Skip to content

Cache iceberg metadata to reduce redundant requests to storage#509

Open
cbb330 wants to merge 15 commits intolinkedin:mainfrom
cbb330:chbush/reduce-refresh-metadata
Open

Cache iceberg metadata to reduce redundant requests to storage#509
cbb330 wants to merge 15 commits intolinkedin:mainfrom
cbb330:chbush/reduce-refresh-metadata

Conversation

@cbb330
Copy link
Copy Markdown
Collaborator

@cbb330 cbb330 commented Mar 21, 2026

Design Doc Summary

Full design doc: https://docs.google.com/document/d/1J4yNWVePjm5nknsiuUmp-eYEJ_eXIeCLhjASgawi2tY

Problem

Kafka ETL reported that OpenHouse PUT /snapshots (commit) p99 latency is missing SLOs, sometimes exceeding 1 minute. Root cause: refreshMetadata is called 6 times per commit request, but 3 of those are redundant real HDFS reads of the same pre-commit metadata file. This happens because each loadTable() creates a fresh TableOperations instance with no shared cache.

The 6 refreshes break down as:

# Source Real HDFS read?
1 TablesService.putTable initial findById Yes
2 save() redundant existsById Yes
3 save() real catalog.loadTable Yes
4-5 Transaction internal ops.refresh() No (cheap no-ops)
6 Post-commit response building Yes (new metadata)

Solution: TableMetadata Cache

  • 7-day fleet analysis: 33-38M metadata refreshes/day; ~28% are duplicates within a 10-minute window
  • A TableMetadata cache keyed by metadata location eliminates redundant HDFS reads across loadTable() calls within the same pod

Cache Configuration

  • TTL: 10 minutes (captures ~60% of re-accesses; 30 min would get ~76%)
  • Max size: 5,000 entries
  • Instrumentation: hitRate() and eviction stats to guide future TTL tuning

Before/After Validation

  • Traces are structurally identical (42 spans) — the cache only reduces latency within spans by avoiding redundant HDFS I/O
  • AFTER eliminates 8 HDFS connection DEBUG lines (24 → 16)
  • Single-sample latency dropped from 1643ms → 560ms; production-scale validation expected to show larger gains due to HDFS p90 variance

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Internal API Changes:

  • Rename the semantic cache contracts to TableMetadataCache and ToggleStatusCache, and wire their defaults through Spring CacheManager instead of the custom request-cache abstraction.
  • Add OpenHouseCacheProperties plus internal Spring configurations so each semantic cache can choose request or pod lifetime while downstream repos can still override the semantic bean directly.

Performance Improvements:

  • Preserve same-request reuse of parsed Iceberg metadata and toggle lookups, and add pod-scoped simple and caffeine backends for longer-lived caching when enabled.
  • Use Spring-managed Caffeine caches with recordStats() so pod-scoped caches fit Spring and Micrometer cache instrumentation.

Refactoring:

  • Replace the bespoke request-cache plumbing with Spring Cache and CacheManager adapters, including a lightweight RequestAttributes-backed cache manager for request lifetime.
  • Keep the cache lifecycle choice in configuration rather than in the semantic interface names.

Tests:

  • Add lifecycle-selection and no-request fallback coverage in InternalCatalogConfigTest and ToggleStatusCacheConfigTest.
  • Update metadata refresh and toggle repository tests to use the lifecycle-neutral cache interfaces and verify the existing request-reuse behavior still holds.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Added new tests for the changes made:
Added lifecycle-selection and no-request fallback coverage in InternalCatalogConfigTest and ToggleStatusCacheConfigTest.

Updated existing tests to reflect the changes made:
Updated OpenHouseInternalTableOperationsTest, RepositoryTestWithSettableComponents, and ToggleStatusesRepositoryImplTest to consume the new semantic cache interfaces and lifecycle-neutral wiring.

Validation run:
JAVA_HOME=$(/usr/libexec/java_home -v 17) ./gradlew :services:common:spotlessJavaCheck :iceberg:openhouse:internalcatalog:spotlessJavaCheck :services:tables:spotlessJavaCheck :iceberg:openhouse:internalcatalog:test --tests com.linkedin.openhouse.internal.catalog.InternalCatalogConfigTest --tests com.linkedin.openhouse.internal.catalog.OpenHouseInternalTableOperationsTest :services:tables:test --tests com.linkedin.openhouse.tables.config.ToggleStatusCacheConfigTest --tests com.linkedin.openhouse.tables.toggle.repository.ToggleStatusesRepositoryImplTest --tests com.linkedin.openhouse.tables.e2e.h2.RepositoryTestWithSettableComponents

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

@cbb330 cbb330 force-pushed the chbush/reduce-refresh-metadata branch from fe80df8 to b646513 Compare March 21, 2026 01:31
@Autowired MeterRegistry meterRegistry;

/** Cache FileIO by TableIdentifier to avoid redundant HTS lookups within the same request. */
private final Cache<TableIdentifier, FileIO> fileIOCache =
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we do tracing that this is actually used from the 6 callsites to reduce to 1? presumably this is visible on the flame graph.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add this (and hdfs logs) to the PR as per e2e testing in test cluster

Reuse request-scoped metadata and toggle lookups so repeated refreshes in a single request avoid extra HDFS reads and duplicate HTS calls without weakening commit concurrency checks.

Made-with: Cursor
@cbb330 cbb330 force-pushed the chbush/reduce-refresh-metadata branch from b646513 to 552381d Compare April 1, 2026 08:35

@Autowired MeterRegistry meterRegistry;

@Autowired(required = false)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix this, it requires test fixes.

cbb330 added 2 commits April 1, 2026 02:02
Register the request cache from internalcatalog configuration so table service contexts can inject it consistently without optional autowiring fallbacks.

Made-with: Cursor
Clarify why RequestScopedCache is provided from internal catalog config and replace bespoke RequestAttributes test helpers with Spring's servlet-backed request attributes.

Made-with: Cursor
@cbb330 cbb330 changed the title Reduce redundant refreshMetadata and HTS lookups in PUT request Reduce redundant metadata refreshes with Spring-managed semantic caches Apr 4, 2026
@cbb330 cbb330 force-pushed the chbush/reduce-refresh-metadata branch from 6d979e1 to a7c4ab9 Compare April 4, 2026 03:11
cbb330 added 12 commits April 3, 2026 20:15
Replace the hand-rolled request cache with a Spring CacheManager backed by Caffeine so request-local metadata reuse is easier to extend and observe later.

Made-with: Cursor
The metadata write path already rethrows cache seeding failures before house table mapping runs, so the extra variable never changes behavior.

Made-with: Cursor
Use a dedicated ConfigurationProperties bean so Spring applies defaults and binds cache overrides from cluster YAML without manual parsing in InternalCatalogConfig.

Made-with: Cursor
This keeps the cache manager owned by the tables app so metadata cache settings bind through the shared Spring environment without creating the cache in other services.

Made-with: Cursor
Keep the metadata cache properties and cache manager in the tables service so Spring Boot can bind them directly without redundant internal catalog config types.

Made-with: Cursor
Keep the tables metadata-cache namespace owned by the deployable while leaving cache configuration and tests in internalcatalog.

Made-with: Cursor
Verify the tables-service metadata cache properties configure the internal catalog Caffeine cache end to end, and add the test-only dependencies needed to assert that wiring directly.

Made-with: Cursor
Keep services/tables focused on generic cache property propagation and move Caffeine-specific verification into internalcatalog so backend changes stay local to that module.

Made-with: Cursor
Differentiate raw Spring-bound properties, module settings, and cache bean wiring so the tables app boundary and internalcatalog runtime config are easier to follow.

Made-with: Cursor
Rely on Spring's default method-based bean name for the internal catalog cache manager so the wiring stays explicit in tests without duplicating the same name in configuration.

Made-with: Cursor
@Setter
public static class MetadataCache {
private Duration ttl = Duration.ofMinutes(5);
private long maxSize = 1000;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a per request cache, why do we need more than one entity stored and a TTL?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the per request cache is not able to emit prometheus metrics, so we can't monitor it.

hence I'm changed strategy to use per-pod cache and key with UUID -- we get metrics, better cache-hit ratio, and don't affect our concurrency model.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't affect our concurrency model.

I'm not sure.

T0 - Table Manifest State X
T1 - MAnifest is cached
T2 - Table Manifest updated to state Y
T3 - Cached value is read vs latest value and transaction fails.
T4 - Cached value is read vs latest value and transaction fails.
T5 - Cache TTLs.

A per-request cache limited lifecycle makes this a non-issue.

Copy link
Copy Markdown
Collaborator Author

@cbb330 cbb330 Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the manifest" refers to a manifest file (the Avro file that tracks a list of data files). this is distinct from "the metadata" which specifically refers to a json file that represents table state e.g. /data/openhouse/<db>/<table>/01226-03d77993-b04c-46a2-b3a8-dbb7b6b45783.metadata.json

and the metadata is immutable, once the UUID is generated and file is written it can never change.

so in the same format listed above it more accurately this:

T0 - Table Metadata State X
T1 - Metadata UUIDv1 is cached
T2 - Table Metadata updated to state Y
T3 - Cache miss because UUID is not found
T4 - Metadata UUIDv2 is cached
T5 - Metadata UUIDv1 TTLs

Copy link
Copy Markdown
Member

@abhisheknath2011 abhisheknath2011 Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to understand the approach here and why per pod cache? As there are 56 tables service pod, the issue with per pod cache is cache miss scenario could be more as most of the requests could land on different pods. What are we missing with request level cache? I believe idea was to optimize multiple metadata calls for a given requests.


FileIOManager fileIOManager;

TableMetadataCache tableMetadataCache;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a generic file cache no? I might call it a file cache but the variable name be metadata.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are two layers here: internalCatalogCacheManager can be generic for any file since it is just the Spring CacheManager, but the injected wrapper is still intentionally TableMetadataCache and specific to tablemetdata.

Right now the cache namespace is tableMetadata and the value type is still TableMetadata

We can easily reuse the generic internalCatalogCacheManager for other use cases later like this

  @CachePut(
      cacheManager = "internalCatalogCacheManager",
      cacheNames = "newUseCase",
      key = "#newUseCaseUUID")

Comment on lines +22 to +25
@CachePut(
cacheManager = "internalCatalogCacheManager",
cacheNames = "tableMetadata",
key = "#metadataLocation")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SO this abstracts the cache as a spring component?

Copy link
Copy Markdown
Collaborator Author

@cbb330 cbb330 Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea.SpringTableMetadataCache is the Spring-managed bean we inject as the implementation. This gives us runtime dependency injection so it allows us to override in the li-openhouse variant at runtime if needed.

And @Cacheable / @CachePut delegate the actual cache read/write through Spring’s cache abstraction rather than us calling Caffeine directly. This gives us auto-instrumentation among other things.

internalCatalogCacheManager represents a named reference to the logic actually used for implementation , and in our config that’s currently a CaffeineCacheManager.

so the end result is that callers stay behind the TableMetadataCache interface and don’t need to know about the concrete cache impl and the impl is mostly framework config.

@@ -0,0 +1,76 @@
package com.linkedin.openhouse.tables.config;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a test where there are two racing commits so we can ensure stale metadata being returned from the cache does not impact.

Copy link
Copy Markdown
Collaborator Author

@cbb330 cbb330 Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the racing-commits case is safe here because the metadata is UUID-keyed, so concurrent commits are not reusing the same cached object.

For example, commit 1 writes uuid-1.json and atomically swaps that into HTS. If commit 2 races, it reads uuid-1.json from HTS, evolves that state, writes the updated metadata back to HTS, and then caches it as uuid-2.json. So even if an older cache entry exists, the next commit is still advancing from the HTS-backed metadata, not from stale cached metadata.

Copy link
Copy Markdown
Collaborator

@mkuchenbecker mkuchenbecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ready for review?

@cbb330 cbb330 changed the title Reduce redundant metadata refreshes with Spring-managed semantic caches Cache iceberg metadata to reduce redundant requests to storage Apr 8, 2026
@cbb330 cbb330 marked this pull request as ready for review April 8, 2026 00:20
@cbb330 cbb330 changed the title Cache iceberg metadata to reduce redundant requests to storage Reduce metadata refreshes and tighten OpenHouse fallback validation Apr 11, 2026
@cbb330 cbb330 force-pushed the chbush/reduce-refresh-metadata branch from ca6b5ed to fbf3a65 Compare April 11, 2026 01:24
@cbb330 cbb330 changed the title Reduce metadata refreshes and tighten OpenHouse fallback validation Cache iceberg metadata to reduce redundant requests to storage Apr 11, 2026
@Getter
@Setter
public static class MetadataCache {
private Duration ttl = Duration.ofMinutes(5);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was any profiling was done on per-entry TableMetadata memory size to validate the 1000-entry max ?

public class SpringTableMetadataCache implements TableMetadataCache {

@Override
@Cacheable(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would happen if there are concurrent requests for same metadata location? Probably both could miss and read from HDFS. Consider https://docs.spring.io/spring-framework/reference/integration/cache/annotations.html#cache-annotations-cacheable-synchronized ?

cacheNames = "tableMetadata",
key = "#metadataLocation")
public TableMetadata load(String metadataLocation, Supplier<TableMetadata> metadataLoader) {
return metadataLoader.get();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the caching here per request basis? What happens for the scenario immediate get after put or multiople concurrent get after put?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants