Skip to content

Core: Cache manifest list files in manifest content cache#16762

Open
rahulsmahadev wants to merge 1 commit into
apache:mainfrom
rahulsmahadev:core-cache-manifest-lists
Open

Core: Cache manifest list files in manifest content cache#16762
rahulsmahadev wants to merge 1 commit into
apache:mainfrom
rahulsmahadev:core-cache-manifest-lists

Conversation

@rahulsmahadev

Copy link
Copy Markdown
Contributor

Summary

When io.manifest.cache.enabled is set, manifest files are served through ContentCache, but manifest list files are not — BaseSnapshot.cacheManifests reads the list with a raw FileIO.newInputFile call. Every freshly loaded Snapshot (table refresh, new table handle, streaming poll) therefore re-fetches the same immutable manifest-list file from object storage, even when the content cache is enabled and warm.

This change routes the manifest-list read through the same content cache used for manifests:

  • ManifestFiles.newInputFile(FileIO, ManifestListFile) — package-private twin of the existing newInputFile(FileIO, ManifestFile) helper: wraps the input with ContentCache.tryCache when caching is enabled for the FileIO.
  • BaseSnapshot.cacheManifests uses it for the manifest-list read.

Manifest lists are immutable and location-unique like manifests, so the existing cache keying and invalidation semantics apply unchanged. Encrypted manifest lists follow the same contract as manifests today: the FileIO (e.g. EncryptingFileIO) controls decryption, and caching behavior is identical to the manifest path.

Test plan

  • New TestManifestCaching.testManifestListCaching: a freshly parsed snapshot loads the manifest list through the cache (miss + cache-size increase), and a second snapshot instance reading the same list is served from the cache (no new miss, hit count increases).
  • Updated testPlanWithCache expectations: with the change, each append commit also caches one manifest list (parent lists are read while committing, and the current snapshot's list while planning), so the cache holds numFiles * 2 entries.
  • Verified locally: TestManifestCaching (6/6), TestManifestListVersions, TestSnapshot, TestCommitReporting, TestManifestListEncryption, plus spotless and checkstyle.

@github-actions github-actions Bot added the core label Jun 10, 2026

static InputFile newInputFile(FileIO io, ManifestListFile manifestList) {
InputFile input = io.newInputFile(manifestList);
if (cachingEnabled(io)) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if its good practice to add a new knob here or use the existing one

@szehon-ho szehon-ho left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The change is minimal and correct: manifest lists are immutable and location-unique just like manifests, so they slot into the existing ContentCache keying/invalidation cleanly, and the encrypted path stays equivalent to the manifest path. Nice catch that fresh Snapshot instances (refresh / table reload / streaming polls) re-fetch the immutable list today even with a warm cache.

One follow-up (not blocking this PR): unlike manifests, manifest-list length isn't tracked, so FileIO.newInputFile(ManifestListFile) returns a location-only InputFile. ContentCache.tryCache calls input.getLength() up front to gate on maxContentLength, which means we now pay a getFileStatus/HEAD on every cacheManifests call — including cache hits, which is exactly the refresh/streaming case this PR targets. It's still a win (HEAD instead of a full body GET), but we can make hits truly round-trip-free by probing the cache before asking the remote. Important: use a stats-free probe (cache.asMap().containsKey(location)), not getIfPresent, since getIfPresent records hit/miss stats and would skew the assertions in TestManifestCaching. Something like:

public InputFile tryCache(InputFile input) {
  if (cache.asMap().containsKey(input.location()) || input.getLength() <= maxContentLength) {
    return new CachingInputFile(this, input);
  }
  return input;
}

Fine to do that in a follow-up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants