Core: Cache manifest list files in manifest content cache#16762
Core: Cache manifest list files in manifest content cache#16762rahulsmahadev wants to merge 1 commit into
Conversation
|
|
||
| static InputFile newInputFile(FileIO io, ManifestListFile manifestList) { | ||
| InputFile input = io.newInputFile(manifestList); | ||
| if (cachingEnabled(io)) { |
There was a problem hiding this comment.
I wonder if its good practice to add a new knob here or use the existing one
There was a problem hiding this comment.
LGTM. The change is minimal and correct: manifest lists are immutable and location-unique just like manifests, so they slot into the existing ContentCache keying/invalidation cleanly, and the encrypted path stays equivalent to the manifest path. Nice catch that fresh Snapshot instances (refresh / table reload / streaming polls) re-fetch the immutable list today even with a warm cache.
One follow-up (not blocking this PR): unlike manifests, manifest-list length isn't tracked, so FileIO.newInputFile(ManifestListFile) returns a location-only InputFile. ContentCache.tryCache calls input.getLength() up front to gate on maxContentLength, which means we now pay a getFileStatus/HEAD on every cacheManifests call — including cache hits, which is exactly the refresh/streaming case this PR targets. It's still a win (HEAD instead of a full body GET), but we can make hits truly round-trip-free by probing the cache before asking the remote. Important: use a stats-free probe (cache.asMap().containsKey(location)), not getIfPresent, since getIfPresent records hit/miss stats and would skew the assertions in TestManifestCaching. Something like:
public InputFile tryCache(InputFile input) {
if (cache.asMap().containsKey(input.location()) || input.getLength() <= maxContentLength) {
return new CachingInputFile(this, input);
}
return input;
}Fine to do that in a follow-up PR.
Summary
When
io.manifest.cache.enabledis set, manifest files are served throughContentCache, but manifest list files are not —BaseSnapshot.cacheManifestsreads the list with a rawFileIO.newInputFilecall. Every freshly loadedSnapshot(table refresh, new table handle, streaming poll) therefore re-fetches the same immutable manifest-list file from object storage, even when the content cache is enabled and warm.This change routes the manifest-list read through the same content cache used for manifests:
ManifestFiles.newInputFile(FileIO, ManifestListFile)— package-private twin of the existingnewInputFile(FileIO, ManifestFile)helper: wraps the input withContentCache.tryCachewhen caching is enabled for theFileIO.BaseSnapshot.cacheManifestsuses it for the manifest-list read.Manifest lists are immutable and location-unique like manifests, so the existing cache keying and invalidation semantics apply unchanged. Encrypted manifest lists follow the same contract as manifests today: the
FileIO(e.g.EncryptingFileIO) controls decryption, and caching behavior is identical to the manifest path.Test plan
TestManifestCaching.testManifestListCaching: a freshly parsed snapshot loads the manifest list through the cache (miss + cache-size increase), and a second snapshot instance reading the same list is served from the cache (no new miss, hit count increases).testPlanWithCacheexpectations: with the change, each append commit also caches one manifest list (parent lists are read while committing, and the current snapshot's list while planning), so the cache holdsnumFiles * 2entries.TestManifestCaching(6/6),TestManifestListVersions,TestSnapshot,TestCommitReporting,TestManifestListEncryption, plus spotless and checkstyle.