feat(cache): support parquet metadata cache#349
Conversation
41c6c77 to
59a8be7
Compare
| PAIMON_ASSIGN_OR_RAISE(file_metadata, metadata_cache_->GetOrLoad(file_uri, loader)); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
I’d like to better understand in what scenarios the same file would be read multiple times. As far as I remember, the code only has prefetch-related optimizations, and prefetch should already open files concurrently. In that case, how much can the metadata cache realistically reduce read latency? Do we have any concrete data showing that the Parquet footer is a bottleneck and that this optimization provides a meaningful latency improvement? Perhaps this is mainly beneficial for extremely wide tables with a large number of fields?
There was a problem hiding this comment.
We provide online services on top of Paimon, primarily focused on point lookups and short scans, and we might disable the prefetch feature (to avoid wasting bandwidth on prefetching). I understand that the current paimon-cpp may be more geared towards offline analytics scenarios.
There was a problem hiding this comment.
we also need offline analytics now, I means this features is focus on our online serving scenarios.
| /// have to orchestrate Get/Put themselves. | ||
| Result<std::shared_ptr<::parquet::FileMetaData>> GetOrLoad(const std::string& uri, | ||
| const MetadataLoader& loader); | ||
|
|
There was a problem hiding this comment.
Because captures in std::function may introduce some memory issue, I’d suggest preferring pass-by-value over const Func&.
9df9037 to
6209fe0
Compare
6209fe0 to
c4042c3
Compare
c4042c3 to
cfe8260
Compare
|
I wanted to confirm one point about the Parquet footer: in point lookup, is the footer becoming a performance hotspot because of parsing or because of I/O? If it is mainly I/O, could we handle it similarly to how we cache manifests? The cache added in |
|
cache parquet footer bytes in #373 , close this PR |
Purpose
Support caching parsed Parquet FileMetaData to avoid repeatedly reading and parsing Parquet footers when creating readers for the same file.
This change:
Tests
Added/updated UT cases in parquet_file_batch_reader_test:
Not run locally: full CMake build/tests, because the local build blocks on bundled Arrow/Boost dependency download.
API and Format
No storage format or protocol changes.
No public include/ API changes.
Adds a Parquet read option:
Documentation
This introduces a new Parquet reader optimization. No standalone documentation changes are included.
Generative AI tooling
Generated-by: OpenAI Codex GPT-5