feat(cache): support parquet metadata cache by gripleaf · Pull Request #349 · alibaba/paimon-cpp

gripleaf · 2026-06-08T13:36:50Z

Purpose

Support caching parsed Parquet FileMetaData to avoid repeatedly reading and parsing Parquet footers when creating readers for the same file.

This change:

Adds ParquetMetadataCache, backed by GenericLruCache.
Adds option parquet.read.metadata-cache.max-bytes.
Initializes one metadata cache per ParquetReaderBuilder when the configured size is greater than 0.
Uses cached metadata in ParquetFileBatchReader::Create through Parquet FileReaderBuilder::Open(..., file_metadata).
Keeps cache entries keyed by file URI.

Tests

Added/updated UT cases in parquet_file_batch_reader_test:

TestReadRewrittenFileWithoutMetadataCache
TestMetadataCacheGetOrLoad
TestMetadataCacheNullLoaderResultIsNotCached

Not run locally: full CMake build/tests, because the local build blocks on bundled Arrow/Boost dependency download.

API and Format

No storage format or protocol changes.

No public include/ API changes.

Adds a Parquet read option:

parquet.read.metadata-cache.max-bytes

Documentation

This introduces a new Parquet reader optimization. No standalone documentation changes are included.

Generative AI tooling

Generated-by: OpenAI Codex GPT-5

lxy-9602 · 2026-06-09T07:27:14Z

+                PAIMON_ASSIGN_OR_RAISE(file_metadata, metadata_cache_->GetOrLoad(file_uri, loader));
+            }
+        }
+


I’d like to better understand in what scenarios the same file would be read multiple times. As far as I remember, the code only has prefetch-related optimizations, and prefetch should already open files concurrently. In that case, how much can the metadata cache realistically reduce read latency? Do we have any concrete data showing that the Parquet footer is a bottleneck and that this optimization provides a meaningful latency improvement? Perhaps this is mainly beneficial for extremely wide tables with a large number of fields?

We provide online services on top of Paimon, primarily focused on point lookups and short scans, and we might disable the prefetch feature (to avoid wasting bandwidth on prefetching). I understand that the current paimon-cpp may be more geared towards offline analytics scenarios.

we also need offline analytics now, I means this features is focus on our online serving scenarios.

lxy-9602 · 2026-06-09T07:28:46Z

+    /// have to orchestrate Get/Put themselves.
+    Result<std::shared_ptr<::parquet::FileMetaData>> GetOrLoad(const std::string& uri,
+                                                               const MetadataLoader& loader);
+


Because captures in std::function may introduce some memory issue, I’d suggest preferring pass-by-value over const Func&.

lxy-9602 · 2026-06-17T01:38:18Z

I wanted to confirm one point about the Parquet footer: in point lookup, is the footer becoming a performance hotspot because of parsing or because of I/O? If it is mainly I/O, could we handle it similarly to how we cache manifests? The cache added in ReadContext does not seem to be used yet. If the bottleneck is parsing, then we could consider caching a ::parquet::FileMetaData object instead. That said, this optimization feels somewhat specialized and would be more intrusive to the codebase, so I think we should first confirm that the performance benefit is large enough.

gripleaf · 2026-06-17T03:55:43Z

cache parquet footer bytes in #373 , close this PR

gripleaf marked this pull request as draft June 9, 2026 06:41

gripleaf force-pushed the feat/parquet-meta-cache branch from 41c6c77 to 59a8be7 Compare June 9, 2026 07:46

lxy-9602 reviewed Jun 9, 2026

View reviewed changes

gripleaf force-pushed the feat/parquet-meta-cache branch 4 times, most recently from 9df9037 to 6209fe0 Compare June 10, 2026 07:04

gripleaf marked this pull request as ready for review June 10, 2026 10:08

gripleaf force-pushed the feat/parquet-meta-cache branch from 6209fe0 to c4042c3 Compare June 11, 2026 11:36

feat(cache): support parquet metadata cache

cfe8260

gripleaf force-pushed the feat/parquet-meta-cache branch from c4042c3 to cfe8260 Compare June 11, 2026 13:37

lxy-9602 added 2 commits June 16, 2026 13:22

Merge branch 'main' into feat/parquet-meta-cache

26d30d0

Merge branch 'main' into feat/parquet-meta-cache

36d8e69

gripleaf closed this Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cache): support parquet metadata cache#349

feat(cache): support parquet metadata cache#349
gripleaf wants to merge 3 commits into
alibaba:mainfrom
gripleaf:feat/parquet-meta-cache

gripleaf commented Jun 8, 2026

Uh oh!

Uh oh!

Uh oh!

lxy-9602 Jun 9, 2026

Uh oh!

gripleaf Jun 10, 2026

Uh oh!

gripleaf Jun 10, 2026

Uh oh!

lxy-9602 Jun 9, 2026

Uh oh!

Uh oh!

lxy-9602 commented Jun 17, 2026

Uh oh!

gripleaf commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gripleaf commented Jun 8, 2026

Purpose

Tests

API and Format

Documentation

Generative AI tooling

Uh oh!

Uh oh!

Uh oh!

lxy-9602 Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

gripleaf Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

gripleaf Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

lxy-9602 Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lxy-9602 commented Jun 17, 2026

Uh oh!

gripleaf commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants