Skip to content

feat(cache): support parquet metadata cache#349

Closed
gripleaf wants to merge 3 commits into
alibaba:mainfrom
gripleaf:feat/parquet-meta-cache
Closed

feat(cache): support parquet metadata cache#349
gripleaf wants to merge 3 commits into
alibaba:mainfrom
gripleaf:feat/parquet-meta-cache

Conversation

@gripleaf

@gripleaf gripleaf commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Purpose

Support caching parsed Parquet FileMetaData to avoid repeatedly reading and parsing Parquet footers when creating readers for the same file.

This change:

  • Adds ParquetMetadataCache, backed by GenericLruCache.
  • Adds option parquet.read.metadata-cache.max-bytes.
  • Initializes one metadata cache per ParquetReaderBuilder when the configured size is greater than 0.
  • Uses cached metadata in ParquetFileBatchReader::Create through Parquet FileReaderBuilder::Open(..., file_metadata).
  • Keeps cache entries keyed by file URI.

Tests

Added/updated UT cases in parquet_file_batch_reader_test:

  • TestReadRewrittenFileWithoutMetadataCache
  • TestMetadataCacheGetOrLoad
  • TestMetadataCacheNullLoaderResultIsNotCached

Not run locally: full CMake build/tests, because the local build blocks on bundled Arrow/Boost dependency download.

API and Format

No storage format or protocol changes.

No public include/ API changes.

Adds a Parquet read option:

  • parquet.read.metadata-cache.max-bytes

Documentation

This introduces a new Parquet reader optimization. No standalone documentation changes are included.

Generative AI tooling

Generated-by: OpenAI Codex GPT-5

@gripleaf gripleaf marked this pull request as draft June 9, 2026 06:41
@gripleaf gripleaf force-pushed the feat/parquet-meta-cache branch from 41c6c77 to 59a8be7 Compare June 9, 2026 07:46
Comment thread src/paimon/format/parquet/parquet_file_batch_reader.h
Comment thread src/paimon/format/parquet/parquet_format_defs.h Outdated
PAIMON_ASSIGN_OR_RAISE(file_metadata, metadata_cache_->GetOrLoad(file_uri, loader));
}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d like to better understand in what scenarios the same file would be read multiple times. As far as I remember, the code only has prefetch-related optimizations, and prefetch should already open files concurrently. In that case, how much can the metadata cache realistically reduce read latency? Do we have any concrete data showing that the Parquet footer is a bottleneck and that this optimization provides a meaningful latency improvement? Perhaps this is mainly beneficial for extremely wide tables with a large number of fields?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We provide online services on top of Paimon, primarily focused on point lookups and short scans, and we might disable the prefetch feature (to avoid wasting bandwidth on prefetching). I understand that the current paimon-cpp may be more geared towards offline analytics scenarios.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also need offline analytics now, I means this features is focus on our online serving scenarios.

/// have to orchestrate Get/Put themselves.
Result<std::shared_ptr<::parquet::FileMetaData>> GetOrLoad(const std::string& uri,
const MetadataLoader& loader);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because captures in std::function may introduce some memory issue, I’d suggest preferring pass-by-value over const Func&.

Comment thread src/paimon/format/parquet/parquet_file_batch_reader_test.cpp Outdated
@gripleaf gripleaf force-pushed the feat/parquet-meta-cache branch 4 times, most recently from 9df9037 to 6209fe0 Compare June 10, 2026 07:04
@gripleaf gripleaf marked this pull request as ready for review June 10, 2026 10:08
@gripleaf gripleaf force-pushed the feat/parquet-meta-cache branch from 6209fe0 to c4042c3 Compare June 11, 2026 11:36
@gripleaf gripleaf force-pushed the feat/parquet-meta-cache branch from c4042c3 to cfe8260 Compare June 11, 2026 13:37
@lxy-9602

Copy link
Copy Markdown
Collaborator

I wanted to confirm one point about the Parquet footer: in point lookup, is the footer becoming a performance hotspot because of parsing or because of I/O? If it is mainly I/O, could we handle it similarly to how we cache manifests? The cache added in ReadContext does not seem to be used yet. If the bottleneck is parsing, then we could consider caching a ::parquet::FileMetaData object instead. That said, this optimization feels somewhat specialized and would be more intrusive to the codebase, so I think we should first confirm that the performance benefit is large enough.

@gripleaf

Copy link
Copy Markdown
Contributor Author

cache parquet footer bytes in #373 , close this PR

@gripleaf gripleaf closed this Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants