Skip to content

Conversation

@WaterWhisperer
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Currently the default behavior is to parse the full vector of encoding stats, but given the limited use of this information we should instead default to the more concise and performant bitmask.

What changes are included in this PR?

  • Implement Default for ParquetMetaDataOptions with encoding_stats_as_mask: true
  • Update Thrift decoding logic to default to bitmask even if options are missing
  • Update documentation to reflect the new default behavior
  • Update existing tests to maintain coverage for full statistics
  • Add new tests to verify default behavior and full stats option

Are these changes tested?

Yes

Are there any user-facing changes?

Yes

@github-actions github-actions bot added the parquet Changes to the parquet crate label Dec 27, 2025
@alamb
Copy link
Contributor

alamb commented Dec 27, 2025

run benchmark metadata

@apache apache deleted a comment from alamb-ghbot Dec 27, 2025
@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing feat/change-default-page-encoding-stats (552f0ec) to 8ed2b52 diff
BENCH_NAME=metadata
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench metadata
BENCH_FILTER=
BENCH_BRANCH_NAME=feat_change-default-page-encoding-stats
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                     feat_change-default-page-encoding-stats    main
-----                                     ---------------------------------------    ----
decode metadata (wide) with schema        1.00     35.5±1.06ms        ? ?/sec        1.21     42.9±0.42ms        ? ?/sec
decode metadata (wide) with skip PES      1.00     38.7±0.39ms        ? ?/sec        1.01     39.2±0.38ms        ? ?/sec
decode metadata (wide) with stats mask    1.00     38.3±0.39ms        ? ?/sec        1.01     38.6±0.67ms        ? ?/sec
decode metadata with schema               1.00      5.1±0.04µs        ? ?/sec        1.21      6.1±0.05µs        ? ?/sec
decode metadata with skip PES             1.00      8.9±0.26µs        ? ?/sec        1.04      9.3±0.18µs        ? ?/sec
decode metadata with stats mask           1.00      8.9±0.04µs        ? ?/sec        1.05      9.4±0.07µs        ? ?/sec
decode parquet metadata                   1.00      8.9±0.04µs        ? ?/sec        1.13     10.0±0.08µs        ? ?/sec
decode parquet metadata (wide)            1.00     38.2±0.87ms        ? ?/sec        1.20     45.9±0.48ms        ? ?/sec
open(default)                             1.00      9.5±0.32µs        ? ?/sec        1.12     10.6±0.12µs        ? ?/sec
open(page index)                          1.00    166.2±2.37µs        ? ?/sec        1.00    166.3±1.61µs        ? ?/sec

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @WaterWhisperer, this looks good to me. I think we should also modify the benchmark to use the old behavior for the baseline benchmarks, or remove the 'PES' benches which are now redundant.

@etseidl etseidl added api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Dec 27, 2025
@WaterWhisperer WaterWhisperer force-pushed the feat/change-default-page-encoding-stats branch from 3a58d23 to 248b7eb Compare December 28, 2025 05:50
@Dandandan
Copy link
Contributor

@etseidl could you have one more look at the PR?

@Dandandan
Copy link
Contributor

run benchmark metadata

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing feat/change-default-page-encoding-stats (248b7eb) to 8ed2b52 diff
BENCH_NAME=metadata
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench metadata
BENCH_FILTER=
BENCH_BRANCH_NAME=feat_change-default-page-encoding-stats
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                     feat_change-default-page-encoding-stats    main
-----                                     ---------------------------------------    ----
decode metadata (full stats)              1.00      9.6±0.10µs        ? ?/sec      
decode metadata (wide) (full stats)       1.00     42.4±0.41ms        ? ?/sec      
decode metadata (wide) with schema        1.00     35.3±0.39ms        ? ?/sec        1.11     39.2±0.59ms        ? ?/sec
decode metadata (wide) with skip PES                                                 1.00     38.7±0.73ms        ? ?/sec
decode metadata (wide) with stats mask                                               1.00     38.0±0.36ms        ? ?/sec
decode metadata with schema               1.00      5.1±0.14µs        ? ?/sec        1.12      5.7±0.05µs        ? ?/sec
decode metadata with skip PES                                                        1.00      9.0±0.18µs        ? ?/sec
decode metadata with stats mask                                                      1.00      9.0±0.19µs        ? ?/sec
decode parquet metadata                   1.00      9.0±0.25µs        ? ?/sec        1.05      9.4±0.04µs        ? ?/sec
decode parquet metadata (wide)            1.00     38.4±1.55ms        ? ?/sec        1.10     42.2±0.81ms        ? ?/sec
open(default)                             1.00      9.5±0.18µs        ? ?/sec        1.04      9.9±0.17µs        ? ?/sec
open(page index)                          1.00    164.3±1.61µs        ? ?/sec        1.01    165.4±0.97µs        ? ?/sec

@Dandandan Dandandan merged commit 67e04e7 into apache:main Jan 8, 2026
16 checks passed
@WaterWhisperer WaterWhisperer deleted the feat/change-default-page-encoding-stats branch January 8, 2026 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Change default behavior for Parquet PageEncodingStats

5 participants