Skip to content

feat(parquet): support pushing bitmaps down to page-level filtering (except for nested columns)#375

Open
zhf999 wants to merge 11 commits into
alibaba:mainfrom
zhf999:paged-bitmap2
Open

feat(parquet): support pushing bitmaps down to page-level filtering (except for nested columns)#375
zhf999 wants to merge 11 commits into
alibaba:mainfrom
zhf999:paged-bitmap2

Conversation

@zhf999

@zhf999 zhf999 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Purpose

This PR follows PR#371.

  • Support pushing selection bitmaps down to Parquet page-level filtering so only matching pages are read when both predicate and bitmap pruning are available.
  • Refactor target row-group filtering metadata into a dedicated TargetRowGroup type to carry row-group index, partial-match status, row ranges, and read-range exclusion flags consistently across planning and reading phases.

Changes

  • src/paimon/format/parquet/target_row_group.h

    • Introduced TargetRowGroup class and TargetRowGroups alias.
    • Replaced ad-hoc struct usage from row_ranges.h.
  • src/paimon/format/parquet/parquet_file_batch_reader.h

    • Updated row-group filtering APIs to operate on TargetRowGroups instead of raw std::vector<int32_t>.
    • Added FindColumnWithOffsetIndex and FilterPagesByBitmap.
  • src/paimon/format/parquet/parquet_file_batch_reader.cpp

    • In SetReadSchema, switched row-group candidate list to TargetRowGroups and applied bitmap filter before page-index filtering.
    • Added bitmap-driven row-group/page pruning:
      • For row groups with offset index metadata, it computes per-page row ranges and marks those RGs as partially matched.
      • If any nested columns are included in read schema, bitmap filtering fallbacks to row-group-level.
    • Enhanced FilterRowGroupsByPageIndex to intersect page ranges when RGs are already partially matched by bitmap.

Tests

  • Added bitmap + page-level integration tests:
    • BitmapAllPagesSomeRowGroups
    • BitmapPartialPagesSingleRowGroup
    • BitmapAllAndPartialPagesMixed
    • BitmapAllPagesWithPredicate
    • BitmapPartialPagesWithPredicate
    • BitmapMixedWithPredicate
    • NestedMapBitmapFallback
    • NestedListBitmapFallback
    • NestedStructBitmapFallback
  • Coverage includes bitmap-only pruning, bitmap+predicate pruning, and mixed partial/full page hit cases.

Datasets

Multiple performance testing were conducted on 2 dataset.

Dataset File size num_rows num_columns num_rowgroups page length (in rows)
Dataset#1 6.7GB 2M 1,176 39 1024
Dataset#2 3GB 11M 4 23 1024

Test Methodology

To validate bitmap behavior, we organize tests into six groups based on bitmap distribution patterns. Each group contains five distinct datasets, and the final result reported is the average of these five runs.

Test Cases

  1. Matching records are contiguous and only one row is queried.
  2. Matching records are contiguous and 100 rows are queried.
  3. Matching records are contiguous and 10,000 rows are queried.
  4. Matching records are scattered in 5 segments, each segment containing 1 row.
  5. Matching records are scattered in 5 segments, each segment containing 100 rows.
  6. Matching records are scattered in 100 segments, each segment containing 1 row.

Test Results

On Dataset#1

Branch Case 1 Case 2 Case3 Case 4 Case 5 Case 6
Main 2.50s 2.45s 3.08s 11.46s 11.34s 103.63s
This PR 0.35s 0.34s 0.96s 1.94s 1.88s 18.80s

On Dataset#2

Branch Case 1 Case 2 Case3 Case 4 Case 5 Case 6
Main 1.28s 1.28s 1.28s 3.12s 3.07s 6.47s
This PR 10ms 10ms 42ms 35ms 40ms 0.37s

API and Format

  • No public API changes.
  • No storage format/protocol changes.

Documentation

  • No user-facing docs required for this change.

Generative AI tooling

  • gpt-5.3-codex

Copilot AI review requested due to automatic review settings June 17, 2026 07:28

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants