parquet: decoder-level dictionary filter pushdown for Parquet reader#9464
parquet: decoder-level dictionary filter pushdown for Parquet reader#9464lyang24 wants to merge 1 commit intoapache:mainfrom
Conversation
a4545dd to
644295d
Compare
Push predicate evaluation into the decoder for dictionary-encoded BYTE_ARRAY columns. Instead of decoding all rows into StringViewArray and then filtering, this evaluates the predicate once on the small dictionary (~N unique values), then maps integer keys to booleans via a simple lookup — no intermediate arrays are created for data rows. Two-phase approach: - Phase 1: decode dictionary page, evaluate predicate, produce matching_keys: Vec<bool> per row group - Phase 2: DictFilterDecoder reads RLE-encoded integer keys and maps each key to matching_keys[key], producing a BooleanArray that feeds directly into RowSelection Adds ArrowPredicate::use_dictionary_encoding() opt-in flag and falls back to the normal path for multi-column predicates, non-BYTE_ARRAY types, nested columns, or PLAIN-encoded pages.
644295d to
53bd848
Compare
|
Hi @Dandandan, can you please help me summon the gh bots for clickbenches. |
|
run benchmark arrow_reader_clickbench |
Sorry, missed your message. Started now! |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
interetsing i have the opposite results on q13 locally |
Which issue does this PR close?
Rationale for this change
For dictionary-encoded columns, we previously converted the dictionary data into a StringViewArray and then applied the filter. This approach materialized every row in the column before filtering, which introduced overhead.
What changes are included in this PR?
We introduced DictFilterDecoder for columns that are 100% dictionary encoded.
The decoder first processes the dictionary keys, then applies the filter.
After that, it expands null values.
Finally, it filters the dictionary values and materializes only the rows that pass the filter.
Are these changes tested?
existing test should pass added additional tests
Are there any user-facing changes?
yes added an new api with_dict_pushdown to opt in to enable the predicate pushdown on predicates