Refactor PushDownFilter benchmark suite: add fast default mode, validation, and unified A/B harness by kosiew · Pull Request #21029 · apache/datafusion

kosiew · 2026-03-18T15:17:56Z

Which issue does this PR close?

Part of perf: push_down_filter is pathologically slow for some plans #20002

Rationale for this change

The existing PushDownFilter benchmark suite is difficult to use for iteration and debugging due to:

Large default parameter sweeps that result in long execution times
Lack of validation for generated predicates, allowing silent benchmark construction issues
Duplicated setup logic across benchmark scenarios

These issues make it hard to distinguish between slow execution and incorrect behavior, reducing developer productivity when investigating optimizer performance.

This PR improves usability, correctness, and maintainability of the benchmark suite.

What changes are included in this PR?

Introduced a lightweight default sweep (DEFAULT_SWEEP_POINTS) for faster local iteration
Added optional full sweep mode controlled by DATAFUSION_PUSH_DOWN_FILTER_FULL_SWEEP
Refactored benchmark loops into a reusable helper (bench_push_down_filter_ab) to unify A/B comparisons
Unified DataFrame construction via build_left_join_df_with_push_down_filter
Added validation helpers:
- find_filter_predicates
- assert_case_heavy_left_join_inference_candidates
Improved CASE expression generation to better simulate realistic predicate shapes and ensure join-key involvement
Ensured predicates reference both join keys and non-join columns for meaningful PushDownFilter evaluation

Are these changes tested?

Yes.

Assertions were added to validate that generated predicates:
- Match the expected predicate count
- Reference join keys (l.c0 or r.c0)
- Include non-join columns

These validations act as correctness checks during benchmark construction and help prevent silent logical errors.

Are there any user-facing changes?

No user-facing API changes.

However:

Benchmark execution behavior has improved:
- Default runs are now significantly faster
- Full sweep is opt-in via environment variable

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

kosiew · 2026-03-18T15:18:45Z

run benchmark sql_planner_extended

kosiew · 2026-03-18T15:19:05Z

show benchmark queue

adriangbot · 2026-03-18T15:19:07Z

Hi @kosiew, you asked to view the benchmark queue (#21029 (comment)).

Comment	Repo	PR	User	Benchmarks	Status
#4083377148	apache/datafusion	#21029	kosiew	["sql_planner_extended"]	running

adriangbot · 2026-03-18T15:21:33Z

🤖 Criterion benchmark running (GKE) | trigger
Linux bench-c4083377148-423-7k56m 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing push-down-05-20002 (6cc60be) to ab28234 (merge-base) diff
BENCH_NAME=sql_planner_extended
BENCH_COMMAND=cargo bench --features=parquet --bench sql_planner_extended
BENCH_FILTER=
Results will be posted here when complete

… benchmarks

…rganization - Refactored functions to build left join DataFrames with push down filters. - Created `bench_push_down_filter_ab` for streamlined benchmarking of push down filter impact. - Updated benchmark groups to use the new building functions, enhancing readability and maintainability.

…n_df_with_push_down_filter` This commit removes the `build_case_heavy_left_join_df_with_push_down_filter` duplicate function from the `sql_planner_extended.rs` benchmark file

… checks - Added assertions to verify inference candidates in case-heavy left join data frames. - Introduced helper functions `find_filter_predicates` and `assert_case_heavy_left_join_inference_candidates` for better structure and readability. - Updated join logic in `build_case_heavy_left_join_query` for more complex case handling. This update improves the robustness of benchmarks by ensuring correctness in filter predicate references related to join keys.

…ery handling - Removed unnecessary functions and constants related to push down filter sweep configurations. - Simplified the logic for constructing test DataFrames, focusing on essential parameters for benchmarks. - Enhanced clarity of the benchmarks by differentiating cases for `with_push_down_filter` and `without_push_down_filter`. - Updated the implementation to improve readability and maintainability.

- Updated `find_filter_predicates` function to streamline the code by removing unnecessary line breaks and retaining clarity in the error message when the expected structure is not met. - Ensured that the function continues to accurately identify and handle logical plans with projections.

…ests - Refactored case heavy and non-case left join benchmark functions to include push down filter tests. - Added utility functions to configure benchmark sweeps for push down filters, making it customizable via environment variables. - Improved assertions for filter predicates in case heavy left join inference. - Cleaned up and organized existing benchmark code for clarity and reuse.

kosiew mentioned this pull request Mar 18, 2026

Avoid null-restrict evaluation for predicates that reference non-join columns in PushDownFilter #20961

Draft

github-actions bot added the core Core DataFusion crate label Mar 18, 2026

kosiew changed the title ~~Make PushDownFilter benchmark sweeps opt-in to reduce long default runtimes~~ Improve PushDownFilter benchmark robustness, add sweep control, and refactor A/B harness Mar 30, 2026

kosiew changed the title ~~Improve PushDownFilter benchmark robustness, add sweep control, and refactor A/B harness~~ Improve PushDownFilter benchmark usability with configurable sweeps, validation, and refactoring Mar 31, 2026

kosiew marked this pull request as ready for review April 1, 2026 05:26

kosiew added 9 commits April 1, 2026 13:31

Refactor push down filter benchmarks to use dynamic sweep points

aa6db02

Add dynamic sample size configuration for push down filter benchmarks

52bf5c9

Remove unused sample size function and constant from push down filter…

f996e37

… benchmarks

fix(benchmarks): remove duplicate function `build_case_heavy_left_joi…

e170b32

…n_df_with_push_down_filter` This commit removes the `build_case_heavy_left_join_df_with_push_down_filter` duplicate function from the `sql_planner_extended.rs` benchmark file

kosiew force-pushed the push-down-05-20002 branch from 074c431 to e2af627 Compare April 1, 2026 05:31

kosiew changed the title ~~Improve PushDownFilter benchmark usability with configurable sweeps, validation, and refactoring~~ Refactor PushDownFilter benchmark suite: add fast default mode, validation, and unified A/B harness Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor PushDownFilter benchmark suite: add fast default mode, validation, and unified A/B harness#21029

Refactor PushDownFilter benchmark suite: add fast default mode, validation, and unified A/B harness#21029
kosiew wants to merge 9 commits intoapache:mainfrom
kosiew:push-down-05-20002

kosiew commented Mar 18, 2026 •

edited

Loading

Uh oh!

kosiew commented Mar 18, 2026

Uh oh!

kosiew commented Mar 18, 2026

Uh oh!

adriangbot commented Mar 18, 2026

Uh oh!

adriangbot commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kosiew commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

LLM-generated code disclosure

Uh oh!

kosiew commented Mar 18, 2026

Uh oh!

kosiew commented Mar 18, 2026

Uh oh!

adriangbot commented Mar 18, 2026

Uh oh!

adriangbot commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kosiew commented Mar 18, 2026 •

edited

Loading