Skip to content

Refactor PushDownFilter benchmark suite: add fast default mode, validation, and unified A/B harness#21029

Open
kosiew wants to merge 9 commits intoapache:mainfrom
kosiew:push-down-05-20002
Open

Refactor PushDownFilter benchmark suite: add fast default mode, validation, and unified A/B harness#21029
kosiew wants to merge 9 commits intoapache:mainfrom
kosiew:push-down-05-20002

Conversation

@kosiew
Copy link
Copy Markdown
Contributor

@kosiew kosiew commented Mar 18, 2026

Which issue does this PR close?


Rationale for this change

The existing PushDownFilter benchmark suite is difficult to use for iteration and debugging due to:

  • Large default parameter sweeps that result in long execution times
  • Lack of validation for generated predicates, allowing silent benchmark construction issues
  • Duplicated setup logic across benchmark scenarios

These issues make it hard to distinguish between slow execution and incorrect behavior, reducing developer productivity when investigating optimizer performance.

This PR improves usability, correctness, and maintainability of the benchmark suite.


What changes are included in this PR?

  • Introduced a lightweight default sweep (DEFAULT_SWEEP_POINTS) for faster local iteration

  • Added optional full sweep mode controlled by DATAFUSION_PUSH_DOWN_FILTER_FULL_SWEEP

  • Refactored benchmark loops into a reusable helper (bench_push_down_filter_ab) to unify A/B comparisons

  • Unified DataFrame construction via build_left_join_df_with_push_down_filter

  • Added validation helpers:

    • find_filter_predicates
    • assert_case_heavy_left_join_inference_candidates
  • Improved CASE expression generation to better simulate realistic predicate shapes and ensure join-key involvement

  • Ensured predicates reference both join keys and non-join columns for meaningful PushDownFilter evaluation


Are these changes tested?

Yes.

  • Assertions were added to validate that generated predicates:

    • Match the expected predicate count
    • Reference join keys (l.c0 or r.c0)
    • Include non-join columns

These validations act as correctness checks during benchmark construction and help prevent silent logical errors.


Are there any user-facing changes?

No user-facing API changes.

However:

  • Benchmark execution behavior has improved:

    • Default runs are now significantly faster
    • Full sweep is opt-in via environment variable

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented Mar 18, 2026

run benchmark sql_planner_extended

@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented Mar 18, 2026

show benchmark queue

@adriangbot
Copy link
Copy Markdown

Hi @kosiew, you asked to view the benchmark queue (#21029 (comment)).

Comment Repo PR User Benchmarks Status
#4083377148 apache/datafusion #21029 kosiew ["sql_planner_extended"] running

@adriangbot
Copy link
Copy Markdown

🤖 Criterion benchmark running (GKE) | trigger
Linux bench-c4083377148-423-7k56m 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing push-down-05-20002 (6cc60be) to ab28234 (merge-base) diff
BENCH_NAME=sql_planner_extended
BENCH_COMMAND=cargo bench --features=parquet --bench sql_planner_extended
BENCH_FILTER=
Results will be posted here when complete

@github-actions github-actions bot added the core Core DataFusion crate label Mar 18, 2026
@kosiew kosiew changed the title Make PushDownFilter benchmark sweeps opt-in to reduce long default runtimes Improve PushDownFilter benchmark robustness, add sweep control, and refactor A/B harness Mar 30, 2026
@kosiew kosiew changed the title Improve PushDownFilter benchmark robustness, add sweep control, and refactor A/B harness Improve PushDownFilter benchmark usability with configurable sweeps, validation, and refactoring Mar 31, 2026
@kosiew kosiew marked this pull request as ready for review April 1, 2026 05:26
kosiew added 9 commits April 1, 2026 13:31
…rganization

- Refactored functions to build left join DataFrames with push down filters.
- Created `bench_push_down_filter_ab` for streamlined benchmarking of push down filter impact.
- Updated benchmark groups to use the new building functions, enhancing readability and maintainability.
…n_df_with_push_down_filter`

This commit removes the `build_case_heavy_left_join_df_with_push_down_filter` duplicate function from the `sql_planner_extended.rs` benchmark file
… checks

- Added assertions to verify inference candidates in case-heavy left join data frames.
- Introduced helper functions `find_filter_predicates` and `assert_case_heavy_left_join_inference_candidates` for better structure and readability.
- Updated join logic in `build_case_heavy_left_join_query` for more complex case handling.

This update improves the robustness of benchmarks by ensuring correctness in filter predicate references related to join keys.
…ery handling

- Removed unnecessary functions and constants related to push down filter sweep configurations.
- Simplified the logic for constructing test DataFrames, focusing on essential parameters for benchmarks.
- Enhanced clarity of the benchmarks by differentiating cases for `with_push_down_filter` and `without_push_down_filter`.
- Updated the implementation to improve readability and maintainability.
- Updated `find_filter_predicates` function to streamline the code by removing unnecessary line breaks and retaining clarity in the error message when the expected structure is not met.
- Ensured that the function continues to accurately identify and handle logical plans with projections.
…ests

- Refactored case heavy and non-case left join benchmark functions to include push down filter tests.
- Added utility functions to configure benchmark sweeps for push down filters, making it customizable via environment variables.
- Improved assertions for filter predicates in case heavy left join inference.
- Cleaned up and organized existing benchmark code for clarity and reuse.
@kosiew kosiew force-pushed the push-down-05-20002 branch from 074c431 to e2af627 Compare April 1, 2026 05:31
@kosiew kosiew changed the title Improve PushDownFilter benchmark usability with configurable sweeps, validation, and refactoring Refactor PushDownFilter benchmark suite: add fast default mode, validation, and unified A/B harness Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants