Skip to content

Avoid null-restrict evaluation for predicates that reference non-join columns in PushDownFilter#20961

Draft
kosiew wants to merge 64 commits intoapache:mainfrom
kosiew:push-down-02-20002
Draft

Avoid null-restrict evaluation for predicates that reference non-join columns in PushDownFilter#20961
kosiew wants to merge 64 commits intoapache:mainfrom
kosiew:push-down-02-20002

Conversation

@kosiew
Copy link
Copy Markdown
Contributor

@kosiew kosiew commented Mar 16, 2026

Which issue does this PR close?

Rationale for this change

PushDownFilter has been identified as a major planning-time bottleneck, with profiling showing significant time spent repeatedly evaluating expression types while reasoning about predicate pushdown and null-restriction behavior. This PR addresses a slice of that problem by reducing unnecessary work in null-restriction evaluation and by tightening join-filter conversion behavior to avoid incorrect rewrites in scalar-side join cases.

In particular, the patch focuses on two goals:

  1. Avoid expensive authoritative null-restriction evaluation when a cheaper syntactic determination is sufficient.
  2. Preserve correct post-join filter semantics for joins involving scalar-producing inputs, which regressed when filters were converted into join conditions too aggressively.

Together, these changes both improve optimizer robustness and target one of the hot paths called out in the profiling for PushDownFilter.

What changes are included in this PR?

This PR includes the following changes:

  • Adds regression SQL tests under datafusion/core/tests/sql/push_down_filter_regressions.rs and wires the new module into the SQL test suite.

  • Adds end-to-end regression coverage for:

    • window + scalar subquery planning and execution
    • sqllogictest-style execution with push_down_filter enabled/disabled
    • aggregate regression functions returning non-NULL results
    • correlated IN subqueries
    • NATURAL JOIN combined with UNION ALL
  • Adds an optimizer regression test to ensure a post-join filter is retained for a cross join where one side is scalar/aggregated.

  • Changes push_down_filter so filters are only converted into join conditions for inner joins when it is safe to do so, specifically avoiding that conversion for joins with empty ON clauses when either side is known to be scalar (max_rows() == Some(1)).

  • Updates inferred predicate handling to treat null-restriction evaluation failures conservatively via unwrap_or(false).

  • Refactors null-restriction evaluation in optimizer::utils by:

    • introducing NullRestrictionEvalMode and a test-only guard for controlling evaluation mode
    • adding a fast syntactic null-substitution path for common predicate shapes
    • short-circuiting predicates that reference columns outside the provided join-column set
    • retaining the existing authoritative evaluation path as a fallback
    • adding debug assertions to verify the syntactic fast path matches the authoritative evaluator in debug builds
  • Expands unit test coverage for is_restrict_null_predicate and for agreement between the syntactic fast path and the authoritative evaluator.

Are these changes tested?

Yes.

The PR adds both logical optimizer tests and SQL-level regression coverage:

  • new SQL regression tests covering the query patterns implicated by this area of the optimizer
  • a focused optimizer unit test ensuring filters are not incorrectly folded into join conditions for scalar-side cross joins
  • expanded unit tests for is_restrict_null_predicate, including additional CASE, BETWEEN, IN, and null-related predicate forms
  • a validation test that checks the syntactic fast path against the authoritative evaluator for supported expressions
  • plan-shape assertions comparing optimized and physical plans with push_down_filter enabled and disabled

These tests verify both correctness and protection against the regressions addressed by this patch.

Are there any user-facing changes?

There are no intended user-facing API changes.

This PR improves optimizer behavior and planner performance characteristics internally, and fixes incorrect or fragile filter-pushdown behavior for certain query shapes. Users may observe more stable planning behavior and preserved correctness for affected queries, but no documentation or API changes should be required.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

kosiew added 6 commits March 16, 2026 21:38
Introduce a test case to assert non-restricting behavior
when evaluating the predicate a > b, focusing on join
keys that only include a. This directly tests the new
early-return branch in the is_restrict_null_predicate
function in utils.rs, enhancing overall code coverage.
Extract the column-membership check into a new helper function
called `predicate_uses_only_columns` in utils.rs. Update the
current implementation at utils.rs:91 to use this new helper,
improving code readability and maintainability.
Add call-site contract comment in push_down_filter.rs to
specify that only Ok(true) is treated as null-restricting.
State that both Ok(false) and Err(_) are considered
non-restricting and will be skipped during processing.
Inline iterator predicate in utils.rs and streamline the
null-restrict handling in push_down_filter.rs. This
reduces indirections and lines of code while maintaining
the same logic and behavior. No public interface or
behavior changes intended.
@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented Mar 16, 2026

run benchmark sql_planner_extended

@adriangbot
Copy link
Copy Markdown

🤖 Criterion benchmark running (GKE) | trigger
Linux bench-c4067930471-314-dhsls 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing push-down-02-20002 (3d3945c) to ab28234 (merge-base) diff
BENCH_NAME=sql_planner_extended
BENCH_COMMAND=cargo bench --features=parquet --bench sql_planner_extended
BENCH_FILTER=
Results will be posted here when complete

@github-actions github-actions bot added the optimizer Optimizer rules label Mar 16, 2026
@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented Mar 18, 2026

show benchmark queue

@adriangbot
Copy link
Copy Markdown

Hi @kosiew, you asked to view the benchmark queue (#20961 (comment)).

No pending jobs.

@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented Mar 18, 2026

run benchmark sql_planner_extended

@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented Mar 18, 2026

show benchmark queue

@adriangbot
Copy link
Copy Markdown

Hi @kosiew, you asked to view the benchmark queue (#20961 (comment)).

Comment Repo PR User Benchmarks Status
#4080428628 apache/datafusion #20961 kosiew ["sql_planner_extended"] running

@adriangbot
Copy link
Copy Markdown

🤖 Criterion benchmark running (GKE) | trigger
Linux bench-c4080428628-401-tbflf 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing push-down-02-20002 (3d3945c) to ab28234 (merge-base) diff
BENCH_NAME=sql_planner_extended
BENCH_COMMAND=cargo bench --features=parquet --bench sql_planner_extended
BENCH_FILTER=
Results will be posted here when complete

@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented Mar 18, 2026

show benchmark queue

@adriangbot
Copy link
Copy Markdown

Hi @kosiew, you asked to view the benchmark queue (#20961 (comment)).

Comment Repo PR User Benchmarks Status
#4080428628 apache/datafusion #20961 kosiew ["sql_planner_extended"] running

@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented Mar 18, 2026

show benchmark queue

@adriangbot
Copy link
Copy Markdown

Hi @kosiew, you asked to view the benchmark queue (#20961 (comment)).

Comment Repo PR User Benchmarks Status
#4082114714 apache/datafusion #21026 Dandandan ["clickbench_partitioned"] running
#4082114714 apache/datafusion #21026 Dandandan ["tpcds"] running
#4082114714 apache/datafusion #21026 Dandandan ["tpch"] running

Implement fast path for syntactic null-restriction in
utils.rs to classify predicates without evaluating
physical expressions. Enhance SQL boolean handling
with a large supporting evaluator, including CASE
management. Retain existing branch helper styles
and expand test coverage for constant simple CASE
and outside-join-key fast paths. Fix correctness
edge case for simple CASE indeterminate comparisons,
ensuring proper tracking and fallback to Unknown.
@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented Mar 18, 2026

I think the benchmark never completes or gets killed because it's too heavy.
Amending benchmark in #21029

@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented Mar 19, 2026

run benchmark sql_planner_extended --sample-size 10

@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented Mar 19, 2026

show benchmark queue

@adriangbot
Copy link
Copy Markdown

Hi @kosiew, you asked to view the benchmark queue (#20961 (comment)).

Comment Repo PR User Benchmarks Status
#4087159258 apache/datafusion #20961 kosiew ["sql_planner_extended"] running
#4087159258 apache/datafusion #20961 kosiew ["--sample-size"] running
#4087159258 apache/datafusion #20961 kosiew ["10"] running

@adriangbot
Copy link
Copy Markdown

🤖 Criterion benchmark running (GKE) | trigger
Linux bench-c4087159258-443-hxscx 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing push-down-02-20002 (3d3945c) to ab28234 (merge-base) diff
BENCH_NAME=sql_planner_extended
BENCH_COMMAND=cargo bench --features=parquet --bench sql_planner_extended
BENCH_FILTER=
Results will be posted here when complete

@adriangbot
Copy link
Copy Markdown

🤖 Criterion benchmark running (GKE) | trigger
Linux bench-c4087159258-444-4d9n5 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing push-down-02-20002 (3d3945c) to ab28234 (merge-base) diff
BENCH_NAME=--sample-size
BENCH_COMMAND=cargo bench --features=parquet --bench --sample-size
BENCH_FILTER=
Results will be posted here when complete

@adriangbot
Copy link
Copy Markdown

Benchmark for this request failed.

Last 20 lines of output:

Click to expand
HEAD is now at ab28234 Support `columns_sorted` in row_filters (#20497)
rustc 1.94.0 (4a4ef493e 2026-03-02)
3d3945ce07b7015c11b0a4f89f3b456d785b7bdf
ab2823475d0c79a749120ae354572ab85c043b78
error: unexpected argument '--sample-size' found

  tip: a similar argument exists: '--examples'
  tip: to pass '--sample-size' as a value, use '-- --sample-size'

Usage: cargo bench --features <FEATURES> --bench [<NAME>] --examples [BENCHNAME] [-- [ARGS]...]

For more information, try '--help'.
error: unexpected argument '--sample-size' found

  tip: a similar argument exists: '--examples'
  tip: to pass '--sample-size' as a value, use '-- --sample-size'

Usage: cargo bench --features <FEATURES> --bench [<NAME>] --examples [BENCHNAME] [-- [ARGS]...]

For more information, try '--help'.

@adriangbot
Copy link
Copy Markdown

🤖 Criterion benchmark running (GKE) | trigger
Linux bench-c4087159258-445-b4z9j 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing push-down-02-20002 (3d3945c) to ab28234 (merge-base) diff
BENCH_NAME=10
BENCH_COMMAND=cargo bench --features=parquet --bench 10
BENCH_FILTER=
Results will be posted here when complete

@adriangbot
Copy link
Copy Markdown

Benchmark for this request failed.

Last 20 lines of output:

Click to expand
Cloning into '/workspace/datafusion-branch'...
push-down-02-20002
From https://github.com/apache/datafusion
 * [new ref]         refs/pull/20961/head -> push-down-02-20002
 * branch            main                 -> FETCH_HEAD
Switched to branch 'push-down-02-20002'
ab2823475d0c79a749120ae354572ab85c043b78
Cloning into '/workspace/datafusion-base'...
HEAD is now at ab28234 Support `columns_sorted` in row_filters (#20497)
rustc 1.94.0 (4a4ef493e 2026-03-02)
3d3945ce07b7015c11b0a4f89f3b456d785b7bdf
ab2823475d0c79a749120ae354572ab85c043b78
    Blocking waiting for file lock on package cache
    Blocking waiting for file lock on package cache
    Blocking waiting for file lock on package cache
error: no bench target named `10` in default-run packages

help: a target with a similar name exists: `chr`

@github-actions github-actions bot added the core Core DataFusion crate label Mar 19, 2026
kosiew added 7 commits March 27, 2026 19:26
Update the function name to specify its relevance to scalar
subquery cross-joins. Add an intent comment for better
understanding of its purpose. Replace the old function call in
join predicate handling for improved readability.
Clarify supported expression/operator families in the syntactic
evaluator. Emphasize that returning None indicates deferral to
authoritative evaluation, rather than "non-restricting." Ensure
unsupported variants also return None for consistency.
Broaden derived-relation detection to include projection
wrappers over derived relations. Add regression tests to cover
alias/projection shape changes and ensure mixed-side filters
are preserved. Implement a panic-path robustness test to
confirm that eval mode resets properly, even on closure
panic using catch_unwind.
Combine plan-wrapper traversal and cross-join shape detection.
Shorten join-column replacement scan and share authoritative
null-result decoding. Remove unused helpers and reorganize
strict-null operator list behind a classifier helper.
Public interfaces remain unchanged.
authoritative_restrict_null_predicate

Restore evaluates_to_null behavior for general expressions
without boolean-downcasting arrays. Fix scalar_subquery_with_non_strong_project
regression. Update authoritative_restrict_null_predicate to handle
predicate results directly, treating scalar NULL as null-restricting,
resolving the CASE ... ELSE NULL test failure. Maintain non-join-column
fast path functionality.
Consolidate repeated join-input wrapper inspections into a single
JoinInputShape classifier. Hoist scalar-subquery cross-join shape
check out of the predicate loop and unify repeated left/right
predicate bucketing. Remove temporary Vec in join-column
replacement inference and narrow test-only null-restriction mode
support into its own helper module. Share column-subset check
path and extract helper for authoritative null-evaluation
results. Reduce repetition in syntactic null-restriction
evaluator by factoring strict-null-preserving unary cases.
@kosiew kosiew force-pushed the push-down-02-20002 branch from 6d3f368 to 106e963 Compare March 27, 2026 11:42
kosiew added 6 commits March 27, 2026 21:46
Eliminate JoinInputShape/PredicateDestination layers and integrate
scalar-subquery handling into the main join-pushdown flow. Revert
join-inference replacement logic to an explicit loop. Remove
unnecessary helper functions and simplify the null-restriction
evaluator for cleaner code.
Flatten single-use helper layers and tighten small utility helpers.
In push_down_filter, fold the join-input classifier and bind
predicate.column_refs() once for join inference. Shorten test-only
import surface in utils.rs and compact mapping helpers in
null_restriction.rs without changing semantics.
Tighten scalar-subquery guard, remove redundant empty-loop
scaffolding, and share more of the null-evaluation flow.
Simplify small null-restriction helpers and match arms.
Extract shared helpers in push_down_filter_regressions.rs
and push_down_filter.rs to reduce code duplication.
Consolidate optimizer-delta test assertions and create
specific plan builders for common expressions.

Add a utility in utils.rs to evaluate predicates under
different null-restriction modes, streamlining
mode-comparison tests and enhancing maintainability.
Restore is_restrict_null_predicate to its original behavior by
removing the broader syntactic null-restriction path and its
associated test scaffolding. Maintain early rejection for the
push_down_filter caller while eliminating extra tree-walk work
on common paths. Confirmed changes by running tests and
formatting.
@kosiew kosiew force-pushed the push-down-02-20002 branch from dfdd011 to d0f0976 Compare March 28, 2026 11:54
kosiew added 2 commits March 28, 2026 22:03
Reinstate the deleted fast path for is_restrict_null_predicate
in null_restriction.rs. Implement a two-stage evaluation
process: return early false for mixed-reference predicates,
perform syntactic evaluation for supported join-key-only
predicates, and ensure authoritative fallback is applied
only when necessary.
@kosiew kosiew force-pushed the push-down-02-20002 branch from d0f0976 to 847a8fa Compare March 30, 2026 02:36
@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Mar 30, 2026
@kosiew kosiew force-pushed the push-down-02-20002 branch from d79cc72 to 6ef1c84 Compare March 30, 2026 08:14
@github-actions github-actions bot removed the physical-expr Changes to the physical-expr crates label Mar 30, 2026
kosiew added 3 commits March 30, 2026 19:53
Update binary_boolean_value to return None for mixed
boolean states instead of asserting. This change allows
is_restrict_null_predicate to use the authoritative
evaluator, preventing panics from unsupported cases.

Add regression tests to ensure proper behavior in
unsupported boolean-wrapper scenarios, maintaining
consistent functionality between auto mode and
authoritative mode.
@kosiew kosiew force-pushed the push-down-02-20002 branch from 86609f2 to a355cb0 Compare March 30, 2026 11:53
kosiew added 3 commits March 31, 2026 11:15
- Refactored `classify_join_input` to `strip_plan_wrappers` for clearer handling of logical plan wrappers.
- Introduced helper functions `is_scalar_aggregate_subquery` and `is_derived_relation` to check for specific logical plan structures.
- Enhanced `is_scalar_subquery_cross_join` to streamline the evaluation logic of joins with scalar subqueries.
- Added `should_keep_filter_above_scalar_subquery_cross_join` to manage filter preservation based on join conditions.
- Adjusted the predicate handling in `push_down_all_join` to utilize the new helper functions, improving filter application logic.
- Updated tests for window operations over scalar subquery cross joins to maintain correct behavior and refactored test setup for clarity.
@kosiew kosiew force-pushed the push-down-02-20002 branch from a355cb0 to 1f2b598 Compare March 31, 2026 03:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate optimizer Optimizer rules

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants