fix(sql): fix a bug when planning semi- or antijoins by aalexandrov · Pull Request #20990 · apache/datafusion

aalexandrov · 2026-03-17T13:02:10Z

Closes exclude_using_columns might wrongly retain columns from a projected join input #20989.

Rationale for this change

The planner should be consistent with the expected SQL behavior—swapping the names of tables that have identical structure in a SQL query should not affect the schema for that query.

What changes are included in this PR?

A fix in the exclude_using_columns helper method in datafusion/expr/src/utils.rs that ensures that we don't retain columns from the projected side when deciding which USING columns to exclude and which to retain on top of semi- or antijoins.
Regression tests for the change in test_using_join_wildcard_schema_semi_anti.

Are these changes tested?

Added a regression test.

Are there any user-facing changes?

Yes, the change is user facing, but I doubt that this behavior is expected and is documented anywhere.
If existing docs need to be updated, please point me to the concrete places and I can take a look.

neilconway

Thanks for taking this! Overall looks good to me.

neilconway · 2026-03-26T19:45:33Z

datafusion/sql/tests/sql_integration.rs

+    let sql = "WITH 
+        s AS (SELECT 1 AS x1, 2 AS x2, 3 AS x3),
+        t AS (SELECT 1 AS x1, 4 AS x2, 5 AS x3)
+        SELECT * FROM s LEFT SEMI JOIN t USING (x1)";


This test (and the other LEFT ANTI JOIN test) should succeed without this PR, right? Because the correct side of the join happens to be picked. Can we change this (or add a new test variant) that checks that the behavior has changed for LEFT joins?

datafusion/sql/tests/sql_integration.rs

neilconway · 2026-03-26T20:04:35Z

datafusion/expr/src/utils.rs

It took me a while to wrap my head around the logic here; what do you think of this instead?

Suggested change

/// When a JOIN has a USING clause, the join columns appear in the output

/// schema once per side (for inner/outer joins) or once total (for semi/anti

/// joins). An unqualified wildcard should include each USING column only once.

/// This function returns the duplicate columns that should be excluded.

fn exclude_using_columns(plan: &LogicalPlan) -> Result<HashSet<Column>> {

let output_columns: HashSet<_> = plan.schema().columns().iter().cloned().collect();

let mut excluded = HashSet::new();

for cols in plan.using_columns()? {

// `using_columns()` returns join columns from both sides regardless

// of join type. For semi/anti joins, only one side's columns appear

// in the output schema. Filter to output columns so that columns

// from the non-output side don't participate in deduplication and

// displace real output columns.

let mut cols: Vec<_> = cols

.into_iter()

.filter(|c| output_columns.contains(c))

.collect();

// Sort so we keep the same qualified column, regardless of HashSet

// iteration order.

cols.sort();

let mut seen = HashSet::new();

for col in cols {

if !seen.insert(col.name.clone()) {

excluded.insert(col);

}

}

}

Ok(excluded)

}

I don't have a strong preference here—in the initial PR I was trying to do the minimal amount of changes to the existing method definition to fix the error.

I did some cosmetic changes to your suggestion and added it to the PR. I went with this for-loop body

if seen_names.contains(col.name.as_str()) { excluded.insert(col); // exclude columns with already seen name } else { seen_names.insert(col.name.clone()); // mark column name as seen }

because seen_names.contains(col.name.clone()) will create unnecessary String clones for ~50% of the cols entries.

Currently, the `exclude_using_columns` called from `expand_wildcard` doesn't consider the filtering semantics of semi- and antijoins when expanding wildcards on top of joins defined via `USING(<columns>)` syntax. From each set of columns equated by a `USING(<column>)` expression, the code currently (1) sorts the set entries, and (2) retains only the first entry from each set. Because of that, the columns surviving the `exclude_using_columns` call might be wrongly chosen from the filtering side if the table qualifier from that side is lexicographically before the filtered side qualifier. For example, given this schema of two identical tables: ```sql create table s(x1 int, x2 int, x3 int); create table t(x1 int, x2 int, x3 int); ``` One would expect that the schema of queries where the `s` and `t` names are swapped will be identical. However, currently this is not the case: ```sql -- Q1 schema: x1 int, x2 int, x3 int (because s < t) select * from s left semi join t using (x1); -- Q2 schema: x2 int, x3 int (because t < s) select * from t left semi join s using (x1); ``` This commit fixes the issue and adds some regression tests.

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions labels Mar 17, 2026

neilconway reviewed Mar 26, 2026

View reviewed changes

aalexandrov force-pushed the fix_semijoin_using_wildcard_planner_bug branch from 4682e74 to 00f4ea1 Compare March 31, 2026 11:34

aalexandrov requested a review from neilconway March 31, 2026 11:34

aalexandrov force-pushed the fix_semijoin_using_wildcard_planner_bug branch from 00f4ea1 to e2b1b12 Compare March 31, 2026 11:42

aalexandrov force-pushed the fix_semijoin_using_wildcard_planner_bug branch from e2b1b12 to 73c992f Compare March 31, 2026 11:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sql): fix a bug when planning semi- or antijoins#20990

fix(sql): fix a bug when planning semi- or antijoins#20990
aalexandrov wants to merge 1 commit intoapache:mainfrom
aalexandrov:fix_semijoin_using_wildcard_planner_bug

aalexandrov commented Mar 17, 2026

Uh oh!

neilconway left a comment

Uh oh!

neilconway Mar 26, 2026

Uh oh!

aalexandrov Mar 31, 2026

Uh oh!

Uh oh!

neilconway Mar 26, 2026

Uh oh!

aalexandrov Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

+  /// When a JOIN has a USING clause, the join columns appear in the output
+  /// schema once per side (for inner/outer joins) or once total (for semi/anti
+  /// joins). An unqualified wildcard should include each USING column only once.
+  /// This function returns the duplicate columns that should be excluded.
+  fn exclude_using_columns(plan: &LogicalPlan) -> Result<HashSet<Column>> {
+      let output_columns: HashSet<_> = plan.schema().columns().iter().cloned().collect();
+      let mut excluded = HashSet::new();
+      for cols in plan.using_columns()? {
+          // `using_columns()` returns join columns from both sides regardless
+          // of join type. For semi/anti joins, only one side's columns appear
+          // in the output schema. Filter to output columns so that columns
+          // from the non-output side don't participate in deduplication and
+          // displace real output columns.
+          let mut cols: Vec<_> = cols
+              .into_iter()
+              .filter(|c| output_columns.contains(c))
+              .collect();
+          // Sort so we keep the same qualified column, regardless of HashSet
+          // iteration order.
+          cols.sort();
+          let mut seen = HashSet::new();
+          for col in cols {
+              if !seen.insert(col.name.clone()) {
+                  excluded.insert(col);
+              }
+          }
+      }
+      Ok(excluded)
+  }

Conversation

aalexandrov commented Mar 17, 2026

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

neilconway left a comment

Choose a reason for hiding this comment

Uh oh!

neilconway Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

aalexandrov Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

neilconway Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

aalexandrov Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants