Skip to content

fix(sql): fix a bug when planning semi- or antijoins#20990

Open
aalexandrov wants to merge 1 commit intoapache:mainfrom
aalexandrov:fix_semijoin_using_wildcard_planner_bug
Open

fix(sql): fix a bug when planning semi- or antijoins#20990
aalexandrov wants to merge 1 commit intoapache:mainfrom
aalexandrov:fix_semijoin_using_wildcard_planner_bug

Conversation

@aalexandrov
Copy link
Copy Markdown
Contributor

Rationale for this change

The planner should be consistent with the expected SQL behavior—swapping the names of tables that have identical structure in a SQL query should not affect the schema for that query.

What changes are included in this PR?

  • A fix in the exclude_using_columns helper method in datafusion/expr/src/utils.rs that ensures that we don't retain columns from the projected side when deciding which USING columns to exclude and which to retain on top of semi- or antijoins.
  • Regression tests for the change in test_using_join_wildcard_schema_semi_anti.

Are these changes tested?

  • Added a regression test.

Are there any user-facing changes?

Yes, the change is user facing, but I doubt that this behavior is expected and is documented anywhere.
If existing docs need to be updated, please point me to the concrete places and I can take a look.

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions labels Mar 17, 2026
Copy link
Copy Markdown
Contributor

@neilconway neilconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this! Overall looks good to me.

Comment on lines +4855 to +4858
let sql = "WITH
s AS (SELECT 1 AS x1, 2 AS x2, 3 AS x3),
t AS (SELECT 1 AS x1, 4 AS x2, 5 AS x3)
SELECT * FROM s LEFT SEMI JOIN t USING (x1)";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test (and the other LEFT ANTI JOIN test) should succeed without this PR, right? Because the correct side of the join happens to be picked. Can we change this (or add a new test variant) that checks that the behavior has changed for LEFT joins?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 385 to 415
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to wrap my head around the logic here; what do you think of this instead?

Suggested change
/// When a JOIN has a USING clause, the join columns appear in the output
/// schema once per side (for inner/outer joins) or once total (for semi/anti
/// joins). An unqualified wildcard should include each USING column only once.
/// This function returns the duplicate columns that should be excluded.
fn exclude_using_columns(plan: &LogicalPlan) -> Result<HashSet<Column>> {
let output_columns: HashSet<_> = plan.schema().columns().iter().cloned().collect();
let mut excluded = HashSet::new();
for cols in plan.using_columns()? {
// `using_columns()` returns join columns from both sides regardless
// of join type. For semi/anti joins, only one side's columns appear
// in the output schema. Filter to output columns so that columns
// from the non-output side don't participate in deduplication and
// displace real output columns.
let mut cols: Vec<_> = cols
.into_iter()
.filter(|c| output_columns.contains(c))
.collect();
// Sort so we keep the same qualified column, regardless of HashSet
// iteration order.
cols.sort();
let mut seen = HashSet::new();
for col in cols {
if !seen.insert(col.name.clone()) {
excluded.insert(col);
}
}
}
Ok(excluded)
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference here—in the initial PR I was trying to do the minimal amount of changes to the existing method definition to fix the error.

I did some cosmetic changes to your suggestion and added it to the PR. I went with this for-loop body

if seen_names.contains(col.name.as_str()) {
    excluded.insert(col); // exclude columns with already seen name
} else {
    seen_names.insert(col.name.clone()); // mark column name as seen
}

because seen_names.contains(col.name.clone()) will create unnecessary String clones for ~50% of the cols entries.

@aalexandrov aalexandrov force-pushed the fix_semijoin_using_wildcard_planner_bug branch from 4682e74 to 00f4ea1 Compare March 31, 2026 11:34
@aalexandrov aalexandrov requested a review from neilconway March 31, 2026 11:34
@aalexandrov aalexandrov force-pushed the fix_semijoin_using_wildcard_planner_bug branch from 00f4ea1 to e2b1b12 Compare March 31, 2026 11:42
Currently, the `exclude_using_columns` called from `expand_wildcard`
doesn't consider the filtering semantics of semi- and antijoins when
expanding wildcards on top of joins defined via `USING(<columns>)`
syntax.

From each set of columns equated by a `USING(<column>)` expression, the
code currently (1) sorts the set entries, and (2) retains only the first
entry from each set.

Because of that, the columns surviving the `exclude_using_columns` call
might be wrongly chosen from the filtering side if the table qualifier
from that side is lexicographically before the filtered side qualifier.

For example, given this schema of two identical tables:

```sql
create table s(x1 int, x2 int, x3 int);
create table t(x1 int, x2 int, x3 int);
```

One would expect that the schema of queries where the `s` and `t` names
are swapped will be identical. However, currently this is not  the case:

```sql
-- Q1 schema: x1 int, x2 int, x3 int (because s < t)
select * from s left semi join t using (x1);

-- Q2 schema: x2 int, x3 int (because t < s)
select * from t left semi join s using (x1);
```

This commit fixes the issue and adds some regression tests.
@aalexandrov aalexandrov force-pushed the fix_semijoin_using_wildcard_planner_bug branch from e2b1b12 to 73c992f Compare March 31, 2026 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

logical-expr Logical plan and expressions sql SQL Planner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

exclude_using_columns might wrongly retain columns from a projected join input

2 participants