fix(sql): fix a bug when planning semi- or antijoins#20990
fix(sql): fix a bug when planning semi- or antijoins#20990aalexandrov wants to merge 1 commit intoapache:mainfrom
Conversation
neilconway
left a comment
There was a problem hiding this comment.
Thanks for taking this! Overall looks good to me.
| let sql = "WITH | ||
| s AS (SELECT 1 AS x1, 2 AS x2, 3 AS x3), | ||
| t AS (SELECT 1 AS x1, 4 AS x2, 5 AS x3) | ||
| SELECT * FROM s LEFT SEMI JOIN t USING (x1)"; |
There was a problem hiding this comment.
This test (and the other LEFT ANTI JOIN test) should succeed without this PR, right? Because the correct side of the join happens to be picked. Can we change this (or add a new test variant) that checks that the behavior has changed for LEFT joins?
datafusion/expr/src/utils.rs
Outdated
There was a problem hiding this comment.
It took me a while to wrap my head around the logic here; what do you think of this instead?
| /// When a JOIN has a USING clause, the join columns appear in the output | |
| /// schema once per side (for inner/outer joins) or once total (for semi/anti | |
| /// joins). An unqualified wildcard should include each USING column only once. | |
| /// This function returns the duplicate columns that should be excluded. | |
| fn exclude_using_columns(plan: &LogicalPlan) -> Result<HashSet<Column>> { | |
| let output_columns: HashSet<_> = plan.schema().columns().iter().cloned().collect(); | |
| let mut excluded = HashSet::new(); | |
| for cols in plan.using_columns()? { | |
| // `using_columns()` returns join columns from both sides regardless | |
| // of join type. For semi/anti joins, only one side's columns appear | |
| // in the output schema. Filter to output columns so that columns | |
| // from the non-output side don't participate in deduplication and | |
| // displace real output columns. | |
| let mut cols: Vec<_> = cols | |
| .into_iter() | |
| .filter(|c| output_columns.contains(c)) | |
| .collect(); | |
| // Sort so we keep the same qualified column, regardless of HashSet | |
| // iteration order. | |
| cols.sort(); | |
| let mut seen = HashSet::new(); | |
| for col in cols { | |
| if !seen.insert(col.name.clone()) { | |
| excluded.insert(col); | |
| } | |
| } | |
| } | |
| Ok(excluded) | |
| } |
There was a problem hiding this comment.
I don't have a strong preference here—in the initial PR I was trying to do the minimal amount of changes to the existing method definition to fix the error.
I did some cosmetic changes to your suggestion and added it to the PR. I went with this for-loop body
if seen_names.contains(col.name.as_str()) {
excluded.insert(col); // exclude columns with already seen name
} else {
seen_names.insert(col.name.clone()); // mark column name as seen
}because seen_names.contains(col.name.clone()) will create unnecessary String clones for ~50% of the cols entries.
4682e74 to
00f4ea1
Compare
00f4ea1 to
e2b1b12
Compare
Currently, the `exclude_using_columns` called from `expand_wildcard` doesn't consider the filtering semantics of semi- and antijoins when expanding wildcards on top of joins defined via `USING(<columns>)` syntax. From each set of columns equated by a `USING(<column>)` expression, the code currently (1) sorts the set entries, and (2) retains only the first entry from each set. Because of that, the columns surviving the `exclude_using_columns` call might be wrongly chosen from the filtering side if the table qualifier from that side is lexicographically before the filtered side qualifier. For example, given this schema of two identical tables: ```sql create table s(x1 int, x2 int, x3 int); create table t(x1 int, x2 int, x3 int); ``` One would expect that the schema of queries where the `s` and `t` names are swapped will be identical. However, currently this is not the case: ```sql -- Q1 schema: x1 int, x2 int, x3 int (because s < t) select * from s left semi join t using (x1); -- Q2 schema: x2 int, x3 int (because t < s) select * from t left semi join s using (x1); ``` This commit fixes the issue and adds some regression tests.
e2b1b12 to
73c992f
Compare
exclude_using_columnsmight wrongly retain columns from a projected join input #20989.Rationale for this change
The planner should be consistent with the expected SQL behavior—swapping the names of tables that have identical structure in a SQL query should not affect the schema for that query.
What changes are included in this PR?
exclude_using_columnshelper method indatafusion/expr/src/utils.rsthat ensures that we don't retain columns from the projected side when deciding which USING columns to exclude and which to retain on top of semi- or antijoins.test_using_join_wildcard_schema_semi_anti.Are these changes tested?
Are there any user-facing changes?
Yes, the change is user facing, but I doubt that this behavior is expected and is documented anywhere.
If existing docs need to be updated, please point me to the concrete places and I can take a look.