Optimize Clickbench Query 29 by adding a new Optimizer rule#20180
Optimize Clickbench Query 29 by adding a new Optimizer rule#20180devanshu0987 wants to merge 13 commits intoapache:mainfrom
Conversation
|
run benchmark sql_planner |
|
🤖 |
|
run benchmarks |
|
Thanks @devanshu0987 -- I kicked off some benchmarks |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
I think we can implement datafusion/datafusion/expr/src/udaf.rs Lines 682 to 684 in 35ff4ab |
Hi, is this a blocking PR review comment? I am thinking that we still have the same logical flow
|
neilconway
left a comment
There was a problem hiding this comment.
Can we add a test for queries with a HAVING clause?
I wonder if it would make sense to support this optimization for other aggregates? e.g., AVG(a+k) => AVG(a) + k.
| FROM test_table; | ||
| ---- | ||
| logical_plan | ||
| 01)Projection: sum(test_table.a) AS sum_a, sum(test_table.a + Int64(1)) AS sum_a_plus_1, sum(test_table.b) AS sum_b, sum(test_table.b + Int64(2)) AS sum_b_plus_2, sum(test_table.c + Int64(3)) AS sum_c_plus_3 |
There was a problem hiding this comment.
It looks to me like the rewrite is not actually being applied here? (Contra the comment above: "Test 7: Mixed sum types (rewrites a and b, not c)").
| | ScalarValue::UInt16(_) | ||
| | ScalarValue::UInt32(_) | ||
| | ScalarValue::UInt64(_) | ||
| | ScalarValue::Float32(_) |
| if let Expr::BinaryExpr(BinaryExpr { left, op, right }) = arg | ||
| && matches!(op, Operator::Plus | Operator::Minus) | ||
| { | ||
| // Check if right side is a literal constant |
There was a problem hiding this comment.
Remove duplicate comment
|
|
||
| /// Check if a scalar value is a numeric constant | ||
| /// (guards against non-arithmetic types like strings, booleans, dates, etc.) | ||
| fn is_numeric_constant(value: &ScalarValue) -> bool { |
There was a problem hiding this comment.
How should NULL values be handled by this function? I suppose we want to return false for them?
|
My real concern with this approach is that this optimization is so clickbench specific. I realize that other systems are doing it too (aka the aforementioned ClickHouse and DuckDB PR) but I struggle to find any actual real world usecase On the other hand, the performance results are pretty compelling so let's see if we can make it work |
alamb
left a comment
There was a problem hiding this comment.
Thanks for this @devanshu0987
I think we should proceed with the high level idea of rewriting the expressions
Howevr, I really like @neilconway and @UBarney 's suggestions to use the rewrite API instead of a new optimizer rule
After providing this feedback, however, it occurs to me the "don't use function names" and "try not to add new optimizer passes" guidance isn't written down anywhere so I'll try and make a PR to do so later
|
|
||
| match inner { | ||
| Expr::AggregateFunction(agg_fn) => { | ||
| // Rule only applies to SUM |
There was a problem hiding this comment.
Checking for SUM this was is non ideal because if someone has added a user defined function with a sum that has different behavior / semantics than the built in SUM aggregate this rule will still trigger
Hi, is this a blocking PR review comment? I am not able to understand how it would be simpler?
I am thinking that we still have the same logical flow
- Detect the pattern of SUM(col + literal)
- Modify the plan appropriately.
I think it would be better for several reasons:
- A new optimizer pass adds non trivial overhead (it ends up copying each plan node I think) so we see planning time go up with each new rule we add to the base DataFusion
- It would fix the above issue
There was a problem hiding this comment.
I also think @UBarney has the same suggestion here
There was a problem hiding this comment.
Here is a PR with some suggested documentation improvements:
The solution I proposed avoids the need to locate the However, there is an issue I overlooked: the expression after simplification becomes I am still exploring the best way to handle this. One possible direction is to introduce a new rule or modify Regarding the architecture, my current view is that the |
|
Thanks for sharing your thought process on why it should be better to write a
For other comments from @neilconway, I will take care of them post I can convert this to Thanks. |
Which issue does this PR close?
SUM(..)clauses #15524Rationale for this change
SUM(ResolutionWidth), SUM(ResolutionWidth + 1), SUM(ResolutionWidth + 2), ..., SUM(ResolutionWidth + 89)SUM(A + k)intoSUM(A) + k * COUNT(*)CommonSubexprEliminateto allow re use of the common SUM and COUNT expression.What changes are included in this PR?
RewriteAggregateWithConstantAre these changes tested?
Are there any user-facing changes?
No
AI Usage