Skip to content

Commit 10b8612

Browse files
committed
Add design spec for rule-based optimizer (sub-project 6)
1 parent 3ee471e commit 10b8612

1 file changed

Lines changed: 271 additions & 0 deletions

File tree

Lines changed: 271 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,271 @@
1+
# SQL Engine Optimizer — Design Specification
2+
3+
## Overview
4+
5+
Rule-based query optimizer that transforms logical plans into more efficient logical plans. Takes a plan tree, returns a better plan tree. Same type in, same type out.
6+
7+
Sub-project 6 of the query engine. Depends on: logical plan (sub-project 5), expression evaluator (sub-project 2), catalog (sub-project 3).
8+
9+
### Goals
10+
11+
- **Four rewrite rules:** predicate pushdown, projection pruning, constant folding, limit pushdown
12+
- **Fixed rule sequence** — rules applied in order, one pass each
13+
- **Correctness-preserving** — optimized plan produces identical results to unoptimized
14+
- **Architected for extension** — cost-based optimization can be added later without rewriting
15+
16+
### Constraints
17+
18+
- C++17, arena-allocated new nodes
19+
- Rules operate on PlanNode trees (same types as plan builder output)
20+
- No iterative fixed-point — single pass per rule
21+
- No table statistics needed (rule-based, not cost-based)
22+
23+
### Non-Goals
24+
25+
- Cost-based optimization (future sub-project)
26+
- Join reordering (needs cost model)
27+
- Index selection (needs index metadata in catalog)
28+
- Subquery decorrelation
29+
30+
---
31+
32+
## Interface
33+
34+
```cpp
35+
template <Dialect D>
36+
class Optimizer {
37+
public:
38+
Optimizer(const Catalog& catalog);
39+
40+
PlanNode* optimize(PlanNode* plan, Arena& arena);
41+
42+
private:
43+
const Catalog& catalog_;
44+
};
45+
```
46+
47+
Internally applies rules in sequence:
48+
49+
```cpp
50+
PlanNode* optimize(PlanNode* plan, Arena& arena) {
51+
plan = predicate_pushdown(plan, catalog_, arena);
52+
plan = projection_pruning(plan, catalog_, arena);
53+
plan = constant_folding(plan, catalog_, arena);
54+
plan = limit_pushdown(plan, catalog_, arena);
55+
return plan;
56+
}
57+
```
58+
59+
Each rule is a standalone function:
60+
61+
```cpp
62+
using RewriteRule = PlanNode*(*)(PlanNode* node, const Catalog& catalog, Arena& arena);
63+
```
64+
65+
---
66+
67+
## Rule 1: Predicate Pushdown
68+
69+
Move Filter nodes below Join nodes when the filter condition only references columns from one side of the join.
70+
71+
### Before
72+
73+
```
74+
Filter(a.x > 10)
75+
└── Join(ON a.id = b.id)
76+
├── Scan(a)
77+
└── Scan(b)
78+
```
79+
80+
### After
81+
82+
```
83+
Join(ON a.id = b.id)
84+
├── Filter(a.x > 10)
85+
│ └── Scan(a)
86+
└── Scan(b)
87+
```
88+
89+
### Algorithm
90+
91+
1. Walk the tree top-down
92+
2. When encountering Filter above Join:
93+
- Analyze the filter expression: which tables does it reference?
94+
- If it references only left-side tables → push filter to left child
95+
- If it references only right-side tables → push filter to right child
96+
- If it references both sides → leave in place (can't push)
97+
3. For compound conditions (AND): split into individual predicates, push each independently
98+
4. Recurse into children
99+
100+
### Table reference analysis
101+
102+
To determine which tables an expression references, walk the AST expression looking for `NODE_COLUMN_REF` and `NODE_QUALIFIED_NAME` nodes. For qualified names (`a.x`), the table prefix is known. For unqualified names (`x`), look up in the catalog to find which table the column belongs to.
103+
104+
---
105+
106+
## Rule 2: Projection Pruning
107+
108+
If the query only needs a subset of columns, annotate or transform the plan to avoid carrying unused columns through the pipeline.
109+
110+
### Approach
111+
112+
Walk the plan tree top-down, tracking which columns are needed by each node:
113+
114+
1. Start from the root (Project node) — its expression list tells us which columns are needed
115+
2. Filter nodes add their expression's column references to the needed set
116+
3. Join conditions add their column references
117+
4. Sort keys add their column references
118+
5. Aggregate group-by and aggregate expressions add their column references
119+
120+
If a Scan node produces columns that no ancestor needs, insert a Project node immediately above the Scan to strip unused columns.
121+
122+
### Implementation
123+
124+
```cpp
125+
PlanNode* projection_pruning(PlanNode* plan, const Catalog& catalog, Arena& arena);
126+
```
127+
128+
Collect column references from all expressions in the plan. For each Scan, compare needed columns against available columns. If fewer are needed, insert a slimming Project.
129+
130+
---
131+
132+
## Rule 3: Constant Folding
133+
134+
Evaluate expressions that don't reference any columns at plan time.
135+
136+
### Examples
137+
138+
- `10 + 8` → `18`
139+
- `UPPER('hello')` → `'HELLO'`
140+
- `1 > 2` → `FALSE`
141+
- `COALESCE(NULL, 42)` → `42`
142+
143+
### Algorithm
144+
145+
1. Walk all expression AST nodes in the plan (Filter, Project, Sort, etc.)
146+
2. For each expression, check if it references any columns (no `NODE_COLUMN_REF`, no `NODE_QUALIFIED_NAME`)
147+
3. If it's purely constant, evaluate it using the expression evaluator
148+
4. Replace the expression with a new literal AST node containing the result
149+
150+
### Implementation
151+
152+
```cpp
153+
PlanNode* constant_folding(PlanNode* plan, const Catalog& catalog, Arena& arena);
154+
```
155+
156+
Uses `evaluate_expression<D>()` with a null resolver (no columns available — if it tries to resolve a column, it's not a constant expression).
157+
158+
The expression evaluator already handles this — we just need to replace the AST node with a literal node after evaluation.
159+
160+
---
161+
162+
## Rule 4: Limit Pushdown
163+
164+
Push Limit nodes past Filter nodes toward Scan nodes, enabling early termination.
165+
166+
### Before
167+
168+
```
169+
Limit(10)
170+
└── Filter(active = 1)
171+
└── Scan(users)
172+
```
173+
174+
### After
175+
176+
```
177+
Limit(10)
178+
└── Filter(active = 1)
179+
└── Limit(10)
180+
└── Scan(users)
181+
```
182+
183+
The inner Limit allows the scan to stop after 10 candidates (before filtering). The outer Limit ensures exactly 10 results after filtering.
184+
185+
### Constraints — when NOT to push
186+
187+
Do NOT push limit past:
188+
- **Sort** — sort needs all rows before it can produce ordered output
189+
- **Aggregate** — aggregation needs all rows to compute correct groups
190+
- **Distinct** — needs all rows to determine uniqueness
191+
- **Join** — limit on one side would produce incorrect join results
192+
193+
Only push past **Filter** nodes (and only when there's no Sort/Aggregate above).
194+
195+
### Implementation
196+
197+
```cpp
198+
PlanNode* limit_pushdown(PlanNode* plan, const Catalog& catalog, Arena& arena);
199+
```
200+
201+
Walk the tree. When seeing Limit → Filter → child, insert a new Limit(same count) between Filter and child. The extra Limit is a hint for early termination, not a correctness requirement.
202+
203+
---
204+
205+
## File Organization
206+
207+
```
208+
include/sql_engine/
209+
optimizer.h — Optimizer<D> class
210+
rules/
211+
predicate_pushdown.h — Push filters below joins
212+
projection_pruning.h — Drop unused columns early
213+
constant_folding.h — Evaluate constants at plan time
214+
limit_pushdown.h — Push limits toward scans
215+
216+
tests/
217+
test_optimizer.cpp — Tests for each rule + combined + correctness
218+
```
219+
220+
---
221+
222+
## Testing Strategy
223+
224+
### Per-rule tests
225+
226+
Each rule is tested independently: build a specific plan shape, apply the rule, verify the transformed shape.
227+
228+
**Predicate pushdown:**
229+
- Filter above Join → pushed to correct side
230+
- Filter referencing both sides → stays in place
231+
- AND with mixed predicates → split and push individually
232+
- No Join → no change
233+
234+
**Projection pruning:**
235+
- SELECT subset of columns → Project inserted above Scan
236+
- SELECT * → no pruning needed
237+
- Columns used in WHERE but not SELECT → still kept
238+
239+
**Constant folding:**
240+
- `10 + 8` → literal 18
241+
- `UPPER('hello')` → literal 'HELLO'
242+
- `col + 1` → not folded (has column reference)
243+
- Mixed: `col > 10 + 8` → `col > 18` (partial folding)
244+
245+
**Limit pushdown:**
246+
- Limit → Filter → Scan → inner Limit inserted
247+
- Limit → Sort → Scan → no change (Sort blocks pushdown)
248+
- Limit → Aggregate → Scan → no change
249+
250+
### Combined optimization test
251+
252+
Apply all rules to a complex query and verify the final plan shape.
253+
254+
### Correctness test
255+
256+
Parse SQL → build plan → execute WITHOUT optimizer → get results.
257+
Parse SQL → build plan → optimize → execute WITH optimizer → get results.
258+
Verify both result sets are identical.
259+
260+
This is the critical test — the optimizer must never change query semantics.
261+
262+
---
263+
264+
## Performance Targets
265+
266+
| Operation | Target |
267+
|---|---|
268+
| Optimizer (simple query, 4 rules) | <10us |
269+
| Optimizer (complex query with joins) | <50us |
270+
| Predicate pushdown (single join) | <5us |
271+
| Constant folding (5 constants) | <2us |

0 commit comments

Comments
 (0)