Skip to content

Commit f7e82fa

Browse files
committed
Add design spec for Row format and Logical Plan (sub-projects 4+5)
1 parent 1ea314e commit f7e82fa

1 file changed

Lines changed: 272 additions & 0 deletions

File tree

Lines changed: 272 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,272 @@
1+
# SQL Engine Row Format & Logical Plan — Design Specification
2+
3+
## Overview
4+
5+
This spec covers two tightly coupled components: the in-memory Row format and the Logical Plan tree. The Row defines how data flows between operators. The Logical Plan defines the relational algebra tree that the optimizer and executor will operate on.
6+
7+
Sub-project 4+5 of the query engine. Depends on: type system (sub-project 1), expression evaluator (sub-project 2), catalog (sub-project 3).
8+
9+
### Goals
10+
11+
- **Row struct** — fixed-width array of `Value` objects, arena-allocated, indexed by ordinal
12+
- **Logical plan nodes** — relational algebra: Scan, Filter, Project, Join, Aggregate, Sort, Limit, Distinct, Union
13+
- **Plan builder** — mechanical translation from parser AST to logical plan tree (no optimization)
14+
- **SELECT-only** for now — INSERT/UPDATE/DELETE plan generation deferred to executor
15+
16+
### Constraints
17+
18+
- C++17, arena-allocated plan nodes
19+
- Plan nodes hold pointers to parser AST expression nodes (not copies)
20+
- No optimization in the plan builder — that's sub-project 6
21+
22+
---
23+
24+
## Row Format
25+
26+
```cpp
27+
struct Row {
28+
Value* values; // array indexed by ordinal
29+
uint16_t column_count;
30+
31+
Value get(uint16_t ordinal) const { return values[ordinal]; }
32+
void set(uint16_t ordinal, Value v) { values[ordinal] = v; }
33+
bool is_null(uint16_t ordinal) const { return values[ordinal].is_null(); }
34+
};
35+
```
36+
37+
No separate null bitmap — `Value::tag == TAG_NULL` serves as the null indicator.
38+
39+
**Allocation:**
40+
41+
```cpp
42+
inline Row make_row(Arena& arena, uint16_t column_count) {
43+
Value* vals = static_cast<Value*>(arena.allocate(sizeof(Value) * column_count));
44+
for (uint16_t i = 0; i < column_count; ++i) vals[i] = value_null();
45+
return Row{vals, column_count};
46+
}
47+
```
48+
49+
Rows are arena-allocated. Freed on arena reset after query completes.
50+
51+
**Schema:** A Row's column metadata is the `TableInfo` from the Catalog (or a `ProjectInfo` for computed columns). No new schema struct needed.
52+
53+
---
54+
55+
## Logical Plan Nodes
56+
57+
### Node types
58+
59+
```cpp
60+
enum class PlanNodeType : uint8_t {
61+
SCAN, // read from data source
62+
FILTER, // WHERE / HAVING condition
63+
PROJECT, // SELECT expression list
64+
JOIN, // JOIN two sources
65+
AGGREGATE, // GROUP BY + aggregate functions
66+
SORT, // ORDER BY
67+
LIMIT, // LIMIT + OFFSET
68+
DISTINCT, // remove duplicates
69+
SET_OP, // UNION / INTERSECT / EXCEPT
70+
};
71+
```
72+
73+
### PlanNode struct
74+
75+
```cpp
76+
struct PlanNode {
77+
PlanNodeType type;
78+
PlanNode* left = nullptr; // primary child (or left of join/union)
79+
PlanNode* right = nullptr; // right of join/union (null for unary ops)
80+
81+
union {
82+
struct {
83+
const TableInfo* table;
84+
} scan;
85+
86+
struct {
87+
const AstNode* expr; // WHERE/HAVING expression AST
88+
} filter;
89+
90+
struct {
91+
const AstNode** exprs; // SELECT expression list (AST nodes)
92+
const AstNode** aliases; // alias AST nodes (parallel array, nullable entries)
93+
uint16_t count;
94+
} project;
95+
96+
struct {
97+
uint8_t join_type; // INNER=0, LEFT=1, RIGHT=2, FULL=3, CROSS=4
98+
const AstNode* condition; // ON expression AST (null for CROSS/NATURAL)
99+
} join;
100+
101+
struct {
102+
const AstNode** group_by; // GROUP BY expression list
103+
uint16_t group_count;
104+
const AstNode** agg_exprs; // aggregate expressions (COUNT, SUM, etc.)
105+
uint16_t agg_count;
106+
} aggregate;
107+
108+
struct {
109+
const AstNode** keys; // ORDER BY key expressions
110+
uint8_t* directions; // 0=ASC, 1=DESC (parallel array)
111+
uint16_t count;
112+
} sort;
113+
114+
struct {
115+
int64_t count;
116+
int64_t offset;
117+
} limit;
118+
119+
struct {
120+
uint8_t op; // 0=UNION, 1=INTERSECT, 2=EXCEPT
121+
bool all; // UNION ALL vs UNION
122+
} set_op;
123+
};
124+
};
125+
```
126+
127+
**Design decisions:**
128+
129+
1. **AST expression pointers, not copies.** Plan nodes reference the parser's AST nodes directly. The AST lives in the arena for the query's lifetime. Avoids duplication.
130+
131+
2. **Arena-allocated.** `PlanNode` is allocated from the arena via `make_plan_node(arena, type)`.
132+
133+
3. **Union for node-specific data.** Compact — each PlanNode is ~48 bytes regardless of type.
134+
135+
---
136+
137+
## Plan Builder — AST to Logical Plan
138+
139+
```cpp
140+
template <Dialect D>
141+
class PlanBuilder {
142+
public:
143+
PlanBuilder(const Catalog& catalog, Arena& arena);
144+
145+
// Build a logical plan from a parsed SELECT statement AST
146+
PlanNode* build(const AstNode* stmt_ast);
147+
148+
private:
149+
const Catalog& catalog_;
150+
Arena& arena_;
151+
152+
PlanNode* build_select(const AstNode* select_ast);
153+
PlanNode* build_from(const AstNode* from_clause);
154+
PlanNode* build_join(const AstNode* join_clause, PlanNode* left);
155+
PlanNode* build_compound(const AstNode* compound_ast);
156+
};
157+
```
158+
159+
### Translation rules for SELECT
160+
161+
The builder walks the SELECT AST children and builds the plan bottom-up:
162+
163+
```
164+
SQL: SELECT DISTINCT name, COUNT(*)
165+
FROM users u JOIN orders o ON u.id = o.user_id
166+
WHERE u.active = 1
167+
GROUP BY name
168+
HAVING COUNT(*) > 5
169+
ORDER BY name
170+
LIMIT 10 OFFSET 5
171+
172+
Plan: Limit(10, 5)
173+
└── Sort([name ASC])
174+
└── Distinct
175+
└── Project([name, COUNT(*)])
176+
└── Filter(COUNT(*) > 5) ← HAVING
177+
└── Aggregate([name], [COUNT(*)])
178+
└── Filter(u.active = 1) ← WHERE
179+
└── Join(INNER, u.id = o.user_id)
180+
├── Scan(users)
181+
└── Scan(orders)
182+
```
183+
184+
### Build order
185+
186+
1. Start with FROM → `Scan` nodes (one per table)
187+
2. If JOINs → wrap in `Join` nodes
188+
3. If WHERE → wrap in `Filter`
189+
4. If GROUP BY → wrap in `Aggregate`
190+
5. If HAVING → wrap in `Filter` (above Aggregate)
191+
6. SELECT list → wrap in `Project`
192+
7. If DISTINCT → wrap in `Distinct`
193+
8. If ORDER BY → wrap in `Sort`
194+
9. If LIMIT → wrap in `Limit`
195+
196+
### FROM clause translation
197+
198+
- Single table: `Scan(table)` — look up table in Catalog
199+
- Multiple tables (comma join): `Join(CROSS, Scan(t1), Scan(t2))`
200+
- Explicit JOIN: `Join(type, Scan(t1), Scan(t2), condition)`
201+
- Multiple JOINs: left-deep tree of `Join` nodes
202+
- Subquery in FROM: defer (return `Scan` with null table for now)
203+
204+
### Compound queries
205+
206+
`NODE_COMPOUND_QUERY` → `SetOp` node wrapping two child plans:
207+
208+
```
209+
SELECT ... UNION ALL SELECT ...
210+
→ SetOp(UNION, all=true, build(left), build(right))
211+
```
212+
213+
Trailing ORDER BY / LIMIT on compounds → `Sort` / `Limit` above the `SetOp`.
214+
215+
### No FROM clause
216+
217+
`SELECT 1 + 2` (no FROM) → `Project` with no child (leaf node). The executor generates a single empty row.
218+
219+
---
220+
221+
## File Organization
222+
223+
```
224+
include/sql_engine/
225+
row.h — Row struct, make_row()
226+
plan_node.h — PlanNodeType enum, PlanNode struct, make_plan_node()
227+
plan_builder.h — PlanBuilder<D> template
228+
229+
tests/
230+
test_row.cpp — Row creation, get/set, null
231+
test_plan_builder.cpp — AST → plan for various SELECT shapes
232+
```
233+
234+
---
235+
236+
## Testing Strategy
237+
238+
### Row tests
239+
- Create row, set/get values
240+
- NULL checks
241+
- Arena allocation
242+
243+
### Plan builder tests
244+
245+
Parse real SQL → build plan → walk tree → verify structure:
246+
247+
| SQL | Expected plan shape |
248+
|---|---|
249+
| `SELECT * FROM users` | Scan(users) |
250+
| `SELECT name FROM users` | Project → Scan |
251+
| `SELECT * FROM users WHERE id = 1` | Filter → Scan |
252+
| `SELECT name, age FROM users WHERE age > 18` | Project → Filter → Scan |
253+
| `SELECT * FROM users ORDER BY name LIMIT 10` | Limit → Sort → Scan |
254+
| `SELECT status, COUNT(*) FROM users GROUP BY status` | Project → Aggregate → Scan |
255+
| `SELECT status, COUNT(*) FROM users GROUP BY status HAVING COUNT(*) > 5` | Project → Filter → Aggregate → Scan |
256+
| `SELECT * FROM users u JOIN orders o ON u.id = o.user_id` | Join(Scan, Scan) |
257+
| `SELECT DISTINCT name FROM users` | Distinct → Project → Scan |
258+
| `SELECT * FROM t1 UNION ALL SELECT * FROM t2` | SetOp(Scan, Scan) |
259+
| `SELECT 1 + 2` | Project (no child — leaf) |
260+
261+
Each test verifies: correct PlanNodeType at each level, correct table in Scan, correct expression pointer in Filter.
262+
263+
---
264+
265+
## Performance Targets
266+
267+
| Operation | Target |
268+
|---|---|
269+
| make_row (10 columns) | <50ns |
270+
| Row get/set | <5ns |
271+
| Plan builder (simple SELECT WHERE) | <500ns |
272+
| Plan builder (complex JOIN + GROUP BY) | <2us |

0 commit comments

Comments
 (0)