Skip to content

Commit 1ae5c66

Browse files
committed
Add design specs for subquery execution (sub-project 11) and transactions (sub-project 12)
1 parent 40dad5d commit 1ae5c66

2 files changed

Lines changed: 527 additions & 0 deletions

File tree

Lines changed: 222 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
# Subquery Execution — Design Specification
2+
3+
## Overview
4+
5+
Enable subquery evaluation in the expression evaluator and executor. Currently `NODE_SUBQUERY` returns NULL. After this sub-project, `WHERE id IN (SELECT ...)`, `WHERE EXISTS (SELECT ...)`, scalar subqueries, and correlated subqueries work correctly.
6+
7+
Sub-project 11. Depends on: executor (sub-project 7), backend connections (sub-project 10).
8+
9+
### Goals
10+
11+
- **Uncorrelated subqueries**`WHERE id IN (SELECT user_id FROM orders)`, `WHERE EXISTS (SELECT 1 FROM ...)`, `SELECT (SELECT MAX(age) FROM users)`
12+
- **Correlated subqueries**`WHERE age > (SELECT AVG(age) FROM users WHERE dept = outer.dept)`
13+
- **Subqueries in FROM**`SELECT * FROM (SELECT ...) AS t` (derived tables)
14+
- **Subqueries in expressions**`SELECT CASE WHEN (SELECT COUNT(*) FROM orders) > 0 THEN 'yes' ELSE 'no' END`
15+
- **Distributed subqueries** — subquery on different backend than outer query
16+
17+
### Constraints
18+
19+
- C++17
20+
- Reuse existing executor pipeline for inner queries
21+
- Arena-allocated intermediate results
22+
- No materialized CTEs (WITH clause) — deferred
23+
24+
---
25+
26+
## Subquery Types
27+
28+
### 1. Scalar subquery
29+
30+
Returns a single value. Used anywhere an expression is expected.
31+
32+
```sql
33+
SELECT name, (SELECT MAX(total) FROM orders WHERE orders.user_id = users.id) AS max_order
34+
FROM users
35+
```
36+
37+
Execution: run inner query, verify it returns exactly 0 or 1 row. If 0 rows → NULL. If 1 row → the single value. If >1 row → error.
38+
39+
### 2. EXISTS subquery
40+
41+
Returns TRUE if inner query returns at least one row, FALSE otherwise.
42+
43+
```sql
44+
SELECT * FROM users WHERE EXISTS (SELECT 1 FROM orders WHERE orders.user_id = users.id)
45+
```
46+
47+
Execution: run inner query with implicit LIMIT 1 (optimization — only need to check if any row exists). Return value_bool(has_row).
48+
49+
### 3. IN subquery
50+
51+
Returns TRUE if outer value matches any row from inner query.
52+
53+
```sql
54+
SELECT * FROM users WHERE id IN (SELECT user_id FROM orders)
55+
```
56+
57+
Execution: materialize inner query into a set. For each outer row, check if the value is in the set. Optimization: use a hash set for O(1) lookup.
58+
59+
### 4. Correlated subquery
60+
61+
The inner query references columns from the outer query. Must be re-executed for each outer row.
62+
63+
```sql
64+
SELECT * FROM users u
65+
WHERE age > (SELECT AVG(age) FROM users WHERE dept = u.dept)
66+
```
67+
68+
Execution: for each outer row, bind the outer columns into the inner query's resolver, execute inner query, use result.
69+
70+
### 5. Derived table (FROM subquery)
71+
72+
A subquery in the FROM clause that acts as a virtual table.
73+
74+
```sql
75+
SELECT t.name FROM (SELECT name, age FROM users WHERE age > 18) AS t
76+
```
77+
78+
Execution: execute inner query, materialize results as a DataSource, Scan from it.
79+
80+
---
81+
82+
## Architecture
83+
84+
### SubqueryExecutor
85+
86+
A new component that the expression evaluator calls when it encounters NODE_SUBQUERY:
87+
88+
```cpp
89+
template <Dialect D>
90+
class SubqueryExecutor {
91+
public:
92+
SubqueryExecutor(PlanExecutor<D>& executor,
93+
PlanBuilder<D>& builder,
94+
Optimizer<D>& optimizer,
95+
Arena& arena);
96+
97+
// Execute a subquery AST, return result
98+
Value execute_scalar(const AstNode* subquery_ast,
99+
const std::function<Value(StringRef)>& outer_resolve);
100+
101+
bool execute_exists(const AstNode* subquery_ast,
102+
const std::function<Value(StringRef)>& outer_resolve);
103+
104+
ResultSet execute_set(const AstNode* subquery_ast,
105+
const std::function<Value(StringRef)>& outer_resolve);
106+
};
107+
```
108+
109+
### Integration with expression evaluator
110+
111+
The expression evaluator's `evaluate_expression()` currently returns `value_null()` for `NODE_SUBQUERY`. It needs access to a `SubqueryExecutor`:
112+
113+
```cpp
114+
template <Dialect D>
115+
Value evaluate_expression(const AstNode* expr,
116+
const std::function<Value(StringRef)>& resolve,
117+
FunctionRegistry<D>& functions,
118+
Arena& arena,
119+
SubqueryExecutor<D>* subquery_exec = nullptr); // NEW optional param
120+
```
121+
122+
When `NODE_SUBQUERY` is encountered and `subquery_exec != nullptr`, call the appropriate method. The subquery type is determined by context (the parent node — IN_LIST, EXISTS check, or scalar position).
123+
124+
### Integration with plan builder
125+
126+
Derived tables in FROM: the plan builder recognizes subquery AST nodes in the FROM clause and creates a special DERIVED_SCAN plan node:
127+
128+
```cpp
129+
struct {
130+
PlanNode* inner_plan; // the subquery's plan
131+
const char* alias;
132+
} derived_scan;
133+
```
134+
135+
### Integration with executor
136+
137+
The DerivedScanOperator:
138+
1. On open(): execute the inner plan, materialize into a vector of Rows
139+
2. On next(): yield rows from the materialized result
140+
141+
### Correlated subquery handling
142+
143+
For correlated subqueries, the inner query's resolver needs access to the outer row. The SubqueryExecutor creates a combined resolver:
144+
145+
```cpp
146+
auto combined_resolve = [&outer_resolve, &inner_resolve](StringRef name) -> Value {
147+
// Try inner first (inner table columns take precedence)
148+
Value v = inner_resolve(name);
149+
if (/* found in inner */) return v;
150+
// Fall back to outer
151+
return outer_resolve(name);
152+
};
153+
```
154+
155+
The inner query is re-executed for each outer row (naive but correct). Optimization (caching, decorrelation) deferred.
156+
157+
---
158+
159+
## Distributed Subqueries
160+
161+
When the outer query and subquery reference tables on different backends:
162+
163+
```sql
164+
-- users on backend_a, orders on backend_b
165+
SELECT * FROM users WHERE id IN (SELECT user_id FROM orders)
166+
```
167+
168+
The distributed planner recognizes the subquery and:
169+
1. Executes the subquery against backend_b: `SELECT user_id FROM orders`
170+
2. Materializes the result locally
171+
3. Rewrites the outer query: `SELECT * FROM users WHERE id IN (1, 2, 3, ...)` with the materialized values
172+
4. Sends the rewritten query to backend_a
173+
174+
For correlated cross-backend subqueries, the engine falls back to row-by-row execution (fetch outer rows, execute inner per row). This is slow but correct.
175+
176+
---
177+
178+
## File Organization
179+
180+
```
181+
include/sql_engine/
182+
subquery_executor.h -- SubqueryExecutor<D>
183+
operators/
184+
derived_scan_op.h -- DerivedScanOperator
185+
186+
(modify) expression_eval.h -- Add SubqueryExecutor* parameter
187+
(modify) plan_node.h -- Add DERIVED_SCAN node type
188+
(modify) plan_builder.h -- Handle subquery in FROM
189+
(modify) plan_executor.h -- Build DerivedScanOperator
190+
(modify) distributed_planner.h -- Distributed subquery handling
191+
192+
tests/
193+
test_subquery.cpp -- All subquery types
194+
```
195+
196+
---
197+
198+
## Testing Strategy
199+
200+
- Scalar subquery: `SELECT (SELECT MAX(age) FROM users)` → correct value
201+
- Scalar subquery returning 0 rows → NULL
202+
- EXISTS: `WHERE EXISTS (SELECT 1 FROM orders WHERE ...)` → correct filtering
203+
- NOT EXISTS → correct filtering
204+
- IN subquery: `WHERE id IN (SELECT user_id FROM orders)` → correct set membership
205+
- NOT IN subquery → correct
206+
- Correlated scalar: inner references outer column → re-executed per row
207+
- Derived table: `FROM (SELECT ...) AS t` → works as table source
208+
- Nested subquery: subquery within subquery
209+
- NULL handling: IN with NULLs in subquery result
210+
- Distributed: subquery on different backend than outer query
211+
212+
---
213+
214+
## Performance Targets
215+
216+
| Operation | Target |
217+
|---|---|
218+
| Uncorrelated IN subquery (100 values) | <500us (materialization + hash set build) |
219+
| EXISTS subquery | <100us (stops after first row) |
220+
| Scalar subquery | <200us |
221+
| Correlated subquery (100 outer rows) | <50ms (100 inner executions) |
222+
| Derived table (100 rows) | <200us (materialization) |

0 commit comments

Comments
 (0)