|
| 1 | +# Subquery Execution — Design Specification |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +Enable subquery evaluation in the expression evaluator and executor. Currently `NODE_SUBQUERY` returns NULL. After this sub-project, `WHERE id IN (SELECT ...)`, `WHERE EXISTS (SELECT ...)`, scalar subqueries, and correlated subqueries work correctly. |
| 6 | + |
| 7 | +Sub-project 11. Depends on: executor (sub-project 7), backend connections (sub-project 10). |
| 8 | + |
| 9 | +### Goals |
| 10 | + |
| 11 | +- **Uncorrelated subqueries** — `WHERE id IN (SELECT user_id FROM orders)`, `WHERE EXISTS (SELECT 1 FROM ...)`, `SELECT (SELECT MAX(age) FROM users)` |
| 12 | +- **Correlated subqueries** — `WHERE age > (SELECT AVG(age) FROM users WHERE dept = outer.dept)` |
| 13 | +- **Subqueries in FROM** — `SELECT * FROM (SELECT ...) AS t` (derived tables) |
| 14 | +- **Subqueries in expressions** — `SELECT CASE WHEN (SELECT COUNT(*) FROM orders) > 0 THEN 'yes' ELSE 'no' END` |
| 15 | +- **Distributed subqueries** — subquery on different backend than outer query |
| 16 | + |
| 17 | +### Constraints |
| 18 | + |
| 19 | +- C++17 |
| 20 | +- Reuse existing executor pipeline for inner queries |
| 21 | +- Arena-allocated intermediate results |
| 22 | +- No materialized CTEs (WITH clause) — deferred |
| 23 | + |
| 24 | +--- |
| 25 | + |
| 26 | +## Subquery Types |
| 27 | + |
| 28 | +### 1. Scalar subquery |
| 29 | + |
| 30 | +Returns a single value. Used anywhere an expression is expected. |
| 31 | + |
| 32 | +```sql |
| 33 | +SELECT name, (SELECT MAX(total) FROM orders WHERE orders.user_id = users.id) AS max_order |
| 34 | +FROM users |
| 35 | +``` |
| 36 | + |
| 37 | +Execution: run inner query, verify it returns exactly 0 or 1 row. If 0 rows → NULL. If 1 row → the single value. If >1 row → error. |
| 38 | + |
| 39 | +### 2. EXISTS subquery |
| 40 | + |
| 41 | +Returns TRUE if inner query returns at least one row, FALSE otherwise. |
| 42 | + |
| 43 | +```sql |
| 44 | +SELECT * FROM users WHERE EXISTS (SELECT 1 FROM orders WHERE orders.user_id = users.id) |
| 45 | +``` |
| 46 | + |
| 47 | +Execution: run inner query with implicit LIMIT 1 (optimization — only need to check if any row exists). Return value_bool(has_row). |
| 48 | + |
| 49 | +### 3. IN subquery |
| 50 | + |
| 51 | +Returns TRUE if outer value matches any row from inner query. |
| 52 | + |
| 53 | +```sql |
| 54 | +SELECT * FROM users WHERE id IN (SELECT user_id FROM orders) |
| 55 | +``` |
| 56 | + |
| 57 | +Execution: materialize inner query into a set. For each outer row, check if the value is in the set. Optimization: use a hash set for O(1) lookup. |
| 58 | + |
| 59 | +### 4. Correlated subquery |
| 60 | + |
| 61 | +The inner query references columns from the outer query. Must be re-executed for each outer row. |
| 62 | + |
| 63 | +```sql |
| 64 | +SELECT * FROM users u |
| 65 | +WHERE age > (SELECT AVG(age) FROM users WHERE dept = u.dept) |
| 66 | +``` |
| 67 | + |
| 68 | +Execution: for each outer row, bind the outer columns into the inner query's resolver, execute inner query, use result. |
| 69 | + |
| 70 | +### 5. Derived table (FROM subquery) |
| 71 | + |
| 72 | +A subquery in the FROM clause that acts as a virtual table. |
| 73 | + |
| 74 | +```sql |
| 75 | +SELECT t.name FROM (SELECT name, age FROM users WHERE age > 18) AS t |
| 76 | +``` |
| 77 | + |
| 78 | +Execution: execute inner query, materialize results as a DataSource, Scan from it. |
| 79 | + |
| 80 | +--- |
| 81 | + |
| 82 | +## Architecture |
| 83 | + |
| 84 | +### SubqueryExecutor |
| 85 | + |
| 86 | +A new component that the expression evaluator calls when it encounters NODE_SUBQUERY: |
| 87 | + |
| 88 | +```cpp |
| 89 | +template <Dialect D> |
| 90 | +class SubqueryExecutor { |
| 91 | +public: |
| 92 | + SubqueryExecutor(PlanExecutor<D>& executor, |
| 93 | + PlanBuilder<D>& builder, |
| 94 | + Optimizer<D>& optimizer, |
| 95 | + Arena& arena); |
| 96 | + |
| 97 | + // Execute a subquery AST, return result |
| 98 | + Value execute_scalar(const AstNode* subquery_ast, |
| 99 | + const std::function<Value(StringRef)>& outer_resolve); |
| 100 | + |
| 101 | + bool execute_exists(const AstNode* subquery_ast, |
| 102 | + const std::function<Value(StringRef)>& outer_resolve); |
| 103 | + |
| 104 | + ResultSet execute_set(const AstNode* subquery_ast, |
| 105 | + const std::function<Value(StringRef)>& outer_resolve); |
| 106 | +}; |
| 107 | +``` |
| 108 | +
|
| 109 | +### Integration with expression evaluator |
| 110 | +
|
| 111 | +The expression evaluator's `evaluate_expression()` currently returns `value_null()` for `NODE_SUBQUERY`. It needs access to a `SubqueryExecutor`: |
| 112 | +
|
| 113 | +```cpp |
| 114 | +template <Dialect D> |
| 115 | +Value evaluate_expression(const AstNode* expr, |
| 116 | + const std::function<Value(StringRef)>& resolve, |
| 117 | + FunctionRegistry<D>& functions, |
| 118 | + Arena& arena, |
| 119 | + SubqueryExecutor<D>* subquery_exec = nullptr); // NEW optional param |
| 120 | +``` |
| 121 | + |
| 122 | +When `NODE_SUBQUERY` is encountered and `subquery_exec != nullptr`, call the appropriate method. The subquery type is determined by context (the parent node — IN_LIST, EXISTS check, or scalar position). |
| 123 | + |
| 124 | +### Integration with plan builder |
| 125 | + |
| 126 | +Derived tables in FROM: the plan builder recognizes subquery AST nodes in the FROM clause and creates a special DERIVED_SCAN plan node: |
| 127 | + |
| 128 | +```cpp |
| 129 | +struct { |
| 130 | + PlanNode* inner_plan; // the subquery's plan |
| 131 | + const char* alias; |
| 132 | +} derived_scan; |
| 133 | +``` |
| 134 | + |
| 135 | +### Integration with executor |
| 136 | + |
| 137 | +The DerivedScanOperator: |
| 138 | +1. On open(): execute the inner plan, materialize into a vector of Rows |
| 139 | +2. On next(): yield rows from the materialized result |
| 140 | + |
| 141 | +### Correlated subquery handling |
| 142 | + |
| 143 | +For correlated subqueries, the inner query's resolver needs access to the outer row. The SubqueryExecutor creates a combined resolver: |
| 144 | + |
| 145 | +```cpp |
| 146 | +auto combined_resolve = [&outer_resolve, &inner_resolve](StringRef name) -> Value { |
| 147 | + // Try inner first (inner table columns take precedence) |
| 148 | + Value v = inner_resolve(name); |
| 149 | + if (/* found in inner */) return v; |
| 150 | + // Fall back to outer |
| 151 | + return outer_resolve(name); |
| 152 | +}; |
| 153 | +``` |
| 154 | + |
| 155 | +The inner query is re-executed for each outer row (naive but correct). Optimization (caching, decorrelation) deferred. |
| 156 | + |
| 157 | +--- |
| 158 | + |
| 159 | +## Distributed Subqueries |
| 160 | + |
| 161 | +When the outer query and subquery reference tables on different backends: |
| 162 | + |
| 163 | +```sql |
| 164 | +-- users on backend_a, orders on backend_b |
| 165 | +SELECT * FROM users WHERE id IN (SELECT user_id FROM orders) |
| 166 | +``` |
| 167 | + |
| 168 | +The distributed planner recognizes the subquery and: |
| 169 | +1. Executes the subquery against backend_b: `SELECT user_id FROM orders` |
| 170 | +2. Materializes the result locally |
| 171 | +3. Rewrites the outer query: `SELECT * FROM users WHERE id IN (1, 2, 3, ...)` with the materialized values |
| 172 | +4. Sends the rewritten query to backend_a |
| 173 | + |
| 174 | +For correlated cross-backend subqueries, the engine falls back to row-by-row execution (fetch outer rows, execute inner per row). This is slow but correct. |
| 175 | + |
| 176 | +--- |
| 177 | + |
| 178 | +## File Organization |
| 179 | + |
| 180 | +``` |
| 181 | +include/sql_engine/ |
| 182 | + subquery_executor.h -- SubqueryExecutor<D> |
| 183 | + operators/ |
| 184 | + derived_scan_op.h -- DerivedScanOperator |
| 185 | +
|
| 186 | + (modify) expression_eval.h -- Add SubqueryExecutor* parameter |
| 187 | + (modify) plan_node.h -- Add DERIVED_SCAN node type |
| 188 | + (modify) plan_builder.h -- Handle subquery in FROM |
| 189 | + (modify) plan_executor.h -- Build DerivedScanOperator |
| 190 | + (modify) distributed_planner.h -- Distributed subquery handling |
| 191 | +
|
| 192 | +tests/ |
| 193 | + test_subquery.cpp -- All subquery types |
| 194 | +``` |
| 195 | + |
| 196 | +--- |
| 197 | + |
| 198 | +## Testing Strategy |
| 199 | + |
| 200 | +- Scalar subquery: `SELECT (SELECT MAX(age) FROM users)` → correct value |
| 201 | +- Scalar subquery returning 0 rows → NULL |
| 202 | +- EXISTS: `WHERE EXISTS (SELECT 1 FROM orders WHERE ...)` → correct filtering |
| 203 | +- NOT EXISTS → correct filtering |
| 204 | +- IN subquery: `WHERE id IN (SELECT user_id FROM orders)` → correct set membership |
| 205 | +- NOT IN subquery → correct |
| 206 | +- Correlated scalar: inner references outer column → re-executed per row |
| 207 | +- Derived table: `FROM (SELECT ...) AS t` → works as table source |
| 208 | +- Nested subquery: subquery within subquery |
| 209 | +- NULL handling: IN with NULLs in subquery result |
| 210 | +- Distributed: subquery on different backend than outer query |
| 211 | + |
| 212 | +--- |
| 213 | + |
| 214 | +## Performance Targets |
| 215 | + |
| 216 | +| Operation | Target | |
| 217 | +|---|---| |
| 218 | +| Uncorrelated IN subquery (100 values) | <500us (materialization + hash set build) | |
| 219 | +| EXISTS subquery | <100us (stops after first row) | |
| 220 | +| Scalar subquery | <200us | |
| 221 | +| Correlated subquery (100 outer rows) | <50ms (100 inner executions) | |
| 222 | +| Derived table (100 rows) | <200us (materialization) | |
0 commit comments