|
| 1 | +# Tier 1 Promotions, UNION Support & Query Digest — Design Specification |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +Extends the SQL parser with full Tier 1 deep parsing for INSERT, UPDATE, and DELETE (both MySQL and PostgreSQL dialects), adds UNION/INTERSECT/EXCEPT compound query support with recursive nesting, and introduces AST-based query digest/normalization for query rules matching. |
| 6 | + |
| 7 | +### Goals |
| 8 | + |
| 9 | +- **INSERT Tier 1:** Full AST for INSERT/REPLACE with VALUES, SELECT, SET, ON DUPLICATE KEY UPDATE (MySQL), ON CONFLICT (PostgreSQL), RETURNING (PostgreSQL). |
| 10 | +- **UPDATE Tier 1:** Full AST with multi-table JOIN (MySQL), FROM (PostgreSQL), ORDER BY/LIMIT (MySQL), RETURNING (PostgreSQL). |
| 11 | +- **DELETE Tier 1:** Full AST with multi-table (MySQL both forms), USING (PostgreSQL), ORDER BY/LIMIT (MySQL), RETURNING (PostgreSQL). |
| 12 | +- **Compound queries:** UNION [ALL], INTERSECT [ALL], EXCEPT [ALL] with parenthesized nesting and precedence (INTERSECT binds tighter). |
| 13 | +- **Query digest:** AST-based normalization (literals → `?`, IN list collapsing, keyword uppercasing) + 64-bit hash. Works for all statement types including Tier 2 (token-level fallback). |
| 14 | + |
| 15 | +### Constraints |
| 16 | + |
| 17 | +- Same as original spec: C++17 floor, both dialects, sub-microsecond targets, arena allocation, header-only parsers. |
| 18 | +- All new parsers follow the established pattern: `XxxParser<D>` header-only template, uses `ExpressionParser<D>`, integrated via `parser.cpp`. |
| 19 | +- Emitter extended for all new node types + digest mode. |
| 20 | + |
| 21 | +--- |
| 22 | + |
| 23 | +## New NodeType Additions |
| 24 | + |
| 25 | +```cpp |
| 26 | +// INSERT nodes |
| 27 | +NODE_INSERT_STMT, |
| 28 | +NODE_INSERT_COLUMNS, // (col1, col2, ...) |
| 29 | +NODE_VALUES_CLAUSE, // VALUES keyword wrapper |
| 30 | +NODE_VALUES_ROW, // single (val1, val2, ...) row |
| 31 | +NODE_INSERT_SET_CLAUSE, // MySQL INSERT ... SET col=val form |
| 32 | +NODE_ON_DUPLICATE_KEY, // MySQL ON DUPLICATE KEY UPDATE |
| 33 | +NODE_ON_CONFLICT, // PostgreSQL ON CONFLICT |
| 34 | +NODE_CONFLICT_TARGET, // PostgreSQL conflict target (cols or ON CONSTRAINT) |
| 35 | +NODE_CONFLICT_ACTION, // DO UPDATE SET ... or DO NOTHING |
| 36 | +NODE_RETURNING_CLAUSE, // PostgreSQL RETURNING expr_list |
| 37 | + |
| 38 | +// UPDATE nodes |
| 39 | +NODE_UPDATE_STMT, |
| 40 | +NODE_UPDATE_SET_CLAUSE, // SET col=expr, col=expr in UPDATE context |
| 41 | +NODE_UPDATE_SET_ITEM, // single col=expr pair |
| 42 | + |
| 43 | +// DELETE nodes |
| 44 | +NODE_DELETE_STMT, |
| 45 | +NODE_DELETE_USING_CLAUSE, // PostgreSQL USING for join-like deletes |
| 46 | + |
| 47 | +// Compound query nodes |
| 48 | +NODE_COMPOUND_QUERY, // root for UNION/INTERSECT/EXCEPT |
| 49 | +NODE_SET_OPERATION, // operator (UNION, INTERSECT, EXCEPT) with ALL flag |
| 50 | + |
| 51 | +// Statement options (shared) |
| 52 | +NODE_STMT_OPTIONS, // LOW_PRIORITY, IGNORE, QUICK, DELAYED, etc. |
| 53 | +``` |
| 54 | + |
| 55 | +--- |
| 56 | + |
| 57 | +## INSERT Deep Parser |
| 58 | + |
| 59 | +### MySQL Syntax |
| 60 | + |
| 61 | +```sql |
| 62 | +INSERT [LOW_PRIORITY | DELAYED | HIGH_PRIORITY] [IGNORE] [INTO] table_name |
| 63 | + [(col1, col2, ...)] |
| 64 | + { VALUES (row1), (row2), ... | SELECT ... | SET col=val, ... } |
| 65 | + [ON DUPLICATE KEY UPDATE col=expr, col=expr, ...] |
| 66 | + |
| 67 | +REPLACE [LOW_PRIORITY | DELAYED] [INTO] table_name |
| 68 | + [(col1, col2, ...)] |
| 69 | + { VALUES (row1), (row2), ... | SELECT ... | SET col=val, ... } |
| 70 | +``` |
| 71 | + |
| 72 | +### PostgreSQL Syntax |
| 73 | + |
| 74 | +```sql |
| 75 | +INSERT INTO table_name [(col1, col2, ...)] |
| 76 | + { VALUES (row1), (row2), ... | SELECT ... | DEFAULT VALUES } |
| 77 | + [ON CONFLICT [(col1, col2, ...)] | [ON CONSTRAINT name] |
| 78 | + { DO UPDATE SET col=expr [, ...] [WHERE ...] | DO NOTHING }] |
| 79 | + [RETURNING expr_list] |
| 80 | +``` |
| 81 | + |
| 82 | +### AST Structure |
| 83 | + |
| 84 | +``` |
| 85 | +NODE_INSERT_STMT |
| 86 | + ├── NODE_STMT_OPTIONS (LOW_PRIORITY, IGNORE, etc.) |
| 87 | + ├── NODE_TABLE_REF (table name, optional schema) |
| 88 | + ├── NODE_INSERT_COLUMNS (col1, col2, ...) |
| 89 | + ├── NODE_VALUES_CLAUSE |
| 90 | + │ ├── NODE_VALUES_ROW (val1, val2, ...) |
| 91 | + │ └── NODE_VALUES_ROW (val1, val2, ...) |
| 92 | + │ OR |
| 93 | + ├── NODE_SELECT_STMT (INSERT ... SELECT) |
| 94 | + │ OR |
| 95 | + ├── NODE_INSERT_SET_CLAUSE (MySQL SET col=val) |
| 96 | + │ ├── NODE_UPDATE_SET_ITEM (col = expr) |
| 97 | + │ └── NODE_UPDATE_SET_ITEM (col = expr) |
| 98 | + ├── NODE_ON_DUPLICATE_KEY (MySQL) |
| 99 | + │ ├── NODE_UPDATE_SET_ITEM (col = expr) |
| 100 | + │ └── NODE_UPDATE_SET_ITEM (col = expr) |
| 101 | + │ OR |
| 102 | + ├── NODE_ON_CONFLICT (PostgreSQL) |
| 103 | + │ ├── NODE_CONFLICT_TARGET (cols or ON CONSTRAINT name) |
| 104 | + │ └── NODE_CONFLICT_ACTION (DO UPDATE SET ... WHERE ... or DO NOTHING) |
| 105 | + └── NODE_RETURNING_CLAUSE (PostgreSQL) |
| 106 | + ├── expression |
| 107 | + └── expression |
| 108 | +``` |
| 109 | + |
| 110 | +The parser reuses `ExpressionParser` for all value expressions and `SelectParser` for `INSERT ... SELECT`. The `RETURNING` clause uses the same item-list parsing as SELECT's select item list. |
| 111 | + |
| 112 | +--- |
| 113 | + |
| 114 | +## UPDATE Deep Parser |
| 115 | + |
| 116 | +### MySQL Syntax |
| 117 | + |
| 118 | +```sql |
| 119 | +UPDATE [LOW_PRIORITY] [IGNORE] table_references |
| 120 | + SET col=expr [, col=expr, ...] |
| 121 | + [WHERE condition] |
| 122 | + [ORDER BY ...] |
| 123 | + [LIMIT count] |
| 124 | +``` |
| 125 | + |
| 126 | +`table_references` can include JOINs — same grammar as SELECT's FROM clause. |
| 127 | + |
| 128 | +### PostgreSQL Syntax |
| 129 | + |
| 130 | +```sql |
| 131 | +UPDATE [ONLY] table_name [[AS] alias] |
| 132 | + SET col=expr [, col=expr, ...] |
| 133 | + [FROM from_list] |
| 134 | + [WHERE condition] |
| 135 | + [RETURNING expr_list] |
| 136 | +``` |
| 137 | + |
| 138 | +### AST Structure |
| 139 | + |
| 140 | +``` |
| 141 | +NODE_UPDATE_STMT |
| 142 | + ├── NODE_STMT_OPTIONS (LOW_PRIORITY, IGNORE) |
| 143 | + ├── NODE_FROM_CLAUSE (table references, may include JOINs for MySQL) |
| 144 | + ├── NODE_UPDATE_SET_CLAUSE |
| 145 | + │ ├── NODE_UPDATE_SET_ITEM (col = expr) |
| 146 | + │ └── NODE_UPDATE_SET_ITEM (col = expr) |
| 147 | + ├── NODE_WHERE_CLAUSE |
| 148 | + ├── NODE_ORDER_BY_CLAUSE (MySQL only) |
| 149 | + ├── NODE_LIMIT_CLAUSE (MySQL only) |
| 150 | + ├── NODE_FROM_CLAUSE (PostgreSQL FROM — second FROM node, distinct from table ref) |
| 151 | + └── NODE_RETURNING_CLAUSE (PostgreSQL) |
| 152 | +``` |
| 153 | + |
| 154 | +For MySQL multi-table UPDATE, the table references (with JOINs) reuse the existing `parse_from_clause()` / `parse_join()` logic from SelectParser. For PostgreSQL, the single table is parsed first, then an optional FROM clause provides joined tables. |
| 155 | + |
| 156 | +--- |
| 157 | + |
| 158 | +## DELETE Deep Parser |
| 159 | + |
| 160 | +### MySQL Syntax |
| 161 | + |
| 162 | +```sql |
| 163 | +-- Single-table: |
| 164 | +DELETE [LOW_PRIORITY] [QUICK] [IGNORE] FROM table_name |
| 165 | + [WHERE condition] |
| 166 | + [ORDER BY ...] |
| 167 | + [LIMIT count] |
| 168 | + |
| 169 | +-- Multi-table form 1: |
| 170 | +DELETE [LOW_PRIORITY] [QUICK] [IGNORE] t1, t2 |
| 171 | + FROM table_references |
| 172 | + [WHERE condition] |
| 173 | + |
| 174 | +-- Multi-table form 2: |
| 175 | +DELETE [LOW_PRIORITY] [QUICK] [IGNORE] FROM t1, t2 |
| 176 | + USING table_references |
| 177 | + [WHERE condition] |
| 178 | +``` |
| 179 | + |
| 180 | +### PostgreSQL Syntax |
| 181 | + |
| 182 | +```sql |
| 183 | +DELETE FROM [ONLY] table_name [[AS] alias] |
| 184 | + [USING using_list] |
| 185 | + [WHERE condition] |
| 186 | + [RETURNING expr_list] |
| 187 | +``` |
| 188 | + |
| 189 | +### AST Structure |
| 190 | + |
| 191 | +``` |
| 192 | +NODE_DELETE_STMT |
| 193 | + ├── NODE_STMT_OPTIONS (LOW_PRIORITY, QUICK, IGNORE) |
| 194 | + ├── NODE_TABLE_REF (target table(s)) |
| 195 | + ├── NODE_FROM_CLAUSE (multi-table MySQL: source tables with JOINs) |
| 196 | + ├── NODE_DELETE_USING_CLAUSE (MySQL USING or PostgreSQL USING) |
| 197 | + ├── NODE_WHERE_CLAUSE |
| 198 | + ├── NODE_ORDER_BY_CLAUSE (MySQL single-table only) |
| 199 | + ├── NODE_LIMIT_CLAUSE (MySQL single-table only) |
| 200 | + └── NODE_RETURNING_CLAUSE (PostgreSQL) |
| 201 | +``` |
| 202 | + |
| 203 | +--- |
| 204 | + |
| 205 | +## Compound Query Parser (UNION/INTERSECT/EXCEPT) |
| 206 | + |
| 207 | +### Syntax |
| 208 | + |
| 209 | +```sql |
| 210 | +select_stmt { UNION | INTERSECT | EXCEPT } [ALL] select_stmt |
| 211 | + [{ UNION | INTERSECT | EXCEPT } [ALL] select_stmt ...] |
| 212 | + [ORDER BY ...] [LIMIT ...] |
| 213 | + |
| 214 | +-- With parenthesized nesting: |
| 215 | +(SELECT ...) UNION ALL (SELECT ... INTERSECT SELECT ...) ORDER BY ... LIMIT ... |
| 216 | +``` |
| 217 | + |
| 218 | +### Precedence |
| 219 | + |
| 220 | +Per SQL standard: INTERSECT binds tighter than UNION and EXCEPT. So: |
| 221 | + |
| 222 | +```sql |
| 223 | +SELECT 1 UNION SELECT 2 INTERSECT SELECT 3 |
| 224 | +-- Parses as: SELECT 1 UNION (SELECT 2 INTERSECT SELECT 3) |
| 225 | +``` |
| 226 | + |
| 227 | +Implemented via precedence levels: |
| 228 | +- INTERSECT: higher precedence |
| 229 | +- UNION, EXCEPT: lower precedence (same level, left-associative) |
| 230 | + |
| 231 | +### AST Structure |
| 232 | + |
| 233 | +``` |
| 234 | +NODE_COMPOUND_QUERY |
| 235 | + ├── NODE_SET_OPERATION (value="UNION ALL") |
| 236 | + │ ├── NODE_SELECT_STMT (left) |
| 237 | + │ └── NODE_SELECT_STMT (right) |
| 238 | + ├── NODE_ORDER_BY_CLAUSE (applies to whole compound) |
| 239 | + └── NODE_LIMIT_CLAUSE (applies to whole compound) |
| 240 | +``` |
| 241 | + |
| 242 | +For nested compounds: |
| 243 | +``` |
| 244 | +NODE_COMPOUND_QUERY |
| 245 | + └── NODE_SET_OPERATION (value="UNION") |
| 246 | + ├── NODE_SELECT_STMT (left) |
| 247 | + └── NODE_SET_OPERATION (value="INTERSECT") |
| 248 | + ├── NODE_SELECT_STMT |
| 249 | + └── NODE_SELECT_STMT |
| 250 | +``` |
| 251 | + |
| 252 | +### Integration |
| 253 | + |
| 254 | +The `parse_select()` method in `Parser<D>` is updated: after parsing the first SELECT, it checks for UNION/INTERSECT/EXCEPT. If found, it wraps the result in a compound query. This is transparent to the caller — `parse()` still returns a `ParseResult`. |
| 255 | + |
| 256 | +--- |
| 257 | + |
| 258 | +## Query Digest / Normalization |
| 259 | + |
| 260 | +### API |
| 261 | + |
| 262 | +```cpp |
| 263 | +template <Dialect D> |
| 264 | +class Digest { |
| 265 | +public: |
| 266 | + Digest(Arena& arena); |
| 267 | + |
| 268 | + // From a parsed AST (Tier 1) |
| 269 | + DigestResult compute(const AstNode* ast); |
| 270 | + |
| 271 | + // From raw SQL (works for any statement, falls back to token-level for Tier 2) |
| 272 | + DigestResult compute(const char* sql, size_t len); |
| 273 | +}; |
| 274 | + |
| 275 | +struct DigestResult { |
| 276 | + StringRef normalized; // "SELECT * FROM t WHERE id = ?" |
| 277 | + uint64_t hash; // 64-bit hash |
| 278 | +}; |
| 279 | +``` |
| 280 | +
|
| 281 | +### Normalization Rules |
| 282 | +
|
| 283 | +1. **Literals → `?`:** Replace `NODE_LITERAL_INT`, `NODE_LITERAL_FLOAT`, `NODE_LITERAL_STRING` with `?` |
| 284 | +2. **IN list collapsing:** `IN (?, ?, ?)` → `IN (?)` (ProxySQL convention — multiple values produce the same digest) |
| 285 | +3. **Keyword uppercasing:** All SQL keywords emitted in uppercase canonical form |
| 286 | +4. **Whitespace normalization:** Single space between tokens, no leading/trailing |
| 287 | +5. **Comment stripping:** Comments already stripped by tokenizer, so this is free |
| 288 | +6. **Backtick/quote stripping:** Identifiers emitted without quotes in digest (optional, configurable) |
| 289 | +
|
| 290 | +### Token-Level Fallback (Tier 2) |
| 291 | +
|
| 292 | +For statements without a full AST (Tier 2 or parse failures), the digest works at the token level: |
| 293 | +
|
| 294 | +1. Tokenize the input |
| 295 | +2. Walk tokens, emitting each: |
| 296 | + - Keywords → uppercase |
| 297 | + - Identifiers → as-is |
| 298 | + - Literals (TK_INTEGER, TK_FLOAT, TK_STRING) → `?` |
| 299 | + - Operators/punctuation → as-is |
| 300 | +3. Collapse consecutive `?` in IN/VALUES lists |
| 301 | +4. Hash the result |
| 302 | +
|
| 303 | +This ensures digest works for ALL queries, even those the parser doesn't deeply understand. |
| 304 | +
|
| 305 | +### Hash Function |
| 306 | +
|
| 307 | +64-bit FNV-1a — simple, fast, no external dependency, good distribution. Computed incrementally as the normalized string is built (no second pass). |
| 308 | +
|
| 309 | +--- |
| 310 | +
|
| 311 | +## Emitter Extensions |
| 312 | +
|
| 313 | +New `emit_*` methods for each new node type, following the same pattern as existing SET/SELECT emission: |
| 314 | +
|
| 315 | +- `emit_insert_stmt`, `emit_values_clause`, `emit_values_row`, `emit_on_duplicate_key`, `emit_on_conflict`, `emit_returning` |
| 316 | +- `emit_update_stmt`, `emit_update_set_clause`, `emit_update_set_item` |
| 317 | +- `emit_delete_stmt`, `emit_delete_using` |
| 318 | +- `emit_compound_query`, `emit_set_operation` |
| 319 | +
|
| 320 | +The `RETURNING` emitter is shared across INSERT/UPDATE/DELETE. |
| 321 | +
|
| 322 | +**Digest mode** is a constructor flag on the emitter: |
| 323 | +
|
| 324 | +```cpp |
| 325 | +enum class EmitMode : uint8_t { NORMAL, DIGEST }; |
| 326 | +
|
| 327 | +Emitter(Arena& arena, EmitMode mode = EmitMode::NORMAL, |
| 328 | + const ParamBindings* bindings = nullptr); |
| 329 | +``` |
| 330 | + |
| 331 | +In digest mode, `emit_literal_*` methods write `?` instead of the actual value, keywords are uppercased, and IN lists are collapsed. |
| 332 | + |
| 333 | +--- |
| 334 | + |
| 335 | +## New Token Additions |
| 336 | + |
| 337 | +```cpp |
| 338 | +// Needed for new syntax: |
| 339 | +TK_DELAYED, |
| 340 | +TK_HIGH_PRIORITY, |
| 341 | +TK_DUPLICATE, |
| 342 | +TK_KEY, |
| 343 | +TK_CONFLICT, |
| 344 | +TK_DO, |
| 345 | +TK_NOTHING, |
| 346 | +TK_RETURNING, |
| 347 | +TK_ONLY, // already exists in enum, verify in keyword tables |
| 348 | +TK_EXCEPT, |
| 349 | +TK_INTERSECT, |
| 350 | +TK_CONSTRAINT, |
| 351 | +TK_DEFAULT_VALUES, // or handle as TK_DEFAULT + TK_VALUES |
| 352 | +``` |
| 353 | + |
| 354 | +--- |
| 355 | + |
| 356 | +## Implementation Plans (separate) |
| 357 | + |
| 358 | +This spec should be implemented across 5 plans: |
| 359 | + |
| 360 | +1. **Plan 7: INSERT deep parser** — INSERT/REPLACE with all syntax, emitter, tests. Closes #5. |
| 361 | +2. **Plan 8: UPDATE deep parser** — full UPDATE syntax, emitter, tests. Closes #6. |
| 362 | +3. **Plan 9: DELETE deep parser** — full DELETE syntax, emitter, tests. Closes #7. |
| 363 | +4. **Plan 10: Compound queries** — UNION/INTERSECT/EXCEPT with nesting, emitter, tests. Closes #8. |
| 364 | +5. **Plan 11: Query digest** — Digest module with both AST and token-level modes, tests. Closes #9. |
| 365 | + |
| 366 | +Plans 7-9 are independent of each other (can be done in any order). Plan 10 depends on SELECT parser (already done). Plan 11 depends on the emitter (already done) and benefits from Plans 7-9 being complete (more node types to digest), but can work with Tier 2 token-level fallback for unstubbed types. |
| 367 | + |
| 368 | +--- |
| 369 | + |
| 370 | +## Performance Targets |
| 371 | + |
| 372 | +| Operation | Target | |
| 373 | +|---|---| |
| 374 | +| INSERT parse (simple VALUES) | <500ns | |
| 375 | +| INSERT parse (multi-row + ON DUPLICATE KEY) | <2us | |
| 376 | +| UPDATE parse (simple) | <500ns | |
| 377 | +| DELETE parse (simple) | <300ns | |
| 378 | +| Compound UNION (2 simple SELECTs) | <1us | |
| 379 | +| Query digest (simple SELECT) | <500ns | |
| 380 | +| Query digest (token-level, Tier 2) | <200ns | |
0 commit comments