Skip to content

Commit 149b54e

Browse files
committed
Add design spec for Tier 1 promotions (INSERT/UPDATE/DELETE), UNION, and query digest
Covers full MySQL + PostgreSQL syntax for INSERT, UPDATE, DELETE deep parsers, compound queries with INTERSECT precedence, and AST-based query digest with token-level fallback for Tier 2 statements.
1 parent b7126b9 commit 149b54e

1 file changed

Lines changed: 380 additions & 0 deletions

File tree

Lines changed: 380 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,380 @@
1+
# Tier 1 Promotions, UNION Support & Query Digest — Design Specification
2+
3+
## Overview
4+
5+
Extends the SQL parser with full Tier 1 deep parsing for INSERT, UPDATE, and DELETE (both MySQL and PostgreSQL dialects), adds UNION/INTERSECT/EXCEPT compound query support with recursive nesting, and introduces AST-based query digest/normalization for query rules matching.
6+
7+
### Goals
8+
9+
- **INSERT Tier 1:** Full AST for INSERT/REPLACE with VALUES, SELECT, SET, ON DUPLICATE KEY UPDATE (MySQL), ON CONFLICT (PostgreSQL), RETURNING (PostgreSQL).
10+
- **UPDATE Tier 1:** Full AST with multi-table JOIN (MySQL), FROM (PostgreSQL), ORDER BY/LIMIT (MySQL), RETURNING (PostgreSQL).
11+
- **DELETE Tier 1:** Full AST with multi-table (MySQL both forms), USING (PostgreSQL), ORDER BY/LIMIT (MySQL), RETURNING (PostgreSQL).
12+
- **Compound queries:** UNION [ALL], INTERSECT [ALL], EXCEPT [ALL] with parenthesized nesting and precedence (INTERSECT binds tighter).
13+
- **Query digest:** AST-based normalization (literals → `?`, IN list collapsing, keyword uppercasing) + 64-bit hash. Works for all statement types including Tier 2 (token-level fallback).
14+
15+
### Constraints
16+
17+
- Same as original spec: C++17 floor, both dialects, sub-microsecond targets, arena allocation, header-only parsers.
18+
- All new parsers follow the established pattern: `XxxParser<D>` header-only template, uses `ExpressionParser<D>`, integrated via `parser.cpp`.
19+
- Emitter extended for all new node types + digest mode.
20+
21+
---
22+
23+
## New NodeType Additions
24+
25+
```cpp
26+
// INSERT nodes
27+
NODE_INSERT_STMT,
28+
NODE_INSERT_COLUMNS, // (col1, col2, ...)
29+
NODE_VALUES_CLAUSE, // VALUES keyword wrapper
30+
NODE_VALUES_ROW, // single (val1, val2, ...) row
31+
NODE_INSERT_SET_CLAUSE, // MySQL INSERT ... SET col=val form
32+
NODE_ON_DUPLICATE_KEY, // MySQL ON DUPLICATE KEY UPDATE
33+
NODE_ON_CONFLICT, // PostgreSQL ON CONFLICT
34+
NODE_CONFLICT_TARGET, // PostgreSQL conflict target (cols or ON CONSTRAINT)
35+
NODE_CONFLICT_ACTION, // DO UPDATE SET ... or DO NOTHING
36+
NODE_RETURNING_CLAUSE, // PostgreSQL RETURNING expr_list
37+
38+
// UPDATE nodes
39+
NODE_UPDATE_STMT,
40+
NODE_UPDATE_SET_CLAUSE, // SET col=expr, col=expr in UPDATE context
41+
NODE_UPDATE_SET_ITEM, // single col=expr pair
42+
43+
// DELETE nodes
44+
NODE_DELETE_STMT,
45+
NODE_DELETE_USING_CLAUSE, // PostgreSQL USING for join-like deletes
46+
47+
// Compound query nodes
48+
NODE_COMPOUND_QUERY, // root for UNION/INTERSECT/EXCEPT
49+
NODE_SET_OPERATION, // operator (UNION, INTERSECT, EXCEPT) with ALL flag
50+
51+
// Statement options (shared)
52+
NODE_STMT_OPTIONS, // LOW_PRIORITY, IGNORE, QUICK, DELAYED, etc.
53+
```
54+
55+
---
56+
57+
## INSERT Deep Parser
58+
59+
### MySQL Syntax
60+
61+
```sql
62+
INSERT [LOW_PRIORITY | DELAYED | HIGH_PRIORITY] [IGNORE] [INTO] table_name
63+
[(col1, col2, ...)]
64+
{ VALUES (row1), (row2), ... | SELECT ... | SET col=val, ... }
65+
[ON DUPLICATE KEY UPDATE col=expr, col=expr, ...]
66+
67+
REPLACE [LOW_PRIORITY | DELAYED] [INTO] table_name
68+
[(col1, col2, ...)]
69+
{ VALUES (row1), (row2), ... | SELECT ... | SET col=val, ... }
70+
```
71+
72+
### PostgreSQL Syntax
73+
74+
```sql
75+
INSERT INTO table_name [(col1, col2, ...)]
76+
{ VALUES (row1), (row2), ... | SELECT ... | DEFAULT VALUES }
77+
[ON CONFLICT [(col1, col2, ...)] | [ON CONSTRAINT name]
78+
{ DO UPDATE SET col=expr [, ...] [WHERE ...] | DO NOTHING }]
79+
[RETURNING expr_list]
80+
```
81+
82+
### AST Structure
83+
84+
```
85+
NODE_INSERT_STMT
86+
├── NODE_STMT_OPTIONS (LOW_PRIORITY, IGNORE, etc.)
87+
├── NODE_TABLE_REF (table name, optional schema)
88+
├── NODE_INSERT_COLUMNS (col1, col2, ...)
89+
├── NODE_VALUES_CLAUSE
90+
│ ├── NODE_VALUES_ROW (val1, val2, ...)
91+
│ └── NODE_VALUES_ROW (val1, val2, ...)
92+
│ OR
93+
├── NODE_SELECT_STMT (INSERT ... SELECT)
94+
│ OR
95+
├── NODE_INSERT_SET_CLAUSE (MySQL SET col=val)
96+
│ ├── NODE_UPDATE_SET_ITEM (col = expr)
97+
│ └── NODE_UPDATE_SET_ITEM (col = expr)
98+
├── NODE_ON_DUPLICATE_KEY (MySQL)
99+
│ ├── NODE_UPDATE_SET_ITEM (col = expr)
100+
│ └── NODE_UPDATE_SET_ITEM (col = expr)
101+
│ OR
102+
├── NODE_ON_CONFLICT (PostgreSQL)
103+
│ ├── NODE_CONFLICT_TARGET (cols or ON CONSTRAINT name)
104+
│ └── NODE_CONFLICT_ACTION (DO UPDATE SET ... WHERE ... or DO NOTHING)
105+
└── NODE_RETURNING_CLAUSE (PostgreSQL)
106+
├── expression
107+
└── expression
108+
```
109+
110+
The parser reuses `ExpressionParser` for all value expressions and `SelectParser` for `INSERT ... SELECT`. The `RETURNING` clause uses the same item-list parsing as SELECT's select item list.
111+
112+
---
113+
114+
## UPDATE Deep Parser
115+
116+
### MySQL Syntax
117+
118+
```sql
119+
UPDATE [LOW_PRIORITY] [IGNORE] table_references
120+
SET col=expr [, col=expr, ...]
121+
[WHERE condition]
122+
[ORDER BY ...]
123+
[LIMIT count]
124+
```
125+
126+
`table_references` can include JOINs — same grammar as SELECT's FROM clause.
127+
128+
### PostgreSQL Syntax
129+
130+
```sql
131+
UPDATE [ONLY] table_name [[AS] alias]
132+
SET col=expr [, col=expr, ...]
133+
[FROM from_list]
134+
[WHERE condition]
135+
[RETURNING expr_list]
136+
```
137+
138+
### AST Structure
139+
140+
```
141+
NODE_UPDATE_STMT
142+
├── NODE_STMT_OPTIONS (LOW_PRIORITY, IGNORE)
143+
├── NODE_FROM_CLAUSE (table references, may include JOINs for MySQL)
144+
├── NODE_UPDATE_SET_CLAUSE
145+
│ ├── NODE_UPDATE_SET_ITEM (col = expr)
146+
│ └── NODE_UPDATE_SET_ITEM (col = expr)
147+
├── NODE_WHERE_CLAUSE
148+
├── NODE_ORDER_BY_CLAUSE (MySQL only)
149+
├── NODE_LIMIT_CLAUSE (MySQL only)
150+
├── NODE_FROM_CLAUSE (PostgreSQL FROM — second FROM node, distinct from table ref)
151+
└── NODE_RETURNING_CLAUSE (PostgreSQL)
152+
```
153+
154+
For MySQL multi-table UPDATE, the table references (with JOINs) reuse the existing `parse_from_clause()` / `parse_join()` logic from SelectParser. For PostgreSQL, the single table is parsed first, then an optional FROM clause provides joined tables.
155+
156+
---
157+
158+
## DELETE Deep Parser
159+
160+
### MySQL Syntax
161+
162+
```sql
163+
-- Single-table:
164+
DELETE [LOW_PRIORITY] [QUICK] [IGNORE] FROM table_name
165+
[WHERE condition]
166+
[ORDER BY ...]
167+
[LIMIT count]
168+
169+
-- Multi-table form 1:
170+
DELETE [LOW_PRIORITY] [QUICK] [IGNORE] t1, t2
171+
FROM table_references
172+
[WHERE condition]
173+
174+
-- Multi-table form 2:
175+
DELETE [LOW_PRIORITY] [QUICK] [IGNORE] FROM t1, t2
176+
USING table_references
177+
[WHERE condition]
178+
```
179+
180+
### PostgreSQL Syntax
181+
182+
```sql
183+
DELETE FROM [ONLY] table_name [[AS] alias]
184+
[USING using_list]
185+
[WHERE condition]
186+
[RETURNING expr_list]
187+
```
188+
189+
### AST Structure
190+
191+
```
192+
NODE_DELETE_STMT
193+
├── NODE_STMT_OPTIONS (LOW_PRIORITY, QUICK, IGNORE)
194+
├── NODE_TABLE_REF (target table(s))
195+
├── NODE_FROM_CLAUSE (multi-table MySQL: source tables with JOINs)
196+
├── NODE_DELETE_USING_CLAUSE (MySQL USING or PostgreSQL USING)
197+
├── NODE_WHERE_CLAUSE
198+
├── NODE_ORDER_BY_CLAUSE (MySQL single-table only)
199+
├── NODE_LIMIT_CLAUSE (MySQL single-table only)
200+
└── NODE_RETURNING_CLAUSE (PostgreSQL)
201+
```
202+
203+
---
204+
205+
## Compound Query Parser (UNION/INTERSECT/EXCEPT)
206+
207+
### Syntax
208+
209+
```sql
210+
select_stmt { UNION | INTERSECT | EXCEPT } [ALL] select_stmt
211+
[{ UNION | INTERSECT | EXCEPT } [ALL] select_stmt ...]
212+
[ORDER BY ...] [LIMIT ...]
213+
214+
-- With parenthesized nesting:
215+
(SELECT ...) UNION ALL (SELECT ... INTERSECT SELECT ...) ORDER BY ... LIMIT ...
216+
```
217+
218+
### Precedence
219+
220+
Per SQL standard: INTERSECT binds tighter than UNION and EXCEPT. So:
221+
222+
```sql
223+
SELECT 1 UNION SELECT 2 INTERSECT SELECT 3
224+
-- Parses as: SELECT 1 UNION (SELECT 2 INTERSECT SELECT 3)
225+
```
226+
227+
Implemented via precedence levels:
228+
- INTERSECT: higher precedence
229+
- UNION, EXCEPT: lower precedence (same level, left-associative)
230+
231+
### AST Structure
232+
233+
```
234+
NODE_COMPOUND_QUERY
235+
├── NODE_SET_OPERATION (value="UNION ALL")
236+
│ ├── NODE_SELECT_STMT (left)
237+
│ └── NODE_SELECT_STMT (right)
238+
├── NODE_ORDER_BY_CLAUSE (applies to whole compound)
239+
└── NODE_LIMIT_CLAUSE (applies to whole compound)
240+
```
241+
242+
For nested compounds:
243+
```
244+
NODE_COMPOUND_QUERY
245+
└── NODE_SET_OPERATION (value="UNION")
246+
├── NODE_SELECT_STMT (left)
247+
└── NODE_SET_OPERATION (value="INTERSECT")
248+
├── NODE_SELECT_STMT
249+
└── NODE_SELECT_STMT
250+
```
251+
252+
### Integration
253+
254+
The `parse_select()` method in `Parser<D>` is updated: after parsing the first SELECT, it checks for UNION/INTERSECT/EXCEPT. If found, it wraps the result in a compound query. This is transparent to the caller — `parse()` still returns a `ParseResult`.
255+
256+
---
257+
258+
## Query Digest / Normalization
259+
260+
### API
261+
262+
```cpp
263+
template <Dialect D>
264+
class Digest {
265+
public:
266+
Digest(Arena& arena);
267+
268+
// From a parsed AST (Tier 1)
269+
DigestResult compute(const AstNode* ast);
270+
271+
// From raw SQL (works for any statement, falls back to token-level for Tier 2)
272+
DigestResult compute(const char* sql, size_t len);
273+
};
274+
275+
struct DigestResult {
276+
StringRef normalized; // "SELECT * FROM t WHERE id = ?"
277+
uint64_t hash; // 64-bit hash
278+
};
279+
```
280+
281+
### Normalization Rules
282+
283+
1. **Literals → `?`:** Replace `NODE_LITERAL_INT`, `NODE_LITERAL_FLOAT`, `NODE_LITERAL_STRING` with `?`
284+
2. **IN list collapsing:** `IN (?, ?, ?)` → `IN (?)` (ProxySQL convention — multiple values produce the same digest)
285+
3. **Keyword uppercasing:** All SQL keywords emitted in uppercase canonical form
286+
4. **Whitespace normalization:** Single space between tokens, no leading/trailing
287+
5. **Comment stripping:** Comments already stripped by tokenizer, so this is free
288+
6. **Backtick/quote stripping:** Identifiers emitted without quotes in digest (optional, configurable)
289+
290+
### Token-Level Fallback (Tier 2)
291+
292+
For statements without a full AST (Tier 2 or parse failures), the digest works at the token level:
293+
294+
1. Tokenize the input
295+
2. Walk tokens, emitting each:
296+
- Keywords → uppercase
297+
- Identifiers → as-is
298+
- Literals (TK_INTEGER, TK_FLOAT, TK_STRING) → `?`
299+
- Operators/punctuation → as-is
300+
3. Collapse consecutive `?` in IN/VALUES lists
301+
4. Hash the result
302+
303+
This ensures digest works for ALL queries, even those the parser doesn't deeply understand.
304+
305+
### Hash Function
306+
307+
64-bit FNV-1a — simple, fast, no external dependency, good distribution. Computed incrementally as the normalized string is built (no second pass).
308+
309+
---
310+
311+
## Emitter Extensions
312+
313+
New `emit_*` methods for each new node type, following the same pattern as existing SET/SELECT emission:
314+
315+
- `emit_insert_stmt`, `emit_values_clause`, `emit_values_row`, `emit_on_duplicate_key`, `emit_on_conflict`, `emit_returning`
316+
- `emit_update_stmt`, `emit_update_set_clause`, `emit_update_set_item`
317+
- `emit_delete_stmt`, `emit_delete_using`
318+
- `emit_compound_query`, `emit_set_operation`
319+
320+
The `RETURNING` emitter is shared across INSERT/UPDATE/DELETE.
321+
322+
**Digest mode** is a constructor flag on the emitter:
323+
324+
```cpp
325+
enum class EmitMode : uint8_t { NORMAL, DIGEST };
326+
327+
Emitter(Arena& arena, EmitMode mode = EmitMode::NORMAL,
328+
const ParamBindings* bindings = nullptr);
329+
```
330+
331+
In digest mode, `emit_literal_*` methods write `?` instead of the actual value, keywords are uppercased, and IN lists are collapsed.
332+
333+
---
334+
335+
## New Token Additions
336+
337+
```cpp
338+
// Needed for new syntax:
339+
TK_DELAYED,
340+
TK_HIGH_PRIORITY,
341+
TK_DUPLICATE,
342+
TK_KEY,
343+
TK_CONFLICT,
344+
TK_DO,
345+
TK_NOTHING,
346+
TK_RETURNING,
347+
TK_ONLY, // already exists in enum, verify in keyword tables
348+
TK_EXCEPT,
349+
TK_INTERSECT,
350+
TK_CONSTRAINT,
351+
TK_DEFAULT_VALUES, // or handle as TK_DEFAULT + TK_VALUES
352+
```
353+
354+
---
355+
356+
## Implementation Plans (separate)
357+
358+
This spec should be implemented across 5 plans:
359+
360+
1. **Plan 7: INSERT deep parser** — INSERT/REPLACE with all syntax, emitter, tests. Closes #5.
361+
2. **Plan 8: UPDATE deep parser** — full UPDATE syntax, emitter, tests. Closes #6.
362+
3. **Plan 9: DELETE deep parser** — full DELETE syntax, emitter, tests. Closes #7.
363+
4. **Plan 10: Compound queries** — UNION/INTERSECT/EXCEPT with nesting, emitter, tests. Closes #8.
364+
5. **Plan 11: Query digest** — Digest module with both AST and token-level modes, tests. Closes #9.
365+
366+
Plans 7-9 are independent of each other (can be done in any order). Plan 10 depends on SELECT parser (already done). Plan 11 depends on the emitter (already done) and benefits from Plans 7-9 being complete (more node types to digest), but can work with Tier 2 token-level fallback for unstubbed types.
367+
368+
---
369+
370+
## Performance Targets
371+
372+
| Operation | Target |
373+
|---|---|
374+
| INSERT parse (simple VALUES) | <500ns |
375+
| INSERT parse (multi-row + ON DUPLICATE KEY) | <2us |
376+
| UPDATE parse (simple) | <500ns |
377+
| DELETE parse (simple) | <300ns |
378+
| Compound UNION (2 simple SELECTs) | <1us |
379+
| Query digest (simple SELECT) | <500ns |
380+
| Query digest (token-level, Tier 2) | <200ns |

0 commit comments

Comments
 (0)