You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Address spec review findings: arena growth, AstNode layout, multi-statement support
- Define arena block-chaining strategy (never realloc, overflow blocks freed on reset)
- Fix AstNode field ordering to achieve 32 bytes with static_assert
- Add multi-statement query handling via ParseResult::remaining
- Clarify PARTIAL semantics for both tiers
- Fix BoundValue: add DATETIME/DECIMAL types, separate float32/float64
- Replace prepare_cache_store with parse_and_cache for safe API
- Add threading note for session migration across threads
- Correct RETURNING clause attribution (DML, not SELECT)
- Mark cross-dialect emission as out of scope
- Mandate StringRef as trivially copyable with static_assert
- Add ErrorInfo lifetime documentation
- Add max query length / arena size documentation
Copy file name to clipboardExpand all lines: docs/superpowers/specs/2026-03-24-sql-parser-design.md
+59-24Lines changed: 59 additions & 24 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,6 +19,10 @@ A high-performance, hand-written recursive descent SQL parser for ProxySQL. Supp
19
19
- Must compile on AlmaLinux 8 (GCC 8.5) through Fedora 43 and macOS (Apple Silicon).
20
20
- Static library or header-only (whichever yields better performance after benchmarking). Linked into ProxySQL at build time.
21
21
22
+
### Migration from Existing POC
23
+
24
+
This project replaces the existing Flex/Bison-based POC parser wholesale. The old parser (Flex lexer, Bison LALR grammar, `std::string`-based AST with `std::vector<AstNode*>` children, per-node heap allocation) is not carried forward. The existing `src/mysql_parser/`, `src/pgsql_parser/`, and `include/` directories will be removed once the new parser is functional. Existing examples in `examples/` serve as a test corpus for validating that the new parser handles the same queries correctly, then they too will be replaced.
25
+
22
26
---
23
27
24
28
## Architecture
@@ -56,16 +60,20 @@ Input SQL bytes
56
60
57
61
### Arena Allocator
58
62
59
-
Each parser instance owns a thread-local arena — a pre-allocated memory block (64KB default, growable). All AST nodes, materialized strings, and temporary data are allocated from the arena. After a query is fully processed, the arena resets (pointer rewind, O(1)). No per-node new/delete.
63
+
Each parser instance owns a thread-local arena — a pre-allocated memory block (64KB default). All AST nodes, materialized strings, and temporary data are allocated from the arena. After a query is fully processed, the arena resets (pointer rewind, O(1)). No per-node new/delete.
64
+
65
+
**Growth strategy:** The arena uses block chaining — never `realloc` (which would invalidate all pointers). When the current block is exhausted, a new block is allocated and linked. `reset()` retains the first (primary) block and frees any overflow blocks. This means `reset()` is O(1) in the common case (single block) and O(n_overflow_blocks) in the rare case. A configurable maximum arena size (default: 1MB) prevents unbounded growth; exceeding it returns `ParseResult::ERROR`.
`StringRef` must remain a trivial type (no constructors, destructors, or virtual functions) to be safely used in unions and to enable memcpy-based operations.
92
+
82
93
Points into the original input buffer. No copies or allocations for identifiers, keywords, or literals. The input SQL string must outlive the parse result (natural in ProxySQL — the query buffer is session-owned).
83
94
84
95
When a string must be materialized (e.g., unescaping a quoted literal), it is allocated from the arena.
@@ -89,13 +100,16 @@ Flat, compact struct. No virtual functions, no `std::vector`. Children use an in
89
100
90
101
```cpp
91
102
struct AstNode {
92
-
NodeType type; // enum, 2 bytes
93
-
uint16_t flags; // dialect bits, tier, modifiers
94
-
StringRef value; // pointer + length into input
95
-
AstNode* first_child; // first child (intrusive list)
96
-
AstNode* next_sibling; // next sibling
103
+
AstNode* first_child; // 8 bytes — first child (intrusive list)
// ~32 bytes per node on 64-bit, fits in half a cache line
110
+
// 32 bytes per node on 64-bit — exactly half a cache line.
111
+
// Fields ordered to avoid padding: pointers first, then 4-byte, then 2-byte.
112
+
static_assert(sizeof(AstNode) == 32);
99
113
```
100
114
101
115
### ParseResult
@@ -107,6 +121,7 @@ struct ParseResult {
107
121
StmtType stmt_type; // always set, even on error (best-effort)
108
122
AstNode* ast; // non-null for Tier 1 OK
109
123
ErrorInfo error; // populated on ERROR/PARTIAL
124
+
StringRef remaining; // unparsed input after semicolon (for multi-statement)
110
125
111
126
// Tier 2 extracted metadata
112
127
StringRef table_name;
@@ -120,7 +135,12 @@ struct ErrorInfo {
120
135
};
121
136
```
122
137
123
-
`PARTIAL` means the classifier succeeded (statement type known) but the deep parser hit a syntax error. ProxySQL can still route on statement type — let the backend report the error to the client.
138
+
`PARTIAL` semantics by tier:
139
+
- **Tier 1:** classifier succeeded (statement type known) but the deep parser hit a syntax error. The AST may be partially populated. ProxySQL can still route on statement type — let the backend report the error.
140
+
- **Tier 2:** classifier succeeded but the extractor could not find expected metadata (e.g., `INSERT INTO` with no table name following). `stmt_type` is set but metadata fields may be empty.
141
+
- `ERROR` means the first token could not be classified at all (e.g., binary garbage or empty input).
142
+
143
+
**Lifetime note:** `ErrorInfo::message` points to arena-allocated memory. It becomes invalid after `parser.reset()`. Consumers must copy the message if they need it beyond the parse lifecycle.
124
144
125
145
---
126
146
@@ -179,7 +199,7 @@ struct KeywordEntry {
179
199
180
200
### Lookahead
181
201
182
-
One-token lookahead via `peek()`. Cached internally. Sufficient for recursive descent and classification.
202
+
The tokenizer provides single-token lookahead via `peek()`, cached internally. Statement parsers that need multi-token disambiguation (e.g., `SET TRANSACTION` vs `SET var = ...`, or `INSERT INTO ... SELECT` vs `INSERT INTO ... VALUES`) handle this by consuming tokens and using the parser's own state to disambiguate — no backtracking needed. For example, the SET parser consumes the second token; if it's `TRANSACTION`, it enters `parse_set_transaction()`, otherwise it treats the consumed token as the start of a variable target. This is standard recursive descent practice and does not require a multi-token lookahead buffer in the tokenizer itself.
183
203
184
204
---
185
205
@@ -256,7 +276,8 @@ parse_select()
256
276
-`::` type cast
257
277
-`LIMIT ALL` vs `LIMIT` with expression
258
278
- Dollar-quoted strings
259
-
-`RETURNING` clause
279
+
280
+
Note: PostgreSQL's `RETURNING` clause applies to INSERT/UPDATE/DELETE, not SELECT. It will be handled when those statements are promoted to Tier 1. Until then, Tier 2 extractors for those statements will detect `RETURNING` and include it in metadata but not build an AST for the returned expressions.
260
281
261
282
### SET Parser
262
283
@@ -286,7 +307,8 @@ Each `NodeType` has a corresponding emit function. The emitter is dialect-templa
286
307
287
308
-`StringRef` values are emitted directly from the original input (no copy unless the node was modified).
288
309
- Modified nodes emit their new values.
289
-
- Theoretically supports cross-dialect emission (parse MySQL → emit PostgreSQL) for future query translation.
310
+
311
+
Cross-dialect emission (parse MySQL → emit PostgreSQL) is **out of scope** for the initial design. Many constructs have no direct equivalent across dialects (`SQL_CALC_FOUND_ROWS`, backtick quoting, `LIMIT` syntax differences). The emitter always emits in the same dialect it parsed.
SQL template parsed normally. Placeholder tokens (`?` in MySQL, `$1`/`$2` in PostgreSQL) become `NODE_PLACEHOLDER` AST nodes with a parameter index in `flags`.
308
330
309
-
The AST is copied from the arena to a longer-lived **statement cache** (per-parser-instance, keyed by statement ID). This is the one place where memory leaves the arena.
331
+
The AST is copied from the arena to a longer-lived **statement cache** (per-parser-instance, keyed by statement ID) via `parse_and_cache()`, which atomically parses and stores the result before the arena can be reset. This is the one place where memory leaves the arena.
332
+
333
+
**Threading note:** The statement cache is per-parser-instance (i.e., per-thread). In ProxySQL, prepared statement state is per-session. If sessions can migrate between threads, the session must carry its own prepared statement metadata (statement IDs, SQL templates). The parser on the destination thread can re-parse and cache the template on first execute if the cached AST is not found. This avoids any cross-thread sharing of parser state.
310
334
311
335
### Execute Phase
312
336
313
337
```cpp
314
338
structBoundValue {
315
-
enum Type { INT, FLOAT, DOUBLE, STRING, BLOB, NULL_VAL };
@@ -349,6 +376,14 @@ The parser never throws exceptions. Errors are reported through `ParseResult::st
349
376
350
377
**Lenient by design:** ProxySQL doesn't need to reject queries — the backend does. The parser extracts as much useful information as possible and degrades gracefully.
351
378
379
+
### Multi-Statement Queries
380
+
381
+
ProxySQL regularly receives semicolon-separated multi-statement queries (e.g., `SET autocommit=0; BEGIN`). The parser handles this by parsing the **first statement** and returning its `ParseResult` along with a `remaining` field (`StringRef` pointing to the unparsed tail after the semicolon). The caller is responsible for calling `parse()` again on the remainder if needed. This avoids allocating a list of results and lets the caller decide whether to parse subsequent statements.
382
+
383
+
### Maximum Query Length
384
+
385
+
The parser respects the caller's buffer size (the `len` parameter). It does not impose its own maximum query length — that is ProxySQL's responsibility (via `mysql-max_allowed_packet` or equivalent). The arena's maximum size (default 1MB) provides an implicit bound on the complexity of parseable queries; exceeding it returns `ERROR`.
0 commit comments