Skip to content

Commit 8b07819

Browse files
committed
Address spec review findings: arena growth, AstNode layout, multi-statement support
- Define arena block-chaining strategy (never realloc, overflow blocks freed on reset) - Fix AstNode field ordering to achieve 32 bytes with static_assert - Add multi-statement query handling via ParseResult::remaining - Clarify PARTIAL semantics for both tiers - Fix BoundValue: add DATETIME/DECIMAL types, separate float32/float64 - Replace prepare_cache_store with parse_and_cache for safe API - Add threading note for session migration across threads - Correct RETURNING clause attribution (DML, not SELECT) - Mark cross-dialect emission as out of scope - Mandate StringRef as trivially copyable with static_assert - Add ErrorInfo lifetime documentation - Add max query length / arena size documentation
1 parent 8967a98 commit 8b07819

1 file changed

Lines changed: 59 additions & 24 deletions

File tree

docs/superpowers/specs/2026-03-24-sql-parser-design.md

Lines changed: 59 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,10 @@ A high-performance, hand-written recursive descent SQL parser for ProxySQL. Supp
1919
- Must compile on AlmaLinux 8 (GCC 8.5) through Fedora 43 and macOS (Apple Silicon).
2020
- Static library or header-only (whichever yields better performance after benchmarking). Linked into ProxySQL at build time.
2121

22+
### Migration from Existing POC
23+
24+
This project replaces the existing Flex/Bison-based POC parser wholesale. The old parser (Flex lexer, Bison LALR grammar, `std::string`-based AST with `std::vector<AstNode*>` children, per-node heap allocation) is not carried forward. The existing `src/mysql_parser/`, `src/pgsql_parser/`, and `include/` directories will be removed once the new parser is functional. Existing examples in `examples/` serve as a test corpus for validating that the new parser handles the same queries correctly, then they too will be replaced.
25+
2226
---
2327

2428
## Architecture
@@ -56,16 +60,20 @@ Input SQL bytes
5660

5761
### Arena Allocator
5862

59-
Each parser instance owns a thread-local arena — a pre-allocated memory block (64KB default, growable). All AST nodes, materialized strings, and temporary data are allocated from the arena. After a query is fully processed, the arena resets (pointer rewind, O(1)). No per-node new/delete.
63+
Each parser instance owns a thread-local arena — a pre-allocated memory block (64KB default). All AST nodes, materialized strings, and temporary data are allocated from the arena. After a query is fully processed, the arena resets (pointer rewind, O(1)). No per-node new/delete.
64+
65+
**Growth strategy:** The arena uses block chaining — never `realloc` (which would invalidate all pointers). When the current block is exhausted, a new block is allocated and linked. `reset()` retains the first (primary) block and frees any overflow blocks. This means `reset()` is O(1) in the common case (single block) and O(n_overflow_blocks) in the rare case. A configurable maximum arena size (default: 1MB) prevents unbounded growth; exceeding it returns `ParseResult::ERROR`.
6066

6167
```
62-
┌─────────────────────────────────────────┐
63-
│ Arena (64KB) │
64-
│ [AstNode][AstNode][string][AstNode]... │
65-
│ ^ │
66-
│ cursor │
67-
│ reset() → cursor = start │
68-
└─────────────────────────────────────────┘
68+
┌──────────────────────┐ ┌──────────────────────┐
69+
│ Block 1 (64KB) │───►│ Block 2 (overflow) │───► ...
70+
│ [AstNode][string].. │ │ [AstNode][string].. │
71+
│ ^ │ │ ^ │
72+
│ cursor │ │ cursor │
73+
│ │ │ │
74+
│ reset() → cursor=0, │ │ (freed on reset) │
75+
│ free overflow blocks│ │ │
76+
└──────────────────────┘ └──────────────────────┘
6977
```
7078

7179
### StringRef (Zero-Copy)
@@ -77,8 +85,11 @@ struct StringRef {
7785

7886
// Comparison, hashing helpers
7987
};
88+
static_assert(std::is_trivially_copyable_v<StringRef>);
8089
```
8190
91+
`StringRef` must remain a trivial type (no constructors, destructors, or virtual functions) to be safely used in unions and to enable memcpy-based operations.
92+
8293
Points into the original input buffer. No copies or allocations for identifiers, keywords, or literals. The input SQL string must outlive the parse result (natural in ProxySQL — the query buffer is session-owned).
8394
8495
When a string must be materialized (e.g., unescaping a quoted literal), it is allocated from the arena.
@@ -89,13 +100,16 @@ Flat, compact struct. No virtual functions, no `std::vector`. Children use an in
89100
90101
```cpp
91102
struct AstNode {
92-
NodeType type; // enum, 2 bytes
93-
uint16_t flags; // dialect bits, tier, modifiers
94-
StringRef value; // pointer + length into input
95-
AstNode* first_child; // first child (intrusive list)
96-
AstNode* next_sibling; // next sibling
103+
AstNode* first_child; // 8 bytes — first child (intrusive list)
104+
AstNode* next_sibling; // 8 bytes — next sibling
105+
const char* value_ptr; // 8 bytes — pointer into input (inlined StringRef)
106+
uint32_t value_len; // 4 bytes — length
107+
NodeType type; // 2 bytes — enum
108+
uint16_t flags; // 2 bytes — dialect bits, tier, modifiers
97109
};
98-
// ~32 bytes per node on 64-bit, fits in half a cache line
110+
// 32 bytes per node on 64-bit — exactly half a cache line.
111+
// Fields ordered to avoid padding: pointers first, then 4-byte, then 2-byte.
112+
static_assert(sizeof(AstNode) == 32);
99113
```
100114

101115
### ParseResult
@@ -107,6 +121,7 @@ struct ParseResult {
107121
StmtType stmt_type; // always set, even on error (best-effort)
108122
AstNode* ast; // non-null for Tier 1 OK
109123
ErrorInfo error; // populated on ERROR/PARTIAL
124+
StringRef remaining; // unparsed input after semicolon (for multi-statement)
110125

111126
// Tier 2 extracted metadata
112127
StringRef table_name;
@@ -120,7 +135,12 @@ struct ErrorInfo {
120135
};
121136
```
122137
123-
`PARTIAL` means the classifier succeeded (statement type known) but the deep parser hit a syntax error. ProxySQL can still route on statement type — let the backend report the error to the client.
138+
`PARTIAL` semantics by tier:
139+
- **Tier 1:** classifier succeeded (statement type known) but the deep parser hit a syntax error. The AST may be partially populated. ProxySQL can still route on statement type — let the backend report the error.
140+
- **Tier 2:** classifier succeeded but the extractor could not find expected metadata (e.g., `INSERT INTO` with no table name following). `stmt_type` is set but metadata fields may be empty.
141+
- `ERROR` means the first token could not be classified at all (e.g., binary garbage or empty input).
142+
143+
**Lifetime note:** `ErrorInfo::message` points to arena-allocated memory. It becomes invalid after `parser.reset()`. Consumers must copy the message if they need it beyond the parse lifecycle.
124144
125145
---
126146
@@ -179,7 +199,7 @@ struct KeywordEntry {
179199

180200
### Lookahead
181201

182-
One-token lookahead via `peek()`. Cached internally. Sufficient for recursive descent and classification.
202+
The tokenizer provides single-token lookahead via `peek()`, cached internally. Statement parsers that need multi-token disambiguation (e.g., `SET TRANSACTION` vs `SET var = ...`, or `INSERT INTO ... SELECT` vs `INSERT INTO ... VALUES`) handle this by consuming tokens and using the parser's own state to disambiguate — no backtracking needed. For example, the SET parser consumes the second token; if it's `TRANSACTION`, it enters `parse_set_transaction()`, otherwise it treats the consumed token as the start of a variable target. This is standard recursive descent practice and does not require a multi-token lookahead buffer in the tokenizer itself.
183203

184204
---
185205

@@ -256,7 +276,8 @@ parse_select()
256276
- `::` type cast
257277
- `LIMIT ALL` vs `LIMIT` with expression
258278
- Dollar-quoted strings
259-
- `RETURNING` clause
279+
280+
Note: PostgreSQL's `RETURNING` clause applies to INSERT/UPDATE/DELETE, not SELECT. It will be handled when those statements are promoted to Tier 1. Until then, Tier 2 extractors for those statements will detect `RETURNING` and include it in metadata but not build an AST for the returned expressions.
260281

261282
### SET Parser
262283

@@ -286,7 +307,8 @@ Each `NodeType` has a corresponding emit function. The emitter is dialect-templa
286307

287308
- `StringRef` values are emitted directly from the original input (no copy unless the node was modified).
288309
- Modified nodes emit their new values.
289-
- Theoretically supports cross-dialect emission (parse MySQL → emit PostgreSQL) for future query translation.
310+
311+
Cross-dialect emission (parse MySQL → emit PostgreSQL) is **out of scope** for the initial design. Many constructs have no direct equivalent across dialects (`SQL_CALC_FOUND_ROWS`, backtick quoting, `LIMIT` syntax differences). The emitter always emits in the same dialect it parsed.
290312

291313
---
292314

@@ -306,20 +328,25 @@ COM_STMT_PREPARE COM_STMT_EXECUTE (repeated) COM_STMT_CLOSE
306328

307329
SQL template parsed normally. Placeholder tokens (`?` in MySQL, `$1`/`$2` in PostgreSQL) become `NODE_PLACEHOLDER` AST nodes with a parameter index in `flags`.
308330

309-
The AST is copied from the arena to a longer-lived **statement cache** (per-parser-instance, keyed by statement ID). This is the one place where memory leaves the arena.
331+
The AST is copied from the arena to a longer-lived **statement cache** (per-parser-instance, keyed by statement ID) via `parse_and_cache()`, which atomically parses and stores the result before the arena can be reset. This is the one place where memory leaves the arena.
332+
333+
**Threading note:** The statement cache is per-parser-instance (i.e., per-thread). In ProxySQL, prepared statement state is per-session. If sessions can migrate between threads, the session must carry its own prepared statement metadata (statement IDs, SQL templates). The parser on the destination thread can re-parse and cache the template on first execute if the cached AST is not found. This avoids any cross-thread sharing of parser state.
310334

311335
### Execute Phase
312336

313337
```cpp
314338
struct BoundValue {
315-
enum Type { INT, FLOAT, DOUBLE, STRING, BLOB, NULL_VAL };
339+
enum Type { INT, FLOAT, DOUBLE, STRING, BLOB, NULL_VAL, DATETIME, DECIMAL };
316340
Type type;
317341
union {
318342
int64_t int_val;
319-
double float_val;
343+
float float32_val; // MySQL FLOAT (4 bytes) — distinct from DOUBLE
344+
double float64_val; // MySQL DOUBLE (8 bytes)
320345
StringRef str_val; // points into COM_STMT_EXECUTE packet buffer
346+
// also used for DATETIME/DECIMAL (wire-format string)
321347
};
322348
};
349+
static_assert(std::is_trivially_copyable_v<BoundValue>);
323350

324351
struct ParamBindings {
325352
BoundValue* values;
@@ -349,6 +376,14 @@ The parser never throws exceptions. Errors are reported through `ParseResult::st
349376
350377
**Lenient by design:** ProxySQL doesn't need to reject queries — the backend does. The parser extracts as much useful information as possible and degrades gracefully.
351378
379+
### Multi-Statement Queries
380+
381+
ProxySQL regularly receives semicolon-separated multi-statement queries (e.g., `SET autocommit=0; BEGIN`). The parser handles this by parsing the **first statement** and returning its `ParseResult` along with a `remaining` field (`StringRef` pointing to the unparsed tail after the semicolon). The caller is responsible for calling `parse()` again on the remainder if needed. This avoids allocating a list of results and lets the caller decide whether to parse subsequent statements.
382+
383+
### Maximum Query Length
384+
385+
The parser respects the caller's buffer size (the `len` parameter). It does not impose its own maximum query length — that is ProxySQL's responsibility (via `mysql-max_allowed_packet` or equivalent). The arena's maximum size (default 1MB) provides an implicit bound on the complexity of parseable queries; exceeding it returns `ERROR`.
386+
352387
---
353388
354389
## Public API
@@ -360,9 +395,9 @@ public:
360395
Parser(const ParserConfig& config = {}); // arena size, cache capacity
361396
362397
ParseResult parse(const char* sql, size_t len);
398+
ParseResult parse_and_cache(const char* sql, size_t len, uint32_t stmt_id);
363399
ParseResult execute(uint32_t stmt_id, const ParamBindings& params);
364400
365-
void prepare_cache_store(uint32_t stmt_id);
366401
void prepare_cache_evict(uint32_t stmt_id);
367402
368403
void reset(); // resets arena; call after each query is fully processed
@@ -397,7 +432,7 @@ include/sql_parser/
397432
keywords_pgsql.h // PostgreSQL keyword table
398433
399434
src/sql_parser/
400-
tokenizer.cpp // explicit template instantiations
435+
tokenizer.cpp // explicit template instantiations (or header-only with LTO for max inlining)
401436
classifier.cpp // switch dispatch
402437
select_parser.cpp // Tier 1: SELECT
403438
set_parser.cpp // Tier 1: SET
@@ -431,7 +466,7 @@ bench/
431466
| Tier 1 SELECT parse (complex) | <2us | Multi-join, subqueries, GROUP BY, ORDER BY |
432467
| Tier 1 SET parse | <300ns | `SET @@session.var = value` |
433468
| Query reconstruction | <500ns | Simple SELECT round-trip |
434-
| Arena reset | <10ns | Pointer rewind |
469+
| Arena reset | <10ns | Pointer rewind (single-block case; overflow blocks add O(n) free calls) |
435470

436471
---
437472

0 commit comments

Comments
 (0)