|
| 1 | +# ParserSQL |
| 2 | + |
| 3 | +A high-performance, hand-written recursive descent SQL parser for [ProxySQL](https://github.com/sysown/proxysql). Supports both MySQL and PostgreSQL dialects with compile-time dispatch — zero runtime overhead for dialect selection. |
| 4 | + |
| 5 | +## Performance |
| 6 | + |
| 7 | +All operations run in sub-microsecond latency on modern hardware: |
| 8 | + |
| 9 | +| Operation | Latency | Notes | |
| 10 | +|---|---|---| |
| 11 | +| Classify statement (BEGIN) | **36 ns** | Tier 2: type + metadata only | |
| 12 | +| Parse SET statement | **130 ns** | Full AST | |
| 13 | +| Parse simple SELECT | **223 ns** | Full AST | |
| 14 | +| Parse complex SELECT (JOINs, GROUP BY, HAVING) | **1.4 µs** | Full AST | |
| 15 | +| Parse INSERT | **244 ns** | Full AST | |
| 16 | +| Query reconstruction (round-trip) | **132-263 ns** | Parse → emit | |
| 17 | +| Arena reset | **3.6 ns** | O(1) pointer rewind | |
| 18 | + |
| 19 | +Compared to other parsers on the same queries: |
| 20 | + |
| 21 | +| Parser | Simple SELECT | Complex SELECT | Notes | |
| 22 | +|---|---|---|---| |
| 23 | +| **ParserSQL** | **223 ns** | **1,189 ns** | This project | |
| 24 | +| libpg_query (raw parse) | 684 ns (3.1x slower) | 3,304 ns (2.8x) | PostgreSQL's own parser | |
| 25 | +| sqlparser-rs (Rust) | 4,687 ns (21x slower) | 23,411 ns (19x) | Apache DataFusion | |
| 26 | + |
| 27 | +See [docs/benchmarks/](docs/benchmarks/) for full results and [REPRODUCING.md](docs/benchmarks/REPRODUCING.md) for reproduction instructions. |
| 28 | + |
| 29 | +## Features |
| 30 | + |
| 31 | +- **Deep parsing (Tier 1):** SELECT, INSERT, UPDATE, DELETE, SET, REPLACE, EXPLAIN, CALL, DO, LOAD DATA |
| 32 | +- **Compound queries:** UNION / INTERSECT / EXCEPT with SQL-standard precedence and parenthesized nesting |
| 33 | +- **Tier 2 classification:** All other statement types (DDL, transactions, SHOW, GRANT, etc.) |
| 34 | +- **Query reconstruction:** Parse → modify AST → emit valid SQL |
| 35 | +- **Query digest:** Normalize queries for fingerprinting (literals → `?`, IN list collapsing, keyword uppercasing) with 64-bit FNV-1a hash |
| 36 | +- **Prepared statement cache:** LRU cache with `parse_and_cache()` / `execute()` for binary protocol support |
| 37 | +- **Both dialects:** MySQL and PostgreSQL via `Parser<Dialect::MySQL>` / `Parser<Dialect::PostgreSQL>` |
| 38 | +- **Thread-safe:** One parser instance per thread, zero shared state, no locks |
| 39 | + |
| 40 | +## Quick Start |
| 41 | + |
| 42 | +```cpp |
| 43 | +#include "sql_parser/parser.h" |
| 44 | +#include "sql_parser/emitter.h" |
| 45 | + |
| 46 | +using namespace sql_parser; |
| 47 | + |
| 48 | +// Create a parser (one per thread) |
| 49 | +Parser<Dialect::MySQL> parser; |
| 50 | + |
| 51 | +// Parse a query |
| 52 | +auto result = parser.parse("SELECT * FROM users WHERE id = 1", 32); |
| 53 | +if (result.ok()) { |
| 54 | + // result.ast contains the full AST |
| 55 | + // result.stmt_type == StmtType::SELECT |
| 56 | +} |
| 57 | + |
| 58 | +// Reconstruct SQL from AST |
| 59 | +Emitter<Dialect::MySQL> emitter(parser.arena()); |
| 60 | +emitter.emit(result.ast); |
| 61 | +StringRef sql = emitter.result(); // "SELECT * FROM users WHERE id = 1" |
| 62 | + |
| 63 | +// Query digest (normalize for fingerprinting) |
| 64 | +Digest<Dialect::MySQL> digest(parser.arena()); |
| 65 | +DigestResult dr = digest.compute(result.ast); |
| 66 | +// dr.normalized = "SELECT * FROM users WHERE id = ?" |
| 67 | +// dr.hash = 0x... (64-bit FNV-1a) |
| 68 | + |
| 69 | +// Modify AST and re-emit |
| 70 | +// ... modify nodes ... |
| 71 | +// Emitter emitter2(parser.arena()); |
| 72 | +// emitter2.emit(result.ast); // emits modified SQL |
| 73 | + |
| 74 | +// Reset arena after each query (O(1), reuses memory) |
| 75 | +parser.reset(); |
| 76 | +``` |
| 77 | +
|
| 78 | +## Building |
| 79 | +
|
| 80 | +```bash |
| 81 | +# Build library + run tests |
| 82 | +make -f Makefile.new all |
| 83 | +
|
| 84 | +# Build and run benchmarks |
| 85 | +make -f Makefile.new bench |
| 86 | +
|
| 87 | +# Build comparison benchmarks (requires libpg_query) |
| 88 | +cd third_party/libpg_query && make && cd ../.. |
| 89 | +make -f Makefile.new bench-compare |
| 90 | +``` |
| 91 | + |
| 92 | +Requires: `g++` or `clang++` with C++17 support. No external dependencies for the parser itself. Google Test and Google Benchmark are vendored in `third_party/`. |
| 93 | + |
| 94 | +## Architecture |
| 95 | + |
| 96 | +``` |
| 97 | +Input SQL bytes |
| 98 | + │ |
| 99 | + ▼ |
| 100 | +┌──────────────┐ |
| 101 | +│ Tokenizer │ Zero-copy, dialect-templated, pull-based |
| 102 | +│ <Dialect D> │ Binary search keyword lookup (~110 keywords) |
| 103 | +└──────┬───────┘ |
| 104 | + │ |
| 105 | + ▼ |
| 106 | +┌──────────────┐ |
| 107 | +│ Classifier │ Switch on first token → route to parser |
| 108 | +└──────┬───────┘ |
| 109 | + │ |
| 110 | + ├──── Tier 1 ──► Deep parser (SELECT, INSERT, UPDATE, DELETE, SET, ...) |
| 111 | + │ │ |
| 112 | + │ ▼ |
| 113 | + │ Full AST in arena |
| 114 | + │ |
| 115 | + └──── Tier 2 ──► Lightweight extractor |
| 116 | + │ |
| 117 | + ▼ |
| 118 | + StmtType + table name + metadata |
| 119 | +``` |
| 120 | + |
| 121 | +**Key design decisions:** |
| 122 | +- **Arena allocator** — 64KB bump allocator, O(1) reset. All AST nodes allocated from arena. No per-node new/delete. |
| 123 | +- **Zero-copy StringRef** — Token values point into original input buffer. No string copies during parsing. |
| 124 | +- **32-byte AstNode** — Compact intrusive linked-list (first_child + next_sibling). Half a cache line. |
| 125 | +- **Compile-time dialect dispatch** — `if constexpr` for MySQL vs PostgreSQL differences. Zero runtime overhead. |
| 126 | +- **Header-only parsers** — Maximum inlining opportunity. Only `arena.cpp` and `parser.cpp` are compiled separately. |
| 127 | + |
| 128 | +## Testing |
| 129 | + |
| 130 | +430 unit tests + validated against 86K+ queries from 9 external corpora: |
| 131 | + |
| 132 | +| Corpus | Queries | OK Rate | |
| 133 | +|---|---|---| |
| 134 | +| PostgreSQL regression suite | 55,553 | 99.6% | |
| 135 | +| MySQL MTR test suite | 2,270 | 99.9% | |
| 136 | +| CockroachDB parser testdata | 17,429 | 95.1% | |
| 137 | +| sqlparser-rs test cases | 2,431 | 99.5% | |
| 138 | +| Vitess test cases | 2,291 | 99.8% | |
| 139 | +| TiDB test cases | 5,043 | 99.8% | |
| 140 | +| SQLGlot fixtures | 1,450 | 98.2% | |
| 141 | + |
| 142 | +## File Layout |
| 143 | + |
| 144 | +``` |
| 145 | +include/sql_parser/ |
| 146 | + parser.h Public API: Parser<D> |
| 147 | + common.h StringRef, Dialect, StmtType, NodeType enums |
| 148 | + arena.h Arena allocator |
| 149 | + ast.h AstNode (32 bytes) |
| 150 | + token.h TokenType enum |
| 151 | + tokenizer.h Tokenizer<D> (header-only) |
| 152 | + expression_parser.h Pratt expression parser |
| 153 | + select_parser.h SELECT deep parser |
| 154 | + set_parser.h SET deep parser |
| 155 | + insert_parser.h INSERT/REPLACE deep parser |
| 156 | + update_parser.h UPDATE deep parser |
| 157 | + delete_parser.h DELETE deep parser |
| 158 | + compound_query_parser.h UNION/INTERSECT/EXCEPT |
| 159 | + table_ref_parser.h Shared FROM/JOIN parsing |
| 160 | + emitter.h AST → SQL reconstruction |
| 161 | + digest.h Query normalization + hash |
| 162 | + parse_result.h ParseResult, BoundValue |
| 163 | + stmt_cache.h Prepared statement LRU cache |
| 164 | + string_builder.h Arena-backed string builder |
| 165 | + keywords_mysql.h MySQL keyword table |
| 166 | + keywords_pgsql.h PostgreSQL keyword table |
| 167 | +
|
| 168 | +src/sql_parser/ |
| 169 | + arena.cpp Arena implementation |
| 170 | + parser.cpp Classifier + integration |
| 171 | +
|
| 172 | +tests/ 430 unit tests (Google Test) |
| 173 | +bench/ 18 benchmarks + comparison suite |
| 174 | +``` |
| 175 | + |
| 176 | +## License |
| 177 | + |
| 178 | +See [LICENSE](LICENSE) file. |
0 commit comments