Skip to content

Commit 43fe79a

Browse files
committed
docs: add README.md with usage/benchmarks/architecture, update CLAUDE.md for current codebase
1 parent bf1cd01 commit 43fe79a

2 files changed

Lines changed: 263 additions & 0 deletions

File tree

CLAUDE.md

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
High-performance hand-written recursive descent SQL parser for ProxySQL. Supports MySQL and PostgreSQL dialects via compile-time templating (`Parser<Dialect::MySQL>` / `Parser<Dialect::PostgreSQL>`). Designed for sub-microsecond latency on the proxy hot path.
8+
9+
## Build Commands
10+
11+
```bash
12+
make -f Makefile.new all # Build library + run all 430 tests
13+
make -f Makefile.new lib # Build only libsqlparser.a
14+
make -f Makefile.new test # Build + run tests
15+
make -f Makefile.new bench # Build + run benchmarks
16+
make -f Makefile.new bench-compare # Run comparison vs libpg_query (requires libpg_query built)
17+
make -f Makefile.new build-corpus-test # Build corpus test harness
18+
make -f Makefile.new clean # Remove all build artifacts
19+
```
20+
21+
For release benchmarks: `sed 's/-g -O2/-O3/' Makefile.new > /tmp/Makefile.release && make -f /tmp/Makefile.release bench`
22+
23+
**Note:** The old `Makefile` (no `.new`) is for the legacy Flex/Bison parser — do not use it for new code.
24+
25+
## Architecture
26+
27+
### Three-layer pipeline
28+
29+
1. **Tokenizer** (`tokenizer.h`) — Zero-copy pull-based iterator, dialect-templated. Keyword lookup via sorted-array binary search. Produces `Token{type, StringRef, offset}`.
30+
2. **Classifier** (`parser.cpp:classify_and_dispatch()`) — Switch on first token. Routes to Tier 1 deep parser or Tier 2 extractor.
31+
3. **Statement parsers** — Each Tier 1 statement has its own header-only template class (e.g., `SelectParser<D>`, `SetParser<D>`).
32+
33+
### Key types
34+
35+
- `Arena` — Block-chained bump allocator. 64KB default, 1MB max. O(1) reset.
36+
- `StringRef``{const char* ptr, uint32_t len}`. Zero-copy view into input buffer. Trivially copyable.
37+
- `AstNode` — 32 bytes. Intrusive linked list (first_child + next_sibling). Arena-allocated.
38+
- `ParseResult` — Status (OK/PARTIAL/ERROR) + stmt_type + ast + table_name/schema_name + remaining (for multi-statement).
39+
40+
### Namespace
41+
42+
Everything is in `namespace sql_parser`. All templates are parameterized on `Dialect D` (MySQL or PostgreSQL).
43+
44+
### Adding a new deep parser
45+
46+
1. Create `include/sql_parser/xxx_parser.h` — header-only template following `SetParser<D>` pattern
47+
2. Add node types to `NodeType` enum in `common.h`
48+
3. Add tokens to `token.h` and both keyword tables (sorted!)
49+
4. Add `parse_xxx()` method to `parser.h` and implement in `parser.cpp`
50+
5. Update `classify_and_dispatch()` switch to route to new parser
51+
6. Add emit methods to `emitter.h`
52+
7. Add `is_keyword_as_identifier()` entries in `expression_parser.h` for new keywords
53+
8. Update `is_alias_start()` blocklist in `table_ref_parser.h` for clause-starting keywords
54+
9. Write tests in `tests/test_xxx.cpp`, add to `Makefile.new` TEST_SRCS
55+
56+
### Expression parsing
57+
58+
`ExpressionParser<D>` uses Pratt parsing (precedence climbing). Used by all Tier 1 parsers for WHERE conditions, SET values, function args, etc. Handles: literals, identifiers, binary/unary ops, IS NULL, BETWEEN, IN, NOT IN/BETWEEN/LIKE, CASE/WHEN, function calls, subqueries, ARRAY constructors, tuple constructors, field access.
59+
60+
### Table reference parsing
61+
62+
`TableRefParser<D>` is a shared utility extracted from SelectParser. Used by SELECT (FROM), UPDATE (MySQL multi-table), DELETE (MySQL multi-table), INSERT (for INSERT...SELECT). Handles: simple tables, qualified names, aliases, JOINs (all types), subqueries in FROM.
63+
64+
### Emitter
65+
66+
`Emitter<D>` walks AST and produces SQL text into arena-backed `StringBuilder`. Supports:
67+
- Normal mode: faithful round-trip reconstruction
68+
- Digest mode (`EmitMode::DIGEST`): literals→`?`, IN collapsing, keyword uppercasing
69+
- Bindings mode: materializes `?` placeholders with bound parameter values
70+
71+
### Tests
72+
73+
Google Test. 430 tests across 16 test files. Validated against 86K+ external queries (PostgreSQL regression, MySQL MTR, CockroachDB, Vitess, TiDB, sqlparser-rs, SQLGlot).
74+
75+
Run a single test: `./run_tests --gtest_filter="*SetTest*"`
76+
77+
### Benchmarks
78+
79+
Google Benchmark. 18 single-thread + 16 multi-thread + 4 percentile benchmarks.
80+
Comparison benchmarks against libpg_query and sqlparser-rs in `bench/bench_comparison.cpp`.
81+
82+
### Corpus testing
83+
84+
`corpus_test` binary reads SQL from stdin (one per line), parses each, reports OK/PARTIAL/ERROR counts.
85+
Usage: `./corpus_test mysql < queries.sql` or `./corpus_test pgsql < queries.sql`

README.md

Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
# ParserSQL
2+
3+
A high-performance, hand-written recursive descent SQL parser for [ProxySQL](https://github.com/sysown/proxysql). Supports both MySQL and PostgreSQL dialects with compile-time dispatch — zero runtime overhead for dialect selection.
4+
5+
## Performance
6+
7+
All operations run in sub-microsecond latency on modern hardware:
8+
9+
| Operation | Latency | Notes |
10+
|---|---|---|
11+
| Classify statement (BEGIN) | **36 ns** | Tier 2: type + metadata only |
12+
| Parse SET statement | **130 ns** | Full AST |
13+
| Parse simple SELECT | **223 ns** | Full AST |
14+
| Parse complex SELECT (JOINs, GROUP BY, HAVING) | **1.4 µs** | Full AST |
15+
| Parse INSERT | **244 ns** | Full AST |
16+
| Query reconstruction (round-trip) | **132-263 ns** | Parse → emit |
17+
| Arena reset | **3.6 ns** | O(1) pointer rewind |
18+
19+
Compared to other parsers on the same queries:
20+
21+
| Parser | Simple SELECT | Complex SELECT | Notes |
22+
|---|---|---|---|
23+
| **ParserSQL** | **223 ns** | **1,189 ns** | This project |
24+
| libpg_query (raw parse) | 684 ns (3.1x slower) | 3,304 ns (2.8x) | PostgreSQL's own parser |
25+
| sqlparser-rs (Rust) | 4,687 ns (21x slower) | 23,411 ns (19x) | Apache DataFusion |
26+
27+
See [docs/benchmarks/](docs/benchmarks/) for full results and [REPRODUCING.md](docs/benchmarks/REPRODUCING.md) for reproduction instructions.
28+
29+
## Features
30+
31+
- **Deep parsing (Tier 1):** SELECT, INSERT, UPDATE, DELETE, SET, REPLACE, EXPLAIN, CALL, DO, LOAD DATA
32+
- **Compound queries:** UNION / INTERSECT / EXCEPT with SQL-standard precedence and parenthesized nesting
33+
- **Tier 2 classification:** All other statement types (DDL, transactions, SHOW, GRANT, etc.)
34+
- **Query reconstruction:** Parse → modify AST → emit valid SQL
35+
- **Query digest:** Normalize queries for fingerprinting (literals → `?`, IN list collapsing, keyword uppercasing) with 64-bit FNV-1a hash
36+
- **Prepared statement cache:** LRU cache with `parse_and_cache()` / `execute()` for binary protocol support
37+
- **Both dialects:** MySQL and PostgreSQL via `Parser<Dialect::MySQL>` / `Parser<Dialect::PostgreSQL>`
38+
- **Thread-safe:** One parser instance per thread, zero shared state, no locks
39+
40+
## Quick Start
41+
42+
```cpp
43+
#include "sql_parser/parser.h"
44+
#include "sql_parser/emitter.h"
45+
46+
using namespace sql_parser;
47+
48+
// Create a parser (one per thread)
49+
Parser<Dialect::MySQL> parser;
50+
51+
// Parse a query
52+
auto result = parser.parse("SELECT * FROM users WHERE id = 1", 32);
53+
if (result.ok()) {
54+
// result.ast contains the full AST
55+
// result.stmt_type == StmtType::SELECT
56+
}
57+
58+
// Reconstruct SQL from AST
59+
Emitter<Dialect::MySQL> emitter(parser.arena());
60+
emitter.emit(result.ast);
61+
StringRef sql = emitter.result(); // "SELECT * FROM users WHERE id = 1"
62+
63+
// Query digest (normalize for fingerprinting)
64+
Digest<Dialect::MySQL> digest(parser.arena());
65+
DigestResult dr = digest.compute(result.ast);
66+
// dr.normalized = "SELECT * FROM users WHERE id = ?"
67+
// dr.hash = 0x... (64-bit FNV-1a)
68+
69+
// Modify AST and re-emit
70+
// ... modify nodes ...
71+
// Emitter emitter2(parser.arena());
72+
// emitter2.emit(result.ast); // emits modified SQL
73+
74+
// Reset arena after each query (O(1), reuses memory)
75+
parser.reset();
76+
```
77+
78+
## Building
79+
80+
```bash
81+
# Build library + run tests
82+
make -f Makefile.new all
83+
84+
# Build and run benchmarks
85+
make -f Makefile.new bench
86+
87+
# Build comparison benchmarks (requires libpg_query)
88+
cd third_party/libpg_query && make && cd ../..
89+
make -f Makefile.new bench-compare
90+
```
91+
92+
Requires: `g++` or `clang++` with C++17 support. No external dependencies for the parser itself. Google Test and Google Benchmark are vendored in `third_party/`.
93+
94+
## Architecture
95+
96+
```
97+
Input SQL bytes
98+
99+
100+
┌──────────────┐
101+
│ Tokenizer │ Zero-copy, dialect-templated, pull-based
102+
│ <Dialect D> │ Binary search keyword lookup (~110 keywords)
103+
└──────┬───────┘
104+
105+
106+
┌──────────────┐
107+
│ Classifier │ Switch on first token → route to parser
108+
└──────┬───────┘
109+
110+
├──── Tier 1 ──► Deep parser (SELECT, INSERT, UPDATE, DELETE, SET, ...)
111+
│ │
112+
│ ▼
113+
│ Full AST in arena
114+
115+
└──── Tier 2 ──► Lightweight extractor
116+
117+
118+
StmtType + table name + metadata
119+
```
120+
121+
**Key design decisions:**
122+
- **Arena allocator** — 64KB bump allocator, O(1) reset. All AST nodes allocated from arena. No per-node new/delete.
123+
- **Zero-copy StringRef** — Token values point into original input buffer. No string copies during parsing.
124+
- **32-byte AstNode** — Compact intrusive linked-list (first_child + next_sibling). Half a cache line.
125+
- **Compile-time dialect dispatch**`if constexpr` for MySQL vs PostgreSQL differences. Zero runtime overhead.
126+
- **Header-only parsers** — Maximum inlining opportunity. Only `arena.cpp` and `parser.cpp` are compiled separately.
127+
128+
## Testing
129+
130+
430 unit tests + validated against 86K+ queries from 9 external corpora:
131+
132+
| Corpus | Queries | OK Rate |
133+
|---|---|---|
134+
| PostgreSQL regression suite | 55,553 | 99.6% |
135+
| MySQL MTR test suite | 2,270 | 99.9% |
136+
| CockroachDB parser testdata | 17,429 | 95.1% |
137+
| sqlparser-rs test cases | 2,431 | 99.5% |
138+
| Vitess test cases | 2,291 | 99.8% |
139+
| TiDB test cases | 5,043 | 99.8% |
140+
| SQLGlot fixtures | 1,450 | 98.2% |
141+
142+
## File Layout
143+
144+
```
145+
include/sql_parser/
146+
parser.h Public API: Parser<D>
147+
common.h StringRef, Dialect, StmtType, NodeType enums
148+
arena.h Arena allocator
149+
ast.h AstNode (32 bytes)
150+
token.h TokenType enum
151+
tokenizer.h Tokenizer<D> (header-only)
152+
expression_parser.h Pratt expression parser
153+
select_parser.h SELECT deep parser
154+
set_parser.h SET deep parser
155+
insert_parser.h INSERT/REPLACE deep parser
156+
update_parser.h UPDATE deep parser
157+
delete_parser.h DELETE deep parser
158+
compound_query_parser.h UNION/INTERSECT/EXCEPT
159+
table_ref_parser.h Shared FROM/JOIN parsing
160+
emitter.h AST → SQL reconstruction
161+
digest.h Query normalization + hash
162+
parse_result.h ParseResult, BoundValue
163+
stmt_cache.h Prepared statement LRU cache
164+
string_builder.h Arena-backed string builder
165+
keywords_mysql.h MySQL keyword table
166+
keywords_pgsql.h PostgreSQL keyword table
167+
168+
src/sql_parser/
169+
arena.cpp Arena implementation
170+
parser.cpp Classifier + integration
171+
172+
tests/ 430 unit tests (Google Test)
173+
bench/ 18 benchmarks + comparison suite
174+
```
175+
176+
## License
177+
178+
See [LICENSE](LICENSE) file.

0 commit comments

Comments
 (0)