docs: add README.md with usage/benchmarks/architecture, update CLAUDE.md for current codebase

renecannao · renecannao · commit 43fe79aa1d04 · 2026-03-24T21:32:29.000Z
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,85 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+High-performance hand-written recursive descent SQL parser for ProxySQL. Supports MySQL and PostgreSQL dialects via compile-time templating (`Parser<Dialect::MySQL>` / `Parser<Dialect::PostgreSQL>`). Designed for sub-microsecond latency on the proxy hot path.
+
+## Build Commands
+
+```bash
+make -f Makefile.new all           # Build library + run all 430 tests
+make -f Makefile.new lib           # Build only libsqlparser.a
+make -f Makefile.new test          # Build + run tests
+make -f Makefile.new bench         # Build + run benchmarks
+make -f Makefile.new bench-compare # Run comparison vs libpg_query (requires libpg_query built)
+make -f Makefile.new build-corpus-test  # Build corpus test harness
+make -f Makefile.new clean         # Remove all build artifacts
+```
+
+For release benchmarks: `sed 's/-g -O2/-O3/' Makefile.new > /tmp/Makefile.release && make -f /tmp/Makefile.release bench`
+
+**Note:** The old `Makefile` (no `.new`) is for the legacy Flex/Bison parser — do not use it for new code.
+
+## Architecture
+
+### Three-layer pipeline
+
+1. **Tokenizer** (`tokenizer.h`) — Zero-copy pull-based iterator, dialect-templated. Keyword lookup via sorted-array binary search. Produces `Token{type, StringRef, offset}`.
+2. **Classifier** (`parser.cpp:classify_and_dispatch()`) — Switch on first token. Routes to Tier 1 deep parser or Tier 2 extractor.
+3. **Statement parsers** — Each Tier 1 statement has its own header-only template class (e.g., `SelectParser<D>`, `SetParser<D>`).
+
+### Key types
+
+- `Arena` — Block-chained bump allocator. 64KB default, 1MB max. O(1) reset.
+- `StringRef` — `{const char* ptr, uint32_t len}`. Zero-copy view into input buffer. Trivially copyable.
+- `AstNode` — 32 bytes. Intrusive linked list (first_child + next_sibling). Arena-allocated.
+- `ParseResult` — Status (OK/PARTIAL/ERROR) + stmt_type + ast + table_name/schema_name + remaining (for multi-statement).
+
+### Namespace
+
+Everything is in `namespace sql_parser`. All templates are parameterized on `Dialect D` (MySQL or PostgreSQL).
+
+### Adding a new deep parser
+
+1. Create `include/sql_parser/xxx_parser.h` — header-only template following `SetParser<D>` pattern
+2. Add node types to `NodeType` enum in `common.h`
+3. Add tokens to `token.h` and both keyword tables (sorted!)
+4. Add `parse_xxx()` method to `parser.h` and implement in `parser.cpp`
+5. Update `classify_and_dispatch()` switch to route to new parser
+6. Add emit methods to `emitter.h`
+7. Add `is_keyword_as_identifier()` entries in `expression_parser.h` for new keywords
+8. Update `is_alias_start()` blocklist in `table_ref_parser.h` for clause-starting keywords
+9. Write tests in `tests/test_xxx.cpp`, add to `Makefile.new` TEST_SRCS
+
+### Expression parsing
+
+`ExpressionParser<D>` uses Pratt parsing (precedence climbing). Used by all Tier 1 parsers for WHERE conditions, SET values, function args, etc. Handles: literals, identifiers, binary/unary ops, IS NULL, BETWEEN, IN, NOT IN/BETWEEN/LIKE, CASE/WHEN, function calls, subqueries, ARRAY constructors, tuple constructors, field access.
+
+### Table reference parsing
+
+`TableRefParser<D>` is a shared utility extracted from SelectParser. Used by SELECT (FROM), UPDATE (MySQL multi-table), DELETE (MySQL multi-table), INSERT (for INSERT...SELECT). Handles: simple tables, qualified names, aliases, JOINs (all types), subqueries in FROM.
+
+### Emitter
+
+`Emitter<D>` walks AST and produces SQL text into arena-backed `StringBuilder`. Supports:
+- Normal mode: faithful round-trip reconstruction
+- Digest mode (`EmitMode::DIGEST`): literals→`?`, IN collapsing, keyword uppercasing
+- Bindings mode: materializes `?` placeholders with bound parameter values
+
+### Tests
+
+Google Test. 430 tests across 16 test files. Validated against 86K+ external queries (PostgreSQL regression, MySQL MTR, CockroachDB, Vitess, TiDB, sqlparser-rs, SQLGlot).
+
+Run a single test: `./run_tests --gtest_filter="*SetTest*"`
+
+### Benchmarks
+
+Google Benchmark. 18 single-thread + 16 multi-thread + 4 percentile benchmarks.
+Comparison benchmarks against libpg_query and sqlparser-rs in `bench/bench_comparison.cpp`.
+
+### Corpus testing
+
+`corpus_test` binary reads SQL from stdin (one per line), parses each, reports OK/PARTIAL/ERROR counts.
+Usage: `./corpus_test mysql < queries.sql` or `./corpus_test pgsql < queries.sql`
diff --git a/README.md b/README.md
@@ -0,0 +1,178 @@
+# ParserSQL
+
+A high-performance, hand-written recursive descent SQL parser for [ProxySQL](https://github.com/sysown/proxysql). Supports both MySQL and PostgreSQL dialects with compile-time dispatch — zero runtime overhead for dialect selection.
+
+## Performance
+
+All operations run in sub-microsecond latency on modern hardware:
+
+| Operation | Latency | Notes |
+|---|---|---|
+| Classify statement (BEGIN) | **36 ns** | Tier 2: type + metadata only |
+| Parse SET statement | **130 ns** | Full AST |
+| Parse simple SELECT | **223 ns** | Full AST |
+| Parse complex SELECT (JOINs, GROUP BY, HAVING) | **1.4 µs** | Full AST |
+| Parse INSERT | **244 ns** | Full AST |
+| Query reconstruction (round-trip) | **132-263 ns** | Parse → emit |
+| Arena reset | **3.6 ns** | O(1) pointer rewind |
+
+Compared to other parsers on the same queries:
+
+| Parser | Simple SELECT | Complex SELECT | Notes |
+|---|---|---|---|
+| **ParserSQL** | **223 ns** | **1,189 ns** | This project |
+| libpg_query (raw parse) | 684 ns (3.1x slower) | 3,304 ns (2.8x) | PostgreSQL's own parser |
+| sqlparser-rs (Rust) | 4,687 ns (21x slower) | 23,411 ns (19x) | Apache DataFusion |
+
+See [docs/benchmarks/](docs/benchmarks/) for full results and [REPRODUCING.md](docs/benchmarks/REPRODUCING.md) for reproduction instructions.
+
+## Features
+
+- **Deep parsing (Tier 1):** SELECT, INSERT, UPDATE, DELETE, SET, REPLACE, EXPLAIN, CALL, DO, LOAD DATA
+- **Compound queries:** UNION / INTERSECT / EXCEPT with SQL-standard precedence and parenthesized nesting
+- **Tier 2 classification:** All other statement types (DDL, transactions, SHOW, GRANT, etc.)
+- **Query reconstruction:** Parse → modify AST → emit valid SQL
+- **Query digest:** Normalize queries for fingerprinting (literals → `?`, IN list collapsing, keyword uppercasing) with 64-bit FNV-1a hash
+- **Prepared statement cache:** LRU cache with `parse_and_cache()` / `execute()` for binary protocol support
+- **Both dialects:** MySQL and PostgreSQL via `Parser<Dialect::MySQL>` / `Parser<Dialect::PostgreSQL>`
+- **Thread-safe:** One parser instance per thread, zero shared state, no locks
+
+## Quick Start
+
+```cpp
+#include "sql_parser/parser.h"
+#include "sql_parser/emitter.h"
+
+using namespace sql_parser;
+
+// Create a parser (one per thread)
+Parser<Dialect::MySQL> parser;
+
+// Parse a query
+auto result = parser.parse("SELECT * FROM users WHERE id = 1", 32);
+if (result.ok()) {
+    // result.ast contains the full AST
+    // result.stmt_type == StmtType::SELECT
+}
+
+// Reconstruct SQL from AST
+Emitter<Dialect::MySQL> emitter(parser.arena());
+emitter.emit(result.ast);
+StringRef sql = emitter.result();  // "SELECT * FROM users WHERE id = 1"
+
+// Query digest (normalize for fingerprinting)
+Digest<Dialect::MySQL> digest(parser.arena());
+DigestResult dr = digest.compute(result.ast);
+// dr.normalized = "SELECT * FROM users WHERE id = ?"
+// dr.hash = 0x... (64-bit FNV-1a)
+
+// Modify AST and re-emit
+// ... modify nodes ...
+// Emitter emitter2(parser.arena());
+// emitter2.emit(result.ast);  // emits modified SQL
+
+// Reset arena after each query (O(1), reuses memory)
+parser.reset();
+```
+
+## Building
+
+```bash
+# Build library + run tests
+make -f Makefile.new all
+
+# Build and run benchmarks
+make -f Makefile.new bench
+
+# Build comparison benchmarks (requires libpg_query)
+cd third_party/libpg_query && make && cd ../..
+make -f Makefile.new bench-compare
+```
+
+Requires: `g++` or `clang++` with C++17 support. No external dependencies for the parser itself. Google Test and Google Benchmark are vendored in `third_party/`.
+
+## Architecture
+
+```
+Input SQL bytes
+       │
+       ▼
+┌──────────────┐
+│  Tokenizer   │  Zero-copy, dialect-templated, pull-based
+│  <Dialect D> │  Binary search keyword lookup (~110 keywords)
+└──────┬───────┘
+       │
+       ▼
+┌──────────────┐
+│  Classifier  │  Switch on first token → route to parser
+└──────┬───────┘
+       │
+       ├──── Tier 1 ──► Deep parser (SELECT, INSERT, UPDATE, DELETE, SET, ...)
+       │                  │
+       │                  ▼
+       │                Full AST in arena
+       │
+       └──── Tier 2 ──► Lightweight extractor
+                          │
+                          ▼
+                        StmtType + table name + metadata
+```
+
+**Key design decisions:**
+- **Arena allocator** — 64KB bump allocator, O(1) reset. All AST nodes allocated from arena. No per-node new/delete.
+- **Zero-copy StringRef** — Token values point into original input buffer. No string copies during parsing.
+- **32-byte AstNode** — Compact intrusive linked-list (first_child + next_sibling). Half a cache line.
+- **Compile-time dialect dispatch** — `if constexpr` for MySQL vs PostgreSQL differences. Zero runtime overhead.
+- **Header-only parsers** — Maximum inlining opportunity. Only `arena.cpp` and `parser.cpp` are compiled separately.
+
+## Testing
+
+430 unit tests + validated against 86K+ queries from 9 external corpora:
+
+| Corpus | Queries | OK Rate |
+|---|---|---|
+| PostgreSQL regression suite | 55,553 | 99.6% |
+| MySQL MTR test suite | 2,270 | 99.9% |
+| CockroachDB parser testdata | 17,429 | 95.1% |
+| sqlparser-rs test cases | 2,431 | 99.5% |
+| Vitess test cases | 2,291 | 99.8% |
+| TiDB test cases | 5,043 | 99.8% |
+| SQLGlot fixtures | 1,450 | 98.2% |
+
+## File Layout
+
+```
+include/sql_parser/
+    parser.h              Public API: Parser<D>
+    common.h              StringRef, Dialect, StmtType, NodeType enums
+    arena.h               Arena allocator
+    ast.h                 AstNode (32 bytes)
+    token.h               TokenType enum
+    tokenizer.h           Tokenizer<D> (header-only)
+    expression_parser.h   Pratt expression parser
+    select_parser.h       SELECT deep parser
+    set_parser.h          SET deep parser
+    insert_parser.h       INSERT/REPLACE deep parser
+    update_parser.h       UPDATE deep parser
+    delete_parser.h       DELETE deep parser
+    compound_query_parser.h  UNION/INTERSECT/EXCEPT
+    table_ref_parser.h    Shared FROM/JOIN parsing
+    emitter.h             AST → SQL reconstruction
+    digest.h              Query normalization + hash
+    parse_result.h        ParseResult, BoundValue
+    stmt_cache.h          Prepared statement LRU cache
+    string_builder.h      Arena-backed string builder
+    keywords_mysql.h      MySQL keyword table
+    keywords_pgsql.h      PostgreSQL keyword table
+
+src/sql_parser/
+    arena.cpp             Arena implementation
+    parser.cpp            Classifier + integration
+
+tests/                    430 unit tests (Google Test)
+bench/                    18 benchmarks + comparison suite
+```
+
+## License
+
+See [LICENSE](LICENSE) file.