Skip to content

Commit 3ee471e

Browse files
committed
docs: update README and CLAUDE.md for full parser + query engine
1 parent 2d36d58 commit 3ee471e

6 files changed

Lines changed: 315 additions & 45 deletions

File tree

CLAUDE.md

Lines changed: 88 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,12 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
44

55
## Project Overview
66

7-
High-performance hand-written recursive descent SQL parser for ProxySQL. Supports MySQL and PostgreSQL dialects via compile-time templating (`Parser<Dialect::MySQL>` / `Parser<Dialect::PostgreSQL>`). Designed for sub-microsecond latency on the proxy hot path.
7+
High-performance hand-written recursive descent SQL parser and query engine for ProxySQL. Supports MySQL and PostgreSQL dialects via compile-time templating (`Parser<Dialect::MySQL>` / `Parser<Dialect::PostgreSQL>`). Designed for sub-microsecond latency on the proxy hot path. The query engine takes parsed ASTs and executes them through a Volcano-model operator pipeline.
88

99
## Build Commands
1010

1111
```bash
12-
make -f Makefile.new all # Build library + run all 430 tests
12+
make -f Makefile.new all # Build library + run all 871 tests
1313
make -f Makefile.new lib # Build only libsqlparser.a
1414
make -f Makefile.new test # Build + run tests
1515
make -f Makefile.new bench # Build + run benchmarks
@@ -22,7 +22,7 @@ For release benchmarks: `sed 's/-g -O2/-O3/' Makefile.new > /tmp/Makefile.releas
2222

2323
**Note:** The old `Makefile` (no `.new`) is for the legacy Flex/Bison parser — do not use it for new code.
2424

25-
## Architecture
25+
## Parser Architecture
2626

2727
### Three-layer pipeline
2828

@@ -68,18 +68,99 @@ Everything is in `namespace sql_parser`. All templates are parameterized on `Dia
6868
- Digest mode (`EmitMode::DIGEST`): literals→`?`, IN collapsing, keyword uppercasing
6969
- Bindings mode: materializes `?` placeholders with bound parameter values
7070

71-
### Tests
71+
## Query Engine
7272

73-
Google Test. 430 tests across 16 test files. Validated against 86K+ external queries (PostgreSQL regression, MySQL MTR, CockroachDB, Vitess, TiDB, sqlparser-rs, SQLGlot).
73+
### Architecture
74+
75+
The engine follows a five-component pipeline:
76+
77+
1. **Type System** (`types.h`, `value.h`) — `SqlType` describes column types (30+ kinds). `Value` is a 14-tag discriminated union for runtime values (null, bool, int64, uint64, double, decimal, string, bytes, date, time, datetime, timestamp, interval, json).
78+
2. **Expression Evaluator** (`expression_eval.h`) — Recursively evaluates AST expression nodes against a row. Handles arithmetic, comparisons, boolean logic (three-valued), BETWEEN, IN, LIKE, CASE/WHEN, function calls. Uses `CoercionRules<D>` for type promotion and `null_semantics` for NULL propagation.
79+
3. **Catalog** (`catalog.h`, `in_memory_catalog.h`) — Abstract interface for table/column metadata. `InMemoryCatalog` is the hash-map implementation. `CatalogResolver` creates column-resolve callbacks from catalog + table + row.
80+
4. **Plan Builder** (`plan_builder.h`) — Translates a SELECT AST into a `PlanNode` tree. Translation order: FROM (Scan/Join) → WHERE (Filter) → GROUP BY (Aggregate) → HAVING (Filter) → SELECT list (Project) → DISTINCT → ORDER BY (Sort) → LIMIT.
81+
5. **Executor** (`plan_executor.h`) — Converts a `PlanNode` tree into an `Operator` tree (Volcano model: open/next/close). Pulls rows through the tree and collects them into a `ResultSet`.
82+
83+
### Key types
84+
85+
- `Value` — 14-tag discriminated union. Constructors: `value_null()`, `value_int(i)`, `value_string(s)`, etc.
86+
- `SqlType` — Column type descriptor with `Kind` enum, precision, scale, unsigned flag, timezone flag.
87+
- `Row``{Value* values, uint16_t column_count}`. Arena-allocated via `make_row()`.
88+
- `PlanNode` — Arena-allocated union node. Types: SCAN, FILTER, PROJECT, JOIN, AGGREGATE, SORT, LIMIT, DISTINCT, SET_OP.
89+
- `Operator` — Abstract base class with `open()`, `next(Row&)`, `close()`. Nine implementations in `operators/`.
90+
- `ResultSet``{vector<Row> rows, vector<string> column_names, uint16_t column_count}`.
91+
92+
### How to add a new operator
93+
94+
1. Create `include/sql_engine/operators/xxx_op.h` — implement `Operator` (open/next/close)
95+
2. Add a new `PlanNodeType` enum value in `plan_node.h` and a corresponding union member in `PlanNode`
96+
3. Add `build_xxx()` in `PlanExecutor` (in `plan_executor.h`) and wire it into `build_operator()` switch
97+
4. Add translation logic in `PlanBuilder::build_select()` (in `plan_builder.h`)
98+
5. Include the new header in `plan_executor.h`
99+
6. Write tests in `tests/test_operators.cpp` or a new test file
100+
101+
### How to add a new SQL function
102+
103+
1. Write the function in the appropriate file under `include/sql_engine/functions/` (arithmetic.h, string.h, comparison.h, or cast.h). Signature: `Value fn(const Value* args, uint16_t arg_count, Arena& arena)`.
104+
2. Register it in `src/sql_engine/function_registry.cpp` inside `register_builtins()`:
105+
```cpp
106+
register_function({"MYFUNC", 6, my_func_impl, 1, 2});
107+
// name, name_len, impl, min_args, max_args (255=variadic)
108+
```
109+
3. Write tests in `tests/test_string_funcs.cpp`, `tests/test_arithmetic.cpp`, or `tests/test_registry.cpp`
110+
111+
### How to implement a custom DataSource
112+
113+
Implement the `DataSource` interface in `data_source.h`:
114+
115+
```cpp
116+
class MySource : public DataSource {
117+
const TableInfo* table_info() const override; // return table metadata
118+
void open() override; // initialize cursor
119+
bool next(Row& out) override; // fill row, return false when done
120+
void close() override; // release resources
121+
};
122+
```
123+
124+
Then register it with the executor: `executor.add_data_source("table_name", &my_source);`
125+
126+
### How to implement a custom Catalog
127+
128+
Implement the `Catalog` interface in `catalog.h`:
129+
130+
```cpp
131+
class MyCatalog : public Catalog {
132+
const TableInfo* get_table(StringRef name) const override;
133+
const TableInfo* get_table(StringRef schema, StringRef table) const override;
134+
const ColumnInfo* get_column(const TableInfo* table, StringRef column_name) const override;
135+
};
136+
```
137+
138+
`TableInfo` must outlive any queries that reference it. `ColumnInfo::ordinal` must match the column position in rows returned by the corresponding `DataSource`.
139+
140+
### Engine namespace
141+
142+
Everything is in `namespace sql_engine`. Templates are parameterized on `Dialect D` where dialect-specific behavior applies (coercion rules, `||` semantics, LIKE matching).
143+
144+
## Tests
145+
146+
Google Test. 871 tests across 30 test files. Validated against 86K+ external queries (PostgreSQL regression, MySQL MTR, CockroachDB, Vitess, TiDB, sqlparser-rs, SQLGlot).
74147
75148
Run a single test: `./run_tests --gtest_filter="*SetTest*"`
76149
77-
### Benchmarks
150+
### Test files by component
151+
152+
**Parser:**
153+
`test_tokenizer.cpp`, `test_classifier.cpp`, `test_expression.cpp`, `test_select.cpp`, `test_insert.cpp`, `test_update.cpp`, `test_delete.cpp`, `test_set.cpp`, `test_compound.cpp`, `test_emitter.cpp`, `test_digest.cpp`, `test_stmt_cache.cpp`, `test_arena.cpp`, `test_misc_stmts.cpp`
154+
155+
**Engine:**
156+
`test_value.cpp`, `test_row.cpp`, `test_coercion.cpp`, `test_null_semantics.cpp`, `test_like.cpp`, `test_expression_eval.cpp`, `test_eval_integration.cpp`, `test_catalog.cpp`, `test_registry.cpp`, `test_arithmetic.cpp`, `test_comparison.cpp`, `test_cast.cpp`, `test_string_funcs.cpp`, `test_operators.cpp`, `test_plan_builder.cpp`, `test_plan_executor.cpp`
157+
158+
## Benchmarks
78159
79160
Google Benchmark. 18 single-thread + 16 multi-thread + 4 percentile benchmarks.
80161
Comparison benchmarks against libpg_query and sqlparser-rs in `bench/bench_comparison.cpp`.
81162
82-
### Corpus testing
163+
## Corpus testing
83164
84165
`corpus_test` binary reads SQL from stdin (one per line), parses each, reports OK/PARTIAL/ERROR counts.
85166
Usage: `./corpus_test mysql < queries.sql` or `./corpus_test pgsql < queries.sql`

README.md

Lines changed: 154 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
# ParserSQL
1+
# ParserSQL — High-Performance SQL Parser & Query Engine
22

3-
A high-performance, hand-written recursive descent SQL parser for [ProxySQL](https://github.com/sysown/proxysql). Supports both MySQL and PostgreSQL dialects with compile-time dispatch — zero runtime overhead for dialect selection.
3+
A high-performance, hand-written recursive descent SQL parser and composable query engine for [ProxySQL](https://github.com/sysown/proxysql). Supports both MySQL and PostgreSQL dialects with compile-time dispatch — zero runtime overhead for dialect selection. The parser produces an AST that feeds directly into the query engine's plan builder and executor pipeline.
44

55
## Performance
66

7-
All operations run in sub-microsecond latency on modern hardware:
7+
All parser operations run in sub-microsecond latency on modern hardware:
88

99
| Operation | Latency | Notes |
1010
|---|---|---|
@@ -26,19 +26,10 @@ Compared to other parsers on the same queries:
2626

2727
See [docs/benchmarks/](docs/benchmarks/) for full results and [REPRODUCING.md](docs/benchmarks/REPRODUCING.md) for reproduction instructions.
2828

29-
## Features
30-
31-
- **Deep parsing (Tier 1):** SELECT, INSERT, UPDATE, DELETE, SET, REPLACE, EXPLAIN, CALL, DO, LOAD DATA
32-
- **Compound queries:** UNION / INTERSECT / EXCEPT with SQL-standard precedence and parenthesized nesting
33-
- **Tier 2 classification:** All other statement types (DDL, transactions, SHOW, GRANT, etc.)
34-
- **Query reconstruction:** Parse → modify AST → emit valid SQL
35-
- **Query digest:** Normalize queries for fingerprinting (literals → `?`, IN list collapsing, keyword uppercasing) with 64-bit FNV-1a hash
36-
- **Prepared statement cache:** LRU cache with `parse_and_cache()` / `execute()` for binary protocol support
37-
- **Both dialects:** MySQL and PostgreSQL via `Parser<Dialect::MySQL>` / `Parser<Dialect::PostgreSQL>`
38-
- **Thread-safe:** One parser instance per thread, zero shared state, no locks
39-
4029
## Quick Start
4130

31+
### Parse and emit SQL
32+
4233
```cpp
4334
#include "sql_parser/parser.h"
4435
#include "sql_parser/emitter.h"
@@ -66,30 +57,86 @@ DigestResult dr = digest.compute(result.ast);
6657
// dr.normalized = "SELECT * FROM users WHERE id = ?"
6758
// dr.hash = 0x... (64-bit FNV-1a)
6859

69-
// Modify AST and re-emit
70-
// ... modify nodes ...
71-
// Emitter emitter2(parser.arena());
72-
// emitter2.emit(result.ast); // emits modified SQL
73-
7460
// Reset arena after each query (O(1), reuses memory)
7561
parser.reset();
7662
```
7763
78-
## Building
64+
### Full pipeline: parse, plan, execute
7965
80-
```bash
81-
# Build library + run tests
82-
make -f Makefile.new all
66+
```cpp
67+
#include "sql_parser/parser.h"
68+
#include "sql_engine/plan_builder.h"
69+
#include "sql_engine/plan_executor.h"
70+
#include "sql_engine/in_memory_catalog.h"
71+
#include "sql_engine/data_source.h"
8372
84-
# Build and run benchmarks
85-
make -f Makefile.new bench
73+
using namespace sql_parser;
74+
using namespace sql_engine;
75+
76+
// 1. Set up catalog (table metadata)
77+
InMemoryCatalog catalog;
78+
catalog.add_table("", "users", {
79+
{"id", SqlType::make_int(), false},
80+
{"name", SqlType::make_varchar(255), true},
81+
{"age", SqlType::make_int(), true},
82+
});
83+
84+
// 2. Populate data source
85+
Arena data_arena{65536, 1048576};
86+
std::vector<Row> rows = {
87+
// ... build rows with make_row() + value_int(), value_string(), etc.
88+
};
89+
const TableInfo* table = catalog.get_table(StringRef{"users", 5});
90+
InMemoryDataSource source(table, std::move(rows));
91+
92+
// 3. Register built-in functions (UPPER, LOWER, COALESCE, etc.)
93+
FunctionRegistry<Dialect::MySQL> functions;
94+
functions.register_builtins();
95+
96+
// 4. Parse SQL
97+
Parser<Dialect::MySQL> parser;
98+
auto result = parser.parse("SELECT name, age FROM users WHERE age > 21", 43);
8699
87-
# Build comparison benchmarks (requires libpg_query)
88-
cd third_party/libpg_query && make && cd ../..
89-
make -f Makefile.new bench-compare
100+
// 5. Build logical plan (AST -> plan tree)
101+
PlanBuilder<Dialect::MySQL> builder(catalog, parser.arena());
102+
PlanNode* plan = builder.build(result.ast);
103+
104+
// 6. Execute plan
105+
PlanExecutor<Dialect::MySQL> executor(functions, catalog, parser.arena());
106+
executor.add_data_source("users", &source);
107+
ResultSet rs = executor.execute(plan);
108+
109+
// 7. Read results
110+
for (size_t i = 0; i < rs.row_count(); ++i) {
111+
Row& row = rs.rows[i];
112+
// row.get(0) = name (Value), row.get(1) = age (Value)
113+
}
114+
115+
parser.reset();
90116
```
91117

92-
Requires: `g++` or `clang++` with C++17 support. No external dependencies for the parser itself. Google Test and Google Benchmark are vendored in `third_party/`.
118+
## Features
119+
120+
### Parser
121+
122+
- **Deep parsing (Tier 1):** SELECT, INSERT, UPDATE, DELETE, SET, REPLACE, EXPLAIN, CALL, DO, LOAD DATA
123+
- **Compound queries:** UNION / INTERSECT / EXCEPT with SQL-standard precedence and parenthesized nesting
124+
- **Tier 2 classification:** All other statement types (DDL, transactions, SHOW, GRANT, etc.)
125+
- **Query reconstruction:** Parse → modify AST → emit valid SQL
126+
- **Query digest:** Normalize queries for fingerprinting (literals → `?`, IN list collapsing, keyword uppercasing) with 64-bit FNV-1a hash
127+
- **Prepared statement cache:** LRU cache with `parse_and_cache()` / `execute()` for binary protocol support
128+
- **Both dialects:** MySQL and PostgreSQL via `Parser<Dialect::MySQL>` / `Parser<Dialect::PostgreSQL>`
129+
- **Thread-safe:** One parser instance per thread, zero shared state, no locks
130+
131+
### Query Engine
132+
133+
- **Type system:** 30+ SQL types (`SqlType::Kind`) with 14-tag runtime `Value` (null, bool, int64, uint64, double, decimal, string, bytes, date, time, datetime, timestamp, interval, json)
134+
- **Expression evaluator:** Recursive AST evaluator with three-valued logic (NULL propagation), type coercion, short-circuit AND/OR, BETWEEN, IN, LIKE, CASE/WHEN, IS [NOT] NULL
135+
- **Catalog:** Abstract `Catalog` interface + `InMemoryCatalog` implementation for table/column metadata resolution
136+
- **Plan builder:** Translates parsed SELECT AST into a logical plan tree (FROM → WHERE → GROUP BY → HAVING → SELECT → DISTINCT → ORDER BY → LIMIT)
137+
- **Executor with 9 operators:** Scan, Filter, Project, NestedLoopJoin (INNER/LEFT/RIGHT/FULL/CROSS), Aggregate (COUNT/SUM/AVG/MIN/MAX), Sort, Limit, Distinct, SetOp (UNION/INTERSECT/EXCEPT)
138+
- **38 built-in functions:** Arithmetic (ABS, CEIL, FLOOR, ROUND, MOD, POWER, SQRT, LOG, LN, EXP, SIGN, TRUNCATE, RAND, GREATEST, LEAST), string (UPPER, LOWER, LENGTH, CONCAT, SUBSTRING, TRIM, LTRIM, RTRIM, REPLACE, REVERSE, LEFT, RIGHT, LPAD, RPAD, REPEAT), comparison (COALESCE, NULLIF, IF, IFNULL), type (CAST)
139+
- **Composable data sources:** Implement `DataSource` interface for custom storage backends
93140

94141
## Architecture
95142

@@ -110,24 +157,36 @@ Input SQL bytes
110157
├──── Tier 1 ──► Deep parser (SELECT, INSERT, UPDATE, DELETE, SET, ...)
111158
│ │
112159
│ ▼
113-
│ Full AST in arena
114-
115-
└──── Tier 2 ──► Lightweight extractor
116-
117-
118-
StmtType + table name + metadata
160+
│ Full AST in arena ──┐
161+
│ │
162+
└──── Tier 2 ──► Lightweight │
163+
extractor │
164+
165+
┌──────────────┐
166+
│ Plan Builder │ AST → logical plan tree
167+
└──────┬───────┘
168+
169+
170+
┌──────────────┐
171+
│ Plan Executor │ Volcano-model iterator
172+
│ (Operators) │ open() / next() / close()
173+
└──────┬───────┘
174+
175+
176+
ResultSet
119177
```
120178

121179
**Key design decisions:**
122-
- **Arena allocator** — 64KB bump allocator, O(1) reset. All AST nodes allocated from arena. No per-node new/delete.
180+
- **Arena allocator** — 64KB bump allocator, O(1) reset. All AST nodes and plan nodes allocated from arena. No per-node new/delete.
123181
- **Zero-copy StringRef** — Token values point into original input buffer. No string copies during parsing.
124182
- **32-byte AstNode** — Compact intrusive linked-list (first_child + next_sibling). Half a cache line.
125183
- **Compile-time dialect dispatch**`if constexpr` for MySQL vs PostgreSQL differences. Zero runtime overhead.
126184
- **Header-only parsers** — Maximum inlining opportunity. Only `arena.cpp` and `parser.cpp` are compiled separately.
185+
- **Volcano execution model** — Operators implement open/next/close; rows are pulled through the tree one at a time.
127186

128187
## Testing
129188

130-
430 unit tests + validated against 86K+ queries from 9 external corpora:
189+
871 unit tests + validated against 86K+ queries from 9 external corpora:
131190

132191
| Corpus | Queries | OK Rate |
133192
|---|---|---|
@@ -165,14 +224,71 @@ include/sql_parser/
165224
keywords_mysql.h MySQL keyword table
166225
keywords_pgsql.h PostgreSQL keyword table
167226
227+
include/sql_engine/
228+
types.h SqlType with 30+ SQL type kinds
229+
value.h Tagged-union Value (14 tags) + constructors
230+
row.h Row: array of Value indexed by ordinal
231+
catalog.h Abstract Catalog interface (TableInfo, ColumnInfo)
232+
in_memory_catalog.h Hash-map Catalog implementation
233+
catalog_resolver.h Column resolver callback factory
234+
data_source.h DataSource interface + InMemoryDataSource
235+
expression_eval.h Recursive AST expression evaluator
236+
function_registry.h FunctionRegistry<D> with register_builtins()
237+
plan_node.h PlanNode union (9 node types)
238+
plan_builder.h AST → logical plan translation
239+
plan_executor.h Plan → operator tree → ResultSet
240+
operator.h Abstract Operator base (open/next/close)
241+
result_set.h ResultSet: rows + column names
242+
coercion.h Type coercion rules (dialect-specific)
243+
null_semantics.h Three-valued logic (AND/OR/NOT with NULL)
244+
like.h LIKE pattern matching
245+
tag_kind_map.h Value::Tag ↔ SqlType::Kind mapping
246+
247+
operators/
248+
scan_op.h Table scan from DataSource
249+
filter_op.h WHERE/HAVING predicate evaluation
250+
project_op.h SELECT expression list evaluation
251+
join_op.h Nested-loop join (5 join types)
252+
aggregate_op.h GROUP BY + COUNT/SUM/AVG/MIN/MAX
253+
sort_op.h ORDER BY (in-memory sort)
254+
limit_op.h LIMIT + OFFSET
255+
distinct_op.h Duplicate elimination
256+
set_op_op.h UNION/INTERSECT/EXCEPT
257+
258+
functions/
259+
arithmetic.h ABS, CEIL, FLOOR, ROUND, MOD, POWER, SQRT, ...
260+
comparison.h COALESCE, NULLIF, IF, IFNULL
261+
string.h UPPER, LOWER, LENGTH, CONCAT, SUBSTRING, TRIM, ...
262+
cast.h CAST type conversion
263+
168264
src/sql_parser/
169265
arena.cpp Arena implementation
170266
parser.cpp Classifier + integration
171267
172-
tests/ 430 unit tests (Google Test)
268+
src/sql_engine/
269+
function_registry.cpp Built-in function registration
270+
in_memory_catalog.cpp InMemoryCatalog implementation
271+
272+
tests/ 871 unit tests (Google Test)
173273
bench/ 18 benchmarks + comparison suite
174274
```
175275

276+
## Building
277+
278+
```bash
279+
# Build library + run tests
280+
make -f Makefile.new all
281+
282+
# Build and run benchmarks
283+
make -f Makefile.new bench
284+
285+
# Build comparison benchmarks (requires libpg_query)
286+
cd third_party/libpg_query && make && cd ../..
287+
make -f Makefile.new bench-compare
288+
```
289+
290+
Requires: `g++` or `clang++` with C++17 support. No external dependencies for the parser itself. Google Test and Google Benchmark are vendored in `third_party/`.
291+
176292
## License
177293

178294
See [LICENSE](LICENSE) file.

0 commit comments

Comments
 (0)