From b2ab16943579214602f2e6c436277bcc7b011b15 Mon Sep 17 00:00:00 2001
From: Filippo Menghi <filo.gametech@gmail.com>
Date: Wed, 10 Jun 2026 11:22:18 +0200
Subject: [PATCH] =?UTF-8?q?docs:=20README=20=E2=80=94=20add=20v0.3.0=20exe?=
 =?UTF-8?q?cution-accuracy=20numbers,=20fix=20stale=20roadmap?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The Roadmap section still predicted a different v0.3 than the one that
shipped; The Numbers section had no execution-accuracy results at all.
---
 README.md | 29 ++++++++++++++++++++++++++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index 423dc4e..baa9181 100644
--- a/README.md
+++ b/README.md
@@ -18,7 +18,26 @@ PromptQuery is an open-source CLI that lets you query Postgres in plain English
 
 ## The numbers
 
-Two independent production-scale schemas. SQL generation: `gpt-4o`. Table selection: `gpt-4o-mini`.
+### Execution accuracy — 211-table benchmark, v0.3.0
+
+Measured on [heron](https://github.com/Cyberfilo/heron), an independent open benchmark: a seeded
+211-table / 14-schema Postgres database (4.4M rows) with 100 audited questions, scored by
+**execution-equality** (the query runs and returns the right rows). Conditions: `gpt-4o`,
+temperature 0, single-state EX@1. Receipts: [`results_prq_v031.json` (heron PR #1)](https://github.com/Cyberfilo/heron/pull/1).
+
+| | v0.2.2 | **v0.3.0** |
+|---|---:|---:|
+| Execution accuracy (EX@1) | 58% | **72%** |
+| Hard DB errors | 7/100 | **0/100** |
+| Schema-retrieval recall | 98% | 99% |
+| Tokens / query | 4,257 | 4,689 |
+
+Same benchmark, same model, same questions — only the package version changed. Where it still
+falls short, on purpose and in the open: window-function questions sit at 20%, multi-join at 50%.
+
+### Retrieval accuracy — two more production-scale schemas (v0.2)
+
+SQL generation: `gpt-4o`. Table selection: `gpt-4o-mini`.
 
 **What "accuracy" means here:** a question passes only if the generated SQL references *every* table the question needs and *invents none* (parsed with `sqlglot`). These two schemas ship without seeded data, so queries are parsed, not executed — execution-equality is measured separately on the seeded `shop` fixture (see [Benchmark](#benchmark)). Finding the right handful of tables out of hundreds is the hard part, and this measures exactly that. "Tokens / query" is the SQL-generator prompt size, measured with `tiktoken`.
 
@@ -246,8 +265,12 @@ See [`eval/END_TO_END.md`](eval/END_TO_END.md) for the harness internals.
 ## Roadmap
 
 - **v0.2 (shipped)** — LLM-assisted table selector, stemmed TF-IDF.
-- **v0.3** — local LLMs (Ollama), schema anonymisation (GDPR-by-default), query-history-as-few-shot.
-- **v0.4** — MySQL + SQLite adapters, MCP server mode, public competitor benchmark.
+- **v0.3 (shipped)** — enum-aware schema prompts, execution-guided self-repair (`--max-repair`),
+  answer-shape generation rules. EX 58% → 72% on the 100-question benchmark.
+- **v0.4** — value/literal linking (match question terms to actual cell values), opt-in
+  multi-candidate generation (`--thorough`), window-function/composition work.
+- **Later** — MySQL + SQLite adapters, MCP server mode, local LLMs (Ollama), schema anonymisation,
+  query-history-as-few-shot.
 
 ---