Release v0.3.0 — EX 58% → 72% on the 100-question benchmark, zero hard errors by Cyberfilo · Pull Request #17 · Cyberfilo/PromptQuery

Cyberfilo · 2026-06-10T08:44:54Z

Promotes staging to main for the 0.3.0 release. Code changes were reviewed in #16; this PR adds the measured results that gate the promotion.

Results

Re-ran the same 100-question suite against the same 211-table Postgres schema, same model (gpt-4o), temperature 0, single-state EX@1 — only the package version changed.

	0.2.2 (main)	0.3.0 (staging)
Execution accuracy (EX@1)	58.0%	72.0%
Hard DB errors	7/100	0/100
Soft-F1 (row-level partial credit)	60.2	73.9
Set-Recall	98.0%	99.0%
Tokens/query	4,257	4,689
Time/query	2.26 s	2.46 s

Where it came from, by question bucket: single-table questions 58.6 → 82.8, lexical-gap questions 36.4 → 72.7 (that's the enum vocabulary doing its job — the model stops guessing state names), joins 64 → 68. The honest negatives: analytical questions (window functions) stay at 20%, and multi-join dipped 58.3 → 50.0 on a 12-question bucket — both are composition limits the next release should target, not regressions I can explain away.

The cost of the gains: +10% prompt tokens for the enum lists and ~+200 ms/query average from repair rounds on failing queries. Worth it.

Release mechanics after merge: tag v0.3.0 (trusted publishing pushes to PyPI), GitHub release with the changelog notes.

… in the prompt Introspection now reads col_description() and pg_enum labels for every column; format_schema renders them, so the generator filters on real states instead of guessing them. New generation rules pin down answer shape: exactly the columns asked for, no speculative filters, INNER JOIN by default, status columns over timestamp inference.

When the database rejects a query, feed the SQL plus the database's own error back to the model for a bounded number of corrected attempts. Repaired SQL is re-validated by the sqlglot guard and re-confirmed in the REPL before it runs. Empty results never trigger repair — empty is often the right answer.

v0.3.0: enum-aware prompts, execution-guided self-repair, tighter generation rules

Cyberfilo and others added 6 commits June 10, 2026 09:15

fix: infer OpenAI provider for bare o4-* model names

a2a99a5

chore: 0.3.0 — changelog, pipeline docs, reconcile stale v0.1-era claims

29fa4a6

Merge pull request #16 from Cyberfilo/feat/v0.3-generation-quality

a82440e

v0.3.0: enum-aware prompts, execution-guided self-repair, tighter generation rules

docs: add measured 0.3.0 benchmark results to the changelog

f8d29b1

Cyberfilo merged commit 0ecab61 into main Jun 10, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.3.0 — EX 58% → 72% on the 100-question benchmark, zero hard errors#17

Release v0.3.0 — EX 58% → 72% on the 100-question benchmark, zero hard errors#17
Cyberfilo merged 6 commits into
mainfrom
staging

Cyberfilo commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Cyberfilo commented Jun 10, 2026

Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant