Skip to content

Release v0.3.0 — EX 58% → 72% on the 100-question benchmark, zero hard errors#17

Merged
Cyberfilo merged 6 commits into
mainfrom
staging
Jun 10, 2026
Merged

Release v0.3.0 — EX 58% → 72% on the 100-question benchmark, zero hard errors#17
Cyberfilo merged 6 commits into
mainfrom
staging

Conversation

@Cyberfilo

Copy link
Copy Markdown
Owner

Promotes staging to main for the 0.3.0 release. Code changes were reviewed in #16; this PR adds the measured results that gate the promotion.

Results

Re-ran the same 100-question suite against the same 211-table Postgres schema, same model (gpt-4o), temperature 0, single-state EX@1 — only the package version changed.

0.2.2 (main) 0.3.0 (staging)
Execution accuracy (EX@1) 58.0% 72.0%
Hard DB errors 7/100 0/100
Soft-F1 (row-level partial credit) 60.2 73.9
Set-Recall 98.0% 99.0%
Tokens/query 4,257 4,689
Time/query 2.26 s 2.46 s

Where it came from, by question bucket: single-table questions 58.6 → 82.8, lexical-gap questions 36.4 → 72.7 (that's the enum vocabulary doing its job — the model stops guessing state names), joins 64 → 68. The honest negatives: analytical questions (window functions) stay at 20%, and multi-join dipped 58.3 → 50.0 on a 12-question bucket — both are composition limits the next release should target, not regressions I can explain away.

The cost of the gains: +10% prompt tokens for the enum lists and ~+200 ms/query average from repair rounds on failing queries. Worth it.

Release mechanics after merge: tag v0.3.0 (trusted publishing pushes to PyPI), GitHub release with the changelog notes.

Cyberfilo and others added 6 commits June 10, 2026 09:15
… in the prompt

Introspection now reads col_description() and pg_enum labels for every column;
format_schema renders them, so the generator filters on real states instead of
guessing them. New generation rules pin down answer shape: exactly the columns
asked for, no speculative filters, INNER JOIN by default, status columns over
timestamp inference.
When the database rejects a query, feed the SQL plus the database's own error
back to the model for a bounded number of corrected attempts. Repaired SQL is
re-validated by the sqlglot guard and re-confirmed in the REPL before it runs.
Empty results never trigger repair — empty is often the right answer.
v0.3.0: enum-aware prompts, execution-guided self-repair, tighter generation rules
@Cyberfilo Cyberfilo merged commit 0ecab61 into main Jun 10, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant