DEV-1462: LiveSQLBench-Base-Lite-SQLite one-shot harness#5
DEV-1462: LiveSQLBench-Base-Lite-SQLite one-shot harness#5ZmeiGorynych wants to merge 5 commits into
Conversation
Add a new `--mode one-shot` evaluation path that runs the
LiveSQLBench-Base-Lite-SQLite dataset (the unambiguous, one-shot sibling
of mini-interact) against BOTH SLayer agents — `pydantic_ai_recursive`
(KB→memories) and `pydantic_ai_otf_encode` (KB→models). Scored on the
180 SELECT (`category=="Query"`) tasks; no user simulator, no
`ask_user` anywhere in the spawn tree.
Highlights (full plan in DEV-1462's "Final approved plan" section):
* New `--dataset {mini-interact, livesqlbench}` and `--gold-file` flags.
`--mode one-shot` is gated to `--dataset livesqlbench`; `--dataset
livesqlbench` is gated to `--mode {one-shot, oracle}`; one-shot
requires `--slayer-setup on-the-fly`. Guards land both at the CLI
(parser.error) and at the programmatic-entry points
(run_evaluation/make_runner/run_one_task — Codex #1).
* New `scripts/prepare_livesqlbench.py` materialises per-DB
`<db>.sqlite` from `<db>_template.sqlite` (refuses git-LFS pointers
loudly). New `harness.materialize_task_db` gives each LiveSQLBench
task an isolated per-instance `db_file_path` (symlinked template) so
concurrent eval-resets never race the OTF cache build. Stale-symlink
rebuild + defensive `_ephemeral_`/`_process_` refusal (Codex #4/#5).
* New `harness.load_livesqlbench_tasks`: merges the gated gold sidecar
by `instance_id`, maps `query`→`amb_user_query`, stamps a
`task["dataset"]="livesqlbench"` marker (the irreducible source of
truth for DB isolation + the one-shot run_task programmatic guard),
filters `category=="Query"` (logs `_M_` disagreements), asserts
exactly 180 SELECT tasks on a full run, and is `filter_ids`-aware so
partial-gold targeted runs don't trip the empty-`sol_sql` fail-fast
(Codex #6).
* One-shot factories + prompts in BOTH SLayer adapters: `_build_sub_explorer`,
`_build_projection_resolver_oneshot`, `_build_query_constructor_oneshot`
(+ otf_encode `_build_kb_encoder_oneshot`). `_register_spawn_subagent`
+ `_register_kb_to_slayer` parametrized with `eval_mode` to dispatch
the right ask_user-free child. New one-shot prompt variants with NO
ask_user / user-sim language. `agent.run_task` accepts
`eval_mode in {a-interact, one-shot}` with reserve=submit_query only
and a one-shot resolver recovery prompt.
* Per-benchmark artifact separation (user principle: artifacts kept
SEPARATE by benchmark). `paths.slayer_otf_cache_root` and
`slayer_models_otf_root` gain a `benchmark: str | None = None` kwarg;
default returns the legacy mini-interact path (no caller breakage),
`benchmark="livesqlbench"` returns parallel roots
(`slayer_otf_cache_livesqlbench/`, `slayer_models_otf_livesqlbench/`)
with parallel env-var overrides. `_maybe_force_wipe_otf` is
benchmark-aware so `--otf-rebuild` never wipes the wrong root.
* Codex #2 fix: `slayer_otf/reference_build.py::_effective_db_root`
accepts an authoritative `db_root` kwarg that overrides
`$BIRD_DB_PATH` (threaded all the way from `ensure_db_reference`
through `_resolve_datasource_for_build` + `_portabilise_datasource`).
The otf_encode adapter passes `--db-path` as the explicit `db_root`
for LiveSQLBench runs, so the setup-encoder build resolves the right
sqlite even when conftest/CI sets `$BIRD_DB_PATH` to mini-interact.
* README + CLAUDE.md updated with the LiveSQLBench setup +
per-benchmark roots pattern. A `test_cloud_paths_unchanged.py`
regression test pins that the dev-1470 cloud upload-back / post-run
merge keep using the LEGACY (no-kwarg) roots — cloud-side
LiveSQLBench is explicitly out of scope and filed as a follow-up.
Full non-integration suite: 1232 passed, 94 skipped, 19 deselected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds LiveSQLBench one-shot execution: dataset preparation, gold-sidecar merging, per-task SQLite materialization, benchmark-scoped OTF cache/model roots, one-shot agent factories/prompts and budgeting for recursive and otf-encode frameworks, db_root threading through reference-build, CLI/programmatic guards, and broad tests and fixtures. ChangesLiveSQLBench One-Shot Mode
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/bird_interact_agents/agents/pydantic_ai_otf_encode/prompts.py`:
- Around line 436-469: The new KB_ENCODER_ONESHOT_PROMPT removed the
self-tagging/output-contract that sets meta.kb_id={kb_id}, which breaks
downstream dependency discovery used by _live_tagged_entities() and
_format_deps_block(); restore the tagging and output-contract sections from
KB_ENCODER_PROMPT into KB_ENCODER_ONESHOT_PROMPT (keeping only removal of
ask_user-specific instructions) so that encoders still emit meta.kb_id={kb_id}
and any required meta/self-annotation fields and call patterns (e.g.,
submit_encoding(result_json=...)) exactly once, ensuring already-encoded
dependencies are discoverable by _live_tagged_entities() and
_format_deps_block().
In `@src/bird_interact_agents/agents/pydantic_ai_recursive/agent.py`:
- Around line 282-298: run_task currently allows eval_mode="one-shot" even when
self.slayer_setup=="pre-encoded", letting callers bypass the dataset guard; add
an explicit check alongside the existing dataset check to raise ValueError if
is_one_shot and self.slayer_setup != "on-the-fly". Locate the branch around
eval_mode/is_one_shot in run_task (and reference self.slayer_setup) and enforce
that one-shot is only permitted when self.slayer_setup equals the on-the-fly
mode, with a clear error message mentioning both the required slayer_setup and
dataset constraints.
In `@tests/_livesqlbench_fixtures.py`:
- Around line 189-193: The public_task fixture currently uses truthy 'or'
defaults which overrides explicit empty values; change the assignments to use
explicit None checks so empty string "" or empty dict {} are preserved: for the
"query" key use query if query is not None else f"unambiguous request for
{instance_id}", and for the "conditions" key use conditions if conditions is not
None else {"decimal": [], "distinct": False, "order": False}; update the code in
the public_task function where "query" and "conditions" are set accordingly.
In `@tests/test_one_shot_mode.py`:
- Around line 351-352: Split the semicolon-joined statements into separate lines
to satisfy Ruff E702: replace the combined statements that create and write to
"data" and create "db_path" (variables data, db_path, tmp_path in
tests/test_one_shot_mode.py) with two distinct statements — one that assigns
data = tmp_path / "data.jsonl" and calls data.write_text(""), and another that
assigns db_path = tmp_path / "ds" and calls db_path.mkdir() — and do the same
for the other occurrences around the same file (the locations referenced near
the original combined lines).
In `@tests/test_one_shot_run.py`:
- Around line 728-799: The tests install the _spy_materialize spy but never
assert it was invoked, so add an assertion after each instantiation/run in both
OTF one-shot parity tests (the test function shown and the other one around
lines 802-870) to verify DB materialization occurred: after calling
PydanticAIOtfEncodeAgent.run_task, assert that the spy created by
_spy_materialize recorded a call to materialize_task_db (e.g., check spy.called
or spy.call_count > 0 or spy.assert_called_once() depending on the spy API).
Make this change in both test functions referencing _spy_materialize and
materialize_task_db so a regression in one-shot DB materialization is caught.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 476372c4-3570-4134-9d96-b93cf793a62c
📒 Files selected for processing (27)
CLAUDE.mdREADME.mdscripts/prepare_livesqlbench.pysrc/bird_interact_agents/agents/pydantic_ai_otf_encode/agent.pysrc/bird_interact_agents/agents/pydantic_ai_otf_encode/factories.pysrc/bird_interact_agents/agents/pydantic_ai_otf_encode/prompts.pysrc/bird_interact_agents/agents/pydantic_ai_recursive/agent.pysrc/bird_interact_agents/agents/pydantic_ai_recursive/factories.pysrc/bird_interact_agents/agents/pydantic_ai_recursive/prompts.pysrc/bird_interact_agents/harness.pysrc/bird_interact_agents/paths.pysrc/bird_interact_agents/run.pysrc/bird_interact_agents/slayer_otf/reference_build.pytests/_livesqlbench_fixtures.pytests/test_cloud_paths_unchanged.pytests/test_db_isolation.pytests/test_livesqlbench_loader.pytests/test_one_shot_mode.pytests/test_one_shot_otf_encode_factories.pytests/test_one_shot_recursive_factories.pytests/test_one_shot_run.pytests/test_otf_encode_reference_root.pytests/test_otf_rebuild_per_benchmark.pytests/test_otf_rebuild_wiring.pytests/test_paths.pytests/test_prepare_livesqlbench.pytests/test_pydantic_ai_otf_encode_agent.py
* run.py: wire `materialize_task_db` into `run_oracle_task` so
LiveSQLBench oracle runs at `--concurrency > 1` don't race the
shared `<db>.sqlite` the OTF cache reads (B0's docstring said it
was called from both run paths but the oracle wiring was missing
— Codex).
* pydantic_ai_otf_encode/prompts.py: restore the KB SELF-ANNOTATION
contract in `KB_ENCODER_ONESHOT_PROMPT`. Every encoded entity must
carry `label`, the canonical `description` block, and
`meta.kb_id={kb_id}` — downstream `_live_tagged_entities()` +
`_format_deps_block()` discover already-encoded deps via those
tags, so a successfully-encoded KB without the meta.kb_id tag
silently looks "not encoded" to dependents and breaks transitive
`kb_to_slayer` resolution (CodeRabbit).
* pydantic_ai_recursive/agent.py: belt-and-suspenders check that
one-shot ⟹ `slayer_setup == "on-the-fly"`. The check already lives
at the CLI / run_evaluation / make_runner / run_one_task
boundaries; this in-run_task assertion closes the bypass for a
programmatic caller that constructs
`PydanticAIRecursiveAgent(slayer_setup="pre-encoded")` and invokes
`.run_task(eval_mode="one-shot", ...)` directly (CodeRabbit). +
regression test.
Full non-integration suite: 1233 passed (was 1232).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* harness.load_livesqlbench_tasks: the full-run "exactly 180" check is now `if … : raise ValueError(…)`, not `assert`. Production guards must survive `python -O` / `PYTHONOPTIMIZE` — the latter strips assertions, so a truncated dataset would have silently run in exactly the scenario this guard exists for. * README: the LiveSQLBench setup steps used `cd livesqlbench-base- lite-sqlite` then ran `uv run python scripts/prepare_livesqlbench.py`, but `scripts/` lives in **bird-agents**, not the dataset dir. Rewrite step 2 as a subshell (`(cd … && git lfs pull)`) and have step 3 explicitly `cd bird-agents` before running the script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* tests/_livesqlbench_fixtures.py: `public_task` uses `is not None`
(not `or`) for `query` and `conditions` defaults, so a caller can
build a fixture with an explicit empty `""` / `{}` for edge-case
tests instead of having it silently replaced by the default.
* tests/test_one_shot_mode.py: split semicolon-separated statements
(`data = ...; data.write_text("")`) into two lines each — Ruff E702
flags the one-liner form and would break CI once Ruff is wired in.
* tests/test_one_shot_run.py: both OTF one-shot parity tests (reserve
+ empty-resolver-skip) now capture the `materialize_calls` spy
return and assert it was invoked. The recursive parity tests
already do this; without it a regression in OTF one-shot DB
materialization would slip through these two paths.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous round threaded `db_root` into `ensure_db_reference` →
`_effective_db_root` so the otf_encode **reference build** ignores
`$BIRD_DB_PATH` when an authoritative `--db-path` is in play. But the
**per-task variant copy** (which is what the SLayer MCP server reads
at runtime) is materialised by `build_task_variant_storage`, and that
function delegates to `resolve_committed_connection_string` which
still preferred `$BIRD_DB_PATH` over the supplied root.
Net effect: a one-shot LiveSQLBench run would correctly build/reuse
the LiveSQLBench-scoped reference, then SILENTLY query the
mini-interact sqlite at task time (because conftest / CI / day-to-day
shells often set `BIRD_DB_PATH` to mini-interact).
Mirror the previous fix one more layer:
* `slayer_pipeline/portable_connection.py::resolve_committed_connection_string`
gains an explicit `db_root: Path | None = None` kwarg that overrides
`$BIRD_DB_PATH` when set. Back-compat preserved (no kwarg → legacy
env-wins semantics).
* `hard8_preprocessor.build_task_variant_storage` gains the same
`db_root` kwarg and forwards it into
`resolve_committed_connection_string`.
* otf_encode's `_resolve_otf_task_storage_dir` now passes
`db_root=Path(data_path_base).resolve()` (the same value already
passed to `ensure_db_reference`), so build-time + runtime
re-anchoring see the SAME root.
Recursive flavor unaffected — `runtime._rewrite_datasource_connection_string`
force-overwrites the connection string from `data_path_base` already,
no env consultation.
+ 3 regression tests in test_otf_encode_reference_root.py:
- resolve_committed_connection_string honours db_root over env,
- back-compat: no db_root → env wins,
- build_task_variant_storage forwards db_root through the resolver
(spies the source-module resolver, not the importing-module
binding, because the import lives inside the function body).
Full non-integration suite: 1236 passed (was 1233; +3 new tests).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a new
--mode one-shotevaluation path that runs theLiveSQLBench-Base-Lite-SQLite dataset (the unambiguous, one-shot
sibling of mini-interact) against BOTH SLayer agents —
pydantic_ai_recursive(KB→memories) andpydantic_ai_otf_encode(KB→models). Scored on the 180 SELECT (
category=="Query") tasks;no user simulator, no
ask_useranywhere in the spawn tree.The full final plan (incl. all Codex findings folded) is in
DEV-1462's "Final approved plan" section.
Highlights
--dataset {mini-interact, livesqlbench}and--gold-fileflags.--mode one-shotis gated to--dataset livesqlbench;--dataset livesqlbenchis gated to
--mode {one-shot, oracle}; one-shot requires--slayer-setup on-the-fly. Guards land both at the CLI(
parser.error) and the programmatic-entry points(
run_evaluation/make_runner/run_one_task).scripts/prepare_livesqlbench.pymaterialises per-DB<db>.sqlitefrom
<db>_template.sqlite(refuses git-LFS pointers loudly). Newharness.materialize_task_dbgives each LiveSQLBench task an isolatedper-instance
db_file_path(symlinked template) so concurrenteval-resets never race the OTF cache build. Stale-symlink rebuild +
defensive
_ephemeral_/_process_refusal.harness.load_livesqlbench_tasks: merges the gated gold sidecarby
instance_id, mapsquery→amb_user_query, stampstask["dataset"]="livesqlbench", filterscategory=="Query", assertsexactly 180 on a full run, and is
filter_ids-aware so partial-goldtargeted runs don't trip the empty-
sol_sqlfail-fast._build_sub_explorer,_build_projection_resolver_oneshot,_build_query_constructor_oneshot(+ otf_encode_build_kb_encoder_oneshot)._register_spawn_subagent+_register_kb_to_slayerparametrized witheval_modeto dispatchthe right ask_user-free child.
agent.run_taskacceptseval_mode in {a-interact, one-shot}with reserve=submit_queryonly and a one-shot resolver recovery prompt.
paths.slayer_otf_cache_rootand
slayer_models_otf_rootgain abenchmark: str | None = Nonekwarg; default returns the legacy mini-interact path (no caller
breakage),
benchmark="livesqlbench"returns parallel roots(
slayer_otf_cache_livesqlbench/,slayer_models_otf_livesqlbench/)with parallel env-var overrides.
_maybe_force_wipe_otfisbenchmark-aware so
--otf-rebuildnever wipes the wrong root.slayer_otf/reference_build.py::_effective_db_rootaccepts an authoritative
db_rootkwarg that overrides$BIRD_DB_PATH(threaded all the way from
ensure_db_reference). The otf_encodeadapter passes
--db-pathas the explicitdb_rootfor LiveSQLBenchruns, so the setup-encoder resolves the right sqlite even when
conftest/CI sets
$BIRD_DB_PATHto mini-interact.per-benchmark roots pattern. A
test_cloud_paths_unchanged.pyregression test pins that the DEV-1470 cloud upload-back / post-run
merge keep using the legacy (no-kwarg) roots — cloud-side
LiveSQLBench is explicitly out of scope and filed as a follow-up.
Tests
165 new tests + extensions to existing ones; full non-integration
suite: 1232 passed, 94 skipped, 19 deselected. TDD: tests landed
first, Codex-reviewed against the plan (all 8 findings folded; the
test_otf_encode_reference_rootwas restructured to be unconditional;the spawn-builder closure was moved back to call-time so the existing
nested-spawn trajectory test still pins parent_idx correctness).
Manual smoke (after merge)
Out of scope (explicit follow-ups)
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Improvements
Tests