Skip to content

DEV-1462: LiveSQLBench-Base-Lite-SQLite one-shot harness#5

Open
ZmeiGorynych wants to merge 5 commits into
mainfrom
egor/dev-1462-tweak-the-benchmark-harness-to-run-with-livesqlbench-lite
Open

DEV-1462: LiveSQLBench-Base-Lite-SQLite one-shot harness#5
ZmeiGorynych wants to merge 5 commits into
mainfrom
egor/dev-1462-tweak-the-benchmark-harness-to-run-with-livesqlbench-lite

Conversation

@ZmeiGorynych
Copy link
Copy Markdown
Member

@ZmeiGorynych ZmeiGorynych commented May 27, 2026

Add a new --mode one-shot evaluation path that runs the
LiveSQLBench-Base-Lite-SQLite dataset (the unambiguous, one-shot
sibling of mini-interact) against BOTH SLayer agents —
pydantic_ai_recursive (KB→memories) and pydantic_ai_otf_encode
(KB→models). Scored on the 180 SELECT (category=="Query") tasks;
no user simulator, no ask_user anywhere in the spawn tree.

The full final plan (incl. all Codex findings folded) is in
DEV-1462's "Final approved plan" section.

Highlights

  • New --dataset {mini-interact, livesqlbench} and --gold-file flags.
    --mode one-shot is gated to --dataset livesqlbench; --dataset livesqlbench
    is gated to --mode {one-shot, oracle}; one-shot requires
    --slayer-setup on-the-fly. Guards land both at the CLI
    (parser.error) and the programmatic-entry points
    (run_evaluation/make_runner/run_one_task).
  • New scripts/prepare_livesqlbench.py materialises per-DB <db>.sqlite
    from <db>_template.sqlite (refuses git-LFS pointers loudly). New
    harness.materialize_task_db gives each LiveSQLBench task an isolated
    per-instance db_file_path (symlinked template) so concurrent
    eval-resets never race the OTF cache build. Stale-symlink rebuild +
    defensive _ephemeral_/_process_ refusal.
  • New harness.load_livesqlbench_tasks: merges the gated gold sidecar
    by instance_id, maps queryamb_user_query, stamps
    task["dataset"]="livesqlbench", filters category=="Query", asserts
    exactly 180 on a full run, and is filter_ids-aware so partial-gold
    targeted runs don't trip the empty-sol_sql fail-fast.
  • One-shot factories + prompts in BOTH SLayer adapters:
    _build_sub_explorer, _build_projection_resolver_oneshot,
    _build_query_constructor_oneshot (+ otf_encode
    _build_kb_encoder_oneshot). _register_spawn_subagent +
    _register_kb_to_slayer parametrized with eval_mode to dispatch
    the right ask_user-free child. agent.run_task accepts
    eval_mode in {a-interact, one-shot} with reserve=submit_query
    only and a one-shot resolver recovery prompt.
  • Per-benchmark artifact separation. paths.slayer_otf_cache_root
    and slayer_models_otf_root gain a benchmark: str | None = None
    kwarg; default returns the legacy mini-interact path (no caller
    breakage), benchmark="livesqlbench" returns parallel roots
    (slayer_otf_cache_livesqlbench/, slayer_models_otf_livesqlbench/)
    with parallel env-var overrides. _maybe_force_wipe_otf is
    benchmark-aware so --otf-rebuild never wipes the wrong root.
  • Codex Cloud SLayer mode end-to-end (mirror local) + consolidate the OTF artifact lifecycle (DEV-1468) #2 fix: slayer_otf/reference_build.py::_effective_db_root
    accepts an authoritative db_root kwarg that overrides $BIRD_DB_PATH
    (threaded all the way from ensure_db_reference). The otf_encode
    adapter passes --db-path as the explicit db_root for LiveSQLBench
    runs, so the setup-encoder resolves the right sqlite even when
    conftest/CI sets $BIRD_DB_PATH to mini-interact.
  • README + CLAUDE.md updated with the LiveSQLBench setup +
    per-benchmark roots pattern. A test_cloud_paths_unchanged.py
    regression test pins that the DEV-1470 cloud upload-back / post-run
    merge keep using the legacy (no-kwarg) roots — cloud-side
    LiveSQLBench is explicitly out of scope and filed as a follow-up.

Tests

165 new tests + extensions to existing ones; full non-integration
suite: 1232 passed, 94 skipped, 19 deselected
. TDD: tests landed
first, Codex-reviewed against the plan (all 8 findings folded; the
test_otf_encode_reference_root was restructured to be unconditional;
the spawn-builder closure was moved back to call-time so the existing
nested-spawn trajectory test still pins parent_idx correctness).

Manual smoke (after merge)

git lfs pull   # in ../livesqlbench-base-lite-sqlite/
uv run python scripts/prepare_livesqlbench.py
# Oracle validation (needs gated gold):
uv run bird-interact --dataset livesqlbench --mode oracle \
  --gold-file <gated.jsonl> --data ... --db-path ... --limit 5
# One-shot, both flavors:
uv run bird-interact --dataset livesqlbench --mode one-shot --query-mode slayer \
  --framework pydantic_ai_recursive --slayer-setup on-the-fly \
  --gold-file <gated.jsonl> --instance-id <id1>,<id2> ...

Out of scope (explicit follow-ups)

  • 90 Management tasks (need phase-1 preprocess + custom-test_cases eval).
  • Raw one-shot; one-shot for non-SLayer frameworks; Base-Full / Large.
  • Cloud-side LiveSQLBench — see B8 in the linked spec.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • One-shot execution mode (--mode one-shot) with fixed 30.0 budget, supports two slayer frameworks and enforces on-the-fly setup.
    • LiveSQLBench dataset support: per-task SQLite isolation, gated gold sidecar (--gold-file), and a dataset preparation CLI.
  • Documentation

    • README and docs: LiveSQLBench setup, one-shot usage, dataset/mode gating, and benchmark-scoped OTF roots.
  • Improvements

    • Benchmark-scoped on-the-fly artifact roots and safer --otf-rebuild that only purges the targeted benchmark.
  • Tests

    • Extensive new tests covering one-shot behavior, LiveSQLBench loader/prep, path scoping, DB isolation, and rebuild wiring.

Review Change Stack

Add a new `--mode one-shot` evaluation path that runs the
LiveSQLBench-Base-Lite-SQLite dataset (the unambiguous, one-shot sibling
of mini-interact) against BOTH SLayer agents — `pydantic_ai_recursive`
(KB→memories) and `pydantic_ai_otf_encode` (KB→models). Scored on the
180 SELECT (`category=="Query"`) tasks; no user simulator, no
`ask_user` anywhere in the spawn tree.

Highlights (full plan in DEV-1462's "Final approved plan" section):

* New `--dataset {mini-interact, livesqlbench}` and `--gold-file` flags.
  `--mode one-shot` is gated to `--dataset livesqlbench`; `--dataset
  livesqlbench` is gated to `--mode {one-shot, oracle}`; one-shot
  requires `--slayer-setup on-the-fly`. Guards land both at the CLI
  (parser.error) and at the programmatic-entry points
  (run_evaluation/make_runner/run_one_task — Codex #1).

* New `scripts/prepare_livesqlbench.py` materialises per-DB
  `<db>.sqlite` from `<db>_template.sqlite` (refuses git-LFS pointers
  loudly). New `harness.materialize_task_db` gives each LiveSQLBench
  task an isolated per-instance `db_file_path` (symlinked template) so
  concurrent eval-resets never race the OTF cache build. Stale-symlink
  rebuild + defensive `_ephemeral_`/`_process_` refusal (Codex #4/#5).

* New `harness.load_livesqlbench_tasks`: merges the gated gold sidecar
  by `instance_id`, maps `query`→`amb_user_query`, stamps a
  `task["dataset"]="livesqlbench"` marker (the irreducible source of
  truth for DB isolation + the one-shot run_task programmatic guard),
  filters `category=="Query"` (logs `_M_` disagreements), asserts
  exactly 180 SELECT tasks on a full run, and is `filter_ids`-aware so
  partial-gold targeted runs don't trip the empty-`sol_sql` fail-fast
  (Codex #6).

* One-shot factories + prompts in BOTH SLayer adapters: `_build_sub_explorer`,
  `_build_projection_resolver_oneshot`, `_build_query_constructor_oneshot`
  (+ otf_encode `_build_kb_encoder_oneshot`). `_register_spawn_subagent`
  + `_register_kb_to_slayer` parametrized with `eval_mode` to dispatch
  the right ask_user-free child. New one-shot prompt variants with NO
  ask_user / user-sim language. `agent.run_task` accepts
  `eval_mode in {a-interact, one-shot}` with reserve=submit_query only
  and a one-shot resolver recovery prompt.

* Per-benchmark artifact separation (user principle: artifacts kept
  SEPARATE by benchmark). `paths.slayer_otf_cache_root` and
  `slayer_models_otf_root` gain a `benchmark: str | None = None` kwarg;
  default returns the legacy mini-interact path (no caller breakage),
  `benchmark="livesqlbench"` returns parallel roots
  (`slayer_otf_cache_livesqlbench/`, `slayer_models_otf_livesqlbench/`)
  with parallel env-var overrides. `_maybe_force_wipe_otf` is
  benchmark-aware so `--otf-rebuild` never wipes the wrong root.

* Codex #2 fix: `slayer_otf/reference_build.py::_effective_db_root`
  accepts an authoritative `db_root` kwarg that overrides
  `$BIRD_DB_PATH` (threaded all the way from `ensure_db_reference`
  through `_resolve_datasource_for_build` + `_portabilise_datasource`).
  The otf_encode adapter passes `--db-path` as the explicit `db_root`
  for LiveSQLBench runs, so the setup-encoder build resolves the right
  sqlite even when conftest/CI sets `$BIRD_DB_PATH` to mini-interact.

* README + CLAUDE.md updated with the LiveSQLBench setup +
  per-benchmark roots pattern. A `test_cloud_paths_unchanged.py`
  regression test pins that the dev-1470 cloud upload-back / post-run
  merge keep using the LEGACY (no-kwarg) roots — cloud-side
  LiveSQLBench is explicitly out of scope and filed as a follow-up.

Full non-integration suite: 1232 passed, 94 skipped, 19 deselected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@linear
Copy link
Copy Markdown

linear Bot commented May 27, 2026

DEV-1462

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9573c42e-4d57-48a9-bb4d-b906315281b2

📥 Commits

Reviewing files that changed from the base of the PR and between c9bf9e5 and 201ea99.

📒 Files selected for processing (5)
  • src/bird_interact_agents/agents/pydantic_ai_otf_encode/agent.py
  • src/bird_interact_agents/hard8_preprocessor.py
  • src/bird_interact_agents/slayer_pipeline/portable_connection.py
  • tests/test_one_shot_run.py
  • tests/test_otf_encode_reference_root.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/bird_interact_agents/agents/pydantic_ai_otf_encode/agent.py

📝 Walkthrough

Walkthrough

Adds LiveSQLBench one-shot execution: dataset preparation, gold-sidecar merging, per-task SQLite materialization, benchmark-scoped OTF cache/model roots, one-shot agent factories/prompts and budgeting for recursive and otf-encode frameworks, db_root threading through reference-build, CLI/programmatic guards, and broad tests and fixtures.

Changes

LiveSQLBench One-Shot Mode

Layer / File(s) Summary
Docs: README & CLAUDE
README.md, CLAUDE.md
Adds LiveSQLBench one-shot usage documentation and per-benchmark OTF artifact root guidance.
Prepare script & CLI
scripts/prepare_livesqlbench.py, tests/test_prepare_livesqlbench.py
New prepare utility that detects git-LFS pointer templates, atomically materializes <db>.sqlite from <db>_template.sqlite, supports idempotent/skippable runs and --force, prints per-DB summary, and exposes programmatic prepare(...) and main(...) CLI.
Path helpers & benchmark scoping
src/bird_interact_agents/paths.py, tests/test_paths.py, tests/test_cloud_paths_unchanged.py
Adds livesqlbench_root() and livesqlbench_data_file() with env overrides; introduces _KNOWN_BENCHMARKS and _validate_benchmark; updates slayer_otf_cache_root() and slayer_models_otf_root() to accept `benchmark: str
Reference build db_root propagation
src/bird_interact_agents/slayer_otf/reference_build.py, tests/test_otf_encode_reference_root.py
Adds db_root kwarg to ensure_db_reference() and threads it through _build_reference, _resolve_datasource_for_build, and _portabilise_datasource; introduces _effective_db_root() with precedence explicit db_root > BIRD_DB_PATH > mini-interact default.
Loader & per-task DB materialization
src/bird_interact_agents/harness.py, tests/test_livesqlbench_loader.py, tests/test_db_isolation.py
Adds load_livesqlbench_tasks() that merges gold sidecar fields, filters Query tasks, enforces the 180 SELECT-row contract, and materialize_task_db() which symlinks per-instance working DBs from templates and is idempotent/stale-safe.
Run CLI & programmatic guards
src/bird_interact_agents/run.py, tests/test_one_shot_mode.py
Extends CLI with --dataset and --gold-file, includes one-shot in --mode; enforces dataset↔mode and one-shot framework/slayer-setup/query-mode constraints; scopes _maybe_force_wipe_otf purges by benchmark; materializes per-task DBs for oracle runs; hardens run_evaluation args and validation.
OTF-encode agent one-shot support
src/bird_interact_agents/agents/pydantic_ai_otf_encode/*, tests/test_one_shot_otf_encode_factories.py, tests/test_one_shot_run.py
Adds one-shot eval_mode handling, oneshot KB-encoder builder and prompts, benchmark-scoped resolution for cache and model roots, materialize_task_db pre-run, and oneshot-specific projection/constructor flow and recovery prompt.
Recursive agent one-shot support
src/bird_interact_agents/agents/pydantic_ai_recursive/*, tests/test_one_shot_recursive_factories.py, tests/test_one_shot_run.py
Adds eval_mode handling, one-shot budgeting (submit-only reserve), benchmark forwarding into _resolve_otf_task_storage_dir, materialize_task_db usage, oneshot factory variants and non-interactive prompts.
Factory & prompt additions
src/bird_interact_agents/agents/*/factories.py, src/bird_interact_agents/agents/*/prompts.py
Introduces SUB_EXPLORER/ROOT_EXPLORER and multiple ONESHOT prompt constants, spawn_subagent eval_mode propagation, per-KB oneshot encoder dispatch, and oneshot submit wrappers removing ask_user tooling.
Test fixtures & helpers
tests/_livesqlbench_fixtures.py
Provides deterministic tiny SQLite creator, LFS pointer writer, dataset/gold builders, and public_task helper for tests.
Validation & unit tests
tests/* (many files)
Adds broad test coverage: CLI/programmatic validation, loader semantics, prepare script behavior, path/env override tests, reference-build db_root propagation tests, otf-rebuild per-benchmark scoping, cloud-side AST contract to avoid passing benchmark in cloud modules, factory/tool/prompt/docstring checks, orchestration/integration tests, and DB isolation/concurrency tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • MotleyAI/bird-agents#2: Related OTF artifact lifecycle and _maybe_force_wipe_otf purging/OTF-cache handling.
  • MotleyAI/bird-agents#3: Overlaps in slayer_otf reference-build area (locking/reuse behavior and build pipeline).

"From a rabbit's log, I hop and write,
One-shot paths split left and right,
Templates turned stable with care,
Benchmarked roots now live elsewhere,
A tiny hop — tests pass by night!" 🐇✨

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch egor/dev-1462-tweak-the-benchmark-harness-to-run-with-livesqlbench-lite

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/bird_interact_agents/agents/pydantic_ai_otf_encode/prompts.py`:
- Around line 436-469: The new KB_ENCODER_ONESHOT_PROMPT removed the
self-tagging/output-contract that sets meta.kb_id={kb_id}, which breaks
downstream dependency discovery used by _live_tagged_entities() and
_format_deps_block(); restore the tagging and output-contract sections from
KB_ENCODER_PROMPT into KB_ENCODER_ONESHOT_PROMPT (keeping only removal of
ask_user-specific instructions) so that encoders still emit meta.kb_id={kb_id}
and any required meta/self-annotation fields and call patterns (e.g.,
submit_encoding(result_json=...)) exactly once, ensuring already-encoded
dependencies are discoverable by _live_tagged_entities() and
_format_deps_block().

In `@src/bird_interact_agents/agents/pydantic_ai_recursive/agent.py`:
- Around line 282-298: run_task currently allows eval_mode="one-shot" even when
self.slayer_setup=="pre-encoded", letting callers bypass the dataset guard; add
an explicit check alongside the existing dataset check to raise ValueError if
is_one_shot and self.slayer_setup != "on-the-fly". Locate the branch around
eval_mode/is_one_shot in run_task (and reference self.slayer_setup) and enforce
that one-shot is only permitted when self.slayer_setup equals the on-the-fly
mode, with a clear error message mentioning both the required slayer_setup and
dataset constraints.

In `@tests/_livesqlbench_fixtures.py`:
- Around line 189-193: The public_task fixture currently uses truthy 'or'
defaults which overrides explicit empty values; change the assignments to use
explicit None checks so empty string "" or empty dict {} are preserved: for the
"query" key use query if query is not None else f"unambiguous request for
{instance_id}", and for the "conditions" key use conditions if conditions is not
None else {"decimal": [], "distinct": False, "order": False}; update the code in
the public_task function where "query" and "conditions" are set accordingly.

In `@tests/test_one_shot_mode.py`:
- Around line 351-352: Split the semicolon-joined statements into separate lines
to satisfy Ruff E702: replace the combined statements that create and write to
"data" and create "db_path" (variables data, db_path, tmp_path in
tests/test_one_shot_mode.py) with two distinct statements — one that assigns
data = tmp_path / "data.jsonl" and calls data.write_text(""), and another that
assigns db_path = tmp_path / "ds" and calls db_path.mkdir() — and do the same
for the other occurrences around the same file (the locations referenced near
the original combined lines).

In `@tests/test_one_shot_run.py`:
- Around line 728-799: The tests install the _spy_materialize spy but never
assert it was invoked, so add an assertion after each instantiation/run in both
OTF one-shot parity tests (the test function shown and the other one around
lines 802-870) to verify DB materialization occurred: after calling
PydanticAIOtfEncodeAgent.run_task, assert that the spy created by
_spy_materialize recorded a call to materialize_task_db (e.g., check spy.called
or spy.call_count > 0 or spy.assert_called_once() depending on the spy API).
Make this change in both test functions referencing _spy_materialize and
materialize_task_db so a regression in one-shot DB materialization is caught.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 476372c4-3570-4134-9d96-b93cf793a62c

📥 Commits

Reviewing files that changed from the base of the PR and between 33f30ac and 7233a09.

📒 Files selected for processing (27)
  • CLAUDE.md
  • README.md
  • scripts/prepare_livesqlbench.py
  • src/bird_interact_agents/agents/pydantic_ai_otf_encode/agent.py
  • src/bird_interact_agents/agents/pydantic_ai_otf_encode/factories.py
  • src/bird_interact_agents/agents/pydantic_ai_otf_encode/prompts.py
  • src/bird_interact_agents/agents/pydantic_ai_recursive/agent.py
  • src/bird_interact_agents/agents/pydantic_ai_recursive/factories.py
  • src/bird_interact_agents/agents/pydantic_ai_recursive/prompts.py
  • src/bird_interact_agents/harness.py
  • src/bird_interact_agents/paths.py
  • src/bird_interact_agents/run.py
  • src/bird_interact_agents/slayer_otf/reference_build.py
  • tests/_livesqlbench_fixtures.py
  • tests/test_cloud_paths_unchanged.py
  • tests/test_db_isolation.py
  • tests/test_livesqlbench_loader.py
  • tests/test_one_shot_mode.py
  • tests/test_one_shot_otf_encode_factories.py
  • tests/test_one_shot_recursive_factories.py
  • tests/test_one_shot_run.py
  • tests/test_otf_encode_reference_root.py
  • tests/test_otf_rebuild_per_benchmark.py
  • tests/test_otf_rebuild_wiring.py
  • tests/test_paths.py
  • tests/test_prepare_livesqlbench.py
  • tests/test_pydantic_ai_otf_encode_agent.py

Comment thread src/bird_interact_agents/agents/pydantic_ai_otf_encode/prompts.py
Comment thread src/bird_interact_agents/agents/pydantic_ai_recursive/agent.py
Comment thread tests/_livesqlbench_fixtures.py Outdated
Comment thread tests/test_one_shot_mode.py Outdated
Comment thread tests/test_one_shot_run.py
ZmeiGorynych and others added 4 commits May 27, 2026 17:47
* run.py: wire `materialize_task_db` into `run_oracle_task` so
  LiveSQLBench oracle runs at `--concurrency > 1` don't race the
  shared `<db>.sqlite` the OTF cache reads (B0's docstring said it
  was called from both run paths but the oracle wiring was missing
  — Codex).

* pydantic_ai_otf_encode/prompts.py: restore the KB SELF-ANNOTATION
  contract in `KB_ENCODER_ONESHOT_PROMPT`. Every encoded entity must
  carry `label`, the canonical `description` block, and
  `meta.kb_id={kb_id}` — downstream `_live_tagged_entities()` +
  `_format_deps_block()` discover already-encoded deps via those
  tags, so a successfully-encoded KB without the meta.kb_id tag
  silently looks "not encoded" to dependents and breaks transitive
  `kb_to_slayer` resolution (CodeRabbit).

* pydantic_ai_recursive/agent.py: belt-and-suspenders check that
  one-shot ⟹ `slayer_setup == "on-the-fly"`. The check already lives
  at the CLI / run_evaluation / make_runner / run_one_task
  boundaries; this in-run_task assertion closes the bypass for a
  programmatic caller that constructs
  `PydanticAIRecursiveAgent(slayer_setup="pre-encoded")` and invokes
  `.run_task(eval_mode="one-shot", ...)` directly (CodeRabbit). +
  regression test.

Full non-integration suite: 1233 passed (was 1232).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* harness.load_livesqlbench_tasks: the full-run "exactly 180" check
  is now `if … : raise ValueError(…)`, not `assert`. Production
  guards must survive `python -O` / `PYTHONOPTIMIZE` — the latter
  strips assertions, so a truncated dataset would have silently run
  in exactly the scenario this guard exists for.

* README: the LiveSQLBench setup steps used `cd livesqlbench-base-
  lite-sqlite` then ran `uv run python scripts/prepare_livesqlbench.py`,
  but `scripts/` lives in **bird-agents**, not the dataset dir.
  Rewrite step 2 as a subshell (`(cd … && git lfs pull)`) and have
  step 3 explicitly `cd bird-agents` before running the script.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* tests/_livesqlbench_fixtures.py: `public_task` uses `is not None`
  (not `or`) for `query` and `conditions` defaults, so a caller can
  build a fixture with an explicit empty `""` / `{}` for edge-case
  tests instead of having it silently replaced by the default.

* tests/test_one_shot_mode.py: split semicolon-separated statements
  (`data = ...; data.write_text("")`) into two lines each — Ruff E702
  flags the one-liner form and would break CI once Ruff is wired in.

* tests/test_one_shot_run.py: both OTF one-shot parity tests (reserve
  + empty-resolver-skip) now capture the `materialize_calls` spy
  return and assert it was invoked. The recursive parity tests
  already do this; without it a regression in OTF one-shot DB
  materialization would slip through these two paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous round threaded `db_root` into `ensure_db_reference` →
`_effective_db_root` so the otf_encode **reference build** ignores
`$BIRD_DB_PATH` when an authoritative `--db-path` is in play. But the
**per-task variant copy** (which is what the SLayer MCP server reads
at runtime) is materialised by `build_task_variant_storage`, and that
function delegates to `resolve_committed_connection_string` which
still preferred `$BIRD_DB_PATH` over the supplied root.

Net effect: a one-shot LiveSQLBench run would correctly build/reuse
the LiveSQLBench-scoped reference, then SILENTLY query the
mini-interact sqlite at task time (because conftest / CI / day-to-day
shells often set `BIRD_DB_PATH` to mini-interact).

Mirror the previous fix one more layer:

* `slayer_pipeline/portable_connection.py::resolve_committed_connection_string`
  gains an explicit `db_root: Path | None = None` kwarg that overrides
  `$BIRD_DB_PATH` when set. Back-compat preserved (no kwarg → legacy
  env-wins semantics).

* `hard8_preprocessor.build_task_variant_storage` gains the same
  `db_root` kwarg and forwards it into
  `resolve_committed_connection_string`.

* otf_encode's `_resolve_otf_task_storage_dir` now passes
  `db_root=Path(data_path_base).resolve()` (the same value already
  passed to `ensure_db_reference`), so build-time + runtime
  re-anchoring see the SAME root.

Recursive flavor unaffected — `runtime._rewrite_datasource_connection_string`
force-overwrites the connection string from `data_path_base` already,
no env consultation.

+ 3 regression tests in test_otf_encode_reference_root.py:
  - resolve_committed_connection_string honours db_root over env,
  - back-compat: no db_root → env wins,
  - build_task_variant_storage forwards db_root through the resolver
    (spies the source-module resolver, not the importing-module
    binding, because the import lives inside the function body).

Full non-integration suite: 1236 passed (was 1233; +3 new tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant