feat(engine): per-tool tool-schema tiering + tier-aware guard (slab Phases 1-2) [A/B-gated]#1012
feat(engine): per-tool tool-schema tiering + tier-aware guard (slab Phases 1-2) [A/B-gated]#1012100yenadmin wants to merge 1 commit into
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
… (slab Phases 1-2) Pin only a census-backed core of engine MCP tools into every DM beat and defer the cold tail behind the harness ToolSearch, via per-tool _meta["anthropic/alwaysLoad"] — verified to resolve per-tool in claude 2.1.160 and to propagate from FastMCP @tool meta on the installed mcp 1.27.1. No facade server, no rename of the frozen worldos-engine id. - PINNED_ALLOWLIST (69): the hot beat loop + the full active-combat verb set + the cold-open path + the 18 reach-for tools. New tools default DEFERRED. (285-transcript census: 92/153 never called.) - _apply_tool_tiering(): annotates the core; INERT under the whole-server baseline (WORLDOS_ENGINE_ALWAYSLOAD=1, default) so production is byte-identical until the post-A/B cutover; activates for the tiered A/B arm (=0). Validates the allowlist names exist (fail loud). - test_tool_schema_budget.py is now tier-aware: a ratchet on the PINNED-core slab, assert the pinned set == PINNED_ALLOWLIST (the growth forcing-function), assert per-tool _meta actually propagates to list_tools(), keep the reach-for first-sentence guard + a full-slab secondary cap. Measured: pinned core = 63,862 B vs 118,739 B full slab = -46% per beat (~13.7k tokens) with a deliberately generous core. Baseline byte-identical. Full engine suite 2992 green. Phases 1-2 of the FPAD slab decision (worldos-session-notes/2026-06-18/tool-schema-slab-decision/). DO NOT MERGE until the Phase-3 duo A/B (cache_creation/read + cold-open + chance-corrected selection >= control). This PR is the dormant mechanism + guard; the production cutover (dropping server-level alwaysLoad) is a separate gated flip. --- Rebased over #1246 / #1248 / #1250 (2026-07-02) --- Rebased onto origin/main after three merged engine PRs touched servers/engine/server.py. Conflict was import-only (both sides added imports): kept main's fcntl/functools/json (from #1246 action-economy + #1248 AC-ownership helpers) AND this PR's os, alphabetized. The tiering tail block appended cleanly after mark_climax; _apply_tool_tiering() still runs at import after every @mcp.tool registers, so it now sees main's 156 tools. New @mcp.tool defs on main since the PR base (from the #598/#1250 coords-combat wiring), classified per this PR's own tiering rule (hot = combat/turn-loop verb the DM hits every beat; a mid-fight ToolSearch hop is the worst-failure class) — all three PINNED (HOT): - move_to_coords : grid twin of the pinned move_to_zone; die/turn-loop movement, returns opportunity-attack. HOT. - place_combatant_at_coords: grid twin of the pinned place_combatant; combat setup verb. HOT. - set_grid : combat-setup verb that establishes grid mode; the coords verbs are inert without it — pinned to keep the grid verb set coherent. HOT. Ratchets bumped WITH justification (3 real combat tools promoted into the core, exactly the case the ratchet-test error message says to raise for): pinned slab 63,862 -> 66,955 B (ceiling 66,000 -> 67,000); full-slab cap 118,739 -> 124,835 B (121,000 -> 125,000, grew because main added the tools regardless of tiering). Remains DRAFT pending the Phase-3 duo A/B.
1de4f35 to
35e8624
Compare
Rebased over #1246 / #1248 / #1250 (still DRAFT)Rebased Conflict resolution — one hunk, import-only:
New tools since PR base → classified per this PR's own tiering rule (hot = combat/turn-loop verb the DM hits every beat). The #598/#1250 coords-combat wiring added 3
Ratchets bumped with justification (3 real combat tools promoted into the core — exactly the case the ratchet-test error says to raise for): pinned slab Verification (single-process, no xdist):
Remains DRAFT pending the Phase-3 duo A/B (cache_creation/read + cold-open + chance-corrected selection ≥ control). No default flag flipped ( |
Phase-3 duo A/B: FAIL — do not cut over for latency; retain as a BUDGET leverBoth arms played clean (same SHA/world/persona/6 beats, 0 failed beats, behavioral GREEN both). Flag exercised: per-tool Latency (duration_api_ms): cold-open 185.3s base → 192.6s tier (+3.9%, REGRESSED); s/beat 99.8 → 94.5; turns/beat 5.7 → 5.9. The deferral reintroduces exactly the ToolSearch round-trip churn the alwaysLoad pin removed (see worldos-latency-forensics: cold-open 248→176 when pinning landed). Quality: story +0.1 (noise), mech −0.2, angry-dm −0.5 (likely scene variance — low-coverage social scenes both arms — but outside the specified noise band). Disposition: stays DRAFT. The tiering doesn't buy latency — it costs it. Its real value is the ~46% injected-schema reduction as a budget-relief lever: post-#1257/#1265/#1277 the slab sits at 119,319B of the ~121K ceiling (~1.7KB headroom). Recommend re-framing this PR as the pull-when-headroom-runs-out mechanism, with this measured latency cost documented as the price. Artifacts: |
DRAFT — do not merge until the Phase-3 duo A/B passes (per the approved FPAD plan: merge only if cache-not-dented + cold-open-not-worse + tool-selection ≥ control).
What
Pin only a census-backed core of engine MCP tools into every DM beat and defer the cold tail behind the harness ToolSearch, via per-tool
_meta["anthropic/alwaysLoad"].Verified the mechanism end-to-end before building: claude 2.1.160 resolves the pin per-tool (
_meta["anthropic/alwaysLoad"]===true, binary grep) and FastMCP (mcp 1.27.1) propagates@mcp.tool(meta=...)→list_tools()._meta(runtime probe). So this is an in-place decorator-style annotation on the frozenworldos-engineserver — no facade server, no engine split, no rename (R1's feared blocker was falsified).The win (measured)
−46% of the per-beat injected slab (~13.7k tokens) — with a deliberately generous core. (285-transcript census: 92/153 tools never called in real play.)
How it's safe
_apply_tool_tiering()is inert under the whole-server baseline (WORLDOS_ENGINE_ALWAYSLOAD=1, the default): the harness ORs the server pin over every tool, so production is byte-identical until the post-A/B cutover. It only activates for the tiered A/B arm (=0).PINNED_ALLOWLIST= hot beat loop + full active-combat verb set (die-triggered, no payload hint) + cold-open path (no payload names them; the 22-turn give-up band) + the 18 reach-for tools. New tools default deferred.obligations/directorpayloads the DM already holds, or they're explicit-intent-gated (Step-1.7 reach-for validation found no selection regression).Guard (
test_tool_schema_budget.py, now tier-aware)PINNED_ALLOWLIST(growth forcing-function). 3. Per-tool_metaactually propagates tolist_tools()(fail loud on a FastMCP/claude upgrade). 4. Full-slab secondary cap + the reach-for first-sentence guard.Deviation from the approved Phase 1 (flagged)
The approved plan said "drop server-level
alwaysLoad" in Phase 1. I kept it env-gated instead so production default stays baseline (this PR is the dormant mechanism). Dropping server-levelalwaysLoad= flipping production to tiered = exactly the behavior change the A/B must gate, so it's deferred to a tiny post-A/B cutover flip. Both A/B arms run from this branch today viaWORLDOS_ENGINE_ALWAYSLOAD.Tests
Full engine suite 2992 passed (single-process). Baseline byte-clean asserted.
Phase 3 (the gate — next)
Same-SHA/same-seed duo A/B: arm1
WORLDOS_ENGINE_ALWAYSLOAD=1vs arm2=0. Remaining harness work: extendqa/latency_rollup.pyto parsecache_creation/cache_read+ cold-open seconds from the*.dm.jsonlresult events; add a chance-corrected tool-selection check vs the census. Heavy/paired playtests → support-VM lane.FPAD record:
worldos-session-notes/2026-06-18/tool-schema-slab-decision/decision-record.md.