Skip to content

feat(engine): per-tool tool-schema tiering + tier-aware guard (slab Phases 1-2) [A/B-gated]#1012

Draft
100yenadmin wants to merge 1 commit into
mainfrom
feat/slab-phase1-tiering
Draft

feat(engine): per-tool tool-schema tiering + tier-aware guard (slab Phases 1-2) [A/B-gated]#1012
100yenadmin wants to merge 1 commit into
mainfrom
feat/slab-phase1-tiering

Conversation

@100yenadmin

Copy link
Copy Markdown
Member

DRAFT — do not merge until the Phase-3 duo A/B passes (per the approved FPAD plan: merge only if cache-not-dented + cold-open-not-worse + tool-selection ≥ control).

What

Pin only a census-backed core of engine MCP tools into every DM beat and defer the cold tail behind the harness ToolSearch, via per-tool _meta["anthropic/alwaysLoad"].

Verified the mechanism end-to-end before building: claude 2.1.160 resolves the pin per-tool (_meta["anthropic/alwaysLoad"]===true, binary grep) and FastMCP (mcp 1.27.1) propagates @mcp.tool(meta=...)list_tools()._meta (runtime probe). So this is an in-place decorator-style annotation on the frozen worldos-engine server — no facade server, no engine split, no rename (R1's feared blocker was falsified).

The win (measured)

bytes tools
Full slab (baseline, today) 118,739 B 153
Pinned core (tiered arm) 63,862 B 69
Deferred (behind ToolSearch) 57,500 B 84

−46% of the per-beat injected slab (~13.7k tokens) — with a deliberately generous core. (285-transcript census: 92/153 tools never called in real play.)

How it's safe

  • _apply_tool_tiering() is inert under the whole-server baseline (WORLDOS_ENGINE_ALWAYSLOAD=1, the default): the harness ORs the server pin over every tool, so production is byte-identical until the post-A/B cutover. It only activates for the tiered A/B arm (=0).
  • PINNED_ALLOWLIST = hot beat loop + full active-combat verb set (die-triggered, no payload hint) + cold-open path (no payload names them; the 22-turn give-up band) + the 18 reach-for tools. New tools default deferred.
  • Cold tools stay findable: the engine names them in the obligations/director payloads the DM already holds, or they're explicit-intent-gated (Step-1.7 reach-for validation found no selection regression).

Guard (test_tool_schema_budget.py, now tier-aware)

  1. Ratchet on the pinned-core slab bytes. 2. Pinned set == PINNED_ALLOWLIST (growth forcing-function). 3. Per-tool _meta actually propagates to list_tools() (fail loud on a FastMCP/claude upgrade). 4. Full-slab secondary cap + the reach-for first-sentence guard.

Deviation from the approved Phase 1 (flagged)

The approved plan said "drop server-level alwaysLoad" in Phase 1. I kept it env-gated instead so production default stays baseline (this PR is the dormant mechanism). Dropping server-level alwaysLoad = flipping production to tiered = exactly the behavior change the A/B must gate, so it's deferred to a tiny post-A/B cutover flip. Both A/B arms run from this branch today via WORLDOS_ENGINE_ALWAYSLOAD.

Tests

Full engine suite 2992 passed (single-process). Baseline byte-clean asserted.

Phase 3 (the gate — next)

Same-SHA/same-seed duo A/B: arm1 WORLDOS_ENGINE_ALWAYSLOAD=1 vs arm2 =0. Remaining harness work: extend qa/latency_rollup.py to parse cache_creation/cache_read + cold-open seconds from the *.dm.jsonl result events; add a chance-corrected tool-selection check vs the census. Heavy/paired playtests → support-VM lane.

FPAD record: worldos-session-notes/2026-06-18/tool-schema-slab-decision/decision-record.md.

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: ae0c2545-a64a-4c46-b5ed-3263fbe01320

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/slab-phase1-tiering

Comment @coderabbitai help to get the list of available commands.

… (slab Phases 1-2)

Pin only a census-backed core of engine MCP tools into every DM beat and defer the cold tail
behind the harness ToolSearch, via per-tool _meta["anthropic/alwaysLoad"] — verified to resolve
per-tool in claude 2.1.160 and to propagate from FastMCP @tool meta on the installed mcp 1.27.1.
No facade server, no rename of the frozen worldos-engine id.

- PINNED_ALLOWLIST (69): the hot beat loop + the full active-combat verb set + the cold-open path
  + the 18 reach-for tools. New tools default DEFERRED. (285-transcript census: 92/153 never called.)
- _apply_tool_tiering(): annotates the core; INERT under the whole-server baseline
  (WORLDOS_ENGINE_ALWAYSLOAD=1, default) so production is byte-identical until the post-A/B cutover;
  activates for the tiered A/B arm (=0). Validates the allowlist names exist (fail loud).
- test_tool_schema_budget.py is now tier-aware: a ratchet on the PINNED-core slab, assert the
  pinned set == PINNED_ALLOWLIST (the growth forcing-function), assert per-tool _meta actually
  propagates to list_tools(), keep the reach-for first-sentence guard + a full-slab secondary cap.

Measured: pinned core = 63,862 B vs 118,739 B full slab = -46% per beat (~13.7k tokens) with a
deliberately generous core. Baseline byte-identical. Full engine suite 2992 green.

Phases 1-2 of the FPAD slab decision (worldos-session-notes/2026-06-18/tool-schema-slab-decision/).
DO NOT MERGE until the Phase-3 duo A/B (cache_creation/read + cold-open + chance-corrected
selection >= control). This PR is the dormant mechanism + guard; the production cutover (dropping
server-level alwaysLoad) is a separate gated flip.

--- Rebased over #1246 / #1248 / #1250 (2026-07-02) ---
Rebased onto origin/main after three merged engine PRs touched servers/engine/server.py.
Conflict was import-only (both sides added imports): kept main's fcntl/functools/json
(from #1246 action-economy + #1248 AC-ownership helpers) AND this PR's os, alphabetized.
The tiering tail block appended cleanly after mark_climax; _apply_tool_tiering() still
runs at import after every @mcp.tool registers, so it now sees main's 156 tools.

New @mcp.tool defs on main since the PR base (from the #598/#1250 coords-combat wiring),
classified per this PR's own tiering rule (hot = combat/turn-loop verb the DM hits every
beat; a mid-fight ToolSearch hop is the worst-failure class) — all three PINNED (HOT):
  - move_to_coords          : grid twin of the pinned move_to_zone; die/turn-loop movement,
                              returns opportunity-attack. HOT.
  - place_combatant_at_coords: grid twin of the pinned place_combatant; combat setup verb. HOT.
  - set_grid                : combat-setup verb that establishes grid mode; the coords verbs
                              are inert without it — pinned to keep the grid verb set coherent. HOT.

Ratchets bumped WITH justification (3 real combat tools promoted into the core, exactly the
case the ratchet-test error message says to raise for): pinned slab 63,862 -> 66,955 B
(ceiling 66,000 -> 67,000); full-slab cap 118,739 -> 124,835 B (121,000 -> 125,000, grew
because main added the tools regardless of tiering). Remains DRAFT pending the Phase-3 duo A/B.
@100yenadmin 100yenadmin force-pushed the feat/slab-phase1-tiering branch from 1de4f35 to 35e8624 Compare July 2, 2026 09:48
@100yenadmin

Copy link
Copy Markdown
Member Author

Rebased over #1246 / #1248 / #1250 (still DRAFT)

Rebased feat/slab-phase1-tiering onto current origin/main (HEAD 811a5ec3, past the three merged PRs that touched servers/engine/server.py).

Conflict resolution — one hunk, import-only:

New tools since PR base → classified per this PR's own tiering rule (hot = combat/turn-loop verb the DM hits every beat). The #598/#1250 coords-combat wiring added 3 @mcp.tools; all three are PINNED (HOT):

  • move_to_coords — grid twin of the pinned move_to_zone; die/turn-loop movement, returns opportunity-attack.
  • place_combatant_at_coords — grid twin of the pinned place_combatant; combat setup.
  • set_grid — establishes grid mode; the coords verbs are inert without it — pinned to keep the grid verb set coherent.

Ratchets bumped with justification (3 real combat tools promoted into the core — exactly the case the ratchet-test error says to raise for): pinned slab 63,862 → 66,955 B (ceiling 66,000 → 67,000); full-slab cap 118,739 → 124,835 B (121,000 → 125,000, grew because main added the tools regardless of tiering).

Verification (single-process, no xdist):

  • test_tool_schema_budget.py — 5 passed
  • affected surfaces (action-economy, AC-ownership, armor-props, grid, combat, scene-grid, schema) — 336 passed
  • qa/fast_gate.sh — PASS (241 deterministic, exit 0)

Remains DRAFT pending the Phase-3 duo A/B (cache_creation/read + cold-open + chance-corrected selection ≥ control). No default flag flipped (WORLDOS_ENGINE_ALWAYSLOAD stays 1 = byte-identical baseline).

@100yenadmin

Copy link
Copy Markdown
Member Author

Phase-3 duo A/B: FAIL — do not cut over for latency; retain as a BUDGET lever

Both arms played clean (same SHA/world/persona/6 beats, 0 failed beats, behavioral GREEN both). Flag exercised: per-tool _meta diff confirmed 72 pinned / 84 deferred under ALWAYSLOAD=0; gameplay-level ToolSearch went 1 (base) → 10 (tier) — the deferral is real and load-bearing.

Latency (duration_api_ms): cold-open 185.3s base → 192.6s tier (+3.9%, REGRESSED); s/beat 99.8 → 94.5; turns/beat 5.7 → 5.9. The deferral reintroduces exactly the ToolSearch round-trip churn the alwaysLoad pin removed (see worldos-latency-forensics: cold-open 248→176 when pinning landed). Quality: story +0.1 (noise), mech −0.2, angry-dm −0.5 (likely scene variance — low-coverage social scenes both arms — but outside the specified noise band).

Disposition: stays DRAFT. The tiering doesn't buy latency — it costs it. Its real value is the ~46% injected-schema reduction as a budget-relief lever: post-#1257/#1265/#1277 the slab sits at 119,319B of the ~121K ceiling (~1.7KB headroom). Recommend re-framing this PR as the pull-when-headroom-runs-out mechanism, with this measured latency cost documented as the price.

Artifacts: worldos-wt-slab-tiering/qa/transcripts/ab1012b-{base,tier}.* (latency.json + 3 lenses each).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant