feat(engine): per-tool tool-schema tiering + tier-aware guard (slab Phases 1-2) [A/B-gated] by 100yenadmin · Pull Request #1012 · electricsheephq/WorldOS

100yenadmin · 2026-06-18T07:26:51Z

DRAFT — do not merge until the Phase-3 duo A/B passes (per the approved FPAD plan: merge only if cache-not-dented + cold-open-not-worse + tool-selection ≥ control).

What

Pin only a census-backed core of engine MCP tools into every DM beat and defer the cold tail behind the harness ToolSearch, via per-tool _meta["anthropic/alwaysLoad"].

Verified the mechanism end-to-end before building: claude 2.1.160 resolves the pin per-tool (_meta["anthropic/alwaysLoad"]===true, binary grep) and FastMCP (mcp 1.27.1) propagates @mcp.tool(meta=...) → list_tools()._meta (runtime probe). So this is an in-place decorator-style annotation on the frozen worldos-engine server — no facade server, no engine split, no rename (R1's feared blocker was falsified).

The win (measured)

	bytes	tools
Full slab (baseline, today)	118,739 B	153
Pinned core (tiered arm)	63,862 B	69
Deferred (behind ToolSearch)	57,500 B	84

−46% of the per-beat injected slab (~13.7k tokens) — with a deliberately generous core. (285-transcript census: 92/153 tools never called in real play.)

How it's safe

_apply_tool_tiering() is inert under the whole-server baseline (WORLDOS_ENGINE_ALWAYSLOAD=1, the default): the harness ORs the server pin over every tool, so production is byte-identical until the post-A/B cutover. It only activates for the tiered A/B arm (=0).
PINNED_ALLOWLIST = hot beat loop + full active-combat verb set (die-triggered, no payload hint) + cold-open path (no payload names them; the 22-turn give-up band) + the 18 reach-for tools. New tools default deferred.
Cold tools stay findable: the engine names them in the obligations/director payloads the DM already holds, or they're explicit-intent-gated (Step-1.7 reach-for validation found no selection regression).

Guard (`test_tool_schema_budget.py`, now tier-aware)

Ratchet on the pinned-core slab bytes. 2. Pinned set == PINNED_ALLOWLIST (growth forcing-function). 3. Per-tool _meta actually propagates to list_tools() (fail loud on a FastMCP/claude upgrade). 4. Full-slab secondary cap + the reach-for first-sentence guard.

Deviation from the approved Phase 1 (flagged)

The approved plan said "drop server-level alwaysLoad" in Phase 1. I kept it env-gated instead so production default stays baseline (this PR is the dormant mechanism). Dropping server-level alwaysLoad = flipping production to tiered = exactly the behavior change the A/B must gate, so it's deferred to a tiny post-A/B cutover flip. Both A/B arms run from this branch today via WORLDOS_ENGINE_ALWAYSLOAD.

Tests

Full engine suite 2992 passed (single-process). Baseline byte-clean asserted.

Phase 3 (the gate — next)

Same-SHA/same-seed duo A/B: arm1 WORLDOS_ENGINE_ALWAYSLOAD=1 vs arm2 =0. Remaining harness work: extend qa/latency_rollup.py to parse cache_creation/cache_read + cold-open seconds from the *.dm.jsonl result events; add a chance-corrected tool-selection check vs the census. Heavy/paired playtests → support-VM lane.

FPAD record: worldos-session-notes/2026-06-18/tool-schema-slab-decision/decision-record.md.

coderabbitai · 2026-06-18T07:26:59Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: ae0c2545-a64a-4c46-b5ed-3263fbe01320

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/slab-phase1-tiering

_{Comment @coderabbitai help to get the list of available commands.}

@tool

… (slab Phases 1-2) Pin only a census-backed core of engine MCP tools into every DM beat and defer the cold tail behind the harness ToolSearch, via per-tool _meta["anthropic/alwaysLoad"] — verified to resolve per-tool in claude 2.1.160 and to propagate from FastMCP @tool meta on the installed mcp 1.27.1. No facade server, no rename of the frozen worldos-engine id. - PINNED_ALLOWLIST (69): the hot beat loop + the full active-combat verb set + the cold-open path + the 18 reach-for tools. New tools default DEFERRED. (285-transcript census: 92/153 never called.) - _apply_tool_tiering(): annotates the core; INERT under the whole-server baseline (WORLDOS_ENGINE_ALWAYSLOAD=1, default) so production is byte-identical until the post-A/B cutover; activates for the tiered A/B arm (=0). Validates the allowlist names exist (fail loud). - test_tool_schema_budget.py is now tier-aware: a ratchet on the PINNED-core slab, assert the pinned set == PINNED_ALLOWLIST (the growth forcing-function), assert per-tool _meta actually propagates to list_tools(), keep the reach-for first-sentence guard + a full-slab secondary cap. Measured: pinned core = 63,862 B vs 118,739 B full slab = -46% per beat (~13.7k tokens) with a deliberately generous core. Baseline byte-identical. Full engine suite 2992 green. Phases 1-2 of the FPAD slab decision (worldos-session-notes/2026-06-18/tool-schema-slab-decision/). DO NOT MERGE until the Phase-3 duo A/B (cache_creation/read + cold-open + chance-corrected selection >= control). This PR is the dormant mechanism + guard; the production cutover (dropping server-level alwaysLoad) is a separate gated flip. --- Rebased over #1246 / #1248 / #1250 (2026-07-02) --- Rebased onto origin/main after three merged engine PRs touched servers/engine/server.py. Conflict was import-only (both sides added imports): kept main's fcntl/functools/json (from #1246 action-economy + #1248 AC-ownership helpers) AND this PR's os, alphabetized. The tiering tail block appended cleanly after mark_climax; _apply_tool_tiering() still runs at import after every @mcp.tool registers, so it now sees main's 156 tools. New @mcp.tool defs on main since the PR base (from the #598/#1250 coords-combat wiring), classified per this PR's own tiering rule (hot = combat/turn-loop verb the DM hits every beat; a mid-fight ToolSearch hop is the worst-failure class) — all three PINNED (HOT): - move_to_coords : grid twin of the pinned move_to_zone; die/turn-loop movement, returns opportunity-attack. HOT. - place_combatant_at_coords: grid twin of the pinned place_combatant; combat setup verb. HOT. - set_grid : combat-setup verb that establishes grid mode; the coords verbs are inert without it — pinned to keep the grid verb set coherent. HOT. Ratchets bumped WITH justification (3 real combat tools promoted into the core, exactly the case the ratchet-test error message says to raise for): pinned slab 63,862 -> 66,955 B (ceiling 66,000 -> 67,000); full-slab cap 118,739 -> 124,835 B (121,000 -> 125,000, grew because main added the tools regardless of tiering). Remains DRAFT pending the Phase-3 duo A/B.

100yenadmin · 2026-07-02T09:48:44Z

Rebased over #1246 / #1248 / #1250 (still DRAFT)

Rebased feat/slab-phase1-tiering onto current origin/main (HEAD 811a5ec3, past the three merged PRs that touched servers/engine/server.py).

Conflict resolution — one hunk, import-only:

servers/engine/server.py imports: main added fcntl/functools/json (from fix(combat): enforce action economy per-turn, not per-tool (#778) #1246 action-economy + feat(engine): #806 Stage 2 — equipment-owned AC with provenance flag #1248 AC-ownership helpers), this PR added os. Kept both, alphabetized. Main's new/changed tool bodies (fix(combat): enforce action economy per-turn, not per-tool (#778) #1246 per-turn action gates, feat(engine): #806 Stage 2 — equipment-owned AC with provenance flag #1248 armor_ac_source/_derive_worn_ac_from_equipped, fix(combat-ui): wire Cast/Item/Move ActionTiles to existing move-kinds #1250 viewer-only) are untouched; the tiering annotations apply on top. #1250 was viewer/server.py only — no conflict.
The tiering tail block appended cleanly after mark_climax; _apply_tool_tiering() still runs at import after every @mcp.tool registers, so it now sees main's full tool set.

New tools since PR base → classified per this PR's own tiering rule (hot = combat/turn-loop verb the DM hits every beat). The #598/#1250 coords-combat wiring added 3 @mcp.tools; all three are PINNED (HOT):

move_to_coords — grid twin of the pinned move_to_zone; die/turn-loop movement, returns opportunity-attack.
place_combatant_at_coords — grid twin of the pinned place_combatant; combat setup.
set_grid — establishes grid mode; the coords verbs are inert without it — pinned to keep the grid verb set coherent.

Ratchets bumped with justification (3 real combat tools promoted into the core — exactly the case the ratchet-test error says to raise for): pinned slab 63,862 → 66,955 B (ceiling 66,000 → 67,000); full-slab cap 118,739 → 124,835 B (121,000 → 125,000, grew because main added the tools regardless of tiering).

Verification (single-process, no xdist):

test_tool_schema_budget.py — 5 passed
affected surfaces (action-economy, AC-ownership, armor-props, grid, combat, scene-grid, schema) — 336 passed
qa/fast_gate.sh — PASS (241 deterministic, exit 0)

Remains DRAFT pending the Phase-3 duo A/B (cache_creation/read + cold-open + chance-corrected selection ≥ control). No default flag flipped (WORLDOS_ENGINE_ALWAYSLOAD stays 1 = byte-identical baseline).

100yenadmin · 2026-07-02T15:17:04Z

Phase-3 duo A/B: FAIL — do not cut over for latency; retain as a BUDGET lever

Both arms played clean (same SHA/world/persona/6 beats, 0 failed beats, behavioral GREEN both). Flag exercised: per-tool _meta diff confirmed 72 pinned / 84 deferred under ALWAYSLOAD=0; gameplay-level ToolSearch went 1 (base) → 10 (tier) — the deferral is real and load-bearing.

Latency (duration_api_ms): cold-open 185.3s base → 192.6s tier (+3.9%, REGRESSED); s/beat 99.8 → 94.5; turns/beat 5.7 → 5.9. The deferral reintroduces exactly the ToolSearch round-trip churn the alwaysLoad pin removed (see worldos-latency-forensics: cold-open 248→176 when pinning landed). Quality: story +0.1 (noise), mech −0.2, angry-dm −0.5 (likely scene variance — low-coverage social scenes both arms — but outside the specified noise band).

Disposition: stays DRAFT. The tiering doesn't buy latency — it costs it. Its real value is the ~46% injected-schema reduction as a budget-relief lever: post-#1257/#1265/#1277 the slab sits at 119,319B of the ~121K ceiling (~1.7KB headroom). Recommend re-framing this PR as the pull-when-headroom-runs-out mechanism, with this measured latency cost documented as the price.

Artifacts: worldos-wt-slab-tiering/qa/transcripts/ab1012b-{base,tier}.* (latency.json + 3 lenses each).

100yenadmin mentioned this pull request Jun 18, 2026

feat(qa): latency-rollup token/cache ledger + slab A/B comparator (Phase 3 harness) #1014

Merged

100yenadmin force-pushed the feat/slab-phase1-tiering branch from 1de4f35 to 35e8624 Compare July 2, 2026 09:48

100yenadmin mentioned this pull request Jul 2, 2026

[qa] score.sh hardening: caller-provided stale OAuth token shadows the keychain fallback + retry-without-truncate leaves trailing JSON fragments #1278

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(engine): per-tool tool-schema tiering + tier-aware guard (slab Phases 1-2) [A/B-gated]#1012

feat(engine): per-tool tool-schema tiering + tier-aware guard (slab Phases 1-2) [A/B-gated]#1012
100yenadmin wants to merge 1 commit into
mainfrom
feat/slab-phase1-tiering

100yenadmin commented Jun 18, 2026

Uh oh!

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading

Review skipped

Uh oh!

100yenadmin commented Jul 2, 2026

Uh oh!

100yenadmin commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

100yenadmin commented Jun 18, 2026

What

The win (measured)

How it's safe

Guard (test_tool_schema_budget.py, now tier-aware)

Deviation from the approved Phase 1 (flagged)

Tests

Phase 3 (the gate — next)

Uh oh!

coderabbitai Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

100yenadmin commented Jul 2, 2026

Rebased over #1246 / #1248 / #1250 (still DRAFT)

Uh oh!

100yenadmin commented Jul 2, 2026

Phase-3 duo A/B: FAIL — do not cut over for latency; retain as a BUDGET lever

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Guard (`test_tool_schema_budget.py`, now tier-aware)

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading