Skip to content

Remediation pipeline token efficiency: 22.8M cache_read for ~500 line PR #783 #792

@Trecek

Description

@Trecek

Problem

PR #783 ("Rectify: CLI Startup Update Prompt — Terminal Freeze Immunity") produced 425 additions / 101 deletions but consumed 22.8M cache_read tokens across child sessions (reported in the PR's Token Usage Summary) plus an additional ~4.7M cache_read from the orchestrator session — for a true pipeline cost of ~27.5M cache_read tokens. The pipeline ran 12 child sessions over 2h 35m, making ~500+ API calls with average context sizes of 60-80K tokens per call.

This is not a bug — the pipeline completed successfully. But the token cost is disproportionate to the output size, and the investigation identifies specific structural inefficiencies that compound multiplicatively.

Investigation Summary

The cost traces to three compounding structural factors, each of which is independently fixable:

Factor 1: Non-Tool API Turns in Headless Sessions (implement session — 9.5M cache_read)

The implementation session (9768e31d) made 183 API calls for only 111 tool invocations. The remaining 72 calls (39%) produced no tool calls:

  • 61 were text-only narration turns — the model announces intent ("Let me now implement Phase 2a...") in one API call, then executes the tool in the next
  • 11 were extended-thinking-only turns — model reasoning with no visible output or tool use

Key statistics:

  • Of 111 tool-use turns, 62 (56%) were preceded by a non-tool turn (narration or thinking)
  • The remaining 48 (43%) directly followed another tool-use turn — no narration in between
  • Zero turns contained both text and tool_use in the same message — the model consistently separates them into two API calls

Each non-tool turn sends the full context (~85K avg) for zero work output. At 72 wasted turns × ~85K tokens = ~6.1M cache_read overhead from narration/thinking alone.

Additionally, Phase 2e of the plan migrated 8 input() call sites with ~9 API calls per site. The pattern for each site: Read → narrate → Edit → narrate → Grep (verify) → Read again → Edit fixup. The same files were read repeatedly: app.py 7 times, test_update_checks.py 7 times.

Two Agent subagents spawned early in the session returned 18K + 21K character reports that were injected into the persistent conversation context and stayed there for all subsequent 165+ API calls, adding ~3-4K tokens of permanent overhead per call (~500K+ cache_read over the session lifetime).

Factor 2: Test Suite Execution Overhead (assess step — 8.0M cache_read, 2 sessions)

Two resolve-failures sessions ran a combined 9 test suite invocations (each full run takes 5-7 minutes):

Session 1 (096e0115, 162 API calls, ~32 min):

Run Result Notes
1 ABORTED install-worktree failure: uv venv: virtual environment already exists — 0 tests ran
2 ABORTED Same install-worktree collision — 0 tests ran
3 FAIL — 60 failures First real run. All failures from timed_prompt()'s select.select(sys.stdin) crashing under pytest's captured stdin (no fileno())
4 FAIL — 4 failures 3 subset-gate tests + sigterm flaky test
5 FAIL — 1 failure Pre-existing sigterm flaky test only

Root cause of runs 1-2: The worktree had 22 packages newer than the lockfile. install-worktree's uv venv .venv (without --clear) failed because a venv already existed. Fix: uv sync --all-extras downgraded packages to match lockfile. Time wasted: 65 seconds, but 2 full API round-trips with context.

Session 2 (e2495448, 102 API calls / 54 deduplicated, ~45 min):

Run Result Notes
1-3 FAIL — 1 each Sigterm test failing under -n 4 parallel load (passes in isolation)
4 PASS — 7750 passed Fix: increased subprocess startup deadline from 5s to 15s

This session had significant polling overhead: 22 API calls were spent monitoring a single test run via repeated sleep 60tailps aux | grep pytest cycles instead of a single background wait.

Factor 3: Pipeline Architecture — Per-Session Startup and Orchestrator Cost

Each of the 12 child sessions starts from scratch with:

  • System prompt + CLAUDE.md + skill docs: ~12K tokens (cached but counted in cache_read every call)
  • Plan file re-read: ~21K chars
  • Full context accumulation from zero on every session

The orchestrator session (812fdbe0) ran 5 hours managing orders 774 and 775 concurrently:

  • 124 API calls, 8.7M total cache_read, 41 MCP tool calls
  • 98 of 124 turns (79%) were shared context (thinking/processing between dispatches)
  • Attribution: ~54% to order 774 (~4.7M), ~46% to order 775
  • This cost is not included in the PR's Token Usage Summary — only child session tokens are reported

Token Accounting Details

How the 22.8M Figure Is Computed

The PR's Token Usage Summary is generated by token_summary_hook.py:

  1. Reads sessions.jsonl index, filters by order_id
  2. For each session, reads token_usage.json from the session diagnostics directory
  3. These files are written by session_log.py::flush_session_log() using extract_token_usage(stdout) from session.py
  4. extract_token_usage prefers the Claude CLI's result record (a single session-level aggregate) over summing per-turn assistant records
  5. No extended-thinking deduplication exists anywhere in AutoSkillit — the 22.8M vs. raw per-turn sum (~45.4M) difference comes entirely from the Claude CLI's own result record being roughly half the per-turn sum

Per-Step Breakdown (from PR Token Usage Summary)

Step uncached output cache_read cache_write count time
investigate 5.5k 10.7k 870.9k 58.4k 1 6m 18s
rectify 28 13.5k 1.0M 63.7k 1 6m 44s
review 3.7k 7.7k 313.4k 37.9k 1 5m 7s
dry_walkthrough 1.9k 14.9k 1.9M 62.9k 1 12m 57s
implement 119 46.6k 9.5M 130.7k 1 22m 55s
assess 135 47.4k 8.0M 164.2k 2 1h 24m
prepare_pr 9 7.0k 139.3k 31.2k 1 2m 38s
run_arch_lenses 3.4k 20.1k 720.7k 108.2k 3 10m 41s
compose_pr 9 12.0k 200.5k 40.6k 1 3m 42s
Total 14.8k 179.8k 22.8M 697.8k 12 2h 35m

Full Session Inventory (Order 774)

Session ID Step API Calls Cache Read Output Duration Notes
14b6a3bc investigate 38 870.9K 10.7K 6m 18s
6202fd96 rectify 38 1.0M 13.5K 6m 44s
504ea192 review-approach 24 313.4K 7.7K 5m 7s
fd6d3079 dry-walkthrough 65 1.9M 14.9K 12m 57s
9768e31d implement 183 9.5M 46.6K 22m 55s 72 non-tool turns, 2 Agent subagents (18K+21K char reports in context)
096e0115 resolve-failures #1 162 ~5.1M 33.9K 32m 5 test runs (2 aborted on venv), 60→4→1 failures
e2495448 resolve-failures #2 102 ~2.9M 13.4K ~45m 4 test runs, polling overhead, sigterm fix
a53d2b38 prepare-pr 18 139.3K 7.0K 2m 38s
79e2d7d8 arch-lens-process-flow 18 ~240K 5.5K ~3m
0f1f76be arch-lens-state-lifecycle 17 ~245K 7.0K ~3m
a2540a25 arch-lens-module-dep 21 ~300K 7.6K ~4m
7651e1a4 compose-pr 20 200.5K 12.0K 3m 42s
cdce8960 review-pr 51 2.8M 24.1K Post-PR creation, excluded from summary

Orchestrator (Excluded from PR Summary)

Metric Value
Session 812fdbe0-ba63-4df1-960d-bda6cf3ca374
Duration 5h (23:25 → 04:28 UTC)
API calls 124
Total cache_read 8.7M
Attributable to order 774 ~4.7M (54%)
MCP tool calls 41 (26 run_skill + run_cmd/test_check)
Context growth 15K → 98K tokens over session

The orchestrator also managed order 775 ("MCP server startup race — open_kitchen 'No such tool available' on first call") concurrently. Order 775 ran 12 sessions but never completed a PR — it was still in resolve-failures when the orchestrator session ended.

Crashed Session Records (41 total)

The session diagnostics system recorded 41 "crashed" sessions during this pipeline run. Investigation confirms none represent lost work:

11 PID-reuse duplicates: The process monitor re-opened tracking records for PIDs belonging to already-completing successful sessions. All show peak_rss_kb=1920 (monitor wrapper RSS), empty skill_command, no token usage.

30 orphaned subagent processes (10 triplet groups): Each group has a consistent fingerprint:

Role Peak RSS Description
Primary Node.js ~180-218K KB Claude Code process, no SIGTERM handler, fd_count=21, 2 threads
Monitor wrapper 1664 KB Python wrapper, sleeping (hrtimer_nanosleep), 1 thread
Secondary Node.js ~171-173K KB Spawned subagent child, has SIGTERM handler (bit 15 in sig_cgt), fd_count=18

All 30 have: end_ts="", skill_command="", cwd="", write_call_count=0, no token usage. They disappear within 4-6 snapshots (~200-300ms).

The groups cluster around resolve-failures session endings: 096e0115 generated 3 groups, e2495448 generated 4 groups, cdce8960 generated 1 group. These sessions spawned the most Agent subagents internally.

Only 1 anomaly file exists across all 41 crashed sessions: crashed_23417 flags rss_growth (512→1664 KB) — normal Python startup allocation of the monitor wrapper, not meaningful.

Pipeline Data Flow

Orchestrator (812fdbe0, 5h, 8.7M cache_read, 124 API calls)
├── Order 774: "CLI startup update prompt terminal freeze" → PR #783
│   ├── investigate (14b6a3bc) → rectify (6202fd96) → review-approach (504ea192)
│   ├── dry-walkthrough (fd6d3079, 65 API calls)
│   ├── implement-worktree-no-merge (9768e31d, 183 API calls, 9.5M cache_read)
│   │   ├── Agent subagent 1: explore _update_checks.py → 18K char report → orphaned
│   │   └── Agent subagent 2: survey input() calls → 21K char report → orphaned
│   ├── resolve-failures #1 (096e0115, 162 API calls, 5 test runs)
│   │   ├── Runs 1-2: aborted (install-worktree venv collision, 22 packages out of sync)
│   │   ├── Run 3: 60 failures (select.select stdin crash)
│   │   ├── Run 4: 4 failures → Run 5: 1 failure (pre-existing sigterm)
│   │   └── Agent subagent: categorize failures → 5K report → orphaned
│   ├── resolve-failures #2 (e2495448, 102 API calls, 4 test runs)
│   │   ├── Runs 1-3: sigterm failing under -n 4 load
│   │   └── Run 4: PASS (fix: 5s→15s startup deadline)
│   ├── prepare-pr → arch-lenses ×3 (parallel) → compose-pr
│   └── review-pr (cdce8960, post-creation, 51 API calls)
│
└── Order 775: "MCP server startup race" (parallel, 12 sessions, NO PR produced)
    ├── investigate → rectify → review-approach → dry-walkthrough (part A)
    ├── implement (part A) → resolve-failures (part A passes)
    ├── dry-walkthrough (part B) → implement (part B)
    └── resolve-failures (part B) ×4 — still running when orchestrator ended

Recommendations (Ordered by Impact)

High Impact — Reduce API Call Count

R1. Suppress narration in headless sessions. Add to the headless session system prompt: "Proceed directly to tool use without announcing intent. Combine reasoning and tool calls in single API responses. Never emit a text-only turn before a tool call." The implement session's 72 non-tool turns are the single largest waste factor. Reducing them from 72 to ~10 (extended thinking is harder to suppress) would save ~5-6M cache_read on the implement step alone. Estimated impact: -5M cache_read per pipeline run.

R2. Batch file edits instead of per-call-site granularity. Phase 2e made ~9 API calls per input() call site migration (8 sites = ~72 calls). A single Write replacing the whole file, or grouped Edits, would reduce this to ~16 calls. Estimated impact: -3M cache_read (fewer calls with the same ~85K context).

Medium Impact — Reduce Context Size Per Call

R3. Extract Agent subagent reports to files instead of injecting into context. The two 18K+21K char reports stayed in context for 165+ turns. Instead: Write reports to .autoskillit/temp/, inject only a 1-line summary ("Report at /path/to/file.md"), and Read the file only when its content is needed. Estimated impact: -500K cache_read (removes ~4K tokens from 165+ calls).

R4. Truncate test output injected into context. Each test run injects ~10K chars of output. After 5 test runs, that's 50K chars of historical test output in context that is rarely re-consulted. Summarize prior run results instead of carrying full output. Estimated impact: -500K cache_read.

Medium Impact — Reduce Test Run Count

R5. Pre-validate worktree environment before running tests. The install-worktree task's uv venv .venv (without --clear) caused 2 wasted test runs when the worktree had package version drift. Fix: add --clear to the uv venv call in install-worktree, or add a pre-test environment check step that runs uv sync --check and auto-fixes before the first task test-check. Estimated impact: -200K cache_read (2 fewer API round-trips with full context).

R6. Run targeted test subsets before full suite. Before running task test-check (5-7 min), run only the specific failing test files first (10-20 seconds). This would eliminate full-suite re-runs where only 1-4 tests fail. Estimated impact: -1M cache_read (fewer 5-7 min wait cycles requiring polling turns).

R7. Replace test polling with single background wait. Session 2 spent 22 API calls polling a test run via sleep 60 + tail + ps. Use run_in_background with a single Bash call instead. Estimated impact: -1M cache_read (20 fewer polling calls at ~50K context each).

Lower Impact — Improve Observability

R8. Include orchestrator token usage in PR Token Usage Summary. The orchestrator's ~4.7M cache_read for order 774 is invisible in the current summary. Add an "orchestrator" row to the token table showing the attributed share. Without this, the reported 22.8M understates the true 27.5M cost by 17%.

R9. Clean up orphaned subagent processes. 30 orphaned Claude Code subprocesses across 10 triplet groups indicate CLAUDE_CODE_EXIT_AFTER_STOP_DELAY is not propagating to all Agent invocations within headless sessions. While these don't consume API tokens (they're killed before making calls), they pollute session diagnostics with 41 false-positive crash records.

Estimated Combined Impact

If R1-R7 were all implemented:

  • Implement step: 183 → ~100 API calls (R1+R2), smaller context per call (R3+R4) → ~4M vs 9.5M
  • Assess step: fewer test runs (R5+R6), less polling (R7) → ~4M vs 8.0M
  • Other steps: minimal change
  • Estimated total: ~12-14M cache_read vs. current 22.8M (40-47% reduction)

The orchestrator cost (~4.7M for order 774) is harder to reduce without architectural changes to multi-order session management.

Metadata

Metadata

Assignees

No one assigned

    Labels

    recipe:remediationRoute: investigate/decompose before implementation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions