Problem
PR #783 ("Rectify: CLI Startup Update Prompt — Terminal Freeze Immunity") produced 425 additions / 101 deletions but consumed 22.8M cache_read tokens across child sessions (reported in the PR's Token Usage Summary) plus an additional ~4.7M cache_read from the orchestrator session — for a true pipeline cost of ~27.5M cache_read tokens. The pipeline ran 12 child sessions over 2h 35m, making ~500+ API calls with average context sizes of 60-80K tokens per call.
This is not a bug — the pipeline completed successfully. But the token cost is disproportionate to the output size, and the investigation identifies specific structural inefficiencies that compound multiplicatively.
Investigation Summary
The cost traces to three compounding structural factors, each of which is independently fixable:
Factor 1: Non-Tool API Turns in Headless Sessions (implement session — 9.5M cache_read)
The implementation session (9768e31d) made 183 API calls for only 111 tool invocations. The remaining 72 calls (39%) produced no tool calls:
- 61 were text-only narration turns — the model announces intent ("Let me now implement Phase 2a...") in one API call, then executes the tool in the next
- 11 were extended-thinking-only turns — model reasoning with no visible output or tool use
Key statistics:
- Of 111 tool-use turns, 62 (56%) were preceded by a non-tool turn (narration or thinking)
- The remaining 48 (43%) directly followed another tool-use turn — no narration in between
- Zero turns contained both text and tool_use in the same message — the model consistently separates them into two API calls
Each non-tool turn sends the full context (~85K avg) for zero work output. At 72 wasted turns × ~85K tokens = ~6.1M cache_read overhead from narration/thinking alone.
Additionally, Phase 2e of the plan migrated 8 input() call sites with ~9 API calls per site. The pattern for each site: Read → narrate → Edit → narrate → Grep (verify) → Read again → Edit fixup. The same files were read repeatedly: app.py 7 times, test_update_checks.py 7 times.
Two Agent subagents spawned early in the session returned 18K + 21K character reports that were injected into the persistent conversation context and stayed there for all subsequent 165+ API calls, adding ~3-4K tokens of permanent overhead per call (~500K+ cache_read over the session lifetime).
Factor 2: Test Suite Execution Overhead (assess step — 8.0M cache_read, 2 sessions)
Two resolve-failures sessions ran a combined 9 test suite invocations (each full run takes 5-7 minutes):
Session 1 (096e0115, 162 API calls, ~32 min):
| Run |
Result |
Notes |
| 1 |
ABORTED |
install-worktree failure: uv venv: virtual environment already exists — 0 tests ran |
| 2 |
ABORTED |
Same install-worktree collision — 0 tests ran |
| 3 |
FAIL — 60 failures |
First real run. All failures from timed_prompt()'s select.select(sys.stdin) crashing under pytest's captured stdin (no fileno()) |
| 4 |
FAIL — 4 failures |
3 subset-gate tests + sigterm flaky test |
| 5 |
FAIL — 1 failure |
Pre-existing sigterm flaky test only |
Root cause of runs 1-2: The worktree had 22 packages newer than the lockfile. install-worktree's uv venv .venv (without --clear) failed because a venv already existed. Fix: uv sync --all-extras downgraded packages to match lockfile. Time wasted: 65 seconds, but 2 full API round-trips with context.
Session 2 (e2495448, 102 API calls / 54 deduplicated, ~45 min):
| Run |
Result |
Notes |
| 1-3 |
FAIL — 1 each |
Sigterm test failing under -n 4 parallel load (passes in isolation) |
| 4 |
PASS — 7750 passed |
Fix: increased subprocess startup deadline from 5s to 15s |
This session had significant polling overhead: 22 API calls were spent monitoring a single test run via repeated sleep 60 → tail → ps aux | grep pytest cycles instead of a single background wait.
Factor 3: Pipeline Architecture — Per-Session Startup and Orchestrator Cost
Each of the 12 child sessions starts from scratch with:
- System prompt + CLAUDE.md + skill docs: ~12K tokens (cached but counted in cache_read every call)
- Plan file re-read: ~21K chars
- Full context accumulation from zero on every session
The orchestrator session (812fdbe0) ran 5 hours managing orders 774 and 775 concurrently:
- 124 API calls, 8.7M total cache_read, 41 MCP tool calls
- 98 of 124 turns (79%) were shared context (thinking/processing between dispatches)
- Attribution: ~54% to order 774 (~4.7M), ~46% to order 775
- This cost is not included in the PR's Token Usage Summary — only child session tokens are reported
Token Accounting Details
How the 22.8M Figure Is Computed
The PR's Token Usage Summary is generated by token_summary_hook.py:
- Reads
sessions.jsonl index, filters by order_id
- For each session, reads
token_usage.json from the session diagnostics directory
- These files are written by
session_log.py::flush_session_log() using extract_token_usage(stdout) from session.py
extract_token_usage prefers the Claude CLI's result record (a single session-level aggregate) over summing per-turn assistant records
- No extended-thinking deduplication exists anywhere in AutoSkillit — the 22.8M vs. raw per-turn sum (~45.4M) difference comes entirely from the Claude CLI's own
result record being roughly half the per-turn sum
Per-Step Breakdown (from PR Token Usage Summary)
| Step |
uncached |
output |
cache_read |
cache_write |
count |
time |
| investigate |
5.5k |
10.7k |
870.9k |
58.4k |
1 |
6m 18s |
| rectify |
28 |
13.5k |
1.0M |
63.7k |
1 |
6m 44s |
| review |
3.7k |
7.7k |
313.4k |
37.9k |
1 |
5m 7s |
| dry_walkthrough |
1.9k |
14.9k |
1.9M |
62.9k |
1 |
12m 57s |
| implement |
119 |
46.6k |
9.5M |
130.7k |
1 |
22m 55s |
| assess |
135 |
47.4k |
8.0M |
164.2k |
2 |
1h 24m |
| prepare_pr |
9 |
7.0k |
139.3k |
31.2k |
1 |
2m 38s |
| run_arch_lenses |
3.4k |
20.1k |
720.7k |
108.2k |
3 |
10m 41s |
| compose_pr |
9 |
12.0k |
200.5k |
40.6k |
1 |
3m 42s |
| Total |
14.8k |
179.8k |
22.8M |
697.8k |
12 |
2h 35m |
Full Session Inventory (Order 774)
| Session ID |
Step |
API Calls |
Cache Read |
Output |
Duration |
Notes |
14b6a3bc |
investigate |
38 |
870.9K |
10.7K |
6m 18s |
|
6202fd96 |
rectify |
38 |
1.0M |
13.5K |
6m 44s |
|
504ea192 |
review-approach |
24 |
313.4K |
7.7K |
5m 7s |
|
fd6d3079 |
dry-walkthrough |
65 |
1.9M |
14.9K |
12m 57s |
|
9768e31d |
implement |
183 |
9.5M |
46.6K |
22m 55s |
72 non-tool turns, 2 Agent subagents (18K+21K char reports in context) |
096e0115 |
resolve-failures #1 |
162 |
~5.1M |
33.9K |
32m |
5 test runs (2 aborted on venv), 60→4→1 failures |
e2495448 |
resolve-failures #2 |
102 |
~2.9M |
13.4K |
~45m |
4 test runs, polling overhead, sigterm fix |
a53d2b38 |
prepare-pr |
18 |
139.3K |
7.0K |
2m 38s |
|
79e2d7d8 |
arch-lens-process-flow |
18 |
~240K |
5.5K |
~3m |
|
0f1f76be |
arch-lens-state-lifecycle |
17 |
~245K |
7.0K |
~3m |
|
a2540a25 |
arch-lens-module-dep |
21 |
~300K |
7.6K |
~4m |
|
7651e1a4 |
compose-pr |
20 |
200.5K |
12.0K |
3m 42s |
|
cdce8960 |
review-pr |
51 |
2.8M |
24.1K |
— |
Post-PR creation, excluded from summary |
Orchestrator (Excluded from PR Summary)
| Metric |
Value |
| Session |
812fdbe0-ba63-4df1-960d-bda6cf3ca374 |
| Duration |
5h (23:25 → 04:28 UTC) |
| API calls |
124 |
| Total cache_read |
8.7M |
| Attributable to order 774 |
~4.7M (54%) |
| MCP tool calls |
41 (26 run_skill + run_cmd/test_check) |
| Context growth |
15K → 98K tokens over session |
The orchestrator also managed order 775 ("MCP server startup race — open_kitchen 'No such tool available' on first call") concurrently. Order 775 ran 12 sessions but never completed a PR — it was still in resolve-failures when the orchestrator session ended.
Crashed Session Records (41 total)
The session diagnostics system recorded 41 "crashed" sessions during this pipeline run. Investigation confirms none represent lost work:
11 PID-reuse duplicates: The process monitor re-opened tracking records for PIDs belonging to already-completing successful sessions. All show peak_rss_kb=1920 (monitor wrapper RSS), empty skill_command, no token usage.
30 orphaned subagent processes (10 triplet groups): Each group has a consistent fingerprint:
| Role |
Peak RSS |
Description |
| Primary Node.js |
~180-218K KB |
Claude Code process, no SIGTERM handler, fd_count=21, 2 threads |
| Monitor wrapper |
1664 KB |
Python wrapper, sleeping (hrtimer_nanosleep), 1 thread |
| Secondary Node.js |
~171-173K KB |
Spawned subagent child, has SIGTERM handler (bit 15 in sig_cgt), fd_count=18 |
All 30 have: end_ts="", skill_command="", cwd="", write_call_count=0, no token usage. They disappear within 4-6 snapshots (~200-300ms).
The groups cluster around resolve-failures session endings: 096e0115 generated 3 groups, e2495448 generated 4 groups, cdce8960 generated 1 group. These sessions spawned the most Agent subagents internally.
Only 1 anomaly file exists across all 41 crashed sessions: crashed_23417 flags rss_growth (512→1664 KB) — normal Python startup allocation of the monitor wrapper, not meaningful.
Pipeline Data Flow
Orchestrator (812fdbe0, 5h, 8.7M cache_read, 124 API calls)
├── Order 774: "CLI startup update prompt terminal freeze" → PR #783
│ ├── investigate (14b6a3bc) → rectify (6202fd96) → review-approach (504ea192)
│ ├── dry-walkthrough (fd6d3079, 65 API calls)
│ ├── implement-worktree-no-merge (9768e31d, 183 API calls, 9.5M cache_read)
│ │ ├── Agent subagent 1: explore _update_checks.py → 18K char report → orphaned
│ │ └── Agent subagent 2: survey input() calls → 21K char report → orphaned
│ ├── resolve-failures #1 (096e0115, 162 API calls, 5 test runs)
│ │ ├── Runs 1-2: aborted (install-worktree venv collision, 22 packages out of sync)
│ │ ├── Run 3: 60 failures (select.select stdin crash)
│ │ ├── Run 4: 4 failures → Run 5: 1 failure (pre-existing sigterm)
│ │ └── Agent subagent: categorize failures → 5K report → orphaned
│ ├── resolve-failures #2 (e2495448, 102 API calls, 4 test runs)
│ │ ├── Runs 1-3: sigterm failing under -n 4 load
│ │ └── Run 4: PASS (fix: 5s→15s startup deadline)
│ ├── prepare-pr → arch-lenses ×3 (parallel) → compose-pr
│ └── review-pr (cdce8960, post-creation, 51 API calls)
│
└── Order 775: "MCP server startup race" (parallel, 12 sessions, NO PR produced)
├── investigate → rectify → review-approach → dry-walkthrough (part A)
├── implement (part A) → resolve-failures (part A passes)
├── dry-walkthrough (part B) → implement (part B)
└── resolve-failures (part B) ×4 — still running when orchestrator ended
Recommendations (Ordered by Impact)
High Impact — Reduce API Call Count
R1. Suppress narration in headless sessions. Add to the headless session system prompt: "Proceed directly to tool use without announcing intent. Combine reasoning and tool calls in single API responses. Never emit a text-only turn before a tool call." The implement session's 72 non-tool turns are the single largest waste factor. Reducing them from 72 to ~10 (extended thinking is harder to suppress) would save ~5-6M cache_read on the implement step alone. Estimated impact: -5M cache_read per pipeline run.
R2. Batch file edits instead of per-call-site granularity. Phase 2e made ~9 API calls per input() call site migration (8 sites = ~72 calls). A single Write replacing the whole file, or grouped Edits, would reduce this to ~16 calls. Estimated impact: -3M cache_read (fewer calls with the same ~85K context).
Medium Impact — Reduce Context Size Per Call
R3. Extract Agent subagent reports to files instead of injecting into context. The two 18K+21K char reports stayed in context for 165+ turns. Instead: Write reports to .autoskillit/temp/, inject only a 1-line summary ("Report at /path/to/file.md"), and Read the file only when its content is needed. Estimated impact: -500K cache_read (removes ~4K tokens from 165+ calls).
R4. Truncate test output injected into context. Each test run injects ~10K chars of output. After 5 test runs, that's 50K chars of historical test output in context that is rarely re-consulted. Summarize prior run results instead of carrying full output. Estimated impact: -500K cache_read.
Medium Impact — Reduce Test Run Count
R5. Pre-validate worktree environment before running tests. The install-worktree task's uv venv .venv (without --clear) caused 2 wasted test runs when the worktree had package version drift. Fix: add --clear to the uv venv call in install-worktree, or add a pre-test environment check step that runs uv sync --check and auto-fixes before the first task test-check. Estimated impact: -200K cache_read (2 fewer API round-trips with full context).
R6. Run targeted test subsets before full suite. Before running task test-check (5-7 min), run only the specific failing test files first (10-20 seconds). This would eliminate full-suite re-runs where only 1-4 tests fail. Estimated impact: -1M cache_read (fewer 5-7 min wait cycles requiring polling turns).
R7. Replace test polling with single background wait. Session 2 spent 22 API calls polling a test run via sleep 60 + tail + ps. Use run_in_background with a single Bash call instead. Estimated impact: -1M cache_read (20 fewer polling calls at ~50K context each).
Lower Impact — Improve Observability
R8. Include orchestrator token usage in PR Token Usage Summary. The orchestrator's ~4.7M cache_read for order 774 is invisible in the current summary. Add an "orchestrator" row to the token table showing the attributed share. Without this, the reported 22.8M understates the true 27.5M cost by 17%.
R9. Clean up orphaned subagent processes. 30 orphaned Claude Code subprocesses across 10 triplet groups indicate CLAUDE_CODE_EXIT_AFTER_STOP_DELAY is not propagating to all Agent invocations within headless sessions. While these don't consume API tokens (they're killed before making calls), they pollute session diagnostics with 41 false-positive crash records.
Estimated Combined Impact
If R1-R7 were all implemented:
- Implement step: 183 → ~100 API calls (R1+R2), smaller context per call (R3+R4) → ~4M vs 9.5M
- Assess step: fewer test runs (R5+R6), less polling (R7) → ~4M vs 8.0M
- Other steps: minimal change
- Estimated total: ~12-14M cache_read vs. current 22.8M (40-47% reduction)
The orchestrator cost (~4.7M for order 774) is harder to reduce without architectural changes to multi-order session management.
Problem
PR #783 ("Rectify: CLI Startup Update Prompt — Terminal Freeze Immunity") produced 425 additions / 101 deletions but consumed 22.8M cache_read tokens across child sessions (reported in the PR's Token Usage Summary) plus an additional ~4.7M cache_read from the orchestrator session — for a true pipeline cost of ~27.5M cache_read tokens. The pipeline ran 12 child sessions over 2h 35m, making ~500+ API calls with average context sizes of 60-80K tokens per call.
This is not a bug — the pipeline completed successfully. But the token cost is disproportionate to the output size, and the investigation identifies specific structural inefficiencies that compound multiplicatively.
Investigation Summary
The cost traces to three compounding structural factors, each of which is independently fixable:
Factor 1: Non-Tool API Turns in Headless Sessions (implement session — 9.5M cache_read)
The implementation session (
9768e31d) made 183 API calls for only 111 tool invocations. The remaining 72 calls (39%) produced no tool calls:Key statistics:
Each non-tool turn sends the full context (~85K avg) for zero work output. At 72 wasted turns × ~85K tokens = ~6.1M cache_read overhead from narration/thinking alone.
Additionally, Phase 2e of the plan migrated 8
input()call sites with ~9 API calls per site. The pattern for each site: Read → narrate → Edit → narrate → Grep (verify) → Read again → Edit fixup. The same files were read repeatedly:app.py7 times,test_update_checks.py7 times.Two Agent subagents spawned early in the session returned 18K + 21K character reports that were injected into the persistent conversation context and stayed there for all subsequent 165+ API calls, adding ~3-4K tokens of permanent overhead per call (~500K+ cache_read over the session lifetime).
Factor 2: Test Suite Execution Overhead (assess step — 8.0M cache_read, 2 sessions)
Two
resolve-failuressessions ran a combined 9 test suite invocations (each full run takes 5-7 minutes):Session 1 (
096e0115, 162 API calls, ~32 min):install-worktreefailure:uv venv: virtual environment already exists— 0 tests raninstall-worktreecollision — 0 tests rantimed_prompt()'sselect.select(sys.stdin)crashing under pytest's captured stdin (nofileno())Root cause of runs 1-2: The worktree had 22 packages newer than the lockfile.
install-worktree'suv venv .venv(without--clear) failed because a venv already existed. Fix:uv sync --all-extrasdowngraded packages to match lockfile. Time wasted: 65 seconds, but 2 full API round-trips with context.Session 2 (
e2495448, 102 API calls / 54 deduplicated, ~45 min):-n 4parallel load (passes in isolation)This session had significant polling overhead: 22 API calls were spent monitoring a single test run via repeated
sleep 60→tail→ps aux | grep pytestcycles instead of a single background wait.Factor 3: Pipeline Architecture — Per-Session Startup and Orchestrator Cost
Each of the 12 child sessions starts from scratch with:
The orchestrator session (
812fdbe0) ran 5 hours managing orders 774 and 775 concurrently:Token Accounting Details
How the 22.8M Figure Is Computed
The PR's Token Usage Summary is generated by
token_summary_hook.py:sessions.jsonlindex, filters byorder_idtoken_usage.jsonfrom the session diagnostics directorysession_log.py::flush_session_log()usingextract_token_usage(stdout)fromsession.pyextract_token_usageprefers the Claude CLI'sresultrecord (a single session-level aggregate) over summing per-turn assistant recordsresultrecord being roughly half the per-turn sumPer-Step Breakdown (from PR Token Usage Summary)
Full Session Inventory (Order 774)
14b6a3bc6202fd96504ea192fd6d30799768e31d096e0115e2495448a53d2b3879e2d7d80f1f76bea2540a257651e1a4cdce8960Orchestrator (Excluded from PR Summary)
812fdbe0-ba63-4df1-960d-bda6cf3ca374The orchestrator also managed order 775 ("MCP server startup race — open_kitchen 'No such tool available' on first call") concurrently. Order 775 ran 12 sessions but never completed a PR — it was still in resolve-failures when the orchestrator session ended.
Crashed Session Records (41 total)
The session diagnostics system recorded 41 "crashed" sessions during this pipeline run. Investigation confirms none represent lost work:
11 PID-reuse duplicates: The process monitor re-opened tracking records for PIDs belonging to already-completing successful sessions. All show
peak_rss_kb=1920(monitor wrapper RSS), emptyskill_command, no token usage.30 orphaned subagent processes (10 triplet groups): Each group has a consistent fingerprint:
hrtimer_nanosleep), 1 threadAll 30 have:
end_ts="",skill_command="",cwd="",write_call_count=0, no token usage. They disappear within 4-6 snapshots (~200-300ms).The groups cluster around resolve-failures session endings:
096e0115generated 3 groups,e2495448generated 4 groups,cdce8960generated 1 group. These sessions spawned the most Agent subagents internally.Only 1 anomaly file exists across all 41 crashed sessions:
crashed_23417flagsrss_growth(512→1664 KB) — normal Python startup allocation of the monitor wrapper, not meaningful.Pipeline Data Flow
Recommendations (Ordered by Impact)
High Impact — Reduce API Call Count
R1. Suppress narration in headless sessions. Add to the headless session system prompt: "Proceed directly to tool use without announcing intent. Combine reasoning and tool calls in single API responses. Never emit a text-only turn before a tool call." The implement session's 72 non-tool turns are the single largest waste factor. Reducing them from 72 to ~10 (extended thinking is harder to suppress) would save ~5-6M cache_read on the implement step alone. Estimated impact: -5M cache_read per pipeline run.
R2. Batch file edits instead of per-call-site granularity. Phase 2e made ~9 API calls per
input()call site migration (8 sites = ~72 calls). A single Write replacing the whole file, or grouped Edits, would reduce this to ~16 calls. Estimated impact: -3M cache_read (fewer calls with the same ~85K context).Medium Impact — Reduce Context Size Per Call
R3. Extract Agent subagent reports to files instead of injecting into context. The two 18K+21K char reports stayed in context for 165+ turns. Instead: Write reports to
.autoskillit/temp/, inject only a 1-line summary ("Report at /path/to/file.md"), and Read the file only when its content is needed. Estimated impact: -500K cache_read (removes ~4K tokens from 165+ calls).R4. Truncate test output injected into context. Each test run injects ~10K chars of output. After 5 test runs, that's 50K chars of historical test output in context that is rarely re-consulted. Summarize prior run results instead of carrying full output. Estimated impact: -500K cache_read.
Medium Impact — Reduce Test Run Count
R5. Pre-validate worktree environment before running tests. The
install-worktreetask'suv venv .venv(without--clear) caused 2 wasted test runs when the worktree had package version drift. Fix: add--clearto theuv venvcall ininstall-worktree, or add a pre-test environment check step that runsuv sync --checkand auto-fixes before the firsttask test-check. Estimated impact: -200K cache_read (2 fewer API round-trips with full context).R6. Run targeted test subsets before full suite. Before running
task test-check(5-7 min), run only the specific failing test files first (10-20 seconds). This would eliminate full-suite re-runs where only 1-4 tests fail. Estimated impact: -1M cache_read (fewer 5-7 min wait cycles requiring polling turns).R7. Replace test polling with single background wait. Session 2 spent 22 API calls polling a test run via
sleep 60+tail+ps. Userun_in_backgroundwith a single Bash call instead. Estimated impact: -1M cache_read (20 fewer polling calls at ~50K context each).Lower Impact — Improve Observability
R8. Include orchestrator token usage in PR Token Usage Summary. The orchestrator's ~4.7M cache_read for order 774 is invisible in the current summary. Add an "orchestrator" row to the token table showing the attributed share. Without this, the reported 22.8M understates the true 27.5M cost by 17%.
R9. Clean up orphaned subagent processes. 30 orphaned Claude Code subprocesses across 10 triplet groups indicate
CLAUDE_CODE_EXIT_AFTER_STOP_DELAYis not propagating to all Agent invocations within headless sessions. While these don't consume API tokens (they're killed before making calls), they pollute session diagnostics with 41 false-positive crash records.Estimated Combined Impact
If R1-R7 were all implemented:
The orchestrator cost (~4.7M for order 774) is harder to reduce without architectural changes to multi-order session management.