Remediation pipeline token efficiency: 22.8M cache_read for ~500 line PR #783

## Problem

PR #783 ("Rectify: CLI Startup Update Prompt — Terminal Freeze Immunity") produced 425 additions / 101 deletions but consumed **22.8M cache_read tokens** across child sessions (reported in the PR's Token Usage Summary) plus an additional **~4.7M cache_read** from the orchestrator session — for a true pipeline cost of **~27.5M cache_read tokens**. The pipeline ran 12 child sessions over 2h 35m, making ~500+ API calls with average context sizes of 60-80K tokens per call.

This is not a bug — the pipeline completed successfully. But the token cost is disproportionate to the output size, and the investigation identifies specific structural inefficiencies that compound multiplicatively.

## Investigation Summary

The cost traces to three compounding structural factors, each of which is independently fixable:

### Factor 1: Non-Tool API Turns in Headless Sessions (implement session — 9.5M cache_read)

The implementation session (`9768e31d`) made **183 API calls for only 111 tool invocations**. The remaining 72 calls (39%) produced no tool calls:
- **61 were text-only narration turns** — the model announces intent ("Let me now implement Phase 2a...") in one API call, then executes the tool in the next
- **11 were extended-thinking-only turns** — model reasoning with no visible output or tool use

Key statistics:
- Of 111 tool-use turns, **62 (56%) were preceded by a non-tool turn** (narration or thinking)
- The remaining **48 (43%) directly followed another tool-use turn** — no narration in between
- **Zero turns contained both text and tool_use in the same message** — the model consistently separates them into two API calls

Each non-tool turn sends the full context (~85K avg) for zero work output. At 72 wasted turns × ~85K tokens = ~6.1M cache_read overhead from narration/thinking alone.

Additionally, Phase 2e of the plan migrated 8 `input()` call sites with ~9 API calls per site. The pattern for each site: Read → narrate → Edit → narrate → Grep (verify) → Read again → Edit fixup. The same files were read repeatedly: `app.py` 7 times, `test_update_checks.py` 7 times.

Two Agent subagents spawned early in the session returned **18K + 21K character reports** that were injected into the persistent conversation context and stayed there for all subsequent 165+ API calls, adding ~3-4K tokens of permanent overhead per call (~500K+ cache_read over the session lifetime).

### Factor 2: Test Suite Execution Overhead (assess step — 8.0M cache_read, 2 sessions)

Two `resolve-failures` sessions ran a combined **9 test suite invocations** (each full run takes 5-7 minutes):

**Session 1 (`096e0115`, 162 API calls, ~32 min):**

| Run | Result | Notes |
|-----|--------|-------|
| 1 | ABORTED | `install-worktree` failure: `uv venv: virtual environment already exists` — 0 tests ran |
| 2 | ABORTED | Same `install-worktree` collision — 0 tests ran |
| 3 | FAIL — 60 failures | First real run. All failures from `timed_prompt()`'s `select.select(sys.stdin)` crashing under pytest's captured stdin (no `fileno()`) |
| 4 | FAIL — 4 failures | 3 subset-gate tests + sigterm flaky test |
| 5 | FAIL — 1 failure | Pre-existing sigterm flaky test only |

Root cause of runs 1-2: The worktree had 22 packages newer than the lockfile. `install-worktree`'s `uv venv .venv` (without `--clear`) failed because a venv already existed. Fix: `uv sync --all-extras` downgraded packages to match lockfile. Time wasted: 65 seconds, but 2 full API round-trips with context.

**Session 2 (`e2495448`, 102 API calls / 54 deduplicated, ~45 min):**

| Run | Result | Notes |
|-----|--------|-------|
| 1-3 | FAIL — 1 each | Sigterm test failing under `-n 4` parallel load (passes in isolation) |
| 4 | PASS — 7750 passed | Fix: increased subprocess startup deadline from 5s to 15s |

This session had significant **polling overhead**: 22 API calls were spent monitoring a single test run via repeated `sleep 60` → `tail` → `ps aux | grep pytest` cycles instead of a single background wait.

### Factor 3: Pipeline Architecture — Per-Session Startup and Orchestrator Cost

Each of the 12 child sessions starts from scratch with:
- System prompt + CLAUDE.md + skill docs: ~12K tokens (cached but counted in cache_read every call)
- Plan file re-read: ~21K chars
- Full context accumulation from zero on every session

The **orchestrator session** (`812fdbe0`) ran 5 hours managing orders 774 and 775 concurrently:
- 124 API calls, 8.7M total cache_read, 41 MCP tool calls
- 98 of 124 turns (79%) were shared context (thinking/processing between dispatches)
- Attribution: ~54% to order 774 (~4.7M), ~46% to order 775
- This cost is **not included** in the PR's Token Usage Summary — only child session tokens are reported

## Token Accounting Details

### How the 22.8M Figure Is Computed

The PR's Token Usage Summary is generated by `token_summary_hook.py`:
1. Reads `sessions.jsonl` index, filters by `order_id`
2. For each session, reads `token_usage.json` from the session diagnostics directory
3. These files are written by `session_log.py::flush_session_log()` using `extract_token_usage(stdout)` from `session.py`
4. **`extract_token_usage` prefers the Claude CLI's `result` record** (a single session-level aggregate) over summing per-turn assistant records
5. No extended-thinking deduplication exists anywhere in AutoSkillit — the 22.8M vs. raw per-turn sum (~45.4M) difference comes entirely from the Claude CLI's own `result` record being roughly half the per-turn sum

### Per-Step Breakdown (from PR Token Usage Summary)

| Step | uncached | output | cache_read | cache_write | count | time |
|------|----------|--------|------------|-------------|-------|------|
| investigate | 5.5k | 10.7k | 870.9k | 58.4k | 1 | 6m 18s |
| rectify | 28 | 13.5k | 1.0M | 63.7k | 1 | 6m 44s |
| review | 3.7k | 7.7k | 313.4k | 37.9k | 1 | 5m 7s |
| dry_walkthrough | 1.9k | 14.9k | 1.9M | 62.9k | 1 | 12m 57s |
| implement | 119 | 46.6k | 9.5M | 130.7k | 1 | 22m 55s |
| assess | 135 | 47.4k | 8.0M | 164.2k | 2 | 1h 24m |
| prepare_pr | 9 | 7.0k | 139.3k | 31.2k | 1 | 2m 38s |
| run_arch_lenses | 3.4k | 20.1k | 720.7k | 108.2k | 3 | 10m 41s |
| compose_pr | 9 | 12.0k | 200.5k | 40.6k | 1 | 3m 42s |
| **Total** | **14.8k** | **179.8k** | **22.8M** | **697.8k** | **12** | **2h 35m** |

### Full Session Inventory (Order 774)

| Session ID | Step | API Calls | Cache Read | Output | Duration | Notes |
|------------|------|-----------|-----------|--------|----------|-------|
| `14b6a3bc` | investigate | 38 | 870.9K | 10.7K | 6m 18s | |
| `6202fd96` | rectify | 38 | 1.0M | 13.5K | 6m 44s | |
| `504ea192` | review-approach | 24 | 313.4K | 7.7K | 5m 7s | |
| `fd6d3079` | dry-walkthrough | 65 | 1.9M | 14.9K | 12m 57s | |
| `9768e31d` | implement | 183 | 9.5M | 46.6K | 22m 55s | 72 non-tool turns, 2 Agent subagents (18K+21K char reports in context) |
| `096e0115` | resolve-failures #1 | 162 | ~5.1M | 33.9K | 32m | 5 test runs (2 aborted on venv), 60→4→1 failures |
| `e2495448` | resolve-failures #2 | 102 | ~2.9M | 13.4K | ~45m | 4 test runs, polling overhead, sigterm fix |
| `a53d2b38` | prepare-pr | 18 | 139.3K | 7.0K | 2m 38s | |
| `79e2d7d8` | arch-lens-process-flow | 18 | ~240K | 5.5K | ~3m | |
| `0f1f76be` | arch-lens-state-lifecycle | 17 | ~245K | 7.0K | ~3m | |
| `a2540a25` | arch-lens-module-dep | 21 | ~300K | 7.6K | ~4m | |
| `7651e1a4` | compose-pr | 20 | 200.5K | 12.0K | 3m 42s | |
| `cdce8960` | review-pr | 51 | 2.8M | 24.1K | — | Post-PR creation, excluded from summary |

### Orchestrator (Excluded from PR Summary)

| Metric | Value |
|--------|-------|
| Session | `812fdbe0-ba63-4df1-960d-bda6cf3ca374` |
| Duration | 5h (23:25 → 04:28 UTC) |
| API calls | 124 |
| Total cache_read | 8.7M |
| Attributable to order 774 | ~4.7M (54%) |
| MCP tool calls | 41 (26 run_skill + run_cmd/test_check) |
| Context growth | 15K → 98K tokens over session |

The orchestrator also managed order 775 ("MCP server startup race — open_kitchen 'No such tool available' on first call") concurrently. Order 775 ran 12 sessions but never completed a PR — it was still in resolve-failures when the orchestrator session ended.

## Crashed Session Records (41 total)

The session diagnostics system recorded 41 "crashed" sessions during this pipeline run. Investigation confirms none represent lost work:

**11 PID-reuse duplicates**: The process monitor re-opened tracking records for PIDs belonging to already-completing successful sessions. All show `peak_rss_kb=1920` (monitor wrapper RSS), empty `skill_command`, no token usage.

**30 orphaned subagent processes (10 triplet groups)**: Each group has a consistent fingerprint:

| Role | Peak RSS | Description |
|------|----------|-------------|
| Primary Node.js | ~180-218K KB | Claude Code process, no SIGTERM handler, fd_count=21, 2 threads |
| Monitor wrapper | 1664 KB | Python wrapper, sleeping (`hrtimer_nanosleep`), 1 thread |
| Secondary Node.js | ~171-173K KB | Spawned subagent child, has SIGTERM handler (bit 15 in sig_cgt), fd_count=18 |

All 30 have: `end_ts=""`, `skill_command=""`, `cwd=""`, `write_call_count=0`, no token usage. They disappear within 4-6 snapshots (~200-300ms).

The groups cluster around resolve-failures session endings: `096e0115` generated 3 groups, `e2495448` generated 4 groups, `cdce8960` generated 1 group. These sessions spawned the most Agent subagents internally.

Only 1 anomaly file exists across all 41 crashed sessions: `crashed_23417` flags `rss_growth` (512→1664 KB) — normal Python startup allocation of the monitor wrapper, not meaningful.

## Pipeline Data Flow

```
Orchestrator (812fdbe0, 5h, 8.7M cache_read, 124 API calls)
├── Order 774: "CLI startup update prompt terminal freeze" → PR #783
│   ├── investigate (14b6a3bc) → rectify (6202fd96) → review-approach (504ea192)
│   ├── dry-walkthrough (fd6d3079, 65 API calls)
│   ├── implement-worktree-no-merge (9768e31d, 183 API calls, 9.5M cache_read)
│   │   ├── Agent subagent 1: explore _update_checks.py → 18K char report → orphaned
│   │   └── Agent subagent 2: survey input() calls → 21K char report → orphaned
│   ├── resolve-failures #1 (096e0115, 162 API calls, 5 test runs)
│   │   ├── Runs 1-2: aborted (install-worktree venv collision, 22 packages out of sync)
│   │   ├── Run 3: 60 failures (select.select stdin crash)
│   │   ├── Run 4: 4 failures → Run 5: 1 failure (pre-existing sigterm)
│   │   └── Agent subagent: categorize failures → 5K report → orphaned
│   ├── resolve-failures #2 (e2495448, 102 API calls, 4 test runs)
│   │   ├── Runs 1-3: sigterm failing under -n 4 load
│   │   └── Run 4: PASS (fix: 5s→15s startup deadline)
│   ├── prepare-pr → arch-lenses ×3 (parallel) → compose-pr
│   └── review-pr (cdce8960, post-creation, 51 API calls)
│
└── Order 775: "MCP server startup race" (parallel, 12 sessions, NO PR produced)
    ├── investigate → rectify → review-approach → dry-walkthrough (part A)
    ├── implement (part A) → resolve-failures (part A passes)
    ├── dry-walkthrough (part B) → implement (part B)
    └── resolve-failures (part B) ×4 — still running when orchestrator ended
```

## Recommendations (Ordered by Impact)

### High Impact — Reduce API Call Count

**R1. Suppress narration in headless sessions.** Add to the headless session system prompt: *"Proceed directly to tool use without announcing intent. Combine reasoning and tool calls in single API responses. Never emit a text-only turn before a tool call."* The implement session's 72 non-tool turns are the single largest waste factor. Reducing them from 72 to ~10 (extended thinking is harder to suppress) would save ~5-6M cache_read on the implement step alone. Estimated impact: **-5M cache_read per pipeline run.**

**R2. Batch file edits instead of per-call-site granularity.** Phase 2e made ~9 API calls per `input()` call site migration (8 sites = ~72 calls). A single Write replacing the whole file, or grouped Edits, would reduce this to ~16 calls. Estimated impact: **-3M cache_read** (fewer calls with the same ~85K context).

### Medium Impact — Reduce Context Size Per Call

**R3. Extract Agent subagent reports to files instead of injecting into context.** The two 18K+21K char reports stayed in context for 165+ turns. Instead: Write reports to `.autoskillit/temp/`, inject only a 1-line summary ("Report at /path/to/file.md"), and Read the file only when its content is needed. Estimated impact: **-500K cache_read** (removes ~4K tokens from 165+ calls).

**R4. Truncate test output injected into context.** Each test run injects ~10K chars of output. After 5 test runs, that's 50K chars of historical test output in context that is rarely re-consulted. Summarize prior run results instead of carrying full output. Estimated impact: **-500K cache_read.**

### Medium Impact — Reduce Test Run Count

**R5. Pre-validate worktree environment before running tests.** The `install-worktree` task's `uv venv .venv` (without `--clear`) caused 2 wasted test runs when the worktree had package version drift. Fix: add `--clear` to the `uv venv` call in `install-worktree`, or add a pre-test environment check step that runs `uv sync --check` and auto-fixes before the first `task test-check`. Estimated impact: **-200K cache_read** (2 fewer API round-trips with full context).

**R6. Run targeted test subsets before full suite.** Before running `task test-check` (5-7 min), run only the specific failing test files first (10-20 seconds). This would eliminate full-suite re-runs where only 1-4 tests fail. Estimated impact: **-1M cache_read** (fewer 5-7 min wait cycles requiring polling turns).

**R7. Replace test polling with single background wait.** Session 2 spent 22 API calls polling a test run via `sleep 60` + `tail` + `ps`. Use `run_in_background` with a single Bash call instead. Estimated impact: **-1M cache_read** (20 fewer polling calls at ~50K context each).

### Lower Impact — Improve Observability

**R8. Include orchestrator token usage in PR Token Usage Summary.** The orchestrator's ~4.7M cache_read for order 774 is invisible in the current summary. Add an "orchestrator" row to the token table showing the attributed share. Without this, the reported 22.8M understates the true 27.5M cost by 17%.

**R9. Clean up orphaned subagent processes.** 30 orphaned Claude Code subprocesses across 10 triplet groups indicate `CLAUDE_CODE_EXIT_AFTER_STOP_DELAY` is not propagating to all Agent invocations within headless sessions. While these don't consume API tokens (they're killed before making calls), they pollute session diagnostics with 41 false-positive crash records.

## Estimated Combined Impact

If R1-R7 were all implemented:
- Implement step: 183 → ~100 API calls (R1+R2), smaller context per call (R3+R4) → ~4M vs 9.5M
- Assess step: fewer test runs (R5+R6), less polling (R7) → ~4M vs 8.0M
- Other steps: minimal change
- **Estimated total: ~12-14M cache_read vs. current 22.8M (40-47% reduction)**

The orchestrator cost (~4.7M for order 774) is harder to reduce without architectural changes to multi-order session management.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remediation pipeline token efficiency: 22.8M cache_read for ~500 line PR #783 #792

Problem

Investigation Summary

Factor 1: Non-Tool API Turns in Headless Sessions (implement session — 9.5M cache_read)

Factor 2: Test Suite Execution Overhead (assess step — 8.0M cache_read, 2 sessions)

Factor 3: Pipeline Architecture — Per-Session Startup and Orchestrator Cost

Token Accounting Details

How the 22.8M Figure Is Computed

Per-Step Breakdown (from PR Token Usage Summary)

Full Session Inventory (Order 774)

Orchestrator (Excluded from PR Summary)

Crashed Session Records (41 total)

Pipeline Data Flow

Recommendations (Ordered by Impact)

High Impact — Reduce API Call Count

Medium Impact — Reduce Context Size Per Call

Medium Impact — Reduce Test Run Count

Lower Impact — Improve Observability

Estimated Combined Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Run	Result	Notes
1	ABORTED	`install-worktree` failure: `uv venv: virtual environment already exists` — 0 tests ran
2	ABORTED	Same `install-worktree` collision — 0 tests ran
3	FAIL — 60 failures	First real run. All failures from `timed_prompt()`'s `select.select(sys.stdin)` crashing under pytest's captured stdin (no `fileno()`)
4	FAIL — 4 failures	3 subset-gate tests + sigterm flaky test
5	FAIL — 1 failure	Pre-existing sigterm flaky test only

Run	Result	Notes
1-3	FAIL — 1 each	Sigterm test failing under `-n 4` parallel load (passes in isolation)
4	PASS — 7750 passed	Fix: increased subprocess startup deadline from 5s to 15s

Step	uncached	output	cache_read	cache_write	count	time
investigate	5.5k	10.7k	870.9k	58.4k	1	6m 18s
rectify	28	13.5k	1.0M	63.7k	1	6m 44s
review	3.7k	7.7k	313.4k	37.9k	1	5m 7s
dry_walkthrough	1.9k	14.9k	1.9M	62.9k	1	12m 57s
implement	119	46.6k	9.5M	130.7k	1	22m 55s
assess	135	47.4k	8.0M	164.2k	2	1h 24m
prepare_pr	9	7.0k	139.3k	31.2k	1	2m 38s
run_arch_lenses	3.4k	20.1k	720.7k	108.2k	3	10m 41s
compose_pr	9	12.0k	200.5k	40.6k	1	3m 42s
Total	14.8k	179.8k	22.8M	697.8k	12	2h 35m

Session ID	Step	API Calls	Cache Read	Output	Duration	Notes
`14b6a3bc`	investigate	38	870.9K	10.7K	6m 18s
`6202fd96`	rectify	38	1.0M	13.5K	6m 44s
`504ea192`	review-approach	24	313.4K	7.7K	5m 7s
`fd6d3079`	dry-walkthrough	65	1.9M	14.9K	12m 57s
`9768e31d`	implement	183	9.5M	46.6K	22m 55s	72 non-tool turns, 2 Agent subagents (18K+21K char reports in context)
`096e0115`	resolve-failures #1	162	~5.1M	33.9K	32m	5 test runs (2 aborted on venv), 60→4→1 failures
`e2495448`	resolve-failures #2	102	~2.9M	13.4K	~45m	4 test runs, polling overhead, sigterm fix
`a53d2b38`	prepare-pr	18	139.3K	7.0K	2m 38s
`79e2d7d8`	arch-lens-process-flow	18	~240K	5.5K	~3m
`0f1f76be`	arch-lens-state-lifecycle	17	~245K	7.0K	~3m
`a2540a25`	arch-lens-module-dep	21	~300K	7.6K	~4m
`7651e1a4`	compose-pr	20	200.5K	12.0K	3m 42s
`cdce8960`	review-pr	51	2.8M	24.1K	—	Post-PR creation, excluded from summary

Metric	Value
Session	`812fdbe0-ba63-4df1-960d-bda6cf3ca374`
Duration	5h (23:25 → 04:28 UTC)
API calls	124
Total cache_read	8.7M
Attributable to order 774	~4.7M (54%)
MCP tool calls	41 (26 run_skill + run_cmd/test_check)
Context growth	15K → 98K tokens over session

Role	Peak RSS	Description
Primary Node.js	~180-218K KB	Claude Code process, no SIGTERM handler, fd_count=21, 2 threads
Monitor wrapper	1664 KB	Python wrapper, sleeping (`hrtimer_nanosleep`), 1 thread
Secondary Node.js	~171-173K KB	Spawned subagent child, has SIGTERM handler (bit 15 in sig_cgt), fd_count=18

Remediation pipeline token efficiency: 22.8M cache_read for ~500 line PR #783 #792

Description

Problem

Investigation Summary

Factor 1: Non-Tool API Turns in Headless Sessions (implement session — 9.5M cache_read)

Factor 2: Test Suite Execution Overhead (assess step — 8.0M cache_read, 2 sessions)

Factor 3: Pipeline Architecture — Per-Session Startup and Orchestrator Cost

Token Accounting Details

How the 22.8M Figure Is Computed

Per-Step Breakdown (from PR Token Usage Summary)

Full Session Inventory (Order 774)

Orchestrator (Excluded from PR Summary)

Crashed Session Records (41 total)

Pipeline Data Flow

Recommendations (Ordered by Impact)

High Impact — Reduce API Call Count

Medium Impact — Reduce Context Size Per Call

Medium Impact — Reduce Test Run Count

Lower Impact — Improve Observability

Estimated Combined Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions