feat(codex-fleet): tighten claude-supervisor classifier + replay harness by NagyVikt · Pull Request #116 · recodeee/codex-fleetui

NagyVikt · 2026-05-15T20:37:18Z

Summary

Tightens the claude-supervisor.sh asking/blocked classifier so it stops paying Sonnet for false positives and stops ignoring real blockers. Extracts the classifier into a pure-bash lib that the daemon and a new fixture-driven replay harness both consume.

scripts/codex-fleet/lib/claude-supervisor-classifier.sh (new, sourceable)
scripts/codex-fleet/claude-supervisor.sh (replaces inline classifier with source)
scripts/codex-fleet/test/test-claude-supervisor-classifier.sh (new) + 24 fixtures under scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/
openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/ (proposal + spec + tasks)

Audit — top 3 failure modes

The supervisor has never produced a real metrics.tsv on disk (no /tmp/claude-viz/claude-supervisor/metrics.tsv exists yet), so this audit is grounded in the classifier's structural failure modes rather than replayed production rows. Each failure mode is pinned by a fixture under scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/; the harness asserts the post-fix label.

FM-A — bare `:$` cursor admits scrollback `Continue?` as a live ask (false positive)

The prior last_line_is_prompt accepted any line ending in : or ? as a waiting cursor:

'[?:][[:space:]]*$'

Worker tails routinely end with Reading file: rust/fleet-launcher/src/heartbeat.rs: mid-work. With that line at the bottom and an older Continue? yes — answered earlier ten lines back in the 80-line capture, the prior is_asking returned true and paid --model sonnet --effort medium to paste an answer codex never asked for. Pure cost.

Fixture quiet__colon_ending_stale_continue.txt:

  ran scripts/codex-fleet/lib/sandboxed-cargo-test.sh: ok
  next:
Reading file: rust/fleet-launcher/src/heartbeat.rs:

FM-B — stale `Working (` in scrollback masks fresh asks (false negative)

is_busy previously grep'd the entire 80-line window:

is_busy() {
  for pat in "${BUSY_PATTERNS[@]}"; do
    printf '%s\n' "$tail" | grep -qF "$pat" && return 0
  done
}

A pane that finished Working (12s) 40 lines ago and is now sitting on a fresh menu read as busy → skipped entirely. Real asks slipped past the supervisor.

Fixture asking__busy_in_scrollback_fresh_cursor.txt:

> claim sub-task 1 of plan codex-fleet-glass-menu-drop-tabstrip-2026-05-15
  reading plan.json
Working (8s • ↑ 1.2k tokens • esc to interrupt)
Worked for 6m 32s • ↑ 12.4k tokens
> file rust/fleet-ui/src/glass/mod.rs already exists, contents diverge from plan

Detected merge of generated migration vs hand-edited one — which approach do you prefer?
  1) accept generated (recommended)
  2) keep hand-edited
  3) abort the sub-task
❯

Fix: anchor is_busy to the LAST non-empty line only. codex rewrites the Working (…) footer in place; if the worker is genuinely busy, that line is the bottom of the capture. Once the turn completes, codex replaces it with Worked for … and the pane is no longer busy.

FM-C — `BLOCKED_PATTERNS` missed real codex-fleet stuck states (false negative)

The prior set only covered the plan-side blockers (PLAN_SUBTASK_NOT_FOUND, stale-claim, told me not to rescue, less than 5% of your 5h limit, Blocked.). Real production stuck states the supervisor stayed silent on:

CONFLICT (content): Merge conflict in …
error: uncommitted changes
fatal: unable to access 'https://github.com/…' (any ^fatal: from git)
Permission denied (publickey) (ssh key bust on push)
gh: command not found
Bad credentials (GH token rejected)
MCP server <name> (not found|missing|unavailable)
429 Too Many Requests (cap-swap normally catches this; supervisor is the safety net)
BLOCKED: (canonical Guardex blocker prefix)

Fixtures covering each: blocked__merge_conflict.txt, blocked__uncommitted_changes.txt, blocked__fatal_git.txt, blocked__permission_denied.txt, blocked__mcp_server_missing.txt, blocked__bad_credentials.txt. Fix: extend BLOCKED_PATTERNS with the patterns above; MCP server .*(not found|...) admits the server name token in the middle (MCP server colony not found …).

Logic changes

Surface	Before	After
`last_line_is_prompt` bare-`:$` rule	accepted	removed
`last_line_is_prompt` bare-`?$` rule	accepted unconditionally	accepted only with question lead-word on same line
`last_line_is_prompt` numbered-list rule	matched any `N) item` at bottom	requires `(recommended)` or `(default)` tag on same line
`is_busy` scope	full 80-line tail	last non-empty line only
`is_asking` scope	full tail (after gate)	last N non-empty lines (default 8, env `CLAUDE_SUPERVISOR_RECENT_LINES`)
`BLOCKED_PATTERNS`	plan-side blockers only	+ merge conflict / uncommitted / fatal git / publickey / gh / bad creds / MCP / 429 / BLOCKED:

Cost impact

The Sonnet/Opus tier split is unchanged: state=asking still routes to --model sonnet --effort medium, state=blocked still routes to --model claude-opus-4-7 --effort high. Both still share the JSON-schema + prompt-cache prefix. Effect of the changes:

Fewer asking false positives → fewer Sonnet calls per tick (the dominant cost line).
More-accurate blocked detection may produce slightly more Opus calls, but those are precisely the panes that previously sat stuck forever — the supervisor was undercounting the genuinely-stuck class.
3-strike loop guard and per-pane cooldown unchanged; per-pane spend is still capped.

Replay harness

$ bash scripts/codex-fleet/test/test-claude-supervisor-classifier.sh
  PASS  asking        asking__busy_in_scrollback_fresh_cursor.txt
  PASS  asking        asking__press_digit_to_continue.txt
  PASS  asking        asking__recommended_numbered_menu.txt
  PASS  asking        asking__stale_blocker_in_scrollback_menu_now.txt
  PASS  asking        asking__yn_lowercase_default.txt
  PASS  asking        asking__yn_uppercase_destructive.txt
  PASS  blocked       blocked__bad_credentials.txt
  PASS  blocked       blocked__fatal_git.txt
  PASS  blocked       blocked__five_pct_limit.txt
  PASS  blocked       blocked__mcp_server_missing.txt
  PASS  blocked       blocked__merge_conflict.txt
  PASS  blocked       blocked__permission_denied.txt
  PASS  blocked       blocked__plan_subtask_not_found.txt
  PASS  blocked       blocked__stale_claim_told_not_rescue.txt
  PASS  blocked       blocked__uncommitted_changes.txt
  PASS  busy          busy__codex_working_active.txt
  PASS  busy          busy__esc_to_interrupt_last_line.txt
  PASS  quiet         quiet__colon_ending_stale_continue.txt
  PASS  quiet         quiet__do_you_want_in_narration.txt
  PASS  quiet         quiet__empty_pane.txt
  PASS  quiet         quiet__narrative_should_i_no_cursor.txt
  PASS  quiet         quiet__numbered_list_narrative_summary.txt
  PASS  quiet         quiet__prompt_ready_for_input.txt
  PASS  quiet         quiet__worked_for_completion_footer.txt
24 pass, 0 fail

Fixture naming: <expected-label>__<short-name>.txt. Labels: busy | asking | blocked | quiet. The harness sources the pure-bash lib (no daemon side effects) and runs classify_tail per fixture.

Test plan

bash -n clean on lib, harness, daemon.
bash scripts/codex-fleet/test/test-claude-supervisor-classifier.sh — 24 pass, 0 fail.
claude-supervisor.sh --once --dry-run runs without tmux (no-op tick, rc=0).
openspec validate agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25 --type change --strict — valid.

🤖 Generated with Claude Code

Extracts the asking/blocked classifier from claude-supervisor.sh into a sourceable pure-bash lib so the daemon and a new fixture-driven harness share one implementation. Tightens three failure modes that produced false positives (and missed real asks/blocks) under the prior logic: 1. last_line_is_prompt no longer accepts bare ":$" as a waiting cursor; bare "?$" is admitted only when the line carries a known question lead-word (Continue/Approve/Should I/Do you want/Choose/Select/...). 2. is_busy is anchored to the LAST non-empty line. A stale "Working (" in scrollback no longer masks a fresh interactive cursor below it. 3. is_asking scopes ASK_PATTERN matching to the recent N lines AND requires the tightened last_line_is_prompt gate — both gates needed. Extends BLOCKED_PATTERNS with the codex-fleet stuck states the supervisor was previously deaf to: CONFLICT (content / merge conflict, "error: uncommitted changes", "fatal: <git>", Permission denied (publickey), "gh: command not found", "Bad credentials", "MCP server <name> not found|missing|unavailable", "429 Too Many Requests", and the canonical "BLOCKED:" prefix. Adds scripts/codex-fleet/test/test-claude-supervisor-classifier.sh + 24 pane-capture fixtures under scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/. Filename prefix encodes the expected label (busy|asking|blocked|quiet). 24 pass, 0 fail. Daemon --once --dry-run runs clean without tmux. Cost: fewer ASK false positives → fewer sonnet/medium calls per tick. Sonnet stays the workhorse for the remaining real asks; opus stays gated on the now-more-accurate BLOCKED set. Strike guard unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

NagyVikt merged commit 5275198 into main May 15, 2026

NagyVikt deleted the agent/claude/claude-supervisor-classifier-audit-2026-05-15-22-25 branch May 15, 2026 20:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(codex-fleet): tighten claude-supervisor classifier + replay harness#116

feat(codex-fleet): tighten claude-supervisor classifier + replay harness#116
NagyVikt merged 1 commit into
mainfrom
agent/claude/claude-supervisor-classifier-audit-2026-05-15-22-25

NagyVikt commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NagyVikt commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Audit — top 3 failure modes

FM-A — bare :$ cursor admits scrollback Continue? as a live ask (false positive)

FM-B — stale Working ( in scrollback masks fresh asks (false negative)

FM-C — BLOCKED_PATTERNS missed real codex-fleet stuck states (false negative)

Logic changes

Cost impact

Replay harness

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NagyVikt commented May 15, 2026 •

edited

Loading

FM-A — bare `:$` cursor admits scrollback `Continue?` as a live ask (false positive)

FM-B — stale `Working (` in scrollback masks fresh asks (false negative)

FM-C — `BLOCKED_PATTERNS` missed real codex-fleet stuck states (false negative)