Skip to content

feat(codex-fleet): tighten claude-supervisor classifier + replay harness#116

Merged
NagyVikt merged 1 commit into
mainfrom
agent/claude/claude-supervisor-classifier-audit-2026-05-15-22-25
May 15, 2026
Merged

feat(codex-fleet): tighten claude-supervisor classifier + replay harness#116
NagyVikt merged 1 commit into
mainfrom
agent/claude/claude-supervisor-classifier-audit-2026-05-15-22-25

Conversation

@NagyVikt
Copy link
Copy Markdown
Contributor

@NagyVikt NagyVikt commented May 15, 2026

Summary

Tightens the claude-supervisor.sh asking/blocked classifier so it stops paying Sonnet for false positives and stops ignoring real blockers. Extracts the classifier into a pure-bash lib that the daemon and a new fixture-driven replay harness both consume.

  • scripts/codex-fleet/lib/claude-supervisor-classifier.sh (new, sourceable)
  • scripts/codex-fleet/claude-supervisor.sh (replaces inline classifier with source)
  • scripts/codex-fleet/test/test-claude-supervisor-classifier.sh (new) + 24 fixtures under scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/
  • openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/ (proposal + spec + tasks)

Audit — top 3 failure modes

The supervisor has never produced a real metrics.tsv on disk (no /tmp/claude-viz/claude-supervisor/metrics.tsv exists yet), so this audit is grounded in the classifier's structural failure modes rather than replayed production rows. Each failure mode is pinned by a fixture under scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/; the harness asserts the post-fix label.

FM-A — bare :$ cursor admits scrollback Continue? as a live ask (false positive)

The prior last_line_is_prompt accepted any line ending in : or ? as a waiting cursor:

'[?:][[:space:]]*$'

Worker tails routinely end with Reading file: rust/fleet-launcher/src/heartbeat.rs: mid-work. With that line at the bottom and an older Continue? yes — answered earlier ten lines back in the 80-line capture, the prior is_asking returned true and paid --model sonnet --effort medium to paste an answer codex never asked for. Pure cost.

Fixture quiet__colon_ending_stale_continue.txt:

  ran scripts/codex-fleet/lib/sandboxed-cargo-test.sh: ok
  next:
Reading file: rust/fleet-launcher/src/heartbeat.rs:

Fix: drop the bare [?:][[:space:]]*$ rule. Bare-? is admitted only when the same line carries a known question lead-word (Continue|Approve|Proceed|Confirm|Apply|Should I|Do you want|Would you like|Which option/approach/one|Choose|Select|Pick|Need clarification|Need more …|Please clarify/confirm/choose/specify). Bare-: no longer counts at all.

FM-B — stale Working ( in scrollback masks fresh asks (false negative)

is_busy previously grep'd the entire 80-line window:

is_busy() {
  for pat in "${BUSY_PATTERNS[@]}"; do
    printf '%s\n' "$tail" | grep -qF "$pat" && return 0
  done
}

A pane that finished Working (12s) 40 lines ago and is now sitting on a fresh menu read as busy → skipped entirely. Real asks slipped past the supervisor.

Fixture asking__busy_in_scrollback_fresh_cursor.txt:

> claim sub-task 1 of plan codex-fleet-glass-menu-drop-tabstrip-2026-05-15
  reading plan.json
Working (8s • ↑ 1.2k tokens • esc to interrupt)
Worked for 6m 32s • ↑ 12.4k tokens
> file rust/fleet-ui/src/glass/mod.rs already exists, contents diverge from plan

Detected merge of generated migration vs hand-edited one — which approach do you prefer?
  1) accept generated (recommended)
  2) keep hand-edited
  3) abort the sub-task
❯

Fix: anchor is_busy to the LAST non-empty line only. codex rewrites the Working (…) footer in place; if the worker is genuinely busy, that line is the bottom of the capture. Once the turn completes, codex replaces it with Worked for … and the pane is no longer busy.

FM-C — BLOCKED_PATTERNS missed real codex-fleet stuck states (false negative)

The prior set only covered the plan-side blockers (PLAN_SUBTASK_NOT_FOUND, stale-claim, told me not to rescue, less than 5% of your 5h limit, Blocked.). Real production stuck states the supervisor stayed silent on:

  • CONFLICT (content): Merge conflict in …
  • error: uncommitted changes
  • fatal: unable to access 'https://github.com/…' (any ^fatal: from git)
  • Permission denied (publickey) (ssh key bust on push)
  • gh: command not found
  • Bad credentials (GH token rejected)
  • MCP server <name> (not found|missing|unavailable)
  • 429 Too Many Requests (cap-swap normally catches this; supervisor is the safety net)
  • BLOCKED: (canonical Guardex blocker prefix)

Fixtures covering each: blocked__merge_conflict.txt, blocked__uncommitted_changes.txt, blocked__fatal_git.txt, blocked__permission_denied.txt, blocked__mcp_server_missing.txt, blocked__bad_credentials.txt. Fix: extend BLOCKED_PATTERNS with the patterns above; MCP server .*(not found|...) admits the server name token in the middle (MCP server colony not found …).

Logic changes

Surface Before After
last_line_is_prompt bare-:$ rule accepted removed
last_line_is_prompt bare-?$ rule accepted unconditionally accepted only with question lead-word on same line
last_line_is_prompt numbered-list rule matched any N) item at bottom requires (recommended) or (default) tag on same line
is_busy scope full 80-line tail last non-empty line only
is_asking scope full tail (after gate) last N non-empty lines (default 8, env CLAUDE_SUPERVISOR_RECENT_LINES)
BLOCKED_PATTERNS plan-side blockers only + merge conflict / uncommitted / fatal git / publickey / gh / bad creds / MCP / 429 / BLOCKED:

Cost impact

The Sonnet/Opus tier split is unchanged: state=asking still routes to --model sonnet --effort medium, state=blocked still routes to --model claude-opus-4-7 --effort high. Both still share the JSON-schema + prompt-cache prefix. Effect of the changes:

  • Fewer asking false positives → fewer Sonnet calls per tick (the dominant cost line).
  • More-accurate blocked detection may produce slightly more Opus calls, but those are precisely the panes that previously sat stuck forever — the supervisor was undercounting the genuinely-stuck class.
  • 3-strike loop guard and per-pane cooldown unchanged; per-pane spend is still capped.

Replay harness

$ bash scripts/codex-fleet/test/test-claude-supervisor-classifier.sh
  PASS  asking        asking__busy_in_scrollback_fresh_cursor.txt
  PASS  asking        asking__press_digit_to_continue.txt
  PASS  asking        asking__recommended_numbered_menu.txt
  PASS  asking        asking__stale_blocker_in_scrollback_menu_now.txt
  PASS  asking        asking__yn_lowercase_default.txt
  PASS  asking        asking__yn_uppercase_destructive.txt
  PASS  blocked       blocked__bad_credentials.txt
  PASS  blocked       blocked__fatal_git.txt
  PASS  blocked       blocked__five_pct_limit.txt
  PASS  blocked       blocked__mcp_server_missing.txt
  PASS  blocked       blocked__merge_conflict.txt
  PASS  blocked       blocked__permission_denied.txt
  PASS  blocked       blocked__plan_subtask_not_found.txt
  PASS  blocked       blocked__stale_claim_told_not_rescue.txt
  PASS  blocked       blocked__uncommitted_changes.txt
  PASS  busy          busy__codex_working_active.txt
  PASS  busy          busy__esc_to_interrupt_last_line.txt
  PASS  quiet         quiet__colon_ending_stale_continue.txt
  PASS  quiet         quiet__do_you_want_in_narration.txt
  PASS  quiet         quiet__empty_pane.txt
  PASS  quiet         quiet__narrative_should_i_no_cursor.txt
  PASS  quiet         quiet__numbered_list_narrative_summary.txt
  PASS  quiet         quiet__prompt_ready_for_input.txt
  PASS  quiet         quiet__worked_for_completion_footer.txt
24 pass, 0 fail

Fixture naming: <expected-label>__<short-name>.txt. Labels: busy | asking | blocked | quiet. The harness sources the pure-bash lib (no daemon side effects) and runs classify_tail per fixture.

Test plan

  • bash -n clean on lib, harness, daemon.
  • bash scripts/codex-fleet/test/test-claude-supervisor-classifier.sh — 24 pass, 0 fail.
  • claude-supervisor.sh --once --dry-run runs without tmux (no-op tick, rc=0).
  • openspec validate agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25 --type change --strict — valid.

🤖 Generated with Claude Code

Extracts the asking/blocked classifier from claude-supervisor.sh into a
sourceable pure-bash lib so the daemon and a new fixture-driven harness
share one implementation. Tightens three failure modes that produced
false positives (and missed real asks/blocks) under the prior logic:

1. last_line_is_prompt no longer accepts bare ":$" as a waiting cursor;
   bare "?$" is admitted only when the line carries a known question
   lead-word (Continue/Approve/Should I/Do you want/Choose/Select/...).
2. is_busy is anchored to the LAST non-empty line. A stale "Working ("
   in scrollback no longer masks a fresh interactive cursor below it.
3. is_asking scopes ASK_PATTERN matching to the recent N lines AND
   requires the tightened last_line_is_prompt gate — both gates needed.

Extends BLOCKED_PATTERNS with the codex-fleet stuck states the
supervisor was previously deaf to: CONFLICT (content / merge conflict,
"error: uncommitted changes", "fatal: <git>", Permission denied
(publickey), "gh: command not found", "Bad credentials",
"MCP server <name> not found|missing|unavailable", "429 Too Many
Requests", and the canonical "BLOCKED:" prefix.

Adds scripts/codex-fleet/test/test-claude-supervisor-classifier.sh +
24 pane-capture fixtures under
scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/.
Filename prefix encodes the expected label (busy|asking|blocked|quiet).
24 pass, 0 fail. Daemon --once --dry-run runs clean without tmux.

Cost: fewer ASK false positives → fewer sonnet/medium calls per tick.
Sonnet stays the workhorse for the remaining real asks; opus stays
gated on the now-more-accurate BLOCKED set. Strike guard unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@NagyVikt NagyVikt merged commit 5275198 into main May 15, 2026
@NagyVikt NagyVikt deleted the agent/claude/claude-supervisor-classifier-audit-2026-05-15-22-25 branch May 15, 2026 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant