diff --git a/openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/.openspec.yaml b/openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/.openspec.yaml new file mode 100644 index 0000000..9f70866 --- /dev/null +++ b/openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-05-15 diff --git a/openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/proposal.md b/openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/proposal.md new file mode 100644 index 0000000..937d646 --- /dev/null +++ b/openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/proposal.md @@ -0,0 +1,63 @@ +## Why + +`claude-supervisor.sh`'s asking/blocked classifier had two structural +weaknesses that produced classifier cost and precision problems: + +1. The "is this a real interactive cursor?" gate (`last_line_is_prompt`) + accepted any line ending in `:` or `?` as a waiting cursor. Worker + tails routinely end with `Reading file:` mid-work, so an old + `Continue?` or `Should I` 40 lines back in the 80-line capture + was enough to call sonnet/medium and paste an answer codex never + asked for. Pure false-positive cost. +2. `is_busy` grep'd the whole 80-line window for `Working (` or + `esc to interrupt`. A pane that finished `Working (12s)` 40 lines + ago and is now sitting at a fresh `[Y/n] ` cursor read as "busy" + and never reached the ask path — real asks slipped past the + supervisor entirely. + +Additionally, several real codex-fleet blockers (merge conflict, +uncommitted-changes, fatal git, ssh permission-denied, MCP server +missing, bad GH credentials, BLOCKED: prefix) were missing from +`BLOCKED_PATTERNS`, so panes parked on these states classified as +`quiet` and the supervisor stayed silent. + +## What Changes + +- Extract the classifier (BUSY/ASK/BLOCKED patterns + `is_busy`, + `is_asking`, `is_blocked`, `last_line_is_prompt`, `classify_tail`, + `tail_hash`) into a pure-bash library at + `scripts/codex-fleet/lib/claude-supervisor-classifier.sh` so the + daemon and a replay harness share one implementation. +- Tighten `last_line_is_prompt`: drop the bare `[?:][[:space:]]*$` + rule. Bare-`?` is only admitted when the line carries a known + question lead-word. `:$` no longer counts. +- Tighten `is_busy`: anchor BUSY_PATTERNS to the LAST non-empty line + only. codex rewrites the `Working (…)` footer in place; if the + worker is busy it's at the bottom. Stale `Working (` in scrollback + no longer masks a fresh interactive prompt. +- Tighten `is_asking`: scope ASK_PATTERN matching to the recent N + non-empty lines (default 8 via `CLAUDE_SUPERVISOR_RECENT_LINES`) + AND require `last_line_is_prompt` to pass. +- Extend `BLOCKED_PATTERNS` with the codex-fleet-specific stuck + states listed under "Why". +- Add a fixture-driven replay harness at + `scripts/codex-fleet/test/test-claude-supervisor-classifier.sh` + with 24 pane-capture fixtures covering the false-positive, + missed-block, and previously-correct cases. Filename prefix + encodes the expected classification. + +## Impact + +- Cost: fewer ASK false positives → fewer sonnet/medium calls per + tick. Sonnet stays the workhorse for the remaining real asks. + Opus calls are gated on the (now more accurate) BLOCKED set; + strike guard caps per-pane spend. +- Behavior: panes the supervisor used to ignore (real ask under + stale `Working (`, merge-conflict, MCP-missing, bad GH creds) + now classify correctly. +- Risk: the harness pins the precision/recall trade-off — any + future loosening of the gates regresses the harness. +- Surfaces touched: `claude-supervisor.sh` replaces its inline + classifier with `source`; new lib; new test + fixtures. No + daemon-state files, no plan-watcher changes, no cap-swap-daemon + changes. diff --git a/openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/specs/claude-supervisor-classifier-audit/spec.md b/openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/specs/claude-supervisor-classifier-audit/spec.md new file mode 100644 index 0000000..fdbff47 --- /dev/null +++ b/openspec/changes/agent-claude-claude-supervisor-classifier-audit-2026-05-15-22-25/specs/claude-supervisor-classifier-audit/spec.md @@ -0,0 +1,89 @@ +## ADDED Requirements + +### Requirement: classifier lives in a pure-bash library + +The system SHALL keep the classifier (BUSY/ASK/BLOCKED pattern +arrays, `is_busy`, `is_asking`, `is_blocked`, `last_line_is_prompt`, +`classify_tail`, and `tail_hash`) at +`scripts/codex-fleet/lib/claude-supervisor-classifier.sh`. The lib +SHALL be safe to `source` with no side effects (no tmux calls, no +claude calls, no file writes at source time). `claude-supervisor.sh` +SHALL source this lib rather than inlining the classifier. + +#### Scenario: library is sourceable with no side effects +- **WHEN** `scripts/codex-fleet/lib/claude-supervisor-classifier.sh` + is sourced from a bash script +- **THEN** no tmux, claude, or filesystem-mutating command runs +- **AND** `classify_tail`, `is_busy`, `is_asking`, `is_blocked`, + `last_line_is_prompt`, and `tail_hash` are defined + +### Requirement: classify_tail returns one of four labels + +`classify_tail ""` SHALL echo exactly one of +`busy`, `asking`, `blocked`, `quiet`. `asking` SHALL outrank `blocked` +when both conditions match — a pane that mentioned a stale blocker +but is now showing an interactive menu wants an answer to the menu. + +#### Scenario: asking outranks blocked +- **WHEN** the recent tail contains both a BLOCKED_PATTERN (e.g., + stale-claim) and an ASK_PATTERN (e.g., a numbered menu with a + `(recommended)` option) AND `last_line_is_prompt` accepts the + bottom line +- **THEN** `classify_tail` echoes `asking` + +### Requirement: busy is anchored to the last non-empty line + +`is_busy` SHALL match BUSY_PATTERNS only against the last non-empty +line of the ANSI-stripped tail. A stale `Working (` or +`esc to interrupt` earlier in scrollback SHALL NOT mask a fresh +interactive cursor at the bottom. + +#### Scenario: stale Working in scrollback does not mask a fresh ask +- **WHEN** the tail contains `Working (` nine lines from the bottom + AND the bottom line is a bare prompt sigil (`❯`) under a numbered + menu with `(recommended)` option +- **THEN** `classify_tail` echoes `asking`, not `busy` + +### Requirement: last_line_is_prompt rejects bare-colon endings + +`last_line_is_prompt` SHALL NOT accept a bare trailing `:` as a +waiting cursor. A bare trailing `?` SHALL be accepted only when the +same line carries a known question lead-word (Continue, Approve, +Proceed, Confirm, Apply, Should I, Do you want, Would you like, +Which option/approach/one, Choose, Select, Pick, Need clarification, +Need more …, Please clarify/confirm/choose/specify). + +#### Scenario: narrative status line ending in ":" does not trigger asking +- **WHEN** the bottom line of the tail is `Reading file: …:` AND an + older `Continue?` appears earlier in scrollback +- **THEN** `classify_tail` echoes `quiet` + +### Requirement: BLOCKED_PATTERNS cover codex-fleet stuck states + +`BLOCKED_PATTERNS` SHALL match each of: git merge conflict +(`CONFLICT (content`), `error: uncommitted changes`, `fatal: ` git +errors, `Permission denied (publickey)`, `gh: command not found`, +`Bad credentials`, `MCP server (not found|missing|unavailable)`, +`429 Too Many Requests`, and the canonical `BLOCKED:` prefix — +in addition to the prior set (PLAN_SUBTASK_NOT_FOUND, stale-claim, +told-not-to-rescue, less-than-5%-limit, etc.). + +#### Scenario: missing MCP server is classified as blocked +- **WHEN** the tail contains `Error: MCP server colony not found in + the registered servers` +- **THEN** `classify_tail` echoes `blocked` + +### Requirement: replay harness pins classifier behavior + +The system SHALL ship a replay harness at +`scripts/codex-fleet/test/test-claude-supervisor-classifier.sh` +that discovers every `*.txt` fixture under +`scripts/codex-fleet/test/fixtures/claude-supervisor-classifier/`, +parses the expected label from the filename prefix +(`