Owner: @devpulse (own module — src/aipass/devpulse/apps/handlers/watchdog/agent.py)
Severity: Low (stall path only reports — it never kills the agent). Improve later.
The watchdog's JSONL stall detector works but has two rough edges noted during S191/S193:
1. False-positive on a long single operation
_has_jsonl_activity() (agent.py:212) decides liveness purely from JSONL file size-growth over the 120s window (stall_threshold = 120.0, line 389). An agent doing one genuinely long operation (big file read, long-running tool call, heavy compute) writes no new JSONL lines for that span → the detector reads it as idle → STALLED fires while the agent is actively working. JSONL-growth ≠ liveness.
Fix options:
- (a) Parse the last JSONL entry for an in-flight/unclosed tool_use and treat that as activity (not just size delta).
- (b) Cross-check PID CPU time / fd activity before declaring a stall (current PID logic is liveness/zombie only, not CPU).
- (c) Require N consecutive idle windows before reporting.
2. Stall signal isn't surfaced to the caller
The stall line is written to _stderr() + logger (agent.py:453/458), and line 41 explicitly notes stderr 'doesn't pollute stdout capture.' So when the watchdog is armed via the Monitor tool, the stall event is not relayed to devpulse live — it only lands in the prax stream. The crash/timeout paths are fine; it's the stall notification that's invisible to the wrapper.
Fix: emit the stall as a structured event the Monitor wrapper can see (stdout/return payload), so devpulse can act (kill + resume) immediately instead of waiting on the 600s timeout.
Both are minor polish on a working feature — batch into one watchdog-hardening pass when convenient.
Owner: @devpulse (own module —
src/aipass/devpulse/apps/handlers/watchdog/agent.py)Severity: Low (stall path only reports — it never kills the agent). Improve later.
The watchdog's JSONL stall detector works but has two rough edges noted during S191/S193:
1. False-positive on a long single operation
_has_jsonl_activity()(agent.py:212) decides liveness purely from JSONL file size-growth over the 120s window (stall_threshold = 120.0, line 389). An agent doing one genuinely long operation (big file read, long-running tool call, heavy compute) writes no new JSONL lines for that span → the detector reads it as idle →STALLEDfires while the agent is actively working. JSONL-growth ≠ liveness.Fix options:
2. Stall signal isn't surfaced to the caller
The stall line is written to
_stderr()+logger(agent.py:453/458), and line 41 explicitly notes stderr 'doesn't pollute stdout capture.' So when the watchdog is armed via the Monitor tool, the stall event is not relayed to devpulse live — it only lands in the prax stream. The crash/timeout paths are fine; it's the stall notification that's invisible to the wrapper.Fix: emit the stall as a structured event the Monitor wrapper can see (stdout/return payload), so devpulse can act (kill + resume) immediately instead of waiting on the 600s timeout.
Both are minor polish on a working feature — batch into one watchdog-hardening pass when convenient.