Watchdog stall detector: false-positives on long single operations + stall signal not surfaced to caller

**Owner:** @devpulse (own module — `src/aipass/devpulse/apps/handlers/watchdog/agent.py`)
**Severity:** Low (stall path only *reports* — it never kills the agent). Improve later.

The watchdog's JSONL stall detector works but has two rough edges noted during S191/S193:

### 1. False-positive on a long single operation
`_has_jsonl_activity()` (agent.py:212) decides liveness purely from **JSONL file size-growth** over the 120s window (`stall_threshold = 120.0`, line 389). An agent doing one genuinely long operation (big file read, long-running tool call, heavy compute) writes **no new JSONL lines** for that span → the detector reads it as idle → `STALLED` fires **while the agent is actively working**. JSONL-growth ≠ liveness.

Fix options:
- (a) Parse the last JSONL entry for an **in-flight/unclosed tool_use** and treat that as activity (not just size delta).
- (b) Cross-check **PID CPU time / fd activity** before declaring a stall (current PID logic is liveness/zombie only, not CPU).
- (c) Require **N consecutive idle windows** before reporting.

### 2. Stall signal isn't surfaced to the caller
The stall line is written to `_stderr()` + `logger` (agent.py:453/458), and line 41 explicitly notes stderr 'doesn't pollute stdout capture.' So when the watchdog is armed via the **Monitor tool**, the stall event is **not relayed to devpulse live** — it only lands in the prax stream. The crash/timeout paths are fine; it's the stall *notification* that's invisible to the wrapper.

Fix: emit the stall as a structured event the Monitor wrapper can see (stdout/return payload), so devpulse can act (kill + resume) immediately instead of waiting on the 600s timeout.

Both are minor polish on a working feature — batch into one watchdog-hardening pass when convenient.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watchdog stall detector: false-positives on long single operations + stall signal not surfaced to caller #634

1. False-positive on a long single operation

2. Stall signal isn't surfaced to the caller

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Watchdog stall detector: false-positives on long single operations + stall signal not surfaced to caller #634

Description

1. False-positive on a long single operation

2. Stall signal isn't surfaced to the caller

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions