Skip to content

Watchdog stall detector: false-positives on long single operations + stall signal not surfaced to caller #634

@AIOSAI

Description

@AIOSAI

Owner: @devpulse (own module — src/aipass/devpulse/apps/handlers/watchdog/agent.py)
Severity: Low (stall path only reports — it never kills the agent). Improve later.

The watchdog's JSONL stall detector works but has two rough edges noted during S191/S193:

1. False-positive on a long single operation

_has_jsonl_activity() (agent.py:212) decides liveness purely from JSONL file size-growth over the 120s window (stall_threshold = 120.0, line 389). An agent doing one genuinely long operation (big file read, long-running tool call, heavy compute) writes no new JSONL lines for that span → the detector reads it as idle → STALLED fires while the agent is actively working. JSONL-growth ≠ liveness.

Fix options:

  • (a) Parse the last JSONL entry for an in-flight/unclosed tool_use and treat that as activity (not just size delta).
  • (b) Cross-check PID CPU time / fd activity before declaring a stall (current PID logic is liveness/zombie only, not CPU).
  • (c) Require N consecutive idle windows before reporting.

2. Stall signal isn't surfaced to the caller

The stall line is written to _stderr() + logger (agent.py:453/458), and line 41 explicitly notes stderr 'doesn't pollute stdout capture.' So when the watchdog is armed via the Monitor tool, the stall event is not relayed to devpulse live — it only lands in the prax stream. The crash/timeout paths are fine; it's the stall notification that's invisible to the wrapper.

Fix: emit the stall as a structured event the Monitor wrapper can see (stdout/return payload), so devpulse can act (kill + resume) immediately instead of waiting on the 600s timeout.

Both are minor polish on a working feature — batch into one watchdog-hardening pass when convenient.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions