Skip to content

Rectify: Stdout Idle Watchdog + Bounded Suppression for Stall Detection#735

Open
Trecek wants to merge 5 commits intointegrationfrom
impl-rectify-idle-stall-watchdog-20260411-142908
Open

Rectify: Stdout Idle Watchdog + Bounded Suppression for Stall Detection#735
Trecek wants to merge 5 commits intointegrationfrom
impl-rectify-idle-stall-watchdog-20260411-142908

Conversation

@Trecek
Copy link
Copy Markdown
Collaborator

@Trecek Trecek commented Apr 11, 2026

Summary

A headless Claude subprocess hung for 82 minutes on a silently stalled HTTPS SSE stream. The TCP connection remained ESTABLISHED with zero traffic, so the _has_active_api_connection suppression gate in _process_monitor.py kept resetting the stale-kill timer every 2-second poll cycle. Neither Channel A nor Channel B detected the stall because Channel A only checks for valid JSONL result records (not raw byte growth), and Channel B's stale detection was suppressed by the ESTABLISHED port-443 connection.

Three layers of fix provide immunity:

  1. Stdout idle watchdog (primary) — new _watch_stdout_idle coroutine in the anyio task group monitors stdout temp file byte count. Fires TerminationReason.IDLE_STALL if no growth for idle_output_timeout seconds. NOT suppressed by active API connections — orthogonal to Channel A/B.

  2. Bounded suppression (defense-in-depth) — max_suppression_seconds caps the _has_active_api_connection and _has_active_child_processes suppression gates so they cannot defer stale kills indefinitely. Counter resets on genuine JSONL activity.

  3. Subprocess timeoutstimeout= added to all unguarded subprocess.run() and asyncio subprocess calls in remote_resolver, smoke_utils, clone, and token_summary_appender.

Changed Files

New (★)

  • tests/execution/test_process_idle_watchdog.py — watchdog coroutine unit tests
  • tests/workspace/test_clone_timeouts.py — static analysis: git network commands have timeouts

Modified (●)

  • src/autoskillit/core/_type_enums.pyIDLE_STALL members on TerminationReason and CliSubtype
  • src/autoskillit/config/settings.pyidle_output_timeout and max_suppression_seconds on RunSkillConfig
  • src/autoskillit/execution/_process_race.py_watch_stdout_idle coroutine, idle_stall field on RaceAccumulator/RaceSignals, priority update in resolve_termination
  • src/autoskillit/execution/_process_monitor.py — bounded suppression logic with suppression_start tracker
  • src/autoskillit/execution/process.py — wire watchdog into task group, IDLE_STALL kill dispatch
  • src/autoskillit/execution/headless.pyIDLE_STALL routing in _build_skill_result (retriable)
  • src/autoskillit/execution/session.py — exhaustive match arms for IDLE_STALL
  • src/autoskillit/core/_type_subprocess.py — protocol updated with new params
  • src/autoskillit/execution/recording.py — runner implementations updated
  • src/autoskillit/execution/remote_resolver.py — 15s timeout via asyncio.wait_for
  • src/autoskillit/smoke_utils.pytimeout=60 on all subprocess calls
  • src/autoskillit/workspace/clone.pytimeout=300 (clone), timeout=120 (push)
  • src/autoskillit/hooks/token_summary_appender.pytimeout=30 on gh api calls

Test Plan

  • test_watch_stdout_idle_fires_on_silence — watchdog fires IDLE_STALL when stdout stops growing
  • test_watch_stdout_idle_resets_on_continuous_output — watchdog does NOT fire when output is steady
  • test_watch_stdout_idle_handles_missing_file — tolerates missing file until it appears
  • test_resolve_termination_idle_stall_priority — idle stall outranks stale in priority chain
  • test_resolve_termination_process_exit_beats_idle_stall — process exit still highest priority
  • test_resolve_termination_idle_stall_beats_stale — idle stall outranks stale
  • test_race_signals_field_count — sentinel: 7 fields on RaceSignals
  • test_build_skill_result_idle_stall_is_retriable — IDLE_STALL produces retriable SkillResult
  • test_run_managed_async_idle_stall_kills_hanging_process — end-to-end integration
  • test_stale_suppression_bounded_by_max_duration — bounded suppression fires after cap
  • test_stale_suppression_resets_on_genuine_activity — counter resets on real JSONL activity
  • test_stale_suppression_logs_warning_on_bounded_kill — warning log emitted
  • Static analysis tests: all subprocess calls in target files have timeout=
  • All 7370 tests pass, pre-commit clean, mypy clean

Implementation Plan

Plan file: .autoskillit/temp/rectify/rectify_idle_stall_watchdog_2026-04-11_140500.md

🤖 Generated with Claude Code via AutoSkillit

Trecek added 3 commits April 11, 2026 14:52
…ss timeouts

Addresses 82-minute hung SSE stream where ESTABLISHED TCP connection
suppressed stale-kill indefinitely. Three layers of fix:

1. Stdout idle watchdog (_watch_stdout_idle): independent coroutine in the
   anyio task group that monitors stdout temp file byte count. Fires
   IDLE_STALL if no growth for idle_output_timeout seconds. NOT suppressed
   by active API connections — orthogonal to Channel A/B.

2. Bounded suppression: max_suppression_seconds caps _has_active_api_connection
   and _has_active_child_processes suppression gates so they cannot defer
   stale kills indefinitely. Counter resets on genuine JSONL activity.

3. Subprocess timeouts: add timeout= to all unguarded subprocess.run() and
   asyncio subprocess calls in remote_resolver, smoke_utils, clone, and
   token_summary_appender.
Copy link
Copy Markdown
Collaborator Author

@Trecek Trecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoSkillit PR Review — Verdict: changes_requested (posted as COMMENT — cannot REQUEST_CHANGES on own PR)

stdout, _ = await proc.communicate()
io_task = asyncio.ensure_future(proc.communicate())
try:
await asyncio.wait_for(proc.wait(), timeout=15.0)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] defense: wait_for(proc.wait()) races with concurrent proc.communicate() future. wait() does not drain stdout pipe; if subprocess fills the pipe buffer before exiting, this can deadlock. Use asyncio.wait_for(proc.communicate(), timeout=15.0) directly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — this is intentional. The code creates io_task = asyncio.ensure_future(proc.communicate()) at line 54 BEFORE calling wait_for(proc.wait()). The communicate() future runs concurrently on the event loop and actively drains stdout, so the pipe buffer deadlock scenario cannot occur here. wait() alone would deadlock, but the concurrent communicate() task prevents it.

input_data: str | None = None,
completion_drain_timeout: float = 5.0,
linux_tracing_config: Any | None = None,
idle_output_timeout: float | None = None,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] arch: idle_output_timeout and max_suppression_seconds are execution-tuning params (L1 concerns) added to the HeadlessExecutor protocol in L0 core/. This couples L0 to L1 operational details. Consider kwargs escape hatch or moving protocol to L1.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — this is intentional. The SubprocessRunner protocol already contained multiple execution-tuning params at L0 before this PR: timeout (float), stale_threshold (float=1200), completion_drain_timeout (float=5.0), pty_mode (bool), linux_tracing_config (Any|None). The new idle_output_timeout and max_suppression_seconds follow the identical pre-existing pattern. This is the established design for this protocol — it serves as the structural contract mirroring run_managed_async.

Orthogonal to Channel A/B: NOT suppressed by active API connections.
Monitors raw byte count (st_size), not JSONL record structure.
"""
import time as _time
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[info] cohesion: import time as _time is a local import inside _watch_stdout_idle. The rest of the module uses module-level imports. Promote to module level.

await anyio.sleep(_poll_interval)
try:
current_size = stdout_path.stat().st_size
except OSError:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[info] defense: except OSError: continue silently swallows all OS errors on stdout_path.stat(). ENOENT is expected, but other errors (permission denied, I/O error) should be logged at DEBUG.

_phase2_poll=_phase2_poll,
_phase1_timeout=_phase1_timeout,
expected_session_id=acc.stdout_session_id,
**_monitor_kwargs, # type: ignore[arg-type]
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[info] defense: # type: ignore[arg-type] suppresses type error without rationale. dict[str, object] is too wide for the target signature. Consider TypedDict or add explanatory comment.

completion_drain_timeout: float = 5.0
exit_after_stop_delay_ms: int = 120000
idle_output_timeout: int = 600
max_suppression_seconds: int = 1800
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[info] defense: max_suppression_seconds and idle_output_timeout have no range validation. Zero or negative values would cause silent logic bugs (immediate expiry). Consider post_init assertion.

is_error=True,
exit_code=-1,
needs_retry=True,
retry_reason=RetryReason.STALE,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[info] defense: RetryReason.STALE reused for IDLE_STALL termination. Downstream retry telemetry cannot distinguish the two failure modes.

Trecek and others added 2 commits April 11, 2026 16:17
Add io_task.cancel() in the TimeoutError handler to prevent a task
leak when proc.communicate() future is abandoned after timeout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Split shared suppression_start into suppression_start_api and
suppression_start_child to prevent elapsed time carry-over when
sessions alternate between API connection and child process conditions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Trecek Trecek added this pull request to the merge queue Apr 12, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant