Rectify: Stdout Idle Watchdog + Bounded Suppression for Stall Detection#735
Rectify: Stdout Idle Watchdog + Bounded Suppression for Stall Detection#735Trecek wants to merge 5 commits intointegrationfrom
Conversation
…ss timeouts Addresses 82-minute hung SSE stream where ESTABLISHED TCP connection suppressed stale-kill indefinitely. Three layers of fix: 1. Stdout idle watchdog (_watch_stdout_idle): independent coroutine in the anyio task group that monitors stdout temp file byte count. Fires IDLE_STALL if no growth for idle_output_timeout seconds. NOT suppressed by active API connections — orthogonal to Channel A/B. 2. Bounded suppression: max_suppression_seconds caps _has_active_api_connection and _has_active_child_processes suppression gates so they cannot defer stale kills indefinitely. Counter resets on genuine JSONL activity. 3. Subprocess timeouts: add timeout= to all unguarded subprocess.run() and asyncio subprocess calls in remote_resolver, smoke_utils, clone, and token_summary_appender.
Trecek
left a comment
There was a problem hiding this comment.
AutoSkillit PR Review — Verdict: changes_requested (posted as COMMENT — cannot REQUEST_CHANGES on own PR)
| stdout, _ = await proc.communicate() | ||
| io_task = asyncio.ensure_future(proc.communicate()) | ||
| try: | ||
| await asyncio.wait_for(proc.wait(), timeout=15.0) |
There was a problem hiding this comment.
[warning] defense: wait_for(proc.wait()) races with concurrent proc.communicate() future. wait() does not drain stdout pipe; if subprocess fills the pipe buffer before exiting, this can deadlock. Use asyncio.wait_for(proc.communicate(), timeout=15.0) directly.
There was a problem hiding this comment.
Investigated — this is intentional. The code creates io_task = asyncio.ensure_future(proc.communicate()) at line 54 BEFORE calling wait_for(proc.wait()). The communicate() future runs concurrently on the event loop and actively drains stdout, so the pipe buffer deadlock scenario cannot occur here. wait() alone would deadlock, but the concurrent communicate() task prevents it.
| input_data: str | None = None, | ||
| completion_drain_timeout: float = 5.0, | ||
| linux_tracing_config: Any | None = None, | ||
| idle_output_timeout: float | None = None, |
There was a problem hiding this comment.
[warning] arch: idle_output_timeout and max_suppression_seconds are execution-tuning params (L1 concerns) added to the HeadlessExecutor protocol in L0 core/. This couples L0 to L1 operational details. Consider kwargs escape hatch or moving protocol to L1.
There was a problem hiding this comment.
Investigated — this is intentional. The SubprocessRunner protocol already contained multiple execution-tuning params at L0 before this PR: timeout (float), stale_threshold (float=1200), completion_drain_timeout (float=5.0), pty_mode (bool), linux_tracing_config (Any|None). The new idle_output_timeout and max_suppression_seconds follow the identical pre-existing pattern. This is the established design for this protocol — it serves as the structural contract mirroring run_managed_async.
| Orthogonal to Channel A/B: NOT suppressed by active API connections. | ||
| Monitors raw byte count (st_size), not JSONL record structure. | ||
| """ | ||
| import time as _time |
There was a problem hiding this comment.
[info] cohesion: import time as _time is a local import inside _watch_stdout_idle. The rest of the module uses module-level imports. Promote to module level.
| await anyio.sleep(_poll_interval) | ||
| try: | ||
| current_size = stdout_path.stat().st_size | ||
| except OSError: |
There was a problem hiding this comment.
[info] defense: except OSError: continue silently swallows all OS errors on stdout_path.stat(). ENOENT is expected, but other errors (permission denied, I/O error) should be logged at DEBUG.
| _phase2_poll=_phase2_poll, | ||
| _phase1_timeout=_phase1_timeout, | ||
| expected_session_id=acc.stdout_session_id, | ||
| **_monitor_kwargs, # type: ignore[arg-type] |
There was a problem hiding this comment.
[info] defense: # type: ignore[arg-type] suppresses type error without rationale. dict[str, object] is too wide for the target signature. Consider TypedDict or add explanatory comment.
| completion_drain_timeout: float = 5.0 | ||
| exit_after_stop_delay_ms: int = 120000 | ||
| idle_output_timeout: int = 600 | ||
| max_suppression_seconds: int = 1800 |
There was a problem hiding this comment.
[info] defense: max_suppression_seconds and idle_output_timeout have no range validation. Zero or negative values would cause silent logic bugs (immediate expiry). Consider post_init assertion.
| is_error=True, | ||
| exit_code=-1, | ||
| needs_retry=True, | ||
| retry_reason=RetryReason.STALE, |
There was a problem hiding this comment.
[info] defense: RetryReason.STALE reused for IDLE_STALL termination. Downstream retry telemetry cannot distinguish the two failure modes.
Add io_task.cancel() in the TimeoutError handler to prevent a task leak when proc.communicate() future is abandoned after timeout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Split shared suppression_start into suppression_start_api and suppression_start_child to prevent elapsed time carry-over when sessions alternate between API connection and child process conditions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
A headless Claude subprocess hung for 82 minutes on a silently stalled HTTPS SSE stream. The TCP connection remained ESTABLISHED with zero traffic, so the
_has_active_api_connectionsuppression gate in_process_monitor.pykept resetting the stale-kill timer every 2-second poll cycle. Neither Channel A nor Channel B detected the stall because Channel A only checks for valid JSONL result records (not raw byte growth), and Channel B's stale detection was suppressed by the ESTABLISHED port-443 connection.Three layers of fix provide immunity:
Stdout idle watchdog (primary) — new
_watch_stdout_idlecoroutine in the anyio task group monitors stdout temp file byte count. FiresTerminationReason.IDLE_STALLif no growth foridle_output_timeoutseconds. NOT suppressed by active API connections — orthogonal to Channel A/B.Bounded suppression (defense-in-depth) —
max_suppression_secondscaps the_has_active_api_connectionand_has_active_child_processessuppression gates so they cannot defer stale kills indefinitely. Counter resets on genuine JSONL activity.Subprocess timeouts —
timeout=added to all unguardedsubprocess.run()andasynciosubprocess calls inremote_resolver,smoke_utils,clone, andtoken_summary_appender.Changed Files
New (★)
tests/execution/test_process_idle_watchdog.py— watchdog coroutine unit teststests/workspace/test_clone_timeouts.py— static analysis: git network commands have timeoutsModified (●)
src/autoskillit/core/_type_enums.py—IDLE_STALLmembers onTerminationReasonandCliSubtypesrc/autoskillit/config/settings.py—idle_output_timeoutandmax_suppression_secondsonRunSkillConfigsrc/autoskillit/execution/_process_race.py—_watch_stdout_idlecoroutine,idle_stallfield onRaceAccumulator/RaceSignals, priority update inresolve_terminationsrc/autoskillit/execution/_process_monitor.py— bounded suppression logic withsuppression_starttrackersrc/autoskillit/execution/process.py— wire watchdog into task group,IDLE_STALLkill dispatchsrc/autoskillit/execution/headless.py—IDLE_STALLrouting in_build_skill_result(retriable)src/autoskillit/execution/session.py— exhaustive match arms forIDLE_STALLsrc/autoskillit/core/_type_subprocess.py— protocol updated with new paramssrc/autoskillit/execution/recording.py— runner implementations updatedsrc/autoskillit/execution/remote_resolver.py— 15s timeout viaasyncio.wait_forsrc/autoskillit/smoke_utils.py—timeout=60on all subprocess callssrc/autoskillit/workspace/clone.py—timeout=300(clone),timeout=120(push)src/autoskillit/hooks/token_summary_appender.py—timeout=30on gh api callsTest Plan
test_watch_stdout_idle_fires_on_silence— watchdog fires IDLE_STALL when stdout stops growingtest_watch_stdout_idle_resets_on_continuous_output— watchdog does NOT fire when output is steadytest_watch_stdout_idle_handles_missing_file— tolerates missing file until it appearstest_resolve_termination_idle_stall_priority— idle stall outranks stale in priority chaintest_resolve_termination_process_exit_beats_idle_stall— process exit still highest prioritytest_resolve_termination_idle_stall_beats_stale— idle stall outranks staletest_race_signals_field_count— sentinel: 7 fields on RaceSignalstest_build_skill_result_idle_stall_is_retriable— IDLE_STALL produces retriable SkillResulttest_run_managed_async_idle_stall_kills_hanging_process— end-to-end integrationtest_stale_suppression_bounded_by_max_duration— bounded suppression fires after captest_stale_suppression_resets_on_genuine_activity— counter resets on real JSONL activitytest_stale_suppression_logs_warning_on_bounded_kill— warning log emittedtimeout=Implementation Plan
Plan file:
.autoskillit/temp/rectify/rectify_idle_stall_watchdog_2026-04-11_140500.md🤖 Generated with Claude Code via AutoSkillit