Skip to content

fix: surface agent timeout as state['error'] in CliAgentEnv#1170

Merged
rasdani merged 1 commit intomainfrom
fix/cli-agent-timeout-surfaces-error
Apr 18, 2026
Merged

fix: surface agent timeout as state['error'] in CliAgentEnv#1170
rasdani merged 1 commit intomainfrom
fix/cli-agent-timeout-surfaces-error

Conversation

@rasdani
Copy link
Copy Markdown
Contributor

@rasdani rasdani commented Apr 17, 2026

Problem

CliAgentEnv.wait_for_completion catches asyncio.TimeoutError but only sets state["agent_timed_out"] = True, leaving state["error"] unset. When the agent times out without producing any trajectory, the rollout comes back with trajectory=[], error=None, and the orchestrator scheduler reschedules it as a generic "Empty trajectory" with no diagnostic — indistinguishable from a genuinely silent empty rollout.

Fix

In the asyncio.TimeoutError branch, also set state["error"] = AgentError(f"Agent timed out after {self.timeout_seconds}s"), mirroring the style of the except Exception branch directly below. agent_timed_out=True is preserved for existing downstream checks elsewhere in the file.

Context

Third in a series closing silent-failure holes in CliAgentEnv:

A companion fix in prime-rl reorders the scheduler's if empty / elif error so that the error branch is checked first (otherwise this surfaced error would still be hidden behind the empty-trajectory branch). See companion PR in prime-rl.

Tests

No regression test was added. The existing tests/test_cli_agent_env.py covers timeout_reached (the stop-condition) but does not exercise wait_for_completion's asyncio.TimeoutError path. Adding such a test would require mocking asyncio.wait_for and the background-job polling loop, which felt out of scope for a 3-line fix — flagged here for a follow-up if desired.

Checks

  • uv run ruff check verifiers/envs/experimental/cli_agent_env.py — passed
  • uv run ruff format --check verifiers/envs/experimental/cli_agent_env.py — already formatted

Note

Low Risk
Small, localized change to timeout handling that only affects how errors are surfaced in rollout state; minimal behavioral impact beyond improved diagnostics and potential downstream branching on state["error"].

Overview
CliAgentEnv.wait_for_completion now records timeouts as a real failure by setting state["error"] to an AgentError when asyncio.wait_for hits TimeoutError, in addition to the existing state["agent_timed_out"] = True flag.

This ensures timeout rollouts surface a diagnostic error instead of looking like a silent empty trajectory to downstream schedulers/consumers.

Reviewed by Cursor Bugbot for commit e88fd94. Bugbot is set up for automated code reviews on this repo. Configure here.

When the agent background job exceeds timeout_seconds, wait_for_completion
previously set only agent_timed_out=True and left state["error"] unset.
Rollouts with no trajectory then returned error=None, which the orchestrator
scheduler logs as a generic "Empty trajectory" with no root cause.

Mirror the behavior of the Exception branch and set state["error"] to an
AgentError naming the timeout so downstream consumers can distinguish a
timed-out rollout from a silently empty one. agent_timed_out=True is kept
for existing downstream checks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rasdani rasdani merged commit b3a7255 into main Apr 18, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant