Conversation
7773606 to
12d9b08
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 777360656e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| done, _ = await asyncio.wait({rollout_task}, timeout=self.timeout_seconds) | ||
| if rollout_task in done: | ||
| return await rollout_task | ||
|
|
||
| rollout_task.cancel() |
There was a problem hiding this comment.
Re-check rollout task completion before forcing timeout
When asyncio.wait(..., timeout=...) returns due to timeout, the rollout task can still complete naturally in the small window before rollout_task.cancel() runs. This branch unconditionally marks the rollout as timed out/truncated, so near-deadline successful rollouts can be mislabeled and have cleanup run again with timeout metadata. Add a rollout_task.done() check immediately before cancellation/timeout state mutation so completed work is returned as-is.
Useful? React with 👍 / 👎.
| state["timed_out"] = True | ||
| state["is_completed"] = True | ||
| state["is_truncated"] = True | ||
| state["stop_condition"] = "timeout_reached" |
There was a problem hiding this comment.
Preserve timeout hook side effects in forced-timeout path
This forced-timeout path writes timed_out/stop_condition directly instead of executing stop-condition logic, so subclass timeout side effects are skipped. In ApiEnv, timeout_reached sets agent_timed_out and ApiEnvMonitorRubric reads that field, so rollouts that time out via this branch can be reported as not timed out in metrics/logs. Route timeout finalization through the same timeout hook (or mirror its side effects) to keep subclass state consistent.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 12d9b08. Configure here.
| await self._render_timing(state) | ||
| await self._cleanup(state) | ||
| await self.render_completion(state) | ||
| return state |
There was a problem hiding this comment.
Rollout timeout races with stop condition causing double cleanup
Medium Severity
CliAgentEnv always sets self.timeout_seconds = 3600.0, so the new rollout method enters the asyncio.wait timeout path. Both timeout_reached (checking start_time) and asyncio.wait fire at nearly the same wall-clock moment (offset by microseconds). When timeout_reached triggers, is_completed calls _cleanup (which includes destroy_sandbox — a network call). Microseconds later asyncio.wait returns, the task is cancelled mid-cleanup, then the outer code calls _cleanup again — running handlers like destroy_sandbox a second time on a partially-cleaned state.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 12d9b08. Configure here.
|
|
||
| ```bash | ||
| prime eval run my-env -x '{"max_turns": 20}' | ||
| prime eval run my-env -x '{"max_turns": 20, "timeout_seconds": 600}' |
There was a problem hiding this comment.
Skills file not updated after docs/evaluation.md change
Low Severity
docs/evaluation.md now documents timeout_seconds as an --extra-env-kwargs option, but skills/evaluate-environments/SKILL.md (line 92) still only shows max_turns as the extra_env_kwargs example. Per project rules, changes to docs/evaluation.md require verifying and updating affected skill files.
Triggered by project rule: BugBot Instructions
Reviewed by Cursor Bugbot for commit 12d9b08. Configure here.
| async def timeout_reached(self, state: State) -> bool: | ||
| if self.timeout_seconds is None: | ||
| return False | ||
| if time.time() - state["timing"]["start_time"] <= self.timeout_seconds: |
There was a problem hiding this comment.
should use time.perf_counter() everywhere, not time.time(). The former is more accurate and robust against things like time-zone changes.


Summary
timeout_secondskwarg toMultiTurnEnvand stop timed out rollouts withtimeout_reachedextra_env_kwargs) instead of adding a first-party eval flagCliAgentEnvtyping fix needed for the repo'stypre-push checkTesting
uv run pytest tests/test_multiturn_env.py tests/test_eval_cli.pyuv run pre-commit run --all-filesToolEnvwith a 10s sleeping tool andtimeout_seconds=5stopped in ~5s withstop_condition=timeout_reachedNote
Medium Risk
Changes the core
MultiTurnEnv.rolloutexecution loop to support wall-clock cancellation, which could affect rollout completion/cleanup behavior and stop-condition reporting across all multi-turn environments.Overview
Adds a nullable
timeout_secondsparameter toMultiTurnEnvand a new built-in stop conditiontimeout_reached, allowing multi-turn rollouts to terminate based on wall-clock time.Updates
MultiTurnEnv.rolloutto enforce the timeout via task cancellation and to mark timed-out states as completed/truncated withstop_condition="timeout_reached", ensuring timing, cleanup, and completion rendering still run.Extends eval configuration/tests to pass
timeout_secondsthroughextra_env_kwargs(CLI + TOML), adjustsCliAgentEnvsandbox resource calculation when the timeout is unset, and updates docs/reference to document the new timeout behavior and configuration.Reviewed by Cursor Bugbot for commit 12d9b08. Bugbot is set up for automated code reviews on this repo. Configure here.