Add streaming observability + resume to TaskSet.validate()#1169
Add streaming observability + resume to TaskSet.validate()#1169
Conversation
Rewrites TaskSet.validate() to stream results via asyncio.as_completed, append each row to a JSONL sink as it finishes, show a tqdm pass-rate bar, and retry on vf.InfraError. Each row carries instance_id, repo, reason (pass/test_failed/gold_apply_failed/setup_failed/sandbox_error/ billing_error/timeout/flaky_pass/flaky_fail), attempts, error_type, and a test_output_tail. Adds resume=True to skip indices already in out_path so a crashed or interrupted run can be continued without redoing completed work, and classifies PaymentRequiredError distinctly so a billing outage is visible rather than buried in setup_failed.
PaymentRequiredError is terminal — retrying won't fix a dead wallet, and grinding through thousands of pending tasks just burns time and API calls. On the first billing_error row, log the abort, cancel all pending tasks, and exit the as_completed loop. Completed rows are preserved in the JSONL so resume=True can pick up from there once billing is restored.
Fix fd leak in _append_jsonl by using a `with path.open("a") as f`
helper instead of calling open().write() inline. Matches the pattern
save_outputs() already uses in verifiers/utils/save_utils.py.
Drop flaky_retries entirely. The branch that was supposed to produce
reason="flaky_fail" was dead (flaky_outcomes always started [False]
so all() never returned True), and rerunning failed tests to detect
flakiness is something a caller can do by filtering the JSONL — it
doesn't need to be a validate() feature.
| try: | ||
| await client.delete(sb.id) | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Sandbox cleanup leaks on task cancellation
High Severity
The sandbox cleanup in _validate_once's finally block catches Exception but not BaseException. When tasks are cancelled (billing-error abort or Ctrl-C), asyncio.CancelledError — a BaseException in Python 3.9+ — propagates through the await client.delete(sb.id) call and escapes the except Exception: pass handler, leaving the sandbox running and leaking resources. This is specifically triggered by the new cancellation paths this PR introduces.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit d684e58. Configure here.
The old classifier hit "timeout" when test_output started with "ERROR:" and the word "timeout" appeared anywhere in pytest's output. That's two unrelated coincidences: pytest emits "ERROR: found no collectors ..." as a first line on collection errors, and its own log often mentions "timeout" (plugin names, test names, warnings). Real wall-clock timeouts were buried inside the caller's broad except Exception and never reached the classifier anyway. Dispatch on exception class instead: - asyncio.TimeoutError / TimeoutError via isinstance - CommandTimeoutError / BackgroundJobTimeoutError by class name (keeps prime_sandboxes out of the import graph) - vf.InfraError → sandbox_error (already class-based) - PaymentRequiredError → billing_error (already class-based) - gold_apply_failed is still message-matched since it's a plain RuntimeError raised from our own code When exc is None and validate_instance returned False, always return "test_failed" — don't guess from the pytest log.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 3 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 426dd77. Configure here.
| *, | ||
| out_path: str | Path | None = None, | ||
| max_retries: int = 0, | ||
| resume: bool = False, |
There was a problem hiding this comment.
Resume silently ignored when out_path not provided
Medium Severity
Passing resume=True without out_path silently does nothing — no warning, no error. The resume logic is entirely gated on out_path_p is not None, so completed_indices stays empty and all instances are re-validated from scratch. A user running a multi-hour validation (the PR motivation mentions ~4h for 4.4k rows) could believe resume is active, Ctrl-C, restart, and unknowingly redo all completed work.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 426dd77. Configure here.
| continue | ||
| if isinstance(row.get("index"), int): | ||
| prior_rows.append(row) | ||
| completed_indices.add(row["index"]) |
There was a problem hiding this comment.
Resume loads prior rows outside requested validation range
Low Severity
When resuming, ALL rows from the JSONL are loaded into prior_rows regardless of total (derived from n). If a previous run used a larger n, the result list and pass-rate/reason-breakdown summaries include rows with indices outside the current range(total). For example, resuming with n=50 after completing a full n=100 run returns 100 rows instead of ≤50, with skipped also overcounting as len(completed_indices) rather than the number of in-range indices skipped.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 426dd77. Configure here.
APIError (and SandboxNotRunningError) from prime_sandboxes are transient infra failures — a common instance is "Read file timed out after 30s: /tmp/job_*.exit" when the background-job exit file hasn't landed yet. They were falling into setup_failed and never retrying. Treat them as sandbox_error so max_retries kicks in. Matched by class name (not isinstance) to keep prime_sandboxes out of this module's import graph.
Replaces the class-name string sets with proper isinstance tuples, imported lazily from prime_sandboxes alongside the existing lazy CreateSandboxRequest import. Expands coverage: SandboxTimeoutError, APITimeoutError, DownloadTimeoutError, UploadTimeoutError all map to "timeout", and SandboxImagePullError joins the retryable sandbox_error bucket. PaymentRequiredError is caught by isinstance now rather than by class name. Pure-LLM tasksets don't import prime_sandboxes (the sandbox-type tuples stay empty for them), so the non-sandbox code path is unchanged.
The 500-char tail wasn't large enough to capture both halves of a two-run eval script — for SWE tasksets where eval.sh runs pytest on FAIL_TO_PASS and then PASS_TO_PASS, the tail would often show only the passing half's summary while the failing half scrolled off the top, leaving a whole class of rows un-classifiable by post-hoc pattern matching (the "tail_ambiguous" bucket). Add a test_output_tail_chars kwarg (default 2000) and use it in place of the hardcoded 500. 2000 is a reasonable default that comfortably holds both summary blocks plus some tracebacks; callers with very chatty suites can bump it higher at the cost of JSONL size.


Summary
Rewrites
TaskSet.validate()to stream results and survive interruptions — inspired by howEnvironment.a_generatealready streams rollouts throughasyncio.as_completedwith incremental JSONL saves. Previous behavior (asyncio.gatherwith no output until completion) hid progress and lost all work on crash/Ctrl-C.New kwargs:
out_path— JSONL sink; each completed row is appended (off the event loop) as it finishes.max_retries— retries onvf.InfraErroronly (sandbox create/exec failures, tunnel drops). Fresh sandbox per retry;attemptsfield records how many.flaky_retries— optional reruns oftest_failedrows to distinguish flaky tests from real failures; tags outcomes asflaky_pass/flaky_fail. Off by default (expensive).resume=True— skip indices already inout_pathand append new rows. Enables Ctrl-C → fix → restart without redoing completed work.Each row now carries:
index,instance_id,repo,valid,reason,attempts,elapsed,error,error_type,test_output_tail(last 500 chars ofstate["test_output"]).reasonvalues:pass,test_failed,gold_apply_failed,setup_failed,sandbox_error,billing_error,timeout,flaky_pass,flaky_fail. Classification is derived from exception type/message and the taskset-providedstate["test_output"]— no subclass changes required.PaymentRequiredErroris surfaced asbilling_errorrather than buried insetup_failed, so a billing outage mid-run is immediately visible in the stream.Display: tqdm progress bar with running pass-rate, plus per-row
logger.infolines with reason + instance_id.Motivation: while validating SWE-Lego-Real-Data (4.4k rows, ~4h runtime) we had no visibility into progress, no way to resume, and a mid-run
PaymentRequiredErroroutage produced 2094 useless "setup_failed" rows that were indistinguishable from real setup bugs. This change makes bulk validation observable, retry-aware, and recoverable.Test plan
verifiers/envs/experimental/composable/task.pypre-commit run --files ...passes (includes AGENTS.md sync)TaskSet(sandbox-free) end-to-end: first run writes 4 rows,resume=Trueappends 2 more, returned list is the 6-row union with correctreasonclassification (pass/test_failed)PrimeIntellect/SWE-Lego-Real-Data(5 rows, c=5): streams JSONL as it goes, tqdm bar updates, each row includesinstance_id+repo+test_output_tailPaymentRequiredErroroutage + credit top-up,--resumecorrectly picks up only the missing 2094 indicesvalidate()on any existingSandboxTaskSetwithout_path=, confirm tqdm + streaming JSONL look as expected🤖 Generated with Claude Code
Note
Medium Risk
Touches core validation execution/concurrency and sandbox lifecycle/error classification, which could affect long-running bulk validations and failure handling; changes are localized but behaviorally significant.
Overview
TaskSet.validate()is rewritten to stream validation results as they complete (viaasyncio.as_completed) and optionally persist each row to a JSONL file (out_path) so long runs are observable and partial progress survives crashes.The method now supports resume mode (skip indices already recorded in the JSONL), retry-on-infra for transient sandbox failures (
max_retriesonvf.InfraError/sandbox infra errors), adds tqdm progress/pass-rate logging, and returns a richer per-row result schema includinginstance_id/repo,reasonclassification (e.g.billing_error,timeout,gold_apply_failed), andtest_output_tail. Docs/READMEs are updated to reflect the newvalidate()signature and usage.Reviewed by Cursor Bugbot for commit 070291e. Bugbot is set up for automated code reviews on this repo. Configure here.