Add streaming observability + resume to TaskSet.validate() by hallerite · Pull Request #1169 · PrimeIntellect-ai/verifiers

hallerite · 2026-04-17T18:14:41Z

Summary

Rewrites TaskSet.validate() to stream results and survive interruptions — inspired by how Environment.a_generate already streams rollouts through asyncio.as_completed with incremental JSONL saves. Previous behavior (asyncio.gather with no output until completion) hid progress and lost all work on crash/Ctrl-C.

New kwargs:

out_path — JSONL sink; each completed row is appended (off the event loop) as it finishes.
max_retries — retries on vf.InfraError only (sandbox create/exec failures, tunnel drops). Fresh sandbox per retry; attempts field records how many.
flaky_retries — optional reruns of test_failed rows to distinguish flaky tests from real failures; tags outcomes as flaky_pass / flaky_fail. Off by default (expensive).
resume=True — skip indices already in out_path and append new rows. Enables Ctrl-C → fix → restart without redoing completed work.

Each row now carries: index, instance_id, repo, valid, reason, attempts, elapsed, error, error_type, test_output_tail (last 500 chars of state["test_output"]).

reason values: pass, test_failed, gold_apply_failed, setup_failed, sandbox_error, billing_error, timeout, flaky_pass, flaky_fail. Classification is derived from exception type/message and the taskset-provided state["test_output"] — no subclass changes required. PaymentRequiredError is surfaced as billing_error rather than buried in setup_failed, so a billing outage mid-run is immediately visible in the stream.

Display: tqdm progress bar with running pass-rate, plus per-row logger.info lines with reason + instance_id.

Motivation: while validating SWE-Lego-Real-Data (4.4k rows, ~4h runtime) we had no visibility into progress, no way to resume, and a mid-run PaymentRequiredError outage produced 2094 useless "setup_failed" rows that were indistinguishable from real setup bugs. This change makes bulk validation observable, retry-aware, and recoverable.

Test plan

Ruff check / format clean on verifiers/envs/experimental/composable/task.py
pre-commit run --files ... passes (includes AGENTS.md sync)
Toy TaskSet (sandbox-free) end-to-end: first run writes 4 rows, resume=True appends 2 more, returned list is the 6-row union with correct reason classification (pass / test_failed)
Live smoke on PrimeIntellect/SWE-Lego-Real-Data (5 rows, c=5): streams JSONL as it goes, tqdm bar updates, each row includes instance_id + repo + test_output_tail
Full run on 4432 rows at c=50/100: JSONL is tailable mid-run, pre-billing cohort (2338 rows) validates at 95.5% pass rate; after a mid-run PaymentRequiredError outage + credit top-up, --resume correctly picks up only the missing 2094 indices
Reviewer: run a small validate() on any existing SandboxTaskSet with out_path=, confirm tqdm + streaming JSONL look as expected

🤖 Generated with Claude Code

Note

Medium Risk
Touches core validation execution/concurrency and sandbox lifecycle/error classification, which could affect long-running bulk validations and failure handling; changes are localized but behaviorally significant.

Overview
TaskSet.validate() is rewritten to stream validation results as they complete (via asyncio.as_completed) and optionally persist each row to a JSONL file (out_path) so long runs are observable and partial progress survives crashes.

The method now supports resume mode (skip indices already recorded in the JSONL), retry-on-infra for transient sandbox failures (max_retries on vf.InfraError/sandbox infra errors), adds tqdm progress/pass-rate logging, and returns a richer per-row result schema including instance_id/repo, reason classification (e.g. billing_error, timeout, gold_apply_failed), and test_output_tail. Docs/READMEs are updated to reflect the new validate() signature and usage.

^{Reviewed by Cursor Bugbot for commit 070291e. Bugbot is set up for automated code reviews on this repo. Configure here.}

Rewrites TaskSet.validate() to stream results via asyncio.as_completed, append each row to a JSONL sink as it finishes, show a tqdm pass-rate bar, and retry on vf.InfraError. Each row carries instance_id, repo, reason (pass/test_failed/gold_apply_failed/setup_failed/sandbox_error/ billing_error/timeout/flaky_pass/flaky_fail), attempts, error_type, and a test_output_tail. Adds resume=True to skip indices already in out_path so a crashed or interrupted run can be continued without redoing completed work, and classifies PaymentRequiredError distinctly so a billing outage is visible rather than buried in setup_failed.

PaymentRequiredError is terminal — retrying won't fix a dead wallet, and grinding through thousands of pending tasks just burns time and API calls. On the first billing_error row, log the abort, cancel all pending tasks, and exit the as_completed loop. Completed rows are preserved in the JSONL so resume=True can pick up from there once billing is restored.

Fix fd leak in _append_jsonl by using a `with path.open("a") as f` helper instead of calling open().write() inline. Matches the pattern save_outputs() already uses in verifiers/utils/save_utils.py. Drop flaky_retries entirely. The branch that was supposed to produce reason="flaky_fail" was dead (flaky_outcomes always started [False] so all() never returned True), and rerunning failed tests to detect flakiness is something a caller can do by filtering the JSONL — it doesn't need to be a validate() feature.

cursor · 2026-04-17T20:25:08Z

+                        try:
+                            await client.delete(sb.id)
+                        except Exception:
+                            pass


Sandbox cleanup leaks on task cancellation

High Severity

The sandbox cleanup in _validate_once's finally block catches Exception but not BaseException. When tasks are cancelled (billing-error abort or Ctrl-C), asyncio.CancelledError — a BaseException in Python 3.9+ — propagates through the await client.delete(sb.id) call and escapes the except Exception: pass handler, leaving the sandbox running and leaking resources. This is specifically triggered by the new cancellation paths this PR introduces.

Additional Locations (1)

verifiers/envs/experimental/composable/task.py#L577-L582

^{Reviewed by Cursor Bugbot for commit d684e58. Configure here.}

The old classifier hit "timeout" when test_output started with "ERROR:" and the word "timeout" appeared anywhere in pytest's output. That's two unrelated coincidences: pytest emits "ERROR: found no collectors ..." as a first line on collection errors, and its own log often mentions "timeout" (plugin names, test names, warnings). Real wall-clock timeouts were buried inside the caller's broad except Exception and never reached the classifier anyway. Dispatch on exception class instead: - asyncio.TimeoutError / TimeoutError via isinstance - CommandTimeoutError / BackgroundJobTimeoutError by class name (keeps prime_sandboxes out of the import graph) - vf.InfraError → sandbox_error (already class-based) - PaymentRequiredError → billing_error (already class-based) - gold_apply_failed is still message-matched since it's a plain RuntimeError raised from our own code When exc is None and validate_instance returned False, always return "test_failed" — don't guess from the pytest log.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 426dd77. Configure here.}

cursor · 2026-04-17T20:53:22Z

+        *,
+        out_path: str | Path | None = None,
+        max_retries: int = 0,
+        resume: bool = False,


Resume silently ignored when out_path not provided

Medium Severity

Passing resume=True without out_path silently does nothing — no warning, no error. The resume logic is entirely gated on out_path_p is not None, so completed_indices stays empty and all instances are re-validated from scratch. A user running a multi-hour validation (the PR motivation mentions ~4h for 4.4k rows) could believe resume is active, Ctrl-C, restart, and unknowingly redo all completed work.

Additional Locations (1)

verifiers/envs/experimental/composable/task.py#L334-L358

^{Reviewed by Cursor Bugbot for commit 426dd77. Configure here.}

cursor · 2026-04-17T20:53:22Z

+                            continue
+                        if isinstance(row.get("index"), int):
+                            prior_rows.append(row)
+                            completed_indices.add(row["index"])


Resume loads prior rows outside requested validation range

Low Severity

When resuming, ALL rows from the JSONL are loaded into prior_rows regardless of total (derived from n). If a previous run used a larger n, the result list and pass-rate/reason-breakdown summaries include rows with indices outside the current range(total). For example, resuming with n=50 after completing a full n=100 run returns 100 rows instead of ≤50, with skipped also overcounting as len(completed_indices) rather than the number of in-range indices skipped.

Additional Locations (1)

verifiers/envs/experimental/composable/task.py#L540-L541

^{Reviewed by Cursor Bugbot for commit 426dd77. Configure here.}

APIError (and SandboxNotRunningError) from prime_sandboxes are transient infra failures — a common instance is "Read file timed out after 30s: /tmp/job_*.exit" when the background-job exit file hasn't landed yet. They were falling into setup_failed and never retrying. Treat them as sandbox_error so max_retries kicks in. Matched by class name (not isinstance) to keep prime_sandboxes out of this module's import graph.

Replaces the class-name string sets with proper isinstance tuples, imported lazily from prime_sandboxes alongside the existing lazy CreateSandboxRequest import. Expands coverage: SandboxTimeoutError, APITimeoutError, DownloadTimeoutError, UploadTimeoutError all map to "timeout", and SandboxImagePullError joins the retryable sandbox_error bucket. PaymentRequiredError is caught by isinstance now rather than by class name. Pure-LLM tasksets don't import prime_sandboxes (the sandbox-type tuples stay empty for them), so the non-sandbox code path is unchanged.

The 500-char tail wasn't large enough to capture both halves of a two-run eval script — for SWE tasksets where eval.sh runs pytest on FAIL_TO_PASS and then PASS_TO_PASS, the tail would often show only the passing half's summary while the failing half scrolled off the top, leaving a whole class of rows un-classifiable by post-hoc pattern matching (the "tail_ambiguous" bucket). Add a test_output_tail_chars kwarg (default 2000) and use it in place of the hardcoded 500. 2000 is a reasonable default that comfortably holds both summary blocks plus some tracebacks; callers with very chatty suites can bump it higher at the cost of JSONL size.

hallerite added 2 commits April 17, 2026 18:14

cursor bot reviewed Apr 17, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/composable/task.py Outdated

Comment thread verifiers/envs/experimental/composable/task.py Outdated

cursor bot reviewed Apr 17, 2026

View reviewed changes

hallerite added 3 commits April 17, 2026 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add streaming observability + resume to TaskSet.validate()#1169

Add streaming observability + resume to TaskSet.validate()#1169
hallerite wants to merge 7 commits intomainfrom
add-validate-observability

hallerite commented Apr 17, 2026 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

Uh oh!

cursor bot Apr 17, 2026

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Apr 17, 2026

Uh oh!

cursor bot Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallerite commented Apr 17, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Uh oh!

cursor bot Apr 17, 2026

Choose a reason for hiding this comment

Sandbox cleanup leaks on task cancellation

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 17, 2026

Choose a reason for hiding this comment

Resume silently ignored when out_path not provided

Uh oh!

cursor bot Apr 17, 2026

Choose a reason for hiding this comment

Resume loads prior rows outside requested validation range

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hallerite commented Apr 17, 2026 •

edited by cursor bot

Loading