Skip to content

Add streaming observability + resume to TaskSet.validate()#1169

Open
hallerite wants to merge 7 commits intomainfrom
add-validate-observability
Open

Add streaming observability + resume to TaskSet.validate()#1169
hallerite wants to merge 7 commits intomainfrom
add-validate-observability

Conversation

@hallerite
Copy link
Copy Markdown
Member

@hallerite hallerite commented Apr 17, 2026

Summary

Rewrites TaskSet.validate() to stream results and survive interruptions — inspired by how Environment.a_generate already streams rollouts through asyncio.as_completed with incremental JSONL saves. Previous behavior (asyncio.gather with no output until completion) hid progress and lost all work on crash/Ctrl-C.

New kwargs:

  • out_path — JSONL sink; each completed row is appended (off the event loop) as it finishes.
  • max_retries — retries on vf.InfraError only (sandbox create/exec failures, tunnel drops). Fresh sandbox per retry; attempts field records how many.
  • flaky_retries — optional reruns of test_failed rows to distinguish flaky tests from real failures; tags outcomes as flaky_pass / flaky_fail. Off by default (expensive).
  • resume=True — skip indices already in out_path and append new rows. Enables Ctrl-C → fix → restart without redoing completed work.

Each row now carries: index, instance_id, repo, valid, reason, attempts, elapsed, error, error_type, test_output_tail (last 500 chars of state["test_output"]).

reason values: pass, test_failed, gold_apply_failed, setup_failed, sandbox_error, billing_error, timeout, flaky_pass, flaky_fail. Classification is derived from exception type/message and the taskset-provided state["test_output"] — no subclass changes required. PaymentRequiredError is surfaced as billing_error rather than buried in setup_failed, so a billing outage mid-run is immediately visible in the stream.

Display: tqdm progress bar with running pass-rate, plus per-row logger.info lines with reason + instance_id.

Motivation: while validating SWE-Lego-Real-Data (4.4k rows, ~4h runtime) we had no visibility into progress, no way to resume, and a mid-run PaymentRequiredError outage produced 2094 useless "setup_failed" rows that were indistinguishable from real setup bugs. This change makes bulk validation observable, retry-aware, and recoverable.

Test plan

  • Ruff check / format clean on verifiers/envs/experimental/composable/task.py
  • pre-commit run --files ... passes (includes AGENTS.md sync)
  • Toy TaskSet (sandbox-free) end-to-end: first run writes 4 rows, resume=True appends 2 more, returned list is the 6-row union with correct reason classification (pass / test_failed)
  • Live smoke on PrimeIntellect/SWE-Lego-Real-Data (5 rows, c=5): streams JSONL as it goes, tqdm bar updates, each row includes instance_id + repo + test_output_tail
  • Full run on 4432 rows at c=50/100: JSONL is tailable mid-run, pre-billing cohort (2338 rows) validates at 95.5% pass rate; after a mid-run PaymentRequiredError outage + credit top-up, --resume correctly picks up only the missing 2094 indices
  • Reviewer: run a small validate() on any existing SandboxTaskSet with out_path=, confirm tqdm + streaming JSONL look as expected

🤖 Generated with Claude Code


Note

Medium Risk
Touches core validation execution/concurrency and sandbox lifecycle/error classification, which could affect long-running bulk validations and failure handling; changes are localized but behaviorally significant.

Overview
TaskSet.validate() is rewritten to stream validation results as they complete (via asyncio.as_completed) and optionally persist each row to a JSONL file (out_path) so long runs are observable and partial progress survives crashes.

The method now supports resume mode (skip indices already recorded in the JSONL), retry-on-infra for transient sandbox failures (max_retries on vf.InfraError/sandbox infra errors), adds tqdm progress/pass-rate logging, and returns a richer per-row result schema including instance_id/repo, reason classification (e.g. billing_error, timeout, gold_apply_failed), and test_output_tail. Docs/READMEs are updated to reflect the new validate() signature and usage.

Reviewed by Cursor Bugbot for commit 070291e. Bugbot is set up for automated code reviews on this repo. Configure here.

Rewrites TaskSet.validate() to stream results via asyncio.as_completed,
append each row to a JSONL sink as it finishes, show a tqdm pass-rate
bar, and retry on vf.InfraError. Each row carries instance_id, repo,
reason (pass/test_failed/gold_apply_failed/setup_failed/sandbox_error/
billing_error/timeout/flaky_pass/flaky_fail), attempts, error_type,
and a test_output_tail. Adds resume=True to skip indices already in
out_path so a crashed or interrupted run can be continued without
redoing completed work, and classifies PaymentRequiredError distinctly
so a billing outage is visible rather than buried in setup_failed.
PaymentRequiredError is terminal — retrying won't fix a dead wallet,
and grinding through thousands of pending tasks just burns time and
API calls. On the first billing_error row, log the abort, cancel all
pending tasks, and exit the as_completed loop. Completed rows are
preserved in the JSONL so resume=True can pick up from there once
billing is restored.
Comment thread verifiers/envs/experimental/composable/task.py Outdated
Comment thread verifiers/envs/experimental/composable/task.py Outdated
Fix fd leak in _append_jsonl by using a `with path.open("a") as f`
helper instead of calling open().write() inline. Matches the pattern
save_outputs() already uses in verifiers/utils/save_utils.py.

Drop flaky_retries entirely. The branch that was supposed to produce
reason="flaky_fail" was dead (flaky_outcomes always started [False]
so all() never returned True), and rerunning failed tests to detect
flakiness is something a caller can do by filtering the JSONL — it
doesn't need to be a validate() feature.
try:
await client.delete(sb.id)
except Exception:
pass
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sandbox cleanup leaks on task cancellation

High Severity

The sandbox cleanup in _validate_once's finally block catches Exception but not BaseException. When tasks are cancelled (billing-error abort or Ctrl-C), asyncio.CancelledError — a BaseException in Python 3.9+ — propagates through the await client.delete(sb.id) call and escapes the except Exception: pass handler, leaving the sandbox running and leaking resources. This is specifically triggered by the new cancellation paths this PR introduces.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d684e58. Configure here.

Comment thread verifiers/envs/experimental/composable/task.py
The old classifier hit "timeout" when test_output started with
"ERROR:" and the word "timeout" appeared anywhere in pytest's output.
That's two unrelated coincidences: pytest emits "ERROR: found no
collectors ..." as a first line on collection errors, and its own
log often mentions "timeout" (plugin names, test names, warnings).
Real wall-clock timeouts were buried inside the caller's broad
except Exception and never reached the classifier anyway.

Dispatch on exception class instead:
- asyncio.TimeoutError / TimeoutError via isinstance
- CommandTimeoutError / BackgroundJobTimeoutError by class name
  (keeps prime_sandboxes out of the import graph)
- vf.InfraError → sandbox_error (already class-based)
- PaymentRequiredError → billing_error (already class-based)
- gold_apply_failed is still message-matched since it's a plain
  RuntimeError raised from our own code
When exc is None and validate_instance returned False, always
return "test_failed" — don't guess from the pytest log.
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 426dd77. Configure here.

*,
out_path: str | Path | None = None,
max_retries: int = 0,
resume: bool = False,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resume silently ignored when out_path not provided

Medium Severity

Passing resume=True without out_path silently does nothing — no warning, no error. The resume logic is entirely gated on out_path_p is not None, so completed_indices stays empty and all instances are re-validated from scratch. A user running a multi-hour validation (the PR motivation mentions ~4h for 4.4k rows) could believe resume is active, Ctrl-C, restart, and unknowingly redo all completed work.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 426dd77. Configure here.

continue
if isinstance(row.get("index"), int):
prior_rows.append(row)
completed_indices.add(row["index"])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resume loads prior rows outside requested validation range

Low Severity

When resuming, ALL rows from the JSONL are loaded into prior_rows regardless of total (derived from n). If a previous run used a larger n, the result list and pass-rate/reason-breakdown summaries include rows with indices outside the current range(total). For example, resuming with n=50 after completing a full n=100 run returns 100 rows instead of ≤50, with skipped also overcounting as len(completed_indices) rather than the number of in-range indices skipped.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 426dd77. Configure here.

APIError (and SandboxNotRunningError) from prime_sandboxes are
transient infra failures — a common instance is "Read file timed
out after 30s: /tmp/job_*.exit" when the background-job exit file
hasn't landed yet. They were falling into setup_failed and never
retrying. Treat them as sandbox_error so max_retries kicks in.

Matched by class name (not isinstance) to keep prime_sandboxes
out of this module's import graph.
Replaces the class-name string sets with proper isinstance tuples,
imported lazily from prime_sandboxes alongside the existing lazy
CreateSandboxRequest import. Expands coverage: SandboxTimeoutError,
APITimeoutError, DownloadTimeoutError, UploadTimeoutError all map
to "timeout", and SandboxImagePullError joins the retryable
sandbox_error bucket. PaymentRequiredError is caught by isinstance
now rather than by class name.

Pure-LLM tasksets don't import prime_sandboxes (the sandbox-type
tuples stay empty for them), so the non-sandbox code path is
unchanged.
The 500-char tail wasn't large enough to capture both halves of a
two-run eval script — for SWE tasksets where eval.sh runs pytest on
FAIL_TO_PASS and then PASS_TO_PASS, the tail would often show only
the passing half's summary while the failing half scrolled off the
top, leaving a whole class of rows un-classifiable by post-hoc
pattern matching (the "tail_ambiguous" bucket).

Add a test_output_tail_chars kwarg (default 2000) and use it in
place of the hardcoded 500. 2000 is a reasonable default that
comfortably holds both summary blocks plus some tracebacks; callers
with very chatty suites can bump it higher at the cost of JSONL
size.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant