Summary
Running swebenchmultimodal with ACP agents consistently fails on ~60-75% of instances due to the runtime API (runtime.eval.all-hands.dev/start) being unable to start pods for large Docker images. The failure is non-deterministic: the same instance can succeed on one attempt and fail on the next, depending on cluster state (node image caching, scheduling, API load).
Root Cause
The runtime API /start endpoint times out when starting pods with large container images. Two error types:
ReadTimeout (~75% of errors): The HTTP read timeout expires waiting for the pod to start
Server disconnected without sending a response (~25%): The server drops the connection entirely
These are infrastructure-level failures in the runtime API, not benchmark or agent bugs. The runtime either starts within a few seconds or never starts within the timeout window — there is no "slow start" case where a longer timeout would help.
Dataset Composition
The swebenchmultimodal dev dataset has 102 instances across 5 repos with vastly different image sizes:
| Repo |
Instances |
Image Size |
Typical Success Rate |
Automattic/wp-calypso |
37 |
Very large (~37% of dataset) |
0-15% |
chartjs/Chart.js |
24 |
Small |
~90-100% |
processing/p5.js |
16 |
Large |
0-30% |
markedjs/marked |
14 |
Small |
50-90% |
diegomura/react-pdf |
11 |
Medium |
~60-80% |
| Total |
102 |
|
|
Key insight: wp-calypso (37 instances) and p5.js (16 instances) = 52% of the dataset and account for >80% of all failures.
Evidence
Bumping api_timeout from 60s to 600s does NOT help
Branch fix/swebenchmultimodal-api-timeout-600s increases the httpx read timeout from 60s to 600s. Testing shows:
- Every successful instance starts immediately (0 errors, completes on first try)
- No instance that hit a ReadTimeout ever recovered within the same attempt — even with 600s to wait
- The 600s timeout actively hurts: each failed retry wastes ~10m43s instead of ~70s, making runs 9x slower per failure
- Identical success rates between 60s and 600s timeout runs for the same repos
Timing evidence (600s timeout):
wp-calypso-26008: Started 15:38:02 → Timeout 15:48:45 (10m43s) → Timeout 15:59:29 (10m44s) → Timeout 16:10:12 (10m43s)
Three retries × 10m43s = 32 minutes wasted per failed instance (vs 3.5 min with 60s timeout).
Concurrent dispatches cause 429 rate limiting
Original run dispatched 3 models simultaneously (10 workers each = 30 concurrent requests):
- 574 total 429 errors across all 3 jobs
- Opus was especially impacted: attempts 1 and 2 were completely wiped out
Retry runs dispatched 1 model at a time:
- 6 total 429 errors (effectively zero)
Success is probabilistic, not deterministic
wp-calypso-21977 and wp-calypso-22242 failed in the original run but succeeded (zero errors, clean first try) in the retry. This proves wp-calypso instances ARE processable — the runtime start is a coin flip that depends on node scheduling and image caching.
Current Run Status (2026-03-28)
Original runs (benchmarks branch fix/swebenchmultimodal-api-timeout-600s, 600s timeout)
| Model |
Agent |
Succeeded |
Failed |
Total |
| claude-sonnet-4-5 |
acp-claude |
24/102 |
78 |
102 |
| claude-opus-4-6 |
acp-claude |
30/102 |
72 |
102 |
| gpt-5.4 |
acp-codex |
27/102 |
75 |
102 |
Retry runs (cleaned archive + single dispatch, still running)
| Model |
Agent |
Orig Success |
Retry Zero-Error |
Still Running |
Permanent Fail (attempt 1) |
| claude-sonnet-4-5 |
acp-claude |
24 |
+9 |
~36 left |
32 (17 wp-calypso, 8 marked, 6 p5.js, 1 react-pdf) |
| claude-opus-4-6 |
acp-claude |
30 |
+7 |
~31 left |
35 (19 wp-calypso, 8 marked, 7 p5.js, 1 react-pdf) |
| gpt-5.4 |
acp-codex |
27 |
+12 |
~30 left |
38 (21 wp-calypso, 8 marked, 8 p5.js, 1 react-pdf) |
Proposed Solution
The api_timeout bump should be reverted (or not merged). Instead:
1. Reduce worker concurrency for swebenchmultimodal
Set num_workers=2-3 (instead of 10) to reduce concurrent load on the runtime API. Fewer simultaneous /start requests = fewer 429s and fewer scheduling conflicts for large images.
2. Run one model at a time
Never dispatch multiple models for swebenchmultimodal simultaneously. Stagger dispatches to avoid 429 storms.
3. Increase max_retries to 5-10
Since wp-calypso success is probabilistic, more outer attempts = more chances. With 60s timeout × 3 inner retries = 3.5 min per failed instance, 10 outer attempts would take ~35 min per persistently failing instance (vs ~320 min with 600s timeout × 3 retries × 10 attempts).
4. Keep timeout at 60s (or lower)
Fail fast on hopeless attempts. The runtime either starts in seconds or not at all. A shorter timeout means faster iteration through retry cycles.
5. (Infrastructure) Pre-pull large images on nodes
The most impactful fix would be ensuring wp-calypso/p5.js images are cached on nodes via a DaemonSet or node image pre-puller. This would make all 102 instances reliably processable.
Investigation Links
Current retry runs
- Sonnet K8s job:
eval-23686021661-claude-son (namespace: evaluation-jobs)
- Opus K8s job:
eval-23686024551-claude-4-6
- GPT-5.4 K8s job:
eval-23686024050-gpt-5-4
GitHub Actions
- Original SDK workflows: OpenHands/software-agent-sdk runs 23673797686 (sonnet/opus), 23673798829 (gpt-5.4)
- Retry SDK workflows: 23686021661 (sonnet), 23686024551 (opus), 23686024050 (gpt-5.4)
Related
/cc @OpenHands/benchmarks-team
Summary
Running swebenchmultimodal with ACP agents consistently fails on ~60-75% of instances due to the runtime API (
runtime.eval.all-hands.dev/start) being unable to start pods for large Docker images. The failure is non-deterministic: the same instance can succeed on one attempt and fail on the next, depending on cluster state (node image caching, scheduling, API load).Root Cause
The runtime API
/startendpoint times out when starting pods with large container images. Two error types:ReadTimeout(~75% of errors): The HTTP read timeout expires waiting for the pod to startServer disconnected without sending a response(~25%): The server drops the connection entirelyThese are infrastructure-level failures in the runtime API, not benchmark or agent bugs. The runtime either starts within a few seconds or never starts within the timeout window — there is no "slow start" case where a longer timeout would help.
Dataset Composition
The swebenchmultimodal dev dataset has 102 instances across 5 repos with vastly different image sizes:
Automattic/wp-calypsochartjs/Chart.jsprocessing/p5.jsmarkedjs/markeddiegomura/react-pdfKey insight:
wp-calypso(37 instances) andp5.js(16 instances) = 52% of the dataset and account for >80% of all failures.Evidence
Bumping
api_timeoutfrom 60s to 600s does NOT helpBranch
fix/swebenchmultimodal-api-timeout-600sincreases the httpx read timeout from 60s to 600s. Testing shows:Timing evidence (600s timeout):
Three retries × 10m43s = 32 minutes wasted per failed instance (vs 3.5 min with 60s timeout).
Concurrent dispatches cause 429 rate limiting
Original run dispatched 3 models simultaneously (10 workers each = 30 concurrent requests):
Retry runs dispatched 1 model at a time:
Success is probabilistic, not deterministic
wp-calypso-21977andwp-calypso-22242failed in the original run but succeeded (zero errors, clean first try) in the retry. This proves wp-calypso instances ARE processable — the runtime start is a coin flip that depends on node scheduling and image caching.Current Run Status (2026-03-28)
Original runs (benchmarks branch
fix/swebenchmultimodal-api-timeout-600s, 600s timeout)Retry runs (cleaned archive + single dispatch, still running)
Proposed Solution
The
api_timeoutbump should be reverted (or not merged). Instead:1. Reduce worker concurrency for swebenchmultimodal
Set
num_workers=2-3(instead of 10) to reduce concurrent load on the runtime API. Fewer simultaneous/startrequests = fewer 429s and fewer scheduling conflicts for large images.2. Run one model at a time
Never dispatch multiple models for swebenchmultimodal simultaneously. Stagger dispatches to avoid 429 storms.
3. Increase max_retries to 5-10
Since wp-calypso success is probabilistic, more outer attempts = more chances. With 60s timeout × 3 inner retries = 3.5 min per failed instance, 10 outer attempts would take ~35 min per persistently failing instance (vs ~320 min with 600s timeout × 3 retries × 10 attempts).
4. Keep timeout at 60s (or lower)
Fail fast on hopeless attempts. The runtime either starts in seconds or not at all. A shorter timeout means faster iteration through retry cycles.
5. (Infrastructure) Pre-pull large images on nodes
The most impactful fix would be ensuring wp-calypso/p5.js images are cached on nodes via a DaemonSet or node image pre-puller. This would make all 102 instances reliably processable.
Investigation Links
Current retry runs
eval-23686021661-claude-son(namespace:evaluation-jobs)eval-23686024551-claude-4-6eval-23686024050-gpt-5-4GitHub Actions
Related
fix/swebenchmultimodal-api-timeout-600s— should NOT be merged (makes things worse)/cc @OpenHands/benchmarks-team