swebenchmultimodal: Runtime API fails to start pods for large images (wp-calypso, p5.js)

## Summary

Running swebenchmultimodal with ACP agents consistently fails on **~60-75% of instances** due to the runtime API (`runtime.eval.all-hands.dev/start`) being unable to start pods for large Docker images. The failure is **non-deterministic**: the same instance can succeed on one attempt and fail on the next, depending on cluster state (node image caching, scheduling, API load).

## Root Cause

The runtime API `/start` endpoint times out when starting pods with large container images. Two error types:
- **`ReadTimeout`** (~75% of errors): The HTTP read timeout expires waiting for the pod to start
- **`Server disconnected without sending a response`** (~25%): The server drops the connection entirely

These are **infrastructure-level** failures in the runtime API, not benchmark or agent bugs. The runtime either starts within a few seconds or never starts within the timeout window — there is no "slow start" case where a longer timeout would help.

## Dataset Composition

The swebenchmultimodal dev dataset has **102 instances** across 5 repos with vastly different image sizes:

| Repo | Instances | Image Size | Typical Success Rate |
|------|-----------|------------|---------------------|
| `Automattic/wp-calypso` | 37 | Very large (~37% of dataset) | 0-15% |
| `chartjs/Chart.js` | 24 | Small | ~90-100% |
| `processing/p5.js` | 16 | Large | 0-30% |
| `markedjs/marked` | 14 | Small | 50-90% |
| `diegomura/react-pdf` | 11 | Medium | ~60-80% |
| **Total** | **102** | | |

**Key insight**: `wp-calypso` (37 instances) and `p5.js` (16 instances) = **52% of the dataset** and account for **>80% of all failures**.

## Evidence

### Bumping `api_timeout` from 60s to 600s does NOT help

Branch `fix/swebenchmultimodal-api-timeout-600s` increases the httpx read timeout from 60s to 600s. Testing shows:

1. **Every successful instance starts immediately** (0 errors, completes on first try)
2. **No instance that hit a ReadTimeout ever recovered** within the same attempt — even with 600s to wait
3. **The 600s timeout actively hurts**: each failed retry wastes ~10m43s instead of ~70s, making runs 9x slower per failure
4. **Identical success rates** between 60s and 600s timeout runs for the same repos

Timing evidence (600s timeout):
```
wp-calypso-26008: Started 15:38:02 → Timeout 15:48:45 (10m43s) → Timeout 15:59:29 (10m44s) → Timeout 16:10:12 (10m43s)
```
Three retries × 10m43s = **32 minutes wasted** per failed instance (vs 3.5 min with 60s timeout).

### Concurrent dispatches cause 429 rate limiting

Original run dispatched 3 models simultaneously (10 workers each = 30 concurrent requests):
- **574 total 429 errors** across all 3 jobs
- Opus was especially impacted: attempts 1 and 2 were completely wiped out

Retry runs dispatched 1 model at a time:
- **6 total 429 errors** (effectively zero)

### Success is probabilistic, not deterministic

`wp-calypso-21977` and `wp-calypso-22242` failed in the original run but succeeded (zero errors, clean first try) in the retry. This proves wp-calypso instances ARE processable — the runtime start is a coin flip that depends on node scheduling and image caching.

## Current Run Status (2026-03-28)

### Original runs (benchmarks branch `fix/swebenchmultimodal-api-timeout-600s`, 600s timeout)

| Model | Agent | Succeeded | Failed | Total |
|-------|-------|-----------|--------|-------|
| claude-sonnet-4-5 | acp-claude | 24/102 | 78 | 102 |
| claude-opus-4-6 | acp-claude | 30/102 | 72 | 102 |
| gpt-5.4 | acp-codex | 27/102 | 75 | 102 |

### Retry runs (cleaned archive + single dispatch, still running)

| Model | Agent | Orig Success | Retry Zero-Error | Still Running | Permanent Fail (attempt 1) |
|-------|-------|-------------|------------------|---------------|--------------------------|
| claude-sonnet-4-5 | acp-claude | 24 | +9 | ~36 left | 32 (17 wp-calypso, 8 marked, 6 p5.js, 1 react-pdf) |
| claude-opus-4-6 | acp-claude | 30 | +7 | ~31 left | 35 (19 wp-calypso, 8 marked, 7 p5.js, 1 react-pdf) |
| gpt-5.4 | acp-codex | 27 | +12 | ~30 left | 38 (21 wp-calypso, 8 marked, 8 p5.js, 1 react-pdf) |

## Proposed Solution

The `api_timeout` bump should be **reverted** (or not merged). Instead:

### 1. Reduce worker concurrency for swebenchmultimodal
Set `num_workers=2-3` (instead of 10) to reduce concurrent load on the runtime API. Fewer simultaneous `/start` requests = fewer 429s and fewer scheduling conflicts for large images.

### 2. Run one model at a time
Never dispatch multiple models for swebenchmultimodal simultaneously. Stagger dispatches to avoid 429 storms.

### 3. Increase max_retries to 5-10
Since wp-calypso success is probabilistic, more outer attempts = more chances. With 60s timeout × 3 inner retries = 3.5 min per failed instance, 10 outer attempts would take ~35 min per persistently failing instance (vs ~320 min with 600s timeout × 3 retries × 10 attempts).

### 4. Keep timeout at 60s (or lower)
Fail fast on hopeless attempts. The runtime either starts in seconds or not at all. A shorter timeout means faster iteration through retry cycles.

### 5. (Infrastructure) Pre-pull large images on nodes
The most impactful fix would be ensuring wp-calypso/p5.js images are cached on nodes via a DaemonSet or node image pre-puller. This would make all 102 instances reliably processable.

## Investigation Links

### Current retry runs
- Sonnet K8s job: `eval-23686021661-claude-son` (namespace: `evaluation-jobs`)
- Opus K8s job: `eval-23686024551-claude-4-6`
- GPT-5.4 K8s job: `eval-23686024050-gpt-5-4`

### GitHub Actions
- Original SDK workflows: OpenHands/software-agent-sdk runs 23673797686 (sonnet/opus), 23673798829 (gpt-5.4)
- Retry SDK workflows: 23686021661 (sonnet), 23686024551 (opus), 23686024050 (gpt-5.4)

### Related
- OpenHands/benchmarks#586 (retry workaround for cleaning archives)
- Branch `fix/swebenchmultimodal-api-timeout-600s` — should NOT be merged (makes things worse)

/cc @OpenHands/benchmarks-team

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

swebenchmultimodal: Runtime API fails to start pods for large images (wp-calypso, p5.js) #523

Summary

Root Cause

Dataset Composition

Evidence

Bumping `api_timeout` from 60s to 600s does NOT help

Concurrent dispatches cause 429 rate limiting

Success is probabilistic, not deterministic

Current Run Status (2026-03-28)

Original runs (benchmarks branch `fix/swebenchmultimodal-api-timeout-600s`, 600s timeout)

Retry runs (cleaned archive + single dispatch, still running)

Proposed Solution

1. Reduce worker concurrency for swebenchmultimodal

2. Run one model at a time

3. Increase max_retries to 5-10

4. Keep timeout at 60s (or lower)

5. (Infrastructure) Pre-pull large images on nodes

Investigation Links

Current retry runs

GitHub Actions

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Repo	Instances	Image Size	Typical Success Rate
`Automattic/wp-calypso`	37	Very large (~37% of dataset)	0-15%
`chartjs/Chart.js`	24	Small	~90-100%
`processing/p5.js`	16	Large	0-30%
`markedjs/marked`	14	Small	50-90%
`diegomura/react-pdf`	11	Medium	~60-80%
Total	102

Model	Agent	Succeeded	Failed	Total
claude-sonnet-4-5	acp-claude	24/102	78	102
claude-opus-4-6	acp-claude	30/102	72	102
gpt-5.4	acp-codex	27/102	75	102

Model	Agent	Orig Success	Retry Zero-Error	Still Running	Permanent Fail (attempt 1)
claude-sonnet-4-5	acp-claude	24	+9	~36 left	32 (17 wp-calypso, 8 marked, 6 p5.js, 1 react-pdf)
claude-opus-4-6	acp-claude	30	+7	~31 left	35 (19 wp-calypso, 8 marked, 7 p5.js, 1 react-pdf)
gpt-5.4	acp-codex	27	+12	~30 left	38 (21 wp-calypso, 8 marked, 8 p5.js, 1 react-pdf)

swebenchmultimodal: Runtime API fails to start pods for large images (wp-calypso, p5.js) #523

Description

Summary

Root Cause

Dataset Composition

Evidence

Bumping api_timeout from 60s to 600s does NOT help

Concurrent dispatches cause 429 rate limiting

Success is probabilistic, not deterministic

Current Run Status (2026-03-28)

Original runs (benchmarks branch fix/swebenchmultimodal-api-timeout-600s, 600s timeout)

Retry runs (cleaned archive + single dispatch, still running)

Proposed Solution

1. Reduce worker concurrency for swebenchmultimodal

2. Run one model at a time

3. Increase max_retries to 5-10

4. Keep timeout at 60s (or lower)

5. (Infrastructure) Pre-pull large images on nodes

Investigation Links

Current retry runs

GitHub Actions

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bumping `api_timeout` from 60s to 600s does NOT help

Original runs (benchmarks branch `fix/swebenchmultimodal-api-timeout-600s`, 600s timeout)