Skip to content

swebenchmultimodal: Runtime API fails to start pods for large images (wp-calypso, p5.js) #523

@simonrosenberg

Description

@simonrosenberg

Summary

Running swebenchmultimodal with ACP agents consistently fails on ~60-75% of instances due to the runtime API (runtime.eval.all-hands.dev/start) being unable to start pods for large Docker images. The failure is non-deterministic: the same instance can succeed on one attempt and fail on the next, depending on cluster state (node image caching, scheduling, API load).

Root Cause

The runtime API /start endpoint times out when starting pods with large container images. Two error types:

  • ReadTimeout (~75% of errors): The HTTP read timeout expires waiting for the pod to start
  • Server disconnected without sending a response (~25%): The server drops the connection entirely

These are infrastructure-level failures in the runtime API, not benchmark or agent bugs. The runtime either starts within a few seconds or never starts within the timeout window — there is no "slow start" case where a longer timeout would help.

Dataset Composition

The swebenchmultimodal dev dataset has 102 instances across 5 repos with vastly different image sizes:

Repo Instances Image Size Typical Success Rate
Automattic/wp-calypso 37 Very large (~37% of dataset) 0-15%
chartjs/Chart.js 24 Small ~90-100%
processing/p5.js 16 Large 0-30%
markedjs/marked 14 Small 50-90%
diegomura/react-pdf 11 Medium ~60-80%
Total 102

Key insight: wp-calypso (37 instances) and p5.js (16 instances) = 52% of the dataset and account for >80% of all failures.

Evidence

Bumping api_timeout from 60s to 600s does NOT help

Branch fix/swebenchmultimodal-api-timeout-600s increases the httpx read timeout from 60s to 600s. Testing shows:

  1. Every successful instance starts immediately (0 errors, completes on first try)
  2. No instance that hit a ReadTimeout ever recovered within the same attempt — even with 600s to wait
  3. The 600s timeout actively hurts: each failed retry wastes ~10m43s instead of ~70s, making runs 9x slower per failure
  4. Identical success rates between 60s and 600s timeout runs for the same repos

Timing evidence (600s timeout):

wp-calypso-26008: Started 15:38:02 → Timeout 15:48:45 (10m43s) → Timeout 15:59:29 (10m44s) → Timeout 16:10:12 (10m43s)

Three retries × 10m43s = 32 minutes wasted per failed instance (vs 3.5 min with 60s timeout).

Concurrent dispatches cause 429 rate limiting

Original run dispatched 3 models simultaneously (10 workers each = 30 concurrent requests):

  • 574 total 429 errors across all 3 jobs
  • Opus was especially impacted: attempts 1 and 2 were completely wiped out

Retry runs dispatched 1 model at a time:

  • 6 total 429 errors (effectively zero)

Success is probabilistic, not deterministic

wp-calypso-21977 and wp-calypso-22242 failed in the original run but succeeded (zero errors, clean first try) in the retry. This proves wp-calypso instances ARE processable — the runtime start is a coin flip that depends on node scheduling and image caching.

Current Run Status (2026-03-28)

Original runs (benchmarks branch fix/swebenchmultimodal-api-timeout-600s, 600s timeout)

Model Agent Succeeded Failed Total
claude-sonnet-4-5 acp-claude 24/102 78 102
claude-opus-4-6 acp-claude 30/102 72 102
gpt-5.4 acp-codex 27/102 75 102

Retry runs (cleaned archive + single dispatch, still running)

Model Agent Orig Success Retry Zero-Error Still Running Permanent Fail (attempt 1)
claude-sonnet-4-5 acp-claude 24 +9 ~36 left 32 (17 wp-calypso, 8 marked, 6 p5.js, 1 react-pdf)
claude-opus-4-6 acp-claude 30 +7 ~31 left 35 (19 wp-calypso, 8 marked, 7 p5.js, 1 react-pdf)
gpt-5.4 acp-codex 27 +12 ~30 left 38 (21 wp-calypso, 8 marked, 8 p5.js, 1 react-pdf)

Proposed Solution

The api_timeout bump should be reverted (or not merged). Instead:

1. Reduce worker concurrency for swebenchmultimodal

Set num_workers=2-3 (instead of 10) to reduce concurrent load on the runtime API. Fewer simultaneous /start requests = fewer 429s and fewer scheduling conflicts for large images.

2. Run one model at a time

Never dispatch multiple models for swebenchmultimodal simultaneously. Stagger dispatches to avoid 429 storms.

3. Increase max_retries to 5-10

Since wp-calypso success is probabilistic, more outer attempts = more chances. With 60s timeout × 3 inner retries = 3.5 min per failed instance, 10 outer attempts would take ~35 min per persistently failing instance (vs ~320 min with 600s timeout × 3 retries × 10 attempts).

4. Keep timeout at 60s (or lower)

Fail fast on hopeless attempts. The runtime either starts in seconds or not at all. A shorter timeout means faster iteration through retry cycles.

5. (Infrastructure) Pre-pull large images on nodes

The most impactful fix would be ensuring wp-calypso/p5.js images are cached on nodes via a DaemonSet or node image pre-puller. This would make all 102 instances reliably processable.

Investigation Links

Current retry runs

  • Sonnet K8s job: eval-23686021661-claude-son (namespace: evaluation-jobs)
  • Opus K8s job: eval-23686024551-claude-4-6
  • GPT-5.4 K8s job: eval-23686024050-gpt-5-4

GitHub Actions

  • Original SDK workflows: OpenHands/software-agent-sdk runs 23673797686 (sonnet/opus), 23673798829 (gpt-5.4)
  • Retry SDK workflows: 23686021661 (sonnet), 23686024551 (opus), 23686024050 (gpt-5.4)

Related

/cc @OpenHands/benchmarks-team

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions