fix(swebench): explicitly close conversation after evaluate_instance to prevent OOM by simonrosenberg · Pull Request #701 · OpenHands/benchmarks

simonrosenberg · 2026-05-01T00:21:40Z

Summary

Wrap the conversation body in try/finally and call conversation.close() before returning from evaluate_instance() in both swebench and swebenchmultimodal run_infer.py
close() sets self._ws_client = None, breaking the circular reference so CPython's reference counter can immediately reclaim _cached_events

Root cause (OpenHands/evaluation#540)

RemoteConversation has a circular reference that prevents immediate GC:

conversation → _ws_client (WebSocketCallbackClient)
             → callback (composed_callback)
             → run_complete_callback closure (remote_conversation.py:813)
             → self._terminal_status_queue   ← captures self

Without an explicit close() call, CPython's cyclic GC must collect these objects, and it only runs on a threshold — not immediately. With 30 workers completing instances rapidly, many RemoteConversation objects pile up.

This is especially bad for ACP agents (acp-codex, acp-claude) because:

ACPAgent raises NotImplementedError if a condenser is configured, so all events accumulate uncapped in _cached_events
GPT-5.5 with reasoning_effort=high on large JS projects (wp-calypso in swebenchmultimodal) makes 100+ tool calls per instance; each ACPToolCallEvent stores raw_input + raw_output which can be MB-sized for file reads and test output

Observed failure: swebenchmultimodal with acp-codex OOMed the eval-container (12Gi limit) at ~40/68 instances across 5 consecutive runs. Both num_workers=8 and num_workers=30 hit the same fraction — confirming per-instance linear accumulation rather than concurrency pressure.

Test plan

Trigger a swebenchmultimodal eval with acp-codex agent type and verify it runs to completion without OOMKill
Verify swebench eval still works normally (no regression)
Confirm the fix applies to both files identically

🤖 Generated with Claude Code

…to prevent OOM RemoteConversation has a circular reference: conversation → _ws_client (WebSocketCallbackClient) → callback (composed_callback) → run_complete_callback closure → self._terminal_status_queue (captures self) Because of this cycle CPython's reference counter cannot free the object when evaluate_instance() returns — it must wait for the cyclic GC, which only runs on a threshold. With 30 workers completing ACP instances rapidly, many RemoteConversation objects pile up before GC runs. ACP agents (acp-codex, acp-claude) make this severe because: - ACPAgent raises NotImplementedError if a condenser is configured, so all events accumulate uncapped in _cached_events. - GPT-5.5 on large JS projects (wp-calypso) makes 100+ tool calls per instance; each ACPToolCallEvent stores raw_input + raw_output which can be MB-sized for file reads and test output. Result: on swebenchmultimodal with acp-codex the eval-container OOMed at ~40/68 instances across 5 consecutive runs (workers=8 and workers=30 both hit the same fraction — confirming per-instance accumulation, not concurrency pressure). Fix: wrap the conversation body in try/finally and call conversation.close() before returning. close() sets self._ws_client = None, which breaks the cycle and lets the reference counter reclaim _cached_events immediately. Applied to both swebench and swebenchmultimodal run_infer.py since both have the same structure and both can be triggered with ACP agents. Fixes OpenHands/evaluation#540 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

all-hands-bot

🟡 Acceptable - Clean resource management fix for a real production problem.

Analysis

The implementation is correct:

✅ try/finally pattern properly ensures conversation.close() runs even if exceptions occur
✅ Extracting history and metrics before the finally block is necessary and correct
✅ Identical changes in both swebench and swebenchmultimodal maintain consistency
✅ The root cause analysis is excellent (circular reference preventing immediate GC)
✅ Solves a real, observed problem (OOM at 40/68 instances)

[TESTING GAPS]

The test plan in the PR description has unchecked boxes, and there's no Evidence section showing that:

The fix was tested with acp-codex agent type and completed without OOMKill
Regular swebench evals still work correctly
Memory usage was actually reduced

For infrastructure changes that fix critical production issues, concrete evidence is important:

Run swebenchmultimodal-infer with acp-codex and show completion stats
Include memory usage comparison (before/after) if possible
Show at least one successful end-to-end run demonstrating the fix

[RISK ASSESSMENT]

⚠️ Risk Assessment: 🟡 MEDIUM

This is an infrastructure change in a critical evaluation path. The logic is sound (breaking circular references for immediate GC is the right approach), and the implementation is clean. However, without test evidence, there's a medium risk that edge cases weren't validated. The change should be safe since close() is idiomatic resource cleanup, but verification in a real environment is recommended before merge.

VERDICT:
✅ Worth merging after evidence is provided - Core logic is correct and solves a real problem, just needs validation

KEY INSIGHT:
This is a textbook example of fixing resource leaks caused by circular references - the implementation is clean and minimal, just needs empirical confirmation that it solves the OOM issue.

Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.

Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.

When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

simonrosenberg · 2026-05-01T15:52:18Z

Closing in favor of a cleaner fix: gc.collect() in the Evaluation base class (one change covers all benchmarks) + weakref fix in the SDK to eliminate the circular reference at the source.

all-hands-bot reviewed May 1, 2026

View reviewed changes

style: apply ruff formatting

c7c47c8

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

simonrosenberg closed this May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(swebench): explicitly close conversation after evaluate_instance to prevent OOM#701

fix(swebench): explicitly close conversation after evaluate_instance to prevent OOM#701
simonrosenberg wants to merge 2 commits intomainfrom
fix/close-conversation-after-evaluate

simonrosenberg commented May 1, 2026

Uh oh!

all-hands-bot left a comment

Uh oh!

simonrosenberg commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

simonrosenberg commented May 1, 2026

Summary

Root cause (OpenHands/evaluation#540)

Test plan

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

simonrosenberg commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants