fix(swebench): explicitly close conversation after evaluate_instance to prevent OOM#701
fix(swebench): explicitly close conversation after evaluate_instance to prevent OOM#701simonrosenberg wants to merge 2 commits intomainfrom
Conversation
…to prevent OOM
RemoteConversation has a circular reference:
conversation → _ws_client (WebSocketCallbackClient)
→ callback (composed_callback)
→ run_complete_callback closure
→ self._terminal_status_queue (captures self)
Because of this cycle CPython's reference counter cannot free the object
when evaluate_instance() returns — it must wait for the cyclic GC, which
only runs on a threshold. With 30 workers completing ACP instances rapidly,
many RemoteConversation objects pile up before GC runs.
ACP agents (acp-codex, acp-claude) make this severe because:
- ACPAgent raises NotImplementedError if a condenser is configured, so all
events accumulate uncapped in _cached_events.
- GPT-5.5 on large JS projects (wp-calypso) makes 100+ tool calls per
instance; each ACPToolCallEvent stores raw_input + raw_output which can
be MB-sized for file reads and test output.
Result: on swebenchmultimodal with acp-codex the eval-container OOMed at
~40/68 instances across 5 consecutive runs (workers=8 and workers=30 both
hit the same fraction — confirming per-instance accumulation, not
concurrency pressure).
Fix: wrap the conversation body in try/finally and call conversation.close()
before returning. close() sets self._ws_client = None, which breaks the
cycle and lets the reference counter reclaim _cached_events immediately.
Applied to both swebench and swebenchmultimodal run_infer.py since both
have the same structure and both can be triggered with ACP agents.
Fixes OpenHands/evaluation#540
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Clean resource management fix for a real production problem.
Analysis
The implementation is correct:
- ✅
try/finallypattern properly ensuresconversation.close()runs even if exceptions occur - ✅ Extracting
historyandmetricsbefore thefinallyblock is necessary and correct - ✅ Identical changes in both
swebenchandswebenchmultimodalmaintain consistency - ✅ The root cause analysis is excellent (circular reference preventing immediate GC)
- ✅ Solves a real, observed problem (OOM at 40/68 instances)
[TESTING GAPS]
The test plan in the PR description has unchecked boxes, and there's no Evidence section showing that:
- The fix was tested with
acp-codexagent type and completed without OOMKill - Regular swebench evals still work correctly
- Memory usage was actually reduced
For infrastructure changes that fix critical production issues, concrete evidence is important:
- Run
swebenchmultimodal-inferwithacp-codexand show completion stats - Include memory usage comparison (before/after) if possible
- Show at least one successful end-to-end run demonstrating the fix
[RISK ASSESSMENT]
This is an infrastructure change in a critical evaluation path. The logic is sound (breaking circular references for immediate GC is the right approach), and the implementation is clean. However, without test evidence, there's a medium risk that edge cases weren't validated. The change should be safe since close() is idiomatic resource cleanup, but verification in a real environment is recommended before merge.
VERDICT:
✅ Worth merging after evidence is provided - Core logic is correct and solves a real problem, just needs validation
KEY INSIGHT:
This is a textbook example of fixing resource leaks caused by circular references - the implementation is clean and minimal, just needs empirical confirmation that it solves the OOM issue.
Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:
- Add a
.agents/skills/custom-codereview-guide.mdfile to your branch (or edit it if one already exists) with the/codereviewtrigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.- Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
- When your PR is merged, the guideline file goes through normal code review by repository maintainers.
Resolve with AI? Install the iterate skill in your agent and run
/iterateto automatically drive this PR through CI, review, and QA until it's merge-ready.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Closing in favor of a cleaner fix: gc.collect() in the Evaluation base class (one change covers all benchmarks) + weakref fix in the SDK to eliminate the circular reference at the source. |
Summary
try/finallyand callconversation.close()before returning fromevaluate_instance()in bothswebenchandswebenchmultimodalrun_infer.pyclose()setsself._ws_client = None, breaking the circular reference so CPython's reference counter can immediately reclaim_cached_eventsRoot cause (OpenHands/evaluation#540)
RemoteConversationhas a circular reference that prevents immediate GC:Without an explicit
close()call, CPython's cyclic GC must collect these objects, and it only runs on a threshold — not immediately. With 30 workers completing instances rapidly, manyRemoteConversationobjects pile up.This is especially bad for ACP agents (
acp-codex,acp-claude) because:ACPAgentraisesNotImplementedErrorif a condenser is configured, so all events accumulate uncapped in_cached_eventsreasoning_effort=highon large JS projects (wp-calypso in swebenchmultimodal) makes 100+ tool calls per instance; eachACPToolCallEventstoresraw_input+raw_outputwhich can be MB-sized for file reads and test outputObserved failure: swebenchmultimodal with
acp-codexOOMed the eval-container (12Gi limit) at ~40/68 instances across 5 consecutive runs. Bothnum_workers=8andnum_workers=30hit the same fraction — confirming per-instance linear accumulation rather than concurrency pressure.Test plan
acp-codexagent type and verify it runs to completion without OOMKill🤖 Generated with Claude Code