Skip to content

fix(swebench): explicitly close conversation after evaluate_instance to prevent OOM#701

Closed
simonrosenberg wants to merge 2 commits intomainfrom
fix/close-conversation-after-evaluate
Closed

fix(swebench): explicitly close conversation after evaluate_instance to prevent OOM#701
simonrosenberg wants to merge 2 commits intomainfrom
fix/close-conversation-after-evaluate

Conversation

@simonrosenberg
Copy link
Copy Markdown
Collaborator

Summary

  • Wrap the conversation body in try/finally and call conversation.close() before returning from evaluate_instance() in both swebench and swebenchmultimodal run_infer.py
  • close() sets self._ws_client = None, breaking the circular reference so CPython's reference counter can immediately reclaim _cached_events

Root cause (OpenHands/evaluation#540)

RemoteConversation has a circular reference that prevents immediate GC:

conversation → _ws_client (WebSocketCallbackClient)
             → callback (composed_callback)
             → run_complete_callback closure (remote_conversation.py:813)
             → self._terminal_status_queue   ← captures self

Without an explicit close() call, CPython's cyclic GC must collect these objects, and it only runs on a threshold — not immediately. With 30 workers completing instances rapidly, many RemoteConversation objects pile up.

This is especially bad for ACP agents (acp-codex, acp-claude) because:

  • ACPAgent raises NotImplementedError if a condenser is configured, so all events accumulate uncapped in _cached_events
  • GPT-5.5 with reasoning_effort=high on large JS projects (wp-calypso in swebenchmultimodal) makes 100+ tool calls per instance; each ACPToolCallEvent stores raw_input + raw_output which can be MB-sized for file reads and test output

Observed failure: swebenchmultimodal with acp-codex OOMed the eval-container (12Gi limit) at ~40/68 instances across 5 consecutive runs. Both num_workers=8 and num_workers=30 hit the same fraction — confirming per-instance linear accumulation rather than concurrency pressure.

Test plan

  • Trigger a swebenchmultimodal eval with acp-codex agent type and verify it runs to completion without OOMKill
  • Verify swebench eval still works normally (no regression)
  • Confirm the fix applies to both files identically

🤖 Generated with Claude Code

…to prevent OOM

RemoteConversation has a circular reference:
  conversation → _ws_client (WebSocketCallbackClient)
               → callback (composed_callback)
               → run_complete_callback closure
               → self._terminal_status_queue  (captures self)

Because of this cycle CPython's reference counter cannot free the object
when evaluate_instance() returns — it must wait for the cyclic GC, which
only runs on a threshold.  With 30 workers completing ACP instances rapidly,
many RemoteConversation objects pile up before GC runs.

ACP agents (acp-codex, acp-claude) make this severe because:
- ACPAgent raises NotImplementedError if a condenser is configured, so all
  events accumulate uncapped in _cached_events.
- GPT-5.5 on large JS projects (wp-calypso) makes 100+ tool calls per
  instance; each ACPToolCallEvent stores raw_input + raw_output which can
  be MB-sized for file reads and test output.

Result: on swebenchmultimodal with acp-codex the eval-container OOMed at
~40/68 instances across 5 consecutive runs (workers=8 and workers=30 both
hit the same fraction — confirming per-instance accumulation, not
concurrency pressure).

Fix: wrap the conversation body in try/finally and call conversation.close()
before returning.  close() sets self._ws_client = None, which breaks the
cycle and lets the reference counter reclaim _cached_events immediately.

Applied to both swebench and swebenchmultimodal run_infer.py since both
have the same structure and both can be triggered with ACP agents.

Fixes OpenHands/evaluation#540

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Clean resource management fix for a real production problem.

Analysis

The implementation is correct:

  • try/finally pattern properly ensures conversation.close() runs even if exceptions occur
  • ✅ Extracting history and metrics before the finally block is necessary and correct
  • ✅ Identical changes in both swebench and swebenchmultimodal maintain consistency
  • ✅ The root cause analysis is excellent (circular reference preventing immediate GC)
  • ✅ Solves a real, observed problem (OOM at 40/68 instances)

[TESTING GAPS]

The test plan in the PR description has unchecked boxes, and there's no Evidence section showing that:

  1. The fix was tested with acp-codex agent type and completed without OOMKill
  2. Regular swebench evals still work correctly
  3. Memory usage was actually reduced

For infrastructure changes that fix critical production issues, concrete evidence is important:

  • Run swebenchmultimodal-infer with acp-codex and show completion stats
  • Include memory usage comparison (before/after) if possible
  • Show at least one successful end-to-end run demonstrating the fix

[RISK ASSESSMENT]

⚠️ Risk Assessment: 🟡 MEDIUM

This is an infrastructure change in a critical evaluation path. The logic is sound (breaking circular references for immediate GC is the right approach), and the implementation is clean. However, without test evidence, there's a medium risk that edge cases weren't validated. The change should be safe since close() is idiomatic resource cleanup, but verification in a real environment is recommended before merge.

VERDICT:
Worth merging after evidence is provided - Core logic is correct and solves a real problem, just needs validation

KEY INSIGHT:
This is a textbook example of fixing resource leaks caused by circular references - the implementation is clean and minimal, just needs empirical confirmation that it solves the OOM issue.


Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

  1. Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.
  2. Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
  3. When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@simonrosenberg
Copy link
Copy Markdown
Collaborator Author

Closing in favor of a cleaner fix: gc.collect() in the Evaluation base class (one change covers all benchmarks) + weakref fix in the SDK to eliminate the circular reference at the source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants