Skip to content

fix: register image listener before sending request to prevent race condition#241

Open
mateusz28011 wants to merge 1 commit intoRunware:mainfrom
mateusz28011:fix/listener-registration-race-condition
Open

fix: register image listener before sending request to prevent race condition#241
mateusz28011 wants to merge 1 commit intoRunware:mainfrom
mateusz28011:fix/listener-registration-race-condition

Conversation

@mateusz28011
Copy link

@mateusz28011 mateusz28011 commented Feb 18, 2026

Summary

Fix a race condition in _requestImages() where the image response listener was registered after sending the WebSocket request, causing intermittent 300-second timeouts.

Problem

In _requestImages(), the order of operations was:

  1. await self.send([new_request_object]) — send request via WebSocket
  2. await self.listenToImages(...) — register listener for the response

Because send() yields control to the event loop (via await), the server's response could arrive and be dispatched by on_message() before the listener was registered. When this happened, the response was silently dropped — no listener existed to catch it. The SDK then polled _globalImages for 300 seconds (IMAGE_INFERENCE_TIMEOUT) finding nothing, and eventually timed out.

This was especially problematic with multiple concurrent requests sharing one WebSocket connection.

Production error pattern:

Timeout waiting for images | TaskUUIDs: ['...'] | Expected: 1 images | Received: 0 images | Timeout: 300000ms

Fix

Swap the order: register listenToImages() before send(), so the listener is always ready to catch the response. In the webhook code path, the listener is destroyed before the early return since it's not needed.

Before:

await self.send([new_request_object])

if new_request_object.get("webhookURL"):
    return await self._handleWebhookAcknowledgment(...)

let_lis = await self.listenToImages(...)

After:

let_lis = await self.listenToImages(...)

await self.send([new_request_object])

if new_request_object.get("webhookURL"):
    let_lis["destroy"]()
    return await self._handleWebhookAcknowledgment(...)

…ondition

The listener for image responses was registered after sending the request
via WebSocket. If the server responded before the listener was set up,
the response was lost and the SDK would poll for 300s until timeout.

This moves listenToImages() before send() so the listener is always
ready to catch the response. In the webhook code path, the listener
is destroyed immediately since it's not needed.
@przemyslaw
Copy link

I’m seeing the exact same issue.

With concurrent requests over one WebSocket connection, I randomly hit the full 300s timeout even though the server responds. Logs look like this:

Timeout waiting for images | Expected: 1 images | Received: 0 images | Timeout: 300000ms

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race condition in the synchronous image inference flow by ensuring the WebSocket image-response listener is registered before sending the request, preventing intermittent long timeouts when responses arrive very quickly.

Changes:

  • Register listenToImages() before send() in _requestImages() to avoid missing early responses.
  • Destroy the temporary image listener on the webhook early-return path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +598 to 608
let_lis = await self.listenToImages(
onPartialImages=on_partial_images,
taskUUID=task_uuid,
groupKey=LISTEN_TO_IMAGES_KEY.REQUEST_IMAGES,
)

await self.send([new_request_object])

if new_request_object.get("webhookURL"):
let_lis["destroy"]()
return await self._handleWebhookAcknowledgment(
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let_lis is now created before await self.send(...), but _requestImages() doesn’t guard this with a try/finally. If send() raises (e.g., connection closing) or if the subsequent wait/processing raises, this listener will remain registered in self._listeners and can accumulate across retries/concurrent calls. Consider wrapping the section from listener creation through response handling in try/finally (or try/except + re-raise) to ensure let_lis["destroy"]() always runs on all exit paths (including send() failures and getSimililarImage() timeouts/exceptions).

Copilot uses AI. Check for mistakes.
@Sirsho1997
Copy link
Collaborator

Sirsho1997 commented Feb 19, 2026

Hey @mateusz28011 and @przemyslaw

Thanks for investigating this

  • Can you guys tell me with how many concurrent requests you face this issue?
  • Do you face the same issue when using deliveryMethod = async in imageInference as well? (example using async workflow in videoInference )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants