fix: register image listener before sending request to prevent race condition#241
Conversation
…ondition The listener for image responses was registered after sending the request via WebSocket. If the server responded before the listener was set up, the response was lost and the SDK would poll for 300s until timeout. This moves listenToImages() before send() so the listener is always ready to catch the response. In the webhook code path, the listener is destroyed immediately since it's not needed.
|
I’m seeing the exact same issue. With concurrent requests over one WebSocket connection, I randomly hit the full 300s timeout even though the server responds. Logs look like this: Timeout waiting for images | Expected: 1 images | Received: 0 images | Timeout: 300000ms |
There was a problem hiding this comment.
Pull request overview
Fixes a race condition in the synchronous image inference flow by ensuring the WebSocket image-response listener is registered before sending the request, preventing intermittent long timeouts when responses arrive very quickly.
Changes:
- Register
listenToImages()beforesend()in_requestImages()to avoid missing early responses. - Destroy the temporary image listener on the webhook early-return path.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let_lis = await self.listenToImages( | ||
| onPartialImages=on_partial_images, | ||
| taskUUID=task_uuid, | ||
| groupKey=LISTEN_TO_IMAGES_KEY.REQUEST_IMAGES, | ||
| ) | ||
|
|
||
| await self.send([new_request_object]) | ||
|
|
||
| if new_request_object.get("webhookURL"): | ||
| let_lis["destroy"]() | ||
| return await self._handleWebhookAcknowledgment( |
There was a problem hiding this comment.
let_lis is now created before await self.send(...), but _requestImages() doesn’t guard this with a try/finally. If send() raises (e.g., connection closing) or if the subsequent wait/processing raises, this listener will remain registered in self._listeners and can accumulate across retries/concurrent calls. Consider wrapping the section from listener creation through response handling in try/finally (or try/except + re-raise) to ensure let_lis["destroy"]() always runs on all exit paths (including send() failures and getSimililarImage() timeouts/exceptions).
|
Hey @mateusz28011 and @przemyslaw Thanks for investigating this
|
Summary
Fix a race condition in
_requestImages()where the image response listener was registered after sending the WebSocket request, causing intermittent 300-second timeouts.Problem
In
_requestImages(), the order of operations was:await self.send([new_request_object])— send request via WebSocketawait self.listenToImages(...)— register listener for the responseBecause
send()yields control to the event loop (viaawait), the server's response could arrive and be dispatched byon_message()before the listener was registered. When this happened, the response was silently dropped — no listener existed to catch it. The SDK then polled_globalImagesfor 300 seconds (IMAGE_INFERENCE_TIMEOUT) finding nothing, and eventually timed out.This was especially problematic with multiple concurrent requests sharing one WebSocket connection.
Production error pattern:
Timeout waiting for images | TaskUUIDs: ['...'] | Expected: 1 images | Received: 0 images | Timeout: 300000ms
Fix
Swap the order: register
listenToImages()beforesend(), so the listener is always ready to catch the response. In the webhook code path, the listener is destroyed before the early return since it's not needed.Before:
After: