fix(sqs): long-poll wakeup race + DashMap guard held across await#20
Open
AnatolyRugalev wants to merge 2 commits intotyrchen:masterfrom
Open
fix(sqs): long-poll wakeup race + DashMap guard held across await#20AnatolyRugalev wants to merge 2 commits intotyrchen:masterfrom
AnatolyRugalev wants to merge 2 commits intotyrchen:masterfrom
Conversation
tyrchen
approved these changes
Apr 24, 2026
Owner
tyrchen
left a comment
There was a problem hiding this comment.
Thanks for the PR. Looks good to me.
Minor changes: consider to drop the now-dead notify_waiters() calls and the
notified() branch
…nder concurrency Two issues found while using Rustack as a LocalStack replacement in our Go integration test suite (~2500 SQS operations per run): The queue actor event loop uses `tokio::select!` with `commands.recv()` and `message_notify.notified()` as competing arms. When SendMessage arrives via `commands.recv()`, it calls `notify_waiters()` inside `handle_command` — but the `notified()` future lost the select race and isn't being polled, so the wakeup is lost. Pending long-polls only get fulfilled when they expire (WaitTimeSeconds later). Fix: after `handle_command`, directly call `fulfill_pending_long_polls()` when there are pending receivers. This is cheap (no-op when no polls are pending or no messages are available). `get_queue()` returns a `DashMap::Ref` (read guard). Every operation method holds this guard while `.await`-ing the actor response via a oneshot channel. Under concurrent load with create/delete/send/receive across many queues, this can deadlock when a task holding a read guard awaits on the actor while another task needs a write guard on the same DashMap shard. Fix: `get_queue()` now clones the handle out of the guard and returns an owned `QueueHandle`. The guard is released before any `.await`. `QueueHandle` derives `Clone` (JoinHandle wrapped in Arc). Both issues include regression tests in the integration suite. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When ChangeMessageVisibility sets a short timeout (e.g. 1s), the message stays in-flight until periodic_cleanup (1s interval) moves it back to the available queue. periodic_cleanup then calls notify_waiters(), but this notification is lost — same tokio::select race as the SendMessage fix: the notified() future isn't being polled during the cleanup_interval.tick() arm. Fix: periodic_cleanup now calls fulfill_pending_long_polls() directly when messages become available, instead of relying on notify_waiters(). Regression test: test_change_visibility_wakes_long_poll — sends a message, receives it, changes visibility to 1s, then starts a 10s long-poll. Without the fix, the long-poll returns 0 messages after 10s. With the fix, it returns in ~2s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
857a081 to
f7af606
Compare
Author
|
Hey @tyrchen, the suggestion is addressed. Feel free to push into my forked branch if you need to tweak anything before the merge. Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hey! We've been evaluating Rustack as a drop-in LocalStack replacement for our Go integration test suite and ran into three SQS issues. Love the project — startup is instant and it's way lighter than LocalStack.
Bug 1: Long-poll ReceiveMessage doesn't wake on new messages
When
ReceiveMessageis waiting withWaitTimeSeconds > 0and a new message is sent to the queue, the long-poll doesn't return — it waits the fullWaitTimeSecondsbefore expiring with 0 messages.Root cause: In the actor event loop (
queue/actor.rs),SendMessagearrives via thecommands.recv()arm oftokio::select!and callsnotify_waiters()insidehandle_command. But thenotified()arm lost the select race and isn't being polled at that point, so the notification is lost.Fix: After
handle_command, directly callfulfill_pending_long_polls()when there are pending receivers. This is a no-op when there's nothing to fulfill.Regression test:
test_long_poll_wakes_on_new_message— starts a 20s long-poll, sends a message 200ms later, asserts it returns within 5s. Fails onmain, passes with this fix.Bug 2: DashMap guard held across
.awaitcan deadlock under concurrencyget_queue()returns aDashMap::Ref(read guard). All operation methods (send_message,receive_message, etc.) hold this guard while.await-ing the actor response. Under concurrent load with many queue create/delete/send/receive operations, this can deadlock when a task holding a read guard awaits while another task needs a write guard on the same DashMap shard.We hit this running our accounting service test suite (~2500 SQS operations, many concurrent queues). The server would hang with the last log line being a
CreateQueuedispatch that never completed.Fix:
get_queue()now returns an ownedQueueHandle(clone) instead of aDashMap::Ref. The guard is released before any.await.QueueHandlederivesClonewithJoinHandlewrapped inArc.Regression test:
test_concurrent_queue_operations_do_not_deadlock— 50 concurrent tasks doing create → idempotent re-create → send → receive → delete. This is more of a stress/regression guard (the original deadlock is probabilistic).Bug 3: ChangeMessageVisibility doesn't wake long-poll waiters
When a consumer calls
ChangeMessageVisibilitywith a short timeout (e.g. 1s) to retry a message quickly, the message becomes available after the timeout expires but any pendingReceiveMessagelong-poll is not woken up. The consumer hangs untilWaitTimeSecondsexpires.Root cause:
periodic_cleanup(1s interval) moves expired in-flight messages back to the available queue and callsnotify_waiters(). But this runs inside thecleanup_interval.tick()arm oftokio::select!— same race as Bug 1. Thenotified()future isn't being polled, so the notification is lost.Fix:
periodic_cleanupnow callsfulfill_pending_long_polls()directly when messages become available, instead of relying onnotify_waiters().Regression test:
test_change_visibility_wakes_long_poll— sends a message, receives it, changes visibility to 1s, starts a 10s long-poll. Without the fix, the long-poll returns 0 messages after 10s. With the fix, it returns in ~2s.Common theme
All three bugs stem from the same pattern: calling
notify_waiters()inside one arm oftokio::select!while thenotified()future is in another arm. Since only one arm is polled per iteration, the notification is always lost. The fix in each case is to callfulfill_pending_long_polls()directly instead of going through theNotifymechanism.