Skip to content

fix(sqs): long-poll wakeup race + DashMap guard held across await#20

Open
AnatolyRugalev wants to merge 2 commits intotyrchen:masterfrom
AnatolyRugalev:fix/sqs-longpoll-and-dashmap-deadlock
Open

fix(sqs): long-poll wakeup race + DashMap guard held across await#20
AnatolyRugalev wants to merge 2 commits intotyrchen:masterfrom
AnatolyRugalev:fix/sqs-longpoll-and-dashmap-deadlock

Conversation

@AnatolyRugalev
Copy link
Copy Markdown

@AnatolyRugalev AnatolyRugalev commented Apr 21, 2026

Hey! We've been evaluating Rustack as a drop-in LocalStack replacement for our Go integration test suite and ran into three SQS issues. Love the project — startup is instant and it's way lighter than LocalStack.

Bug 1: Long-poll ReceiveMessage doesn't wake on new messages

When ReceiveMessage is waiting with WaitTimeSeconds > 0 and a new message is sent to the queue, the long-poll doesn't return — it waits the full WaitTimeSeconds before expiring with 0 messages.

Root cause: In the actor event loop (queue/actor.rs), SendMessage arrives via the commands.recv() arm of tokio::select! and calls notify_waiters() inside handle_command. But the notified() arm lost the select race and isn't being polled at that point, so the notification is lost.

Fix: After handle_command, directly call fulfill_pending_long_polls() when there are pending receivers. This is a no-op when there's nothing to fulfill.

Regression test: test_long_poll_wakes_on_new_message — starts a 20s long-poll, sends a message 200ms later, asserts it returns within 5s. Fails on main, passes with this fix.

Bug 2: DashMap guard held across .await can deadlock under concurrency

get_queue() returns a DashMap::Ref (read guard). All operation methods (send_message, receive_message, etc.) hold this guard while .await-ing the actor response. Under concurrent load with many queue create/delete/send/receive operations, this can deadlock when a task holding a read guard awaits while another task needs a write guard on the same DashMap shard.

We hit this running our accounting service test suite (~2500 SQS operations, many concurrent queues). The server would hang with the last log line being a CreateQueue dispatch that never completed.

Fix: get_queue() now returns an owned QueueHandle (clone) instead of a DashMap::Ref. The guard is released before any .await. QueueHandle derives Clone with JoinHandle wrapped in Arc.

Regression test: test_concurrent_queue_operations_do_not_deadlock — 50 concurrent tasks doing create → idempotent re-create → send → receive → delete. This is more of a stress/regression guard (the original deadlock is probabilistic).

Bug 3: ChangeMessageVisibility doesn't wake long-poll waiters

When a consumer calls ChangeMessageVisibility with a short timeout (e.g. 1s) to retry a message quickly, the message becomes available after the timeout expires but any pending ReceiveMessage long-poll is not woken up. The consumer hangs until WaitTimeSeconds expires.

Root cause: periodic_cleanup (1s interval) moves expired in-flight messages back to the available queue and calls notify_waiters(). But this runs inside the cleanup_interval.tick() arm of tokio::select! — same race as Bug 1. The notified() future isn't being polled, so the notification is lost.

Fix: periodic_cleanup now calls fulfill_pending_long_polls() directly when messages become available, instead of relying on notify_waiters().

Regression test: test_change_visibility_wakes_long_poll — sends a message, receives it, changes visibility to 1s, starts a 10s long-poll. Without the fix, the long-poll returns 0 messages after 10s. With the fix, it returns in ~2s.

Common theme

All three bugs stem from the same pattern: calling notify_waiters() inside one arm of tokio::select! while the notified() future is in another arm. Since only one arm is polled per iteration, the notification is always lost. The fix in each case is to call fulfill_pending_long_polls() directly instead of going through the Notify mechanism.

Copy link
Copy Markdown
Owner

@tyrchen tyrchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Looks good to me.

Minor changes: consider to drop the now-dead notify_waiters() calls and the
notified() branch

AnatolyRugalev and others added 2 commits April 25, 2026 23:58
…nder concurrency

Two issues found while using Rustack as a LocalStack replacement in our
Go integration test suite (~2500 SQS operations per run):

The queue actor event loop uses `tokio::select!` with `commands.recv()`
and `message_notify.notified()` as competing arms. When SendMessage
arrives via `commands.recv()`, it calls `notify_waiters()` inside
`handle_command` — but the `notified()` future lost the select race and
isn't being polled, so the wakeup is lost. Pending long-polls only get
fulfilled when they expire (WaitTimeSeconds later).

Fix: after `handle_command`, directly call `fulfill_pending_long_polls()`
when there are pending receivers. This is cheap (no-op when no polls are
pending or no messages are available).

`get_queue()` returns a `DashMap::Ref` (read guard). Every operation
method holds this guard while `.await`-ing the actor response via a
oneshot channel. Under concurrent load with create/delete/send/receive
across many queues, this can deadlock when a task holding a read guard
awaits on the actor while another task needs a write guard on the same
DashMap shard.

Fix: `get_queue()` now clones the handle out of the guard and returns
an owned `QueueHandle`. The guard is released before any `.await`.
`QueueHandle` derives `Clone` (JoinHandle wrapped in Arc).

Both issues include regression tests in the integration suite.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When ChangeMessageVisibility sets a short timeout (e.g. 1s), the message
stays in-flight until periodic_cleanup (1s interval) moves it back to
the available queue. periodic_cleanup then calls notify_waiters(), but
this notification is lost — same tokio::select race as the SendMessage
fix: the notified() future isn't being polled during the
cleanup_interval.tick() arm.

Fix: periodic_cleanup now calls fulfill_pending_long_polls() directly
when messages become available, instead of relying on notify_waiters().

Regression test: test_change_visibility_wakes_long_poll — sends a
message, receives it, changes visibility to 1s, then starts a 10s
long-poll. Without the fix, the long-poll returns 0 messages after 10s.
With the fix, it returns in ~2s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AnatolyRugalev AnatolyRugalev force-pushed the fix/sqs-longpoll-and-dashmap-deadlock branch from 857a081 to f7af606 Compare April 25, 2026 22:05
@AnatolyRugalev
Copy link
Copy Markdown
Author

Hey @tyrchen, the suggestion is addressed. Feel free to push into my forked branch if you need to tweak anything before the merge.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants