fix(sqs): long-poll wakeup race + DashMap guard held across await by AnatolyRugalev · Pull Request #20 · tyrchen/rustack

AnatolyRugalev · 2026-04-21T21:07:47Z

Hey! We've been evaluating Rustack as a drop-in LocalStack replacement for our Go integration test suite and ran into three SQS issues. Love the project — startup is instant and it's way lighter than LocalStack.

Bug 1: Long-poll ReceiveMessage doesn't wake on new messages

When ReceiveMessage is waiting with WaitTimeSeconds > 0 and a new message is sent to the queue, the long-poll doesn't return — it waits the full WaitTimeSeconds before expiring with 0 messages.

Root cause: In the actor event loop (queue/actor.rs), SendMessage arrives via the commands.recv() arm of tokio::select! and calls notify_waiters() inside handle_command. But the notified() arm lost the select race and isn't being polled at that point, so the notification is lost.

Fix: After handle_command, directly call fulfill_pending_long_polls() when there are pending receivers. This is a no-op when there's nothing to fulfill.

Regression test: test_long_poll_wakes_on_new_message — starts a 20s long-poll, sends a message 200ms later, asserts it returns within 5s. Fails on main, passes with this fix.

Bug 2: DashMap guard held across `.await` can deadlock under concurrency

get_queue() returns a DashMap::Ref (read guard). All operation methods (send_message, receive_message, etc.) hold this guard while .await-ing the actor response. Under concurrent load with many queue create/delete/send/receive operations, this can deadlock when a task holding a read guard awaits while another task needs a write guard on the same DashMap shard.

We hit this running our accounting service test suite (~2500 SQS operations, many concurrent queues). The server would hang with the last log line being a CreateQueue dispatch that never completed.

Fix: get_queue() now returns an owned QueueHandle (clone) instead of a DashMap::Ref. The guard is released before any .await. QueueHandle derives Clone with JoinHandle wrapped in Arc.

Regression test: test_concurrent_queue_operations_do_not_deadlock — 50 concurrent tasks doing create → idempotent re-create → send → receive → delete. This is more of a stress/regression guard (the original deadlock is probabilistic).

Bug 3: ChangeMessageVisibility doesn't wake long-poll waiters

When a consumer calls ChangeMessageVisibility with a short timeout (e.g. 1s) to retry a message quickly, the message becomes available after the timeout expires but any pending ReceiveMessage long-poll is not woken up. The consumer hangs until WaitTimeSeconds expires.

Root cause: periodic_cleanup (1s interval) moves expired in-flight messages back to the available queue and calls notify_waiters(). But this runs inside the cleanup_interval.tick() arm of tokio::select! — same race as Bug 1. The notified() future isn't being polled, so the notification is lost.

Fix: periodic_cleanup now calls fulfill_pending_long_polls() directly when messages become available, instead of relying on notify_waiters().

Regression test: test_change_visibility_wakes_long_poll — sends a message, receives it, changes visibility to 1s, starts a 10s long-poll. Without the fix, the long-poll returns 0 messages after 10s. With the fix, it returns in ~2s.

Common theme

All three bugs stem from the same pattern: calling notify_waiters() inside one arm of tokio::select! while the notified() future is in another arm. Since only one arm is polled per iteration, the notification is always lost. The fix in each case is to call fulfill_pending_long_polls() directly instead of going through the Notify mechanism.

tyrchen

Thanks for the PR. Looks good to me.

Minor changes: consider to drop the now-dead notify_waiters() calls and the
notified() branch

…nder concurrency Two issues found while using Rustack as a LocalStack replacement in our Go integration test suite (~2500 SQS operations per run): The queue actor event loop uses `tokio::select!` with `commands.recv()` and `message_notify.notified()` as competing arms. When SendMessage arrives via `commands.recv()`, it calls `notify_waiters()` inside `handle_command` — but the `notified()` future lost the select race and isn't being polled, so the wakeup is lost. Pending long-polls only get fulfilled when they expire (WaitTimeSeconds later). Fix: after `handle_command`, directly call `fulfill_pending_long_polls()` when there are pending receivers. This is cheap (no-op when no polls are pending or no messages are available). `get_queue()` returns a `DashMap::Ref` (read guard). Every operation method holds this guard while `.await`-ing the actor response via a oneshot channel. Under concurrent load with create/delete/send/receive across many queues, this can deadlock when a task holding a read guard awaits on the actor while another task needs a write guard on the same DashMap shard. Fix: `get_queue()` now clones the handle out of the guard and returns an owned `QueueHandle`. The guard is released before any `.await`. `QueueHandle` derives `Clone` (JoinHandle wrapped in Arc). Both issues include regression tests in the integration suite. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When ChangeMessageVisibility sets a short timeout (e.g. 1s), the message stays in-flight until periodic_cleanup (1s interval) moves it back to the available queue. periodic_cleanup then calls notify_waiters(), but this notification is lost — same tokio::select race as the SendMessage fix: the notified() future isn't being polled during the cleanup_interval.tick() arm. Fix: periodic_cleanup now calls fulfill_pending_long_polls() directly when messages become available, instead of relying on notify_waiters(). Regression test: test_change_visibility_wakes_long_poll — sends a message, receives it, changes visibility to 1s, then starts a 10s long-poll. Without the fix, the long-poll returns 0 messages after 10s. With the fix, it returns in ~2s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AnatolyRugalev · 2026-04-25T22:16:47Z

Hey @tyrchen, the suggestion is addressed. Feel free to push into my forked branch if you need to tweak anything before the merge.

Thanks!

tyrchen approved these changes Apr 24, 2026

View reviewed changes

AnatolyRugalev and others added 2 commits April 25, 2026 23:58

AnatolyRugalev force-pushed the fix/sqs-longpoll-and-dashmap-deadlock branch from 857a081 to f7af606 Compare April 25, 2026 22:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sqs): long-poll wakeup race + DashMap guard held across await#20

fix(sqs): long-poll wakeup race + DashMap guard held across await#20
AnatolyRugalev wants to merge 2 commits intotyrchen:masterfrom
AnatolyRugalev:fix/sqs-longpoll-and-dashmap-deadlock

AnatolyRugalev commented Apr 21, 2026 •

edited

Loading

Uh oh!

tyrchen left a comment

Uh oh!

AnatolyRugalev commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AnatolyRugalev commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug 1: Long-poll ReceiveMessage doesn't wake on new messages

Bug 2: DashMap guard held across .await can deadlock under concurrency

Bug 3: ChangeMessageVisibility doesn't wake long-poll waiters

Common theme

Uh oh!

tyrchen left a comment

Choose a reason for hiding this comment

Uh oh!

AnatolyRugalev commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AnatolyRugalev commented Apr 21, 2026 •

edited

Loading

Bug 2: DashMap guard held across `.await` can deadlock under concurrency