Skip to content

nvme_driver: Set correct drain state for new queues (#2864)#2867

Merged
alandau merged 1 commit intomicrosoft:release/1.7.2511from
alandau:cherrypick/release/1.7.2511/pr-2864
Mar 2, 2026
Merged

nvme_driver: Set correct drain state for new queues (#2864)#2867
alandau merged 1 commit intomicrosoft:release/1.7.2511from
alandau:cherrypick/release/1.7.2511/pr-2864

Conversation

@alandau
Copy link
Contributor

@alandau alandau commented Mar 2, 2026

Clean cherry pick of PR #2864

In #2833, saved queues (eager and proto) were given the correct drain state, but new queues were given the AllDrained state meaning they could race before all existing queues have drained. This could happen if IO is received during draining on a CPU (thus queue) that has never seen IO before the save.

This PR fixes the race by giving new queues the correct drain state - either AllDrained if all existing queues have drained already, or SelfDrained, like proto queues (no IO pending on them) if not.

In microsoft#2833, saved queues (eager and proto) were given the correct drain
state, but new queues were given the `AllDrained` state meaning they
could race before all existing queues have drained. This could happen if
IO is received during draining on a CPU (thus queue) that has never seen
IO before the save.

This PR fixes the race by giving new queues the correct drain state -
either `AllDrained` if all existing queues have drained already, or
`SelfDrained`, like proto queues (no IO pending on them) if not.

(cherry picked from commit 831eb3d)
@alandau alandau requested a review from a team as a code owner March 2, 2026 21:07
Copilot AI review requested due to automatic review settings March 2, 2026 21:07
@alandau alandau requested a review from a team as a code owner March 2, 2026 21:07
@github-actions github-actions bot added the release_1.7.2511 Targets the release/1.7.2511 branch. label Mar 2, 2026
@alandau alandau enabled auto-merge (squash) March 2, 2026 21:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a restore-time race in the NVMe driver’s “drain-after-restore” barrier by ensuring newly created I/O queues (created after restore, not from saved state) receive an appropriate drain state instead of incorrectly starting in AllDrained.

Changes:

  • Add DrainAfterRestoreBuilder::{is_drain_complete, new_for_new_queue} to pick the correct drain state for newly created queues during an active drain.
  • Thread an explicit DrainAfterRestore into QueuePair::new so new queues don’t implicitly default to AllDrained.
  • Store a shared DrainAfterRestoreBuilder in the worker during restore so new queues participate in the drain barrier until completion.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs Adds builder helpers to determine drain completion and compute the correct drain state for brand-new queues; updates QueuePair::new to accept an explicit drain state.
vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs Persists the drain builder across restore so newly created queues can block on draining; uses new_for_new_queue() when creating new I/O queues.

@alandau alandau merged commit 5bc290d into microsoft:release/1.7.2511 Mar 2, 2026
56 checks passed
@alandau alandau deleted the cherrypick/release/1.7.2511/pr-2864 branch March 2, 2026 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release_1.7.2511 Targets the release/1.7.2511 branch.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants