nvme_driver: Drain all queues after restore#2833
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request implements a mechanism to drain all NVMe I/O queues after restore by coordinating completion of in-flight commands across multiple queues before allowing new I/O. The implementation uses a three-state machine (Draining -> SelfDrained -> AllDrained) with a shared atomic counter to synchronize when all queues have completed draining.
Changes:
- Introduced
DrainAfterRestoreenum to track queue draining state during restore - Modified queue handler to block new I/O commands until all queues finish draining existing commands
- Added coordination mechanism using shared atomic counter and device interrupt signaling
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
| vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs | Adds DrainAfterRestore state machine, modifies queue handler event loop to support draining mode, updates restore logic to mark empty queues as drained |
| vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs | Integrates DrainAfterRestore into queue restoration, creates shared drain coordinator for I/O queues, passes appropriate drain state to admin and prototype queues |
mattkur
left a comment
There was a problem hiding this comment.
Nifty. You asked for feedback on the approach. I think it looks good to me. I left a few nit comments (though I imagine the things I commented on are already on your radar). Thanks for this, Alex!
I think regression testing is "good enough" for this fix, but hope you can also find a way to prove if there is (or is not) a real bug here.
4eb5369 to
7a5b3ea
Compare
| Request(Req), | ||
| Command(Cmd), | ||
| Completion(spec::Completion), | ||
| NoOp, |
There was a problem hiding this comment.
nit: How about naming this DrainComplete (or something like that)?
|
Feel free to take/ignore the nit comments. Overall logic looks pretty sound to me! (Appreciate the detailed comments btw) |
mattkur
left a comment
There was a problem hiding this comment.
I can only find nits. I think this looks good. Nice work, and a tricky area. Thanks Alex!
| .iter() | ||
| .filter(|q| !q.queue_data.handler_data.pending_cmds.commands.is_empty()) | ||
| .count(); | ||
| tracing::info!(nonempty_queues, "drain-after-restore initialization"); |
There was a problem hiding this comment.
nit, please include ?pci_id as a field.
| let old_counter = counter.fetch_sub(1, Ordering::AcqRel); | ||
| if old_counter == 1 { | ||
| signal.signal_uncached(); | ||
| tracing::info!( |
There was a problem hiding this comment.
minor, include pci_id here (unless this is really painful to plumb; can happen in a separate PR later)
|
Addressed nits, thanks Matt & Guramrit! |
``` w1 | w1' c1' w2 c2 c1 ``` Guest sends a write request w1. Then we do servicing. Then storvsp duplicates the request into w1', which completes at c1'. If we allow the queue to run freely (aka accept new guest requests), then at this point the guest can issue another write w2 which completes at c2 and only then the original pre-save write completes at c1 overwriting the "latest" write w2, as far as the guest is concerned. That's the reason we stop the queue before all pre-save requests complete. But with multiple queues if we drain one queue and start accepting new guest requests on it, while there are other still-draining queues, we can have the same race. Therefore, with this PR, all queues are drained before new guest requests are accepted to avoid races. (cherry picked from commit c1541ef)
Clean cherry pick of PR #2833 ``` w1 | w1' c1' w2 c2 c1 ``` Guest sends a write request w1. Then we do servicing. Then storvsp duplicates the request into w1', which completes at c1'. If we allow the queue to run freely (aka accept new guest requests), then at this point the guest can issue another write w2 which completes at c2 and only then the original pre-save write completes at c1 overwriting the "latest" write w2, as far as the guest is concerned. That's the reason we stop the queue before all pre-save requests complete. But with multiple queues if we drain one queue and start accepting new guest requests on it, while there are other still-draining queues, we can have the same race. Therefore, with this PR, all queues are drained before new guest requests are accepted to avoid races. Co-authored-by: Alex Landau <alexlandau@microsoft.com>
|
Backported to release/1.7.2511 in #2851 |
In #2833, saved queues (eager and proto) were given the correct drain state, but new queues were given the `AllDrained` state meaning they could race before all existing queues have drained. This could happen if IO is received during draining on a CPU (thus queue) that has never seen IO before the save. This PR fixes the race by giving new queues the correct drain state - either `AllDrained` if all existing queues have drained already, or `SelfDrained`, like proto queues (no IO pending on them) if not.
Clean cherry pick of PR #2864 In #2833, saved queues (eager and proto) were given the correct drain state, but new queues were given the `AllDrained` state meaning they could race before all existing queues have drained. This could happen if IO is received during draining on a CPU (thus queue) that has never seen IO before the save. This PR fixes the race by giving new queues the correct drain state - either `AllDrained` if all existing queues have drained already, or `SelfDrained`, like proto queues (no IO pending on them) if not.
Guest sends a write request w1. Then we do servicing. Then storvsp duplicates the request into w1', which completes at c1'. If we allow the queue to run freely (aka accept new guest requests), then at this point the guest can issue another write w2 which completes at c2 and only then the original pre-save write completes at c1 overwriting the "latest" write w2, as far as the guest is concerned.
That's the reason we stop the queue before all pre-save requests complete.
But with multiple queues if we drain one queue and start accepting new guest requests on it, while there are other still-draining queues, we can have the same race.
Therefore, with this PR, all queues are drained before new guest requests are accepted to avoid races.