api: Avoid race-condition in volume-attach timeout handling by leust · Pull Request #593 · sapcc/nova

leust · 2026-01-21T14:23:10Z

When nova-api calls reserve_block_device_name RPC to nova-compute and the call times out, we try to clean up by deleting the BDM entry. However, during the timeout window a second attachment request for the same volume can come in, create a valid BDM, and progress to talking to Cinder. The original timed-out request then deletes this new valid BDM, leaving the volume in an inconsistent state.

We fix this by checking if the BDM has attachment_id set before deleting it. The attachment_id field is only populated in _check_attach_and_reserve_volume(), which we only call after the reserve_block_device_name RPC succeeds. If attachment_id is set, we know the BDM belongs to a subsequent request that has already progressed past the RPC phase, so we should not delete it.

Change-Id: I7ed649a5cab7f254690f329fac285128d8cd1c92

joker-at-work

Looks good, just 2 minor things.

joker-at-work · 2026-01-29T10:10:42Z

+            self.assertRaises(oslo_exceptions.MessagingTimeout,
+                                self.compute_api.attach_volume,
+                                self.context, instance, volume['id'])


Indentation is off by 2

joker-at-work · 2026-01-29T10:15:09Z

+                        objects.BlockDeviceMapping.get_by_volume_and_instance(
+                            context, volume['id'], instance.uuid)
+                    if bdm.attachment_id:
+                        LOG.warning(


You sure this needs to be a warning? I think info would be enough - debug would be fine, too, imho.

Could we start the log-string on this same line like below for the LOG.debug after the bdm.destroy()?

When nova-api calls reserve_block_device_name RPC to nova-compute and the call times out, we try to clean up by deleting the BDM entry. However, during the timeout window a second attachment request for the same volume can come in, create a valid BDM, and progress to talking to Cinder. The original timed-out request then deletes this new valid BDM, leaving the volume in an inconsistent state. We fix this by checking if the BDM has attachment_id set before deleting it. The attachment_id field is only populated in _check_attach_and_reserve_volume(), which we only call after the reserve_block_device_name RPC succeeds. If attachment_id is set, we know the BDM belongs to a subsequent request that has already progressed past the RPC phase, so we should not delete it. Change-Id: I7ed649a5cab7f254690f329fac285128d8cd1c92

fwiesel

Correct me, if I am wrong, but my reading of the code is, that there is no real uniqueness constraint on the block-device-mapping (the uuid is randomly generated), and that allows us to create multiple of them for the same instance-volume pair.

If that is the case, it doesn't quite solve the issue.

I'd suggest to pass a uuid to _create_volume_bdm so we know exactly which BDM we want to delete and delete that.

Having a block-device-mapping without an attachment_id is an expected intermediate state, so the second thread can also be in that state at the point of time we handle the exception.

We need to ensure that it is really our block-device-mapping we clean up. The query simply gets the first block-device-mapping by calling get_by_volume_and_instance, by all accounts that could be any. Such as the one from the second thread.

fwiesel · 2026-04-16T08:16:40Z

+                    bdm = \
+                        objects.BlockDeviceMapping.get_by_volume_and_instance(
+                            context, volume['id'], instance.uuid)
+                    if bdm.attachment_id:
+                        LOG.debug("BDM for volume %s has attachment_id set, "
+                                  "not deleting to avoid race-condition",
+                                  volume['id'])
+                    else:
+                        bdm.destroy()


Assuming there is just one block-device-mapping per volume-and-instance (which I think is not enforced), why isn't then a race here?

I mean, I get the block-device-mapping then the other thread saves the new version with attachment_id and then I delete it because my copy doesn't have the attachment-id.

leust requested review from fwiesel and joker-at-work January 21, 2026 14:23

leust marked this pull request as ready for review January 21, 2026 15:22

leust force-pushed the timeout-rm-bdm branch from 0880fb2 to b57ee44 Compare January 22, 2026 08:28

joker-at-work reviewed Jan 29, 2026

View reviewed changes

leust force-pushed the timeout-rm-bdm branch from b57ee44 to ab9fcc2 Compare February 4, 2026 17:54

leust requested a review from joker-at-work February 5, 2026 07:28

leust force-pushed the timeout-rm-bdm branch from ab9fcc2 to f0c4af3 Compare March 4, 2026 09:57

fwiesel reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

api: Avoid race-condition in volume-attach timeout handling#593

api: Avoid race-condition in volume-attach timeout handling#593
leust wants to merge 1 commit into
stable/2023.2-m3from
timeout-rm-bdm

leust commented Jan 21, 2026 •

edited

Loading

Uh oh!

joker-at-work left a comment

Uh oh!

joker-at-work Jan 29, 2026

Uh oh!

joker-at-work Jan 29, 2026

Uh oh!

fwiesel left a comment •

edited

Loading

Uh oh!

fwiesel Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

leust commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joker-at-work left a comment

Choose a reason for hiding this comment

Uh oh!

joker-at-work Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

joker-at-work Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

fwiesel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fwiesel Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

leust commented Jan 21, 2026 •

edited

Loading

fwiesel left a comment •

edited

Loading

fwiesel Apr 16, 2026 •

edited

Loading