api: Avoid race-condition in volume-attach timeout handling#593
Conversation
joker-at-work
left a comment
There was a problem hiding this comment.
Looks good, just 2 minor things.
| self.assertRaises(oslo_exceptions.MessagingTimeout, | ||
| self.compute_api.attach_volume, | ||
| self.context, instance, volume['id']) |
| objects.BlockDeviceMapping.get_by_volume_and_instance( | ||
| context, volume['id'], instance.uuid) | ||
| if bdm.attachment_id: | ||
| LOG.warning( |
There was a problem hiding this comment.
You sure this needs to be a warning? I think info would be enough - debug would be fine, too, imho.
Could we start the log-string on this same line like below for the LOG.debug after the bdm.destroy()?
When nova-api calls reserve_block_device_name RPC to nova-compute and the call times out, we try to clean up by deleting the BDM entry. However, during the timeout window a second attachment request for the same volume can come in, create a valid BDM, and progress to talking to Cinder. The original timed-out request then deletes this new valid BDM, leaving the volume in an inconsistent state. We fix this by checking if the BDM has attachment_id set before deleting it. The attachment_id field is only populated in _check_attach_and_reserve_volume(), which we only call after the reserve_block_device_name RPC succeeds. If attachment_id is set, we know the BDM belongs to a subsequent request that has already progressed past the RPC phase, so we should not delete it. Change-Id: I7ed649a5cab7f254690f329fac285128d8cd1c92
There was a problem hiding this comment.
Correct me, if I am wrong, but my reading of the code is, that there is no real uniqueness constraint on the block-device-mapping (the uuid is randomly generated), and that allows us to create multiple of them for the same instance-volume pair.
If that is the case, it doesn't quite solve the issue.
I'd suggest to pass a uuid to _create_volume_bdm so we know exactly which BDM we want to delete and delete that.
Having a block-device-mapping without an attachment_id is an expected intermediate state, so the second thread can also be in that state at the point of time we handle the exception.
We need to ensure that it is really our block-device-mapping we clean up. The query simply gets the first block-device-mapping by calling get_by_volume_and_instance, by all accounts that could be any. Such as the one from the second thread.
| bdm = \ | ||
| objects.BlockDeviceMapping.get_by_volume_and_instance( | ||
| context, volume['id'], instance.uuid) | ||
| if bdm.attachment_id: | ||
| LOG.debug("BDM for volume %s has attachment_id set, " | ||
| "not deleting to avoid race-condition", | ||
| volume['id']) | ||
| else: | ||
| bdm.destroy() |
There was a problem hiding this comment.
Assuming there is just one block-device-mapping per volume-and-instance (which I think is not enforced), why isn't then a race here?
I mean, I get the block-device-mapping then the other thread saves the new version with attachment_id and then I delete it because my copy doesn't have the attachment-id.
When nova-api calls reserve_block_device_name RPC to nova-compute and the call times out, we try to clean up by deleting the BDM entry. However, during the timeout window a second attachment request for the same volume can come in, create a valid BDM, and progress to talking to Cinder. The original timed-out request then deletes this new valid BDM, leaving the volume in an inconsistent state.
We fix this by checking if the BDM has attachment_id set before deleting it. The attachment_id field is only populated in _check_attach_and_reserve_volume(), which we only call after the reserve_block_device_name RPC succeeds. If attachment_id is set, we know the BDM belongs to a subsequent request that has already progressed past the RPC phase, so we should not delete it.
Change-Id: I7ed649a5cab7f254690f329fac285128d8cd1c92