[SAP] Implement graceful shutdown for cinder services by hemna · Pull Request #314 · sapcc/cinder

hemna · 2026-02-20T20:05:32Z

Graceful Shutdown for Cinder Volume Services

Implements three-phase graceful shutdown that allows in-flight volume operations to complete before the pod exits during Kubernetes rolling updates.

How It Works

Phase 1 — Stop new messages (without killing in-flight handlers):

Sends Basic.Cancel directly for each AMQP consumer tag via conn.consumer_cancel()
Does NOT call conn.stop_consuming() (which causes _runner busy-loop starvation)
_runner greenthread stays blocked in drain_events() at 0% CPU

Phase 2 — Wait for in-flight operations:

GreenPool.waitall() blocks until all RPC handler greenthreads finish
Worker entry heartbeat keeps entries fresh (prevents new pod cleanup interference)
Heartbeats continue (service stays "up" in DB)

Phase 3 — Clean exit:

Skip rpcserver.stop()/rpcserver.wait() (hangs on dead AMQP socket)
Process exits cleanly after stop() returns

Additional Mechanisms

Worker entry heartbeat (cinder/objects/cleanable.py): set_workers decorator spawns a greenthread that touches worker DB entries every 10s during operations. Prevents new pod's init_host → _do_cleanup from resetting in-flight volumes to 'error'.
do_cleanup freshness check (cinder/manager.py): Skips worker entries updated within service_down_time (60s). Only cleans up truly stale/crashed entries.
reject_if_draining decorator: Rejects new RPC calls during shutdown so scheduler routes to healthy backends.
Semaphore guard: Prevents concurrent stop() calls on same Service instance.

Requirements (separate changes)

dumb-init --single-child on cinder-volume container command — ensures ProcessLauncher parent waits for all children before exit
terminationGracePeriodSeconds: 900 on pod spec
oslo.messaging PR (Send basic.cancel to broker in stop_consuming() oslo.messaging#4): Adds basic.cancel to stop_consuming(). Not strictly required by our code path (we call consumer_cancel() directly) but provides a safety net if other code paths invoke stop_consuming() during cleanup.

Test Results

See sap-doc/graceful-shutdown-test-results.md for full details.

Test	Operation	Result
Idle shutdown	Clean exit	✅ <1s
Volume create from image	16GB, kill pod mid-download	✅ (41s to 8min drains)
Backup to Swift	Kill backup pod mid-stream	✅
Scheduler rerouting	New work during drain	✅
Snapshot create	Kill pod mid-snapshot	✅
Snapshot delete	Kill pod mid-delete	✅
Volume clone	Kill pod mid-clone	✅
Backup (kill volume pod)	Multi-service coordination	✅

Files Changed

File	Purpose
`cinder/service.py`	Three-phase shutdown, semaphore guard, heartbeat continuation
`cinder/manager.py`	`do_cleanup` freshness check for worker entries
`cinder/objects/cleanable.py`	Worker heartbeat greenthread in `set_workers` decorator
`cinder/volume/manager.py`	Direct flow execution (removed tpool.execute)
`tox.ini`	Exclude `sap-tools`/`sap-doc` from flake8
`doc/source/admin/graceful-shutdown-race-condition.rst`	Race condition documentation
`sap-doc/graceful-shutdown-test-results.md`	Sanitized test results

No oslo.messaging source changes required

All changes are self-contained in cinder. We access oslo.messaging internal attributes defensively (getattr with fallbacks) but don't modify oslo.messaging source.

Three-phase graceful shutdown that allows in-flight volume operations (create, delete, clone, snapshot, backup) to complete before the pod exits during Kubernetes rolling updates. Phase 1: Send Basic.Cancel to RabbitMQ consumers (no new messages) without disrupting the _runner greenthread or setting _consume_loop_stopped (avoids busy-loop CPU starvation). Phase 2: Block in pool.waitall() until all in-flight RPC handler greenthreads in the GreenPool complete their operations. Phase 3: Skip rpcserver.stop()/wait() (hangs on dead AMQP socket). Process exits cleanly after stop() returns. Additional mechanisms: - Worker entry heartbeat in set_workers decorator: touches worker DB entries every 10s during operations, preventing new pod's init_host _do_cleanup from resetting in-flight volumes to 'error'. - do_cleanup freshness check: skips worker entries updated within service_down_time (60s), only cleans up truly stale/crashed entries. - Semaphore guard: prevents concurrent stop() calls on same Service. - Heartbeat continues during drain: service stays 'up' in DB. - reject_if_draining decorator: rejects new RPC calls during shutdown so scheduler routes to healthy backends. Requires: - dumb-init --single-child (Helm chart change in separate commit) - terminationGracePeriodSeconds: 900 on pod spec Tested operations surviving pod termination: - Volume create from image (41s to 8min drains) - Volume clone, snapshot create, snapshot delete - Backup (kill backup pod during stream) - Backup (kill volume pod during snapshot prep) - Scheduler rerouting during drain - Idle shutdown (clean exit <1s) Change-Id: Icdd28affc73fd34491b656a68410dce8e46264d4

Move eventlet imports after stdlib imports (inspect, os, random, etc.) to comply with flake8-import-order (import-order-style = pep8). Change-Id: Icdd28affc73fd34491b656a68410dce8e46264d4

hemna force-pushed the graceful-shutdown branch 3 times, most recently from 5a72074 to 1f69e00 Compare February 23, 2026 14:24

hemna force-pushed the graceful-shutdown branch 3 times, most recently from 4c183ec to d331087 Compare April 30, 2026 12:37

hemna force-pushed the graceful-shutdown branch 2 times, most recently from 1dd055a to d2fddd7 Compare May 14, 2026 22:25

hemna changed the title ~~[SAP] Try graceful shutdown~~ [SAP] Implement graceful shutdown for cinder services May 14, 2026

hemna force-pushed the graceful-shutdown branch 3 times, most recently from ca845fc to a235a58 Compare May 14, 2026 22:40

hemna force-pushed the graceful-shutdown branch from a235a58 to b77438d Compare May 14, 2026 22:41

Scsabiii previously approved these changes May 15, 2026

View reviewed changes

Fix import order in service.py to pass pep8 CI check

81fe034

Move eventlet imports after stdlib imports (inspect, os, random, etc.) to comply with flake8-import-order (import-order-style = pep8). Change-Id: Icdd28affc73fd34491b656a68410dce8e46264d4

hemna dismissed Scsabiii’s stale review via 81fe034 May 15, 2026 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SAP] Implement graceful shutdown for cinder services#314

[SAP] Implement graceful shutdown for cinder services#314
hemna wants to merge 2 commits into
stable/2023.1-m3from
graceful-shutdown

hemna commented Feb 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hemna commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Graceful Shutdown for Cinder Volume Services

How It Works

Additional Mechanisms

Requirements (separate changes)

Test Results

Files Changed

No oslo.messaging source changes required

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hemna commented Feb 20, 2026 •

edited

Loading