[SAP] Fix EMS vserver race condition in NetApp REST client by hemna · Pull Request #336 · sapcc/cinder

hemna · 2026-05-12T14:51:20Z

Summary

Cherry-pick of 267739b from stable/2023.1-m3
Fixes race condition in send_ems_log_message() where mutating self.connection's vserver caused concurrent greenthreads to query the wrong SVM, resulting in "Volume not found" errors during startup
Uses copy.deepcopy(self.connection) for the EMS request so self.connection is never mutated
Depends on [SAP] Fix race condition in NetApp REST client session handling #335 (session race fix)

Create the clone on the template DS and later moves to target ds via svmotion post clone Change-Id: Icc4dda70f98498723c622913dfc383fb27b25da6

The RestNaServer.send_http_request() method was rebuilding self._session on every call via _build_session() without any locking or thread-local isolation. In an eventlet environment with concurrent greenthreads, this caused a race condition where one greenthread's session headers (including the critical X-Dot-SVM-Name vserver tunneling header) could be silently overwritten by another greenthread's _build_session() call before the HTTP request was actually sent. This manifested in production as the get_operational_lif_addresses() REST call intermittently returning LIFs for the wrong vserver (or an empty set), because the tunneling header was lost due to the race. The driver then logged 'Address not found for NFS share' for all configured shares, reported zero pools to the scheduler, and the backend became invisible to the volume controller's pool cache — permanently, since the cache has no TTL or retry logic. The fix eliminates the shared mutable state by making _build_session() return a new local Session object instead of storing it on self._session. Each concurrent REST call now gets its own isolated session with the correct headers, preventing any cross-greenthread contamination. This is the proper fix for the class of issues previously worked around in commit deebedf ('use zapi in _get_flexvol_to_pool_map'), which forced get_flexvol calls through the ZAPI fallback path to avoid this same REST client race condition. Change-Id: Ida5dbde04e4976b41d88fcd82c0573df7721cb0a

The send_ems_log_message method previously mutated self.connection's vserver to temporarily switch to the cluster admin vserver for EMS logging. This caused a race condition with concurrent greenthreads (e.g., _update_ssc / volume discovery at startup) that would read the wrong vserver from self.connection, resulting in REST calls being tunneled to the wrong SVM and returning 'Volume not found' errors. Fix by using a deep copy of self.connection for the EMS request, ensuring that self.connection is never mutated and concurrent operations continue using the correct data vserver. Change-Id: IF426ECCD93A24707ADF25C3CB843B5C7

Scsabiii and others added 3 commits May 12, 2026 10:47

Fix image cache, do the clone on the templates ds

83e3c25

Create the clone on the template DS and later moves to target ds via svmotion post clone Change-Id: Icc4dda70f98498723c622913dfc383fb27b25da6

hemna force-pushed the fix/ems-vserver-race-2025.1 branch from 58f6abe to 0bb7eb4 Compare May 12, 2026 15:18

Scsabiii approved these changes May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SAP] Fix EMS vserver race condition in NetApp REST client#336

[SAP] Fix EMS vserver race condition in NetApp REST client#336
hemna wants to merge 3 commits into
stable/2025.1-m3from
fix/ems-vserver-race-2025.1

hemna commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hemna commented May 12, 2026

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants