Skip to content

[SAP] Fix EMS vserver race condition in NetApp REST client#336

Open
hemna wants to merge 3 commits into
stable/2025.1-m3from
fix/ems-vserver-race-2025.1
Open

[SAP] Fix EMS vserver race condition in NetApp REST client#336
hemna wants to merge 3 commits into
stable/2025.1-m3from
fix/ems-vserver-race-2025.1

Conversation

@hemna
Copy link
Copy Markdown

@hemna hemna commented May 12, 2026

Summary

  • Cherry-pick of 267739b from stable/2023.1-m3
  • Fixes race condition in send_ems_log_message() where mutating self.connection's vserver caused concurrent greenthreads to query the wrong SVM, resulting in "Volume not found" errors during startup
  • Uses copy.deepcopy(self.connection) for the EMS request so self.connection is never mutated
  • Depends on [SAP] Fix race condition in NetApp REST client session handling #335 (session race fix)

Scsabiii and others added 3 commits May 12, 2026 10:47
Create the clone on the template DS and later moves to target ds via svmotion post clone

Change-Id: Icc4dda70f98498723c622913dfc383fb27b25da6
The RestNaServer.send_http_request() method was rebuilding self._session
on every call via _build_session() without any locking or thread-local
isolation. In an eventlet environment with concurrent greenthreads, this
caused a race condition where one greenthread's session headers (including
the critical X-Dot-SVM-Name vserver tunneling header) could be silently
overwritten by another greenthread's _build_session() call before the
HTTP request was actually sent.

This manifested in production as the get_operational_lif_addresses() REST
call intermittently returning LIFs for the wrong vserver (or an empty
set), because the tunneling header was lost due to the race. The driver
then logged 'Address not found for NFS share' for all configured shares,
reported zero pools to the scheduler, and the backend became invisible
to the volume controller's pool cache — permanently, since the cache has
no TTL or retry logic.

The fix eliminates the shared mutable state by making _build_session()
return a new local Session object instead of storing it on self._session.
Each concurrent REST call now gets its own isolated session with the
correct headers, preventing any cross-greenthread contamination.

This is the proper fix for the class of issues previously worked around
in commit deebedf ('use zapi in _get_flexvol_to_pool_map'), which
forced get_flexvol calls through the ZAPI fallback path to avoid this
same REST client race condition.

Change-Id: Ida5dbde04e4976b41d88fcd82c0573df7721cb0a
The send_ems_log_message method previously mutated self.connection's
vserver to temporarily switch to the cluster admin vserver for EMS
logging. This caused a race condition with concurrent greenthreads
(e.g., _update_ssc / volume discovery at startup) that would read
the wrong vserver from self.connection, resulting in REST calls being
tunneled to the wrong SVM and returning 'Volume not found' errors.

Fix by using a deep copy of self.connection for the EMS request,
ensuring that self.connection is never mutated and concurrent
operations continue using the correct data vserver.

Change-Id: IF426ECCD93A24707ADF25C3CB843B5C7
@hemna hemna force-pushed the fix/ems-vserver-race-2025.1 branch from 58f6abe to 0bb7eb4 Compare May 12, 2026 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants