Bug Report: Thread-safety race in SupportsPrimaryWrites ??= caching causes permanent write failures after failover
Summary
A race condition in the SupportsPrimaryWrites property's ??= (null-coalescing assignment) caching can permanently poison the cached value during Redis failover events. This causes all write commands to fail with "Command cannot be issued to a replica" even though the endpoint IS a primary — and the state never self-heals until the multiplexer is disposed and recreated.
Affected Versions
Confirmed in 2.8.58. Also present in latest main branch (code unchanged).
The Bug
In ServerEndPoint.cs:
private bool? supportsPrimaryWrites;
public bool SupportsPrimaryWrites => supportsPrimaryWrites ??= !IsReplica || !ReplicaReadOnly || AllowReplicaWrites;
private void SetConfig<T>(ref T field, T value, ...)
{
if (!EqualityComparer<T>.Default.Equals(field, value))
{
field = value;
ClearMemoized();
}
}
private void ClearMemoized()
{
supportsDatabases = null;
supportsPrimaryWrites = null;
}
The ??= operator is not atomic. It expands to:
if (supportsPrimaryWrites == null) {
supportsPrimaryWrites = !IsReplica || !ReplicaReadOnly || AllowReplicaWrites;
}
return supportsPrimaryWrites.Value;
During a failover, when a replica is promoted to primary (IsReplica changes from true to false), a reader thread can interleave with the reconfigure thread:
Thread A (request) Thread B (reconfigure: promoting to primary)
────────────────── ──────────────────────────────────────────────
1. Reads supportsPrimaryWrites → null
(enters evaluation branch)
2. Reads IsReplica → true
(still replica at this instant)
Computes: !true || !true || false = false
3. SetConfig(ref isReplica, false)
→ field = false
→ ClearMemoized()
→ supportsPrimaryWrites = null ✓
4. Stores: supportsPrimaryWrites = false
← OVERWRITES the null from step 3!
Final State (Permanently Stuck)
isReplica = false ← CORRECT (node IS primary)
supportsPrimaryWrites = false ← STALE (cached from when IsReplica was true)
Why It Never Self-Heals
The 60-second heartbeat reads CLUSTER NODES, sees the node is primary, and calls:
SetConfig(ref isReplica, false);
But isReplica is already false, so the equality check passes → no-op → ClearMemoized() is never called → stale cache persists indefinitely.
Impact
- Routing layer checks
IsReplica directly → sees false → passes the endpoint as valid primary
- Bridge layer checks
SupportsPrimaryWrites → reads stale cached false → throws:
RedisConnectionException: InternalFailure on [0]:SETEX key (BooleanProcessor)
---> RedisCommandException: Command cannot be issued to a replica: SETEX key
at PhysicalBridge.WriteMessageToServerInsideWriteLock line 1535
- 100% of write commands fail on the affected endpoint until the multiplexer is disposed
- In production, we observed 2,337 errors over 5 minutes before the application's error threshold triggered a multiplexer reset
Reproduction
using System.Reflection;
using StackExchange.Redis;
var mux = await ConnectionMultiplexer.ConnectAsync("your-redis:6380,...");
var db = mux.GetDatabase();
// Verify baseline works
await db.StringSetAsync("test", "value"); // ✓ Success
// Simulate the post-race state: IsReplica=false but supportsPrimaryWrites cached as false
var endpoint = mux.GetEndPoints()[0]; // primary endpoint
var server = mux.GetServer(endpoint);
var sepType = /* get ServerEndPoint type via reflection */;
// Get the internal ServerEndPoint object
var getServerEpMethod = typeof(ConnectionMultiplexer)
.GetMethods(BindingFlags.NonPublic | BindingFlags.Instance)
.First(m => m.Name == "GetServerEndPoint");
var sep = getServerEpMethod.Invoke(mux, new object[] { endpoint });
// Set the stale state (simulates race result)
var spwField = sep.GetType().GetField("supportsPrimaryWrites", BindingFlags.NonPublic | BindingFlags.Instance);
spwField.SetValue(sep, (bool?)false); // Stale cached value
// IsReplica is still false (correct) - routing will pass
// But SupportsPrimaryWrites returns false (stale) - bridge will block
await db.StringSetAsync("test", "value");
// ❌ Throws: "Command cannot be issued to a replica: SETEX"
// ConfigureAsync does NOT fix it (isReplica is already false → SetConfig is no-op)
await mux.ConfigureAsync();
await db.StringSetAsync("test", "value");
// ❌ STILL throws!
// Only way to recover: dispose and recreate the multiplexer
Suggested Fix
Option A: Make the cached evaluation safe against torn reads
Use Interlocked.CompareExchange to ensure only the first completed evaluation wins:
public bool SupportsPrimaryWrites
{
get
{
var current = Volatile.Read(ref supportsPrimaryWrites);
if (current is not null) return current.Value;
// Compute fresh value
var computed = !IsReplica || !ReplicaReadOnly || AllowReplicaWrites;
// Only store if still null (another thread may have cleared it)
Interlocked.CompareExchange(ref supportsPrimaryWrites, computed, null);
// Return what's actually stored (may differ if cleared between compute and store)
return Volatile.Read(ref supportsPrimaryWrites) ?? computed;
}
}
Option B: Unconditionally call ClearMemoized in SetConfig
Remove the equality check so heartbeat always clears the cache:
private void SetConfig<T>(ref T field, T value, ...)
{
if (!EqualityComparer<T>.Default.Equals(field, value))
{
field = value;
Multiplexer?.ReconfigureIfNeeded(EndPoint, false, caller!);
}
ClearMemoized(); // Always clear, even if value didn't change
}
Option C: Use Volatile.Write in ClearMemoized + double-check in the getter
private void ClearMemoized()
{
Volatile.Write(ref supportsDatabases, null);
Volatile.Write(ref supportsPrimaryWrites, null);
}
public bool SupportsPrimaryWrites
{
get
{
var val = supportsPrimaryWrites;
if (val is null)
{
val = !IsReplica || !ReplicaReadOnly || AllowReplicaWrites;
supportsPrimaryWrites = val;
// Double-check: if IsReplica changed between our read and store, re-evaluate
var recheck = !IsReplica || !ReplicaReadOnly || AllowReplicaWrites;
if (recheck != val)
{
supportsPrimaryWrites = null; // Force re-evaluation next time
return recheck;
}
}
return val.Value;
}
}
Option A is preferred — it's minimal, lock-free, and correct. The worst case is a single wasted computation (benign).
Environment
- SE.Redis 2.8.58 (also present in latest main)
- Azure Redis Cluster (2 shards, 4 nodes)
- .NET 8
- Trigger: Azure Redis planned maintenance / Entra Auth rollout causing node reboots and role promotions
Workaround
There is no clean workaround from application code. ConfigureAsync() does not help because IsReplica is already correct. The only recovery is full multiplexer disposal and recreation.
Bug Report: Thread-safety race in
SupportsPrimaryWrites??= caching causes permanent write failures after failoverSummary
A race condition in the
SupportsPrimaryWritesproperty's??=(null-coalescing assignment) caching can permanently poison the cached value during Redis failover events. This causes all write commands to fail with"Command cannot be issued to a replica"even though the endpoint IS a primary — and the state never self-heals until the multiplexer is disposed and recreated.Affected Versions
Confirmed in 2.8.58. Also present in latest
mainbranch (code unchanged).The Bug
In
ServerEndPoint.cs:The
??=operator is not atomic. It expands to:During a failover, when a replica is promoted to primary (
IsReplicachanges fromtruetofalse), a reader thread can interleave with the reconfigure thread:Final State (Permanently Stuck)
Why It Never Self-Heals
The 60-second heartbeat reads
CLUSTER NODES, sees the node is primary, and calls:But
isReplicais alreadyfalse, so the equality check passes → no-op →ClearMemoized()is never called → stale cache persists indefinitely.Impact
IsReplicadirectly → seesfalse→ passes the endpoint as valid primarySupportsPrimaryWrites→ reads stale cachedfalse→ throws:Reproduction
Suggested Fix
Option A: Make the cached evaluation safe against torn reads
Use
Interlocked.CompareExchangeto ensure only the first completed evaluation wins:Option B: Unconditionally call ClearMemoized in SetConfig
Remove the equality check so heartbeat always clears the cache:
Option C: Use Volatile.Write in ClearMemoized + double-check in the getter
Option A is preferred — it's minimal, lock-free, and correct. The worst case is a single wasted computation (benign).
Environment
Workaround
There is no clean workaround from application code.
ConfigureAsync()does not help becauseIsReplicais already correct. The only recovery is full multiplexer disposal and recreation.