Thread-safety issue in SupportsPrimaryWrites ??= caching causes permanent "Command cannot be issued to a replica" errors

# Bug Report: Thread-safety race in `SupportsPrimaryWrites` ??= caching causes permanent write failures after failover

## Summary

A race condition in the `SupportsPrimaryWrites` property's `??=` (null-coalescing assignment) caching can permanently poison the cached value during Redis failover events. This causes all write commands to fail with `"Command cannot be issued to a replica"` even though the endpoint IS a primary — and the state never self-heals until the multiplexer is disposed and recreated.

## Affected Versions

Confirmed in 2.8.58. Also present in latest `main` branch (code unchanged).

## The Bug

In `ServerEndPoint.cs`:

```csharp
private bool? supportsPrimaryWrites;

public bool SupportsPrimaryWrites => supportsPrimaryWrites ??= !IsReplica || !ReplicaReadOnly || AllowReplicaWrites;

private void SetConfig<T>(ref T field, T value, ...)
{
    if (!EqualityComparer<T>.Default.Equals(field, value))
    {
        field = value;
        ClearMemoized();
    }
}

private void ClearMemoized()
{
    supportsDatabases = null;
    supportsPrimaryWrites = null;
}
```

The `??=` operator is not atomic. It expands to:

```csharp
if (supportsPrimaryWrites == null) {
    supportsPrimaryWrites = !IsReplica || !ReplicaReadOnly || AllowReplicaWrites;
}
return supportsPrimaryWrites.Value;
```

During a failover, when a replica is promoted to primary (`IsReplica` changes from `true` to `false`), a reader thread can interleave with the reconfigure thread:

```
Thread A (request)                    Thread B (reconfigure: promoting to primary)
──────────────────                    ──────────────────────────────────────────────

1. Reads supportsPrimaryWrites → null
   (enters evaluation branch)

2. Reads IsReplica → true
   (still replica at this instant)
   Computes: !true || !true || false = false

                                      3. SetConfig(ref isReplica, false)
                                         → field = false
                                         → ClearMemoized()
                                         → supportsPrimaryWrites = null  ✓

4. Stores: supportsPrimaryWrites = false
   ← OVERWRITES the null from step 3!
```

## Final State (Permanently Stuck)

```
isReplica = false                    ← CORRECT (node IS primary)
supportsPrimaryWrites = false        ← STALE (cached from when IsReplica was true)
```

## Why It Never Self-Heals

The 60-second heartbeat reads `CLUSTER NODES`, sees the node is primary, and calls:

```csharp
SetConfig(ref isReplica, false);
```

But `isReplica` is **already** `false`, so the equality check passes → **no-op** → `ClearMemoized()` is never called → stale cache persists indefinitely.

## Impact

- **Routing layer** checks `IsReplica` directly → sees `false` → passes the endpoint as valid primary
- **Bridge layer** checks `SupportsPrimaryWrites` → reads stale cached `false` → throws:
  ```
  RedisConnectionException: InternalFailure on [0]:SETEX key (BooleanProcessor)
   ---> RedisCommandException: Command cannot be issued to a replica: SETEX key
     at PhysicalBridge.WriteMessageToServerInsideWriteLock line 1535
  ```
- **100% of write commands fail** on the affected endpoint until the multiplexer is disposed
- In production, we observed **2,337 errors over 5 minutes** before the application's error threshold triggered a multiplexer reset

## Reproduction

```csharp
using System.Reflection;
using StackExchange.Redis;

var mux = await ConnectionMultiplexer.ConnectAsync("your-redis:6380,...");
var db = mux.GetDatabase();

// Verify baseline works
await db.StringSetAsync("test", "value"); // ✓ Success

// Simulate the post-race state: IsReplica=false but supportsPrimaryWrites cached as false
var endpoint = mux.GetEndPoints()[0]; // primary endpoint
var server = mux.GetServer(endpoint);
var sepType = /* get ServerEndPoint type via reflection */;

// Get the internal ServerEndPoint object
var getServerEpMethod = typeof(ConnectionMultiplexer)
    .GetMethods(BindingFlags.NonPublic | BindingFlags.Instance)
    .First(m => m.Name == "GetServerEndPoint");
var sep = getServerEpMethod.Invoke(mux, new object[] { endpoint });

// Set the stale state (simulates race result)
var spwField = sep.GetType().GetField("supportsPrimaryWrites", BindingFlags.NonPublic | BindingFlags.Instance);
spwField.SetValue(sep, (bool?)false);  // Stale cached value

// IsReplica is still false (correct) - routing will pass
// But SupportsPrimaryWrites returns false (stale) - bridge will block

await db.StringSetAsync("test", "value"); 
// ❌ Throws: "Command cannot be issued to a replica: SETEX"

// ConfigureAsync does NOT fix it (isReplica is already false → SetConfig is no-op)
await mux.ConfigureAsync();
await db.StringSetAsync("test", "value"); 
// ❌ STILL throws!

// Only way to recover: dispose and recreate the multiplexer
```

## Suggested Fix

### Option A: Make the cached evaluation safe against torn reads

Use `Interlocked.CompareExchange` to ensure only the first completed evaluation wins:

```csharp
public bool SupportsPrimaryWrites
{
    get
    {
        var current = Volatile.Read(ref supportsPrimaryWrites);
        if (current is not null) return current.Value;
        
        // Compute fresh value
        var computed = !IsReplica || !ReplicaReadOnly || AllowReplicaWrites;
        
        // Only store if still null (another thread may have cleared it)
        Interlocked.CompareExchange(ref supportsPrimaryWrites, computed, null);
        
        // Return what's actually stored (may differ if cleared between compute and store)
        return Volatile.Read(ref supportsPrimaryWrites) ?? computed;
    }
}
```

### Option B: Unconditionally call ClearMemoized in SetConfig

Remove the equality check so heartbeat always clears the cache:

```csharp
private void SetConfig<T>(ref T field, T value, ...)
{
    if (!EqualityComparer<T>.Default.Equals(field, value))
    {
        field = value;
        Multiplexer?.ReconfigureIfNeeded(EndPoint, false, caller!);
    }
    ClearMemoized(); // Always clear, even if value didn't change
}
```

### Option C: Use Volatile.Write in ClearMemoized + double-check in the getter

```csharp
private void ClearMemoized()
{
    Volatile.Write(ref supportsDatabases, null);
    Volatile.Write(ref supportsPrimaryWrites, null);
}

public bool SupportsPrimaryWrites
{
    get
    {
        var val = supportsPrimaryWrites;
        if (val is null)
        {
            val = !IsReplica || !ReplicaReadOnly || AllowReplicaWrites;
            supportsPrimaryWrites = val;
            
            // Double-check: if IsReplica changed between our read and store, re-evaluate
            var recheck = !IsReplica || !ReplicaReadOnly || AllowReplicaWrites;
            if (recheck != val)
            {
                supportsPrimaryWrites = null; // Force re-evaluation next time
                return recheck;
            }
        }
        return val.Value;
    }
}
```

**Option A** is preferred — it's minimal, lock-free, and correct. The worst case is a single wasted computation (benign).

## Environment

- SE.Redis 2.8.58 (also present in latest main)
- Azure Redis Cluster (2 shards, 4 nodes)
- .NET 8
- Trigger: Azure Redis planned maintenance / Entra Auth rollout causing node reboots and role promotions

## Workaround

There is no clean workaround from application code. `ConfigureAsync()` does not help because `IsReplica` is already correct. The only recovery is full multiplexer disposal and recreation.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread-safety issue in SupportsPrimaryWrites ??= caching causes permanent "Command cannot be issued to a replica" errors #3089

Bug Report: Thread-safety race in `SupportsPrimaryWrites` ??= caching causes permanent write failures after failover

Summary

Affected Versions

The Bug

Final State (Permanently Stuck)

Why It Never Self-Heals

Impact

Reproduction

Suggested Fix

Option A: Make the cached evaluation safe against torn reads

Option B: Unconditionally call ClearMemoized in SetConfig

Option C: Use Volatile.Write in ClearMemoized + double-check in the getter

Environment

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Thread-safety issue in SupportsPrimaryWrites ??= caching causes permanent "Command cannot be issued to a replica" errors #3089

Description

Bug Report: Thread-safety race in SupportsPrimaryWrites ??= caching causes permanent write failures after failover

Summary

Affected Versions

The Bug

Final State (Permanently Stuck)

Why It Never Self-Heals

Impact

Reproduction

Suggested Fix

Option A: Make the cached evaluation safe against torn reads

Option B: Unconditionally call ClearMemoized in SetConfig

Option C: Use Volatile.Write in ClearMemoized + double-check in the getter

Environment

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug Report: Thread-safety race in `SupportsPrimaryWrites` ??= caching causes permanent write failures after failover