Skip to content

feat(brain): migrate MCP SSE sessions from in-memory DashMap to Redis for multi-instance scaling #313

@ruvnet

Description

@ruvnet

Summary

The ruvbrain Cloud Run service currently stores MCP SSE sessions in a process-local DashMap (line 182 of routes.rs). This forces max-instances=1 to prevent session-not-found 404s when Cloud Run routes requests across instances. To restore horizontal scaling, sessions must move to a shared external store.

Current State

// crates/mcp-brain-server/src/routes.rs:182-183
let sessions: Arc<dashmap::DashMap<String, tokio::sync::mpsc::Sender<String>>> =
    Arc::new(dashmap::DashMap::new());

Proposed Solution: Memorystore for Redis

Google Cloud Memorystore provides a managed Redis instance that all Cloud Run instances can share.

GCloud Resources Required

# 1. Create Memorystore Redis instance (Basic tier, 1GB, us-central1)
gcloud redis instances create ruvbrain-sessions \
  --size=1 \
  --region=us-central1 \
  --tier=basic \
  --redis-version=redis_7_2 \
  --network=default \
  --connect-mode=private-service-access \
  --project=ruv-dev

# 2. Create VPC connector for Cloud Run → Redis
gcloud compute networks vpc-access connectors create ruvbrain-connector \
  --network=default \
  --region=us-central1 \
  --range=10.8.0.0/28 \
  --project=ruv-dev

# 3. Update ruvbrain service with VPC connector and restore scaling
gcloud run services update ruvbrain \
  --region=us-central1 --project=ruv-dev \
  --vpc-connector=ruvbrain-connector \
  --max-instances=10 \
  --set-env-vars="REDIS_HOST=<redis-ip>,REDIS_PORT=6379"

# 4. Store Redis host as a secret
gcloud secrets create REDIS_HOST --data-file=- <<< "<redis-ip>"

Estimated Monthly Cost

Resource Cost
Memorystore Basic 1GB (us-central1) ~$35/month
VPC connector (f1-micro, min 2 instances) ~$7/month
Total ~$42/month

Code Changes

1. Add redis dependency to mcp-brain-server

# crates/mcp-brain-server/Cargo.toml
redis = { version = "0.25", features = ["tokio-comp", "connection-manager"] }

2. Replace DashMap with Redis-backed session store

// New: RedisSessionStore
pub struct RedisSessionStore {
    pool: redis::aio::ConnectionManager,
    local_senders: DashMap<String, tokio::sync::mpsc::Sender<String>>,
}

impl RedisSessionStore {
    pub async fn new(redis_url: &str) -> Result<Self> {
        let client = redis::Client::open(redis_url)?;
        let pool = redis::aio::ConnectionManager::new(client).await?;
        Ok(Self { pool, local_senders: DashMap::new() })
    }

    pub async fn register(&self, session_id: &str, sender: Sender<String>) {
        // Store sender locally (channel is not serializable)
        self.local_senders.insert(session_id.to_string(), sender);
        // Register session existence in Redis (with TTL)
        let mut conn = self.pool.clone();
        let _: () = redis::cmd("SET")
            .arg(format!("mcp:session:{session_id}"))
            .arg("1")
            .arg("EX").arg(3600) // 1 hour TTL
            .query_async(&mut conn).await.unwrap_or(());
    }

    pub async fn get(&self, session_id: &str) -> Option<Sender<String>> {
        // Check local first (same instance)
        if let Some(s) = self.local_senders.get(session_id) {
            return Some(s.clone());
        }
        // Check Redis (session exists on another instance)
        // In this case, we can't forward — return None and let the
        // client reconnect. Log for observability.
        None
    }
}

3. Update messages_handler

async fn messages_handler(...) -> StatusCode {
    let sender = match state.sessions.get(&query.session_id).await {
        Some(s) => s,
        None => {
            // Log session miss for monitoring
            tracing::warn!(session_id = %query.session_id, "Session not found (may be on another instance)");
            return StatusCode::NOT_FOUND;
        }
    };
    // ... rest unchanged
}

Migration Plan

  1. Provision Memorystore Redis + VPC connector
  2. Add redis dependency to mcp-brain-server
  3. Implement RedisSessionStore with local sender cache + Redis existence tracking
  4. Add session TTL (1 hour) and cleanup
  5. Update Cloud Run service: add VPC connector, set REDIS_HOST env var
  6. Restore max-instances=10
  7. Test: SSE connect on Instance A, POST on Instance B → verify session found
  8. Monitor: track session miss rate in logs

Alternative: Streamable HTTP Transport

MCP protocol v2025+ supports Streamable HTTP transport which is stateless (no sessions). This would eliminate the session problem entirely. However, Claude Code currently uses SSE transport, so this is a longer-term migration.

Context

🤖 Generated with claude-flow

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions