-
Notifications
You must be signed in to change notification settings - Fork 4
feat: Add distributed state manager for multi-node WebSocket support #498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Claude ReviewIssues Found
|
Implements DistributedStateManager service to enable horizontal scaling without sticky sessions: - Redis-backed connection tracking across instances - Distributed user list aggregation from all instances - Atomic leader election using Lua scripts - Cross-instance session membership validation Key changes: - New DistributedStateManager service with comprehensive test coverage - RoomManager integration for seamless multi-instance support - Updated requireSessionMember helper for cross-instance auth - getSessionUsers now async to support distributed queries This enables running multiple backend nodes behind a load balancer without requiring sticky sessions. Leader election is coordinated via Redis, and user lists are aggregated across all instances.
- Add DistributedStateManager to architecture diagram - Update leader election section for distributed election - Expand multi-instance support section with new diagrams - Add distributed state Redis keys documentation - Update graceful shutdown sequence for distributed cleanup - Add distributed state timeouts and limits - Add new files to Related Files section - Update server.ts to call roomManager.shutdown() on exit
f15002f to
24da364
Compare
Claude Reviewpackages/backend/src/services/distributed-state.ts
packages/backend/src/graphql/resolvers/shared/helpers.ts
packages/backend/src/tests/distributed-state.test.ts
|
- Fix race condition in registerClient by awaiting distributed state registration and rolling back local state on failure - Fix missing await in removeClient to prevent ghost connections - Fix findNearbySessions to use distributed state for accurate cross-instance participant counts - Remove unused connectedAt parameter from LEADER_ELECTION_SCRIPT - Add TTL refresh for session members in ELECT_NEW_LEADER_SCRIPT to prevent expiry during long sessions - Add try-catch to test cleanup in all beforeEach blocks for robustness - Fix queue-sync-fixes.test.ts to await async registerClient
Claude ReviewIssues Found
Test QualityThe tests are comprehensive with good coverage of:
Missing test coverage:
DocumentationDocumentation in |
- Fix race condition in leader election by making LEADER_ELECTION_SCRIPT atomically set both leader key and isLeader flag in a single Lua operation - Add error handling for Lua script failures in leaveSession with fallback to clear leader key on error - Add refreshSessionMembership method and update refreshConnection to also refresh session membership TTL for long-running sessions - Fix potential memory leak in cleanupInstanceConnections by force-cleaning failed connection data and wrapping operations in try-catch - Add resetDistributedState function for test isolation and hot-reload
Claude ReviewIssues Found1. Potential race condition in removeConnection (distributed-state.ts:243-270)
2. Missing old leader isLeader flag reset in Lua script (distributed-state.ts:96-140)
3. Singleton reset not thread-safe (distributed-state.ts:648-650)
4. Error swallowed in cleanup (room-manager.ts:409-417)
5. TTL refresh race condition (distributed-state.ts:459-472)
6. Missing test coverage for error paths
7. Documentation updated correctly
|
- Fix race condition in removeConnection by adding automatic leader election when removing a leader (controlled via electNewLeader parameter) - Update ELECT_NEW_LEADER_SCRIPT to reset old leader's isLeader flag before electing new leader, preventing stale flag inconsistencies - Make resetDistributedState async and call stop() to prevent orphaned heartbeat intervals; add forceResetDistributedState for sync reset - Add atomic REFRESH_TTL_SCRIPT Lua script to avoid race conditions in refreshConnection where session could change between read and write - Improve removeClient error handling to return cleanup status and provide more detailed logging about distributed state failures - Simplify cleanupInstanceConnections since removeConnection now handles leader election automatically - Add comprehensive tests for removeConnection with leader election, refreshConnection, and old leader flag reset scenarios
Claude ReviewIssues Found
DocumentationThe |
Implements DistributedStateManager service to enable horizontal scaling
without sticky sessions:
Key changes:
This enables running multiple backend nodes behind a load balancer
without requiring sticky sessions. Leader election is coordinated
via Redis, and user lists are aggregated across all instances.