Skip to content

Conversation

@marcodejongh
Copy link
Owner

@marcodejongh marcodejongh commented Dec 30, 2025

Summary

Comprehensive fix for party mode queue reliability issues, addressing erratic navigation behavior and implementing robust synchronization.

Problem

When navigating rapidly in party mode, server echoes of optimistic updates could overwrite newer local state, causing the queue to jump erratically. Additionally, reconnection after brief disconnections could result in missed events.

Solution

Echo Detection (Previous Commits)

  • Track pending optimistic updates via correlation IDs
  • Skip server events that match pending local updates
  • Add clientId-based detection as fallback for echo identification

Sequence-Based Reliability (This Commit)

  • Add sequence numbers to all queue events for gap detection
  • Add state hashes (FNV-1a) for periodic drift verification
  • Client validates sequence continuity and logs gaps
  • Periodic hash verification every 60 seconds

Delta Sync for Reconnection

  • Store last 100 events per session in Redis (5-min TTL)
  • Add eventsReplay GraphQL query for efficient recovery
  • Client attempts delta sync on reconnection for gaps ≤ 100 events
  • Falls back to full sync for larger gaps or errors

Bug Fix

  • Fix pendingTimestamps Map in QueueContext (was recreated every render, breaking cleanup logic)

Technical Details

Based on industry patterns from:

  • Kafka: Sequence-based ordering and gap detection
  • Spotify playlists: Server-authoritative shared ordered list
  • Database replication: Total ordering + state verification

Files changed:

  • packages/shared-schema/: Updated GraphQL schema and TypeScript types
  • packages/backend/src/: Sequence tracking, hash computation, event buffer
  • packages/web/: Client-side validation, delta sync on reconnection

Test Plan

  • All 286 existing tests pass
  • TypeScript compilation succeeds
  • Manual test: Rapid queue navigation in party mode
  • Manual test: Disconnect/reconnect during active navigation
  • Manual test: Multi-user rapid navigation

🤖 Generated with Claude Code

@vercel
Copy link

vercel bot commented Dec 30, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
boardsesh Ready Ready Preview, Comment Dec 31, 2025 5:44am

@claude
Copy link

claude bot commented Dec 30, 2025

Claude Review

packages/web/app/components/graphql-queue/QueueContext.tsx

  1. Potential memory leak (lines 714-716): The setTimeout creates a closure over pendingUuid but there's no cleanup mechanism. If the component unmounts before the 5-second timeout, the callback still executes and tries to modify the ref. This won't crash but is wasteful. Consider clearing timeouts on unmount or using a cleanup pattern.

  2. Missing tracking in setCurrentClimb method (lines 643-678): Unlike setCurrentClimbQueueItem (line 709), the setCurrentClimb method also calls persistentSession.setCurrentClimb() but doesn't add to pendingCurrentClimbUpdatesRef. This means server echoes for climbs set via setCurrentClimb will still cause the race condition the PR aims to fix.

  3. Race condition with rapid updates (line 250-251): If the same UUID is set twice rapidly (unlikely but possible), deleting from the set on first echo could allow the second echo through. Consider using a counter instead of a boolean presence check, or track timestamps.

@claude
Copy link

claude bot commented Dec 30, 2025

Claude Review

No changes to review. This PR has 0 file changes - the fix commit was reverted by a subsequent commit. The PR description mentions investigating further.

Consider closing this PR if the approach isn't viable, or force-push a new implementation when ready.

@claude
Copy link

claude bot commented Dec 30, 2025

Claude Review

Issues found:

  1. PR description mismatch: Description mentions "Add timeout to cleanup stale pending entries (5 seconds)" and "Remove pending entries on error" but neither feature is implemented in the code.

  2. No tests for new functionality: The isServerEvent flag and pendingCurrentClimbUpdates tracking have no test coverage in reducer.test.ts.

  3. Memory leak potential (reducer.ts:182): The .slice(-10) limit is arbitrary and could still accumulate if server echoes never arrive (network issues). Consider also clearing stale entries on timer or full sync.

@claude
Copy link

claude bot commented Dec 30, 2025

Claude Review

Performance Issue - QueueContext.tsx:259-276: The cleanup effect creates new timeouts for ALL pending updates every time the array changes. This means rapidly navigating through 50 items creates and clears 1+2+3+...+50 = 1275 timeouts instead of 50. Each pending update should have its timeout set only once.

Potential Memory Leak - QueueContext.tsx:264-270: The setTimeout callback captures the uuid by closure but the dispatch function may be stale if the component re-renders. While dispatch from useReducer is stable, this pattern is fragile if refactored.

Logic Issue - reducer.ts:166-168: The deduplication check returns early without updating pendingCurrentClimbUpdates for local updates. If a user rapidly clicks the same climb, it will not be tracked as pending, but the server echo will still arrive and potentially cause issues since it is not in the pending list.

@claude
Copy link

claude bot commented Dec 30, 2025

Claude Review

packages/web/app/components/graphql-queue/QueueContext.tsx:727 - Missing explicit isServerEvent: false in local dispatch. While this works due to !undefined being truthy, it's inconsistent with the test patterns that explicitly set isServerEvent: false. Consider adding for clarity and consistency.

packages/web/app/components/graphql-queue/QueueContext.tsx:294-297 - The cleanup function in useEffect runs on every dependency change, not just unmount. This could cause issues if the component re-renders while timeouts are still active - timeouts get cleared prematurely before the new effect runs.

packages/web/app/components/queue-control/reducer.ts:183 - Potential memory issue: using .slice(-50) on an array that already has 50+ items creates a new array on every local update, even when at the limit. Consider checking length first to avoid unnecessary allocations during rapid navigation.

@claude
Copy link

claude bot commented Dec 30, 2025

Claude Review

Minor Issues:

  1. reducer.ts:186-191 - Missing isServerEvent flag in optimistic update: In QueueContext.tsx:684-687, setCurrentClimbQueueItem dispatches without isServerEvent: false, relying on it being undefined. While this works because undefined is falsy, it's inconsistent with the explicit isServerEvent: true used for server events. Consider being explicit for clarity.

  2. reducer.ts:153-157 - Stale entry cleanup only runs on DELTA_UPDATE_CURRENT_CLIMB: Stale pending entries (>5 seconds) are only cleaned up during DELTA_UPDATE_CURRENT_CLIMB actions. If no navigation happens for a while, stale entries accumulate until the next navigation. Consider also cleaning in CLEANUP_PENDING_UPDATE for consistency.

  3. reducer.ts:161 - Using find when a boolean check suffices: const isPending = freshPending.find(...) returns the object but only the truthy check is used. Could use some() for marginal clarity, though this is minor.

No critical bugs or security issues found. The implementation correctly handles the race condition between optimistic updates and server echoes.

@claude
Copy link

claude bot commented Dec 31, 2025

Claude Review

Issues Found

  1. Potential race condition with stale entry cleanup (reducer.ts:153-157): The 5-second stale timeout is applied at the beginning of every reducer call, meaning an update that arrives at 4.9 seconds works, but one at 5.1 seconds is considered fresh from a different source. This could cause inconsistent behavior at boundary conditions.

  2. Echo detection logic gap (reducer.ts:173-180): The fallback pending-list check only runs when !eventClientId, but if eventClientId is present and doesn't match myClientId, the code falls through without checking the pending list. This means events from other clients that happen to match a pending UUID will still update state (which is correct), but the pending entry is never cleaned up - it just waits to timeout.

  3. GraphQL schema breaking change (schema.ts:531): clientId: ID! is marked as non-nullable, but the backend could potentially send null if ctx.connectionId is undefined. Consider making it nullable (ID) or ensuring the backend always provides a value.

  4. Missing cleanup timeout for pending updates: While there's a CLEANUP_PENDING_UPDATE action and error handling calls it, there's no scheduled timeout to clean up pending updates that never receive a server echo (e.g., if the network request failed silently). The 5-second filter in the reducer only runs on state transitions, so orphaned entries could persist indefinitely in an idle queue.

  5. Test file size (pending-updates-integration.test.ts): 473 lines of test code is substantial - consider splitting into separate test files per scenario category for better maintainability.

@claude
Copy link

claude bot commented Dec 31, 2025

Claude Review

Issues Found

  1. packages/web/app/components/persistent-session/persistent-session-context.tsx:636-637 - Uses require() inside useEffect for hash module. This is anti-pattern in React/ESM - should be a top-level import.

  2. packages/web/app/components/persistent-session/persistent-session-context.tsx:232 - lastReceivedSequence in handleQueueEvent dependency but mutated via setLastReceivedSequence inside callback creates stale closure risk. The sequence gap check compares against potentially stale lastReceivedSequence value.

  3. packages/backend/src/services/room-manager.ts:590 - sequence: result[0].version comment says "Use version as sequence for now (will be separate column later)" but this inconsistency means Postgres-only reads return version-as-sequence while Redis stores actual sequence. Could cause sequence mismatches across restarts.

  4. packages/web/app/components/graphql-queue/QueueContext.tsx:268-293 - The pending update cleanup interval runs a forEach dispatching multiple actions in quick succession. Could cause render thrashing. Consider batching cleanup into single action.

  5. packages/web/app/components/queue-control/reducer.ts:188-195 - UUID-based fallback heuristic hasPendingWithSameUUID only checks if pendingUpdates.length > 0, ignoring whether UUIDs actually match. This could incorrectly skip legitimate server updates.

  6. packages/backend/src/pubsub/index.ts:269-271 - Empty catch block silently swallows errors. Even though logged in storeEventInBuffer, the .catch((error) => {}) pattern hides intent.

  7. packages/web/app/components/graphql-queue/QueueContext.tsx:273-287 - 10-second cleanup timeout + 5-second check interval means stale pending updates could block legitimate server events for up to 15 seconds before cleanup.

claude and others added 9 commits December 31, 2025 15:07
When navigating rapidly in party mode, server echoes of optimistic updates
could overwrite newer local state, causing the queue to jump erratically.

The fix tracks pending optimistic updates and skips server events that match
them. This ensures that when a user swipes quickly through the queue, each
navigation is preserved and not reverted by delayed server responses.

- Add pendingCurrentClimbUpdatesRef to track UUIDs of pending updates
- Skip CurrentClimbChanged events that match pending updates
- Clear pending updates on FullSync (full state reset)
- Add timeout to cleanup stale pending entries (5 seconds)
- Remove pending entries on error to prevent future updates being blocked
The previous fix for queue navigation race conditions was causing
the party connection to drop on each navigation. Reverting to
investigate further.
When navigating rapidly in party mode, server echoes of optimistic updates
could overwrite newer local state, causing the queue to jump erratically.

This fix uses the reducer to track pending local updates instead of a ref,
which is safer and avoids potential issues with modifying state during
subscription callbacks.

Changes:
- Add pendingCurrentClimbUpdates array to QueueState
- Add isServerEvent flag to DELTA_UPDATE_CURRENT_CLIMB action
- Reducer tracks local updates as pending, skips server echoes that match
- Pending list is bounded to last 10 entries to prevent unbounded growth
- Clear pending updates on INITIAL_QUEUE_DATA (full sync)
This commit addresses critical issues in the pending updates mechanism
and adds comprehensive test coverage:

Fixes:
- Add missing 5-second timeout cleanup for stale pending updates
- Add error handling to remove pending updates on failed server requests
- Increase pending array limit from 10 to 50 items for rapid navigation
- Fix logic order: check pending updates before deduplication
- Fix test initialization to include pendingCurrentClimbUpdates

Changes:
- Add CLEANUP_PENDING_UPDATE action type and reducer handler
- Add useEffect in QueueContext for timeout-based cleanup
- Add error handling in setCurrentClimbQueueItem
- Fix order of checks in DELTA_UPDATE_CURRENT_CLIMB reducer

Tests Added:
- 13 new unit tests for server event handling and cleanup
- 6 new integration tests covering rapid navigation and edge cases
- All 79 queue-control tests pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
CRITICAL FIX: The previous timeout cleanup implementation caused
cascading re-render storms leading to component crashes and websocket
reconnection loops.

Problem:
- useEffect re-ran on EVERY pendingCurrentClimbUpdates change
- Cleared ALL timeouts and recreated them for ALL items
- Rapid clicks created exponential effect re-runs
- Component crashes → websocket reconnections → multiple connections

Solution:
- Use useRef Map to track timeouts per UUID
- Only create timeouts for NEW items (not already tracked)
- Only clear timeouts for items removed from pending
- Remove dispatch from dependency array (it's stable)

Impact:
- Eliminates re-render storm during rapid navigation
- Prevents component crashes and websocket reconnection loops
- Maintains 5-second timeout cleanup for stale pending updates
- All 79 tests still passing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replace complex timeout-based cleanup with simpler timestamp-based cleanup directly in the reducer. This eliminates ~40 lines of useEffect complexity and prevents potential re-render storms.

Changes:
- Change pendingCurrentClimbUpdates from string[] to Array<{ uuid: string; addedAt: number }>
- Add automatic stale entry filtering (>5 seconds) in DELTA_UPDATE_CURRENT_CLIMB reducer
- Remove entire timeout useEffect and useRef Map from QueueContext.tsx
- Update CLEANUP_PENDING_UPDATE to work with timestamp objects
- Update all test assertions to work with timestamp objects
- Add test for timestamp-based stale cleanup

Benefits:
- Pure reducer logic with no side effects
- Deterministic cleanup on every state transition
- No crash loops from effect dependencies
- Easier to test and maintain
- Same 5-second cleanup behavior, simpler implementation

All 80 tests passing ✓

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replace UUID-based echo detection with client ID matching to properly distinguish between own updates and updates from other users. This fixes the issue where other users' queue changes caused weird jumps in the play view.

**Problem:**
- Previously used UUID matching to detect server echoes
- If User A navigated to climb-2 (pending) and then User B also navigated to climb-2, User A would skip User B's update because climb-2 was in the pending list
- This caused "weird jumps" when other clients changed the queue state

**Solution:**
- Add clientId to CurrentClimbChanged event payload (schema, backend, subscription)
- Primary echo detection: compare eventClientId === myClientId
- Fallback to pending list for backward compatibility
- Updates from other clients are now properly applied

**Changes:**
- Schema: Added clientId: ID! to CurrentClimbChanged type
- Backend: Include ctx.connectionId as clientId when publishing events
- Subscription: Query clientId field in CurrentClimbChanged fragment
- Reducer: Use clientId comparison as primary echo detection method
- QueueContext: Pass eventClientId and myClientId in dispatch
- Tests: Added 4 new tests for clientId-based detection scenarios

**Test Coverage:**
- Own echo skipping (clientId match)
- Other client updates applied (clientId mismatch)
- Same climb from different clients handled correctly
- Fallback to pending list when clientIds unavailable

All 84 tests passing ✓ (+4 new tests)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implements a comprehensive reliability system for party mode queue synchronization
based on industry patterns from Kafka, Spotify playlists, and database replication.

Phase 1 - Sequence Numbers & State Hashes:
- Add sequence number to all queue events for gap detection
- Add state hash (FNV-1a) for periodic drift verification
- Client validates sequence continuity and logs gaps
- Periodic hash verification every 60 seconds

Phase 2 - Delta Sync (Event Replay):
- Store last 100 events per session in Redis (5-min TTL)
- Add eventsReplay GraphQL query for efficient recovery
- Client attempts delta sync on reconnection for gaps ≤ 100
- Falls back to full sync for larger gaps or errors

Phase 3 - Bug Fix:
- Fix pendingTimestamps Map in QueueContext (was recreated every render)
- Now uses useRef for proper persistence across renders

This builds on the previous echo detection work to provide:
- Gap detection during normal operation
- Efficient recovery from brief disconnections
- State verification to detect any drift
- Graceful degradation to full sync when needed

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…eview

- Remove ClientQueueEvent type - use unified QueueEvent everywhere
- Remove GraphQL field aliases (addedItem→item, currentItem→item)
- Fix stale closure in sequence gap detection using ref
- Add batch cleanup action to prevent render thrashing
- Fix updateQueueState calls with missing sequence/stateHash params
- Add auto-resync on hash mismatch detection
- Keep state hash in sync after delta events
- Add sequence column to Postgres (migration 0027)
- Add exponential backoff for session restoration lock
- Reduce cleanup timing from 10s/5s to 5s/2s
- Add proper error logging for event buffer failures

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Dec 31, 2025

Claude Review

Issues Found

  1. Race condition in addQueueItem mutation (packages/backend/src/graphql/resolvers/queue/mutations.ts:79-84): The sequence is fetched after the queue is modified, but before the event is published. Another concurrent mutation could increment the sequence between the state update and getQueueState() call, leading to incorrect sequence numbers on events.

  2. Stale closure in reconnection handler (packages/web/app/components/persistent-session/persistent-session-context.tsx:439): lastReceivedSequence state is captured by closure at effect creation time. The ref (lastReceivedSequenceRef) was added for handleQueueEvent, but handleReconnect still uses the stale state variable directly.

  3. Missing input validation for sinceSequence (packages/backend/src/graphql/resolvers/sessions/queries.ts:46): The sinceSequence parameter isn't validated with a Zod schema like other inputs - negative values or non-integers could cause unexpected behavior.

  4. State hash computed on sorted UUIDs ignores queue order (packages/backend/src/utils/hash.ts:49): Sorting queue UUIDs means queue reordering won't change the hash, so QueueReordered events won't be detected by hash verification.

  5. Test data type mismatch (packages/web/app/components/queue-control/__tests__/reducer.test.ts:899, 933, 970): Tests pass { uuid: string, addedAt: number } objects for pendingCurrentClimbUpdates, but the implementation now expects string[] (correlation IDs). These tests may not be properly validating the new behavior.

  6. Unnecessary state updates in hash sync (packages/web/app/components/persistent-session/persistent-session-context.tsx:315-319): setLastReceivedStateHash is called on every queue or currentClimbQueueItem change, even when the hash doesn't change. Consider adding a comparison before updating.

GraphQL doesn't allow the same field name with different nullability in
a union type. QueueItemAdded.item is non-null while CurrentClimbChanged.item
is nullable, causing subscription errors.

Solution:
- Restore field aliases in GraphQL operations (addedItem: item, currentItem: item)
- Add SubscriptionQueueEvent type for client-side use with aliased field names
- Add transformToSubscriptionEvent function to convert eventsReplay responses
- Update all client-side handlers to use SubscriptionQueueEvent

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Dec 31, 2025

Claude Review

Issues Found:

  1. Stale closure in hash verification - persistent-session-context.tsx:519: The handleReconnect function captures queue and currentClimbQueueItem from the effect closure. When called from the 60-second hash verification interval (line 714), these values may be stale since the interval closure also captures old state values.

  2. Potential race condition in sequence tracking - persistent-session-context.tsx:265-268: updateLastReceivedSequence updates both a ref and state. The ref is updated synchronously but the state update is async. If multiple events arrive rapidly, the ref and state could become inconsistent.

  3. Missing error boundary for event replay - persistent-session-context.tsx:496: When replaying events from eventsReplay, errors in individual event handlers are not caught. A single malformed event could break the entire delta sync.

  4. Hash sorting creates non-deterministic queue order comparison - packages/backend/src/utils/hash.ts:50 and packages/web/app/utils/hash.ts:29: The hash sorts UUIDs before hashing, meaning queue order changes won't be detected by the hash. This is intentional for set-membership detection but means reorder-only bugs won't trigger resync.

  5. Defensive check triggers resync repeatedly - persistent-session-context.tsx:737-752: The effect that checks if currentClimbQueueItem exists in queue runs on every state change. If there's a legitimate edge case where current climb isn't in queue (e.g., removed by another user), this will trigger infinite resync attempts.

  6. jose downgrade from v6 to v4 - packages/backend/package.json: Downgrading jose from ^6.0.0 to ^4.15.9 is a major version downgrade. If intentional, should be documented; if unintentional, could introduce compatibility issues.

  7. Unused import in reducer test - packages/web/app/components/queue-control/__tests__/reducer.test.ts: Test file imports and structure look correct, but ensure test coverage for edge cases in DELTA_UPDATE_CURRENT_CLIMB with missing correlationId.

  8. Event buffer size mismatch - packages/backend/src/pubsub/index.ts:9 stores 100 events, but persistent-session-context.tsx:483 says delta sync works for gaps ≤100. Off-by-one: if exactly 100 events occurred, the oldest one needed is trimmed.

Minor:

  • room-manager.ts:160-169: When restoring from Postgres, stateHash is computed but then also stored in Redis - redundant computation if stateHash was already stored in Postgres.

@claude
Copy link

claude bot commented Dec 31, 2025

Claude Review

Issues Found

  1. Jose version downgrade (packages/backend/package.json): Downgrading jose from ^6.0.0 to ^4.15.9 is a significant version regression. This should be documented with the reason - is there a compatibility issue with graphql-ws or another dependency?

  2. Hash state drift on queue order changes (packages/backend/src/utils/hash.ts:50, packages/web/app/utils/hash.ts:29): The hash sorts queue UUIDs before hashing, which means queue reordering won't change the hash. This defeats the purpose of detecting reorder-related state drift. If clients reorder items differently, the hash will still match incorrectly.

  3. Missing optimistic locking in removeQueueItem (packages/backend/src/graphql/resolvers/queue/mutations.ts:102-127): Unlike addQueueItem and setCurrentClimb, removeQueueItem doesn't use a retry loop with version checking. This could cause race conditions with concurrent modifications.

  4. Missing optimistic locking in mirrorCurrentClimb (packages/backend/src/graphql/resolvers/queue/mutations.ts:240-271): Same issue - no retry loop for version conflicts.

  5. Stale closure risk in hash verification (packages/web/app/components/persistent-session/persistent-session-context.tsx:519): The handleReconnect function captures queue and currentClimbQueueItem at effect creation time, but these values may be stale when the reconnect actually happens. This could cause incorrect hash comparisons.

  6. Potential infinite loop in state inconsistency check (packages/web/app/components/persistent-session/persistent-session-context.tsx:737-752): If triggerResyncRef.current() fails to fix the state inconsistency (e.g., the current climb truly shouldn't be in the queue), this effect will keep triggering resyncs indefinitely. There should be a debounce or max retry limit.

  7. Test file includes unused imports (packages/backend/src/__tests__/queue-sync-fixes.test.ts:6-7): db and schema tables (boardSessions, boardSessionQueues) are imported but appear unused (using sessions and sessionQueues instead).

@marcodejongh marcodejongh merged commit 4842c8a into main Dec 31, 2025
6 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants