More robust SignalR connection handling#2317
Conversation
📝 WalkthroughWalkthroughAdds network-aware SignalR push-listener recovery for MAUI and web: introduces an ChangesSignalR Push Listener Recovery
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The latest updates on your projects. Learn more about Argos notifications ↗︎
|
545a4b9 to
c498cb3
Compare
ListenForProjectChanges previously registered a project in the
_reconnectProjects table only AFTER its state-based early returns. A
project subscribed while the underlying HubConnection was Reconnecting
would silently never be resubscribed when the connection came back.
Hoist the registration above the state checks so the comment ("Reconnected
handler will resubscribe...") is actually true.
Also extract the Reconnecting auth-loss decision into a static internal
method so the auth-stop behavior can be unit-tested without a real
HubConnection or full DI graph. Add two unit tests covering token-present
(no-op) and token-null (evict + stop) branches.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously OnAuthenticationChanged only invalidated the cached project list — the hub connection stayed up with whatever token it was built with. After a logout/login cycle the listener kept running with stale auth until the next reconnect attempt happened to detect it. Track each project we're asked to listen for in a stable dictionary (separate from the per-connection _reconnectProjects table) so we have a list to re-subscribe even after the underlying HubConnection has been torn down. On every auth event, evict the cached connection and walk that list to (re)start listeners; on logout the new connections fail fast at the auth check, on login they come up with the fresh token — auto-recovery without waiting for the user to manually sync or reopen the project. EvictAndStopIfCached is extracted as a static internal helper to match the HandleReconnecting pattern; add two unit tests for its empty-cache and cached-connection branches. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
If the initial StartLexboxProjectChangeListener call fails (e.g. user was offline at project-open), the WithAutomaticReconnect policy never kicks in — the listener stays dead until something re-triggers ListenForProjectChanges. Today the only triggers are project open and the FwLiteWeb hub's OnConnectedAsync; a long-running session can sit indefinitely with no push notifications. Fire-and-forget a ListenForProjectChanges call after every successful SyncService.ExecuteSync. If the listener is already running the cache short-circuits and it's effectively free; if it died (cold-start failure, transient network) this is the recovery trigger. Wrap in try/catch so a listener problem can't masquerade as a sync failure. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When two ListenForProjectChanges callers race past the cache check (both see it empty), both build a HubConnection. The second wins the cache slot; the first is evicted via the IMemoryCache eviction callback and disposed. Disposing fires its Closed event handler, which unconditionally called cache.Remove — evicting the *replacement* connection that just took its place. The next ListenForProjectChanges call finds an empty cache and races again. The race is exposed (not caused) by the post-sync piggyback firing ListenForProjectChanges on every successful sync — observed during two-instance manual testing as a cascade where the cache stayed empty and connections kept cycling. Guard the cache eviction with a ReferenceEquals check so an orphaned Closed event can't take down the live cached connection. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Dispose the HubConnection when StartAsync fails (was leaked on cold-start retry) - Evict a dead cached connection if Disconnected-recovery StartAsync fails - Broaden the subscribe SendAsync catch so a mid-call drop can't fail project-open - Extract RegisterForResubscribe and unit-test the subscription-ordering invariant - HandleReconnecting takes a bool; null token = logged out (OAuthClient owns transient-vs-logout) - MAUI: ConnectivitySyncTrigger re-establishes listeners when connectivity returns Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PushListenerRecoveryService re-runs EnsureListenersForTrackedProjects every 5 minutes so a listener that failed to start (e.g. offline at project-open) recovers without a manual sync, on platforms with no connectivity events (FwLiteWeb). Servers with a live connection are skipped via a pre-loop snapshot so the periodic pass is a no-op once connected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… recovery Two paths could leave sibling projects silently unsubscribed while their server reported Connected: a manual revive (StartAsync does not raise Reconnected) resubscribed only the requesting project, and the recovery pass skipped whole servers, so a connection rebuilt by a single-project path (post-sync, project-open) never picked the siblings back up. Extract the Reconnected handler body into ResubscribeRegisteredProjects, call it after a successful revive, and make EnsureListenersForTrackedProjects skip per project (connected AND registered) instead of per server. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
HandleReconnecting removed the cache entry by key, and the revive-failure path evicted whatever was cached — either could race a rebuild and evict a live replacement connection, orphaning it while callers believe the server has no listener. Apply the same ReferenceEquals guard the Closed handler uses: HandleReconnecting only evicts its own entry (still stops itself — it has no token), and EvictAndStopIfCached takes an optional expected instance for callers holding a specific dead connection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Concurrent ListenForProjectChanges callers that both missed the cache each built and started a connection; the cache replacement then disposed the loser mid-use and silently discarded its resubscribe registrations. Concurrent revives of the same Disconnected connection were worse: the losing StartAsync threw and evicted the winner. A per-server gate around build-and-cache and revive (with re-checks inside) leaves one connection per server by construction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The HubConnection logged only to its own console-only factory, so reconnect attempts and close reasons never reached the app log file -- production logs could not answer why a connection died. Register the app logger factory in the builder''s service collection so SignalR client logs flow through the app''s providers and filters. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
An OS-frozen app (Android Doze) can return to the foreground with its push listener down and no connectivity transition for the connectivity trigger to react to, leaving recovery to the 5-minute periodic backstop. Resume is exactly when the user is watching, so recover immediately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The fixed backoff topped out at 60s, and since SignalR cannot be hurried mid-delay, that was the worst-case window where HTTP already works but pushes are still down (e.g. after a server deploy, where no client-side signal exists to trigger recovery). Retry every 10s while the device reports online -- the failure is server-side and could end any moment -- and back off to 60s while offline, where every attempt is a pointless local failure. Hosts without a connectivity API (FwLiteWeb) report always-online. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A connection sleeping in its reconnect backoff cannot be hurried -- every recovery mechanism found it Reconnecting and correctly stood down, so the committed delay was the floor on recovery latency even when the network was demonstrably back. Callers with strong evidence (connectivity regained, app resume, a successful sync to that server) now stop the mid-backoff connection and rebuild through the normal serialized path, which also resubscribes every tracked project. Weak-signal callers (the periodic backstop) never kick: a wrong kick trades the automatic retry loop for the much slower backstops. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ndler Remove the StartAsync retry TODO (post-sync, connectivity, resume and periodic recovery now cover it), fix comments referencing the renamed retry policy, and catch TriggerSync failures in the OnProjectUpdated handler -- it throws when the background sync service is not running and SignalR would swallow that into its own logger. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Reconnecting handler decided "logged out" from whether a token could be fetched right then -- but reconnecting happens precisely when the network is flaky, and fetching a token can require a refresh round-trip that fails transiently, which read as a false logout and tore down a listener that should have kept retrying. Switch the signal to IsSignedIn (a local-only account-presence read that flips only on a real logout), now available from the merged OAuth fixes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
c498cb3 to
29c91a8
Compare
C# Unit Tests165 tests 165 ✅ 23s ⏱️ Results for commit 4177009. ♻️ This comment has been updated with latest results. |
Move auth-change listener recovery out of LexboxProjectService into a dedicated AuthChangeListenerTrigger hosted service, so the subscription lifecycle no longer rides on the service's IDisposable. Report connectivity from the OS network-interface table (NetworkInterfaceNetworkStatus) rather than assuming always-online, and add NetworkChangeSyncTrigger so web hosts re-ensure listeners when connectivity returns (the cross-platform counterpart to MAUI's ConnectivitySyncTrigger). StartLexboxProjectChangeListener now checks the connection cache before any auth call and gates a new build on IsSignedIn (account presence) instead of GetCurrentToken: a transient token-refresh failure during a reconnect could otherwise be mistaken for a logout and tear down a healthy auto-reconnecting connection (the same false-logout trap HandleReconnecting already avoids). Adds a regression test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
29c91a8 to
40068a1
Compare
Each recovery path now announces itself: the auth-change, MAUI-connectivity and web-network-availability triggers log at startup that they're watching, the MAUI resume hook logs when it fires, and the periodic backstop logs a debug heartbeat per check. The connectivity/network triggers already logged their fire events; auth-change fire detail stays in HandleAuthChanged. Makes it easier to tell which of the several recovery paths did (or didn't) run when debugging listener issues. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ne verdict A health check that failed while offline was cached as unhealthy for 30 minutes, so a sync triggered by network recovery (this branch's reconnect-sync) returned the stale "unhealthy" verdict and refused to sync until the cache expired or the app restarted. On a genuine connect transition — where SignalR reaching the server proves it's reachable — drop the cached health verdict (and the per-server projects list, which has the same stale-offline-failure problem) so the catch-up sync re-probes. Also quiet the expected ObjectDisposedException when stopping an evicted connection that's mid-reconnect (it has already disposed itself; logged at Debug), and downgrade the per-attempt reconnect log to Debug. Verified on Android: cold-start-while-offline then online now syncs within seconds, where it previously stayed "Offline" with pending downloads for the full cache window. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
….user Reframe the connectivity-change log (was a temporary diagnostic) as a permanent event so real-device behaviour stays legible, and gitignore the local-harmony *.props.user artifact. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
C# FwHeadless Unit Tests48 tests 48 ✅ 16s ⏱️ Results for commit 4177009. ♻️ This comment has been updated with latest results. |
|
Caution Review failedAn error occurred during the review process. Please try again later. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…rror When the device was offline the sync indicator/dialog stayed green and a manual sync surfaced a raw UnknownHostException: the server-health cache can still report healthy from an earlier online sync, so ExecuteSync sailed past the ShouldSync gate and the actual transfer threw — which bubbled to the UI and skipped the offline status update. - ExecuteSync now trusts the device's network state: if offline, report Offline and return without attempting the transfer; also catch a mid-sync HttpRequestException and map it to Offline. Either path fires an Offline sync event, so the indicator dot and the sync dialog reflect offline. - SyncDialog handles an un-synced result gracefully (gentle "you're offline" toast, no optimistic zeroing of pending counts) rather than letting it reach the global error handler. Verified on Android: an offline sync now shows "Offline" + cloud-off with a friendly toast; going back online still syncs normally. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Widen the ExecuteSync catch to OperationCanceledException so an HttpClient timeout reads as Offline instead of leaking a raw error. ExecuteSync takes no cancellation token, so an OCE there is always a transport timeout, never a caller abort. The CRDT sync dialog showed "You're offline" for every non-synced outcome (also NoServer/NotLoggedIn); use a status-neutral message and drop a dead null-guard. The persistent status indicator still shows the specific status. Re-extract catalogs and add translator context for the new string. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ExecuteSync now reports offline/connectivity failures as an unsynced result instead of throwing, so UploadProject's catch no longer fired on a failed upload: the project was left bound to a server it never reached and the /upload route returned Ok(). Honor the unsynced result so the origin is rolled back and the failure surfaces, restoring the pre-offline-handling contract. Found by Devin review of #2317. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… too The PushListenerRecoveryService comment claimed the connectivity trigger was MAUI-only and that FwLiteWeb had none, but this branch added NetworkChangeSyncTrigger for exactly that. Correct it, and state the real reason the periodic backstop still matters: the OS network signal is optimistic (a virtual adapter reads "available"), so a real-uplink return may not fire. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs (1)
32-35: 🏗️ Heavy liftAdd an untrack path for
_trackedProjectsto prevent stale recovery/sync fan-out.
_trackedProjectsis session-lifetime and currently only grows. Over long sessions, closed/removed projects will still be iterated by periodic recovery and reconnect sync triggers, which can create avoidable background work. Consider adding an explicit unregistration hook on project close/removal and pruning dead entries.Also applies to: 395-401
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs` around lines 32 - 35, The _trackedProjects ConcurrentDictionary grows indefinitely over the session lifetime and never removes closed or removed projects, causing unnecessary iteration and background work in periodic recovery and reconnect sync operations. Add an explicit untracking method that removes entries from _trackedProjects when projects are closed or removed, and call this method in the appropriate project lifecycle handlers (project close/removal events or callbacks). This ensures dead entries are pruned and don't create avoidable background work during periodic recovery and reconnect sync triggers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs`:
- Around line 32-35: The _trackedProjects ConcurrentDictionary grows
indefinitely over the session lifetime and never removes closed or removed
projects, causing unnecessary iteration and background work in periodic recovery
and reconnect sync operations. Add an explicit untracking method that removes
entries from _trackedProjects when projects are closed or removed, and call this
method in the appropriate project lifecycle handlers (project close/removal
events or callbacks). This ensures dead entries are pruned and don't create
avoidable background work during periodic recovery and reconnect sync triggers.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: df2671f8-6e2c-4bd2-a51e-a0202d3ef5f2
📒 Files selected for processing (29)
.gitignorebackend/.editorconfigbackend/FwLite/FwLiteMaui.Tests/ConnectivitySyncTriggerTests.csbackend/FwLite/FwLiteMaui/App.xaml.csbackend/FwLite/FwLiteMaui/FwLiteMauiKernel.csbackend/FwLite/FwLiteMaui/Services/ConnectivityNetworkStatus.csbackend/FwLite/FwLiteMaui/Services/ConnectivitySyncTrigger.csbackend/FwLite/FwLiteShared.Tests/Projects/AdaptiveRetryPolicyTests.csbackend/FwLite/FwLiteShared.Tests/Projects/LexboxProjectServiceTests.csbackend/FwLite/FwLiteShared/FwLiteSharedKernel.csbackend/FwLite/FwLiteShared/Projects/AuthChangeListenerTrigger.csbackend/FwLite/FwLiteShared/Projects/LexboxProjectService.csbackend/FwLite/FwLiteShared/Projects/PushListenerRecoveryService.csbackend/FwLite/FwLiteShared/Services/INetworkStatus.csbackend/FwLite/FwLiteShared/Services/NetworkInterfaceNetworkStatus.csbackend/FwLite/FwLiteShared/Sync/SyncService.csbackend/FwLite/FwLiteWeb/FwLiteWebKernel.csbackend/FwLite/FwLiteWeb/Services/NetworkChangeSyncTrigger.csbackend/FwLite/LcmCrdt.Tests/CrdtHttpSyncServiceTests.csbackend/FwLite/LcmCrdt/RemoteSync/CrdtHttpSyncService.csfrontend/viewer/src/locales/en.pofrontend/viewer/src/locales/es.pofrontend/viewer/src/locales/fr.pofrontend/viewer/src/locales/id.pofrontend/viewer/src/locales/ko.pofrontend/viewer/src/locales/ms.pofrontend/viewer/src/locales/sw.pofrontend/viewer/src/locales/vi.pofrontend/viewer/src/project/SyncDialog.svelte
Resolves #2331 — production symptoms and log evidence live there.
From manual testing, syncing (both pulling and pushing) appear to recover very quickly on connectivity and auth changes, even when opening projects when offline or unauthenticated, which previously caused trouble. The sync dialog also refreshed and keeps state info up-to-date. Tested most thoroughly on Android, but also on desktop (MAUI).
FwLite listens for Lexbox project-change push notifications over a per-server SignalR
HubConnection. After network loss or an OS-frozen app (Android Doze) that connection could end up dead-but-cached: opening a project occasionally threwSendCoreAsync … not active, but more often push just silently stopped until a manual sync or restart. This makes the whole lifecycle robust, on the FwLite client side.Don't get stuck. Check auth before reconnecting or handing back a cached connection; abort the infinite retry loop only on a real logout — decided by local account presence (
IsSignedIn), not by whether a token can be fetched mid-outage, which was a false-logout trap. GuardClosed/eviction withReferenceEqualsso a live replacement is never evicted, tolerateSendAsyncracing a drop, dispose on failedStartAsync, and serialize connection creation per server so concurrent callers can't build duplicates that dispose each other mid-use.Always recover. A new
_trackedProjectsmap (kept across rebuilds) lets listeners be re-established from every angle: auth change, successful sync, MAUI connectivity-regained, MAUI app-resume, and a 5-minute periodic backstop. All idempotent, and healing is per-project so siblings on a rebuilt connection get resubscribed rather than silently dropped.Recover faster. Reconnect backoff adapts to device-reported network state (retry ~10s while online, back off to 60s while offline); and because SignalR has no retry-now API, strong recovery signals (connectivity/resume/post-sync) interrupt a mid-backoff reconnect instead of waiting it out.
Show offline, don't error. When a sync can't reach the server — connection drop or timeout — it now reports an
Offlinestatus instead of surfacing a raw connection error, and the FwLite sync dialog shows a brief offline notice while leaving the pending-change counts intact rather than optimistically zeroing them.Be diagnosable. HubConnection's own logs now flow into the app logger, so the next occurrence is debuggable from user logs.
Depended on #2308 (merged) — the "logged out" signal is only trustworthy with those OAuth fixes in place. New/updated unit tests cover the reconnect decision, adaptive backoff, eviction guards, and resubscribe bookkeeping; the connection-orchestration paths that need a live
HubConnectionare exercised manually (cold-start network isolation needs per-process network control, not in CI). Manual check for the offline path: go offline, press Sync → the dialog shows the offline notice and the pending-change counts are preserved.Housekeeping in this PR:
.editorconfigsilences IDE0061/IDE0022 for the expression-bodied members added here, and.gitignoreexcludes*.props.userand a local-onlysignalr-acceptance-tests.mdmanual-test cookbook.Known follow-up:
_trackedProjectsis process-lifetime with no untrack path — a project opened once keeps being resubscribed for the session (bounded by distinct projects opened, not per-open). Intentional for now; revisit if listeners should stop on project-close.🤖 Generated with Claude Code