You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
User downloaded a project over flaky internet on Android (build v2026-05-06-9470018e, commit 9470018e). End state, observed in screenshots from the user device:
Project list shows "Synced with Lexbox" under the project name — misleading. The local .sqlite is 1 page, schema_version 0, no tables in sqlite_master. The "Synced" label and the project name come from ProjectDataCache (project-cache.json), not from the database.
Opening the project shows "No entries found" — silently. No error toast, no recovery affordance, no indication that anything is wrong with the file.
The only signal that something is amiss is a small red dot on "Synchronize" in the side menu, which does not tell the user what is wrong or how to fix it.
A separate MsalClientException: authentication_canceled toast appears intermittently — that is a different, already-known bug (infinite-retry SignalR HubConnection that never reissues a new connection after the underlying auth dies). Out of scope for this issue.
Why local reproduction shows a different symptom
When the broken .sqlite is opened against the current develop build, OpenCrdtProject throws System.InvalidOperationException: Sequence contains no elements at CurrentProjectService.cs:117:
That code was added by #2219 on 2026-05-12 — after the user's 2026-05-06 build was cut. On the user's older build there is no FirstAsync post-migration, so OpenCrdtProject succeeds, the entries query against the empty Entry table returns 0 rows, and the UI silently shows "No entries found." Today's build at least throws an exception; the version actually in users' hands silently lies about state. Both behaviours stem from the same underlying bug (an orphaned .sqlite file with no schema, no ProjectData).
Reproduction log on current code (the exception path): bin/Debug/net10.0/win-x64/fw-lite-web.log lines 16267-16797.
EnsureDeletedAsync drops tables then deletes the file. If the file is locked (sync may still hold connections), it throws after dropping tables.
EnsureDeleteProject (line 197) is fire-and-forgetTask.Run — File.Delete retry every 1 s for 10 s, then gives up silently. If the user closes the app during those 10 s (which a frustrated user with flaky internet might), the background Task is killed.
Outcome: the file remains, sometimes with header only (tables dropped, delete failed; or migration transaction never committed in the first place).
2. ProjectDataCache makes orphans look like real projects
backend/FwLite/LcmCrdt/Project/ProjectDataCache.cs (file-backed JSON cache of ProjectData).
Once CreateProject got far enough to populate the cache (which can happen before the on-disk database is fully usable), the project list reads the project name and Role from JSON and renders the tile as "Synced with Lexbox" regardless of the actual sqlite state.
3. On reopen, behaviour depends on build
On builds without#2219, OpenCrdtProject succeeds against the empty file — migration reapplies the schema, no further validation runs, and entry queries return 0 rows. User experience: silently empty project, misleading "Synced" label.
On builds with#2219, MigrateDb (CurrentProjectService.cs:96-134) calls dbContext.ProjectData.AsNoTracking().FirstAsync() after migration. The row is missing → Sequence contains no elements is thrown → UI navigates to ?troubleshootDialogOpen=true. User experience: scary error, no recovery affordance.
Worse on both paths: the static MigrationTasksLazy<Task> at line 96/102 caches the (faulted or completed) task, so subsequent opens in the same process do not retry the underlying setup.
4. Sync layer has no retry / resilience (the upstream trigger)
backend/FwLite/FwLiteShared/FwLiteSharedKernel.cs:70-108 registers the auth/sync HttpClient bare — no .AddStandardResilienceHandler(), no Polly, default 100 s HttpClient timeout. A single dropped TCP / 30+ s pause kills ExecuteSync outright. That is the typical trigger that drops CreateProject into the failure path described in (1).
Harmony itself is well-behaved on the between-attempts side:
The pull happens in one SQLite transaction (harmony/src/SIL.Harmony/DataModel.cs:144-174), so on failure the local db is unchanged from before the sync attempt — no half-applied corruption inside the file.
SyncState (per-client HLC heads, SIL.Harmony.Core/SyncState.cs:3) is resumable between attempts.
Re-pulling commits is idempotent (CrdtRepository.cs:112-125 dedupes by commit GUID).
But none of that helps when CreateProject treats any failure as fatal and tears down the entire local project on the first hiccup.
Proposed fix (one PR)
A. Atomic create-then-rename in CreateProject
Build at <final>.sqlite.tmp, do all work against it, atomically File.Move(tmp, final) only on full success. On any failure, delete the tmp file. Same-volume rename is atomic on Windows + Linux.
This makes an orphan with a "real" name impossible regardless of what fails (sync error, process kill, cleanup race, migration transaction rollback). The existing background EnsureDeleteProject becomes a fallback for the tmp name.
B. Startup sweep of *.sqlite.tmp
In CrdtProjectsService.ListProjects (or app startup), delete leftover *.sqlite.tmp with a warning log. Handles process-kill cases where the tmp file survives.
C. Add .AddStandardResilienceHandler() to the auth HTTP client
FwLiteSharedKernel.cs:70, on the OAuthClient.AuthHttpClientName builder. Default config gives 3 retries with exponential backoff, per-attempt and total timeouts, circuit breaker. Massively reduces the triggering failure rate for flaky-internet users.
D. (Smaller, defense-in-depth) Make MigrateDb fail clearly on empty ProjectData
CurrentProjectService.cs:117 — use SingleOrDefaultAsync() and throw a typed exception (e.g. CrdtProjectMissingDataException) with an actionable message when the row is missing. Avoid caching faulted migration tasks (either remove on exception from MigrationTasks, or switch to LazyThreadSafetyMode.PublicationOnly).
This is the safety net for any existing orphan files (like the one the user already has — A/B prevent new ones but do not recover this one) and any future class of orphan we have not thought of.
E. Do not render "Synced with Lexbox" for projects whose local db is empty/missing
Project-list tile should reflect actual sqlite state, not just the JSON cache. Either:
Probe __EFMigrationsHistory / ProjectData row presence on listing and label orphans as "Unrecoverable — delete" or "Re-download required", or
Reset the ProjectDataCache entry when CreateProject fails and do not rebuild it from a broken on-disk file.
Without this, the symptom remains visually misleading even when A–D close the underlying gap for new downloads.
Out of scope / follow-up
Pending-sync project state. Today a failed AfterCreate rolls back the local project entirely; the user has to re-download from scratch. Because Harmony sync is resumable, we could keep the local project (schema + ProjectData + morph types — already a valid empty project) on AfterCreate failure and mark it SyncPending. User opens an empty project, hits retry, sync picks up correctly. Big UX win for flaky-internet users, but needs design on the project-list affordance, retry button, and how to communicate the state.
Orphan recovery UI. Even with (D) the user gets a clear error but no in-app way to delete the file. The troubleshoot dialog could grow a "delete orphaned project" action when this typed exception is surfaced.
SignalR HubConnection auth retry. The MsalClientException: authentication_canceled toast visible in user screenshots is a different bug — an infinite-retry policy on the SignalR connection means a dead auth/connection is never re-established. Tim has a separate WIP branch.
Evidence
User screenshots (Android, build v2026-05-06-9470018e, commit 9470018e):
Project list — both projects labeled "Synced with Lexbox"; MSAL auth-cancel toast visible.
What the user actually sees
User downloaded a project over flaky internet on Android (build
v2026-05-06-9470018e, commit9470018e). End state, observed in screenshots from the user device:.sqliteis 1 page, schema_version 0, no tables insqlite_master. The "Synced" label and the project name come fromProjectDataCache(project-cache.json), not from the database.MsalClientException: authentication_canceledtoast appears intermittently — that is a different, already-known bug (infinite-retry SignalR HubConnection that never reissues a new connection after the underlying auth dies). Out of scope for this issue.Why local reproduction shows a different symptom
When the broken
.sqliteis opened against the currentdevelopbuild,OpenCrdtProjectthrowsSystem.InvalidOperationException: Sequence contains no elementsatCurrentProjectService.cs:117:That code was added by #2219 on 2026-05-12 — after the user's 2026-05-06 build was cut. On the user's older build there is no
FirstAsyncpost-migration, soOpenCrdtProjectsucceeds, the entries query against the emptyEntrytable returns 0 rows, and the UI silently shows "No entries found." Today's build at least throws an exception; the version actually in users' hands silently lies about state. Both behaviours stem from the same underlying bug (an orphaned.sqlitefile with no schema, noProjectData).Reproduction log on current code (the exception path):
bin/Debug/net10.0/win-x64/fw-lite-web.loglines 16267-16797.Root cause chain
1.
CreateProjectleaves the orphanbackend/FwLite/LcmCrdt/CrdtProjectsService.cs:133-195CreateProjectbuilds the project under its final filename and runs:MigrateAsync— creates schemaProjectData.Add(...)+SaveChangesAsync— inserts the project metadata rowAddPredefinedMorphTypes— seeds morph typesAfterCreate(...)— for downloads, this isSyncService.ExecuteSync(true)(seeCombinedProjectsService.cs:200)If any step throws, the
catch(line 177) does:EnsureDeletedAsyncdrops tables then deletes the file. If the file is locked (sync may still hold connections), it throws after dropping tables.EnsureDeleteProject(line 197) is fire-and-forgetTask.Run—File.Deleteretry every 1 s for 10 s, then gives up silently. If the user closes the app during those 10 s (which a frustrated user with flaky internet might), the background Task is killed.Outcome: the file remains, sometimes with header only (tables dropped, delete failed; or migration transaction never committed in the first place).
2.
ProjectDataCachemakes orphans look like real projectsbackend/FwLite/LcmCrdt/Project/ProjectDataCache.cs(file-backed JSON cache ofProjectData).Once
CreateProjectgot far enough to populate the cache (which can happen before the on-disk database is fully usable), the project list reads the project name andRolefrom JSON and renders the tile as "Synced with Lexbox" regardless of the actual sqlite state.3. On reopen, behaviour depends on build
On builds without #2219,
OpenCrdtProjectsucceeds against the empty file — migration reapplies the schema, no further validation runs, and entry queries return 0 rows. User experience: silently empty project, misleading "Synced" label.On builds with #2219,
MigrateDb(CurrentProjectService.cs:96-134) callsdbContext.ProjectData.AsNoTracking().FirstAsync()after migration. The row is missing →Sequence contains no elementsis thrown → UI navigates to?troubleshootDialogOpen=true. User experience: scary error, no recovery affordance.Worse on both paths: the static
MigrationTasksLazy<Task>at line 96/102 caches the (faulted or completed) task, so subsequent opens in the same process do not retry the underlying setup.4. Sync layer has no retry / resilience (the upstream trigger)
backend/FwLite/FwLiteShared/FwLiteSharedKernel.cs:70-108registers the auth/syncHttpClientbare — no.AddStandardResilienceHandler(), no Polly, default 100 sHttpClienttimeout. A single dropped TCP / 30+ s pause killsExecuteSyncoutright. That is the typical trigger that dropsCreateProjectinto the failure path described in (1).Harmony itself is well-behaved on the between-attempts side:
harmony/src/SIL.Harmony/DataModel.cs:144-174), so on failure the local db is unchanged from before the sync attempt — no half-applied corruption inside the file.SyncState(per-client HLC heads,SIL.Harmony.Core/SyncState.cs:3) is resumable between attempts.CrdtRepository.cs:112-125dedupes by commit GUID).But none of that helps when
CreateProjecttreats any failure as fatal and tears down the entire local project on the first hiccup.Proposed fix (one PR)
A. Atomic create-then-rename in
CreateProjectBuild at
<final>.sqlite.tmp, do all work against it, atomicallyFile.Move(tmp, final)only on full success. On any failure, delete the tmp file. Same-volume rename is atomic on Windows + Linux.This makes an orphan with a "real" name impossible regardless of what fails (sync error, process kill, cleanup race, migration transaction rollback). The existing background
EnsureDeleteProjectbecomes a fallback for the tmp name.B. Startup sweep of
*.sqlite.tmpIn
CrdtProjectsService.ListProjects(or app startup), delete leftover*.sqlite.tmpwith a warning log. Handles process-kill cases where the tmp file survives.C. Add
.AddStandardResilienceHandler()to the auth HTTP clientFwLiteSharedKernel.cs:70, on theOAuthClient.AuthHttpClientNamebuilder. Default config gives 3 retries with exponential backoff, per-attempt and total timeouts, circuit breaker. Massively reduces the triggering failure rate for flaky-internet users.D. (Smaller, defense-in-depth) Make
MigrateDbfail clearly on emptyProjectDataCurrentProjectService.cs:117— useSingleOrDefaultAsync()and throw a typed exception (e.g.CrdtProjectMissingDataException) with an actionable message when the row is missing. Avoid caching faulted migration tasks (either remove on exception fromMigrationTasks, or switch toLazyThreadSafetyMode.PublicationOnly).This is the safety net for any existing orphan files (like the one the user already has — A/B prevent new ones but do not recover this one) and any future class of orphan we have not thought of.
E. Do not render "Synced with Lexbox" for projects whose local db is empty/missing
Project-list tile should reflect actual sqlite state, not just the JSON cache. Either:
__EFMigrationsHistory/ProjectDatarow presence on listing and label orphans as "Unrecoverable — delete" or "Re-download required", orProjectDataCacheentry whenCreateProjectfails and do not rebuild it from a broken on-disk file.Without this, the symptom remains visually misleading even when A–D close the underlying gap for new downloads.
Out of scope / follow-up
Pending-sync project state. Today a failed
AfterCreaterolls back the local project entirely; the user has to re-download from scratch. Because Harmony sync is resumable, we could keep the local project (schema + ProjectData + morph types — already a valid empty project) onAfterCreatefailure and mark itSyncPending. User opens an empty project, hits retry, sync picks up correctly. Big UX win for flaky-internet users, but needs design on the project-list affordance, retry button, and how to communicate the state.Orphan recovery UI. Even with (D) the user gets a clear error but no in-app way to delete the file. The troubleshoot dialog could grow a "delete orphaned project" action when this typed exception is surfaced.
SignalR HubConnection auth retry. The
MsalClientException: authentication_canceledtoast visible in user screenshots is a different bug — an infinite-retry policy on the SignalR connection means a dead auth/connection is never re-established. Tim has a separate WIP branch.Evidence
v2026-05-06-9470018e, commit9470018e):.sqlite: 1 page, schema_version 0, sqlite_master empty.develop(exception path):bin/Debug/net10.0/win-x64/fw-lite-web.loglines 16267-16797.CreateProjectfailure/cleanup path:LcmCrdt/CrdtProjectsService.cs:133-225.ProjectDataCache:LcmCrdt/Project/ProjectDataCache.cs.MigrateDbfailure point:LcmCrdt/CurrentProjectService.cs:117(introduced Seed canonical morph types and regenerate search index #2219, 2026-05-12).harmony/src/SIL.Harmony/DataModel.cs:144-174.FwLiteShared/FwLiteSharedKernel.cs:70-108,LcmCrdt/RemoteSync/CrdtHttpSyncService.cs:84-109.