Skip to content

Fix orphaned CRDT project files from failed downloads (stuck at 'Sequence contains no elements') #2292

@myieye

Description

@myieye

What the user actually sees

User downloaded a project over flaky internet on Android (build v2026-05-06-9470018e, commit 9470018e). End state, observed in screenshots from the user device:

  1. Project list shows "Synced with Lexbox" under the project name — misleading. The local .sqlite is 1 page, schema_version 0, no tables in sqlite_master. The "Synced" label and the project name come from ProjectDataCache (project-cache.json), not from the database.
  2. Opening the project shows "No entries found" — silently. No error toast, no recovery affordance, no indication that anything is wrong with the file.
  3. The only signal that something is amiss is a small red dot on "Synchronize" in the side menu, which does not tell the user what is wrong or how to fix it.
  4. A separate MsalClientException: authentication_canceled toast appears intermittently — that is a different, already-known bug (infinite-retry SignalR HubConnection that never reissues a new connection after the underlying auth dies). Out of scope for this issue.

Why local reproduction shows a different symptom

When the broken .sqlite is opened against the current develop build, OpenCrdtProject throws System.InvalidOperationException: Sequence contains no elements at CurrentProjectService.cs:117:

var projectData = await dbContext.ProjectData.AsNoTracking().FirstAsync();

That code was added by #2219 on 2026-05-12 — after the user's 2026-05-06 build was cut. On the user's older build there is no FirstAsync post-migration, so OpenCrdtProject succeeds, the entries query against the empty Entry table returns 0 rows, and the UI silently shows "No entries found." Today's build at least throws an exception; the version actually in users' hands silently lies about state. Both behaviours stem from the same underlying bug (an orphaned .sqlite file with no schema, no ProjectData).

Reproduction log on current code (the exception path): bin/Debug/net10.0/win-x64/fw-lite-web.log lines 16267-16797.

Root cause chain

1. CreateProject leaves the orphan

backend/FwLite/LcmCrdt/CrdtProjectsService.cs:133-195

CreateProject builds the project under its final filename and runs:

  1. MigrateAsync — creates schema
  2. ProjectData.Add(...) + SaveChangesAsync — inserts the project metadata row
  3. AddPredefinedMorphTypes — seeds morph types
  4. AfterCreate(...) — for downloads, this is SyncService.ExecuteSync(true) (see CombinedProjectsService.cs:200)

If any step throws, the catch (line 177) does:

await db.Database.CloseConnectionAsync();
try { await db.Database.EnsureDeletedAsync(); }
catch { EnsureDeleteProject(sqliteFile); }
throw;
  • EnsureDeletedAsync drops tables then deletes the file. If the file is locked (sync may still hold connections), it throws after dropping tables.
  • EnsureDeleteProject (line 197) is fire-and-forget Task.RunFile.Delete retry every 1 s for 10 s, then gives up silently. If the user closes the app during those 10 s (which a frustrated user with flaky internet might), the background Task is killed.

Outcome: the file remains, sometimes with header only (tables dropped, delete failed; or migration transaction never committed in the first place).

2. ProjectDataCache makes orphans look like real projects

backend/FwLite/LcmCrdt/Project/ProjectDataCache.cs (file-backed JSON cache of ProjectData).

Once CreateProject got far enough to populate the cache (which can happen before the on-disk database is fully usable), the project list reads the project name and Role from JSON and renders the tile as "Synced with Lexbox" regardless of the actual sqlite state.

3. On reopen, behaviour depends on build

On builds without #2219, OpenCrdtProject succeeds against the empty file — migration reapplies the schema, no further validation runs, and entry queries return 0 rows. User experience: silently empty project, misleading "Synced" label.

On builds with #2219, MigrateDb (CurrentProjectService.cs:96-134) calls dbContext.ProjectData.AsNoTracking().FirstAsync() after migration. The row is missing → Sequence contains no elements is thrown → UI navigates to ?troubleshootDialogOpen=true. User experience: scary error, no recovery affordance.

Worse on both paths: the static MigrationTasks Lazy<Task> at line 96/102 caches the (faulted or completed) task, so subsequent opens in the same process do not retry the underlying setup.

4. Sync layer has no retry / resilience (the upstream trigger)

backend/FwLite/FwLiteShared/FwLiteSharedKernel.cs:70-108 registers the auth/sync HttpClient bare — no .AddStandardResilienceHandler(), no Polly, default 100 s HttpClient timeout. A single dropped TCP / 30+ s pause kills ExecuteSync outright. That is the typical trigger that drops CreateProject into the failure path described in (1).

Harmony itself is well-behaved on the between-attempts side:

  • The pull happens in one SQLite transaction (harmony/src/SIL.Harmony/DataModel.cs:144-174), so on failure the local db is unchanged from before the sync attempt — no half-applied corruption inside the file.
  • SyncState (per-client HLC heads, SIL.Harmony.Core/SyncState.cs:3) is resumable between attempts.
  • Re-pulling commits is idempotent (CrdtRepository.cs:112-125 dedupes by commit GUID).

But none of that helps when CreateProject treats any failure as fatal and tears down the entire local project on the first hiccup.

Proposed fix (one PR)

A. Atomic create-then-rename in CreateProject

Build at <final>.sqlite.tmp, do all work against it, atomically File.Move(tmp, final) only on full success. On any failure, delete the tmp file. Same-volume rename is atomic on Windows + Linux.

This makes an orphan with a "real" name impossible regardless of what fails (sync error, process kill, cleanup race, migration transaction rollback). The existing background EnsureDeleteProject becomes a fallback for the tmp name.

B. Startup sweep of *.sqlite.tmp

In CrdtProjectsService.ListProjects (or app startup), delete leftover *.sqlite.tmp with a warning log. Handles process-kill cases where the tmp file survives.

C. Add .AddStandardResilienceHandler() to the auth HTTP client

FwLiteSharedKernel.cs:70, on the OAuthClient.AuthHttpClientName builder. Default config gives 3 retries with exponential backoff, per-attempt and total timeouts, circuit breaker. Massively reduces the triggering failure rate for flaky-internet users.

D. (Smaller, defense-in-depth) Make MigrateDb fail clearly on empty ProjectData

CurrentProjectService.cs:117 — use SingleOrDefaultAsync() and throw a typed exception (e.g. CrdtProjectMissingDataException) with an actionable message when the row is missing. Avoid caching faulted migration tasks (either remove on exception from MigrationTasks, or switch to LazyThreadSafetyMode.PublicationOnly).

This is the safety net for any existing orphan files (like the one the user already has — A/B prevent new ones but do not recover this one) and any future class of orphan we have not thought of.

E. Do not render "Synced with Lexbox" for projects whose local db is empty/missing

Project-list tile should reflect actual sqlite state, not just the JSON cache. Either:

  • Probe __EFMigrationsHistory / ProjectData row presence on listing and label orphans as "Unrecoverable — delete" or "Re-download required", or
  • Reset the ProjectDataCache entry when CreateProject fails and do not rebuild it from a broken on-disk file.

Without this, the symptom remains visually misleading even when A–D close the underlying gap for new downloads.

Out of scope / follow-up

Pending-sync project state. Today a failed AfterCreate rolls back the local project entirely; the user has to re-download from scratch. Because Harmony sync is resumable, we could keep the local project (schema + ProjectData + morph types — already a valid empty project) on AfterCreate failure and mark it SyncPending. User opens an empty project, hits retry, sync picks up correctly. Big UX win for flaky-internet users, but needs design on the project-list affordance, retry button, and how to communicate the state.

Orphan recovery UI. Even with (D) the user gets a clear error but no in-app way to delete the file. The troubleshoot dialog could grow a "delete orphaned project" action when this typed exception is surfaced.

SignalR HubConnection auth retry. The MsalClientException: authentication_canceled toast visible in user screenshots is a different bug — an infinite-retry policy on the SignalR connection means a dead auth/connection is never re-established. Tim has a separate WIP branch.

Evidence

  • User screenshots (Android, build v2026-05-06-9470018e, commit 9470018e):
    • Project list — both projects labeled "Synced with Lexbox"; MSAL auth-cancel toast visible.
    • Project view — "No entries found" / "Filter 0 words" silently.
    • Side menu — "Browse: 0", red dot on "Synchronize".
  • Broken .sqlite: 1 page, schema_version 0, sqlite_master empty.
  • Reproducing log against current develop (exception path): bin/Debug/net10.0/win-x64/fw-lite-web.log lines 16267-16797.
  • CreateProject failure/cleanup path: LcmCrdt/CrdtProjectsService.cs:133-225.
  • ProjectDataCache: LcmCrdt/Project/ProjectDataCache.cs.
  • MigrateDb failure point: LcmCrdt/CurrentProjectService.cs:117 (introduced Seed canonical morph types and regenerate search index #2219, 2026-05-12).
  • Sync transactional model: harmony/src/SIL.Harmony/DataModel.cs:144-174.
  • Sync HTTP path (no resilience): FwLiteShared/FwLiteSharedKernel.cs:70-108, LcmCrdt/RemoteSync/CrdtHttpSyncService.cs:84-109.

Metadata

Metadata

Assignees

No one assigned

    Labels

    High PrioritybugSomething isn't working💻 FW Liteissues related to the fw lite application, not miniLcm or crdt related📖 MiniLcmissues related to miniLcm library code, includes fwdat bridge and lcmCrdt

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions