store: Don't hold write lock while receiving LTX by saleemrashid · Pull Request #450 · superfly/litefs

saleemrashid · 2026-04-21T14:47:00Z

Holding the recovery lock for too long results in "locking protocol" errors from SQLite. This happens when receiving the LTX takes too long (e.g. the LTX is large or the connection to the primary is slow).

Before we have the LTX, we only verify whether the database position matches the LTX pre-apply position. We can instead keep this as a preflight check (without the write lock), then verify again once we've received the LTX and acquired the write lock.

Holding the recovery lock for too long results in "locking protocol" errors from SQLite. This happens when receiving the LTX takes too long (e.g. the LTX is large or the connection to the primary is slow). Before we have the LTX, we only verify whether the database position matches the LTX pre-apply position. We can instead keep this as a preflight check (without the write lock), then verify again once we've received the LTX and acquired the write lock.

saleemrashid · 2026-04-21T14:52:38Z

-	// unless this is a snapshot, which will overwrite all data.
+	// unless this is a snapshot, which will overwrite all data. This is a preflight
+	// check so we can skip streaming an LTX we can't apply, but we'll need to verify
+	// again after we acquire the write lock.


There's a behavior change here now we're not acquiring the write lock: if a transaction is being applied concurrently, we might now observe the previous db.Pos() and return a position mismatch error here. In contrast, the existing code is serialized by the write lock, and would observe the post-apply db.Pos() of any transaction being applied concurrently.

I don't think this is an issue, but it is a behavior change, so I wanted to call it out.

…mFrame Moving the halt lock check after receiving the LTX increases the probability of an existing deadlock: - AcquireRemoteHaltLock stores remoteHaltLock and calls WaitPosExact, i.e. it waits for processLTXStreamFrame to process pending LTXs - processLTXStreamFrame sees remoteHaltLock != nil (this is more likely now, since it receives the LTX first) - It incorrectly assumes the remote lock is stale and tries to clear it, since we don't expect new LTXs from the primary while we hold the halt lock - UnsetRemoteHaltLock deadlocks in Recover, because processLTXStreamFrame already called AcquireWriteLock - WaitPosExact deadlocks, because it's waiting for processLTXStreamFrame, which is deadlocked Instead of trying to clear the remote lock locally, processLTXStreamFrame will perform recovery so the LTX can be applied. If the remote lock was stale, we'll now only find out when committing a transaction, as the primary will reject the remote commit. Deferring the error is an acceptable trade-off, as the halt lock is intended to be short-lived.

saleemrashid · 2026-04-22T12:33:39Z

+	// We'll perform recovery so the LTX can be applied, but not clear the remote lock.
+	if haltLock := db.RemoteHaltLock(); haltLock != nil {
+		TraceLog.Printf("[ProcessLTXStreamFrame.Recover(%s)]: replica holds HALT lock but received LTX file, performing recovery", db.Name())
+		if err := db.recover(ctx); err != nil {


@benbjohnson Since we haven't called UnsetRemoteHaltLock yet, which would have performed recovery, we have to perform recovery here to apply the LTX, right? And that should be safe whether the remote lock is stale, or if AcquireRemoteHaltLock is stuck in WaitPosExact (it'll just be a no-op, because we haven't done any writes yet)

Also: we probably shouldn't set remoteHaltLock until after WaitPosExact returns, since Writeable will return true prematurely (before we've caught up to the primary). But that would only happen if the user started writing before AcquireRemoteHaltLock returned success, and it'd result in the primary rejecting the remote commit. So I think it's fine to keep it as is (so we're not changing around too much).

When "PRAGMA journal_mode = WAL" is used on the replica before the database file exists, SQLite will try to create the database file to update the header, which fails with "attempt to write a readonly database".

saleemrashid · 2026-04-23T16:16:30Z

I almost wonder if we should make the tests use an artificially slow connection between the primary and replica to better catch race conditions

saleemrashid commented Apr 21, 2026

View reviewed changes

saleemrashid requested a review from benbjohnson April 21, 2026 14:54

benbjohnson approved these changes Apr 21, 2026

View reviewed changes

saleemrashid force-pushed the saleem/ltx-no-write-lock branch from b8dd162 to ebc2c91 Compare April 22, 2026 12:24

saleemrashid commented Apr 22, 2026

View reviewed changes

saleemrashid marked this pull request as ready for review April 22, 2026 12:38

saleemrashid force-pushed the saleem/ltx-no-write-lock branch 2 times, most recently from 0c601aa to 0982090 Compare April 22, 2026 13:29

mount: Fix race condition in WAL tests

68dd622

When "PRAGMA journal_mode = WAL" is used on the replica before the database file exists, SQLite will try to create the database file to update the header, which fails with "attempt to write a readonly database".

saleemrashid force-pushed the saleem/ltx-no-write-lock branch from 0982090 to 68dd622 Compare April 23, 2026 16:11

saleemrashid requested a review from benbjohnson April 23, 2026 16:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

store: Don't hold write lock while receiving LTX#450

store: Don't hold write lock while receiving LTX#450
saleemrashid wants to merge 3 commits intosaleem/incremental-chksumfrom
saleem/ltx-no-write-lock

saleemrashid commented Apr 21, 2026

Uh oh!

saleemrashid Apr 21, 2026

Uh oh!

Uh oh!

saleemrashid Apr 22, 2026 •

edited

Loading

Uh oh!

saleemrashid commented Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

saleemrashid commented Apr 21, 2026

Uh oh!

saleemrashid Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

saleemrashid Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saleemrashid commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

saleemrashid Apr 22, 2026 •

edited

Loading

saleemrashid commented Apr 23, 2026 •

edited

Loading