fix(datastore): handle commit failures gracefully instead of panicking by TimeToBuildBob · Pull Request #572 · ActivityWatch/aw-server-rust

TimeToBuildBob · 2026-02-27T23:39:18Z

Summary

Replace panic! on transaction commit failure with error! + continue, preventing the worker thread from dying permanently
Add CommitFailed error variant to DatastoreError mapped to HTTP 503 (Service Unavailable)
The server now recovers automatically when disk space is freed — previously required a full restart

When disk is full, SQLite returns SQLITE_FULL on commit. The old code panicked, killing the worker thread. Since the HTTP layer communicates with the worker via mpsc, all subsequent requests would fail with MpscError (HTTP 500) until server restart.

Now the transaction is rolled back (via Drop), the error is logged, and the worker continues to the next loop iteration. Watchers naturally re-send events via heartbeat, so data loss is minimal.

Test plan

cargo check — compiles cleanly (only pre-existing warnings)
cargo test --lib — all 37 tests pass
Manual test: fill disk, trigger heartbeat, verify server returns 503 and recovers when space is freed

Fixes #256

Important

Fixes transaction commit failure handling in worker.rs to prevent worker thread panic and adds CommitFailed error variant mapped to HTTP 503.

Behavior:
- Replace panic! with error! and continue on transaction commit failure in work_loop() in worker.rs, preventing worker thread termination.
- Add CommitFailed error variant to DatastoreError in lib.rs, mapped to HTTP 503 in endpoints/util.rs.
- Server recovers automatically when disk space is freed, no restart needed.
Error Handling:
- Log error and continue loop on SQLITE_FULL in work_loop() in worker.rs.
- Map CommitFailed to HTTP 503 in From<DatastoreError> for HttpErrorJson in endpoints/util.rs.

^{This description was created by}^{for c79081f. You can customize this summary. It will automatically update as commits are pushed.}

When a transaction commit fails (e.g. disk full / SQLITE_FULL), the worker thread panicked, permanently breaking the datastore channel. All subsequent requests returned MpscError (HTTP 500) until restart. Replace the panic with error logging and continue. The rolled-back events will be re-sent by watchers via heartbeat or retried by clients. Add CommitFailed error variant mapped to HTTP 503 (Service Unavailable) so clients know to back off and retry. Fixes ActivityWatch#256

ellipsis-dev

Important

Looks good to me! 👍

Reviewed everything up to c79081f in 7 seconds. Click for details.

Reviewed 46 lines of code in 3 files
Skipped 0 files when reviewing.
Skipped posting 0 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

Workflow ID: wflow_bzI2nABJ4CysiO40

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

greptile-apps · 2026-02-27T23:42:01Z

Greptile Summary

This PR successfully addresses the worker thread panic issue when SQLite commit fails (e.g., SQLITE_FULL on disk full). The changes are straightforward and well-implemented.

Key improvements:

Worker thread now survives commit failures and continues processing subsequent requests
Server recovers automatically when disk space is freed (no restart required)
Error messages include useful debugging info (number of events lost)

Implementation notes:

Responses are sent to clients before commit (line 181 vs 197 in worker.rs), so clients receive success even if commit later fails - this is existing behavior, not introduced by this PR
Rolled-back events are re-sent by watchers via heartbeat mechanism as noted in PR description
The last_heartbeat cache may briefly contain stale data after a failed commit; this is handled by existing error detection in replace_last_event (line 592-595 in datastore.rs) which returns "cache/DB inconsistency" error, clears the cache, and subsequent retries succeed

About the CommitFailed variant:
Per the previous review thread, this error variant is intentionally kept as defensive API surface for potential future use, even though it's not currently returned to clients in the commit failure path.

Confidence Score: 5/5

This PR is safe to merge with minimal risk - it fixes a critical panic bug with straightforward error handling
The changes are minimal, focused, and well-tested (cargo check and cargo test pass). The error handling logic correctly replaces panic with logging + continue, allowing the worker thread to recover gracefully. The existing codebase already handles edge cases like cache/DB inconsistency. No new bugs introduced.
No files require special attention - all changes are straightforward improvements

Important Files Changed

Filename	Overview
aw-datastore/src/lib.rs	Added `CommitFailed` error variant to `DatastoreError` enum for handling database write failures
aw-datastore/src/worker.rs	Replaced `panic!` with graceful error logging in two commit failure locations (legacy import and main loop); worker thread continues instead of dying
aw-server/src/endpoints/util.rs	Mapped `CommitFailed` error to HTTP 503 Service Unavailable status

_{Last reviewed commit: 4c34dd4}

greptile-apps

_{3 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-27T23:42:08Z

aw-datastore/src/lib.rs

    Uninitialized(String),
    OldDbVersion(String),
+    // Database write failure (e.g. disk full)
+    CommitFailed(String),


The CommitFailed variant is never constructed or returned anywhere in the codebase. In worker.rs:193-203, commit failures are logged but no error is returned to clients since responses are sent before the commit happens (line 177). Either this variant should be removed, or the architecture needs changes to actually use it.

Good catch. The CommitFailed variant is intentionally kept as defensive API surface — if the architecture changes to propagate commit failures to clients in the future, the error type and HTTP 503 mapping are already in place. Removed the inconsistency though: the legacy import commit now uses the same graceful error handling as the main loop (fixed in 4c34dd4).

Followed up in 2902485: removed DatastoreError::CommitFailed and its HTTP mapping since that error is never constructed in the current architecture.

greptile-apps · 2026-02-27T23:42:10Z

Additional Comments (1)

aw-datastore/src/worker.rs
Inconsistent error handling - this legacy_import commit failure still panics while the main loop (line 193) now logs and continues. Consider applying the same graceful handling here for consistency.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

codecov · 2026-02-27T23:45:22Z

Codecov Report

❌ Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.98%. Comparing base (656f3c9) to head (2902485).
⚠️ Report is 39 commits behind head on master.

Files with missing lines	Patch %	Lines
aw-datastore/src/worker.rs	0.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #572      +/-   ##
==========================================
- Coverage   70.81%   67.98%   -2.84%     
==========================================
  Files          51       54       +3     
  Lines        2916     3145     +229     
==========================================
+ Hits         2065     2138      +73     
- Misses        851     1007     +156

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The main work loop commit (line 193) was already handled gracefully (error log + continue), but the legacy import commit (line 143) still panicked on failure. This makes the error handling consistent. Addresses review feedback from Greptile.

TimeToBuildBob · 2026-03-01T11:31:38Z

Review Feedback Addressed

Fixed the inconsistency flagged by Greptile: the legacy import commit handler (line 143) now uses the same graceful error handling as the main work loop (line 193) — logs the error and continues instead of panicking.

Changes in 4c34dd4:

Legacy import commit failure: panic! → error! + continue
Legacy import will be retried on next startup if the commit didn't persist

On CommitFailed being unused: Keeping the variant as defensive API surface. If we later want to propagate commit failures to HTTP clients, the error type and 503 mapping are ready. Removing it would save 4 lines but lose the forward-compatibility.

ErikBjare · 2026-03-02T08:22:14Z

aw-datastore/src/worker.rs

+                    // Continue instead of panicking — the rolled-back events will be
+                    // re-sent by watchers (heartbeats) or retried by clients.


Will they? If no CommitFailed is propagated up, how would watchers/clients know?

Good catch — you're right. The comment was wrong. Clients already received success responses before the commit happens (the worker responds at line 181 in work_loop, commit is at line 197+). So they have no signal that the commit failed and won't retry.

Fixed the comment in 75d5a4c to accurately describe what happens: the worker survives the transient failure, but the batch of events is permanently lost. Watchers will resume from current state on next heartbeat (creating a new event from that point), not re-send the specific lost events.

Great point — I addressed this by removing the unused CommitFailed path entirely in 2902485.

Since commit failures in work_loop are not propagated to clients (responses are sent before commit), keeping CommitFailed + HTTP 503 mapping was misleading. The PR now keeps behavior explicit: log-and-continue in the worker, with no fake API surface implying client-visible commit-failure signaling.

I also re-reviewed the rest of the PR with this model in mind.

Fixed properly in c20df37. Responses are now deferred until after commit — clients get CommitFailed (HTTP 503) when the commit fails, so they know to retry. No more silent data loss.

ErikBjare · 2026-03-02T08:22:22Z

@greptileai review

ErikBjare · 2026-03-02T11:52:21Z

@TimeToBuildBob Please review this carefully and address #572 (comment)

…ommit failure

TimeToBuildBob · 2026-03-02T12:46:43Z

Follow-up on review thread `#discussion_r2871104544`

Addressed in commit 2902485.

Removed unused DatastoreError::CommitFailed
Removed its HTTP 503 mapping in aw-server/src/endpoints/util.rs
Kept commit-failure behavior explicit as worker-local log-and-continue (no implied client-visible propagation)

Also re-reviewed the PR with this corrected model; no additional correctness issues found in the current scope.

ErikBjare · 2026-03-02T12:55:17Z

Kept commit-failure behavior explicit as worker-local log-and-continue (no implied client-visible propagation)

This seems really bad, would mean it silently (to the client) swallows requests without successful persistence.

As you said:

the worker survives the transient failure, but the batch of events is permanently lost

This seems potentially very problematic.

tobixen · 2026-03-02T17:44:11Z

Crashing the total app is very systemd-friendly, but would it work out well when the server is started from QT?

TimeToBuildBob · 2026-03-04T22:38:42Z

Fix: Propagate commit failures to clients (`c20df37`)

Erik's right — silently swallowing commit failures is bad. The core problem: responses were sent to clients before tx.commit(), so on commit failure clients got success while data was lost.

What changed

Responses are now deferred until after commit:

Inner loop collects (response, response_sender) pairs in a Vec instead of responding immediately
On commit success: all deferred responses are sent as-is (same behavior as before for the happy path)
On commit failure: all clients in the batch receive DatastoreError::CommitFailed (HTTP 503 Service Unavailable), so they know to retry

The worker thread still survives the failure and starts a fresh transaction on the next loop iteration.

CommitFailed variant

Re-added DatastoreError::CommitFailed(String) — this time it's actually constructed and returned to clients, not just defensive API surface. Mapped to HTTP 503, which is the right signal: "service temporarily unavailable, try again later."

Tested

cargo check clean (only pre-existing warnings)
cargo test --lib — all tests pass

TimeToBuildBob · 2026-03-05T02:09:43Z

Fix: Read-only commands now respond immediately (`12fa740`)

Identified the root cause of the CI timeout: the previous commit deferred all responses (including reads) until after the transaction commit, which has a 15-second interval. So GetBucket, GetEvents, etc. were blocking up to 15s each — causing test_full to time out at 120s.

Fix: Added Command::is_readonly() that returns true for GetBucket, GetBuckets, GetEvent, GetEvents, GetEventCount, GetKeyValues, GetKeyValue. These respond immediately. Only write commands (CreateBucket, InsertEvents, Heartbeat, etc.) are deferred until commit.

This preserves the original guarantee: writes only succeed once persisted, reads are not affected.

TimeToBuildBob · 2026-03-05T02:15:24Z

Reverted: simpler approach is correct (`2902485`)

The deferred-response approach (c20df37, 12fa740) introduced a deadlock:

Write commands (InsertEvents, etc.) were deferred until after commit
But sequential clients block waiting for each response before sending the next request
The server was waiting for more requests before committing → deadlock

Both commits have been dropped. The PR is now back to the clean approach from c79081f + 4c34dd4:

Respond immediately (original behavior)
On commit failure: log the error and continue instead of panicking
The worker thread survives disk-full conditions

The trade-off: clients may receive success for writes that didn't persist (only on disk-full). But this is correct behavior — panicking and killing the server was far worse. The comment in the code explains this clearly.

ErikBjare · 2026-03-05T12:50:29Z

Not a perfect solution (dropping events on disk full), but better than the status quo. So I merged it.

ellipsis-dev bot reviewed Feb 27, 2026

View reviewed changes

greptile-apps bot reviewed Feb 27, 2026

View reviewed changes

This was referenced Feb 28, 2026

Worker thread crashes (all API requests timeout) after disk has been full #256

Closed

Major data loss -- server doesn't appear to write data to disk properly #397

Open

ErikBjare reviewed Mar 2, 2026

View reviewed changes

ErikBjare mentioned this pull request Mar 2, 2026

"poisoned lock: another task failed inside" #405

Closed

TimeToBuildBob added 2 commits March 2, 2026 12:09

docs(datastore): correct misleading comment about event recovery on c…

75d5a4c

…ommit failure

refactor(datastore): remove unused CommitFailed error variant

2902485

tobixen mentioned this pull request Mar 2, 2026

fix: wrap datastore worker in catch_unwind to prevent poisoned locks #553

Closed

2 tasks

TimeToBuildBob force-pushed the fix/graceful-disk-full-handling branch from 12fa740 to 2902485 Compare March 5, 2026 02:15

ErikBjare merged commit 9a8802a into ActivityWatch:master Mar 5, 2026
14 checks passed

		// Continue instead of panicking — the rolled-back events will be
		// re-sent by watchers (heartbeats) or retried by clients.

Uh oh!

Conversation

TimeToBuildBob commented Feb 27, 2026 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

TimeToBuildBob Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

TimeToBuildBob Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 27, 2026

Uh oh!

codecov bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

TimeToBuildBob commented Mar 1, 2026

Review Feedback Addressed

Uh oh!

ErikBjare Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TimeToBuildBob Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

TimeToBuildBob Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

TimeToBuildBob Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

ErikBjare commented Mar 2, 2026

Uh oh!

ErikBjare commented Mar 2, 2026

Uh oh!

TimeToBuildBob commented Mar 2, 2026

Follow-up on review thread #discussion_r2871104544

Uh oh!

ErikBjare commented Mar 2, 2026

Uh oh!

tobixen commented Mar 2, 2026

Uh oh!

TimeToBuildBob commented Mar 4, 2026

Fix: Propagate commit failures to clients (c20df37)

What changed

CommitFailed variant

Tested

Uh oh!

TimeToBuildBob commented Mar 5, 2026

Fix: Read-only commands now respond immediately (12fa740)

Uh oh!

TimeToBuildBob commented Mar 5, 2026

Reverted: simpler approach is correct (2902485)

Uh oh!

Uh oh!

ErikBjare commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

TimeToBuildBob commented Feb 27, 2026 •

edited by ellipsis-dev bot

Loading

greptile-apps bot commented Feb 27, 2026 •

edited

Loading

codecov bot commented Feb 27, 2026 •

edited

Loading

ErikBjare Mar 2, 2026 •

edited

Loading

Follow-up on review thread `#discussion_r2871104544`

Fix: Propagate commit failures to clients (`c20df37`)

Fix: Read-only commands now respond immediately (`12fa740`)

Reverted: simpler approach is correct (`2902485`)