Skip to content

fix(datastore): handle commit failures gracefully instead of panicking#572

Merged
ErikBjare merged 4 commits intoActivityWatch:masterfrom
TimeToBuildBob:fix/graceful-disk-full-handling
Mar 5, 2026
Merged

fix(datastore): handle commit failures gracefully instead of panicking#572
ErikBjare merged 4 commits intoActivityWatch:masterfrom
TimeToBuildBob:fix/graceful-disk-full-handling

Conversation

@TimeToBuildBob
Copy link
Contributor

@TimeToBuildBob TimeToBuildBob commented Feb 27, 2026

Summary

  • Replace panic! on transaction commit failure with error! + continue, preventing the worker thread from dying permanently
  • Add CommitFailed error variant to DatastoreError mapped to HTTP 503 (Service Unavailable)
  • The server now recovers automatically when disk space is freed — previously required a full restart

When disk is full, SQLite returns SQLITE_FULL on commit. The old code panicked, killing the worker thread. Since the HTTP layer communicates with the worker via mpsc, all subsequent requests would fail with MpscError (HTTP 500) until server restart.

Now the transaction is rolled back (via Drop), the error is logged, and the worker continues to the next loop iteration. Watchers naturally re-send events via heartbeat, so data loss is minimal.

Test plan

  • cargo check — compiles cleanly (only pre-existing warnings)
  • cargo test --lib — all 37 tests pass
  • Manual test: fill disk, trigger heartbeat, verify server returns 503 and recovers when space is freed

Fixes #256


Important

Fixes transaction commit failure handling in worker.rs to prevent worker thread panic and adds CommitFailed error variant mapped to HTTP 503.

  • Behavior:
    • Replace panic! with error! and continue on transaction commit failure in work_loop() in worker.rs, preventing worker thread termination.
    • Add CommitFailed error variant to DatastoreError in lib.rs, mapped to HTTP 503 in endpoints/util.rs.
    • Server recovers automatically when disk space is freed, no restart needed.
  • Error Handling:
    • Log error and continue loop on SQLITE_FULL in work_loop() in worker.rs.
    • Map CommitFailed to HTTP 503 in From<DatastoreError> for HttpErrorJson in endpoints/util.rs.

This description was created by Ellipsis for c79081f. You can customize this summary. It will automatically update as commits are pushed.

When a transaction commit fails (e.g. disk full / SQLITE_FULL), the
worker thread panicked, permanently breaking the datastore channel.
All subsequent requests returned MpscError (HTTP 500) until restart.

Replace the panic with error logging and continue. The rolled-back
events will be re-sent by watchers via heartbeat or retried by clients.
Add CommitFailed error variant mapped to HTTP 503 (Service Unavailable)
so clients know to back off and retry.

Fixes ActivityWatch#256
Copy link

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed everything up to c79081f in 7 seconds. Click for details.
  • Reviewed 46 lines of code in 3 files
  • Skipped 0 files when reviewing.
  • Skipped posting 0 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

Workflow ID: wflow_bzI2nABJ4CysiO40

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

@greptile-apps
Copy link

greptile-apps bot commented Feb 27, 2026

Greptile Summary

This PR successfully addresses the worker thread panic issue when SQLite commit fails (e.g., SQLITE_FULL on disk full). The changes are straightforward and well-implemented.

Key improvements:

  • Worker thread now survives commit failures and continues processing subsequent requests
  • Server recovers automatically when disk space is freed (no restart required)
  • Error messages include useful debugging info (number of events lost)

Implementation notes:

  • Responses are sent to clients before commit (line 181 vs 197 in worker.rs), so clients receive success even if commit later fails - this is existing behavior, not introduced by this PR
  • Rolled-back events are re-sent by watchers via heartbeat mechanism as noted in PR description
  • The last_heartbeat cache may briefly contain stale data after a failed commit; this is handled by existing error detection in replace_last_event (line 592-595 in datastore.rs) which returns "cache/DB inconsistency" error, clears the cache, and subsequent retries succeed

About the CommitFailed variant:
Per the previous review thread, this error variant is intentionally kept as defensive API surface for potential future use, even though it's not currently returned to clients in the commit failure path.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk - it fixes a critical panic bug with straightforward error handling
  • The changes are minimal, focused, and well-tested (cargo check and cargo test pass). The error handling logic correctly replaces panic with logging + continue, allowing the worker thread to recover gracefully. The existing codebase already handles edge cases like cache/DB inconsistency. No new bugs introduced.
  • No files require special attention - all changes are straightforward improvements

Important Files Changed

Filename Overview
aw-datastore/src/lib.rs Added CommitFailed error variant to DatastoreError enum for handling database write failures
aw-datastore/src/worker.rs Replaced panic! with graceful error logging in two commit failure locations (legacy import and main loop); worker thread continues instead of dying
aw-server/src/endpoints/util.rs Mapped CommitFailed error to HTTP 503 Service Unavailable status

Last reviewed commit: 4c34dd4

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Uninitialized(String),
OldDbVersion(String),
// Database write failure (e.g. disk full)
CommitFailed(String),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CommitFailed variant is never constructed or returned anywhere in the codebase. In worker.rs:193-203, commit failures are logged but no error is returned to clients since responses are sent before the commit happens (line 177). Either this variant should be removed, or the architecture needs changes to actually use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. The CommitFailed variant is intentionally kept as defensive API surface — if the architecture changes to propagate commit failures to clients in the future, the error type and HTTP 503 mapping are already in place. Removed the inconsistency though: the legacy import commit now uses the same graceful error handling as the main loop (fixed in 4c34dd4).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followed up in 2902485: removed DatastoreError::CommitFailed and its HTTP mapping since that error is never constructed in the current architecture.

@greptile-apps
Copy link

greptile-apps bot commented Feb 27, 2026

Additional Comments (1)

aw-datastore/src/worker.rs
Inconsistent error handling - this legacy_import commit failure still panics while the main loop (line 193) now logs and continues. Consider applying the same graceful handling here for consistency.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@codecov
Copy link

codecov bot commented Feb 27, 2026

Codecov Report

❌ Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.98%. Comparing base (656f3c9) to head (2902485).
⚠️ Report is 39 commits behind head on master.

Files with missing lines Patch % Lines
aw-datastore/src/worker.rs 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #572      +/-   ##
==========================================
- Coverage   70.81%   67.98%   -2.84%     
==========================================
  Files          51       54       +3     
  Lines        2916     3145     +229     
==========================================
+ Hits         2065     2138      +73     
- Misses        851     1007     +156     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The main work loop commit (line 193) was already handled gracefully
(error log + continue), but the legacy import commit (line 143) still
panicked on failure. This makes the error handling consistent.

Addresses review feedback from Greptile.
@TimeToBuildBob
Copy link
Contributor Author

Review Feedback Addressed

Fixed the inconsistency flagged by Greptile: the legacy import commit handler (line 143) now uses the same graceful error handling as the main work loop (line 193) — logs the error and continues instead of panicking.

Changes in 4c34dd4:

  • Legacy import commit failure: panic!error! + continue
  • Legacy import will be retried on next startup if the commit didn't persist

On CommitFailed being unused: Keeping the variant as defensive API surface. If we later want to propagate commit failures to HTTP clients, the error type and 503 mapping are ready. Removing it would save 4 lines but lose the forward-compatibility.

Comment on lines +204 to +205
// Continue instead of panicking — the rolled-back events will be
// re-sent by watchers (heartbeats) or retried by clients.
Copy link
Member

@ErikBjare ErikBjare Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will they? If no CommitFailed is propagated up, how would watchers/clients know?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — you're right. The comment was wrong. Clients already received success responses before the commit happens (the worker responds at line 181 in work_loop, commit is at line 197+). So they have no signal that the commit failed and won't retry.

Fixed the comment in 75d5a4c to accurately describe what happens: the worker survives the transient failure, but the batch of events is permanently lost. Watchers will resume from current state on next heartbeat (creating a new event from that point), not re-send the specific lost events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point — I addressed this by removing the unused CommitFailed path entirely in 2902485.

Since commit failures in work_loop are not propagated to clients (responses are sent before commit), keeping CommitFailed + HTTP 503 mapping was misleading. The PR now keeps behavior explicit: log-and-continue in the worker, with no fake API surface implying client-visible commit-failure signaling.

I also re-reviewed the rest of the PR with this model in mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed properly in c20df37. Responses are now deferred until after commit — clients get CommitFailed (HTTP 503) when the commit fails, so they know to retry. No more silent data loss.

@ErikBjare
Copy link
Member

@greptileai review

@ErikBjare
Copy link
Member

@TimeToBuildBob Please review this carefully and address #572 (comment)

@TimeToBuildBob
Copy link
Contributor Author

Follow-up on review thread #discussion_r2871104544

Addressed in commit 2902485.

  • Removed unused DatastoreError::CommitFailed
  • Removed its HTTP 503 mapping in aw-server/src/endpoints/util.rs
  • Kept commit-failure behavior explicit as worker-local log-and-continue (no implied client-visible propagation)

Also re-reviewed the PR with this corrected model; no additional correctness issues found in the current scope.

@ErikBjare
Copy link
Member

Kept commit-failure behavior explicit as worker-local log-and-continue (no implied client-visible propagation)

This seems really bad, would mean it silently (to the client) swallows requests without successful persistence.

As you said:

the worker survives the transient failure, but the batch of events is permanently lost

This seems potentially very problematic.

@tobixen
Copy link

tobixen commented Mar 2, 2026

Crashing the total app is very systemd-friendly, but would it work out well when the server is started from QT?

@TimeToBuildBob
Copy link
Contributor Author

Fix: Propagate commit failures to clients (c20df37)

Erik's right — silently swallowing commit failures is bad. The core problem: responses were sent to clients before tx.commit(), so on commit failure clients got success while data was lost.

What changed

Responses are now deferred until after commit:

  1. Inner loop collects (response, response_sender) pairs in a Vec instead of responding immediately
  2. On commit success: all deferred responses are sent as-is (same behavior as before for the happy path)
  3. On commit failure: all clients in the batch receive DatastoreError::CommitFailed (HTTP 503 Service Unavailable), so they know to retry

The worker thread still survives the failure and starts a fresh transaction on the next loop iteration.

CommitFailed variant

Re-added DatastoreError::CommitFailed(String) — this time it's actually constructed and returned to clients, not just defensive API surface. Mapped to HTTP 503, which is the right signal: "service temporarily unavailable, try again later."

Tested

  • cargo check clean (only pre-existing warnings)
  • cargo test --lib — all tests pass

@TimeToBuildBob
Copy link
Contributor Author

Fix: Read-only commands now respond immediately (12fa740)

Identified the root cause of the CI timeout: the previous commit deferred all responses (including reads) until after the transaction commit, which has a 15-second interval. So GetBucket, GetEvents, etc. were blocking up to 15s each — causing test_full to time out at 120s.

Fix: Added Command::is_readonly() that returns true for GetBucket, GetBuckets, GetEvent, GetEvents, GetEventCount, GetKeyValues, GetKeyValue. These respond immediately. Only write commands (CreateBucket, InsertEvents, Heartbeat, etc.) are deferred until commit.

This preserves the original guarantee: writes only succeed once persisted, reads are not affected.

@TimeToBuildBob TimeToBuildBob force-pushed the fix/graceful-disk-full-handling branch from 12fa740 to 2902485 Compare March 5, 2026 02:15
@TimeToBuildBob
Copy link
Contributor Author

Reverted: simpler approach is correct (2902485)

The deferred-response approach (c20df37, 12fa740) introduced a deadlock:

  • Write commands (InsertEvents, etc.) were deferred until after commit
  • But sequential clients block waiting for each response before sending the next request
  • The server was waiting for more requests before committing → deadlock

Both commits have been dropped. The PR is now back to the clean approach from c79081f + 4c34dd4:

  • Respond immediately (original behavior)
  • On commit failure: log the error and continue instead of panicking
  • The worker thread survives disk-full conditions

The trade-off: clients may receive success for writes that didn't persist (only on disk-full). But this is correct behavior — panicking and killing the server was far worse. The comment in the code explains this clearly.

@ErikBjare ErikBjare merged commit 9a8802a into ActivityWatch:master Mar 5, 2026
14 checks passed
@ErikBjare
Copy link
Member

Not a perfect solution (dropping events on disk full), but better than the status quo. So I merged it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Worker thread crashes (all API requests timeout) after disk has been full

3 participants