fix: wrap datastore worker in catch_unwind to prevent poisoned locks#553
fix: wrap datastore worker in catch_unwind to prevent poisoned locks#553tobixen wants to merge 3 commits intoActivityWatch:masterfrom
Conversation
There was a problem hiding this comment.
Important
Looks good to me! 👍
Reviewed everything up to f8cf670 in 36 seconds. Click for details.
- Reviewed
35lines of code in1files - Skipped
0files when reviewing. - Skipped posting
3draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. aw-datastore/src/worker.rs:315
- Draft comment:
Good use of catch_unwind to safeguard against panics. Ensure the worker state (di) is unwind safe; using AssertUnwindSafe here assumes all captured state is safe to unwind. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
2. aw-datastore/src/worker.rs:327
- Draft comment:
Panic payload extraction is correctly handled. Optionally, consider logging additional context or backtrace to aid debugging. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
3. aw-datastore/src/worker.rs:338
- Draft comment:
After a panic, the worker thread exits, closing the channel. Confirm that this shutdown behavior (vs. automatically restarting a new worker) is acceptable for the server’s resilience strategy. - Reason this comment was not posted:
Confidence changes required:33%<= threshold50%None
Workflow ID: wflow_cxDmxiQ85n2qLVTV
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
Greptile SummaryWrapped the datastore worker's Key Changes
AnalysisThe fix addresses the root symptom from issue #405 where panics in the worker thread would poison the Mutex, making the entire server unusable until restart. While the PR correctly notes that panics should ideally be replaced with proper error handling, this safety net is appropriate for:
The use of Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Client as API Client
participant Datastore as Datastore (Main Thread)
participant Channel as MPSC Channel
participant Worker as DatastoreWorker Thread
participant CatchUnwind as catch_unwind Handler
participant SQLite as SQLite Database
Note over Datastore,Worker: Initialization
Datastore->>Channel: Create MPSC channel
Datastore->>Worker: Spawn worker thread
Worker->>CatchUnwind: Wrap work_loop in catch_unwind
CatchUnwind->>Worker: Execute work_loop(method)
Worker->>SQLite: Open connection
Worker->>Worker: Initialize DatastoreInstance
Note over Client,SQLite: Normal Operation
Client->>Datastore: Request (e.g., insert_events)
Datastore->>Channel: Send command
Channel->>Worker: Deliver command
Worker->>SQLite: Execute SQL operation
SQLite-->>Worker: Result
Worker->>Channel: Send response
Channel->>Datastore: Deliver response
Datastore-->>Client: Return result
Note over Client,SQLite: Panic Scenario (Before Fix)
Client->>Datastore: Request triggers panic
Datastore->>Channel: Send command
Channel->>Worker: Deliver command
Worker->>SQLite: Execute operation
SQLite-->>Worker: Unexpected error
Worker->>Worker: panic! occurs
Note over Worker: Thread exits, Mutex poisoned
Client->>Datastore: Subsequent request
Datastore--XClient: Error: "poisoned lock"
Note over Datastore: Server unusable until restart
Note over Client,SQLite: Panic Scenario (After Fix)
Client->>Datastore: Request triggers panic
Datastore->>Channel: Send command
Channel->>Worker: Deliver command
Worker->>SQLite: Execute operation
SQLite-->>Worker: Unexpected error
Worker->>Worker: panic! occurs
Worker->>CatchUnwind: Panic caught
CatchUnwind->>CatchUnwind: Log panic message
CatchUnwind->>Worker: Exit gracefully
Worker->>Channel: Channel closes cleanly
Note over Worker: No lock poisoning
Client->>Datastore: Subsequent request
Datastore->>Channel: Send command
Channel--XDatastore: Channel closed error
Datastore-->>Client: Clean error (channel closed)
Note over Client: Clear error, can restart server gracefully
|
Greptile's behavior is changing!From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
|
This pull request was created by Claude the AI. I don't know rust well enough to have a subjective point of view if this change is good or not. The prompt given to Claude was like this:
My rationale is that a server that is down may be auto-restarted, and it's also easy to catch this through monitoring. A server that is up and listening to the port but not working at all may be a little bit harder to monitor. I run the server through systemd, and systemd supports auto-restarting. I'm not sure how things work when utilizing the aw-qt app. |
|
LGTM |
I believe only non-zero exit codes get restarted (that's how it works in aw-tauri, not sure about aw-qt). Since that's a crash. This would cause it to exit normally and might not be restarted. You can still throw a panic after gracefully handling the first one though. If that is deemed necessary |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #553 +/- ##
==========================================
- Coverage 70.81% 67.97% -2.84%
==========================================
Files 51 54 +3
Lines 2916 3151 +235
==========================================
+ Hits 2065 2142 +77
- Misses 851 1009 +158 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
3afcbd9 to
9d0bc84
Compare
Use std::panic::catch_unwind() around the worker loop as a safety net. If any panic occurs, the worker will exit gracefully instead of poisoning the Mutex lock and leaving the server in an unusable state. This ensures: - Panics are caught and logged with their message - The worker thread exits cleanly - The channel closes properly - Future requests get clean "channel closed" errors - No poisoned locks requiring server restart Related to: ActivityWatch#405 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add tests to verify that the catch_unwind wrapper properly handles worker panics and prevents poisoned lock errors. Tests are gated behind a `test-panic` feature flag. Run with: cargo test --package aw-datastore --features test-panic 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… extraction Extract the panic payload message logic into a standalone `extract_panic_message` helper and add unit tests covering all three branches: &str literal panics, String (formatted) panics, and unknown payload types. These tests run unconditionally (no feature flag required), so they are picked up by tarpaulin in CI and close the coverage gap reported by Codecov on the previous commit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
9d0bc84 to
6d234a3
Compare
| panic_msg | ||
| ); | ||
| } | ||
| // Worker exits, channel closes, future requests get clean errors |
Closing this PR in light of the discussion today on issue #405 and PR #572. Why closing: The Recommendation — focus on #572: PR #572 is closer to the right fix architecturally, but it also has a serious problem in its current form: the worker responds to clients before committing, so a commit failure silently and permanently drops events that clients believe were saved. ErikBjare flagged this too. The correct fix for #572 would be to commit before responding — if the commit fails, the client gets an error and can retry. That requires restructuring the inner worker loop but would make the server truly correct under disk-full conditions. That leaves panics from other causes (bugs, not disk-full) still unhandled. The
|
|
My first impression of Claude was just amazing - it apparently did manage to understand some of my projects in some few minutes, even if I thought I was the only person in the world understanding it fully. It has managed to make small scripts, tons of test code and fix bugs for me much faster and better than what I'm able to do myself. Other times it's just making tons of crap, or replacing good code with bad code. I found the "this is fine"-meme quite fitting - it seems happy to burn down houses without realizing it's a bad idea, and if one requests it to change the color of the curtains in the burning house, it would do so without asking any questions. I've come to realize that delivering pull requests without understanding the project or the code it wants to change perhaps isn't the best idea. |
|
@tobixen Sometimes they are put on the spot and panic, and seem to just do something, in hopes of satisfying some RL reward like short solution and passing a test suite 😆 They also seem to have an unusually difficult time with Rust, not sure why that is. But yeah, making PRs to other's stuff still requires a little bit of self-review and early prodding so that they have some gating on their confidence. Still, I think it could probably have gotten there had I not been trying that other approach at the same time, with some feedback. One solution that would be friendly to systemd etc would be to crash the entire app, that way the service manager (aw-qt, aw-tauri, systemd) would deal with restart instead of the server having to handle/restart poisoned locks. But should probably just try to clean up and restart in-process? We will continue that discussion in #572 |
Summary
Use
std::panic::catch_unwind()around the datastore worker loop as a safety net. If any panic occurs, the worker will exit gracefully instead of poisoning the Mutex lock and leaving the server in an unusable state.Problem
When the datastore worker thread panics (for any reason), it poisons the Mutex lock. All subsequent requests then fail with "poisoned lock: another task failed inside" errors, and the server becomes completely unusable until restart.
This is the symptom described in #405
Solution
Wrap the worker's
work_loop()incatch_unwind():This is a defense-in-depth approach - ideally panics should be replaced with proper error handling, but this safety net ensures no panic can leave the server in an unrecoverable state.
Test plan
cargo buildcompiles successfullycargo test --package aw-datastorepasses (9 tests)🤖 Generated with Claude Code
Important
Wrap
work_loop()incatch_unwind()inaw-datastore/src/worker.rsto handle panics gracefully and prevent poisoned locks.work_loop()incatch_unwind()inaw-datastore/src/worker.rsto handle panics gracefully.This description was created by
for f8cf670. You can customize this summary. It will automatically update as commits are pushed.