Fix Spurious recovery action trigger on first run target activation (#198)#216
Conversation
License Check Results🚀 The license check job ran with the Bazel command: bazel run --lockfile_mode=error //:license-checkStatus: Click to expand output |
|
The created documentation from the pull request is available at: docu-html |
MaciejKaszynski
left a comment
There was a problem hiding this comment.
The fix makes sense. Ideally we would run this with tsan and helgrind to make sure however there is a plan to refactor the graph to make the concurrency nicer so don't think there is any point in doing this now.
f72f0a9 to
f7dab10
Compare
There was a problem hiding this comment.
Pull request overview
Fixes a race condition (#198) where a transition could complete to kSuccess concurrently with another thread setting kCancelled, leading the state machine into kUndefinedState and triggering a spurious recovery action. A std::shared_mutex (transition_completion_mutex_) is added to mutually exclude transition finalization (in nodeExecuted, holding the unique lock) from new transition starts and cancellations (holding shared locks). With the race fixed, the sleep(1) workarounds added to the integration tests are removed.
Changes:
- Add
transition_completion_mutex_(shared_mutex) toGraph; finalization innodeExecutedtakes an exclusive lock whilestartTransition,startTransitionToOffState, andcanceltake shared locks around their state changes. - Remove the temporary
sleep(1)workarounds from four integration tests.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| score/launch_manager/daemon/src/process_group_manager/details/graph.hpp | Adds <shared_mutex> include and the new transition_completion_mutex_ member. |
| score/launch_manager/daemon/src/process_group_manager/details/graph.cpp | Acquires shared/unique locks around transition start, cancel, and node-executed completion to serialize completion vs start/cancel. |
| tests/integration/smoke/control_daemon_mock.cpp | Removes the sleep(1) workaround for #198. |
| tests/integration/process_crash_monitoring/control_client_mock.cpp | Removes the sleep(1) workaround for #198. |
| tests/integration/crash_on_startup/control_client_mock.cpp | Removes the sleep(1) workaround for #198. |
| tests/integration/complex_monitoring/control_client_mock.cpp | Removes the sleep(1) workaround for #198. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Added a shared mutex to protect completing transitions. This resolves a race where a transition would finish (in
Graph::handleTransitionExecution) and the state would be changed tokCancelledby another thread just after being set tokSuccess, which is a blocked transition. This led to the undefined state being set and a recovery action triggering.Fixes #198