Skip to content

Fix Spurious recovery action trigger on first run target activation (#198)#216

Merged
NicolasFussberger merged 3 commits into
eclipse-score:mainfrom
etas-contrib:feature/fix-spurious-failures
Jun 2, 2026
Merged

Fix Spurious recovery action trigger on first run target activation (#198)#216
NicolasFussberger merged 3 commits into
eclipse-score:mainfrom
etas-contrib:feature/fix-spurious-failures

Conversation

@WilliamRoebuck
Copy link
Copy Markdown
Contributor

@WilliamRoebuck WilliamRoebuck commented May 27, 2026

Added a shared mutex to protect completing transitions. This resolves a race where a transition would finish (in Graph::handleTransitionExecution) and the state would be changed to kCancelled by another thread just after being set to kSuccess, which is a blocked transition. This led to the undefined state being set and a recovery action triggering.

Fixes #198

@WilliamRoebuck WilliamRoebuck changed the title Fix Fix ' May 27, 2026
@WilliamRoebuck WilliamRoebuck changed the title Fix ' Fix #198 May 27, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

License Check Results

🚀 The license check job ran with the Bazel command:

bazel run --lockfile_mode=error //:license-check

Status: ⚠️ Needs Review

Click to expand output
[License Check Output]
Extracting Bazel installation...
Starting local Bazel server (8.4.2) and connecting to it...
INFO: Invocation ID: 8e9470b0-6785-479d-bbc2-ce43c3ab7a77
Computing main repo mapping: 
Computing main repo mapping: 
Loading: 
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
    currently loading: 
Loading: 0 packages loaded
    currently loading: 
Analyzing: target //:license-check (1 packages loaded, 0 targets configured)
Analyzing: target //:license-check (1 packages loaded, 0 targets configured)

Analyzing: target //:license-check (35 packages loaded, 10 targets configured)

Analyzing: target //:license-check (105 packages loaded, 183 targets configured)

Analyzing: target //:license-check (145 packages loaded, 3288 targets configured)

Analyzing: target //:license-check (147 packages loaded, 7487 targets configured)

Analyzing: target //:license-check (152 packages loaded, 8045 targets configured)

Analyzing: target //:license-check (157 packages loaded, 8094 targets configured)

Analyzing: target //:license-check (158 packages loaded, 8218 targets configured)

INFO: Analyzed target //:license-check (162 packages loaded, 10232 targets configured).
[12 / 16] JavaToolchainCompileClasses external/rules_java+/toolchains/platformclasspath_classes; 0s disk-cache, processwrapper-sandbox ... (2 actions, 1 running)
[15 / 16] [Prepa] Building license.check.license_check.jar ()
INFO: Found 1 target...
Target //:license.check.license_check up-to-date:
  bazel-bin/license.check.license_check
  bazel-bin/license.check.license_check.jar
INFO: Elapsed time: 18.806s, Critical Path: 2.47s
INFO: 16 processes: 12 internal, 3 processwrapper-sandbox, 1 worker.
INFO: Build completed successfully, 16 total actions
INFO: Running command line: bazel-bin/license.check.license_check ./formatted.txt <args omitted>
usage: org.eclipse.dash.licenses.cli.Main [-batch <int>] [-cd <url>]
       [-confidence <int>] [-ef <url>] [-excludeSources <sources>] [-help] [-lic
       <url>] [-project <shortname>] [-repo <url>] [-review] [-summary <file>]
       [-timeout <seconds>] [-token <token>]

@github-actions
Copy link
Copy Markdown

The created documentation from the pull request is available at: docu-html

@NicolasFussberger NicolasFussberger changed the title Fix #198 Fix Spurious recovery action trigger on first run target activation (#198) May 28, 2026
Copy link
Copy Markdown
Contributor

@MaciejKaszynski MaciejKaszynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix makes sense. Ideally we would run this with tsan and helgrind to make sure however there is a plan to refactor the graph to make the concurrency nicer so don't think there is any point in doing this now.

@WilliamRoebuck WilliamRoebuck force-pushed the feature/fix-spurious-failures branch from f72f0a9 to f7dab10 Compare June 2, 2026 12:07
@NicolasFussberger NicolasFussberger requested a review from Copilot June 2, 2026 12:51
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race condition (#198) where a transition could complete to kSuccess concurrently with another thread setting kCancelled, leading the state machine into kUndefinedState and triggering a spurious recovery action. A std::shared_mutex (transition_completion_mutex_) is added to mutually exclude transition finalization (in nodeExecuted, holding the unique lock) from new transition starts and cancellations (holding shared locks). With the race fixed, the sleep(1) workarounds added to the integration tests are removed.

Changes:

  • Add transition_completion_mutex_ (shared_mutex) to Graph; finalization in nodeExecuted takes an exclusive lock while startTransition, startTransitionToOffState, and cancel take shared locks around their state changes.
  • Remove the temporary sleep(1) workarounds from four integration tests.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
score/launch_manager/daemon/src/process_group_manager/details/graph.hpp Adds <shared_mutex> include and the new transition_completion_mutex_ member.
score/launch_manager/daemon/src/process_group_manager/details/graph.cpp Acquires shared/unique locks around transition start, cancel, and node-executed completion to serialize completion vs start/cancel.
tests/integration/smoke/control_daemon_mock.cpp Removes the sleep(1) workaround for #198.
tests/integration/process_crash_monitoring/control_client_mock.cpp Removes the sleep(1) workaround for #198.
tests/integration/crash_on_startup/control_client_mock.cpp Removes the sleep(1) workaround for #198.
tests/integration/complex_monitoring/control_client_mock.cpp Removes the sleep(1) workaround for #198.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread score/launch_manager/daemon/src/process_group_manager/details/graph.hpp Outdated
@NicolasFussberger NicolasFussberger merged commit 2e2c5ca into eclipse-score:main Jun 2, 2026
19 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spurious recovery action trigger on first run target activation

4 participants