Skip to content

fix(provider-node): deflake coordinator pause/stop startup race#161

Merged
bkontur merged 1 commit into
devfrom
fix-flaky-challenge-responder-pause
Jun 11, 2026
Merged

fix(provider-node): deflake coordinator pause/stop startup race#161
bkontur merged 1 commit into
devfrom
fix-flaky-challenge-responder-pause

Conversation

@bkontur

@bkontur bkontur commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Problem

challenge::test_resume_after_pause is flaky and failed on an unrelated CI run (e.g. the PR #158 hardening run):

challenge::test_resume_after_pause panicked at provider-node/tests/coordinators/challenge.rs:418:
assertion failed: mock.submitted.lock().unwrap().is_empty()

Root cause

Each coordinator's run_loop does:

tokio::select! {
    cmd = command_rx.recv() => { ... }   // Pause / Stop / Resume
    _   = interval.tick()    => { ... }  // poll + auto-respond
}

tokio::time::interval fires its first tick immediately. So on the first loop iteration both arms are ready, and an unbiased select! picks one at random. A Pause/Stop queued right after start() could lose to that immediate tick — the loop then polls and submits once before the command takes effect, leaving mock.submitted non-empty and tripping the assertion ~50% of the time.

Fix

Add biased; to the select! in all four coordinators (challenge, checkpoint, replica-sync, agreement) so control commands are polled before the timer tick. Pause/Stop now deterministically take effect before the first poll, while immediate-poll-on-start is preserved when no command is pending.

Applied to all four because they're copies of the same loop with the same latent race; only the challenge test currently asserts in a way that catches it.

Verification

  • cargo test -p storage-provider-node --test coordinators39/39 pass
  • previously-flaky test looped → 50/50 pass (was ~50% before)
  • cargo +nightly fmt --all → clean
  • cargo clippy -p storage-provider-node --all-targets --all-features -- -D warnings → clean

… race

The coordinator run loops select over a command channel and a poll
interval. tokio::time::interval fires its first tick immediately, so on
the first loop iteration both arms are ready and the unbiased select
picked one at random — a Pause/Stop queued right after start() could lose
to that immediate tick, letting one poll run (and submit) before the
command took effect. This made challenge::test_resume_after_pause flaky.

Add `biased;` to the select in all four coordinators (challenge,
checkpoint, replica-sync, agreement) so control commands are polled
before the timer tick. Pause/Stop now deterministically take effect
before the first poll, while immediate-poll-on-start is preserved when no
command is pending.
@bkontur bkontur merged commit 3b43d65 into dev Jun 11, 2026
26 checks passed
@bkontur bkontur deleted the fix-flaky-challenge-responder-pause branch June 11, 2026 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant