fix(provider-node): deflake coordinator pause/stop startup race by bkontur · Pull Request #161 · paritytech/web3-storage

bkontur · 2026-06-11T08:35:38Z

Problem

challenge::test_resume_after_pause is flaky and failed on an unrelated CI run (e.g. the PR #158 hardening run):

challenge::test_resume_after_pause panicked at provider-node/tests/coordinators/challenge.rs:418:
assertion failed: mock.submitted.lock().unwrap().is_empty()

Root cause

Each coordinator's run_loop does:

tokio::select! {
    cmd = command_rx.recv() => { ... }   // Pause / Stop / Resume
    _   = interval.tick()    => { ... }  // poll + auto-respond
}

tokio::time::interval fires its first tick immediately. So on the first loop iteration both arms are ready, and an unbiased select! picks one at random. A Pause/Stop queued right after start() could lose to that immediate tick — the loop then polls and submits once before the command takes effect, leaving mock.submitted non-empty and tripping the assertion ~50% of the time.

Fix

Add biased; to the select! in all four coordinators (challenge, checkpoint, replica-sync, agreement) so control commands are polled before the timer tick. Pause/Stop now deterministically take effect before the first poll, while immediate-poll-on-start is preserved when no command is pending.

Applied to all four because they're copies of the same loop with the same latent race; only the challenge test currently asserts in a way that catches it.

Verification

cargo test -p storage-provider-node --test coordinators → 39/39 pass
previously-flaky test looped → 50/50 pass (was ~50% before)
cargo +nightly fmt --all → clean
cargo clippy -p storage-provider-node --all-targets --all-features -- -D warnings → clean

… race The coordinator run loops select over a command channel and a poll interval. tokio::time::interval fires its first tick immediately, so on the first loop iteration both arms are ready and the unbiased select picked one at random — a Pause/Stop queued right after start() could lose to that immediate tick, letting one poll run (and submit) before the command took effect. This made challenge::test_resume_after_pause flaky. Add `biased;` to the select in all four coordinators (challenge, checkpoint, replica-sync, agreement) so control commands are polled before the timer tick. Pause/Stop now deterministically take effect before the first poll, while immediate-poll-on-start is preserved when no command is pending.

bkontur mentioned this pull request Jun 11, 2026

test(provider-node): add HTTP and storage integration tests #147

Merged

3 tasks

bkontur merged commit 3b43d65 into dev Jun 11, 2026
26 checks passed

bkontur deleted the fix-flaky-challenge-responder-pause branch June 11, 2026 09:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(provider-node): deflake coordinator pause/stop startup race#161

fix(provider-node): deflake coordinator pause/stop startup race#161
bkontur merged 1 commit into
devfrom
fix-flaky-challenge-responder-pause

bkontur commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bkontur commented Jun 11, 2026

Problem

Root cause

Fix

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant