fix(bittensor): improve reconnection by properly stopping old BlockSync by echobt · Pull Request #33 · PlatformNetwork/platform

echobt · 2026-02-03T14:56:02Z

Summary

When Bittensor RPC connection errors occur (e.g., 'restart required'), the reconnection logic now properly stops the old BlockSync before creating a new one. This prevents the old internal task from continuing to emit errors.

Problem

When the error Subxt error: RPC error: ... connection closed; restart required occurred, it would spam the logs every 5 seconds because:

The old BlockSync's internal task wasn't being properly stopped
We were only aborting the wrapper JoinHandle, not the internal event processing task
The old listener continued running and emitting ConnectionError events

Solution

Track the BlockSync instance directly instead of just a spawned task handle
Call sync.stop() before creating a new connection to properly clean up internal tasks
Remove the unnecessary wrapper task around sync.start()
Maintain exponential backoff for reconnection attempts

Testing

Code compiles successfully (cargo check)
No clippy warnings (cargo clippy)
Code formatted (cargo fmt)

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved Bittensor reconnection handling with exponential backoff to prevent rapid reconnection attempts.
- Enhanced critical error detection for connection failures.
- More stable block event processing and weight submission handling.
Refactor
- Streamlined internal state management and function signatures for improved code maintainability.

When Bittensor RPC connection errors occur (e.g., 'restart required'), the reconnection logic now properly stops the old BlockSync before creating a new one. This prevents the old internal task from continuing to emit errors. Key changes: - Track BlockSync instance instead of just the spawned task handle - Call sync.stop() before creating new connection to clean up internal tasks - Remove unnecessary wrapper task around sync.start() - Maintain exponential backoff for reconnection attempts This resolves the issue where connection errors would spam logs every 5 seconds due to the old BlockSync task not being properly terminated.

coderabbitai · 2026-02-03T14:56:33Z

📝 Walkthrough

Walkthrough

The PR refactors the validator-node's core initialization and event-handling logic by introducing context structs to bundle related parameters, adding exponential backoff-based reconnection state management, and restructuring the main loop to gracefully handle Bittensor disconnections and block events through formalized context objects and state transitions.

Changes

Cohort / File(s)	Summary
New Context and State Types `bins/validator-node/src/main.rs`	Introduces `WeightSubmissionContext` and `BlockEventContext` structs to group related parameters; adds `ReconnectionState` struct to manage exponential backoff with methods for state transitions; adds `ReconnectionResult` to encapsulate post-reconnection artifacts (client, BlockSync, event receiver).
Reconnection and Error Handling `bins/validator-node/src/main.rs`	Adds `is_critical_bittensor_error` function to detect fatal connection errors; introduces `attempt_bittensor_reconnect` async function to gracefully shut down old BlockSync, create new client/BlockSync, and return reconnection result.
Function Refactoring `bins/validator-node/src/main.rs`	Refactors `submit_weights_for_epoch` to accept `WeightSubmissionContext` instead of multiple separate parameters; refactors `handle_block_event` to accept `BlockEventContext` instead of raw parameter list.
Main Loop Restructuring `bins/validator-node/src/main.rs`	Refactors main initialization and event-loop flow to incorporate `block_sync`, `block_rx`, and `reconnect_state` with exponential backoff logic; replaces ad-hoc disconnection handling with structured retry logic and immediate reconnection attempts for critical errors when permitted by backoff state.

Sequence Diagram

sequenceDiagram
    participant Main as Main Loop
    participant BlockSync as BlockSync
    participant BittensorClient as BittensorClient
    participant ReconnectState as ReconnectionState

    Main->>BlockSync: Receive BlockSyncEvent
    alt Event received
        Main->>Main: Handle event with BlockEventContext
        Note over Main: Process block/challenge data
    else Error detected
        Main->>ReconnectState: Check should_attempt_reconnect()
        alt Backoff allows retry
            Main->>BlockSync: Graceful shutdown
            Main->>BittensorClient: Create new connection
            Main->>BlockSync: Initialize new BlockSync
            BittensorClient-->>Main: Return ReconnectionResult
            Main->>ReconnectState: mark_success()
            Note over Main: Resume event loop
        else Backoff in progress
            Main->>ReconnectState: mark_failure()
            Note over Main: Wait before next attempt
        end
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

fix: validator uses Subtensor.set_mechanism_weights() for CRv4 #4 — Modifies the validator's block-event and weight-submission wiring at the code level; both PRs refactor handle_block_event parameters and weight-submission call sites with context-based parameter grouping.

Poem

🐰 A hop, a skip, through reconnect's grace,
Exponential backoff keeps pace,
Context bundles, state so clean,
Validator's heart runs lean,
No more crashes—blessed peace!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely identifies the main fix: proper cleanup of old BlockSync during Bittensor reconnection, which is the central purpose of the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/bittensor-reconnection-cleanup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@bins/validator-node/src/main.rs`:
- Around line 98-112: The exponential pow call can overflow when self.failures
>= 64; update both should_attempt and current_backoff_secs to clamp the exponent
before calling pow (e.g. let exp = std::cmp::min(self.failures, 63u32); let
backoff_secs = std::cmp::min(10 * 2u64.pow(exp), 120);) so the pow never
receives >=64 and overflow is prevented; apply the same clamped-exp computation
in both should_attempt and current_backoff_secs (referencing the should_attempt
and current_backoff_secs functions).

coderabbitai · 2026-02-03T15:05:35Z

bins/validator-node/src/main.rs

+    /// Check if reconnection should be attempted based on exponential backoff.
+    /// Backoff formula: min(10 * 2^failures, 120) seconds
+    fn should_attempt(&self) -> bool {
+        if !self.disconnected {
+            return false;
+        }
+        let backoff_secs = std::cmp::min(10 * 2u64.pow(self.failures), 120);
+        self.last_attempt.elapsed() > Duration::from_secs(backoff_secs)
+    }
+
+    /// Get the current backoff interval in seconds
+    fn current_backoff_secs(&self) -> u64 {
+        std::cmp::min(10 * 2u64.pow(self.failures), 120)
+    }
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Rust u64 pow overflow behavior debug release builds panic wrap

💡 Result:

u64::pow(exp) uses normal integer arithmetic, so its overflow behavior follows Rust’s integer-overflow rules:

Debug builds (overflow checks enabled by default): if self.pow(exp) overflows u64, it panics at runtime (same rule as *, +, etc.). The Rust Reference states that when debug assertions are enabled, implementations must insert dynamic checks that panic on overflow. [3]

Release builds (overflow checks disabled by default): on overflow it typically wraps modulo 2⁶⁴ (two’s-complement wrapping). The Rust Reference allows “silently wrapped values” in non-debug builds, but requires that wrapped results be well-defined two’s-complement results. [3]

How to control / make it explicit:

Always wrap: wrapping_pow (mod 2⁶⁴). [1]

Detect overflow: overflowing_pow returns (value, overflowed). [1]

Return None on overflow: checked_pow. [1]

Always panic on overflow (even in release): strict_pow. [1]

Enable panics-on-overflow in release: set Cargo profile overflow-checks = true (controls -C overflow-checks). [2]

Sources:
[1] Rust std docs for u64 (pow, checked_pow, overflowing_pow, wrapping_pow, strict_pow) https://doc.rust-lang.org/std/primitive.u64.html
[2] Cargo profiles: overflow-checks https://doc.rust-lang.org/cargo/reference/profiles.html
[3] Rust Reference: integer overflow behavior (panic with debug assertions; otherwise may wrap, but must be two’s-complement) https://doc.rust-lang.org/stable/reference/behavior-not-considered-unsafe.html

🏁 Script executed:

cat -n bins/validator-node/src/main.rs | sed -n '90,130p'

Repository: PlatformNetwork/platform

Length of output: 1546

Clamp exponential backoff exponent to prevent overflow.

Line 104 and 110 use 2u64.pow(self.failures), which overflows when self.failures >= 64. Since self.failures grows via saturating_add, this is reachable after 64 reconnection attempts. In debug builds, overflow panics; in release builds, the value wraps unpredictably, causing backoff to become unreliable.

Since the backoff is capped at 120s, clamp the exponent before pow to avoid overflow:

🛠️ Proposed fix

- let backoff_secs = std::cmp::min(10 * 2u64.pow(self.failures), 120); + let exp = self.failures.min(4); // 10 * 2^4 = 160 > 120 cap + let backoff_secs = std::cmp::min(10 * 2u64.pow(exp), 120); @@ - std::cmp::min(10 * 2u64.pow(self.failures), 120) + let exp = self.failures.min(4); + std::cmp::min(10 * 2u64.pow(exp), 120)

🤖 Prompt for AI Agents

In `@bins/validator-node/src/main.rs` around lines 98 - 112, The exponential pow call can overflow when self.failures >= 64; update both should_attempt and current_backoff_secs to clamp the exponent before calling pow (e.g. let exp = std::cmp::min(self.failures, 63u32); let backoff_secs = std::cmp::min(10 * 2u64.pow(exp), 120);) so the pow never receives >=64 and overflow is prevented; apply the same clamped-exp computation in both should_attempt and current_backoff_secs (referencing the should_attempt and current_backoff_secs functions).

echobt merged commit 15077d5 into main Feb 3, 2026
5 of 6 checks passed

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bittensor): improve reconnection by properly stopping old BlockSync#33

fix(bittensor): improve reconnection by properly stopping old BlockSync#33
echobt merged 1 commit intomainfrom
fix/bittensor-reconnection-cleanup

echobt commented Feb 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 3, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

echobt commented Feb 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Testing

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

echobt commented Feb 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 3, 2026 •

edited

Loading