Skip to content

fix(bittensor): improve reconnection by properly stopping old BlockSync#33

Merged
echobt merged 1 commit intomainfrom
fix/bittensor-reconnection-cleanup
Feb 3, 2026
Merged

fix(bittensor): improve reconnection by properly stopping old BlockSync#33
echobt merged 1 commit intomainfrom
fix/bittensor-reconnection-cleanup

Conversation

@echobt
Copy link
Contributor

@echobt echobt commented Feb 3, 2026

Summary

When Bittensor RPC connection errors occur (e.g., 'restart required'), the reconnection logic now properly stops the old BlockSync before creating a new one. This prevents the old internal task from continuing to emit errors.

Problem

When the error Subxt error: RPC error: ... connection closed; restart required occurred, it would spam the logs every 5 seconds because:

  1. The old BlockSync's internal task wasn't being properly stopped
  2. We were only aborting the wrapper JoinHandle, not the internal event processing task
  3. The old listener continued running and emitting ConnectionError events

Solution

  • Track the BlockSync instance directly instead of just a spawned task handle
  • Call sync.stop() before creating a new connection to properly clean up internal tasks
  • Remove the unnecessary wrapper task around sync.start()
  • Maintain exponential backoff for reconnection attempts

Testing

  • Code compiles successfully (cargo check)
  • No clippy warnings (cargo clippy)
  • Code formatted (cargo fmt)

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved Bittensor reconnection handling with exponential backoff to prevent rapid reconnection attempts.
    • Enhanced critical error detection for connection failures.
    • More stable block event processing and weight submission handling.
  • Refactor

    • Streamlined internal state management and function signatures for improved code maintainability.

When Bittensor RPC connection errors occur (e.g., 'restart required'),
the reconnection logic now properly stops the old BlockSync before
creating a new one. This prevents the old internal task from
continuing to emit errors.

Key changes:
- Track BlockSync instance instead of just the spawned task handle
- Call sync.stop() before creating new connection to clean up internal tasks
- Remove unnecessary wrapper task around sync.start()
- Maintain exponential backoff for reconnection attempts

This resolves the issue where connection errors would spam logs every
5 seconds due to the old BlockSync task not being properly terminated.
@coderabbitai
Copy link

coderabbitai bot commented Feb 3, 2026

📝 Walkthrough

Walkthrough

The PR refactors the validator-node's core initialization and event-handling logic by introducing context structs to bundle related parameters, adding exponential backoff-based reconnection state management, and restructuring the main loop to gracefully handle Bittensor disconnections and block events through formalized context objects and state transitions.

Changes

Cohort / File(s) Summary
New Context and State Types
bins/validator-node/src/main.rs
Introduces WeightSubmissionContext and BlockEventContext structs to group related parameters; adds ReconnectionState struct to manage exponential backoff with methods for state transitions; adds ReconnectionResult to encapsulate post-reconnection artifacts (client, BlockSync, event receiver).
Reconnection and Error Handling
bins/validator-node/src/main.rs
Adds is_critical_bittensor_error function to detect fatal connection errors; introduces attempt_bittensor_reconnect async function to gracefully shut down old BlockSync, create new client/BlockSync, and return reconnection result.
Function Refactoring
bins/validator-node/src/main.rs
Refactors submit_weights_for_epoch to accept WeightSubmissionContext instead of multiple separate parameters; refactors handle_block_event to accept BlockEventContext instead of raw parameter list.
Main Loop Restructuring
bins/validator-node/src/main.rs
Refactors main initialization and event-loop flow to incorporate block_sync, block_rx, and reconnect_state with exponential backoff logic; replaces ad-hoc disconnection handling with structured retry logic and immediate reconnection attempts for critical errors when permitted by backoff state.

Sequence Diagram

sequenceDiagram
    participant Main as Main Loop
    participant BlockSync as BlockSync
    participant BittensorClient as BittensorClient
    participant ReconnectState as ReconnectionState

    Main->>BlockSync: Receive BlockSyncEvent
    alt Event received
        Main->>Main: Handle event with BlockEventContext
        Note over Main: Process block/challenge data
    else Error detected
        Main->>ReconnectState: Check should_attempt_reconnect()
        alt Backoff allows retry
            Main->>BlockSync: Graceful shutdown
            Main->>BittensorClient: Create new connection
            Main->>BlockSync: Initialize new BlockSync
            BittensorClient-->>Main: Return ReconnectionResult
            Main->>ReconnectState: mark_success()
            Note over Main: Resume event loop
        else Backoff in progress
            Main->>ReconnectState: mark_failure()
            Note over Main: Wait before next attempt
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 A hop, a skip, through reconnect's grace,
Exponential backoff keeps pace,
Context bundles, state so clean,
Validator's heart runs lean,
No more crashes—blessed peace!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely identifies the main fix: proper cleanup of old BlockSync during Bittensor reconnection, which is the central purpose of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/bittensor-reconnection-cleanup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@echobt echobt merged commit 15077d5 into main Feb 3, 2026
5 of 6 checks passed
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@bins/validator-node/src/main.rs`:
- Around line 98-112: The exponential pow call can overflow when self.failures
>= 64; update both should_attempt and current_backoff_secs to clamp the exponent
before calling pow (e.g. let exp = std::cmp::min(self.failures, 63u32); let
backoff_secs = std::cmp::min(10 * 2u64.pow(exp), 120);) so the pow never
receives >=64 and overflow is prevented; apply the same clamped-exp computation
in both should_attempt and current_backoff_secs (referencing the should_attempt
and current_backoff_secs functions).

Comment on lines +98 to +112
/// Check if reconnection should be attempted based on exponential backoff.
/// Backoff formula: min(10 * 2^failures, 120) seconds
fn should_attempt(&self) -> bool {
if !self.disconnected {
return false;
}
let backoff_secs = std::cmp::min(10 * 2u64.pow(self.failures), 120);
self.last_attempt.elapsed() > Duration::from_secs(backoff_secs)
}

/// Get the current backoff interval in seconds
fn current_backoff_secs(&self) -> u64 {
std::cmp::min(10 * 2u64.pow(self.failures), 120)
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Rust u64 pow overflow behavior debug release builds panic wrap

💡 Result:

u64::pow(exp) uses normal integer arithmetic, so its overflow behavior follows Rust’s integer-overflow rules:

  • Debug builds (overflow checks enabled by default): if self.pow(exp) overflows u64, it panics at runtime (same rule as *, +, etc.). The Rust Reference states that when debug assertions are enabled, implementations must insert dynamic checks that panic on overflow. [3]
  • Release builds (overflow checks disabled by default): on overflow it typically wraps modulo 2⁶⁴ (two’s-complement wrapping). The Rust Reference allows “silently wrapped values” in non-debug builds, but requires that wrapped results be well-defined two’s-complement results. [3]

How to control / make it explicit:

  • Always wrap: wrapping_pow (mod 2⁶⁴). [1]
  • Detect overflow: overflowing_pow returns (value, overflowed). [1]
  • Return None on overflow: checked_pow. [1]
  • Always panic on overflow (even in release): strict_pow. [1]
  • Enable panics-on-overflow in release: set Cargo profile overflow-checks = true (controls -C overflow-checks). [2]

Sources:
[1] Rust std docs for u64 (pow, checked_pow, overflowing_pow, wrapping_pow, strict_pow) https://doc.rust-lang.org/std/primitive.u64.html
[2] Cargo profiles: overflow-checks https://doc.rust-lang.org/cargo/reference/profiles.html
[3] Rust Reference: integer overflow behavior (panic with debug assertions; otherwise may wrap, but must be two’s-complement) https://doc.rust-lang.org/stable/reference/behavior-not-considered-unsafe.html


🏁 Script executed:

cat -n bins/validator-node/src/main.rs | sed -n '90,130p'

Repository: PlatformNetwork/platform

Length of output: 1546


Clamp exponential backoff exponent to prevent overflow.

Line 104 and 110 use 2u64.pow(self.failures), which overflows when self.failures >= 64. Since self.failures grows via saturating_add, this is reachable after 64 reconnection attempts. In debug builds, overflow panics; in release builds, the value wraps unpredictably, causing backoff to become unreliable.

Since the backoff is capped at 120s, clamp the exponent before pow to avoid overflow:

🛠️ Proposed fix
-        let backoff_secs = std::cmp::min(10 * 2u64.pow(self.failures), 120);
+        let exp = self.failures.min(4); // 10 * 2^4 = 160 > 120 cap
+        let backoff_secs = std::cmp::min(10 * 2u64.pow(exp), 120);
@@
-        std::cmp::min(10 * 2u64.pow(self.failures), 120)
+        let exp = self.failures.min(4);
+        std::cmp::min(10 * 2u64.pow(exp), 120)
🤖 Prompt for AI Agents
In `@bins/validator-node/src/main.rs` around lines 98 - 112, The exponential pow
call can overflow when self.failures >= 64; update both should_attempt and
current_backoff_secs to clamp the exponent before calling pow (e.g. let exp =
std::cmp::min(self.failures, 63u32); let backoff_secs = std::cmp::min(10 *
2u64.pow(exp), 120);) so the pow never receives >=64 and overflow is prevented;
apply the same clamped-exp computation in both should_attempt and
current_backoff_secs (referencing the should_attempt and current_backoff_secs
functions).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant