fix(bittensor): improve reconnection by properly stopping old BlockSync#33
fix(bittensor): improve reconnection by properly stopping old BlockSync#33
Conversation
When Bittensor RPC connection errors occur (e.g., 'restart required'), the reconnection logic now properly stops the old BlockSync before creating a new one. This prevents the old internal task from continuing to emit errors. Key changes: - Track BlockSync instance instead of just the spawned task handle - Call sync.stop() before creating new connection to clean up internal tasks - Remove unnecessary wrapper task around sync.start() - Maintain exponential backoff for reconnection attempts This resolves the issue where connection errors would spam logs every 5 seconds due to the old BlockSync task not being properly terminated.
📝 WalkthroughWalkthroughThe PR refactors the validator-node's core initialization and event-handling logic by introducing context structs to bundle related parameters, adding exponential backoff-based reconnection state management, and restructuring the main loop to gracefully handle Bittensor disconnections and block events through formalized context objects and state transitions. Changes
Sequence DiagramsequenceDiagram
participant Main as Main Loop
participant BlockSync as BlockSync
participant BittensorClient as BittensorClient
participant ReconnectState as ReconnectionState
Main->>BlockSync: Receive BlockSyncEvent
alt Event received
Main->>Main: Handle event with BlockEventContext
Note over Main: Process block/challenge data
else Error detected
Main->>ReconnectState: Check should_attempt_reconnect()
alt Backoff allows retry
Main->>BlockSync: Graceful shutdown
Main->>BittensorClient: Create new connection
Main->>BlockSync: Initialize new BlockSync
BittensorClient-->>Main: Return ReconnectionResult
Main->>ReconnectState: mark_success()
Note over Main: Resume event loop
else Backoff in progress
Main->>ReconnectState: mark_failure()
Note over Main: Wait before next attempt
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@bins/validator-node/src/main.rs`:
- Around line 98-112: The exponential pow call can overflow when self.failures
>= 64; update both should_attempt and current_backoff_secs to clamp the exponent
before calling pow (e.g. let exp = std::cmp::min(self.failures, 63u32); let
backoff_secs = std::cmp::min(10 * 2u64.pow(exp), 120);) so the pow never
receives >=64 and overflow is prevented; apply the same clamped-exp computation
in both should_attempt and current_backoff_secs (referencing the should_attempt
and current_backoff_secs functions).
| /// Check if reconnection should be attempted based on exponential backoff. | ||
| /// Backoff formula: min(10 * 2^failures, 120) seconds | ||
| fn should_attempt(&self) -> bool { | ||
| if !self.disconnected { | ||
| return false; | ||
| } | ||
| let backoff_secs = std::cmp::min(10 * 2u64.pow(self.failures), 120); | ||
| self.last_attempt.elapsed() > Duration::from_secs(backoff_secs) | ||
| } | ||
|
|
||
| /// Get the current backoff interval in seconds | ||
| fn current_backoff_secs(&self) -> u64 { | ||
| std::cmp::min(10 * 2u64.pow(self.failures), 120) | ||
| } | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Rust u64 pow overflow behavior debug release builds panic wrap
💡 Result:
u64::pow(exp) uses normal integer arithmetic, so its overflow behavior follows Rust’s integer-overflow rules:
- Debug builds (overflow checks enabled by default): if
self.pow(exp)overflowsu64, it panics at runtime (same rule as*,+, etc.). The Rust Reference states that when debug assertions are enabled, implementations must insert dynamic checks that panic on overflow. [3] - Release builds (overflow checks disabled by default): on overflow it typically wraps modulo 2⁶⁴ (two’s-complement wrapping). The Rust Reference allows “silently wrapped values” in non-debug builds, but requires that wrapped results be well-defined two’s-complement results. [3]
How to control / make it explicit:
- Always wrap:
wrapping_pow(mod 2⁶⁴). [1] - Detect overflow:
overflowing_powreturns(value, overflowed). [1] - Return
Noneon overflow:checked_pow. [1] - Always panic on overflow (even in release):
strict_pow. [1] - Enable panics-on-overflow in release: set Cargo profile
overflow-checks = true(controls-C overflow-checks). [2]
Sources:
[1] Rust std docs for u64 (pow, checked_pow, overflowing_pow, wrapping_pow, strict_pow) https://doc.rust-lang.org/std/primitive.u64.html
[2] Cargo profiles: overflow-checks https://doc.rust-lang.org/cargo/reference/profiles.html
[3] Rust Reference: integer overflow behavior (panic with debug assertions; otherwise may wrap, but must be two’s-complement) https://doc.rust-lang.org/stable/reference/behavior-not-considered-unsafe.html
🏁 Script executed:
cat -n bins/validator-node/src/main.rs | sed -n '90,130p'Repository: PlatformNetwork/platform
Length of output: 1546
Clamp exponential backoff exponent to prevent overflow.
Line 104 and 110 use 2u64.pow(self.failures), which overflows when self.failures >= 64. Since self.failures grows via saturating_add, this is reachable after 64 reconnection attempts. In debug builds, overflow panics; in release builds, the value wraps unpredictably, causing backoff to become unreliable.
Since the backoff is capped at 120s, clamp the exponent before pow to avoid overflow:
🛠️ Proposed fix
- let backoff_secs = std::cmp::min(10 * 2u64.pow(self.failures), 120);
+ let exp = self.failures.min(4); // 10 * 2^4 = 160 > 120 cap
+ let backoff_secs = std::cmp::min(10 * 2u64.pow(exp), 120);
@@
- std::cmp::min(10 * 2u64.pow(self.failures), 120)
+ let exp = self.failures.min(4);
+ std::cmp::min(10 * 2u64.pow(exp), 120)🤖 Prompt for AI Agents
In `@bins/validator-node/src/main.rs` around lines 98 - 112, The exponential pow
call can overflow when self.failures >= 64; update both should_attempt and
current_backoff_secs to clamp the exponent before calling pow (e.g. let exp =
std::cmp::min(self.failures, 63u32); let backoff_secs = std::cmp::min(10 *
2u64.pow(exp), 120);) so the pow never receives >=64 and overflow is prevented;
apply the same clamped-exp computation in both should_attempt and
current_backoff_secs (referencing the should_attempt and current_backoff_secs
functions).
Summary
When Bittensor RPC connection errors occur (e.g., 'restart required'), the reconnection logic now properly stops the old BlockSync before creating a new one. This prevents the old internal task from continuing to emit errors.
Problem
When the error
Subxt error: RPC error: ... connection closed; restart requiredoccurred, it would spam the logs every 5 seconds because:Solution
sync.stop()before creating a new connection to properly clean up internal taskssync.start()Testing
cargo check)cargo clippy)cargo fmt)Summary by CodeRabbit
Release Notes
Bug Fixes
Refactor