Skip to content

Improve TopologyService and HeartbeatService scalability for large clusters#17595

Draft
CRZbulabula wants to merge 3 commits intomasterfrom
upgrade-heartbeat-service
Draft

Improve TopologyService and HeartbeatService scalability for large clusters#17595
CRZbulabula wants to merge 3 commits intomasterfrom
upgrade-heartbeat-service

Conversation

@CRZbulabula
Copy link
Copy Markdown
Contributor

Summary

  • Adaptive topology probing interval: Probing interval now scales with DataNode count (max(base, base × N / refN)), preventing fixed-interval overhead from dominating large clusters. Configurable via topology_probing_base_interval_in_ms, topology_probing_reference_node_count, and topology_probing_timeout_ratio.
  • √N sampling with batch rotation: Each probing cycle selects only ceil(√N) DataNodes as probers (instead of all N), rotating across cycles for full coverage. Reduces per-cycle RPC fan-out from O(N) to O(√N) and total connection tests from O(N²) to O(N√N).
  • Independent topology push channel: Topology is no longer piggybacked on the 1-second heartbeat. TopologyService pushes updates directly to DataNodes via dedicated heartbeat RPC, only when a DataNode's reachable set changes. Eliminates O(N²) per-heartbeat payload and the compareAndSet-based update-loss issue.
  • Timeout protection for test connection RPCs: submitInternalTestConnectionTask and submitTestConnectionTask on DataNode now use bounded timeouts (dnConnectionTimeoutInMS) instead of unbounded CountDownLatch.await(). Prevents internal RPC threads from blocking indefinitely when target nodes are unreachable.
  • Isolated topology probing thread pool on DataNode: submitInternalTestConnectionTask handler offloads blocking work to a dedicated 2-thread TOPOLOGY_PROBING_EXECUTOR with Future.get(timeout), so the DataNodeInternalRPCService thread pool (shared with sendFragmentInstance, consensus, etc.) is no longer held hostage by slow probes.
  • Increased heartbeat selector threads: AsyncDataNodeHeartbeatServiceClientPoolFactory and AsyncConfigNodeHeartbeatServiceClientPoolFactory now use a dedicated heartbeat_selector_num_of_client_manager config (default 4, up from 1), preventing the single NIO selector thread from becoming a serial bottleneck when processing N heartbeat responses.
  • Fixed collectPipeMetaList lock timeout: Changed from dnConnectionTimeoutInMS * 2/3 (≈ 40000 seconds due to unit mismatch — effectively 11 hours) to 2 seconds. If the pipe meta lock cannot be acquired within 2s, the collection is skipped for this heartbeat cycle and retried in the next (every ~100 heartbeats).
  • Fixed topology diff logging bugs: LoadCache.updateTopology() and ClusterTopology.updateTopology() both had copy-paste bugs where originReachable read from latestTopology instead of the old topology, making the diff comparison always equal and the change-log dead code.

Test plan

  • Verify TopologyService adaptive interval: deploy a 3-node cluster, confirm probing interval equals base_interval; add nodes beyond reference_node_count, confirm interval scales proportionally
  • Verify √N sampling: with 9+ DataNodes, confirm only ceil(√N) probers are selected per cycle via [Topology] log lines; confirm all nodes rotate through as probers over multiple cycles
  • Verify topology push: confirm [Topology] latest view from config-node appears on DataNode logs only when reachable set changes, not every heartbeat
  • Verify timeout protection: stop one DataNode, confirm other DataNodes' internal RPC threads are not blocked beyond dnConnectionTimeoutInMS
  • Verify collectPipeMetaList timeout: create a pipe, confirm heartbeat handler does not block for more than 2 seconds even under pipe lock contention
  • Verify topology diff logging: trigger a network partition, confirm [Topology] Topology of DataNode X is now unreachable to DataNode Y log entries appear correctly
  • Regression: run existing TopologyService and HeartbeatService integration tests

🤖 Generated with Claude Code

@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

❌ Patch coverage is 17.73050% with 116 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.05%. Comparing base (f72b3ee) to head (b0e2df7).

Files with missing lines Patch % Lines
...nfignode/manager/load/service/TopologyService.java 7.27% 51 Missing ⚠️
...ol/thrift/impl/DataNodeInternalRPCServiceImpl.java 17.24% 24 Missing ⚠️
...che/iotdb/db/queryengine/plan/ClusterTopology.java 4.16% 23 Missing ⚠️
...he/iotdb/confignode/conf/ConfigNodeDescriptor.java 0.00% 8 Missing ⚠️
...apache/iotdb/confignode/conf/ConfigNodeConfig.java 25.00% 6 Missing ⚠️
...apache/iotdb/commons/client/ClientPoolFactory.java 50.00% 2 Missing ⚠️
...iotdb/confignode/manager/load/cache/LoadCache.java 0.00% 1 Missing ⚠️
...otdb/db/pipe/agent/task/PipeDataNodeTaskAgent.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master   #17595   +/-   ##
=========================================
  Coverage     40.05%   40.05%           
  Complexity     2554     2554           
=========================================
  Files          5176     5176           
  Lines        348528   348594   +66     
  Branches      44558    44557    -1     
=========================================
+ Hits         139595   139626   +31     
- Misses       208933   208968   +35     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 5, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant