Improve TopologyService and HeartbeatService scalability for large clusters by CRZbulabula · Pull Request #17595 · apache/iotdb

CRZbulabula · 2026-05-05T06:04:56Z

Summary

Adaptive topology probing interval: Probing interval now scales with DataNode count (max(base, base × N / refN)), preventing fixed-interval overhead from dominating large clusters. Configurable via topology_probing_base_interval_in_ms, topology_probing_reference_node_count, and topology_probing_timeout_ratio.
√N sampling with batch rotation: Each probing cycle selects only ceil(√N) DataNodes as probers (instead of all N), rotating across cycles for full coverage. Reduces per-cycle RPC fan-out from O(N) to O(√N) and total connection tests from O(N²) to O(N√N).
Independent topology push channel: Topology is no longer piggybacked on the 1-second heartbeat. TopologyService pushes updates directly to DataNodes via dedicated heartbeat RPC, only when a DataNode's reachable set changes. Eliminates O(N²) per-heartbeat payload and the compareAndSet-based update-loss issue.
Timeout protection for test connection RPCs: submitInternalTestConnectionTask and submitTestConnectionTask on DataNode now use bounded timeouts (dnConnectionTimeoutInMS) instead of unbounded CountDownLatch.await(). Prevents internal RPC threads from blocking indefinitely when target nodes are unreachable.
Isolated topology probing thread pool on DataNode: submitInternalTestConnectionTask handler offloads blocking work to a dedicated 2-thread TOPOLOGY_PROBING_EXECUTOR with Future.get(timeout), so the DataNodeInternalRPCService thread pool (shared with sendFragmentInstance, consensus, etc.) is no longer held hostage by slow probes.
Increased heartbeat selector threads: AsyncDataNodeHeartbeatServiceClientPoolFactory and AsyncConfigNodeHeartbeatServiceClientPoolFactory now use a dedicated heartbeat_selector_num_of_client_manager config (default 4, up from 1), preventing the single NIO selector thread from becoming a serial bottleneck when processing N heartbeat responses.
Fixed collectPipeMetaList lock timeout: Changed from dnConnectionTimeoutInMS * 2/3 (≈ 40000 seconds due to unit mismatch — effectively 11 hours) to 2 seconds. If the pipe meta lock cannot be acquired within 2s, the collection is skipped for this heartbeat cycle and retried in the next (every ~100 heartbeats).
Fixed topology diff logging bugs: LoadCache.updateTopology() and ClusterTopology.updateTopology() both had copy-paste bugs where originReachable read from latestTopology instead of the old topology, making the diff comparison always equal and the change-log dead code.

Test plan

Verify TopologyService adaptive interval: deploy a 3-node cluster, confirm probing interval equals base_interval; add nodes beyond reference_node_count, confirm interval scales proportionally
Verify √N sampling: with 9+ DataNodes, confirm only ceil(√N) probers are selected per cycle via [Topology] log lines; confirm all nodes rotate through as probers over multiple cycles
Verify topology push: confirm [Topology] latest view from config-node appears on DataNode logs only when reachable set changes, not every heartbeat
Verify timeout protection: stop one DataNode, confirm other DataNodes' internal RPC threads are not blocked beyond dnConnectionTimeoutInMS
Verify collectPipeMetaList timeout: create a pipe, confirm heartbeat handler does not block for more than 2 seconds even under pipe lock contention
Verify topology diff logging: trigger a network partition, confirm [Topology] Topology of DataNode X is now unreachable to DataNode Y log entries appear correctly
Regression: run existing TopologyService and HeartbeatService integration tests

🤖 Generated with Claude Code

codecov · 2026-05-05T08:28:32Z

Codecov Report

❌ Patch coverage is 17.73050% with 116 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.05%. Comparing base (f72b3ee) to head (b0e2df7).

Files with missing lines	Patch %	Lines
...nfignode/manager/load/service/TopologyService.java	7.27%	51 Missing ⚠️
...ol/thrift/impl/DataNodeInternalRPCServiceImpl.java	17.24%	24 Missing ⚠️
...che/iotdb/db/queryengine/plan/ClusterTopology.java	4.16%	23 Missing ⚠️
...he/iotdb/confignode/conf/ConfigNodeDescriptor.java	0.00%	8 Missing ⚠️
...apache/iotdb/confignode/conf/ConfigNodeConfig.java	25.00%	6 Missing ⚠️
...apache/iotdb/commons/client/ClientPoolFactory.java	50.00%	2 Missing ⚠️
...iotdb/confignode/manager/load/cache/LoadCache.java	0.00%	1 Missing ⚠️
...otdb/db/pipe/agent/task/PipeDataNodeTaskAgent.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #17595   +/-   ##
=========================================
  Coverage     40.05%   40.05%           
  Complexity     2554     2554           
=========================================
  Files          5176     5176           
  Lines        348528   348594   +66     
  Branches      44558    44557    -1     
=========================================
+ Hits         139595   139626   +31     
- Misses       208933   208968   +35

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sonarqubecloud · 2026-05-05T08:36:42Z

Quality Gate failed

Failed conditions
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

CRZbulabula added 2 commits May 5, 2026 13:58

ready 4 review

2ddad6d

update

3e5f0d7

better impl 4 topology service

b0e2df7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve TopologyService and HeartbeatService scalability for large clusters#17595

Improve TopologyService and HeartbeatService scalability for large clusters#17595
CRZbulabula wants to merge 3 commits intomasterfrom
upgrade-heartbeat-service

CRZbulabula commented May 5, 2026

Uh oh!

codecov Bot commented May 5, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CRZbulabula commented May 5, 2026

Summary

Test plan

Uh oh!

codecov Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sonarqubecloud Bot commented May 5, 2026

Quality Gate failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented May 5, 2026 •

edited

Loading