Skip to content

Controller's cluster_nodes push silently drops when a data node is unreachable, leaving cluster meta stale forever #395

@thinker0

Description

@thinker0

Summary

kvctl-server propagates cluster topology via a push to each data node, but
if the push fails (network blip, target restarting, etc.) it is not retried.
After a controller restart or any partial failure the data nodes end up with
different views of the cluster, and the cluster does not self-heal.

Observed

In an 18-shard cluster (replicas=3, 54 nodes) we ran CLUSTER NODES on
four representative masters at the same time:

Master Visible nodes (master + slave)
shard 0 3 (a residual view from a 3-shard cluster created earlier)
shard 1 17
shard 9 33 (a residual view from a 27-shard cluster created in between)
shard 17 8
Expected 54

cluster_state:ok and cluster_slots_ok:16384 on every node, so routing
formally works, but every node holds a different cluster_nodes snapshot
and clients that bootstrap from a "small" master keep redirecting traffic
to a few hot masters.

The state never converges back. Manager restart (rolling, all 3 instances)
only partially refreshed the views. There is no operator-facing endpoint
(/sync, /refresh, /reload, …) to force a re-push — all returned 404.

Expected

  • Controller retries the per-node push with backoff until each node
    acknowledges the new topology version, OR
  • Exposes a POST /clusters/{name}/sync (or similar) endpoint that
    re-pushes the current cluster_nodes to every member.

Repro (vanilla setup)

  1. Start kvctl-server v1.3.0 and 6 Apache Kvrocks 2.15.0 data nodes.
  2. Create a cluster with replicas=2 (2 shards × 3 nodes).
  3. iptables -A INPUT -p tcp --dport <node-3-port> -j DROP for ~10 s.
  4. Remove the rule, wait, then run CLUSTER NODES on every node.

Node 3 has a stale cluster_nodes view; nothing converges back even
after several minutes.

Versions

  • kvctl-server v1.3.0
  • Apache Kvrocks 2.15.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions