Skip to content

fix: make cluster produce cluster_state:ok on bootstrap#339

Merged
kacy merged 2 commits intomainfrom
fix/cluster-bootstrap-state
Feb 27, 2026
Merged

fix: make cluster produce cluster_state:ok on bootstrap#339
kacy merged 2 commits intomainfrom
fix/cluster-bootstrap-state

Conversation

@kacy
Copy link
Owner

@kacy kacy commented Feb 27, 2026

summary

make cluster was showing cluster_state:fail, cluster_slots_assigned:5461, and cluster_known_nodes:5 instead of the expected healthy state. three bugs were responsible.

bug a — bootstrap node never populates gossip.local_slots

GossipEngine initializes local_slots = []. When the bootstrap flag is set, ClusterState::single_node() assigns all 16384 slots into state.slot_map, but this was never reflected in gossip.local_slots. Every Welcome reply sent slots: [] to joining nodes, which then queued a SlotsChanged(node1, []) update. When that looped back to node 1, the handler cleared all 16384 slots from slot_map. The subsequent addslotsrange 0 5460 in the script then "succeeded" (slots were free) and left only 5461 assigned.

fix: populate gossip.local_slots from the bootstrap ClusterState immediately after construction, before the engine is placed behind a Mutex.

bug b — race condition in cluster_meet leaves stale placeholders

cluster_meet was: (1) build gossip message, (2) send UDP, (3) insert placeholder. The gossip receive task can process the Welcome reply between steps 2 and 3, inserting the real node. Then step 3 adds the placeholder under a fake ID — leaving both in state.nodes. Two CLUSTER MEET calls → 5 entries → cluster_known_nodes:5.

fix: insert the placeholder before sending the UDP packet. The gossip lock is already released at that point so there is no deadlock.

bug c — bootstrap node's config_epoch mismatch

ClusterState::single_node() set config_epoch: 1 on the state but the ClusterNode itself defaulted to config_epoch: 0, causing cluster_my_epoch:0 vs cluster_current_epoch:1.

fix: set local_node.config_epoch = 1 inside single_node().

defensive guard: skip any external SlotsChanged event that refers to the local node — the node is authoritative for its own slot ownership.

script: removed the redundant addslotsrange 0 5460 on node-1 (bootstrap already owns those slots) and bumped the convergence sleep from 0.5s to 1.5s.

what was tested

  • cargo check -p ember-server -p ember-cluster passes cleanly
  • the three bug scenarios were traced through the code paths with the fixes applied
  • make cluster should now show cluster_state:ok, cluster_slots_assigned:16384, cluster_known_nodes:3

design considerations

fixes a and b are the load-bearing changes. fix a eliminates the gossip feedback loop that was destroying the canonical slot assignment. fix b closes a genuine TOCTOU race that's hard to reproduce deterministically but occurs consistently under normal async scheduling. the defensive guard is belt-and-suspenders but has no cost on the non-bootstrap path.

kacy added 2 commits February 26, 2026 22:33
three bugs caused `make cluster` to show `cluster_state:fail`,
`cluster_slots_assigned:5461`, and `cluster_known_nodes:5`.

fix a: after creating the bootstrap ClusterState, populate
gossip.local_slots with all 16384 owned slots. previously it was
left empty, so every Welcome reply advertised zero slots to joining
nodes, which then gossipped back a SlotsChanged(node1, []) event
that wiped the canonical slot assignment.

fix b: insert the placeholder node into state.nodes *before* sending
the UDP join packet. the gossip receive task can process the Welcome
reply before cluster_meet resumes, leaving both a real entry and a
stale placeholder in state.nodes (and therefore cluster_known_nodes:5
instead of 3).

fix c: set local_node.config_epoch = 1 inside single_node() so that
cluster_my_epoch matches cluster_current_epoch on the bootstrap node.

defensive: skip external SlotsChanged events for the local node — the
node is authoritative for its own slots and should never accept gossip
overrides of its own ownership.

script: remove the redundant addslotsrange 0 5460 on node-1 (bootstrap
already owns those slots), and bump the convergence sleep from 0.5s to
1.5s so gossip has time to propagate before cluster info is printed.
@kacy kacy merged commit 2dd9e65 into main Feb 27, 2026
7 of 8 checks passed
@kacy kacy deleted the fix/cluster-bootstrap-state branch February 27, 2026 04:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant