fix: keep chash ring stable during health changes#13532
Open
nic-6443 wants to merge 4 commits into
Open
Conversation
There was a problem hiding this comment.
Pull request overview
This PR changes chash balancing to keep the hash ring topology stable across upstream/instance health transitions by building the ring from the configured nodes/instances and applying health filtering at pick time (falling through the ring via the existing “tried” mechanism when an unhealthy target is selected). This reduces unnecessary key movement for keys that were already mapped to healthy backends, especially with uneven weights.
Changes:
- Upstream chash: build the picker from all configured nodes and skip unhealthy nodes during selection instead of rebuilding the ring from only healthy nodes.
- ai-proxy-multi chash: build the picker from all configured instances and skip unhealthy instances during selection; non-chash algorithms keep using healthy-only pickers.
- Add regression tests covering stable-ring behavior for both upstream chash and ai-proxy-multi chash.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
apisix/balancer.lua |
Keeps chash ring stable by building pickers from all nodes and filtering health during pick. |
apisix/plugins/ai-proxy-multi.lua |
Keeps ai-proxy-multi chash ring stable by building pickers from all instances and filtering health during pick. |
t/node/chash-healthcheck-stable-ring.t |
Regression test ensuring healthy-key mappings don’t move when health changes for upstream chash. |
t/plugin/ai-proxy-multi-chash-healthcheck-stable-ring.t |
Regression test ensuring healthy-key mappings don’t move when health changes for ai-proxy-multi chash. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Chash previously rebuilt the ring from only healthy nodes whenever health status changed. With uneven weights, that changes the ring topology and can move keys that were already mapped to healthy nodes.
This keeps chash rings tied to config and weight changes, then checks health at pick time. The health snapshot is cached by checker status version, so it only recomputes when health state changes. If the selected node or AI instance is unhealthy, it is marked as tried and the existing chash next path selects another physical target. Non-chash algorithms still build pickers from the healthy subset. If healthcheck targets are not registered yet after DNS or checker changes, their status is treated as unknown instead of unhealthy. ai-proxy-multi keeps its existing exhaustion error string for compatibility.
The same behavior is applied to upstream chash balancing and ai-proxy-multi chash balancing, with regression coverage for both paths.