-
Notifications
You must be signed in to change notification settings - Fork 206
refactor: [Scale from Zero] Introduce PodLocator #1950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: [Scale from Zero] Introduce PodLocator #1950
Conversation
This defines the contract for resolving candidate pods based on request metadata, decoupling the resolution logic from the storage layer.
Introduces DatastorePodLocator and a caching decorator. This reduces contention on the Datastore RWMutex during high-throughput dispatch cycles by caching subset resolution results for a short TTL.
Refactors the Director to use the injected PodLocator interface instead of the private getCandidatePodsForScheduling method. This prepares the Director for lazy resolution without changing current behavior.
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Hi @LukeAVanDrie. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am defining this interface within the Flow Control contracts package as it will serve as the primary contract for lazy candidate resolution during the dispatch cycle.
Crucially, this abstraction decouples the core request lifecycle from specific upstream filtering logic. While the current implementation handles Envoy subsetting, isolating this behavior behind an interface paves the way for promoting it to an EPP Extension Point. This would allow adopters to inject environment-specific or vendor-customized discovery mechanisms in the future without polluting the core directory.
|
I have one more PR stacked on top of this that moves the candidate resolution line to after the admission control check. This involves injecting the PodLocator into the FlowControl layer code and performing lazy candidate resolution before each call to the saturation detector. Splitting that out as a separate PR to keep this one as a behavioral no-op refactor. I will likely need to update some integration tests on the subsequent PR as well. |
|
/ok-to-test |
What changed that made us concerned about RLock contention? is that expected as a result of the follow up PR? is this why we are introducing this cashing layer? |
Yes, this is specifically to support the scale-from-zero logic in the follow-up PR (#1952). In the current legacy path, we resolve pods once per request. In the Flow Control path, the
Yes. I opted to encapsulate this inside the Perhaps this |
ahg-g
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the most expensive part is that we are creating a full snapshot of the pod metrics per request, we should find a way to take a snapshot and share it among requests with a "view" on that cache based on the latest subset.
|
|
||
| // DatastorePodLocator implements contracts.PodLocator by querying the EPP Datastore. | ||
| // It centralizes the logic for resolving candidate pods based on request metadata (specifically Envoy subset filters). | ||
| type DatastorePodLocator struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are converging on endpoints instead of pods, right @kfswain ?
| // | ||
| // This interface allows the Flow Controller to fetch a fresh list of pods dynamically during the dispatch cycle, | ||
| // enabling support for "Scale-from-Zero" scenarios where pods may not exist when the request is first enqueued. | ||
| type PodLocator interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not keep the candidates term instead? like EndpointCandidates or CandidateEndpoints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you just referring to the interface name here? Both of these seem better to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, referring to the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can merge this as is and then rename after the other PR merges just so we don't cause conflicts with the second PR.
This is what the caching decorator is for. I am caching by "subset key" though. Not per-pod. If two requests share the same subset, we use the cached results. This handles 99% of the calls which are sourced from tight loop spinning on the HoL request in Flow Control. Cache misses for most unique requests are probably fine as this is how EPP operated before this change. I was trying to strike a balance between caching on request ID and caching on pod namepsaced name. |
Great! |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, LukeAVanDrie The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
This PR refactors the candidate resolution logic out of the
Directorand into a dedicatedPodLocatorcomponent.Reasoning:
CachedPodLocator, a decorator that caches resolution results (TTL 50ms), significantly reducingRLockcontention on the central Datastore.Changes:
contracts.PodLocatorinterface.DatastorePodLocator(logic moved from Director) andCachedPodLocator.PodLocatorinto theDirector.getCandidatePodsForSchedulingwithlocator.Locate().Note: This is a pure refactor. The sequence of events in
HandleRequestremains unchanged (Admission is still checked after resolution).Which issue(s) this PR fixes:
Tracks #1800
Does this PR introduce a user-facing change?:
NONE