Skip to content

Conversation

@LukeAVanDrie
Copy link
Contributor

⚠️ Stacked Pull Request

Status: 🚧 Draft / Blocked
Depends on: #1950

This PR is based on top of changes in #1950
Please ignore the commits from that PR. The actual changes for this feature start at commit: 191440268b9e06db3faefc04471b98504ff595e0.

Once #1950 is merged, I will rebase this branch to show a clean diff.

What type of PR is this?
/kind feature

What this PR does / why we need it:
This PR enables Scale-from-Zero support in the Endpoint Picker (EPP).

Previously, the Director eagerly resolved candidate pods before Admission. If the pool was scaled to zero (or no subsets matched), the request was immediately rejected with a 503 Service Unavailable, preventing the Flow Control layer from queuing the request while the autoscaler reacted.

Key Changes

  1. Inverted Control Flow:
    The Director now attempts Admission (Queueing) before Resolution (Finding Pods).

    • Before: Resolve Pods $\to$ Check Admission $\to$ Schedule.
    • After: Check Admission $\to$ Resolve Pods $\to$ Schedule.
  2. Lazy Resolution (Flow Control):
    The FlowControlAdmissionController no longer requires a list of candidate pods. It enqueues the request (carrying only Metadata). The ShardProcessor then uses the PodLocator to resolve candidate pods Just-In-Time during the dispatch loop.

    • If 0 pods are found during dispatch, the system is considered "Saturated", enforcing Head-of-Line (HoL) blocking until pods appear or the request TTL expires.
  3. Legacy Path Preservation:
    The LegacyAdmissionController (used when Flow Control is disabled) retains the need for eager resolution to perform immediate shedding. It has been updated to use the PodLocator internally to resolve pods within the Admit call.

  4. Safety Checks:
    The Director retains an explicit check after resolution: if a request passes admission but still resolves to 0 pods (e.g., non-queued traffic, or race conditions), it fails with 503 Service Unavailable.

Which issue(s) this PR fixes:
Tracks #1800 -- not marking as fixed until we have sufficient validation of this use case (@lionelvillard FYI).

Does this PR introduce a user-facing change?:

The Endpoint Picker now supports Scale-from-Zero. Requests targeting a pool with no available backends will be queued in the Flow Control layer (up to their timeout) instead of being immediately rejected, allowing time for backends to scale up.

This defines the contract for resolving candidate pods based on request
metadata, decoupling the resolution logic from the storage layer.
Introduces DatastorePodLocator and a caching decorator. This reduces
contention on the Datastore RWMutex during high-throughput dispatch
cycles by caching subset resolution results for a short TTL.
Refactors the Director to use the injected PodLocator interface instead
of the private getCandidatePodsForScheduling method. This prepares the
Director for lazy resolution without changing current behavior.
Updates the FlowControlRequest interface to carry request metadata
instead of a pre-resolved list of candidate pods. This prepares the
system for lazy pod resolution.

- Adds GetMetadata() to FlowControlRequest.
- Removes CandidatePodsForScheduling() from FlowControlRequest.
- Updates mocks in flowcontrol/types and contracts.
Refactors the request processing flow to support queuing when no
backends are available.

- Inverts Director flow: Admission is now called before Pod Resolution.
- Updates AdmissionController interface to remove eager pod list.
- LegacyAdmissionController now resolves pods internally via PodLocator.
- ShardProcessor (Flow Control) now resolves pods lazily via PodLocator
  during the dispatch cycle.
- Updates Runner wiring to inject PodLocator where needed.
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. labels Dec 4, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: LukeAVanDrie
Once this PR has been reviewed and has the lgtm label, please assign danehans for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@netlify
Copy link

netlify bot commented Dec 4, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 8191661
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/69321ef29a207800087bdfde
😎 Deploy Preview https://deploy-preview-1952--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 4, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @LukeAVanDrie. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 4, 2025
@LukeAVanDrie
Copy link
Contributor Author

/hold

Awaiting diffbase to be merged.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 4, 2025
// GetMetadata returns the opaque metadata associated with the request (e.g., header-derived context, subset filters).
// This data is passed transparently to components like the contracts.PodLocator to resolve resources (candidate pods)
// lazily during the dispatch cycle.
GetMetadata() map[string]any
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lioraron This incidentally provides the plumbing necessary for #1863. Let me know if this is what you were expecting. This is populated from reqCtx.Request.Metadata.

IntraFlowDispatchPolicy.SelectItem gets access to QueueItemAccessor which exposes OriginalRequest() FlowControlRequest. Now your custom plugin impl can extract whatever it wants from here.

Is this sufficient to resolve #1863 or are you also seeking an extension point to intercept and augment reqCtx.Request.Metadata with additional key-value pairs?

@dumb0002
Copy link

dumb0002 commented Dec 5, 2025

This PR only enable scale-from-zero for only one use-case: when flow control is enable.

@LukeAVanDrie
Copy link
Contributor Author

This PR only enable scale-from-zero for only one use-case: when flow control is enable.

Yes, I can update the title to be clearer about this requirement. Without a buffer (flow control), there is no way to hold requests until capacity is online.

@dumb0002
Copy link

dumb0002 commented Dec 5, 2025

This PR only enable scale-from-zero for only one use-case: when flow control is enable.

Yes, I can update the title to be clearer about this requirement. Without a buffer (flow control), there is no way to hold requests until capacity is online.

@LukeAVanDrie, we have a proposal to also support scale from zero when flow control is disable. It is described in detail here. Basically, the idea is to create an admission Plugin to contain the logic to hold the request and emit a prometheus metrics (e.g., _scale_zero_waiting_request_count) as a signal to the autoscaler. Then, release the request if capacity is online or timeout and drop the request after a pre-defined time period. However, this proposal would require for the admission plugin to be called prior to the computation of the list of candidate pods Or not dropping the request if this list is empty, but that would probably require another check for empty list of pods in or before the scheduling layer. What are your thoughts on this?

This new admission plugin and new metric would be included as part of the llm-d inference-scheduler set of existing plugins and metrics.

@LukeAVanDrie
Copy link
Contributor Author

LukeAVanDrie commented Dec 5, 2025

Basically, the idea is to create an admission Plugin to contain the logic to hold the request and emit a prometheus metrics (e.g., _scale_zero_waiting_request_count) as a signal to the autoscaler. Then, release the request if capacity is online or timeout and drop the request after a pre-defined time period. However, this proposal would require for the admission plugin to be called prior to the computation of the list of candidate pods Or not dropping the request if this list is empty, but that would probably require another check for empty list of pods in or before the scheduling layer. What are your thoughts on this?

This is exactly what Flow Control does already. Flow Control is an Admission Control plugin with some more bells and whistles (priority and fairness). I guess I am not understanding why we need to bifurcate here.

This series of PRs moves candidate resolution to after Admission Control (Flow Control or otherwise).

@dumb0002
Copy link

dumb0002 commented Dec 5, 2025

Basically, the idea is to create an admission Plugin to contain the logic to hold the request and emit a prometheus metrics (e.g., _scale_zero_waiting_request_count) as a signal to the autoscaler. Then, release the request if capacity is online or timeout and drop the request after a pre-defined time period. However, this proposal would require for the admission plugin to be called prior to the computation of the list of candidate pods Or not dropping the request if this list is empty, but that would probably require another check for empty list of pods in or before the scheduling layer. What are your thoughts on this?

This is exactly what Flow Control does already. Flow Control is an Admission Control plugin with some more bells and whistles (priority and fairness). I guess I am not understanding why we need to bifurcate here.

This series of PRs moves candidate resolution to after Admission Control (Flow Control or otherwise).

I am referring to the admission plugins that run in this part of the code: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/requestcontrol/director.go#L177-L180. The proposal described above is to handle the scenario when only flow control is not enable but still have support for scale from zero.

@LukeAVanDrie
Copy link
Contributor Author

LukeAVanDrie commented Dec 5, 2025

I am referring to the admission plugins that run in this part of the code: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/requestcontrol/director.go#L177-L180. The proposal describe above is to handle the scenario when only flow control is not enable.

Yes, I guess my question is why create a separate plugin that handles request buffering only for the FC disabled path rather than simply enabling FC? What is the value in duplicating here? If FC has gaps that don't satisfy your use case, doesn't it make more sense to prioritize closing those?


Your approach will work, and this PR already sets the stage for the prerequisites you need. You will just need to implement your plugin externally. I will update the release note to clarify this is only for FC. I am just trying to better understand your use case.

@dumb0002
Copy link

dumb0002 commented Dec 5, 2025

I am referring to the admission plugins that run in this part of the code: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/requestcontrol/director.go#L177-L180. The proposal describe above is to handle the scenario when only flow control is not enable.

Yes, I guess my question is why create a separate plugin that handles request buffering only for the FC disabled path rather than simply enabling FC? What is the value in duplicating here? If FC has gaps that don't satisfy your use case, doesn't it make more sense to prioritize closing those?

I am referring to the admission plugins that run in this part of the code: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/requestcontrol/director.go#L177-L180. The proposal describe above is to handle the scenario when only flow control is not enable.

Yes, I guess my question is why create a separate plugin that handles request buffering only for the FC disabled path rather than simply enabling FC? What is the value in duplicating here? If FC has gaps that don't satisfy your use case, doesn't it make more sense to prioritize closing those?

Your approach will work, and this PR already sets the stage for the prerequisites you need. You will just need to implement your plugin externally. I will update the release note to clarify this is only for FC. I am just trying to better understand your use case.

You raised a very important question: do we need to support scale from zero if FC is disable? I started with the assumption of also providing scale from zero support even if FC is disable. However, I agree that more evaluation of this scenario is needed indeed. Sounds good, updating the release note highlighting the FC focus will help to avoid any confusion. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants