feat: Enable Scale-from-Zero #1952

LukeAVanDrie · 2025-12-04T23:53:17Z

⚠️ Stacked Pull Request

Status: 🚧 Draft / Blocked
Depends on: #1950

This PR is based on top of changes in #1950
Please ignore the commits from that PR. The actual changes for this feature start at commit: 191440268b9e06db3faefc04471b98504ff595e0.

Once #1950 is merged, I will rebase this branch to show a clean diff.

What type of PR is this?
/kind feature

What this PR does / why we need it:
This PR enables Scale-from-Zero support in the Endpoint Picker (EPP).

Previously, the Director eagerly resolved candidate pods before Admission. If the pool was scaled to zero (or no subsets matched), the request was immediately rejected with a 503 Service Unavailable, preventing the Flow Control layer from queuing the request while the autoscaler reacted.

Key Changes

Inverted Control Flow:
The Director now attempts Admission (Queueing) before Resolution (Finding Pods).
- Before: Resolve Pods $\to$ Check Admission $\to$ Schedule.
- After: Check Admission $\to$ Resolve Pods $\to$ Schedule.
Lazy Resolution (Flow Control):
The FlowControlAdmissionController no longer requires a list of candidate pods. It enqueues the request (carrying only Metadata). The ShardProcessor then uses the PodLocator to resolve candidate pods Just-In-Time during the dispatch loop.
- If 0 pods are found during dispatch, the system is considered "Saturated", enforcing Head-of-Line (HoL) blocking until pods appear or the request TTL expires.
Legacy Path Preservation:
The LegacyAdmissionController (used when Flow Control is disabled) retains the need for eager resolution to perform immediate shedding. It has been updated to use the PodLocator internally to resolve pods within the Admit call.
Safety Checks:
The Director retains an explicit check after resolution: if a request passes admission but still resolves to 0 pods (e.g., non-queued traffic, or race conditions), it fails with 503 Service Unavailable.

Which issue(s) this PR fixes:
Tracks #1800 -- not marking as fixed until we have sufficient validation of this use case (@lionelvillard FYI).

Does this PR introduce a user-facing change?:

The Endpoint Picker now supports Scale-from-Zero. Requests targeting a pool with no available backends will be queued in the Flow Control layer (up to their timeout) instead of being immediately rejected, allowing time for backends to scale up.

This defines the contract for resolving candidate pods based on request metadata, decoupling the resolution logic from the storage layer.

Introduces DatastorePodLocator and a caching decorator. This reduces contention on the Datastore RWMutex during high-throughput dispatch cycles by caching subset resolution results for a short TTL.

Refactors the Director to use the injected PodLocator interface instead of the private getCandidatePodsForScheduling method. This prepares the Director for lazy resolution without changing current behavior.

Updates the FlowControlRequest interface to carry request metadata instead of a pre-resolved list of candidate pods. This prepares the system for lazy pod resolution. - Adds GetMetadata() to FlowControlRequest. - Removes CandidatePodsForScheduling() from FlowControlRequest. - Updates mocks in flowcontrol/types and contracts.

Refactors the request processing flow to support queuing when no backends are available. - Inverts Director flow: Admission is now called before Pod Resolution. - Updates AdmissionController interface to remove eager pod list. - LegacyAdmissionController now resolves pods internally via PodLocator. - ShardProcessor (Flow Control) now resolves pods lazily via PodLocator during the dispatch cycle. - Updates Runner wiring to inject PodLocator where needed.

k8s-ci-robot · 2025-12-04T23:53:24Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: LukeAVanDrie
Once this PR has been reviewed and has the lgtm label, please assign danehans for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2025-12-04T23:53:25Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`8191661`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/69321ef29a207800087bdfde
😎 Deploy Preview	https://deploy-preview-1952--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2025-12-04T23:53:27Z

Hi @LukeAVanDrie. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

LukeAVanDrie · 2025-12-04T23:58:31Z

/hold

Awaiting diffbase to be merged.

LukeAVanDrie · 2025-12-05T00:14:13Z

pkg/epp/flowcontrol/types/request.go

+	// GetMetadata returns the opaque metadata associated with the request (e.g., header-derived context, subset filters).
+	// This data is passed transparently to components like the contracts.PodLocator to resolve resources (candidate pods)
+	// lazily during the dispatch cycle.
+	GetMetadata() map[string]any


@lioraron This incidentally provides the plumbing necessary for #1863. Let me know if this is what you were expecting. This is populated from reqCtx.Request.Metadata.

IntraFlowDispatchPolicy.SelectItem gets access to QueueItemAccessor which exposes OriginalRequest() FlowControlRequest. Now your custom plugin impl can extract whatever it wants from here.

Is this sufficient to resolve #1863 or are you also seeking an extension point to intercept and augment reqCtx.Request.Metadata with additional key-value pairs?

dumb0002 · 2025-12-05T14:00:28Z

This PR only enable scale-from-zero for only one use-case: when flow control is enable.

LukeAVanDrie · 2025-12-05T17:42:42Z

This PR only enable scale-from-zero for only one use-case: when flow control is enable.

Yes, I can update the title to be clearer about this requirement. Without a buffer (flow control), there is no way to hold requests until capacity is online.

dumb0002 · 2025-12-05T18:09:24Z

This PR only enable scale-from-zero for only one use-case: when flow control is enable.

Yes, I can update the title to be clearer about this requirement. Without a buffer (flow control), there is no way to hold requests until capacity is online.

@LukeAVanDrie, we have a proposal to also support scale from zero when flow control is disable. It is described in detail here. Basically, the idea is to create an admission Plugin to contain the logic to hold the request and emit a prometheus metrics (e.g., _scale_zero_waiting_request_count) as a signal to the autoscaler. Then, release the request if capacity is online or timeout and drop the request after a pre-defined time period. However, this proposal would require for the admission plugin to be called prior to the computation of the list of candidate pods Or not dropping the request if this list is empty, but that would probably require another check for empty list of pods in or before the scheduling layer. What are your thoughts on this?

This new admission plugin and new metric would be included as part of the llm-d inference-scheduler set of existing plugins and metrics.

LukeAVanDrie · 2025-12-05T18:19:09Z

Basically, the idea is to create an admission Plugin to contain the logic to hold the request and emit a prometheus metrics (e.g., _scale_zero_waiting_request_count) as a signal to the autoscaler. Then, release the request if capacity is online or timeout and drop the request after a pre-defined time period. However, this proposal would require for the admission plugin to be called prior to the computation of the list of candidate pods Or not dropping the request if this list is empty, but that would probably require another check for empty list of pods in or before the scheduling layer. What are your thoughts on this?

This is exactly what Flow Control does already. Flow Control is an Admission Control plugin with some more bells and whistles (priority and fairness). I guess I am not understanding why we need to bifurcate here.

This series of PRs moves candidate resolution to after Admission Control (Flow Control or otherwise).

dumb0002 · 2025-12-05T18:28:21Z

Basically, the idea is to create an admission Plugin to contain the logic to hold the request and emit a prometheus metrics (e.g., _scale_zero_waiting_request_count) as a signal to the autoscaler. Then, release the request if capacity is online or timeout and drop the request after a pre-defined time period. However, this proposal would require for the admission plugin to be called prior to the computation of the list of candidate pods Or not dropping the request if this list is empty, but that would probably require another check for empty list of pods in or before the scheduling layer. What are your thoughts on this?

This is exactly what Flow Control does already. Flow Control is an Admission Control plugin with some more bells and whistles (priority and fairness). I guess I am not understanding why we need to bifurcate here.

This series of PRs moves candidate resolution to after Admission Control (Flow Control or otherwise).

I am referring to the admission plugins that run in this part of the code: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/requestcontrol/director.go#L177-L180. The proposal described above is to handle the scenario when only flow control is not enable but still have support for scale from zero.

LukeAVanDrie · 2025-12-05T18:30:49Z

I am referring to the admission plugins that run in this part of the code: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/requestcontrol/director.go#L177-L180. The proposal describe above is to handle the scenario when only flow control is not enable.

Yes, I guess my question is why create a separate plugin that handles request buffering only for the FC disabled path rather than simply enabling FC? What is the value in duplicating here? If FC has gaps that don't satisfy your use case, doesn't it make more sense to prioritize closing those?

Your approach will work, and this PR already sets the stage for the prerequisites you need. You will just need to implement your plugin externally. I will update the release note to clarify this is only for FC. I am just trying to better understand your use case.

dumb0002 · 2025-12-05T18:48:26Z

I am referring to the admission plugins that run in this part of the code: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/requestcontrol/director.go#L177-L180. The proposal describe above is to handle the scenario when only flow control is not enable.

Yes, I guess my question is why create a separate plugin that handles request buffering only for the FC disabled path rather than simply enabling FC? What is the value in duplicating here? If FC has gaps that don't satisfy your use case, doesn't it make more sense to prioritize closing those?

I am referring to the admission plugins that run in this part of the code: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/requestcontrol/director.go#L177-L180. The proposal describe above is to handle the scenario when only flow control is not enable.

Yes, I guess my question is why create a separate plugin that handles request buffering only for the FC disabled path rather than simply enabling FC? What is the value in duplicating here? If FC has gaps that don't satisfy your use case, doesn't it make more sense to prioritize closing those?

Your approach will work, and this PR already sets the stage for the prerequisites you need. You will just need to implement your plugin externally. I will update the release note to clarify this is only for FC. I am just trying to better understand your use case.

You raised a very important question: do we need to support scale from zero if FC is disable? I started with the assumption of also providing scale from zero support even if FC is disable. However, I agree that more evaluation of this scenario is needed indeed. Sounds good, updating the release note highlighting the FC focus will help to avoid any confusion. Thanks!

LukeAVanDrie added 5 commits December 4, 2025 19:12

contracts: add PodLocator for candidate resolution

a734712

This defines the contract for resolving candidate pods based on request metadata, decoupling the resolution logic from the storage layer.

requestcontrol: implement CachedPodLocator

716fe89

Introduces DatastorePodLocator and a caching decorator. This reduces contention on the Datastore RWMutex during high-throughput dispatch cycles by caching subset resolution results for a short TTL.

director: delegate candidate resolution

cce12bf

Refactors the Director to use the injected PodLocator interface instead of the private getCandidatePodsForScheduling method. This prepares the Director for lazy resolution without changing current behavior.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. labels Dec 4, 2025

k8s-ci-robot requested review from liu-cong and nirrozenbaum December 4, 2025 23:53

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 4, 2025

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 4, 2025

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 4, 2025

LukeAVanDrie commented Dec 5, 2025

View reviewed changes

LukeAVanDrie mentioned this pull request Dec 5, 2025

refactor: [Scale from Zero] Introduce PodLocator #1950

Merged

feat: Enable Scale-from-Zero #1952

Are you sure you want to change the base?

feat: Enable Scale-from-Zero #1952

Uh oh!

Conversation

LukeAVanDrie commented Dec 4, 2025

⚠️ Stacked Pull Request

Key Changes

Uh oh!

k8s-ci-robot commented Dec 4, 2025

Uh oh!

netlify bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

k8s-ci-robot commented Dec 4, 2025

Uh oh!

LukeAVanDrie commented Dec 4, 2025

Uh oh!

LukeAVanDrie Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

dumb0002 commented Dec 5, 2025

Uh oh!

LukeAVanDrie commented Dec 5, 2025

Uh oh!

dumb0002 commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LukeAVanDrie commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dumb0002 commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LukeAVanDrie commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dumb0002 commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

netlify bot commented Dec 4, 2025 •

edited

Loading

dumb0002 commented Dec 5, 2025 •

edited

Loading

LukeAVanDrie commented Dec 5, 2025 •

edited

Loading

dumb0002 commented Dec 5, 2025 •

edited

Loading

LukeAVanDrie commented Dec 5, 2025 •

edited

Loading