AEP-8818: InPlace Update Mode #8818

omerap12 · 2025-11-15T13:55:39Z

What type of PR is this?

/kind documentation

What this PR does / why we need it:

AEP for #8720

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

k8s-ci-robot · 2025-11-15T13:55:42Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2025-11-15T13:55:47Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: omerap12
Once this PR has been reviewed and has the lgtm label, please assign gjtempleton for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

vertical-pod-autoscaler/enhancements/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

omerap12 · 2025-11-16T11:55:37Z

/cc @adrianmoisey @maxcao13

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

omerap12 · 2025-11-20T09:51:12Z

/kind api-review

k8s-ci-robot · 2025-11-20T09:51:15Z

@omerap12: The label(s) kind/api-review cannot be applied, because the repository doesn't have them.

Details

In response to this:

/kind api-review

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

omerap12 · 2025-11-20T09:51:38Z

/label kind/api-review

adrianmoisey · 2025-12-07T14:39:02Z

vertical-pod-autoscaler/enhancements/8818-in-place-only/README.md

+	    klog.V(4).InfoS("Can't in-place update pod, waiting for next loop", "pod", klog.KObj(pod))
+	return utils.InPlaceDeferred


Minor nit, are these supposed to be indented the same level?

No sure, I'll try to fmt soon

vertical-pod-autoscaler/enhancements/8818-in-place-only/README.md

adrianmoisey · 2025-12-07T14:40:16Z

vertical-pod-autoscaler/enhancements/8818-in-place-only/README.md

+
+### Behavior when Feature Gate is Disabled
+
+- When `InPlace` feature gate is disabled and a VPA is configured with `UpdateMode: InPlace`, the updater will skip processing that VPA entirely (not fall back to eviction).


Just want to check: it won't evict and it won't in-place update?

Also, what does the admission-controller do when the feature gate is disabled but a pod is set to InPlace?

The admission controller will deny the the request ref

Just want to check: it won't evict and it won't in-place update?

That’s what I assumed, because if someone wants to use in-place mode only, it likely means the workload can’t be evicted. In that case, I think the correct action is to do nothing.

Well, what if someone does this:

Upgrades to this version of VPA and enables the feature gate

Uses the InPlace mode on a VPA

Disables the feature gate

Deletes a Pod from the VPA pointing at InPlace

Does the admission-controller:

Set the resources as per the recommendation (as if the VPA was in "Initial" mode)

Ignore the pod (as if the VPA was in "Off" mode)

Something else..

TBH I didn't test it. but it should be 1

Just checked, we set the resources as per the recommendation.

Cool, that's worth clarifying here

Done in: 13f1fa7

vertical-pod-autoscaler/enhancements/8818-in-place-only/README.md

maxcao13 · 2025-12-08T18:10:04Z

vertical-pod-autoscaler/enhancements/8818-in-place-only/README.md

+- Apply recommendations during pod admission (like all other modes)
+- Attempt in-place updates for running pods under the same conditions as `InPlaceOrRecreate`
+- Never add pods to `podsForEviction` if in-place updates fail
+- Continuously retry failed in-place update


Should we have a backoff policy for retrying, or do we think linear retry is sufficient if we keep failing?

I have another idea - let me know what you think.
Since the kubelet automatically retries deferred pods when resources become available, we could use that behavior to our advantage. Let's say we send an in-place update that sets a pod to x CPU and y memory.
In the next update loop, if the recommended values are still x CPU and y memory, we can skip sending a new update. We already know the kubelet got the first request and will retry it when it can.
So the updater only needs to check whether the requested resources have changed since the previous cycle. If they haven’t changed, we just move on.

The main drawback is that the updater now has to remember which recommendation was last applied for each pod, which means some extra memory use and more bookkeeping in the code.

I think that makes sense what you are proposing but I think the existing resize conditions will tell us this information without the extra bookkeeping: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources#resize-status

I think this already exists in the vpa code -> with Deferred we will just keep waiting until the deferred timeout, at which point InPlaceOrRecreate will fallback to eviction. Is our intention with InPlace to just let it sit in Deferred indefinitely (same with InProgress)? I think that's what we want, I just wanted to clarify here, and maybe clarify that in the AEP.

With Infeasible I'm not so sure. In the KEP it says it the kubelet itself will never retry, which is where the VPA will come in to manually retry and where I guess I am asking if we should backoff these requests or not. But as an alpha implementation, I think it's fine to just retry if we see Infeasible indefinitely, until pre-production testing tells us it's better not to do that.

I think this already exists in the vpa code -> with Deferred we will just keep waiting until the deferred timeout, at which point InPlaceOrRecreate will fallback to eviction. Is our intention with InPlace to just let it sit in Deferred indefinitely (same with InProgress)? I think that's what we want, I just wanted to clarify here, and maybe clarify that in the AEP.

Exactly, with deferred we will just skip that pod do nothing ( the kubelet will do the hard job for us ).

With Infeasible I'm not so sure. In the KEP it says it the kubelet itself will never retry, which is where the VPA will come in to manually retry and where I guess I am asking if we should backoff these requests or not. But as an alpha implementation, I think it's fine to just retry if we see Infeasible indefinitely, until pre-production testing tells us it's better not to do that.

Agree.

So to sum up:

Deferred - do nothing.

Infeasible - we retry with no backoff for alpha.

Regarding both of these cases, what happens if the recommendation changes? (Omer already mentioned this earlier in the thread).

Should the updater check if recommendations != spec.resources, and if they aren't the same, resize again?

It's possible that the new recommendation could be smaller, allowing for the pod to be resized.

I think we should, since we don’t know how long the pod will remain in deferred mode and we don’t want to miss recommendations.

That makes sense +1

maxcao13 · 2025-12-08T18:14:07Z

vertical-pod-autoscaler/enhancements/8818-in-place-only/README.md

+- Allow VPA to eventually apply updates when cluster conditions improve
+- Respect the existing in-place update infrastructure from AEP-4016
+
+## Non-Goals


Do we think we should have some small note that this update mode is subject to the behavior of the inplacepodverticalscaling gate, such that it's possible (but improbable) that a resize can cause an OOMkill during a memory limit downsize?

Though I don't actually know the probability of this happening if a limit gets resized close to the usage, I think it may be useful to callout since we emphasize that brief disruptions are unnacceptable.

I think to mitigate risk here we may want to recommend that if you absolutely cannot tolerate disruption (i.e. unintended OOMkill), then you can either:

disallow memory limits for your no disruption container

if you must allow VPA to set memory limits, then you should configure the VPA to generate more generous/conservative memory limit recommendations as a safety buffer.

^ Though this may or may not be better for our docs, instead of getting into it in the AEP here.

Thoughts? cc @adrianmoisey

I think you're right.
I was thinking a similar though on the "Provide a truly non-disruptive VPA update mode that never evicts pods" goal.

I think it may be worth softening the language in the AEP (since we can't make guarantees that resizes are non-disruptive)

I also agree that most of what you suggested may be good for the docs

Related: #8805

Yeah, that sounds very reasonable. I think we can have this in both our docs and the AEP.
lemme know what you think of this: ba9514a

Looks good to me, thanks for that 👍

…re flag turned off Signed-off-by: Omer Aplatony <omerap12@gmail.com>

Co-authored-by: Adrian Moisey <adrian@changeover.za.net>

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

maxcao13 · 2025-12-10T17:36:03Z

vertical-pod-autoscaler/enhancements/8818-in-place-only/README.md

+## Implementation History
+
+- 2025-15-11: initial version


I know we didn't write anything down for AEP-4016 in terms of graduation criteria, but since we went through the process of graduating that one from alpha to beta, I'm wondering if we should have some sort of idea for this one?

I don't know if we should have some formal process, but just judging from the last graduation, I think it make senses to say we would keep it in alpha for one release cycle to allow early adoption, and if there's no graduation bugs/blockers that come up in the issues, then we are okay to graduate to beta.

Added Graduation Criteria section.

maxcao13 · 2025-12-10T19:42:54Z

vertical-pod-autoscaler/enhancements/8818-in-place-only/README.md

+)
+```
+
+Modify the `CanInPlaceUpdate` to accomdate the new update mode:


nit:
verify check is complaining about this:

vertical-pod-autoscaler/enhancements/8818-in-place-only/README.md:80:33: "accomdate" is a misspelling of "accommodate"

Thanks! fixed :)

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

maxcao13 · 2025-12-11T02:16:35Z

/lgtm

Thanks for writing this up, this is great :-)

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

k8s-ci-robot · 2025-12-11T17:19:32Z

New changes are detected. LGTM label has been removed.

omerap12 · 2025-12-11T17:20:39Z

@maxcao13 , @adrianmoisey , @iamzili
Updated the AEP based on our talk, if you lgtm I want to ping to the sig-node folks for review as well.

omerap12 · 2025-12-11T18:05:24Z

/label tide/merge-method-squash

adrianmoisey · 2025-12-12T10:56:26Z

vertical-pod-autoscaler/enhancements/8818-in-place-only/README.md

+
+Disabling of feature gate `InPlace` will cause the following to happen:
+- admission-controller will:
+	- Reject new VPA objects being created with `InPlace` configured


Just want to clarify, it will reject new VPAs with InPlace, existing VPAs with InPlace can still be modified, right?
(that's how k/k handles this)

Yes - I need to double check on that but yes, just like InPlaceOrRecreate

adrianmoisey · 2025-12-12T10:57:13Z

Generally speaking I think this is good.
I think it's safe to ping sig-node and api-review on this if you want

omerap12 · 2025-12-12T13:00:07Z

/label api-review

natasha41575

No large concerns from me from ippr / node perspective, just a few questions

I'm kind of confused by the handling of Infeasible, there are mentions of doing "retry" in this case, which sounds odd to me because Infeasible means that the node cannot support the resize - in which case what is the purpose of retrying? (This is probably just me not understanding what it means for VPA to retry.)

natasha41575 · 2025-12-15T20:31:07Z

vertical-pod-autoscaler/enhancements/8818-in-place-only/README.md

+- `Deferred`: When the resize status is Deferred and the recommendation matches spec, VPA waits and lets kubelet handle it. This means kubelet is waiting to apply the resize, and VPA should not interfere.
+- `Infeasible`: When the resize status is Infeasible and the recommendation matches spec, VPA defers action. The node cannot accommodate the current resize, but if the recommendation changes, VPA will attempt the new resize.


So basically Deferred, Infeasible, and InProgress will all be handled the same by VPA, right? VPA will do nothing if recommendation = spec, otherwise it will apply the new recommendation? I'm slightly confused because there is a thread where there is talk about "retrying" for Infeasible resizes but I don't see that written here, nor do I fully understand what that means

Something else I'd like to mention - we are planning to move feasibility checks to admission (hopefully in v1.36). So the goal is that resizes that are infeasible should fail to patch rather than being accepted and then marked Infeasible by the kubelet. Does this change anything about how you want to handle infeasible resizes?

natasha41575 · 2025-12-15T20:34:28Z

vertical-pod-autoscaler/enhancements/8818-in-place-only/README.md

+- `Deferred`: When the resize status is Deferred and the recommendation matches spec, VPA waits and lets kubelet handle it. This means kubelet is waiting to apply the resize, and VPA should not interfere.
+- `Infeasible`: When the resize status is Infeasible and the recommendation matches spec, VPA defers action. The node cannot accommodate the current resize, but if the recommendation changes, VPA will attempt the new resize.
+- `InProgress`: When the resize status is InProgress and the recommendation matches spec, VPA waits for completion. The resize is actively being applied by kubelet.
+- `Error`: When the resize status is Error, VPA retries the operation. An error occurred during resize and retrying may succeed.


What is meant by Error here? There are two interpretations I can think of:

The API server returns an error (the patch does not succeed).

A resize has the InProgress status, but there is an Error present in the InProgress condition.

I think both should be explicitly mentioned for clarity. Not sure if you are proposing something different depending on which case it is.

natasha41575 · 2025-12-15T20:36:37Z

vertical-pod-autoscaler/enhancements/8818-in-place-only/README.md

+
+## Kubernetes version compatibility
+
+`InPlace` is being built assuming that it will be running on a Kubernetes version of at least 1.33 with the beta version of [KEP-1287: In-Place Update of Pod Resources](https://github.com/kubernetes/enhancements/issues/1287) enabled.


1.33 does not support memory limit decreases - the patch will not go through. Is it easier to just support 1.34+, or do you have some plan to handle the forbidden memory limit decreases of 1.33?

natasha41575 · 2025-12-15T20:58:23Z

vertical-pod-autoscaler/enhancements/8818-in-place-only/README.md

+- Apply recommendations during pod admission (like all other modes)
+- Attempt in-place updates for running pods under the same conditions as `InPlaceOrRecreate`
+- Never add pods to `podsForEviction` if in-place updates fail
+- Continuously retry failed in-place update


What does "retry" mean in this context? Does this mean you re-attempt the patch with the same recommendation, or does VPA adjust its recommendation to something else and try to apply that?

[WIP] In Place Only VPA

10a2270

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

k8s-ci-robot added the do-not-merge/needs-area label Nov 15, 2025

k8s-ci-robot requested review from adrianmoisey and kwiesmueller November 15, 2025 13:55

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. area/vertical-pod-autoscaler and removed do-not-merge/needs-area labels Nov 15, 2025

removed spaces

29e7843

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

omerap12 changed the title ~~[WIP] In Place Only VPA~~ AEP-8720: InPlace Update Mode Nov 15, 2025

omerap12 changed the title ~~AEP-8720: InPlace Update Mode~~ AEP-8818: InPlace Update Mode Nov 15, 2025

omerap12 added 3 commits November 15, 2025 15:57

Fixed AEP number

c6d27c0

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

Fixed function

26456cb

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

fmt

e76adef

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

omerap12 marked this pull request as ready for review November 16, 2025 11:55

k8s-ci-robot requested a review from maxcao13 November 16, 2025 11:55

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 16, 2025

Use kubelet as the for retry

19e34c0

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

adrianmoisey reviewed Dec 7, 2025

View reviewed changes

vertical-pod-autoscaler/enhancements/8818-in-place-only/README.md Outdated Show resolved Hide resolved

maxcao13 reviewed Dec 8, 2025

View reviewed changes

omerap12 and others added 5 commits December 9, 2025 08:57

clatify on how the admission controller and updater behave with featu…

13f1fa7

…re flag turned off Signed-off-by: Omer Aplatony <omerap12@gmail.com>

remove auto updateMode

bf77412

Co-authored-by: Adrian Moisey <adrian@changeover.za.net>

Update vertical-pod-autoscaler/enhancements/8818-in-place-only/README.md

a065f33

Co-authored-by: Adrian Moisey <adrian@changeover.za.net>

clarify behavior of inplacepodverticalscaling gate

ba9514a

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

clarify behavior of inplacepodverticalscaling gate

33a9603

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

maxcao13 reviewed Dec 10, 2025

View reviewed changes

omerap12 added 2 commits December 10, 2025 20:57

update AEP

8c0facd

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

fixed typo

1fa38e4

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

k8s-ci-robot assigned maxcao13 Dec 11, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 11, 2025

adrianmoisey mentioned this pull request Dec 11, 2025

Blog post for in place pod vertical scaling GA kubernetes/website#53013

Merged

recommendation change

3e5bd3a

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 11, 2025

omerap12 requested review from adrianmoisey and maxcao13 December 11, 2025 17:19

k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Dec 11, 2025

adrianmoisey reviewed Dec 12, 2025

View reviewed changes

natasha41575 reviewed Dec 15, 2025

View reviewed changes

		klog.V(4).InfoS("Can't in-place update pod, waiting for next loop", "pod", klog.KObj(pod))
		return utils.InPlaceDeferred


		### Behavior when Feature Gate is Disabled

		- When `InPlace` feature gate is disabled and a VPA is configured with `UpdateMode: InPlace`, the updater will skip processing that VPA entirely (not fall back to eviction).

		- `Deferred`: When the resize status is Deferred and the recommendation matches spec, VPA waits and lets kubelet handle it. This means kubelet is waiting to apply the resize, and VPA should not interfere.
		- `Infeasible`: When the resize status is Infeasible and the recommendation matches spec, VPA defers action. The node cannot accommodate the current resize, but if the recommendation changes, VPA will attempt the new resize.


		## Kubernetes version compatibility

		`InPlace` is being built assuming that it will be running on a Kubernetes version of at least 1.33 with the beta version of [KEP-1287: In-Place Update of Pod Resources](https://github.com/kubernetes/enhancements/issues/1287) enabled.

AEP-8818: InPlace Update Mode #8818

Are you sure you want to change the base?

AEP-8818: InPlace Update Mode #8818

Conversation

omerap12 commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Nov 15, 2025

Uh oh!

k8s-ci-robot commented Nov 15, 2025

Uh oh!

omerap12 commented Nov 16, 2025

Uh oh!

omerap12 commented Nov 20, 2025

Uh oh!

k8s-ci-robot commented Nov 20, 2025

Uh oh!

omerap12 commented Nov 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maxcao13 Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

omerap12 commented Nov 15, 2025 •

edited

Loading

maxcao13 Dec 10, 2025 •

edited

Loading

maxcao13 commented Dec 11, 2025 •

edited

Loading

natasha41575 left a comment •

edited

Loading

natasha41575 Dec 15, 2025 •

edited

Loading

natasha41575 Dec 15, 2025 •

edited

Loading

natasha41575 Dec 15, 2025 •

edited

Loading