WIP: Test: requeue PerformanceStatus update when write fails#1479
WIP: Test: requeue PerformanceStatus update when write fails#1479MarSik wants to merge 1 commit intoopenshift:release-4.20from
Conversation
WalkthroughModified error handling in the performance profile controller to include a fixed 30-second requeue delay when status update operations fail, replacing empty reconcile results with results containing the requeue timing. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/RHsyseng/operator-utils@v1.4.13: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/coreos/go-systemd@v0.0.0-20191104093116-d3cd4ed1dbcf: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/coreos/ignition@v0.35.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/coreos/ignition/v2@v2.22.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/docker/go-units@v0.5.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/go-logr/stdr@v1.2.2: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/google/go-cmp@v0.7.0 ... [truncated 18026 characters] ... io/kubectl: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/kubelet: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/legacy-cloud-providers: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/metrics: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/mount-utils: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/pod-security-admission: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-apiserver: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@jmencak This could solve the issue we were investigating. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: MarSik The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test e2e-hypershift |
|
@MarSik: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/agentic_review |
Code Review by Qodo
|
|
/agentic_describe |
Review Summary by QodoRequeue PerformanceProfile reconciliation on status update failures
WalkthroughsDescription• Add requeue delay when PerformanceProfile status update fails • Prevent status from becoming stale during write failures • Introduce statusUpdateRequeueAfter constant for retry timing • Apply requeue logic to three status update failure paths Diagramflowchart LR
A["Status Update Fails"] --> B["Return RequeueAfter Result"]
B --> C["Retry After 30 Seconds"]
C --> D["Prevent Stale Status State"]
File Changes1. pkg/performanceprofile/controller/performanceprofile_controller.go
|
| if err := r.StatusWriter.Update(ctx, instance, conditions); err != nil { | ||
| klog.Errorf("failed to update performance profile %q status: %v", instance.GetName(), err) | ||
| return reconcile.Result{}, err | ||
| return reconcile.Result{RequeueAfter: statusUpdateRequeueAfter}, err | ||
| } | ||
| return reconcile.Result{}, err | ||
| } | ||
| err = r.StatusWriter.UpdateOwnedConditions(ctx, instance) | ||
| if err != nil { | ||
| klog.Errorf("failed to update performance profile %q status: %v", instance.GetName(), err) | ||
| return ctrl.Result{RequeueAfter: statusUpdateRequeueAfter}, err | ||
| } |
There was a problem hiding this comment.
1. Requeueafter ignored with error 🐞 Bug ≡ Correctness
Reconcile returns (Result{RequeueAfter: statusUpdateRequeueAfter}, err) on status update
failures, but controller-runtime ignores any non-zero Result when err != nil and requeues using
exponential backoff instead of the intended 30s delay. This breaks the PR’s stated goal and can
cause immediate/repeated retries rather than the fixed retry delay.
Agent Prompt
### Issue description
The reconciler currently returns `Result{RequeueAfter: ...}` together with a non-nil `err` in status-write failure paths. In controller-runtime, `Result` is ignored when `err != nil`, so the requested 30s delay is not applied.
### Issue Context
Goal is to retry reconciliation after a status update failure with a fixed delay (30s) so status doesn’t remain stale, without triggering immediate rate-limited retries.
### Fix Focus Areas
- pkg/performanceprofile/controller/performanceprofile_controller.go[502-517]
### Suggested change
- In the status update failure branches, change returns from `return ctrl/reconcile.Result{RequeueAfter: statusUpdateRequeueAfter}, err` to `return ctrl.Result{RequeueAfter: statusUpdateRequeueAfter}, nil` (keeping the existing log line so the failure is still visible).
- Ensure any other requeue-after paths do not return a non-nil error if the delay must be honored.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
There was a problem hiding this comment.
Interesting, CodeRabbit didn't catch this @MarSik , looks like there's some value in Qodo.
Edit: CodeRabbit AI found it too.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/performanceprofile/controller/performanceprofile_controller.go (1)
543-556:⚠️ Potential issue | 🟠 MajorSame issue:
RequeueAfterignored due to non-nil error return.Apply the same fix pattern here—return
nilfor error when status update fails to honor the 30-second requeue.Proposed fix
if err != nil { conditions := status.GetDegradedConditions(status.ConditionFailedToFindMachineConfigPool, err.Error()) - if err := r.StatusWriter.Update(ctx, profile, conditions); err != nil { + if statusErr := r.StatusWriter.Update(ctx, profile, conditions); statusErr != nil { klog.Errorf("failed to update performance profile %q status: %v", profile.GetName(), err) - return nil, &reconcile.Result{RequeueAfter: statusUpdateRequeueAfter}, err + return nil, &reconcile.Result{RequeueAfter: statusUpdateRequeueAfter}, nil } return nil, &reconcile.Result{}, nil } if err := validateProfileMachineConfigPool(profile, profileMCP); err != nil { conditions := status.GetDegradedConditions(status.ConditionBadMachineConfigLabels, err.Error()) - if err := r.StatusWriter.Update(ctx, profile, conditions); err != nil { + if statusErr := r.StatusWriter.Update(ctx, profile, conditions); statusErr != nil { klog.Errorf("failed to update performance profile %q status: %v", profile.GetName(), err) - return nil, &reconcile.Result{RequeueAfter: statusUpdateRequeueAfter}, err + return nil, &reconcile.Result{RequeueAfter: statusUpdateRequeueAfter}, nil } return nil, &reconcile.Result{}, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/performanceprofile/controller/performanceprofile_controller.go` around lines 543 - 556, The status update error handling after validateProfileMachineConfigPool(profile, profileMCP) incorrectly returns a non-nil error which prevents the reconcile.Result.RequeueAfter (statusUpdateRequeueAfter) from being honored; update the block that calls r.StatusWriter.Update(ctx, profile, conditions) so that when the status update fails you still log the error (klog.Errorf) but return nil for the error and &reconcile.Result{RequeueAfter: statusUpdateRequeueAfter} (i.e., mirror the earlier fixed pattern used above), referencing validateProfileMachineConfigPool, status.GetDegradedConditions, r.StatusWriter.Update, and statusUpdateRequeueAfter.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/performanceprofile/controller/performanceprofile_controller.go`:
- Around line 506-515: The status update error handling is returning both a
non-nil error and a Result with RequeueAfter (so the fixed 30s requeue is
ignored); in the Reconcile flow (and in getAndValidateMCP) change the error
returns after r.StatusWriter.Update and r.StatusWriter.UpdateOwnedConditions to
log the error via klog.Errorf (including instance.GetName() and err), then
return ctrl.Result{RequeueAfter: statusUpdateRequeueAfter}, nil (i.e., drop the
error) so the fixed requeue is honored; ensure this change is applied to all
four locations referencing r.StatusWriter.Update,
r.StatusWriter.UpdateOwnedConditions and getAndValidateMCP where
statusUpdateRequeueAfter is currently paired with a non-nil err.
---
Outside diff comments:
In `@pkg/performanceprofile/controller/performanceprofile_controller.go`:
- Around line 543-556: The status update error handling after
validateProfileMachineConfigPool(profile, profileMCP) incorrectly returns a
non-nil error which prevents the reconcile.Result.RequeueAfter
(statusUpdateRequeueAfter) from being honored; update the block that calls
r.StatusWriter.Update(ctx, profile, conditions) so that when the status update
fails you still log the error (klog.Errorf) but return nil for the error and
&reconcile.Result{RequeueAfter: statusUpdateRequeueAfter} (i.e., mirror the
earlier fixed pattern used above), referencing validateProfileMachineConfigPool,
status.GetDegradedConditions, r.StatusWriter.Update, and
statusUpdateRequeueAfter.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4ca6448a-225d-4081-af9c-19ad5a481598
📒 Files selected for processing (1)
pkg/performanceprofile/controller/performanceprofile_controller.go
| if err := r.StatusWriter.Update(ctx, instance, conditions); err != nil { | ||
| klog.Errorf("failed to update performance profile %q status: %v", instance.GetName(), err) | ||
| return reconcile.Result{}, err | ||
| return reconcile.Result{RequeueAfter: statusUpdateRequeueAfter}, err | ||
| } | ||
| return reconcile.Result{}, err | ||
| } | ||
| err = r.StatusWriter.UpdateOwnedConditions(ctx, instance) | ||
| if err != nil { | ||
| klog.Errorf("failed to update performance profile %q status: %v", instance.GetName(), err) | ||
| return ctrl.Result{RequeueAfter: statusUpdateRequeueAfter}, err |
There was a problem hiding this comment.
RequeueAfter is ignored when returning a non-nil error.
Per controller-runtime's reconciliation loop (see vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:334-347), when an error is returned, the framework uses rate-limited exponential backoff and ignores the Result entirely. The log at line 344 confirms: "The result will always be ignored if the error is non-nil."
All four modified locations return both RequeueAfter: statusUpdateRequeueAfter and a non-nil err, so the 30-second fixed delay will never be applied.
To achieve the intended fixed requeue interval, return nil for the error and log it separately:
Proposed fix for lines 506-516
conditions := status.GetDegradedConditions(status.ConditionReasonComponentsCreationFailed, err.Error())
- if err := r.StatusWriter.Update(ctx, instance, conditions); err != nil {
+ if statusErr := r.StatusWriter.Update(ctx, instance, conditions); statusErr != nil {
klog.Errorf("failed to update performance profile %q status: %v", instance.GetName(), err)
- return reconcile.Result{RequeueAfter: statusUpdateRequeueAfter}, err
+ return reconcile.Result{RequeueAfter: statusUpdateRequeueAfter}, nil
}
return reconcile.Result{}, err
}
- err = r.StatusWriter.UpdateOwnedConditions(ctx, instance)
- if err != nil {
+ if statusErr := r.StatusWriter.UpdateOwnedConditions(ctx, instance); statusErr != nil {
klog.Errorf("failed to update performance profile %q status: %v", instance.GetName(), err)
- return ctrl.Result{RequeueAfter: statusUpdateRequeueAfter}, err
+ return ctrl.Result{RequeueAfter: statusUpdateRequeueAfter}, nil
}The same pattern applies to the status update failures in getAndValidateMCP (lines 543-545 and 551-554).
Alternatively, if exponential backoff is acceptable, remove the RequeueAfter to avoid misleading code.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/performanceprofile/controller/performanceprofile_controller.go` around
lines 506 - 515, The status update error handling is returning both a non-nil
error and a Result with RequeueAfter (so the fixed 30s requeue is ignored); in
the Reconcile flow (and in getAndValidateMCP) change the error returns after
r.StatusWriter.Update and r.StatusWriter.UpdateOwnedConditions to log the error
via klog.Errorf (including instance.GetName() and err), then return
ctrl.Result{RequeueAfter: statusUpdateRequeueAfter}, nil (i.e., drop the error)
so the fixed requeue is honored; ensure this change is applied to all four
locations referencing r.StatusWriter.Update,
r.StatusWriter.UpdateOwnedConditions and getAndValidateMCP where
statusUpdateRequeueAfter is currently paired with a non-nil err.
Test PR, do not merge here. I will resubmit for the main branch once the fix is validated.
A failure to write a PerformanceProfile status could cause the status to be stuck in stale state for a long time, because it will only be recomputed when a new reconcile event arrives. That can actually take hours.