Requeue zone update when context is cancelled#1965
Conversation
There was a problem hiding this comment.
2 issues found across 8 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="internal/controller/operator/controllers_test.go">
<violation number="1" location="internal/controller/operator/controllers_test.go:355">
P2: This test uses an unreachable `context.Canceled`+`ErrZone` error shape, so it does not verify the real requeue-on-zone-cancel path.</violation>
</file>
<file name="internal/controller/operator/factory/vmdistributed/zone.go">
<violation number="1" location="internal/controller/operator/factory/vmdistributed/zone.go:406">
P1: `WithCancelCause` alone is not enough here: `wait.PollUntilContextCancel` only returns `ctx.Err()`, so `ErrZone` never reaches the reconcile code and the new requeue path will not trigger.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
303cbe5 to
d35cbc1
Compare
|
this solution most likely doesn't cover a case described in an issue, when VMCluster reconciliation for some reason returned context.Canceled and VMDistributed waits for it's readiness forever |
| contextCancelErrorsTotal.Inc() | ||
| var errZone *vmdistributed.ErrZone | ||
| if errors.As(err, &errZone) { | ||
| return ctrl.Result{Requeue: true}, nil |
There was a problem hiding this comment.
why not just ignore cause from cmd/main.go and requeue for all others cases?
in this case other causes are not needed
There was a problem hiding this comment.
Good idea, thanks
There was a problem hiding this comment.
also let's keep only this cancelWithCause and drop the rest
There was a problem hiding this comment.
I mean all WithCancelCause added in this PR. Let's drop the rest besides one, which actually impacts reconcile behaviour, #1964 initially for a different purpose, it keeps only one function for reconcile errors handling and processes all reconcile errors in this function
401395b to
021e56b
Compare
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="internal/controller/operator/factory/vmdistributed/zone.go">
<violation number="1" location="internal/controller/operator/factory/vmdistributed/zone.go:355">
P1: Ignore per-address polling cancellations here; otherwise normal EndpointSlice churn aborts the whole queue-drain wait.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
77bef40 to
c412b94
Compare
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="internal/controller/operator/factory/vmdistributed/zone.go">
<violation number="1" location="internal/controller/operator/factory/vmdistributed/zone.go:199">
P1: Wrap `ctx.Err()` in this return; otherwise canceled zone updates are treated as generic failures and won't be requeued.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| return fmt.Errorf("zone=%s: failed to wait till VMCluster=%s queue is empty: %w", item, nsnCluster.String(), err) | ||
| zs.waitForEmptyPQ(ctx, rclient, defaultMetricsCheckInterval, i) | ||
| if ctx.Err() != nil { | ||
| return fmt.Errorf("zone=%s: failed to wait till VMCluster=%s queue is empty", item, nsnCluster.String()) |
There was a problem hiding this comment.
P1: Wrap ctx.Err() in this return; otherwise canceled zone updates are treated as generic failures and won't be requeued.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At internal/controller/operator/factory/vmdistributed/zone.go, line 199:
<comment>Wrap `ctx.Err()` in this return; otherwise canceled zone updates are treated as generic failures and won't be requeued.</comment>
<file context>
@@ -195,8 +194,9 @@ func (zs *zones) upgrade(ctx context.Context, rclient client.Client, cr *vmv1alp
- return fmt.Errorf("zone=%s: failed to wait till VMCluster=%s queue is empty: %w", item, nsnCluster.String(), err)
+ zs.waitForEmptyPQ(ctx, rclient, defaultMetricsCheckInterval, i)
+ if ctx.Err() != nil {
+ return fmt.Errorf("zone=%s: failed to wait till VMCluster=%s queue is empty", item, nsnCluster.String())
}
</file context>
| return fmt.Errorf("zone=%s: failed to wait till VMCluster=%s queue is empty", item, nsnCluster.String()) | |
| return fmt.Errorf("zone=%s: failed to wait till VMCluster=%s queue is empty: %w", item, nsnCluster.String(), ctx.Err()) |
| } | ||
|
|
||
| var wg sync.WaitGroup | ||
| var resultErr error |
There was a problem hiding this comment.
I like that we remove this ugly block, but not sure its worth removing return error and read it from the context instead
There was a problem hiding this comment.
this function returns nothing besides context.Canceled, other errors are treated as transient
There was a problem hiding this comment.
Ah, okay, lets roll with it then. Could you LGTM the PR?
Attach a more detailed error every time we cancel the context. Requeue the request if the cancellation occurred during zone processing. This would prevent some zones from being left untouched, as otherwise the controller would restart from scratch.
Fixes #1962