Skip to content

Requeue zone update when context is cancelled#1965

Merged
AndrewChubatiuk merged 2 commits intomasterfrom
context-cancelled
Mar 17, 2026
Merged

Requeue zone update when context is cancelled#1965
AndrewChubatiuk merged 2 commits intomasterfrom
context-cancelled

Conversation

@vrutkovs
Copy link
Collaborator

Attach a more detailed error every time we cancel the context. Requeue the request if the cancellation occurred during zone processing. This would prevent some zones from being left untouched, as otherwise the controller would restart from scratch.

Fixes #1962

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 8 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="internal/controller/operator/controllers_test.go">

<violation number="1" location="internal/controller/operator/controllers_test.go:355">
P2: This test uses an unreachable `context.Canceled`+`ErrZone` error shape, so it does not verify the real requeue-on-zone-cancel path.</violation>
</file>

<file name="internal/controller/operator/factory/vmdistributed/zone.go">

<violation number="1" location="internal/controller/operator/factory/vmdistributed/zone.go:406">
P1: `WithCancelCause` alone is not enough here: `wait.PollUntilContextCancel` only returns `ctx.Err()`, so `ErrZone` never reaches the reconcile code and the new requeue path will not trigger.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@vrutkovs vrutkovs force-pushed the context-cancelled branch 2 times, most recently from 303cbe5 to d35cbc1 Compare March 13, 2026 15:10
@AndrewChubatiuk
Copy link
Contributor

this solution most likely doesn't cover a case described in an issue, when VMCluster reconciliation for some reason returned context.Canceled and VMDistributed waits for it's readiness forever

contextCancelErrorsTotal.Inc()
var errZone *vmdistributed.ErrZone
if errors.As(err, &errZone) {
return ctrl.Result{Requeue: true}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just ignore cause from cmd/main.go and requeue for all others cases?
in this case other causes are not needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added it here 00bc895

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also let's keep only this cancelWithCause and drop the rest

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be removed in #1964 anyway? I want to keep this PR minimal so that it can be backported to 0.68 (not so sure about #1964).

Perhaps its easier to merge these changes in #1964?

Copy link
Contributor

@AndrewChubatiuk AndrewChubatiuk Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean all WithCancelCause added in this PR. Let's drop the rest besides one, which actually impacts reconcile behaviour, #1964 initially for a different purpose, it keeps only one function for reconcile errors handling and processes all reconcile errors in this function

@vrutkovs vrutkovs force-pushed the context-cancelled branch 4 times, most recently from 401395b to 021e56b Compare March 17, 2026 09:12
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="internal/controller/operator/factory/vmdistributed/zone.go">

<violation number="1" location="internal/controller/operator/factory/vmdistributed/zone.go:355">
P1: Ignore per-address polling cancellations here; otherwise normal EndpointSlice churn aborts the whole queue-drain wait.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@vrutkovs vrutkovs force-pushed the context-cancelled branch from 77bef40 to c412b94 Compare March 17, 2026 09:37
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="internal/controller/operator/factory/vmdistributed/zone.go">

<violation number="1" location="internal/controller/operator/factory/vmdistributed/zone.go:199">
P1: Wrap `ctx.Err()` in this return; otherwise canceled zone updates are treated as generic failures and won't be requeued.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

return fmt.Errorf("zone=%s: failed to wait till VMCluster=%s queue is empty: %w", item, nsnCluster.String(), err)
zs.waitForEmptyPQ(ctx, rclient, defaultMetricsCheckInterval, i)
if ctx.Err() != nil {
return fmt.Errorf("zone=%s: failed to wait till VMCluster=%s queue is empty", item, nsnCluster.String())
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Wrap ctx.Err() in this return; otherwise canceled zone updates are treated as generic failures and won't be requeued.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At internal/controller/operator/factory/vmdistributed/zone.go, line 199:

<comment>Wrap `ctx.Err()` in this return; otherwise canceled zone updates are treated as generic failures and won't be requeued.</comment>

<file context>
@@ -195,8 +194,9 @@ func (zs *zones) upgrade(ctx context.Context, rclient client.Client, cr *vmv1alp
-			return fmt.Errorf("zone=%s: failed to wait till VMCluster=%s queue is empty: %w", item, nsnCluster.String(), err)
+		zs.waitForEmptyPQ(ctx, rclient, defaultMetricsCheckInterval, i)
+		if ctx.Err() != nil {
+			return fmt.Errorf("zone=%s: failed to wait till VMCluster=%s queue is empty", item, nsnCluster.String())
 		}
 
</file context>
Suggested change
return fmt.Errorf("zone=%s: failed to wait till VMCluster=%s queue is empty", item, nsnCluster.String())
return fmt.Errorf("zone=%s: failed to wait till VMCluster=%s queue is empty: %w", item, nsnCluster.String(), ctx.Err())
Fix with Cubic

}

var wg sync.WaitGroup
var resultErr error
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that we remove this ugly block, but not sure its worth removing return error and read it from the context instead

Copy link
Contributor

@AndrewChubatiuk AndrewChubatiuk Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function returns nothing besides context.Canceled, other errors are treated as transient

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay, lets roll with it then. Could you LGTM the PR?

@AndrewChubatiuk AndrewChubatiuk merged commit 8a88772 into master Mar 17, 2026
7 checks passed
@AndrewChubatiuk AndrewChubatiuk deleted the context-cancelled branch March 17, 2026 10:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

handleReconcileErr swallows context.Canceled without requeueing, permanently dropping CRs from work queue

2 participants