Requeue zone update when context is cancelled by vrutkovs · Pull Request #1965 · VictoriaMetrics/operator

vrutkovs · 2026-03-13T13:11:33Z

Attach a more detailed error every time we cancel the context. Requeue the request if the cancellation occurred during zone processing. This would prevent some zones from being left untouched, as otherwise the controller would restart from scratch.

Fixes #1962

cubic-dev-ai

2 issues found across 8 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="internal/controller/operator/controllers_test.go">

<violation number="1" location="internal/controller/operator/controllers_test.go:355">
P2: This test uses an unreachable `context.Canceled`+`ErrZone` error shape, so it does not verify the real requeue-on-zone-cancel path.</violation>
</file>

<file name="internal/controller/operator/factory/vmdistributed/zone.go">

<violation number="1" location="internal/controller/operator/factory/vmdistributed/zone.go:406">
P1: `WithCancelCause` alone is not enough here: `wait.PollUntilContextCancel` only returns `ctx.Err()`, so `ErrZone` never reaches the reconcile code and the new requeue path will not trigger.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

internal/controller/operator/factory/vmdistributed/zone.go

internal/controller/operator/controllers_test.go

AndrewChubatiuk · 2026-03-13T20:08:24Z

this solution most likely doesn't cover a case described in an issue, when VMCluster reconciliation for some reason returned context.Canceled and VMDistributed waits for it's readiness forever

AndrewChubatiuk · 2026-03-13T20:21:11Z

internal/controller/operator/controllers.go

 		contextCancelErrorsTotal.Inc()
+		var errZone *vmdistributed.ErrZone
+		if errors.As(err, &errZone) {
+			return ctrl.Result{Requeue: true}, nil


why not just ignore cause from cmd/main.go and requeue for all others cases?
in this case other causes are not needed

added it here 00bc895

Good idea, thanks

also let's keep only this cancelWithCause and drop the rest

This would be removed in #1964 anyway? I want to keep this PR minimal so that it can be backported to 0.68 (not so sure about #1964).

Perhaps its easier to merge these changes in #1964?

I mean all WithCancelCause added in this PR. Let's drop the rest besides one, which actually impacts reconcile behaviour, #1964 initially for a different purpose, it keeps only one function for reconcile errors handling and processes all reconcile errors in this function

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="internal/controller/operator/factory/vmdistributed/zone.go">

<violation number="1" location="internal/controller/operator/factory/vmdistributed/zone.go:355">
P1: Ignore per-address polling cancellations here; otherwise normal EndpointSlice churn aborts the whole queue-drain wait.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

internal/controller/operator/factory/vmdistributed/zone.go

cubic-dev-ai

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="internal/controller/operator/factory/vmdistributed/zone.go">

<violation number="1" location="internal/controller/operator/factory/vmdistributed/zone.go:199">
P1: Wrap `ctx.Err()` in this return; otherwise canceled zone updates are treated as generic failures and won't be requeued.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-03-17T10:20:19Z

internal/controller/operator/factory/vmdistributed/zone.go

-			return fmt.Errorf("zone=%s: failed to wait till VMCluster=%s queue is empty: %w", item, nsnCluster.String(), err)
+		zs.waitForEmptyPQ(ctx, rclient, defaultMetricsCheckInterval, i)
+		if ctx.Err() != nil {
+			return fmt.Errorf("zone=%s: failed to wait till VMCluster=%s queue is empty", item, nsnCluster.String())


P1: Wrap ctx.Err() in this return; otherwise canceled zone updates are treated as generic failures and won't be requeued.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At internal/controller/operator/factory/vmdistributed/zone.go, line 199: <comment>Wrap `ctx.Err()` in this return; otherwise canceled zone updates are treated as generic failures and won't be requeued.</comment> <file context> @@ -195,8 +194,9 @@ func (zs *zones) upgrade(ctx context.Context, rclient client.Client, cr *vmv1alp - return fmt.Errorf("zone=%s: failed to wait till VMCluster=%s queue is empty: %w", item, nsnCluster.String(), err) + zs.waitForEmptyPQ(ctx, rclient, defaultMetricsCheckInterval, i) + if ctx.Err() != nil { + return fmt.Errorf("zone=%s: failed to wait till VMCluster=%s queue is empty", item, nsnCluster.String()) } </file context>

Suggested change

return fmt.Errorf("zone=%s: failed to wait till VMCluster=%s queue is empty", item, nsnCluster.String())

return fmt.Errorf("zone=%s: failed to wait till VMCluster=%s queue is empty: %w", item, nsnCluster.String(), ctx.Err())

vrutkovs · 2026-03-17T10:20:56Z

internal/controller/operator/factory/vmdistributed/zone.go

 	}

 	var wg sync.WaitGroup
-	var resultErr error


I like that we remove this ugly block, but not sure its worth removing return error and read it from the context instead

this function returns nothing besides context.Canceled, other errors are treated as transient

Ah, okay, lets roll with it then. Could you LGTM the PR?

vrutkovs requested a review from AndrewChubatiuk as a code owner March 13, 2026 13:11

vrutkovs assigned AndrewChubatiuk Mar 13, 2026

cubic-dev-ai bot reviewed Mar 13, 2026

View reviewed changes

internal/controller/operator/factory/vmdistributed/zone.go Outdated Show resolved Hide resolved

internal/controller/operator/controllers_test.go Outdated Show resolved Hide resolved

vrutkovs force-pushed the context-cancelled branch 2 times, most recently from 303cbe5 to d35cbc1 Compare March 13, 2026 15:10

AndrewChubatiuk reviewed Mar 13, 2026

View reviewed changes

vrutkovs force-pushed the context-cancelled branch 4 times, most recently from 401395b to 021e56b Compare March 17, 2026 09:12

cubic-dev-ai bot reviewed Mar 17, 2026

View reviewed changes

internal/controller/operator/factory/vmdistributed/zone.go Outdated Show resolved Hide resolved

fix: add cancellation reason for graceful shutdown, retry other requests

c412b94

vrutkovs force-pushed the context-cancelled branch from 77bef40 to c412b94 Compare March 17, 2026 09:37

make waitForEmptyPQ return no error

031fbcf

cubic-dev-ai bot reviewed Mar 17, 2026

View reviewed changes

vrutkovs commented Mar 17, 2026

View reviewed changes

AndrewChubatiuk approved these changes Mar 17, 2026

View reviewed changes

AndrewChubatiuk merged commit 8a88772 into master Mar 17, 2026
7 checks passed

AndrewChubatiuk deleted the context-cancelled branch March 17, 2026 10:50

	return fmt.Errorf("zone=%s: failed to wait till VMCluster=%s queue is empty", item, nsnCluster.String())
	return fmt.Errorf("zone=%s: failed to wait till VMCluster=%s queue is empty: %w", item, nsnCluster.String(), ctx.Err())

Conversation

vrutkovs commented Mar 13, 2026

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AndrewChubatiuk commented Mar 13, 2026

Uh oh!

AndrewChubatiuk Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

AndrewChubatiuk Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

vrutkovs Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

AndrewChubatiuk Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

vrutkovs Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

AndrewChubatiuk Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vrutkovs Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

AndrewChubatiuk Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vrutkovs Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AndrewChubatiuk Mar 16, 2026 •

edited

Loading

cubic-dev-ai bot Mar 17, 2026 •

edited

Loading

AndrewChubatiuk Mar 17, 2026 •

edited

Loading