Acceptance: stretch-cluster multicluster operator upgrade test by RafalKorepta · Pull Request #1533 · redpanda-data/redpanda-operator

RafalKorepta · 2026-05-19T15:15:19Z

Adds a multicluster acceptance scenario that exercises the supported
operator-upgrade path described in
docs/stretch-cluster-operator-upgrade.md: install the published
redpanda/operator chart from charts.redpanda.com on every vcluster,
bring up a 3-broker StretchCluster, then helm-upgrade each vcluster
one at a time to the local dev chart with localhost/redpanda-operator:dev.

The test contract:

Before the upgrade, every vcluster's TransportService.Status RPC
must report is_healthy=true with no unhealthy peers. The starting
raft term is snapshotted per vcluster.
After each per-vcluster helm upgrade, the operator Deployment is
observed rolling (old pod gone, new pod Available, new UID), and
the raft quorum on EVERY vcluster must recover with is_healthy=true
AND a strictly advanced term vs the prior snapshot. The advanced
term is the load-bearing signal — a steady is_healthy alone could
mean the new pod is sitting on the sidelines while the existing
quorum re-uses the old leader.
vclusters are upgraded in first-to-last index order (vc-0, vc-1,
vc-2). The runbook recommends this order so support tooling can
give one concrete sequence rather than "any order works"; leader
flapping that this ordering may cause is not on the data path
and is acceptable.
After all three vclusters are upgraded, a sentinel topic created
on the v26.2.1-beta.1-managed cluster is re-read to prove no
broker outage occurred during the operator rollout.

Supporting changes:

pkg/vcluster/vcluster.go: Cluster.HelmUpgrade mirroring HelmInstall
via action.NewUpgrade, so per-vcluster upgrades go through the same
in-process helm machinery as the initial install.
acceptance/steps/stretch.go: deployOperators refactored to take an
operatorChartSource (Path or RemoteChart + optional image override).
Existing call sites continue to use the local dev chart via a thin
wrapper; the new step uses a pulled upstream chart.
acceptance/steps/operator_chart.go: helm chart-pull helper backed by
action.NewPullWithOpts, downloading and untarring the published
chart into a temp dir so loader.Load can consume it.
acceptance/steps/stretch_operator_upgrade.go: the three new step
handlers plus the operator gRPC Status helper that dials each
vcluster's operator pod on :9443 using mTLS material read from
redpanda-operator-multicluster-certificates (same secret the
bootstrap step writes and the existing rpk-k8s raft check reads).

The scenario is tagged @multicluster @dev-env so it runs under the
existing multicluster acceptance harness and is included in the
dev-env setup-only flow.

K8S-848

setupMulticlusterManager called mgr.Elected() and then immediately iterated mgr.GetClusterNames() to call mgr.GetCluster(ctx, name) for cache-sync. Elected() only signals leader election, not provider readiness: the multicluster-runtime clusters.Provider engages each non-local cluster from a manager runnable started by mgr.Start(ctx), and that engagement loop can still be in progress after Elected fires. Until engagement completes for a given name, mgr.GetCluster returns ErrClusterNotFound immediately rather than blocking — so the previous code raced the engagement loop and could fail t.Fatalf, or worse, appear to succeed for one cluster while silently leaving the others absent from downstream factory.GetClient lookups (surfaced as "no Kafka brokers found" inside Factory.kafkaForStretchCluster). Replace the direct GetCluster call with a 60s-budgeted poll on GetCluster per cluster, then run WaitForCacheSync on the engaged cluster. Behavior under healthy conditions is unchanged — engagement is essentially "start the cluster's cache" and typically completes in under a second. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@dev-env

Adds a multicluster acceptance scenario that exercises the supported operator-upgrade path described in docs/stretch-cluster-operator-upgrade.md: install the published redpanda/operator chart from charts.redpanda.com on every vcluster, bring up a 3-broker StretchCluster, then helm-upgrade each vcluster one at a time to the local dev chart with localhost/redpanda-operator:dev. The test contract: - Before the upgrade, every vcluster's TransportService.Status RPC must report is_healthy=true with no unhealthy peers. The starting raft term is snapshotted per vcluster. - After each per-vcluster helm upgrade, the operator Deployment is observed rolling (old pod gone, new pod Available, new UID), and the raft quorum on EVERY vcluster must recover with is_healthy=true AND a strictly advanced term vs the prior snapshot. The advanced term is the load-bearing signal — a steady is_healthy alone could mean the new pod is sitting on the sidelines while the existing quorum re-uses the old leader. - vclusters are upgraded in first-to-last index order (vc-0, vc-1, vc-2). The runbook recommends this order so support tooling can give one concrete sequence rather than "any order works"; leader flapping that this ordering may cause is not on the data path and is acceptable. - After all three vclusters are upgraded, a sentinel topic created on the v26.2.1-beta.1-managed cluster is re-read to prove no broker outage occurred during the operator rollout. Supporting changes: - pkg/vcluster/vcluster.go: Cluster.HelmUpgrade mirroring HelmInstall via action.NewUpgrade, so per-vcluster upgrades go through the same in-process helm machinery as the initial install. - acceptance/steps/stretch.go: deployOperators refactored to take an operatorChartSource (Path or RemoteChart + optional image override). Existing call sites continue to use the local dev chart via a thin wrapper; the new step uses a pulled upstream chart. - acceptance/steps/operator_chart.go: helm chart-pull helper backed by action.NewPullWithOpts, downloading and untarring the published chart into a temp dir so loader.Load can consume it. - acceptance/steps/stretch_operator_upgrade.go: the three new step handlers plus the operator gRPC Status helper that dials each vcluster's operator pod on :9443 using mTLS material read from redpanda-operator-multicluster-certificates (same secret the bootstrap step writes and the existing rpk-k8s raft check reads). The scenario is tagged @multicluster @dev-env so it runs under the existing multicluster acceptance harness and is included in the dev-env setup-only flow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

RafalKorepta requested review from andrewstucki, chrisseto, gene-redpanda and hidalgopl as code owners May 19, 2026 15:15

RafalKorepta added the no-changelog label May 19, 2026

RafalKorepta force-pushed the rk/k8s-848/upgrade-test branch from dc7aeeb to a913aea Compare May 20, 2026 13:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Acceptance: stretch-cluster multicluster operator upgrade test#1533

Acceptance: stretch-cluster multicluster operator upgrade test#1533
RafalKorepta wants to merge 2 commits into
mainfrom
rk/k8s-848/upgrade-test

RafalKorepta commented May 19, 2026 •

edited by atlassian Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RafalKorepta commented May 19, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RafalKorepta commented May 19, 2026 •

edited by atlassian Bot

Loading