Acceptance: stretch-cluster multicluster operator upgrade test#1533
Open
RafalKorepta wants to merge 2 commits into
Open
Acceptance: stretch-cluster multicluster operator upgrade test#1533RafalKorepta wants to merge 2 commits into
RafalKorepta wants to merge 2 commits into
Conversation
setupMulticlusterManager called mgr.Elected() and then immediately iterated mgr.GetClusterNames() to call mgr.GetCluster(ctx, name) for cache-sync. Elected() only signals leader election, not provider readiness: the multicluster-runtime clusters.Provider engages each non-local cluster from a manager runnable started by mgr.Start(ctx), and that engagement loop can still be in progress after Elected fires. Until engagement completes for a given name, mgr.GetCluster returns ErrClusterNotFound immediately rather than blocking — so the previous code raced the engagement loop and could fail t.Fatalf, or worse, appear to succeed for one cluster while silently leaving the others absent from downstream factory.GetClient lookups (surfaced as "no Kafka brokers found" inside Factory.kafkaForStretchCluster). Replace the direct GetCluster call with a 60s-budgeted poll on GetCluster per cluster, then run WaitForCacheSync on the engaged cluster. Behavior under healthy conditions is unchanged — engagement is essentially "start the cluster's cache" and typically completes in under a second. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a multicluster acceptance scenario that exercises the supported operator-upgrade path described in docs/stretch-cluster-operator-upgrade.md: install the published redpanda/operator chart from charts.redpanda.com on every vcluster, bring up a 3-broker StretchCluster, then helm-upgrade each vcluster one at a time to the local dev chart with localhost/redpanda-operator:dev. The test contract: - Before the upgrade, every vcluster's TransportService.Status RPC must report is_healthy=true with no unhealthy peers. The starting raft term is snapshotted per vcluster. - After each per-vcluster helm upgrade, the operator Deployment is observed rolling (old pod gone, new pod Available, new UID), and the raft quorum on EVERY vcluster must recover with is_healthy=true AND a strictly advanced term vs the prior snapshot. The advanced term is the load-bearing signal — a steady is_healthy alone could mean the new pod is sitting on the sidelines while the existing quorum re-uses the old leader. - vclusters are upgraded in first-to-last index order (vc-0, vc-1, vc-2). The runbook recommends this order so support tooling can give one concrete sequence rather than "any order works"; leader flapping that this ordering may cause is not on the data path and is acceptable. - After all three vclusters are upgraded, a sentinel topic created on the v26.2.1-beta.1-managed cluster is re-read to prove no broker outage occurred during the operator rollout. Supporting changes: - pkg/vcluster/vcluster.go: Cluster.HelmUpgrade mirroring HelmInstall via action.NewUpgrade, so per-vcluster upgrades go through the same in-process helm machinery as the initial install. - acceptance/steps/stretch.go: deployOperators refactored to take an operatorChartSource (Path or RemoteChart + optional image override). Existing call sites continue to use the local dev chart via a thin wrapper; the new step uses a pulled upstream chart. - acceptance/steps/operator_chart.go: helm chart-pull helper backed by action.NewPullWithOpts, downloading and untarring the published chart into a temp dir so loader.Load can consume it. - acceptance/steps/stretch_operator_upgrade.go: the three new step handlers plus the operator gRPC Status helper that dials each vcluster's operator pod on :9443 using mTLS material read from redpanda-operator-multicluster-certificates (same secret the bootstrap step writes and the existing rpk-k8s raft check reads). The scenario is tagged @multicluster @dev-env so it runs under the existing multicluster acceptance harness and is included in the dev-env setup-only flow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dc7aeeb to
a913aea
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a multicluster acceptance scenario that exercises the supported
operator-upgrade path described in
docs/stretch-cluster-operator-upgrade.md: install the published
redpanda/operator chart from charts.redpanda.com on every vcluster,
bring up a 3-broker StretchCluster, then helm-upgrade each vcluster
one at a time to the local dev chart with localhost/redpanda-operator:dev.
The test contract:
must report is_healthy=true with no unhealthy peers. The starting
raft term is snapshotted per vcluster.
observed rolling (old pod gone, new pod Available, new UID), and
the raft quorum on EVERY vcluster must recover with is_healthy=true
AND a strictly advanced term vs the prior snapshot. The advanced
term is the load-bearing signal — a steady is_healthy alone could
mean the new pod is sitting on the sidelines while the existing
quorum re-uses the old leader.
vc-2). The runbook recommends this order so support tooling can
give one concrete sequence rather than "any order works"; leader
flapping that this ordering may cause is not on the data path
and is acceptable.
on the v26.2.1-beta.1-managed cluster is re-read to prove no
broker outage occurred during the operator rollout.
Supporting changes:
via action.NewUpgrade, so per-vcluster upgrades go through the same
in-process helm machinery as the initial install.
operatorChartSource (Path or RemoteChart + optional image override).
Existing call sites continue to use the local dev chart via a thin
wrapper; the new step uses a pulled upstream chart.
action.NewPullWithOpts, downloading and untarring the published
chart into a temp dir so loader.Load can consume it.
handlers plus the operator gRPC Status helper that dials each
vcluster's operator pod on :9443 using mTLS material read from
redpanda-operator-multicluster-certificates (same secret the
bootstrap step writes and the existing rpk-k8s raft check reads).
The scenario is tagged @multicluster @dev-env so it runs under the
existing multicluster acceptance harness and is included in the
dev-env setup-only flow.
K8S-848