Skip to content

Acceptance: stretch-cluster multicluster operator upgrade test#1533

Open
RafalKorepta wants to merge 2 commits into
mainfrom
rk/k8s-848/upgrade-test
Open

Acceptance: stretch-cluster multicluster operator upgrade test#1533
RafalKorepta wants to merge 2 commits into
mainfrom
rk/k8s-848/upgrade-test

Conversation

@RafalKorepta
Copy link
Copy Markdown
Contributor

@RafalKorepta RafalKorepta commented May 19, 2026

Adds a multicluster acceptance scenario that exercises the supported
operator-upgrade path described in
docs/stretch-cluster-operator-upgrade.md: install the published
redpanda/operator chart from charts.redpanda.com on every vcluster,
bring up a 3-broker StretchCluster, then helm-upgrade each vcluster
one at a time to the local dev chart with localhost/redpanda-operator:dev.

The test contract:

  • Before the upgrade, every vcluster's TransportService.Status RPC
      must report is_healthy=true with no unhealthy peers. The starting
      raft term is snapshotted per vcluster.
  • After each per-vcluster helm upgrade, the operator Deployment is
      observed rolling (old pod gone, new pod Available, new UID), and
      the raft quorum on EVERY vcluster must recover with is_healthy=true
      AND a strictly advanced term vs the prior snapshot. The advanced
      term is the load-bearing signal — a steady is_healthy alone could
      mean the new pod is sitting on the sidelines while the existing
      quorum re-uses the old leader.
  • vclusters are upgraded in first-to-last index order (vc-0, vc-1,
      vc-2). The runbook recommends this order so support tooling can
      give one concrete sequence rather than "any order works"; leader
      flapping that this ordering may cause is not on the data path
      and is acceptable.
  • After all three vclusters are upgraded, a sentinel topic created
      on the v26.2.1-beta.1-managed cluster is re-read to prove no
      broker outage occurred during the operator rollout.

Supporting changes:

  • pkg/vcluster/vcluster.go: Cluster.HelmUpgrade mirroring HelmInstall
      via action.NewUpgrade, so per-vcluster upgrades go through the same
      in-process helm machinery as the initial install.
  • acceptance/steps/stretch.go: deployOperators refactored to take an
      operatorChartSource (Path or RemoteChart + optional image override).
      Existing call sites continue to use the local dev chart via a thin
      wrapper; the new step uses a pulled upstream chart.
  • acceptance/steps/operator_chart.go: helm chart-pull helper backed by
      action.NewPullWithOpts, downloading and untarring the published
      chart into a temp dir so loader.Load can consume it.
  • acceptance/steps/stretch_operator_upgrade.go: the three new step
      handlers plus the operator gRPC Status helper that dials each
      vcluster's operator pod on :9443 using mTLS material read from
      redpanda-operator-multicluster-certificates (same secret the
      bootstrap step writes and the existing rpk-k8s raft check reads).

The scenario is tagged @multicluster @dev-env so it runs under the
existing multicluster acceptance harness and is included in the
dev-env setup-only flow.

K8S-848

setupMulticlusterManager called mgr.Elected() and then immediately
iterated mgr.GetClusterNames() to call mgr.GetCluster(ctx, name) for
cache-sync. Elected() only signals leader election, not provider
readiness: the multicluster-runtime clusters.Provider engages each
non-local cluster from a manager runnable started by mgr.Start(ctx),
and that engagement loop can still be in progress after Elected fires.
Until engagement completes for a given name, mgr.GetCluster returns
ErrClusterNotFound immediately rather than blocking — so the previous
code raced the engagement loop and could fail t.Fatalf, or worse,
appear to succeed for one cluster while silently leaving the others
absent from downstream factory.GetClient lookups (surfaced as
"no Kafka brokers found" inside Factory.kafkaForStretchCluster).

Replace the direct GetCluster call with a 60s-budgeted poll on
GetCluster per cluster, then run WaitForCacheSync on the engaged
cluster. Behavior under healthy conditions is unchanged — engagement
is essentially "start the cluster's cache" and typically completes in
under a second.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a multicluster acceptance scenario that exercises the supported
operator-upgrade path described in
docs/stretch-cluster-operator-upgrade.md: install the published
redpanda/operator chart from charts.redpanda.com on every vcluster,
bring up a 3-broker StretchCluster, then helm-upgrade each vcluster
one at a time to the local dev chart with localhost/redpanda-operator:dev.

The test contract:

- Before the upgrade, every vcluster's TransportService.Status RPC
  must report is_healthy=true with no unhealthy peers. The starting
  raft term is snapshotted per vcluster.
- After each per-vcluster helm upgrade, the operator Deployment is
  observed rolling (old pod gone, new pod Available, new UID), and
  the raft quorum on EVERY vcluster must recover with is_healthy=true
  AND a strictly advanced term vs the prior snapshot. The advanced
  term is the load-bearing signal — a steady is_healthy alone could
  mean the new pod is sitting on the sidelines while the existing
  quorum re-uses the old leader.
- vclusters are upgraded in first-to-last index order (vc-0, vc-1,
  vc-2). The runbook recommends this order so support tooling can
  give one concrete sequence rather than "any order works"; leader
  flapping that this ordering may cause is not on the data path
  and is acceptable.
- After all three vclusters are upgraded, a sentinel topic created
  on the v26.2.1-beta.1-managed cluster is re-read to prove no
  broker outage occurred during the operator rollout.

Supporting changes:

- pkg/vcluster/vcluster.go: Cluster.HelmUpgrade mirroring HelmInstall
  via action.NewUpgrade, so per-vcluster upgrades go through the same
  in-process helm machinery as the initial install.
- acceptance/steps/stretch.go: deployOperators refactored to take an
  operatorChartSource (Path or RemoteChart + optional image override).
  Existing call sites continue to use the local dev chart via a thin
  wrapper; the new step uses a pulled upstream chart.
- acceptance/steps/operator_chart.go: helm chart-pull helper backed by
  action.NewPullWithOpts, downloading and untarring the published
  chart into a temp dir so loader.Load can consume it.
- acceptance/steps/stretch_operator_upgrade.go: the three new step
  handlers plus the operator gRPC Status helper that dials each
  vcluster's operator pod on :9443 using mTLS material read from
  redpanda-operator-multicluster-certificates (same secret the
  bootstrap step writes and the existing rpk-k8s raft check reads).

The scenario is tagged @multicluster @dev-env so it runs under the
existing multicluster acceptance harness and is included in the
dev-env setup-only flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@RafalKorepta RafalKorepta force-pushed the rk/k8s-848/upgrade-test branch from dc7aeeb to a913aea Compare May 20, 2026 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant