Clean up StretchCluster & NodePool specs#1529
Conversation
…odePool
Per-K8s-cluster configuration (TLS, External, Listeners, RBAC, ServiceAccount,
Monitoring, ClusterDomain) moves entirely from StretchClusterSpec to
EmbeddedNodePoolSpec. Defaultable fields (Storage, Resources, Service,
ImagePullSecrets) stay on StretchClusterSpec as cluster-wide defaults and gain
an optional NodePool override (top-level replace, no deep merge).
NodePool drops its `np` shortname (collides with NetworkPolicy).
Defaulting splits into two stages, run by NewRenderState:
1. cluster.Spec.MergeDefaults() — cluster-wide fields.
2. pool.Spec.MergeDefaultsFrom(&cluster.Spec) — per-pool: inherits the
defaultable group when unset, then populates the moved fields.
Helpers that read moved fields move from StretchClusterSpec to
EmbeddedNodePoolSpec (IsAdminTLSEnabled, port readers, GetClusterDomain,
GetServiceAccountName, InternalDomain, InUse*Certs, GetResourceRequirements,
GetRedpandaStartFlags, GetOverProvisionValue, GetEnableMemoryLocking, tiered
storage helpers, GetStorageMinFreeBytes, AdminInternalURL, AdminAPIURLs).
Config/audit/metrics readers stay on StretchClusterSpec.
RenderState exposes the representative local pool via state.PoolSpec(); all
multicluster renderers read per-K8s-cluster values through it instead of
state.Spec(). state.Spec() retains cluster-wide reads (Auth, Networking,
CommonLabels, Image, Tuning, Logging, AuditLogging, Config, RackAwareness,
Enterprise).
Factory.defaultedPoolSpec fetches a NodePool when building admin/Kafka/Schema
Registry clients so listener/TLS/port lookups happen on the merged pool.
syncCA picks a representative NodePool for BootstrappedCertNames.
rapidutil gains a *runtime.RawExtension generator so fuzz tests can traverse
the now-larger NodePool type tree (Monitoring.TLSConfig wraps a runtime.Object
interface that rapid cannot synthesize).
Generated files (CRDs, deepcopy, applyconfiguration) and golden files are
regenerated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two regressions surfaced by the multicluster smoke test: 1. PoolSpec() returned nil when there were no in-cluster pools yet (the reconciler fires once before the user has created NodePools). Direct field reads like state.PoolSpec().TLS then panicked. Returns an empty spec instead — helpers already handle the nil receivers, and resources that depend on per-pool config (certs, listeners) self-skip. 2. Acceptance feature files still applied TLS/External/RBAC/Listeners on StretchCluster.spec, which is rejected by the new CRD schema. Moved those fields to the NodePool manifests in the same scenarios. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every resource that reads a moved-to-NodePool field (TLS, Listeners,
ClusterDomain, External, RBAC, ServiceAccount, Monitoring, plus the
defaultable overrides Storage/Resources/ImagePullSecrets) now renders per
local NodePool — picking a representative pool was wrong because each pool
may legitimately differ.
API:
- PoolSpec() becomes PoolSpec(pool) — every read is explicit about which
pool's view it wants.
- Drop the Service field from both StretchClusterSpec and EmbeddedNodePoolSpec
along with GetServiceName. The two usages were a vestigial headless-service
rename (now driven by the cluster's name) and a per-pod-service annotation
default already covered by NodePoolServices.PerPod.{Local,Remote}.Annotations.
- InternalDomain / AdminInternalURL / AdminAPIURLs move to *RenderState
(combine cluster-wide service name with per-pool ClusterDomain).
- RenderState.TLSConfig(pool, certName): take a pool so secret lookup uses
the per-pool cert names.
- representativePool() kept only for the headless ClusterIP Service, whose
listener port list is cluster-wide.
Resource layout (each pool's resource name = {cluster}-{pool}):
- Bootstrap / RPK / Pandaproxy / Schema Registry ConfigMaps → per pool.
- Redpanda ConfigMap → already per pool, now reads its own pool's spec.
- ServiceAccount, Role, ClusterRole, RoleBinding, ClusterRoleBinding → per pool.
- ServiceMonitor → per pool, selector scoped via labelComponentKey.
- Certificates (cert-manager) → per pool, cert SANs include the per-pool
StatefulSet hostname pattern; Issuers stay cluster-wide (one per cert name)
because they reference the synced CA Secret shared across the StretchCluster.
- NodePort + LoadBalancer external Services → per pool, selector scoped via
labelComponentKey.
- Lifecycle Scripts Secret → per pool (admin URL/curl flags differ per pool).
- Configurator and FS-Validator Secrets → already per pool.
- Headless ClusterIP Service → stays cluster-wide, named {fullname}; listener
ports taken from a representative pool (assumed identical within a K8s
cluster).
- CA root Secret, bootstrap user Secret → stay cluster-wide.
syncCA in the multicluster controller now collects bootstrappable cert names
as the union across every NodePool.
Tests:
- New per-pool-variation-test case in stretch-cluster-cases.txtar with two
pools whose TLS / RBAC / ServiceAccount / Monitoring / Storage / Resources
differ; goldens regenerated. Confirms each per-pool resource reflects its
own pool's spec rather than leaking from a representative.
- Existing GetServiceName test removed (helper deleted).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ance assertion
Two bugs from the per-pool fan-out refactor surfaced when the smoke test
tried to use a partial Storage override on the NodePool:
- statefulset_init.go init-container VolumeMounts for the configurator and
fs-validator still referenced the old fullname-based volume names while
the corresponding Volumes had already been renamed to the per-pool
poolFullname form. The StatefulSet was rejected with `volumeMounts[5].
name: Not found "cluster-configurator"`. Mounts now use poolFullname too.
- inheritFromCluster does whole-replace, so a NodePool that set only
storage.persistentVolume.labels (no size) ended up with PersistentVolume
populated but Size nil, and the StatefulSet was rejected with
`volumeClaimTemplates[0].spec.resources[storage]: Required value`.
Extracted the existing mergeDefaultStorage / mergeDefaultResources bodies
into free funcs (fillStretchStorageDefaults / fillStretchResourcesDefaults),
and MergeDefaultsFrom now applies them at the pool level after inheriting
from the cluster — partial overrides pick up the unset sub-fields.
Acceptance:
- nodepoolManifest now substitutes ${POOL_NAME} into the DocString body
before applying, so a single feature step can express per-region pool
variation.
- New step: `in "<mc>" region "<vc-N>" the PVC "<name>" in namespace "<ns>"
should have label "<key>" with value "<value>"` (checkPVCLabelInRegion).
- stretch-cluster-basics.feature: NodePool now sets
storage.persistentVolume.labels.redpanda.com/test-pool-marker: ${POOL_NAME}
and asserts each region's datadir PVC carries its own pool's value. If
the operator ever leaks a representative pool's spec into other pools,
the wrong PVC would carry the wrong value and the assertion would fail.
PVC renderer change:
- volumeClaimTemplateDatadir now propagates pv.Labels and pv.Annotations
to the PVC template (previously ignored).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the v1alpha2 CRD surface changes: - Fields removed from StretchCluster.spec (clusterDomain, tls, external, listeners, rbac, serviceAccount, monitoring, service). - Fields added to NodePool.spec, with the defaultable ones (storage, resources, imagePullSecrets) inheriting from the cluster when unset. - NodePool CRD's `np` short name dropped (collided with NetworkPolicy). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
loadBalancerServices used to iterate state.allPodNames(); after the per-pool fan-out it iterates state.inClusterPools and derives pod names per pool, so the helper is dead. golangci-lint flagged it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l fan-out
Two regressions surfaced once the operator started naming cert resources
per-pool and creating per-pool NodePort Services:
- Both tests applied a NodePool that left `external` at its default
(enabled, NodePort). They run in the same K8s cluster, so their pools
both tried to grab the default admin NodePort 31644 and the reconcile
failed before any cert resources were created. Disabled external on
both pools.
- The leaf-cert lookup computed the secret name with
`CertServerSecretName(scName, "default")`. Cert resources are now named
with the pool's full name (`{cluster}-{pool}-{certname}-cert`), so the
test was waiting on the pre-refactor name forever. Renamed the pool to
a plain `pool` (so poolFullname is `{scName}-pool`) and threaded that
into the secret-name computation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o per-pool spec The StretchCluster → StaticConfigurationSource converter (newly added on main as part of the Console multicluster registration) read TLS, listener ports, and ClusterDomain off StretchClusterSpec — all of which moved to NodePool.Spec in this PR. - ConvertStretchClusterToStaticConfig now takes a representative *EmbeddedNodePoolSpec alongside the StretchCluster. Per-K8s-cluster fields (TLS, listener ports, ClusterDomain) read off the pool spec; Auth/SASL still reads off the cluster. - console controller fetches NodePools for the referenced StretchCluster, picks the first one, applies cluster + pool defaults, and passes the merged pool spec to the converter. - The conversion test reorganized: each case now mutates either the cluster or the pool spec via a (sc, poolSpec) callback; the runner defaults both before invoking the converter. - The console multicluster integration test creates a NodePool with TLS disabled alongside the StretchCluster (instead of setting TLS on the cluster spec). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6d4b4fe to
417880a
Compare
…l converter signature change The previous commit changed ConvertStretchClusterToStaticConfig to require a representative NodePool and the console controller fails its reconcile if it can't find one. The unit test only created the StretchCluster, so the stretch-cluster-ref case hit `no NodePools found for StretchCluster test-ns/test-stretch`. Apply a minimal NodePool alongside the StretchCluster in the test setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picking an arbitrary NodePool on every reconcile risked rendering flapping between pools as new ones were added or sort order shifted. Persist the choice instead: - maybeSelectStretchSourcePool runs early in Reconcile. If `operator.redpanda.com/stretch-source-pool` is already set and the named pool still exists in the local K8s cluster, keep it. Otherwise pick the lexically smallest matching NodePool (deterministic tie-breaker), write the annotation via Apply, and use it. - clusterFragment now reads that annotation to look up the pinned pool by name rather than iterating + first-match selecting on each render. The annotation lives on the Console CR (operator-owned state). Because the controller's ctl is bound to the local K8s cluster where the Console CR lives, the candidate NodePool list is automatically scoped to that cluster too — matching the "same K8s cluster as the Console CR" intent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The StretchCluster and NodePool CRDs are only installed by the multicluster operator. In single-cluster mode the controller must not List NodePools or even attempt to dereference a ClusterRef of kind StretchCluster — there's nothing to read. - New Controller.Multicluster bool; SetupWithMulticlusterManager sets it true, SetupWithManager leaves it false. - maybeSelectStretchSourcePool short-circuits when Multicluster is false. - The render struct gets the same flag; clusterFragment returns a clear error if a StretchCluster ref shows up in single-cluster mode rather than 404'ing on the CRDs. - Unit test sets Multicluster=true since it exercises the stretch-cluster-ref case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The early-return in syncCA when no NodePools exist created a chicken-and-egg problem: a user applies a StretchCluster, syncCA short-circuits, then the user creates a NodePool and the CA still has to wait for the StretchCluster reconciler to fire again before any TLS cert can be issued. Pre-refactor this didn't happen because TLS defaults lived on the cluster spec, so MergeDefaults produced a "default" + "external" cert set even before any NodePool existed. Restore that behaviour by treating "no NodePools" as "one empty placeholder pool" and letting MergeDefaultsFrom populate the same defaults a real NodePool would inherit. Any unusual cert names a NodePool later defines are picked up on the next reconcile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CA bootstrapping in the multicluster reconciler only ever bootstraps two cert names — "default" and "external" — because those are the only certs the operator owns. Anything a user adds via NodePool.Spec.TLS.Certs is expected to bring its own CA (SecretRef or IssuerRef), which BootstrappedCertNames was already filtering out. Iterating NodePools to recompute that static pair on every reconcile was just plumbing and forced the placeholder-pool workaround for the no-NodePools-yet case. Hard-code the list instead. syncCA no longer depends on NodePools or BootstrappedCertNames, and the chicken-and-egg between StretchCluster creation and the first NodePool goes away. Renderer-side cert bookkeeping (cert_issuers.go, certs.go) continues to use BootstrappedCertNames per-pool, which is correct — there each pool's spec genuinely shapes the cert resources rendered for that pool. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "default" and "external" cert names — the operator's well-known CA-bootstrapped certs — appeared as bare string literals in several places: mergeDefaultTLS, the listener defaulting helpers, ConvertStretchClusterToStaticConfig, Factory.stretchClusterListenerTLSConfig, and the freshly-simplified syncCA. Expose them as exported constants on the v1alpha2 package and use them everywhere a cert name is meant. Cuts the typo risk and makes the rename refactor (if it ever happens) trivial. The "default" listener-map key inside mergeDefaultExternalListener stays a literal because it's a listener name, not a cert name — annotated as such. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| External *External `json:"external,omitempty"` | ||
|
|
||
| // Defines the log level settings. | ||
| Logging *StretchLogging `json:"logging,omitempty"` |
There was a problem hiding this comment.
Can logging be also added to NodePool.Spec? During indidents there is only subset of Redpandas that needs different Log Level to change for further investigation.
There was a problem hiding this comment.
RackAwareness should be removed from StretchCluster as each cloud provider define different NodeAnnotation.
| Logging *StretchLogging `json:"logging,omitempty"` | ||
|
|
||
| // Defines audit logging settings. | ||
| AuditLogging *AuditLogging `json:"auditLogging,omitempty"` |
There was a problem hiding this comment.
In AuditLogging property there is indirect dependency on now removed Listener.Kafka
// Kafka external listener name, note that it must have
authenticationMethodset to sasl
Listener *stringjson:"listener,omitempty"
From the current code could you remove Listener from CRD and Redpanda helm values as it is not used?
| - -c | ||
| - curl --silent --fail -k -m 5 --cacert /etc/tls/certs/default/ca.crt | ||
| "https://${SERVICE_NAME}.custom-cluster-domain.custom-cluster-domain.svc.custom.local:9644/v1/status/ready" | ||
| "https://${SERVICE_NAME}.custom-cluster-domain.custom-cluster-domain.svc.cluster.local.:9644/v1/status/ready" |
There was a problem hiding this comment.
As this changed could you adjust
redpanda-operator/operator/multicluster/testdata/render-cases.txtar
Lines 869 to 870 in 001e9a9
| - containerPort: 9644 | ||
| name: admin | ||
| - containerPort: 31644 | ||
| - containerPort: 9645 |
There was a problem hiding this comment.
This test case should be adjusted too
redpanda-operator/operator/multicluster/testdata/render-cases.txtar
Lines 281 to 305 in 001e9a9
| - containerPort: 9644 | ||
| name: admin | ||
| - containerPort: 30644 | ||
| - containerPort: 9645 |
There was a problem hiding this comment.
And this too
redpanda-operator/operator/multicluster/testdata/render-cases.txtar
Lines 367 to 416 in 001e9a9
| - containerPort: 9644 | ||
| name: admin | ||
| - containerPort: 30644 | ||
| - containerPort: 9645 |
There was a problem hiding this comment.
and this too
redpanda-operator/operator/multicluster/testdata/render-cases.txtar
Lines 989 to 1044 in 001e9a9
| - -c | ||
| - curl --silent --fail -k -m 5 --cacert /etc/tls/certs/default-client/ca.crt | ||
| --cert /etc/tls/certs/default-client/tls.crt --key /etc/tls/certs/default-client/tls.key | ||
| - curl --silent --fail -k -m 5 --cacert /etc/tls/certs/default/ca.crt |
There was a problem hiding this comment.
and this too
redpanda-operator/operator/multicluster/testdata/render-cases.txtar
Lines 486 to 517 in 001e9a9
| Name: state.Spec().GetServiceName(state.fullname()), | ||
| Namespace: state.namespace, | ||
| Labels: labels, | ||
| Annotations: annotations, |
There was a problem hiding this comment.
I wonder if we should add spec.service.internal.annotation map to the NodePool. It is regresion, but I don't remember what is the use case for our internal service even in none multicluster setup. Probably some kind annotation for other operators/controllers.
| // TODO: consider a special field for per pod service annotation, either in nodepool or stretchcluster spec. | ||
| annotations = spec.Service.Internal.Annotations | ||
| } | ||
| annotations := map[string]string{} |
There was a problem hiding this comment.
NIT: Maybe inline this with Service declaration?
RafalKorepta
left a comment
There was a problem hiding this comment.
LGTM, my comments could be addressed in follow up.
…nalAccessReady Per-pool external NodePort Services all default to the same advertised ports (admin 31644 / kafka 31092 / http 30082 / schema-registry 30081). Two local NodePools in the same Kubernetes cluster therefore both request the same nodePort numbers, and the API server rejects the second Service with `provided port is already allocated`. Resolve collisions deterministically in the renderer: the lexically first pool wins each nodePort and any pool that would clash has its Service skipped. The NodePoolReconciler independently runs the same detection (via DetectExternalNodePortConflicts) and writes a new ExternalAccessReady condition on each pool — Available, Disabled, or NodePortConflict with a message naming the winning pool and the colliding ports — so users see why their external Service is missing. ExternalAccessReady is added to the Stable rollup. Goldens for `multi-pool` and `full-featured` lose the duplicate Service that the API server would have rejected; that was the regression Codex flagged in the PR review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
syncCA used to hardcode {default, external}, but certIssuers renders
Issuers for the union of BootstrappedCertNames across all NodePools.
The two diverged in both directions:
* A NodePool that referenced a custom operator-managed cert name (no
SecretRef, no IssuerRef) got an Issuer pointing at a `--root-certificate`
Secret that syncCA never created — cert-manager would stall waiting on
the missing CA.
* A NodePool that switched the default or external cert to IssuerRef
still got the operator's unused CA private-key Secret created in every
member cluster.
Add ManagedCertNamesForBootstrap that runs the same defaulting as the
renderer and returns the union of BootstrappedCertNames over all pools,
with a {default, external} fallback for the no-pools bootstrap window.
syncCA now calls it instead of hardcoding the list, keeping CA Secrets
and Issuers strictly in sync.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stretchListenerTLS passed sc.Name as the fullname argument to TLS.CertificatesFor, which is correct for shared root-CA Secrets (scoped to the cluster name) but wrong for IssuerRef-backed certs. cert-manager creates the leaf Secret from a Certificate the operator renders against the per-pool fullname `<cluster>-<pool>`, so the actual Secret name is `<cluster>-<pool>-<cert>-cert`. Console pointed at `<cluster>-<cert>-cert` and TLS broke. Take the NodePool object in ConvertStretchClusterToStaticConfig and compute its fullname inside the converter. Switch fullname to the pool variant when the cert uses IssuerRef; leave the shared CA path on the cluster name. The console controller now just passes the pinned pool through and the converter handles defaulting itself, which trims duplicate setup at the call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Console reads per-K8s-cluster TLS, listener ports, and clusterDomain from the pinned source NodePool, but the multicluster Console controller only watched StretchCluster. Updates to that pool (or its deletion) never re-enqueued dependent Consoles, so rendered configs could stay stale indefinitely until some other StretchCluster or Console event happened to fire. Add a NodePool watch that translates each NodePool event to a lookup against the existing `console_stretch` index — the same index the StretchCluster watch already uses — so the same set of Consoles gets enqueued. The pinned-pool repair logic in maybeSelectStretchSourcePool already re-picks when the previously-annotated pool is gone; it just needed a reconcile trigger to run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
During an incident, operators want to crank up the log level on one region's brokers without touching the rest of the cluster. Add Logging *StretchLogging to NodePool.spec.embeddedNodePoolSpec as an optional per-pool override that inherits StretchCluster.Spec.Logging when unset, mirroring the existing Storage/Resources/ImagePullSecrets override pattern. The rpk configmap renderer reads from the defaulted pool spec, so the per-pool value (or the inherited cluster value) drives --default-log-level for each StatefulSet. Addresses Rafał's review comment on PR #1529. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
RackAwareness.NodeAnnotation identifies the K8s node label that encodes a node's rack/zone. The right key varies per cloud provider (EKS / GKE / AKS use the standard topology.kubernetes.io/zone but on-prem and some custom deployments use other annotations), and the StretchCluster topology already crosses cloud boundaries. A cluster-wide field can't accommodate that. Move RackAwareness *RackAwareness from StretchClusterSpec to EmbeddedNodePoolSpec. Each pool picks the annotation that matches its own K8s cluster's nodes. The renderer (rack-awareness ClusterRole, configurator init-container volume mount, bootstrap configmap, configurator script) now reads RackAwareness off the pool spec. The render-cases `rack-awareness` and `full-featured` fixtures move the field onto their NodePool entries so the goldens continue to exercise the rack-aware code paths. Addresses Rafał's review comment on PR #1529. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AuditLogging.Listener referenced a Kafka external listener by name. The Listeners block moved off StretchCluster.spec onto each NodePool, so a cluster-scoped Listener pointer can no longer resolve to a single listener — and the field was never wired into the rendered audit configuration anyway. Fork a new StretchAuditLogging type for StretchCluster.spec.auditLogging without the Listener field. Redpanda.spec.auditLogging keeps the shared AuditLogging type unchanged, preserving the single-cluster operator's surface. Addresses Rafał's review comment on PR #1529. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The render-cases input fixtures still placed `tls`, `external`, `listeners`, `monitoring`, and `clusterDomain` on StretchCluster.spec. After the field move those keys no longer exist on the cluster CRD, so the YAML decoder dropped them silently and many cases ended up exercising only the renderer's defaults instead of the case's intended configuration. Migrate the moved fields onto each case's NodePool spec across all 13 affected cases (tls-self-signed, tls-mtls, sasl-scram512-with-tls, external-loadbalancer, external-nodeport, external-tls, tls-issuer-ref, tls-issuer-ref-mtls, monitoring, custom-cluster-domain, full-featured, and the rack-awareness / full-featured already-touched-in-prior-commits ones). Goldens regenerated — external NodePorts and the custom cluster domain now show up in the rendered output, restoring the coverage the cases were originally written for. Addresses Rafał's review comments on PR #1529 (he flagged five specifically; the underlying issue applied to all of them). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pod annotations
`StretchCluster.spec.service.internal.annotations` was dropped when the
`service` field was removed wholesale, but the headless ClusterIP
Service it annotated is cluster-wide (one per member K8s cluster), not
per pool. The renderer had no way to attach annotations consumed by
external-dns / service-mesh / other sibling operators.
Add `StretchCluster.spec.internalServiceAnnotations` as a cluster-wide
map; thread it onto the headless Service rendered by
`serviceInternal`. Annotations stay nil when the user didn't set any,
so apply-time comparisons keep working.
While here, inline the empty `annotations := map[string]string{}`
declaration in service_per_pod.go directly into the ObjectMeta
literal (Rafał nit).
Addresses Rafał's review comments on PR #1529.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
What
Move per-K8s-cluster fields from
StretchCluster.spectoNodePool.specand rework the operator so each NodePool's resources are rendered from that pool's own spec.Why
Pre-PR every field on
StretchCluster.spechad to be identical across all member K8s clusters. That's wrong for things that are of a K8s cluster —clusterDomain,tls/listeners(resolved against per-cluster Secrets and Issuers),external(per-cloud LB/NodePort),rbac/serviceAccount(per-cluster K8s API),monitoring(per-cluster Prometheus). Pinning them on the federated CR meant either picking a "representative" pool silently (wrong for per-pool customization) or rejecting useful configurations.API surface changes (breaking)
Moved from
cluster.redpanda.com/v1alpha2 StretchCluster.spectoNodePool.spec(each pool sets its own):clusterDomain,tls,external,listeners,rbac,serviceAccount,monitoring.Removed entirely from both specs:
service. Its only two effects were a rarely-used rename of the headless ClusterIP Service (now always named after the cluster) and a per-pod-service annotation default (already covered byNodePoolServices.PerPod.{Local,Remote}.Annotations).Added on
NodePool.specas optional overrides:storage,resources,imagePullSecrets— inherit fromStretchCluster.specwhen unset.Other: dropped the
npshort name on the NodePool CRD (collided with NetworkPolicy).Rendering changes
Every resource that depends on a moved field now fans out per local NodePool, named
{cluster}-{pool}: ConfigMaps (redpanda, bootstrap, rpk, pandaproxy, schema-registry), ServiceAccount, Roles/Bindings, ServiceMonitor, Certificates, external NodePort/LB Services, lifecycle Secrets. The headless ClusterIP Service stays cluster-wide. Cert-manager Issuers stay one-per-cert (they reference the shared synced CA Secret).Notable behavior
syncCA: only bootstraps the well-knowndefaultandexternalcert names (the operator's contract). No longer waits for a NodePool to exist before generating the CA.Tests
stretch-cluster-cases.txtarasserting each pool's resources reflect its own spec.stretch-cluster-basicsnow sets${POOL_NAME}into a PVC label and asserts each region's PVC carries its own pool's value — a fan-out bug would show up immediately.