Clean up StretchCluster & NodePool specs by hidalgopl · Pull Request #1529 · redpanda-data/redpanda-operator

hidalgopl · 2026-05-15T13:20:47Z

What

Move per-K8s-cluster fields from StretchCluster.spec to NodePool.spec and rework the operator so each NodePool's resources are rendered from that pool's own spec.

Why

Pre-PR every field on StretchCluster.spec had to be identical across all member K8s clusters. That's wrong for things that are of a K8s cluster — clusterDomain, tls/listeners (resolved against per-cluster Secrets and Issuers), external (per-cloud LB/NodePort), rbac/serviceAccount (per-cluster K8s API), monitoring (per-cluster Prometheus). Pinning them on the federated CR meant either picking a "representative" pool silently (wrong for per-pool customization) or rejecting useful configurations.

API surface changes (breaking)

Moved from cluster.redpanda.com/v1alpha2 StretchCluster.spec to NodePool.spec (each pool sets its own): clusterDomain, tls, external, listeners, rbac, serviceAccount, monitoring.

Removed entirely from both specs: service. Its only two effects were a rarely-used rename of the headless ClusterIP Service (now always named after the cluster) and a per-pod-service annotation default (already covered by NodePoolServices.PerPod.{Local,Remote}.Annotations).

Added on NodePool.spec as optional overrides: storage, resources, imagePullSecrets — inherit from StretchCluster.spec when unset.

Other: dropped the np short name on the NodePool CRD (collided with NetworkPolicy).

Rendering changes

Every resource that depends on a moved field now fans out per local NodePool, named {cluster}-{pool}: ConfigMaps (redpanda, bootstrap, rpk, pandaproxy, schema-registry), ServiceAccount, Roles/Bindings, ServiceMonitor, Certificates, external NodePort/LB Services, lifecycle Secrets. The headless ClusterIP Service stays cluster-wide. Cert-manager Issuers stay one-per-cert (they reference the shared synced CA Secret).

Notable behavior

Console controller (multicluster mode): a StretchCluster ClusterRef requires the controller to pick one local NodePool as the source of per-cluster TLS/listener config. The choice is persisted via an annotation on the Console CR so renders don't flap when pools come and go. Skipped entirely in single-cluster mode.
syncCA: only bootstraps the well-known default and external cert names (the operator's contract). No longer waits for a NodePool to exist before generating the CA.

Tests

New per-pool variation case in stretch-cluster-cases.txtar asserting each pool's resources reflect its own spec.
Acceptance feature stretch-cluster-basics now sets ${POOL_NAME} into a PVC label and asserts each region's PVC carries its own pool's value — a fan-out bug would show up immediately.

…odePool Per-K8s-cluster configuration (TLS, External, Listeners, RBAC, ServiceAccount, Monitoring, ClusterDomain) moves entirely from StretchClusterSpec to EmbeddedNodePoolSpec. Defaultable fields (Storage, Resources, Service, ImagePullSecrets) stay on StretchClusterSpec as cluster-wide defaults and gain an optional NodePool override (top-level replace, no deep merge). NodePool drops its `np` shortname (collides with NetworkPolicy). Defaulting splits into two stages, run by NewRenderState: 1. cluster.Spec.MergeDefaults() — cluster-wide fields. 2. pool.Spec.MergeDefaultsFrom(&cluster.Spec) — per-pool: inherits the defaultable group when unset, then populates the moved fields. Helpers that read moved fields move from StretchClusterSpec to EmbeddedNodePoolSpec (IsAdminTLSEnabled, port readers, GetClusterDomain, GetServiceAccountName, InternalDomain, InUse*Certs, GetResourceRequirements, GetRedpandaStartFlags, GetOverProvisionValue, GetEnableMemoryLocking, tiered storage helpers, GetStorageMinFreeBytes, AdminInternalURL, AdminAPIURLs). Config/audit/metrics readers stay on StretchClusterSpec. RenderState exposes the representative local pool via state.PoolSpec(); all multicluster renderers read per-K8s-cluster values through it instead of state.Spec(). state.Spec() retains cluster-wide reads (Auth, Networking, CommonLabels, Image, Tuning, Logging, AuditLogging, Config, RackAwareness, Enterprise). Factory.defaultedPoolSpec fetches a NodePool when building admin/Kafka/Schema Registry clients so listener/TLS/port lookups happen on the merged pool. syncCA picks a representative NodePool for BootstrappedCertNames. rapidutil gains a *runtime.RawExtension generator so fuzz tests can traverse the now-larger NodePool type tree (Monitoring.TLSConfig wraps a runtime.Object interface that rapid cannot synthesize). Generated files (CRDs, deepcopy, applyconfiguration) and golden files are regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two regressions surfaced by the multicluster smoke test: 1. PoolSpec() returned nil when there were no in-cluster pools yet (the reconciler fires once before the user has created NodePools). Direct field reads like state.PoolSpec().TLS then panicked. Returns an empty spec instead — helpers already handle the nil receivers, and resources that depend on per-pool config (certs, listeners) self-skip. 2. Acceptance feature files still applied TLS/External/RBAC/Listeners on StretchCluster.spec, which is rejected by the new CRD schema. Moved those fields to the NodePool manifests in the same scenarios. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Every resource that reads a moved-to-NodePool field (TLS, Listeners, ClusterDomain, External, RBAC, ServiceAccount, Monitoring, plus the defaultable overrides Storage/Resources/ImagePullSecrets) now renders per local NodePool — picking a representative pool was wrong because each pool may legitimately differ. API: - PoolSpec() becomes PoolSpec(pool) — every read is explicit about which pool's view it wants. - Drop the Service field from both StretchClusterSpec and EmbeddedNodePoolSpec along with GetServiceName. The two usages were a vestigial headless-service rename (now driven by the cluster's name) and a per-pod-service annotation default already covered by NodePoolServices.PerPod.{Local,Remote}.Annotations. - InternalDomain / AdminInternalURL / AdminAPIURLs move to *RenderState (combine cluster-wide service name with per-pool ClusterDomain). - RenderState.TLSConfig(pool, certName): take a pool so secret lookup uses the per-pool cert names. - representativePool() kept only for the headless ClusterIP Service, whose listener port list is cluster-wide. Resource layout (each pool's resource name = {cluster}-{pool}): - Bootstrap / RPK / Pandaproxy / Schema Registry ConfigMaps → per pool. - Redpanda ConfigMap → already per pool, now reads its own pool's spec. - ServiceAccount, Role, ClusterRole, RoleBinding, ClusterRoleBinding → per pool. - ServiceMonitor → per pool, selector scoped via labelComponentKey. - Certificates (cert-manager) → per pool, cert SANs include the per-pool StatefulSet hostname pattern; Issuers stay cluster-wide (one per cert name) because they reference the synced CA Secret shared across the StretchCluster. - NodePort + LoadBalancer external Services → per pool, selector scoped via labelComponentKey. - Lifecycle Scripts Secret → per pool (admin URL/curl flags differ per pool). - Configurator and FS-Validator Secrets → already per pool. - Headless ClusterIP Service → stays cluster-wide, named {fullname}; listener ports taken from a representative pool (assumed identical within a K8s cluster). - CA root Secret, bootstrap user Secret → stay cluster-wide. syncCA in the multicluster controller now collects bootstrappable cert names as the union across every NodePool. Tests: - New per-pool-variation-test case in stretch-cluster-cases.txtar with two pools whose TLS / RBAC / ServiceAccount / Monitoring / Storage / Resources differ; goldens regenerated. Confirms each per-pool resource reflects its own pool's spec rather than leaking from a representative. - Existing GetServiceName test removed (helper deleted). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ance assertion Two bugs from the per-pool fan-out refactor surfaced when the smoke test tried to use a partial Storage override on the NodePool: - statefulset_init.go init-container VolumeMounts for the configurator and fs-validator still referenced the old fullname-based volume names while the corresponding Volumes had already been renamed to the per-pool poolFullname form. The StatefulSet was rejected with `volumeMounts[5]. name: Not found "cluster-configurator"`. Mounts now use poolFullname too. - inheritFromCluster does whole-replace, so a NodePool that set only storage.persistentVolume.labels (no size) ended up with PersistentVolume populated but Size nil, and the StatefulSet was rejected with `volumeClaimTemplates[0].spec.resources[storage]: Required value`. Extracted the existing mergeDefaultStorage / mergeDefaultResources bodies into free funcs (fillStretchStorageDefaults / fillStretchResourcesDefaults), and MergeDefaultsFrom now applies them at the pool level after inheriting from the cluster — partial overrides pick up the unset sub-fields. Acceptance: - nodepoolManifest now substitutes ${POOL_NAME} into the DocString body before applying, so a single feature step can express per-region pool variation. - New step: `in "<mc>" region "<vc-N>" the PVC "<name>" in namespace "<ns>" should have label "<key>" with value "<value>"` (checkPVCLabelInRegion). - stretch-cluster-basics.feature: NodePool now sets storage.persistentVolume.labels.redpanda.com/test-pool-marker: ${POOL_NAME} and asserts each region's datadir PVC carries its own pool's value. If the operator ever leaks a representative pool's spec into other pools, the wrong PVC would carry the wrong value and the assertion would fail. PVC renderer change: - volumeClaimTemplateDatadir now propagates pv.Labels and pv.Annotations to the PVC template (previously ignored). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Documents the v1alpha2 CRD surface changes: - Fields removed from StretchCluster.spec (clusterDomain, tls, external, listeners, rbac, serviceAccount, monitoring, service). - Fields added to NodePool.spec, with the defaultable ones (storage, resources, imagePullSecrets) inheriting from the cluster when unset. - NodePool CRD's `np` short name dropped (collided with NetworkPolicy). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

loadBalancerServices used to iterate state.allPodNames(); after the per-pool fan-out it iterates state.inClusterPools and derives pod names per pool, so the helper is dead. golangci-lint flagged it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…l fan-out Two regressions surfaced once the operator started naming cert resources per-pool and creating per-pool NodePort Services: - Both tests applied a NodePool that left `external` at its default (enabled, NodePort). They run in the same K8s cluster, so their pools both tried to grab the default admin NodePort 31644 and the reconcile failed before any cert resources were created. Disabled external on both pools. - The leaf-cert lookup computed the secret name with `CertServerSecretName(scName, "default")`. Cert resources are now named with the pool's full name (`{cluster}-{pool}-{certname}-cert`), so the test was waiting on the pre-refactor name forever. Renamed the pool to a plain `pool` (so poolFullname is `{scName}-pool`) and threaded that into the secret-name computation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…o per-pool spec The StretchCluster → StaticConfigurationSource converter (newly added on main as part of the Console multicluster registration) read TLS, listener ports, and ClusterDomain off StretchClusterSpec — all of which moved to NodePool.Spec in this PR. - ConvertStretchClusterToStaticConfig now takes a representative *EmbeddedNodePoolSpec alongside the StretchCluster. Per-K8s-cluster fields (TLS, listener ports, ClusterDomain) read off the pool spec; Auth/SASL still reads off the cluster. - console controller fetches NodePools for the referenced StretchCluster, picks the first one, applies cluster + pool defaults, and passes the merged pool spec to the converter. - The conversion test reorganized: each case now mutates either the cluster or the pool spec via a (sc, poolSpec) callback; the runner defaults both before invoking the converter. - The console multicluster integration test creates a NodePool with TLS disabled alongside the StretchCluster (instead of setting TLS on the cluster spec). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…l converter signature change The previous commit changed ConvertStretchClusterToStaticConfig to require a representative NodePool and the console controller fails its reconcile if it can't find one. The unit test only created the StretchCluster, so the stretch-cluster-ref case hit `no NodePools found for StretchCluster test-ns/test-stretch`. Apply a minimal NodePool alongside the StretchCluster in the test setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Picking an arbitrary NodePool on every reconcile risked rendering flapping between pools as new ones were added or sort order shifted. Persist the choice instead: - maybeSelectStretchSourcePool runs early in Reconcile. If `operator.redpanda.com/stretch-source-pool` is already set and the named pool still exists in the local K8s cluster, keep it. Otherwise pick the lexically smallest matching NodePool (deterministic tie-breaker), write the annotation via Apply, and use it. - clusterFragment now reads that annotation to look up the pinned pool by name rather than iterating + first-match selecting on each render. The annotation lives on the Console CR (operator-owned state). Because the controller's ctl is bound to the local K8s cluster where the Console CR lives, the candidate NodePool list is automatically scoped to that cluster too — matching the "same K8s cluster as the Console CR" intent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The StretchCluster and NodePool CRDs are only installed by the multicluster operator. In single-cluster mode the controller must not List NodePools or even attempt to dereference a ClusterRef of kind StretchCluster — there's nothing to read. - New Controller.Multicluster bool; SetupWithMulticlusterManager sets it true, SetupWithManager leaves it false. - maybeSelectStretchSourcePool short-circuits when Multicluster is false. - The render struct gets the same flag; clusterFragment returns a clear error if a StretchCluster ref shows up in single-cluster mode rather than 404'ing on the CRDs. - Unit test sets Multicluster=true since it exercises the stretch-cluster-ref case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The early-return in syncCA when no NodePools exist created a chicken-and-egg problem: a user applies a StretchCluster, syncCA short-circuits, then the user creates a NodePool and the CA still has to wait for the StretchCluster reconciler to fire again before any TLS cert can be issued. Pre-refactor this didn't happen because TLS defaults lived on the cluster spec, so MergeDefaults produced a "default" + "external" cert set even before any NodePool existed. Restore that behaviour by treating "no NodePools" as "one empty placeholder pool" and letting MergeDefaultsFrom populate the same defaults a real NodePool would inherit. Any unusual cert names a NodePool later defines are picked up on the next reconcile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CA bootstrapping in the multicluster reconciler only ever bootstraps two cert names — "default" and "external" — because those are the only certs the operator owns. Anything a user adds via NodePool.Spec.TLS.Certs is expected to bring its own CA (SecretRef or IssuerRef), which BootstrappedCertNames was already filtering out. Iterating NodePools to recompute that static pair on every reconcile was just plumbing and forced the placeholder-pool workaround for the no-NodePools-yet case. Hard-code the list instead. syncCA no longer depends on NodePools or BootstrappedCertNames, and the chicken-and-egg between StretchCluster creation and the first NodePool goes away. Renderer-side cert bookkeeping (cert_issuers.go, certs.go) continues to use BootstrappedCertNames per-pool, which is correct — there each pool's spec genuinely shapes the cert resources rendered for that pool. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The "default" and "external" cert names — the operator's well-known CA-bootstrapped certs — appeared as bare string literals in several places: mergeDefaultTLS, the listener defaulting helpers, ConvertStretchClusterToStaticConfig, Factory.stretchClusterListenerTLSConfig, and the freshly-simplified syncCA. Expose them as exported constants on the v1alpha2 package and use them everywhere a cert name is meant. Cuts the typo risk and makes the rename refactor (if it ever happens) trivial. The "default" listener-map key inside mergeDefaultExternalListener stays a literal because it's a listener name, not a cert name — annotated as such. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

RafalKorepta · 2026-05-20T07:21:00Z

-	External *External `json:"external,omitempty"`
-
 	// Defines the log level settings.
 	Logging *StretchLogging `json:"logging,omitempty"`


Can logging be also added to NodePool.Spec? During indidents there is only subset of Redpandas that needs different Log Level to change for further investigation.

RafalKorepta · 2026-05-20T07:23:02Z

RackAwareness should be removed from StretchCluster as each cloud provider define different NodeAnnotation.

RafalKorepta · 2026-05-20T07:35:23Z

 	Logging *StretchLogging `json:"logging,omitempty"`

 	// Defines audit logging settings.
 	AuditLogging *AuditLogging `json:"auditLogging,omitempty"`


In AuditLogging property there is indirect dependency on now removed Listener.Kafka

// Kafka external listener name, note that it must have authenticationMethod set to sasl
Listener *string json:"listener,omitempty"

From the current code could you remove Listener from CRD and Redpanda helm values as it is not used?

RafalKorepta · 2026-05-20T07:41:21Z

              - -c
              - curl --silent --fail -k -m 5 --cacert /etc/tls/certs/default/ca.crt
-                "https://${SERVICE_NAME}.custom-cluster-domain.custom-cluster-domain.svc.custom.local:9644/v1/status/ready"
+                "https://${SERVICE_NAME}.custom-cluster-domain.custom-cluster-domain.svc.cluster.local.:9644/v1/status/ready"


As this changed could you adjust

redpanda-operator/operator/multicluster/testdata/render-cases.txtar

Lines 869 to 870 in 001e9a9

spec:

clusterDomain: custom.local

test case and maybe others as some properties moved from StretchCluster to NodePool?

RafalKorepta · 2026-05-20T07:45:07Z

          - containerPort: 9644
            name: admin
-          - containerPort: 31644
+          - containerPort: 9645


This test case should be adjusted too

redpanda-operator/operator/multicluster/testdata/render-cases.txtar

Lines 281 to 305 in 001e9a9

listeners:

kafka:

port: 9093

external:

default:

enabled: true

port: 30092

admin:

port: 9644

external:

default:

enabled: true

port: 30644

http:

port: 8082

external:

default:

enabled: true

port: 30082

schemaRegistry:

port: 8081

external:

default:

enabled: true

port: 30081

RafalKorepta · 2026-05-20T07:47:07Z

          - containerPort: 9644
            name: admin
-          - containerPort: 30644
+          - containerPort: 9645


And this too

redpanda-operator/operator/multicluster/testdata/render-cases.txtar

Lines 367 to 416 in 001e9a9

tls:

enabled: true

certs:

default:

caEnabled: true

listeners:

kafka:

port: 9093

tls:

cert: default

external:

default:

enabled: true

port: 30092

tls:

cert: default

admin:

port: 9644

tls:

cert: default

external:

default:

enabled: true

port: 30644

tls:

cert: default

http:

port: 8082

tls:

cert: default

external:

default:

enabled: true

port: 30082

tls:

cert: default

schemaRegistry:

port: 8081

tls:

cert: default

external:

default:

enabled: true

port: 30081

tls:

cert: default

rpc:

port: 33145

tls:

cert: default

RafalKorepta · 2026-05-20T07:48:23Z

          - containerPort: 9644
            name: admin
-          - containerPort: 30644
+          - containerPort: 9645


and this too

redpanda-operator/operator/multicluster/testdata/render-cases.txtar

Lines 989 to 1044 in 001e9a9

tls:

enabled: true

certs:

default:

caEnabled: true

external:

enabled: true

type: LoadBalancer

domain: prod.example.com

service:

enabled: true

listeners:

kafka:

port: 9093

tls:

cert: default

external:

default:

enabled: true

port: 30092

tls:

cert: default

admin:

port: 9644

tls:

cert: default

external:

default:

enabled: true

port: 30644

tls:

cert: default

http:

port: 8082

tls:

cert: default

external:

default:

enabled: true

port: 30082

tls:

cert: default

schemaRegistry:

port: 8081

tls:

cert: default

external:

default:

enabled: true

port: 30081

tls:

cert: default

rpc:

port: 33145

tls:

cert: default

RafalKorepta · 2026-05-20T07:50:06Z

              - -c
-              - curl --silent --fail -k -m 5 --cacert /etc/tls/certs/default-client/ca.crt
-                --cert /etc/tls/certs/default-client/tls.crt --key /etc/tls/certs/default-client/tls.key
+              - curl --silent --fail -k -m 5 --cacert /etc/tls/certs/default/ca.crt


and this too

redpanda-operator/operator/multicluster/testdata/render-cases.txtar

Lines 486 to 517 in 001e9a9

tls:

enabled: true

certs:

default:

caEnabled: true

issuerRef:

name: my-ca-issuer

kind: ClusterIssuer

group: cert-manager.io

listeners:

kafka:

port: 9093

tls:

cert: default

requireClientAuth: true

admin:

port: 9644

tls:

cert: default

requireClientAuth: true

http:

port: 8082

tls:

cert: default

schemaRegistry:

port: 8081

tls:

cert: default

rpc:

port: 33145

tls:

cert: default

RafalKorepta · 2026-05-20T08:14:17Z

-			Name:        state.Spec().GetServiceName(state.fullname()),
-			Namespace:   state.namespace,
-			Labels:      labels,
-			Annotations: annotations,


I wonder if we should add spec.service.internal.annotation map to the NodePool. It is regresion, but I don't remember what is the use case for our internal service even in none multicluster setup. Probably some kind annotation for other operators/controllers.

RafalKorepta · 2026-05-20T08:55:17Z

-		// TODO: consider a special field for per pod service annotation, either in nodepool or stretchcluster spec.
-		annotations = spec.Service.Internal.Annotations
-	}
+	annotations := map[string]string{}


NIT: Maybe inline this with Service declaration?

RafalKorepta

LGTM, my comments could be addressed in follow up.

…nalAccessReady Per-pool external NodePort Services all default to the same advertised ports (admin 31644 / kafka 31092 / http 30082 / schema-registry 30081). Two local NodePools in the same Kubernetes cluster therefore both request the same nodePort numbers, and the API server rejects the second Service with `provided port is already allocated`. Resolve collisions deterministically in the renderer: the lexically first pool wins each nodePort and any pool that would clash has its Service skipped. The NodePoolReconciler independently runs the same detection (via DetectExternalNodePortConflicts) and writes a new ExternalAccessReady condition on each pool — Available, Disabled, or NodePortConflict with a message naming the winning pool and the colliding ports — so users see why their external Service is missing. ExternalAccessReady is added to the Stable rollup. Goldens for `multi-pool` and `full-featured` lose the duplicate Service that the API server would have rejected; that was the regression Codex flagged in the PR review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

syncCA used to hardcode {default, external}, but certIssuers renders Issuers for the union of BootstrappedCertNames across all NodePools. The two diverged in both directions: * A NodePool that referenced a custom operator-managed cert name (no SecretRef, no IssuerRef) got an Issuer pointing at a `--root-certificate` Secret that syncCA never created — cert-manager would stall waiting on the missing CA. * A NodePool that switched the default or external cert to IssuerRef still got the operator's unused CA private-key Secret created in every member cluster. Add ManagedCertNamesForBootstrap that runs the same defaulting as the renderer and returns the union of BootstrappedCertNames over all pools, with a {default, external} fallback for the no-pools bootstrap window. syncCA now calls it instead of hardcoding the list, keeping CA Secrets and Issuers strictly in sync. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

stretchListenerTLS passed sc.Name as the fullname argument to TLS.CertificatesFor, which is correct for shared root-CA Secrets (scoped to the cluster name) but wrong for IssuerRef-backed certs. cert-manager creates the leaf Secret from a Certificate the operator renders against the per-pool fullname `<cluster>-<pool>`, so the actual Secret name is `<cluster>-<pool>-<cert>-cert`. Console pointed at `<cluster>-<cert>-cert` and TLS broke. Take the NodePool object in ConvertStretchClusterToStaticConfig and compute its fullname inside the converter. Switch fullname to the pool variant when the cert uses IssuerRef; leave the shared CA path on the cluster name. The console controller now just passes the pinned pool through and the converter handles defaulting itself, which trims duplicate setup at the call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Console reads per-K8s-cluster TLS, listener ports, and clusterDomain from the pinned source NodePool, but the multicluster Console controller only watched StretchCluster. Updates to that pool (or its deletion) never re-enqueued dependent Consoles, so rendered configs could stay stale indefinitely until some other StretchCluster or Console event happened to fire. Add a NodePool watch that translates each NodePool event to a lookup against the existing `console_stretch` index — the same index the StretchCluster watch already uses — so the same set of Consoles gets enqueued. The pinned-pool repair logic in maybeSelectStretchSourcePool already re-picks when the previously-annotated pool is gone; it just needed a reconcile trigger to run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

During an incident, operators want to crank up the log level on one region's brokers without touching the rest of the cluster. Add Logging *StretchLogging to NodePool.spec.embeddedNodePoolSpec as an optional per-pool override that inherits StretchCluster.Spec.Logging when unset, mirroring the existing Storage/Resources/ImagePullSecrets override pattern. The rpk configmap renderer reads from the defaulted pool spec, so the per-pool value (or the inherited cluster value) drives --default-log-level for each StatefulSet. Addresses Rafał's review comment on PR #1529. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

RackAwareness.NodeAnnotation identifies the K8s node label that encodes a node's rack/zone. The right key varies per cloud provider (EKS / GKE / AKS use the standard topology.kubernetes.io/zone but on-prem and some custom deployments use other annotations), and the StretchCluster topology already crosses cloud boundaries. A cluster-wide field can't accommodate that. Move RackAwareness *RackAwareness from StretchClusterSpec to EmbeddedNodePoolSpec. Each pool picks the annotation that matches its own K8s cluster's nodes. The renderer (rack-awareness ClusterRole, configurator init-container volume mount, bootstrap configmap, configurator script) now reads RackAwareness off the pool spec. The render-cases `rack-awareness` and `full-featured` fixtures move the field onto their NodePool entries so the goldens continue to exercise the rack-aware code paths. Addresses Rafał's review comment on PR #1529. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AuditLogging.Listener referenced a Kafka external listener by name. The Listeners block moved off StretchCluster.spec onto each NodePool, so a cluster-scoped Listener pointer can no longer resolve to a single listener — and the field was never wired into the rendered audit configuration anyway. Fork a new StretchAuditLogging type for StretchCluster.spec.auditLogging without the Listener field. Redpanda.spec.auditLogging keeps the shared AuditLogging type unchanged, preserving the single-cluster operator's surface. Addresses Rafał's review comment on PR #1529. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The render-cases input fixtures still placed `tls`, `external`, `listeners`, `monitoring`, and `clusterDomain` on StretchCluster.spec. After the field move those keys no longer exist on the cluster CRD, so the YAML decoder dropped them silently and many cases ended up exercising only the renderer's defaults instead of the case's intended configuration. Migrate the moved fields onto each case's NodePool spec across all 13 affected cases (tls-self-signed, tls-mtls, sasl-scram512-with-tls, external-loadbalancer, external-nodeport, external-tls, tls-issuer-ref, tls-issuer-ref-mtls, monitoring, custom-cluster-domain, full-featured, and the rack-awareness / full-featured already-touched-in-prior-commits ones). Goldens regenerated — external NodePorts and the custom cluster domain now show up in the rendered output, restoring the coverage the cases were originally written for. Addresses Rafał's review comments on PR #1529 (he flagged five specifically; the underlying issue applied to all of them). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…pod annotations `StretchCluster.spec.service.internal.annotations` was dropped when the `service` field was removed wholesale, but the headless ClusterIP Service it annotated is cluster-wide (one per member K8s cluster), not per pool. The renderer had no way to attach annotations consumed by external-dns / service-mesh / other sibling operators. Add `StretchCluster.spec.internalServiceAnnotations` as a cluster-wide map; thread it onto the headless Service rendered by `serviceInternal`. Annotations stay nil when the user didn't set any, so apply-time comparisons keep working. While here, inline the empty `annotations := map[string]string{}` declaration in service_per_pod.go directly into the ObjectMeta literal (Rafał nit). Addresses Rafał's review comments on PR #1529. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hidalgopl requested review from RafalKorepta, andrewstucki, chrisseto and gene-redpanda as code owners May 15, 2026 13:20

hidalgopl marked this pull request as draft May 15, 2026 13:20

hidalgopl and others added 8 commits May 18, 2026 15:10

hidalgopl force-pushed the pb/move-fields-to-nodepool branch from 6d4b4fe to 417880a Compare May 18, 2026 13:17

hidalgopl and others added 6 commits May 18, 2026 15:36

hidalgopl marked this pull request as ready for review May 19, 2026 13:01

RafalKorepta reviewed May 20, 2026

View reviewed changes

RafalKorepta approved these changes May 20, 2026

View reviewed changes

hidalgopl and others added 7 commits May 20, 2026 16:05

hidalgopl and others added 2 commits May 20, 2026 17:13

hidalgopl marked this pull request as draft May 21, 2026 09:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up StretchCluster & NodePool specs#1529

Clean up StretchCluster & NodePool specs#1529
hidalgopl wants to merge 23 commits into
mainfrom
pb/move-fields-to-nodepool

hidalgopl commented May 15, 2026 •

edited

Loading

Uh oh!

RafalKorepta May 20, 2026

Uh oh!

RafalKorepta May 20, 2026

Uh oh!

RafalKorepta May 20, 2026

Uh oh!

RafalKorepta May 20, 2026

Uh oh!

RafalKorepta May 20, 2026

Uh oh!

RafalKorepta May 20, 2026

Uh oh!

RafalKorepta May 20, 2026

Uh oh!

RafalKorepta May 20, 2026

Uh oh!

RafalKorepta May 20, 2026

Uh oh!

RafalKorepta May 20, 2026

Uh oh!

RafalKorepta left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	listeners:
	kafka:
	port: 9093
	external:
	default:
	enabled: true
	port: 30092
	admin:
	port: 9644
	external:
	default:
	enabled: true
	port: 30644
	http:
	port: 8082
	external:
	default:
	enabled: true
	port: 30082
	schemaRegistry:
	port: 8081
	external:
	default:
	enabled: true
	port: 30081

	tls:
	enabled: true
	certs:
	default:
	caEnabled: true
	listeners:
	kafka:
	port: 9093
	tls:
	cert: default
	external:
	default:
	enabled: true
	port: 30092
	tls:
	cert: default
	admin:
	port: 9644
	tls:
	cert: default
	external:
	default:
	enabled: true
	port: 30644
	tls:
	cert: default
	http:
	port: 8082
	tls:
	cert: default
	external:
	default:
	enabled: true
	port: 30082
	tls:
	cert: default
	schemaRegistry:
	port: 8081
	tls:
	cert: default
	external:
	default:
	enabled: true
	port: 30081
	tls:
	cert: default
	rpc:
	port: 33145
	tls:
	cert: default

Conversation

hidalgopl commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

API surface changes (breaking)

Rendering changes

Notable behavior

Tests

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RafalKorepta left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hidalgopl commented May 15, 2026 •

edited

Loading