Skip to content

Clean up StretchCluster & NodePool specs#1529

Draft
hidalgopl wants to merge 23 commits into
mainfrom
pb/move-fields-to-nodepool
Draft

Clean up StretchCluster & NodePool specs#1529
hidalgopl wants to merge 23 commits into
mainfrom
pb/move-fields-to-nodepool

Conversation

@hidalgopl
Copy link
Copy Markdown
Contributor

@hidalgopl hidalgopl commented May 15, 2026

What

Move per-K8s-cluster fields from StretchCluster.spec to NodePool.spec and rework the operator so each NodePool's resources are rendered from that pool's own spec.

Why

Pre-PR every field on StretchCluster.spec had to be identical across all member K8s clusters. That's wrong for things that are of a K8s cluster — clusterDomain, tls/listeners (resolved against per-cluster Secrets and Issuers), external (per-cloud LB/NodePort), rbac/serviceAccount (per-cluster K8s API), monitoring (per-cluster Prometheus). Pinning them on the federated CR meant either picking a "representative" pool silently (wrong for per-pool customization) or rejecting useful configurations.

API surface changes (breaking)

Moved from cluster.redpanda.com/v1alpha2 StretchCluster.spec to NodePool.spec (each pool sets its own): clusterDomain, tls, external, listeners, rbac, serviceAccount, monitoring.

Removed entirely from both specs: service. Its only two effects were a rarely-used rename of the headless ClusterIP Service (now always named after the cluster) and a per-pod-service annotation default (already covered by NodePoolServices.PerPod.{Local,Remote}.Annotations).

Added on NodePool.spec as optional overrides: storage, resources, imagePullSecrets — inherit from StretchCluster.spec when unset.

Other: dropped the np short name on the NodePool CRD (collided with NetworkPolicy).

Rendering changes

Every resource that depends on a moved field now fans out per local NodePool, named {cluster}-{pool}: ConfigMaps (redpanda, bootstrap, rpk, pandaproxy, schema-registry), ServiceAccount, Roles/Bindings, ServiceMonitor, Certificates, external NodePort/LB Services, lifecycle Secrets. The headless ClusterIP Service stays cluster-wide. Cert-manager Issuers stay one-per-cert (they reference the shared synced CA Secret).

Notable behavior

  • Console controller (multicluster mode): a StretchCluster ClusterRef requires the controller to pick one local NodePool as the source of per-cluster TLS/listener config. The choice is persisted via an annotation on the Console CR so renders don't flap when pools come and go. Skipped entirely in single-cluster mode.
  • syncCA: only bootstraps the well-known default and external cert names (the operator's contract). No longer waits for a NodePool to exist before generating the CA.

Tests

  • New per-pool variation case in stretch-cluster-cases.txtar asserting each pool's resources reflect its own spec.
  • Acceptance feature stretch-cluster-basics now sets ${POOL_NAME} into a PVC label and asserts each region's PVC carries its own pool's value — a fan-out bug would show up immediately.

@hidalgopl hidalgopl marked this pull request as draft May 15, 2026 13:20
hidalgopl and others added 8 commits May 18, 2026 15:10
…odePool

Per-K8s-cluster configuration (TLS, External, Listeners, RBAC, ServiceAccount,
Monitoring, ClusterDomain) moves entirely from StretchClusterSpec to
EmbeddedNodePoolSpec. Defaultable fields (Storage, Resources, Service,
ImagePullSecrets) stay on StretchClusterSpec as cluster-wide defaults and gain
an optional NodePool override (top-level replace, no deep merge).

NodePool drops its `np` shortname (collides with NetworkPolicy).

Defaulting splits into two stages, run by NewRenderState:
  1. cluster.Spec.MergeDefaults() — cluster-wide fields.
  2. pool.Spec.MergeDefaultsFrom(&cluster.Spec) — per-pool: inherits the
     defaultable group when unset, then populates the moved fields.

Helpers that read moved fields move from StretchClusterSpec to
EmbeddedNodePoolSpec (IsAdminTLSEnabled, port readers, GetClusterDomain,
GetServiceAccountName, InternalDomain, InUse*Certs, GetResourceRequirements,
GetRedpandaStartFlags, GetOverProvisionValue, GetEnableMemoryLocking, tiered
storage helpers, GetStorageMinFreeBytes, AdminInternalURL, AdminAPIURLs).
Config/audit/metrics readers stay on StretchClusterSpec.

RenderState exposes the representative local pool via state.PoolSpec(); all
multicluster renderers read per-K8s-cluster values through it instead of
state.Spec(). state.Spec() retains cluster-wide reads (Auth, Networking,
CommonLabels, Image, Tuning, Logging, AuditLogging, Config, RackAwareness,
Enterprise).

Factory.defaultedPoolSpec fetches a NodePool when building admin/Kafka/Schema
Registry clients so listener/TLS/port lookups happen on the merged pool.
syncCA picks a representative NodePool for BootstrappedCertNames.

rapidutil gains a *runtime.RawExtension generator so fuzz tests can traverse
the now-larger NodePool type tree (Monitoring.TLSConfig wraps a runtime.Object
interface that rapid cannot synthesize).

Generated files (CRDs, deepcopy, applyconfiguration) and golden files are
regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two regressions surfaced by the multicluster smoke test:

1. PoolSpec() returned nil when there were no in-cluster pools yet (the
   reconciler fires once before the user has created NodePools). Direct
   field reads like state.PoolSpec().TLS then panicked. Returns an empty
   spec instead — helpers already handle the nil receivers, and resources
   that depend on per-pool config (certs, listeners) self-skip.

2. Acceptance feature files still applied TLS/External/RBAC/Listeners on
   StretchCluster.spec, which is rejected by the new CRD schema. Moved
   those fields to the NodePool manifests in the same scenarios.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every resource that reads a moved-to-NodePool field (TLS, Listeners,
ClusterDomain, External, RBAC, ServiceAccount, Monitoring, plus the
defaultable overrides Storage/Resources/ImagePullSecrets) now renders per
local NodePool — picking a representative pool was wrong because each pool
may legitimately differ.

API:
- PoolSpec() becomes PoolSpec(pool) — every read is explicit about which
  pool's view it wants.
- Drop the Service field from both StretchClusterSpec and EmbeddedNodePoolSpec
  along with GetServiceName. The two usages were a vestigial headless-service
  rename (now driven by the cluster's name) and a per-pod-service annotation
  default already covered by NodePoolServices.PerPod.{Local,Remote}.Annotations.
- InternalDomain / AdminInternalURL / AdminAPIURLs move to *RenderState
  (combine cluster-wide service name with per-pool ClusterDomain).
- RenderState.TLSConfig(pool, certName): take a pool so secret lookup uses
  the per-pool cert names.
- representativePool() kept only for the headless ClusterIP Service, whose
  listener port list is cluster-wide.

Resource layout (each pool's resource name = {cluster}-{pool}):
- Bootstrap / RPK / Pandaproxy / Schema Registry ConfigMaps → per pool.
- Redpanda ConfigMap → already per pool, now reads its own pool's spec.
- ServiceAccount, Role, ClusterRole, RoleBinding, ClusterRoleBinding → per pool.
- ServiceMonitor → per pool, selector scoped via labelComponentKey.
- Certificates (cert-manager) → per pool, cert SANs include the per-pool
  StatefulSet hostname pattern; Issuers stay cluster-wide (one per cert name)
  because they reference the synced CA Secret shared across the StretchCluster.
- NodePort + LoadBalancer external Services → per pool, selector scoped via
  labelComponentKey.
- Lifecycle Scripts Secret → per pool (admin URL/curl flags differ per pool).
- Configurator and FS-Validator Secrets → already per pool.
- Headless ClusterIP Service → stays cluster-wide, named {fullname}; listener
  ports taken from a representative pool (assumed identical within a K8s
  cluster).
- CA root Secret, bootstrap user Secret → stay cluster-wide.

syncCA in the multicluster controller now collects bootstrappable cert names
as the union across every NodePool.

Tests:
- New per-pool-variation-test case in stretch-cluster-cases.txtar with two
  pools whose TLS / RBAC / ServiceAccount / Monitoring / Storage / Resources
  differ; goldens regenerated. Confirms each per-pool resource reflects its
  own pool's spec rather than leaking from a representative.
- Existing GetServiceName test removed (helper deleted).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ance assertion

Two bugs from the per-pool fan-out refactor surfaced when the smoke test
tried to use a partial Storage override on the NodePool:

- statefulset_init.go init-container VolumeMounts for the configurator and
  fs-validator still referenced the old fullname-based volume names while
  the corresponding Volumes had already been renamed to the per-pool
  poolFullname form. The StatefulSet was rejected with `volumeMounts[5].
  name: Not found "cluster-configurator"`. Mounts now use poolFullname too.

- inheritFromCluster does whole-replace, so a NodePool that set only
  storage.persistentVolume.labels (no size) ended up with PersistentVolume
  populated but Size nil, and the StatefulSet was rejected with
  `volumeClaimTemplates[0].spec.resources[storage]: Required value`.
  Extracted the existing mergeDefaultStorage / mergeDefaultResources bodies
  into free funcs (fillStretchStorageDefaults / fillStretchResourcesDefaults),
  and MergeDefaultsFrom now applies them at the pool level after inheriting
  from the cluster — partial overrides pick up the unset sub-fields.

Acceptance:
- nodepoolManifest now substitutes ${POOL_NAME} into the DocString body
  before applying, so a single feature step can express per-region pool
  variation.
- New step: `in "<mc>" region "<vc-N>" the PVC "<name>" in namespace "<ns>"
  should have label "<key>" with value "<value>"` (checkPVCLabelInRegion).
- stretch-cluster-basics.feature: NodePool now sets
  storage.persistentVolume.labels.redpanda.com/test-pool-marker: ${POOL_NAME}
  and asserts each region's datadir PVC carries its own pool's value. If
  the operator ever leaks a representative pool's spec into other pools,
  the wrong PVC would carry the wrong value and the assertion would fail.

PVC renderer change:
- volumeClaimTemplateDatadir now propagates pv.Labels and pv.Annotations
  to the PVC template (previously ignored).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the v1alpha2 CRD surface changes:
- Fields removed from StretchCluster.spec (clusterDomain, tls, external,
  listeners, rbac, serviceAccount, monitoring, service).
- Fields added to NodePool.spec, with the defaultable ones (storage,
  resources, imagePullSecrets) inheriting from the cluster when unset.
- NodePool CRD's `np` short name dropped (collided with NetworkPolicy).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
loadBalancerServices used to iterate state.allPodNames(); after the per-pool
fan-out it iterates state.inClusterPools and derives pod names per pool, so
the helper is dead. golangci-lint flagged it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l fan-out

Two regressions surfaced once the operator started naming cert resources
per-pool and creating per-pool NodePort Services:

- Both tests applied a NodePool that left `external` at its default
  (enabled, NodePort). They run in the same K8s cluster, so their pools
  both tried to grab the default admin NodePort 31644 and the reconcile
  failed before any cert resources were created. Disabled external on
  both pools.

- The leaf-cert lookup computed the secret name with
  `CertServerSecretName(scName, "default")`. Cert resources are now named
  with the pool's full name (`{cluster}-{pool}-{certname}-cert`), so the
  test was waiting on the pre-refactor name forever. Renamed the pool to
  a plain `pool` (so poolFullname is `{scName}-pool`) and threaded that
  into the secret-name computation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o per-pool spec

The StretchCluster → StaticConfigurationSource converter (newly added on
main as part of the Console multicluster registration) read TLS, listener
ports, and ClusterDomain off StretchClusterSpec — all of which moved to
NodePool.Spec in this PR.

- ConvertStretchClusterToStaticConfig now takes a representative
  *EmbeddedNodePoolSpec alongside the StretchCluster. Per-K8s-cluster
  fields (TLS, listener ports, ClusterDomain) read off the pool spec;
  Auth/SASL still reads off the cluster.
- console controller fetches NodePools for the referenced StretchCluster,
  picks the first one, applies cluster + pool defaults, and passes the
  merged pool spec to the converter.
- The conversion test reorganized: each case now mutates either the
  cluster or the pool spec via a (sc, poolSpec) callback; the runner
  defaults both before invoking the converter.
- The console multicluster integration test creates a NodePool with
  TLS disabled alongside the StretchCluster (instead of setting TLS on
  the cluster spec).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hidalgopl hidalgopl force-pushed the pb/move-fields-to-nodepool branch from 6d4b4fe to 417880a Compare May 18, 2026 13:17
hidalgopl and others added 6 commits May 18, 2026 15:36
…l converter signature change

The previous commit changed ConvertStretchClusterToStaticConfig to require a
representative NodePool and the console controller fails its reconcile if it
can't find one. The unit test only created the StretchCluster, so the
stretch-cluster-ref case hit `no NodePools found for StretchCluster
test-ns/test-stretch`. Apply a minimal NodePool alongside the StretchCluster
in the test setup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picking an arbitrary NodePool on every reconcile risked rendering flapping
between pools as new ones were added or sort order shifted. Persist the
choice instead:

- maybeSelectStretchSourcePool runs early in Reconcile. If
  `operator.redpanda.com/stretch-source-pool` is already set and the named
  pool still exists in the local K8s cluster, keep it. Otherwise pick the
  lexically smallest matching NodePool (deterministic tie-breaker), write
  the annotation via Apply, and use it.
- clusterFragment now reads that annotation to look up the pinned pool by
  name rather than iterating + first-match selecting on each render.

The annotation lives on the Console CR (operator-owned state). Because the
controller's ctl is bound to the local K8s cluster where the Console CR
lives, the candidate NodePool list is automatically scoped to that cluster
too — matching the "same K8s cluster as the Console CR" intent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The StretchCluster and NodePool CRDs are only installed by the multicluster
operator. In single-cluster mode the controller must not List NodePools or
even attempt to dereference a ClusterRef of kind StretchCluster — there's
nothing to read.

- New Controller.Multicluster bool; SetupWithMulticlusterManager sets it
  true, SetupWithManager leaves it false.
- maybeSelectStretchSourcePool short-circuits when Multicluster is false.
- The render struct gets the same flag; clusterFragment returns a clear
  error if a StretchCluster ref shows up in single-cluster mode rather
  than 404'ing on the CRDs.
- Unit test sets Multicluster=true since it exercises the stretch-cluster-ref
  case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The early-return in syncCA when no NodePools exist created a chicken-and-egg
problem: a user applies a StretchCluster, syncCA short-circuits, then the
user creates a NodePool and the CA still has to wait for the StretchCluster
reconciler to fire again before any TLS cert can be issued.

Pre-refactor this didn't happen because TLS defaults lived on the cluster
spec, so MergeDefaults produced a "default" + "external" cert set even
before any NodePool existed. Restore that behaviour by treating "no
NodePools" as "one empty placeholder pool" and letting MergeDefaultsFrom
populate the same defaults a real NodePool would inherit. Any unusual cert
names a NodePool later defines are picked up on the next reconcile.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CA bootstrapping in the multicluster reconciler only ever bootstraps two
cert names — "default" and "external" — because those are the only certs
the operator owns. Anything a user adds via NodePool.Spec.TLS.Certs is
expected to bring its own CA (SecretRef or IssuerRef), which
BootstrappedCertNames was already filtering out.

Iterating NodePools to recompute that static pair on every reconcile was
just plumbing and forced the placeholder-pool workaround for the
no-NodePools-yet case. Hard-code the list instead. syncCA no longer
depends on NodePools or BootstrappedCertNames, and the
chicken-and-egg between StretchCluster creation and the first NodePool
goes away.

Renderer-side cert bookkeeping (cert_issuers.go, certs.go) continues to
use BootstrappedCertNames per-pool, which is correct — there each pool's
spec genuinely shapes the cert resources rendered for that pool.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "default" and "external" cert names — the operator's well-known
CA-bootstrapped certs — appeared as bare string literals in several
places: mergeDefaultTLS, the listener defaulting helpers,
ConvertStretchClusterToStaticConfig, Factory.stretchClusterListenerTLSConfig,
and the freshly-simplified syncCA. Expose them as exported constants on
the v1alpha2 package and use them everywhere a cert name is meant. Cuts
the typo risk and makes the rename refactor (if it ever happens) trivial.

The "default" listener-map key inside mergeDefaultExternalListener stays a
literal because it's a listener name, not a cert name — annotated as such.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hidalgopl hidalgopl marked this pull request as ready for review May 19, 2026 13:01
External *External `json:"external,omitempty"`

// Defines the log level settings.
Logging *StretchLogging `json:"logging,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can logging be also added to NodePool.Spec? During indidents there is only subset of Redpandas that needs different Log Level to change for further investigation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RackAwareness should be removed from StretchCluster as each cloud provider define different NodeAnnotation.

Logging *StretchLogging `json:"logging,omitempty"`

// Defines audit logging settings.
AuditLogging *AuditLogging `json:"auditLogging,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In AuditLogging property there is indirect dependency on now removed Listener.Kafka

// Kafka external listener name, note that it must have authenticationMethod set to sasl
Listener *string json:"listener,omitempty"

From the current code could you remove Listener from CRD and Redpanda helm values as it is not used?

- -c
- curl --silent --fail -k -m 5 --cacert /etc/tls/certs/default/ca.crt
"https://${SERVICE_NAME}.custom-cluster-domain.custom-cluster-domain.svc.custom.local:9644/v1/status/ready"
"https://${SERVICE_NAME}.custom-cluster-domain.custom-cluster-domain.svc.cluster.local.:9644/v1/status/ready"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this changed could you adjust

spec:
clusterDomain: custom.local
test case and maybe others as some properties moved from StretchCluster to NodePool?

- containerPort: 9644
name: admin
- containerPort: 31644
- containerPort: 9645
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case should be adjusted too

listeners:
kafka:
port: 9093
external:
default:
enabled: true
port: 30092
admin:
port: 9644
external:
default:
enabled: true
port: 30644
http:
port: 8082
external:
default:
enabled: true
port: 30082
schemaRegistry:
port: 8081
external:
default:
enabled: true
port: 30081

- containerPort: 9644
name: admin
- containerPort: 30644
- containerPort: 9645
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this too

tls:
enabled: true
certs:
default:
caEnabled: true
listeners:
kafka:
port: 9093
tls:
cert: default
external:
default:
enabled: true
port: 30092
tls:
cert: default
admin:
port: 9644
tls:
cert: default
external:
default:
enabled: true
port: 30644
tls:
cert: default
http:
port: 8082
tls:
cert: default
external:
default:
enabled: true
port: 30082
tls:
cert: default
schemaRegistry:
port: 8081
tls:
cert: default
external:
default:
enabled: true
port: 30081
tls:
cert: default
rpc:
port: 33145
tls:
cert: default

- containerPort: 9644
name: admin
- containerPort: 30644
- containerPort: 9645
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this too

tls:
enabled: true
certs:
default:
caEnabled: true
external:
enabled: true
type: LoadBalancer
domain: prod.example.com
service:
enabled: true
listeners:
kafka:
port: 9093
tls:
cert: default
external:
default:
enabled: true
port: 30092
tls:
cert: default
admin:
port: 9644
tls:
cert: default
external:
default:
enabled: true
port: 30644
tls:
cert: default
http:
port: 8082
tls:
cert: default
external:
default:
enabled: true
port: 30082
tls:
cert: default
schemaRegistry:
port: 8081
tls:
cert: default
external:
default:
enabled: true
port: 30081
tls:
cert: default
rpc:
port: 33145
tls:
cert: default

- -c
- curl --silent --fail -k -m 5 --cacert /etc/tls/certs/default-client/ca.crt
--cert /etc/tls/certs/default-client/tls.crt --key /etc/tls/certs/default-client/tls.key
- curl --silent --fail -k -m 5 --cacert /etc/tls/certs/default/ca.crt
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this too

tls:
enabled: true
certs:
default:
caEnabled: true
issuerRef:
name: my-ca-issuer
kind: ClusterIssuer
group: cert-manager.io
listeners:
kafka:
port: 9093
tls:
cert: default
requireClientAuth: true
admin:
port: 9644
tls:
cert: default
requireClientAuth: true
http:
port: 8082
tls:
cert: default
schemaRegistry:
port: 8081
tls:
cert: default
rpc:
port: 33145
tls:
cert: default

Name: state.Spec().GetServiceName(state.fullname()),
Namespace: state.namespace,
Labels: labels,
Annotations: annotations,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should add spec.service.internal.annotation map to the NodePool. It is regresion, but I don't remember what is the use case for our internal service even in none multicluster setup. Probably some kind annotation for other operators/controllers.

// TODO: consider a special field for per pod service annotation, either in nodepool or stretchcluster spec.
annotations = spec.Service.Internal.Annotations
}
annotations := map[string]string{}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Maybe inline this with Service declaration?

Copy link
Copy Markdown
Contributor

@RafalKorepta RafalKorepta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, my comments could be addressed in follow up.

hidalgopl and others added 7 commits May 20, 2026 16:05
…nalAccessReady

Per-pool external NodePort Services all default to the same advertised
ports (admin 31644 / kafka 31092 / http 30082 / schema-registry 30081).
Two local NodePools in the same Kubernetes cluster therefore both
request the same nodePort numbers, and the API server rejects the second
Service with `provided port is already allocated`.

Resolve collisions deterministically in the renderer: the lexically
first pool wins each nodePort and any pool that would clash has its
Service skipped. The NodePoolReconciler independently runs the same
detection (via DetectExternalNodePortConflicts) and writes a new
ExternalAccessReady condition on each pool — Available, Disabled, or
NodePortConflict with a message naming the winning pool and the
colliding ports — so users see why their external Service is missing.

ExternalAccessReady is added to the Stable rollup. Goldens for
`multi-pool` and `full-featured` lose the duplicate Service that the API
server would have rejected; that was the regression Codex flagged in
the PR review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
syncCA used to hardcode {default, external}, but certIssuers renders
Issuers for the union of BootstrappedCertNames across all NodePools.
The two diverged in both directions:

* A NodePool that referenced a custom operator-managed cert name (no
  SecretRef, no IssuerRef) got an Issuer pointing at a `--root-certificate`
  Secret that syncCA never created — cert-manager would stall waiting on
  the missing CA.
* A NodePool that switched the default or external cert to IssuerRef
  still got the operator's unused CA private-key Secret created in every
  member cluster.

Add ManagedCertNamesForBootstrap that runs the same defaulting as the
renderer and returns the union of BootstrappedCertNames over all pools,
with a {default, external} fallback for the no-pools bootstrap window.
syncCA now calls it instead of hardcoding the list, keeping CA Secrets
and Issuers strictly in sync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stretchListenerTLS passed sc.Name as the fullname argument to
TLS.CertificatesFor, which is correct for shared root-CA Secrets
(scoped to the cluster name) but wrong for IssuerRef-backed certs.
cert-manager creates the leaf Secret from a Certificate the operator
renders against the per-pool fullname `<cluster>-<pool>`, so the
actual Secret name is `<cluster>-<pool>-<cert>-cert`. Console pointed
at `<cluster>-<cert>-cert` and TLS broke.

Take the NodePool object in ConvertStretchClusterToStaticConfig and
compute its fullname inside the converter. Switch fullname to the
pool variant when the cert uses IssuerRef; leave the shared CA path
on the cluster name. The console controller now just passes the
pinned pool through and the converter handles defaulting itself,
which trims duplicate setup at the call site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Console reads per-K8s-cluster TLS, listener ports, and clusterDomain from
the pinned source NodePool, but the multicluster Console controller only
watched StretchCluster. Updates to that pool (or its deletion) never
re-enqueued dependent Consoles, so rendered configs could stay stale
indefinitely until some other StretchCluster or Console event happened
to fire.

Add a NodePool watch that translates each NodePool event to a lookup
against the existing `console_stretch` index — the same index the
StretchCluster watch already uses — so the same set of Consoles gets
enqueued. The pinned-pool repair logic in maybeSelectStretchSourcePool
already re-picks when the previously-annotated pool is gone; it just
needed a reconcile trigger to run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
During an incident, operators want to crank up the log level on one
region's brokers without touching the rest of the cluster. Add
Logging *StretchLogging to NodePool.spec.embeddedNodePoolSpec as an
optional per-pool override that inherits StretchCluster.Spec.Logging
when unset, mirroring the existing Storage/Resources/ImagePullSecrets
override pattern.

The rpk configmap renderer reads from the defaulted pool spec, so the
per-pool value (or the inherited cluster value) drives
--default-log-level for each StatefulSet.

Addresses Rafał's review comment on PR #1529.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
RackAwareness.NodeAnnotation identifies the K8s node label that
encodes a node's rack/zone. The right key varies per cloud provider
(EKS / GKE / AKS use the standard topology.kubernetes.io/zone but
on-prem and some custom deployments use other annotations), and the
StretchCluster topology already crosses cloud boundaries. A
cluster-wide field can't accommodate that.

Move RackAwareness *RackAwareness from StretchClusterSpec to
EmbeddedNodePoolSpec. Each pool picks the annotation that matches
its own K8s cluster's nodes. The renderer (rack-awareness ClusterRole,
configurator init-container volume mount, bootstrap configmap,
configurator script) now reads RackAwareness off the pool spec.

The render-cases `rack-awareness` and `full-featured` fixtures move
the field onto their NodePool entries so the goldens continue to
exercise the rack-aware code paths.

Addresses Rafał's review comment on PR #1529.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AuditLogging.Listener referenced a Kafka external listener by name. The
Listeners block moved off StretchCluster.spec onto each NodePool, so a
cluster-scoped Listener pointer can no longer resolve to a single
listener — and the field was never wired into the rendered audit
configuration anyway.

Fork a new StretchAuditLogging type for StretchCluster.spec.auditLogging
without the Listener field. Redpanda.spec.auditLogging keeps the shared
AuditLogging type unchanged, preserving the single-cluster operator's
surface.

Addresses Rafał's review comment on PR #1529.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hidalgopl and others added 2 commits May 20, 2026 17:13
The render-cases input fixtures still placed `tls`, `external`,
`listeners`, `monitoring`, and `clusterDomain` on StretchCluster.spec.
After the field move those keys no longer exist on the cluster CRD, so
the YAML decoder dropped them silently and many cases ended up
exercising only the renderer's defaults instead of the case's intended
configuration.

Migrate the moved fields onto each case's NodePool spec across all 13
affected cases (tls-self-signed, tls-mtls, sasl-scram512-with-tls,
external-loadbalancer, external-nodeport, external-tls, tls-issuer-ref,
tls-issuer-ref-mtls, monitoring, custom-cluster-domain, full-featured,
and the rack-awareness / full-featured already-touched-in-prior-commits
ones). Goldens regenerated — external NodePorts and the custom cluster
domain now show up in the rendered output, restoring the coverage the
cases were originally written for.

Addresses Rafał's review comments on PR #1529 (he flagged five
specifically; the underlying issue applied to all of them).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pod annotations

`StretchCluster.spec.service.internal.annotations` was dropped when the
`service` field was removed wholesale, but the headless ClusterIP
Service it annotated is cluster-wide (one per member K8s cluster), not
per pool. The renderer had no way to attach annotations consumed by
external-dns / service-mesh / other sibling operators.

Add `StretchCluster.spec.internalServiceAnnotations` as a cluster-wide
map; thread it onto the headless Service rendered by
`serviceInternal`. Annotations stay nil when the user didn't set any,
so apply-time comparisons keep working.

While here, inline the empty `annotations := map[string]string{}`
declaration in service_per_pod.go directly into the ObjectMeta
literal (Rafał nit).

Addresses Rafał's review comments on PR #1529.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hidalgopl hidalgopl marked this pull request as draft May 21, 2026 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants