Fix up multicluster watches for nodepools and tls hostnames#1534
Merged
Conversation
6e4831f to
929a695
Compare
RafalKorepta
approved these changes
May 20, 2026
Contributor
RafalKorepta
left a comment
There was a problem hiding this comment.
LGTM, This fixes our implementation.
I wonder if it would be worth to have test that asserts our generated seed_servers or advertised_rpc_api matches one of certificates DNSNames?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two fixes for the StretchCluster multicluster controller:
1. Watch
NodePoolfrom the StretchCluster reconcilerThe StretchCluster reconciler now watches
NodePoolevents across local and provider clusters and re-enqueues the owning StretchCluster (spec.clusterRef.Name) on every change, with the originating cluster name preserved so the multicluster runtime routes the request correctly. Without this, NodePool edits (replica count, image, services, …) didn't trigger a StretchCluster reconcile until something else nudged it, so scaling and config changes lagged unpredictably.2. Operator-issued broker TLS certs reject strict hostname verification on
flat/MCS modes (fixes #1499)In flat & MCS cross-cluster networking modes, the operator writes 2-label broker hostnames (
<pod>.<ns>) intoseed_servers[].host.address/advertised_rpc_api.address. The generated cert-managerCertificatecovered those only via a*.<ns>wildcard — a single-label-parent wildcard that RFC 6125 §6.4.3 disallows and OpenSSL ≥3.0 rejects withverify error:num=62: hostname mismatch. The broker RPC handshake then fails,cluster_bootstrap_infoRPCs returnrpc::errc:4, and the StretchCluster staysReady: Falseforever.The fix is purely additive on the SAN list (plus the dead, RFC-violating wildcard removed):
*.<ns>from the operator-emitted SAN list. It can never validate under any modern strict TLS stack, and it was the only thing previously "covering" the 2-label hostnames the operator advertises. Sibling wildcards*.<ns>.svc/*.<ns>.svc.<domain>(multi-label parents) are kept — those are RFC-valid and cover the FQDN forms.seed_servers[].host.address/advertised_rpc_api.address:<pod>.<ns><pod>.<ns>and<pod>.<ns>.svc.clusterset.localSame enumeration site that already builds
seed_servers, so no new code path — just consume the broker list the renderer already produces. Non-stretch / pre-26.2 deployments are unaffected; the wildcards that did the work for them (*.<cluster>.<ns>.svc.cluster.localetc.) are untouched.Changes
operator/internal/controller/redpanda/multicluster_controller.go— add aWatches(&NodePool{}, …)on the StretchCluster controller that maps NodePool events to their owning StretchCluster viaspec.clusterRef(only whenIsStretchCluster()), with cluster-name preservation.operator/multicluster/certs.go— drop*.<ns>; append<pod>.<ns>per broker in the flat/applyInternalDNSNames branch; append<pod>.<ns>.svc.clusterset.localper broker in the MCS branch.operator/multicluster/testdata/render-cases.txtar— addflat-network-tlsandmcs-network-tlsfixtures (the pre-existingflat-network/mcs-networkcases had TLS disabled and never exercised the cert SAN path).operator/multicluster/testdata/render-cases.resources.golden.txtaroperator/multicluster/testdata/render-cases.pools.golden.txtaroperator/internal/lifecycle/testdata/stretch-cluster-cases.resources.golden.txtar