operator: adopt per-broker pre-restart probe (ENG-222) + roll-loop correctness#1547
Draft
david-yu wants to merge 8 commits into
Draft
operator: adopt per-broker pre-restart probe (ENG-222) + roll-loop correctness#1547david-yu wants to merge 8 commits into
david-yu wants to merge 8 commits into
Conversation
Bumps common-go/rpadmin from v0.2.4 to v0.2.7, which exposes the PreRestartProbe/PostRestartProbe bindings for /v1/broker/pre_restart_probe (Redpanda 25.1+) merged in common-go#170. Wiring on the operator side: 1. internal/probes/broker.go: new Prober.IsBrokerSafeToRestart method that calls PreRestartProbe scoped to the local broker and returns safe only when none of acks1_data_loss / unavailable / full_acks_produce_unavailable is populated. rf1_offline is treated as acceptable risk. On 404 returns ErrPreRestartProbeUnsupported so callers can fall back. IsClusterBrokerHealthy now consults this first; only falls through to the legacy cluster-overview path on older brokers. 2. internal/controller/redpanda/redpanda_controller.go: new helper brokerSafeToRestart that the rolling-restart loop calls per pod (using admin.BrokerIDToURL + ForHost to scope per-broker). Replaces the cluster-wide health.IsHealthy heuristic at the roll gate. The helper falls back to the cluster-wide IsHealthy on 404 so behavior on Redpanda < 25.1 is preserved. Tests: new TestIntegrationBrokerSafeToRestart in the redpanda controller package boots a real Redpanda container via testcontainers and pins four behaviors: - fresh cluster → safe - RF=1 topic populates rf1_offline but still reports safe - 404 falls back to the cluster.IsHealthy argument (both directions) - non-404 admin errors surface as errors (not silent fallback) Verified locally against v26.1.1: PASS in 19.907s, 4/4 subtests. Build cleanup: cluster_controller_configuration_test.go used github.com/moby/moby/pkg/namesgenerator for one-off random namespace names. The moby monolith conflicts with the split-out moby/moby/api module that testcontainers v0.42 pulls in via the rpadmin v0.2.7 test dep graph. Replaced the namesgenerator call with a 4-byte crypto/rand hex suffix so the operator no longer depends on the moby monolith. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Incorporates the two correctness fixes from the (now-closed) #1537 into the ENG-222 wiring stack: - brokerMap parsing now strips a "host:port" suffix via net.SplitHostPort and keys by both the first DNS label and the raw host. Previously strings.Split(InternalRPCAddress, ".")[0] silently mis-keyed when the admin API returned the address as "host:port" or a bare IP, causing the roll loop's lookup to miss and treat the broker as orphan. - The "pod not in brokerMap" branch now deletes one pod and requeues instead of deleting + continuing. Combined with the parsing bug above (or any transient cause that empties AllNodes — rolling controller leadership, brief admin-API 503s, NodeID flip after PVC reset), the previous behavior was the only single-reconcile path that could produce multiple brokers literally Terminating at the same time. Also pins the existing HasRecentlyReplacedPods gate's contract with a new TestHasRecentlyReplacedPods unit test in pool_test.go covering the no-op, all-old-revision, all-Ready-new-revision, just-replaced-not-yet- Ready, terminating-pod, and stuck-in-Pending cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
Adds parameterized subtests to TestIntegrationBrokerSafeToRestart that
walk the four failure modes the rolling-restart RFC enumerates and
assert the operator's roll/don't-roll decision for the JSON each mode
produces:
RFC case 1: in-sync 2, offline 1
→ restarting an online replica makes the partition leaderless
→ probe populates `unavailable`
→ operator blocks ✓
RFC case 2 (in-sync restart path): in-sync 2, recovering 1
→ restarting the in-sync replica leaves no acks=-1 quorum
→ probe populates `full_acks_produce_unavailable`
→ operator blocks ✓
RFC case 2 (recovering-replica restart path)
→ probe populates nothing — recovering replica was always at risk
→ operator allows ✓
RFC case 3: in-sync 1, offline 1, recovering 1
→ restarting the in-sync replica with acks=1 risks data loss
→ probe populates both `unavailable` and `acks1_data_loss`
→ operator blocks on either ✓
RFC case 4: in-sync 1, recovering 2
→ restarting the in-sync replica causes acks=1 data loss
→ probe populates `acks1_data_loss`
→ operator blocks ✓
Plus the documented edge cases:
- rf1_offline alone → acceptable risk, operator allows
- rf1_offline + a dangerous category → still blocked
- all categories empty → safe (steady-state happy path)
The subtests use an httptest stub that returns the engineered
PreRestartCheckResult JSON for each case. Reproducing the actual
multi-broker cluster states inside a single testcontainer isn't
practical; Redpanda core's own integration suite verifies the
probe's server-side risk calculation per cluster state. What this
test verifies is the operator's *interpretation* of the contract —
that the four documented failure modes each translate to the right
roll decision.
Verified locally against redpandadata/redpanda:v26.1.1:
PASS in 19.042s, 12/12 subtests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The rolling-restart RFC's ideal sequence has six steps; #1547 wired up the pre-restart probe (step 2) but step 5 — "wait for post-restart probe" — was still served by HasRecentlyReplacedPods, which checks K8s pod-Ready as a proxy. K8s pod-Ready can flip True before the broker has finished replaying its in-sync replicas. That window is exactly where we'd kick off the next roll prematurely. This commit adds /v1/broker/post_restart_probe (load_reclaimed_pc) as an additional gate that runs after HasRecentlyReplacedPods passes and before the rollSet loop: internal/probes/broker.go: + Prober.IsBrokerCaughtUp(ctx, brokerURL, threshold) + ErrPostRestartProbeUnsupported sentinel + DefaultPostRestartCaughtUpPercent = 100 (RFC's strictest reading) internal/controller/redpanda/redpanda_controller.go: + brokerCaughtUp(ctx, admin, brokerID, threshold, logger, podName) — BrokerIDToURL + ForHost + PostRestartProbe; 404 → treat as caught up (older Redpanda) + brokersStillRecovering(ctx, admin, brokerMap, threshold, logger) — iterates brokerMap entries (dedup'd by broker ID since the map intentionally double-keys), returns true if any broker is below threshold + reconcileDecommission now calls brokersStillRecovering after HasRecentlyReplacedPods has cleared. Probe errors are non-fatal (the K8s-Ready gate and per-broker pre-restart probe still protect the dangerous failure modes). Tests in redpanda_restart_probe_test.go (10 new subtests): TestIntegrationBrokerCaughtUp - 100% → caught up at default threshold - 99% → not caught up at default strict threshold - 99% → caught up at relaxed threshold (95) - 0% (just restarted) → not caught up - 404 → falls back to caught up (pre-25.1 broker) - 500 → propagates as error TestIntegrationBrokersStillRecovering - any broker below threshold blocks - all brokers caught up returns false - dedupes by broker ID across brokerMap's double-key entries - 404 on a broker is treated as endpoint absent (not recovering) All pass locally, 4.328s total. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…aram) CI's golangci-lint flagged brokersStillRecovering's threshold parameter as unparam — every caller (one in the controller, three in the test) passes DefaultPostRestartCaughtUpPercent (100). The parameter was speculative for "what if we someday want tunable rolling-restart tolerance" but the call sites converged on the strictest reading. Hardcoding probes.DefaultPostRestartCaughtUpPercent inside the helper keeps brokerCaughtUp's threshold (which the tests exercise with 100, 99, 95, 0) and removes the unused-flexibility surface from the gate helper. The doc comment now explains why the threshold was lifted out and what would have to change if we ever wanted to plumb it back in. Also trimmed an unrelated sentence from the changelog entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The rpadmin v0.2.4 → v0.2.7 bump cascaded into transitive-dep
updates across the workspace. CI's task generate runs:
- mod:tidy across all 13 modules in go.work
- dev:update-licenses (adds BSL boilerplate to generated proto files
that don't yet have it)
- generate:third-party-licenses-list (licenses/third_party.md)
and the ci:lint git diff --exit-code then flagged the deltas:
acceptance/go.sum
charts/{connectors,console,redpanda}/go.{mod,sum}
gen/go.sum
gotohelm/go.{mod,sum}
gotohelm/testdata/src/example/go.{mod,sum}
harpoon/go.{mod,sum}
pkg/go.{mod,sum}
pkg/multicluster/leaderelection/proto/gen/transport/v1/*.pb.go
licenses/third_party.md
No source code changes — purely the regenerated artifacts that
match what CI produces. The two .pb.go files only gained the
mandatory BSL boilerplate header (licenseupdater added it; the
generated proto bodies are unchanged).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
My previous commit ran licenseupdater locally, which added the BSL boilerplate header to two generated proto files (message.pb.go, message_grpc.pb.go). But CI's task generate runs licenseupdater followed by `buf generate`, and `buf generate` regenerates those files from scratch — stripping the header back out. The canonical state is the unheaded one. Reverting the header addition so my branch matches the post-buf-generate output that CI produces. The proto file bodies were unchanged either way. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The sidecar's /healthz is K8s readiness — "am I serving traffic?". The pre-restart probe is a stricter signal — "would restarting me cause partition risks?". During cluster bootstrap (internal-topic creation, partition leader election) the pre-restart probe transiently reports unavailable/acks1_data_loss/full_acks_produce_unavailable even though the broker is happily serving Kafka traffic. My earlier IsClusterBrokerHealthy change wired PreRestartProbe in as the primary gate, which kept pods NotReady through bootstrap and made helm installs time out (`Error: INSTALLATION FAILED: context deadline exceeded`) in the integration suite — TestIntegrationStatefulSetDecommissioner, TestIntegrationFactoryOperatorV1, TestIntegrationClientFactoryTLSListeners, TestIntegrationRedpandaController/TestLicense, etc. all failed with that shape on PR #1547. This commit restores IsClusterBrokerHealthy to its prior logic (cluster.IsHealthy → relaxed broker-local fallback). Prober still exposes IsBrokerSafeToRestart for callers that genuinely want the restart-safety signal — its primary user is the operator's roll-gate helper brokerSafeToRestart, which is the right place to use the per-broker probe. PostRestartProbe wiring (brokerCaughtUp / brokersStillRecovering) is untouched — those are operator-side only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adopt Redpanda 25.1+'s per-broker pre-restart probe (ENG-222) as the primary
rolling-restart gate, replacing the cluster-wide
health_overview.IsHealthyheuristic with the per-broker counterfactual
/v1/broker/pre_restart_probe.This PR is the consolidated customer-blocking fix for the rolling-restart
races; it folds in the two correctness fixes from the (now-closed) #1537 so
they don't get lost.
What this PR does
1. ENG-222 — replace cluster.IsHealthy with the per-broker probe
internal/probes/broker.go— newProber.IsBrokerSafeToRestartmethod.Calls
PreRestartProbe(Redpanda 25.1+) and returns safe only when none ofthe dangerous risk categories is populated:
acks1_data_loss(acks=1 producers may lose data)unavailable(produce + consume reject)full_acks_produce_unavailable(acks=-1 produce rejects)rf1_offlineis acceptable per ENG-222: RF=1 already has no redundancy.On 404 (Redpanda < 25.1) returns
ErrPreRestartProbeUnsupportedso callerscan fall back.
IsClusterBrokerHealthynow consults this first; fallsthrough to the legacy cluster-overview-with-relaxed-fallback path only when
the endpoint isn't present.
internal/controller/redpanda/redpanda_controller.go— the rollingloop now calls a new
brokerSafeToRestarthelper for each pod, usingadmin.BrokerIDToURL+ForHostto scope per-broker. Theelse if health.IsHealthybranch is replaced by:On 404 the helper falls back to
health.IsHealthyso behavior on Redpanda< 25.1 is unchanged.
2. PostRestartProbe — wait for replica recovery before rolling the next broker
The RFC's ideal sequence has six steps; we now cover both probe steps:
HasRecentlyReplacedPods(the K8s pod-Ready gate added in #1446) stillfires first — fast, cheap, and protects against the most obvious
"pod isn't even Running yet" case. On top of that, the controller
now consults
/v1/broker/post_restart_probe(load_reclaimed_pc) perbroker before the rollSet loop. A broker is treated as caught up when
its
load_reclaimed_pc >= 100(default strictest reading, perprobes.DefaultPostRestartCaughtUpPercent). Implementation:internal/probes/broker.go— newProber.IsBrokerCaughtUpmethodErrPostRestartProbeUnsupportedsentinel.internal/controller/redpanda/redpanda_controller.go—brokerCaughtUphelper (BrokerIDToURL + ForHost + PostRestartProbe;404 → treat as caught up so older clusters behave as before) and
brokersStillRecoveringouter-gate helper that iteratesbrokerMap,dedupes by broker ID (the map intentionally double-keys), and returns
true if any broker is below threshold.
per-broker pre-restart probe still protect the dangerous failure
modes.
3. Incorporated from #1537 — orphan-branch correctness
Two latent code-path bugs in the same
reconcileDecommissionfunction:brokerMap parsing. Brokers were keyed by
strings.Split(broker.InternalRPCAddress, ".")[0]. This produces the wrongkey when the admin API returns the address as
host:port(advertised RPCport suffix) or as a bare IP — the lookup
brokerMap[pod.GetName()]thenmisses and the pod is silently treated as orphan. Fix: strip the port via
net.SplitHostPortand key by both the first DNS label and the raw host(strict superset of the previous keyspace; no regression for FQDN-only
deployments).
"Pod not in brokerMap" branch. The branch deleted and set
continueExecution=true, so every unmapped pod inrollSetwas deleted inthe same reconcile pass. Combined with the parsing bug above (or any
transient cause that empties
AllNodes— rolling controller leadership,brief admin-API 503s, NodeID flip after PVC reset), this was the only
single-reconcile path that could produce multiple brokers literally
Terminatingat the same time. Now the branch deletes one pod andrequeues, so each pod's state is re-evaluated against fresh broker-map
data before the next one is touched.
4. Regression test for the existing
HasRecentlyReplacedPodsgateNew
TestHasRecentlyReplacedPodsinpool_test.gopins the contract for theouter K8s-pod-state gate that #1446 introduced. Cases: no-op, all-old-revision,
all-Ready-new-revision (steady state), just-replaced-not-yet-Ready,
terminating pod, stuck-in-Pending.
Why all together
The changes cover four independent failure surfaces that all contribute to
the same customer-visible bug:
!inBrokerMapbranch tear-downstrings.Split(...)[0]parsernet.SplitHostPort+ dual-keyhealth.IsHealthy(cluster-wide)PreRestartProbePostRestartProbeouter gateTogether they make rolling restart safe under load, both for the literal
parallel-delete scenario (the orphan-branch fix) and the timing race where
the cluster reports
IsHealthy=truewhile a broker is still partition-recovering (the pre- and post-restart probes).
Dependencies
github.com/redpanda-data/common-go/rpadmin v0.2.7(carriesPreRestartProbe/PostRestartProbefrom common-go#170)github.com/testcontainers/testcontainers-go v0.42.0(already present)github.com/moby/moby/pkg/namesgenerator(single-line usein
cluster_controller_configuration_test.go) because the moby monolithv28+ conflicts with the split-out
moby/moby/apimodule pulled in viatestcontainers v0.42's transitive deps under rpadmin v0.2.7. Replaced with
a 4-byte
crypto/randhex suffix.Tests
TestIntegrationBrokerSafeToRestartboots a real Redpanda container viatestcontainers (for the live-probe subtests) and uses httptest stubs
(for the RFC-case and error-handling subtests). Twelve subtests pin the
pre-restart contract:
fresh cluster is safe to restartRF=1 topic still counts as safe (acceptable risk)404 falls back to cluster.IsHealthynon-404 admin errors surface as errorsunavailablefull_acks_produce_unavailableunavailableandacks1_data_lossare populatedacks1_data_lossThe RFC cases reference the rolling-restart safety probes RFC's
"Scenarios" section. Reproducing the actual multi-broker cluster states
inside a single testcontainer isn't practical; Redpanda core's own
suite verifies server-side risk calculation per cluster state. What
this test pins is the operator's interpretation of the wire
contract — for each documented failure mode, the JSON the probe
returns translates to the right roll decision.
Verified locally against
redpandadata/redpanda:v26.1.1:TestIntegrationBrokerCaughtUp+TestIntegrationBrokersStillRecoveringpin the post-restart contract (10 subtests total) via httptest stubs:
fully caught up at default threshold99% not caught up at default strict threshold99% caught up at relaxed threshold (95)0% (just restarted) not caught up404 falls back to caught up (pre-25.1 broker)500 propagates as errorany broker below threshold blocksall brokers caught up returns falsededupes by broker ID across map entries404 on a broker is treated as endpoint absentTestHasRecentlyReplacedPodsininternal/lifecycle/pool_test.go— 6 cases,all pass locally.
Run with:
DOCKER_HOST=unix:///path/to/docker.sock \ TESTCONTAINERS_DOCKER_SOCKET_OVERRIDE=/var/run/docker.sock \ TESTCONTAINERS_RYUK_DISABLED=true \ GOLANG_PROTOBUF_REGISTRATION_CONFLICT=ignore \ TEST_REDPANDA_REPO=redpandadata/redpanda TEST_REDPANDA_VERSION=v26.1.1 \ nix develop -c go test ./operator/internal/controller/redpanda/ \ -run TestIntegrationBrokerSafeToRestart -v -count=1 -timeout 360sTest plan
go build ./operator/internal/probes/... ./operator/internal/controller/redpanda/... ./operator/internal/lifecycle/...— cleango veton the same — cleanTestIntegrationBrokerSafeToRestartagainst realv26.1.1— greenTestHasRecentlyReplacedPods— greentask test:unitwindow goes away
Follow-ups (not in this PR)
acks=1 case... we should prioritize making forward progress in the
rolling restart process as well, though automatically determining if
possible data loss is 'legitimate'... may prove hard." Current
default is strict (always block on
acks1_data_loss). A CR-levelknob could allow opt-in tolerance for clusters that have explicitly
accepted the trade-off. Separate PR.
PR wires only the single-cluster controller. The multicluster controller
(which we'll be removing) is unchanged.
; truemask onthe maintenance-mode drain script is the other half of the customer
scenario — see charts/redpanda: surface lifecycle-hook timeouts in pod logs #1548.
🤖 Generated with Claude Code