operator: adopt per-broker pre-restart probe (ENG-222) + roll-loop correctness by david-yu · Pull Request #1547 · redpanda-data/redpanda-operator

david-yu · 2026-05-21T17:26:31Z

Summary

Adopt Redpanda 25.1+'s per-broker pre-restart probe (ENG-222) as the primary
rolling-restart gate, replacing the cluster-wide health_overview.IsHealthy
heuristic with the per-broker counterfactual /v1/broker/pre_restart_probe.

This PR is the consolidated customer-blocking fix for the rolling-restart
races; it folds in the two correctness fixes from the (now-closed) #1537 so
they don't get lost.

What this PR does

1. ENG-222 — replace cluster.IsHealthy with the per-broker probe

internal/probes/broker.go — new Prober.IsBrokerSafeToRestart method.
Calls PreRestartProbe (Redpanda 25.1+) and returns safe only when none of
the dangerous risk categories is populated:
- acks1_data_loss (acks=1 producers may lose data)
- unavailable (produce + consume reject)
- full_acks_produce_unavailable (acks=-1 produce rejects)
rf1_offline is acceptable per ENG-222: RF=1 already has no redundancy.
On 404 (Redpanda < 25.1) returns ErrPreRestartProbeUnsupported so callers
can fall back. IsClusterBrokerHealthy now consults this first; falls
through to the legacy cluster-overview-with-relaxed-fallback path only when
the endpoint isn't present.

internal/controller/redpanda/redpanda_controller.go — the rolling
loop now calls a new brokerSafeToRestart helper for each pod, using
admin.BrokerIDToURL + ForHost to scope per-broker. The
else if health.IsHealthy branch is replaced by:

safe, err := brokerSafeToRestart(ctx, state.admin, brokerID, health.IsHealthy, logger, pod.GetName())
switch {
case err != nil:    // probe failed → skip pod, continue
case safe:          // probe says OK → delete and halt
default:            // probe says not safe → skip pod, continue
}

On 404 the helper falls back to health.IsHealthy so behavior on Redpanda
< 25.1 is unchanged.

2. PostRestartProbe — wait for replica recovery before rolling the next broker

The RFC's ideal sequence has six steps; we now cover both probe steps:

request maintenance → wait pre-restart probe → restart → wait available
                  → maintenance off → wait post-restart probe → next broker

HasRecentlyReplacedPods (the K8s pod-Ready gate added in #1446) still
fires first — fast, cheap, and protects against the most obvious
"pod isn't even Running yet" case. On top of that, the controller
now consults /v1/broker/post_restart_probe (load_reclaimed_pc) per
broker before the rollSet loop. A broker is treated as caught up when
its load_reclaimed_pc >= 100 (default strictest reading, per
probes.DefaultPostRestartCaughtUpPercent). Implementation:

internal/probes/broker.go — new Prober.IsBrokerCaughtUp method
- ErrPostRestartProbeUnsupported sentinel.
internal/controller/redpanda/redpanda_controller.go —
brokerCaughtUp helper (BrokerIDToURL + ForHost + PostRestartProbe;
404 → treat as caught up so older clusters behave as before) and
brokersStillRecovering outer-gate helper that iterates brokerMap,
dedupes by broker ID (the map intentionally double-keys), and returns
true if any broker is below threshold.
Probe errors are non-fatal at the gate — the K8s-Ready gate and the
per-broker pre-restart probe still protect the dangerous failure
modes.

3. Incorporated from #1537 — orphan-branch correctness

Two latent code-path bugs in the same reconcileDecommission function:

brokerMap parsing. Brokers were keyed by
strings.Split(broker.InternalRPCAddress, ".")[0]. This produces the wrong
key when the admin API returns the address as host:port (advertised RPC
port suffix) or as a bare IP — the lookup brokerMap[pod.GetName()] then
misses and the pod is silently treated as orphan. Fix: strip the port via
net.SplitHostPort and key by both the first DNS label and the raw host
(strict superset of the previous keyspace; no regression for FQDN-only
deployments).
"Pod not in brokerMap" branch. The branch deleted and set
continueExecution=true, so every unmapped pod in rollSet was deleted in
the same reconcile pass. Combined with the parsing bug above (or any
transient cause that empties AllNodes — rolling controller leadership,
brief admin-API 503s, NodeID flip after PVC reset), this was the only
single-reconcile path that could produce multiple brokers literally
Terminating at the same time. Now the branch deletes one pod and
requeues, so each pod's state is re-evaluated against fresh broker-map
data before the next one is touched.

4. Regression test for the existing `HasRecentlyReplacedPods` gate

New TestHasRecentlyReplacedPods in pool_test.go pins the contract for the
outer K8s-pod-state gate that #1446 introduced. Cases: no-op, all-old-revision,
all-Ready-new-revision (steady state), just-replaced-not-yet-Ready,
terminating pod, stuck-in-Pending.

Why all together

The changes cover four independent failure surfaces that all contribute to
the same customer-visible bug:

Concern	What's broken	What this PR does
Multiple deletes per reconcile	`!inBrokerMap` branch tear-down	one-pod-per-reconcile
Address-format misclassification	`strings.Split(...)[0]` parser	`net.SplitHostPort` + dual-key
Coarse-grained roll signal	`health.IsHealthy` (cluster-wide)	per-broker `PreRestartProbe`
K8s pod-Ready ≠ broker caught up	new broker passes pod-Ready before it has finished replaying partition state	per-broker `PostRestartProbe` outer gate

Together they make rolling restart safe under load, both for the literal
parallel-delete scenario (the orphan-branch fix) and the timing race where
the cluster reports IsHealthy=true while a broker is still partition-
recovering (the pre- and post-restart probes).

Dependencies

github.com/redpanda-data/common-go/rpadmin v0.2.7 (carries
PreRestartProbe/PostRestartProbe from common-go#170)
github.com/testcontainers/testcontainers-go v0.42.0 (already present)
Cleanup: dropped github.com/moby/moby/pkg/namesgenerator (single-line use
in cluster_controller_configuration_test.go) because the moby monolith
v28+ conflicts with the split-out moby/moby/api module pulled in via
testcontainers v0.42's transitive deps under rpadmin v0.2.7. Replaced with
a 4-byte crypto/rand hex suffix.

Tests

TestIntegrationBrokerSafeToRestart boots a real Redpanda container via
testcontainers (for the live-probe subtests) and uses httptest stubs
(for the RFC-case and error-handling subtests). Twelve subtests pin the
pre-restart contract:

Subtest	What it pins
`fresh cluster is safe to restart`	A broker with no partitions reports safe (live broker)
`RF=1 topic still counts as safe (acceptable risk)`	rf1_offline is populated but the helper still returns safe (live broker)
`404 falls back to cluster.IsHealthy`	Pre-25.1 broker → returns the cluster-health argument unchanged (both true and false directions pinned)
`non-404 admin errors surface as errors`	5xx propagates instead of silently falling back
RFC case 1: in-sync 2, offline 1 — restarting online replica makes partition leaderless	blocks on `unavailable`
RFC case 2 (in-sync restart): in-sync 2, recovering 1 — restarting in-sync replica blocks acks=-1 produce	blocks on `full_acks_produce_unavailable`
RFC case 2 (recovering-replica restart)	empty risks → safe to restart
RFC case 3: in-sync 1, offline 1, recovering 1 — restarting in-sync risks acks=1 data loss	blocks even when both `unavailable` and `acks1_data_loss` are populated
RFC case 4: in-sync 1, recovering 2 — restarting in-sync causes acks=1 data loss	blocks on `acks1_data_loss`
RF=1 acceptable + dangerous category populated	still blocked — rf1 doesn't override
all categories empty	safe (steady-state happy path)

The RFC cases reference the rolling-restart safety probes RFC's
"Scenarios" section. Reproducing the actual multi-broker cluster states
inside a single testcontainer isn't practical; Redpanda core's own
suite verifies server-side risk calculation per cluster state. What
this test pins is the operator's interpretation of the wire
contract — for each documented failure mode, the JSON the probe
returns translates to the right roll decision.

Verified locally against redpandadata/redpanda:v26.1.1:

PASS  TestIntegrationBrokerSafeToRestart (19.042s) — 12/12 subtests

TestIntegrationBrokerCaughtUp + TestIntegrationBrokersStillRecovering
pin the post-restart contract (10 subtests total) via httptest stubs:

Subtest	What it pins
`fully caught up at default threshold`	100% → caught up
`99% not caught up at default strict threshold`	strict default blocks
`99% caught up at relaxed threshold (95)`	threshold is honored
`0% (just restarted) not caught up`	freshly-restarted broker blocks
`404 falls back to caught up (pre-25.1 broker)`	older Redpanda → not blocked
`500 propagates as error`	admin failure surfaces
`any broker below threshold blocks`	gate fires if any broker is recovering
`all brokers caught up returns false`	clean steady state
`dedupes by broker ID across map entries`	brokerMap's double-key doesn't double-probe
`404 on a broker is treated as endpoint absent`	mixed-version cluster behaves safely

PASS  TestIntegrationBrokerCaughtUp (3.03s)              — 6/6
PASS  TestIntegrationBrokersStillRecovering (0.00s)      — 4/4

TestHasRecentlyReplacedPods in internal/lifecycle/pool_test.go — 6 cases,
all pass locally.

Run with:

DOCKER_HOST=unix:///path/to/docker.sock \
TESTCONTAINERS_DOCKER_SOCKET_OVERRIDE=/var/run/docker.sock \
TESTCONTAINERS_RYUK_DISABLED=true \
GOLANG_PROTOBUF_REGISTRATION_CONFLICT=ignore \
TEST_REDPANDA_REPO=redpandadata/redpanda TEST_REDPANDA_VERSION=v26.1.1 \
nix develop -c go test ./operator/internal/controller/redpanda/ \
  -run TestIntegrationBrokerSafeToRestart -v -count=1 -timeout 360s

Test plan

go build ./operator/internal/probes/... ./operator/internal/controller/redpanda/... ./operator/internal/lifecycle/... — clean
go vet on the same — clean
TestIntegrationBrokerSafeToRestart against real v26.1.1 — green
TestHasRecentlyReplacedPods — green
Full unit suite via task test:unit
Acceptance suite under load — confirm the "1 Terminating + 1 NotReady"
window goes away

Follow-ups (not in this PR)

Opt-in acks=1 tolerance. The RFC explicitly punts: "For the
acks=1 case... we should prioritize making forward progress in the
rolling restart process as well, though automatically determining if
possible data loss is 'legitimate'... may prove hard." Current
default is strict (always block on acks1_data_loss). A CR-level
knob could allow opt-in tolerance for clusters that have explicitly
accepted the trade-off. Separate PR.
PreRestartProbe equivalent for stretch / multicluster controller. This
PR wires only the single-cluster controller. The multicluster controller
(which we'll be removing) is unchanged.
Chart-side preStop drain hardening. The 45s timeout + ; true mask on
the maintenance-mode drain script is the other half of the customer
scenario — see charts/redpanda: surface lifecycle-hook timeouts in pod logs #1548.

🤖 Generated with Claude Code

Bumps common-go/rpadmin from v0.2.4 to v0.2.7, which exposes the PreRestartProbe/PostRestartProbe bindings for /v1/broker/pre_restart_probe (Redpanda 25.1+) merged in common-go#170. Wiring on the operator side: 1. internal/probes/broker.go: new Prober.IsBrokerSafeToRestart method that calls PreRestartProbe scoped to the local broker and returns safe only when none of acks1_data_loss / unavailable / full_acks_produce_unavailable is populated. rf1_offline is treated as acceptable risk. On 404 returns ErrPreRestartProbeUnsupported so callers can fall back. IsClusterBrokerHealthy now consults this first; only falls through to the legacy cluster-overview path on older brokers. 2. internal/controller/redpanda/redpanda_controller.go: new helper brokerSafeToRestart that the rolling-restart loop calls per pod (using admin.BrokerIDToURL + ForHost to scope per-broker). Replaces the cluster-wide health.IsHealthy heuristic at the roll gate. The helper falls back to the cluster-wide IsHealthy on 404 so behavior on Redpanda < 25.1 is preserved. Tests: new TestIntegrationBrokerSafeToRestart in the redpanda controller package boots a real Redpanda container via testcontainers and pins four behaviors: - fresh cluster → safe - RF=1 topic populates rf1_offline but still reports safe - 404 falls back to the cluster.IsHealthy argument (both directions) - non-404 admin errors surface as errors (not silent fallback) Verified locally against v26.1.1: PASS in 19.907s, 4/4 subtests. Build cleanup: cluster_controller_configuration_test.go used github.com/moby/moby/pkg/namesgenerator for one-off random namespace names. The moby monolith conflicts with the split-out moby/moby/api module that testcontainers v0.42 pulls in via the rpadmin v0.2.7 test dep graph. Replaced the namesgenerator call with a 4-byte crypto/rand hex suffix so the operator no longer depends on the moby monolith. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Incorporates the two correctness fixes from the (now-closed) #1537 into the ENG-222 wiring stack: - brokerMap parsing now strips a "host:port" suffix via net.SplitHostPort and keys by both the first DNS label and the raw host. Previously strings.Split(InternalRPCAddress, ".")[0] silently mis-keyed when the admin API returned the address as "host:port" or a bare IP, causing the roll loop's lookup to miss and treat the broker as orphan. - The "pod not in brokerMap" branch now deletes one pod and requeues instead of deleting + continuing. Combined with the parsing bug above (or any transient cause that empties AllNodes — rolling controller leadership, brief admin-API 503s, NodeID flip after PVC reset), the previous behavior was the only single-reconcile path that could produce multiple brokers literally Terminating at the same time. Also pins the existing HasRecentlyReplacedPods gate's contract with a new TestHasRecentlyReplacedPods unit test in pool_test.go covering the no-op, all-old-revision, all-Ready-new-revision, just-replaced-not-yet- Ready, terminating-pod, and stuck-in-Pending cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds parameterized subtests to TestIntegrationBrokerSafeToRestart that walk the four failure modes the rolling-restart RFC enumerates and assert the operator's roll/don't-roll decision for the JSON each mode produces: RFC case 1: in-sync 2, offline 1 → restarting an online replica makes the partition leaderless → probe populates `unavailable` → operator blocks ✓ RFC case 2 (in-sync restart path): in-sync 2, recovering 1 → restarting the in-sync replica leaves no acks=-1 quorum → probe populates `full_acks_produce_unavailable` → operator blocks ✓ RFC case 2 (recovering-replica restart path) → probe populates nothing — recovering replica was always at risk → operator allows ✓ RFC case 3: in-sync 1, offline 1, recovering 1 → restarting the in-sync replica with acks=1 risks data loss → probe populates both `unavailable` and `acks1_data_loss` → operator blocks on either ✓ RFC case 4: in-sync 1, recovering 2 → restarting the in-sync replica causes acks=1 data loss → probe populates `acks1_data_loss` → operator blocks ✓ Plus the documented edge cases: - rf1_offline alone → acceptable risk, operator allows - rf1_offline + a dangerous category → still blocked - all categories empty → safe (steady-state happy path) The subtests use an httptest stub that returns the engineered PreRestartCheckResult JSON for each case. Reproducing the actual multi-broker cluster states inside a single testcontainer isn't practical; Redpanda core's own integration suite verifies the probe's server-side risk calculation per cluster state. What this test verifies is the operator's *interpretation* of the contract — that the four documented failure modes each translate to the right roll decision. Verified locally against redpandadata/redpanda:v26.1.1: PASS in 19.042s, 12/12 subtests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The rolling-restart RFC's ideal sequence has six steps; #1547 wired up the pre-restart probe (step 2) but step 5 — "wait for post-restart probe" — was still served by HasRecentlyReplacedPods, which checks K8s pod-Ready as a proxy. K8s pod-Ready can flip True before the broker has finished replaying its in-sync replicas. That window is exactly where we'd kick off the next roll prematurely. This commit adds /v1/broker/post_restart_probe (load_reclaimed_pc) as an additional gate that runs after HasRecentlyReplacedPods passes and before the rollSet loop: internal/probes/broker.go: + Prober.IsBrokerCaughtUp(ctx, brokerURL, threshold) + ErrPostRestartProbeUnsupported sentinel + DefaultPostRestartCaughtUpPercent = 100 (RFC's strictest reading) internal/controller/redpanda/redpanda_controller.go: + brokerCaughtUp(ctx, admin, brokerID, threshold, logger, podName) — BrokerIDToURL + ForHost + PostRestartProbe; 404 → treat as caught up (older Redpanda) + brokersStillRecovering(ctx, admin, brokerMap, threshold, logger) — iterates brokerMap entries (dedup'd by broker ID since the map intentionally double-keys), returns true if any broker is below threshold + reconcileDecommission now calls brokersStillRecovering after HasRecentlyReplacedPods has cleared. Probe errors are non-fatal (the K8s-Ready gate and per-broker pre-restart probe still protect the dangerous failure modes). Tests in redpanda_restart_probe_test.go (10 new subtests): TestIntegrationBrokerCaughtUp - 100% → caught up at default threshold - 99% → not caught up at default strict threshold - 99% → caught up at relaxed threshold (95) - 0% (just restarted) → not caught up - 404 → falls back to caught up (pre-25.1 broker) - 500 → propagates as error TestIntegrationBrokersStillRecovering - any broker below threshold blocks - all brokers caught up returns false - dedupes by broker ID across brokerMap's double-key entries - 404 on a broker is treated as endpoint absent (not recovering) All pass locally, 4.328s total. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…aram) CI's golangci-lint flagged brokersStillRecovering's threshold parameter as unparam — every caller (one in the controller, three in the test) passes DefaultPostRestartCaughtUpPercent (100). The parameter was speculative for "what if we someday want tunable rolling-restart tolerance" but the call sites converged on the strictest reading. Hardcoding probes.DefaultPostRestartCaughtUpPercent inside the helper keeps brokerCaughtUp's threshold (which the tests exercise with 100, 99, 95, 0) and removes the unused-flexibility surface from the gate helper. The doc comment now explains why the threshold was lifted out and what would have to change if we ever wanted to plumb it back in. Also trimmed an unrelated sentence from the changelog entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The rpadmin v0.2.4 → v0.2.7 bump cascaded into transitive-dep updates across the workspace. CI's task generate runs: - mod:tidy across all 13 modules in go.work - dev:update-licenses (adds BSL boilerplate to generated proto files that don't yet have it) - generate:third-party-licenses-list (licenses/third_party.md) and the ci:lint git diff --exit-code then flagged the deltas: acceptance/go.sum charts/{connectors,console,redpanda}/go.{mod,sum} gen/go.sum gotohelm/go.{mod,sum} gotohelm/testdata/src/example/go.{mod,sum} harpoon/go.{mod,sum} pkg/go.{mod,sum} pkg/multicluster/leaderelection/proto/gen/transport/v1/*.pb.go licenses/third_party.md No source code changes — purely the regenerated artifacts that match what CI produces. The two .pb.go files only gained the mandatory BSL boilerplate header (licenseupdater added it; the generated proto bodies are unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

My previous commit ran licenseupdater locally, which added the BSL boilerplate header to two generated proto files (message.pb.go, message_grpc.pb.go). But CI's task generate runs licenseupdater followed by `buf generate`, and `buf generate` regenerates those files from scratch — stripping the header back out. The canonical state is the unheaded one. Reverting the header addition so my branch matches the post-buf-generate output that CI produces. The proto file bodies were unchanged either way. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The sidecar's /healthz is K8s readiness — "am I serving traffic?". The pre-restart probe is a stricter signal — "would restarting me cause partition risks?". During cluster bootstrap (internal-topic creation, partition leader election) the pre-restart probe transiently reports unavailable/acks1_data_loss/full_acks_produce_unavailable even though the broker is happily serving Kafka traffic. My earlier IsClusterBrokerHealthy change wired PreRestartProbe in as the primary gate, which kept pods NotReady through bootstrap and made helm installs time out (`Error: INSTALLATION FAILED: context deadline exceeded`) in the integration suite — TestIntegrationStatefulSetDecommissioner, TestIntegrationFactoryOperatorV1, TestIntegrationClientFactoryTLSListeners, TestIntegrationRedpandaController/TestLicense, etc. all failed with that shape on PR #1547. This commit restores IsClusterBrokerHealthy to its prior logic (cluster.IsHealthy → relaxed broker-local fallback). Prober still exposes IsBrokerSafeToRestart for callers that genuinely want the restart-safety signal — its primary user is the operator's roll-gate helper brokerSafeToRestart, which is the right place to use the per-broker probe. PostRestartProbe wiring (brokerCaughtUp / brokersStillRecovering) is untouched — those are operator-side only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

david-yu and others added 2 commits May 21, 2026 10:25

david-yu changed the title ~~operator: adopt per-broker pre-restart probe (ENG-222)~~ operator: adopt per-broker pre-restart probe (ENG-222) + roll-loop correctness May 21, 2026

david-yu mentioned this pull request May 21, 2026

charts/redpanda: surface lifecycle-hook timeouts in pod logs #1548

Open

6 tasks

david-yu and others added 6 commits May 21, 2026 10:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

operator: adopt per-broker pre-restart probe (ENG-222) + roll-loop correctness#1547

operator: adopt per-broker pre-restart probe (ENG-222) + roll-loop correctness#1547
david-yu wants to merge 8 commits into
mainfrom
dyu/eng-222-prestart-probes

david-yu commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

david-yu commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this PR does

1. ENG-222 — replace cluster.IsHealthy with the per-broker probe

2. PostRestartProbe — wait for replica recovery before rolling the next broker

3. Incorporated from #1537 — orphan-branch correctness

4. Regression test for the existing HasRecentlyReplacedPods gate

Why all together

Dependencies

Tests

Test plan

Follow-ups (not in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

david-yu commented May 21, 2026 •

edited

Loading

4. Regression test for the existing `HasRecentlyReplacedPods` gate