Skip to content

operator: adopt per-broker pre-restart probe (ENG-222) + roll-loop correctness#1547

Draft
david-yu wants to merge 8 commits into
mainfrom
dyu/eng-222-prestart-probes
Draft

operator: adopt per-broker pre-restart probe (ENG-222) + roll-loop correctness#1547
david-yu wants to merge 8 commits into
mainfrom
dyu/eng-222-prestart-probes

Conversation

@david-yu
Copy link
Copy Markdown
Contributor

@david-yu david-yu commented May 21, 2026

Summary

Adopt Redpanda 25.1+'s per-broker pre-restart probe (ENG-222) as the primary
rolling-restart gate, replacing the cluster-wide health_overview.IsHealthy
heuristic with the per-broker counterfactual /v1/broker/pre_restart_probe.

This PR is the consolidated customer-blocking fix for the rolling-restart
races; it folds in the two correctness fixes from the (now-closed) #1537 so
they don't get lost.

What this PR does

1. ENG-222 — replace cluster.IsHealthy with the per-broker probe

  • internal/probes/broker.go — new Prober.IsBrokerSafeToRestart method.
    Calls PreRestartProbe (Redpanda 25.1+) and returns safe only when none of
    the dangerous risk categories is populated:

    • acks1_data_loss (acks=1 producers may lose data)
    • unavailable (produce + consume reject)
    • full_acks_produce_unavailable (acks=-1 produce rejects)

    rf1_offline is acceptable per ENG-222: RF=1 already has no redundancy.
    On 404 (Redpanda < 25.1) returns ErrPreRestartProbeUnsupported so callers
    can fall back. IsClusterBrokerHealthy now consults this first; falls
    through to the legacy cluster-overview-with-relaxed-fallback path only when
    the endpoint isn't present.

  • internal/controller/redpanda/redpanda_controller.go — the rolling
    loop now calls a new brokerSafeToRestart helper for each pod, using
    admin.BrokerIDToURL + ForHost to scope per-broker. The
    else if health.IsHealthy branch is replaced by:

    safe, err := brokerSafeToRestart(ctx, state.admin, brokerID, health.IsHealthy, logger, pod.GetName())
    switch {
    case err != nil:    // probe failed → skip pod, continue
    case safe:          // probe says OK → delete and halt
    default:            // probe says not safe → skip pod, continue
    }

    On 404 the helper falls back to health.IsHealthy so behavior on Redpanda
    < 25.1 is unchanged.

2. PostRestartProbe — wait for replica recovery before rolling the next broker

The RFC's ideal sequence has six steps; we now cover both probe steps:

request maintenance → wait pre-restart probe → restart → wait available
                  → maintenance off → wait post-restart probe → next broker

HasRecentlyReplacedPods (the K8s pod-Ready gate added in #1446) still
fires first — fast, cheap, and protects against the most obvious
"pod isn't even Running yet" case. On top of that, the controller
now consults /v1/broker/post_restart_probe (load_reclaimed_pc) per
broker before the rollSet loop. A broker is treated as caught up when
its load_reclaimed_pc >= 100 (default strictest reading, per
probes.DefaultPostRestartCaughtUpPercent). Implementation:

  • internal/probes/broker.go — new Prober.IsBrokerCaughtUp method
    • ErrPostRestartProbeUnsupported sentinel.
  • internal/controller/redpanda/redpanda_controller.go
    brokerCaughtUp helper (BrokerIDToURL + ForHost + PostRestartProbe;
    404 → treat as caught up so older clusters behave as before) and
    brokersStillRecovering outer-gate helper that iterates brokerMap,
    dedupes by broker ID (the map intentionally double-keys), and returns
    true if any broker is below threshold.
  • Probe errors are non-fatal at the gate — the K8s-Ready gate and the
    per-broker pre-restart probe still protect the dangerous failure
    modes.

3. Incorporated from #1537 — orphan-branch correctness

Two latent code-path bugs in the same reconcileDecommission function:

  • brokerMap parsing. Brokers were keyed by
    strings.Split(broker.InternalRPCAddress, ".")[0]. This produces the wrong
    key when the admin API returns the address as host:port (advertised RPC
    port suffix) or as a bare IP — the lookup brokerMap[pod.GetName()] then
    misses and the pod is silently treated as orphan. Fix: strip the port via
    net.SplitHostPort and key by both the first DNS label and the raw host
    (strict superset of the previous keyspace; no regression for FQDN-only
    deployments).

  • "Pod not in brokerMap" branch. The branch deleted and set
    continueExecution=true, so every unmapped pod in rollSet was deleted in
    the same reconcile pass. Combined with the parsing bug above (or any
    transient cause that empties AllNodes — rolling controller leadership,
    brief admin-API 503s, NodeID flip after PVC reset), this was the only
    single-reconcile path that could produce multiple brokers literally
    Terminating at the same time. Now the branch deletes one pod and
    requeues, so each pod's state is re-evaluated against fresh broker-map
    data before the next one is touched.

4. Regression test for the existing HasRecentlyReplacedPods gate

New TestHasRecentlyReplacedPods in pool_test.go pins the contract for the
outer K8s-pod-state gate that #1446 introduced. Cases: no-op, all-old-revision,
all-Ready-new-revision (steady state), just-replaced-not-yet-Ready,
terminating pod, stuck-in-Pending.

Why all together

The changes cover four independent failure surfaces that all contribute to
the same customer-visible bug:

Concern What's broken What this PR does
Multiple deletes per reconcile !inBrokerMap branch tear-down one-pod-per-reconcile
Address-format misclassification strings.Split(...)[0] parser net.SplitHostPort + dual-key
Coarse-grained roll signal health.IsHealthy (cluster-wide) per-broker PreRestartProbe
K8s pod-Ready ≠ broker caught up new broker passes pod-Ready before it has finished replaying partition state per-broker PostRestartProbe outer gate

Together they make rolling restart safe under load, both for the literal
parallel-delete scenario (the orphan-branch fix) and the timing race where
the cluster reports IsHealthy=true while a broker is still partition-
recovering (the pre- and post-restart probes).

Dependencies

  • github.com/redpanda-data/common-go/rpadmin v0.2.7 (carries
    PreRestartProbe/PostRestartProbe from common-go#170)
  • github.com/testcontainers/testcontainers-go v0.42.0 (already present)
  • Cleanup: dropped github.com/moby/moby/pkg/namesgenerator (single-line use
    in cluster_controller_configuration_test.go) because the moby monolith
    v28+ conflicts with the split-out moby/moby/api module pulled in via
    testcontainers v0.42's transitive deps under rpadmin v0.2.7. Replaced with
    a 4-byte crypto/rand hex suffix.

Tests

TestIntegrationBrokerSafeToRestart boots a real Redpanda container via
testcontainers (for the live-probe subtests) and uses httptest stubs
(for the RFC-case and error-handling subtests). Twelve subtests pin the
pre-restart contract:

Subtest What it pins
fresh cluster is safe to restart A broker with no partitions reports safe (live broker)
RF=1 topic still counts as safe (acceptable risk) rf1_offline is populated but the helper still returns safe (live broker)
404 falls back to cluster.IsHealthy Pre-25.1 broker → returns the cluster-health argument unchanged (both true and false directions pinned)
non-404 admin errors surface as errors 5xx propagates instead of silently falling back
RFC case 1: in-sync 2, offline 1 — restarting online replica makes partition leaderless blocks on unavailable
RFC case 2 (in-sync restart): in-sync 2, recovering 1 — restarting in-sync replica blocks acks=-1 produce blocks on full_acks_produce_unavailable
RFC case 2 (recovering-replica restart) empty risks → safe to restart
RFC case 3: in-sync 1, offline 1, recovering 1 — restarting in-sync risks acks=1 data loss blocks even when both unavailable and acks1_data_loss are populated
RFC case 4: in-sync 1, recovering 2 — restarting in-sync causes acks=1 data loss blocks on acks1_data_loss
RF=1 acceptable + dangerous category populated still blocked — rf1 doesn't override
all categories empty safe (steady-state happy path)

The RFC cases reference the rolling-restart safety probes RFC's
"Scenarios" section. Reproducing the actual multi-broker cluster states
inside a single testcontainer isn't practical; Redpanda core's own
suite verifies server-side risk calculation per cluster state. What
this test pins is the operator's interpretation of the wire
contract — for each documented failure mode, the JSON the probe
returns translates to the right roll decision.

Verified locally against redpandadata/redpanda:v26.1.1:

PASS  TestIntegrationBrokerSafeToRestart (19.042s) — 12/12 subtests

TestIntegrationBrokerCaughtUp + TestIntegrationBrokersStillRecovering
pin the post-restart contract (10 subtests total) via httptest stubs:

Subtest What it pins
fully caught up at default threshold 100% → caught up
99% not caught up at default strict threshold strict default blocks
99% caught up at relaxed threshold (95) threshold is honored
0% (just restarted) not caught up freshly-restarted broker blocks
404 falls back to caught up (pre-25.1 broker) older Redpanda → not blocked
500 propagates as error admin failure surfaces
any broker below threshold blocks gate fires if any broker is recovering
all brokers caught up returns false clean steady state
dedupes by broker ID across map entries brokerMap's double-key doesn't double-probe
404 on a broker is treated as endpoint absent mixed-version cluster behaves safely
PASS  TestIntegrationBrokerCaughtUp (3.03s)              — 6/6
PASS  TestIntegrationBrokersStillRecovering (0.00s)      — 4/4

TestHasRecentlyReplacedPods in internal/lifecycle/pool_test.go — 6 cases,
all pass locally.

Run with:

DOCKER_HOST=unix:///path/to/docker.sock \
TESTCONTAINERS_DOCKER_SOCKET_OVERRIDE=/var/run/docker.sock \
TESTCONTAINERS_RYUK_DISABLED=true \
GOLANG_PROTOBUF_REGISTRATION_CONFLICT=ignore \
TEST_REDPANDA_REPO=redpandadata/redpanda TEST_REDPANDA_VERSION=v26.1.1 \
nix develop -c go test ./operator/internal/controller/redpanda/ \
  -run TestIntegrationBrokerSafeToRestart -v -count=1 -timeout 360s

Test plan

  • go build ./operator/internal/probes/... ./operator/internal/controller/redpanda/... ./operator/internal/lifecycle/... — clean
  • go vet on the same — clean
  • TestIntegrationBrokerSafeToRestart against real v26.1.1 — green
  • TestHasRecentlyReplacedPods — green
  • Full unit suite via task test:unit
  • Acceptance suite under load — confirm the "1 Terminating + 1 NotReady"
    window goes away

Follow-ups (not in this PR)

  • Opt-in acks=1 tolerance. The RFC explicitly punts: "For the
    acks=1 case... we should prioritize making forward progress in the
    rolling restart process as well, though automatically determining if
    possible data loss is 'legitimate'... may prove hard."
    Current
    default is strict (always block on acks1_data_loss). A CR-level
    knob could allow opt-in tolerance for clusters that have explicitly
    accepted the trade-off. Separate PR.
  • PreRestartProbe equivalent for stretch / multicluster controller. This
    PR wires only the single-cluster controller. The multicluster controller
    (which we'll be removing) is unchanged.
  • Chart-side preStop drain hardening. The 45s timeout + ; true mask on
    the maintenance-mode drain script is the other half of the customer
    scenario — see charts/redpanda: surface lifecycle-hook timeouts in pod logs #1548.

🤖 Generated with Claude Code

david-yu and others added 2 commits May 21, 2026 10:25
Bumps common-go/rpadmin from v0.2.4 to v0.2.7, which exposes the
PreRestartProbe/PostRestartProbe bindings for /v1/broker/pre_restart_probe
(Redpanda 25.1+) merged in common-go#170.

Wiring on the operator side:

1. internal/probes/broker.go: new Prober.IsBrokerSafeToRestart method
   that calls PreRestartProbe scoped to the local broker and returns
   safe only when none of acks1_data_loss / unavailable /
   full_acks_produce_unavailable is populated. rf1_offline is treated
   as acceptable risk. On 404 returns ErrPreRestartProbeUnsupported so
   callers can fall back. IsClusterBrokerHealthy now consults this
   first; only falls through to the legacy cluster-overview path on
   older brokers.

2. internal/controller/redpanda/redpanda_controller.go: new helper
   brokerSafeToRestart that the rolling-restart loop calls per pod
   (using admin.BrokerIDToURL + ForHost to scope per-broker). Replaces
   the cluster-wide health.IsHealthy heuristic at the roll gate. The
   helper falls back to the cluster-wide IsHealthy on 404 so behavior
   on Redpanda < 25.1 is preserved.

Tests: new TestIntegrationBrokerSafeToRestart in the redpanda
controller package boots a real Redpanda container via testcontainers
and pins four behaviors:
  - fresh cluster → safe
  - RF=1 topic populates rf1_offline but still reports safe
  - 404 falls back to the cluster.IsHealthy argument (both directions)
  - non-404 admin errors surface as errors (not silent fallback)

Verified locally against v26.1.1:
  PASS in 19.907s, 4/4 subtests.

Build cleanup: cluster_controller_configuration_test.go used
github.com/moby/moby/pkg/namesgenerator for one-off random namespace
names. The moby monolith conflicts with the split-out moby/moby/api
module that testcontainers v0.42 pulls in via the rpadmin v0.2.7 test
dep graph. Replaced the namesgenerator call with a 4-byte crypto/rand
hex suffix so the operator no longer depends on the moby monolith.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Incorporates the two correctness fixes from the (now-closed) #1537 into
the ENG-222 wiring stack:

  - brokerMap parsing now strips a "host:port" suffix via net.SplitHostPort
    and keys by both the first DNS label and the raw host. Previously
    strings.Split(InternalRPCAddress, ".")[0] silently mis-keyed when the
    admin API returned the address as "host:port" or a bare IP, causing
    the roll loop's lookup to miss and treat the broker as orphan.

  - The "pod not in brokerMap" branch now deletes one pod and requeues
    instead of deleting + continuing. Combined with the parsing bug above
    (or any transient cause that empties AllNodes — rolling controller
    leadership, brief admin-API 503s, NodeID flip after PVC reset), the
    previous behavior was the only single-reconcile path that could
    produce multiple brokers literally Terminating at the same time.

Also pins the existing HasRecentlyReplacedPods gate's contract with a
new TestHasRecentlyReplacedPods unit test in pool_test.go covering the
no-op, all-old-revision, all-Ready-new-revision, just-replaced-not-yet-
Ready, terminating-pod, and stuck-in-Pending cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@david-yu david-yu changed the title operator: adopt per-broker pre-restart probe (ENG-222) operator: adopt per-broker pre-restart probe (ENG-222) + roll-loop correctness May 21, 2026
david-yu and others added 6 commits May 21, 2026 10:59
Adds parameterized subtests to TestIntegrationBrokerSafeToRestart that
walk the four failure modes the rolling-restart RFC enumerates and
assert the operator's roll/don't-roll decision for the JSON each mode
produces:

  RFC case 1: in-sync 2, offline 1
    → restarting an online replica makes the partition leaderless
    → probe populates `unavailable`
    → operator blocks ✓

  RFC case 2 (in-sync restart path): in-sync 2, recovering 1
    → restarting the in-sync replica leaves no acks=-1 quorum
    → probe populates `full_acks_produce_unavailable`
    → operator blocks ✓

  RFC case 2 (recovering-replica restart path)
    → probe populates nothing — recovering replica was always at risk
    → operator allows ✓

  RFC case 3: in-sync 1, offline 1, recovering 1
    → restarting the in-sync replica with acks=1 risks data loss
    → probe populates both `unavailable` and `acks1_data_loss`
    → operator blocks on either ✓

  RFC case 4: in-sync 1, recovering 2
    → restarting the in-sync replica causes acks=1 data loss
    → probe populates `acks1_data_loss`
    → operator blocks ✓

Plus the documented edge cases:
  - rf1_offline alone → acceptable risk, operator allows
  - rf1_offline + a dangerous category → still blocked
  - all categories empty → safe (steady-state happy path)

The subtests use an httptest stub that returns the engineered
PreRestartCheckResult JSON for each case. Reproducing the actual
multi-broker cluster states inside a single testcontainer isn't
practical; Redpanda core's own integration suite verifies the
probe's server-side risk calculation per cluster state. What this
test verifies is the operator's *interpretation* of the contract —
that the four documented failure modes each translate to the right
roll decision.

Verified locally against redpandadata/redpanda:v26.1.1:
  PASS in 19.042s, 12/12 subtests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The rolling-restart RFC's ideal sequence has six steps; #1547 wired up
the pre-restart probe (step 2) but step 5 — "wait for post-restart
probe" — was still served by HasRecentlyReplacedPods, which checks K8s
pod-Ready as a proxy. K8s pod-Ready can flip True before the broker
has finished replaying its in-sync replicas. That window is exactly
where we'd kick off the next roll prematurely.

This commit adds /v1/broker/post_restart_probe (load_reclaimed_pc) as
an additional gate that runs after HasRecentlyReplacedPods passes and
before the rollSet loop:

  internal/probes/broker.go:
    + Prober.IsBrokerCaughtUp(ctx, brokerURL, threshold)
    + ErrPostRestartProbeUnsupported sentinel
    + DefaultPostRestartCaughtUpPercent = 100 (RFC's strictest reading)

  internal/controller/redpanda/redpanda_controller.go:
    + brokerCaughtUp(ctx, admin, brokerID, threshold, logger, podName)
      — BrokerIDToURL + ForHost + PostRestartProbe; 404 → treat as
      caught up (older Redpanda)
    + brokersStillRecovering(ctx, admin, brokerMap, threshold, logger)
      — iterates brokerMap entries (dedup'd by broker ID since the
      map intentionally double-keys), returns true if any broker is
      below threshold
    + reconcileDecommission now calls brokersStillRecovering after
      HasRecentlyReplacedPods has cleared. Probe errors are
      non-fatal (the K8s-Ready gate and per-broker pre-restart probe
      still protect the dangerous failure modes).

Tests in redpanda_restart_probe_test.go (10 new subtests):

  TestIntegrationBrokerCaughtUp
    - 100% → caught up at default threshold
    - 99% → not caught up at default strict threshold
    - 99% → caught up at relaxed threshold (95)
    - 0% (just restarted) → not caught up
    - 404 → falls back to caught up (pre-25.1 broker)
    - 500 → propagates as error

  TestIntegrationBrokersStillRecovering
    - any broker below threshold blocks
    - all brokers caught up returns false
    - dedupes by broker ID across brokerMap's double-key entries
    - 404 on a broker is treated as endpoint absent (not recovering)

All pass locally, 4.328s total.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…aram)

CI's golangci-lint flagged brokersStillRecovering's threshold parameter
as unparam — every caller (one in the controller, three in the test)
passes DefaultPostRestartCaughtUpPercent (100). The parameter was
speculative for "what if we someday want tunable rolling-restart
tolerance" but the call sites converged on the strictest reading.

Hardcoding probes.DefaultPostRestartCaughtUpPercent inside the helper
keeps brokerCaughtUp's threshold (which the tests exercise with 100,
99, 95, 0) and removes the unused-flexibility surface from the gate
helper. The doc comment now explains why the threshold was lifted out
and what would have to change if we ever wanted to plumb it back in.

Also trimmed an unrelated sentence from the changelog entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The rpadmin v0.2.4 → v0.2.7 bump cascaded into transitive-dep
updates across the workspace. CI's task generate runs:

  - mod:tidy across all 13 modules in go.work
  - dev:update-licenses (adds BSL boilerplate to generated proto files
    that don't yet have it)
  - generate:third-party-licenses-list (licenses/third_party.md)

and the ci:lint git diff --exit-code then flagged the deltas:

  acceptance/go.sum
  charts/{connectors,console,redpanda}/go.{mod,sum}
  gen/go.sum
  gotohelm/go.{mod,sum}
  gotohelm/testdata/src/example/go.{mod,sum}
  harpoon/go.{mod,sum}
  pkg/go.{mod,sum}
  pkg/multicluster/leaderelection/proto/gen/transport/v1/*.pb.go
  licenses/third_party.md

No source code changes — purely the regenerated artifacts that
match what CI produces. The two .pb.go files only gained the
mandatory BSL boilerplate header (licenseupdater added it; the
generated proto bodies are unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
My previous commit ran licenseupdater locally, which added the BSL
boilerplate header to two generated proto files
(message.pb.go, message_grpc.pb.go). But CI's task generate runs
licenseupdater followed by `buf generate`, and `buf generate`
regenerates those files from scratch — stripping the header back out.

The canonical state is the unheaded one. Reverting the header
addition so my branch matches the post-buf-generate output that CI
produces. The proto file bodies were unchanged either way.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The sidecar's /healthz is K8s readiness — "am I serving traffic?". The
pre-restart probe is a stricter signal — "would restarting me cause
partition risks?". During cluster bootstrap (internal-topic creation,
partition leader election) the pre-restart probe transiently reports
unavailable/acks1_data_loss/full_acks_produce_unavailable even though
the broker is happily serving Kafka traffic. My earlier IsClusterBrokerHealthy
change wired PreRestartProbe in as the primary gate, which kept pods
NotReady through bootstrap and made helm installs time out
(`Error: INSTALLATION FAILED: context deadline exceeded`) in the
integration suite — TestIntegrationStatefulSetDecommissioner,
TestIntegrationFactoryOperatorV1, TestIntegrationClientFactoryTLSListeners,
TestIntegrationRedpandaController/TestLicense, etc. all failed with
that shape on PR #1547.

This commit restores IsClusterBrokerHealthy to its prior logic
(cluster.IsHealthy → relaxed broker-local fallback). Prober still
exposes IsBrokerSafeToRestart for callers that genuinely want the
restart-safety signal — its primary user is the operator's
roll-gate helper brokerSafeToRestart, which is the right place to
use the per-broker probe.

PostRestartProbe wiring (brokerCaughtUp / brokersStillRecovering) is
untouched — those are operator-side only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant