Skip to content

fix(server): make MonitorHealthChan resilient to slow and closed notifiers#115

Draft
miotte wants to merge 1 commit into
mainfrom
miotte-pr12
Draft

fix(server): make MonitorHealthChan resilient to slow and closed notifiers#115
miotte wants to merge 1 commit into
mainfrom
miotte-pr12

Conversation

@miotte
Copy link
Copy Markdown
Contributor

@miotte miotte commented May 31, 2026

Fixes #114

MonitorHealthChan could stall every health broadcast on one busy client and panic on a send to a closed notifier. Deliver each code non-blocking through a sendHealthCode helper that recovers, and give the per-client notifier a buffer of 1. Adds internal regression tests for both failure modes.

@miotte
Copy link
Copy Markdown
Contributor Author

miotte commented May 31, 2026

Defect #2 (send on a closed channel): clarification + reclassification

Reclassification — this is reachable, not latent. The PR/issue/commit currently describe both defects as latent because MonitorHealthChan has no caller in this repo. That's misleading for the full picture: MonitorHealthChan is exported API. So a routine client disconnect — or a GOAWAY-time mass disconnect — that races a broadcast can panic the monitor goroutine and crash the consuming process. (The "not wired/latent" wording in the description and commit message should be corrected to say so.)

Is the panic "just how a channel works"? Yes and no, and the distinction is the whole point:

  • The panic mechanism is correct, intentional Go behavior — sending on a closed channel is defined to panic. The runtime is doing the right thing.
  • But the panic is a symptom of a real ownership bug in this code: the notifier channel is closed by Check (defer) and by notifyHealth (on re-registration), while a separate goroutine — MonitorHealthChan — sends to it, with no synchronization across the close and the send (the ConcurrentMap lock is released by Get before the send happens). That is the classic "a non-owner sends to a channel that owners close" anti-pattern; the panic is Go surfacing it.

Why recover rather than a redesign. Go offers no safe test-then-send: you cannot check "is this channel closed?" and then send, because the close can land between the check and the send. So for a broadcaster that sends to channels it does not own and whose closing it cannot control, there are exactly two options:

  1. Restructure ownership so close and send are mutually exclusive (single owner, or a lock held across both), or
  2. recover from the panic.

Option 1 is the more correct fix, but it is materially more invasive here: there are two closers (Check's defer and notifyHealth's replace-close); notifyHealth's close-on-re-registration is behavior covered by tests in #112, so it can't simply be removed; and it needs a locking protocol (close + send serialized under a shared lock) that ConcurrentMap does not currently expose. For a live crash bug, the right immediate move is the minimal, contained fix: recover stops the crash, changes no exported signatures, alters no happy-path behavior, and has a tiny blast radius. Recommendation: keep recover here and track the ownership/locking redesign as a separate follow-up rather than rush it into the crash-fix.

**One behavior change to weigh ** The non-blocking send means a notification can be dropped if a client's reader is momentarily busy (e.g. mid stream.Send). For best-effort health that's usually fine, but GOAWAY is meaningful — a dropped GOAWAY means that client never learns the server is leaving. The blocking original guaranteed delivery, but at the cost of head-of-line-blocking every client. If guaranteed GOAWAY delivery matters to consumers, the options are: give the per-client notifier channel (created in Check) a small buffer, or use a short bounded send timeout instead of a pure non-blocking drop. Happy to do either — flagging for input.

…fiers

MonitorHealthChan broadcasts health codes to every registered Healthz
client with a blocking send on an unbuffered channel, in a sequential
loop. A client whose reader was busy (e.g. still in a blocking
stream.Send to a slow connection) would block that send and stall the
entire broadcast loop, starving all other clients. Separately, the
notifier channel is closed when a client disconnects (Check's defer) or
re-registers (notifyHealth); because the registry lock is not held
across the lookup and the send, a send can race that close and panic
the monitor goroutine on a routine disconnect, taking down the process.

MonitorHealthChan is exported and arke is consumed as a library, so
these are reachable in production, not theoretical: a disconnect during
a broadcast (e.g. a GOAWAY-time mass disconnect) can crash a consuming
process.

Route each delivery through a new sendHealthCode helper that sends
non-blocking (dropping with a debug log if the receiver is not ready)
and recovers from a send on a closed channel: one slow or disconnecting
client can no longer block or crash notifications to the rest. Give the
per-client notifier channel a buffer of 1 so a broadcast still reaches a
reader that is momentarily busy rather than being dropped, without
reintroducing head-of-line blocking.

Add internal regression tests for the non-blocking and closed-channel
behaviors; both were confirmed to fail/panic against the prior
implementation.

Note: a fuller fix for the close/send race is to redesign channel
ownership (single owner closing, coordinated with senders under a lock);
recover is the minimal, contained fix that stops the crash without
changing exported signatures or happy-path behavior. Tracked separately.

Fixes #114

Signed-off-by: Michael Otteni <MichaelGOtteni@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MonitorHealthChan can stall all health broadcasts and panic on a closed notifier

1 participant