fix(server): make MonitorHealthChan resilient to slow and closed notifiers by miotte · Pull Request #115 · sassoftware/arke

miotte · 2026-05-31T20:19:21Z

Fixes #114

MonitorHealthChan could stall every health broadcast on one busy client and panic on a send to a closed notifier. Deliver each code non-blocking through a sendHealthCode helper that recovers, and give the per-client notifier a buffer of 1. Adds internal regression tests for both failure modes.

miotte · 2026-05-31T20:34:13Z

Defect #2 (send on a closed channel): clarification + reclassification

Reclassification — this is reachable, not latent. The PR/issue/commit currently describe both defects as latent because MonitorHealthChan has no caller in this repo. That's misleading for the full picture: MonitorHealthChan is exported API. So a routine client disconnect — or a GOAWAY-time mass disconnect — that races a broadcast can panic the monitor goroutine and crash the consuming process. (The "not wired/latent" wording in the description and commit message should be corrected to say so.)

Is the panic "just how a channel works"? Yes and no, and the distinction is the whole point:

The panic mechanism is correct, intentional Go behavior — sending on a closed channel is defined to panic. The runtime is doing the right thing.
But the panic is a symptom of a real ownership bug in this code: the notifier channel is closed by Check (defer) and by notifyHealth (on re-registration), while a separate goroutine — MonitorHealthChan — sends to it, with no synchronization across the close and the send (the ConcurrentMap lock is released by Get before the send happens). That is the classic "a non-owner sends to a channel that owners close" anti-pattern; the panic is Go surfacing it.

Why recover rather than a redesign. Go offers no safe test-then-send: you cannot check "is this channel closed?" and then send, because the close can land between the check and the send. So for a broadcaster that sends to channels it does not own and whose closing it cannot control, there are exactly two options:

Restructure ownership so close and send are mutually exclusive (single owner, or a lock held across both), or
recover from the panic.

Option 1 is the more correct fix, but it is materially more invasive here: there are two closers (Check's defer and notifyHealth's replace-close); notifyHealth's close-on-re-registration is behavior covered by tests in #112, so it can't simply be removed; and it needs a locking protocol (close + send serialized under a shared lock) that ConcurrentMap does not currently expose. For a live crash bug, the right immediate move is the minimal, contained fix: recover stops the crash, changes no exported signatures, alters no happy-path behavior, and has a tiny blast radius. Recommendation: keep recover here and track the ownership/locking redesign as a separate follow-up rather than rush it into the crash-fix.

**One behavior change to weigh ** The non-blocking send means a notification can be dropped if a client's reader is momentarily busy (e.g. mid stream.Send). For best-effort health that's usually fine, but GOAWAY is meaningful — a dropped GOAWAY means that client never learns the server is leaving. The blocking original guaranteed delivery, but at the cost of head-of-line-blocking every client. If guaranteed GOAWAY delivery matters to consumers, the options are: give the per-client notifier channel (created in Check) a small buffer, or use a short bounded send timeout instead of a pure non-blocking drop. Happy to do either — flagging for input.

…fiers MonitorHealthChan broadcasts health codes to every registered Healthz client with a blocking send on an unbuffered channel, in a sequential loop. A client whose reader was busy (e.g. still in a blocking stream.Send to a slow connection) would block that send and stall the entire broadcast loop, starving all other clients. Separately, the notifier channel is closed when a client disconnects (Check's defer) or re-registers (notifyHealth); because the registry lock is not held across the lookup and the send, a send can race that close and panic the monitor goroutine on a routine disconnect, taking down the process. MonitorHealthChan is exported and arke is consumed as a library, so these are reachable in production, not theoretical: a disconnect during a broadcast (e.g. a GOAWAY-time mass disconnect) can crash a consuming process. Route each delivery through a new sendHealthCode helper that sends non-blocking (dropping with a debug log if the receiver is not ready) and recovers from a send on a closed channel: one slow or disconnecting client can no longer block or crash notifications to the rest. Give the per-client notifier channel a buffer of 1 so a broadcast still reaches a reader that is momentarily busy rather than being dropped, without reintroducing head-of-line blocking. Add internal regression tests for the non-blocking and closed-channel behaviors; both were confirmed to fail/panic against the prior implementation. Note: a fuller fix for the close/send race is to redesign channel ownership (single owner closing, coordinated with senders under a lock); recover is the minimal, contained fix that stops the crash without changing exported signatures or happy-path behavior. Tracked separately. Fixes #114 Signed-off-by: Michael Otteni <MichaelGOtteni@gmail.com>

miotte force-pushed the miotte-pr12 branch from 2efa198 to 45462d3 Compare May 31, 2026 20:38

miotte mentioned this pull request Jun 1, 2026

MonitorHealthChan can stall all health broadcasts and panic on a closed notifier #114

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server): make MonitorHealthChan resilient to slow and closed notifiers#115

fix(server): make MonitorHealthChan resilient to slow and closed notifiers#115
miotte wants to merge 1 commit into
mainfrom
miotte-pr12

miotte commented May 31, 2026 •

edited

Loading

Uh oh!

miotte commented May 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

miotte commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

miotte commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Defect #2 (send on a closed channel): clarification + reclassification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

miotte commented May 31, 2026 •

edited

Loading

miotte commented May 31, 2026 •

edited

Loading