fix(server): make MonitorHealthChan resilient to slow and closed notifiers#115
fix(server): make MonitorHealthChan resilient to slow and closed notifiers#115miotte wants to merge 1 commit into
Conversation
Defect #2 (send on a closed channel): clarification + reclassificationReclassification — this is reachable, not latent. The PR/issue/commit currently describe both defects as latent because Is the panic "just how a channel works"? Yes and no, and the distinction is the whole point:
Why
Option 1 is the more correct fix, but it is materially more invasive here: there are two closers ( **One behavior change to weigh ** The non-blocking send means a notification can be dropped if a client's reader is momentarily busy (e.g. mid |
…fiers MonitorHealthChan broadcasts health codes to every registered Healthz client with a blocking send on an unbuffered channel, in a sequential loop. A client whose reader was busy (e.g. still in a blocking stream.Send to a slow connection) would block that send and stall the entire broadcast loop, starving all other clients. Separately, the notifier channel is closed when a client disconnects (Check's defer) or re-registers (notifyHealth); because the registry lock is not held across the lookup and the send, a send can race that close and panic the monitor goroutine on a routine disconnect, taking down the process. MonitorHealthChan is exported and arke is consumed as a library, so these are reachable in production, not theoretical: a disconnect during a broadcast (e.g. a GOAWAY-time mass disconnect) can crash a consuming process. Route each delivery through a new sendHealthCode helper that sends non-blocking (dropping with a debug log if the receiver is not ready) and recovers from a send on a closed channel: one slow or disconnecting client can no longer block or crash notifications to the rest. Give the per-client notifier channel a buffer of 1 so a broadcast still reaches a reader that is momentarily busy rather than being dropped, without reintroducing head-of-line blocking. Add internal regression tests for the non-blocking and closed-channel behaviors; both were confirmed to fail/panic against the prior implementation. Note: a fuller fix for the close/send race is to redesign channel ownership (single owner closing, coordinated with senders under a lock); recover is the minimal, contained fix that stops the crash without changing exported signatures or happy-path behavior. Tracked separately. Fixes #114 Signed-off-by: Michael Otteni <MichaelGOtteni@gmail.com>
Fixes #114
MonitorHealthChancould stall every health broadcast on one busy client and panic on a send to a closed notifier. Deliver each code non-blocking through asendHealthCodehelper thatrecovers, and give the per-client notifier a buffer of 1. Adds internal regression tests for both failure modes.