outbound/urltest: fix gateway freeze when relay stops forwarding#4256
Open
HouMinXi wants to merge 1 commit into
Open
outbound/urltest: fix gateway freeze when relay stops forwarding#4256HouMinXi wants to merge 1 commit into
HouMinXi wants to merge 1 commit into
Conversation
When a relay server accepts TCP but stops forwarding application data, URL test probe goroutines block in Read() indefinitely. batch.Wait() then blocks forever, keeping the checking flag true and suppressing all future health checks. selectedOutbound is never updated, so new connections keep routing to the dead relay. This creates a triple self-locking loop that makes the gateway completely unresponsive. Fix three issues: 1. Set SetReadDeadline on the URL test connection. Context cancellation does not interrupt net.Conn.Read() when the connection was obtained through a custom DialContext. Use a relative timeout to avoid issues with clock time already consumed by the dial phase. 2. Add a hard timeout (2*TCPTimeout) around batch.Wait(). When the timeout fires, proceed with whatever results are available rather than blocking indefinitely. 3. Propagate batch context to individual probes by deriving testCtx from batchCtx instead of g.ctx, so batch cancellation reaches stuck probes. 4. Clean up stale history entries for probes that did not complete within the timeout, preventing performUpdateCheck from selecting a stuck outbound based on outdated results. Root cause confirmed by SIGQUIT goroutine dump during a live stall event on a tproxy gateway (205 goroutines, 168 blocked in CopyConn, 1 semacquire in batch.Wait for 3 minutes). Fixes SagerNet#4255 Ref: SagerNet#4144 SagerNet#1620 Signed-off-by: Minxi Hou <houminxi@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix transparent gateway freeze caused by
URLTestGroup.urlTest()blocking indefinitely when a relay server accepts TCP but stops forwarding data.Problem
When a relay becomes unresponsive at the application layer (TCP keepalive keeps the connection alive), three mechanisms lock simultaneously:
bufio.CopyConngoroutines block inRead()forever (no idle timeout on relay connections)Read()waiting for HTTP response from the dead relaybatch.Wait()requires ALL probes to complete, so one stuck probe blocks the entire health checkcheckingatomic flag stays true, silencing all future timer-triggered health checksselectedOutboundis never updated, routing all new connections to the dead relayConfirmed by SIGQUIT goroutine dump during a live stall: 205 goroutines total, 168 blocked in CopyConn IO wait, 1 in
batch.Wait()semacquire for 3 minutes. Full dump available in #4255.Fix
common/urltest/urltest.go: SetSetReadDeadlineon the dialed connection. Context cancellation does not interruptnet.Conn.Read()when the connection comes from a customDialContext. Uses relative timeout to avoid issues with time already consumed by the dial phase.protocol/group/urltest.go:testCtxfrom batch context (wasg.ctx), so batch cancellation propagatesbatch.Wait()with a hard timer (2*TCPTimeout). On timeout, proceed with available resultsperformUpdateCheckfrom selecting a stuck outboundTesting
Deployed patched binary on the affected gateway (N100 iStoreOS, tproxy + urltest with 6 exit nodes). Before patch: 8 stalls in 2 hours with intervals shrinking to 57s. After patch: monitoring for stability.
Ref: #4255 (goroutine dump and full analysis), #4144 (same symptom on different deployments), #1620 (same symptom on N100 tproxy, closed as stale)