Skip to content

urltest batch.Wait() blocks indefinitely when relay is unresponsive, stalling transparent gateway (goroutine dump attached) #4255

Description

@HouMinXi

Operating system

Linux

System version

iStoreOS (OpenWrt-based) x86_64, kernel 5.15

Installation type

Original sing-box Command Line

Version

sing-box version 1.10.7

Environment: go1.23.4 linux/amd64
Tags: with_gvisor,with_quic,with_dhcp,with_wireguard,with_ech,with_utls,with_reality_server,with_acme,with_clash_api
Revision: 253b41936ecd6ae17948d49d9c510d7100830927
CGO: disabled

Description

Transparent gateway (tproxy inbound + urltest outbound with 6 relay servers) becomes unresponsive under normal residential traffic. The process stays alive but stops processing connections. accept() works at the kernel level but no data flows.

This is related to #4144 (transparent gateway stall under normal connection churn). We captured a SIGQUIT goroutine dump during a live stall, which shows the blocking point is URLTestGroup.urlTest() calling batch.Wait() with no timeout.

Evidence: goroutine dump during stall (205 goroutines)

State distribution at stall time:

  66 [IO wait]
  26 [IO wait, 1 minutes]
  18 [IO wait, 2 minutes]
  10 [IO wait, 4 minutes]
  20 [select]
  18 [select, 2 minutes]
  17 [select, 4 minutes]
  12 [select, 1 minutes]
   1 [semacquire, 3 minutes]   ← URLTest batch.Wait()

The deadlock chain:

  1. URLTestGroup.urlTest() (goroutine 103) spawns batch workers to probe each outbound via batch.Go()
  2. One batch worker (goroutine 381) is stuck in TCPConn.Read for 1+ minute — the relay server accepted the TCP connection but never sent an HTTP response
  3. batch.Wait() (goroutine 103) blocks on sync.WaitGroup.Wait() indefinitely — one stuck worker blocks the entire batch
  4. While the batch is blocked, selectedOutboundTCP/selectedOutboundUDP are never updated
  5. New connections keep being routed to the unresponsive relay
  6. 168 CopyConn goroutines accumulate, all blocked in IO wait on the dead relay

Goroutine 103 (the blocked URLTest checker):

goroutine 103 [semacquire, 3 minutes]:
sync.(*WaitGroup).Wait(...)
  sync/waitgroup.go:118
github.com/sagernet/sing/common/batch.(*Batch[...]).Wait(...)
  github.com/sagernet/sing@v0.5.1/common/batch/batch.go:77
github.com/sagernet/sing-box/outbound.(*URLTestGroup).urlTest(...)
  github.com/sagernet/sing-box/outbound/urltest.go:407

Goroutine 381 (the stuck batch worker):

goroutine 381 [IO wait, 1 minutes]:
net.(*TCPConn).Read(...)
  ...
github.com/sagernet/sing-box/common/urltest.URLTest(...)
  github.com/sagernet/sing-box/common/urltest/urltest.go:96
github.com/sagernet/sing-box/outbound.(*URLTestGroup).urlTest.func1()
  github.com/sagernet/sing-box/outbound/urltest.go:390
github.com/sagernet/sing/common/batch.(*Batch[...]).Go.func1()
  github.com/sagernet/sing@v0.5.1/common/batch/batch.go:59

Socket state at stall (from ss)

  • 112 ESTAB connections with Send-Q = 0 (accepted but not writing)
  • 5 CLOSE-WAIT with Recv-Q = 1 (remote sent FIN, sing-box never called close())
  • conntrack: 76/76 ASSURED (TCP layer healthy, application layer dead)

Reproduction pattern

  • Config: tproxy inbound, urltest outbound with 6 exit nodes via relay servers
  • Traffic: normal residential (2-3 LAN devices, web browsing + streaming)
  • Trigger: relay server becomes slow or unresponsive (accepts TCP, does not respond to HTTP)
  • Stall occurs within 1-19 minutes of VPN reconnect during degraded relay conditions
  • Restart of sing-box process restores function temporarily

Observed 8 stalls in 2 hours, with intervals shrinking: 13min, 19min, 6.5min, 57s, 4.5min, 11min, 5min, 4min

Root cause analysis

URLTestGroup.urlTest() at outbound/urltest.go:407 calls batch.Wait() which blocks on sync.WaitGroup.Wait(). The batch spawns one goroutine per outbound to run URL tests. If any single outbound's relay accepts the TCP connection but does not respond to the HTTP request, that batch worker blocks in Read() indefinitely. Since batch.Wait() requires ALL workers to complete, one stuck worker blocks the entire health check.

While the health check is blocked:

  • selectedOutboundTCP/selectedOutboundUDP are never updated to healthy outbounds
  • New connections continue routing to the unresponsive outbound
  • Connection copy goroutines (2 per connection) accumulate in IO wait
  • The gateway appears fully unresponsive to LAN devices

Suggested fix

Add a context timeout to the urltest batch, so that unresponsive outbounds are detected as failures rather than blocking the entire health check:

// outbound/urltest.go, in urlTest()
// Before: batch.Wait() blocks forever if one probe hangs
// After: timeout ensures batch completes even with unresponsive outbounds

ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
defer cancel()
// Use ctx for batch operations so they respect the timeout

Or ensure individual URL test requests use a bounded timeout that covers the full HTTP transaction (connect + read), not just the TCP dial phase.

Deployment details

  • Hardware: Intel N100 mini-PC, 8GB RAM
  • Role: home network transparent proxy gateway
  • Inbound: tproxy (TCP only, QUIC/UDP rejected)
  • Outbound: urltest group selecting from 6 exit nodes (Dallas, LA, Chicago, NY, Atlanta, Miami) via relay servers
  • BPF TCP keepalive: 30s (clamped via CGROUP_SETSOCKOPT)
  • sysctl: tcp_retries2=6

Full goroutine dump and socket state

Captured by sending SIGQUIT to the sing-box process during a live stall event (GOTRACEBACK=all):

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions