Operating system
Linux
System version
iStoreOS (OpenWrt-based) x86_64, kernel 5.15
Installation type
Original sing-box Command Line
Version
sing-box version 1.10.7
Environment: go1.23.4 linux/amd64
Tags: with_gvisor,with_quic,with_dhcp,with_wireguard,with_ech,with_utls,with_reality_server,with_acme,with_clash_api
Revision: 253b41936ecd6ae17948d49d9c510d7100830927
CGO: disabled
Description
Transparent gateway (tproxy inbound + urltest outbound with 6 relay servers) becomes unresponsive under normal residential traffic. The process stays alive but stops processing connections. accept() works at the kernel level but no data flows.
This is related to #4144 (transparent gateway stall under normal connection churn). We captured a SIGQUIT goroutine dump during a live stall, which shows the blocking point is URLTestGroup.urlTest() calling batch.Wait() with no timeout.
Evidence: goroutine dump during stall (205 goroutines)
State distribution at stall time:
66 [IO wait]
26 [IO wait, 1 minutes]
18 [IO wait, 2 minutes]
10 [IO wait, 4 minutes]
20 [select]
18 [select, 2 minutes]
17 [select, 4 minutes]
12 [select, 1 minutes]
1 [semacquire, 3 minutes] ← URLTest batch.Wait()
The deadlock chain:
URLTestGroup.urlTest() (goroutine 103) spawns batch workers to probe each outbound via batch.Go()
- One batch worker (goroutine 381) is stuck in
TCPConn.Read for 1+ minute — the relay server accepted the TCP connection but never sent an HTTP response
batch.Wait() (goroutine 103) blocks on sync.WaitGroup.Wait() indefinitely — one stuck worker blocks the entire batch
- While the batch is blocked,
selectedOutboundTCP/selectedOutboundUDP are never updated
- New connections keep being routed to the unresponsive relay
- 168 CopyConn goroutines accumulate, all blocked in IO wait on the dead relay
Goroutine 103 (the blocked URLTest checker):
goroutine 103 [semacquire, 3 minutes]:
sync.(*WaitGroup).Wait(...)
sync/waitgroup.go:118
github.com/sagernet/sing/common/batch.(*Batch[...]).Wait(...)
github.com/sagernet/sing@v0.5.1/common/batch/batch.go:77
github.com/sagernet/sing-box/outbound.(*URLTestGroup).urlTest(...)
github.com/sagernet/sing-box/outbound/urltest.go:407
Goroutine 381 (the stuck batch worker):
goroutine 381 [IO wait, 1 minutes]:
net.(*TCPConn).Read(...)
...
github.com/sagernet/sing-box/common/urltest.URLTest(...)
github.com/sagernet/sing-box/common/urltest/urltest.go:96
github.com/sagernet/sing-box/outbound.(*URLTestGroup).urlTest.func1()
github.com/sagernet/sing-box/outbound/urltest.go:390
github.com/sagernet/sing/common/batch.(*Batch[...]).Go.func1()
github.com/sagernet/sing@v0.5.1/common/batch/batch.go:59
Socket state at stall (from ss)
- 112 ESTAB connections with Send-Q = 0 (accepted but not writing)
- 5 CLOSE-WAIT with Recv-Q = 1 (remote sent FIN, sing-box never called close())
- conntrack: 76/76 ASSURED (TCP layer healthy, application layer dead)
Reproduction pattern
- Config: tproxy inbound, urltest outbound with 6 exit nodes via relay servers
- Traffic: normal residential (2-3 LAN devices, web browsing + streaming)
- Trigger: relay server becomes slow or unresponsive (accepts TCP, does not respond to HTTP)
- Stall occurs within 1-19 minutes of VPN reconnect during degraded relay conditions
- Restart of sing-box process restores function temporarily
Observed 8 stalls in 2 hours, with intervals shrinking: 13min, 19min, 6.5min, 57s, 4.5min, 11min, 5min, 4min
Root cause analysis
URLTestGroup.urlTest() at outbound/urltest.go:407 calls batch.Wait() which blocks on sync.WaitGroup.Wait(). The batch spawns one goroutine per outbound to run URL tests. If any single outbound's relay accepts the TCP connection but does not respond to the HTTP request, that batch worker blocks in Read() indefinitely. Since batch.Wait() requires ALL workers to complete, one stuck worker blocks the entire health check.
While the health check is blocked:
selectedOutboundTCP/selectedOutboundUDP are never updated to healthy outbounds
- New connections continue routing to the unresponsive outbound
- Connection copy goroutines (2 per connection) accumulate in IO wait
- The gateway appears fully unresponsive to LAN devices
Suggested fix
Add a context timeout to the urltest batch, so that unresponsive outbounds are detected as failures rather than blocking the entire health check:
// outbound/urltest.go, in urlTest()
// Before: batch.Wait() blocks forever if one probe hangs
// After: timeout ensures batch completes even with unresponsive outbounds
ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
defer cancel()
// Use ctx for batch operations so they respect the timeout
Or ensure individual URL test requests use a bounded timeout that covers the full HTTP transaction (connect + read), not just the TCP dial phase.
Deployment details
- Hardware: Intel N100 mini-PC, 8GB RAM
- Role: home network transparent proxy gateway
- Inbound: tproxy (TCP only, QUIC/UDP rejected)
- Outbound: urltest group selecting from 6 exit nodes (Dallas, LA, Chicago, NY, Atlanta, Miami) via relay servers
- BPF TCP keepalive: 30s (clamped via CGROUP_SETSOCKOPT)
- sysctl: tcp_retries2=6
Full goroutine dump and socket state
Captured by sending SIGQUIT to the sing-box process during a live stall event (GOTRACEBACK=all):
Operating system
Linux
System version
iStoreOS (OpenWrt-based) x86_64, kernel 5.15
Installation type
Original sing-box Command Line
Version
Description
Transparent gateway (tproxy inbound + urltest outbound with 6 relay servers) becomes unresponsive under normal residential traffic. The process stays alive but stops processing connections. accept() works at the kernel level but no data flows.
This is related to #4144 (transparent gateway stall under normal connection churn). We captured a SIGQUIT goroutine dump during a live stall, which shows the blocking point is
URLTestGroup.urlTest()callingbatch.Wait()with no timeout.Evidence: goroutine dump during stall (205 goroutines)
State distribution at stall time:
The deadlock chain:
URLTestGroup.urlTest()(goroutine 103) spawns batch workers to probe each outbound viabatch.Go()TCPConn.Readfor 1+ minute — the relay server accepted the TCP connection but never sent an HTTP responsebatch.Wait()(goroutine 103) blocks onsync.WaitGroup.Wait()indefinitely — one stuck worker blocks the entire batchselectedOutboundTCP/selectedOutboundUDPare never updatedGoroutine 103 (the blocked URLTest checker):
Goroutine 381 (the stuck batch worker):
Socket state at stall (from ss)
Reproduction pattern
Observed 8 stalls in 2 hours, with intervals shrinking: 13min, 19min, 6.5min, 57s, 4.5min, 11min, 5min, 4min
Root cause analysis
URLTestGroup.urlTest()atoutbound/urltest.go:407callsbatch.Wait()which blocks onsync.WaitGroup.Wait(). The batch spawns one goroutine per outbound to run URL tests. If any single outbound's relay accepts the TCP connection but does not respond to the HTTP request, that batch worker blocks inRead()indefinitely. Sincebatch.Wait()requires ALL workers to complete, one stuck worker blocks the entire health check.While the health check is blocked:
selectedOutboundTCP/selectedOutboundUDPare never updated to healthy outboundsSuggested fix
Add a context timeout to the urltest batch, so that unresponsive outbounds are detected as failures rather than blocking the entire health check:
Or ensure individual URL test requests use a bounded timeout that covers the full HTTP transaction (connect + read), not just the TCP dial phase.
Deployment details
Full goroutine dump and socket state
Captured by sending SIGQUIT to the sing-box process during a live stall event (
GOTRACEBACK=all):