urltest batch.Wait() blocks indefinitely when relay is unresponsive, stalling transparent gateway (goroutine dump attached)

### Operating system

Linux

### System version

iStoreOS (OpenWrt-based) x86_64, kernel 5.15

### Installation type

Original sing-box Command Line

### Version

```
sing-box version 1.10.7

Environment: go1.23.4 linux/amd64
Tags: with_gvisor,with_quic,with_dhcp,with_wireguard,with_ech,with_utls,with_reality_server,with_acme,with_clash_api
Revision: 253b41936ecd6ae17948d49d9c510d7100830927
CGO: disabled
```

### Description

Transparent gateway (tproxy inbound + urltest outbound with 6 relay servers) becomes unresponsive under normal residential traffic. The process stays alive but stops processing connections. accept() works at the kernel level but no data flows.

This is related to #4144 (transparent gateway stall under normal connection churn). We captured a SIGQUIT goroutine dump during a live stall, which shows the blocking point is `URLTestGroup.urlTest()` calling `batch.Wait()` with no timeout.

### Evidence: goroutine dump during stall (205 goroutines)

State distribution at stall time:
```
  66 [IO wait]
  26 [IO wait, 1 minutes]
  18 [IO wait, 2 minutes]
  10 [IO wait, 4 minutes]
  20 [select]
  18 [select, 2 minutes]
  17 [select, 4 minutes]
  12 [select, 1 minutes]
   1 [semacquire, 3 minutes]   ← URLTest batch.Wait()
```

**The deadlock chain:**

1. `URLTestGroup.urlTest()` (goroutine 103) spawns batch workers to probe each outbound via `batch.Go()`
2. One batch worker (goroutine 381) is stuck in `TCPConn.Read` for 1+ minute — the relay server accepted the TCP connection but never sent an HTTP response
3. `batch.Wait()` (goroutine 103) blocks on `sync.WaitGroup.Wait()` indefinitely — one stuck worker blocks the entire batch
4. While the batch is blocked, `selectedOutboundTCP`/`selectedOutboundUDP` are never updated
5. New connections keep being routed to the unresponsive relay
6. 168 CopyConn goroutines accumulate, all blocked in IO wait on the dead relay

Goroutine 103 (the blocked URLTest checker):
```
goroutine 103 [semacquire, 3 minutes]:
sync.(*WaitGroup).Wait(...)
  sync/waitgroup.go:118
github.com/sagernet/sing/common/batch.(*Batch[...]).Wait(...)
  github.com/sagernet/sing@v0.5.1/common/batch/batch.go:77
github.com/sagernet/sing-box/outbound.(*URLTestGroup).urlTest(...)
  github.com/sagernet/sing-box/outbound/urltest.go:407
```

Goroutine 381 (the stuck batch worker):
```
goroutine 381 [IO wait, 1 minutes]:
net.(*TCPConn).Read(...)
  ...
github.com/sagernet/sing-box/common/urltest.URLTest(...)
  github.com/sagernet/sing-box/common/urltest/urltest.go:96
github.com/sagernet/sing-box/outbound.(*URLTestGroup).urlTest.func1()
  github.com/sagernet/sing-box/outbound/urltest.go:390
github.com/sagernet/sing/common/batch.(*Batch[...]).Go.func1()
  github.com/sagernet/sing@v0.5.1/common/batch/batch.go:59
```

### Socket state at stall (from ss)

- 112 ESTAB connections with Send-Q = 0 (accepted but not writing)
- 5 CLOSE-WAIT with Recv-Q = 1 (remote sent FIN, sing-box never called close())
- conntrack: 76/76 ASSURED (TCP layer healthy, application layer dead)

### Reproduction pattern

- Config: tproxy inbound, urltest outbound with 6 exit nodes via relay servers
- Traffic: normal residential (2-3 LAN devices, web browsing + streaming)
- Trigger: relay server becomes slow or unresponsive (accepts TCP, does not respond to HTTP)
- Stall occurs within 1-19 minutes of VPN reconnect during degraded relay conditions
- Restart of sing-box process restores function temporarily

Observed 8 stalls in 2 hours, with intervals shrinking: 13min, 19min, 6.5min, 57s, 4.5min, 11min, 5min, 4min

### Root cause analysis

`URLTestGroup.urlTest()` at `outbound/urltest.go:407` calls `batch.Wait()` which blocks on `sync.WaitGroup.Wait()`. The batch spawns one goroutine per outbound to run URL tests. If any single outbound's relay accepts the TCP connection but does not respond to the HTTP request, that batch worker blocks in `Read()` indefinitely. Since `batch.Wait()` requires ALL workers to complete, one stuck worker blocks the entire health check.

While the health check is blocked:
- `selectedOutboundTCP`/`selectedOutboundUDP` are never updated to healthy outbounds
- New connections continue routing to the unresponsive outbound
- Connection copy goroutines (2 per connection) accumulate in IO wait
- The gateway appears fully unresponsive to LAN devices

### Suggested fix

Add a context timeout to the urltest batch, so that unresponsive outbounds are detected as failures rather than blocking the entire health check:

```go
// outbound/urltest.go, in urlTest()
// Before: batch.Wait() blocks forever if one probe hangs
// After: timeout ensures batch completes even with unresponsive outbounds

ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
defer cancel()
// Use ctx for batch operations so they respect the timeout
```

Or ensure individual URL test requests use a bounded timeout that covers the full HTTP transaction (connect + read), not just the TCP dial phase.

### Deployment details

- Hardware: Intel N100 mini-PC, 8GB RAM
- Role: home network transparent proxy gateway
- Inbound: tproxy (TCP only, QUIC/UDP rejected)
- Outbound: urltest group selecting from 6 exit nodes (Dallas, LA, Chicago, NY, Atlanta, Miami) via relay servers
- BPF TCP keepalive: 30s (clamped via CGROUP_SETSOCKOPT)
- sysctl: tcp_retries2=6

### Full goroutine dump and socket state

Captured by sending SIGQUIT to the sing-box process during a live stall event (`GOTRACEBACK=all`):

- **Goroutine dump** (588KB, 7256 lines, 205 goroutines): [goroutine_dump.txt](https://gist.github.com/HouMinXi/f5aa1870026685e06aeed2cd7e0030cf#file-goroutine_dump-txt)
- **Socket state** (ss -tnp, 117 connections, IPs sanitized): [sockets_sanitized.txt](https://gist.github.com/HouMinXi/f5aa1870026685e06aeed2cd7e0030cf#file-sockets_sanitized-txt)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

urltest batch.Wait() blocks indefinitely when relay is unresponsive, stalling transparent gateway (goroutine dump attached) #4255

Operating system

System version

Installation type

Version

Description

Evidence: goroutine dump during stall (205 goroutines)

Socket state at stall (from ss)

Reproduction pattern

Root cause analysis

Suggested fix

Deployment details

Full goroutine dump and socket state

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

urltest batch.Wait() blocks indefinitely when relay is unresponsive, stalling transparent gateway (goroutine dump attached) #4255

Description

Operating system

System version

Installation type

Version

Description

Evidence: goroutine dump during stall (205 goroutines)

Socket state at stall (from ss)

Reproduction pattern

Root cause analysis

Suggested fix

Deployment details

Full goroutine dump and socket state

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions