Skip to content

Conversation

@baizhufbb
Copy link

Problem

When using an HTTP proxy to connect to HTTPS servers, if the TLS handshake fails (e.g., proxy timeout, node switching) after the CONNECT request succeeds, the connection gets stuck in ACTIVE state and never gets cleaned up. This results in a zombie connection that permanently occupies the connection pool.

Symptoms:

  • is_closed() = False (connection state is ACTIVE, not CLOSED)
  • has_expired() = False (only checks IDLE state connections)
  • Connection not removed from pool
  • Connection pool becomes full (e.g., 1/1)
  • New requests cannot create connections
  • Results in Pool timeout infinite loop
  • Application cannot recover even after proxy recovers

Root Cause

Execution flow when TLS handshake fails:

  1. TCP connect to proxy succeeds ✅
  2. CONNECT request succeeds (HTTP/1.1 200 Connection Established)
    • Connection state set to ACTIVE
  3. TLS handshake fails (e.g., ConnectError: EndOfStream)
    • Exception is raised
    • ⚠️ No aclose() is called
    • ⚠️ Connection state remains ACTIVE
  4. Connection pool cleanup check:
    • is_closed() returns False (state is ACTIVE)
    • has_expired() returns False (only checks if state == IDLE)
    • ⚠️ Connection is not cleaned up
  5. Zombie connection occupies pool slot forever
  6. Connection pool full → Pool timeout

Solution

This PR adds a try-except block around start_tls() to ensure cleanup on TLS failures:

try:
    async with Trace("start_tls", logger, request, kwargs) as trace:  
        stream = await stream.start_tls(**kwargs)
        trace.return_value = stream
except Exception:
    await self._connection.aclose()
    raise

Alternative Approaches Considered

Option 1: Outer try-except

Following the pattern in AsyncHTTP11Connection.handle_async_request(), we could wrap the entire tunnel establishment in an outer try-except.

Option 2: Fix has_expired() logic

Check for ACTIVE state with readable socket:

  def has_expired(self) -> bool:
      server_disconnected = (
          (self._state == HTTPConnectionState.IDLE or self._state == HTTPConnectionState.ACTIVE)
          and self._network_stream.get_extra_info("is_readable")        
      )
      return keepalive_expired or server_disconnected

Pros: No need to modify http_proxy.py.
Cons: May incorrectly close healthy ACTIVE connections, broader impact.

Verification

Before fix (zombie connection):
20:28:08.895 - start_tls.failed exception=ConnectError(EndOfStream())
20:28:08.896 - state=ACTIVE, is_readable=True
20:28:08.896 - Connection not cleaned: closed=False, expired=False, idle=False
20:28:08.897 - Pool: 1/1 (full)
20:28:11.412 - Pool timeout (infinite loop)

After fix (proper cleanup):
08:42:53.481 - start_tls.failed exception=ConnectError(EndOfStream())
08:42:53.482 - close.started
08:42:53.482 - close.complete
08:42:53.482 - is_closed=True
08:42:53.482 - Removing closed connection. Pool size before: 1
08:42:53.483 - Pool size after removing closed: 0
08:42:54.487 - Creating new connection. Pool: 0/1
08:42:57.386 - start_tls.failed (retry succeeded)

Test environment:

  • OS: Windows 11
  • Proxy: HTTP proxy (FlClash)
  • Test duration: 12+ hours
  • TLS failures: 10+ occurrences, all recovered successfully
  • No Pool timeout issues after fix

Related

Checklist

  • I understand that this PR may be closed in case there was no previous discussion. (This is a straightforward bug fix with clear reproduction. Happy to discuss if needed.)
  • I've added a test for each change that was introduced. (Can add integration test if reviewers consider it necessary.)
  • I've updated the documentation accordingly. (N/A - Bug fix, no API changes)

When using an HTTP proxy to connect to HTTPS servers, if the TLS handshake
fails after CONNECT succeeds, the connection remains in ACTIVE state and never
gets cleaned up, occupying the connection pool forever.

This fix ensures the connection is properly closed when TLS fails, preventing
Pool timeout issues.
Apply the same fix to the synchronous version to keep async/sync in sync.
@baizhufbb
Copy link
Author

This is a bug fix for a hard-to-test edge case (TLS failure in production). The uncovered lines are exception handling code that only executes during proxy failures. I've verified the fix works in a real-world scenario (12+ hours of testing with multiple TLS failures).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant