Skip to content

fix(pooling): fix race conditions, deadlock, and connection leak in PoolEntry#100

Open
dkasimovskiy wants to merge 4 commits into
masterfrom
fix/connection-pool-flaky-tests
Open

fix(pooling): fix race conditions, deadlock, and connection leak in PoolEntry#100
dkasimovskiy wants to merge 4 commits into
masterfrom
fix/connection-pool-flaky-tests

Conversation

@dkasimovskiy

@dkasimovskiy dkasimovskiy commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Summary

PoolEntry had several races between netty IO threads, the HashedWheelTimer heartbeat worker, and reconnect calls, manifesting as ABBA deadlocks, NPEs on inline connect failures, and leaked connections after a KILL/reconnect cycle in CI.

Production fixes

PoolEntry

  • Synchronize state mutations; mark fields volatile/AtomicBoolean for cross-thread visibility.
  • Narrow monitor critical sections to field mutations; call client.close() and emit() outside the monitor to break the ABBA deadlock between ConnectionImpl and PoolEntry monitors that hung DistributingRoundRobinBalancerTest.
  • Return the local chain future from internalConnect() so an inline connect failure cannot leave the caller observing a null connectFuture after handleConnectError() nulls it for reconnect.
  • Keep client.close() in shutdown() on every invocation (idempotent via closeChannel no-op) but guard only the onConnectionClosed emit, so a KILL-then-reconnect cycle cannot leak the new connection when its auth/ping subsequently fails.
  • Serialize connectAfter() reconnect-task scheduling to avoid double scheduling; double-check connectFuture in internalConnect() to return the in-flight future instead of starting a new connect.

IProtoClientPoolImpl

  • Synchronize forEach() on connectionPoolLock to avoid CME under concurrent setGroups().

Test fixes

  • Wait for box.stat.net().CONNECTIONS.current to stabilise in BasePoolTest.getActiveConnectionsCount: the IProto worker updates it asynchronously, so a single read often lags by 5-15 connections when 20+ are opened in a burst.
  • Collapse the wait-for-stable Lua script to a single line so tarantool emits one YAML document (SnakeYAML rejects multi-document streams).
  • Wait for the active connection count to reach the expected value in ConnectionPoolReconnectsTest post-reconnect assertions via a new waitForActiveConnections() helper.

I haven't forgotten about:

  • Tests
  • Changelog
  • Documentation
    • JavaDoc was written
  • Commit messages comply with the guideline
  • Cleanup the code for review. See checklist

Related issues:

dkasimovskiy and others added 2 commits June 19, 2026 11:08
…oolEntry

PoolEntry state transitions had several races between netty IO threads,
the HashedWheelTimer heartbeat worker, and user-facing reconnect calls
that surfaced as ABBA deadlocks, NPEs on inline connect failures, and
leaked connections after a KILL/reconnect cycle in CI.

PoolEntry:
- Synchronize state mutations and mark connectFuture, heartbeatTask,
  reconnectTask, lastHeartbeatEvent, isLocked, and isShutdown
  volatile/AtomicBoolean for cross-thread visibility.
- Narrow the entry-monitor critical sections to field mutations; call
  client.close() and emit() outside the monitor to break the ABBA
  deadlock between ConnectionImpl and PoolEntry monitors that hung
  DistributingRoundRobinBalancerTest.
- Return the local connect future from internalConnect() so an inline
  connect failure cannot leave the caller observing a null
  connectFuture after handleConnectError() nulls it for reconnect.
- Keep client.close() in shutdown() on every invocation (closeChannel
  is idempotent if already closed) but guard only the
  onConnectionClosed emit, so a KILL-then-reconnect cycle cannot
  leak the new connection when its auth/ping subsequently fails.
- Serialize connectAfter() reconnect-task scheduling to avoid double
  scheduling; add a double-check of connectFuture in internalConnect()
  to return the in-flight future instead of starting a new connect.

IProtoClientPoolImpl:
- Synchronize forEach() on connectionPoolLock to avoid CME under
  concurrent setGroups().

Tests (BasePoolTest / ConnectionPoolReconnectsTest):
- Wait for box.stat.net().CONNECTIONS.current to stabilise in
  getActiveConnectionsCount: the IProto worker updates it
  asynchronously, so a single read often lags by 5-15 connections
  when 20+ are opened in a burst.
- Collapse the wait-for-stable Lua script to a single line so
  tarantool emits one YAML document (SnakeYAML rejects
  multi-document streams).
- Wait for the active connection count to reach the expected value
  in ConnectionPoolReconnectsTest post-reconnect assertions via a
  new waitForActiveConnections() helper.

Verified locally on 3.5.0 and 2.11.8: ConnectionPoolReconnectsTest,
ConnectionPoolTest, ConnectionPoolHeartbeatTest,
DistributingRoundRobinBalancerTest, and unit tests all pass
consistently where they were previously flaky.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…oolEntry

Fix a relocated ABBA deadlock: `PoolEntry.connect()`/`internalConnect()` no longer hold the entry monitor across `client.connect()`, which would deadlock with the Netty close-callback path on the shared `ConnectionImpl` monitor during a close/reconnect overlap.
Fix `onConnectionClosed` accounting: the `PoolEntry` shutdown idempotency flag is now reset per connection generation, so a KILL/reconnect cycle emits one close event per generation instead of suppressing all closes after the first.
`IProtoClientPoolImpl.forEach()` snapshots clients under the pool lock and invokes the action outside it, avoiding `ConcurrentModificationException` under concurrent `setGroups()` without holding the lock across a user callback.
…oolEntry

Fix a relocated ABBA deadlock: `PoolEntry.connect()`/`internalConnect()` no longer hold the entry monitor across `client.connect()`, which would deadlock with the Netty close-callback path on the shared `ConnectionImpl` monitor during a close/reconnect overlap.
Fix `onConnectionClosed` accounting: the `PoolEntry` shutdown idempotency flag is now reset per connection generation, so a KILL/reconnect cycle emits one close event per generation instead of suppressing all closes after the first.
`IProtoClientPoolImpl.forEach()` snapshots clients under the pool lock and invokes the action outside it, avoiding `ConcurrentModificationException` under concurrent `setGroups()` without holding the lock across a user callback.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant