fix(pooling): fix race conditions, deadlock, and connection leak in PoolEntry#100
Open
dkasimovskiy wants to merge 4 commits into
Open
fix(pooling): fix race conditions, deadlock, and connection leak in PoolEntry#100dkasimovskiy wants to merge 4 commits into
dkasimovskiy wants to merge 4 commits into
Conversation
…oolEntry PoolEntry state transitions had several races between netty IO threads, the HashedWheelTimer heartbeat worker, and user-facing reconnect calls that surfaced as ABBA deadlocks, NPEs on inline connect failures, and leaked connections after a KILL/reconnect cycle in CI. PoolEntry: - Synchronize state mutations and mark connectFuture, heartbeatTask, reconnectTask, lastHeartbeatEvent, isLocked, and isShutdown volatile/AtomicBoolean for cross-thread visibility. - Narrow the entry-monitor critical sections to field mutations; call client.close() and emit() outside the monitor to break the ABBA deadlock between ConnectionImpl and PoolEntry monitors that hung DistributingRoundRobinBalancerTest. - Return the local connect future from internalConnect() so an inline connect failure cannot leave the caller observing a null connectFuture after handleConnectError() nulls it for reconnect. - Keep client.close() in shutdown() on every invocation (closeChannel is idempotent if already closed) but guard only the onConnectionClosed emit, so a KILL-then-reconnect cycle cannot leak the new connection when its auth/ping subsequently fails. - Serialize connectAfter() reconnect-task scheduling to avoid double scheduling; add a double-check of connectFuture in internalConnect() to return the in-flight future instead of starting a new connect. IProtoClientPoolImpl: - Synchronize forEach() on connectionPoolLock to avoid CME under concurrent setGroups(). Tests (BasePoolTest / ConnectionPoolReconnectsTest): - Wait for box.stat.net().CONNECTIONS.current to stabilise in getActiveConnectionsCount: the IProto worker updates it asynchronously, so a single read often lags by 5-15 connections when 20+ are opened in a burst. - Collapse the wait-for-stable Lua script to a single line so tarantool emits one YAML document (SnakeYAML rejects multi-document streams). - Wait for the active connection count to reach the expected value in ConnectionPoolReconnectsTest post-reconnect assertions via a new waitForActiveConnections() helper. Verified locally on 3.5.0 and 2.11.8: ConnectionPoolReconnectsTest, ConnectionPoolTest, ConnectionPoolHeartbeatTest, DistributingRoundRobinBalancerTest, and unit tests all pass consistently where they were previously flaky.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…oolEntry Fix a relocated ABBA deadlock: `PoolEntry.connect()`/`internalConnect()` no longer hold the entry monitor across `client.connect()`, which would deadlock with the Netty close-callback path on the shared `ConnectionImpl` monitor during a close/reconnect overlap. Fix `onConnectionClosed` accounting: the `PoolEntry` shutdown idempotency flag is now reset per connection generation, so a KILL/reconnect cycle emits one close event per generation instead of suppressing all closes after the first. `IProtoClientPoolImpl.forEach()` snapshots clients under the pool lock and invokes the action outside it, avoiding `ConcurrentModificationException` under concurrent `setGroups()` without holding the lock across a user callback.
…oolEntry Fix a relocated ABBA deadlock: `PoolEntry.connect()`/`internalConnect()` no longer hold the entry monitor across `client.connect()`, which would deadlock with the Netty close-callback path on the shared `ConnectionImpl` monitor during a close/reconnect overlap. Fix `onConnectionClosed` accounting: the `PoolEntry` shutdown idempotency flag is now reset per connection generation, so a KILL/reconnect cycle emits one close event per generation instead of suppressing all closes after the first. `IProtoClientPoolImpl.forEach()` snapshots clients under the pool lock and invokes the action outside it, avoiding `ConcurrentModificationException` under concurrent `setGroups()` without holding the lock across a user callback.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PoolEntry had several races between netty IO threads, the HashedWheelTimer heartbeat worker, and reconnect calls, manifesting as ABBA deadlocks, NPEs on inline connect failures, and leaked connections after a KILL/reconnect cycle in CI.
Production fixes
PoolEntry
volatile/AtomicBooleanfor cross-thread visibility.client.close()andemit()outside the monitor to break the ABBA deadlock betweenConnectionImplandPoolEntrymonitors that hungDistributingRoundRobinBalancerTest.internalConnect()so an inline connect failure cannot leave the caller observing a nullconnectFutureafterhandleConnectError()nulls it for reconnect.client.close()inshutdown()on every invocation (idempotent viacloseChannelno-op) but guard only theonConnectionClosedemit, so a KILL-then-reconnect cycle cannot leak the new connection when its auth/ping subsequently fails.connectAfter()reconnect-task scheduling to avoid double scheduling; double-checkconnectFutureininternalConnect()to return the in-flight future instead of starting a new connect.IProtoClientPoolImpl
forEach()onconnectionPoolLockto avoid CME under concurrentsetGroups().Test fixes
box.stat.net().CONNECTIONS.currentto stabilise inBasePoolTest.getActiveConnectionsCount: the IProto worker updates it asynchronously, so a single read often lags by 5-15 connections when 20+ are opened in a burst.ConnectionPoolReconnectsTestpost-reconnect assertions via a newwaitForActiveConnections()helper.I haven't forgotten about:
Related issues: