Add background task to change for stale connections and reconnect proactively to fix #109 by soenkeliebau · Pull Request #112 · stackabletech/trino-lb

soenkeliebau · 2026-06-12T16:43:27Z

TrinoLB has an issue where under certain network failure scenarios it looses the connection to redis, but never reconnects on its own. So it just hangs until restarted.

This PR introduces a background task to check for timeouts and reconnect when they occur, so no restart is necessary.

fixes #109

When the Kubernetes node running the Redis master is drained, the established TCP connection turns into a black hole: packets are silently dropped and no FIN/RST ever arrives. The only error trino-lb sees is a response timeout, which the redis crate's ConnectionManager classifies as RetryImmediately rather than a dropped connection, so it never reconnects and keeps reusing the dead socket. Every command then times out for hours until the liveness probe restarts the pod (#109). This only reproduced on drain (not pod delete, which closes the socket cleanly and triggers a reconnect) and under load, since active traffic keeps in-flight requests on the wedged connection. Fix this in trino-lb-persistence/src/redis/mod.rs: - Add a Reconnectable<T> wrapper (RwLock<Arc<T>> + rebuild factory + single-flight guard). reconnect() only rebuilds when the connection a failing command used is still current (Arc::ptr_eq generation check), so a burst of failures rebuilds the connection exactly once. - Route every Redis operation through a run() combinator that rebuilds the connection when a command times out. Dropped-connection errors are left to ConnectionManager's own reconnect, so this only covers the timeout gap. - Enable TCP keepalive (5s/2s/3 probes) and TCP_USER_TIMEOUT (12s) on the Redis socket via ConnectionInfo::set_tcp_settings (single-node) and ClusterClientBuilder::tcp_settings (cluster). These are re-applied on every reconnect and let the kernel tear down the dead socket faster. Reachable via redis::io::tcp, so no new dependency is needed. ConnectionManagerConfig does not expose TCP settings (even on redis-rs main), hence routing them through ConnectionInfo / the cluster builder. Add unit tests for the Reconnectable coordination logic (replace, stale-handle no-op, concurrent single-flight). End-to-end black-hole recovery is not covered by automated tests and still needs an integration or manual node-drain test. fixes #109

The #109 fix recovers a black-holed Redis connection (node drain -> only response timeouts, which ConnectionManager never reconnects on) by rebuilding the connection through a Reconnectable<T> wrapper. The trigger for that rebuild was a run() combinator that every Redis operation had to be wrapped in, so it could observe a timeout and call reconnect(). That trigger was the expensive part: it forced all ~15 call sites into `self.run(|conn| async move { ... })` closures, dragged Future/FnOnce bounds through the call path, and still could not cover the sscan path (its iterator borrows the connection for the whole loop), which had to reconnect by hand. The detection policy ended up smeared across every operation. Move detection into a single background task instead. spawn_health_check PINGs Redis every 5s and calls the same Reconnectable::reconnect on any ping failure; the connection's own response timeout makes a PING on a black-holed socket fail rather than hang. All call sites revert to the original `self.connection()` form and the sscan special case disappears. The diff against the pre-fix baseline shrinks from +368/-103 to +268/-23 on redis/mod.rs - almost entirely call-site churn that is no longer needed. Trade-offs: - Detection is now proactive and time-based (ping interval + response timeout, ~5-15s after a drain) instead of riding on the next failing business operation. It also heals while the pod is idle, which the run() approach could not. Worst-case recovery latency is comparable. - A single failed ping triggers a rebuild. This is the usual health-check semantics and the rebuild is single-flight + Arc::ptr_eq guarded so it happens at most once per dead connection, but it is slightly more eager than rebuilding only on a failed real operation. If churn on a flaky- but-alive Redis ever shows up, gate the rebuild behind N consecutive failed pings. - The task is spawn-and-forget (no JoinHandle). The persistence layer lives for the whole process, so there is nothing to abort; adding lifecycle machinery would be dead weight. Reconnectable<T>, the ConnectionFactory closures, the TCP keepalive / TCP_USER_TIMEOUT settings, and the Reconnectable unit tests are unchanged. Real black-hole recovery is still not covered by automated tests and needs an integration (TCP proxy) or manual node-drain test.

soenkeliebau added 2 commits June 11, 2026 17:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add background task to change for stale connections and reconnect proactively to fix #109#112

Add background task to change for stale connections and reconnect proactively to fix #109#112
soenkeliebau wants to merge 2 commits into
mainfrom
iss/109

soenkeliebau commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

soenkeliebau commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant