Add background task to change for stale connections and reconnect proactively to fix #109#112
Open
soenkeliebau wants to merge 2 commits into
Open
Add background task to change for stale connections and reconnect proactively to fix #109#112soenkeliebau wants to merge 2 commits into
soenkeliebau wants to merge 2 commits into
Conversation
When the Kubernetes node running the Redis master is drained, the established TCP connection turns into a black hole: packets are silently dropped and no FIN/RST ever arrives. The only error trino-lb sees is a response timeout, which the redis crate's ConnectionManager classifies as RetryImmediately rather than a dropped connection, so it never reconnects and keeps reusing the dead socket. Every command then times out for hours until the liveness probe restarts the pod (#109). This only reproduced on drain (not pod delete, which closes the socket cleanly and triggers a reconnect) and under load, since active traffic keeps in-flight requests on the wedged connection. Fix this in trino-lb-persistence/src/redis/mod.rs: - Add a Reconnectable<T> wrapper (RwLock<Arc<T>> + rebuild factory + single-flight guard). reconnect() only rebuilds when the connection a failing command used is still current (Arc::ptr_eq generation check), so a burst of failures rebuilds the connection exactly once. - Route every Redis operation through a run() combinator that rebuilds the connection when a command times out. Dropped-connection errors are left to ConnectionManager's own reconnect, so this only covers the timeout gap. - Enable TCP keepalive (5s/2s/3 probes) and TCP_USER_TIMEOUT (12s) on the Redis socket via ConnectionInfo::set_tcp_settings (single-node) and ClusterClientBuilder::tcp_settings (cluster). These are re-applied on every reconnect and let the kernel tear down the dead socket faster. Reachable via redis::io::tcp, so no new dependency is needed. ConnectionManagerConfig does not expose TCP settings (even on redis-rs main), hence routing them through ConnectionInfo / the cluster builder. Add unit tests for the Reconnectable coordination logic (replace, stale-handle no-op, concurrent single-flight). End-to-end black-hole recovery is not covered by automated tests and still needs an integration or manual node-drain test. fixes #109
The #109 fix recovers a black-holed Redis connection (node drain -> only response timeouts, which ConnectionManager never reconnects on) by rebuilding the connection through a Reconnectable<T> wrapper. The trigger for that rebuild was a run() combinator that every Redis operation had to be wrapped in, so it could observe a timeout and call reconnect(). That trigger was the expensive part: it forced all ~15 call sites into `self.run(|conn| async move { ... })` closures, dragged Future/FnOnce bounds through the call path, and still could not cover the sscan path (its iterator borrows the connection for the whole loop), which had to reconnect by hand. The detection policy ended up smeared across every operation. Move detection into a single background task instead. spawn_health_check PINGs Redis every 5s and calls the same Reconnectable::reconnect on any ping failure; the connection's own response timeout makes a PING on a black-holed socket fail rather than hang. All call sites revert to the original `self.connection()` form and the sscan special case disappears. The diff against the pre-fix baseline shrinks from +368/-103 to +268/-23 on redis/mod.rs - almost entirely call-site churn that is no longer needed. Trade-offs: - Detection is now proactive and time-based (ping interval + response timeout, ~5-15s after a drain) instead of riding on the next failing business operation. It also heals while the pod is idle, which the run() approach could not. Worst-case recovery latency is comparable. - A single failed ping triggers a rebuild. This is the usual health-check semantics and the rebuild is single-flight + Arc::ptr_eq guarded so it happens at most once per dead connection, but it is slightly more eager than rebuilding only on a failed real operation. If churn on a flaky- but-alive Redis ever shows up, gate the rebuild behind N consecutive failed pings. - The task is spawn-and-forget (no JoinHandle). The persistence layer lives for the whole process, so there is nothing to abort; adding lifecycle machinery would be dead weight. Reconnectable<T>, the ConnectionFactory closures, the TCP keepalive / TCP_USER_TIMEOUT settings, and the Reconnectable unit tests are unchanged. Real black-hole recovery is still not covered by automated tests and needs an integration (TCP proxy) or manual node-drain test.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TrinoLB has an issue where under certain network failure scenarios it looses the connection to redis, but never reconnects on its own. So it just hangs until restarted.
This PR introduces a background task to check for timeouts and reconnect when they occur, so no restart is necessary.
fixes #109