Skip to content

Add background task to change for stale connections and reconnect proactively to fix #109#112

Open
soenkeliebau wants to merge 2 commits into
mainfrom
iss/109
Open

Add background task to change for stale connections and reconnect proactively to fix #109#112
soenkeliebau wants to merge 2 commits into
mainfrom
iss/109

Conversation

@soenkeliebau

Copy link
Copy Markdown
Member

TrinoLB has an issue where under certain network failure scenarios it looses the connection to redis, but never reconnects on its own. So it just hangs until restarted.

This PR introduces a background task to check for timeouts and reconnect when they occur, so no restart is necessary.

fixes #109

  When the Kubernetes node running the Redis master is drained, the
  established TCP connection turns into a black hole: packets are silently
  dropped and no FIN/RST ever arrives. The only error trino-lb sees is a
  response timeout, which the redis crate's ConnectionManager classifies as
  RetryImmediately rather than a dropped connection, so it never reconnects
  and keeps reusing the dead socket. Every command then times out for hours
  until the liveness probe restarts the pod (#109).

  This only reproduced on drain (not pod delete, which closes the socket
  cleanly and triggers a reconnect) and under load, since active traffic
  keeps in-flight requests on the wedged connection.

  Fix this in trino-lb-persistence/src/redis/mod.rs:

  - Add a Reconnectable<T> wrapper (RwLock<Arc<T>> + rebuild factory +
    single-flight guard). reconnect() only rebuilds when the connection a
    failing command used is still current (Arc::ptr_eq generation check), so
    a burst of failures rebuilds the connection exactly once.
  - Route every Redis operation through a run() combinator that rebuilds the
    connection when a command times out. Dropped-connection errors are left
    to ConnectionManager's own reconnect, so this only covers the timeout gap.
  - Enable TCP keepalive (5s/2s/3 probes) and TCP_USER_TIMEOUT (12s) on the
    Redis socket via ConnectionInfo::set_tcp_settings (single-node) and
    ClusterClientBuilder::tcp_settings (cluster). These are re-applied on
    every reconnect and let the kernel tear down the dead socket faster.
    Reachable via redis::io::tcp, so no new dependency is needed.

  ConnectionManagerConfig does not expose TCP settings (even on redis-rs
  main), hence routing them through ConnectionInfo / the cluster builder.

  Add unit tests for the Reconnectable coordination logic (replace,
  stale-handle no-op, concurrent single-flight). End-to-end black-hole
  recovery is not covered by automated tests and still needs an integration
  or manual node-drain test.

  fixes #109
The #109 fix recovers a black-holed Redis connection (node drain -> only
response timeouts, which ConnectionManager never reconnects on) by
rebuilding the connection through a Reconnectable<T> wrapper. The trigger
for that rebuild was a run() combinator that every Redis operation had to
be wrapped in, so it could observe a timeout and call reconnect().

That trigger was the expensive part: it forced all ~15 call sites into
`self.run(|conn| async move { ... })` closures, dragged Future/FnOnce
bounds through the call path, and still could not cover the sscan path
(its iterator borrows the connection for the whole loop), which had to
reconnect by hand. The detection policy ended up smeared across every
operation.

Move detection into a single background task instead. spawn_health_check
PINGs Redis every 5s and calls the same Reconnectable::reconnect on any
ping failure; the connection's own response timeout makes a PING on a
black-holed socket fail rather than hang. All call sites revert to the
original `self.connection()` form and the sscan special case disappears.
The diff against the pre-fix baseline shrinks from +368/-103 to +268/-23
on redis/mod.rs - almost entirely call-site churn that is no longer
needed.

Trade-offs:

- Detection is now proactive and time-based (ping interval + response
  timeout, ~5-15s after a drain) instead of riding on the next failing
  business operation. It also heals while the pod is idle, which the
  run() approach could not. Worst-case recovery latency is comparable.
- A single failed ping triggers a rebuild. This is the usual health-check
  semantics and the rebuild is single-flight + Arc::ptr_eq guarded so it
  happens at most once per dead connection, but it is slightly more eager
  than rebuilding only on a failed real operation. If churn on a flaky-
  but-alive Redis ever shows up, gate the rebuild behind N consecutive
  failed pings.
- The task is spawn-and-forget (no JoinHandle). The persistence layer
  lives for the whole process, so there is nothing to abort; adding
  lifecycle machinery would be dead weight.

Reconnectable<T>, the ConnectionFactory closures, the TCP keepalive /
TCP_USER_TIMEOUT settings, and the Reconnectable unit tests are unchanged.
Real black-hole recovery is still not covered by automated tests and
needs an integration (TCP proxy) or manual node-drain test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

trino-lb is stuck after redis master shuts down

1 participant