fix(gateway): make readiness health checks dependency-aware#1328
fix(gateway): make readiness health checks dependency-aware#1328alangou wants to merge 2 commits into
Conversation
TaylorMutch
left a comment
There was a problem hiding this comment.
A couple of focused comments on the readiness changes.
2cb34c9 to
05b6998
Compare
|
@TaylorMutch I removed the hardcoded value for the database timeout some documentation and deployments resources has been updated to reflect that change (helm chart, doc) |
|
Label |
05b6998 to
2b11909
Compare
|
@TaylorMutch I have rebased the branch PR |
TaylorMutch
left a comment
There was a problem hiding this comment.
A few changes I think are needed on this PR, but it's getting closer.
| openshell gateway info | ||
| ``` | ||
|
|
||
| Gateways expose `/healthz` as a liveness signal and `/readyz` as a dependency-aware readiness signal. `/readyz` checks database connectivity and reports unhealthy when the probe exceeds the configured timeout. Tune that timeout with `--readiness-db-timeout-secs` or `OPENSHELL_READINESS_DB_TIMEOUT_SECS` when you need a slower or stricter readiness window. |
There was a problem hiding this comment.
This is probably only relevant on the Kubernetes docs which care about these things; should this move into the Kubernetes docs?
| --config "${GATEWAY_CONFIG}" | ||
| --bind-address 0.0.0.0 | ||
| --port "${HOST_PORT}" | ||
| --health-port "${HEALTH_PORT}" |
There was a problem hiding this comment.
We don't currently depend on the health port in any configuration other than Kubernetes, so I would suggest we not add it into the docker e2e path and only exercise it on the kubernetes e2e tests (which would enable the health port by default).
| # Database readiness timeout in seconds used by /readyz and /health. | ||
| # Keep this lower than probes.readiness.timeoutSeconds so the gateway can | ||
| # return a readiness decision before Kubernetes times out the probe. | ||
| readinessDbTimeoutSecs: 1 |
There was a problem hiding this comment.
This seems like a possible footgun; the readiness check should be pretty instantaneous right? And if we have to add in a configuration option to the server to determine this, that seems strange to me.
Perhaps the health check for DB health should run in the background and the health check handler only checks if we detect the state to be unhealthy? That would remove this race condition. WDYT?
There was a problem hiding this comment.
Yes that works for me. The database healthcheck would run in background (with the configurable timeout) and update some sort of store in the gateway. When /readyz is queried the store is read. This would indeed be almost instant
ab4deb4 to
a167000
Compare
Emit Prometheus readiness metrics for database probes (healthy gauge and outcome-labeled latency histogram) with coverage in health HTTP tests. Restrict Store::close behind test support cfg to prevent accidental runtime pool shutdown under live traffic. Signed-off-by: Adrien Langou <alangou@nvidia.com>
Signed-off-by: Adrien Langou <alangou@nvidia.com>
a167000 to
1cb6011
Compare
Summary
This PR makes gateway readiness signals dependency-aware instead of always healthy, while keeping liveness intentionally lightweight.
The database health check runs as an in-process background monitor that publishes the latest result through a
tokio::sync::watchchannel;/readyzand/healthread that cached state, so probe responses are sub-millisecond and never race with the kubelet's probe timeout.It also extends readiness coverage with a Kubernetes-only e2e test and updates the Rust test task so
openshell-servertest-only coverage runs withtest-support.Related Issue
Changes
openshell-server:/healthzremains liveness-only (200when process is responsive)/readyzand/healthreflect a cached database health state and return503when the latest background check failedchecks.database.status,latency_ms,error)crates/openshell-server/src/readiness.rs):DatabaseHealthMonitorspawns a Tokio task that periodically pings the persistence layer and publishes the result throughtokio::sync::watchInitializing(treated as unhealthy until the first poll completes),Healthy { latency_ms },Unhealthy { reason, latency_ms }withreasoninUnavailable | Timeoutconst_assert!enforcingtimeout < intervalhealth_routerawaits the monitor's first poll before serving, so/readyznever returns the placeholderInitializingstate on the very first probeStore::ping, plus SQLite/Postgresping) used by the monitor.openshell_server_readiness_database_healthygauge (1healthy,0unhealthy)openshell_server_readiness_database_probe_duration_secondshistogram labeled byoutcome(success,db_error,timeout)/metricsexposureStore::closeand backendclosemethods are behind#[cfg(any(test, feature = "test-support"))]health_routerso they provide a realStore.crates/openshell-server/tests/health_endpoint_integration.rs/readyz:e2e/rust/tests/readyz_health.rs(gated by thee2e-kubernetesfeature)e2e/with-kube-gateway.shopens a dedicated port-forward to the gateway pod's namedhealthcontainer port and exportsOPENSHELL_E2E_HEALTH_PORTfor the teste2e/rust/e2e-kubernetes.shso it actually runs the Kubernetes-gated tests:E2E_FEATURESdid not includee2e-kubernetes, so every test withrequired-features = ["e2e-kubernetes"]was silently skipped on the Kubernetes lane. Added the feature to the default list.e2e/rust/tests/user_namespaces.rsthat had been hidden by the misconfiguration. The test was marked#[ignore]with an inline header documenting the two blocking issues (hardcodeddocker exec openshell-cluster-openshell kubectlinvocation that no project setup creates, and a mid-test gateway StatefulSet rollout that breaks the wrapper'skubectl port-forwardwhile the test swallows stderr). Re-enable withcargo test -- --ignoredonce both are fixed; out of scope for this PR.probes.readiness.timeoutSecondsdefault set to2sto match the per-check timeoutserver.readinessDbTimeoutSecsproposal is dropped now that the monitor decouples the probe from the DB call)docs/kubernetes/setup.mdxadds aProbessection describing/healthzvs/readyzand the background DB checkdocs/sandboxes/manage-gateways.mdxnotes the two endpoints in the existing gateway health contextarchitecture/gateway.mdrecords the dependency-aware readiness behavior and its cached/background semanticstest-supportcoverage by default in the Rust test lane:tasks/test.tomlruns workspace tests excludingopenshell-server, then runsopenshell-serverwith--features test-supportWhy
/healthzstill returns 200 when DB is down/healthzis kept as a pure liveness probe by design. If liveness depended on DB, transient DB outages could trigger unnecessary pod restarts and CrashLoop behavior without fixing the dependency outage. Readiness (/readyz) is the dependency-aware signal used to remove unhealthy pods from traffic.Why the DB health check runs in the background
The first iteration of this PR pinged the database from inside the
/readyzhandler with a bounded timeout. That coupled probe latency to DB latency and required a gateway-side timeout config so the app timeout could stay below the kubelet's probe timeout. Running the ping in a background task decouples those two timeouts: handler latency stays sub-millisecond regardless of DB state, the published state is stale by at most one polling interval (5s), and operators no longer need a gateway-side knob to keep the app timeout below the Kubernetes probe timeout. As a side effect, the per-check timeout (2s) and polling interval (5s) can stay hardcoded — they are properties of the monitor, not of any individual request.Why
closeis test-onlycloseis used to simulate dependency outages in tests. Exposing it in runtime code would make it possible to tear down an active pool under live traffic. Until a dedicated graceful-shutdown flow exists, keeping it behind test support prevents accidental production use.Testing
mise run pre-commitpassesValidation run:
mise run e2emise run ciChecklist