fix(gateway): make readiness health checks dependency-aware by alangou · Pull Request #1328 · NVIDIA/OpenShell

alangou · 2026-05-12T15:18:42Z

Summary

This PR makes gateway readiness signals dependency-aware instead of always healthy, while keeping liveness intentionally lightweight.
The database health check runs as an in-process background monitor that publishes the latest result through a tokio::sync::watch channel; /readyz and /health read that cached state, so probe responses are sub-millisecond and never race with the kubelet's probe timeout.
It also extends readiness coverage with a Kubernetes-only e2e test and updates the Rust test task so openshell-server test-only coverage runs with test-support.

Related Issue

closes OS-156
Parent initiative: OSGH-111 (Runtime Reliability)

Changes

Added dependency-aware health behavior in openshell-server:
- /healthz remains liveness-only (200 when process is responsive)
- /readyz and /health reflect a cached database health state and return 503 when the latest background check failed
- readiness payload includes structured dependency details (checks.database.status, latency_ms, error)
Added background database health monitor (crates/openshell-server/src/readiness.rs):
- DatabaseHealthMonitor spawns a Tokio task that periodically pings the persistence layer and publishes the result through tokio::sync::watch
- Three published states: Initializing (treated as unhealthy until the first poll completes), Healthy { latency_ms }, Unhealthy { reason, latency_ms } with reason in Unavailable | Timeout
- Polling parameters are hardcoded by design: 5s interval, 2s per-check timeout, with a const_assert! enforcing timeout < interval
- health_router awaits the monitor's first poll before serving, so /readyz never returns the placeholder Initializing state on the very first probe
Added persistence connectivity helpers (Store::ping, plus SQLite/Postgres ping) used by the monitor.
Added Prometheus readiness metrics:
- openshell_server_readiness_database_healthy gauge (1 healthy, 0 unhealthy)
- openshell_server_readiness_database_probe_duration_seconds histogram labeled by outcome (success, db_error, timeout)
- health tests validate metric emission and /metrics exposure
Scoped pool teardown APIs to test support only:
- Store::close and backend close methods are behind #[cfg(any(test, feature = "test-support"))]
Updated existing integration tests that instantiate health_router so they provide a real Store.
Added readiness integration test coverage:
- crates/openshell-server/tests/health_endpoint_integration.rs
Added Kubernetes e2e coverage for /readyz:
- e2e/rust/tests/readyz_health.rs (gated by the e2e-kubernetes feature)
- e2e/with-kube-gateway.sh opens a dedicated port-forward to the gateway pod's named health container port and exports OPENSHELL_E2E_HEALTH_PORT for the test
- Docker e2e is intentionally not extended: nothing else in the Docker path depends on the health port today
Fixed e2e/rust/e2e-kubernetes.sh so it actually runs the Kubernetes-gated tests:
- The script's default E2E_FEATURES did not include e2e-kubernetes, so every test with required-features = ["e2e-kubernetes"] was silently skipped on the Kubernetes lane. Added the feature to the default list.
- This surfaced a pre-existing failure in e2e/rust/tests/user_namespaces.rs that had been hidden by the misconfiguration. The test was marked #[ignore] with an inline header documenting the two blocking issues (hardcoded docker exec openshell-cluster-openshell kubectl invocation that no project setup creates, and a mid-test gateway StatefulSet rollout that breaks the wrapper's kubectl port-forward while the test swallows stderr). Re-enable with cargo test -- --ignored once both are fixed; out of scope for this PR.
Helm chart updates aligned with the background monitor:
- probes.readiness.timeoutSeconds default set to 2s to match the per-check timeout
- chart README documents that the readiness probe is now cached/background and no gateway-side tuning is exposed
- no new chart value is introduced for the readiness check (the previous server.readinessDbTimeoutSecs proposal is dropped now that the monitor decouples the probe from the DB call)
Documentation updates:
- docs/kubernetes/setup.mdx adds a Probes section describing /healthz vs /readyz and the background DB check
- docs/sandboxes/manage-gateways.mdx notes the two endpoints in the existing gateway health context
- architecture/gateway.md records the dependency-aware readiness behavior and its cached/background semantics
Test task update for test-support coverage by default in the Rust test lane:
- tasks/test.toml runs workspace tests excluding openshell-server, then runs openshell-server with --features test-support

Why `/healthz` still returns 200 when DB is down

/healthz is kept as a pure liveness probe by design. If liveness depended on DB, transient DB outages could trigger unnecessary pod restarts and CrashLoop behavior without fixing the dependency outage. Readiness (/readyz) is the dependency-aware signal used to remove unhealthy pods from traffic.

Why the DB health check runs in the background

The first iteration of this PR pinged the database from inside the /readyz handler with a bounded timeout. That coupled probe latency to DB latency and required a gateway-side timeout config so the app timeout could stay below the kubelet's probe timeout. Running the ping in a background task decouples those two timeouts: handler latency stays sub-millisecond regardless of DB state, the published state is stale by at most one polling interval (5s), and operators no longer need a gateway-side knob to keep the app timeout below the Kubernetes probe timeout. As a side effect, the per-check timeout (2s) and polling interval (5s) can stay hardcoded — they are properties of the monitor, not of any individual request.

Why `close` is test-only

close is used to simulate dependency outages in tests. Exposing it in runtime code would make it possible to tear down an active pool under live traffic. Until a dedicated graceful-shutdown flow exists, keeping it behind test support prevents accidental production use.

Testing

mise run pre-commit passes
Unit tests added/updated
E2E tests added/updated (if applicable)

Validation run:

mise run e2e
mise run ci

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

TaylorMutch

A couple of focused comments on the readiness changes.

alangou · 2026-05-13T13:42:31Z

@TaylorMutch I removed the hardcoded value for the database timeout some documentation and deployments resources has been updated to reflect that change (helm chart, doc)

github-actions · 2026-05-13T16:54:14Z

Label test:e2e applied for 05b6998. Open the existing run and click Re-run all jobs to execute with the label set. The E2E Gate check on this PR will flip green automatically once the run finishes.

alangou · 2026-05-18T09:42:27Z

@TaylorMutch I have rebased the branch PR

TaylorMutch

A few changes I think are needed on this PR, but it's getting closer.

TaylorMutch · 2026-05-18T15:35:25Z

 openshell gateway info
 ```

+Gateways expose `/healthz` as a liveness signal and `/readyz` as a dependency-aware readiness signal. `/readyz` checks database connectivity and reports unhealthy when the probe exceeds the configured timeout. Tune that timeout with `--readiness-db-timeout-secs` or `OPENSHELL_READINESS_DB_TIMEOUT_SECS` when you need a slower or stricter readiness window.


This is probably only relevant on the Kubernetes docs which care about these things; should this move into the Kubernetes docs?

TaylorMutch · 2026-05-18T15:36:37Z

  --config "${GATEWAY_CONFIG}"
  --bind-address 0.0.0.0
  --port "${HOST_PORT}"
+  --health-port "${HEALTH_PORT}"


We don't currently depend on the health port in any configuration other than Kubernetes, so I would suggest we not add it into the docker e2e path and only exercise it on the kubernetes e2e tests (which would enable the health port by default).

TaylorMutch · 2026-05-18T15:39:45Z

+  # Database readiness timeout in seconds used by /readyz and /health.
+  # Keep this lower than probes.readiness.timeoutSeconds so the gateway can
+  # return a readiness decision before Kubernetes times out the probe.
+  readinessDbTimeoutSecs: 1


This seems like a possible footgun; the readiness check should be pretty instantaneous right? And if we have to add in a configuration option to the server to determine this, that seems strange to me.

Perhaps the health check for DB health should run in the background and the health check handler only checks if we detect the state to be unhealthy? That would remove this race condition. WDYT?

Yes that works for me. The database healthcheck would run in background (with the configurable timeout) and update some sort of store in the gateway. When /readyz is queried the store is read. This would indeed be almost instant

Emit Prometheus readiness metrics for database probes (healthy gauge and outcome-labeled latency histogram) with coverage in health HTTP tests. Restrict Store::close behind test support cfg to prevent accidental runtime pool shutdown under live traffic. Signed-off-by: Adrien Langou <alangou@nvidia.com>

Signed-off-by: Adrien Langou <alangou@nvidia.com>

alangou requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners May 12, 2026 15:18

TaylorMutch reviewed May 12, 2026

View reviewed changes

Comment thread crates/openshell-server/src/http.rs Outdated

Comment thread crates/openshell-server/src/persistence/mod.rs

alangou force-pushed the alangou/os-156-update-gateway-health-check-to-account-for-database branch 5 times, most recently from 2cb34c9 to 05b6998 Compare May 13, 2026 12:44

alangou requested a review from TaylorMutch May 13, 2026 13:42

TaylorMutch added the test:e2e Requires end-to-end coverage label May 13, 2026

alangou force-pushed the alangou/os-156-update-gateway-health-check-to-account-for-database branch from 05b6998 to 2b11909 Compare May 18, 2026 09:22

TaylorMutch requested changes May 18, 2026

View reviewed changes

TaylorMutch self-assigned this May 19, 2026

alangou force-pushed the alangou/os-156-update-gateway-health-check-to-account-for-database branch 4 times, most recently from ab4deb4 to a167000 Compare May 19, 2026 16:06

alangou added 2 commits May 20, 2026 11:09

test(e2e): add simple e2e test with kubernetes to test /readyz

1cb6011

Signed-off-by: Adrien Langou <alangou@nvidia.com>

alangou force-pushed the alangou/os-156-update-gateway-health-check-to-account-for-database branch from a167000 to 1cb6011 Compare May 20, 2026 09:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): make readiness health checks dependency-aware#1328

fix(gateway): make readiness health checks dependency-aware#1328
alangou wants to merge 2 commits into
NVIDIA:mainfrom
alangou:alangou/os-156-update-gateway-health-check-to-account-for-database

alangou commented May 12, 2026 •

edited

Loading

Uh oh!

TaylorMutch left a comment

Uh oh!

Uh oh!

Uh oh!

alangou commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

alangou commented May 18, 2026

Uh oh!

TaylorMutch left a comment

Uh oh!

TaylorMutch May 18, 2026

Uh oh!

TaylorMutch May 18, 2026

Uh oh!

TaylorMutch May 18, 2026

Uh oh!

alangou May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alangou commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Why /healthz still returns 200 when DB is down

Why the DB health check runs in the background

Why close is test-only

Testing

Checklist

Uh oh!

TaylorMutch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alangou commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

alangou commented May 18, 2026

Uh oh!

TaylorMutch left a comment

Choose a reason for hiding this comment

Uh oh!

TaylorMutch May 18, 2026

Choose a reason for hiding this comment

Uh oh!

TaylorMutch May 18, 2026

Choose a reason for hiding this comment

Uh oh!

TaylorMutch May 18, 2026

Choose a reason for hiding this comment

Uh oh!

alangou May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alangou commented May 12, 2026 •

edited

Loading

Why `/healthz` still returns 200 when DB is down

Why `close` is test-only

alangou May 18, 2026 •

edited

Loading