Skip to content

tools: sse-timeout-probe for UI-01 / UI-02 empirical trace#292

Draft
jamesbroadhead wants to merge 1 commit intodatabricks:mainfrom
jamesbroadhead:jb/ui-01-sse-timeout-repro
Draft

tools: sse-timeout-probe for UI-01 / UI-02 empirical trace#292
jamesbroadhead wants to merge 1 commit intodatabricks:mainfrom
jamesbroadhead:jb/ui-01-sse-timeout-repro

Conversation

@jamesbroadhead
Copy link
Copy Markdown
Contributor

@jamesbroadhead jamesbroadhead commented Apr 21, 2026

Summary

Adds a tiny TS reproducer (tools/sse-timeout-probe/) for the Databricks Apps SSE idle-timeout gap reported in ES-1742245 and captured as UI-01 / UI-02 in the internal EMEA Apps "gaps that matter" doc.

Why this is a separate PR

UI-01's source doc and the ES ticket disagree on whether the drop is timeout-driven or buffering-driven:

  • Doc says: 75% of SSE connections drop, "distinct from the 120s idle timeout," likely buffering or HTTP/2 multiplexing.
  • Naïm Achahboun's comment on the ticket says: it is the timeout — multi-agent LLM calls have non-deterministic duration, so ~30% finish under the ceiling and ~70% don't.

We can't pick the fix until we know which diagnosis is correct. Running this probe against a dogfood app answers the question deterministically and tells us whether to:

  • raise request_timeout per-route on apps/gateway, or
  • ship server-side SSE keepalive middleware in AppKit (and templates), or
  • harden FlushInterval / HTTP/2 end-to-end across apps-gateway + oauth2-proxy + apps/runtime.

What's in the PR

  • tools/sse-timeout-probe/probe.ts — opens one SSE connection per duration in a configurable ladder; records lifetime, bytes, and how the connection ended (completed / server-close / network-error).
  • tools/sse-timeout-probe/server.ts — companion server on /sse-probe that holds the connection open for the requested duration with optional heartbeat. Deploy as an app entrypoint to measure the Databricks-hosted ceiling vs an EKS / localhost control.
  • tools/sse-timeout-probe/README.md — usage, what to look for (sharp cliff at 60s/90s/120s/180s maps back to apps/gateway vs oauth2-proxy vs DP ApiProxy envoy), and how heartbeat behavior distinguishes idle timeouts from absolute request timeouts.

Test plan

  • Deploy server.ts as an app in dogfood; run probe.ts against it and against an EKS / localhost control; capture the result ladders side-by-side.
  • Re-run with --heartbeat 30000 to isolate idle-timeout behavior from absolute request-timeout.
  • Decide UI-01 / UI-02 fix path based on the results.

Follow-ups explicitly out of scope for this PR

  • Wire the companion server into apps/dev-playground so probing is one pnpm deploy away.
  • WebSocket variant of the probe so UI-02's ping/pong-bypass claim can be measured on the same axes.
  • Once the fix path is chosen, the AppKit-side middleware (heartbeat / reconnect hardening) or apps/gateway PR will reference this probe as the regression test.

This pull request and its description were written by Claude (claude.ai).

Adds a tiny TS reproducer for the SSE idle-timeout gap reported in ES-1742245
(field-facing "AI Value Roadmap" app dropping ~75% of SSE connections
through the Apps reverse proxy).

Two files:
- probe.ts     — opens one SSE connection per duration in a configurable
                 ladder; records lifetime, bytes, and how the connection
                 ended (completed / server-close / network-error).
- server.ts    — companion server that responds on /sse-probe, holding the
                 connection open for the requested duration with an optional
                 heartbeat comment. Deploy as an app entrypoint to measure
                 the Databricks-hosted ceiling vs an EKS / localhost control.
- README.md    — usage, what to look for (sharp cliff at 60s/90s/120s/180s
                 maps back to apps/gateway vs oauth2-proxy vs DP ApiProxy
                 envoy), and how heartbeat behavior distinguishes idle
                 timeouts from absolute request timeouts.

Why this is a separate PR: UI-01's source doc and ES-1742245 disagree on
whether the drop is timeout-driven or buffering-driven. Running this probe
against a dogfood app answers that question empirically and tells us which
fix to pursue (per-route request_timeout raise, heartbeat middleware, or
buffering / HTTP/2 hardening). Draft because the fix itself depends on
those results.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant