tools: sse-timeout-probe for UI-01 / UI-02 empirical trace by jamesbroadhead · Pull Request #292 · databricks/appkit

jamesbroadhead · 2026-04-21T16:18:20Z

Summary

Adds a tiny TS reproducer (tools/sse-timeout-probe/) for the Databricks Apps SSE idle-timeout gap reported in ES-1742245 and captured as UI-01 / UI-02 in the internal EMEA Apps "gaps that matter" doc.

Why this is a separate PR

UI-01's source doc and the ES ticket disagree on whether the drop is timeout-driven or buffering-driven:

Doc says: 75% of SSE connections drop, "distinct from the 120s idle timeout," likely buffering or HTTP/2 multiplexing.
Naïm Achahboun's comment on the ticket says: it is the timeout — multi-agent LLM calls have non-deterministic duration, so ~30% finish under the ceiling and ~70% don't.

We can't pick the fix until we know which diagnosis is correct. Running this probe against a dogfood app answers the question deterministically and tells us whether to:

raise request_timeout per-route on apps/gateway, or
ship server-side SSE keepalive middleware in AppKit (and templates), or
harden FlushInterval / HTTP/2 end-to-end across apps-gateway + oauth2-proxy + apps/runtime.

What's in the PR

tools/sse-timeout-probe/probe.ts — opens one SSE connection per duration in a configurable ladder; records lifetime, bytes, and how the connection ended (completed / server-close / network-error).
tools/sse-timeout-probe/server.ts — companion server on /sse-probe that holds the connection open for the requested duration with optional heartbeat. Deploy as an app entrypoint to measure the Databricks-hosted ceiling vs an EKS / localhost control.
tools/sse-timeout-probe/README.md — usage, what to look for (sharp cliff at 60s/90s/120s/180s maps back to apps/gateway vs oauth2-proxy vs DP ApiProxy envoy), and how heartbeat behavior distinguishes idle timeouts from absolute request timeouts.

Test plan

Deploy server.ts as an app in dogfood; run probe.ts against it and against an EKS / localhost control; capture the result ladders side-by-side.
Re-run with --heartbeat 30000 to isolate idle-timeout behavior from absolute request-timeout.
Decide UI-01 / UI-02 fix path based on the results.

Follow-ups explicitly out of scope for this PR

Wire the companion server into apps/dev-playground so probing is one pnpm deploy away.
WebSocket variant of the probe so UI-02's ping/pong-bypass claim can be measured on the same axes.
Once the fix path is chosen, the AppKit-side middleware (heartbeat / reconnect hardening) or apps/gateway PR will reference this probe as the regression test.

This pull request and its description were written by Claude (claude.ai).

Adds a tiny TS reproducer for the SSE idle-timeout gap reported in ES-1742245 (field-facing "AI Value Roadmap" app dropping ~75% of SSE connections through the Apps reverse proxy). Two files: - probe.ts — opens one SSE connection per duration in a configurable ladder; records lifetime, bytes, and how the connection ended (completed / server-close / network-error). - server.ts — companion server that responds on /sse-probe, holding the connection open for the requested duration with an optional heartbeat comment. Deploy as an app entrypoint to measure the Databricks-hosted ceiling vs an EKS / localhost control. - README.md — usage, what to look for (sharp cliff at 60s/90s/120s/180s maps back to apps/gateway vs oauth2-proxy vs DP ApiProxy envoy), and how heartbeat behavior distinguishes idle timeouts from absolute request timeouts. Why this is a separate PR: UI-01's source doc and ES-1742245 disagree on whether the drop is timeout-driven or buffering-driven. Running this probe against a dogfood app answers that question empirically and tells us which fix to pursue (per-route request_timeout raise, heartbeat middleware, or buffering / HTTP/2 hardening). Draft because the fix itself depends on those results. Co-authored-by: Isaac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tools: sse-timeout-probe for UI-01 / UI-02 empirical trace#292

tools: sse-timeout-probe for UI-01 / UI-02 empirical trace#292
jamesbroadhead wants to merge 1 commit intodatabricks:mainfrom
jamesbroadhead:jb/ui-01-sse-timeout-repro

jamesbroadhead commented Apr 21, 2026 •

edited by atlassian Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jamesbroadhead commented Apr 21, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this is a separate PR

What's in the PR

Test plan

Follow-ups explicitly out of scope for this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jamesbroadhead commented Apr 21, 2026 •

edited by atlassian Bot

Loading