Skip to content

sandbox: cap API and SSH timeouts so unenabled regions fail fast#5565

Open
akshaysingla-db wants to merge 1 commit into
mainfrom
akshay/sandbox-timeouts
Open

sandbox: cap API and SSH timeouts so unenabled regions fail fast#5565
akshaysingla-db wants to merge 1 commit into
mainfrom
akshay/sandbox-timeouts

Conversation

@akshaysingla-db

Copy link
Copy Markdown
Contributor

Summary

  • In regions where the sandbox manager service isn't deployed, the gateway silently holds the connection rather than returning a structured error, so databricks sandbox <anything> would hang for ~60s on the API call and another ~75s in ssh before giving up.
  • Tighten the per-request timeout on every sandbox HTTP call to 10s (via context.WithTimeout inside a new do wrapper on *sandboxAPI), and add -o ConnectTimeout=10 to the ssh argv.
  • Translate context.DeadlineExceeded from the API path into "sandbox API timed out after 10s — this region may not have sandbox enabled, or the manager is unreachable", so the user gets an actionable hint instead of a raw deadline error.

Long flows (cold start, etc.) are made of many short calls polled in a loop, so each call still has its own 10s budget while the outer wait loop's own deadline (startWaitTimeout = 10m) keeps governing the overall flow.

Considered routing through the SDK's Config.HTTPTimeoutSeconds instead, but that path rewrites a timeout into a freshly-allocated url.Error whose inner error no longer satisfies errors.Is(err, context.DeadlineExceeded) — making clean error detection require string matching, which the repo style guide forbids. The context.WithTimeout approach composes fine with the SDK's 60s default backstop.

Test plan

  • go test ./cmd/sandbox/... — new TestSandboxAPIDoTranslatesTimeout exercises the wrapper against an httptest server that hangs and asserts both the "timed out" string and the region hint; existing tests pass unchanged.
  • TestBuildSSHArgsBaseFlags now asserts ConnectTimeout=10 is in the argv.
  • Manual: invoke databricks sandbox list against a profile in a region without the manager and confirm the clear timeout message arrives in ~10s instead of hanging.

This pull request and its description were written by Isaac.

…ns fail fast

In regions where the sandbox manager service isn't deployed, the
gateway holds the connection without returning a structured error,
so `databricks sandbox <anything>` would hang indefinitely behind
the SDK's default 60s inactivity timeout (and ssh's ~75s connect
default). Tighten both to 10s and translate the API timeout into a
user-facing message that hints at the most likely cause.

Co-authored-by: Isaac
@github-actions

Copy link
Copy Markdown
Contributor

Waiting for approval

Based on git history, these people are best suited to review:

  • @pietern -- recent work in cmd/sandbox/

Eligible reviewers: @shuochen0311

Suggestions based on git history. See OWNERS for ownership rules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants