Skip to content

experimental/ssh: show compute provisioning status during ssh connect startup#5572

Closed
TanishqDatabricks wants to merge 1 commit into
databricks:mainfrom
TanishqDatabricks:ssh-connect-gpu-startup-ux
Closed

experimental/ssh: show compute provisioning status during ssh connect startup#5572
TanishqDatabricks wants to merge 1 commit into
databricks:mainfrom
TanishqDatabricks:ssh-connect-gpu-startup-ux

Conversation

@TanishqDatabricks

@TanishqDatabricks TanishqDatabricks commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Changes

While the SSH server bootstrap job's compute spins up, the spinner now reads Waiting for compute to start... (all connection types) instead of Starting SSH server.... For GPU accelerators, a persistent notice is printed upfront: Waiting for GPU_8xH100 compute to be provisioned. This can take upwards of 10 minutes depending on capacity....

Why

ssh connect --accelerator=GPU_8xH100 frequently fails with:

Error: failed to ensure that ssh server is running: failed to submit and start ssh server job: timed out: waiting for task to start (current state: PENDING)

GPU_8xH100 launch latency is ~10 minutes at P50 and ~30 minutes at P90, so sessions routinely hit the startup timeout even when nothing is wrong. Nothing in the output indicated that compute was being provisioned, so users read the error as a service outage.

Tests

  • go build, go vet, and go test ./experimental/ssh/... all pass; TestWaitForJobToStartSurfacesFailure updated for the waitForJobToStart signature change.
  • The change is display-only (spinner and notice text); no control flow or error behavior is modified.

This pull request and its description were written by Isaac.

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Waiting for approval

Based on git history, these people are best suited to review:

  • @ilia-db -- recent work in experimental/ssh/internal/client/
  • @anton-107 -- recent work in experimental/ssh/internal/client/

Eligible reviewers: @andrewnester, @denik, @pietern, @renaudhartert-db, @shreyas-goenka, @simonfaltum

Suggestions based on git history. See OWNERS for ownership rules.

@TanishqDatabricks TanishqDatabricks force-pushed the ssh-connect-gpu-startup-ux branch from e053cdd to 0785ade Compare June 12, 2026 12:40
@TanishqDatabricks TanishqDatabricks changed the title experimental/ssh: clarify GPU compute provisioning during ssh connect startup experimental/ssh: clarify compute provisioning during ssh connect startup Jun 12, 2026
@TanishqDatabricks TanishqDatabricks force-pushed the ssh-connect-gpu-startup-ux branch from 0785ade to 901c635 Compare June 12, 2026 16:13
… startup

GPU_8xH100 serverless capacity takes ~10 minutes at P50 and ~30 minutes at
P90 to acquire, but while waiting `ssh connect` only showed a generic
"Starting SSH server... (task: PENDING)" spinner, so users assumed a long
wait meant a service outage (see the Zillow report in
#remote-development-help).

Show "Waiting for compute to start..." while the bootstrap job's compute
spins up (all connection types, including dedicated-cluster auto-start),
and print an upfront notice for GPU accelerators that provisioning can
take upwards of 10 minutes.

The startup timeout increase for GPU accelerators is handled separately.

Co-authored-by: Isaac
@TanishqDatabricks TanishqDatabricks force-pushed the ssh-connect-gpu-startup-ux branch from 901c635 to ea6ce07 Compare June 12, 2026 16:15
@TanishqDatabricks TanishqDatabricks changed the title experimental/ssh: clarify compute provisioning during ssh connect startup experimental/ssh: show compute provisioning status during ssh connect startup Jun 12, 2026
@github-actions

Copy link
Copy Markdown
Contributor

An authorized user can trigger integration tests manually by following the instructions below:

Trigger:
go/deco-tests-run/cli

Inputs:

  • PR number: 5572
  • Commit SHA: ea6ce0717101b322b47ce05998b77de784de46ef

Checks will be approved automatically on success.

@TanishqDatabricks

Copy link
Copy Markdown
Collaborator Author

Closing in favor of #5576: this PR was opened from a fork, where GitHub does not issue OIDC tokens to PR workflows, so the JFrog-authenticated CI jobs (test-exp-ssh, validate-generated) could never pass. #5576 carries the identical commit from an in-repo branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant