Skip to content

experimental/ssh: show compute provisioning status during ssh connect startup#5576

Open
TanishqDatabricks wants to merge 1 commit into
mainfrom
ssh-connect-gpu-startup-ux
Open

experimental/ssh: show compute provisioning status during ssh connect startup#5576
TanishqDatabricks wants to merge 1 commit into
mainfrom
ssh-connect-gpu-startup-ux

Conversation

@TanishqDatabricks

@TanishqDatabricks TanishqDatabricks commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Changes

While the SSH server bootstrap job's compute spins up, the spinner now reads Waiting for compute to start... (all connection types) instead of Starting SSH server.... For GPU accelerators, a persistent notice is printed upfront: Waiting for GPU_8xH100 compute to be provisioned. This can take upwards of 10 minutes depending on capacity....

Why

ssh connect --accelerator=GPU_8xH100 frequently fails with:

Error: failed to ensure that ssh server is running: failed to submit and start ssh server job: timed out: waiting for task to start (current state: PENDING)

GPU_8xH100 launch latency is ~10 minutes at P50 and ~30 minutes at P90, so sessions routinely hit the startup timeout even when nothing is wrong. Nothing in the output indicated that compute was being provisioned, so users read the error as a service outage.

Tests

  • go build, go vet, and go test ./experimental/ssh/... all pass; TestWaitForJobToStartSurfacesFailure updated for the waitForJobToStart signature change.
  • The change is display-only (spinner and notice text); no control flow or error behavior is modified.

This pull request and its description were written by Isaac.

… startup

GPU_8xH100 serverless capacity takes ~10 minutes at P50 and ~30 minutes at
P90 to acquire, but while waiting `ssh connect` only showed a generic
"Starting SSH server... (task: PENDING)" spinner, so users assumed a long
wait meant a service outage (see the Zillow report in
#remote-development-help).

Show "Waiting for compute to start..." while the bootstrap job's compute
spins up (all connection types, including dedicated-cluster auto-start),
and print an upfront notice for GPU accelerators that provisioning can
take upwards of 10 minutes.

The startup timeout increase for GPU accelerators is handled separately.

Co-authored-by: Isaac
@github-actions

Copy link
Copy Markdown
Contributor

Waiting for approval

Based on git history, these people are best suited to review:

  • @ilia-db -- recent work in experimental/ssh/internal/client/
  • @anton-107 -- recent work in experimental/ssh/internal/client/

Eligible reviewers: @andrewnester, @denik, @pietern, @renaudhartert-db, @shreyas-goenka, @simonfaltum

Suggestions based on git history. See OWNERS for ownership rules.

@eng-dev-ecosystem-bot

Copy link
Copy Markdown
Collaborator

Integration test report

Commit: ea6ce07

Run: 27428511968

Env 🟨​KNOWN 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
🟨​ aws linux 7 15 264 977 8:10
🟨​ aws windows 7 15 266 975 13:17
💚​ aws-ucws linux 7 15 360 891 7:56
💚​ aws-ucws windows 7 15 362 889 12:18
💚​ azure linux 1 17 267 975 7:11
💚​ azure windows 1 17 269 973 11:21
💚​ azure-ucws linux 1 17 365 887 8:01
💚​ azure-ucws windows 1 17 367 885 12:46
💚​ gcp linux 1 17 263 978 8:26
💚​ gcp windows 1 17 265 976 11:44
22 interesting tests: 15 SKIP, 7 KNOWN
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🟨​ TestAccept 🟨​K 🟨​K 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/invariant/no_drift 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/replace_existing 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/grants/select 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
Top 28 slowest tests (at least 2 minutes):
duration env testname
6:16 gcp windows TestAccept
6:12 aws-ucws windows TestAccept
6:05 azure-ucws windows TestAccept
6:02 azure windows TestAccept
4:54 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:19 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:14 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:08 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:57 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:41 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:29 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:17 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:16 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:12 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:05 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:58 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:58 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:56 azure linux TestAccept
2:54 gcp linux TestAccept
2:49 aws-ucws linux TestAccept
2:47 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:46 azure-ucws linux TestAccept
2:46 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:45 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:44 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:36 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:34 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:32 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants