experimental/ssh: show compute provisioning status during ssh connect startup#5576
Open
TanishqDatabricks wants to merge 1 commit into
Open
experimental/ssh: show compute provisioning status during ssh connect startup#5576TanishqDatabricks wants to merge 1 commit into
TanishqDatabricks wants to merge 1 commit into
Conversation
… startup GPU_8xH100 serverless capacity takes ~10 minutes at P50 and ~30 minutes at P90 to acquire, but while waiting `ssh connect` only showed a generic "Starting SSH server... (task: PENDING)" spinner, so users assumed a long wait meant a service outage (see the Zillow report in #remote-development-help). Show "Waiting for compute to start..." while the bootstrap job's compute spins up (all connection types, including dedicated-cluster auto-start), and print an upfront notice for GPU accelerators that provisioning can take upwards of 10 minutes. The startup timeout increase for GPU accelerators is handled separately. Co-authored-by: Isaac
Contributor
Waiting for approvalBased on git history, these people are best suited to review:
Eligible reviewers: Suggestions based on git history. See OWNERS for ownership rules. |
Collaborator
Integration test reportCommit: ea6ce07
22 interesting tests: 15 SKIP, 7 KNOWN
Top 28 slowest tests (at least 2 minutes):
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
While the SSH server bootstrap job's compute spins up, the spinner now reads
Waiting for compute to start...(all connection types) instead ofStarting SSH server.... For GPU accelerators, a persistent notice is printed upfront:Waiting for GPU_8xH100 compute to be provisioned. This can take upwards of 10 minutes depending on capacity....Why
ssh connect --accelerator=GPU_8xH100frequently fails with:GPU_8xH100 launch latency is ~10 minutes at P50 and ~30 minutes at P90, so sessions routinely hit the startup timeout even when nothing is wrong. Nothing in the output indicated that compute was being provisioned, so users read the error as a service outage.
Tests
go build,go vet, andgo test ./experimental/ssh/...all pass;TestWaitForJobToStartSurfacesFailureupdated for thewaitForJobToStartsignature change.This pull request and its description were written by Isaac.