ssh: increase server startup timeout to 45m for GPU accelerators#5569
Open
anton-107 wants to merge 2 commits into
Open
ssh: increase server startup timeout to 45m for GPU accelerators#5569anton-107 wants to merge 2 commits into
anton-107 wants to merge 2 commits into
Conversation
Serverless GPU compute (--accelerator) can take much longer to provision than CPU compute, and the fixed 10-minute wait for the SSH server job to start frequently expires before the GPU node is up. Use a 45-minute startup timeout when an accelerator is requested. Co-authored-by: Isaac
Contributor
Waiting for approvalBased on git history, these people are best suited to review:
Eligible reviewers: Suggestions based on git history. See OWNERS for ownership rules. |
Co-authored-by: Isaac
a71119d to
45cfdd4
Compare
Collaborator
Integration test reportCommit: 45cfdd4
22 interesting tests: 15 SKIP, 7 RECOVERED
Top 30 slowest tests (at least 2 minutes):
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
databricks ssh connectwaits for the bootstrap SSH server job run to reach RUNNING state, bounded by a fixed 10-minutetaskStartupTimeout. Serverless GPU compute (requested via--accelerator GPU_1xA10/GPU_8xH100) can take much longer to provision than CPU compute, so the wait frequently expires before the GPU node is up.When
--acceleratoris set, use a 45-minute startup timeout (gpuTaskStartupTimeout) instead. The CPU/dedicated-cluster path keeps the existing 10-minute timeout.Why
GPU serverless provisioning routinely exceeds 10 minutes, causing
ssh connect --accelerator ...to fail with a startup timeout even though the job would have come up shortly after.Tests
go build ./experimental/ssh/...andgo test ./experimental/ssh/...pass.ClientOptions.TaskStartupTimeout; no behavior change without--accelerator.This pull request and its description were written by Isaac.