Skip to content

ssh: increase server startup timeout to 45m for GPU accelerators#5569

Open
anton-107 wants to merge 2 commits into
mainfrom
ssh-gpu-startup-timeout
Open

ssh: increase server startup timeout to 45m for GPU accelerators#5569
anton-107 wants to merge 2 commits into
mainfrom
ssh-gpu-startup-timeout

Conversation

@anton-107

Copy link
Copy Markdown
Contributor

Changes

databricks ssh connect waits for the bootstrap SSH server job run to reach RUNNING state, bounded by a fixed 10-minute taskStartupTimeout. Serverless GPU compute (requested via --accelerator GPU_1xA10 / GPU_8xH100) can take much longer to provision than CPU compute, so the wait frequently expires before the GPU node is up.

When --accelerator is set, use a 45-minute startup timeout (gpuTaskStartupTimeout) instead. The CPU/dedicated-cluster path keeps the existing 10-minute timeout.

Why

GPU serverless provisioning routinely exceeds 10 minutes, causing ssh connect --accelerator ... to fail with a startup timeout even though the job would have come up shortly after.

Tests

  • go build ./experimental/ssh/... and go test ./experimental/ssh/... pass.
  • The change is a constant selection wired through the existing ClientOptions.TaskStartupTimeout; no behavior change without --accelerator.

This pull request and its description were written by Isaac.

Serverless GPU compute (--accelerator) can take much longer to
provision than CPU compute, and the fixed 10-minute wait for the SSH
server job to start frequently expires before the GPU node is up. Use a
45-minute startup timeout when an accelerator is requested.

Co-authored-by: Isaac
@anton-107 anton-107 temporarily deployed to test-trigger-is June 12, 2026 10:22 — with GitHub Actions Inactive
@anton-107 anton-107 temporarily deployed to test-trigger-is June 12, 2026 10:22 — with GitHub Actions Inactive
@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Waiting for approval

Based on git history, these people are best suited to review:

  • @ilia-db -- recent work in experimental/ssh/cmd/
  • @simonfaltum -- recent work in ./, experimental/ssh/cmd/
  • @denik -- recent work in ./

Eligible reviewers: @andrewnester, @pietern, @renaudhartert-db, @shreyas-goenka

Suggestions based on git history. See OWNERS for ownership rules.

@anton-107 anton-107 requested review from pietern and rclarey June 12, 2026 10:24
@anton-107 anton-107 temporarily deployed to test-trigger-is June 12, 2026 10:26 — with GitHub Actions Inactive
@anton-107 anton-107 temporarily deployed to test-trigger-is June 12, 2026 10:26 — with GitHub Actions Inactive
Co-authored-by: Isaac
@anton-107 anton-107 force-pushed the ssh-gpu-startup-timeout branch from a71119d to 45cfdd4 Compare June 12, 2026 10:27
@anton-107 anton-107 temporarily deployed to test-trigger-is June 12, 2026 10:28 — with GitHub Actions Inactive
@anton-107 anton-107 temporarily deployed to test-trigger-is June 12, 2026 10:28 — with GitHub Actions Inactive
@eng-dev-ecosystem-bot

Copy link
Copy Markdown
Collaborator

Integration test report

Commit: 45cfdd4

Run: 27410069834

Env 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
💚​ aws linux 7 15 264 973 6:42
💚​ aws windows 7 15 266 971 11:13
💚​ aws-ucws linux 7 15 360 887 7:38
💚​ aws-ucws windows 7 15 362 885 12:48
💚​ azure linux 1 17 267 971 6:32
💚​ azure windows 1 17 269 969 11:38
💚​ azure-ucws linux 1 17 365 883 7:24
💚​ azure-ucws windows 1 17 367 881 13:07
💚​ gcp linux 1 17 263 974 7:27
💚​ gcp windows 1 17 265 972 11:21
22 interesting tests: 15 SKIP, 7 RECOVERED
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
💚​ TestAccept 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/invariant/no_drift 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/replace_existing 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/grants/select 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
Top 30 slowest tests (at least 2 minutes):
duration env testname
6:14 aws windows TestAccept
6:12 gcp windows TestAccept
6:10 azure windows TestAccept
6:02 aws-ucws windows TestAccept
6:00 azure-ucws windows TestAccept
4:30 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:21 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:19 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:13 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:09 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:08 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:07 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:05 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:02 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:54 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:54 aws-ucws linux TestAccept
2:54 azure linux TestAccept
2:53 gcp linux TestAccept
2:53 azure-ucws linux TestAccept
2:53 aws linux TestAccept
2:50 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:50 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:48 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:47 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:45 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:44 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:41 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:39 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:30 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:29 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants