Skip to content

ssh: surface server errors from a running server on failed connections#5555

Open
anton-107 wants to merge 1 commit into
mainfrom
anton-107/ssh-logs-endpoint
Open

ssh: surface server errors from a running server on failed connections#5555
anton-107 wants to merge 1 commit into
mainfrom
anton-107/ssh-logs-endpoint

Conversation

@anton-107

Copy link
Copy Markdown
Contributor

Changes

  • The SSH server keeps its recent warning/error log records in a bounded in-memory buffer (16KB, oldest evicted) and serves them at /logs next to the existing /metadata endpoint, behind the same driver-proxy auth. Implemented as a tee slog.Handler, so all records still flow to stdout (the run-page logs) unchanged.
  • When the spawned ssh client exits with a connection-level failure (code 255), ssh connect fetches /logs and prints the server's actual errors (e.g. failed to start SSHD process: ... /usr/sbin/sshd: no such file or directory). The generic "install openssh-server" hint remains as the fallback when no logs are available (e.g. older server versions without /logs); the fetch is best-effort.
  • Extracted newDriverProxyRequest from getServerMetadata, shared by the new logs fetch.

Why

When a connection attempt fails against a healthy-looking bootstrap job (FAILURE_MODES.md Mode 1: the container lacks sshd, the server logs the error per connection and keeps running), the real error was unreachable from the client: the Jobs API exposes no stdout logs for a running notebook task (GetRunOutput requires a terminal state and RunOutput.Logs is unsupported for notebook tasks). The server's own HTTP service behind the driver proxy is the only channel available while the job is alive. Complements #5552, which covers the terminated-job case.

Tests

  • New unit tests for the log buffer and tee handler (eviction, warn+ filtering, per-connection session attrs, HTTP handler); ./task test-exp-ssh and full lint pass.
  • Manually verified against dogfood with a planted failing sshd path: the bootstrap job stays RUNNING, and after the connection drops the terminal prints The SSH connection closed unexpectedly. Recent SSH server errors: followed by the server's failed to start SSHD process: fork/exec ...: no such file or directory log line. A regular ssh connect (no plant) still connects end-to-end.

This pull request and its description were written by Isaac.

@anton-107 anton-107 temporarily deployed to test-trigger-is June 11, 2026 14:27 — with GitHub Actions Inactive
@anton-107 anton-107 temporarily deployed to test-trigger-is June 11, 2026 14:27 — with GitHub Actions Inactive
The ssh server keeps its recent warning/error log lines in a bounded
in-memory buffer and serves them at /logs next to /metadata. When the
spawned ssh client exits with a connection-level failure (code 255),
"ssh connect" fetches that endpoint and prints the server's actual
errors instead of only a generic hint. The Jobs API exposes no stdout
logs for a running notebook task, so this is the only way to read the
server's errors while the bootstrap job is still alive.

Co-authored-by: Isaac
@anton-107 anton-107 force-pushed the anton-107/ssh-logs-endpoint branch from ddc7889 to 0a76b0c Compare June 11, 2026 14:27
@anton-107 anton-107 temporarily deployed to test-trigger-is June 11, 2026 14:28 — with GitHub Actions Inactive
@anton-107 anton-107 temporarily deployed to test-trigger-is June 11, 2026 14:28 — with GitHub Actions Inactive
@anton-107 anton-107 requested a review from rclarey June 11, 2026 14:30
@eng-dev-ecosystem-bot

Copy link
Copy Markdown
Collaborator

Commit: 0a76b0c

Run: 27354123063

Env 🟨​KNOWN 🔄​flaky 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
🟨​ aws linux 7 15 264 969 8:35
🟨​ aws windows 7 15 266 967 16:28
💚​ aws-ucws linux 7 15 360 883 11:55
💚​ aws-ucws windows 7 15 362 881 13:16
💚​ azure linux 1 17 267 967 8:21
💚​ azure windows 1 17 269 965 9:40
💚​ azure-ucws linux 1 17 365 879 10:13
💚​ azure-ucws windows 1 17 367 877 11:50
🔄​ gcp linux 2 1 17 261 970 11:41
💚​ gcp windows 1 17 265 968 10:19
24 interesting tests: 15 SKIP, 7 KNOWN, 2 flaky
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🟨​ TestAccept 🟨​K 🟨​K 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/invariant/no_drift 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/replace_existing 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/grants/select 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🔄​ TestFetchRepositoryInfoAPI_FromRepo/root ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p 🔄​f ✅​p
🔄​ TestFetchRepositoryInfoAPI_FromRepo/subdir ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p 🔄​f ✅​p
Top 29 slowest tests (at least 2 minutes):
duration env testname
7:35 aws-ucws windows TestAccept
5:47 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
5:23 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
5:13 azure-ucws windows TestAccept
5:08 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
5:02 gcp windows TestAccept
4:51 azure windows TestAccept
4:19 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:50 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:45 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:43 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:20 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:09 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:03 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:01 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:57 gcp linux TestAccept
2:55 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:53 azure linux TestAccept
2:53 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:52 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:51 azure-ucws linux TestAccept
2:50 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:48 aws-ucws linux TestAccept
2:47 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:45 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:43 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:39 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:37 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:10 aws windows TestSecretsPutSecretStringValue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants