ssh: surface server errors from a running server on failed connections#5555
Open
anton-107 wants to merge 1 commit into
Open
ssh: surface server errors from a running server on failed connections#5555anton-107 wants to merge 1 commit into
anton-107 wants to merge 1 commit into
Conversation
The ssh server keeps its recent warning/error log lines in a bounded in-memory buffer and serves them at /logs next to /metadata. When the spawned ssh client exits with a connection-level failure (code 255), "ssh connect" fetches that endpoint and prints the server's actual errors instead of only a generic hint. The Jobs API exposes no stdout logs for a running notebook task, so this is the only way to read the server's errors while the bootstrap job is still alive. Co-authored-by: Isaac
ddc7889 to
0a76b0c
Compare
Collaborator
|
Commit: 0a76b0c
24 interesting tests: 15 SKIP, 7 KNOWN, 2 flaky
Top 29 slowest tests (at least 2 minutes):
|
rclarey
approved these changes
Jun 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
/logsnext to the existing/metadataendpoint, behind the same driver-proxy auth. Implemented as a teeslog.Handler, so all records still flow to stdout (the run-page logs) unchanged.sshclient exits with a connection-level failure (code 255),ssh connectfetches/logsand prints the server's actual errors (e.g.failed to start SSHD process: ... /usr/sbin/sshd: no such file or directory). The generic "install openssh-server" hint remains as the fallback when no logs are available (e.g. older server versions without/logs); the fetch is best-effort.newDriverProxyRequestfromgetServerMetadata, shared by the new logs fetch.Why
When a connection attempt fails against a healthy-looking bootstrap job (FAILURE_MODES.md Mode 1: the container lacks
sshd, the server logs the error per connection and keeps running), the real error was unreachable from the client: the Jobs API exposes no stdout logs for a running notebook task (GetRunOutputrequires a terminal state andRunOutput.Logsis unsupported for notebook tasks). The server's own HTTP service behind the driver proxy is the only channel available while the job is alive. Complements #5552, which covers the terminated-job case.Tests
sessionattrs, HTTP handler);./task test-exp-sshand full lint pass.The SSH connection closed unexpectedly. Recent SSH server errors:followed by the server'sfailed to start SSHD process: fork/exec ...: no such file or directorylog line. A regularssh connect(no plant) still connects end-to-end.This pull request and its description were written by Isaac.