Skip to content

feat(server): support out-of-tree compute drivers via --compute-driver-socket#1703

Open
st-gr wants to merge 2 commits into
NVIDIA:mainfrom
st-gr:feat/external-compute-driver-socket
Open

feat(server): support out-of-tree compute drivers via --compute-driver-socket#1703
st-gr wants to merge 2 commits into
NVIDIA:mainfrom
st-gr:feat/external-compute-driver-socket

Conversation

@st-gr

@st-gr st-gr commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds an opt-in --compute-driver-socket=<path> flag (env OPENSHELL_COMPUTE_DRIVER_SOCKET) to the gateway. When set, the gateway dispatches sandbox lifecycle to an out-of-tree compute driver speaking the existing compute_driver.proto contract over a Unix domain socket, instead of one of the four built-in drivers (Kubernetes / Podman / Docker / VM).

The implementation deliberately avoids adding a fifth ComputeDriverKind variant — the gateway treats out-of-tree drivers as an acquired endpoint rather than a parallel kind in the in-tree enum. This was rewritten in response to review feedback from @elezar and @drew (see "Review history" below).

Related Issue

Supersedes the closed draft #1604 (same author, same patch shape — re-opened as a non-draft after rebasing onto current main, with all commits DCO-signed and st-gr-attributed). Same direction as cheese-head's vouch in #1345 ("extend or provide new out-of-tree compute drivers to OpenShell").

No upstream tracking issue filed — happy to file one if reviewers prefer.

Changes

  • crates/openshell-core/src/config.rs: adds Config.external_compute_driver_socket: Option<PathBuf> plus with_external_compute_driver_socket(...). No new ComputeDriverKind variant. The four in-tree variants remain Copy and bare.
  • crates/openshell-server/src/cli.rs: adds --compute-driver-socket=<path> flag (env OPENSHELL_COMPUTE_DRIVER_SOCKET) on RunArgs. When set, effective_single_driver returns None, short-circuiting both --drivers and the auto-detect probe so callers keyed off the in-tree enum skip the match.
  • crates/openshell-server/src/compute/mod.rs:
    • ComputeRuntime::from_driver now takes Option<ComputeDriverKind>; the four in-tree constructors wrap their kind in Some(...). Out-of-tree drivers pass None.
    • connect_external_compute_driver(&Path) produces a tonic Channel over a pre-existing UDS, mirroring the connector pattern the VM driver already uses (hyper-util::TokioIo + tokio::net::UnixStream).
    • ComputeRuntime::new_remote_external consumes the channel via the existing RemoteComputeDriver proxy and skips shutdown cleanup / managed-process supervision (the operator owns the lifecycle).
    • from_driver logs the driver_name advertised by GetCapabilities whenever driver_kind is None, so operators can confirm the gateway connected to the driver they expect (logged for diagnostics — not validated against operator-supplied input in this PR).
  • crates/openshell-server/src/lib.rs: build_compute_runtime short-circuits to connect_external_compute_driver + new_remote_external when Config.external_compute_driver_socket is set, before consulting --drivers / auto-detect. The four in-tree backends keep their existing dispatch arms unchanged.
  • architecture/compute-runtimes.md: adds an "External" row to the Runtime Summary table (activation flag, GetCapabilities.driver_name logging, operator-owned process+socket lifecycle, trust boundary = filesystem permissions on the UDS) and a matching row in Supervisor Delivery.

No compute_driver.proto changes. No new workspace dependencies (hyper-util and tower are already in scope; tokio already exposes UnixStream under cfg(unix)). No security-model changes.

Testing

  • mise run pre-commit passes — not run end-to-end (mise not on author's dev environment), but the equivalent rust pieces verified independently in a rust:1.95.0-slim-bookworm dev container with the workspace's apt deps: cargo fmt --all -- --check clean; cargo clippy --no-deps --workspace --all-targets -- -D warnings clean; cargo test --workspace --lib --bins passes (incl. all 769 openshell-server lib tests).
  • Unit tests added/updated — 3 new CLI-arg tests in crates/openshell-server/src/cli.rs::tests: compute_driver_socket_flag_populates_run_args, compute_driver_socket_overrides_drivers_flag, compute_driver_socket_reads_from_env_var. The four existing in-tree compute-driver tests (configured_compute_driver_*) continue to pass unchanged.
  • E2E tests added/updated — none — running an External driver in CI would require shipping a stub driver binary; deferring until at least one in-tree consumer wants to test against it. The patch is exercised in production on SAP BTP via the st-gr/openshell-driver-kyma reference driver running against a Gardener-managed Kyma cluster since 2026-05-28 (sandbox CR creation, openshell sandbox exec, openshell sandbox upload/download all work).

Checklist

  • Follows Conventional Commits (feat(core):, feat(server):, docs(compute):)
  • Commits are signed off (DCO)
  • Architecture docs updated — added "External" row to architecture/compute-runtimes.md::Runtime Summary and ::Supervisor Delivery. Per-runtime implementation-notes section is intentionally NOT extended for external drivers because they ship their own documentation; the reference implementation lives at st-gr/openshell-driver-kyma.

Review history

The original revision of this PR added ComputeDriverKind::External(PathBuf) as a fifth enum variant. @elezar and @drew both flagged this as the wrong shape — out-of-tree drivers are an acquired endpoint rather than a parallel kind. The current revision implements that direction:

  • No External enum variant. ComputeDriverKind is unchanged; out-of-tree dispatch lives on Config.external_compute_driver_socket and goes through a separate construction path that produces a ComputeRuntime with driver_kind = None.
  • (name, socket) tuple syntax is deferred. The flag remains a single path. GetCapabilities.driver_name is logged for diagnostics, not validated against operator input. This matches @drew's suggestion to defer name-keying / <name>@<socket> to a follow-up that aligns with the broader name-keyed driver registry direction (rfc-0006: add driver config passthrough proposal #1589).
  • Construction / consumption split. The VM driver still spawns its own subprocess; external drivers consume an operator-owned channel. Both paths share RemoteComputeDriver for the actual proto consumption — the only divergence is at construction.

Happy to revisit any of those choices if reviewers prefer a different cut point.

Notes for reviewers

Out of scope (intentional):

  • No new in-tree compute driver implementations. This PR only adds the activation flag + connector + runtime split. Operators bring their own driver process and point --compute-driver-socket at its socket path.
  • No compute_driver.proto changes. The existing protocol is reused verbatim — the same one already used by the in-tree VM driver to talk to its tonic peer.
  • No security model changes. UDS endpoint security is the operator's responsibility (filesystem permissions, sidecar isolation), matching the --drivers vm posture.

Operator context:

The forked gateway image at ghcr.io/st-gr/openshell-gateway (carrying these patches plus an in-tree Dockerfile.gateway) has been driving production-shaped Kubernetes clusters with the st-gr/openshell-driver-kyma compute driver since 2026-05-28. The flag has needed zero changes since it landed — the API shape is stable. Merging this PR lets the driver consume the unmodified upstream gateway image, retiring the fork.

@st-gr st-gr requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners June 3, 2026 05:27
@copy-pr-bot

copy-pr-bot Bot commented Jun 3, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@st-gr

st-gr commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

I have read the DCO document and I hereby sign the DCO.

@st-gr

st-gr commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

recheck

@drew drew self-assigned this Jun 3, 2026
@drew

drew commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

/ok to test 8689967

@st-gr st-gr force-pushed the feat/external-compute-driver-socket branch from 8689967 to 464018b Compare June 3, 2026 08:18
@st-gr

st-gr commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

Pushed a doc_markdown fix (added backticks around compute_driver.proto and OPENSHELL_COMPUTE_DRIVER_SOCKET in the External variant's doc comment, folded into the originating feat(core): commit via autosquash).

/ok to test 464018b7 — superseded by a second force-push (redundant_pub_crate fix); see the follow-up comment below for the current SHA.

st-gr added a commit to st-gr/OpenShell that referenced this pull request Jun 3, 2026
NVIDIA/OpenShell's Cargo.toml has

    [workspace.lints.clippy]
    pedantic = "warn"
    nursery = "warn"
    all = "warn"

so `cargo clippy --workspace -- -D warnings` escalates pedantic lints
(including `doc_markdown`) to errors. Their CI uses this as part of
`mise run pre-commit`. The fork's existing `release-gateway.yml` only
runs `docker build` (which calls `cargo build --release` — no clippy),
so the same lints weren't enforced on the fork.

PR NVIDIA#1703 caught us with two `doc_markdown` violations
in `crates/openshell-core/src/config.rs:56` (`compute_driver.proto` and
`OPENSHELL_COMPUTE_DRIVER_SOCKET` missing backticks in the External
variant's doc comment). The fork's CI didn't catch it; NVIDIA's did.

This workflow runs on push to main + `feat/**` branches and on
workflow_dispatch. It pins to Rust 1.95.0 (matches upstream mise.toml)
and installs the same native deps `Dockerfile.gateway` needs
(libclang-dev + libz3-dev for transitive bindgen + z3-sys).

Workflow is light: ubuntu-latest runner, ~10-15 min cold, ~3-5 min
with the Swatinem/rust-cache hot. Path-filtered so doc-only changes
don't trigger it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
@st-gr st-gr force-pushed the feat/external-compute-driver-socket branch from 464018b to 1c6663d Compare June 3, 2026 09:01
@st-gr

st-gr commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

Force-pushed once more — caught a second pedantic clippy lint locally that the previous run didn't get to (redundant_pub_crate on connect_external_compute_driver). Folded into the originating feat(server): wire ... commit via autosquash so the PR stays at 5 commits.

New HEAD: 1c6663d. Verified clean with the full-workspace check NVIDIA's CI runs:

      cargo fmt --all -- --check
      cargo clippy --workspace --all-targets -- -D warnings

Could I get a re-vet with /ok to test 1c6663d3?

Apologies for the round-trip — I've added a rust-lint workflow to my fork to mirror your mise run pre-commit so future pushes catch these locally before they reach your runners.

elezar
elezar previously requested changes Jun 3, 2026

@elezar elezar left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking ahead to changes such as #1589 and support for multiple drivers in future, does it make sense to formulate this as a "named driver endpoint" where a user would supply a driver name and a socket path? Especially given that the in-tree vm driver is already an example of such a driver and we would only be extending the existing functionality.

In practice, the gateway would select a driver by name (instead of a kind enum) and one could use GetCapabilities.driver_name to validate the user-provided name if required.

@st-gr

st-gr commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

@elezar
I read through #1589 / RFC 0005. The named-endpoint shape and GetCapabilities.driver_name validation both fit cleanly. Two ways to implement your suggestion in this PR:

A. Minimal add. Keep External(PathBuf) and add a paired --compute-driver-name=<name> flag. After the UDS connect, the gateway calls GetCapabilities, compares response.driver_name to the supplied name, and errors loudly on mismatch. The enum surface stays flat. Could serve as a stepping stone; when #1589's registry lands, it can replace the enum entirely without re-migrating this code.

B. Restructured External. Extend the variant to External { name: String, socket: PathBuf }. More visible signal of the multi-driver direction at the type level, but touches FromStr / Display / their tests.

Replacing ComputeDriverKind itself with a name-keyed registry (your second paragraph's "select a driver by name instead of a kind enum") reads to me as the actual #1589 implementation. I can work on that as a follow-up once the RFC has a concrete shape, but that's a much bigger refactor than this PR.

I'd lean (A) because it doesn't pre-commit the data model to a shape #1589 might want to rewrite, and it costs ~30 LOC. Either is fine, though. Let me know which you'd prefer to see here.

@elezar

elezar commented Jun 3, 2026

Copy link
Copy Markdown
Member

As a quick follow-up: What would it look like if we first migrate to implementing the vm driver as a specific-instance of a NamedDriver{ name: String, socket: PathBuf } where name == "vm" and then generalize from here?

Also note, that although #1589 is mentioned here as a motivator for restructuring how drivers are managed from a gateway perspective, moving to a name-keyed registry for managing drivers is not specifically related to the driver-specific config.

@st-gr

st-gr commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

Right, separating from #1589 makes sense, the named-driver registry is its own concern.

Confirmed against the #1051 OpenShell Drivers roadmap: Its stated intent of "the core OpenShell team maintains a small deployment footprint and third parties may develop their own drivers" is exactly the direction your NamedDriver proposal serves. So aligning this PR there feels right; I won't propose alternatives that pull Kyma (or any other operator-specific driver) in-tree.

Here's what Phase A could look like: A pure refactor that pulls today's Vm arm into a generalized shape, no behavior change for existing --drivers vm operators:

    pub struct NamedDriver {
        pub name: String,
        pub socket: PathBuf,
        /// Set when the gateway owns the driver's lifecycle (current
        /// `vm` behavior - `compute::vm::spawn` returning a
        /// `ManagedDriverProcess`). When `None`, the operator runs the
        /// driver out-of-process and supplies the socket path.
        pub spawn: Option<DriverSpawnConfig>,
    }

    pub enum ComputeDriverKind {
        Kubernetes,
        Docker,
        Podman,
        Named(NamedDriver),
    }

Maps cleanly to the existing wiring at crates/openshell-server/src/compute/mod.rs:

  • --drivers vm translates to Named(NamedDriver { name: "vm", socket: <generated under XDG_STATE>, spawn: Some(default_vm_spawn()) }). The CLI front-end keeps the vm alias for backward compat.
  • ManagedDriverProcess (lines 100-128) stays as-is - it's already the lifecycle holder for spawned drivers; it just gets reached via the Named arm instead of being conditional on ComputeDriverKind::Vm.
  • new_remote_vm (line 388) becomes new_remote_named - same signature, same Option<Arc<ManagedDriverProcess>> field.
  • GetCapabilities.driver_name validation lands here: gateway compares Named.name against the proto response and errors at startup on mismatch.

Then Phase B (this PR) becomes additive:

  • --compute-driver=<name>@<path> (or two flags) → Named(NamedDriver { name, socket, spawn: None }).
  • All the dispatch, lifecycle-watcher, and capability-check logic is shared with the VM path.

Ordering: I'd implement Phase A first as a precursor PR (pure refactor, no behavior change since --drivers vm still works through the alias), then rebase this PR on top. Each piece reviews more cleanly on its own. Happy to bundle into one PR if you'd rather see the full motivation in
a single diff.

Before I start coding, anything about the NamedDriver shape you'd want different? Specifically:

  • The spawn: Option<DriverSpawnConfig> distinction (gateway-managed vs operator-managed) - does that match how you'd want the boundary drawn?
  • name: String vs name: &'static str for the in-tree vm case - string keeps the field uniform with operator-supplied names; static would mean a separate type.
  • Whether Kubernetes / Docker / Podman should follow into Named(...) in this same precursor PR or stay as enum arms (they don't speak compute_driver.proto over UDS - they're in-tree adapters - so probably out of scope).

Once we converge on the shape, Phase A is straightforward.

@elezar

elezar commented Jun 4, 2026

Copy link
Copy Markdown
Member

Thanks, this is the right direction, but I'd separate the pieces slightly differently.

For Phase A, I don't think spawn should be part of the named driver definition. A named endpoint should describe how the gateway consumes a remote driver: { name, socket }. Whether that socket came from a gateway-spawned VM driver or from an operator-managed external process is a construction/lifecycle concern.

For the VM refactor in Phase A, the shape I'd like to see is:

  1. Resolve --drivers vm.
  2. Run the VM-specific construction path: read VM config, prepare the socket path, spawn openshell-driver-vm, and wait for readiness.
  3. Treat the result as an acquired named endpoint plus a lifecycle guard.
  4. From that point on, use the generic endpoint path: connect to the socket, wrap the channel in RemoteComputeDriver, and pass it into ComputeRuntime as a normal SharedComputeDriver.

Conceptually:

VM config -> spawn VM driver -> acquired named endpoint
acquired named endpoint -> RemoteComputeDriver -> ComputeRuntime

For Phase B, an external driver should enter at the same acquired-endpoint boundary:

external config -> existing socket -> acquired named endpoint
acquired named endpoint -> RemoteComputeDriver -> ComputeRuntime

That keeps driver construction separate from driver consumption. The VM lifecycle guard can own process/socket cleanup; an external driver can use a no-op lifecycle guard. ComputeRuntime should not need to care which acquisition path produced the driver.

I'd also prefer to avoid a special External enum variant if possible. The <name>@<socket> shape you proposed seems like a good fit here, especially if we expect to support multiple configured drivers later. With that said, I'd like @drew or @johntmyers to weigh in here in terms of the UX of this. We may not want to overload the existing flag, but then we need to consider how we want to handle multiple drivers and their mapping to endpoints.

Reserved names like vm, docker, podman, and kubernetes can keep their current meaning when supplied bare. Endpoint-based drivers can use the explicit endpoint form:

--drivers vm
--drivers vm@/path/to/compute-driver.sock
--drivers kyma@/path/to/compute-driver.sock

In that model, bare vm keeps the managed VM default path. vm@<socket> uses the VM construction path but overrides the managed socket path (if we wanted to allow this). A non-reserved name such as kyma@<socket> is an unmanaged out-of-tree endpoint-based driver. A non-reserved bare name should error until there is some other way to resolve its endpoint.

For the initial parser, I'd keep the rules narrow: parse the current comma-delimited --drivers entries independently, split endpoint entries on the first @, and reject empty names or empty socket paths. docker@..., podman@..., and kubernetes@... can remain reserved/errors for now unless we explicitly add endpoint-backed versions of those drivers.

One follow-up to call out is the gateway TOML mapping. Today [openshell.gateway].compute_drivers is the structured driver selection list, while [openshell.drivers.<name>] is intentionally driver-owned/raw TOML. If we accept <name>@<socket> on the CLI, we should decide whether the TOML uses the same string form:

[openshell.gateway]
compute_drivers = ["kyma@/run/openshell/kyma.sock"]

or a more structured form under the driver table:

[openshell.gateway]
compute_drivers = ["kyma"]

[openshell.drivers.kyma]
endpoint = "/run/openshell/kyma.sock"

I don't think that has to block Phase A, but it should be discussed and clarified before Phase B lands so CLI/env/TOML do not grow incompatible shapes.

Name matching via GetCapabilities.driver_name can be deferred as a later hardening step. The initial refactor only needs to establish the construction/consumption split and the common endpoint-based consumption path.

@st-gr

st-gr commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @elezar — agree the named-driver-endpoint framing is the right shape, and lines up with #1589's name-keyed driver_config. Concrete proposal for this PR:

  • ComputeDriverKind::External(PathBuf)ComputeDriverKind::External { name: String, socket: PathBuf }. Other variants stay bare.
  • CLI: add --compute-driver-name=<name> (env OPENSHELL_COMPUTE_DRIVER_NAME) paired with the existing --compute-driver-socket. Both required when either is set; the pair pins External { name, socket } and skips --drivers and auto-detect.
  • On startup, after the tonic channel connects, the gateway calls GetCapabilities and fails fast if response.driver_name != name.

One scope question before I push: should I generalise the dispatch beyond External in this PR — e.g. fold Vm into the same name-keyed path so the registry is uniform — or keep that for the #1589 follow-up and leave the four in-tree variants as-is for now? My read is the latter (this PR stays narrow; #1589 / a successor RFC handles the registry), but happy to expand if you'd prefer it land in one go.

@drew

drew commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

I'd also prefer to avoid a special External enum variant if possible. The <name>@<socket> shape you proposed seems like a good fit here, especially if we expect to support multiple configured drivers later.

+1 to avoiding the External enum variant. I like the <name>@<socket> pattern, though I do worry a bit about overloading the string.

We will eventually want to support multiple drivers, and possibly different versions of the same driver.

That said, can we add the <name>@<socket> convention in a followup PR without breaking contracts? Would be good to see how external drivers gets used before including including additional overloads to the path. In the meantime we can rely on GetCapabilities.driver_name and assume it will be unique.

@drew

drew commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

@st-gr

One scope question before I push: should I generalise the dispatch beyond External in this PR — e.g. fold Vm into the same name-keyed path so the registry is uniform — or keep that for the #1589 follow-up and leave the four in-tree variants as-is for now? My read is the latter (this PR stays narrow; #1589 / a successor RFC handles the registry), but happy to expand if you'd prefer it land in one go.

+1 to your read. Let's leave the four in-tree for now.

@st-gr

st-gr commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

@elezar @drew — reading the convergence:

This PR:

  • Extract the construction/consumption split as @elezar described: VM and external both produce (channel, advertised_name) via dedicated acquisition paths; consumption (RemoteComputeDriverComputeRuntime::from_driver) is shared. VM construction keeps its ManagedDriverProcess lifecycle guard on its side of the boundary — endpoint type stays slim, no spawn field. UX for --drivers vm unchanged.
  • Drop ComputeDriverKind::External(PathBuf). Restore upstream Copy + const fn as_str. External driver is signalled by a separate Config.compute_driver_socket: Option<PathBuf> that takes precedence over compute_drivers in build_compute_runtime.
  • Keep --compute-driver-socket=<path> (env OPENSHELL_COMPUTE_DRIVER_SOCKET) as the single external flag — no <name>@<socket>, no user-supplied name. Gateway calls GetCapabilities once after connect and logs the advertised driver_name (no validation, per @drew).
  • Bare --drivers vm/docker/podman/kubernetes unchanged.

Deferred to follow-up(s): <name>@<socket> syntax + reserved-name discipline, multi-driver / multi-version, TOML shape choice, GetCapabilities.driver_name matching against a supplied name.

Branch will be rewritten as: (1) refactor extracting the boundary, (2) wire --compute-driver-socket through it, (3) drop External + revert the enum changes, (4) refresh the arch doc. Will start coding and force-push a new revision; flag if anything's misread.

@st-gr st-gr force-pushed the feat/external-compute-driver-socket branch from 1c6663d to be90905 Compare June 5, 2026 11:12
@st-gr

st-gr commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

@elezar @drew — pushed the rewrite. PR body has been refreshed end-to-end to match the new shape. Summary of what changed since your last comments:

  • No External enum variant. ComputeDriverKind is unchanged. Out-of-tree dispatch now lives on Config.external_compute_driver_socket: Option<PathBuf> and goes through a separate construction path that produces a ComputeRuntime with driver_kind: None.
  • Construction / consumption split. ComputeRuntime::from_driver now takes Option<ComputeDriverKind>; the four in-tree constructors wrap their kind in Some(...), the new new_remote_external passes None. Both paths share the existing RemoteComputeDriver proxy for proto consumption.
  • (name, socket) tuple deferred per @drew's guidance — flag stays a single path, GetCapabilities.driver_name is logged for diagnostics only (no validation against operator input). Name-keying / <name>@<socket> is a follow-up that aligns with rfc-0006: add driver config passthrough proposal #1589.
  • Workspace verification: cargo fmt --check, cargo clippy --workspace --all-targets -D warnings, and cargo test --workspace --lib --bins all clean (incl. all 769 openshell-server lib tests).

Branch is force-pushed; commits are squash-friendly (4 small, named per conventional-commits). Happy to revisit any cut point if you'd prefer a different shape.

@st-gr

st-gr commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

I have read the DCO document and I hereby sign the DCO.

Comment thread crates/openshell-server/src/lib.rs Outdated
tracing_log_bus: TracingLogBus,
supervisor_sessions: Arc<supervisor_session::SupervisorSessionRegistry>,
) -> Result<ComputeRuntime> {
if let Some(socket_path) = config.external_compute_driver_socket.as_deref() {

@elezar elezar Jun 15, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we discussed that the current in-tree VM driver is an example of a "remote socket driver", I don't think that we should introduce a new conditional branch for this special case.

I am prepping a PR from a fork that would propose additional changes. See st-gr#1

Comment thread architecture/compute-runtimes.md Outdated
@st-gr

st-gr commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

@elezar — adopted your refactor PR (st-gr/OpenShell#1) on top of the rebased PR #1703 branch. Branch is now 0925729b + 2979bf9c; new HEAD 2979bf9c carries your feat(server): model remote compute driver endpoints commit verbatim, with authorship preserved.

This addresses both inline comments:

  • lib.rs:712 — VM and out-of-tree remote drivers now route through the same acquired-endpoint path (compute::vm::spawnEndpointRemoteComputeDriverComputeRuntime). The previous External conditional branch is gone.
  • architecture/compute-runtimes.md:37 — renamed "External" → "Remote" throughout.

Verified inside rust:1.95-slim:

  • cargo fmt --all -- --check
  • cargo check --workspace --tests
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo test -p openshell-core: 176/176 ✅
  • cargo test -p openshell-server: 822/822 ✅
  • No conflict against current upstream/main (git merge-tree exits clean).

The branch now sits at 5 commits ahead of upstream/main. Ready for re-review and /ok to test 2979bf9c when convenient.

@johntmyers johntmyers assigned elezar and unassigned drew Jun 16, 2026
@elezar elezar force-pushed the feat/external-compute-driver-socket branch from 2979bf9 to 02d9b31 Compare June 17, 2026 13:09
@elezar

elezar commented Jun 17, 2026

Copy link
Copy Markdown
Member

Hi @st-gr. I spent some time on this again today. I rebased the changes onto the current main as a single commit (keeping shared authorship) and also added an initial in-tree test case for remote drivers. There are some follow-ups that I would like to see, but I don't think that they block this PR further. Most notably is #1950 which discusses adding end-to-end tests for this driver path, but probably requires more work in general.

There is also #1949 which should further cleanup driver aquisition / configuration, but once again is not a blocker.

If you could verify the current changes against your remote driver, we could look at getting this integrated. We would also need to update the PR description. Did you want to do that, or should I?

@elezar

elezar commented Jun 17, 2026

Copy link
Copy Markdown
Member

/ok-to-test 02d9b31

@st-gr

st-gr commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@elezar — verified against my Kyma compute driver. Both the cold-start dispatch and live sandbox traffic work cleanly with 02d9b319.

Build (cargo zigbuild --target x86_64-unknown-linux-gnu.2.31 --features bundled-z3 -p openshell-server --bin openshell-gateway):

  • cargo fmt --check
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo test -p openshell-core -p openshell-server --lib → 1,083 tests, all green (includes your new test_support.rs UDS lifecycle coverage).
  • Image: ghcr.io/st-gr/openshell-gateway@sha256:343630cda459bcd27507c035c521032e277c36ea2d544d29ec0b14178d30bbaa.

Deploy (Helm upgrade on Gardener-managed Kyma cluster, 4 amd64 workers; the existing ods release rolled gateway.image.tag to the new digest with --reuse-values):

Required one chart-side change in openshell-driver-kyma: pass --drivers kyma alongside the existing --compute-driver-socket, since the refactored gateway requires the explicit driver name (reserved built-ins reject sockets; kyma is non-reserved). One-line addition in the deployment template.

Gateway log on first boot:

INFO openshell_server: Using compute driver driver=kyma
INFO openshell_server: Using remote compute driver endpoint driver=kyma socket=/var/run/openshell-driver.sock
INFO openshell_server::compute: Compute driver connected configured_driver=kyma advertised_driver=kyma in_tree=false

That's the exact shape your refactor advertises — configured/advertised name comparison via GetCapabilities, in_tree=false correctly identifying Kyma as out-of-tree. The pre-existing persistent sandbox kept working through the rollout: GetSandboxConfig, ConnectSupervisor, GetSandboxProviderEnvironment, GetInferenceBundle all returning 200 with sub-5 ms latency. Supervisor session accepted, sandbox-side claude-tui resumed without restart.

Follow-ups in this fork's chart (openshell-driver-kyma) — I'll land these after #1703 merges so the chart and upstream stay in sync:

I'm happy for you to do the PR description update — your wording on the construction/consumption split and the named-endpoint framing is the canonical version, and you've been driving the refactor. From my side, this is ready to merge once you've updated the description.

@elezar elezar left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming @st-gr. I think this looks good now.

@drew any thoughts from your side?

@elezar elezar added the test:e2e Requires end-to-end coverage label Jun 23, 2026
@github-actions

Copy link
Copy Markdown

Label test:e2e applied for 02d9b31. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute the standard E2E suite after building the required gateway and supervisor images once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

st-gr and others added 2 commits June 23, 2026 16:20
Add named remote compute driver endpoint support to the gateway.

Remote drivers are selected by a non-reserved compute driver name and either a CLI/env socket endpoint or [openshell.drivers.<name>].socket_path. The VM driver now enters ComputeRuntime through the same acquired remote endpoint path, while Docker, Podman, and Kubernetes retain their in-process drivers.

Require --drivers/OPENSHELL_DRIVERS when pairing an ad-hoc socket endpoint so the socket does not imply a magic driver name, and keep reserved in-tree names unavailable for unmanaged socket endpoints.

Co-authored-by: Evan Lezar <elezar@nvidia.com>

Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar added the gator:approval-needed Gator completed review; maintainer approval needed label Jun 23, 2026
@elezar

elezar commented Jun 23, 2026

Copy link
Copy Markdown
Member

gator-agent

Maintainer Approval Needed

Gator validation and PR monitoring are complete.

Validation: this PR is project-valid because it implements a maintainer-reviewed extension-driver path for OpenShell compute drivers using named remote endpoints, with the built-in driver namespace reserved and user-facing docs updated.
Review: no blocking findings remain. A bounded independent review found one nonblocking follow-up risk: unmanaged UDS endpoints could be hardened by validating socket type, symlink status, and unsafe permissions before connecting.
Docs: updated in docs/reference/gateway-config.mdx and docs/reference/sandbox-compute-drivers.mdx; no docs navigation change is needed because both pages are already under the reference section.
Checks: OpenShell / Branch Checks, OpenShell / Helm Lint, DCO, and required gate statuses are passing for head 02d9b3191f129c99809d1e166fbdf48a512da26c.
E2E: test:e2e is applied and OpenShell / E2E is passing. GPU E2E is not required for this PR.

Human maintainer approval or merge decision is now required.

@elezar elezar force-pushed the feat/external-compute-driver-socket branch from 02d9b31 to b6ae186 Compare June 23, 2026 18:13
@elezar

elezar commented Jun 23, 2026

Copy link
Copy Markdown
Member

Rebased onto latest origin/main and ran local pre-commit successfully:

PATH=/Library/Developer/CommandLineTools/usr/bin:$PATH mise run -j 1 pre-commit

@elezar elezar added gator:watch-pipeline Gator is monitoring PR CI/CD status and removed gator:approval-needed Gator completed review; maintainer approval needed labels Jun 23, 2026
@elezar

elezar commented Jun 23, 2026

Copy link
Copy Markdown
Member

gator-agent

Re-check After Maintainer Update

I re-evaluated latest head b6ae1867468ac3fd73923a5d3d24272e8b846627 after @elezar's 2026-06-23 18:13 UTC update that the branch was rebased onto latest origin/main and local pre-commit completed.

Disposition: partially resolved, but not ready for gator:approval-needed at the current head.

Remaining items:

  • Review remains approved for the current head and no blocking review findings are active.
  • Still pending: OpenShell / Branch Checks, OpenShell / E2E, and OpenShell / Helm Lint are pending for this head, so the prior approval handoff for 02d9b3191f129c99809d1e166fbdf48a512da26c is stale.
  • OpenShell / GPU E2E is not required for this PR and is green.

Next state: gator:watch-pipeline

@elezar

elezar commented Jun 23, 2026

Copy link
Copy Markdown
Member

/ok to test b6ae186

@johntmyers johntmyers added this to the OpenShell Beta milestone Jun 23, 2026
@st-gr

st-gr commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

@elezar — heads-up: the three red checks at b6ae1867 are a transient ghcr.io auth flake, not a regression from the rebase.

Root cause in build-gateway / Build gateway (amd64) (run):

Error response from daemon: Get "https://ghcr.io/v2/":
  Get "https://ghcr.io/token?account=copy-pr-bot%5Bbot%5D&...":
  context deadline exceeded (Client.Timeout exceeded while awaiting headers)

The two Core E2E result / OpenShell / E2E failures are downstream — the E2E jobs can't proceed without the gateway image artifact.

A "Re-run failed jobs" on run 28048603052 should clear all three. Apologies for the round-trip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gator:watch-pipeline Gator is monitoring PR CI/CD status test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants