Skip to content

feat: guest command exec over vsock (Hyper.exec/3)#49

Open
markovejnovic wants to merge 29 commits into
mainfrom
feat/guest-exec-vsock
Open

feat: guest command exec over vsock (Hyper.exec/3)#49
markovejnovic wants to merge 29 commits into
mainfrom
feat/guest-exec-vsock

Conversation

@markovejnovic

Copy link
Copy Markdown
Contributor

Guest command exec over vsock

Adds Hyper.exec(vm, argv, opts) — run a one-shot command inside a running guest microVM over virtio-vsock and get back {stdout, stderr, exit_code}. Also fixes guest boot on bare OCI rootfses (the alpine image was spinning on a missing /sbin/openrc) by replacing the image init with our own PID-1 agent.

Design spec + plan live under docs/superpowers/ (gitignored scratch).

How it works

  1. hyper-guest-agent (new native/guest-agent/ static musl Rust crate) — runs as guest PID 1 (init=/hyper-init): mounts /proc, /sys, /dev, reaps orphans, binds AF_VSOCK:1024, and fork/execs one command per connection. Because it is init, any OCI image boots regardless of its own init system.
  2. Provisioningmix guest_agent.install builds the two-arch binaries; a per-arch provider resolves them, mirroring the vmlinux/suidhelper patterns.
  3. Rootfs bakeOciLoader stages the agent to /hyper-init before mke2fs -d (1-layer image; delta-layer authoring deliberately deferred to its own spec).
  4. BootBootSpec adds init=/hyper-init.
  5. vsock device + grant — the Configuring state PUTs /vsock (cid 3, vsock.sock) then hands the firecracker-created host UDS to the node uid via a new chroot-jail grant-vsock setuid op (shares a factored grant.rs confinement helper with grant-api, which was refactored onto it with no behavioral drift). InstanceStart is gated behind the grant, retry bounded by the existing boot_deadline.
  6. Host clientHyper.Node.FireVMM.Exec does firecracker's CONNECT 1024\n handshake + length-prefixed framing; Hyper.exec/3 ties it together.

Wire protocol

  • Request: u32 BE length + JSON {argv, env, cwd?, timeout_ms?} (env as a JSON object).
  • Response: i32 BE exit_code + u32 BE stdout_len + stdout + u32 BE stderr_len + stderr. Exec failure → 127.

Resource lifecycle

No new host-side process to supervise and no new leak surface: the agent is PID 1 inside the VM (dies with it); the granted UDS lives in the jail and is removed by the existing teardown (cgroup.kill + jail removal). Deliberate, given the recently-fixed reaper/dm fragility (#48).

Review & quality

Built subagent-driven with a per-task spec+quality review gate and a whole-branch opus review. That whole-branch review caught two cross-task contract bugs the per-task reviews structurally couldn't — both fixed and re-confirmed:

  • grant-vsock required the leaf to end in .vsock, but the real socket is vsock.sock → would have boot-looped every VM. Now an exact-name gate (mirrors grant-api; also closes an evil.vsock hole).
  • the agent decoded env as a JSON array but the client sends an object → every :env exec would fail. Now a BTreeMap, pinned by a cross-language fixture test that feeds the exact bytes the Elixir client emits.

Gate (HEAD)

  • Elixir: compile --warnings-as-errors, format, credo --strict (exit 0), 266 tests (90 properties), dialyzer 0 errors.
  • Rust: suidhelper 134 tests, guest-agent 6 tests, clippy -D warnings + fmt clean, x86_64-musl release builds.

Not yet done

Live E2E — this is static-gate-green but has not been booted on a real VM yet. Before/after merge, on the VM host: mix guest_agent.install + mix suidhelper.install (re-stamps the bumped suidhelper), then Hyper.exec(vm, ["uname","-a"]) should return {:ok, %{exit_code: 0, stdout: "Linux ..."}} with no openrc spam in the boot log.

Implement exec::run/1 and exec::serve_one/2 in src/exec.rs.
Replace tests/exec.rs placeholder with three example tests (captures
stdout+exit, missing command → 127+stderr, honours cwd).
Extend tests/wire.rs request_roundtrips to generate env pairs and
assert env survives the round-trip (carry-over from Task 1 review).
Add the `chroot-jail grant-vsock --socket <path>` op and its Elixir
wrapper `Hyper.SuidHelper.ChrootJail.grant_vsock/1`. Mirrors `grant-api`
exactly: confine the socket path to the jail base via SafePath, walk
O_NOFOLLOW from jail_base, require S_IFSOCK + no-follow, chown to caller,
chmod 0660, chgrp+chmod 0710 on the parent root dir for traversal. The
leaf must end with `.vsock` (vs the fixed `api.socket` name for grant-api).

Factors the shared confinement+chown+chmod logic (GrantOut, constants,
walk_to, grant_to_caller) into a new `grant.rs` module called by both
ops; grant_api.rs re-exports GrantOut for backward compatibility.

14 refusal + pending + grant tests in tests/tools/chroot_jail_grant_vsock.rs
(mirrors grant_api's suite, security refusal contracts first).
Implements Hyper.Node.FireVMM.Exec.run/3: connect the firecracker
vsock UDS, complete the CONNECT 1024\n / OK handshake, send the
length-prefixed JSON request frame, and decode the signed-32 exit-code
+ two length-prefixed stdio frames. recv_exactly loops defensively until
all bytes arrive; recv_line accumulates byte-by-byte for the handshake.
Retries on :econnrefused/:enoent/non-OK until the connect_timeout (5 s
default) elapses, then returns {:error, :agent_unavailable}.

Tests: full round-trip against a fake AF_UNIX server + a chunked-response
variant that sends the response one byte at a time to exercise frame
accumulation.
The `.ends_with(".vsock")` suffix check allowed any leaf ending with
`.vsock` (e.g. `evil.vsock`) and rejected the real socket name
`vsock.sock`, causing every boot to loop until the deadline.

Replace the suffix check with an exact-name guard mirroring grant-api's
`SOCKET_NAME` pattern: define `VSOCK_NAME = "vsock.sock"` and require
`leaf == Path::new(VSOCK_NAME)`. Update all tests to use the real
`vsock.sock` leaf so the name gate isn't short-circuited when testing
other refusal contracts; add `wrong_leaf_name_is_rejected` and
`api_socket_name_is_rejected` to pin the exact-name contract.
… client

The Elixir client sends env as a JSON object (`{"K":"V"}`); serde was
trying to deserialize it into `Vec<(String,String)>` and failing with
"invalid type: map, expected a sequence", causing every exec with :env
to return {:error, :closed}.

Change the env field type to `BTreeMap<String,String>`, which serde
deserializes correctly from a JSON object. Update exec.rs to use
`.envs(&req.env)`, and add a cross-language fixture test that feeds the
exact byte sequence the Elixir client produces and asserts correct decoding.

Also fix M1: the signal-exit comment claimed "128+signal" but the code
returns -1; update the comment to match. Fix M3: the Hyper.exec/3 @doc
now names all three accepted timeout opts (:timeout, :timeout_ms,
:connect_timeout) accurately.
@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

Test Results

359 tests   - 18   358 ✅  - 17   9s ⏱️ +3s
 60 suites  -  7     1 💤 ± 0 
  2 files   ± 0     0 ❌  -  1 

Results for commit a4d73f0. ± Comparison against base commit c3a9766.

This pull request removes 32 and adds 14 tests. Note that renamed tests count towards both.
Elixir.Hyper.Node.FireVMM.JailerTest ‑ test args contain --id, --uid, --gid with the opts values
Elixir.Hyper.Node.FireVMM.JailerTest ‑ test args do not contain privileged flags owned by the suidhelper
Elixir.Hyper.Node.FireVMM.JailerTest ‑ test args end with --api-sock /api.socket
Elixir.Hyper.Node.FireVMM.JailerTest ‑ test args include --cgroup cpu.max and memory.max for :micro type
Elixir.Hyper.Node.FireVMM.JailerTest ‑ test args start with the jailer subcommand
Elixir.Hyper.Node.FireVMM.JailerTest ‑ test binary is the suid helper
Elixir.Hyper.Node.Img.MutableTest ‑ test active_vm_ids drops a vm_id once its mutable layer dies
Elixir.Hyper.Node.Img.MutableTest ‑ test active_vm_ids lists the vm_ids of every live mutable layer
Elixir.Hyper.Node.Reaper.LivenessTest ‑ test a vm with a live mutable layer is never an orphan, even with a leftover rw device
Elixir.Hyper.Node.Reaper.PlanPropertiesTest ‑ property Mutable.dm_name/1 round-trips through rw_ids for a real vm_id
…
hyper-suidhelper::tools_chroot_jail_grant_vsock ‑ api_socket_name_is_rejected
hyper-suidhelper::tools_chroot_jail_grant_vsock ‑ dotdot_traversal_is_rejected
hyper-suidhelper::tools_chroot_jail_grant_vsock ‑ missing_jail_tree_is_pending
hyper-suidhelper::tools_chroot_jail_grant_vsock ‑ missing_socket_is_pending
hyper-suidhelper::tools_chroot_jail_grant_vsock ‑ real_socket_is_granted_and_chmod_0660
hyper-suidhelper::tools_chroot_jail_grant_vsock ‑ regular_file_at_leaf_is_refused_and_untouched
hyper-suidhelper::tools_chroot_jail_grant_vsock ‑ relative_socket_is_rejected
hyper-suidhelper::tools_chroot_jail_grant_vsock ‑ shape_classification
hyper-suidhelper::tools_chroot_jail_grant_vsock ‑ socket_outside_jail_base_is_rejected
hyper-suidhelper::tools_chroot_jail_grant_vsock ‑ symlink_at_leaf_is_refused
…

♻️ This comment has been updated with latest results.

…he app

Replace the manual `mix guest_agent.install` task with a `:guest_agent_build`
Mix compiler (mirroring the codegen compilers) that builds the static musl
binary into priv/guest-agent/ at compile time, so the agent ships inside the
app and its release with no install step. Host arch is required; a cross arch
is best-effort (built only when its rustup target is installed, warn+skip on
link failure). Provider resolves from :code.priv_dir; ensure_installed/0 now
checks the host arch only (KVM boots same-arch guests). Drops the now-unused
guest_agent_install_dir.
@markovejnovic

Copy link
Copy Markdown
Contributor Author

Follow-up (per review feedback): the guest agent no longer needs a manual mix guest_agent.install. It's now built into priv/guest-agent/ by a :guest_agent_build Mix compiler at mix compile (same pattern as the codegen compilers) and ships inside the app/release. Host arch is required; a cross arch is best-effort (built only when its rustup target is installed, warn+skip on link failure). Runtime resolves via :code.priv_dir; ensure_installed/0 checks the host arch only (KVM boots same-arch guests). So the only remaining manual step for E2E is mix suidhelper.install.

Replace length-prefixed JSON framing with self-delimiting CBOR in both
directions. Request: client CBOR.encodes a map and sends raw bytes;
agent reads one value via ciborium::from_reader (no EOF needed).
Response: agent ciborium::into_writer's then closes; client reads to EOF
and CBOR.decodes the accumulated buffer. stdout/stderr are CBOR byte
strings (serde_bytes on Rust, %CBOR.Tag{tag: :bytes} unwrap on Elixir).
Removes serde_json from the guest-agent crate. Cross-language contract
is pinned by a Rust fixture (decodes the exact Elixir-produced anchor
bytes) and a matching Elixir assertion on the same hex anchor.
…unwrap_bytes

Add a shared response anchor (exit_code=3, stdout=[0xFF,0x00,0x68,0x69], stderr=[])
that is pinned on both sides:
- Rust: rust_encodes_response_anchor asserts ciborium emits the exact 32 literal
  bytes (byte string, major type 2 for stdout/stderr), catching any regression
  where serde_bytes is dropped from wire::Response.
- Elixir: two new tests decode the same hex and round-trip through the fake
  server, asserting unwrapped stdout == <<0xFF,0x00,0x68,0x69>>.

The 0xFF,0x00 prefix is invalid UTF-8: a text-string encoder would corrupt or
reject it, making both per-side suites fail on the same regression the anchor
is designed to catch.

Fix 2: unwrap_bytes/1 now returns {:ok, binary()} | {:error, term()} instead of
raising FunctionClauseError on unexpected input shapes. recv_response/2 threads
the result through `with`, so a malformed agent response (e.g. stdout as an
integer list) becomes {:error, {:bad_response, {:not_bytes, other}}} rather than
an unhandled raise that violates the @SPEC.

New test: malformed-response test asserts {:error, _} via the fake server seam.
@markovejnovic

Copy link
Copy Markdown
Contributor Author

Wire protocol reworked → CBOR (replaces the length-prefixed JSON framing described above). After review feedback that the hand-rolled u32/i32 framing was low-quality: since it's strict req-rep and bounded, we dropped all framing. It's now CBOR both directions, no length prefixes/delimiters — request is a self-delimiting CBOR value (agent reads one via ciborium::from_reader); response is written then the connection closes, client reads to EOF and CBOR.decodes. stdout/stderr are CBOR byte strings (raw bytes, no base64) via serde_bytes on Rust ↔ %CBOR.Tag{tag: :bytes} on Elixir. gRPC was considered and rejected (HTTP/2 + async runtime is overkill for one unary call on a tiny PID-1 agent; QEMU-guest-agent's JSON-over-vsock is the closer prior art, and CBOR gives us that ergonomics byte-native). Cross-language contract pinned by shared literal-byte anchors in both directions (the response anchor uses non-UTF-8 bytes so a byte-vs-text regression can't pass). serde_json dropped from the agent. Gate: 270 Elixir tests + Rust 7/7, dialyzer 0.

… + grpc.gen)

- Add proto/hyper/agent/v1/agent.proto: GuestAgent service with Exec and
  Health RPCs; package hyper.agent.v1.
- Add native/guest-agent/build.rs: calls tonic_prost_build::configure()
  .build_client(false).compile_protos(...) to generate server-side Rust
  stubs into OUT_DIR/hyper.agent.v1.rs.
- Add tonic 0.14.6, prost 0.13.5, tokio 1, tokio-vsock 0.7.2 (tonic014
  feature), and tonic-prost-build 0.14.6 to native/guest-agent/Cargo.toml.
  Note: tonic-prost-build (build dep) pulls prost 0.14 for codegen; the
  runtime prost 0.13 coexists since the generated module is not yet
  included (later task).
- Refactor Mix.Tasks.Compile.GrpcGen to glob proto/**/*.proto and compile
  each proto separately with its own directory as --proto_path (required
  so protoc-gen-elixir derives a bare filename, not a package-doubled path).
  A second --proto_path=proto enables future cross-proto imports.
- Gitignore lib/hyper/agent/v1/ alongside the existing grpc/v0 entry.
…elayed UDS

Wires up a minimal tonic Health server (examples/health_uds.rs) that serves
over a Unix domain socket, plus an Elixir integration test that proves the
full transport chain: Gun HTTP/2 client → byte-pipe UDS relay → tonic server.
The Health RPC round-trips with ok: true, confirming the assumption that
grpc-elixir's Gun adapter can speak HTTP/2 over a relayed UDS without a VM.

:gun added to deps (was an optional dep of :grpc, not yet resolved); grpc
must be force-recompiled after adding gun so the conditional Gun adapter
module is included. tonic-prost + tokio-stream added as dev-dependencies
for the health_uds example. Integration tests excluded from the normal
mix test run via :integration tag in test_helper.exs; run separately with
--no-start + Application.ensure_all_started(:grpc) to avoid the suidhelper
version check that fails on dev machines.
…op CBOR wire

Replace the CBOR serve loop with a tonic 0.14 gRPC server bound on
AF_VSOCK:1024. PID-1 invariants are preserved: sync mounts run before
the tokio runtime, a tokio SIGCHLD reaper is spawned before the server
accepts connections, and the main thread parks forever if serve() returns.

- src/agent.rs: new GuestAgent impl; Health returns ok=true; Exec stubs
  UNIMPLEMENTED (Task 4 fills it in)
- src/init.rs: drop raw libc::signal SIGCHLD handler; add spawn_reaper()
  using tokio::signal so tokio's child-process tracking is not clobbered
- src/main.rs: PID-1 skeleton with tokio runtime + VsockListener::bind +
  serve_with_incoming(listener.incoming())
- src/exec.rs: move Request/Response types here (no serde/CBOR); drop
  serve_one(); run() core unchanged for Task 4
- tests/exec.rs: import from exec:: instead of wire::
- Delete src/wire.rs, tests/wire.rs and [[test]] wire entry
- Cargo.toml: add tonic-prost to [dependencies]; drop ciborium/serde/
  serde_bytes; add [profile.release] size profile (z/lto/1cu/abort/strip)
  — musl binary 7.08 MB → 1.09 MB
Replace the scaffolding Request/Response types in exec.rs with a pure
run(argv, env, cwd) -> (i32, Vec<u8>, Vec<u8>) core; Agent::exec maps
the generated ExecRequest -> run (via spawn_blocking) -> ExecResponse.
RelayDialer: pure module that dials the Firecracker vsock UDS via :socket,
sends CONNECT <port>\n, reads the OK line byte-by-byte (to avoid consuming
HTTP/2 bytes), and returns the connected socket.

Relay: GenServer that listens on a host UDS, accepts connections into an
unblocking linked acceptor process, and for each conn spawns an isolated
connection process that calls RelayDialer.dial/3 then pipes bytes in two
unlinked direction-workers — a pipe crash drops only that connection.
terminate/2 closes the listen socket and removes the socket file.
…er-read test

Fix 1: distinguish :closed (expected terminate/2 shutdown) from any other
accept error in accept_loop. Non-:closed errors now exit({:accept_error, reason})
so the linked GenServer receives a non-:normal EXIT and stops, letting the
Task 7 supervisor restart the relay rather than leaving it zombied with no
acceptor. Add synthetic-EXIT test proving the GenServer stops on non-normal
acceptor exits.

Fix 2: add over-read regression test to relay_dialer_test — fake Firecracker
sends "OK 5\nHELLO_H2_BYTES" in one write; asserts trailing bytes are readable
after dial/3 returns, which would fail if a bulk recv replaced the byte-by-byte
line reader.

Fix 3: change pipe workers from spawn to spawn_link so a connection handler
death reaps both workers via the link instead of orphaning them. Keep the
existing Process.monitor for the handler to react to each worker's normal end.
Add Process.unlink(sibling) before Process.exit(sibling, :kill) in
await_pipe_end to prevent the :killed cascade from reaching the handler back
through the now-live link.
…lient

Replace Hyper.Node.FireVMM.Exec (CBOR over vsock) with a typed gRPC client
(Hyper.Node.FireVMM.Agent) that dials the per-VM relay UDS through the Gun
adapter. relay_socket_path/1 is the single source of truth for the host-side
socket path; the relay GenServer (Task 7) will call the same function.

Also fix Mix.Tasks.Compile.GrpcGen to format generated pb.ex files via
Code.format_string! instead of Mix.Task.run("format"), which is a no-op
when the format task was already consumed earlier in the same mix check run.
- Wire Relay as a third :one_for_one child of FireVMM.Supervisor, started
  with {vsock_uds: Jailer.host_vsock/1, listen_path: Agent.relay_socket_path/1}.
- Override Relay.child_spec/1 with restart: :transient so abnormal acceptor
  crashes are restarted by the supervisor while clean :shutdown is not.
- Add Reclaim.reclaim_sockets/0 (public @doc false) to sweep grpc-*.sock
  files in socket_dir at node boot; called from Reclaim.run/0 alongside the
  existing dm/loop sweeps. Also creates the dir if absent so the relay's
  bind never races a missing directory on first boot.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant