feat: guest command exec over vsock (Hyper.exec/3)#49
Conversation
Implement exec::run/1 and exec::serve_one/2 in src/exec.rs. Replace tests/exec.rs placeholder with three example tests (captures stdout+exit, missing command → 127+stderr, honours cwd). Extend tests/wire.rs request_roundtrips to generate env pairs and assert env survives the round-trip (carry-over from Task 1 review).
Add the `chroot-jail grant-vsock --socket <path>` op and its Elixir wrapper `Hyper.SuidHelper.ChrootJail.grant_vsock/1`. Mirrors `grant-api` exactly: confine the socket path to the jail base via SafePath, walk O_NOFOLLOW from jail_base, require S_IFSOCK + no-follow, chown to caller, chmod 0660, chgrp+chmod 0710 on the parent root dir for traversal. The leaf must end with `.vsock` (vs the fixed `api.socket` name for grant-api). Factors the shared confinement+chown+chmod logic (GrantOut, constants, walk_to, grant_to_caller) into a new `grant.rs` module called by both ops; grant_api.rs re-exports GrantOut for backward compatibility. 14 refusal + pending + grant tests in tests/tools/chroot_jail_grant_vsock.rs (mirrors grant_api's suite, security refusal contracts first).
Implements Hyper.Node.FireVMM.Exec.run/3: connect the firecracker
vsock UDS, complete the CONNECT 1024\n / OK handshake, send the
length-prefixed JSON request frame, and decode the signed-32 exit-code
+ two length-prefixed stdio frames. recv_exactly loops defensively until
all bytes arrive; recv_line accumulates byte-by-byte for the handshake.
Retries on :econnrefused/:enoent/non-OK until the connect_timeout (5 s
default) elapses, then returns {:error, :agent_unavailable}.
Tests: full round-trip against a fake AF_UNIX server + a chunked-response
variant that sends the response one byte at a time to exercise frame
accumulation.
The `.ends_with(".vsock")` suffix check allowed any leaf ending with
`.vsock` (e.g. `evil.vsock`) and rejected the real socket name
`vsock.sock`, causing every boot to loop until the deadline.
Replace the suffix check with an exact-name guard mirroring grant-api's
`SOCKET_NAME` pattern: define `VSOCK_NAME = "vsock.sock"` and require
`leaf == Path::new(VSOCK_NAME)`. Update all tests to use the real
`vsock.sock` leaf so the name gate isn't short-circuited when testing
other refusal contracts; add `wrong_leaf_name_is_rejected` and
`api_socket_name_is_rejected` to pin the exact-name contract.
… client
The Elixir client sends env as a JSON object (`{"K":"V"}`); serde was
trying to deserialize it into `Vec<(String,String)>` and failing with
"invalid type: map, expected a sequence", causing every exec with :env
to return {:error, :closed}.
Change the env field type to `BTreeMap<String,String>`, which serde
deserializes correctly from a JSON object. Update exec.rs to use
`.envs(&req.env)`, and add a cross-language fixture test that feeds the
exact byte sequence the Elixir client produces and asserts correct decoding.
Also fix M1: the signal-exit comment claimed "128+signal" but the code
returns -1; update the comment to match. Fix M3: the Hyper.exec/3 @doc
now names all three accepted timeout opts (:timeout, :timeout_ms,
:connect_timeout) accurately.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Test Results359 tests - 18 358 ✅ - 17 9s ⏱️ +3s Results for commit a4d73f0. ± Comparison against base commit c3a9766. This pull request removes 32 and adds 14 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
…he app Replace the manual `mix guest_agent.install` task with a `:guest_agent_build` Mix compiler (mirroring the codegen compilers) that builds the static musl binary into priv/guest-agent/ at compile time, so the agent ships inside the app and its release with no install step. Host arch is required; a cross arch is best-effort (built only when its rustup target is installed, warn+skip on link failure). Provider resolves from :code.priv_dir; ensure_installed/0 now checks the host arch only (KVM boots same-arch guests). Drops the now-unused guest_agent_install_dir.
|
Follow-up (per review feedback): the guest agent no longer needs a manual |
Replace length-prefixed JSON framing with self-delimiting CBOR in both
directions. Request: client CBOR.encodes a map and sends raw bytes;
agent reads one value via ciborium::from_reader (no EOF needed).
Response: agent ciborium::into_writer's then closes; client reads to EOF
and CBOR.decodes the accumulated buffer. stdout/stderr are CBOR byte
strings (serde_bytes on Rust, %CBOR.Tag{tag: :bytes} unwrap on Elixir).
Removes serde_json from the guest-agent crate. Cross-language contract
is pinned by a Rust fixture (decodes the exact Elixir-produced anchor
bytes) and a matching Elixir assertion on the same hex anchor.
…unwrap_bytes
Add a shared response anchor (exit_code=3, stdout=[0xFF,0x00,0x68,0x69], stderr=[])
that is pinned on both sides:
- Rust: rust_encodes_response_anchor asserts ciborium emits the exact 32 literal
bytes (byte string, major type 2 for stdout/stderr), catching any regression
where serde_bytes is dropped from wire::Response.
- Elixir: two new tests decode the same hex and round-trip through the fake
server, asserting unwrapped stdout == <<0xFF,0x00,0x68,0x69>>.
The 0xFF,0x00 prefix is invalid UTF-8: a text-string encoder would corrupt or
reject it, making both per-side suites fail on the same regression the anchor
is designed to catch.
Fix 2: unwrap_bytes/1 now returns {:ok, binary()} | {:error, term()} instead of
raising FunctionClauseError on unexpected input shapes. recv_response/2 threads
the result through `with`, so a malformed agent response (e.g. stdout as an
integer list) becomes {:error, {:bad_response, {:not_bytes, other}}} rather than
an unhandled raise that violates the @SPEC.
New test: malformed-response test asserts {:error, _} via the fake server seam.
|
Wire protocol reworked → CBOR (replaces the length-prefixed JSON framing described above). After review feedback that the hand-rolled |
… + grpc.gen) - Add proto/hyper/agent/v1/agent.proto: GuestAgent service with Exec and Health RPCs; package hyper.agent.v1. - Add native/guest-agent/build.rs: calls tonic_prost_build::configure() .build_client(false).compile_protos(...) to generate server-side Rust stubs into OUT_DIR/hyper.agent.v1.rs. - Add tonic 0.14.6, prost 0.13.5, tokio 1, tokio-vsock 0.7.2 (tonic014 feature), and tonic-prost-build 0.14.6 to native/guest-agent/Cargo.toml. Note: tonic-prost-build (build dep) pulls prost 0.14 for codegen; the runtime prost 0.13 coexists since the generated module is not yet included (later task). - Refactor Mix.Tasks.Compile.GrpcGen to glob proto/**/*.proto and compile each proto separately with its own directory as --proto_path (required so protoc-gen-elixir derives a bare filename, not a package-doubled path). A second --proto_path=proto enables future cross-proto imports. - Gitignore lib/hyper/agent/v1/ alongside the existing grpc/v0 entry.
…elayed UDS Wires up a minimal tonic Health server (examples/health_uds.rs) that serves over a Unix domain socket, plus an Elixir integration test that proves the full transport chain: Gun HTTP/2 client → byte-pipe UDS relay → tonic server. The Health RPC round-trips with ok: true, confirming the assumption that grpc-elixir's Gun adapter can speak HTTP/2 over a relayed UDS without a VM. :gun added to deps (was an optional dep of :grpc, not yet resolved); grpc must be force-recompiled after adding gun so the conditional Gun adapter module is included. tonic-prost + tokio-stream added as dev-dependencies for the health_uds example. Integration tests excluded from the normal mix test run via :integration tag in test_helper.exs; run separately with --no-start + Application.ensure_all_started(:grpc) to avoid the suidhelper version check that fails on dev machines.
…op CBOR wire Replace the CBOR serve loop with a tonic 0.14 gRPC server bound on AF_VSOCK:1024. PID-1 invariants are preserved: sync mounts run before the tokio runtime, a tokio SIGCHLD reaper is spawned before the server accepts connections, and the main thread parks forever if serve() returns. - src/agent.rs: new GuestAgent impl; Health returns ok=true; Exec stubs UNIMPLEMENTED (Task 4 fills it in) - src/init.rs: drop raw libc::signal SIGCHLD handler; add spawn_reaper() using tokio::signal so tokio's child-process tracking is not clobbered - src/main.rs: PID-1 skeleton with tokio runtime + VsockListener::bind + serve_with_incoming(listener.incoming()) - src/exec.rs: move Request/Response types here (no serde/CBOR); drop serve_one(); run() core unchanged for Task 4 - tests/exec.rs: import from exec:: instead of wire:: - Delete src/wire.rs, tests/wire.rs and [[test]] wire entry - Cargo.toml: add tonic-prost to [dependencies]; drop ciborium/serde/ serde_bytes; add [profile.release] size profile (z/lto/1cu/abort/strip) — musl binary 7.08 MB → 1.09 MB
Replace the scaffolding Request/Response types in exec.rs with a pure run(argv, env, cwd) -> (i32, Vec<u8>, Vec<u8>) core; Agent::exec maps the generated ExecRequest -> run (via spawn_blocking) -> ExecResponse.
RelayDialer: pure module that dials the Firecracker vsock UDS via :socket, sends CONNECT <port>\n, reads the OK line byte-by-byte (to avoid consuming HTTP/2 bytes), and returns the connected socket. Relay: GenServer that listens on a host UDS, accepts connections into an unblocking linked acceptor process, and for each conn spawns an isolated connection process that calls RelayDialer.dial/3 then pipes bytes in two unlinked direction-workers — a pipe crash drops only that connection. terminate/2 closes the listen socket and removes the socket file.
…er-read test
Fix 1: distinguish :closed (expected terminate/2 shutdown) from any other
accept error in accept_loop. Non-:closed errors now exit({:accept_error, reason})
so the linked GenServer receives a non-:normal EXIT and stops, letting the
Task 7 supervisor restart the relay rather than leaving it zombied with no
acceptor. Add synthetic-EXIT test proving the GenServer stops on non-normal
acceptor exits.
Fix 2: add over-read regression test to relay_dialer_test — fake Firecracker
sends "OK 5\nHELLO_H2_BYTES" in one write; asserts trailing bytes are readable
after dial/3 returns, which would fail if a bulk recv replaced the byte-by-byte
line reader.
Fix 3: change pipe workers from spawn to spawn_link so a connection handler
death reaps both workers via the link instead of orphaning them. Keep the
existing Process.monitor for the handler to react to each worker's normal end.
Add Process.unlink(sibling) before Process.exit(sibling, :kill) in
await_pipe_end to prevent the :killed cascade from reaching the handler back
through the now-live link.
…lient
Replace Hyper.Node.FireVMM.Exec (CBOR over vsock) with a typed gRPC client
(Hyper.Node.FireVMM.Agent) that dials the per-VM relay UDS through the Gun
adapter. relay_socket_path/1 is the single source of truth for the host-side
socket path; the relay GenServer (Task 7) will call the same function.
Also fix Mix.Tasks.Compile.GrpcGen to format generated pb.ex files via
Code.format_string! instead of Mix.Task.run("format"), which is a no-op
when the format task was already consumed earlier in the same mix check run.
- Wire Relay as a third :one_for_one child of FireVMM.Supervisor, started
with {vsock_uds: Jailer.host_vsock/1, listen_path: Agent.relay_socket_path/1}.
- Override Relay.child_spec/1 with restart: :transient so abnormal acceptor
crashes are restarted by the supervisor while clean :shutdown is not.
- Add Reclaim.reclaim_sockets/0 (public @doc false) to sweep grpc-*.sock
files in socket_dir at node boot; called from Reclaim.run/0 alongside the
existing dm/loop sweeps. Also creates the dir if absent so the relay's
bind never races a missing directory on first boot.
…v/PATH in Hyper.exec
Guest command exec over vsock
Adds
Hyper.exec(vm, argv, opts)— run a one-shot command inside a running guest microVM over virtio-vsock and get back{stdout, stderr, exit_code}. Also fixes guest boot on bare OCI rootfses (the alpine image was spinning on a missing/sbin/openrc) by replacing the image init with our own PID-1 agent.Design spec + plan live under
docs/superpowers/(gitignored scratch).How it works
hyper-guest-agent(newnative/guest-agent/static musl Rust crate) — runs as guest PID 1 (init=/hyper-init): mounts/proc,/sys,/dev, reaps orphans, bindsAF_VSOCK:1024, and fork/execs one command per connection. Because it is init, any OCI image boots regardless of its own init system.mix guest_agent.installbuilds the two-arch binaries; a per-arch provider resolves them, mirroring the vmlinux/suidhelper patterns.OciLoaderstages the agent to/hyper-initbeforemke2fs -d(1-layer image; delta-layer authoring deliberately deferred to its own spec).BootSpecaddsinit=/hyper-init.Configuringstate PUTs/vsock(cid 3,vsock.sock) then hands the firecracker-created host UDS to the node uid via a newchroot-jail grant-vsocksetuid op (shares a factoredgrant.rsconfinement helper withgrant-api, which was refactored onto it with no behavioral drift).InstanceStartis gated behind the grant, retry bounded by the existingboot_deadline.Hyper.Node.FireVMM.Execdoes firecracker'sCONNECT 1024\nhandshake + length-prefixed framing;Hyper.exec/3ties it together.Wire protocol
u32BE length + JSON{argv, env, cwd?, timeout_ms?}(env as a JSON object).i32BE exit_code +u32BE stdout_len + stdout +u32BE stderr_len + stderr. Exec failure → 127.Resource lifecycle
No new host-side process to supervise and no new leak surface: the agent is PID 1 inside the VM (dies with it); the granted UDS lives in the jail and is removed by the existing teardown (cgroup.kill + jail removal). Deliberate, given the recently-fixed reaper/dm fragility (#48).
Review & quality
Built subagent-driven with a per-task spec+quality review gate and a whole-branch opus review. That whole-branch review caught two cross-task contract bugs the per-task reviews structurally couldn't — both fixed and re-confirmed:
.vsock, but the real socket isvsock.sock→ would have boot-looped every VM. Now an exact-name gate (mirrors grant-api; also closes anevil.vsockhole).envas a JSON array but the client sends an object → every:envexec would fail. Now aBTreeMap, pinned by a cross-language fixture test that feeds the exact bytes the Elixir client emits.Gate (HEAD)
--warnings-as-errors, format, credo--strict(exit 0), 266 tests (90 properties), dialyzer 0 errors.-D warnings+ fmt clean, x86_64-musl release builds.Not yet done
Live E2E — this is static-gate-green but has not been booted on a real VM yet. Before/after merge, on the VM host:
mix guest_agent.install+mix suidhelper.install(re-stamps the bumped suidhelper), thenHyper.exec(vm, ["uname","-a"])should return{:ok, %{exit_code: 0, stdout: "Linux ..."}}with no openrc spam in the boot log.