From 1fe39c1874b2509faee3f556d4b1660678c95705 Mon Sep 17 00:00:00 2001 From: Patrick Riel Date: Wed, 20 May 2026 16:10:54 +0000 Subject: [PATCH 1/3] add rfc for vm lifecycle extensions Signed-off-by: Patrick Riel --- .../README.md | 281 ++++++++++++++++++ 1 file changed, 281 insertions(+) create mode 100644 rfc/0004-vm-driver-lifecycle-extension-api/README.md diff --git a/rfc/0004-vm-driver-lifecycle-extension-api/README.md b/rfc/0004-vm-driver-lifecycle-extension-api/README.md new file mode 100644 index 000000000..3a3b28e32 --- /dev/null +++ b/rfc/0004-vm-driver-lifecycle-extension-api/README.md @@ -0,0 +1,281 @@ +--- +authors: + - "@cheese-head" +state: review +--- + +# RFC 0004: VM-Driver Lifecycle Extension API + +## Summary + +This RFC makes `openshell-driver-vm` extensible without splitting the VM +driver into multiple driver crates. + +It introduces two core changes: + +1. `template.runtime_class_name` selects the VM backend: + `libkrun`, `qemu`, or omitted for the default. +2. `VmLifecycleExtension` lets in-tree extensions participate in VM + launch, launch result handling, delete, and restart reconcile. + +The default behavior is unchanged: if no backend is requested and no +extension is configured, the VM driver behaves as it does today. + +## Goals + +- Let users request QEMU without also requesting a GPU. +- Keep `libkrun` as the default VM backend. +- Give in-tree VM extensions a supported way to add launch-time + resources such as VFIO devices, DPU-backed NICs, vDPA devices, vTPMs, + encrypted volumes, or audit hooks. +- Keep extension state isolated by extension and sandbox. +- Make restart reconcile explicit, observable, and advisory by default. +- Keep user-facing resource requests separate from VM-driver internal + attachment details. + +## Non-Goals + +- A new `openshell-driver-qemu` crate. +- Out-of-tree extension support. +- A public resource-claim API. +- Treating `runtime_class_name` as a network attachment selector. +- Replacing gateway-level compute hooks. +- Automatic VM restart within the same sandbox lifetime. + +## Core Components + +### Runtime Class + +`template.runtime_class_name` selects the VM backend: + +```yaml +spec: + template: + runtime_class_name: qemu +``` + +Supported values: + +| Value | Behavior | +| --- | --- | +| omitted | Use the driver default, initially `libkrun`. | +| `libkrun` | Use the existing libkrun path. | +| `qemu` | Use QEMU with TAP and vsock. | + +GPU selection remains independent from backend selection. A GPU request +with `libkrun` is rejected with a clear error. A GPU request with +`qemu` uses the existing QEMU + VFIO GPU path. + +### Lifecycle Extension + +`VmLifecycleExtension` is an in-process VM-driver extension trait. +Extensions are compiled into the VM driver and registered by name. + +An extension can: + +- allocate per-sandbox state before VM launch; +- add validated launcher arguments or environment variables; +- add typed VM attachments; +- react to launch success or launch failure; +- clean up on sandbox delete; +- reconcile its external state after driver restart. + +The VM driver remains the source of truth for sandbox lifecycle. An +extension owns only the external or attached state it allocates. + +### Launch Plan + +Before spawning the VM, the driver builds a `VmLaunchPlan`. +Configured extensions receive a restricted view of that plan and may add +attachment bodies through driver-owned APIs. + +The driver stamps the extension namespace on every attachment. An +extension cannot forge another extension's namespace in metrics, OCSF +events, persisted state, or launcher validation output. + +### Attachments + +`VmAttachment` is the driver-internal representation of launch-time +resources. Attachment bodies may include: + +- launcher arguments and environment variables; +- storage attachments; +- device attachments; +- future typed attachment variants. + +`LauncherArgs` exists as a bounded escape hatch. The renderer validates +allowed flag prefixes, maximum counts, maximum lengths, and +driver-stamped environment prefixes before the command is spawned. + +### Extension State + +Each extension may return a `PersistedExtensionState` from +`before_vm_launch`. + +State is stored per sandbox and per extension: + +```text +/sandboxes//extensions/.json +``` + +The driver passes that state back only to the extension that created it. +This supports multiple extensions without state collisions. + +### Reconcile + +Reconcile runs when the driver starts: + +1. `reconcile_before_restore`: extension-level health and global checks. +2. VM driver reloads persisted sandboxes. +3. `reconcile_after_restore`: extension checks restored live sandboxes + against external or attached state. + +Reconcile outcomes: + +| Outcome | Meaning | +| --- | --- | +| `Ok` | No drift. | +| `Advisory(report)` | Drift found; report only. | +| `Authoritative(report)` | Drift found; extension may repair or clean up. | +| `Failed(status, report)` | Reconcile could not complete. | + +`advisory` is the default. An extension can perform authoritative repair +only when both are true: + +- the extension declares support for authoritative reconcile; +- the operator sets `reconcile_mode = "authoritative"` for that + extension. + +Otherwise authoritative outcomes are demoted to advisory behavior. + +## Operator Configuration + +Extensions are configured in VM-driver TOML: + +```toml +[vm] +extensions = ["logging"] + +[vm.extension."logging"] +reconcile_mode = "advisory" + +[vm.extension."logging".timeouts] +before_vm_launch_ms = 15000 +``` + +Rules: + +- unknown extension names fail driver startup; +- extension order follows the config list; +- unknown config keys fail startup via typed deserialization; +- absent or empty `extensions` means no extension chain; +- no extension can enable itself at runtime. + +## Relationship to Resource Requests + +This API is not the public resource request model. + +Public sandbox intent, such as "I need one GPU", "I need a DPU-backed +network function", or "I need a vDPA network device", should be +expressed through typed sandbox resource requirements or +driver-specific config namespaces. + +The VM driver then realizes that request after it has been selected. +For example: + +- a portable GPU request may compile into QEMU VFIO GPU arguments; +- a DPU-backed NIC request may compile into a DPU extension attachment; +- a vDPA request may compile into a future VM network/device extension; +- a plain QEMU backend request may use TAP because that is how the QEMU + backend currently connects networking. + +`runtime_class_name` chooses the VM backend. It should not be overloaded +to mean "use TAP", "use VFIO", or "use vDPA". + +## TAP, VFIO, and vDPA + +This RFC enables the VM driver to support TAP, VFIO, and vDPA, but it +does not define a final user-facing selector for all of them. + +Current interpretation: + +| User intent | Intended path | +| --- | --- | +| Use QEMU | `runtime_class_name = "qemu"` | +| Use default libkrun path | omit `runtime_class_name` or set `libkrun` | +| Use QEMU's normal TAP networking | selected implicitly by QEMU backend | +| Attach a GPU by VFIO | typed GPU resource request plus VM driver realization | +| Attach a DPU-backed VF/SF | typed device or generic resource request plus DPU extension | +| Attach vDPA | future typed resource request plus VM extension | + +The important boundary is: + +- user-facing requests describe *what* the sandbox needs; +- VM-driver extensions describe *how* this VM driver realizes that need. + +## Lifecycle Flow + +Create path: + +```text +gateway selects VM driver + -> VM driver selects backend from runtime_class_name + -> driver builds VmLaunchPlan + -> extension.before_vm_launch in config order + -> driver validates rendered launcher args + -> driver spawns VM + -> extension.after_vm_launch_succeeded + or extension.after_vm_launch_failed in reverse order +``` + +Delete path: + +```text +gateway deletes sandbox + -> driver terminates VM if needed + -> extension.after_sandbox_deleted in reverse order + -> driver removes sandbox state +``` + +Restart path: + +```text +driver starts + -> extension.reconcile_before_restore + -> driver restores live sandbox records + -> extension.reconcile_after_restore + -> conditions and metrics report drift or failure +``` + +## Observability + +The driver emits extension metrics and conditions for: + +- hook duration and outcome; +- rollback count; +- reconcile outcome; +- dropped condition or OCSF events; +- launcher argument validation failures; +- unhealthy extensions. + +OCSF events emitted by extensions are stamped by the driver with: + +- `extension_layer = "vm-driver"` +- `extension_name = ` + +## Compatibility + +- Empty extension chain preserves existing behavior. +- Omitted `runtime_class_name` preserves existing backend selection. +- QEMU + GPU continues to work through the existing path. +- QEMU without GPU becomes valid. +- `bound_threshold_ms` defaults to `0`, preserving today's + ready-after-spawn behavior unless operators opt in to a liveness + threshold. + +## Open Questions + +- Should TAP, VFIO, and vDPA get a shared typed network attachment + request shape, or remain separate resource classes? +- Should per-sandbox backend or bound-threshold overrides be added? +- Should runtime extension reload be supported later? \ No newline at end of file From 3866097d4ebeb6e2401060dfd5fac82aa24f4860 Mon Sep 17 00:00:00 2001 From: Patrick Riel Date: Wed, 20 May 2026 18:30:51 +0000 Subject: [PATCH 2/3] update rfc Signed-off-by: Patrick Riel --- .../README.md | 552 ++++++++++++------ 1 file changed, 368 insertions(+), 184 deletions(-) diff --git a/rfc/0004-vm-driver-lifecycle-extension-api/README.md b/rfc/0004-vm-driver-lifecycle-extension-api/README.md index 3a3b28e32..cb2d5e67c 100644 --- a/rfc/0004-vm-driver-lifecycle-extension-api/README.md +++ b/rfc/0004-vm-driver-lifecycle-extension-api/README.md @@ -8,274 +8,458 @@ state: review ## Summary -This RFC makes `openshell-driver-vm` extensible without splitting the VM -driver into multiple driver crates. +This RFC adds an in-tree DPU extension for `openshell-driver-vm`. +The extension is vendor-neutral at the host-driver layer, with +BlueField as the first concrete coordinator backend. -It introduces two core changes: +The extension lets the VM driver attach a DPU-backed VF/SF to a sandbox +VM, pass the device through to the guest, and delegate L2/L3/L4 network +policy enforcement to a DPU-side coordinator. When the coordinator +supports storage provisioning, the same extension boundary can also +provide a DPU-provisioned rootfs block device to the VM driver. -1. `template.runtime_class_name` selects the VM backend: - `libkrun`, `qemu`, or omitted for the default. -2. `VmLifecycleExtension` lets in-tree extensions participate in VM - launch, launch result handling, delete, and restart reconcile. - -The default behavior is unchanged: if no backend is requested and no -extension is configured, the VM driver behaves as it does today. +Default OpenShell builds and deployments without DPU hardware are +unchanged. The extension runs only when `dpu` is listed in VM-driver +extension config. ## Goals -- Let users request QEMU without also requesting a GPU. -- Keep `libkrun` as the default VM backend. -- Give in-tree VM extensions a supported way to add launch-time - resources such as VFIO devices, DPU-backed NICs, vDPA devices, vTPMs, - encrypted volumes, or audit hooks. -- Keep extension state isolated by extension and sandbox. -- Make restart reconcile explicit, observable, and advisory by default. -- Keep user-facing resource requests separate from VM-driver internal - attachment details. +- Provide an in-tree DPU consumer of the VM-driver lifecycle extension + API. +- Keep host-side DPU integration vendor-neutral. +- Use BlueField as the first concrete backend without requiring DOCA or + Comch in the OpenShell tree. +- Let DPU coordinators watch sandbox network policy directly from the + gateway. +- Scope DPU policy access by coordinator identity. +- Preserve existing in-guest L7 policy enforcement and add DPU-side + L2/L3/L4 enforcement for the VF/SF data path. +- Make policy delivery projection-based so future DPU-capable policy + domains can be added explicitly. +- Keep DPU operator settings separate from portable sandbox resource + requests. +- Use the VM driver's typed launch-plan attachments for VF/SF, vDPA, and + DPU-provisioned storage rather than raw QEMU argument injection. ## Non-Goals -- A new `openshell-driver-qemu` crate. -- Out-of-tree extension support. -- A public resource-claim API. -- Treating `runtime_class_name` as a network attachment selector. -- Replacing gateway-level compute hooks. -- Automatic VM restart within the same sandbox lifetime. +- Defining a separate VM-driver lifecycle extension API. +- Shipping DOCA SDK, Comch transport, or vendor-proprietary link + dependencies. +- Replacing the in-guest OPA/L7 proxy. +- A final public API for choosing DPU-only networking, vDPA, or + DPU-provisioned rootfs. +- Multi-DPU per-host scheduling. +- Moving a running sandbox between DPUs. +- Live migration or hot-plug of DPU attachments. +- A required SNAP implementation in the first upstream slice. ## Core Components -### Runtime Class +### Host Extension -`template.runtime_class_name` selects the VM backend: +`openshell-dpu-extension` implements `VmLifecycleExtension` and +registers as `dpu` in the VM driver's extension registry. -```yaml -spec: - template: - runtime_class_name: qemu -``` +The extension: -Supported values: +- calls a `DpuCoordinator` before VM launch to allocate a VF/SF; +- updates the VM launch plan with typed network, device, and optional + storage attachments; +- persists attachment state under the sandbox's extension state; +- reports launch, detach, policy, and health events; +- detaches on sandbox delete; +- reconciles DPU state after driver restart. -| Value | Behavior | -| --- | --- | -| omitted | Use the driver default, initially `libkrun`. | -| `libkrun` | Use the existing libkrun path. | -| `qemu` | Use QEMU with TAP and vsock. | +### Coordinator Trait -GPU selection remains independent from backend selection. A GPU request -with `libkrun` is rejected with a clear error. A GPU request with -`qemu` uses the existing QEMU + VFIO GPU path. +`DpuCoordinator` abstracts the host-to-DPU attachment lifecycle. -### Lifecycle Extension +The trait covers: -`VmLifecycleExtension` is an in-process VM-driver extension trait. -Extensions are compiled into the VM driver and registered by name. +- `health`: firmware, SR-IOV mode, OVS offload, and required control + checks; +- `attach`: allocate a VF/SF or vDPA endpoint, install initial + enforcement, and return typed VM attachment details; +- `provision_rootfs` (optional capability): prepare or expose a + DPU-backed rootfs/block device and return a typed storage attachment; +- `detach`: idempotently release an attachment; +- `list`: report coordinator-known attachments; +- `reconcile`: compare host-restored state with coordinator state; +- `watch_attachment_events`: stream policy and health events back to + the host extension. -An extension can: +The host extension is written against this trait rather than against +BlueField-specific code. -- allocate per-sandbox state before VM launch; -- add validated launcher arguments or environment variables; -- add typed VM attachments; -- react to launch success or launch failure; -- clean up on sandbox delete; -- reconcile its external state after driver restart. +### Coordinator Backends -The VM driver remains the source of truth for sandbox lifecycle. An -extension owns only the external or attached state it allocates. +This RFC defines two in-tree coordinator backends: -### Launch Plan +| Backend | Purpose | Default | +| --- | --- | --- | +| `fake` | Unit and integration testing without DPU hardware. | on | +| `bluefield-grpc` | mTLS gRPC client for the BlueField coordinator daemon. | off | -Before spawning the VM, the driver builds a `VmLaunchPlan`. -Configured extensions receive a restricted view of that plan and may add -attachment bodies through driver-owned APIs. +Other vendors can add new coordinator backends without changing the +host extension contract. -The driver stamps the extension namespace on every attachment. An -extension cannot forge another extension's namespace in metrics, OCSF -events, persisted state, or launcher validation output. +### BlueField Coordinator -### Attachments +`openshell-bluefield-coordinator` is the DPU-side daemon. It runs on +the BlueField ARM cores and owns: -`VmAttachment` is the driver-internal representation of launch-time -resources. Attachment bodies may include: +- VF/SF allocation; +- vDPA endpoint allocation when supported; +- optional DPU-backed rootfs/block-device provisioning; +- representor and OVS programming; +- policy application and verification; +- durable on-DPU attachment registry; +- policy streaming from the gateway; +- event emission back to the host extension. -- launcher arguments and environment variables; -- storage attachments; -- device attachments; -- future typed attachment variants. +The host driver instructs the coordinator to attach and detach +sandboxes. The coordinator owns policy enforcement and policy updates. -`LauncherArgs` exists as a bounded escape hatch. The renderer validates -allowed flag prefixes, maximum counts, maximum lengths, and -driver-stamped environment prefixes before the command is spawned. +### Gateway Policy Stream -### Extension State +`WatchSandboxPolicies` is a gateway RPC for DPU coordinators. -Each extension may return a `PersistedExtensionState` from -`before_vm_launch`. +It streams authorized policy projections to coordinator identities. The +initial projection is `NetworkScope`, which is the only projection this +RFC requires. The stream uses: -State is stored per sandbox and per extension: +- `INITIAL` events for initial state; +- `DELTA` events for updates; +- `REMOVED` events for deletion; +- monotonic `seq` values; +- slow-consumer disconnects; +- per-coordinator authorization. -```text -/sandboxes//extensions/.json +There is no polling fallback. If the gateway does not support the +stream, the DPU extension fails startup cleanly. + +### Policy Projections + +The gateway must not send the full sandbox policy blob to the DPU. +Instead, it emits explicit, versioned projections. Each projection has +its own schema, capability gate, authorization check, and threat-model +treatment. + +The first projection is `NetworkScope`, covering L2/L3/L4 enforcement +for the VF/SF data path. Future projections may be added for domains a +DPU or SmartNIC can actually enforce, such as HTTP/L7 inspection, +credential handling, or selected sandbox identity metadata. + +Projection rules: + +- projections are opt-in by coordinator capability; +- projections are scoped to delegated sandboxes only; +- each projection exposes the minimum fields needed for enforcement; +- fields that are not enforceable by the DPU stay out of the + projection; +- credentials and sensitive metadata require their own projection and + authorization, not implicit inclusion in `NetworkScope`. + +Example shape: + +```proto +message PolicyProjection { + string sandbox_id = 1; + string attachment_id = 2; + NetworkScope network = 10; + HttpScope http = 11; // optional future projection + CredentialScope credentials = 12; // optional future projection + MetadataScope metadata = 13; // optional future projection +} ``` -The driver passes that state back only to the extension that created it. -This supports multiple extensions without state collisions. +This keeps the policy stream generic without turning it into a broad +"read all policy" channel. -### Reconcile +### Policy Reader Role -Reconcile runs when the driver starts: +The gateway adds a `policy-reader` role for DPU coordinators. -1. `reconcile_before_restore`: extension-level health and global checks. -2. VM driver reloads persisted sandboxes. -3. `reconcile_after_restore`: extension checks restored live sandboxes - against external or attached state. +The role is scoped to the sandboxes delegated to that coordinator. A +coordinator identity cannot subscribe to the whole fleet's network +policy. The gateway validates sandbox registration against the +coordinator's attachment-derived allowlist. -Reconcile outcomes: +### NetworkScope -| Outcome | Meaning | -| --- | --- | -| `Ok` | No drift. | -| `Advisory(report)` | Drift found; report only. | -| `Authoritative(report)` | Drift found; extension may repair or clean up. | -| `Failed(status, report)` | Reconcile could not complete. | +`NetworkScope` is the initial policy projection sent to the DPU. It +contains only the L2/L3/L4 fields a network fabric can enforce. -`advisory` is the default. An extension can perform authoritative repair -only when both are true: +It does not include: -- the extension declares support for authoritative reconcile; -- the operator sets `reconcile_mode = "authoritative"` for that - extension. +- filesystem policy; +- process policy; +- L7 request-body or operation-level policy; +- provider credentials; +- unrelated sandbox metadata. -Otherwise authoritative outcomes are demoted to advisory behavior. +Those domains can be added only through separate projections with +separate coordinator capabilities and authorization rules. -## Operator Configuration +## Relationship to Resource Requests -Extensions are configured in VM-driver TOML: +The DPU extension is not a public resource request API. -```toml -[vm] -extensions = ["logging"] +Portable sandbox intent such as "attach a DPU-backed VF/SF" should be +expressed through typed device or generic resource requirements with a +stable class name, count, selectors, and namespaced parameters. -[vm.extension."logging"] -reconcile_mode = "advisory" +The VM driver realizes that request after it has been selected: -[vm.extension."logging".timeouts] -before_vm_launch_ms = 15000 +```text +resource requirement + -> gateway selects VM driver + -> VM driver builds launch plan + -> dpu extension allocates VF/SF, vDPA, or DPU rootfs as requested + -> dpu extension updates typed launch-plan attachments + -> VM driver validates and renders the final launch plan ``` -Rules: +Deployment-specific settings remain extension config, including: -- unknown extension names fail driver startup; -- extension order follows the config list; -- unknown config keys fail startup via typed deserialization; -- absent or empty `extensions` means no extension chain; -- no extension can enable itself at runtime. +- coordinator backend; +- coordinator endpoint; +- mTLS material; +- initial policy behavior; +- stale policy behavior; +- rate limits; +- `reconcile_mode`. -## Relationship to Resource Requests +Public request fields describe what the sandbox needs. DPU extension +config describes how this deployment provides it. + +## VM Launch Plan Integration -This API is not the public resource request model. +The DPU extension consumes the VM driver's typed launch plan. It does +not primarily contribute raw QEMU arguments. -Public sandbox intent, such as "I need one GPU", "I need a DPU-backed -network function", or "I need a vDPA network device", should be -expressed through typed sandbox resource requirements or -driver-specific config namespaces. +The extension may update: -The VM driver then realizes that request after it has been selected. -For example: +- `plan.network`: replace the default TAP attachment with + `VmNetworkAttachment::VfioPci` for a VF/SF, or + `VmNetworkAttachment::Vdpa` for a vDPA endpoint; +- `plan.devices`: add supporting `VmDeviceAttachment::VfioPci` devices + when the DPU integration needs a non-network PCI device passed through; +- `plan.rootfs`: replace the default host-file root block device with + `VmStorageAttachment::DpuProvisioned` when the coordinator exposes a + DPU-provisioned rootfs/block device. -- a portable GPU request may compile into QEMU VFIO GPU arguments; -- a DPU-backed NIC request may compile into a DPU extension attachment; -- a vDPA request may compile into a future VM network/device extension; -- a plain QEMU backend request may use TAP because that is how the QEMU - backend currently connects networking. +The default QEMU path still uses host-file rootfs plus TAP/vsock. A DPU +extension can deliberately omit TAP by replacing `plan.network` before +the driver validates and renders the launcher configuration. + +## Operator Configuration -`runtime_class_name` chooses the VM backend. It should not be overloaded -to mean "use TAP", "use VFIO", or "use vDPA". +The extension is enabled through VM-driver config: -## TAP, VFIO, and vDPA +```toml +[vm] +extensions = ["dpu"] -This RFC enables the VM driver to support TAP, VFIO, and vDPA, but it -does not define a final user-facing selector for all of them. +[vm.extension."dpu"] +coordinator = "bluefield-grpc" # or "fake" +coordinator_endpoint = "https://192.168.100.2:8443" +coordinator_ca_path = "/etc/openshell/dpu/coordinator-ca.pem" +client_cert_path = "/etc/openshell/dpu/host-client.pem" +client_key_path = "/etc/openshell/dpu/host-client.key" -Current interpretation: +initial_policy = "wait-initial" # "wait-initial" | "baseline" +initial_policy_timeout_ms = 5000 +network_class_defaults = "dpu-ovs-isolated" -| User intent | Intended path | -| --- | --- | -| Use QEMU | `runtime_class_name = "qemu"` | -| Use default libkrun path | omit `runtime_class_name` or set `libkrun` | -| Use QEMU's normal TAP networking | selected implicitly by QEMU backend | -| Attach a GPU by VFIO | typed GPU resource request plus VM driver realization | -| Attach a DPU-backed VF/SF | typed device or generic resource request plus DPU extension | -| Attach vDPA | future typed resource request plus VM extension | +stale_threshold_ms = 30000 +on_stale = "keep-last-known" # "keep-last-known" | "deny-all" -The important boundary is: +reconcile_mode = "advisory" # or "authoritative" +max_attach_qps = 50 +max_attachments = 1024 +``` -- user-facing requests describe *what* the sandbox needs; -- VM-driver extensions describe *how* this VM driver realizes that need. +Unknown keys fail startup through typed config deserialization. -## Lifecycle Flow +## Attachment Lifecycle Create path: ```text -gateway selects VM driver - -> VM driver selects backend from runtime_class_name - -> driver builds VmLaunchPlan - -> extension.before_vm_launch in config order - -> driver validates rendered launcher args - -> driver spawns VM - -> extension.after_vm_launch_succeeded - or extension.after_vm_launch_failed in reverse order +VM driver selected + -> dpu.before_vm_launch + -> coordinator.attach + -> coordinator installs initial enforcement + -> extension returns PersistedExtensionState + -> extension updates plan.network / plan.devices / plan.rootfs + -> VM driver validates and renders typed attachments + -> VM driver launches sandbox + -> extension reports bound or detaches on launch failure ``` Delete path: ```text gateway deletes sandbox - -> driver terminates VM if needed - -> extension.after_sandbox_deleted in reverse order - -> driver removes sandbox state + -> VM driver terminates VM if needed + -> dpu.after_sandbox_deleted + -> coordinator.detach + -> extension state removed ``` -Restart path: +The attachment lifetime equals the sandbox lifetime. VM process exit +does not immediately detach the DPU resource; delete does. -```text -driver starts - -> extension.reconcile_before_restore - -> driver restores live sandbox records - -> extension.reconcile_after_restore - -> conditions and metrics report drift or failure -``` +### DPU-Provisioned Rootfs -## Observability +If the coordinator advertises rootfs provisioning, `attach` or +`provision_rootfs` may return a DPU-backed block device. The host +extension represents that as `VmStorageAttachment::DpuProvisioned`. + +The RFC intentionally keeps the storage backend behind the coordinator +trait. A BlueField implementation may use SNAP, NVMe emulation, a block +device exposed to the host, or another mechanism, but the VM driver only +sees the typed storage attachment and renders it into the QEMU storage +configuration. + +## Initial Policy Bootstrap + +The extension must avoid exposing a VF/SF to a guest before enforcement +is installed. + +Two modes are supported: -The driver emits extension metrics and conditions for: +| Mode | Behavior | +| --- | --- | +| `wait-initial` | Register sandbox, wait for gateway `INITIAL`, apply it, then return from `attach`. Timeout fails the create. | +| `baseline` | Apply `network_class_defaults`, return from `attach`, then replace with gateway `INITIAL` when it arrives. | + +`wait-initial` is the default because it fails closed when the policy +stream is unavailable. + +## Reconcile + +On driver restart: + +1. `reconcile_before_restore` calls `DpuCoordinator::health`. +2. The VM driver reloads persisted sandbox state. +3. `reconcile_after_restore` compares restored host state with + coordinator state. + +Advisory is the default: + +- orphaned DPU attachments are reported as drift; +- no cleanup is performed; +- operators can inspect conditions and logs. + +Authoritative reconcile is opt-in: + +- the extension must support authoritative behavior; +- the operator must set `reconcile_mode = "authoritative"`; +- the coordinator may garbage-collect attachments it still owns but + the host driver no longer knows about. + +## Lease Fencing + +Every attachment has: + +- `attachment_id`; +- `lease_generation`; +- `sandbox_id`; +- `host_instance_id`. + +`lease_generation` is monotonic per attachment. The coordinator rejects +stale detach or reconcile operations whose generation is older than the +highest generation it has observed for that attachment. + +This prevents an old host process or stale restored state from +corrupting a newer attachment allocation. + +## Failure Behavior -- hook duration and outcome; -- rollback count; -- reconcile outcome; -- dropped condition or OCSF events; -- launcher argument validation failures; -- unhealthy extensions. +Policy and coordinator failures can be configured to fail open or fail +closed where appropriate. + +Recommended defaults: + +| Failure | Default | +| --- | --- | +| Gateway policy stream down before attach | fail closed in `wait-initial` | +| Policy update apply failure | keep last known policy | +| Previous policy unusable or stale | emit `PolicyStale`; optional deny-all | +| Coordinator health missing required controls | refuse new attaches | +| Stale host detach operation | reject by lease generation | + +## Security Model + +The strongest property is available only when the DPU has an out-of-band +path to the gateway that the host cannot observe or terminate. + +With that topology: + +- VF/SF egress traverses the DPU representor and can be enforced even + if the guest is compromised. +- The host driver is not in the policy-read path. +- DPU policy-stream credentials are scoped to the coordinator's + delegated sandboxes and authorized projections. +- Sensitive policy domains such as L7, credentials, and metadata are + not exposed unless their projection is explicitly enabled. + +Important limitation: + +- The default QEMU path still has TAP + virtio-net. A DPU extension must + replace the network plan and omit TAP for deployments that require all + guest egress to traverse the DPU-controlled data path. If TAP remains + present, this RFC enforces only the VF/SF or vDPA data path and does + not claim all guest egress is DPU-enforced. + +## mTLS and Identity + +The BlueField coordinator authenticates to the gateway with a dedicated +coordinator credential. + +Requirements: + +- coordinator private key stays on the DPU; +- gateway maps coordinator identity to an allowlist of sandbox + attachments; +- cert renewal and revocation are supported; +- CA rotation uses a dual-trust window; +- host proxying of the stream is allowed only for explicitly configured + development scenarios. + +## Observability -OCSF events emitted by extensions are stamped by the driver with: +The extension and coordinator emit: -- `extension_layer = "vm-driver"` -- `extension_name = ` +- driver-level conditions such as `DpuCoordinatorHealthy`; +- per-sandbox conditions such as `DpuAttachInProgress`, + `DpuAttachmentBound`, `DpuPolicyApplied`, `DpuPolicyDegraded`, + `DpuPolicyStale`, and `DpuFirmwareDegraded`; +- OCSF events tagged with the VM-driver extension name; +- coordinator metrics for policy apply latency, offload status, + event stream reconnects, attachment count, and stale policy count. ## Compatibility -- Empty extension chain preserves existing behavior. -- Omitted `runtime_class_name` preserves existing backend selection. -- QEMU + GPU continues to work through the existing path. -- QEMU without GPU becomes valid. -- `bound_threshold_ms` defaults to `0`, preserving today's - ready-after-spawn behavior unless operators opt in to a liveness - threshold. +- Default builds do not compile the BlueField gRPC backend. +- The fake backend keeps host-side tests available without hardware. +- The extension does not run unless configured. +- Existing VM sandboxes are unaffected. +- Existing in-guest L7 enforcement remains in place. +- No DOCA, Comch, or proprietary SDK dependency is introduced. ## Open Questions -- Should TAP, VFIO, and vDPA get a shared typed network attachment - request shape, or remain separate resource classes? -- Should per-sandbox backend or bound-threshold overrides be added? -- Should runtime extension reload be supported later? \ No newline at end of file +- Should the DPU resource class be a standard typed device class or a + generic resource extension? +- What public resource/profile should select DPU-only networking, vDPA, + or DPU-provisioned rootfs? +- Should `reconcile_mode` remain advisory by default for DPU, or should + some deployments opt into authoritative by default? +- Should the shared DPU proto remain vendor-neutral, or should vendors + own separate coordinator protos behind the same trait? +- Should topology verification fail closed automatically at coordinator + startup? From 86077d7001f15a55fbb1e8184ebc0ff1163eeddf Mon Sep 17 00:00:00 2001 From: Patrick Riel Date: Wed, 20 May 2026 18:36:49 +0000 Subject: [PATCH 3/3] update rfc Signed-off-by: Patrick Riel --- .../README.md | 583 +++++++----------- 1 file changed, 219 insertions(+), 364 deletions(-) diff --git a/rfc/0004-vm-driver-lifecycle-extension-api/README.md b/rfc/0004-vm-driver-lifecycle-extension-api/README.md index cb2d5e67c..f27fc7c7e 100644 --- a/rfc/0004-vm-driver-lifecycle-extension-api/README.md +++ b/rfc/0004-vm-driver-lifecycle-extension-api/README.md @@ -4,462 +4,317 @@ authors: state: review --- -# RFC 0004: VM-Driver Lifecycle Extension API +# VM-Driver Lifecycle Extension API ## Summary -This RFC adds an in-tree DPU extension for `openshell-driver-vm`. -The extension is vendor-neutral at the host-driver layer, with -BlueField as the first concrete coordinator backend. +This RFC makes `openshell-driver-vm` extensible without splitting the VM +driver into multiple driver crates. -The extension lets the VM driver attach a DPU-backed VF/SF to a sandbox -VM, pass the device through to the guest, and delegate L2/L3/L4 network -policy enforcement to a DPU-side coordinator. When the coordinator -supports storage provisioning, the same extension boundary can also -provide a DPU-provisioned rootfs block device to the VM driver. +It introduces three core changes: -Default OpenShell builds and deployments without DPU hardware are -unchanged. The extension runs only when `dpu` is listed in VM-driver -extension config. +1. `template.platform_config.runtime_class_name` selects the VM backend: + `libkrun`, `qemu`, or omitted for the default. +2. `VmLifecycleExtension` lets in-tree extensions participate in VM + launch, launch result handling, delete, and restart reconcile. +3. `VmLaunchPlan` exposes typed rootfs/storage, network, and device + attachments so extensions can provide hardware resources without raw + QEMU argument injection. + +The default behavior is unchanged: if no backend is requested and no +extension is configured, the VM driver behaves as it does today. ## Goals -- Provide an in-tree DPU consumer of the VM-driver lifecycle extension - API. -- Keep host-side DPU integration vendor-neutral. -- Use BlueField as the first concrete backend without requiring DOCA or - Comch in the OpenShell tree. -- Let DPU coordinators watch sandbox network policy directly from the - gateway. -- Scope DPU policy access by coordinator identity. -- Preserve existing in-guest L7 policy enforcement and add DPU-side - L2/L3/L4 enforcement for the VF/SF data path. -- Make policy delivery projection-based so future DPU-capable policy - domains can be added explicitly. -- Keep DPU operator settings separate from portable sandbox resource - requests. -- Use the VM driver's typed launch-plan attachments for VF/SF, vDPA, and - DPU-provisioned storage rather than raw QEMU argument injection. +- Let users request QEMU without also requesting a GPU. +- Keep `libkrun` as the default VM backend. +- Give in-tree VM extensions a supported way to add launch-time + resources such as VFIO devices, DPU-backed NICs, vDPA devices, vTPMs, + encrypted volumes, or audit hooks. +- Keep extension state isolated by extension and sandbox. +- Make restart reconcile explicit, observable, and advisory by default. +- Keep user-facing resource requests separate from VM-driver internal + attachment details. ## Non-Goals -- Defining a separate VM-driver lifecycle extension API. -- Shipping DOCA SDK, Comch transport, or vendor-proprietary link - dependencies. -- Replacing the in-guest OPA/L7 proxy. -- A final public API for choosing DPU-only networking, vDPA, or - DPU-provisioned rootfs. -- Multi-DPU per-host scheduling. -- Moving a running sandbox between DPUs. -- Live migration or hot-plug of DPU attachments. -- A required SNAP implementation in the first upstream slice. +- A new `openshell-driver-qemu` crate. +- Out-of-tree extension support. +- A public resource-claim API. +- Treating `runtime_class_name` as a network attachment selector. +- Replacing gateway-level compute hooks. +- Automatic VM restart within the same sandbox lifetime. ## Core Components -### Host Extension - -`openshell-dpu-extension` implements `VmLifecycleExtension` and -registers as `dpu` in the VM driver's extension registry. +### Runtime Class -The extension: +`template.platform_config.runtime_class_name` selects the VM backend: -- calls a `DpuCoordinator` before VM launch to allocate a VF/SF; -- updates the VM launch plan with typed network, device, and optional - storage attachments; -- persists attachment state under the sandbox's extension state; -- reports launch, detach, policy, and health events; -- detaches on sandbox delete; -- reconciles DPU state after driver restart. +```yaml +spec: + template: + platform_config: + runtime_class_name: qemu +``` -### Coordinator Trait +Supported values: -`DpuCoordinator` abstracts the host-to-DPU attachment lifecycle. +| Value | Behavior | +| --- | --- | +| omitted | Use the driver default, initially `libkrun`. | +| `libkrun` | Use the existing libkrun path. | +| `qemu` | Use QEMU. The default QEMU plan uses TAP and vsock. | -The trait covers: +GPU selection remains independent from backend selection. A GPU request +with `libkrun` is rejected with a clear error. A GPU request with +`qemu` uses the existing QEMU + VFIO GPU path. -- `health`: firmware, SR-IOV mode, OVS offload, and required control - checks; -- `attach`: allocate a VF/SF or vDPA endpoint, install initial - enforcement, and return typed VM attachment details; -- `provision_rootfs` (optional capability): prepare or expose a - DPU-backed rootfs/block device and return a typed storage attachment; -- `detach`: idempotently release an attachment; -- `list`: report coordinator-known attachments; -- `reconcile`: compare host-restored state with coordinator state; -- `watch_attachment_events`: stream policy and health events back to - the host extension. +### Lifecycle Extension -The host extension is written against this trait rather than against -BlueField-specific code. +`VmLifecycleExtension` is an in-process VM-driver extension trait. +Extensions are compiled into the VM driver and registered by name. -### Coordinator Backends +An extension can: -This RFC defines two in-tree coordinator backends: +- allocate per-sandbox state before VM launch; +- add validated environment variables or bounded extra launcher + arguments; +- add or replace typed rootfs, network, and device attachments; +- react to launch success or launch failure; +- clean up on sandbox delete; +- reconcile its external state after driver restart. -| Backend | Purpose | Default | -| --- | --- | --- | -| `fake` | Unit and integration testing without DPU hardware. | on | -| `bluefield-grpc` | mTLS gRPC client for the BlueField coordinator daemon. | off | +The VM driver remains the source of truth for sandbox lifecycle. An +extension owns only the external or attached state it allocates. -Other vendors can add new coordinator backends without changing the -host extension contract. +### Launch Plan -### BlueField Coordinator +Before spawning the VM, the driver builds a `VmLaunchPlan`. +Configured extensions receive that plan and may mutate validated, +driver-owned fields before the launcher command is rendered. -`openshell-bluefield-coordinator` is the DPU-side daemon. It runs on -the BlueField ARM cores and owns: +The selected backend is fixed before extensions run. An extension may +not change `libkrun` to `qemu`, or `qemu` to `libkrun`, after the VM +driver has allocated backend-specific resources. -- VF/SF allocation; -- vDPA endpoint allocation when supported; -- optional DPU-backed rootfs/block-device provisioning; -- representor and OVS programming; -- policy application and verification; -- durable on-DPU attachment registry; -- policy streaming from the gateway; -- event emission back to the host extension. +### Attachments -The host driver instructs the coordinator to attach and detach -sandboxes. The coordinator owns policy enforcement and policy updates. +Launch attachments are typed driver-internal values: -### Gateway Policy Stream +```rust +enum VmStorageAttachment { + HostFile { path, read_only }, + HostBlockDevice { path, read_only }, + DpuProvisioned { id, device, read_only }, +} -`WatchSandboxPolicies` is a gateway RPC for DPU coordinators. +struct VmRootfsConfig { + root: VmStorageAttachment, + overlay: VmStorageAttachment, + image: Option, +} -It streams authorized policy projections to coordinator identities. The -initial projection is `NetworkScope`, which is the only projection this -RFC requires. The stream uses: +enum VmNetworkAttachment { + Tap { ifname, guest_ip, host_ip, mac, gateway_port }, + VfioPci { bdf, mac }, + Vdpa { device, mac }, +} -- `INITIAL` events for initial state; -- `DELTA` events for updates; -- `REMOVED` events for deletion; -- monotonic `seq` values; -- slow-consumer disconnects; -- per-coordinator authorization. +enum VmDeviceAttachment { + VfioPci { bdf, id }, + Vsock { cid }, +} +``` -There is no polling fallback. If the gateway does not support the -stream, the DPU extension fails startup cleanly. +The default VM driver builds the same plan as before: host-file rootfs, +host-file overlay, optional host-file prepared image disk, TAP +networking for QEMU, a vhost-vsock device, and optional GPU VFIO. -### Policy Projections +DPU or hardware-specific extensions can replace `network`, `devices`, or +`rootfs` with typed attachments. For example, a DPU extension can remove +the default TAP attachment and provide a VFIO PCI network function, a +vDPA device, or a DPU-provisioned root block device. -The gateway must not send the full sandbox policy blob to the DPU. -Instead, it emits explicit, versioned projections. Each projection has -its own schema, capability gate, authorization check, and threat-model -treatment. +`LauncherArgs` exists as a bounded escape hatch. The renderer validates +allowed flag prefixes, maximum counts, maximum lengths, and +reserved driver-owned flag prefixes before the command is spawned. It +is not the primary hardware-integration mechanism. -The first projection is `NetworkScope`, covering L2/L3/L4 enforcement -for the VF/SF data path. Future projections may be added for domains a -DPU or SmartNIC can actually enforce, such as HTTP/L7 inspection, -credential handling, or selected sandbox identity metadata. +### Extension State -Projection rules: +Each extension may return a `PersistedExtensionState` from +`before_vm_launch`. -- projections are opt-in by coordinator capability; -- projections are scoped to delegated sandboxes only; -- each projection exposes the minimum fields needed for enforcement; -- fields that are not enforceable by the DPU stay out of the - projection; -- credentials and sensitive metadata require their own projection and - authorization, not implicit inclusion in `NetworkScope`. +State is stored per sandbox and per extension: -Example shape: - -```proto -message PolicyProjection { - string sandbox_id = 1; - string attachment_id = 2; - NetworkScope network = 10; - HttpScope http = 11; // optional future projection - CredentialScope credentials = 12; // optional future projection - MetadataScope metadata = 13; // optional future projection -} +```text +/sandboxes//extensions/.json ``` -This keeps the policy stream generic without turning it into a broad -"read all policy" channel. +The driver passes that state back only to the extension that created it. +This supports multiple extensions without state collisions. -### Policy Reader Role +### Reconcile -The gateway adds a `policy-reader` role for DPU coordinators. +Reconcile runs when the driver starts: -The role is scoped to the sandboxes delegated to that coordinator. A -coordinator identity cannot subscribe to the whole fleet's network -policy. The gateway validates sandbox registration against the -coordinator's attachment-derived allowlist. +1. `reconcile_before_restore`: extension-level health and global checks. +2. VM driver reloads persisted sandboxes. +3. `reconcile_after_restore`: extension checks restored live sandboxes + against external or attached state. -### NetworkScope +Reconcile outcomes: -`NetworkScope` is the initial policy projection sent to the DPU. It -contains only the L2/L3/L4 fields a network fabric can enforce. +| Outcome | Meaning | +| --- | --- | +| `Ok` | No drift. | +| `Advisory(report)` | Drift found; report only. | +| `Authoritative(report)` | Drift found; extension may repair or clean up. | +| `Failed(status, report)` | Reconcile could not complete. | -It does not include: +`advisory` is the default. An extension can perform authoritative repair +only when both are true: -- filesystem policy; -- process policy; -- L7 request-body or operation-level policy; -- provider credentials; -- unrelated sandbox metadata. +- the extension declares support for authoritative reconcile; +- the operator sets `reconcile_mode = "authoritative"` for that + extension. -Those domains can be added only through separate projections with -separate coordinator capabilities and authorization rules. +Otherwise authoritative outcomes are demoted to advisory behavior. -## Relationship to Resource Requests +## Operator Configuration -The DPU extension is not a public resource request API. +Extensions are configured in VM-driver TOML: -Portable sandbox intent such as "attach a DPU-backed VF/SF" should be -expressed through typed device or generic resource requirements with a -stable class name, count, selectors, and namespaced parameters. +```toml +[vm] +extensions = ["logging"] -The VM driver realizes that request after it has been selected: +[vm.extension."logging"] +reconcile_mode = "advisory" -```text -resource requirement - -> gateway selects VM driver - -> VM driver builds launch plan - -> dpu extension allocates VF/SF, vDPA, or DPU rootfs as requested - -> dpu extension updates typed launch-plan attachments - -> VM driver validates and renders the final launch plan +[vm.extension."logging".timeouts] +before_vm_launch_ms = 15000 ``` -Deployment-specific settings remain extension config, including: - -- coordinator backend; -- coordinator endpoint; -- mTLS material; -- initial policy behavior; -- stale policy behavior; -- rate limits; -- `reconcile_mode`. +Rules: -Public request fields describe what the sandbox needs. DPU extension -config describes how this deployment provides it. +- unknown extension names fail driver startup; +- extension order follows the config list; +- unknown config keys fail startup via typed deserialization; +- absent or empty `extensions` means no extension chain; +- no extension can enable itself at runtime. -## VM Launch Plan Integration +## Relationship to Resource Requests -The DPU extension consumes the VM driver's typed launch plan. It does -not primarily contribute raw QEMU arguments. +This API is not the public resource request model. -The extension may update: +Public sandbox intent, such as "I need one GPU", "I need a DPU-backed +network function", or "I need a vDPA network device", should be +expressed through typed sandbox resource requirements or +driver-specific config namespaces. -- `plan.network`: replace the default TAP attachment with - `VmNetworkAttachment::VfioPci` for a VF/SF, or - `VmNetworkAttachment::Vdpa` for a vDPA endpoint; -- `plan.devices`: add supporting `VmDeviceAttachment::VfioPci` devices - when the DPU integration needs a non-network PCI device passed through; -- `plan.rootfs`: replace the default host-file root block device with - `VmStorageAttachment::DpuProvisioned` when the coordinator exposes a - DPU-provisioned rootfs/block device. +The VM driver then realizes that request after it has been selected. +For example: -The default QEMU path still uses host-file rootfs plus TAP/vsock. A DPU -extension can deliberately omit TAP by replacing `plan.network` before -the driver validates and renders the launcher configuration. +- a portable GPU request may compile into QEMU VFIO GPU arguments; +- a DPU-backed NIC request may compile into a DPU extension-provided + `VmNetworkAttachment::VfioPci`; +- a vDPA request may compile into `VmNetworkAttachment::Vdpa`; +- a DPU-backed rootfs request may compile into + `VmStorageAttachment::DpuProvisioned`; +- a plain QEMU backend request may use TAP because that is how the QEMU + backend connects networking by default. -## Operator Configuration +`runtime_class_name` chooses the VM backend. It should not be overloaded +to mean "use TAP", "use VFIO", or "use vDPA". -The extension is enabled through VM-driver config: +## TAP, VFIO, and vDPA -```toml -[vm] -extensions = ["dpu"] +This RFC enables the VM driver to support TAP, VFIO, and vDPA, but it +does not define a final user-facing selector for all of them. -[vm.extension."dpu"] -coordinator = "bluefield-grpc" # or "fake" -coordinator_endpoint = "https://192.168.100.2:8443" -coordinator_ca_path = "/etc/openshell/dpu/coordinator-ca.pem" -client_cert_path = "/etc/openshell/dpu/host-client.pem" -client_key_path = "/etc/openshell/dpu/host-client.key" +Current interpretation: -initial_policy = "wait-initial" # "wait-initial" | "baseline" -initial_policy_timeout_ms = 5000 -network_class_defaults = "dpu-ovs-isolated" +| User intent | Intended path | +| --- | --- | +| Use QEMU | `template.platform_config.runtime_class_name = "qemu"` | +| Use default libkrun path | omit `template.platform_config.runtime_class_name` or set `libkrun` | +| Use QEMU's normal TAP networking | selected implicitly by QEMU backend | +| Attach a GPU by VFIO | typed GPU resource request plus VM driver realization | +| Attach a DPU-backed VF/SF | typed device or generic resource request plus DPU extension | +| Attach vDPA | typed resource request plus VM extension | +| Use DPU-provisioned rootfs | typed storage/resource request plus VM extension | -stale_threshold_ms = 30000 -on_stale = "keep-last-known" # "keep-last-known" | "deny-all" +The important boundary is: -reconcile_mode = "advisory" # or "authoritative" -max_attach_qps = 50 -max_attachments = 1024 -``` +- user-facing requests describe *what* the sandbox needs; +- VM-driver extensions describe *how* this VM driver realizes that need. -Unknown keys fail startup through typed config deserialization. - -## Attachment Lifecycle +## Lifecycle Flow Create path: ```text -VM driver selected - -> dpu.before_vm_launch - -> coordinator.attach - -> coordinator installs initial enforcement - -> extension returns PersistedExtensionState - -> extension updates plan.network / plan.devices / plan.rootfs - -> VM driver validates and renders typed attachments - -> VM driver launches sandbox - -> extension reports bound or detaches on launch failure +gateway selects VM driver + -> VM driver selects backend from template.platform_config.runtime_class_name + -> driver builds VmLaunchPlan + -> extension.before_vm_launch in config order + -> driver validates the final typed launch plan + -> driver renders launcher config and arguments + -> driver spawns VM + -> extension.after_vm_launch_succeeded + or extension.after_vm_launch_failed in config order ``` Delete path: ```text gateway deletes sandbox - -> VM driver terminates VM if needed - -> dpu.after_sandbox_deleted - -> coordinator.detach - -> extension state removed + -> driver terminates VM if needed + -> extension.after_sandbox_deleted in config order + -> driver removes sandbox state ``` -The attachment lifetime equals the sandbox lifetime. VM process exit -does not immediately detach the DPU resource; delete does. - -### DPU-Provisioned Rootfs - -If the coordinator advertises rootfs provisioning, `attach` or -`provision_rootfs` may return a DPU-backed block device. The host -extension represents that as `VmStorageAttachment::DpuProvisioned`. - -The RFC intentionally keeps the storage backend behind the coordinator -trait. A BlueField implementation may use SNAP, NVMe emulation, a block -device exposed to the host, or another mechanism, but the VM driver only -sees the typed storage attachment and renders it into the QEMU storage -configuration. - -## Initial Policy Bootstrap - -The extension must avoid exposing a VF/SF to a guest before enforcement -is installed. - -Two modes are supported: +Restart path: -| Mode | Behavior | -| --- | --- | -| `wait-initial` | Register sandbox, wait for gateway `INITIAL`, apply it, then return from `attach`. Timeout fails the create. | -| `baseline` | Apply `network_class_defaults`, return from `attach`, then replace with gateway `INITIAL` when it arrives. | - -`wait-initial` is the default because it fails closed when the policy -stream is unavailable. - -## Reconcile - -On driver restart: - -1. `reconcile_before_restore` calls `DpuCoordinator::health`. -2. The VM driver reloads persisted sandbox state. -3. `reconcile_after_restore` compares restored host state with - coordinator state. - -Advisory is the default: - -- orphaned DPU attachments are reported as drift; -- no cleanup is performed; -- operators can inspect conditions and logs. - -Authoritative reconcile is opt-in: - -- the extension must support authoritative behavior; -- the operator must set `reconcile_mode = "authoritative"`; -- the coordinator may garbage-collect attachments it still owns but - the host driver no longer knows about. - -## Lease Fencing - -Every attachment has: - -- `attachment_id`; -- `lease_generation`; -- `sandbox_id`; -- `host_instance_id`. - -`lease_generation` is monotonic per attachment. The coordinator rejects -stale detach or reconcile operations whose generation is older than the -highest generation it has observed for that attachment. - -This prevents an old host process or stale restored state from -corrupting a newer attachment allocation. - -## Failure Behavior - -Policy and coordinator failures can be configured to fail open or fail -closed where appropriate. - -Recommended defaults: - -| Failure | Default | -| --- | --- | -| Gateway policy stream down before attach | fail closed in `wait-initial` | -| Policy update apply failure | keep last known policy | -| Previous policy unusable or stale | emit `PolicyStale`; optional deny-all | -| Coordinator health missing required controls | refuse new attaches | -| Stale host detach operation | reject by lease generation | - -## Security Model - -The strongest property is available only when the DPU has an out-of-band -path to the gateway that the host cannot observe or terminate. - -With that topology: - -- VF/SF egress traverses the DPU representor and can be enforced even - if the guest is compromised. -- The host driver is not in the policy-read path. -- DPU policy-stream credentials are scoped to the coordinator's - delegated sandboxes and authorized projections. -- Sensitive policy domains such as L7, credentials, and metadata are - not exposed unless their projection is explicitly enabled. - -Important limitation: - -- The default QEMU path still has TAP + virtio-net. A DPU extension must - replace the network plan and omit TAP for deployments that require all - guest egress to traverse the DPU-controlled data path. If TAP remains - present, this RFC enforces only the VF/SF or vDPA data path and does - not claim all guest egress is DPU-enforced. - -## mTLS and Identity - -The BlueField coordinator authenticates to the gateway with a dedicated -coordinator credential. +```text +driver starts + -> extension.reconcile_before_restore + -> driver restores live sandbox records + -> extension.reconcile_after_restore + -> conditions and metrics report drift or failure +``` -Requirements: +## Observability -- coordinator private key stays on the DPU; -- gateway maps coordinator identity to an allowlist of sandbox - attachments; -- cert renewal and revocation are supported; -- CA rotation uses a dual-trust window; -- host proxying of the stream is allowed only for explicitly configured - development scenarios. +The driver emits extension metrics and conditions for: -## Observability +- hook duration and outcome; +- rollback count; +- reconcile outcome; +- dropped condition or OCSF events; +- launcher argument validation failures; +- unhealthy extensions. -The extension and coordinator emit: +OCSF events emitted by extensions are stamped by the driver with: -- driver-level conditions such as `DpuCoordinatorHealthy`; -- per-sandbox conditions such as `DpuAttachInProgress`, - `DpuAttachmentBound`, `DpuPolicyApplied`, `DpuPolicyDegraded`, - `DpuPolicyStale`, and `DpuFirmwareDegraded`; -- OCSF events tagged with the VM-driver extension name; -- coordinator metrics for policy apply latency, offload status, - event stream reconnects, attachment count, and stale policy count. +- `extension_layer = "vm-driver"` +- `extension_name = ` ## Compatibility -- Default builds do not compile the BlueField gRPC backend. -- The fake backend keeps host-side tests available without hardware. -- The extension does not run unless configured. -- Existing VM sandboxes are unaffected. -- Existing in-guest L7 enforcement remains in place. -- No DOCA, Comch, or proprietary SDK dependency is introduced. +- Empty extension chain preserves existing behavior. +- Omitted `runtime_class_name` preserves existing backend selection. +- QEMU + GPU continues to work through the existing path. +- QEMU without GPU becomes valid. +- `bound_threshold_ms` defaults to `0`, preserving today's + ready-after-spawn behavior unless operators opt in to a liveness + threshold. ## Open Questions -- Should the DPU resource class be a standard typed device class or a - generic resource extension? -- What public resource/profile should select DPU-only networking, vDPA, - or DPU-provisioned rootfs? -- Should `reconcile_mode` remain advisory by default for DPU, or should - some deployments opt into authoritative by default? -- Should the shared DPU proto remain vendor-neutral, or should vendors - own separate coordinator protos behind the same trait? -- Should topology verification fail closed automatically at coordinator - startup? +- Should TAP, VFIO, and vDPA get a shared typed network attachment + request shape, or remain separate resource classes? +- Should per-sandbox backend or bound-threshold overrides be added? +- Should runtime extension reload be supported later? \ No newline at end of file