feat: Kubernetes deployment support (Dockerfile, manifests, docs)#13
feat: Kubernetes deployment support (Dockerfile, manifests, docs)#13chaosreload wants to merge 9 commits intozerobootdev:mainfrom
Conversation
P0 fixes: - serve: change default bind from 127.0.0.1 to 0.0.0.0 to fix K8s health probes and Service routing; add --bind flag for explicit control - entrypoint.sh: pass $ZEROBOOT_BIND (default 0.0.0.0) to serve command P1 fixes: - deployment.yaml: replace devices.kubevirt.io/kvm (requires kubevirt) with privileged: true + hostPath /dev/kvm (works on plain EKS) - deployment.yaml: increase livenessProbe initialDelaySeconds from 60 to 120; template creation takes ~19s, 60s was too tight on slow EBS attach - deployment.yaml: add /dev/kvm hostPath volume and mount EKS self-managed node group (new file): - deploy/eks/eks-self-managed-kvm.sh: end-to-end script to create a self-managed ASG + Launch Template with CpuOptions.NestedVirtualization=enabled EKS managed node groups silently drop CpuOptions — self-managed bypasses this - deploy/eks/eks-with-kvm-nodegroup.yaml: add warning about CpuOptions being dropped by managed node groups (documented as a gap vs AWS official docs) Docs: - docs/KUBERNETES.md: add EKS managed vs self-managed section with root cause analysis and the recommended self-managed approach - docs/KUBERNETES.md: add server bind address configuration note - docs/KUBERNETES.md: add ZEROBOOT_BIND env var reference Validated on: EKS 1.31 / ap-southeast-1 / c8i.xlarge (nested virt) Ref: chaosreload/zeroboot PR zerobootdev#13
- Dockerfile: multi-stage build (Rust compiler + Ubuntu runtime) Firecracker bundled; vmlinux/rootfs mounted via PVC - docker/entrypoint.sh: handles template creation on first boot, skips if snapshot already exists on PVC - deploy/k8s/: namespace, PVC (gp3 20Gi), Deployment with KVM device plugin resource, podAntiAffinity, health probes, HPA, Service - docs/KUBERNETES.md: EC2 instance family requirements, KVM device plugin setup, PVC storage guidance, autoscaling with custom metric (zeroboot_concurrent_forks), Karpenter NodePool example, ServiceMonitor config, configuration reference Closes zerobootdev#9
c8i/m8i/r8i support nested virtualization on regular (non-metal) sizes via --cpu-options NestedVirtualization=enabled. Other families (c6i, m6i etc.) require .metal sizes for KVM access. Update instance table to make this distinction explicit.
Three files covering two scenarios: - eks-with-kvm-nodegroup.yaml: cluster + KVM node group in one shot - eks-cluster-only.yaml: cluster only (no node groups) - eks-add-kvm-nodegroup.yaml: add KVM node group to existing cluster All configs use c8i.xlarge with cpuOptions.nestedVirtualization=enabled, AmazonLinux2023 AMI, and aws-ebs-csi-driver addon for PVC support.
P0 fixes: - serve: change default bind from 127.0.0.1 to 0.0.0.0 to fix K8s health probes and Service routing; add --bind flag for explicit control - entrypoint.sh: pass $ZEROBOOT_BIND (default 0.0.0.0) to serve command P1 fixes: - deployment.yaml: replace devices.kubevirt.io/kvm (requires kubevirt) with privileged: true + hostPath /dev/kvm (works on plain EKS) - deployment.yaml: increase livenessProbe initialDelaySeconds from 60 to 120; template creation takes ~19s, 60s was too tight on slow EBS attach - deployment.yaml: add /dev/kvm hostPath volume and mount EKS self-managed node group (new file): - deploy/eks/eks-self-managed-kvm.sh: end-to-end script to create a self-managed ASG + Launch Template with CpuOptions.NestedVirtualization=enabled EKS managed node groups silently drop CpuOptions — self-managed bypasses this - deploy/eks/eks-with-kvm-nodegroup.yaml: add warning about CpuOptions being dropped by managed node groups (documented as a gap vs AWS official docs) Docs: - docs/KUBERNETES.md: add EKS managed vs self-managed section with root cause analysis and the recommended self-managed approach - docs/KUBERNETES.md: add server bind address configuration note - docs/KUBERNETES.md: add ZEROBOOT_BIND env var reference Validated on: EKS 1.31 / ap-southeast-1 / c8i.xlarge (nested virt) Ref: chaosreload/zeroboot PR zerobootdev#13
src/main.rs and entrypoint.sh bind address changes belong in a dedicated fix PR. This PR should only contain K8s deployment configs and docs. The deployment.yaml already handles the 127.0.0.1 limitation via the hostPath /dev/kvm approach; users can add a socat sidecar if needed until the fix PR is merged.
a2898a1 to
c41f21c
Compare
docs/KUBERNETES.md
Outdated
| │ | ||
| ┌──────┼──────┐ | ||
| │ │ │ | ||
| Pod-1 Pod-2 Pod-3 ← one Pod per KVM-capable Node (podAntiAffinity) |
There was a problem hiding this comment.
Curious — if we need one Pod per KVM-capable Node, would a DaemonSet make sense here?
There was a problem hiding this comment.
Good call — switched to DaemonSet as the default in the latest commit.
Added deploy/k8s/daemonset.yaml:
nodeSelector: kvm-capable: "true"— only runs on KVM-capable nodeshostPath: /var/lib/zerobootwithDirectoryOrCreate— local storage is the right choice here since Firecracker snapshots are bound to the host CPU microarchitecture and KVM state; cross-node migration is not meaningful anywayupdateStrategy: RollingUpdatewithmaxUnavailable: 1
Kept deployment.yaml as an annotated advanced alternative for HPA or manual replica control scenarios.
Updated KUBERNETES.md: DaemonSet + node-level autoscaling (Karpenter) is now the primary scaling model; Deployment + HPA documented as the fallback.
- Add deploy/k8s/daemonset.yaml: one Pod per KVM-capable node via nodeSelector,
hostPath /var/lib/zeroboot (DirectoryOrCreate), RollingUpdate strategy
- Update deploy/k8s/deployment.yaml: add header comment clarifying it is an
advanced alternative for HPA/manual replica control scenarios
- Update docs/KUBERNETES.md:
- Architecture diagram updated to reflect DaemonSet semantics
- New "Deploying" section with Option A (DaemonSet) and Option B (Deployment)
- New "Persistent storage" section explaining hostPath rationale and trade-offs
(snapshot CPU-affinity, no cross-node migration, node drain behavior)
- Autoscaling section: DaemonSet node-level scaling as primary path,
HPA documented under Deployment advanced usage
- Limitations updated to reflect DaemonSet/hostPath model
Addresses reviewer question: DaemonSet is the correct primitive when the
scaling unit is a KVM-capable node, not a Pod replica.
…i→c8i/m8i/r8i
- deployment.yaml: replicas: 2 → 1; RWO PVC cannot be mounted by multiple pods
simultaneously. Running 2 replicas would cause one pod to be stuck Pending.
- docs/KUBERNETES.md:
- Karpenter NodePool: remove c6i/c7i from instance-family list; these
families require .metal sizes for KVM — non-metal c6i/c7i nodes provisioned
by Karpenter would never pass the kvm-capable check
- Remove duplicate horizontal rule (lines 77-78)
- Fix two stale "set in deployment.yaml" references to include daemonset.yaml
Found by Claude Code review.
|
Hi @adammiribyan 👋 This PR adds Kubernetes deployment support for Zeroboot (closes #9). It's been validated on EKS 1.31 / ap-southeast-1 / c8i.xlarge. Following feedback from @minhoryang, I've updated the default deployment to DaemonSet (instead of Deployment), which better fits Zeroboot's node-bound nature ( Would you be able to take a look and review when you have a chance? Happy to address any feedback. 🙏 |
Summary
Add first-class Kubernetes deployment support for Zeroboot, addressing all items in #9.
Changes
DockerfileMulti-stage build: Rust 1.86 compiler stage + Ubuntu 22.04 runtime. Firecracker binary bundled at build time.
vmlinuxand rootfs images are not baked into the image — they are mounted via host storage, keeping the image lean and allowing runtime upgrades without rebuilding.docker/entrypoint.sh/dev/kvmaccess on startup (fast-fail with clear error message)deploy/k8s/Reference manifests ready to apply:
namespace.yaml— dedicatedzerobootnamespacedaemonset.yaml— default deployment (see DaemonSet section below)deployment.yaml— alternative for HPA / manual replica control (single-replica, RWO PVC)pvc.yaml— 20 Gi gp3 PVC fordeployment.yamlservice.yaml— ClusterIP service (works with both DaemonSet and Deployment)hpa.yaml— custom-metric HPA onzeroboot_concurrent_forks(fordeployment.yaml)deploy/eks/EKS-specific deployment configs:
eks-cluster-only.yaml— create cluster without node groups (Step 1)eks-self-managed-kvm.sh— end-to-end script for self-managed ASG withCpuOptions.NestedVirtualization=enabled; see EKS note beloweks-with-kvm-nodegroup.yaml— eksctl managed node group config (eks-add-kvm-nodegroup.yaml— add KVM node group to existing clusterdocs/KUBERNETES.mdComprehensive deployment guide covering:
CpuOptionsbeing silently droppedDaemonSet as Default (updated after reviewer feedback)
Following @minhoryang's suggestion, the default deployment is now DaemonSet instead of Deployment.
Why DaemonSet:
/dev/kvmand host CPU microarchitecture — the natural unit of scale is a Node, not a Pod replicapodAntiAffinity+replicas: NworkaroundsStorage: DaemonSet uses
hostPath: /var/lib/zeroboot(DirectoryOrCreate). Firecracker snapshots are bound to the host CPU microarchitecture and KVM state — local storage is the correct choice. Cross-node snapshot migration is not supported and not needed for short-lived sandbox workloads.Deployment option retained for cases requiring HPA or manual replica control. Documented as an advanced alternative.
TL;DR: EKS managed node groups silently drop
CpuOptions.NestedVirtualization— your nodes start without/dev/kvmeven if the eksctl YAML looks correct.Root cause: When you provide a Launch Template to a managed node group, EKS generates a new internal LT and merges only a subset of fields.
CpuOptionsis not in that subset — despite not being listed in the official blocked-fields list. This is a documentation gap.Solution: Use
deploy/eks/eks-self-managed-kvm.sh, which creates an ASG + Launch Template directly via AWS CLI, bypassing EKS's internal LT generation entirely.Architecture note
Kubernetes manages the lifecycle of the zeroboot server process — it does not schedule individual sandboxes. Each
/v1/execrequest is handled entirely within the Pod via a KVM fork (~0.8 ms). K8s role is capacity management: health checks, rolling updates, and node-level scaling.Testing
Validated on EKS 1.31 / ap-southeast-1 / c8i.xlarge (self-managed node group with nested virt):
/var/lib/zerobootcreated automatically on each node/dev/kvmpresent on nodes/v1/health→{"status":"ok","templates":{"python":{"ready":true,"numpy":true}}}CODE:print(1+1)→2(fork 2.2ms, exec 118ms)CODE:import numpy as np; print(np.array([1,2,3,4,5]).mean())→3.0(exec 259ms)cat /etc/os-release→ Ubuntu 22.04 content (exec 28ms)Depends on: #14
Closes #9