Skip to content

feat: add GPU count support for Kubernetes sandboxes #1338

@ryana

Description

@ryana

Problem Statement

OpenShell can express generic GPU intent with openshell sandbox create --gpu, but users cannot request a specific GPU count through the public sandbox API.

For Kubernetes-backed gateways, generic GPU intent maps to a single nvidia.com/gpu resource request. This blocks workloads that need multiple GPUs, for example:

openshell sandbox create --gpu-count 4 -- claude

Users can work around this only by injecting Kubernetes-specific resource settings through sandbox templates. That makes a common scheduling requirement driver-specific and bypasses OpenShell's typed sandbox spec layer.

Proposed Design

Add first-class GPU count support across the public sandbox spec, compute-driver spec, CLI, server mapping, and Kubernetes driver.

Public API:

  • Add gpu_count to SandboxSpec.
  • Use default 0 to mean unspecified/default.
  • Use values >0 to request that many GPUs.
  • Preserve existing gpu: true behavior.

Compute driver API:

  • Add gpu_count to DriverSandboxSpec.
  • Copy SandboxSpec.gpu_count into DriverSandboxSpec.gpu_count in the server public-to-driver mapping.

CLI:

  • Add openshell sandbox create --gpu-count COUNT.
  • Reject --gpu-count 0.
  • Treat --gpu-count N as GPU intent, equivalent to setting gpu: true.
  • Reject combining --gpu-count with --gpu-device, because count-based scheduling and device-specific selection are different allocation modes.

Kubernetes driver:

  • If gpu_count > 0, set the sandbox container resource limit:
resources:
  limits:
    nvidia.com/gpu: "<count>"
  • If gpu_count == 0 and gpu == true, preserve current behavior by requesting one GPU.
  • Preserve existing CPU, memory, custom resource, and typed-resource overlay behavior.
  • Require clusters to expose allocatable nvidia.com/gpu resources through the NVIDIA device plugin or equivalent.

Compatibility:

  • Existing clients omit gpu_count, so it defaults to 0.
  • Existing --gpu behavior remains unchanged.
  • Docker, Podman, and VM drivers can safely receive the new field and ignore it unless they later add explicit count support.

Acceptance criteria:

  • openshell sandbox create --gpu-count 4 -- claude sends SandboxSpec { gpu: true, gpu_count: 4 }.
  • --gpu-count 0 is rejected with a clear error.
  • --gpu-count cannot be combined with --gpu-device.
  • Server mapping copies public gpu_count into the driver spec.
  • Kubernetes pod rendering emits limits["nvidia.com/gpu"] == "4" for gpu_count: 4.
  • Existing --gpu still emits limits["nvidia.com/gpu"] == "1".
  • Docs explain --gpu-count, Kubernetes nvidia.com/gpu scheduling, and the --gpu-device conflict.

Alternatives Considered

  • Continue injecting nvidia.com/gpu through raw template resources.
    • This works only for users who know the Kubernetes resource model and bypasses OpenShell's typed sandbox API.
  • Overload --gpu with an optional value.
    • This is ambiguous and risks breaking existing boolean flag behavior.
  • Reuse --gpu-device for counts.
    • Device-specific selection and count-based scheduling are separate allocation modes, so combining them would make driver behavior unclear.

Agent Investigation

  • Inspected the existing proto contracts, CLI sandbox-create path, server compute mapping, and Kubernetes driver rendering path.
  • Found that OpenShell already has a public-to-driver sandbox spec mapping layer, so GPU count belongs in typed specs rather than template resource passthrough.
  • Found existing Kubernetes GPU behavior maps generic gpu: true to one nvidia.com/gpu limit.
  • Identified docs that need updates: sandbox management docs, Kubernetes setup prerequisites, Kubernetes driver README, and compute runtime architecture docs.

Checklist

  • I've reviewed existing issues and the architecture docs
  • This is a design proposal, not a "please build this" request

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions