Skip to content

feat(gpu): introduce GPU request spec#1156

Open
elezar wants to merge 5 commits into
mainfrom
feat/gpu-request-spec
Open

feat(gpu): introduce GPU request spec#1156
elezar wants to merge 5 commits into
mainfrom
feat/gpu-request-spec

Conversation

@elezar
Copy link
Copy Markdown
Member

@elezar elezar commented May 4, 2026

Summary

Replace legacy GPU-specific request fields in the public and driver protos with ResourceRequirements carrying a GPUSpec. This also adds GPU count requests while preserving the existing default GPU request shape used by --gpu and image-name auto-detection.

Related Issue

Related to #1444

Changes

  • Added ResourceRequirements with an embedded GPUSpec to public and driver protos.
  • Added optional GPUSpec.count, mutually exclusive with explicit device_ids.
  • Mapped CLI GPU flags into ResourceRequirements: --gpu-device and --gpu-count now imply a GPU request, --gpu-count must be greater than zero, and --gpu-count conflicts with --gpu-device.
  • Validated GPU count/device-ID mutual exclusion at the gateway and preserved GPU requirements in driver request translation.
  • Honored GPU count in Kubernetes by setting nvidia.com/gpu to the requested count.
  • Kept VM support limited to one GPU, and rejected count requests for Docker and Podman until those drivers support count semantics.
  • Preserved image-name GPU auto-detection and --gpu as a present GPU request with no device IDs and no count.
  • Updated unit tests, e2e fixtures, generated Python proto exports, and user-facing sandbox docs.
  • Included a small VM clippy cleanup for the local container-engine fallback required by the current main baseline.

Testing

  • mise run pre-commit passes
  • cargo check -p openshell-core -p openshell-server -p openshell-driver-docker -p openshell-driver-kubernetes -p openshell-driver-podman -p openshell-driver-vm -p openshell-cli
  • cargo test -p openshell-cli gpu
  • cargo test -p openshell-core -p openshell-server -p openshell-driver-docker -p openshell-driver-kubernetes -p openshell-driver-podman -p openshell-driver-vm --lib gpu
  • Unit tests added/updated
  • E2E fixtures updated; dedicated GPU e2e coverage was already merged separately

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (not applicable; user-facing docs updated)

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 4, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@elezar elezar force-pushed the feat/gpu-request-spec branch 2 times, most recently from 8b95e09 to 3d17c4e Compare May 13, 2026 16:00
@elezar elezar force-pushed the feat/gpu-request-spec branch from 3d17c4e to 930c581 Compare May 13, 2026 16:44
@github-actions
Copy link
Copy Markdown

@elezar elezar force-pushed the feat/gpu-request-spec branch 2 times, most recently from 07b171d to dd18d21 Compare May 15, 2026 15:12
@elezar elezar marked this pull request as ready for review May 15, 2026 15:12
@elezar elezar requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners May 15, 2026 15:12
@elezar elezar force-pushed the feat/gpu-request-spec branch from dd18d21 to 9ff168a Compare May 15, 2026 15:14
elezar added 3 commits May 18, 2026 14:38
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the feat/gpu-request-spec branch from 9ff168a to cfceac9 Compare May 18, 2026 13:28
@elezar elezar added the test:e2e-gpu Requires GPU end-to-end coverage label May 18, 2026
@github-actions
Copy link
Copy Markdown

Label test:e2e-gpu applied for cfceac9. Open the existing run and click Re-run all jobs to execute with the label set. The E2E Gate check on this PR will flip green automatically once the run finishes.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the feat/gpu-request-spec branch from cfceac9 to dc21ff0 Compare May 19, 2026 13:21
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the feat/gpu-request-spec branch from dc21ff0 to ff9e99d Compare May 19, 2026 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e-gpu Requires GPU end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant