Skip to content

Add GPU passthrough for Docker sandboxes#658

Open
shamanez wants to merge 1 commit intoalibaba:masterfrom
shamanez:feature/sandbox-gpu-passthrough
Open

Add GPU passthrough for Docker sandboxes#658
shamanez wants to merge 1 commit intoalibaba:masterfrom
shamanez:feature/sandbox-gpu-passthrough

Conversation

@shamanez
Copy link

Summary

Add server-side sandbox GPU passthrough controls for Docker-backed sandboxes.

This change introduces runtime-config-driven GPU exposure for sandbox containers without changing the current SDK start request shape. It supports:

  • enabling GPU passthrough globally at the ROCK runtime layer
  • fixed GPU requests such as all or device=1
  • round-robin GPU assignment across host GPUs
  • optional env-var overrides for local deployments
  • unit coverage for the new runtime config and Docker argument generation logic

Motivation

Today ROCK can launch CPU/memory-scoped sandbox containers, but there is no built-in way to expose host GPUs to those containers. In local and agentic training workflows this makes it difficult to run GPU-bound evaluation or tool execution inside the sandbox itself.

This PR ports a previously working local workaround into current upstream so Docker-backed sandboxes can opt into GPU passthrough from ROCK Admin runtime configuration.

What Changed

  • Added GPU passthrough fields to RuntimeConfig
    • enable_gpu_passthrough
    • gpu_device_request
    • gpu_allocation_mode
    • gpu_count_per_sandbox
  • Added sample local config in rock-conf/rock-local.yml
  • Extended DockerDeployment to:
    • detect host GPU count via nvidia-smi
    • compute round-robin device assignments
    • add docker run --gpus ...
    • set NVIDIA_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES for deterministic assignment
    • skip GPU injection if docker_args already contains --gpus
  • Added unit tests for:
    • runtime config GPU fields
    • fixed GPU mode
    • pre-existing explicit docker_args
    • round-robin allocation

Scope / Non-Goals

This PR intentionally stays server-side and config-driven.

It does not yet add:

  • per-sandbox GPU fields to the SDK SandboxConfig
  • per-request GPU fields to SandboxStartRequest
  • scheduler-aware GPU reservation semantics in Ray/K8s

Those would be good follow-up work for first-class end-to-end GPU support.

Validation

Passed:

  • uv run ruff check rock/config.py rock/deployments/docker.py tests/unit/test_config.py tests/unit/rocklet/test_docker_deployment.py
  • uv run pytest tests/unit/test_config.py
  • uv run pytest tests/unit/admin/proto/test_sandbox_request.py
  • uv run pytest tests/unit/rocklet/test_docker_deployment.py::test_build_gpu_args_disabled_by_default
  • uv run pytest tests/unit/rocklet/test_docker_deployment.py::test_build_gpu_args_fixed_mode_from_runtime
  • uv run pytest tests/unit/rocklet/test_docker_deployment.py::test_build_gpu_args_skips_when_docker_args_already_set
  • uv run pytest tests/unit/rocklet/test_docker_deployment.py::test_build_gpu_args_round_robin

Not reliable in this environment:

  • tests/unit/rocklet/test_docker_deployment.py::test_docker_deployment
    • the live Docker-backed sandbox process exits immediately in the current machine environment, so this was not used as a gating signal for the GPU feature itself

@CLAassistant
Copy link

CLAassistant commented Mar 23, 2026

CLA assistant check
All committers have signed the CLA.

@shamanez
Copy link
Author

issue - #657

@shamanez shamanez force-pushed the feature/sandbox-gpu-passthrough branch from 385bbaa to 78b2ef8 Compare March 24, 2026 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants