Support n_gpus=0 models and native baremetal launcher by raviguptaamd · Pull Request #82 · ROCm/madengine

raviguptaamd · 2026-03-09T04:18:56Z

Summary

Add "native" to BAREMETAL_LAUNCHERS so scripts can run directly on the host without Docker
Skip GPU hardware detection when MAD_SYSTEM_NGPUS=0 — enables running network-only tests (RDMA bandwidth) from headless machines that manage GPU nodes via kubectl
Propagate well-known GPU keys from additional_context into docker_env_vars to short-circuit hardware probes

Changes

src/madengine/execution/container_runner.py

Add "native" to BAREMETAL_LAUNCHERS list
Early return in get_gpu_arg() when requested_gpus == "0"

src/madengine/core/context.py

Fast-path in init_gpu_context(): skip rocminfo/nvidia-smi when MAD_SYSTEM_NGPUS=0
Propagate MAD_GPU_VENDOR, MAD_SYSTEM_NGPUS, MAD_SYSTEM_GPU_ARCHITECTURE, etc. from additional_context into docker_env_vars

Motivation

The RDMA bandwidth tests in MAD-private (ROCm/MAD-private#200) use n_gpus: "0" and launcher: "native" because the launch machine orchestrates K8s pods remotely — it doesn't need local GPUs or Docker.

Test plan

madengine build + run with n_gpus=0 and native launcher on a headless machine
Existing GPU-based models unaffected (fast-path only triggers when MAD_SYSTEM_NGPUS=0)

- Add "native" to BAREMETAL_LAUNCHERS so scripts using the native launcher run directly on the host instead of inside Docker - Skip GPU argument generation when n_gpus=0 - Propagate user-provided GPU context keys (MAD_SYSTEM_NGPUS, etc.) from additional_context into docker_env_vars to short-circuit hardware detection on headless nodes - Fast-path in init_gpu_context: skip all GPU hardware probes when MAD_SYSTEM_NGPUS=0 (no rocminfo, nvidia-smi, renderD detection) Enables running non-GPU workloads like RDMA bandwidth tests from machines without GPU hardware.

- Set top-level ctx["gpu_vendor"] in n_gpus=0 fast path so run_container() doesn't KeyError before reaching the baremetal check - Move baremetal launcher detection before Docker options setup so native launcher skips gpu_vendor/Docker code entirely - Propagate top-level additional_context string values as env vars to baremetal scripts (e.g. KUBECONFIG_PATH)

raviguptaamd added 2 commits March 5, 2026 00:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support n_gpus=0 models and native baremetal launcher#82

Support n_gpus=0 models and native baremetal launcher#82
raviguptaamd wants to merge 2 commits intoROCm:raviguptaamd/update_slurm_launcherfrom
raviguptaamd:raviguptaamd/ngpus0_native_launcher

raviguptaamd commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raviguptaamd commented Mar 9, 2026

Summary

Changes

Motivation

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant