Updates to slurm launcher by raviguptaamd · Pull Request #86 · ROCm/madengine

raviguptaamd · 2026-03-12T20:19:59Z

Motivation

PR Summary: SGLang Disaggregated Inference Support for Madengine v2

Replaces PR #68 - now sourced from fork with all latest commits.

Overview

This PR adds support for running SGLang Disaggregated (P/D) inference workloads on SLURM clusters. The key architectural change is introducing a "baremetal launcher" path for disaggregated inference launchers that manage their own Docker containers via SLURM.

Files Modified

1. `src/madengine/cli/commands/build.py` (+18 lines)

Purpose: Add CLI options for flexible build workflows

Added --use-image option: Skip Docker build and use pre-built image
Added --build-on-compute option: Build Docker images on SLURM compute node
Both options are optional and default to disabled
Existing build workflows unchanged when options not specified

2. `src/madengine/deployment/slurm.py` (+344 lines)

Purpose: Support baremetal execution for disagg launchers

Added self.reservation parsing from slurm config
Added baremetal launcher detection in prepare()
Added _prepare_baremetal_script() method that generates simple wrapper scripts
Added _run_inside_existing_allocation() for salloc support
Added _validate_allocation_nodes() for node count validation
Detection is explicit: Only triggers for sglang-disagg or vllm-disagg launchers
All other launchers continue to use existing job.sh.j2 template path

3. `src/madengine/deployment/slurm_node_selector.py` (+29/-18 lines)

Purpose: Add reservation support to node health checks

Added reservation parameter to srun commands for health checks and cleanup

4. `src/madengine/execution/container_runner.py` (+206 lines)

Purpose: Execute scripts directly on baremetal for disagg launchers

Added BAREMETAL_LAUNCHERS constant
Added _run_on_baremetal() method for direct script execution
Added launcher detection before Docker container creation
All other launchers continue to Docker execution unchanged

5. `src/madengine/orchestration/build_orchestrator.py` (+283 lines)

Purpose: Support pre-built images and compute-node builds

Added _execute_with_prebuilt_image() for --use-image option
Added _execute_build_on_compute() for --build-on-compute option
Standard execute() flow unchanged when options not used

Impact Analysis

Component	Affected Launchers	Other Launchers
slurm.py	sglang-disagg, vllm-disagg	No change - uses existing template
container_runner.py	sglang-disagg, vllm-disagg	No change - uses Docker execution
build.py	All (optional flags)	No change - flags default to disabled
build_orchestrator.py	All (optional flags)	No change - standard build if flags not used

Backward Compatibility

Existing models: All existing models continue to work unchanged
Existing launchers: torchrun, deepspeed, megatron-lm, vllm, sglang (non-disagg) all unchanged
Existing build workflows: Default behavior preserved
Existing SLURM deployments: Standard job template path unchanged

Testing

Tested with:

pyt_sglang_disagg_qwen3-32b model on 3-node SLURM cluster
Both salloc (existing allocation) and sbatch (new job) modes
Pre-built image workflow with --use-image
Benchmarks completed successfully (GSM8K accuracy + throughput sweep)

How to test

# Clone from fork
git clone https://github.com/raviguptaamd/madengine.git
cd madengine
git checkout raviguptaamd/update_slurm_launcher
pip install -e .

Made with Cursor

- Add docker_image field to built_images and built_models in --use-image mode - Merge model's distributed.launcher into deployment_config for proper detection - Make reservation optional (empty string is ignored) - Pass reservation to SlurmNodeSelector for health checks Co-authored-by: Cursor <cursoragent@cursor.com>

Add missing newline after 'fi' statement to prevent syntax error in generated sbatch script when using --build-on-compute option. Co-authored-by: Cursor <cursoragent@cursor.com>

…launcher

Resolved conflict in slurm.py by keeping both: - _prepare_baremetal_script() for sglang-disagg/vllm-disagg baremetal execution - _normalize_nodelist() utility for SLURM nodelist normalization (from PR ROCm#71) Made-with: Cursor

Replace legacy launcher names with unified slurm_multi: - Remove sglang-disagg, vllm-disagg from VALID_LAUNCHERS - Add slurm_multi as the only baremetal multi-node launcher - Update deployment logic in slurm.py and kubernetes.py - Rename example config files from sglang-disagg-* to slurm-multi-* - Update docs/launchers.md references Breaking change: sglang-disagg and vllm-disagg are no longer valid launcher names. Use slurm_multi instead. Made-with: Cursor

…ures docs/cli-reference.md: - Add --use-image and --build-on-compute options to build command - Add examples for pre-built image and compute node build workflows - Document when to use each mode docs/usage.md: - Add "Pre-built Image Mode" section with examples - Add "Build on Compute Node" section explaining workflow docs/deployment.md: - Add "SLURM Allocation Detection" section - Add "Baremetal Execution (slurm_multi)" section - Document environment variables and behavior inside salloc docs/launchers.md: - Fix example file paths (slurm_multi -> slurm-multi) Made-with: Cursor

--use-image changes: - Make image name optional (auto-detect from model card's DOCKER_IMAGE_NAME) - Add mutual exclusivity with --registry and --build-on-compute - Handle multiple models with different images (use first, warn) --build-on-compute changes: - Require --registry option (build once, push, pull everywhere) - Build on 1 compute node only, push to registry - Merge SLURM config from model card + additional-context (CLI overrides) - Add registry credential validation before build - Add docker login step in build script - Smart registry image naming (namespace/repo vs namespace formats) - Parallel docker pull on all nodes in run phase - Clear error messages for missing partition, credentials Run phase changes: - Detect built_on_compute flag in manifest - Pull image in parallel on all nodes via srun before model execution Documentation updated for new workflows. Made-with: Cursor

- slurm_multi launcher now requires --registry or --use-image - Parallel docker pull on all nodes for any registry image (not just built_on_compute) - DOCKER_IMAGE_NAME added to manifest env_vars for all registry builds - Documentation updated for slurm_multi requirements Made-with: Cursor

Ensure consistent documentation across: - deployment.md - launchers.md - usage.md All now mention that slurm_multi requires --registry or --use-image. Made-with: Cursor

- Make gpu_vendor/guest_os optional when using --use-image (pre-built image) - Remove model name truncation in CLI results table - Merge model card slurm and distributed config into build_manifest.json - Add skip_monitoring flag for synchronous salloc execution - Add completion marker file for robust sbatch job completion detection Made-with: Cursor

On shared NFS filesystems, perf.csv written by compute nodes may not be immediately visible to the login node. Retry up to 30s to allow NFS propagation before reporting the file as missing. Made-with: Cursor

…cher column - Emit #SBATCH --nodelist in _prepare_baremetal_script() when nodelist is set via model card or --additional-context - Include nodelist in model card slurm merge key list so it propagates to the build manifest - Rename perf table column from "Launcher" to "Workload" for clarity Made-with: Cursor

…time) When --additional-context provides partial slurm overrides (e.g. only nodelist), ConfigLoader defaults were overwriting model card values like nodes and time. Fixed by capturing original user slurm keys before ConfigLoader runs, and deep-merging manifest slurm config with runtime context instead of replacing it wholesale. Made-with: Cursor

raviguptaamd and others added 14 commits February 5, 2026 07:33

Updates to slurm launcher

8c4b959

Fix syntax error in build-on-compute sbatch script generation

e7db0be

Add missing newline after 'fi' statement to prevent syntax error in generated sbatch script when using --build-on-compute option. Co-authored-by: Cursor <cursoragent@cursor.com>

Merge branch 'coketaste/refactor-dis' into raviguptaamd/update_slurm_…

db40029

…launcher

Add slurm_multi registry requirement to all docs

3da3403

Ensure consistent documentation across: - deployment.md - launchers.md - usage.md All now mention that slurm_multi requires --registry or --use-image. Made-with: Cursor

Add retry loop for perf.csv detection after SLURM job completion

d707397

On shared NFS filesystems, perf.csv written by compute nodes may not be immediately visible to the login node. Retry up to 30s to allow NFS propagation before reporting the file as missing. Made-with: Cursor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to slurm launcher#86

Updates to slurm launcher#86
raviguptaamd wants to merge 14 commits intoROCm:coketaste/refactor-disfrom
raviguptaamd:raviguptaamd/update_slurm_launcher

raviguptaamd commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

raviguptaamd commented Mar 12, 2026

Motivation

Overview

Files Modified

1. src/madengine/cli/commands/build.py (+18 lines)

2. src/madengine/deployment/slurm.py (+344 lines)

3. src/madengine/deployment/slurm_node_selector.py (+29/-18 lines)

4. src/madengine/execution/container_runner.py (+206 lines)

5. src/madengine/orchestration/build_orchestrator.py (+283 lines)

Impact Analysis

Backward Compatibility

Testing

How to test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `src/madengine/cli/commands/build.py` (+18 lines)

2. `src/madengine/deployment/slurm.py` (+344 lines)

3. `src/madengine/deployment/slurm_node_selector.py` (+29/-18 lines)

4. `src/madengine/execution/container_runner.py` (+206 lines)

5. `src/madengine/orchestration/build_orchestrator.py` (+283 lines)