Skip to content

Updates to slurm launcher#86

Open
raviguptaamd wants to merge 14 commits intoROCm:coketaste/refactor-disfrom
raviguptaamd:raviguptaamd/update_slurm_launcher
Open

Updates to slurm launcher#86
raviguptaamd wants to merge 14 commits intoROCm:coketaste/refactor-disfrom
raviguptaamd:raviguptaamd/update_slurm_launcher

Conversation

@raviguptaamd
Copy link

Motivation

PR Summary: SGLang Disaggregated Inference Support for Madengine v2

Replaces PR #68 - now sourced from fork with all latest commits.

Overview

This PR adds support for running SGLang Disaggregated (P/D) inference workloads on SLURM clusters. The key architectural change is introducing a "baremetal launcher" path for disaggregated inference launchers that manage their own Docker containers via SLURM.

Files Modified

1. src/madengine/cli/commands/build.py (+18 lines)

Purpose: Add CLI options for flexible build workflows

  • Added --use-image option: Skip Docker build and use pre-built image
  • Added --build-on-compute option: Build Docker images on SLURM compute node
  • Both options are optional and default to disabled
  • Existing build workflows unchanged when options not specified

2. src/madengine/deployment/slurm.py (+344 lines)

Purpose: Support baremetal execution for disagg launchers

  • Added self.reservation parsing from slurm config
  • Added baremetal launcher detection in prepare()
  • Added _prepare_baremetal_script() method that generates simple wrapper scripts
  • Added _run_inside_existing_allocation() for salloc support
  • Added _validate_allocation_nodes() for node count validation
  • Detection is explicit: Only triggers for sglang-disagg or vllm-disagg launchers
  • All other launchers continue to use existing job.sh.j2 template path

3. src/madengine/deployment/slurm_node_selector.py (+29/-18 lines)

Purpose: Add reservation support to node health checks

  • Added reservation parameter to srun commands for health checks and cleanup

4. src/madengine/execution/container_runner.py (+206 lines)

Purpose: Execute scripts directly on baremetal for disagg launchers

  • Added BAREMETAL_LAUNCHERS constant
  • Added _run_on_baremetal() method for direct script execution
  • Added launcher detection before Docker container creation
  • All other launchers continue to Docker execution unchanged

5. src/madengine/orchestration/build_orchestrator.py (+283 lines)

Purpose: Support pre-built images and compute-node builds

  • Added _execute_with_prebuilt_image() for --use-image option
  • Added _execute_build_on_compute() for --build-on-compute option
  • Standard execute() flow unchanged when options not used

Impact Analysis

Component Affected Launchers Other Launchers
slurm.py sglang-disagg, vllm-disagg No change - uses existing template
container_runner.py sglang-disagg, vllm-disagg No change - uses Docker execution
build.py All (optional flags) No change - flags default to disabled
build_orchestrator.py All (optional flags) No change - standard build if flags not used

Backward Compatibility

  • Existing models: All existing models continue to work unchanged
  • Existing launchers: torchrun, deepspeed, megatron-lm, vllm, sglang (non-disagg) all unchanged
  • Existing build workflows: Default behavior preserved
  • Existing SLURM deployments: Standard job template path unchanged

Testing

Tested with:

  • pyt_sglang_disagg_qwen3-32b model on 3-node SLURM cluster
  • Both salloc (existing allocation) and sbatch (new job) modes
  • Pre-built image workflow with --use-image
  • Benchmarks completed successfully (GSM8K accuracy + throughput sweep)

How to test

# Clone from fork
git clone https://github.com/raviguptaamd/madengine.git
cd madengine
git checkout raviguptaamd/update_slurm_launcher
pip install -e .

Made with Cursor

raviguptaamd and others added 14 commits February 5, 2026 07:33
- Add docker_image field to built_images and built_models in --use-image mode
- Merge model's distributed.launcher into deployment_config for proper detection
- Make reservation optional (empty string is ignored)
- Pass reservation to SlurmNodeSelector for health checks

Co-authored-by: Cursor <cursoragent@cursor.com>
Add missing newline after 'fi' statement to prevent syntax error
in generated sbatch script when using --build-on-compute option.

Co-authored-by: Cursor <cursoragent@cursor.com>
Resolved conflict in slurm.py by keeping both:
- _prepare_baremetal_script() for sglang-disagg/vllm-disagg baremetal execution
- _normalize_nodelist() utility for SLURM nodelist normalization (from PR ROCm#71)

Made-with: Cursor
Replace legacy launcher names with unified slurm_multi:
- Remove sglang-disagg, vllm-disagg from VALID_LAUNCHERS
- Add slurm_multi as the only baremetal multi-node launcher
- Update deployment logic in slurm.py and kubernetes.py
- Rename example config files from sglang-disagg-* to slurm-multi-*
- Update docs/launchers.md references

Breaking change: sglang-disagg and vllm-disagg are no longer valid
launcher names. Use slurm_multi instead.

Made-with: Cursor
…ures

docs/cli-reference.md:
- Add --use-image and --build-on-compute options to build command
- Add examples for pre-built image and compute node build workflows
- Document when to use each mode

docs/usage.md:
- Add "Pre-built Image Mode" section with examples
- Add "Build on Compute Node" section explaining workflow

docs/deployment.md:
- Add "SLURM Allocation Detection" section
- Add "Baremetal Execution (slurm_multi)" section
- Document environment variables and behavior inside salloc

docs/launchers.md:
- Fix example file paths (slurm_multi -> slurm-multi)

Made-with: Cursor
--use-image changes:
- Make image name optional (auto-detect from model card's DOCKER_IMAGE_NAME)
- Add mutual exclusivity with --registry and --build-on-compute
- Handle multiple models with different images (use first, warn)

--build-on-compute changes:
- Require --registry option (build once, push, pull everywhere)
- Build on 1 compute node only, push to registry
- Merge SLURM config from model card + additional-context (CLI overrides)
- Add registry credential validation before build
- Add docker login step in build script
- Smart registry image naming (namespace/repo vs namespace formats)
- Parallel docker pull on all nodes in run phase
- Clear error messages for missing partition, credentials

Run phase changes:
- Detect built_on_compute flag in manifest
- Pull image in parallel on all nodes via srun before model execution

Documentation updated for new workflows.

Made-with: Cursor
- slurm_multi launcher now requires --registry or --use-image
- Parallel docker pull on all nodes for any registry image (not just built_on_compute)
- DOCKER_IMAGE_NAME added to manifest env_vars for all registry builds
- Documentation updated for slurm_multi requirements

Made-with: Cursor
Ensure consistent documentation across:
- deployment.md
- launchers.md
- usage.md

All now mention that slurm_multi requires --registry or --use-image.

Made-with: Cursor
- Make gpu_vendor/guest_os optional when using --use-image (pre-built image)
- Remove model name truncation in CLI results table
- Merge model card slurm and distributed config into build_manifest.json
- Add skip_monitoring flag for synchronous salloc execution
- Add completion marker file for robust sbatch job completion detection

Made-with: Cursor
On shared NFS filesystems, perf.csv written by compute nodes may not
be immediately visible to the login node. Retry up to 30s to allow
NFS propagation before reporting the file as missing.

Made-with: Cursor
…cher column

- Emit #SBATCH --nodelist in _prepare_baremetal_script() when nodelist
  is set via model card or --additional-context
- Include nodelist in model card slurm merge key list so it propagates
  to the build manifest
- Rename perf table column from "Launcher" to "Workload" for clarity

Made-with: Cursor
…time)

When --additional-context provides partial slurm overrides (e.g. only
nodelist), ConfigLoader defaults were overwriting model card values like
nodes and time. Fixed by capturing original user slurm keys before
ConfigLoader runs, and deep-merging manifest slurm config with runtime
context instead of replacing it wholesale.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants