Updates to slurm launcher#86
Open
raviguptaamd wants to merge 14 commits intoROCm:coketaste/refactor-disfrom
Open
Updates to slurm launcher#86raviguptaamd wants to merge 14 commits intoROCm:coketaste/refactor-disfrom
raviguptaamd wants to merge 14 commits intoROCm:coketaste/refactor-disfrom
Conversation
- Add docker_image field to built_images and built_models in --use-image mode - Merge model's distributed.launcher into deployment_config for proper detection - Make reservation optional (empty string is ignored) - Pass reservation to SlurmNodeSelector for health checks Co-authored-by: Cursor <cursoragent@cursor.com>
Add missing newline after 'fi' statement to prevent syntax error in generated sbatch script when using --build-on-compute option. Co-authored-by: Cursor <cursoragent@cursor.com>
Resolved conflict in slurm.py by keeping both: - _prepare_baremetal_script() for sglang-disagg/vllm-disagg baremetal execution - _normalize_nodelist() utility for SLURM nodelist normalization (from PR ROCm#71) Made-with: Cursor
Replace legacy launcher names with unified slurm_multi: - Remove sglang-disagg, vllm-disagg from VALID_LAUNCHERS - Add slurm_multi as the only baremetal multi-node launcher - Update deployment logic in slurm.py and kubernetes.py - Rename example config files from sglang-disagg-* to slurm-multi-* - Update docs/launchers.md references Breaking change: sglang-disagg and vllm-disagg are no longer valid launcher names. Use slurm_multi instead. Made-with: Cursor
…ures docs/cli-reference.md: - Add --use-image and --build-on-compute options to build command - Add examples for pre-built image and compute node build workflows - Document when to use each mode docs/usage.md: - Add "Pre-built Image Mode" section with examples - Add "Build on Compute Node" section explaining workflow docs/deployment.md: - Add "SLURM Allocation Detection" section - Add "Baremetal Execution (slurm_multi)" section - Document environment variables and behavior inside salloc docs/launchers.md: - Fix example file paths (slurm_multi -> slurm-multi) Made-with: Cursor
--use-image changes: - Make image name optional (auto-detect from model card's DOCKER_IMAGE_NAME) - Add mutual exclusivity with --registry and --build-on-compute - Handle multiple models with different images (use first, warn) --build-on-compute changes: - Require --registry option (build once, push, pull everywhere) - Build on 1 compute node only, push to registry - Merge SLURM config from model card + additional-context (CLI overrides) - Add registry credential validation before build - Add docker login step in build script - Smart registry image naming (namespace/repo vs namespace formats) - Parallel docker pull on all nodes in run phase - Clear error messages for missing partition, credentials Run phase changes: - Detect built_on_compute flag in manifest - Pull image in parallel on all nodes via srun before model execution Documentation updated for new workflows. Made-with: Cursor
- slurm_multi launcher now requires --registry or --use-image - Parallel docker pull on all nodes for any registry image (not just built_on_compute) - DOCKER_IMAGE_NAME added to manifest env_vars for all registry builds - Documentation updated for slurm_multi requirements Made-with: Cursor
Ensure consistent documentation across: - deployment.md - launchers.md - usage.md All now mention that slurm_multi requires --registry or --use-image. Made-with: Cursor
- Make gpu_vendor/guest_os optional when using --use-image (pre-built image) - Remove model name truncation in CLI results table - Merge model card slurm and distributed config into build_manifest.json - Add skip_monitoring flag for synchronous salloc execution - Add completion marker file for robust sbatch job completion detection Made-with: Cursor
On shared NFS filesystems, perf.csv written by compute nodes may not be immediately visible to the login node. Retry up to 30s to allow NFS propagation before reporting the file as missing. Made-with: Cursor
…cher column - Emit #SBATCH --nodelist in _prepare_baremetal_script() when nodelist is set via model card or --additional-context - Include nodelist in model card slurm merge key list so it propagates to the build manifest - Rename perf table column from "Launcher" to "Workload" for clarity Made-with: Cursor
…time) When --additional-context provides partial slurm overrides (e.g. only nodelist), ConfigLoader defaults were overwriting model card values like nodes and time. Fixed by capturing original user slurm keys before ConfigLoader runs, and deep-merging manifest slurm config with runtime context instead of replacing it wholesale. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
PR Summary: SGLang Disaggregated Inference Support for Madengine v2
Overview
This PR adds support for running SGLang Disaggregated (P/D) inference workloads on SLURM clusters. The key architectural change is introducing a "baremetal launcher" path for disaggregated inference launchers that manage their own Docker containers via SLURM.
Files Modified
1.
src/madengine/cli/commands/build.py(+18 lines)Purpose: Add CLI options for flexible build workflows
--use-imageoption: Skip Docker build and use pre-built image--build-on-computeoption: Build Docker images on SLURM compute node2.
src/madengine/deployment/slurm.py(+344 lines)Purpose: Support baremetal execution for disagg launchers
self.reservationparsing from slurm configprepare()_prepare_baremetal_script()method that generates simple wrapper scripts_run_inside_existing_allocation()for salloc support_validate_allocation_nodes()for node count validationsglang-disaggorvllm-disagglaunchersjob.sh.j2template path3.
src/madengine/deployment/slurm_node_selector.py(+29/-18 lines)Purpose: Add reservation support to node health checks
reservationparameter to srun commands for health checks and cleanup4.
src/madengine/execution/container_runner.py(+206 lines)Purpose: Execute scripts directly on baremetal for disagg launchers
BAREMETAL_LAUNCHERSconstant_run_on_baremetal()method for direct script execution5.
src/madengine/orchestration/build_orchestrator.py(+283 lines)Purpose: Support pre-built images and compute-node builds
_execute_with_prebuilt_image()for--use-imageoption_execute_build_on_compute()for--build-on-computeoptionexecute()flow unchanged when options not usedImpact Analysis
Backward Compatibility
Testing
Tested with:
pyt_sglang_disagg_qwen3-32bmodel on 3-node SLURM clustersalloc(existing allocation) andsbatch(new job) modes--use-imageHow to test
Made with Cursor