Day 0 GB300 DeepSeek-V4-Pro FP4 vLLM disagg #1150
Conversation
Adds the same set of topologies (1k/1k: 1p1d-dep8-tep8, 1p1d-dep8-dep16,
3p1d-dep8-dep16; 8k/1k: same plus 7p1d-dep8-dep16) targeted at the
gb300-cr cluster (CoreWeave, 2x 18-node racks). Per-worker tuning is
identical to the gb200 sweep — only gpu_type, name, and the launch
script's filesystem / partition assumptions differ.
- Adds gb300-cr runner group (gb300-cr_0/1) and launch_gb300-cr.sh.
- Recipes mounted at /mnt/vast/models/deepseek-v4-pro/ and squash files
under /mnt/vast/squash/; SLURM partition is 'all'.
- Each job rack-pins via srtctl's auto '#SBATCH --segment={total_nodes}';
the 18-node 7p1d topology fits one rack exactly.
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
Runner names use the existing CoreWeave 'cw' suffix convention (matches b200-cw_*, h100-cw_*, etc.) — gb300-cr was wrong. Model weights are at /mnt/vast/models/dsv4/ (the directory the user already populated), not .../deepseek-v4-pro/ as I'd guessed.
| - "Same topologies, same per-worker tuning, same container (vllm/vllm-openai:deepseekv4-cu130). Recipes duplicated as disagg-gb300-*.yaml with gpu_type: gb300; 1k/1k and 8k/1k both included" | ||
| - "New runners group gb300-cr (gb300-cr_0/1) and launch_gb300-cr.sh: SLURM partition `all`, model staging at /mnt/vast/models/deepseek-v4-pro/, squash files at /mnt/vast/squash/. Each job rack-pins via srtctl's auto `#SBATCH --segment={total_nodes}` (max 18-node 7p1d topology fits one rack exactly)" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1150 |
There was a problem hiding this comment.
🟡 The new dsv4-fp4-gb300-dynamo-vllm changelog entry at perf-changelog.yaml:1822 ends with pr-link: TBD rather than the actual PR URL. The immediately-preceding gb200 sibling entry (line 1814) and most recent merges follow the convention of using the real https://github.com/SemiAnalysisAI/InferenceX/pull/<num> link, so this should be updated to https://github.com/SemiAnalysisAI/InferenceX/pull/1150 before merge. Cosmetic / nit — does not affect runtime, but the placeholder will be permanently retained in the changelog once merged.
Extended reasoning...
What's wrong. The new entry added at the bottom of perf-changelog.yaml (lines 1820-1822) has pr-link: TBD as its final line. The convention in this file is to fill in the actual GitHub PR URL — the directly-preceding gb200 sibling entry (lines 1807-1814, the dsv4-fp4-gb200-dynamo-vllm port from PR #1129) ends with pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1129, and most recently-merged entries (#1144, #1147, #1148) follow the same pattern.
Why it matters. Once this PR is merged as #1150, the placeholder TBD becomes permanent in the changelog history — no automated process rewrites the entry post-merge. So the changelog will permanently link to nothing for this entry, breaking the audit trail that lets readers click through from a config-key change back to its originating PR.
Step-by-step proof.
- Check the current state of HEAD for perf-changelog.yaml:
git show 154be8d -- perf-changelog.yamlshows the diff hunk adding lines 1815-1822, ending withpr-link: TBD. - Read perf-changelog.yaml lines 1820-1822 directly: the literal string
pr-link: TBDappears, not a URL. - Compare to the gb200 sibling 8 lines above (line 1814):
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1129— proper URL. - The PR description explicitly calls out
perf-changelog.yaml: additions-only entryand the PR number is 1150, so the intended value is unambiguous.
Note on the PR diff display. The bug-tracker's rendered diff in the review pane shows this line as pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1150, which differs from what is actually committed at HEAD. The git working tree is the source of truth — git show 154be8d -- perf-changelog.yaml confirms the literal TBD.
Fix. Replace pr-link: TBD on line 1822 with pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1150 before merging, matching the convention of the gb200 sibling above.
Severity rationale (nit). This is documentation-only and has no runtime impact. Some prior entries in the file already have pull/XXX, pull/XXXX, or pull/TBD placeholder leakage from previously-merged PRs (e.g., lines 16, 46, 53, 349, 824, 852, 889, 906, 1556-1676), so the convention is clearly not enforced. Worth fixing for parity with the gb200 sibling immediately above, but not a blocker.
| # Same topology + tuning as dsv4-fp4-gb300-dynamo-vllm's gb200 sibling, just | ||
| # pointed at the gb300 recipe variants. Cluster gb300-cr is 2x 18-node | ||
| # racks; each job is rack-pinned via srtctl's auto `#SBATCH --segment={N}`. |
There was a problem hiding this comment.
🟡 Comment at line 7666 reads "Same topology + tuning as dsv4-fp4-gb300-dynamo-vllm's gb200 sibling" but it's inside the dsv4-fp4-gb300-dynamo-vllm config block (declared at line 7657) — so the config is referring to itself as having a gb200 sibling. Almost certainly a copy-paste leftover from the GB200→GB300 port; should reference dsv4-fp4-gb200-dynamo-vllm instead. Pure comment-only nit, no runtime effect.
Extended reasoning...
What's wrong
In .github/configs/nvidia-master.yaml the new config block dsv4-fp4-gb300-dynamo-vllm: is declared at line 7657. The header comment for that block, lines 7666-7668, currently reads:
dsv4-fp4-gb300-dynamo-vllm: # line 7657
...
# Same topology + tuning as dsv4-fp4-gb300-dynamo-vllm's gb200 sibling, just # line 7666
# pointed at the gb300 recipe variants. Cluster gb300-cr is 2x 18-node
# racks; each job is rack-pinned via srtctl's auto `#SBATCH --segment={N}`.A config can't be its own sibling. The author clearly intended to point at dsv4-fp4-gb200-dynamo-vllm — that is the existing GB200 config defined immediately above (ending at line 7655) and the actual upstream this PR ports from.
Step-by-step proof
- Line 7657:
dsv4-fp4-gb300-dynamo-vllm:— this opens the config block. - Lines 7658-7665: scalar fields (image, model, model-prefix, runner, precision, framework, multinode, disagg) — all still inside the block opened at 7657.
- Line 7666: comment under that same key, which begins "Same topology + tuning as dsv4-fp4-gb300-dynamo-vllm's gb200 sibling…"
- The phrase "dsv4-fp4-gb300-dynamo-vllm's gb200 sibling" reads as "the gb200 sibling of dsv4-fp4-gb300-dynamo-vllm" — i.e. the current block's gb200 sibling, which is dsv4-fp4-gb200-dynamo-vllm (lines 7544-7655). Saying "X's gb200 sibling" while being X is a tautology with no referent.
- The PR description corroborates: "Same DSV4-Pro FP4 sweep we already run on gb200, ported to the gb300-cr cluster" — i.e. the sibling is gb200, not gb300.
Impact
None on runtime, parsing, generated artifacts, or sweep behavior — YAML comments are inert. This is purely a readability issue: a future reader following the comment will go looking for a non-existent reference.
Fix
Change dsv4-fp4-gb300-dynamo-vllm's gb200 sibling to dsv4-fp4-gb200-dynamo-vllm (or equivalent phrasing such as "as the gb200 sibling (dsv4-fp4-gb200-dynamo-vllm)"). One-token edit while the PR is still open.
- SLURM_ACCOUNT: benchmark -> cw-sup. The 'benchmark' account was inherited from launch_gb200-nv.sh but doesn't exist on the cw cluster; sacctmgr shows the user is associated with cw-sup. - Extend gb300-cw runner group to include gb300-cw_2 and gb300-cw_3. All four cw runners now have the gb300-cw label, so list them all so matrix expansion can round-robin across the full pool.
srtctl's slurm template (job_script_minimal.j2) does `if ! command -v uv` and only installs its own (ARM64) uv when missing. The runner pod is x86 and /mnt/home is shared NFS with the aarch64 compute nodes; the default uv install location $HOME/.local/bin lands on that shared NFS path and shadows the template's install on the compute side, causing `Exec format error` from slurmd. Install via XDG_BIN_HOME to a runner-pod-local /tmp tmpfs path. Scrub any stale x86 uv from prior runs out of $HOME/.local/bin and fail loud if XDG_BIN_HOME isn't honored or the install leaks to NFS anyway.
Previously relied on srtctl's auto '#SBATCH --segment={total_nodes}'
(controlled by use_segment_sbatch_directive=true, the schema default).
Real runs on gb300-cw showed the directive was missing from the
generated sbatch — workers landed on different racks.
Make the constraint explicit per recipe:
sbatch_directives:
segment: "<total_nodes>"
and turn off the auto path in srtslurm.yaml so we don't emit two
overlapping #SBATCH --segment lines. Each gb300 recipe now declares
its own segment value matching its prefill_nodes + decode_nodes
total (4, 6, 10, or 18).
OOM during 'maturin build' of dynamo source on gb300-cw. Cargo defaults to nproc parallel rustc workers; on Grace ARM (~72 cores per node) the peak RAM during the link phase exceeded the SLURM cgroup limit, causing SIGKILL with 'task 0: Out Of Memory' before vLLM ever started. Capped at 4 in both prefill_environment and decode_environment of every gb300 recipe. Each rustc uses ~5-10GB during linking, so 4 parallel jobs keep peak well under any reasonable per-task cgroup limit. (gb200-nv runs the same install via the same srt-slurm path and works without this cap, so cw evidently has tighter per-task memory limits.)
…oc backtick bug Two changes: 1. Add 'mem: "0"' to sbatch_directives in every gb300 recipe so each sbatch emits '#SBATCH --mem=0'. cw evidently has a tighter default per-task memory cgroup than nv; without --mem=0 the workers were getting killed with 'srun: task 0: Out Of Memory' partway through model load (and possibly during the dynamo source build before that). --mem=0 means 'use all node memory', which is what we want for these node-exclusive ML jobs. 2. Drop backticks from the comment in launch_gb300-cw.sh's heredoc. The heredoc terminator is unquoted (<<EOF), so bash performed command substitution on the backtick content, producing a noisy 'sbatch_directives:: command not found' error. Cosmetic only — the srtslurm.yaml was still written correctly — but the error looked alarming.
The recipe header comments still claimed each job is rack-pinned
'via srtctl's auto #SBATCH --segment={total_nodes}', but two commits
ago we flipped use_segment_sbatch_directive to false in srtslurm.yaml
and added explicit sbatch_directives.segment per recipe. Update the
six gb300 recipe headers to match the actual mechanism.
…cuda.so.1 First gb300-cw run died with 'ImportError: libcuda.so.1: cannot open shared object file' inside the decode worker container — vllm._C is linked against libcuda but the shared lib wasn't on the dynamic linker search path. cw's pyxis/enroot doesn't auto-inject the host NVIDIA driver libraries the way gb200-nv's setup does; the prestart hook needs NVIDIA_VISIBLE_DEVICES + NVIDIA_DRIVER_CAPABILITIES in the runtime env to know which devices and capabilities to expose. Setting them in the launch script before 'srtctl apply' propagates through SLURM's default --export=ALL on both sbatch and srun, so they reach the enroot prestart hook and trigger the libcuda + libnvidia-* bind-mounts.
Failure mode (now diagnosed): srt-slurm's DP+EP path launches one srun
container per GPU. Each container independently runs the dynamo source
install ('maturin build' of the rust runtime, ~10 min on Grace ARM).
With 4 ranks per node x 2 nodes per worker the install times vary
enough across ranks that the early finishers hit vLLM's hardcoded
5-min 'Did not receive response from front-end process' deadline
while late finishers (rank 0 included) are still compiling.
Fix:
- runners/gb300-cw-vllm-container-deps.sh: new setup script that takes
a global flock on /mnt/vast and, on cache miss, builds the dynamo
wheel + a pruned source archive ONCE. Every rank pip-installs from
the cache (~30 s) so timing across ranks stays tight.
- launch_gb300-cw.sh: overlay the custom script into the cloned
srt-slurm's configs/ dir so the recipes' setup_script reference
resolves to it.
- All 6 gb300 recipes: dynamo.install: false (was true) so srt-slurm's
hardcoded per-rank install path is skipped — our setup script is the
sole installer.
Previous attempt's logs proved every rank ran maturin build in parallel
('[dynamo-cache] cold cache — building...' showed up in ALL worker
output), so the flock on /mnt/vast was a silent no-op. /mnt/vast is
NFS-backed and flock is unreliable there without explicit nolock
config — typical in clusters.
mkdir IS atomic across NFS. Switch to mkdir-based leader election: the
rank whose mkdir of <hash>.building succeeds is the leader and runs the
build; everyone else polls for .done. Followers timeout at 30 min if
the leader crashes; in practice the build is ~10 min.
Two prior attempts at coordinating a one-time dynamo build across the ~60 worker containers via fs-level locks on /mnt/vast both failed: NFS silently no-ops flock and races negatively-cached mkdir. Every rank ended up running maturin build in parallel, the timing skew across nodes blew vLLM's hardcoded 5-min 'Did not receive response from front-end' deadline, and ranks died. New design eliminates all per-rank coordination: * launch_gb300-cw.sh now runs a one-shot BEFORE submitting the main sbatch. That srun builds the dynamo wheel + a pruned source archive into a temp dir on /mnt/vast and atomically renames into place. Same-dir rename on NFS IS atomic (unlike flock or mkdir-vs-cache), so even when both gb300-cw_0 and gb300-cw_1 race on a cold cache the loser cleanly discards its build. * gb300-cw-vllm-container-deps.sh becomes pure pip-install-from-cache; it errors out fast if the prebuild didn't run, instead of trying to build on its own. Net: per-rank setup is now ~30 s (pip install of prebuilt wheel) vs. ~10 min cargo build, and identical across all ranks, so we don't blow vLLM's startup window.
Last attempt's prebuild srun got OOM-killed mid-build:
error: could not compile `moxcms` (lib)
Caused by: process didn't exit successfully ... (signal: 9, SIGKILL)
error: Detected 1 oom_kill event in StepId=71.0
srun: task 0: Out Of Memory
Default per-task memory cgroup is too small for cargo's link phase on a
big rust workspace. Three knobs added:
--mem=0 claim full node memory (same lever the main
sbatch already uses)
CARGO_BUILD_JOBS=8 cap parallel rustc workers; on 72-core Grace
ARM the default nproc setting can have dozens
of rustc processes peaking together
-C debuginfo=0 default debuginfo=2 from cargo is what makes
the link phase memory-hungry; we don't need
debug symbols in the runtime wheel
Last attempt's prebuild succeeded, the launch script reported '[dynamo-prebuild] published cache at /mnt/vast/dynamo_cache/<hash>', but every worker still errored with our 'prebuilt cache missing' message. Reason: srt-slurm only mounts the model dir (/mnt/vast/models/dsv4) into worker containers — /mnt/vast/dynamo_cache isn't visible inside, so setup_script's stat of the cache always fails. Add extra_mount: /mnt/vast/dynamo_cache:/mnt/vast/dynamo_cache to all six gb300 recipes. Verified the recipes still parse cleanly via srtctl's load_config; cfg.extra_mount is now populated as expected.
Latest run got past dynamo install (cache mount + prebuild both work now — 41 ranks all succeeded), then hit a different wall: RuntimeError: Did not receive response from front-end process within 5 minutes This is vllm's hardcoded engine-core handshake deadline. With DSV4-Pro weights (~850 GB) on /mnt/vast NFS and 8 DP ranks reading in parallel through one NFS client mount, rank 0's model load runs longer than 5 minutes under contention; the other DP ranks then time out waiting for the front-end (rank 0's DPAsyncMPClient) to respond. The 5-min limit is a module-level constant HANDSHAKE_TIMEOUT_MINS in vllm/v1/engine/core.py with no env-var override. The setup script now seds it to 30 in each rank's container after the dynamo install completes. (No-op + warning if the constant ever changes upstream.)
After patching the handshake timeout to 30 min, every rank still hits 'Did not receive response from front-end process within 30 minutes'. Rank 0 itself goes silent right after vllm config init — no model load progress, just a 30+ min gap. Suggests NCCL init is hanging, not slow NFS load. Two cw-specific tweaks: - NCCL_MNNVL_ENABLE: removed. cw does not have multi-node NVLink (that's a gb200-nv tray feature). Telling NCCL it's there can confuse init. - NCCL_P2P_LEVEL: NVL: removed. Across nodes there is no NVLink path, so forcing NVL-only P2P is wrong; let NCCL auto-pick (PIX/NET/etc). Plus NCCL_DEBUG=INFO so the next run's worker logs show where NCCL is stuck. We can revert the debug log once we know the root cause.
…ducer NVL72 GB300 HAS multi-node NVLink — removing NCCL_MNNVL_ENABLE was wrong. This commit restores it (and NCCL_P2P_LEVEL=NVL on tep8 recipes) to match the working gb200 references. Adds NCCL_DEBUG_SUBSYS + NCCL_DEBUG_FILE to all gb300 recipes so NCCL init/bootstrap/net diagnostics land in per-process log files instead of flooding the main sweep log. Also adds VLLM_ENGINE_READY_TIMEOUT_S to dep16 recipes (was only on tep8 before). Reduces nvidia-master search space to just the 1p1d-dep8-tep8 topology (4 nodes) for both ISL configs to isolate the DP Coordinator startup failure before scaling up to larger topologies.
…L_DEBUG_FILE Three changes to diagnose the prefill DP Coordinator startup failure: 1. Restore the HANDSHAKE_TIMEOUT_MINS 5→30 sed patch in the setup script. Removing it (87bdf1f) caused follower DP ranks to hit the hardcoded 5-minute front-end handshake timeout during model load from VAST NFS. VLLM_ENGINE_READY_TIMEOUT_S does not control this code path. 2. Add a Python patch to vllm's coordinator.py that logs the DP Coordinator child's pid, alive status, and exitcode when the parent sees "failed to report ZMQ addresses". This surfaces the actual child failure instead of the opaque parent-side error. 3. Remove NCCL_DEBUG_FILE from all gb300 recipes — /tmp inside the container is ephemeral and not collected. NCCL debug now goes to stderr which lands in the SLURM .out files.
The previous coordinator patch (7f526db) failed because the needle strings didn't match the actual multi-line format in vllm/v1/engine/coordinator.py. Rewrote based on the real source: (a) Bump _wait_for_zmq_addrs timeout=30 → timeout=300 by matching the exact "[zmq_addr_pipe, self.proc.sentinel], timeout=30" string. (b) Insert child-process debug logging (pid, alive, exitcode) before the RuntimeError raise, matching the exact multi-line raise block. This should expose whether the DP Coordinator child is crashing vs just slow, and give it 5 minutes instead of 30 seconds to report ZMQ addresses.
Previous patches (7f526db, 6415458) failed because exact string matching was too brittle for the multi-line raise block in coordinator.py. Now: - Timeout bump: still exact-matches "[zmq_addr_pipe, self.proc.sentinel], timeout=30" → timeout=300 (this string is stable) - Debug logging: regex-matches the RuntimeError raise block with flexible indentation/whitespace, injects child proc debug info (pid, alive, exitcode, sentinel) using self.proc (not the wrong self._coordinator_proc from the v1 attempt) - Verification: dumps inspect.getsource(DPCoordinator._wait_for_zmq_addrs) so the per-rank logs show exactly what code will run Separates timeout bump and logging into independent python blocks so a failure in one doesn't skip the other.
Summary
Adds
dsv4-fp4-gb300-dynamo-vllm— same DSV4-Pro FP4 sweep we already run on gb200, ported to the gb300-cw (CoreWeave) cluster. Topologies, per-worker tuning, container, and concurrency sweep are identical to the gb200 config; onlygpu_type, the launch script's filesystem assumptions, and the SLURM partition differ.What's in here
runners/launch_gb300-cw.sh(new): adapted fromlaunch_gb200-nv.sh. Stages weights at/mnt/vast/models/dsv4/, squash files at/mnt/vast/squash/, partitionall. cw has no Lustre and no compute-node-local NVMe — VAST is the only option.runners.yaml: newgb300-cwgroup withgb300-cw_0/_1(kept separate from existinggb300group so dsr1-fp8-gb300-dynamo-sglang doesn't get cross-routed onto cw's launch script).benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/{1k1k,8k1k}/disagg-gb300-*.yaml: byte-for-byte mirrors of the gb200 recipes withgpu_type: gb300and updated headers. Tuning kept verbatim — GB300's extra HBM (288 GB vs 184 GB) probably means the CPU/DRAM offload knobs in the tep8 recipes can be dropped, but worth measuring first rather than re-tuning blind.nvidia-master.yaml:dsv4-fp4-gb300-dynamo-vllmconfig entry,runner: gb300-cw, recipes pointing to the new gb300 paths.perf-changelog.yaml: additions-only entry.Rack pinning (cw-specific)
cw is 2x 18-node racks. srtctl already auto-emits
#SBATCH --segment={total_nodes}by default (use_segment_sbatch_directive: trueis the schema default), and the launch script spells this out insrtslurm.yamlso it's obvious. The largest topology (8k/1k 7p1d-dep8-dep16) needs exactly 18 nodes — fits one rack exactly. Anything wider would no longer fit.Test plan
dsv4-fp4-gb300-dynamo-vllmon the gb300-cw runners — start with the 4-node1p1d-dep8-tep8recipe to validate cluster plumbing before any 18-node job.gb300-cw(assumed, not verified by this PR).Validation done locally
process_changelog.pyagainstmain: passes, produces 6 multi-node entries (3x 1k/1k + 3x 8k/1k), 23 benchmark points.benchmark.concurrenciesaudit vs matrixconc-list: all 6 pairs aligned.generate_sweep_configs.py full-sweep --runner-type gb300-cw: returns 6 entries.