Day 0 GB300 DeepSeek-V4-Pro FP4 vLLM disagg by Oseltamivir · Pull Request #1150 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-04-25T05:09:30Z

Summary

Adds dsv4-fp4-gb300-dynamo-vllm — same DSV4-Pro FP4 sweep we already run on gb200, ported to the gb300-cw (CoreWeave) cluster. Topologies, per-worker tuning, container, and concurrency sweep are identical to the gb200 config; only gpu_type, the launch script's filesystem assumptions, and the SLURM partition differ.

What's in here

runners/launch_gb300-cw.sh (new): adapted from launch_gb200-nv.sh. Stages weights at /mnt/vast/models/dsv4/, squash files at /mnt/vast/squash/, partition all. cw has no Lustre and no compute-node-local NVMe — VAST is the only option.
runners.yaml: new gb300-cw group with gb300-cw_0/_1 (kept separate from existing gb300 group so dsr1-fp8-gb300-dynamo-sglang doesn't get cross-routed onto cw's launch script).
6 new recipes under benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/{1k1k,8k1k}/disagg-gb300-*.yaml: byte-for-byte mirrors of the gb200 recipes with gpu_type: gb300 and updated headers. Tuning kept verbatim — GB300's extra HBM (288 GB vs 184 GB) probably means the CPU/DRAM offload knobs in the tep8 recipes can be dropped, but worth measuring first rather than re-tuning blind.
nvidia-master.yaml: dsv4-fp4-gb300-dynamo-vllm config entry, runner: gb300-cw, recipes pointing to the new gb300 paths.
perf-changelog.yaml: additions-only entry.

Rack pinning (cw-specific)

cw is 2x 18-node racks. srtctl already auto-emits #SBATCH --segment={total_nodes} by default (use_segment_sbatch_directive: true is the schema default), and the launch script spells this out in srtslurm.yaml so it's obvious. The largest topology (8k/1k 7p1d-dep8-dep16) needs exactly 18 nodes — fits one rack exactly. Anything wider would no longer fit.

Test plan

Manually trigger dsv4-fp4-gb300-dynamo-vllm on the gb300-cw runners — start with the 4-node 1p1d-dep8-tep8 recipe to validate cluster plumbing before any 18-node job.
If the cw cluster has spare HBM headroom, follow up with a tuning PR to drop CPU/DRAM offload in the tep8 recipes and see if max-num-seqs can go higher.
Validate that gb300-cw_0/1 GitHub runners are registered with label gb300-cw (assumed, not verified by this PR).

Validation done locally

process_changelog.py against main: passes, produces 6 multi-node entries (3x 1k/1k + 3x 8k/1k), 23 benchmark points.
Recipe benchmark.concurrencies audit vs matrix conc-list: all 6 pairs aligned.
generate_sweep_configs.py full-sweep --runner-type gb300-cw: returns 6 entries.

Adds the same set of topologies (1k/1k: 1p1d-dep8-tep8, 1p1d-dep8-dep16, 3p1d-dep8-dep16; 8k/1k: same plus 7p1d-dep8-dep16) targeted at the gb300-cr cluster (CoreWeave, 2x 18-node racks). Per-worker tuning is identical to the gb200 sweep — only gpu_type, name, and the launch script's filesystem / partition assumptions differ. - Adds gb300-cr runner group (gb300-cr_0/1) and launch_gb300-cr.sh. - Recipes mounted at /mnt/vast/models/deepseek-v4-pro/ and squash files under /mnt/vast/squash/; SLURM partition is 'all'. - Each job rack-pins via srtctl's auto '#SBATCH --segment={total_nodes}'; the 18-node 7p1d topology fits one rack exactly.

github-actions · 2026-04-25T05:09:38Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Runner names use the existing CoreWeave 'cw' suffix convention (matches b200-cw_*, h100-cw_*, etc.) — gb300-cr was wrong. Model weights are at /mnt/vast/models/dsv4/ (the directory the user already populated), not .../deepseek-v4-pro/ as I'd guessed.

claude · 2026-04-25T05:18:05Z

+    - "Same topologies, same per-worker tuning, same container (vllm/vllm-openai:deepseekv4-cu130). Recipes duplicated as disagg-gb300-*.yaml with gpu_type: gb300; 1k/1k and 8k/1k both included"
+    - "New runners group gb300-cr (gb300-cr_0/1) and launch_gb300-cr.sh: SLURM partition `all`, model staging at /mnt/vast/models/deepseek-v4-pro/, squash files at /mnt/vast/squash/. Each job rack-pins via srtctl's auto `#SBATCH --segment={total_nodes}` (max 18-node 7p1d topology fits one rack exactly)"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1150


🟡 The new dsv4-fp4-gb300-dynamo-vllm changelog entry at perf-changelog.yaml:1822 ends with pr-link: TBD rather than the actual PR URL. The immediately-preceding gb200 sibling entry (line 1814) and most recent merges follow the convention of using the real https://github.com/SemiAnalysisAI/InferenceX/pull/<num> link, so this should be updated to https://github.com/SemiAnalysisAI/InferenceX/pull/1150 before merge. Cosmetic / nit — does not affect runtime, but the placeholder will be permanently retained in the changelog once merged.

Extended reasoning...

What's wrong. The new entry added at the bottom of perf-changelog.yaml (lines 1820-1822) has pr-link: TBD as its final line. The convention in this file is to fill in the actual GitHub PR URL — the directly-preceding gb200 sibling entry (lines 1807-1814, the dsv4-fp4-gb200-dynamo-vllm port from PR #1129) ends with pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1129, and most recently-merged entries (#1144, #1147, #1148) follow the same pattern.

Why it matters. Once this PR is merged as #1150, the placeholder TBD becomes permanent in the changelog history — no automated process rewrites the entry post-merge. So the changelog will permanently link to nothing for this entry, breaking the audit trail that lets readers click through from a config-key change back to its originating PR.

Step-by-step proof.

Check the current state of HEAD for perf-changelog.yaml: git show 154be8d -- perf-changelog.yaml shows the diff hunk adding lines 1815-1822, ending with pr-link: TBD.

Read perf-changelog.yaml lines 1820-1822 directly: the literal string pr-link: TBD appears, not a URL.

Compare to the gb200 sibling 8 lines above (line 1814): pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1129 — proper URL.

The PR description explicitly calls out perf-changelog.yaml: additions-only entry and the PR number is 1150, so the intended value is unambiguous.

Note on the PR diff display. The bug-tracker's rendered diff in the review pane shows this line as pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1150, which differs from what is actually committed at HEAD. The git working tree is the source of truth — git show 154be8d -- perf-changelog.yaml confirms the literal TBD.

Fix. Replace pr-link: TBD on line 1822 with pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1150 before merging, matching the convention of the gb200 sibling above.

Severity rationale (nit). This is documentation-only and has no runtime impact. Some prior entries in the file already have pull/XXX, pull/XXXX, or pull/TBD placeholder leakage from previously-merged PRs (e.g., lines 16, 46, 53, 349, 824, 852, 889, 906, 1556-1676), so the convention is clearly not enforced. Worth fixing for parity with the gb200 sibling immediately above, but not a blocker.

claude · 2026-04-25T05:18:05Z

+  # Same topology + tuning as dsv4-fp4-gb300-dynamo-vllm's gb200 sibling, just
+  # pointed at the gb300 recipe variants. Cluster gb300-cr is 2x 18-node
+  # racks; each job is rack-pinned via srtctl's auto `#SBATCH --segment={N}`.


🟡 Comment at line 7666 reads "Same topology + tuning as dsv4-fp4-gb300-dynamo-vllm's gb200 sibling" but it's inside the dsv4-fp4-gb300-dynamo-vllm config block (declared at line 7657) — so the config is referring to itself as having a gb200 sibling. Almost certainly a copy-paste leftover from the GB200→GB300 port; should reference dsv4-fp4-gb200-dynamo-vllm instead. Pure comment-only nit, no runtime effect.

Extended reasoning...

What's wrong

In .github/configs/nvidia-master.yaml the new config block dsv4-fp4-gb300-dynamo-vllm: is declared at line 7657. The header comment for that block, lines 7666-7668, currently reads:

dsv4-fp4-gb300-dynamo-vllm: # line 7657 ... # Same topology + tuning as dsv4-fp4-gb300-dynamo-vllm's gb200 sibling, just # line 7666 # pointed at the gb300 recipe variants. Cluster gb300-cr is 2x 18-node # racks; each job is rack-pinned via srtctl's auto `#SBATCH --segment={N}`.

A config can't be its own sibling. The author clearly intended to point at dsv4-fp4-gb200-dynamo-vllm — that is the existing GB200 config defined immediately above (ending at line 7655) and the actual upstream this PR ports from.

Step-by-step proof

Line 7657: dsv4-fp4-gb300-dynamo-vllm: — this opens the config block.

Lines 7658-7665: scalar fields (image, model, model-prefix, runner, precision, framework, multinode, disagg) — all still inside the block opened at 7657.

Line 7666: comment under that same key, which begins "Same topology + tuning as dsv4-fp4-gb300-dynamo-vllm's gb200 sibling…"

The phrase "dsv4-fp4-gb300-dynamo-vllm's gb200 sibling" reads as "the gb200 sibling of dsv4-fp4-gb300-dynamo-vllm" — i.e. the current block's gb200 sibling, which is dsv4-fp4-gb200-dynamo-vllm (lines 7544-7655). Saying "X's gb200 sibling" while being X is a tautology with no referent.

The PR description corroborates: "Same DSV4-Pro FP4 sweep we already run on gb200, ported to the gb300-cr cluster" — i.e. the sibling is gb200, not gb300.

Impact

None on runtime, parsing, generated artifacts, or sweep behavior — YAML comments are inert. This is purely a readability issue: a future reader following the comment will go looking for a non-existent reference.

Fix

Change dsv4-fp4-gb300-dynamo-vllm's gb200 sibling to dsv4-fp4-gb200-dynamo-vllm (or equivalent phrasing such as "as the gb200 sibling (dsv4-fp4-gb200-dynamo-vllm)"). One-token edit while the PR is still open.

- SLURM_ACCOUNT: benchmark -> cw-sup. The 'benchmark' account was inherited from launch_gb200-nv.sh but doesn't exist on the cw cluster; sacctmgr shows the user is associated with cw-sup. - Extend gb300-cw runner group to include gb300-cw_2 and gb300-cw_3. All four cw runners now have the gb300-cw label, so list them all so matrix expansion can round-robin across the full pool.

srtctl's slurm template (job_script_minimal.j2) does `if ! command -v uv` and only installs its own (ARM64) uv when missing. The runner pod is x86 and /mnt/home is shared NFS with the aarch64 compute nodes; the default uv install location $HOME/.local/bin lands on that shared NFS path and shadows the template's install on the compute side, causing `Exec format error` from slurmd. Install via XDG_BIN_HOME to a runner-pod-local /tmp tmpfs path. Scrub any stale x86 uv from prior runs out of $HOME/.local/bin and fail loud if XDG_BIN_HOME isn't honored or the install leaks to NFS anyway.

Previously relied on srtctl's auto '#SBATCH --segment={total_nodes}' (controlled by use_segment_sbatch_directive=true, the schema default). Real runs on gb300-cw showed the directive was missing from the generated sbatch — workers landed on different racks. Make the constraint explicit per recipe: sbatch_directives: segment: "<total_nodes>" and turn off the auto path in srtslurm.yaml so we don't emit two overlapping #SBATCH --segment lines. Each gb300 recipe now declares its own segment value matching its prefill_nodes + decode_nodes total (4, 6, 10, or 18).

OOM during 'maturin build' of dynamo source on gb300-cw. Cargo defaults to nproc parallel rustc workers; on Grace ARM (~72 cores per node) the peak RAM during the link phase exceeded the SLURM cgroup limit, causing SIGKILL with 'task 0: Out Of Memory' before vLLM ever started. Capped at 4 in both prefill_environment and decode_environment of every gb300 recipe. Each rustc uses ~5-10GB during linking, so 4 parallel jobs keep peak well under any reasonable per-task cgroup limit. (gb200-nv runs the same install via the same srt-slurm path and works without this cap, so cw evidently has tighter per-task memory limits.)

…oc backtick bug Two changes: 1. Add 'mem: "0"' to sbatch_directives in every gb300 recipe so each sbatch emits '#SBATCH --mem=0'. cw evidently has a tighter default per-task memory cgroup than nv; without --mem=0 the workers were getting killed with 'srun: task 0: Out Of Memory' partway through model load (and possibly during the dynamo source build before that). --mem=0 means 'use all node memory', which is what we want for these node-exclusive ML jobs. 2. Drop backticks from the comment in launch_gb300-cw.sh's heredoc. The heredoc terminator is unquoted (<<EOF), so bash performed command substitution on the backtick content, producing a noisy 'sbatch_directives:: command not found' error. Cosmetic only — the srtslurm.yaml was still written correctly — but the error looked alarming.

The recipe header comments still claimed each job is rack-pinned 'via srtctl's auto #SBATCH --segment={total_nodes}', but two commits ago we flipped use_segment_sbatch_directive to false in srtslurm.yaml and added explicit sbatch_directives.segment per recipe. Update the six gb300 recipe headers to match the actual mechanism.

…cuda.so.1 First gb300-cw run died with 'ImportError: libcuda.so.1: cannot open shared object file' inside the decode worker container — vllm._C is linked against libcuda but the shared lib wasn't on the dynamic linker search path. cw's pyxis/enroot doesn't auto-inject the host NVIDIA driver libraries the way gb200-nv's setup does; the prestart hook needs NVIDIA_VISIBLE_DEVICES + NVIDIA_DRIVER_CAPABILITIES in the runtime env to know which devices and capabilities to expose. Setting them in the launch script before 'srtctl apply' propagates through SLURM's default --export=ALL on both sbatch and srun, so they reach the enroot prestart hook and trigger the libcuda + libnvidia-* bind-mounts.

Failure mode (now diagnosed): srt-slurm's DP+EP path launches one srun container per GPU. Each container independently runs the dynamo source install ('maturin build' of the rust runtime, ~10 min on Grace ARM). With 4 ranks per node x 2 nodes per worker the install times vary enough across ranks that the early finishers hit vLLM's hardcoded 5-min 'Did not receive response from front-end process' deadline while late finishers (rank 0 included) are still compiling. Fix: - runners/gb300-cw-vllm-container-deps.sh: new setup script that takes a global flock on /mnt/vast and, on cache miss, builds the dynamo wheel + a pruned source archive ONCE. Every rank pip-installs from the cache (~30 s) so timing across ranks stays tight. - launch_gb300-cw.sh: overlay the custom script into the cloned srt-slurm's configs/ dir so the recipes' setup_script reference resolves to it. - All 6 gb300 recipes: dynamo.install: false (was true) so srt-slurm's hardcoded per-rank install path is skipped — our setup script is the sole installer.

Previous attempt's logs proved every rank ran maturin build in parallel ('[dynamo-cache] cold cache — building...' showed up in ALL worker output), so the flock on /mnt/vast was a silent no-op. /mnt/vast is NFS-backed and flock is unreliable there without explicit nolock config — typical in clusters. mkdir IS atomic across NFS. Switch to mkdir-based leader election: the rank whose mkdir of <hash>.building succeeds is the leader and runs the build; everyone else polls for .done. Followers timeout at 30 min if the leader crashes; in practice the build is ~10 min.

Two prior attempts at coordinating a one-time dynamo build across the ~60 worker containers via fs-level locks on /mnt/vast both failed: NFS silently no-ops flock and races negatively-cached mkdir. Every rank ended up running maturin build in parallel, the timing skew across nodes blew vLLM's hardcoded 5-min 'Did not receive response from front-end' deadline, and ranks died. New design eliminates all per-rank coordination: * launch_gb300-cw.sh now runs a one-shot BEFORE submitting the main sbatch. That srun builds the dynamo wheel + a pruned source archive into a temp dir on /mnt/vast and atomically renames into place. Same-dir rename on NFS IS atomic (unlike flock or mkdir-vs-cache), so even when both gb300-cw_0 and gb300-cw_1 race on a cold cache the loser cleanly discards its build. * gb300-cw-vllm-container-deps.sh becomes pure pip-install-from-cache; it errors out fast if the prebuild didn't run, instead of trying to build on its own. Net: per-rank setup is now ~30 s (pip install of prebuilt wheel) vs. ~10 min cargo build, and identical across all ranks, so we don't blow vLLM's startup window.

Last attempt's prebuild srun got OOM-killed mid-build: error: could not compile `moxcms` (lib) Caused by: process didn't exit successfully ... (signal: 9, SIGKILL) error: Detected 1 oom_kill event in StepId=71.0 srun: task 0: Out Of Memory Default per-task memory cgroup is too small for cargo's link phase on a big rust workspace. Three knobs added: --mem=0 claim full node memory (same lever the main sbatch already uses) CARGO_BUILD_JOBS=8 cap parallel rustc workers; on 72-core Grace ARM the default nproc setting can have dozens of rustc processes peaking together -C debuginfo=0 default debuginfo=2 from cargo is what makes the link phase memory-hungry; we don't need debug symbols in the runtime wheel

Last attempt's prebuild succeeded, the launch script reported '[dynamo-prebuild] published cache at /mnt/vast/dynamo_cache/<hash>', but every worker still errored with our 'prebuilt cache missing' message. Reason: srt-slurm only mounts the model dir (/mnt/vast/models/dsv4) into worker containers — /mnt/vast/dynamo_cache isn't visible inside, so setup_script's stat of the cache always fails. Add extra_mount: /mnt/vast/dynamo_cache:/mnt/vast/dynamo_cache to all six gb300 recipes. Verified the recipes still parse cleanly via srtctl's load_config; cfg.extra_mount is now populated as expected.

Latest run got past dynamo install (cache mount + prebuild both work now — 41 ranks all succeeded), then hit a different wall: RuntimeError: Did not receive response from front-end process within 5 minutes This is vllm's hardcoded engine-core handshake deadline. With DSV4-Pro weights (~850 GB) on /mnt/vast NFS and 8 DP ranks reading in parallel through one NFS client mount, rank 0's model load runs longer than 5 minutes under contention; the other DP ranks then time out waiting for the front-end (rank 0's DPAsyncMPClient) to respond. The 5-min limit is a module-level constant HANDSHAKE_TIMEOUT_MINS in vllm/v1/engine/core.py with no env-var override. The setup script now seds it to 30 in each rank's container after the dynamo install completes. (No-op + warning if the constant ever changes upstream.)

After patching the handshake timeout to 30 min, every rank still hits 'Did not receive response from front-end process within 30 minutes'. Rank 0 itself goes silent right after vllm config init — no model load progress, just a 30+ min gap. Suggests NCCL init is hanging, not slow NFS load. Two cw-specific tweaks: - NCCL_MNNVL_ENABLE: removed. cw does not have multi-node NVLink (that's a gb200-nv tray feature). Telling NCCL it's there can confuse init. - NCCL_P2P_LEVEL: NVL: removed. Across nodes there is no NVLink path, so forcing NVL-only P2P is wrong; let NCCL auto-pick (PIX/NET/etc). Plus NCCL_DEBUG=INFO so the next run's worker logs show where NCCL is stuck. We can revert the debug log once we know the root cause.

…ducer NVL72 GB300 HAS multi-node NVLink — removing NCCL_MNNVL_ENABLE was wrong. This commit restores it (and NCCL_P2P_LEVEL=NVL on tep8 recipes) to match the working gb200 references. Adds NCCL_DEBUG_SUBSYS + NCCL_DEBUG_FILE to all gb300 recipes so NCCL init/bootstrap/net diagnostics land in per-process log files instead of flooding the main sweep log. Also adds VLLM_ENGINE_READY_TIMEOUT_S to dep16 recipes (was only on tep8 before). Reduces nvidia-master search space to just the 1p1d-dep8-tep8 topology (4 nodes) for both ISL configs to isolate the DP Coordinator startup failure before scaling up to larger topologies.

…L_DEBUG_FILE Three changes to diagnose the prefill DP Coordinator startup failure: 1. Restore the HANDSHAKE_TIMEOUT_MINS 5→30 sed patch in the setup script. Removing it (87bdf1f) caused follower DP ranks to hit the hardcoded 5-minute front-end handshake timeout during model load from VAST NFS. VLLM_ENGINE_READY_TIMEOUT_S does not control this code path. 2. Add a Python patch to vllm's coordinator.py that logs the DP Coordinator child's pid, alive status, and exitcode when the parent sees "failed to report ZMQ addresses". This surfaces the actual child failure instead of the opaque parent-side error. 3. Remove NCCL_DEBUG_FILE from all gb300 recipes — /tmp inside the container is ephemeral and not collected. NCCL debug now goes to stderr which lands in the SLURM .out files.

The previous coordinator patch (7f526db) failed because the needle strings didn't match the actual multi-line format in vllm/v1/engine/coordinator.py. Rewrote based on the real source: (a) Bump _wait_for_zmq_addrs timeout=30 → timeout=300 by matching the exact "[zmq_addr_pipe, self.proc.sentinel], timeout=30" string. (b) Insert child-process debug logging (pid, alive, exitcode) before the RuntimeError raise, matching the exact multi-line raise block. This should expose whether the DP Coordinator child is crashing vs just slow, and give it 5 minutes instead of 30 seconds to report ZMQ addresses.

Previous patches (7f526db, 6415458) failed because exact string matching was too brittle for the multi-line raise block in coordinator.py. Now: - Timeout bump: still exact-matches "[zmq_addr_pipe, self.proc.sentinel], timeout=30" → timeout=300 (this string is stable) - Debug logging: regex-matches the RuntimeError raise block with flexible indentation/whitespace, injects child proc debug info (pid, alive, exitcode, sentinel) using self.proc (not the wrong self._coordinator_proc from the v1 attempt) - Verification: dumps inspect.getsource(DPCoordinator._wait_for_zmq_addrs) so the per-rank logs show exactly what code will run Separates timeout bump and logging into independent python blocks so a failure in one doesn't skip the other.

… vllm pattern for sglang

Oseltamivir requested a review from a team April 25, 2026 05:09

Oseltamivir requested review from jgangani and kedarpotdar-nv as code owners April 25, 2026 05:09

github-project-automation Bot added this to InferenceMAX Board Apr 25, 2026

Oseltamivir added 2 commits April 24, 2026 22:09

Fill in PR link for gb300-cr changelog entry

017b66a

Oseltamivir changed the title ~~Port DeepSeek-V4-Pro FP4 vLLM disagg sweep to gb300-cr~~ Port DeepSeek-V4-Pro FP4 vLLM disagg sweep to gb300-cw Apr 25, 2026

Oseltamivir changed the title ~~Port DeepSeek-V4-Pro FP4 vLLM disagg sweep to gb300-cw~~ Day 0 GB300 DeepSeek-V4-Pro FP4 vLLM disagg Apr 25, 2026

Oseltamivir added the sweep-enabled label Apr 25, 2026

claude Bot reviewed Apr 25, 2026

View reviewed changes

Oseltamivir and others added 18 commits April 24, 2026 22:24

Merge branch 'main' into dsv4-fp4-gb300-dynamo-vllm-disagg

b3d2bd8

Update perf-changelog.yaml

33f6eb4

Remove vLLM HANDSHAKE_TIMEOUT_MINS sed patch from setup script

87bdf1f

Oseltamivir and others added 7 commits April 25, 2026 14:26

Merge branch 'main' into dsv4-fp4-gb300-dynamo-vllm-disagg

ff4ab3a

more

8570717

Merge branch 'main' into dsv4-fp4-gb300-dynamo-vllm-disagg

b39f41e

Merge branch 'main' into dsv4-fp4-gb300-dynamo-vllm-disagg

3f33f27

Oseltamivir requested a review from Qiaolin-Yu as a code owner April 26, 2026 05:49

configs

df79838

Oseltamivir added a commit that referenced this pull request Apr 27, 2026

prebuild dynamo wheel from hash 6a159fed on /mnt/vast — mirror PR #1150…

d7dc646

… vllm pattern for sglang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Day 0 GB300 DeepSeek-V4-Pro FP4 vLLM disagg #1150

Day 0 GB300 DeepSeek-V4-Pro FP4 vLLM disagg #1150
Oseltamivir wants to merge 29 commits intomainfrom
dsv4-fp4-gb300-dynamo-vllm-disagg

Oseltamivir commented Apr 25, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

claude Bot Apr 25, 2026

Uh oh!

claude Bot Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oseltamivir commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in here

Rack pinning (cw-specific)

Test plan

Validation done locally

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

claude Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Oseltamivir commented Apr 25, 2026 •

edited

Loading