Skip to content

Add H100 config: dsv4-fp8-dynamo-vllm (DeepSeek-V4-Pro multinode disagg)#1142

Open
Oseltamivir wants to merge 17 commits intomainfrom
dsv4-fp8-h100-dynamo-vllm
Open

Add H100 config: dsv4-fp8-dynamo-vllm (DeepSeek-V4-Pro multinode disagg)#1142
Oseltamivir wants to merge 17 commits intomainfrom
dsv4-fp8-h100-dynamo-vllm

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

@Oseltamivir Oseltamivir commented Apr 24, 2026

Summary

  • Port the DSV4-Pro vLLM recipe from single-node H200 to H100 multinode disaggregated serving via Dynamo (2 prefill nodes + 2 decode nodes, DP16/EP16 per side, 32xH100 total).
  • 2P+2D is the minimum viable shape: ~862 GB FP8 weights don't fit on one 8xH100-80GB node (640 GB), so each side must own >=2 nodes. This exactly fills the h100-multinode pool.
  • srt-slurm recipes are bundled locally at benchmarks/multi_node/srt_slurm_recipes/ and overlaid onto the upstream clone at runtime. Temporary pending an upstream PR to NVIDIA/srt-slurm.

Config parity with H200

Engine flags match benchmarks/single_node/dsv4_fp8_h200.sh:

  • deepseek_v4 tokenizer, tool-call, and reasoning parsers
  • --kv-cache-dtype fp8, --block-size 256
  • --no-enable-prefix-caching, --no-enable-flashinfer-autotune
  • --enable-expert-parallel, --gpu-memory-utilization 0.95, --max-num-seqs 512, --max-num-batched-tokens 512
  • compilation mode 0 with FULL_DECODE_ONLY cudagraph
  • VLLM_ENGINE_READY_TIMEOUT_S=3600

Differs from H200:

  • max-model-len: 16384 (H200's 800k does not fit KV across two 80GB decode nodes)
  • H100 80GB tunings from the DSR1 H100 vLLM recipe: VLLM_MOE_DP_CHUNK_SIZE=192, deepep_{high_throughput,low_latency} all2all backends, VLLM_USE_DEEP_GEMM=1
  • NixlConnector P<->D KV transfer, dynamo: { version: 1.0.1, install: true } with setup_script: vllm-container-deps.sh
  • tensor-parallel-size: 1 + data-parallel-size: 16 per side (vs H200 --data-parallel-size 8)

🤖 Generated with Claude Code

Port the DSV4-Pro vLLM recipe from single-node H200 to H100 as multinode
disaggregated serving via Dynamo. The ~862 GB FP8 weights don't fit on one
8xH100-80GB node (640 GB), so each side must own >=2 nodes; with the
h100-multinode pool at 4 nodes, 2P+2D DP16/EP16 per side (32 H100s total)
is the minimum viable shape and fills the pool exactly.

Engine flags match the single-node H200 recipe: deepseek_v4 tokenizer,
tool-call, and reasoning parsers; FP8 KV cache; block size 256; prefix
caching disabled; compilation mode 0 with FULL_DECODE_ONLY cudagraph.
max-model-len is capped at 16384 (H200's 800k does not fit KV across two
80GB decode nodes). Keeps H100-tuned knobs from the DSR1 vLLM recipe:
VLLM_MOE_DP_CHUNK_SIZE=192, deepep_{high_throughput,low_latency} all2all
backends, NixlConnector P<->D KV transfer, VLLM_USE_DEEP_GEMM, dynamo 1.0.1.

srt-slurm recipes are bundled locally at benchmarks/multi_node/srt_slurm_recipes/
and overlaid onto the srt-slurm clone at runtime. This is temporary until
the recipes can be upstreamed to NVIDIA/srt-slurm.

Changes:
- recipes: benchmarks/multi_node/srt_slurm_recipes/vllm/deepseek-v4-pro/
  {1k1k,8k1k}/disagg-h100-fp8-1p1d-dep16-dep16.yaml
- runner: launch_h100-dgxc-slurm.sh gains a dynamo-vllm framework branch
  (dsv4-fp8 model path at /mnt/nfs/lustre/models/dsv4-fp8, vLLM container
  squash mapping, srtslurm.yaml dynamo-vllm alias) and an unconditional
  local-recipes overlay after the srt-slurm checkout
- master: .github/configs/nvidia-master.yaml adds dsv4-fp8-h100-dynamo-vllm
  with 1k1k conc [4,8,16,32,64,128] and 8k1k conc [4,8,16,32,64]

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Oseltamivir and others added 14 commits April 24, 2026 13:31
Sweep 24909864822 had all three multinode jobs fail in 6s with
ExitCode=1:0 and no sweep_JOBID.log written, leaving no usable
diagnostic in the CI artifact. Two defensive changes:

1. mkdir -p outputs/$JOB_ID/logs before polling, so Slurm's
   #SBATCH --output=outputs/%j/logs/sweep_%j.log directive can
   open the target file even when the compute-node stepd lacks
   permission to create the parent dir on NFS.

2. On the "job failed before creating log file" path, tar
   outputs/$JOB_ID/ (sbatch_script.sh, config.yaml, any partial
   log, and the scontrol dump) into multinode_server_logs.tar.gz
   so the CI artifact captures what was submitted and why Slurm
   exited early. Previously exit 1 ran before the tar step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR 1142's first real sweep hit "ModuleNotFoundError: No module named
'vllm.inputs.data'" on all three multinode jobs. Same error as PR 1129
on GB200.

Root cause: ai-dynamo 1.0.1 (installed by NVIDIA/srt-slurm@sa-submission-q2-2026
via `dynamo: { version: 1.0.1 }`) imports vllm.inputs.data.TokensPrompt,
a path removed in the DSV4 vLLM wheel. Dynamo workers crash during
import before any vLLM flag matters.

Fix, mirroring PR 1129:
- launch_h100-dgxc-slurm.sh: override srt-slurm clone URL/ref via
  SRT_SLURM_REPO_URL and SRT_SLURM_REF env vars, set to
  alec-flowers/srt-slurm@d60e3f1c (head of NVIDIA/srt-slurm#71) for
  dynamo-vllm+dsv4. All other frameworks/models keep NVIDIA upstream.
- Recipes: replace `dynamo.version: 1.0.1` with `dynamo.hash:
  6a159fedd8e4a1563aa647c31f622aedbf254b5b`. The fork's schema accepts
  `hash:` for pinning a specific ai-dynamo/dynamo commit. That commit
  has the matching vllm.inputs import path.
- Recipes: adopt DSV4-specific flags PR 1129 proved necessary for
  startup: `enforce-eager: true` (prefill only), `enable-sleep-mode: true`,
  `no-disable-hybrid-kv-cache-manager: true`, explicit
  `kv-transfer-config` (NixlConnector kv_both), env vars
  VLLM_SERVER_DEV_MODE=1 and TILELANG_CLEANUP_TEMP_FILES=1.
- Recipes: drop `data-parallel-hybrid-lb` and `async-scheduling` (DSR1
  patterns that PR 1129 omitted on DSV4; keep minimal delta from DSV4
  H200 single-node).

Kept H100-specific knobs: VLLM_MOE_DP_CHUNK_SIZE=192, deepep_{high_throughput,
low_latency} all2all backends, VLLM_USE_DEEP_GEMM. Skipped GB200-only
flags (NCCL_MNNVL_ENABLE, NCCL_NVLS_ENABLE, VLLM_USE_NCCL_SYMM_MEM).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Dynamo vLLM worker argparse rejects --enable-auto-tool-choice and
--tool-call-parser — the sweep from e0359c6 got past the module-import
error but failed with "unrecognized arguments: --enable-auto-tool-choice
--tool-call-parser deepseek_v4" during prefill worker startup.

These flags (along with --tokenizer-mode and --reasoning-parser) are
OpenAI API-server concerns. In disagg, Dynamo is the frontend and does
tokenization / tool parsing itself; the vLLM workers are engine-only
processes and expose only engine args. The H200 single-node recipe
uses `vllm serve` directly (full API server), which is why those flags
work there but fail here.

Kimi K2.5 (only other working dynamo-vllm recipe) also omits all four
flags — that's the precedent.

Removed from both prefill and decode:
  tokenizer-mode: deepseek_v4
  tool-call-parser: deepseek_v4
  reasoning-parser: deepseek_v4
  enable-auto-tool-choice: true

Kept trust-remote-code: true (needed for DSV4's custom modeling code).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workers got past module import and weight load (471s), then died
simultaneously with:

  /dvs/p4/build/sw/rel/gpgpu/toolkit/r12.9/main_nvshmem/src/host/mem/
  mem_heap.cpp:exchange_heap_memory_handle:781: Fatal IPC Failure
  IPC failure: Sending data over socket failed: No such file or directory

Root cause: `all2all-backend: deepep_{high_throughput,low_latency}`
routes expert-parallel comms through NVSHMEM. The cu129 DSV4 vLLM
wheel's NVSHMEM can't complete host-side IPC bootstrap after the
workers enter the executor init phase. DSR1 on the same H100 nodes
uses deepep successfully, but through a different container
(nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0) with an older NVSHMEM.

Fix — mirror PR 1129's GB200 approach:
1. Drop the `all2all-backend` override entirely. The DSV4 vLLM code
   picks its own default for this model, which routes through NCCL
   symmetric memory instead of NVSHMEM.
2. Add env vars:
     VLLM_USE_NCCL_SYMM_MEM=1  (prefer NCCL symm mem path)
     NCCL_CUMEM_ENABLE=1       (CUDA unified memory companion)

Skipped NCCL_MNNVL_ENABLE and NCCL_NVLS_ENABLE (Blackwell-only; MNNVL
is GB200 NVSwitch fabric, NVLS is NVLink SHARP — neither exists on
H100). Keeps all H100-specific knobs (VLLM_USE_DEEP_GEMM,
VLLM_MOE_DP_CHUNK_SIZE=192, VLLM_SKIP_P2P_CHECK).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run 24913192394 got past every prior failure (NVSHMEM/IPC, module
import, argparse) but OOMed during compile_or_warm_up_model:

  torch.OutOfMemoryError: CUDA out of memory.
  Tried to allocate 512.00 MiB. GPU 0 has a total capacity of 79.19 GiB
  of which 93.00 MiB is free. PyTorch: 72.99 GiB | CUDA Graphs: 1.28 GiB
  File ".../vllm/model_executor/layers/sparse_attn_indexer.py", line 122

DSV4's "Lightning Indexer" sparse attention layer allocates transient
torch.empty buffers that aren't accounted for in vLLM's KV cache
profiling. With gpu-memory-utilization=0.95, vLLM reserves ~75 GiB of
each H100's 79 GiB usable, leaving only ~4 GiB for non-PyTorch state
(NCCL buffers, NVSHMEM scratch, the indexer's transient allocations).
The indexer's 512 MiB allocation tips it over.

The H200 single-node DSV4 recipe uses 0.95 and works because each H200
has 141 GiB/GPU — 4 GiB headroom is enough there. PR 1129 uses 0.88
(prefill) / 0.9 (decode) on GB200's 192 GiB. DSR1 H100 disagg uses
vLLM's default 0.9 and works because DSR1's MLA doesn't have the
indexer overhead.

0.85 reserves ~12 GiB headroom on H100 80GB, well above the indexer's
~6 GiB working set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run 24914869373: server starts successfully (eval-only succeeds in 33m,
end-to-end gsm8k completions). The throughput jobs fail before sending
a single request:

  ValueError: Cannot use chat template functions because
  tokenizer.chat_template is not set
  File "/srtctl-benchmarks/sa-bench/benchmark_serving.py", line 346,
  in sample_random_requests
      chat_template_dummy = tokenizer.apply_chat_template(...)

DSV4-Pro's HF tokenizer ships without a chat_template attribute. The
server uses tokenizer-mode=deepseek_v4 (set automatically from the
model's tokenizer_config.json) to handle templating itself, but
sa-bench's prompt-construction path runs a *local* HF
apply_chat_template before sending — and that raises with no template
to apply.

Eval works because lm-eval-harness sends raw messages to
/v1/chat/completions; the server templates them via Dynamo's parser.

Set `use_chat_template: false` on both recipes' benchmark blocks
(matches PR 1129). sa-bench will send raw random text, which is what
the throughput benchmark wants anyway.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Expand the search space with a TEP-style recipe alongside the existing
DEP, following the dsr1-fp8-h100-dynamo-sglang TEP/DEP split pattern.

The h100-multinode pool is exactly 4 nodes and DSV4-Pro weights need
>=2 nodes per side, so we cannot add more workers (1P+1D = 4 nodes is
the only fit). The TEP variant therefore differs from DEP by changing
each worker's *internal* parallelism, not the worker count:

  DEP (existing): tp=1, dp=16, ep=16, dp-attn=true
                  16 independent attention paths, sharded experts.
                  Better at high concurrency / throughput.

  TEP (new):      tp=16, dp=1, ep=16, dp-attn=false
                  Single replica spread across all 16 GPUs, sharded
                  experts. All 16 GPUs cooperate on each forward pass.
                  Cross-node TP routes attn all-reduce + MoE all2all
                  over IB — expensive per token, but latency wins at
                  small batch sizes (conc 4-32).

Concurrency split per the user's hint ("DEP for high conc, TEP for
low conc"):
  1k1k TEP: [4, 8, 16, 32]    1k1k DEP: [64, 128, 256]
  8k1k TEP: [4, 8, 16]        8k1k DEP: [32, 64, 128]

Also extends the DEP high-conc tail by one point each side
(1k1k 128 -> 256, 8k1k 64 -> 128).

TEP recipe drops `data-parallel-hybrid-lb` (no DP) and lowers
`max-num-seqs` to 64 / `max-num-batched-tokens` to 512 since cudagraph
capture would otherwise reserve memory for batch shapes never reached
at conc<=32. Keeps the existing DSV4 startup workarounds
(VLLM_USE_NCCL_SYMM_MEM, gpu-memory-utilization=0.85, no all2all-backend
override, etc).

Doubles the matrix from 2 to 4 entries (validated via
MultiNodeMatrixEntry).

Also adds `du -sh "$MODEL_PATH"` in the dynamo-vllm branch of
launch_h100-dgxc-slurm.sh so model size shows in CI output — useful
for catching partial downloads or wrong revisions before the 8-min
weight-load step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sweep 24921015519 surfaced that cross-node TP=16 doesn't work with the
Dynamo+vLLM stack:

  pydantic_core._pydantic_core.ValidationError: 1 validation error for ParallelConfig
    Value error, World size (16) is larger than the number of available
    GPUs (1) in this node. If this is intentional and you are using:
    - ray, set '--distributed-executor-backend ray'.
    - multiprocessing, set '--nnodes' appropriately.

Dynamo spawns one vLLM process per GPU; each process only sees its
single local GPU and vLLM rejects world_size=16. Working around this
would need --distributed-executor-backend=ray which Dynamo doesn't
coordinate. None of the working DSV4 vLLM recipes (kimi GB200, DSR1
H100, PR 1129 GB200) use cross-node TP either — the execution model
assumes one process per GPU.

So drop TEP entirely; instead deliver two DEP recipes per ISL/OSL
that differ in batch tuning:

  DEP-eager (low conc): max-num-seqs=64, max-num-batched-tokens=256,
    enforce-eager=true on decode (no cudagraph). Smaller cudagraph
    capture footprint, faster warmup, no decode kernel-launch
    optimization (irrelevant at conc<=32 where network round-trips
    dominate per-token latency).
  DEP (high conc, existing): max-num-seqs=512, max-num-batched-tokens
    =512, decode cudagraph enabled. Higher batching throughput at
    conc>=64.

Conc splits unchanged from previous attempt:
  1k1k eager [4,8,16,32]    1k1k dep [64,128,256]
  8k1k eager [4,8,16]       8k1k dep [32,64,128]

Same 4 matrix entries, all with the same tp=1/dp=16/ep=16/dp-attn=true
metadata; differentiation is via the CONFIG_FILE pointer in
additional-settings (mirrors how the trtllm dsr1-h100 recipes encode
multiple variants of the same topology).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…914869373)

The eager low-conc DEP variant added in 1bdeb9e was untested, and the
TEP variant before that didn't work at all on Dynamo+vLLM. Drop both
and revert to the single-DEP search-space form that successfully served
gsm8k eval-only in run 24914869373:

  1k1k DEP: conc [4, 8, 16, 32, 64, 128]
  8k1k DEP: conc [4, 8, 16, 32, 64]

Each entry uses tp=1, dp=16, ep=16, dp-attn=true (1P+1D filling the
4-node h100-multinode pool). max-num-seqs=512, decode cudagraph on,
gpu-memory-utilization=0.85.

Removes:
- benchmarks/multi_node/srt_slurm_recipes/vllm/deepseek-v4-pro/1k1k/disagg-h100-fp8-1p1d-dep16-dep16-eager.yaml
- benchmarks/multi_node/srt_slurm_recipes/vllm/deepseek-v4-pro/8k1k/disagg-h100-fp8-1p1d-dep16-dep16-eager.yaml

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…vllm

# Conflicts:
#	.github/configs/nvidia-master.yaml
#	perf-changelog.yaml
Run 24922713022 hit the default 1800s orchestrator deadline on all three
matrix jobs (1k1k bench, 8k1k bench, 8k1k eval). Concurrent multinode
matrix jobs starve the same Lustre OSTs — first shard load took 423s,
shard 8/64 was reached at 16 min, projected total weight load ~107 min.

Match the GB200 dsv4 recipes which already added these blocks for the
same reason.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DSV4-Pro per-rank weights are 74.99 GiB at DP=16/EP=16 — H100 80GB
leaves only ~4 GiB headroom and sparse_attn_indexer's profile_run
torch.empty(512 MiB) OOMs (run 24923521075).

Cross-node TP=16 shards the model 16-way across 2 nodes (~5 GiB per
rank). srt-slurm's vllm.py:386-388 emits --headless on the secondary
node when data-parallel-size is absent and the worker spans nodes;
Dynamo's run_dynamo_headless calls vLLM's run_headless which uses
MultiprocExecutor + torch.distributed (no Ray) to form the cross-node
PG. NCCL TP all-reduce flows over IB on every layer — slower per-token
than intra-node NVLink, but the only way to fit DSV4-Pro at 80 GB.

Other changes: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True for
fragmentation; gpu-memory-utilization back to 0.95 (matches H200);
enforce-eager on decode for the first attempt (cross-node cudagraphs
are fragile).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Oseltamivir Oseltamivir force-pushed the dsv4-fp8-h100-dynamo-vllm branch from 348910a to f798361 Compare April 25, 2026 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant