Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7428,3 +7428,53 @@ kimik2.5-fp4-gb200-dynamo-vllm:
tp: 16
ep: 16
dp-attn: true

# DeepSeek-V4-Pro on GB200, SGLang aggregated (TP=8 across 2 nodes).
# Recipes live in YAMY1234/srt-slurm-nv:dsv4-pro-recipes (NVIDIA srt-slurm
# PR #69), derived from the official SGLang DeepSeek-V4 cookbook.
# `framework: sglang` (no Dynamo frontend) tells the runner to clone that
# fork instead of NVIDIA/srt-slurm and to use the recipe directly.
dsv4-fp4-gb200-sglang:
image: lmsysorg/sglang:deepseek-v4-grace-blackwell
model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: gb200
precision: fp4
framework: sglang
multinode: true
disagg: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
# Low-latency: TP=8 + EAGLE 3/4 speculative decoding (smaller batches,
# better TPOT). Recipe targets the low-conc end of the curve.
- conc-list: [1, 2, 4, 8, 16, 32, 64]
prefill:
num-worker: 1
tp: 8
ep: 1
dp-attn: false
additional-settings:
# https://github.com/NVIDIA/srt-slurm/pull/69/files#diff-recipes/gb200-fp4/1k1k-dsv4/agg-2n-low-latency.yaml
- "CONFIG_FILE=recipes/gb200-fp4/1k1k-dsv4/agg-2n-low-latency.yaml"
Comment on lines +7447 to +7460
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The low-latency search-space entry (conc 1-64) references agg-2n-low-latency.yaml, which this PR's own description and the adjacent YAML comment describe as using EAGLE 3/4 speculative decoding, but the entry omits spec-decoding: "mtp" and therefore defaults to "none". Because SPEC_DECODING is plumbed through the runner into the result metadata (utils/process_result.py line 53), every EAGLE run from this config will be mislabeled as non-speculative in downstream dashboards, and it will be indistinguishable from the agg-2n-nomtp entry in any aggregation that keys on spec-decoding. Fix: add spec-decoding: "mtp" to the conc 1-64 entry at lines 7452-7465, matching the convention used by every other EAGLE/MTP config in this file.

Extended reasoning...

What the bug is

The new dsv4-fp4-gb200-sglang config in .github/configs/nvidia-master.yaml has two search-space entries for the 1k/1k seq-len. The first entry (conc [1, 2, 4, 8, 16, 32, 64]) points at recipes/gb200-fp4/1k1k-dsv4/agg-2n-low-latency.yaml. This PR's own description says agg-2n-low-latency.yaml — TP=8 + EAGLE 3/4 speculative decoding, the adjacent YAML comment at line 7450 says Low-latency: TP=8 + EAGLE 3/4 speculative decoding, and the perf-changelog entry says agg-2n-low-latency (EAGLE 3/4 spec decoding). Despite this, the search-space entry does not set spec-decoding: "mtp".

Why the default is wrong here

Per utils/matrix_logic/validation.py:227-228, MultiNodeSearchSpaceEntry.spec_decoding is declared as Literal["mtp", "draft_model", "none"] with default="none". So the low-latency EAGLE entry silently becomes spec_decoding="none" when the matrix is generated. Every other EAGLE/MTP recipe in this same file explicitly sets spec-decoding: "mtp" (the *-trt-mtp, *-sglang-mtp, *-dynamo-sglang-mtp entries; the changelog's own description of the other MTP PRs in this file confirms this is the established convention).

Impact

utils/process_result.py is the result-metadata writer run for every benchmark point. At module load (line 27) it requires SPEC_DECODING as an env var, and at line 53 it writes 'spec_decoding': spec_decoding into the result JSON that is uploaded for dashboards:

data = {
    ...
    'framework': framework,
    'precision': precision,
    'spec_decoding': spec_decoding,
    ...
}

With spec-decoding missing from the YAML, the generated matrix entry carries spec_decoding="none", the job's SPEC_DECODING env var is set to "none", and all seven low-latency EAGLE points (conc 1, 2, 4, 8, 16, 32, 64) get written into result metadata as non-speculative. Downstream dashboards and aggregations that key on spec_decoding will silently fold the EAGLE points in with the non-MTP entry (agg-2n-nomtp.yaml, conc 128-1024), producing a combined sweep that looks like a pure non-MTP config — you lose the ability to see the EAGLE speedup at low concurrency.

The multi-node eval-grouping logic in utils/matrix_logic/generate_sweep_configs.py:106-114 also keys on spec-decoding, but the grouping targets only 8k1k (see target_isl, target_osl = seq_len_stoi["8k1k"] at line 51) and this config is 1k1k, so eval selection is not directly affected. The metadata-correctness impact on result labeling remains, which is the primary defect.

Proof (step-by-step)

  1. generate_sweep_configs.py expands the config. For the conc=16 point in the first search-space entry, it calls into the validated MultiNodeSearchSpaceEntry model. Because spec-decoding is absent, Pydantic applies the default "none" (validation.py:227-228).
  2. The entry is emitted into the matrix with spec-decoding: "none".
  3. The matrix row drives a GitHub Actions job; the runner sets SPEC_DECODING=none in the environment alongside FRAMEWORK=sglang, MODEL_PREFIX=dsv4, CONFIG_FILE=recipes/gb200-fp4/1k1k-dsv4/agg-2n-low-latency.yaml.
  4. srtctl apply runs the agg-2n-low-latency.yaml recipe, which actually does use EAGLE 3/4 speculative decoding (per the YAMY1234/srt-slurm-nv fork linked in the diff).
  5. Post-benchmark, utils/process_result.py reads SPEC_DECODING="none" and writes {"spec_decoding": "none", ...} into the per-run result JSON that is uploaded for dashboards.
  6. The EAGLE run is now indistinguishable from a non-speculative run in downstream aggregations.

Fix

Add spec-decoding: "mtp" to the first search-space entry:

    - conc-list: [1, 2, 4, 8, 16, 32, 64]
      spec-decoding: "mtp"
      prefill:
        num-worker: 1
        ...

The second (throughput) entry correctly leaves it at the default since agg-2n-nomtp.yaml has no MTP.

decode:
num-worker: 1
tp: 8
ep: 1
dp-attn: false
# Throughput: TP=8 with no MTP (matches cookbook's "throughput" tier).
- conc-list: [128, 256, 512, 1024]
prefill:
num-worker: 1
tp: 8
ep: 1
dp-attn: false
additional-settings:
# https://github.com/NVIDIA/srt-slurm/pull/69/files#diff-recipes/gb200-fp4/1k1k-dsv4/agg-2n-nomtp.yaml
- "CONFIG_FILE=recipes/gb200-fp4/1k1k-dsv4/agg-2n-nomtp.yaml"
decode:
num-worker: 1
tp: 8
ep: 1
dp-attn: false
10 changes: 10 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
- config-keys:
- dsv4-fp4-gb200-sglang
description:
- "Add DeepSeek-V4-Pro SGLang aggregated GB200 benchmarks (1k/1k, TP=8, 2 nodes)"
- "Recipes from YAMY1234/srt-slurm-nv:dsv4-pro-recipes (NVIDIA srt-slurm PR #69)"
- "Image: lmsysorg/sglang:deepseek-v4-grace-blackwell"
- "Two recipes: agg-2n-low-latency (EAGLE 3/4 spec decoding) for conc 1-64, agg-2n-nomtp for conc 128-1024"
- "Runner script clones the YAMY1234 fork pinned at commit da535e87 instead of NVIDIA/srt-slurm"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD
Comment on lines +1 to +9
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new dsv4-fp4-gb200-sglang changelog entry was prepended to the top of perf-changelog.yaml (lines 1-9), but AGENTS.md line 160 explicitly requires new entries to be appended to the END of the file ("oldest at the top, newest at the bottom"). All other recent entries (#1120, #1040, #1043, etc.) correctly live at the bottom, so prepending here inverts chronological ordering for this one entry. Please move the 9-line block from the top of the file to the end.

Extended reasoning...

What the bug is

AGENTS.md line 160 states the rule plainly:

The file is read in chronological order: oldest at the top, newest at the bottom. New entries MUST be appended to the END of the file — never insert in the middle or prepend.

The diff for perf-changelog.yaml in this PR is a single hunk @@ -1,3 +1,13 @@ that inserts the new dsv4-fp4-gb200-sglang entry at the very top of the file, before the pre-existing dsr1-fp8-h100-dynamo-trt/dsr1-fp8-h100-dynamo-sglang entry. That directly violates the documented rule.

Why existing code/convention doesnt prevent it

The rule is a human-enforced convention documented in AGENTS.md — there is no lint or CI check that validates ordering, so the only guard is reviewer/author attention. Every other recent PR in the repo (e.g. #1120 evals trigger, #1040 qwen atom, #1043 glm5.1 atom, #1098, #1106, #1094) lands its entry at the bottom, which is how new readers correctly infer the chronology.

Impact

Anyone scanning perf-changelog.yaml top-down to understand the evolution of configs will now see the DSV4-Pro GB200 SGLang entry (a brand-new April 2026 submission) as if it were the oldest change, predating things like 70b-fp8-*-vllm (PR #95) and gptoss-fp4-*-trt (PR #110). That misleads humans, and any future tooling that parses the file chronologically (e.g. "what changed since last quarter") would get the wrong answer.

How to fix

Move the 9-line block currently at lines 1-9 of perf-changelog.yaml to the end of the file (after the last glm5.1-fp4-mi355x-atom entry from PR #1043). The content of the entry itself is fine — only its position needs to change.

Step-by-step proof

  1. Open AGENTS.md line 160 → confirms the rule: "appended to the END of the file — never insert in the middle or prepend."
  2. Open the PR diff for perf-changelog.yaml → the single hunk is @@ -1,3 +1,13 @@, meaning the 10 added lines (9 entry + 1 blank separator) sit at the very beginning of the file. The first pre-existing entry - config-keys: [dsr1-fp8-h100-dynamo-trt, ...] is pushed from line 1 down to line 11.
  3. Compare to PR trigger H100 multinode evals #1120 (the most recent "evals trigger" entry in the file) and PR [AMD/ROCM] atom glm5.1 fp4 on mi355x #1043 (glm5.1-fp4-mi355x-atom) — both sit at the bottom of the modified file, per the rule.
  4. Therefore this PR prepends while all sibling PRs appended, violating the explicit documented convention.


- config-keys:
- dsr1-fp8-h100-dynamo-trt
- dsr1-fp8-h100-dynamo-sglang
Expand Down
28 changes: 27 additions & 1 deletion runners/launch_gb200-nv.sh
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,16 @@ elif [[ $FRAMEWORK == "dynamo-vllm" ]]; then
echo "Unsupported model prefix/precision combination: $MODEL_PREFIX/$PRECISION. Supported combinations for dynamo-vllm: kimik2.5/fp4"
exit 1
fi
elif [[ $FRAMEWORK == "sglang" ]]; then
# Direct SGLang aggregated serving (no Dynamo frontend), used by recipes
# in YAMY1234/srt-slurm-nv:dsv4-pro-recipes (NVIDIA srt-slurm PR #69).
if [[ $MODEL_PREFIX == "dsv4" && $PRECISION == "fp4" ]]; then
export MODEL_PATH="/mnt/lustre01/users/sa-shared/DeepSeek-V4-Pro"
export SRT_SLURM_MODEL_PREFIX="dsv4-pro"
else
echo "Unsupported model prefix/precision combination: $MODEL_PREFIX/$PRECISION. Supported combinations for sglang: dsv4/fp4"
exit 1
fi
else
export MODEL_PATH=$MODEL
fi
Expand Down Expand Up @@ -134,7 +144,22 @@ if [ -d "$SRT_REPO_DIR" ]; then
rm -rf "$SRT_REPO_DIR"
fi

if [[ $FRAMEWORK == "dynamo-vllm" ]]; then
if [[ $FRAMEWORK == "sglang" && $MODEL_PREFIX == "dsv4" ]]; then
# YAMY1234's fork of NVIDIA/srt-slurm, branch dsv4-pro-recipes
# (https://github.com/NVIDIA/srt-slurm/pull/69) — adds DeepSeek-V4-Pro
# SGLang aggregated recipes for GB200 / GB300 derived from the SGLang
# DeepSeek-V4 cookbook. Pinned to the PR head commit for reproducibility.
git clone https://github.com/YAMY1234/srt-slurm-nv.git "$SRT_REPO_DIR"
cd "$SRT_REPO_DIR"
git checkout da535e87338cfac0388fc301f9c87b7bc5e669a6
# The upstream recipes hardcode slurm.partition to NVIDIA's internal
# partition names (gb200 / gb300). Rewrite to our partition so sbatch
# doesn't fail with "invalid partition specified".
find recipes/gb200-fp4 recipes/gb300-fp4 -type f -name "*.yaml" -exec \
sed -i "s/^ partition: gb200$/ partition: ${SLURM_PARTITION}/" {} +
find recipes/gb200-fp4 recipes/gb300-fp4 -type f -name "*.yaml" -exec \
sed -i "s/^ partition: gb300$/ partition: ${SLURM_PARTITION}/" {} +
elif [[ $FRAMEWORK == "dynamo-vllm" ]]; then
git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR"
cd "$SRT_REPO_DIR"
git checkout sa-submission-q2-2026
Expand Down Expand Up @@ -187,6 +212,7 @@ model_paths:
containers:
dynamo-trtllm: ${SQUASH_FILE}
dynamo-sglang: ${SQUASH_FILE}
dsv4-grace-blackwell: ${SQUASH_FILE}
"${IMAGE}": ${SQUASH_FILE}
nginx-sqsh: ${NGINX_SQUASH_FILE}
EOF
Expand Down
Loading