[AMD/ROCm] Add Kimi-K2.5 FP4 vLLM Eagle3 speculative decoding config for MI355X#1116
[AMD/ROCm] Add Kimi-K2.5 FP4 vLLM Eagle3 speculative decoding config for MI355X#1116chunfangamd wants to merge 6 commits intomainfrom
Conversation
- Co-authored with Li, Larry and Li, Chao
Changes:
- amd-master.yaml: add kimik2.5-fp4-mi355x-vllm-eagle3 entry sweeping
tp={4,8} x conc={4..64} on both 1k1k and 8k1k. Uses the vllm-rocm nightly
required for Eagle3 support.
- benchmarks/single_node/kimik2.5_fp4_mi355x_eagle3.sh: new benchmark script
that reuses the kimik2.5_fp4_mi355x.sh recipe and adds --speculative-config
for the Eagle3 draft (lightseekorg/kimi-k2.5-eagle3,
num_speculative_tokens=7, draft_tensor_parallel_size=1). --block-size is
intentionally omitted: the Eagle3 draft uses standard attention on ROCm
and no ROCm standard-attention backend supports block_size=1.
- utils/matrix_logic/validation.py (+ test_validation.py): extend the
spec-decoding Literal from {mtp, draft_model, none} to include eagle3.
- runners/launch_mi355x-amds.sh: extend SPEC_SUFFIX so SPEC_DECODING=eagle3
resolves to the _eagle3 benchmark script variant.
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ \ | ||
| --trust-remote-code |
There was a problem hiding this comment.
🔴 The new benchmarks/single_node/kimik2.5_fp4_mi355x_eagle3.sh does not pass --use-chat-template to run_benchmark_serving (lines 108-116). Eagle3 is EAGLE-style speculative decoding (the SPEC_CONFIG sets method: "eagle3" at line 78), so AGENTS.md:167 and .github/workflows/claude-pr-review.yml:277-296 both mandate this flag — without it the draft (trained on chat-formatted inputs) is benchmarked against raw prompts, silently regressing acceptance rate and producing misleading tok/s/gpu. Fix: add --use-chat-template to the run_benchmark_serving invocation, matching every other *_mtp.sh spec-decoding script in the repo.
Extended reasoning...
What the bug is
The new Eagle3 benchmark script benchmarks/single_node/kimik2.5_fp4_mi355x_eagle3.sh invokes run_benchmark_serving at lines 105-116 without the --use-chat-template flag. This is the only spec-decoding benchmark script in the repo that omits it.
Why the rule applies
Two independent pieces of repo policy make this a hard requirement for EAGLE-style spec-decoding scripts:
AGENTS.md:162-167: "MTP scripts MUST pass--use-chat-templatetorun_benchmark_serving— no exceptions. EAGLE-style speculative decoding is trained against chat-formatted inputs, so benchmarking against raw prompts silently regresses acceptance rate and produces misleading numbers.".github/workflows/claude-pr-review.yml:277-296: classifies a missing--use-chat-templateon an MTP/EAGLE benchmark as a BLOCKING issue.
The new script configures method: "eagle3" in SPEC_CONFIG (line 78) — literally EAGLE-family spec decoding, squarely in scope.
Why existing code doesn't prevent it
run_benchmark_serving in benchmarks/benchmark_lib.sh is a pass-through wrapper: the flag is honored when supplied (see benchmark_lib.sh:351-353) but there is no default/fallback — if the script omits it, raw prompts are sent. All 17+ existing *_mtp.sh spec scripts (dsr1, glm5, qwen3.5, kimi MTP variants across AMD and NVIDIA) pass the flag; the new Eagle3 script is the lone omission.
Step-by-step proof
- Script runs. Line 78 sets
SPEC_CONFIGwithmethod: "eagle3",model: lightseekorg/kimi-k2.5-eagle3,num_speculative_tokens: 7. - vLLM starts the target with the Eagle3 draft attached.
run_benchmark_servingis called at line 107 with--input-len,--output-len,--random-range-ratio, etc. — but no--use-chat-template.- The benchmark client emits the
randomdataset directly as tokenized input rather than wrapping it in the model's chat template. - The Eagle3 draft head (trained on
<|im_start|>user ... <|im_end|><|im_start|>assistant ...formatted prompts) now sees raw random tokens — off-distribution input. - Acceptance rate of speculated tokens drops (this is the exact failure mode
perf-changelog.yaml:467-472documented for DeepSeek-MTP in PR Add benchmark script code style guidelines to PR review workflow #647: "Without this arg, MTP acceptance rates are artificially high for DeepSeek with MTP." — same root cause, opposite-sign artifact). - Reported tok/s/gpu diverges from the real number customers would see on chat-formatted traffic — directly defeating the purpose of the benchmark.
Impact
Performance numbers published from this script would be systematically wrong and not comparable to the other kimik2.5 or spec-decoding entries in the results. Since the entire point of adding an Eagle3 config is to measure the speedup from Eagle3, a broken acceptance rate undermines the benchmark's validity.
Fix
One-line change in the run_benchmark_serving call (around line 107-116):
run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
+ --use-chat-template \
--trust-remote-code| precision: str | ||
| framework: str | ||
| spec_decoding: Literal["mtp", "draft_model", "none"] = Field( | ||
| spec_decoding: Literal["mtp", "draft_model", "eagle3", "none"] = Field( |
There was a problem hiding this comment.
🟡 README docs at .github/workflows/README.md:238 still illustrate the spec-decoding Literal as Literal["mtp", "draft_model", "none"], but this PR extends it to include eagle3 at four sites in utils/matrix_logic/validation.py (lines 80, 119, 207, 227). Nit — please update that README line to include eagle3 (or rephrase to avoid enumerating) so future config authors know eagle3 is a valid value.
Extended reasoning...
What the bug is
This PR extends the spec_decoding Pydantic Literal from {mtp, draft_model, none} to {mtp, draft_model, eagle3, none} at four call sites in utils/matrix_logic/validation.py:
- Line 80 —
SingleNodeMatrixEntry.spec_decoding - Line 119 —
MultiNodeMatrixEntry.spec_decoding - Line 207 —
SingleNodeSearchSpaceEntry.spec_decoding - Line 227 —
MultiNodeSearchSpaceEntry.spec_decoding
However, the developer-facing documentation at .github/workflows/README.md:238 still reads verbatim:
Strict typing: Fields like
spec-decodinguseLiteral["mtp", "draft_model", "none"]to restrict values to known options.
This enumeration is now stale — it omits eagle3, which is a valid value after this PR.
Why existing code doesn't prevent it
The README is a free-text Markdown illustrative example, not machine-read. Nothing in CI cross-references the README string against validation.py, so there is no automatic enforcement that the two stay in sync. The code (Pydantic Literal) remains the source of truth and will correctly accept eagle3 at runtime — so there is no functional regression. It's a pure documentation drift.
Impact
A future config author reading the README to learn what spec-decoding values are allowed will see only three options and not realize eagle3 is valid. They may either (a) try eagle3 anyway and hit no issue, or (b) incorrectly assume the option doesn't exist and add another value. Mild confusion, not broken behavior.
Step-by-step proof
- Check the README at
.github/workflows/README.md:238: it contains the stringLiteral["mtp", "draft_model", "none"]. - Check
utils/matrix_logic/validation.py:80(and 119, 207, 227) in this PR: all four sites now declareLiteral["mtp", "draft_model", "eagle3", "none"]. - The README string is therefore a proper subset of the actual Literal values, and misstates the illustrative example.
How to fix
Two easy options at .github/workflows/README.md:238:
- Update the enumeration to match:
Literal["mtp", "draft_model", "eagle3", "none"]. - Or rephrase to avoid enumerating, e.g., "Fields like
spec-decodinguse aLiteraltype to restrict values to a fixed set of known options (seevalidation.pyfor the current set)." — this avoids the same drift the next time a value is added.
| # Eagle3 speculative config. Single-quoted JSON passed as one arg so spaces/braces | ||
| # survive bash word-splitting when expanded into the vllm invocation below. | ||
| SPEC_CONFIG="{\"model\":\"${SPEC_DRAFT_MODEL}\",\"method\":\"eagle3\",\"num_speculative_tokens\":${SPEC_NUM_TOKENS},\"draft_tensor_parallel_size\":${SPEC_DRAFT_TP}}" |
There was a problem hiding this comment.
🟡 Nit: the comment at lines 76-77 says "Single-quoted JSON passed as one arg", but the actual code on line 78 is a double-quoted bash string with backslash-escaped inner double quotes (SPEC_CONFIG="{\"model\":...}") — it has to be double-quoted so ${SPEC_DRAFT_MODEL}, ${SPEC_NUM_TOKENS}, and ${SPEC_DRAFT_TP} expand. The code itself works correctly; only the comment is wrong. Suggest updating to something like: "Double-quoted JSON string with escaped inner quotes, expanded as a single arg so spaces/braces survive bash word-splitting."
Extended reasoning...
What the bug is
Lines 76-77 contain a comment that misdescribes how SPEC_CONFIG is constructed:
# Eagle3 speculative config. Single-quoted JSON passed as one arg so spaces/braces
# survive bash word-splitting when expanded into the vllm invocation below.
SPEC_CONFIG="{\"model\":\"${SPEC_DRAFT_MODEL}\",\"method\":\"eagle3\",\"num_speculative_tokens\":${SPEC_NUM_TOKENS},\"draft_tensor_parallel_size\":${SPEC_DRAFT_TP}}"The comment claims the JSON is single-quoted, but line 78 uses a double-quoted bash string with backslash-escaped inner " characters. These are two materially different shell mechanisms.
Why the code must use double quotes (not single)
Single-quoted shell strings ('...') do not perform variable expansion. The author needs ${SPEC_DRAFT_MODEL}, ${SPEC_NUM_TOKENS}, and ${SPEC_DRAFT_TP} to be substituted into the JSON payload, so single-quoting here would have produced literal ${SPEC_DRAFT_MODEL} text and caused vLLM to fail. Double-quoting with \" escapes for the inner JSON quotes is the correct mechanism — it just isn't what the comment says.
Step-by-step proof
- Suppose
SPEC_DRAFT_MODEL=lightseekorg/kimi-k2.5-eagle3,SPEC_NUM_TOKENS=7,SPEC_DRAFT_TP=1. - After line 78 evaluates,
SPEC_CONFIGcontains:{"model":"lightseekorg/kimi-k2.5-eagle3","method":"eagle3","num_speculative_tokens":7,"draft_tensor_parallel_size":1}— variable expansion happened because of the outer double quotes. - If the code were genuinely single-quoted as the comment describes (
SPEC_CONFIG='{"model":"${SPEC_DRAFT_MODEL}",...}'),SPEC_CONFIGwould contain the literal text${SPEC_DRAFT_MODEL}, and vLLM would reject the config because the model string is not a real path. - Line 98 passes
"$SPEC_CONFIG"— the outer double-quotes on expansion preserve the value as one argv element regardless of which method built it; that part is fine.
Impact
Zero runtime impact — the code works. The issue is strictly a documentation/comment accuracy problem: a future maintainer reading the comment might try to "fix" the escaping to match what the comment says (e.g., replacing the \" escapes with a single-quoted literal), which would silently break variable expansion and produce a broken --speculative-config argument.
How to fix
Update the comment on lines 76-77 to accurately describe the mechanism, e.g.:
# Eagle3 speculative config. Double-quoted JSON string with escaped inner quotes so
# ${SPEC_DRAFT_MODEL}/${SPEC_NUM_TOKENS}/${SPEC_DRAFT_TP} expand; passed as one arg
# via "$SPEC_CONFIG" below so spaces/braces survive bash word-splitting.This is a nit — worth a one-line fix but does not block the PR.
| case "$SPEC_DECODING" in | ||
| mtp) SPEC_SUFFIX='_mtp' ;; | ||
| eagle3) SPEC_SUFFIX='_eagle3' ;; | ||
| *) SPEC_SUFFIX='' ;; | ||
| esac |
There was a problem hiding this comment.
🟣 Pre-existing issue (not introduced by this PR). The new case statement at lines 182-186 maps mtp and eagle3, but the spec-decoding Literal in validation.py also accepts draft_model — which would silently fall through to the *) catch-all, yielding SPEC_SUFFIX='' and launching the non-spec benchmark script with SPEC_DECODING=draft_model still in the env. No master config uses draft_model today and the prior ternary had the same gap, but since this PR rewrote these exact lines it's a natural opportunity to fail fast on unknown spec values (either add an explicit error branch, or map draft_model once a corresponding _draft_model.sh exists). The same silent-fallthrough pattern also exists in launch_h200-nb.sh, launch_h200-cw.sh, launch_h200-dgxc-slurm.sh, launch_b200-nb.sh, launch_b200-dgxc-slurm.sh, and launch_b300-nv.sh.
Extended reasoning...
What the bug is
utils/matrix_logic/validation.py (lines 80, 119, 207, 227) declares the spec-decoding field as Literal["mtp", "draft_model", "eagle3", "none"], so any of those four values will pass config validation. The launcher case statement at runners/launch_mi355x-amds.sh:182-186 only handles two of them:
case "$SPEC_DECODING" in
mtp) SPEC_SUFFIX='_mtp' ;;
eagle3) SPEC_SUFFIX='_eagle3' ;;
*) SPEC_SUFFIX='' ;;
esacBoth none and draft_model fall into the *) branch. For none that is intended (non-spec script). For draft_model it is silently wrong: the launcher will invoke benchmarks/single_node/${EXP_NAME%%_*}_${PRECISION}_mi355x${FRAMEWORK_SUFFIX}.sh (the plain, non-spec variant) while SPEC_DECODING=draft_model is still exported into the container env.
Step-by-step proof
- Someone adds a master-config entry with
spec-decoding: draft_modelunder e.g.kimik2.5-fp4-mi355x-vllm-*. Validation accepts it because the Literal allowsdraft_model. generate_sweep_configs.pyemits a matrix entry whereSPEC_DECODING=draft_modelis exported by the workflow into the launcher's environment.- In
launch_mi355x-amds.shthe case statement hits the*)branch →SPEC_SUFFIX=''. - The final command becomes
bash benchmarks/single_node/kimik2.5_fp4_mi355x.sh(no_draft_modelsuffix, no_eagle3suffix) — the plain non-spec recipe runs. - The reported result is tagged with the original run metadata saying
spec-decoding=draft_model, but no speculative decoding actually happened. Silent divergence between requested and executed experiment.
Why existing code doesn't prevent it
The validation layer only checks that the string is one of the accepted literals — it has no notion that the launcher must have a matching _<name>.sh script. The launcher's default *) branch treats any unrecognised value as "no spec suffix", which is correct for none but masks misconfiguration for draft_model.
Addressing the refutations
One verifier argued this is purely hypothetical because (a) no master config uses draft_model, (b) no *_draft_model.sh script exists in the repo, (c) the same pattern exists in six sibling launchers, and (d) hardening is out of scope for this PR. Those points are all factually correct, which is why this is marked pre_existing and not a blocker. The reason it is still worth flagging: the PR rewrote this exact block from a ternary into a case statement to add eagle3, which is the natural moment to either (i) add an explicit error branch for unmapped-but-valid literals, or (ii) tighten the Literal to only include values with a corresponding script. Doing neither leaves a foot-gun for the next person who extends spec-decoding.
How to fix
Replace the silent catch-all with an explicit split between none (intended empty suffix) and unknown values (error):
case "$SPEC_DECODING" in
mtp) SPEC_SUFFIX='_mtp' ;;
eagle3) SPEC_SUFFIX='_eagle3' ;;
none|'') SPEC_SUFFIX='' ;;
*) echo "ERROR: unsupported SPEC_DECODING=$SPEC_DECODING" >&2; exit 1 ;;
esacThe same hardening would apply to the other mi*/h*/b* launchers that share this pattern.
- benchmarks/single_node/kimik2.5_fp4_mi355x_eagle3.sh: add --use-chat-template to run_benchmark_serving. Eagle3 is EAGLE-family speculative decoding and the Eagle3 draft is trained on chat-formatted inputs; AGENTS.md requires this flag for all *_mtp.sh scripts and every peer script in the repo passes it. Without it, acceptance rate and throughput are silently distorted. - benchmarks/single_node/kimik2.5_fp4_mi355x_eagle3.sh: fix the SPEC_CONFIG comment to accurately describe the double-quoted bash string mechanism (was mis-described as single-quoted). - .github/workflows/README.md: rephrase the spec-decoding Literal example to point at utils/matrix_logic/validation.py as the source of truth, instead of enumerating values that drift whenever the Literal changes. Addresses review comments on PR #1116. Made-with: Cursor
|
thanks for the PR @haic0 @chunfangamd As we wrote in the march 2026 inferencev3 pdf for native MTP, we haven't thought much about the guidelines for MTP for models that dont natively ship with it as we didn't think we would include it for inferencexv3 & have previously rejected submissions for it #1026 (review) do you think we should include it for inferencex v3 i.e. the questions that need some thought is:
We already have 3 different models that natively ship with MTP (deepseek, glm5, qwen3.5), is it worth it to spend time thinking about MTP for models that don't come natively with MTP? @haic0 has some good inital answers that we should work together on to flush out over slack before we decide on merging this or not #1115 (comment) |
|
/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys kimik2.5-fp4-mi355x-vllm-eagle3 |
|
@chunfangamd Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24802980019 |
Add Kimi-K2.5 eagle3 ISL/OSL=1K/8K experiments
Co-authored with Larry Li, and Chao Li
Changes: