Skip to content

[AMD/ROCm] Add Kimi-K2.5 FP4 vLLM Eagle3 speculative decoding config for MI355X#1116

Draft
chunfangamd wants to merge 6 commits intomainfrom
chun-larry-chao/kimik2.5-fp4-eagle3
Draft

[AMD/ROCm] Add Kimi-K2.5 FP4 vLLM Eagle3 speculative decoding config for MI355X#1116
chunfangamd wants to merge 6 commits intomainfrom
chun-larry-chao/kimik2.5-fp4-eagle3

Conversation

@chunfangamd
Copy link
Copy Markdown
Collaborator

Co-authored with Larry Li, and Chao Li

Changes:

  • amd-master.yaml
  • benchmarks/single_node/kimik2.5_fp4_mi355x_eagle3.sh
  • Support --block-size=16 but didn't use in this PR
  • utils/matrix_logic/validation.py (+ test_validation.py): extend the spec-decoding Literal from {mtp, draft_model, none} to include eagle3.
  • runners/launch_mi355x-amds.sh: extend SPEC_SUFFIX so SPEC_DECODING=eagle3 resolves to the _eagle3 benchmark script variant.

- Co-authored with Li, Larry and Li, Chao
Changes:
- amd-master.yaml: add kimik2.5-fp4-mi355x-vllm-eagle3 entry sweeping
  tp={4,8} x conc={4..64} on both 1k1k and 8k1k. Uses the vllm-rocm nightly
  required for Eagle3 support.
- benchmarks/single_node/kimik2.5_fp4_mi355x_eagle3.sh: new benchmark script
  that reuses the kimik2.5_fp4_mi355x.sh recipe and adds --speculative-config
  for the Eagle3 draft (lightseekorg/kimi-k2.5-eagle3,
  num_speculative_tokens=7, draft_tensor_parallel_size=1). --block-size is
  intentionally omitted: the Eagle3 draft uses standard attention on ROCm
  and no ROCm standard-attention backend supports block_size=1.
- utils/matrix_logic/validation.py (+ test_validation.py): extend the
  spec-decoding Literal from {mtp, draft_model, none} to include eagle3.
- runners/launch_mi355x-amds.sh: extend SPEC_SUFFIX so SPEC_DECODING=eagle3
  resolves to the _eagle3 benchmark script variant.
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@chunfangamd chunfangamd self-assigned this Apr 22, 2026
Comment on lines +105 to +116
run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--trust-remote-code
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new benchmarks/single_node/kimik2.5_fp4_mi355x_eagle3.sh does not pass --use-chat-template to run_benchmark_serving (lines 108-116). Eagle3 is EAGLE-style speculative decoding (the SPEC_CONFIG sets method: "eagle3" at line 78), so AGENTS.md:167 and .github/workflows/claude-pr-review.yml:277-296 both mandate this flag — without it the draft (trained on chat-formatted inputs) is benchmarked against raw prompts, silently regressing acceptance rate and producing misleading tok/s/gpu. Fix: add --use-chat-template to the run_benchmark_serving invocation, matching every other *_mtp.sh spec-decoding script in the repo.

Extended reasoning...

What the bug is

The new Eagle3 benchmark script benchmarks/single_node/kimik2.5_fp4_mi355x_eagle3.sh invokes run_benchmark_serving at lines 105-116 without the --use-chat-template flag. This is the only spec-decoding benchmark script in the repo that omits it.

Why the rule applies

Two independent pieces of repo policy make this a hard requirement for EAGLE-style spec-decoding scripts:

  1. AGENTS.md:162-167: "MTP scripts MUST pass --use-chat-template to run_benchmark_serving — no exceptions. EAGLE-style speculative decoding is trained against chat-formatted inputs, so benchmarking against raw prompts silently regresses acceptance rate and produces misleading numbers."
  2. .github/workflows/claude-pr-review.yml:277-296: classifies a missing --use-chat-template on an MTP/EAGLE benchmark as a BLOCKING issue.

The new script configures method: "eagle3" in SPEC_CONFIG (line 78) — literally EAGLE-family spec decoding, squarely in scope.

Why existing code doesn't prevent it

run_benchmark_serving in benchmarks/benchmark_lib.sh is a pass-through wrapper: the flag is honored when supplied (see benchmark_lib.sh:351-353) but there is no default/fallback — if the script omits it, raw prompts are sent. All 17+ existing *_mtp.sh spec scripts (dsr1, glm5, qwen3.5, kimi MTP variants across AMD and NVIDIA) pass the flag; the new Eagle3 script is the lone omission.

Step-by-step proof

  1. Script runs. Line 78 sets SPEC_CONFIG with method: "eagle3", model: lightseekorg/kimi-k2.5-eagle3, num_speculative_tokens: 7.
  2. vLLM starts the target with the Eagle3 draft attached.
  3. run_benchmark_serving is called at line 107 with --input-len, --output-len, --random-range-ratio, etc. — but no --use-chat-template.
  4. The benchmark client emits the random dataset directly as tokenized input rather than wrapping it in the model's chat template.
  5. The Eagle3 draft head (trained on <|im_start|>user ... <|im_end|><|im_start|>assistant ... formatted prompts) now sees raw random tokens — off-distribution input.
  6. Acceptance rate of speculated tokens drops (this is the exact failure mode perf-changelog.yaml:467-472 documented for DeepSeek-MTP in PR Add benchmark script code style guidelines to PR review workflow #647: "Without this arg, MTP acceptance rates are artificially high for DeepSeek with MTP." — same root cause, opposite-sign artifact).
  7. Reported tok/s/gpu diverges from the real number customers would see on chat-formatted traffic — directly defeating the purpose of the benchmark.

Impact

Performance numbers published from this script would be systematically wrong and not comparable to the other kimik2.5 or spec-decoding entries in the results. Since the entire point of adding an Eagle3 config is to measure the speedup from Eagle3, a broken acceptance rate undermines the benchmark's validity.

Fix

One-line change in the run_benchmark_serving call (around line 107-116):

 run_benchmark_serving \
     --model "$MODEL" \
     --port "$PORT" \
     --backend vllm \
     --input-len "$ISL" \
     --output-len "$OSL" \
     --random-range-ratio "$RANDOM_RANGE_RATIO" \
     --num-prompts "$((CONC * 10))" \
     --max-concurrency "$CONC" \
     --result-filename "$RESULT_FILENAME" \
     --result-dir /workspace/ \
+    --use-chat-template \
     --trust-remote-code

precision: str
framework: str
spec_decoding: Literal["mtp", "draft_model", "none"] = Field(
spec_decoding: Literal["mtp", "draft_model", "eagle3", "none"] = Field(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 README docs at .github/workflows/README.md:238 still illustrate the spec-decoding Literal as Literal["mtp", "draft_model", "none"], but this PR extends it to include eagle3 at four sites in utils/matrix_logic/validation.py (lines 80, 119, 207, 227). Nit — please update that README line to include eagle3 (or rephrase to avoid enumerating) so future config authors know eagle3 is a valid value.

Extended reasoning...

What the bug is

This PR extends the spec_decoding Pydantic Literal from {mtp, draft_model, none} to {mtp, draft_model, eagle3, none} at four call sites in utils/matrix_logic/validation.py:

  • Line 80 — SingleNodeMatrixEntry.spec_decoding
  • Line 119 — MultiNodeMatrixEntry.spec_decoding
  • Line 207 — SingleNodeSearchSpaceEntry.spec_decoding
  • Line 227 — MultiNodeSearchSpaceEntry.spec_decoding

However, the developer-facing documentation at .github/workflows/README.md:238 still reads verbatim:

Strict typing: Fields like spec-decoding use Literal["mtp", "draft_model", "none"] to restrict values to known options.

This enumeration is now stale — it omits eagle3, which is a valid value after this PR.

Why existing code doesn't prevent it

The README is a free-text Markdown illustrative example, not machine-read. Nothing in CI cross-references the README string against validation.py, so there is no automatic enforcement that the two stay in sync. The code (Pydantic Literal) remains the source of truth and will correctly accept eagle3 at runtime — so there is no functional regression. It's a pure documentation drift.

Impact

A future config author reading the README to learn what spec-decoding values are allowed will see only three options and not realize eagle3 is valid. They may either (a) try eagle3 anyway and hit no issue, or (b) incorrectly assume the option doesn't exist and add another value. Mild confusion, not broken behavior.

Step-by-step proof

  1. Check the README at .github/workflows/README.md:238: it contains the string Literal["mtp", "draft_model", "none"].
  2. Check utils/matrix_logic/validation.py:80 (and 119, 207, 227) in this PR: all four sites now declare Literal["mtp", "draft_model", "eagle3", "none"].
  3. The README string is therefore a proper subset of the actual Literal values, and misstates the illustrative example.

How to fix

Two easy options at .github/workflows/README.md:238:

  • Update the enumeration to match: Literal["mtp", "draft_model", "eagle3", "none"].
  • Or rephrase to avoid enumerating, e.g., "Fields like spec-decoding use a Literal type to restrict values to a fixed set of known options (see validation.py for the current set)." — this avoids the same drift the next time a value is added.

Comment on lines +76 to +78
# Eagle3 speculative config. Single-quoted JSON passed as one arg so spaces/braces
# survive bash word-splitting when expanded into the vllm invocation below.
SPEC_CONFIG="{\"model\":\"${SPEC_DRAFT_MODEL}\",\"method\":\"eagle3\",\"num_speculative_tokens\":${SPEC_NUM_TOKENS},\"draft_tensor_parallel_size\":${SPEC_DRAFT_TP}}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: the comment at lines 76-77 says "Single-quoted JSON passed as one arg", but the actual code on line 78 is a double-quoted bash string with backslash-escaped inner double quotes (SPEC_CONFIG="{\"model\":...}") — it has to be double-quoted so ${SPEC_DRAFT_MODEL}, ${SPEC_NUM_TOKENS}, and ${SPEC_DRAFT_TP} expand. The code itself works correctly; only the comment is wrong. Suggest updating to something like: "Double-quoted JSON string with escaped inner quotes, expanded as a single arg so spaces/braces survive bash word-splitting."

Extended reasoning...

What the bug is

Lines 76-77 contain a comment that misdescribes how SPEC_CONFIG is constructed:

# Eagle3 speculative config. Single-quoted JSON passed as one arg so spaces/braces
# survive bash word-splitting when expanded into the vllm invocation below.
SPEC_CONFIG="{\"model\":\"${SPEC_DRAFT_MODEL}\",\"method\":\"eagle3\",\"num_speculative_tokens\":${SPEC_NUM_TOKENS},\"draft_tensor_parallel_size\":${SPEC_DRAFT_TP}}"

The comment claims the JSON is single-quoted, but line 78 uses a double-quoted bash string with backslash-escaped inner " characters. These are two materially different shell mechanisms.

Why the code must use double quotes (not single)

Single-quoted shell strings ('...') do not perform variable expansion. The author needs ${SPEC_DRAFT_MODEL}, ${SPEC_NUM_TOKENS}, and ${SPEC_DRAFT_TP} to be substituted into the JSON payload, so single-quoting here would have produced literal ${SPEC_DRAFT_MODEL} text and caused vLLM to fail. Double-quoting with \" escapes for the inner JSON quotes is the correct mechanism — it just isn't what the comment says.

Step-by-step proof

  1. Suppose SPEC_DRAFT_MODEL=lightseekorg/kimi-k2.5-eagle3, SPEC_NUM_TOKENS=7, SPEC_DRAFT_TP=1.
  2. After line 78 evaluates, SPEC_CONFIG contains: {"model":"lightseekorg/kimi-k2.5-eagle3","method":"eagle3","num_speculative_tokens":7,"draft_tensor_parallel_size":1} — variable expansion happened because of the outer double quotes.
  3. If the code were genuinely single-quoted as the comment describes (SPEC_CONFIG='{"model":"${SPEC_DRAFT_MODEL}",...}'), SPEC_CONFIG would contain the literal text ${SPEC_DRAFT_MODEL}, and vLLM would reject the config because the model string is not a real path.
  4. Line 98 passes "$SPEC_CONFIG" — the outer double-quotes on expansion preserve the value as one argv element regardless of which method built it; that part is fine.

Impact

Zero runtime impact — the code works. The issue is strictly a documentation/comment accuracy problem: a future maintainer reading the comment might try to "fix" the escaping to match what the comment says (e.g., replacing the \" escapes with a single-quoted literal), which would silently break variable expansion and produce a broken --speculative-config argument.

How to fix

Update the comment on lines 76-77 to accurately describe the mechanism, e.g.:

# Eagle3 speculative config. Double-quoted JSON string with escaped inner quotes so
# ${SPEC_DRAFT_MODEL}/${SPEC_NUM_TOKENS}/${SPEC_DRAFT_TP} expand; passed as one arg
# via "$SPEC_CONFIG" below so spaces/braces survive bash word-splitting.

This is a nit — worth a one-line fix but does not block the PR.

Comment on lines +182 to +186
case "$SPEC_DECODING" in
mtp) SPEC_SUFFIX='_mtp' ;;
eagle3) SPEC_SUFFIX='_eagle3' ;;
*) SPEC_SUFFIX='' ;;
esac
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 Pre-existing issue (not introduced by this PR). The new case statement at lines 182-186 maps mtp and eagle3, but the spec-decoding Literal in validation.py also accepts draft_model — which would silently fall through to the *) catch-all, yielding SPEC_SUFFIX='' and launching the non-spec benchmark script with SPEC_DECODING=draft_model still in the env. No master config uses draft_model today and the prior ternary had the same gap, but since this PR rewrote these exact lines it's a natural opportunity to fail fast on unknown spec values (either add an explicit error branch, or map draft_model once a corresponding _draft_model.sh exists). The same silent-fallthrough pattern also exists in launch_h200-nb.sh, launch_h200-cw.sh, launch_h200-dgxc-slurm.sh, launch_b200-nb.sh, launch_b200-dgxc-slurm.sh, and launch_b300-nv.sh.

Extended reasoning...

What the bug is

utils/matrix_logic/validation.py (lines 80, 119, 207, 227) declares the spec-decoding field as Literal["mtp", "draft_model", "eagle3", "none"], so any of those four values will pass config validation. The launcher case statement at runners/launch_mi355x-amds.sh:182-186 only handles two of them:

case "$SPEC_DECODING" in
    mtp)    SPEC_SUFFIX='_mtp' ;;
    eagle3) SPEC_SUFFIX='_eagle3' ;;
    *)      SPEC_SUFFIX='' ;;
esac

Both none and draft_model fall into the *) branch. For none that is intended (non-spec script). For draft_model it is silently wrong: the launcher will invoke benchmarks/single_node/${EXP_NAME%%_*}_${PRECISION}_mi355x${FRAMEWORK_SUFFIX}.sh (the plain, non-spec variant) while SPEC_DECODING=draft_model is still exported into the container env.

Step-by-step proof

  1. Someone adds a master-config entry with spec-decoding: draft_model under e.g. kimik2.5-fp4-mi355x-vllm-*. Validation accepts it because the Literal allows draft_model.
  2. generate_sweep_configs.py emits a matrix entry where SPEC_DECODING=draft_model is exported by the workflow into the launcher's environment.
  3. In launch_mi355x-amds.sh the case statement hits the *) branch → SPEC_SUFFIX=''.
  4. The final command becomes bash benchmarks/single_node/kimik2.5_fp4_mi355x.sh (no _draft_model suffix, no _eagle3 suffix) — the plain non-spec recipe runs.
  5. The reported result is tagged with the original run metadata saying spec-decoding=draft_model, but no speculative decoding actually happened. Silent divergence between requested and executed experiment.

Why existing code doesn't prevent it

The validation layer only checks that the string is one of the accepted literals — it has no notion that the launcher must have a matching _<name>.sh script. The launcher's default *) branch treats any unrecognised value as "no spec suffix", which is correct for none but masks misconfiguration for draft_model.

Addressing the refutations

One verifier argued this is purely hypothetical because (a) no master config uses draft_model, (b) no *_draft_model.sh script exists in the repo, (c) the same pattern exists in six sibling launchers, and (d) hardening is out of scope for this PR. Those points are all factually correct, which is why this is marked pre_existing and not a blocker. The reason it is still worth flagging: the PR rewrote this exact block from a ternary into a case statement to add eagle3, which is the natural moment to either (i) add an explicit error branch for unmapped-but-valid literals, or (ii) tighten the Literal to only include values with a corresponding script. Doing neither leaves a foot-gun for the next person who extends spec-decoding.

How to fix

Replace the silent catch-all with an explicit split between none (intended empty suffix) and unknown values (error):

case "$SPEC_DECODING" in
    mtp)    SPEC_SUFFIX='_mtp' ;;
    eagle3) SPEC_SUFFIX='_eagle3' ;;
    none|'') SPEC_SUFFIX='' ;;
    *) echo "ERROR: unsupported SPEC_DECODING=$SPEC_DECODING" >&2; exit 1 ;;
esac

The same hardening would apply to the other mi*/h*/b* launchers that share this pattern.

- benchmarks/single_node/kimik2.5_fp4_mi355x_eagle3.sh: add
  --use-chat-template to run_benchmark_serving. Eagle3 is EAGLE-family
  speculative decoding and the Eagle3 draft is trained on chat-formatted
  inputs; AGENTS.md requires this flag for all *_mtp.sh scripts and every
  peer script in the repo passes it. Without it, acceptance rate and
  throughput are silently distorted.
- benchmarks/single_node/kimik2.5_fp4_mi355x_eagle3.sh: fix the
  SPEC_CONFIG comment to accurately describe the double-quoted bash
  string mechanism (was mis-described as single-quoted).
- .github/workflows/README.md: rephrase the spec-decoding Literal
  example to point at utils/matrix_logic/validation.py as the source of
  truth, instead of enumerating values that drift whenever the Literal
  changes.

Addresses review comments on PR #1116.

Made-with: Cursor
@functionstackx functionstackx marked this pull request as draft April 22, 2026 20:48
@functionstackx
Copy link
Copy Markdown
Contributor

thanks for the PR @haic0 @chunfangamd

As we wrote in the march 2026 inferencev3 pdf for native MTP, we haven't thought much about the guidelines for MTP for models that dont natively ship with it as we didn't think we would include it for inferencexv3 & have previously rejected submissions for it #1026 (review) do you think we should include it for inferencex v3 i.e. the questions that need some thought is:

  1. what speculator weight & speculator # of layers/arch should be used?
  2. Can different vendors submit different speculator weights?
  3. Can different vendors submit different speculator? If so, how you are maintain similar acceptance rate between vendors. i.e. [Model Runner V2] Enable forcing a specific acceptance rate during rejection sampling vllm-project/vllm#38045 --speculative-config '{"method": "eagle", "model": <eagle_model>, "num_speculative_tokens": 3, "rejection_sample_method": "synthetic", "synthetic_acceptance_rate": <test-value>}' if you are doing this, how do we ensure that the specific AR distribution implementation is similar across different frameworks?
  4. should we be using lightseekorg/kimi-k2.6-eagle3 https://github.com/vllm-project/recipes/pull/347/changes ?

We already have 3 different models that natively ship with MTP (deepseek, glm5, qwen3.5), is it worth it to spend time thinking about MTP for models that don't come natively with MTP?

@haic0 has some good inital answers that we should work together on to flush out over slack before we decide on merging this or not #1115 (comment)

@chunfangamd
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys kimik2.5-fp4-mi355x-vllm-eagle3

@github-actions
Copy link
Copy Markdown
Contributor

@chunfangamd Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24802980019
Command: test-config --config-files .github/configs/amd-master.yaml --config-keys kimik2.5-fp4-mi355x-vllm-eagle3
Pinned ref: a3b35a3
Approval: not required (trusted collaborator).

Add Kimi-K2.5 eagle3 ISL/OSL=1K/8K experiments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants