[AMD/ROCm] Add Kimi-K2.5 FP4 vLLM Eagle3 speculative decoding config for MI355X by chunfangamd · Pull Request #1116 · SemiAnalysisAI/InferenceX

chunfangamd · 2026-04-22T20:18:09Z

Co-authored with Larry Li, and Chao Li

Changes:

amd-master.yaml
benchmarks/single_node/kimik2.5_fp4_mi355x_eagle3.sh
Support --block-size=16 but didn't use in this PR
utils/matrix_logic/validation.py (+ test_validation.py): extend the spec-decoding Literal from {mtp, draft_model, none} to include eagle3.
runners/launch_mi355x-amds.sh: extend SPEC_SUFFIX so SPEC_DECODING=eagle3 resolves to the _eagle3 benchmark script variant.

- Co-authored with Li, Larry and Li, Chao Changes: - amd-master.yaml: add kimik2.5-fp4-mi355x-vllm-eagle3 entry sweeping tp={4,8} x conc={4..64} on both 1k1k and 8k1k. Uses the vllm-rocm nightly required for Eagle3 support. - benchmarks/single_node/kimik2.5_fp4_mi355x_eagle3.sh: new benchmark script that reuses the kimik2.5_fp4_mi355x.sh recipe and adds --speculative-config for the Eagle3 draft (lightseekorg/kimi-k2.5-eagle3, num_speculative_tokens=7, draft_tensor_parallel_size=1). --block-size is intentionally omitted: the Eagle3 draft uses standard attention on ROCm and no ROCm standard-attention backend supports block_size=1. - utils/matrix_logic/validation.py (+ test_validation.py): extend the spec-decoding Literal from {mtp, draft_model, none} to include eagle3. - runners/launch_mi355x-amds.sh: extend SPEC_SUFFIX so SPEC_DECODING=eagle3 resolves to the _eagle3 benchmark script variant.

github-actions · 2026-04-22T20:18:18Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-22T20:18:19Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-04-22T20:25:30Z

+run_benchmark_serving \
+    --model "$MODEL" \
+    --port "$PORT" \
+    --backend vllm \
+    --input-len "$ISL" \
+    --output-len "$OSL" \
+    --random-range-ratio "$RANDOM_RANGE_RATIO" \
+    --num-prompts "$((CONC * 10))" \
+    --max-concurrency "$CONC" \
+    --result-filename "$RESULT_FILENAME" \
+    --result-dir /workspace/ \
+    --trust-remote-code


🔴 The new benchmarks/single_node/kimik2.5_fp4_mi355x_eagle3.sh does not pass --use-chat-template to run_benchmark_serving (lines 108-116). Eagle3 is EAGLE-style speculative decoding (the SPEC_CONFIG sets method: "eagle3" at line 78), so AGENTS.md:167 and .github/workflows/claude-pr-review.yml:277-296 both mandate this flag — without it the draft (trained on chat-formatted inputs) is benchmarked against raw prompts, silently regressing acceptance rate and producing misleading tok/s/gpu. Fix: add --use-chat-template to the run_benchmark_serving invocation, matching every other *_mtp.sh spec-decoding script in the repo.

Extended reasoning...

What the bug is

The new Eagle3 benchmark script benchmarks/single_node/kimik2.5_fp4_mi355x_eagle3.sh invokes run_benchmark_serving at lines 105-116 without the --use-chat-template flag. This is the only spec-decoding benchmark script in the repo that omits it.

Why the rule applies

Two independent pieces of repo policy make this a hard requirement for EAGLE-style spec-decoding scripts:

AGENTS.md:162-167: "MTP scripts MUST pass --use-chat-template to run_benchmark_serving — no exceptions. EAGLE-style speculative decoding is trained against chat-formatted inputs, so benchmarking against raw prompts silently regresses acceptance rate and produces misleading numbers."

.github/workflows/claude-pr-review.yml:277-296: classifies a missing --use-chat-template on an MTP/EAGLE benchmark as a BLOCKING issue.

The new script configures method: "eagle3" in SPEC_CONFIG (line 78) — literally EAGLE-family spec decoding, squarely in scope.

Why existing code doesn't prevent it

run_benchmark_serving in benchmarks/benchmark_lib.sh is a pass-through wrapper: the flag is honored when supplied (see benchmark_lib.sh:351-353) but there is no default/fallback — if the script omits it, raw prompts are sent. All 17+ existing *_mtp.sh spec scripts (dsr1, glm5, qwen3.5, kimi MTP variants across AMD and NVIDIA) pass the flag; the new Eagle3 script is the lone omission.

Step-by-step proof

Script runs. Line 78 sets SPEC_CONFIG with method: "eagle3", model: lightseekorg/kimi-k2.5-eagle3, num_speculative_tokens: 7.

vLLM starts the target with the Eagle3 draft attached.

run_benchmark_serving is called at line 107 with --input-len, --output-len, --random-range-ratio, etc. — but no --use-chat-template.

The benchmark client emits the random dataset directly as tokenized input rather than wrapping it in the model's chat template.

The Eagle3 draft head (trained on <|im_start|>user ... <|im_end|><|im_start|>assistant ... formatted prompts) now sees raw random tokens — off-distribution input.

Acceptance rate of speculated tokens drops (this is the exact failure mode perf-changelog.yaml:467-472 documented for DeepSeek-MTP in PR Add benchmark script code style guidelines to PR review workflow #647: "Without this arg, MTP acceptance rates are artificially high for DeepSeek with MTP." — same root cause, opposite-sign artifact).

Reported tok/s/gpu diverges from the real number customers would see on chat-formatted traffic — directly defeating the purpose of the benchmark.

Impact

Performance numbers published from this script would be systematically wrong and not comparable to the other kimik2.5 or spec-decoding entries in the results. Since the entire point of adding an Eagle3 config is to measure the speedup from Eagle3, a broken acceptance rate undermines the benchmark's validity.

Fix

One-line change in the run_benchmark_serving call (around line 107-116):

run_benchmark_serving \ --model "$MODEL" \ --port "$PORT" \ --backend vllm \ --input-len "$ISL" \ --output-len "$OSL" \ --random-range-ratio "$RANDOM_RANGE_RATIO" \ --num-prompts "$((CONC * 10))" \ --max-concurrency "$CONC" \ --result-filename "$RESULT_FILENAME" \ --result-dir /workspace/ \ + --use-chat-template \ --trust-remote-code

claude · 2026-04-22T20:25:30Z

    precision: str
    framework: str
-    spec_decoding: Literal["mtp", "draft_model", "none"] = Field(
+    spec_decoding: Literal["mtp", "draft_model", "eagle3", "none"] = Field(


🟡 README docs at .github/workflows/README.md:238 still illustrate the spec-decoding Literal as Literal["mtp", "draft_model", "none"], but this PR extends it to include eagle3 at four sites in utils/matrix_logic/validation.py (lines 80, 119, 207, 227). Nit — please update that README line to include eagle3 (or rephrase to avoid enumerating) so future config authors know eagle3 is a valid value.

Extended reasoning...

What the bug is

This PR extends the spec_decoding Pydantic Literal from {mtp, draft_model, none} to {mtp, draft_model, eagle3, none} at four call sites in utils/matrix_logic/validation.py:

Line 80 — SingleNodeMatrixEntry.spec_decoding

Line 119 — MultiNodeMatrixEntry.spec_decoding

Line 207 — SingleNodeSearchSpaceEntry.spec_decoding

Line 227 — MultiNodeSearchSpaceEntry.spec_decoding

However, the developer-facing documentation at .github/workflows/README.md:238 still reads verbatim:

Strict typing: Fields like spec-decoding use Literal["mtp", "draft_model", "none"] to restrict values to known options.

This enumeration is now stale — it omits eagle3, which is a valid value after this PR.

Why existing code doesn't prevent it

The README is a free-text Markdown illustrative example, not machine-read. Nothing in CI cross-references the README string against validation.py, so there is no automatic enforcement that the two stay in sync. The code (Pydantic Literal) remains the source of truth and will correctly accept eagle3 at runtime — so there is no functional regression. It's a pure documentation drift.

Impact

A future config author reading the README to learn what spec-decoding values are allowed will see only three options and not realize eagle3 is valid. They may either (a) try eagle3 anyway and hit no issue, or (b) incorrectly assume the option doesn't exist and add another value. Mild confusion, not broken behavior.

Step-by-step proof

Check the README at .github/workflows/README.md:238: it contains the string Literal["mtp", "draft_model", "none"].

Check utils/matrix_logic/validation.py:80 (and 119, 207, 227) in this PR: all four sites now declare Literal["mtp", "draft_model", "eagle3", "none"].

The README string is therefore a proper subset of the actual Literal values, and misstates the illustrative example.

How to fix

Two easy options at .github/workflows/README.md:238:

Update the enumeration to match: Literal["mtp", "draft_model", "eagle3", "none"].

Or rephrase to avoid enumerating, e.g., "Fields like spec-decoding use a Literal type to restrict values to a fixed set of known options (see validation.py for the current set)." — this avoids the same drift the next time a value is added.

claude · 2026-04-22T20:25:30Z

+# Eagle3 speculative config. Single-quoted JSON passed as one arg so spaces/braces
+# survive bash word-splitting when expanded into the vllm invocation below.
+SPEC_CONFIG="{\"model\":\"${SPEC_DRAFT_MODEL}\",\"method\":\"eagle3\",\"num_speculative_tokens\":${SPEC_NUM_TOKENS},\"draft_tensor_parallel_size\":${SPEC_DRAFT_TP}}"


🟡 Nit: the comment at lines 76-77 says "Single-quoted JSON passed as one arg", but the actual code on line 78 is a double-quoted bash string with backslash-escaped inner double quotes (SPEC_CONFIG="{\"model\":...}") — it has to be double-quoted so ${SPEC_DRAFT_MODEL}, ${SPEC_NUM_TOKENS}, and ${SPEC_DRAFT_TP} expand. The code itself works correctly; only the comment is wrong. Suggest updating to something like: "Double-quoted JSON string with escaped inner quotes, expanded as a single arg so spaces/braces survive bash word-splitting."

Extended reasoning...

What the bug is

Lines 76-77 contain a comment that misdescribes how SPEC_CONFIG is constructed:

# Eagle3 speculative config. Single-quoted JSON passed as one arg so spaces/braces # survive bash word-splitting when expanded into the vllm invocation below. SPEC_CONFIG="{\"model\":\"${SPEC_DRAFT_MODEL}\",\"method\":\"eagle3\",\"num_speculative_tokens\":${SPEC_NUM_TOKENS},\"draft_tensor_parallel_size\":${SPEC_DRAFT_TP}}"

The comment claims the JSON is single-quoted, but line 78 uses a double-quoted bash string with backslash-escaped inner " characters. These are two materially different shell mechanisms.

Why the code must use double quotes (not single)

Single-quoted shell strings ('...') do not perform variable expansion. The author needs ${SPEC_DRAFT_MODEL}, ${SPEC_NUM_TOKENS}, and ${SPEC_DRAFT_TP} to be substituted into the JSON payload, so single-quoting here would have produced literal ${SPEC_DRAFT_MODEL} text and caused vLLM to fail. Double-quoting with \" escapes for the inner JSON quotes is the correct mechanism — it just isn't what the comment says.

Step-by-step proof

Suppose SPEC_DRAFT_MODEL=lightseekorg/kimi-k2.5-eagle3, SPEC_NUM_TOKENS=7, SPEC_DRAFT_TP=1.

After line 78 evaluates, SPEC_CONFIG contains: {"model":"lightseekorg/kimi-k2.5-eagle3","method":"eagle3","num_speculative_tokens":7,"draft_tensor_parallel_size":1} — variable expansion happened because of the outer double quotes.

If the code were genuinely single-quoted as the comment describes (SPEC_CONFIG='{"model":"${SPEC_DRAFT_MODEL}",...}'), SPEC_CONFIG would contain the literal text ${SPEC_DRAFT_MODEL}, and vLLM would reject the config because the model string is not a real path.

Line 98 passes "$SPEC_CONFIG" — the outer double-quotes on expansion preserve the value as one argv element regardless of which method built it; that part is fine.

Impact

Zero runtime impact — the code works. The issue is strictly a documentation/comment accuracy problem: a future maintainer reading the comment might try to "fix" the escaping to match what the comment says (e.g., replacing the \" escapes with a single-quoted literal), which would silently break variable expansion and produce a broken --speculative-config argument.

How to fix

Update the comment on lines 76-77 to accurately describe the mechanism, e.g.:

# Eagle3 speculative config. Double-quoted JSON string with escaped inner quotes so # ${SPEC_DRAFT_MODEL}/${SPEC_NUM_TOKENS}/${SPEC_DRAFT_TP} expand; passed as one arg # via "$SPEC_CONFIG" below so spaces/braces survive bash word-splitting.

This is a nit — worth a one-line fix but does not block the PR.

claude · 2026-04-22T20:25:30Z

+    case "$SPEC_DECODING" in
+        mtp)    SPEC_SUFFIX='_mtp' ;;
+        eagle3) SPEC_SUFFIX='_eagle3' ;;
+        *)      SPEC_SUFFIX='' ;;
+    esac


🟣 Pre-existing issue (not introduced by this PR). The new case statement at lines 182-186 maps mtp and eagle3, but the spec-decoding Literal in validation.py also accepts draft_model — which would silently fall through to the *) catch-all, yielding SPEC_SUFFIX='' and launching the non-spec benchmark script with SPEC_DECODING=draft_model still in the env. No master config uses draft_model today and the prior ternary had the same gap, but since this PR rewrote these exact lines it's a natural opportunity to fail fast on unknown spec values (either add an explicit error branch, or map draft_model once a corresponding _draft_model.sh exists). The same silent-fallthrough pattern also exists in launch_h200-nb.sh, launch_h200-cw.sh, launch_h200-dgxc-slurm.sh, launch_b200-nb.sh, launch_b200-dgxc-slurm.sh, and launch_b300-nv.sh.

Extended reasoning...

What the bug is

utils/matrix_logic/validation.py (lines 80, 119, 207, 227) declares the spec-decoding field as Literal["mtp", "draft_model", "eagle3", "none"], so any of those four values will pass config validation. The launcher case statement at runners/launch_mi355x-amds.sh:182-186 only handles two of them:

case "$SPEC_DECODING" in mtp) SPEC_SUFFIX='_mtp' ;; eagle3) SPEC_SUFFIX='_eagle3' ;; *) SPEC_SUFFIX='' ;; esac

Both none and draft_model fall into the *) branch. For none that is intended (non-spec script). For draft_model it is silently wrong: the launcher will invoke benchmarks/single_node/${EXP_NAME%%_*}_${PRECISION}_mi355x${FRAMEWORK_SUFFIX}.sh (the plain, non-spec variant) while SPEC_DECODING=draft_model is still exported into the container env.

Step-by-step proof

Someone adds a master-config entry with spec-decoding: draft_model under e.g. kimik2.5-fp4-mi355x-vllm-*. Validation accepts it because the Literal allows draft_model.

generate_sweep_configs.py emits a matrix entry where SPEC_DECODING=draft_model is exported by the workflow into the launcher's environment.

In launch_mi355x-amds.sh the case statement hits the *) branch → SPEC_SUFFIX=''.

The final command becomes bash benchmarks/single_node/kimik2.5_fp4_mi355x.sh (no _draft_model suffix, no _eagle3 suffix) — the plain non-spec recipe runs.

The reported result is tagged with the original run metadata saying spec-decoding=draft_model, but no speculative decoding actually happened. Silent divergence between requested and executed experiment.

Why existing code doesn't prevent it

The validation layer only checks that the string is one of the accepted literals — it has no notion that the launcher must have a matching _<name>.sh script. The launcher's default *) branch treats any unrecognised value as "no spec suffix", which is correct for none but masks misconfiguration for draft_model.

Addressing the refutations

One verifier argued this is purely hypothetical because (a) no master config uses draft_model, (b) no *_draft_model.sh script exists in the repo, (c) the same pattern exists in six sibling launchers, and (d) hardening is out of scope for this PR. Those points are all factually correct, which is why this is marked pre_existing and not a blocker. The reason it is still worth flagging: the PR rewrote this exact block from a ternary into a case statement to add eagle3, which is the natural moment to either (i) add an explicit error branch for unmapped-but-valid literals, or (ii) tighten the Literal to only include values with a corresponding script. Doing neither leaves a foot-gun for the next person who extends spec-decoding.

How to fix

Replace the silent catch-all with an explicit split between none (intended empty suffix) and unknown values (error):

case "$SPEC_DECODING" in mtp) SPEC_SUFFIX='_mtp' ;; eagle3) SPEC_SUFFIX='_eagle3' ;; none|'') SPEC_SUFFIX='' ;; *) echo "ERROR: unsupported SPEC_DECODING=$SPEC_DECODING" >&2; exit 1 ;; esac

The same hardening would apply to the other mi*/h*/b* launchers that share this pattern.

- benchmarks/single_node/kimik2.5_fp4_mi355x_eagle3.sh: add --use-chat-template to run_benchmark_serving. Eagle3 is EAGLE-family speculative decoding and the Eagle3 draft is trained on chat-formatted inputs; AGENTS.md requires this flag for all *_mtp.sh scripts and every peer script in the repo passes it. Without it, acceptance rate and throughput are silently distorted. - benchmarks/single_node/kimik2.5_fp4_mi355x_eagle3.sh: fix the SPEC_CONFIG comment to accurately describe the double-quoted bash string mechanism (was mis-described as single-quoted). - .github/workflows/README.md: rephrase the spec-decoding Literal example to point at utils/matrix_logic/validation.py as the source of truth, instead of enumerating values that drift whenever the Literal changes. Addresses review comments on PR #1116. Made-with: Cursor

functionstackx · 2026-04-22T20:50:22Z

thanks for the PR @haic0 @chunfangamd

As we wrote in the march 2026 inferencev3 pdf for native MTP, we haven't thought much about the guidelines for MTP for models that dont natively ship with it as we didn't think we would include it for inferencexv3 & have previously rejected submissions for it #1026 (review) do you think we should include it for inferencex v3 i.e. the questions that need some thought is:

what speculator weight & speculator # of layers/arch should be used?
Can different vendors submit different speculator weights?
Can different vendors submit different speculator? If so, how you are maintain similar acceptance rate between vendors. i.e. [Model Runner V2] Enable forcing a specific acceptance rate during rejection sampling vllm-project/vllm#38045 --speculative-config '{"method": "eagle", "model": <eagle_model>, "num_speculative_tokens": 3, "rejection_sample_method": "synthetic", "synthetic_acceptance_rate": <test-value>}' if you are doing this, how do we ensure that the specific AR distribution implementation is similar across different frameworks?
should we be using lightseekorg/kimi-k2.6-eagle3 https://github.com/vllm-project/recipes/pull/347/changes ?

We already have 3 different models that natively ship with MTP (deepseek, glm5, qwen3.5), is it worth it to spend time thinking about MTP for models that don't come natively with MTP?

@haic0 has some good inital answers that we should work together on to flush out over slack before we decide on merging this or not #1115 (comment)

chunfangamd · 2026-04-22T21:12:52Z

/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys kimik2.5-fp4-mi355x-vllm-eagle3

github-actions · 2026-04-22T21:13:28Z

@chunfangamd Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24802980019
Command: test-config --config-files .github/configs/amd-master.yaml --config-keys kimik2.5-fp4-mi355x-vllm-eagle3
Pinned ref: a3b35a3
Approval: not required (trusted collaborator).

Add Kimi-K2.5 eagle3 ISL/OSL=1K/8K experiments

chunfangamd requested a review from a team April 22, 2026 20:18

chunfangamd requested review from 1am9trash, billishyahao, seungrokj and yctseng0211 as code owners April 22, 2026 20:18

github-project-automation Bot added this to InferenceMAX Board Apr 22, 2026

chunfangamd added 2 commits April 22, 2026 20:21

Update perf-changelog.yaml

1c51d2e

Merge branch 'main' into chun-larry-chao/kimik2.5-fp4-eagle3

8c8ee0f

chunfangamd self-assigned this Apr 22, 2026

claude Bot reviewed Apr 22, 2026

View reviewed changes

functionstackx marked this pull request as draft April 22, 2026 20:48

Drop redundant pip install amd-quark from Eagle3 benchmark script

a3b35a3

Update amd-master.yaml

c06b5e9

Add Kimi-K2.5 eagle3 ISL/OSL=1K/8K experiments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD/ROCm] Add Kimi-K2.5 FP4 vLLM Eagle3 speculative decoding config for MI355X#1116

[AMD/ROCm] Add Kimi-K2.5 FP4 vLLM Eagle3 speculative decoding config for MI355X#1116
chunfangamd wants to merge 6 commits intomainfrom
chun-larry-chao/kimik2.5-fp4-eagle3

chunfangamd commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

claude Bot Apr 22, 2026

Uh oh!

claude Bot Apr 22, 2026

Uh oh!

claude Bot Apr 22, 2026

Uh oh!

claude Bot Apr 22, 2026

Uh oh!

functionstackx commented Apr 22, 2026

Uh oh!

chunfangamd commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chunfangamd commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

claude Bot Apr 22, 2026

Choose a reason for hiding this comment

What the bug is

Why the rule applies

Why existing code doesn't prevent it

Step-by-step proof

Impact

Fix

Uh oh!

claude Bot Apr 22, 2026

Choose a reason for hiding this comment

What the bug is

Why existing code doesn't prevent it

Impact

Step-by-step proof

How to fix

Uh oh!

claude Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 22, 2026

Choose a reason for hiding this comment

What the bug is

Step-by-step proof

Why existing code doesn't prevent it

Addressing the refutations

How to fix

Uh oh!

functionstackx commented Apr 22, 2026

Uh oh!

chunfangamd commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants