-
Notifications
You must be signed in to change notification settings - Fork 150
[AMD/ROCm] Add Kimi-K2.5 FP4 vLLM Eagle3 speculative decoding config for MI355X #1116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
7834799
1c51d2e
8c8ee0f
e605f96
a3b35a3
c06b5e9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,114 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| # Kimi-K2.5 MXFP4 + Eagle3 speculative decoding on MI355X (vLLM). | ||
| # Adds `--speculative-config` on top of the plain kimik2.5_fp4_mi355x.sh flow. | ||
| # | ||
| # Draft model: lightseekorg/kimi-k2.5-eagle3 (~6 GB BF16, Eagle3 MTP head) | ||
| # Spec tokens: 7 (reproduced baseline: 764.1 +/- 35.7 tok/s/gpu @ TP=4, 1k1k, conc=64) | ||
| # Draft TP: 1 (draft runs on a single GPU; target occupies $TP) | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| MAX_MODEL_LEN \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| # Draft model (Eagle3 head). Override via SPEC_DRAFT_MODEL if needed. | ||
| SPEC_DRAFT_MODEL="${SPEC_DRAFT_MODEL:-lightseekorg/kimi-k2.5-eagle3}" | ||
| SPEC_NUM_TOKENS="${SPEC_NUM_TOKENS:-7}" | ||
| SPEC_DRAFT_TP="${SPEC_DRAFT_TP:-1}" | ||
|
|
||
| hf download "$MODEL" | ||
| hf download "$SPEC_DRAFT_MODEL" | ||
|
|
||
| # Set HIP_VISIBLE_DEVICES to match ROCR_VISIBLE_DEVICES for Ray compatibility in vLLM 0.14+ | ||
| if [ -n "$ROCR_VISIBLE_DEVICES" ]; then | ||
| export HIP_VISIBLE_DEVICES="$ROCR_VISIBLE_DEVICES" | ||
| fi | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
|
|
||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN" | ||
| fi | ||
|
|
||
| # If the machine runs a MEC FW older than 177, RCCL | ||
| # cannot reclaim some memory. | ||
| # Disable that features to avoid crashes. | ||
| # This is related to the changes in the driver at: | ||
| # https://rocm.docs.amd.com/en/docs-6.4.3/about/release-notes.html#amdgpu-driver-updates | ||
| version=`rocm-smi --showfw | grep MEC | head -n 1 | awk '{print $NF}'` | ||
| if [[ "$version" == "" || $version -lt 177 ]]; then | ||
| export HSA_NO_SCRATCH_RECLAIM=1 | ||
| fi | ||
|
|
||
| export VLLM_ROCM_USE_AITER=1 | ||
| export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4 | ||
|
|
||
| # Disable AITER RMSNorm for TP < 8 due to accuracy issues | ||
| if [ "${TP}" -lt 8 ]; then | ||
| export VLLM_ROCM_USE_AITER_RMSNORM=0 | ||
| fi | ||
|
|
||
| if [ "${EP_SIZE:-0}" -gt 1 ]; then | ||
| EP=" --enable-expert-parallel" | ||
| else | ||
| EP=" " | ||
| fi | ||
|
|
||
| SPEC_CONFIG="{\"model\":\"${SPEC_DRAFT_MODEL}\",\"method\":\"eagle3\",\"num_speculative_tokens\":${SPEC_NUM_TOKENS},\"draft_tensor_parallel_size\":${SPEC_DRAFT_TP}}" | ||
|
|
||
| # Start GPU monitoring (power, temperature, clocks every second) | ||
| start_gpu_monitor | ||
|
|
||
| set -x | ||
| vllm serve $MODEL --port $PORT \ | ||
| --tensor-parallel-size=$TP \ | ||
| $EP \ | ||
| --gpu-memory-utilization 0.90 \ | ||
| --max-model-len $MAX_MODEL_LEN \ | ||
| --no-enable-prefix-caching \ | ||
| --trust-remote-code \ | ||
| --mm-encoder-tp-mode data \ | ||
| --speculative-config "$SPEC_CONFIG" > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| # Wait for server to be ready | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ \ | ||
| --use-chat-template \ | ||
| --trust-remote-code | ||
|
|
||
| # After throughput, run evaluation only if RUN_EVAL is true | ||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| # Stop GPU monitoring | ||
| stop_gpu_monitor | ||
| set +x | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -179,7 +179,11 @@ else | |
| export PORT_OFFSET=${RUNNER_NAME: -1} | ||
| export PORT=$(( 8888 + ${PORT_OFFSET} )) | ||
| FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "atom" ]] && printf '_atom' || printf '') | ||
| SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '') | ||
| case "$SPEC_DECODING" in | ||
| mtp) SPEC_SUFFIX='_mtp' ;; | ||
| eagle3) SPEC_SUFFIX='_eagle3' ;; | ||
| *) SPEC_SUFFIX='' ;; | ||
| esac | ||
|
Comment on lines
+182
to
+186
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟣 Pre-existing issue (not introduced by this PR). The new case statement at lines 182-186 maps Extended reasoning...What the bug is
case "$SPEC_DECODING" in
mtp) SPEC_SUFFIX='_mtp' ;;
eagle3) SPEC_SUFFIX='_eagle3' ;;
*) SPEC_SUFFIX='' ;;
esacBoth Step-by-step proof
Why existing code doesn't prevent itThe validation layer only checks that the string is one of the accepted literals — it has no notion that the launcher must have a matching Addressing the refutationsOne verifier argued this is purely hypothetical because (a) no master config uses How to fixReplace the silent catch-all with an explicit split between case "$SPEC_DECODING" in
mtp) SPEC_SUFFIX='_mtp' ;;
eagle3) SPEC_SUFFIX='_eagle3' ;;
none|'') SPEC_SUFFIX='' ;;
*) echo "ERROR: unsupported SPEC_DECODING=$SPEC_DECODING" >&2; exit 1 ;;
esacThe same hardening would apply to the other mi*/h*/b* launchers that share this pattern. |
||
|
|
||
| PARTITION="compute" | ||
| SQUASH_FILE="/var/lib/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -77,7 +77,7 @@ class SingleNodeMatrixEntry(BaseModel): | |
| model_prefix: str = Field(alias=Fields.MODEL_PREFIX.value) | ||
| precision: str | ||
| framework: str | ||
| spec_decoding: Literal["mtp", "draft_model", "none"] = Field( | ||
| spec_decoding: Literal["mtp", "draft_model", "eagle3", "none"] = Field( | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 README docs at Extended reasoning...What the bug isThis PR extends the
However, the developer-facing documentation at
This enumeration is now stale — it omits Why existing code doesn't prevent itThe README is a free-text Markdown illustrative example, not machine-read. Nothing in CI cross-references the README string against ImpactA future config author reading the README to learn what Step-by-step proof
How to fixTwo easy options at
|
||
| alias=Fields.SPEC_DECODING.value | ||
| ) | ||
| runner: str | ||
|
|
@@ -116,7 +116,7 @@ class MultiNodeMatrixEntry(BaseModel): | |
| model_prefix: str = Field(alias=Fields.MODEL_PREFIX.value) | ||
| precision: str | ||
| framework: str | ||
| spec_decoding: Literal["mtp", "draft_model", "none"] = Field( | ||
| spec_decoding: Literal["mtp", "draft_model", "eagle3", "none"] = Field( | ||
| alias=Fields.SPEC_DECODING.value | ||
| ) | ||
| runner: str | ||
|
|
@@ -204,7 +204,7 @@ class SingleNodeSearchSpaceEntry(BaseModel): | |
|
|
||
| tp: int | ||
| ep: Optional[int] = None | ||
| spec_decoding: Literal["mtp", "draft_model", "none"] = Field( | ||
| spec_decoding: Literal["mtp", "draft_model", "eagle3", "none"] = Field( | ||
| default="none", alias=Fields.SPEC_DECODING.value) | ||
| dp_attn: Optional[bool] = Field( | ||
| default=None, alias=Fields.DP_ATTN.value) | ||
|
|
@@ -224,7 +224,7 @@ class MultiNodeSearchSpaceEntry(BaseModel): | |
| """Multinode search space configuration.""" | ||
| model_config = ConfigDict(extra='forbid', populate_by_name=True) | ||
|
|
||
| spec_decoding: Literal["mtp", "draft_model", "none"] = Field( | ||
| spec_decoding: Literal["mtp", "draft_model", "eagle3", "none"] = Field( | ||
| default="none", alias=Fields.SPEC_DECODING.value) | ||
| prefill: WorkerConfig | ||
| decode: WorkerConfig | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The new
benchmarks/single_node/kimik2.5_fp4_mi355x_eagle3.shdoes not pass--use-chat-templatetorun_benchmark_serving(lines 108-116). Eagle3 is EAGLE-style speculative decoding (theSPEC_CONFIGsetsmethod: "eagle3"at line 78), so AGENTS.md:167 and.github/workflows/claude-pr-review.yml:277-296both mandate this flag — without it the draft (trained on chat-formatted inputs) is benchmarked against raw prompts, silently regressing acceptance rate and producing misleading tok/s/gpu. Fix: add--use-chat-templateto therun_benchmark_servinginvocation, matching every other*_mtp.shspec-decoding script in the repo.Extended reasoning...
What the bug is
The new Eagle3 benchmark script
benchmarks/single_node/kimik2.5_fp4_mi355x_eagle3.shinvokesrun_benchmark_servingat lines 105-116 without the--use-chat-templateflag. This is the only spec-decoding benchmark script in the repo that omits it.Why the rule applies
Two independent pieces of repo policy make this a hard requirement for EAGLE-style spec-decoding scripts:
AGENTS.md:162-167: "MTP scripts MUST pass--use-chat-templatetorun_benchmark_serving— no exceptions. EAGLE-style speculative decoding is trained against chat-formatted inputs, so benchmarking against raw prompts silently regresses acceptance rate and produces misleading numbers.".github/workflows/claude-pr-review.yml:277-296: classifies a missing--use-chat-templateon an MTP/EAGLE benchmark as a BLOCKING issue.The new script configures
method: "eagle3"inSPEC_CONFIG(line 78) — literally EAGLE-family spec decoding, squarely in scope.Why existing code doesn't prevent it
run_benchmark_servinginbenchmarks/benchmark_lib.shis a pass-through wrapper: the flag is honored when supplied (seebenchmark_lib.sh:351-353) but there is no default/fallback — if the script omits it, raw prompts are sent. All 17+ existing*_mtp.shspec scripts (dsr1, glm5, qwen3.5, kimi MTP variants across AMD and NVIDIA) pass the flag; the new Eagle3 script is the lone omission.Step-by-step proof
SPEC_CONFIGwithmethod: "eagle3",model: lightseekorg/kimi-k2.5-eagle3,num_speculative_tokens: 7.run_benchmark_servingis called at line 107 with--input-len,--output-len,--random-range-ratio, etc. — but no--use-chat-template.randomdataset directly as tokenized input rather than wrapping it in the model's chat template.<|im_start|>user ... <|im_end|><|im_start|>assistant ...formatted prompts) now sees raw random tokens — off-distribution input.perf-changelog.yaml:467-472documented for DeepSeek-MTP in PR Add benchmark script code style guidelines to PR review workflow #647: "Without this arg, MTP acceptance rates are artificially high for DeepSeek with MTP." — same root cause, opposite-sign artifact).Impact
Performance numbers published from this script would be systematically wrong and not comparable to the other kimik2.5 or spec-decoding entries in the results. Since the entire point of adding an Eagle3 config is to measure the speedup from Eagle3, a broken acceptance rate undermines the benchmark's validity.
Fix
One-line change in the
run_benchmark_servingcall (around line 107-116):run_benchmark_serving \ --model "$MODEL" \ --port "$PORT" \ --backend vllm \ --input-len "$ISL" \ --output-len "$OSL" \ --random-range-ratio "$RANDOM_RANGE_RATIO" \ --num-prompts "$((CONC * 10))" \ --max-concurrency "$CONC" \ --result-filename "$RESULT_FILENAME" \ --result-dir /workspace/ \ + --use-chat-template \ --trust-remote-code