Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -431,6 +431,31 @@ kimik2.5-fp4-mi355x-vllm:
- { tp: 8, conc-start: 4, conc-end: 64 }
- { tp: 4, conc-start: 4, conc-end: 64 }

kimik2.5-fp4-mi355x-vllm-eagle3:
image: vllm/vllm-openai-rocm:nightly-4eafc729285e459a5fc96efd6f7b313b155cad48
model: amd/Kimi-K2.5-MXFP4
model-prefix: kimik2.5
runner: mi355x
precision: fp4
framework: vllm
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 4, conc-start: 4, conc-end: 64, spec-decoding: eagle3 }
- { tp: 8, conc-start: 4, conc-end: 64, spec-decoding: eagle3 }
- isl: 1024
osl: 8192
search-space:
- { tp: 4, conc-start: 4, conc-end: 64, spec-decoding: eagle3 }
- { tp: 8, conc-start: 4, conc-end: 64, spec-decoding: eagle3 }
- isl: 8192
osl: 1024
search-space:
- { tp: 4, conc-start: 4, conc-end: 64, spec-decoding: eagle3 }
- { tp: 8, conc-start: 4, conc-end: 64, spec-decoding: eagle3 }

kimik2.5-fp4-mi355x-atom:
image: rocm/atom:rocm7.2.1-ubuntu24.04-pytorch2.9.1-atom0.1.2
model: amd/Kimi-K2.5-MXFP4
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,7 @@ The corresponding `SingleNodeMatrixEntry` enforces these same fields with approp

2. **`extra='forbid'`**: Unknown fields are rejected, preventing typos or deprecated fields from slipping through.

3. **Strict typing**: Fields like `spec-decoding` use `Literal["mtp", "draft_model", "none"]` to restrict values to known options.
3. **Strict typing**: Fields like `spec-decoding` use a `Literal` type to restrict values to a fixed set of known options (see `utils/matrix_logic/validation.py` for the current set).

4. **Concurrency validation**: The system ensures either `conc-list` OR `conc-start`/`conc-end` is provided, but not both.

Expand Down
114 changes: 114 additions & 0 deletions benchmarks/single_node/kimik2.5_fp4_mi355x_eagle3.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
#!/usr/bin/env bash

# Kimi-K2.5 MXFP4 + Eagle3 speculative decoding on MI355X (vLLM).
# Adds `--speculative-config` on top of the plain kimik2.5_fp4_mi355x.sh flow.
#
# Draft model: lightseekorg/kimi-k2.5-eagle3 (~6 GB BF16, Eagle3 MTP head)
# Spec tokens: 7 (reproduced baseline: 764.1 +/- 35.7 tok/s/gpu @ TP=4, 1k1k, conc=64)
# Draft TP: 1 (draft runs on a single GPU; target occupies $TP)

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
MAX_MODEL_LEN \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

# Draft model (Eagle3 head). Override via SPEC_DRAFT_MODEL if needed.
SPEC_DRAFT_MODEL="${SPEC_DRAFT_MODEL:-lightseekorg/kimi-k2.5-eagle3}"
SPEC_NUM_TOKENS="${SPEC_NUM_TOKENS:-7}"
SPEC_DRAFT_TP="${SPEC_DRAFT_TP:-1}"

hf download "$MODEL"
hf download "$SPEC_DRAFT_MODEL"

# Set HIP_VISIBLE_DEVICES to match ROCR_VISIBLE_DEVICES for Ray compatibility in vLLM 0.14+
if [ -n "$ROCR_VISIBLE_DEVICES" ]; then
export HIP_VISIBLE_DEVICES="$ROCR_VISIBLE_DEVICES"
fi

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
fi

# If the machine runs a MEC FW older than 177, RCCL
# cannot reclaim some memory.
# Disable that features to avoid crashes.
# This is related to the changes in the driver at:
# https://rocm.docs.amd.com/en/docs-6.4.3/about/release-notes.html#amdgpu-driver-updates
version=`rocm-smi --showfw | grep MEC | head -n 1 | awk '{print $NF}'`
if [[ "$version" == "" || $version -lt 177 ]]; then
export HSA_NO_SCRATCH_RECLAIM=1
fi

export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4

# Disable AITER RMSNorm for TP < 8 due to accuracy issues
if [ "${TP}" -lt 8 ]; then
export VLLM_ROCM_USE_AITER_RMSNORM=0
fi

if [ "${EP_SIZE:-0}" -gt 1 ]; then
EP=" --enable-expert-parallel"
else
EP=" "
fi

SPEC_CONFIG="{\"model\":\"${SPEC_DRAFT_MODEL}\",\"method\":\"eagle3\",\"num_speculative_tokens\":${SPEC_NUM_TOKENS},\"draft_tensor_parallel_size\":${SPEC_DRAFT_TP}}"

# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x
vllm serve $MODEL --port $PORT \
--tensor-parallel-size=$TP \
$EP \
--gpu-memory-utilization 0.90 \
--max-model-len $MAX_MODEL_LEN \
--no-enable-prefix-caching \
--trust-remote-code \
--mm-encoder-tp-mode data \
--speculative-config "$SPEC_CONFIG" > $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--use-chat-template \
--trust-remote-code
Comment on lines +92 to +104
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new benchmarks/single_node/kimik2.5_fp4_mi355x_eagle3.sh does not pass --use-chat-template to run_benchmark_serving (lines 108-116). Eagle3 is EAGLE-style speculative decoding (the SPEC_CONFIG sets method: "eagle3" at line 78), so AGENTS.md:167 and .github/workflows/claude-pr-review.yml:277-296 both mandate this flag — without it the draft (trained on chat-formatted inputs) is benchmarked against raw prompts, silently regressing acceptance rate and producing misleading tok/s/gpu. Fix: add --use-chat-template to the run_benchmark_serving invocation, matching every other *_mtp.sh spec-decoding script in the repo.

Extended reasoning...

What the bug is

The new Eagle3 benchmark script benchmarks/single_node/kimik2.5_fp4_mi355x_eagle3.sh invokes run_benchmark_serving at lines 105-116 without the --use-chat-template flag. This is the only spec-decoding benchmark script in the repo that omits it.

Why the rule applies

Two independent pieces of repo policy make this a hard requirement for EAGLE-style spec-decoding scripts:

  1. AGENTS.md:162-167: "MTP scripts MUST pass --use-chat-template to run_benchmark_serving — no exceptions. EAGLE-style speculative decoding is trained against chat-formatted inputs, so benchmarking against raw prompts silently regresses acceptance rate and produces misleading numbers."
  2. .github/workflows/claude-pr-review.yml:277-296: classifies a missing --use-chat-template on an MTP/EAGLE benchmark as a BLOCKING issue.

The new script configures method: "eagle3" in SPEC_CONFIG (line 78) — literally EAGLE-family spec decoding, squarely in scope.

Why existing code doesn't prevent it

run_benchmark_serving in benchmarks/benchmark_lib.sh is a pass-through wrapper: the flag is honored when supplied (see benchmark_lib.sh:351-353) but there is no default/fallback — if the script omits it, raw prompts are sent. All 17+ existing *_mtp.sh spec scripts (dsr1, glm5, qwen3.5, kimi MTP variants across AMD and NVIDIA) pass the flag; the new Eagle3 script is the lone omission.

Step-by-step proof

  1. Script runs. Line 78 sets SPEC_CONFIG with method: "eagle3", model: lightseekorg/kimi-k2.5-eagle3, num_speculative_tokens: 7.
  2. vLLM starts the target with the Eagle3 draft attached.
  3. run_benchmark_serving is called at line 107 with --input-len, --output-len, --random-range-ratio, etc. — but no --use-chat-template.
  4. The benchmark client emits the random dataset directly as tokenized input rather than wrapping it in the model's chat template.
  5. The Eagle3 draft head (trained on <|im_start|>user ... <|im_end|><|im_start|>assistant ... formatted prompts) now sees raw random tokens — off-distribution input.
  6. Acceptance rate of speculated tokens drops (this is the exact failure mode perf-changelog.yaml:467-472 documented for DeepSeek-MTP in PR Add benchmark script code style guidelines to PR review workflow #647: "Without this arg, MTP acceptance rates are artificially high for DeepSeek with MTP." — same root cause, opposite-sign artifact).
  7. Reported tok/s/gpu diverges from the real number customers would see on chat-formatted traffic — directly defeating the purpose of the benchmark.

Impact

Performance numbers published from this script would be systematically wrong and not comparable to the other kimik2.5 or spec-decoding entries in the results. Since the entire point of adding an Eagle3 config is to measure the speedup from Eagle3, a broken acceptance rate undermines the benchmark's validity.

Fix

One-line change in the run_benchmark_serving call (around line 107-116):

 run_benchmark_serving \
     --model "$MODEL" \
     --port "$PORT" \
     --backend vllm \
     --input-len "$ISL" \
     --output-len "$OSL" \
     --random-range-ratio "$RANDOM_RANGE_RATIO" \
     --num-prompts "$((CONC * 10))" \
     --max-concurrency "$CONC" \
     --result-filename "$RESULT_FILENAME" \
     --result-dir /workspace/ \
+    --use-chat-template \
     --trust-remote-code


# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
9 changes: 9 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1702,3 +1702,12 @@
description:
- "Add VLLM_FLOAT32_MATMUL_PRECISION=high"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1069

- config-keys:
- kimik2.5-fp4-mi355x-vllm-eagle3
description:
- "Add Kimi-K2.5 FP4 vLLM Eagle3 speculative decoding config for MI355X"
- "Image: vllm/vllm-openai-rocm:nightly-4eafc729285e459a5fc96efd6f7b313b155cad48"
- "Model: amd/Kimi-K2.5-MXFP4"
- "Draft model: lightseekorg/kimi-k2.5-eagle3 (num_speculative_tokens=7, draft_tp=1)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1116
6 changes: 5 additions & 1 deletion runners/launch_mi355x-amds.sh
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,11 @@ else
export PORT_OFFSET=${RUNNER_NAME: -1}
export PORT=$(( 8888 + ${PORT_OFFSET} ))
FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "atom" ]] && printf '_atom' || printf '')
SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')
case "$SPEC_DECODING" in
mtp) SPEC_SUFFIX='_mtp' ;;
eagle3) SPEC_SUFFIX='_eagle3' ;;
*) SPEC_SUFFIX='' ;;
esac
Comment on lines +182 to +186
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 Pre-existing issue (not introduced by this PR). The new case statement at lines 182-186 maps mtp and eagle3, but the spec-decoding Literal in validation.py also accepts draft_model — which would silently fall through to the *) catch-all, yielding SPEC_SUFFIX='' and launching the non-spec benchmark script with SPEC_DECODING=draft_model still in the env. No master config uses draft_model today and the prior ternary had the same gap, but since this PR rewrote these exact lines it's a natural opportunity to fail fast on unknown spec values (either add an explicit error branch, or map draft_model once a corresponding _draft_model.sh exists). The same silent-fallthrough pattern also exists in launch_h200-nb.sh, launch_h200-cw.sh, launch_h200-dgxc-slurm.sh, launch_b200-nb.sh, launch_b200-dgxc-slurm.sh, and launch_b300-nv.sh.

Extended reasoning...

What the bug is

utils/matrix_logic/validation.py (lines 80, 119, 207, 227) declares the spec-decoding field as Literal["mtp", "draft_model", "eagle3", "none"], so any of those four values will pass config validation. The launcher case statement at runners/launch_mi355x-amds.sh:182-186 only handles two of them:

case "$SPEC_DECODING" in
    mtp)    SPEC_SUFFIX='_mtp' ;;
    eagle3) SPEC_SUFFIX='_eagle3' ;;
    *)      SPEC_SUFFIX='' ;;
esac

Both none and draft_model fall into the *) branch. For none that is intended (non-spec script). For draft_model it is silently wrong: the launcher will invoke benchmarks/single_node/${EXP_NAME%%_*}_${PRECISION}_mi355x${FRAMEWORK_SUFFIX}.sh (the plain, non-spec variant) while SPEC_DECODING=draft_model is still exported into the container env.

Step-by-step proof

  1. Someone adds a master-config entry with spec-decoding: draft_model under e.g. kimik2.5-fp4-mi355x-vllm-*. Validation accepts it because the Literal allows draft_model.
  2. generate_sweep_configs.py emits a matrix entry where SPEC_DECODING=draft_model is exported by the workflow into the launcher's environment.
  3. In launch_mi355x-amds.sh the case statement hits the *) branch → SPEC_SUFFIX=''.
  4. The final command becomes bash benchmarks/single_node/kimik2.5_fp4_mi355x.sh (no _draft_model suffix, no _eagle3 suffix) — the plain non-spec recipe runs.
  5. The reported result is tagged with the original run metadata saying spec-decoding=draft_model, but no speculative decoding actually happened. Silent divergence between requested and executed experiment.

Why existing code doesn't prevent it

The validation layer only checks that the string is one of the accepted literals — it has no notion that the launcher must have a matching _<name>.sh script. The launcher's default *) branch treats any unrecognised value as "no spec suffix", which is correct for none but masks misconfiguration for draft_model.

Addressing the refutations

One verifier argued this is purely hypothetical because (a) no master config uses draft_model, (b) no *_draft_model.sh script exists in the repo, (c) the same pattern exists in six sibling launchers, and (d) hardening is out of scope for this PR. Those points are all factually correct, which is why this is marked pre_existing and not a blocker. The reason it is still worth flagging: the PR rewrote this exact block from a ternary into a case statement to add eagle3, which is the natural moment to either (i) add an explicit error branch for unmapped-but-valid literals, or (ii) tighten the Literal to only include values with a corresponding script. Doing neither leaves a foot-gun for the next person who extends spec-decoding.

How to fix

Replace the silent catch-all with an explicit split between none (intended empty suffix) and unknown values (error):

case "$SPEC_DECODING" in
    mtp)    SPEC_SUFFIX='_mtp' ;;
    eagle3) SPEC_SUFFIX='_eagle3' ;;
    none|'') SPEC_SUFFIX='' ;;
    *) echo "ERROR: unsupported SPEC_DECODING=$SPEC_DECODING" >&2; exit 1 ;;
esac

The same hardening would apply to the other mi*/h*/b* launchers that share this pattern.


PARTITION="compute"
SQUASH_FILE="/var/lib/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
Expand Down
2 changes: 1 addition & 1 deletion utils/matrix_logic/test_validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -275,7 +275,7 @@ def test_conc_as_list(self, valid_single_node_matrix_entry):

def test_spec_decoding_values(self, valid_single_node_matrix_entry):
"""Spec decoding should accept valid literal values."""
for value in ["mtp", "draft_model", "none"]:
for value in ["mtp", "draft_model", "eagle3", "none"]:
valid_single_node_matrix_entry["spec-decoding"] = value
entry = SingleNodeMatrixEntry(**valid_single_node_matrix_entry)
assert entry.spec_decoding == value
Expand Down
8 changes: 4 additions & 4 deletions utils/matrix_logic/validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ class SingleNodeMatrixEntry(BaseModel):
model_prefix: str = Field(alias=Fields.MODEL_PREFIX.value)
precision: str
framework: str
spec_decoding: Literal["mtp", "draft_model", "none"] = Field(
spec_decoding: Literal["mtp", "draft_model", "eagle3", "none"] = Field(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 README docs at .github/workflows/README.md:238 still illustrate the spec-decoding Literal as Literal["mtp", "draft_model", "none"], but this PR extends it to include eagle3 at four sites in utils/matrix_logic/validation.py (lines 80, 119, 207, 227). Nit — please update that README line to include eagle3 (or rephrase to avoid enumerating) so future config authors know eagle3 is a valid value.

Extended reasoning...

What the bug is

This PR extends the spec_decoding Pydantic Literal from {mtp, draft_model, none} to {mtp, draft_model, eagle3, none} at four call sites in utils/matrix_logic/validation.py:

  • Line 80 — SingleNodeMatrixEntry.spec_decoding
  • Line 119 — MultiNodeMatrixEntry.spec_decoding
  • Line 207 — SingleNodeSearchSpaceEntry.spec_decoding
  • Line 227 — MultiNodeSearchSpaceEntry.spec_decoding

However, the developer-facing documentation at .github/workflows/README.md:238 still reads verbatim:

Strict typing: Fields like spec-decoding use Literal["mtp", "draft_model", "none"] to restrict values to known options.

This enumeration is now stale — it omits eagle3, which is a valid value after this PR.

Why existing code doesn't prevent it

The README is a free-text Markdown illustrative example, not machine-read. Nothing in CI cross-references the README string against validation.py, so there is no automatic enforcement that the two stay in sync. The code (Pydantic Literal) remains the source of truth and will correctly accept eagle3 at runtime — so there is no functional regression. It's a pure documentation drift.

Impact

A future config author reading the README to learn what spec-decoding values are allowed will see only three options and not realize eagle3 is valid. They may either (a) try eagle3 anyway and hit no issue, or (b) incorrectly assume the option doesn't exist and add another value. Mild confusion, not broken behavior.

Step-by-step proof

  1. Check the README at .github/workflows/README.md:238: it contains the string Literal["mtp", "draft_model", "none"].
  2. Check utils/matrix_logic/validation.py:80 (and 119, 207, 227) in this PR: all four sites now declare Literal["mtp", "draft_model", "eagle3", "none"].
  3. The README string is therefore a proper subset of the actual Literal values, and misstates the illustrative example.

How to fix

Two easy options at .github/workflows/README.md:238:

  • Update the enumeration to match: Literal["mtp", "draft_model", "eagle3", "none"].
  • Or rephrase to avoid enumerating, e.g., "Fields like spec-decoding use a Literal type to restrict values to a fixed set of known options (see validation.py for the current set)." — this avoids the same drift the next time a value is added.

alias=Fields.SPEC_DECODING.value
)
runner: str
Expand Down Expand Up @@ -116,7 +116,7 @@ class MultiNodeMatrixEntry(BaseModel):
model_prefix: str = Field(alias=Fields.MODEL_PREFIX.value)
precision: str
framework: str
spec_decoding: Literal["mtp", "draft_model", "none"] = Field(
spec_decoding: Literal["mtp", "draft_model", "eagle3", "none"] = Field(
alias=Fields.SPEC_DECODING.value
)
runner: str
Expand Down Expand Up @@ -204,7 +204,7 @@ class SingleNodeSearchSpaceEntry(BaseModel):

tp: int
ep: Optional[int] = None
spec_decoding: Literal["mtp", "draft_model", "none"] = Field(
spec_decoding: Literal["mtp", "draft_model", "eagle3", "none"] = Field(
default="none", alias=Fields.SPEC_DECODING.value)
dp_attn: Optional[bool] = Field(
default=None, alias=Fields.DP_ATTN.value)
Expand All @@ -224,7 +224,7 @@ class MultiNodeSearchSpaceEntry(BaseModel):
"""Multinode search space configuration."""
model_config = ConfigDict(extra='forbid', populate_by_name=True)

spec_decoding: Literal["mtp", "draft_model", "none"] = Field(
spec_decoding: Literal["mtp", "draft_model", "eagle3", "none"] = Field(
default="none", alias=Fields.SPEC_DECODING.value)
prefill: WorkerConfig
decode: WorkerConfig
Expand Down
Loading