fix(glm_asr): honor sampling params in vLLM generate() by SuperMarioYL · Pull Request #2997 · modelscope/FunASR

SuperMarioYL · 2026-06-20T05:15:05Z

Summary

GLMASRVLLMEngine.generate() accepts **kwargs but hardcodes temperature=0 and never forwards any sampling parameter into SamplingParams, so callers cannot control decoding — anything they pass (temperature, top_p, top_k, …) is silently dropped.

The sibling Fun-ASR-Nano vLLM engine already exposes these parameters (temperature/top_p/top_k/repetition_penalty). This PR brings GLM-ASR to parity.

Changes

Add temperature, top_p, top_k, repetition_penalty arguments to generate() and forward them into SamplingParams.
Defaults (temperature=0.0, top_p=1.0, top_k=-1, repetition_penalty=1.0) reproduce the previous greedy-decoding behavior exactly, so existing callers are unaffected.
Normalize non-positive top_k to -1 (disabled), mirroring Fun-ASR-Nano.
Guard repetition_penalty: this engine runs vLLM in prompt-embeds mode (enable_prompt_embeds=True), where a non-neutral repetition penalty has no prompt token IDs to scatter over and crashes the engine with a CUDA scatter gather index out of bounds assertion (the same failure mode fixed for Fun-ASR-Nano in vllm离线服务推理，长音频报错 #2948). A non-neutral value is coerced back to 1.0 with a one-time warning.

Testing

Added tests/test_glm_asr_vllm_sampling.py (CPU-only, vLLM stubbed) covering:

sampling parameters are forwarded into SamplingParams
defaults preserve the previous greedy behavior
non-positive top_k is normalized to disabled
the repetition_penalty guard coerces non-neutral values and warns once

$ python -m pytest tests/test_glm_asr_vllm_sampling.py -q
7 passed

Verified red on main (params dropped) and green on this branch.

GLMASRVLLMEngine.generate() accepted **kwargs but hardcoded temperature=0 and never forwarded any sampling parameter, so callers could not control decoding. Its sibling Fun-ASR-Nano engine already exposes temperature/top_p/top_k/repetition_penalty. Mirror that surface on GLM-ASR: forward temperature, top_p and top_k into SamplingParams (preserving the previous greedy default), and guard repetition_penalty so a non-neutral value is coerced back to 1.0 (with a one-time warning). This engine runs vLLM in prompt-embeds mode, where a non-neutral repetition_penalty triggers the issue modelscope#2948 CUDA crash. Add CPU-only regression tests covering parameter forwarding, the greedy defaults, top_k normalization, and the repetition_penalty guard.

gemini-code-assist

Code Review

This pull request adds sampling parameters (temperature, top_p, top_k, and repetition_penalty) to the GLM-ASR vLLM inference engine, along with corresponding unit tests. It also introduces a safety check to prevent a CUDA crash when using repetition penalty in prompt-embeds mode. The review feedback points out a potential TypeError if top_k is passed as None and suggests adding a safe check and an associated unit test.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-20T05:15:50Z

-            temperature=0,
+            temperature=temperature,
+            top_p=top_p,
+            top_k=top_k if top_k > 0 else -1,


If top_k is passed as None (which is common when passing optional configuration dictionaries), the comparison top_k > 0 will raise a TypeError: '>' not supported between instances of 'NoneType' and 'int'. We should explicitly check if top_k is not None before comparing it.

Suggested change

top_k=top_k if top_k > 0 else -1,

top_k=top_k if (top_k is not None and top_k > 0) else -1,

gemini-code-assist · 2026-06-20T05:15:51Z

+    def test_non_positive_top_k_is_normalized_to_disabled(self):
+        self.engine.generate("a.wav", top_k=0)
+        self.assertEqual(self.captured["top_k"], -1)


Let's also test the case where top_k is None to ensure it is correctly normalized to -1 without raising a TypeError.

Suggested change

def test_non_positive_top_k_is_normalized_to_disabled(self):

self.engine.generate("a.wav", top_k=0)

self.assertEqual(self.captured["top_k"], -1)

def test_non_positive_top_k_is_normalized_to_disabled(self):

self.engine.generate("a.wav", top_k=0)

self.assertEqual(self.captured["top_k"], -1)

self.engine.generate("a.wav", top_k=None)

self.assertEqual(self.captured["top_k"], -1)

gemini-code-assist Bot reviewed Jun 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(glm_asr): honor sampling params in vLLM generate()#2997

fix(glm_asr): honor sampling params in vLLM generate()#2997
SuperMarioYL wants to merge 1 commit into
modelscope:mainfrom
SuperMarioYL:fix/glm-asr-vllm-sampling-params

SuperMarioYL commented Jun 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	top_k=top_k if top_k > 0 else -1,
	top_k=top_k if (top_k is not None and top_k > 0) else -1,

Conversation

SuperMarioYL commented Jun 20, 2026

Summary

Changes

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant