Skip to content

fix(glm_asr): honor sampling params in vLLM generate()#2997

Open
SuperMarioYL wants to merge 1 commit into
modelscope:mainfrom
SuperMarioYL:fix/glm-asr-vllm-sampling-params
Open

fix(glm_asr): honor sampling params in vLLM generate()#2997
SuperMarioYL wants to merge 1 commit into
modelscope:mainfrom
SuperMarioYL:fix/glm-asr-vllm-sampling-params

Conversation

@SuperMarioYL

Copy link
Copy Markdown
Contributor

Summary

GLMASRVLLMEngine.generate() accepts **kwargs but hardcodes temperature=0 and never forwards any sampling parameter into SamplingParams, so callers cannot control decoding — anything they pass (temperature, top_p, top_k, …) is silently dropped.

The sibling Fun-ASR-Nano vLLM engine already exposes these parameters (temperature/top_p/top_k/repetition_penalty). This PR brings GLM-ASR to parity.

Changes

  • Add temperature, top_p, top_k, repetition_penalty arguments to generate() and forward them into SamplingParams.
  • Defaults (temperature=0.0, top_p=1.0, top_k=-1, repetition_penalty=1.0) reproduce the previous greedy-decoding behavior exactly, so existing callers are unaffected.
  • Normalize non-positive top_k to -1 (disabled), mirroring Fun-ASR-Nano.
  • Guard repetition_penalty: this engine runs vLLM in prompt-embeds mode (enable_prompt_embeds=True), where a non-neutral repetition penalty has no prompt token IDs to scatter over and crashes the engine with a CUDA scatter gather index out of bounds assertion (the same failure mode fixed for Fun-ASR-Nano in vllm离线服务推理,长音频报错 #2948). A non-neutral value is coerced back to 1.0 with a one-time warning.

Testing

Added tests/test_glm_asr_vllm_sampling.py (CPU-only, vLLM stubbed) covering:

  • sampling parameters are forwarded into SamplingParams
  • defaults preserve the previous greedy behavior
  • non-positive top_k is normalized to disabled
  • the repetition_penalty guard coerces non-neutral values and warns once
$ python -m pytest tests/test_glm_asr_vllm_sampling.py -q
7 passed

Verified red on main (params dropped) and green on this branch.

GLMASRVLLMEngine.generate() accepted **kwargs but hardcoded
temperature=0 and never forwarded any sampling parameter, so callers
could not control decoding. Its sibling Fun-ASR-Nano engine already
exposes temperature/top_p/top_k/repetition_penalty.

Mirror that surface on GLM-ASR: forward temperature, top_p and top_k
into SamplingParams (preserving the previous greedy default), and guard
repetition_penalty so a non-neutral value is coerced back to 1.0 (with a
one-time warning). This engine runs vLLM in prompt-embeds mode, where a
non-neutral repetition_penalty triggers the issue modelscope#2948 CUDA crash.

Add CPU-only regression tests covering parameter forwarding, the greedy
defaults, top_k normalization, and the repetition_penalty guard.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds sampling parameters (temperature, top_p, top_k, and repetition_penalty) to the GLM-ASR vLLM inference engine, along with corresponding unit tests. It also introduces a safety check to prevent a CUDA crash when using repetition penalty in prompt-embeds mode. The review feedback points out a potential TypeError if top_k is passed as None and suggests adding a safe check and an associated unit test.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

temperature=0,
temperature=temperature,
top_p=top_p,
top_k=top_k if top_k > 0 else -1,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If top_k is passed as None (which is common when passing optional configuration dictionaries), the comparison top_k > 0 will raise a TypeError: '>' not supported between instances of 'NoneType' and 'int'. We should explicitly check if top_k is not None before comparing it.

Suggested change
top_k=top_k if top_k > 0 else -1,
top_k=top_k if (top_k is not None and top_k > 0) else -1,

Comment on lines +83 to +85
def test_non_positive_top_k_is_normalized_to_disabled(self):
self.engine.generate("a.wav", top_k=0)
self.assertEqual(self.captured["top_k"], -1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Let's also test the case where top_k is None to ensure it is correctly normalized to -1 without raising a TypeError.

Suggested change
def test_non_positive_top_k_is_normalized_to_disabled(self):
self.engine.generate("a.wav", top_k=0)
self.assertEqual(self.captured["top_k"], -1)
def test_non_positive_top_k_is_normalized_to_disabled(self):
self.engine.generate("a.wav", top_k=0)
self.assertEqual(self.captured["top_k"], -1)
self.engine.generate("a.wav", top_k=None)
self.assertEqual(self.captured["top_k"], -1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant