Skip to content

test: add a diagnostic script for prefix caching naning#1987

Merged
terrykong merged 4 commits intomainfrom
tk/vllm-nan-diagnostic
Mar 1, 2026
Merged

test: add a diagnostic script for prefix caching naning#1987
terrykong merged 4 commits intomainfrom
tk/vllm-nan-diagnostic

Conversation

@terrykong
Copy link
Contributor

@terrykong terrykong commented Feb 18, 2026

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • Documentation

    • Added diagnostic guidance for troubleshooting prefix caching NaN logprobs issues.
  • New Features

    • Introduced a diagnostic tool to validate prefix caching behavior and detect NaN logprobs.

@terrykong terrykong requested review from a team as code owners February 18, 2026 21:18
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 18, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 18, 2026

📝 Walkthrough

Walkthrough

Documentation updated to include a new diagnostic section for prefix caching NaN logprobs validation. A new Python script added to tools/model_diagnostics/ that reproduces and validates prefix caching behavior in vLLM, including multi-iteration generation with prefix cache reuse and NaN logprob detection.

Changes

Cohort / File(s) Summary
Documentation
docs/adding-new-models.md
Added diagnostic section documenting prefix caching NaN logprobs test case with usage instructions, expected outputs for different vLLM versions, and explanation of underlying prefix caching bug.
Diagnostic Tool
tools/model_diagnostics/5.prefix_caching_nan.py
New script to reproduce prefix caching behavior: instantiates LLM with prefix caching, performs two-iteration generation (initial build, then cache reuse), detects NaN logprobs, and asserts expected behavior.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested labels

CI:docs

Suggested reviewers

  • chtruong814
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title contains a typo ('naning' instead of 'NaN') and lacks clarity about what the diagnostic script addresses, making it unclear despite referencing the main change. Correct the typo to 'test: add a diagnostic script for prefix caching NaN logprobs' to accurately describe the script's purpose of detecting NaN values in prefix caching.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes ✅ Passed PR adds minor diagnostic utilities and documentation; no major production code changes or breaking changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch tk/vllm-nan-diagnostic

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tools/model_diagnostics/5.prefix_caching_nan.py (2)

75-84: Unconditional break silently under-counts NaNs if logprobs > 1.

The break at line 84 exits after inspecting only the first token-id entry per step, regardless of whether a NaN was found. With logprobs=1 and temperature=0.0 (greedy), each step's dict has exactly one entry so this is functionally correct today. However, vLLM returns up to logprobs+1 elements per step, meaning if logprobs is ever bumped above 1, the counter would silently undercount NaN occurrences (at most 1 per step). Removing the break makes the intent clear and future-proof:

♻️ Proposed fix
     for _tid, lp_obj in step.items():
         lp = lp_obj.logprob if hasattr(lp_obj, "logprob") else lp_obj
         if isinstance(lp, float) and math.isnan(lp):
             nan_count += 1
-        break
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/model_diagnostics/5.prefix_caching_nan.py` around lines 75 - 84, The
loop over out2.logprobs currently contains an unconditional break after
inspecting the first token-id entry, which causes under-counting NaNs when a
step contains multiple entries; remove the break so the inner loop over
step.items() examines every lp_obj (keep existing hasattr(lp_obj, "logprob")
check and NaN detection for lp) so nan_count increments for every NaN in all
token-id entries rather than just the first one per step.

35-41: Consider consolidating under if __name__ == "__main__": and moving imports to the top.

All module-level logic (argparse, LLM instantiation, generation) runs unconditionally on import. A __main__ guard is the standard protection against accidental execution when scripts are discovered by tooling. Additionally, the vllm import (lines 40–41) appears mid-file after parse_args() — while this speeds up --help, it diverges from PEP 8 and can surprise readers.

♻️ Suggested structure
 import argparse
 import math
+from vllm import LLM, SamplingParams
+import vllm

 MODEL = "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
 ...

-parser = argparse.ArgumentParser()
-parser.add_argument("--model", type=str, default=MODEL)
-parser.add_argument("--tp", type=int, default=TP)
-args = parser.parse_args()
-
-import vllm
-from vllm import LLM, SamplingParams
-
-print(f"vLLM version: {vllm.__version__}")
-...
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model", type=str, default=MODEL)
+    parser.add_argument("--tp", type=int, default=TP)
+    args = parser.parse_args()
+
+    print(f"vLLM version: {vllm.__version__}")
+    ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/model_diagnostics/5.prefix_caching_nan.py` around lines 35 - 41, Move
all runtime logic (argparse setup and calls using parser/args, LLM
instantiation, and generation) under a guarded block: wrap code that calls
parser.parse_args(), creates the vllm LLM and SamplingParams, and runs
generation inside if __name__ == "__main__":. Also relocate imports (import vllm
and from vllm import LLM, SamplingParams) to the top of the file with other
imports to follow PEP8; keep only lightweight module-level constants like MODEL
and TP outside the guard and ensure no heavy side-effect code runs at import
time.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/adding-new-models.md`:
- Around line 330-344: The two fenced code blocks under "Expected pass output
(vLLM 0.13.0)" and "Expected fail output (vLLM 0.14.0)" are missing language
specifiers causing markdownlint MD040; update both fences to include a language
(e.g., use ```text) so the pass-output and fail-output blocks explicitly start
with ```text and end with ``` to satisfy MD040 and preserve formatting.

In `@tools/model_diagnostics/5.prefix_caching_nan.py`:
- Line 87: The print statement currently uses an unnecessary f-string: locate
the call print(f"\n  Sample logprobs from iteration 2:") and remove the leading
"f" so it becomes a plain string literal; this eliminates the Ruff F541 spurious
f-string warning without changing behavior.
- Around line 29-75: The module defines several module-level mutable bindings
(parser, args, numbers, prompt, llm, sampling_params, out1, out2, nan_count)
which violate the global naming guideline; wrap all runtime code that creates or
mutates these symbols inside an if __name__ == "__main__": block so they become
local to main (keep MODEL, TP, MAX_TOKENS, MAX_MODEL_LEN, COUNT_UP_TO and
imports at module scope), e.g., move creation of
argparse.ArgumentParser()/parser, args = parser.parse_args(), numbers, prompt
construction, LLM() instantiation, SamplingParams(), the two generate calls that
produce out1/out2, and nan_count into that guard; alternatively if you must keep
any of them global, rename using upper snake_case with the G_ prefix (e.g.,
G_PARSER, G_LLM) to satisfy the guideline.

---

Nitpick comments:
In `@tools/model_diagnostics/5.prefix_caching_nan.py`:
- Around line 75-84: The loop over out2.logprobs currently contains an
unconditional break after inspecting the first token-id entry, which causes
under-counting NaNs when a step contains multiple entries; remove the break so
the inner loop over step.items() examines every lp_obj (keep existing
hasattr(lp_obj, "logprob") check and NaN detection for lp) so nan_count
increments for every NaN in all token-id entries rather than just the first one
per step.
- Around line 35-41: Move all runtime logic (argparse setup and calls using
parser/args, LLM instantiation, and generation) under a guarded block: wrap code
that calls parser.parse_args(), creates the vllm LLM and SamplingParams, and
runs generation inside if __name__ == "__main__":. Also relocate imports (import
vllm and from vllm import LLM, SamplingParams) to the top of the file with other
imports to follow PEP8; keep only lightweight module-level constants like MODEL
and TP outside the guard and ensure no heavy side-effect code runs at import
time.

yfw
yfw previously approved these changes Feb 19, 2026
@terrykong terrykong changed the title tests: add a diagnostic script for prefix caching naning test: add a diagnostic script for prefix caching naning Feb 20, 2026
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong force-pushed the tk/vllm-nan-diagnostic branch from c5c0bee to 5f9f9be Compare February 27, 2026 08:09
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong added CI:docs Run doctest and removed CI:docs Run doctest labels Feb 27, 2026
@terrykong terrykong enabled auto-merge (squash) February 27, 2026 22:12
@terrykong terrykong requested a review from yuki-97 March 1, 2026 08:43
@terrykong terrykong merged commit ac2c774 into main Mar 1, 2026
42 of 43 checks passed
@terrykong terrykong deleted the tk/vllm-nan-diagnostic branch March 1, 2026 15:17
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
Signed-off-by: Terry Kong <terryk@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
Signed-off-by: Terry Kong <terryk@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 9, 2026
Signed-off-by: Terry Kong <terryk@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:docs Run doctest documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants