docs(eval skill): vLLM backend env vars + SLURM HF-cache/cpu_partition guidance by cjluo-nv · Pull Request #1625 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-06-04T06:35:32Z

What does this PR do?

Type of change: documentation

Hardens the evaluation skill with operational guidance discovered while running AA-Index evals (NVFP4 Nemotron-3-Nano) on SLURM. Two themes:

1. vLLM backend env vars (original commits). Some models need a model-card backend toggle that is an env var, not a CLI flag — e.g. NVFP4 MoE models like NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 need VLLM_USE_FLASHINFER_MOE_FP4=1 + VLLM_FLASHINFER_MOE_BACKEND=throughput. These go in deployment.env_vars (with the lit: prefix), not the vllm serve command.

2. SLURM deploy/eval operational lessons (new commits).

mount_home: false always. Some internal cluster templates default it true, which mounts the host ~/.cache; where that is a symlink into a shared/networked filesystem, it dangles in the container and the vLLM trust-remote-code deploy dies with FileNotFoundError: /root/.cache/huggingface — invisible to --dry-run.
HF cache: mount the realpath of ~/.cache/huggingface to /hf-cache and set HF_HOME: lit:/hf-cache for both stages.
execution.cpu_partition: on split GPU/CPU-partition clusters, the CPU-only MLflow auto-export job is rejected by the GPU partition and marks the whole task FAILED despite EVAL_EXIT_CODE=0. Set cpu_partition to route it correctly.
Top-level env_vars: for vars both stages need (HF_TOKEN, HF_HOME); execution.env_vars is unsupported and hard-errors.
lcr.md: LOG_LEVEL=WARNING to skip logging AA-LCR's ~120K-token inputs.

Usage

execution:
  cpu_partition: <cpu-partition>   # CPU-only auto-export job
  mounts:
    mount_home: false
    deployment: { <shared-fs>/<user>/.cache/huggingface: /hf-cache }
    evaluation: { <shared-fs>/<user>/.cache/huggingface: /hf-cache }
env_vars:                          # shared by both stages
  HF_TOKEN: host:HF_TOKEN
  HF_HOME: lit:/hf-cache
deployment:
  env_vars:
    VLLM_USE_FLASHINFER_MOE_FP4: lit:1
    VLLM_FLASHINFER_MOE_BACKEND: lit:throughput

Testing

Docs/skill-only. Validated end-to-end by running the full AA suite (GPQA + SciCode + AA-LCR) across 5 NVFP4 checkpoints on SLURM: dry-run + canary + full runs all SUCCESS with these settings. Pre-commit (yamlfmt, markdownlint) passed.

Before your PR is "Ready for review"

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: N/A
Did you update Changelog?: N/A
Did you get Claude approval on this PR?: ❌

🤖 Generated with Claude Code

Summary by CodeRabbit

Documentation
- Clarified where backend-related environment variables must be configured (deployment-level) and not embedded in execution commands.
- Added guidance to mount the real HuggingFace cache path and set home-mounting to false for containerized SLURM runs.
- Documented routing CPU-only jobs via CPU partitions and added a task-level LOG_LEVEL setting to reduce verbose long-context logging.

Add guidance for setting vLLM-specific environment variables (e.g. VLLM_USE_FLASHINFER_MOE_FP4 + VLLM_FLASHINFER_MOE_BACKEND for NVFP4 MoE models like NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4) via the deployment.env_vars block rather than the command string. - example_eval.yaml: commented FlashInfer FP4 example under deployment.env_vars - SKILL.md Step 3: expand the extra-env-vars note as the single authority, with the FlashInfer example and lit: prefix usage - SKILL.md command-generation section: brief cross-reference clarifying env vars are not CLI flags Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai · 2026-06-04T06:35:46Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Updates evaluation docs and example config: require vLLM backend env vars in deployment.env_vars as lit: values (not in command), add FlashInfer/NVFP4 commented examples, add SLURM/container mount and HF cache guidance, route MLflow via execution.cpu_partition, and set LOG_LEVEL: lit:WARNING for an LCR task.

Changes

Evaluation docs and examples

Layer / File(s)	Summary
vLLM backend env var placement and examples `.claude/skills/evaluation/SKILL.md`, `.claude/skills/evaluation/recipes/examples/example_eval.yaml`	SKILL.md instructs placing vLLM backend/runtime env vars in `deployment.env_vars` as `lit:`-prefixed values and not embedding them in the `command`. `example_eval.yaml` adds comments for HF cache mounting and shows commented FlashInfer/NVFP4 FP4 toggle examples with `lit:` assignments.
Execution mounts, HF cache, SLURM routing, and task LOG_LEVEL `.claude/skills/evaluation/SKILL.md`, `.claude/skills/evaluation/recipes/examples/example_eval.yaml`, `.claude/skills/evaluation/recipes/tasks/aa/lcr.md`	Adds Step 4 SLURM/execution guidance: enforce `execution.mounts.mount_home: false`, mount the real HuggingFace cache to `/hf-cache` and set `HF_HOME: lit:/hf-cache`, use `execution.cpu_partition` for MLflow auto-export on CPU partitions, prefer top-level `env_vars` for shared variables, and sets `LOG_LEVEL: lit:WARNING` for the AA-LCR task fragment.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

NVIDIA/Model-Optimizer#1595: Prior doc/template changes about lit:-prefixed env var placement and container-scoped env var semantics.

Suggested reviewers

kaix-nv
shengliangxu
chadvoegele

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main changes: documenting vLLM backend environment variables and SLURM-specific operational guidance (HuggingFace cache mounting and CPU partition configuration).
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	This is a documentation-only PR with 0 Python code changes. Files modified are: 16 Markdown files, 1 YAML configuration file, and 1 example config file. No security-sensitive code patterns detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch eval-skill-vllm-env-vars

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tighten the prose added in the prior commit (Step 3 note, command-section cross-reference, and the example_eval.yaml comment) without changing meaning. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

codecov · 2026-06-04T06:55:27Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.38%. Comparing base (ca7eb64) to head (23cb9a3).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1625      +/-   ##
==========================================
- Coverage   77.39%   77.38%   -0.01%     
==========================================
  Files         482      482              
  Lines       52960    53051      +91     
==========================================
+ Hits        40986    41054      +68     
- Misses      11974    11997      +23

Flag	Coverage Δ
unit	`53.93% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

shengliangxu · 2026-06-04T19:22:48Z

@kevalmorabia97 I think we need to create code owners for these agent skills stuff.

kevalmorabia97 · 2026-06-04T20:09:07Z

@kevalmorabia97 I think we need to create code owners for these agent skills stuff.

Created https://github.com/orgs/NVIDIA/teams/modelopt-agents-codeowners
Feel free to add more members to this group and also add it to .github/CODEOWNERS for whatever folders we want to set this for

…guidance Lessons hardened while running AA evals on gcp-nrt (NVFP4 Nemotron-3-Nano): - mount_home: ALWAYS false. Internal cluster templates (e.g. gcp-nrt) default it true, which mounts the host ~/.cache symlink-into-lustre; that dangles in the container and the vLLM trust-remote-code deploy dies with FileNotFoundError /root/.cache/huggingface (invisible to --dry-run). - HF cache: mount the realpath of ~/.cache/huggingface to /hf-cache and set HF_HOME: lit:/hf-cache for both stages (sidesteps the symlink, reuses token). - Auto-export on split GPU/CPU-partition clusters: set execution.cpu_partition (e.g. cpu) so the CPU-only export job isn't rejected by the GPU partition (which otherwise marks the whole task FAILED despite EVAL_EXIT_CODE=0). - Shared env vars go in a top-level env_vars: block (merges into deployment + evaluation); execution.env_vars is unsupported and hard-errors. - lcr.md: LOG_LEVEL=WARNING to skip logging AA-LCR's ~120K-token inputs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Replace internal cluster filesystem references with neutral wording (shared/networked filesystem) in the HF-cache mount guidance. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

…dance - example_eval.yaml: set HF_HOME: lit:/hf-cache under deployment.env_vars and evaluation.env_vars; tighten the mounts comment block. - SKILL.md: collapse the mount_home / cpu_partition / shared-env_vars notes into one compact "SLURM gotchas" block with a single yaml snippet. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Tighten the three bullets to one line each; fold the realpath hint into the yaml snippet. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

meenchen

Bot review — DM the bot to share feedback.

Docs-only change to the evaluation Claude skill. Three changes are coherent and consistent with each other: (1) SKILL.md + example_eval.yaml steer vLLM backend env vars (e.g. VLLM_USE_FLASHINFER_MOE_FP4) into deployment.env_vars rather than the vllm serve command, with the lit: prefix; (2) a SLURM "gotchas" block documents mount_home: false, real-cache mount → HF_HOME: lit:/hf-cache, cpu_partition for the chained CPU-only auto-export job, and top-level env_vars: for shared vars; (3) lcr.md adds LOG_LEVEL: lit:WARNING to skip logging ~120K-token AA-LCR inputs. Author reports end-to-end validation across 5 NVFP4 checkpoints (dry-run + canary + full SUCCESS) and pre-commit clean. Minor: SKILL.md recommends top-level env_vars: for HF_TOKEN/HF_HOME but the example yaml duplicates them under both deployment.env_vars and evaluation.env_vars instead — both forms work, but worth aligning if the top-level pattern is the recommended one. Also cpu_partition is documented in the gotcha block but not surfaced as a commented line in example_eval.yaml itself; a one-line addition there would aid discoverability. Neither is blocking. No licensing impact, no prompt-injection in PR content.

- SKILL.md: note shared env_vars work top-level OR per-stage (matches the example's per-stage layout) instead of prescribing only top-level. - example_eval.yaml: surface cpu_partition as a commented line for discoverability. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Edwardf0t1

Can we check agent's traces in MLFlow to see if it behaves as expected?

cpu_partition is optional + cluster-specific and already covered in SKILL.md's SLURM gotchas; keep the generic example lean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Make explicit: NEL runs MLflow export as a separate CPU-only job chained via afterok; partition = cpu_partition or execution.partition (no auto-routing); unset -> lands on the GPU partition -> rejected -> fails the task despite EVAL_EXIT_CODE=0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

… for export Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Edwardf0t1

The SKILL.md is getting too long, we should keep it lean. Suggestions:

Move the SLURM gotchas into a new references/slurm.md (mount_home, hf-cache, cpu_partition, top-level vs per-stage env_vars), matching the existing reference-file pattern.
In SKILL.md Step 4, replace the blockquote with a one-liner pointer, e.g. "On SLURM, several deploy/eval failures are invisible to --dry-run and only surface at canary (mount_home, HF cache, cpu_partition) — read references/slurm.md."
Keep the deployment.env_vars backend-env-var note where it is (that one's short, directly in the deployment-config flow, and genuinely high-traffic), but it can lose the second near-duplicate sentence added at line ~143.
The example_eval.yaml and lcr.md comments are fine as-is — that's the right place for copy-paste-adjacent detail.

coderabbitai Bot approved these changes Jun 4, 2026

View reviewed changes

cjluo-nv changed the title ~~docs(eval skill): document vLLM backend env vars in deployment config~~ docs(eval skill): vLLM backend env vars + SLURM HF-cache/cpu_partition guidance Jun 4, 2026

cjluo-nv and others added 3 commits June 4, 2026 14:42

docs(eval skill): further compress SLURM gotchas block

2b6e969

Tighten the three bullets to one line each; fold the realpath hint into the yaml snippet. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv requested review from Edwardf0t1, chadvoegele and meenchen June 4, 2026 21:51

meenchen approved these changes Jun 4, 2026

View reviewed changes

Edwardf0t1 reviewed Jun 4, 2026

View reviewed changes

cjluo-nv and others added 4 commits June 4, 2026 15:15

docs(eval skill): tighten cpu_partition note; emphasize it's required…

de7b8eb

… for export Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

docs(eval skill): reword cpu_partition note ("if not specified")

23cb9a3

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

meenchen approved these changes Jun 4, 2026

View reviewed changes

Edwardf0t1 requested changes Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(eval skill): vLLM backend env vars + SLURM HF-cache/cpu_partition guidance#1625

docs(eval skill): vLLM backend env vars + SLURM HF-cache/cpu_partition guidance#1625
cjluo-nv wants to merge 11 commits into
mainfrom
eval-skill-vllm-env-vars

cjluo-nv commented Jun 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

codecov Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

shengliangxu commented Jun 4, 2026

Uh oh!

kevalmorabia97 commented Jun 4, 2026 •

edited

Loading

Uh oh!

meenchen left a comment

Uh oh!

Edwardf0t1 left a comment

Uh oh!

Edwardf0t1 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

cjluo-nv commented Jun 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

codecov Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

shengliangxu commented Jun 4, 2026

Uh oh!

kevalmorabia97 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cjluo-nv commented Jun 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

codecov Bot commented Jun 4, 2026 •

edited

Loading

kevalmorabia97 commented Jun 4, 2026 •

edited

Loading