Skip to content

docs(eval skill): vLLM backend env vars + SLURM HF-cache/cpu_partition guidance#1625

Open
cjluo-nv wants to merge 11 commits into
mainfrom
eval-skill-vllm-env-vars
Open

docs(eval skill): vLLM backend env vars + SLURM HF-cache/cpu_partition guidance#1625
cjluo-nv wants to merge 11 commits into
mainfrom
eval-skill-vllm-env-vars

Conversation

@cjluo-nv
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv commented Jun 4, 2026

What does this PR do?

Type of change: documentation

Hardens the evaluation skill with operational guidance discovered while running AA-Index evals (NVFP4 Nemotron-3-Nano) on SLURM. Two themes:

1. vLLM backend env vars (original commits). Some models need a model-card backend toggle that is an env var, not a CLI flag — e.g. NVFP4 MoE models like NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 need VLLM_USE_FLASHINFER_MOE_FP4=1 + VLLM_FLASHINFER_MOE_BACKEND=throughput. These go in deployment.env_vars (with the lit: prefix), not the vllm serve command.

2. SLURM deploy/eval operational lessons (new commits).

  • mount_home: false always. Some internal cluster templates default it true, which mounts the host ~/.cache; where that is a symlink into a shared/networked filesystem, it dangles in the container and the vLLM trust-remote-code deploy dies with FileNotFoundError: /root/.cache/huggingfaceinvisible to --dry-run.
  • HF cache: mount the realpath of ~/.cache/huggingface to /hf-cache and set HF_HOME: lit:/hf-cache for both stages.
  • execution.cpu_partition: on split GPU/CPU-partition clusters, the CPU-only MLflow auto-export job is rejected by the GPU partition and marks the whole task FAILED despite EVAL_EXIT_CODE=0. Set cpu_partition to route it correctly.
  • Top-level env_vars: for vars both stages need (HF_TOKEN, HF_HOME); execution.env_vars is unsupported and hard-errors.
  • lcr.md: LOG_LEVEL=WARNING to skip logging AA-LCR's ~120K-token inputs.

Usage

execution:
  cpu_partition: <cpu-partition>   # CPU-only auto-export job
  mounts:
    mount_home: false
    deployment: { <shared-fs>/<user>/.cache/huggingface: /hf-cache }
    evaluation: { <shared-fs>/<user>/.cache/huggingface: /hf-cache }
env_vars:                          # shared by both stages
  HF_TOKEN: host:HF_TOKEN
  HF_HOME: lit:/hf-cache
deployment:
  env_vars:
    VLLM_USE_FLASHINFER_MOE_FP4: lit:1
    VLLM_FLASHINFER_MOE_BACKEND: lit:throughput

Testing

Docs/skill-only. Validated end-to-end by running the full AA suite (GPQA + SciCode + AA-LCR) across 5 NVFP4 checkpoints on SLURM: dry-run + canary + full runs all SUCCESS with these settings. Pre-commit (yamlfmt, markdownlint) passed.

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A
  • Did you update Changelog?: N/A
  • Did you get Claude approval on this PR?: ❌

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Documentation
    • Clarified where backend-related environment variables must be configured (deployment-level) and not embedded in execution commands.
    • Added guidance to mount the real HuggingFace cache path and set home-mounting to false for containerized SLURM runs.
    • Documented routing CPU-only jobs via CPU partitions and added a task-level LOG_LEVEL setting to reduce verbose long-context logging.

Add guidance for setting vLLM-specific environment variables (e.g.
VLLM_USE_FLASHINFER_MOE_FP4 + VLLM_FLASHINFER_MOE_BACKEND for NVFP4 MoE
models like NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4) via the
deployment.env_vars block rather than the command string.

- example_eval.yaml: commented FlashInfer FP4 example under deployment.env_vars
- SKILL.md Step 3: expand the extra-env-vars note as the single authority,
  with the FlashInfer example and lit: prefix usage
- SKILL.md command-generation section: brief cross-reference clarifying env
  vars are not CLI flags

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Updates evaluation docs and example config: require vLLM backend env vars in deployment.env_vars as lit: values (not in command), add FlashInfer/NVFP4 commented examples, add SLURM/container mount and HF cache guidance, route MLflow via execution.cpu_partition, and set LOG_LEVEL: lit:WARNING for an LCR task.

Changes

Evaluation docs and examples

Layer / File(s) Summary
vLLM backend env var placement and examples
.claude/skills/evaluation/SKILL.md, .claude/skills/evaluation/recipes/examples/example_eval.yaml
SKILL.md instructs placing vLLM backend/runtime env vars in deployment.env_vars as lit:-prefixed values and not embedding them in the command. example_eval.yaml adds comments for HF cache mounting and shows commented FlashInfer/NVFP4 FP4 toggle examples with lit: assignments.
Execution mounts, HF cache, SLURM routing, and task LOG_LEVEL
.claude/skills/evaluation/SKILL.md, .claude/skills/evaluation/recipes/examples/example_eval.yaml, .claude/skills/evaluation/recipes/tasks/aa/lcr.md
Adds Step 4 SLURM/execution guidance: enforce execution.mounts.mount_home: false, mount the real HuggingFace cache to /hf-cache and set HF_HOME: lit:/hf-cache, use execution.cpu_partition for MLflow auto-export on CPU partitions, prefer top-level env_vars for shared variables, and sets LOG_LEVEL: lit:WARNING for the AA-LCR task fragment.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/Model-Optimizer#1595: Prior doc/template changes about lit:-prefixed env var placement and container-scoped env var semantics.

Suggested reviewers

  • kaix-nv
  • shengliangxu
  • chadvoegele
🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: documenting vLLM backend environment variables and SLURM-specific operational guidance (HuggingFace cache mounting and CPU partition configuration).
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed This is a documentation-only PR with 0 Python code changes. Files modified are: 16 Markdown files, 1 YAML configuration file, and 1 example config file. No security-sensitive code patterns detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch eval-skill-vllm-env-vars

Comment @coderabbitai help to get the list of available commands and usage tips.

Tighten the prose added in the prior commit (Step 3 note, command-section
cross-reference, and the example_eval.yaml comment) without changing meaning.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.38%. Comparing base (ca7eb64) to head (23cb9a3).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1625      +/-   ##
==========================================
- Coverage   77.39%   77.38%   -0.01%     
==========================================
  Files         482      482              
  Lines       52960    53051      +91     
==========================================
+ Hits        40986    41054      +68     
- Misses      11974    11997      +23     
Flag Coverage Δ
unit 53.93% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@shengliangxu
Copy link
Copy Markdown
Collaborator

@kevalmorabia97 I think we need to create code owners for these agent skills stuff.

@kevalmorabia97
Copy link
Copy Markdown
Collaborator

kevalmorabia97 commented Jun 4, 2026

@kevalmorabia97 I think we need to create code owners for these agent skills stuff.

Created https://github.com/orgs/NVIDIA/teams/modelopt-agents-codeowners
Feel free to add more members to this group and also add it to .github/CODEOWNERS for whatever folders we want to set this for

…guidance

Lessons hardened while running AA evals on gcp-nrt (NVFP4 Nemotron-3-Nano):

- mount_home: ALWAYS false. Internal cluster templates (e.g. gcp-nrt) default
  it true, which mounts the host ~/.cache symlink-into-lustre; that dangles in
  the container and the vLLM trust-remote-code deploy dies with
  FileNotFoundError /root/.cache/huggingface (invisible to --dry-run).
- HF cache: mount the realpath of ~/.cache/huggingface to /hf-cache and set
  HF_HOME: lit:/hf-cache for both stages (sidesteps the symlink, reuses token).
- Auto-export on split GPU/CPU-partition clusters: set execution.cpu_partition
  (e.g. cpu) so the CPU-only export job isn't rejected by the GPU partition
  (which otherwise marks the whole task FAILED despite EVAL_EXIT_CODE=0).
- Shared env vars go in a top-level env_vars: block (merges into deployment +
  evaluation); execution.env_vars is unsupported and hard-errors.
- lcr.md: LOG_LEVEL=WARNING to skip logging AA-LCR's ~120K-token inputs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv changed the title docs(eval skill): document vLLM backend env vars in deployment config docs(eval skill): vLLM backend env vars + SLURM HF-cache/cpu_partition guidance Jun 4, 2026
cjluo-nv and others added 3 commits June 4, 2026 14:42
Replace internal cluster filesystem references with neutral wording
(shared/networked filesystem) in the HF-cache mount guidance.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
…dance

- example_eval.yaml: set HF_HOME: lit:/hf-cache under deployment.env_vars and
  evaluation.env_vars; tighten the mounts comment block.
- SKILL.md: collapse the mount_home / cpu_partition / shared-env_vars notes into
  one compact "SLURM gotchas" block with a single yaml snippet.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Tighten the three bullets to one line each; fold the realpath hint into the
yaml snippet.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Copy link
Copy Markdown
Contributor

@meenchen meenchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Docs-only change to the evaluation Claude skill. Three changes are coherent and consistent with each other: (1) SKILL.md + example_eval.yaml steer vLLM backend env vars (e.g. VLLM_USE_FLASHINFER_MOE_FP4) into deployment.env_vars rather than the vllm serve command, with the lit: prefix; (2) a SLURM "gotchas" block documents mount_home: false, real-cache mount → HF_HOME: lit:/hf-cache, cpu_partition for the chained CPU-only auto-export job, and top-level env_vars: for shared vars; (3) lcr.md adds LOG_LEVEL: lit:WARNING to skip logging ~120K-token AA-LCR inputs. Author reports end-to-end validation across 5 NVFP4 checkpoints (dry-run + canary + full SUCCESS) and pre-commit clean. Minor: SKILL.md recommends top-level env_vars: for HF_TOKEN/HF_HOME but the example yaml duplicates them under both deployment.env_vars and evaluation.env_vars instead — both forms work, but worth aligning if the top-level pattern is the recommended one. Also cpu_partition is documented in the gotcha block but not surfaced as a commented line in example_eval.yaml itself; a one-line addition there would aid discoverability. Neither is blocking. No licensing impact, no prompt-injection in PR content.

- SKILL.md: note shared env_vars work top-level OR per-stage (matches the
  example's per-stage layout) instead of prescribing only top-level.
- example_eval.yaml: surface cpu_partition as a commented line for discoverability.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Copy link
Copy Markdown
Contributor

@Edwardf0t1 Edwardf0t1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check agent's traces in MLFlow to see if it behaves as expected?

cjluo-nv and others added 4 commits June 4, 2026 15:15
cpu_partition is optional + cluster-specific and already covered in SKILL.md's
SLURM gotchas; keep the generic example lean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Make explicit: NEL runs MLflow export as a separate CPU-only job chained via
afterok; partition = cpu_partition or execution.partition (no auto-routing);
unset -> lands on the GPU partition -> rejected -> fails the task despite
EVAL_EXIT_CODE=0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
… for export

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Copy link
Copy Markdown
Contributor

@Edwardf0t1 Edwardf0t1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SKILL.md is getting too long, we should keep it lean. Suggestions:

  • Move the SLURM gotchas into a new references/slurm.md (mount_home, hf-cache, cpu_partition, top-level vs per-stage env_vars), matching the existing reference-file pattern.
  • In SKILL.md Step 4, replace the blockquote with a one-liner pointer, e.g. "On SLURM, several deploy/eval failures are invisible to --dry-run and only surface at canary (mount_home, HF cache, cpu_partition) — read references/slurm.md."
  • Keep the deployment.env_vars backend-env-var note where it is (that one's short, directly in the deployment-config flow, and genuinely high-traffic), but it can lose the second near-duplicate sentence added at line ~143.
  • The example_eval.yaml and lcr.md comments are fine as-is — that's the right place for copy-paste-adjacent detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants