docs(eval skill): vLLM backend env vars + SLURM HF-cache/cpu_partition guidance#1625
docs(eval skill): vLLM backend env vars + SLURM HF-cache/cpu_partition guidance#1625cjluo-nv wants to merge 11 commits into
Conversation
Add guidance for setting vLLM-specific environment variables (e.g. VLLM_USE_FLASHINFER_MOE_FP4 + VLLM_FLASHINFER_MOE_BACKEND for NVFP4 MoE models like NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4) via the deployment.env_vars block rather than the command string. - example_eval.yaml: commented FlashInfer FP4 example under deployment.env_vars - SKILL.md Step 3: expand the extra-env-vars note as the single authority, with the FlashInfer example and lit: prefix usage - SKILL.md command-generation section: brief cross-reference clarifying env vars are not CLI flags Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughUpdates evaluation docs and example config: require vLLM backend env vars in ChangesEvaluation docs and examples
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Tighten the prose added in the prior commit (Step 3 note, command-section cross-reference, and the example_eval.yaml comment) without changing meaning. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1625 +/- ##
==========================================
- Coverage 77.39% 77.38% -0.01%
==========================================
Files 482 482
Lines 52960 53051 +91
==========================================
+ Hits 40986 41054 +68
- Misses 11974 11997 +23
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
@kevalmorabia97 I think we need to create code owners for these agent skills stuff. |
Created https://github.com/orgs/NVIDIA/teams/modelopt-agents-codeowners |
…guidance Lessons hardened while running AA evals on gcp-nrt (NVFP4 Nemotron-3-Nano): - mount_home: ALWAYS false. Internal cluster templates (e.g. gcp-nrt) default it true, which mounts the host ~/.cache symlink-into-lustre; that dangles in the container and the vLLM trust-remote-code deploy dies with FileNotFoundError /root/.cache/huggingface (invisible to --dry-run). - HF cache: mount the realpath of ~/.cache/huggingface to /hf-cache and set HF_HOME: lit:/hf-cache for both stages (sidesteps the symlink, reuses token). - Auto-export on split GPU/CPU-partition clusters: set execution.cpu_partition (e.g. cpu) so the CPU-only export job isn't rejected by the GPU partition (which otherwise marks the whole task FAILED despite EVAL_EXIT_CODE=0). - Shared env vars go in a top-level env_vars: block (merges into deployment + evaluation); execution.env_vars is unsupported and hard-errors. - lcr.md: LOG_LEVEL=WARNING to skip logging AA-LCR's ~120K-token inputs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Replace internal cluster filesystem references with neutral wording (shared/networked filesystem) in the HF-cache mount guidance. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
…dance - example_eval.yaml: set HF_HOME: lit:/hf-cache under deployment.env_vars and evaluation.env_vars; tighten the mounts comment block. - SKILL.md: collapse the mount_home / cpu_partition / shared-env_vars notes into one compact "SLURM gotchas" block with a single yaml snippet. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Tighten the three bullets to one line each; fold the realpath hint into the yaml snippet. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
meenchen
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Docs-only change to the evaluation Claude skill. Three changes are coherent and consistent with each other: (1) SKILL.md + example_eval.yaml steer vLLM backend env vars (e.g. VLLM_USE_FLASHINFER_MOE_FP4) into deployment.env_vars rather than the vllm serve command, with the lit: prefix; (2) a SLURM "gotchas" block documents mount_home: false, real-cache mount → HF_HOME: lit:/hf-cache, cpu_partition for the chained CPU-only auto-export job, and top-level env_vars: for shared vars; (3) lcr.md adds LOG_LEVEL: lit:WARNING to skip logging ~120K-token AA-LCR inputs. Author reports end-to-end validation across 5 NVFP4 checkpoints (dry-run + canary + full SUCCESS) and pre-commit clean. Minor: SKILL.md recommends top-level env_vars: for HF_TOKEN/HF_HOME but the example yaml duplicates them under both deployment.env_vars and evaluation.env_vars instead — both forms work, but worth aligning if the top-level pattern is the recommended one. Also cpu_partition is documented in the gotcha block but not surfaced as a commented line in example_eval.yaml itself; a one-line addition there would aid discoverability. Neither is blocking. No licensing impact, no prompt-injection in PR content.
- SKILL.md: note shared env_vars work top-level OR per-stage (matches the example's per-stage layout) instead of prescribing only top-level. - example_eval.yaml: surface cpu_partition as a commented line for discoverability. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Edwardf0t1
left a comment
There was a problem hiding this comment.
Can we check agent's traces in MLFlow to see if it behaves as expected?
cpu_partition is optional + cluster-specific and already covered in SKILL.md's SLURM gotchas; keep the generic example lean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Make explicit: NEL runs MLflow export as a separate CPU-only job chained via afterok; partition = cpu_partition or execution.partition (no auto-routing); unset -> lands on the GPU partition -> rejected -> fails the task despite EVAL_EXIT_CODE=0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
… for export Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Edwardf0t1
left a comment
There was a problem hiding this comment.
The SKILL.md is getting too long, we should keep it lean. Suggestions:
- Move the SLURM gotchas into a new references/slurm.md (mount_home, hf-cache, cpu_partition, top-level vs per-stage env_vars), matching the existing reference-file pattern.
- In SKILL.md Step 4, replace the blockquote with a one-liner pointer, e.g. "On SLURM, several deploy/eval failures are invisible to --dry-run and only surface at canary (mount_home, HF cache, cpu_partition) — read references/slurm.md."
- Keep the deployment.env_vars backend-env-var note where it is (that one's short, directly in the deployment-config flow, and genuinely high-traffic), but it can lose the second near-duplicate sentence added at line ~143.
- The example_eval.yaml and lcr.md comments are fine as-is — that's the right place for copy-paste-adjacent detail.
What does this PR do?
Type of change: documentation
Hardens the evaluation skill with operational guidance discovered while running AA-Index evals (NVFP4 Nemotron-3-Nano) on SLURM. Two themes:
1. vLLM backend env vars (original commits). Some models need a model-card backend toggle that is an env var, not a CLI flag — e.g. NVFP4 MoE models like NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 need
VLLM_USE_FLASHINFER_MOE_FP4=1+VLLM_FLASHINFER_MOE_BACKEND=throughput. These go indeployment.env_vars(with thelit:prefix), not thevllm servecommand.2. SLURM deploy/eval operational lessons (new commits).
mount_home: falsealways. Some internal cluster templates default ittrue, which mounts the host~/.cache; where that is a symlink into a shared/networked filesystem, it dangles in the container and the vLLMtrust-remote-codedeploy dies withFileNotFoundError: /root/.cache/huggingface— invisible to--dry-run.~/.cache/huggingfaceto/hf-cacheand setHF_HOME: lit:/hf-cachefor both stages.execution.cpu_partition: on split GPU/CPU-partition clusters, the CPU-only MLflow auto-export job is rejected by the GPU partition and marks the whole task FAILED despiteEVAL_EXIT_CODE=0. Setcpu_partitionto route it correctly.env_vars:for vars both stages need (HF_TOKEN,HF_HOME);execution.env_varsis unsupported and hard-errors.lcr.md:LOG_LEVEL=WARNINGto skip logging AA-LCR's ~120K-token inputs.Usage
Testing
Docs/skill-only. Validated end-to-end by running the full AA suite (GPQA + SciCode + AA-LCR) across 5 NVFP4 checkpoints on SLURM: dry-run + canary + full runs all SUCCESS with these settings. Pre-commit (yamlfmt, markdownlint) passed.
Before your PR is "Ready for review"
CONTRIBUTING.md: N/A🤖 Generated with Claude Code
Summary by CodeRabbit