fix: use smoothed reward metric for VLM GRPO CLEVR convergence tests#2015
Conversation
The nightly GB200 test vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-dtensor2tp1.v1 was failing because the metric check evaluated train/reward at a single step (200), which is subject to per-step noise (std ~0.04 in the converged region). The GB200 run actually converged well (mean=0.954, std=0.043 over steps 151-200) but happened to land on a noise dip at step 200 (reward=0.9000, failing > 0.9), while step 199 was 0.9813. Replace single-step checks with smoothed mean over the last 20 steps (181-200) for all four VLM GRPO CLEVR test variants (dtensor and megatron, 4g and 8g). Fixes NVIDIA-NeMo#1996 Signed-off-by: Ali Roshan Ghias <aroshanghias@nvidia.com>
📝 WalkthroughWalkthroughFour test scripts in the VLM GRPO test suite were modified to change how metric evaluation is performed. The changes replace single-step threshold checks (at step 200) with moving-average style checks that evaluate rewards and losses over a range starting from step 181, addressing test failures when metrics don't meet thresholds at specific checkpoints. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-megatrontp1.v1.sh (1)
38-39: Avoid hardcoding the smoothing window start.If
MAX_STEPSchanges, the fixed181will drift from the intended last-20-steps window. Consider deriving it fromMAX_STEPSto keep the window consistent.💡 Suggested refactor
- uv run tests/check_metrics.py $JSON_METRICS \ - 'mean(data["train/loss"], range_start=181) < 0.1' \ - 'mean(data["train/reward"], range_start=181) > 0.9' + RANGE_START=$((MAX_STEPS-19)) + uv run tests/check_metrics.py $JSON_METRICS \ + "mean(data[\"train/loss\"], range_start=${RANGE_START}) < 0.1" \ + "mean(data[\"train/reward\"], range_start=${RANGE_START}) > 0.9"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-megatrontp1.v1.sh` around lines 38 - 39, The hardcoded smoothing window start (181) in the metric expressions should be derived from MAX_STEPS so the "last-20-steps" window stays correct if MAX_STEPS changes; compute a LAST_WINDOW_START (e.g., MAX_STEPS-19) and replace occurrences like 'mean(data["train/loss"], range_start=181)' and 'mean(data["train/reward"], range_start=181)' to use that variable (range_start=LAST_WINDOW_START) so the window is always the final 20 steps based on MAX_STEPS.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-megatrontp1.v1.sh`:
- Around line 38-39: The hardcoded smoothing window start (181) in the metric
expressions should be derived from MAX_STEPS so the "last-20-steps" window stays
correct if MAX_STEPS changes; compute a LAST_WINDOW_START (e.g., MAX_STEPS-19)
and replace occurrences like 'mean(data["train/loss"], range_start=181)' and
'mean(data["train/reward"], range_start=181)' to use that variable
(range_start=LAST_WINDOW_START) so the window is always the final 20 steps based
on MAX_STEPS.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-dtensor2tp1.v1.shtests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-megatrontp1.v1.shtests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-dtensor2tp1.v1.shtests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-megatrontp2.v1.sh
Summary
Fixes #1996
The nightly GB200 convergence test
vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-dtensor2tp1.v1was intermittently failing because the metric check evaluatedtrain/rewardat a single step (200), which is subject to per-step training noise.Root Cause
Analysis of the CI logs from pipeline 44432479 showed:
The GB200 run actually converged better on average but happened to land on a noise dip at step 200 (reward = 0.9000, failing
> 0.9), while step 199 was 0.9813.Reward comparison from issue #1996:
Fix
Replace single-step metric checks with smoothed mean over the last 20 steps (181-200) using the existing
mean()helper incheck_metrics.py. This is applied to all four VLM GRPO CLEVR test variants:vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-dtensor2tp1.v1.sh(the failing test)vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-dtensor2tp1.v1.sh(H100 counterpart)vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-megatrontp1.v1.sh(Megatron GB200)vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-megatrontp2.v1.sh(Megatron H100)Before:
'data["train/reward"]["200"] > 0.9'After:
'mean(data["train/reward"], range_start=181) > 0.9'Test plan