Skip to content

fix: use smoothed reward metric for VLM GRPO CLEVR convergence tests#2015

Open
aroshanghias-nvd wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
aroshanghias-nvd:aroshanghias/fix-vlm-grpo-flaky-metric-check
Open

fix: use smoothed reward metric for VLM GRPO CLEVR convergence tests#2015
aroshanghias-nvd wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
aroshanghias-nvd:aroshanghias/fix-vlm-grpo-flaky-metric-check

Conversation

@aroshanghias-nvd
Copy link

@aroshanghias-nvd aroshanghias-nvd commented Feb 23, 2026

Summary

Fixes #1996

The nightly GB200 convergence test vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-dtensor2tp1.v1 was intermittently failing because the metric check evaluated train/reward at a single step (200), which is subject to per-step training noise.

Root Cause

Analysis of the CI logs from pipeline 44432479 showed:

Metric H100 (1n8g, passed) GB200 (1n4g, failed)
Reward at step 200 0.9500 0.9000
Mean reward (steps 151-200) 0.942 0.954
Std reward (steps 151-200) 0.036 0.043

The GB200 run actually converged better on average but happened to land on a noise dip at step 200 (reward = 0.9000, failing > 0.9), while step 199 was 0.9813.

Reward comparison from issue #1996:

Reward Comparison

Fix

Replace single-step metric checks with smoothed mean over the last 20 steps (181-200) using the existing mean() helper in check_metrics.py. This is applied to all four VLM GRPO CLEVR test variants:

  • vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-dtensor2tp1.v1.sh (the failing test)
  • vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-dtensor2tp1.v1.sh (H100 counterpart)
  • vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-megatrontp1.v1.sh (Megatron GB200)
  • vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-megatrontp2.v1.sh (Megatron H100)

Before:

'data["train/reward"]["200"] > 0.9'

After:

'mean(data["train/reward"], range_start=181) > 0.9'

Test plan

  • Nightly GB200 convergence test passes consistently
  • Nightly H100 convergence test continues to pass
  • No functional change to training — only the metric evaluation logic is changed

The nightly GB200 test vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-dtensor2tp1.v1
was failing because the metric check evaluated train/reward at a single step (200),
which is subject to per-step noise (std ~0.04 in the converged region). The GB200
run actually converged well (mean=0.954, std=0.043 over steps 151-200) but happened
to land on a noise dip at step 200 (reward=0.9000, failing > 0.9), while step 199
was 0.9813.

Replace single-step checks with smoothed mean over the last 20 steps (181-200) for
all four VLM GRPO CLEVR test variants (dtensor and megatron, 4g and 8g).

Fixes NVIDIA-NeMo#1996

Signed-off-by: Ali Roshan Ghias <aroshanghias@nvidia.com>
@aroshanghias-nvd aroshanghias-nvd requested a review from a team as a code owner February 23, 2026 21:19
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 23, 2026

📝 Walkthrough

Walkthrough

Four test scripts in the VLM GRPO test suite were modified to change how metric evaluation is performed. The changes replace single-step threshold checks (at step 200) with moving-average style checks that evaluate rewards and losses over a range starting from step 181, addressing test failures when metrics don't meet thresholds at specific checkpoints.

Changes

Cohort / File(s) Summary
VLM GRPO Test Suite Metric Evaluation
tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-dtensor2tp1.v1.sh, tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-megatrontp1.v1.sh, tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-dtensor2tp1.v1.sh, tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-megatrontp2.v1.sh
Changed metric evaluation from point-in-time checks at step 200 to range-based mean calculations starting at step 181. Updated train/reward and train/loss threshold assertions to use moving-average style evaluation instead of exact step indices.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Suggested labels

CI:L0, Run CICD

Suggested reviewers

  • terrykong
  • yfw
🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: replacing a single-step reward metric with a smoothed mean reward for VLM GRPO CLEVR convergence tests.
Linked Issues check ✅ Passed The PR fully addresses issue #1996 by implementing smoothed mean-based reward checks (range_start=181) instead of single-step checks, making convergence tests robust to per-step noise.
Out of Scope Changes check ✅ Passed All changes are scoped to the four CLEVR test variants' metric evaluation logic; no unrelated modifications to training code or other components are present.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes ✅ Passed Minor changes affecting only test metric evaluation logic across 4 test scripts with 1-2 lines changed per file. PR explicitly states no functional training changes; only metric evaluation logic modified. PR description includes adequate context explaining why changes are needed and test plan to verify the fix works.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-megatrontp1.v1.sh (1)

38-39: Avoid hardcoding the smoothing window start.

If MAX_STEPS changes, the fixed 181 will drift from the intended last-20-steps window. Consider deriving it from MAX_STEPS to keep the window consistent.

💡 Suggested refactor
-    uv run tests/check_metrics.py $JSON_METRICS \
-        'mean(data["train/loss"], range_start=181) < 0.1' \
-        'mean(data["train/reward"], range_start=181) > 0.9'
+    RANGE_START=$((MAX_STEPS-19))
+    uv run tests/check_metrics.py $JSON_METRICS \
+        "mean(data[\"train/loss\"], range_start=${RANGE_START}) < 0.1" \
+        "mean(data[\"train/reward\"], range_start=${RANGE_START}) > 0.9"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-megatrontp1.v1.sh`
around lines 38 - 39, The hardcoded smoothing window start (181) in the metric
expressions should be derived from MAX_STEPS so the "last-20-steps" window stays
correct if MAX_STEPS changes; compute a LAST_WINDOW_START (e.g., MAX_STEPS-19)
and replace occurrences like 'mean(data["train/loss"], range_start=181)' and
'mean(data["train/reward"], range_start=181)' to use that variable
(range_start=LAST_WINDOW_START) so the window is always the final 20 steps based
on MAX_STEPS.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-megatrontp1.v1.sh`:
- Around line 38-39: The hardcoded smoothing window start (181) in the metric
expressions should be derived from MAX_STEPS so the "last-20-steps" window stays
correct if MAX_STEPS changes; compute a LAST_WINDOW_START (e.g., MAX_STEPS-19)
and replace occurrences like 'mean(data["train/loss"], range_start=181)' and
'mean(data["train/reward"], range_start=181)' to use that variable
(range_start=LAST_WINDOW_START) so the window is always the final 20 steps based
on MAX_STEPS.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9148186 and acad5e0.

📒 Files selected for processing (4)
  • tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-dtensor2tp1.v1.sh
  • tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-megatrontp1.v1.sh
  • tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-dtensor2tp1.v1.sh
  • tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-megatrontp2.v1.sh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RL nightly test failing train/reward < 0.9

1 participant