fix: use smoothed reward metric for VLM GRPO CLEVR convergence tests by aroshanghias-nvd · Pull Request #2015 · NVIDIA-NeMo/RL

aroshanghias-nvd · 2026-02-23T21:19:04Z

Summary

Fixes #1996

The nightly GB200 convergence test vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-dtensor2tp1.v1 was intermittently failing because the metric check evaluated train/reward at a single step (200), which is subject to per-step training noise.

Root Cause

Analysis of the CI logs from pipeline 44432479 showed:

Metric	H100 (1n8g, passed)	GB200 (1n4g, failed)
Reward at step 200	0.9500	0.9000
Mean reward (steps 151-200)	0.942	0.954
Std reward (steps 151-200)	0.036	0.043

The GB200 run actually converged better on average but happened to land on a noise dip at step 200 (reward = 0.9000, failing > 0.9), while step 199 was 0.9813.

Reward comparison from issue #1996:

Fix

Replace single-step metric checks with smoothed mean over the last 20 steps (181-200) using the existing mean() helper in check_metrics.py. This is applied to all four VLM GRPO CLEVR test variants:

vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-dtensor2tp1.v1.sh (the failing test)
vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-dtensor2tp1.v1.sh (H100 counterpart)
vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-megatrontp1.v1.sh (Megatron GB200)
vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-megatrontp2.v1.sh (Megatron H100)

Before:

'data["train/reward"]["200"] > 0.9'

After:

'mean(data["train/reward"], range_start=181) > 0.9'

Test plan

Nightly GB200 convergence test passes consistently
Nightly H100 convergence test continues to pass
No functional change to training — only the metric evaluation logic is changed

The nightly GB200 test vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-dtensor2tp1.v1 was failing because the metric check evaluated train/reward at a single step (200), which is subject to per-step noise (std ~0.04 in the converged region). The GB200 run actually converged well (mean=0.954, std=0.043 over steps 151-200) but happened to land on a noise dip at step 200 (reward=0.9000, failing > 0.9), while step 199 was 0.9813. Replace single-step checks with smoothed mean over the last 20 steps (181-200) for all four VLM GRPO CLEVR test variants (dtensor and megatron, 4g and 8g). Fixes NVIDIA-NeMo#1996 Signed-off-by: Ali Roshan Ghias <aroshanghias@nvidia.com>

coderabbitai · 2026-02-23T21:22:15Z

📝 Walkthrough

Walkthrough

Four test scripts in the VLM GRPO test suite were modified to change how metric evaluation is performed. The changes replace single-step threshold checks (at step 200) with moving-average style checks that evaluate rewards and losses over a range starting from step 181, addressing test failures when metrics don't meet thresholds at specific checkpoints.

Changes

Cohort / File(s)	Summary
VLM GRPO Test Suite Metric Evaluation `tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-dtensor2tp1.v1.sh`, `tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-megatrontp1.v1.sh`, `tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-dtensor2tp1.v1.sh`, `tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-megatrontp2.v1.sh`	Changed metric evaluation from point-in-time checks at step 200 to range-based mean calculations starting at step 181. Updated train/reward and train/loss threshold assertions to use moving-average style evaluation instead of exact step indices.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

fix: use median instead of mean for logprob error for stability in nightlies #1722: Modifies test metric-evaluation expressions in similar test suites using median() for stability, addressing related test robustness concerns.

Suggested labels

CI:L0, Run CICD

Suggested reviewers

terrykong
yfw

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: replacing a single-step reward metric with a smoothed mean reward for VLM GRPO CLEVR convergence tests.
Linked Issues check	✅ Passed	The PR fully addresses issue `#1996` by implementing smoothed mean-based reward checks (range_start=181) instead of single-step checks, making convergence tests robust to per-step noise.
Out of Scope Changes check	✅ Passed	All changes are scoped to the four CLEVR test variants' metric evaluation logic; no unrelated modifications to training code or other components are present.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes	✅ Passed	Minor changes affecting only test metric evaluation logic across 4 test scripts with 1-2 lines changed per file. PR explicitly states no functional training changes; only metric evaluation logic modified. PR description includes adequate context explaining why changes are needed and test plan to verify the fix works.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-megatrontp1.v1.sh (1)

38-39: Avoid hardcoding the smoothing window start.

If MAX_STEPS changes, the fixed 181 will drift from the intended last-20-steps window. Consider deriving it from MAX_STEPS to keep the window consistent.

💡 Suggested refactor

-    uv run tests/check_metrics.py $JSON_METRICS \
-        'mean(data["train/loss"], range_start=181) < 0.1' \
-        'mean(data["train/reward"], range_start=181) > 0.9'
+    RANGE_START=$((MAX_STEPS-19))
+    uv run tests/check_metrics.py $JSON_METRICS \
+        "mean(data[\"train/loss\"], range_start=${RANGE_START}) < 0.1" \
+        "mean(data[\"train/reward\"], range_start=${RANGE_START}) > 0.9"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-megatrontp1.v1.sh`
around lines 38 - 39, The hardcoded smoothing window start (181) in the metric
expressions should be derived from MAX_STEPS so the "last-20-steps" window stays
correct if MAX_STEPS changes; compute a LAST_WINDOW_START (e.g., MAX_STEPS-19)
and replace occurrences like 'mean(data["train/loss"], range_start=181)' and
'mean(data["train/reward"], range_start=181)' to use that variable
(range_start=LAST_WINDOW_START) so the window is always the final 20 steps based
on MAX_STEPS.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-megatrontp1.v1.sh`:
- Around line 38-39: The hardcoded smoothing window start (181) in the metric
expressions should be derived from MAX_STEPS so the "last-20-steps" window stays
correct if MAX_STEPS changes; compute a LAST_WINDOW_START (e.g., MAX_STEPS-19)
and replace occurrences like 'mean(data["train/loss"], range_start=181)' and
'mean(data["train/reward"], range_start=181)' to use that variable
(range_start=LAST_WINDOW_START) so the window is always the final 20 steps based
on MAX_STEPS.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9148186 and acad5e0.

📒 Files selected for processing (4)

tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-dtensor2tp1.v1.sh
tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-megatrontp1.v1.sh
tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-dtensor2tp1.v1.sh
tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-megatrontp2.v1.sh

aroshanghias-nvd requested a review from a team as a code owner February 23, 2026 21:19

coderabbitai bot reviewed Feb 23, 2026

View reviewed changes

aroshanghias-nvd requested review from terrykong and yfw February 23, 2026 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use smoothed reward metric for VLM GRPO CLEVR convergence tests#2015

fix: use smoothed reward metric for VLM GRPO CLEVR convergence tests#2015
aroshanghias-nvd wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
aroshanghias-nvd:aroshanghias/fix-vlm-grpo-flaky-metric-check

aroshanghias-nvd commented Feb 23, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 23, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aroshanghias-nvd commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Fix

Test plan

Uh oh!

coderabbitai bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aroshanghias-nvd commented Feb 23, 2026 •

edited

Loading

coderabbitai bot commented Feb 23, 2026 •

edited

Loading