[training] fix flops computation by jepio · Pull Request #3947 · NVIDIA-NeMo/Megatron-Bridge

jepio · 2026-05-22T15:15:29Z

What does this PR do ?

This PR fixes two discrepancies found when analyzing GPT-OSS FLOPs numbers coming from MBridge:

gated linear units were only being counted correctly when using SWIGLU activations. Other gated linear unit activation units were using an expansion factor of 2 rather than 3. Correct this to 3 for all GLU. For GPT-OSS 20B this is on the order of 10% of total FLOPS during a training step.
sliding window attention was undercounting FLOPS because we were dividing by 2. Expand the formula for the two regimes so that we correctly compute those FLOPS. This has a minimal effect but worth including for correctness sake.

Changelog

fix FLOPS accounting for gated linear units
fix FLOPS accounting for sliding window attention with small windows

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

copy-pr-bot · 2026-05-22T15:15:33Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

claude · 2026-05-22T15:26:25Z

Review -- fix flops computation

The two fixes are directionally correct:

GLU fix: Removing the activation_func == F.silu guard so all gated linear units (not just SwiGLU) get expansion factor 3.
SWA context fix: Replacing the min(W, S) upper-bound approximation with the exact average causal SWA context, and removing the spurious /2 that was double-counting the causal mask.

1. Existing unit tests will break (blocking)
At least two tests in test_flop_utils.py compute expected values using the old formulas and will fail with this change:

test_swa_exact_delta (line 983) -- hardcodes core_diff = Q * (S - W), which no longer matches.
test_swa_no_effect_when_window_ge_seq (line 958) -- asserts flops_swa == flops_full, but the new W >= T branch uses (T+1)/2 while full_core uses T/2, producing a mismatch of Q per SWA layer.
These tests need to be updated in this PR.

2. W >= T inconsistency (see inline comment)
When the window covers the full sequence, swa_core should equal full_core. The else branch currently uses (effective_seq_length + 1) / 2 (exact) while full_core uses core_attn_seq_factor / 2 (approximate). Suggest using core_attn_seq_factor / 2 in the else branch so the two paths agree -- this also correctly threads through the seqlen_squared_sum override.

3. Missing test for non-silu GLU (the motivating fix)
The GLU fix is the larger of the two corrections (about 10 percent of total FLOPs per the PR description), but there is no test exercising gated_linear_unit=True with a non-silu activation (e.g., F.gelu). The existing test_swiglu_scaling_factor only tests with the default F.silu. Please add a test for GLU with a non-silu activation confirming expansion factor 3.

Suggested test cases
No perf tests impacted.
Unit tests to add/update in tests/unit_tests/training/utils/test_flop_utils.py:

Update test_swa_exact_delta with the new SWA context/core formulas
Update test_swa_no_effect_when_window_ge_seq to match the aligned W >= T behavior
Add a test for GLU with non-silu activation (gated_linear_unit=True, activation_func=F.gelu) confirming expansion factor 3

The factor 3 applies to any GLU, not just SWIGLU. Signed-off-by: Jeremi Piotrowski <jpiotrowski@nvidia.com>

For models like GPT-OSS that use SWA and window size is smaller than sequence length, the division by 2 in the current formula in MBridge is wrong and undercounts FLOPS. Removing the division makes the approximation wrong for models with a large window size, eg. Gemma2. To make things work correctly in all cases expand the full formula. Signed-off-by: Jeremi Piotrowski <jpiotrowski@nvidia.com>

Signed-off-by: Jeremi Piotrowski <jpiotrowski@nvidia.com>

jepio · 2026-05-22T18:40:44Z

Suggested test cases No perf tests impacted. Unit tests to add/update in tests/unit_tests/training/utils/test_flop_utils.py:

Update test_swa_exact_delta with the new SWA context/core formulas

Update test_swa_no_effect_when_window_ge_seq to match the aligned W >= T behavior

Add a test for GLU with non-silu activation (gated_linear_unit=True, activation_func=F.gelu) confirming expansion factor 3

Did all 3.

jepio · 2026-05-22T18:41:06Z

@claude do you want to review again

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2026-05-22T21:22:55Z

/ok to test 8729946

jepio added bug Something isn't working area:training Training loop, callbacks, and runtime integration 26.06 labels May 22, 2026

claude Bot reviewed May 22, 2026

View reviewed changes

Comment thread src/megatron/bridge/training/utils/flop_utils.py Outdated

claude Bot reviewed May 22, 2026

View reviewed changes

Comment thread src/megatron/bridge/training/utils/flop_utils.py

jepio changed the title ~~fix flops computation~~ [training] fix flops computation May 22, 2026

dingqingy-nv requested a review from cuichenx May 22, 2026 18:15

jepio force-pushed the jpiotrowski/fix-flops-computation branch from 99b97b1 to ad80a0a Compare May 22, 2026 18:38

jepio added 4 commits May 22, 2026 20:38

[training] Fix FLOPS calculation for gated linear units

5ae9397

The factor 3 applies to any GLU, not just SWIGLU. Signed-off-by: Jeremi Piotrowski <jpiotrowski@nvidia.com>

[training,test] Update SWA test for more accurate formula

7fcc595

Signed-off-by: Jeremi Piotrowski <jpiotrowski@nvidia.com>

[training,test] Add a test case for non-SWIGLU GLU

964c3c7

Signed-off-by: Jeremi Piotrowski <jpiotrowski@nvidia.com>

jepio force-pushed the jpiotrowski/fix-flops-computation branch from ad80a0a to 964c3c7 Compare May 22, 2026 18:38

yaoyu-33 added full-test-suite needs-review PR is ready for code review and waiting on a reviewer labels May 22, 2026

[training] revert accidental mcore bump

8729946

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

copy-pr-bot Bot temporarily deployed to public May 22, 2026 21:23 Inactive

copy-pr-bot Bot temporarily deployed to test May 22, 2026 21:23 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 21:31 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 21:46 Inactive

yaoyu-33 added area:perf Performance optimizations and benchmarking and removed area:training Training loop, callbacks, and runtime integration labels May 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[training] fix flops computation#3947

[training] fix flops computation#3947
jepio wants to merge 5 commits into
mainfrom
jpiotrowski/fix-flops-computation

jepio commented May 22, 2026

Uh oh!

copy-pr-bot Bot commented May 22, 2026

Uh oh!

Uh oh!

Uh oh!

claude Bot commented May 22, 2026

Uh oh!

jepio commented May 22, 2026

Uh oh!

jepio commented May 22, 2026

Uh oh!

yaoyu-33 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jepio commented May 22, 2026

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented May 22, 2026

Uh oh!

Uh oh!

Uh oh!

claude Bot commented May 22, 2026

Review -- fix flops computation

Uh oh!

jepio commented May 22, 2026

Uh oh!

jepio commented May 22, 2026

Uh oh!

yaoyu-33 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants