Skip to content

[training] fix flops computation#3947

Open
jepio wants to merge 5 commits into
mainfrom
jpiotrowski/fix-flops-computation
Open

[training] fix flops computation#3947
jepio wants to merge 5 commits into
mainfrom
jpiotrowski/fix-flops-computation

Conversation

@jepio
Copy link
Copy Markdown

@jepio jepio commented May 22, 2026

What does this PR do ?

This PR fixes two discrepancies found when analyzing GPT-OSS FLOPs numbers coming from MBridge:

  • gated linear units were only being counted correctly when using SWIGLU activations. Other gated linear unit activation units were using an expansion factor of 2 rather than 3. Correct this to 3 for all GLU. For GPT-OSS 20B this is on the order of 10% of total FLOPS during a training step.
  • sliding window attention was undercounting FLOPS because we were dividing by 2. Expand the formula for the two regimes so that we correctly compute those FLOPS. This has a minimal effect but worth including for correctness sake.

Changelog

  • fix FLOPS accounting for gated linear units
  • fix FLOPS accounting for sliding window attention with small windows

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

@jepio jepio added bug Something isn't working area:training Training loop, callbacks, and runtime integration 26.06 labels May 22, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread src/megatron/bridge/training/utils/flop_utils.py Outdated
Comment thread src/megatron/bridge/training/utils/flop_utils.py
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 22, 2026

Review -- fix flops computation

The two fixes are directionally correct:

  • GLU fix: Removing the activation_func == F.silu guard so all gated linear units (not just SwiGLU) get expansion factor 3.
  • SWA context fix: Replacing the min(W, S) upper-bound approximation with the exact average causal SWA context, and removing the spurious /2 that was double-counting the causal mask.

1. Existing unit tests will break (blocking)
At least two tests in test_flop_utils.py compute expected values using the old formulas and will fail with this change:

  • test_swa_exact_delta (line 983) -- hardcodes core_diff = Q * (S - W), which no longer matches.
  • test_swa_no_effect_when_window_ge_seq (line 958) -- asserts flops_swa == flops_full, but the new W >= T branch uses (T+1)/2 while full_core uses T/2, producing a mismatch of Q per SWA layer.
    These tests need to be updated in this PR.

2. W >= T inconsistency (see inline comment)
When the window covers the full sequence, swa_core should equal full_core. The else branch currently uses (effective_seq_length + 1) / 2 (exact) while full_core uses core_attn_seq_factor / 2 (approximate). Suggest using core_attn_seq_factor / 2 in the else branch so the two paths agree -- this also correctly threads through the seqlen_squared_sum override.

3. Missing test for non-silu GLU (the motivating fix)
The GLU fix is the larger of the two corrections (about 10 percent of total FLOPs per the PR description), but there is no test exercising gated_linear_unit=True with a non-silu activation (e.g., F.gelu). The existing test_swiglu_scaling_factor only tests with the default F.silu. Please add a test for GLU with a non-silu activation confirming expansion factor 3.

Suggested test cases
No perf tests impacted.
Unit tests to add/update in tests/unit_tests/training/utils/test_flop_utils.py:

  • Update test_swa_exact_delta with the new SWA context/core formulas
  • Update test_swa_no_effect_when_window_ge_seq to match the aligned W >= T behavior
  • Add a test for GLU with non-silu activation (gated_linear_unit=True, activation_func=F.gelu) confirming expansion factor 3

@jepio jepio changed the title fix flops computation [training] fix flops computation May 22, 2026
@dingqingy-nv dingqingy-nv requested a review from cuichenx May 22, 2026 18:15
@jepio jepio force-pushed the jpiotrowski/fix-flops-computation branch from 99b97b1 to ad80a0a Compare May 22, 2026 18:38
jepio added 4 commits May 22, 2026 20:38
The factor 3 applies to any GLU, not just SWIGLU.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@nvidia.com>
For models like GPT-OSS that use SWA and window size is smaller than sequence
length, the division by 2 in the current formula in MBridge is wrong and
undercounts FLOPS. Removing the division makes the approximation wrong for
models with a large window size, eg. Gemma2. To make things work correctly in
all cases expand the full formula.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@nvidia.com>
Signed-off-by: Jeremi Piotrowski <jpiotrowski@nvidia.com>
Signed-off-by: Jeremi Piotrowski <jpiotrowski@nvidia.com>
@jepio jepio force-pushed the jpiotrowski/fix-flops-computation branch from ad80a0a to 964c3c7 Compare May 22, 2026 18:38
@jepio
Copy link
Copy Markdown
Author

jepio commented May 22, 2026

Suggested test cases No perf tests impacted. Unit tests to add/update in tests/unit_tests/training/utils/test_flop_utils.py:

  • Update test_swa_exact_delta with the new SWA context/core formulas
  • Update test_swa_no_effect_when_window_ge_seq to match the aligned W >= T behavior
  • Add a test for GLU with non-silu activation (gated_linear_unit=True, activation_func=F.gelu) confirming expansion factor 3

Did all 3.

@jepio
Copy link
Copy Markdown
Author

jepio commented May 22, 2026

@claude do you want to review again

@yaoyu-33 yaoyu-33 added full-test-suite needs-review PR is ready for code review and waiting on a reviewer labels May 22, 2026
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33
Copy link
Copy Markdown
Contributor

/ok to test 8729946

@yaoyu-33 yaoyu-33 added area:perf Performance optimizations and benchmarking and removed area:training Training loop, callbacks, and runtime integration labels May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

26.06 area:perf Performance optimizations and benchmarking bug Something isn't working full-test-suite needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants