Skip to content

[training] fix: validate MXFP8 param gather overlap#3920

Open
Glitchfix wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
Glitchfix:fix/mxfp8-param-gather-overlap-validation
Open

[training] fix: validate MXFP8 param gather overlap#3920
Glitchfix wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
Glitchfix:fix/mxfp8-param-gather-overlap-validation

Conversation

@Glitchfix
Copy link
Copy Markdown

@Glitchfix Glitchfix commented May 21, 2026

What does this PR do ?

Adds validation so the non-FSDP MXFP8 parameter-gather path requires ddp.overlap_param_gather=True when reuse_grad_buf_for_mxfp8_param_ag=True is enabled.

Changelog

  • Require ddp.overlap_param_gather=True with non-FSDP MXFP8 gradient-buffer reuse.
  • Extend config validation coverage for the invalid overlap-disabled case and the valid overlap-enabled case.

GitHub Actions CI

Local validation:

  • uvx pre-commit run --files src/megatron/bridge/training/config.py tests/unit_tests/training/test_config.py
  • PYTHONPATH=src:3rdparty/Megatron-LM python -m pytest tests/unit_tests/training/test_config.py::TestConfigContainerValidation::test_reuse_grad_buf_for_mxfp8_param_ag_required_without_fsdp

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation? Not needed for this validation-only fix.
  • Does the PR affect components that are optional to install? No.

Additional Information

- Added an MXFP8 param-gather validation so non-FSDP configs fail before enabling gradient-buffer reuse without overlapping parameter gather.

- Updated the existing config unit test to cover the unsafe overlap-disabled path and the valid overlap-enabled path.

- Kept FSDP handling unchanged because Bridge already disables the unsupported reuse path there.

Signed-off-by: Shivanjan Chakravorty <shivanjanc@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Glitchfix
Copy link
Copy Markdown
Author

Requested labels per CONTRIBUTING.md: bug, area:training, area:quant, needs-review. I attempted to apply them directly, but this account does not have label permissions on the upstream repository.

@Glitchfix Glitchfix changed the title [training, quant] fix: require overlap for MXFP8 param gather [training, quant] fix: validate MXFP8 param gather overlap May 21, 2026
@yaoyu-33 yaoyu-33 added area:training Training loop, callbacks, and runtime integration bug Something isn't working community-request needs-review PR is ready for code review and waiting on a reviewer labels May 21, 2026
@Glitchfix Glitchfix changed the title [training, quant] fix: validate MXFP8 param gather overlap [training] fix: validate MXFP8 param gather overlap May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:training Training loop, callbacks, and runtime integration bug Something isn't working community-request needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants