[training] fix: validate MXFP8 param gather overlap#3920
Open
Glitchfix wants to merge 1 commit into
Open
Conversation
- Added an MXFP8 param-gather validation so non-FSDP configs fail before enabling gradient-buffer reuse without overlapping parameter gather. - Updated the existing config unit test to cover the unsafe overlap-disabled path and the valid overlap-enabled path. - Kept FSDP handling unchanged because Bridge already disables the unsupported reuse path there. Signed-off-by: Shivanjan Chakravorty <shivanjanc@nvidia.com>
Author
|
Requested labels per CONTRIBUTING.md: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Adds validation so the non-FSDP MXFP8 parameter-gather path requires
ddp.overlap_param_gather=Truewhenreuse_grad_buf_for_mxfp8_param_ag=Trueis enabled.Changelog
ddp.overlap_param_gather=Truewith non-FSDP MXFP8 gradient-buffer reuse.GitHub Actions CI
Local validation:
uvx pre-commit run --files src/megatron/bridge/training/config.py tests/unit_tests/training/test_config.pyPYTHONPATH=src:3rdparty/Megatron-LM python -m pytest tests/unit_tests/training/test_config.py::TestConfigContainerValidation::test_reuse_grad_buf_for_mxfp8_param_ag_required_without_fsdpBefore your PR is "Ready for review"
Pre checks:
Additional Information