initial commit on the megatron bridge part#3949
Draft
pengdurice wants to merge 1 commit into
Draft
Conversation
Signed-off-by: pengdurice <pengduhit@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes a finetuning checkpoint-load bug when Megatron Bridge uses mixed precision, distributed optimizer, and optimizer CPU offload.
When model weights are loaded after optimizer construction,
optimizer.reload_model_params()refreshes the normal distributed optimizer FP32 main params, but the hybrid CPU-offload optimizer also owns CPU parameter clones used by CPU sub-optimizers. Those copies can remain at random initialization and later overwrite the loaded checkpoint weights on optimizer step.This PR explicitly synchronizes the hybrid optimizer parameter copies from the loaded model weights after model-only checkpoint loads.
Why is this needed?
The risky configuration is:
bf16orfp16use_distributed_optimizer=Trueoptimizer_cpu_offload=TrueBridge builds the model and optimizer before loading the finetuning checkpoint. After the checkpoint load, the BF16/FP16 model params are correct, but CPU-offload optimizer copies can still contain the pre-load random initialization. The first optimizer step can then update and copy those stale values back into the model.
Implementation
shard_fp32_from_float16_groups)gpu_params_map_cpu_copy)update_fp32_param_by_new_param()cfg.optimizer.optimizer_cpu_offloadis enabled, after the existingoptimizer.reload_model_params()path.Test plan
Passed focused regression test in a Slurm job using a NeMo container
.sqsh:The job ran:
Result: