initial commit on the megatron bridge part by pengdurice · Pull Request #3949 · NVIDIA-NeMo/Megatron-Bridge

pengdurice · 2026-05-22T17:07:03Z

What does this PR do?

Fixes a finetuning checkpoint-load bug when Megatron Bridge uses mixed precision, distributed optimizer, and optimizer CPU offload.

When model weights are loaded after optimizer construction, optimizer.reload_model_params() refreshes the normal distributed optimizer FP32 main params, but the hybrid CPU-offload optimizer also owns CPU parameter clones used by CPU sub-optimizers. Those copies can remain at random initialization and later overwrite the loaded checkpoint weights on optimizer step.

This PR explicitly synchronizes the hybrid optimizer parameter copies from the loaded model weights after model-only checkpoint loads.

Why is this needed?

The risky configuration is:

bf16 or fp16
use_distributed_optimizer=True
optimizer_cpu_offload=True
finetuning / pretrained checkpoint load without loading optimizer state

Bridge builds the model and optimizer before loading the finetuning checkpoint. After the checkpoint load, the BF16/FP16 model params are correct, but CPU-offload optimizer copies can still contain the pre-load random initialization. The first optimizer step can then update and copy those stale values back into the model.

Implementation

Adds a Bridge-side helper to refresh HybridDeviceOptimizer copies after model-only checkpoint loads.
Synchronizes loaded model params into:
- distributed optimizer FP32 shards (shard_fp32_from_float16_groups)
- CPU offload clones (gpu_params_map_cpu_copy)
- HybridDeviceOptimizer FP32 copies via update_fp32_param_by_new_param()
Calls the helper only when cfg.optimizer.optimizer_cpu_offload is enabled, after the existing optimizer.reload_model_params() path.
Adds focused unit coverage for both the hybrid sync path and the non-hybrid no-op path.

Test plan

Passed focused regression test in a Slurm job using a NeMo container .sqsh:

IMAGE=/fsx/peng/containers/megatron-bridge-nemo-26.04-nvrx.sqsh \
  sbatch slurm_scripts/run_checkpointing_sync_test.slurm

The job ran:

python3 -m pytest tests/unit_tests/training/test_checkpointing.py \
  -k sync_hybrid_optimizer_param_copies -q

Result:

2 passed, 117 deselected, 33 warnings in 7.05s

Signed-off-by: pengdurice <pengduhit@gmail.com>

copy-pr-bot · 2026-05-22T17:07:08Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

initial commit on the megatron bridge part

b5c9302

Signed-off-by: pengdurice <pengduhit@gmail.com>

github-actions Bot added the community-request label May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial commit on the megatron bridge part#3949

initial commit on the megatron bridge part#3949
pengdurice wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
pengdurice:peng-optimizer-random-weight-v1

pengdurice commented May 22, 2026

Uh oh!

copy-pr-bot Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pengdurice commented May 22, 2026

What does this PR do?

Why is this needed?

Implementation

Test plan

Uh oh!

copy-pr-bot Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants