Skip to content

initial commit on the megatron bridge part#3949

Draft
pengdurice wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
pengdurice:peng-optimizer-random-weight-v1
Draft

initial commit on the megatron bridge part#3949
pengdurice wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
pengdurice:peng-optimizer-random-weight-v1

Conversation

@pengdurice
Copy link
Copy Markdown

What does this PR do?

Fixes a finetuning checkpoint-load bug when Megatron Bridge uses mixed precision, distributed optimizer, and optimizer CPU offload.

When model weights are loaded after optimizer construction, optimizer.reload_model_params() refreshes the normal distributed optimizer FP32 main params, but the hybrid CPU-offload optimizer also owns CPU parameter clones used by CPU sub-optimizers. Those copies can remain at random initialization and later overwrite the loaded checkpoint weights on optimizer step.

This PR explicitly synchronizes the hybrid optimizer parameter copies from the loaded model weights after model-only checkpoint loads.

Why is this needed?

The risky configuration is:

  • bf16 or fp16
  • use_distributed_optimizer=True
  • optimizer_cpu_offload=True
  • finetuning / pretrained checkpoint load without loading optimizer state

Bridge builds the model and optimizer before loading the finetuning checkpoint. After the checkpoint load, the BF16/FP16 model params are correct, but CPU-offload optimizer copies can still contain the pre-load random initialization. The first optimizer step can then update and copy those stale values back into the model.

Implementation

  • Adds a Bridge-side helper to refresh HybridDeviceOptimizer copies after model-only checkpoint loads.
  • Synchronizes loaded model params into:
    • distributed optimizer FP32 shards (shard_fp32_from_float16_groups)
    • CPU offload clones (gpu_params_map_cpu_copy)
    • HybridDeviceOptimizer FP32 copies via update_fp32_param_by_new_param()
  • Calls the helper only when cfg.optimizer.optimizer_cpu_offload is enabled, after the existing optimizer.reload_model_params() path.
  • Adds focused unit coverage for both the hybrid sync path and the non-hybrid no-op path.

Test plan

Passed focused regression test in a Slurm job using a NeMo container .sqsh:

IMAGE=/fsx/peng/containers/megatron-bridge-nemo-26.04-nvrx.sqsh \
  sbatch slurm_scripts/run_checkpointing_sync_test.slurm

The job ran:

python3 -m pytest tests/unit_tests/training/test_checkpointing.py \
  -k sync_hybrid_optimizer_param_copies -q

Result:

2 passed, 117 deselected, 33 warnings in 7.05s

Signed-off-by: pengdurice <pengduhit@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants