feat: add MegatronMIMO model conversion by liding-nv · Pull Request #3905 · NVIDIA-NeMo/Megatron-Bridge

liding-nv · 2026-05-20T15:15:43Z

Summary

This PR adds the framework-level conversion path for MegatronMIMO:

import HF weights into a constructed MimoModel with per-component parallelism;
save/load that model as a Megatron distributed checkpoint;
export the MIMO checkpoint back to HuggingFace safetensors.

Dense Qwen3.5-VL is the first model family using the framework. It only adds standard bridge/provider metadata needed for default route construction; the conversion infrastructure itself is model-family agnostic.

This enables the workflow: download an HF VLM checkpoint → convert to MIMO → continue training/SFT/LoRA with heterogeneous component parallelism.

Why MIMO Needs Dedicated Conversion Infrastructure

The standard bridge path assumes one provider, one model, one global parallel_state, and one rank topology. MegatronMIMO breaks those assumptions: language, vision, and future modality components can use different TP/PP/DP
layouts and may live on different rank sets.

There are two core problems this PR solves:

Global distributed state. Standard bridge mappings read TP/PP/DP groups from MCore global parallel_state. MegatronMIMO intentionally does not initialize one global topology; each component owns its own ProcessGroupCollection. The conversion orchestrator therefore runs one component route at a time, temporarily bridges MCore parallel_state from that component's process groups, and attaches the same pg_collection to the target submodule. This lets existing bridge mapping/gather code run without changing standard conversion internals.
Component-local naming. Standard bridges describe parameters in a single Megatron namespace, e.g. language_model.* and vision_model.*. Inside MimoModel, those are separate target submodules, so the component
prefix must be stripped before dispatch. A route table maps each component to its target submodule and builds a route-local mapping registry by cloning the source bridge mappings with the component prefix removed.

The rest of the framework follows from those two constraints:

HF export keeps one representative rank per component before gathering to rank 0, avoiding duplicated tensors across component replicas.
Nested TransformerConfigs inside MIMO specs are finalized before model construction.
Qwen3.5-VL vision patch/deepstack mergers receive the component TP group explicitly, while the standard non-MIMO path still falls back to global parallel_state.
MTP is disabled for MIMO v1 because the current MimoModel schema has no MTP submodule.

Framework-Level Design

The implementation is centered on MegatronMIMOBridge, an AutoBridge subclass with MIMO-specific entry points. to_megatron_mimo_provider() builds a MegatronMIMOProvider, to_megatron_model() imports HF weights into a constructed MimoModel, and import_ckpt() / export_ckpt() provide the checkpoint-level user workflow.

Default model support is metadata-driven:

the standard bridge provides mimo_source_prefixes, e.g. {"language": "language_model.", "images": "vision_model."};
the standard provider provides modality metadata, e.g. modality_keys = {"images": "qwen_visual"}, plus build_language_model_spec() and special_token_ids;
the framework combines those into MIMOComponent routes and builds the provider via MegatronMIMOProvider.from_standard_provider().

For each route, the framework clones the standard bridge mapping registry, strips the component prefix, temporarily exposes the component process groups through parallel_state, and invokes the existing standard bridge weight import/export code on the target submodule. This keeps most conversion logic shared with the standard bridge path.

Explicit register_mimo_conversion_spec() support remains as an escape hatch for models whose provider or route construction cannot be derived from standard metadata.

Checkpoint save/load uses the regular Megatron distributed checkpointing path with MIMO-aware module names and component process groups. Derived model specs are rebuilt from the persisted standard provider on load, so HF import -> MIMO checkpoint -> fresh export works across processes.

Adding another model family should usually require only standard bridge/provider metadata; the generic orchestrator should not need model-family edits.

Validation

Focused unit tests: MIMO conversion/provider/Qwen coverage.
Manual 27B regression: Qwen/Qwen3.5-27B HF → MIMO checkpoint → HF export on 8 GPUs (language=tp=4, images=tp=4), non-MTP parity 1184/1184, 0 mismatches. Expected ignored MTP keys: 15.

Not included in this PR: an L0/L1 functional test. The current conversion path needs multiple GPUs and a real HF model; a synthetic 2-GPU functional test can be added as follow-up.

Limitations

Only dense Qwen3.5-VL is supported in this PR.
MTP is disabled for MIMO v1.
MoE Qwen3.5-VL variants are not supported yet.
Checkpoint resharding across colocated ↔ non-colocated placement changes is not guaranteed.
Only model weights round-trip; optimizer, scheduler, and RNG state are out of scope.

User Surface

The concrete import/export examples

bash examples/megatron_mimo/qwen35_vl_conversion.sh

copy-pr-bot · 2026-05-20T15:15:47Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

liding-nv · 2026-05-20T15:16:12Z

/ok to test 9ce2fea