Skip to content

feat: add MegatronMIMO model conversion#3905

Open
liding-nv wants to merge 10 commits into
mainfrom
liding/mimo-bridge
Open

feat: add MegatronMIMO model conversion#3905
liding-nv wants to merge 10 commits into
mainfrom
liding/mimo-bridge

Conversation

@liding-nv
Copy link
Copy Markdown
Contributor

@liding-nv liding-nv commented May 20, 2026

Summary

This PR adds the framework-level conversion path for MegatronMIMO:

  • import HF weights into a constructed MimoModel with per-component parallelism;
  • save/load that model as a Megatron distributed checkpoint;
  • export the MIMO checkpoint back to HuggingFace safetensors.

Dense Qwen3.5-VL is the first model family using the framework. It only adds standard bridge/provider metadata needed for default route construction; the conversion infrastructure itself is model-family agnostic.

This enables the workflow: download an HF VLM checkpoint → convert to MIMO → continue training/SFT/LoRA with heterogeneous component parallelism.

Why MIMO Needs Dedicated Conversion Infrastructure

The standard bridge path assumes one provider, one model, one global parallel_state, and one rank topology. MegatronMIMO breaks those assumptions: language, vision, and future modality components can use different TP/PP/DP
layouts and may live on different rank sets.

There are two core problems this PR solves:

  1. Global distributed state. Standard bridge mappings read TP/PP/DP groups from MCore global parallel_state. MegatronMIMO intentionally does not initialize one global topology; each component owns its own ProcessGroupCollection. The conversion orchestrator therefore runs one component route at a time, temporarily bridges MCore parallel_state from that component's process groups, and attaches the same pg_collection to the target submodule. This lets existing bridge mapping/gather code run without changing standard conversion internals.

  2. Component-local naming. Standard bridges describe parameters in a single Megatron namespace, e.g. language_model.* and vision_model.*. Inside MimoModel, those are separate target submodules, so the component
    prefix must be stripped before dispatch. A route table maps each component to its target submodule and builds a route-local mapping registry by cloning the source bridge mappings with the component prefix removed.

The rest of the framework follows from those two constraints:

  • HF export keeps one representative rank per component before gathering to rank 0, avoiding duplicated tensors across component replicas.
  • Nested TransformerConfigs inside MIMO specs are finalized before model construction.
  • Qwen3.5-VL vision patch/deepstack mergers receive the component TP group explicitly, while the standard non-MIMO path still falls back to global parallel_state.
  • MTP is disabled for MIMO v1 because the current MimoModel schema has no MTP submodule.

Framework-Level Design

The implementation is centered on MegatronMIMOBridge, an AutoBridge subclass with MIMO-specific entry points. to_megatron_mimo_provider() builds a MegatronMIMOProvider, to_megatron_model() imports HF weights into a constructed MimoModel, and import_ckpt() / export_ckpt() provide the checkpoint-level user workflow.

Default model support is metadata-driven:

  • the standard bridge provides mimo_source_prefixes, e.g. {"language": "language_model.", "images": "vision_model."};
  • the standard provider provides modality metadata, e.g. modality_keys = {"images": "qwen_visual"}, plus build_language_model_spec() and special_token_ids;
  • the framework combines those into MIMOComponent routes and builds the provider via MegatronMIMOProvider.from_standard_provider().

For each route, the framework clones the standard bridge mapping registry, strips the component prefix, temporarily exposes the component process groups through parallel_state, and invokes the existing standard bridge weight import/export code on the target submodule. This keeps most conversion logic shared with the standard bridge path.

Explicit register_mimo_conversion_spec() support remains as an escape hatch for models whose provider or route construction cannot be derived from standard metadata.

Checkpoint save/load uses the regular Megatron distributed checkpointing path with MIMO-aware module names and component process groups. Derived model specs are rebuilt from the persisted standard provider on load, so HF import -> MIMO checkpoint -> fresh export works across processes.

Adding another model family should usually require only standard bridge/provider metadata; the generic orchestrator should not need model-family edits.

Validation

  • Focused unit tests: MIMO conversion/provider/Qwen coverage.
  • Manual 27B regression: Qwen/Qwen3.5-27B HF → MIMO checkpoint → HF export on 8 GPUs (language=tp=4, images=tp=4), non-MTP parity 1184/1184, 0 mismatches. Expected ignored MTP keys: 15.

Not included in this PR: an L0/L1 functional test. The current conversion path needs multiple GPUs and a real HF model; a synthetic 2-GPU functional test can be added as follow-up.

Limitations

  • Only dense Qwen3.5-VL is supported in this PR.
  • MTP is disabled for MIMO v1.
  • MoE Qwen3.5-VL variants are not supported yet.
  • Checkpoint resharding across colocated ↔ non-colocated placement changes is not guaranteed.
  • Only model weights round-trip; optimizer, scheduler, and RNG state are out of scope.

User Surface

The concrete import/export examples

bash examples/megatron_mimo/qwen35_vl_conversion.sh

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 20, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@liding-nv
Copy link
Copy Markdown
Contributor Author

/ok to test 9ce2fea

Signed-off-by: Li Ding <liding@nvidia.com>
@liding-nv liding-nv force-pushed the liding/mimo-bridge branch from 9ce2fea to ce528bf Compare May 20, 2026 15:26
@liding-nv
Copy link
Copy Markdown
Contributor Author

/ok to test ce528bf

Signed-off-by: Li Ding <liding@nvidia.com>
@liding-nv
Copy link
Copy Markdown
Contributor Author

/ok to test 617c404

Signed-off-by: Li Ding <liding@nvidia.com>
@liding-nv
Copy link
Copy Markdown
Contributor Author

/ok to test 6408e9b

liding-nv added 2 commits May 21, 2026 20:50
Signed-off-by: Li Ding <liding@nvidia.com>
Signed-off-by: Li Ding <liding@nvidia.com>
@liding-nv
Copy link
Copy Markdown
Contributor Author

/ok to test acbd4cd

Signed-off-by: Li Ding <liding@nvidia.com>
@liding-nv
Copy link
Copy Markdown
Contributor Author

/ok to test 0979433

Comment on lines +154 to +165
f"--component {name!r}: ranks per component ({ranks_per_component}) "
f"is not divisible by total_model_parallel_size ({mp}); "
f"specify dp=N explicitly."
)
parallelism.data_parallel_size = ranks_per_component // mp
parallelism.rank_offset = offset
# Whether auto-filled or user-supplied, advance offset by this
# component's ranks so the next auto-filled component lands after it.
offset += parallelism.total_ranks


def run_import(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: _auto_fill_layout is order-dependent when mixing user-supplied dp= and auto-filled components. If a user-supplied component (which keeps rank_offset=0 by default) appears after an auto-filled component, both can end up at the same rank_offset, causing overlapping rank assignments. This only happens with mixed user/auto layouts and dict iteration order, so it may never surface in practice — but worth a comment or a post-loop overlap check to future-proof.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 22, 2026

Light Code Review

Overall this is a well-structured PR with comprehensive unit test coverage (131 test functions across 10 new/modified test files).

Bug fix in state.py (correctness): The save_generator fix (replacing keys_for_file with tensors_to_save.keys()) is a genuine correctness fix -- the old code could produce a model.safetensors.index.json that maps keys to files that do not contain them. No dedicated regression test for this specific bug path was added; consider adding one if practical.

_auto_fill_layout ordering dependence (minor): In examples/conversion/convert_megatron_mimo.py, when mixing user-supplied dp= components with auto-filled ones, the auto-fill is iteration-order-dependent. A user-supplied component keeps its default rank_offset=0 and is not repositioned, so a later auto-filled component could overlap it in rank space. See inline comment.

_rebuild_derived_specs_from_standard_provider mutates in place: Setting standard_provider.mtp_num_layers = None mutates the provider object in place. If the same provider instance is used in a non-MIMO path elsewhere, MTP would be silently disabled. Documented as intentional for MIMO v1, but worth noting.

Missing test for _bridge_parallel_state_globals: The build_model.py helper sets about 10 parallel_state globals but has no direct unit test. A targeted test would catch regressions if the mcore global names change.

The PR adds excellent unit test coverage for CLI parsing, route validation, orchestrator import/export/streaming, Qwen3.5-VL default conversion, MIMO provider factory, TransformerConfig finalization, and a functional checkpoint roundtrip test.

Suggested test cases: No perf tests impacted.

@yaoyu-33 yaoyu-33 added area:ckpt Checkpoint conversion, loading, export, and save paths feature New capabilities, enhancements, or enablement work full-test-suite high-complexity Harder to merge: prone to conflicts and needs additional test coverage needs-review PR is ready for code review and waiting on a reviewer labels May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:ckpt Checkpoint conversion, loading, export, and save paths feature New capabilities, enhancements, or enablement work full-test-suite high-complexity Harder to merge: prone to conflicts and needs additional test coverage needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants