fix(export): correct unified_export_megatron at EP > 1 and DP > 1#1631
fix(export): correct unified_export_megatron at EP > 1 and DP > 1#1631yueshen2016 wants to merge 1 commit into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughCoordinates Megatron EP ranks for safe export: adds distributed imports, gates single-writer sidecar and quant config writes with barriers, documents EP semantics, computes local-to-global expert mappings with validation, and gathers per-expert state across EP ranks into the exported _state_dict. ChangesDistributed Expert Export for Expert-Model-Parallel Layouts
Sequence DiagramsequenceDiagram
participant Exporter
participant EP_Rank
participant EP_Group
participant Filesystem
Exporter->>EP_Rank: compute ep_rank/ep_size and local_expert_indices
EP_Rank->>EP_Group: all_reduce(MAX) on local missing-key flags
EP_Rank->>EP_Rank: build per-local-expert tensors on CPU and torch.save -> BytesIO
EP_Rank->>EP_Group: all_gather_object(serialized_bytes)
EP_Rank->>EP_Rank: torch.load gathered bytes and merge into _state_dict
EP_Rank->>Filesystem: gated single-writer writes (config.json, sidecars, hf_quant_config.json) by DP0+EP0 in last PP main
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 408-428: The file-writing for the last model stage currently lets
every last-stage main rank perform shared side-effect operations (copying
sidecars, writing config.json / hf_quant_config.json) which races when DP > 1;
hoist the single-writer guard so only the last-stage main rank with
get_data_parallel_rank() == 0 performs all shared writes. Concretely, wrap the
entire last-stage sidecar/config writing sequence (the blocks that copy
sidecars, write hf_quant_config.json and config.json and any uses of
save_directory) inside a single conditional: if is_last_stage_main_rank and
get_data_parallel_rank() == 0 and self._hf_quant_config (as needed), and
surround it with torch.distributed.barrier() calls before and after so all PP/DP
ranks wait for the single writer; keep convert_hf_quant_config_format and
json.dump usage unchanged but called only from that guarded writer. Ensure no
per-rank writes to the same files occur outside this guarded block.
- Around line 1175-1190: The torch.load call that reconstructs peer-gathered
expert states currently passes weights_only=False; change it to
weights_only=True when reloading the byte stream produced from
local_expert_state (dict[str, torch.Tensor]) in the block that gathers
gathered_bytes (the torch.distributed.all_gather_object usage and subsequent
loop over gathered_bytes), i.e., update the torch.load(...) invocation that
feeds self._state_dict.update(s) to use weights_only=True; if you truly need
weights_only=False instead, add an inline comment justifying why this
peer-provided payload is trusted and include the required security-review
exception in the PR description.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d25eb08a-5d03-4923-af06-927bcd4372fb
📒 Files selected for processing (1)
modelopt/torch/export/unified_export_megatron.py
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1631 +/- ##
==========================================
+ Coverage 77.09% 77.36% +0.27%
==========================================
Files 482 482
Lines 52961 52997 +36
==========================================
+ Hits 40830 41003 +173
+ Misses 12131 11994 -137
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
35467a8 to
460ff39
Compare
|
Force-pushed 1. Previous gate ( 3. Without this, the writer rank would emit an Concerns 2 ( |
6beb3cb to
0083a75
Compare
|
Force-pushed Responses to inline review comments@coderabbitai inline #1 (line 432, Major — hoist single-writer guard)Applied. You were right that gating only the
@coderabbitai inline #2 (line 1214, Critical —
|
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
# List CodeRabbit's review comments on PR `#1631` to find comment IDs to resolve
gh api repos/NVIDIA/Model-Optimizer/pulls/1631/comments --jq '.[] | select(.user.login == "coderabbitai") | {id: .id, body: .body[0:120]}'Length of output: 472 🏁 Script executed: #!/bin/bash
# Try fetching review comments via the reviews API
gh api repos/NVIDIA/Model-Optimizer/pulls/1631/reviews --jq '.[] | select(.user.login == "coderabbitai") | {id: .id, state: .state, body: .body[0:100]}'Length of output: 458
To summarize: all three actionable items have been satisfactorily addressed in
The deferred items (defensive 🐇 ✨ ✏️ Learnings added
🧠 Learnings used |
Two related bugs surface when exporting Mamba-MoE Megatron-Core
checkpoints (e.g., Nemotron 3 Ultra) to HuggingFace with non-trivial
expert / data parallelism.
1. **TEGroupedMLP expert-id collision at EP > 1**. `_grouped_mlp_slicing`
iterates `range(num_experts)`, but `num_experts == module.num_gemms`
is the *local* expert count on each EP rank. Using local ids 0..N-1
for the saved HF key prefix means all EP ranks emit the same key
set (`experts.0..N-1.*`), and the last writer wins for each layer —
at EP=8 this silently loses 7/8 of every MoE layer's experts.
This change:
- Reads the authoritative local-to-global mapping from
`module.local_expert_indices` (or
`module.experts.local_expert_indices`) when exposed; falls back to
the standard Megatron contiguous layout `[ep_rank*N,
(ep_rank+1)*N - 1]`.
- Saves each expert under its *global* id so the key sets are
disjoint across EP ranks.
- Adds a collective-safe missing-key check via `all_reduce(MAX)`
over a CUDA-resident int32 flag — CPU tensors trip
`RuntimeError: No backend type associated with device type cpu`
on the NCCL-backed EP process group.
- Builds the per-rank state dict locally, then byte-stream
all-gathers it across the EP group via `torch.save` →
`all_gather_object` → `torch.load`. (`all_gather_object` pickling
fails on quantized uint8 weights because their `UntypedStorage`
has no `dtype` attr; the byte-stream round-trip uses PyTorch's
own tensor codec instead.)
- Pre-moves shared scales / aux tensors to CPU once so the gather
payload doesn't repeatedly clone GPU tensors.
- Records `_per_layer_quant_config` for **all global expert ids**
(0..num_experts*ep_size - 1) on every rank, not just the local
slice — `_per_layer_quant_config` is later serialized into
`hf_quant_config.json`, and if each EP rank only recorded its
local 1/EP slice the writer rank's dict would miss (EP-1)/EP of
the routed-expert entries. Within a single TEGroupedMLP layer all
routed experts share the same qformat / block_size by
construction, so it's safe to reuse the local
qformat/block_size for the global record.
When EP == 1 the gather is skipped and behavior is identical to
the original loop modulo the global-id rename (no-op there).
2. **config.json write race at DP > 1 or EP > 1**. The post-save block
that injects `quantization_config` into `config.json` had no DP/EP
rank gating. With DP > 1 or EP > 1, multiple ranks satisfy
`is_last_stage_main_rank` simultaneously (one per DP×EP cell on
the last PP stage with TP rank 0), all of them read-then-write the
same file, and any pair can interleave such that another rank
reads a truncated mid-write file and raises `JSONDecodeError`.
Guarded with
`is_last_stage_main_rank AND get_data_parallel_rank() == 0 AND
get_expert_model_parallel_rank() == 0`, bracketed with
`torch.distributed.barrier()` so every other rank waits for the
single writer.
The new imports (`get_data_parallel_rank`,
`get_expert_model_parallel_*`) and `io` are added at the module level.
Verified end-to-end on Nemotron 3 Ultra (108-layer Mamba-MoE,
512 routed experts) exported with PP=9 TP=1 EP=8 → all 512 experts
per layer present in the HF shards, vLLM 0.21 loads the checkpoint
via the `modelopt_mixed` path and serves real generations.
The TODO note is intentionally left for a follow-up: the
`SequentialMLP` path (around `local_experts.linear_fc{1,2}` in
`_get_transformer_layer_state_dict`) has the same local-id-collision
issue and will need an equivalent global-id + EP-gather treatment
when MoE recipes start exercising that spec.
Signed-off-by: James Shen <yueshen@nvidia.com>
0083a75 to
46b5c8c
Compare
|
Force-pushed The check failed on a single Locally verified |
Summary
Two related bugs surface when exporting a Megatron-Core Mamba-MoE checkpoint (e.g., Nemotron 3 Ultra) to HuggingFace with non-trivial expert and data parallelism.
1. TEGroupedMLP expert-id collision at EP > 1
_grouped_mlp_slicingiteratesrange(num_experts), butnum_experts == module.num_gemmsis the local expert count on each EP rank. Using local ids0..N-1for the saved HF key prefix means every EP rank emits the same key set (experts.0..N-1.*), and the last writer wins for each layer — at EP=8 this silently loses 7/8 of every MoE layer's experts.The fix:
module.local_expert_indices(ormodule.experts.local_expert_indices) when exposed; falls back to the standard Megatron contiguous layout[ep_rank*N, (ep_rank+1)*N − 1].all_reduce(MAX)over a CUDA-resident int32 flag — CPU tensors tripRuntimeError: No backend type associated with device type cpuon the NCCL-backed EP process group.torch.save→all_gather_object→torch.load.all_gather_objectpickling fails on quantized uint8 weights because theirUntypedStoragehas nodtypeattr; the byte-stream round-trip uses PyTorch's own tensor codec to sidestep this.When EP == 1 the gather is skipped and behavior is identical to the original loop modulo the global-id rename (a no-op when local_expert_indices is the trivial 0..N-1).
2.
config.jsonrace at DP > 1The post-save block that injects
quantization_configintoconfig.jsonhad no DP-rank gating. With DP > 1 there are multiple last-stage main ranks; all of them read-then-write the same file, and any interleaving can leave another rank reading a truncated file, raisingJSONDecodeError. Guarded withis_last_stage_main_rank AND get_data_parallel_rank() == 0and bracketed withtorch.distributed.barrier()s so every PP/DP rank waits for the single writer.Two new imports (
get_data_parallel_rank,get_expert_model_parallel_*) and anioimport are added at the module level.Repro
Without this patch, exporting Nemotron 3 Ultra (108-layer Mamba-MoE hybrid, 512 routed experts) at
--pp 9 --tp 1 --ep 8produces a HuggingFace checkpoint where only 64 of 512 experts per MoE layer are present, and theconfig.jsonwrite races to a JSONDecodeError on DP=8. With this patch both succeed and vLLM loads the resulting checkpoint and serves real generations.Test plan
model.safetensors.index.jsonand confirmed all 512 experts per MoE layer are present in the HF shards.vllm serveand verified end-to-end generation (noKeyErroron weight loading, real model output).Follow-up
The
SequentialMLPpath in_get_transformer_layer_state_dict(iteratinglocal_experts.linear_fc{1,2}) has the same local-id-collision issue and will need an equivalent global-id + EP-gather treatment when MoE recipes start exercising that spec. Not addressed here because the recipes in current Megatron-Bridge mostly useTEGroupedMLP.🤖 Generated with Claude Code
Signed-off-by: James Shen yueshen@nvidia.com
Summary by CodeRabbit
Bug Fixes
Improvements