[training, model] fix: Adapt MCore main bump compatibility by yaoyu-33 · Pull Request #3944 · NVIDIA-NeMo/Megatron-Bridge

yaoyu-33 · 2026-05-22T14:17:16Z

Original bump PR

Original bump PR: chore(beep boop 🤖): Bump uv.lock (main, mcore-main) (2026-05-22) #3933
Target: main / mcore-main
Bump head mirrored from: a2503d4abf668ef04dd0ad7fa9a3e1b6b44a12c6
MCore range: 38986a98aae6a0cc4c8ae7b435db3288a890b0cb -> 4c63602607966297f7daef31a2d6a9bd2dc50f65

Failure classification

MCore broke Bridge.

Root cause

The 2026-05-22 mcore-main bump includes MCore PR #4103, which changed get_batch_on_this_cp_rank to require is_hybrid_cp. Existing Bridge GPT, LLaVA, Qwen3-VL, DGPT, and VLM CP slicing call sites used the older signature.

The same bump also updates MCore transformer module construction to pass a name keyword into attention modules. Bridge WAN's DiT attention subclasses did not accept that keyword.

Fix summary

Added a narrow get_batch_on_this_cp_rank_compat wrapper that adapts to pre- and post-MLM #4103 signatures.
Routed affected Bridge batch slicing call sites through the wrapper with is_hybrid_cp=False.
Updated DiT self/cross attention constructors to accept name and pass it through to MCore attention.
Updated focused unit-test mocks to match the compatibility keyword surface.

Guards

Added one local compatibility guard in src/megatron/bridge/utils/common_utils.py.

Removal TODO:

# TODO: remove this guard when Bridge no longer supports Megatron-LM commits before PR #4103.

No stale guards were removed.

Validation

Local:

python3 -m py_compile src/megatron/bridge/utils/common_utils.py src/megatron/bridge/training/gpt_step.py src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py src/megatron/bridge/training/llava_step.py src/megatron/bridge/diffusion/models/common/dgpt_step.py src/megatron/bridge/diffusion/models/common/dit_attention.py tests/unit_tests/diffusion/model/nemotron_labs_diffusion/test_dgpt_step.py tests/unit_tests/utils/test_slice_batch_for_cp.py - passed.
uv run python -m pytest ... on this local host was blocked because nvidia-resiliency-ext==0.6.0 has no wheel for the host manylinux_2_31_x86_64 platform.

CW interactive partition, 2026-05-22 America/Los_Angeles:

Command: srun -A coreai_dlalgo_llm -p interactive --container-image=/lustre/fsw/portfolios/coreai/users/yuya/containers/mbridge-260321.sqsh --container-mounts=/lustre:/lustre --no-container-mount-home -n 1 --gpus-per-task=1 --time=00:30:00 bash -lc "cd /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3933 && export UV_CACHE_DIR=/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/uv_cache && uv run python -m pytest tests/unit_tests/utils/test_slice_batch_for_cp.py tests/unit_tests/diffusion/model/nemotron_labs_diffusion/test_dgpt_step.py tests/unit_tests/training/test_vlm_step.py tests/unit_tests/training/test_audio_lm_step.py tests/unit_tests/diffusion/model/wan/test_wan_layer_spec.py tests/unit_tests/diffusion/model/wan/test_wan_provider.py -q"
Result: 69 passed, 33 warnings in 3.74s.

Pre-commit:

Command: srun -A coreai_dlalgo_llm -p interactive --container-image=/lustre/fsw/portfolios/coreai/users/yuya/containers/mbridge-260321.sqsh --container-mounts=/lustre:/lustre --no-container-mount-home -n 1 --gpus-per-task=1 --time=00:45:00 bash -lc "cd /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3933 && export UV_CACHE_DIR=/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/uv_cache && uv run pre-commit run --all-files"
Result: passed.

Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Signed-off-by: Yu Yao <yaoyu.094@gmail.com>

copy-pr-bot · 2026-05-22T14:17:21Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yaoyu-33 · 2026-05-22T14:17:31Z

/ok to test 036b3e5

claude · 2026-05-22T14:22:21Z

Light Code Review

Overall: Clean compatibility fix. The signature-introspection approach in get_batch_on_this_cp_rank_compat is a solid pattern for bridging old/new MCore signatures, and all Bridge call sites are consistently routed through the wrapper.

Observations

Test mocks in test_vlm_step.py and test_audio_lm_step.py not updated - These files still use lambda x: x mocks for megatron.core.utils.get_batch_on_this_cp_rank. This works today because the compat wrapper introspects the lambda, sees it only accepts x, and strips all keyword arguments before calling. However, the mocks in test_slice_batch_for_cp.py and test_dgpt_step.py were updated to accept the new kwargs. Consider updating the remaining mocks for consistency, even though the current code handles them correctly.
No dedicated unit test for get_batch_on_this_cp_rank_compat - The wrapper is tested indirectly through existing tests, but a small focused test exercising both the old-style (no is_hybrid_cp parameter) and new-style (with is_hybrid_cp) function signatures would help catch regressions if the introspection logic ever changes.
OLMoESelfAttention absorbs name via kwargs but does not forward it to super().init() (olmoe_provider.py:109-118) - Pre-existing issue, not introduced by this PR, but worth noting since it means OLMoE modules will not get MCore new name-tracking feature.

Suggested test cases

No perf tests impacted.

yaoyu-33 · 2026-05-22T16:24:10Z

+    return frozenset(parameters)
+
+
+def get_batch_on_this_cp_rank_compat(


get_batch_on_this_cp_rank_compat, no need this, just support new api/set of signiture.

Removed the Bridge compatibility wrapper and updated the affected call sites to use the new MCore get_batch_on_this_cp_rank API directly with is_hybrid_cp=False and the appropriate CP group.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2026-05-22T16:30:27Z

/ok to test 3f8e933

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2026-05-22T18:30:35Z

/ok to test bbe7e59

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2026-05-22T21:11:13Z

/ok to test a961e93

dimapihtar and others added 2 commits May 22, 2026 06:25

chore(beep boop 🤖): Bump uv.lock (main, mcore-main) (2026-05-22)

a2503d4

Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

fix: adapt mcore cp batch and wan constructors

036b3e5

Signed-off-by: Yu Yao <yaoyu.094@gmail.com>

yaoyu-33 mentioned this pull request May 22, 2026

chore(beep boop 🤖): Bump uv.lock (main, mcore-main) (2026-05-22) #3933

Closed

copy-pr-bot Bot temporarily deployed to public May 22, 2026 14:18 Inactive

copy-pr-bot Bot temporarily deployed to test May 22, 2026 14:18 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 14:26 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 14:27 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 14:42 Inactive

yaoyu-33 added area:training Training loop, callbacks, and runtime integration bug Something isn't working full-test-suite needs-review PR is ready for code review and waiting on a reviewer labels May 22, 2026

yaoyu-33 commented May 22, 2026

View reviewed changes

[training] fix CP batch API call sites

3f8e933

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

copy-pr-bot Bot temporarily deployed to public May 22, 2026 16:30 Inactive

copy-pr-bot Bot temporarily deployed to test May 22, 2026 16:31 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 16:41 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 16:56 Inactive

asolergi-nv previously approved these changes May 22, 2026

View reviewed changes

yaoyu-33 enabled auto-merge (squash) May 22, 2026 18:05

[training] fix CP batch unit mocks

bbe7e59

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 dismissed asolergi-nv’s stale review via bbe7e59 May 22, 2026 18:30

copy-pr-bot Bot temporarily deployed to public May 22, 2026 18:31 Inactive

copy-pr-bot Bot temporarily deployed to test May 22, 2026 18:31 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 18:38 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 18:39 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 18:54 Inactive

[model] fix MTP recompute padding mask compat

a961e93

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

copy-pr-bot Bot temporarily deployed to public May 22, 2026 21:11 Inactive

copy-pr-bot Bot temporarily deployed to test May 22, 2026 21:12 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 21:22 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 21:36 Inactive

cuichenx approved these changes May 22, 2026

View reviewed changes

cuichenx disabled auto-merge May 23, 2026 00:23

cuichenx merged commit 92deeef into main May 23, 2026
160 of 161 checks passed

cuichenx deleted the yuya/mcore-main-autofix-20260522-pr3933 branch May 23, 2026 00:23

asolergi-nv mentioned this pull request May 25, 2026

Fix MLM!4103 #3931

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[training, model] fix: Adapt MCore main bump compatibility#3944

[training, model] fix: Adapt MCore main bump compatibility#3944
cuichenx merged 5 commits into
mainfrom
yuya/mcore-main-autofix-20260522-pr3933

yaoyu-33 commented May 22, 2026

Uh oh!

copy-pr-bot Bot commented May 22, 2026

Uh oh!

yaoyu-33 commented May 22, 2026

Uh oh!

claude Bot commented May 22, 2026

Uh oh!

yaoyu-33 May 22, 2026

Uh oh!

yaoyu-33 May 22, 2026

Uh oh!

yaoyu-33 commented May 22, 2026

Uh oh!

yaoyu-33 commented May 22, 2026

Uh oh!

yaoyu-33 commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		return frozenset(parameters)


		def get_batch_on_this_cp_rank_compat(

Conversation

yaoyu-33 commented May 22, 2026

Original bump PR

Failure classification

Root cause

Fix summary

Guards

Validation

Uh oh!

copy-pr-bot Bot commented May 22, 2026

Uh oh!

yaoyu-33 commented May 22, 2026

Uh oh!

claude Bot commented May 22, 2026

Light Code Review

Uh oh!

yaoyu-33 May 22, 2026

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 May 22, 2026

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 commented May 22, 2026

Uh oh!

yaoyu-33 commented May 22, 2026

Uh oh!

yaoyu-33 commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants