Skip to content

[training, model] fix: Adapt MCore main bump compatibility#3944

Merged
cuichenx merged 5 commits into
mainfrom
yuya/mcore-main-autofix-20260522-pr3933
May 23, 2026
Merged

[training, model] fix: Adapt MCore main bump compatibility#3944
cuichenx merged 5 commits into
mainfrom
yuya/mcore-main-autofix-20260522-pr3933

Conversation

@yaoyu-33
Copy link
Copy Markdown
Contributor

Original bump PR

Failure classification

MCore broke Bridge.

Root cause

The 2026-05-22 mcore-main bump includes MCore PR #4103, which changed get_batch_on_this_cp_rank to require is_hybrid_cp. Existing Bridge GPT, LLaVA, Qwen3-VL, DGPT, and VLM CP slicing call sites used the older signature.

The same bump also updates MCore transformer module construction to pass a name keyword into attention modules. Bridge WAN's DiT attention subclasses did not accept that keyword.

Fix summary

  • Added a narrow get_batch_on_this_cp_rank_compat wrapper that adapts to pre- and post-MLM #4103 signatures.
  • Routed affected Bridge batch slicing call sites through the wrapper with is_hybrid_cp=False.
  • Updated DiT self/cross attention constructors to accept name and pass it through to MCore attention.
  • Updated focused unit-test mocks to match the compatibility keyword surface.

Guards

Added one local compatibility guard in src/megatron/bridge/utils/common_utils.py.

Removal TODO:

# TODO: remove this guard when Bridge no longer supports Megatron-LM commits before PR #4103.

No stale guards were removed.

Validation

Local:

  • python3 -m py_compile src/megatron/bridge/utils/common_utils.py src/megatron/bridge/training/gpt_step.py src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py src/megatron/bridge/training/llava_step.py src/megatron/bridge/diffusion/models/common/dgpt_step.py src/megatron/bridge/diffusion/models/common/dit_attention.py tests/unit_tests/diffusion/model/nemotron_labs_diffusion/test_dgpt_step.py tests/unit_tests/utils/test_slice_batch_for_cp.py - passed.
  • uv run python -m pytest ... on this local host was blocked because nvidia-resiliency-ext==0.6.0 has no wheel for the host manylinux_2_31_x86_64 platform.

CW interactive partition, 2026-05-22 America/Los_Angeles:

  • Command: srun -A coreai_dlalgo_llm -p interactive --container-image=/lustre/fsw/portfolios/coreai/users/yuya/containers/mbridge-260321.sqsh --container-mounts=/lustre:/lustre --no-container-mount-home -n 1 --gpus-per-task=1 --time=00:30:00 bash -lc "cd /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3933 && export UV_CACHE_DIR=/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/uv_cache && uv run python -m pytest tests/unit_tests/utils/test_slice_batch_for_cp.py tests/unit_tests/diffusion/model/nemotron_labs_diffusion/test_dgpt_step.py tests/unit_tests/training/test_vlm_step.py tests/unit_tests/training/test_audio_lm_step.py tests/unit_tests/diffusion/model/wan/test_wan_layer_spec.py tests/unit_tests/diffusion/model/wan/test_wan_provider.py -q"
  • Result: 69 passed, 33 warnings in 3.74s.

Pre-commit:

  • Command: srun -A coreai_dlalgo_llm -p interactive --container-image=/lustre/fsw/portfolios/coreai/users/yuya/containers/mbridge-260321.sqsh --container-mounts=/lustre:/lustre --no-container-mount-home -n 1 --gpus-per-task=1 --time=00:45:00 bash -lc "cd /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3933 && export UV_CACHE_DIR=/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/uv_cache && uv run pre-commit run --all-files"
  • Result: passed.

dimapihtar and others added 2 commits May 22, 2026 06:25
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaoyu-33
Copy link
Copy Markdown
Contributor Author

/ok to test 036b3e5

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 22, 2026

Light Code Review

Overall: Clean compatibility fix. The signature-introspection approach in get_batch_on_this_cp_rank_compat is a solid pattern for bridging old/new MCore signatures, and all Bridge call sites are consistently routed through the wrapper.

Observations

  1. Test mocks in test_vlm_step.py and test_audio_lm_step.py not updated - These files still use lambda x: x mocks for megatron.core.utils.get_batch_on_this_cp_rank. This works today because the compat wrapper introspects the lambda, sees it only accepts x, and strips all keyword arguments before calling. However, the mocks in test_slice_batch_for_cp.py and test_dgpt_step.py were updated to accept the new kwargs. Consider updating the remaining mocks for consistency, even though the current code handles them correctly.

  2. No dedicated unit test for get_batch_on_this_cp_rank_compat - The wrapper is tested indirectly through existing tests, but a small focused test exercising both the old-style (no is_hybrid_cp parameter) and new-style (with is_hybrid_cp) function signatures would help catch regressions if the introspection logic ever changes.

  3. OLMoESelfAttention absorbs name via kwargs but does not forward it to super().init() (olmoe_provider.py:109-118) - Pre-existing issue, not introduced by this PR, but worth noting since it means OLMoE modules will not get MCore new name-tracking feature.

Suggested test cases

No perf tests impacted.

@yaoyu-33 yaoyu-33 added area:training Training loop, callbacks, and runtime integration bug Something isn't working full-test-suite needs-review PR is ready for code review and waiting on a reviewer labels May 22, 2026
return frozenset(parameters)


def get_batch_on_this_cp_rank_compat(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_batch_on_this_cp_rank_compat, no need this, just support new api/set of signiture.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the Bridge compatibility wrapper and updated the affected call sites to use the new MCore get_batch_on_this_cp_rank API directly with is_hybrid_cp=False and the appropriate CP group.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33
Copy link
Copy Markdown
Contributor Author

/ok to test 3f8e933

asolergi-nv
asolergi-nv previously approved these changes May 22, 2026
@yaoyu-33 yaoyu-33 enabled auto-merge (squash) May 22, 2026 18:05
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33
Copy link
Copy Markdown
Contributor Author

/ok to test bbe7e59

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33
Copy link
Copy Markdown
Contributor Author

/ok to test a961e93

@cuichenx cuichenx disabled auto-merge May 23, 2026 00:23
@cuichenx cuichenx merged commit 92deeef into main May 23, 2026
160 of 161 checks passed
@cuichenx cuichenx deleted the yuya/mcore-main-autofix-20260522-pr3933 branch May 23, 2026 00:23
@asolergi-nv asolergi-nv mentioned this pull request May 25, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:training Training loop, callbacks, and runtime integration bug Something isn't working full-test-suite needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants