[training, model] fix: Adapt MCore main bump compatibility#3944
Conversation
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
|
/ok to test 036b3e5 |
Light Code ReviewOverall: Clean compatibility fix. The signature-introspection approach in get_batch_on_this_cp_rank_compat is a solid pattern for bridging old/new MCore signatures, and all Bridge call sites are consistently routed through the wrapper. Observations
Suggested test cases No perf tests impacted. |
| return frozenset(parameters) | ||
|
|
||
|
|
||
| def get_batch_on_this_cp_rank_compat( |
There was a problem hiding this comment.
get_batch_on_this_cp_rank_compat, no need this, just support new api/set of signiture.
There was a problem hiding this comment.
Removed the Bridge compatibility wrapper and updated the affected call sites to use the new MCore get_batch_on_this_cp_rank API directly with is_hybrid_cp=False and the appropriate CP group.
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
|
/ok to test 3f8e933 |
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
|
/ok to test bbe7e59 |
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
|
/ok to test a961e93 |
Original bump PR
uv.lock(main, mcore-main) (2026-05-22) #3933main/mcore-maina2503d4abf668ef04dd0ad7fa9a3e1b6b44a12c638986a98aae6a0cc4c8ae7b435db3288a890b0cb -> 4c63602607966297f7daef31a2d6a9bd2dc50f65Failure classification
MCore broke Bridge.
Root cause
The 2026-05-22
mcore-mainbump includes MCore PR#4103, which changedget_batch_on_this_cp_rankto requireis_hybrid_cp. Existing Bridge GPT, LLaVA, Qwen3-VL, DGPT, and VLM CP slicing call sites used the older signature.The same bump also updates MCore transformer module construction to pass a
namekeyword into attention modules. Bridge WAN's DiT attention subclasses did not accept that keyword.Fix summary
get_batch_on_this_cp_rank_compatwrapper that adapts to pre- and post-MLM#4103signatures.is_hybrid_cp=False.nameand pass it through to MCore attention.Guards
Added one local compatibility guard in
src/megatron/bridge/utils/common_utils.py.Removal TODO:
# TODO: remove this guard when Bridge no longer supports Megatron-LM commits before PR #4103.No stale guards were removed.
Validation
Local:
python3 -m py_compile src/megatron/bridge/utils/common_utils.py src/megatron/bridge/training/gpt_step.py src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py src/megatron/bridge/training/llava_step.py src/megatron/bridge/diffusion/models/common/dgpt_step.py src/megatron/bridge/diffusion/models/common/dit_attention.py tests/unit_tests/diffusion/model/nemotron_labs_diffusion/test_dgpt_step.py tests/unit_tests/utils/test_slice_batch_for_cp.py- passed.uv run python -m pytest ...on this local host was blocked becausenvidia-resiliency-ext==0.6.0has no wheel for the hostmanylinux_2_31_x86_64platform.CW interactive partition, 2026-05-22 America/Los_Angeles:
srun -A coreai_dlalgo_llm -p interactive --container-image=/lustre/fsw/portfolios/coreai/users/yuya/containers/mbridge-260321.sqsh --container-mounts=/lustre:/lustre --no-container-mount-home -n 1 --gpus-per-task=1 --time=00:30:00 bash -lc "cd /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3933 && export UV_CACHE_DIR=/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/uv_cache && uv run python -m pytest tests/unit_tests/utils/test_slice_batch_for_cp.py tests/unit_tests/diffusion/model/nemotron_labs_diffusion/test_dgpt_step.py tests/unit_tests/training/test_vlm_step.py tests/unit_tests/training/test_audio_lm_step.py tests/unit_tests/diffusion/model/wan/test_wan_layer_spec.py tests/unit_tests/diffusion/model/wan/test_wan_provider.py -q"69 passed, 33 warnings in 3.74s.Pre-commit:
srun -A coreai_dlalgo_llm -p interactive --container-image=/lustre/fsw/portfolios/coreai/users/yuya/containers/mbridge-260321.sqsh --container-mounts=/lustre:/lustre --no-container-mount-home -n 1 --gpus-per-task=1 --time=00:45:00 bash -lc "cd /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3933 && export UV_CACHE_DIR=/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/uv_cache && uv run pre-commit run --all-files"