[training] fix: Adapt tokenizer build for MCore dev#3945
Conversation
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
|
/ok to test 1ab4f6f |
Light Code Review |
Light Code ReviewGeneral observations Clean, focused fix for the MCore dev tokenizer builder breaking change. The approach -- shallow copy + conditional setattr -- is the right pattern: it isolates the original config from MCore mutations and avoids overwriting if MCore later restores these fields. Missing labels: This PR has no labels. Per CONTRIBUTING.md, it should have at minimum bug (type), area:training + area:build (areas), and needs-more-tests / full-test-suite since this is an MCore submodule bump. Findings Missing test coverage for _config_for_mcore_build_tokenizer (src/megatron/bridge/training/tokenizers/tokenizer.py lines 13-26) The existing test_tokenizer.py tests exercise build_tokenizer which now calls the new helper, providing implicit happy-path coverage. However, the key invariants of the shim are not directly tested:
A small unit test covering these two cases would make the compatibility guard self-documenting and catch regressions when the TODO is eventually resolved. Suggested test cases
|
Original bump PR
uv.lock(main, mcore-dev) (2026-05-22) #3934dev/mcore-dev78d009f47b919b17b91ece7864ba20a65e7fa03ecf081d5df0b34b665d214ff936f27489d7396876 -> 56481b0501cf7b3719e1869c495e2680ef0f3456Failure classification
MCore broke Bridge.
Root cause
The 2026-05-22
mcore-devbump uses an MCore tokenizer builder that unconditionally computes padded vocab size throughargs.make_vocab_size_divisible_by * args.tensor_model_parallel_size. MCore dev'sTokenizerConfigno longer declares those padding fields, while Bridge's tokenizer config deliberately disables tokenizer-side vocab padding because Bridge model providers handle vocab padding separately.CI failed in
Launch_Unit_Tests_Corejob77372646987with six failures intests/unit_tests/training/test_tokenizer.py, all ending inAttributeError: 'TokenizerConfig' object has no attribute 'make_vocab_size_divisible_by'from3rdparty/Megatron-LM/megatron/core/tokenizers/utils/build_tokenizer.py:106.Fix summary
make_vocab_size_divisible_by=1,tensor_model_parallel_size=1,rank=0) only when MCore/Bridge config objects do not already expose them._set_padded_vocab_sizemutation does not change Bridge's original tokenizer config object.Guards
Added one local compatibility guard in
src/megatron/bridge/training/tokenizers/tokenizer.py.Removal TODO:
No stale guards were removed. Audit result on 2026-05-22: the existing
InferenceModeguard is still needed because MCore dev56481b0501cfdoes not exposeInferenceMode; the YARN guard is still needed because MCore still uses thehasattr(config, "yarn_rotary_scaling_factor")path; the TE grouped-linear guard remains capability-checked against the runtime TE API.Validation
Local syntax check, 2026-05-22 America/Los_Angeles:
python3 -m py_compile src/megatron/bridge/training/tokenizers/tokenizer.py tests/unit_tests/training/test_tokenizer.py- passed.CW interactive partition, 2026-05-22 America/Los_Angeles:
srun -A coreai_dlalgo_llm -p interactive --container-image=/lustre/fsw/portfolios/coreai/users/yuya/containers/mbridge-260321.sqsh --container-mounts=/lustre:/lustre --no-container-mount-home -n 1 --gpus-per-task=1 --time=00:30:00 bash -lc "cd /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3934-20260522-0726 && export UV_CACHE_DIR=/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/uv_cache && uv run python -m pytest tests/unit_tests/training/test_tokenizer.py -q"- blocked before pytest bynvidia-cusolverdownload timeout.UV_HTTP_TIMEOUT=120- blocked before pytest by a broken-pipe error while downloadingnvidia-cusolver.uv run --no-build-isolation- reached package build but hit the 30-minute Slurm time limit during environment setup before pytest.srun -A coreai_dlalgo_llm -p interactive --container-image=/lustre/fsw/portfolios/coreai/users/yuya/containers/mbridge-260321.sqsh --container-mounts=/lustre:/lustre --no-container-mount-home -n 1 --gpus-per-task=1 --time=00:10:00 bash -lc "cd /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3934-20260522-0726 && export UV_CACHE_DIR=/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/uv_cache && export PYTHONPATH=/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3934-20260522-0726/src:/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3934-20260522-0726/3rdparty/Megatron-LM:\$PYTHONPATH && uv run --no-sync python -m pytest tests/unit_tests/training/test_tokenizer.py -q"6 passed, 41 warnings in 4.24s.Pre-commit, CW interactive partition, 2026-05-22 America/Los_Angeles:
srun -A coreai_dlalgo_llm -p interactive --container-image=/lustre/fsw/portfolios/coreai/users/yuya/containers/mbridge-260321.sqsh --container-mounts=/lustre:/lustre --no-container-mount-home -n 1 --gpus-per-task=1 --time=00:10:00 bash -lc "cd /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3934-20260522-0726 && export UV_CACHE_DIR=/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/uv_cache && export PYTHONPATH=/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3934-20260522-0726/src:/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3934-20260522-0726/3rdparty/Megatron-LM:\$PYTHONPATH && UV_NO_SYNC=1 uv run pre-commit run --all-files"fix end of files,trim trailing whitespace,ruff,ruff,ruff-format; markdown filename hook skipped).