[training] fix: Adapt tokenizer build for MCore dev by yaoyu-33 · Pull Request #3945 · NVIDIA-NeMo/Megatron-Bridge

yaoyu-33 · 2026-05-22T15:10:46Z

Original bump PR

Original bump PR: chore(beep boop 🤖): Bump uv.lock (main, mcore-dev) (2026-05-22) #3934
Target: dev / mcore-dev
Bump head mirrored from: 78d009f47b919b17b91ece7864ba20a65e7fa03e
MCore dev range: cf081d5df0b34b665d214ff936f27489d7396876 -> 56481b0501cf7b3719e1869c495e2680ef0f3456

Failure classification

MCore broke Bridge.

Root cause

The 2026-05-22 mcore-dev bump uses an MCore tokenizer builder that unconditionally computes padded vocab size through args.make_vocab_size_divisible_by * args.tensor_model_parallel_size. MCore dev's TokenizerConfig no longer declares those padding fields, while Bridge's tokenizer config deliberately disables tokenizer-side vocab padding because Bridge model providers handle vocab padding separately.

CI failed in Launch_Unit_Tests_Core job 77372646987 with six failures in tests/unit_tests/training/test_tokenizer.py, all ending in AttributeError: 'TokenizerConfig' object has no attribute 'make_vocab_size_divisible_by' from 3rdparty/Megatron-LM/megatron/core/tokenizers/utils/build_tokenizer.py:106.

Fix summary

Added a local Bridge wrapper config for MCore tokenizer construction.
Supplies no-op padding compatibility attrs (make_vocab_size_divisible_by=1, tensor_model_parallel_size=1, rank=0) only when MCore/Bridge config objects do not already expose them.
Uses a shallow copy so MCore dev's unconditional _set_padded_vocab_size mutation does not change Bridge's original tokenizer config object.

Guards

Added one local compatibility guard in src/megatron/bridge/training/tokenizers/tokenizer.py.

Removal TODO:

# TODO: remove this guard when Megatron-LM tokenizer build_tokenizer either restores
# pad_vocab_size handling or no longer requires padding attrs on TokenizerConfig.

No stale guards were removed. Audit result on 2026-05-22: the existing InferenceMode guard is still needed because MCore dev 56481b0501cf does not expose InferenceMode; the YARN guard is still needed because MCore still uses the hasattr(config, "yarn_rotary_scaling_factor") path; the TE grouped-linear guard remains capability-checked against the runtime TE API.

Validation

Local syntax check, 2026-05-22 America/Los_Angeles:

python3 -m py_compile src/megatron/bridge/training/tokenizers/tokenizer.py tests/unit_tests/training/test_tokenizer.py - passed.

CW interactive partition, 2026-05-22 America/Los_Angeles:

Initial full-sync command: srun -A coreai_dlalgo_llm -p interactive --container-image=/lustre/fsw/portfolios/coreai/users/yuya/containers/mbridge-260321.sqsh --container-mounts=/lustre:/lustre --no-container-mount-home -n 1 --gpus-per-task=1 --time=00:30:00 bash -lc "cd /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3934-20260522-0726 && export UV_CACHE_DIR=/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/uv_cache && uv run python -m pytest tests/unit_tests/training/test_tokenizer.py -q" - blocked before pytest by nvidia-cusolver download timeout.
Retry with UV_HTTP_TIMEOUT=120 - blocked before pytest by a broken-pipe error while downloading nvidia-cusolver.
Retry with uv run --no-build-isolation - reached package build but hit the 30-minute Slurm time limit during environment setup before pytest.
Focused test command that completed: srun -A coreai_dlalgo_llm -p interactive --container-image=/lustre/fsw/portfolios/coreai/users/yuya/containers/mbridge-260321.sqsh --container-mounts=/lustre:/lustre --no-container-mount-home -n 1 --gpus-per-task=1 --time=00:10:00 bash -lc "cd /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3934-20260522-0726 && export UV_CACHE_DIR=/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/uv_cache && export PYTHONPATH=/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3934-20260522-0726/src:/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3934-20260522-0726/3rdparty/Megatron-LM:\$PYTHONPATH && uv run --no-sync python -m pytest tests/unit_tests/training/test_tokenizer.py -q"
Result: 6 passed, 41 warnings in 4.24s.

Pre-commit, CW interactive partition, 2026-05-22 America/Los_Angeles:

Command: srun -A coreai_dlalgo_llm -p interactive --container-image=/lustre/fsw/portfolios/coreai/users/yuya/containers/mbridge-260321.sqsh --container-mounts=/lustre:/lustre --no-container-mount-home -n 1 --gpus-per-task=1 --time=00:10:00 bash -lc "cd /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3934-20260522-0726 && export UV_CACHE_DIR=/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/uv_cache && export PYTHONPATH=/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3934-20260522-0726/src:/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3934-20260522-0726/3rdparty/Megatron-LM:\$PYTHONPATH && UV_NO_SYNC=1 uv run pre-commit run --all-files"
Result: passed (fix end of files, trim trailing whitespace, ruff, ruff, ruff-format; markdown filename hook skipped).

Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Signed-off-by: Yu Yao <yaoyu.094@gmail.com>

yaoyu-33 · 2026-05-22T15:11:00Z

/ok to test 1ab4f6f

copy-pr-bot · 2026-05-22T15:11:54Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

claude · 2026-05-22T15:15:36Z

Light Code Review

claude · 2026-05-22T15:16:51Z

Light Code Review

General observations

Clean, focused fix for the MCore dev tokenizer builder breaking change. The approach -- shallow copy + conditional setattr -- is the right pattern: it isolates the original config from MCore mutations and avoids overwriting if MCore later restores these fields.

Missing labels: This PR has no labels. Per CONTRIBUTING.md, it should have at minimum bug (type), area:training + area:build (areas), and needs-more-tests / full-test-suite since this is an MCore submodule bump.

Findings

Missing test coverage for _config_for_mcore_build_tokenizer (src/megatron/bridge/training/tokenizers/tokenizer.py lines 13-26)

The existing test_tokenizer.py tests exercise build_tokenizer which now calls the new helper, providing implicit happy-path coverage. However, the key invariants of the shim are not directly tested:

Copy isolation -- the original TokenizerConfig must not be mutated after build_tokenizer returns (the whole reason for using copy()).
Conditional setting -- if MCore restores make_vocab_size_divisible_by / tensor_model_parallel_size / rank on the config, the shim must not overwrite them.

A small unit test covering these two cases would make the compatibility guard self-documenting and catch regressions when the TODO is eventually resolved.

Suggested test cases

No perf tests impacted.

dimapihtar and others added 2 commits May 22, 2026 06:27

chore(beep boop 🤖): Bump uv.lock (main, mcore-dev) (2026-05-22)

78d009f

Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

fix: adapt tokenizer build for mcore dev

1ab4f6f

Signed-off-by: Yu Yao <yaoyu.094@gmail.com>

yaoyu-33 mentioned this pull request May 22, 2026

chore(beep boop 🤖): Bump uv.lock (main, mcore-dev) (2026-05-22) #3934

Open

copy-pr-bot Bot temporarily deployed to public May 22, 2026 15:11 Inactive

copy-pr-bot Bot temporarily deployed to test May 22, 2026 15:11 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 16:00 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 16:16 Inactive

yaoyu-33 added area:training Training loop, callbacks, and runtime integration bug Something isn't working full-test-suite needs-review PR is ready for code review and waiting on a reviewer labels May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[training] fix: Adapt tokenizer build for MCore dev#3945

[training] fix: Adapt tokenizer build for MCore dev#3945
yaoyu-33 wants to merge 2 commits into
mainfrom
yuya/mcore-dev-autofix-20260522-pr3934

yaoyu-33 commented May 22, 2026

Uh oh!

yaoyu-33 commented May 22, 2026

Uh oh!

copy-pr-bot Bot commented May 22, 2026

Uh oh!

claude Bot commented May 22, 2026

Uh oh!

claude Bot commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yaoyu-33 commented May 22, 2026

Original bump PR

Failure classification

Root cause

Fix summary

Guards

Validation

Uh oh!

yaoyu-33 commented May 22, 2026

Uh oh!

copy-pr-bot Bot commented May 22, 2026

Uh oh!

claude Bot commented May 22, 2026

Light Code Review

Uh oh!

claude Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Light Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented May 22, 2026 •

edited

Loading