Skip to content

[training] fix: Adapt tokenizer build for MCore dev#3945

Open
yaoyu-33 wants to merge 2 commits into
mainfrom
yuya/mcore-dev-autofix-20260522-pr3934
Open

[training] fix: Adapt tokenizer build for MCore dev#3945
yaoyu-33 wants to merge 2 commits into
mainfrom
yuya/mcore-dev-autofix-20260522-pr3934

Conversation

@yaoyu-33
Copy link
Copy Markdown
Contributor

Original bump PR

Failure classification

MCore broke Bridge.

Root cause

The 2026-05-22 mcore-dev bump uses an MCore tokenizer builder that unconditionally computes padded vocab size through args.make_vocab_size_divisible_by * args.tensor_model_parallel_size. MCore dev's TokenizerConfig no longer declares those padding fields, while Bridge's tokenizer config deliberately disables tokenizer-side vocab padding because Bridge model providers handle vocab padding separately.

CI failed in Launch_Unit_Tests_Core job 77372646987 with six failures in tests/unit_tests/training/test_tokenizer.py, all ending in AttributeError: 'TokenizerConfig' object has no attribute 'make_vocab_size_divisible_by' from 3rdparty/Megatron-LM/megatron/core/tokenizers/utils/build_tokenizer.py:106.

Fix summary

  • Added a local Bridge wrapper config for MCore tokenizer construction.
  • Supplies no-op padding compatibility attrs (make_vocab_size_divisible_by=1, tensor_model_parallel_size=1, rank=0) only when MCore/Bridge config objects do not already expose them.
  • Uses a shallow copy so MCore dev's unconditional _set_padded_vocab_size mutation does not change Bridge's original tokenizer config object.

Guards

Added one local compatibility guard in src/megatron/bridge/training/tokenizers/tokenizer.py.

Removal TODO:

# TODO: remove this guard when Megatron-LM tokenizer build_tokenizer either restores
# pad_vocab_size handling or no longer requires padding attrs on TokenizerConfig.

No stale guards were removed. Audit result on 2026-05-22: the existing InferenceMode guard is still needed because MCore dev 56481b0501cf does not expose InferenceMode; the YARN guard is still needed because MCore still uses the hasattr(config, "yarn_rotary_scaling_factor") path; the TE grouped-linear guard remains capability-checked against the runtime TE API.

Validation

Local syntax check, 2026-05-22 America/Los_Angeles:

  • python3 -m py_compile src/megatron/bridge/training/tokenizers/tokenizer.py tests/unit_tests/training/test_tokenizer.py - passed.

CW interactive partition, 2026-05-22 America/Los_Angeles:

  • Initial full-sync command: srun -A coreai_dlalgo_llm -p interactive --container-image=/lustre/fsw/portfolios/coreai/users/yuya/containers/mbridge-260321.sqsh --container-mounts=/lustre:/lustre --no-container-mount-home -n 1 --gpus-per-task=1 --time=00:30:00 bash -lc "cd /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3934-20260522-0726 && export UV_CACHE_DIR=/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/uv_cache && uv run python -m pytest tests/unit_tests/training/test_tokenizer.py -q" - blocked before pytest by nvidia-cusolver download timeout.
  • Retry with UV_HTTP_TIMEOUT=120 - blocked before pytest by a broken-pipe error while downloading nvidia-cusolver.
  • Retry with uv run --no-build-isolation - reached package build but hit the 30-minute Slurm time limit during environment setup before pytest.
  • Focused test command that completed: srun -A coreai_dlalgo_llm -p interactive --container-image=/lustre/fsw/portfolios/coreai/users/yuya/containers/mbridge-260321.sqsh --container-mounts=/lustre:/lustre --no-container-mount-home -n 1 --gpus-per-task=1 --time=00:10:00 bash -lc "cd /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3934-20260522-0726 && export UV_CACHE_DIR=/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/uv_cache && export PYTHONPATH=/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3934-20260522-0726/src:/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3934-20260522-0726/3rdparty/Megatron-LM:\$PYTHONPATH && uv run --no-sync python -m pytest tests/unit_tests/training/test_tokenizer.py -q"
  • Result: 6 passed, 41 warnings in 4.24s.

Pre-commit, CW interactive partition, 2026-05-22 America/Los_Angeles:

  • Command: srun -A coreai_dlalgo_llm -p interactive --container-image=/lustre/fsw/portfolios/coreai/users/yuya/containers/mbridge-260321.sqsh --container-mounts=/lustre:/lustre --no-container-mount-home -n 1 --gpus-per-task=1 --time=00:10:00 bash -lc "cd /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3934-20260522-0726 && export UV_CACHE_DIR=/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/uv_cache && export PYTHONPATH=/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3934-20260522-0726/src:/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/yuya/mb-autofix-pr3934-20260522-0726/3rdparty/Megatron-LM:\$PYTHONPATH && UV_NO_SYNC=1 uv run pre-commit run --all-files"
  • Result: passed (fix end of files, trim trailing whitespace, ruff, ruff, ruff-format; markdown filename hook skipped).

dimapihtar and others added 2 commits May 22, 2026 06:27
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
@yaoyu-33
Copy link
Copy Markdown
Contributor Author

/ok to test 1ab4f6f

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 22, 2026

Light Code Review

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 22, 2026

Light Code Review

General observations

Clean, focused fix for the MCore dev tokenizer builder breaking change. The approach -- shallow copy + conditional setattr -- is the right pattern: it isolates the original config from MCore mutations and avoids overwriting if MCore later restores these fields.

Missing labels: This PR has no labels. Per CONTRIBUTING.md, it should have at minimum bug (type), area:training + area:build (areas), and needs-more-tests / full-test-suite since this is an MCore submodule bump.

Findings

Missing test coverage for _config_for_mcore_build_tokenizer (src/megatron/bridge/training/tokenizers/tokenizer.py lines 13-26)

The existing test_tokenizer.py tests exercise build_tokenizer which now calls the new helper, providing implicit happy-path coverage. However, the key invariants of the shim are not directly tested:

  1. Copy isolation -- the original TokenizerConfig must not be mutated after build_tokenizer returns (the whole reason for using copy()).
  2. Conditional setting -- if MCore restores make_vocab_size_divisible_by / tensor_model_parallel_size / rank on the config, the shim must not overwrite them.

A small unit test covering these two cases would make the compatibility guard self-documenting and catch regressions when the TODO is eventually resolved.

Suggested test cases

  • No perf tests impacted.

@yaoyu-33 yaoyu-33 added area:training Training loop, callbacks, and runtime integration bug Something isn't working full-test-suite needs-review PR is ready for code review and waiting on a reviewer labels May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:training Training loop, callbacks, and runtime integration bug Something isn't working full-test-suite needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants