[Training] Add memory estimator breakdown by yaoyu-33 · Pull Request #3910 · NVIDIA-NeMo/Megatron-Bridge

yaoyu-33 · 2026-05-21T00:18:16Z

Summary

Add a structured estimate_training_memory API and formatter around the existing theoretical memory utility.
Account for MoE layer patterns, routed expert EP/ETP sharding, context parallel partitioning, and distributed optimizer shard sizes.
Preserve the existing training-time aggregate memory report while adding docs, focused arithmetic tests, and memory-tuning skill guidance.

Closes #1673

Validation

git diff --check origin/main..HEAD
python3 -m py_compile src/megatron/bridge/training/utils/theoretical_memory_utils.py tests/unit_tests/training/utils/test_theoretical_memory_utils.py
/home/yuya/.local/bin/ruff format src/megatron/bridge/training/utils/theoretical_memory_utils.py tests/unit_tests/training/utils/test_theoretical_memory_utils.py
/home/yuya/.local/bin/ruff check src/megatron/bridge/training/utils/theoretical_memory_utils.py tests/unit_tests/training/utils/test_theoretical_memory_utils.py
/home/yuya/.local/bin/pre-commit run --all-files

Not run:

uv run pre-commit run --all-files: local uv run cannot install nvidia-resiliency-ext==0.6.0 because this workstation is manylinux_2_31_x86_64 and the locked wheel is only available for manylinux_2_39_{x86_64,aarch64}.
Targeted pytest: not run locally per task instructions; cw, sbatch, and srun are not available in this environment.

copy-pr-bot · 2026-05-21T00:18:20Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yaoyu-33 · 2026-05-21T00:18:25Z

/ok to test 20e5d87

claude · 2026-05-21T00:21:06Z

+def _expert_optimizer_shard_size(
+    config: ConfigContainer,
+    *,
+    tensor_parallel_size: int,
+    expert_parallel_size: int,
+    expert_tensor_parallel_size: int,
+) -> int:
+    data_parallel_size = _positive_int_attr(config, "data_parallel_size", 1)
+    shard_size = data_parallel_size * tensor_parallel_size
+    shard_size //= max(1, expert_parallel_size * expert_tensor_parallel_size)
+    return max(1, shard_size)


Bug: context_parallel_size is missing from the expert optimizer shard size.

For dense parameters (line 151), the optimizer shard size correctly includes CP: data_parallel_size * context_parallel_size. But for experts, CP ranks also share the same expert parameters, so the expert data-parallel shard size should be data_parallel_size * tensor_parallel_size * context_parallel_size / (EP * ETP), not data_parallel_size * tensor_parallel_size / (EP * ETP).

When context_parallel_size > 1, this under-counts the shard size by a factor of CP, making bytes_per_parameter too high for experts (overestimate).

Suggested change

def _expert_optimizer_shard_size(

config: ConfigContainer,

*,

tensor_parallel_size: int,

expert_parallel_size: int,

expert_tensor_parallel_size: int,

) -> int:

data_parallel_size = _positive_int_attr(config, "data_parallel_size", 1)

shard_size = data_parallel_size * tensor_parallel_size

shard_size //= max(1, expert_parallel_size * expert_tensor_parallel_size)

return max(1, shard_size)

def _expert_optimizer_shard_size(

config: ConfigContainer,

*,

tensor_parallel_size: int,

context_parallel_size: int,

expert_parallel_size: int,

expert_tensor_parallel_size: int,

) -> int:

data_parallel_size = _positive_int_attr(config, "data_parallel_size", 1)

shard_size = data_parallel_size * tensor_parallel_size * context_parallel_size

shard_size //= max(1, expert_parallel_size * expert_tensor_parallel_size)

return max(1, shard_size)

The caller at line 168 would also need context_parallel_size=context_parallel_size.

Fixed in 304786d: expert optimizer shard size now includes context_parallel_size, and the MoE CP test asserts the corrected bytes-per-parameter value.

claude · 2026-05-21T00:21:09Z

+- `6 + 12 / shard_size` bytes per parameter when the distributed optimizer is enabled
+
+For dense parameters, `shard_size` is `data_parallel_size * context_parallel_size`.


Nit: the doc says shard_size for dense is data_parallel_size * context_parallel_size, but if the bug above is fixed the expert shard formula should also be updated to mention CP:

For routed MoE experts, expert parameters are divided by expert_model_parallel_size * expert_tensor_parallel_size, and optimizer state uses the expert data-parallel shard size which includes context parallel ranks.

Updated in 304786d: the docs now state that the expert data-parallel shard size includes context parallel ranks.

claude · 2026-05-21T00:23:33Z

Review of [Training] Add memory estimator breakdown. Bug: _expert_optimizer_shard_size is missing context_parallel_size (theoretical_memory_utils.py:542-552). The dense optimizer shard size correctly includes CP at line 151 but the expert shard size at line 550 uses data_parallel_size times tensor_parallel_size without multiplying by CP. When CP > 1 this under-counts the shard size overestimating expert memory. See inline comment for suggested fix. Test coverage gaps: MTP layer counting, moe_layer_freq as a list, moe_latent_size latent projection branch, shared expert parameters, VPP activation penalty, and report_theoretical_memory integration are all untested. Suggested test cases: No perf tests impacted.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2026-05-21T00:35:35Z

/ok to test 304786d

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2026-05-22T16:34:57Z

/ok to test 01d2da3

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2026-05-22T16:51:44Z

/ok to test dc1c741

copy-pr-bot Bot temporarily deployed to public May 21, 2026 00:18 Inactive

copy-pr-bot Bot had a problem deploying to test May 21, 2026 00:19 Error

claude Bot reviewed May 21, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public May 21, 2026 00:26 Inactive

[Training] feat: add memory estimator breakdown

304786d

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 force-pushed the priya/issue1673-memory-estimator branch from 20e5d87 to 304786d Compare May 21, 2026 00:35

copy-pr-bot Bot temporarily deployed to public May 21, 2026 00:36 Inactive

copy-pr-bot Bot temporarily deployed to test May 21, 2026 00:36 Inactive

copy-pr-bot Bot temporarily deployed to public May 21, 2026 00:43 Inactive

copy-pr-bot Bot temporarily deployed to public May 21, 2026 00:44 Inactive

copy-pr-bot Bot temporarily deployed to public May 21, 2026 01:00 Inactive

yaoyu-33 added area:training Training loop, callbacks, and runtime integration feature New capabilities, enhancements, or enablement work needs-review PR is ready for code review and waiting on a reviewer labels May 21, 2026

[training] test: Cover memory report wrapper

01d2da3

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

copy-pr-bot Bot temporarily deployed to public May 22, 2026 16:35 Inactive

copy-pr-bot Bot had a problem deploying to test May 22, 2026 16:35 Error

copy-pr-bot Bot temporarily deployed to public May 22, 2026 16:43 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 16:44 Inactive

[training] cite memory estimator source

dc1c741

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

copy-pr-bot Bot temporarily deployed to public May 22, 2026 16:52 Inactive

copy-pr-bot Bot temporarily deployed to test May 22, 2026 16:52 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 17:00 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 17:14 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Training] Add memory estimator breakdown#3910

[Training] Add memory estimator breakdown#3910
yaoyu-33 wants to merge 3 commits into
mainfrom
priya/issue1673-memory-estimator

yaoyu-33 commented May 21, 2026

Uh oh!

copy-pr-bot Bot commented May 21, 2026

Uh oh!

yaoyu-33 commented May 21, 2026

Uh oh!

claude Bot May 21, 2026

Uh oh!

yaoyu-33 May 21, 2026

Uh oh!

claude Bot May 21, 2026

Uh oh!

yaoyu-33 May 21, 2026

Uh oh!

claude Bot commented May 21, 2026

Uh oh!

yaoyu-33 commented May 21, 2026

Uh oh!

yaoyu-33 commented May 22, 2026

Uh oh!

yaoyu-33 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		- `6 + 12 / shard_size` bytes per parameter when the distributed optimizer is enabled

		For dense parameters, `shard_size` is `data_parallel_size * context_parallel_size`.

Conversation

yaoyu-33 commented May 21, 2026

Summary

Validation

Uh oh!

copy-pr-bot Bot commented May 21, 2026

Uh oh!

yaoyu-33 commented May 21, 2026

Uh oh!

claude Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented May 21, 2026

Uh oh!

yaoyu-33 commented May 21, 2026

Uh oh!

yaoyu-33 commented May 22, 2026

Uh oh!

yaoyu-33 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant