[Training] Add memory estimator breakdown#3910
Conversation
|
/ok to test 20e5d87 |
| def _expert_optimizer_shard_size( | ||
| config: ConfigContainer, | ||
| *, | ||
| tensor_parallel_size: int, | ||
| expert_parallel_size: int, | ||
| expert_tensor_parallel_size: int, | ||
| ) -> int: | ||
| data_parallel_size = _positive_int_attr(config, "data_parallel_size", 1) | ||
| shard_size = data_parallel_size * tensor_parallel_size | ||
| shard_size //= max(1, expert_parallel_size * expert_tensor_parallel_size) | ||
| return max(1, shard_size) |
There was a problem hiding this comment.
Bug: context_parallel_size is missing from the expert optimizer shard size.
For dense parameters (line 151), the optimizer shard size correctly includes CP: data_parallel_size * context_parallel_size. But for experts, CP ranks also share the same expert parameters, so the expert data-parallel shard size should be data_parallel_size * tensor_parallel_size * context_parallel_size / (EP * ETP), not data_parallel_size * tensor_parallel_size / (EP * ETP).
When context_parallel_size > 1, this under-counts the shard size by a factor of CP, making bytes_per_parameter too high for experts (overestimate).
| def _expert_optimizer_shard_size( | |
| config: ConfigContainer, | |
| *, | |
| tensor_parallel_size: int, | |
| expert_parallel_size: int, | |
| expert_tensor_parallel_size: int, | |
| ) -> int: | |
| data_parallel_size = _positive_int_attr(config, "data_parallel_size", 1) | |
| shard_size = data_parallel_size * tensor_parallel_size | |
| shard_size //= max(1, expert_parallel_size * expert_tensor_parallel_size) | |
| return max(1, shard_size) | |
| def _expert_optimizer_shard_size( | |
| config: ConfigContainer, | |
| *, | |
| tensor_parallel_size: int, | |
| context_parallel_size: int, | |
| expert_parallel_size: int, | |
| expert_tensor_parallel_size: int, | |
| ) -> int: | |
| data_parallel_size = _positive_int_attr(config, "data_parallel_size", 1) | |
| shard_size = data_parallel_size * tensor_parallel_size * context_parallel_size | |
| shard_size //= max(1, expert_parallel_size * expert_tensor_parallel_size) | |
| return max(1, shard_size) |
The caller at line 168 would also need context_parallel_size=context_parallel_size.
There was a problem hiding this comment.
Fixed in 304786d: expert optimizer shard size now includes context_parallel_size, and the MoE CP test asserts the corrected bytes-per-parameter value.
| - `6 + 12 / shard_size` bytes per parameter when the distributed optimizer is enabled | ||
|
|
||
| For dense parameters, `shard_size` is `data_parallel_size * context_parallel_size`. |
There was a problem hiding this comment.
Nit: the doc says shard_size for dense is data_parallel_size * context_parallel_size, but if the bug above is fixed the expert shard formula should also be updated to mention CP:
For routed MoE experts, expert parameters are divided by
expert_model_parallel_size * expert_tensor_parallel_size, and optimizer state uses the expert data-parallel shard size which includes context parallel ranks.
There was a problem hiding this comment.
Updated in 304786d: the docs now state that the expert data-parallel shard size includes context parallel ranks.
|
Review of [Training] Add memory estimator breakdown. Bug: _expert_optimizer_shard_size is missing context_parallel_size (theoretical_memory_utils.py:542-552). The dense optimizer shard size correctly includes CP at line 151 but the expert shard size at line 550 uses data_parallel_size times tensor_parallel_size without multiplying by CP. When CP > 1 this under-counts the shard size overestimating expert memory. See inline comment for suggested fix. Test coverage gaps: MTP layer counting, moe_layer_freq as a list, moe_latent_size latent projection branch, shared expert parameters, VPP activation penalty, and report_theoretical_memory integration are all untested. Suggested test cases: No perf tests impacted. |
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
20e5d87 to
304786d
Compare
|
/ok to test 304786d |
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
|
/ok to test 01d2da3 |
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
|
/ok to test dc1c741 |
Summary
estimate_training_memoryAPI and formatter around the existing theoretical memory utility.Closes #1673
Validation
git diff --check origin/main..HEADpython3 -m py_compile src/megatron/bridge/training/utils/theoretical_memory_utils.py tests/unit_tests/training/utils/test_theoretical_memory_utils.py/home/yuya/.local/bin/ruff format src/megatron/bridge/training/utils/theoretical_memory_utils.py tests/unit_tests/training/utils/test_theoretical_memory_utils.py/home/yuya/.local/bin/ruff check src/megatron/bridge/training/utils/theoretical_memory_utils.py tests/unit_tests/training/utils/test_theoretical_memory_utils.py/home/yuya/.local/bin/pre-commit run --all-filesNot run:
uv run pre-commit run --all-files: localuv runcannot installnvidia-resiliency-ext==0.6.0because this workstation ismanylinux_2_31_x86_64and the locked wheel is only available formanylinux_2_39_{x86_64,aarch64}.cw,sbatch, andsrunare not available in this environment.