[ROCm] reduce temp compile memory usage before training starts by cj401-amd · Pull Request #85 · ROCm/maxtext

cj401-amd · 2026-04-30T21:46:22Z

Description

Previously, it was observed that excessively temp compile memory usage with various models as shown https://github.com/orgs/ROCm/projects/14/views/7?pane=issue&itemId=149359236&issue=ROCm%7Cframeworks-internal%7C15124. This PR is trying to address the issue and reduce the temp compile memory usage without impacting the training performance negatively as shown https://github.com/orgs/ROCm/projects/14/views/1?filterQuery=assignee%3A%22cj401-amd%22&pane=issue&itemId=179156668&issue=ROCm%7Cframeworks-internal%7C16353. it can potentially avoid the crash due to OOM resulting from excessive temp compile usage for some training workloads. i.e., 405B.

file changes

File	Change	Status
`train.py`	Remove nan_to_num + grad_dtype guard + hasattr JAX config guard + flax_always_shard_variable	Clean
`train_compile.py`	hasattr guard for JAX 0.7.1	Clean
`sharding.py`	skip_trivial_specs param	Clean
`gather_reduce_sc.py`	functools.partial(jax.jit, ...) fix	Clean
`configs/base.yml`	float32_weight_sum: false with updated comment	Clean
`configs/types.py`	float32_weight_sum default synced to False	Clean
`attentions.py`	skip_trivial_specs=True added	Clean
`attention_op.py`	Synthetic mask shortcut, scale_factor removed, context_parallel_axis restored, context_parallel_strategy removed with comment	Clean
`normalizations.py`	Replaced einsum with y * effective_scale + explicit sharding	Clean
`moe.py`	3-tuple tiling with justifying comment, removed .astype(dtype)	Clean
`embeddings.py`	Fixed activation_length_no_exp → output_default_axis_names, added nn.with_logical_constraint	Clean
`pipeline.py`	Replaced shard_map-based permute/shift with pure array ops, removed extra sharding ops	Clean
`mixtral.py`	Converted NNX→Linen MixtralDecoderLayer, MixtralDecoderLayerToLinen = MixtralDecoderLayer alias	Clean
`decoders.py`	shared_embedding as class field, logits sharding constraints added	Clean
`multi_token_prediction.py`	shared_embedding removed from call signatures	Clean
`models.py`	shared_embedding passed at construction, removed from call sites	Clean
`deepseek.py`	remove_size_one_mesh_axis for activation_pspec, removed jax.reshard calls, nd_dense_init(1.0, ...)	Clean

temp memory changes

Model	Branch	Total	Output	Temp	Argument	Δ Temp vs rocm-main
ds-proxy-N1-ep2-pp4	rocm-main	59.8 GB	14.6 GB	45.1 GB	14.6 GB	—
ds-proxy-N1-ep2-pp4	cj-fix-tmp-mem_rocm-main	44.9 GB	14.6 GB	30.3 GB	14.6 GB	−14.8 GB
ds-proxy-se2-e256-h4096	rocm-main	66.0 GB	29.3 GB	36.8 GB	29.3 GB	—
ds-proxy-se2-e256-h4096	cj-fix-tmp-mem_rocm-main	60.3 GB	29.3 GB	31.1 GB	29.3 GB	−5.7 GB

This reverts commit 11a8852.

(cherry picked from commit a2f9860) fix rocm version finding

Removed ROCm specific environment variables for fused-attention.

- Import remove_size_one_mesh_axis from sharding utils - Use remove_size_one_mesh_axis for activation_pspec to handle all mesh axes including fsdp_transpose, expert, context correctly - Remove jax.reshard calls that caused extra temp memory allocation - Fix dense_init_scale to 1.0 (was self.config.dense_init_scale) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

JAX 0.7.1 requires fun as first positional argument to jax.jit, so @jax.jit(static_argnames=[...]) fails. Use functools.partial instead. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

This config option was added in JAX > 0.7.1; guard with hasattr so the code runs on both older and newer JAX versions. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Three root causes identified vs the working cj-reduce-tmp-mem_rocm-main: 1. GMM 9-tuple tile sizes (moe.py): newer rocm-main reintroduced 9-tuple (fwd+dlhs+drhs) tiling which changes XLA backward-pass planning and adds ~0.5-1 GB temp memory. Revert to 3-tuple (forward-only). 2. Trivial sharding constraints (sharding.py, mixtral.py, attentions.py): For pp=8 ep=1, ALL mesh axes inside the pipeline vmap are size 1. Every with_sharding_constraint resolves to all-None/() PartitionSpecs (trivial), which XLA loop_broadcast_fusion hoists into the pipeline scan carry as extra buffers (+5 GB). Fix: add skip_trivial_specs param to maybe_shard_with_logical; replace nn.with_logical_constraint in MixtralDecoderLayer with shard() helper using skip_trivial_specs=True; same for Attention._maybe_shard_with_logical. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

HLO analysis shows pipeline_module.init_states/_maybe_shard_with_logical at sharding.py:115 generating loop_broadcast_fusion entries with bf16[1,1,4096,4096] tensors in the preallocated-temp pool for pp=8 ep=1. These are trivial constraints (all size-1 mesh axes) that XLA cannot fully eliminate. Skip them to avoid polluting the scan carry. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

This nan_to_num tree_map forced materialization of every gradient tensor (f32[8,16,4096,2048] etc.) as an explicit elementwise op, preventing XLA from eliding them as loop-invariant values through the pipeline scan carry. Result: +10 sub-computations and +80 parameter() occurrences per compiled module, blocking loop_broadcast_fusion and adding ~6 GB preallocated temp. The good branch (cj-reduce-tmp-mem_rocm-main) never had this call. FP8 NaN sanitization is still handled conditionally in the fp8_stats block. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

- Remove 'cj-reduce ---' debug prefix from print_compiled_memory_stats log - Restore shard_optimizer_over_data guard (was commented out) - Restore compiled_trainstep_file guard so pre-compiled files skip recompile (forced compile broke any run using compiled_trainstep_file) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

attentions.py: - Restore depth_scaling = jnp.sqrt(self.head_dim) in the default branch. Both branches were incorrectly set to 1.0, eliminating the T5-style 1/sqrt(d_k) folded into the query weight initializer. For Mixtral with head_dim=128 this produced query weights ~11x larger than intended, degrading training convergence from step 0. attention_op.py: - Restore context_parallel_axis=self.config.context_sharding (was hardcoded to "context", silently breaking the ep-as-cp mesh config where context_sharding="expert"). - Add comment explaining scale_factor is intentionally omitted: passing 1.0 disables QK scaling, while TE's default (None) auto-computes 1/sqrt(head_dim). - Add comment explaining context_parallel_strategy is omitted because the installed TE 2.6.x DotProductAttention does not accept that parameter. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

…um sync moe.py: Add comment explaining why 3-tuple tiling is correct. The 9-tuple (fwd+dlhs+drhs) is only used by the megablox custom VJP (_gmm_bwd reads tiling[3:6] and tiling[-3:] for backward passes). ds-proxy uses megablox=False / use_tokamax_gmm=False (jax.lax.ragged_dot path) which only reads tiling[0:3], so the extra 6 values were always ignored. base.yml documents this explicitly: "megablox/jax ragged dot - supports forward pass only". embeddings.py: Replace undefined axis "activation_length_no_exp" with "activation_length" (output_default_axis_names) in the ShardMode.EXPLICIT path. The undefined axis would silently map to None (fully replicated) in any explicit-shard config that lacks a rule for it (e.g. deepseek3-671b-batchsplit). ds-proxy is unaffected (uses shard_mode="auto"), but the latent bug is real. types.py / base.yml: Sync float32_weight_sum default from True (types.py) to False (base.yml). The False default was set intentionally in commit 4fceae4 to eliminate a ~2 GB temporary f32 tensor from the MoE weight_sum einsum. bf16 summation over 4 experts (num_experts_per_tok) is numerically acceptable. Update comment to document the memory trade-off. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

i-chaochen · 2026-04-30T22:42:04Z

@cj401-amd We need to land the PR to upstream maxtext, we cannot afford the maintenance burden on rocm/maxtext:rocm-main, if you work can reduce the memory on other side, directly go to upstream maxtext.

i-chaochen

just keep those general changes (no ROCm specific ones), and verify on another side.

i-chaochen

clean up your commits and agaisnt to the correct branch (rocm-main), then upstream to maxtext, not here

gulsumgudukbay and others added 29 commits April 23, 2026 15:55

adding rocm_jax_0.7.1 reqs

fe61a0c

Revert "removing CI workflows for now to upstream decoupling changes"

bc13c85

This reverts commit 11a8852.

skip ring attention test on ROCm

cad0e39

[DOWNSTREAM-ONLY] update schedule for build_and_test_maxtext

9a840ad

adding jax 0.8.2 requirements

5d8a1d8

update configs in tests to use helper functions

2b4b7a6

adding TE build and upload CI workflow

5c59338

Adding CI workflow changes for ROCm and JAX 0.8.2 requirements files

568f2d4

update te wheel consumption

2c770ad

refactoring TE wheel release workflow

65eff61

update runner labels to mi355

def5679

fix te wheel selection

cec342e

(cherry picked from commit a2f9860) fix rocm version finding

Remove ROCm fused-attention backend variables

62b9b2d

Removed ROCm specific environment variables for fused-attention.

refactor requirements location change

9f6a367

fix TE build workflow, add rocm torch dependency to env

23da49f

fix ci

2eb63a2

Merge branch 'AI-Hypercomputer:main' into rocm-main

97bc532

update with a reduction of temp mem usage

ec4b5b9

upate to reduce temp mem usage

985ab79

update for reduce temp memory usage for ep=2 pp=4

4fceae4

Fix jax.jit decorator syntax for JAX 0.7.1 compatibility

8168931

JAX 0.7.1 requires fun as first positional argument to jax.jit, so @jax.jit(static_argnames=[...]) fails. Use functools.partial instead. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Guard jax_remove_size_one_mesh_axis_from_type for JAX 0.7.1 compat

720619e

This config option was added in JAX > 0.7.1; guard with hasattr so the code runs on both older and newer JAX versions. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

cj401-amd requested a review from gulsumgudukbay April 30, 2026 21:46

cj401-amd requested a review from yeandy April 30, 2026 21:46

i-chaochen reviewed Apr 30, 2026

View reviewed changes

i-chaochen suggested changes Apr 30, 2026

View reviewed changes

cj401-amd closed this May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] reduce temp compile memory usage before training starts#85

[ROCm] reduce temp compile memory usage before training starts#85
cj401-amd wants to merge 29 commits into
mainfrom
cj-fix-tmp-mem_rocm-main

cj401-amd commented Apr 30, 2026

Uh oh!

i-chaochen commented Apr 30, 2026

Uh oh!

i-chaochen left a comment

Uh oh!

i-chaochen left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cj401-amd commented Apr 30, 2026

Description

file changes

temp memory changes

Uh oh!

i-chaochen commented Apr 30, 2026

Uh oh!

i-chaochen left a comment

Choose a reason for hiding this comment

Uh oh!

i-chaochen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants