fix: try remove memory footprint by anyangml · Pull Request #5449 · deepmodeling/deepmd-kit

anyangml · 2026-05-20T10:20:26Z

There are two issues hindering multitask DDP training in pt-expt compile mode:

1. memory footprint scales with N tasks;
1. lazy compile cause NCCL timeout.

Solutions:

1. clean up memory footprint after compiling each task
1. compile all task before training start and sync.

Summary by CodeRabbit

Bug Fixes
- Reduced memory usage during model tracing/compilation by immediately releasing temporary trace and compiled artifacts and conditionally clearing GPU cache when CUDA is active.
- Improved multi-task compilation stability by cleaning up per-task intermediate data right after each task.
- Ensured logged and validation metrics are converted to Python scalars for accurate aggregation and reporting.

Copilot

Pull request overview

This PR attempts to reduce peak/retained memory during torch.compile setup in the PyTorch experimental trainer by explicitly releasing intermediate tensors after FX tracing and after per-task compilation setup (especially for multi-task scenarios).

Changes:

Delete trace seed tensors after make_fx captures the graph in _trace_and_compile.
Store the compiled module in a local variable, delete traced_lower, then return the compiled wrapper.
In _compile_model, delete per-task intermediate tensors after installing the compiled wrapper to avoid accumulation across tasks, and call torch.cuda.empty_cache().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai · 2026-05-20T10:23:27Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Deletes temporary tensors created during torch.fx tracing and per-task compilation, warms up lazily compiled modules, conditionally clears CUDA cache when CUDA is active, and converts training/validation metric tensors to Python scalars for logging/aggregation.

Changes

Memory management in model compilation

Layer / File(s)	Summary
Single-model trace and compile cleanup `deepmd/pt_expt/train/training.py`	After `make_fx` produces the FX graph, trace-time inputs (`ext_coord`, `ext_atype`, `nlist`, `mapping`, optional `fparam`/`aparam`) are deleted; `torch.compile` result is stored in `compiled`, `traced_lower` is deleted, and `torch.cuda.empty_cache()` is called only when model params/buffers are on CUDA and CUDA is initialized.
Per-task compilation warmup and cleanup `deepmd/pt_expt/train/training.py`	Before wrapping the model in `_CompiledModel`, the lazily-compiled `compiled_lower` is warmed up with the task's sample inputs (warmup output deleted), optional `torch.cuda.synchronize()` is run for CUDA, `_CompiledModel` is installed into `wrapper_mod.model[task_key]`, and per-task tracing/compilation intermediates are deleted; conditional CUDA cache clearing follows and a distributed barrier is retained.
Single-task logging and validation scalarization `deepmd/pt_expt/train/training.py`	Training logging converts `more_loss` values to Python scalars via `.item()` (excluding keys starting with `l2_`); validation accumulation converts per-batch `more_loss` tensors to scalars before summing into `valid_results`.
Multi-task logging, forward-release, and validation scalarization `deepmd/pt_expt/train/training.py`	Multi-task training scalarizes `more_loss` for current and other tasks with `.item()`; after computing other tasks' contributions local intermediates (`_loss`, `_more`, inputs/labels) are deleted and CUDA cache cleared conditionally. Multi-task validation aggregates per-batch metrics using `.item()` to accumulate scalars per task.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'fix: try remove memory footprint' is vague and uses non-descriptive language ('try remove') that doesn't clearly convey what specific memory optimization was implemented.	Make the title more specific about the memory optimization, such as 'fix: reduce memory footprint by cleaning up intermediate tensors' or 'fix: add tensor cleanup and cache clearing in compilation and training loop'.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-20T11:17:27Z

Codecov Report

❌ Patch coverage is 86.20690% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.46%. Comparing base (f39a081) to head (ec7296c).

Files with missing lines	Patch %	Lines
deepmd/pt_expt/train/training.py	86.20%	4 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #5449   +/-   ##
=======================================
  Coverage   82.46%   82.46%           
=======================================
  Files         829      829           
  Lines       88763    88786   +23     
  Branches     4225     4226    +1     
=======================================
+ Hits        73197    73218   +21     
- Misses      14274    14277    +3     
+ Partials     1292     1291    -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Anyang Peng <137014849+anyangml@users.noreply.github.com>

for more information, see https://pre-commit.ci

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Anyang Peng <137014849+anyangml@users.noreply.github.com>

coderabbitai

🧹 Nitpick comments (1)

deepmd/pt_expt/train/training.py (1)
1023-1024: ⚡ Quick win

Add torch.cuda.is_available() check for consistency and defensive programming.

The CUDA cache cleanup at line 1023 checks DEVICE.type == "cuda" and torch.cuda.is_initialized(), but the similar logic in _trace_and_compile (line 336) includes an additional torch.cuda.is_available() check. While functionally safe (if CUDA is unavailable, is_initialized() returns False), adding is_available() is an idiomatic PyTorch defensive pattern and improves consistency across the codebase.
♻️ Suggested improvement for consistency
-            if DEVICE.type == "cuda" and torch.cuda.is_initialized():
+            if DEVICE.type == "cuda" and torch.cuda.is_available() and torch.cuda.is_initialized():
                 torch.cuda.empty_cache()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/pt_expt/train/training.py` around lines 1023 - 1024, Update the CUDA
cache cleanup conditional to include torch.cuda.is_available() for defensive
consistency: where you currently check DEVICE.type == "cuda" and
torch.cuda.is_initialized() (in training.py), add torch.cuda.is_available() to
the condition so it mirrors the check used in _trace_and_compile and guards
against unavailable CUDA devices; modify the condition that references DEVICE
and torch.cuda.is_initialized() to include torch.cuda.is_available().

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@deepmd/pt_expt/train/training.py`:
- Around line 1023-1024: Update the CUDA cache cleanup conditional to include
torch.cuda.is_available() for defensive consistency: where you currently check
DEVICE.type == "cuda" and torch.cuda.is_initialized() (in training.py), add
torch.cuda.is_available() to the condition so it mirrors the check used in
_trace_and_compile and guards against unavailable CUDA devices; modify the
condition that references DEVICE and torch.cuda.is_initialized() to include
torch.cuda.is_available().

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: b0c09d7f-9c9f-4236-97ff-991995ee8557

📥 Commits

Reviewing files that changed from the base of the PR and between c7d9f57 and 2a10532.

📒 Files selected for processing (1)

deepmd/pt_expt/train/training.py

for more information, see https://pre-commit.ci

njzjz-bot

Thanks for chasing this down. I reviewed the final diff in deepmd/pt_expt/train/training.py: the changes now do three useful things for the observed memory/timeout issues:

drop temporary tracing/compile inputs after each task is wrapped;
convert display-only loss tensors to Python scalars so their graphs can be released promptly;
force a warmup compile before distributed training and barrier afterwards, avoiding lazy per-rank compilation desyncs.

The final _traced_lower_ref handling also makes sense: keeping the FX graph alive addresses the SymInt/view-metadata lifetime issue, and object.__setattr__ avoids registering it as a recursive submodule.

CI is green. Approving.

— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)

wanghan-iapcm · 2026-05-24T10:00:13Z

Code review

Found 1 issue:

Three new CUDA guards are inconsistent with each other and with the existing pt_expt convention.

The PR introduces three different CUDA-guard patterns:

_trace_and_compile: model_uses_cuda and torch.cuda.is_available() and torch.cuda.is_initialized()
_compile_model: DEVICE.type == "cuda" and torch.cuda.is_initialized()
run() display path: torch.cuda.is_available() and torch.cuda.is_initialized()

The pre-existing pt_expt convention is simply DEVICE.type == "cuda", used at pt_expt/train/training.py:660-662 (the only CUDA guard before this PR) and pt_expt/utils/serialization.py:883-885. torch.cuda.is_initialized() is not used anywhere else in the codebase.

The is_initialized() half is dead in all three sites: if DEVICE.type == "cuda" (or any model parameter is on CUDA), the context is necessarily initialized.
The third site (display path) is also actively wrong on a CPU run on a CUDA-equipped host (DEVICE=cpu env var): torch.cuda.is_available() and torch.cuda.is_initialized() returns True if anything else in the process touched CUDA, causing a stray empty_cache() on an unrelated context. DEVICE.type == "cuda" correctly returns False.
The _trace_and_compile model_uses_cuda = any(param.is_cuda for ...) iterates every parameter and buffer on each call — same answer as DEVICE.type == "cuda" in pt_expt (every model is on DEVICE), at higher cost.

Suggest collapsing all three sites to:

if DEVICE.type == "cuda":
    torch.cuda.empty_cache()

DEVICE is already imported in training.py; for the module-level _trace_and_compile helper it can be imported from deepmd.pt_expt.utils.env.

deepmd-kit/deepmd/pt_expt/train/training.py

Lines 341 to 348 in b5236de

    
           # causes RecursionError in trainer.wrapper.train(). 
        
           object.__setattr__(compiled, "_traced_lower_ref", traced_lower) 
        
           del traced_lower 
        
           model_uses_cuda = any(param.is_cuda for param in model.parameters()) or any( 
        
               buffer.is_cuda for buffer in model.buffers() 
        
           ) 
        
           if model_uses_cuda and torch.cuda.is_available() and torch.cuda.is_initialized(): 
        
               torch.cuda.empty_cache()

wanghan-iapcm · 2026-05-24T10:11:47Z

Follow-up: lifeline workaround for `_traced_lower_ref`

The object.__setattr__(compiled, "_traced_lower_ref", traced_lower) workaround is correct and the comment explains the GC mechanism well. Two maintainability suggestions before this lands:

Fail loud if PyTorch's OptimizedModule ever starts using the same attribute name. The current code silently overwrites whatever may live there. Replace the bare setattr with:
```
if hasattr(compiled, "_traced_lower_ref"):
    raise RuntimeError(
        "torch.compile wrapper already exposes _traced_lower_ref; "
        "the FX-graph lifeline mechanism needs review against the "
        "current PyTorch version."
    )
object.__setattr__(compiled, "_traced_lower_ref", traced_lower)
```
assert would be wrong here — it's stripped under python -O / PYTHONOPTIMIZE=1, and this is a check at an external-library boundary (the shape of torch.compile's return type), which is canonically a RuntimeError. The day PyTorch changes how it wraps modules, this turns a silent regression into a loud one.
File the underlying bug upstream. The root cause — _remove_detach_nodes leaving C++ view-metadata with raw pointers to FX-graph-owned SymInts that get GC'd — is a real PyTorch issue worth a minimal reproducer. Without an upstream tracking issue, this workaround becomes permanent debt with no exit path. A short link in the in-code comment (# upstream: pytorch/pytorch#NNNNN) would let future maintainers check whether the workaround can be removed.

Neither is blocking; both are about making sure the workaround stays visible and removable rather than quietly accumulating.

deepmd-kit/deepmd/pt_expt/train/training.py

Lines 335 to 345 in b5236de

    
           # Python references while C++ view metadata still holds raw pointers to 
        
           # them — causing apply_view_meta_sequence to read garbage (crash at 
        
           # random training steps, earlier under higher GC pressure from many tasks). 
        
           # Use object.__setattr__ to bypass nn.Module.__setattr__: traced_lower is 
        
           # an nn.Module, and normal assignment would register it as a submodule of 
        
           # compiled (also an nn.Module), creating a cycle in the module tree that 
        
           # causes RecursionError in trainer.wrapper.train(). 
        
           object.__setattr__(compiled, "_traced_lower_ref", traced_lower) 
        
           del traced_lower 
        
           model_uses_cuda = any(param.is_cuda for param in model.parameters()) or any( 
        
               buffer.is_cuda for buffer in model.buffers()

wanghan-iapcm

see my posted comments.

fix: try remove memory footprint

7714b84

Copilot AI review requested due to automatic review settings May 20, 2026 10:20

dosubot Bot added the bug label May 20, 2026

github-actions Bot added the Python label May 20, 2026

Copilot started reviewing on behalf of anyangml May 20, 2026 10:20 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread deepmd/pt_expt/train/training.py Outdated

Comment thread deepmd/pt_expt/train/training.py Outdated

anyangml and others added 3 commits May 20, 2026 19:33

Potential fix for pull request finding

ee413c3

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Anyang Peng <137014849+anyangml@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

eb239ef

for more information, see https://pre-commit.ci

fix: comment

c7d9f57

anyangml requested a review from Copilot May 20, 2026 11:39

Copilot started reviewing on behalf of anyangml May 20, 2026 11:39 View session

anyangml requested review from iProzd and wanghan-iapcm May 20, 2026 11:40

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread deepmd/pt_expt/train/training.py Outdated

Comment thread deepmd/pt_expt/train/training.py Outdated

Potential fix for pull request finding

2a10532

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Anyang Peng <137014849+anyangml@users.noreply.github.com>

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

anyangml and others added 11 commits May 21, 2026 10:47

fix: remove graph

8b7584a

[pre-commit.ci] auto fixes from pre-commit.com hooks

87e4e46

for more information, see https://pre-commit.ci

fix: lazy compile in multitask NCCL timeout

4ffc15a

fix: mark variable-size dimensions as dynamic to prevent NCCL timeout

36d57a7

[pre-commit.ci] auto fixes from pre-commit.com hooks

1978e33

for more information, see https://pre-commit.ci

fix: mark tensors as dynamic to prevent NCCL timeout during training

937b742

[pre-commit.ci] auto fixes from pre-commit.com hooks

2ee8af1

for more information, see https://pre-commit.ci

fix: dynamic shape

c05542a

chore: revert extra mark_dynamic

697b24f

fix: prevent GC error

c27c95d

fix:recursion

b5236de

njzjz-bot approved these changes May 23, 2026

View reviewed changes

wanghan-iapcm reviewed May 24, 2026

View reviewed changes

anyangml added 4 commits May 25, 2026 09:43

fix: try fix remove_detach_node

2377b1a

Merge branch 'master' into fix/compile-multitask-oom

edc9a57

fix: add new charge_spin para

211608b

fix: comment

ec7296c

anyangml requested a review from wanghan-iapcm May 25, 2026 06:26

Conversation

anyangml commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

codecov Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

njzjz-bot left a comment

Choose a reason for hiding this comment

Uh oh!

wanghan-iapcm commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

wanghan-iapcm commented May 24, 2026

Follow-up: lifeline workaround for _traced_lower_ref

Uh oh!

wanghan-iapcm left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anyangml commented May 20, 2026 •

edited

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading

codecov Bot commented May 20, 2026 •

edited

Loading

wanghan-iapcm commented May 24, 2026 •

edited

Loading

Follow-up: lifeline workaround for `_traced_lower_ref`