Skip to content

fix: try remove memory footprint#5449

Open
anyangml wants to merge 20 commits into
deepmodeling:masterfrom
anyangml:fix/compile-multitask-oom
Open

fix: try remove memory footprint#5449
anyangml wants to merge 20 commits into
deepmodeling:masterfrom
anyangml:fix/compile-multitask-oom

Conversation

@anyangml
Copy link
Copy Markdown
Collaborator

@anyangml anyangml commented May 20, 2026

There are two issues hindering multitask DDP training in pt-expt compile mode:

    1. memory footprint scales with N tasks;
    1. lazy compile cause NCCL timeout.

Solutions:

    1. clean up memory footprint after compiling each task
    1. compile all task before training start and sync.

Summary by CodeRabbit

  • Bug Fixes
    • Reduced memory usage during model tracing/compilation by immediately releasing temporary trace and compiled artifacts and conditionally clearing GPU cache when CUDA is active.
    • Improved multi-task compilation stability by cleaning up per-task intermediate data right after each task.
    • Ensured logged and validation metrics are converted to Python scalars for accurate aggregation and reporting.

Review Change Stack

Copilot AI review requested due to automatic review settings May 20, 2026 10:20
@dosubot dosubot Bot added the bug label May 20, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR attempts to reduce peak/retained memory during torch.compile setup in the PyTorch experimental trainer by explicitly releasing intermediate tensors after FX tracing and after per-task compilation setup (especially for multi-task scenarios).

Changes:

  • Delete trace seed tensors after make_fx captures the graph in _trace_and_compile.
  • Store the compiled module in a local variable, delete traced_lower, then return the compiled wrapper.
  • In _compile_model, delete per-task intermediate tensors after installing the compiled wrapper to avoid accumulation across tasks, and call torch.cuda.empty_cache().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread deepmd/pt_expt/train/training.py Outdated
Comment thread deepmd/pt_expt/train/training.py Outdated
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Deletes temporary tensors created during torch.fx tracing and per-task compilation, warms up lazily compiled modules, conditionally clears CUDA cache when CUDA is active, and converts training/validation metric tensors to Python scalars for logging/aggregation.

Changes

Memory management in model compilation

Layer / File(s) Summary
Single-model trace and compile cleanup
deepmd/pt_expt/train/training.py
After make_fx produces the FX graph, trace-time inputs (ext_coord, ext_atype, nlist, mapping, optional fparam/aparam) are deleted; torch.compile result is stored in compiled, traced_lower is deleted, and torch.cuda.empty_cache() is called only when model params/buffers are on CUDA and CUDA is initialized.
Per-task compilation warmup and cleanup
deepmd/pt_expt/train/training.py
Before wrapping the model in _CompiledModel, the lazily-compiled compiled_lower is warmed up with the task's sample inputs (warmup output deleted), optional torch.cuda.synchronize() is run for CUDA, _CompiledModel is installed into wrapper_mod.model[task_key], and per-task tracing/compilation intermediates are deleted; conditional CUDA cache clearing follows and a distributed barrier is retained.
Single-task logging and validation scalarization
deepmd/pt_expt/train/training.py
Training logging converts more_loss values to Python scalars via .item() (excluding keys starting with l2_); validation accumulation converts per-batch more_loss tensors to scalars before summing into valid_results.
Multi-task logging, forward-release, and validation scalarization
deepmd/pt_expt/train/training.py
Multi-task training scalarizes more_loss for current and other tasks with .item(); after computing other tasks' contributions local intermediates (_loss, _more, inputs/labels) are deleted and CUDA cache cleared conditionally. Multi-task validation aggregates per-batch metrics using .item() to accumulate scalars per task.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'fix: try remove memory footprint' is vague and uses non-descriptive language ('try remove') that doesn't clearly convey what specific memory optimization was implemented. Make the title more specific about the memory optimization, such as 'fix: reduce memory footprint by cleaning up intermediate tensors' or 'fix: add tensor cleanup and cache clearing in compilation and training loop'.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 86.20690% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.46%. Comparing base (f39a081) to head (ec7296c).

Files with missing lines Patch % Lines
deepmd/pt_expt/train/training.py 86.20% 4 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5449   +/-   ##
=======================================
  Coverage   82.46%   82.46%           
=======================================
  Files         829      829           
  Lines       88763    88786   +23     
  Branches     4225     4226    +1     
=======================================
+ Hits        73197    73218   +21     
- Misses      14274    14277    +3     
+ Partials     1292     1291    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

anyangml and others added 3 commits May 20, 2026 19:33
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Anyang Peng <137014849+anyangml@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment thread deepmd/pt_expt/train/training.py Outdated
Comment thread deepmd/pt_expt/train/training.py Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Anyang Peng <137014849+anyangml@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
deepmd/pt_expt/train/training.py (1)

1023-1024: ⚡ Quick win

Add torch.cuda.is_available() check for consistency and defensive programming.

The CUDA cache cleanup at line 1023 checks DEVICE.type == "cuda" and torch.cuda.is_initialized(), but the similar logic in _trace_and_compile (line 336) includes an additional torch.cuda.is_available() check. While functionally safe (if CUDA is unavailable, is_initialized() returns False), adding is_available() is an idiomatic PyTorch defensive pattern and improves consistency across the codebase.

♻️ Suggested improvement for consistency
-            if DEVICE.type == "cuda" and torch.cuda.is_initialized():
+            if DEVICE.type == "cuda" and torch.cuda.is_available() and torch.cuda.is_initialized():
                 torch.cuda.empty_cache()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/pt_expt/train/training.py` around lines 1023 - 1024, Update the CUDA
cache cleanup conditional to include torch.cuda.is_available() for defensive
consistency: where you currently check DEVICE.type == "cuda" and
torch.cuda.is_initialized() (in training.py), add torch.cuda.is_available() to
the condition so it mirrors the check used in _trace_and_compile and guards
against unavailable CUDA devices; modify the condition that references DEVICE
and torch.cuda.is_initialized() to include torch.cuda.is_available().
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@deepmd/pt_expt/train/training.py`:
- Around line 1023-1024: Update the CUDA cache cleanup conditional to include
torch.cuda.is_available() for defensive consistency: where you currently check
DEVICE.type == "cuda" and torch.cuda.is_initialized() (in training.py), add
torch.cuda.is_available() to the condition so it mirrors the check used in
_trace_and_compile and guards against unavailable CUDA devices; modify the
condition that references DEVICE and torch.cuda.is_initialized() to include
torch.cuda.is_available().

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: b0c09d7f-9c9f-4236-97ff-991995ee8557

📥 Commits

Reviewing files that changed from the base of the PR and between c7d9f57 and 2a10532.

📒 Files selected for processing (1)
  • deepmd/pt_expt/train/training.py

Copy link
Copy Markdown
Contributor

@njzjz-bot njzjz-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for chasing this down. I reviewed the final diff in deepmd/pt_expt/train/training.py: the changes now do three useful things for the observed memory/timeout issues:

  • drop temporary tracing/compile inputs after each task is wrapped;
  • convert display-only loss tensors to Python scalars so their graphs can be released promptly;
  • force a warmup compile before distributed training and barrier afterwards, avoiding lazy per-rank compilation desyncs.

The final _traced_lower_ref handling also makes sense: keeping the FX graph alive addresses the SymInt/view-metadata lifetime issue, and object.__setattr__ avoids registering it as a recursive submodule.

CI is green. Approving.

— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)

@wanghan-iapcm
Copy link
Copy Markdown
Collaborator

wanghan-iapcm commented May 24, 2026

Code review

Found 1 issue:

  1. Three new CUDA guards are inconsistent with each other and with the existing pt_expt convention.

    The PR introduces three different CUDA-guard patterns:

    • _trace_and_compile: model_uses_cuda and torch.cuda.is_available() and torch.cuda.is_initialized()
    • _compile_model: DEVICE.type == "cuda" and torch.cuda.is_initialized()
    • run() display path: torch.cuda.is_available() and torch.cuda.is_initialized()

    The pre-existing pt_expt convention is simply DEVICE.type == "cuda", used at pt_expt/train/training.py:660-662 (the only CUDA guard before this PR) and pt_expt/utils/serialization.py:883-885. torch.cuda.is_initialized() is not used anywhere else in the codebase.

    • The is_initialized() half is dead in all three sites: if DEVICE.type == "cuda" (or any model parameter is on CUDA), the context is necessarily initialized.
    • The third site (display path) is also actively wrong on a CPU run on a CUDA-equipped host (DEVICE=cpu env var): torch.cuda.is_available() and torch.cuda.is_initialized() returns True if anything else in the process touched CUDA, causing a stray empty_cache() on an unrelated context. DEVICE.type == "cuda" correctly returns False.
    • The _trace_and_compile model_uses_cuda = any(param.is_cuda for ...) iterates every parameter and buffer on each call — same answer as DEVICE.type == "cuda" in pt_expt (every model is on DEVICE), at higher cost.

    Suggest collapsing all three sites to:

    if DEVICE.type == "cuda":
        torch.cuda.empty_cache()

    DEVICE is already imported in training.py; for the module-level _trace_and_compile helper it can be imported from deepmd.pt_expt.utils.env.

    # causes RecursionError in trainer.wrapper.train().
    object.__setattr__(compiled, "_traced_lower_ref", traced_lower)
    del traced_lower
    model_uses_cuda = any(param.is_cuda for param in model.parameters()) or any(
    buffer.is_cuda for buffer in model.buffers()
    )
    if model_uses_cuda and torch.cuda.is_available() and torch.cuda.is_initialized():
    torch.cuda.empty_cache()

@wanghan-iapcm
Copy link
Copy Markdown
Collaborator

Follow-up: lifeline workaround for _traced_lower_ref

The object.__setattr__(compiled, "_traced_lower_ref", traced_lower) workaround is correct and the comment explains the GC mechanism well. Two maintainability suggestions before this lands:

  1. Fail loud if PyTorch's OptimizedModule ever starts using the same attribute name. The current code silently overwrites whatever may live there. Replace the bare setattr with:

    if hasattr(compiled, "_traced_lower_ref"):
        raise RuntimeError(
            "torch.compile wrapper already exposes _traced_lower_ref; "
            "the FX-graph lifeline mechanism needs review against the "
            "current PyTorch version."
        )
    object.__setattr__(compiled, "_traced_lower_ref", traced_lower)

    assert would be wrong here — it's stripped under python -O / PYTHONOPTIMIZE=1, and this is a check at an external-library boundary (the shape of torch.compile's return type), which is canonically a RuntimeError. The day PyTorch changes how it wraps modules, this turns a silent regression into a loud one.

  2. File the underlying bug upstream. The root cause — _remove_detach_nodes leaving C++ view-metadata with raw pointers to FX-graph-owned SymInts that get GC'd — is a real PyTorch issue worth a minimal reproducer. Without an upstream tracking issue, this workaround becomes permanent debt with no exit path. A short link in the in-code comment (# upstream: pytorch/pytorch#NNNNN) would let future maintainers check whether the workaround can be removed.

Neither is blocking; both are about making sure the workaround stays visible and removable rather than quietly accumulating.

# Python references while C++ view metadata still holds raw pointers to
# them — causing apply_view_meta_sequence to read garbage (crash at
# random training steps, earlier under higher GC pressure from many tasks).
# Use object.__setattr__ to bypass nn.Module.__setattr__: traced_lower is
# an nn.Module, and normal assignment would register it as a submodule of
# compiled (also an nn.Module), creating a cycle in the module tree that
# causes RecursionError in trainer.wrapper.train().
object.__setattr__(compiled, "_traced_lower_ref", traced_lower)
del traced_lower
model_uses_cuda = any(param.is_cuda for param in model.parameters()) or any(
buffer.is_cuda for buffer in model.buffers()

Copy link
Copy Markdown
Collaborator

@wanghan-iapcm wanghan-iapcm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my posted comments.

@anyangml anyangml requested a review from wanghan-iapcm May 25, 2026 06:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants