fix: try remove memory footprint#5449
Conversation
There was a problem hiding this comment.
Pull request overview
This PR attempts to reduce peak/retained memory during torch.compile setup in the PyTorch experimental trainer by explicitly releasing intermediate tensors after FX tracing and after per-task compilation setup (especially for multi-task scenarios).
Changes:
- Delete trace seed tensors after
make_fxcaptures the graph in_trace_and_compile. - Store the compiled module in a local variable, delete
traced_lower, then return the compiled wrapper. - In
_compile_model, delete per-task intermediate tensors after installing the compiled wrapper to avoid accumulation across tasks, and calltorch.cuda.empty_cache().
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughDeletes temporary tensors created during torch.fx tracing and per-task compilation, warms up lazily compiled modules, conditionally clears CUDA cache when CUDA is active, and converts training/validation metric tensors to Python scalars for logging/aggregation. ChangesMemory management in model compilation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5449 +/- ##
=======================================
Coverage 82.46% 82.46%
=======================================
Files 829 829
Lines 88763 88786 +23
Branches 4225 4226 +1
=======================================
+ Hits 73197 73218 +21
- Misses 14274 14277 +3
+ Partials 1292 1291 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Anyang Peng <137014849+anyangml@users.noreply.github.com>
for more information, see https://pre-commit.ci
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Anyang Peng <137014849+anyangml@users.noreply.github.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
deepmd/pt_expt/train/training.py (1)
1023-1024: ⚡ Quick winAdd
torch.cuda.is_available()check for consistency and defensive programming.The CUDA cache cleanup at line 1023 checks
DEVICE.type == "cuda" and torch.cuda.is_initialized(), but the similar logic in_trace_and_compile(line 336) includes an additionaltorch.cuda.is_available()check. While functionally safe (if CUDA is unavailable,is_initialized()returns False), addingis_available()is an idiomatic PyTorch defensive pattern and improves consistency across the codebase.♻️ Suggested improvement for consistency
- if DEVICE.type == "cuda" and torch.cuda.is_initialized(): + if DEVICE.type == "cuda" and torch.cuda.is_available() and torch.cuda.is_initialized(): torch.cuda.empty_cache()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt_expt/train/training.py` around lines 1023 - 1024, Update the CUDA cache cleanup conditional to include torch.cuda.is_available() for defensive consistency: where you currently check DEVICE.type == "cuda" and torch.cuda.is_initialized() (in training.py), add torch.cuda.is_available() to the condition so it mirrors the check used in _trace_and_compile and guards against unavailable CUDA devices; modify the condition that references DEVICE and torch.cuda.is_initialized() to include torch.cuda.is_available().
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@deepmd/pt_expt/train/training.py`:
- Around line 1023-1024: Update the CUDA cache cleanup conditional to include
torch.cuda.is_available() for defensive consistency: where you currently check
DEVICE.type == "cuda" and torch.cuda.is_initialized() (in training.py), add
torch.cuda.is_available() to the condition so it mirrors the check used in
_trace_and_compile and guards against unavailable CUDA devices; modify the
condition that references DEVICE and torch.cuda.is_initialized() to include
torch.cuda.is_available().
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: b0c09d7f-9c9f-4236-97ff-991995ee8557
📒 Files selected for processing (1)
deepmd/pt_expt/train/training.py
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
njzjz-bot
left a comment
There was a problem hiding this comment.
Thanks for chasing this down. I reviewed the final diff in deepmd/pt_expt/train/training.py: the changes now do three useful things for the observed memory/timeout issues:
- drop temporary tracing/compile inputs after each task is wrapped;
- convert display-only loss tensors to Python scalars so their graphs can be released promptly;
- force a warmup compile before distributed training and barrier afterwards, avoiding lazy per-rank compilation desyncs.
The final _traced_lower_ref handling also makes sense: keeping the FX graph alive addresses the SymInt/view-metadata lifetime issue, and object.__setattr__ avoids registering it as a recursive submodule.
CI is green. Approving.
— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)
Code reviewFound 1 issue:
|
Follow-up: lifeline workaround for
|
wanghan-iapcm
left a comment
There was a problem hiding this comment.
see my posted comments.
There are two issues hindering multitask DDP training in pt-expt compile mode:
Solutions:
Summary by CodeRabbit