fix(pt/pd): fix incompatibility between AutoBatchSize and eval hooks by njzjz · Pull Request #5181 · deepmodeling/deepmd-kit

njzjz · 2026-01-29T15:39:37Z

Summary by CodeRabbit

New Features
- Optional automatic OOM-retry during evaluation that retries with adjusted batch size and a retry loop.
- New public control to enable or disable OOM-retry mode for evaluation flows.
Bug Fixes
- Evaluation retry now guarantees cleanup: evaluation hooks and OOM-retry mode are reliably cleared after success or failure.
Tests
- Added unit tests verifying OOM-retry behavior, retry cleanup, and state/hook reset across attempts.

for more information, see https://pre-commit.ci

coderabbitai · 2026-01-29T15:42:52Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds RetrySignal and an oom_retry_mode to AutoBatchSize, raises RetrySignal on OOM when retry mode is enabled, wraps descriptor and fitting-last-layer evaluation in Paddle and PyTorch backends to toggle oom-retry mode and retry on RetrySignal, and adds tests verifying hook/state cleanup.

Changes

OOM Retry Integration

Layer / File(s)	Summary
AutoBatchSize: Retry signal and API `deepmd/utils/batch_size.py`	Adds `RetrySignal` exception, initializes `self.oom_retry_mode`, raises `RetrySignal` from `execute` when an OOM occurs and retry mode is enabled, and adds `set_oom_retry_mode(enable: bool)` public method.
Paddle eval wrappers `deepmd/pd/infer/deep_eval.py`	Imports `RetrySignal`; wraps `eval_descriptor` and `eval_fitting_last_layer` in a retry loop that enables `auto_batch_size.set_oom_retry_mode(True)` and the model eval hooks, calls `self.eval(...)` then model eval, retries on `RetrySignal`, and ensures hooks and mode are reset in `finally`.
PyTorch eval wrappers `deepmd/pt/infer/deep_eval.py`	Same changes as Paddle: import `RetrySignal`, toggle OOM-retry mode around `eval_descriptor` and `eval_fitting_last_layer`, retry on `RetrySignal`, and reset mode and hook state in `finally`.
Tests: OOM-retry behavior `source/tests/common/test_oom_retry.py`	Adds tests and dummy helpers that simulate OOM and runtime failures to validate `RetrySignal` propagation, `AutoBatchSize.execute` behavior, and that evaluation hooks/oom state are cleared between attempts and on non-retry errors.

Sequence Diagram(s)

sequenceDiagram
  participant DeepEval
  participant AutoBatchSize
  participant GPU
  participant Model
  DeepEval->>AutoBatchSize: set_oom_retry_mode(True)
  DeepEval->>Model: set_eval_*_hook(True)
  DeepEval->>AutoBatchSize: execute(batch)
  AutoBatchSize->>GPU: run_batch()
  GPU-->>AutoBatchSize: OOM error
  AutoBatchSize->>DeepEval: raise RetrySignal
  DeepEval->>Model: set_eval_*_hook(False)
  DeepEval->>AutoBatchSize: set_oom_retry_mode(False)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

deepmodeling/deepmd-kit#5418: Enhances PyTorch OOM detection used by AutoBatchSize to recognize additional CUDA OOM wrappers, which affects whether RetrySignal is raised.

Suggested reviewers

wanghan-iapcm
njzjz-bot

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.93% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly summarizes the main change: fixing incompatibility between AutoBatchSize and eval hooks in PyTorch/Paddle backends.
Linked Issues check	✅ Passed	The PR implements a retry mechanism with OOM signal handling to fix mismatched descriptor frame counts. Changes align with resolving the AutoBatchSize and eval hook incompatibility reported in issue `#5180`.
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing the AutoBatchSize and eval hooks incompatibility: RetrySignal exception, OOM retry mode logic, eval hook management, and comprehensive test coverage.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This pull request attempts to fix an incompatibility between AutoBatchSize and evaluation hooks (eval_descriptor and eval_fitting_last_layer) that caused mismatched descriptor output when OOM errors occurred during batch processing. The issue manifested as descriptors having more frames than the input system (e.g., 241 frames vs 175 in the reported issue).

Changes:

Introduces a retry mechanism via RetrySignal exception and @retry decorator to restart processing from the beginning when OOM occurs during hook-based evaluation
Adds oom_retry_mode flag to AutoBatchSize to control whether OOM errors trigger a full retry
Enables retry mode in eval_descriptor and eval_fitting_last_layer methods for both PyTorch (pt) and Paddle (pd) backends

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File	Description
deepmd/utils/batch_size.py	Adds RetrySignal exception, retry decorator, oom_retry_mode flag, and logic to raise RetrySignal on OOM when retry mode is enabled
deepmd/pt/infer/deep_eval.py	Enables/disables oom_retry_mode around eval calls in eval_descriptor and eval_fitting_last_layer methods
deepmd/pd/infer/deep_eval.py	Enables/disables oom_retry_mode around eval calls in eval_descriptor and eval_fitting_last_layer methods

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

deepmd/pt/infer/deep_eval.py (1)

796-812: Ensure hooks & OOM retry mode are reset on exceptions.
Lines 796-812 and 855-871 enable hooks/retry mode but only disable them on the success path. If self.eval(...) or model.eval_* throws, the flags remain enabled and can corrupt subsequent calls.

✅ Safer pattern (apply to both methods)

-        if self.auto_batch_size is not None:
-            self.auto_batch_size.set_oom_retry_mode(True)
-        model.set_eval_descriptor_hook(True)
-        self.eval(
-            coords,
-            cells,
-            atom_types,
-            atomic=False,
-            fparam=fparam,
-            aparam=aparam,
-            **kwargs,
-        )
-        descriptor = model.eval_descriptor()
-        model.set_eval_descriptor_hook(False)
-        if self.auto_batch_size is not None:
-            self.auto_batch_size.set_oom_retry_mode(False)
+        if self.auto_batch_size is not None:
+            self.auto_batch_size.set_oom_retry_mode(True)
+        model.set_eval_descriptor_hook(True)
+        try:
+            self.eval(
+                coords,
+                cells,
+                atom_types,
+                atomic=False,
+                fparam=fparam,
+                aparam=aparam,
+                **kwargs,
+            )
+            descriptor = model.eval_descriptor()
+        finally:
+            model.set_eval_descriptor_hook(False)
+            if self.auto_batch_size is not None:
+                self.auto_batch_size.set_oom_retry_mode(False)

Also applies to: 855-871

deepmd/pd/infer/deep_eval.py (1)

823-842: Ensure hooks & OOM retry mode are reset on exceptions.
Lines 823-842 and 884-901 toggle hooks/retry mode without a finally. Any exception during self.eval(...) or model.eval_* can leave the backend in a bad state.

✅ Safer pattern (apply to both methods)

-        if self.auto_batch_size is not None:
-            self.auto_batch_size.set_oom_retry_mode(True)
-        model.set_eval_descriptor_hook(True)
-        self.eval(
-            coords,
-            cells,
-            atom_types,
-            atomic=False,
-            fparam=fparam,
-            aparam=aparam,
-            **kwargs,
-        )
-        descriptor = model.eval_descriptor()
-        model.set_eval_descriptor_hook(False)
-        if self.auto_batch_size is not None:
-            self.auto_batch_size.set_oom_retry_mode(False)
+        if self.auto_batch_size is not None:
+            self.auto_batch_size.set_oom_retry_mode(True)
+        model.set_eval_descriptor_hook(True)
+        try:
+            self.eval(
+                coords,
+                cells,
+                atom_types,
+                atomic=False,
+                fparam=fparam,
+                aparam=aparam,
+                **kwargs,
+            )
+            descriptor = model.eval_descriptor()
+        finally:
+            model.set_eval_descriptor_hook(False)
+            if self.auto_batch_size is not None:
+                self.auto_batch_size.set_oom_retry_mode(False)

Also applies to: 884-901

🤖 Fix all issues with AI agents

In `@deepmd/utils/batch_size.py`:
- Around line 161-162: The OOM handler incorrectly checks the method object
self.set_oom_retry_mode rather than its boolean result, so every OOM triggers a
retry; change the condition to call the method (if self.set_oom_retry_mode():
...) so it evaluates the returned bool before raising RetrySignal, and ensure
the method returns a proper bool.

🧹 Nitpick comments (1)

deepmd/utils/batch_size.py (1)
32-54: Clarify retry semantics (docstring vs behavior).
Line 32 says retries happen for “certain times,” but the wrapper loops forever. Either document it as unbounded or add a cap/backoff.
✏️ Minimal doc fix
-    """Decorator to retry the function until it succeeds or fails for certain times.
+    """Decorator to retry the function until it succeeds (no max retry cap).

codecov · 2026-01-29T16:15:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.48%. Comparing base (8787b45) to head (a1ba195).
⚠️ Report is 181 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5181      +/-   ##
==========================================
+ Coverage   81.95%   82.48%   +0.53%     
==========================================
  Files         714      829     +115     
  Lines       73441    88810   +15369     
  Branches     3616     4225     +609     
==========================================
+ Hits        60187    73258   +13071     
- Misses      12091    14260    +2169     
- Partials     1163     1292     +129

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…deepmd-kit into eval_desc_auto_batch_size

for more information, see https://pre-commit.ci

njzjz · 2026-02-02T08:07:24Z

@QuantumMisaka could you test whether this PR fixes your issue?

njzjz-bot

I found two OOM-retry correctness issues in the PT/PD eval hook paths and left apply-ready suggestions below.

— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)

@njzjz

Co-authored-by: A bot of @njzjz <48687836+njzjz-bot@users.noreply.github.com> Signed-off-by: Jinzhe Zeng <njzjz@qq.com>

wanghan-iapcm

The fix introduces three new code paths with zero test coverage. Please add unit tests for each — they exist solely because of this PR and will silently regress otherwise.

1. `RetrySignal` raise path in `AutoBatchSize.execute()`

Target: deepmd/utils/batch_size.py:130-131

Subclass AutoBatchSize with a fake executor that raises an OOM-sentinel on first call. Then:

With set_oom_retry_mode(True): assert RetrySignal is raised, its __cause__ is the original OOM, and current_batch_size was halved before the raise.
With set_oom_retry_mode(False): assert it returns (0, None) instead — pins the flag actually gates behavior.

2. Recursive retry in `eval_descriptor` / `eval_fitting_last_layer` clears the hook between attempts

Target: deepmd/{pt,pd}/infer/deep_eval.py

This is the actual fix for #5180 (descriptor frame count doubling). Per backend:

Inject an AutoBatchSize that raises RetrySignal on the first execute, succeeds on the second.
Call eval_descriptor with N frames.
Assert the returned array has exactly N frames (not 2N — the #5180 regression check).
Spy on model.set_eval_descriptor_hook; assert the call sequence is [True, False, True, False].
Assert auto_batch_size.oom_retry_mode == False after return.

Repeat for eval_fitting_last_layer with set_eval_fitting_last_layer_hook.

3. `finally` clears state on non-`RetrySignal` exceptions

Target: same files as Test 2.

The PR moves hook cleanup into finally — a real improvement, since arbitrary errors no longer leave the hook stuck True. Per backend, per method:

Monkey-patch self.eval to raise a generic RuntimeError.
Call inside pytest.raises(RuntimeError).
Assert set_eval_*_hook was called with False (sequence [True, False]).
Assert auto_batch_size.oom_retry_mode == False.

Out of scope (do NOT add)

oom_retry_mode == True + no OOM — exercises no new code.
oom_retry_mode == False + OOM returning (0, None) — pre-existing behavior.

Acceptance

All new tests pass.
Mutation check: removing raise RetrySignal fails Test 1; removing the finally fails Tests 2 and 3.

Add regression coverage for the AutoBatchSize RetrySignal path and eval hook cleanup around retry and non-retry failures. Move recursive retries until after finally so hooks are cleared between attempts. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

njzjz-bot · 2026-05-23T16:23:09Z

Addressed review feedback by adding regression coverage and ensuring retry attempts restart only after hook state is cleared.

Opened follow-up PR against this PR branch: njzjz#229

Local focused test passed:

uv run --no-project --with pytest --with array-api-strict --with array-api-compat --with numpy --with packaging --with typing-extensions --with pyyaml --with wcmatch python -m pytest source/tests/common/test_oom_retry.py -q
# 6 passed

— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

Avoid recursive RetrySignal handling in eval_descriptor and eval_fitting_last_layer so repeated OOM retries do not consume Python stack frames. The loop still clears hook and retry state between attempts before retrying. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

njzjz-bot · 2026-05-24T05:49:45Z

Addressed Copilot's latest recursion comments by opening another follow-up PR against this PR branch: njzjz#230

It converts the RetrySignal handling in PT/PD eval_descriptor and eval_fitting_last_layer from recursive self-calls to iterative retry loops, while preserving the per-attempt hook / OOM retry cleanup before retrying.

Local focused test passed:

uv run --no-project --with pytest --with array-api-strict --with array-api-compat --with numpy --with packaging --with typing-extensions --with pyyaml --with wcmatch python -m pytest source/tests/common/test_oom_retry.py -q
# 6 passed

— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

wanghan-iapcm

The new tests address the three asks from my prior review, with one structural concern worth raising before merge.

DummyDeepEval.eval_descriptor / eval_fitting_last_layer (lines 68-104 of source/tests/common/test_oom_retry.py) are hand-written copies of the production try/except RetrySignal/finally orchestration in deepmd/pt/infer/deep_eval.py and
deepmd/pd/infer/deep_eval.py. The tests call the dummy's methods, not production's. This is testing the test code, not the code that ships.

Concrete evidence the gap matters: commit b5f789ae (the day after the tests were added) refactored the production code from recursion to iteration (while True + retry flag). The dummy still uses recursion. The tests passed throughout — they never
noticed the production-side change because they don't exercise it. If a future refactor drops the finally block from production, the same tests will keep passing.

The mutation-check guarantee from the prior review ("removing the finally fails Tests 2 and 3") does not currently hold for Tests 2 and 3. It holds for Test 1, which calls the real AutoBatchSize.execute.

Mocking is right for a UT — but mock the dependencies of the production method, not the production method itself. Roughly:

class TestEvalDescriptorRetry(unittest.TestCase):
    def setUp(self):
        # Construct DeepEval without loading a real model.
        self.dp_eval = DeepPotPT.__new__(DeepPotPT)
        self.dp_eval.dp = MagicMock()
        self.dp_eval.dp.model = {"Default": MagicMock()}
        self.dp_eval.auto_batch_size = MagicMock()                                                                                                                                                                                                           
 
    def test_retry_clears_hook_between_attempts(self):                                                                                                                                                                                                       
        model = self.dp_eval.dp.model["Default"]                                                                                                                                                                                                             
        model.eval_descriptor.return_value = np.array([1, 2, 3])
        with patch.object(self.dp_eval, "eval",                                                                                                                                                                                                              
                          side_effect=[RetrySignal, None]):                                                                                                                                                                                                  
            result = self.dp_eval.eval_descriptor(
                coords=..., cells=..., atom_types=...,                                                                                                                                                                                                       
            )                                                                                                                                                                                                                                                
        self.assertEqual(           
            model.set_eval_descriptor_hook.call_args_list,                                                                                                                                                                                                   
            [call(True), call(False), call(True), call(False)],          
        )                           
        np.testing.assert_array_equal(result, [1, 2, 3])

    def test_finally_clears_hook_on_runtime_error(self):
        with patch.object(self.dp_eval, "eval",
                          side_effect=RuntimeError("non-retry failure")):                                                                                                                                                                                    
            with self.assertRaisesRegex(RuntimeError, "non-retry failure"):
                self.dp_eval.eval_descriptor(                                                                                                                                                                                                                
                    coords=..., cells=..., atom_types=...,               
                )                   
        model = self.dp_eval.dp.model["Default"]
        self.assertEqual(                                                                                                                                                                                                                                    
            model.set_eval_descriptor_hook.call_args_list,
            [call(True), call(False)],                                                                                                                                                                                                                       
        )

The reusable DummyModel / DummyAutoBatchSize stubs (lines 24-49) are fine as collaborator fakes — keep them. Only DummyDeepEval should go; replace its usages with the real DeepPotPT / DeepPotPD constructed via __new__ (or a tiny init helper)
with the dependencies above patched in. Same shape, similar line count, but the assertions then pin the production code paths.

Test 1 (the AutoBatchSize.execute / RetrySignal raise path) is already correct — no changes needed there.

Replace the hand-written DummyDeepEval orchestration with real PT/PD DeepEval instances constructed through __new__, while mocking their dependencies. This keeps the AutoBatchSize test intact and makes retry/finally assertions pin the production eval_descriptor and eval_fitting_last_layer methods. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

njzjz-bot · 2026-05-25T11:32:43Z

Addressed Wang Han's latest structural test feedback by opening another follow-up PR against this PR branch: njzjz#231

It removes the hand-written DummyDeepEval orchestration and instead constructs real PT/PD DeepEval instances via __new__, mocking only their dependencies. The retry and finally assertions now exercise the production eval_descriptor and eval_fitting_last_layer methods.

Local focused checks passed:

uv run --no-project --with pytest --with array-api-strict --with array-api-compat --with numpy --with packaging --with typing-extensions --with pyyaml --with wcmatch python -m pytest source/tests/common/test_oom_retry.py -q
# 2 passed, 8 skipped locally because torch/paddle are unavailable in this lightweight environment

uv run --no-project --with ruff ruff check source/tests/common/test_oom_retry.py
# All checks passed

— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)

for more information, see https://pre-commit.ci

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Jinzhe Zeng <njzjz@qq.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Production eval helpers convert model outputs with backend precision handling, which rejects integer arrays. Use floating arrays in the mocked descriptor and fitting outputs so the production PT/PD retry tests exercise the intended cleanup path. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

njzjz-bot · 2026-05-25T15:36:53Z

Follow-up to fix my test mistake: njzjz#232

The production PT retry tests in #231 mocked descriptor/fitting outputs with integer NumPy arrays. In CI, the real PT backend then passed those through to_numpy_array(), which rejects integer dtypes with ValueError: unknown precision int64. The new follow-up changes those mocked outputs and expectations to floating arrays.

Local focused checks passed:

uv run --no-project --with pytest --with array-api-strict --with array-api-compat --with numpy --with packaging --with typing-extensions --with pyyaml --with wcmatch python -m pytest source/tests/common/test_oom_retry.py -q
# 2 passed, 8 skipped locally because torch/paddle are unavailable here

uv run --no-project --with ruff ruff check source/tests/common/test_oom_retry.py
# All checks passed

The separate TensorFlow MessageToString(float_format) failures seen in the bot-fork reproduction appear unrelated to this OOM retry test fix.

— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)

test(oom): return floating mock outputs

fix(pt/pd): fix incompatibility between AutoBatchSize and eval hooks

6b4ca59

Copilot AI review requested due to automatic review settings January 29, 2026 15:39

github-actions Bot added the Python label Jan 29, 2026

Copilot started reviewing on behalf of njzjz January 29, 2026 15:39 View session

dosubot Bot added the bug label Jan 29, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

404d1ac

for more information, see https://pre-commit.ci

Copilot AI reviewed Jan 29, 2026

View reviewed changes

Comment thread deepmd/utils/batch_size.py Outdated

Comment thread deepmd/utils/batch_size.py

Comment thread deepmd/utils/batch_size.py Outdated

coderabbitai Bot reviewed Jan 29, 2026

View reviewed changes

Comment thread deepmd/utils/batch_size.py Outdated

njzjz marked this pull request as draft January 29, 2026 16:12

Copilot AI reviewed Jan 29, 2026

View reviewed changes

njzjz and others added 4 commits January 30, 2026 00:19

apply Copilot's suggestions

01de666

Merge branch 'eval_desc_auto_batch_size' of https://github.com/njzjz/…

232651b

…deepmd-kit into eval_desc_auto_batch_size

rm retry

72c4e36

[pre-commit.ci] auto fixes from pre-commit.com hooks

f785d61

for more information, see https://pre-commit.ci

njzjz mentioned this pull request Mar 10, 2026

[BUG] dp eval-desc give mismatch descriptors on certain System in certain GPU node #5180

Open

njzjz-bot reviewed May 22, 2026

View reviewed changes

Comment thread deepmd/pt/infer/deep_eval.py Outdated

Comment thread deepmd/pt/infer/deep_eval.py Outdated

Comment thread deepmd/pd/infer/deep_eval.py Outdated

Comment thread deepmd/pd/infer/deep_eval.py Outdated

Apply suggestions from code review

fb6fff5

Co-authored-by: A bot of @njzjz <48687836+njzjz-bot@users.noreply.github.com> Signed-off-by: Jinzhe Zeng <njzjz@qq.com>

njzjz marked this pull request as ready for review May 23, 2026 02:35

dosubot Bot added the enhancement label May 23, 2026

njzjz requested review from iProzd and wanghan-iapcm May 23, 2026 02:36

wanghan-iapcm reviewed May 23, 2026

View reviewed changes

njzjz-bot mentioned this pull request May 23, 2026

test(infer): cover OOM retry hook cleanup njzjz/deepmd-kit#229

Merged

Merge pull request #229 from njzjz-bothub/pr-5181-oom-retry-tests

3b640c9

njzjz requested a review from Copilot May 24, 2026 00:56

Copilot started reviewing on behalf of njzjz May 24, 2026 00:57 View session

Copilot AI reviewed May 24, 2026

View reviewed changes

Comment thread deepmd/pt/infer/deep_eval.py Outdated

Comment thread deepmd/pt/infer/deep_eval.py Outdated

Comment thread deepmd/pd/infer/deep_eval.py Outdated

Comment thread deepmd/pd/infer/deep_eval.py Outdated

Merge pull request #230 from njzjz-bothub/pr-5181-iterative-retry

42771a8

njzjz requested a review from Copilot May 25, 2026 07:12

Copilot started reviewing on behalf of njzjz May 25, 2026 07:13 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

njzjz requested a review from wanghan-iapcm May 25, 2026 07:25

wanghan-iapcm reviewed May 25, 2026

View reviewed changes

njzjz and others added 2 commits May 25, 2026 19:34

Merge pull request #231 from njzjz-bothub/pr-5181-production-tests

d19993b

[pre-commit.ci] auto fixes from pre-commit.com hooks

a80ce64

for more information, see https://pre-commit.ci

njzjz requested a review from Copilot May 25, 2026 12:13

Copilot started reviewing on behalf of njzjz May 25, 2026 12:13 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

Comment thread deepmd/utils/batch_size.py Outdated

Potential fix for pull request finding

a1ba195

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Jinzhe Zeng <njzjz@qq.com>

njzjz requested a review from Copilot May 25, 2026 12:27

Copilot started reviewing on behalf of njzjz May 25, 2026 12:27 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

Comment thread source/tests/common/test_oom_retry.py Outdated

Comment thread deepmd/utils/batch_size.py

Merge pull request #232 from njzjz-bothub/pr-5181-production-tests

23340af

test(oom): return floating mock outputs

Conversation

njzjz commented Jan 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

njzjz commented Feb 2, 2026

Uh oh!

njzjz-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wanghan-iapcm left a comment

Choose a reason for hiding this comment

1. RetrySignal raise path in AutoBatchSize.execute()

2. Recursive retry in eval_descriptor / eval_fitting_last_layer clears the hook between attempts

3. finally clears state on non-RetrySignal exceptions

Out of scope (do NOT add)

Acceptance

Uh oh!

njzjz-bot commented May 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

njzjz-bot commented May 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

wanghan-iapcm left a comment

Choose a reason for hiding this comment

Uh oh!

njzjz-bot commented May 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

njzjz commented Jan 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 29, 2026 •

edited

Loading

codecov Bot commented Jan 29, 2026 •

edited

Loading

1. `RetrySignal` raise path in `AutoBatchSize.execute()`

2. Recursive retry in `eval_descriptor` / `eval_fitting_last_layer` clears the hook between attempts

3. `finally` clears state on non-`RetrySignal` exceptions