[bugfix] ensure predict threads are joined on exception by tiankongdeguiji · Pull Request #433 · alibaba/TorchEasyRec

tiankongdeguiji · 2026-03-22T02:56:27Z

Summary

Added try/finally around the prediction pipeline in tzrec/main.py to ensure forward and write threads are properly cleaned up on exception
Without this, if an exception occurred during prediction, threads could be left running indefinitely

Test plan

Run prediction pipeline end-to-end
Verify threads are cleaned up when prediction completes normally
Verify threads are cleaned up when an exception occurs

🤖 Generated with Claude Code

Wrap prediction loop in try/finally to ensure forward and write threads are properly signaled and joined even when exceptions occur, preventing potential resource leaks and hanging processes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-27T03:26:54Z

tzrec/main.py

+        for t in forward_t_list:
+            t.join()


Bug risk: join() without timeout can hang the process indefinitely

If a forward thread is blocked on pred_queue.put() (queue full because write thread stopped consuming), the sentinel None on data_queue won't unblock it. t.join() will then block forever. In a distributed job this hangs all ranks.

Consider adding a timeout:

Suggested change

for t in forward_t_list:

t.join()

for t in forward_t_list:

t.join(timeout=PREDICT_QUEUE_TIMEOUT)

github-actions · 2026-03-27T03:26:56Z

tzrec/main.py

+                pred_queue.put((None, None), timeout=PREDICT_QUEUE_TIMEOUT)
+            except Exception:
+                pass
+            write_t.join()


Same concern — add a timeout to prevent indefinite hang if the sentinel put above failed silently:

Suggested change

write_t.join()

write_t.join(timeout=PREDICT_QUEUE_TIMEOUT)

github-actions · 2026-03-27T03:26:58Z

tzrec/main.py

+            except Exception:
+                pass


Nit: Silently swallowing exceptions makes cleanup hangs very hard to diagnose in production. Consider logging instead of bare pass:

except Exception: logger.warning("Failed to send sentinel to data_queue during cleanup", exc_info=True)

Same for the pred_queue.put handler at line 1242.

github-actions · 2026-03-27T03:27:14Z

Code Review Summary

Good bugfix — wrapping the prediction loop in try/finally is the right pattern to prevent thread leaks on exception. The cleanup ordering (forward threads → write thread → writer close) correctly respects the pipeline topology, and using len(forward_t_list) instead of predict_threads for sentinel count properly handles partial initialization.

Issues to address

1. join() without timeout can hang indefinitely (inline comments posted)

If a forward thread is blocked on pred_queue.put() (not on data_queue.get()), the None sentinel won't unblock it. t.join() then blocks forever — up to N × 600s in the worst case. In a distributed job, this hangs all ranks. Add timeout=PREDICT_QUEUE_TIMEOUT to both join() calls.

2. predict_checkpoint() has the same bug but was not fixed

The function at line ~1412–1439 uses the same thread pattern (pred_queue / write_t) but still has bare assert write_t is not None / write_t.join() without try/finally. If this fix is warranted for predict(), the same pattern should be applied to predict_checkpoint().

Minor suggestions

Log instead of bare except Exception: pass in cleanup handlers (inline comment posted) — silent swallowing makes production hangs very hard to diagnose.
_forward_loop has no exception handling (pre-existing): if _forward() raises inside a worker thread, the thread dies silently and its sentinel goes unconsumed, which can cause other threads to exit prematurely. Consider a follow-up to add error propagation in thread bodies.

🤖 Generated with Claude Code

predict_checkpoint had the same vulnerability as predict: threads and writer not cleaned up on exception, plus assert on write_t that would crash if no steps ran. Apply the same try/finally pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-03T12:18:01Z

tzrec/main.py

+        for t in forward_t_list:
+            t.join()


Potential hang: t.join() has no timeout. If the sentinel put(None) above failed silently (caught by except Exception: pass), the worker thread never receives the stop signal and this join() blocks indefinitely. Same applies to write_t.join() on line 1251.

Consider adding a timeout and logging a warning:

Suggested change

for t in forward_t_list:

t.join()

for t in forward_t_list:

t.join(timeout=PREDICT_QUEUE_TIMEOUT)

And after each join:

if t.is_alive(): logger.warning("Forward thread did not exit within timeout.")

This bounds worst-case hang time and makes failures diagnosable. The same pattern should be applied to write_t.join() below and in predict_checkpoint (line 1454).

github-actions · 2026-04-03T12:18:05Z

tzrec/main.py

+            try:
+                data_queue.put(None, timeout=PREDICT_QUEUE_TIMEOUT)
+            except Exception:
+                pass


Silent failure → deadlock risk: If the sentinel put times out (e.g. queue is full because a downstream thread crashed), the exception is silently swallowed. The subsequent t.join() then blocks forever since the thread never got the stop signal.

At minimum, log the failure so hangs are diagnosable:

Suggested change

try:

data_queue.put(None, timeout=PREDICT_QUEUE_TIMEOUT)

except Exception:

pass

try:

data_queue.put(None, timeout=PREDICT_QUEUE_TIMEOUT)

except Exception:

logger.warning("Failed to send sentinel to data_queue during cleanup.", exc_info=True)

Same applies to the pred_queue sentinel put on lines 1247-1250 and in predict_checkpoint (lines 1449-1453).

github-actions · 2026-04-03T12:18:17Z

Code Review Summary

Overall: Good bugfix — wrapping prediction loops in try/finally to guarantee thread cleanup on exceptions is the right approach. The happy path is unaffected. Two issues in the error-path cleanup logic are worth addressing before merge.

Actionable Feedback

thread.join() without timeout can hang indefinitely — If the sentinel queue.put(None) fails (silently caught), the worker thread never exits, and the unbounded join() blocks the main thread forever. This undermines the PR's goal of clean error-path shutdown. Add timeout=PREDICT_QUEUE_TIMEOUT to all join() calls and log a warning if the thread is still alive. Applies to both predict and predict_checkpoint.
Bare except Exception: pass hides the root cause — The sentinel put failures are silently swallowed, making deadlock-on-error scenarios very hard to diagnose. Replace pass with logger.warning(...) so operators have a signal when things go wrong.

Non-blocking Observations (pre-existing, not introduced by this PR)

Worker threads (_forward_loop, _write_loop) have no internal exception handling. A crash in a worker silently kills the thread, leaving sibling threads and the main thread stuck on queue operations. Consider adding error propagation (e.g., shared error event) in a follow-up.
torch.cuda.synchronize() called every step in predict_checkpoint (line 1428) prevents GPU/CPU overlap. The other predict function avoids this by doing .to("cpu") on tensors instead.

🤖 Generated with Claude Code

If Thread.start() raises RuntimeError (e.g., OS can't create thread), write_t would be non-None but never started. In the finally cleanup, write_t.join() would raise RuntimeError, causing writer.close() to be skipped — leaking the writer resource. Fix: assign write_t only after start() succeeds, matching the pattern already used for forward threads via forward_t_list.append(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-07T03:05:03Z

tzrec/main.py

+            except Exception:
+                pass
+        for t in forward_t_list:
+            t.join()


join() without timeout can hang forever — If the sentinel put above fails (caught by except Exception: pass), the thread never sees its termination signal and join() blocks indefinitely. This creates a new deadlock path in the very code meant to fix deadlocks.

Consider adding a timeout and logging:

Suggested change

t.join()

t.join(timeout=PREDICT_QUEUE_TIMEOUT)

if t.is_alive():

logger.warning("Forward thread %s did not terminate in time.", t.name)

github-actions · 2026-04-07T03:05:05Z

tzrec/main.py

+                pred_queue.put((None, None), timeout=PREDICT_QUEUE_TIMEOUT)
+            except Exception:
+                pass
+            write_t.join()


Same concern here — write_t.join() without a timeout can hang if the sentinel didn't reach the write thread.

Suggested change

write_t.join()

write_t.join(timeout=PREDICT_QUEUE_TIMEOUT)

if write_t.is_alive():

logger.warning("Write thread did not terminate in time.")

github-actions · 2026-04-07T03:05:07Z

tzrec/main.py

+            except Exception:
+                pass


Bare except Exception: pass silently swallows the sentinel enqueue failure, giving zero signal that a thread will never see its termination signal. At minimum, log a warning so operators can diagnose hangs:

Suggested change

except Exception:

pass

except Exception:

logger.warning("Failed to send sentinel to data_queue", exc_info=True)

github-actions · 2026-04-07T03:05:09Z

tzrec/main.py

+            except Exception:
+                pass


Same here — log instead of silently passing:

Suggested change

except Exception:

pass

except Exception:

logger.warning("Failed to send sentinel to pred_queue", exc_info=True)

github-actions · 2026-04-07T03:05:11Z

tzrec/main.py

+                    pred_queue.put((None, None), timeout=PREDICT_QUEUE_TIMEOUT)
+                except Exception:
+                    pass
+                write_t.join()


Same two issues as the predict function: join() needs a timeout, and the except Exception: pass above should log.

Suggested change

write_t.join()

write_t.join(timeout=PREDICT_QUEUE_TIMEOUT)

if write_t.is_alive():

logger.warning("Write thread did not terminate in time.")

github-actions · 2026-04-07T03:05:28Z

Code Review Summary

Overall: This is a well-motivated bugfix. The try/finally pattern is the right approach, and the t = Thread(...); t.start(); write_t = t change is a genuine correctness improvement that prevents referencing a thread that failed to start. The scope is focused and appropriate.

Key Finding: `join()` without timeout can still hang

The main concern is that thread.join() is called without a timeout in the finally blocks. If the sentinel put() fails (silently caught by except Exception: pass), the target thread never receives its termination signal, and join() blocks indefinitely — replacing one hang with another. Adding join(timeout=PREDICT_QUEUE_TIMEOUT) would make the cleanup path robust against this scenario.

Secondary: Silent exception swallowing

The except Exception: pass blocks around sentinel queue puts discard all diagnostic information. When these fail, something has gone seriously wrong (worker thread stuck or dead), and operators get zero signal. Logging a warning here would significantly improve debuggability in production.

See inline comments for specific suggestions with code.

What looks good

write_t = t only after t.start() succeeds — eliminates the risk of joining an unstarted thread
if write_t is not None replacing assert write_t is not None — assertions in cleanup paths are counterproductive
No performance regressions — the try/finally adds no overhead to the hot path
Consistent pattern applied to both predict and predict_checkpoint
Existing integration tests adequately cover the happy path; no new tests needed for this defensive fix

Replace silent except-pass with warning logs so cleanup failures are visible. Helps debugging when threads are stuck or queues are broken during error-path cleanup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eption writer.close() can fail in real scenarios (ODPS commit error, disk full during flush, network drop). When called unprotected in a finally block, such failures replace the original exception that triggered cleanup, making root-cause debugging harder. Wrap writer.close() in try/except with a warning log, consistent with the queue sentinel cleanup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-07T07:09:45Z

tzrec/main.py

+        for t in forward_t_list:
+            t.join()
+        if write_t is not None:
+            try:
+                pred_queue.put((None, None), timeout=PREDICT_QUEUE_TIMEOUT)
+            except Exception as e:
+                logger.warning(f"Failed to send sentinel to pred_queue: {e}")
+            write_t.join()


Potential indefinite hang if sentinel delivery fails.

When pred_queue.put((None, None)) fails (caught on line 1250), write_t.join() on line 1252 will block indefinitely — _write_loop is stuck waiting on pred_queue.get(timeout=PREDICT_QUEUE_TIMEOUT) and will only unblock after the queue timeout (default 600s). The same applies to the forward thread joins on line 1246 if data_queue.put(None) fails.

With multiple forward threads, this could block cleanup for up to N_threads * 600s sequentially.

Consider adding a timeout to join() calls:

for t in forward_t_list: t.join(timeout=PREDICT_QUEUE_TIMEOUT + 30) if t.is_alive(): logger.warning("Forward thread did not exit within timeout")

github-actions · 2026-04-07T07:09:47Z

tzrec/main.py

+            if write_t is not None:
+                try:
+                    pred_queue.put((None, None), timeout=PREDICT_QUEUE_TIMEOUT)
+                except Exception as e:
+                    logger.warning(f"Failed to send sentinel to pred_queue: {e}")
+                write_t.join()


Same concern here — write_t.join() on line 1459 has no timeout. If the sentinel put fails, the join blocks until _write_loop's queue get times out (default 600s).

github-actions · 2026-04-07T07:10:02Z

Code Review Summary

Good bugfix — the original code had a real resource-leak risk where threads and the writer would never be cleaned up if an exception occurred mid-prediction. The try/finally pattern is the right approach, and applying it consistently to both predict and predict_checkpoint is thorough.

What looks good

Assigning write_t = t only after t.start() succeeds — correct defensive pattern
Using len(forward_t_list) instead of range(predict_threads) for sentinel count — only sends sentinels for threads that actually started
Guarding writer.close() against masking the original exception

Issues to consider

1. join() calls without timeout can hang indefinitely (see inline comments)

If a sentinel put fails (caught and logged), the corresponding thread never receives its shutdown signal. thread.join() then blocks until the thread's own queue.get(timeout=600) times out. With multiple forward threads, cleanup could block for up to N_threads * 600s. Recommend adding timeout to all join() calls and logging a warning if threads are still alive.

2. No test coverage for the exception cleanup paths

The entire purpose of this PR — exception-path thread cleanup — has no test coverage. Existing tests only cover the happy path via subprocess-based integration tests. A unit test that mocks the model to raise mid-prediction and verifies all threads are joined and writer.close() is called would be the highest-value addition to prevent regression.

🤖 Generated with Claude Code

…angs Per PR review feedback: t.join() and write_t.join() without a timeout can hang the cleanup path indefinitely if: - The sentinel put failed (caught and logged), so the worker never sees its termination signal - The worker is stuck in a non-timed operation (CUDA op, network call, blocking writer flush) In a distributed job, hanging cleanup hangs all ranks. Add bounded join(timeout=PREDICT_QUEUE_TIMEOUT) and log a warning if the thread is still alive after the timeout, so operators can diagnose the leak. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tiankongdeguiji and others added 2 commits March 22, 2026 10:56

Merge branch 'master' into fix/predict-thread-cleanup

e7cd75f

tiankongdeguiji added the claude-review Let Claude Review label Mar 27, 2026

github-actions bot removed the claude-review Let Claude Review label Mar 27, 2026

github-actions bot reviewed Mar 27, 2026

View reviewed changes

tiankongdeguiji and others added 2 commits April 3, 2026 10:28

Merge branch 'master' into fix/predict-thread-cleanup

40cc18b

tiankongdeguiji added the claude-review Let Claude Review label Apr 3, 2026

github-actions bot removed the claude-review Let Claude Review label Apr 3, 2026

github-actions bot reviewed Apr 3, 2026

View reviewed changes

tiankongdeguiji added the claude-review Let Claude Review label Apr 7, 2026

github-actions bot removed the claude-review Let Claude Review label Apr 7, 2026

github-actions bot reviewed Apr 7, 2026

View reviewed changes

tiankongdeguiji and others added 2 commits April 7, 2026 11:56

tiankongdeguiji added the claude-review Let Claude Review label Apr 7, 2026

github-actions bot removed the claude-review Let Claude Review label Apr 7, 2026

github-actions bot reviewed Apr 7, 2026

View reviewed changes

eric-gecheng approved these changes Apr 7, 2026

View reviewed changes

tiankongdeguiji merged commit ca6d308 into alibaba:master Apr 7, 2026
6 checks passed

-            t.join()
+            t.join(timeout=PREDICT_QUEUE_TIMEOUT)
+            if t.is_alive():
+                logger.warning("Forward thread %s did not terminate in time.", t.name)

-            write_t.join()
+            write_t.join(timeout=PREDICT_QUEUE_TIMEOUT)
+            if write_t.is_alive():
+                logger.warning("Write thread did not terminate in time.")

Conversation

tiankongdeguiji commented Mar 22, 2026

Summary

Test plan

Uh oh!

github-actions bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 27, 2026

Code Review Summary

Issues to address

Minor suggestions

Uh oh!

github-actions bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 3, 2026

Code Review Summary

Actionable Feedback

Non-blocking Observations (pre-existing, not introduced by this PR)

Uh oh!

github-actions bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 7, 2026

Code Review Summary

Key Finding: join() without timeout can still hang

Secondary: Silent exception swallowing

What looks good

Uh oh!

github-actions bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 7, 2026

Code Review Summary

What looks good

Issues to consider

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Key Finding: `join()` without timeout can still hang