Skip to content

Fix pthreadpool subset corruption + executor deadlock; de-nest ExecuTorch SDPA under OpenMP MKL (#20267)#20267

Open
GregoryComer wants to merge 1 commit into
pytorch:mainfrom
GregoryComer:export-D108226589
Open

Fix pthreadpool subset corruption + executor deadlock; de-nest ExecuTorch SDPA under OpenMP MKL (#20267)#20267
GregoryComer wants to merge 1 commit into
pytorch:mainfrom
GregoryComer:export-D108226589

Conversation

@GregoryComer

@GregoryComer GregoryComer commented Jun 13, 2026

Copy link
Copy Markdown
Member

Summary:

Fix three issues with the local patching needed for pthreadpool, plus the ExecuTorch SDPA op that runs on it:

Human tl;dr:

  • There are a few small bugs with the number of threads patch we apply to pthreadpool that weren't caught in the initial land. Note that these are mainly in paths not used until we update XNNPACK.
  • It also shifts the behavior such that the subset of threads used isn't stable between jobs. Trying to restore this with the new algorithm led to a few % latency regression on llama, so I decided to just fix the downstream issue - ET SDPA does nested parallelism of SDPA via pthreadpool then with OMP inside MKL BLAS (linked only on Linux currently). This leads to many thousands of threads, though perf is fine, is just times out when cleaning up. Since we already parallelize the SDPA GEMMs ourself, I've updated to not parallelize inside MKL/OMP.

[Claude Summary]
(1) Dynamic work-stealing corrupts output when a job runs with a thread subset. The dynamic thread functions in portable-api.c reach into peer threads' ranges via threads[(num_threads + thread_number - tid) % num_threads]. When the pool is created with max_num_threads (e.g. hardware concurrency) but a parallelize runs with a smaller num_threads selected via pthreadpool_set_num_threads_to_use (as caffe2/ATen does for mobile inference), every physical worker with thread_number >= num_threads is aliased by the % num_threads indexing onto an active thread's range, and in the tid == 0 branch re-reads that range's range_start and re-processes its tiles from the front. The result is front tiles processed multiple times and back tiles skipped entirely: nondeterministic, in-bounds data corruption that is invisible to ASAN and TSAN. The static (non-dynamic) thread functions are unaffected because each thread only ever processes its own threads[thread_number] range and out-of-subset threads are handed empty sentinel ranges. Fix: in all 17 dynamic thread functions, return early when thread_number >= num_threads, before the stealing loop. This is the dynamic-path equivalent of the static path's empty-sentinel behavior.

(2) Executor-borrowed threads leak num_active_threads_mutex and deadlock the pool. In condvar builds (PTHREADPOOL_USE_FUTEX=0, the Linux/Android/wasm configuration), wait_on_num_active_threads locks num_active_threads_mutex while the pool is idle, then an executor-borrowed worker returns PTHREADPOOL_NUM_ACTIVE_THREADS_DONE from inside the wait loop without releasing it. The orphaned lock then blocks the main thread's signal_num_active_threads (inside pthreadpool_parallelize) and every other worker entering wait_on_num_active_threads, hanging the pool. This only affects pools created via pthreadpool_create_v2 with a real executor; the classic pthreadpool_create path never takes the branch. Fix: handle executor-borrowed threads in a noinline cold helper (return_thread_to_executor) reached before the lock is taken, so there is no orphaned lock to leak. Keeping it out-of-line also leaves the own-threads wait loop byte-identical: that spin/sleep coordination is sensitive to codegen perturbation, and releasing the lock inline at the early return shifted the loop's codegen enough to regress decode throughput.

(3) The ExecuTorch SDPA nests OpenMP-threaded MKL under the threadpool, which deadlocks at process teardown. cpu_flash_attention (op_sdpa_impl.h) parallelizes over query blocks via the threadpool, and each block calls cpublas::gemm -> sgemm_. When the optimized BLAS is OpenMP MKL (the libblas variant, fbsource//third-party/mkl:mkl_lp64_omp), each per-block gemm enters a nested MKL/OpenMP region, so the pthreadpool worker that ran the block is registered by libomp as a "root" thread for the rest of its life. On a 96-core host this turned ~40 of the ~63 workers into roots (~3562 live threads), and at process exit the concurrent root teardown deadlocked on libomp's global __kmp_forkjoin_lock while reaping hidden-helper condvars -- surfacing as sgr_llm_tests LlmTest.TestTextPrefill intermittently FATAL/TIMEOUT under tpx (T275129576). Fix: serialize the SDPA's per-block gemm so it never spawns a nested team. The blocks are already threadpool-parallel, so the inner gemm should run single-threaded -- this is the correct nesting model, not a workaround. The optimized BLAS library compiles with -DET_CPUBLAS_MKL_OMP exactly when it links OpenMP MKL (lib_defs.bzl), gating a SingleThreadedGemmGuard -- a thread-local mkl_set_num_threads_local(1) for its scope -- constructed at the top of cpu_flash_attention's per-block lambda. On any other BLAS backend the guard compiles to a no-op and emits no MKL symbol reference, and only the SDPA is affected: the matmul ops (op_bmm/op_mm/op_linear) keep using threaded MKL. This removes the SDPA's nested OpenMP teams, so the rotating-worker root pileup (~40 of ~63 workers on the 96-core host at baseline) no longer forms; LlmTest.TestTextPrefill passes 20/20 under stress with the pthreadpool work-stealing left completely stock (no participation or scheduling change).

Note: (1) and (2) are local fixes to vendored third-party pthreadpool and should also go upstream to google/pthreadpool; (3) is in ExecuTorch and should go upstream to pytorch/executorch.

Reviewed By: jessiezheng123, shoumikhin

Differential Revision: D108226589

@pytorch-bot

pytorch-bot Bot commented Jun 13, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20267

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 13, 2026
@meta-codesync

meta-codesync Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

@GregoryComer has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108226589.

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…orch SDPA under OpenMP MKL (pytorch#20267)

Summary:

Fix three issues with the local patching needed for pthreadpool, plus the ExecuTorch SDPA op that runs on it:

Human tl;dr:
* There are a few small bugs with the number of threads patch we apply to pthreadpool that weren't caught in the initial land.
* It also shifts the behavior such that the subset of threads used isn't stable between jobs. Trying to restore this with the new algorithm led to a few % latency regression on llama, so I decided to just fix the downstream issue - ET SDPA does nested parallelism of SDPA via pthreadpool then with OMP inside MKL BLAS (linked only on Linux currently). This leads to many thousands of threads, though perf is fine, is just times out when cleaning up. Since we already parallelize the SDPA GEMMs ourself, I've updated to not parallelize inside MKL/OMP.

[Claude Summary]
(1) Dynamic work-stealing corrupts output when a job runs with a thread subset. The dynamic thread functions in `portable-api.c` reach into peer threads' ranges via `threads[(num_threads + thread_number - tid) % num_threads]`. When the pool is created with `max_num_threads` (e.g. hardware concurrency) but a parallelize runs with a smaller `num_threads` selected via `pthreadpool_set_num_threads_to_use` (as caffe2/ATen does for mobile inference), every physical worker with `thread_number >= num_threads` is aliased by the `% num_threads` indexing onto an active thread's range, and in the `tid == 0` branch re-reads that range's `range_start` and re-processes its tiles from the front. The result is front tiles processed multiple times and back tiles skipped entirely: nondeterministic, in-bounds data corruption that is invisible to ASAN and TSAN. The static (non-dynamic) thread functions are unaffected because each thread only ever processes its own `threads[thread_number]` range and out-of-subset threads are handed empty sentinel ranges. Fix: in all 17 dynamic thread functions, return early when `thread_number >= num_threads`, before the stealing loop. This is the dynamic-path equivalent of the static path's empty-sentinel behavior.

(2) Executor-borrowed threads leak `num_active_threads_mutex` and deadlock the pool. In condvar builds (`PTHREADPOOL_USE_FUTEX=0`, the Linux/Android/wasm configuration), `wait_on_num_active_threads` locks `num_active_threads_mutex` while the pool is idle, then an executor-borrowed worker returns `PTHREADPOOL_NUM_ACTIVE_THREADS_DONE` from inside the wait loop without releasing it. The orphaned lock then blocks the main thread's `signal_num_active_threads` (inside `pthreadpool_parallelize`) and every other worker entering `wait_on_num_active_threads`, hanging the pool. This only affects pools created via `pthreadpool_create_v2` with a real executor; the classic `pthreadpool_create` path never takes the branch. Fix: handle executor-borrowed threads in a `noinline cold` helper (`return_thread_to_executor`) reached before the lock is taken, so there is no orphaned lock to leak. Keeping it out-of-line also leaves the own-threads wait loop byte-identical: that spin/sleep coordination is sensitive to codegen perturbation, and releasing the lock inline at the early return shifted the loop's codegen enough to regress decode throughput.

(3) The ExecuTorch SDPA nests OpenMP-threaded MKL under the threadpool, which deadlocks at process teardown. `cpu_flash_attention` (`op_sdpa_impl.h`) parallelizes over query blocks via the threadpool, and each block calls `cpublas::gemm` -> `sgemm_`. When the optimized BLAS is OpenMP MKL (the `libblas` variant, `fbsource//third-party/mkl:mkl_lp64_omp`), each per-block gemm enters a nested MKL/OpenMP region, so the pthreadpool worker that ran the block is registered by libomp as a "root" thread for the rest of its life. On a 96-core host this turned ~40 of the ~63 workers into roots (~3562 live threads), and at process exit the concurrent root teardown deadlocked on libomp's global `__kmp_forkjoin_lock` while reaping hidden-helper condvars -- surfacing as `sgr_llm_tests` `LlmTest.TestTextPrefill` intermittently FATAL/TIMEOUT under tpx (T275129576). Fix: serialize the SDPA's per-block gemm so it never spawns a nested team. The blocks are already threadpool-parallel, so the inner gemm should run single-threaded -- this is the correct nesting model, not a workaround. The optimized BLAS library compiles with `-DET_CPUBLAS_MKL_OMP` exactly when it links OpenMP MKL (`lib_defs.bzl`), gating a `SingleThreadedGemmGuard` -- a thread-local `mkl_set_num_threads_local(1)` for its scope -- constructed at the top of `cpu_flash_attention`'s per-block lambda. On any other BLAS backend the guard compiles to a no-op and emits no MKL symbol reference, and only the SDPA is affected: the matmul ops (`op_bmm`/`op_mm`/`op_linear`) keep using threaded MKL. This removes the SDPA's nested OpenMP teams, so the rotating-worker root pileup (~40 of ~63 workers on the 96-core host at baseline) no longer forms; `LlmTest.TestTextPrefill` passes 20/20 under stress with the pthreadpool work-stealing left completely stock (no participation or scheduling change).

Note: (1) and (2) are local fixes to vendored third-party pthreadpool and should also go upstream to google/pthreadpool; (3) is in ExecuTorch and should go upstream to pytorch/executorch.

Reviewed By: jessiezheng123, shoumikhin

Differential Revision: D108226589
@meta-codesync meta-codesync Bot changed the title Fix pthreadpool subset corruption + executor deadlock; de-nest ExecuTorch SDPA under OpenMP MKL Fix pthreadpool subset corruption + executor deadlock; de-nest ExecuTorch SDPA under OpenMP MKL (#20267) Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant