Fix pthreadpool subset corruption + executor deadlock; de-nest ExecuTorch SDPA under OpenMP MKL (#20267) by GregoryComer · Pull Request #20267 · pytorch/executorch

GregoryComer · 2026-06-13T01:45:15Z

Summary:

Fix three issues with the local patching needed for pthreadpool, plus the ExecuTorch SDPA op that runs on it:

Human tl;dr:

There are a few small bugs with the number of threads patch we apply to pthreadpool that weren't caught in the initial land. Note that these are mainly in paths not used until we update XNNPACK.
It also shifts the behavior such that the subset of threads used isn't stable between jobs. Trying to restore this with the new algorithm led to a few % latency regression on llama, so I decided to just fix the downstream issue - ET SDPA does nested parallelism of SDPA via pthreadpool then with OMP inside MKL BLAS (linked only on Linux currently). This leads to many thousands of threads, though perf is fine, is just times out when cleaning up. Since we already parallelize the SDPA GEMMs ourself, I've updated to not parallelize inside MKL/OMP.

[Claude Summary]
(1) Dynamic work-stealing corrupts output when a job runs with a thread subset. The dynamic thread functions in portable-api.c reach into peer threads' ranges via threads[(num_threads + thread_number - tid) % num_threads]. When the pool is created with max_num_threads (e.g. hardware concurrency) but a parallelize runs with a smaller num_threads selected via pthreadpool_set_num_threads_to_use (as caffe2/ATen does for mobile inference), every physical worker with thread_number >= num_threads is aliased by the % num_threads indexing onto an active thread's range, and in the tid == 0 branch re-reads that range's range_start and re-processes its tiles from the front. The result is front tiles processed multiple times and back tiles skipped entirely: nondeterministic, in-bounds data corruption that is invisible to ASAN and TSAN. The static (non-dynamic) thread functions are unaffected because each thread only ever processes its own threads[thread_number] range and out-of-subset threads are handed empty sentinel ranges. Fix: in all 17 dynamic thread functions, return early when thread_number >= num_threads, before the stealing loop. This is the dynamic-path equivalent of the static path's empty-sentinel behavior.

(2) Executor-borrowed threads leak num_active_threads_mutex and deadlock the pool. In condvar builds (PTHREADPOOL_USE_FUTEX=0, the Linux/Android/wasm configuration), wait_on_num_active_threads locks num_active_threads_mutex while the pool is idle, then an executor-borrowed worker returns PTHREADPOOL_NUM_ACTIVE_THREADS_DONE from inside the wait loop without releasing it. The orphaned lock then blocks the main thread's signal_num_active_threads (inside pthreadpool_parallelize) and every other worker entering wait_on_num_active_threads, hanging the pool. This only affects pools created via pthreadpool_create_v2 with a real executor; the classic pthreadpool_create path never takes the branch. Fix: handle executor-borrowed threads in a noinline cold helper (return_thread_to_executor) reached before the lock is taken, so there is no orphaned lock to leak. Keeping it out-of-line also leaves the own-threads wait loop byte-identical: that spin/sleep coordination is sensitive to codegen perturbation, and releasing the lock inline at the early return shifted the loop's codegen enough to regress decode throughput.

(3) The ExecuTorch SDPA nests OpenMP-threaded MKL under the threadpool, which deadlocks at process teardown. cpu_flash_attention (op_sdpa_impl.h) parallelizes over query blocks via the threadpool, and each block calls cpublas::gemm -> sgemm_. When the optimized BLAS is OpenMP MKL (the libblas variant, fbsource//third-party/mkl:mkl_lp64_omp), each per-block gemm enters a nested MKL/OpenMP region, so the pthreadpool worker that ran the block is registered by libomp as a "root" thread for the rest of its life. On a 96-core host this turned ~40 of the ~63 workers into roots (~3562 live threads), and at process exit the concurrent root teardown deadlocked on libomp's global __kmp_forkjoin_lock while reaping hidden-helper condvars -- surfacing as sgr_llm_tests LlmTest.TestTextPrefill intermittently FATAL/TIMEOUT under tpx (T275129576). Fix: serialize the SDPA's per-block gemm so it never spawns a nested team. The blocks are already threadpool-parallel, so the inner gemm should run single-threaded -- this is the correct nesting model, not a workaround. The optimized BLAS library compiles with -DET_CPUBLAS_MKL_OMP exactly when it links OpenMP MKL (lib_defs.bzl), gating a SingleThreadedGemmGuard -- a thread-local mkl_set_num_threads_local(1) for its scope -- constructed at the top of cpu_flash_attention's per-block lambda. On any other BLAS backend the guard compiles to a no-op and emits no MKL symbol reference, and only the SDPA is affected: the matmul ops (op_bmm/op_mm/op_linear) keep using threaded MKL. This removes the SDPA's nested OpenMP teams, so the rotating-worker root pileup (~40 of ~63 workers on the 96-core host at baseline) no longer forms; LlmTest.TestTextPrefill passes 20/20 under stress with the pthreadpool work-stealing left completely stock (no participation or scheduling change).

Note: (1) and (2) are local fixes to vendored third-party pthreadpool and should also go upstream to google/pthreadpool; (3) is in ExecuTorch and should go upstream to pytorch/executorch.

Reviewed By: jessiezheng123, shoumikhin

Differential Revision: D108226589

pytorch-bot · 2026-06-13T01:45:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20267

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-06-13T01:45:26Z

@GregoryComer has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108226589.

github-actions · 2026-06-13T01:46:08Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…orch SDPA under OpenMP MKL (pytorch#20267) Summary: Fix three issues with the local patching needed for pthreadpool, plus the ExecuTorch SDPA op that runs on it: Human tl;dr: * There are a few small bugs with the number of threads patch we apply to pthreadpool that weren't caught in the initial land. * It also shifts the behavior such that the subset of threads used isn't stable between jobs. Trying to restore this with the new algorithm led to a few % latency regression on llama, so I decided to just fix the downstream issue - ET SDPA does nested parallelism of SDPA via pthreadpool then with OMP inside MKL BLAS (linked only on Linux currently). This leads to many thousands of threads, though perf is fine, is just times out when cleaning up. Since we already parallelize the SDPA GEMMs ourself, I've updated to not parallelize inside MKL/OMP. [Claude Summary] (1) Dynamic work-stealing corrupts output when a job runs with a thread subset. The dynamic thread functions in `portable-api.c` reach into peer threads' ranges via `threads[(num_threads + thread_number - tid) % num_threads]`. When the pool is created with `max_num_threads` (e.g. hardware concurrency) but a parallelize runs with a smaller `num_threads` selected via `pthreadpool_set_num_threads_to_use` (as caffe2/ATen does for mobile inference), every physical worker with `thread_number >= num_threads` is aliased by the `% num_threads` indexing onto an active thread's range, and in the `tid == 0` branch re-reads that range's `range_start` and re-processes its tiles from the front. The result is front tiles processed multiple times and back tiles skipped entirely: nondeterministic, in-bounds data corruption that is invisible to ASAN and TSAN. The static (non-dynamic) thread functions are unaffected because each thread only ever processes its own `threads[thread_number]` range and out-of-subset threads are handed empty sentinel ranges. Fix: in all 17 dynamic thread functions, return early when `thread_number >= num_threads`, before the stealing loop. This is the dynamic-path equivalent of the static path's empty-sentinel behavior. (2) Executor-borrowed threads leak `num_active_threads_mutex` and deadlock the pool. In condvar builds (`PTHREADPOOL_USE_FUTEX=0`, the Linux/Android/wasm configuration), `wait_on_num_active_threads` locks `num_active_threads_mutex` while the pool is idle, then an executor-borrowed worker returns `PTHREADPOOL_NUM_ACTIVE_THREADS_DONE` from inside the wait loop without releasing it. The orphaned lock then blocks the main thread's `signal_num_active_threads` (inside `pthreadpool_parallelize`) and every other worker entering `wait_on_num_active_threads`, hanging the pool. This only affects pools created via `pthreadpool_create_v2` with a real executor; the classic `pthreadpool_create` path never takes the branch. Fix: handle executor-borrowed threads in a `noinline cold` helper (`return_thread_to_executor`) reached before the lock is taken, so there is no orphaned lock to leak. Keeping it out-of-line also leaves the own-threads wait loop byte-identical: that spin/sleep coordination is sensitive to codegen perturbation, and releasing the lock inline at the early return shifted the loop's codegen enough to regress decode throughput. (3) The ExecuTorch SDPA nests OpenMP-threaded MKL under the threadpool, which deadlocks at process teardown. `cpu_flash_attention` (`op_sdpa_impl.h`) parallelizes over query blocks via the threadpool, and each block calls `cpublas::gemm` -> `sgemm_`. When the optimized BLAS is OpenMP MKL (the `libblas` variant, `fbsource//third-party/mkl:mkl_lp64_omp`), each per-block gemm enters a nested MKL/OpenMP region, so the pthreadpool worker that ran the block is registered by libomp as a "root" thread for the rest of its life. On a 96-core host this turned ~40 of the ~63 workers into roots (~3562 live threads), and at process exit the concurrent root teardown deadlocked on libomp's global `__kmp_forkjoin_lock` while reaping hidden-helper condvars -- surfacing as `sgr_llm_tests` `LlmTest.TestTextPrefill` intermittently FATAL/TIMEOUT under tpx (T275129576). Fix: serialize the SDPA's per-block gemm so it never spawns a nested team. The blocks are already threadpool-parallel, so the inner gemm should run single-threaded -- this is the correct nesting model, not a workaround. The optimized BLAS library compiles with `-DET_CPUBLAS_MKL_OMP` exactly when it links OpenMP MKL (`lib_defs.bzl`), gating a `SingleThreadedGemmGuard` -- a thread-local `mkl_set_num_threads_local(1)` for its scope -- constructed at the top of `cpu_flash_attention`'s per-block lambda. On any other BLAS backend the guard compiles to a no-op and emits no MKL symbol reference, and only the SDPA is affected: the matmul ops (`op_bmm`/`op_mm`/`op_linear`) keep using threaded MKL. This removes the SDPA's nested OpenMP teams, so the rotating-worker root pileup (~40 of ~63 workers on the 96-core host at baseline) no longer forms; `LlmTest.TestTextPrefill` passes 20/20 under stress with the pthreadpool work-stealing left completely stock (no participation or scheduling change). Note: (1) and (2) are local fixes to vendored third-party pthreadpool and should also go upstream to google/pthreadpool; (3) is in ExecuTorch and should go upstream to pytorch/executorch. Reviewed By: jessiezheng123, shoumikhin Differential Revision: D108226589

GregoryComer requested review from larryliu0820, manuelcandales and mergennachin as code owners June 13, 2026 01:45

GregoryComer had a problem deploying to cadence June 13, 2026 01:45 — with GitHub Actions Failure

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 13, 2026

meta-codesync Bot added the meta-exported label Jun 13, 2026

meta-codesync Bot changed the title ~~Fix pthreadpool subset corruption + executor deadlock; de-nest ExecuTorch SDPA under OpenMP MKL~~ Fix pthreadpool subset corruption + executor deadlock; de-nest ExecuTorch SDPA under OpenMP MKL (#20267) Jun 15, 2026

GregoryComer force-pushed the export-D108226589 branch from 0f67e68 to 4728a51 Compare June 15, 2026 21:17

GregoryComer had a problem deploying to cadence June 15, 2026 21:17 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix pthreadpool subset corruption + executor deadlock; de-nest ExecuTorch SDPA under OpenMP MKL (#20267)#20267

Fix pthreadpool subset corruption + executor deadlock; de-nest ExecuTorch SDPA under OpenMP MKL (#20267)#20267
GregoryComer wants to merge 1 commit into
pytorch:mainfrom
GregoryComer:export-D108226589

GregoryComer commented Jun 13, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

GregoryComer commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20267

Uh oh!

meta-codesync Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GregoryComer commented Jun 13, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 13, 2026 •

edited

Loading

This PR needs a `release notes:` label