[WIP] Persistent + grouped tile MoE batched GEMM by Gasoonjia · Pull Request #19219 · pytorch/executorch

Gasoonjia · 2026-04-30T04:28:42Z

No description provided.

…-quant Rewrites the four batched MoE GEMM kernels (BF16/INT8 GEMM1 + GEMM2) from one-CTA-per-tile to a persistent grid: launch min(NUM_SMS, num_tiles) programs, each loops over its share of (expert_block, n_block) tiles via `tl.range(start_pid, num_tiles, NUM_SMS)`. Tiles are visited in column-major-within-group order (Triton tutorial style) so consecutive M-blocks of the same expert reuse B[expert, n_block, *] via L2. Since moe_align_block_size already sorts blocks by expert, this gives clean weight reuse across the group. Adds NUM_SMS (constexpr) and GROUP_SIZE_M (autotuned over {8, 16, 32}) to all four batched kernels and their wrappers. NUM_SMS is queried once per device via _num_sms() with lru_cache. Benchmark on A100 80GB, prefill=1642, decode=512, --cuda_graph (persistent_on_hoist vs hoist baseline): hoist baseline: 6171 prefill (5941-6313), 99.0 decode tok/s persistent + grouped: 6127 prefill (6037-6315), 84.5 decode tok/s Best-case prefill is essentially tied with the hoist baseline at this prefill length on A100 (108 SMs). The hoist optimization had already shortened the K-loop by removing per-tile activation quantization, so the persistent kernel's L2 reuse gain doesn't show on top of it for M=1642. The optimization should matter more for short prefills (M <= 512) and on GPUs with fewer SMs / bigger L2 (e.g. RTX 4090 with 128 SMs and 72MB L2) where wave quantization is more severe. Behavior preserved: kernels are mathematically identical to the hoist baseline; only the tile traversal order and grid launch shape changed.

pytorch-bot · 2026-04-30T04:28:46Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19219

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 3 Cancelled Jobs, 3 Pending, 1 Unrelated Failure

As of commit a67881e with merge base cb4e5ae ():

NEW FAILURES - The following jobs have failed:

pull / test-coreml-bc-macos (macos-m1-stable) / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
Test CoreML Backend / test-coreml / test-backend-macos (coreml, models) / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (facebook, dinov2-small-imagenet1k-1-layer, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 1534591dec2d1115be968fd7b3aedd2b778d7882d04cf370eda79d3d498b261a /exec failed with exit code 1
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (nvidia, parakeet-tdt, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t ae864d36f91a24fbabaeebd423123f25e9dff27d52be360459ab5386b8a64700 /exec failed with exit code 1
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (nvidia, parakeet-tdt, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t 2d4b47be68a38bf5f4340e8fc2cc2836c73fe82169fc58ce7fb91ee230742dd3 /exec failed with exit code 1

CANCELLED JOBS - The following jobs were cancelled. Please retry:

pull / test-models-linux (mobilebert, portable, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (mobilebert, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-voxtral-realtime-xnnpack-linux / linux-job (gh)
##[error]The operation was canceled.

FLAKY - The following job failed but was likely due to flakiness present on trunk:

Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (mistralai, Voxtral-Mini-4B-Realtime-2602, quantized-int4-tile... / linux-job (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 30, 2026

Gasoonjia force-pushed the persistent-on-hoist branch from 015476d to a67881e Compare April 30, 2026 08:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Persistent + grouped tile MoE batched GEMM #19219

[WIP] Persistent + grouped tile MoE batched GEMM #19219
Gasoonjia wants to merge 1 commit intohoist-activation-quantfrom
persistent-on-hoist

Gasoonjia commented Apr 30, 2026

Uh oh!

pytorch-bot Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Gasoonjia commented Apr 30, 2026

Uh oh!

pytorch-bot Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19219

❌ 5 New Failures, 3 Cancelled Jobs, 3 Pending, 1 Unrelated Failure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pytorch-bot Bot commented Apr 30, 2026 •

edited

Loading