Hoist W4A8 activation quantization out of GEMM K-loop by Gasoonjia · Pull Request #19209 · pytorch/executorch

Gasoonjia · 2026-04-29T19:47:54Z

Context

The original K-loop did tl.max(tl.abs(a)) + INT8 cast on every tile (16 tiles × 16 rows = 256 reductions per program). Hoisting eliminates this redundant work and halves activation HBM bandwidth in the GEMM (bf16 → int8).

Improvement

Pre-quantize activations to INT8 once into a dedicated buffer (with per-row-per-tile FP32 scales) before the W4A8 batched MoE GEMMs, instead of re-quantizing inside the K-loop on every tile.

Perf (1600-token prefill)

Metric	Baseline (`gh/digantdesai/53/head`)	Optimized	Speedup
Prefill	5727 tok/s (5296–5963)	6171 tok/s (5941–6313)	1.08×

Correctness

7/7 microbenchmark configs (incl. qwen3.5-like M=128, K=2048, gs=128) pass with relative diff <1.5% vs BF16 reference — within INT8 quantization noise.

… runner Runner now uses llm::Stats with proper timestamps for model load, prefill, decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h print_report format: PyTorchObserver JSON line plus human-readable table. This commit was authored with the assistance of Claude Code. [ghstack-poisoned]

…Qwen3.5 MoE runner" Runner now uses llm::Stats with proper timestamps for model load, prefill, decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h print_report format: PyTorchObserver JSON line plus human-readable table. This commit was authored with the assistance of Claude Code. [ghstack-poisoned]

pytorch-bot · 2026-04-29T19:47:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19209

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 9 New Failures, 2 Cancelled Jobs, 1 Pending, 2 Unrelated Failures

As of commit d936717 with merge base cb4e5ae ():

NEW FAILURES - The following jobs have failed:

pull / test-qnn-wheel-packages-linux (3.12) / linux-job (gh)
RuntimeError: Command docker exec -t ea1022a7df8e382bea5f87b531282fb1a05a2d8b805a5084aa7a98118b5daa27 /exec failed with exit code 92
Test CUDA Builds / export-model-cuda-artifact (nvidia, diar_streaming_sortformer_4spk-v2, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 981a04073cd2e7869dc416344cd902ede9d78f0c822de1211f5184ae36d843c1 /exec failed with exit code 1
Test CUDA Builds / export-model-cuda-artifact (nvidia, parakeet-tdt, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t e1ee8a0a5320d7ca73ecede2bdb0e837995f3405271122395d3f3f2688a535c9 /exec failed with exit code 1
Test CUDA Builds / export-model-cuda-artifact (nvidia, parakeet-tdt, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t e74e415eef9f828ca92023a52eea08fad5b16a9499709d436d646544291b7e8c /exec failed with exit code 1
Test CUDA Builds / unittest-cuda / linux-job (gh)
backends/cuda/tests/test_int4_matmul.py::TestDequantThenMatmul::test_prefill_short
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (facebook, dinov2-small-imagenet1k-1-layer, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t d555997c309ecdf225a59b9ae5f05867f3687fb5ecf0bf7efb91c232eca1c76f /exec failed with exit code 1
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (nvidia, parakeet-tdt, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 1536a0bad7cfd27e5ee45a6ce325ee9619b3578aa14813f64365017c7ed33599 /exec failed with exit code 1
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (nvidia, parakeet-tdt, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t c9b40cb2ecd913802ad28bd9409eee8bd2c8522633d1caba848d1d59d87ee5f6 /exec failed with exit code 1
Test Metal Backend / test-metal-qwen35-moe-tiny / macos-job (gh)
/Users/ec2-user/runner/_work/executorch/executorch/pytorch/executorch/examples/models/qwen3_5_moe/main.cpp:367:42: error: use of undeclared identifier 'cudaSuccess'

CANCELLED JOBS - The following jobs were cancelled. Please retry:

pull / test-models-linux (phi_4_mini, portable, linux.4xlarge.memory) / linux-job (gh)
##[error]The operation was canceled.
pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
##[error]The operation was canceled.

FLAKY - The following job failed but was likely due to flakiness present on trunk:

Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (mistralai, Voxtral-Mini-4B-Realtime-2602, quantized-int4-tile... / linux-job (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

MLX / test-mlx-qwen35-moe / test-mlx-qwen35-moe (gh) (trunk failure)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Add dedicated _quantize_activations_int8_kernel and _silu_quantize_int8_kernel that pre-quantize activations to INT8 with per-row-per-tile FP32 scales before GEMM1 and GEMM2 respectively. The existing _fused_moe_batched_int8_kernel and _fused_moe_silu_batched_int8_kernel are rewritten to consume pre-quantized activations + scales, eliminating ~256 redundant tl.max reductions per program (cdiv(K, BLOCK_K) tiles * BLOCK_M rows) and halving activation HBM bandwidth in the K-loop (bf16 -> int8). BLOCK_SIZE_K is fixed at PREQUANT_BLOCK_K (= 128) so per-tile activation scales align with the GEMM K-loop. Correctness: 7/7 microbenchmark configs pass with rel diff <1.5% vs BF16 ref. End-to-end (Qwen3.5 MoE 1600 prefill + 512 decode, --cuda_graph, A100): prefill 5727 -> 6171 tok/s (+7.7%), decode 92.6 -> 99.0 tok/s (+6.9%).

digantdesai added 6 commits April 28, 2026 08:56

Gasoonjia requested a review from lucylq as a code owner April 29, 2026 19:47

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 29, 2026

Gasoonjia force-pushed the hoist-activation-quant branch 2 times, most recently from 385bf6d to 793be8d Compare April 29, 2026 20:19

Gasoonjia requested review from digantdesai and mergennachin April 29, 2026 20:31

Gasoonjia force-pushed the hoist-activation-quant branch from 793be8d to d936717 Compare April 29, 2026 21:14

Base automatically changed from gh/digantdesai/53/head to gh/digantdesai/53/base April 30, 2026 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hoist W4A8 activation quantization out of GEMM K-loop#19209

Hoist W4A8 activation quantization out of GEMM K-loop#19209
Gasoonjia wants to merge 7 commits intogh/digantdesai/53/basefrom
hoist-activation-quant

Gasoonjia commented Apr 29, 2026

Uh oh!

pytorch-bot Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Gasoonjia commented Apr 29, 2026

Context

Improvement

Perf (1600-token prefill)

Correctness

Uh oh!

pytorch-bot Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19209

❌ 9 New Failures, 2 Cancelled Jobs, 1 Pending, 2 Unrelated Failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot Bot commented Apr 29, 2026 •

edited

Loading