Skip to content

Hoist W4A8 activation quantization out of GEMM K-loop#19209

Open
Gasoonjia wants to merge 7 commits intogh/digantdesai/53/basefrom
hoist-activation-quant
Open

Hoist W4A8 activation quantization out of GEMM K-loop#19209
Gasoonjia wants to merge 7 commits intogh/digantdesai/53/basefrom
hoist-activation-quant

Conversation

@Gasoonjia
Copy link
Copy Markdown
Contributor

Context

The original K-loop did tl.max(tl.abs(a)) + INT8 cast on every tile (16 tiles × 16 rows = 256 reductions per program). Hoisting eliminates this redundant work and halves activation HBM bandwidth in the GEMM (bf16 → int8).

Improvement

Pre-quantize activations to INT8 once into a dedicated buffer (with per-row-per-tile FP32 scales) before the W4A8 batched MoE GEMMs, instead of re-quantizing inside the K-loop on every tile.

Perf (1600-token prefill)

Metric Baseline (gh/digantdesai/53/head) Optimized Speedup
Prefill 5727 tok/s (5296–5963) 6171 tok/s (5941–6313) 1.08×

Correctness

7/7 microbenchmark configs (incl. qwen3.5-like M=128, K=2048, gs=128) pass with relative diff <1.5% vs BF16 reference — within INT8 quantization noise.

… runner

Runner now uses llm::Stats with proper timestamps for model load, prefill,
decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h
print_report format: PyTorchObserver JSON line plus human-readable table.

This commit was authored with the assistance of Claude Code.

[ghstack-poisoned]
…Qwen3.5 MoE runner"

Runner now uses llm::Stats with proper timestamps for model load, prefill,
decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h
print_report format: PyTorchObserver JSON line plus human-readable table.

This commit was authored with the assistance of Claude Code.

[ghstack-poisoned]
…Qwen3.5 MoE runner"

Runner now uses llm::Stats with proper timestamps for model load, prefill,
decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h
print_report format: PyTorchObserver JSON line plus human-readable table.

This commit was authored with the assistance of Claude Code.

[ghstack-poisoned]
…Qwen3.5 MoE runner"

Runner now uses llm::Stats with proper timestamps for model load, prefill,
decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h
print_report format: PyTorchObserver JSON line plus human-readable table.

This commit was authored with the assistance of Claude Code.

[ghstack-poisoned]
…Qwen3.5 MoE runner"

Runner now uses llm::Stats with proper timestamps for model load, prefill,
decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h
print_report format: PyTorchObserver JSON line plus human-readable table.

This commit was authored with the assistance of Claude Code.

[ghstack-poisoned]
…Qwen3.5 MoE runner"

Runner now uses llm::Stats with proper timestamps for model load, prefill,
decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h
print_report format: PyTorchObserver JSON line plus human-readable table.

This commit was authored with the assistance of Claude Code.

[ghstack-poisoned]
@Gasoonjia Gasoonjia requested a review from lucylq as a code owner April 29, 2026 19:47
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 29, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19209

Note: Links to docs will display an error until the docs builds have been completed.

❌ 9 New Failures, 2 Cancelled Jobs, 1 Pending, 2 Unrelated Failures

As of commit d936717 with merge base cb4e5ae (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 29, 2026
@Gasoonjia Gasoonjia force-pushed the hoist-activation-quant branch 2 times, most recently from 385bf6d to 793be8d Compare April 29, 2026 20:19
Add dedicated _quantize_activations_int8_kernel and _silu_quantize_int8_kernel
that pre-quantize activations to INT8 with per-row-per-tile FP32 scales before
GEMM1 and GEMM2 respectively. The existing _fused_moe_batched_int8_kernel and
_fused_moe_silu_batched_int8_kernel are rewritten to consume pre-quantized
activations + scales, eliminating ~256 redundant tl.max reductions per program
(cdiv(K, BLOCK_K) tiles * BLOCK_M rows) and halving activation HBM bandwidth in
the K-loop (bf16 -> int8). BLOCK_SIZE_K is fixed at PREQUANT_BLOCK_K (= 128)
so per-tile activation scales align with the GEMM K-loop.

Correctness: 7/7 microbenchmark configs pass with rel diff <1.5% vs BF16 ref.
End-to-end (Qwen3.5 MoE 1600 prefill + 512 decode, --cuda_graph, A100):
prefill 5727 -> 6171 tok/s (+7.7%), decode 92.6 -> 99.0 tok/s (+6.9%).
@Gasoonjia Gasoonjia force-pushed the hoist-activation-quant branch from 793be8d to d936717 Compare April 29, 2026 21:14
Base automatically changed from gh/digantdesai/53/head to gh/digantdesai/53/base April 30, 2026 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants