Qualcomm AI Engine Direct - Decouple quantization and compile graphs for faster VLM/LLM PTQ by DannyYuyang-quic · Pull Request #19220 · pytorch/executorch

DannyYuyang-quic · 2026-04-30T04:37:51Z

Summary

Calibrate decoder using prefill stage only (full chunk tokens)
Remove need for AR-N calibration loops
Significantly reduce calibration overhead

model name	before Time(sec)	after Time(sec)	speedup
gemma-2b	1216	259	4.69x
gemma2-2b	1827	382	4.78x
gemma3-1b	907	218	4.16x
glm-1_5b	963	167	5.76x
granite_3_3-2b	1545	304	5.08x
llama3_2-1b	1237	285	4.34×
llama3_2-3b	2286	813	2.81x
phi_4_mini	2824	363	7.77x
qwen2_5-0_5b	486	119	4.08x
qwen2_5-1_5b	1068	220	4.86×
qwen3-0_6b	1013	158	6.41×
qwen3-1_7b	1478	283	5.22×
smollm2_135m	399	122	3.27×
smollm3-3b	2065	431	4.79x
smolvlm_500m_instruct	170	131	1.30×
internvl3_1b	170	103	1.65x
granite_speech_3_3-2b	447	215	2.07x

This change decouples the quantization graph from the graph used for subsequent lowering, so calibration no longer depends on the AR-N decoding flow.

Previously, we were running calibration directly on the graph shaped for lowering (with fixed AR-N constraints). That forced us into an autoregressive loop (AR1 per step), which was both inefficient and slow since we never saw the full sequence context in a single pass.

With this update, calibration is done once during the prefill stage using the full tokens chunk. This gives us much better coverage in a single run and completely removes the need for iterative decoding during calibration.

After quantization, we take the KV cache encodings from the output, override the input KV cache encodings, and then propagate those into the graph that will later be lowered. This keeps everything consistent without needing to recalibrate on that graph.

Result: same accuracy, significantly faster calibration, and a much cleaner separation between quantization and lowering

Test plan

Test CI in TestExampleLLMScript and TestExampleMultimodalityScript

pytorch-bot · 2026-04-30T04:37:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19220

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⚠️ 11 Awaiting Approval

As of commit a447ba2 with merge base e84a418 ():

AWAITING APPROVAL - The following workflows need approval before CI can run:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

DannyYuyang-quic · 2026-04-30T04:40:49Z

Hi @abhinaykukkadapu,
This PR optimizes the PTQ calibration flow for VLM/LLM.
Details are in the top comment.
Please have a look!
Thanks!!

DannyYuyang-quic · 2026-04-30T04:41:19Z

@pytorchbot label "release notes: qualcomm"

faster VLM/LLM PTQ Summary: - Calibrate decoder using prefill stage only (full chunk input_ids) - Remove need for AR-N calibration loops - Significantly reduce calibration overhead

DannyYuyang-quic requested a review from abhinaykukkadapu as a code owner April 30, 2026 04:37

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 30, 2026

pytorch-bot Bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Apr 30, 2026

Qualcomm AI Engine Direct - Decouple quantization and compile graphs for

a447ba2

faster VLM/LLM PTQ Summary: - Calibrate decoder using prefill stage only (full chunk input_ids) - Remove need for AR-N calibration loops - Significantly reduce calibration overhead

DannyYuyang-quic force-pushed the dev1/danny/optimize_mllm_ptq branch from c3f07e0 to a447ba2 Compare April 30, 2026 18:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qualcomm AI Engine Direct - Decouple quantization and compile graphs for faster VLM/LLM PTQ#19220

Qualcomm AI Engine Direct - Decouple quantization and compile graphs for faster VLM/LLM PTQ#19220
DannyYuyang-quic wants to merge 1 commit intopytorch:mainfrom
CodeLinaro:dev1/danny/optimize_mllm_ptq

DannyYuyang-quic commented Apr 30, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

DannyYuyang-quic commented Apr 30, 2026

Uh oh!

DannyYuyang-quic commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DannyYuyang-quic commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

pytorch-bot Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19220

⚠️ 11 Awaiting Approval

Uh oh!

DannyYuyang-quic commented Apr 30, 2026

Uh oh!

DannyYuyang-quic commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DannyYuyang-quic commented Apr 30, 2026 •

edited

Loading

pytorch-bot Bot commented Apr 30, 2026 •

edited

Loading