Qualcomm AI Engine Direct - [LLM Quantization] Support dataloader-based prefill by DannyYuyang-quic · Pull Request #20273 · pytorch/executorch

DannyYuyang-quic · 2026-06-15T05:41:39Z

Summary

Calibration dataset:

Replace HF AutoModel token generation with direct tokenization of curated corpus (llm eval tasks or JSON samples)
Add default calibration samples: assets/samples/{text,vision,audio}.json
Support Dataloader-based calibration

Architecture:

Introduce PTQStrategy + DecoderInference as unified calibration forward-pass primitives; remove decoder_utils.graph_module_inference
Refactor dataset.py into dataset/ package: builders, collators, config, datasets, loaders, preprocessors, schema

Test plan

Test CI:

ExampleLLMScript
TestExampleMultimodalityScript

pytorch-bot · 2026-06-15T05:41:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20273

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⚠️ 16 Awaiting Approval, 1 Pending

As of commit 01574e1 with merge base e88fd04 ():

AWAITING APPROVAL - The following workflows need approval before CI can run:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

DannyYuyang-quic · 2026-06-15T05:52:24Z

@psiddh Hi, this PR is to support Dataloader-based calibration in MLLMs. With this PR, LLMs can be calibrated using the full input sequence at once, eliminating the need for iterative autoregressive (AR) processing over long sequences. For example, instead of performing hundreds of iterations for a sequence length of 1024, calibration can now be completed in a single forward pass.

Below is a comparison between AR iterative calibration and dataloader-based calibration across different models:

MLLMs metrics

model name	AR iterative calibration Time(sec)/PPL	Dataloader-based calibration Time(sec)/PPL	speedup
gemma-2b	1216 / 16.588	100 /16.609	12.16x
gemma2-2b	1827 / 11.504	123 / 11.517	14.85x
gemma3-1b	907 / 23.052	81 / 22.722	10.67x
glm-1_5b	963 / 20.180	85 / 20.041	11.32x
llama3_2-3b	2286 / 10.745	138 / 10.498	16.56x
phi_4_mini	2824 / 13.437	180 / 13.605	15.68x
qwen2_5-0_5b	486 / 13.951	77 / 13.813	6.31x
qwen2_5-1_5b	1068 / 9.714	116 / 9.669	9.2x
qwen3-1_7b	1478 / 14.756	111 / 14.913	13.31x
smollm2_135m	399 / 19.797	80/19.706	4.98x
smollm3-3b	2065 / 8.345	132 / 8.989	15.64x
smolvlm_500m_instruct	170 / -	86 / -	1.97x
internvl3_1b	170 / -	75 / -	2.26 x
granite_speech_3_3-2b	447 / -	179 / -	2.49x
llama3_2-1b	1237 / 14.973	883 / 15.647	1.4x
qwen3-0_6b	1013 / 19.740	408 / 19.912	2.48x

cc: @shewu-quic, @haowhsu-quic

Calibration dataset: - Replace HF AutoModel token generation with direct tokenization of curated corpus (llm eval tasks or JSON samples) - Add default calibration samples: assets/samples/{text,vision,audio}.json Architecture: - Introduce PTQStrategy + DecoderInference as unified calibration forward-pass primitives; remove decoder_utils.graph_module_inference - Refactor dataset.py into dataset/ package: builders, collators, config, datasets, loaders, preprocessors, schema

DannyYuyang-quic · 2026-06-15T16:20:30Z

@pytorchbot label "release notes: qualcomm"

DannyYuyang-quic requested review from abhinaykukkadapu and psiddh as code owners June 15, 2026 05:41

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 15, 2026

DannyYuyang-quic had a problem deploying to cadence June 15, 2026 05:44 — with GitHub Actions Failure

DannyYuyang-quic changed the title ~~Qualcomm AI Engine Direct - Support dataloader-based prefill quantize~~ Qualcomm AI Engine Direct - [LLM Quantization] Support dataloader-based prefill Jun 15, 2026

DannyYuyang-quic force-pushed the dev1/danny/remove_token_gen_from_calib branch from 28dd82b to d80b723 Compare June 15, 2026 07:46

DannyYuyang-quic force-pushed the dev1/danny/remove_token_gen_from_calib branch from d80b723 to 01574e1 Compare June 15, 2026 07:54

pytorch-bot Bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Jun 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qualcomm AI Engine Direct - [LLM Quantization] Support dataloader-based prefill#20273

Qualcomm AI Engine Direct - [LLM Quantization] Support dataloader-based prefill#20273
DannyYuyang-quic wants to merge 1 commit into
pytorch:mainfrom
CodeLinaro:dev1/danny/remove_token_gen_from_calib

DannyYuyang-quic commented Jun 15, 2026

Uh oh!

pytorch-bot Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

DannyYuyang-quic commented Jun 15, 2026 •

edited

Loading

Uh oh!

DannyYuyang-quic commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DannyYuyang-quic commented Jun 15, 2026

Summary

Test plan

Uh oh!

pytorch-bot Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20273

⚠️ 16 Awaiting Approval, 1 Pending

Uh oh!

DannyYuyang-quic commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MLLMs metrics

Uh oh!

DannyYuyang-quic commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pytorch-bot Bot commented Jun 15, 2026 •

edited

Loading

DannyYuyang-quic commented Jun 15, 2026 •

edited

Loading