Qualcomm AI Engine Direct - [LLM Quantization] Support dataloader-based prefill#20273
Qualcomm AI Engine Direct - [LLM Quantization] Support dataloader-based prefill#20273DannyYuyang-quic wants to merge 1 commit into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20273
Note: Links to docs will display an error until the docs builds have been completed.
|
|
@psiddh Hi, this PR is to support Dataloader-based calibration in MLLMs. With this PR, LLMs can be calibrated using the full input sequence at once, eliminating the need for iterative autoregressive (AR) processing over long sequences. For example, instead of performing hundreds of iterations for a sequence length of 1024, calibration can now be completed in a single forward pass. Below is a comparison between AR iterative calibration and dataloader-based calibration across different models: MLLMs metrics
cc: @shewu-quic, @haowhsu-quic |
28dd82b to
d80b723
Compare
Calibration dataset:
- Replace HF AutoModel token generation with direct tokenization of
curated corpus (llm eval tasks or JSON samples)
- Add default calibration samples: assets/samples/{text,vision,audio}.json
Architecture:
- Introduce PTQStrategy + DecoderInference as unified calibration
forward-pass primitives; remove decoder_utils.graph_module_inference
- Refactor dataset.py into dataset/ package:
builders, collators, config, datasets, loaders, preprocessors, schema
d80b723 to
01574e1
Compare
|
@pytorchbot label "release notes: qualcomm" |
Summary
Calibration dataset:
Architecture:
Test plan
Test CI: