feat(quant): add GPTQ/AWQ quantized checkpoint support (fixes #70) by drunkcoding · Pull Request #82 · EfficientMoE/MoE-Infinity

drunkcoding · 2026-04-01T10:07:41Z

Summary

Fixes Evaluating Mixtral-8×7B-Instruct-v0.1-offloading-demo on MMLU #70 — quantized MoE checkpoints now load correctly instead of silently corrupting weights
Adds GPTQ and AWQ quantized model support with conditional tensor casting
Fail-fast with actionable error messages for unsupported formats (HQQ, bitsandbytes, GGUF, EXL2)

Problem

MoE-Infinity unconditionally cast all checkpoint tensors to the model's float dtype during loading. This destroyed packed integer representations in quantized checkpoints (GPTQ qweight/qzeros/g_idx, AWQ qweight/scales), causing silent weight corruption. Issue #70 reported this for the HQQ-quantized Mixtral variant.

Changes

New: `moe_infinity/utils/quantization.py`

detect_quantization() — identifies quant format from config attributes or checkpoint files (quantize_config.json, quant_config.json, quantization_config.json, .gguf files)
validate_quantization_support() — raises ValueError with actionable messages for unsupported formats
should_cast_tensor() — decides per-tensor whether to cast to model dtype (skips packed quant tensors)
QuantizationInfo dataclass — carries method, bits, group_size, support status

Modified: `moe_infinity/runtime/model_offload.py`

Tensor casting loop now conditionally skips cast for quantized tensors (qweight, qzeros, scales, g_idx)
GPTQ detection hardened: uses detect_quantization() instead of fragile hasattr check, handles file-based config
Added AWQ conversion path with optional autoawq dependency (clear ImportError if missing)
Clear error when optimum not installed for GPTQ models

Modified: `moe_infinity/entrypoints/big_modeling.py`

Early quant detection after AutoConfig.from_pretrained() — fails before expensive snapshot_download()
Post-download re-detection for file-only formats (GGUF)

New: `requirements-optional.txt`

autoawq>=0.2.0 as optional dependency for AWQ support

Test Coverage

File	Tests	Coverage
`test_quantization_detection.py`	29	Detection, validation, cast decisions
`test_gptq_loading.py`	11	GPTQ tensor handling, expert key mapping
`test_awq_loading.py`	10	AWQ tensor handling, conversion, key mapping
`test_unsupported_quant_errors.py`	8	Fail-fast error messages
`test_quant_regression.py`	10	Full-precision path unchanged, DeepSeek V3 fp8 intact
`test_quantized_e2e.py`	4	Real checkpoint e2e (GPTQ/AWQ/HQQ/FP16)

Unit tests: 248 passed, 4 skipped (CUDA), 0 failed
E2E tests: Scaffolded with auto-skip when checkpoints not available

Supported Quantized Models

Format	Models Covered	Status
GPTQ	Mixtral, Qwen3-30B-A3B (official), DeepSeek	✅ Supported
AWQ	Mixtral, Qwen3, DeepSeek-V3, GPT-OSS-120B	✅ Supported
HQQ	Issue #70 model	❌ Clear error
bitsandbytes	Various	❌ Clear error
GGUF	Various	❌ Clear error (suggests llama.cpp/Ollama)
EXL2	Various	❌ Clear error (suggests ExLlamaV2)

Backward Compatibility

should_cast_tensor(name, None) returns True for all tensors — existing full-precision path unchanged
detect_quantization() returns None for standard models — no new code paths activated
DeepSeek V3 float8_e4m3fn special case verified untouched
No new required dependencies added to requirements.txt

free_gpu_blocks was a no-op (two consecutive returns), causing scheduler preemption to fail — freed blocks were never returned to the allocator. Now releases physical blocks while preserving sequence table entry and CPU swap buffers for potential swap-in recovery.

xly added 2 commits April 1, 2026 11:06

style: apply ruff auto-formatting from pre-commit hooks

b7d4993

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(quant): add GPTQ/AWQ quantized checkpoint support (fixes #70)#82

feat(quant): add GPTQ/AWQ quantized checkpoint support (fixes #70)#82
drunkcoding wants to merge 2 commits intodevfrom
feat/quantized-checkpoint-support

drunkcoding commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drunkcoding commented Apr 1, 2026

Summary

Problem

Changes

New: moe_infinity/utils/quantization.py

Modified: moe_infinity/runtime/model_offload.py

Modified: moe_infinity/entrypoints/big_modeling.py

New: requirements-optional.txt

Test Coverage

Supported Quantized Models

Backward Compatibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New: `moe_infinity/utils/quantization.py`

Modified: `moe_infinity/runtime/model_offload.py`

Modified: `moe_infinity/entrypoints/big_modeling.py`

New: `requirements-optional.txt`