Skip to content

feat(quant): add GPTQ/AWQ quantized checkpoint support (fixes #70)#82

Open
drunkcoding wants to merge 2 commits intodevfrom
feat/quantized-checkpoint-support
Open

feat(quant): add GPTQ/AWQ quantized checkpoint support (fixes #70)#82
drunkcoding wants to merge 2 commits intodevfrom
feat/quantized-checkpoint-support

Conversation

@drunkcoding
Copy link
Copy Markdown
Contributor

Summary

Problem

MoE-Infinity unconditionally cast all checkpoint tensors to the model's float dtype during loading. This destroyed packed integer representations in quantized checkpoints (GPTQ qweight/qzeros/g_idx, AWQ qweight/scales), causing silent weight corruption. Issue #70 reported this for the HQQ-quantized Mixtral variant.

Changes

New: moe_infinity/utils/quantization.py

  • detect_quantization() — identifies quant format from config attributes or checkpoint files (quantize_config.json, quant_config.json, quantization_config.json, .gguf files)
  • validate_quantization_support() — raises ValueError with actionable messages for unsupported formats
  • should_cast_tensor() — decides per-tensor whether to cast to model dtype (skips packed quant tensors)
  • QuantizationInfo dataclass — carries method, bits, group_size, support status

Modified: moe_infinity/runtime/model_offload.py

  • Tensor casting loop now conditionally skips cast for quantized tensors (qweight, qzeros, scales, g_idx)
  • GPTQ detection hardened: uses detect_quantization() instead of fragile hasattr check, handles file-based config
  • Added AWQ conversion path with optional autoawq dependency (clear ImportError if missing)
  • Clear error when optimum not installed for GPTQ models

Modified: moe_infinity/entrypoints/big_modeling.py

  • Early quant detection after AutoConfig.from_pretrained() — fails before expensive snapshot_download()
  • Post-download re-detection for file-only formats (GGUF)

New: requirements-optional.txt

  • autoawq>=0.2.0 as optional dependency for AWQ support

Test Coverage

File Tests Coverage
test_quantization_detection.py 29 Detection, validation, cast decisions
test_gptq_loading.py 11 GPTQ tensor handling, expert key mapping
test_awq_loading.py 10 AWQ tensor handling, conversion, key mapping
test_unsupported_quant_errors.py 8 Fail-fast error messages
test_quant_regression.py 10 Full-precision path unchanged, DeepSeek V3 fp8 intact
test_quantized_e2e.py 4 Real checkpoint e2e (GPTQ/AWQ/HQQ/FP16)

Unit tests: 248 passed, 4 skipped (CUDA), 0 failed
E2E tests: Scaffolded with auto-skip when checkpoints not available

Supported Quantized Models

Format Models Covered Status
GPTQ Mixtral, Qwen3-30B-A3B (official), DeepSeek ✅ Supported
AWQ Mixtral, Qwen3, DeepSeek-V3, GPT-OSS-120B ✅ Supported
HQQ Issue #70 model ❌ Clear error
bitsandbytes Various ❌ Clear error
GGUF Various ❌ Clear error (suggests llama.cpp/Ollama)
EXL2 Various ❌ Clear error (suggests ExLlamaV2)

Backward Compatibility

  • should_cast_tensor(name, None) returns True for all tensors — existing full-precision path unchanged
  • detect_quantization() returns None for standard models — no new code paths activated
  • DeepSeek V3 float8_e4m3fn special case verified untouched
  • No new required dependencies added to requirements.txt

xly added 2 commits April 1, 2026 11:06
free_gpu_blocks was a no-op (two consecutive returns), causing scheduler
preemption to fail — freed blocks were never returned to the allocator.
Now releases physical blocks while preserving sequence table entry and
CPU swap buffers for potential swap-in recovery.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant