Skip to content

IREE gap: quantized weight values not emitted in stablehlo.constant bodies (P1) #493

@michalharakal

Description

@michalharakal

Context

Surfaced during the int8 NPU readiness audit on 2026-04-13. The `skainet.tensor_encodings` module-level attribute (#478) tells a downstream consumer which tensors are quantized and how. But `ConstantOperationsConverter` in `skainet-compile-hlo` emits constant tensor bodies as float dense literals:

```mlir
%w = stablehlo.constant dense<[0.123, 0.456, ...]> : tensor<8x4xf32>
```

…even when the source tensor's `TensorData` is a `Q8_0BlockTensorData` or `Q4_KBlockTensorData`. The module attribute separately says `skainet.tensor_encodings = {w = "Q8_0"}` so a consumer can see the "intent" — but the actual constant in the IR is a fictional placeholder because the emitter does not have a path to serialize quantized block bytes into the MLIR.

Why this is P1 — the load-bearing IREE gap

This is the single biggest blocker for a real "SKaiNET → IREE → NPU quantized inference" pipeline. Until it's fixed:

  • IREE sees a dense float weight and dequantizes at compile time (loses the storage benefit).
  • The emitted MLIR is numerically wrong if the "dense float" values are just placeholders rather than faithfully dequantized from the Q8_0 / Q4_K blocks.
  • The `skainet.tensor_encodings` attribute is informational only, with no ground-truth data in the IR backing it.

Design options

Three approaches, listed from cheapest to most rigorous:

Option A — dequantize at emit time (correct but defeats the point)

`ConstantOperationsConverter` detects `tensor.data is PackedBlockStorage`, calls the existing GGML dequant routine to produce a float buffer, and emits that as the dense constant. Byte-wastefully correct: the emitted MLIR is numerically equivalent to the quantized source. IREE can then re-quantize at compile time if it wants. Simplest to implement (~a day) but leaves 4× storage on the table and the compile-time re-quant may not match the source quant scheme.

Option B — emit raw int8 constant + separate scale/zp attribute (matches TFLite flatbuffer style)

Emit the constant as a `stablehlo.constant dense<...> : tensor<8x4xi8>` with the raw quantized codes, and carry the dequant params (scales, zero points, block structure) in an `#skainet.quant_params` attribute on the constant op. A downstream pass reads the combination and materializes `!quant.uniform` types itself. Preserves storage, preserves intent, moderate implementation complexity. Requires extending `TypeMapper` and `ConstantOperationsConverter` to understand per-tensor quant types.

Option C — emit as `!quant.uniform<i8:f32, 0.1:128>` directly (the "right" MLIR way)

Use MLIR's `quant` dialect element types natively. Maximum downstream tool compatibility: IREE, XLA, TFLite, ONNXRuntime all understand `!quant.uniform`. But Q4_K is a block-quantized format with per-block scales and a super-block shape that doesn't map cleanly to the `uniform` type. Would require either:

  • reducing to per-tensor / per-channel scale (lossy for Q4_K)
  • emitting a custom `#skainet.q4_k` dialect type the consumer has to extend to parse
  • decomposing Q4_K into a stablehlo.custom_call stub that wraps the raw bytes

Not a 1-PR project. Start with A or B, plan toward C.

Recommended starting point

Do Option A first. Correctness beats storage efficiency for the first working IREE round-trip. Once a SKaiNET → IREE → NPU pipeline is running end-to-end on any quantized model, revisit for Option B or C based on whichever downstream consumer is actually in play.

Scope of the first PR

  1. Extend `ConstantOperationsConverter` (or wherever weight-constant emission happens) to detect `is PackedBlockStorage` on the tensor's `TensorData`.
  2. Call the existing GGML dequant routine for each quantized block format and materialize a float buffer.
  3. Emit the float buffer as the dense constant body, keeping the existing `skainet.tensor_encodings` module attribute on the header as the metadata hint for future Option B/C consumers.
  4. Unit test: build a graph with a Q8_0 weight, emit, parse the constant body back, compare against a reference dequantized FloatArray.
  5. `./gradlew :skainet-compile:skainet-compile-hlo:allTests` before pushing.

Out of scope

  • Option B / C. Follow-ups.
  • Per-channel or per-block quant dialect emission.
  • Changing `TensorEncoding` or `TensorSpec`.

Relationship to other IREE work

Second of two gaps surfaced in the 2026-04-13 audit. The other gap ("TensorEncoding not propagated through intermediate ops") is tracked in its own issue. Both are tagged into the project at https://github.com/orgs/SKaiNET-developers/projects/1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions