Skip to content

Design: externalize weights via IREE parameter archive (supersedes #519) #523

@michalharakal

Description

@michalharakal

Problem

ConstantOperationsConverter inlines every weight tensor as stablehlo.constant dense<[[...]]>. For Whisper-tiny.en this produces a 151 MB text MLIR file; iree-compile takes minutes to parse, and larger models would be gigabytes. See #519 for the original bug report.

Principle

The HLO converter never writes numerical weight bytes. It emits symbolic references. Bytes live wherever the TensorSpec says they live, are packaged by a dedicated IO module, and are consumed by IREE's existing parameter loader (util.parameter.load + .irpa archive, --iree-opt-import-parameters=<path>).

This principle removes the entire 151 MB problem and every future variant. No new MLIR dialect, no custom weight format, no duplication of work IREE already does.

Architecture fit

SKaiNET already has the right layers. We only add one policy seam and one packaging module:

Concern Layer Piece
Symbolic name for a weight skainet-lang TensorSpec.name (exists)
Physical encoding (Dense / Q4_K / Q8_0 / TurboQuant / TernaryPacked) skainet-lang-core TensorEncoding (exists)
Bytes skainet-lang-core BufferHandle (exists)
"Inline vs external" decision new, skainet-compile-core ConstantMaterializationPolicy
MLIR emission skainet-compile-hlo ConstantOperationsConverter consults policy
.irpa packaging new, skainet-io-iree-params peer of skainet-io-gguf
mmap-backed loaders skainet-io-gguf, skainet-io-safetensors return BufferHandle windows, not copies
End-to-end orchestration caller (e.g. skainet-whisper) converter → packager → iree-compile

MLIR emission shape

Today:

%w = stablehlo.constant dense<[[0.0, 1.2, 3.4, ...]]> : tensor<384x384xf32>

With this design:

%w = util.global.load.indirect @model::@encoder.layer0.attn_q : tensor<384x384xf32>

Native IREE mechanism. Resolved at compile time against --iree-opt-import-parameters=model.irpa.

Converter output contract (one field added)

public data class StableHloModule(
    val content: String,
    // ... existing fields
    val externalParameters: List<ExternalParameterRef> = emptyList(),
)

public data class ExternalParameterRef(
    val scope: String,             // e.g. \"model\"
    val key: String,               // spec.name
    val encoding: TensorEncoding,  // for .irpa header
    val source: BufferHandle,      // converter does NOT copy bytes
)

The converter sees specs + a name-addressable byte source. It never handles ByteArray.

Policy seam

public sealed interface ConstantMaterializationPolicy {
    public object InlineAlways : ConstantMaterializationPolicy   // default, no behavior change
    public object ExternalAlways : ConstantMaterializationPolicy // production
    public data class SizeThreshold(val bytes: Long) : ConstantMaterializationPolicy
}

Default InlineAlways means PR B ships with zero behavior change — it's purely the seam.

PR breakdown

PR A — splat-for-uniform (in review: #522)

Polish on the inline path. Uniform value lists collapse to dense<v>. Small. Cosmetic; does not solve #519 on its own.

PR B — the seam

Introduce ConstantMaterializationPolicy in skainet-compile-core. Add ExternalParameterRef to StableHloModule. Teach ConstantOperationsConverter to consult the policy; emit util.global.load.indirect + record an ExternalParameterRef when external, otherwise fall through to today's inline path. Default InlineAlways → no observable change. Medium. Unlocks everything else.

PR C — .irpa packager

New module skainet-io-iree-params, peer of skainet-io-gguf. Stream-writes an IREE parameter archive from a list of ExternalParameterRef. Honors each TensorEncoding so Q4_K / Q8_0 blocks stay packed verbatim — zero re-quantization. Medium; most of the work is understanding IREE's .irpa format (stable and documented).

PR D — caller wiring in skainet-whisper

Flip policy to ExternalAlways, pipe externalParameters into the .irpa writer, invoke iree-compile --iree-opt-import-parameters=model.irpa. Small. Expected outcome: MLIR text 151 MB → <1 MB, parse time minutes → seconds.

PR E — mmap-backed loaders in skainet-io-gguf and skainet-io-safetensors

Teach both loaders to yield BufferHandles that are mmap windows into the source file rather than owned copies. The .irpa writer then blits those windows verbatim — no copy, no parse, no re-quantization. Backend-agnostic ingestion-side win.

Feasibility note: safetensors is designed for zero-copy mmap (its whole selling point); trivial. GGUF requires honoring the file's chunked block layout (e.g. Q4_K's 144-byte blocks) — each tensor is a contiguous region and mmap windowing is natural. Both are realistic.

What this is NOT

  • Not a new MLIR dialect or attribute.
  • Not a custom weight format. .irpa is IREE's, stable, documented.
  • Not a rewrite of ConstantOperationsConverter — the inline path stays as the InlineAlways mode.
  • Not tied to Whisper — any SKaiNET → IREE export benefits once the caller flips the policy.

Migration order and parallelism

Hard ordering: B before C, D, E (it's the seam).

After B ships, C / D / E can proceed in parallel — they touch disjoint modules.

Supersedes

This issue supersedes #519. The original bug report documents the 151 MB symptom; this issue documents the architectural answer. When B + C + D land, #519 closes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions