Problem
ConstantOperationsConverter inlines every weight tensor as stablehlo.constant dense<[[...]]>. For Whisper-tiny.en this produces a 151 MB text MLIR file; iree-compile takes minutes to parse, and larger models would be gigabytes. See #519 for the original bug report.
Principle
The HLO converter never writes numerical weight bytes. It emits symbolic references. Bytes live wherever the TensorSpec says they live, are packaged by a dedicated IO module, and are consumed by IREE's existing parameter loader (util.parameter.load + .irpa archive, --iree-opt-import-parameters=<path>).
This principle removes the entire 151 MB problem and every future variant. No new MLIR dialect, no custom weight format, no duplication of work IREE already does.
Architecture fit
SKaiNET already has the right layers. We only add one policy seam and one packaging module:
| Concern |
Layer |
Piece |
| Symbolic name for a weight |
skainet-lang |
TensorSpec.name (exists) |
| Physical encoding (Dense / Q4_K / Q8_0 / TurboQuant / TernaryPacked) |
skainet-lang-core |
TensorEncoding (exists) |
| Bytes |
skainet-lang-core |
BufferHandle (exists) |
| "Inline vs external" decision |
new, skainet-compile-core |
ConstantMaterializationPolicy |
| MLIR emission |
skainet-compile-hlo |
ConstantOperationsConverter consults policy |
.irpa packaging |
new, skainet-io-iree-params |
peer of skainet-io-gguf |
| mmap-backed loaders |
skainet-io-gguf, skainet-io-safetensors |
return BufferHandle windows, not copies |
| End-to-end orchestration |
caller (e.g. skainet-whisper) |
converter → packager → iree-compile |
MLIR emission shape
Today:
%w = stablehlo.constant dense<[[0.0, 1.2, 3.4, ...]]> : tensor<384x384xf32>
With this design:
%w = util.global.load.indirect @model::@encoder.layer0.attn_q : tensor<384x384xf32>
Native IREE mechanism. Resolved at compile time against --iree-opt-import-parameters=model.irpa.
Converter output contract (one field added)
public data class StableHloModule(
val content: String,
// ... existing fields
val externalParameters: List<ExternalParameterRef> = emptyList(),
)
public data class ExternalParameterRef(
val scope: String, // e.g. \"model\"
val key: String, // spec.name
val encoding: TensorEncoding, // for .irpa header
val source: BufferHandle, // converter does NOT copy bytes
)
The converter sees specs + a name-addressable byte source. It never handles ByteArray.
Policy seam
public sealed interface ConstantMaterializationPolicy {
public object InlineAlways : ConstantMaterializationPolicy // default, no behavior change
public object ExternalAlways : ConstantMaterializationPolicy // production
public data class SizeThreshold(val bytes: Long) : ConstantMaterializationPolicy
}
Default InlineAlways means PR B ships with zero behavior change — it's purely the seam.
PR breakdown
PR A — splat-for-uniform (in review: #522)
Polish on the inline path. Uniform value lists collapse to dense<v>. Small. Cosmetic; does not solve #519 on its own.
PR B — the seam
Introduce ConstantMaterializationPolicy in skainet-compile-core. Add ExternalParameterRef to StableHloModule. Teach ConstantOperationsConverter to consult the policy; emit util.global.load.indirect + record an ExternalParameterRef when external, otherwise fall through to today's inline path. Default InlineAlways → no observable change. Medium. Unlocks everything else.
PR C — .irpa packager
New module skainet-io-iree-params, peer of skainet-io-gguf. Stream-writes an IREE parameter archive from a list of ExternalParameterRef. Honors each TensorEncoding so Q4_K / Q8_0 blocks stay packed verbatim — zero re-quantization. Medium; most of the work is understanding IREE's .irpa format (stable and documented).
PR D — caller wiring in skainet-whisper
Flip policy to ExternalAlways, pipe externalParameters into the .irpa writer, invoke iree-compile --iree-opt-import-parameters=model.irpa. Small. Expected outcome: MLIR text 151 MB → <1 MB, parse time minutes → seconds.
PR E — mmap-backed loaders in skainet-io-gguf and skainet-io-safetensors
Teach both loaders to yield BufferHandles that are mmap windows into the source file rather than owned copies. The .irpa writer then blits those windows verbatim — no copy, no parse, no re-quantization. Backend-agnostic ingestion-side win.
Feasibility note: safetensors is designed for zero-copy mmap (its whole selling point); trivial. GGUF requires honoring the file's chunked block layout (e.g. Q4_K's 144-byte blocks) — each tensor is a contiguous region and mmap windowing is natural. Both are realistic.
What this is NOT
- Not a new MLIR dialect or attribute.
- Not a custom weight format.
.irpa is IREE's, stable, documented.
- Not a rewrite of
ConstantOperationsConverter — the inline path stays as the InlineAlways mode.
- Not tied to Whisper — any SKaiNET → IREE export benefits once the caller flips the policy.
Migration order and parallelism
Hard ordering: B before C, D, E (it's the seam).
After B ships, C / D / E can proceed in parallel — they touch disjoint modules.
Supersedes
This issue supersedes #519. The original bug report documents the 151 MB symptom; this issue documents the architectural answer. When B + C + D land, #519 closes.
Problem
ConstantOperationsConverterinlines every weight tensor asstablehlo.constant dense<[[...]]>. For Whisper-tiny.en this produces a 151 MB text MLIR file;iree-compiletakes minutes to parse, and larger models would be gigabytes. See #519 for the original bug report.Principle
The HLO converter never writes numerical weight bytes. It emits symbolic references. Bytes live wherever the
TensorSpecsays they live, are packaged by a dedicated IO module, and are consumed by IREE's existing parameter loader (util.parameter.load+.irpaarchive,--iree-opt-import-parameters=<path>).This principle removes the entire 151 MB problem and every future variant. No new MLIR dialect, no custom weight format, no duplication of work IREE already does.
Architecture fit
SKaiNET already has the right layers. We only add one policy seam and one packaging module:
skainet-langTensorSpec.name(exists)skainet-lang-coreTensorEncoding(exists)skainet-lang-coreBufferHandle(exists)skainet-compile-coreConstantMaterializationPolicyskainet-compile-hloConstantOperationsConverterconsults policy.irpapackagingskainet-io-iree-paramsskainet-io-ggufskainet-io-gguf,skainet-io-safetensorsBufferHandlewindows, not copiesskainet-whisper)iree-compileMLIR emission shape
Today:
With this design:
Native IREE mechanism. Resolved at compile time against
--iree-opt-import-parameters=model.irpa.Converter output contract (one field added)
The converter sees specs + a name-addressable byte source. It never handles
ByteArray.Policy seam
Default
InlineAlwaysmeans PR B ships with zero behavior change — it's purely the seam.PR breakdown
PR A — splat-for-uniform (in review: #522)
Polish on the inline path. Uniform value lists collapse to
dense<v>. Small. Cosmetic; does not solve #519 on its own.PR B — the seam
Introduce
ConstantMaterializationPolicyinskainet-compile-core. AddExternalParameterReftoStableHloModule. TeachConstantOperationsConverterto consult the policy; emitutil.global.load.indirect+ record anExternalParameterRefwhen external, otherwise fall through to today's inline path. DefaultInlineAlways→ no observable change. Medium. Unlocks everything else.PR C —
.irpapackagerNew module
skainet-io-iree-params, peer ofskainet-io-gguf. Stream-writes an IREE parameter archive from a list ofExternalParameterRef. Honors eachTensorEncodingso Q4_K / Q8_0 blocks stay packed verbatim — zero re-quantization. Medium; most of the work is understanding IREE's.irpaformat (stable and documented).PR D — caller wiring in skainet-whisper
Flip policy to
ExternalAlways, pipeexternalParametersinto the.irpawriter, invokeiree-compile --iree-opt-import-parameters=model.irpa. Small. Expected outcome: MLIR text 151 MB → <1 MB, parse time minutes → seconds.PR E — mmap-backed loaders in skainet-io-gguf and skainet-io-safetensors
Teach both loaders to yield
BufferHandles that aremmapwindows into the source file rather than owned copies. The.irpawriter then blits those windows verbatim — no copy, no parse, no re-quantization. Backend-agnostic ingestion-side win.Feasibility note: safetensors is designed for zero-copy mmap (its whole selling point); trivial. GGUF requires honoring the file's chunked block layout (e.g. Q4_K's 144-byte blocks) — each tensor is a contiguous region and mmap windowing is natural. Both are realistic.
What this is NOT
.irpais IREE's, stable, documented.ConstantOperationsConverter— the inline path stays as theInlineAlwaysmode.Migration order and parallelism
Hard ordering: B before C, D, E (it's the seam).
After B ships, C / D / E can proceed in parallel — they touch disjoint modules.
Supersedes
This issue supersedes #519. The original bug report documents the 151 MB symptom; this issue documents the architectural answer. When B + C + D land, #519 closes.