Design: externalize weights via IREE parameter archive (supersedes #519)

## Problem

`ConstantOperationsConverter` inlines every weight tensor as `stablehlo.constant dense<[[...]]>`. For Whisper-tiny.en this produces a **151 MB** text MLIR file; `iree-compile` takes minutes to parse, and larger models would be gigabytes. See #519 for the original bug report.

## Principle

**The HLO converter never writes numerical weight bytes.** It emits symbolic references. Bytes live wherever the `TensorSpec` says they live, are packaged by a dedicated IO module, and are consumed by IREE's existing parameter loader (`util.parameter.load` + `.irpa` archive, `--iree-opt-import-parameters=<path>`).

This principle removes the entire 151 MB problem and every future variant. No new MLIR dialect, no custom weight format, no duplication of work IREE already does.

## Architecture fit

SKaiNET already has the right layers. We only add one policy seam and one packaging module:

| Concern | Layer | Piece |
|---|---|---|
| Symbolic name for a weight | `skainet-lang` | `TensorSpec.name` (exists) |
| Physical encoding (Dense / Q4_K / Q8_0 / TurboQuant / TernaryPacked) | `skainet-lang-core` | `TensorEncoding` (exists) |
| Bytes | `skainet-lang-core` | `BufferHandle` (exists) |
| **\"Inline vs external\" decision** | **new, `skainet-compile-core`** | `ConstantMaterializationPolicy` |
| MLIR emission | `skainet-compile-hlo` | `ConstantOperationsConverter` consults policy |
| **`.irpa` packaging** | **new, `skainet-io-iree-params`** | peer of `skainet-io-gguf` |
| mmap-backed loaders | `skainet-io-gguf`, `skainet-io-safetensors` | return `BufferHandle` windows, not copies |
| End-to-end orchestration | caller (e.g. `skainet-whisper`) | converter → packager → `iree-compile` |

## MLIR emission shape

Today:
```mlir
%w = stablehlo.constant dense<[[0.0, 1.2, 3.4, ...]]> : tensor<384x384xf32>
```

With this design:
```mlir
%w = util.global.load.indirect @model::@encoder.layer0.attn_q : tensor<384x384xf32>
```

Native IREE mechanism. Resolved at compile time against `--iree-opt-import-parameters=model.irpa`.

## Converter output contract (one field added)

```kotlin
public data class StableHloModule(
    val content: String,
    // ... existing fields
    val externalParameters: List<ExternalParameterRef> = emptyList(),
)

public data class ExternalParameterRef(
    val scope: String,             // e.g. \"model\"
    val key: String,               // spec.name
    val encoding: TensorEncoding,  // for .irpa header
    val source: BufferHandle,      // converter does NOT copy bytes
)
```

The converter sees specs + a name-addressable byte source. It never handles `ByteArray`.

## Policy seam

```kotlin
public sealed interface ConstantMaterializationPolicy {
    public object InlineAlways : ConstantMaterializationPolicy   // default, no behavior change
    public object ExternalAlways : ConstantMaterializationPolicy // production
    public data class SizeThreshold(val bytes: Long) : ConstantMaterializationPolicy
}
```

Default `InlineAlways` means **PR B ships with zero behavior change** — it's purely the seam.

## PR breakdown

### PR A — splat-for-uniform (in review: #522)
Polish on the inline path. Uniform value lists collapse to `dense<v>`. Small. Cosmetic; does not solve #519 on its own.

### PR B — the seam
Introduce `ConstantMaterializationPolicy` in `skainet-compile-core`. Add `ExternalParameterRef` to `StableHloModule`. Teach `ConstantOperationsConverter` to consult the policy; emit `util.global.load.indirect` + record an `ExternalParameterRef` when external, otherwise fall through to today's inline path. Default `InlineAlways` → no observable change. **Medium. Unlocks everything else.**

### PR C — `.irpa` packager
New module `skainet-io-iree-params`, peer of `skainet-io-gguf`. Stream-writes an IREE parameter archive from a list of `ExternalParameterRef`. Honors each `TensorEncoding` so Q4_K / Q8_0 blocks stay packed verbatim — zero re-quantization. Medium; most of the work is understanding IREE's `.irpa` format (stable and documented).

### PR D — caller wiring in skainet-whisper
Flip policy to `ExternalAlways`, pipe `externalParameters` into the `.irpa` writer, invoke `iree-compile --iree-opt-import-parameters=model.irpa`. Small. Expected outcome: MLIR text 151 MB → <1 MB, parse time minutes → seconds.

### PR E — mmap-backed loaders in skainet-io-gguf and skainet-io-safetensors
Teach both loaders to yield `BufferHandle`s that are `mmap` windows into the source file rather than owned copies. The `.irpa` writer then blits those windows verbatim — **no copy, no parse, no re-quantization**. Backend-agnostic ingestion-side win.

Feasibility note: **safetensors** is designed for zero-copy mmap (its whole selling point); trivial. **GGUF** requires honoring the file's chunked block layout (e.g. Q4_K's 144-byte blocks) — each tensor is a contiguous region and mmap windowing is natural. Both are realistic.

## What this is NOT

- Not a new MLIR dialect or attribute.
- Not a custom weight format. `.irpa` is IREE's, stable, documented.
- Not a rewrite of `ConstantOperationsConverter` — the inline path stays as the `InlineAlways` mode.
- Not tied to Whisper — any SKaiNET → IREE export benefits once the caller flips the policy.

## Migration order and parallelism

Hard ordering: **B before C, D, E** (it's the seam).

After B ships, C / D / E can proceed in parallel — they touch disjoint modules.

## Supersedes

This issue supersedes #519. The original bug report documents the 151 MB symptom; this issue documents the architectural answer. When B + C + D land, #519 closes.

Concern	Layer	Piece
Symbolic name for a weight	`skainet-lang`	`TensorSpec.name` (exists)
Physical encoding (Dense / Q4_K / Q8_0 / TurboQuant / TernaryPacked)	`skainet-lang-core`	`TensorEncoding` (exists)
Bytes	`skainet-lang-core`	`BufferHandle` (exists)
"Inline vs external" decision	new, `skainet-compile-core`	`ConstantMaterializationPolicy`
MLIR emission	`skainet-compile-hlo`	`ConstantOperationsConverter` consults policy
`.irpa` packaging	new, `skainet-io-iree-params`	peer of `skainet-io-gguf`
mmap-backed loaders	`skainet-io-gguf`, `skainet-io-safetensors`	return `BufferHandle` windows, not copies
End-to-end orchestration	caller (e.g. `skainet-whisper`)	converter → packager → `iree-compile`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design: externalize weights via IREE parameter archive (supersedes #519) #523

Problem

Principle

Architecture fit

MLIR emission shape

Converter output contract (one field added)

Policy seam

PR breakdown

PR A — splat-for-uniform (in review: #522)

PR B — the seam

PR C — `.irpa` packager

PR D — caller wiring in skainet-whisper

PR E — mmap-backed loaders in skainet-io-gguf and skainet-io-safetensors

What this is NOT

Migration order and parallelism

Supersedes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Design: externalize weights via IREE parameter archive (supersedes #519) #523

Description

Problem

Principle

Architecture fit

MLIR emission shape

Converter output contract (one field added)

Policy seam

PR breakdown

PR A — splat-for-uniform (in review: #522)

PR B — the seam

PR C — .irpa packager

PR D — caller wiring in skainet-whisper

PR E — mmap-backed loaders in skainet-io-gguf and skainet-io-safetensors

What this is NOT

Migration order and parallelism

Supersedes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

PR C — `.irpa` packager