From 576570a604cf8ee7875d79c180f2eb5e96e450c1 Mon Sep 17 00:00:00 2001
From: Michal Harakal <michal.harakal@googlemail.com>
Date: Sat, 11 Apr 2026 13:29:29 +0200
Subject: [PATCH 1/9] fix: remove pre-transpose in LlamaRuntime to halve peak
 memory

LlamaRuntime previously pre-transposed all weight tensors at init,
doubling peak memory (~31GB for 8B models). This caused OOM on 48GB
machines when loading Qwen3-8B-Q4_K_M.

Replace eager pre-transpose with inline .t() calls during forward pass.
The GC reclaims each temporary transpose, so only one projection's
worth of memory (~200MB) is live at a time instead of the full set.

Before: ~62GB peak (31GB weights + 31GB transposed copies)
After:  ~31GB peak (weights only + 200MB temp per layer)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 ISSUE-skainet-8b-oom.md                       | 113 ++++++++++++++++++
 .../sk/ainet/models/llama/LlamaRuntime.kt     |  62 ++++------
 2 files changed, 133 insertions(+), 42 deletions(-)
 create mode 100644 ISSUE-skainet-8b-oom.md

diff --git a/ISSUE-skainet-8b-oom.md b/ISSUE-skainet-8b-oom.md
new file mode 100644
index 0000000..6977ec9
--- /dev/null
+++ b/ISSUE-skainet-8b-oom.md
@@ -0,0 +1,113 @@
+# Issue: Qwen3-8B OOM on 48GB Mac
+
+## Problem
+
+Running Qwen3-8B-Q4_K_M.gguf (4.7GB on disk) on a 48GB Mac fails with OOM during weight loading, both via kllama and the unified skainet CLI.
+
+## Root Cause
+
+The current loading path uses `DEQUANTIZE_TO_FP32`, which expands Q4 weights 8x:
+
+| Component                | Size      |
+|--------------------------|-----------|
+| Quantized weights (disk) | 4.7 GB    |
+| Dequantized FP32 weights | ~37-40 GB |
+| KV cache (2048 context)  | 512 MB    |
+| Embeddings, norms        | ~1 GB     |
+| JVM + tokenizer          | ~2 GB     |
+| **Total**                | **~41 GB** |
+
+48GB barely fits, and the JVM needs headroom for temporary buffers during dequantization, so it OOMs.
+
+## What Already Exists in the Codebase
+
+### 1. NATIVE_OPTIMIZED quant policy (best option)
+
+`QuantPolicy.NATIVE_OPTIMIZED` keeps weights in quantized form and uses SIMD-accelerated matmul kernels. `MemSegWeightConverter` converts raw Q4/Q8 bytes to 64-byte-aligned MemorySegment-backed tensors for Vector API dispatch.
+
+- Memory: ~5GB for the 8B model (vs 40GB with FP32)
+- Speed: 1-3 tok/s (proven on Qwen2/3 via kqwen runner)
+- Already works for Qwen2/3 in kllama Main.kt (the `isQwen` path)
+
+**Why it doesn't work today for the 8B:** The kllama `isQwen` path loads with `NATIVE_OPTIMIZED` but then creates `LlamaRuntime` which still transposes weight matrices to FP32 during init (`LlamaRuntime.kt:74`). This transpose step allocates FP32 copies.
+
+### 2. Lazy per-layer dequantization (Apertus pattern)
+
+`ApertusQuantizedRuntime` keeps weights quantized and dequantizes one projection at a time during `runLayer()`:
+
+```
+Resident: ~3.5GB (quantized) + ~100MB (norms/embeddings)
+Per-layer temp: ~50MB (one projection, discarded after matmul)
+```
+
+This is the llama.cpp approach. Not yet available for LLaMA/Qwen runtimes.
+
+### 3. Memory-mapped loading (F32 only)
+
+`MmapLlamaLoader` maps the GGUF file via `MappedByteBuffer` for zero-copy tensor access. Only works for F32 models — Q4 models need dequantization which defeats the zero-copy benefit.
+
+## Proposed Solutions (ordered by effort)
+
+### Solution A: Fix NATIVE_OPTIMIZED path for 8B models (small effort)
+
+The kllama Main.kt Qwen path already loads with `NATIVE_OPTIMIZED`. The problem is `LlamaRuntime` constructor transposes weights to FP32. Fix:
+
+1. Skip transpose for quantized tensors in `LlamaRuntime` init
+2. Or use `OptimizedLLMRuntime` which doesn't transpose (the DSL path)
+3. Ensure SIMD matmul kernels handle Q4_K_M format (Q4_K dispatch exists in `MemSegWeightConverter`)
+
+**Expected result:** 8B Q4 loads in ~5GB, runs at 1-3 tok/s.
+
+**Files to change:**
+- `llm-inference/llama/.../LlamaRuntime.kt` -- skip transpose for quantized MemSeg tensors
+- Or migrate Qwen path in `kllama/cli/Main.kt` to `OptimizedLLMRuntime` + `llamaNetwork()`
+
+### Solution B: Port lazy dequant from Apertus to LLaMA (medium effort)
+
+Port the `ApertusQuantizedRuntime` pattern to a `LlamaQuantizedRuntime`:
+
+1. Store projections as `QuantizedTensor` (quantized bytes + metadata)
+2. In `runLayer()`, dequantize one weight matrix at a time, matmul, discard
+3. Keep embeddings and norms as FP32 (small, need element access)
+
+**Expected result:** 8B Q4 loads in ~5GB, runs at ~1 tok/s (dequant overhead per layer).
+
+**Files to create:**
+- `llm-inference/llama/.../LlamaQuantizedRuntime.kt` (new, based on Apertus pattern)
+- `llm-runtime/kllama/.../LlamaQuantizedWeights.kt` (new, mixed storage)
+
+### Solution C: SIMD-native matmul without dequantization (larger effort, best perf)
+
+The SIMD backend (`skainet-backend-cpu`) already has Q4/Q8 matmul kernels via Vector API. The issue is the runtime layer doesn't use them directly. Changes needed in skainet core:
+
+1. `skainet-backend-cpu`: Ensure `matmul(FP32, Q4_K)` kernel exists and dispatches correctly
+2. `LlamaRuntime` or `OptimizedLLMRuntime`: Accept mixed-precision weight tensors (Q4 weights, FP32 activations)
+3. Skip the `MemSegWeightConverter` step entirely — use raw quantized MemorySegments
+
+**Expected result:** 8B Q4 loads in ~5GB, runs at 2-5 tok/s (no dequant overhead).
+
+**Files to change (in skainet core):**
+- `skainet-backend-cpu`: Q4_K matmul kernel (may already exist)
+- `skainet-lang-core`: Mixed-precision tensor support in matmul dispatch
+
+### Solution D: Memory-mapped quantized tensors (largest effort)
+
+Extend `MmapLlamaLoader` to support quantized formats:
+
+1. Map the GGUF file to virtual memory
+2. Create quantized tensor views that reference mmap regions
+3. Dequantize on-the-fly during matmul (like lazy dequant but zero-copy from disk)
+
+**Expected result:** Load time near-zero, ~5GB virtual (OS manages paging).
+
+**Files to change:**
+- `llm-inference/llama/.../MmapLlamaLoader.kt` -- extend to Q4/Q8 formats
+- Requires `skainet-io-core` changes for mmap quantized tensor views
+
+## Recommended Path
+
+**Start with Solution A** — it's the smallest change and uses code that already works for Qwen2/3. The `NATIVE_OPTIMIZED` + `MemSegWeightConverter` path is proven; the only blocker is `LlamaRuntime`'s constructor transposing weights to FP32.
+
+If that's not enough, **add Solution B** (lazy dequant) which gives the most control over memory at a known performance cost.
+
+Solution C is the long-term goal (best performance) but requires skainet core changes.
diff --git a/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt b/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt
index e461142..3a0e562 100644
--- a/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt
+++ b/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt
@@ -54,29 +54,9 @@ public class LlamaRuntime<T : DType>(
         const val DEFAULT_BOS_TOKEN: Int = 1
     }
 
-    /** Pre-transposed weight tensors per layer — avoids re-creating lazy transpose wrappers every forward pass. */
-    private class TransposedLayerWeights<T : DType>(
-        val wqT: Tensor<T, Float>,
-        val wkT: Tensor<T, Float>,
-        val wvT: Tensor<T, Float>,
-        val woT: Tensor<T, Float>,
-        val ffnGateT: Tensor<T, Float>,
-        val ffnDownT: Tensor<T, Float>,
-        val ffnUpT: Tensor<T, Float>,
-    )
-
-    private val transposedLayers: List<TransposedLayerWeights<T>> = weights.layers.map { layer ->
-        TransposedLayerWeights(
-            wqT = layer.wq.t(),
-            wkT = layer.wk.t(),
-            wvT = layer.wv.t(),
-            woT = layer.wo.t(),
-            ffnGateT = layer.ffnGate.t(),
-            ffnDownT = layer.ffnDown.t(),
-            ffnUpT = layer.ffnUp.t(),
-        )
-    }
-    private val outputWeightT: Tensor<T, Float> = weights.outputWeight.t()
+    // NOTE: weights are transposed on-the-fly during forward pass rather than
+    // pre-transposed at init. This halves peak memory (~31GB saved for 8B models)
+    // at the cost of per-token transpose allocations that the GC reclaims.
 
     // ---- DecoderRuntime abstract properties ----
     override val dim: Int = weights.metadata.embeddingLength
@@ -131,7 +111,6 @@ public class LlamaRuntime<T : DType>(
         embedding.forward(intArrayOf(tokenId), ctx)
 
     override fun runLayer(layerIdx: Int, x: Tensor<T, Float>): Tensor<T, Float> {
-        val tl = transposedLayers[layerIdx]
         val layer = weights.layers[layerIdx]
 
         // QKV: try compiled graph first, fall back to individual ops
@@ -140,9 +119,9 @@ public class LlamaRuntime<T : DType>(
         } ?: run {
             val attnNorm = attnNorms[layerIdx].forward(x, ctx)
             Triple(
-                attnNorm.matmul(tl.wqT),
-                attnNorm.matmul(tl.wkT),
-                attnNorm.matmul(tl.wvT)
+                attnNorm.matmul(layer.wq.t()),
+                attnNorm.matmul(layer.wk.t()),
+                attnNorm.matmul(layer.wv.t())
             )
         }
 
@@ -156,14 +135,14 @@ public class LlamaRuntime<T : DType>(
         val attnOut = attentionBackend.attention(q, k, v, layerIdx, position)
 
         // Output projection + residual
-        val afterAttn = x + attnOut.matmul(tl.woT)
+        val afterAttn = x + attnOut.matmul(layer.wo.t())
 
         // FFN: try compiled graph first, fall back to individual ops
         return graphAccelerator?.runFFN(layerIdx, afterAttn) ?: run {
             val ffnNorm = ffnNorms[layerIdx].forward(afterAttn, ctx)
-            val gate = ffnNorm.matmul(tl.ffnGateT).silu()
-            val up = ffnNorm.matmul(tl.ffnUpT)
-            val ffnOut = (gate * up).matmul(tl.ffnDownT)
+            val gate = ffnNorm.matmul(layer.ffnGate.t()).silu()
+            val up = ffnNorm.matmul(layer.ffnUp.t())
+            val ffnOut = (gate * up).matmul(layer.ffnDown.t())
             afterAttn + ffnOut
         }
     }
@@ -172,7 +151,7 @@ public class LlamaRuntime<T : DType>(
         outputNormLayer.forward(x, ctx)
 
     override fun outputProject(x: Tensor<T, Float>): Tensor<T, Float> =
-        x.matmul(outputWeightT)
+        x.matmul(weights.outputWeight.t())
 
     override fun resetState() {
         attentionBackend.reset()
@@ -287,11 +266,10 @@ public class LlamaRuntime<T : DType>(
         var x = embedding.forward(tokenIds, ctx)
 
         weights.layers.forEachIndexed { layerIdx, layer ->
-            val tl = transposedLayers[layerIdx]
             val attnNorm = attnNorms[layerIdx].forward(x, ctx)
-            var q = attnNorm.matmul(tl.wqT)
-            var k = attnNorm.matmul(tl.wkT)
-            val v = attnNorm.matmul(tl.wvT)
+            var q = attnNorm.matmul(layer.wq.t())
+            var k = attnNorm.matmul(layer.wk.t())
+            val v = attnNorm.matmul(layer.wv.t())
 
             if (hasQKNorm) {
                 q = applyPerHeadRMSNorm(q, nHeads, headDim, layer.qNorm!!)
@@ -299,19 +277,19 @@ public class LlamaRuntime<T : DType>(
             }
 
             val attnOut = attentionBackend.batchAttention(q, k, v, layerIdx, startPos)
-                ?: return batchForwardFallback(tokenIds, startPos) // shouldn't happen but be safe
+                ?: return batchForwardFallback(tokenIds, startPos)
 
-            val afterAttn = x + attnOut.matmul(tl.woT)
+            val afterAttn = x + attnOut.matmul(layer.wo.t())
 
             val ffnNorm = ffnNorms[layerIdx].forward(afterAttn, ctx)
-            val gate = ffnNorm.matmul(tl.ffnGateT).silu()
-            val up = ffnNorm.matmul(tl.ffnUpT)
-            val ffnOut = (gate * up).matmul(tl.ffnDownT)
+            val gate = ffnNorm.matmul(layer.ffnGate.t()).silu()
+            val up = ffnNorm.matmul(layer.ffnUp.t())
+            val ffnOut = (gate * up).matmul(layer.ffnDown.t())
             x = afterAttn + ffnOut
         }
 
         val norm = outputNormLayer.forward(x, ctx)
-        val logits = norm.matmul(outputWeightT)
+        val logits = norm.matmul(weights.outputWeight.t())
         position = startPos + tokenIds.size
         return logits
     }

From 4934f5adc1f26554b0c8ddef50ab7beb140db035 Mon Sep 17 00:00:00 2001
From: Michal Harakal <michal.harakal@googlemail.com>
Date: Sat, 11 Apr 2026 13:36:00 +0200
Subject: [PATCH 2/9] perf: keep Q4_K weights quantized using native SIMD
 matmul kernel

MemSegWeightConverter previously dequantized Q4_K tensors to FP32
because "no native SIMD kernel yet". But Q4_KBlockTensorData and
QuantizedMatmul.matmulQ4_K() already exist in skainet-backend-cpu.

Wire Q4_K into the SIMD path: create Q4_KBlockTensorData from raw
bytes instead of dequantizing. This keeps Q4_K weights in their
compact quantized form (~4.5 bits/param) and uses the SIMD kernel
for matmul at inference time.

Memory impact for Qwen3-8B-Q4_K_M:
  Before: ~31GB (Q4_K dequantized to FP32)
  After:  ~5GB  (Q4_K kept quantized)

Q5_K and Q6_K still dequantize to FP32 (no SIMD kernel yet).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .../sk/ainet/models/llama/MemSegWeightConverter.kt     | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt b/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt
index 2e24d4f..a08b065 100644
--- a/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt
+++ b/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt
@@ -10,6 +10,7 @@ import sk.ainet.lang.tensor.Shape
 import sk.ainet.lang.tensor.Tensor
 import sk.ainet.lang.tensor.data.IntArrayTensorData
 import sk.ainet.lang.tensor.data.Q4MemorySegmentTensorData
+import sk.ainet.lang.tensor.data.Q4_KBlockTensorData
 import sk.ainet.lang.tensor.data.Q8MemorySegmentTensorData
 import sk.ainet.lang.tensor.data.TensorData
 import sk.ainet.lang.types.FP32
@@ -94,10 +95,15 @@ public object MemSegWeightConverter {
                 Q4MemorySegmentTensorData.fromRawBytes(logicalShape, bytes, arena)
             GGMLQuantizationType.Q8_0 ->
                 Q8MemorySegmentTensorData.fromRawBytes(logicalShape, bytes, arena)
-            GGMLQuantizationType.Q4_K,
+            GGMLQuantizationType.Q4_K -> {
+                // Q4_K has a native SIMD matmul kernel — keep quantized
+                val q4kData = Q4_KBlockTensorData.fromRawBytes(logicalShape, bytes)
+                @Suppress("UNCHECKED_CAST")
+                return ctx.fromData(q4kData as TensorData<FP32, Float>, FP32::class)
+            }
             GGMLQuantizationType.Q5_K,
             GGMLQuantizationType.Q6_K -> {
-                // Dequantize K-quant types to FP32 (no native SIMD kernel yet)
+                // Q5_K/Q6_K: no native SIMD kernel yet, dequantize to FP32
                 val floats = DequantOps.dequantFromBytes(bytes, quantType, logicalShape.volume)
                 return ctx.fromFloatArray(logicalShape, FP32::class, floats)
             }

From 0d04bc0c2564605b2d5af96b9dc42f1cfa77b208 Mon Sep 17 00:00:00 2001
From: Michal Harakal <michal.harakal@googlemail.com>
Date: Sat, 11 Apr 2026 15:11:26 +0200
Subject: [PATCH 3/9] fix: add Q4_K SIMD matmul and quantized-aware linear
 projection

Wire Q4_K into MemSegWeightConverter to keep weights quantized (~5GB)
instead of dequantizing to FP32 (~31GB). Add linearProject() helper
to LlamaRuntime that dispatches to quantized matmul for Q4_K weights
and standard transpose+matmul for FP32 weights.

The 8B model now loads successfully on 48GB Mac (15s load time) and
generates tokens via the Q4_K SIMD kernel. However, the inline .t()
on Q6_K-dequantized FP32 tensors still causes direct buffer OOM
during inference -- the JVM doesn't reclaim direct memory eagerly.
Full fix requires lazy per-layer dequantization (Solution B) or
Q6_K SIMD kernel support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 llm-apps/skainet-cli/build.gradle.kts         |  2 +-
 .../sk/ainet/models/llama/LlamaRuntime.kt     | 48 ++++++++++++-------
 .../models/llama/MemSegWeightConverter.kt     |  8 +++-
 3 files changed, 39 insertions(+), 19 deletions(-)

diff --git a/llm-apps/skainet-cli/build.gradle.kts b/llm-apps/skainet-cli/build.gradle.kts
index 7a999be..608c38c 100644
--- a/llm-apps/skainet-cli/build.gradle.kts
+++ b/llm-apps/skainet-cli/build.gradle.kts
@@ -50,5 +50,5 @@ tasks.withType<Test>().configureEach {
 }
 
 tasks.withType<JavaExec>().configureEach {
-    jvmArgs("--enable-preview", "--add-modules", "jdk.incubator.vector", "-Xmx48g", "-XX:MaxDirectMemorySize=64g")
+    jvmArgs("--enable-preview", "--add-modules", "jdk.incubator.vector", "-Xmx42g", "-XX:MaxDirectMemorySize=42g")
 }
diff --git a/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt b/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt
index 3a0e562..57be0ee 100644
--- a/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt
+++ b/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt
@@ -6,6 +6,7 @@ import sk.ainet.context.ExecutionContext
 import sk.ainet.models.llama.LlamaRuntimeWeights
 import sk.ainet.lang.nn.layers.Embedding
 import sk.ainet.lang.tensor.Tensor
+import sk.ainet.lang.tensor.data.Q4_KTensorData
 import sk.ainet.lang.tensor.matmul
 import sk.ainet.lang.tensor.plus
 import sk.ainet.lang.tensor.silu
@@ -57,6 +58,21 @@ public class LlamaRuntime<T : DType>(
     // NOTE: weights are transposed on-the-fly during forward pass rather than
     // pre-transposed at init. This halves peak memory (~31GB saved for 8B models)
     // at the cost of per-token transpose allocations that the GC reclaims.
+    // Quantized weights (Q4_K) skip transpose entirely — their matmul kernel
+    // handles the [out, in] layout directly.
+
+    /**
+     * Linear projection: y = x @ W^T.
+     * For FP32 weights, transposes and matmuls. For quantized weights (Q4_K),
+     * calls matmul directly (the quantized kernel handles the layout).
+     */
+    private fun linearProject(x: Tensor<T, Float>, w: Tensor<T, Float>): Tensor<T, Float> {
+        return if (w.data is Q4_KTensorData) {
+            x.matmul(w)  // quantized kernel handles [out, in] layout
+        } else {
+            x.matmul(w.t())
+        }
+    }
 
     // ---- DecoderRuntime abstract properties ----
     override val dim: Int = weights.metadata.embeddingLength
@@ -119,9 +135,9 @@ public class LlamaRuntime<T : DType>(
         } ?: run {
             val attnNorm = attnNorms[layerIdx].forward(x, ctx)
             Triple(
-                attnNorm.matmul(layer.wq.t()),
-                attnNorm.matmul(layer.wk.t()),
-                attnNorm.matmul(layer.wv.t())
+                linearProject(attnNorm, layer.wq),
+                linearProject(attnNorm, layer.wk),
+                linearProject(attnNorm, layer.wv)
             )
         }
 
@@ -135,14 +151,14 @@ public class LlamaRuntime<T : DType>(
         val attnOut = attentionBackend.attention(q, k, v, layerIdx, position)
 
         // Output projection + residual
-        val afterAttn = x + attnOut.matmul(layer.wo.t())
+        val afterAttn = x + linearProject(attnOut, layer.wo)
 
         // FFN: try compiled graph first, fall back to individual ops
         return graphAccelerator?.runFFN(layerIdx, afterAttn) ?: run {
             val ffnNorm = ffnNorms[layerIdx].forward(afterAttn, ctx)
-            val gate = ffnNorm.matmul(layer.ffnGate.t()).silu()
-            val up = ffnNorm.matmul(layer.ffnUp.t())
-            val ffnOut = (gate * up).matmul(layer.ffnDown.t())
+            val gate = linearProject(ffnNorm, layer.ffnGate).silu()
+            val up = linearProject(ffnNorm, layer.ffnUp)
+            val ffnOut = linearProject(gate * up, layer.ffnDown)
             afterAttn + ffnOut
         }
     }
@@ -151,7 +167,7 @@ public class LlamaRuntime<T : DType>(
         outputNormLayer.forward(x, ctx)
 
     override fun outputProject(x: Tensor<T, Float>): Tensor<T, Float> =
-        x.matmul(weights.outputWeight.t())
+        linearProject(x, weights.outputWeight)
 
     override fun resetState() {
         attentionBackend.reset()
@@ -267,9 +283,9 @@ public class LlamaRuntime<T : DType>(
 
         weights.layers.forEachIndexed { layerIdx, layer ->
             val attnNorm = attnNorms[layerIdx].forward(x, ctx)
-            var q = attnNorm.matmul(layer.wq.t())
-            var k = attnNorm.matmul(layer.wk.t())
-            val v = attnNorm.matmul(layer.wv.t())
+            var q = linearProject(attnNorm, layer.wq)
+            var k = linearProject(attnNorm, layer.wk)
+            val v = linearProject(attnNorm, layer.wv)
 
             if (hasQKNorm) {
                 q = applyPerHeadRMSNorm(q, nHeads, headDim, layer.qNorm!!)
@@ -279,17 +295,17 @@ public class LlamaRuntime<T : DType>(
             val attnOut = attentionBackend.batchAttention(q, k, v, layerIdx, startPos)
                 ?: return batchForwardFallback(tokenIds, startPos)
 
-            val afterAttn = x + attnOut.matmul(layer.wo.t())
+            val afterAttn = x + linearProject(attnOut, layer.wo)
 
             val ffnNorm = ffnNorms[layerIdx].forward(afterAttn, ctx)
-            val gate = ffnNorm.matmul(layer.ffnGate.t()).silu()
-            val up = ffnNorm.matmul(layer.ffnUp.t())
-            val ffnOut = (gate * up).matmul(layer.ffnDown.t())
+            val gate = linearProject(ffnNorm, layer.ffnGate).silu()
+            val up = linearProject(ffnNorm, layer.ffnUp)
+            val ffnOut = linearProject(gate * up, layer.ffnDown)
             x = afterAttn + ffnOut
         }
 
         val norm = outputNormLayer.forward(x, ctx)
-        val logits = norm.matmul(weights.outputWeight.t())
+        val logits = linearProject(norm, weights.outputWeight)
         position = startPos + tokenIds.size
         return logits
     }
diff --git a/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt b/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt
index a08b065..55fdf86 100644
--- a/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt
+++ b/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt
@@ -96,8 +96,12 @@ public object MemSegWeightConverter {
             GGMLQuantizationType.Q8_0 ->
                 Q8MemorySegmentTensorData.fromRawBytes(logicalShape, bytes, arena)
             GGMLQuantizationType.Q4_K -> {
-                // Q4_K has a native SIMD matmul kernel — keep quantized
-                val q4kData = Q4_KBlockTensorData.fromRawBytes(logicalShape, bytes)
+                // Q4_K has a native SIMD matmul kernel — keep quantized.
+                // GGUF stores weights as [out, in], but matmul expects [in, out],
+                // so we pass the transposed shape. The block data layout stays
+                // the same — Q4_K matmul reads rows in the transposed order.
+                val transposedShape = Shape(logicalShape[1], logicalShape[0])
+                val q4kData = Q4_KBlockTensorData.fromRawBytes(transposedShape, bytes)
                 @Suppress("UNCHECKED_CAST")
                 return ctx.fromData(q4kData as TensorData<FP32, Float>, FP32::class)
             }

From 8aee3fa486855208553f5f362a8b72a70e9941b4 Mon Sep 17 00:00:00 2001
From: Michal Harakal <michal.harakal@googlemail.com>
Date: Sat, 11 Apr 2026 15:31:00 +0200
Subject: [PATCH 4/9] fix: pre-transpose weights during loading to eliminate
 runtime OOM

Pre-transpose ALL projection weights during MemSegWeightConverter so
LlamaRuntime never calls .t() during inference. This eliminates direct
buffer allocations that the JVM doesn't GC eagerly, which caused OOM
on 48GB machines.

- Q4_K: transposed shape passed to Q4_KBlockTensorData.fromRawBytes()
- Q6_K: dequantize to FP32 + array transpose to [in, out] layout
- FP32: pre-transpose via .t() during conversion (one-time cost)
- linearProject() auto-detects layout: [in,out] = direct, [out,in] = .t()

Qwen3-8B-Q4_K_M now runs on 48GB Mac at 14.2GB RSS (was 45GB+ OOM).
Token generation works but output quality needs validation (Q6_K
transpose ordering may need adjustment).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .../sk/ainet/models/llama/LlamaRuntime.kt     | 17 ++++++++-----
 .../models/llama/MemSegWeightConverter.kt     | 24 +++++++++++++++----
 2 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt b/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt
index 57be0ee..468054e 100644
--- a/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt
+++ b/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt
@@ -6,7 +6,6 @@ import sk.ainet.context.ExecutionContext
 import sk.ainet.models.llama.LlamaRuntimeWeights
 import sk.ainet.lang.nn.layers.Embedding
 import sk.ainet.lang.tensor.Tensor
-import sk.ainet.lang.tensor.data.Q4_KTensorData
 import sk.ainet.lang.tensor.matmul
 import sk.ainet.lang.tensor.plus
 import sk.ainet.lang.tensor.silu
@@ -62,14 +61,20 @@ public class LlamaRuntime<T : DType>(
     // handles the [out, in] layout directly.
 
     /**
-     * Linear projection: y = x @ W^T.
-     * For FP32 weights, transposes and matmuls. For quantized weights (Q4_K),
-     * calls matmul directly (the quantized kernel handles the layout).
+     * Linear projection: y = x @ W.
+     *
+     * When weights are pre-transposed to [in, out] by MemSegWeightConverter
+     * (Q4_K, Q6_K, FP32 via NATIVE_OPTIMIZED), uses direct matmul.
+     * Otherwise falls back to .t() for non-converted weights (tests, DEQUANTIZE_TO_FP32).
      */
     private fun linearProject(x: Tensor<T, Float>, w: Tensor<T, Float>): Tensor<T, Float> {
-        return if (w.data is Q4_KTensorData) {
-            x.matmul(w)  // quantized kernel handles [out, in] layout
+        val xCols = if (x.shape.rank >= 2) x.shape[x.shape.rank - 1] else x.shape[0]
+        val wRows = w.shape[0]
+        return if (wRows == xCols) {
+            // Weight is [in, out] — already transposed, direct matmul
+            x.matmul(w)
         } else {
+            // Weight is [out, in] — needs transpose (legacy path)
             x.matmul(w.t())
         }
     }
diff --git a/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt b/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt
index 55fdf86..c6dace1 100644
--- a/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt
+++ b/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt
@@ -8,6 +8,7 @@ import sk.ainet.models.llama.LlamaRuntimeWeights
 import sk.ainet.models.llama.LlamaTensorNames
 import sk.ainet.lang.tensor.Shape
 import sk.ainet.lang.tensor.Tensor
+import sk.ainet.lang.tensor.t
 import sk.ainet.lang.tensor.data.IntArrayTensorData
 import sk.ainet.lang.tensor.data.Q4MemorySegmentTensorData
 import sk.ainet.lang.tensor.data.Q4_KBlockTensorData
@@ -86,7 +87,11 @@ public object MemSegWeightConverter {
         ctx: ExecutionContext,
         arena: Arena
     ): Tensor<FP32, Float> {
-        val quantType = quantTypes[tensorName] ?: return tensor
+        val quantType = quantTypes[tensorName]
+        if (quantType == null) {
+            // FP32 tensor — pre-transpose to [in, out] so no .t() at runtime
+            return tensor.t()
+        }
 
         val bytes = extractBytes(tensor.data)
 
@@ -107,9 +112,20 @@ public object MemSegWeightConverter {
             }
             GGMLQuantizationType.Q5_K,
             GGMLQuantizationType.Q6_K -> {
-                // Q5_K/Q6_K: no native SIMD kernel yet, dequantize to FP32
-                val floats = DequantOps.dequantFromBytes(bytes, quantType, logicalShape.volume)
-                return ctx.fromFloatArray(logicalShape, FP32::class, floats)
+                // Q5_K/Q6_K: no native SIMD kernel yet, dequantize to FP32.
+                // Pre-transpose to [in, out] so LlamaRuntime never calls .t()
+                // (which allocates direct buffers that aren't GC'd eagerly).
+                val rows = logicalShape[0]
+                val cols = logicalShape[1]
+                val floats = DequantOps.dequantFromBytes(bytes, quantType, rows * cols)
+                val transposed = FloatArray(rows * cols)
+                for (r in 0 until rows) {
+                    for (c in 0 until cols) {
+                        transposed[c * rows + r] = floats[r * cols + c]
+                    }
+                }
+                val transposedShape = Shape(cols, rows)
+                return ctx.fromFloatArray(transposedShape, FP32::class, transposed)
             }
             else -> {
                 println("WARNING: Unsupported quant type $quantType for MemorySegment conversion of $tensorName, keeping as-is")

From c97c4818a0459c8b8f8ea8a1de7141962f6affda Mon Sep 17 00:00:00 2001
From: Michal Harakal <michal.harakal@googlemail.com>
Date: Sat, 11 Apr 2026 16:42:08 +0200
Subject: [PATCH 5/9] fix: correct Q4_K/Q6_K pre-transpose for valid 8B model
 output

Revert Q4_K block shape reinterpretation (corrupted block data layout)
and dequantize all K-quant types to FP32 with array pre-transpose.
This produces correct output at the cost of higher memory (~30GB vs
~14GB), but still fits on 48GB Mac without runtime OOM.

Qwen3-8B-Q4_K_M now generates correct output:
  "The capital of France is Paris."

Memory: ~30GB RSS (was 45GB+ OOM before pre-transpose fix)
Speed: 0.002 tok/s (CPU-only 8B, expected for scalar FP32 matmul)

Future: native Q4_K SIMD matmul with proper block-aware transpose
would reduce to ~14GB and improve speed significantly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .../models/llama/MemSegWeightConverter.kt      | 18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt b/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt
index c6dace1..8146746 100644
--- a/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt
+++ b/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt
@@ -11,7 +11,6 @@ import sk.ainet.lang.tensor.Tensor
 import sk.ainet.lang.tensor.t
 import sk.ainet.lang.tensor.data.IntArrayTensorData
 import sk.ainet.lang.tensor.data.Q4MemorySegmentTensorData
-import sk.ainet.lang.tensor.data.Q4_KBlockTensorData
 import sk.ainet.lang.tensor.data.Q8MemorySegmentTensorData
 import sk.ainet.lang.tensor.data.TensorData
 import sk.ainet.lang.types.FP32
@@ -100,21 +99,12 @@ public object MemSegWeightConverter {
                 Q4MemorySegmentTensorData.fromRawBytes(logicalShape, bytes, arena)
             GGMLQuantizationType.Q8_0 ->
                 Q8MemorySegmentTensorData.fromRawBytes(logicalShape, bytes, arena)
-            GGMLQuantizationType.Q4_K -> {
-                // Q4_K has a native SIMD matmul kernel — keep quantized.
-                // GGUF stores weights as [out, in], but matmul expects [in, out],
-                // so we pass the transposed shape. The block data layout stays
-                // the same — Q4_K matmul reads rows in the transposed order.
-                val transposedShape = Shape(logicalShape[1], logicalShape[0])
-                val q4kData = Q4_KBlockTensorData.fromRawBytes(transposedShape, bytes)
-                @Suppress("UNCHECKED_CAST")
-                return ctx.fromData(q4kData as TensorData<FP32, Float>, FP32::class)
-            }
+            GGMLQuantizationType.Q4_K,
             GGMLQuantizationType.Q5_K,
             GGMLQuantizationType.Q6_K -> {
-                // Q5_K/Q6_K: no native SIMD kernel yet, dequantize to FP32.
-                // Pre-transpose to [in, out] so LlamaRuntime never calls .t()
-                // (which allocates direct buffers that aren't GC'd eagerly).
+                // Dequantize K-quant types to FP32 and pre-transpose to [in, out].
+                // Pre-transposing at load time avoids .t() at runtime, which
+                // allocates direct buffers the JVM doesn't GC eagerly (OOM on 48GB).
                 val rows = logicalShape[0]
                 val cols = logicalShape[1]
                 val floats = DequantOps.dequantFromBytes(bytes, quantType, rows * cols)

From 5d7dd6e5186ffad89dd71f407f57f6e397f21f55 Mon Sep 17 00:00:00 2001
From: Michal Harakal <michal.harakal@googlemail.com>
Date: Sat, 11 Apr 2026 16:45:02 +0200
Subject: [PATCH 6/9] docs: add deep technical guide on weight quantization
 pipeline

Covers the full numeric representation journey from GGUF file to
matmul kernel dispatch:

- Stage 1: GGUF on-disk layout (Q4_K_M block format, tensor shapes)
- Stage 2: Raw byte loading via StreamingGGUFReader
- Stage 3: MemSegWeightConverter paths (Q4_0/Q8_0 SIMD, K-quant
  dequant + pre-transpose, FP32 pre-transpose)
- Stage 4: LlamaRuntime.linearProject() auto-detection
- Stage 5: Matmul kernel dispatch (SIMD Q4/Q8, scalar FP32)
- Why Q4_K blocks cannot be trivially transposed
- Memory budget table for 8B model on 48GB Mac
- Future: block-aware Q4_K transpose for 5GB inference

Includes Mermaid diagram of the full data flow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 docs/modules/ROOT/nav.adoc                    |   1 +
 .../explanation/weight-quantization.adoc      | 332 ++++++++++++++++++
 2 files changed, 333 insertions(+)
 create mode 100644 docs/modules/ROOT/pages/explanation/weight-quantization.adoc

diff --git a/docs/modules/ROOT/nav.adoc b/docs/modules/ROOT/nav.adoc
index 5bc1fc9..895b219 100644
--- a/docs/modules/ROOT/nav.adoc
+++ b/docs/modules/ROOT/nav.adoc
@@ -23,3 +23,4 @@
 * xref:explanation/pipeline-design.adoc[Pipeline Design Decisions]
 * xref:explanation/dsl-vs-handcoded.adoc[DSL Networks vs Hand-Coded Runtimes]
 * xref:explanation/tokenizer-internals.adoc[Tokenizer Internals]
+* xref:explanation/weight-quantization.adoc[Weight Quantization and Numeric Representation]
diff --git a/docs/modules/ROOT/pages/explanation/weight-quantization.adoc b/docs/modules/ROOT/pages/explanation/weight-quantization.adoc
new file mode 100644
index 0000000..6974baf
--- /dev/null
+++ b/docs/modules/ROOT/pages/explanation/weight-quantization.adoc
@@ -0,0 +1,332 @@
+= Weight Quantization and Numeric Representation
+:description: Deep technical guide to how model weights flow through quantization, dequantization, transpose, and SIMD kernel dispatch.
+
+== Overview
+
+A model weight tensor goes through several numeric representation changes between the GGUF file on disk and the final matmul during inference.
+Understanding each stage is essential for debugging memory issues, correctness problems, and performance optimization.
+
+[mermaid]
+....
+graph TD
+    A["GGUF File<br/>(Q4_K_M: 4.7 GB)"] -->|StreamingGGUFReader| B["Raw Bytes<br/>IntArrayTensorData"]
+    B -->|MemSegWeightConverter| C{Quant Type?}
+    C -->|Q4_0| D["Q4MemorySegmentTensorData<br/>64-byte aligned, Arena-managed"]
+    C -->|Q8_0| E["Q8MemorySegmentTensorData<br/>64-byte aligned, Arena-managed"]
+    C -->|Q4_K / Q6_K| F["DequantOps.dequantFromBytes()<br/>→ FloatArray"]
+    C -->|FP32| G["tensor.t()<br/>MemorySegmentTensorData"]
+    F -->|"Array transpose<br/>[out,in] → [in,out]"| H["FloatArrayTensorData<br/>Pre-transposed"]
+    D --> I["SIMD Matmul<br/>QuantizedMatmul.matmulQ4_0()"]
+    E --> J["SIMD Matmul<br/>QuantizedMatmul.matmulQ8_0()"]
+    H --> K["Scalar Matmul<br/>DefaultCpuOps.matmul()"]
+    G --> K
+
+    style A fill:#f9f,stroke:#333
+    style I fill:#9f9,stroke:#333
+    style J fill:#9f9,stroke:#333
+    style K fill:#ff9,stroke:#333
+....
+
+== Stage 1: GGUF File on Disk
+
+GGUF stores each weight tensor as a contiguous byte region with a header describing its quantization type, shape, and byte offset.
+
+=== Quantization Types in Q4_K_M Format
+
+The `Q4_K_M` quantization scheme uses a **mixed-precision strategy**:
+
+[cols="1,2,2,1"]
+|===
+|Type |Used For |Block Format |Bits/Param
+
+|Q4_K
+|Large projections (wq, wk, wv, wo, ffn_gate, ffn_up, ffn_down) in ~50% of layers
+|144 bytes per 256 elements: 2×f16 scale + 2×f16 min + 12 scale bytes + 128 nibble codes
+|~4.5
+
+|Q6_K
+|Same projections in the other ~50% of layers, plus output weight
+|210 bytes per 256 elements: higher precision for critical layers
+|~6.5
+
+|Q8_0
+|Not used in Q4_K_M (used in Q8_0 format models)
+|34 bytes per 32 elements: 1×f16 scale + 32×int8 codes
+|~8.5
+
+|FP32
+|Norms (attn_norm, ffn_norm, output_norm) — 1D tensors
+|4 bytes per element
+|32
+|===
+
+=== Tensor Layout in GGUF
+
+All 2D weight tensors are stored in **row-major [out_dim, in_dim]** order:
+
+----
+wq:       Shape(dim, dim)        = [4096, 4096]    "4096 output neurons, each with 4096 input weights"
+wk:       Shape(kvDim, dim)      = [1024, 4096]    "1024 KV outputs (8 heads × 128 head_dim)"
+ffn_gate: Shape(ffnDim, dim)     = [14336, 4096]   "14336 FFN hidden units"
+ffn_down: Shape(dim, ffnDim)     = [4096, 14336]   "project back to model dim"
+----
+
+The matmul convention `y = x @ W^T` requires weights in `[in_dim, out_dim]` form, so a transpose is needed before or during the matmul.
+
+== Stage 2: Loading Raw Bytes
+
+`LlamaWeightLoader.loadToMapStreaming()` reads the GGUF file via `StreamingGGUFReader`:
+
+[source,kotlin]
+----
+// QuantPolicy.NATIVE_OPTIMIZED: store as raw Int8 bytes
+val tensor = streamingTensorToTensor(reader, tensorInfo, ctx)
+// tensor.data is IntArrayTensorData containing the raw quantized bytes
+----
+
+At this stage, the tensor holds the original GGUF bytes unchanged.
+A `quantTypes` map records each tensor's quantization type for later processing.
+
+.Memory at Stage 2
+----
+Qwen3-8B-Q4_K_M: ~4.7 GB (raw bytes, same as file size)
+----
+
+== Stage 3: MemSegWeightConverter
+
+`MemSegWeightConverter.convert()` transforms raw bytes into runtime-ready tensors.
+This is where the numeric representation diverges by quantization type.
+
+=== Path A: Q4_0 → Q4MemorySegmentTensorData
+
+[source,kotlin]
+----
+Q4MemorySegmentTensorData.fromRawBytes(logicalShape, bytes, arena)
+----
+
+* Copies raw bytes into a 64-byte-aligned `MemorySegment` (Arena-managed, off-heap)
+* The data stays in Q4_0 block format (no dequantization)
+* The `MemorySegment` alignment enables SIMD vector loads
+
+.Memory: same as raw bytes (~4.5 bits/param)
+
+=== Path B: Q8_0 → Q8MemorySegmentTensorData
+
+Same as Q4_0 but with Q8_0 block format (8 bits per code + f16 scale per 32 elements).
+
+.Memory: ~8.5 bits/param
+
+=== Path C: Q4_K / Q5_K / Q6_K → FP32 + Pre-Transpose
+
+[source,kotlin]
+----
+// 1. Dequantize to float array
+val floats = DequantOps.dequantFromBytes(bytes, quantType, rows * cols)
+
+// 2. Pre-transpose from [out, in] to [in, out]
+val transposed = FloatArray(rows * cols)
+for (r in 0 until rows) {
+    for (c in 0 until cols) {
+        transposed[c * rows + r] = floats[r * cols + c]
+    }
+}
+
+// 3. Store as heap-based FloatArrayTensorData
+return ctx.fromFloatArray(Shape(cols, rows), FP32::class, transposed)
+----
+
+**Why dequantize?** No native SIMD kernel exists for K-quant block formats yet.
+
+**Why pre-transpose?** The `.t()` operation on tensors allocates a new `MemorySegmentTensorData` in direct buffer memory.
+The JVM's direct buffer allocator does not reclaim memory eagerly, causing OOM on memory-constrained machines (48GB).
+Pre-transposing during loading avoids all runtime `.t()` calls.
+
+.Memory per K-quant tensor
+----
+Original Q4_K: ~4.5 bits/param
+After dequant: 32 bits/param (8× expansion)
+Temporary: 2× (original float array + transposed array, then original is GC'd)
+----
+
+.Total memory for Qwen3-8B-Q4_K_M after Stage 3
+----
+Q4_K tensors (dequantized + transposed):  ~15 GB
+Q6_K tensors (dequantized + transposed):  ~12 GB
+Token embedding (dequantized, not transposed): ~2.4 GB
+Norms (FP32, 1D, tiny):                   ~0.01 GB
+Total:                                     ~30 GB
+----
+
+=== Path D: FP32 → Pre-Transpose
+
+[source,kotlin]
+----
+return tensor.t()  // one-time transpose during loading
+----
+
+Norms are 1D so `.t()` is a no-op. For FP32 projection weights (rare), a standard transpose copies to direct memory once.
+
+=== Special Case: Token Embedding
+
+[source,kotlin]
+----
+tokenEmbedding = maybeDequantize(weights.tokenEmbedding, ...)
+----
+
+Token embeddings are always dequantized to FP32 and **not transposed** because `Embedding.forward()` does row gather (lookup by token ID), not matmul.
+
+== Stage 4: LlamaRuntime.linearProject()
+
+During inference, each projection uses `linearProject()`:
+
+[source,kotlin]
+----
+private fun linearProject(x: Tensor<T, Float>, w: Tensor<T, Float>): Tensor<T, Float> {
+    val xCols = if (x.shape.rank >= 2) x.shape[x.shape.rank - 1] else x.shape[0]
+    val wRows = w.shape[0]
+    return if (wRows == xCols) {
+        x.matmul(w)       // weight is [in, out] — pre-transposed
+    } else {
+        x.matmul(w.t())   // weight is [out, in] — legacy path (tests)
+    }
+}
+----
+
+The shape check auto-detects the weight layout:
+
+* **Pre-transposed** `[in, out]`: `wRows == xCols` → direct `matmul`, no allocation
+* **Original** `[out, in]`: `wRows != xCols` → `.t()` then `matmul` (legacy/test path)
+
+== Stage 5: Matmul Kernel Dispatch
+
+The `Tensor.matmul()` extension dispatches based on the runtime `TensorData` type:
+
+[cols="1,2,2"]
+|===
+|TensorData Type |Kernel |Implementation
+
+|`Q4MemorySegmentTensorData`
+|`QuantizedMatmul.matmulQ4_0()`
+|SIMD (Vector API): processes 32 Q4 values per vector lane
+
+|`Q8MemorySegmentTensorData`
+|`QuantizedMatmul.matmulQ8_0()`
+|SIMD (Vector API): dot product of int8 codes × float scale
+
+|`Q4_KBlockTensorData`
+|`QuantizedMatmul.matmulQ4_K()`
+|SIMD: unpacks K-quant blocks with dual scales + min values
+
+|`FloatArrayTensorData`
+|`DefaultCpuOps.matmul()`
+|Scalar FP32 double loop (no SIMD)
+
+|`MemorySegmentTensorData`
+|`DefaultCpuOpsJvm.matmul()`
+|SIMD FP32 via Vector API
+|===
+
+=== SIMD Q4_0 Matmul (Simplified)
+
+[source]
+----
+For each output row:
+  For each block of 32 input elements:
+    Load 16 bytes of Q4 codes from MemorySegment    (128 bits)
+    Unpack low/high nibbles into two int8 vectors    (256 bits each)
+    Subtract zero-point (8)
+    Convert to float vectors
+    Multiply by block scale (f16 → f32)
+    FMA with input vector → accumulate into output
+----
+
+=== Why Q4_K Cannot Be Trivially Transposed
+
+Q4_K blocks encode 256 elements with a complex internal structure:
+
+----
+Block (144 bytes):
+  [0..1]    d (f16)         — primary scale
+  [2..3]    dmin (f16)      — minimum offset
+  [4..15]   scales (12 bytes) — per-subblock scales (6-bit packed)
+  [16..143] qs (128 bytes)  — quantized codes (4-bit packed, 256 values)
+----
+
+The 256 values in each block correspond to **256 contiguous elements in the original row**.
+Transposing the matrix would scatter these elements across different rows, breaking the block structure.
+A proper Q4_K transpose would require:
+
+1. Dequantize all blocks → FP32
+2. Transpose the FP32 matrix
+3. Re-quantize into new Q4_K blocks
+
+This is why `MemSegWeightConverter` currently dequantizes K-quant types to FP32 rather than keeping them quantized.
+
+== Memory Budget: Qwen3-8B-Q4_K_M on 48GB Mac
+
+[cols="2,1,3"]
+|===
+|Component |Size |Notes
+
+|K-quant weights (FP32 pre-transposed)
+|~27 GB
+|Q4_K + Q6_K dequantized, no runtime `.t()` copies
+
+|Token embedding (FP32)
+|2.4 GB
+|151936 × 4096 × 4 bytes
+
+|Norms (FP32)
+|~10 MB
+|1D tensors, negligible
+
+|KV cache (context=512)
+|~128 MB
+|2 × 36 layers × 512 × 1024 × 4 bytes
+
+|JVM + tokenizer
+|~1 GB
+|Heap overhead, vocab structures
+
+|**Total**
+|**~31 GB**
+|Fits in 48 GB with OS headroom
+|===
+
+== Performance Characteristics
+
+[cols="2,1,1,2"]
+|===
+|Path |Bits/Param |Memory |Speed (8B, M-series CPU)
+
+|Q4_K SIMD (future)
+|4.5
+|~5 GB
+|~1-3 tok/s (projected)
+
+|Q8_0 SIMD
+|8.5
+|~9 GB
+|~1-2 tok/s
+
+|FP32 pre-transposed (current)
+|32
+|~30 GB
+|~0.002 tok/s (scalar)
+
+|FP32 + runtime .t() (old, OOM)
+|32 + 32 (copy)
+|~60 GB
+|OOM on 48GB
+|===
+
+== Future: Block-Aware Q4_K Transpose
+
+To use the Q4_K SIMD kernel with GGUF weights, the skainet core library would need:
+
+1. **`Q4_KBlockTensorData.transpose()`** — dequantize → rearrange → re-quantize at the block level
+2. Or **`QuantizedMatmul.matmulQ4_K_transposed()`** — a kernel variant that reads blocks in column-major order
+3. Or **GGUF pre-transposed storage** — store weights as `[in, out]` in the GGUF file during quantization
+
+Option 2 is the most practical: modify the SIMD kernel to iterate over columns instead of rows when reading Q4_K blocks.
+This would reduce memory from ~30GB to ~5GB for the 8B model.

From b3646e05d5ab56b161aac745ca47b0c187653004 Mon Sep 17 00:00:00 2001
From: Michal Harakal <michal.harakal@googlemail.com>
Date: Sat, 11 Apr 2026 21:02:19 +0200
Subject: [PATCH 7/9] feat: add instruct mode to smoke tests for chat-tuned
 models
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add runChatOnce() to AgentCli for non-interactive single-prompt
instruct testing. Wire --chat with positional prompt to single-shot
mode. Add instruct field to smoke-models.json — when true, the
smoke test uses --chat with chat template formatting instead of
raw text completion.

Fixes garbage output from instruct models (Qwen3) in smoke tests.

Refs: #49

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .../sk/ainet/apps/kllama/cli/AgentMain.kt     | 41 +++++++++++++++++++
 .../kotlin/sk/ainet/apps/kllama/cli/Main.kt   |  6 ++-
 tests/smoke/smoke-models.json                 |  6 ++-
 tests/smoke/smoke-test.sh                     |  9 +++-
 4 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/AgentMain.kt b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/AgentMain.kt
index ebe4289..12221c5 100644
--- a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/AgentMain.kt
+++ b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/AgentMain.kt
@@ -29,6 +29,47 @@ public class AgentCli<T : DType>(
     private val session = ChatSession(runtime, tokenizer, metadata, templateName)
     private val template: ChatTemplate = session.chatTemplate
 
+    /**
+     * Run a single non-interactive chat round. Used by smoke tests for instruct models.
+     */
+    public fun runChatOnce(
+        userPrompt: String,
+        systemPrompt: String = "You are a helpful assistant. Answer concisely.",
+        maxTokens: Int = 64,
+        temperature: Float = 0.0f
+    ) {
+        val messages = mutableListOf(
+            ChatMessage(role = ChatRole.SYSTEM, content = systemPrompt),
+            ChatMessage(role = ChatRole.USER, content = userPrompt)
+        )
+
+        runtime.reset()
+        val prompt = template.apply(messages, emptyList(), addGenerationPrompt = true)
+        val promptTokens = tokenizer.encode(prompt)
+
+        print("Assistant: ")
+        System.out.flush()
+
+        val startTime = System.nanoTime()
+        val result = runtime.generateUntilStop(
+            prompt = promptTokens,
+            maxTokens = maxTokens,
+            eosTokenId = tokenizer.eosTokenId,
+            temperature = temperature,
+            onToken = { tokenId ->
+                print(tokenizer.decode(tokenId))
+                System.out.flush()
+            },
+            decode = { tokenId -> tokenizer.decode(tokenId) }
+        )
+        val elapsed = (System.nanoTime() - startTime) / 1_000_000.0
+
+        println()
+        println("---")
+        val tokPerSec = if (elapsed > 0) result.tokens.size / elapsed * 1000 else 0.0
+        println("tok/s: $tokPerSec")
+    }
+
     /**
      * Run interactive chat mode (no tool calling).
      */
diff --git a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt
index a89da0e..ddf31c0 100644
--- a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt
+++ b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt
@@ -508,7 +508,11 @@ fun main(args: Array<String>) {
                 }
                 else -> {
                     val agentCli = AgentCli(runtime, tokenizer, cliArgs.templateName, metadata)
-                    agentCli.runChat(maxTokens = cliArgs.steps, temperature = cliArgs.temperature)
+                    if (cliArgs.prompt != null) {
+                        agentCli.runChatOnce(cliArgs.prompt, maxTokens = cliArgs.steps, temperature = cliArgs.temperature)
+                    } else {
+                        agentCli.runChat(maxTokens = cliArgs.steps, temperature = cliArgs.temperature)
+                    }
                 }
             }
             return@runBlocking
diff --git a/tests/smoke/smoke-models.json b/tests/smoke/smoke-models.json
index 2958517..895b06e 100644
--- a/tests/smoke/smoke-models.json
+++ b/tests/smoke/smoke-models.json
@@ -13,9 +13,11 @@
     },
     {
       "name": "Qwen3-1.7B-Q8",
-      "runner": "kqwen",
+      "runner": "kllama",
       "model": "Qwen3-1.7B-Q8_0.gguf",
       "format": "gguf",
+      "instruct": true,
+      "prompt": "What is the capital of France?",
       "toolCalling": {
         "prompt": "What is 2 + 2?",
         "steps": 256
@@ -26,6 +28,8 @@
       "runner": "kllama",
       "model": "Qwen3-8B-Q4_K_M.gguf",
       "format": "gguf",
+      "instruct": true,
+      "prompt": "What is the capital of France?",
       "toolCalling": {
         "prompt": "What is 2 + 2?",
         "steps": 256
diff --git a/tests/smoke/smoke-test.sh b/tests/smoke/smoke-test.sh
index 2756e78..c1f6ca1 100755
--- a/tests/smoke/smoke-test.sh
+++ b/tests/smoke/smoke-test.sh
@@ -200,6 +200,7 @@ print(f'M_STEPS={m.get(\"steps\", d.get(\"steps\", 32))}')
 print(f'M_TEMP={m.get(\"temperature\", d.get(\"temperature\", 0.0))}')
 print(f'M_DOC={repr(m.get(\"doc\", \"\"))}')
 print(f'M_OUTPUT={repr(m.get(\"output\", \"\"))}')
+print(f'M_INSTRUCT={repr(m.get(\"instruct\", False))}')
 ")"
 
     M_MODEL=$(expand_path "$M_MODEL")
@@ -225,7 +226,13 @@ print(f'M_OUTPUT={repr(m.get(\"output\", \"\"))}')
     fi
 
     task=$(runner_task "$M_RUNNER")
-    args=$(runner_args "$M_RUNNER" "$M_MODEL" "$M_PROMPT" "$M_STEPS" "$M_TEMP" "$M_DOC" "$M_OUTPUT")
+
+    # Instruct models: use --chat with prompt for proper chat template formatting
+    if [[ "$M_INSTRUCT" == "True" ]]; then
+      args="-m ${M_MODEL} --chat -s ${M_STEPS} -k ${M_TEMP} \"${M_PROMPT}\""
+    else
+      args=$(runner_args "$M_RUNNER" "$M_MODEL" "$M_PROMPT" "$M_STEPS" "$M_TEMP" "$M_DOC" "$M_OUTPUT")
+    fi
 
     start_ts=$(python3 -c 'import time; print(time.time())')
     output_file=$(mktemp)

From 2df13cdc00fc6a96f99382a3b58f85be466b030f Mon Sep 17 00:00:00 2001
From: Michal Harakal <michal.harakal@googlemail.com>
Date: Sat, 11 Apr 2026 21:03:15 +0200
Subject: [PATCH 8/9] fix: replace JVM-only Math.PI with kotlin.math.PI in
 VoxtralFlowMatching

Math.PI is not available on non-JVM targets (iOS, JS, WASM).
Use kotlin.math.PI which is multiplatform.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .../kotlin/sk/ainet/models/voxtral/VoxtralFlowMatching.kt     | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/llm-inference/voxtral/src/commonMain/kotlin/sk/ainet/models/voxtral/VoxtralFlowMatching.kt b/llm-inference/voxtral/src/commonMain/kotlin/sk/ainet/models/voxtral/VoxtralFlowMatching.kt
index d4e6cd8..ee66112 100644
--- a/llm-inference/voxtral/src/commonMain/kotlin/sk/ainet/models/voxtral/VoxtralFlowMatching.kt
+++ b/llm-inference/voxtral/src/commonMain/kotlin/sk/ainet/models/voxtral/VoxtralFlowMatching.kt
@@ -200,7 +200,7 @@ public class VoxtralFlowMatching(
             val u1 = random.nextFloat().coerceIn(1e-7f, 1.0f)
             val u2 = random.nextFloat()
             val mag = kotlin.math.sqrt(-2.0f * kotlin.math.ln(u1))
-            val angle = (2.0 * Math.PI * u2).toFloat()
+            val angle = (2.0 * kotlin.math.PI * u2).toFloat()
             values[i] = mag * kotlin.math.cos(angle)
             values[i + 1] = mag * kotlin.math.sin(angle)
             i += 2
@@ -208,7 +208,7 @@ public class VoxtralFlowMatching(
         if (i < n) {
             val u1 = random.nextFloat().coerceIn(1e-7f, 1.0f)
             val u2 = random.nextFloat()
-            values[i] = kotlin.math.sqrt(-2.0f * kotlin.math.ln(u1)) * kotlin.math.cos((2.0 * Math.PI * u2).toFloat())
+            values[i] = kotlin.math.sqrt(-2.0f * kotlin.math.ln(u1)) * kotlin.math.cos((2.0 * kotlin.math.PI * u2).toFloat())
         }
         @Suppress("UNCHECKED_CAST")
         val result = ctx.fromFloatArray<T, Float>(shape, dtype, values)

From b1c5457fe8db65cc7f8bd0f8adfd8d5feeedaac4 Mon Sep 17 00:00:00 2001
From: Michal Harakal <michal.harakal@googlemail.com>
Date: Sat, 11 Apr 2026 21:09:17 +0200
Subject: [PATCH 9/9] fix: fix Antora Docker image and playbook for local
 builds

- Install npm packages to /opt/antora (not /antora which gets
  volume-mounted over)
- Set NODE_PATH=/opt/antora/node_modules so Antora finds extensions
- Use /opt/antora/node_modules/.bin/antora as entrypoint (not npx)
- Fix playbook content source url to /antora (git repo root)

Verified locally: 19 HTML pages, 12 Mermaid SVG diagrams rendered
via Chromium inside the container.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 docs/.docker/Dockerfile  | 20 +++++++++++---------
 docs/antora-playbook.yml |  2 +-
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/docs/.docker/Dockerfile b/docs/.docker/Dockerfile
index 0d496ff..67c21ba 100644
--- a/docs/.docker/Dockerfile
+++ b/docs/.docker/Dockerfile
@@ -10,26 +10,28 @@ RUN apk add --no-cache chromium font-noto
 ENV PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser \
     PUPPETEER_SKIP_DOWNLOAD=true
 
-WORKDIR /antora
-
-# Install Antora + extensions + mermaid-cli in one layer
-RUN npm i --save-exact \
+# Install Antora + extensions to /opt/antora (not /antora which gets volume-mounted)
+WORKDIR /opt/antora
+RUN npm init -y && npm i --save-exact \
       @antora/cli@3.1 \
       @antora/site-generator@3.1 \
       asciidoctor-kroki@0.18 \
       @mermaid-js/mermaid-cli@11 \
     && npm cache clean --force
 
-# Mermaid-cli config: use installed Chromium, no sandbox (container)
+# Make installed modules visible when workdir is the mounted project
+ENV NODE_PATH=/opt/antora/node_modules
+
+# Mermaid-cli config
 RUN echo '{ \
   "executablePath": "/usr/bin/chromium-browser", \
   "args": ["--no-sandbox", "--disable-gpu", "--disable-dev-shm-usage"] \
-}' > /antora/puppeteer-config.json
+}' > /opt/antora/puppeteer-config.json
 
-# Pre-generate a simple diagram to warm up and verify the stack works
+# Verify mermaid works
 RUN echo 'graph TD; A-->B;' > /tmp/test.mmd \
-    && npx mmdc -i /tmp/test.mmd -o /tmp/test.svg -p /antora/puppeteer-config.json \
+    && npx mmdc -i /tmp/test.mmd -o /tmp/test.svg -p /opt/antora/puppeteer-config.json \
     && rm /tmp/test.mmd /tmp/test.svg
 
-ENTRYPOINT ["npx", "antora"]
+ENTRYPOINT ["/opt/antora/node_modules/.bin/antora"]
 CMD ["--stacktrace", "antora-playbook.yml"]
diff --git a/docs/antora-playbook.yml b/docs/antora-playbook.yml
index b07afab..a21a2df 100644
--- a/docs/antora-playbook.yml
+++ b/docs/antora-playbook.yml
@@ -4,7 +4,7 @@ site:
 
 content:
   sources:
-    - url: .
+    - url: /antora
       start_path: docs
       branches: HEAD