From 576570a604cf8ee7875d79c180f2eb5e96e450c1 Mon Sep 17 00:00:00 2001 From: Michal Harakal Date: Sat, 11 Apr 2026 13:29:29 +0200 Subject: [PATCH 1/9] fix: remove pre-transpose in LlamaRuntime to halve peak memory LlamaRuntime previously pre-transposed all weight tensors at init, doubling peak memory (~31GB for 8B models). This caused OOM on 48GB machines when loading Qwen3-8B-Q4_K_M. Replace eager pre-transpose with inline .t() calls during forward pass. The GC reclaims each temporary transpose, so only one projection's worth of memory (~200MB) is live at a time instead of the full set. Before: ~62GB peak (31GB weights + 31GB transposed copies) After: ~31GB peak (weights only + 200MB temp per layer) Co-Authored-By: Claude Opus 4.6 (1M context) --- ISSUE-skainet-8b-oom.md | 113 ++++++++++++++++++ .../sk/ainet/models/llama/LlamaRuntime.kt | 62 ++++------ 2 files changed, 133 insertions(+), 42 deletions(-) create mode 100644 ISSUE-skainet-8b-oom.md diff --git a/ISSUE-skainet-8b-oom.md b/ISSUE-skainet-8b-oom.md new file mode 100644 index 0000000..6977ec9 --- /dev/null +++ b/ISSUE-skainet-8b-oom.md @@ -0,0 +1,113 @@ +# Issue: Qwen3-8B OOM on 48GB Mac + +## Problem + +Running Qwen3-8B-Q4_K_M.gguf (4.7GB on disk) on a 48GB Mac fails with OOM during weight loading, both via kllama and the unified skainet CLI. + +## Root Cause + +The current loading path uses `DEQUANTIZE_TO_FP32`, which expands Q4 weights 8x: + +| Component | Size | +|--------------------------|-----------| +| Quantized weights (disk) | 4.7 GB | +| Dequantized FP32 weights | ~37-40 GB | +| KV cache (2048 context) | 512 MB | +| Embeddings, norms | ~1 GB | +| JVM + tokenizer | ~2 GB | +| **Total** | **~41 GB** | + +48GB barely fits, and the JVM needs headroom for temporary buffers during dequantization, so it OOMs. + +## What Already Exists in the Codebase + +### 1. NATIVE_OPTIMIZED quant policy (best option) + +`QuantPolicy.NATIVE_OPTIMIZED` keeps weights in quantized form and uses SIMD-accelerated matmul kernels. `MemSegWeightConverter` converts raw Q4/Q8 bytes to 64-byte-aligned MemorySegment-backed tensors for Vector API dispatch. + +- Memory: ~5GB for the 8B model (vs 40GB with FP32) +- Speed: 1-3 tok/s (proven on Qwen2/3 via kqwen runner) +- Already works for Qwen2/3 in kllama Main.kt (the `isQwen` path) + +**Why it doesn't work today for the 8B:** The kllama `isQwen` path loads with `NATIVE_OPTIMIZED` but then creates `LlamaRuntime` which still transposes weight matrices to FP32 during init (`LlamaRuntime.kt:74`). This transpose step allocates FP32 copies. + +### 2. Lazy per-layer dequantization (Apertus pattern) + +`ApertusQuantizedRuntime` keeps weights quantized and dequantizes one projection at a time during `runLayer()`: + +``` +Resident: ~3.5GB (quantized) + ~100MB (norms/embeddings) +Per-layer temp: ~50MB (one projection, discarded after matmul) +``` + +This is the llama.cpp approach. Not yet available for LLaMA/Qwen runtimes. + +### 3. Memory-mapped loading (F32 only) + +`MmapLlamaLoader` maps the GGUF file via `MappedByteBuffer` for zero-copy tensor access. Only works for F32 models — Q4 models need dequantization which defeats the zero-copy benefit. + +## Proposed Solutions (ordered by effort) + +### Solution A: Fix NATIVE_OPTIMIZED path for 8B models (small effort) + +The kllama Main.kt Qwen path already loads with `NATIVE_OPTIMIZED`. The problem is `LlamaRuntime` constructor transposes weights to FP32. Fix: + +1. Skip transpose for quantized tensors in `LlamaRuntime` init +2. Or use `OptimizedLLMRuntime` which doesn't transpose (the DSL path) +3. Ensure SIMD matmul kernels handle Q4_K_M format (Q4_K dispatch exists in `MemSegWeightConverter`) + +**Expected result:** 8B Q4 loads in ~5GB, runs at 1-3 tok/s. + +**Files to change:** +- `llm-inference/llama/.../LlamaRuntime.kt` -- skip transpose for quantized MemSeg tensors +- Or migrate Qwen path in `kllama/cli/Main.kt` to `OptimizedLLMRuntime` + `llamaNetwork()` + +### Solution B: Port lazy dequant from Apertus to LLaMA (medium effort) + +Port the `ApertusQuantizedRuntime` pattern to a `LlamaQuantizedRuntime`: + +1. Store projections as `QuantizedTensor` (quantized bytes + metadata) +2. In `runLayer()`, dequantize one weight matrix at a time, matmul, discard +3. Keep embeddings and norms as FP32 (small, need element access) + +**Expected result:** 8B Q4 loads in ~5GB, runs at ~1 tok/s (dequant overhead per layer). + +**Files to create:** +- `llm-inference/llama/.../LlamaQuantizedRuntime.kt` (new, based on Apertus pattern) +- `llm-runtime/kllama/.../LlamaQuantizedWeights.kt` (new, mixed storage) + +### Solution C: SIMD-native matmul without dequantization (larger effort, best perf) + +The SIMD backend (`skainet-backend-cpu`) already has Q4/Q8 matmul kernels via Vector API. The issue is the runtime layer doesn't use them directly. Changes needed in skainet core: + +1. `skainet-backend-cpu`: Ensure `matmul(FP32, Q4_K)` kernel exists and dispatches correctly +2. `LlamaRuntime` or `OptimizedLLMRuntime`: Accept mixed-precision weight tensors (Q4 weights, FP32 activations) +3. Skip the `MemSegWeightConverter` step entirely — use raw quantized MemorySegments + +**Expected result:** 8B Q4 loads in ~5GB, runs at 2-5 tok/s (no dequant overhead). + +**Files to change (in skainet core):** +- `skainet-backend-cpu`: Q4_K matmul kernel (may already exist) +- `skainet-lang-core`: Mixed-precision tensor support in matmul dispatch + +### Solution D: Memory-mapped quantized tensors (largest effort) + +Extend `MmapLlamaLoader` to support quantized formats: + +1. Map the GGUF file to virtual memory +2. Create quantized tensor views that reference mmap regions +3. Dequantize on-the-fly during matmul (like lazy dequant but zero-copy from disk) + +**Expected result:** Load time near-zero, ~5GB virtual (OS manages paging). + +**Files to change:** +- `llm-inference/llama/.../MmapLlamaLoader.kt` -- extend to Q4/Q8 formats +- Requires `skainet-io-core` changes for mmap quantized tensor views + +## Recommended Path + +**Start with Solution A** — it's the smallest change and uses code that already works for Qwen2/3. The `NATIVE_OPTIMIZED` + `MemSegWeightConverter` path is proven; the only blocker is `LlamaRuntime`'s constructor transposing weights to FP32. + +If that's not enough, **add Solution B** (lazy dequant) which gives the most control over memory at a known performance cost. + +Solution C is the long-term goal (best performance) but requires skainet core changes. diff --git a/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt b/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt index e461142..3a0e562 100644 --- a/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt +++ b/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt @@ -54,29 +54,9 @@ public class LlamaRuntime( const val DEFAULT_BOS_TOKEN: Int = 1 } - /** Pre-transposed weight tensors per layer — avoids re-creating lazy transpose wrappers every forward pass. */ - private class TransposedLayerWeights( - val wqT: Tensor, - val wkT: Tensor, - val wvT: Tensor, - val woT: Tensor, - val ffnGateT: Tensor, - val ffnDownT: Tensor, - val ffnUpT: Tensor, - ) - - private val transposedLayers: List> = weights.layers.map { layer -> - TransposedLayerWeights( - wqT = layer.wq.t(), - wkT = layer.wk.t(), - wvT = layer.wv.t(), - woT = layer.wo.t(), - ffnGateT = layer.ffnGate.t(), - ffnDownT = layer.ffnDown.t(), - ffnUpT = layer.ffnUp.t(), - ) - } - private val outputWeightT: Tensor = weights.outputWeight.t() + // NOTE: weights are transposed on-the-fly during forward pass rather than + // pre-transposed at init. This halves peak memory (~31GB saved for 8B models) + // at the cost of per-token transpose allocations that the GC reclaims. // ---- DecoderRuntime abstract properties ---- override val dim: Int = weights.metadata.embeddingLength @@ -131,7 +111,6 @@ public class LlamaRuntime( embedding.forward(intArrayOf(tokenId), ctx) override fun runLayer(layerIdx: Int, x: Tensor): Tensor { - val tl = transposedLayers[layerIdx] val layer = weights.layers[layerIdx] // QKV: try compiled graph first, fall back to individual ops @@ -140,9 +119,9 @@ public class LlamaRuntime( } ?: run { val attnNorm = attnNorms[layerIdx].forward(x, ctx) Triple( - attnNorm.matmul(tl.wqT), - attnNorm.matmul(tl.wkT), - attnNorm.matmul(tl.wvT) + attnNorm.matmul(layer.wq.t()), + attnNorm.matmul(layer.wk.t()), + attnNorm.matmul(layer.wv.t()) ) } @@ -156,14 +135,14 @@ public class LlamaRuntime( val attnOut = attentionBackend.attention(q, k, v, layerIdx, position) // Output projection + residual - val afterAttn = x + attnOut.matmul(tl.woT) + val afterAttn = x + attnOut.matmul(layer.wo.t()) // FFN: try compiled graph first, fall back to individual ops return graphAccelerator?.runFFN(layerIdx, afterAttn) ?: run { val ffnNorm = ffnNorms[layerIdx].forward(afterAttn, ctx) - val gate = ffnNorm.matmul(tl.ffnGateT).silu() - val up = ffnNorm.matmul(tl.ffnUpT) - val ffnOut = (gate * up).matmul(tl.ffnDownT) + val gate = ffnNorm.matmul(layer.ffnGate.t()).silu() + val up = ffnNorm.matmul(layer.ffnUp.t()) + val ffnOut = (gate * up).matmul(layer.ffnDown.t()) afterAttn + ffnOut } } @@ -172,7 +151,7 @@ public class LlamaRuntime( outputNormLayer.forward(x, ctx) override fun outputProject(x: Tensor): Tensor = - x.matmul(outputWeightT) + x.matmul(weights.outputWeight.t()) override fun resetState() { attentionBackend.reset() @@ -287,11 +266,10 @@ public class LlamaRuntime( var x = embedding.forward(tokenIds, ctx) weights.layers.forEachIndexed { layerIdx, layer -> - val tl = transposedLayers[layerIdx] val attnNorm = attnNorms[layerIdx].forward(x, ctx) - var q = attnNorm.matmul(tl.wqT) - var k = attnNorm.matmul(tl.wkT) - val v = attnNorm.matmul(tl.wvT) + var q = attnNorm.matmul(layer.wq.t()) + var k = attnNorm.matmul(layer.wk.t()) + val v = attnNorm.matmul(layer.wv.t()) if (hasQKNorm) { q = applyPerHeadRMSNorm(q, nHeads, headDim, layer.qNorm!!) @@ -299,19 +277,19 @@ public class LlamaRuntime( } val attnOut = attentionBackend.batchAttention(q, k, v, layerIdx, startPos) - ?: return batchForwardFallback(tokenIds, startPos) // shouldn't happen but be safe + ?: return batchForwardFallback(tokenIds, startPos) - val afterAttn = x + attnOut.matmul(tl.woT) + val afterAttn = x + attnOut.matmul(layer.wo.t()) val ffnNorm = ffnNorms[layerIdx].forward(afterAttn, ctx) - val gate = ffnNorm.matmul(tl.ffnGateT).silu() - val up = ffnNorm.matmul(tl.ffnUpT) - val ffnOut = (gate * up).matmul(tl.ffnDownT) + val gate = ffnNorm.matmul(layer.ffnGate.t()).silu() + val up = ffnNorm.matmul(layer.ffnUp.t()) + val ffnOut = (gate * up).matmul(layer.ffnDown.t()) x = afterAttn + ffnOut } val norm = outputNormLayer.forward(x, ctx) - val logits = norm.matmul(outputWeightT) + val logits = norm.matmul(weights.outputWeight.t()) position = startPos + tokenIds.size return logits } From 4934f5adc1f26554b0c8ddef50ab7beb140db035 Mon Sep 17 00:00:00 2001 From: Michal Harakal Date: Sat, 11 Apr 2026 13:36:00 +0200 Subject: [PATCH 2/9] perf: keep Q4_K weights quantized using native SIMD matmul kernel MemSegWeightConverter previously dequantized Q4_K tensors to FP32 because "no native SIMD kernel yet". But Q4_KBlockTensorData and QuantizedMatmul.matmulQ4_K() already exist in skainet-backend-cpu. Wire Q4_K into the SIMD path: create Q4_KBlockTensorData from raw bytes instead of dequantizing. This keeps Q4_K weights in their compact quantized form (~4.5 bits/param) and uses the SIMD kernel for matmul at inference time. Memory impact for Qwen3-8B-Q4_K_M: Before: ~31GB (Q4_K dequantized to FP32) After: ~5GB (Q4_K kept quantized) Q5_K and Q6_K still dequantize to FP32 (no SIMD kernel yet). Co-Authored-By: Claude Opus 4.6 (1M context) --- .../sk/ainet/models/llama/MemSegWeightConverter.kt | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt b/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt index 2e24d4f..a08b065 100644 --- a/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt +++ b/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt @@ -10,6 +10,7 @@ import sk.ainet.lang.tensor.Shape import sk.ainet.lang.tensor.Tensor import sk.ainet.lang.tensor.data.IntArrayTensorData import sk.ainet.lang.tensor.data.Q4MemorySegmentTensorData +import sk.ainet.lang.tensor.data.Q4_KBlockTensorData import sk.ainet.lang.tensor.data.Q8MemorySegmentTensorData import sk.ainet.lang.tensor.data.TensorData import sk.ainet.lang.types.FP32 @@ -94,10 +95,15 @@ public object MemSegWeightConverter { Q4MemorySegmentTensorData.fromRawBytes(logicalShape, bytes, arena) GGMLQuantizationType.Q8_0 -> Q8MemorySegmentTensorData.fromRawBytes(logicalShape, bytes, arena) - GGMLQuantizationType.Q4_K, + GGMLQuantizationType.Q4_K -> { + // Q4_K has a native SIMD matmul kernel — keep quantized + val q4kData = Q4_KBlockTensorData.fromRawBytes(logicalShape, bytes) + @Suppress("UNCHECKED_CAST") + return ctx.fromData(q4kData as TensorData, FP32::class) + } GGMLQuantizationType.Q5_K, GGMLQuantizationType.Q6_K -> { - // Dequantize K-quant types to FP32 (no native SIMD kernel yet) + // Q5_K/Q6_K: no native SIMD kernel yet, dequantize to FP32 val floats = DequantOps.dequantFromBytes(bytes, quantType, logicalShape.volume) return ctx.fromFloatArray(logicalShape, FP32::class, floats) } From 0d04bc0c2564605b2d5af96b9dc42f1cfa77b208 Mon Sep 17 00:00:00 2001 From: Michal Harakal Date: Sat, 11 Apr 2026 15:11:26 +0200 Subject: [PATCH 3/9] fix: add Q4_K SIMD matmul and quantized-aware linear projection Wire Q4_K into MemSegWeightConverter to keep weights quantized (~5GB) instead of dequantizing to FP32 (~31GB). Add linearProject() helper to LlamaRuntime that dispatches to quantized matmul for Q4_K weights and standard transpose+matmul for FP32 weights. The 8B model now loads successfully on 48GB Mac (15s load time) and generates tokens via the Q4_K SIMD kernel. However, the inline .t() on Q6_K-dequantized FP32 tensors still causes direct buffer OOM during inference -- the JVM doesn't reclaim direct memory eagerly. Full fix requires lazy per-layer dequantization (Solution B) or Q6_K SIMD kernel support. Co-Authored-By: Claude Opus 4.6 (1M context) --- llm-apps/skainet-cli/build.gradle.kts | 2 +- .../sk/ainet/models/llama/LlamaRuntime.kt | 48 ++++++++++++------- .../models/llama/MemSegWeightConverter.kt | 8 +++- 3 files changed, 39 insertions(+), 19 deletions(-) diff --git a/llm-apps/skainet-cli/build.gradle.kts b/llm-apps/skainet-cli/build.gradle.kts index 7a999be..608c38c 100644 --- a/llm-apps/skainet-cli/build.gradle.kts +++ b/llm-apps/skainet-cli/build.gradle.kts @@ -50,5 +50,5 @@ tasks.withType().configureEach { } tasks.withType().configureEach { - jvmArgs("--enable-preview", "--add-modules", "jdk.incubator.vector", "-Xmx48g", "-XX:MaxDirectMemorySize=64g") + jvmArgs("--enable-preview", "--add-modules", "jdk.incubator.vector", "-Xmx42g", "-XX:MaxDirectMemorySize=42g") } diff --git a/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt b/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt index 3a0e562..57be0ee 100644 --- a/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt +++ b/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt @@ -6,6 +6,7 @@ import sk.ainet.context.ExecutionContext import sk.ainet.models.llama.LlamaRuntimeWeights import sk.ainet.lang.nn.layers.Embedding import sk.ainet.lang.tensor.Tensor +import sk.ainet.lang.tensor.data.Q4_KTensorData import sk.ainet.lang.tensor.matmul import sk.ainet.lang.tensor.plus import sk.ainet.lang.tensor.silu @@ -57,6 +58,21 @@ public class LlamaRuntime( // NOTE: weights are transposed on-the-fly during forward pass rather than // pre-transposed at init. This halves peak memory (~31GB saved for 8B models) // at the cost of per-token transpose allocations that the GC reclaims. + // Quantized weights (Q4_K) skip transpose entirely — their matmul kernel + // handles the [out, in] layout directly. + + /** + * Linear projection: y = x @ W^T. + * For FP32 weights, transposes and matmuls. For quantized weights (Q4_K), + * calls matmul directly (the quantized kernel handles the layout). + */ + private fun linearProject(x: Tensor, w: Tensor): Tensor { + return if (w.data is Q4_KTensorData) { + x.matmul(w) // quantized kernel handles [out, in] layout + } else { + x.matmul(w.t()) + } + } // ---- DecoderRuntime abstract properties ---- override val dim: Int = weights.metadata.embeddingLength @@ -119,9 +135,9 @@ public class LlamaRuntime( } ?: run { val attnNorm = attnNorms[layerIdx].forward(x, ctx) Triple( - attnNorm.matmul(layer.wq.t()), - attnNorm.matmul(layer.wk.t()), - attnNorm.matmul(layer.wv.t()) + linearProject(attnNorm, layer.wq), + linearProject(attnNorm, layer.wk), + linearProject(attnNorm, layer.wv) ) } @@ -135,14 +151,14 @@ public class LlamaRuntime( val attnOut = attentionBackend.attention(q, k, v, layerIdx, position) // Output projection + residual - val afterAttn = x + attnOut.matmul(layer.wo.t()) + val afterAttn = x + linearProject(attnOut, layer.wo) // FFN: try compiled graph first, fall back to individual ops return graphAccelerator?.runFFN(layerIdx, afterAttn) ?: run { val ffnNorm = ffnNorms[layerIdx].forward(afterAttn, ctx) - val gate = ffnNorm.matmul(layer.ffnGate.t()).silu() - val up = ffnNorm.matmul(layer.ffnUp.t()) - val ffnOut = (gate * up).matmul(layer.ffnDown.t()) + val gate = linearProject(ffnNorm, layer.ffnGate).silu() + val up = linearProject(ffnNorm, layer.ffnUp) + val ffnOut = linearProject(gate * up, layer.ffnDown) afterAttn + ffnOut } } @@ -151,7 +167,7 @@ public class LlamaRuntime( outputNormLayer.forward(x, ctx) override fun outputProject(x: Tensor): Tensor = - x.matmul(weights.outputWeight.t()) + linearProject(x, weights.outputWeight) override fun resetState() { attentionBackend.reset() @@ -267,9 +283,9 @@ public class LlamaRuntime( weights.layers.forEachIndexed { layerIdx, layer -> val attnNorm = attnNorms[layerIdx].forward(x, ctx) - var q = attnNorm.matmul(layer.wq.t()) - var k = attnNorm.matmul(layer.wk.t()) - val v = attnNorm.matmul(layer.wv.t()) + var q = linearProject(attnNorm, layer.wq) + var k = linearProject(attnNorm, layer.wk) + val v = linearProject(attnNorm, layer.wv) if (hasQKNorm) { q = applyPerHeadRMSNorm(q, nHeads, headDim, layer.qNorm!!) @@ -279,17 +295,17 @@ public class LlamaRuntime( val attnOut = attentionBackend.batchAttention(q, k, v, layerIdx, startPos) ?: return batchForwardFallback(tokenIds, startPos) - val afterAttn = x + attnOut.matmul(layer.wo.t()) + val afterAttn = x + linearProject(attnOut, layer.wo) val ffnNorm = ffnNorms[layerIdx].forward(afterAttn, ctx) - val gate = ffnNorm.matmul(layer.ffnGate.t()).silu() - val up = ffnNorm.matmul(layer.ffnUp.t()) - val ffnOut = (gate * up).matmul(layer.ffnDown.t()) + val gate = linearProject(ffnNorm, layer.ffnGate).silu() + val up = linearProject(ffnNorm, layer.ffnUp) + val ffnOut = linearProject(gate * up, layer.ffnDown) x = afterAttn + ffnOut } val norm = outputNormLayer.forward(x, ctx) - val logits = norm.matmul(weights.outputWeight.t()) + val logits = linearProject(norm, weights.outputWeight) position = startPos + tokenIds.size return logits } diff --git a/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt b/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt index a08b065..55fdf86 100644 --- a/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt +++ b/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt @@ -96,8 +96,12 @@ public object MemSegWeightConverter { GGMLQuantizationType.Q8_0 -> Q8MemorySegmentTensorData.fromRawBytes(logicalShape, bytes, arena) GGMLQuantizationType.Q4_K -> { - // Q4_K has a native SIMD matmul kernel — keep quantized - val q4kData = Q4_KBlockTensorData.fromRawBytes(logicalShape, bytes) + // Q4_K has a native SIMD matmul kernel — keep quantized. + // GGUF stores weights as [out, in], but matmul expects [in, out], + // so we pass the transposed shape. The block data layout stays + // the same — Q4_K matmul reads rows in the transposed order. + val transposedShape = Shape(logicalShape[1], logicalShape[0]) + val q4kData = Q4_KBlockTensorData.fromRawBytes(transposedShape, bytes) @Suppress("UNCHECKED_CAST") return ctx.fromData(q4kData as TensorData, FP32::class) } From 8aee3fa486855208553f5f362a8b72a70e9941b4 Mon Sep 17 00:00:00 2001 From: Michal Harakal Date: Sat, 11 Apr 2026 15:31:00 +0200 Subject: [PATCH 4/9] fix: pre-transpose weights during loading to eliminate runtime OOM Pre-transpose ALL projection weights during MemSegWeightConverter so LlamaRuntime never calls .t() during inference. This eliminates direct buffer allocations that the JVM doesn't GC eagerly, which caused OOM on 48GB machines. - Q4_K: transposed shape passed to Q4_KBlockTensorData.fromRawBytes() - Q6_K: dequantize to FP32 + array transpose to [in, out] layout - FP32: pre-transpose via .t() during conversion (one-time cost) - linearProject() auto-detects layout: [in,out] = direct, [out,in] = .t() Qwen3-8B-Q4_K_M now runs on 48GB Mac at 14.2GB RSS (was 45GB+ OOM). Token generation works but output quality needs validation (Q6_K transpose ordering may need adjustment). Co-Authored-By: Claude Opus 4.6 (1M context) --- .../sk/ainet/models/llama/LlamaRuntime.kt | 17 ++++++++----- .../models/llama/MemSegWeightConverter.kt | 24 +++++++++++++++---- 2 files changed, 31 insertions(+), 10 deletions(-) diff --git a/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt b/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt index 57be0ee..468054e 100644 --- a/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt +++ b/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt @@ -6,7 +6,6 @@ import sk.ainet.context.ExecutionContext import sk.ainet.models.llama.LlamaRuntimeWeights import sk.ainet.lang.nn.layers.Embedding import sk.ainet.lang.tensor.Tensor -import sk.ainet.lang.tensor.data.Q4_KTensorData import sk.ainet.lang.tensor.matmul import sk.ainet.lang.tensor.plus import sk.ainet.lang.tensor.silu @@ -62,14 +61,20 @@ public class LlamaRuntime( // handles the [out, in] layout directly. /** - * Linear projection: y = x @ W^T. - * For FP32 weights, transposes and matmuls. For quantized weights (Q4_K), - * calls matmul directly (the quantized kernel handles the layout). + * Linear projection: y = x @ W. + * + * When weights are pre-transposed to [in, out] by MemSegWeightConverter + * (Q4_K, Q6_K, FP32 via NATIVE_OPTIMIZED), uses direct matmul. + * Otherwise falls back to .t() for non-converted weights (tests, DEQUANTIZE_TO_FP32). */ private fun linearProject(x: Tensor, w: Tensor): Tensor { - return if (w.data is Q4_KTensorData) { - x.matmul(w) // quantized kernel handles [out, in] layout + val xCols = if (x.shape.rank >= 2) x.shape[x.shape.rank - 1] else x.shape[0] + val wRows = w.shape[0] + return if (wRows == xCols) { + // Weight is [in, out] — already transposed, direct matmul + x.matmul(w) } else { + // Weight is [out, in] — needs transpose (legacy path) x.matmul(w.t()) } } diff --git a/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt b/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt index 55fdf86..c6dace1 100644 --- a/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt +++ b/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt @@ -8,6 +8,7 @@ import sk.ainet.models.llama.LlamaRuntimeWeights import sk.ainet.models.llama.LlamaTensorNames import sk.ainet.lang.tensor.Shape import sk.ainet.lang.tensor.Tensor +import sk.ainet.lang.tensor.t import sk.ainet.lang.tensor.data.IntArrayTensorData import sk.ainet.lang.tensor.data.Q4MemorySegmentTensorData import sk.ainet.lang.tensor.data.Q4_KBlockTensorData @@ -86,7 +87,11 @@ public object MemSegWeightConverter { ctx: ExecutionContext, arena: Arena ): Tensor { - val quantType = quantTypes[tensorName] ?: return tensor + val quantType = quantTypes[tensorName] + if (quantType == null) { + // FP32 tensor — pre-transpose to [in, out] so no .t() at runtime + return tensor.t() + } val bytes = extractBytes(tensor.data) @@ -107,9 +112,20 @@ public object MemSegWeightConverter { } GGMLQuantizationType.Q5_K, GGMLQuantizationType.Q6_K -> { - // Q5_K/Q6_K: no native SIMD kernel yet, dequantize to FP32 - val floats = DequantOps.dequantFromBytes(bytes, quantType, logicalShape.volume) - return ctx.fromFloatArray(logicalShape, FP32::class, floats) + // Q5_K/Q6_K: no native SIMD kernel yet, dequantize to FP32. + // Pre-transpose to [in, out] so LlamaRuntime never calls .t() + // (which allocates direct buffers that aren't GC'd eagerly). + val rows = logicalShape[0] + val cols = logicalShape[1] + val floats = DequantOps.dequantFromBytes(bytes, quantType, rows * cols) + val transposed = FloatArray(rows * cols) + for (r in 0 until rows) { + for (c in 0 until cols) { + transposed[c * rows + r] = floats[r * cols + c] + } + } + val transposedShape = Shape(cols, rows) + return ctx.fromFloatArray(transposedShape, FP32::class, transposed) } else -> { println("WARNING: Unsupported quant type $quantType for MemorySegment conversion of $tensorName, keeping as-is") From c97c4818a0459c8b8f8ea8a1de7141962f6affda Mon Sep 17 00:00:00 2001 From: Michal Harakal Date: Sat, 11 Apr 2026 16:42:08 +0200 Subject: [PATCH 5/9] fix: correct Q4_K/Q6_K pre-transpose for valid 8B model output Revert Q4_K block shape reinterpretation (corrupted block data layout) and dequantize all K-quant types to FP32 with array pre-transpose. This produces correct output at the cost of higher memory (~30GB vs ~14GB), but still fits on 48GB Mac without runtime OOM. Qwen3-8B-Q4_K_M now generates correct output: "The capital of France is Paris." Memory: ~30GB RSS (was 45GB+ OOM before pre-transpose fix) Speed: 0.002 tok/s (CPU-only 8B, expected for scalar FP32 matmul) Future: native Q4_K SIMD matmul with proper block-aware transpose would reduce to ~14GB and improve speed significantly. Co-Authored-By: Claude Opus 4.6 (1M context) --- .../models/llama/MemSegWeightConverter.kt | 18 ++++-------------- 1 file changed, 4 insertions(+), 14 deletions(-) diff --git a/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt b/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt index c6dace1..8146746 100644 --- a/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt +++ b/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt @@ -11,7 +11,6 @@ import sk.ainet.lang.tensor.Tensor import sk.ainet.lang.tensor.t import sk.ainet.lang.tensor.data.IntArrayTensorData import sk.ainet.lang.tensor.data.Q4MemorySegmentTensorData -import sk.ainet.lang.tensor.data.Q4_KBlockTensorData import sk.ainet.lang.tensor.data.Q8MemorySegmentTensorData import sk.ainet.lang.tensor.data.TensorData import sk.ainet.lang.types.FP32 @@ -100,21 +99,12 @@ public object MemSegWeightConverter { Q4MemorySegmentTensorData.fromRawBytes(logicalShape, bytes, arena) GGMLQuantizationType.Q8_0 -> Q8MemorySegmentTensorData.fromRawBytes(logicalShape, bytes, arena) - GGMLQuantizationType.Q4_K -> { - // Q4_K has a native SIMD matmul kernel — keep quantized. - // GGUF stores weights as [out, in], but matmul expects [in, out], - // so we pass the transposed shape. The block data layout stays - // the same — Q4_K matmul reads rows in the transposed order. - val transposedShape = Shape(logicalShape[1], logicalShape[0]) - val q4kData = Q4_KBlockTensorData.fromRawBytes(transposedShape, bytes) - @Suppress("UNCHECKED_CAST") - return ctx.fromData(q4kData as TensorData, FP32::class) - } + GGMLQuantizationType.Q4_K, GGMLQuantizationType.Q5_K, GGMLQuantizationType.Q6_K -> { - // Q5_K/Q6_K: no native SIMD kernel yet, dequantize to FP32. - // Pre-transpose to [in, out] so LlamaRuntime never calls .t() - // (which allocates direct buffers that aren't GC'd eagerly). + // Dequantize K-quant types to FP32 and pre-transpose to [in, out]. + // Pre-transposing at load time avoids .t() at runtime, which + // allocates direct buffers the JVM doesn't GC eagerly (OOM on 48GB). val rows = logicalShape[0] val cols = logicalShape[1] val floats = DequantOps.dequantFromBytes(bytes, quantType, rows * cols) From 5d7dd6e5186ffad89dd71f407f57f6e397f21f55 Mon Sep 17 00:00:00 2001 From: Michal Harakal Date: Sat, 11 Apr 2026 16:45:02 +0200 Subject: [PATCH 6/9] docs: add deep technical guide on weight quantization pipeline Covers the full numeric representation journey from GGUF file to matmul kernel dispatch: - Stage 1: GGUF on-disk layout (Q4_K_M block format, tensor shapes) - Stage 2: Raw byte loading via StreamingGGUFReader - Stage 3: MemSegWeightConverter paths (Q4_0/Q8_0 SIMD, K-quant dequant + pre-transpose, FP32 pre-transpose) - Stage 4: LlamaRuntime.linearProject() auto-detection - Stage 5: Matmul kernel dispatch (SIMD Q4/Q8, scalar FP32) - Why Q4_K blocks cannot be trivially transposed - Memory budget table for 8B model on 48GB Mac - Future: block-aware Q4_K transpose for 5GB inference Includes Mermaid diagram of the full data flow. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/modules/ROOT/nav.adoc | 1 + .../explanation/weight-quantization.adoc | 332 ++++++++++++++++++ 2 files changed, 333 insertions(+) create mode 100644 docs/modules/ROOT/pages/explanation/weight-quantization.adoc diff --git a/docs/modules/ROOT/nav.adoc b/docs/modules/ROOT/nav.adoc index 5bc1fc9..895b219 100644 --- a/docs/modules/ROOT/nav.adoc +++ b/docs/modules/ROOT/nav.adoc @@ -23,3 +23,4 @@ * xref:explanation/pipeline-design.adoc[Pipeline Design Decisions] * xref:explanation/dsl-vs-handcoded.adoc[DSL Networks vs Hand-Coded Runtimes] * xref:explanation/tokenizer-internals.adoc[Tokenizer Internals] +* xref:explanation/weight-quantization.adoc[Weight Quantization and Numeric Representation] diff --git a/docs/modules/ROOT/pages/explanation/weight-quantization.adoc b/docs/modules/ROOT/pages/explanation/weight-quantization.adoc new file mode 100644 index 0000000..6974baf --- /dev/null +++ b/docs/modules/ROOT/pages/explanation/weight-quantization.adoc @@ -0,0 +1,332 @@ += Weight Quantization and Numeric Representation +:description: Deep technical guide to how model weights flow through quantization, dequantization, transpose, and SIMD kernel dispatch. + +== Overview + +A model weight tensor goes through several numeric representation changes between the GGUF file on disk and the final matmul during inference. +Understanding each stage is essential for debugging memory issues, correctness problems, and performance optimization. + +[mermaid] +.... +graph TD + A["GGUF File
(Q4_K_M: 4.7 GB)"] -->|StreamingGGUFReader| B["Raw Bytes
IntArrayTensorData"] + B -->|MemSegWeightConverter| C{Quant Type?} + C -->|Q4_0| D["Q4MemorySegmentTensorData
64-byte aligned, Arena-managed"] + C -->|Q8_0| E["Q8MemorySegmentTensorData
64-byte aligned, Arena-managed"] + C -->|Q4_K / Q6_K| F["DequantOps.dequantFromBytes()
→ FloatArray"] + C -->|FP32| G["tensor.t()
MemorySegmentTensorData"] + F -->|"Array transpose
[out,in] → [in,out]"| H["FloatArrayTensorData
Pre-transposed"] + D --> I["SIMD Matmul
QuantizedMatmul.matmulQ4_0()"] + E --> J["SIMD Matmul
QuantizedMatmul.matmulQ8_0()"] + H --> K["Scalar Matmul
DefaultCpuOps.matmul()"] + G --> K + + style A fill:#f9f,stroke:#333 + style I fill:#9f9,stroke:#333 + style J fill:#9f9,stroke:#333 + style K fill:#ff9,stroke:#333 +.... + +== Stage 1: GGUF File on Disk + +GGUF stores each weight tensor as a contiguous byte region with a header describing its quantization type, shape, and byte offset. + +=== Quantization Types in Q4_K_M Format + +The `Q4_K_M` quantization scheme uses a **mixed-precision strategy**: + +[cols="1,2,2,1"] +|=== +|Type |Used For |Block Format |Bits/Param + +|Q4_K +|Large projections (wq, wk, wv, wo, ffn_gate, ffn_up, ffn_down) in ~50% of layers +|144 bytes per 256 elements: 2×f16 scale + 2×f16 min + 12 scale bytes + 128 nibble codes +|~4.5 + +|Q6_K +|Same projections in the other ~50% of layers, plus output weight +|210 bytes per 256 elements: higher precision for critical layers +|~6.5 + +|Q8_0 +|Not used in Q4_K_M (used in Q8_0 format models) +|34 bytes per 32 elements: 1×f16 scale + 32×int8 codes +|~8.5 + +|FP32 +|Norms (attn_norm, ffn_norm, output_norm) — 1D tensors +|4 bytes per element +|32 +|=== + +=== Tensor Layout in GGUF + +All 2D weight tensors are stored in **row-major [out_dim, in_dim]** order: + +---- +wq: Shape(dim, dim) = [4096, 4096] "4096 output neurons, each with 4096 input weights" +wk: Shape(kvDim, dim) = [1024, 4096] "1024 KV outputs (8 heads × 128 head_dim)" +ffn_gate: Shape(ffnDim, dim) = [14336, 4096] "14336 FFN hidden units" +ffn_down: Shape(dim, ffnDim) = [4096, 14336] "project back to model dim" +---- + +The matmul convention `y = x @ W^T` requires weights in `[in_dim, out_dim]` form, so a transpose is needed before or during the matmul. + +== Stage 2: Loading Raw Bytes + +`LlamaWeightLoader.loadToMapStreaming()` reads the GGUF file via `StreamingGGUFReader`: + +[source,kotlin] +---- +// QuantPolicy.NATIVE_OPTIMIZED: store as raw Int8 bytes +val tensor = streamingTensorToTensor(reader, tensorInfo, ctx) +// tensor.data is IntArrayTensorData containing the raw quantized bytes +---- + +At this stage, the tensor holds the original GGUF bytes unchanged. +A `quantTypes` map records each tensor's quantization type for later processing. + +.Memory at Stage 2 +---- +Qwen3-8B-Q4_K_M: ~4.7 GB (raw bytes, same as file size) +---- + +== Stage 3: MemSegWeightConverter + +`MemSegWeightConverter.convert()` transforms raw bytes into runtime-ready tensors. +This is where the numeric representation diverges by quantization type. + +=== Path A: Q4_0 → Q4MemorySegmentTensorData + +[source,kotlin] +---- +Q4MemorySegmentTensorData.fromRawBytes(logicalShape, bytes, arena) +---- + +* Copies raw bytes into a 64-byte-aligned `MemorySegment` (Arena-managed, off-heap) +* The data stays in Q4_0 block format (no dequantization) +* The `MemorySegment` alignment enables SIMD vector loads + +.Memory: same as raw bytes (~4.5 bits/param) + +=== Path B: Q8_0 → Q8MemorySegmentTensorData + +Same as Q4_0 but with Q8_0 block format (8 bits per code + f16 scale per 32 elements). + +.Memory: ~8.5 bits/param + +=== Path C: Q4_K / Q5_K / Q6_K → FP32 + Pre-Transpose + +[source,kotlin] +---- +// 1. Dequantize to float array +val floats = DequantOps.dequantFromBytes(bytes, quantType, rows * cols) + +// 2. Pre-transpose from [out, in] to [in, out] +val transposed = FloatArray(rows * cols) +for (r in 0 until rows) { + for (c in 0 until cols) { + transposed[c * rows + r] = floats[r * cols + c] + } +} + +// 3. Store as heap-based FloatArrayTensorData +return ctx.fromFloatArray(Shape(cols, rows), FP32::class, transposed) +---- + +**Why dequantize?** No native SIMD kernel exists for K-quant block formats yet. + +**Why pre-transpose?** The `.t()` operation on tensors allocates a new `MemorySegmentTensorData` in direct buffer memory. +The JVM's direct buffer allocator does not reclaim memory eagerly, causing OOM on memory-constrained machines (48GB). +Pre-transposing during loading avoids all runtime `.t()` calls. + +.Memory per K-quant tensor +---- +Original Q4_K: ~4.5 bits/param +After dequant: 32 bits/param (8× expansion) +Temporary: 2× (original float array + transposed array, then original is GC'd) +---- + +.Total memory for Qwen3-8B-Q4_K_M after Stage 3 +---- +Q4_K tensors (dequantized + transposed): ~15 GB +Q6_K tensors (dequantized + transposed): ~12 GB +Token embedding (dequantized, not transposed): ~2.4 GB +Norms (FP32, 1D, tiny): ~0.01 GB +Total: ~30 GB +---- + +=== Path D: FP32 → Pre-Transpose + +[source,kotlin] +---- +return tensor.t() // one-time transpose during loading +---- + +Norms are 1D so `.t()` is a no-op. For FP32 projection weights (rare), a standard transpose copies to direct memory once. + +=== Special Case: Token Embedding + +[source,kotlin] +---- +tokenEmbedding = maybeDequantize(weights.tokenEmbedding, ...) +---- + +Token embeddings are always dequantized to FP32 and **not transposed** because `Embedding.forward()` does row gather (lookup by token ID), not matmul. + +== Stage 4: LlamaRuntime.linearProject() + +During inference, each projection uses `linearProject()`: + +[source,kotlin] +---- +private fun linearProject(x: Tensor, w: Tensor): Tensor { + val xCols = if (x.shape.rank >= 2) x.shape[x.shape.rank - 1] else x.shape[0] + val wRows = w.shape[0] + return if (wRows == xCols) { + x.matmul(w) // weight is [in, out] — pre-transposed + } else { + x.matmul(w.t()) // weight is [out, in] — legacy path (tests) + } +} +---- + +The shape check auto-detects the weight layout: + +* **Pre-transposed** `[in, out]`: `wRows == xCols` → direct `matmul`, no allocation +* **Original** `[out, in]`: `wRows != xCols` → `.t()` then `matmul` (legacy/test path) + +== Stage 5: Matmul Kernel Dispatch + +The `Tensor.matmul()` extension dispatches based on the runtime `TensorData` type: + +[cols="1,2,2"] +|=== +|TensorData Type |Kernel |Implementation + +|`Q4MemorySegmentTensorData` +|`QuantizedMatmul.matmulQ4_0()` +|SIMD (Vector API): processes 32 Q4 values per vector lane + +|`Q8MemorySegmentTensorData` +|`QuantizedMatmul.matmulQ8_0()` +|SIMD (Vector API): dot product of int8 codes × float scale + +|`Q4_KBlockTensorData` +|`QuantizedMatmul.matmulQ4_K()` +|SIMD: unpacks K-quant blocks with dual scales + min values + +|`FloatArrayTensorData` +|`DefaultCpuOps.matmul()` +|Scalar FP32 double loop (no SIMD) + +|`MemorySegmentTensorData` +|`DefaultCpuOpsJvm.matmul()` +|SIMD FP32 via Vector API +|=== + +=== SIMD Q4_0 Matmul (Simplified) + +[source] +---- +For each output row: + For each block of 32 input elements: + Load 16 bytes of Q4 codes from MemorySegment (128 bits) + Unpack low/high nibbles into two int8 vectors (256 bits each) + Subtract zero-point (8) + Convert to float vectors + Multiply by block scale (f16 → f32) + FMA with input vector → accumulate into output +---- + +=== Why Q4_K Cannot Be Trivially Transposed + +Q4_K blocks encode 256 elements with a complex internal structure: + +---- +Block (144 bytes): + [0..1] d (f16) — primary scale + [2..3] dmin (f16) — minimum offset + [4..15] scales (12 bytes) — per-subblock scales (6-bit packed) + [16..143] qs (128 bytes) — quantized codes (4-bit packed, 256 values) +---- + +The 256 values in each block correspond to **256 contiguous elements in the original row**. +Transposing the matrix would scatter these elements across different rows, breaking the block structure. +A proper Q4_K transpose would require: + +1. Dequantize all blocks → FP32 +2. Transpose the FP32 matrix +3. Re-quantize into new Q4_K blocks + +This is why `MemSegWeightConverter` currently dequantizes K-quant types to FP32 rather than keeping them quantized. + +== Memory Budget: Qwen3-8B-Q4_K_M on 48GB Mac + +[cols="2,1,3"] +|=== +|Component |Size |Notes + +|K-quant weights (FP32 pre-transposed) +|~27 GB +|Q4_K + Q6_K dequantized, no runtime `.t()` copies + +|Token embedding (FP32) +|2.4 GB +|151936 × 4096 × 4 bytes + +|Norms (FP32) +|~10 MB +|1D tensors, negligible + +|KV cache (context=512) +|~128 MB +|2 × 36 layers × 512 × 1024 × 4 bytes + +|JVM + tokenizer +|~1 GB +|Heap overhead, vocab structures + +|**Total** +|**~31 GB** +|Fits in 48 GB with OS headroom +|=== + +== Performance Characteristics + +[cols="2,1,1,2"] +|=== +|Path |Bits/Param |Memory |Speed (8B, M-series CPU) + +|Q4_K SIMD (future) +|4.5 +|~5 GB +|~1-3 tok/s (projected) + +|Q8_0 SIMD +|8.5 +|~9 GB +|~1-2 tok/s + +|FP32 pre-transposed (current) +|32 +|~30 GB +|~0.002 tok/s (scalar) + +|FP32 + runtime .t() (old, OOM) +|32 + 32 (copy) +|~60 GB +|OOM on 48GB +|=== + +== Future: Block-Aware Q4_K Transpose + +To use the Q4_K SIMD kernel with GGUF weights, the skainet core library would need: + +1. **`Q4_KBlockTensorData.transpose()`** — dequantize → rearrange → re-quantize at the block level +2. Or **`QuantizedMatmul.matmulQ4_K_transposed()`** — a kernel variant that reads blocks in column-major order +3. Or **GGUF pre-transposed storage** — store weights as `[in, out]` in the GGUF file during quantization + +Option 2 is the most practical: modify the SIMD kernel to iterate over columns instead of rows when reading Q4_K blocks. +This would reduce memory from ~30GB to ~5GB for the 8B model. From b3646e05d5ab56b161aac745ca47b0c187653004 Mon Sep 17 00:00:00 2001 From: Michal Harakal Date: Sat, 11 Apr 2026 21:02:19 +0200 Subject: [PATCH 7/9] feat: add instruct mode to smoke tests for chat-tuned models MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add runChatOnce() to AgentCli for non-interactive single-prompt instruct testing. Wire --chat with positional prompt to single-shot mode. Add instruct field to smoke-models.json — when true, the smoke test uses --chat with chat template formatting instead of raw text completion. Fixes garbage output from instruct models (Qwen3) in smoke tests. Refs: #49 Co-Authored-By: Claude Opus 4.6 (1M context) --- .../sk/ainet/apps/kllama/cli/AgentMain.kt | 41 +++++++++++++++++++ .../kotlin/sk/ainet/apps/kllama/cli/Main.kt | 6 ++- tests/smoke/smoke-models.json | 6 ++- tests/smoke/smoke-test.sh | 9 +++- 4 files changed, 59 insertions(+), 3 deletions(-) diff --git a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/AgentMain.kt b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/AgentMain.kt index ebe4289..12221c5 100644 --- a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/AgentMain.kt +++ b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/AgentMain.kt @@ -29,6 +29,47 @@ public class AgentCli( private val session = ChatSession(runtime, tokenizer, metadata, templateName) private val template: ChatTemplate = session.chatTemplate + /** + * Run a single non-interactive chat round. Used by smoke tests for instruct models. + */ + public fun runChatOnce( + userPrompt: String, + systemPrompt: String = "You are a helpful assistant. Answer concisely.", + maxTokens: Int = 64, + temperature: Float = 0.0f + ) { + val messages = mutableListOf( + ChatMessage(role = ChatRole.SYSTEM, content = systemPrompt), + ChatMessage(role = ChatRole.USER, content = userPrompt) + ) + + runtime.reset() + val prompt = template.apply(messages, emptyList(), addGenerationPrompt = true) + val promptTokens = tokenizer.encode(prompt) + + print("Assistant: ") + System.out.flush() + + val startTime = System.nanoTime() + val result = runtime.generateUntilStop( + prompt = promptTokens, + maxTokens = maxTokens, + eosTokenId = tokenizer.eosTokenId, + temperature = temperature, + onToken = { tokenId -> + print(tokenizer.decode(tokenId)) + System.out.flush() + }, + decode = { tokenId -> tokenizer.decode(tokenId) } + ) + val elapsed = (System.nanoTime() - startTime) / 1_000_000.0 + + println() + println("---") + val tokPerSec = if (elapsed > 0) result.tokens.size / elapsed * 1000 else 0.0 + println("tok/s: $tokPerSec") + } + /** * Run interactive chat mode (no tool calling). */ diff --git a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt index a89da0e..ddf31c0 100644 --- a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt +++ b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt @@ -508,7 +508,11 @@ fun main(args: Array) { } else -> { val agentCli = AgentCli(runtime, tokenizer, cliArgs.templateName, metadata) - agentCli.runChat(maxTokens = cliArgs.steps, temperature = cliArgs.temperature) + if (cliArgs.prompt != null) { + agentCli.runChatOnce(cliArgs.prompt, maxTokens = cliArgs.steps, temperature = cliArgs.temperature) + } else { + agentCli.runChat(maxTokens = cliArgs.steps, temperature = cliArgs.temperature) + } } } return@runBlocking diff --git a/tests/smoke/smoke-models.json b/tests/smoke/smoke-models.json index 2958517..895b06e 100644 --- a/tests/smoke/smoke-models.json +++ b/tests/smoke/smoke-models.json @@ -13,9 +13,11 @@ }, { "name": "Qwen3-1.7B-Q8", - "runner": "kqwen", + "runner": "kllama", "model": "Qwen3-1.7B-Q8_0.gguf", "format": "gguf", + "instruct": true, + "prompt": "What is the capital of France?", "toolCalling": { "prompt": "What is 2 + 2?", "steps": 256 @@ -26,6 +28,8 @@ "runner": "kllama", "model": "Qwen3-8B-Q4_K_M.gguf", "format": "gguf", + "instruct": true, + "prompt": "What is the capital of France?", "toolCalling": { "prompt": "What is 2 + 2?", "steps": 256 diff --git a/tests/smoke/smoke-test.sh b/tests/smoke/smoke-test.sh index 2756e78..c1f6ca1 100755 --- a/tests/smoke/smoke-test.sh +++ b/tests/smoke/smoke-test.sh @@ -200,6 +200,7 @@ print(f'M_STEPS={m.get(\"steps\", d.get(\"steps\", 32))}') print(f'M_TEMP={m.get(\"temperature\", d.get(\"temperature\", 0.0))}') print(f'M_DOC={repr(m.get(\"doc\", \"\"))}') print(f'M_OUTPUT={repr(m.get(\"output\", \"\"))}') +print(f'M_INSTRUCT={repr(m.get(\"instruct\", False))}') ")" M_MODEL=$(expand_path "$M_MODEL") @@ -225,7 +226,13 @@ print(f'M_OUTPUT={repr(m.get(\"output\", \"\"))}') fi task=$(runner_task "$M_RUNNER") - args=$(runner_args "$M_RUNNER" "$M_MODEL" "$M_PROMPT" "$M_STEPS" "$M_TEMP" "$M_DOC" "$M_OUTPUT") + + # Instruct models: use --chat with prompt for proper chat template formatting + if [[ "$M_INSTRUCT" == "True" ]]; then + args="-m ${M_MODEL} --chat -s ${M_STEPS} -k ${M_TEMP} \"${M_PROMPT}\"" + else + args=$(runner_args "$M_RUNNER" "$M_MODEL" "$M_PROMPT" "$M_STEPS" "$M_TEMP" "$M_DOC" "$M_OUTPUT") + fi start_ts=$(python3 -c 'import time; print(time.time())') output_file=$(mktemp) From 2df13cdc00fc6a96f99382a3b58f85be466b030f Mon Sep 17 00:00:00 2001 From: Michal Harakal Date: Sat, 11 Apr 2026 21:03:15 +0200 Subject: [PATCH 8/9] fix: replace JVM-only Math.PI with kotlin.math.PI in VoxtralFlowMatching Math.PI is not available on non-JVM targets (iOS, JS, WASM). Use kotlin.math.PI which is multiplatform. Co-Authored-By: Claude Opus 4.6 (1M context) --- .../kotlin/sk/ainet/models/voxtral/VoxtralFlowMatching.kt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/llm-inference/voxtral/src/commonMain/kotlin/sk/ainet/models/voxtral/VoxtralFlowMatching.kt b/llm-inference/voxtral/src/commonMain/kotlin/sk/ainet/models/voxtral/VoxtralFlowMatching.kt index d4e6cd8..ee66112 100644 --- a/llm-inference/voxtral/src/commonMain/kotlin/sk/ainet/models/voxtral/VoxtralFlowMatching.kt +++ b/llm-inference/voxtral/src/commonMain/kotlin/sk/ainet/models/voxtral/VoxtralFlowMatching.kt @@ -200,7 +200,7 @@ public class VoxtralFlowMatching( val u1 = random.nextFloat().coerceIn(1e-7f, 1.0f) val u2 = random.nextFloat() val mag = kotlin.math.sqrt(-2.0f * kotlin.math.ln(u1)) - val angle = (2.0 * Math.PI * u2).toFloat() + val angle = (2.0 * kotlin.math.PI * u2).toFloat() values[i] = mag * kotlin.math.cos(angle) values[i + 1] = mag * kotlin.math.sin(angle) i += 2 @@ -208,7 +208,7 @@ public class VoxtralFlowMatching( if (i < n) { val u1 = random.nextFloat().coerceIn(1e-7f, 1.0f) val u2 = random.nextFloat() - values[i] = kotlin.math.sqrt(-2.0f * kotlin.math.ln(u1)) * kotlin.math.cos((2.0 * Math.PI * u2).toFloat()) + values[i] = kotlin.math.sqrt(-2.0f * kotlin.math.ln(u1)) * kotlin.math.cos((2.0 * kotlin.math.PI * u2).toFloat()) } @Suppress("UNCHECKED_CAST") val result = ctx.fromFloatArray(shape, dtype, values) From b1c5457fe8db65cc7f8bd0f8adfd8d5feeedaac4 Mon Sep 17 00:00:00 2001 From: Michal Harakal Date: Sat, 11 Apr 2026 21:09:17 +0200 Subject: [PATCH 9/9] fix: fix Antora Docker image and playbook for local builds - Install npm packages to /opt/antora (not /antora which gets volume-mounted over) - Set NODE_PATH=/opt/antora/node_modules so Antora finds extensions - Use /opt/antora/node_modules/.bin/antora as entrypoint (not npx) - Fix playbook content source url to /antora (git repo root) Verified locally: 19 HTML pages, 12 Mermaid SVG diagrams rendered via Chromium inside the container. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/.docker/Dockerfile | 20 +++++++++++--------- docs/antora-playbook.yml | 2 +- 2 files changed, 12 insertions(+), 10 deletions(-) diff --git a/docs/.docker/Dockerfile b/docs/.docker/Dockerfile index 0d496ff..67c21ba 100644 --- a/docs/.docker/Dockerfile +++ b/docs/.docker/Dockerfile @@ -10,26 +10,28 @@ RUN apk add --no-cache chromium font-noto ENV PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser \ PUPPETEER_SKIP_DOWNLOAD=true -WORKDIR /antora - -# Install Antora + extensions + mermaid-cli in one layer -RUN npm i --save-exact \ +# Install Antora + extensions to /opt/antora (not /antora which gets volume-mounted) +WORKDIR /opt/antora +RUN npm init -y && npm i --save-exact \ @antora/cli@3.1 \ @antora/site-generator@3.1 \ asciidoctor-kroki@0.18 \ @mermaid-js/mermaid-cli@11 \ && npm cache clean --force -# Mermaid-cli config: use installed Chromium, no sandbox (container) +# Make installed modules visible when workdir is the mounted project +ENV NODE_PATH=/opt/antora/node_modules + +# Mermaid-cli config RUN echo '{ \ "executablePath": "/usr/bin/chromium-browser", \ "args": ["--no-sandbox", "--disable-gpu", "--disable-dev-shm-usage"] \ -}' > /antora/puppeteer-config.json +}' > /opt/antora/puppeteer-config.json -# Pre-generate a simple diagram to warm up and verify the stack works +# Verify mermaid works RUN echo 'graph TD; A-->B;' > /tmp/test.mmd \ - && npx mmdc -i /tmp/test.mmd -o /tmp/test.svg -p /antora/puppeteer-config.json \ + && npx mmdc -i /tmp/test.mmd -o /tmp/test.svg -p /opt/antora/puppeteer-config.json \ && rm /tmp/test.mmd /tmp/test.svg -ENTRYPOINT ["npx", "antora"] +ENTRYPOINT ["/opt/antora/node_modules/.bin/antora"] CMD ["--stacktrace", "antora-playbook.yml"] diff --git a/docs/antora-playbook.yml b/docs/antora-playbook.yml index b07afab..a21a2df 100644 --- a/docs/antora-playbook.yml +++ b/docs/antora-playbook.yml @@ -4,7 +4,7 @@ site: content: sources: - - url: . + - url: /antora start_path: docs branches: HEAD