diff --git a/ISSUE-skainet-8b-oom.md b/ISSUE-skainet-8b-oom.md
new file mode 100644
index 0000000..6977ec9
--- /dev/null
+++ b/ISSUE-skainet-8b-oom.md
@@ -0,0 +1,113 @@
+# Issue: Qwen3-8B OOM on 48GB Mac
+
+## Problem
+
+Running Qwen3-8B-Q4_K_M.gguf (4.7GB on disk) on a 48GB Mac fails with OOM during weight loading, both via kllama and the unified skainet CLI.
+
+## Root Cause
+
+The current loading path uses `DEQUANTIZE_TO_FP32`, which expands Q4 weights 8x:
+
+| Component | Size |
+|--------------------------|-----------|
+| Quantized weights (disk) | 4.7 GB |
+| Dequantized FP32 weights | ~37-40 GB |
+| KV cache (2048 context) | 512 MB |
+| Embeddings, norms | ~1 GB |
+| JVM + tokenizer | ~2 GB |
+| **Total** | **~41 GB** |
+
+48GB barely fits, and the JVM needs headroom for temporary buffers during dequantization, so it OOMs.
+
+## What Already Exists in the Codebase
+
+### 1. NATIVE_OPTIMIZED quant policy (best option)
+
+`QuantPolicy.NATIVE_OPTIMIZED` keeps weights in quantized form and uses SIMD-accelerated matmul kernels. `MemSegWeightConverter` converts raw Q4/Q8 bytes to 64-byte-aligned MemorySegment-backed tensors for Vector API dispatch.
+
+- Memory: ~5GB for the 8B model (vs 40GB with FP32)
+- Speed: 1-3 tok/s (proven on Qwen2/3 via kqwen runner)
+- Already works for Qwen2/3 in kllama Main.kt (the `isQwen` path)
+
+**Why it doesn't work today for the 8B:** The kllama `isQwen` path loads with `NATIVE_OPTIMIZED` but then creates `LlamaRuntime` which still transposes weight matrices to FP32 during init (`LlamaRuntime.kt:74`). This transpose step allocates FP32 copies.
+
+### 2. Lazy per-layer dequantization (Apertus pattern)
+
+`ApertusQuantizedRuntime` keeps weights quantized and dequantizes one projection at a time during `runLayer()`:
+
+```
+Resident: ~3.5GB (quantized) + ~100MB (norms/embeddings)
+Per-layer temp: ~50MB (one projection, discarded after matmul)
+```
+
+This is the llama.cpp approach. Not yet available for LLaMA/Qwen runtimes.
+
+### 3. Memory-mapped loading (F32 only)
+
+`MmapLlamaLoader` maps the GGUF file via `MappedByteBuffer` for zero-copy tensor access. Only works for F32 models — Q4 models need dequantization which defeats the zero-copy benefit.
+
+## Proposed Solutions (ordered by effort)
+
+### Solution A: Fix NATIVE_OPTIMIZED path for 8B models (small effort)
+
+The kllama Main.kt Qwen path already loads with `NATIVE_OPTIMIZED`. The problem is `LlamaRuntime` constructor transposes weights to FP32. Fix:
+
+1. Skip transpose for quantized tensors in `LlamaRuntime` init
+2. Or use `OptimizedLLMRuntime` which doesn't transpose (the DSL path)
+3. Ensure SIMD matmul kernels handle Q4_K_M format (Q4_K dispatch exists in `MemSegWeightConverter`)
+
+**Expected result:** 8B Q4 loads in ~5GB, runs at 1-3 tok/s.
+
+**Files to change:**
+- `llm-inference/llama/.../LlamaRuntime.kt` -- skip transpose for quantized MemSeg tensors
+- Or migrate Qwen path in `kllama/cli/Main.kt` to `OptimizedLLMRuntime` + `llamaNetwork()`
+
+### Solution B: Port lazy dequant from Apertus to LLaMA (medium effort)
+
+Port the `ApertusQuantizedRuntime` pattern to a `LlamaQuantizedRuntime`:
+
+1. Store projections as `QuantizedTensor` (quantized bytes + metadata)
+2. In `runLayer()`, dequantize one weight matrix at a time, matmul, discard
+3. Keep embeddings and norms as FP32 (small, need element access)
+
+**Expected result:** 8B Q4 loads in ~5GB, runs at ~1 tok/s (dequant overhead per layer).
+
+**Files to create:**
+- `llm-inference/llama/.../LlamaQuantizedRuntime.kt` (new, based on Apertus pattern)
+- `llm-runtime/kllama/.../LlamaQuantizedWeights.kt` (new, mixed storage)
+
+### Solution C: SIMD-native matmul without dequantization (larger effort, best perf)
+
+The SIMD backend (`skainet-backend-cpu`) already has Q4/Q8 matmul kernels via Vector API. The issue is the runtime layer doesn't use them directly. Changes needed in skainet core:
+
+1. `skainet-backend-cpu`: Ensure `matmul(FP32, Q4_K)` kernel exists and dispatches correctly
+2. `LlamaRuntime` or `OptimizedLLMRuntime`: Accept mixed-precision weight tensors (Q4 weights, FP32 activations)
+3. Skip the `MemSegWeightConverter` step entirely — use raw quantized MemorySegments
+
+**Expected result:** 8B Q4 loads in ~5GB, runs at 2-5 tok/s (no dequant overhead).
+
+**Files to change (in skainet core):**
+- `skainet-backend-cpu`: Q4_K matmul kernel (may already exist)
+- `skainet-lang-core`: Mixed-precision tensor support in matmul dispatch
+
+### Solution D: Memory-mapped quantized tensors (largest effort)
+
+Extend `MmapLlamaLoader` to support quantized formats:
+
+1. Map the GGUF file to virtual memory
+2. Create quantized tensor views that reference mmap regions
+3. Dequantize on-the-fly during matmul (like lazy dequant but zero-copy from disk)
+
+**Expected result:** Load time near-zero, ~5GB virtual (OS manages paging).
+
+**Files to change:**
+- `llm-inference/llama/.../MmapLlamaLoader.kt` -- extend to Q4/Q8 formats
+- Requires `skainet-io-core` changes for mmap quantized tensor views
+
+## Recommended Path
+
+**Start with Solution A** — it's the smallest change and uses code that already works for Qwen2/3. The `NATIVE_OPTIMIZED` + `MemSegWeightConverter` path is proven; the only blocker is `LlamaRuntime`'s constructor transposing weights to FP32.
+
+If that's not enough, **add Solution B** (lazy dequant) which gives the most control over memory at a known performance cost.
+
+Solution C is the long-term goal (best performance) but requires skainet core changes.
diff --git a/docs/.docker/Dockerfile b/docs/.docker/Dockerfile
index 0d496ff..67c21ba 100644
--- a/docs/.docker/Dockerfile
+++ b/docs/.docker/Dockerfile
@@ -10,26 +10,28 @@ RUN apk add --no-cache chromium font-noto
ENV PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser \
PUPPETEER_SKIP_DOWNLOAD=true
-WORKDIR /antora
-
-# Install Antora + extensions + mermaid-cli in one layer
-RUN npm i --save-exact \
+# Install Antora + extensions to /opt/antora (not /antora which gets volume-mounted)
+WORKDIR /opt/antora
+RUN npm init -y && npm i --save-exact \
@antora/cli@3.1 \
@antora/site-generator@3.1 \
asciidoctor-kroki@0.18 \
@mermaid-js/mermaid-cli@11 \
&& npm cache clean --force
-# Mermaid-cli config: use installed Chromium, no sandbox (container)
+# Make installed modules visible when workdir is the mounted project
+ENV NODE_PATH=/opt/antora/node_modules
+
+# Mermaid-cli config
RUN echo '{ \
"executablePath": "/usr/bin/chromium-browser", \
"args": ["--no-sandbox", "--disable-gpu", "--disable-dev-shm-usage"] \
-}' > /antora/puppeteer-config.json
+}' > /opt/antora/puppeteer-config.json
-# Pre-generate a simple diagram to warm up and verify the stack works
+# Verify mermaid works
RUN echo 'graph TD; A-->B;' > /tmp/test.mmd \
- && npx mmdc -i /tmp/test.mmd -o /tmp/test.svg -p /antora/puppeteer-config.json \
+ && npx mmdc -i /tmp/test.mmd -o /tmp/test.svg -p /opt/antora/puppeteer-config.json \
&& rm /tmp/test.mmd /tmp/test.svg
-ENTRYPOINT ["npx", "antora"]
+ENTRYPOINT ["/opt/antora/node_modules/.bin/antora"]
CMD ["--stacktrace", "antora-playbook.yml"]
diff --git a/docs/antora-playbook.yml b/docs/antora-playbook.yml
index b07afab..a21a2df 100644
--- a/docs/antora-playbook.yml
+++ b/docs/antora-playbook.yml
@@ -4,7 +4,7 @@ site:
content:
sources:
- - url: .
+ - url: /antora
start_path: docs
branches: HEAD
diff --git a/docs/modules/ROOT/nav.adoc b/docs/modules/ROOT/nav.adoc
index 5bc1fc9..895b219 100644
--- a/docs/modules/ROOT/nav.adoc
+++ b/docs/modules/ROOT/nav.adoc
@@ -23,3 +23,4 @@
* xref:explanation/pipeline-design.adoc[Pipeline Design Decisions]
* xref:explanation/dsl-vs-handcoded.adoc[DSL Networks vs Hand-Coded Runtimes]
* xref:explanation/tokenizer-internals.adoc[Tokenizer Internals]
+* xref:explanation/weight-quantization.adoc[Weight Quantization and Numeric Representation]
diff --git a/docs/modules/ROOT/pages/explanation/weight-quantization.adoc b/docs/modules/ROOT/pages/explanation/weight-quantization.adoc
new file mode 100644
index 0000000..6974baf
--- /dev/null
+++ b/docs/modules/ROOT/pages/explanation/weight-quantization.adoc
@@ -0,0 +1,332 @@
+= Weight Quantization and Numeric Representation
+:description: Deep technical guide to how model weights flow through quantization, dequantization, transpose, and SIMD kernel dispatch.
+
+== Overview
+
+A model weight tensor goes through several numeric representation changes between the GGUF file on disk and the final matmul during inference.
+Understanding each stage is essential for debugging memory issues, correctness problems, and performance optimization.
+
+[mermaid]
+....
+graph TD
+ A["GGUF File
(Q4_K_M: 4.7 GB)"] -->|StreamingGGUFReader| B["Raw Bytes
IntArrayTensorData"]
+ B -->|MemSegWeightConverter| C{Quant Type?}
+ C -->|Q4_0| D["Q4MemorySegmentTensorData
64-byte aligned, Arena-managed"]
+ C -->|Q8_0| E["Q8MemorySegmentTensorData
64-byte aligned, Arena-managed"]
+ C -->|Q4_K / Q6_K| F["DequantOps.dequantFromBytes()
→ FloatArray"]
+ C -->|FP32| G["tensor.t()
MemorySegmentTensorData"]
+ F -->|"Array transpose
[out,in] → [in,out]"| H["FloatArrayTensorData
Pre-transposed"]
+ D --> I["SIMD Matmul
QuantizedMatmul.matmulQ4_0()"]
+ E --> J["SIMD Matmul
QuantizedMatmul.matmulQ8_0()"]
+ H --> K["Scalar Matmul
DefaultCpuOps.matmul()"]
+ G --> K
+
+ style A fill:#f9f,stroke:#333
+ style I fill:#9f9,stroke:#333
+ style J fill:#9f9,stroke:#333
+ style K fill:#ff9,stroke:#333
+....
+
+== Stage 1: GGUF File on Disk
+
+GGUF stores each weight tensor as a contiguous byte region with a header describing its quantization type, shape, and byte offset.
+
+=== Quantization Types in Q4_K_M Format
+
+The `Q4_K_M` quantization scheme uses a **mixed-precision strategy**:
+
+[cols="1,2,2,1"]
+|===
+|Type |Used For |Block Format |Bits/Param
+
+|Q4_K
+|Large projections (wq, wk, wv, wo, ffn_gate, ffn_up, ffn_down) in ~50% of layers
+|144 bytes per 256 elements: 2×f16 scale + 2×f16 min + 12 scale bytes + 128 nibble codes
+|~4.5
+
+|Q6_K
+|Same projections in the other ~50% of layers, plus output weight
+|210 bytes per 256 elements: higher precision for critical layers
+|~6.5
+
+|Q8_0
+|Not used in Q4_K_M (used in Q8_0 format models)
+|34 bytes per 32 elements: 1×f16 scale + 32×int8 codes
+|~8.5
+
+|FP32
+|Norms (attn_norm, ffn_norm, output_norm) — 1D tensors
+|4 bytes per element
+|32
+|===
+
+=== Tensor Layout in GGUF
+
+All 2D weight tensors are stored in **row-major [out_dim, in_dim]** order:
+
+----
+wq: Shape(dim, dim) = [4096, 4096] "4096 output neurons, each with 4096 input weights"
+wk: Shape(kvDim, dim) = [1024, 4096] "1024 KV outputs (8 heads × 128 head_dim)"
+ffn_gate: Shape(ffnDim, dim) = [14336, 4096] "14336 FFN hidden units"
+ffn_down: Shape(dim, ffnDim) = [4096, 14336] "project back to model dim"
+----
+
+The matmul convention `y = x @ W^T` requires weights in `[in_dim, out_dim]` form, so a transpose is needed before or during the matmul.
+
+== Stage 2: Loading Raw Bytes
+
+`LlamaWeightLoader.loadToMapStreaming()` reads the GGUF file via `StreamingGGUFReader`:
+
+[source,kotlin]
+----
+// QuantPolicy.NATIVE_OPTIMIZED: store as raw Int8 bytes
+val tensor = streamingTensorToTensor(reader, tensorInfo, ctx)
+// tensor.data is IntArrayTensorData containing the raw quantized bytes
+----
+
+At this stage, the tensor holds the original GGUF bytes unchanged.
+A `quantTypes` map records each tensor's quantization type for later processing.
+
+.Memory at Stage 2
+----
+Qwen3-8B-Q4_K_M: ~4.7 GB (raw bytes, same as file size)
+----
+
+== Stage 3: MemSegWeightConverter
+
+`MemSegWeightConverter.convert()` transforms raw bytes into runtime-ready tensors.
+This is where the numeric representation diverges by quantization type.
+
+=== Path A: Q4_0 → Q4MemorySegmentTensorData
+
+[source,kotlin]
+----
+Q4MemorySegmentTensorData.fromRawBytes(logicalShape, bytes, arena)
+----
+
+* Copies raw bytes into a 64-byte-aligned `MemorySegment` (Arena-managed, off-heap)
+* The data stays in Q4_0 block format (no dequantization)
+* The `MemorySegment` alignment enables SIMD vector loads
+
+.Memory: same as raw bytes (~4.5 bits/param)
+
+=== Path B: Q8_0 → Q8MemorySegmentTensorData
+
+Same as Q4_0 but with Q8_0 block format (8 bits per code + f16 scale per 32 elements).
+
+.Memory: ~8.5 bits/param
+
+=== Path C: Q4_K / Q5_K / Q6_K → FP32 + Pre-Transpose
+
+[source,kotlin]
+----
+// 1. Dequantize to float array
+val floats = DequantOps.dequantFromBytes(bytes, quantType, rows * cols)
+
+// 2. Pre-transpose from [out, in] to [in, out]
+val transposed = FloatArray(rows * cols)
+for (r in 0 until rows) {
+ for (c in 0 until cols) {
+ transposed[c * rows + r] = floats[r * cols + c]
+ }
+}
+
+// 3. Store as heap-based FloatArrayTensorData
+return ctx.fromFloatArray(Shape(cols, rows), FP32::class, transposed)
+----
+
+**Why dequantize?** No native SIMD kernel exists for K-quant block formats yet.
+
+**Why pre-transpose?** The `.t()` operation on tensors allocates a new `MemorySegmentTensorData` in direct buffer memory.
+The JVM's direct buffer allocator does not reclaim memory eagerly, causing OOM on memory-constrained machines (48GB).
+Pre-transposing during loading avoids all runtime `.t()` calls.
+
+.Memory per K-quant tensor
+----
+Original Q4_K: ~4.5 bits/param
+After dequant: 32 bits/param (8× expansion)
+Temporary: 2× (original float array + transposed array, then original is GC'd)
+----
+
+.Total memory for Qwen3-8B-Q4_K_M after Stage 3
+----
+Q4_K tensors (dequantized + transposed): ~15 GB
+Q6_K tensors (dequantized + transposed): ~12 GB
+Token embedding (dequantized, not transposed): ~2.4 GB
+Norms (FP32, 1D, tiny): ~0.01 GB
+Total: ~30 GB
+----
+
+=== Path D: FP32 → Pre-Transpose
+
+[source,kotlin]
+----
+return tensor.t() // one-time transpose during loading
+----
+
+Norms are 1D so `.t()` is a no-op. For FP32 projection weights (rare), a standard transpose copies to direct memory once.
+
+=== Special Case: Token Embedding
+
+[source,kotlin]
+----
+tokenEmbedding = maybeDequantize(weights.tokenEmbedding, ...)
+----
+
+Token embeddings are always dequantized to FP32 and **not transposed** because `Embedding.forward()` does row gather (lookup by token ID), not matmul.
+
+== Stage 4: LlamaRuntime.linearProject()
+
+During inference, each projection uses `linearProject()`:
+
+[source,kotlin]
+----
+private fun linearProject(x: Tensor, w: Tensor): Tensor {
+ val xCols = if (x.shape.rank >= 2) x.shape[x.shape.rank - 1] else x.shape[0]
+ val wRows = w.shape[0]
+ return if (wRows == xCols) {
+ x.matmul(w) // weight is [in, out] — pre-transposed
+ } else {
+ x.matmul(w.t()) // weight is [out, in] — legacy path (tests)
+ }
+}
+----
+
+The shape check auto-detects the weight layout:
+
+* **Pre-transposed** `[in, out]`: `wRows == xCols` → direct `matmul`, no allocation
+* **Original** `[out, in]`: `wRows != xCols` → `.t()` then `matmul` (legacy/test path)
+
+== Stage 5: Matmul Kernel Dispatch
+
+The `Tensor.matmul()` extension dispatches based on the runtime `TensorData` type:
+
+[cols="1,2,2"]
+|===
+|TensorData Type |Kernel |Implementation
+
+|`Q4MemorySegmentTensorData`
+|`QuantizedMatmul.matmulQ4_0()`
+|SIMD (Vector API): processes 32 Q4 values per vector lane
+
+|`Q8MemorySegmentTensorData`
+|`QuantizedMatmul.matmulQ8_0()`
+|SIMD (Vector API): dot product of int8 codes × float scale
+
+|`Q4_KBlockTensorData`
+|`QuantizedMatmul.matmulQ4_K()`
+|SIMD: unpacks K-quant blocks with dual scales + min values
+
+|`FloatArrayTensorData`
+|`DefaultCpuOps.matmul()`
+|Scalar FP32 double loop (no SIMD)
+
+|`MemorySegmentTensorData`
+|`DefaultCpuOpsJvm.matmul()`
+|SIMD FP32 via Vector API
+|===
+
+=== SIMD Q4_0 Matmul (Simplified)
+
+[source]
+----
+For each output row:
+ For each block of 32 input elements:
+ Load 16 bytes of Q4 codes from MemorySegment (128 bits)
+ Unpack low/high nibbles into two int8 vectors (256 bits each)
+ Subtract zero-point (8)
+ Convert to float vectors
+ Multiply by block scale (f16 → f32)
+ FMA with input vector → accumulate into output
+----
+
+=== Why Q4_K Cannot Be Trivially Transposed
+
+Q4_K blocks encode 256 elements with a complex internal structure:
+
+----
+Block (144 bytes):
+ [0..1] d (f16) — primary scale
+ [2..3] dmin (f16) — minimum offset
+ [4..15] scales (12 bytes) — per-subblock scales (6-bit packed)
+ [16..143] qs (128 bytes) — quantized codes (4-bit packed, 256 values)
+----
+
+The 256 values in each block correspond to **256 contiguous elements in the original row**.
+Transposing the matrix would scatter these elements across different rows, breaking the block structure.
+A proper Q4_K transpose would require:
+
+1. Dequantize all blocks → FP32
+2. Transpose the FP32 matrix
+3. Re-quantize into new Q4_K blocks
+
+This is why `MemSegWeightConverter` currently dequantizes K-quant types to FP32 rather than keeping them quantized.
+
+== Memory Budget: Qwen3-8B-Q4_K_M on 48GB Mac
+
+[cols="2,1,3"]
+|===
+|Component |Size |Notes
+
+|K-quant weights (FP32 pre-transposed)
+|~27 GB
+|Q4_K + Q6_K dequantized, no runtime `.t()` copies
+
+|Token embedding (FP32)
+|2.4 GB
+|151936 × 4096 × 4 bytes
+
+|Norms (FP32)
+|~10 MB
+|1D tensors, negligible
+
+|KV cache (context=512)
+|~128 MB
+|2 × 36 layers × 512 × 1024 × 4 bytes
+
+|JVM + tokenizer
+|~1 GB
+|Heap overhead, vocab structures
+
+|**Total**
+|**~31 GB**
+|Fits in 48 GB with OS headroom
+|===
+
+== Performance Characteristics
+
+[cols="2,1,1,2"]
+|===
+|Path |Bits/Param |Memory |Speed (8B, M-series CPU)
+
+|Q4_K SIMD (future)
+|4.5
+|~5 GB
+|~1-3 tok/s (projected)
+
+|Q8_0 SIMD
+|8.5
+|~9 GB
+|~1-2 tok/s
+
+|FP32 pre-transposed (current)
+|32
+|~30 GB
+|~0.002 tok/s (scalar)
+
+|FP32 + runtime .t() (old, OOM)
+|32 + 32 (copy)
+|~60 GB
+|OOM on 48GB
+|===
+
+== Future: Block-Aware Q4_K Transpose
+
+To use the Q4_K SIMD kernel with GGUF weights, the skainet core library would need:
+
+1. **`Q4_KBlockTensorData.transpose()`** — dequantize → rearrange → re-quantize at the block level
+2. Or **`QuantizedMatmul.matmulQ4_K_transposed()`** — a kernel variant that reads blocks in column-major order
+3. Or **GGUF pre-transposed storage** — store weights as `[in, out]` in the GGUF file during quantization
+
+Option 2 is the most practical: modify the SIMD kernel to iterate over columns instead of rows when reading Q4_K blocks.
+This would reduce memory from ~30GB to ~5GB for the 8B model.
diff --git a/llm-apps/skainet-cli/build.gradle.kts b/llm-apps/skainet-cli/build.gradle.kts
index 7a999be..608c38c 100644
--- a/llm-apps/skainet-cli/build.gradle.kts
+++ b/llm-apps/skainet-cli/build.gradle.kts
@@ -50,5 +50,5 @@ tasks.withType().configureEach {
}
tasks.withType().configureEach {
- jvmArgs("--enable-preview", "--add-modules", "jdk.incubator.vector", "-Xmx48g", "-XX:MaxDirectMemorySize=64g")
+ jvmArgs("--enable-preview", "--add-modules", "jdk.incubator.vector", "-Xmx42g", "-XX:MaxDirectMemorySize=42g")
}
diff --git a/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt b/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt
index e461142..468054e 100644
--- a/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt
+++ b/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt
@@ -54,29 +54,30 @@ public class LlamaRuntime(
const val DEFAULT_BOS_TOKEN: Int = 1
}
- /** Pre-transposed weight tensors per layer — avoids re-creating lazy transpose wrappers every forward pass. */
- private class TransposedLayerWeights(
- val wqT: Tensor,
- val wkT: Tensor,
- val wvT: Tensor,
- val woT: Tensor,
- val ffnGateT: Tensor,
- val ffnDownT: Tensor,
- val ffnUpT: Tensor,
- )
+ // NOTE: weights are transposed on-the-fly during forward pass rather than
+ // pre-transposed at init. This halves peak memory (~31GB saved for 8B models)
+ // at the cost of per-token transpose allocations that the GC reclaims.
+ // Quantized weights (Q4_K) skip transpose entirely — their matmul kernel
+ // handles the [out, in] layout directly.
- private val transposedLayers: List> = weights.layers.map { layer ->
- TransposedLayerWeights(
- wqT = layer.wq.t(),
- wkT = layer.wk.t(),
- wvT = layer.wv.t(),
- woT = layer.wo.t(),
- ffnGateT = layer.ffnGate.t(),
- ffnDownT = layer.ffnDown.t(),
- ffnUpT = layer.ffnUp.t(),
- )
+ /**
+ * Linear projection: y = x @ W.
+ *
+ * When weights are pre-transposed to [in, out] by MemSegWeightConverter
+ * (Q4_K, Q6_K, FP32 via NATIVE_OPTIMIZED), uses direct matmul.
+ * Otherwise falls back to .t() for non-converted weights (tests, DEQUANTIZE_TO_FP32).
+ */
+ private fun linearProject(x: Tensor, w: Tensor): Tensor {
+ val xCols = if (x.shape.rank >= 2) x.shape[x.shape.rank - 1] else x.shape[0]
+ val wRows = w.shape[0]
+ return if (wRows == xCols) {
+ // Weight is [in, out] — already transposed, direct matmul
+ x.matmul(w)
+ } else {
+ // Weight is [out, in] — needs transpose (legacy path)
+ x.matmul(w.t())
+ }
}
- private val outputWeightT: Tensor = weights.outputWeight.t()
// ---- DecoderRuntime abstract properties ----
override val dim: Int = weights.metadata.embeddingLength
@@ -131,7 +132,6 @@ public class LlamaRuntime(
embedding.forward(intArrayOf(tokenId), ctx)
override fun runLayer(layerIdx: Int, x: Tensor): Tensor {
- val tl = transposedLayers[layerIdx]
val layer = weights.layers[layerIdx]
// QKV: try compiled graph first, fall back to individual ops
@@ -140,9 +140,9 @@ public class LlamaRuntime(
} ?: run {
val attnNorm = attnNorms[layerIdx].forward(x, ctx)
Triple(
- attnNorm.matmul(tl.wqT),
- attnNorm.matmul(tl.wkT),
- attnNorm.matmul(tl.wvT)
+ linearProject(attnNorm, layer.wq),
+ linearProject(attnNorm, layer.wk),
+ linearProject(attnNorm, layer.wv)
)
}
@@ -156,14 +156,14 @@ public class LlamaRuntime(
val attnOut = attentionBackend.attention(q, k, v, layerIdx, position)
// Output projection + residual
- val afterAttn = x + attnOut.matmul(tl.woT)
+ val afterAttn = x + linearProject(attnOut, layer.wo)
// FFN: try compiled graph first, fall back to individual ops
return graphAccelerator?.runFFN(layerIdx, afterAttn) ?: run {
val ffnNorm = ffnNorms[layerIdx].forward(afterAttn, ctx)
- val gate = ffnNorm.matmul(tl.ffnGateT).silu()
- val up = ffnNorm.matmul(tl.ffnUpT)
- val ffnOut = (gate * up).matmul(tl.ffnDownT)
+ val gate = linearProject(ffnNorm, layer.ffnGate).silu()
+ val up = linearProject(ffnNorm, layer.ffnUp)
+ val ffnOut = linearProject(gate * up, layer.ffnDown)
afterAttn + ffnOut
}
}
@@ -172,7 +172,7 @@ public class LlamaRuntime(
outputNormLayer.forward(x, ctx)
override fun outputProject(x: Tensor): Tensor =
- x.matmul(outputWeightT)
+ linearProject(x, weights.outputWeight)
override fun resetState() {
attentionBackend.reset()
@@ -287,11 +287,10 @@ public class LlamaRuntime(
var x = embedding.forward(tokenIds, ctx)
weights.layers.forEachIndexed { layerIdx, layer ->
- val tl = transposedLayers[layerIdx]
val attnNorm = attnNorms[layerIdx].forward(x, ctx)
- var q = attnNorm.matmul(tl.wqT)
- var k = attnNorm.matmul(tl.wkT)
- val v = attnNorm.matmul(tl.wvT)
+ var q = linearProject(attnNorm, layer.wq)
+ var k = linearProject(attnNorm, layer.wk)
+ val v = linearProject(attnNorm, layer.wv)
if (hasQKNorm) {
q = applyPerHeadRMSNorm(q, nHeads, headDim, layer.qNorm!!)
@@ -299,19 +298,19 @@ public class LlamaRuntime(
}
val attnOut = attentionBackend.batchAttention(q, k, v, layerIdx, startPos)
- ?: return batchForwardFallback(tokenIds, startPos) // shouldn't happen but be safe
+ ?: return batchForwardFallback(tokenIds, startPos)
- val afterAttn = x + attnOut.matmul(tl.woT)
+ val afterAttn = x + linearProject(attnOut, layer.wo)
val ffnNorm = ffnNorms[layerIdx].forward(afterAttn, ctx)
- val gate = ffnNorm.matmul(tl.ffnGateT).silu()
- val up = ffnNorm.matmul(tl.ffnUpT)
- val ffnOut = (gate * up).matmul(tl.ffnDownT)
+ val gate = linearProject(ffnNorm, layer.ffnGate).silu()
+ val up = linearProject(ffnNorm, layer.ffnUp)
+ val ffnOut = linearProject(gate * up, layer.ffnDown)
x = afterAttn + ffnOut
}
val norm = outputNormLayer.forward(x, ctx)
- val logits = norm.matmul(outputWeightT)
+ val logits = linearProject(norm, weights.outputWeight)
position = startPos + tokenIds.size
return logits
}
diff --git a/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt b/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt
index 2e24d4f..8146746 100644
--- a/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt
+++ b/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt
@@ -8,6 +8,7 @@ import sk.ainet.models.llama.LlamaRuntimeWeights
import sk.ainet.models.llama.LlamaTensorNames
import sk.ainet.lang.tensor.Shape
import sk.ainet.lang.tensor.Tensor
+import sk.ainet.lang.tensor.t
import sk.ainet.lang.tensor.data.IntArrayTensorData
import sk.ainet.lang.tensor.data.Q4MemorySegmentTensorData
import sk.ainet.lang.tensor.data.Q8MemorySegmentTensorData
@@ -85,7 +86,11 @@ public object MemSegWeightConverter {
ctx: ExecutionContext,
arena: Arena
): Tensor {
- val quantType = quantTypes[tensorName] ?: return tensor
+ val quantType = quantTypes[tensorName]
+ if (quantType == null) {
+ // FP32 tensor — pre-transpose to [in, out] so no .t() at runtime
+ return tensor.t()
+ }
val bytes = extractBytes(tensor.data)
@@ -97,9 +102,20 @@ public object MemSegWeightConverter {
GGMLQuantizationType.Q4_K,
GGMLQuantizationType.Q5_K,
GGMLQuantizationType.Q6_K -> {
- // Dequantize K-quant types to FP32 (no native SIMD kernel yet)
- val floats = DequantOps.dequantFromBytes(bytes, quantType, logicalShape.volume)
- return ctx.fromFloatArray(logicalShape, FP32::class, floats)
+ // Dequantize K-quant types to FP32 and pre-transpose to [in, out].
+ // Pre-transposing at load time avoids .t() at runtime, which
+ // allocates direct buffers the JVM doesn't GC eagerly (OOM on 48GB).
+ val rows = logicalShape[0]
+ val cols = logicalShape[1]
+ val floats = DequantOps.dequantFromBytes(bytes, quantType, rows * cols)
+ val transposed = FloatArray(rows * cols)
+ for (r in 0 until rows) {
+ for (c in 0 until cols) {
+ transposed[c * rows + r] = floats[r * cols + c]
+ }
+ }
+ val transposedShape = Shape(cols, rows)
+ return ctx.fromFloatArray(transposedShape, FP32::class, transposed)
}
else -> {
println("WARNING: Unsupported quant type $quantType for MemorySegment conversion of $tensorName, keeping as-is")
diff --git a/llm-inference/voxtral/src/commonMain/kotlin/sk/ainet/models/voxtral/VoxtralFlowMatching.kt b/llm-inference/voxtral/src/commonMain/kotlin/sk/ainet/models/voxtral/VoxtralFlowMatching.kt
index d4e6cd8..ee66112 100644
--- a/llm-inference/voxtral/src/commonMain/kotlin/sk/ainet/models/voxtral/VoxtralFlowMatching.kt
+++ b/llm-inference/voxtral/src/commonMain/kotlin/sk/ainet/models/voxtral/VoxtralFlowMatching.kt
@@ -200,7 +200,7 @@ public class VoxtralFlowMatching(
val u1 = random.nextFloat().coerceIn(1e-7f, 1.0f)
val u2 = random.nextFloat()
val mag = kotlin.math.sqrt(-2.0f * kotlin.math.ln(u1))
- val angle = (2.0 * Math.PI * u2).toFloat()
+ val angle = (2.0 * kotlin.math.PI * u2).toFloat()
values[i] = mag * kotlin.math.cos(angle)
values[i + 1] = mag * kotlin.math.sin(angle)
i += 2
@@ -208,7 +208,7 @@ public class VoxtralFlowMatching(
if (i < n) {
val u1 = random.nextFloat().coerceIn(1e-7f, 1.0f)
val u2 = random.nextFloat()
- values[i] = kotlin.math.sqrt(-2.0f * kotlin.math.ln(u1)) * kotlin.math.cos((2.0 * Math.PI * u2).toFloat())
+ values[i] = kotlin.math.sqrt(-2.0f * kotlin.math.ln(u1)) * kotlin.math.cos((2.0 * kotlin.math.PI * u2).toFloat())
}
@Suppress("UNCHECKED_CAST")
val result = ctx.fromFloatArray(shape, dtype, values)
diff --git a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/AgentMain.kt b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/AgentMain.kt
index ebe4289..12221c5 100644
--- a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/AgentMain.kt
+++ b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/AgentMain.kt
@@ -29,6 +29,47 @@ public class AgentCli(
private val session = ChatSession(runtime, tokenizer, metadata, templateName)
private val template: ChatTemplate = session.chatTemplate
+ /**
+ * Run a single non-interactive chat round. Used by smoke tests for instruct models.
+ */
+ public fun runChatOnce(
+ userPrompt: String,
+ systemPrompt: String = "You are a helpful assistant. Answer concisely.",
+ maxTokens: Int = 64,
+ temperature: Float = 0.0f
+ ) {
+ val messages = mutableListOf(
+ ChatMessage(role = ChatRole.SYSTEM, content = systemPrompt),
+ ChatMessage(role = ChatRole.USER, content = userPrompt)
+ )
+
+ runtime.reset()
+ val prompt = template.apply(messages, emptyList(), addGenerationPrompt = true)
+ val promptTokens = tokenizer.encode(prompt)
+
+ print("Assistant: ")
+ System.out.flush()
+
+ val startTime = System.nanoTime()
+ val result = runtime.generateUntilStop(
+ prompt = promptTokens,
+ maxTokens = maxTokens,
+ eosTokenId = tokenizer.eosTokenId,
+ temperature = temperature,
+ onToken = { tokenId ->
+ print(tokenizer.decode(tokenId))
+ System.out.flush()
+ },
+ decode = { tokenId -> tokenizer.decode(tokenId) }
+ )
+ val elapsed = (System.nanoTime() - startTime) / 1_000_000.0
+
+ println()
+ println("---")
+ val tokPerSec = if (elapsed > 0) result.tokens.size / elapsed * 1000 else 0.0
+ println("tok/s: $tokPerSec")
+ }
+
/**
* Run interactive chat mode (no tool calling).
*/
diff --git a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt
index a89da0e..ddf31c0 100644
--- a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt
+++ b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt
@@ -508,7 +508,11 @@ fun main(args: Array) {
}
else -> {
val agentCli = AgentCli(runtime, tokenizer, cliArgs.templateName, metadata)
- agentCli.runChat(maxTokens = cliArgs.steps, temperature = cliArgs.temperature)
+ if (cliArgs.prompt != null) {
+ agentCli.runChatOnce(cliArgs.prompt, maxTokens = cliArgs.steps, temperature = cliArgs.temperature)
+ } else {
+ agentCli.runChat(maxTokens = cliArgs.steps, temperature = cliArgs.temperature)
+ }
}
}
return@runBlocking
diff --git a/tests/smoke/smoke-models.json b/tests/smoke/smoke-models.json
index 2958517..895b06e 100644
--- a/tests/smoke/smoke-models.json
+++ b/tests/smoke/smoke-models.json
@@ -13,9 +13,11 @@
},
{
"name": "Qwen3-1.7B-Q8",
- "runner": "kqwen",
+ "runner": "kllama",
"model": "Qwen3-1.7B-Q8_0.gguf",
"format": "gguf",
+ "instruct": true,
+ "prompt": "What is the capital of France?",
"toolCalling": {
"prompt": "What is 2 + 2?",
"steps": 256
@@ -26,6 +28,8 @@
"runner": "kllama",
"model": "Qwen3-8B-Q4_K_M.gguf",
"format": "gguf",
+ "instruct": true,
+ "prompt": "What is the capital of France?",
"toolCalling": {
"prompt": "What is 2 + 2?",
"steps": 256
diff --git a/tests/smoke/smoke-test.sh b/tests/smoke/smoke-test.sh
index 2756e78..c1f6ca1 100755
--- a/tests/smoke/smoke-test.sh
+++ b/tests/smoke/smoke-test.sh
@@ -200,6 +200,7 @@ print(f'M_STEPS={m.get(\"steps\", d.get(\"steps\", 32))}')
print(f'M_TEMP={m.get(\"temperature\", d.get(\"temperature\", 0.0))}')
print(f'M_DOC={repr(m.get(\"doc\", \"\"))}')
print(f'M_OUTPUT={repr(m.get(\"output\", \"\"))}')
+print(f'M_INSTRUCT={repr(m.get(\"instruct\", False))}')
")"
M_MODEL=$(expand_path "$M_MODEL")
@@ -225,7 +226,13 @@ print(f'M_OUTPUT={repr(m.get(\"output\", \"\"))}')
fi
task=$(runner_task "$M_RUNNER")
- args=$(runner_args "$M_RUNNER" "$M_MODEL" "$M_PROMPT" "$M_STEPS" "$M_TEMP" "$M_DOC" "$M_OUTPUT")
+
+ # Instruct models: use --chat with prompt for proper chat template formatting
+ if [[ "$M_INSTRUCT" == "True" ]]; then
+ args="-m ${M_MODEL} --chat -s ${M_STEPS} -k ${M_TEMP} \"${M_PROMPT}\""
+ else
+ args=$(runner_args "$M_RUNNER" "$M_MODEL" "$M_PROMPT" "$M_STEPS" "$M_TEMP" "$M_DOC" "$M_OUTPUT")
+ fi
start_ts=$(python3 -c 'import time; print(time.time())')
output_file=$(mktemp)