diff --git a/ISSUE-skainet-8b-oom.md b/ISSUE-skainet-8b-oom.md
new file mode 100644
index 0000000..6977ec9
--- /dev/null
+++ b/ISSUE-skainet-8b-oom.md
@@ -0,0 +1,113 @@
+# Issue: Qwen3-8B OOM on 48GB Mac
+
+## Problem
+
+Running Qwen3-8B-Q4_K_M.gguf (4.7GB on disk) on a 48GB Mac fails with OOM during weight loading, both via kllama and the unified skainet CLI.
+
+## Root Cause
+
+The current loading path uses `DEQUANTIZE_TO_FP32`, which expands Q4 weights 8x:
+
+| Component                | Size      |
+|--------------------------|-----------|
+| Quantized weights (disk) | 4.7 GB    |
+| Dequantized FP32 weights | ~37-40 GB |
+| KV cache (2048 context)  | 512 MB    |
+| Embeddings, norms        | ~1 GB     |
+| JVM + tokenizer          | ~2 GB     |
+| **Total**                | **~41 GB** |
+
+48GB barely fits, and the JVM needs headroom for temporary buffers during dequantization, so it OOMs.
+
+## What Already Exists in the Codebase
+
+### 1. NATIVE_OPTIMIZED quant policy (best option)
+
+`QuantPolicy.NATIVE_OPTIMIZED` keeps weights in quantized form and uses SIMD-accelerated matmul kernels. `MemSegWeightConverter` converts raw Q4/Q8 bytes to 64-byte-aligned MemorySegment-backed tensors for Vector API dispatch.
+
+- Memory: ~5GB for the 8B model (vs 40GB with FP32)
+- Speed: 1-3 tok/s (proven on Qwen2/3 via kqwen runner)
+- Already works for Qwen2/3 in kllama Main.kt (the `isQwen` path)
+
+**Why it doesn't work today for the 8B:** The kllama `isQwen` path loads with `NATIVE_OPTIMIZED` but then creates `LlamaRuntime` which still transposes weight matrices to FP32 during init (`LlamaRuntime.kt:74`). This transpose step allocates FP32 copies.
+
+### 2. Lazy per-layer dequantization (Apertus pattern)
+
+`ApertusQuantizedRuntime` keeps weights quantized and dequantizes one projection at a time during `runLayer()`:
+
+```
+Resident: ~3.5GB (quantized) + ~100MB (norms/embeddings)
+Per-layer temp: ~50MB (one projection, discarded after matmul)
+```
+
+This is the llama.cpp approach. Not yet available for LLaMA/Qwen runtimes.
+
+### 3. Memory-mapped loading (F32 only)
+
+`MmapLlamaLoader` maps the GGUF file via `MappedByteBuffer` for zero-copy tensor access. Only works for F32 models — Q4 models need dequantization which defeats the zero-copy benefit.
+
+## Proposed Solutions (ordered by effort)
+
+### Solution A: Fix NATIVE_OPTIMIZED path for 8B models (small effort)
+
+The kllama Main.kt Qwen path already loads with `NATIVE_OPTIMIZED`. The problem is `LlamaRuntime` constructor transposes weights to FP32. Fix:
+
+1. Skip transpose for quantized tensors in `LlamaRuntime` init
+2. Or use `OptimizedLLMRuntime` which doesn't transpose (the DSL path)
+3. Ensure SIMD matmul kernels handle Q4_K_M format (Q4_K dispatch exists in `MemSegWeightConverter`)
+
+**Expected result:** 8B Q4 loads in ~5GB, runs at 1-3 tok/s.
+
+**Files to change:**
+- `llm-inference/llama/.../LlamaRuntime.kt` -- skip transpose for quantized MemSeg tensors
+- Or migrate Qwen path in `kllama/cli/Main.kt` to `OptimizedLLMRuntime` + `llamaNetwork()`
+
+### Solution B: Port lazy dequant from Apertus to LLaMA (medium effort)
+
+Port the `ApertusQuantizedRuntime` pattern to a `LlamaQuantizedRuntime`:
+
+1. Store projections as `QuantizedTensor` (quantized bytes + metadata)
+2. In `runLayer()`, dequantize one weight matrix at a time, matmul, discard
+3. Keep embeddings and norms as FP32 (small, need element access)
+
+**Expected result:** 8B Q4 loads in ~5GB, runs at ~1 tok/s (dequant overhead per layer).
+
+**Files to create:**
+- `llm-inference/llama/.../LlamaQuantizedRuntime.kt` (new, based on Apertus pattern)
+- `llm-runtime/kllama/.../LlamaQuantizedWeights.kt` (new, mixed storage)
+
+### Solution C: SIMD-native matmul without dequantization (larger effort, best perf)
+
+The SIMD backend (`skainet-backend-cpu`) already has Q4/Q8 matmul kernels via Vector API. The issue is the runtime layer doesn't use them directly. Changes needed in skainet core:
+
+1. `skainet-backend-cpu`: Ensure `matmul(FP32, Q4_K)` kernel exists and dispatches correctly
+2. `LlamaRuntime` or `OptimizedLLMRuntime`: Accept mixed-precision weight tensors (Q4 weights, FP32 activations)
+3. Skip the `MemSegWeightConverter` step entirely — use raw quantized MemorySegments
+
+**Expected result:** 8B Q4 loads in ~5GB, runs at 2-5 tok/s (no dequant overhead).
+
+**Files to change (in skainet core):**
+- `skainet-backend-cpu`: Q4_K matmul kernel (may already exist)
+- `skainet-lang-core`: Mixed-precision tensor support in matmul dispatch
+
+### Solution D: Memory-mapped quantized tensors (largest effort)
+
+Extend `MmapLlamaLoader` to support quantized formats:
+
+1. Map the GGUF file to virtual memory
+2. Create quantized tensor views that reference mmap regions
+3. Dequantize on-the-fly during matmul (like lazy dequant but zero-copy from disk)
+
+**Expected result:** Load time near-zero, ~5GB virtual (OS manages paging).
+
+**Files to change:**
+- `llm-inference/llama/.../MmapLlamaLoader.kt` -- extend to Q4/Q8 formats
+- Requires `skainet-io-core` changes for mmap quantized tensor views
+
+## Recommended Path
+
+**Start with Solution A** — it's the smallest change and uses code that already works for Qwen2/3. The `NATIVE_OPTIMIZED` + `MemSegWeightConverter` path is proven; the only blocker is `LlamaRuntime`'s constructor transposing weights to FP32.
+
+If that's not enough, **add Solution B** (lazy dequant) which gives the most control over memory at a known performance cost.
+
+Solution C is the long-term goal (best performance) but requires skainet core changes.
diff --git a/docs/.docker/Dockerfile b/docs/.docker/Dockerfile
index 0d496ff..67c21ba 100644
--- a/docs/.docker/Dockerfile
+++ b/docs/.docker/Dockerfile
@@ -10,26 +10,28 @@ RUN apk add --no-cache chromium font-noto
 ENV PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser \
     PUPPETEER_SKIP_DOWNLOAD=true
 
-WORKDIR /antora
-
-# Install Antora + extensions + mermaid-cli in one layer
-RUN npm i --save-exact \
+# Install Antora + extensions to /opt/antora (not /antora which gets volume-mounted)
+WORKDIR /opt/antora
+RUN npm init -y && npm i --save-exact \
       @antora/cli@3.1 \
       @antora/site-generator@3.1 \
       asciidoctor-kroki@0.18 \
       @mermaid-js/mermaid-cli@11 \
     && npm cache clean --force
 
-# Mermaid-cli config: use installed Chromium, no sandbox (container)
+# Make installed modules visible when workdir is the mounted project
+ENV NODE_PATH=/opt/antora/node_modules
+
+# Mermaid-cli config
 RUN echo '{ \
   "executablePath": "/usr/bin/chromium-browser", \
   "args": ["--no-sandbox", "--disable-gpu", "--disable-dev-shm-usage"] \
-}' > /antora/puppeteer-config.json
+}' > /opt/antora/puppeteer-config.json
 
-# Pre-generate a simple diagram to warm up and verify the stack works
+# Verify mermaid works
 RUN echo 'graph TD; A-->B;' > /tmp/test.mmd \
-    && npx mmdc -i /tmp/test.mmd -o /tmp/test.svg -p /antora/puppeteer-config.json \
+    && npx mmdc -i /tmp/test.mmd -o /tmp/test.svg -p /opt/antora/puppeteer-config.json \
     && rm /tmp/test.mmd /tmp/test.svg
 
-ENTRYPOINT ["npx", "antora"]
+ENTRYPOINT ["/opt/antora/node_modules/.bin/antora"]
 CMD ["--stacktrace", "antora-playbook.yml"]
diff --git a/docs/antora-playbook.yml b/docs/antora-playbook.yml
index b07afab..a21a2df 100644
--- a/docs/antora-playbook.yml
+++ b/docs/antora-playbook.yml
@@ -4,7 +4,7 @@ site:
 
 content:
   sources:
-    - url: .
+    - url: /antora
       start_path: docs
       branches: HEAD
 
diff --git a/docs/modules/ROOT/nav.adoc b/docs/modules/ROOT/nav.adoc
index 5bc1fc9..895b219 100644
--- a/docs/modules/ROOT/nav.adoc
+++ b/docs/modules/ROOT/nav.adoc
@@ -23,3 +23,4 @@
 * xref:explanation/pipeline-design.adoc[Pipeline Design Decisions]
 * xref:explanation/dsl-vs-handcoded.adoc[DSL Networks vs Hand-Coded Runtimes]
 * xref:explanation/tokenizer-internals.adoc[Tokenizer Internals]
+* xref:explanation/weight-quantization.adoc[Weight Quantization and Numeric Representation]
diff --git a/docs/modules/ROOT/pages/explanation/weight-quantization.adoc b/docs/modules/ROOT/pages/explanation/weight-quantization.adoc
new file mode 100644
index 0000000..6974baf
--- /dev/null
+++ b/docs/modules/ROOT/pages/explanation/weight-quantization.adoc
@@ -0,0 +1,332 @@
+= Weight Quantization and Numeric Representation
+:description: Deep technical guide to how model weights flow through quantization, dequantization, transpose, and SIMD kernel dispatch.
+
+== Overview
+
+A model weight tensor goes through several numeric representation changes between the GGUF file on disk and the final matmul during inference.
+Understanding each stage is essential for debugging memory issues, correctness problems, and performance optimization.
+
+[mermaid]
+....
+graph TD
+    A["GGUF File<br/>(Q4_K_M: 4.7 GB)"] -->|StreamingGGUFReader| B["Raw Bytes<br/>IntArrayTensorData"]
+    B -->|MemSegWeightConverter| C{Quant Type?}
+    C -->|Q4_0| D["Q4MemorySegmentTensorData<br/>64-byte aligned, Arena-managed"]
+    C -->|Q8_0| E["Q8MemorySegmentTensorData<br/>64-byte aligned, Arena-managed"]
+    C -->|Q4_K / Q6_K| F["DequantOps.dequantFromBytes()<br/>→ FloatArray"]
+    C -->|FP32| G["tensor.t()<br/>MemorySegmentTensorData"]
+    F -->|"Array transpose<br/>[out,in] → [in,out]"| H["FloatArrayTensorData<br/>Pre-transposed"]
+    D --> I["SIMD Matmul<br/>QuantizedMatmul.matmulQ4_0()"]
+    E --> J["SIMD Matmul<br/>QuantizedMatmul.matmulQ8_0()"]
+    H --> K["Scalar Matmul<br/>DefaultCpuOps.matmul()"]
+    G --> K
+
+    style A fill:#f9f,stroke:#333
+    style I fill:#9f9,stroke:#333
+    style J fill:#9f9,stroke:#333
+    style K fill:#ff9,stroke:#333
+....
+
+== Stage 1: GGUF File on Disk
+
+GGUF stores each weight tensor as a contiguous byte region with a header describing its quantization type, shape, and byte offset.
+
+=== Quantization Types in Q4_K_M Format
+
+The `Q4_K_M` quantization scheme uses a **mixed-precision strategy**:
+
+[cols="1,2,2,1"]
+|===
+|Type |Used For |Block Format |Bits/Param
+
+|Q4_K
+|Large projections (wq, wk, wv, wo, ffn_gate, ffn_up, ffn_down) in ~50% of layers
+|144 bytes per 256 elements: 2×f16 scale + 2×f16 min + 12 scale bytes + 128 nibble codes
+|~4.5
+
+|Q6_K
+|Same projections in the other ~50% of layers, plus output weight
+|210 bytes per 256 elements: higher precision for critical layers
+|~6.5
+
+|Q8_0
+|Not used in Q4_K_M (used in Q8_0 format models)
+|34 bytes per 32 elements: 1×f16 scale + 32×int8 codes
+|~8.5
+
+|FP32
+|Norms (attn_norm, ffn_norm, output_norm) — 1D tensors
+|4 bytes per element
+|32
+|===
+
+=== Tensor Layout in GGUF
+
+All 2D weight tensors are stored in **row-major [out_dim, in_dim]** order:
+
+----
+wq:       Shape(dim, dim)        = [4096, 4096]    "4096 output neurons, each with 4096 input weights"
+wk:       Shape(kvDim, dim)      = [1024, 4096]    "1024 KV outputs (8 heads × 128 head_dim)"
+ffn_gate: Shape(ffnDim, dim)     = [14336, 4096]   "14336 FFN hidden units"
+ffn_down: Shape(dim, ffnDim)     = [4096, 14336]   "project back to model dim"
+----
+
+The matmul convention `y = x @ W^T` requires weights in `[in_dim, out_dim]` form, so a transpose is needed before or during the matmul.
+
+== Stage 2: Loading Raw Bytes
+
+`LlamaWeightLoader.loadToMapStreaming()` reads the GGUF file via `StreamingGGUFReader`:
+
+[source,kotlin]
+----
+// QuantPolicy.NATIVE_OPTIMIZED: store as raw Int8 bytes
+val tensor = streamingTensorToTensor(reader, tensorInfo, ctx)
+// tensor.data is IntArrayTensorData containing the raw quantized bytes
+----
+
+At this stage, the tensor holds the original GGUF bytes unchanged.
+A `quantTypes` map records each tensor's quantization type for later processing.
+
+.Memory at Stage 2
+----
+Qwen3-8B-Q4_K_M: ~4.7 GB (raw bytes, same as file size)
+----
+
+== Stage 3: MemSegWeightConverter
+
+`MemSegWeightConverter.convert()` transforms raw bytes into runtime-ready tensors.
+This is where the numeric representation diverges by quantization type.
+
+=== Path A: Q4_0 → Q4MemorySegmentTensorData
+
+[source,kotlin]
+----
+Q4MemorySegmentTensorData.fromRawBytes(logicalShape, bytes, arena)
+----
+
+* Copies raw bytes into a 64-byte-aligned `MemorySegment` (Arena-managed, off-heap)
+* The data stays in Q4_0 block format (no dequantization)
+* The `MemorySegment` alignment enables SIMD vector loads
+
+.Memory: same as raw bytes (~4.5 bits/param)
+
+=== Path B: Q8_0 → Q8MemorySegmentTensorData
+
+Same as Q4_0 but with Q8_0 block format (8 bits per code + f16 scale per 32 elements).
+
+.Memory: ~8.5 bits/param
+
+=== Path C: Q4_K / Q5_K / Q6_K → FP32 + Pre-Transpose
+
+[source,kotlin]
+----
+// 1. Dequantize to float array
+val floats = DequantOps.dequantFromBytes(bytes, quantType, rows * cols)
+
+// 2. Pre-transpose from [out, in] to [in, out]
+val transposed = FloatArray(rows * cols)
+for (r in 0 until rows) {
+    for (c in 0 until cols) {
+        transposed[c * rows + r] = floats[r * cols + c]
+    }
+}
+
+// 3. Store as heap-based FloatArrayTensorData
+return ctx.fromFloatArray(Shape(cols, rows), FP32::class, transposed)
+----
+
+**Why dequantize?** No native SIMD kernel exists for K-quant block formats yet.
+
+**Why pre-transpose?** The `.t()` operation on tensors allocates a new `MemorySegmentTensorData` in direct buffer memory.
+The JVM's direct buffer allocator does not reclaim memory eagerly, causing OOM on memory-constrained machines (48GB).
+Pre-transposing during loading avoids all runtime `.t()` calls.
+
+.Memory per K-quant tensor
+----
+Original Q4_K: ~4.5 bits/param
+After dequant: 32 bits/param (8× expansion)
+Temporary: 2× (original float array + transposed array, then original is GC'd)
+----
+
+.Total memory for Qwen3-8B-Q4_K_M after Stage 3
+----
+Q4_K tensors (dequantized + transposed):  ~15 GB
+Q6_K tensors (dequantized + transposed):  ~12 GB
+Token embedding (dequantized, not transposed): ~2.4 GB
+Norms (FP32, 1D, tiny):                   ~0.01 GB
+Total:                                     ~30 GB
+----
+
+=== Path D: FP32 → Pre-Transpose
+
+[source,kotlin]
+----
+return tensor.t()  // one-time transpose during loading
+----
+
+Norms are 1D so `.t()` is a no-op. For FP32 projection weights (rare), a standard transpose copies to direct memory once.
+
+=== Special Case: Token Embedding
+
+[source,kotlin]
+----
+tokenEmbedding = maybeDequantize(weights.tokenEmbedding, ...)
+----
+
+Token embeddings are always dequantized to FP32 and **not transposed** because `Embedding.forward()` does row gather (lookup by token ID), not matmul.
+
+== Stage 4: LlamaRuntime.linearProject()
+
+During inference, each projection uses `linearProject()`:
+
+[source,kotlin]
+----
+private fun linearProject(x: Tensor<T, Float>, w: Tensor<T, Float>): Tensor<T, Float> {
+    val xCols = if (x.shape.rank >= 2) x.shape[x.shape.rank - 1] else x.shape[0]
+    val wRows = w.shape[0]
+    return if (wRows == xCols) {
+        x.matmul(w)       // weight is [in, out] — pre-transposed
+    } else {
+        x.matmul(w.t())   // weight is [out, in] — legacy path (tests)
+    }
+}
+----
+
+The shape check auto-detects the weight layout:
+
+* **Pre-transposed** `[in, out]`: `wRows == xCols` → direct `matmul`, no allocation
+* **Original** `[out, in]`: `wRows != xCols` → `.t()` then `matmul` (legacy/test path)
+
+== Stage 5: Matmul Kernel Dispatch
+
+The `Tensor.matmul()` extension dispatches based on the runtime `TensorData` type:
+
+[cols="1,2,2"]
+|===
+|TensorData Type |Kernel |Implementation
+
+|`Q4MemorySegmentTensorData`
+|`QuantizedMatmul.matmulQ4_0()`
+|SIMD (Vector API): processes 32 Q4 values per vector lane
+
+|`Q8MemorySegmentTensorData`
+|`QuantizedMatmul.matmulQ8_0()`
+|SIMD (Vector API): dot product of int8 codes × float scale
+
+|`Q4_KBlockTensorData`
+|`QuantizedMatmul.matmulQ4_K()`
+|SIMD: unpacks K-quant blocks with dual scales + min values
+
+|`FloatArrayTensorData`
+|`DefaultCpuOps.matmul()`
+|Scalar FP32 double loop (no SIMD)
+
+|`MemorySegmentTensorData`
+|`DefaultCpuOpsJvm.matmul()`
+|SIMD FP32 via Vector API
+|===
+
+=== SIMD Q4_0 Matmul (Simplified)
+
+[source]
+----
+For each output row:
+  For each block of 32 input elements:
+    Load 16 bytes of Q4 codes from MemorySegment    (128 bits)
+    Unpack low/high nibbles into two int8 vectors    (256 bits each)
+    Subtract zero-point (8)
+    Convert to float vectors
+    Multiply by block scale (f16 → f32)
+    FMA with input vector → accumulate into output
+----
+
+=== Why Q4_K Cannot Be Trivially Transposed
+
+Q4_K blocks encode 256 elements with a complex internal structure:
+
+----
+Block (144 bytes):
+  [0..1]    d (f16)         — primary scale
+  [2..3]    dmin (f16)      — minimum offset
+  [4..15]   scales (12 bytes) — per-subblock scales (6-bit packed)
+  [16..143] qs (128 bytes)  — quantized codes (4-bit packed, 256 values)
+----
+
+The 256 values in each block correspond to **256 contiguous elements in the original row**.
+Transposing the matrix would scatter these elements across different rows, breaking the block structure.
+A proper Q4_K transpose would require:
+
+1. Dequantize all blocks → FP32
+2. Transpose the FP32 matrix
+3. Re-quantize into new Q4_K blocks
+
+This is why `MemSegWeightConverter` currently dequantizes K-quant types to FP32 rather than keeping them quantized.
+
+== Memory Budget: Qwen3-8B-Q4_K_M on 48GB Mac
+
+[cols="2,1,3"]
+|===
+|Component |Size |Notes
+
+|K-quant weights (FP32 pre-transposed)
+|~27 GB
+|Q4_K + Q6_K dequantized, no runtime `.t()` copies
+
+|Token embedding (FP32)
+|2.4 GB
+|151936 × 4096 × 4 bytes
+
+|Norms (FP32)
+|~10 MB
+|1D tensors, negligible
+
+|KV cache (context=512)
+|~128 MB
+|2 × 36 layers × 512 × 1024 × 4 bytes
+
+|JVM + tokenizer
+|~1 GB
+|Heap overhead, vocab structures
+
+|**Total**
+|**~31 GB**
+|Fits in 48 GB with OS headroom
+|===
+
+== Performance Characteristics
+
+[cols="2,1,1,2"]
+|===
+|Path |Bits/Param |Memory |Speed (8B, M-series CPU)
+
+|Q4_K SIMD (future)
+|4.5
+|~5 GB
+|~1-3 tok/s (projected)
+
+|Q8_0 SIMD
+|8.5
+|~9 GB
+|~1-2 tok/s
+
+|FP32 pre-transposed (current)
+|32
+|~30 GB
+|~0.002 tok/s (scalar)
+
+|FP32 + runtime .t() (old, OOM)
+|32 + 32 (copy)
+|~60 GB
+|OOM on 48GB
+|===
+
+== Future: Block-Aware Q4_K Transpose
+
+To use the Q4_K SIMD kernel with GGUF weights, the skainet core library would need:
+
+1. **`Q4_KBlockTensorData.transpose()`** — dequantize → rearrange → re-quantize at the block level
+2. Or **`QuantizedMatmul.matmulQ4_K_transposed()`** — a kernel variant that reads blocks in column-major order
+3. Or **GGUF pre-transposed storage** — store weights as `[in, out]` in the GGUF file during quantization
+
+Option 2 is the most practical: modify the SIMD kernel to iterate over columns instead of rows when reading Q4_K blocks.
+This would reduce memory from ~30GB to ~5GB for the 8B model.
diff --git a/llm-apps/skainet-cli/build.gradle.kts b/llm-apps/skainet-cli/build.gradle.kts
index 7a999be..608c38c 100644
--- a/llm-apps/skainet-cli/build.gradle.kts
+++ b/llm-apps/skainet-cli/build.gradle.kts
@@ -50,5 +50,5 @@ tasks.withType<Test>().configureEach {
 }
 
 tasks.withType<JavaExec>().configureEach {
-    jvmArgs("--enable-preview", "--add-modules", "jdk.incubator.vector", "-Xmx48g", "-XX:MaxDirectMemorySize=64g")
+    jvmArgs("--enable-preview", "--add-modules", "jdk.incubator.vector", "-Xmx42g", "-XX:MaxDirectMemorySize=42g")
 }
diff --git a/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt b/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt
index e461142..468054e 100644
--- a/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt
+++ b/llm-inference/llama/src/commonMain/kotlin/sk/ainet/models/llama/LlamaRuntime.kt
@@ -54,29 +54,30 @@ public class LlamaRuntime<T : DType>(
         const val DEFAULT_BOS_TOKEN: Int = 1
     }
 
-    /** Pre-transposed weight tensors per layer — avoids re-creating lazy transpose wrappers every forward pass. */
-    private class TransposedLayerWeights<T : DType>(
-        val wqT: Tensor<T, Float>,
-        val wkT: Tensor<T, Float>,
-        val wvT: Tensor<T, Float>,
-        val woT: Tensor<T, Float>,
-        val ffnGateT: Tensor<T, Float>,
-        val ffnDownT: Tensor<T, Float>,
-        val ffnUpT: Tensor<T, Float>,
-    )
+    // NOTE: weights are transposed on-the-fly during forward pass rather than
+    // pre-transposed at init. This halves peak memory (~31GB saved for 8B models)
+    // at the cost of per-token transpose allocations that the GC reclaims.
+    // Quantized weights (Q4_K) skip transpose entirely — their matmul kernel
+    // handles the [out, in] layout directly.
 
-    private val transposedLayers: List<TransposedLayerWeights<T>> = weights.layers.map { layer ->
-        TransposedLayerWeights(
-            wqT = layer.wq.t(),
-            wkT = layer.wk.t(),
-            wvT = layer.wv.t(),
-            woT = layer.wo.t(),
-            ffnGateT = layer.ffnGate.t(),
-            ffnDownT = layer.ffnDown.t(),
-            ffnUpT = layer.ffnUp.t(),
-        )
+    /**
+     * Linear projection: y = x @ W.
+     *
+     * When weights are pre-transposed to [in, out] by MemSegWeightConverter
+     * (Q4_K, Q6_K, FP32 via NATIVE_OPTIMIZED), uses direct matmul.
+     * Otherwise falls back to .t() for non-converted weights (tests, DEQUANTIZE_TO_FP32).
+     */
+    private fun linearProject(x: Tensor<T, Float>, w: Tensor<T, Float>): Tensor<T, Float> {
+        val xCols = if (x.shape.rank >= 2) x.shape[x.shape.rank - 1] else x.shape[0]
+        val wRows = w.shape[0]
+        return if (wRows == xCols) {
+            // Weight is [in, out] — already transposed, direct matmul
+            x.matmul(w)
+        } else {
+            // Weight is [out, in] — needs transpose (legacy path)
+            x.matmul(w.t())
+        }
     }
-    private val outputWeightT: Tensor<T, Float> = weights.outputWeight.t()
 
     // ---- DecoderRuntime abstract properties ----
     override val dim: Int = weights.metadata.embeddingLength
@@ -131,7 +132,6 @@ public class LlamaRuntime<T : DType>(
         embedding.forward(intArrayOf(tokenId), ctx)
 
     override fun runLayer(layerIdx: Int, x: Tensor<T, Float>): Tensor<T, Float> {
-        val tl = transposedLayers[layerIdx]
         val layer = weights.layers[layerIdx]
 
         // QKV: try compiled graph first, fall back to individual ops
@@ -140,9 +140,9 @@ public class LlamaRuntime<T : DType>(
         } ?: run {
             val attnNorm = attnNorms[layerIdx].forward(x, ctx)
             Triple(
-                attnNorm.matmul(tl.wqT),
-                attnNorm.matmul(tl.wkT),
-                attnNorm.matmul(tl.wvT)
+                linearProject(attnNorm, layer.wq),
+                linearProject(attnNorm, layer.wk),
+                linearProject(attnNorm, layer.wv)
             )
         }
 
@@ -156,14 +156,14 @@ public class LlamaRuntime<T : DType>(
         val attnOut = attentionBackend.attention(q, k, v, layerIdx, position)
 
         // Output projection + residual
-        val afterAttn = x + attnOut.matmul(tl.woT)
+        val afterAttn = x + linearProject(attnOut, layer.wo)
 
         // FFN: try compiled graph first, fall back to individual ops
         return graphAccelerator?.runFFN(layerIdx, afterAttn) ?: run {
             val ffnNorm = ffnNorms[layerIdx].forward(afterAttn, ctx)
-            val gate = ffnNorm.matmul(tl.ffnGateT).silu()
-            val up = ffnNorm.matmul(tl.ffnUpT)
-            val ffnOut = (gate * up).matmul(tl.ffnDownT)
+            val gate = linearProject(ffnNorm, layer.ffnGate).silu()
+            val up = linearProject(ffnNorm, layer.ffnUp)
+            val ffnOut = linearProject(gate * up, layer.ffnDown)
             afterAttn + ffnOut
         }
     }
@@ -172,7 +172,7 @@ public class LlamaRuntime<T : DType>(
         outputNormLayer.forward(x, ctx)
 
     override fun outputProject(x: Tensor<T, Float>): Tensor<T, Float> =
-        x.matmul(outputWeightT)
+        linearProject(x, weights.outputWeight)
 
     override fun resetState() {
         attentionBackend.reset()
@@ -287,11 +287,10 @@ public class LlamaRuntime<T : DType>(
         var x = embedding.forward(tokenIds, ctx)
 
         weights.layers.forEachIndexed { layerIdx, layer ->
-            val tl = transposedLayers[layerIdx]
             val attnNorm = attnNorms[layerIdx].forward(x, ctx)
-            var q = attnNorm.matmul(tl.wqT)
-            var k = attnNorm.matmul(tl.wkT)
-            val v = attnNorm.matmul(tl.wvT)
+            var q = linearProject(attnNorm, layer.wq)
+            var k = linearProject(attnNorm, layer.wk)
+            val v = linearProject(attnNorm, layer.wv)
 
             if (hasQKNorm) {
                 q = applyPerHeadRMSNorm(q, nHeads, headDim, layer.qNorm!!)
@@ -299,19 +298,19 @@ public class LlamaRuntime<T : DType>(
             }
 
             val attnOut = attentionBackend.batchAttention(q, k, v, layerIdx, startPos)
-                ?: return batchForwardFallback(tokenIds, startPos) // shouldn't happen but be safe
+                ?: return batchForwardFallback(tokenIds, startPos)
 
-            val afterAttn = x + attnOut.matmul(tl.woT)
+            val afterAttn = x + linearProject(attnOut, layer.wo)
 
             val ffnNorm = ffnNorms[layerIdx].forward(afterAttn, ctx)
-            val gate = ffnNorm.matmul(tl.ffnGateT).silu()
-            val up = ffnNorm.matmul(tl.ffnUpT)
-            val ffnOut = (gate * up).matmul(tl.ffnDownT)
+            val gate = linearProject(ffnNorm, layer.ffnGate).silu()
+            val up = linearProject(ffnNorm, layer.ffnUp)
+            val ffnOut = linearProject(gate * up, layer.ffnDown)
             x = afterAttn + ffnOut
         }
 
         val norm = outputNormLayer.forward(x, ctx)
-        val logits = norm.matmul(outputWeightT)
+        val logits = linearProject(norm, weights.outputWeight)
         position = startPos + tokenIds.size
         return logits
     }
diff --git a/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt b/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt
index 2e24d4f..8146746 100644
--- a/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt
+++ b/llm-inference/llama/src/jvmMain/kotlin/sk/ainet/models/llama/MemSegWeightConverter.kt
@@ -8,6 +8,7 @@ import sk.ainet.models.llama.LlamaRuntimeWeights
 import sk.ainet.models.llama.LlamaTensorNames
 import sk.ainet.lang.tensor.Shape
 import sk.ainet.lang.tensor.Tensor
+import sk.ainet.lang.tensor.t
 import sk.ainet.lang.tensor.data.IntArrayTensorData
 import sk.ainet.lang.tensor.data.Q4MemorySegmentTensorData
 import sk.ainet.lang.tensor.data.Q8MemorySegmentTensorData
@@ -85,7 +86,11 @@ public object MemSegWeightConverter {
         ctx: ExecutionContext,
         arena: Arena
     ): Tensor<FP32, Float> {
-        val quantType = quantTypes[tensorName] ?: return tensor
+        val quantType = quantTypes[tensorName]
+        if (quantType == null) {
+            // FP32 tensor — pre-transpose to [in, out] so no .t() at runtime
+            return tensor.t()
+        }
 
         val bytes = extractBytes(tensor.data)
 
@@ -97,9 +102,20 @@ public object MemSegWeightConverter {
             GGMLQuantizationType.Q4_K,
             GGMLQuantizationType.Q5_K,
             GGMLQuantizationType.Q6_K -> {
-                // Dequantize K-quant types to FP32 (no native SIMD kernel yet)
-                val floats = DequantOps.dequantFromBytes(bytes, quantType, logicalShape.volume)
-                return ctx.fromFloatArray(logicalShape, FP32::class, floats)
+                // Dequantize K-quant types to FP32 and pre-transpose to [in, out].
+                // Pre-transposing at load time avoids .t() at runtime, which
+                // allocates direct buffers the JVM doesn't GC eagerly (OOM on 48GB).
+                val rows = logicalShape[0]
+                val cols = logicalShape[1]
+                val floats = DequantOps.dequantFromBytes(bytes, quantType, rows * cols)
+                val transposed = FloatArray(rows * cols)
+                for (r in 0 until rows) {
+                    for (c in 0 until cols) {
+                        transposed[c * rows + r] = floats[r * cols + c]
+                    }
+                }
+                val transposedShape = Shape(cols, rows)
+                return ctx.fromFloatArray(transposedShape, FP32::class, transposed)
             }
             else -> {
                 println("WARNING: Unsupported quant type $quantType for MemorySegment conversion of $tensorName, keeping as-is")
diff --git a/llm-inference/voxtral/src/commonMain/kotlin/sk/ainet/models/voxtral/VoxtralFlowMatching.kt b/llm-inference/voxtral/src/commonMain/kotlin/sk/ainet/models/voxtral/VoxtralFlowMatching.kt
index d4e6cd8..ee66112 100644
--- a/llm-inference/voxtral/src/commonMain/kotlin/sk/ainet/models/voxtral/VoxtralFlowMatching.kt
+++ b/llm-inference/voxtral/src/commonMain/kotlin/sk/ainet/models/voxtral/VoxtralFlowMatching.kt
@@ -200,7 +200,7 @@ public class VoxtralFlowMatching(
             val u1 = random.nextFloat().coerceIn(1e-7f, 1.0f)
             val u2 = random.nextFloat()
             val mag = kotlin.math.sqrt(-2.0f * kotlin.math.ln(u1))
-            val angle = (2.0 * Math.PI * u2).toFloat()
+            val angle = (2.0 * kotlin.math.PI * u2).toFloat()
             values[i] = mag * kotlin.math.cos(angle)
             values[i + 1] = mag * kotlin.math.sin(angle)
             i += 2
@@ -208,7 +208,7 @@ public class VoxtralFlowMatching(
         if (i < n) {
             val u1 = random.nextFloat().coerceIn(1e-7f, 1.0f)
             val u2 = random.nextFloat()
-            values[i] = kotlin.math.sqrt(-2.0f * kotlin.math.ln(u1)) * kotlin.math.cos((2.0 * Math.PI * u2).toFloat())
+            values[i] = kotlin.math.sqrt(-2.0f * kotlin.math.ln(u1)) * kotlin.math.cos((2.0 * kotlin.math.PI * u2).toFloat())
         }
         @Suppress("UNCHECKED_CAST")
         val result = ctx.fromFloatArray<T, Float>(shape, dtype, values)
diff --git a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/AgentMain.kt b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/AgentMain.kt
index ebe4289..12221c5 100644
--- a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/AgentMain.kt
+++ b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/AgentMain.kt
@@ -29,6 +29,47 @@ public class AgentCli<T : DType>(
     private val session = ChatSession(runtime, tokenizer, metadata, templateName)
     private val template: ChatTemplate = session.chatTemplate
 
+    /**
+     * Run a single non-interactive chat round. Used by smoke tests for instruct models.
+     */
+    public fun runChatOnce(
+        userPrompt: String,
+        systemPrompt: String = "You are a helpful assistant. Answer concisely.",
+        maxTokens: Int = 64,
+        temperature: Float = 0.0f
+    ) {
+        val messages = mutableListOf(
+            ChatMessage(role = ChatRole.SYSTEM, content = systemPrompt),
+            ChatMessage(role = ChatRole.USER, content = userPrompt)
+        )
+
+        runtime.reset()
+        val prompt = template.apply(messages, emptyList(), addGenerationPrompt = true)
+        val promptTokens = tokenizer.encode(prompt)
+
+        print("Assistant: ")
+        System.out.flush()
+
+        val startTime = System.nanoTime()
+        val result = runtime.generateUntilStop(
+            prompt = promptTokens,
+            maxTokens = maxTokens,
+            eosTokenId = tokenizer.eosTokenId,
+            temperature = temperature,
+            onToken = { tokenId ->
+                print(tokenizer.decode(tokenId))
+                System.out.flush()
+            },
+            decode = { tokenId -> tokenizer.decode(tokenId) }
+        )
+        val elapsed = (System.nanoTime() - startTime) / 1_000_000.0
+
+        println()
+        println("---")
+        val tokPerSec = if (elapsed > 0) result.tokens.size / elapsed * 1000 else 0.0
+        println("tok/s: $tokPerSec")
+    }
+
     /**
      * Run interactive chat mode (no tool calling).
      */
diff --git a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt
index a89da0e..ddf31c0 100644
--- a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt
+++ b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt
@@ -508,7 +508,11 @@ fun main(args: Array<String>) {
                 }
                 else -> {
                     val agentCli = AgentCli(runtime, tokenizer, cliArgs.templateName, metadata)
-                    agentCli.runChat(maxTokens = cliArgs.steps, temperature = cliArgs.temperature)
+                    if (cliArgs.prompt != null) {
+                        agentCli.runChatOnce(cliArgs.prompt, maxTokens = cliArgs.steps, temperature = cliArgs.temperature)
+                    } else {
+                        agentCli.runChat(maxTokens = cliArgs.steps, temperature = cliArgs.temperature)
+                    }
                 }
             }
             return@runBlocking
diff --git a/tests/smoke/smoke-models.json b/tests/smoke/smoke-models.json
index 2958517..895b06e 100644
--- a/tests/smoke/smoke-models.json
+++ b/tests/smoke/smoke-models.json
@@ -13,9 +13,11 @@
     },
     {
       "name": "Qwen3-1.7B-Q8",
-      "runner": "kqwen",
+      "runner": "kllama",
       "model": "Qwen3-1.7B-Q8_0.gguf",
       "format": "gguf",
+      "instruct": true,
+      "prompt": "What is the capital of France?",
       "toolCalling": {
         "prompt": "What is 2 + 2?",
         "steps": 256
@@ -26,6 +28,8 @@
       "runner": "kllama",
       "model": "Qwen3-8B-Q4_K_M.gguf",
       "format": "gguf",
+      "instruct": true,
+      "prompt": "What is the capital of France?",
       "toolCalling": {
         "prompt": "What is 2 + 2?",
         "steps": 256
diff --git a/tests/smoke/smoke-test.sh b/tests/smoke/smoke-test.sh
index 2756e78..c1f6ca1 100755
--- a/tests/smoke/smoke-test.sh
+++ b/tests/smoke/smoke-test.sh
@@ -200,6 +200,7 @@ print(f'M_STEPS={m.get(\"steps\", d.get(\"steps\", 32))}')
 print(f'M_TEMP={m.get(\"temperature\", d.get(\"temperature\", 0.0))}')
 print(f'M_DOC={repr(m.get(\"doc\", \"\"))}')
 print(f'M_OUTPUT={repr(m.get(\"output\", \"\"))}')
+print(f'M_INSTRUCT={repr(m.get(\"instruct\", False))}')
 ")"
 
     M_MODEL=$(expand_path "$M_MODEL")
@@ -225,7 +226,13 @@ print(f'M_OUTPUT={repr(m.get(\"output\", \"\"))}')
     fi
 
     task=$(runner_task "$M_RUNNER")
-    args=$(runner_args "$M_RUNNER" "$M_MODEL" "$M_PROMPT" "$M_STEPS" "$M_TEMP" "$M_DOC" "$M_OUTPUT")
+
+    # Instruct models: use --chat with prompt for proper chat template formatting
+    if [[ "$M_INSTRUCT" == "True" ]]; then
+      args="-m ${M_MODEL} --chat -s ${M_STEPS} -k ${M_TEMP} \"${M_PROMPT}\""
+    else
+      args=$(runner_args "$M_RUNNER" "$M_MODEL" "$M_PROMPT" "$M_STEPS" "$M_TEMP" "$M_DOC" "$M_OUTPUT")
+    fi
 
     start_ts=$(python3 -c 'import time; print(time.time())')
     output_file=$(mktemp)