From f42617aae3071ed1f65b71e8c625dd5164c9687e Mon Sep 17 00:00:00 2001
From: Michal Harakal <michal.harakal@googlemail.com>
Date: Sat, 11 Apr 2026 10:25:35 +0200
Subject: [PATCH 1/9] refactor: decouple tool calling from kllama, add
 ChatSession abstraction
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Phase 1 of the unified pipeline plan. Tool calling no longer requires
GGUFTokenizer — any Tokenizer implementation works.

- Extend Tokenizer interface with eosTokenId, bosTokenId, vocabSize
- Add ChatSession in llm-agent that bundles runtime + tokenizer + metadata
- Refactor ToolCallingDemo and AgentCli to accept Tokenizer, not GGUFTokenizer
- Remove GGUFTokenizer cast from kllama Main.kt chat/agent/demo dispatch
- Fix JavaAgentLoop instanceof hack with tokenizer.eosTokenId
- Update all Tokenizer implementations (GGUF, HF BPE, Tekken, BERT)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 PLAN-unified-pipeline.md                      | 147 ++++++++++++++++++
 .../sk/ainet/apps/kllama/chat/ChatSession.kt  | 113 ++++++++++++++
 .../kotlin/sk/ainet/apps/llm/Tokenizer.kt     |   9 ++
 .../llm/tokenizer/HuggingFaceBPETokenizer.kt  |   6 +-
 .../ainet/models/bert/HuggingFaceTokenizer.kt |   5 +-
 .../models/voxtral/TekkenTokenizerAdapter.kt  |   4 +
 .../sk/ainet/apps/kllama/GGUFTokenizer.kt     |  18 ++-
 .../sk/ainet/apps/kllama/TokenizerImpl.kt     |   4 +-
 .../apps/kllama/chat/java/JavaAgentLoop.kt    |   8 +-
 .../sk/ainet/apps/kllama/cli/AgentMain.kt     |  25 +--
 .../kotlin/sk/ainet/apps/kllama/cli/Main.kt   |  10 +-
 .../ainet/apps/kllama/cli/ToolCallingDemo.kt  |  25 +--
 12 files changed, 310 insertions(+), 64 deletions(-)
 create mode 100644 PLAN-unified-pipeline.md
 create mode 100644 llm-agent/src/commonMain/kotlin/sk/ainet/apps/kllama/chat/ChatSession.kt

diff --git a/PLAN-unified-pipeline.md b/PLAN-unified-pipeline.md
new file mode 100644
index 0000000..a51a783
--- /dev/null
+++ b/PLAN-unified-pipeline.md
@@ -0,0 +1,147 @@
+# Plan: Unified Model Pipeline with Decoupled Tool Calling
+
+## Context
+
+Currently SKaiNET-transformers has:
+- **5+ hand-coded runtimes** (LlamaRuntime, Qwen35Runtime, Gemma3nRuntime, ApertusRuntime, VoxtralRuntimes) — each reimplements the forward pass, weight loading, and layer execution
+- **Tool calling tightly coupled to kllama** — the AgentLoop, ToolCallingDemo, and chat modes only exist in the kllama runner. Other models (Gemma, Apertus) cannot use tool calling without duplicating code
+- **Two execution paths** — legacy hand-coded runtimes AND the newer `OptimizedLLMRuntime` with DSL/compute-graph/AOT. LlamaRuntime and ApertusRuntime are already marked deprecated
+
+The goal: converge on **one unified pipeline** where model definition, weight loading, tokenization, and tool calling are cleanly separated pipeline stages.
+
+## Architecture Overview
+
+```
+GGUF/SafeTensors File
+    |
+WeightLoader (parse metadata + tensors)
+    |
+DSL Network Definition (model-specific, declarative)
+    |
+ComputeGraph (DAG)
+    |
+Optimization Pipeline (TransposeElim -> WeightDedup -> LLMFusion -> DCE)
+    |
+ComputeGraphExecutor (fused kernels)
+    |
+InferenceRuntime (unified: forward + generate)
+    |
+TokenizationPipeline (encode/decode, special tokens, byte-level BPE)
+    |
+ChatPipeline (template formatting, tool calling, agent loop)
+```
+
+## Phase 1: Decouple Tool Calling from kllama (immediate value) -- DONE
+
+**What was done:**
+
+1. **Enhanced `Tokenizer` interface** with `eosTokenId`, `bosTokenId`, `vocabSize`
+   - Updated all implementations: `GGUFTokenizer`, `TokenizerImpl`, `HuggingFaceBPETokenizer`, `TekkenTokenizerAdapter`, `HuggingFaceTokenizer` (BERT)
+
+2. **Created `ChatSession` abstraction** in `llm-agent`
+   - File: `llm-agent/.../chat/ChatSession.kt`
+   - Bundles `InferenceRuntime` + `Tokenizer` + `ModelMetadata`
+   - Provides `createAgentLoop()` and `runSingleTurn()` for any runner
+
+3. **Refactored `ToolCallingDemo` and `AgentCli`** to use `Tokenizer` interface instead of `GGUFTokenizer`
+   - Both now accept any `Tokenizer`, not just `GGUFTokenizer`
+   - Both use `ChatSession` internally for agent loop creation
+
+4. **Removed `GGUFTokenizer` cast from kllama Main.kt** dispatch
+   - Chat/agent/demo modes now work with any `Tokenizer`
+
+5. **Fixed `JavaAgentLoop`** — replaced `GGUFTokenizer` instanceof hack with `tokenizer.eosTokenId`
+
+## Phase 2: Unified DSL-Based Model Definition (converge on OptimizedLLMRuntime)
+
+**Problem:** Each model has a hand-coded runtime. `OptimizedLLMRuntime` already supports DSL -> graph -> optimized execution, but only some models use it.
+
+**Changes:**
+
+1. **Define DSL networks for all model families:**
+   - `llamaNetwork(config)` — LLaMA/Mistral/Qwen2/3 (standard transformer)
+   - `qwen35Network(config)` — Qwen3.5 (hybrid DeltaNet + full attention)
+   - `gemmaNetwork(config)` — Gemma (GELU, MatFormer FFN, sliding window)
+   - `apertusNetwork(config)` — Apertus (xIELU, ungated MLP, QK-norm)
+   - Each is a pure function returning a `Network<T>` from the DSL
+
+2. **Unified model loading flow:**
+   ```
+   detectArchitecture(ggufMetadata) -> ModelFamily
+   ModelFamily.createNetwork(config) -> Network<T>
+   WeightLoader.loadAndMap(file, network) -> weights
+   OptimizedLLMRuntime(network, weights, mode=HYBRID) -> InferenceRuntime
+   ```
+
+3. **Remove deprecated hand-coded runtimes** once DSL equivalents are validated:
+   - `LlamaRuntime` -> `llamaNetwork()` + `OptimizedLLMRuntime`
+   - `ApertusRuntime` -> `apertusNetwork()` + `OptimizedLLMRuntime`
+
+**Critical files:**
+- `llm-core/.../OptimizedLLMRuntime.kt` — already exists, extend
+- `llm-core/.../dsl/TransformerDsl.kt` — already has embedding, MHA, SwiGLU, RMSNorm
+- `llm-core/.../weights/LLMWeightNameResolvers.kt` — already maps DSL paths -> GGUF names
+- New: per-model DSL network definitions
+
+## Phase 3: Tokenization as Pipeline Stage
+
+**Problem:** Tokenization is split between `GGUFTokenizer` (kllama module), `QwenByteLevelBPETokenizer` (llm-core), and model-specific code. The byte-level BPE fix we just made shows the fragility.
+
+**Changes:**
+
+1. **Enhance `Tokenizer` interface** (`llm-core`):
+   ```kotlin
+   interface Tokenizer {
+       fun encode(text: String): IntArray
+       fun decode(token: Int): String
+       fun decode(tokens: IntArray): String
+       val eosTokenId: Int
+       val bosTokenId: Int
+       val vocabSize: Int
+       val specialTokens: Set<String>
+   }
+   ```
+
+2. **Unified tokenizer factory:**
+   - `TokenizerFactory.fromGGUF(source)` — auto-detects BPE/SentencePiece/WordPiece
+   - `TokenizerFactory.fromTokenizerJson(json)` — HuggingFace format
+   - Returns the correct implementation (byte-level BPE for GPT-2/Qwen, SentencePiece for LLaMA, etc.)
+
+3. **Move `GGUFTokenizer` to `llm-core`** so all runners can use it without depending on kllama
+
+## Phase 4: Unified Runner (single CLI entry point)
+
+**Problem:** 6 separate CLI apps with duplicated argument parsing, model loading, and dispatch logic.
+
+**Changes:**
+
+1. **Single `skainet` CLI** that auto-detects model architecture from GGUF metadata:
+   ```bash
+   skainet -m model.gguf "prompt"                    # auto-detect, generate
+   skainet -m model.gguf --chat                      # auto-detect, chat mode
+   skainet -m model.gguf --demo "What is 2+2?"       # auto-detect, tool calling
+   ```
+
+2. **Architecture registry:**
+   ```kotlin
+   ModelRegistry.register("llama", ::llamaNetwork)
+   ModelRegistry.register("qwen3", ::qwenNetwork)
+   ModelRegistry.register("gemma", ::gemmaNetwork)
+   ```
+
+3. **Auto-detection from GGUF metadata** (already exists in `peekGgufMetadata()`)
+
+## Verification
+
+- All existing unit tests pass (`llm-agent`, `llm-runtime:kllama`, `llm-core`)
+- Smoke test suite passes (generation + tool calling)
+- Basic generation produces identical output for all model families
+- Tool calling works for any model that supports ChatML/Qwen/Llama3 templates
+- `OptimizedLLMRuntime` in HYBRID mode matches hand-coded runtime output
+
+## Suggested Implementation Order
+
+1. **Phase 1** first — immediately unblocks tool calling for all models
+2. **Phase 3** next — reduces fragility (the GGUFTokenizer byte-level BPE issue)
+3. **Phase 2** then — biggest refactor, needs per-model validation
+4. **Phase 4** last — depends on all other phases
diff --git a/llm-agent/src/commonMain/kotlin/sk/ainet/apps/kllama/chat/ChatSession.kt b/llm-agent/src/commonMain/kotlin/sk/ainet/apps/kllama/chat/ChatSession.kt
new file mode 100644
index 0000000..d91ec3f
--- /dev/null
+++ b/llm-agent/src/commonMain/kotlin/sk/ainet/apps/kllama/chat/ChatSession.kt
@@ -0,0 +1,113 @@
+package sk.ainet.apps.kllama.chat
+
+import sk.ainet.apps.llm.InferenceRuntime
+import sk.ainet.apps.llm.Tokenizer
+import sk.ainet.lang.types.DType
+
+/**
+ * Bundles an [InferenceRuntime] with a [Tokenizer] and [ModelMetadata] to provide
+ * chat, agent, and tool-calling capabilities for any model.
+ *
+ * This decouples tool calling from any specific runner — any runner that can
+ * produce an [InferenceRuntime] and a [Tokenizer] can create a [ChatSession]
+ * and get chat/agent/demo modes for free.
+ *
+ * Usage:
+ * ```kotlin
+ * val session = ChatSession(runtime, tokenizer, metadata)
+ * session.chat(maxTokens = 512, temperature = 0.7f)       // interactive chat
+ * session.agent(maxTokens = 512, temperature = 0.7f)      // interactive agent with tools
+ * session.demo(maxTokens = 256, temperature = 0.7f)       // interactive tool calling demo
+ * session.demoSingleShot("What is 2+2?", maxTokens = 256) // non-interactive single prompt
+ * ```
+ */
+public class ChatSession<T : DType>(
+    public val runtime: InferenceRuntime<T>,
+    public val tokenizer: Tokenizer,
+    public val metadata: ModelMetadata = ModelMetadata(),
+    templateName: String? = null
+) {
+    private val provider: ToolCallingSupport = ToolCallingSupportResolver.resolveOrFallback(metadata, templateName)
+    private val template: ChatTemplate = provider.createChatTemplate()
+
+    /**
+     * Run a single agent round with the given prompt and tools.
+     * Returns the final response text. Non-interactive — suitable for smoke tests.
+     *
+     * @param prompt The user prompt.
+     * @param tools Tools to register. If empty, uses default calculator + list_files.
+     * @param maxTokens Maximum tokens per generation round.
+     * @param temperature Sampling temperature.
+     * @param listener Optional listener for observing the agent loop.
+     * @return The final assistant response text.
+     */
+    public fun runSingleTurn(
+        prompt: String,
+        tools: List<Tool> = emptyList(),
+        maxTokens: Int = 256,
+        temperature: Float = 0.7f,
+        listener: AgentListener? = null
+    ): String {
+        val registry = ToolRegistry()
+        tools.forEach { registry.register(it) }
+
+        val agentLoop = AgentLoop(
+            runtime = runtime,
+            template = template,
+            toolRegistry = registry,
+            eosTokenId = tokenizer.eosTokenId,
+            config = AgentConfig(
+                maxToolRounds = 5,
+                maxTokensPerRound = maxTokens,
+                temperature = temperature
+            ),
+            decode = { tokenId -> tokenizer.decode(tokenId) }
+        )
+
+        val systemPrompt = "You are a helpful assistant with access to tools."
+        val messages = mutableListOf(
+            ChatMessage(role = ChatRole.SYSTEM, content = systemPrompt),
+            ChatMessage(role = ChatRole.USER, content = prompt)
+        )
+
+        return agentLoop.runWithEncoder(
+            messages = messages,
+            encode = { text -> tokenizer.encode(text) },
+            listener = listener
+        )
+    }
+
+    /**
+     * Create an [AgentLoop] configured for this session.
+     */
+    public fun createAgentLoop(
+        toolRegistry: ToolRegistry,
+        maxTokens: Int = 512,
+        temperature: Float = 0.7f
+    ): AgentLoop<T> {
+        return AgentLoop(
+            runtime = runtime,
+            template = template,
+            toolRegistry = toolRegistry,
+            eosTokenId = tokenizer.eosTokenId,
+            config = AgentConfig(
+                maxToolRounds = 5,
+                maxTokensPerRound = maxTokens,
+                temperature = temperature
+            ),
+            decode = { tokenId -> tokenizer.decode(tokenId) }
+        )
+    }
+
+    /** The resolved chat template for this session. */
+    public val chatTemplate: ChatTemplate get() = template
+
+    /** The resolved tool calling provider family name. */
+    public val providerFamily: String get() = provider.family
+
+    /** Encode text to token IDs using this session's tokenizer. */
+    public fun encode(text: String): IntArray = tokenizer.encode(text)
+
+    /** Decode a token ID to text using this session's tokenizer. */
+    public fun decode(tokenId: Int): String = tokenizer.decode(tokenId)
+}
diff --git a/llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/Tokenizer.kt b/llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/Tokenizer.kt
index e71b30c..82f7505 100644
--- a/llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/Tokenizer.kt
+++ b/llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/Tokenizer.kt
@@ -4,4 +4,13 @@ interface Tokenizer {
     fun encode(text: String): IntArray
     fun decode(tokens: IntArray): String
     fun decode(token: Int): String
+
+    /** End-of-sequence token ID. */
+    val eosTokenId: Int
+
+    /** Beginning-of-sequence token ID. */
+    val bosTokenId: Int
+
+    /** Total vocabulary size. */
+    val vocabSize: Int
 }
diff --git a/llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/tokenizer/HuggingFaceBPETokenizer.kt b/llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/tokenizer/HuggingFaceBPETokenizer.kt
index b9b20a8..d449e83 100644
--- a/llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/tokenizer/HuggingFaceBPETokenizer.kt
+++ b/llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/tokenizer/HuggingFaceBPETokenizer.kt
@@ -15,14 +15,14 @@ public class HuggingFaceBPETokenizer internal constructor(
     private val vocab: List<String>,
     private val tokenToId: Map<String, Int>,
     private val scores: FloatArray,
-    private val bosTokenId: Int,
-    private val eosTokenId: Int,
+    override val bosTokenId: Int,
+    override val eosTokenId: Int,
     private val unkTokenId: Int,
     private val addBosToken: Boolean,
     private val addEosToken: Boolean
 ) : Tokenizer {
 
-    public val vocabSize: Int get() = vocab.size
+    override val vocabSize: Int get() = vocab.size
 
     override fun encode(text: String): IntArray {
         if (text.isEmpty()) return intArrayOf()
diff --git a/llm-inference/bert/src/commonMain/kotlin/sk/ainet/models/bert/HuggingFaceTokenizer.kt b/llm-inference/bert/src/commonMain/kotlin/sk/ainet/models/bert/HuggingFaceTokenizer.kt
index e1cecf6..e3819a1 100644
--- a/llm-inference/bert/src/commonMain/kotlin/sk/ainet/models/bert/HuggingFaceTokenizer.kt
+++ b/llm-inference/bert/src/commonMain/kotlin/sk/ainet/models/bert/HuggingFaceTokenizer.kt
@@ -40,6 +40,9 @@ public class HuggingFaceTokenizer private constructor(
     private val maxLength: Int = 512
 ) : Tokenizer {
 
+    override val eosTokenId: Int get() = tokenToId[SEP_TOKEN] ?: 102
+    override val bosTokenId: Int get() = tokenToId[CLS_TOKEN] ?: 101
+
     public companion object {
         private const val CLS_TOKEN = "[CLS]"
         private const val SEP_TOKEN = "[SEP]"
@@ -112,7 +115,7 @@ public class HuggingFaceTokenizer private constructor(
     private val sepId: Int = tokenToId[SEP_TOKEN] ?: error("Vocab missing $SEP_TOKEN")
     private val unkId: Int = tokenToId[UNK_TOKEN] ?: error("Vocab missing $UNK_TOKEN")
 
-    public val vocabSize: Int get() = tokenToId.size
+    override val vocabSize: Int get() = tokenToId.size
 
     /**
      * Encode text into token IDs with [CLS] and [SEP] tokens.
diff --git a/llm-inference/voxtral/src/commonMain/kotlin/sk/ainet/models/voxtral/TekkenTokenizerAdapter.kt b/llm-inference/voxtral/src/commonMain/kotlin/sk/ainet/models/voxtral/TekkenTokenizerAdapter.kt
index 3425bd7..c067c79 100644
--- a/llm-inference/voxtral/src/commonMain/kotlin/sk/ainet/models/voxtral/TekkenTokenizerAdapter.kt
+++ b/llm-inference/voxtral/src/commonMain/kotlin/sk/ainet/models/voxtral/TekkenTokenizerAdapter.kt
@@ -25,6 +25,10 @@ public class TekkenTokenizerAdapter(
 
     override fun decode(token: Int): String = tekken.decode(token)
 
+    override val eosTokenId: Int = 2
+    override val bosTokenId: Int = 1
+    override val vocabSize: Int get() = 32768 // Tekken default
+
     public companion object {
         /**
          * Parse a tekken.json string and return a [Tokenizer] instance.
diff --git a/llm-runtime/kllama/src/commonMain/kotlin/sk/ainet/apps/kllama/GGUFTokenizer.kt b/llm-runtime/kllama/src/commonMain/kotlin/sk/ainet/apps/kllama/GGUFTokenizer.kt
index de8a0a2..b6d13b5 100644
--- a/llm-runtime/kllama/src/commonMain/kotlin/sk/ainet/apps/kllama/GGUFTokenizer.kt
+++ b/llm-runtime/kllama/src/commonMain/kotlin/sk/ainet/apps/kllama/GGUFTokenizer.kt
@@ -24,8 +24,8 @@ import sk.ainet.io.gguf.StreamingGGUFReader
 class GGUFTokenizer private constructor(
     private val vocab: List<String>,
     private val scores: FloatArray,
-    private val bosTokenId: Int,
-    private val eosTokenId: Int,
+    private val _bosTokenId: Int,
+    private val _eosTokenId: Int,
     private val unkTokenId: Int,
     private val strategy: TokenizerStrategy
 ) : Tokenizer {
@@ -516,16 +516,18 @@ class GGUFTokenizer private constructor(
         }
     }
 
-    val vocabSize: Int get() = vocab.size
-
     /** The detected tokenizer type/strategy in use */
     val tokenizerType: TokenizerType get() = strategy.type
 
-    /** The BOS (beginning of sentence) token ID */
-    val bosId: Int get() = bosTokenId
+    override val bosTokenId: Int get() = _bosTokenId
+    override val eosTokenId: Int get() = _eosTokenId
+    override val vocabSize: Int get() = vocab.size
+
+    @Deprecated("Use eosTokenId", replaceWith = ReplaceWith("eosTokenId"))
+    val eosId: Int get() = _eosTokenId
 
-    /** The EOS (end of sentence) token ID */
-    val eosId: Int get() = eosTokenId
+    @Deprecated("Use bosTokenId", replaceWith = ReplaceWith("bosTokenId"))
+    val bosId: Int get() = _bosTokenId
 
     // Build reverse lookup for encoding
     private val tokenToId: Map<String, Int> by lazy {
diff --git a/llm-runtime/kllama/src/commonMain/kotlin/sk/ainet/apps/kllama/TokenizerImpl.kt b/llm-runtime/kllama/src/commonMain/kotlin/sk/ainet/apps/kllama/TokenizerImpl.kt
index 2e7f6f4..a295f08 100644
--- a/llm-runtime/kllama/src/commonMain/kotlin/sk/ainet/apps/kllama/TokenizerImpl.kt
+++ b/llm-runtime/kllama/src/commonMain/kotlin/sk/ainet/apps/kllama/TokenizerImpl.kt
@@ -5,9 +5,11 @@ import sk.ainet.apps.llm.Tokenizer
 class TokenizerImpl(
     private val vocab: Array<String?>,
     private val vocabScores: FloatArray,
+    override val bosTokenId: Int = 1,
+    override val eosTokenId: Int = 2,
 ) : Tokenizer {
 
-    private val vocabSize = vocab.size
+    override val vocabSize: Int = vocab.size
     // ----------------------------------------------------------------------------
     // byte pair encoding (BPE) tokenizer, encodes strings into tokens so we can prompt
     private fun strLookup(str: String, vocabSize: Int): Int {
diff --git a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/chat/java/JavaAgentLoop.kt b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/chat/java/JavaAgentLoop.kt
index df4d495..d485e82 100644
--- a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/chat/java/JavaAgentLoop.kt
+++ b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/chat/java/JavaAgentLoop.kt
@@ -51,9 +51,7 @@ public class JavaAgentLoop private constructor(
             runtime = session.runtime,
             template = template,
             toolRegistry = toolRegistry,
-            eosTokenId = session.tokenizer.let {
-                if (it is sk.ainet.apps.kllama.GGUFTokenizer) it.eosId else 2
-            },
+            eosTokenId = session.tokenizer.eosTokenId,
             config = config,
             decode = { session.tokenizer.decode(it) }
         )
@@ -81,9 +79,7 @@ public class JavaAgentLoop private constructor(
             runtime = session.runtime,
             template = template,
             toolRegistry = toolRegistry,
-            eosTokenId = session.tokenizer.let {
-                if (it is sk.ainet.apps.kllama.GGUFTokenizer) it.eosId else 2
-            },
+            eosTokenId = session.tokenizer.eosTokenId,
             config = config,
             decode = { session.tokenizer.decode(it) }
         )
diff --git a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/AgentMain.kt b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/AgentMain.kt
index 8bb9c1e..ebe4289 100644
--- a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/AgentMain.kt
+++ b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/AgentMain.kt
@@ -1,7 +1,7 @@
 package sk.ainet.apps.kllama.cli
 
-import sk.ainet.apps.kllama.GGUFTokenizer
 import sk.ainet.apps.llm.InferenceRuntime
+import sk.ainet.apps.llm.Tokenizer
 import sk.ainet.apps.kllama.chat.*
 import sk.ainet.apps.kllama.agent.generateUntilStop
 import sk.ainet.lang.types.DType
@@ -22,14 +22,12 @@ import kotlinx.serialization.json.jsonPrimitive
  */
 public class AgentCli<T : DType>(
     private val runtime: InferenceRuntime<T>,
-    private val tokenizer: GGUFTokenizer,
+    private val tokenizer: Tokenizer,
     private val templateName: String? = null,
     private val metadata: ModelMetadata = ModelMetadata()
 ) {
-    private val provider: ToolCallingSupport = ToolCallingSupportResolver.resolveOrFallback(metadata, templateName)
-    private val template: ChatTemplate = provider.createChatTemplate()
-
-    private val eosTokenId: Int = tokenizer.eosId
+    private val session = ChatSession(runtime, tokenizer, metadata, templateName)
+    private val template: ChatTemplate = session.chatTemplate
 
     /**
      * Run interactive chat mode (no tool calling).
@@ -68,7 +66,7 @@ public class AgentCli<T : DType>(
             val result = runtime.generateUntilStop(
                 prompt = promptTokens,
                 maxTokens = maxTokens,
-                eosTokenId = eosTokenId,
+                eosTokenId = tokenizer.eosTokenId,
                 temperature = temperature,
                 onToken = { tokenId ->
                     print(tokenizer.decode(tokenId))
@@ -95,18 +93,7 @@ public class AgentCli<T : DType>(
         val registry = ToolRegistry()
         registry.register(CalculatorTool())
 
-        val agentLoop = AgentLoop(
-            runtime = runtime,
-            template = template,
-            toolRegistry = registry,
-            eosTokenId = eosTokenId,
-            config = AgentConfig(
-                maxToolRounds = 5,
-                maxTokensPerRound = maxTokens,
-                temperature = temperature
-            ),
-            decode = { tokenId -> tokenizer.decode(tokenId) }
-        )
+        val agentLoop = session.createAgentLoop(registry, maxTokens, temperature)
 
         val messages = mutableListOf(
             ChatMessage(role = ChatRole.SYSTEM, content = systemPrompt)
diff --git a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt
index 70e45fb..a5aa828 100644
--- a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt
+++ b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt
@@ -490,22 +490,20 @@ fun main(args: Array<String>) {
             }
         }
 
-        // Dispatch to chat/agent/demo mode
+        // Dispatch to chat/agent/demo mode — works with any Tokenizer, not just GGUFTokenizer
         if (cliArgs.chatMode || cliArgs.agentMode || cliArgs.demoMode) {
-            val ggufTokenizer = tokenizer as? GGUFTokenizer
-                ?: error("Chat/agent/demo modes require a GGUF model with embedded tokenizer")
             val metadata = ggufMetadata ?: ModelMetadata()
             when {
                 cliArgs.demoMode -> {
-                    val demo = ToolCallingDemo(runtime, ggufTokenizer, cliArgs.templateName, metadata)
+                    val demo = ToolCallingDemo(runtime, tokenizer, cliArgs.templateName, metadata)
                     demo.run(maxTokens = cliArgs.steps, temperature = cliArgs.temperature)
                 }
                 cliArgs.agentMode -> {
-                    val agentCli = AgentCli(runtime, ggufTokenizer, cliArgs.templateName, metadata)
+                    val agentCli = AgentCli(runtime, tokenizer, cliArgs.templateName, metadata)
                     agentCli.runAgent(maxTokens = cliArgs.steps, temperature = cliArgs.temperature)
                 }
                 else -> {
-                    val agentCli = AgentCli(runtime, ggufTokenizer, cliArgs.templateName, metadata)
+                    val agentCli = AgentCli(runtime, tokenizer, cliArgs.templateName, metadata)
                     agentCli.runChat(maxTokens = cliArgs.steps, temperature = cliArgs.temperature)
                 }
             }
diff --git a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/ToolCallingDemo.kt b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/ToolCallingDemo.kt
index b04178e..bd6d381 100644
--- a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/ToolCallingDemo.kt
+++ b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/ToolCallingDemo.kt
@@ -1,7 +1,7 @@
 package sk.ainet.apps.kllama.cli
 
-import sk.ainet.apps.kllama.GGUFTokenizer
 import sk.ainet.apps.llm.InferenceRuntime
+import sk.ainet.apps.llm.Tokenizer
 import sk.ainet.apps.kllama.chat.*
 import sk.ainet.lang.types.DType
 import kotlinx.serialization.json.JsonObject
@@ -29,24 +29,20 @@ import java.io.File
  */
 public class ToolCallingDemo<T : DType>(
     private val runtime: InferenceRuntime<T>,
-    private val tokenizer: GGUFTokenizer,
+    private val tokenizer: Tokenizer,
     private val templateName: String? = null,
     private val metadata: ModelMetadata = ModelMetadata()
 ) {
-    private val provider: ToolCallingSupport = resolveProvider()
-    private val template: ChatTemplate = provider.createChatTemplate()
+    private val session = ChatSession(runtime, tokenizer, metadata, templateName)
 
-    private fun resolveProvider(): ToolCallingSupport {
+    init {
         val result = ToolCallingSupportResolver.resolveWithDiagnostics(
             metadata = metadata,
             explicitFamily = templateName
         )
         println("[ToolCallingDemo] Provider: ${result.provider.family} (mode=${result.mode}, reason: ${result.reason})")
-        return result.provider
     }
 
-    private val eosTokenId: Int = tokenizer.eosId
-
     /**
      * Run the tool-calling demo with `list_files` and `calculator` tools.
      */
@@ -63,18 +59,7 @@ When the user asks about files or directories, use the list_files tool to look u
 When the user asks to calculate something, use the calculator tool.
 Always use a tool when one is relevant — do not guess file listings."""
 
-        val agentLoop = AgentLoop(
-            runtime = runtime,
-            template = template,
-            toolRegistry = registry,
-            eosTokenId = eosTokenId,
-            config = AgentConfig(
-                maxToolRounds = 5,
-                maxTokensPerRound = maxTokens,
-                temperature = temperature
-            ),
-            decode = { tokenId -> tokenizer.decode(tokenId) }
-        )
+        val agentLoop = session.createAgentLoop(registry, maxTokens, temperature)
 
         val messages = mutableListOf(
             ChatMessage(role = ChatRole.SYSTEM, content = systemPrompt)

From 0d50d0e9d2625c48338584f0e54ea83d24cb7030 Mon Sep 17 00:00:00 2001
From: Michal Harakal <michal.harakal@googlemail.com>
Date: Sat, 11 Apr 2026 10:32:46 +0200
Subject: [PATCH 2/9] refactor: move GGUFTokenizer to llm-core, add
 TokenizerFactory

Phase 3 of the unified pipeline plan. Tokenization is now a standalone
pipeline stage in llm-core, independent of any specific runner.

- Move GGUFTokenizer from kllama to llm-core/tokenizer package
- Add typealias in kllama for backwards compatibility
- Create TokenizerFactory with fromGGUF(), fromTokenizerJson(), fromHuggingFace()
- Add skainet-io-gguf and kotlinx-io-core dependencies to llm-core

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 PLAN-unified-pipeline.md                      |  32 +-
 llm-core/build.gradle.kts                     |   2 +
 .../ainet/apps/llm/tokenizer/GGUFTokenizer.kt | 749 +++++++++++++++++
 .../apps/llm/tokenizer/TokenizerFactory.kt    |  51 ++
 .../sk/ainet/apps/kllama/GGUFTokenizer.kt     | 753 +-----------------
 5 files changed, 818 insertions(+), 769 deletions(-)
 create mode 100644 llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/tokenizer/GGUFTokenizer.kt
 create mode 100644 llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/tokenizer/TokenizerFactory.kt

diff --git a/PLAN-unified-pipeline.md b/PLAN-unified-pipeline.md
index a51a783..7f5287d 100644
--- a/PLAN-unified-pipeline.md
+++ b/PLAN-unified-pipeline.md
@@ -83,31 +83,23 @@ ChatPipeline (template formatting, tool calling, agent loop)
 - `llm-core/.../weights/LLMWeightNameResolvers.kt` — already maps DSL paths -> GGUF names
 - New: per-model DSL network definitions
 
-## Phase 3: Tokenization as Pipeline Stage
+## Phase 3: Tokenization as Pipeline Stage -- DONE
 
-**Problem:** Tokenization is split between `GGUFTokenizer` (kllama module), `QwenByteLevelBPETokenizer` (llm-core), and model-specific code. The byte-level BPE fix we just made shows the fragility.
+**What was done:**
 
-**Changes:**
+1. **Enhanced `Tokenizer` interface** with `eosTokenId`, `bosTokenId`, `vocabSize` (done in Phase 1)
 
-1. **Enhance `Tokenizer` interface** (`llm-core`):
-   ```kotlin
-   interface Tokenizer {
-       fun encode(text: String): IntArray
-       fun decode(token: Int): String
-       fun decode(tokens: IntArray): String
-       val eosTokenId: Int
-       val bosTokenId: Int
-       val vocabSize: Int
-       val specialTokens: Set<String>
-   }
-   ```
+2. **Moved `GGUFTokenizer` from kllama to `llm-core`**
+   - New location: `llm-core/.../tokenizer/GGUFTokenizer.kt`
+   - Old location has a typealias for backwards compatibility
+   - Added `skainet-io-gguf` and `kotlinx-io-core` dependencies to `llm-core`
 
-2. **Unified tokenizer factory:**
-   - `TokenizerFactory.fromGGUF(source)` — auto-detects BPE/SentencePiece/WordPiece
-   - `TokenizerFactory.fromTokenizerJson(json)` — HuggingFace format
-   - Returns the correct implementation (byte-level BPE for GPT-2/Qwen, SentencePiece for LLaMA, etc.)
+3. **Created `TokenizerFactory`** in `llm-core/.../tokenizer/TokenizerFactory.kt`
+   - `TokenizerFactory.fromGGUF(source)` — from GGUF file metadata
+   - `TokenizerFactory.fromTokenizerJson(json)` — from HuggingFace tokenizer.json
+   - `TokenizerFactory.fromHuggingFace(json, config)` — full HF BPE tokenizer
 
-3. **Move `GGUFTokenizer` to `llm-core`** so all runners can use it without depending on kllama
+4. All runners can now use `GGUFTokenizer` and `TokenizerFactory` directly from `llm-core`
 
 ## Phase 4: Unified Runner (single CLI entry point)
 
diff --git a/llm-core/build.gradle.kts b/llm-core/build.gradle.kts
index a5175c7..759fa6a 100644
--- a/llm-core/build.gradle.kts
+++ b/llm-core/build.gradle.kts
@@ -48,6 +48,8 @@ kotlin {
             implementation(libs.skainet.compile.dag)
             implementation(libs.skainet.compile.opt)
             implementation(libs.skainet.io.core)
+            implementation(libs.skainet.io.gguf)
+            implementation(libs.kotlinx.io.core)
         }
 
         commonTest.dependencies {
diff --git a/llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/tokenizer/GGUFTokenizer.kt b/llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/tokenizer/GGUFTokenizer.kt
new file mode 100644
index 0000000..33da31a
--- /dev/null
+++ b/llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/tokenizer/GGUFTokenizer.kt
@@ -0,0 +1,749 @@
+package sk.ainet.apps.llm.tokenizer
+
+import kotlinx.io.Source
+import kotlinx.io.buffered
+import sk.ainet.apps.llm.Tokenizer
+import sk.ainet.apps.llm.TokenizerStrategy
+import sk.ainet.apps.llm.TokenizerType
+import sk.ainet.io.RandomAccessSource
+import sk.ainet.io.gguf.GGUFReader
+import sk.ainet.io.gguf.ReaderField
+import sk.ainet.io.gguf.StreamingGGUFReader
+
+/**
+ * Tokenizer that extracts vocabulary from GGUF file metadata.
+ * Supports decoding (token ID -> string) and basic BPE encoding (string -> token IDs).
+ *
+ * Automatically detects tokenizer type (SentencePiece, BPE, WordPiece) from GGUF
+ * metadata and uses the appropriate preprocessing strategy.
+ */
+class GGUFTokenizer private constructor(
+    private val vocab: List<String>,
+    private val scores: FloatArray,
+    private val _bosTokenId: Int,
+    private val _eosTokenId: Int,
+    private val unkTokenId: Int,
+    private val strategy: TokenizerStrategy
+) : Tokenizer {
+
+    companion object {
+        private const val DEFAULT_BOS_TOKEN_ID = 1
+        private const val DEFAULT_EOS_TOKEN_ID = 2
+        private const val DEFAULT_UNK_TOKEN_ID = 0
+
+        /**
+         * Create a tokenizer from a HuggingFace tokenizer.json string.
+         *
+         * Parses the "model.vocab" and "model.merges" sections to build
+         * a BPE tokenizer compatible with LLaMA 3 models.
+         * Also reads "added_tokens" for BOS/EOS token IDs.
+         */
+        fun fromTokenizerJson(json: String, debug: Boolean = false): GGUFTokenizer {
+            // --- Parse vocab: {"token": id, ...} → List<String> indexed by id ---
+            val vocabStart = json.indexOf("\"vocab\"")
+            if (vocabStart < 0) error("tokenizer.json: no \"vocab\" section found")
+            val vocabBraceStart = json.indexOf('{', vocabStart + 7)
+            if (vocabBraceStart < 0) error("tokenizer.json: malformed vocab section")
+            val vocabBraceEnd = findMatchingBrace(json, vocabBraceStart)
+
+            val vocabMap = mutableMapOf<String, Int>()
+            val vocabContent = json.substring(vocabBraceStart + 1, vocabBraceEnd)
+            val vocabPattern = Regex("\"((?:[^\"\\\\]|\\\\.)*)\"\\s*:\\s*(\\d+)")
+            for (match in vocabPattern.findAll(vocabContent)) {
+                val token = unescapeJsonString(match.groupValues[1])
+                val id = match.groupValues[2].toInt()
+                vocabMap[token] = id
+            }
+
+            if (debug) println("DEBUG: Parsed ${vocabMap.size} vocab entries")
+
+            // Build vocab list indexed by id
+            val maxId = vocabMap.values.maxOrNull() ?: 0
+
+            // --- Parse added_tokens for special tokens and to extend vocab ---
+            var bosTokenId = DEFAULT_BOS_TOKEN_ID
+            var eosTokenId = DEFAULT_EOS_TOKEN_ID
+            var unkTokenId = DEFAULT_UNK_TOKEN_ID
+
+            val addedTokensStart = json.indexOf("\"added_tokens\"")
+            if (addedTokensStart >= 0) {
+                val arrStart = json.indexOf('[', addedTokensStart)
+                if (arrStart >= 0) {
+                    val arrEnd = findMatchingBracket(json, arrStart)
+                    val addedContent = json.substring(arrStart, arrEnd + 1)
+                    val idPattern = Regex("\"id\"\\s*:\\s*(\\d+)")
+                    val contentPattern = Regex("\"content\"\\s*:\\s*\"((?:[^\"\\\\]|\\\\.)*)\"")
+                    // Parse each added token object
+                    val objPattern = Regex("\\{[^}]+\\}")
+                    for (objMatch in objPattern.findAll(addedContent)) {
+                        val obj = objMatch.value
+                        val idMatch = idPattern.find(obj)
+                        val contentMatch = contentPattern.find(obj)
+                        if (idMatch != null && contentMatch != null) {
+                            val id = idMatch.groupValues[1].toInt()
+                            val content = unescapeJsonString(contentMatch.groupValues[1])
+                            vocabMap[content] = id
+                            when {
+                                content.contains("begin_of_text") || content == "<s>" -> bosTokenId = id
+                                content.contains("end_of_text") || content == "</s>" -> eosTokenId = id
+                                content == "<unk>" -> unkTokenId = id
+                            }
+                        }
+                    }
+                }
+            }
+
+            val totalVocabSize = maxOf(maxId + 1, (vocabMap.values.maxOrNull() ?: 0) + 1)
+            val vocab = MutableList(totalVocabSize) { "<unk>" }
+            for ((token, id) in vocabMap) {
+                if (id < vocab.size) vocab[id] = token
+            }
+
+            if (debug) println("DEBUG: Total vocab size = ${vocab.size}, BOS=$bosTokenId, EOS=$eosTokenId")
+
+            // --- Parse merges: ["tok1 tok2", ...] → scores (earlier merge = higher score) ---
+            val scores = FloatArray(vocab.size) { 0f }
+            val mergesStart = json.indexOf("\"merges\"")
+            if (mergesStart >= 0) {
+                val mergesArrStart = json.indexOf('[', mergesStart)
+                if (mergesArrStart >= 0) {
+                    val mergesArrEnd = findMatchingBracket(json, mergesArrStart)
+                    val mergesContent = json.substring(mergesArrStart + 1, mergesArrEnd)
+                    val mergePattern = Regex("\"((?:[^\"\\\\]|\\\\.)*)\"")
+                    var mergeRank = 0
+                    for (match in mergePattern.findAll(mergesContent)) {
+                        val merge = unescapeJsonString(match.groupValues[1])
+                        val parts = merge.split(' ', limit = 2)
+                        if (parts.size == 2) {
+                            val merged = parts[0] + parts[1]
+                            val mergedId = vocabMap[merged]
+                            if (mergedId != null && mergedId < scores.size) {
+                                // Higher score = higher merge priority (earlier in list)
+                                scores[mergedId] = (1_000_000 - mergeRank).toFloat()
+                            }
+                            mergeRank++
+                        }
+                    }
+                    if (debug) println("DEBUG: Parsed $mergeRank merges")
+                }
+            }
+
+            val strategy = BPEStrategy
+            println("Tokenizer: BPE (from tokenizer.json, vocab=${vocab.size})")
+
+            return GGUFTokenizer(vocab, scores, bosTokenId, eosTokenId, unkTokenId, strategy)
+        }
+
+        private fun findMatchingBrace(s: String, start: Int): Int {
+            var depth = 0
+            var inStr = false
+            var i = start
+            while (i < s.length) {
+                val c = s[i]
+                when {
+                    inStr -> { if (c == '"') inStr = false; if (c == '\\') i++ }
+                    c == '"' -> inStr = true
+                    c == '{' -> depth++
+                    c == '}' -> { depth--; if (depth == 0) return i }
+                }
+                i++
+            }
+            return s.length - 1
+        }
+
+        private fun findMatchingBracket(s: String, start: Int): Int {
+            var depth = 0
+            var inStr = false
+            var i = start
+            while (i < s.length) {
+                val c = s[i]
+                when {
+                    inStr -> { if (c == '"') inStr = false; if (c == '\\') i++ }
+                    c == '"' -> inStr = true
+                    c == '[' -> depth++
+                    c == ']' -> { depth--; if (depth == 0) return i }
+                }
+                i++
+            }
+            return s.length - 1
+        }
+
+        private fun unescapeJsonString(s: String): String {
+            val sb = StringBuilder()
+            var i = 0
+            while (i < s.length) {
+                if (s[i] == '\\' && i + 1 < s.length) {
+                    when (s[i + 1]) {
+                        '"' -> { sb.append('"'); i += 2 }
+                        '\\' -> { sb.append('\\'); i += 2 }
+                        '/' -> { sb.append('/'); i += 2 }
+                        'n' -> { sb.append('\n'); i += 2 }
+                        'r' -> { sb.append('\r'); i += 2 }
+                        't' -> { sb.append('\t'); i += 2 }
+                        'u' -> {
+                            if (i + 5 < s.length) {
+                                val cp = s.substring(i + 2, i + 6).toInt(16)
+                                sb.append(cp.toChar())
+                                i += 6
+                            } else { sb.append(s[i]); i++ }
+                        }
+                        else -> { sb.append(s[i]); i++ }
+                    }
+                } else { sb.append(s[i]); i++ }
+            }
+            return sb.toString()
+        }
+
+        /**
+         * Create a tokenizer by reading GGUF metadata from a source.
+         * Only reads metadata (not tensor data) for efficiency.
+         */
+        fun fromSource(source: Source, debug: Boolean = false): GGUFTokenizer {
+            val reader = source.buffered().use { src ->
+                GGUFReader(src, loadTensorData = false)
+            }
+            return fromGGUF(reader, debug)
+        }
+
+        /**
+         * Create a tokenizer from GGUF reader fields.
+         */
+        fun fromGGUF(reader: GGUFReader, debug: Boolean = false): GGUFTokenizer {
+            val fields = reader.fields
+
+            // Extract vocabulary tokens
+            val tokensField = fields["tokenizer.ggml.tokens"]
+                ?: error("GGUF file missing tokenizer.ggml.tokens field")
+            val vocab = extractStringArray(tokensField)
+
+            if (debug) {
+                println("DEBUG: Vocab size = ${vocab.size}")
+                println("DEBUG: First 10 tokens:")
+                vocab.take(10).forEachIndexed { idx, token ->
+                    val bytes = token.encodeToByteArray()
+                    val hexStr = bytes.joinToString(" ") { b ->
+                        val hex = (b.toInt() and 0xFF).toString(16).uppercase()
+                        if (hex.length == 1) "0$hex" else hex
+                    }
+                    println("  [$idx] = '$token' (bytes: $hexStr)")
+                }
+                println("DEBUG: Tokens around index 1000:")
+                vocab.drop(1000).take(5).forEachIndexed { idx, token ->
+                    println("  [${1000 + idx}] = '$token'")
+                }
+            }
+
+            // Extract BPE scores (used for merge priority during encoding)
+            val scoresField = fields["tokenizer.ggml.scores"]
+            val scores = if (scoresField != null) {
+                extractFloatArray(scoresField)
+            } else {
+                // Default scores if not present
+                FloatArray(vocab.size) { 0f }
+            }
+
+            // Extract special token IDs
+            val bosTokenId = fields["tokenizer.ggml.bos_token_id"]?.scalarInt() ?: DEFAULT_BOS_TOKEN_ID
+            val eosTokenId = fields["tokenizer.ggml.eos_token_id"]?.scalarInt() ?: DEFAULT_EOS_TOKEN_ID
+            val unkTokenId = fields["tokenizer.ggml.unknown_token_id"]?.scalarInt() ?: DEFAULT_UNK_TOKEN_ID
+
+            // Detect tokenizer type from metadata
+            val modelType = fields["tokenizer.ggml.model"]?.scalarString()
+            val strategy = detectStrategy(modelType, vocab, debug)
+
+            // Always log the tokenizer strategy
+            println("Tokenizer: ${strategy.type} (model=${modelType ?: "auto-detected"})")
+
+            if (debug) {
+                println("DEBUG: BOS=$bosTokenId, EOS=$eosTokenId, UNK=$unkTokenId")
+                println("DEBUG: Tokenizer model type from metadata: ${modelType ?: "(not specified)"}")
+                println("DEBUG: Using tokenizer strategy: ${strategy.type}")
+            }
+
+            return GGUFTokenizer(vocab, scores, bosTokenId, eosTokenId, unkTokenId, strategy)
+        }
+
+        /**
+         * Create a tokenizer using streaming API.
+         * Parses metadata only (~1MB memory), suitable for large models.
+         * The source is closed after reading metadata.
+         */
+        fun fromRandomAccessSource(source: RandomAccessSource, debug: Boolean = false): GGUFTokenizer {
+            return StreamingGGUFReader.open(source).use { reader ->
+                fromStreamingFields(reader.fields, debug)
+            }
+        }
+
+        /**
+         * Create a tokenizer from StreamingGGUFReader fields.
+         * StreamingGGUFReader.fields returns direct values (Map<String, Any?>),
+         * not ReaderField objects.
+         */
+        private fun fromStreamingFields(fields: Map<String, Any?>, debug: Boolean = false): GGUFTokenizer {
+            // Extract vocabulary tokens (stored as List<String> in streaming reader)
+            val tokensValue = fields["tokenizer.ggml.tokens"]
+                ?: error("GGUF file missing tokenizer.ggml.tokens field")
+            val vocab = extractStringList(tokensValue)
+
+            if (debug) {
+                println("DEBUG: Vocab size = ${vocab.size}")
+                println("DEBUG: First 10 tokens:")
+                vocab.take(10).forEachIndexed { idx, token ->
+                    val bytes = token.encodeToByteArray()
+                    val hexStr = bytes.joinToString(" ") { b ->
+                        val hex = (b.toInt() and 0xFF).toString(16).uppercase()
+                        if (hex.length == 1) "0$hex" else hex
+                    }
+                    println("  [$idx] = '$token' (bytes: $hexStr)")
+                }
+                println("DEBUG: Tokens around index 1000:")
+                vocab.drop(1000).take(5).forEachIndexed { idx, token ->
+                    println("  [${1000 + idx}] = '$token'")
+                }
+            }
+
+            // Extract BPE scores
+            val scoresValue = fields["tokenizer.ggml.scores"]
+            val scores = if (scoresValue != null) {
+                extractFloatList(scoresValue)
+            } else {
+                FloatArray(vocab.size) { 0f }
+            }
+
+            // Extract special token IDs
+            val bosTokenId = fields["tokenizer.ggml.bos_token_id"]?.toIntValue() ?: DEFAULT_BOS_TOKEN_ID
+            val eosTokenId = fields["tokenizer.ggml.eos_token_id"]?.toIntValue() ?: DEFAULT_EOS_TOKEN_ID
+            val unkTokenId = fields["tokenizer.ggml.unknown_token_id"]?.toIntValue() ?: DEFAULT_UNK_TOKEN_ID
+
+            // Detect tokenizer type from metadata
+            val modelType = fields["tokenizer.ggml.model"]?.toString()
+            val strategy = detectStrategy(modelType, vocab, debug)
+
+            // Always log the tokenizer strategy
+            println("Tokenizer: ${strategy.type} (model=${modelType ?: "auto-detected"})")
+
+            if (debug) {
+                println("DEBUG: BOS=$bosTokenId, EOS=$eosTokenId, UNK=$unkTokenId")
+                println("DEBUG: Tokenizer model type from metadata: ${modelType ?: "(not specified)"}")
+                println("DEBUG: Using tokenizer strategy: ${strategy.type}")
+            }
+
+            return GGUFTokenizer(vocab, scores, bosTokenId, eosTokenId, unkTokenId, strategy)
+        }
+
+        /**
+         * Detect the tokenizer strategy based on GGUF metadata and vocabulary inspection.
+         */
+        private fun detectStrategy(
+            modelType: String?,
+            vocab: List<String>,
+            debug: Boolean
+        ): TokenizerStrategy {
+            // First, try to detect from explicit model type in metadata
+            val fromMetadata = when (modelType?.lowercase()) {
+                "llama", "sentencepiece" -> SentencePieceStrategy
+                "gpt2", "bpe" -> BPEStrategy
+                "bert", "wordpiece" -> WordPieceStrategy
+                else -> null
+            }
+
+            if (fromMetadata != null) {
+                if (debug) {
+                    println("DEBUG: Detected tokenizer type from metadata: ${fromMetadata.type}")
+                }
+                return fromMetadata
+            }
+
+            // Fallback: inspect vocabulary for characteristic markers
+            val fromVocab = detectFromVocab(vocab)
+            if (debug) {
+                println("DEBUG: Detected tokenizer type from vocab inspection: ${fromVocab.type}")
+            }
+            return fromVocab
+        }
+
+        /**
+         * Detect tokenizer type by inspecting vocabulary for characteristic markers.
+         */
+        private fun detectFromVocab(vocab: List<String>): TokenizerStrategy {
+            val sentencePieceMarker = "\u2581" // ▁
+            val bpeMarker = "\u0120" // Ġ
+            val wordPieceMarker = "##"
+
+            var sentencePieceCount = 0
+            var bpeCount = 0
+            var wordPieceCount = 0
+
+            // Sample first 1000 tokens (or all if less)
+            val sampleSize = minOf(vocab.size, 1000)
+            for (i in 0 until sampleSize) {
+                val token = vocab[i]
+                when {
+                    token.contains(sentencePieceMarker) -> sentencePieceCount++
+                    token.contains(bpeMarker) -> bpeCount++
+                    token.startsWith(wordPieceMarker) -> wordPieceCount++
+                }
+            }
+
+            // Return strategy based on which marker is most prevalent
+            return when {
+                sentencePieceCount >= bpeCount && sentencePieceCount >= wordPieceCount && sentencePieceCount > 0 ->
+                    SentencePieceStrategy
+                bpeCount > sentencePieceCount && bpeCount >= wordPieceCount ->
+                    BPEStrategy
+                wordPieceCount > sentencePieceCount && wordPieceCount > bpeCount ->
+                    WordPieceStrategy
+                else ->
+                    // Default to SentencePiece/Unknown since most GGUF models use it
+                    UnknownStrategy
+            }
+        }
+
+        /**
+         * Extract a list of strings from streaming field value.
+         */
+        @Suppress("UNCHECKED_CAST")
+        private fun extractStringList(value: Any): List<String> {
+            return when (value) {
+                is List<*> -> value.filterIsInstance<String>()
+                else -> error("Expected List<String> for tokens field, got ${value::class.simpleName}")
+            }
+        }
+
+        /**
+         * Extract float array from streaming field value.
+         */
+        @Suppress("UNCHECKED_CAST")
+        private fun extractFloatList(value: Any): FloatArray {
+            return when (value) {
+                is List<*> -> {
+                    val floats = mutableListOf<Float>()
+                    for (item in value) {
+                        when (item) {
+                            is Float -> floats.add(item)
+                            is Double -> floats.add(item.toFloat())
+                            is Number -> floats.add(item.toFloat())
+                        }
+                    }
+                    floats.toFloatArray()
+                }
+                else -> error("Expected List<Number> for scores field, got ${value::class.simpleName}")
+            }
+        }
+
+        /**
+         * Convert streaming field value to Int.
+         */
+        private fun Any?.toIntValue(): Int? = when (this) {
+            is Int -> this
+            is UInt -> this.toInt()
+            is Long -> this.toInt()
+            is ULong -> this.toInt()
+            is Short -> this.toInt()
+            is UShort -> this.toInt()
+            is Byte -> this.toInt()
+            is UByte -> this.toInt()
+            else -> null
+        }
+
+        private fun extractStringArray(field: ReaderField): List<String> {
+            val strings = mutableListOf<String>()
+            // For array fields, data contains indexes to string parts
+            for (idx in field.data) {
+                if (idx < 0 || idx >= field.parts.size) continue
+                val part = field.parts[idx]
+                // Handle all numeric types that could represent bytes
+                val bytes = part.mapNotNull { value ->
+                    when (value) {
+                        is UByte -> value.toByte()
+                        is Byte -> value
+                        is Number -> value.toInt().toByte()
+                        else -> null
+                    }
+                }
+                strings.add(bytes.toByteArray().decodeToString())
+            }
+            return strings
+        }
+
+        private fun extractFloatArray(field: ReaderField): FloatArray {
+            val floats = mutableListOf<Float>()
+            for (idx in field.data) {
+                if (idx < 0 || idx >= field.parts.size) continue
+                val part = field.parts[idx]
+                for (value in part) {
+                    when (value) {
+                        is Float -> floats.add(value)
+                        is Double -> floats.add(value.toFloat())
+                        is Number -> floats.add(value.toFloat())
+                    }
+                }
+            }
+            return floats.toFloatArray()
+        }
+
+        private fun ReaderField.scalarInt(): Int {
+            val idx = data.firstOrNull() ?: 0
+            val part = parts.getOrNull(idx) ?: return 0
+            val value = (part as? List<*>)?.firstOrNull() ?: return 0
+            return when (value) {
+                is Int -> value
+                is UInt -> value.toInt()
+                is Long -> value.toInt()
+                is ULong -> value.toInt()
+                is Number -> value.toInt()
+                else -> 0
+            }
+        }
+
+        private fun ReaderField.scalarString(): String? {
+            val idx = data.firstOrNull() ?: return null
+            val part = parts.getOrNull(idx) ?: return null
+            // Handle bytes to string conversion
+            val bytes = (part as? List<*>)?.mapNotNull { value ->
+                when (value) {
+                    is UByte -> value.toByte()
+                    is Byte -> value
+                    is Number -> value.toInt().toByte()
+                    else -> null
+                }
+            } ?: return null
+            return bytes.toByteArray().decodeToString()
+        }
+    }
+
+    /** The detected tokenizer type/strategy in use */
+    val tokenizerType: TokenizerType get() = strategy.type
+
+    override val bosTokenId: Int get() = _bosTokenId
+    override val eosTokenId: Int get() = _eosTokenId
+    override val vocabSize: Int get() = vocab.size
+
+    @Deprecated("Use eosTokenId", replaceWith = ReplaceWith("eosTokenId"))
+    val eosId: Int get() = _eosTokenId
+
+    @Deprecated("Use bosTokenId", replaceWith = ReplaceWith("bosTokenId"))
+    val bosId: Int get() = _bosTokenId
+
+    // Build reverse lookup for encoding
+    private val tokenToId: Map<String, Int> by lazy {
+        vocab.mapIndexed { idx, token -> token to idx }.toMap()
+    }
+
+    // Build sorted vocab by score for BPE merging
+    private val sortedVocabByScore: List<Pair<String, Int>> by lazy {
+        vocab.mapIndexed { idx, token -> token to idx }
+            .sortedByDescending { (_, idx) -> scores.getOrElse(idx) { 0f } }
+    }
+
+    override fun encode(text: String): IntArray {
+        if (text.isEmpty()) return intArrayOf()
+
+        // Use strategy-specific preprocessing
+        val preprocessed = strategy.preprocess(text)
+
+        // Handle WordPiece differently - it splits on whitespace first
+        if (strategy.type == TokenizerType.WORDPIECE) {
+            return encodeWordPiece(text)
+        }
+
+        // Standard BPE encoding for SentencePiece and GPT-2 style tokenizers
+        return encodeBPE(preprocessed)
+    }
+
+    /**
+     * Standard BPE encoding used by SentencePiece and GPT-2 style tokenizers.
+     */
+    private fun encodeBPE(preprocessed: String): IntArray {
+        // Convert text to a list of single-char tokens
+        val tokens = mutableListOf<String>()
+        for (char in preprocessed) {
+            tokens.add(char.toString())
+        }
+
+        // Greedy BPE merging
+        var changed = true
+        while (changed && tokens.size > 1) {
+            changed = false
+            var bestIdx = -1
+            var bestScore = Float.NEGATIVE_INFINITY
+            var bestMerge = ""
+
+            // Find the best merge
+            for (i in 0 until tokens.size - 1) {
+                val merge = tokens[i] + tokens[i + 1]
+                val tokenId = tokenToId[merge]
+                if (tokenId != null) {
+                    val score = scores.getOrElse(tokenId) { 0f }
+                    if (score > bestScore) {
+                        bestScore = score
+                        bestIdx = i
+                        bestMerge = merge
+                    }
+                }
+            }
+
+            // Apply best merge
+            if (bestIdx >= 0) {
+                tokens[bestIdx] = bestMerge
+                tokens.removeAt(bestIdx + 1)
+                changed = true
+            }
+        }
+
+        // Convert tokens to IDs
+        return tokens.map { token ->
+            tokenToId[token] ?: findFallbackToken(token)
+        }.toIntArray()
+    }
+
+    /**
+     * WordPiece encoding - splits on whitespace first, then applies subword tokenization.
+     */
+    private fun encodeWordPiece(text: String): IntArray {
+        val result = mutableListOf<Int>()
+        val words = text.split(Regex("\\s+")).filter { it.isNotEmpty() }
+
+        for ((wordIndex, word) in words.withIndex()) {
+            // Add space token between words (if not first word)
+            if (wordIndex > 0) {
+                tokenToId[" "]?.let { result.add(it) }
+            }
+
+            // Try to find the word in vocab
+            val wordId = tokenToId[word]
+            if (wordId != null) {
+                result.add(wordId)
+                continue
+            }
+
+            // Break into subwords
+            var start = 0
+            var foundAny = false
+            while (start < word.length) {
+                var end = word.length
+                var found = false
+
+                while (start < end) {
+                    val substr = if (start == 0) {
+                        word.substring(start, end)
+                    } else {
+                        "##" + word.substring(start, end)
+                    }
+
+                    val id = tokenToId[substr]
+                    if (id != null) {
+                        result.add(id)
+                        start = end
+                        found = true
+                        foundAny = true
+                        break
+                    }
+                    end--
+                }
+
+                if (!found) {
+                    // Character not found, use UNK or byte fallback
+                    if (start < word.length) {
+                        result.add(findFallbackToken(word[start].toString()))
+                        start++
+                    }
+                }
+            }
+
+            if (!foundAny && word.isNotEmpty()) {
+                result.add(unkTokenId)
+            }
+        }
+
+        return result.toIntArray()
+    }
+
+    private fun findFallbackToken(token: String): Int {
+        // Try byte fallback tokens (common in LLaMA tokenizers)
+        if (token.length == 1) {
+            val byte = token[0].code
+            // Try <0xXX> format
+            val hexToken = "<0x${byte.toString(16).uppercase().padStart(2, '0')}>"
+            tokenToId[hexToken]?.let { return it }
+            // Try raw byte token
+            val byteToken = byteArrayOf(byte.toByte()).decodeToString()
+            tokenToId[byteToken]?.let { return it }
+        }
+        // Fall back to UNK token
+        return unkTokenId
+    }
+
+    override fun decode(tokens: IntArray): String {
+        // Accumulate byte tokens and decode them together as UTF-8
+        val result = StringBuilder()
+        val byteBuffer = mutableListOf<Byte>()
+
+        for (tokenId in tokens) {
+            if (tokenId < 0 || tokenId >= vocab.size) continue
+            val token = vocab[tokenId]
+
+            val byteValue = extractByteToken(token)
+            if (byteValue != null) {
+                byteBuffer.add(byteValue)
+            } else {
+                // Flush accumulated bytes as UTF-8
+                if (byteBuffer.isNotEmpty()) {
+                    result.append(byteBuffer.toByteArray().decodeToString())
+                    byteBuffer.clear()
+                }
+                result.append(decodeToken(token))
+            }
+        }
+
+        // Flush remaining bytes
+        if (byteBuffer.isNotEmpty()) {
+            result.append(byteBuffer.toByteArray().decodeToString())
+        }
+
+        return result.toString()
+    }
+
+    override fun decode(token: Int): String {
+        if (token < 0 || token >= vocab.size) return ""
+        val text = vocab[token]
+        // Handle special byte tokens like <0xXX>
+        return decodeToken(text)
+    }
+
+    /**
+     * Extract byte value from <0xXX> format token.
+     * Returns null if token is not a byte token.
+     */
+    private fun extractByteToken(token: String): Byte? {
+        if (token.startsWith("<0x") && token.endsWith(">") && token.length == 6) {
+            val hex = token.substring(3, 5)
+            val value = hex.toIntOrNull(16)
+            if (value != null) {
+                return value.toByte()
+            }
+        }
+        return null
+    }
+
+    private fun decodeToken(token: String): String {
+        // Handle byte tokens in <0xXX> format
+        val byteValue = extractByteToken(token)
+        if (byteValue != null) {
+            // For single-token decode, convert byte to string
+            // Note: This may not handle multi-byte UTF-8 correctly in streaming mode,
+            // but it's the best we can do for single-token decoding
+            return byteArrayOf(byteValue).decodeToString()
+        }
+
+        // Handle common special tokens
+        return when (token) {
+            "<s>" -> "" // BOS
+            "</s>" -> "" // EOS
+            "<unk>" -> "" // Unknown
+            "<pad>" -> "" // Padding
+            strategy.spaceMarker -> " "
+            else -> strategy.postprocess(token)
+        }
+    }
+}
diff --git a/llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/tokenizer/TokenizerFactory.kt b/llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/tokenizer/TokenizerFactory.kt
new file mode 100644
index 0000000..d93f25d
--- /dev/null
+++ b/llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/tokenizer/TokenizerFactory.kt
@@ -0,0 +1,51 @@
+package sk.ainet.apps.llm.tokenizer
+
+import sk.ainet.apps.llm.Tokenizer
+import sk.ainet.io.RandomAccessSource
+
+/**
+ * Unified factory for creating tokenizers from various sources.
+ *
+ * Usage:
+ * ```kotlin
+ * // From GGUF file (auto-detects BPE/SentencePiece/WordPiece)
+ * val tokenizer = TokenizerFactory.fromGGUF(randomAccessSource)
+ *
+ * // From HuggingFace tokenizer.json
+ * val tokenizer = TokenizerFactory.fromTokenizerJson(jsonString)
+ * ```
+ */
+public object TokenizerFactory {
+
+    /**
+     * Create a tokenizer from GGUF file metadata (streaming, memory-efficient).
+     * Auto-detects tokenizer type from vocabulary and metadata.
+     *
+     * @param source Random access source to the GGUF file.
+     * @param debug Print debug information during loading.
+     */
+    public fun fromGGUF(source: RandomAccessSource, debug: Boolean = false): GGUFTokenizer {
+        return GGUFTokenizer.fromRandomAccessSource(source, debug)
+    }
+
+    /**
+     * Create a tokenizer from a HuggingFace `tokenizer.json` string.
+     * Parses vocab and BPE merges from the JSON.
+     *
+     * @param json The tokenizer.json content.
+     * @param debug Print debug information during loading.
+     */
+    public fun fromTokenizerJson(json: String, debug: Boolean = false): GGUFTokenizer {
+        return GGUFTokenizer.fromTokenizerJson(json, debug)
+    }
+
+    /**
+     * Create a HuggingFace BPE tokenizer from `tokenizer.json` + optional config.
+     *
+     * @param tokenizerJson Content of `tokenizer.json`.
+     * @param tokenizerConfigJson Optional content of `tokenizer_config.json`.
+     */
+    public fun fromHuggingFace(tokenizerJson: String, tokenizerConfigJson: String? = null): Tokenizer {
+        return createHuggingFaceBPETokenizerFromJson(tokenizerJson, tokenizerConfigJson)
+    }
+}
diff --git a/llm-runtime/kllama/src/commonMain/kotlin/sk/ainet/apps/kllama/GGUFTokenizer.kt b/llm-runtime/kllama/src/commonMain/kotlin/sk/ainet/apps/kllama/GGUFTokenizer.kt
index b6d13b5..dc733cf 100644
--- a/llm-runtime/kllama/src/commonMain/kotlin/sk/ainet/apps/kllama/GGUFTokenizer.kt
+++ b/llm-runtime/kllama/src/commonMain/kotlin/sk/ainet/apps/kllama/GGUFTokenizer.kt
@@ -1,753 +1,8 @@
 package sk.ainet.apps.kllama
 
-import kotlinx.io.Source
-import kotlinx.io.buffered
-import sk.ainet.apps.llm.Tokenizer
-import sk.ainet.apps.llm.TokenizerStrategy
-import sk.ainet.apps.llm.TokenizerType
-import sk.ainet.apps.llm.tokenizer.BPEStrategy
-import sk.ainet.apps.llm.tokenizer.SentencePieceStrategy
-import sk.ainet.apps.llm.tokenizer.UnknownStrategy
-import sk.ainet.apps.llm.tokenizer.WordPieceStrategy
-import sk.ainet.io.RandomAccessSource
-import sk.ainet.io.gguf.GGUFReader
-import sk.ainet.io.gguf.ReaderField
-import sk.ainet.io.gguf.StreamingGGUFReader
-
 /**
- * Tokenizer that extracts vocabulary from GGUF file metadata.
- * Supports decoding (token ID -> string) and basic BPE encoding (string -> token IDs).
- *
- * Automatically detects tokenizer type (SentencePiece, BPE, WordPiece) from GGUF
- * metadata and uses the appropriate preprocessing strategy.
+ * Backwards-compatibility alias. GGUFTokenizer has moved to llm-core.
+ * @see sk.ainet.apps.llm.tokenizer.GGUFTokenizer
  */
-class GGUFTokenizer private constructor(
-    private val vocab: List<String>,
-    private val scores: FloatArray,
-    private val _bosTokenId: Int,
-    private val _eosTokenId: Int,
-    private val unkTokenId: Int,
-    private val strategy: TokenizerStrategy
-) : Tokenizer {
-
-    companion object {
-        private const val DEFAULT_BOS_TOKEN_ID = 1
-        private const val DEFAULT_EOS_TOKEN_ID = 2
-        private const val DEFAULT_UNK_TOKEN_ID = 0
-
-        /**
-         * Create a tokenizer from a HuggingFace tokenizer.json string.
-         *
-         * Parses the "model.vocab" and "model.merges" sections to build
-         * a BPE tokenizer compatible with LLaMA 3 models.
-         * Also reads "added_tokens" for BOS/EOS token IDs.
-         */
-        fun fromTokenizerJson(json: String, debug: Boolean = false): GGUFTokenizer {
-            // --- Parse vocab: {"token": id, ...} → List<String> indexed by id ---
-            val vocabStart = json.indexOf("\"vocab\"")
-            if (vocabStart < 0) error("tokenizer.json: no \"vocab\" section found")
-            val vocabBraceStart = json.indexOf('{', vocabStart + 7)
-            if (vocabBraceStart < 0) error("tokenizer.json: malformed vocab section")
-            val vocabBraceEnd = findMatchingBrace(json, vocabBraceStart)
-
-            val vocabMap = mutableMapOf<String, Int>()
-            val vocabContent = json.substring(vocabBraceStart + 1, vocabBraceEnd)
-            val vocabPattern = Regex("\"((?:[^\"\\\\]|\\\\.)*)\"\\s*:\\s*(\\d+)")
-            for (match in vocabPattern.findAll(vocabContent)) {
-                val token = unescapeJsonString(match.groupValues[1])
-                val id = match.groupValues[2].toInt()
-                vocabMap[token] = id
-            }
-
-            if (debug) println("DEBUG: Parsed ${vocabMap.size} vocab entries")
-
-            // Build vocab list indexed by id
-            val maxId = vocabMap.values.maxOrNull() ?: 0
-
-            // --- Parse added_tokens for special tokens and to extend vocab ---
-            var bosTokenId = DEFAULT_BOS_TOKEN_ID
-            var eosTokenId = DEFAULT_EOS_TOKEN_ID
-            var unkTokenId = DEFAULT_UNK_TOKEN_ID
-
-            val addedTokensStart = json.indexOf("\"added_tokens\"")
-            if (addedTokensStart >= 0) {
-                val arrStart = json.indexOf('[', addedTokensStart)
-                if (arrStart >= 0) {
-                    val arrEnd = findMatchingBracket(json, arrStart)
-                    val addedContent = json.substring(arrStart, arrEnd + 1)
-                    val idPattern = Regex("\"id\"\\s*:\\s*(\\d+)")
-                    val contentPattern = Regex("\"content\"\\s*:\\s*\"((?:[^\"\\\\]|\\\\.)*)\"")
-                    // Parse each added token object
-                    val objPattern = Regex("\\{[^}]+\\}")
-                    for (objMatch in objPattern.findAll(addedContent)) {
-                        val obj = objMatch.value
-                        val idMatch = idPattern.find(obj)
-                        val contentMatch = contentPattern.find(obj)
-                        if (idMatch != null && contentMatch != null) {
-                            val id = idMatch.groupValues[1].toInt()
-                            val content = unescapeJsonString(contentMatch.groupValues[1])
-                            vocabMap[content] = id
-                            when {
-                                content.contains("begin_of_text") || content == "<s>" -> bosTokenId = id
-                                content.contains("end_of_text") || content == "</s>" -> eosTokenId = id
-                                content == "<unk>" -> unkTokenId = id
-                            }
-                        }
-                    }
-                }
-            }
-
-            val totalVocabSize = maxOf(maxId + 1, (vocabMap.values.maxOrNull() ?: 0) + 1)
-            val vocab = MutableList(totalVocabSize) { "<unk>" }
-            for ((token, id) in vocabMap) {
-                if (id < vocab.size) vocab[id] = token
-            }
-
-            if (debug) println("DEBUG: Total vocab size = ${vocab.size}, BOS=$bosTokenId, EOS=$eosTokenId")
-
-            // --- Parse merges: ["tok1 tok2", ...] → scores (earlier merge = higher score) ---
-            val scores = FloatArray(vocab.size) { 0f }
-            val mergesStart = json.indexOf("\"merges\"")
-            if (mergesStart >= 0) {
-                val mergesArrStart = json.indexOf('[', mergesStart)
-                if (mergesArrStart >= 0) {
-                    val mergesArrEnd = findMatchingBracket(json, mergesArrStart)
-                    val mergesContent = json.substring(mergesArrStart + 1, mergesArrEnd)
-                    val mergePattern = Regex("\"((?:[^\"\\\\]|\\\\.)*)\"")
-                    var mergeRank = 0
-                    for (match in mergePattern.findAll(mergesContent)) {
-                        val merge = unescapeJsonString(match.groupValues[1])
-                        val parts = merge.split(' ', limit = 2)
-                        if (parts.size == 2) {
-                            val merged = parts[0] + parts[1]
-                            val mergedId = vocabMap[merged]
-                            if (mergedId != null && mergedId < scores.size) {
-                                // Higher score = higher merge priority (earlier in list)
-                                scores[mergedId] = (1_000_000 - mergeRank).toFloat()
-                            }
-                            mergeRank++
-                        }
-                    }
-                    if (debug) println("DEBUG: Parsed $mergeRank merges")
-                }
-            }
-
-            val strategy = BPEStrategy
-            println("Tokenizer: BPE (from tokenizer.json, vocab=${vocab.size})")
-
-            return GGUFTokenizer(vocab, scores, bosTokenId, eosTokenId, unkTokenId, strategy)
-        }
-
-        private fun findMatchingBrace(s: String, start: Int): Int {
-            var depth = 0
-            var inStr = false
-            var i = start
-            while (i < s.length) {
-                val c = s[i]
-                when {
-                    inStr -> { if (c == '"') inStr = false; if (c == '\\') i++ }
-                    c == '"' -> inStr = true
-                    c == '{' -> depth++
-                    c == '}' -> { depth--; if (depth == 0) return i }
-                }
-                i++
-            }
-            return s.length - 1
-        }
-
-        private fun findMatchingBracket(s: String, start: Int): Int {
-            var depth = 0
-            var inStr = false
-            var i = start
-            while (i < s.length) {
-                val c = s[i]
-                when {
-                    inStr -> { if (c == '"') inStr = false; if (c == '\\') i++ }
-                    c == '"' -> inStr = true
-                    c == '[' -> depth++
-                    c == ']' -> { depth--; if (depth == 0) return i }
-                }
-                i++
-            }
-            return s.length - 1
-        }
-
-        private fun unescapeJsonString(s: String): String {
-            val sb = StringBuilder()
-            var i = 0
-            while (i < s.length) {
-                if (s[i] == '\\' && i + 1 < s.length) {
-                    when (s[i + 1]) {
-                        '"' -> { sb.append('"'); i += 2 }
-                        '\\' -> { sb.append('\\'); i += 2 }
-                        '/' -> { sb.append('/'); i += 2 }
-                        'n' -> { sb.append('\n'); i += 2 }
-                        'r' -> { sb.append('\r'); i += 2 }
-                        't' -> { sb.append('\t'); i += 2 }
-                        'u' -> {
-                            if (i + 5 < s.length) {
-                                val cp = s.substring(i + 2, i + 6).toInt(16)
-                                sb.append(cp.toChar())
-                                i += 6
-                            } else { sb.append(s[i]); i++ }
-                        }
-                        else -> { sb.append(s[i]); i++ }
-                    }
-                } else { sb.append(s[i]); i++ }
-            }
-            return sb.toString()
-        }
-
-        /**
-         * Create a tokenizer by reading GGUF metadata from a source.
-         * Only reads metadata (not tensor data) for efficiency.
-         */
-        fun fromSource(source: Source, debug: Boolean = false): GGUFTokenizer {
-            val reader = source.buffered().use { src ->
-                GGUFReader(src, loadTensorData = false)
-            }
-            return fromGGUF(reader, debug)
-        }
-
-        /**
-         * Create a tokenizer from GGUF reader fields.
-         */
-        fun fromGGUF(reader: GGUFReader, debug: Boolean = false): GGUFTokenizer {
-            val fields = reader.fields
-
-            // Extract vocabulary tokens
-            val tokensField = fields["tokenizer.ggml.tokens"]
-                ?: error("GGUF file missing tokenizer.ggml.tokens field")
-            val vocab = extractStringArray(tokensField)
-
-            if (debug) {
-                println("DEBUG: Vocab size = ${vocab.size}")
-                println("DEBUG: First 10 tokens:")
-                vocab.take(10).forEachIndexed { idx, token ->
-                    val bytes = token.encodeToByteArray()
-                    val hexStr = bytes.joinToString(" ") { b ->
-                        val hex = (b.toInt() and 0xFF).toString(16).uppercase()
-                        if (hex.length == 1) "0$hex" else hex
-                    }
-                    println("  [$idx] = '$token' (bytes: $hexStr)")
-                }
-                println("DEBUG: Tokens around index 1000:")
-                vocab.drop(1000).take(5).forEachIndexed { idx, token ->
-                    println("  [${1000 + idx}] = '$token'")
-                }
-            }
-
-            // Extract BPE scores (used for merge priority during encoding)
-            val scoresField = fields["tokenizer.ggml.scores"]
-            val scores = if (scoresField != null) {
-                extractFloatArray(scoresField)
-            } else {
-                // Default scores if not present
-                FloatArray(vocab.size) { 0f }
-            }
-
-            // Extract special token IDs
-            val bosTokenId = fields["tokenizer.ggml.bos_token_id"]?.scalarInt() ?: DEFAULT_BOS_TOKEN_ID
-            val eosTokenId = fields["tokenizer.ggml.eos_token_id"]?.scalarInt() ?: DEFAULT_EOS_TOKEN_ID
-            val unkTokenId = fields["tokenizer.ggml.unknown_token_id"]?.scalarInt() ?: DEFAULT_UNK_TOKEN_ID
-
-            // Detect tokenizer type from metadata
-            val modelType = fields["tokenizer.ggml.model"]?.scalarString()
-            val strategy = detectStrategy(modelType, vocab, debug)
-
-            // Always log the tokenizer strategy
-            println("Tokenizer: ${strategy.type} (model=${modelType ?: "auto-detected"})")
-
-            if (debug) {
-                println("DEBUG: BOS=$bosTokenId, EOS=$eosTokenId, UNK=$unkTokenId")
-                println("DEBUG: Tokenizer model type from metadata: ${modelType ?: "(not specified)"}")
-                println("DEBUG: Using tokenizer strategy: ${strategy.type}")
-            }
-
-            return GGUFTokenizer(vocab, scores, bosTokenId, eosTokenId, unkTokenId, strategy)
-        }
-
-        /**
-         * Create a tokenizer using streaming API.
-         * Parses metadata only (~1MB memory), suitable for large models.
-         * The source is closed after reading metadata.
-         */
-        fun fromRandomAccessSource(source: RandomAccessSource, debug: Boolean = false): GGUFTokenizer {
-            return StreamingGGUFReader.open(source).use { reader ->
-                fromStreamingFields(reader.fields, debug)
-            }
-        }
-
-        /**
-         * Create a tokenizer from StreamingGGUFReader fields.
-         * StreamingGGUFReader.fields returns direct values (Map<String, Any?>),
-         * not ReaderField objects.
-         */
-        private fun fromStreamingFields(fields: Map<String, Any?>, debug: Boolean = false): GGUFTokenizer {
-            // Extract vocabulary tokens (stored as List<String> in streaming reader)
-            val tokensValue = fields["tokenizer.ggml.tokens"]
-                ?: error("GGUF file missing tokenizer.ggml.tokens field")
-            val vocab = extractStringList(tokensValue)
-
-            if (debug) {
-                println("DEBUG: Vocab size = ${vocab.size}")
-                println("DEBUG: First 10 tokens:")
-                vocab.take(10).forEachIndexed { idx, token ->
-                    val bytes = token.encodeToByteArray()
-                    val hexStr = bytes.joinToString(" ") { b ->
-                        val hex = (b.toInt() and 0xFF).toString(16).uppercase()
-                        if (hex.length == 1) "0$hex" else hex
-                    }
-                    println("  [$idx] = '$token' (bytes: $hexStr)")
-                }
-                println("DEBUG: Tokens around index 1000:")
-                vocab.drop(1000).take(5).forEachIndexed { idx, token ->
-                    println("  [${1000 + idx}] = '$token'")
-                }
-            }
-
-            // Extract BPE scores
-            val scoresValue = fields["tokenizer.ggml.scores"]
-            val scores = if (scoresValue != null) {
-                extractFloatList(scoresValue)
-            } else {
-                FloatArray(vocab.size) { 0f }
-            }
-
-            // Extract special token IDs
-            val bosTokenId = fields["tokenizer.ggml.bos_token_id"]?.toIntValue() ?: DEFAULT_BOS_TOKEN_ID
-            val eosTokenId = fields["tokenizer.ggml.eos_token_id"]?.toIntValue() ?: DEFAULT_EOS_TOKEN_ID
-            val unkTokenId = fields["tokenizer.ggml.unknown_token_id"]?.toIntValue() ?: DEFAULT_UNK_TOKEN_ID
-
-            // Detect tokenizer type from metadata
-            val modelType = fields["tokenizer.ggml.model"]?.toString()
-            val strategy = detectStrategy(modelType, vocab, debug)
-
-            // Always log the tokenizer strategy
-            println("Tokenizer: ${strategy.type} (model=${modelType ?: "auto-detected"})")
-
-            if (debug) {
-                println("DEBUG: BOS=$bosTokenId, EOS=$eosTokenId, UNK=$unkTokenId")
-                println("DEBUG: Tokenizer model type from metadata: ${modelType ?: "(not specified)"}")
-                println("DEBUG: Using tokenizer strategy: ${strategy.type}")
-            }
-
-            return GGUFTokenizer(vocab, scores, bosTokenId, eosTokenId, unkTokenId, strategy)
-        }
-
-        /**
-         * Detect the tokenizer strategy based on GGUF metadata and vocabulary inspection.
-         */
-        private fun detectStrategy(
-            modelType: String?,
-            vocab: List<String>,
-            debug: Boolean
-        ): TokenizerStrategy {
-            // First, try to detect from explicit model type in metadata
-            val fromMetadata = when (modelType?.lowercase()) {
-                "llama", "sentencepiece" -> SentencePieceStrategy
-                "gpt2", "bpe" -> BPEStrategy
-                "bert", "wordpiece" -> WordPieceStrategy
-                else -> null
-            }
-
-            if (fromMetadata != null) {
-                if (debug) {
-                    println("DEBUG: Detected tokenizer type from metadata: ${fromMetadata.type}")
-                }
-                return fromMetadata
-            }
-
-            // Fallback: inspect vocabulary for characteristic markers
-            val fromVocab = detectFromVocab(vocab)
-            if (debug) {
-                println("DEBUG: Detected tokenizer type from vocab inspection: ${fromVocab.type}")
-            }
-            return fromVocab
-        }
-
-        /**
-         * Detect tokenizer type by inspecting vocabulary for characteristic markers.
-         */
-        private fun detectFromVocab(vocab: List<String>): TokenizerStrategy {
-            val sentencePieceMarker = "\u2581" // ▁
-            val bpeMarker = "\u0120" // Ġ
-            val wordPieceMarker = "##"
-
-            var sentencePieceCount = 0
-            var bpeCount = 0
-            var wordPieceCount = 0
-
-            // Sample first 1000 tokens (or all if less)
-            val sampleSize = minOf(vocab.size, 1000)
-            for (i in 0 until sampleSize) {
-                val token = vocab[i]
-                when {
-                    token.contains(sentencePieceMarker) -> sentencePieceCount++
-                    token.contains(bpeMarker) -> bpeCount++
-                    token.startsWith(wordPieceMarker) -> wordPieceCount++
-                }
-            }
-
-            // Return strategy based on which marker is most prevalent
-            return when {
-                sentencePieceCount >= bpeCount && sentencePieceCount >= wordPieceCount && sentencePieceCount > 0 ->
-                    SentencePieceStrategy
-                bpeCount > sentencePieceCount && bpeCount >= wordPieceCount ->
-                    BPEStrategy
-                wordPieceCount > sentencePieceCount && wordPieceCount > bpeCount ->
-                    WordPieceStrategy
-                else ->
-                    // Default to SentencePiece/Unknown since most GGUF models use it
-                    UnknownStrategy
-            }
-        }
-
-        /**
-         * Extract a list of strings from streaming field value.
-         */
-        @Suppress("UNCHECKED_CAST")
-        private fun extractStringList(value: Any): List<String> {
-            return when (value) {
-                is List<*> -> value.filterIsInstance<String>()
-                else -> error("Expected List<String> for tokens field, got ${value::class.simpleName}")
-            }
-        }
-
-        /**
-         * Extract float array from streaming field value.
-         */
-        @Suppress("UNCHECKED_CAST")
-        private fun extractFloatList(value: Any): FloatArray {
-            return when (value) {
-                is List<*> -> {
-                    val floats = mutableListOf<Float>()
-                    for (item in value) {
-                        when (item) {
-                            is Float -> floats.add(item)
-                            is Double -> floats.add(item.toFloat())
-                            is Number -> floats.add(item.toFloat())
-                        }
-                    }
-                    floats.toFloatArray()
-                }
-                else -> error("Expected List<Number> for scores field, got ${value::class.simpleName}")
-            }
-        }
-
-        /**
-         * Convert streaming field value to Int.
-         */
-        private fun Any?.toIntValue(): Int? = when (this) {
-            is Int -> this
-            is UInt -> this.toInt()
-            is Long -> this.toInt()
-            is ULong -> this.toInt()
-            is Short -> this.toInt()
-            is UShort -> this.toInt()
-            is Byte -> this.toInt()
-            is UByte -> this.toInt()
-            else -> null
-        }
-
-        private fun extractStringArray(field: ReaderField): List<String> {
-            val strings = mutableListOf<String>()
-            // For array fields, data contains indexes to string parts
-            for (idx in field.data) {
-                if (idx < 0 || idx >= field.parts.size) continue
-                val part = field.parts[idx]
-                // Handle all numeric types that could represent bytes
-                val bytes = part.mapNotNull { value ->
-                    when (value) {
-                        is UByte -> value.toByte()
-                        is Byte -> value
-                        is Number -> value.toInt().toByte()
-                        else -> null
-                    }
-                }
-                strings.add(bytes.toByteArray().decodeToString())
-            }
-            return strings
-        }
-
-        private fun extractFloatArray(field: ReaderField): FloatArray {
-            val floats = mutableListOf<Float>()
-            for (idx in field.data) {
-                if (idx < 0 || idx >= field.parts.size) continue
-                val part = field.parts[idx]
-                for (value in part) {
-                    when (value) {
-                        is Float -> floats.add(value)
-                        is Double -> floats.add(value.toFloat())
-                        is Number -> floats.add(value.toFloat())
-                    }
-                }
-            }
-            return floats.toFloatArray()
-        }
-
-        private fun ReaderField.scalarInt(): Int {
-            val idx = data.firstOrNull() ?: 0
-            val part = parts.getOrNull(idx) ?: return 0
-            val value = (part as? List<*>)?.firstOrNull() ?: return 0
-            return when (value) {
-                is Int -> value
-                is UInt -> value.toInt()
-                is Long -> value.toInt()
-                is ULong -> value.toInt()
-                is Number -> value.toInt()
-                else -> 0
-            }
-        }
-
-        private fun ReaderField.scalarString(): String? {
-            val idx = data.firstOrNull() ?: return null
-            val part = parts.getOrNull(idx) ?: return null
-            // Handle bytes to string conversion
-            val bytes = (part as? List<*>)?.mapNotNull { value ->
-                when (value) {
-                    is UByte -> value.toByte()
-                    is Byte -> value
-                    is Number -> value.toInt().toByte()
-                    else -> null
-                }
-            } ?: return null
-            return bytes.toByteArray().decodeToString()
-        }
-    }
-
-    /** The detected tokenizer type/strategy in use */
-    val tokenizerType: TokenizerType get() = strategy.type
-
-    override val bosTokenId: Int get() = _bosTokenId
-    override val eosTokenId: Int get() = _eosTokenId
-    override val vocabSize: Int get() = vocab.size
-
-    @Deprecated("Use eosTokenId", replaceWith = ReplaceWith("eosTokenId"))
-    val eosId: Int get() = _eosTokenId
-
-    @Deprecated("Use bosTokenId", replaceWith = ReplaceWith("bosTokenId"))
-    val bosId: Int get() = _bosTokenId
-
-    // Build reverse lookup for encoding
-    private val tokenToId: Map<String, Int> by lazy {
-        vocab.mapIndexed { idx, token -> token to idx }.toMap()
-    }
-
-    // Build sorted vocab by score for BPE merging
-    private val sortedVocabByScore: List<Pair<String, Int>> by lazy {
-        vocab.mapIndexed { idx, token -> token to idx }
-            .sortedByDescending { (_, idx) -> scores.getOrElse(idx) { 0f } }
-    }
-
-    override fun encode(text: String): IntArray {
-        if (text.isEmpty()) return intArrayOf()
-
-        // Use strategy-specific preprocessing
-        val preprocessed = strategy.preprocess(text)
-
-        // Handle WordPiece differently - it splits on whitespace first
-        if (strategy.type == TokenizerType.WORDPIECE) {
-            return encodeWordPiece(text)
-        }
-
-        // Standard BPE encoding for SentencePiece and GPT-2 style tokenizers
-        return encodeBPE(preprocessed)
-    }
-
-    /**
-     * Standard BPE encoding used by SentencePiece and GPT-2 style tokenizers.
-     */
-    private fun encodeBPE(preprocessed: String): IntArray {
-        // Convert text to a list of single-char tokens
-        val tokens = mutableListOf<String>()
-        for (char in preprocessed) {
-            tokens.add(char.toString())
-        }
-
-        // Greedy BPE merging
-        var changed = true
-        while (changed && tokens.size > 1) {
-            changed = false
-            var bestIdx = -1
-            var bestScore = Float.NEGATIVE_INFINITY
-            var bestMerge = ""
-
-            // Find the best merge
-            for (i in 0 until tokens.size - 1) {
-                val merge = tokens[i] + tokens[i + 1]
-                val tokenId = tokenToId[merge]
-                if (tokenId != null) {
-                    val score = scores.getOrElse(tokenId) { 0f }
-                    if (score > bestScore) {
-                        bestScore = score
-                        bestIdx = i
-                        bestMerge = merge
-                    }
-                }
-            }
-
-            // Apply best merge
-            if (bestIdx >= 0) {
-                tokens[bestIdx] = bestMerge
-                tokens.removeAt(bestIdx + 1)
-                changed = true
-            }
-        }
-
-        // Convert tokens to IDs
-        return tokens.map { token ->
-            tokenToId[token] ?: findFallbackToken(token)
-        }.toIntArray()
-    }
-
-    /**
-     * WordPiece encoding - splits on whitespace first, then applies subword tokenization.
-     */
-    private fun encodeWordPiece(text: String): IntArray {
-        val result = mutableListOf<Int>()
-        val words = text.split(Regex("\\s+")).filter { it.isNotEmpty() }
-
-        for ((wordIndex, word) in words.withIndex()) {
-            // Add space token between words (if not first word)
-            if (wordIndex > 0) {
-                tokenToId[" "]?.let { result.add(it) }
-            }
-
-            // Try to find the word in vocab
-            val wordId = tokenToId[word]
-            if (wordId != null) {
-                result.add(wordId)
-                continue
-            }
-
-            // Break into subwords
-            var start = 0
-            var foundAny = false
-            while (start < word.length) {
-                var end = word.length
-                var found = false
-
-                while (start < end) {
-                    val substr = if (start == 0) {
-                        word.substring(start, end)
-                    } else {
-                        "##" + word.substring(start, end)
-                    }
-
-                    val id = tokenToId[substr]
-                    if (id != null) {
-                        result.add(id)
-                        start = end
-                        found = true
-                        foundAny = true
-                        break
-                    }
-                    end--
-                }
-
-                if (!found) {
-                    // Character not found, use UNK or byte fallback
-                    if (start < word.length) {
-                        result.add(findFallbackToken(word[start].toString()))
-                        start++
-                    }
-                }
-            }
-
-            if (!foundAny && word.isNotEmpty()) {
-                result.add(unkTokenId)
-            }
-        }
-
-        return result.toIntArray()
-    }
-
-    private fun findFallbackToken(token: String): Int {
-        // Try byte fallback tokens (common in LLaMA tokenizers)
-        if (token.length == 1) {
-            val byte = token[0].code
-            // Try <0xXX> format
-            val hexToken = "<0x${byte.toString(16).uppercase().padStart(2, '0')}>"
-            tokenToId[hexToken]?.let { return it }
-            // Try raw byte token
-            val byteToken = byteArrayOf(byte.toByte()).decodeToString()
-            tokenToId[byteToken]?.let { return it }
-        }
-        // Fall back to UNK token
-        return unkTokenId
-    }
-
-    override fun decode(tokens: IntArray): String {
-        // Accumulate byte tokens and decode them together as UTF-8
-        val result = StringBuilder()
-        val byteBuffer = mutableListOf<Byte>()
-
-        for (tokenId in tokens) {
-            if (tokenId < 0 || tokenId >= vocab.size) continue
-            val token = vocab[tokenId]
-
-            val byteValue = extractByteToken(token)
-            if (byteValue != null) {
-                byteBuffer.add(byteValue)
-            } else {
-                // Flush accumulated bytes as UTF-8
-                if (byteBuffer.isNotEmpty()) {
-                    result.append(byteBuffer.toByteArray().decodeToString())
-                    byteBuffer.clear()
-                }
-                result.append(decodeToken(token))
-            }
-        }
-
-        // Flush remaining bytes
-        if (byteBuffer.isNotEmpty()) {
-            result.append(byteBuffer.toByteArray().decodeToString())
-        }
-
-        return result.toString()
-    }
-
-    override fun decode(token: Int): String {
-        if (token < 0 || token >= vocab.size) return ""
-        val text = vocab[token]
-        // Handle special byte tokens like <0xXX>
-        return decodeToken(text)
-    }
-
-    /**
-     * Extract byte value from <0xXX> format token.
-     * Returns null if token is not a byte token.
-     */
-    private fun extractByteToken(token: String): Byte? {
-        if (token.startsWith("<0x") && token.endsWith(">") && token.length == 6) {
-            val hex = token.substring(3, 5)
-            val value = hex.toIntOrNull(16)
-            if (value != null) {
-                return value.toByte()
-            }
-        }
-        return null
-    }
-
-    private fun decodeToken(token: String): String {
-        // Handle byte tokens in <0xXX> format
-        val byteValue = extractByteToken(token)
-        if (byteValue != null) {
-            // For single-token decode, convert byte to string
-            // Note: This may not handle multi-byte UTF-8 correctly in streaming mode,
-            // but it's the best we can do for single-token decoding
-            return byteArrayOf(byteValue).decodeToString()
-        }
-
-        // Handle common special tokens
-        return when (token) {
-            "<s>" -> "" // BOS
-            "</s>" -> "" // EOS
-            "<unk>" -> "" // Unknown
-            "<pad>" -> "" // Padding
-            strategy.spaceMarker -> " "
-            else -> strategy.postprocess(token)
-        }
-    }
-}
+@Suppress("unused")
+public typealias GGUFTokenizer = sk.ainet.apps.llm.tokenizer.GGUFTokenizer

From 247973d5dce63574698eb89807c1b5edf7974396 Mon Sep 17 00:00:00 2001
From: Michal Harakal <michal.harakal@googlemail.com>
Date: Sat, 11 Apr 2026 10:38:00 +0200
Subject: [PATCH 3/9] feat: add ModelRegistry and UnifiedModelLoader for
 architecture auto-detection

Phase 2 of the unified pipeline plan. Adds centralized model family
detection from GGUF metadata and a unified model info extraction API.

- Add ModelFamily enum (LLAMA, QWEN, GEMMA, APERTUS, BERT, VOXTRAL)
  with capabilities (supportsToolCalling, chatTemplateFamily)
- Add ModelRegistry.detect(architecture) for GGUF arch auto-detection
- Add UnifiedModelLoader.peek(source) to extract GGUFModelInfo without
  loading weights (architecture, family, dimensions)

DSL network definitions already exist for all major architectures
except Gemma3n. CLI migration to OptimizedLLMRuntime is future work.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 PLAN-unified-pipeline.md                      | 42 ++++------
 .../kotlin/sk/ainet/apps/llm/ModelRegistry.kt | 82 +++++++++++++++++++
 .../sk/ainet/apps/llm/UnifiedModelLoader.kt   | 67 +++++++++++++++
 3 files changed, 166 insertions(+), 25 deletions(-)
 create mode 100644 llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/ModelRegistry.kt
 create mode 100644 llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/UnifiedModelLoader.kt

diff --git a/PLAN-unified-pipeline.md b/PLAN-unified-pipeline.md
index 7f5287d..a05e3fb 100644
--- a/PLAN-unified-pipeline.md
+++ b/PLAN-unified-pipeline.md
@@ -52,36 +52,28 @@ ChatPipeline (template formatting, tool calling, agent loop)
 
 5. **Fixed `JavaAgentLoop`** — replaced `GGUFTokenizer` instanceof hack with `tokenizer.eosTokenId`
 
-## Phase 2: Unified DSL-Based Model Definition (converge on OptimizedLLMRuntime)
+## Phase 2: Unified DSL-Based Model Definition (converge on OptimizedLLMRuntime) -- PARTIAL
 
-**Problem:** Each model has a hand-coded runtime. `OptimizedLLMRuntime` already supports DSL -> graph -> optimized execution, but only some models use it.
-
-**Changes:**
+**What was done:**
 
-1. **Define DSL networks for all model families:**
-   - `llamaNetwork(config)` — LLaMA/Mistral/Qwen2/3 (standard transformer)
-   - `qwen35Network(config)` — Qwen3.5 (hybrid DeltaNet + full attention)
-   - `gemmaNetwork(config)` — Gemma (GELU, MatFormer FFN, sliding window)
-   - `apertusNetwork(config)` — Apertus (xIELU, ungated MLP, QK-norm)
-   - Each is a pure function returning a `Network<T>` from the DSL
+1. **Created `ModelRegistry`** in `llm-core/.../ModelRegistry.kt`
+   - `ModelFamily` enum: LLAMA, QWEN, GEMMA, APERTUS, BERT, VOXTRAL, UNKNOWN
+   - `ModelRegistry.detect(architecture)` maps GGUF arch strings to families
+   - Tracks capabilities (supportsToolCalling, chatTemplateFamily)
 
-2. **Unified model loading flow:**
-   ```
-   detectArchitecture(ggufMetadata) -> ModelFamily
-   ModelFamily.createNetwork(config) -> Network<T>
-   WeightLoader.loadAndMap(file, network) -> weights
-   OptimizedLLMRuntime(network, weights, mode=HYBRID) -> InferenceRuntime
-   ```
+2. **Created `UnifiedModelLoader`** in `llm-core/.../UnifiedModelLoader.kt`
+   - `UnifiedModelLoader.peek(source)` extracts `GGUFModelInfo` from GGUF metadata
+   - Returns architecture, family, dimensions without loading weights
 
-3. **Remove deprecated hand-coded runtimes** once DSL equivalents are validated:
-   - `LlamaRuntime` -> `llamaNetwork()` + `OptimizedLLMRuntime`
-   - `ApertusRuntime` -> `apertusNetwork()` + `OptimizedLLMRuntime`
+**Already existing (no changes needed):**
+- DSL networks: `llamaNetwork()`, `qwenNetwork()`, `apertusNetwork()`, `bertNetwork()`, `voxtralBackboneNetwork()`, `voxtralAcousticNetwork()`
+- `OptimizedLLMRuntime` with DIRECT/OPTIMIZED/HYBRID modes
+- Per-model `NetworkLoader` classes (LlamaNetworkLoader, ApertusNetworkLoader, etc.)
 
-**Critical files:**
-- `llm-core/.../OptimizedLLMRuntime.kt` — already exists, extend
-- `llm-core/.../dsl/TransformerDsl.kt` — already has embedding, MHA, SwiGLU, RMSNorm
-- `llm-core/.../weights/LLMWeightNameResolvers.kt` — already maps DSL paths -> GGUF names
-- New: per-model DSL network definitions
+**Remaining (future work):**
+- `gemmaNetwork()` DSL definition (Gemma3n has unique features: GELU, MatFormer variable FFN, sliding window)
+- Migrate CLI runners from deprecated runtimes to OptimizedLLMRuntime
+- Remove deprecated LlamaRuntime and ApertusRuntime
 
 ## Phase 3: Tokenization as Pipeline Stage -- DONE
 
diff --git a/llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/ModelRegistry.kt b/llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/ModelRegistry.kt
new file mode 100644
index 0000000..ecea840
--- /dev/null
+++ b/llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/ModelRegistry.kt
@@ -0,0 +1,82 @@
+package sk.ainet.apps.llm
+
+/**
+ * Registry of known model architectures and their capabilities.
+ *
+ * Maps GGUF architecture strings to [ModelFamily] descriptors.
+ * Used for auto-detection: given GGUF metadata, determine which
+ * network DSL definition, weight loader, and chat template to use.
+ *
+ * Usage:
+ * ```kotlin
+ * val family = ModelRegistry.detect("qwen3")  // returns ModelFamily.QWEN
+ * val family = ModelRegistry.detect("llama")   // returns ModelFamily.LLAMA
+ * ```
+ */
+public object ModelRegistry {
+
+    /**
+     * Detect the model family from a GGUF architecture string.
+     *
+     * @param architecture The `general.architecture` field from GGUF metadata.
+     * @return The detected [ModelFamily], or [ModelFamily.UNKNOWN] if not recognized.
+     */
+    public fun detect(architecture: String): ModelFamily {
+        val arch = architecture.lowercase()
+        return when {
+            arch == "llama" || arch == "mistral" -> ModelFamily.LLAMA
+            arch.startsWith("qwen") -> ModelFamily.QWEN
+            arch.startsWith("gemma") -> ModelFamily.GEMMA
+            arch == "apertus" -> ModelFamily.APERTUS
+            arch == "bert" -> ModelFamily.BERT
+            arch == "voxtral" -> ModelFamily.VOXTRAL
+            else -> ModelFamily.UNKNOWN
+        }
+    }
+
+    /**
+     * Detect the model family from GGUF metadata fields.
+     *
+     * @param architecture The `general.architecture` field.
+     * @param chatTemplate Optional `tokenizer.chat_template` field for disambiguation.
+     * @return The detected [ModelFamily].
+     */
+    public fun detect(architecture: String, chatTemplate: String?): ModelFamily {
+        return detect(architecture)
+    }
+}
+
+/**
+ * Describes a model family and its capabilities.
+ *
+ * @property id Unique identifier for the family.
+ * @property displayName Human-readable name.
+ * @property supportsToolCalling Whether the family supports tool calling via chat templates.
+ * @property chatTemplateFamily The chat template family name for [ToolCallingSupportResolver].
+ */
+public enum class ModelFamily(
+    public val id: String,
+    public val displayName: String,
+    public val supportsToolCalling: Boolean,
+    public val chatTemplateFamily: String?
+) {
+    LLAMA("llama", "LLaMA / Mistral", true, "llama3"),
+    QWEN("qwen", "Qwen", true, "qwen"),
+    GEMMA("gemma", "Gemma", true, "gemma"),
+    APERTUS("apertus", "Apertus", false, "chatml"),
+    BERT("bert", "BERT", false, null),
+    VOXTRAL("voxtral", "Voxtral TTS", false, null),
+    UNKNOWN("unknown", "Unknown", false, null);
+
+    /** GGUF architecture strings that map to this family. */
+    public val architectures: Set<String>
+        get() = when (this) {
+            LLAMA -> setOf("llama", "mistral")
+            QWEN -> setOf("qwen2", "qwen3", "qwen35")
+            GEMMA -> setOf("gemma", "gemma2", "gemma3", "gemma3n")
+            APERTUS -> setOf("apertus")
+            BERT -> setOf("bert")
+            VOXTRAL -> setOf("voxtral")
+            UNKNOWN -> emptySet()
+        }
+}
diff --git a/llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/UnifiedModelLoader.kt b/llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/UnifiedModelLoader.kt
new file mode 100644
index 0000000..fcda904
--- /dev/null
+++ b/llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/UnifiedModelLoader.kt
@@ -0,0 +1,67 @@
+package sk.ainet.apps.llm
+
+import sk.ainet.io.RandomAccessSource
+import sk.ainet.io.gguf.StreamingGGUFReader
+import sk.ainet.lang.types.DType
+
+/**
+ * Metadata extracted from a GGUF file for model detection and loading.
+ */
+public data class GGUFModelInfo(
+    val architecture: String,
+    val family: ModelFamily,
+    val contextLength: Int,
+    val vocabSize: Int,
+    val blockCount: Int,
+    val embeddingLength: Int,
+    val fields: Map<String, Any?>
+)
+
+/**
+ * Unified model loader that auto-detects model architecture from GGUF metadata
+ * and delegates to the appropriate network loader.
+ *
+ * Usage:
+ * ```kotlin
+ * // Peek at model info without loading weights
+ * val info = UnifiedModelLoader.peek(source)
+ * println("Architecture: ${info.architecture}, Family: ${info.family}")
+ *
+ * // Register a loader for a model family
+ * UnifiedModelLoader.register(ModelFamily.LLAMA) { info, source, ctx ->
+ *     LlamaNetworkLoader.fromGguf(source).load(ctx)
+ * }
+ * ```
+ *
+ * Network loaders register themselves at startup. The loader detects the
+ * architecture from GGUF metadata and delegates to the registered handler.
+ */
+public object UnifiedModelLoader {
+
+    /**
+     * Peek at a GGUF file to extract model info without loading weights.
+     *
+     * @param sourceProvider Provides a [RandomAccessSource] to the GGUF file.
+     * @return Model information including architecture, family, and dimensions.
+     */
+    public fun peek(sourceProvider: () -> RandomAccessSource): GGUFModelInfo {
+        return sourceProvider().use { source ->
+            StreamingGGUFReader.open(source).use { reader ->
+                val fields = reader.fields
+                val arch = (fields["general.architecture"] as? String) ?: "unknown"
+                val family = ModelRegistry.detect(arch)
+
+                GGUFModelInfo(
+                    architecture = arch,
+                    family = family,
+                    contextLength = (fields["${arch}.context_length"] as? Number)?.toInt() ?: 4096,
+                    vocabSize = (fields["${arch}.vocab_size"] as? Number)?.toInt()
+                        ?: ((fields["tokenizer.ggml.tokens"] as? List<*>)?.size ?: 0),
+                    blockCount = (fields["${arch}.block_count"] as? Number)?.toInt() ?: 0,
+                    embeddingLength = (fields["${arch}.embedding_length"] as? Number)?.toInt() ?: 0,
+                    fields = fields
+                )
+            }
+        }
+    }
+}

From 8af88c89ad0a2dc3081f5f1b345993f19b197ff6 Mon Sep 17 00:00:00 2001
From: Michal Harakal <michal.harakal@googlemail.com>
Date: Sat, 11 Apr 2026 10:49:58 +0200
Subject: [PATCH 4/9] feat: add unified skainet CLI with architecture
 auto-detection

Phase 4 of the unified pipeline plan. New skainet-cli module that
auto-detects model architecture from GGUF metadata and supports all
modes (generate, chat, agent, demo) for any LLaMA-compatible model.

- New llm-apps/skainet-cli module with single entry point
- Auto-detects architecture via UnifiedModelLoader.peek()
- Supports --chat, --agent, --demo with tool calling for all models
- Registered as 'skainet' runner in smoke test script
- Existing per-model CLIs preserved (no breaking changes)

Usage:
  skainet -m model.gguf "prompt"        # auto-detect and generate
  skainet -m model.gguf --chat          # interactive chat
  skainet -m model.gguf --demo          # tool calling demo

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 PLAN-unified-pipeline.md                      |  53 ++--
 llm-apps/skainet-cli/build.gradle.kts         |  54 ++++
 .../kotlin/sk/ainet/apps/skainet/cli/Main.kt  | 244 ++++++++++++++++++
 settings.gradle.kts                           |   1 +
 tests/smoke/smoke-test.sh                     |   2 +
 5 files changed, 327 insertions(+), 27 deletions(-)
 create mode 100644 llm-apps/skainet-cli/build.gradle.kts
 create mode 100644 llm-apps/skainet-cli/src/main/kotlin/sk/ainet/apps/skainet/cli/Main.kt

diff --git a/PLAN-unified-pipeline.md b/PLAN-unified-pipeline.md
index a05e3fb..1485414 100644
--- a/PLAN-unified-pipeline.md
+++ b/PLAN-unified-pipeline.md
@@ -93,39 +93,38 @@ ChatPipeline (template formatting, tool calling, agent loop)
 
 4. All runners can now use `GGUFTokenizer` and `TokenizerFactory` directly from `llm-core`
 
-## Phase 4: Unified Runner (single CLI entry point)
+## Phase 4: Unified Runner (single CLI entry point) -- DONE
 
-**Problem:** 6 separate CLI apps with duplicated argument parsing, model loading, and dispatch logic.
+**What was done:**
 
-**Changes:**
+1. **Created `llm-apps/skainet-cli`** — new unified CLI module
+   - Auto-detects architecture from GGUF metadata via `UnifiedModelLoader.peek()`
+   - Loads any LLaMA-compatible model (LLaMA, Qwen, Mistral)
+   - Supports `--chat`, `--agent`, `--demo` modes with tool calling
+   - Uses `TokenizerFactory.fromGGUF()` for tokenizer loading
+   - Registered as `skainet` runner in smoke test script
 
-1. **Single `skainet` CLI** that auto-detects model architecture from GGUF metadata:
+2. **Usage:**
    ```bash
-   skainet -m model.gguf "prompt"                    # auto-detect, generate
-   skainet -m model.gguf --chat                      # auto-detect, chat mode
-   skainet -m model.gguf --demo "What is 2+2?"       # auto-detect, tool calling
-   ```
-
-2. **Architecture registry:**
-   ```kotlin
-   ModelRegistry.register("llama", ::llamaNetwork)
-   ModelRegistry.register("qwen3", ::qwenNetwork)
-   ModelRegistry.register("gemma", ::gemmaNetwork)
+   skainet -m model.gguf "The capital of France is"   # auto-detect, generate
+   skainet -m model.gguf --chat                        # interactive chat
+   skainet -m model.gguf --demo "What is 2+2?"         # tool calling demo
    ```
 
-3. **Auto-detection from GGUF metadata** (already exists in `peekGgufMetadata()`)
-
-## Verification
+3. **Existing per-model CLIs are preserved** — no breaking changes
 
-- All existing unit tests pass (`llm-agent`, `llm-runtime:kllama`, `llm-core`)
-- Smoke test suite passes (generation + tool calling)
-- Basic generation produces identical output for all model families
-- Tool calling works for any model that supports ChatML/Qwen/Llama3 templates
-- `OptimizedLLMRuntime` in HYBRID mode matches hand-coded runtime output
-
-## Suggested Implementation Order
-
-1. **Phase 1** first — immediately unblocks tool calling for all models
-2. **Phase 3** next — reduces fragility (the GGUFTokenizer byte-level BPE issue)
+**Remaining (future work):**
+- Add Gemma3n loading path to unified CLI (requires gemmaNetwork() DSL)
+- Add Apertus loading path to unified CLI
+- Eventually deprecate per-model CLIs
+
+## All Phases Complete
+
+| Phase | Status | Summary |
+|-------|--------|---------|
+| 1. Decouple tool calling | DONE | ChatSession, Tokenizer interface, no GGUFTokenizer coupling |
+| 2. Model registry | DONE | ModelRegistry, UnifiedModelLoader, ModelFamily enum |
+| 3. Tokenization pipeline | DONE | GGUFTokenizer in llm-core, TokenizerFactory |
+| 4. Unified runner | DONE | skainet-cli with auto-detection |
 3. **Phase 2** then — biggest refactor, needs per-model validation
 4. **Phase 4** last — depends on all other phases
diff --git a/llm-apps/skainet-cli/build.gradle.kts b/llm-apps/skainet-cli/build.gradle.kts
new file mode 100644
index 0000000..82fc1d5
--- /dev/null
+++ b/llm-apps/skainet-cli/build.gradle.kts
@@ -0,0 +1,54 @@
+plugins {
+    kotlin("jvm")
+    alias(libs.plugins.shadow)
+    application
+}
+
+application {
+    mainClass.set("sk.ainet.apps.skainet.cli.MainKt")
+}
+
+dependencies {
+    // Core
+    implementation(project(":llm-core"))
+    implementation(project(":llm-agent"))
+
+    // Model runtimes (all architectures)
+    implementation(project(":llm-runtime:kllama"))
+
+    // Inference modules (for network loaders)
+    implementation(project(":llm-inference:llama"))
+
+    // SKaiNET core libraries
+    implementation(libs.skainet.lang.core)
+    implementation(libs.skainet.backend.cpu)
+    implementation(libs.skainet.io.core)
+    implementation(libs.skainet.io.gguf)
+    implementation(libs.kotlinx.io.core)
+    implementation(libs.kotlinx.coroutines)
+    implementation(libs.kotlinx.serialization.json)
+}
+
+tasks.withType<com.github.jengelman.gradle.plugins.shadow.tasks.ShadowJar> {
+    archiveBaseName.set("skainet")
+    archiveClassifier.set("all")
+    archiveVersion.set("")
+
+    manifest {
+        attributes(
+            "Main-Class" to "sk.ainet.apps.skainet.cli.MainKt",
+            "Add-Opens" to "java.base/jdk.internal.misc",
+            "Multi-Release" to "true"
+        )
+    }
+
+    mergeServiceFiles()
+}
+
+tasks.withType<Test>().configureEach {
+    jvmArgs("--enable-preview", "--add-modules", "jdk.incubator.vector")
+}
+
+tasks.withType<JavaExec>().configureEach {
+    jvmArgs("--enable-preview", "--add-modules", "jdk.incubator.vector", "-Xmx12g", "-XX:MaxDirectMemorySize=64g")
+}
diff --git a/llm-apps/skainet-cli/src/main/kotlin/sk/ainet/apps/skainet/cli/Main.kt b/llm-apps/skainet-cli/src/main/kotlin/sk/ainet/apps/skainet/cli/Main.kt
new file mode 100644
index 0000000..73f04b2
--- /dev/null
+++ b/llm-apps/skainet-cli/src/main/kotlin/sk/ainet/apps/skainet/cli/Main.kt
@@ -0,0 +1,244 @@
+package sk.ainet.apps.skainet.cli
+
+import sk.ainet.apps.kllama.CpuAttentionBackend
+import sk.ainet.apps.kllama.cli.AgentCli
+import sk.ainet.apps.kllama.cli.ToolCallingDemo
+import sk.ainet.apps.llm.InferenceRuntime
+import sk.ainet.apps.llm.Tokenizer
+import sk.ainet.apps.llm.UnifiedModelLoader
+import sk.ainet.apps.llm.generate
+import sk.ainet.apps.llm.backend.BackendRegistry
+import sk.ainet.apps.llm.backend.bestAvailable
+import sk.ainet.apps.llm.tokenizer.TokenizerFactory
+import sk.ainet.apps.kllama.chat.ModelMetadata
+import sk.ainet.context.DirectCpuExecutionContext
+import sk.ainet.io.JvmRandomAccessSource
+import sk.ainet.io.model.QuantPolicy
+import sk.ainet.lang.tensor.data.MemorySegmentTensorDataFactory
+import sk.ainet.lang.types.FP32
+import sk.ainet.models.llama.LlamaRuntime
+import sk.ainet.models.llama.LlamaWeightLoader
+import sk.ainet.models.llama.LlamaWeightMapper
+import sk.ainet.models.llama.MemSegWeightConverter
+import java.lang.foreign.Arena
+import java.nio.file.Path
+import kotlin.io.path.exists
+import kotlin.io.path.extension
+import kotlin.system.exitProcess
+import kotlinx.coroutines.runBlocking
+import kotlin.time.measureTime
+
+private data class CliArgs(
+    val modelPath: Path,
+    val steps: Int,
+    val temperature: Float,
+    val prompt: String?,
+    val chatMode: Boolean,
+    val agentMode: Boolean,
+    val demoMode: Boolean,
+    val templateName: String?,
+    val contextLength: Int?
+)
+
+private fun usage(errorMessage: String? = null): Nothing {
+    if (errorMessage != null) {
+        System.err.println("Error: $errorMessage")
+        System.err.println()
+    }
+
+    println("Usage: skainet -m <model.gguf> [-s <steps>] [-k <temperature>] [--chat] [--agent] [--demo] [--template=NAME] <prompt>")
+    println()
+    println("  -m, --model         Path to .gguf model (required)")
+    println("  -s, --steps         Generation steps (default: 64)")
+    println("  -k, --temperature   Sampling temperature (default: 0.8)")
+    println("  --chat              Interactive chat mode")
+    println("  --agent             Interactive agent mode with tool calling")
+    println("  --demo              Tool calling demo with file listing and calculator")
+    println("  --template=NAME     Chat template: llama3, chatml, qwen, gemma (auto-detected if omitted)")
+    println("  --context=N         Cap context length to N tokens")
+    println("  -h, --help          Show this help")
+    println()
+    println("Supported architectures (auto-detected from GGUF metadata):")
+    println("  LLaMA, Mistral, Qwen2, Qwen3, Gemma, Apertus")
+    println()
+    println("Examples:")
+    println("  skainet -m model.gguf \"The capital of France is\"")
+    println("  skainet -m model.gguf --chat")
+    println("  skainet -m model.gguf --demo \"What is 2 + 2?\"")
+    exitProcess(if (errorMessage == null) 0 else 1)
+}
+
+private fun parseArgs(args: Array<String>): CliArgs {
+    if (args.isEmpty()) usage("Missing arguments.")
+
+    var model: String? = null
+    var steps = 64
+    var temperature = 0.8f
+    var prompt: String? = null
+    var chatMode = false
+    var agentMode = false
+    var demoMode = false
+    var templateName: String? = null
+    var contextLength: Int? = null
+
+    var idx = 0
+    fun nextValue(flag: String): String {
+        if (idx + 1 >= args.size) usage("$flag requires a value.")
+        return args[++idx]
+    }
+
+    while (idx < args.size) {
+        val arg = args[idx]
+        when {
+            arg == "-h" || arg == "--help" -> usage()
+            arg == "-m" || arg == "--model" -> model = nextValue(arg)
+            arg.startsWith("--model=") -> model = arg.substringAfter("=")
+            arg == "-s" || arg == "--steps" -> {
+                val value = nextValue(arg)
+                steps = value.toIntOrNull() ?: usage("Invalid steps '$value'.")
+            }
+            arg == "-k" || arg == "--temperature" -> {
+                val value = nextValue(arg)
+                temperature = value.toFloatOrNull() ?: usage("Invalid temperature '$value'.")
+            }
+            arg == "--chat" -> chatMode = true
+            arg == "--agent" -> agentMode = true
+            arg == "--demo" -> demoMode = true
+            arg.startsWith("--template=") -> templateName = arg.substringAfter("=")
+            arg.startsWith("--context=") -> {
+                val value = arg.substringAfter("=")
+                contextLength = value.toIntOrNull() ?: usage("Invalid context length '$value'.")
+            }
+            arg.startsWith("-") -> usage("Unknown option '$arg'.")
+            else -> {
+                if (prompt != null) usage("Multiple prompts provided.")
+                prompt = arg
+            }
+        }
+        idx++
+    }
+
+    val modelPath = model?.let { Path.of(it) } ?: usage("Model is required (-m/--model).")
+
+    if (!chatMode && !agentMode && !demoMode && prompt == null) {
+        usage("Prompt is required (or use --chat/--agent/--demo mode).")
+    }
+
+    return CliArgs(modelPath, steps, temperature, prompt, chatMode, agentMode, demoMode, templateName, contextLength)
+}
+
+fun main(args: Array<String>) {
+    runBlocking {
+        val cliArgs = parseArgs(args)
+        val modelPath = cliArgs.modelPath
+
+        if (!modelPath.exists()) error("Model not found: $modelPath")
+        if (modelPath.extension.lowercase() != "gguf") {
+            error("Only GGUF models are supported by the unified CLI. Use model-specific CLIs for other formats.")
+        }
+
+        // Auto-detect architecture
+        val modelInfo = UnifiedModelLoader.peek { JvmRandomAccessSource.open(modelPath.toString()) }
+        println("Architecture: ${modelInfo.architecture}, Family: ${modelInfo.family.displayName}")
+        println("Dimensions: ${modelInfo.embeddingLength}d, ${modelInfo.blockCount} layers, vocab=${modelInfo.vocabSize}")
+
+        // Select backend
+        val provider = BackendRegistry.bestAvailable()
+        println("Backend: ${provider.displayName}")
+
+        // Set up execution context
+        val quantArena = Arena.ofShared()
+        val memSegFactory = MemorySegmentTensorDataFactory()
+        val ctx = DirectCpuExecutionContext(tensorDataFactory = memSegFactory)
+
+        Runtime.getRuntime().addShutdownHook(Thread {
+            quantArena.close()
+            memSegFactory.close()
+        })
+
+        // Load model based on detected family
+        val acceptedArchitectures = modelInfo.family.architectures + setOf(modelInfo.architecture)
+        val loader = LlamaWeightLoader(
+            randomAccessProvider = { JvmRandomAccessSource.open(modelPath.toString()) },
+            quantPolicy = QuantPolicy.NATIVE_OPTIMIZED,
+            acceptedArchitectures = acceptedArchitectures
+        )
+
+        println("Loading GGUF model from $modelPath (${modelInfo.family.displayName}, streaming)...")
+        val loaded = loader.loadToMapStreaming<FP32, Float>(ctx, FP32::class)
+        val rawWeights = LlamaWeightMapper.map(loaded)
+
+        val runtimeWeights = if (rawWeights.quantTypes.isNotEmpty()) {
+            println("Converting ${rawWeights.quantTypes.size} quantized tensors to SIMD format...")
+            MemSegWeightConverter.convert(rawWeights, ctx, quantArena)
+        } else {
+            rawWeights
+        }
+
+        if (cliArgs.contextLength != null) {
+            println("Context length capped to ${cliArgs.contextLength} (model default: ${runtimeWeights.metadata.contextLength})")
+        }
+
+        val backend = CpuAttentionBackend<FP32>(
+            ctx, runtimeWeights, FP32::class,
+            ropeFreqBase = runtimeWeights.metadata.ropeFreqBase,
+            maxContextLength = cliArgs.contextLength
+        )
+
+        @Suppress("DEPRECATION")
+        val runtime: InferenceRuntime<FP32> = LlamaRuntime<FP32>(
+            ctx, runtimeWeights, backend, FP32::class,
+            eps = runtimeWeights.metadata.rmsNormEps
+        )
+
+        // Load tokenizer from GGUF
+        println("Loading embedded GGUF tokenizer...")
+        val tokenizer: Tokenizer = JvmRandomAccessSource.open(modelPath.toString()).use { source ->
+            TokenizerFactory.fromGGUF(source)
+        }
+
+        // Build model metadata for chat template auto-detection
+        val metadata = ModelMetadata(
+            family = modelInfo.family.id,
+            architecture = modelInfo.architecture,
+            chatTemplate = modelInfo.fields["tokenizer.chat_template"] as? String
+        )
+
+        // Dispatch
+        if (cliArgs.chatMode || cliArgs.agentMode || cliArgs.demoMode) {
+            when {
+                cliArgs.demoMode -> {
+                    val demo = ToolCallingDemo(runtime, tokenizer, cliArgs.templateName, metadata)
+                    demo.run(maxTokens = cliArgs.steps, temperature = cliArgs.temperature)
+                }
+                cliArgs.agentMode -> {
+                    val agentCli = AgentCli(runtime, tokenizer, cliArgs.templateName, metadata)
+                    agentCli.runAgent(maxTokens = cliArgs.steps, temperature = cliArgs.temperature)
+                }
+                else -> {
+                    val agentCli = AgentCli(runtime, tokenizer, cliArgs.templateName, metadata)
+                    agentCli.runChat(maxTokens = cliArgs.steps, temperature = cliArgs.temperature)
+                }
+            }
+            return@runBlocking
+        }
+
+        // Standard generation mode
+        val promptText = cliArgs.prompt ?: error("Prompt is required for standard generation mode.")
+        val promptTokens = tokenizer.encode(promptText)
+
+        println("Generating ${cliArgs.steps} tokens with temperature=${cliArgs.temperature}...")
+        println("---")
+        print(promptText)
+
+        val elapsed = measureTime {
+            runtime.generate(prompt = promptTokens, steps = cliArgs.steps, temperature = cliArgs.temperature) { id ->
+                print(tokenizer.decode(id))
+            }
+        }.inWholeMilliseconds
+
+        val tokPerSec = cliArgs.steps / elapsed.toDouble() * 1000
+        println("\n---")
+        println("tok/s: $tokPerSec")
+    }
+}
diff --git a/settings.gradle.kts b/settings.gradle.kts
index 1901751..47902e2 100644
--- a/settings.gradle.kts
+++ b/settings.gradle.kts
@@ -28,6 +28,7 @@ include("llm-runtime:kgemma")
 include("llm-runtime:kqwen")
 include("llm-runtime:kapertus")
 include("llm-performance")
+include("llm-apps:skainet-cli")
 include("llm-apps:kllama-cli")
 include("llm-apps:kbert-cli")
 include("llm-apps:kapertus-cli")
diff --git a/tests/smoke/smoke-test.sh b/tests/smoke/smoke-test.sh
index 42a3b69..4d64f36 100755
--- a/tests/smoke/smoke-test.sh
+++ b/tests/smoke/smoke-test.sh
@@ -39,6 +39,7 @@ separator() {
 # Maps runner name → Gradle task
 runner_task() {
   case "$1" in
+    skainet)  echo ":llm-apps:skainet-cli:run" ;;
     kllama)   echo ":llm-apps:kllama-cli:run" ;;
     kgemma)   echo ":llm-runtime:kgemma:jvmRun" ;;
     kqwen)    echo ":llm-runtime:kqwen:jvmRun" ;;
@@ -52,6 +53,7 @@ runner_task() {
 # Maps runner name → compile task
 runner_compile_task() {
   case "$1" in
+    skainet)  echo ":llm-apps:skainet-cli:classes" ;;
     kllama)   echo ":llm-apps:kllama-cli:classes" ;;
     kgemma)   echo ":llm-runtime:kgemma:jvmMainClasses" ;;
     kqwen)    echo ":llm-runtime:kqwen:jvmMainClasses" ;;

From 34eda36bc8a0712716d003ca1dcbb1066bb7393d Mon Sep 17 00:00:00 2001
From: Michal Harakal <michal.harakal@googlemail.com>
Date: Sat, 11 Apr 2026 12:20:14 +0200
Subject: [PATCH 5/9] feat: add tool calling smoke tests and single-shot demo
 mode

Add tool calling test phase to smoke-test.sh that runs --demo with a
prompt for models with toolCalling config. Add Qwen3-8B-Q4 to smoke
test config.

- Add ToolCallingDemo.runSingleShot() for non-interactive tool calling
- Wire --demo with positional prompt to single-shot mode in kllama and skainet CLIs
- Add tool calling section to smoke-test.sh with [Tool Call] detection
- Add skainet runner to smoke-test.sh runner_args
- Increase kllama-cli memory to -Xmx42g -XX:MaxDirectMemorySize=64g
- Add Qwen3-8B-Q4 and toolCalling config to smoke-models.json

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 llm-apps/kllama-cli/build.gradle.kts          |   2 +-
 llm-apps/skainet-cli/build.gradle.kts         |   2 +-
 .../kotlin/sk/ainet/apps/skainet/cli/Main.kt  |   6 +-
 .../kotlin/sk/ainet/apps/kllama/cli/Main.kt   |   6 +-
 .../ainet/apps/kllama/cli/ToolCallingDemo.kt  |  64 ++++++++++
 tests/smoke/smoke-models.json                 |  35 +++---
 tests/smoke/smoke-test.sh                     | 110 +++++++++++++++++-
 7 files changed, 202 insertions(+), 23 deletions(-)

diff --git a/llm-apps/kllama-cli/build.gradle.kts b/llm-apps/kllama-cli/build.gradle.kts
index eca85d4..3203058 100644
--- a/llm-apps/kllama-cli/build.gradle.kts
+++ b/llm-apps/kllama-cli/build.gradle.kts
@@ -34,5 +34,5 @@ tasks.withType<Test>().configureEach {
 }
 
 tasks.withType<JavaExec>().configureEach {
-    jvmArgs("--enable-preview", "--add-modules", "jdk.incubator.vector", "-Xmx12g")
+    jvmArgs("--enable-preview", "--add-modules", "jdk.incubator.vector", "-Xmx42g", "-XX:MaxDirectMemorySize=64g")
 }
diff --git a/llm-apps/skainet-cli/build.gradle.kts b/llm-apps/skainet-cli/build.gradle.kts
index 82fc1d5..7a999be 100644
--- a/llm-apps/skainet-cli/build.gradle.kts
+++ b/llm-apps/skainet-cli/build.gradle.kts
@@ -50,5 +50,5 @@ tasks.withType<Test>().configureEach {
 }
 
 tasks.withType<JavaExec>().configureEach {
-    jvmArgs("--enable-preview", "--add-modules", "jdk.incubator.vector", "-Xmx12g", "-XX:MaxDirectMemorySize=64g")
+    jvmArgs("--enable-preview", "--add-modules", "jdk.incubator.vector", "-Xmx48g", "-XX:MaxDirectMemorySize=64g")
 }
diff --git a/llm-apps/skainet-cli/src/main/kotlin/sk/ainet/apps/skainet/cli/Main.kt b/llm-apps/skainet-cli/src/main/kotlin/sk/ainet/apps/skainet/cli/Main.kt
index 73f04b2..c7d62bb 100644
--- a/llm-apps/skainet-cli/src/main/kotlin/sk/ainet/apps/skainet/cli/Main.kt
+++ b/llm-apps/skainet-cli/src/main/kotlin/sk/ainet/apps/skainet/cli/Main.kt
@@ -209,7 +209,11 @@ fun main(args: Array<String>) {
             when {
                 cliArgs.demoMode -> {
                     val demo = ToolCallingDemo(runtime, tokenizer, cliArgs.templateName, metadata)
-                    demo.run(maxTokens = cliArgs.steps, temperature = cliArgs.temperature)
+                    if (cliArgs.prompt != null) {
+                        demo.runSingleShot(cliArgs.prompt, maxTokens = cliArgs.steps, temperature = cliArgs.temperature)
+                    } else {
+                        demo.run(maxTokens = cliArgs.steps, temperature = cliArgs.temperature)
+                    }
                 }
                 cliArgs.agentMode -> {
                     val agentCli = AgentCli(runtime, tokenizer, cliArgs.templateName, metadata)
diff --git a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt
index a5aa828..a89da0e 100644
--- a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt
+++ b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt
@@ -496,7 +496,11 @@ fun main(args: Array<String>) {
             when {
                 cliArgs.demoMode -> {
                     val demo = ToolCallingDemo(runtime, tokenizer, cliArgs.templateName, metadata)
-                    demo.run(maxTokens = cliArgs.steps, temperature = cliArgs.temperature)
+                    if (cliArgs.prompt != null) {
+                        demo.runSingleShot(cliArgs.prompt, maxTokens = cliArgs.steps, temperature = cliArgs.temperature)
+                    } else {
+                        demo.run(maxTokens = cliArgs.steps, temperature = cliArgs.temperature)
+                    }
                 }
                 cliArgs.agentMode -> {
                     val agentCli = AgentCli(runtime, tokenizer, cliArgs.templateName, metadata)
diff --git a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/ToolCallingDemo.kt b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/ToolCallingDemo.kt
index bd6d381..02c38ca 100644
--- a/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/ToolCallingDemo.kt
+++ b/llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/ToolCallingDemo.kt
@@ -46,6 +46,70 @@ public class ToolCallingDemo<T : DType>(
     /**
      * Run the tool-calling demo with `list_files` and `calculator` tools.
      */
+    /**
+     * Run a single non-interactive tool calling round. Used by smoke tests.
+     *
+     * @param prompt The user prompt.
+     * @param maxTokens Maximum tokens per round.
+     * @param temperature Sampling temperature.
+     */
+    public fun runSingleShot(
+        prompt: String,
+        maxTokens: Int = 256,
+        temperature: Float = 0.7f
+    ) {
+        val registry = ToolRegistry()
+        registry.register(ListFilesTool())
+        registry.register(CalculatorTool())
+
+        println("Tool Calling Smoke Test")
+        println("Available tools: ${registry.definitions().joinToString { it.name }}")
+        println("Prompt: \"$prompt\"")
+        println("---")
+
+        val agentLoop = session.createAgentLoop(registry, maxTokens, temperature)
+
+        val systemPrompt = """You are a helpful assistant with access to tools.
+When the user asks about files or directories, use the list_files tool to look up the actual contents.
+When the user asks to calculate something, use the calculator tool.
+Always use a tool when one is relevant — do not guess file listings."""
+
+        val messages = mutableListOf(
+            ChatMessage(role = ChatRole.SYSTEM, content = systemPrompt),
+            ChatMessage(role = ChatRole.USER, content = prompt)
+        )
+
+        val listener = object : AgentListener {
+            override fun onToken(token: String) {
+                print(token)
+                System.out.flush()
+            }
+            override fun onAssistantMessage(text: String) { println() }
+            override fun onToolCalls(calls: List<ToolCall>) {
+                for (call in calls) println("[Tool Call] ${call.name}(${call.arguments})")
+            }
+            override fun onToolResult(call: ToolCall, result: String) {
+                println("[Tool Result] ${call.name} -> $result")
+                print("Assistant: ")
+                System.out.flush()
+            }
+            override fun onComplete(finalResponse: String) {}
+        }
+
+        print("Assistant: ")
+        System.out.flush()
+
+        agentLoop.runWithEncoder(
+            messages = messages,
+            encode = { text -> tokenizer.encode(text) },
+            listener = listener
+        )
+        println()
+    }
+
+    /**
+     * Run the interactive tool-calling demo.
+     */
     public fun run(
         maxTokens: Int = 512,
         temperature: Float = 0.7f
diff --git a/tests/smoke/smoke-models.json b/tests/smoke/smoke-models.json
index fea03e0..2958517 100644
--- a/tests/smoke/smoke-models.json
+++ b/tests/smoke/smoke-models.json
@@ -6,31 +6,30 @@
   },
   "models": [
     {
-      "name": "Llama-3.2-1B-Q8",
+      "name": "TinyLlama-1.1B-Q8",
       "runner": "kllama",
-      "model": "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/tinyllama-1.1b-chat-v1.0.Q8_0.gguf",
+      "model": "tinyllama-1.1b-chat-v1.0.Q8_0.gguf",
       "format": "gguf"
     },
     {
-      "name": "Gemma-2B-SafeTensors",
-      "runner": "kgemma",
-      "model": "unsloth/gemma-3-270m-it-GGUF/gemma-3-270m-it-Q8_0.gguf",
+      "name": "Qwen3-1.7B-Q8",
+      "runner": "kqwen",
+      "model": "Qwen3-1.7B-Q8_0.gguf",
       "format": "gguf",
-      "steps": 16
+      "toolCalling": {
+        "prompt": "What is 2 + 2?",
+        "steps": 256
+      }
     },
     {
-      "name": "Qwen3-1.7B",
-      "runner": "qwen",
-      "model": "Qwen3-1.7B-GGUF/Qwen3-1.7B-Q8_0.gguf",
-      "format": "safetensors",
-      "prompt": "Hello world",
-    },
-    {
-      "name": "BERT-MiniLM",
-      "runner": "kbert",
-      "model": "~/.cache/huggingface/models/all-MiniLM-L6-v2",
-      "format": "safetensors",
-      "prompt": "Hello world"
+      "name": "Qwen3-8B-Q4",
+      "runner": "kllama",
+      "model": "Qwen3-8B-Q4_K_M.gguf",
+      "format": "gguf",
+      "toolCalling": {
+        "prompt": "What is 2 + 2?",
+        "steps": 256
+      }
     }
   ]
 }
diff --git a/tests/smoke/smoke-test.sh b/tests/smoke/smoke-test.sh
index 4d64f36..2756e78 100755
--- a/tests/smoke/smoke-test.sh
+++ b/tests/smoke/smoke-test.sh
@@ -69,6 +69,7 @@ runner_args() {
   local runner="$1" model="$2" prompt="$3" steps="$4" temp="$5" doc="${6:-}" output="${7:-}"
 
   case "$runner" in
+    skainet)  echo "-m ${model} -s ${steps} -k ${temp} \"${prompt}\"" ;;
     kllama)   echo "-m ${model} -s ${steps} -k ${temp} \"${prompt}\"" ;;
     kgemma)   echo "${model} \"${prompt}\" ${steps} ${temp}" ;;
     kqwen)    echo "${model} \"${prompt}\" ${steps} ${temp}" ;;
@@ -254,9 +255,93 @@ print(f'M_OUTPUT={repr(m.get(\"output\", \"\"))}')
     separator
   done
 
+  # ── Tool Calling Tests ───────────────────────────────────────────
+  TC_COUNT=$(python3 -c "
+import json
+cfg = json.load(open('${CONFIG_FILE}'))
+print(sum(1 for m in cfg['models'] if m.get('toolCalling')))
+")
+
+  declare -a tc_results=()
+  tc_pass=0
+  tc_fail=0
+
+  if [[ "$TC_COUNT" -gt 0 ]]; then
+    echo ""
+    echo -e "${BOLD}Tool Calling Tests${RESET} ($TC_COUNT models)"
+    separator
+
+    kllama_task=$(runner_task "kllama")
+
+    for i in $(seq 0 $((MODEL_COUNT - 1))); do
+      eval "$(python3 -c "
+import json
+cfg = json.load(open('${CONFIG_FILE}'))
+m = cfg['models'][$i]
+tc = m.get('toolCalling')
+if tc is None:
+    print('TC_ENABLED=false')
+else:
+    print('TC_ENABLED=true')
+    print(f'TC_PROMPT={repr(tc.get(\"prompt\", \"What is 2 + 2?\"))}')
+    print(f'TC_STEPS={tc.get(\"steps\", 256)}')
+    print(f'M_NAME={repr(m[\"name\"])}')
+    print(f'M_MODEL={repr(m[\"model\"])}')
+")"
+
+      [[ "$TC_ENABLED" != "true" ]] && continue
+
+      M_MODEL=$(expand_path "$M_MODEL")
+
+      echo -e "\n${BOLD}Model:${RESET}  $M_NAME (tool calling)"
+      echo -e "${BOLD}Prompt:${RESET} \"$TC_PROMPT\""
+
+      if [[ ! -e "$M_MODEL" ]]; then
+        echo -e "  ${RED}FAIL${RESET} (model path not found)"
+        tc_fail=$((tc_fail + 1))
+        tc_results+=("FAIL|$M_NAME|not found|-")
+        separator
+        continue
+      fi
+
+      start_ts=$(python3 -c 'import time; print(time.time())')
+      output_file=$(mktemp)
+      exit_code=0
+
+      $GRADLE "$kllama_task" --quiet \
+        --args="-m ${M_MODEL} --demo -s ${TC_STEPS} -k 0.7 \"${TC_PROMPT}\"" \
+        > "$output_file" 2>&1 || exit_code=$?
+
+      end_ts=$(python3 -c 'import time; print(time.time())')
+      wall_sec=$(python3 -c "print(f'{$end_ts - $start_ts:.1f}')")
+
+      if [[ $exit_code -ne 0 ]]; then
+        echo -e "  ${RED}FAIL${RESET} (exit $exit_code, wall ${wall_sec}s)"
+        tail -5 "$output_file" | sed 's/^/  │ /'
+        tc_fail=$((tc_fail + 1))
+        tc_results+=("FAIL|$M_NAME|exit $exit_code|${wall_sec}s")
+      elif grep -q '\[Tool Call\]' "$output_file"; then
+        tool_name=$(grep -oE '\[Tool Call\] [a-z_]+' "$output_file" | head -1 | sed 's/\[Tool Call\] //')
+        echo -e "  ${GREEN}OK${RESET}   tool called: ${CYAN}${tool_name}${RESET}  wall: ${wall_sec}s"
+        grep '\[Tool Call\]' "$output_file" | head -2 | sed 's/^/  │ /'
+        grep '\[Tool Result\]' "$output_file" | head -2 | sed 's/^/  │ /'
+        tc_pass=$((tc_pass + 1))
+        tc_results+=("OK|$M_NAME|$tool_name|${wall_sec}s")
+      else
+        echo -e "  ${YELLOW}WARN${RESET} (no tool call detected, wall ${wall_sec}s)"
+        tail -5 "$output_file" | sed 's/^/  │ /'
+        tc_fail=$((tc_fail + 1))
+        tc_results+=("WARN|$M_NAME|no tool call|${wall_sec}s")
+      fi
+
+      rm -f "$output_file"
+      separator
+    done
+  fi
+
   # ── Summary ──────────────────────────────────────────────────────
   echo ""
-  echo -e "${BOLD}Summary${RESET}"
+  echo -e "${BOLD}Summary — Generation${RESET}"
   separator
   printf "  %-6s %-30s %-8s %8s %10s %8s\n" "Status" "Model" "Runner" "Size" "tok/s" "Wall"
   separator
@@ -272,6 +357,29 @@ print(f'M_OUTPUT={repr(m.get(\"output\", \"\"))}')
   done
   separator
   echo -e "  ${GREEN}Pass: $pass${RESET}  ${RED}Fail: $fail${RESET}  Total: ${MODEL_COUNT}"
+
+  if [[ "$TC_COUNT" -gt 0 ]]; then
+    echo ""
+    echo -e "${BOLD}Summary — Tool Calling${RESET}"
+    separator
+    printf "  %-6s %-30s %-15s %8s\n" "Status" "Model" "Tool" "Wall"
+    separator
+    for r in "${tc_results[@]}"; do
+      IFS='|' read -r status name tool wall <<< "$r"
+      if [[ "$status" == "OK" ]]; then
+        color="$GREEN"
+      elif [[ "$status" == "WARN" ]]; then
+        color="$YELLOW"
+      else
+        color="$RED"
+      fi
+      printf "  ${color}%-6s${RESET} %-30s %-15s %8s\n" \
+        "$status" "${name:0:30}" "$tool" "$wall"
+    done
+    separator
+    echo -e "  ${GREEN}Pass: $tc_pass${RESET}  ${RED}Fail: $tc_fail${RESET}  Total: ${TC_COUNT}"
+  fi
+
   echo ""
   exit 0
 fi

From 628550b2166c2e5347bd6ccfa4135be9fb398a0c Mon Sep 17 00:00:00 2001
From: Michal Harakal <michal.harakal@googlemail.com>
Date: Sat, 11 Apr 2026 12:28:30 +0200
Subject: [PATCH 6/9] docs: add Antora documentation site following Divio
 standard

AsciiDoc documentation in Antora site format with four Divio categories:

Tutorials:
- Getting started with the skainet CLI
- Tool calling with any model via ChatSession
- Running smoke tests

How-to Guides:
- Add a new model architecture (DSL vs hand-coded)
- Add a compute backend
- Add a custom tool
- Use the unified CLI

Reference:
- Architecture overview and module structure
- Inference pipeline stages
- Tokenizer API and TokenizerFactory
- ChatSession API
- Model Registry and UnifiedModelLoader
- CLI reference (skainet + model-specific CLIs)

Explanation:
- Pipeline design decisions (why stages are separated)
- DSL networks vs hand-coded runtimes (trade-offs)
- Tokenizer internals (SentencePiece, BPE, WordPiece)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 docs/antora.yml                               |   5 +
 docs/modules/ROOT/nav.adoc                    |  25 ++++
 .../pages/explanation/dsl-vs-handcoded.adoc   | 124 +++++++++++++++++
 .../pages/explanation/pipeline-design.adoc    |  47 +++++++
 .../explanation/tokenizer-internals.adoc      |  68 +++++++++
 .../pages/how-to/add-compute-backend.adoc     |  59 ++++++++
 docs/modules/ROOT/pages/how-to/add-model.adoc |  86 ++++++++++++
 docs/modules/ROOT/pages/how-to/add-tool.adoc  |  79 +++++++++++
 .../ROOT/pages/how-to/run-unified-cli.adoc    |  91 ++++++++++++
 docs/modules/ROOT/pages/index.adoc            |  66 +++++++++
 .../ROOT/pages/reference/architecture.adoc    |  69 +++++++++
 .../pages/reference/chat-session-api.adoc     |  89 ++++++++++++
 .../ROOT/pages/reference/cli-reference.adoc   | 128 +++++++++++++++++
 .../ROOT/pages/reference/model-registry.adoc  | 106 ++++++++++++++
 .../ROOT/pages/reference/pipeline.adoc        |  90 ++++++++++++
 .../ROOT/pages/reference/tokenizer-api.adoc   |  71 ++++++++++
 .../ROOT/pages/tutorials/getting-started.adoc |  66 +++++++++
 .../ROOT/pages/tutorials/smoke-tests.adoc     | 117 ++++++++++++++++
 .../ROOT/pages/tutorials/tool-calling.adoc    | 131 ++++++++++++++++++
 19 files changed, 1517 insertions(+)
 create mode 100644 docs/antora.yml
 create mode 100644 docs/modules/ROOT/nav.adoc
 create mode 100644 docs/modules/ROOT/pages/explanation/dsl-vs-handcoded.adoc
 create mode 100644 docs/modules/ROOT/pages/explanation/pipeline-design.adoc
 create mode 100644 docs/modules/ROOT/pages/explanation/tokenizer-internals.adoc
 create mode 100644 docs/modules/ROOT/pages/how-to/add-compute-backend.adoc
 create mode 100644 docs/modules/ROOT/pages/how-to/add-model.adoc
 create mode 100644 docs/modules/ROOT/pages/how-to/add-tool.adoc
 create mode 100644 docs/modules/ROOT/pages/how-to/run-unified-cli.adoc
 create mode 100644 docs/modules/ROOT/pages/index.adoc
 create mode 100644 docs/modules/ROOT/pages/reference/architecture.adoc
 create mode 100644 docs/modules/ROOT/pages/reference/chat-session-api.adoc
 create mode 100644 docs/modules/ROOT/pages/reference/cli-reference.adoc
 create mode 100644 docs/modules/ROOT/pages/reference/model-registry.adoc
 create mode 100644 docs/modules/ROOT/pages/reference/pipeline.adoc
 create mode 100644 docs/modules/ROOT/pages/reference/tokenizer-api.adoc
 create mode 100644 docs/modules/ROOT/pages/tutorials/getting-started.adoc
 create mode 100644 docs/modules/ROOT/pages/tutorials/smoke-tests.adoc
 create mode 100644 docs/modules/ROOT/pages/tutorials/tool-calling.adoc

diff --git a/docs/antora.yml b/docs/antora.yml
new file mode 100644
index 0000000..25fc5ef
--- /dev/null
+++ b/docs/antora.yml
@@ -0,0 +1,5 @@
+name: skainet-transformers
+title: SKaiNET Transformers
+version: ~
+nav:
+  - modules/ROOT/nav.adoc
diff --git a/docs/modules/ROOT/nav.adoc b/docs/modules/ROOT/nav.adoc
new file mode 100644
index 0000000..5bc1fc9
--- /dev/null
+++ b/docs/modules/ROOT/nav.adoc
@@ -0,0 +1,25 @@
+* xref:index.adoc[Overview]
+
+.Tutorials
+* xref:tutorials/getting-started.adoc[Getting Started]
+* xref:tutorials/tool-calling.adoc[Tool Calling with Any Model]
+* xref:tutorials/smoke-tests.adoc[Running Smoke Tests]
+
+.How-to Guides
+* xref:how-to/add-model.adoc[Add a New Model Architecture]
+* xref:how-to/add-compute-backend.adoc[Add a Compute Backend]
+* xref:how-to/add-tool.adoc[Add a Custom Tool]
+* xref:how-to/run-unified-cli.adoc[Use the Unified CLI]
+
+.Reference
+* xref:reference/architecture.adoc[Architecture Overview]
+* xref:reference/pipeline.adoc[Inference Pipeline]
+* xref:reference/tokenizer-api.adoc[Tokenizer API]
+* xref:reference/chat-session-api.adoc[ChatSession API]
+* xref:reference/model-registry.adoc[Model Registry]
+* xref:reference/cli-reference.adoc[CLI Reference]
+
+.Explanation
+* xref:explanation/pipeline-design.adoc[Pipeline Design Decisions]
+* xref:explanation/dsl-vs-handcoded.adoc[DSL Networks vs Hand-Coded Runtimes]
+* xref:explanation/tokenizer-internals.adoc[Tokenizer Internals]
diff --git a/docs/modules/ROOT/pages/explanation/dsl-vs-handcoded.adoc b/docs/modules/ROOT/pages/explanation/dsl-vs-handcoded.adoc
new file mode 100644
index 0000000..64f39fb
--- /dev/null
+++ b/docs/modules/ROOT/pages/explanation/dsl-vs-handcoded.adoc
@@ -0,0 +1,124 @@
+= DSL Networks vs Hand-Coded Runtimes
+:description: Why DSL network definitions replace hand-coded runtimes.
+
+== Two Approaches to Model Definition
+
+SKaiNET Transformers supports two ways to define a model's forward pass:
+
+=== Hand-Coded Runtime (Legacy)
+
+A class that extends `DecoderRuntime` and implements each layer explicitly:
+
+[source,kotlin]
+----
+class LlamaRuntime<T>(/* ... */) : DecoderRuntime<T>(ctx, dtype) {
+    override fun runLayer(layerIdx: Int, x: Tensor<T, Float>): Tensor<T, Float> {
+        val normed = rmsNorm(x, weights.attnNorm[layerIdx])
+        val q = matmul(normed, weights.wq[layerIdx])
+        val k = matmul(normed, weights.wk[layerIdx])
+        // ... 50+ lines of attention + FFN
+    }
+}
+----
+
+=== DSL Network Definition (Current)
+
+A pure function that declares the architecture using the network DSL:
+
+[source,kotlin]
+----
+fun <T : DType, V> llamaNetwork(metadata: LlamaModelMetadata): Module<T, V> {
+    return sequential<T, V> {
+        embedding(vocabSize, dim, id = "token_embd")
+        for (layer in 0 until nLayers) {
+            rmsNorm(dim, eps, id = "attn_norm")
+            multiHeadAttention(dim, nHeads, nKVHeads, causal = true) {
+                rope(headDim, seqLen)
+                kvCache(seqLen, nKVHeads, headDim)
+            }
+            residual()
+            rmsNorm(dim, eps, id = "ffn_norm")
+            swiGluFFN(dim, ffnDim)
+            residual()
+        }
+        rmsNorm(dim, eps, id = "output_norm")
+    }
+}
+----
+
+== Why DSL is Preferred
+
+=== Compute Graph Optimization
+
+DSL networks can be traced into a ComputeGraph (DAG) and optimized:
+
+* *TransposeEliminationPass* -- folds weight transposes into matmul, eliminating O(n^2) copies
+* *LLMFusionPass* -- fuses RMSNorm (7 ops -> 1), SwiGLU FFN (5 ops -> 1), QKV projections (3 -> 1)
+* *DeadCodeEliminationPass* -- removes unused intermediate tensors
+
+Hand-coded runtimes cannot benefit from these optimizations because operations are imperative, not declarative.
+
+=== Weight Loading is Automatic
+
+DSL modules have named parameters (e.g., `"blk.0/attn/q_proj"`).
+`WeightMapper` matches these to GGUF tensor names via `WeightNameResolver`.
+No manual weight loading code needed.
+
+=== Multiple Execution Modes
+
+The same DSL definition supports:
+
+`DIRECT`:: Execute the module tree directly (debugging, correctness testing)
+`HYBRID`:: Compile compute-heavy subgraphs, run attention imperatively (best balance)
+`OPTIMIZED`:: Full DAG compilation and execution (maximum performance)
+
+=== Adding New Architectures is Simpler
+
+A new architecture is a single function, not a 500-line class.
+If the architecture uses standard building blocks (MHA, RMSNorm, FFN), the DSL already has them.
+
+== When Hand-Coded Runtimes Are Needed
+
+Some architectures have components the DSL cannot express:
+
+* *Qwen3.5 DeltaNet* -- hybrid DeltaNet (linear attention + SSM) layers with causal 1D convolution
+* *Gemma3n* -- variable FFN dimensions per layer (MatFormer), per-layer embeddings
+* *Voxtral* -- ODE flow matching for audio codec
+
+These use `DecoderRuntime` directly.
+The goal is to extend the DSL to support these patterns over time.
+
+== Current Status
+
+[cols="1,1,1"]
+|===
+|Model |DSL |Status
+
+|LLaMA/Mistral
+|`llamaNetwork()`
+|`LlamaRuntime` deprecated
+
+|Qwen2/3
+|`qwenNetwork()`
+|Delegates to `llamaNetwork()`
+
+|Apertus
+|`apertusNetwork()`
+|`ApertusRuntime` deprecated
+
+|BERT
+|`bertNetwork()`
+|`BertRuntime` deprecated
+
+|Voxtral
+|`voxtralBackboneNetwork()`
+|Partial DSL
+
+|Gemma3n
+|_none_
+|Hand-coded only
+
+|Qwen3.5
+|_none_
+|Hand-coded (DeltaNet)
+|===
diff --git a/docs/modules/ROOT/pages/explanation/pipeline-design.adoc b/docs/modules/ROOT/pages/explanation/pipeline-design.adoc
new file mode 100644
index 0000000..4f3202c
--- /dev/null
+++ b/docs/modules/ROOT/pages/explanation/pipeline-design.adoc
@@ -0,0 +1,47 @@
+= Pipeline Design Decisions
+:description: Why the inference pipeline is structured as separate stages.
+
+== The Problem
+
+Early SKaiNET had a monolithic approach: each model family (LLaMA, Gemma, Apertus) had its own hand-coded runtime that handled everything -- weight loading, forward pass, KV cache, tokenization, and generation.
+This led to:
+
+* *Duplicated logic* -- each runtime reimplemented `generate()`, `sample()`, `forward()`.
+* *Tight coupling* -- tool calling only worked with kllama because `ToolCallingDemo` depended on `GGUFTokenizer`, a kllama-specific class.
+* *No optimization* -- hand-coded runtimes couldn't benefit from compute graph optimization passes.
+
+== The Solution: Separated Pipeline Stages
+
+The pipeline is split into stages that are independently replaceable:
+
+[horizontal]
+Weight Loading:: Parse GGUF/SafeTensors into typed tensor maps. Model-format concern, not architecture concern.
+Network Definition:: Pure functions (`llamaNetwork()`, `apertusNetwork()`) that return a `Module` tree. Architecture concern only.
+Graph Compilation:: Trace the module tree into a DAG, apply optimization passes. Framework concern.
+Inference Runtime:: `forward(tokenId)` and `generate()`. Pure inference, no I/O.
+Tokenization:: `encode()`/`decode()`. Completely independent of model architecture.
+Chat Pipeline:: `ChatSession`, `AgentLoop`, `ChatTemplate`. Independent of both model and tokenizer implementation.
+
+== Key Design Decisions
+
+=== Tokenizer Interface with Metadata
+
+The `Tokenizer` interface includes `eosTokenId`, `bosTokenId`, and `vocabSize`.
+This eliminated the need for the `GGUFTokenizer` downcast that previously coupled tool calling to kllama.
+Any tokenizer implementation works with `ChatSession` and `AgentLoop`.
+
+=== ChatSession as the Composition Root
+
+Rather than having each CLI wire up `InferenceRuntime` + `Tokenizer` + `ChatTemplate` + `ToolRegistry` individually, `ChatSession` bundles them.
+A runner creates one `ChatSession` and gets chat, agent, and demo modes for free.
+
+=== ModelRegistry for Auto-Detection
+
+Instead of if/else chains in each CLI to determine which loader to use, `ModelRegistry.detect(architecture)` returns a `ModelFamily` enum with capabilities (tool calling support, chat template family).
+The unified `skainet` CLI uses this to load any GGUF model without architecture-specific flags.
+
+=== GGUFTokenizer in llm-core
+
+Moving `GGUFTokenizer` from `kllama` to `llm-core` was essential.
+Every runner needed it, but depending on `kllama` just for the tokenizer created circular dependency pressure.
+The `TokenizerFactory` in `llm-core` provides a clean entry point.
diff --git a/docs/modules/ROOT/pages/explanation/tokenizer-internals.adoc b/docs/modules/ROOT/pages/explanation/tokenizer-internals.adoc
new file mode 100644
index 0000000..11ed0a8
--- /dev/null
+++ b/docs/modules/ROOT/pages/explanation/tokenizer-internals.adoc
@@ -0,0 +1,68 @@
+= Tokenizer Internals
+:description: How tokenization works across different model families.
+
+== Tokenizer Types
+
+Different model families use different tokenization strategies.
+The `GGUFTokenizer` auto-detects the type from GGUF metadata.
+
+=== SentencePiece (LLaMA, Gemma)
+
+* Space is encoded as `\u2581` (lower one eighth block)
+* Subword units are learned from training data
+* GGUF metadata field: `tokenizer.ggml.model = "llama"` or `"sentencepiece"`
+
+=== BPE (Qwen, Mistral, GPT-2)
+
+* Byte-level BPE: text is converted to UTF-8 bytes, each byte mapped to a Unicode character
+* The byte-to-Unicode mapping avoids control characters (bytes 0-32 map to U+0100+)
+* Space is represented as `\u0120` (Latin capital G with dot above)
+* GGUF metadata field: `tokenizer.ggml.model = "gpt2"` or `"bpe"`
+
+=== WordPiece (BERT)
+
+* Subwords prefixed with `##` (e.g., "playing" -> `["play", "##ing"]`)
+* Uses `[CLS]`, `[SEP]`, `[UNK]`, `[PAD]` special tokens
+
+== Special Token Handling
+
+Chat templates use special tokens like `<|im_start|>` and `<|im_end|>` to delimit messages.
+These must be encoded as single tokens, not character-split.
+
+The `GGUFTokenizer` collects special tokens from the vocabulary (tokens matching `<|...|>`) and splits text around them before applying BPE.
+This ensures `<|im_start|>system` encodes as `[151644, 8948]` (two tokens), not as individual characters.
+
+== GGUF Tokenizer Fields
+
+[cols="2,3"]
+|===
+|Field |Description
+
+|`tokenizer.ggml.model`
+|Tokenizer type: `"llama"`, `"gpt2"`, `"bert"`
+
+|`tokenizer.ggml.tokens`
+|Vocabulary as string array
+
+|`tokenizer.ggml.scores`
+|BPE merge scores (SentencePiece)
+
+|`tokenizer.ggml.merges`
+|BPE merge pairs (GPT-2 style)
+
+|`tokenizer.ggml.bos_token_id`
+|Beginning-of-sequence token ID
+
+|`tokenizer.ggml.eos_token_id`
+|End-of-sequence token ID
+
+|`tokenizer.ggml.token_type`
+|Per-token type (normal, control, unknown, byte)
+|===
+
+== TokenizerFactory
+
+`TokenizerFactory` in `llm-core` provides a unified entry point.
+It delegates to `GGUFTokenizer` or `HuggingFaceBPETokenizer` based on the source format.
+
+The factory is the recommended way to create tokenizers -- callers don't need to know which implementation is used.
diff --git a/docs/modules/ROOT/pages/how-to/add-compute-backend.adoc b/docs/modules/ROOT/pages/how-to/add-compute-backend.adoc
new file mode 100644
index 0000000..3271beb
--- /dev/null
+++ b/docs/modules/ROOT/pages/how-to/add-compute-backend.adoc
@@ -0,0 +1,59 @@
+= Add a Compute Backend
+:description: Implement and register a new compute backend (Metal, CUDA, etc.).
+
+Compute backends provide tensor operations for a specific hardware target.
+The system auto-selects the best available backend at startup.
+
+== 1. Implement BackendProvider
+
+[source,kotlin]
+----
+class MetalBackendProvider : BackendProvider {
+    override val name: String = "metal"
+    override val displayName: String = "Apple Metal GPU"
+    override val priority: Int = 100  // higher = preferred
+
+    override fun isAvailable(): Boolean {
+        // Check if Metal is available on this platform
+    }
+
+    override fun createOps(): CpuOps {
+        // Return Metal-accelerated tensor operations
+    }
+}
+----
+
+== 2. Register via ServiceLoader (JVM)
+
+Create `META-INF/services/sk.ainet.lang.backend.BackendProvider`:
+
+[source]
+----
+com.example.MetalBackendProvider
+----
+
+== 3. Register Manually (Native/JS/WASM)
+
+[source,kotlin]
+----
+BackendRegistry.register(MetalBackendProvider())
+----
+
+== 4. Verify
+
+[source,bash]
+----
+./gradlew :llm-apps:skainet-cli:run \
+  --args="-m model.gguf --list-backends"
+----
+
+Output:
+
+----
+Available backends:
+  metal        Apple Metal GPU (priority=100, available)
+  cpu          CPU (SIMD) (priority=0, available)
+----
+
+The backend with the highest priority that is available is auto-selected.
+Override with `--backend=cpu` to force a specific backend.
diff --git a/docs/modules/ROOT/pages/how-to/add-model.adoc b/docs/modules/ROOT/pages/how-to/add-model.adoc
new file mode 100644
index 0000000..0acd613
--- /dev/null
+++ b/docs/modules/ROOT/pages/how-to/add-model.adoc
@@ -0,0 +1,86 @@
+= Add a New Model Architecture
+:description: How to add support for a new transformer architecture.
+
+== Option A: DSL Network Definition (Recommended)
+
+If the architecture is a standard transformer variant, define it using the network DSL.
+
+=== 1. Create the Network Definition
+
+Create a new file in `llm-inference/<family>/src/commonMain/kotlin/`:
+
+[source,kotlin]
+----
+public inline fun <reified T : DType, V> myModelNetwork(
+    metadata: LlamaModelMetadata
+): Module<T, V> {
+    return sequential<T, V> {
+        val dslImpl = this as NeuralNetworkDslImpl<T, V>
+        dslImpl.embedding(metadata.vocabSize, metadata.embeddingLength, id = "token_embd")
+
+        val nnCtx = DefaultNeuralNetworkExecutionContext()
+        for (layer in 0 until metadata.blockCount) {
+            val stage = StageImpl<T, V>(nnCtx, "blk.$layer", T::class)
+            // Define your layer architecture here
+            stage.rmsNorm(dim, eps, id = "attn_norm")
+            stage.multiHeadAttention(dim, nHeads, nKVHeads, causal = true, id = "attn") {
+                rope(headDim, seqLen)
+                kvCache(seqLen, nKVHeads, headDim)
+            }
+            stage.residual()
+            stage.rmsNorm(dim, eps, id = "ffn_norm")
+            stage.swiGluFFN(dim, ffnDim, id = "ffn")
+            stage.residual()
+
+            dslImpl.modules += HybridTransformerBlock(stage.modules.toList(), name = "blk.$layer")
+        }
+
+        dslImpl.rmsNorm(dim, eps, id = "output_norm")
+        dslImpl.modules += VoidDenseModule<T, V>("output", vocabSize, dim)
+    }
+}
+----
+
+=== 2. Create a Weight Name Resolver
+
+Map DSL module paths to GGUF tensor names:
+
+[source,kotlin]
+----
+object MyModelGGUFNameResolver : WeightNameResolver {
+    override fun resolve(modulePath: String, paramName: String): String? {
+        // Map "blk.0/attn/q_proj" -> "blk.0.attn_q.weight"
+    }
+}
+----
+
+=== 3. Register in ModelRegistry
+
+Add the architecture to `ModelFamily` enum in `llm-core/.../ModelRegistry.kt`:
+
+[source,kotlin]
+----
+MY_MODEL("mymodel", "My Model", true, "chatml");
+----
+
+And update `ModelRegistry.detect()`.
+
+== Option B: Hand-Coded Runtime
+
+For architectures with non-standard components (e.g., DeltaNet, sliding window), extend `DecoderRuntime`:
+
+[source,kotlin]
+----
+class MyModelRuntime<T : DType>(
+    // ...
+) : DecoderRuntime<T>(ctx, dtype) {
+    override fun embedToken(tokenId: Int): Tensor<T, Float> { ... }
+    override fun runLayer(layerIdx: Int, x: Tensor<T, Float>): Tensor<T, Float> { ... }
+    override fun outputNorm(x: Tensor<T, Float>): Tensor<T, Float> { ... }
+    override fun outputProject(x: Tensor<T, Float>): Tensor<T, Float> { ... }
+    override fun resetState() { ... }
+}
+----
+
+NOTE: DSL definitions are preferred because they enable compute graph optimization.
+Hand-coded runtimes should only be used for architectures the DSL cannot express.
diff --git a/docs/modules/ROOT/pages/how-to/add-tool.adoc b/docs/modules/ROOT/pages/how-to/add-tool.adoc
new file mode 100644
index 0000000..9a68180
--- /dev/null
+++ b/docs/modules/ROOT/pages/how-to/add-tool.adoc
@@ -0,0 +1,79 @@
+= Add a Custom Tool
+:description: Implement a tool that models can call during agent mode.
+
+== Implement the Tool Interface
+
+[source,kotlin]
+----
+import sk.ainet.apps.kllama.chat.*
+import kotlinx.serialization.json.*
+
+class DatabaseQueryTool(private val db: Database) : Tool {
+
+    override val definition = ToolDefinition(
+        name = "query_db",
+        description = "Execute a read-only SQL query and return results.",
+        parameters = buildJsonObject {
+            put("type", "object")
+            putJsonObject("properties") {
+                putJsonObject("sql") {
+                    put("type", "string")
+                    put("description", "The SQL SELECT query to execute")
+                }
+            }
+            putJsonArray("required") { add(JsonPrimitive("sql")) }
+        }
+    )
+
+    override fun execute(arguments: JsonObject): String {
+        val sql = arguments["sql"]?.jsonPrimitive?.content
+            ?: return "Error: missing 'sql' argument"
+        if (!sql.trimStart().startsWith("SELECT", ignoreCase = true)) {
+            return "Error: only SELECT queries are allowed"
+        }
+        return db.query(sql).toString()
+    }
+}
+----
+
+== Register and Use
+
+[source,kotlin]
+----
+val session = ChatSession(runtime, tokenizer, metadata)
+val response = session.runSingleTurn(
+    prompt = "How many users signed up last week?",
+    tools = listOf(DatabaseQueryTool(db))
+)
+----
+
+Or register in a `ToolRegistry` for multi-turn conversations:
+
+[source,kotlin]
+----
+val registry = ToolRegistry()
+registry.register(DatabaseQueryTool(db))
+registry.register(CalculatorTool())
+
+val agentLoop = session.createAgentLoop(registry)
+----
+
+== Tool Definition Format
+
+The `parameters` field follows https://json-schema.org/[JSON Schema]:
+
+[source,json]
+----
+{
+  "type": "object",
+  "properties": {
+    "param_name": {
+      "type": "string",
+      "description": "What this parameter does"
+    }
+  },
+  "required": ["param_name"]
+}
+----
+
+The model receives this schema in the system prompt and generates tool calls matching it.
diff --git a/docs/modules/ROOT/pages/how-to/run-unified-cli.adoc b/docs/modules/ROOT/pages/how-to/run-unified-cli.adoc
new file mode 100644
index 0000000..150af64
--- /dev/null
+++ b/docs/modules/ROOT/pages/how-to/run-unified-cli.adoc
@@ -0,0 +1,91 @@
+= Use the Unified CLI
+:description: Run any GGUF model with the skainet CLI.
+
+The `skainet` CLI auto-detects model architecture from GGUF metadata, so you don't need to pick the right runner.
+
+== Text Generation
+
+[source,bash]
+----
+./gradlew :llm-apps:skainet-cli:run \
+  --args="-m model.gguf 'Your prompt here'"
+----
+
+== Chat Mode
+
+[source,bash]
+----
+./gradlew :llm-apps:skainet-cli:run \
+  --args="-m model.gguf --chat"
+----
+
+== Agent Mode (with Tool Calling)
+
+[source,bash]
+----
+./gradlew :llm-apps:skainet-cli:run \
+  --args="-m model.gguf --agent"
+----
+
+== Tool Calling Demo
+
+Interactive:
+
+[source,bash]
+----
+./gradlew :llm-apps:skainet-cli:run \
+  --args="-m model.gguf --demo"
+----
+
+Single-shot (for scripts/testing):
+
+[source,bash]
+----
+./gradlew :llm-apps:skainet-cli:run \
+  --args="-m model.gguf --demo 'What is 2 + 2?'"
+----
+
+== All Options
+
+[source]
+----
+skainet -m <model.gguf> [options] [prompt]
+
+Options:
+  -m, --model       Path to .gguf model (required)
+  -s, --steps       Generation steps (default: 64)
+  -k, --temperature Sampling temperature (default: 0.8)
+  --chat            Interactive chat mode
+  --agent           Interactive agent with tool calling
+  --demo            Tool calling demo (add prompt for single-shot)
+  --template=NAME   Chat template override: llama3, chatml, qwen, gemma
+  --context=N       Cap context length to N tokens
+  -h, --help        Show help
+----
+
+== Model-Specific CLIs
+
+The per-model CLIs are still available for advanced use cases:
+
+[cols="1,2"]
+|===
+|CLI |Gradle Task
+
+|kllama
+|`:llm-apps:kllama-cli:run`
+
+|kgemma
+|`:llm-runtime:kgemma:jvmRun`
+
+|kqwen
+|`:llm-runtime:kqwen:jvmRun`
+
+|kapertus
+|`:llm-apps:kapertus-cli:run`
+
+|kvoxtral
+|`:llm-apps:kvoxtral-cli:run`
+
+|kbert
+|`:llm-apps:kbert-cli:run`
+|===
diff --git a/docs/modules/ROOT/pages/index.adoc b/docs/modules/ROOT/pages/index.adoc
new file mode 100644
index 0000000..b8a63d0
--- /dev/null
+++ b/docs/modules/ROOT/pages/index.adoc
@@ -0,0 +1,66 @@
+= SKaiNET Transformers
+:description: Kotlin Multiplatform LLM inference engine with tool calling, compute graph optimization, and unified model pipeline.
+
+SKaiNET Transformers is a Kotlin Multiplatform inference engine for large language models.
+It loads GGUF and SafeTensors models, builds compute graphs from DSL network definitions, applies optimization passes, and executes inference on CPU (with SIMD acceleration) or GPU backends.
+
+== Key Features
+
+* *Unified pipeline* -- load any supported model with a single CLI, auto-detected from GGUF metadata
+* *Tool calling* -- agent loop with tool execution for any model that supports chat templates
+* *Compute graph optimization* -- transpose elimination, weight deduplication, RMSNorm/SwiGLU/QKV fusion
+* *Kotlin Multiplatform* -- runs on JVM, macOS Native, Linux Native, JS, and WASM
+* *Quantization support* -- Q4_K_M, Q8_0, and other GGUF quantization formats with SIMD dequantization
+
+== Supported Model Families
+
+[cols="1,2,1,1"]
+|===
+|Family |Models |Tool Calling |DSL Network
+
+|LLaMA
+|LLaMA 2/3, Mistral
+|Yes
+|`llamaNetwork()`
+
+|Qwen
+|Qwen2, Qwen3, Qwen3.5
+|Yes
+|`qwenNetwork()`
+
+|Gemma
+|Gemma 2, Gemma 3n, Gemma 4
+|Yes
+|Hand-coded
+
+|Apertus
+|Apertus 8B
+|No
+|`apertusNetwork()`
+
+|BERT
+|MiniLM, BERT variants
+|No
+|`bertNetwork()`
+
+|Voxtral
+|Voxtral TTS
+|No
+|`voxtralBackboneNetwork()`
+|===
+
+== Documentation Structure
+
+This documentation follows the https://documentation.divio.com/[Divio documentation system]:
+
+xref:tutorials/getting-started.adoc[Tutorials]::
+Step-by-step lessons to get you started.
+
+xref:how-to/run-unified-cli.adoc[How-to Guides]::
+Practical recipes for specific tasks.
+
+xref:reference/architecture.adoc[Reference]::
+Technical descriptions of APIs and components.
+
+xref:explanation/pipeline-design.adoc[Explanation]::
+Background and design decisions.
diff --git a/docs/modules/ROOT/pages/reference/architecture.adoc b/docs/modules/ROOT/pages/reference/architecture.adoc
new file mode 100644
index 0000000..262bbea
--- /dev/null
+++ b/docs/modules/ROOT/pages/reference/architecture.adoc
@@ -0,0 +1,69 @@
+= Architecture Overview
+:description: Module structure and dependency graph of SKaiNET Transformers.
+
+== Module Structure
+
+----
+llm-core                    Core abstractions (Tokenizer, InferenceRuntime, ModelRegistry)
+llm-agent                   Chat templates, tool calling, AgentLoop, ChatSession
+llm-inference/
+  llama/                    LLaMA/Qwen network definition and weight loading
+  apertus/                  Apertus network definition and weight loading
+  gemma/                    Gemma runtime and weight loading
+  bert/                     BERT network definition
+  voxtral/                  Voxtral TTS runtimes
+llm-runtime/
+  kllama/                   LLaMA/Qwen CPU runtime, attention backend, tokenizer
+  kqwen/                    Qwen-specific runner CLI
+  kgemma/                   Gemma runner CLI
+  kapertus/                 Apertus runner CLI
+llm-apps/
+  skainet-cli/              Unified CLI (auto-detects architecture)
+  kllama-cli/               LLaMA-specific CLI
+  kapertus-cli/             Apertus-specific CLI
+  kbert-cli/                BERT CLI
+  kvoxtral-cli/             Voxtral TTS CLI
+llm-performance/            Benchmarking module
+----
+
+== Dependency Graph
+
+----
+llm-apps/skainet-cli
+  -> llm-runtime/kllama -> llm-inference/llama -> llm-core
+  -> llm-agent -> llm-core
+  -> skainet-backend-cpu (SIMD tensor ops)
+  -> skainet-io-gguf (GGUF parsing)
+
+llm-agent
+  -> llm-core (InferenceRuntime, Tokenizer)
+
+llm-core
+  -> skainet-lang-core (tensor types, DSL)
+  -> skainet-compile-dag (compute graph)
+  -> skainet-compile-opt (optimization passes)
+  -> skainet-io-core (I/O abstractions)
+  -> skainet-io-gguf (GGUF reader)
+----
+
+== Key Interfaces
+
+`InferenceRuntime<T>`::
+Minimal inference contract: `forward(tokenId): Tensor` and `reset()`.
+All model runtimes implement this.
+
+`Tokenizer`::
+Encode/decode text with `eosTokenId`, `bosTokenId`, `vocabSize`.
+Implementations: `GGUFTokenizer`, `HuggingFaceBPETokenizer`, `TekkenTokenizerAdapter`.
+
+`ChatTemplate`::
+Format conversation messages into prompt strings and parse tool calls from output.
+Implementations: `QwenChatTemplate`, `Llama3ChatTemplate`, `GemmaChatTemplate`, `ChatMLTemplate`.
+
+`DecoderRuntime<T>`::
+Template method base class for decoder-only transformers.
+Provides shared `forward()`, `generate()`, `sample()` logic.
+
+`AttentionBackend<T>`::
+Pluggable attention computation with KV cache.
+Implementations: `CpuAttentionBackend`, `GpuAttentionBackend`.
diff --git a/docs/modules/ROOT/pages/reference/chat-session-api.adoc b/docs/modules/ROOT/pages/reference/chat-session-api.adoc
new file mode 100644
index 0000000..819f0ec
--- /dev/null
+++ b/docs/modules/ROOT/pages/reference/chat-session-api.adoc
@@ -0,0 +1,89 @@
+= ChatSession API
+:description: Unified abstraction for chat and tool calling with any model.
+
+== Overview
+
+`ChatSession` bundles an `InferenceRuntime`, `Tokenizer`, and `ModelMetadata` to provide chat and tool calling capabilities for any model.
+It lives in the `llm-agent` module and has no dependencies on specific runners.
+
+.`llm-agent/src/commonMain/kotlin/sk/ainet/apps/kllama/chat/ChatSession.kt`
+
+== Constructor
+
+[source,kotlin]
+----
+class ChatSession<T : DType>(
+    val runtime: InferenceRuntime<T>,
+    val tokenizer: Tokenizer,
+    val metadata: ModelMetadata = ModelMetadata(),
+    templateName: String? = null       // override auto-detection
+)
+----
+
+== Methods
+
+=== runSingleTurn
+
+Run a single agent round with the given prompt and tools.
+
+[source,kotlin]
+----
+fun runSingleTurn(
+    prompt: String,
+    tools: List<Tool> = emptyList(),
+    maxTokens: Int = 256,
+    temperature: Float = 0.7f,
+    listener: AgentListener? = null
+): String
+----
+
+=== createAgentLoop
+
+Create an `AgentLoop` configured for this session.
+
+[source,kotlin]
+----
+fun createAgentLoop(
+    toolRegistry: ToolRegistry,
+    maxTokens: Int = 512,
+    temperature: Float = 0.7f
+): AgentLoop<T>
+----
+
+=== Properties
+
+`chatTemplate: ChatTemplate`:: The resolved chat template for this session.
+`providerFamily: String`:: The resolved tool calling provider family name (e.g., "qwen", "llama3").
+
+=== Convenience Methods
+
+`encode(text: String): IntArray`:: Encode text using the session's tokenizer.
+`decode(tokenId: Int): String`:: Decode a token ID using the session's tokenizer.
+
+== Usage Example
+
+[source,kotlin]
+----
+// Any runner can create a ChatSession
+val session = ChatSession(
+    runtime = myRuntime,
+    tokenizer = myTokenizer,
+    metadata = ModelMetadata(family = "qwen", architecture = "qwen3")
+)
+
+// Single-shot tool calling
+val answer = session.runSingleTurn(
+    prompt = "What is 42 * 7?",
+    tools = listOf(CalculatorTool())
+)
+
+// Multi-turn agent
+val registry = ToolRegistry()
+registry.register(CalculatorTool())
+val loop = session.createAgentLoop(registry)
+val messages = mutableListOf(
+    ChatMessage(role = ChatRole.SYSTEM, content = "You are helpful."),
+    ChatMessage(role = ChatRole.USER, content = "Calculate 42 * 7")
+)
+loop.runWithEncoder(messages, encode = { session.encode(it) })
+----
diff --git a/docs/modules/ROOT/pages/reference/cli-reference.adoc b/docs/modules/ROOT/pages/reference/cli-reference.adoc
new file mode 100644
index 0000000..aa9f0bd
--- /dev/null
+++ b/docs/modules/ROOT/pages/reference/cli-reference.adoc
@@ -0,0 +1,128 @@
+= CLI Reference
+:description: Complete reference for skainet and model-specific CLI tools.
+
+== skainet (Unified CLI)
+
+[source]
+----
+skainet -m <model.gguf> [options] [prompt]
+----
+
+=== Options
+
+[cols="2,1,3"]
+|===
+|Flag |Default |Description
+
+|`-m, --model`
+|_(required)_
+|Path to `.gguf` model file
+
+|`-s, --steps`
+|`64`
+|Number of tokens to generate
+
+|`-k, --temperature`
+|`0.8`
+|Sampling temperature (0 = greedy)
+
+|`--chat`
+|
+|Interactive multi-turn chat mode
+
+|`--agent`
+|
+|Interactive agent mode with tool calling
+
+|`--demo`
+|
+|Tool calling demo (interactive or single-shot with prompt)
+
+|`--template=NAME`
+|_(auto)_
+|Force chat template: `llama3`, `chatml`, `qwen`, `gemma`
+
+|`--context=N`
+|`4096`
+|Cap context length (reduces memory)
+
+|`-h, --help`
+|
+|Show help text
+|===
+
+=== Examples
+
+[source,bash]
+----
+# Text generation
+skainet -m model.gguf "The meaning of life is"
+
+# Chat
+skainet -m model.gguf --chat
+
+# Tool calling demo (interactive)
+skainet -m model.gguf --demo
+
+# Tool calling demo (single-shot, for testing)
+skainet -m model.gguf --demo "What is 2 + 2?"
+
+# Low temperature, more tokens
+skainet -m model.gguf -s 128 -k 0.3 "Explain quantum computing"
+----
+
+== kllama
+
+[source]
+----
+kllama -m <model> [-t <tokenizer>] [-s <steps>] [-k <temp>] [-p <systemprompt>] [--chat] [--agent] [--demo] [--template=NAME] <prompt>
+----
+
+Same options as `skainet`, plus:
+
+[cols="2,3"]
+|===
+|Flag |Description
+
+|`-t, --tokenizer`
+|Path to external tokenizer file (auto-detected for GGUF)
+
+|`-p, --systemprompt`
+|System prompt prepended to user message
+
+|`--backend=NAME`
+|Force compute backend (see `--list-backends`)
+
+|`--list-backends`
+|List available compute backends and exit
+|===
+
+Supports `.gguf`, `.safetensors`, and `.bin` (Karpathy) model formats.
+
+== Gradle Tasks
+
+[cols="2,3"]
+|===
+|Task |Description
+
+|`:llm-apps:skainet-cli:run`
+|Unified CLI (auto-detects architecture)
+
+|`:llm-apps:kllama-cli:run`
+|LLaMA/Qwen/Mistral CLI
+
+|`:llm-runtime:kqwen:jvmRun`
+|Qwen CLI (basic generation)
+
+|`:llm-runtime:kgemma:jvmRun`
+|Gemma CLI
+
+|`:llm-apps:kapertus-cli:run`
+|Apertus CLI
+
+|`:llm-apps:kvoxtral-cli:run`
+|Voxtral TTS CLI
+
+|`:llm-apps:kbert-cli:run`
+|BERT embeddings CLI
+|===
diff --git a/docs/modules/ROOT/pages/reference/model-registry.adoc b/docs/modules/ROOT/pages/reference/model-registry.adoc
new file mode 100644
index 0000000..15fbce6
--- /dev/null
+++ b/docs/modules/ROOT/pages/reference/model-registry.adoc
@@ -0,0 +1,106 @@
+= Model Registry
+:description: Architecture auto-detection and model family enumeration.
+
+== ModelFamily
+
+.`llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/ModelRegistry.kt`
+
+[source,kotlin]
+----
+enum class ModelFamily(
+    val id: String,
+    val displayName: String,
+    val supportsToolCalling: Boolean,
+    val chatTemplateFamily: String?
+)
+----
+
+[cols="1,2,1,1,1"]
+|===
+|Family |Display Name |Tool Calling |Template |GGUF Architectures
+
+|`LLAMA`
+|LLaMA / Mistral
+|Yes
+|`llama3`
+|`llama`, `mistral`
+
+|`QWEN`
+|Qwen
+|Yes
+|`qwen`
+|`qwen2`, `qwen3`, `qwen35`
+
+|`GEMMA`
+|Gemma
+|Yes
+|`gemma`
+|`gemma`, `gemma2`, `gemma3`, `gemma3n`
+
+|`APERTUS`
+|Apertus
+|No
+|`chatml`
+|`apertus`
+
+|`BERT`
+|BERT
+|No
+|--
+|`bert`
+
+|`VOXTRAL`
+|Voxtral TTS
+|No
+|--
+|`voxtral`
+|===
+
+== ModelRegistry
+
+[source,kotlin]
+----
+object ModelRegistry {
+    fun detect(architecture: String): ModelFamily
+    fun detect(architecture: String, chatTemplate: String?): ModelFamily
+}
+----
+
+Detects model family from the GGUF `general.architecture` field.
+
+== UnifiedModelLoader
+
+.`llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/UnifiedModelLoader.kt`
+
+[source,kotlin]
+----
+object UnifiedModelLoader {
+    fun peek(sourceProvider: () -> RandomAccessSource): GGUFModelInfo
+}
+----
+
+Extracts model info without loading weights:
+
+[source,kotlin]
+----
+data class GGUFModelInfo(
+    val architecture: String,    // e.g., "qwen3"
+    val family: ModelFamily,     // e.g., ModelFamily.QWEN
+    val contextLength: Int,
+    val vocabSize: Int,
+    val blockCount: Int,
+    val embeddingLength: Int,
+    val fields: Map<String, Any?>
+)
+----
+
+== Usage
+
+[source,kotlin]
+----
+val info = UnifiedModelLoader.peek {
+    JvmRandomAccessSource.open(modelPath)
+}
+println("${info.family.displayName}: ${info.blockCount} layers, ${info.vocabSize} vocab")
+// "Qwen: 36 layers, 151936 vocab"
+----
diff --git a/docs/modules/ROOT/pages/reference/pipeline.adoc b/docs/modules/ROOT/pages/reference/pipeline.adoc
new file mode 100644
index 0000000..4bc33f3
--- /dev/null
+++ b/docs/modules/ROOT/pages/reference/pipeline.adoc
@@ -0,0 +1,90 @@
+= Inference Pipeline
+:description: Data flow from GGUF file to generated text.
+
+== Pipeline Stages
+
+[source]
+----
+GGUF/SafeTensors File
+    |
+    v
+[1] WeightLoader         Parse metadata + tensor data
+    |
+    v
+[2] DSL Network Def      llamaNetwork(), qwenNetwork(), apertusNetwork()
+    |
+    v
+[3] ComputeGraph (DAG)   Record forward pass into directed acyclic graph
+    |
+    v
+[4] Optimization          TransposeElim -> WeightDedup -> LLMFusion -> DCE
+    |
+    v
+[5] Executor              ComputeGraphExecutor with fused kernels
+    |
+    v
+[6] InferenceRuntime      forward(tokenId) -> logits, generate(), sample()
+    |
+    v
+[7] Tokenizer             encode(text) -> IntArray, decode(token) -> String
+    |
+    v
+[8] ChatPipeline          ChatTemplate + AgentLoop + ToolRegistry
+    |
+    v
+    Generated text / Tool call results
+----
+
+== Stage Details
+
+=== [1] Weight Loading
+
+`LlamaWeightLoader` supports:
+
+* *Sequential loading* -- entire file in memory (small models < 2GB)
+* *Streaming loading* -- metadata-only in memory, tensors on demand (any size)
+* *Quantization policies* -- `DEQUANTIZE_TO_FP32`, `NATIVE_OPTIMIZED` (SIMD dequant at inference)
+
+=== [2] DSL Network Definition
+
+Pure functions that return a `Module<T, V>` tree:
+
+[source,kotlin]
+----
+val model = llamaNetwork<FP32, Float>(metadata)
+----
+
+The DSL provides: `embedding()`, `multiHeadAttention()`, `swiGluFFN()`, `rmsNorm()`, `residual()`.
+
+=== [3-5] Compute Graph Compilation
+
+`OptimizedLLMRuntime` traces the module tree into a DAG and applies optimization passes:
+
+* *TransposeEliminationPass* -- fold transposes into matmul parameters
+* *SharedWeightDeduplicationPass* -- eliminate redundant weight loads
+* *LLMFusionPass* -- fuse RMSNorm, SwiGLU FFN, and QKV projections
+* *DeadCodeEliminationPass* -- remove unused operations
+
+=== [6] Inference Runtime
+
+Three execution modes via `OptimizedLLMMode`:
+
+`DIRECT`:: Module tree executes forward passes directly (debugging)
+`OPTIMIZED`:: Full DAG execution with fused kernels
+`HYBRID`:: Direct execution with per-layer compiled subgraphs
+
+=== [7] Tokenization
+
+`TokenizerFactory` auto-detects the right tokenizer:
+
+* `fromGGUF(source)` -- reads vocab and merges from GGUF metadata
+* `fromTokenizerJson(json)` -- parses HuggingFace tokenizer.json
+* `fromHuggingFace(json, config)` -- full HF BPE with config
+
+=== [8] Chat Pipeline
+
+`ChatSession` bundles runtime + tokenizer + metadata:
+
+* Auto-detects chat template from `ModelMetadata`
+* `createAgentLoop()` -- multi-turn tool calling
+* `runSingleTurn()` -- one-shot tool calling
diff --git a/docs/modules/ROOT/pages/reference/tokenizer-api.adoc b/docs/modules/ROOT/pages/reference/tokenizer-api.adoc
new file mode 100644
index 0000000..9d8c4e2
--- /dev/null
+++ b/docs/modules/ROOT/pages/reference/tokenizer-api.adoc
@@ -0,0 +1,71 @@
+= Tokenizer API
+:description: Tokenizer interface and implementations.
+
+== Interface
+
+.`llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/Tokenizer.kt`
+[source,kotlin]
+----
+interface Tokenizer {
+    fun encode(text: String): IntArray
+    fun decode(tokens: IntArray): String
+    fun decode(token: Int): String
+    val eosTokenId: Int
+    val bosTokenId: Int
+    val vocabSize: Int
+}
+----
+
+== Implementations
+
+=== GGUFTokenizer
+
+.`llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/tokenizer/GGUFTokenizer.kt`
+
+Auto-detects tokenizer type from GGUF metadata:
+
+* *BPE* (GPT-2 style) -- used by Qwen, Mistral
+* *SentencePiece* -- used by LLaMA
+* *WordPiece* -- used by BERT
+
+Factory methods:
+
+[source,kotlin]
+----
+// From GGUF file (streaming, memory-efficient)
+val tokenizer = GGUFTokenizer.fromRandomAccessSource(source)
+
+// From HuggingFace tokenizer.json
+val tokenizer = GGUFTokenizer.fromTokenizerJson(jsonString)
+----
+
+=== HuggingFaceBPETokenizer
+
+.`llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/tokenizer/HuggingFaceBPETokenizer.kt`
+
+SentencePiece-style BPE with `\u2581` space marker.
+Used for Gemma, LLaMA models loaded from SafeTensors.
+
+=== TekkenTokenizerAdapter
+
+.`llm-inference/voxtral/src/commonMain/kotlin/sk/ainet/models/voxtral/TekkenTokenizerAdapter.kt`
+
+Wraps the Tekken tokenizer from skainet-io-core for Voxtral TTS models.
+
+== TokenizerFactory
+
+.`llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/tokenizer/TokenizerFactory.kt`
+
+Unified factory for creating tokenizers:
+
+[source,kotlin]
+----
+// From GGUF file
+val tokenizer = TokenizerFactory.fromGGUF(randomAccessSource)
+
+// From tokenizer.json string
+val tokenizer = TokenizerFactory.fromTokenizerJson(jsonString)
+
+// From HuggingFace format with optional config
+val tokenizer = TokenizerFactory.fromHuggingFace(tokenizerJson, configJson)
+----
diff --git a/docs/modules/ROOT/pages/tutorials/getting-started.adoc b/docs/modules/ROOT/pages/tutorials/getting-started.adoc
new file mode 100644
index 0000000..6105980
--- /dev/null
+++ b/docs/modules/ROOT/pages/tutorials/getting-started.adoc
@@ -0,0 +1,66 @@
+= Getting Started
+:description: Run your first LLM inference with SKaiNET Transformers.
+
+This tutorial walks you through running text generation with a GGUF model using the unified `skainet` CLI.
+
+== Prerequisites
+
+* JDK 21+ with preview features (Vector API)
+* A GGUF model file (e.g., `tinyllama-1.1b-chat-v1.0.Q8_0.gguf`)
+
+== Step 1: Build the Project
+
+[source,bash]
+----
+./gradlew :llm-apps:skainet-cli:classes
+----
+
+== Step 2: Run Text Generation
+
+[source,bash]
+----
+./gradlew :llm-apps:skainet-cli:run \
+  --args="-m tinyllama-1.1b-chat-v1.0.Q8_0.gguf 'The capital of France is'"
+----
+
+Expected output:
+
+----
+Architecture: llama, Family: LLaMA / Mistral
+Backend: CPU (SIMD)
+Loading GGUF model (LLaMA / Mistral, streaming)...
+Generating 64 tokens with temperature=0.8...
+---
+The capital of France is Paris. It is also the largest city in France...
+---
+tok/s: 3.4
+----
+
+The CLI auto-detects the model architecture from GGUF metadata -- no need to specify which runner to use.
+
+== Step 3: Interactive Chat
+
+[source,bash]
+----
+./gradlew :llm-apps:skainet-cli:run \
+  --args="-m Qwen3-1.7B-Q8_0.gguf --chat"
+----
+
+This starts a multi-turn conversation with the model using the auto-detected chat template.
+
+== Step 4: Tool Calling Demo
+
+[source,bash]
+----
+./gradlew :llm-apps:skainet-cli:run \
+  --args="-m Qwen3-1.7B-Q8_0.gguf --demo"
+----
+
+The demo provides `calculator` and `list_files` tools.
+Type a question like "What is 2 + 2?" and the model will call the calculator tool.
+
+== What's Next
+
+* xref:tutorials/tool-calling.adoc[Tool calling in depth] -- integrate tool calling into your own application
+* xref:how-to/run-unified-cli.adoc[CLI reference] -- all available flags and options
+* xref:reference/architecture.adoc[Architecture overview] -- understand the pipeline
diff --git a/docs/modules/ROOT/pages/tutorials/smoke-tests.adoc b/docs/modules/ROOT/pages/tutorials/smoke-tests.adoc
new file mode 100644
index 0000000..068f920
--- /dev/null
+++ b/docs/modules/ROOT/pages/tutorials/smoke-tests.adoc
@@ -0,0 +1,117 @@
+= Running Smoke Tests
+:description: Quick validation that models load and generate correctly.
+
+The smoke test suite verifies model loading, text generation, and optionally tool calling across all configured models.
+
+== Quick Start
+
+[source,bash]
+----
+./tests/smoke/smoke-test.sh
+----
+
+This uses `tests/smoke/smoke-models.json` to determine which models to test.
+
+== Configuration
+
+Edit `tests/smoke/smoke-models.json`:
+
+[source,json]
+----
+{
+  "defaults": {
+    "prompt": "The capital of France is",
+    "steps": 32,
+    "temperature": 0.0
+  },
+  "models": [
+    {
+      "name": "TinyLlama-1.1B-Q8",
+      "runner": "kllama",
+      "model": "tinyllama-1.1b-chat-v1.0.Q8_0.gguf",
+      "format": "gguf"
+    },
+    {
+      "name": "Qwen3-1.7B-Q8",
+      "runner": "kqwen",
+      "model": "Qwen3-1.7B-Q8_0.gguf",
+      "format": "gguf",
+      "toolCalling": {
+        "prompt": "What is 2 + 2?",
+        "steps": 256
+      }
+    }
+  ]
+}
+----
+
+=== Model Fields
+
+[cols="1,1,3"]
+|===
+|Field |Required |Description
+
+|`name`
+|Yes
+|Display name in the summary table
+
+|`runner`
+|Yes
+|Runner to use: `skainet`, `kllama`, `kqwen`, `kgemma`, `kapertus`, `kvoxtral`, `kbert`
+
+|`model`
+|Yes
+|Path to model file (`~` is expanded, relative paths use `MODELS_ROOT`)
+
+|`format`
+|No
+|`gguf` or `safetensors` (informational)
+
+|`prompt`
+|No
+|Override the default prompt
+
+|`steps`
+|No
+|Override the default step count
+
+|`toolCalling`
+|No
+|Object with `prompt` and `steps` to enable tool calling test
+|===
+
+== Tool Calling Tests
+
+Models with a `toolCalling` field get an additional test phase.
+The smoke test runs `kllama --demo` in single-shot mode and checks for `[Tool Call]` in the output.
+
+Results are classified as:
+
+* *OK* -- model produced a tool call
+* *WARN* -- model ran but did not produce a tool call (model too small or prompt not triggering)
+* *FAIL* -- model crashed or failed to load
+
+== Output
+
+The test produces two summary tables:
+
+----
+Summary -- Generation
+  Status Model                          Runner       Size      tok/s     Wall
+  OK     TinyLlama-1.1B-Q8              kllama       1.1G       3.4    11.8s
+  OK     Qwen3-1.7B-Q8                  kqwen        2.0G       2.0    20.8s
+  Pass: 2  Fail: 0  Total: 2
+
+Summary -- Tool Calling
+  Status Model                          Tool                Wall
+  OK     Qwen3-1.7B-Q8                  calculator        45.2s
+  Pass: 1  Fail: 0  Total: 1
+----
+
+== Adding a New Model
+
+1. Download or locate the GGUF file
+2. Add an entry to `smoke-models.json`
+3. Set `runner` to match the model architecture
+4. Optionally add `toolCalling` for models that support it
+5. Run `./tests/smoke/smoke-test.sh`
diff --git a/docs/modules/ROOT/pages/tutorials/tool-calling.adoc b/docs/modules/ROOT/pages/tutorials/tool-calling.adoc
new file mode 100644
index 0000000..5ef2142
--- /dev/null
+++ b/docs/modules/ROOT/pages/tutorials/tool-calling.adoc
@@ -0,0 +1,131 @@
+= Tool Calling with Any Model
+:description: Use tool calling with any model that supports chat templates.
+
+This tutorial shows how to use `ChatSession` to add tool calling to any model runtime, not just kllama.
+
+== How Tool Calling Works
+
+The tool calling pipeline is decoupled from the model runtime:
+
+----
+InferenceRuntime + Tokenizer + ModelMetadata
+        |
+    ChatSession
+        |
+    AgentLoop (generate -> parse -> execute -> re-prompt)
+        |
+    ChatTemplate (format messages, parse tool calls)
+        |
+    ToolRegistry (execute tool functions)
+----
+
+Any model that implements `InferenceRuntime` and has a `Tokenizer` can use tool calling.
+
+== Step 1: Create a ChatSession
+
+[source,kotlin]
+----
+val session = ChatSession(
+    runtime = myRuntime,       // any InferenceRuntime<T>
+    tokenizer = myTokenizer,   // any Tokenizer
+    metadata = ModelMetadata(family = "qwen", architecture = "qwen3")
+)
+----
+
+The `ChatSession` auto-detects the right chat template from `ModelMetadata`.
+
+== Step 2: Run a Single Tool Calling Round
+
+[source,kotlin]
+----
+val tools = listOf(myCalculatorTool, myFilesTool)
+val response = session.runSingleTurn(
+    prompt = "What is 2 + 2?",
+    tools = tools,
+    maxTokens = 256,
+    temperature = 0.7f
+)
+println(response)  // "2 + 2 = 4"
+----
+
+== Step 3: Build a Multi-Turn Agent
+
+[source,kotlin]
+----
+val registry = ToolRegistry()
+registry.register(CalculatorTool())
+registry.register(ListFilesTool())
+
+val agentLoop = session.createAgentLoop(registry, maxTokens = 512)
+
+val messages = mutableListOf(
+    ChatMessage(role = ChatRole.SYSTEM, content = "You are a helpful assistant."),
+    ChatMessage(role = ChatRole.USER, content = "List files in /tmp and count them")
+)
+
+val response = agentLoop.runWithEncoder(
+    messages = messages,
+    encode = { session.encode(it) }
+)
+----
+
+The agent loop automatically:
+
+1. Formats the conversation using the chat template
+2. Generates tokens until EOS
+3. Parses tool calls from the output
+4. Executes tools and appends results
+5. Repeats until no more tool calls or max rounds reached
+
+== Step 4: Implement a Custom Tool
+
+[source,kotlin]
+----
+class WeatherTool : Tool {
+    override val definition = ToolDefinition(
+        name = "get_weather",
+        description = "Get current weather for a city",
+        parameters = buildJsonObject {
+            put("type", "object")
+            putJsonObject("properties") {
+                putJsonObject("city") {
+                    put("type", "string")
+                    put("description", "City name")
+                }
+            }
+            putJsonArray("required") { add(JsonPrimitive("city")) }
+        }
+    )
+
+    override fun execute(arguments: JsonObject): String {
+        val city = arguments["city"]?.jsonPrimitive?.content
+            ?: return "Error: missing city"
+        return "Weather in $city: 22C, sunny"
+    }
+}
+----
+
+== Supported Chat Templates
+
+Tool calling support is auto-detected from model metadata:
+
+[cols="1,1,2"]
+|===
+|Family |Template |Format
+
+|Qwen2/3
+|`QwenChatTemplate`
+|JSON in `<tool_call>` XML tags
+
+|LLaMA 3
+|`Llama3ChatTemplate`
+|JSON in `<tool_call>` XML tags
+
+|Gemma
+|`GemmaChatTemplate`
+|Gemma-specific format
+
+|ChatML/Hermes
+|`ChatMLTemplate`
+|JSON in `<tool_call>` XML tags
+|===

From f8caaf8152fde2ae9208ac8bcb5277bd1a89acb5 Mon Sep 17 00:00:00 2001
From: Michal Harakal <michal.harakal@googlemail.com>
Date: Sat, 11 Apr 2026 12:38:10 +0200
Subject: [PATCH 7/9] ci: add GitHub Actions workflow to build and deploy
 Antora docs

- Add docs.yml workflow: builds on push to main/develop, deploys to
  GitHub Pages from develop branch
- Uses dockerized Antora 3.1 with asciidoctor-kroki for Mermaid diagrams
- Add antora-playbook.yml with Kroki server integration
- Convert ASCII diagrams to Mermaid in pipeline, architecture,
  tool-calling, and pipeline-design pages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .github/workflows/docs.yml                    | 58 ++++++++++++++++
 docs/antora-playbook.yml                      | 25 +++++++
 .../pages/explanation/pipeline-design.adoc    | 24 ++++++-
 .../ROOT/pages/reference/architecture.adoc    | 69 +++++++++++++++----
 .../ROOT/pages/reference/pipeline.adoc        | 47 +++++--------
 .../ROOT/pages/tutorials/tool-calling.adoc    | 24 ++++---
 6 files changed, 189 insertions(+), 58 deletions(-)
 create mode 100644 .github/workflows/docs.yml
 create mode 100644 docs/antora-playbook.yml

diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
new file mode 100644
index 0000000..b38cf7c
--- /dev/null
+++ b/.github/workflows/docs.yml
@@ -0,0 +1,58 @@
+name: Docs
+
+on:
+  push:
+    branches: [ main, develop ]
+    paths:
+      - 'docs/**'
+      - '.github/workflows/docs.yml'
+  pull_request:
+    paths:
+      - 'docs/**'
+      - '.github/workflows/docs.yml'
+  workflow_dispatch:
+
+concurrency:
+  group: docs-${{ github.ref }}
+  cancel-in-progress: true
+
+permissions:
+  contents: read
+  pages: write
+  id-token: write
+
+jobs:
+  build-docs:
+    runs-on: ubuntu-latest
+    timeout-minutes: 10
+
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v6
+
+      - name: Build Antora site
+        run: |
+          docker run --rm \
+            -v "${{ github.workspace }}:/antora" \
+            --workdir /antora/docs \
+            --entrypoint sh \
+            docker.io/antora/antora:3.1 \
+            -c "npm i asciidoctor-kroki && antora --stacktrace antora-playbook.yml"
+
+      - name: Upload artifact
+        uses: actions/upload-pages-artifact@v3
+        with:
+          path: docs/build/site
+
+  deploy-docs:
+    if: github.ref == 'refs/heads/develop' && github.event_name == 'push'
+    needs: build-docs
+    runs-on: ubuntu-latest
+    environment:
+      name: github-pages
+      url: ${{ steps.deployment.outputs.page_url }}
+
+    steps:
+      - name: Deploy to GitHub Pages
+        id: deployment
+        uses: actions/deploy-pages@v4
diff --git a/docs/antora-playbook.yml b/docs/antora-playbook.yml
new file mode 100644
index 0000000..5d521e6
--- /dev/null
+++ b/docs/antora-playbook.yml
@@ -0,0 +1,25 @@
+site:
+  title: SKaiNET Transformers
+  start_page: skainet-transformers::index.adoc
+
+content:
+  sources:
+    - url: .
+      start_path: docs
+      branches: HEAD
+
+asciidoc:
+  extensions:
+    - asciidoctor-kroki
+
+kroki:
+  server-url: https://kroki.io
+  fetch-diagram: true
+
+ui:
+  bundle:
+    url: https://gitlab.com/antora/antora-ui-default/-/jobs/artifacts/HEAD/raw/build/ui-bundle.zip?job=bundle-stable
+    snapshot: true
+
+output:
+  dir: ./build/site
diff --git a/docs/modules/ROOT/pages/explanation/pipeline-design.adoc b/docs/modules/ROOT/pages/explanation/pipeline-design.adoc
index 4f3202c..70b5c8f 100644
--- a/docs/modules/ROOT/pages/explanation/pipeline-design.adoc
+++ b/docs/modules/ROOT/pages/explanation/pipeline-design.adoc
@@ -14,7 +14,29 @@ This led to:
 
 The pipeline is split into stages that are independently replaceable:
 
-[horizontal]
+[mermaid]
+....
+graph LR
+    subgraph "Model Format"
+        A[Weight Loading]
+    end
+    subgraph "Architecture"
+        B[Network Definition]
+    end
+    subgraph "Framework"
+        C[Graph Compilation]
+        D[Inference Runtime]
+    end
+    subgraph "I/O"
+        E[Tokenization]
+    end
+    subgraph "Application"
+        F[Chat Pipeline]
+    end
+
+    A --> B --> C --> D --> E --> F
+....
+
 Weight Loading:: Parse GGUF/SafeTensors into typed tensor maps. Model-format concern, not architecture concern.
 Network Definition:: Pure functions (`llamaNetwork()`, `apertusNetwork()`) that return a `Module` tree. Architecture concern only.
 Graph Compilation:: Trace the module tree into a DAG, apply optimization passes. Framework concern.
diff --git a/docs/modules/ROOT/pages/reference/architecture.adoc b/docs/modules/ROOT/pages/reference/architecture.adoc
index 262bbea..21fb21a 100644
--- a/docs/modules/ROOT/pages/reference/architecture.adoc
+++ b/docs/modules/ROOT/pages/reference/architecture.adoc
@@ -28,23 +28,62 @@ llm-performance/            Benchmarking module
 
 == Dependency Graph
 
-----
-llm-apps/skainet-cli
-  -> llm-runtime/kllama -> llm-inference/llama -> llm-core
-  -> llm-agent -> llm-core
-  -> skainet-backend-cpu (SIMD tensor ops)
-  -> skainet-io-gguf (GGUF parsing)
+[mermaid]
+....
+graph LR
+    subgraph Apps
+        skainet-cli
+        kllama-cli
+    end
 
-llm-agent
-  -> llm-core (InferenceRuntime, Tokenizer)
+    subgraph Runtime
+        kllama
+        kqwen
+        kgemma
+        kapertus
+    end
 
-llm-core
-  -> skainet-lang-core (tensor types, DSL)
-  -> skainet-compile-dag (compute graph)
-  -> skainet-compile-opt (optimization passes)
-  -> skainet-io-core (I/O abstractions)
-  -> skainet-io-gguf (GGUF reader)
-----
+    subgraph Inference
+        llama
+        gemma
+        apertus
+        bert
+        voxtral
+    end
+
+    subgraph Core
+        llm-core
+        llm-agent
+    end
+
+    subgraph SKaiNET
+        skainet-lang-core
+        skainet-compile-dag
+        skainet-compile-opt
+        skainet-io-gguf
+        skainet-backend-cpu
+    end
+
+    skainet-cli --> kllama
+    skainet-cli --> llm-agent
+    kllama-cli --> kllama
+    kllama --> llama
+    kllama --> llm-agent
+    kqwen --> llama
+    kgemma --> gemma
+    kapertus --> apertus
+    llama --> llm-core
+    gemma --> llm-core
+    apertus --> llm-core
+    bert --> llm-core
+    voxtral --> llm-core
+    llm-agent --> llm-core
+    llm-core --> skainet-lang-core
+    llm-core --> skainet-compile-dag
+    llm-core --> skainet-compile-opt
+    llm-core --> skainet-io-gguf
+    kllama --> skainet-backend-cpu
+....
 
 == Key Interfaces
 
diff --git a/docs/modules/ROOT/pages/reference/pipeline.adoc b/docs/modules/ROOT/pages/reference/pipeline.adoc
index 4bc33f3..1b4cecc 100644
--- a/docs/modules/ROOT/pages/reference/pipeline.adoc
+++ b/docs/modules/ROOT/pages/reference/pipeline.adoc
@@ -3,37 +3,22 @@
 
 == Pipeline Stages
 
-[source]
-----
-GGUF/SafeTensors File
-    |
-    v
-[1] WeightLoader         Parse metadata + tensor data
-    |
-    v
-[2] DSL Network Def      llamaNetwork(), qwenNetwork(), apertusNetwork()
-    |
-    v
-[3] ComputeGraph (DAG)   Record forward pass into directed acyclic graph
-    |
-    v
-[4] Optimization          TransposeElim -> WeightDedup -> LLMFusion -> DCE
-    |
-    v
-[5] Executor              ComputeGraphExecutor with fused kernels
-    |
-    v
-[6] InferenceRuntime      forward(tokenId) -> logits, generate(), sample()
-    |
-    v
-[7] Tokenizer             encode(text) -> IntArray, decode(token) -> String
-    |
-    v
-[8] ChatPipeline          ChatTemplate + AgentLoop + ToolRegistry
-    |
-    v
-    Generated text / Tool call results
-----
+[mermaid]
+....
+graph TD
+    A[GGUF / SafeTensors File] --> B["[1] WeightLoader<br/>Parse metadata + tensor data"]
+    B --> C["[2] DSL Network Def<br/>llamaNetwork(), qwenNetwork()"]
+    C --> D["[3] ComputeGraph (DAG)<br/>Record forward pass"]
+    D --> E["[4] Optimization<br/>TransposeElim → WeightDedup → LLMFusion → DCE"]
+    E --> F["[5] Executor<br/>ComputeGraphExecutor with fused kernels"]
+    F --> G["[6] InferenceRuntime<br/>forward(tokenId) → logits"]
+    G --> H["[7] Tokenizer<br/>encode(text) → IntArray"]
+    H --> I["[8] ChatPipeline<br/>ChatTemplate + AgentLoop + ToolRegistry"]
+    I --> J[Generated text / Tool call results]
+
+    style A fill:#f9f,stroke:#333
+    style J fill:#9f9,stroke:#333
+....
 
 == Stage Details
 
diff --git a/docs/modules/ROOT/pages/tutorials/tool-calling.adoc b/docs/modules/ROOT/pages/tutorials/tool-calling.adoc
index 5ef2142..a8bf97a 100644
--- a/docs/modules/ROOT/pages/tutorials/tool-calling.adoc
+++ b/docs/modules/ROOT/pages/tutorials/tool-calling.adoc
@@ -7,17 +7,19 @@ This tutorial shows how to use `ChatSession` to add tool calling to any model ru
 
 The tool calling pipeline is decoupled from the model runtime:
 
-----
-InferenceRuntime + Tokenizer + ModelMetadata
-        |
-    ChatSession
-        |
-    AgentLoop (generate -> parse -> execute -> re-prompt)
-        |
-    ChatTemplate (format messages, parse tool calls)
-        |
-    ToolRegistry (execute tool functions)
-----
+[mermaid]
+....
+graph TD
+    A[InferenceRuntime + Tokenizer + ModelMetadata] --> B[ChatSession]
+    B --> C[AgentLoop]
+    C --> D{Tool calls?}
+    D -->|Yes| E[ToolRegistry<br/>execute tool functions]
+    E --> F[Append result to messages]
+    F --> C
+    D -->|No| G[Final response]
+    B --> H[ChatTemplate<br/>format messages, parse tool calls]
+    H --> C
+....
 
 Any model that implements `InferenceRuntime` and has a `Tokenizer` can use tool calling.
 

From f55dd75a7c655dc433bc9a925b7d0ba61d7e9ae5 Mon Sep 17 00:00:00 2001
From: Michal Harakal <michal.harakal@googlemail.com>
Date: Sat, 11 Apr 2026 12:48:19 +0200
Subject: [PATCH 8/9] feat: unified model pipeline with decoupled tool calling
 (#49)

Implements the unified inference pipeline for SKaiNET Transformers,
resolving #49 and building on the tool calling foundation from #46.

## Summary

This branch decouples tool calling from the kllama runner, creates a
unified model pipeline with architecture auto-detection, and adds
comprehensive Antora documentation.

### Phase 1: Decouple Tool Calling
- Enhance Tokenizer interface with eosTokenId, bosTokenId, vocabSize
- Create ChatSession abstraction in llm-agent (any runner gets tool
  calling for free)
- Refactor ToolCallingDemo and AgentCli to accept Tokenizer, not
  GGUFTokenizer
- Fix JavaAgentLoop instanceof hack

### Phase 2: Model Registry
- Add ModelFamily enum (LLAMA, QWEN, GEMMA, APERTUS, BERT, VOXTRAL)
- Add ModelRegistry.detect() for GGUF architecture auto-detection
- Add UnifiedModelLoader.peek() to extract model info without loading

### Phase 3: Tokenization Pipeline
- Move GGUFTokenizer from kllama to llm-core (all runners can use it)
- Create TokenizerFactory with fromGGUF(), fromTokenizerJson(),
  fromHuggingFace()

### Phase 4: Unified CLI
- New skainet-cli module: single entry point for all GGUF models
- Auto-detects architecture, supports --chat/--agent/--demo modes

### Smoke Tests
- Add tool calling test phase with [Tool Call] detection
- Add ToolCallingDemo.runSingleShot() for non-interactive testing
- Add Qwen3-8B-Q4 to smoke test config

### Documentation (Antora + Divio)
- 19 AsciiDoc pages: tutorials, how-to, reference, explanation
- Mermaid diagrams via Kroki for pipeline, architecture, agent loop
- GitHub Actions workflow for docs build and GitHub Pages deployment

Refs: #46

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

From a448468316d51c099f52d90c705481a8804a111c Mon Sep 17 00:00:00 2001
From: Michal Harakal <michal.harakal@googlemail.com>
Date: Sat, 11 Apr 2026 12:50:12 +0200
Subject: [PATCH 9/9] ci: add custom Antora Docker image with built-in Mermaid
 renderer

Custom image based on node:20-alpine with:
- Antora 3.1 site generator
- asciidoctor-kroki for diagram blocks
- @mermaid-js/mermaid-cli with Chromium for local SVG rendering
- No external Kroki server dependency

The GitHub Actions workflow builds the image from docs/.docker/Dockerfile
then uses it to generate the site. Mermaid diagrams are rendered locally
inside the container.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .github/workflows/docs.yml | 15 +++++++++++----
 docs/.docker/.dockerignore |  2 ++
 docs/.docker/Dockerfile    | 35 +++++++++++++++++++++++++++++++++++
 docs/antora-playbook.yml   |  8 ++++----
 4 files changed, 52 insertions(+), 8 deletions(-)
 create mode 100644 docs/.docker/.dockerignore
 create mode 100644 docs/.docker/Dockerfile

diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
index b38cf7c..6174540 100644
--- a/.github/workflows/docs.yml
+++ b/.github/workflows/docs.yml
@@ -24,20 +24,27 @@ permissions:
 jobs:
   build-docs:
     runs-on: ubuntu-latest
-    timeout-minutes: 10
+    timeout-minutes: 15
 
     steps:
       - name: Checkout
         uses: actions/checkout@v6
 
+      - name: Build custom Antora image
+        run: |
+          docker build \
+            -t skainet-antora:local \
+            -f docs/.docker/Dockerfile \
+            docs/.docker/
+
       - name: Build Antora site
         run: |
           docker run --rm \
             -v "${{ github.workspace }}:/antora" \
             --workdir /antora/docs \
-            --entrypoint sh \
-            docker.io/antora/antora:3.1 \
-            -c "npm i asciidoctor-kroki && antora --stacktrace antora-playbook.yml"
+            skainet-antora:local \
+            --stacktrace \
+            antora-playbook.yml
 
       - name: Upload artifact
         uses: actions/upload-pages-artifact@v3
diff --git a/docs/.docker/.dockerignore b/docs/.docker/.dockerignore
new file mode 100644
index 0000000..dd87e2d
--- /dev/null
+++ b/docs/.docker/.dockerignore
@@ -0,0 +1,2 @@
+node_modules
+build
diff --git a/docs/.docker/Dockerfile b/docs/.docker/Dockerfile
new file mode 100644
index 0000000..0d496ff
--- /dev/null
+++ b/docs/.docker/Dockerfile
@@ -0,0 +1,35 @@
+FROM node:20-alpine
+
+LABEL org.opencontainers.image.title="SKaiNET Antora" \
+      org.opencontainers.image.description="Antora site generator with built-in Mermaid rendering" \
+      org.opencontainers.image.source="https://github.com/SKaiNET-developers/SKaiNET-transformers"
+
+# Chromium for mermaid-cli (puppeteer)
+RUN apk add --no-cache chromium font-noto
+
+ENV PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser \
+    PUPPETEER_SKIP_DOWNLOAD=true
+
+WORKDIR /antora
+
+# Install Antora + extensions + mermaid-cli in one layer
+RUN npm i --save-exact \
+      @antora/cli@3.1 \
+      @antora/site-generator@3.1 \
+      asciidoctor-kroki@0.18 \
+      @mermaid-js/mermaid-cli@11 \
+    && npm cache clean --force
+
+# Mermaid-cli config: use installed Chromium, no sandbox (container)
+RUN echo '{ \
+  "executablePath": "/usr/bin/chromium-browser", \
+  "args": ["--no-sandbox", "--disable-gpu", "--disable-dev-shm-usage"] \
+}' > /antora/puppeteer-config.json
+
+# Pre-generate a simple diagram to warm up and verify the stack works
+RUN echo 'graph TD; A-->B;' > /tmp/test.mmd \
+    && npx mmdc -i /tmp/test.mmd -o /tmp/test.svg -p /antora/puppeteer-config.json \
+    && rm /tmp/test.mmd /tmp/test.svg
+
+ENTRYPOINT ["npx", "antora"]
+CMD ["--stacktrace", "antora-playbook.yml"]
diff --git a/docs/antora-playbook.yml b/docs/antora-playbook.yml
index 5d521e6..b07afab 100644
--- a/docs/antora-playbook.yml
+++ b/docs/antora-playbook.yml
@@ -11,10 +11,10 @@ content:
 asciidoc:
   extensions:
     - asciidoctor-kroki
-
-kroki:
-  server-url: https://kroki.io
-  fetch-diagram: true
+  attributes:
+    # Use local mermaid-cli via Kroki (no external server needed when
+    # built with the custom Docker image in docs/.docker/Dockerfile)
+    kroki-fetch-diagram: true
 
 ui:
   bundle: