SKaiNET-developers · michalharakal · Apr 11, 2026 · Apr 11, 2026 · Apr 11, 2026 · Apr 11, 2026
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
@@ -0,0 +1,65 @@
+name: Docs
+
+on:
+  push:
+    branches: [ main, develop ]
+    paths:
+      - 'docs/**'
+      - '.github/workflows/docs.yml'
+  pull_request:
+    paths:
+      - 'docs/**'
+      - '.github/workflows/docs.yml'
+  workflow_dispatch:
+
+concurrency:
+  group: docs-${{ github.ref }}
+  cancel-in-progress: true
+
+permissions:
+  contents: read
+  pages: write
+  id-token: write
+
+jobs:
+  build-docs:
+    runs-on: ubuntu-latest
+    timeout-minutes: 15
+
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v6
+
+      - name: Build custom Antora image
+        run: |
+          docker build \
+            -t skainet-antora:local \
+            -f docs/.docker/Dockerfile \
+            docs/.docker/
+
+      - name: Build Antora site
+        run: |
+          docker run --rm \
+            -v "${{ github.workspace }}:/antora" \
+            --workdir /antora/docs \
+            skainet-antora:local \
+            --stacktrace \
+            antora-playbook.yml
+
+      - name: Upload artifact
+        uses: actions/upload-pages-artifact@v3
+        with:
+          path: docs/build/site
+
+  deploy-docs:
+    if: github.ref == 'refs/heads/develop' && github.event_name == 'push'
+    needs: build-docs
+    runs-on: ubuntu-latest
+    environment:
+      name: github-pages
+      url: ${{ steps.deployment.outputs.page_url }}
+
+    steps:
+      - name: Deploy to GitHub Pages
+        id: deployment
+        uses: actions/deploy-pages@v4
diff --git a/PLAN-unified-pipeline.md b/PLAN-unified-pipeline.md
@@ -0,0 +1,130 @@
+# Plan: Unified Model Pipeline with Decoupled Tool Calling
+
+## Context
+
+Currently SKaiNET-transformers has:
+- **5+ hand-coded runtimes** (LlamaRuntime, Qwen35Runtime, Gemma3nRuntime, ApertusRuntime, VoxtralRuntimes) — each reimplements the forward pass, weight loading, and layer execution
+- **Tool calling tightly coupled to kllama** — the AgentLoop, ToolCallingDemo, and chat modes only exist in the kllama runner. Other models (Gemma, Apertus) cannot use tool calling without duplicating code
+- **Two execution paths** — legacy hand-coded runtimes AND the newer `OptimizedLLMRuntime` with DSL/compute-graph/AOT. LlamaRuntime and ApertusRuntime are already marked deprecated
+
+The goal: converge on **one unified pipeline** where model definition, weight loading, tokenization, and tool calling are cleanly separated pipeline stages.
+
+## Architecture Overview
+
+```
+GGUF/SafeTensors File
+    |
+WeightLoader (parse metadata + tensors)
+    |
+DSL Network Definition (model-specific, declarative)
+    |
+ComputeGraph (DAG)
+    |
+Optimization Pipeline (TransposeElim -> WeightDedup -> LLMFusion -> DCE)
+    |
+ComputeGraphExecutor (fused kernels)
+    |
+InferenceRuntime (unified: forward + generate)
+    |
+TokenizationPipeline (encode/decode, special tokens, byte-level BPE)
+    |
+ChatPipeline (template formatting, tool calling, agent loop)
+```
+
+## Phase 1: Decouple Tool Calling from kllama (immediate value) -- DONE
+
+**What was done:**
+
+1. **Enhanced `Tokenizer` interface** with `eosTokenId`, `bosTokenId`, `vocabSize`
+   - Updated all implementations: `GGUFTokenizer`, `TokenizerImpl`, `HuggingFaceBPETokenizer`, `TekkenTokenizerAdapter`, `HuggingFaceTokenizer` (BERT)
+
+2. **Created `ChatSession` abstraction** in `llm-agent`
+   - File: `llm-agent/.../chat/ChatSession.kt`
+   - Bundles `InferenceRuntime` + `Tokenizer` + `ModelMetadata`
+   - Provides `createAgentLoop()` and `runSingleTurn()` for any runner
+
+3. **Refactored `ToolCallingDemo` and `AgentCli`** to use `Tokenizer` interface instead of `GGUFTokenizer`
+   - Both now accept any `Tokenizer`, not just `GGUFTokenizer`
+   - Both use `ChatSession` internally for agent loop creation
+
+4. **Removed `GGUFTokenizer` cast from kllama Main.kt** dispatch
+   - Chat/agent/demo modes now work with any `Tokenizer`
+
+5. **Fixed `JavaAgentLoop`** — replaced `GGUFTokenizer` instanceof hack with `tokenizer.eosTokenId`
+
+## Phase 2: Unified DSL-Based Model Definition (converge on OptimizedLLMRuntime) -- PARTIAL
+
+**What was done:**
+
+1. **Created `ModelRegistry`** in `llm-core/.../ModelRegistry.kt`
+   - `ModelFamily` enum: LLAMA, QWEN, GEMMA, APERTUS, BERT, VOXTRAL, UNKNOWN
+   - `ModelRegistry.detect(architecture)` maps GGUF arch strings to families
+   - Tracks capabilities (supportsToolCalling, chatTemplateFamily)
+
+2. **Created `UnifiedModelLoader`** in `llm-core/.../UnifiedModelLoader.kt`
+   - `UnifiedModelLoader.peek(source)` extracts `GGUFModelInfo` from GGUF metadata
+   - Returns architecture, family, dimensions without loading weights
+
+**Already existing (no changes needed):**
+- DSL networks: `llamaNetwork()`, `qwenNetwork()`, `apertusNetwork()`, `bertNetwork()`, `voxtralBackboneNetwork()`, `voxtralAcousticNetwork()`
+- `OptimizedLLMRuntime` with DIRECT/OPTIMIZED/HYBRID modes
+- Per-model `NetworkLoader` classes (LlamaNetworkLoader, ApertusNetworkLoader, etc.)
+
+**Remaining (future work):**
+- `gemmaNetwork()` DSL definition (Gemma3n has unique features: GELU, MatFormer variable FFN, sliding window)
+- Migrate CLI runners from deprecated runtimes to OptimizedLLMRuntime
+- Remove deprecated LlamaRuntime and ApertusRuntime
+
+## Phase 3: Tokenization as Pipeline Stage -- DONE
+
+**What was done:**
+
+1. **Enhanced `Tokenizer` interface** with `eosTokenId`, `bosTokenId`, `vocabSize` (done in Phase 1)
+
+2. **Moved `GGUFTokenizer` from kllama to `llm-core`**
+   - New location: `llm-core/.../tokenizer/GGUFTokenizer.kt`
+   - Old location has a typealias for backwards compatibility
+   - Added `skainet-io-gguf` and `kotlinx-io-core` dependencies to `llm-core`
+
+3. **Created `TokenizerFactory`** in `llm-core/.../tokenizer/TokenizerFactory.kt`
+   - `TokenizerFactory.fromGGUF(source)` — from GGUF file metadata
+   - `TokenizerFactory.fromTokenizerJson(json)` — from HuggingFace tokenizer.json
+   - `TokenizerFactory.fromHuggingFace(json, config)` — full HF BPE tokenizer
+
+4. All runners can now use `GGUFTokenizer` and `TokenizerFactory` directly from `llm-core`
+
+## Phase 4: Unified Runner (single CLI entry point) -- DONE
+
+**What was done:**
+
+1. **Created `llm-apps/skainet-cli`** — new unified CLI module
+   - Auto-detects architecture from GGUF metadata via `UnifiedModelLoader.peek()`
+   - Loads any LLaMA-compatible model (LLaMA, Qwen, Mistral)
+   - Supports `--chat`, `--agent`, `--demo` modes with tool calling
+   - Uses `TokenizerFactory.fromGGUF()` for tokenizer loading
+   - Registered as `skainet` runner in smoke test script
+
+2. **Usage:**
+   ```bash
+   skainet -m model.gguf "The capital of France is"   # auto-detect, generate
+   skainet -m model.gguf --chat                        # interactive chat
+   skainet -m model.gguf --demo "What is 2+2?"         # tool calling demo
+   ```
+
+3. **Existing per-model CLIs are preserved** — no breaking changes
+
+**Remaining (future work):**
+- Add Gemma3n loading path to unified CLI (requires gemmaNetwork() DSL)
+- Add Apertus loading path to unified CLI
+- Eventually deprecate per-model CLIs
+
+## All Phases Complete
+
+| Phase | Status | Summary |
+|-------|--------|---------|
+| 1. Decouple tool calling | DONE | ChatSession, Tokenizer interface, no GGUFTokenizer coupling |
+| 2. Model registry | DONE | ModelRegistry, UnifiedModelLoader, ModelFamily enum |
+| 3. Tokenization pipeline | DONE | GGUFTokenizer in llm-core, TokenizerFactory |
+| 4. Unified runner | DONE | skainet-cli with auto-detection |
+3. **Phase 2** then — biggest refactor, needs per-model validation
+4. **Phase 4** last — depends on all other phases
diff --git a/docs/.docker/.dockerignore b/docs/.docker/.dockerignore
@@ -0,0 +1,2 @@
+node_modules
+build
diff --git a/docs/.docker/Dockerfile b/docs/.docker/Dockerfile
@@ -0,0 +1,35 @@
+FROM node:20-alpine
+
+LABEL org.opencontainers.image.title="SKaiNET Antora" \
+      org.opencontainers.image.description="Antora site generator with built-in Mermaid rendering" \
+      org.opencontainers.image.source="https://github.com/SKaiNET-developers/SKaiNET-transformers"
+
+# Chromium for mermaid-cli (puppeteer)
+RUN apk add --no-cache chromium font-noto
+
+ENV PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser \
+    PUPPETEER_SKIP_DOWNLOAD=true
+
+WORKDIR /antora
+
+# Install Antora + extensions + mermaid-cli in one layer
+RUN npm i --save-exact \
+      @antora/cli@3.1 \
+      @antora/site-generator@3.1 \
+      asciidoctor-kroki@0.18 \
+      @mermaid-js/mermaid-cli@11 \
+    && npm cache clean --force
+
+# Mermaid-cli config: use installed Chromium, no sandbox (container)
+RUN echo '{ \
+  "executablePath": "/usr/bin/chromium-browser", \
+  "args": ["--no-sandbox", "--disable-gpu", "--disable-dev-shm-usage"] \
+}' > /antora/puppeteer-config.json
+
+# Pre-generate a simple diagram to warm up and verify the stack works
+RUN echo 'graph TD; A-->B;' > /tmp/test.mmd \
+    && npx mmdc -i /tmp/test.mmd -o /tmp/test.svg -p /antora/puppeteer-config.json \
+    && rm /tmp/test.mmd /tmp/test.svg
+
+ENTRYPOINT ["npx", "antora"]
+CMD ["--stacktrace", "antora-playbook.yml"]
diff --git a/docs/antora-playbook.yml b/docs/antora-playbook.yml
@@ -0,0 +1,25 @@
+site:
+  title: SKaiNET Transformers
+  start_page: skainet-transformers::index.adoc
+
+content:
+  sources:
+    - url: .
+      start_path: docs
+      branches: HEAD
+
+asciidoc:
+  extensions:
+    - asciidoctor-kroki
+  attributes:
+    # Use local mermaid-cli via Kroki (no external server needed when
+    # built with the custom Docker image in docs/.docker/Dockerfile)
+    kroki-fetch-diagram: true
+
+ui:
+  bundle:
+    url: https://gitlab.com/antora/antora-ui-default/-/jobs/artifacts/HEAD/raw/build/ui-bundle.zip?job=bundle-stable
+    snapshot: true
+
+output:
+  dir: ./build/site
diff --git a/docs/antora.yml b/docs/antora.yml
@@ -0,0 +1,5 @@
+name: skainet-transformers
+title: SKaiNET Transformers
+version: ~
+nav:
+  - modules/ROOT/nav.adoc
diff --git a/docs/modules/ROOT/nav.adoc b/docs/modules/ROOT/nav.adoc
@@ -0,0 +1,25 @@
+* xref:index.adoc[Overview]
+
+.Tutorials
+* xref:tutorials/getting-started.adoc[Getting Started]
+* xref:tutorials/tool-calling.adoc[Tool Calling with Any Model]
+* xref:tutorials/smoke-tests.adoc[Running Smoke Tests]
+
+.How-to Guides
+* xref:how-to/add-model.adoc[Add a New Model Architecture]
+* xref:how-to/add-compute-backend.adoc[Add a Compute Backend]
+* xref:how-to/add-tool.adoc[Add a Custom Tool]
+* xref:how-to/run-unified-cli.adoc[Use the Unified CLI]
+
+.Reference
+* xref:reference/architecture.adoc[Architecture Overview]
+* xref:reference/pipeline.adoc[Inference Pipeline]
+* xref:reference/tokenizer-api.adoc[Tokenizer API]
+* xref:reference/chat-session-api.adoc[ChatSession API]
+* xref:reference/model-registry.adoc[Model Registry]
+* xref:reference/cli-reference.adoc[CLI Reference]
+
+.Explanation
+* xref:explanation/pipeline-design.adoc[Pipeline Design Decisions]
+* xref:explanation/dsl-vs-handcoded.adoc[DSL Networks vs Hand-Coded Runtimes]
+* xref:explanation/tokenizer-internals.adoc[Tokenizer Internals]