Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 65 additions & 0 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
name: Docs

on:
push:
branches: [ main, develop ]
paths:
- 'docs/**'
- '.github/workflows/docs.yml'
pull_request:
paths:
- 'docs/**'
- '.github/workflows/docs.yml'
workflow_dispatch:

concurrency:
group: docs-${{ github.ref }}
cancel-in-progress: true

permissions:
contents: read
pages: write
id-token: write

jobs:
build-docs:
runs-on: ubuntu-latest
timeout-minutes: 15

steps:
- name: Checkout
uses: actions/checkout@v6

- name: Build custom Antora image
run: |
docker build \
-t skainet-antora:local \
-f docs/.docker/Dockerfile \
docs/.docker/

- name: Build Antora site
run: |
docker run --rm \
-v "${{ github.workspace }}:/antora" \
--workdir /antora/docs \
skainet-antora:local \
--stacktrace \
antora-playbook.yml

- name: Upload artifact
uses: actions/upload-pages-artifact@v3
with:
path: docs/build/site

deploy-docs:
if: github.ref == 'refs/heads/develop' && github.event_name == 'push'
needs: build-docs
runs-on: ubuntu-latest
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}

steps:
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v4
130 changes: 130 additions & 0 deletions PLAN-unified-pipeline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Plan: Unified Model Pipeline with Decoupled Tool Calling

## Context

Currently SKaiNET-transformers has:
- **5+ hand-coded runtimes** (LlamaRuntime, Qwen35Runtime, Gemma3nRuntime, ApertusRuntime, VoxtralRuntimes) — each reimplements the forward pass, weight loading, and layer execution
- **Tool calling tightly coupled to kllama** — the AgentLoop, ToolCallingDemo, and chat modes only exist in the kllama runner. Other models (Gemma, Apertus) cannot use tool calling without duplicating code
- **Two execution paths** — legacy hand-coded runtimes AND the newer `OptimizedLLMRuntime` with DSL/compute-graph/AOT. LlamaRuntime and ApertusRuntime are already marked deprecated

The goal: converge on **one unified pipeline** where model definition, weight loading, tokenization, and tool calling are cleanly separated pipeline stages.

## Architecture Overview

```
GGUF/SafeTensors File
|
WeightLoader (parse metadata + tensors)
|
DSL Network Definition (model-specific, declarative)
|
ComputeGraph (DAG)
|
Optimization Pipeline (TransposeElim -> WeightDedup -> LLMFusion -> DCE)
|
ComputeGraphExecutor (fused kernels)
|
InferenceRuntime (unified: forward + generate)
|
TokenizationPipeline (encode/decode, special tokens, byte-level BPE)
|
ChatPipeline (template formatting, tool calling, agent loop)
```

## Phase 1: Decouple Tool Calling from kllama (immediate value) -- DONE

**What was done:**

1. **Enhanced `Tokenizer` interface** with `eosTokenId`, `bosTokenId`, `vocabSize`
- Updated all implementations: `GGUFTokenizer`, `TokenizerImpl`, `HuggingFaceBPETokenizer`, `TekkenTokenizerAdapter`, `HuggingFaceTokenizer` (BERT)

2. **Created `ChatSession` abstraction** in `llm-agent`
- File: `llm-agent/.../chat/ChatSession.kt`
- Bundles `InferenceRuntime` + `Tokenizer` + `ModelMetadata`
- Provides `createAgentLoop()` and `runSingleTurn()` for any runner

3. **Refactored `ToolCallingDemo` and `AgentCli`** to use `Tokenizer` interface instead of `GGUFTokenizer`
- Both now accept any `Tokenizer`, not just `GGUFTokenizer`
- Both use `ChatSession` internally for agent loop creation

4. **Removed `GGUFTokenizer` cast from kllama Main.kt** dispatch
- Chat/agent/demo modes now work with any `Tokenizer`

5. **Fixed `JavaAgentLoop`** — replaced `GGUFTokenizer` instanceof hack with `tokenizer.eosTokenId`

## Phase 2: Unified DSL-Based Model Definition (converge on OptimizedLLMRuntime) -- PARTIAL

**What was done:**

1. **Created `ModelRegistry`** in `llm-core/.../ModelRegistry.kt`
- `ModelFamily` enum: LLAMA, QWEN, GEMMA, APERTUS, BERT, VOXTRAL, UNKNOWN
- `ModelRegistry.detect(architecture)` maps GGUF arch strings to families
- Tracks capabilities (supportsToolCalling, chatTemplateFamily)

2. **Created `UnifiedModelLoader`** in `llm-core/.../UnifiedModelLoader.kt`
- `UnifiedModelLoader.peek(source)` extracts `GGUFModelInfo` from GGUF metadata
- Returns architecture, family, dimensions without loading weights

**Already existing (no changes needed):**
- DSL networks: `llamaNetwork()`, `qwenNetwork()`, `apertusNetwork()`, `bertNetwork()`, `voxtralBackboneNetwork()`, `voxtralAcousticNetwork()`
- `OptimizedLLMRuntime` with DIRECT/OPTIMIZED/HYBRID modes
- Per-model `NetworkLoader` classes (LlamaNetworkLoader, ApertusNetworkLoader, etc.)

**Remaining (future work):**
- `gemmaNetwork()` DSL definition (Gemma3n has unique features: GELU, MatFormer variable FFN, sliding window)
- Migrate CLI runners from deprecated runtimes to OptimizedLLMRuntime
- Remove deprecated LlamaRuntime and ApertusRuntime

## Phase 3: Tokenization as Pipeline Stage -- DONE

**What was done:**

1. **Enhanced `Tokenizer` interface** with `eosTokenId`, `bosTokenId`, `vocabSize` (done in Phase 1)

2. **Moved `GGUFTokenizer` from kllama to `llm-core`**
- New location: `llm-core/.../tokenizer/GGUFTokenizer.kt`
- Old location has a typealias for backwards compatibility
- Added `skainet-io-gguf` and `kotlinx-io-core` dependencies to `llm-core`

3. **Created `TokenizerFactory`** in `llm-core/.../tokenizer/TokenizerFactory.kt`
- `TokenizerFactory.fromGGUF(source)` — from GGUF file metadata
- `TokenizerFactory.fromTokenizerJson(json)` — from HuggingFace tokenizer.json
- `TokenizerFactory.fromHuggingFace(json, config)` — full HF BPE tokenizer

4. All runners can now use `GGUFTokenizer` and `TokenizerFactory` directly from `llm-core`

## Phase 4: Unified Runner (single CLI entry point) -- DONE

**What was done:**

1. **Created `llm-apps/skainet-cli`** — new unified CLI module
- Auto-detects architecture from GGUF metadata via `UnifiedModelLoader.peek()`
- Loads any LLaMA-compatible model (LLaMA, Qwen, Mistral)
- Supports `--chat`, `--agent`, `--demo` modes with tool calling
- Uses `TokenizerFactory.fromGGUF()` for tokenizer loading
- Registered as `skainet` runner in smoke test script

2. **Usage:**
```bash
skainet -m model.gguf "The capital of France is" # auto-detect, generate
skainet -m model.gguf --chat # interactive chat
skainet -m model.gguf --demo "What is 2+2?" # tool calling demo
```

3. **Existing per-model CLIs are preserved** — no breaking changes

**Remaining (future work):**
- Add Gemma3n loading path to unified CLI (requires gemmaNetwork() DSL)
- Add Apertus loading path to unified CLI
- Eventually deprecate per-model CLIs

## All Phases Complete

| Phase | Status | Summary |
|-------|--------|---------|
| 1. Decouple tool calling | DONE | ChatSession, Tokenizer interface, no GGUFTokenizer coupling |
| 2. Model registry | DONE | ModelRegistry, UnifiedModelLoader, ModelFamily enum |
| 3. Tokenization pipeline | DONE | GGUFTokenizer in llm-core, TokenizerFactory |
| 4. Unified runner | DONE | skainet-cli with auto-detection |
3. **Phase 2** then — biggest refactor, needs per-model validation
4. **Phase 4** last — depends on all other phases
2 changes: 2 additions & 0 deletions docs/.docker/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
node_modules
build
35 changes: 35 additions & 0 deletions docs/.docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
FROM node:20-alpine

LABEL org.opencontainers.image.title="SKaiNET Antora" \
org.opencontainers.image.description="Antora site generator with built-in Mermaid rendering" \
org.opencontainers.image.source="https://github.com/SKaiNET-developers/SKaiNET-transformers"

# Chromium for mermaid-cli (puppeteer)
RUN apk add --no-cache chromium font-noto

ENV PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser \
PUPPETEER_SKIP_DOWNLOAD=true

WORKDIR /antora

# Install Antora + extensions + mermaid-cli in one layer
RUN npm i --save-exact \
@antora/cli@3.1 \
@antora/site-generator@3.1 \
asciidoctor-kroki@0.18 \
@mermaid-js/mermaid-cli@11 \
&& npm cache clean --force

# Mermaid-cli config: use installed Chromium, no sandbox (container)
RUN echo '{ \
"executablePath": "/usr/bin/chromium-browser", \
"args": ["--no-sandbox", "--disable-gpu", "--disable-dev-shm-usage"] \
}' > /antora/puppeteer-config.json

# Pre-generate a simple diagram to warm up and verify the stack works
RUN echo 'graph TD; A-->B;' > /tmp/test.mmd \
&& npx mmdc -i /tmp/test.mmd -o /tmp/test.svg -p /antora/puppeteer-config.json \
&& rm /tmp/test.mmd /tmp/test.svg

ENTRYPOINT ["npx", "antora"]
CMD ["--stacktrace", "antora-playbook.yml"]
25 changes: 25 additions & 0 deletions docs/antora-playbook.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
site:
title: SKaiNET Transformers
start_page: skainet-transformers::index.adoc

content:
sources:
- url: .
start_path: docs
branches: HEAD

asciidoc:
extensions:
- asciidoctor-kroki
attributes:
# Use local mermaid-cli via Kroki (no external server needed when
# built with the custom Docker image in docs/.docker/Dockerfile)
kroki-fetch-diagram: true

ui:
bundle:
url: https://gitlab.com/antora/antora-ui-default/-/jobs/artifacts/HEAD/raw/build/ui-bundle.zip?job=bundle-stable
snapshot: true

output:
dir: ./build/site
5 changes: 5 additions & 0 deletions docs/antora.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
name: skainet-transformers
title: SKaiNET Transformers
version: ~
nav:
- modules/ROOT/nav.adoc
25 changes: 25 additions & 0 deletions docs/modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
* xref:index.adoc[Overview]

.Tutorials
* xref:tutorials/getting-started.adoc[Getting Started]
* xref:tutorials/tool-calling.adoc[Tool Calling with Any Model]
* xref:tutorials/smoke-tests.adoc[Running Smoke Tests]

.How-to Guides
* xref:how-to/add-model.adoc[Add a New Model Architecture]
* xref:how-to/add-compute-backend.adoc[Add a Compute Backend]
* xref:how-to/add-tool.adoc[Add a Custom Tool]
* xref:how-to/run-unified-cli.adoc[Use the Unified CLI]

.Reference
* xref:reference/architecture.adoc[Architecture Overview]
* xref:reference/pipeline.adoc[Inference Pipeline]
* xref:reference/tokenizer-api.adoc[Tokenizer API]
* xref:reference/chat-session-api.adoc[ChatSession API]
* xref:reference/model-registry.adoc[Model Registry]
* xref:reference/cli-reference.adoc[CLI Reference]

.Explanation
* xref:explanation/pipeline-design.adoc[Pipeline Design Decisions]
* xref:explanation/dsl-vs-handcoded.adoc[DSL Networks vs Hand-Coded Runtimes]
* xref:explanation/tokenizer-internals.adoc[Tokenizer Internals]
Loading
Loading