Skip to content

Feature/49 tool calling pipeline#50

Merged
michalharakal merged 9 commits intodevelopfrom
feature/49-tool-calling-pipeline
Apr 11, 2026
Merged

Feature/49 tool calling pipeline#50
michalharakal merged 9 commits intodevelopfrom
feature/49-tool-calling-pipeline

Conversation

@michalharakal
Copy link
Copy Markdown
Contributor

@michalharakal michalharakal commented Apr 11, 2026

feat: unified model pipeline with decoupled tool calling (#49)
Implements the unified inference pipeline for SKaiNET Transformers,
resolving #49 and building on the tool calling foundation from #46.

Summary

This branch decouples tool calling from the kllama runner, creates a
unified model pipeline with architecture auto-detection, and adds
comprehensive Antora documentation.

Phase 1: Decouple Tool Calling

  • Enhance Tokenizer interface with eosTokenId, bosTokenId, vocabSize
  • Create ChatSession abstraction in llm-agent (any runner gets tool
    calling for free)
  • Refactor ToolCallingDemo and AgentCli to accept Tokenizer, not
    GGUFTokenizer
  • Fix JavaAgentLoop instanceof hack

Phase 2: Model Registry

  • Add ModelFamily enum (LLAMA, QWEN, GEMMA, APERTUS, BERT, VOXTRAL)
  • Add ModelRegistry.detect() for GGUF architecture auto-detection
  • Add UnifiedModelLoader.peek() to extract model info without loading

Phase 3: Tokenization Pipeline

  • Move GGUFTokenizer from kllama to llm-core (all runners can use it)
  • Create TokenizerFactory with fromGGUF(), fromTokenizerJson(),
    fromHuggingFace()

Phase 4: Unified CLI

  • New skainet-cli module: single entry point for all GGUF models
  • Auto-detects architecture, supports --chat/--agent/--demo modes

Smoke Tests

  • Add tool calling test phase with [Tool Call] detection
  • Add ToolCallingDemo.runSingleShot() for non-interactive testing
  • Add Qwen3-8B-Q4 to smoke test config

Documentation (Antora + Divio)

  • 19 AsciiDoc pages: tutorials, how-to, reference, explanation
  • Mermaid diagrams via Kroki for pipeline, architecture, agent loop
  • GitHub Actions workflow for docs build and GitHub Pages deployment

michalharakal and others added 9 commits April 11, 2026 10:25
Phase 1 of the unified pipeline plan. Tool calling no longer requires
GGUFTokenizer — any Tokenizer implementation works.

- Extend Tokenizer interface with eosTokenId, bosTokenId, vocabSize
- Add ChatSession in llm-agent that bundles runtime + tokenizer + metadata
- Refactor ToolCallingDemo and AgentCli to accept Tokenizer, not GGUFTokenizer
- Remove GGUFTokenizer cast from kllama Main.kt chat/agent/demo dispatch
- Fix JavaAgentLoop instanceof hack with tokenizer.eosTokenId
- Update all Tokenizer implementations (GGUF, HF BPE, Tekken, BERT)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 3 of the unified pipeline plan. Tokenization is now a standalone
pipeline stage in llm-core, independent of any specific runner.

- Move GGUFTokenizer from kllama to llm-core/tokenizer package
- Add typealias in kllama for backwards compatibility
- Create TokenizerFactory with fromGGUF(), fromTokenizerJson(), fromHuggingFace()
- Add skainet-io-gguf and kotlinx-io-core dependencies to llm-core

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…detection

Phase 2 of the unified pipeline plan. Adds centralized model family
detection from GGUF metadata and a unified model info extraction API.

- Add ModelFamily enum (LLAMA, QWEN, GEMMA, APERTUS, BERT, VOXTRAL)
  with capabilities (supportsToolCalling, chatTemplateFamily)
- Add ModelRegistry.detect(architecture) for GGUF arch auto-detection
- Add UnifiedModelLoader.peek(source) to extract GGUFModelInfo without
  loading weights (architecture, family, dimensions)

DSL network definitions already exist for all major architectures
except Gemma3n. CLI migration to OptimizedLLMRuntime is future work.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 4 of the unified pipeline plan. New skainet-cli module that
auto-detects model architecture from GGUF metadata and supports all
modes (generate, chat, agent, demo) for any LLaMA-compatible model.

- New llm-apps/skainet-cli module with single entry point
- Auto-detects architecture via UnifiedModelLoader.peek()
- Supports --chat, --agent, --demo with tool calling for all models
- Registered as 'skainet' runner in smoke test script
- Existing per-model CLIs preserved (no breaking changes)

Usage:
  skainet -m model.gguf "prompt"        # auto-detect and generate
  skainet -m model.gguf --chat          # interactive chat
  skainet -m model.gguf --demo          # tool calling demo

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add tool calling test phase to smoke-test.sh that runs --demo with a
prompt for models with toolCalling config. Add Qwen3-8B-Q4 to smoke
test config.

- Add ToolCallingDemo.runSingleShot() for non-interactive tool calling
- Wire --demo with positional prompt to single-shot mode in kllama and skainet CLIs
- Add tool calling section to smoke-test.sh with [Tool Call] detection
- Add skainet runner to smoke-test.sh runner_args
- Increase kllama-cli memory to -Xmx42g -XX:MaxDirectMemorySize=64g
- Add Qwen3-8B-Q4 and toolCalling config to smoke-models.json

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AsciiDoc documentation in Antora site format with four Divio categories:

Tutorials:
- Getting started with the skainet CLI
- Tool calling with any model via ChatSession
- Running smoke tests

How-to Guides:
- Add a new model architecture (DSL vs hand-coded)
- Add a compute backend
- Add a custom tool
- Use the unified CLI

Reference:
- Architecture overview and module structure
- Inference pipeline stages
- Tokenizer API and TokenizerFactory
- ChatSession API
- Model Registry and UnifiedModelLoader
- CLI reference (skainet + model-specific CLIs)

Explanation:
- Pipeline design decisions (why stages are separated)
- DSL networks vs hand-coded runtimes (trade-offs)
- Tokenizer internals (SentencePiece, BPE, WordPiece)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add docs.yml workflow: builds on push to main/develop, deploys to
  GitHub Pages from develop branch
- Uses dockerized Antora 3.1 with asciidoctor-kroki for Mermaid diagrams
- Add antora-playbook.yml with Kroki server integration
- Convert ASCII diagrams to Mermaid in pipeline, architecture,
  tool-calling, and pipeline-design pages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the unified inference pipeline for SKaiNET Transformers,
resolving #49 and building on the tool calling foundation from #46.

## Summary

This branch decouples tool calling from the kllama runner, creates a
unified model pipeline with architecture auto-detection, and adds
comprehensive Antora documentation.

### Phase 1: Decouple Tool Calling
- Enhance Tokenizer interface with eosTokenId, bosTokenId, vocabSize
- Create ChatSession abstraction in llm-agent (any runner gets tool
  calling for free)
- Refactor ToolCallingDemo and AgentCli to accept Tokenizer, not
  GGUFTokenizer
- Fix JavaAgentLoop instanceof hack

### Phase 2: Model Registry
- Add ModelFamily enum (LLAMA, QWEN, GEMMA, APERTUS, BERT, VOXTRAL)
- Add ModelRegistry.detect() for GGUF architecture auto-detection
- Add UnifiedModelLoader.peek() to extract model info without loading

### Phase 3: Tokenization Pipeline
- Move GGUFTokenizer from kllama to llm-core (all runners can use it)
- Create TokenizerFactory with fromGGUF(), fromTokenizerJson(),
  fromHuggingFace()

### Phase 4: Unified CLI
- New skainet-cli module: single entry point for all GGUF models
- Auto-detects architecture, supports --chat/--agent/--demo modes

### Smoke Tests
- Add tool calling test phase with [Tool Call] detection
- Add ToolCallingDemo.runSingleShot() for non-interactive testing
- Add Qwen3-8B-Q4 to smoke test config

### Documentation (Antora + Divio)
- 19 AsciiDoc pages: tutorials, how-to, reference, explanation
- Mermaid diagrams via Kroki for pipeline, architecture, agent loop
- GitHub Actions workflow for docs build and GitHub Pages deployment

Refs: #46

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Custom image based on node:20-alpine with:
- Antora 3.1 site generator
- asciidoctor-kroki for diagram blocks
- @mermaid-js/mermaid-cli with Chromium for local SVG rendering
- No external Kroki server dependency

The GitHub Actions workflow builds the image from docs/.docker/Dockerfile
then uses it to generate the site. Mermaid diagrams are rendered locally
inside the container.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit f9110fb into develop Apr 11, 2026
1 of 4 checks passed
@michalharakal michalharakal deleted the feature/49-tool-calling-pipeline branch April 11, 2026 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant