Skip to content

Unified Model Pipeline with Decoupled Tool Calling #49

@michalharakal

Description

@michalharakal

Context

Currently SKaiNET-transformers has:

  • 5+ hand-coded runtimes (LlamaRuntime, Qwen35Runtime, Gemma3nRuntime, ApertusRuntime, VoxtralRuntimes) — each reimplements the forward pass, weight loading, and layer execution
  • Tool calling tightly coupled to kllama — the AgentLoop, ToolCallingDemo, and chat modes only exist in the kllama runner. Other models (Gemma, Apertus) cannot use tool calling without duplicating code
  • Two execution paths — legacy hand-coded runtimes AND the newer OptimizedLLMRuntime with DSL/compute-graph/AOT. LlamaRuntime and ApertusRuntime are already marked deprecated

The goal: converge on one unified pipeline where model definition, weight loading, tokenization, and tool calling are cleanly separated pipeline stages.

Architecture Overview

GGUF/SafeTensors File
    |
WeightLoader (parse metadata + tensors)
    |
DSL Network Definition (model-specific, declarative)
    |
ComputeGraph (DAG)
    |
Optimization Pipeline (TransposeElim -> WeightDedup -> LLMFusion -> DCE)
    |
ComputeGraphExecutor (fused kernels)
    |
InferenceRuntime (unified: forward + generate)
    |
TokenizationPipeline (encode/decode, special tokens, byte-level BPE)
    |
ChatPipeline (template formatting, tool calling, agent loop)

Phase 1: Decouple Tool Calling from kllama (immediate value)

Problem: Tool calling lives in llm-agent (good) but is wired only through kllama CLI. Other runners can't use it.

Changes:

  1. Move AgentLoop's generation into InferenceRuntime interface (llm-core)

    • generateUntilStop() is currently an extension function in llm-agent — promote it or add a generate() method with stop-token support to the interface
    • File: llm-core/.../InferenceRuntime.kt
  2. Create ChatSession abstraction (llm-agent)

    • Bundles: InferenceRuntime + Tokenizer + ChatTemplate + ToolRegistry
    • Any runner creates a ChatSession to get chat/agent/demo modes for free
    • File: new llm-agent/.../ChatSession.kt
  3. Extract CLI chat/agent/demo dispatch from kllama Main.kt into shared module

    • Currently lines 532-551 of Main.kt dispatch to ToolCallingDemo / AgentCli
    • This logic should work for any runner that provides InferenceRuntime + Tokenizer

Critical files:

  • llm-core/.../InferenceRuntime.kt — extend interface
  • llm-agent/.../AgentLoop.kt — already well-abstracted, keep as-is
  • llm-runtime/kllama/.../Main.kt — extract dispatch logic
  • llm-runtime/kllama/.../ToolCallingDemo.kt — move to llm-agent or llm-apps shared module

Phase 2: Unified DSL-Based Model Definition (converge on OptimizedLLMRuntime)

Problem: Each model has a hand-coded runtime. OptimizedLLMRuntime already supports DSL -> graph -> optimized execution, but only some models use it.

Changes:

  1. Define DSL networks for all model families:

    • llamaNetwork(config) — LLaMA/Mistral/Qwen2/3 (standard transformer)
    • qwen35Network(config) — Qwen3.5 (hybrid DeltaNet + full attention)
    • gemmaNetwork(config) — Gemma (GELU, MatFormer FFN, sliding window)
    • apertusNetwork(config) — Apertus (xIELU, ungated MLP, QK-norm)
    • Each is a pure function returning a Network<T> from the DSL
  2. Unified model loading flow:

    detectArchitecture(ggufMetadata) -> ModelFamily
    ModelFamily.createNetwork(config) -> Network<T>
    WeightLoader.loadAndMap(file, network) -> weights
    OptimizedLLMRuntime(network, weights, mode=HYBRID) -> InferenceRuntime
    
  3. Remove deprecated hand-coded runtimes once DSL equivalents are validated:

    • LlamaRuntime -> llamaNetwork() + OptimizedLLMRuntime
    • ApertusRuntime -> apertusNetwork() + OptimizedLLMRuntime

Critical files:

  • llm-core/.../OptimizedLLMRuntime.kt — already exists, extend
  • llm-core/.../dsl/TransformerDsl.kt — already has embedding, MHA, SwiGLU, RMSNorm
  • llm-core/.../weights/LLMWeightNameResolvers.kt — already maps DSL paths -> GGUF names
  • New: per-model DSL network definitions

Phase 3: Tokenization as Pipeline Stage

Problem: Tokenization is split between GGUFTokenizer (kllama module), QwenByteLevelBPETokenizer (llm-core), and model-specific code. The byte-level BPE fix we just made shows the fragility.

Changes:

  1. Enhance Tokenizer interface (llm-core):

    interface Tokenizer {
        fun encode(text: String): IntArray
        fun decode(token: Int): String
        fun decode(tokens: IntArray): String
        val eosTokenId: Int
        val bosTokenId: Int
        val vocabSize: Int
        val specialTokens: Set<String>
    }
  2. Unified tokenizer factory:

    • TokenizerFactory.fromGGUF(source) — auto-detects BPE/SentencePiece/WordPiece
    • TokenizerFactory.fromTokenizerJson(json) — HuggingFace format
    • Returns the correct implementation (byte-level BPE for GPT-2/Qwen, SentencePiece for LLaMA, etc.)
  3. Move GGUFTokenizer to llm-core so all runners can use it without depending on kllama

Phase 4: Unified Runner (single CLI entry point)

Problem: 6 separate CLI apps with duplicated argument parsing, model loading, and dispatch logic.

Changes:

  1. Single skainet CLI that auto-detects model architecture from GGUF metadata:

    skainet -m model.gguf "prompt"                    # auto-detect, generate
    skainet -m model.gguf --chat                      # auto-detect, chat mode
    skainet -m model.gguf --demo "What is 2+2?"       # auto-detect, tool calling
  2. Architecture registry:

    ModelRegistry.register("llama", ::llamaNetwork)
    ModelRegistry.register("qwen3", ::qwenNetwork)
    ModelRegistry.register("gemma", ::gemmaNetwork)
  3. Auto-detection from GGUF metadata (already exists in peekGgufMetadata())

Verification

  • All existing unit tests pass (llm-agent, llm-runtime:kllama, llm-core)
  • Smoke test suite passes (generation + tool calling)
  • Basic generation produces identical output for all model families
  • Tool calling works for any model that supports ChatML/Qwen/Llama3 templates
  • OptimizedLLMRuntime in HYBRID mode matches hand-coded runtime output

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions