Context
Currently SKaiNET-transformers has:
- 5+ hand-coded runtimes (LlamaRuntime, Qwen35Runtime, Gemma3nRuntime, ApertusRuntime, VoxtralRuntimes) — each reimplements the forward pass, weight loading, and layer execution
- Tool calling tightly coupled to kllama — the AgentLoop, ToolCallingDemo, and chat modes only exist in the kllama runner. Other models (Gemma, Apertus) cannot use tool calling without duplicating code
- Two execution paths — legacy hand-coded runtimes AND the newer
OptimizedLLMRuntime with DSL/compute-graph/AOT. LlamaRuntime and ApertusRuntime are already marked deprecated
The goal: converge on one unified pipeline where model definition, weight loading, tokenization, and tool calling are cleanly separated pipeline stages.
Architecture Overview
GGUF/SafeTensors File
|
WeightLoader (parse metadata + tensors)
|
DSL Network Definition (model-specific, declarative)
|
ComputeGraph (DAG)
|
Optimization Pipeline (TransposeElim -> WeightDedup -> LLMFusion -> DCE)
|
ComputeGraphExecutor (fused kernels)
|
InferenceRuntime (unified: forward + generate)
|
TokenizationPipeline (encode/decode, special tokens, byte-level BPE)
|
ChatPipeline (template formatting, tool calling, agent loop)
Phase 1: Decouple Tool Calling from kllama (immediate value)
Problem: Tool calling lives in llm-agent (good) but is wired only through kllama CLI. Other runners can't use it.
Changes:
-
Move AgentLoop's generation into InferenceRuntime interface (llm-core)
generateUntilStop() is currently an extension function in llm-agent — promote it or add a generate() method with stop-token support to the interface
- File:
llm-core/.../InferenceRuntime.kt
-
Create ChatSession abstraction (llm-agent)
- Bundles:
InferenceRuntime + Tokenizer + ChatTemplate + ToolRegistry
- Any runner creates a
ChatSession to get chat/agent/demo modes for free
- File: new
llm-agent/.../ChatSession.kt
-
Extract CLI chat/agent/demo dispatch from kllama Main.kt into shared module
- Currently lines 532-551 of
Main.kt dispatch to ToolCallingDemo / AgentCli
- This logic should work for any runner that provides
InferenceRuntime + Tokenizer
Critical files:
llm-core/.../InferenceRuntime.kt — extend interface
llm-agent/.../AgentLoop.kt — already well-abstracted, keep as-is
llm-runtime/kllama/.../Main.kt — extract dispatch logic
llm-runtime/kllama/.../ToolCallingDemo.kt — move to llm-agent or llm-apps shared module
Phase 2: Unified DSL-Based Model Definition (converge on OptimizedLLMRuntime)
Problem: Each model has a hand-coded runtime. OptimizedLLMRuntime already supports DSL -> graph -> optimized execution, but only some models use it.
Changes:
-
Define DSL networks for all model families:
llamaNetwork(config) — LLaMA/Mistral/Qwen2/3 (standard transformer)
qwen35Network(config) — Qwen3.5 (hybrid DeltaNet + full attention)
gemmaNetwork(config) — Gemma (GELU, MatFormer FFN, sliding window)
apertusNetwork(config) — Apertus (xIELU, ungated MLP, QK-norm)
- Each is a pure function returning a
Network<T> from the DSL
-
Unified model loading flow:
detectArchitecture(ggufMetadata) -> ModelFamily
ModelFamily.createNetwork(config) -> Network<T>
WeightLoader.loadAndMap(file, network) -> weights
OptimizedLLMRuntime(network, weights, mode=HYBRID) -> InferenceRuntime
-
Remove deprecated hand-coded runtimes once DSL equivalents are validated:
LlamaRuntime -> llamaNetwork() + OptimizedLLMRuntime
ApertusRuntime -> apertusNetwork() + OptimizedLLMRuntime
Critical files:
llm-core/.../OptimizedLLMRuntime.kt — already exists, extend
llm-core/.../dsl/TransformerDsl.kt — already has embedding, MHA, SwiGLU, RMSNorm
llm-core/.../weights/LLMWeightNameResolvers.kt — already maps DSL paths -> GGUF names
- New: per-model DSL network definitions
Phase 3: Tokenization as Pipeline Stage
Problem: Tokenization is split between GGUFTokenizer (kllama module), QwenByteLevelBPETokenizer (llm-core), and model-specific code. The byte-level BPE fix we just made shows the fragility.
Changes:
-
Enhance Tokenizer interface (llm-core):
interface Tokenizer {
fun encode(text: String): IntArray
fun decode(token: Int): String
fun decode(tokens: IntArray): String
val eosTokenId: Int
val bosTokenId: Int
val vocabSize: Int
val specialTokens: Set<String>
}
-
Unified tokenizer factory:
TokenizerFactory.fromGGUF(source) — auto-detects BPE/SentencePiece/WordPiece
TokenizerFactory.fromTokenizerJson(json) — HuggingFace format
- Returns the correct implementation (byte-level BPE for GPT-2/Qwen, SentencePiece for LLaMA, etc.)
-
Move GGUFTokenizer to llm-core so all runners can use it without depending on kllama
Phase 4: Unified Runner (single CLI entry point)
Problem: 6 separate CLI apps with duplicated argument parsing, model loading, and dispatch logic.
Changes:
-
Single skainet CLI that auto-detects model architecture from GGUF metadata:
skainet -m model.gguf "prompt" # auto-detect, generate
skainet -m model.gguf --chat # auto-detect, chat mode
skainet -m model.gguf --demo "What is 2+2?" # auto-detect, tool calling
-
Architecture registry:
ModelRegistry.register("llama", ::llamaNetwork)
ModelRegistry.register("qwen3", ::qwenNetwork)
ModelRegistry.register("gemma", ::gemmaNetwork)
-
Auto-detection from GGUF metadata (already exists in peekGgufMetadata())
Verification
- All existing unit tests pass (
llm-agent, llm-runtime:kllama, llm-core)
- Smoke test suite passes (generation + tool calling)
- Basic generation produces identical output for all model families
- Tool calling works for any model that supports ChatML/Qwen/Llama3 templates
OptimizedLLMRuntime in HYBRID mode matches hand-coded runtime output
Context
Currently SKaiNET-transformers has:
OptimizedLLMRuntimewith DSL/compute-graph/AOT. LlamaRuntime and ApertusRuntime are already marked deprecatedThe goal: converge on one unified pipeline where model definition, weight loading, tokenization, and tool calling are cleanly separated pipeline stages.
Architecture Overview
Phase 1: Decouple Tool Calling from kllama (immediate value)
Problem: Tool calling lives in
llm-agent(good) but is wired only throughkllamaCLI. Other runners can't use it.Changes:
Move AgentLoop's generation into
InferenceRuntimeinterface (llm-core)generateUntilStop()is currently an extension function inllm-agent— promote it or add agenerate()method with stop-token support to the interfacellm-core/.../InferenceRuntime.ktCreate
ChatSessionabstraction (llm-agent)InferenceRuntime+Tokenizer+ChatTemplate+ToolRegistryChatSessionto get chat/agent/demo modes for freellm-agent/.../ChatSession.ktExtract CLI chat/agent/demo dispatch from kllama Main.kt into shared module
Main.ktdispatch toToolCallingDemo/AgentCliInferenceRuntime+TokenizerCritical files:
llm-core/.../InferenceRuntime.kt— extend interfacellm-agent/.../AgentLoop.kt— already well-abstracted, keep as-isllm-runtime/kllama/.../Main.kt— extract dispatch logicllm-runtime/kllama/.../ToolCallingDemo.kt— move tollm-agentorllm-appsshared modulePhase 2: Unified DSL-Based Model Definition (converge on OptimizedLLMRuntime)
Problem: Each model has a hand-coded runtime.
OptimizedLLMRuntimealready supports DSL -> graph -> optimized execution, but only some models use it.Changes:
Define DSL networks for all model families:
llamaNetwork(config)— LLaMA/Mistral/Qwen2/3 (standard transformer)qwen35Network(config)— Qwen3.5 (hybrid DeltaNet + full attention)gemmaNetwork(config)— Gemma (GELU, MatFormer FFN, sliding window)apertusNetwork(config)— Apertus (xIELU, ungated MLP, QK-norm)Network<T>from the DSLUnified model loading flow:
Remove deprecated hand-coded runtimes once DSL equivalents are validated:
LlamaRuntime->llamaNetwork()+OptimizedLLMRuntimeApertusRuntime->apertusNetwork()+OptimizedLLMRuntimeCritical files:
llm-core/.../OptimizedLLMRuntime.kt— already exists, extendllm-core/.../dsl/TransformerDsl.kt— already has embedding, MHA, SwiGLU, RMSNormllm-core/.../weights/LLMWeightNameResolvers.kt— already maps DSL paths -> GGUF namesPhase 3: Tokenization as Pipeline Stage
Problem: Tokenization is split between
GGUFTokenizer(kllama module),QwenByteLevelBPETokenizer(llm-core), and model-specific code. The byte-level BPE fix we just made shows the fragility.Changes:
Enhance
Tokenizerinterface (llm-core):Unified tokenizer factory:
TokenizerFactory.fromGGUF(source)— auto-detects BPE/SentencePiece/WordPieceTokenizerFactory.fromTokenizerJson(json)— HuggingFace formatMove
GGUFTokenizertollm-coreso all runners can use it without depending on kllamaPhase 4: Unified Runner (single CLI entry point)
Problem: 6 separate CLI apps with duplicated argument parsing, model loading, and dispatch logic.
Changes:
Single
skainetCLI that auto-detects model architecture from GGUF metadata:Architecture registry:
Auto-detection from GGUF metadata (already exists in
peekGgufMetadata())Verification
llm-agent,llm-runtime:kllama,llm-core)OptimizedLLMRuntimein HYBRID mode matches hand-coded runtime output