SKaiNET-developers · michalharakal · Apr 13, 2026 · Apr 12, 2026 · Apr 12, 2026 · Apr 13, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,12 +4,15 @@
 
 ### Added
 - **Qwen / GPT-2 Byte-Level BPE Tokenizer**: `QwenByteLevelBpeTokenizer` implements the full GPT-2-style pipeline — byte-to-unicode mapping, GPT-2 pretokenization regex, merge-rank BPE, and atomic special-token splitting. Builds from either GGUF metadata (`fromGgufFields`) or a HuggingFace `tokenizer.json` (`fromTokenizerJson`). Verified against Qwen2.5-0.5B reference token IDs from HuggingFace `transformers`.
-- **`TokenizerFactory` with Per-Architecture Dispatch**: Tokenizer selection is now **per-architecture, not per file format**. `TokenizerFactory.fromGguf(fields)` and `.fromTokenizerJson(json)` inspect `tokenizer.ggml.model` / `model.type` and dispatch to the right implementation — so a Qwen model uses byte-level BPE whether its weights come from `.gguf` or `.safetensors`.
-- **`Tokenizer` Interface**: Common surface implemented by `TekkenTokenizer` and `QwenByteLevelBpeTokenizer` (`encode`, `decode`, `vocabSize`, `bosTokenId`, `eosTokenId`).
+- **LLaMA / SentencePiece Tokenizer**: `SentencePieceTokenizer` implements the llama.cpp SPM pipeline — whitespace escape (`▁`), code-point symbol split, **score-priority** BPE (the SPM rule, opposite of the merge-rank rule used for GPT-2 BPE), and `<0xNN>` byte fallback for unknown characters. Builds from GGUF (`tokenizer.ggml.model == "llama"`) and HuggingFace `tokenizer.json` (`model.type == "Unigram"`). Verified against TinyLlama-1.1B reference token IDs from HuggingFace `transformers`.
+- **`TokenizerFactory` with Per-Architecture Dispatch**: Tokenizer selection is now **per-architecture, not per file format**. `TokenizerFactory.fromGguf(fields)` and `.fromTokenizerJson(json)` inspect `tokenizer.ggml.model` / `model.type` and dispatch to the right implementation — Qwen/GPT-2 → byte-level BPE, LLaMA/Gemma/TinyLlama → SentencePiece — regardless of whether weights come from GGUF or SafeTensors.
+- **`Tokenizer` Interface**: Common surface implemented by `TekkenTokenizer`, `QwenByteLevelBpeTokenizer`, and `SentencePieceTokenizer` (`encode`, `decode`, `vocabSize`, `bosTokenId`, `eosTokenId`).
 - **GGUF Tokenizer Metadata**: `GgufModelMetadata` now exposes `tokenizerModel`, `tokenizerTokens`, `tokenizerMerges`, `tokenizerTokenTypes`, `bosTokenId`, and `eosTokenId` so callers can build a tokenizer without re-parsing the raw field map.
 
 ### Fixed
-- **Byte-Level BPE Broken for Qwen/GPT-2 Models**: Previously there was no GPT-2-style byte-level BPE tokenizer in the repo, and `GgufModelMetadata` ignored `tokenizer.ggml.merges` entirely — so any Qwen / GPT-2 / Mistral-Nemo model encoded text into garbage tokens (byte-level chars instead of merged vocab IDs), blocking chat mode and tool calling. The new `QwenByteLevelBpeTokenizer` + `TokenizerFactory` dispatch fix the issue for both GGUF and SafeTensors sources. SentencePiece / LLaMA support is tracked separately in #464. (#463)
+- **Byte-Level BPE Broken for Qwen/GPT-2 Models**: Previously there was no GPT-2-style byte-level BPE tokenizer in the repo, and `GgufModelMetadata` ignored `tokenizer.ggml.merges` entirely — so any Qwen / GPT-2 / Mistral-Nemo model encoded text into garbage tokens (byte-level chars instead of merged vocab IDs), blocking chat mode and tool calling. The new `QwenByteLevelBpeTokenizer` + `TokenizerFactory` dispatch fix the issue for both GGUF and SafeTensors sources. (#463)
+- **No SentencePiece Path for LLaMA-Family GGUF Models**: `TokenizerFactory` previously threw `UnsupportedTokenizerException` for `tokenizer.ggml.model == "llama"`, leaving LLaMA / TinyLlama / Gemma / Mistral-v0.1 GGUFs untokenizable. The new `SentencePieceTokenizer` closes that gap. (#464)
+- **GGUF UInt Fields Silently Dropped**: GGUF UINT32 fields (e.g. `tokenizer.ggml.bos_token_id`) arrive from `StreamingGGUFReader` as `kotlin.UInt`, which is a value class — *not* a subclass of `kotlin.Number` — so a plain `as? Number` cast was returning null. The new `toIntFlexible` helper handles every signed and unsigned numeric type GGUF can produce, restoring the BOS/EOS/UNK ids on the tokenizer builders.
 
 ## [0.18.0] - 2026-04-08
 

diff --git a/skainet-io/skainet-io-core/build.gradle.kts b/skainet-io/skainet-io-core/build.gradle.kts
@@ -140,6 +140,34 @@ val downloadQwenTokenizerFixtures by tasks.registering {
     }
 }
 
+val downloadTinyLlamaTokenizerFixtures by tasks.registering {
+    group = "verification"
+    description = "Download TinyLlama-1.1B GGUF + tokenizer.json for #464 tests"
+    val outDir = fixturesDir
+    outputs.dir(outDir)
+    doLast {
+        val dir = outDir.get().asFile.apply { mkdirs() }
+        val files = listOf(
+            "tinyllama-1.1b-chat-v1.0.Q8_0.gguf" to
+                "https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q8_0.gguf",
+            "tinyllama-tokenizer.json" to
+                "https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/tokenizer.json",
+        )
+        for ((name, url) in files) {
+            val target = dir.resolve(name)
+            if (target.exists() && target.length() > 0) {
+                logger.lifecycle("fixture already present: ${target.name}")
+                continue
+            }
+            logger.lifecycle("downloading $name from $url")
+            URI(url).toURL().openStream().use { input ->
+                target.outputStream().use { out -> input.copyTo(out) }
+            }
+            logger.lifecycle("  -> ${target.length()} bytes")
+        }
+    }
+}
+
 tasks.withType<Test>().configureEach {
     systemProperty("skainet.test.fixturesDir", fixturesDir.get().asFile.absolutePath)
 }
diff --git a/skainet-io/skainet-io-core/src/commonMain/kotlin/sk/ainet/io/tokenizer/GgufFieldHelpers.kt b/skainet-io/skainet-io-core/src/commonMain/kotlin/sk/ainet/io/tokenizer/GgufFieldHelpers.kt
@@ -0,0 +1,17 @@
+package sk.ainet.io.tokenizer
+
+/**
+ * GGUF UINT32 fields come back from `StreamingGGUFReader` as `kotlin.UInt`,
+ * which is a value class — not a subclass of `kotlin.Number`. A plain
+ * `as? Number` cast silently returns `null` for them, which is how
+ * `tokenizer.ggml.bos_token_id` etc. were getting lost. This helper
+ * accepts every numeric and unsigned numeric type GGUF can produce.
+ */
+internal fun Any?.toIntFlexible(): Int? = when (this) {
+    is Number -> toInt()
+    is UByte -> toInt()
+    is UShort -> toInt()
+    is UInt -> toInt()
+    is ULong -> toInt()
+    else -> null
+}
diff --git a/.../skainet-io-core/src/commonMain/kotlin/sk/ainet/io/tokenizer/QwenByteLevelBpeTokenizer.kt b/.../skainet-io-core/src/commonMain/kotlin/sk/ainet/io/tokenizer/QwenByteLevelBpeTokenizer.kt
@@ -213,8 +213,8 @@ public class QwenByteLevelBpeTokenizer(
                 tokens = tokens,
                 merges = merges,
                 specialTokens = specialTokens,
-                bosTokenId = (fields["tokenizer.ggml.bos_token_id"] as? Number)?.toInt(),
-                eosTokenId = (fields["tokenizer.ggml.eos_token_id"] as? Number)?.toInt(),
+                bosTokenId = fields["tokenizer.ggml.bos_token_id"]?.toIntFlexible(),
+                eosTokenId = fields["tokenizer.ggml.eos_token_id"]?.toIntFlexible(),
             )
         }