Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,15 @@

### Added
- **Qwen / GPT-2 Byte-Level BPE Tokenizer**: `QwenByteLevelBpeTokenizer` implements the full GPT-2-style pipeline — byte-to-unicode mapping, GPT-2 pretokenization regex, merge-rank BPE, and atomic special-token splitting. Builds from either GGUF metadata (`fromGgufFields`) or a HuggingFace `tokenizer.json` (`fromTokenizerJson`). Verified against Qwen2.5-0.5B reference token IDs from HuggingFace `transformers`.
- **`TokenizerFactory` with Per-Architecture Dispatch**: Tokenizer selection is now **per-architecture, not per file format**. `TokenizerFactory.fromGguf(fields)` and `.fromTokenizerJson(json)` inspect `tokenizer.ggml.model` / `model.type` and dispatch to the right implementation — so a Qwen model uses byte-level BPE whether its weights come from `.gguf` or `.safetensors`.
- **`Tokenizer` Interface**: Common surface implemented by `TekkenTokenizer` and `QwenByteLevelBpeTokenizer` (`encode`, `decode`, `vocabSize`, `bosTokenId`, `eosTokenId`).
- **LLaMA / SentencePiece Tokenizer**: `SentencePieceTokenizer` implements the llama.cpp SPM pipeline — whitespace escape (`▁`), code-point symbol split, **score-priority** BPE (the SPM rule, opposite of the merge-rank rule used for GPT-2 BPE), and `<0xNN>` byte fallback for unknown characters. Builds from GGUF (`tokenizer.ggml.model == "llama"`) and HuggingFace `tokenizer.json` (`model.type == "Unigram"`). Verified against TinyLlama-1.1B reference token IDs from HuggingFace `transformers`.
- **`TokenizerFactory` with Per-Architecture Dispatch**: Tokenizer selection is now **per-architecture, not per file format**. `TokenizerFactory.fromGguf(fields)` and `.fromTokenizerJson(json)` inspect `tokenizer.ggml.model` / `model.type` and dispatch to the right implementation — Qwen/GPT-2 → byte-level BPE, LLaMA/Gemma/TinyLlama → SentencePiece — regardless of whether weights come from GGUF or SafeTensors.
- **`Tokenizer` Interface**: Common surface implemented by `TekkenTokenizer`, `QwenByteLevelBpeTokenizer`, and `SentencePieceTokenizer` (`encode`, `decode`, `vocabSize`, `bosTokenId`, `eosTokenId`).
- **GGUF Tokenizer Metadata**: `GgufModelMetadata` now exposes `tokenizerModel`, `tokenizerTokens`, `tokenizerMerges`, `tokenizerTokenTypes`, `bosTokenId`, and `eosTokenId` so callers can build a tokenizer without re-parsing the raw field map.

### Fixed
- **Byte-Level BPE Broken for Qwen/GPT-2 Models**: Previously there was no GPT-2-style byte-level BPE tokenizer in the repo, and `GgufModelMetadata` ignored `tokenizer.ggml.merges` entirely — so any Qwen / GPT-2 / Mistral-Nemo model encoded text into garbage tokens (byte-level chars instead of merged vocab IDs), blocking chat mode and tool calling. The new `QwenByteLevelBpeTokenizer` + `TokenizerFactory` dispatch fix the issue for both GGUF and SafeTensors sources. SentencePiece / LLaMA support is tracked separately in #464. (#463)
- **Byte-Level BPE Broken for Qwen/GPT-2 Models**: Previously there was no GPT-2-style byte-level BPE tokenizer in the repo, and `GgufModelMetadata` ignored `tokenizer.ggml.merges` entirely — so any Qwen / GPT-2 / Mistral-Nemo model encoded text into garbage tokens (byte-level chars instead of merged vocab IDs), blocking chat mode and tool calling. The new `QwenByteLevelBpeTokenizer` + `TokenizerFactory` dispatch fix the issue for both GGUF and SafeTensors sources. (#463)
- **No SentencePiece Path for LLaMA-Family GGUF Models**: `TokenizerFactory` previously threw `UnsupportedTokenizerException` for `tokenizer.ggml.model == "llama"`, leaving LLaMA / TinyLlama / Gemma / Mistral-v0.1 GGUFs untokenizable. The new `SentencePieceTokenizer` closes that gap. (#464)
- **GGUF UInt Fields Silently Dropped**: GGUF UINT32 fields (e.g. `tokenizer.ggml.bos_token_id`) arrive from `StreamingGGUFReader` as `kotlin.UInt`, which is a value class — *not* a subclass of `kotlin.Number` — so a plain `as? Number` cast was returning null. The new `toIntFlexible` helper handles every signed and unsigned numeric type GGUF can produce, restoring the BOS/EOS/UNK ids on the tokenizer builders.

## [0.18.0] - 2026-04-08

Expand Down
28 changes: 28 additions & 0 deletions skainet-io/skainet-io-core/build.gradle.kts
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,34 @@ val downloadQwenTokenizerFixtures by tasks.registering {
}
}

val downloadTinyLlamaTokenizerFixtures by tasks.registering {
group = "verification"
description = "Download TinyLlama-1.1B GGUF + tokenizer.json for #464 tests"
val outDir = fixturesDir
outputs.dir(outDir)
doLast {
val dir = outDir.get().asFile.apply { mkdirs() }
val files = listOf(
"tinyllama-1.1b-chat-v1.0.Q8_0.gguf" to
"https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q8_0.gguf",
"tinyllama-tokenizer.json" to
"https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/tokenizer.json",
)
for ((name, url) in files) {
val target = dir.resolve(name)
if (target.exists() && target.length() > 0) {
logger.lifecycle("fixture already present: ${target.name}")
continue
}
logger.lifecycle("downloading $name from $url")
URI(url).toURL().openStream().use { input ->
target.outputStream().use { out -> input.copyTo(out) }
}
logger.lifecycle(" -> ${target.length()} bytes")
}
}
}

tasks.withType<Test>().configureEach {
systemProperty("skainet.test.fixturesDir", fixturesDir.get().asFile.absolutePath)
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
package sk.ainet.io.tokenizer

/**
* GGUF UINT32 fields come back from `StreamingGGUFReader` as `kotlin.UInt`,
* which is a value class — not a subclass of `kotlin.Number`. A plain
* `as? Number` cast silently returns `null` for them, which is how
* `tokenizer.ggml.bos_token_id` etc. were getting lost. This helper
* accepts every numeric and unsigned numeric type GGUF can produce.
*/
internal fun Any?.toIntFlexible(): Int? = when (this) {
is Number -> toInt()
is UByte -> toInt()
is UShort -> toInt()
is UInt -> toInt()
is ULong -> toInt()
else -> null
}
Original file line number Diff line number Diff line change
Expand Up @@ -213,8 +213,8 @@ public class QwenByteLevelBpeTokenizer(
tokens = tokens,
merges = merges,
specialTokens = specialTokens,
bosTokenId = (fields["tokenizer.ggml.bos_token_id"] as? Number)?.toInt(),
eosTokenId = (fields["tokenizer.ggml.eos_token_id"] as? Number)?.toInt(),
bosTokenId = fields["tokenizer.ggml.bos_token_id"]?.toIntFlexible(),
eosTokenId = fields["tokenizer.ggml.eos_token_id"]?.toIntFlexible(),
)
}

Expand Down
Loading
Loading