Summary
GGUFTokenizer.encodeBPE() (in llm-core/.../tokenizer/GGUFTokenizer.kt) does not implement byte-level BPE correctly for GPT-2/Qwen-family tokenizers.
When used to encode text containing chat-template special tokens (e.g., <|im_start|>) or arbitrary Unicode, it produces broken token sequences that cause the model to generate nonsense output (CJK characters, URL-encoded fragments, HTML entities).
This affects both file formats, not just GGUF:
GGUFTokenizer.fromRandomAccessSource(gguf) — broken for Qwen GGUF models
GGUFTokenizer.fromTokenizerJson(json) — broken for Qwen SafeTensors models (same code path)
The bug hasn't surfaced on SafeTensors simply because no Qwen SafeTensors model has been tested in this project yet. All SafeTensors testing so far used LLaMA/Gemma (SentencePiece), which goes through a different, working code path.
Tokenizer selection should be per-architecture (per tokenizer type), not per file format.
A Qwen model needs byte-level BPE whether its weights come from .gguf or .safetensors. A LLaMA model needs SentencePiece regardless of format.
This blocks tool calling and chat mode for Qwen2, Qwen3, Qwen2.5, Mistral-Nemo, and any other model that uses GPT-2-style byte-level BPE.
Symptoms
Qwen2.5-0.5B-Instruct tool-calling demo
./gradlew :llm-apps:kllama-cli:run --args="-m Qwen2.5-0.5B-Instruct-Q8_0.gguf --demo 'What is 2 + 2?'"
Model loads successfully (tied embeddings, Q8_0 SIMD, qwen chat template auto-detected). The agent loop then produces:
Assistant: footingök JSONExceptionzm.bzéħ¬Ùĥتreckæľ¬ç½ijaira ?>>
<?ä¸ĢæĹ¥/ĊĊĊĊannppers ... (continues with random CJK, HTML, URL fragments)
Expected: either a plain answer (2 + 2 = 4) or a <tool_call>{"name":"calculator",...}</tool_call> XML block.
Summary
GGUFTokenizer.encodeBPE()(inllm-core/.../tokenizer/GGUFTokenizer.kt) does not implement byte-level BPE correctly for GPT-2/Qwen-family tokenizers.When used to encode text containing chat-template special tokens (e.g.,
<|im_start|>) or arbitrary Unicode, it produces broken token sequences that cause the model to generate nonsense output (CJK characters, URL-encoded fragments, HTML entities).This affects both file formats, not just GGUF:
GGUFTokenizer.fromRandomAccessSource(gguf)— broken for Qwen GGUF modelsGGUFTokenizer.fromTokenizerJson(json)— broken for Qwen SafeTensors models (same code path)The bug hasn't surfaced on SafeTensors simply because no Qwen SafeTensors model has been tested in this project yet. All SafeTensors testing so far used LLaMA/Gemma (SentencePiece), which goes through a different, working code path.
Tokenizer selection should be per-architecture (per tokenizer type), not per file format.
A Qwen model needs byte-level BPE whether its weights come from
.ggufor.safetensors. A LLaMA model needs SentencePiece regardless of format.This blocks tool calling and chat mode for Qwen2, Qwen3, Qwen2.5, Mistral-Nemo, and any other model that uses GPT-2-style byte-level BPE.
Symptoms
Qwen2.5-0.5B-Instruct tool-calling demo
./gradlew :llm-apps:kllama-cli:run --args="-m Qwen2.5-0.5B-Instruct-Q8_0.gguf --demo 'What is 2 + 2?'"Model loads successfully (tied embeddings, Q8_0 SIMD, qwen chat template auto-detected). The agent loop then produces:
Expected: either a plain answer (
2 + 2 = 4) or a<tool_call>{"name":"calculator",...}</tool_call>XML block.