Skip to content

Byte-level BPE broken for GPT-2/Qwen models (affects both GGUF and SafeTensors) #52

@michalharakal

Description

@michalharakal

Summary

GGUFTokenizer.encodeBPE() (in llm-core/.../tokenizer/GGUFTokenizer.kt) does not implement byte-level BPE correctly for GPT-2/Qwen-family tokenizers.
When used to encode text containing chat-template special tokens (e.g., <|im_start|>) or arbitrary Unicode, it produces broken token sequences that cause the model to generate nonsense output (CJK characters, URL-encoded fragments, HTML entities).

This affects both file formats, not just GGUF:

  • GGUFTokenizer.fromRandomAccessSource(gguf) — broken for Qwen GGUF models
  • GGUFTokenizer.fromTokenizerJson(json) — broken for Qwen SafeTensors models (same code path)

The bug hasn't surfaced on SafeTensors simply because no Qwen SafeTensors model has been tested in this project yet. All SafeTensors testing so far used LLaMA/Gemma (SentencePiece), which goes through a different, working code path.

Tokenizer selection should be per-architecture (per tokenizer type), not per file format.
A Qwen model needs byte-level BPE whether its weights come from .gguf or .safetensors. A LLaMA model needs SentencePiece regardless of format.

This blocks tool calling and chat mode for Qwen2, Qwen3, Qwen2.5, Mistral-Nemo, and any other model that uses GPT-2-style byte-level BPE.

Symptoms

Qwen2.5-0.5B-Instruct tool-calling demo

./gradlew :llm-apps:kllama-cli:run --args="-m Qwen2.5-0.5B-Instruct-Q8_0.gguf --demo 'What is 2 + 2?'"

Model loads successfully (tied embeddings, Q8_0 SIMD, qwen chat template auto-detected). The agent loop then produces:

Assistant: footingök JSONExceptionzm.bzéħ¬Ùĥتreckæľ¬ç½ijaira ?>>
<?ä¸ĢæĹ¥/ĊĊĊĊannppers ... (continues with random CJK, HTML, URL fragments)

Expected: either a plain answer (2 + 2 = 4) or a <tool_call>{"name":"calculator",...}</tool_call> XML block.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions