Feature/463 bpe qwen by michalharakal · Pull Request #465 · SKaiNET-developers/SKaiNET

michalharakal · 2026-04-12T18:26:32Z

No description provided.

Introduces a common Tokenizer surface so tokenizer selection can be per-architecture rather than per file format. This is the first step toward fixing byte-level BPE for GPT-2/Qwen models (#463), where the same tokenizer must be dispatched whether weights come from GGUF or SafeTensors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Introduces the canonical Karpathy/HF byte-to-unicode mapping that byte-level BPE tokenizers use to represent arbitrary bytes (including newline, tab, space) as printable code points before BPE merging. Covered by a commonTest round-trip over all 256 bytes plus the canonical 0x0A -> U+010A and 0x20 -> U+0120 spot-checks that lock the table to the GPT-2 reference. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Implements GPT-2-style byte-level BPE with the full seven-step pipeline: special-token splitting, GPT-2 pretokenization regex, byte-to-unicode mapping, merge-rank BPE (lowest rank wins, not vocab score), vocab lookup, and the reverse for decode. Includes a synthetic commonTest that locks in the algorithm with a hand-crafted vocab + merges — no model fixture required. End-to-end assertions against real Qwen2.5 IDs will follow in step 6. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds typed accessors for tokenizer.ggml.{model, tokens, merges, token_type, bos_token_id, eos_token_id} and reworks vocabSize to derive from the tokens list (the old code read "tokenizer.ggml.tokens" as an Int, which was dead code — that field is a string array). Includes a commonTest that parses a stub map so the contract is locked in before the factory wiring lands in the next step. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Introduces TokenizerFactory.fromGguf(fields) and .fromTokenizerJson(json) that route by tokenizer type — gpt2/BPE goes to QwenByteLevelBpeTokenizer regardless of file format. LLaMA/SentencePiece and WordPiece branches throw UnsupportedTokenizerException and are tracked in #464. QwenByteLevelBpeTokenizer gains two builders: - fromGgufFields: reads tokens/merges/token_type from the raw GGUF field map and treats token_type == 3 entries as atomic specials. - fromTokenizerJson: parses a HuggingFace tokenizer.json with kotlinx.serialization, inverting the vocab map and pulling added_tokens as specials. Covered by TokenizerFactoryDispatchTest: gpt2 -> byte-level BPE (verified by encoding a "Hello<|end|>" round trip), llama/Unigram throw, and a synthetic tokenizer.json produces the expected merged token id. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds a fixture-gated jvmTest that validates the full GGUF path against the real Qwen2.5-0.5B-Instruct model: all seven reference assertions from #463 pass ("Hello" -> [9707], "<|im_start|>" -> [151644], "The capital of France is" -> [785, 6722, 315, 9625, 374], newline -> [198], chat-template roundtrip, and GGUF vs tokenizer.json produce identical ids). Fixtures are downloaded on demand into build/test-fixtures/ via a new downloadQwenTokenizerFixtures Gradle task; tests print a skip notice and pass when the files are absent, so offline/CI builds stay green without network access. The synthetic commonTest coverage added in earlier steps still exercises the algorithm unconditionally. Adds a jvmTest dependency on :skainet-io:skainet-io-gguf so the test can load real GGUF files via StreamingGGUFReader. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

michalharakal and others added 7 commits April 12, 2026 16:47

Document #463 byte-level BPE fix in CHANGELOG (#463 step 7)

fc2985e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

michalharakal merged commit 968bfc1 into develop Apr 12, 2026
4 checks passed

michalharakal deleted the feature/463-bpe-qwen branch April 12, 2026 18:26

michalharakal mentioned this pull request Apr 13, 2026

Byte-level BPE broken for GPT-2/Qwen models (affects both GGUF and SafeTensors) #463

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/463 bpe qwen#465

Feature/463 bpe qwen#465
michalharakal merged 7 commits intodevelopfrom
feature/463-bpe-qwen

michalharakal commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michalharakal commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant