Merged
Conversation
Introduces a common Tokenizer surface so tokenizer selection can be per-architecture rather than per file format. This is the first step toward fixing byte-level BPE for GPT-2/Qwen models (#463), where the same tokenizer must be dispatched whether weights come from GGUF or SafeTensors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduces the canonical Karpathy/HF byte-to-unicode mapping that byte-level BPE tokenizers use to represent arbitrary bytes (including newline, tab, space) as printable code points before BPE merging. Covered by a commonTest round-trip over all 256 bytes plus the canonical 0x0A -> U+010A and 0x20 -> U+0120 spot-checks that lock the table to the GPT-2 reference. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements GPT-2-style byte-level BPE with the full seven-step pipeline: special-token splitting, GPT-2 pretokenization regex, byte-to-unicode mapping, merge-rank BPE (lowest rank wins, not vocab score), vocab lookup, and the reverse for decode. Includes a synthetic commonTest that locks in the algorithm with a hand-crafted vocab + merges — no model fixture required. End-to-end assertions against real Qwen2.5 IDs will follow in step 6. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds typed accessors for tokenizer.ggml.{model, tokens, merges,
token_type, bos_token_id, eos_token_id} and reworks vocabSize to
derive from the tokens list (the old code read
"tokenizer.ggml.tokens" as an Int, which was dead code — that
field is a string array). Includes a commonTest that parses a
stub map so the contract is locked in before the factory wiring
lands in the next step.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduces TokenizerFactory.fromGguf(fields) and .fromTokenizerJson(json) that route by tokenizer type — gpt2/BPE goes to QwenByteLevelBpeTokenizer regardless of file format. LLaMA/SentencePiece and WordPiece branches throw UnsupportedTokenizerException and are tracked in #464. QwenByteLevelBpeTokenizer gains two builders: - fromGgufFields: reads tokens/merges/token_type from the raw GGUF field map and treats token_type == 3 entries as atomic specials. - fromTokenizerJson: parses a HuggingFace tokenizer.json with kotlinx.serialization, inverting the vocab map and pulling added_tokens as specials. Covered by TokenizerFactoryDispatchTest: gpt2 -> byte-level BPE (verified by encoding a "Hello<|end|>" round trip), llama/Unigram throw, and a synthetic tokenizer.json produces the expected merged token id. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a fixture-gated jvmTest that validates the full GGUF path against the real Qwen2.5-0.5B-Instruct model: all seven reference assertions from #463 pass ("Hello" -> [9707], "<|im_start|>" -> [151644], "The capital of France is" -> [785, 6722, 315, 9625, 374], newline -> [198], chat-template roundtrip, and GGUF vs tokenizer.json produce identical ids). Fixtures are downloaded on demand into build/test-fixtures/ via a new downloadQwenTokenizerFixtures Gradle task; tests print a skip notice and pass when the files are absent, so offline/CI builds stay green without network access. The synthetic commonTest coverage added in earlier steps still exercises the algorithm unconditionally. Adds a jvmTest dependency on :skainet-io:skainet-io-gguf so the test can load real GGUF files via StreamingGGUFReader. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.