Feature/complete sentencepiece tokenizer#65
Feature/complete sentencepiece tokenizer#65Arpitsh7 wants to merge 8 commits intoAOSSIE-Org:mainfrom
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughAdds full SentencePieceTokenizer functionality (train, load, encode, decode, validation, artifact paths), new tests for tokenizer behavior and determinism, a Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can suggest fixes for GitHub Check annotations.Configure the |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/tokenizer/sentencepiece_tokenizer.py`:
- Around line 25-27: The constructor stores min_frequency but it is never
applied during training in the SentencePiece training path; update the training
routine (the method that calls SentencePieceTrainer.Train — referenced as the
training method that sets self._model) to use the stored value
(self._min_frequency) by passing it into the SentencePieceTrainer.Train call (or
equivalent training config) so the vocab pruning respects min_frequency; if the
underlying SentencePiece API does not accept a min_frequency parameter, make the
change fail-fast by raising a clear exception or logging an error when
min_frequency is non-default so callers are not silently misconfigured.
- Around line 140-145: get_merges_path currently returns None which breaks
polymorphism where callers expect a Path; change get_merges_path(self,
tokenizer_dir: Path) to return a Path object (e.g. tokenizer_dir / "merges" or
tokenizer_dir / "merges.txt") to maintain the artifact-path API for
BaseTokenizer consumers, and update the docstring to note SentencePiece doesn't
use merges but returns a placeholder Path that may not exist.
In `@tests/test_sentencepiece.py`:
- Around line 199-213: Extend the determinism test
test_spm_training_is_deterministic to also compare the trained binary model:
after training SentencePieceTokenizer instances (tokenizer_1, tokenizer_2) and
reading spm.vocab, open the corresponding "spm.model" files from save_path_1 and
save_path_2 in binary mode (e.g., read bytes or compute a hash) and assert they
are identical; ensure you reference the same file name "spm.model" and perform a
binary equality or hash equality check so the runtime model artifact is included
in the determinism assertion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 436c3fd3-76ab-45aa-880b-3a3ea484b0cd
📒 Files selected for processing (4)
.gitignoreopenverifiablellm/tokenizer/sentencepiece_tokenizer.pyopenverifiablellm/verify.pytests/test_sentencepiece.py
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
openverifiablellm/tokenizer/sentencepiece_tokenizer.py (1)
26-35:⚠️ Potential issue | 🟠 MajorVerify that
min_countis a real SentencePiece trainer option.I can't find a
min_countfield in SentencePiece's upstreamTrainerSpec, and the Python wrapper just forwards keyword args into the native trainer. If the pinned wheel doesn't support this key,min_frequencyis still effectively misconfigured here, and the new tests would not catch it because they never assert thatmin_frequencychanges the trained artifacts. Either map this to a confirmed SentencePiece option or fail fast whenmin_frequency != 1. (raw.githubusercontent.com)Safe fallback until the upstream option is confirmed
save_path.mkdir(parents=True, exist_ok=True) + if self.min_frequency != 1: + raise ValueError( + "SentencePieceTokenizer does not currently support min_frequency; use 1 until a supported trainer option is wired through." + ) + model_prefix = save_path / "spm" spm.SentencePieceTrainer.train( input=str(text_file), model_prefix=str(model_prefix), vocab_size=self.vocab_size, - min_count=self.min_frequency, pad_id=0, unk_id=1, bos_id=2,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/tokenizer/sentencepiece_tokenizer.py` around lines 26 - 35, The call to spm.SentencePieceTrainer.train is using a nonstandard keyword min_count (from our min_frequency) which may not be supported by the upstream wrapper; update the SentencePiece tokenization code (around the save_path/model_prefix and the spm.SentencePieceTrainer.train invocation in sentencepiece_tokenizer.py) to either map our min_frequency to a confirmed SentencePiece TrainerSpec option or, until you verify such mapping, add a fail-fast check that raises a clear error if self.min_frequency != 1; reference the min_frequency field and the spm.SentencePieceTrainer.train call so the check runs before invoking train.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/tokenizer/sentencepiece_tokenizer.py`:
- Around line 49-60: Promote the runtime API methods encode, decode, and load
from SentencePieceTokenizer into the BaseTokenizer interface so
tokenizer-swapping code can rely on them: add abstract method signatures
encode(self, text: str) -> list, decode(self, ids: list) -> str, and load(self,
tokenizer_dir: Path) -> None to BaseTokenizer (or provide default
implementations that raise NotImplementedError), then have
SentencePieceTokenizer implement/override these methods (using its _check_loaded
and _load_model helpers) to preserve current behavior; ensure type hints match
and update any callers to depend on BaseTokenizer for runtime encode/decode/load
usage.
---
Duplicate comments:
In `@openverifiablellm/tokenizer/sentencepiece_tokenizer.py`:
- Around line 26-35: The call to spm.SentencePieceTrainer.train is using a
nonstandard keyword min_count (from our min_frequency) which may not be
supported by the upstream wrapper; update the SentencePiece tokenization code
(around the save_path/model_prefix and the spm.SentencePieceTrainer.train
invocation in sentencepiece_tokenizer.py) to either map our min_frequency to a
confirmed SentencePiece TrainerSpec option or, until you verify such mapping,
add a fail-fast check that raises a clear error if self.min_frequency != 1;
reference the min_frequency field and the spm.SentencePieceTrainer.train call so
the check runs before invoking train.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: c569e98c-903a-419e-afdb-571636ed3102
📒 Files selected for processing (2)
openverifiablellm/tokenizer/sentencepiece_tokenizer.pytests/test_sentencepiece.py
| def encode(self, text: str) -> list: | ||
| self._check_loaded() | ||
| return self._model.encode(text, out_type=int) | ||
|
|
||
| def decode(self, ids: list) -> str: | ||
| self._check_loaded() | ||
| return self._model.decode(ids) | ||
|
|
||
| def load(self, tokenizer_dir: Path): | ||
| tokenizer_dir = Path(tokenizer_dir) | ||
| self._load_model(tokenizer_dir) | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Promote encode / decode / load into BaseTokenizer if tokenizer swapping is supported.
These methods only exist on SentencePieceTokenizer. BaseTokenizer still guarantees only train, get_vocab_path, and get_merges_path, so tokenizer-agnostic code cannot rely on the new runtime API and future tokenizer implementations will not be forced to keep parity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/tokenizer/sentencepiece_tokenizer.py` around lines 49 - 60,
Promote the runtime API methods encode, decode, and load from
SentencePieceTokenizer into the BaseTokenizer interface so tokenizer-swapping
code can rely on them: add abstract method signatures encode(self, text: str) ->
list, decode(self, ids: list) -> str, and load(self, tokenizer_dir: Path) ->
None to BaseTokenizer (or provide default implementations that raise
NotImplementedError), then have SentencePieceTokenizer implement/override these
methods (using its _check_loaded and _load_model helpers) to preserve current
behavior; ensure type hints match and update any callers to depend on
BaseTokenizer for runtime encode/decode/load usage.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_sentencepiece.py`:
- Around line 136-140: The test test_spm_get_merges_path_returns_none has a
stale expectation; update it to reflect SentencePieceTokenizer.get_merges_path
returning a path. Replace the assertion so it checks that
tokenizer.get_merges_path(tmp_path) equals tmp_path / "spm.model" (use the
SentencePieceTokenizer class and get_merges_path method to locate the code).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7a6d89a0-f206-497d-8108-a8f1711a2833
📒 Files selected for processing (2)
openverifiablellm/tokenizer/sentencepiece_tokenizer.pytests/test_sentencepiece.py
Summary
Completes the SentencePieceTokenizer implementation introduced in #17 by adding encode(), decode(), and load() — making SentencePieceTokenizer fully usable in the pipeline.
Background
PR #17 introduced a modular tokenizer architecture with SentencePieceTokenizer as one of the implementations. However the current SentencePieceTokenizer only supports training and lacks encode(), decode(), and load().
Since these methods are absent, the tokenizer cannot process text despite being trained, which prevents it from being used in downstream pipeline tasks.
Changes
openverifiablellm/tokenizer/sentencepiece_tokenizer.pyencode(text) -> list[int]decode(ids) -> strload(tokenizer_dir)— reloads fromvocab.json+merges.txttext_file.is_file()validation intrain()_check_loaded()guard for encode/decodetests/test_sentencepiece.py← new fileTesting
All 18 tests pass.
Scope
Self-contained change. Touches only sentencepiece_tokenizer.py and tests/test_sentencepiece.py. No overlap with any open PR.
Related
issue close [FEATURE]: Complete SentencePiece tokenizer — encode, decode, load and input validation #51
Summary by CodeRabbit
New Features
Tests
Chores
Style