Feature/complete sentencepiece tokenizer by Arpitsh7 · Pull Request #65 · AOSSIE-Org/OpenVerifiableLLM

Arpitsh7 · 2026-03-11T06:05:53Z

Summary

Completes the SentencePieceTokenizer implementation introduced in #17 by adding encode(), decode(), and load() — making SentencePieceTokenizer fully usable in the pipeline.

Background

PR #17 introduced a modular tokenizer architecture with SentencePieceTokenizer as one of the implementations. However the current SentencePieceTokenizer only supports training and lacks encode(), decode(), and load().

Since these methods are absent, the tokenizer cannot process text despite being trained, which prevents it from being used in downstream pipeline tasks.

Changes

`openverifiablellm/tokenizer/sentencepiece_tokenizer.py`

Implemented encode(text) -> list[int]
Implemented decode(ids) -> str
Implemented load(tokenizer_dir) — reloads from vocab.json + merges.txt
Added text_file.is_file() validation in train()
Added _check_loaded() guard for encode/decode

`tests/test_sentencepiece.py` ← new file

18 tests covering training, encode/decode roundtrip, load, artifact paths, special tokens, determinism, and constructor validation

Testing

pytest tests/test_sentencepiece.py -v

All 18 tests pass.

Scope

Self-contained change. Touches only sentencepiece_tokenizer.py and tests/test_sentencepiece.py. No overlap with any open PR.

Summary by CodeRabbit

New Features
- Added a SentencePiece tokenizer: train from text, save/load persistent models, and encode/decode with validation and clear errors.
Tests
- New comprehensive tests for tokenizer training, loading, encode/decode, artifact creation, special tokens, and deterministic behavior.
Chores
- Updated .gitignore to exclude the standard Python venv directory.
Style
- Minor formatting adjustments with no behavioral changes.

coderabbitai · 2026-03-11T06:06:10Z

Warning

Rate limit exceeded

@Arpitsh7 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 10 minutes and 0 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 41c2df51-e82d-4c03-b243-2c777270637a

📥 Commits

Reviewing files that changed from the base of the PR and between c73d046 and becd3c1.

📒 Files selected for processing (1)

tests/test_sentencepiece.py

Walkthrough

Adds full SentencePieceTokenizer functionality (train, load, encode, decode, validation, artifact paths), new tests for tokenizer behavior and determinism, a .gitignore entry for venv/, and a minor formatting change in verify.py. (50 words)

Changes

Cohort / File(s)	Summary
Gitignore `/.gitignore`	Added `venv/` ignore pattern alongside existing `*.venv`.
SentencePiece Tokenizer `openverifiablellm/tokenizer/sentencepiece_tokenizer.py`	Added `SPM_MODEL_FILE`/`SPM_VOCAB_FILE` constants; new constructor (`vocab_size`, `min_frequency`); implemented `train`, `load`, `encode`, `decode`, `get_vocab_path`, `get_merges_path`; added `_load_model` and `_check_loaded`; input validation, directory creation, and FileNotFoundError/RuntimeError handling.
Verification Formatting `openverifiablellm/verify.py`	Reformatted construction of the Skip `CheckResult` for the missing `environment_hash` case (style-only change).
Tests `tests/test_sentencepiece.py`	New pytest module covering training, artifact generation (`spm.model`, `spm.vocab`), load behavior, encode/decode round-trips, error cases, special tokens, path helpers, and deterministic training checks.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Python Lang

Suggested reviewers

Archit381

Poem

🐇 I learned to split the words with care and vim,
I save my model, vocab — tidy, trim, and prim,
I load, I encode, decode with joyful hops,
Tests check my hops and never let me stop,
A rabbit claps — tokens leap on a whim!

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	❓ Inconclusive	Minor out-of-scope changes detected: .gitignore modification (venv/ pattern) and openverifiablellm/verify.py reformatting are unrelated to issue `#51` requirements, though they are minimal and unlikely to cause issues.	Consider isolating the .gitignore and verify.py changes into a separate PR to maintain focus on the core SentencePieceTokenizer completion objectives.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'Feature/complete sentencepiece tokenizer' directly aligns with the main objective of completing the SentencePieceTokenizer implementation with encode, decode, and load functionality.
Linked Issues check	✅ Passed	The PR successfully implements all key requirements from issue `#51`: encode() and decode() methods, load() for reloading artifacts, input validation (text_file existence, _check_loaded guard), special tokens in training, save_path directory creation, and comprehensive tests.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can suggest fixes for GitHub Check annotations.

Configure the reviews.tools.github-checks setting to adjust the time to wait for GitHub Checks to complete.

Arpitsh7 · 2026-03-11T06:09:02Z

@coderabbitai review

coderabbitai · 2026-03-11T06:09:12Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer/sentencepiece_tokenizer.py`:
- Around line 25-27: The constructor stores min_frequency but it is never
applied during training in the SentencePiece training path; update the training
routine (the method that calls SentencePieceTrainer.Train — referenced as the
training method that sets self._model) to use the stored value
(self._min_frequency) by passing it into the SentencePieceTrainer.Train call (or
equivalent training config) so the vocab pruning respects min_frequency; if the
underlying SentencePiece API does not accept a min_frequency parameter, make the
change fail-fast by raising a clear exception or logging an error when
min_frequency is non-default so callers are not silently misconfigured.
- Around line 140-145: get_merges_path currently returns None which breaks
polymorphism where callers expect a Path; change get_merges_path(self,
tokenizer_dir: Path) to return a Path object (e.g. tokenizer_dir / "merges" or
tokenizer_dir / "merges.txt") to maintain the artifact-path API for
BaseTokenizer consumers, and update the docstring to note SentencePiece doesn't
use merges but returns a placeholder Path that may not exist.

In `@tests/test_sentencepiece.py`:
- Around line 199-213: Extend the determinism test
test_spm_training_is_deterministic to also compare the trained binary model:
after training SentencePieceTokenizer instances (tokenizer_1, tokenizer_2) and
reading spm.vocab, open the corresponding "spm.model" files from save_path_1 and
save_path_2 in binary mode (e.g., read bytes or compute a hash) and assert they
are identical; ensure you reference the same file name "spm.model" and perform a
binary equality or hash equality check so the runtime model artifact is included
in the determinism assertion.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 436c3fd3-76ab-45aa-880b-3a3ea484b0cd

📥 Commits

Reviewing files that changed from the base of the PR and between a2dfd7a and 1bdeb40.

📒 Files selected for processing (4)

.gitignore
openverifiablellm/tokenizer/sentencepiece_tokenizer.py
openverifiablellm/verify.py
tests/test_sentencepiece.py

openverifiablellm/tokenizer/sentencepiece_tokenizer.py

tests/test_sentencepiece.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

openverifiablellm/tokenizer/sentencepiece_tokenizer.py (1)
26-35: ⚠️ Potential issue | 🟠 Major

Verify that min_count is a real SentencePiece trainer option.

I can't find a min_count field in SentencePiece's upstream TrainerSpec, and the Python wrapper just forwards keyword args into the native trainer. If the pinned wheel doesn't support this key, min_frequency is still effectively misconfigured here, and the new tests would not catch it because they never assert that min_frequency changes the trained artifacts. Either map this to a confirmed SentencePiece option or fail fast when min_frequency != 1. (raw.githubusercontent.com)
Safe fallback until the upstream option is confirmed
         save_path.mkdir(parents=True, exist_ok=True)
 
+        if self.min_frequency != 1:
+            raise ValueError(
+                "SentencePieceTokenizer does not currently support min_frequency; use 1 until a supported trainer option is wired through."
+            )
+
         model_prefix = save_path / "spm"
 
         spm.SentencePieceTrainer.train(
             input=str(text_file),
             model_prefix=str(model_prefix),
             vocab_size=self.vocab_size,
-            min_count=self.min_frequency,
             pad_id=0,
             unk_id=1,
             bos_id=2,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/tokenizer/sentencepiece_tokenizer.py` around lines 26 - 35,
The call to spm.SentencePieceTrainer.train is using a nonstandard keyword
min_count (from our min_frequency) which may not be supported by the upstream
wrapper; update the SentencePiece tokenization code (around the
save_path/model_prefix and the spm.SentencePieceTrainer.train invocation in
sentencepiece_tokenizer.py) to either map our min_frequency to a confirmed
SentencePiece TrainerSpec option or, until you verify such mapping, add a
fail-fast check that raises a clear error if self.min_frequency != 1; reference
the min_frequency field and the spm.SentencePieceTrainer.train call so the check
runs before invoking train.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer/sentencepiece_tokenizer.py`:
- Around line 49-60: Promote the runtime API methods encode, decode, and load
from SentencePieceTokenizer into the BaseTokenizer interface so
tokenizer-swapping code can rely on them: add abstract method signatures
encode(self, text: str) -> list, decode(self, ids: list) -> str, and load(self,
tokenizer_dir: Path) -> None to BaseTokenizer (or provide default
implementations that raise NotImplementedError), then have
SentencePieceTokenizer implement/override these methods (using its _check_loaded
and _load_model helpers) to preserve current behavior; ensure type hints match
and update any callers to depend on BaseTokenizer for runtime encode/decode/load
usage.

---

Duplicate comments:
In `@openverifiablellm/tokenizer/sentencepiece_tokenizer.py`:
- Around line 26-35: The call to spm.SentencePieceTrainer.train is using a
nonstandard keyword min_count (from our min_frequency) which may not be
supported by the upstream wrapper; update the SentencePiece tokenization code
(around the save_path/model_prefix and the spm.SentencePieceTrainer.train
invocation in sentencepiece_tokenizer.py) to either map our min_frequency to a
confirmed SentencePiece TrainerSpec option or, until you verify such mapping,
add a fail-fast check that raises a clear error if self.min_frequency != 1;
reference the min_frequency field and the spm.SentencePieceTrainer.train call so
the check runs before invoking train.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c569e98c-903a-419e-afdb-571636ed3102

📥 Commits

Reviewing files that changed from the base of the PR and between 1bdeb40 and 323983c.

📒 Files selected for processing (2)

openverifiablellm/tokenizer/sentencepiece_tokenizer.py
tests/test_sentencepiece.py

coderabbitai · 2026-03-13T13:02:52Z

openverifiablellm/tokenizer/sentencepiece_tokenizer.py

+    def encode(self, text: str) -> list:
+        self._check_loaded()
+        return self._model.encode(text, out_type=int)
+
+    def decode(self, ids: list) -> str:
+        self._check_loaded()
+        return self._model.decode(ids)
+
+    def load(self, tokenizer_dir: Path):
+        tokenizer_dir = Path(tokenizer_dir)
+        self._load_model(tokenizer_dir)
+


🧹 Nitpick | 🔵 Trivial

Promote encode / decode / load into BaseTokenizer if tokenizer swapping is supported.

These methods only exist on SentencePieceTokenizer. BaseTokenizer still guarantees only train, get_vocab_path, and get_merges_path, so tokenizer-agnostic code cannot rely on the new runtime API and future tokenizer implementations will not be forced to keep parity.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/tokenizer/sentencepiece_tokenizer.py` around lines 49 - 60, Promote the runtime API methods encode, decode, and load from SentencePieceTokenizer into the BaseTokenizer interface so tokenizer-swapping code can rely on them: add abstract method signatures encode(self, text: str) -> list, decode(self, ids: list) -> str, and load(self, tokenizer_dir: Path) -> None to BaseTokenizer (or provide default implementations that raise NotImplementedError), then have SentencePieceTokenizer implement/override these methods (using its _check_loaded and _load_model helpers) to preserve current behavior; ensure type hints match and update any callers to depend on BaseTokenizer for runtime encode/decode/load usage.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_sentencepiece.py`:
- Around line 136-140: The test test_spm_get_merges_path_returns_none has a
stale expectation; update it to reflect SentencePieceTokenizer.get_merges_path
returning a path. Replace the assertion so it checks that
tokenizer.get_merges_path(tmp_path) equals tmp_path / "spm.model" (use the
SentencePieceTokenizer class and get_merges_path method to locate the code).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7a6d89a0-f206-497d-8108-a8f1711a2833

📥 Commits

Reviewing files that changed from the base of the PR and between 323983c and c73d046.

📒 Files selected for processing (2)

openverifiablellm/tokenizer/sentencepiece_tokenizer.py
tests/test_sentencepiece.py

tests/test_sentencepiece.py

Arpitsh7 added 2 commits March 11, 2026 03:12

feature:complete sentence_piecetokenizer

c096dbb

feature:complete sentencepiece_tokenizer

e4f567d

github-actions bot added backend ci-cd configuration documentation Improvements or additions to documentation enhancement New feature or request first-time-contributor github-actions no-issue-linked pending-coderabbit-review python size/L external-contributor and removed size/L labels Mar 11, 2026

fix: ruff lint and formatting for sentencepiece tokenizer

0b54e19

github-actions bot added size/L and removed size/L labels Mar 11, 2026