Skip to content

Feature/complete sentencepiece tokenizer#65

Open
Arpitsh7 wants to merge 8 commits intoAOSSIE-Org:mainfrom
Arpitsh7:feature/complete-sentencepiece_tokenizer
Open

Feature/complete sentencepiece tokenizer#65
Arpitsh7 wants to merge 8 commits intoAOSSIE-Org:mainfrom
Arpitsh7:feature/complete-sentencepiece_tokenizer

Conversation

@Arpitsh7
Copy link
Contributor

@Arpitsh7 Arpitsh7 commented Mar 11, 2026

Summary

Completes the SentencePieceTokenizer implementation introduced in #17 by adding encode(), decode(), and load() — making SentencePieceTokenizer fully usable in the pipeline.


Background

PR #17 introduced a modular tokenizer architecture with SentencePieceTokenizer as one of the implementations. However the current SentencePieceTokenizer only supports training and lacks encode(), decode(), and load().

Since these methods are absent, the tokenizer cannot process text despite being trained, which prevents it from being used in downstream pipeline tasks.


Changes

openverifiablellm/tokenizer/sentencepiece_tokenizer.py

  • Implemented encode(text) -> list[int]
  • Implemented decode(ids) -> str
  • Implemented load(tokenizer_dir) — reloads from vocab.json + merges.txt
  • Added text_file.is_file() validation in train()
  • Added _check_loaded() guard for encode/decode

tests/test_sentencepiece.py ← new file

  • 18 tests covering training, encode/decode roundtrip, load, artifact paths, special tokens, determinism, and constructor validation

Testing

pytest tests/test_sentencepiece.py -v

All 18 tests pass.


Scope

Self-contained change. Touches only sentencepiece_tokenizer.py and tests/test_sentencepiece.py. No overlap with any open PR.

Related

Summary by CodeRabbit

  • New Features

    • Added a SentencePiece tokenizer: train from text, save/load persistent models, and encode/decode with validation and clear errors.
  • Tests

    • New comprehensive tests for tokenizer training, loading, encode/decode, artifact creation, special tokens, and deterministic behavior.
  • Chores

    • Updated .gitignore to exclude the standard Python venv directory.
  • Style

    • Minor formatting adjustments with no behavioral changes.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 11, 2026

Warning

Rate limit exceeded

@Arpitsh7 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 10 minutes and 0 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 41c2df51-e82d-4c03-b243-2c777270637a

📥 Commits

Reviewing files that changed from the base of the PR and between c73d046 and becd3c1.

📒 Files selected for processing (1)
  • tests/test_sentencepiece.py

Walkthrough

Adds full SentencePieceTokenizer functionality (train, load, encode, decode, validation, artifact paths), new tests for tokenizer behavior and determinism, a .gitignore entry for venv/, and a minor formatting change in verify.py. (50 words)

Changes

Cohort / File(s) Summary
Gitignore
/.gitignore
Added venv/ ignore pattern alongside existing *.venv.
SentencePiece Tokenizer
openverifiablellm/tokenizer/sentencepiece_tokenizer.py
Added SPM_MODEL_FILE/SPM_VOCAB_FILE constants; new constructor (vocab_size, min_frequency); implemented train, load, encode, decode, get_vocab_path, get_merges_path; added _load_model and _check_loaded; input validation, directory creation, and FileNotFoundError/RuntimeError handling.
Verification Formatting
openverifiablellm/verify.py
Reformatted construction of the Skip CheckResult for the missing environment_hash case (style-only change).
Tests
tests/test_sentencepiece.py
New pytest module covering training, artifact generation (spm.model, spm.vocab), load behavior, encode/decode round-trips, error cases, special tokens, path helpers, and deterministic training checks.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Python Lang

Suggested reviewers

  • Archit381

Poem

🐇 I learned to split the words with care and vim,
I save my model, vocab — tidy, trim, and prim,
I load, I encode, decode with joyful hops,
Tests check my hops and never let me stop,
A rabbit claps — tokens leap on a whim!

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Out of Scope Changes check ❓ Inconclusive Minor out-of-scope changes detected: .gitignore modification (venv/ pattern) and openverifiablellm/verify.py reformatting are unrelated to issue #51 requirements, though they are minimal and unlikely to cause issues. Consider isolating the .gitignore and verify.py changes into a separate PR to maintain focus on the core SentencePieceTokenizer completion objectives.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Feature/complete sentencepiece tokenizer' directly aligns with the main objective of completing the SentencePieceTokenizer implementation with encode, decode, and load functionality.
Linked Issues check ✅ Passed The PR successfully implements all key requirements from issue #51: encode() and decode() methods, load() for reloading artifacts, input validation (text_file existence, _check_loaded guard), special tokens in training, save_path directory creation, and comprehensive tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can suggest fixes for GitHub Check annotations.

Configure the reviews.tools.github-checks setting to adjust the time to wait for GitHub Checks to complete.

@github-actions github-actions bot added size/L and removed size/L labels Mar 11, 2026
@Arpitsh7
Copy link
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 11, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@github-actions github-actions bot added size/L and removed size/L labels Mar 13, 2026
@github-actions github-actions bot added size/L and removed size/L labels Mar 13, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer/sentencepiece_tokenizer.py`:
- Around line 25-27: The constructor stores min_frequency but it is never
applied during training in the SentencePiece training path; update the training
routine (the method that calls SentencePieceTrainer.Train — referenced as the
training method that sets self._model) to use the stored value
(self._min_frequency) by passing it into the SentencePieceTrainer.Train call (or
equivalent training config) so the vocab pruning respects min_frequency; if the
underlying SentencePiece API does not accept a min_frequency parameter, make the
change fail-fast by raising a clear exception or logging an error when
min_frequency is non-default so callers are not silently misconfigured.
- Around line 140-145: get_merges_path currently returns None which breaks
polymorphism where callers expect a Path; change get_merges_path(self,
tokenizer_dir: Path) to return a Path object (e.g. tokenizer_dir / "merges" or
tokenizer_dir / "merges.txt") to maintain the artifact-path API for
BaseTokenizer consumers, and update the docstring to note SentencePiece doesn't
use merges but returns a placeholder Path that may not exist.

In `@tests/test_sentencepiece.py`:
- Around line 199-213: Extend the determinism test
test_spm_training_is_deterministic to also compare the trained binary model:
after training SentencePieceTokenizer instances (tokenizer_1, tokenizer_2) and
reading spm.vocab, open the corresponding "spm.model" files from save_path_1 and
save_path_2 in binary mode (e.g., read bytes or compute a hash) and assert they
are identical; ensure you reference the same file name "spm.model" and perform a
binary equality or hash equality check so the runtime model artifact is included
in the determinism assertion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 436c3fd3-76ab-45aa-880b-3a3ea484b0cd

📥 Commits

Reviewing files that changed from the base of the PR and between a2dfd7a and 1bdeb40.

📒 Files selected for processing (4)
  • .gitignore
  • openverifiablellm/tokenizer/sentencepiece_tokenizer.py
  • openverifiablellm/verify.py
  • tests/test_sentencepiece.py

@github-actions github-actions bot added size/L and removed size/L labels Mar 13, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
openverifiablellm/tokenizer/sentencepiece_tokenizer.py (1)

26-35: ⚠️ Potential issue | 🟠 Major

Verify that min_count is a real SentencePiece trainer option.

I can't find a min_count field in SentencePiece's upstream TrainerSpec, and the Python wrapper just forwards keyword args into the native trainer. If the pinned wheel doesn't support this key, min_frequency is still effectively misconfigured here, and the new tests would not catch it because they never assert that min_frequency changes the trained artifacts. Either map this to a confirmed SentencePiece option or fail fast when min_frequency != 1. (raw.githubusercontent.com)

Safe fallback until the upstream option is confirmed
         save_path.mkdir(parents=True, exist_ok=True)
 
+        if self.min_frequency != 1:
+            raise ValueError(
+                "SentencePieceTokenizer does not currently support min_frequency; use 1 until a supported trainer option is wired through."
+            )
+
         model_prefix = save_path / "spm"
 
         spm.SentencePieceTrainer.train(
             input=str(text_file),
             model_prefix=str(model_prefix),
             vocab_size=self.vocab_size,
-            min_count=self.min_frequency,
             pad_id=0,
             unk_id=1,
             bos_id=2,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/tokenizer/sentencepiece_tokenizer.py` around lines 26 - 35,
The call to spm.SentencePieceTrainer.train is using a nonstandard keyword
min_count (from our min_frequency) which may not be supported by the upstream
wrapper; update the SentencePiece tokenization code (around the
save_path/model_prefix and the spm.SentencePieceTrainer.train invocation in
sentencepiece_tokenizer.py) to either map our min_frequency to a confirmed
SentencePiece TrainerSpec option or, until you verify such mapping, add a
fail-fast check that raises a clear error if self.min_frequency != 1; reference
the min_frequency field and the spm.SentencePieceTrainer.train call so the check
runs before invoking train.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer/sentencepiece_tokenizer.py`:
- Around line 49-60: Promote the runtime API methods encode, decode, and load
from SentencePieceTokenizer into the BaseTokenizer interface so
tokenizer-swapping code can rely on them: add abstract method signatures
encode(self, text: str) -> list, decode(self, ids: list) -> str, and load(self,
tokenizer_dir: Path) -> None to BaseTokenizer (or provide default
implementations that raise NotImplementedError), then have
SentencePieceTokenizer implement/override these methods (using its _check_loaded
and _load_model helpers) to preserve current behavior; ensure type hints match
and update any callers to depend on BaseTokenizer for runtime encode/decode/load
usage.

---

Duplicate comments:
In `@openverifiablellm/tokenizer/sentencepiece_tokenizer.py`:
- Around line 26-35: The call to spm.SentencePieceTrainer.train is using a
nonstandard keyword min_count (from our min_frequency) which may not be
supported by the upstream wrapper; update the SentencePiece tokenization code
(around the save_path/model_prefix and the spm.SentencePieceTrainer.train
invocation in sentencepiece_tokenizer.py) to either map our min_frequency to a
confirmed SentencePiece TrainerSpec option or, until you verify such mapping,
add a fail-fast check that raises a clear error if self.min_frequency != 1;
reference the min_frequency field and the spm.SentencePieceTrainer.train call so
the check runs before invoking train.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c569e98c-903a-419e-afdb-571636ed3102

📥 Commits

Reviewing files that changed from the base of the PR and between 1bdeb40 and 323983c.

📒 Files selected for processing (2)
  • openverifiablellm/tokenizer/sentencepiece_tokenizer.py
  • tests/test_sentencepiece.py

Comment on lines +49 to +60
def encode(self, text: str) -> list:
self._check_loaded()
return self._model.encode(text, out_type=int)

def decode(self, ids: list) -> str:
self._check_loaded()
return self._model.decode(ids)

def load(self, tokenizer_dir: Path):
tokenizer_dir = Path(tokenizer_dir)
self._load_model(tokenizer_dir)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Promote encode / decode / load into BaseTokenizer if tokenizer swapping is supported.

These methods only exist on SentencePieceTokenizer. BaseTokenizer still guarantees only train, get_vocab_path, and get_merges_path, so tokenizer-agnostic code cannot rely on the new runtime API and future tokenizer implementations will not be forced to keep parity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/tokenizer/sentencepiece_tokenizer.py` around lines 49 - 60,
Promote the runtime API methods encode, decode, and load from
SentencePieceTokenizer into the BaseTokenizer interface so tokenizer-swapping
code can rely on them: add abstract method signatures encode(self, text: str) ->
list, decode(self, ids: list) -> str, and load(self, tokenizer_dir: Path) ->
None to BaseTokenizer (or provide default implementations that raise
NotImplementedError), then have SentencePieceTokenizer implement/override these
methods (using its _check_loaded and _load_model helpers) to preserve current
behavior; ensure type hints match and update any callers to depend on
BaseTokenizer for runtime encode/decode/load usage.

@github-actions github-actions bot added size/L and removed size/L labels Mar 13, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_sentencepiece.py`:
- Around line 136-140: The test test_spm_get_merges_path_returns_none has a
stale expectation; update it to reflect SentencePieceTokenizer.get_merges_path
returning a path. Replace the assertion so it checks that
tokenizer.get_merges_path(tmp_path) equals tmp_path / "spm.model" (use the
SentencePieceTokenizer class and get_merges_path method to locate the code).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7a6d89a0-f206-497d-8108-a8f1711a2833

📥 Commits

Reviewing files that changed from the base of the PR and between 323983c and c73d046.

📒 Files selected for processing (2)
  • openverifiablellm/tokenizer/sentencepiece_tokenizer.py
  • tests/test_sentencepiece.py

@github-actions github-actions bot added size/L and removed size/L labels Mar 13, 2026
@github-actions github-actions bot added size/L and removed size/L labels Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Complete SentencePiece tokenizer — encode, decode, load and input validation

1 participant