feature:completed bpe and base by Arpitsh7 · Pull Request #64 · AOSSIE-Org/OpenVerifiableLLM

Arpitsh7 · 2026-03-11T05:25:18Z

Summary

Completes the BaseTokenizer abstract interface and BPETokenizer implementation introduced in #17 by adding encode(), decode(), and load() — making BPETokenizer fully usable in the pipeline.

Background

PR #17 introduced a modular tokenizer architecture. However BaseTokenizer only defines train(), get_vocab_path(), and get_merges_path(). Since encode(), decode(), and load() are absent from the abstract interface, subclasses have no obligation to implement them, and BPETokenizer cannot tokenize text despite being trained.

Changes

`openverifiablellm/tokenizer/base.py`

Added encode(), decode(), load() as abstract methods with full docstrings
Added constructor validation for vocab_size and min_frequency

`openverifiablellm/tokenizer/bpe_tokenizer.py`

Implemented encode(text) -> list[int]
Implemented decode(ids) -> str
Implemented load(tokenizer_dir) — reloads from vocab.json + merges.txt
Added text_file.is_file() validation in train()
Added _check_loaded() guard for encode/decode

`tests/test_bpe.py` ← new file

18 tests covering training, encode/decode roundtrip, load, artifact paths, special tokens, determinism, and constructor validation

Testing

pytest tests/test_bpe.py -v

All 18 tests pass.

Scope

Self-contained change. Touches only base.py, bpe_tokenizer.py, and tests/test_bpe.py. No overlap with any open PR.

Summary by CodeRabbit

New Features
- Full BPE tokenizer implementation: training, saving/loading, and encode/decode; public tokenizer API expanded with load/encode/decode and artifact path helpers.
Bug Fixes / Validation
- Improved constructor/input validation and clearer runtime error messages for tokenizer operations.
Tests
- Added comprehensive end-to-end tests for training, artifacts, save/load, encode/decode, determinism, and validation.
Chores
- Broadened Python virtual environment ignore patterns in .gitignore.
Style
- Minor formatting tweak to verification reporting output (no behavior change).

coderabbitai · 2026-03-11T05:25:34Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds abstract encode, decode, and load to BaseTokenizer; implements a full BPETokenizer (constructor, train, load, encode/decode, artifact helpers, validation); adds comprehensive end-to-end tests for the BPE tokenizer; minor .gitignore addition; small stylistic change in openverifiablellm/verify.py.

Changes

Cohort / File(s)	Summary
Configuration `/.gitignore`	Added `venv/` entry while retaining `*.venv`.
Base Tokenizer Interface `openverifiablellm/tokenizer/base.py`	Added abstract methods `encode(text) -> list[int]`, `decode(ids) -> str`, and `load(tokenizer_dir)`; updated `train()` docstring; adjusted `get_vocab_path` / `get_merges_path` signatures and docstrings.
BPETokenizer Implementation `openverifiablellm/tokenizer/bpe_tokenizer.py`	New full BPE implementation: constructor with validation, `train()` (input checks, artifact creation), `load()` (restore from `vocab.json`/`merges.txt`), `encode()`/`decode()` guarded by `_check_loaded()`, artifact path helpers, and improved error messages.
Tests `tests/test_bpebase.py`	New comprehensive tests covering training, artifact generation, load/save, encode/decode round-trips, determinism, special tokens, path helpers, error cases, and constructor validation.
Verify formatting tweak `openverifiablellm/verify.py`	Stylistic rewrite of CheckResult/report addition and minor comma change (no behavioral change).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant BPETokenizer
    participant FileSystem

    rect rgba(100,150,200,0.5)
    Note over Client,FileSystem: Training flow
    Client->>BPETokenizer: train(text_file, save_path)
    BPETokenizer->>BPETokenizer: validate inputs
    BPETokenizer->>FileSystem: create save_path
    BPETokenizer->>FileSystem: write `vocab.json` & `merges.txt`
    end

    rect rgba(150,200,100,0.5)
    Note over Client,FileSystem: Load and inference flow
    Client->>BPETokenizer: load(tokenizer_dir)
    BPETokenizer->>FileSystem: read `vocab.json` & `merges.txt`
    BPETokenizer->>BPETokenizer: initialize internal tokenizer
    Client->>BPETokenizer: encode(text)
    BPETokenizer->>BPETokenizer: _check_loaded()
    BPETokenizer-->>Client: return ids
    Client->>BPETokenizer: decode(ids)
    BPETokenizer->>BPETokenizer: _check_loaded()
    BPETokenizer-->>Client: return text
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Task/Standardize development environment with uv and ruff #34 — Modifies verification report construction related to the environment_hash check in openverifiablellm/verify.py.

Suggested labels

Python Lang

Suggested reviewers

Archit381

Poem

🐇 I nudged the vocab, pushed merges in line,
Trained in a burrow where tokens align.
Encode and decode — I taught them to sing,
Load from disk, round-trip — what joy they bring!
A happy rabbit hops: "Tokens, take wing!"

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	⚠️ Warning	The PR includes an out-of-scope change to .gitignore (venv/ and *.venv rules) and openverifiablellm/verify.py (stylistic formatting), unrelated to the core tokenizer objectives.	Remove changes to .gitignore and verify.py; limit modifications to base.py, bpe_tokenizer.py, and tests/test_bpebase.py as defined in issue `#52`.
Title check	❓ Inconclusive	The title 'feature:completed bpe and base' is vague and uses generic phrasing without clearly conveying what 'completed' means or what specific functionality was added.	Use a more descriptive title like 'Add encode, decode, and load methods to BaseTokenizer and BPETokenizer' to clearly specify the key changes.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	All coding objectives from issue `#52` are fully addressed: abstract methods (encode, decode, load) added to BaseTokenizer, implemented in BPETokenizer, train() hardened with validation, and comprehensive test suite added.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

openverifiablellm/tokenizer/bpe_tokenizer.py (1)

48-64: ⚠️ Potential issue | 🟡 Minor

Remove duplicate save_path.mkdir() call.

save_path.mkdir(parents=True, exist_ok=True) is called twice (lines 48 and 60). The second call on line 60 is redundant since the directory was already created on line 48.

🔧 Proposed fix

         save_path.mkdir(parents=True, exist_ok=True)

         tokenizer = ByteLevelBPETokenizer()

         tokenizer.train(
             files=[str(text_file)],
             vocab_size=self.vocab_size,
             min_frequency=self.min_frequency,
             special_tokens=SPECIAL_TOKENS,
         )

-        # Must create directory BEFORE save_model() is called
-        save_path.mkdir(parents=True, exist_ok=True)
-
         tokenizer.save_model(str(save_path))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/tokenizer/bpe_tokenizer.py` around lines 48 - 64, The code
calls save_path.mkdir(parents=True, exist_ok=True) twice before saving the
tokenizer; remove the redundant second call to avoid duplication. In the block
where ByteLevelBPETokenizer is created and trained (tokenizer.train(...)), keep
the initial save_path.mkdir(...) before training and then call
tokenizer.save_model(str(save_path)) and set self._tokenizer = tokenizer without
calling save_path.mkdir a second time; update the snippet around
tokenizer.save_model and self._tokenizer to remove the duplicate
save_path.mkdir() invocation.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer/base.py`:
- Around line 45-48: The abstract method get_merges_path is missing a return
type annotation; update its signature to indicate it may return None (e.g., ->
Optional[Path] or -> Path | None) to match the docstring and be consistent with
get_vocab_path; also add the required import (from typing import Optional) if
using Optional, and update the declaration of get_merges_path(tokenizer_dir:
Path) accordingly in the tokenizer/base.py file.
- Around line 25-33: The abstract methods encode and decode in the tokenizer
base currently use the generic type hint list; update their signatures to use
explicit list[int] for token id lists (i.e., encode(self, text: str) ->
list[int] and decode(self, ids: list[int]) -> str) so static type-checkers and
readers know these are integer token ids; adjust any related docstrings if
needed and ensure subclasses implement the new annotations for consistency
(refer to the encode and decode method definitions in the Tokenizer base class).

In `@openverifiablellm/tokenizer/bpe_tokenizer.py`:
- Around line 70-102: The encode and decode methods use broad type hints; update
their annotations to be specific lists of integers to match the base contract:
change encode(self, text: str) -> list to encode(self, text: str) -> list[int]
and change decode(self, ids: list) -> str to decode(self, ids: list[int]) ->
str; ensure the implementations still call self._check_loaded() and return
self._tokenizer.encode(text).ids and self._tokenizer.decode(ids) respectively
(references: encode, decode, _check_loaded, _tokenizer).

In `@tests/test_bpebase.py`:
- Around line 73-78: The test function test_bpe_train_raises_if_directory_passed
declares an unused fixture parameter sample_text_file; remove that parameter
from the function signature so it only accepts tmp_path (i.e., change def
test_bpe_train_raises_if_directory_passed(tmp_path):) and keep the body as-is to
intentionally pass a directory to tokenizer.train; ensure no other references to
sample_text_file remain in that test.
- Around line 158-174: In the tests test_bpe_load_raises_if_vocab_missing and
test_bpe_load_raises_if_merges_missing update the pytest.raises match strings to
use raw regexes with escaped dots (e.g., r"vocab\.json not found" and
r"merges\.txt not found") so the literal periods are matched; locate the two
with pytest.raises(FileNotFoundError, match=...) around the
BPETokenizer.load(...) calls and replace the match arguments accordingly.

---

Outside diff comments:
In `@openverifiablellm/tokenizer/bpe_tokenizer.py`:
- Around line 48-64: The code calls save_path.mkdir(parents=True, exist_ok=True)
twice before saving the tokenizer; remove the redundant second call to avoid
duplication. In the block where ByteLevelBPETokenizer is created and trained
(tokenizer.train(...)), keep the initial save_path.mkdir(...) before training
and then call tokenizer.save_model(str(save_path)) and set self._tokenizer =
tokenizer without calling save_path.mkdir a second time; update the snippet
around tokenizer.save_model and self._tokenizer to remove the duplicate
save_path.mkdir() invocation.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4fa3e8cf-ae6f-46d9-ac07-e5837a8c63b9

📥 Commits

Reviewing files that changed from the base of the PR and between a2dfd7a and 2f71909.

📒 Files selected for processing (4)

.gitignore
openverifiablellm/tokenizer/base.py
openverifiablellm/tokenizer/bpe_tokenizer.py
tests/test_bpebase.py

openverifiablellm/tokenizer/base.py

openverifiablellm/tokenizer/bpe_tokenizer.py

tests/test_bpebase.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer/bpe_tokenizer.py`:
- Around line 98-99: The BPE tokenizer's decode method currently calls
self._tokenizer.decode(ids) which defaults to skip_special_tokens=True and drops
special token ids; change the implementation in BPETokenizer.decode (the method
that calls self._check_loaded() and then returns the decoded string) to call the
underlying tokenizers.Tokenizer.decode with skip_special_tokens=False so
decoding is lossless by default and preserve special tokens for round-trip
encode/decode.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4301a79a-31eb-4a64-a0e3-6ef01da909cb

📥 Commits

Reviewing files that changed from the base of the PR and between 2f71909 and 82b0209.

📒 Files selected for processing (1)

openverifiablellm/tokenizer/bpe_tokenizer.py

openverifiablellm/tokenizer/bpe_tokenizer.py

Archit381 · 2026-03-12T14:55:06Z

@Arpitsh7 do requested coderabbit changes

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer/base.py`:
- Around line 25-37: The SentencePieceTokenizer class is missing concrete
implementations for BaseTokenizer's abstract methods encode, decode, and load;
implement encode(self, text: str) -> list[int] to return token ids (use the
SentencePieceProcessor EncodeAsIds or equivalent on the internal model),
decode(self, ids: list[int]) -> str to return text (use DecodeIds/DecodePieces
on the processor), and load(self, tokenizer_dir: Path) to initialize and load
the SentencePiece model file into the tokenizer instance (e.g., create/assign
self.sp_model and call its Load with the model filepath); update
SentencePieceTokenizer to define these three methods so instantiation for
tokenizer_type == "sentencepiece" succeeds.

In `@openverifiablellm/tokenizer/bpe_tokenizer.py`:
- Around line 117-118: In load(), remove the local redefinition of vocab_path
and merges_path and instead call the centralized helpers
get_vocab_path(tokenizer_dir) and get_merges_path(tokenizer_dir) (or the
parameter signatures used in this module) so path logic is single-sourced;
replace uses of the local vocab_path and merges_path variables with the returned
paths from those helper functions and ensure any downstream code still receives
the same Path objects.

In `@tests/test_bpebase.py`:
- Around line 233-236: The test currently compares only vocab.json between
save_path_1 and save_path_2 (vocab_1 == vocab_2); extend the determinism
assertion to also read and compare the merges.txt files from both outputs (e.g.,
read (save_path_1 / "merges.txt") and (save_path_2 / "merges.txt") into merges_1
and merges_2) and assert merges_1 == merges_2 so that both the vocabulary and
merge rules are validated for deterministic training.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 131488aa-2b4d-4c1a-8dd9-b2d1e4234f9a

📥 Commits

Reviewing files that changed from the base of the PR and between 82b0209 and dcdb8c4.

📒 Files selected for processing (4)

openverifiablellm/tokenizer/base.py
openverifiablellm/tokenizer/bpe_tokenizer.py
openverifiablellm/verify.py
tests/test_bpebase.py

coderabbitai · 2026-03-13T05:13:43Z

openverifiablellm/tokenizer/base.py

    @abstractmethod
-    def get_vocab_path(self, tokenizer_dir: Path) -> Path:
+    def encode(self, text: str) -> list[int]:
+        """Encode text into a list of integer token ids."""
+        pass
+
+    @abstractmethod
+    def decode(self, ids: list[int]) -> str:
+        """Decode a list of integer token ids back into text."""
+        pass
+
+    @abstractmethod
+    def load(self, tokenizer_dir: Path):
+        """Load a previously trained tokenizer from disk."""


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "BaseTokenizer abstract method declarations:" rg -nP '^\s*@abstractmethod\s*$|^\s*def (encode|decode|load)\(' openverifiablellm/tokenizer/base.py -C1 echo echo "SentencePieceTokenizer implementations for the same methods:" rg -nP '^\s*def (encode|decode|load)\(' openverifiablellm/tokenizer/sentencepiece_tokenizer.py -C1 || true echo echo "Factory path that instantiates SentencePieceTokenizer:" rg -nP 'SentencePieceTokenizer|tokenizer_type == "sentencepiece"' openverifiablellm/tokenizer/factory.py -C2

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1296

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== SentencePieceTokenizer class definition and all methods ===" cat -n openverifiablellm/tokenizer/sentencepiece_tokenizer.py echo echo "=== Checking parent class(es) ===" rg -P 'class SentencePieceTokenizer' openverifiablellm/tokenizer/sentencepiece_tokenizer.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1084

SentencePieceTokenizer is missing required abstract method implementations.

SentencePieceTokenizer does not implement three abstract methods from BaseTokenizer: encode, decode, and load. This will cause instantiation to fail at runtime when the factory attempts to create the tokenizer for tokenizer_type == "sentencepiece".

Implement encode, decode, and load in SentencePieceTokenizer, or remove the abstract requirement until the class is fully implemented.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/tokenizer/base.py` around lines 25 - 37, The SentencePieceTokenizer class is missing concrete implementations for BaseTokenizer's abstract methods encode, decode, and load; implement encode(self, text: str) -> list[int] to return token ids (use the SentencePieceProcessor EncodeAsIds or equivalent on the internal model), decode(self, ids: list[int]) -> str to return text (use DecodeIds/DecodePieces on the processor), and load(self, tokenizer_dir: Path) to initialize and load the SentencePiece model file into the tokenizer instance (e.g., create/assign self.sp_model and call its Load with the model filepath); update SentencePieceTokenizer to define these three methods so instantiation for tokenizer_type == "sentencepiece" succeeds.

coderabbitai · 2026-03-13T05:13:43Z

openverifiablellm/tokenizer/bpe_tokenizer.py

+        vocab_path = tokenizer_dir / "vocab.json"
+        merges_path = tokenizer_dir / "merges.txt"


🧹 Nitpick | 🔵 Trivial

Use path helper methods inside load() to avoid duplication.

Line 117–118 redefines artifact paths that are already centralized in get_vocab_path() / get_merges_path(). Reusing those helpers keeps path logic single-sourced.

♻️ Proposed refactor

- vocab_path = tokenizer_dir / "vocab.json" - merges_path = tokenizer_dir / "merges.txt" + vocab_path = self.get_vocab_path(tokenizer_dir) + merges_path = self.get_merges_path(tokenizer_dir)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/tokenizer/bpe_tokenizer.py` around lines 117 - 118, In load(), remove the local redefinition of vocab_path and merges_path and instead call the centralized helpers get_vocab_path(tokenizer_dir) and get_merges_path(tokenizer_dir) (or the parameter signatures used in this module) so path logic is single-sourced; replace uses of the local vocab_path and merges_path variables with the returned paths from those helper functions and ensure any downstream code still receives the same Path objects.

coderabbitai · 2026-03-13T05:13:43Z

tests/test_bpebase.py

+    vocab_1 = (save_path_1 / "vocab.json").read_text(encoding="utf-8")
+    vocab_2 = (save_path_2 / "vocab.json").read_text(encoding="utf-8")
+
+    assert vocab_1 == vocab_2


🧹 Nitpick | 🔵 Trivial

Expand determinism assertion to include merges.txt.

Line 233–236 verifies only vocab.json. BPE tokenization behavior also depends on merge rules, so deterministic training should assert both artifacts are identical.

♻️ Proposed test hardening

vocab_1 = (save_path_1 / "vocab.json").read_text(encoding="utf-8") vocab_2 = (save_path_2 / "vocab.json").read_text(encoding="utf-8") + merges_1 = (save_path_1 / "merges.txt").read_text(encoding="utf-8") + merges_2 = (save_path_2 / "merges.txt").read_text(encoding="utf-8") assert vocab_1 == vocab_2 + assert merges_1 == merges_2

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

vocab_1 = (save_path_1 / "vocab.json").read_text(encoding="utf-8")

vocab_2 = (save_path_2 / "vocab.json").read_text(encoding="utf-8")

assert vocab_1 == vocab_2

vocab_1 = (save_path_1 / "vocab.json").read_text(encoding="utf-8")

vocab_2 = (save_path_2 / "vocab.json").read_text(encoding="utf-8")

merges_1 = (save_path_1 / "merges.txt").read_text(encoding="utf-8")

merges_2 = (save_path_2 / "merges.txt").read_text(encoding="utf-8")

assert vocab_1 == vocab_2

assert merges_1 == merges_2

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_bpebase.py` around lines 233 - 236, The test currently compares only vocab.json between save_path_1 and save_path_2 (vocab_1 == vocab_2); extend the determinism assertion to also read and compare the merges.txt files from both outputs (e.g., read (save_path_1 / "merges.txt") and (save_path_2 / "merges.txt") into merges_1 and merges_2) and assert merges_1 == merges_2 so that both the vocabulary and merge rules are validated for deterministic training.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

openverifiablellm/tokenizer/bpe_tokenizer.py (1)
51-56: ⚠️ Potential issue | 🔴 Critical

Fix special-token loss during save/load cycle.

Verification confirms that load() loses special-token metadata. After training with SPECIAL_TOKENS (line 55), <s> encodes as [0]. But after reloading from vocab.json and merges.txt (lines 130–133), <s> expands to [32, 87, 34] (three separate tokens). The ByteLevelBPETokenizer constructor restores vocabulary but not added-token configuration.

To fix, explicitly re-register special tokens in load():
reloaded = ByteLevelBPETokenizer(
    vocab=str(vocab_file),
    merges=str(merges_file),
)
for token in SPECIAL_TOKENS:
    reloaded.add_special_tokens([token])
Add a regression test encoding <s> after load() to confirm the fix.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/tokenizer/bpe_tokenizer.py` around lines 51 - 56, The
loader loses special-token metadata: after training with SPECIAL_TOKENS the
ByteLevelBPETokenizer encodes "<s>" correctly, but load() that constructs
ByteLevelBPETokenizer(vocab=..., merges=...) does not re-register added tokens;
update the load() implementation to iterate SPECIAL_TOKENS and call
reloaded.add_special_tokens([token]) (reference ByteLevelBPETokenizer and
load()), and add a regression test that trains, saves, reloads via load(), then
asserts encoding of "<s>" matches the trained/special-token encoding to prevent
regressions.

♻️ Duplicate comments (2)

tests/test_bpebase.py (1)
233-236: 🧹 Nitpick | 🔵 Trivial

The determinism test should assert merges.txt too.

For BPE, identical vocab.json files are only half the contract; different merges.txt contents can still change tokenization. Compare both artifacts here so this test actually proves deterministic training.
♻️ Suggested hardening
     vocab_1 = (save_path_1 / "vocab.json").read_text(encoding="utf-8")
     vocab_2 = (save_path_2 / "vocab.json").read_text(encoding="utf-8")
+    merges_1 = (save_path_1 / "merges.txt").read_text(encoding="utf-8")
+    merges_2 = (save_path_2 / "merges.txt").read_text(encoding="utf-8")

     assert vocab_1 == vocab_2
+    assert merges_1 == merges_2
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_bpebase.py` around lines 233 - 236, The determinism test currently
only compares vocab.json (vocab_1/vocab_2); extend it to also read and compare
the BPE merges file by loading (save_path_1 / "merges.txt") and (save_path_2 /
"merges.txt") with encoding="utf-8" and adding an assertion that merges_1 ==
merges_2 so both artifacts (vocab.json and merges.txt) are verified for
deterministic training in tests/test_bpebase.py.
openverifiablellm/tokenizer/base.py (1)
25-37: ⚠️ Potential issue | 🔴 Critical

This interface change breaks existing tokenizer subclasses.

openverifiablellm/tokenizer/sentencepiece_tokenizer.py still does not implement encode(), decode(), or load(). After making these abstract here, SentencePieceTokenizer becomes non-instantiable and any sentencepiece path will now fail with TypeError. Please update that subclass in the same PR or defer making these methods abstract until all subclasses comply.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/tokenizer/base.py` around lines 25 - 37, The abstract
methods encode, decode, and load were added to TokenizerBase which breaks
instantiation of SentencePieceTokenizer because it does not implement these
methods; update the SentencePieceTokenizer class to implement
TokenizerBase.encode(self, text: str) -> list[int], TokenizerBase.decode(self,
ids: list[int]) -> str, and TokenizerBase.load(self, tokenizer_dir: Path) (or
remove the abstract decorator from TokenizerBase.encode/decode/load until all
subclasses are updated). Implementations should delegate to the underlying
sentencepiece processor (e.g., use the SentencePiece model's encode/decode and
loading logic already present in sentencepiece_tokenizer.py) and match the
declared signatures so SentencePieceTokenizer becomes instantiable again.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@openverifiablellm/tokenizer/bpe_tokenizer.py`:
- Around line 51-56: The loader loses special-token metadata: after training
with SPECIAL_TOKENS the ByteLevelBPETokenizer encodes "<s>" correctly, but
load() that constructs ByteLevelBPETokenizer(vocab=..., merges=...) does not
re-register added tokens; update the load() implementation to iterate
SPECIAL_TOKENS and call reloaded.add_special_tokens([token]) (reference
ByteLevelBPETokenizer and load()), and add a regression test that trains, saves,
reloads via load(), then asserts encoding of "<s>" matches the
trained/special-token encoding to prevent regressions.

---

Duplicate comments:
In `@openverifiablellm/tokenizer/base.py`:
- Around line 25-37: The abstract methods encode, decode, and load were added to
TokenizerBase which breaks instantiation of SentencePieceTokenizer because it
does not implement these methods; update the SentencePieceTokenizer class to
implement TokenizerBase.encode(self, text: str) -> list[int],
TokenizerBase.decode(self, ids: list[int]) -> str, and TokenizerBase.load(self,
tokenizer_dir: Path) (or remove the abstract decorator from
TokenizerBase.encode/decode/load until all subclasses are updated).
Implementations should delegate to the underlying sentencepiece processor (e.g.,
use the SentencePiece model's encode/decode and loading logic already present in
sentencepiece_tokenizer.py) and match the declared signatures so
SentencePieceTokenizer becomes instantiable again.

In `@tests/test_bpebase.py`:
- Around line 233-236: The determinism test currently only compares vocab.json
(vocab_1/vocab_2); extend it to also read and compare the BPE merges file by
loading (save_path_1 / "merges.txt") and (save_path_2 / "merges.txt") with
encoding="utf-8" and adding an assertion that merges_1 == merges_2 so both
artifacts (vocab.json and merges.txt) are verified for deterministic training in
tests/test_bpebase.py.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b8dfc0ff-3b7b-4db6-870d-712b9aaef5a9

📥 Commits

Reviewing files that changed from the base of the PR and between dcdb8c4 and 1d049ff.

📒 Files selected for processing (3)

openverifiablellm/tokenizer/base.py
openverifiablellm/tokenizer/bpe_tokenizer.py
tests/test_bpebase.py

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

openverifiablellm/tokenizer/bpe_tokenizer.py (1)
1-165: ⚠️ Potential issue | 🟡 Minor

Fix Ruff formatting issues to pass CI.

The pipeline indicates that Ruff format check failed. Run ruff format openverifiablellm/tokenizer/bpe_tokenizer.py to fix the code style issues.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/tokenizer/bpe_tokenizer.py` around lines 1 - 165, Ruff
flagged formatting issues in openverifiablellm/tokenizer/bpe_tokenizer.py; run
`ruff format openverifiablellm/tokenizer/bpe_tokenizer.py` (or apply the
equivalent auto-format) and commit the changes so BPETokenizer, SPECIAL_TOKENS,
and its methods (train, load, encode, decode, _check_loaded) conform to the
project's Ruff/PEP8 style; if you cannot run Ruff locally, adjust whitespace,
imports, line lengths and trailing commas in that file to match Ruff's rules,
re-run the linter, and push the formatted file to fix CI.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@openverifiablellm/tokenizer/bpe_tokenizer.py`:
- Around line 1-165: Ruff flagged formatting issues in
openverifiablellm/tokenizer/bpe_tokenizer.py; run `ruff format
openverifiablellm/tokenizer/bpe_tokenizer.py` (or apply the equivalent
auto-format) and commit the changes so BPETokenizer, SPECIAL_TOKENS, and its
methods (train, load, encode, decode, _check_loaded) conform to the project's
Ruff/PEP8 style; if you cannot run Ruff locally, adjust whitespace, imports,
line lengths and trailing commas in that file to match Ruff's rules, re-run the
linter, and push the formatted file to fix CI.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 61bc19ff-755d-43f7-b747-58bbd6229971

📥 Commits

Reviewing files that changed from the base of the PR and between 1d049ff and 088ec84.

📒 Files selected for processing (1)

openverifiablellm/tokenizer/bpe_tokenizer.py

Arpitsh7 · 2026-03-13T12:17:58Z

@coderabbitai review

coderabbitai · 2026-03-13T12:18:09Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer/bpe_tokenizer.py`:
- Around line 146-148: The method signature for get_merges_path in
openverifiablellm/tokenizer/bpe_tokenizer.py currently returns Path but must
match the base class contract BaseTokenizer.get_merges_path() which declares
Path | None; change the signature of get_merges_path(self, tokenizer_dir: Path)
-> Path | None (or Optional[Path]) and keep returning the Path(tokenizer_dir) /
"merges.txt" so the implementation remains correct while the declared return
type matches the base class; add or adjust the typing import (from typing import
Optional) if needed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: edfa00b2-4b65-422e-ba8a-ae28d700ea7b

📥 Commits

Reviewing files that changed from the base of the PR and between 1d049ff and 49c7ba6.

📒 Files selected for processing (1)

openverifiablellm/tokenizer/bpe_tokenizer.py

coderabbitai · 2026-03-13T12:21:58Z

openverifiablellm/tokenizer/bpe_tokenizer.py

    def get_merges_path(self, tokenizer_dir: Path) -> Path:
-        return tokenizer_dir / "merges.txt"
+        """Return path to merges.txt file."""
+        return Path(tokenizer_dir) / "merges.txt"


⚠️ Potential issue | 🟡 Minor

Return type mismatch with base class.

The base class BaseTokenizer.get_merges_path() declares return type Path | None, but this implementation declares Path. While BPE always has merges, the return type should match the base class contract for type safety and LSP compliance.

♻️ Proposed fix

- def get_merges_path(self, tokenizer_dir: Path) -> Path: + def get_merges_path(self, tokenizer_dir: Path) -> Path | None: """Return path to merges.txt file.""" return Path(tokenizer_dir) / "merges.txt"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/tokenizer/bpe_tokenizer.py` around lines 146 - 148, The method signature for get_merges_path in openverifiablellm/tokenizer/bpe_tokenizer.py currently returns Path but must match the base class contract BaseTokenizer.get_merges_path() which declares Path | None; change the signature of get_merges_path(self, tokenizer_dir: Path) -> Path | None (or Optional[Path]) and keep returning the Path(tokenizer_dir) / "merges.txt" so the implementation remains correct while the declared return type matches the base class; add or adjust the typing import (from typing import Optional) if needed.

feature:completed bpe and base

2f71909

github-actions bot added enhancement New feature or request backend python size/L external-contributor pending-coderabbit-review labels Mar 11, 2026

github-actions bot added size/L and removed size/L labels Mar 11, 2026

coderabbitai bot requested changes Mar 11, 2026

View reviewed changes

fix: resolve merge conflict in bpe_tokenizer.py

82b0209

github-actions bot added size/L and removed size/L labels Mar 11, 2026

fix: remove unused Path import in test_bpe.py

71ea890

github-actions bot added size/L and removed size/L labels Mar 11, 2026

coderabbitai bot requested changes Mar 11, 2026

View reviewed changes

openverifiablellm/tokenizer/bpe_tokenizer.py Outdated Show resolved Hide resolved

fix: implied changes with ruff

3ff1c96

github-actions bot added size/L and removed size/L labels Mar 11, 2026

style: format with ruff

9d743a2

github-actions bot added size/L and removed size/L labels Mar 11, 2026

fix:applied coderabbit changes

dcdb8c4

github-actions bot added size/L and removed size/L labels Mar 13, 2026

coderabbitai bot requested changes Mar 13, 2026

View reviewed changes

fix:updated to comply with ruff

1d049ff

github-actions bot added size/L and removed size/L labels Mar 13, 2026

coderabbitai bot reviewed Mar 13, 2026

View reviewed changes

fix:changed the load function

088ec84

github-actions bot added size/L and removed size/L labels Mar 13, 2026

coderabbitai bot reviewed Mar 13, 2026

View reviewed changes

fix:done ruff changes

49c7ba6

github-actions bot added size/L and removed size/L labels Mar 13, 2026

coderabbitai bot requested changes Mar 13, 2026

View reviewed changes

		vocab_path = tokenizer_dir / "vocab.json"
		merges_path = tokenizer_dir / "merges.txt"

Uh oh!

Conversation

Arpitsh7 commented Mar 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Changes

openverifiablellm/tokenizer/base.py

openverifiablellm/tokenizer/bpe_tokenizer.py

tests/test_bpe.py ← new file

Testing

Scope

Related

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Archit381 commented Mar 12, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Arpitsh7 commented Mar 13, 2026

Uh oh!

coderabbitai bot commented Mar 13, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Arpitsh7 commented Mar 11, 2026 •

edited by coderabbitai bot

Loading

`openverifiablellm/tokenizer/base.py`

`openverifiablellm/tokenizer/bpe_tokenizer.py`

`tests/test_bpe.py` ← new file

coderabbitai bot commented Mar 11, 2026 •

edited

Loading