Skip to content

feature:completed bpe and base#64

Open
Arpitsh7 wants to merge 9 commits intoAOSSIE-Org:mainfrom
Arpitsh7:feature/complete_bpebase
Open

feature:completed bpe and base#64
Arpitsh7 wants to merge 9 commits intoAOSSIE-Org:mainfrom
Arpitsh7:feature/complete_bpebase

Conversation

@Arpitsh7
Copy link
Contributor

@Arpitsh7 Arpitsh7 commented Mar 11, 2026

Summary

Completes the BaseTokenizer abstract interface and BPETokenizer implementation introduced in #17 by adding encode(), decode(), and load() — making BPETokenizer fully usable in the pipeline.

Background

PR #17 introduced a modular tokenizer architecture. However BaseTokenizer only defines train(), get_vocab_path(), and get_merges_path(). Since encode(), decode(), and load() are absent from the abstract interface, subclasses have no obligation to implement them, and BPETokenizer cannot tokenize text despite being trained.

Changes

openverifiablellm/tokenizer/base.py

  • Added encode(), decode(), load() as abstract methods with full docstrings
  • Added constructor validation for vocab_size and min_frequency

openverifiablellm/tokenizer/bpe_tokenizer.py

  • Implemented encode(text) -> list[int]
  • Implemented decode(ids) -> str
  • Implemented load(tokenizer_dir) — reloads from vocab.json + merges.txt
  • Added text_file.is_file() validation in train()
  • Added _check_loaded() guard for encode/decode

tests/test_bpe.py ← new file

  • 18 tests covering training, encode/decode roundtrip, load, artifact paths, special tokens, determinism, and constructor validation

Testing

pytest tests/test_bpe.py -v

All 18 tests pass.


Scope

Self-contained change. Touches only base.py, bpe_tokenizer.py, and tests/test_bpe.py. No overlap with any open PR.


Related

Summary by CodeRabbit

  • New Features

    • Full BPE tokenizer implementation: training, saving/loading, and encode/decode; public tokenizer API expanded with load/encode/decode and artifact path helpers.
  • Bug Fixes / Validation

    • Improved constructor/input validation and clearer runtime error messages for tokenizer operations.
  • Tests

    • Added comprehensive end-to-end tests for training, artifacts, save/load, encode/decode, determinism, and validation.
  • Chores

    • Broadened Python virtual environment ignore patterns in .gitignore.
  • Style

    • Minor formatting tweak to verification reporting output (no behavior change).

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds abstract encode, decode, and load to BaseTokenizer; implements a full BPETokenizer (constructor, train, load, encode/decode, artifact helpers, validation); adds comprehensive end-to-end tests for the BPE tokenizer; minor .gitignore addition; small stylistic change in openverifiablellm/verify.py.

Changes

Cohort / File(s) Summary
Configuration
/.gitignore
Added venv/ entry while retaining *.venv.
Base Tokenizer Interface
openverifiablellm/tokenizer/base.py
Added abstract methods encode(text) -> list[int], decode(ids) -> str, and load(tokenizer_dir); updated train() docstring; adjusted get_vocab_path / get_merges_path signatures and docstrings.
BPETokenizer Implementation
openverifiablellm/tokenizer/bpe_tokenizer.py
New full BPE implementation: constructor with validation, train() (input checks, artifact creation), load() (restore from vocab.json/merges.txt), encode()/decode() guarded by _check_loaded(), artifact path helpers, and improved error messages.
Tests
tests/test_bpebase.py
New comprehensive tests covering training, artifact generation, load/save, encode/decode round-trips, determinism, special tokens, path helpers, error cases, and constructor validation.
Verify formatting tweak
openverifiablellm/verify.py
Stylistic rewrite of CheckResult/report addition and minor comma change (no behavioral change).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant BPETokenizer
    participant FileSystem

    rect rgba(100,150,200,0.5)
    Note over Client,FileSystem: Training flow
    Client->>BPETokenizer: train(text_file, save_path)
    BPETokenizer->>BPETokenizer: validate inputs
    BPETokenizer->>FileSystem: create save_path
    BPETokenizer->>FileSystem: write `vocab.json` & `merges.txt`
    end

    rect rgba(150,200,100,0.5)
    Note over Client,FileSystem: Load and inference flow
    Client->>BPETokenizer: load(tokenizer_dir)
    BPETokenizer->>FileSystem: read `vocab.json` & `merges.txt`
    BPETokenizer->>BPETokenizer: initialize internal tokenizer
    Client->>BPETokenizer: encode(text)
    BPETokenizer->>BPETokenizer: _check_loaded()
    BPETokenizer-->>Client: return ids
    Client->>BPETokenizer: decode(ids)
    BPETokenizer->>BPETokenizer: _check_loaded()
    BPETokenizer-->>Client: return text
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

Python Lang

Suggested reviewers

  • Archit381

Poem

🐇 I nudged the vocab, pushed merges in line,
Trained in a burrow where tokens align.
Encode and decode — I taught them to sing,
Load from disk, round-trip — what joy they bring!
A happy rabbit hops: "Tokens, take wing!"

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning The PR includes an out-of-scope change to .gitignore (venv/ and *.venv rules) and openverifiablellm/verify.py (stylistic formatting), unrelated to the core tokenizer objectives. Remove changes to .gitignore and verify.py; limit modifications to base.py, bpe_tokenizer.py, and tests/test_bpebase.py as defined in issue #52.
Title check ❓ Inconclusive The title 'feature:completed bpe and base' is vague and uses generic phrasing without clearly conveying what 'completed' means or what specific functionality was added. Use a more descriptive title like 'Add encode, decode, and load methods to BaseTokenizer and BPETokenizer' to clearly specify the key changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed All coding objectives from issue #52 are fully addressed: abstract methods (encode, decode, load) added to BaseTokenizer, implemented in BPETokenizer, train() hardened with validation, and comprehensive test suite added.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added size/L and removed size/L labels Mar 11, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
openverifiablellm/tokenizer/bpe_tokenizer.py (1)

48-64: ⚠️ Potential issue | 🟡 Minor

Remove duplicate save_path.mkdir() call.

save_path.mkdir(parents=True, exist_ok=True) is called twice (lines 48 and 60). The second call on line 60 is redundant since the directory was already created on line 48.

🔧 Proposed fix
         save_path.mkdir(parents=True, exist_ok=True)

         tokenizer = ByteLevelBPETokenizer()

         tokenizer.train(
             files=[str(text_file)],
             vocab_size=self.vocab_size,
             min_frequency=self.min_frequency,
             special_tokens=SPECIAL_TOKENS,
         )

-        # Must create directory BEFORE save_model() is called
-        save_path.mkdir(parents=True, exist_ok=True)
-
         tokenizer.save_model(str(save_path))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/tokenizer/bpe_tokenizer.py` around lines 48 - 64, The code
calls save_path.mkdir(parents=True, exist_ok=True) twice before saving the
tokenizer; remove the redundant second call to avoid duplication. In the block
where ByteLevelBPETokenizer is created and trained (tokenizer.train(...)), keep
the initial save_path.mkdir(...) before training and then call
tokenizer.save_model(str(save_path)) and set self._tokenizer = tokenizer without
calling save_path.mkdir a second time; update the snippet around
tokenizer.save_model and self._tokenizer to remove the duplicate
save_path.mkdir() invocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer/base.py`:
- Around line 45-48: The abstract method get_merges_path is missing a return
type annotation; update its signature to indicate it may return None (e.g., ->
Optional[Path] or -> Path | None) to match the docstring and be consistent with
get_vocab_path; also add the required import (from typing import Optional) if
using Optional, and update the declaration of get_merges_path(tokenizer_dir:
Path) accordingly in the tokenizer/base.py file.
- Around line 25-33: The abstract methods encode and decode in the tokenizer
base currently use the generic type hint list; update their signatures to use
explicit list[int] for token id lists (i.e., encode(self, text: str) ->
list[int] and decode(self, ids: list[int]) -> str) so static type-checkers and
readers know these are integer token ids; adjust any related docstrings if
needed and ensure subclasses implement the new annotations for consistency
(refer to the encode and decode method definitions in the Tokenizer base class).

In `@openverifiablellm/tokenizer/bpe_tokenizer.py`:
- Around line 70-102: The encode and decode methods use broad type hints; update
their annotations to be specific lists of integers to match the base contract:
change encode(self, text: str) -> list to encode(self, text: str) -> list[int]
and change decode(self, ids: list) -> str to decode(self, ids: list[int]) ->
str; ensure the implementations still call self._check_loaded() and return
self._tokenizer.encode(text).ids and self._tokenizer.decode(ids) respectively
(references: encode, decode, _check_loaded, _tokenizer).

In `@tests/test_bpebase.py`:
- Around line 73-78: The test function test_bpe_train_raises_if_directory_passed
declares an unused fixture parameter sample_text_file; remove that parameter
from the function signature so it only accepts tmp_path (i.e., change def
test_bpe_train_raises_if_directory_passed(tmp_path):) and keep the body as-is to
intentionally pass a directory to tokenizer.train; ensure no other references to
sample_text_file remain in that test.
- Around line 158-174: In the tests test_bpe_load_raises_if_vocab_missing and
test_bpe_load_raises_if_merges_missing update the pytest.raises match strings to
use raw regexes with escaped dots (e.g., r"vocab\.json not found" and
r"merges\.txt not found") so the literal periods are matched; locate the two
with pytest.raises(FileNotFoundError, match=...) around the
BPETokenizer.load(...) calls and replace the match arguments accordingly.

---

Outside diff comments:
In `@openverifiablellm/tokenizer/bpe_tokenizer.py`:
- Around line 48-64: The code calls save_path.mkdir(parents=True, exist_ok=True)
twice before saving the tokenizer; remove the redundant second call to avoid
duplication. In the block where ByteLevelBPETokenizer is created and trained
(tokenizer.train(...)), keep the initial save_path.mkdir(...) before training
and then call tokenizer.save_model(str(save_path)) and set self._tokenizer =
tokenizer without calling save_path.mkdir a second time; update the snippet
around tokenizer.save_model and self._tokenizer to remove the duplicate
save_path.mkdir() invocation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4fa3e8cf-ae6f-46d9-ac07-e5837a8c63b9

📥 Commits

Reviewing files that changed from the base of the PR and between a2dfd7a and 2f71909.

📒 Files selected for processing (4)
  • .gitignore
  • openverifiablellm/tokenizer/base.py
  • openverifiablellm/tokenizer/bpe_tokenizer.py
  • tests/test_bpebase.py

@github-actions github-actions bot added size/L and removed size/L labels Mar 11, 2026
@github-actions github-actions bot added size/L and removed size/L labels Mar 11, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer/bpe_tokenizer.py`:
- Around line 98-99: The BPE tokenizer's decode method currently calls
self._tokenizer.decode(ids) which defaults to skip_special_tokens=True and drops
special token ids; change the implementation in BPETokenizer.decode (the method
that calls self._check_loaded() and then returns the decoded string) to call the
underlying tokenizers.Tokenizer.decode with skip_special_tokens=False so
decoding is lossless by default and preserve special tokens for round-trip
encode/decode.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4301a79a-31eb-4a64-a0e3-6ef01da909cb

📥 Commits

Reviewing files that changed from the base of the PR and between 2f71909 and 82b0209.

📒 Files selected for processing (1)
  • openverifiablellm/tokenizer/bpe_tokenizer.py

@github-actions github-actions bot added size/L and removed size/L labels Mar 11, 2026
@github-actions github-actions bot added size/L and removed size/L labels Mar 11, 2026
@Archit381
Copy link
Member

@Arpitsh7 do requested coderabbit changes

@github-actions github-actions bot added size/L and removed size/L labels Mar 13, 2026
@github-actions github-actions bot added size/L and removed size/L labels Mar 13, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer/base.py`:
- Around line 25-37: The SentencePieceTokenizer class is missing concrete
implementations for BaseTokenizer's abstract methods encode, decode, and load;
implement encode(self, text: str) -> list[int] to return token ids (use the
SentencePieceProcessor EncodeAsIds or equivalent on the internal model),
decode(self, ids: list[int]) -> str to return text (use DecodeIds/DecodePieces
on the processor), and load(self, tokenizer_dir: Path) to initialize and load
the SentencePiece model file into the tokenizer instance (e.g., create/assign
self.sp_model and call its Load with the model filepath); update
SentencePieceTokenizer to define these three methods so instantiation for
tokenizer_type == "sentencepiece" succeeds.

In `@openverifiablellm/tokenizer/bpe_tokenizer.py`:
- Around line 117-118: In load(), remove the local redefinition of vocab_path
and merges_path and instead call the centralized helpers
get_vocab_path(tokenizer_dir) and get_merges_path(tokenizer_dir) (or the
parameter signatures used in this module) so path logic is single-sourced;
replace uses of the local vocab_path and merges_path variables with the returned
paths from those helper functions and ensure any downstream code still receives
the same Path objects.

In `@tests/test_bpebase.py`:
- Around line 233-236: The test currently compares only vocab.json between
save_path_1 and save_path_2 (vocab_1 == vocab_2); extend the determinism
assertion to also read and compare the merges.txt files from both outputs (e.g.,
read (save_path_1 / "merges.txt") and (save_path_2 / "merges.txt") into merges_1
and merges_2) and assert merges_1 == merges_2 so that both the vocabulary and
merge rules are validated for deterministic training.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 131488aa-2b4d-4c1a-8dd9-b2d1e4234f9a

📥 Commits

Reviewing files that changed from the base of the PR and between 82b0209 and dcdb8c4.

📒 Files selected for processing (4)
  • openverifiablellm/tokenizer/base.py
  • openverifiablellm/tokenizer/bpe_tokenizer.py
  • openverifiablellm/verify.py
  • tests/test_bpebase.py

Comment on lines 25 to +37
@abstractmethod
def get_vocab_path(self, tokenizer_dir: Path) -> Path:
def encode(self, text: str) -> list[int]:
"""Encode text into a list of integer token ids."""
pass

@abstractmethod
def decode(self, ids: list[int]) -> str:
"""Decode a list of integer token ids back into text."""
pass

@abstractmethod
def load(self, tokenizer_dir: Path):
"""Load a previously trained tokenizer from disk."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "BaseTokenizer abstract method declarations:"
rg -nP '^\s*@abstractmethod\s*$|^\s*def (encode|decode|load)\(' openverifiablellm/tokenizer/base.py -C1

echo
echo "SentencePieceTokenizer implementations for the same methods:"
rg -nP '^\s*def (encode|decode|load)\(' openverifiablellm/tokenizer/sentencepiece_tokenizer.py -C1 || true

echo
echo "Factory path that instantiates SentencePieceTokenizer:"
rg -nP 'SentencePieceTokenizer|tokenizer_type == "sentencepiece"' openverifiablellm/tokenizer/factory.py -C2

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1296


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== SentencePieceTokenizer class definition and all methods ==="
cat -n openverifiablellm/tokenizer/sentencepiece_tokenizer.py

echo
echo "=== Checking parent class(es) ==="
rg -P 'class SentencePieceTokenizer' openverifiablellm/tokenizer/sentencepiece_tokenizer.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1084


SentencePieceTokenizer is missing required abstract method implementations.

SentencePieceTokenizer does not implement three abstract methods from BaseTokenizer: encode, decode, and load. This will cause instantiation to fail at runtime when the factory attempts to create the tokenizer for tokenizer_type == "sentencepiece".

Implement encode, decode, and load in SentencePieceTokenizer, or remove the abstract requirement until the class is fully implemented.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/tokenizer/base.py` around lines 25 - 37, The
SentencePieceTokenizer class is missing concrete implementations for
BaseTokenizer's abstract methods encode, decode, and load; implement
encode(self, text: str) -> list[int] to return token ids (use the
SentencePieceProcessor EncodeAsIds or equivalent on the internal model),
decode(self, ids: list[int]) -> str to return text (use DecodeIds/DecodePieces
on the processor), and load(self, tokenizer_dir: Path) to initialize and load
the SentencePiece model file into the tokenizer instance (e.g., create/assign
self.sp_model and call its Load with the model filepath); update
SentencePieceTokenizer to define these three methods so instantiation for
tokenizer_type == "sentencepiece" succeeds.

Comment on lines +117 to +118
vocab_path = tokenizer_dir / "vocab.json"
merges_path = tokenizer_dir / "merges.txt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Use path helper methods inside load() to avoid duplication.

Line 117–118 redefines artifact paths that are already centralized in get_vocab_path() / get_merges_path(). Reusing those helpers keeps path logic single-sourced.

♻️ Proposed refactor
-        vocab_path = tokenizer_dir / "vocab.json"
-        merges_path = tokenizer_dir / "merges.txt"
+        vocab_path = self.get_vocab_path(tokenizer_dir)
+        merges_path = self.get_merges_path(tokenizer_dir)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/tokenizer/bpe_tokenizer.py` around lines 117 - 118, In
load(), remove the local redefinition of vocab_path and merges_path and instead
call the centralized helpers get_vocab_path(tokenizer_dir) and
get_merges_path(tokenizer_dir) (or the parameter signatures used in this module)
so path logic is single-sourced; replace uses of the local vocab_path and
merges_path variables with the returned paths from those helper functions and
ensure any downstream code still receives the same Path objects.

Comment on lines +233 to +236
vocab_1 = (save_path_1 / "vocab.json").read_text(encoding="utf-8")
vocab_2 = (save_path_2 / "vocab.json").read_text(encoding="utf-8")

assert vocab_1 == vocab_2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Expand determinism assertion to include merges.txt.

Line 233–236 verifies only vocab.json. BPE tokenization behavior also depends on merge rules, so deterministic training should assert both artifacts are identical.

♻️ Proposed test hardening
     vocab_1 = (save_path_1 / "vocab.json").read_text(encoding="utf-8")
     vocab_2 = (save_path_2 / "vocab.json").read_text(encoding="utf-8")
+    merges_1 = (save_path_1 / "merges.txt").read_text(encoding="utf-8")
+    merges_2 = (save_path_2 / "merges.txt").read_text(encoding="utf-8")

     assert vocab_1 == vocab_2
+    assert merges_1 == merges_2
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
vocab_1 = (save_path_1 / "vocab.json").read_text(encoding="utf-8")
vocab_2 = (save_path_2 / "vocab.json").read_text(encoding="utf-8")
assert vocab_1 == vocab_2
vocab_1 = (save_path_1 / "vocab.json").read_text(encoding="utf-8")
vocab_2 = (save_path_2 / "vocab.json").read_text(encoding="utf-8")
merges_1 = (save_path_1 / "merges.txt").read_text(encoding="utf-8")
merges_2 = (save_path_2 / "merges.txt").read_text(encoding="utf-8")
assert vocab_1 == vocab_2
assert merges_1 == merges_2
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_bpebase.py` around lines 233 - 236, The test currently compares
only vocab.json between save_path_1 and save_path_2 (vocab_1 == vocab_2); extend
the determinism assertion to also read and compare the merges.txt files from
both outputs (e.g., read (save_path_1 / "merges.txt") and (save_path_2 /
"merges.txt") into merges_1 and merges_2) and assert merges_1 == merges_2 so
that both the vocabulary and merge rules are validated for deterministic
training.

@github-actions github-actions bot added size/L and removed size/L labels Mar 13, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
openverifiablellm/tokenizer/bpe_tokenizer.py (1)

51-56: ⚠️ Potential issue | 🔴 Critical

Fix special-token loss during save/load cycle.

Verification confirms that load() loses special-token metadata. After training with SPECIAL_TOKENS (line 55), <s> encodes as [0]. But after reloading from vocab.json and merges.txt (lines 130–133), <s> expands to [32, 87, 34] (three separate tokens). The ByteLevelBPETokenizer constructor restores vocabulary but not added-token configuration.

To fix, explicitly re-register special tokens in load():

reloaded = ByteLevelBPETokenizer(
    vocab=str(vocab_file),
    merges=str(merges_file),
)
for token in SPECIAL_TOKENS:
    reloaded.add_special_tokens([token])

Add a regression test encoding <s> after load() to confirm the fix.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/tokenizer/bpe_tokenizer.py` around lines 51 - 56, The
loader loses special-token metadata: after training with SPECIAL_TOKENS the
ByteLevelBPETokenizer encodes "<s>" correctly, but load() that constructs
ByteLevelBPETokenizer(vocab=..., merges=...) does not re-register added tokens;
update the load() implementation to iterate SPECIAL_TOKENS and call
reloaded.add_special_tokens([token]) (reference ByteLevelBPETokenizer and
load()), and add a regression test that trains, saves, reloads via load(), then
asserts encoding of "<s>" matches the trained/special-token encoding to prevent
regressions.
♻️ Duplicate comments (2)
tests/test_bpebase.py (1)

233-236: 🧹 Nitpick | 🔵 Trivial

The determinism test should assert merges.txt too.

For BPE, identical vocab.json files are only half the contract; different merges.txt contents can still change tokenization. Compare both artifacts here so this test actually proves deterministic training.

♻️ Suggested hardening
     vocab_1 = (save_path_1 / "vocab.json").read_text(encoding="utf-8")
     vocab_2 = (save_path_2 / "vocab.json").read_text(encoding="utf-8")
+    merges_1 = (save_path_1 / "merges.txt").read_text(encoding="utf-8")
+    merges_2 = (save_path_2 / "merges.txt").read_text(encoding="utf-8")

     assert vocab_1 == vocab_2
+    assert merges_1 == merges_2
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_bpebase.py` around lines 233 - 236, The determinism test currently
only compares vocab.json (vocab_1/vocab_2); extend it to also read and compare
the BPE merges file by loading (save_path_1 / "merges.txt") and (save_path_2 /
"merges.txt") with encoding="utf-8" and adding an assertion that merges_1 ==
merges_2 so both artifacts (vocab.json and merges.txt) are verified for
deterministic training in tests/test_bpebase.py.
openverifiablellm/tokenizer/base.py (1)

25-37: ⚠️ Potential issue | 🔴 Critical

This interface change breaks existing tokenizer subclasses.

openverifiablellm/tokenizer/sentencepiece_tokenizer.py still does not implement encode(), decode(), or load(). After making these abstract here, SentencePieceTokenizer becomes non-instantiable and any sentencepiece path will now fail with TypeError. Please update that subclass in the same PR or defer making these methods abstract until all subclasses comply.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/tokenizer/base.py` around lines 25 - 37, The abstract
methods encode, decode, and load were added to TokenizerBase which breaks
instantiation of SentencePieceTokenizer because it does not implement these
methods; update the SentencePieceTokenizer class to implement
TokenizerBase.encode(self, text: str) -> list[int], TokenizerBase.decode(self,
ids: list[int]) -> str, and TokenizerBase.load(self, tokenizer_dir: Path) (or
remove the abstract decorator from TokenizerBase.encode/decode/load until all
subclasses are updated). Implementations should delegate to the underlying
sentencepiece processor (e.g., use the SentencePiece model's encode/decode and
loading logic already present in sentencepiece_tokenizer.py) and match the
declared signatures so SentencePieceTokenizer becomes instantiable again.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@openverifiablellm/tokenizer/bpe_tokenizer.py`:
- Around line 51-56: The loader loses special-token metadata: after training
with SPECIAL_TOKENS the ByteLevelBPETokenizer encodes "<s>" correctly, but
load() that constructs ByteLevelBPETokenizer(vocab=..., merges=...) does not
re-register added tokens; update the load() implementation to iterate
SPECIAL_TOKENS and call reloaded.add_special_tokens([token]) (reference
ByteLevelBPETokenizer and load()), and add a regression test that trains, saves,
reloads via load(), then asserts encoding of "<s>" matches the
trained/special-token encoding to prevent regressions.

---

Duplicate comments:
In `@openverifiablellm/tokenizer/base.py`:
- Around line 25-37: The abstract methods encode, decode, and load were added to
TokenizerBase which breaks instantiation of SentencePieceTokenizer because it
does not implement these methods; update the SentencePieceTokenizer class to
implement TokenizerBase.encode(self, text: str) -> list[int],
TokenizerBase.decode(self, ids: list[int]) -> str, and TokenizerBase.load(self,
tokenizer_dir: Path) (or remove the abstract decorator from
TokenizerBase.encode/decode/load until all subclasses are updated).
Implementations should delegate to the underlying sentencepiece processor (e.g.,
use the SentencePiece model's encode/decode and loading logic already present in
sentencepiece_tokenizer.py) and match the declared signatures so
SentencePieceTokenizer becomes instantiable again.

In `@tests/test_bpebase.py`:
- Around line 233-236: The determinism test currently only compares vocab.json
(vocab_1/vocab_2); extend it to also read and compare the BPE merges file by
loading (save_path_1 / "merges.txt") and (save_path_2 / "merges.txt") with
encoding="utf-8" and adding an assertion that merges_1 == merges_2 so both
artifacts (vocab.json and merges.txt) are verified for deterministic training in
tests/test_bpebase.py.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b8dfc0ff-3b7b-4db6-870d-712b9aaef5a9

📥 Commits

Reviewing files that changed from the base of the PR and between dcdb8c4 and 1d049ff.

📒 Files selected for processing (3)
  • openverifiablellm/tokenizer/base.py
  • openverifiablellm/tokenizer/bpe_tokenizer.py
  • tests/test_bpebase.py

@github-actions github-actions bot added size/L and removed size/L labels Mar 13, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
openverifiablellm/tokenizer/bpe_tokenizer.py (1)

1-165: ⚠️ Potential issue | 🟡 Minor

Fix Ruff formatting issues to pass CI.

The pipeline indicates that Ruff format check failed. Run ruff format openverifiablellm/tokenizer/bpe_tokenizer.py to fix the code style issues.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/tokenizer/bpe_tokenizer.py` around lines 1 - 165, Ruff
flagged formatting issues in openverifiablellm/tokenizer/bpe_tokenizer.py; run
`ruff format openverifiablellm/tokenizer/bpe_tokenizer.py` (or apply the
equivalent auto-format) and commit the changes so BPETokenizer, SPECIAL_TOKENS,
and its methods (train, load, encode, decode, _check_loaded) conform to the
project's Ruff/PEP8 style; if you cannot run Ruff locally, adjust whitespace,
imports, line lengths and trailing commas in that file to match Ruff's rules,
re-run the linter, and push the formatted file to fix CI.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@openverifiablellm/tokenizer/bpe_tokenizer.py`:
- Around line 1-165: Ruff flagged formatting issues in
openverifiablellm/tokenizer/bpe_tokenizer.py; run `ruff format
openverifiablellm/tokenizer/bpe_tokenizer.py` (or apply the equivalent
auto-format) and commit the changes so BPETokenizer, SPECIAL_TOKENS, and its
methods (train, load, encode, decode, _check_loaded) conform to the project's
Ruff/PEP8 style; if you cannot run Ruff locally, adjust whitespace, imports,
line lengths and trailing commas in that file to match Ruff's rules, re-run the
linter, and push the formatted file to fix CI.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 61bc19ff-755d-43f7-b747-58bbd6229971

📥 Commits

Reviewing files that changed from the base of the PR and between 1d049ff and 088ec84.

📒 Files selected for processing (1)
  • openverifiablellm/tokenizer/bpe_tokenizer.py

@github-actions github-actions bot added size/L and removed size/L labels Mar 13, 2026
@Arpitsh7
Copy link
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 13, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@github-actions github-actions bot added size/L and removed size/L labels Mar 13, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer/bpe_tokenizer.py`:
- Around line 146-148: The method signature for get_merges_path in
openverifiablellm/tokenizer/bpe_tokenizer.py currently returns Path but must
match the base class contract BaseTokenizer.get_merges_path() which declares
Path | None; change the signature of get_merges_path(self, tokenizer_dir: Path)
-> Path | None (or Optional[Path]) and keep returning the Path(tokenizer_dir) /
"merges.txt" so the implementation remains correct while the declared return
type matches the base class; add or adjust the typing import (from typing import
Optional) if needed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: edfa00b2-4b65-422e-ba8a-ae28d700ea7b

📥 Commits

Reviewing files that changed from the base of the PR and between 1d049ff and 49c7ba6.

📒 Files selected for processing (1)
  • openverifiablellm/tokenizer/bpe_tokenizer.py

Comment on lines 146 to +148
def get_merges_path(self, tokenizer_dir: Path) -> Path:
return tokenizer_dir / "merges.txt"
"""Return path to merges.txt file."""
return Path(tokenizer_dir) / "merges.txt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Return type mismatch with base class.

The base class BaseTokenizer.get_merges_path() declares return type Path | None, but this implementation declares Path. While BPE always has merges, the return type should match the base class contract for type safety and LSP compliance.

♻️ Proposed fix
-    def get_merges_path(self, tokenizer_dir: Path) -> Path:
+    def get_merges_path(self, tokenizer_dir: Path) -> Path | None:
         """Return path to merges.txt file."""
         return Path(tokenizer_dir) / "merges.txt"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/tokenizer/bpe_tokenizer.py` around lines 146 - 148, The
method signature for get_merges_path in
openverifiablellm/tokenizer/bpe_tokenizer.py currently returns Path but must
match the base class contract BaseTokenizer.get_merges_path() which declares
Path | None; change the signature of get_merges_path(self, tokenizer_dir: Path)
-> Path | None (or Optional[Path]) and keep returning the Path(tokenizer_dir) /
"merges.txt" so the implementation remains correct while the declared return
type matches the base class; add or adjust the typing import (from typing import
Optional) if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Complete BPETokenizer and BaseTokenizer-Add encode, decode and load to BaseTokenizer and BPETokenizer

2 participants