Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds abstract Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant BPETokenizer
participant FileSystem
rect rgba(100,150,200,0.5)
Note over Client,FileSystem: Training flow
Client->>BPETokenizer: train(text_file, save_path)
BPETokenizer->>BPETokenizer: validate inputs
BPETokenizer->>FileSystem: create save_path
BPETokenizer->>FileSystem: write `vocab.json` & `merges.txt`
end
rect rgba(150,200,100,0.5)
Note over Client,FileSystem: Load and inference flow
Client->>BPETokenizer: load(tokenizer_dir)
BPETokenizer->>FileSystem: read `vocab.json` & `merges.txt`
BPETokenizer->>BPETokenizer: initialize internal tokenizer
Client->>BPETokenizer: encode(text)
BPETokenizer->>BPETokenizer: _check_loaded()
BPETokenizer-->>Client: return ids
Client->>BPETokenizer: decode(ids)
BPETokenizer->>BPETokenizer: _check_loaded()
BPETokenizer-->>Client: return text
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
openverifiablellm/tokenizer/bpe_tokenizer.py (1)
48-64:⚠️ Potential issue | 🟡 MinorRemove duplicate
save_path.mkdir()call.
save_path.mkdir(parents=True, exist_ok=True)is called twice (lines 48 and 60). The second call on line 60 is redundant since the directory was already created on line 48.🔧 Proposed fix
save_path.mkdir(parents=True, exist_ok=True) tokenizer = ByteLevelBPETokenizer() tokenizer.train( files=[str(text_file)], vocab_size=self.vocab_size, min_frequency=self.min_frequency, special_tokens=SPECIAL_TOKENS, ) - # Must create directory BEFORE save_model() is called - save_path.mkdir(parents=True, exist_ok=True) - tokenizer.save_model(str(save_path))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/tokenizer/bpe_tokenizer.py` around lines 48 - 64, The code calls save_path.mkdir(parents=True, exist_ok=True) twice before saving the tokenizer; remove the redundant second call to avoid duplication. In the block where ByteLevelBPETokenizer is created and trained (tokenizer.train(...)), keep the initial save_path.mkdir(...) before training and then call tokenizer.save_model(str(save_path)) and set self._tokenizer = tokenizer without calling save_path.mkdir a second time; update the snippet around tokenizer.save_model and self._tokenizer to remove the duplicate save_path.mkdir() invocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/tokenizer/base.py`:
- Around line 45-48: The abstract method get_merges_path is missing a return
type annotation; update its signature to indicate it may return None (e.g., ->
Optional[Path] or -> Path | None) to match the docstring and be consistent with
get_vocab_path; also add the required import (from typing import Optional) if
using Optional, and update the declaration of get_merges_path(tokenizer_dir:
Path) accordingly in the tokenizer/base.py file.
- Around line 25-33: The abstract methods encode and decode in the tokenizer
base currently use the generic type hint list; update their signatures to use
explicit list[int] for token id lists (i.e., encode(self, text: str) ->
list[int] and decode(self, ids: list[int]) -> str) so static type-checkers and
readers know these are integer token ids; adjust any related docstrings if
needed and ensure subclasses implement the new annotations for consistency
(refer to the encode and decode method definitions in the Tokenizer base class).
In `@openverifiablellm/tokenizer/bpe_tokenizer.py`:
- Around line 70-102: The encode and decode methods use broad type hints; update
their annotations to be specific lists of integers to match the base contract:
change encode(self, text: str) -> list to encode(self, text: str) -> list[int]
and change decode(self, ids: list) -> str to decode(self, ids: list[int]) ->
str; ensure the implementations still call self._check_loaded() and return
self._tokenizer.encode(text).ids and self._tokenizer.decode(ids) respectively
(references: encode, decode, _check_loaded, _tokenizer).
In `@tests/test_bpebase.py`:
- Around line 73-78: The test function test_bpe_train_raises_if_directory_passed
declares an unused fixture parameter sample_text_file; remove that parameter
from the function signature so it only accepts tmp_path (i.e., change def
test_bpe_train_raises_if_directory_passed(tmp_path):) and keep the body as-is to
intentionally pass a directory to tokenizer.train; ensure no other references to
sample_text_file remain in that test.
- Around line 158-174: In the tests test_bpe_load_raises_if_vocab_missing and
test_bpe_load_raises_if_merges_missing update the pytest.raises match strings to
use raw regexes with escaped dots (e.g., r"vocab\.json not found" and
r"merges\.txt not found") so the literal periods are matched; locate the two
with pytest.raises(FileNotFoundError, match=...) around the
BPETokenizer.load(...) calls and replace the match arguments accordingly.
---
Outside diff comments:
In `@openverifiablellm/tokenizer/bpe_tokenizer.py`:
- Around line 48-64: The code calls save_path.mkdir(parents=True, exist_ok=True)
twice before saving the tokenizer; remove the redundant second call to avoid
duplication. In the block where ByteLevelBPETokenizer is created and trained
(tokenizer.train(...)), keep the initial save_path.mkdir(...) before training
and then call tokenizer.save_model(str(save_path)) and set self._tokenizer =
tokenizer without calling save_path.mkdir a second time; update the snippet
around tokenizer.save_model and self._tokenizer to remove the duplicate
save_path.mkdir() invocation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4fa3e8cf-ae6f-46d9-ac07-e5837a8c63b9
📒 Files selected for processing (4)
.gitignoreopenverifiablellm/tokenizer/base.pyopenverifiablellm/tokenizer/bpe_tokenizer.pytests/test_bpebase.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/tokenizer/bpe_tokenizer.py`:
- Around line 98-99: The BPE tokenizer's decode method currently calls
self._tokenizer.decode(ids) which defaults to skip_special_tokens=True and drops
special token ids; change the implementation in BPETokenizer.decode (the method
that calls self._check_loaded() and then returns the decoded string) to call the
underlying tokenizers.Tokenizer.decode with skip_special_tokens=False so
decoding is lossless by default and preserve special tokens for round-trip
encode/decode.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4301a79a-31eb-4a64-a0e3-6ef01da909cb
📒 Files selected for processing (1)
openverifiablellm/tokenizer/bpe_tokenizer.py
|
@Arpitsh7 do requested coderabbit changes |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/tokenizer/base.py`:
- Around line 25-37: The SentencePieceTokenizer class is missing concrete
implementations for BaseTokenizer's abstract methods encode, decode, and load;
implement encode(self, text: str) -> list[int] to return token ids (use the
SentencePieceProcessor EncodeAsIds or equivalent on the internal model),
decode(self, ids: list[int]) -> str to return text (use DecodeIds/DecodePieces
on the processor), and load(self, tokenizer_dir: Path) to initialize and load
the SentencePiece model file into the tokenizer instance (e.g., create/assign
self.sp_model and call its Load with the model filepath); update
SentencePieceTokenizer to define these three methods so instantiation for
tokenizer_type == "sentencepiece" succeeds.
In `@openverifiablellm/tokenizer/bpe_tokenizer.py`:
- Around line 117-118: In load(), remove the local redefinition of vocab_path
and merges_path and instead call the centralized helpers
get_vocab_path(tokenizer_dir) and get_merges_path(tokenizer_dir) (or the
parameter signatures used in this module) so path logic is single-sourced;
replace uses of the local vocab_path and merges_path variables with the returned
paths from those helper functions and ensure any downstream code still receives
the same Path objects.
In `@tests/test_bpebase.py`:
- Around line 233-236: The test currently compares only vocab.json between
save_path_1 and save_path_2 (vocab_1 == vocab_2); extend the determinism
assertion to also read and compare the merges.txt files from both outputs (e.g.,
read (save_path_1 / "merges.txt") and (save_path_2 / "merges.txt") into merges_1
and merges_2) and assert merges_1 == merges_2 so that both the vocabulary and
merge rules are validated for deterministic training.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 131488aa-2b4d-4c1a-8dd9-b2d1e4234f9a
📒 Files selected for processing (4)
openverifiablellm/tokenizer/base.pyopenverifiablellm/tokenizer/bpe_tokenizer.pyopenverifiablellm/verify.pytests/test_bpebase.py
| @abstractmethod | ||
| def get_vocab_path(self, tokenizer_dir: Path) -> Path: | ||
| def encode(self, text: str) -> list[int]: | ||
| """Encode text into a list of integer token ids.""" | ||
| pass | ||
|
|
||
| @abstractmethod | ||
| def decode(self, ids: list[int]) -> str: | ||
| """Decode a list of integer token ids back into text.""" | ||
| pass | ||
|
|
||
| @abstractmethod | ||
| def load(self, tokenizer_dir: Path): | ||
| """Load a previously trained tokenizer from disk.""" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "BaseTokenizer abstract method declarations:"
rg -nP '^\s*@abstractmethod\s*$|^\s*def (encode|decode|load)\(' openverifiablellm/tokenizer/base.py -C1
echo
echo "SentencePieceTokenizer implementations for the same methods:"
rg -nP '^\s*def (encode|decode|load)\(' openverifiablellm/tokenizer/sentencepiece_tokenizer.py -C1 || true
echo
echo "Factory path that instantiates SentencePieceTokenizer:"
rg -nP 'SentencePieceTokenizer|tokenizer_type == "sentencepiece"' openverifiablellm/tokenizer/factory.py -C2Repository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 1296
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== SentencePieceTokenizer class definition and all methods ==="
cat -n openverifiablellm/tokenizer/sentencepiece_tokenizer.py
echo
echo "=== Checking parent class(es) ==="
rg -P 'class SentencePieceTokenizer' openverifiablellm/tokenizer/sentencepiece_tokenizer.pyRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 1084
SentencePieceTokenizer is missing required abstract method implementations.
SentencePieceTokenizer does not implement three abstract methods from BaseTokenizer: encode, decode, and load. This will cause instantiation to fail at runtime when the factory attempts to create the tokenizer for tokenizer_type == "sentencepiece".
Implement encode, decode, and load in SentencePieceTokenizer, or remove the abstract requirement until the class is fully implemented.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/tokenizer/base.py` around lines 25 - 37, The
SentencePieceTokenizer class is missing concrete implementations for
BaseTokenizer's abstract methods encode, decode, and load; implement
encode(self, text: str) -> list[int] to return token ids (use the
SentencePieceProcessor EncodeAsIds or equivalent on the internal model),
decode(self, ids: list[int]) -> str to return text (use DecodeIds/DecodePieces
on the processor), and load(self, tokenizer_dir: Path) to initialize and load
the SentencePiece model file into the tokenizer instance (e.g., create/assign
self.sp_model and call its Load with the model filepath); update
SentencePieceTokenizer to define these three methods so instantiation for
tokenizer_type == "sentencepiece" succeeds.
| vocab_path = tokenizer_dir / "vocab.json" | ||
| merges_path = tokenizer_dir / "merges.txt" |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Use path helper methods inside load() to avoid duplication.
Line 117–118 redefines artifact paths that are already centralized in get_vocab_path() / get_merges_path(). Reusing those helpers keeps path logic single-sourced.
♻️ Proposed refactor
- vocab_path = tokenizer_dir / "vocab.json"
- merges_path = tokenizer_dir / "merges.txt"
+ vocab_path = self.get_vocab_path(tokenizer_dir)
+ merges_path = self.get_merges_path(tokenizer_dir)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/tokenizer/bpe_tokenizer.py` around lines 117 - 118, In
load(), remove the local redefinition of vocab_path and merges_path and instead
call the centralized helpers get_vocab_path(tokenizer_dir) and
get_merges_path(tokenizer_dir) (or the parameter signatures used in this module)
so path logic is single-sourced; replace uses of the local vocab_path and
merges_path variables with the returned paths from those helper functions and
ensure any downstream code still receives the same Path objects.
| vocab_1 = (save_path_1 / "vocab.json").read_text(encoding="utf-8") | ||
| vocab_2 = (save_path_2 / "vocab.json").read_text(encoding="utf-8") | ||
|
|
||
| assert vocab_1 == vocab_2 |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Expand determinism assertion to include merges.txt.
Line 233–236 verifies only vocab.json. BPE tokenization behavior also depends on merge rules, so deterministic training should assert both artifacts are identical.
♻️ Proposed test hardening
vocab_1 = (save_path_1 / "vocab.json").read_text(encoding="utf-8")
vocab_2 = (save_path_2 / "vocab.json").read_text(encoding="utf-8")
+ merges_1 = (save_path_1 / "merges.txt").read_text(encoding="utf-8")
+ merges_2 = (save_path_2 / "merges.txt").read_text(encoding="utf-8")
assert vocab_1 == vocab_2
+ assert merges_1 == merges_2📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| vocab_1 = (save_path_1 / "vocab.json").read_text(encoding="utf-8") | |
| vocab_2 = (save_path_2 / "vocab.json").read_text(encoding="utf-8") | |
| assert vocab_1 == vocab_2 | |
| vocab_1 = (save_path_1 / "vocab.json").read_text(encoding="utf-8") | |
| vocab_2 = (save_path_2 / "vocab.json").read_text(encoding="utf-8") | |
| merges_1 = (save_path_1 / "merges.txt").read_text(encoding="utf-8") | |
| merges_2 = (save_path_2 / "merges.txt").read_text(encoding="utf-8") | |
| assert vocab_1 == vocab_2 | |
| assert merges_1 == merges_2 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_bpebase.py` around lines 233 - 236, The test currently compares
only vocab.json between save_path_1 and save_path_2 (vocab_1 == vocab_2); extend
the determinism assertion to also read and compare the merges.txt files from
both outputs (e.g., read (save_path_1 / "merges.txt") and (save_path_2 /
"merges.txt") into merges_1 and merges_2) and assert merges_1 == merges_2 so
that both the vocabulary and merge rules are validated for deterministic
training.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
openverifiablellm/tokenizer/bpe_tokenizer.py (1)
51-56:⚠️ Potential issue | 🔴 CriticalFix special-token loss during save/load cycle.
Verification confirms that
load()loses special-token metadata. After training withSPECIAL_TOKENS(line 55),<s>encodes as[0]. But after reloading fromvocab.jsonandmerges.txt(lines 130–133),<s>expands to[32, 87, 34](three separate tokens). TheByteLevelBPETokenizerconstructor restores vocabulary but not added-token configuration.To fix, explicitly re-register special tokens in
load():reloaded = ByteLevelBPETokenizer( vocab=str(vocab_file), merges=str(merges_file), ) for token in SPECIAL_TOKENS: reloaded.add_special_tokens([token])Add a regression test encoding
<s>afterload()to confirm the fix.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/tokenizer/bpe_tokenizer.py` around lines 51 - 56, The loader loses special-token metadata: after training with SPECIAL_TOKENS the ByteLevelBPETokenizer encodes "<s>" correctly, but load() that constructs ByteLevelBPETokenizer(vocab=..., merges=...) does not re-register added tokens; update the load() implementation to iterate SPECIAL_TOKENS and call reloaded.add_special_tokens([token]) (reference ByteLevelBPETokenizer and load()), and add a regression test that trains, saves, reloads via load(), then asserts encoding of "<s>" matches the trained/special-token encoding to prevent regressions.
♻️ Duplicate comments (2)
tests/test_bpebase.py (1)
233-236: 🧹 Nitpick | 🔵 TrivialThe determinism test should assert
merges.txttoo.For BPE, identical
vocab.jsonfiles are only half the contract; differentmerges.txtcontents can still change tokenization. Compare both artifacts here so this test actually proves deterministic training.♻️ Suggested hardening
vocab_1 = (save_path_1 / "vocab.json").read_text(encoding="utf-8") vocab_2 = (save_path_2 / "vocab.json").read_text(encoding="utf-8") + merges_1 = (save_path_1 / "merges.txt").read_text(encoding="utf-8") + merges_2 = (save_path_2 / "merges.txt").read_text(encoding="utf-8") assert vocab_1 == vocab_2 + assert merges_1 == merges_2🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_bpebase.py` around lines 233 - 236, The determinism test currently only compares vocab.json (vocab_1/vocab_2); extend it to also read and compare the BPE merges file by loading (save_path_1 / "merges.txt") and (save_path_2 / "merges.txt") with encoding="utf-8" and adding an assertion that merges_1 == merges_2 so both artifacts (vocab.json and merges.txt) are verified for deterministic training in tests/test_bpebase.py.openverifiablellm/tokenizer/base.py (1)
25-37:⚠️ Potential issue | 🔴 CriticalThis interface change breaks existing tokenizer subclasses.
openverifiablellm/tokenizer/sentencepiece_tokenizer.pystill does not implementencode(),decode(), orload(). After making these abstract here,SentencePieceTokenizerbecomes non-instantiable and any sentencepiece path will now fail withTypeError. Please update that subclass in the same PR or defer making these methods abstract until all subclasses comply.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/tokenizer/base.py` around lines 25 - 37, The abstract methods encode, decode, and load were added to TokenizerBase which breaks instantiation of SentencePieceTokenizer because it does not implement these methods; update the SentencePieceTokenizer class to implement TokenizerBase.encode(self, text: str) -> list[int], TokenizerBase.decode(self, ids: list[int]) -> str, and TokenizerBase.load(self, tokenizer_dir: Path) (or remove the abstract decorator from TokenizerBase.encode/decode/load until all subclasses are updated). Implementations should delegate to the underlying sentencepiece processor (e.g., use the SentencePiece model's encode/decode and loading logic already present in sentencepiece_tokenizer.py) and match the declared signatures so SentencePieceTokenizer becomes instantiable again.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@openverifiablellm/tokenizer/bpe_tokenizer.py`:
- Around line 51-56: The loader loses special-token metadata: after training
with SPECIAL_TOKENS the ByteLevelBPETokenizer encodes "<s>" correctly, but
load() that constructs ByteLevelBPETokenizer(vocab=..., merges=...) does not
re-register added tokens; update the load() implementation to iterate
SPECIAL_TOKENS and call reloaded.add_special_tokens([token]) (reference
ByteLevelBPETokenizer and load()), and add a regression test that trains, saves,
reloads via load(), then asserts encoding of "<s>" matches the
trained/special-token encoding to prevent regressions.
---
Duplicate comments:
In `@openverifiablellm/tokenizer/base.py`:
- Around line 25-37: The abstract methods encode, decode, and load were added to
TokenizerBase which breaks instantiation of SentencePieceTokenizer because it
does not implement these methods; update the SentencePieceTokenizer class to
implement TokenizerBase.encode(self, text: str) -> list[int],
TokenizerBase.decode(self, ids: list[int]) -> str, and TokenizerBase.load(self,
tokenizer_dir: Path) (or remove the abstract decorator from
TokenizerBase.encode/decode/load until all subclasses are updated).
Implementations should delegate to the underlying sentencepiece processor (e.g.,
use the SentencePiece model's encode/decode and loading logic already present in
sentencepiece_tokenizer.py) and match the declared signatures so
SentencePieceTokenizer becomes instantiable again.
In `@tests/test_bpebase.py`:
- Around line 233-236: The determinism test currently only compares vocab.json
(vocab_1/vocab_2); extend it to also read and compare the BPE merges file by
loading (save_path_1 / "merges.txt") and (save_path_2 / "merges.txt") with
encoding="utf-8" and adding an assertion that merges_1 == merges_2 so both
artifacts (vocab.json and merges.txt) are verified for deterministic training in
tests/test_bpebase.py.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b8dfc0ff-3b7b-4db6-870d-712b9aaef5a9
📒 Files selected for processing (3)
openverifiablellm/tokenizer/base.pyopenverifiablellm/tokenizer/bpe_tokenizer.pytests/test_bpebase.py
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
openverifiablellm/tokenizer/bpe_tokenizer.py (1)
1-165:⚠️ Potential issue | 🟡 MinorFix Ruff formatting issues to pass CI.
The pipeline indicates that Ruff format check failed. Run
ruff format openverifiablellm/tokenizer/bpe_tokenizer.pyto fix the code style issues.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/tokenizer/bpe_tokenizer.py` around lines 1 - 165, Ruff flagged formatting issues in openverifiablellm/tokenizer/bpe_tokenizer.py; run `ruff format openverifiablellm/tokenizer/bpe_tokenizer.py` (or apply the equivalent auto-format) and commit the changes so BPETokenizer, SPECIAL_TOKENS, and its methods (train, load, encode, decode, _check_loaded) conform to the project's Ruff/PEP8 style; if you cannot run Ruff locally, adjust whitespace, imports, line lengths and trailing commas in that file to match Ruff's rules, re-run the linter, and push the formatted file to fix CI.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@openverifiablellm/tokenizer/bpe_tokenizer.py`:
- Around line 1-165: Ruff flagged formatting issues in
openverifiablellm/tokenizer/bpe_tokenizer.py; run `ruff format
openverifiablellm/tokenizer/bpe_tokenizer.py` (or apply the equivalent
auto-format) and commit the changes so BPETokenizer, SPECIAL_TOKENS, and its
methods (train, load, encode, decode, _check_loaded) conform to the project's
Ruff/PEP8 style; if you cannot run Ruff locally, adjust whitespace, imports,
line lengths and trailing commas in that file to match Ruff's rules, re-run the
linter, and push the formatted file to fix CI.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 61bc19ff-755d-43f7-b747-58bbd6229971
📒 Files selected for processing (1)
openverifiablellm/tokenizer/bpe_tokenizer.py
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/tokenizer/bpe_tokenizer.py`:
- Around line 146-148: The method signature for get_merges_path in
openverifiablellm/tokenizer/bpe_tokenizer.py currently returns Path but must
match the base class contract BaseTokenizer.get_merges_path() which declares
Path | None; change the signature of get_merges_path(self, tokenizer_dir: Path)
-> Path | None (or Optional[Path]) and keep returning the Path(tokenizer_dir) /
"merges.txt" so the implementation remains correct while the declared return
type matches the base class; add or adjust the typing import (from typing import
Optional) if needed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: edfa00b2-4b65-422e-ba8a-ae28d700ea7b
📒 Files selected for processing (1)
openverifiablellm/tokenizer/bpe_tokenizer.py
| def get_merges_path(self, tokenizer_dir: Path) -> Path: | ||
| return tokenizer_dir / "merges.txt" | ||
| """Return path to merges.txt file.""" | ||
| return Path(tokenizer_dir) / "merges.txt" |
There was a problem hiding this comment.
Return type mismatch with base class.
The base class BaseTokenizer.get_merges_path() declares return type Path | None, but this implementation declares Path. While BPE always has merges, the return type should match the base class contract for type safety and LSP compliance.
♻️ Proposed fix
- def get_merges_path(self, tokenizer_dir: Path) -> Path:
+ def get_merges_path(self, tokenizer_dir: Path) -> Path | None:
"""Return path to merges.txt file."""
return Path(tokenizer_dir) / "merges.txt"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/tokenizer/bpe_tokenizer.py` around lines 146 - 148, The
method signature for get_merges_path in
openverifiablellm/tokenizer/bpe_tokenizer.py currently returns Path but must
match the base class contract BaseTokenizer.get_merges_path() which declares
Path | None; change the signature of get_merges_path(self, tokenizer_dir: Path)
-> Path | None (or Optional[Path]) and keep returning the Path(tokenizer_dir) /
"merges.txt" so the implementation remains correct while the declared return
type matches the base class; add or adjust the typing import (from typing import
Optional) if needed.
Summary
Completes the
BaseTokenizerabstract interface andBPETokenizerimplementation introduced in #17 by addingencode(),decode(), andload()— making BPETokenizer fully usable in the pipeline.Background
PR #17 introduced a modular tokenizer architecture. However
BaseTokenizeronly definestrain(),get_vocab_path(), andget_merges_path(). Sinceencode(),decode(), andload()are absent from the abstract interface, subclasses have no obligation to implement them, andBPETokenizercannot tokenize text despite being trained.Changes
openverifiablellm/tokenizer/base.pyencode(),decode(),load()as abstract methods with full docstringsvocab_sizeandmin_frequencyopenverifiablellm/tokenizer/bpe_tokenizer.pyencode(text) -> list[int]decode(ids) -> strload(tokenizer_dir)— reloads fromvocab.json+merges.txttext_file.is_file()validation intrain()_check_loaded()guard for encode/decodetests/test_bpe.py← new fileTesting
All 18 tests pass.
Scope
Self-contained change. Touches only
base.py,bpe_tokenizer.py, andtests/test_bpe.py. No overlap with any open PR.Related
Summary by CodeRabbit
New Features
Bug Fixes / Validation
Tests
Chores
Style