[Feature] : add training pipeline(Llama Architecture ) and deterministic tokenization#59
Conversation
WalkthroughAdds a deterministic verifiable LLM stack: package exports, ModelConfig, rotary-attention, full model implementation with deterministic init, deterministic tokenization into uint16 binary + SHA-256, deterministic training loop and dataloader, pipeline orchestration script, tests, and a dependency/config update. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Pipeline as scripts/run_pipeline.py
participant Preproc as utils.extract_text_from_xml
participant TokenizerTrain as train_tokenizer
participant Tokenize as tokenize_dataset
participant Train as openverifiablellm.train:train
participant Model as VerifiableLLM
participant DataLoader as DeterministicDataLoader
User->>Pipeline: run(dump_path)
Pipeline->>Preproc: extract_text_from_xml(dump_path)
Preproc-->>Pipeline: cleaned text + manifest
Pipeline->>TokenizerTrain: train_tokenizer(cleaned text)
TokenizerTrain-->>Pipeline: tokenizer artifacts
Pipeline->>Tokenize: tokenize_dataset(cleaned text, tokenizer_dir, tokens.bin)
Tokenize-->>Pipeline: tokens.bin + SHA256
Pipeline->>Train: train(data_path=tokens.bin, seed, config)
Train->>Model: VerifiableLLM(config)
Model->>Model: init_weights_deterministically(seed)
Train->>DataLoader: DeterministicDataLoader(tokens.bin, batch_size, seq_len, np_rng)
loop training steps
DataLoader->>Train: get_batch()
Train->>Model: forward(tokens, targets)
Model-->>Train: logits, loss
Train->>Train: backward + optimizer.step()
Train->>Train: save_checkpoint() @ intervals
end
Train-->>Pipeline: checkpoints + logs
Pipeline-->>User: done
sequenceDiagram
participant Input as tokens
participant Embed as Embedding
participant Layer as TransformerBlock
participant Attn as Attention
participant RoPE as RoPE utils
participant MLP as MLP
participant Output as lm_head
Input->>Embed: embed(tokens)
Embed->>Layer: hidden
loop per layer
Layer->>Attn: hidden, freqs_cis, mask
Attn->>RoPE: apply_rotary_emb(Q,K)
RoPE-->>Attn: rotated Q,K
Attn->>Layer: attn_out
Layer->>MLP: norm(attn_out)
MLP-->>Layer: mlp_out
Layer->>Layer: residual combine
end
Layer->>Output: final_normed_hidden
Output-->>Output: logits
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pyproject.toml (1)
14-17:⚠️ Potential issue | 🟠 MajorAdd missing runtime dependencies
numpyandtorchto project metadata.This PR adds imports of
numpyinopenverifiablellm/tokenize_data.pyandtrain.py, andtorchinopenverifiablellm/train.py,openverifiablellm/model/attention.py, andtest_model_init.py. Both libraries are absent fromdependenciesinpyproject.toml, causing import failures on clean install. Add version constraints for both packages.Also applies to: 27-32
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 14 - 17, The pyproject.toml is missing runtime dependencies for numpy and torch which are imported in openverifiablellm/tokenize_data.py, openverifiablellm/train.py, openverifiablellm/model/attention.py and test_model_init.py; update the dependencies array to include appropriate version constraints for "numpy" and "torch" (pick compatible pinned or minimal versions consistent with the project and CI) so clean installs have these packages available at runtime.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/model/config.py`:
- Around line 17-19: In __post_init__ validate the attention invariants so
invalid configs fail early: ensure hidden_size % num_attention_heads == 0,
ensure num_attention_heads % num_key_value_heads == 0 (when num_key_value_heads
is set or defaulted), and ensure (hidden_size // num_attention_heads) is even
for RoPE; if any check fails raise a ValueError with a clear message mentioning
hidden_size, num_attention_heads and num_key_value_heads; implement these checks
inside the existing __post_init__ (referencing the attributes hidden_size,
num_attention_heads, num_key_value_heads) so callers hit a clear error instead
of downstream failures in attention.py.
In `@openverifiablellm/model/model.py`:
- Around line 60-65: The forward method currently slices self.freqs_cis by
seqlen but does not check that seqlen <= the precomputed RoPE window; add an
explicit guard at the top of forward (using the function name forward and
variables seqlen, self.freqs_cis and self.config.max_position_embeddings) that
raises a clear ValueError when seqlen > self.config.max_position_embeddings,
mentioning the offending seqlen and the configured max so callers get a
deterministic, actionable error instead of a downstream shape mismatch.
- Around line 100-106: The deterministic reinit currently only resets nn.Linear
and nn.Embedding, leaving RMSNorm.weight unchanged; update
init_weights_deterministically to also reinitialize RMSNorm parameters (e.g.,
detect the RMSNorm class or by name) and apply the same torch.nn.init.normal_
for its weight (and zeros_ for any bias if present) using
self.config.initializer_range so calling init_weights_deterministically after
seeding truly recreates the baseline; reference the method
init_weights_deterministically and the RMSNorm symbol (or its class name)
alongside existing nn.Linear and nn.Embedding cases when adding this branch.
- Around line 91-111: The RNG snapshot/restore currently saves cpu_rng_state via
torch.get_rng_state() and only the current CUDA device via
torch.cuda.get_rng_state(), which leaks changes on other GPUs; update the code
in the initialization block (the seed/restore logic around torch.manual_seed,
torch.cuda.manual_seed_all, and the module weight init loop in the Model class)
to call torch.cuda.get_rng_state_all() to capture all device states before
seeding and torch.cuda.set_rng_state_all() to restore them afterward, while
still restoring the CPU RNG via torch.set_rng_state(cpu_rng_state); ensure you
keep the same control flow (only call CUDA helpers if torch.cuda.is_available())
and preserve existing variable names/structure for clarity.
In `@openverifiablellm/tokenize_data.py`:
- Around line 36-45: The current loop reading fixed-size byte chunks with
f_in.read(chunk_size) and calling tokenizer.encode on each chunk (using
variables text_path, output_path, chunk_size, tokenizer.encode, encoded) can
split tokens across chunk boundaries; fix by switching to a boundary-aware
approach: read text in text-safe units (e.g., f_in.readline() or reading into a
string buffer and splitting on newline/whitespace), keep a remainder buffer of
the trailing partial word/subword between iterations, only call tokenizer.encode
on the complete portion and carry the incomplete suffix forward, then flush the
remaining buffered text at the end; also remove/replace any uint16 cast of token
IDs (the code that casts encoded IDs to uint16) with a safe dtype (e.g., uint32)
or add an explicit check that max(token_id) <= 65535 and raise a clear error so
IDs are not silently truncated.
In `@openverifiablellm/train.py`:
- Around line 48-54: In get_batch, fix the off-by-one bounds: compute max_start
= self.total_tokens - self.seq_len (not subtracting an extra 1) and call
self.np_rng.randint(0, max_start + 1, size=(self.batch_size,)) so the exclusive
upper bound of randint includes the last valid start; replace uses of max_idx
and the current randint call (symbols: max_idx, ix, self.np_rng.randint,
get_batch) accordingly so files with exactly seq_len+1 tokens produce valid
indices.
- Around line 18-30: In set_deterministic_seed(seed: int) ensure full CUDA
determinism by calling torch.use_deterministic_algorithms(True) after setting
the RNG seeds and cuDNN flags, and document/require that the
CUBLAS_WORKSPACE_CONFIG environment variable (e.g. CUBLAS_WORKSPACE_CONFIG=:16:8
or :4096:8) is set before CUDA initializes (i.e., before importing torch or
creating CUDA tensors) to guarantee deterministic cuBLAS behavior; update
logger.info in set_deterministic_seed to reflect that deterministic algorithms
are enabled.
In `@scripts/run_pipeline.py`:
- Around line 62-69: The training invocation built in cmd4 calls
openverifiablellm.train without an explicit output directory so checkpoints
default to the caller CWD; update cmd4 to include the explicit --output_dir
argument pointing to the same checkpoints path used in the pipeline summary
(e.g., the variable representing project_root / "data" / "checkpoints") so
checkpoints are written to the advertised location, and make the same change for
the later command at lines 74-77 (the other cmd* that runs
openverifiablellm.train) so both run_command calls pass --output_dir
consistently.
In `@test_model_init.py`:
- Around line 12-33: Move the module-level test into a proper pytest test
function so assertions run under test discovery: create a test file in the tests
package (e.g., tests/test_model_init.py), wrap the existing code inside a
function named test_model_initialization_is_deterministic(), and keep the same
sequence using ModelConfig, VerifiableLLM, init_weights_deterministically and
get_model_hash; remove or stop executing any prints at import if desired and let
pytest assert hash1 == hash2 and hash1 != hash3 inside that test function so
pytest can collect and report failures correctly.
---
Outside diff comments:
In `@pyproject.toml`:
- Around line 14-17: The pyproject.toml is missing runtime dependencies for
numpy and torch which are imported in openverifiablellm/tokenize_data.py,
openverifiablellm/train.py, openverifiablellm/model/attention.py and
test_model_init.py; update the dependencies array to include appropriate version
constraints for "numpy" and "torch" (pick compatible pinned or minimal versions
consistent with the project and CI) so clean installs have these packages
available at runtime.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 331f4652-dfcc-487a-91d9-391dd3c1768d
📒 Files selected for processing (9)
openverifiablellm/model/__init__.pyopenverifiablellm/model/attention.pyopenverifiablellm/model/config.pyopenverifiablellm/model/model.pyopenverifiablellm/tokenize_data.pyopenverifiablellm/train.pypyproject.tomlscripts/run_pipeline.pytest_model_init.py
|
hey @Archit381 can you review this pr , i resolved all coderabbit issues |
There was a problem hiding this comment.
Actionable comments posted: 8
♻️ Duplicate comments (1)
openverifiablellm/tokenize_data.py (1)
53-64:⚠️ Potential issue | 🟠 MajorWhitespace-only buffering still is not
ByteLevelBPETokenizer-safe.At Line 60 the trailing separator stays in
complete_portion, so the next chunk loses the leading-space context that GPT-2-style byte-level BPE uses.encode(" hello")is not equivalent toencode(" ")plusencode("hello"), and the fallback at Line 62-Line 64 still degenerates to arbitrary mid-token splits when a chunk has no whitespace. The emitted token stream can therefore still depend on chunking, not just corpus content.In tokenizers 0.15.2, does ByteLevelBPETokenizer treat leading spaces as part of the following token (for example, is `encode(" hello")` equivalent to `encode(" ")` followed by `encode("hello")`)? Can BPE merges cross pre-tokenizer boundaries or chunk boundaries?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/tokenize_data.py` around lines 53 - 64, The current split keeps the separator character in complete_portion which loses the leading-space context for the next chunk and can break ByteLevelBPETokenizer; change the split so the separator stays at the start of remainder instead of at the end of complete_portion: compute split_idx = max(last_space_idx, last_newline_idx) and when split_idx != -1 set complete_portion = text_chunk[:split_idx] and remainder = text_chunk[split_idx:], and when split_idx == -1 (no separator) assign complete_portion = "" and remainder = text_chunk so you never emit a chunk that drops a leading-space token; refer to the variables/text handling in tokenize_data.py (text_chunk, last_space_idx, last_newline_idx, split_idx, complete_portion, remainder).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/model/attention.py`:
- Around line 17-22: In reshape_for_broadcast, replace the two assert statements
with explicit runtime checks that raise ValueError: verify ndim = x.ndim and
raise ValueError if ndim <= 1 (use a clear message referencing x.ndim), and
verify freqs_cis.shape == (x.shape[1], x.shape[-1]) raising ValueError that
includes the actual freqs_cis.shape and expected shape (derived from x.shape[1]
and x.shape[-1]); keep the rest of the function (shape computation and view)
unchanged and reference the function name reshape_for_broadcast and variable
names freqs_cis, x, ndim in the error messages for clarity.
In `@openverifiablellm/model/config.py`:
- Around line 18-34: The __post_init__ in model/config.py should explicitly
validate that num_attention_heads and num_key_value_heads are positive integers
before any division/modulo: add checks that num_attention_heads > 0 and
num_key_value_heads > 0 (after the existing defaulting of num_key_value_heads)
and raise clear ValueError messages if not; keep these checks before the
hidden_size % num_attention_heads and num_attention_heads % num_key_value_heads
validations so invalid zero or negative values produce a readable ValueError
instead of ZeroDivisionError or later failures when computing head_dim.
In `@openverifiablellm/model/model.py`:
- Around line 1-121: Run the code formatter ruff on this file to make CI happy:
apply `ruff format` (or your editor's ruff integration) to
openverifiablellm/model/model.py so whitespace, imports and line breaks are
normalized; ensure the formatted output preserves existing semantics in
classes/functions RMSNorm, MLP, TransformerBlock, VerifiableLLM and the
init_weights_deterministically method (no code changes, only formatting), then
stage the rewritten file and push the commit.
- Around line 98-121: The RNG restore logic must be placed inside a finally
block so any exception during initialization does not leak the seeded RNG state;
wrap the initialization loop that iterates self.modules() (checking nn.Linear,
nn.Embedding, RMSNorm) in try/finally, capture cpu_rng_state via
torch.get_rng_state() and gpu_rng_states via torch.cuda.get_rng_state_all()
(initialize gpu_rng_states to None if CUDA unavailable) before seeding, and in
the finally call torch.set_rng_state(cpu_rng_state) and, if gpu_rng_states is
not None, torch.cuda.set_rng_state_all(gpu_rng_states); re-raise any exception
after restore so behavior is unchanged.
- Around line 19-21: The forward method promotes activations to float32 because
self.weight is float32; to fix, ensure the weight is cast to the
input/normalized tensor dtype before multiplication: locate forward (and the
normalization call self._norm) and change the multiplication to use a
dtype-matched weight (e.g., cast self.weight to output/type of x via type_as or
to(x.dtype)) so mixed-precision (float16/bfloat16) is preserved; alternatively
ensure self.weight is stored/registered in the same dtype as the model tensors.
In `@openverifiablellm/train.py`:
- Around line 100-107: The ModelConfig currently hardcodes vocab_size=32000;
instead read and pass the tokenizer vocabulary size into Stage 4 and validate
dataset token ids against it before training. Locate where ModelConfig is
constructed (symbol: ModelConfig) and replace the literal 32000 with the
tokenizer/vocab source (e.g., args.tokenizer_vocab_size or a persisted value
loaded from the tokenizer or dataset metadata used in Stage 2), ensure the same
change is applied to the other ModelConfig instances around the 156-169 area,
and add a pre-training validation step that loads the max token id from
tokens.bin (or dataset metadata) and raises an error if max_id >= vocab_size so
out-of-range token ids are caught early.
In `@pyproject.toml`:
- Around line 14-18: Update the dependencies list entry for sentencepiece in
pyproject.toml to pin the exact version used in the lockfile (change
"sentencepiece" to "sentencepiece==0.2.1") so the manifest matches the lockfile;
locate the dependencies array containing "defusedxml", "sentencepiece",
"tokenizers==0.15.2" and replace the sentencepiece entry accordingly, then run
your dependency tool to verify the lockfile remains consistent.
In `@scripts/run_pipeline.py`:
- Around line 10-19: The subprocesses are running in the caller's CWD, causing
outputs and imports to be resolved incorrectly; update run_command to compute
the repository/project root (e.g., from the script's location or a passed-in
project_root) and call subprocess.run with the cwd argument set to that
project_root so every stage runs from the repo root; update any direct
subprocess.run calls in the same file (the other invocations referenced around
lines 33-70) to also pass cwd=project_root so file writes and module imports are
deterministic.
---
Duplicate comments:
In `@openverifiablellm/tokenize_data.py`:
- Around line 53-64: The current split keeps the separator character in
complete_portion which loses the leading-space context for the next chunk and
can break ByteLevelBPETokenizer; change the split so the separator stays at the
start of remainder instead of at the end of complete_portion: compute split_idx
= max(last_space_idx, last_newline_idx) and when split_idx != -1 set
complete_portion = text_chunk[:split_idx] and remainder =
text_chunk[split_idx:], and when split_idx == -1 (no separator) assign
complete_portion = "" and remainder = text_chunk so you never emit a chunk that
drops a leading-space token; refer to the variables/text handling in
tokenize_data.py (text_chunk, last_space_idx, last_newline_idx, split_idx,
complete_portion, remainder).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b4dcfb94-8aaa-433d-82bc-b0389e8d50a1
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (9)
openverifiablellm/model/attention.pyopenverifiablellm/model/config.pyopenverifiablellm/model/model.pyopenverifiablellm/tokenize_data.pyopenverifiablellm/train.pypyproject.tomlscripts/run_pipeline.pytests/test_model_init.pytests/test_verify.py
…cision, cwd, vocab_size
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
openverifiablellm/tokenize_data.py (1)
58-67:⚠️ Potential issue | 🟠 MajorThis split still changes ByteLevel BPE ids.
ByteLevelBPETokenizerkeeps leading whitespace with the following pretoken. Withcomplete_portion = text_chunk[: split_idx + 1],"hello world"is encoded as"hello "+"world"instead of"hello"+" world", so the output still depends on chunk boundaries. Thesplit_idx == -1path also still emits arbitrary whitespace-free slices.💡 Suggested direction
- if split_idx != -1: - complete_portion = text_chunk[: split_idx + 1] - remainder = text_chunk[split_idx + 1 :] - else: - complete_portion = text_chunk - remainder = "" + if split_idx != -1: + complete_portion = text_chunk[:split_idx] + remainder = text_chunk[split_idx:] + else: + complete_portion = "" + remainder = text_chunkIf unbounded
remainderis a concern, stream on a tokenizer-stable unit such as full lines/documents instead of arbitrary character chunks.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/tokenize_data.py` around lines 58 - 67, The current split includes the whitespace in complete_portion which changes ByteLevel BPE ids; update the split logic in tokenize_data.py so the remainder preserves leading whitespace (i.e., set complete_portion = text_chunk[:split_idx] and remainder = text_chunk[split_idx:] when split_idx != -1) so tokens that rely on leading spaces remain stable; additionally, avoid arbitrary whitespace-free slices on the split_idx == -1 path by either not splitting (complete_portion = text_chunk; remainder = "") or, better, switch to streaming/splitting on tokenizer-stable units (e.g., full lines/documents or newline-only splits) to guarantee stable tokenization.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/train.py`:
- Around line 78-97: The checkpoint saved by save_checkpoint currently omits
model and dataset metadata; update the checkpoint dict in save_checkpoint to
include a serializable model_config (e.g., model_config or model.get_config() /
as dict) and a dataset identifier/hash (e.g., dataset_id or dataset_hash
produced when preparing the dataset), keep environment_fingerprint and
iteration, then torch.save that extended checkpoint and
compute_merkle_root(checkpoint_path) as before; reference the save_checkpoint
function, the checkpoint dict keys, compute_merkle_root, and
generate_environment_fingerprint to locate where to add model_config and dataset
identifier/hash.
In `@scripts/run_pipeline.py`:
- Around line 31-46: The existence check for the input dump uses a relative Path
that is validated in the caller cwd but then passed into a subprocess executed
with cwd=project_root, causing false negatives for relative paths; fix by
resolving dump_path to an absolute path before the exists() check and before
building cmd1 (e.g., replace dump_path = Path(args.dump_path) with a resolved
Path and use that resolved dump_path in cmd1), so functions and symbols to
update are parser.parse_args()/dump_path (resolve it), the exists() check, and
the cmd1 passed to run_command.
In `@tests/test_model_init.py`:
- Around line 14-35: Add a same-instance reinitialization regression check
inside test_model_initialization_is_deterministic: after computing hash1 for
model1, mutate model1 (e.g., run one optimizer step or modify a parameter
tensor), then call model1.init_weights_deterministically(seed=42) again and
compute a new hash (via get_model_hash) and assert it equals the original hash1;
this ensures VerifiableLLM.init_weights_deterministically fully resets an
already-used instance.
---
Duplicate comments:
In `@openverifiablellm/tokenize_data.py`:
- Around line 58-67: The current split includes the whitespace in
complete_portion which changes ByteLevel BPE ids; update the split logic in
tokenize_data.py so the remainder preserves leading whitespace (i.e., set
complete_portion = text_chunk[:split_idx] and remainder = text_chunk[split_idx:]
when split_idx != -1) so tokens that rely on leading spaces remain stable;
additionally, avoid arbitrary whitespace-free slices on the split_idx == -1 path
by either not splitting (complete_portion = text_chunk; remainder = "") or,
better, switch to streaming/splitting on tokenizer-stable units (e.g., full
lines/documents or newline-only splits) to guarantee stable tokenization.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: fc85a01f-fd1e-4142-bca9-ef01033efb9c
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (8)
openverifiablellm/model/attention.pyopenverifiablellm/model/config.pyopenverifiablellm/model/model.pyopenverifiablellm/tokenize_data.pyopenverifiablellm/train.pypyproject.tomlscripts/run_pipeline.pytests/test_model_init.py
…kpoints, reinit test, path resolution, vocab_size arg
There was a problem hiding this comment.
♻️ Duplicate comments (2)
openverifiablellm/train.py (1)
87-93:⚠️ Potential issue | 🟠 MajorPersist the dataset identifier in each checkpoint.
checkpointnow recordsmodel_config, but it still doesn’t bind the artifact to the dataset hash / manifest root produced in Stages 1-2. As written, the Merkle root only proves checkpoint integrity, not which corpus/tokenization the weights came from.💡 Suggested direction
-def save_checkpoint( - model: torch.nn.Module, optimizer: torch.optim.Optimizer, iteration: int, output_dir: Path -): +def save_checkpoint( + model: torch.nn.Module, + optimizer: torch.optim.Optimizer, + iteration: int, + output_dir: Path, + dataset_id: str, +): @@ checkpoint = { "model_state_dict": model.state_dict(), "optimizer_state_dict": optimizer.state_dict(), "iteration": iteration, "model_config": asdict(model.config), + "dataset_id": dataset_id, "environment_fingerprint": generate_environment_fingerprint(), }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/train.py` around lines 87 - 93, The checkpoint currently serializes model_state_dict, optimizer_state_dict, iteration, model_config, and environment_fingerprint but omits the dataset identifier; update the checkpoint creation (the dict built where "checkpoint =" is assigned) to include the dataset manifest root/hash produced in Stages 1–2 (use the same variable name used earlier in the pipeline, e.g. dataset_manifest_root or dataset_hash), storing it under a clear key like "dataset_manifest_root" or "dataset_id" so each checkpoint cryptographically binds weights to the corpus/tokenization; ensure the value comes from the existing stage output (not recomputed) and is included alongside model_config and environment_fingerprint in the checkpoint dict.openverifiablellm/tokenize_data.py (1)
56-64:⚠️ Potential issue | 🟠 MajorDon’t split inside the final whitespace run.
split_idxcurrently lands on the last" "/"\n"character, so a chunk ending with" foo"is emitted as" "+" foo". That still changes the text spans seen byByteLevelBPETokenizer.encode(), sotokens.bincan depend on chunk placement rather than corpus content. Move the boundary back to the start of the trailing whitespace run, and preferstr.isspace()so tabs /\r\nare handled too.💡 Suggested direction
- last_space_idx = text_chunk.rfind(" ") - last_newline_idx = text_chunk.rfind("\n") - split_idx = max(last_space_idx, last_newline_idx) - - if split_idx != -1: + split_idx = len(text_chunk) + while split_idx > 0 and not text_chunk[split_idx - 1].isspace(): + split_idx -= 1 + while split_idx > 0 and text_chunk[split_idx - 1].isspace(): + split_idx -= 1 + + if split_idx != len(text_chunk): complete_portion = text_chunk[:split_idx] remainder = text_chunk[split_idx:] else: complete_portion = "" remainder = text_chunkFor Hugging Face tokenizers 0.15.x, does ByteLevelBPETokenizer pre-tokenize consecutive whitespace as a single span, and can BPE merges cross pre-tokenizer boundaries?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/tokenize_data.py` around lines 56 - 64, The split currently picks the last space/newline index (variables text_chunk, last_space_idx, last_newline_idx, split_idx) which leaves the final whitespace char in the emitted chunk; instead find the start of the trailing whitespace run and split there using str.isspace() so tabs and CRLF are handled too. Replace the max(last_space_idx, last_newline_idx) logic with a backward scan from the chosen boundary to find the first index where text_chunk[i].isspace() stops (i.e. start of the trailing whitespace run), set split_idx to that start (or -1 if none), then set complete_portion = text_chunk[:split_idx] and remainder = text_chunk[split_idx:] as before. Ensure you handle the no-whitespace case the same as now.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@openverifiablellm/tokenize_data.py`:
- Around line 56-64: The split currently picks the last space/newline index
(variables text_chunk, last_space_idx, last_newline_idx, split_idx) which leaves
the final whitespace char in the emitted chunk; instead find the start of the
trailing whitespace run and split there using str.isspace() so tabs and CRLF are
handled too. Replace the max(last_space_idx, last_newline_idx) logic with a
backward scan from the chosen boundary to find the first index where
text_chunk[i].isspace() stops (i.e. start of the trailing whitespace run), set
split_idx to that start (or -1 if none), then set complete_portion =
text_chunk[:split_idx] and remainder = text_chunk[split_idx:] as before. Ensure
you handle the no-whitespace case the same as now.
In `@openverifiablellm/train.py`:
- Around line 87-93: The checkpoint currently serializes model_state_dict,
optimizer_state_dict, iteration, model_config, and environment_fingerprint but
omits the dataset identifier; update the checkpoint creation (the dict built
where "checkpoint =" is assigned) to include the dataset manifest root/hash
produced in Stages 1–2 (use the same variable name used earlier in the pipeline,
e.g. dataset_manifest_root or dataset_hash), storing it under a clear key like
"dataset_manifest_root" or "dataset_id" so each checkpoint cryptographically
binds weights to the corpus/tokenization; ensure the value comes from the
existing stage output (not recomputed) and is included alongside model_config
and environment_fingerprint in the checkpoint dict.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8a952ee6-d5d2-4a4b-b623-2b54f43c48d2
📒 Files selected for processing (4)
openverifiablellm/tokenize_data.pyopenverifiablellm/train.pyscripts/run_pipeline.pytests/test_model_init.py
Addressed Issues:
Fixes #55
Architecture and workflow
Screenshots/Recordings:
This is stage 1 where xml converted into clean wiki test
file : simplewiki-20260201-pages-articles-multistream.xml.bz2wiki_clean.txtis signed off by adataset_manifest.jsoncontaining the dataset's Merkle Root, proving the exact dataset composition before training.this is 2nd stage where Tokenization happens
wiki_clean.txtfile.x86and ARM processors, tokens are strictly encoded using Little-Endianuint16structures into memory-mappedtokens.binfiles.
training loop stage where deterministic training happens
LLaMAmodel using PyTorch withRoPE, RMSNorm, and Grouped Query Attention.init_weights_deterministically(seed)ensuring reproducible CPU andCUDAweight initialization for verifiable model starting states.Important : i actaully right now take max_step =10 , for verifying this pr but in my training i taken 70 , you can easily change this in run_pipeline codeTesting
Additional Notes:
Checklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
Summary by CodeRabbit
New Features
Tests
Chores