Skip to content

[Feature] : add training pipeline(Llama Architecture ) and deterministic tokenization#59

Open
Shubhamx404 wants to merge 6 commits intoAOSSIE-Org:mainfrom
Shubhamx404:feature/deterministic-pipeline
Open

[Feature] : add training pipeline(Llama Architecture ) and deterministic tokenization#59
Shubhamx404 wants to merge 6 commits intoAOSSIE-Org:mainfrom
Shubhamx404:feature/deterministic-pipeline

Conversation

@Shubhamx404
Copy link
Contributor

@Shubhamx404 Shubhamx404 commented Mar 10, 2026

Addressed Issues:

Fixes #55

Architecture and workflow

comprehensive_verifiable_architecture_1773090166609

Screenshots/Recordings:

This is stage 1 where xml converted into clean wiki test

file : simplewiki-20260201-pages-articles-multistream.xml.bz2

python -m openverifiablellm.utils C:\Users\shubh\Downloads\simplewiki-20260201-pages-articles-multistream.xml.bz2
  • i implemented a strict regex-based streaming parser to consume massive Wikipedia xml.bz2 dumps.
  • Instead of relying on random sampling or non-deterministic iterators, we extract and cleanse the text and synchronously calculate the dataset's SHA-256 hash.
  • parsed wiki_clean.txt is signed off by a dataset_manifest.json containing the dataset's Merkle Root, proving the exact dataset composition before training.
image

this is 2nd stage where Tokenization happens

train tokenizer : python -c "from openverifiablellm.tokenizer import train_tokenizer; train_tokenizer('data/processed/wiki_clean.txt', 'data/tokenizer')"
Binary tokenization: python -m openverifiablellm.tokenize_data --text_path data/processed/wiki_clean.txt --tokenizer_path data/tokenizer --output_path data/tokens.bin
  • A BPE tokenizer is trained deterministically from the wiki_clean.txt file.
  • i constructed a memory-efficient encoder that reads the raw text in streaming chunks and outputs raw, mathematically predictable encoded tokens.
  • To prevent floating-point or endianness mismatches between x86 and ARM processors, tokens are strictly encoded using Little-Endian uint16 structures into memory-mapped tokens.bin
    files.
image

training loop stage where deterministic training happens

training loop : python -m openverifiablellm.train --data_path data/tokens.bin --batch_size 16 --seq_len 512 --max_steps 70 --log_interval 5 --save_interval 10
  • Implemented a minimal LLaMA model using PyTorch with RoPE, RMSNorm, and Grouped Query Attention.
  • Added init_weights_deterministically(seed) ensuring reproducible CPU and CUDA weight initialization for verifiable model starting states.
  • Implemented deterministic training with fixed seeds, memory-mapped datasets, and auditable checkpoints with Merkle root verification.

Important : i actaully right now take max_step =10 , for verifying this pr but in my training i taken 70 , you can easily change this in run_pipeline code

image

Testing

image

Additional Notes:

Checklist

  • [ * ] My code follows the project's code style and conventions
  • [ * ] I have made corresponding changes to the documentation
  • [ * ] My changes generate no new warnings or errors
  • [ * ] I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • [ * ] I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

  • New Features

    • Added a full transformer-based LLM with configurable model config, rotary positional embeddings, deterministic weight initialization, and token-to-logit inference.
    • New deterministic dataset tokenization tool and an end-to-end pipeline script to run preprocessing, tokenizer training, tokenization, and model training.
    • Deterministic training workflow with a reproducible data loader, checkpointing, and deterministic seed handling.
  • Tests

    • Tests added to verify deterministic model initialization and reproducible hashes.
  • Chores

    • Updated project dependencies and pytest warning configuration.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

Walkthrough

Adds a deterministic verifiable LLM stack: package exports, ModelConfig, rotary-attention, full model implementation with deterministic init, deterministic tokenization into uint16 binary + SHA-256, deterministic training loop and dataloader, pipeline orchestration script, tests, and a dependency/config update.

Changes

Cohort / File(s) Summary
Model package exports
openverifiablellm/model/__init__.py
Expose ModelConfig and VerifiableLLM via package public API (__all__).
Model config
openverifiablellm/model/config.py
Add ModelConfig dataclass with defaults and validation; computes head_dim and enforces divisibility/evenness for RoPE.
Attention module
openverifiablellm/model/attention.py
New RoPE utilities (precompute_freqs_cis, reshape_for_broadcast, apply_rotary_emb) and Attention nn.Module with grouped-kv support, masking, and multi-head projections.
Model implementation
openverifiablellm/model/model.py
Add RMSNorm, MLP, TransformerBlock, VerifiableLLM, and deterministic init_weights_deterministically; forward returns logits and optional loss; preserves RNG state during init.
Tokenization pipeline
openverifiablellm/tokenize_data.py
Add deterministic tokenizer-based chunked encoder CLI that writes little-endian uint16 .bin, validates token IDs, logs progress, and returns SHA-256 hash.
Training workflow
openverifiablellm/train.py
Add deterministic seed setup, DeterministicDataLoader (mmap uint16), checkpoint saving (with Merkle root), and train CLI orchestrating deterministic training loop.
Pipeline script
scripts/run_pipeline.py
New end-to-end pipeline runner calling preprocessing, tokenizer training, tokenization, and deterministic training via subprocess with centralized logging and error handling.
Tests
tests/test_model_init.py, tests/test_verify.py
Add deterministic-init tests asserting identical parameter hashes for same seed and update preprocessing test to generate manifest.
Project config
pyproject.toml
Add sentencepiece==0.2.1 dependency and pytest filterwarnings config for Swig deprecation warnings.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Pipeline as scripts/run_pipeline.py
    participant Preproc as utils.extract_text_from_xml
    participant TokenizerTrain as train_tokenizer
    participant Tokenize as tokenize_dataset
    participant Train as openverifiablellm.train:train
    participant Model as VerifiableLLM
    participant DataLoader as DeterministicDataLoader

    User->>Pipeline: run(dump_path)
    Pipeline->>Preproc: extract_text_from_xml(dump_path)
    Preproc-->>Pipeline: cleaned text + manifest
    Pipeline->>TokenizerTrain: train_tokenizer(cleaned text)
    TokenizerTrain-->>Pipeline: tokenizer artifacts
    Pipeline->>Tokenize: tokenize_dataset(cleaned text, tokenizer_dir, tokens.bin)
    Tokenize-->>Pipeline: tokens.bin + SHA256
    Pipeline->>Train: train(data_path=tokens.bin, seed, config)
    Train->>Model: VerifiableLLM(config)
    Model->>Model: init_weights_deterministically(seed)
    Train->>DataLoader: DeterministicDataLoader(tokens.bin, batch_size, seq_len, np_rng)
    loop training steps
        DataLoader->>Train: get_batch()
        Train->>Model: forward(tokens, targets)
        Model-->>Train: logits, loss
        Train->>Train: backward + optimizer.step()
        Train->>Train: save_checkpoint() @ intervals
    end
    Train-->>Pipeline: checkpoints + logs
    Pipeline-->>User: done
Loading
sequenceDiagram
    participant Input as tokens
    participant Embed as Embedding
    participant Layer as TransformerBlock
    participant Attn as Attention
    participant RoPE as RoPE utils
    participant MLP as MLP
    participant Output as lm_head

    Input->>Embed: embed(tokens)
    Embed->>Layer: hidden
    loop per layer
        Layer->>Attn: hidden, freqs_cis, mask
        Attn->>RoPE: apply_rotary_emb(Q,K)
        RoPE-->>Attn: rotated Q,K
        Attn->>Layer: attn_out
        Layer->>MLP: norm(attn_out)
        MLP-->>Layer: mlp_out
        Layer->>Layer: residual combine
    end
    Layer->>Output: final_normed_hidden
    Output-->>Output: logits
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related PRs

  • PR AOSSIE-Org/OpenVerifiableLLM#17: Adds deterministic tokenizer training and hashing utilities (compute_sha256) used by tokenize_data.py and the pipeline.

Suggested labels

Python Lang

Suggested reviewers

  • Archit381

Poem

🐰 I hop through bytes and RoPE-lined trails,
Seeds set firm so randomness pales,
Tokens march into a binary song,
Weights reset true — the hashes belong,
Reproducible hops, steady and strong.

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly summarizes the main addition: a deterministic training pipeline using Llama architecture and tokenization, which aligns with the substantial code changes implementing these features.
Linked Issues check ✅ Passed All primary coding objectives from issue #55 are met: deterministic tokenization pipeline implemented in tokenize_data.py with SHA-256 hashing, minimal LLaMA architecture with RoPE/RMSNorm/GQA in model.py, deterministic weight initialization via init_weights_deterministically(), and training pipeline with checkpoints and Merkle root verification.
Out of Scope Changes check ✅ Passed All changes are scoped to the feature: model architecture, tokenization, deterministic training, pipeline orchestration, and related configuration. No unrelated modifications detected outside the stated objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Shubhamx404 Shubhamx404 changed the title [Feature] : add training pipeline(Llama model ) and deterministic tokenization [Feature] : add training pipeline(Llama architecture ) and deterministic tokenization Mar 10, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 10, 2026
@Shubhamx404 Shubhamx404 changed the title [Feature] : add training pipeline(Llama architecture ) and deterministic tokenization [Feature] : add training pipeline(Llama Architecture ) and deterministic tokenization Mar 10, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 10, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pyproject.toml (1)

14-17: ⚠️ Potential issue | 🟠 Major

Add missing runtime dependencies numpy and torch to project metadata.

This PR adds imports of numpy in openverifiablellm/tokenize_data.py and train.py, and torch in openverifiablellm/train.py, openverifiablellm/model/attention.py, and test_model_init.py. Both libraries are absent from dependencies in pyproject.toml, causing import failures on clean install. Add version constraints for both packages.

Also applies to: 27-32

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 14 - 17, The pyproject.toml is missing runtime
dependencies for numpy and torch which are imported in
openverifiablellm/tokenize_data.py, openverifiablellm/train.py,
openverifiablellm/model/attention.py and test_model_init.py; update the
dependencies array to include appropriate version constraints for "numpy" and
"torch" (pick compatible pinned or minimal versions consistent with the project
and CI) so clean installs have these packages available at runtime.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/model/config.py`:
- Around line 17-19: In __post_init__ validate the attention invariants so
invalid configs fail early: ensure hidden_size % num_attention_heads == 0,
ensure num_attention_heads % num_key_value_heads == 0 (when num_key_value_heads
is set or defaulted), and ensure (hidden_size // num_attention_heads) is even
for RoPE; if any check fails raise a ValueError with a clear message mentioning
hidden_size, num_attention_heads and num_key_value_heads; implement these checks
inside the existing __post_init__ (referencing the attributes hidden_size,
num_attention_heads, num_key_value_heads) so callers hit a clear error instead
of downstream failures in attention.py.

In `@openverifiablellm/model/model.py`:
- Around line 60-65: The forward method currently slices self.freqs_cis by
seqlen but does not check that seqlen <= the precomputed RoPE window; add an
explicit guard at the top of forward (using the function name forward and
variables seqlen, self.freqs_cis and self.config.max_position_embeddings) that
raises a clear ValueError when seqlen > self.config.max_position_embeddings,
mentioning the offending seqlen and the configured max so callers get a
deterministic, actionable error instead of a downstream shape mismatch.
- Around line 100-106: The deterministic reinit currently only resets nn.Linear
and nn.Embedding, leaving RMSNorm.weight unchanged; update
init_weights_deterministically to also reinitialize RMSNorm parameters (e.g.,
detect the RMSNorm class or by name) and apply the same torch.nn.init.normal_
for its weight (and zeros_ for any bias if present) using
self.config.initializer_range so calling init_weights_deterministically after
seeding truly recreates the baseline; reference the method
init_weights_deterministically and the RMSNorm symbol (or its class name)
alongside existing nn.Linear and nn.Embedding cases when adding this branch.
- Around line 91-111: The RNG snapshot/restore currently saves cpu_rng_state via
torch.get_rng_state() and only the current CUDA device via
torch.cuda.get_rng_state(), which leaks changes on other GPUs; update the code
in the initialization block (the seed/restore logic around torch.manual_seed,
torch.cuda.manual_seed_all, and the module weight init loop in the Model class)
to call torch.cuda.get_rng_state_all() to capture all device states before
seeding and torch.cuda.set_rng_state_all() to restore them afterward, while
still restoring the CPU RNG via torch.set_rng_state(cpu_rng_state); ensure you
keep the same control flow (only call CUDA helpers if torch.cuda.is_available())
and preserve existing variable names/structure for clarity.

In `@openverifiablellm/tokenize_data.py`:
- Around line 36-45: The current loop reading fixed-size byte chunks with
f_in.read(chunk_size) and calling tokenizer.encode on each chunk (using
variables text_path, output_path, chunk_size, tokenizer.encode, encoded) can
split tokens across chunk boundaries; fix by switching to a boundary-aware
approach: read text in text-safe units (e.g., f_in.readline() or reading into a
string buffer and splitting on newline/whitespace), keep a remainder buffer of
the trailing partial word/subword between iterations, only call tokenizer.encode
on the complete portion and carry the incomplete suffix forward, then flush the
remaining buffered text at the end; also remove/replace any uint16 cast of token
IDs (the code that casts encoded IDs to uint16) with a safe dtype (e.g., uint32)
or add an explicit check that max(token_id) <= 65535 and raise a clear error so
IDs are not silently truncated.

In `@openverifiablellm/train.py`:
- Around line 48-54: In get_batch, fix the off-by-one bounds: compute max_start
= self.total_tokens - self.seq_len (not subtracting an extra 1) and call
self.np_rng.randint(0, max_start + 1, size=(self.batch_size,)) so the exclusive
upper bound of randint includes the last valid start; replace uses of max_idx
and the current randint call (symbols: max_idx, ix, self.np_rng.randint,
get_batch) accordingly so files with exactly seq_len+1 tokens produce valid
indices.
- Around line 18-30: In set_deterministic_seed(seed: int) ensure full CUDA
determinism by calling torch.use_deterministic_algorithms(True) after setting
the RNG seeds and cuDNN flags, and document/require that the
CUBLAS_WORKSPACE_CONFIG environment variable (e.g. CUBLAS_WORKSPACE_CONFIG=:16:8
or :4096:8) is set before CUDA initializes (i.e., before importing torch or
creating CUDA tensors) to guarantee deterministic cuBLAS behavior; update
logger.info in set_deterministic_seed to reflect that deterministic algorithms
are enabled.

In `@scripts/run_pipeline.py`:
- Around line 62-69: The training invocation built in cmd4 calls
openverifiablellm.train without an explicit output directory so checkpoints
default to the caller CWD; update cmd4 to include the explicit --output_dir
argument pointing to the same checkpoints path used in the pipeline summary
(e.g., the variable representing project_root / "data" / "checkpoints") so
checkpoints are written to the advertised location, and make the same change for
the later command at lines 74-77 (the other cmd* that runs
openverifiablellm.train) so both run_command calls pass --output_dir
consistently.

In `@test_model_init.py`:
- Around line 12-33: Move the module-level test into a proper pytest test
function so assertions run under test discovery: create a test file in the tests
package (e.g., tests/test_model_init.py), wrap the existing code inside a
function named test_model_initialization_is_deterministic(), and keep the same
sequence using ModelConfig, VerifiableLLM, init_weights_deterministically and
get_model_hash; remove or stop executing any prints at import if desired and let
pytest assert hash1 == hash2 and hash1 != hash3 inside that test function so
pytest can collect and report failures correctly.

---

Outside diff comments:
In `@pyproject.toml`:
- Around line 14-17: The pyproject.toml is missing runtime dependencies for
numpy and torch which are imported in openverifiablellm/tokenize_data.py,
openverifiablellm/train.py, openverifiablellm/model/attention.py and
test_model_init.py; update the dependencies array to include appropriate version
constraints for "numpy" and "torch" (pick compatible pinned or minimal versions
consistent with the project and CI) so clean installs have these packages
available at runtime.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 331f4652-dfcc-487a-91d9-391dd3c1768d

📥 Commits

Reviewing files that changed from the base of the PR and between 35ce9dc and f3d7194.

📒 Files selected for processing (9)
  • openverifiablellm/model/__init__.py
  • openverifiablellm/model/attention.py
  • openverifiablellm/model/config.py
  • openverifiablellm/model/model.py
  • openverifiablellm/tokenize_data.py
  • openverifiablellm/train.py
  • pyproject.toml
  • scripts/run_pipeline.py
  • test_model_init.py

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 10, 2026
@Shubhamx404
Copy link
Contributor Author

hey @Archit381 can you review this pr , i resolved all coderabbit issues

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 10, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

♻️ Duplicate comments (1)
openverifiablellm/tokenize_data.py (1)

53-64: ⚠️ Potential issue | 🟠 Major

Whitespace-only buffering still is not ByteLevelBPETokenizer-safe.

At Line 60 the trailing separator stays in complete_portion, so the next chunk loses the leading-space context that GPT-2-style byte-level BPE uses. encode(" hello") is not equivalent to encode(" ") plus encode("hello"), and the fallback at Line 62-Line 64 still degenerates to arbitrary mid-token splits when a chunk has no whitespace. The emitted token stream can therefore still depend on chunking, not just corpus content.

In tokenizers 0.15.2, does ByteLevelBPETokenizer treat leading spaces as part of the following token (for example, is `encode(" hello")` equivalent to `encode(" ")` followed by `encode("hello")`)? Can BPE merges cross pre-tokenizer boundaries or chunk boundaries?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/tokenize_data.py` around lines 53 - 64, The current split
keeps the separator character in complete_portion which loses the leading-space
context for the next chunk and can break ByteLevelBPETokenizer; change the split
so the separator stays at the start of remainder instead of at the end of
complete_portion: compute split_idx = max(last_space_idx, last_newline_idx) and
when split_idx != -1 set complete_portion = text_chunk[:split_idx] and remainder
= text_chunk[split_idx:], and when split_idx == -1 (no separator) assign
complete_portion = "" and remainder = text_chunk so you never emit a chunk that
drops a leading-space token; refer to the variables/text handling in
tokenize_data.py (text_chunk, last_space_idx, last_newline_idx, split_idx,
complete_portion, remainder).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/model/attention.py`:
- Around line 17-22: In reshape_for_broadcast, replace the two assert statements
with explicit runtime checks that raise ValueError: verify ndim = x.ndim and
raise ValueError if ndim <= 1 (use a clear message referencing x.ndim), and
verify freqs_cis.shape == (x.shape[1], x.shape[-1]) raising ValueError that
includes the actual freqs_cis.shape and expected shape (derived from x.shape[1]
and x.shape[-1]); keep the rest of the function (shape computation and view)
unchanged and reference the function name reshape_for_broadcast and variable
names freqs_cis, x, ndim in the error messages for clarity.

In `@openverifiablellm/model/config.py`:
- Around line 18-34: The __post_init__ in model/config.py should explicitly
validate that num_attention_heads and num_key_value_heads are positive integers
before any division/modulo: add checks that num_attention_heads > 0 and
num_key_value_heads > 0 (after the existing defaulting of num_key_value_heads)
and raise clear ValueError messages if not; keep these checks before the
hidden_size % num_attention_heads and num_attention_heads % num_key_value_heads
validations so invalid zero or negative values produce a readable ValueError
instead of ZeroDivisionError or later failures when computing head_dim.

In `@openverifiablellm/model/model.py`:
- Around line 1-121: Run the code formatter ruff on this file to make CI happy:
apply `ruff format` (or your editor's ruff integration) to
openverifiablellm/model/model.py so whitespace, imports and line breaks are
normalized; ensure the formatted output preserves existing semantics in
classes/functions RMSNorm, MLP, TransformerBlock, VerifiableLLM and the
init_weights_deterministically method (no code changes, only formatting), then
stage the rewritten file and push the commit.
- Around line 98-121: The RNG restore logic must be placed inside a finally
block so any exception during initialization does not leak the seeded RNG state;
wrap the initialization loop that iterates self.modules() (checking nn.Linear,
nn.Embedding, RMSNorm) in try/finally, capture cpu_rng_state via
torch.get_rng_state() and gpu_rng_states via torch.cuda.get_rng_state_all()
(initialize gpu_rng_states to None if CUDA unavailable) before seeding, and in
the finally call torch.set_rng_state(cpu_rng_state) and, if gpu_rng_states is
not None, torch.cuda.set_rng_state_all(gpu_rng_states); re-raise any exception
after restore so behavior is unchanged.
- Around line 19-21: The forward method promotes activations to float32 because
self.weight is float32; to fix, ensure the weight is cast to the
input/normalized tensor dtype before multiplication: locate forward (and the
normalization call self._norm) and change the multiplication to use a
dtype-matched weight (e.g., cast self.weight to output/type of x via type_as or
to(x.dtype)) so mixed-precision (float16/bfloat16) is preserved; alternatively
ensure self.weight is stored/registered in the same dtype as the model tensors.

In `@openverifiablellm/train.py`:
- Around line 100-107: The ModelConfig currently hardcodes vocab_size=32000;
instead read and pass the tokenizer vocabulary size into Stage 4 and validate
dataset token ids against it before training. Locate where ModelConfig is
constructed (symbol: ModelConfig) and replace the literal 32000 with the
tokenizer/vocab source (e.g., args.tokenizer_vocab_size or a persisted value
loaded from the tokenizer or dataset metadata used in Stage 2), ensure the same
change is applied to the other ModelConfig instances around the 156-169 area,
and add a pre-training validation step that loads the max token id from
tokens.bin (or dataset metadata) and raises an error if max_id >= vocab_size so
out-of-range token ids are caught early.

In `@pyproject.toml`:
- Around line 14-18: Update the dependencies list entry for sentencepiece in
pyproject.toml to pin the exact version used in the lockfile (change
"sentencepiece" to "sentencepiece==0.2.1") so the manifest matches the lockfile;
locate the dependencies array containing "defusedxml", "sentencepiece",
"tokenizers==0.15.2" and replace the sentencepiece entry accordingly, then run
your dependency tool to verify the lockfile remains consistent.

In `@scripts/run_pipeline.py`:
- Around line 10-19: The subprocesses are running in the caller's CWD, causing
outputs and imports to be resolved incorrectly; update run_command to compute
the repository/project root (e.g., from the script's location or a passed-in
project_root) and call subprocess.run with the cwd argument set to that
project_root so every stage runs from the repo root; update any direct
subprocess.run calls in the same file (the other invocations referenced around
lines 33-70) to also pass cwd=project_root so file writes and module imports are
deterministic.

---

Duplicate comments:
In `@openverifiablellm/tokenize_data.py`:
- Around line 53-64: The current split keeps the separator character in
complete_portion which loses the leading-space context for the next chunk and
can break ByteLevelBPETokenizer; change the split so the separator stays at the
start of remainder instead of at the end of complete_portion: compute split_idx
= max(last_space_idx, last_newline_idx) and when split_idx != -1 set
complete_portion = text_chunk[:split_idx] and remainder =
text_chunk[split_idx:], and when split_idx == -1 (no separator) assign
complete_portion = "" and remainder = text_chunk so you never emit a chunk that
drops a leading-space token; refer to the variables/text handling in
tokenize_data.py (text_chunk, last_space_idx, last_newline_idx, split_idx,
complete_portion, remainder).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b4dcfb94-8aaa-433d-82bc-b0389e8d50a1

📥 Commits

Reviewing files that changed from the base of the PR and between f3d7194 and 572cb93.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (9)
  • openverifiablellm/model/attention.py
  • openverifiablellm/model/config.py
  • openverifiablellm/model/model.py
  • openverifiablellm/tokenize_data.py
  • openverifiablellm/train.py
  • pyproject.toml
  • scripts/run_pipeline.py
  • tests/test_model_init.py
  • tests/test_verify.py

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 10, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 10, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
openverifiablellm/tokenize_data.py (1)

58-67: ⚠️ Potential issue | 🟠 Major

This split still changes ByteLevel BPE ids.

ByteLevelBPETokenizer keeps leading whitespace with the following pretoken. With complete_portion = text_chunk[: split_idx + 1], "hello world" is encoded as "hello " + "world" instead of "hello" + " world", so the output still depends on chunk boundaries. The split_idx == -1 path also still emits arbitrary whitespace-free slices.

💡 Suggested direction
-            if split_idx != -1:
-                complete_portion = text_chunk[: split_idx + 1]
-                remainder = text_chunk[split_idx + 1 :]
-            else:
-                complete_portion = text_chunk
-                remainder = ""
+            if split_idx != -1:
+                complete_portion = text_chunk[:split_idx]
+                remainder = text_chunk[split_idx:]
+            else:
+                complete_portion = ""
+                remainder = text_chunk

If unbounded remainder is a concern, stream on a tokenizer-stable unit such as full lines/documents instead of arbitrary character chunks.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/tokenize_data.py` around lines 58 - 67, The current split
includes the whitespace in complete_portion which changes ByteLevel BPE ids;
update the split logic in tokenize_data.py so the remainder preserves leading
whitespace (i.e., set complete_portion = text_chunk[:split_idx] and remainder =
text_chunk[split_idx:] when split_idx != -1) so tokens that rely on leading
spaces remain stable; additionally, avoid arbitrary whitespace-free slices on
the split_idx == -1 path by either not splitting (complete_portion = text_chunk;
remainder = "") or, better, switch to streaming/splitting on tokenizer-stable
units (e.g., full lines/documents or newline-only splits) to guarantee stable
tokenization.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/train.py`:
- Around line 78-97: The checkpoint saved by save_checkpoint currently omits
model and dataset metadata; update the checkpoint dict in save_checkpoint to
include a serializable model_config (e.g., model_config or model.get_config() /
as dict) and a dataset identifier/hash (e.g., dataset_id or dataset_hash
produced when preparing the dataset), keep environment_fingerprint and
iteration, then torch.save that extended checkpoint and
compute_merkle_root(checkpoint_path) as before; reference the save_checkpoint
function, the checkpoint dict keys, compute_merkle_root, and
generate_environment_fingerprint to locate where to add model_config and dataset
identifier/hash.

In `@scripts/run_pipeline.py`:
- Around line 31-46: The existence check for the input dump uses a relative Path
that is validated in the caller cwd but then passed into a subprocess executed
with cwd=project_root, causing false negatives for relative paths; fix by
resolving dump_path to an absolute path before the exists() check and before
building cmd1 (e.g., replace dump_path = Path(args.dump_path) with a resolved
Path and use that resolved dump_path in cmd1), so functions and symbols to
update are parser.parse_args()/dump_path (resolve it), the exists() check, and
the cmd1 passed to run_command.

In `@tests/test_model_init.py`:
- Around line 14-35: Add a same-instance reinitialization regression check
inside test_model_initialization_is_deterministic: after computing hash1 for
model1, mutate model1 (e.g., run one optimizer step or modify a parameter
tensor), then call model1.init_weights_deterministically(seed=42) again and
compute a new hash (via get_model_hash) and assert it equals the original hash1;
this ensures VerifiableLLM.init_weights_deterministically fully resets an
already-used instance.

---

Duplicate comments:
In `@openverifiablellm/tokenize_data.py`:
- Around line 58-67: The current split includes the whitespace in
complete_portion which changes ByteLevel BPE ids; update the split logic in
tokenize_data.py so the remainder preserves leading whitespace (i.e., set
complete_portion = text_chunk[:split_idx] and remainder = text_chunk[split_idx:]
when split_idx != -1) so tokens that rely on leading spaces remain stable;
additionally, avoid arbitrary whitespace-free slices on the split_idx == -1 path
by either not splitting (complete_portion = text_chunk; remainder = "") or,
better, switch to streaming/splitting on tokenizer-stable units (e.g., full
lines/documents or newline-only splits) to guarantee stable tokenization.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: fc85a01f-fd1e-4142-bca9-ef01033efb9c

📥 Commits

Reviewing files that changed from the base of the PR and between 572cb93 and b684f73.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • openverifiablellm/model/attention.py
  • openverifiablellm/model/config.py
  • openverifiablellm/model/model.py
  • openverifiablellm/tokenize_data.py
  • openverifiablellm/train.py
  • pyproject.toml
  • scripts/run_pipeline.py
  • tests/test_model_init.py

…kpoints, reinit test, path resolution, vocab_size arg
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 10, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
openverifiablellm/train.py (1)

87-93: ⚠️ Potential issue | 🟠 Major

Persist the dataset identifier in each checkpoint.

checkpoint now records model_config, but it still doesn’t bind the artifact to the dataset hash / manifest root produced in Stages 1-2. As written, the Merkle root only proves checkpoint integrity, not which corpus/tokenization the weights came from.

💡 Suggested direction
-def save_checkpoint(
-    model: torch.nn.Module, optimizer: torch.optim.Optimizer, iteration: int, output_dir: Path
-):
+def save_checkpoint(
+    model: torch.nn.Module,
+    optimizer: torch.optim.Optimizer,
+    iteration: int,
+    output_dir: Path,
+    dataset_id: str,
+):
@@
     checkpoint = {
         "model_state_dict": model.state_dict(),
         "optimizer_state_dict": optimizer.state_dict(),
         "iteration": iteration,
         "model_config": asdict(model.config),
+        "dataset_id": dataset_id,
         "environment_fingerprint": generate_environment_fingerprint(),
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/train.py` around lines 87 - 93, The checkpoint currently
serializes model_state_dict, optimizer_state_dict, iteration, model_config, and
environment_fingerprint but omits the dataset identifier; update the checkpoint
creation (the dict built where "checkpoint =" is assigned) to include the
dataset manifest root/hash produced in Stages 1–2 (use the same variable name
used earlier in the pipeline, e.g. dataset_manifest_root or dataset_hash),
storing it under a clear key like "dataset_manifest_root" or "dataset_id" so
each checkpoint cryptographically binds weights to the corpus/tokenization;
ensure the value comes from the existing stage output (not recomputed) and is
included alongside model_config and environment_fingerprint in the checkpoint
dict.
openverifiablellm/tokenize_data.py (1)

56-64: ⚠️ Potential issue | 🟠 Major

Don’t split inside the final whitespace run.

split_idx currently lands on the last " " / "\n" character, so a chunk ending with " foo" is emitted as " " + " foo". That still changes the text spans seen by ByteLevelBPETokenizer.encode(), so tokens.bin can depend on chunk placement rather than corpus content. Move the boundary back to the start of the trailing whitespace run, and prefer str.isspace() so tabs / \r\n are handled too.

💡 Suggested direction
-            last_space_idx = text_chunk.rfind(" ")
-            last_newline_idx = text_chunk.rfind("\n")
-            split_idx = max(last_space_idx, last_newline_idx)
-
-            if split_idx != -1:
+            split_idx = len(text_chunk)
+            while split_idx > 0 and not text_chunk[split_idx - 1].isspace():
+                split_idx -= 1
+            while split_idx > 0 and text_chunk[split_idx - 1].isspace():
+                split_idx -= 1
+
+            if split_idx != len(text_chunk):
                 complete_portion = text_chunk[:split_idx]
                 remainder = text_chunk[split_idx:]
             else:
                 complete_portion = ""
                 remainder = text_chunk
For Hugging Face tokenizers 0.15.x, does ByteLevelBPETokenizer pre-tokenize consecutive whitespace as a single span, and can BPE merges cross pre-tokenizer boundaries?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/tokenize_data.py` around lines 56 - 64, The split currently
picks the last space/newline index (variables text_chunk, last_space_idx,
last_newline_idx, split_idx) which leaves the final whitespace char in the
emitted chunk; instead find the start of the trailing whitespace run and split
there using str.isspace() so tabs and CRLF are handled too. Replace the
max(last_space_idx, last_newline_idx) logic with a backward scan from the chosen
boundary to find the first index where text_chunk[i].isspace() stops (i.e. start
of the trailing whitespace run), set split_idx to that start (or -1 if none),
then set complete_portion = text_chunk[:split_idx] and remainder =
text_chunk[split_idx:] as before. Ensure you handle the no-whitespace case the
same as now.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@openverifiablellm/tokenize_data.py`:
- Around line 56-64: The split currently picks the last space/newline index
(variables text_chunk, last_space_idx, last_newline_idx, split_idx) which leaves
the final whitespace char in the emitted chunk; instead find the start of the
trailing whitespace run and split there using str.isspace() so tabs and CRLF are
handled too. Replace the max(last_space_idx, last_newline_idx) logic with a
backward scan from the chosen boundary to find the first index where
text_chunk[i].isspace() stops (i.e. start of the trailing whitespace run), set
split_idx to that start (or -1 if none), then set complete_portion =
text_chunk[:split_idx] and remainder = text_chunk[split_idx:] as before. Ensure
you handle the no-whitespace case the same as now.

In `@openverifiablellm/train.py`:
- Around line 87-93: The checkpoint currently serializes model_state_dict,
optimizer_state_dict, iteration, model_config, and environment_fingerprint but
omits the dataset identifier; update the checkpoint creation (the dict built
where "checkpoint =" is assigned) to include the dataset manifest root/hash
produced in Stages 1–2 (use the same variable name used earlier in the pipeline,
e.g. dataset_manifest_root or dataset_hash), storing it under a clear key like
"dataset_manifest_root" or "dataset_id" so each checkpoint cryptographically
binds weights to the corpus/tokenization; ensure the value comes from the
existing stage output (not recomputed) and is included alongside model_config
and environment_fingerprint in the checkpoint dict.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8a952ee6-d5d2-4a4b-b623-2b54f43c48d2

📥 Commits

Reviewing files that changed from the base of the PR and between b684f73 and 6945b9a.

📒 Files selected for processing (4)
  • openverifiablellm/tokenize_data.py
  • openverifiablellm/train.py
  • scripts/run_pipeline.py
  • tests/test_model_init.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Implement Deterministic Dataset Encoding Pipeline and Verifiable LLaMA Model (Model Architecture)

1 participant