Skip to content

Deterministic Preprocessing Verification Mode that re-runs dataset preprocessing and validates that all generated artifacts#25

Merged
Archit381 merged 13 commits intoAOSSIE-Org:mainfrom
aniket866:preprocessing-verification
Mar 5, 2026
Merged

Deterministic Preprocessing Verification Mode that re-runs dataset preprocessing and validates that all generated artifacts#25
Archit381 merged 13 commits intoAOSSIE-Org:mainfrom
aniket866:preprocessing-verification

Conversation

@aniket866
Copy link
Contributor

@aniket866 aniket866 commented Mar 2, 2026

Addressed Issues:

Closes #13
image

image

Detailed PR report:

PR_Report_VerificationMode.pdf

This directly addresses Gap 5.1 from the framework paper — the missing cryptographic link between raw data and training-ready data.
Reference : 2503.22573v1.pdf

Short Description:

Before this PR, the pipeline could generate hashes and Merkle roots, but had no automated way to re-validate them.

This PR introduces a dedicated verification module for the OpenVerifiableLLM data pipeline. Its job is simple: given the same raw Wikipedia dump, re-run the entire preprocessing step from scratch and prove that the output produced today is byte-for-byte identical to the output recorded in the manifest at first run.

This directly addresses Gap 5.1 from the framework paper — the missing cryptographic link between raw data and training-ready data. **

image

Minimal Command Summary

  • Install dependency:
pip install defusedxml
  • Download:
python -c "import urllib.request; urllib.request.urlretrieve('[https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2','simplewiki-latest-pages-articles.xml.bz2](https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2','simplewiki-latest-pages-articles.xml.bz2)')"
  • Configure Python Path (Important)
  • If setup.py exists:

pip install -e .

  • If NOT using setup.py:

set PYTHONPATH=C:\Users\Hp\OpenVerifiableLLM (your own project location")

  • Test import:

python -c "import openverifiablellm.utils; print('OK')"


  • Preprocess:
python -m openverifiablellm.utils simplewiki-latest-pages-articles.xml.bz2
  • Verify:
python -m openverifiablellm.verify simplewiki-latest-pages-articles.xml.bz2
  • Run tests:
    python test_verify.py -v

Detailed Run from scratch:

1. Install Required Dependency

pip install defusedxml

3. Download Wikipedia Dump

Recommended method (works inside CMD):

python -c "import urllib.request; urllib.request.urlretrieve('[https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2','simplewiki-latest-pages-articles.xml.bz2](https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2','simplewiki-latest-pages-articles.xml.bz2)'); print('Download complete')"

Verify download:

dir simplewiki-latest-pages-articles.xml.bz2

4. Configure Python Path (Important)

  • If setup.py exists:

pip install -e .

  • If NOT using setup.py:

set PYTHONPATH=C:\Users\Hp\OpenVerifiableLLM

  • Test import:

python -c "import openverifiablellm.utils; print('OK')"


5. Run Preprocessing

python -m openverifiablellm.utils simplewiki-latest-pages-articles.xml.bz2

This will:

  • Clean Wikipedia XML
  • Generate processed text file
  • Generate dataset_manifest.json
  • Compute SHA256 + Merkle roots

After completion, verify files:

dir data
dir data\processed\

Expected:

  • data\dataset_manifest.json
  • data\processed\wiki_clean.txt

6. Run Verification (Core Step)

This re-processes the dataset and checks all hashes.

Run:

python -m openverifiablellm.verify simplewiki-latest-pages-articles.xml.bz2

If everything is correct:
Result: ALL CHECKS PASSED

  • Optional JSON report:

python -m openverifiablellm.verify simplewiki-latest-pages-articles.xml.bz2 --json verification_report.json


Checklist

  • My code follows the project's code style and conventions
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

  • New Features

    • Added a deterministic preprocessing verification mode with a CLI that re-runs preprocessing, reports PASS/FAIL/SKIP checks, and can emit a human-readable or JSON verification report.
    • Improved XML text extraction to auto-detect and handle compressed (bz2) and uncompressed Wikipedia dumps.
    • Added a Wikipedia dump downloader with optional MD5 checksum verification and reuse of existing verified files.
  • Tests

    • Comprehensive test suite for verification flows (happy and failure paths) and new uncompressed XML extraction tests.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 2, 2026

Walkthrough

Adds a deterministic preprocessing verification mode (API + CLI) that re-runs preprocessing in an isolated temp directory, recomputes artifacts (SHA256, optional Merkle roots, metadata, processed output), compares them to an existing manifest, and emits a structured VerificationReport of per-check PASS/FAIL/SKIP results.

Changes

Cohort / File(s) Summary
Verification Core
openverifiablellm/verify.py
New deterministic verification implementation: verify_preprocessing(), VerificationReport, CheckResult, CheckStatus, helpers (_load_manifest, _check_field), CLI main(), report formatting and JSON export.
Verification Tests
tests/test_verify.py
Comprehensive unit/integration tests for verification flow (happy & failure paths), legacy-manifest behavior, and in-test stubs for preprocessing utilities.
Utilities (I/O & hashing)
openverifiablellm/utils.py, tests/test_util.py
extract_text_from_xml auto-detects bz2 vs uncompressed input; compute_sha256 signature adjusted to file_path= and tests updated; added uncompressed XML extraction test.
Download Script
scripts/download_dump.py
New script to download Wikimedia dumps with streaming progress, optional MD5 verification, URL builders, helpers, and CLI entry point.
Test adjustments / signatures
tests/*
Tests updated for compute_sha256(file_path=...) signature and to exercise new verification behavior; added test utilities and formatting tweaks.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as "verify CLI"
    participant Utils as "openverifiablellm.utils"
    participant Prep as "Preprocessing\n(Temp workspace)"
    participant Manifest as "dataset_manifest.json"
    participant Report as "VerificationReport"

    User->>CLI: run verify_preprocessing(input_dump, manifest_path)
    CLI->>Manifest: _load_manifest(manifest_path)
    Manifest-->>CLI: expected artifact metadata
    CLI->>Utils: read raw file / compute raw_sha256
    Utils-->>CLI: raw data / raw_sha256
    CLI->>Prep: run preprocessing in isolated temp dir
    Prep->>Utils: compute processed SHA256 and merkle roots
    Utils-->>Prep: computed artifacts
    CLI->>Report: compare expected vs actual (_check_field)
    Report-->>CLI: CheckResult entries (PASS/FAIL/SKIP)
    CLI->>User: print formatted summary and optional JSON
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

Python Lang

Suggested reviewers

  • Archit381

Poem

🐰 I hopped through bytes and logs today,
Re-ran preprocessing, checked each array,
Hashes matched, merkle roots in view,
I stamped my paw — the checks went through,
Carrot for me, and a tidy report too.

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main objective: implementing a deterministic preprocessing verification mode that re-runs and validates generated artifacts against the manifest.
Linked Issues check ✅ Passed The PR successfully implements all core requirements from issue #13: CLI entry point (main() function), manifest loading and comparison logic, artifact recomputation (SHA256, Merkle roots, metadata), PASS/FAIL reporting, and proper exit codes.
Out of Scope Changes check ✅ Passed All changes are scoped to verification requirements. The new download_dump.py script and utils.py modifications (bz2 auto-detection) directly support the verification workflow and are necessary dependencies for end-to-end functionality.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@aniket866 aniket866 changed the title ## Minimal Command Summary * Install dependency: pip install defusedxml * Download: python -c "import urllib.request; urllib.request.urlretrieve('[https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2','simplewiki-latest-pages-articles.xml.bz2](https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2','simplewiki-latest-pages-articles.xml.bz2)')" * Preprocess: python -m openverifiablellm.utils simplewiki-latest-pages-articles.xml.bz2 * Verify: python -m openverifiablellm.verify simplewiki-latest-pages-articles.xml.bz2 * Run tests: python test_verify.py -v Deterministic Preprocessing Verification Mode that re-runs dataset preprocessing and validates that all generated artifacts Mar 2, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 2, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/verify.py`:
- Around line 323-337: The code mutates process-wide cwd via os.chdir(tmp_dir)
before calling utils.extract_text_from_xml, causing a concurrency race; replace
this pattern by invoking the extraction in an isolated subprocess (or
subprocess.run) with cwd=tmp_dir so the child process has its working directory
set explicitly and the parent process’ cwd is not modified. Locate the block
using os.chdir, original_cwd, and the call to utils.extract_text_from_xml and
change it to spawn a subprocess that runs the extraction (propagate
stdout/stderr or capture errors), map subprocess failures to the same
CheckResult (name="reprocessing_succeeded", status=CheckStatus.FAIL,
detail=...), and remove the global os.chdir/restore dance. Ensure any temp-dir
cleanup and error-to-report mapping remains intact.
- Around line 231-239: The try/except around _load_manifest(manifest_path) only
catches FileNotFoundError so a malformed manifest raising json.JSONDecodeError
will propagate; update the except to also catch json.JSONDecodeError (or
Exception if preferred) and add a failing CheckResult (name="manifest_exists" or
new name like "manifest_valid"), using CheckStatus.FAIL and include the error
details (str(exc)) so the report is generated instead of crashing; reference
_load_manifest, manifest_path, report, CheckResult, and CheckStatus when making
the change.

In `@tests/test_verify.py`:
- Around line 6-8: Update the docstring in tests/test_verify.py that currently
suggests running "python test_verify.py -v" to reference the correct invocation
and path—replace it with a runnable command such as "python -m pytest
tests/test_verify.py -v" (or simply "pytest tests/test_verify.py -v") so the
docstring accurately reflects running the test from the project root; update the
string at the top of tests/test_verify.py where the run command is documented.
- Around line 120-135: The test currently imports verify as a top-level module
(import verify as verify_module and from verify import ...), but the module
lives under the package namespace openverifiablellm
(openverifiablellm/verify.py); update the imports to import from
openverifiablellm.verify (e.g., import openverifiablellm.verify as verify_module
or from openverifiablellm.verify import CheckResult, CheckStatus,
VerificationReport, _check_field, _load_manifest, verify_preprocessing) so the
test uses the package-qualified module name and matches the stubbed package in
sys.modules.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b253f1 and e8a2381.

📒 Files selected for processing (2)
  • openverifiablellm/verify.py
  • tests/test_verify.py

@github-actions github-actions bot added enhancement New feature or request size/XL and removed size/XL labels Mar 2, 2026
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 2, 2026
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 2, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 2, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 2, 2026
…nature-mismatch

fix: resolve compute_sha256 signature mismatch in test_util.py
@Archit381 Archit381 self-requested a review March 5, 2026 12:24
@Archit381
Copy link
Member

Archit381 commented Mar 5, 2026

@aniket866 Can you look into the following changes:

  • Instead of creating a fake package, copying all the utils code into it and then running the test in tests\test_verify.py, its better if you just import utils. This helps us check if in future utils is updated, this test will verify if it works the way intended or not
  • Instead of having in in-line script to download the dump, can you create a dedicated python script for it

Also a reminder to pull from main before you make these changes

@aniket866
Copy link
Contributor Author

PR: Maintainer Review Fixes — Verification Mode

What Changed

1. tests/test_verify.py — Remove fake utils stub, import real package

After: Replaced entirely with a direct import of the real package.

# before — fake stub injected at runtime
UTILS_CODE = r"""
def compute_sha256(...): ...
def compute_merkle_root(...): ...
"""
_utils = types.ModuleType("openverifiablellm.utils")
exec(compile(UTILS_CODE, "<utils_stub>", "exec"), _utils.__dict__)
sys.modules["openverifiablellm.utils"] = _utils

after — one real import

from openverifiablellm import utils


2. scripts/download_dump.py — Dedicated download script

Before: Download was a long inline one-liner passed to python -c.

After: A proper standalone script at scripts/download_dump.py with full CLI support.

python scripts/download_dump.py                              # latest simplewiki
python scripts/download_dump.py --wiki simplewiki            # explicit wiki
python scripts/download_dump.py --date 20260201             # specific snapshot
python scripts/download_dump.py --output-dir data/raw       # custom output folder
python scripts/download_dump.py --no-verify                 # skip MD5 check

Key features added over the one-liner:

  • MD5 checksum verification against Wikimedia's own checksum file
  • Progress bar showing MB downloaded
  • Skips re-download if file already exists and passes checksum
  • Prints the exact next command to run after download completes

Files Changed

File Change
tests/test_verify.py Removed fake stub — now imports real openverifiablellm.utils
scripts/download_dump.py New file — dedicated download script with checksum verification

How to Run — Full End-to-End

Step 1 — Install dependencies

pip install defusedxml

Step 2 — Download the Wikipedia dump

python scripts/download_dump.py

Or with options:

python scripts/download_dump.py --wiki simplewiki --date 20260201 --output-dir .

Step 3 — Run preprocessing

python -m openverifiablellm.utils simplewiki-latest-pages-articles.xml.bz2

Expected output:

INFO - Preprocessing complete. Output saved to data\processed\wiki_clean.txt
INFO - Manifest written to data\dataset_manifest.json

Step 4 — Run verification

python -m openverifiablellm.verify simplewiki-latest-pages-articles.xml.bz2

Expected output:

============================================================
  Deterministic Preprocessing Verification Report

Step 5 — Run the test suite

python -m pytest tests/test_verify.py -v

Or without pytest:

python tests/test_verify.py -v

Expected output:

Ran 37 tests in ~0.4s
OK

Optional — Save verification report as JSON

python -m openverifiablellm.verify simplewiki-latest-pages-articles.xml.bz2 --json report.json

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 5, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
openverifiablellm/utils.py (1)

284-285: ⚠️ Potential issue | 🟡 Minor

Remove stale migration comments that contradict current code.

These comments (“function definition … should be deleted”) are obsolete and misleading during maintenance.

Suggested cleanup
-# Content before line 270 remains unchanged
-# Entire function definition from lines 270-314 should be deleted
 def verify_merkle_proof_from_file(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 284 - 285, Remove the stale
migration comment block in openverifiablellm.utils that says “# Entire function
definition from lines 270-314 should be deleted” (and any similar obsolete
migration notes) so the file no longer contains misleading comments; delete that
commented section and, if necessary, replace it with a brief, accurate comment
or remove the line entirely so only current, relevant documentation remains.
♻️ Duplicate comments (1)
openverifiablellm/verify.py (1)

233-241: ⚠️ Potential issue | 🟠 Major

Handle malformed manifest JSON as a reportable failure, not an exception.

json.JSONDecodeError from _load_manifest() is not caught here, so verification can terminate without producing a structured FAIL report.

Suggested fix
-    except FileNotFoundError as exc:
+    except FileNotFoundError as exc:
         report.add(CheckResult(
             name="manifest_exists",
             status=CheckStatus.FAIL,
             detail=str(exc),
         ))
         return report
+    except json.JSONDecodeError as exc:
+        report.add(CheckResult(
+            name="manifest_valid_json",
+            status=CheckStatus.FAIL,
+            detail=f"Invalid manifest JSON: {exc}",
+        ))
+        return report
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/verify.py` around lines 233 - 241, The code currently only
catches FileNotFoundError when calling _load_manifest(manifest_path) so a
json.JSONDecodeError will propagate; update the try/except around the
_load_manifest(manifest_path) call to also catch json.JSONDecodeError (or
Exception specific to JSON parsing) and, in that except block, add a CheckResult
with name="manifest_exists" (or a more specific name like
"manifest_valid_json"), status=CheckStatus.FAIL, and detail=str(exc) to report,
then return report; reference _load_manifest, manifest_path, report,
CheckResult, and CheckStatus to locate where to add this additional except
handler and reporting logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/verify.py`:
- Around line 270-277: The code reads manifest["chunk_size_bytes"] and passes it
straight to utils.compute_merkle_root (and similarly in the second block around
lines 374-383), which can crash if the value is non-integer or <=0; validate and
coerce the manifest chunk_size_bytes before use: in the block that sets
chunk_size = manifest.get("chunk_size_bytes", utils.MERKLE_CHUNK_SIZE_BYTES")
(and the analogous block later), check that the value is an int (or can be
converted to int) and > 0, otherwise replace it with
utils.MERKLE_CHUNK_SIZE_BYTES (or record a validation error on report) and then
call utils.compute_merkle_root with the validated chunk_size; reference the
variables/handlers chunk_size, manifest["chunk_size_bytes"],
utils.compute_merkle_root, and _check_field to locate the fix.
- Around line 398-403: Add an explicit parity check for chunk_size_bytes by
calling _check_field with report, a descriptive field key (e.g.,
"manifest_chunk_size_bytes"), expected=manifest.get("chunk_size_bytes"),
actual=reproduced_manifest.get("chunk_size_bytes"), and detail="Chunk size in
bytes"; place this call alongside the existing _check_field invocation that
compares "preprocessing_version" so the verifier fails when chunk_size_bytes
diverge.

In `@scripts/download_dump.py`:
- Around line 6-8: Update the module-level docstring (and any user-facing help
text) in the download_dump script to accurately state that the script verifies
downloads using an MD5 checksum fetched from Wikimedia rather than SHA256; if
there are functions like download_dump(), verify_checksum() or fetch_checksum()
that still mention SHA256 in their docstrings or log messages, change those to
MD5 as well so the documentation matches the actual verification algorithm being
used.
- Around line 133-140: The function currently treats missing checksum metadata
as non-fatal by returning True when _fetch_expected_md5(checksum_url, dest.name)
yields None, which lets --verify silently succeed; change this so that when
checksum verification was explicitly requested (the --verify flag / verification
parameter), a missing expected value is treated as a failure — log an error via
logger.error (or a clearer warning) and return False; if verification was not
explicitly requested, keep the current behavior (log a warning and return True).
Locate the call to _fetch_expected_md5(checksum_url, dest.name) and branch based
on the verification flag/parameter (the --verify flag or function parameter)
instead of always returning True on None.
- Around line 267-268: The printed next-step command uses dest.name which loses
path context and produces an incorrect runnable command for custom --output-dir;
update the two print statements in scripts/download_dump.py to include the full
destination path (e.g., str(dest) or dest.resolve()) instead of dest.name so the
printed command (python -m openverifiablellm.utils <dest_path>) points to the
actual output directory and works from the project root; ensure the path is
properly stringified/quoted when inserted into the f-string.

In `@tests/test_verify.py`:
- Around line 186-202: Add a regression test in TestLoadManifest that ensures
_load_manifest handles malformed JSON gracefully: simulate a file named
"dataset_manifest.json" (or any manifest filename like "bad_manifest.json")
containing invalid JSON, call _load_manifest on its Path and assert that the
verifier returns a FAIL result (or raises a specific handled exception converted
to a fail) instead of letting an uncaught exception propagate; update or add a
test method (e.g., test_malformed_manifest_returns_fail) alongside existing
tests to create the bad file, invoke _load_manifest, and assert the expected
fail behavior.

---

Outside diff comments:
In `@openverifiablellm/utils.py`:
- Around line 284-285: Remove the stale migration comment block in
openverifiablellm.utils that says “# Entire function definition from lines
270-314 should be deleted” (and any similar obsolete migration notes) so the
file no longer contains misleading comments; delete that commented section and,
if necessary, replace it with a brief, accurate comment or remove the line
entirely so only current, relevant documentation remains.

---

Duplicate comments:
In `@openverifiablellm/verify.py`:
- Around line 233-241: The code currently only catches FileNotFoundError when
calling _load_manifest(manifest_path) so a json.JSONDecodeError will propagate;
update the try/except around the _load_manifest(manifest_path) call to also
catch json.JSONDecodeError (or Exception specific to JSON parsing) and, in that
except block, add a CheckResult with name="manifest_exists" (or a more specific
name like "manifest_valid_json"), status=CheckStatus.FAIL, and detail=str(exc)
to report, then return report; reference _load_manifest, manifest_path, report,
CheckResult, and CheckStatus to locate where to add this additional except
handler and reporting logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1dc90ecd-1175-49ad-b635-3e4efff1d259

📥 Commits

Reviewing files that changed from the base of the PR and between e8a2381 and 9c1ed14.

📒 Files selected for processing (5)
  • openverifiablellm/utils.py
  • openverifiablellm/verify.py
  • scripts/download_dump.py
  • tests/test_util.py
  • tests/test_verify.py

Comment on lines +270 to +277
if "raw_merkle_root" in manifest:
chunk_size = manifest.get("chunk_size_bytes", utils.MERKLE_CHUNK_SIZE_BYTES)
raw_merkle_actual = utils.compute_merkle_root(input_dump, chunk_size=chunk_size)
_check_field(
report, "raw_merkle_root",
expected=manifest["raw_merkle_root"],
actual=raw_merkle_actual,
detail=f"Merkle root of raw dump (chunk={chunk_size} bytes)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate chunk_size_bytes before Merkle recomputation to avoid verifier crashes.

chunk_size_bytes is read from manifest and passed directly to compute_merkle_root(). Non-integer or non-positive values can raise and abort verification.

Suggested fix
-        chunk_size = manifest.get("chunk_size_bytes", utils.MERKLE_CHUNK_SIZE_BYTES)
+        chunk_size = manifest.get("chunk_size_bytes", utils.MERKLE_CHUNK_SIZE_BYTES)
+        if not isinstance(chunk_size, int) or chunk_size <= 0:
+            report.add(CheckResult(
+                name="chunk_size_bytes",
+                status=CheckStatus.FAIL,
+                expected=str(utils.MERKLE_CHUNK_SIZE_BYTES),
+                actual=str(chunk_size),
+                detail="Manifest chunk_size_bytes must be a positive integer",
+            ))
+            return report

Also applies to: 374-383

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/verify.py` around lines 270 - 277, The code reads
manifest["chunk_size_bytes"] and passes it straight to utils.compute_merkle_root
(and similarly in the second block around lines 374-383), which can crash if the
value is non-integer or <=0; validate and coerce the manifest chunk_size_bytes
before use: in the block that sets chunk_size = manifest.get("chunk_size_bytes",
utils.MERKLE_CHUNK_SIZE_BYTES") (and the analogous block later), check that the
value is an int (or can be converted to int) and > 0, otherwise replace it with
utils.MERKLE_CHUNK_SIZE_BYTES (or record a validation error on report) and then
call utils.compute_merkle_root with the validated chunk_size; reference the
variables/handlers chunk_size, manifest["chunk_size_bytes"],
utils.compute_merkle_root, and _check_field to locate the fix.

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 5, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
openverifiablellm/verify.py (2)

233-241: ⚠️ Potential issue | 🟠 Major

Malformed manifest JSON is still unhandled in the load step.

If dataset_manifest.json is present but invalid, json.JSONDecodeError will propagate and crash instead of producing a failing check result.

✅ Proposed fix
     try:
         manifest = _load_manifest(manifest_path)
     except FileNotFoundError as exc:
         report.add(CheckResult(
             name="manifest_exists",
             status=CheckStatus.FAIL,
             detail=str(exc),
         ))
         return report
+    except json.JSONDecodeError as exc:
+        report.add(CheckResult(
+            name="manifest_valid_json",
+            status=CheckStatus.FAIL,
+            detail=f"Manifest is not valid JSON: {exc}",
+        ))
+        return report
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/verify.py` around lines 233 - 241, The load step only
handles FileNotFoundError, so invalid JSON in _load_manifest(manifest_path) will
raise json.JSONDecodeError and crash; update the try/except around
_load_manifest to also catch json.JSONDecodeError (import json if not already)
and add a failing CheckResult to report (e.g. name="manifest_valid_json" or
similar) with status=CheckStatus.FAIL and detail=str(exc), then return report
just like the FileNotFoundError branch so malformed manifest yields a proper
failing check.

389-393: ⚠️ Potential issue | 🟠 Major

processed_merkle_root path lacks chunk_size_bytes validation.

Line 390 reads chunk_size_bytes and immediately uses it in compute_merkle_root. Non-integer or non-positive values can raise and abort verification.

🧩 Proposed fix
         if "processed_merkle_root" in manifest:
             chunk_size = manifest.get("chunk_size_bytes", utils.MERKLE_CHUNK_SIZE_BYTES)
+            if not isinstance(chunk_size, int) or chunk_size <= 0:
+                report.add(CheckResult(
+                    name="chunk_size_bytes",
+                    status=CheckStatus.FAIL,
+                    expected=str(utils.MERKLE_CHUNK_SIZE_BYTES),
+                    actual=str(chunk_size),
+                    detail="Manifest chunk_size_bytes must be a positive integer",
+                ))
+                return report
             proc_merkle_actual = utils.compute_merkle_root(
                 reproduced_processed, chunk_size=chunk_size
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/verify.py` around lines 389 - 393, The manifest branch that
handles "processed_merkle_root" must validate the "chunk_size_bytes" value
before passing it to utils.compute_merkle_root: read
manifest.get("chunk_size_bytes"), coerce to an int (e.g., int(...) with safe
handling), ensure it is a positive integer (>0), and if coercion fails or the
value is non-positive, fall back to utils.MERKLE_CHUNK_SIZE_BYTES (or raise a
clear VerificationError if that’s preferred); then call
utils.compute_merkle_root(reproduced_processed,
chunk_size=validated_chunk_size). Also consider logging or returning a
descriptive error when the provided chunk_size is invalid to aid debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/verify.py`:
- Around line 288-293: The parity check calling _check_field with
reproduced_manifest is executed before reproduced_manifest is created, causing a
NameError; move the whole block that compares manifest_chunk_size_bytes (the
_check_field call referencing manifest and reproduced_manifest) to after
reproduced_manifest is initialized (the code that assigns reproduced_manifest
later in the function), so that reproduced_manifest exists before being used and
similarly relocate the other identical checks (the block around lines 409-418)
to follow the reproduced_manifest construction.

In `@scripts/download_dump.py`:
- Around line 203-205: The code currently deletes the existing dump via
dest.unlink() (see logger.warning and dest.unlink) before attempting a new
download, risking loss if the new download or checksum fails; change the flow to
download the new content to a temporary file (e.g., dest.with_suffix('.tmp') or
similar) and perform the checksum verification against that temp file, and only
after successful verification atomically replace the original (use
os.replace/Path.rename) so the original remains if anything fails; also ensure
the temp file is removed on failure and do the same pattern for the related
download/verify block around lines 209-218.
- Line 43: The checksum lookup fails when DEFAULT_DATE == "latest" because
_fetch_expected_md5 expects exact filenames but the checksum file uses
date-stamped names; update _fetch_expected_md5 to detect filenames containing
"-latest-" and instead search the checksum content with a regex that replaces
"-latest-" with "-\d{8}-" (or otherwise matches an 8-digit date) to find the
correct checksum entry, return that md5 string as before, and preserve existing
behavior when no match is found so the downstream verification in the
download/verification flow still handles failures gracefully.

---

Duplicate comments:
In `@openverifiablellm/verify.py`:
- Around line 233-241: The load step only handles FileNotFoundError, so invalid
JSON in _load_manifest(manifest_path) will raise json.JSONDecodeError and crash;
update the try/except around _load_manifest to also catch json.JSONDecodeError
(import json if not already) and add a failing CheckResult to report (e.g.
name="manifest_valid_json" or similar) with status=CheckStatus.FAIL and
detail=str(exc), then return report just like the FileNotFoundError branch so
malformed manifest yields a proper failing check.
- Around line 389-393: The manifest branch that handles "processed_merkle_root"
must validate the "chunk_size_bytes" value before passing it to
utils.compute_merkle_root: read manifest.get("chunk_size_bytes"), coerce to an
int (e.g., int(...) with safe handling), ensure it is a positive integer (>0),
and if coercion fails or the value is non-positive, fall back to
utils.MERKLE_CHUNK_SIZE_BYTES (or raise a clear VerificationError if that’s
preferred); then call utils.compute_merkle_root(reproduced_processed,
chunk_size=validated_chunk_size). Also consider logging or returning a
descriptive error when the provided chunk_size is invalid to aid debugging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f627a475-414c-429b-b4fc-4c7e04bdf95b

📥 Commits

Reviewing files that changed from the base of the PR and between 9c1ed14 and 65722ed.

📒 Files selected for processing (2)
  • openverifiablellm/verify.py
  • scripts/download_dump.py

Comment on lines +288 to +293
_check_field(
report, "manifest_chunk_size_bytes",
expected=manifest.get("chunk_size_bytes"),
actual=reproduced_manifest.get("chunk_size_bytes"),
detail="Merkle chunk size used during preprocessing",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Undefined reproduced_manifest causes a runtime crash.

Line 291 uses reproduced_manifest before it is initialized (initialized at Line 411). This raises NameError whenever the raw Merkle branch executes and prevents report completion.

🐛 Proposed fix (move parity check after reproduced manifest is loaded)
-        _check_field(
-            report, "manifest_chunk_size_bytes",
-            expected=manifest.get("chunk_size_bytes"),
-            actual=reproduced_manifest.get("chunk_size_bytes"),
-            detail="Merkle chunk size used during preprocessing",
-        )
@@
         if reproduced_manifest_path.exists():
             with reproduced_manifest_path.open() as f:
                 reproduced_manifest = json.load(f)
 
+            _check_field(
+                report, "manifest_chunk_size_bytes",
+                expected=manifest.get("chunk_size_bytes"),
+                actual=reproduced_manifest.get("chunk_size_bytes"),
+                detail="Merkle chunk size used during preprocessing",
+            )
+
             _check_field(
                 report, "manifest_preprocessing_version",
                 expected=manifest.get("preprocessing_version"),
                 actual=reproduced_manifest.get("preprocessing_version"),
                 detail="Preprocessing version tag",
             )

Also applies to: 409-418

🧰 Tools
🪛 Ruff (0.15.2)

[error] 291-291: Undefined name reproduced_manifest

(F821)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/verify.py` around lines 288 - 293, The parity check calling
_check_field with reproduced_manifest is executed before reproduced_manifest is
created, causing a NameError; move the whole block that compares
manifest_chunk_size_bytes (the _check_field call referencing manifest and
reproduced_manifest) to after reproduced_manifest is initialized (the code that
assigns reproduced_manifest later in the function), so that reproduced_manifest
exists before being used and similarly relocate the other identical checks (the
block around lines 409-418) to follow the reproduced_manifest construction.

CHECKSUM_FILENAME_TEMPLATE = "{wiki}-{date}-md5sums.txt"

DEFAULT_WIKI = "simplewiki"
DEFAULT_DATE = "latest"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if file exists and get basic info
ls -la scripts/download_dump.py 2>&1 | head -5
# Get total lines
wc -l scripts/download_dump.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 174


🏁 Script executed:

#!/bin/bash
# Read the relevant sections of the file
echo "=== Lines 40-50 (DEFAULT_DATE) ==="
sed -n '40,50p' scripts/download_dump.py

echo -e "\n=== Lines 110-120 (checksum matching) ==="
sed -n '110,120p' scripts/download_dump.py

echo -e "\n=== Lines 135-145 (more checksum code) ==="
sed -n '135,145p' scripts/download_dump.py

echo -e "\n=== Lines 185-195 (filename construction) ==="
sed -n '185,195p' scripts/download_dump.py

echo -e "\n=== Function _fetch_expected_md5 ==="
ast-grep --pattern $'def _fetch_expected_md5($_) {
  $$$
}'

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1549


🌐 Web query:

Wikimedia dumps checksum format latest alias YYYYMMDD

💡 Result:

  • Dump run directory (date format): project dumps are published under a run directory named YYYYMMDD/ (e.g. .../enwiki/20260201/). The project root also includes a latest/ directory. [1]
  • What latest/ means: latest/ is an alias pointing to the most recent successful run outputs (with filenames using -latest-). [2]
  • Checksum filenames (text):
    • In a dated run dir: <wiki>-<YYYYMMDD>-md5sums.txt and <wiki>-<YYYYMMDD>-sha1sums.txt (example: testwiki-20251001-md5sums.txt, testwiki-20251001-sha1sums.txt). [3]
    • In latest/: <wiki>-latest-md5sums.txt (and commonly the corresponding sha1sums file), e.g. enwikiversity-latest-md5sums.txt. [2]
  • Checksum file format (*.txt): flat text consisting of repeated <hash> <filename> pairs separated by whitespace (not necessarily one pair per line). [3]
  • Other checksum formats you may see: Wikimedia also provides checksum data as JSON for some runs/steps (e.g. md5sums.json, sha1sums.json). [4]
  • Adds/changes dumps note: their md5sums file is documented as containing the md5 for the dump-content files. [5]

Sources: [1] [2] [3] [4] [5]


🏁 Script executed:

#!/bin/bash
set -euo pipefail
url="https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-md5sums.txt"
echo "Fetching: $url"
curl -fsSL "$url" 2>&1 | head -20

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1606


🏁 Script executed:

#!/bin/bash
set -euo pipefail
url="https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-md5sums.txt"
echo "=== Checking for -latest- vs date-stamped filenames ==="
curl -fsSL "$url" 2>&1 | grep -o 'simplewiki-[^[:space:]]*-pages-articles\.xml\.bz2' | head -10

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 170


🏁 Script executed:

#!/bin/bash
# Check lines related to file deletion and download order
echo "=== Lines 200-220 (file deletion and download) ==="
sed -n '200,220p' scripts/download_dump.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 798


🏁 Script executed:

#!/bin/bash
# Get the full download_dump function to understand flow
echo "=== Finding download_dump function ==="
ast-grep --pattern $'def download_dump($_) {
  $$$
}'

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 110


Default latest flow fails checksum verification due to filename mismatch.

When DEFAULT_DATE="latest" is used, line 191 creates simplewiki-latest-pages-articles.xml.bz2. However, the Wikimedia checksum file in the latest/ directory contains date-stamped filenames (e.g., simplewiki-20260301-pages-articles.xml.bz2). Lines 113–117 perform exact string matching against the checksum file, so the lookup always returns None and verification fails on line 139, even for valid downloads. The file is then deleted (line 215) before the error is raised.

To fix this, modify _fetch_expected_md5 to map -latest- patterns to -\d{8}- patterns in the checksum file:

Proposed fix
+import re
@@
 def _fetch_expected_md5(checksum_url: str, dump_filename: str) -> str | None:
@@
+    latest_pattern = None
+    if "-latest-" in dump_filename:
+        wiki, suffix = dump_filename.split("-latest-", 1)
+        latest_pattern = re.compile(rf"^{re.escape(wiki)}-\d{{8}}-{re.escape(suffix)}$")
+
     # Format: "<md5hash>  <filename>\n"
     for line in content.splitlines():
         parts = line.strip().split(None, 1)
-        if len(parts) == 2 and parts[1].strip() == dump_filename:
-            return parts[0].strip()
+        if len(parts) != 2:
+            continue
+        candidate = parts[1].strip()
+        if candidate == dump_filename:
+            return parts[0].strip()
+        if latest_pattern and latest_pattern.match(candidate):
+            return parts[0].strip()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/download_dump.py` at line 43, The checksum lookup fails when
DEFAULT_DATE == "latest" because _fetch_expected_md5 expects exact filenames but
the checksum file uses date-stamped names; update _fetch_expected_md5 to detect
filenames containing "-latest-" and instead search the checksum content with a
regex that replaces "-latest-" with "-\d{8}-" (or otherwise matches an 8-digit
date) to find the correct checksum entry, return that md5 string as before, and
preserve existing behavior when no match is found so the downstream verification
in the download/verification flow still handles failures gracefully.

Comment on lines +203 to +205
logger.warning("Existing file failed checksum — re-downloading.")
dest.unlink()
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not delete the existing dump before a verified replacement is ready.

Line 204 removes the current file before the new download is verified. If the subsequent download or checksum verification fails, users lose the only local copy.

🛡️ Proposed fix (download-to-temp then atomic replace)
-            logger.warning("Existing file failed checksum — re-downloading.")
-            dest.unlink()
+            logger.warning("Existing file failed checksum — downloading replacement before overwrite.")
@@
-    _download_file(dump_url, dest)
+    tmp_dest = dest.with_suffix(dest.suffix + ".tmp")
+    _download_file(dump_url, tmp_dest)
 
     if verify:
-        ok = _verify_checksum(dest, checksum_url)
+        ok = _verify_checksum(tmp_dest, checksum_url)
         if not ok:
-            dest.unlink(missing_ok=True)
+            tmp_dest.unlink(missing_ok=True)
             raise RuntimeError(
                 "Downloaded file failed MD5 verification. "
                 "The file has been removed. Please retry."
             )
 
-    return dest
+    tmp_dest.replace(dest)
+    return dest

Also applies to: 209-218

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/download_dump.py` around lines 203 - 205, The code currently deletes
the existing dump via dest.unlink() (see logger.warning and dest.unlink) before
attempting a new download, risking loss if the new download or checksum fails;
change the flow to download the new content to a temporary file (e.g.,
dest.with_suffix('.tmp') or similar) and perform the checksum verification
against that temp file, and only after successful verification atomically
replace the original (use os.replace/Path.rename) so the original remains if
anything fails; also ensure the temp file is removed on failure and do the same
pattern for the related download/verify block around lines 209-218.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deterministic Preprocessing Verification Mode that re-runs dataset preprocessing and validates that all generated artifacts

4 participants