Deterministic Preprocessing Verification Mode that re-runs dataset preprocessing and validates that all generated artifacts#25
Conversation
WalkthroughAdds a deterministic preprocessing verification mode (API + CLI) that re-runs preprocessing in an isolated temp directory, recomputes artifacts (SHA256, optional Merkle roots, metadata, processed output), compares them to an existing manifest, and emits a structured VerificationReport of per-check PASS/FAIL/SKIP results. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI as "verify CLI"
participant Utils as "openverifiablellm.utils"
participant Prep as "Preprocessing\n(Temp workspace)"
participant Manifest as "dataset_manifest.json"
participant Report as "VerificationReport"
User->>CLI: run verify_preprocessing(input_dump, manifest_path)
CLI->>Manifest: _load_manifest(manifest_path)
Manifest-->>CLI: expected artifact metadata
CLI->>Utils: read raw file / compute raw_sha256
Utils-->>CLI: raw data / raw_sha256
CLI->>Prep: run preprocessing in isolated temp dir
Prep->>Utils: compute processed SHA256 and merkle roots
Utils-->>Prep: computed artifacts
CLI->>Report: compare expected vs actual (_check_field)
Report-->>CLI: CheckResult entries (PASS/FAIL/SKIP)
CLI->>User: print formatted summary and optional JSON
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
pip install defusedxml * Download: python -c "import urllib.request; urllib.request.urlretrieve('[https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2','simplewiki-latest-pages-articles.xml.bz2](https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2','simplewiki-latest-pages-articles.xml.bz2)')" * Preprocess: python -m openverifiablellm.utils simplewiki-latest-pages-articles.xml.bz2 * Verify: python -m openverifiablellm.verify simplewiki-latest-pages-articles.xml.bz2 * Run tests: python test_verify.py -vThere was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/verify.py`:
- Around line 323-337: The code mutates process-wide cwd via os.chdir(tmp_dir)
before calling utils.extract_text_from_xml, causing a concurrency race; replace
this pattern by invoking the extraction in an isolated subprocess (or
subprocess.run) with cwd=tmp_dir so the child process has its working directory
set explicitly and the parent process’ cwd is not modified. Locate the block
using os.chdir, original_cwd, and the call to utils.extract_text_from_xml and
change it to spawn a subprocess that runs the extraction (propagate
stdout/stderr or capture errors), map subprocess failures to the same
CheckResult (name="reprocessing_succeeded", status=CheckStatus.FAIL,
detail=...), and remove the global os.chdir/restore dance. Ensure any temp-dir
cleanup and error-to-report mapping remains intact.
- Around line 231-239: The try/except around _load_manifest(manifest_path) only
catches FileNotFoundError so a malformed manifest raising json.JSONDecodeError
will propagate; update the except to also catch json.JSONDecodeError (or
Exception if preferred) and add a failing CheckResult (name="manifest_exists" or
new name like "manifest_valid"), using CheckStatus.FAIL and include the error
details (str(exc)) so the report is generated instead of crashing; reference
_load_manifest, manifest_path, report, CheckResult, and CheckStatus when making
the change.
In `@tests/test_verify.py`:
- Around line 6-8: Update the docstring in tests/test_verify.py that currently
suggests running "python test_verify.py -v" to reference the correct invocation
and path—replace it with a runnable command such as "python -m pytest
tests/test_verify.py -v" (or simply "pytest tests/test_verify.py -v") so the
docstring accurately reflects running the test from the project root; update the
string at the top of tests/test_verify.py where the run command is documented.
- Around line 120-135: The test currently imports verify as a top-level module
(import verify as verify_module and from verify import ...), but the module
lives under the package namespace openverifiablellm
(openverifiablellm/verify.py); update the imports to import from
openverifiablellm.verify (e.g., import openverifiablellm.verify as verify_module
or from openverifiablellm.verify import CheckResult, CheckStatus,
VerificationReport, _check_field, _load_manifest, verify_preprocessing) so the
test uses the package-qualified module name and matches the stubbed package in
sys.modules.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
…nature-mismatch fix: resolve compute_sha256 signature mismatch in test_util.py
…changes Improve XML extraction logic and update tests
|
@aniket866 Can you look into the following changes:
Also a reminder to pull from main before you make these changes |
…t866/OpenVerifiableLLM into preprocessing-verification
PR: Maintainer Review Fixes — Verification ModeWhat Changed1.
|
| File | Change |
|---|---|
| tests/test_verify.py | Removed fake stub — now imports real openverifiablellm.utils |
| scripts/download_dump.py | New file — dedicated download script with checksum verification |
How to Run — Full End-to-End
Step 1 — Install dependencies
pip install defusedxml
Step 2 — Download the Wikipedia dump
python scripts/download_dump.py
Or with options:
python scripts/download_dump.py --wiki simplewiki --date 20260201 --output-dir .
Step 3 — Run preprocessing
python -m openverifiablellm.utils simplewiki-latest-pages-articles.xml.bz2
Expected output:
INFO - Preprocessing complete. Output saved to data\processed\wiki_clean.txt
INFO - Manifest written to data\dataset_manifest.json
Step 4 — Run verification
python -m openverifiablellm.verify simplewiki-latest-pages-articles.xml.bz2
Expected output:
============================================================
Deterministic Preprocessing Verification Report
Step 5 — Run the test suite
python -m pytest tests/test_verify.py -v
Or without pytest:
python tests/test_verify.py -v
Expected output:
Ran 37 tests in ~0.4s
OK
Optional — Save verification report as JSON
python -m openverifiablellm.verify simplewiki-latest-pages-articles.xml.bz2 --json report.json
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
openverifiablellm/utils.py (1)
284-285:⚠️ Potential issue | 🟡 MinorRemove stale migration comments that contradict current code.
These comments (“function definition … should be deleted”) are obsolete and misleading during maintenance.
Suggested cleanup
-# Content before line 270 remains unchanged -# Entire function definition from lines 270-314 should be deleted def verify_merkle_proof_from_file(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 284 - 285, Remove the stale migration comment block in openverifiablellm.utils that says “# Entire function definition from lines 270-314 should be deleted” (and any similar obsolete migration notes) so the file no longer contains misleading comments; delete that commented section and, if necessary, replace it with a brief, accurate comment or remove the line entirely so only current, relevant documentation remains.
♻️ Duplicate comments (1)
openverifiablellm/verify.py (1)
233-241:⚠️ Potential issue | 🟠 MajorHandle malformed manifest JSON as a reportable failure, not an exception.
json.JSONDecodeErrorfrom_load_manifest()is not caught here, so verification can terminate without producing a structured FAIL report.Suggested fix
- except FileNotFoundError as exc: + except FileNotFoundError as exc: report.add(CheckResult( name="manifest_exists", status=CheckStatus.FAIL, detail=str(exc), )) return report + except json.JSONDecodeError as exc: + report.add(CheckResult( + name="manifest_valid_json", + status=CheckStatus.FAIL, + detail=f"Invalid manifest JSON: {exc}", + )) + return report🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/verify.py` around lines 233 - 241, The code currently only catches FileNotFoundError when calling _load_manifest(manifest_path) so a json.JSONDecodeError will propagate; update the try/except around the _load_manifest(manifest_path) call to also catch json.JSONDecodeError (or Exception specific to JSON parsing) and, in that except block, add a CheckResult with name="manifest_exists" (or a more specific name like "manifest_valid_json"), status=CheckStatus.FAIL, and detail=str(exc) to report, then return report; reference _load_manifest, manifest_path, report, CheckResult, and CheckStatus to locate where to add this additional except handler and reporting logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/verify.py`:
- Around line 270-277: The code reads manifest["chunk_size_bytes"] and passes it
straight to utils.compute_merkle_root (and similarly in the second block around
lines 374-383), which can crash if the value is non-integer or <=0; validate and
coerce the manifest chunk_size_bytes before use: in the block that sets
chunk_size = manifest.get("chunk_size_bytes", utils.MERKLE_CHUNK_SIZE_BYTES")
(and the analogous block later), check that the value is an int (or can be
converted to int) and > 0, otherwise replace it with
utils.MERKLE_CHUNK_SIZE_BYTES (or record a validation error on report) and then
call utils.compute_merkle_root with the validated chunk_size; reference the
variables/handlers chunk_size, manifest["chunk_size_bytes"],
utils.compute_merkle_root, and _check_field to locate the fix.
- Around line 398-403: Add an explicit parity check for chunk_size_bytes by
calling _check_field with report, a descriptive field key (e.g.,
"manifest_chunk_size_bytes"), expected=manifest.get("chunk_size_bytes"),
actual=reproduced_manifest.get("chunk_size_bytes"), and detail="Chunk size in
bytes"; place this call alongside the existing _check_field invocation that
compares "preprocessing_version" so the verifier fails when chunk_size_bytes
diverge.
In `@scripts/download_dump.py`:
- Around line 6-8: Update the module-level docstring (and any user-facing help
text) in the download_dump script to accurately state that the script verifies
downloads using an MD5 checksum fetched from Wikimedia rather than SHA256; if
there are functions like download_dump(), verify_checksum() or fetch_checksum()
that still mention SHA256 in their docstrings or log messages, change those to
MD5 as well so the documentation matches the actual verification algorithm being
used.
- Around line 133-140: The function currently treats missing checksum metadata
as non-fatal by returning True when _fetch_expected_md5(checksum_url, dest.name)
yields None, which lets --verify silently succeed; change this so that when
checksum verification was explicitly requested (the --verify flag / verification
parameter), a missing expected value is treated as a failure — log an error via
logger.error (or a clearer warning) and return False; if verification was not
explicitly requested, keep the current behavior (log a warning and return True).
Locate the call to _fetch_expected_md5(checksum_url, dest.name) and branch based
on the verification flag/parameter (the --verify flag or function parameter)
instead of always returning True on None.
- Around line 267-268: The printed next-step command uses dest.name which loses
path context and produces an incorrect runnable command for custom --output-dir;
update the two print statements in scripts/download_dump.py to include the full
destination path (e.g., str(dest) or dest.resolve()) instead of dest.name so the
printed command (python -m openverifiablellm.utils <dest_path>) points to the
actual output directory and works from the project root; ensure the path is
properly stringified/quoted when inserted into the f-string.
In `@tests/test_verify.py`:
- Around line 186-202: Add a regression test in TestLoadManifest that ensures
_load_manifest handles malformed JSON gracefully: simulate a file named
"dataset_manifest.json" (or any manifest filename like "bad_manifest.json")
containing invalid JSON, call _load_manifest on its Path and assert that the
verifier returns a FAIL result (or raises a specific handled exception converted
to a fail) instead of letting an uncaught exception propagate; update or add a
test method (e.g., test_malformed_manifest_returns_fail) alongside existing
tests to create the bad file, invoke _load_manifest, and assert the expected
fail behavior.
---
Outside diff comments:
In `@openverifiablellm/utils.py`:
- Around line 284-285: Remove the stale migration comment block in
openverifiablellm.utils that says “# Entire function definition from lines
270-314 should be deleted” (and any similar obsolete migration notes) so the
file no longer contains misleading comments; delete that commented section and,
if necessary, replace it with a brief, accurate comment or remove the line
entirely so only current, relevant documentation remains.
---
Duplicate comments:
In `@openverifiablellm/verify.py`:
- Around line 233-241: The code currently only catches FileNotFoundError when
calling _load_manifest(manifest_path) so a json.JSONDecodeError will propagate;
update the try/except around the _load_manifest(manifest_path) call to also
catch json.JSONDecodeError (or Exception specific to JSON parsing) and, in that
except block, add a CheckResult with name="manifest_exists" (or a more specific
name like "manifest_valid_json"), status=CheckStatus.FAIL, and detail=str(exc)
to report, then return report; reference _load_manifest, manifest_path, report,
CheckResult, and CheckStatus to locate where to add this additional except
handler and reporting logic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1dc90ecd-1175-49ad-b635-3e4efff1d259
📒 Files selected for processing (5)
openverifiablellm/utils.pyopenverifiablellm/verify.pyscripts/download_dump.pytests/test_util.pytests/test_verify.py
| if "raw_merkle_root" in manifest: | ||
| chunk_size = manifest.get("chunk_size_bytes", utils.MERKLE_CHUNK_SIZE_BYTES) | ||
| raw_merkle_actual = utils.compute_merkle_root(input_dump, chunk_size=chunk_size) | ||
| _check_field( | ||
| report, "raw_merkle_root", | ||
| expected=manifest["raw_merkle_root"], | ||
| actual=raw_merkle_actual, | ||
| detail=f"Merkle root of raw dump (chunk={chunk_size} bytes)", |
There was a problem hiding this comment.
Validate chunk_size_bytes before Merkle recomputation to avoid verifier crashes.
chunk_size_bytes is read from manifest and passed directly to compute_merkle_root(). Non-integer or non-positive values can raise and abort verification.
Suggested fix
- chunk_size = manifest.get("chunk_size_bytes", utils.MERKLE_CHUNK_SIZE_BYTES)
+ chunk_size = manifest.get("chunk_size_bytes", utils.MERKLE_CHUNK_SIZE_BYTES)
+ if not isinstance(chunk_size, int) or chunk_size <= 0:
+ report.add(CheckResult(
+ name="chunk_size_bytes",
+ status=CheckStatus.FAIL,
+ expected=str(utils.MERKLE_CHUNK_SIZE_BYTES),
+ actual=str(chunk_size),
+ detail="Manifest chunk_size_bytes must be a positive integer",
+ ))
+ return reportAlso applies to: 374-383
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/verify.py` around lines 270 - 277, The code reads
manifest["chunk_size_bytes"] and passes it straight to utils.compute_merkle_root
(and similarly in the second block around lines 374-383), which can crash if the
value is non-integer or <=0; validate and coerce the manifest chunk_size_bytes
before use: in the block that sets chunk_size = manifest.get("chunk_size_bytes",
utils.MERKLE_CHUNK_SIZE_BYTES") (and the analogous block later), check that the
value is an int (or can be converted to int) and > 0, otherwise replace it with
utils.MERKLE_CHUNK_SIZE_BYTES (or record a validation error on report) and then
call utils.compute_merkle_root with the validated chunk_size; reference the
variables/handlers chunk_size, manifest["chunk_size_bytes"],
utils.compute_merkle_root, and _check_field to locate the fix.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
openverifiablellm/verify.py (2)
233-241:⚠️ Potential issue | 🟠 MajorMalformed manifest JSON is still unhandled in the load step.
If
dataset_manifest.jsonis present but invalid,json.JSONDecodeErrorwill propagate and crash instead of producing a failing check result.✅ Proposed fix
try: manifest = _load_manifest(manifest_path) except FileNotFoundError as exc: report.add(CheckResult( name="manifest_exists", status=CheckStatus.FAIL, detail=str(exc), )) return report + except json.JSONDecodeError as exc: + report.add(CheckResult( + name="manifest_valid_json", + status=CheckStatus.FAIL, + detail=f"Manifest is not valid JSON: {exc}", + )) + return report🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/verify.py` around lines 233 - 241, The load step only handles FileNotFoundError, so invalid JSON in _load_manifest(manifest_path) will raise json.JSONDecodeError and crash; update the try/except around _load_manifest to also catch json.JSONDecodeError (import json if not already) and add a failing CheckResult to report (e.g. name="manifest_valid_json" or similar) with status=CheckStatus.FAIL and detail=str(exc), then return report just like the FileNotFoundError branch so malformed manifest yields a proper failing check.
389-393:⚠️ Potential issue | 🟠 Major
processed_merkle_rootpath lackschunk_size_bytesvalidation.Line 390 reads
chunk_size_bytesand immediately uses it incompute_merkle_root. Non-integer or non-positive values can raise and abort verification.🧩 Proposed fix
if "processed_merkle_root" in manifest: chunk_size = manifest.get("chunk_size_bytes", utils.MERKLE_CHUNK_SIZE_BYTES) + if not isinstance(chunk_size, int) or chunk_size <= 0: + report.add(CheckResult( + name="chunk_size_bytes", + status=CheckStatus.FAIL, + expected=str(utils.MERKLE_CHUNK_SIZE_BYTES), + actual=str(chunk_size), + detail="Manifest chunk_size_bytes must be a positive integer", + )) + return report proc_merkle_actual = utils.compute_merkle_root( reproduced_processed, chunk_size=chunk_size )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/verify.py` around lines 389 - 393, The manifest branch that handles "processed_merkle_root" must validate the "chunk_size_bytes" value before passing it to utils.compute_merkle_root: read manifest.get("chunk_size_bytes"), coerce to an int (e.g., int(...) with safe handling), ensure it is a positive integer (>0), and if coercion fails or the value is non-positive, fall back to utils.MERKLE_CHUNK_SIZE_BYTES (or raise a clear VerificationError if that’s preferred); then call utils.compute_merkle_root(reproduced_processed, chunk_size=validated_chunk_size). Also consider logging or returning a descriptive error when the provided chunk_size is invalid to aid debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/verify.py`:
- Around line 288-293: The parity check calling _check_field with
reproduced_manifest is executed before reproduced_manifest is created, causing a
NameError; move the whole block that compares manifest_chunk_size_bytes (the
_check_field call referencing manifest and reproduced_manifest) to after
reproduced_manifest is initialized (the code that assigns reproduced_manifest
later in the function), so that reproduced_manifest exists before being used and
similarly relocate the other identical checks (the block around lines 409-418)
to follow the reproduced_manifest construction.
In `@scripts/download_dump.py`:
- Around line 203-205: The code currently deletes the existing dump via
dest.unlink() (see logger.warning and dest.unlink) before attempting a new
download, risking loss if the new download or checksum fails; change the flow to
download the new content to a temporary file (e.g., dest.with_suffix('.tmp') or
similar) and perform the checksum verification against that temp file, and only
after successful verification atomically replace the original (use
os.replace/Path.rename) so the original remains if anything fails; also ensure
the temp file is removed on failure and do the same pattern for the related
download/verify block around lines 209-218.
- Line 43: The checksum lookup fails when DEFAULT_DATE == "latest" because
_fetch_expected_md5 expects exact filenames but the checksum file uses
date-stamped names; update _fetch_expected_md5 to detect filenames containing
"-latest-" and instead search the checksum content with a regex that replaces
"-latest-" with "-\d{8}-" (or otherwise matches an 8-digit date) to find the
correct checksum entry, return that md5 string as before, and preserve existing
behavior when no match is found so the downstream verification in the
download/verification flow still handles failures gracefully.
---
Duplicate comments:
In `@openverifiablellm/verify.py`:
- Around line 233-241: The load step only handles FileNotFoundError, so invalid
JSON in _load_manifest(manifest_path) will raise json.JSONDecodeError and crash;
update the try/except around _load_manifest to also catch json.JSONDecodeError
(import json if not already) and add a failing CheckResult to report (e.g.
name="manifest_valid_json" or similar) with status=CheckStatus.FAIL and
detail=str(exc), then return report just like the FileNotFoundError branch so
malformed manifest yields a proper failing check.
- Around line 389-393: The manifest branch that handles "processed_merkle_root"
must validate the "chunk_size_bytes" value before passing it to
utils.compute_merkle_root: read manifest.get("chunk_size_bytes"), coerce to an
int (e.g., int(...) with safe handling), ensure it is a positive integer (>0),
and if coercion fails or the value is non-positive, fall back to
utils.MERKLE_CHUNK_SIZE_BYTES (or raise a clear VerificationError if that’s
preferred); then call utils.compute_merkle_root(reproduced_processed,
chunk_size=validated_chunk_size). Also consider logging or returning a
descriptive error when the provided chunk_size is invalid to aid debugging.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: f627a475-414c-429b-b4fc-4c7e04bdf95b
📒 Files selected for processing (2)
openverifiablellm/verify.pyscripts/download_dump.py
| _check_field( | ||
| report, "manifest_chunk_size_bytes", | ||
| expected=manifest.get("chunk_size_bytes"), | ||
| actual=reproduced_manifest.get("chunk_size_bytes"), | ||
| detail="Merkle chunk size used during preprocessing", | ||
| ) |
There was a problem hiding this comment.
Undefined reproduced_manifest causes a runtime crash.
Line 291 uses reproduced_manifest before it is initialized (initialized at Line 411). This raises NameError whenever the raw Merkle branch executes and prevents report completion.
🐛 Proposed fix (move parity check after reproduced manifest is loaded)
- _check_field(
- report, "manifest_chunk_size_bytes",
- expected=manifest.get("chunk_size_bytes"),
- actual=reproduced_manifest.get("chunk_size_bytes"),
- detail="Merkle chunk size used during preprocessing",
- )
@@
if reproduced_manifest_path.exists():
with reproduced_manifest_path.open() as f:
reproduced_manifest = json.load(f)
+ _check_field(
+ report, "manifest_chunk_size_bytes",
+ expected=manifest.get("chunk_size_bytes"),
+ actual=reproduced_manifest.get("chunk_size_bytes"),
+ detail="Merkle chunk size used during preprocessing",
+ )
+
_check_field(
report, "manifest_preprocessing_version",
expected=manifest.get("preprocessing_version"),
actual=reproduced_manifest.get("preprocessing_version"),
detail="Preprocessing version tag",
)Also applies to: 409-418
🧰 Tools
🪛 Ruff (0.15.2)
[error] 291-291: Undefined name reproduced_manifest
(F821)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/verify.py` around lines 288 - 293, The parity check calling
_check_field with reproduced_manifest is executed before reproduced_manifest is
created, causing a NameError; move the whole block that compares
manifest_chunk_size_bytes (the _check_field call referencing manifest and
reproduced_manifest) to after reproduced_manifest is initialized (the code that
assigns reproduced_manifest later in the function), so that reproduced_manifest
exists before being used and similarly relocate the other identical checks (the
block around lines 409-418) to follow the reproduced_manifest construction.
| CHECKSUM_FILENAME_TEMPLATE = "{wiki}-{date}-md5sums.txt" | ||
|
|
||
| DEFAULT_WIKI = "simplewiki" | ||
| DEFAULT_DATE = "latest" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if file exists and get basic info
ls -la scripts/download_dump.py 2>&1 | head -5
# Get total lines
wc -l scripts/download_dump.pyRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 174
🏁 Script executed:
#!/bin/bash
# Read the relevant sections of the file
echo "=== Lines 40-50 (DEFAULT_DATE) ==="
sed -n '40,50p' scripts/download_dump.py
echo -e "\n=== Lines 110-120 (checksum matching) ==="
sed -n '110,120p' scripts/download_dump.py
echo -e "\n=== Lines 135-145 (more checksum code) ==="
sed -n '135,145p' scripts/download_dump.py
echo -e "\n=== Lines 185-195 (filename construction) ==="
sed -n '185,195p' scripts/download_dump.py
echo -e "\n=== Function _fetch_expected_md5 ==="
ast-grep --pattern $'def _fetch_expected_md5($_) {
$$$
}'Repository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 1549
🌐 Web query:
Wikimedia dumps checksum format latest alias YYYYMMDD
💡 Result:
- Dump run directory (date format): project dumps are published under a run directory named
YYYYMMDD/(e.g..../enwiki/20260201/). The project root also includes alatest/directory. [1] - What
latest/means:latest/is an alias pointing to the most recent successful run outputs (with filenames using-latest-). [2] - Checksum filenames (text):
- In a dated run dir:
<wiki>-<YYYYMMDD>-md5sums.txtand<wiki>-<YYYYMMDD>-sha1sums.txt(example:testwiki-20251001-md5sums.txt,testwiki-20251001-sha1sums.txt). [3] - In
latest/:<wiki>-latest-md5sums.txt(and commonly the correspondingsha1sumsfile), e.g.enwikiversity-latest-md5sums.txt. [2]
- In a dated run dir:
- Checksum file format (
*.txt): flat text consisting of repeated<hash> <filename>pairs separated by whitespace (not necessarily one pair per line). [3] - Other checksum formats you may see: Wikimedia also provides checksum data as JSON for some runs/steps (e.g.
md5sums.json,sha1sums.json). [4] - Adds/changes dumps note: their
md5sumsfile is documented as containing the md5 for the dump-content files. [5]
Sources: [1] [2] [3] [4] [5]
🏁 Script executed:
#!/bin/bash
set -euo pipefail
url="https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-md5sums.txt"
echo "Fetching: $url"
curl -fsSL "$url" 2>&1 | head -20Repository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 1606
🏁 Script executed:
#!/bin/bash
set -euo pipefail
url="https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-md5sums.txt"
echo "=== Checking for -latest- vs date-stamped filenames ==="
curl -fsSL "$url" 2>&1 | grep -o 'simplewiki-[^[:space:]]*-pages-articles\.xml\.bz2' | head -10Repository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 170
🏁 Script executed:
#!/bin/bash
# Check lines related to file deletion and download order
echo "=== Lines 200-220 (file deletion and download) ==="
sed -n '200,220p' scripts/download_dump.pyRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 798
🏁 Script executed:
#!/bin/bash
# Get the full download_dump function to understand flow
echo "=== Finding download_dump function ==="
ast-grep --pattern $'def download_dump($_) {
$$$
}'Repository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 110
Default latest flow fails checksum verification due to filename mismatch.
When DEFAULT_DATE="latest" is used, line 191 creates simplewiki-latest-pages-articles.xml.bz2. However, the Wikimedia checksum file in the latest/ directory contains date-stamped filenames (e.g., simplewiki-20260301-pages-articles.xml.bz2). Lines 113–117 perform exact string matching against the checksum file, so the lookup always returns None and verification fails on line 139, even for valid downloads. The file is then deleted (line 215) before the error is raised.
To fix this, modify _fetch_expected_md5 to map -latest- patterns to -\d{8}- patterns in the checksum file:
Proposed fix
+import re
@@
def _fetch_expected_md5(checksum_url: str, dump_filename: str) -> str | None:
@@
+ latest_pattern = None
+ if "-latest-" in dump_filename:
+ wiki, suffix = dump_filename.split("-latest-", 1)
+ latest_pattern = re.compile(rf"^{re.escape(wiki)}-\d{{8}}-{re.escape(suffix)}$")
+
# Format: "<md5hash> <filename>\n"
for line in content.splitlines():
parts = line.strip().split(None, 1)
- if len(parts) == 2 and parts[1].strip() == dump_filename:
- return parts[0].strip()
+ if len(parts) != 2:
+ continue
+ candidate = parts[1].strip()
+ if candidate == dump_filename:
+ return parts[0].strip()
+ if latest_pattern and latest_pattern.match(candidate):
+ return parts[0].strip()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/download_dump.py` at line 43, The checksum lookup fails when
DEFAULT_DATE == "latest" because _fetch_expected_md5 expects exact filenames but
the checksum file uses date-stamped names; update _fetch_expected_md5 to detect
filenames containing "-latest-" and instead search the checksum content with a
regex that replaces "-latest-" with "-\d{8}-" (or otherwise matches an 8-digit
date) to find the correct checksum entry, return that md5 string as before, and
preserve existing behavior when no match is found so the downstream verification
in the download/verification flow still handles failures gracefully.
| logger.warning("Existing file failed checksum — re-downloading.") | ||
| dest.unlink() | ||
| else: |
There was a problem hiding this comment.
Do not delete the existing dump before a verified replacement is ready.
Line 204 removes the current file before the new download is verified. If the subsequent download or checksum verification fails, users lose the only local copy.
🛡️ Proposed fix (download-to-temp then atomic replace)
- logger.warning("Existing file failed checksum — re-downloading.")
- dest.unlink()
+ logger.warning("Existing file failed checksum — downloading replacement before overwrite.")
@@
- _download_file(dump_url, dest)
+ tmp_dest = dest.with_suffix(dest.suffix + ".tmp")
+ _download_file(dump_url, tmp_dest)
if verify:
- ok = _verify_checksum(dest, checksum_url)
+ ok = _verify_checksum(tmp_dest, checksum_url)
if not ok:
- dest.unlink(missing_ok=True)
+ tmp_dest.unlink(missing_ok=True)
raise RuntimeError(
"Downloaded file failed MD5 verification. "
"The file has been removed. Please retry."
)
- return dest
+ tmp_dest.replace(dest)
+ return destAlso applies to: 209-218
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/download_dump.py` around lines 203 - 205, The code currently deletes
the existing dump via dest.unlink() (see logger.warning and dest.unlink) before
attempting a new download, risking loss if the new download or checksum fails;
change the flow to download the new content to a temporary file (e.g.,
dest.with_suffix('.tmp') or similar) and perform the checksum verification
against that temp file, and only after successful verification atomically
replace the original (use os.replace/Path.rename) so the original remains if
anything fails; also ensure the temp file is removed on failure and do the same
pattern for the related download/verify block around lines 209-218.
Addressed Issues:
Closes #13

Detailed PR report:
PR_Report_VerificationMode.pdf
This directly addresses Gap 5.1 from the framework paper — the missing cryptographic link between raw data and training-ready data.
Reference : 2503.22573v1.pdf
Short Description:
Before this PR, the pipeline could generate hashes and Merkle roots, but had no automated way to re-validate them.
This PR introduces a dedicated verification module for the OpenVerifiableLLM data pipeline. Its job is simple: given the same raw Wikipedia dump, re-run the entire preprocessing step from scratch and prove that the output produced today is byte-for-byte identical to the output recorded in the manifest at first run.
This directly addresses Gap 5.1 from the framework paper — the missing cryptographic link between raw data and training-ready data. **
Minimal Command Summary
python -c "import urllib.request; urllib.request.urlretrieve('[https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2','simplewiki-latest-pages-articles.xml.bz2](https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2','simplewiki-latest-pages-articles.xml.bz2)')"pip install -e .set PYTHONPATH=C:\Users\Hp\OpenVerifiableLLM(your own project location")python -c "import openverifiablellm.utils; print('OK')"python test_verify.py -vDetailed Run from scratch:
1. Install Required Dependency
pip install defusedxml3. Download Wikipedia Dump
Recommended method (works inside CMD):
python -c "import urllib.request; urllib.request.urlretrieve('[https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2','simplewiki-latest-pages-articles.xml.bz2](https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2','simplewiki-latest-pages-articles.xml.bz2)'); print('Download complete')"Verify download:
dir simplewiki-latest-pages-articles.xml.bz24. Configure Python Path (Important)
pip install -e .set PYTHONPATH=C:\Users\Hp\OpenVerifiableLLMpython -c "import openverifiablellm.utils; print('OK')"5. Run Preprocessing
python -m openverifiablellm.utils simplewiki-latest-pages-articles.xml.bz2This will:
After completion, verify files:
dir datadir data\processed\Expected:
6. Run Verification (Core Step)
This re-processes the dataset and checks all hashes.
Run:
python -m openverifiablellm.verify simplewiki-latest-pages-articles.xml.bz2If everything is correct:
Result: ALL CHECKS PASSED
python -m openverifiablellm.verify simplewiki-latest-pages-articles.xml.bz2 --json verification_report.jsonChecklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
Summary by CodeRabbit
New Features
Tests