Fix #19: refactor utils to support reading xml from both .bz2 and .xml files by itniuma2026 · Pull Request #20 · AOSSIE-Org/OpenVerifiableLLM

itniuma2026 · 2026-03-02T00:33:37Z

Summary

Refactor extract_text_from_xml() in openverifiablellm/utils.py to support both .bz2 compressed and plain .xml files by detecting the file extension and opening accordingly.

Approach

In openverifiablellm/utils.py, modify extract_text_from_xml() to check input_path.suffix.lower(). If .bz2, use bz2.open(input_path, 'rb'). If .xml, use open(input_path, 'rb'). Otherwise, raise a ValueError with a descriptive message. Use a context manager (with file_handle as f) to wrap the existing ET.iterparse logic. 2. In tests/test_util.py, add a test case that passes a plain .xml file to extract_text_from_xml() to verify it works correctly, and optionally add a test for an unsupported extension to verify the ValueError is raised.

Files Changed

openverifiablellm/utils.py
tests/test_util.py

Related Issue

Fixes #19

Summary by CodeRabbit

New Features
- Added text extraction and cleaning capabilities for Wikipedia XML dumps with automatic date parsing.
- Implemented improved manifest generation system with enhanced data processing workflow.
Refactor
- Streamlined hash computation logic for improved efficiency.
- Reorganized data processing pipeline with modularized, publicly accessible functions.

Fixes AOSSIE-Org#19

coderabbitai · 2026-03-02T00:33:54Z

Walkthrough

Refactored utils.py to modularize Wikipedia XML preprocessing and manifest generation workflows. Introduced five new public functions for text extraction, date parsing, text cleaning, and manifest generation. Replaced per-chunk Merkle hashing with inline computation and updated tests to validate XML extraction and manifest structure.

Changes

Cohort / File(s)	Summary
Manifest Generation Refactor `openverifiablellm/utils.py`	Introduced modularized public functions: `compute_sha256()` (dual-input file/data support), `extract_text_from_xml()` (XML/.bz2 extraction with file-type detection), `extract_dump_date()` (date parsing), `clean_wikitext()` (regex-based markup cleaning), and `generate_manifest()` (manifest creation with hash computation). Replaced per-chunk Merkle parent hashing with single-step inline SHA-256 computation. Updated manifest generation workflow to write `dataset_manifest.json` and return manifest object.
Test Suite Updates `tests/test_util.py`	Removed Merkle-focused tests (roots, proofs, determinism, proof verification). Added `test_extract_text_from_xml_plain_xml()` and `test_extract_text_from_xml_unsupported_extension()` for XML extraction validation. Enhanced end-to-end extraction test with temporary file handling and project root patching. Added `test_manifest_contains_merkle_fields()` to verify manifest structure contains non-null merkle\_root fields.

Sequence Diagram

sequenceDiagram
    participant Client
    participant extract_text_from_xml
    participant clean_wikitext
    participant compute_sha256
    participant generate_manifest
    participant FileSystem

    Client->>extract_text_from_xml: input_path (.bz2 or .xml)
    extract_text_from_xml->>FileSystem: detect file type
    extract_text_from_xml->>FileSystem: open and parse XML
    extract_text_from_xml->>clean_wikitext: raw extracted text
    clean_wikitext-->>extract_text_from_xml: cleaned text
    extract_text_from_xml->>FileSystem: write to data/processed/wiki_clean.txt
    
    Client->>generate_manifest: raw_path, processed_path, output_dir
    generate_manifest->>compute_sha256: read raw file
    compute_sha256-->>generate_manifest: raw file hash
    generate_manifest->>compute_sha256: read processed file
    compute_sha256-->>generate_manifest: processed file hash
    generate_manifest->>FileSystem: create output directory
    generate_manifest->>FileSystem: write dataset_manifest.json
    generate_manifest-->>Client: manifest object with hashes

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat: add deterministic preprocessing and dataset identity tracking #2: Overlapping implementation of deterministic Wikipedia preprocessing and dataset-identity features with shared compute_sha256 and manifest generation functionality.

Suggested labels

Python Lang

Suggested reviewers

Archit381

Poem

🐰 Hashing files with newfound grace,
XML streams at quickened pace,
Manifests now cleanly born,
Text extracted, markup torn—
Verification's rabbit dawn! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	⚠️ Warning	The PR includes substantial additional changes beyond issue `#19` scope: new public functions (compute_sha256 refactor, extract_dump_date, clean_wikitext, generate_manifest) and removal of merkle-related tests not mentioned in the linked issue.	Remove changes unrelated to issue `#19` (new functions, manifest generation, merkle test removals) or create separate PRs for these refactoring efforts to maintain focused, reviewable changesets.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main refactoring change: adding support for reading XML from both .bz2 and .xml files, matching the core objective of the PR.
Linked Issues check	✅ Passed	The code changes implement automatic file type detection with extension-based logic for .bz2 and .xml files, raising ValueError for unsupported extensions, directly fulfilling issue `#19` requirements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 155-175: The compute_sha256 function currently prefers data when
both file_path and data are provided, which can hide caller mistakes; update
compute_sha256 to validate inputs so exactly one source is used: if both
file_path and data are provided raise a ValueError (e.g., "Provide exactly one
of file_path or data"), if only data is provided keep current bytes hashing
behavior, if only file_path is provided keep current file-reading hashing
behavior, and if neither raise the existing ValueError. Ensure you reference the
compute_sha256 signature and its parameters file_path and data when making this
change.
- Around line 209-223: The code currently creates output_dir ("data/processed")
before validating the input_path extension; move the extension check (the suffix
== ".bz2" / ".xml" validation and the ValueError) to occur immediately after
computing suffix (input_path.suffix.lower()) and before creating output_dir or
calling output_dir.mkdir; only after validating and opening the file_handle (or
after successful selection of bz2.open / open) should you create output_dir and
set output_file; reference variables/functions input_path, suffix, file_handle,
output_dir, output_file, and extract_dump_date to locate the relevant logic in
utils.py.

In `@tests/test_util.py`:
- Around line 165-166: Update the test assertion that checks the ValueError
message from utils.extract_text_from_xml to escape literal dots in the regex:
change the match pattern to use escaped dots (e.g. r"input_path must have \.xml
or \.bz2 extension") so the regex only matches literal periods; modify the
pytest.raises(...) call accordingly to use the raw string with \.xml and \.bz2.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b253f1 and bc6b684.

📒 Files selected for processing (2)

openverifiablellm/utils.py
tests/test_util.py

coderabbitai · 2026-03-02T00:38:25Z

openverifiablellm/utils.py

+def compute_sha256(file_path: Union[str, Path, None] = None, data: Union[bytes, None] = None) -> str:
    """
-    input_path = Path(input_path)
-
-    # Fixed output path
-    project_root = Path.cwd()
-    output_dir = project_root / "data" / "processed"
-    output_dir.mkdir(parents=True, exist_ok=True)
-
-    output_path = output_dir / "wiki_clean.txt"
-
-    with bz2.open(input_path, "rb") as f:
-        context = ET.iterparse(f, events=("end",))
-
-        with open(output_path, "w", encoding="utf-8") as out:
-            for _, elem in context:
-                if elem.tag.endswith("page"):
-                    text_elem = elem.find(".//{*}text")
-
-                    if text_elem is not None and text_elem.text:
-                        cleaned = clean_wikitext(text_elem.text)
-                        if cleaned:
-                            out.write(cleaned + "\n\n")
-
-                    elem.clear()
-    logger.info("Preprocessing complete. Output saved to %s", output_path)
-    generate_manifest(input_path,output_path)
-
-# generate data manifest
-def generate_manifest(raw_path, processed_path):
-    raw_path = Path(raw_path)
-    processed_path = Path(processed_path)
+    Compute SHA-256 hash of a file or raw bytes.
+    """
+    sha256 = hashlib.sha256()

-    if not processed_path.exists():
-        raise FileNotFoundError(
-            f"Processed file not found at {processed_path}. Run preprocessing first."
-        )
+    if data is not None:
+        sha256.update(data)
+        return sha256.hexdigest()

-    manifest = {
-        "wikipedia_dump": raw_path.name,
-        "dump_date": extract_dump_date(raw_path.name),
-        "raw_sha256": compute_sha256(file_path=raw_path),
-        "processed_sha256": compute_sha256(file_path=processed_path),
-
-        # ---------------- ADDED FIELDS ----------------
-        "raw_merkle_root": compute_merkle_root(raw_path, chunk_size=MERKLE_CHUNK_SIZE_BYTES),
-        "processed_merkle_root": compute_merkle_root(processed_path, chunk_size=MERKLE_CHUNK_SIZE_BYTES),
-        "chunk_size_bytes": MERKLE_CHUNK_SIZE_BYTES,
-        # ---------------------------------------------------------------
-
-        "preprocessing_version": "v1",
-        "python_version": platform.python_version()
-    }
-    project_root = Path.cwd()
-    manifest_path = project_root / "data" / "dataset_manifest.json"
-    manifest_path.parent.mkdir(parents=True, exist_ok=True)
+    if file_path is not None:
+        path = Path(file_path)
+        if not path.exists():
+            raise FileNotFoundError(f"File not found: {file_path}")
+        with path.open("rb") as f:
+            for chunk in iter(lambda: f.read(8192), b""):
+                sha256.update(chunk)
+        return sha256.hexdigest()

-    with open(manifest_path, "w") as f:
-        json.dump(manifest, f, indent=2)
+    raise ValueError("Either file_path or data must be provided")



🛠️ Refactor suggestion | 🟠 Major

Enforce exactly one input source in compute_sha256.

Right now, passing both file_path and data silently hashes data. That can hide caller mistakes and produce the wrong digest source.

Proposed fix

def compute_sha256(file_path: Union[str, Path, None] = None, data: Union[bytes, None] = None) -> str: """ Compute SHA-256 hash of a file or raw bytes. """ sha256 = hashlib.sha256() + + if (file_path is None and data is None) or (file_path is not None and data is not None): + raise ValueError("Provide exactly one of file_path or data") if data is not None: sha256.update(data) return sha256.hexdigest() if file_path is not None: path = Path(file_path) if not path.exists(): raise FileNotFoundError(f"File not found: {file_path}") with path.open("rb") as f: for chunk in iter(lambda: f.read(8192), b""): sha256.update(chunk) return sha256.hexdigest() - - raise ValueError("Either file_path or data must be provided")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def compute_sha256(file_path: Union[str, Path, None] = None, data: Union[bytes, None] = None) -> str:

"""

input_path = Path(input_path)

# Fixed output path

project_root = Path.cwd()

output_dir = project_root / "data" / "processed"

output_dir.mkdir(parents=True, exist_ok=True)

output_path = output_dir / "wiki_clean.txt"

with bz2.open(input_path, "rb") as f:

context = ET.iterparse(f, events=("end",))

with open(output_path, "w", encoding="utf-8") as out:

for _, elem in context:

if elem.tag.endswith("page"):

text_elem = elem.find(".//{*}text")

if text_elem is not None and text_elem.text:

cleaned = clean_wikitext(text_elem.text)

if cleaned:

out.write(cleaned + "\n\n")

elem.clear()

logger.info("Preprocessing complete. Output saved to %s", output_path)

generate_manifest(input_path,output_path)

# generate data manifest

def generate_manifest(raw_path, processed_path):

raw_path = Path(raw_path)

processed_path = Path(processed_path)

Compute SHA-256 hash of a file or raw bytes.

"""

sha256 = hashlib.sha256()

if not processed_path.exists():

raise FileNotFoundError(

f"Processed file not found at {processed_path}. Run preprocessing first."

)

if data is not None:

sha256.update(data)

return sha256.hexdigest()

manifest = {

"wikipedia_dump": raw_path.name,

"dump_date": extract_dump_date(raw_path.name),

"raw_sha256": compute_sha256(file_path=raw_path),

"processed_sha256": compute_sha256(file_path=processed_path),

# ---------------- ADDED FIELDS ----------------

"raw_merkle_root": compute_merkle_root(raw_path, chunk_size=MERKLE_CHUNK_SIZE_BYTES),

"processed_merkle_root": compute_merkle_root(processed_path, chunk_size=MERKLE_CHUNK_SIZE_BYTES),

"chunk_size_bytes": MERKLE_CHUNK_SIZE_BYTES,

# ---------------------------------------------------------------

"preprocessing_version": "v1",

"python_version": platform.python_version()

}

project_root = Path.cwd()

manifest_path = project_root / "data" / "dataset_manifest.json"

manifest_path.parent.mkdir(parents=True, exist_ok=True)

if file_path is not None:

path = Path(file_path)

if not path.exists():

raise FileNotFoundError(f"File not found: {file_path}")

with path.open("rb") as f:

for chunk in iter(lambda: f.read(8192), b""):

sha256.update(chunk)

return sha256.hexdigest()

with open(manifest_path, "w") as f:

json.dump(manifest, f, indent=2)

raise ValueError("Either file_path or data must be provided")

def compute_sha256(file_path: Union[str, Path, None] = None, data: Union[bytes, None] = None) -> str:

"""

Compute SHA-256 hash of a file or raw bytes.

"""

sha256 = hashlib.sha256()

if (file_path is None and data is None) or (file_path is not None and data is not None):

raise ValueError("Provide exactly one of file_path or data")

if data is not None:

sha256.update(data)

return sha256.hexdigest()

if file_path is not None:

path = Path(file_path)

if not path.exists():

raise FileNotFoundError(f"File not found: {file_path}")

with path.open("rb") as f:

for chunk in iter(lambda: f.read(8192), b""):

sha256.update(chunk)

return sha256.hexdigest()

🧰 Tools

🪛 Ruff (0.15.2)

[warning] 168-168: Avoid specifying long messages outside the exception class

(TRY003)

[warning] 174-174: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 155 - 175, The compute_sha256 function currently prefers data when both file_path and data are provided, which can hide caller mistakes; update compute_sha256 to validate inputs so exactly one source is used: if both file_path and data are provided raise a ValueError (e.g., "Provide exactly one of file_path or data"), if only data is provided keep current bytes hashing behavior, if only file_path is provided keep current file-reading hashing behavior, and if neither raise the existing ValueError. Ensure you reference the compute_sha256 signature and its parameters file_path and data when making this change.

coderabbitai · 2026-03-02T00:38:25Z

openverifiablellm/utils.py

+    input_path = Path(input_path)
+    output_dir = Path("data/processed")
+    output_dir.mkdir(parents=True, exist_ok=True)
+    output_file = output_dir / "wiki_clean.txt"

-    if (data is None) == (file_path is None):
-        raise ValueError(
-            "Exactly one of 'data' or 'file_path' must be provided."
-        )
+    dump_date = extract_dump_date(input_path.name)

-    sha256 = hashlib.sha256()
+    suffix = input_path.suffix.lower()
+    if suffix == ".bz2":
+        file_handle = bz2.open(input_path, "rb")
+    elif suffix == ".xml":
+        file_handle = open(input_path, "rb")
+    else:
+        raise ValueError("input_path must have .xml or .bz2 extension")



⚠️ Potential issue | 🟡 Minor

Fail fast on unsupported extension before writing to disk.

The function currently creates data/processed even when the extension is invalid. Validation should happen before any filesystem side effects.

Proposed fix

def extract_text_from_xml(input_path: Union[str, Path]) -> None: @@ input_path = Path(input_path) - output_dir = Path("data/processed") - output_dir.mkdir(parents=True, exist_ok=True) - output_file = output_dir / "wiki_clean.txt" dump_date = extract_dump_date(input_path.name) suffix = input_path.suffix.lower() if suffix == ".bz2": file_handle = bz2.open(input_path, "rb") elif suffix == ".xml": file_handle = open(input_path, "rb") else: raise ValueError("input_path must have .xml or .bz2 extension") + + output_dir = Path("data/processed") + output_dir.mkdir(parents=True, exist_ok=True) + output_file = output_dir / "wiki_clean.txt"

🧰 Tools

🪛 Ruff (0.15.2)

[warning] 222-222: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 209 - 223, The code currently creates output_dir ("data/processed") before validating the input_path extension; move the extension check (the suffix == ".bz2" / ".xml" validation and the ValueError) to occur immediately after computing suffix (input_path.suffix.lower()) and before creating output_dir or calling output_dir.mkdir; only after validating and opening the file_handle (or after successful selection of bz2.open / open) should you create output_dir and set output_file; reference variables/functions input_path, suffix, file_handle, output_dir, output_file, and extract_dump_date to locate the relevant logic in utils.py.

coderabbitai · 2026-03-02T00:38:25Z

tests/test_util.py

+    with pytest.raises(ValueError, match="input_path must have .xml or .bz2 extension"):
+        utils.extract_text_from_xml(input_file)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check if the test file exists and examine the specific lines if [ -f "tests/test_util.py" ]; then echo "=== File found: tests/test_util.py ===" wc -l tests/test_util.py echo "" echo "=== Lines 160-170 ===" sed -n '160,170p' tests/test_util.py | cat -n else echo "File not found at tests/test_util.py" # Try to find the file find . -name "test_util.py" -type f fi

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 641

🏁 Script executed:

# Let's also check what the actual error message is by looking at the implementation # First find the utils module and the extract_text_from_xml function find . -name "utils.py" -o -name "util.py" | head -20

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 100

🏁 Script executed:

# Search for the extract_text_from_xml function to understand what error message is raised rg "extract_text_from_xml" -A 20 | head -50

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 2330

🏁 Script executed:

# Now let's examine the extract_text_from_xml function to see what error message it raises sed -n '1,100p' ./openverifiablellm/utils.py | cat -n

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 3581

🏁 Script executed:

# Search for the extract_text_from_xml function and the error message rg "def extract_text_from_xml" -A 30 ./openverifiablellm/utils.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1224

Escape literal dots in the match regex pattern for stricter error-message validation.

The unescaped . characters in the regex pattern will match any character, not just literal dots. The pattern should use \. to match literal dots only.

Proposed fix

- with pytest.raises(ValueError, match="input_path must have .xml or .bz2 extension"): + with pytest.raises(ValueError, match=r"input_path must have \.xml or \.bz2 extension"): utils.extract_text_from_xml(input_file)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

with pytest.raises(ValueError, match="input_path must have .xml or .bz2 extension"):

utils.extract_text_from_xml(input_file)

with pytest.raises(ValueError, match=r"input_path must have \.xml or \.bz2 extension"):

utils.extract_text_from_xml(input_file)

🧰 Tools

🪛 Ruff (0.15.2)

[warning] 165-165: Pattern passed to match= contains metacharacters but is neither escaped nor raw

(RUF043)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_util.py` around lines 165 - 166, Update the test assertion that checks the ValueError message from utils.extract_text_from_xml to escape literal dots in the regex: change the match pattern to use escaped dots (e.g. r"input_path must have \.xml or \.bz2 extension") so the regex only matches literal periods; modify the pytest.raises(...) call accordingly to use the raw string with \.xml and \.bz2.

Archit381 · 2026-03-05T13:47:11Z

@itniuma2026 This issue has already been addressed in #29

fix: refactor utils to support reading xml from both .bz2 and .xml files

bc6b684

Fixes AOSSIE-Org#19

github-actions bot added backend external-contributor good first issue Good for newcomers pending-coderabbit-review python size/XS size/L first-time-contributor and removed size/XS labels Mar 2, 2026

github-actions bot added size/XS size/L and removed size/L size/XS labels Mar 2, 2026

coderabbitai bot requested changes Mar 2, 2026

View reviewed changes

Archit381 closed this Mar 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix #19: refactor utils to support reading xml from both .bz2 and .xml files#20

Fix #19: refactor utils to support reading xml from both .bz2 and .xml files#20
itniuma2026 wants to merge 1 commit intoAOSSIE-Org:mainfrom
itniuma2026:niuma/fix-19

itniuma2026 commented Mar 2, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 2, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 2, 2026

Uh oh!

coderabbitai bot Mar 2, 2026

Uh oh!

coderabbitai bot Mar 2, 2026

Uh oh!

Archit381 commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		with pytest.raises(ValueError, match="input_path must have .xml or .bz2 extension"):
		utils.extract_text_from_xml(input_file)

Uh oh!

Conversation

itniuma2026 commented Mar 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach

Files Changed

Related Issue

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Archit381 commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

itniuma2026 commented Mar 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 2, 2026 •

edited

Loading