Skip to content

Fix #19: refactor utils to support reading xml from both .bz2 and .xml files#20

Closed
itniuma2026 wants to merge 1 commit intoAOSSIE-Org:mainfrom
itniuma2026:niuma/fix-19
Closed

Fix #19: refactor utils to support reading xml from both .bz2 and .xml files#20
itniuma2026 wants to merge 1 commit intoAOSSIE-Org:mainfrom
itniuma2026:niuma/fix-19

Conversation

@itniuma2026
Copy link

@itniuma2026 itniuma2026 commented Mar 2, 2026

Summary

Refactor extract_text_from_xml() in openverifiablellm/utils.py to support both .bz2 compressed and plain .xml files by detecting the file extension and opening accordingly.

Approach

  1. In openverifiablellm/utils.py, modify extract_text_from_xml() to check input_path.suffix.lower(). If .bz2, use bz2.open(input_path, 'rb'). If .xml, use open(input_path, 'rb'). Otherwise, raise a ValueError with a descriptive message. Use a context manager (with file_handle as f) to wrap the existing ET.iterparse logic. 2. In tests/test_util.py, add a test case that passes a plain .xml file to extract_text_from_xml() to verify it works correctly, and optionally add a test for an unsupported extension to verify the ValueError is raised.

Files Changed

  • openverifiablellm/utils.py
  • tests/test_util.py

Related Issue

Fixes #19

Summary by CodeRabbit

  • New Features

    • Added text extraction and cleaning capabilities for Wikipedia XML dumps with automatic date parsing.
    • Implemented improved manifest generation system with enhanced data processing workflow.
  • Refactor

    • Streamlined hash computation logic for improved efficiency.
    • Reorganized data processing pipeline with modularized, publicly accessible functions.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 2, 2026

Walkthrough

Refactored utils.py to modularize Wikipedia XML preprocessing and manifest generation workflows. Introduced five new public functions for text extraction, date parsing, text cleaning, and manifest generation. Replaced per-chunk Merkle hashing with inline computation and updated tests to validate XML extraction and manifest structure.

Changes

Cohort / File(s) Summary
Manifest Generation Refactor
openverifiablellm/utils.py
Introduced modularized public functions: compute_sha256() (dual-input file/data support), extract_text_from_xml() (XML/.bz2 extraction with file-type detection), extract_dump_date() (date parsing), clean_wikitext() (regex-based markup cleaning), and generate_manifest() (manifest creation with hash computation). Replaced per-chunk Merkle parent hashing with single-step inline SHA-256 computation. Updated manifest generation workflow to write dataset_manifest.json and return manifest object.
Test Suite Updates
tests/test_util.py
Removed Merkle-focused tests (roots, proofs, determinism, proof verification). Added test_extract_text_from_xml_plain_xml() and test_extract_text_from_xml_unsupported_extension() for XML extraction validation. Enhanced end-to-end extraction test with temporary file handling and project root patching. Added test_manifest_contains_merkle_fields() to verify manifest structure contains non-null merkle\_root fields.

Sequence Diagram

sequenceDiagram
    participant Client
    participant extract_text_from_xml
    participant clean_wikitext
    participant compute_sha256
    participant generate_manifest
    participant FileSystem

    Client->>extract_text_from_xml: input_path (.bz2 or .xml)
    extract_text_from_xml->>FileSystem: detect file type
    extract_text_from_xml->>FileSystem: open and parse XML
    extract_text_from_xml->>clean_wikitext: raw extracted text
    clean_wikitext-->>extract_text_from_xml: cleaned text
    extract_text_from_xml->>FileSystem: write to data/processed/wiki_clean.txt
    
    Client->>generate_manifest: raw_path, processed_path, output_dir
    generate_manifest->>compute_sha256: read raw file
    compute_sha256-->>generate_manifest: raw file hash
    generate_manifest->>compute_sha256: read processed file
    compute_sha256-->>generate_manifest: processed file hash
    generate_manifest->>FileSystem: create output directory
    generate_manifest->>FileSystem: write dataset_manifest.json
    generate_manifest-->>Client: manifest object with hashes
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

Python Lang

Suggested reviewers

  • Archit381

Poem

🐰 Hashing files with newfound grace,
XML streams at quickened pace,
Manifests now cleanly born,
Text extracted, markup torn—
Verification's rabbit dawn! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning The PR includes substantial additional changes beyond issue #19 scope: new public functions (compute_sha256 refactor, extract_dump_date, clean_wikitext, generate_manifest) and removal of merkle-related tests not mentioned in the linked issue. Remove changes unrelated to issue #19 (new functions, manifest generation, merkle test removals) or create separate PRs for these refactoring efforts to maintain focused, reviewable changesets.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main refactoring change: adding support for reading XML from both .bz2 and .xml files, matching the core objective of the PR.
Linked Issues check ✅ Passed The code changes implement automatic file type detection with extension-based logic for .bz2 and .xml files, raising ValueError for unsupported extensions, directly fulfilling issue #19 requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 155-175: The compute_sha256 function currently prefers data when
both file_path and data are provided, which can hide caller mistakes; update
compute_sha256 to validate inputs so exactly one source is used: if both
file_path and data are provided raise a ValueError (e.g., "Provide exactly one
of file_path or data"), if only data is provided keep current bytes hashing
behavior, if only file_path is provided keep current file-reading hashing
behavior, and if neither raise the existing ValueError. Ensure you reference the
compute_sha256 signature and its parameters file_path and data when making this
change.
- Around line 209-223: The code currently creates output_dir ("data/processed")
before validating the input_path extension; move the extension check (the suffix
== ".bz2" / ".xml" validation and the ValueError) to occur immediately after
computing suffix (input_path.suffix.lower()) and before creating output_dir or
calling output_dir.mkdir; only after validating and opening the file_handle (or
after successful selection of bz2.open / open) should you create output_dir and
set output_file; reference variables/functions input_path, suffix, file_handle,
output_dir, output_file, and extract_dump_date to locate the relevant logic in
utils.py.

In `@tests/test_util.py`:
- Around line 165-166: Update the test assertion that checks the ValueError
message from utils.extract_text_from_xml to escape literal dots in the regex:
change the match pattern to use escaped dots (e.g. r"input_path must have \.xml
or \.bz2 extension") so the regex only matches literal periods; modify the
pytest.raises(...) call accordingly to use the raw string with \.xml and \.bz2.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b253f1 and bc6b684.

📒 Files selected for processing (2)
  • openverifiablellm/utils.py
  • tests/test_util.py

Comment on lines +155 to 175
def compute_sha256(file_path: Union[str, Path, None] = None, data: Union[bytes, None] = None) -> str:
"""
input_path = Path(input_path)

# Fixed output path
project_root = Path.cwd()
output_dir = project_root / "data" / "processed"
output_dir.mkdir(parents=True, exist_ok=True)

output_path = output_dir / "wiki_clean.txt"

with bz2.open(input_path, "rb") as f:
context = ET.iterparse(f, events=("end",))

with open(output_path, "w", encoding="utf-8") as out:
for _, elem in context:
if elem.tag.endswith("page"):
text_elem = elem.find(".//{*}text")

if text_elem is not None and text_elem.text:
cleaned = clean_wikitext(text_elem.text)
if cleaned:
out.write(cleaned + "\n\n")

elem.clear()
logger.info("Preprocessing complete. Output saved to %s", output_path)
generate_manifest(input_path,output_path)

# generate data manifest
def generate_manifest(raw_path, processed_path):
raw_path = Path(raw_path)
processed_path = Path(processed_path)
Compute SHA-256 hash of a file or raw bytes.
"""
sha256 = hashlib.sha256()

if not processed_path.exists():
raise FileNotFoundError(
f"Processed file not found at {processed_path}. Run preprocessing first."
)
if data is not None:
sha256.update(data)
return sha256.hexdigest()

manifest = {
"wikipedia_dump": raw_path.name,
"dump_date": extract_dump_date(raw_path.name),
"raw_sha256": compute_sha256(file_path=raw_path),
"processed_sha256": compute_sha256(file_path=processed_path),

# ---------------- ADDED FIELDS ----------------
"raw_merkle_root": compute_merkle_root(raw_path, chunk_size=MERKLE_CHUNK_SIZE_BYTES),
"processed_merkle_root": compute_merkle_root(processed_path, chunk_size=MERKLE_CHUNK_SIZE_BYTES),
"chunk_size_bytes": MERKLE_CHUNK_SIZE_BYTES,
# ---------------------------------------------------------------

"preprocessing_version": "v1",
"python_version": platform.python_version()
}
project_root = Path.cwd()
manifest_path = project_root / "data" / "dataset_manifest.json"
manifest_path.parent.mkdir(parents=True, exist_ok=True)
if file_path is not None:
path = Path(file_path)
if not path.exists():
raise FileNotFoundError(f"File not found: {file_path}")
with path.open("rb") as f:
for chunk in iter(lambda: f.read(8192), b""):
sha256.update(chunk)
return sha256.hexdigest()

with open(manifest_path, "w") as f:
json.dump(manifest, f, indent=2)
raise ValueError("Either file_path or data must be provided")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Enforce exactly one input source in compute_sha256.

Right now, passing both file_path and data silently hashes data. That can hide caller mistakes and produce the wrong digest source.

Proposed fix
 def compute_sha256(file_path: Union[str, Path, None] = None, data: Union[bytes, None] = None) -> str:
     """
     Compute SHA-256 hash of a file or raw bytes.
     """
     sha256 = hashlib.sha256()
+
+    if (file_path is None and data is None) or (file_path is not None and data is not None):
+        raise ValueError("Provide exactly one of file_path or data")
 
     if data is not None:
         sha256.update(data)
         return sha256.hexdigest()
 
     if file_path is not None:
         path = Path(file_path)
         if not path.exists():
             raise FileNotFoundError(f"File not found: {file_path}")
         with path.open("rb") as f:
             for chunk in iter(lambda: f.read(8192), b""):
                 sha256.update(chunk)
         return sha256.hexdigest()
-
-    raise ValueError("Either file_path or data must be provided")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def compute_sha256(file_path: Union[str, Path, None] = None, data: Union[bytes, None] = None) -> str:
"""
input_path = Path(input_path)
# Fixed output path
project_root = Path.cwd()
output_dir = project_root / "data" / "processed"
output_dir.mkdir(parents=True, exist_ok=True)
output_path = output_dir / "wiki_clean.txt"
with bz2.open(input_path, "rb") as f:
context = ET.iterparse(f, events=("end",))
with open(output_path, "w", encoding="utf-8") as out:
for _, elem in context:
if elem.tag.endswith("page"):
text_elem = elem.find(".//{*}text")
if text_elem is not None and text_elem.text:
cleaned = clean_wikitext(text_elem.text)
if cleaned:
out.write(cleaned + "\n\n")
elem.clear()
logger.info("Preprocessing complete. Output saved to %s", output_path)
generate_manifest(input_path,output_path)
# generate data manifest
def generate_manifest(raw_path, processed_path):
raw_path = Path(raw_path)
processed_path = Path(processed_path)
Compute SHA-256 hash of a file or raw bytes.
"""
sha256 = hashlib.sha256()
if not processed_path.exists():
raise FileNotFoundError(
f"Processed file not found at {processed_path}. Run preprocessing first."
)
if data is not None:
sha256.update(data)
return sha256.hexdigest()
manifest = {
"wikipedia_dump": raw_path.name,
"dump_date": extract_dump_date(raw_path.name),
"raw_sha256": compute_sha256(file_path=raw_path),
"processed_sha256": compute_sha256(file_path=processed_path),
# ---------------- ADDED FIELDS ----------------
"raw_merkle_root": compute_merkle_root(raw_path, chunk_size=MERKLE_CHUNK_SIZE_BYTES),
"processed_merkle_root": compute_merkle_root(processed_path, chunk_size=MERKLE_CHUNK_SIZE_BYTES),
"chunk_size_bytes": MERKLE_CHUNK_SIZE_BYTES,
# ---------------------------------------------------------------
"preprocessing_version": "v1",
"python_version": platform.python_version()
}
project_root = Path.cwd()
manifest_path = project_root / "data" / "dataset_manifest.json"
manifest_path.parent.mkdir(parents=True, exist_ok=True)
if file_path is not None:
path = Path(file_path)
if not path.exists():
raise FileNotFoundError(f"File not found: {file_path}")
with path.open("rb") as f:
for chunk in iter(lambda: f.read(8192), b""):
sha256.update(chunk)
return sha256.hexdigest()
with open(manifest_path, "w") as f:
json.dump(manifest, f, indent=2)
raise ValueError("Either file_path or data must be provided")
def compute_sha256(file_path: Union[str, Path, None] = None, data: Union[bytes, None] = None) -> str:
"""
Compute SHA-256 hash of a file or raw bytes.
"""
sha256 = hashlib.sha256()
if (file_path is None and data is None) or (file_path is not None and data is not None):
raise ValueError("Provide exactly one of file_path or data")
if data is not None:
sha256.update(data)
return sha256.hexdigest()
if file_path is not None:
path = Path(file_path)
if not path.exists():
raise FileNotFoundError(f"File not found: {file_path}")
with path.open("rb") as f:
for chunk in iter(lambda: f.read(8192), b""):
sha256.update(chunk)
return sha256.hexdigest()
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 168-168: Avoid specifying long messages outside the exception class

(TRY003)


[warning] 174-174: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 155 - 175, The compute_sha256
function currently prefers data when both file_path and data are provided, which
can hide caller mistakes; update compute_sha256 to validate inputs so exactly
one source is used: if both file_path and data are provided raise a ValueError
(e.g., "Provide exactly one of file_path or data"), if only data is provided
keep current bytes hashing behavior, if only file_path is provided keep current
file-reading hashing behavior, and if neither raise the existing ValueError.
Ensure you reference the compute_sha256 signature and its parameters file_path
and data when making this change.

Comment on lines +209 to 223
input_path = Path(input_path)
output_dir = Path("data/processed")
output_dir.mkdir(parents=True, exist_ok=True)
output_file = output_dir / "wiki_clean.txt"

if (data is None) == (file_path is None):
raise ValueError(
"Exactly one of 'data' or 'file_path' must be provided."
)
dump_date = extract_dump_date(input_path.name)

sha256 = hashlib.sha256()
suffix = input_path.suffix.lower()
if suffix == ".bz2":
file_handle = bz2.open(input_path, "rb")
elif suffix == ".xml":
file_handle = open(input_path, "rb")
else:
raise ValueError("input_path must have .xml or .bz2 extension")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fail fast on unsupported extension before writing to disk.

The function currently creates data/processed even when the extension is invalid. Validation should happen before any filesystem side effects.

Proposed fix
 def extract_text_from_xml(input_path: Union[str, Path]) -> None:
@@
     input_path = Path(input_path)
-    output_dir = Path("data/processed")
-    output_dir.mkdir(parents=True, exist_ok=True)
-    output_file = output_dir / "wiki_clean.txt"
 
     dump_date = extract_dump_date(input_path.name)
 
     suffix = input_path.suffix.lower()
     if suffix == ".bz2":
         file_handle = bz2.open(input_path, "rb")
     elif suffix == ".xml":
         file_handle = open(input_path, "rb")
     else:
         raise ValueError("input_path must have .xml or .bz2 extension")
+
+    output_dir = Path("data/processed")
+    output_dir.mkdir(parents=True, exist_ok=True)
+    output_file = output_dir / "wiki_clean.txt"
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 222-222: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 209 - 223, The code currently
creates output_dir ("data/processed") before validating the input_path
extension; move the extension check (the suffix == ".bz2" / ".xml" validation
and the ValueError) to occur immediately after computing suffix
(input_path.suffix.lower()) and before creating output_dir or calling
output_dir.mkdir; only after validating and opening the file_handle (or after
successful selection of bz2.open / open) should you create output_dir and set
output_file; reference variables/functions input_path, suffix, file_handle,
output_dir, output_file, and extract_dump_date to locate the relevant logic in
utils.py.

Comment on lines +165 to +166
with pytest.raises(ValueError, match="input_path must have .xml or .bz2 extension"):
utils.extract_text_from_xml(input_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check if the test file exists and examine the specific lines
if [ -f "tests/test_util.py" ]; then
  echo "=== File found: tests/test_util.py ==="
  wc -l tests/test_util.py
  echo ""
  echo "=== Lines 160-170 ==="
  sed -n '160,170p' tests/test_util.py | cat -n
else
  echo "File not found at tests/test_util.py"
  # Try to find the file
  find . -name "test_util.py" -type f
fi

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 641


🏁 Script executed:

# Let's also check what the actual error message is by looking at the implementation
# First find the utils module and the extract_text_from_xml function
find . -name "utils.py" -o -name "util.py" | head -20

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 100


🏁 Script executed:

# Search for the extract_text_from_xml function to understand what error message is raised
rg "extract_text_from_xml" -A 20 | head -50

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 2330


🏁 Script executed:

# Now let's examine the extract_text_from_xml function to see what error message it raises
sed -n '1,100p' ./openverifiablellm/utils.py | cat -n

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 3581


🏁 Script executed:

# Search for the extract_text_from_xml function and the error message
rg "def extract_text_from_xml" -A 30 ./openverifiablellm/utils.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1224


Escape literal dots in the match regex pattern for stricter error-message validation.

The unescaped . characters in the regex pattern will match any character, not just literal dots. The pattern should use \. to match literal dots only.

Proposed fix
-    with pytest.raises(ValueError, match="input_path must have .xml or .bz2 extension"):
+    with pytest.raises(ValueError, match=r"input_path must have \.xml or \.bz2 extension"):
         utils.extract_text_from_xml(input_file)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with pytest.raises(ValueError, match="input_path must have .xml or .bz2 extension"):
utils.extract_text_from_xml(input_file)
with pytest.raises(ValueError, match=r"input_path must have \.xml or \.bz2 extension"):
utils.extract_text_from_xml(input_file)
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 165-165: Pattern passed to match= contains metacharacters but is neither escaped nor raw

(RUF043)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_util.py` around lines 165 - 166, Update the test assertion that
checks the ValueError message from utils.extract_text_from_xml to escape literal
dots in the regex: change the match pattern to use escaped dots (e.g.
r"input_path must have \.xml or \.bz2 extension") so the regex only matches
literal periods; modify the pytest.raises(...) call accordingly to use the raw
string with \.xml and \.bz2.

@Archit381
Copy link
Member

@itniuma2026 This issue has already been addressed in #29

@Archit381 Archit381 closed this Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants