Fix #19: refactor utils to support reading xml from both .bz2 and .xml files#20
Fix #19: refactor utils to support reading xml from both .bz2 and .xml files#20itniuma2026 wants to merge 1 commit intoAOSSIE-Org:mainfrom
Conversation
WalkthroughRefactored Changes
Sequence DiagramsequenceDiagram
participant Client
participant extract_text_from_xml
participant clean_wikitext
participant compute_sha256
participant generate_manifest
participant FileSystem
Client->>extract_text_from_xml: input_path (.bz2 or .xml)
extract_text_from_xml->>FileSystem: detect file type
extract_text_from_xml->>FileSystem: open and parse XML
extract_text_from_xml->>clean_wikitext: raw extracted text
clean_wikitext-->>extract_text_from_xml: cleaned text
extract_text_from_xml->>FileSystem: write to data/processed/wiki_clean.txt
Client->>generate_manifest: raw_path, processed_path, output_dir
generate_manifest->>compute_sha256: read raw file
compute_sha256-->>generate_manifest: raw file hash
generate_manifest->>compute_sha256: read processed file
compute_sha256-->>generate_manifest: processed file hash
generate_manifest->>FileSystem: create output directory
generate_manifest->>FileSystem: write dataset_manifest.json
generate_manifest-->>Client: manifest object with hashes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 155-175: The compute_sha256 function currently prefers data when
both file_path and data are provided, which can hide caller mistakes; update
compute_sha256 to validate inputs so exactly one source is used: if both
file_path and data are provided raise a ValueError (e.g., "Provide exactly one
of file_path or data"), if only data is provided keep current bytes hashing
behavior, if only file_path is provided keep current file-reading hashing
behavior, and if neither raise the existing ValueError. Ensure you reference the
compute_sha256 signature and its parameters file_path and data when making this
change.
- Around line 209-223: The code currently creates output_dir ("data/processed")
before validating the input_path extension; move the extension check (the suffix
== ".bz2" / ".xml" validation and the ValueError) to occur immediately after
computing suffix (input_path.suffix.lower()) and before creating output_dir or
calling output_dir.mkdir; only after validating and opening the file_handle (or
after successful selection of bz2.open / open) should you create output_dir and
set output_file; reference variables/functions input_path, suffix, file_handle,
output_dir, output_file, and extract_dump_date to locate the relevant logic in
utils.py.
In `@tests/test_util.py`:
- Around line 165-166: Update the test assertion that checks the ValueError
message from utils.extract_text_from_xml to escape literal dots in the regex:
change the match pattern to use escaped dots (e.g. r"input_path must have \.xml
or \.bz2 extension") so the regex only matches literal periods; modify the
pytest.raises(...) call accordingly to use the raw string with \.xml and \.bz2.
| def compute_sha256(file_path: Union[str, Path, None] = None, data: Union[bytes, None] = None) -> str: | ||
| """ | ||
| input_path = Path(input_path) | ||
|
|
||
| # Fixed output path | ||
| project_root = Path.cwd() | ||
| output_dir = project_root / "data" / "processed" | ||
| output_dir.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| output_path = output_dir / "wiki_clean.txt" | ||
|
|
||
| with bz2.open(input_path, "rb") as f: | ||
| context = ET.iterparse(f, events=("end",)) | ||
|
|
||
| with open(output_path, "w", encoding="utf-8") as out: | ||
| for _, elem in context: | ||
| if elem.tag.endswith("page"): | ||
| text_elem = elem.find(".//{*}text") | ||
|
|
||
| if text_elem is not None and text_elem.text: | ||
| cleaned = clean_wikitext(text_elem.text) | ||
| if cleaned: | ||
| out.write(cleaned + "\n\n") | ||
|
|
||
| elem.clear() | ||
| logger.info("Preprocessing complete. Output saved to %s", output_path) | ||
| generate_manifest(input_path,output_path) | ||
|
|
||
| # generate data manifest | ||
| def generate_manifest(raw_path, processed_path): | ||
| raw_path = Path(raw_path) | ||
| processed_path = Path(processed_path) | ||
| Compute SHA-256 hash of a file or raw bytes. | ||
| """ | ||
| sha256 = hashlib.sha256() | ||
|
|
||
| if not processed_path.exists(): | ||
| raise FileNotFoundError( | ||
| f"Processed file not found at {processed_path}. Run preprocessing first." | ||
| ) | ||
| if data is not None: | ||
| sha256.update(data) | ||
| return sha256.hexdigest() | ||
|
|
||
| manifest = { | ||
| "wikipedia_dump": raw_path.name, | ||
| "dump_date": extract_dump_date(raw_path.name), | ||
| "raw_sha256": compute_sha256(file_path=raw_path), | ||
| "processed_sha256": compute_sha256(file_path=processed_path), | ||
|
|
||
| # ---------------- ADDED FIELDS ---------------- | ||
| "raw_merkle_root": compute_merkle_root(raw_path, chunk_size=MERKLE_CHUNK_SIZE_BYTES), | ||
| "processed_merkle_root": compute_merkle_root(processed_path, chunk_size=MERKLE_CHUNK_SIZE_BYTES), | ||
| "chunk_size_bytes": MERKLE_CHUNK_SIZE_BYTES, | ||
| # --------------------------------------------------------------- | ||
|
|
||
| "preprocessing_version": "v1", | ||
| "python_version": platform.python_version() | ||
| } | ||
| project_root = Path.cwd() | ||
| manifest_path = project_root / "data" / "dataset_manifest.json" | ||
| manifest_path.parent.mkdir(parents=True, exist_ok=True) | ||
| if file_path is not None: | ||
| path = Path(file_path) | ||
| if not path.exists(): | ||
| raise FileNotFoundError(f"File not found: {file_path}") | ||
| with path.open("rb") as f: | ||
| for chunk in iter(lambda: f.read(8192), b""): | ||
| sha256.update(chunk) | ||
| return sha256.hexdigest() | ||
|
|
||
| with open(manifest_path, "w") as f: | ||
| json.dump(manifest, f, indent=2) | ||
| raise ValueError("Either file_path or data must be provided") | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Enforce exactly one input source in compute_sha256.
Right now, passing both file_path and data silently hashes data. That can hide caller mistakes and produce the wrong digest source.
Proposed fix
def compute_sha256(file_path: Union[str, Path, None] = None, data: Union[bytes, None] = None) -> str:
"""
Compute SHA-256 hash of a file or raw bytes.
"""
sha256 = hashlib.sha256()
+
+ if (file_path is None and data is None) or (file_path is not None and data is not None):
+ raise ValueError("Provide exactly one of file_path or data")
if data is not None:
sha256.update(data)
return sha256.hexdigest()
if file_path is not None:
path = Path(file_path)
if not path.exists():
raise FileNotFoundError(f"File not found: {file_path}")
with path.open("rb") as f:
for chunk in iter(lambda: f.read(8192), b""):
sha256.update(chunk)
return sha256.hexdigest()
-
- raise ValueError("Either file_path or data must be provided")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def compute_sha256(file_path: Union[str, Path, None] = None, data: Union[bytes, None] = None) -> str: | |
| """ | |
| input_path = Path(input_path) | |
| # Fixed output path | |
| project_root = Path.cwd() | |
| output_dir = project_root / "data" / "processed" | |
| output_dir.mkdir(parents=True, exist_ok=True) | |
| output_path = output_dir / "wiki_clean.txt" | |
| with bz2.open(input_path, "rb") as f: | |
| context = ET.iterparse(f, events=("end",)) | |
| with open(output_path, "w", encoding="utf-8") as out: | |
| for _, elem in context: | |
| if elem.tag.endswith("page"): | |
| text_elem = elem.find(".//{*}text") | |
| if text_elem is not None and text_elem.text: | |
| cleaned = clean_wikitext(text_elem.text) | |
| if cleaned: | |
| out.write(cleaned + "\n\n") | |
| elem.clear() | |
| logger.info("Preprocessing complete. Output saved to %s", output_path) | |
| generate_manifest(input_path,output_path) | |
| # generate data manifest | |
| def generate_manifest(raw_path, processed_path): | |
| raw_path = Path(raw_path) | |
| processed_path = Path(processed_path) | |
| Compute SHA-256 hash of a file or raw bytes. | |
| """ | |
| sha256 = hashlib.sha256() | |
| if not processed_path.exists(): | |
| raise FileNotFoundError( | |
| f"Processed file not found at {processed_path}. Run preprocessing first." | |
| ) | |
| if data is not None: | |
| sha256.update(data) | |
| return sha256.hexdigest() | |
| manifest = { | |
| "wikipedia_dump": raw_path.name, | |
| "dump_date": extract_dump_date(raw_path.name), | |
| "raw_sha256": compute_sha256(file_path=raw_path), | |
| "processed_sha256": compute_sha256(file_path=processed_path), | |
| # ---------------- ADDED FIELDS ---------------- | |
| "raw_merkle_root": compute_merkle_root(raw_path, chunk_size=MERKLE_CHUNK_SIZE_BYTES), | |
| "processed_merkle_root": compute_merkle_root(processed_path, chunk_size=MERKLE_CHUNK_SIZE_BYTES), | |
| "chunk_size_bytes": MERKLE_CHUNK_SIZE_BYTES, | |
| # --------------------------------------------------------------- | |
| "preprocessing_version": "v1", | |
| "python_version": platform.python_version() | |
| } | |
| project_root = Path.cwd() | |
| manifest_path = project_root / "data" / "dataset_manifest.json" | |
| manifest_path.parent.mkdir(parents=True, exist_ok=True) | |
| if file_path is not None: | |
| path = Path(file_path) | |
| if not path.exists(): | |
| raise FileNotFoundError(f"File not found: {file_path}") | |
| with path.open("rb") as f: | |
| for chunk in iter(lambda: f.read(8192), b""): | |
| sha256.update(chunk) | |
| return sha256.hexdigest() | |
| with open(manifest_path, "w") as f: | |
| json.dump(manifest, f, indent=2) | |
| raise ValueError("Either file_path or data must be provided") | |
| def compute_sha256(file_path: Union[str, Path, None] = None, data: Union[bytes, None] = None) -> str: | |
| """ | |
| Compute SHA-256 hash of a file or raw bytes. | |
| """ | |
| sha256 = hashlib.sha256() | |
| if (file_path is None and data is None) or (file_path is not None and data is not None): | |
| raise ValueError("Provide exactly one of file_path or data") | |
| if data is not None: | |
| sha256.update(data) | |
| return sha256.hexdigest() | |
| if file_path is not None: | |
| path = Path(file_path) | |
| if not path.exists(): | |
| raise FileNotFoundError(f"File not found: {file_path}") | |
| with path.open("rb") as f: | |
| for chunk in iter(lambda: f.read(8192), b""): | |
| sha256.update(chunk) | |
| return sha256.hexdigest() |
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 168-168: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 174-174: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/utils.py` around lines 155 - 175, The compute_sha256
function currently prefers data when both file_path and data are provided, which
can hide caller mistakes; update compute_sha256 to validate inputs so exactly
one source is used: if both file_path and data are provided raise a ValueError
(e.g., "Provide exactly one of file_path or data"), if only data is provided
keep current bytes hashing behavior, if only file_path is provided keep current
file-reading hashing behavior, and if neither raise the existing ValueError.
Ensure you reference the compute_sha256 signature and its parameters file_path
and data when making this change.
| input_path = Path(input_path) | ||
| output_dir = Path("data/processed") | ||
| output_dir.mkdir(parents=True, exist_ok=True) | ||
| output_file = output_dir / "wiki_clean.txt" | ||
|
|
||
| if (data is None) == (file_path is None): | ||
| raise ValueError( | ||
| "Exactly one of 'data' or 'file_path' must be provided." | ||
| ) | ||
| dump_date = extract_dump_date(input_path.name) | ||
|
|
||
| sha256 = hashlib.sha256() | ||
| suffix = input_path.suffix.lower() | ||
| if suffix == ".bz2": | ||
| file_handle = bz2.open(input_path, "rb") | ||
| elif suffix == ".xml": | ||
| file_handle = open(input_path, "rb") | ||
| else: | ||
| raise ValueError("input_path must have .xml or .bz2 extension") | ||
|
|
There was a problem hiding this comment.
Fail fast on unsupported extension before writing to disk.
The function currently creates data/processed even when the extension is invalid. Validation should happen before any filesystem side effects.
Proposed fix
def extract_text_from_xml(input_path: Union[str, Path]) -> None:
@@
input_path = Path(input_path)
- output_dir = Path("data/processed")
- output_dir.mkdir(parents=True, exist_ok=True)
- output_file = output_dir / "wiki_clean.txt"
dump_date = extract_dump_date(input_path.name)
suffix = input_path.suffix.lower()
if suffix == ".bz2":
file_handle = bz2.open(input_path, "rb")
elif suffix == ".xml":
file_handle = open(input_path, "rb")
else:
raise ValueError("input_path must have .xml or .bz2 extension")
+
+ output_dir = Path("data/processed")
+ output_dir.mkdir(parents=True, exist_ok=True)
+ output_file = output_dir / "wiki_clean.txt"🧰 Tools
🪛 Ruff (0.15.2)
[warning] 222-222: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/utils.py` around lines 209 - 223, The code currently
creates output_dir ("data/processed") before validating the input_path
extension; move the extension check (the suffix == ".bz2" / ".xml" validation
and the ValueError) to occur immediately after computing suffix
(input_path.suffix.lower()) and before creating output_dir or calling
output_dir.mkdir; only after validating and opening the file_handle (or after
successful selection of bz2.open / open) should you create output_dir and set
output_file; reference variables/functions input_path, suffix, file_handle,
output_dir, output_file, and extract_dump_date to locate the relevant logic in
utils.py.
| with pytest.raises(ValueError, match="input_path must have .xml or .bz2 extension"): | ||
| utils.extract_text_from_xml(input_file) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check if the test file exists and examine the specific lines
if [ -f "tests/test_util.py" ]; then
echo "=== File found: tests/test_util.py ==="
wc -l tests/test_util.py
echo ""
echo "=== Lines 160-170 ==="
sed -n '160,170p' tests/test_util.py | cat -n
else
echo "File not found at tests/test_util.py"
# Try to find the file
find . -name "test_util.py" -type f
fiRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 641
🏁 Script executed:
# Let's also check what the actual error message is by looking at the implementation
# First find the utils module and the extract_text_from_xml function
find . -name "utils.py" -o -name "util.py" | head -20Repository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 100
🏁 Script executed:
# Search for the extract_text_from_xml function to understand what error message is raised
rg "extract_text_from_xml" -A 20 | head -50Repository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 2330
🏁 Script executed:
# Now let's examine the extract_text_from_xml function to see what error message it raises
sed -n '1,100p' ./openverifiablellm/utils.py | cat -nRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 3581
🏁 Script executed:
# Search for the extract_text_from_xml function and the error message
rg "def extract_text_from_xml" -A 30 ./openverifiablellm/utils.pyRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 1224
Escape literal dots in the match regex pattern for stricter error-message validation.
The unescaped . characters in the regex pattern will match any character, not just literal dots. The pattern should use \. to match literal dots only.
Proposed fix
- with pytest.raises(ValueError, match="input_path must have .xml or .bz2 extension"):
+ with pytest.raises(ValueError, match=r"input_path must have \.xml or \.bz2 extension"):
utils.extract_text_from_xml(input_file)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| with pytest.raises(ValueError, match="input_path must have .xml or .bz2 extension"): | |
| utils.extract_text_from_xml(input_file) | |
| with pytest.raises(ValueError, match=r"input_path must have \.xml or \.bz2 extension"): | |
| utils.extract_text_from_xml(input_file) |
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 165-165: Pattern passed to match= contains metacharacters but is neither escaped nor raw
(RUF043)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_util.py` around lines 165 - 166, Update the test assertion that
checks the ValueError message from utils.extract_text_from_xml to escape literal
dots in the regex: change the match pattern to use escaped dots (e.g.
r"input_path must have \.xml or \.bz2 extension") so the regex only matches
literal periods; modify the pytest.raises(...) call accordingly to use the raw
string with \.xml and \.bz2.
|
@itniuma2026 This issue has already been addressed in #29 |
Summary
Refactor
extract_text_from_xml()inopenverifiablellm/utils.pyto support both.bz2compressed and plain.xmlfiles by detecting the file extension and opening accordingly.Approach
openverifiablellm/utils.py, modifyextract_text_from_xml()to checkinput_path.suffix.lower(). If.bz2, usebz2.open(input_path, 'rb'). If.xml, useopen(input_path, 'rb'). Otherwise, raise aValueErrorwith a descriptive message. Use a context manager (with file_handle as f) to wrap the existingET.iterparselogic. 2. Intests/test_util.py, add a test case that passes a plain.xmlfile toextract_text_from_xml()to verify it works correctly, and optionally add a test for an unsupported extension to verify theValueErroris raised.Files Changed
openverifiablellm/utils.pytests/test_util.pyRelated Issue
Fixes #19
Summary by CodeRabbit
New Features
Refactor