Handle malformed XML in extract_text_from_xml and add edge case test by tharunkumar4562 · Pull Request #71 · AOSSIE-Org/OpenVerifiableLLM

tharunkumar4562 · 2026-03-12T17:00:01Z

Addressed Issues:

Fixes #48

Screenshots/Recordings:

Before fix: malformed XML raised a low-level ParseError without context.

After fix: the error now includes a clear message indicating which
XML dump failed to parse.

Additional Notes:

While running the test suite locally, I noticed that extract_text_from_xml does not provide useful context when encountering malformed XML.

Changes in this PR:

Added error handling around ET.iterparse to catch xml.etree.ElementTree.ParseError.
Log a descriptive error message including the input file path.
Re-raise the error with additional context to improve debugging.
Added a new unit test test_extract_text_from_xml_malformed to verify behaviour when malformed XML is encountered.

This improves robustness of the XML extraction step and prevents unclear crashes when corrupted Wikipedia dumps are processed.

All tests pass locally.

Checklist

My code follows the project's code style and conventions
I have made corresponding changes to the documentation
My changes generate no new warnings or errors
I have joined the Discord server and I will share a link to this PR with the project maintainers there
I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

New Features
- Added hash computation for raw data and file inputs.
- Manifest generation now includes additional cryptographic fields (raw and processed merkle roots and chunk size).
Bug Fixes
- Improved XML parsing error handling with clearer failure reporting.
- In-loop memory cleanup during XML processing for better efficiency.
Tests
- Added test coverage for malformed XML error handling.

coderabbitai · 2026-03-12T17:00:19Z

Walkthrough

Added XML parse error handling and in-loop element cleanup in extract_text_from_xml; introduced compute_sha256_bytes() returning raw SHA-256 bytes; extended generate_manifest() to include raw_merkle_root, processed_merkle_root, and chunk_size_bytes; added a unit test asserting malformed XML raises a parse error.

Changes

Cohort / File(s)	Summary
XML parsing & cleanup `openverifiablellm/utils.py`	Wraps XML iteration in try/except to raise/descriptively log parse errors for malformed XML; calls `elem.clear()` after processing page elements to free memory.
Hashing & manifest `openverifiablellm/utils.py`	Added `compute_sha256_bytes(*, data, file_path) -> bytes` for raw SHA-256 output; `generate_manifest()` now emits `raw_merkle_root`, `processed_merkle_root`, and `chunk_size_bytes`.
Tests `tests/test_util.py`	Added `test_extract_text_from_xml_malformed()` to assert that malformed XML raises a parse error and contains an appropriate error message.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Improve XML extraction logic and update tests #29: Modifies extract_text_from_xml (adds auto-detection of bz2 vs uncompressed input) and thus overlaps parsing logic.
Deterministic Preprocessing Verification Mode that re-runs dataset preprocessing and validates that all generated artifacts #25: Introduces SHA-256/merkle root computations and verification logic that align with the new hashing and manifest fields.

Suggested labels

Python Lang

Suggested reviewers

Archit381

Poem

🐰 I nibble bytes and clear each tag,
Parse errors flagged — no silent lag.
Roots hashed crisp, chunks counted right,
Hops and tests keep pipelines bright. 🥕✨

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main changes: adding error handling for malformed XML and a corresponding edge-case test.
Linked Issues check	✅ Passed	The PR addresses all requirements from issue `#48`: adds error handling for malformed XML, includes a descriptive error message with file context, and adds a unit test verifying that malformed XML triggers an exception.
Out of Scope Changes check	✅ Passed	All changes are scoped to the PR objectives: error handling improvements in extract_text_from_xml, in-loop memory cleanup, new compute_sha256_bytes helper, and manifest enhancements are all reasonable supporting changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 225-245: The current parser writes directly to output_path (the
with open(output_path, "w", ...) block) so a late ET.ParseError can leave a
partial or corrupted wiki_clean.txt; instead, write to a temporary file in the
same directory (using tempfile.NamedTemporaryFile or mkstemp) while parsing and
only atomically replace the final output_path with os.replace when parsing
completes successfully; ensure the temp file is opened with encoding="utf-8",
cleaned text is written there (via clean_wikitext), close and sync the temp file
before os.replace, and on exceptions (e.g., ET.ParseError) remove the temp file
and re-raise the error so the original output is preserved.

In `@tests/test_util.py`:
- Around line 195-200: Tighten the test so it only catches the XML parse failure
from defusedxml by using the concrete ParseError type instead of Exception and
assert the contextual message; update the pytest.raises(...) to expect
defusedxml.ElementTree.ParseError (import ParseError at top of
tests/test_util.py or reference defusedxml.ElementTree.ParseError) when calling
utils.extract_text_from_xml, and ensure the assertion checks that the exception
string contains the specific "Failed to parse XML" message produced by
extract_text_from_xml.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: af2b2c77-f0a8-4633-8515-c082d8e119c2

📥 Commits

Reviewing files that changed from the base of the PR and between c352df0 and 132592b.

📒 Files selected for processing (2)

openverifiablellm/utils.py
tests/test_util.py

coderabbitai · 2026-03-12T17:08:14Z

openverifiablellm/utils.py

+    try:
+        with open_func(input_path, "rb") as f:
+            context = ET.iterparse(f, events=("end",))
+
+            with open(output_path, "w", encoding="utf-8") as out:
+                for _, elem in context:
+                    if elem.tag.endswith("page"):
+                        text_elem = elem.find(".//{*}text")
+
+                        if text_elem is not None and text_elem.text:
+                            cleaned = clean_wikitext(text_elem.text)
+                            if cleaned:
+                                out.write(cleaned + "\n\n")
+
+                        elem.clear()
+    except ET.ParseError as e:
+        # provide context about which file failed to parse
+        msg = f"Failed to parse XML dump '{input_path}': {e}"
+        logger.error(msg)
+        # re-raise a new ParseError containing context
+        raise ET.ParseError(msg) from e


⚠️ Potential issue | 🟠 Major

Write the processed output atomically.

wiki_clean.txt is opened in place before parsing finishes. A late ParseError will therefore leave an empty/partial processed file behind and can also destroy a previously good output on rerun. That makes a failed preprocessing run look consumable to downstream steps.

🛠️ Suggested fix

open_func = bz2.open if is_bz2 else open + tmp_output_path = output_path.with_name(output_path.name + ".tmp") try: with open_func(input_path, "rb") as f: context = ET.iterparse(f, events=("end",)) - with open(output_path, "w", encoding="utf-8") as out: + with open(tmp_output_path, "w", encoding="utf-8") as out: for _, elem in context: if elem.tag.endswith("page"): text_elem = elem.find(".//{*}text") if text_elem is not None and text_elem.text: @@ elem.clear() + tmp_output_path.replace(output_path) except ET.ParseError as e: + if tmp_output_path.exists(): + tmp_output_path.unlink() # provide context about which file failed to parse msg = f"Failed to parse XML dump '{input_path}': {e}" logger.error(msg) # re-raise a new ParseError containing context raise ET.ParseError(msg) from e

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 225 - 245, The current parser writes directly to output_path (the with open(output_path, "w", ...) block) so a late ET.ParseError can leave a partial or corrupted wiki_clean.txt; instead, write to a temporary file in the same directory (using tempfile.NamedTemporaryFile or mkstemp) while parsing and only atomically replace the final output_path with os.replace when parsing completes successfully; ensure the temp file is opened with encoding="utf-8", cleaned text is written there (via clean_wikitext), close and sync the temp file before os.replace, and on exceptions (e.g., ET.ParseError) remove the temp file and re-raise the error so the original output is preserved.

coderabbitai · 2026-03-12T17:08:14Z

tests/test_util.py

+    # ensure the parse error bubbles up
+    with pytest.raises(Exception) as excinfo:
+        utils.extract_text_from_xml(input_file)
+
+    # elementtree ParseError is expected
+    assert "Failed to parse XML" in str(excinfo.value) or "ParseError" in str(excinfo.value)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and read the test file fd -name "test_util.py" -type f

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 302

🏁 Script executed:

# Read the test file around lines 195-200 head -220 tests/test_util.py | tail -50

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1529

🏁 Script executed:

# Also check the utils module to understand the implementation fd -name "util.py" -type f | grep -v test | head -5

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 302

🏁 Script executed:

find . -name "util.py" -type f | grep -v test

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 54

🏁 Script executed:

# Read the utils module to understand extract_text_from_xml implementation cat tests/test_util.py | head -30

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 771

🏁 Script executed:

# Find the utils module find . -path "*/utils.py" -o -path "*/util.py" | grep -v test | head -10

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 100

🏁 Script executed:

cat openverifiablellm/utils.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 15537

Tighten the assertion to catch only the specific parse failure.

This currently passes on any exception (e.g., FileNotFoundError, TypeError), so it doesn't actually protect the malformed-XML contract from regressions. Require the concrete ParseError from defusedxml.ElementTree and match the contextual error message.

🧪 Suggested fix

- # ensure the parse error bubbles up - with pytest.raises(Exception) as excinfo: - utils.extract_text_from_xml(input_file) - - # elementtree ParseError is expected - assert "Failed to parse XML" in str(excinfo.value) or "ParseError" in str(excinfo.value) + with pytest.raises( + utils.ET.ParseError, + match=r"Failed to parse XML dump '.*malformed\.xml':", + ): + utils.extract_text_from_xml(input_file)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_util.py` around lines 195 - 200, Tighten the test so it only catches the XML parse failure from defusedxml by using the concrete ParseError type instead of Exception and assert the contextual message; update the pytest.raises(...) to expect defusedxml.ElementTree.ParseError (import ParseError at top of tests/test_util.py or reference defusedxml.ElementTree.ParseError) when calling utils.extract_text_from_xml, and ensure the assertion checks that the exception string contains the specific "Failed to parse XML" message produced by extract_text_from_xml.

coderabbitai

♻️ Duplicate comments (1)

tests/test_util.py (1)

195-200: ⚠️ Potential issue | 🟠 Major

Tighten the assertion to catch only the specific parse failure.

Using pytest.raises(Exception) is too broad—the test would pass on any exception (e.g., FileNotFoundError, TypeError), defeating the purpose of verifying the malformed-XML contract. The implementation in utils.py raises ET.ParseError with a guaranteed message prefix.

Use the concrete ParseError type and the match parameter to validate the contextual error message:

🧪 Suggested fix

-    # ensure the parse error bubbles up
-    with pytest.raises(Exception) as excinfo:
-        utils.extract_text_from_xml(input_file)
-
-    # elementtree ParseError is expected
-    assert "Failed to parse XML" in str(excinfo.value) or "ParseError" in str(excinfo.value)
+    # ensure the parse error bubbles up with context
+    with pytest.raises(
+        utils.ET.ParseError,
+        match=r"Failed to parse XML dump '.*malformed\.xml':",
+    ):
+        utils.extract_text_from_xml(input_file)

,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_util.py` around lines 195 - 200, The test currently uses
pytest.raises(Exception) which is too broad; change it to assert that
utils.extract_text_from_xml(input_file) raises the specific
xml.etree.ElementTree.ParseError and validate the message using
pytest.raises(..., match=...) to match the guaranteed "Failed to parse XML"
prefix; import xml.etree.ElementTree as ET in the test (or reference
ET.ParseError) and use pytest.raises(ET.ParseError, match=r"^Failed to parse
XML") around the call to utils.extract_text_from_xml.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/test_util.py`:
- Around line 195-200: The test currently uses pytest.raises(Exception) which is
too broad; change it to assert that utils.extract_text_from_xml(input_file)
raises the specific xml.etree.ElementTree.ParseError and validate the message
using pytest.raises(..., match=...) to match the guaranteed "Failed to parse
XML" prefix; import xml.etree.ElementTree as ET in the test (or reference
ET.ParseError) and use pytest.raises(ET.ParseError, match=r"^Failed to parse
XML") around the call to utils.extract_text_from_xml.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 83186ed0-b3fe-43dc-bca3-58ddd8dcb8d5

📥 Commits

Reviewing files that changed from the base of the PR and between 132592b and 723674a.

📒 Files selected for processing (1)

tests/test_util.py

Handle malformed XML in extract_text_from_xml and add edge case test

132592b

github-actions bot added backend python size/M first-time-contributor pending-coderabbit-review labels Mar 12, 2026

github-actions bot added size/M and removed size/M labels Mar 12, 2026

Apply Ruff formatting

723674a

github-actions bot added size/M and removed size/M labels Mar 12, 2026

coderabbitai bot requested changes Mar 12, 2026

View reviewed changes

github-actions bot added size/M and removed size/M labels Mar 12, 2026

coderabbitai bot reviewed Mar 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle malformed XML in extract_text_from_xml and add edge case test#71

Handle malformed XML in extract_text_from_xml and add edge case test#71
tharunkumar4562 wants to merge 2 commits intoAOSSIE-Org:mainfrom
tharunkumar4562:malformed-xml-fix

tharunkumar4562 commented Mar 12, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 12, 2026

Uh oh!

coderabbitai bot Mar 12, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tharunkumar4562 commented Mar 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressed Issues:

Screenshots/Recordings:

Additional Notes:

Checklist

⚠️ AI Notice - Important!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tharunkumar4562 commented Mar 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 12, 2026 •

edited

Loading