Skip to content

Handle malformed XML in extract_text_from_xml and add edge case test#71

Open
tharunkumar4562 wants to merge 2 commits intoAOSSIE-Org:mainfrom
tharunkumar4562:malformed-xml-fix
Open

Handle malformed XML in extract_text_from_xml and add edge case test#71
tharunkumar4562 wants to merge 2 commits intoAOSSIE-Org:mainfrom
tharunkumar4562:malformed-xml-fix

Conversation

@tharunkumar4562
Copy link
Contributor

@tharunkumar4562 tharunkumar4562 commented Mar 12, 2026

Addressed Issues:

Fixes #48

Screenshots/Recordings:

Before fix: malformed XML raised a low-level ParseError without context.

image

After fix: the error now includes a clear message indicating which
XML dump failed to parse.

image

Additional Notes:

While running the test suite locally, I noticed that extract_text_from_xml does not provide useful context when encountering malformed XML.

Changes in this PR:

  • Added error handling around ET.iterparse to catch xml.etree.ElementTree.ParseError.
  • Log a descriptive error message including the input file path.
  • Re-raise the error with additional context to improve debugging.
  • Added a new unit test test_extract_text_from_xml_malformed to verify behaviour when malformed XML is encountered.

This improves robustness of the XML extraction step and prevents unclear crashes when corrupted Wikipedia dumps are processed.

All tests pass locally.

Checklist

  • My code follows the project's code style and conventions
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

  • New Features

    • Added hash computation for raw data and file inputs.
    • Manifest generation now includes additional cryptographic fields (raw and processed merkle roots and chunk size).
  • Bug Fixes

    • Improved XML parsing error handling with clearer failure reporting.
    • In-loop memory cleanup during XML processing for better efficiency.
  • Tests

    • Added test coverage for malformed XML error handling.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 12, 2026

Walkthrough

Added XML parse error handling and in-loop element cleanup in extract_text_from_xml; introduced compute_sha256_bytes() returning raw SHA-256 bytes; extended generate_manifest() to include raw_merkle_root, processed_merkle_root, and chunk_size_bytes; added a unit test asserting malformed XML raises a parse error.

Changes

Cohort / File(s) Summary
XML parsing & cleanup
openverifiablellm/utils.py
Wraps XML iteration in try/except to raise/descriptively log parse errors for malformed XML; calls elem.clear() after processing page elements to free memory.
Hashing & manifest
openverifiablellm/utils.py
Added compute_sha256_bytes(*, data, file_path) -> bytes for raw SHA-256 output; generate_manifest() now emits raw_merkle_root, processed_merkle_root, and chunk_size_bytes.
Tests
tests/test_util.py
Added test_extract_text_from_xml_malformed() to assert that malformed XML raises a parse error and contains an appropriate error message.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

Python Lang

Suggested reviewers

  • Archit381

Poem

🐰 I nibble bytes and clear each tag,
Parse errors flagged — no silent lag.
Roots hashed crisp, chunks counted right,
Hops and tests keep pipelines bright. 🥕✨

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main changes: adding error handling for malformed XML and a corresponding edge-case test.
Linked Issues check ✅ Passed The PR addresses all requirements from issue #48: adds error handling for malformed XML, includes a descriptive error message with file context, and adds a unit test verifying that malformed XML triggers an exception.
Out of Scope Changes check ✅ Passed All changes are scoped to the PR objectives: error handling improvements in extract_text_from_xml, in-loop memory cleanup, new compute_sha256_bytes helper, and manifest enhancements are all reasonable supporting changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added size/M and removed size/M labels Mar 12, 2026
@github-actions github-actions bot added size/M and removed size/M labels Mar 12, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 225-245: The current parser writes directly to output_path (the
with open(output_path, "w", ...) block) so a late ET.ParseError can leave a
partial or corrupted wiki_clean.txt; instead, write to a temporary file in the
same directory (using tempfile.NamedTemporaryFile or mkstemp) while parsing and
only atomically replace the final output_path with os.replace when parsing
completes successfully; ensure the temp file is opened with encoding="utf-8",
cleaned text is written there (via clean_wikitext), close and sync the temp file
before os.replace, and on exceptions (e.g., ET.ParseError) remove the temp file
and re-raise the error so the original output is preserved.

In `@tests/test_util.py`:
- Around line 195-200: Tighten the test so it only catches the XML parse failure
from defusedxml by using the concrete ParseError type instead of Exception and
assert the contextual message; update the pytest.raises(...) to expect
defusedxml.ElementTree.ParseError (import ParseError at top of
tests/test_util.py or reference defusedxml.ElementTree.ParseError) when calling
utils.extract_text_from_xml, and ensure the assertion checks that the exception
string contains the specific "Failed to parse XML" message produced by
extract_text_from_xml.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: af2b2c77-f0a8-4633-8515-c082d8e119c2

📥 Commits

Reviewing files that changed from the base of the PR and between c352df0 and 132592b.

📒 Files selected for processing (2)
  • openverifiablellm/utils.py
  • tests/test_util.py

Comment on lines +225 to +245
try:
with open_func(input_path, "rb") as f:
context = ET.iterparse(f, events=("end",))

with open(output_path, "w", encoding="utf-8") as out:
for _, elem in context:
if elem.tag.endswith("page"):
text_elem = elem.find(".//{*}text")

if text_elem is not None and text_elem.text:
cleaned = clean_wikitext(text_elem.text)
if cleaned:
out.write(cleaned + "\n\n")

elem.clear()
except ET.ParseError as e:
# provide context about which file failed to parse
msg = f"Failed to parse XML dump '{input_path}': {e}"
logger.error(msg)
# re-raise a new ParseError containing context
raise ET.ParseError(msg) from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Write the processed output atomically.

wiki_clean.txt is opened in place before parsing finishes. A late ParseError will therefore leave an empty/partial processed file behind and can also destroy a previously good output on rerun. That makes a failed preprocessing run look consumable to downstream steps.

🛠️ Suggested fix
     open_func = bz2.open if is_bz2 else open
+    tmp_output_path = output_path.with_name(output_path.name + ".tmp")

     try:
         with open_func(input_path, "rb") as f:
             context = ET.iterparse(f, events=("end",))

-            with open(output_path, "w", encoding="utf-8") as out:
+            with open(tmp_output_path, "w", encoding="utf-8") as out:
                 for _, elem in context:
                     if elem.tag.endswith("page"):
                         text_elem = elem.find(".//{*}text")

                         if text_elem is not None and text_elem.text:
@@

                         elem.clear()
+        tmp_output_path.replace(output_path)
     except ET.ParseError as e:
+        if tmp_output_path.exists():
+            tmp_output_path.unlink()
         # provide context about which file failed to parse
         msg = f"Failed to parse XML dump '{input_path}': {e}"
         logger.error(msg)
         # re-raise a new ParseError containing context
         raise ET.ParseError(msg) from e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 225 - 245, The current parser writes
directly to output_path (the with open(output_path, "w", ...) block) so a late
ET.ParseError can leave a partial or corrupted wiki_clean.txt; instead, write to
a temporary file in the same directory (using tempfile.NamedTemporaryFile or
mkstemp) while parsing and only atomically replace the final output_path with
os.replace when parsing completes successfully; ensure the temp file is opened
with encoding="utf-8", cleaned text is written there (via clean_wikitext), close
and sync the temp file before os.replace, and on exceptions (e.g.,
ET.ParseError) remove the temp file and re-raise the error so the original
output is preserved.

Comment on lines +195 to +200
# ensure the parse error bubbles up
with pytest.raises(Exception) as excinfo:
utils.extract_text_from_xml(input_file)

# elementtree ParseError is expected
assert "Failed to parse XML" in str(excinfo.value) or "ParseError" in str(excinfo.value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and read the test file
fd -name "test_util.py" -type f

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 302


🏁 Script executed:

# Read the test file around lines 195-200
head -220 tests/test_util.py | tail -50

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1529


🏁 Script executed:

# Also check the utils module to understand the implementation
fd -name "util.py" -type f | grep -v test | head -5

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 302


🏁 Script executed:

find . -name "util.py" -type f | grep -v test

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 54


🏁 Script executed:

# Read the utils module to understand extract_text_from_xml implementation
cat tests/test_util.py | head -30

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 771


🏁 Script executed:

# Find the utils module
find . -path "*/utils.py" -o -path "*/util.py" | grep -v test | head -10

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 100


🏁 Script executed:

cat openverifiablellm/utils.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 15537


Tighten the assertion to catch only the specific parse failure.

This currently passes on any exception (e.g., FileNotFoundError, TypeError), so it doesn't actually protect the malformed-XML contract from regressions. Require the concrete ParseError from defusedxml.ElementTree and match the contextual error message.

🧪 Suggested fix
-    # ensure the parse error bubbles up
-    with pytest.raises(Exception) as excinfo:
-        utils.extract_text_from_xml(input_file)
-
-    # elementtree ParseError is expected
-    assert "Failed to parse XML" in str(excinfo.value) or "ParseError" in str(excinfo.value)
+    with pytest.raises(
+        utils.ET.ParseError,
+        match=r"Failed to parse XML dump '.*malformed\.xml':",
+    ):
+        utils.extract_text_from_xml(input_file)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_util.py` around lines 195 - 200, Tighten the test so it only
catches the XML parse failure from defusedxml by using the concrete ParseError
type instead of Exception and assert the contextual message; update the
pytest.raises(...) to expect defusedxml.ElementTree.ParseError (import
ParseError at top of tests/test_util.py or reference
defusedxml.ElementTree.ParseError) when calling utils.extract_text_from_xml, and
ensure the assertion checks that the exception string contains the specific
"Failed to parse XML" message produced by extract_text_from_xml.

@github-actions github-actions bot added size/M and removed size/M labels Mar 12, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tests/test_util.py (1)

195-200: ⚠️ Potential issue | 🟠 Major

Tighten the assertion to catch only the specific parse failure.

Using pytest.raises(Exception) is too broad—the test would pass on any exception (e.g., FileNotFoundError, TypeError), defeating the purpose of verifying the malformed-XML contract. The implementation in utils.py raises ET.ParseError with a guaranteed message prefix.

Use the concrete ParseError type and the match parameter to validate the contextual error message:

🧪 Suggested fix
-    # ensure the parse error bubbles up
-    with pytest.raises(Exception) as excinfo:
-        utils.extract_text_from_xml(input_file)
-
-    # elementtree ParseError is expected
-    assert "Failed to parse XML" in str(excinfo.value) or "ParseError" in str(excinfo.value)
+    # ensure the parse error bubbles up with context
+    with pytest.raises(
+        utils.ET.ParseError,
+        match=r"Failed to parse XML dump '.*malformed\.xml':",
+    ):
+        utils.extract_text_from_xml(input_file)

,

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_util.py` around lines 195 - 200, The test currently uses
pytest.raises(Exception) which is too broad; change it to assert that
utils.extract_text_from_xml(input_file) raises the specific
xml.etree.ElementTree.ParseError and validate the message using
pytest.raises(..., match=...) to match the guaranteed "Failed to parse XML"
prefix; import xml.etree.ElementTree as ET in the test (or reference
ET.ParseError) and use pytest.raises(ET.ParseError, match=r"^Failed to parse
XML") around the call to utils.extract_text_from_xml.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/test_util.py`:
- Around line 195-200: The test currently uses pytest.raises(Exception) which is
too broad; change it to assert that utils.extract_text_from_xml(input_file)
raises the specific xml.etree.ElementTree.ParseError and validate the message
using pytest.raises(..., match=...) to match the guaranteed "Failed to parse
XML" prefix; import xml.etree.ElementTree as ET in the test (or reference
ET.ParseError) and use pytest.raises(ET.ParseError, match=r"^Failed to parse
XML") around the call to utils.extract_text_from_xml.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 83186ed0-b3fe-43dc-bca3-58ddd8dcb8d5

📥 Commits

Reviewing files that changed from the base of the PR and between 132592b and 723674a.

📒 Files selected for processing (1)
  • tests/test_util.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] : Add edge case test for extract_text_from_xml to handle malformed XML

1 participant