Handle malformed XML in extract_text_from_xml and add edge case test#71
Handle malformed XML in extract_text_from_xml and add edge case test#71tharunkumar4562 wants to merge 2 commits intoAOSSIE-Org:mainfrom
Conversation
WalkthroughAdded XML parse error handling and in-loop element cleanup in extract_text_from_xml; introduced compute_sha256_bytes() returning raw SHA-256 bytes; extended generate_manifest() to include raw_merkle_root, processed_merkle_root, and chunk_size_bytes; added a unit test asserting malformed XML raises a parse error. Changes
Sequence Diagram(s)(omitted) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 225-245: The current parser writes directly to output_path (the
with open(output_path, "w", ...) block) so a late ET.ParseError can leave a
partial or corrupted wiki_clean.txt; instead, write to a temporary file in the
same directory (using tempfile.NamedTemporaryFile or mkstemp) while parsing and
only atomically replace the final output_path with os.replace when parsing
completes successfully; ensure the temp file is opened with encoding="utf-8",
cleaned text is written there (via clean_wikitext), close and sync the temp file
before os.replace, and on exceptions (e.g., ET.ParseError) remove the temp file
and re-raise the error so the original output is preserved.
In `@tests/test_util.py`:
- Around line 195-200: Tighten the test so it only catches the XML parse failure
from defusedxml by using the concrete ParseError type instead of Exception and
assert the contextual message; update the pytest.raises(...) to expect
defusedxml.ElementTree.ParseError (import ParseError at top of
tests/test_util.py or reference defusedxml.ElementTree.ParseError) when calling
utils.extract_text_from_xml, and ensure the assertion checks that the exception
string contains the specific "Failed to parse XML" message produced by
extract_text_from_xml.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: af2b2c77-f0a8-4633-8515-c082d8e119c2
📒 Files selected for processing (2)
openverifiablellm/utils.pytests/test_util.py
| try: | ||
| with open_func(input_path, "rb") as f: | ||
| context = ET.iterparse(f, events=("end",)) | ||
|
|
||
| with open(output_path, "w", encoding="utf-8") as out: | ||
| for _, elem in context: | ||
| if elem.tag.endswith("page"): | ||
| text_elem = elem.find(".//{*}text") | ||
|
|
||
| if text_elem is not None and text_elem.text: | ||
| cleaned = clean_wikitext(text_elem.text) | ||
| if cleaned: | ||
| out.write(cleaned + "\n\n") | ||
|
|
||
| elem.clear() | ||
| except ET.ParseError as e: | ||
| # provide context about which file failed to parse | ||
| msg = f"Failed to parse XML dump '{input_path}': {e}" | ||
| logger.error(msg) | ||
| # re-raise a new ParseError containing context | ||
| raise ET.ParseError(msg) from e |
There was a problem hiding this comment.
Write the processed output atomically.
wiki_clean.txt is opened in place before parsing finishes. A late ParseError will therefore leave an empty/partial processed file behind and can also destroy a previously good output on rerun. That makes a failed preprocessing run look consumable to downstream steps.
🛠️ Suggested fix
open_func = bz2.open if is_bz2 else open
+ tmp_output_path = output_path.with_name(output_path.name + ".tmp")
try:
with open_func(input_path, "rb") as f:
context = ET.iterparse(f, events=("end",))
- with open(output_path, "w", encoding="utf-8") as out:
+ with open(tmp_output_path, "w", encoding="utf-8") as out:
for _, elem in context:
if elem.tag.endswith("page"):
text_elem = elem.find(".//{*}text")
if text_elem is not None and text_elem.text:
@@
elem.clear()
+ tmp_output_path.replace(output_path)
except ET.ParseError as e:
+ if tmp_output_path.exists():
+ tmp_output_path.unlink()
# provide context about which file failed to parse
msg = f"Failed to parse XML dump '{input_path}': {e}"
logger.error(msg)
# re-raise a new ParseError containing context
raise ET.ParseError(msg) from e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/utils.py` around lines 225 - 245, The current parser writes
directly to output_path (the with open(output_path, "w", ...) block) so a late
ET.ParseError can leave a partial or corrupted wiki_clean.txt; instead, write to
a temporary file in the same directory (using tempfile.NamedTemporaryFile or
mkstemp) while parsing and only atomically replace the final output_path with
os.replace when parsing completes successfully; ensure the temp file is opened
with encoding="utf-8", cleaned text is written there (via clean_wikitext), close
and sync the temp file before os.replace, and on exceptions (e.g.,
ET.ParseError) remove the temp file and re-raise the error so the original
output is preserved.
| # ensure the parse error bubbles up | ||
| with pytest.raises(Exception) as excinfo: | ||
| utils.extract_text_from_xml(input_file) | ||
|
|
||
| # elementtree ParseError is expected | ||
| assert "Failed to parse XML" in str(excinfo.value) or "ParseError" in str(excinfo.value) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's locate and read the test file
fd -name "test_util.py" -type fRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 302
🏁 Script executed:
# Read the test file around lines 195-200
head -220 tests/test_util.py | tail -50Repository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 1529
🏁 Script executed:
# Also check the utils module to understand the implementation
fd -name "util.py" -type f | grep -v test | head -5Repository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 302
🏁 Script executed:
find . -name "util.py" -type f | grep -v testRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 54
🏁 Script executed:
# Read the utils module to understand extract_text_from_xml implementation
cat tests/test_util.py | head -30Repository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 771
🏁 Script executed:
# Find the utils module
find . -path "*/utils.py" -o -path "*/util.py" | grep -v test | head -10Repository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 100
🏁 Script executed:
cat openverifiablellm/utils.pyRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 15537
Tighten the assertion to catch only the specific parse failure.
This currently passes on any exception (e.g., FileNotFoundError, TypeError), so it doesn't actually protect the malformed-XML contract from regressions. Require the concrete ParseError from defusedxml.ElementTree and match the contextual error message.
🧪 Suggested fix
- # ensure the parse error bubbles up
- with pytest.raises(Exception) as excinfo:
- utils.extract_text_from_xml(input_file)
-
- # elementtree ParseError is expected
- assert "Failed to parse XML" in str(excinfo.value) or "ParseError" in str(excinfo.value)
+ with pytest.raises(
+ utils.ET.ParseError,
+ match=r"Failed to parse XML dump '.*malformed\.xml':",
+ ):
+ utils.extract_text_from_xml(input_file)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_util.py` around lines 195 - 200, Tighten the test so it only
catches the XML parse failure from defusedxml by using the concrete ParseError
type instead of Exception and assert the contextual message; update the
pytest.raises(...) to expect defusedxml.ElementTree.ParseError (import
ParseError at top of tests/test_util.py or reference
defusedxml.ElementTree.ParseError) when calling utils.extract_text_from_xml, and
ensure the assertion checks that the exception string contains the specific
"Failed to parse XML" message produced by extract_text_from_xml.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
tests/test_util.py (1)
195-200:⚠️ Potential issue | 🟠 MajorTighten the assertion to catch only the specific parse failure.
Using
pytest.raises(Exception)is too broad—the test would pass on any exception (e.g.,FileNotFoundError,TypeError), defeating the purpose of verifying the malformed-XML contract. The implementation inutils.pyraisesET.ParseErrorwith a guaranteed message prefix.Use the concrete
ParseErrortype and thematchparameter to validate the contextual error message:🧪 Suggested fix
- # ensure the parse error bubbles up - with pytest.raises(Exception) as excinfo: - utils.extract_text_from_xml(input_file) - - # elementtree ParseError is expected - assert "Failed to parse XML" in str(excinfo.value) or "ParseError" in str(excinfo.value) + # ensure the parse error bubbles up with context + with pytest.raises( + utils.ET.ParseError, + match=r"Failed to parse XML dump '.*malformed\.xml':", + ): + utils.extract_text_from_xml(input_file),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_util.py` around lines 195 - 200, The test currently uses pytest.raises(Exception) which is too broad; change it to assert that utils.extract_text_from_xml(input_file) raises the specific xml.etree.ElementTree.ParseError and validate the message using pytest.raises(..., match=...) to match the guaranteed "Failed to parse XML" prefix; import xml.etree.ElementTree as ET in the test (or reference ET.ParseError) and use pytest.raises(ET.ParseError, match=r"^Failed to parse XML") around the call to utils.extract_text_from_xml.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@tests/test_util.py`:
- Around line 195-200: The test currently uses pytest.raises(Exception) which is
too broad; change it to assert that utils.extract_text_from_xml(input_file)
raises the specific xml.etree.ElementTree.ParseError and validate the message
using pytest.raises(..., match=...) to match the guaranteed "Failed to parse
XML" prefix; import xml.etree.ElementTree as ET in the test (or reference
ET.ParseError) and use pytest.raises(ET.ParseError, match=r"^Failed to parse
XML") around the call to utils.extract_text_from_xml.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 83186ed0-b3fe-43dc-bca3-58ddd8dcb8d5
📒 Files selected for processing (1)
tests/test_util.py
Addressed Issues:
Fixes #48
Screenshots/Recordings:
Before fix: malformed XML raised a low-level ParseError without context.
After fix: the error now includes a clear message indicating which
XML dump failed to parse.
Additional Notes:
While running the test suite locally, I noticed that
extract_text_from_xmldoes not provide useful context when encountering malformed XML.Changes in this PR:
ET.iterparseto catchxml.etree.ElementTree.ParseError.test_extract_text_from_xml_malformedto verify behaviour when malformed XML is encountered.This improves robustness of the XML extraction step and prevents unclear crashes when corrupted Wikipedia dumps are processed.
All tests pass locally.
Checklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
Summary by CodeRabbit
New Features
Bug Fixes
Tests