Handle malformed XML in extract_text_from_xml and add edge case test#67
Handle malformed XML in extract_text_from_xml and add edge case test#67tharunkumar4562 wants to merge 17 commits intoAOSSIE-Org:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
Walkthroughextract_text_from_xml now writes to a temporary file and parses atomically with try/except that cleans temp artifacts and re-raises contextual Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan for PR comments
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 192-212: The code currently opens output_path before the XML parse
completes, leaving a truncated wiki_clean.txt if parsing fails; modify the logic
in the block using open_func, ET.iterparse, and clean_wikitext to write to a
temporary file (e.g., via tempfile.NamedTemporaryFile or a temp path) and only
atomically replace output_path with os.replace(temp_path, output_path) after the
parse loop finishes successfully, ensuring any ET.ParseError logs the input_path
(using logger.error) and the temp file is removed on exceptions; keep
ET.ParseError handling but do not write to the final output until the atomic
rename succeeds.
In `@tests/test_util.py`:
- Around line 156-174: The test should assert the specific exception type thrown
by utils.extract_text_from_xml instead of any Exception: change the
pytest.raises to expect xml.etree.ElementTree.ParseError (ET.ParseError) and
verify the exception message includes the input file path (input_file) or the
"Failed to parse XML" prefix to ensure the implementation's contextual error is
preserved; target the test function test_extract_text_from_xml_malformed and the
utils.extract_text_from_xml call when updating the assertion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5e43f95c-39b7-404f-b9cb-6c09abc7b12f
📒 Files selected for processing (2)
openverifiablellm/utils.pytests/test_util.py
… output on XML parse failure
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 217-222: The cleanup block currently swallows all exceptions from
temp_output_path.unlink(); change the broad except to only catch OSError (e.g.
except OSError as e:) and emit a warning or logger.warn including the
temp_output_path and exception details so the leftover .tmp is diagnosable,
while still not suppressing the original parse error (do not swallow other
exception types).
- Around line 223-224: The current re-raise creates a fresh ET.ParseError losing
parser metadata; instead, preserve the original exception's structured fields by
creating a new ET.ParseError with your contextual message and then copy
e.position and any other metadata (e.g., e.code) from the caught exception
before raising it (use the caught variable name e and ET.ParseError and include
msg as the new message). Ensure you copy attributes defensively (use getattr(e,
"position", None) and getattr(e, "code", None)) so downstream handlers still see
the original position/code on the raised exception.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: cc2aa65f-4f7f-4d36-8880-1cbd10195694
📒 Files selected for processing (1)
openverifiablellm/utils.py
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 212-215: The except block catching ET.ParseError should log the
full traceback so failures are diagnosable: replace the current
logger.error(msg) call in the ET.ParseError handler (the variables input_path
and msg are built there) with logger.exception(msg) or logger.error(msg,
exc_info=True) so the original stack/exception context is preserved when parsing
fails in that code path.
- Around line 235-236: The current code logs "Preprocessing complete" before
calling generate_manifest, which can produce misleading success logs if manifest
generation fails; move the logger.info call so it occurs after
generate_manifest(input_path, output_path) completes (or split into two logs:
one for extraction completion and another for successful manifest creation).
Specifically, update the sequence around generate_manifest, logger.info, and the
variables input_path/output_path so that logger.info("Preprocessing complete.
Output saved to %s", output_path) executes only after
generate_manifest(input_path, output_path) returns successfully.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9202e810-90b2-4631-948e-248db2253280
📒 Files selected for processing (1)
openverifiablellm/utils.py
Addressed Issues:
Fixes #48
Screenshots/Recordings:
Before fix: malformed XML raised a low-level ParseError without context.
After fix: the error now includes a clear message indicating which
XML dump failed to parse.
Additional Notes:
While running the test suite locally, I noticed that
extract_text_from_xmldoes not provide useful context when encountering malformed XML.Changes in this PR:
ET.iterparseto catchxml.etree.ElementTree.ParseError.test_extract_text_from_xml_malformedto verify behaviour when malformed XML is encountered.This improves robustness of the XML extraction step and prevents unclear crashes when corrupted Wikipedia dumps are processed.
All tests pass locally.
Checklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
Summary by CodeRabbit
Bug Fixes
Public API
Tests