Skip to content

Add malformed XML test and fix verify.py regression#49

Merged
Archit381 merged 6 commits intoAOSSIE-Org:mainfrom
Shubhamx404:fix-malformed-xml-test
Mar 17, 2026
Merged

Add malformed XML test and fix verify.py regression#49
Archit381 merged 6 commits intoAOSSIE-Org:mainfrom
Shubhamx404:fix-malformed-xml-test

Conversation

@Shubhamx404
Copy link
Contributor

@Shubhamx404 Shubhamx404 commented Mar 8, 2026

Addressed Issues:

Fixes #48

Summary

Added a missing edge case test for the preprocessing pipeline to ensure that
extract_text_from_xml fails fast when it encounters malformed XML. In the process, a regression in the verification test suite (
openverifiablellm/verify.py) was also fixed to allow all pipeline checks to pass successfully.

Problem

Currently, there are no tests enforcing the failure state for corrupted XML inputs. If a anyone accidentally suppresses the ET.ParseError during a refactor, it will silently pass also , feeding corrupted data into the verify.py pipeline causes a native Python crash, rather than gracefully reporting a pipeline failure.
he pipeline would silently ignore corrupted files and produce incomplete outputs

Problem in verify.py

when verify.py was used against manifest or when the pipeline hit corrupted data, the script crashed with UnboundLocalError instead of produce Verification report.

Solution

  • Created test_extract_text_from_xml_malformed_xmlthis genrate a broken XML file and feeds it to the pipeline while strictly asserting pytest.raises(ET.ParseError).

  • Fixed verification crash for file verify.py

Screenshots/Recordings:

  • here is demonstration of implemention
image
  • here pipeline first ran on the valid Wikipedia dataset, it created dataset_manifest.json in the data/ folder. The Expected values are read from this manifest, while the Actual values are computed in real time from the file currently passed to the script (e.g., invalid_wikipedia_dump.xml)

Checklist

  • [ * ] My code follows the project's code style and conventions
  • [ * ] I have made corresponding changes to the documentation
  • [ * ] My changes generate no new warnings or errors
  • [ * ] I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • [ * ] I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

  • Bug Fixes

    • Verification now explicitly reports the chunk-size check as skipped for older manifests that lack that field, avoiding false failures and preserving pass/skip semantics.
  • Tests

    • Added a unit test ensuring malformed XML raises a parse error during extraction.
    • Updated verification tests to expect the legacy chunk-size check to be skipped with an explanatory note.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 8, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Make manifest_chunk_size_bytes comparison conditional in verification (skip with note when absent in legacy manifests) and add a unit test asserting utils.extract_text_from_xml raises ET.ParseError for malformed XML; update verification test expectations for legacy manifests.

Changes

Cohort / File(s) Summary
Manifest chunk-size check
openverifiablellm/verify.py
Change unconditional manifest_chunk_size_bytes comparison to run only when chunk_size_bytes exists in the manifest; add SKIP result with explanatory detail when the field is absent.
XML parse error test
tests/test_util.py
Add unit test that writes malformed XML and asserts defusedxml.ElementTree.ParseError is raised by utils.extract_text_from_xml.
Verification tests adjusted
tests/test_verify.py
Update legacy-manifest test to expect a SKIP entry for manifest_chunk_size_bytes with detail noting the field is absent in older manifests.

Sequence Diagram(s)

(Skipped — changes are small verification logic + unit test; no multi-component control flow to visualize.)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

Python Lang

Suggested reviewers

  • Archit381

Poem

"🐰 I hopped through manifests, light and quick,
I skip what is missing and leave a small tick,
When XML crumbles and parsers fret,
My test will catch the broken set,
A nibble, a hop — then off I flit!"

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the two main changes: adding a malformed XML test and fixing a regression in verify.py.
Linked Issues check ✅ Passed The PR successfully addresses issue #48 by adding test_extract_text_from_xml_malformed_xml to verify ParseError is raised, and fixes the verify.py regression affecting manifest field comparison.
Out of Scope Changes check ✅ Passed All changes are directly related to the linked issue: malformed XML test addition and verify.py regression fix for manifest field handling.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added size/S and removed size/S labels Mar 8, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/verify.py`:
- Around line 413-419: The manifest check for "chunk_size_bytes" should mirror
the behavior used for other optional fields: if "chunk_size_bytes" is present
call _check_field as currently done; otherwise add a CheckStatus.SKIP entry to
the report (use the same report key "manifest_chunk_size_bytes" and a brief
detail like "Merkle chunk size not present in manifest") so the absence is
explicitly recorded; update the code around the _check_field call that
references manifest["chunk_size_bytes"],
reproduced_manifest.get("chunk_size_bytes"), and the report to include this SKIP
branch.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 291635d0-780c-4aa3-a6ed-a177b1b9e64b

📥 Commits

Reviewing files that changed from the base of the PR and between c258e31 and d741214.

📒 Files selected for processing (2)
  • openverifiablellm/verify.py
  • tests/test_util.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@github-actions github-actions bot added size/S and removed size/S labels Mar 8, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/verify.py`:
- Around line 420-425: The new SKIP branch for manifest_chunk_size_bytes is
untested; update the existing test that checks
raw_merkle_root/processed_merkle_root to also assert that report contains a
CheckResult with name "manifest_chunk_size_bytes", status CheckStatus.SKIP, and
detail "Field absent from manifest (older version)". Locate where the test
inspects the report (the test calling verify and examining report.results) and
add an assertion that at least one entry matches CheckResult.name ==
"manifest_chunk_size_bytes" and CheckResult.status == CheckStatus.SKIP and
CheckResult.detail contains "Field absent from manifest (older version)" so
regressions on this legacy path are caught.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4eab8d5c-5c9b-4641-82c9-24be9e520269

📥 Commits

Reviewing files that changed from the base of the PR and between d741214 and e923a03.

📒 Files selected for processing (1)
  • openverifiablellm/verify.py

@Shubhamx404
Copy link
Contributor Author

hey @Archit381 can you please review this pr

@github-actions github-actions bot added size/S and removed size/S labels Mar 8, 2026
@github-actions github-actions bot added size/S and removed size/S labels Mar 17, 2026
@github-actions github-actions bot added size/S and removed size/S labels Mar 17, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/verify.py`:
- Around line 605-610: The report currently gets a duplicate SKIP for the same
check name "manifest_chunk_size_bytes" when raw_merkle_root exists but
chunk_size_bytes is absent; locate the block that calls
report.add(CheckResult(... name="manifest_chunk_size_bytes"
status=CheckStatus.SKIP ...)) following the raw_merkle_root branch and remove
that redundant else branch (or change its CheckResult name if it must remain
distinct), so only the earlier SKIP (the one that mentions the default) is
emitted; ensure the raw_merkle_root handling no longer adds a second
"manifest_chunk_size_bytes" entry.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e900fd5c-84f9-44be-ab1e-a971c3f01e00

📥 Commits

Reviewing files that changed from the base of the PR and between 573ad01 and b8b81af.

📒 Files selected for processing (3)
  • openverifiablellm/verify.py
  • tests/test_util.py
  • tests/test_verify.py

Comment on lines +605 to +610
else:
report.add(CheckResult(
name="manifest_chunk_size_bytes",
status=CheckStatus.SKIP,
detail="Field absent from manifest (older version)",
))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if there are any other instances of duplicate check names in verify.py tests
rg -n "manifest_chunk_size_bytes" --type py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 448


🏁 Script executed:

# Get context around the first occurrence (line 387)
sed -n '375,400p' openverifiablellm/verify.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1077


🏁 Script executed:

# Get context around the second occurrence (line 608)
sed -n '595,615p' openverifiablellm/verify.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 837


🏁 Script executed:

# Also check the test file to understand expected behavior
sed -n '365,380p' tests/test_verify.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 881


Remove the duplicate SKIP entry for manifest_chunk_size_bytes.

When raw_merkle_root is present in the manifest but chunk_size_bytes is absent, this else block at lines 605-610 will add a second SKIP entry for manifest_chunk_size_bytes. The first is already added at lines 387-391. Both SKIPs use the same check name but with different detail messages, resulting in duplicate entries in the report.

Consolidate the logic to avoid duplicate entries. Since the first SKIP at lines 387-391 already handles the case of missing chunk_size_bytes (with a more informative message that mentions the default), remove the second else block or rename the second check to distinguish it from the first.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/verify.py` around lines 605 - 610, The report currently
gets a duplicate SKIP for the same check name "manifest_chunk_size_bytes" when
raw_merkle_root exists but chunk_size_bytes is absent; locate the block that
calls report.add(CheckResult(... name="manifest_chunk_size_bytes"
status=CheckStatus.SKIP ...)) following the raw_merkle_root branch and remove
that redundant else branch (or change its CheckResult name if it must remain
distinct), so only the earlier SKIP (the one that mentions the default) is
emitted; ensure the raw_merkle_root handling no longer adds a second
"manifest_chunk_size_bytes" entry.

@Shubhamx404
Copy link
Contributor Author

hii @Archit381 resolved merge conflict pls look
image

@Archit381 Archit381 merged commit 578bc79 into AOSSIE-Org:main Mar 17, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] : Add edge case test for extract_text_from_xml to handle malformed XML

2 participants