Add malformed XML test and fix verify.py regression by Shubhamx404 · Pull Request #49 · AOSSIE-Org/OpenVerifiableLLM

Shubhamx404 · 2026-03-08T07:03:13Z

Addressed Issues:

Fixes #48

Summary

Added a missing edge case test for the preprocessing pipeline to ensure that
extract_text_from_xml fails fast when it encounters malformed XML. In the process, a regression in the verification test suite (
openverifiablellm/verify.py) was also fixed to allow all pipeline checks to pass successfully.

Problem

Currently, there are no tests enforcing the failure state for corrupted XML inputs. If a anyone accidentally suppresses the ET.ParseError during a refactor, it will silently pass also , feeding corrupted data into the verify.py pipeline causes a native Python crash, rather than gracefully reporting a pipeline failure.
he pipeline would silently ignore corrupted files and produce incomplete outputs

Problem in `verify.py`

when verify.py was used against manifest or when the pipeline hit corrupted data, the script crashed with UnboundLocalError instead of produce Verification report.

Solution

Created test_extract_text_from_xml_malformed_xmlthis genrate a broken XML file and feeds it to the pipeline while strictly asserting pytest.raises(ET.ParseError).
Fixed verification crash for file verify.py

Screenshots/Recordings:

here is demonstration of implemention

here pipeline first ran on the valid Wikipedia dataset, it created dataset_manifest.json in the data/ folder. The Expected values are read from this manifest, while the Actual values are computed in real time from the file currently passed to the script (e.g., invalid_wikipedia_dump.xml)

Checklist

[ * ] My code follows the project's code style and conventions
[ * ] I have made corresponding changes to the documentation
[ * ] My changes generate no new warnings or errors
[ * ] I have joined the Discord server and I will share a link to this PR with the project maintainers there
[ * ] I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

Bug Fixes
- Verification now explicitly reports the chunk-size check as skipped for older manifests that lack that field, avoiding false failures and preserving pass/skip semantics.
Tests
- Added a unit test ensuring malformed XML raises a parse error during extraction.
- Updated verification tests to expect the legacy chunk-size check to be skipped with an explanatory note.

coderabbitai · 2026-03-08T07:03:31Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Make manifest_chunk_size_bytes comparison conditional in verification (skip with note when absent in legacy manifests) and add a unit test asserting utils.extract_text_from_xml raises ET.ParseError for malformed XML; update verification test expectations for legacy manifests.

Changes

Cohort / File(s)	Summary
Manifest chunk-size check `openverifiablellm/verify.py`	Change unconditional `manifest_chunk_size_bytes` comparison to run only when `chunk_size_bytes` exists in the manifest; add SKIP result with explanatory detail when the field is absent.
XML parse error test `tests/test_util.py`	Add unit test that writes malformed XML and asserts `defusedxml.ElementTree.ParseError` is raised by `utils.extract_text_from_xml`.
Verification tests adjusted `tests/test_verify.py`	Update legacy-manifest test to expect a SKIP entry for `manifest_chunk_size_bytes` with detail noting the field is absent in older manifests.

Sequence Diagram(s)

(Skipped — changes are small verification logic + unit test; no multi-component control flow to visualize.)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Implement Manifest Chain Tamper Detection for cryptographically linked dataset manifests. #53: Modifies openverifiablellm/verify.py and adjusts manifest field handling; closely related to the manifest_chunk_size_bytes skip logic.
Deterministic Preprocessing Verification Mode that re-runs dataset preprocessing and validates that all generated artifacts #25: Earlier adjustments to manifest field comparison and skipping logic for legacy manifests.

Suggested labels

Python Lang

Suggested reviewers

Archit381

Poem

"🐰 I hopped through manifests, light and quick,
I skip what is missing and leave a small tick,
When XML crumbles and parsers fret,
My test will catch the broken set,
A nibble, a hop — then off I flit!"

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the two main changes: adding a malformed XML test and fixing a regression in verify.py.
Linked Issues check	✅ Passed	The PR successfully addresses issue `#48` by adding test_extract_text_from_xml_malformed_xml to verify ParseError is raised, and fixes the verify.py regression affecting manifest field comparison.
Out of Scope Changes check	✅ Passed	All changes are directly related to the linked issue: malformed XML test addition and verify.py regression fix for manifest field handling.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/verify.py`:
- Around line 413-419: The manifest check for "chunk_size_bytes" should mirror
the behavior used for other optional fields: if "chunk_size_bytes" is present
call _check_field as currently done; otherwise add a CheckStatus.SKIP entry to
the report (use the same report key "manifest_chunk_size_bytes" and a brief
detail like "Merkle chunk size not present in manifest") so the absence is
explicitly recorded; update the code around the _check_field call that
references manifest["chunk_size_bytes"],
reproduced_manifest.get("chunk_size_bytes"), and the report to include this SKIP
branch.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 291635d0-780c-4aa3-a6ed-a177b1b9e64b

📥 Commits

Reviewing files that changed from the base of the PR and between c258e31 and d741214.

📒 Files selected for processing (2)

openverifiablellm/verify.py
tests/test_util.py

openverifiablellm/verify.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/verify.py`:
- Around line 420-425: The new SKIP branch for manifest_chunk_size_bytes is
untested; update the existing test that checks
raw_merkle_root/processed_merkle_root to also assert that report contains a
CheckResult with name "manifest_chunk_size_bytes", status CheckStatus.SKIP, and
detail "Field absent from manifest (older version)". Locate where the test
inspects the report (the test calling verify and examining report.results) and
add an assertion that at least one entry matches CheckResult.name ==
"manifest_chunk_size_bytes" and CheckResult.status == CheckStatus.SKIP and
CheckResult.detail contains "Field absent from manifest (older version)" so
regressions on this legacy path are caught.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4eab8d5c-5c9b-4641-82c9-24be9e520269

📥 Commits

Reviewing files that changed from the base of the PR and between d741214 and e923a03.

📒 Files selected for processing (1)

openverifiablellm/verify.py

Shubhamx404 · 2026-03-08T09:04:38Z

hey @Archit381 can you please review this pr

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/verify.py`:
- Around line 605-610: The report currently gets a duplicate SKIP for the same
check name "manifest_chunk_size_bytes" when raw_merkle_root exists but
chunk_size_bytes is absent; locate the block that calls
report.add(CheckResult(... name="manifest_chunk_size_bytes"
status=CheckStatus.SKIP ...)) following the raw_merkle_root branch and remove
that redundant else branch (or change its CheckResult name if it must remain
distinct), so only the earlier SKIP (the one that mentions the default) is
emitted; ensure the raw_merkle_root handling no longer adds a second
"manifest_chunk_size_bytes" entry.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e900fd5c-84f9-44be-ab1e-a971c3f01e00

📥 Commits

Reviewing files that changed from the base of the PR and between 573ad01 and b8b81af.

📒 Files selected for processing (3)

openverifiablellm/verify.py
tests/test_util.py
tests/test_verify.py

coderabbitai · 2026-03-17T20:36:22Z

openverifiablellm/verify.py

+            else:
+                report.add(CheckResult(
+                    name="manifest_chunk_size_bytes",
+                    status=CheckStatus.SKIP,
+                    detail="Field absent from manifest (older version)",
+                ))


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if there are any other instances of duplicate check names in verify.py tests rg -n "manifest_chunk_size_bytes" --type py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 448

🏁 Script executed:

# Get context around the first occurrence (line 387) sed -n '375,400p' openverifiablellm/verify.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1077

🏁 Script executed:

# Get context around the second occurrence (line 608) sed -n '595,615p' openverifiablellm/verify.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 837

🏁 Script executed:

# Also check the test file to understand expected behavior sed -n '365,380p' tests/test_verify.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 881

Remove the duplicate SKIP entry for manifest_chunk_size_bytes.

When raw_merkle_root is present in the manifest but chunk_size_bytes is absent, this else block at lines 605-610 will add a second SKIP entry for manifest_chunk_size_bytes. The first is already added at lines 387-391. Both SKIPs use the same check name but with different detail messages, resulting in duplicate entries in the report.

Consolidate the logic to avoid duplicate entries. Since the first SKIP at lines 387-391 already handles the case of missing chunk_size_bytes (with a more informative message that mentions the default), remove the second else block or rename the second check to distinguish it from the first.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/verify.py` around lines 605 - 610, The report currently gets a duplicate SKIP for the same check name "manifest_chunk_size_bytes" when raw_merkle_root exists but chunk_size_bytes is absent; locate the block that calls report.add(CheckResult(... name="manifest_chunk_size_bytes" status=CheckStatus.SKIP ...)) following the raw_merkle_root branch and remove that redundant else branch (or change its CheckResult name if it must remain distinct), so only the earlier SKIP (the one that mentions the default) is emitted; ensure the raw_merkle_root handling no longer adds a second "manifest_chunk_size_bytes" entry.

Shubhamx404 · 2026-03-17T20:36:29Z

hii @Archit381 resolved merge conflict pls look

Add malformed XML edge case test and fix verify.py regression

d741214

github-actions bot added backend python size/S external-contributor pending-coderabbit-review labels Mar 8, 2026

github-actions bot added size/S and removed size/S labels Mar 8, 2026

coderabbitai bot requested changes Mar 8, 2026

View reviewed changes

openverifiablellm/verify.py Show resolved Hide resolved

Update openverifiablellm/verify.py

e923a03

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

github-actions bot added size/S and removed size/S labels Mar 8, 2026

coderabbitai bot requested changes Mar 8, 2026

View reviewed changes

coderabbitai bot approved these changes Mar 8, 2026

View reviewed changes

fixed coderabbit issue

0d0626a

github-actions bot added size/S and removed size/S labels Mar 8, 2026

Merge branch 'main' into fix-malformed-xml-test

573ad01

github-actions bot added size/S and removed size/S labels Mar 9, 2026

Archit381 mentioned this pull request Mar 17, 2026

Handle malformed XML in extract_text_from_xml and add edge case test #71

Closed

5 tasks

Merge main into fix-malformed-xml-test and resolve conflicts

b8b81af

github-actions bot added size/S and removed size/S labels Mar 17, 2026

style: run formatter to fix CI lint issue

c1a4ef7

github-actions bot added size/S and removed size/S labels Mar 17, 2026

coderabbitai bot requested changes Mar 17, 2026

View reviewed changes

Archit381 merged commit 578bc79 into AOSSIE-Org:main Mar 17, 2026
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add malformed XML test and fix verify.py regression#49

Add malformed XML test and fix verify.py regression#49
Archit381 merged 6 commits intoAOSSIE-Org:mainfrom
Shubhamx404:fix-malformed-xml-test

Shubhamx404 commented Mar 8, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 8, 2026 •

edited

Loading

Reviews paused

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Shubhamx404 commented Mar 8, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 17, 2026

Uh oh!

Shubhamx404 commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Shubhamx404 commented Mar 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressed Issues:

Summary

Problem

Problem in verify.py

Solution

Screenshots/Recordings:

Checklist

⚠️ AI Notice - Important!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Shubhamx404 commented Mar 8, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Shubhamx404 commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shubhamx404 commented Mar 8, 2026 •

edited by coderabbitai bot

Loading

Problem in `verify.py`

coderabbitai bot commented Mar 8, 2026 •

edited

Loading