feat: add streaming parser and incremental Merkle tree by Muneerali199 · Pull Request #39 · AOSSIE-Org/OpenVerifiableLLM

Muneerali199 · 2026-03-06T09:47:33Z

What this PR does

Adds streaming XML parser (O(1) memory via iterparse + elem.clear())
Adds IncrementalMerkleTree with Merkle Frontier (O(log N) time/space)
Adds benchmark.py to reproduce the Before/After comparison
Adds 50 passing unit tests

How to run the benchmark yourself

wget https://dumps.wikimedia.org/simplewiki/20260201/simplewiki-20260201-pages-articles-multistream.xml.bz2
python -m openverifiablellm.benchmark simplewiki-20260201-pages-articles-multistream.xml.bz2

Test results

pytest tests/test_merkle.py -v # 50/50 passed

Fix #31

Summary by CodeRabbit

New Features
- Added benchmarking tool to compare performance of different Wikipedia XML processing approaches
- Introduced incremental merkle tree implementation for memory-efficient append-only operations
- Extended utilities with streaming modes for processing large files with minimal memory overhead
Tests
- Added comprehensive test suite for merkle tree operations and XML integration

Archit381 · 2026-03-07T09:34:15Z

Pls add the analysis requested in the issue

Archit381 · 2026-03-09T08:35:04Z

Can you update the functions inside the current utils.py file to support streaming.

Also add benchmark.py in the scripts folder

- Add four streaming functions to utils.py: stream_chunks (yields raw byte chunks from a file), stream_sha256 (yields rolling hash pairs), compute_merkle_root_streaming (builds Merkle root from any bytes iterable), and stream_extract_text_from_xml (O(1)-memory generator over Wikipedia XML dumps) - Add scripts/benchmark.py as a thin CLI launcher that delegates to openverifiablellm.benchmark.main, keeping the scripts/ directory as plain entry-points with no duplicated logic

coderabbitai · 2026-03-09T08:41:36Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Introduces streaming data pipeline support with an incremental Merkle tree implementation enabling O(1) memory processing of large datasets. Adds IncrementalMerkleTree class, extends utility functions to support streaming modes, implements comprehensive benchmarking infrastructure, and provides test coverage validating streaming-vs-batch parity.

Changes

Cohort / File(s)	Summary
Incremental Merkle Tree Core `openverifiablellm/incremental_merkle.py`	New module implementing append-only Merkle tree using Merkle Frontier approach. Maintains frontier dictionary mapping depth to node hashes, supports O(log N) leaf appends with O(log N) space, and non-destructive root computation via frontier folding.
Utilities Enhancement `openverifiablellm/utils.py`	Extended compute_merkle_root, generate_merkle_proof, extract_text_from_xml, and compute_sha256 to support streaming modes. Added mutual exclusivity between file-path and chunks parameters. extract_text_from_xml now yields cleaned article texts via generator. compute_sha256 returns Union of string or generator for streaming. Added _sha256_hex helper and Generator/Iterable type imports.
Benchmarking Infrastructure `openverifiablellm/benchmark.py`, `scripts/benchmark.py`	New benchmark module comparing in-memory legacy vs streaming incremental Merkle tree approaches on large Wikipedia XML dumps. Measures wall-time, peak RAM, root hashes, and article counts via isolated subprocess trials. Renders results in terminal and Markdown table formats. CLI wrapper script delegates to benchmark.main.
Test Coverage `tests/test_merkle.py`	Comprehensive PyTest suite with 460 lines validating IncrementalMerkleTree against batch Merkle root reference implementation. Tests correctness across varied leaf counts, empty-tree behavior, frontier size invariants, determinism, hash formatting, and XML streaming integration. Includes parametric coverage and reference batch_merkle_root implementation.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant XMLStream as XML Stream<br/>(generator)
    participant IMT as IncrementalMerkleTree
    participant Frontier as Merkle Frontier<br/>(Dict)

    Client->>XMLStream: extract_text_from_xml(stream=True)
    
    loop for each article in stream
        XMLStream->>XMLStream: parse & clean text
        XMLStream-->>Client: yield cleaned_text
        Client->>IMT: append_leaf(cleaned_text)
        IMT->>IMT: hash_text → leaf_bytes
        IMT->>Frontier: check depth 0
        alt depth empty
            Frontier->>Frontier: store at depth 0
        else depth occupied
            IMT->>IMT: combine(stored, new) → parent
            IMT->>Frontier: remove depth, check depth+1
            Note over IMT,Frontier: Repeat until empty slot
        end
        IMT->>IMT: increment leaf_count
    end
    
    Client->>IMT: get_root_hash()
    IMT->>Frontier: fold from lowest to highest
    Note over IMT,Frontier: Promote partial subtrees<br/>via self-combination
    IMT-->>Client: return hex_root

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

PR #25 — Modifies extract_text_from_xml and compute_sha256 with streaming/IO behavior changes directly affecting these utility signatures.
PR #33 — Updates Merkle hashing utilities in utils.py for SHA-256 digest handling and root/proof computation aligned with benchmark changes.
PR #17 — Modifies compute_sha256 and extract_text_from_xml signatures and behavior overlapping with streaming support implementation.

Suggested labels

Python Lang

Suggested reviewers

Archit381

Poem

🐰 A frontier so fine, no tree need to grow,
Just hashes that flow where the bytes gently go,
Stream on and append with logarithmic grace,
One megabyte stands where gigabytes raced! ✨📊

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	Title clearly summarizes the main changes: adding a streaming parser and an incremental Merkle tree implementation.
Linked Issues check	✅ Passed	All requirements from issue `#31` are addressed: streaming XML parser, incremental Merkle tree with Merkle Frontier, updated hashing utilities, benchmark comparison tool, and comprehensive unit tests confirming implementation parity.
Out of Scope Changes check	✅ Passed	All changes directly support the PR objectives and issue `#31` requirements; no extraneous modifications detected outside the streaming pipeline and Merkle tree scope.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan for PR comments

Generate coding plan

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…ronment fingerprint import

Muneerali199 · 2026-03-09T08:50:23Z

sir @Archit381, Done — updated utils.py with streaming functions and added scripts/benchmark.py, now merged with main and ready for merge.

Archit381 · 2026-03-09T08:51:53Z

@Muneerali199 Make sure to unstage the old files.

Also, you have just added the functions in the util. Can you update the existing functions to support streaming instead of adding new functions for the same task

…ha256_hex helper; remove streaming_utils.py - All internal Merkle tree callers (compute_merkle_root, generate_merkle_proof, verify_merkle_proof) now use the private _sha256_hex() helper instead of the public compute_sha256() API, eliminating Union return-type checker errors. - Removed unused 'cast' from typing imports. - Removed stray comment left over from an earlier edit. - Deleted openverifiablellm/streaming_utils.py: its streaming logic is fully covered by extract_text_from_xml(stream=True) in utils.py, per mentor feedback to update existing functions rather than adding standalone streaming functions.

Muneerali199 · 2026-03-09T09:10:02Z

sir @Archit381 ,Done! Updated the existing functions in utils.py to support streaming directly — compute_sha256 now accepts stream=True, compute_merkle_root and generate_merkle_proof accept a chunks= iterable, and extract_text_from_xml accepts stream=True — no separate streaming functions added. Also removed streaming_utils.py from the PR. All 119 tests passing

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/benchmark.py`:
- Around line 39-45: Replace the unsafe xml.etree.ElementTree import in
benchmark.py with defusedxml.ElementTree so all XML parsing uses the safe
parser; specifically change the top-level import that assigns ET and update
usages in _run_old_way (the call(s) to ET.iterparse) to use the defused parser
(ET) so the function IncrementalMerkleTree/stream_text_from_xml callers remain
unchanged while preventing XML parser attacks.
- Around line 44-45: The import of stream_text_from_xml from
openverifiablellm.streaming_utils is broken because streaming_utils was removed;
update the import to use extract_text_from_xml from openverifiablellm.utils and
replace calls to stream_text_from_xml(...) with extract_text_from_xml(...,
stream=True). Specifically, remove the old import of stream_text_from_xml,
import extract_text_from_xml from openverifiablellm.utils alongside
clean_wikitext, and change the invocation sites (the code that previously called
stream_text_from_xml, e.g., in the benchmark function that processes XML at the
lines noted in the review) to call extract_text_from_xml(file_or_xml, ...,
stream=True) preserving the same arguments and handling the returned
generator/stream.

In `@tests/test_merkle.py`:
- Around line 106-206: Add a new integration test in TestIncrementalVsBatch that
reads a small XML fixture through extract_text_from_xml(..., stream=True), feeds
each extracted text into IncrementalMerkleTree.append_leaf, computes
batch_merkle_root over the same extracted texts, and asserts equality of the two
roots; place the test near the other root-equality tests, reference
extract_text_from_xml, IncrementalMerkleTree, and batch_merkle_root so the test
exercises the streaming path and benchmark wiring without slowing the suite.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f5572ca0-0790-4e67-be21-0fe041f81b1f

📥 Commits

Reviewing files that changed from the base of the PR and between 35ce9dc and edc2479.

📒 Files selected for processing (5)

openverifiablellm/benchmark.py
openverifiablellm/incremental_merkle.py
openverifiablellm/utils.py
scripts/benchmark.py
tests/test_merkle.py

openverifiablellm/benchmark.py

tests/test_merkle.py

…; add streaming XML integration tests - benchmark.py: replace unsafe xml.etree.ElementTree with defusedxml.ElementTree to prevent XML parser attacks (CodeRabbit critical finding). - benchmark.py: replace broken import of stream_text_from_xml (from the now-deleted streaming_utils.py) with extract_text_from_xml from openverifiablellm.utils, using stream=True to preserve identical streaming behaviour. - tests/test_merkle.py: add TestStreamingXmlIntegration class with two tests: (1) streaming root hash matches batch reference root for a tiny XML fixture, (2) streamed texts have wikitext markup stripped — exercises the full extract_text_from_xml(stream=True) + IncrementalMerkleTree pipeline.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/benchmark.py`:
- Around line 269-277: The benchmark currently calls _run_old_way(path) then
_run_new_way(path) sequentially which biases results due to OS/pagecache and
warm-up; modify the benchmark loop to run each approach in isolated subprocesses
and across multiple trials, alternating order per trial (e.g., trial i: order =
[_run_old_way, _run_new_way] if i%2==0 else reversed) or spawn a fresh Python
process that invokes a small entry function (e.g., run_benchmark_mode(mode,
path)) to ensure isolated memory/allocator and cold caches, collect
time/peak-RAM for each trial and aggregate (median or mean) before printing
comparative speed-up; ensure references to _run_old_way and _run_new_way and the
top-level runner that currently prints results are updated to call the
subprocess/alternating-trials flow and to report aggregated, order-independent
metrics.
- Around line 107-109: The code currently treats an empty list of leaves as a
valid Merkle root by computing hashlib.sha256(b"").hexdigest() (see where leaves
is checked and root_hex is set); change this so the empty-case is handled
distinctly: when leaves is empty return or propagate a sentinel (e.g. None) and
an article count instead of computing a root, and make the caller abort/report
explicitly when article count == 0; update both occurrences that set root_hex
from hashlib.sha256(b"") (the check around variable leaves and the second
occurrence mentioned) to follow this pattern so zero-article runs are surfaced
rather than producing a spurious matching root.

In `@tests/test_merkle.py`:
- Around line 381-386: The Beta fixture's revision uses a real XML <ref> element
so XML parsing removes it before clean_wikitext runs; update the test fixture
string in tests/test_merkle.py so the Beta revision text uses escaped markup
(&lt;ref&gt;ref&lt;/ref&gt;) instead of a literal <ref> element to ensure
text_elem.text contains the literal "<ref>..." sequence and thus exercises the
clean_wikitext ref-stripping logic referenced in the test assertion (see
clean_wikitext and where text_elem.text is read).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e52bbd43-9cb1-4408-bc25-e5535d45e1a5

📥 Commits

Reviewing files that changed from the base of the PR and between edc2479 and 6bafa67.

📒 Files selected for processing (2)

openverifiablellm/benchmark.py
tests/test_merkle.py

openverifiablellm/benchmark.py

tests/test_merkle.py

…nel, and fixture ref escaping - benchmark.py: run_benchmark() now spawns each measurement in an isolated subprocess via --_mode flag, alternating old/new order per trial to eliminate OS pagecache and allocator warm-up bias; reports median time and RAM across all trials (default 3). - benchmark.py: empty-article runs now return root=None (sentinel) instead of the spurious hashlib.sha256(b'').hexdigest(); callers abort with a clear error when article_count==0 so zero-article runs are never silently passed off as a valid benchmark result. - tests/test_merkle.py: fixed _WIKI_XML_FIXTURE Beta article to use XML-escaped markup (<ref>ref</ref>) so the literal '<ref>...' string reaches text_elem.text and actually exercises clean_wikitext's ref-stripping logic, rather than being silently consumed by the XML parser.

…emental-merkle # Conflicts: # openverifiablellm/utils.py

…st fixture - Apply ruff format to utils.py (trailing whitespace / blank-line fixes) - Fix import ordering in benchmark.py and test_merkle.py (ruff I001) - Remove unused 'io' import from test_merkle.py (ruff F401) - Pass write_manifest=True in test_verify.run_preprocessing so the manifest is actually written and all 37 verify tests pass

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/benchmark.py`:
- Around line 89-101: The _run_old_way function currently hardcodes
bz2.open(file_path, "rb") causing failures for plain .xml files; update
_run_old_way to use the same detection logic as extract_text_from_xml (or
delegate to it) so it opens bz2-compressed and uncompressed files transparently,
then iterate the parsed elements and apply clean_wikitext/append to all_texts as
before; reference _run_old_way, extract_text_from_xml, clean_wikitext, and
all_texts to locate the change.
- Around line 387-388: The current parsing uses json.loads(proc.stdout.strip())
which fails if stdout contains extra lines; change the logic around proc.stdout
to extract the last non-empty line (or scan lines for the first valid JSON blob)
before calling json.loads so you only parse the actual JSON payload; update the
code that sets data (the json.loads call) and the subsequent print that uses
data['time'] and data['ram'] accordingly, referencing proc.stdout, json.loads,
data and the print statement to locate and fix the parsing.

In `@openverifiablellm/utils.py`:
- Around line 190-268: The streaming generator in extract_text_from_xml may
leave the ET.iterparse iterator open on exceptions or when the caller stops
iteration; wrap the iteration in _generator inside a try/finally and in the
finally block call context.close() (guarded with AttributeError in case close()
isn't present) to ensure the iterparse internal state is released, and keep the
existing elem.clear() logic; reference the _generator function, the context
variable (context = ET.iterparse(...)), and clean_wikitext when making the
change.

In `@tests/test_merkle.py`:
- Around line 178-180: The test is coupling to the internal attribute _frontier;
add a public way to observe the frontier size or change the test to use public
behavior instead: either implement a read-only property like frontier_size on
the MerkleTree class (or a method get_frontier_size()) and assert
len(tree.frontier_size) == 1, or replace the assertion with an indirect check
(e.g., verify the tree root/height/number of stored nodes or that proofs/merkle
paths behave as expected for a power-of-two leaf count) using public methods
such as MerkleTree.root, MerkleTree.leaf_count, or proof generation methods
instead of accessing _frontier directly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: edaaeff9-bb42-4d34-a1ee-efcba5413e42

📥 Commits

Reviewing files that changed from the base of the PR and between 6bafa67 and 900c92d.

📒 Files selected for processing (4)

openverifiablellm/benchmark.py
openverifiablellm/utils.py
tests/test_merkle.py
tests/test_verify.py

openverifiablellm/benchmark.py

openverifiablellm/utils.py

tests/test_merkle.py

benchmark.py: - _run_old_way: replace hardcoded bz2.open() with magic-byte detection (same 3-byte BZh probe used by extract_text_from_xml) so plain .xml files are opened correctly instead of crashing - run_benchmark: replace json.loads(proc.stdout.strip()) with a reverse- scan that finds the last non-empty line parseable as JSON, tolerating any stray log/warning lines that may appear before the JSON payload utils.py: - extract_text_from_xml (stream=True): wrap the ET.iterparse loop in try/finally and call context.close() in the finally block so the iterparse internal state is released when the caller abandons the generator mid-stream or an exception propagates incremental_merkle.py: - Add public read-only property frontier_size -> int so callers and tests can observe frontier occupancy without accessing _frontier tests/test_merkle.py: - Replace all tree._frontier direct accesses with tree.frontier_size (covers test_root_hash_matches_batch_power_of_two, test_frontier_empty_when_empty, test_frontier_size_equals_popcount)

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/benchmark.py`:
- Line 237: The f-string that formats the root-hash row uses slicing on
hash_old_str and hash_new_str (see the string with `{hash_old_str[:16]}…` and
`{hash_new_str[:16]}…`) which produces awkward output when the value is the
sentinel "N/A (0 articles)"; change the formatting to handle that sentinel
explicitly: compute display_old = hash_old_str if hash_old_str.startswith("N/A")
else hash_old_str[:16] + "…" (and the same for display_new), then use
`{display_old}` and `{display_new}` in the f-string so the "N/A (0 articles)"
text is shown intact and truncated hashes keep the ellipsis.

In `@openverifiablellm/utils.py`:
- Around line 261-271: The batch-mode XML processing using ET.iterparse assigned
to variable context (inside the with open_func(input_path, "rb") as f: block)
lacks a try/finally to ensure context.close() on errors; update the loop in the
batch branch to wrap the for _, elem in context: iteration in a try/finally and
call context.close() in the finally block (while keeping existing logic that
finds text_elem, calls clean_wikitext(text_elem.text), writes to output_path,
and calls elem.clear()) so the iterparse iterator is always cleaned up on
exceptions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 96f9f971-f64e-49b3-b112-98e5b02e8868

📥 Commits

Reviewing files that changed from the base of the PR and between 900c92d and fae3413.

📒 Files selected for processing (4)

openverifiablellm/benchmark.py
openverifiablellm/incremental_merkle.py
openverifiablellm/utils.py
tests/test_merkle.py

coderabbitai · 2026-03-12T10:19:22Z

openverifiablellm/benchmark.py

+        "|-------------------------------|--------------------:|--------------------:|--------------------|",
+        f"| Wall-clock time (s)           | `{time_old:>10.3f}` | `{time_new:>10.3f}` | **{time_ratio:,.1f}× faster**    |",
+        f"| Peak RAM usage (MB)           | `{ram_old:>10.2f}` | `{ram_new:>10.2f}` | **{ram_ratio:,.1f}× less RAM**   |",
+        f"| Root hash                     | `{hash_old_str[:16]}…` | `{hash_new_str[:16]}…` | {'Match' if hashes_match else 'MISMATCH'}              |",


🧹 Nitpick | 🔵 Trivial

Potential display issue with truncated hash string.

When hash_old_str or hash_new_str is "N/A (0 articles)", the slice [:16] produces "N/A (0 articles)" (16 chars) followed by …, which could look awkward. Consider handling this case separately for cleaner output.

♻️ Suggested improvement

- f"| Root hash | `{hash_old_str[:16]}…` | `{hash_new_str[:16]}…` | {'Match' if hashes_match else 'MISMATCH'} |", + f"| Root hash | `{hash_old_str[:16]}{'…' if len(hash_old_str) > 16 else ''}` | `{hash_new_str[:16]}{'…' if len(hash_new_str) > 16 else ''}` | {'Match' if hashes_match else 'MISMATCH'} |",

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

f"| Root hash | `{hash_old_str[:16]}…` | `{hash_new_str[:16]}…` | {'Match' if hashes_match else 'MISMATCH'} |",

f"| Root hash | `{hash_old_str[:16]}{'…' if len(hash_old_str) > 16 else ''}` | `{hash_new_str[:16]}{'…' if len(hash_new_str) > 16 else ''}` | {'Match' if hashes_match else 'MISMATCH'} |",

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/benchmark.py` at line 237, The f-string that formats the root-hash row uses slicing on hash_old_str and hash_new_str (see the string with `{hash_old_str[:16]}…` and `{hash_new_str[:16]}…`) which produces awkward output when the value is the sentinel "N/A (0 articles)"; change the formatting to handle that sentinel explicitly: compute display_old = hash_old_str if hash_old_str.startswith("N/A") else hash_old_str[:16] + "…" (and the same for display_new), then use `{display_old}` and `{display_new}` in the f-string so the "N/A (0 articles)" text is shown intact and truncated hashes keep the ellipsis.

coderabbitai · 2026-03-12T10:19:22Z

openverifiablellm/utils.py

    with open_func(input_path, "rb") as f:
        context = ET.iterparse(f, events=("end",))
-
        with open(output_path, "w", encoding="utf-8") as out:
            for _, elem in context:
                if elem.tag.endswith("page"):
                    text_elem = elem.find(".//{*}text")
-
                    if text_elem is not None and text_elem.text:
                        cleaned = clean_wikitext(text_elem.text)
                        if cleaned:
                            out.write(cleaned + "\n\n")
-
                    elem.clear()


⚠️ Potential issue | 🟡 Minor

Batch mode lacks iterparse cleanup on exceptions.

The streaming mode (lines 228-251) properly wraps the iterparse iteration in a try/finally to call context.close(), but the batch mode (lines 261-271) does not. If an exception occurs during batch processing, the iterparse context may not be properly released.

🛡️ Proposed fix to add cleanup in batch mode

with open_func(input_path, "rb") as f: context = ET.iterparse(f, events=("end",)) - with open(output_path, "w", encoding="utf-8") as out: - for _, elem in context: - if elem.tag.endswith("page"): - text_elem = elem.find(".//{*}text") - if text_elem is not None and text_elem.text: - cleaned = clean_wikitext(text_elem.text) - if cleaned: - out.write(cleaned + "\n\n") - elem.clear() + try: + with open(output_path, "w", encoding="utf-8") as out: + for _, elem in context: + if elem.tag.endswith("page"): + text_elem = elem.find(".//{*}text") + if text_elem is not None and text_elem.text: + cleaned = clean_wikitext(text_elem.text) + if cleaned: + out.write(cleaned + "\n\n") + elem.clear() + finally: + try: + context.close() + except AttributeError: + pass

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

with open_func(input_path, "rb") as f:

context = ET.iterparse(f, events=("end",))

with open(output_path, "w", encoding="utf-8") as out:

for _, elem in context:

if elem.tag.endswith("page"):

text_elem = elem.find(".//{*}text")

if text_elem is not None and text_elem.text:

cleaned = clean_wikitext(text_elem.text)

if cleaned:

out.write(cleaned + "\n\n")

elem.clear()

with open_func(input_path, "rb") as f:

context = ET.iterparse(f, events=("end",))

try:

with open(output_path, "w", encoding="utf-8") as out:

for _, elem in context:

if elem.tag.endswith("page"):

text_elem = elem.find(".//{*}text")

if text_elem is not None and text_elem.text:

cleaned = clean_wikitext(text_elem.text)

if cleaned:

out.write(cleaned + "\n\n")

elem.clear()

finally:

try:

context.close()

except AttributeError:

pass

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 261 - 271, The batch-mode XML processing using ET.iterparse assigned to variable context (inside the with open_func(input_path, "rb") as f: block) lacks a try/finally to ensure context.close() on errors; update the loop in the batch branch to wrap the for _, elem in context: iteration in a try/finally and call context.close() in the finally block (while keeping existing logic that finds text_elem, calls clean_wikitext(text_elem.text), writes to output_path, and calls elem.clear()) so the iterparse iterator is always cleaned up on exceptions.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

openverifiablellm/utils.py (1)

179-196: ⚠️ Potential issue | 🔴 Critical

This proof-building loop never shrinks the tree.

Each pair currently produces two parent nodes, so leaves does not reduce toward a single root. That yields invalid proof steps and can trap the loop at a constant level size.

🐛 Minimal fix

         next_level: List[bytes] = []
         for i in range(0, len(leaves), 2):
             combined = leaves[i] + leaves[i + 1]
-
-            next_level.append(bytes.fromhex(_sha256_hex(combined)))
-
-            parent_bytes = compute_sha256_bytes(data=combined)
-            next_level.append(parent_bytes)
+            next_level.append(compute_sha256_bytes(data=combined))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 179 - 196, The loop building the
Merkle tree never shrinks because each leaf pair appends two parent nodes; fix
the loop in the proof construction by producing exactly one parent per pair (use
a single consistent hash function—either bytes.fromhex(_sha256_hex(combined)) or
compute_sha256_bytes(data=combined) — not both) when populating next_level, and
after collapsing the level update index = index // 2 (or equivalent) before the
next iteration so sibling_index/index calculations remain correct; references:
leaves, proof, sibling_index, _sha256_hex, compute_sha256_bytes.

♻️ Duplicate comments (1)

openverifiablellm/utils.py (1)

327-337: ⚠️ Potential issue | 🟡 Minor

Restore iterparse cleanup in batch mode.

This reintroduces the earlier cleanup gap: if clean_wikitext() or out.write() raises, context.close() is skipped. The streaming branch already protects this with try/finally; batch mode should mirror it.

🛡️ Minimal fix

     with open_func(input_path, "rb") as f:
         context = ET.iterparse(f, events=("end",))
-        with open(output_path, "w", encoding="utf-8") as out:
-            for _, elem in context:
-                if elem.tag.endswith("page"):
-                    text_elem = elem.find(".//{*}text")
-                    if text_elem is not None and text_elem.text:
-                        cleaned = clean_wikitext(text_elem.text)
-                        if cleaned:
-                            out.write(cleaned + "\n\n")
-                    elem.clear()
+        try:
+            with open(output_path, "w", encoding="utf-8") as out:
+                for _, elem in context:
+                    if elem.tag.endswith("page"):
+                        text_elem = elem.find(".//{*}text")
+                        if text_elem is not None and text_elem.text:
+                            cleaned = clean_wikitext(text_elem.text)
+                            if cleaned:
+                                out.write(cleaned + "\n\n")
+                        elem.clear()
+        finally:
+            try:
+                context.close()
+            except AttributeError:
+                pass

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 327 - 337, The batch-mode iterparse
loop creates `context = ET.iterparse(f, events=("end",))` but does not guarantee
`context.close()` if `clean_wikitext()` or `out.write()` raises; wrap the `for
_, elem in context:` loop in a try/finally and call `context.close()` in the
finally block so the iterator is always closed (keep the existing `with
open_func(input_path, "rb") as f:` and `with open(output_path, "w",
encoding="utf-8") as out:` blocks as-is); reference `context`, `iterparse`,
`clean_wikitext`, and `out.write` when making the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 87-105: The code performs two passes over the file and uses a
local variable path outside its scope causing a NameError; remove the second
file-pass block that re-opens path and appends duplicate leaves. Instead build
leaves once by iterating _iter_chunks() and for each chunk append the computed
bytes (use compute_sha256_bytes or bytes.fromhex(_sha256_hex(chunk))
consistently). Delete the second with path.open(...) loop and any references to
path outside _iter_chunks(), leaving a single loop that populates leaves.
- Around line 111-121: The loop in the Merkle building code duplicates each
parent by appending two hashes (one from _sha256_hex and one from
compute_sha256_bytes), causing the tree to never shrink; fix by computing a
single parent hash from combined (left + right) and appending it only once to
next_level (remove the redundant next_level.append call and use either
compute_sha256_bytes or a single consistent hash function instead of both);
change code around variables leaves, next_level, combined, _sha256_hex and
compute_sha256_bytes so each pair produces exactly one parent_bytes appended.
- Around line 151-167: The code incorrectly does two passes and references path
when chunks is provided, causing a NameError and duplicated leaves; fix by
building the leaves in a single pass: if chunks is not None iterate over chunks
and append their hash bytes (use bytes.fromhex(_sha256_hex(chunk)) or
compute_sha256_bytes as appropriate) and do NOT open or read the file; otherwise
open Path(file_path) and read chunk_size once, computing leaf_bytes via
compute_sha256_bytes(data=chunk) and appending to leaves; remove the second with
path.open(...) block to avoid duplicate appends and the invalid reference to
path when chunks is passed.

---

Outside diff comments:
In `@openverifiablellm/utils.py`:
- Around line 179-196: The loop building the Merkle tree never shrinks because
each leaf pair appends two parent nodes; fix the loop in the proof construction
by producing exactly one parent per pair (use a single consistent hash
function—either bytes.fromhex(_sha256_hex(combined)) or
compute_sha256_bytes(data=combined) — not both) when populating next_level, and
after collapsing the level update index = index // 2 (or equivalent) before the
next iteration so sibling_index/index calculations remain correct; references:
leaves, proof, sibling_index, _sha256_hex, compute_sha256_bytes.

---

Duplicate comments:
In `@openverifiablellm/utils.py`:
- Around line 327-337: The batch-mode iterparse loop creates `context =
ET.iterparse(f, events=("end",))` but does not guarantee `context.close()` if
`clean_wikitext()` or `out.write()` raises; wrap the `for _, elem in context:`
loop in a try/finally and call `context.close()` in the finally block so the
iterator is always closed (keep the existing `with open_func(input_path, "rb")
as f:` and `with open(output_path, "w", encoding="utf-8") as out:` blocks
as-is); reference `context`, `iterparse`, `clean_wikitext`, and `out.write` when
making the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 737c2643-7f91-4366-bd50-71f51d7c8ffe

📥 Commits

Reviewing files that changed from the base of the PR and between fae3413 and 5240f16.

📒 Files selected for processing (1)

openverifiablellm/utils.py

coderabbitai · 2026-03-12T15:14:25Z

openverifiablellm/utils.py

+    def _iter_chunks() -> Iterator[bytes]:
+        if chunks is not None:
+            yield from chunks
+        else:
+            path = Path(file_path)  # type: ignore[arg-type]
+            with path.open("rb") as f:
+                while chunk := f.read(chunk_size):
+                    yield chunk
+
+
+    leaves: List[bytes] = []
+    for chunk in _iter_chunks():
+        leaves.append(bytes.fromhex(_sha256_hex(chunk)))

    with path.open("rb") as f:
        while chunk := f.read(chunk_size):
            # use new helper to directly get bytes
            leaf_bytes = compute_sha256_bytes(data=chunk)
            leaves.append(leaf_bytes)


⚠️ Potential issue | 🔴 Critical

Remove the stale second file pass.

path is local to _iter_chunks(), so Line 101 raises NameError before any root can be computed. In file mode this block would also re-read the file and duplicate every leaf.

🐛 Minimal fix

leaves: List[bytes] = [] for chunk in _iter_chunks(): leaves.append(bytes.fromhex(_sha256_hex(chunk))) - - with path.open("rb") as f: - while chunk := f.read(chunk_size): - # use new helper to directly get bytes - leaf_bytes = compute_sha256_bytes(data=chunk) - leaves.append(leaf_bytes)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _iter_chunks() -> Iterator[bytes]:

if chunks is not None:

yield from chunks

else:

path = Path(file_path) # type: ignore[arg-type]

with path.open("rb") as f:

while chunk := f.read(chunk_size):

yield chunk

leaves: List[bytes] = []

for chunk in _iter_chunks():

leaves.append(bytes.fromhex(_sha256_hex(chunk)))

with path.open("rb") as f:

while chunk := f.read(chunk_size):

# use new helper to directly get bytes

leaf_bytes = compute_sha256_bytes(data=chunk)

leaves.append(leaf_bytes)

def _iter_chunks() -> Iterator[bytes]:

if chunks is not None:

yield from chunks

else:

path = Path(file_path) # type: ignore[arg-type]

with path.open("rb") as f:

while chunk := f.read(chunk_size):

yield chunk

leaves: List[bytes] = []

for chunk in _iter_chunks():

leaves.append(bytes.fromhex(_sha256_hex(chunk)))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 87 - 105, The code performs two passes over the file and uses a local variable path outside its scope causing a NameError; remove the second file-pass block that re-opens path and appends duplicate leaves. Instead build leaves once by iterating _iter_chunks() and for each chunk append the computed bytes (use compute_sha256_bytes or bytes.fromhex(_sha256_hex(chunk)) consistently). Delete the second with path.open(...) loop and any references to path outside _iter_chunks(), leaving a single loop that populates leaves.

coderabbitai · 2026-03-12T15:14:25Z

openverifiablellm/utils.py

    while len(leaves) > 1:
-        next_level = []
+        next_level: List[bytes] = []
        for i in range(0, len(leaves), 2):
            left = leaves[i]
            right = leaves[i + 1] if i + 1 < len(leaves) else left
-
            combined = left + right
+
+            next_level.append(bytes.fromhex(_sha256_hex(combined)))
+
            parent_bytes = compute_sha256_bytes(data=combined)
            next_level.append(parent_bytes)


⚠️ Potential issue | 🔴 Critical

Append each parent hash only once.

This loop adds the same parent twice. With two leaves, the next level still has length 2, so while len(leaves) > 1 never converges and the root is wrong.

🐛 Minimal fix

for i in range(0, len(leaves), 2): left = leaves[i] right = leaves[i + 1] if i + 1 < len(leaves) else left combined = left + right - - next_level.append(bytes.fromhex(_sha256_hex(combined))) - - parent_bytes = compute_sha256_bytes(data=combined) - next_level.append(parent_bytes) + next_level.append(compute_sha256_bytes(data=combined))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 111 - 121, The loop in the Merkle building code duplicates each parent by appending two hashes (one from _sha256_hex and one from compute_sha256_bytes), causing the tree to never shrink; fix by computing a single parent hash from combined (left + right) and appending it only once to next_level (remove the redundant next_level.append call and use either compute_sha256_bytes or a single consistent hash function instead of both); change code around variables leaves, next_level, combined, _sha256_hex and compute_sha256_bytes so each pair produces exactly one parent_bytes appended.

coderabbitai · 2026-03-12T15:14:25Z

openverifiablellm/utils.py

+    leaves: List[bytes] = []
+
+
+    if chunks is not None:
+        for chunk in chunks:
+            leaves.append(bytes.fromhex(_sha256_hex(chunk)))
+    else:
+        path = Path(file_path)  # type: ignore[arg-type]
+        with path.open("rb") as f:
+            while chunk := f.read(chunk_size):
+                leaves.append(bytes.fromhex(_sha256_hex(chunk)))

    # Build leaf level
    with path.open("rb") as f:
        while chunk := f.read(chunk_size):
            leaf_bytes = compute_sha256_bytes(data=chunk)
            leaves.append(leaf_bytes)


⚠️ Potential issue | 🔴 Critical

Build proof leaves in a single pass.

When chunks is used, Line 164 references path before assignment and fails immediately. In file mode, this second read duplicates every leaf, so chunk_index no longer points at the intended proof path.

🐛 Minimal fix

if chunks is not None: for chunk in chunks: leaves.append(bytes.fromhex(_sha256_hex(chunk))) else: path = Path(file_path) # type: ignore[arg-type] with path.open("rb") as f: while chunk := f.read(chunk_size): leaves.append(bytes.fromhex(_sha256_hex(chunk))) - - # Build leaf level - with path.open("rb") as f: - while chunk := f.read(chunk_size): - leaf_bytes = compute_sha256_bytes(data=chunk) - leaves.append(leaf_bytes)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

leaves: List[bytes] = []

if chunks is not None:

for chunk in chunks:

leaves.append(bytes.fromhex(_sha256_hex(chunk)))

else:

path = Path(file_path) # type: ignore[arg-type]

with path.open("rb") as f:

while chunk := f.read(chunk_size):

leaves.append(bytes.fromhex(_sha256_hex(chunk)))

# Build leaf level

with path.open("rb") as f:

while chunk := f.read(chunk_size):

leaf_bytes = compute_sha256_bytes(data=chunk)

leaves.append(leaf_bytes)

leaves: List[bytes] = []

if chunks is not None:

for chunk in chunks:

leaves.append(bytes.fromhex(_sha256_hex(chunk)))

else:

path = Path(file_path) # type: ignore[arg-type]

with path.open("rb") as f:

while chunk := f.read(chunk_size):

leaves.append(bytes.fromhex(_sha256_hex(chunk)))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 151 - 167, The code incorrectly does two passes and references path when chunks is provided, causing a NameError and duplicated leaves; fix by building the leaves in a single pass: if chunks is not None iterate over chunks and append their hash bytes (use bytes.fromhex(_sha256_hex(chunk)) or compute_sha256_bytes as appropriate) and do NOT open or read the file; otherwise open Path(file_path) and read chunk_size once, computing leaf_bytes via compute_sha256_bytes(data=chunk) and appending to leaves; remove the second with path.open(...) block to avoid duplicate appends and the invalid reference to path when chunks is passed.

feat: add streaming parser and incremental Merkle tree

8f1a0f8

github-actions bot added no-issue-linked backend python size/XL first-time-contributor pending-coderabbit-review enhancement New feature or request and removed size/XL labels Mar 6, 2026

github-actions bot added size/XL and removed size/XL labels Mar 9, 2026

fix: resolve merge conflict with main — keep streaming types and envi…

f3ec3eb

…ronment fingerprint import

github-actions bot added size/XL and removed size/XL labels Mar 9, 2026

coderabbitai bot requested changes Mar 9, 2026

View reviewed changes

openverifiablellm/benchmark.py Outdated Show resolved Hide resolved

openverifiablellm/benchmark.py Outdated Show resolved Hide resolved

tests/test_merkle.py Outdated Show resolved Hide resolved

github-actions bot removed the size/XL label Mar 9, 2026

github-actions bot added size/XL and removed size/XL labels Mar 9, 2026

coderabbitai bot requested changes Mar 9, 2026

View reviewed changes

openverifiablellm/benchmark.py Show resolved Hide resolved

openverifiablellm/benchmark.py Outdated Show resolved Hide resolved

tests/test_merkle.py Show resolved Hide resolved

github-actions bot added size/XL and removed size/XL labels Mar 9, 2026

Muneerali199 added 2 commits March 12, 2026 15:16

Merge remote-tracking branch 'upstream/main' into feat/streaming-incr…

394dca4

…emental-merkle # Conflicts: # openverifiablellm/utils.py

github-actions bot added size/XL and removed size/XL labels Mar 12, 2026

coderabbitai bot requested changes Mar 12, 2026

View reviewed changes

openverifiablellm/benchmark.py Outdated Show resolved Hide resolved

openverifiablellm/benchmark.py Outdated Show resolved Hide resolved

openverifiablellm/utils.py Show resolved Hide resolved

tests/test_merkle.py Outdated Show resolved Hide resolved

github-actions bot added size/XL and removed size/XL labels Mar 12, 2026

coderabbitai bot requested changes Mar 12, 2026

View reviewed changes

Merge branch 'main' into feat/streaming-incremental-merkle

5240f16

github-actions bot added size/XL and removed size/XL labels Mar 12, 2026

coderabbitai bot requested changes Mar 12, 2026

View reviewed changes

	f"\| Root hash \| `{hash_old_str[:16]}…` \| `{hash_new_str[:16]}…` \| {'Match' if hashes_match else 'MISMATCH'} \|",
	f"\| Root hash \| `{hash_old_str[:16]}{'…' if len(hash_old_str) > 16 else ''}` \| `{hash_new_str[:16]}{'…' if len(hash_new_str) > 16 else ''}` \| {'Match' if hashes_match else 'MISMATCH'} \|",

Uh oh!

Conversation

Muneerali199 commented Mar 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

How to run the benchmark yourself

Test results

Summary by CodeRabbit

Uh oh!

Archit381 commented Mar 7, 2026

Uh oh!

Archit381 commented Mar 9, 2026

Uh oh!

coderabbitai bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

Muneerali199 commented Mar 9, 2026

Uh oh!

Archit381 commented Mar 9, 2026

Uh oh!

Muneerali199 commented Mar 9, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Muneerali199 commented Mar 6, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 9, 2026 •

edited

Loading