Skip to content

feat: add streaming parser and incremental Merkle tree#39

Open
Muneerali199 wants to merge 10 commits intoAOSSIE-Org:mainfrom
Muneerali199:feat/streaming-incremental-merkle
Open

feat: add streaming parser and incremental Merkle tree#39
Muneerali199 wants to merge 10 commits intoAOSSIE-Org:mainfrom
Muneerali199:feat/streaming-incremental-merkle

Conversation

@Muneerali199
Copy link

@Muneerali199 Muneerali199 commented Mar 6, 2026

What this PR does

  • Adds streaming XML parser (O(1) memory via iterparse + elem.clear())
  • Adds IncrementalMerkleTree with Merkle Frontier (O(log N) time/space)
  • Adds benchmark.py to reproduce the Before/After comparison
  • Adds 50 passing unit tests

How to run the benchmark yourself

wget https://dumps.wikimedia.org/simplewiki/20260201/simplewiki-20260201-pages-articles-multistream.xml.bz2
python -m openverifiablellm.benchmark simplewiki-20260201-pages-articles-multistream.xml.bz2

Test results

pytest tests/test_merkle.py -v # 50/50 passed

Fix #31

Windows PowerShell 06-03-2026 15_13_50

Summary by CodeRabbit

  • New Features

    • Added benchmarking tool to compare performance of different Wikipedia XML processing approaches
    • Introduced incremental merkle tree implementation for memory-efficient append-only operations
    • Extended utilities with streaming modes for processing large files with minimal memory overhead
  • Tests

    • Added comprehensive test suite for merkle tree operations and XML integration

@Archit381
Copy link
Member

Pls add the analysis requested in the issue

@Archit381
Copy link
Member

Can you update the functions inside the current utils.py file to support streaming.

Also add benchmark.py in the scripts folder

- Add four streaming functions to utils.py: stream_chunks (yields raw
  byte chunks from a file), stream_sha256 (yields rolling hash pairs),
  compute_merkle_root_streaming (builds Merkle root from any bytes
  iterable), and stream_extract_text_from_xml (O(1)-memory generator
  over Wikipedia XML dumps)
- Add scripts/benchmark.py as a thin CLI launcher that delegates to
  openverifiablellm.benchmark.main, keeping the scripts/ directory
  as plain entry-points with no duplicated logic
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 9, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Introduces streaming data pipeline support with an incremental Merkle tree implementation enabling O(1) memory processing of large datasets. Adds IncrementalMerkleTree class, extends utility functions to support streaming modes, implements comprehensive benchmarking infrastructure, and provides test coverage validating streaming-vs-batch parity.

Changes

Cohort / File(s) Summary
Incremental Merkle Tree Core
openverifiablellm/incremental_merkle.py
New module implementing append-only Merkle tree using Merkle Frontier approach. Maintains frontier dictionary mapping depth to node hashes, supports O(log N) leaf appends with O(log N) space, and non-destructive root computation via frontier folding.
Utilities Enhancement
openverifiablellm/utils.py
Extended compute_merkle_root, generate_merkle_proof, extract_text_from_xml, and compute_sha256 to support streaming modes. Added mutual exclusivity between file-path and chunks parameters. extract_text_from_xml now yields cleaned article texts via generator. compute_sha256 returns Union of string or generator for streaming. Added _sha256_hex helper and Generator/Iterable type imports.
Benchmarking Infrastructure
openverifiablellm/benchmark.py, scripts/benchmark.py
New benchmark module comparing in-memory legacy vs streaming incremental Merkle tree approaches on large Wikipedia XML dumps. Measures wall-time, peak RAM, root hashes, and article counts via isolated subprocess trials. Renders results in terminal and Markdown table formats. CLI wrapper script delegates to benchmark.main.
Test Coverage
tests/test_merkle.py
Comprehensive PyTest suite with 460 lines validating IncrementalMerkleTree against batch Merkle root reference implementation. Tests correctness across varied leaf counts, empty-tree behavior, frontier size invariants, determinism, hash formatting, and XML streaming integration. Includes parametric coverage and reference batch_merkle_root implementation.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant XMLStream as XML Stream<br/>(generator)
    participant IMT as IncrementalMerkleTree
    participant Frontier as Merkle Frontier<br/>(Dict)

    Client->>XMLStream: extract_text_from_xml(stream=True)
    
    loop for each article in stream
        XMLStream->>XMLStream: parse & clean text
        XMLStream-->>Client: yield cleaned_text
        Client->>IMT: append_leaf(cleaned_text)
        IMT->>IMT: hash_text → leaf_bytes
        IMT->>Frontier: check depth 0
        alt depth empty
            Frontier->>Frontier: store at depth 0
        else depth occupied
            IMT->>IMT: combine(stored, new) → parent
            IMT->>Frontier: remove depth, check depth+1
            Note over IMT,Frontier: Repeat until empty slot
        end
        IMT->>IMT: increment leaf_count
    end
    
    Client->>IMT: get_root_hash()
    IMT->>Frontier: fold from lowest to highest
    Note over IMT,Frontier: Promote partial subtrees<br/>via self-combination
    IMT-->>Client: return hex_root
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • PR #25 — Modifies extract_text_from_xml and compute_sha256 with streaming/IO behavior changes directly affecting these utility signatures.
  • PR #33 — Updates Merkle hashing utilities in utils.py for SHA-256 digest handling and root/proof computation aligned with benchmark changes.
  • PR #17 — Modifies compute_sha256 and extract_text_from_xml signatures and behavior overlapping with streaming support implementation.

Suggested labels

Python Lang

Suggested reviewers

  • Archit381

Poem

🐰 A frontier so fine, no tree need to grow,
Just hashes that flow where the bytes gently go,
Stream on and append with logarithmic grace,
One megabyte stands where gigabytes raced! ✨📊

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title clearly summarizes the main changes: adding a streaming parser and an incremental Merkle tree implementation.
Linked Issues check ✅ Passed All requirements from issue #31 are addressed: streaming XML parser, incremental Merkle tree with Merkle Frontier, updated hashing utilities, benchmark comparison tool, and comprehensive unit tests confirming implementation parity.
Out of Scope Changes check ✅ Passed All changes directly support the PR objectives and issue #31 requirements; no extraneous modifications detected outside the streaming pipeline and Merkle tree scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan for PR comments
  • Generate coding plan

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 9, 2026
@Muneerali199
Copy link
Author

sir @Archit381, Done — updated utils.py with streaming functions and added scripts/benchmark.py, now merged with main and ready for merge.

@Archit381
Copy link
Member

@Muneerali199 Make sure to unstage the old files.

Also, you have just added the functions in the util. Can you update the existing functions to support streaming instead of adding new functions for the same task

…ha256_hex helper; remove streaming_utils.py

- All internal Merkle tree callers (compute_merkle_root, generate_merkle_proof,
  verify_merkle_proof) now use the private _sha256_hex() helper instead of the
  public compute_sha256() API, eliminating Union return-type checker errors.
- Removed unused 'cast' from typing imports.
- Removed stray comment left over from an earlier edit.
- Deleted openverifiablellm/streaming_utils.py: its streaming logic is fully
  covered by extract_text_from_xml(stream=True) in utils.py, per mentor feedback
  to update existing functions rather than adding standalone streaming functions.
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 9, 2026
@Muneerali199
Copy link
Author

sir @Archit381 ,Done! Updated the existing functions in utils.py to support streaming directly — compute_sha256 now accepts stream=True, compute_merkle_root and generate_merkle_proof accept a chunks= iterable, and extract_text_from_xml accepts stream=True — no separate streaming functions added. Also removed streaming_utils.py from the PR. All 119 tests passing

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 9, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/benchmark.py`:
- Around line 39-45: Replace the unsafe xml.etree.ElementTree import in
benchmark.py with defusedxml.ElementTree so all XML parsing uses the safe
parser; specifically change the top-level import that assigns ET and update
usages in _run_old_way (the call(s) to ET.iterparse) to use the defused parser
(ET) so the function IncrementalMerkleTree/stream_text_from_xml callers remain
unchanged while preventing XML parser attacks.
- Around line 44-45: The import of stream_text_from_xml from
openverifiablellm.streaming_utils is broken because streaming_utils was removed;
update the import to use extract_text_from_xml from openverifiablellm.utils and
replace calls to stream_text_from_xml(...) with extract_text_from_xml(...,
stream=True). Specifically, remove the old import of stream_text_from_xml,
import extract_text_from_xml from openverifiablellm.utils alongside
clean_wikitext, and change the invocation sites (the code that previously called
stream_text_from_xml, e.g., in the benchmark function that processes XML at the
lines noted in the review) to call extract_text_from_xml(file_or_xml, ...,
stream=True) preserving the same arguments and handling the returned
generator/stream.

In `@tests/test_merkle.py`:
- Around line 106-206: Add a new integration test in TestIncrementalVsBatch that
reads a small XML fixture through extract_text_from_xml(..., stream=True), feeds
each extracted text into IncrementalMerkleTree.append_leaf, computes
batch_merkle_root over the same extracted texts, and asserts equality of the two
roots; place the test near the other root-equality tests, reference
extract_text_from_xml, IncrementalMerkleTree, and batch_merkle_root so the test
exercises the streaming path and benchmark wiring without slowing the suite.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f5572ca0-0790-4e67-be21-0fe041f81b1f

📥 Commits

Reviewing files that changed from the base of the PR and between 35ce9dc and edc2479.

📒 Files selected for processing (5)
  • openverifiablellm/benchmark.py
  • openverifiablellm/incremental_merkle.py
  • openverifiablellm/utils.py
  • scripts/benchmark.py
  • tests/test_merkle.py

…; add streaming XML integration tests

- benchmark.py: replace unsafe xml.etree.ElementTree with defusedxml.ElementTree
  to prevent XML parser attacks (CodeRabbit critical finding).
- benchmark.py: replace broken import of stream_text_from_xml (from the now-deleted
  streaming_utils.py) with extract_text_from_xml from openverifiablellm.utils,
  using stream=True to preserve identical streaming behaviour.
- tests/test_merkle.py: add TestStreamingXmlIntegration class with two tests:
  (1) streaming root hash matches batch reference root for a tiny XML fixture,
  (2) streamed texts have wikitext markup stripped — exercises the full
  extract_text_from_xml(stream=True) + IncrementalMerkleTree pipeline.
@github-actions github-actions bot removed the size/XL label Mar 9, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 9, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/benchmark.py`:
- Around line 269-277: The benchmark currently calls _run_old_way(path) then
_run_new_way(path) sequentially which biases results due to OS/pagecache and
warm-up; modify the benchmark loop to run each approach in isolated subprocesses
and across multiple trials, alternating order per trial (e.g., trial i: order =
[_run_old_way, _run_new_way] if i%2==0 else reversed) or spawn a fresh Python
process that invokes a small entry function (e.g., run_benchmark_mode(mode,
path)) to ensure isolated memory/allocator and cold caches, collect
time/peak-RAM for each trial and aggregate (median or mean) before printing
comparative speed-up; ensure references to _run_old_way and _run_new_way and the
top-level runner that currently prints results are updated to call the
subprocess/alternating-trials flow and to report aggregated, order-independent
metrics.
- Around line 107-109: The code currently treats an empty list of leaves as a
valid Merkle root by computing hashlib.sha256(b"").hexdigest() (see where leaves
is checked and root_hex is set); change this so the empty-case is handled
distinctly: when leaves is empty return or propagate a sentinel (e.g. None) and
an article count instead of computing a root, and make the caller abort/report
explicitly when article count == 0; update both occurrences that set root_hex
from hashlib.sha256(b"") (the check around variable leaves and the second
occurrence mentioned) to follow this pattern so zero-article runs are surfaced
rather than producing a spurious matching root.

In `@tests/test_merkle.py`:
- Around line 381-386: The Beta fixture's revision uses a real XML <ref> element
so XML parsing removes it before clean_wikitext runs; update the test fixture
string in tests/test_merkle.py so the Beta revision text uses escaped markup
(&lt;ref&gt;ref&lt;/ref&gt;) instead of a literal <ref> element to ensure
text_elem.text contains the literal "<ref>..." sequence and thus exercises the
clean_wikitext ref-stripping logic referenced in the test assertion (see
clean_wikitext and where text_elem.text is read).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e52bbd43-9cb1-4408-bc25-e5535d45e1a5

📥 Commits

Reviewing files that changed from the base of the PR and between edc2479 and 6bafa67.

📒 Files selected for processing (2)
  • openverifiablellm/benchmark.py
  • tests/test_merkle.py

…nel, and fixture ref escaping

- benchmark.py: run_benchmark() now spawns each measurement in an isolated
  subprocess via --_mode flag, alternating old/new order per trial to eliminate
  OS pagecache and allocator warm-up bias; reports median time and RAM across
  all trials (default 3).
- benchmark.py: empty-article runs now return root=None (sentinel) instead of
  the spurious hashlib.sha256(b'').hexdigest(); callers abort with a clear error
  when article_count==0 so zero-article runs are never silently passed off as a
  valid benchmark result.
- tests/test_merkle.py: fixed _WIKI_XML_FIXTURE Beta article to use XML-escaped
  markup (&lt;ref&gt;ref&lt;/ref&gt;) so the literal '<ref>...' string reaches
  text_elem.text and actually exercises clean_wikitext's ref-stripping logic,
  rather than being silently consumed by the XML parser.
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 9, 2026
…emental-merkle

# Conflicts:
#	openverifiablellm/utils.py
…st fixture

- Apply ruff format to utils.py (trailing whitespace / blank-line fixes)
- Fix import ordering in benchmark.py and test_merkle.py (ruff I001)
- Remove unused 'io' import from test_merkle.py (ruff F401)
- Pass write_manifest=True in test_verify.run_preprocessing so the
  manifest is actually written and all 37 verify tests pass
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 12, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/benchmark.py`:
- Around line 89-101: The _run_old_way function currently hardcodes
bz2.open(file_path, "rb") causing failures for plain .xml files; update
_run_old_way to use the same detection logic as extract_text_from_xml (or
delegate to it) so it opens bz2-compressed and uncompressed files transparently,
then iterate the parsed elements and apply clean_wikitext/append to all_texts as
before; reference _run_old_way, extract_text_from_xml, clean_wikitext, and
all_texts to locate the change.
- Around line 387-388: The current parsing uses json.loads(proc.stdout.strip())
which fails if stdout contains extra lines; change the logic around proc.stdout
to extract the last non-empty line (or scan lines for the first valid JSON blob)
before calling json.loads so you only parse the actual JSON payload; update the
code that sets data (the json.loads call) and the subsequent print that uses
data['time'] and data['ram'] accordingly, referencing proc.stdout, json.loads,
data and the print statement to locate and fix the parsing.

In `@openverifiablellm/utils.py`:
- Around line 190-268: The streaming generator in extract_text_from_xml may
leave the ET.iterparse iterator open on exceptions or when the caller stops
iteration; wrap the iteration in _generator inside a try/finally and in the
finally block call context.close() (guarded with AttributeError in case close()
isn't present) to ensure the iterparse internal state is released, and keep the
existing elem.clear() logic; reference the _generator function, the context
variable (context = ET.iterparse(...)), and clean_wikitext when making the
change.

In `@tests/test_merkle.py`:
- Around line 178-180: The test is coupling to the internal attribute _frontier;
add a public way to observe the frontier size or change the test to use public
behavior instead: either implement a read-only property like frontier_size on
the MerkleTree class (or a method get_frontier_size()) and assert
len(tree.frontier_size) == 1, or replace the assertion with an indirect check
(e.g., verify the tree root/height/number of stored nodes or that proofs/merkle
paths behave as expected for a power-of-two leaf count) using public methods
such as MerkleTree.root, MerkleTree.leaf_count, or proof generation methods
instead of accessing _frontier directly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: edaaeff9-bb42-4d34-a1ee-efcba5413e42

📥 Commits

Reviewing files that changed from the base of the PR and between 6bafa67 and 900c92d.

📒 Files selected for processing (4)
  • openverifiablellm/benchmark.py
  • openverifiablellm/utils.py
  • tests/test_merkle.py
  • tests/test_verify.py

benchmark.py:
- _run_old_way: replace hardcoded bz2.open() with magic-byte detection
  (same 3-byte BZh probe used by extract_text_from_xml) so plain .xml
  files are opened correctly instead of crashing
- run_benchmark: replace json.loads(proc.stdout.strip()) with a reverse-
  scan that finds the last non-empty line parseable as JSON, tolerating
  any stray log/warning lines that may appear before the JSON payload

utils.py:
- extract_text_from_xml (stream=True): wrap the ET.iterparse loop in
  try/finally and call context.close() in the finally block so the
  iterparse internal state is released when the caller abandons the
  generator mid-stream or an exception propagates

incremental_merkle.py:
- Add public read-only property frontier_size -> int so callers and
  tests can observe frontier occupancy without accessing _frontier

tests/test_merkle.py:
- Replace all tree._frontier direct accesses with tree.frontier_size
  (covers test_root_hash_matches_batch_power_of_two,
  test_frontier_empty_when_empty, test_frontier_size_equals_popcount)
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 12, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/benchmark.py`:
- Line 237: The f-string that formats the root-hash row uses slicing on
hash_old_str and hash_new_str (see the string with `{hash_old_str[:16]}…` and
`{hash_new_str[:16]}…`) which produces awkward output when the value is the
sentinel "N/A (0 articles)"; change the formatting to handle that sentinel
explicitly: compute display_old = hash_old_str if hash_old_str.startswith("N/A")
else hash_old_str[:16] + "…" (and the same for display_new), then use
`{display_old}` and `{display_new}` in the f-string so the "N/A (0 articles)"
text is shown intact and truncated hashes keep the ellipsis.

In `@openverifiablellm/utils.py`:
- Around line 261-271: The batch-mode XML processing using ET.iterparse assigned
to variable context (inside the with open_func(input_path, "rb") as f: block)
lacks a try/finally to ensure context.close() on errors; update the loop in the
batch branch to wrap the for _, elem in context: iteration in a try/finally and
call context.close() in the finally block (while keeping existing logic that
finds text_elem, calls clean_wikitext(text_elem.text), writes to output_path,
and calls elem.clear()) so the iterparse iterator is always cleaned up on
exceptions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 96f9f971-f64e-49b3-b112-98e5b02e8868

📥 Commits

Reviewing files that changed from the base of the PR and between 900c92d and fae3413.

📒 Files selected for processing (4)
  • openverifiablellm/benchmark.py
  • openverifiablellm/incremental_merkle.py
  • openverifiablellm/utils.py
  • tests/test_merkle.py

"|-------------------------------|--------------------:|--------------------:|--------------------|",
f"| Wall-clock time (s) | `{time_old:>10.3f}` | `{time_new:>10.3f}` | **{time_ratio:,.1f}× faster** |",
f"| Peak RAM usage (MB) | `{ram_old:>10.2f}` | `{ram_new:>10.2f}` | **{ram_ratio:,.1f}× less RAM** |",
f"| Root hash | `{hash_old_str[:16]}…` | `{hash_new_str[:16]}…` | {'Match' if hashes_match else 'MISMATCH'} |",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Potential display issue with truncated hash string.

When hash_old_str or hash_new_str is "N/A (0 articles)", the slice [:16] produces "N/A (0 articles)" (16 chars) followed by , which could look awkward. Consider handling this case separately for cleaner output.

♻️ Suggested improvement
-        f"| Root hash                     | `{hash_old_str[:16]}…` | `{hash_new_str[:16]}…` | {'Match' if hashes_match else 'MISMATCH'}              |",
+        f"| Root hash                     | `{hash_old_str[:16]}{'…' if len(hash_old_str) > 16 else ''}` | `{hash_new_str[:16]}{'…' if len(hash_new_str) > 16 else ''}` | {'Match' if hashes_match else 'MISMATCH'}              |",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
f"| Root hash | `{hash_old_str[:16]}` | `{hash_new_str[:16]}` | {'Match' if hashes_match else 'MISMATCH'} |",
f"| Root hash | `{hash_old_str[:16]}{'…' if len(hash_old_str) > 16 else ''}` | `{hash_new_str[:16]}{'…' if len(hash_new_str) > 16 else ''}` | {'Match' if hashes_match else 'MISMATCH'} |",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/benchmark.py` at line 237, The f-string that formats the
root-hash row uses slicing on hash_old_str and hash_new_str (see the string with
`{hash_old_str[:16]}…` and `{hash_new_str[:16]}…`) which produces awkward output
when the value is the sentinel "N/A (0 articles)"; change the formatting to
handle that sentinel explicitly: compute display_old = hash_old_str if
hash_old_str.startswith("N/A") else hash_old_str[:16] + "…" (and the same for
display_new), then use `{display_old}` and `{display_new}` in the f-string so
the "N/A (0 articles)" text is shown intact and truncated hashes keep the
ellipsis.

Comment on lines 261 to 271
with open_func(input_path, "rb") as f:
context = ET.iterparse(f, events=("end",))

with open(output_path, "w", encoding="utf-8") as out:
for _, elem in context:
if elem.tag.endswith("page"):
text_elem = elem.find(".//{*}text")

if text_elem is not None and text_elem.text:
cleaned = clean_wikitext(text_elem.text)
if cleaned:
out.write(cleaned + "\n\n")

elem.clear()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Batch mode lacks iterparse cleanup on exceptions.

The streaming mode (lines 228-251) properly wraps the iterparse iteration in a try/finally to call context.close(), but the batch mode (lines 261-271) does not. If an exception occurs during batch processing, the iterparse context may not be properly released.

🛡️ Proposed fix to add cleanup in batch mode
     with open_func(input_path, "rb") as f:
         context = ET.iterparse(f, events=("end",))
-        with open(output_path, "w", encoding="utf-8") as out:
-            for _, elem in context:
-                if elem.tag.endswith("page"):
-                    text_elem = elem.find(".//{*}text")
-                    if text_elem is not None and text_elem.text:
-                        cleaned = clean_wikitext(text_elem.text)
-                        if cleaned:
-                            out.write(cleaned + "\n\n")
-                    elem.clear()
+        try:
+            with open(output_path, "w", encoding="utf-8") as out:
+                for _, elem in context:
+                    if elem.tag.endswith("page"):
+                        text_elem = elem.find(".//{*}text")
+                        if text_elem is not None and text_elem.text:
+                            cleaned = clean_wikitext(text_elem.text)
+                            if cleaned:
+                                out.write(cleaned + "\n\n")
+                        elem.clear()
+        finally:
+            try:
+                context.close()
+            except AttributeError:
+                pass
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with open_func(input_path, "rb") as f:
context = ET.iterparse(f, events=("end",))
with open(output_path, "w", encoding="utf-8") as out:
for _, elem in context:
if elem.tag.endswith("page"):
text_elem = elem.find(".//{*}text")
if text_elem is not None and text_elem.text:
cleaned = clean_wikitext(text_elem.text)
if cleaned:
out.write(cleaned + "\n\n")
elem.clear()
with open_func(input_path, "rb") as f:
context = ET.iterparse(f, events=("end",))
try:
with open(output_path, "w", encoding="utf-8") as out:
for _, elem in context:
if elem.tag.endswith("page"):
text_elem = elem.find(".//{*}text")
if text_elem is not None and text_elem.text:
cleaned = clean_wikitext(text_elem.text)
if cleaned:
out.write(cleaned + "\n\n")
elem.clear()
finally:
try:
context.close()
except AttributeError:
pass
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 261 - 271, The batch-mode XML
processing using ET.iterparse assigned to variable context (inside the with
open_func(input_path, "rb") as f: block) lacks a try/finally to ensure
context.close() on errors; update the loop in the batch branch to wrap the for
_, elem in context: iteration in a try/finally and call context.close() in the
finally block (while keeping existing logic that finds text_elem, calls
clean_wikitext(text_elem.text), writes to output_path, and calls elem.clear())
so the iterparse iterator is always cleaned up on exceptions.

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 12, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
openverifiablellm/utils.py (1)

179-196: ⚠️ Potential issue | 🔴 Critical

This proof-building loop never shrinks the tree.

Each pair currently produces two parent nodes, so leaves does not reduce toward a single root. That yields invalid proof steps and can trap the loop at a constant level size.

🐛 Minimal fix
         next_level: List[bytes] = []
         for i in range(0, len(leaves), 2):
             combined = leaves[i] + leaves[i + 1]
-
-            next_level.append(bytes.fromhex(_sha256_hex(combined)))
-
-            parent_bytes = compute_sha256_bytes(data=combined)
-            next_level.append(parent_bytes)
+            next_level.append(compute_sha256_bytes(data=combined))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 179 - 196, The loop building the
Merkle tree never shrinks because each leaf pair appends two parent nodes; fix
the loop in the proof construction by producing exactly one parent per pair (use
a single consistent hash function—either bytes.fromhex(_sha256_hex(combined)) or
compute_sha256_bytes(data=combined) — not both) when populating next_level, and
after collapsing the level update index = index // 2 (or equivalent) before the
next iteration so sibling_index/index calculations remain correct; references:
leaves, proof, sibling_index, _sha256_hex, compute_sha256_bytes.
♻️ Duplicate comments (1)
openverifiablellm/utils.py (1)

327-337: ⚠️ Potential issue | 🟡 Minor

Restore iterparse cleanup in batch mode.

This reintroduces the earlier cleanup gap: if clean_wikitext() or out.write() raises, context.close() is skipped. The streaming branch already protects this with try/finally; batch mode should mirror it.

🛡️ Minimal fix
     with open_func(input_path, "rb") as f:
         context = ET.iterparse(f, events=("end",))
-        with open(output_path, "w", encoding="utf-8") as out:
-            for _, elem in context:
-                if elem.tag.endswith("page"):
-                    text_elem = elem.find(".//{*}text")
-                    if text_elem is not None and text_elem.text:
-                        cleaned = clean_wikitext(text_elem.text)
-                        if cleaned:
-                            out.write(cleaned + "\n\n")
-                    elem.clear()
+        try:
+            with open(output_path, "w", encoding="utf-8") as out:
+                for _, elem in context:
+                    if elem.tag.endswith("page"):
+                        text_elem = elem.find(".//{*}text")
+                        if text_elem is not None and text_elem.text:
+                            cleaned = clean_wikitext(text_elem.text)
+                            if cleaned:
+                                out.write(cleaned + "\n\n")
+                        elem.clear()
+        finally:
+            try:
+                context.close()
+            except AttributeError:
+                pass
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 327 - 337, The batch-mode iterparse
loop creates `context = ET.iterparse(f, events=("end",))` but does not guarantee
`context.close()` if `clean_wikitext()` or `out.write()` raises; wrap the `for
_, elem in context:` loop in a try/finally and call `context.close()` in the
finally block so the iterator is always closed (keep the existing `with
open_func(input_path, "rb") as f:` and `with open(output_path, "w",
encoding="utf-8") as out:` blocks as-is); reference `context`, `iterparse`,
`clean_wikitext`, and `out.write` when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 87-105: The code performs two passes over the file and uses a
local variable path outside its scope causing a NameError; remove the second
file-pass block that re-opens path and appends duplicate leaves. Instead build
leaves once by iterating _iter_chunks() and for each chunk append the computed
bytes (use compute_sha256_bytes or bytes.fromhex(_sha256_hex(chunk))
consistently). Delete the second with path.open(...) loop and any references to
path outside _iter_chunks(), leaving a single loop that populates leaves.
- Around line 111-121: The loop in the Merkle building code duplicates each
parent by appending two hashes (one from _sha256_hex and one from
compute_sha256_bytes), causing the tree to never shrink; fix by computing a
single parent hash from combined (left + right) and appending it only once to
next_level (remove the redundant next_level.append call and use either
compute_sha256_bytes or a single consistent hash function instead of both);
change code around variables leaves, next_level, combined, _sha256_hex and
compute_sha256_bytes so each pair produces exactly one parent_bytes appended.
- Around line 151-167: The code incorrectly does two passes and references path
when chunks is provided, causing a NameError and duplicated leaves; fix by
building the leaves in a single pass: if chunks is not None iterate over chunks
and append their hash bytes (use bytes.fromhex(_sha256_hex(chunk)) or
compute_sha256_bytes as appropriate) and do NOT open or read the file; otherwise
open Path(file_path) and read chunk_size once, computing leaf_bytes via
compute_sha256_bytes(data=chunk) and appending to leaves; remove the second with
path.open(...) block to avoid duplicate appends and the invalid reference to
path when chunks is passed.

---

Outside diff comments:
In `@openverifiablellm/utils.py`:
- Around line 179-196: The loop building the Merkle tree never shrinks because
each leaf pair appends two parent nodes; fix the loop in the proof construction
by producing exactly one parent per pair (use a single consistent hash
function—either bytes.fromhex(_sha256_hex(combined)) or
compute_sha256_bytes(data=combined) — not both) when populating next_level, and
after collapsing the level update index = index // 2 (or equivalent) before the
next iteration so sibling_index/index calculations remain correct; references:
leaves, proof, sibling_index, _sha256_hex, compute_sha256_bytes.

---

Duplicate comments:
In `@openverifiablellm/utils.py`:
- Around line 327-337: The batch-mode iterparse loop creates `context =
ET.iterparse(f, events=("end",))` but does not guarantee `context.close()` if
`clean_wikitext()` or `out.write()` raises; wrap the `for _, elem in context:`
loop in a try/finally and call `context.close()` in the finally block so the
iterator is always closed (keep the existing `with open_func(input_path, "rb")
as f:` and `with open(output_path, "w", encoding="utf-8") as out:` blocks
as-is); reference `context`, `iterparse`, `clean_wikitext`, and `out.write` when
making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 737c2643-7f91-4366-bd50-71f51d7c8ffe

📥 Commits

Reviewing files that changed from the base of the PR and between fae3413 and 5240f16.

📒 Files selected for processing (1)
  • openverifiablellm/utils.py

Comment on lines +87 to 105
def _iter_chunks() -> Iterator[bytes]:
if chunks is not None:
yield from chunks
else:
path = Path(file_path) # type: ignore[arg-type]
with path.open("rb") as f:
while chunk := f.read(chunk_size):
yield chunk


leaves: List[bytes] = []
for chunk in _iter_chunks():
leaves.append(bytes.fromhex(_sha256_hex(chunk)))

with path.open("rb") as f:
while chunk := f.read(chunk_size):
# use new helper to directly get bytes
leaf_bytes = compute_sha256_bytes(data=chunk)
leaves.append(leaf_bytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Remove the stale second file pass.

path is local to _iter_chunks(), so Line 101 raises NameError before any root can be computed. In file mode this block would also re-read the file and duplicate every leaf.

🐛 Minimal fix
     leaves: List[bytes] = []
     for chunk in _iter_chunks():
         leaves.append(bytes.fromhex(_sha256_hex(chunk)))
-
-    with path.open("rb") as f:
-        while chunk := f.read(chunk_size):
-            # use new helper to directly get bytes
-            leaf_bytes = compute_sha256_bytes(data=chunk)
-            leaves.append(leaf_bytes)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _iter_chunks() -> Iterator[bytes]:
if chunks is not None:
yield from chunks
else:
path = Path(file_path) # type: ignore[arg-type]
with path.open("rb") as f:
while chunk := f.read(chunk_size):
yield chunk
leaves: List[bytes] = []
for chunk in _iter_chunks():
leaves.append(bytes.fromhex(_sha256_hex(chunk)))
with path.open("rb") as f:
while chunk := f.read(chunk_size):
# use new helper to directly get bytes
leaf_bytes = compute_sha256_bytes(data=chunk)
leaves.append(leaf_bytes)
def _iter_chunks() -> Iterator[bytes]:
if chunks is not None:
yield from chunks
else:
path = Path(file_path) # type: ignore[arg-type]
with path.open("rb") as f:
while chunk := f.read(chunk_size):
yield chunk
leaves: List[bytes] = []
for chunk in _iter_chunks():
leaves.append(bytes.fromhex(_sha256_hex(chunk)))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 87 - 105, The code performs two
passes over the file and uses a local variable path outside its scope causing a
NameError; remove the second file-pass block that re-opens path and appends
duplicate leaves. Instead build leaves once by iterating _iter_chunks() and for
each chunk append the computed bytes (use compute_sha256_bytes or
bytes.fromhex(_sha256_hex(chunk)) consistently). Delete the second with
path.open(...) loop and any references to path outside _iter_chunks(), leaving a
single loop that populates leaves.

Comment on lines 111 to 121
while len(leaves) > 1:
next_level = []
next_level: List[bytes] = []
for i in range(0, len(leaves), 2):
left = leaves[i]
right = leaves[i + 1] if i + 1 < len(leaves) else left

combined = left + right

next_level.append(bytes.fromhex(_sha256_hex(combined)))

parent_bytes = compute_sha256_bytes(data=combined)
next_level.append(parent_bytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Append each parent hash only once.

This loop adds the same parent twice. With two leaves, the next level still has length 2, so while len(leaves) > 1 never converges and the root is wrong.

🐛 Minimal fix
         for i in range(0, len(leaves), 2):
             left = leaves[i]
             right = leaves[i + 1] if i + 1 < len(leaves) else left
             combined = left + right
-
-            next_level.append(bytes.fromhex(_sha256_hex(combined)))
-
-            parent_bytes = compute_sha256_bytes(data=combined)
-            next_level.append(parent_bytes)
+            next_level.append(compute_sha256_bytes(data=combined))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 111 - 121, The loop in the Merkle
building code duplicates each parent by appending two hashes (one from
_sha256_hex and one from compute_sha256_bytes), causing the tree to never
shrink; fix by computing a single parent hash from combined (left + right) and
appending it only once to next_level (remove the redundant next_level.append
call and use either compute_sha256_bytes or a single consistent hash function
instead of both); change code around variables leaves, next_level, combined,
_sha256_hex and compute_sha256_bytes so each pair produces exactly one
parent_bytes appended.

Comment on lines +151 to 167
leaves: List[bytes] = []


if chunks is not None:
for chunk in chunks:
leaves.append(bytes.fromhex(_sha256_hex(chunk)))
else:
path = Path(file_path) # type: ignore[arg-type]
with path.open("rb") as f:
while chunk := f.read(chunk_size):
leaves.append(bytes.fromhex(_sha256_hex(chunk)))

# Build leaf level
with path.open("rb") as f:
while chunk := f.read(chunk_size):
leaf_bytes = compute_sha256_bytes(data=chunk)
leaves.append(leaf_bytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Build proof leaves in a single pass.

When chunks is used, Line 164 references path before assignment and fails immediately. In file mode, this second read duplicates every leaf, so chunk_index no longer points at the intended proof path.

🐛 Minimal fix
     if chunks is not None:
         for chunk in chunks:
             leaves.append(bytes.fromhex(_sha256_hex(chunk)))
     else:
         path = Path(file_path)  # type: ignore[arg-type]
         with path.open("rb") as f:
             while chunk := f.read(chunk_size):
                 leaves.append(bytes.fromhex(_sha256_hex(chunk)))
-
-    # Build leaf level
-    with path.open("rb") as f:
-        while chunk := f.read(chunk_size):
-            leaf_bytes = compute_sha256_bytes(data=chunk)
-            leaves.append(leaf_bytes)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
leaves: List[bytes] = []
if chunks is not None:
for chunk in chunks:
leaves.append(bytes.fromhex(_sha256_hex(chunk)))
else:
path = Path(file_path) # type: ignore[arg-type]
with path.open("rb") as f:
while chunk := f.read(chunk_size):
leaves.append(bytes.fromhex(_sha256_hex(chunk)))
# Build leaf level
with path.open("rb") as f:
while chunk := f.read(chunk_size):
leaf_bytes = compute_sha256_bytes(data=chunk)
leaves.append(leaf_bytes)
leaves: List[bytes] = []
if chunks is not None:
for chunk in chunks:
leaves.append(bytes.fromhex(_sha256_hex(chunk)))
else:
path = Path(file_path) # type: ignore[arg-type]
with path.open("rb") as f:
while chunk := f.read(chunk_size):
leaves.append(bytes.fromhex(_sha256_hex(chunk)))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 151 - 167, The code incorrectly does
two passes and references path when chunks is provided, causing a NameError and
duplicated leaves; fix by building the leaves in a single pass: if chunks is not
None iterate over chunks and append their hash bytes (use
bytes.fromhex(_sha256_hex(chunk)) or compute_sha256_bytes as appropriate) and do
NOT open or read the file; otherwise open Path(file_path) and read chunk_size
once, computing leaf_bytes via compute_sha256_bytes(data=chunk) and appending to
leaves; remove the second with path.open(...) block to avoid duplicate appends
and the invalid reference to path when chunks is passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE/ARCHITECTURE]: Implement Streaming Data Pipeline and Incremental Merkle Trees for OOM-Free, Scalable Dataset Verification

2 participants