Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .coderabbit.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,8 @@ reviews:
- Confirm that the code meets the project's requirements and objectives
- Confirm that copyright years are up-to date whenever a file is changed
- Point out redundant obvious comments that do not add clarity to the code
- Ensure that comments are concise and suggest more concise comment statements if possible
- Discourage usage of verbose comment styles such as NatSpec
Comment on lines +160 to +161
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Discourage NatSpec globally conflicts with Solidity best practices.

reviews.instructions applies to all file types. NatSpec is the standard documentation format for Solidity contracts (used by solc --userdoc/--devdoc and block explorers like Etherscan). Discouraging it globally will produce review comments suppressing documentation in .sol files.

Scope this instruction to non-Solidity files, or remove the NatSpec mention:

🔧 Proposed fix
-    - Discourage usage of verbose comment styles such as NatSpec
+    - Discourage usage of verbose comment styles (except NatSpec in Solidity files, where it is standard)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- Ensure that comments are concise and suggest more concise comment statements if possible
- Discourage usage of verbose comment styles such as NatSpec
- Ensure that comments are concise and suggest more concise comment statements if possible
- Discourage usage of verbose comment styles (except NatSpec in Solidity files, where it is standard)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.coderabbit.yaml around lines 160 - 161, The reviews.instructions entry
currently discourages NatSpec globally; update the configuration so NatSpec is
not discouraged for Solidity files by scoping or removing the NatSpec
mention—specifically modify the reviews.instructions rule to either (a) apply
only when file extension is not ".sol" (or when language != "solidity"), or (b)
remove the "NatSpec" line altogether so Solidity's `@dev/`@notice style is
preserved; look for the reviews.instructions key in the YAML and change the
wording or add a file-type condition to exclude .sol files from the NatSpec
discouragement.

- Look for code duplication
- Suggest code completions when:
- seeing a TODO comment
Expand Down Expand Up @@ -275,4 +277,4 @@ reviews:
- Image optimization (appropriate size and format)
- Proper @2x and @3x variants for different screen densities
- SVG assets are optimized
- Font files are licensed and optimized
- Font files are licensed and optimized
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -324,8 +324,11 @@ TSWLatexianTemp*
# option is specified. Footnotes are the stored in a file with suffix Notes.bib.
# Uncomment the next line to have this generated file ignored.
#*Notes.bib

data/
Comment on lines +327 to +328
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider anchoring data/ to the repository root and grouping it with Python entries.

Two minor points:

  1. data/ without a leading / matches a directory named data at any depth in the repo. Since the project only uses a root-level data/ directory, use /data/ to limit the pattern to the root and avoid accidentally silencing a nested data/ directory elsewhere.

  2. This entry is currently placed inside the LaTeX auxiliary-files section. Moving it alongside the Python-specific entries (*.egg-info/, __pycache__/, etc.) at the bottom improves readability.

🔧 Proposed fix
-
-data/
 *.egg-info/
 __pycache__/
 *.pyc
 *.pyo
 *.pyd
+/data/
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
data/
*.egg-info/
__pycache__/
*.pyc
*.pyo
*.pyd
/data/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore around lines 327 - 328, Update the .gitignore entry "data/" to
anchor it to the repository root by changing the pattern to "/data/" and move
this line out of the LaTeX auxiliary-files section into the Python ignores group
(near entries like "*.egg-info/" and "__pycache__/") so the pattern only matches
the top-level data directory and is grouped with related Python entries for
readability.

*.egg-info/
__pycache__/
*.pyc
*.pyo
*.pyd
*.bz2
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -277,4 +277,4 @@ Thanks a lot for spending your time helping TODO grow. Keep rocking 🥂

[![Contributors](https://contrib.rocks/image?repo=AOSSIE-Org/TODO)](https://github.com/AOSSIE-Org/TODO/graphs/contributors)

© 2025 AOSSIE
© 2025 AOSSIE
23 changes: 23 additions & 0 deletions examples/demo_util.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
import sys
import logging
from openverifiablellm.utils import extract_text_from_xml

logger = logging.getLogger(__name__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

logger is declared but never used in this file.

All logging output comes from inside extract_text_from_xml. Remove the module-level logger declaration.

♻️ Proposed fix
-logger = logging.getLogger(__name__)
-
 if __name__ == "__main__":

And remove the unused import logging line if logger is the only consumer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/demo_util.py` at line 8, Remove the unused module-level logger
declaration and its import: delete the top-level "logger =
logging.getLogger(__name__)" (and remove the "import logging" line if nothing
else in the file uses logging) since all logging happens inside
extract_text_from_xml; leave extract_text_from_xml unchanged.


"""
Demo for preprocessing pipeline.

Run with:
python -m examples.demo_util examples\sample_wiki.xml.bz2
"""

if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python -m examples.demo_util <input_dump>")
sys.exit(1)

logging.basicConfig(
level=logging.INFO,
format="%(levelname)s - %(message)s"
)
extract_text_from_xml(sys.argv[1])
12 changes: 0 additions & 12 deletions examples/hash_demo.py

This file was deleted.

18 changes: 18 additions & 0 deletions examples/sample_wiki.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import bz2

xml_content = """<?xml version="1.0" encoding="UTF-8"?>
<mediawiki>
<page>
<revision>
<text>
Hello <ref>citation</ref> world.
This is [[Python|programming language]]
{{Wikipedia }}is a free online encyclopedia.
</text>
</revision>
</page>
</mediawiki>
"""

with bz2.open("examples/sample_wiki.xml.bz2", "wt", encoding="utf-8") as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hardcoded CWD-relative path will break when the script is not run from the project root.

"examples/sample_wiki.xml.bz2" is resolved against wherever Python is invoked. Running python examples/sample_wiki.xml from examples/ would attempt to create examples/examples/sample_wiki.xml.bz2.

🔧 Proposed fix
-with bz2.open("examples/sample_wiki.xml.bz2", "wt", encoding="utf-8") as f:
+with bz2.open(Path(__file__).parent / "sample_wiki.xml.bz2", "wt", encoding="utf-8") as f:

Also add from pathlib import Path at the top.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/sample_wiki.xml` at line 17, The bz2.open call uses a hardcoded
CWD-relative string "examples/sample_wiki.xml.bz2" which breaks when the script
isn't run from the repo root; import Path from pathlib at the top and construct
a file path relative to the script using Path(__file__).resolve().parent /
"sample_wiki.xml.bz2", then pass str(...) (or path.as_posix()) into bz2.open
instead of the hardcoded string so the file is created next to the script
regardless of the current working directory.

f.write(xml_content)
3 changes: 0 additions & 3 deletions examples/sample_wiki.txt

This file was deleted.

Binary file added examples/sample_wiki.xml.bz2
Binary file not shown.
31 changes: 0 additions & 31 deletions openverifiablellm/dataset_hash.py

This file was deleted.

156 changes: 156 additions & 0 deletions openverifiablellm/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
import bz2
import re
import defusedxml.ElementTree as ET
from pathlib import Path
import sys
from typing import Union
import hashlib
import logging
import json
import platform

logger = logging.getLogger(__name__)

# extract clean wikipage from actual wikipage
def extract_text_from_xml(input_path):
"""
Process a compressed Wikipedia XML dump into cleaned plain text.

Each <page> element is parsed, its revision text is extracted,
cleaned using `clean_wikitext()`, and appended to a single
output text file.

The processed output is saved to:
data/processed/wiki_clean.txt

Parameters
----------
input_path : str or Path
Path to the compressed Wikipedia XML (.bz2) dump file.

Output
------
Creates:
data/processed/wiki_clean.txt
"""
input_path = Path(input_path)

# Fixed output path
project_root = Path.cwd()
output_dir = project_root / "data" / "processed"
output_dir.mkdir(parents=True, exist_ok=True)

output_path = output_dir / "wiki_clean.txt"

with bz2.open(input_path, "rb") as f:
context = ET.iterparse(f, events=("end",))

with open(output_path, "w", encoding="utf-8") as out:
for _, elem in context:
if elem.tag.endswith("page"):
text_elem = elem.find(".//{*}text")

if text_elem is not None and text_elem.text:
cleaned = clean_wikitext(text_elem.text)
if cleaned:
out.write(cleaned + "\n\n")

elem.clear()
logger.info("Preprocessing complete. Output saved to %s", output_path)
generate_manifest(input_path,output_path)
Comment on lines +59 to +60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Log message emitted before generate_manifest completes.

logger.info("Preprocessing complete. Output saved to %s", ...) at Line 59 fires before generate_manifest at Line 60. If manifest generation fails (e.g., disk full, permission error), the log already claimed success. Move the info log after the manifest call, or split into separate messages.

🔧 Proposed fix
-    logger.info("Preprocessing complete. Output saved to %s", output_path)
-    generate_manifest(input_path,output_path)
+    generate_manifest(input_path, output_path)
+    logger.info("Preprocessing complete. Output: %s", output_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 59 - 60, The info log "Preprocessing
complete. Output saved to %s" is emitted before generate_manifest(input_path,
output_path) finishes, so move or defer that logger.info call until after
generate_manifest returns (or log a separate message before and a success
message after); update the code around generate_manifest(input_path,
output_path) and logger.info to ensure the success log references output_path
only after generate_manifest completes successfully and include an error/info
path if generate_manifest raises an exception.


# generate data manifest
def generate_manifest(raw_path, processed_path):
raw_path = Path(raw_path)
processed_path = Path(processed_path)

if not processed_path.exists():
raise FileNotFoundError(
f"Processed file not found at {processed_path}. Run preprocessing first."
)
Comment on lines +67 to +70
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Ruff TRY003: move the exception message into a custom exception class.

The long inline message in raise FileNotFoundError(...) is flagged by the static analyser. Consider a dedicated exception or a shorter inline message.

♻️ Proposed fix
     if not processed_path.exists():
-        raise FileNotFoundError(
-            f"Processed file not found at {processed_path}. Run preprocessing first."
-        )
+        raise FileNotFoundError(f"Processed file not found: {processed_path}")
🧰 Tools
🪛 Ruff (0.15.1)

[warning] 68-70: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 67 - 70, The inline
FileNotFoundError message at the check for processed_path should be moved into a
custom exception type: add a new exception class (e.g., ProcessedFileNotFound)
in the module that accepts the path and constructs the full message, then
replace raise FileNotFoundError(...) with raise
ProcessedFileNotFound(processed_path) so the long message lives in the exception
class (refer to the processed_path check and the new ProcessedFileNotFound
class).


manifest = {
"wikipedia_dump": raw_path.name,
"dump_date": extract_dump_date(raw_path.name),
"raw_sha256": compute_sha256(str(raw_path)),
"processed_sha256": compute_sha256(str(processed_path)),
Comment on lines +75 to +76
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Unnecessary str() wrapping — compute_sha256 already accepts Union[str, Path].

♻️ Proposed fix
-        "raw_sha256": compute_sha256(str(raw_path)),
-        "processed_sha256": compute_sha256(str(processed_path)),
+        "raw_sha256": compute_sha256(raw_path),
+        "processed_sha256": compute_sha256(processed_path),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"raw_sha256": compute_sha256(str(raw_path)),
"processed_sha256": compute_sha256(str(processed_path)),
"raw_sha256": compute_sha256(raw_path),
"processed_sha256": compute_sha256(processed_path),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 75 - 76, The code is unnecessarily
converting Path objects to strings before calling compute_sha256; since
compute_sha256 accepts Union[str, Path], remove the redundant str() wrappers and
pass raw_path and processed_path directly to compute_sha256 (update the two
occurrences where compute_sha256(str(raw_path)) and
compute_sha256(str(processed_path)) are used).

"preprocessing_version": "v1",
"python_version": platform.python_version()
}
project_root = Path.cwd()
manifest_path = project_root / "data" / "dataset_manifest.json"
manifest_path.parent.mkdir(parents=True, exist_ok=True)

with open(manifest_path, "w") as f:
json.dump(manifest, f, indent=2)
Comment on lines +84 to +85
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing encoding="utf-8" when writing the manifest file.

On Windows where the default locale encoding is not UTF-8, non-ASCII characters in any manifest field (e.g., a dump filename with non-ASCII characters) would silently corrupt the JSON output.

🔧 Proposed fix
-    with open(manifest_path, "w") as f:
+    with open(manifest_path, "w", encoding="utf-8") as f:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with open(manifest_path, "w") as f:
json.dump(manifest, f, indent=2)
with open(manifest_path, "w", encoding="utf-8") as f:
json.dump(manifest, f, indent=2)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 84 - 85, The file write of the JSON
manifest uses open(manifest_path, "w") which on some platforms can use a
non-UTF-8 default; change the open call to explicitly specify UTF-8 encoding
(open(manifest_path, "w", encoding="utf-8")) and keep the json.dump(manifest, f,
indent=2) call (optionally add ensure_ascii=False to json.dump if you want
non-ASCII characters preserved instead of escaped) so manifest and manifest_path
writes are safe on Windows and other non-UTF-8 locales.


logger.info("Manifest written to %s", manifest_path)

# helpers
def compute_sha256(file_path: Union[str, Path]) -> str:
"""
Compute SHA256 hash of a file.

This provides a deterministic fingerprint of the dataset,
enabling reproducibility and verification.

Parameters
----------
file_path : Union[str, Path]
Path to the dataset file (string or Path-like).

Returns
-------
str
SHA256 hash string.
"""
path = Path(file_path)

sha256 = hashlib.sha256()

with path.open("rb") as f:
while chunk := f.read(8192):
sha256.update(chunk)

return sha256.hexdigest()

def extract_dump_date(filename: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

extract_dump_date is missing a return type annotation.

compute_sha256 (line 90) and clean_wikitext (line 124) both have return type annotations; extract_dump_date should be consistent.

♻️ Proposed fix
-def extract_dump_date(filename: str):
+def extract_dump_date(filename: str) -> str:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def extract_dump_date(filename: str):
def extract_dump_date(filename: str) -> str:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` at line 117, The function extract_dump_date lacks
a return type annotation; update its signature to include the proper return type
(e.g., -> Optional[datetime.date] if it returns a date or -> Optional[str] if it
returns a string) to match its implementation, and add any necessary imports
(from typing import Optional and import datetime) so the annotation is valid;
ensure the signature change is applied to def extract_dump_date(filename: str)
and is consistent with compute_sha256 and clean_wikitext style.

parts = filename.split("-")
for part in parts:
if part.isdigit() and len(part) == 8:
return f"{part[:4]}-{part[4:6]}-{part[6:]}"
return "unknown"

def clean_wikitext(text: str) -> str:
"""
Basic deterministic wikitext cleaning.

Note:
This uses simple regex-based rules for speed and consistency.
It does NOT fully parse MediaWiki syntax.

Limitations:
- Deeply nested templates may not be fully removed.
- Some complex <ref /> cases may not be perfectly handled.
- This is not a complete MediaWiki parser.

These limitations are acceptable for lightweight, deterministic preprocessing.
"""
text = re.sub(r"\{\{.*?\}\}", "", text, flags=re.DOTALL)
text = re.sub(r"<ref.*?>.*?</ref>", "", text, flags=re.DOTALL)
text = re.sub(r"<.*?>", "", text)
text = re.sub(r"\[\[.*?\|(.*?)\]\]", r"\1", text)
text = re.sub(r"\[\[(.*?)\]\]", r"\1", text)
text = re.sub(r"\s+", " ", text)
return text.strip()

if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python -m openverifiablellm.utils <input_dump>")
sys.exit(1)

logging.basicConfig(
level=logging.INFO,
format="%(levelname)s - %(message)s"
)
extract_text_from_xml(sys.argv[1])
9 changes: 9 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,15 @@ authors = [
]
requires-python = ">=3.9"

dependencies= [
"defusedxml"
]

[project.optional-dependencies]
dev = [
"pytest"
]

[tool.setuptools.packages.find]
include = ["openverifiablellm*"]

36 changes: 0 additions & 36 deletions tests/test_dataset_hash.py

This file was deleted.

Loading
Loading