[FEATURE]: Add contamination detection and config loader#60
[FEATURE]: Add contamination detection and config loader#60DhruvK278 wants to merge 4 commits intoAOSSIE-Org:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a configurable contamination-detection subsystem: resolve benchmarks (CLI / YAML / defaults), load benchmark texts (Hugging Face or local), build/serialize an n‑gram Bloom filter, perform two‑stage contamination checks (Bloom + exact verify), redact contaminated XML chunks during extraction, and record contamination metadata in dataset manifests. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI/Config
participant Config as load_config()
participant Fetch as fetch_benchmarks()
participant HF as HuggingFace Datasets
participant Local as Local Files
participant Bloom as build_bloom_filter()
participant Extract as extract_text_from_xml()
participant Check as check_contamination()
participant Manifest as generate_manifest()
CLI->>Config: load_config(cli_benchmarks)
Config->>Fetch: provide BenchmarkConfig
Fetch->>HF: optionally load HF datasets
Fetch->>Local: optionally load local files (JSONL/CSV)
Fetch-->>Bloom: return benchmark_texts
Bloom->>Bloom: build or load cached Bloom filter (serialize & metadata)
Bloom-->>Extract: provide bloom_filter, benchmarks_used
Extract->>Check: check_contamination(chunk, bloom_filter)
Check->>Check: n-gram generation + Bloom lookup
Check->>Check: exact substring verification against benchmark_texts
Check-->>Extract: contamination result
Extract->>Extract: redact contaminated chunks, count redactions
Extract-->>Manifest: pass contamination metadata (benchmarks_used, redacted_chunks)
Manifest->>Manifest: write dataset_manifest.json with contamination fields
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/contamination.py`:
- Around line 153-158: The cached Bloom filter loading uses filter_path and
Bloom.load_bytes(raw, _bloom_hash) without validating the inputs that produced
it, so different runs can reuse a stale filter; update the cache handling to
validate metadata (benchmark sources, config.n, normalization/hash version)
before returning: when saving a Bloom (created in the code paths that call
Bloom.create or Bloom.add), persist a small metadata blob (inputs hash,
config.n, normalization/version) alongside the filter file, and when loading via
the block that uses filter_path and Bloom.load_bytes, read and compare that
metadata to current values and only return the loaded bloom if they
match—otherwise delete or ignore the cache and rebuild the Bloom to ensure
Bloom.load_bytes and _bloom_hash are only used for matching input state.
- Around line 65-69: The current except block around load_dataset silences
dataset/network/auth failures by returning texts; instead propagate the failure:
remove the "return texts" and either re-raise the caught exception (raise) or
wrap and raise a descriptive custom exception (e.g., DatasetLoadError) so the
run fails, or record the per-source failure into the manifest/state before
raising; update the block that references load_dataset, dataset_id,
logger.warning and texts in contamination.py to either propagate the error or
persist a failure entry so callers can handle it rather than getting an empty
corpus.
- Line 66: The call to load_dataset(dataset_id, trust_remote_code=True) uses
untrusted user input and enables remote code execution; remove the hardcoded
trust_remote_code=True and instead gate it behind an explicit opt-in flag (e.g.
a config/CLI boolean like allow_trust_remote_code or trust_remote_code)
defaulting to False, then pass trust_remote_code only when that flag is true;
update the code path that calls load_dataset (reference: load_dataset and
dataset_id in contamination.py) to consult the new flag and ensure users must
explicitly enable it before remote code is allowed.
In `@openverifiablellm/utils.py`:
- Around line 131-136: The manifest generation currently omits contamination
metadata when benchmarks_used is None, making redacted runs indistinguishable
from unchecked runs; update the calls to generate_manifest (and the
generate_manifest implementation if needed) so that when contamination checking
is enabled you always pass a structured contamination/config/result block (e.g.,
a contamination or contamination_metadata argument) even if benchmarks_used is
None, and ensure generate_manifest writes that block to the manifest; apply the
same change to the other call sites noted (the call around lines 169-171) so the
manifest always contains the contamination check config and outcome when
checking is performed.
- Around line 86-90: Current lazy-import logic silently disables contamination
checking when only one of bloom_filter or benchmark_texts is provided; update
the block that sets check_contamination to validate configuration and fail fast:
if exactly one of bloom_filter or benchmark_texts is None (i.e., (bloom_filter
is None) != (benchmark_texts is None)), raise a clear configuration error
(ValueError or a project-specific ConfigurationError) stating both must be
supplied to enable contamination checking, otherwise perform the import from
openverifiablellm.contamination and set check_contamination as before.
In `@pyproject.toml`:
- Around line 14-19: Remove "datasets", "pyyaml", and "rbloom" from the
top-level dependencies array in pyproject.toml and add them under an optional
extra named "contamination" (e.g., project.optional-dependencies.contamination
or tool.poetry.extras.contamination depending on your pyproject format); keep
"defusedxml" in core deps. This aligns with the lazy-imports used around
check_contamination and the conditional use of bloom_filter and benchmark_texts
so users who never enable contamination won't install those packages. Ensure the
extra lists the same package names (add version pins if project uses them) and
run a quick packaging/installation test to confirm optional extra installs when
requested.
In `@tests/test_contamination.py`:
- Around line 96-103: The variable bloom1 in test_serialisation_roundtrip is
unused and triggers an F841 lint error; remove the unused binding by replacing
the standalone build_bloom_filter call with either an assertion on its
side-effect (e.g. confirm the file was created or that subsequent
build_bloom_filter(read) returns expected state) or simply call
build_bloom_filter(texts, config) without assigning to bloom1. Update references
in the test to use bloom2 and the existing get_ngrams loop; ensure any necessary
side-effect checks reference the build_bloom_filter invocation (use file
existence or config-backed state) rather than an unused bloom1 variable.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: af5c1849-6dbc-409e-a594-04eedda3423e
📒 Files selected for processing (6)
openverifiablellm/config.pyopenverifiablellm/contamination.pyopenverifiablellm/utils.pypyproject.tomltests/test_contamination.pytests/test_util.py
|
@DhruvK278 Pull from main, fix the conflicts and check again. We are using |
Introduce contamination detection pipeline and configuration handling. Adds openverifiablellm/config.py to load benchmark lists (CLI, YAML, defaults) and BenchmarkConfig dataclass. Adds openverifiablellm/contamination.py implementing n-gram generation, benchmark loaders (HF and local JSONL/CSV), Bloom filter construction/serialization (rbloom) and a two-stage contamination check (bloom + exact verification). Update openverifiablellm/utils.py to: adjust compute_sha256/compute_merkle_root call signatures, integrate optional contamination checking into extract_text_from_xml, track redacted chunks, and extend generate_manifest to include contamination metadata and merkle fields. Update pyproject.toml to add runtime deps: datasets, pyyaml, rbloom. Add comprehensive tests (tests/test_contamination.py) for n-grams, bloom filter behavior, contamination detection, manifest fields and backward compatibility; adjust tests/test_util.py to match renamed compute_sha256 parameter. Overall goal: enable detection and recording of dataset contamination when preprocessing large corpora.
0652703 to
3b8746a
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/test_util.py (1)
89-104: 🧹 Nitpick | 🔵 TrivialDrop the duplicated hash assertions.
These lines exercise the exact same path twice and do not add coverage. Keeping a single call/assert per case will make the tests easier to read and less noisy when they fail.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_util.py` around lines 89 - 104, Remove the duplicated redundant calls/assertions in the tests that call utils.compute_sha256 twice for the same inputs: in the first test remove the repeated assignment to actual (duplicate compute_sha256 call) and in test_different_content_different_hash remove the second identical assert; keep a single compute_sha256 invocation and a single assert per case to avoid duplicate coverage and noisy failures (references: utils.compute_sha256, test_different_content_different_hash).
♻️ Duplicate comments (3)
openverifiablellm/contamination.py (2)
153-158:⚠️ Potential issue | 🟠 MajorInvalidate the Bloom cache when the inputs change.
The shared
filter_pathis reused solely on file existence. Switching benchmark sources,config.n, or even normalization logic can therefore load a stale filter and silently misclassify chunks. Persist cache metadata alongside the Bloom bytes and rebuild when it no longer matches the current inputs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/contamination.py` around lines 153 - 158, The Bloom cache is being loaded solely by file existence (filter_path) which can return stale data when inputs change; update the logic around Bloom.load_bytes and _bloom_hash to persist and validate cache metadata (e.g., source id, config.n, normalization/version hash) alongside the raw bytes and, on load, compare the stored metadata against current inputs and normalization parameters and rebuild the Bloom (recompute via the code path that creates a new Bloom) if any mismatch is detected; ensure the metadata is written whenever creating/updating the cache so subsequent runs can validate before returning the cached Bloom.
64-69:⚠️ Potential issue | 🔴 CriticalDo not execute arbitrary HF dataset code or fail open on load errors.
BenchmarkConfig.benchmarksis populated from CLI/YAML input, soload_dataset(dataset_id, trust_remote_code=True)lets untrusted dataset IDs run remote code. The broadexceptthen converts auth/network/schema failures into[], which silently disables contamination checking for that source. Defaulttrust_remote_codetoFalseand re-raise after logging instead of returning an empty corpus.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/contamination.py` around lines 64 - 69, The code currently calls load_dataset(dataset_id, trust_remote_code=True) and swallows all exceptions returning an empty texts list; change the call to load_dataset(dataset_id, trust_remote_code=False) to avoid running untrusted remote code, and after catching Exception in the load_dataset block use logger.warning("Could not load HF dataset '%s': %s", dataset_id, exc) (or similar) but re-raise the exception instead of returning texts so callers know loading failed; keep the texts variable initialization and only return it in the normal success path (i.e., remove the early return in the except block) to ensure contamination checks do not silently skip on load errors.pyproject.toml (1)
14-19:⚠️ Potential issue | 🔴 CriticalRestore
tokenizersto the runtime dependencies.
openverifiablellm/tokenizer/bpe_tokenizer.pydirectly importsByteLevelBPETokenizerfrom thetokenizerspackage, andopenverifiablellm/tokenizer/factory.pyinstantiatesBPETokenizerwhentokenizer_type == "bpe". Removingtokenizersfromproject.dependencieswill cause anImportErrorat runtime when the BPE tokenizer path is used.📦 Minimal fix
dependencies= [ "defusedxml", + "tokenizers==0.15.2", "datasets", "pyyaml", "rbloom" ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 14 - 19, The pyproject.toml removed the runtime dependency on tokenizers but openverifiablellm/tokenizer/bpe_tokenizer.py imports ByteLevelBPETokenizer and openverifiablellm/tokenizer/factory.py may instantiate BPETokenizer when tokenizer_type == "bpe", causing ImportError at runtime; restore "tokenizers" in the dependencies list in pyproject.toml (add "tokenizers" to the dependencies array) so ByteLevelBPETokenizer and BPETokenizer can be imported/instantiated at runtime.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/contamination.py`:
- Around line 113-128: get_ngrams currently allows non-positive n which leads to
producing empty-string n-grams; modify get_ngrams to validate its n argument and
raise a ValueError for n <= 0. In the get_ngrams function add a guard at the
start (before normalization/tokenisation) that checks if n is an integer > 0 and
raises ValueError("n must be a positive integer") or similar to prevent
inserting empty n-grams into downstream functions like build_bloom_filter and
check_contamination.
In `@openverifiablellm/utils.py`:
- Around line 40-46: In the Merkle leaf construction and empty-case return,
remove the duplicated calls to compute_sha256 so each chunk is hashed only once
and the empty case returns a single compute_sha256 result; specifically, in the
block that sets leaf_hex (currently calling compute_sha256 twice) keep only one
call to compute_sha256 when assigning leaf_hex and appending
bytes.fromhex(leaf_hex) to leaves, and in the empty-list branch ensure only one
return compute_sha256(data=b"") remains; references to look for: leaf_hex,
compute_sha256, leaves.
- Around line 295-299: The manifest currently stores only benchmarks_used and
redacted_chunks which is insufficient to reproduce the contamination snapshot;
update the code that builds manifest (where
manifest["contamination_checks_passed"] = benchmarks_used and
manifest["redacted_chunks"] = redacted_chunks) to persist per-source snapshot
metadata: for each entry in benchmarks_used record dataset identifier, dataset
config/split, exact version or commit/tag (for HF datasets use dataset
build/version or datestamp), local file path if used, and a content digest
(e.g., sha256) of the dataset file or snapshot; also attach redaction details
per-source (e.g., redacted_chunks keyed by source) so the manifest contains
source->(version, path, hash, redacted_chunks) enabling exact reproduction.
- Around line 162-168: The module entrypoint still calls extract_text_from_xml
with a write_manifest argument but the function signature removed it, causing a
TypeError; update extract_text_from_xml to accept write_manifest (e.g., add
write_manifest=False to the parameter list) or add **kwargs to the signature so
the CLI call no longer fails, and ensure any write_manifest behavior (if needed)
is handled or ignored inside extract_text_from_xml; reference the
extract_text_from_xml function to locate and change its signature accordingly.
---
Outside diff comments:
In `@tests/test_util.py`:
- Around line 89-104: Remove the duplicated redundant calls/assertions in the
tests that call utils.compute_sha256 twice for the same inputs: in the first
test remove the repeated assignment to actual (duplicate compute_sha256 call)
and in test_different_content_different_hash remove the second identical assert;
keep a single compute_sha256 invocation and a single assert per case to avoid
duplicate coverage and noisy failures (references: utils.compute_sha256,
test_different_content_different_hash).
---
Duplicate comments:
In `@openverifiablellm/contamination.py`:
- Around line 153-158: The Bloom cache is being loaded solely by file existence
(filter_path) which can return stale data when inputs change; update the logic
around Bloom.load_bytes and _bloom_hash to persist and validate cache metadata
(e.g., source id, config.n, normalization/version hash) alongside the raw bytes
and, on load, compare the stored metadata against current inputs and
normalization parameters and rebuild the Bloom (recompute via the code path that
creates a new Bloom) if any mismatch is detected; ensure the metadata is written
whenever creating/updating the cache so subsequent runs can validate before
returning the cached Bloom.
- Around line 64-69: The code currently calls load_dataset(dataset_id,
trust_remote_code=True) and swallows all exceptions returning an empty texts
list; change the call to load_dataset(dataset_id, trust_remote_code=False) to
avoid running untrusted remote code, and after catching Exception in the
load_dataset block use logger.warning("Could not load HF dataset '%s': %s",
dataset_id, exc) (or similar) but re-raise the exception instead of returning
texts so callers know loading failed; keep the texts variable initialization and
only return it in the normal success path (i.e., remove the early return in the
except block) to ensure contamination checks do not silently skip on load
errors.
In `@pyproject.toml`:
- Around line 14-19: The pyproject.toml removed the runtime dependency on
tokenizers but openverifiablellm/tokenizer/bpe_tokenizer.py imports
ByteLevelBPETokenizer and openverifiablellm/tokenizer/factory.py may instantiate
BPETokenizer when tokenizer_type == "bpe", causing ImportError at runtime;
restore "tokenizers" in the dependencies list in pyproject.toml (add
"tokenizers" to the dependencies array) so ByteLevelBPETokenizer and
BPETokenizer can be imported/instantiated at runtime.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0703ff93-109e-4739-8fed-c98c72ea29bc
📒 Files selected for processing (6)
openverifiablellm/config.pyopenverifiablellm/contamination.pyopenverifiablellm/utils.pypyproject.tomltests/test_contamination.pytests/test_util.py
| leaf_hex = compute_sha256(data=chunk) | ||
| leaf_hex = compute_sha256(data=chunk) | ||
| leaves.append(bytes.fromhex(leaf_hex)) | ||
|
|
||
| if not leaves: | ||
| return compute_sha256(data=b"") | ||
| return compute_sha256(data=b"") |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Remove the duplicated Merkle hash calls.
The second compute_sha256(...) on Lines 41 and 56 is redundant, and the second return compute_sha256(data=b"") on Line 46 is unreachable. Output stays the same, but large dumps now pay the hashing cost twice.
Also applies to: 55-56
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/utils.py` around lines 40 - 46, In the Merkle leaf
construction and empty-case return, remove the duplicated calls to
compute_sha256 so each chunk is hashed only once and the empty case returns a
single compute_sha256 result; specifically, in the block that sets leaf_hex
(currently calling compute_sha256 twice) keep only one call to compute_sha256
when assigning leaf_hex and appending bytes.fromhex(leaf_hex) to leaves, and in
the empty-list branch ensure only one return compute_sha256(data=b"") remains;
references to look for: leaf_hex, compute_sha256, leaves.
| # --- contamination metadata --- | ||
| if benchmarks_used is not None: | ||
| manifest["contamination_checks_passed"] = benchmarks_used | ||
| manifest["redacted_chunks"] = redacted_chunks | ||
|
|
There was a problem hiding this comment.
Persist benchmark snapshot metadata, not just names.
["gsm8k", "cais/mmlu"] plus redacted_chunks is not enough to reproduce the contamination corpus later. HF datasets can move, and local files need at least a path/hash. Without per-source revision or content digest data, the manifest cannot prove which benchmark snapshot was actually checked.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/utils.py` around lines 295 - 299, The manifest currently
stores only benchmarks_used and redacted_chunks which is insufficient to
reproduce the contamination snapshot; update the code that builds manifest
(where manifest["contamination_checks_passed"] = benchmarks_used and
manifest["redacted_chunks"] = redacted_chunks) to persist per-source snapshot
metadata: for each entry in benchmarks_used record dataset identifier, dataset
config/split, exact version or commit/tag (for HF datasets use dataset
build/version or datestamp), local file path if used, and a content digest
(e.g., sha256) of the dataset file or snapshot; also attach redaction details
per-source (e.g., redacted_chunks keyed by source) so the manifest contains
source->(version, path, hash, redacted_chunks) enabling exact reproduction.
|
Not the final push. Adding the required changes |
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (3)
openverifiablellm/contamination.py (1)
118-133:⚠️ Potential issue | 🟠 MajorReject non-positive n-gram sizes.
n <= 0still generates empty-string n-grams here. Oncebuild_bloom_filter()inserts"",check_contamination()can start flagging unrelated text because"" in any_stringis always true. Guard this before normalisation/tokenisation.Proposed fix
def get_ngrams(text: str, n: int = 13) -> List[str]: @@ - normalised = _normalise(text) + if n <= 0: + raise ValueError("n must be a positive integer") + + normalised = _normalise(text) tokens = normalised.split()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/contamination.py` around lines 118 - 133, get_ngrams currently accepts n <= 0 and will generate empty-string n-grams which then pollute bloom filters; add an explicit guard at the top of get_ngrams to reject non-positive n (e.g. raise ValueError when n <= 0) before any normalization/tokenisation, so build_bloom_filter and check_contamination never see an empty-string n-gram; reference the get_ngrams function and ensure callers (e.g. build_bloom_filter / check_contamination paths) will receive the error rather than inserting "" into the filter.openverifiablellm/utils.py (2)
40-46: 🧹 Nitpick | 🔵 TrivialRemove the repeated SHA-256 work in Merkle construction.
Lines 40/41 and 55/56 hash the same bytes twice, and Line 46 is unreachable after the first return. The result is unchanged, but large dumps now pay the hashing cost twice for every leaf and parent.
Proposed fix
while chunk := f.read(chunk_size): # reuse compute_sha256 leaf_hex = compute_sha256(data=chunk) - leaf_hex = compute_sha256(data=chunk) leaves.append(bytes.fromhex(leaf_hex)) @@ if not leaves: return compute_sha256(data=b"") - return compute_sha256(data=b"") @@ combined = left + right parent_hex = compute_sha256(data=combined) - parent_hex = compute_sha256(data=combined) next_level.append(bytes.fromhex(parent_hex))Also applies to: 55-56
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 40 - 46, The Merkle construction is performing duplicate SHA-256 work and has an unreachable return: call compute_sha256 exactly once per input chunk (replace the two identical calls that produce leaf_hex with a single call and append bytes.fromhex(leaf_hex) once), and remove the duplicated return so the empty-tree case returns compute_sha256(data=b"") only once; apply the same deduping fix where parent hashes are computed (ensure each parent node is hashed a single time via compute_sha256 and no duplicated calls remain).
263-270:⚠️ Potential issue | 🟠 MajorPersist benchmark snapshot metadata, not just names.
The manifest payload built here only carries
benchmarks_usedandredacted_chunks. That is still not enough to reproduce which benchmark snapshot was checked later, especially for mutable Hugging Face datasets or local files. Please include per-source revision/path/content digest incontamination_metadata.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 263 - 270, The current contamination_metadata only stores benchmark names; change it to persist per-benchmark snapshot metadata (not just names) so the check is reproducible: when building contamination_metadata in the check_contamination branch, replace benchmarks_used list of strings with a list of dicts containing unique snapshot fields (e.g., "name"/"id", "source" (hf/local), "path" or "config", "revision"/"commit" if available, and a content_digest/hash for the snapshot or file); compute content_digest for local files (e.g., sha256 of file bytes) and capture dataset revision/sha or dataset fingerprint for HuggingFace datasets, and ensure redacted_chunks remains included; update any callers that expect benchmarks_used to handle the new list-of-dicts structure (look for uses of contamination_metadata and the variable benchmarks_used).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/config.py`:
- Around line 50-55: The parsing currently accepts an empty benchmarks list or
entries that are only whitespace/comma, leading to an empty effective filter;
update the parsing of the "benchmarks" value (the local variable benchmarks in
openverifiablellm/config.py) to strip and filter out empty strings (e.g.,
split/trim multi-value CLI input and remove items where not b.strip()) and
require that at least one non-empty benchmark remains; if none remain, either
raise a ValueError or fall back to the default benchmark list. Apply the same
validation/fallback logic to the other benchmarks-handling block referenced
around lines 71-76 so both code paths reject/handle empty selections
consistently.
In `@openverifiablellm/contamination.py`:
- Around line 39-45: The loop over config.benchmarks currently falls back to
_load_hf when a path has a local suffix but the file is missing; change the
logic in the for benchmark in config.benchmarks loop to treat a benchmark that
ends with .jsonl or .csv as a required local file: use Path(benchmark) and if
path.suffix in (".jsonl", ".csv") then if not path.is_file() raise a clear
FileNotFoundError (or ValueError) referencing the benchmark string instead of
calling _load_hf, otherwise call _load_local(path); only call _load_hf for
entries that do not have a local suffix.
- Around line 243-250: The current exact-verification step uses substring
matching ("if ngram in nb") which allows false positives across token
boundaries; change it to compare whole-token n-grams by tokenizing each
normalised benchmark (use the same tokenization used to build
`ngrams`/`_normalise`) and check whether any contiguous token window equals the
`ngram` (or alternatively pad both `nb` and `ngram` with explicit boundary
markers before doing `in`); update the loop in the verification code that
iterates over `ngrams`, `normalised_benchmarks`, and `nb` to perform token-level
comparison (e.g., split `nb` into tokens and compare `"
".join(tokens[i:i+len_ngram]) == ngram`) so matches respect token boundaries.
---
Duplicate comments:
In `@openverifiablellm/contamination.py`:
- Around line 118-133: get_ngrams currently accepts n <= 0 and will generate
empty-string n-grams which then pollute bloom filters; add an explicit guard at
the top of get_ngrams to reject non-positive n (e.g. raise ValueError when n <=
0) before any normalization/tokenisation, so build_bloom_filter and
check_contamination never see an empty-string n-gram; reference the get_ngrams
function and ensure callers (e.g. build_bloom_filter / check_contamination
paths) will receive the error rather than inserting "" into the filter.
In `@openverifiablellm/utils.py`:
- Around line 40-46: The Merkle construction is performing duplicate SHA-256
work and has an unreachable return: call compute_sha256 exactly once per input
chunk (replace the two identical calls that produce leaf_hex with a single call
and append bytes.fromhex(leaf_hex) once), and remove the duplicated return so
the empty-tree case returns compute_sha256(data=b"") only once; apply the same
deduping fix where parent hashes are computed (ensure each parent node is hashed
a single time via compute_sha256 and no duplicated calls remain).
- Around line 263-270: The current contamination_metadata only stores benchmark
names; change it to persist per-benchmark snapshot metadata (not just names) so
the check is reproducible: when building contamination_metadata in the
check_contamination branch, replace benchmarks_used list of strings with a list
of dicts containing unique snapshot fields (e.g., "name"/"id", "source"
(hf/local), "path" or "config", "revision"/"commit" if available, and a
content_digest/hash for the snapshot or file); compute content_digest for local
files (e.g., sha256 of file bytes) and capture dataset revision/sha or dataset
fingerprint for HuggingFace datasets, and ensure redacted_chunks remains
included; update any callers that expect benchmarks_used to handle the new
list-of-dicts structure (look for uses of contamination_metadata and the
variable benchmarks_used).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: e6d5f329-2beb-4897-a11c-e00dd2ddaefe
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
openverifiablellm/config.pyopenverifiablellm/contamination.pyopenverifiablellm/utils.pypyproject.tomltests/test_contamination.py
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
openverifiablellm/contamination.py (2)
120-135:⚠️ Potential issue | 🟠 MajorReject non-positive
nvalues.With
n <= 0, this starts generating empty-string windows. Oncebuild_bloom_filter()inserts"", later checks can report contamination on unrelated text.Suggested fix
def get_ngrams(text: str, n: int = 13) -> List[str]: """ Generate whitespace-tokenised n-grams from text. @@ """ + if n <= 0: + raise ValueError("n must be a positive integer") + normalised = _normalise(text) tokens = normalised.split()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/contamination.py` around lines 120 - 135, In get_ngrams, reject non-positive n values by validating the n parameter at the start of the function (in function get_ngrams) and raising a ValueError for n <= 0; keep the existing normalization/tokenisation logic and only change to short-circuit invalid n before splitting/producing n-grams so build_bloom_filter and downstream contamination checks never receive the empty-string n-gram.
161-173:⚠️ Potential issue | 🟠 MajorEncode text boundaries in
inputs_hash.Hashing the benchmark texts back-to-back makes different corpora share the same cache key, e.g.
["ab", "c"]and["a", "bc"]. That can silently reuse a stale Bloom cache after row-splitting or extraction changes.Suggested fix
hasher = hashlib.sha256() - for text in benchmark_texts: - hasher.update(text.encode("utf-8")) + hasher.update(len(benchmark_texts).to_bytes(8, "little")) + for text in benchmark_texts: + encoded = text.encode("utf-8") + hasher.update(len(encoded).to_bytes(8, "little")) + hasher.update(encoded) inputs_hash = hasher.hexdigest()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/contamination.py` around lines 161 - 173, The inputs_hash currently concatenates benchmark_texts without boundaries (see hasher and inputs_hash), causing collisions like ["ab","c"] vs ["a","bc"]; fix by encoding text boundaries before updating the hasher in the loop inside contamination.py (where hasher.update(text.encode("utf-8")) is called) — e.g., update with a stable separator or length-prefix for each text (encode and include a delimiter or the length for each element) so the sequence/order and boundaries are unambiguous when computing inputs_hash used in current_meta.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/config.py`:
- Around line 47-63: The code currently only catches yaml.YAMLError while
opening/reading the benchmarks file, so OSError from yaml_path.open(...) will
escape; update the exception handling around the yaml_path.open/ safe_load block
(the try that surrounds yaml_path.open and yaml.safe_load) to catch both
yaml.YAMLError and OSError (e.g., except (yaml.YAMLError, OSError) as exc) and
adjust the logger.warning call to reflect "failed to read/parse %s: %s" (or
similar) so unreadable files are treated the same as malformed YAML and the
function falls back to defaults and returns None.
In `@openverifiablellm/contamination.py`:
- Around line 200-214: The code currently writes a Bloom filter and metadata
even when ngram_count remains 0 (no extractable n-grams from
benchmark_texts/get_ngrams with config.n), producing a useless cache; change the
logic after the n-gram loop to check if ngram_count == 0 and, in that case, do
not write filter_path or meta_path, instead log a clear warning/error (via
logger) and return or raise an exception so callers know no valid filter was
produced; update references around ngram_count, benchmark_texts, get_ngrams,
bloom, filter_path, meta_path, and current_meta to implement this guard.
---
Duplicate comments:
In `@openverifiablellm/contamination.py`:
- Around line 120-135: In get_ngrams, reject non-positive n values by validating
the n parameter at the start of the function (in function get_ngrams) and
raising a ValueError for n <= 0; keep the existing normalization/tokenisation
logic and only change to short-circuit invalid n before splitting/producing
n-grams so build_bloom_filter and downstream contamination checks never receive
the empty-string n-gram.
- Around line 161-173: The inputs_hash currently concatenates benchmark_texts
without boundaries (see hasher and inputs_hash), causing collisions like
["ab","c"] vs ["a","bc"]; fix by encoding text boundaries before updating the
hasher in the loop inside contamination.py (where
hasher.update(text.encode("utf-8")) is called) — e.g., update with a stable
separator or length-prefix for each text (encode and include a delimiter or the
length for each element) so the sequence/order and boundaries are unambiguous
when computing inputs_hash used in current_meta.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a21ea797-6740-40c6-b462-fb836e68da83
📒 Files selected for processing (2)
openverifiablellm/config.pyopenverifiablellm/contamination.py
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
openverifiablellm/contamination.py (1)
120-135:⚠️ Potential issue | 🟠 MajorReject non-positive n-gram sizes.
The function does not validate that
nis positive. Withn <= 0, the function would generate malformed n-grams (empty strings forn=0, or excessive n-grams for negative values), which could corrupt the Bloom filter.🛡️ Proposed fix
def get_ngrams(text: str, n: int = 13) -> List[str]: """ Generate whitespace-tokenised n-grams from text. The text is normalised (lowercased, punctuation stripped) before tokenisation. If the text contains fewer than *n* tokens the result is an empty list. """ + if n <= 0: + raise ValueError("n must be a positive integer") + normalised = _normalise(text) tokens = normalised.split()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/contamination.py` around lines 120 - 135, get_ngrams currently accepts non-positive n and can produce malformed results; add validation at the start of get_ngrams to raise a ValueError for n <= 0, e.g., check the parameter and raise ValueError("n must be a positive integer") before calling _normalise; keep the rest of the logic (normalised = _normalise(text), tokens = normalised.split(), length check and list comprehension) unchanged so callers of get_ngrams are protected from zero/negative n values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/config.py`:
- Around line 31-43: Update the _load_from_yaml function docstring Returns
section to explicitly state the return type and semantics: that it returns
Optional[List[str]] — a list of benchmark identifiers (e.g., "gsm8k",
"cais/mmlu", or local file paths) when the YAML is present and valid, or None
when the file is missing, unreadable, or does not contain a "benchmarks" list;
include the return type annotation and a brief note about when None is returned.
In `@tests/test_contamination.py`:
- Around line 19-60: Add a new test method named test_invalid_n_raises to
TestGetNgrams that calls get_ngrams with non-positive n values and asserts it
raises ValueError (e.g., use pytest.raises for n=0 and n=-1); this ensures
get_ngrams validates the n parameter and fails for invalid inputs.
---
Duplicate comments:
In `@openverifiablellm/contamination.py`:
- Around line 120-135: get_ngrams currently accepts non-positive n and can
produce malformed results; add validation at the start of get_ngrams to raise a
ValueError for n <= 0, e.g., check the parameter and raise ValueError("n must be
a positive integer") before calling _normalise; keep the rest of the logic
(normalised = _normalise(text), tokens = normalised.split(), length check and
list comprehension) unchanged so callers of get_ngrams are protected from
zero/negative n values.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: e7ba798d-a3ec-4285-9a19-56f2da618843
📒 Files selected for processing (3)
openverifiablellm/config.pyopenverifiablellm/contamination.pytests/test_contamination.py
b2a76fe to
15c6e2a
Compare
|
@Archit381 Please review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/contamination.py`:
- Around line 251-259: The hot path inside check_contamination rebuilds
normalised_benchmarks on every Bloom hit (see normalised_benchmarks, _normalise,
benchmark_texts), causing repeated O(number of benchmarks) work; precompute the
set of normalized benchmark texts once (e.g., in the preprocessing entry point
that calls extract_text_from_xml) and change check_contamination to accept a
precomputed_normalised_benchmarks parameter instead of constructing it inline,
then use that set for the inner exact-match loop to avoid rebuilding it on each
invocation.
- Around line 164-167: The current inputs_hash is computed by concatenating raw
benchmark_texts bytes which can collide across different segmentations; update
the hashing in the function that computes inputs_hash (use the hasher variable
near the loop over benchmark_texts) to include an unambiguous delimiter or
length-prefix for each text (e.g., write each text length then the text or
serialize each entry with a structured encoding) before hasher.update, so the
resulting inputs_hash cannot be forged by different concatenations of the same
bytes.
- Around line 72-84: The current load in contamination.py uses
load_dataset(dataset_id, trust_remote_code=trust_remote_code) which returns all
splits and pulls non-evaluation data into texts; change this to load only
evaluation splits (e.g., pass split="test" or "validation" as appropriate) or
make the split selection configurable via a new parameter (e.g.,
evaluation_splits or split_map) so that GSM8K uses "test" and MMLU uses
"test"/"validation"; adjust the call site that constructs ds (still handling
exceptions and raising DatasetLoadError) and then iterate over only those chosen
splits to populate texts (respecting _TEXT_FIELDS and the existing texts append
logic).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 674750de-288e-4aa4-83eb-9d730e04de24
📒 Files selected for processing (3)
openverifiablellm/config.pyopenverifiablellm/contamination.pytests/test_contamination.py
| try: | ||
| ds = load_dataset(dataset_id, trust_remote_code=trust_remote_code) | ||
| except Exception as exc: | ||
| logger.error("Could not load HF dataset '%s': %s", dataset_id, exc) | ||
| raise DatasetLoadError(f"Failed to load HF dataset '{dataset_id}'") from exc | ||
|
|
||
| splits = ds.values() if hasattr(ds, "values") else [ds] | ||
| for split in splits: | ||
| for row in split: | ||
| for field in _TEXT_FIELDS: | ||
| if field in row and row[field]: | ||
| texts.append(str(row[field])) | ||
| break |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
For Hugging Face datasets, what does load_dataset("<dataset_id>")return whensplitis omitted, and what splits are available for thegsm8kandcais/mmlu datasets?
💡 Result:
-
What
load_dataset("<dataset_id>")returns whensplitis omitted: it returns adatasets.DatasetDictcontaining all available splits (a dict mapping split names like"train","test"todatasets.Datasetobjects). [1][2] -
openai/gsm8kavailable splits:trainandtest(for both configsmainandsocratic). [3] -
cais/mmluavailable splits:auxiliary_train,dev,validation,test. [4]
Restrict HF dataset loading to evaluation splits.
load_dataset(dataset_id) without a split parameter returns all available splits. This means GSM8K and MMLU benchmarks pull non-evaluation data (e.g., train splits for GSM8K; auxiliary_train, dev for MMLU) into the contamination corpus. Processing non-eval splits contradicts the feature goal of detecting contamination against evaluation benchmarks and results in false redactions. Restrict to explicit evaluation splits (test for GSM8K; test/validation for MMLU) or make the split selection configurable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/contamination.py` around lines 72 - 84, The current load in
contamination.py uses load_dataset(dataset_id,
trust_remote_code=trust_remote_code) which returns all splits and pulls
non-evaluation data into texts; change this to load only evaluation splits
(e.g., pass split="test" or "validation" as appropriate) or make the split
selection configurable via a new parameter (e.g., evaluation_splits or
split_map) so that GSM8K uses "test" and MMLU uses "test"/"validation"; adjust
the call site that constructs ds (still handling exceptions and raising
DatasetLoadError) and then iterate over only those chosen splits to populate
texts (respecting _TEXT_FIELDS and the existing texts append logic).
| hasher = hashlib.sha256() | ||
| for text in benchmark_texts: | ||
| hasher.update(text.encode("utf-8")) | ||
| inputs_hash = hasher.hexdigest() |
There was a problem hiding this comment.
Make inputs_hash unambiguous across benchmark boundaries.
This digest is built from raw text bytes back-to-back, so different corpora can collide here (["ab", "c"] and ["a", "bc"] hash identically). That means Line 184 can accept a Bloom cache built from different benchmark contents. Hash a structured encoding or length-prefix each text before updating the digest.
Proposed fix
- hasher = hashlib.sha256()
- for text in benchmark_texts:
- hasher.update(text.encode("utf-8"))
- inputs_hash = hasher.hexdigest()
+ payload = json.dumps(
+ benchmark_texts,
+ ensure_ascii=False,
+ separators=(",", ":"),
+ )
+ inputs_hash = hashlib.sha256(payload.encode("utf-8")).hexdigest()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/contamination.py` around lines 164 - 167, The current
inputs_hash is computed by concatenating raw benchmark_texts bytes which can
collide across different segmentations; update the hashing in the function that
computes inputs_hash (use the hasher variable near the loop over
benchmark_texts) to include an unambiguous delimiter or length-prefix for each
text (e.g., write each text length then the text or serialize each entry with a
structured encoding) before hasher.update, so the resulting inputs_hash cannot
be forged by different concatenations of the same bytes.
| normalised_benchmarks: Optional[Set[str]] = None | ||
|
|
||
| for ngram in ngrams: | ||
| if ngram in bloom_filter: | ||
| if normalised_benchmarks is None: | ||
| normalised_benchmarks = {_normalise(bt) for bt in benchmark_texts} | ||
|
|
||
| for nb in normalised_benchmarks: | ||
| if f" {ngram} " in f" {nb} ": |
There was a problem hiding this comment.
Cache the normalised benchmark corpus outside this hot path.
extract_text_from_xml() calls check_contamination() for every chunk, but this block rebuilds the full normalised benchmark set after each Bloom hit. On large dumps that turns the exact-verification stage into repeated O(number of benchmarks) work and will dominate preprocessing time. Precompute/cache the normalised benchmark texts once per run and pass them into check_contamination().
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/contamination.py` around lines 251 - 259, The hot path
inside check_contamination rebuilds normalised_benchmarks on every Bloom hit
(see normalised_benchmarks, _normalise, benchmark_texts), causing repeated
O(number of benchmarks) work; precompute the set of normalized benchmark texts
once (e.g., in the preprocessing entry point that calls extract_text_from_xml)
and change check_contamination to accept a precomputed_normalised_benchmarks
parameter instead of constructing it inline, then use that set for the inner
exact-match loop to avoid rebuilding it on each invocation.
|
fix lint issues and resolve conflicts |
closes #50
Description
This PR introduces a Data Contamination Detection feature to the OpenVerifiableLLM preprocessing pipeline.
The goal of this feature is to automatically detect and redact text chunks from Wikipedia dumps that overlap with evaluation benchmark datasets (e.g., GSM8K, MMLU). This prevents benchmark contamination when trailing LLMs on the generated data. The detection uses an efficient two-stage approach: a fast n-gram matching mechanism backed by a Bloom filter, followed by an exact string match check to eliminate false positives.
Key Changes Made:
1. Configuration & Dependencies:
datasets,pyyaml, andrbloomtopyproject.toml.openverifiablellm/config.pyto handle benchmark list loading via aBenchmarkConfigdataclass. Supports loading from CLI arguments (--benchmarks), a localbenchmarks.yamlfile, or defaults to["gsm8k", "cais/mmlu"].2. Core Contamination Detecton (
openverifiablellm/contamination.py):fetch_benchmarksto load datasets either from Hugging Face or local JSONL/CSV files.get_ngramsto generate and normalize sliding-window n-grams (default 13-grams).build_bloom_filterto construct, populate, and serialize an efficiently sizedrbloom.Bloomfilter to disk.check_contaminationto perform the two-stage contamination check (Bloom filter first, exact match on hit).3. Pipeline Integration (
openverifiablellm/utils.py):extract_text_from_xmlto accept optional contamination checking parameters. If triggered, contaminated text chunks are silently skipped and tracked.generate_manifestto include "contamination_checks_passed" and "redacted_chunks" metadata fields.compute_sha256andcompute_merkle_rootcalls throughout the codebase to correctly use the mandated keyword arguments (data=orfile_path=).4. Testing & Verification:
tests/test_contamination.pycontaining 17 comprehensive unit tests covering n-gram generation, bloom filter serialization, positive/negative contamination detection, and pipeline backward compatibility.tests/test_util.pyto align with thecompute_sha256keyword argument fixes.Screenshots/Recordings:
Additional Notes:
rbloomfor fast Rust-backed Bloom filter construction. Implemented a custom SHA-256-based hash function_bloom_hashto make the Bloom filter serializable viasave_bytes()andload_bytes().extract_text_from_xmlandgenerate_manifestare optional. Callingpython -m openverifiablellm.utils <input_dump>will continue to work normally without contamination checking.Checklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
Summary by CodeRabbit
New Features
Improvements
Tests
Dependencies