Fix substring-match direction in perturbation scorer; add gate/threshold flags#87
Open
dangng2004 wants to merge 3 commits into
Open
Fix substring-match direction in perturbation scorer; add gate/threshold flags#87dangng2004 wants to merge 3 commits into
dangng2004 wants to merge 3 commits into
Conversation
…ld flags The `fuzzy` method's substring prefilter computed `cov(perturbed, quote)`, which asks "is most of the perturbed paragraph contained in the reviewer's quote?" — but reviewers quote *from* the paper, so the natural direction is `cov(quote, perturbed)`. The buggy direction systematically rejected paragraph-level perturbations (because reviewers can't quote 75% of a long paragraph), which is why the legacy fuzzy scorer had near-zero recall on prose-level errors. Changes: - Drop the legacy `_substring_match` function. - Add `_llm_substring_gate(quote, perturbed)` with the correct direction (default threshold 0.7), usable as an optional pre-filter for any method via the new `--substring-gate` flag. - Add `--threshold` flag to expose the LLM-judge cutoff (default still 3). - `score_review` no longer hardcodes `score >= 3`; threads `threshold` and `substring_gate` through. - Drop `max_tokens=8192` -> `max_tokens=16` on the judge call (it returns a single integer; 8192 was triggering OpenRouter credit-reservation failures on tight keys). - Add `--score-subdir` flag to `generate_report.py` so it can aggregate alternate scoring-mode subdirs (default still "llm" for back-compat). - Score JSONs now persist `match_mode`, `threshold`, `substring_gate`. Adds `rapidfuzz` and `sentence-transformers` to the install (needed by the legacy fuzzy + semantic methods). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reports/ is regenerated from results/; data/ is locally-generated input that users supply for their own runs. Neither belongs in the repo. The previously-tracked reports/ files (README, combined, revision notes, surface_3models) were removed in the preceding commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- benchmarks/perturbation/configs/* — ephemeral per-domain configs are
locally generated; keep only the canonical templates (default.yaml,
coarse_{medium,short}.yaml, surface_errors{,_medium}.yaml) via `!`
exceptions.
- benchmarks/conference_study/{manifests,papers,results} — these are
symlinks in this worktree (existing trailing-slash patterns matched
dirs only); add the slashless variants.
- benchmarks/experimental_perturbations/ — exploratory sweep outputs,
large and not part of the published benchmark.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
cov(perturbed, quote)instead ofcov(quote, perturbed)). The original direction asked "is most of the perturbed paragraph contained in the reviewer's quote?" — but reviewers quote from the paper, so the natural direction is the inverse. The bug systematically rejected paragraph-level perturbations (reviewers can't quote 75% of a long paragraph), which is why the legacy fuzzy scorer had near-zero recall on prose-level errors.openaireview scoregains--threshold(LLM-judge cutoff, default 3) and--substring-gate(cheap pre-filter, default off).generate_report.pygains--score-subdirso alternate scoring modes can coexist underpaper_NNN/score/<subdir>/.benchmarks/perturbation/reports/andbenchmarks/perturbation/data/; remove 4 previously-tracked report files.Other small changes:
_substring_matchfunction (replaced by the correctly-directed_llm_substring_gate).max_tokens=8192→16on the judge call (it returns a single integer; 8192 was triggering OpenRouter credit-reservation 402s on tighter keys).match_mode,threshold,substring_gate.rapidfuzzandsentence-transformersas benchmark deps (needed by the existing fuzzy + semantic methods).Test plan
openaireview score --helpshows the new--thresholdand--substring-gateflagsopenaireview score <manifest> <review> --method llm --threshold 4 --substring-gateproduces a JSON with the new fields populatedpython benchmarks/perturbation/generate_report.py <dir> --score-subdir llmmatches existing behavior (default unchanged)--method fuzzy(no gate) on a known TP — gets past the prefilter--substring-gate— confirms gate at 0.7 still passes verbatim-quote TPs🤖 Generated with Claude Code