Fix substring-match direction in perturbation scorer; add gate/threshold flags by dangng2004 · Pull Request #87 · ChicagoHAI/OpenAIReview

dangng2004 · 2026-05-18T02:44:33Z

Summary

Bug fix: the perturbation scorer's substring prefilter computed coverage in the wrong direction (cov(perturbed, quote) instead of cov(quote, perturbed)). The original direction asked "is most of the perturbed paragraph contained in the reviewer's quote?" — but reviewers quote from the paper, so the natural direction is the inverse. The bug systematically rejected paragraph-level perturbations (reviewers can't quote 75% of a long paragraph), which is why the legacy fuzzy scorer had near-zero recall on prose-level errors.
New flags: openaireview score gains --threshold (LLM-judge cutoff, default 3) and --substring-gate (cheap pre-filter, default off). generate_report.py gains --score-subdir so alternate scoring modes can coexist under paper_NNN/score/<subdir>/.
Cleanup: ignore benchmarks/perturbation/reports/ and benchmarks/perturbation/data/; remove 4 previously-tracked report files.

Other small changes:

Drop the legacy _substring_match function (replaced by the correctly-directed _llm_substring_gate).
Drop max_tokens=8192 → 16 on the judge call (it returns a single integer; 8192 was triggering OpenRouter credit-reservation 402s on tighter keys).
Score JSONs now persist match_mode, threshold, substring_gate.
Add rapidfuzz and sentence-transformers as benchmark deps (needed by the existing fuzzy + semantic methods).

Test plan

openaireview score --help shows the new --threshold and --substring-gate flags
openaireview score <manifest> <review> --method llm --threshold 4 --substring-gate produces a JSON with the new fields populated
python benchmarks/perturbation/generate_report.py <dir> --score-subdir llm matches existing behavior (default unchanged)
Sanity-check: run scoring with --method fuzzy (no gate) on a known TP — gets past the prefilter
Sanity-check: same with --substring-gate — confirms gate at 0.7 still passes verbatim-quote TPs

🤖 Generated with Claude Code

…ld flags The `fuzzy` method's substring prefilter computed `cov(perturbed, quote)`, which asks "is most of the perturbed paragraph contained in the reviewer's quote?" — but reviewers quote *from* the paper, so the natural direction is `cov(quote, perturbed)`. The buggy direction systematically rejected paragraph-level perturbations (because reviewers can't quote 75% of a long paragraph), which is why the legacy fuzzy scorer had near-zero recall on prose-level errors. Changes: - Drop the legacy `_substring_match` function. - Add `_llm_substring_gate(quote, perturbed)` with the correct direction (default threshold 0.7), usable as an optional pre-filter for any method via the new `--substring-gate` flag. - Add `--threshold` flag to expose the LLM-judge cutoff (default still 3). - `score_review` no longer hardcodes `score >= 3`; threads `threshold` and `substring_gate` through. - Drop `max_tokens=8192` -> `max_tokens=16` on the judge call (it returns a single integer; 8192 was triggering OpenRouter credit-reservation failures on tight keys). - Add `--score-subdir` flag to `generate_report.py` so it can aggregate alternate scoring-mode subdirs (default still "llm" for back-compat). - Score JSONs now persist `match_mode`, `threshold`, `substring_gate`. Adds `rapidfuzz` and `sentence-transformers` to the install (needed by the legacy fuzzy + semantic methods). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

reports/ is regenerated from results/; data/ is locally-generated input that users supply for their own runs. Neither belongs in the repo. The previously-tracked reports/ files (README, combined, revision notes, surface_3models) were removed in the preceding commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- benchmarks/perturbation/configs/* — ephemeral per-domain configs are locally generated; keep only the canonical templates (default.yaml, coarse_{medium,short}.yaml, surface_errors{,_medium}.yaml) via `!` exceptions. - benchmarks/conference_study/{manifests,papers,results} — these are symlinks in this worktree (existing trailing-slash patterns matched dirs only); add the slashless variants. - benchmarks/experimental_perturbations/ — exploratory sweep outputs, large and not part of the published benchmark. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dangng2004 and others added 3 commits May 17, 2026 21:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix substring-match direction in perturbation scorer; add gate/threshold flags#87

Fix substring-match direction in perturbation scorer; add gate/threshold flags#87
dangng2004 wants to merge 3 commits into
mainfrom
scoring-debug

dangng2004 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dangng2004 commented May 18, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant