Skip to content

Fix substring-match direction in perturbation scorer; add gate/threshold flags#87

Open
dangng2004 wants to merge 3 commits into
mainfrom
scoring-debug
Open

Fix substring-match direction in perturbation scorer; add gate/threshold flags#87
dangng2004 wants to merge 3 commits into
mainfrom
scoring-debug

Conversation

@dangng2004
Copy link
Copy Markdown
Contributor

Summary

  • Bug fix: the perturbation scorer's substring prefilter computed coverage in the wrong direction (cov(perturbed, quote) instead of cov(quote, perturbed)). The original direction asked "is most of the perturbed paragraph contained in the reviewer's quote?" — but reviewers quote from the paper, so the natural direction is the inverse. The bug systematically rejected paragraph-level perturbations (reviewers can't quote 75% of a long paragraph), which is why the legacy fuzzy scorer had near-zero recall on prose-level errors.
  • New flags: openaireview score gains --threshold (LLM-judge cutoff, default 3) and --substring-gate (cheap pre-filter, default off). generate_report.py gains --score-subdir so alternate scoring modes can coexist under paper_NNN/score/<subdir>/.
  • Cleanup: ignore benchmarks/perturbation/reports/ and benchmarks/perturbation/data/; remove 4 previously-tracked report files.

Other small changes:

  • Drop the legacy _substring_match function (replaced by the correctly-directed _llm_substring_gate).
  • Drop max_tokens=819216 on the judge call (it returns a single integer; 8192 was triggering OpenRouter credit-reservation 402s on tighter keys).
  • Score JSONs now persist match_mode, threshold, substring_gate.
  • Add rapidfuzz and sentence-transformers as benchmark deps (needed by the existing fuzzy + semantic methods).

Test plan

  • openaireview score --help shows the new --threshold and --substring-gate flags
  • openaireview score <manifest> <review> --method llm --threshold 4 --substring-gate produces a JSON with the new fields populated
  • python benchmarks/perturbation/generate_report.py <dir> --score-subdir llm matches existing behavior (default unchanged)
  • Sanity-check: run scoring with --method fuzzy (no gate) on a known TP — gets past the prefilter
  • Sanity-check: same with --substring-gate — confirms gate at 0.7 still passes verbatim-quote TPs

🤖 Generated with Claude Code

dangng2004 and others added 3 commits May 17, 2026 21:43
…ld flags

The `fuzzy` method's substring prefilter computed `cov(perturbed, quote)`,
which asks "is most of the perturbed paragraph contained in the reviewer's
quote?" — but reviewers quote *from* the paper, so the natural direction is
`cov(quote, perturbed)`. The buggy direction systematically rejected
paragraph-level perturbations (because reviewers can't quote 75% of a long
paragraph), which is why the legacy fuzzy scorer had near-zero recall on
prose-level errors.

Changes:
- Drop the legacy `_substring_match` function.
- Add `_llm_substring_gate(quote, perturbed)` with the correct direction
  (default threshold 0.7), usable as an optional pre-filter for any method
  via the new `--substring-gate` flag.
- Add `--threshold` flag to expose the LLM-judge cutoff (default still 3).
- `score_review` no longer hardcodes `score >= 3`; threads `threshold` and
  `substring_gate` through.
- Drop `max_tokens=8192` -> `max_tokens=16` on the judge call (it returns
  a single integer; 8192 was triggering OpenRouter credit-reservation
  failures on tight keys).
- Add `--score-subdir` flag to `generate_report.py` so it can aggregate
  alternate scoring-mode subdirs (default still "llm" for back-compat).
- Score JSONs now persist `match_mode`, `threshold`, `substring_gate`.

Adds `rapidfuzz` and `sentence-transformers` to the install (needed by
the legacy fuzzy + semantic methods).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reports/ is regenerated from results/; data/ is locally-generated input
that users supply for their own runs. Neither belongs in the repo.

The previously-tracked reports/ files (README, combined, revision notes,
surface_3models) were removed in the preceding commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- benchmarks/perturbation/configs/* — ephemeral per-domain configs are
  locally generated; keep only the canonical templates (default.yaml,
  coarse_{medium,short}.yaml, surface_errors{,_medium}.yaml) via `!`
  exceptions.
- benchmarks/conference_study/{manifests,papers,results} — these are
  symlinks in this worktree (existing trailing-slash patterns matched
  dirs only); add the slashless variants.
- benchmarks/experimental_perturbations/ — exploratory sweep outputs,
  large and not part of the published benchmark.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant