ci: benchmark regression gate — zero tolerance, fail PR on any score drop by jobordu · Pull Request #1 · nForma-AI/nf-benchmark

jobordu · 2026-04-16T16:23:48Z

Summary

Adds baseline caching: main branch runs save baseline.json to GitHub Cache
PRs restore the main baseline and run with --compare-baseline --baseline-tolerance 0
Any drop in benchmark pass rate fails the PR — zero tolerance
Also brings in --skip-heatmap, --skip-proximity, --no-coderlm, --no-auto-commit in solver args (21× speedup: ~600s → ~14s per challenge)
Node 22, npm ci, 180-min job timeout, cancel-in-progress concurrency
Results + baseline uploaded as artifact on every run (30-day retention)

Test plan

Merge to main → confirm baseline is saved to cache and artifact is uploaded
Open a PR that regresses a challenge → confirm job fails with comparison output
Open a neutral PR → confirm job passes

🤖 Generated with Claude Code

- scorer.cjs: add LAYER_ALIASES mapping benchmark conceptual names to nf-solve canonical keys (c_to_e→git_heatmap, f_to_f→formal_lint, f_to_g→per_model_gates, etc.) - scorer.cjs: fix scoreLayerZero to treat skipped layers (pre=-1, post=-1) as passing — t_to_c is skipped in --fast mode - scorer.cjs: implement custom scoring as detection_only for end-to-end challenges - mutator.cjs: fix json-field-modify with json_path "$" (root replacement) to return mutation.value directly - challenges/01-requirements.json: BENCH-001 uses detection_only (residual_layer_zero is wrong when mutation *adds* a gap) - challenges/08-convergence.json: BENCH-082 uses detection_only (residual_decreased can't pass in --report-only mode); replace ADD_3_SMALL_GAPS placeholder with 3 real requirement objects Score improved from 36.5% (73/200) → 100.0% (205/205) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…core, baseline lock, timing Replace `detection_only` keyword matching (which always passed because nf-solve always emits "residual") with `residual_layer_increased`: a challenge passes iff the target layer's residual actually increased after the mutation is injected. Falls back to keyword matching only when residual_vector is unavailable. Add `reduction_score` (continuous metric: fraction of total residual reduced) to every scoring method and aggregate it as `avgReductionScore` in the report — gives a signal that doesn't collapse to binary pass/fail. Add `--save-baseline` / `--compare-baseline` flags with configurable tolerance (default 5pp) so CI can gate on regression rather than just pass/fail. Add `execution_time_ms` per challenge in results and in the per-run report JSON. Fix stale test assertions: 100 → 205 challenges, add unit tests covering the new residual_layer_increased path and LAYER_ALIASES alias resolution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…null-mutation, multi-layer, parallel, trend, calibrate

- Rewrites benchmark.yml with baseline cache (save on main, compare on PRs) - PRs fail when pass rate drops more than 2 percentage points vs main - Node 22, npm ci, 180-min timeout, cancel-in-progress concurrency - Results + baseline uploaded as artifact on every run (30-day retention) - runner.cjs: add --skip-heatmap, --skip-proximity, --no-coderlm, --no-auto-commit to solve args - nf-benchmark.cjs: pass focus layer through to all runSolve/runSolveFull calls Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…lures Raw failures are expected — CI tracks score trends, not a pass-all gate. The benchmark exits 1 only when --compare-baseline detects a regression. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jobordu and others added 7 commits April 14, 2026 10:28

feat(benchmark): all 7 improvements — fix_and_verify, TLA+ snapshot, …

8808898

…null-mutation, multi-layer, parallel, trend, calibrate

fix(ci): zero tolerance — any benchmark score drop fails the PR

f1d3ed7

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: add package-lock.json (required for npm ci in CI)

af3fd70

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jobordu merged commit 6153b54 into main Apr 16, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: benchmark regression gate — zero tolerance, fail PR on any score drop#1

ci: benchmark regression gate — zero tolerance, fail PR on any score drop#1
jobordu merged 7 commits into
mainfrom
ci/benchmark-regression-gate

jobordu commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jobordu commented Apr 16, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant