Add --trend flag to benchmark.py by ScuttleBot · Pull Request #266 · pinchbench/skill

ScuttleBot · 2026-04-09T16:37:25Z

Closes #107

Wires RunTrendAnalyzer (merged in #104) into the CLI as a first-class flag.

Changes

--trend (boolean) — run trend analysis after benchmark completes, scanning the output directory for prior runs of the same model
--trend-window N (default 10) — how many recent runs to include in the OLS slope fit
--trend-threshold (default -0.5) — slope (%/run) below which regression is flagged

Usage

# Basic: analyze trends with defaults
python benchmark.py --model anthropic/claude-sonnet-4 --trend

# Custom window and threshold
python benchmark.py --model anthropic/claude-sonnet-4 --trend --trend-window 5 --trend-threshold -1.0

The analysis runs after scores are logged but before upload, so regression warnings appear in the terminal output alongside the score summary.

🤖 This PR was opened by @olearycrew's OpenClaw bot. Please review carefully!

Copilot

Pull request overview

Adds first-class CLI support in scripts/benchmark.py to run post-benchmark trend regression detection using the existing RunTrendAnalyzer utility (from PR #104), aligning with issue #107.

Changes:

Adds --trend flag to run trend analysis after the benchmark finishes (and before upload).
Adds --trend-window and --trend-threshold options to configure the analysis window and regression detection threshold.
Wires RunTrendAnalyzer(...).run(model=args.model) into the post-run results flow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

scripts/benchmark.py

Copilot · 2026-04-09T16:41:18Z

scripts/benchmark.py

+    parser.add_argument(
+        "--trend-window",
+        type=int,
+        default=10,
+        help="Number of recent runs to include in trend analysis (default: 10)",
+    )


--trend-window accepts any int, including 0/negative values; with the current slicing (pts[-self.window:]) a value of 0 analyzes all runs and negative values produce surprising slices. Consider validating --trend-window as >= 2 (or >= 1, but then run() will always produce no output) and failing fast via parser.error(...) to avoid confusing behavior.

scripts/benchmark.py

Wire RunTrendAnalyzer into benchmark.py via a new --trend flag. When passed, analyzes score trends for the benchmarked model after results are written, logging regression/improvement before upload. Additional flags --trend-window (default 10) and --trend-threshold (default -0.5) allow tuning the analysis parameters. Usage: python benchmark.py --model anthropic/claude-sonnet-4 --trend python benchmark.py --model anthropic/claude-sonnet-4 --trend --trend-window 5

- Fix help text: %%%%/run -> %/run (argparse doesn't need escaping) - Validate --trend-window >= 2 to avoid confusing behavior - Wrap trend analysis in try/except so failures don't abort upload - Skip in_progress runs in lib_trend.py to avoid skewed regression detection

olearycrew requested a review from Copilot April 9, 2026 16:38

Copilot started reviewing on behalf of olearycrew April 9, 2026 16:38 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

olearycrew added 2 commits April 9, 2026 12:50

olearycrew force-pushed the scuttlebot/add-trend-flag branch from e1fe047 to 33b55d0 Compare April 9, 2026 16:50

olearycrew added 2 commits April 9, 2026 12:57

Remove unused top-level import (now lazy-loaded inside try block)

a3ae51b

Escape % in argparse help string

78e82cd

olearycrew merged commit cadfce3 into main Apr 9, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --trend flag to benchmark.py#266

Add --trend flag to benchmark.py#266
olearycrew merged 4 commits intomainfrom
scuttlebot/add-trend-flag

ScuttleBot commented Apr 9, 2026 •

edited by olearycrew

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ScuttleBot commented Apr 9, 2026 • edited by olearycrew Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Usage

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ScuttleBot commented Apr 9, 2026 •

edited by olearycrew

Loading