Skip to content

Add --trend flag to benchmark.py#266

Merged
olearycrew merged 4 commits intomainfrom
scuttlebot/add-trend-flag
Apr 9, 2026
Merged

Add --trend flag to benchmark.py#266
olearycrew merged 4 commits intomainfrom
scuttlebot/add-trend-flag

Conversation

@ScuttleBot
Copy link
Copy Markdown

@ScuttleBot ScuttleBot commented Apr 9, 2026

Closes #107

Wires RunTrendAnalyzer (merged in #104) into the CLI as a first-class flag.

Changes

  • --trend (boolean) — run trend analysis after benchmark completes, scanning the output directory for prior runs of the same model
  • --trend-window N (default 10) — how many recent runs to include in the OLS slope fit
  • --trend-threshold (default -0.5) — slope (%/run) below which regression is flagged

Usage

# Basic: analyze trends with defaults
python benchmark.py --model anthropic/claude-sonnet-4 --trend

# Custom window and threshold
python benchmark.py --model anthropic/claude-sonnet-4 --trend --trend-window 5 --trend-threshold -1.0

The analysis runs after scores are logged but before upload, so regression warnings appear in the terminal output alongside the score summary.


🤖 This PR was opened by @olearycrew's OpenClaw bot. Please review carefully!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class CLI support in scripts/benchmark.py to run post-benchmark trend regression detection using the existing RunTrendAnalyzer utility (from PR #104), aligning with issue #107.

Changes:

  • Adds --trend flag to run trend analysis after the benchmark finishes (and before upload).
  • Adds --trend-window and --trend-threshold options to configure the analysis window and regression detection threshold.
  • Wires RunTrendAnalyzer(...).run(model=args.model) into the post-run results flow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +260 to +265
parser.add_argument(
"--trend-window",
type=int,
default=10,
help="Number of recent runs to include in trend analysis (default: 10)",
)
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--trend-window accepts any int, including 0/negative values; with the current slicing (pts[-self.window:]) a value of 0 analyzes all runs and negative values produce surprising slices. Consider validating --trend-window as >= 2 (or >= 1, but then run() will always produce no output) and failing fast via parser.error(...) to avoid confusing behavior.

Copilot uses AI. Check for mistakes.
Wire RunTrendAnalyzer into benchmark.py via a new --trend flag.
When passed, analyzes score trends for the benchmarked model after
results are written, logging regression/improvement before upload.

Additional flags --trend-window (default 10) and --trend-threshold
(default -0.5) allow tuning the analysis parameters.

Usage:
  python benchmark.py --model anthropic/claude-sonnet-4 --trend
  python benchmark.py --model anthropic/claude-sonnet-4 --trend --trend-window 5
- Fix help text: %%%%/run -> %/run (argparse doesn't need escaping)
- Validate --trend-window >= 2 to avoid confusing behavior
- Wrap trend analysis in try/except so failures don't abort upload
- Skip in_progress runs in lib_trend.py to avoid skewed regression detection
@olearycrew olearycrew force-pushed the scuttlebot/add-trend-flag branch from e1fe047 to 33b55d0 Compare April 9, 2026 16:50
@olearycrew olearycrew merged commit cadfce3 into main Apr 9, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add --trend flag to benchmark.py for post-run auto-analysis

3 participants