Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds first-class CLI support in scripts/benchmark.py to run post-benchmark trend regression detection using the existing RunTrendAnalyzer utility (from PR #104), aligning with issue #107.
Changes:
- Adds
--trendflag to run trend analysis after the benchmark finishes (and before upload). - Adds
--trend-windowand--trend-thresholdoptions to configure the analysis window and regression detection threshold. - Wires
RunTrendAnalyzer(...).run(model=args.model)into the post-run results flow.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+260
to
+265
| parser.add_argument( | ||
| "--trend-window", | ||
| type=int, | ||
| default=10, | ||
| help="Number of recent runs to include in trend analysis (default: 10)", | ||
| ) |
There was a problem hiding this comment.
--trend-window accepts any int, including 0/negative values; with the current slicing (pts[-self.window:]) a value of 0 analyzes all runs and negative values produce surprising slices. Consider validating --trend-window as >= 2 (or >= 1, but then run() will always produce no output) and failing fast via parser.error(...) to avoid confusing behavior.
Wire RunTrendAnalyzer into benchmark.py via a new --trend flag. When passed, analyzes score trends for the benchmarked model after results are written, logging regression/improvement before upload. Additional flags --trend-window (default 10) and --trend-threshold (default -0.5) allow tuning the analysis parameters. Usage: python benchmark.py --model anthropic/claude-sonnet-4 --trend python benchmark.py --model anthropic/claude-sonnet-4 --trend --trend-window 5
- Fix help text: %%%%/run -> %/run (argparse doesn't need escaping) - Validate --trend-window >= 2 to avoid confusing behavior - Wrap trend analysis in try/except so failures don't abort upload - Skip in_progress runs in lib_trend.py to avoid skewed regression detection
e1fe047 to
33b55d0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #107
Wires
RunTrendAnalyzer(merged in #104) into the CLI as a first-class flag.Changes
--trend(boolean) — run trend analysis after benchmark completes, scanning the output directory for prior runs of the same model--trend-window N(default 10) — how many recent runs to include in the OLS slope fit--trend-threshold(default -0.5) — slope (%/run) below which regression is flaggedUsage
The analysis runs after scores are logged but before upload, so regression warnings appear in the terminal output alongside the score summary.
🤖 This PR was opened by @olearycrew's OpenClaw bot. Please review carefully!