Skip to content

feature/export benchmark aime#65

Merged
benjibc merged 11 commits intomainfrom
feature/export-benchmark-aime
Aug 12, 2025
Merged

feature/export benchmark aime#65
benjibc merged 11 commits intomainfrom
feature/export-benchmark-aime

Conversation

@benjibc
Copy link
Copy Markdown
Contributor

@benjibc benjibc commented Aug 12, 2025

  • benchmarks: add export_benchmark, direct runner, suites.aime25 and wire AIME as aime25_low; add module runner
  • bench: robust auto-import of suites; explicit fallback import for known names
  • bench: ensure runtime annotations (drop future); fix EvaluationRow type check in suite
  • bench: set extra_body.reasoning_effort rather than extra_body.reasoning.effort for Fireworks
  • bench: consolidate AIME export to suites only; set num_runs=8 default
  • bench: add --max-tokens and --max-concurrency overrides; plumb into runner
  • add aime benchmark
  • bench: add GPQA exported benchmark suite; default low effort, num_runs=8
  • bench: export name aime25; default low effort with max_tokens=131000; keep num_runs=8

@benjibc benjibc force-pushed the feature/export-benchmark-aime branch from 56e6846 to ff5fae7 Compare August 12, 2025 07:01
…n eval_protocol/benchmarks/suites/aime25.py as source of truth
@benjibc benjibc merged commit a5e1479 into main Aug 12, 2025
5 of 7 checks passed
@benjibc benjibc deleted the feature/export-benchmark-aime branch August 12, 2025 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant