Skip to content

Add AIME2025, GPQA, HealthBench evaluation_test suites; unify row-limiting via pytest flag; clean up examples#44

Merged
benjibc merged 3 commits intomainfrom
implement_aime_gpqa_health
Aug 10, 2025
Merged

Add AIME2025, GPQA, HealthBench evaluation_test suites; unify row-limiting via pytest flag; clean up examples#44
benjibc merged 3 commits intomainfrom
implement_aime_gpqa_health

Commits

Commits on Aug 10, 2025