Add AIME2025, GPQA, HealthBench evaluation_test suites; unify row-limiting via pytest flag; clean up examples by benjibc · Pull Request #44 · eval-protocol/python-sdk

benjibc · 2025-08-09T23:59:42Z

TL;DR

we get a summarized printout at the end

EP Summary | suite=test_aime2025_pointwise model=fireworks_ai/accounts/fireworks/models/gpt-oss-120b agg=0.446 ci95=[0.383,0.509] runs=8 rows=240
EP Metrics | exact_match=0.446 ci95=[0.383,0.509]

Motivation
- Reimplement gpt-oss AIME2025 chat completion under eval_protocol using evaluation_test (pytest).
- Add GPQA and HealthBench tests with clean adapters and no hidden globals.
- Improve developer experience: single, discoverable control for dataset size; fast CI defaults; full-dataset toggle.
What changed
- Tests
  - AIME2025: examples/aime2025_chat_completion/tests/test_evaluation.py
    - Uses Hugging Face JSONL URLs (opencompass/AIME2025) via input_dataset + a tiny dataset_adapter.
    - Scoring uses _extract_boxed_text + normalization.
  - GPQA: examples/gpqa/tests/test_evaluation.py
    - Loads official gpqa_diamond.csv via Hugging Face, builds prompts with A/B/C/D.
    - Stores ground-truth in a hidden system message (kept out of model prompt) for clean lookup.
  - HealthBench: examples/healthbench/tests/test_evaluation.py
    - Compact in-memory prompts and rubrics; demonstrates rubric-based scoring.
- DX improvements
  - Central row limiting via a pytest plugin and a single flag:
    - --ep-max-rows=N or --ep-max-rows=all (sets EP_MAX_DATASET_ROWS internally)
    - The evaluation_test decorator applies this uniformly to both URL datasets and in-memory input_messages.
  - Default max_dataset_rows=2 in decorators for CI speed; change once via the flag to scale to full runs.
  - Fireworks via LiteLLM with fireworks_ai/ model prefix; no custom rollout needed.
- Library changes
  - eval_protocol/pytest/evaluation_test.py: Apply centralized EP_MAX_DATASET_ROWS override to input_dataset and input_messages.
  - eval_protocol/pytest/plugin.py: new pytest plugin with --ep-max-rows flag.
  - pyproject.toml: registers the plugin via pytest entry points.
  - eval_protocol/common_utils.py: load_jsonl now streams from HTTP URLs (strict errors).
- Cleanup
  - Removed unused YAML configs and sample JSONL files.
  - Deleted custom Fireworks rollout (default rollout now works with fireworks_ai/).
Developer experience
- Quick (CI-friendly): all tests default to 2 rows
  - pytest -q examples/aime2025_chat_completion/tests/test_evaluation.py -q
  - pytest -q examples/gpqa/tests/test_evaluation.py -q
  - pytest -q examples/healthbench/tests/test_evaluation.py -q
- Full dataset: one flag
  - pytest --ep-max-rows=all …
  - Or partial: pytest --ep-max-rows=50 …
- Requirements: FIREWORKS_API_KEY set. HF datasets will be fetched directly.
Checklist
- Tests pass with defaults
- Full-dataset path validated
- Removed unused files/configs
- Docs updated in AIME README for pytest usage

…iting via pytest flag; clean up examples

eval_protocol/pytest/default_single_turn_rollout_process.py

eval_protocol/pytest/evaluation_test.py

eval_protocol/generation/clients.py

eval_protocol/pytest/evaluation_test.py

dphuang2 · 2025-08-10T08:30:46Z

eval_protocol/pytest/evaluation_test.py

+            # Treat multiple dataset paths as a single combined dataset rather than
+            # parameterizing over each path separately. This produces one summary
+            # that reflects the aggregate of all provided files (e.g., AIME I+II).
+            if input_dataset is not None:
+                datasets: List[Optional[List[DatasetPathParam]]] = [input_dataset]  # type: ignore
+            else:
+                datasets = [None]


is this always the case? Is there any time we want to separate rollouts to be per-dataset?

dphuang2 · 2025-08-10T08:35:39Z

eval_protocol/pytest/evaluation_test.py

+                    # Optional: print and/or persist a summary artifact for CI
+                    try:


nit, can we move all of this into a function. evaluation_test decorator code is getting huge in one function

dphuang2 · 2025-08-10T08:36:02Z

eval_protocol/common_utils.py

@@ -15,16 +17,39 @@ def load_jsonl(file_path: str) -> List[Dict[str, Any]]:
        Returns an empty list if the file is not found or if errors occur during parsing.


Suggested change

Returns an empty list if the file is not found or if errors occur during parsing.

Returns an empty list if the file is not found or if errors occur during parsing. Supports HTTP urls and local file paths.

dphuang2

awesome, thanks for the pytest plugin too

eval_protocol/pytest/plugin.py

benjibc added 2 commits August 9, 2025 23:59

Add AIME2025, GPQA, HealthBench evaluation_test suites; unify row-lim…

5bc39d7

…iting via pytest flag; clean up examples

evaluation with aggregated scores

4aa9e5c

benjibc marked this pull request as ready for review August 10, 2025 05:26

xzrderek approved these changes Aug 10, 2025

View reviewed changes

eval_protocol/pytest/default_single_turn_rollout_process.py Show resolved Hide resolved

eval_protocol/pytest/evaluation_test.py Outdated Show resolved Hide resolved

dphuang2 reviewed Aug 10, 2025

View reviewed changes

eval_protocol/generation/clients.py Show resolved Hide resolved

dphuang2 reviewed Aug 10, 2025

View reviewed changes

eval_protocol/pytest/evaluation_test.py Show resolved Hide resolved

dphuang2 reviewed Aug 10, 2025

View reviewed changes

dphuang2 approved these changes Aug 10, 2025

View reviewed changes

dphuang2 reviewed Aug 10, 2025

View reviewed changes

eval_protocol/pytest/plugin.py Show resolved Hide resolved

fixed per comments

613d8d1

benjibc merged commit 55005a1 into main Aug 10, 2025
7 checks passed

benjibc deleted the implement_aime_gpqa_health branch August 10, 2025 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AIME2025, GPQA, HealthBench evaluation_test suites; unify row-limiting via pytest flag; clean up examples#44

Add AIME2025, GPQA, HealthBench evaluation_test suites; unify row-limiting via pytest flag; clean up examples#44
benjibc merged 3 commits intomainfrom
implement_aime_gpqa_health

benjibc commented Aug 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dphuang2 Aug 10, 2025

Uh oh!

dphuang2 Aug 10, 2025

Uh oh!

dphuang2 Aug 10, 2025

Uh oh!

dphuang2 left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# Optional: print and/or persist a summary artifact for CI
		try:

		@@ -15,16 +17,39 @@ def load_jsonl(file_path: str) -> List[Dict[str, Any]]:
		Returns an empty list if the file is not found or if errors occur during parsing.

Conversation

benjibc commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dphuang2 Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

dphuang2 Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

dphuang2 Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

dphuang2 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

benjibc commented Aug 9, 2025 •

edited

Loading