Skip to content

Add AIME2025, GPQA, HealthBench evaluation_test suites; unify row-limiting via pytest flag; clean up examples#44

Merged
benjibc merged 3 commits intomainfrom
implement_aime_gpqa_health
Aug 10, 2025
Merged

Add AIME2025, GPQA, HealthBench evaluation_test suites; unify row-limiting via pytest flag; clean up examples#44
benjibc merged 3 commits intomainfrom
implement_aime_gpqa_health

Conversation

@benjibc
Copy link
Copy Markdown
Contributor

@benjibc benjibc commented Aug 9, 2025

TL;DR

we get a summarized printout at the end

EP Summary | suite=test_aime2025_pointwise model=fireworks_ai/accounts/fireworks/models/gpt-oss-120b agg=0.446 ci95=[0.383,0.509] runs=8 rows=240
EP Metrics | exact_match=0.446 ci95=[0.383,0.509]
  • Motivation

    • Reimplement gpt-oss AIME2025 chat completion under eval_protocol using evaluation_test (pytest).
    • Add GPQA and HealthBench tests with clean adapters and no hidden globals.
    • Improve developer experience: single, discoverable control for dataset size; fast CI defaults; full-dataset toggle.
  • What changed

    • Tests
      • AIME2025: examples/aime2025_chat_completion/tests/test_evaluation.py
        • Uses Hugging Face JSONL URLs (opencompass/AIME2025) via input_dataset + a tiny dataset_adapter.
        • Scoring uses _extract_boxed_text + normalization.
      • GPQA: examples/gpqa/tests/test_evaluation.py
        • Loads official gpqa_diamond.csv via Hugging Face, builds prompts with A/B/C/D.
        • Stores ground-truth in a hidden system message (kept out of model prompt) for clean lookup.
      • HealthBench: examples/healthbench/tests/test_evaluation.py
        • Compact in-memory prompts and rubrics; demonstrates rubric-based scoring.
    • DX improvements
      • Central row limiting via a pytest plugin and a single flag:
        • --ep-max-rows=N or --ep-max-rows=all (sets EP_MAX_DATASET_ROWS internally)
        • The evaluation_test decorator applies this uniformly to both URL datasets and in-memory input_messages.
      • Default max_dataset_rows=2 in decorators for CI speed; change once via the flag to scale to full runs.
      • Fireworks via LiteLLM with fireworks_ai/ model prefix; no custom rollout needed.
    • Library changes
      • eval_protocol/pytest/evaluation_test.py: Apply centralized EP_MAX_DATASET_ROWS override to input_dataset and input_messages.
      • eval_protocol/pytest/plugin.py: new pytest plugin with --ep-max-rows flag.
      • pyproject.toml: registers the plugin via pytest entry points.
      • eval_protocol/common_utils.py: load_jsonl now streams from HTTP URLs (strict errors).
    • Cleanup
      • Removed unused YAML configs and sample JSONL files.
      • Deleted custom Fireworks rollout (default rollout now works with fireworks_ai/).
  • Developer experience

    • Quick (CI-friendly): all tests default to 2 rows
      • pytest -q examples/aime2025_chat_completion/tests/test_evaluation.py -q
      • pytest -q examples/gpqa/tests/test_evaluation.py -q
      • pytest -q examples/healthbench/tests/test_evaluation.py -q
    • Full dataset: one flag
      • pytest --ep-max-rows=all …
      • Or partial: pytest --ep-max-rows=50 …
    • Requirements: FIREWORKS_API_KEY set. HF datasets will be fetched directly.
  • Checklist

    • Tests pass with defaults
    • Full-dataset path validated
    • Removed unused files/configs
    • Docs updated in AIME README for pytest usage

@benjibc benjibc marked this pull request as ready for review August 10, 2025 05:26
Comment on lines +155 to +161
# Treat multiple dataset paths as a single combined dataset rather than
# parameterizing over each path separately. This produces one summary
# that reflects the aggregate of all provided files (e.g., AIME I+II).
if input_dataset is not None:
datasets: List[Optional[List[DatasetPathParam]]] = [input_dataset] # type: ignore
else:
datasets = [None]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this always the case? Is there any time we want to separate rollouts to be per-dataset?

Comment on lines +374 to +375
# Optional: print and/or persist a summary artifact for CI
try:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, can we move all of this into a function. evaluation_test decorator code is getting huge in one function

@@ -15,16 +17,39 @@ def load_jsonl(file_path: str) -> List[Dict[str, Any]]:
Returns an empty list if the file is not found or if errors occur during parsing.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Returns an empty list if the file is not found or if errors occur during parsing.
Returns an empty list if the file is not found or if errors occur during parsing. Supports HTTP urls and local file paths.

Copy link
Copy Markdown
Collaborator

@dphuang2 dphuang2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, thanks for the pytest plugin too

@benjibc benjibc merged commit 55005a1 into main Aug 10, 2025
7 checks passed
@benjibc benjibc deleted the implement_aime_gpqa_health branch August 10, 2025 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants