Conversation
…iting via pytest flag; clean up examples
xzrderek
approved these changes
Aug 10, 2025
dphuang2
reviewed
Aug 10, 2025
dphuang2
reviewed
Aug 10, 2025
dphuang2
reviewed
Aug 10, 2025
Comment on lines
+155
to
+161
| # Treat multiple dataset paths as a single combined dataset rather than | ||
| # parameterizing over each path separately. This produces one summary | ||
| # that reflects the aggregate of all provided files (e.g., AIME I+II). | ||
| if input_dataset is not None: | ||
| datasets: List[Optional[List[DatasetPathParam]]] = [input_dataset] # type: ignore | ||
| else: | ||
| datasets = [None] |
Collaborator
There was a problem hiding this comment.
is this always the case? Is there any time we want to separate rollouts to be per-dataset?
dphuang2
reviewed
Aug 10, 2025
Comment on lines
+374
to
+375
| # Optional: print and/or persist a summary artifact for CI | ||
| try: |
Collaborator
There was a problem hiding this comment.
nit, can we move all of this into a function. evaluation_test decorator code is getting huge in one function
dphuang2
reviewed
Aug 10, 2025
eval_protocol/common_utils.py
Outdated
| @@ -15,16 +17,39 @@ def load_jsonl(file_path: str) -> List[Dict[str, Any]]: | |||
| Returns an empty list if the file is not found or if errors occur during parsing. | |||
Collaborator
There was a problem hiding this comment.
Suggested change
| Returns an empty list if the file is not found or if errors occur during parsing. | |
| Returns an empty list if the file is not found or if errors occur during parsing. Supports HTTP urls and local file paths. |
dphuang2
approved these changes
Aug 10, 2025
Collaborator
dphuang2
left a comment
There was a problem hiding this comment.
awesome, thanks for the pytest plugin too
dphuang2
reviewed
Aug 10, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
we get a summarized printout at the end
Motivation
What changed
examples/aime2025_chat_completion/tests/test_evaluation.pyopencompass/AIME2025) viainput_dataset+ a tiny dataset_adapter._extract_boxed_text+ normalization.examples/gpqa/tests/test_evaluation.pygpqa_diamond.csvvia Hugging Face, builds prompts with A/B/C/D.examples/healthbench/tests/test_evaluation.py--ep-max-rows=Nor--ep-max-rows=all(setsEP_MAX_DATASET_ROWSinternally)input_messages.max_dataset_rows=2in decorators for CI speed; change once via the flag to scale to full runs.fireworks_ai/model prefix; no custom rollout needed.eval_protocol/pytest/evaluation_test.py: Apply centralizedEP_MAX_DATASET_ROWSoverride toinput_datasetandinput_messages.eval_protocol/pytest/plugin.py: new pytest plugin with--ep-max-rowsflag.pyproject.toml: registers the plugin via pytest entry points.eval_protocol/common_utils.py:load_jsonlnow streams from HTTP URLs (strict errors).fireworks_ai/).Developer experience
pytest -q examples/aime2025_chat_completion/tests/test_evaluation.py -qpytest -q examples/gpqa/tests/test_evaluation.py -qpytest -q examples/healthbench/tests/test_evaluation.py -qpytest --ep-max-rows=all …pytest --ep-max-rows=50 …FIREWORKS_API_KEYset. HF datasets will be fetched directly.Checklist