[evals] bump eval runner to v1.3, add v1.3 results placeholder#18
[evals] bump eval runner to v1.3, add v1.3 results placeholder#18Obsidian68 merged 1 commit intofeat/integrationfrom
Conversation
- Update RESULTS_PATH from v1.2.json to v1.3.json - Update version string from 1.2.0 to 1.3.0 - Add evals/results/v1.3.json placeholder (populated on integration) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 85afc24321
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| GOLDEN_SET_PATH = Path(__file__).parent / "golden_set" / "queries.jsonl" | ||
| RESULTS_PATH = Path(__file__).parent / "results" / "v1.2.json" | ||
| RESULTS_PATH = Path(__file__).parent / "results" / "v1.3.json" |
There was a problem hiding this comment.
Keep sweep version in sync with the v1.3 results target
Changing RESULTS_PATH to v1.3.json here also affects evals/sweep.py because it imports this constant (from evals.runner import ... RESULTS_PATH), but sweep still hardcodes "version": "1.2.0" in its output. After this commit, running the sweep will write a 1.2.0 payload into the v1.3 results file, which can corrupt versioned eval artifacts and downstream comparisons that trust file/version consistency.
Useful? React with 👍 / 👎.
Summary
evals/runner.py: RESULTS_PATH changed fromv1.2.jsontov1.3.json, version string changed from"1.2.0"to"1.3.0"evals/results/v1.3.jsonplaceholder with version"1.3.0", zero metrics, empty sweep_resultsKnown limitations
evals/sweep.pyline 152 still has"version": "1.2.0"— out of scope for this branch, needs updating on integrationTest plan
uv run pytest— 277 tests passuv run ruff check evals/— lint cleanuv run ruff format --check evals/— format cleanevals/results/v1.3.jsoncontains valid JSON with version"1.3.0"python -m evals.runnerruns end-to-end against live v1.3 server (requires running server)🤖 Generated with Claude Code