-
Notifications
You must be signed in to change notification settings - Fork 65
ci: Improve autoevals CI and fix failing Python CI #195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Abhijeet Prasad (AbhiPrasad)
merged 14 commits into
main
from
abhi-autoevals-ci-improvements
Jun 8, 2026
Merged
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
6994a34
ci: add Node 24 to JS workflow
AbhiPrasad 284cc22
ci: improve lint workflow
AbhiPrasad 28b0a5d
ci: improve Python workflow coverage
AbhiPrasad 304f956
chore(ci): update workflow checkout action
AbhiPrasad 0ed2a93
ci: use uv for Python installs
AbhiPrasad 9fdaa9c
test: support updated braintrust oai API
AbhiPrasad b2d1a58
ci: migrate Python project to uv
AbhiPrasad da38557
chore: AGENTS.md it up
AbhiPrasad 6b09c2e
dev engines remove to align with js
AbhiPrasad 60af5b2
docs: update Python publishing commands for uv
AbhiPrasad cc86444
chore: remove redundant setup.py
AbhiPrasad f6ca1e3
pin versions
AbhiPrasad 32890f1
fix when run
AbhiPrasad 4ad3a13
fix: add missing pytest plugins
AbhiPrasad File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3,6 +3,7 @@ name: Enforce pnpm | |
| on: | ||
| pull_request: | ||
| push: | ||
| branches: [main] | ||
|
|
||
| jobs: | ||
| reject-npm-lockfile: | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,107 @@ | ||
| # AGENTS.md | ||
|
|
||
| This file provides guidance to coding agents when working with code in this repository. `AGENTS.md` is the source of truth; `CLAUDE.md` is a symlink for compatibility. | ||
|
|
||
| ## Project Overview | ||
|
|
||
| Autoevals is a dual-language library (TypeScript + Python) for evaluating AI model outputs. It provides LLM-as-a-judge evaluations, heuristic scorers (Levenshtein distance), and statistical metrics (BLEU). Developed by Braintrust. | ||
|
|
||
| ## Commands | ||
|
|
||
| ### TypeScript (in root directory) | ||
|
|
||
| ```bash | ||
| pnpm install --frozen-lockfile # Install dependencies | ||
| pnpm run build # Build JS (outputs to jsdist/) | ||
| pnpm run test # Run all JS tests with vitest | ||
| pnpm run test -- js/llm.test.ts # Run single test file | ||
| pnpm run test -- -t "test name" # Run specific test by name | ||
| ``` | ||
|
|
||
| ### Python (from root directory) | ||
|
|
||
| Python dependency management uses `uv` and the project metadata in `pyproject.toml`/`uv.lock`. | ||
|
|
||
| ```bash | ||
| make develop # Set up .venv with dev + scipy extras and install pre-commit | ||
| source env.sh # Activate the .venv | ||
| uv sync --extra dev --extra scipy # Sync local dev dependencies | ||
| uv sync --all-extras # Sync all optional dependency groups (CI-style) | ||
| uv run --extra dev --extra scipy pytest # Run Python tests | ||
| uv run --extra dev --extra scipy pytest py/autoevals/test_llm.py # Run single test file | ||
| uv run --extra dev --extra scipy pytest -k "test_name" # Run tests matching pattern | ||
| uv run --all-extras python -m build # Build Python package | ||
| uv run --all-extras python -m twine check dist/* # Check package metadata | ||
| ``` | ||
|
|
||
| ### Linting | ||
|
|
||
| ```bash | ||
| uv run --extra dev pre-commit run --all-files # Run all linters (black, ruff, prettier, codespell) | ||
| pre-commit run --all-files # Also works after make develop/source env.sh | ||
| make fixup # Same as above | ||
| ``` | ||
|
|
||
| ## Architecture | ||
|
|
||
| ### Dual Implementation Pattern | ||
|
|
||
| The library maintains parallel implementations in TypeScript (`js/`) and Python (`py/autoevals/`). Both share: | ||
|
|
||
| - The same evaluation templates (`templates/*.yaml`) | ||
| - The same `Score` interface: `{name, score (0-1), metadata}` | ||
| - The same scorer names and behavior | ||
|
|
||
| ### Key Modules (both languages) | ||
|
|
||
| - `llm.ts` / `llm.py` - LLM-as-a-judge scorers (Factuality, Battle, ClosedQA, Humor, Security, Sql, Summary, Translation) | ||
| - `ragas.ts` / `ragas.py` - RAG evaluation metrics (ContextRelevancy, Faithfulness, AnswerRelevancy, etc.) | ||
| - `string.ts` / `string.py` - Text similarity (Levenshtein, EmbeddingSimilarity) | ||
| - `json.ts` / `json.py` - JSON validation and diff | ||
| - `oai.ts` / `oai.py` - OpenAI client wrapper with caching | ||
| - `score.ts` / `score.py` - Core Score type and Scorer base class | ||
|
|
||
| ### Template System | ||
|
|
||
| YAML templates in `templates/` define LLM classifier prompts. Templates use Mustache syntax (`{{variable}}`). The `LLMClassifier` class loads these templates and handles: | ||
|
|
||
| - Prompt rendering with chain-of-thought (CoT) suffix | ||
| - Tool-based response parsing via `select_choice` function | ||
| - Score mapping from choice letters to numeric scores | ||
|
|
||
| ### Python Scorer Pattern | ||
|
|
||
| ```python | ||
| class Scorer(ABC): | ||
| def eval(self, output, expected=None, **kwargs) -> Score # Sync | ||
| async def eval_async(self, output, expected=None, **kwargs) # Async | ||
| def __call__(...) # Alias for eval() | ||
| ``` | ||
|
|
||
| ### TypeScript Scorer Pattern | ||
|
|
||
| ```typescript | ||
| type Scorer<Output, Extra> = ( | ||
| args: ScorerArgs<Output, Extra>, | ||
| ) => Score | Promise<Score>; | ||
| // All scorers are async functions | ||
| ``` | ||
|
|
||
| ## CI and Releases | ||
|
|
||
| - Publishing is handled by trusted publishing workflows documented in `docs/PUBLISHING.md`. | ||
| - JavaScript and Python package versions must stay in sync between `package.json` and `py/autoevals/version.py`; CI enforces this via `.github/workflows/version-sync.yaml` and `.github/scripts/check_version_sync.py`. | ||
|
|
||
| ## Environment Variables | ||
|
|
||
| Tests require: | ||
|
|
||
| - `OPENAI_API_KEY` or `BRAINTRUST_API_KEY` - For LLM-based evaluations | ||
| - `OPENAI_BASE_URL` (optional) - Custom API endpoint | ||
|
|
||
| ## Testing Notes | ||
|
|
||
| - Python tests use `pytest` with `respx` for HTTP mocking | ||
| - TypeScript tests use `vitest` with `msw` for HTTP mocking | ||
| - Tests that call real LLM APIs need valid API keys | ||
| - Test files are colocated: `test_*.py` (Python), `*.test.ts` (TypeScript) |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this means fork PRs would fail since they don't have our secrets. Not sure if that's a big deal but wanted to point it out just in case