A lightweight system to run, compare, and visualize evaluation suites for Large Language Models.
# Install dependencies
uv sync
# Run a single suite
uv run python llm_eval.py --suite datasets/examples/basic.yaml
# Run all suites
uv run python llm_eval.py --all-suites
# Run with a specific model
uv run python llm_eval.py --all-suites -m gpt-4o
# Run with a system prompt
uv run python llm_eval.py --suite datasets/examples/basic.yaml --system-prompt examplepython llm_eval.py [OPTIONS]
Options:
-s, --suite PATH Path to a YAML suite file
-a, --all-suites Run all suites in datasets/examples/
--suites-dir PATH Custom directory for suites (with --all-suites)
-m, --model MODEL Model to evaluate (can be repeated, default: gpt-4o-mini)
--system-prompt NAME System prompt name (e.g., 'example')
--system-prompt-version V Specific version (e.g., 'v1'), defaults to latest
-l, --list List stored runs
-c, --compare BASE CURR Compare two runs by ID├── llm_eval.py # CLI entry point
├── datasets/examples/ # Evaluation suite YAML files
├── system_prompts/ # Versioned system prompts
├── src/
│ ├── api/ # FastAPI server for web UI
│ ├── clients/ # Model clients (OpenAI, etc.)
│ ├── runner/ # Evaluation runner and comparison
│ ├── scorers/ # Scoring strategies (rules, etc.)
│ ├── store/ # Result storage (local JSON)
│ ├── prompts/ # System prompt management
│ └── utils/ # Utilities (git, etc.)
├── web/ # React frontend
└── tests/ # Test suite
Suites are YAML files defining test cases:
id: basic
version: "1.0"
title: Basic
description: Quick-start suite with one case from each category
cases:
- id: addition-simple
category: instruction-following
prompt: "What is 2 + 2? Answer with just the number."
expected:
contains: "4"
max_length: 10contains/not_contains- String matchingcontains_any/contains_all- Multiple stringsmax_length/min_length- Response lengthvalid_json- JSON parsingjson_has_keys- Required JSON keys
Store versioned system prompts in system_prompts/:
system_prompts/
├── example-v1.txt
├── assistant-v1.txt
└── assistant-v2.txt
Use with --system-prompt example (latest) or --system-prompt-version v1.
Start the API and frontend:
# Terminal 1: API server
uv run uvicorn src.api.server:app --reload
# Terminal 2: Frontend dev server
cd web && npm run dev- Suite cards with pass rate charts
- Revision-based X-axis (r1, r2, r3...)
- Regression/improvement markers:
- 🟢 Green: +5% or more improvement
- 🟠 Orange: -5% to -10% minor regression
- 🔴 Red: -10% or worse major regression
- Tooltips with run metadata (model, commit, date)
- Featured "basic" suite at full width
Runs are stored in .eval_runs/ as JSON files with:
- Revision number (global sequential)
- Git commit hash (auto-detected)
- Model and system prompt info
- Pass/fail results with scores
Create .env with:
OPENAI_API_KEY=sk-...
# Run tests
uv run pytest tests/ -v
# Build frontend
cd web && npm run buildMIT