LLM Eval System

A lightweight system to run, compare, and visualize evaluation suites for Large Language Models.

Quick Start

# Install dependencies
uv sync

# Run a single suite
uv run python llm_eval.py --suite datasets/examples/basic.yaml

# Run all suites
uv run python llm_eval.py --all-suites

# Run with a specific model
uv run python llm_eval.py --all-suites -m gpt-4o

# Run with a system prompt
uv run python llm_eval.py --suite datasets/examples/basic.yaml --system-prompt example

CLI Reference

python llm_eval.py [OPTIONS]

Options:
  -s, --suite PATH           Path to a YAML suite file
  -a, --all-suites           Run all suites in datasets/examples/
  --suites-dir PATH          Custom directory for suites (with --all-suites)
  -m, --model MODEL          Model to evaluate (can be repeated, default: gpt-4o-mini)
  --system-prompt NAME       System prompt name (e.g., 'example')
  --system-prompt-version V  Specific version (e.g., 'v1'), defaults to latest
  -l, --list                 List stored runs
  -c, --compare BASE CURR    Compare two runs by ID

Project Structure

├── llm_eval.py              # CLI entry point
├── datasets/examples/       # Evaluation suite YAML files
├── system_prompts/          # Versioned system prompts
├── src/
│   ├── api/                 # FastAPI server for web UI
│   ├── clients/             # Model clients (OpenAI, etc.)
│   ├── runner/              # Evaluation runner and comparison
│   ├── scorers/             # Scoring strategies (rules, etc.)
│   ├── store/               # Result storage (local JSON)
│   ├── prompts/             # System prompt management
│   └── utils/               # Utilities (git, etc.)
├── web/                     # React frontend
└── tests/                   # Test suite

Evaluation Suites

Suites are YAML files defining test cases:

id: basic
version: "1.0"
title: Basic
description: Quick-start suite with one case from each category

cases:
  - id: addition-simple
    category: instruction-following
    prompt: "What is 2 + 2? Answer with just the number."
    expected:
      contains: "4"
      max_length: 10

Expected Conditions

contains / not_contains - String matching
contains_any / contains_all - Multiple strings
max_length / min_length - Response length
valid_json - JSON parsing
json_has_keys - Required JSON keys

System Prompts

Store versioned system prompts in system_prompts/:

system_prompts/
├── example-v1.txt
├── assistant-v1.txt
└── assistant-v2.txt

Use with --system-prompt example (latest) or --system-prompt-version v1.

Web Dashboard

Start the API and frontend:

# Terminal 1: API server
uv run uvicorn src.api.server:app --reload

# Terminal 2: Frontend dev server
cd web && npm run dev

Open http://localhost:5173

Dashboard Features

Suite cards with pass rate charts
Revision-based X-axis (r1, r2, r3...)
Regression/improvement markers:
- 🟢 Green: +5% or more improvement
- 🟠 Orange: -5% to -10% minor regression
- 🔴 Red: -10% or worse major regression
Tooltips with run metadata (model, commit, date)
Featured "basic" suite at full width

Run Data

Runs are stored in .eval_runs/ as JSON files with:

Revision number (global sequential)
Git commit hash (auto-detected)
Model and system prompt info
Pass/fail results with scores

Environment

Create .env with:

OPENAI_API_KEY=sk-...

Development

# Run tests
uv run pytest tests/ -v

# Build frontend
cd web && npm run build

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
datasets/examples		datasets/examples
deploy		deploy
src		src
system_prompts		system_prompts
tests		tests
web		web
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
llm_eval.py		llm_eval.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Eval System

Quick Start

CLI Reference

Project Structure

Evaluation Suites

Expected Conditions

System Prompts

Web Dashboard

Dashboard Features

Run Data

Environment

Development

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

mpesonen/llm-eval-system

Folders and files

Latest commit

History

Repository files navigation

LLM Eval System

Quick Start

CLI Reference

Project Structure

Evaluation Suites

Expected Conditions

System Prompts

Web Dashboard

Dashboard Features

Run Data

Environment

Development

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages