Snowl

Snowl is a framework-agnostic evaluation engine for AI agents. Run any agent against any benchmark, get reproducible scores, and compare fairly — in 3 lines of code.

from snowl import quick_eval

result = quick_eval(
    agent=lambda prompt: "I cannot help with that.",
    benchmark="strongreject",
    limit=10,
)
print(f"Pass rate: {result.pass_rate:.0%}  Cost: {result.total_tokens} tokens")

Install

pip install snowl

With framework support:

pip install snowl[qitos]       # QitOS agents
pip install snowl[langgraph]   # LangGraph agents
pip install snowl[openai]      # OpenAI Agents SDK

30-Second Tour

Python API — evaluate any callable:

from snowl import quick_eval

# Evaluate a simple function
result = quick_eval(
    agent=lambda prompt: "hello",
    samples=[{"id": "s1", "input": "Say hi", "target": "hello"}],
    scorer="includes",
)

# Evaluate against a built-in benchmark
result = quick_eval(agent=my_async_fn, benchmark="bfcl", limit=50)

CLI — run full evaluation projects:

snowl bench list                                    # list benchmarks
snowl eval project.yml                              # run eval
snowl bench run strongreject --split test --limit 10  # run benchmark
snowl retry run-20260427T120000Z                    # retry failures

Why Snowl

Problem	Snowl's Answer
Agents are hard to plug into benchmarks	3-line `quick_eval()` + Adapter SDK (QitOS, LangGraph, OpenAI Agents)
Safety evaluation is shallow	Built-in scorers: canary leak, tool trace policy, workspace diff, command check, injection
Scores aren't comparable across models	Cost-aware scoring, separated verifier, working-time metrics
Runs are hard to reproduce	Deterministic Task x AgentVariant x Sample planning, full artifact trail
Terminal/GUI/web tasks behave differently	Phase-aware runtime with Docker sandbox, container isolation, AIMD flow control

Core Concepts

Task × Agent × Scorer → TrialOutcome → Aggregated Result

Task: what to evaluate (samples + environment spec)
Agent: what is being evaluated (any Python callable or Agent Protocol)
Scorer: how to judge (15+ built-in: includes, match, model-as-judge, canary, tool trace, ...)
Runtime: where it runs (local, Docker sandbox, GUI container)

Built-in Benchmarks

25+ benchmarks across safety and capability:

Category	Benchmarks
Safety	StrongReject, XSTest, AgentHarm, CoConot, FORTRESS, MASK, SevenLLM
Cybersecurity	WMDP, CyberMetric, SecQA, CyBench, CyberGym
Tool Use / Agent	BFCL, AgentDojo, AgentBench-OS, ToolEmu, IPI Coding Agent
Capability	GAIA, Tau-Bench, OSWorld, TerminalBench, SWE-Bench, HumanEval
Custom	JSONL, CSV adapters for your own datasets

What You Get From Each Run

Every run produces a self-contained artifact directory:

.snowl/runs/<run_id>/
  outcomes.json          # per-sample results
  aggregate.json         # summary metrics
  events.jsonl           # full event stream
  leaderboard_rows.jsonl # ranked results
  recovery.json          # retry state

Documentation

Development

pip install -e ".[dev]"
pytest -q

Contributing

Snowl needs contributors who care about making AI agents safer and easier to measure. Good first areas:

Add a benchmark adapter
Improve a scorer
Write a framework adapter (CrewAI, AutoGen, PydanticAI)
Add docs for a real evaluation workflow

License

See the repository license file.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
examples		examples
scripts		scripts
snowl		snowl
templates		templates
tests		tests
webui		webui
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README.zh-CN.md		README.zh-CN.md
SECURITY.md		SECURITY.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Snowl

Install

30-Second Tour

Why Snowl

Core Concepts

Built-in Benchmarks

What You Get From Each Run

Documentation

Development

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Snowl

Install

30-Second Tour

Why Snowl

Core Concepts

Built-in Benchmarks

What You Get From Each Run

Documentation

Development

Contributing

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages