Skip to content

Qitor/snowl

Snowl

CI Python Benchmarks Docs English | 简体中文

Snowl is a framework-agnostic evaluation engine for AI agents. Run any agent against any benchmark, get reproducible scores, and compare fairly — in 3 lines of code.

from snowl import quick_eval

result = quick_eval(
    agent=lambda prompt: "I cannot help with that.",
    benchmark="strongreject",
    limit=10,
)
print(f"Pass rate: {result.pass_rate:.0%}  Cost: {result.total_tokens} tokens")

Install

pip install snowl

With framework support:

pip install snowl[qitos]       # QitOS agents
pip install snowl[langgraph]   # LangGraph agents
pip install snowl[openai]      # OpenAI Agents SDK

30-Second Tour

Python API — evaluate any callable:

from snowl import quick_eval

# Evaluate a simple function
result = quick_eval(
    agent=lambda prompt: "hello",
    samples=[{"id": "s1", "input": "Say hi", "target": "hello"}],
    scorer="includes",
)

# Evaluate against a built-in benchmark
result = quick_eval(agent=my_async_fn, benchmark="bfcl", limit=50)

CLI — run full evaluation projects:

snowl bench list                                    # list benchmarks
snowl eval project.yml                              # run eval
snowl bench run strongreject --split test --limit 10  # run benchmark
snowl retry run-20260427T120000Z                    # retry failures

Why Snowl

Problem Snowl's Answer
Agents are hard to plug into benchmarks 3-line quick_eval() + Adapter SDK (QitOS, LangGraph, OpenAI Agents)
Safety evaluation is shallow Built-in scorers: canary leak, tool trace policy, workspace diff, command check, injection
Scores aren't comparable across models Cost-aware scoring, separated verifier, working-time metrics
Runs are hard to reproduce Deterministic Task x AgentVariant x Sample planning, full artifact trail
Terminal/GUI/web tasks behave differently Phase-aware runtime with Docker sandbox, container isolation, AIMD flow control

Core Concepts

Task × Agent × Scorer → TrialOutcome → Aggregated Result
  • Task: what to evaluate (samples + environment spec)
  • Agent: what is being evaluated (any Python callable or Agent Protocol)
  • Scorer: how to judge (15+ built-in: includes, match, model-as-judge, canary, tool trace, ...)
  • Runtime: where it runs (local, Docker sandbox, GUI container)

Built-in Benchmarks

25+ benchmarks across safety and capability:

Category Benchmarks
Safety StrongReject, XSTest, AgentHarm, CoConot, FORTRESS, MASK, SevenLLM
Cybersecurity WMDP, CyberMetric, SecQA, CyBench, CyberGym
Tool Use / Agent BFCL, AgentDojo, AgentBench-OS, ToolEmu, IPI Coding Agent
Capability GAIA, Tau-Bench, OSWorld, TerminalBench, SWE-Bench, HumanEval
Custom JSONL, CSV adapters for your own datasets

What You Get From Each Run

Every run produces a self-contained artifact directory:

.snowl/runs/<run_id>/
  outcomes.json          # per-sample results
  aggregate.json         # summary metrics
  events.jsonl           # full event stream
  leaderboard_rows.jsonl # ranked results
  recovery.json          # retry state

Documentation

Development

pip install -e ".[dev]"
pytest -q

Contributing

Snowl needs contributors who care about making AI agents safer and easier to measure. Good first areas:

  • Add a benchmark adapter
  • Improve a scorer
  • Write a framework adapter (CrewAI, AutoGen, PydanticAI)
  • Add docs for a real evaluation workflow

License

See the repository license file.

About

A safety evaluation framework for agents.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors