Open-source tooling for agent evals, regression testing, trace packaging, failure clustering, and dataset slicing.
AgentReliabilityKit is a focused monorepo for a specific gap in the LLM agent stack: teams can build agents, but still struggle to replay failures, turn real traces into reusable eval assets, cluster recurring failure modes, and produce stable train/eval/test slices from the same evidence.
A lot of agent repos optimize for demos, orchestration, or UI. Fewer repos help with the reliability loop after a run goes wrong.
This repo is built around that loop:
- capture a real run
- replay or diff it in CI
- package it into a reusable eval artifact
- cluster repeated failures across runs or releases
- slice the same artifact into reproducible datasets
That makes AgentReliabilityKit closer to an eval-and-reliability toolkit than a general agent framework.
AgentCIfor replay-first regression testing of tool-using agentsTracePackfor turning real traces into reusable benchmark packsFailMapfor clustering recurring failures and comparing releasesPackSlicefor balanced train/eval/test splits from the same pack- a root automation flow that proves the whole chain works together
AgentCI -> record and diff trajectories
TracePack -> turn trajectories into reusable benchmark packs
FailMap -> cluster failures, compare releases, generate triage issues, bundle work
PackSlice -> split packs into balanced train/eval/test datasets
Run the end-to-end repo demo with:
./scripts/run_automation_demo.sh /tmp/agentreliabilitykit-demoThe output is intentionally machine-readable. A successful run gives you a root manifest.json plus per-tool artifacts:
manifest.json
agentci-summary.json
agentci-regression.json
tracepack-scan.json
tracepack-build.json
tracepack-inspect.json
tracepack-pack/
manifest.json
cases/
failmap-cluster.json
failmap-clusters.json
failmap-summary.json
packslice-split.json
packslice-summary.json
packslice/
summary.json
train/
eval/
test/
The root manifest.json is the single best entrypoint for CI jobs, dashboards, or downstream automation that needs to discover the whole artifact set.
cd projects/agentci
python -m venv .venv
source .venv/bin/activate
pip install -e .
python examples/math_agent.py
agentci diff examples/math_episode.json examples/math_episode_candidate.json
agentci diff-html examples/math_episode.json examples/math_episode_candidate.json examples/math_diff.html
agentci assert-regression examples/math_episode.json examples/math_episode_latency_candidate.json --ignore-diff-prefix metric:latency_ms
agentci detect-flaky examples/math_episode.json examples/math_episode_latency_candidate.json examples/math_episode_candidate.jsoncd projects/tracepack
python -m venv .venv
source .venv/bin/activate
pip install -e .
python examples/make_sample_episodes.py
tracepack scan examples/source_episodes --json
tracepack build examples/source_episodes examples/demo_pack --only-failures --redact --max-per-signature 1
tracepack inspect examples/demo_pack --jsoncd projects/failmap
python -m venv .venv
source .venv/bin/activate
pip install -e .
failmap compare examples/baseline_clusters.json examples/candidate_clusters.json examples/compare.json
failmap issue-drafts examples/compare.json examples/issues --rules examples/triage_rules.json
failmap issue-bundle examples/issues examples/bundle
failmap compare-summary examples/compare.json --jsoncd projects/packslice
python -m venv .venv
source .venv/bin/activate
pip install -e .
packslice split examples/sample_pack examples/split_demo --group-by signature
packslice summarize examples/split_demo --json
packslice markdown examples/split_demo examples/split_demo/REPORT.mdAll four tools support machine-readable CLI output, so they can be chained in CI without scraping terminal prose:
agentci summarize projects/agentci/examples/math_episode.json --json
tracepack scan projects/tracepack/examples/source_episodes --json
failmap summarize projects/failmap/examples/clusters.json --json
packslice summarize projects/packslice/examples/split_demo --jsonThat is the core design choice of the repo: artifacts first, dashboards and release checks second.
Path: projects/agentci
Replay-first regression testing for tool-using LLM agents, with portable episode traces, HTML diff reports, and pytest-friendly regression assertions.
Path: projects/tracepack
Build reusable benchmark packs from real agent traces, with recursive redaction, case labels, jsonl/chat export, and signature-aware sampling for eval pipelines.
Path: projects/failmap
Cluster recurring agent failures from TracePack packs, compare releases, generate issue-ready triage drafts with rules-driven routing, bundle them for planning, and track failure trends across snapshots.
Path: projects/packslice
Create balanced train/eval/test splits from TracePack packs with distribution-aware, label-aware, and chronological slicing modes.
The most useful agent infra repos are usually:
- painkiller products, not toy abstractions
- compatible with existing stacks
- demoable in a few minutes
- useful to both researchers and production teams
AgentReliabilityKit is built around that rule.
projects/
agentci/ replay-first regression testing
tracepack/ trace-to-benchmark packaging
failmap/ failure clustering and release comparison
packslice/ balanced dataset splitting for trace packs
.github/
workflows/ monorepo CI
- repo-level walkthrough:
docs/automation.md - contributor guide:
CONTRIBUTING.md - issue and PR templates:
.github/ - security policy:
SECURITY.md - support guidance:
SUPPORT.md - public roadmap:
ROADMAP.md - changelog:
CHANGELOG.md - discussions: GitHub Discussions
- social preview source:
docs/assets/agentreliabilitykit-social-preview.svg
For the longer view, see ROADMAP.md.
- add more
AgentCIintegrations and richer HTML diff reports - strengthen
TracePackredaction policies, labeling workflows, and export formats - add richer
FailMapissue templates, trend views, and release-to-release drilldowns - expand
PackSlicewith temporal and label-aware slicing - add more focused projects around agent eval infra, failure mining, and trajectory analytics
MIT