A Claude Code skill that adds a rubric-based evaluation layer to any existing agent project. Framework-agnostic — works with PydanticAI, LangGraph, CrewAI, Strands, OpenAI Agents SDK, the raw Anthropic SDK, or anything else that exposes run(prompt) -> result.
You bring the agent. The skill gives you:
- A scoring rubric (3-5 dimensions with concrete level descriptors, not "good" / "bad")
- A test suite of inputs with ≥3 reference-graded cases (for calibration)
- A judge prompt for LLM-as-a-judge with 2-3 calibration examples
- An eval harness that runs your agent on the cases, sends output to the judge, and aggregates:
- per-dimension averages
- weighted overall score
- pass rate
- leniency vs. the human references (flags if the judge is systematically too strict or too lenient)
- A per-subject markdown report and, for multi-subject benchmarks, a self-contained HTML dashboard (leaderboard + radar + per-case heatmap + failure-category breakdown).
This is a Claude Code skill. To use it as your own user-level skill, clone it into your Claude skills directory:
git clone <repo-url> ~/.claude/skills/eval-layerThen from any Claude Code session: /eval-layer <your agent path>.
"Vibes-based" evals fail silently. You ship a change, the agent feels better, but you have no numbers. This skill produces numbers you can trust — including a judge-calibration signal (leniency) so you know when the judge itself is drifting, not just the agent.
Two measurements, not one:
- Weighted score — how good the output is, per the rubric, on
[0, 1] - Leniency —
mean(judge_score − human_reference_score)on[-1, +1]abs < 0.10→ well-calibrated0.10 – 0.25→ slight bias, monitorabs > 0.25→ recalibrate before trusting scores
Every framework adapter returns the same shape. This is what makes cross-framework comparison honest:
{
"recommendation": <schema instance or None>, # agent's structured output
"latency_ms": int, # wall-clock including tool execution
"tool_calls": int, # count of tool invocations
"input_tokens": int | None, # summed across turns
"output_tokens": int | None,
"model_id": str,
"error": str | None,
}If a framework doesn't expose a field (Strands 0.1.x doesn't expose tokens, for example), pass None — don't fabricate. The harness handles None defensively throughout.
-
In Claude Code, invoke the skill against your project:
/eval-layer Add an eval layer to /path/to/your/agent -
Claude reads the agent, proposes a rubric, and on your confirmation generates:
your-project/ evals/ eval_harness.py rubrics/main.yaml prompts/judge.md test_cases/seed.yaml reports/ raw/<subject>.jsonl eval_<subject>_<ts>.md framework-comparison.html # multi-subject only -
Smoke-test one case:
python evals/eval_harness.py --framework <subject> --test-case easy-01 -v -
Full run:
python evals/eval_harness.py --framework <subject> -
Multi-subject sweep + dashboard:
python evals/eval_harness.py --framework all python evals/make_html_report.py
Every harness generated by this skill exposes these:
| Flag | Purpose |
|---|---|
--framework NAME |
which subject to run (or all) |
--test-case ID |
run only one case — essential for debugging adapters |
-v / --verbose |
print per-case metadata as it runs |
--trials N |
run each case N times (pass@k / variance) |
eval-layer/
├── SKILL.md # entry point — the process Claude follows
├── references/
│ ├── rubric-design.md # dimension catalog, scales, leniency thresholds, anti-patterns
│ ├── judge-prompts.md # judge prompt template + calibration techniques
│ ├── judge-robustness.md # defensive JSON parse, retry-once, never-drop-a-result
│ ├── framework-adapters.md # copy-paste recipes per framework (w/ the metadata contract)
│ ├── structured-output-troubleshooting.md # the three Bedrock Opus errors and the two-stage fix
│ ├── cross-subject-benchmarking.md # multi-subject flow (models / frameworks / prompts)
│ └── html-report-template.html # self-contained Chart.js dashboard template
└── README.md
SKILL.md is the top-level recipe (loaded when the skill is invoked). The references/ files are loaded on demand for the relevant step.
If the agent targets Claude on Bedrock, the default path for structured output is the two-stage pattern, not response_format / output_type. Single-stage structured output fails with three distinct errors on Bedrock Opus 4.x across LangGraph, Strands, and OpenAI Agents SDK:
- LangGraph —
"This model does not support assistant message prefill" - Strands —
"No valid tool use or tool use input was found in the Bedrock response" - OpenAI Agents + LiteLLM —
"minItems values other than 0 or 1 are not supported"
The fix:
Stage 1: run the agent conversationally — tools enabled, no output_type.
Stage 2: feed the final text into a fresh structured-output call.
See references/structured-output-troubleshooting.md.
Before handing off an eval produced by this skill:
- 3–5 dimensions, weights sum to 1.0
- Concrete level descriptors (not "good"/"bad")
- 2–3 calibration examples in the judge prompt
- ≥3 test cases with
reference_scoresfor leniency - Harness uses the defensive
parse_judge_responsehelper - Harness outputs the 7-field metadata contract
-
--framework,--test-case,-v,--trialsflags present - Single-case smoke test passes end-to-end
- If Bedrock: two-stage structured output is the default
- If multi-subject: HTML dashboard renders with all subjects
