eval-layer

A Claude Code skill that adds a rubric-based evaluation layer to any existing agent project. Framework-agnostic — works with PydanticAI, LangGraph, CrewAI, Strands, OpenAI Agents SDK, the raw Anthropic SDK, or anything else that exposes run(prompt) -> result.

You bring the agent. The skill gives you:

A scoring rubric (3-5 dimensions with concrete level descriptors, not "good" / "bad")
A test suite of inputs with ≥3 reference-graded cases (for calibration)
A judge prompt for LLM-as-a-judge with 2-3 calibration examples
An eval harness that runs your agent on the cases, sends output to the judge, and aggregates:
- per-dimension averages
- weighted overall score
- pass rate
- leniency vs. the human references (flags if the judge is systematically too strict or too lenient)
A per-subject markdown report and, for multi-subject benchmarks, a self-contained HTML dashboard (leaderboard + radar + per-case heatmap + failure-category breakdown).

Installation

This is a Claude Code skill. To use it as your own user-level skill, clone it into your Claude skills directory:

git clone <repo-url> ~/.claude/skills/eval-layer

Then from any Claude Code session: /eval-layer <your agent path>.

Why this exists

"Vibes-based" evals fail silently. You ship a change, the agent feels better, but you have no numbers. This skill produces numbers you can trust — including a judge-calibration signal (leniency) so you know when the judge itself is drifting, not just the agent.

Two measurements, not one:

Weighted score — how good the output is, per the rubric, on [0, 1]
Leniency — mean(judge_score − human_reference_score) on [-1, +1]
- abs < 0.10 → well-calibrated
- 0.10 – 0.25 → slight bias, monitor
- abs > 0.25 → recalibrate before trusting scores

The 7-field metadata contract

Every framework adapter returns the same shape. This is what makes cross-framework comparison honest:

{
    "recommendation": <schema instance or None>,  # agent's structured output
    "latency_ms":      int,                       # wall-clock including tool execution
    "tool_calls":      int,                       # count of tool invocations
    "input_tokens":    int | None,                # summed across turns
    "output_tokens":   int | None,
    "model_id":        str,
    "error":           str | None,
}

If a framework doesn't expose a field (Strands 0.1.x doesn't expose tokens, for example), pass None — don't fabricate. The harness handles None defensively throughout.

Usage

In Claude Code, invoke the skill against your project:

/eval-layer Add an eval layer to /path/to/your/agent

Claude reads the agent, proposes a rubric, and on your confirmation generates:

your-project/
  evals/
    eval_harness.py
    rubrics/main.yaml
    prompts/judge.md
    test_cases/seed.yaml
    reports/
      raw/<subject>.jsonl
      eval_<subject>_<ts>.md
      framework-comparison.html   # multi-subject only

Smoke-test one case:

python evals/eval_harness.py --framework <subject> --test-case easy-01 -v

Full run:

python evals/eval_harness.py --framework <subject>

Multi-subject sweep + dashboard:

python evals/eval_harness.py --framework all
python evals/make_html_report.py

Required harness flags

Every harness generated by this skill exposes these:

Flag	Purpose
`--framework NAME`	which subject to run (or `all`)
`--test-case ID`	run only one case — essential for debugging adapters
`-v` / `--verbose`	print per-case metadata as it runs
`--trials N`	run each case N times (pass@k / variance)

Repository layout

eval-layer/
├── SKILL.md                                        # entry point — the process Claude follows
├── references/
│   ├── rubric-design.md                            # dimension catalog, scales, leniency thresholds, anti-patterns
│   ├── judge-prompts.md                            # judge prompt template + calibration techniques
│   ├── judge-robustness.md                         # defensive JSON parse, retry-once, never-drop-a-result
│   ├── framework-adapters.md                       # copy-paste recipes per framework (w/ the metadata contract)
│   ├── structured-output-troubleshooting.md        # the three Bedrock Opus errors and the two-stage fix
│   ├── cross-subject-benchmarking.md               # multi-subject flow (models / frameworks / prompts)
│   └── html-report-template.html                   # self-contained Chart.js dashboard template
└── README.md

SKILL.md is the top-level recipe (loaded when the skill is invoked). The references/ files are loaded on demand for the relevant step.

Bedrock gotcha

If the agent targets Claude on Bedrock, the default path for structured output is the two-stage pattern, not response_format / output_type. Single-stage structured output fails with three distinct errors on Bedrock Opus 4.x across LangGraph, Strands, and OpenAI Agents SDK:

LangGraph — "This model does not support assistant message prefill"
Strands — "No valid tool use or tool use input was found in the Bedrock response"
OpenAI Agents + LiteLLM — "minItems values other than 0 or 1 are not supported"

The fix:

Stage 1: run the agent conversationally — tools enabled, no output_type.
Stage 2: feed the final text into a fresh structured-output call.

See references/structured-output-troubleshooting.md.

Validation checklist

Before handing off an eval produced by this skill:

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
references		references
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
SKILL.md		SKILL.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eval-layer

Installation

Why this exists

The 7-field metadata contract

Usage

Required harness flags

Repository layout

Bedrock gotcha

Validation checklist

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

eval-layer

Installation

Why this exists

The 7-field metadata contract

Usage

Required harness flags

Repository layout

Bedrock gotcha

Validation checklist

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages