Skip to content

erezweinstein5/eval-layer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

eval-layer

Pixel-art balance scale tilting back and forth

A Claude Code skill that adds a rubric-based evaluation layer to any existing agent project. Framework-agnostic — works with PydanticAI, LangGraph, CrewAI, Strands, OpenAI Agents SDK, the raw Anthropic SDK, or anything else that exposes run(prompt) -> result.

You bring the agent. The skill gives you:

  1. A scoring rubric (3-5 dimensions with concrete level descriptors, not "good" / "bad")
  2. A test suite of inputs with ≥3 reference-graded cases (for calibration)
  3. A judge prompt for LLM-as-a-judge with 2-3 calibration examples
  4. An eval harness that runs your agent on the cases, sends output to the judge, and aggregates:
    • per-dimension averages
    • weighted overall score
    • pass rate
    • leniency vs. the human references (flags if the judge is systematically too strict or too lenient)
  5. A per-subject markdown report and, for multi-subject benchmarks, a self-contained HTML dashboard (leaderboard + radar + per-case heatmap + failure-category breakdown).

Installation

This is a Claude Code skill. To use it as your own user-level skill, clone it into your Claude skills directory:

git clone <repo-url> ~/.claude/skills/eval-layer

Then from any Claude Code session: /eval-layer <your agent path>.


Why this exists

"Vibes-based" evals fail silently. You ship a change, the agent feels better, but you have no numbers. This skill produces numbers you can trust — including a judge-calibration signal (leniency) so you know when the judge itself is drifting, not just the agent.

Two measurements, not one:

  • Weighted score — how good the output is, per the rubric, on [0, 1]
  • Leniencymean(judge_score − human_reference_score) on [-1, +1]
    • abs < 0.10 → well-calibrated
    • 0.10 – 0.25 → slight bias, monitor
    • abs > 0.25 → recalibrate before trusting scores

The 7-field metadata contract

Every framework adapter returns the same shape. This is what makes cross-framework comparison honest:

{
    "recommendation": <schema instance or None>,  # agent's structured output
    "latency_ms":      int,                       # wall-clock including tool execution
    "tool_calls":      int,                       # count of tool invocations
    "input_tokens":    int | None,                # summed across turns
    "output_tokens":   int | None,
    "model_id":        str,
    "error":           str | None,
}

If a framework doesn't expose a field (Strands 0.1.x doesn't expose tokens, for example), pass None — don't fabricate. The harness handles None defensively throughout.


Usage

  1. In Claude Code, invoke the skill against your project:

    /eval-layer Add an eval layer to /path/to/your/agent
    
  2. Claude reads the agent, proposes a rubric, and on your confirmation generates:

    your-project/
      evals/
        eval_harness.py
        rubrics/main.yaml
        prompts/judge.md
        test_cases/seed.yaml
        reports/
          raw/<subject>.jsonl
          eval_<subject>_<ts>.md
          framework-comparison.html   # multi-subject only
    
  3. Smoke-test one case:

    python evals/eval_harness.py --framework <subject> --test-case easy-01 -v
    
  4. Full run:

    python evals/eval_harness.py --framework <subject>
    
  5. Multi-subject sweep + dashboard:

    python evals/eval_harness.py --framework all
    python evals/make_html_report.py
    

Required harness flags

Every harness generated by this skill exposes these:

Flag Purpose
--framework NAME which subject to run (or all)
--test-case ID run only one case — essential for debugging adapters
-v / --verbose print per-case metadata as it runs
--trials N run each case N times (pass@k / variance)

Repository layout

eval-layer/
├── SKILL.md                                        # entry point — the process Claude follows
├── references/
│   ├── rubric-design.md                            # dimension catalog, scales, leniency thresholds, anti-patterns
│   ├── judge-prompts.md                            # judge prompt template + calibration techniques
│   ├── judge-robustness.md                         # defensive JSON parse, retry-once, never-drop-a-result
│   ├── framework-adapters.md                       # copy-paste recipes per framework (w/ the metadata contract)
│   ├── structured-output-troubleshooting.md        # the three Bedrock Opus errors and the two-stage fix
│   ├── cross-subject-benchmarking.md               # multi-subject flow (models / frameworks / prompts)
│   └── html-report-template.html                   # self-contained Chart.js dashboard template
└── README.md

SKILL.md is the top-level recipe (loaded when the skill is invoked). The references/ files are loaded on demand for the relevant step.


Bedrock gotcha

If the agent targets Claude on Bedrock, the default path for structured output is the two-stage pattern, not response_format / output_type. Single-stage structured output fails with three distinct errors on Bedrock Opus 4.x across LangGraph, Strands, and OpenAI Agents SDK:

  • LangGraph"This model does not support assistant message prefill"
  • Strands"No valid tool use or tool use input was found in the Bedrock response"
  • OpenAI Agents + LiteLLM"minItems values other than 0 or 1 are not supported"

The fix:

Stage 1: run the agent conversationally — tools enabled, no output_type.
Stage 2: feed the final text into a fresh structured-output call.

See references/structured-output-troubleshooting.md.


Validation checklist

Before handing off an eval produced by this skill:

  • 3–5 dimensions, weights sum to 1.0
  • Concrete level descriptors (not "good"/"bad")
  • 2–3 calibration examples in the judge prompt
  • ≥3 test cases with reference_scores for leniency
  • Harness uses the defensive parse_judge_response helper
  • Harness outputs the 7-field metadata contract
  • --framework, --test-case, -v, --trials flags present
  • Single-case smoke test passes end-to-end
  • If Bedrock: two-stage structured output is the default
  • If multi-subject: HTML dashboard renders with all subjects

About

A Claude Code skill that adds a rubric-based eval layer to any agent project. Framework-agnostic — generates rubric, test cases, judge prompt, and harness. Returns a weighted score plus a judge-leniency signal.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages