From 1eb187a292d448b7532cd8f89d9b3f92612b609e Mon Sep 17 00:00:00 2001 From: Sebastian Wessel Date: Fri, 8 May 2026 13:11:44 +0200 Subject: [PATCH] doc: agentic research eval --- agent-evaluation-research/README.md | 55 + agent-evaluation-research/REPORT.md | 1091 ++++++++++++++ .../diagrams/01-three-layers.svg | 84 ++ .../diagrams/02-qualops-architecture.svg | 159 ++ .../diagrams/03-eval-cadence.svg | 91 ++ agent-evaluation-research/report.html | 1338 +++++++++++++++++ .../sources/01-foundations.md | 355 +++++ .../sources/02-frameworks.md | 775 ++++++++++ .../sources/03-toolcalling-and-trajectory.md | 519 +++++++ .../sources/04-improvement.md | 543 +++++++ 10 files changed, 5010 insertions(+) create mode 100644 agent-evaluation-research/README.md create mode 100644 agent-evaluation-research/REPORT.md create mode 100644 agent-evaluation-research/diagrams/01-three-layers.svg create mode 100644 agent-evaluation-research/diagrams/02-qualops-architecture.svg create mode 100644 agent-evaluation-research/diagrams/03-eval-cadence.svg create mode 100644 agent-evaluation-research/report.html create mode 100644 agent-evaluation-research/sources/01-foundations.md create mode 100644 agent-evaluation-research/sources/02-frameworks.md create mode 100644 agent-evaluation-research/sources/03-toolcalling-and-trajectory.md create mode 100644 agent-evaluation-research/sources/04-improvement.md diff --git a/agent-evaluation-research/README.md b/agent-evaluation-research/README.md new file mode 100644 index 00000000..c3221d08 --- /dev/null +++ b/agent-evaluation-research/README.md @@ -0,0 +1,55 @@ +# Agent Evaluation Research — folder index + +This folder contains the deliverables for the QualOps agent-evaluation research project. + +**Goal**: a comprehensive report on state-of-the-art approaches to measure and improve LLM-agent performance and reliability, plus a generalized concept we can apply to QualOps and similar projects. + +**Out of scope**: continuous online self-improvement in production. We deploy fixed versions; improvement is offline between releases. + +## Read in this order + +1. **[REPORT.md](REPORT.md)** — main report (markdown, ~12,000 words). Synthesizes all four dossiers into a coherent narrative for a mixed engineering + leadership audience. Contains the executive summary, the QualOps Approach (Part 6), prerequisites, and the adoption roadmap. +2. **[report.html](report.html)** — interactive HTML version with sidebar navigation, embedded mermaid diagrams, and styled tables. Open in a browser. Print-to-PDF works well. +3. **Standalone diagrams** in [`diagrams/`](diagrams/) — three SVGs of the core concepts (three layers, QualOps architecture, two-tier eval cadence). These are extracted from the report for reuse in slides or external documents. + +## Source dossiers + +The main report is built on top of four primary-source research dossiers, each of which stands on its own as a deeper reference: + +- **[sources/01-foundations.md](sources/01-foundations.md)** (~5,000 words) — Foundational concepts: agent vs. LLM eval, the three layers, dimensions of quality, the eval lifecycle, LLM-as-judge methodology and biases, statistical rigor, recent academic directions. +- **[sources/02-frameworks.md](sources/02-frameworks.md)** (~5,800 words) — Eval / observability framework landscape: deep dive on Langfuse, LangSmith, DeepEval, RAGAS, Braintrust, OpenAI Evals, Anthropic Console, Phoenix/Arize, W&B Weave, MLflow, Patronus, Promptfoo, Inspect AI, plus AgentOps / Helicone / LangWatch. Comparison matrix, fit-by-team-profile, CI integration patterns. +- **[sources/03-toolcalling-and-trajectory.md](sources/03-toolcalling-and-trajectory.md)** (~5,200 words) — Tool-calling and trajectory eval specifically: tool-call accuracy decomposition, trajectory metrics, outcome vs. process, major benchmarks (BFCL, τ-bench, ToolBench, WebArena, AgentBench, GAIA, the SWE-bench family, MLAgentBench, AppWorld, DevAI), code-agent specifics, agent-as-judge, replay testing, calibration under non-determinism, decision guide. +- **[sources/04-improvement.md](sources/04-improvement.md)** (~5,500 words) — Systematic agent improvement: error analysis methodology (Hamel Husain / Eugene Yan / Shreya Shankar), eval-driven loop, prompt engineering as discipline, automated prompt optimization (DSPy/MIPROv2, TextGrad, AdalFlow, SAMMO), few-shot mining, sub-agent decomposition and Skills, tool design, context engineering, reflection patterns, routing, fine-tuning / DPO, regression suites, data flywheel, code-review-agent specific patterns. + +Each dossier ends with an annotated reference list of 30–50 primary sources (arXiv papers, vendor docs, practitioner blogs). + +## Folder layout + +``` +agent-evaluation-research/ +├── README.md # this file +├── REPORT.md # main synthesis report +├── report.html # interactive HTML version +├── diagrams/ +│ ├── 01-three-layers.svg # the three-layer eval concept +│ ├── 02-qualops-architecture.svg # pipeline + eval + improvement +│ └── 03-eval-cadence.svg # two-tier cadence +├── sources/ +│ ├── 01-foundations.md +│ ├── 02-frameworks.md +│ ├── 03-toolcalling-and-trajectory.md +│ └── 04-improvement.md +├── assets/ # placeholder for additional assets +└── drafts/ # placeholder for working drafts +``` + +## TL;DR for someone with five minutes + +- **What is the right approach?** Three layers (component / trajectory / outcome), two tiers (per-PR fast gate + nightly capability eval), one curated golden set (50–200 real PRs). +- **Tooling for QualOps specifically**: keep Langfuse; add Promptfoo for the per-PR CI gate; add Inspect AI for nightly capability eval. Don't migrate. +- **Improvement loop**: open-coding → axial coding → frequency-weighted prioritization. Apply the cheapest fix first (prompt → few-shot → tool → context → sub-agent → optimizer → fine-tune). +- **Stage-by-stage**: tool-call F1 for Analyze; location precision/recall for Review; SWE-bench-style harness for Fix; schema validation for Report; agreement-with-humans for Judge. +- **Statistical discipline**: pass^k with k≥5; paired comparison vs baseline; report 95% CI; refresh judge calibration quarterly. +- **Adoption**: ~10 weeks of phased engineering effort + ~$4,500/mo steady-state LLM costs. + +Detailed reasoning for each of these is in [REPORT.md](REPORT.md). diff --git a/agent-evaluation-research/REPORT.md b/agent-evaluation-research/REPORT.md new file mode 100644 index 00000000..03357f72 --- /dev/null +++ b/agent-evaluation-research/REPORT.md @@ -0,0 +1,1091 @@ +# Evaluating and Improving LLM Agents + +**A state-of-the-art report on agent accuracy, evaluations, and offline improvement — with a generalized approach for QualOps and similar projects.** + +*QualOps Research · May 8, 2026* + +--- + +## Document map + +This report is structured for two audiences. The **executive summary** and **Part 1** (Why this matters) are written for leadership; the rest is written for the engineering team that will implement the eval and improvement program. The **QualOps Approach** in Part 6 is the actionable plan distilled from everything before it. References and a glossary close the document. + +| Part | Title | Primary audience | +|---|---|---| +| 0 | Executive summary | Leadership + engineering | +| 1 | Why this matters for QualOps | Leadership | +| 2 | Foundations of agent evaluation | Engineering | +| 3 | Evaluating tool-calling and workflow agents | Engineering | +| 4 | The framework landscape | Engineering | +| 5 | Systematically improving agents (offline) | Engineering | +| 6 | **The QualOps Approach** — generalized concept | Both | +| 7 | Prerequisites and adoption roadmap | Both | +| 8 | Risks, open questions, and what we left out | Both | +| 9 | Appendix: glossary, references, dossier links | Engineering | + +--- + +## 0. Executive summary + +LLM agent quality is not a property of the model alone — it is a property of the *system*: the model, the prompts, the tools, the context routing, and the harness, all together. As QualOps has matured into a multi-stage agentic pipeline (Analyze → Review → Fix → Report → Judge), the unit that matters has become the **trajectory** the agent takes through tool calls, not just the final PR comment. Evaluating that trajectory and systematically improving it is now the bottleneck for accuracy, cost, and trust. + +The good news: the field has converged on a small, well-supported playbook. We do **not** need to invent it. The shape of the recommendation is: + +1. **Score the agent on three layers, every release.** Component-level (did each tool call fire correctly?), trajectory-level (was the path coherent?), outcome-level (did the final report and patch hold up?). Single end-to-end scores hide too much. +2. **Use a small, curated golden set, not a giant crawl.** 50–200 carefully labeled real PRs, refreshed quarterly, beat 10,000 synthetic ones. Anthropic's, Sierra's, and Cognition's published guidance all converge on this. +3. **Apply the right tool to the right stage.** Deterministic tests for the Fix stage (SWE-bench-style "apply patch, run tests"). Tool-call F1 and AST-match for Analyze. LLM-as-judge with structured rubrics for Review and Report. Agent-as-judge for the Judge stage's meta-eval. +4. **Run a two-tier eval cadence.** A fast per-PR gate (~3–5 min, deterministic asserts on each stage) and a slower nightly or weekly capability eval (~30–60 min, full pipeline + LLM judges + paired statistical comparison against a baseline). Both should fail-loud. +5. **Improve through structured error analysis.** Open-coding 30–50 failing traces, axial-coding into a 5–15 bucket taxonomy, and prioritizing by `frequency × severity × fixability` is the single highest-ROI activity in the loop. The next-cheapest fixes — prompt edits, few-shot mining, tool surface cleanup, context engineering — beat fine-tuning in almost every case until the prompt surface is exhausted. +6. **Keep what works and add what is missing.** QualOps already runs Langfuse with datasets, scorers, presets, and LLM-as-judge — that is the right foundation. The two gaps worth filling are (a) a **per-PR CI gate** with developer-friendly diffs (Promptfoo's GitHub Action) and (b) a **nightly capability eval** harness (Inspect AI or hand-rolled with `agentevals` patterns) that scores the full pipeline end-to-end on a held-out fixture set. Migration away from Langfuse is not recommended *unless we discover a specific feature gap* — the marginal UX wins of LangSmith / Braintrust do not justify a closed-source migration today. + +If implemented in full, this gives QualOps a release process where every prompt, skill, tool, or model change produces a quantitative, auditable delta against a known baseline — and where regressions are caught before they reach a customer's pull request. + +--- + +## 1. Why this matters for QualOps + +### 1.1 What QualOps is + +QualOps is an AI-powered code review tool built on the Claude Agent SDK. It runs in CI on every pull request and produces structured findings — comments, GitHub Checks annotations, severity-ranked reports, and (in agentic mode) suggested fixes. The system is organised as a multi-stage pipeline: + +``` + ┌──────────┐ ┌─────────┐ ┌─────┐ ┌────────┐ ┌───────┐ +PR diff ───────► │ Analyze │───►│ Review │───►│ Fix │───►│ Report │───►│ Judge │───► CI status + └──────────┘ └─────────┘ └─────┘ └────────┘ └───────┘ + detect per-file suggest aggregate quality + changed findings patch + format gate + files (pass/fail) +``` + +In agentic mode, sub-agents (security, dependency, breaking-change) operate in parallel within the Review stage, and the Judge stage acts as an internal LLM-as-judge over the final report. + +### 1.2 What is at stake + +A code review is a **trust artifact**. A false positive — a flagged "vulnerability" that is not real — wastes developer time and erodes confidence in every subsequent finding. A false negative — a missed real bug — defeats the purpose of the tool. A confidently miscalibrated severity label routes attention away from the issues that actually matter. None of these failures are catastrophic individually, but they compound across thousands of PRs. + +For a tool that runs in production CI, the consequences of letting accuracy drift quietly are: + +- **Customer churn.** Reviewers turn the bot off. Recovering trust is expensive. +- **Hidden regressions.** A prompt change that fixes one issue and silently regresses three others accumulates over months. +- **Cost overruns.** Without per-stage cost-quality measurement, model upgrades produce big bills with unclear value. +- **Audit risk.** Enterprise customers increasingly require evidence that the tool's outputs are validated. + +The fix is not "more model" — it is **disciplined evaluation and structured improvement**. The community converged on this consensus through 2024–2026; this report makes it concrete for QualOps. + +### 1.3 What we already have + +QualOps ships with a working evaluation suite: Langfuse-backed dataset runs, multiple presets (`fast`, `default`, `sonnet-agentic`, `thorough`), CRB-derived golden datasets across five real repos (Sentry, Grafana, Cal.com, Discourse, Keycloak), and a configurable LLM-as-judge scoring stage. This puts QualOps ahead of most teams shipping agentic products today. The gaps we identify in this report are deliberate next steps, not foundations. + +### 1.4 What "out of scope" means + +The brief that motivated this report explicitly excludes **online self-improvement in production** — RLHF, continuous training, online policy updates. QualOps deploys fixed versions. Improvement happens between releases, on labeled traces, in CI. Everything in this report is consistent with that model. + +--- + +## 2. Foundations of agent evaluation + +### 2.1 An agent is not a function + +A classical LLM eval treats the model as a function `f(prompt) → completion` and grades the completion against a reference (BLEU, ROUGE, exact match, regex). The unit of evaluation is one input/output pair. + +An agent eval treats the agent as a **stateful policy** `π` that interacts with an environment via tools. The unit of evaluation is a **trajectory** — an ordered record of states, actions (tool calls), and observations: + +``` +τ = (s₀, a₀, o₀, s₁, a₁, o₁, …, sₙ) +``` + +Every benchmark surveyed for this report agrees: agent evaluation requires assessing not just the terminal answer but the *path* taken to reach it. Anthropic's engineering team frames the same shift as moving from "single-output grading" to "behavior verification across many turns" ([*Demystifying evals for AI agents*](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)). + +### 2.2 The three layers of agent evaluation + +Modern taxonomies (Yu et al. 2025; LangChain documentation; Anthropic engineering) recognize three nested layers of evaluation: + +| Layer | Question | Typical metric | +|---|---|---| +| **Component-level** | Does each sub-skill (retriever, single tool call, sub-agent) work in isolation? | Tool-match rate, parameter F1, retrieval recall@k | +| **Trajectory-level** | Is the path of reasoning + actions valid, efficient, and faithful? | Plan correctness, trajectory edit distance, tool-call F1 over the sequence | +| **Outcome / end-to-end** | Did the agent achieve the user goal? | Task success, unit-test pass rate (SWE-bench), human rating | + +For QualOps, all three layers exist naturally: + +- **Component**: did the Analyze stage `read_file` with the right path? Did the security sub-agent emit a well-formed finding? +- **Trajectory**: did the Review stage's parallel sub-agents converge to a coherent set of findings without redundant tool calls? +- **Outcome**: did the suggested fix actually fix the bug, and did the final report match what a human reviewer would write? + +Reporting only the outcome is the most common mistake. An agent can land on a correct PR comment after twelve irrelevant tool calls and faulty reasoning along the way. Outcome metrics miss this. + +### 2.3 Dimensions of agent quality + +Stanford's HELM framework canonicalized seven core metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency) and showed that accuracy alone hides serious failure modes. For an agentic code-review system, the dimensions that matter most are: + +| Dimension | Why it matters for QualOps | How to measure | +|---|---|---| +| **Accuracy / task success** | The headline number. | Exact match, unit-test pass, human rating | +| **Faithfulness / groundedness** | A finding must be supported by an actual line of the diff or repo, never invented. **Dominant for code review.** | Atomic-claim NLI; "every claim must cite a file:line" guardrail | +| **Completeness** | Did the agent find all the issues a human would? | Recall against an annotated PR review | +| **Calibration** | Severity labels must be trustworthy for triage. | Expected calibration error (ECE), Brier score | +| **Robustness** | Stable under prompt perturbation, weird diffs, large files. | Performance under paraphrase/typo/adversarial suites | +| **Determinism / consistency** | Same PR → same review (or stable distribution). | Output variance across N samples; pass^k | +| **Latency** | CI gates have time budgets. | p50/p95/p99 wall-clock per stage | +| **Cost** | $ per PR. | Tokens × price + tool-call costs | +| **Safety** | Should not leak secrets or follow injected instructions in untrusted source. | Red-team pass rate | + +Two notes specific to QualOps: + +- **Faithfulness is dominant.** A hallucinated finding is more damaging than a missed one because it erodes reviewer trust. The RAG literature's faithfulness method — extract atomic claims, verify each against the source — generalizes directly. Concretely: every claim in the report must cite a file:line; if it cannot, drop the claim. ("Citations as guardrail" is the established pattern.) +- **Calibration matters for triage.** If QualOps emits a severity label, ECE quantifies whether "high" findings are actually higher priority. LLMs are systemically overconfident under verbalized prompting; consistency-based confidence (sample N, measure agreement) is more reliable than asking the model to rate its own confidence. + +### 2.4 The evaluation lifecycle + +Mature teams converge on a recognizable loop ([Husain](https://hamel.dev/blog/posts/evals/); [Yan et al., O'Reilly](https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/)): + +```mermaid +flowchart LR + A[1. Error analysis
on real traces] --> B[2. Codify failure
modes as rubric] + B --> C[3. Add to golden set
+ regression suite] + C --> D[4. Run evals in CI;
block on regression] + D --> E[5. Ship + monitor
in production] + E --> F[6. Sample drift,
online judge] + F --> A +``` + +Tactics endorsed across primary sources: + +- **Start small.** Anthropic's engineering team writes that "20–50 simple tasks drawn from real failures is a great start." Simon Willison: "if you're passing 100% of your evals, you're not challenging your system enough." +- **Eval-driven development.** Treat evals like unit tests: write them first, fail them, then build the change that passes them. Every production failure becomes a new eval row. +- **Golden datasets are curated, not crawled.** They should reflect real production distribution, include known failure cases, and cover edge cases discovered in error analysis. For QualOps: 50–200 real PRs spanning languages, sizes, change types, and labeled failure modes. +- **Regression tests on every PR.** Run the eval suite in CI on each prompt or code change. Block merges on stat-sig regression of any axis. +- **Online monitoring** samples production traffic (5–10% is a common heuristic) and runs an LLM-judge fleet asynchronously to flag drift. + +### 2.5 LLM-as-judge: the workhorse, with caveats + +Zheng et al.'s [*Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena*](https://arxiv.org/abs/2306.05685) (NeurIPS 2023) showed that GPT-4 acting as a judge agreed with human preference at over 80% — roughly the same as inter-human agreement. This legitimized LLM-as-judge as a primary evaluation method, and it is now the workhorse of every eval framework on the market. + +The same paper, and a flood of follow-ups, identify recurring failure modes. The biases QualOps must mitigate: + +| Bias | What happens | Mitigation | +|---|---|---| +| **Position bias** | Judge prefers whichever answer appears first | Swap order, score both, average | +| **Verbosity bias** | Longer answers rated higher | Constrain length in rubric; normalize | +| **Self-preference** | Judge prefers outputs from its own model family | Use a different model as judge; ensemble across providers | +| **Familiarity / low-perplexity bias** | Judge favors text it would have generated | Down-weight low-perplexity samples | +| **Sycophancy** | Judge follows hints in the prompt about which is "better" | Blind the judge to source | +| **Fallacy oversight** | Judge accepts confident-sounding wrong reasoning | Require step-by-step grading; use process supervision | + +Hamel Husain's *Using LLM-as-a-Judge* guide recommends **binary pass/fail rubrics** over Likert scales for production judges, because binary judges are easier to calibrate against humans and easier to debug. For open-ended quality grading (e.g. "is this PR comment helpful?"), pairwise comparison with order-swapping outperforms pointwise scoring. + +#### When LLM-as-judge fails + +Eugene Yan's survey of two dozen judge papers ([eugeneyan.com](https://eugeneyan.com/writing/llm-evaluators/)) flags cases where LLM-judge is unreliable: + +- Tasks requiring deep domain expertise the judge lacks. +- Tasks where the judge would have to do work harder than the generator (judging a math proof when the judge cannot do the math). +- Highly subjective tasks where humans disagree among themselves. + +For these cases, three escalation paths exist: **process reward models** (PRMs) that score each reasoning step; **agent-as-judge** that gives the judge tool access to verify claims; and **periodic human review** of a calibration sample. We use all three at appropriate points in the QualOps Approach (Part 6). + +### 2.6 Trajectory and process evaluation + +A code-review agent can produce a correct final report by accident — having issued ten irrelevant tool calls and reasoned incorrectly along the way. Outcome metrics miss this. Process evaluation asks: *was every intermediate step justified?* + +Three families of techniques exist: + +- **Step-wise correctness.** Score each step on (a) whether the tool was appropriate, (b) whether the arguments were valid, (c) whether the output was used. Aggregating gives a step success rate. +- **Plan-level metrics.** Treat the reference trajectory as a set or multiset of expected steps. Compute precision = |predicted ∩ reference| / |predicted|, recall = |predicted ∩ reference| / |reference|, F1. +- **Edit distance.** Treat both trajectories as strings of tool tokens; Levenshtein distance, optionally weighted by argument similarity. Continuous score. + +OpenAI's *Let's Verify Step by Step* (Lightman et al. 2023) showed that **process supervision** beats outcome supervision for training reward models on math; the methodology has since been extended to reasoning PRMs (R-PRM, ThinkPRM) and to coding agents. For QualOps, the directly applicable lesson is: **score the trajectory, not just the final report.** + +### 2.7 Statistical rigor — why "vibes" fail + +With N=10 examples and a stochastic model, a swing of ±20% in pass rate is normal noise. Most teams operate at this scale and overclaim improvements that are within the noise floor. The minimum statistical discipline: + +- **Paired comparisons.** Run model A and model B on the *same* set of examples; the per-example difference cancels per-example variance. McNemar's test for binary outcomes; paired bootstrap for any metric. +- **Confidence intervals.** For binary pass/fail with sample mean p and N samples, the 95% CI half-width is roughly `1.96 × sqrt(p(1-p)/N)`. To distinguish 80% from 85% accuracy at 95% confidence you need ~1000 samples — and most teams have far fewer. This is why **paired** matters. +- **Multiple runs per task.** Even at temperature 0, modern serving stacks are non-deterministic (batched inference, kernel non-determinism). Plan for ≥5 runs per task and report mean ± std. +- **pass^k reporting.** Sierra's τ-bench introduced the *probability of succeeding k times in a row* as the reliability metric. For QualOps it answers a real question: "is this prompt change reliable enough to deploy?" +- **Bradley-Terry / Elo for pairwise rankings.** When the metric is "which model wins this pair?", fit a latent skill rating per model; resample pairs with bootstrap to get CIs on each rating. This is what Chatbot Arena does. + +Anthropic's [*A Statistical Approach to Model Evals*](https://www.anthropic.com/research/statistical-approach-to-model-evals) is the most accessible engineer-facing walkthrough of these techniques. + +### 2.8 Recent academic directions worth knowing + +- **Process reward models (PRMs)** — beyond math, PRMs are now applied to coding agents and multi-step retrieval. The trend is from discriminative classifiers toward **generative / reasoning PRMs** that produce a rationale before scoring. +- **Self-consistency, self-refinement, debate** — sampling multiple reasoning paths and majority-voting; agents critiquing and revising their own output; multi-agent debate as scalable oversight. +- **Constitutional methods** — written rubrics the model uses to critique and revise its own outputs (RLAIF). Useful for *evaluation* too: the constitution doubles as a rubric. +- **Agent-as-judge** — the newest frontier. Rather than a static LLM judge, the judge is an *agent* with tools who can re-read the source, run tests, and verify intermediate steps. Zhuge et al. 2024 ("Agent-as-a-Judge") report ~90% human agreement and ~97% cost reduction vs. human evaluation. **This is directly relevant to QualOps**: the existing Judge stage already smells like agent-as-judge applied internally. + +--- + +## 3. Evaluating tool-calling and workflow agents + +QualOps is a **tool-calling workflow agent**, not a chatbot. Conversational eval techniques (turn-level helpfulness, persona consistency) are largely irrelevant. What matters is whether the agent picks the right tools, in the right order, with the right arguments, and produces the right final artifact. This part summarizes the techniques that map cleanly onto QualOps's pipeline. + +### 3.1 What "tool-call accuracy" actually means + +"Tool-call accuracy" is a deceptively flat label. In practice it decomposes into a stack of sub-metrics, each measuring a different failure mode: + +- **Exact match** — predicted call equals gold call byte-for-byte. Brittle: `{"path": "src/foo.py"}` vs `{"path": "./src/foo.py"}` are equivalent but exact-match scores them as wrong. +- **AST match** (the BFCL standard) — parse the call into name + (arg-name, arg-value) pairs and match structurally. Argument-order independent, basic format normalization. +- **Semantic match** — an LLM judge or custom equality function decides whether two argument values are functionally identical. +- **Argument F1** — per-call precision/recall on argument names and values. Distinguishes "wrong tool" from "right tool, wrong argument." +- **Tool-call F1** (set-level) — over a multiset of (tool, args) pairs across the trajectory. +- **Multi-call ordering** — exact match, in-order subsequence, any-order set, or edit distance. +- **Hallucinated tools** — the agent invents a tool that doesn't exist or arguments that don't apply. BFCL has a dedicated *irrelevance* category. +- **Missed tools** — the agent answered from its own (often outdated) knowledge instead of calling the available tool. Recall is the natural metric. Production teams flag under-tooling as one of the top failure modes. +- **Idempotency / collateral damage** — a side-effecting call (post comment, write file) was issued multiple times, or a tool was called that mutated unintended state. AppWorld's eval explicitly penalizes collateral damage. +- **Parallel calls** — agents (Claude in particular) can emit multiple tool calls in one turn. Scorers must handle a *bag* of tool calls per turn, not a list. + +For QualOps these manifest stage by stage: + +| Stage | Most relevant tool-call metric | +|---|---| +| Analyze | Tool-call F1 against expected `read_file` / `grep` set; under-tooling rate | +| Review | Argument F1 on finding location (file, line range); hallucinated-tool detection | +| Fix | AST match on patch primitives; idempotency check on `apply_patch` | +| Report | Schema validation; structured-output conformance | +| Judge | Argument F1 on severity labels; calibration vs human gold | + +### 3.2 Trajectory and plan evaluation + +A trajectory is the ordered record of (action, observation) pairs. Evaluating it answers two distinct questions: + +- **Q1 — Did the agent get to the goal?** (outcome, goal-completion) +- **Q2 — Did it follow a sensible path?** (process, plan quality) + +These are orthogonal: an agent can stumble to the right answer through a 47-step random walk, or it can take an optimal 3-step path that ends in the wrong final state. Scoring rules in increasing leniency: + +``` +trajectory exact match < in-order match < any-order match < edit distance +[strictest] [most lenient] +``` + +The Sierra / τ-bench position is explicit: in production they care primarily about **goal database state** — they compare the post-conversation DB to an annotated goal DB. Cursor's `CursorBench` flips this: they care about path quality (code style, efficiency, interaction) too because users *experience* the path. + +For QualOps the recommendation is **hybrid**: outcome at the gate (Fix patch must pass tests; final Report must validate schema), plus per-stage trajectory metrics for diagnostics and ranking when outcomes are roughly equal. + +### 3.3 Outcome vs. process — when to use each + +| Aspect | Outcome eval | Process eval | +|---|---|---| +| Data needed | A goal-state checker (unit test, schema, regex) | Reference trajectories (expensive) or judge model | +| Cost | Cheap, deterministic | Expensive or noisy | +| Catches "lucky shortcuts" | No — agent can game it | Yes | +| Catches plan inefficiency | No | Yes | +| Penalizes equivalent-but-different paths | No (good) | Yes (bad — risks rewarding rote imitation) | + +Pitfalls of pure outcome: + +- **Reward hacking via lucky shortcut.** An agent finds a single-line trick that passes the FAIL_TO_PASS test but doesn't actually fix the bug class. SWE-bench Verified mitigates this by also requiring PASS_TO_PASS. +- **Spec ambiguity.** OpenAI annotators rejected ~30% of original SWE-bench instances for ambiguous specs or wrong test patches when building Verified. +- **Non-reproducibility.** Stochastic tools make the goal state non-deterministic. + +Pitfalls of pure process: + +- **Path rigidity** — penalizing a faster, equivalent path. Anthropic's eval blog calls this out as the most common pitfall they see. +- **Reference-trajectory bias** — human authors write idealized trajectories that don't reflect how an LLM actually thinks; comparing against them rewards mimicry over capability. + +### 3.4 Major benchmarks worth knowing for QualOps + +QualOps will not adopt these benchmarks wholesale, but their **methodologies** transfer directly. The most relevant ones: + +| Benchmark | Year | Why it matters for QualOps | +|---|---|---| +| **BFCL v3** ([leaderboard](https://gorilla.cs.berkeley.edu/leaderboard.html)) | 2024 | AST match + executable accuracy methodology; the right framework for scoring per-stage tool calls. | +| **τ-bench** ([Sierra](https://sierra.ai/blog/benchmarking-ai-agents)) | 2024 | pass^k metric for reliability under non-determinism. Directly transferable. | +| **SWE-bench Verified** ([swebench.com](https://www.swebench.com/verified.html))† | 2024 | Apply patch + FAIL_TO_PASS + PASS_TO_PASS test harness. *The* template for evaluating QualOps's Fix stage. | +| **SWE-bench Live** ([swe-bench-live](https://swe-bench-live.github.io/)) | 2025 | 50 freshly verified GitHub issues per month. Contamination-free source of code-review test cases. **Now the recommended SWE-bench variant** for fresh, uncontaminated cases. | +| **SWE-bench Pro** ([Scale AI](https://github.com/scaleapi/SWE-bench_Pro-os)) | 2025 | Long-horizon, enterprise-scale. GPT-5 23.3% / Claude Opus 4.1 23.1% — i.e. enterprise-scale code-agent tasks remain hard. | +| **AppWorld** ([appworld.dev](https://appworld.dev/)) | 2024 | State-based eval with collateral-damage check. The model for any side-effecting QualOps action. | +| **DevAI / Agent-as-a-Judge** ([repo](https://github.com/metauto-ai/agent-as-a-judge)) | 2024 | Methodology for using an agent (with tools) as the judge. Maps to QualOps's Judge stage. | +| **TRAJECT-Bench** | 2025 | Trajectory-quality metrics over outcomes. | +| **Holistic Agent Leaderboard** ([HAL](https://arxiv.org/pdf/2510.11977)) | 2025 | Variance-decomposed reporting; good model for our internal dashboards. | + +The single most influential idea for QualOps is **SWE-bench's "tests as oracle"**: apply the agent's patch, run FAIL_TO_PASS + PASS_TO_PASS, classify. Fully deterministic, fully outcome-based, ignores how the agent got there, resists most reward hacking. We adopt the *methodology* directly for the Fix stage in Part 6. + +> **† Note on SWE-bench Verified status (May 2026):** OpenAI publicly deprecated SWE-bench Verified on Feb 23, 2026, citing flawed test patches and contamination concerns. The benchmark's *methodology* (apply patch, run FAIL_TO_PASS + PASS_TO_PASS) remains the gold standard for code-agent evaluation, but for a fresh, contamination-free dataset prefer **SWE-bench Live** (50 newly-verified issues per month) or **SWE-bench Pro**. Internal harnesses built on the methodology are unaffected; only the specific 500-instance frozen dataset is no longer recommended as a leaderboard target. + +### 3.5 Code-agent specific evaluation + +For QualOps's Review stage (no patch, just a comment), the analog of SWE-bench's pattern is: + +1. **Finding location precision/recall** — did the agent flag the right line and file? +2. **Finding-class match** — did it categorize correctly (security vs perf vs style)? +3. **Finding–PR alignment** — does the finding correspond to something the human reviewer also flagged? + +Test execution is the gold-standard oracle, but tests don't catch every flavor of bad fix: + +- **Style / readability regression** — tests pass, but the diff is ugly, over-broad, or violates project conventions. +- **Performance regression** — tests pass but quietly add an O(n²). +- **Security regression** — tests pass but the patch introduces a new vuln. + +These need additional graders: linter / formatter delta, perf benchmark, CodeQL/Semgrep diff, LLM-judge with explicit criteria. Cursor's CursorBench explicitly grades "code quality" and "efficiency" alongside correctness for this reason. + +### 3.6 Agent-as-judge + +The newest frontier. Zhuge et al.'s [*Agent-as-a-Judge*](https://arxiv.org/abs/2410.10934) (Oct 2024; ICML 2025) replaces the LLM judge with an *agent* judge that can read code, run tools, and verify intermediate steps. They release **DevAI**, 55 AI-dev tasks with 365 hierarchical requirements, and report: + +- ~90% agreement with human expert (vs ~70% for plain LLM-judge). +- ~97% cost reduction (86 h / $1,297 → ~2 h / $31). + +Works well when: + +- The artifact is open-ended (no unit tests possible) — "is this PR comment helpful and accurate?" +- Evaluation requires looking at intermediate steps — "did the agent actually verify this finding by reading the file, or hallucinate it?" +- You have a structured rubric the judge can iterate over. + +Doesn't help when: + +- The judge shares the candidate's biases (same model family — self-preference). +- Stakes require human sign-off anyway. +- A simple deterministic check exists. + +For QualOps: an *external* agent-as-judge is well suited to grading "was this PR review good?" — give it the diff, the agent's findings, the actual human-merged PR, and a rubric, and let it use grep/file-read tools to verify each finding against the code. This is one of the most directly applicable techniques in the literature. Crucially, run the judge on a **different model family** than the Review stage to avoid self-preference. + +### 3.7 Replay testing and recorded traces + +Pattern (used by Braintrust, LangSmith, Arize Phoenix, Anthropic, Cognition): + +1. Capture every production run as a trace (inputs, all tool calls, all outputs, final artifact). +2. Tag interesting traces — failures, edge cases, customer escalations — into a regression set. +3. On every prompt / model / harness change, replay each trace: feed the same input, *but stub tool calls with the recorded outputs*, and observe whether the agent makes equivalent decisions. +4. Diff: were tool-call sequences equivalent? Did the final artifact differ? + +For QualOps: every PR review you ship is already a trajectory. Sample some, freeze them, and you have a regression suite that tracks model and prompt drift better than any synthetic benchmark. AgentRR (arXiv 2505.17716) formalizes the pattern. + +### 3.8 Decision guide: situation → technique + +| Situation | Use this technique | +|---|---| +| Single-call function selection | BFCL-style **AST match** + **argument F1** | +| Multi-step deterministic workflow | **Trajectory in-order match** + **state-based eval** | +| Parallel tool calls in one turn | **Set-equality** match (bag of calls) | +| Tool arguments are free-form text | **Action similarity** (embedding or LLM-judge) | +| Side-effecting tools | **State-based eval with collateral-damage check** | +| Output is a code patch | **SWE-bench harness**: apply patch + FAIL_TO_PASS + PASS_TO_PASS | +| Output is open-ended text | **Agent-as-judge** with structured rubric | +| Detect hallucinated tools | **Schema validation** + tool-name whitelist | +| Detect missed tools | **Recall** against reference trajectory | +| Variance/reliability | **pass^k** with k≥5; report mean + 95% CI | +| Catching prompt/model regressions | **Recorded-trace replay** with tool stubs | +| Long-horizon multi-stage agent (QualOps) | **Hybrid**: per-stage tool-call F1 + per-stage state checks + end-to-end outcome + agent-as-judge on final report | + +--- + +## 4. The framework landscape + +The eval / observability tooling ecosystem moved fast through 2025–2026. The major shifts since early 2025: **OpenAI acquired Promptfoo in March 2026** (MIT license preserved), **Langfuse landed observation-level LLM-as-judge** in February 2026, the **Claude Agent SDK** (formerly Claude Code SDK) became the default Anthropic agent harness, and **Inspect AI** (UK AISI) reached production-grade adoption inside frontier labs. The recommendations below reflect the May 2026 state. + +### 4.1 The shortlist for QualOps + +For a small team running CI-gated tool-calling agents in a Node/TS codebase already on Langfuse, five tools matter: + +1. **Langfuse** *(incumbent — keep)*. MIT, self-host, observation-level LLM-as-judge (Feb 2026), boolean/categorical scoring. Production references include Canva. Strong TS + Python parity. +2. **Promptfoo** *(add as CI gate)*. MIT, OpenAI-acquired but license preserved, first-class GitHub Action with PR-comment diffs, Claude Agent SDK provider. Lowest-effort per-PR gating in the space. +3. **Inspect AI** *(add for nightly capability evals)*. MIT, used internally by Anthropic, DeepMind, Grok. Agent Bridge wraps the QualOps agent without modifying it. Python-only is fine for nightly. +4. **LangSmith** *(only if a wall is hit)*. Best out-of-the-box trajectory primitives via `agentevals`. Closed-source; per-trace pricing penalizes verbose agentic apps. +5. **Braintrust** *(only if non-engineers must contribute test cases)*. Notion's reference deployment is real; the diff UI is polished. Closed-source; hybrid-only self-host. + +We deliberately exclude DeepEval, RAGAS, OpenAI Evals API, Phoenix, W&B Weave, MLflow, Patronus, AgentOps, Helicone, and LangWatch from the shortlist for QualOps's profile — not because they are bad, but because they don't out-perform the shortlist on the dimensions that matter to a small TS/Node team in CI on Claude. The full landscape is in `sources/02-frameworks.md`. + +### 4.2 Comparison matrix + +Legend: ✓ = yes / strong, ≈ = partial / caveat, ✗ = no / weak. + +| Framework | OSS | Self-host | Trajectory eval | Tool-call scoring | LLM-judge built in | Online prod eval | CI integration | TS-native | Py-native | Free tier | +|---|---|---|---|---|---|---|---|---|---|---| +| **Langfuse** | ✓ MIT | ✓ free | ≈ DIY | ✓ (Feb 26 obs-level) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ 50k units/mo | +| **LangSmith** | ✗ | ✓ Plus+ | ✓ `agentevals` | ✓ | ✓ 30+ templates | ✓ | ✓ | ✓ | ✓ | ✓ 5k traces/mo | +| **DeepEval / Confident AI** | ✓ Apache | ≈ lib only | ✓ (Py) | ✓ ToolCorrectness | ✓ G-Eval, DAG | ✓ (CAI) | ✓ pytest | ≈ thin | ✓ | ✓ | +| **Braintrust** | ✗ | ≈ enterprise | ≈ DIY | ✓ in UI | ✓ | ✓ | ✓ Action | ✓ | ✓ | ✓ 1M spans/mo | +| **OpenAI Evals API** | ✓ repo | ≈ | ≈ DIY | ≈ DIY | ✓ model graders | ≈ | ≈ | ≈ via SDK | ✓ | ≈ paid API | +| **Anthropic Console** | ≈ SDK | ✗ | ✗ | ✗ inspect only | ✗ | ✗ | ✗ | ✓ | ✓ | with API | +| **Phoenix / Arize AX** | ✓ Apache | ✓ | ≈ DIY | ✓ OpenInference | ✓ | ✓ AX | ✓ | ✓ | ✓ | ✓ | +| **W&B Weave** | ✓ SDK | ≈ enterprise | ≈ | ≈ | ✓ | ≈ | ✓ | ≈ | ✓ | ✓ | +| **MLflow GenAI** | ✓ Apache | ✓ | ≈ | ≈ | ✓ | ✓ | ✓ | ✗ | ✓ | ✓ | +| **Promptfoo** | ✓ MIT | ✓ | ≈ custom | ≈ Claude SDK | ✓ llm-rubric | ✗ | ✓ best-in-class | ✓ | ✓ | ✓ | +| **Inspect AI** | ✓ MIT | ✓ | ✓ sandbox | ✓ MCP, built-in | ✓ | ✗ | ≈ custom | ✗ | ✓ | ✓ | +| **Patronus** | ✗ | ✗ | ✗ judge only | ≈ | ✓ specialized | ✓ | ≈ | ✓ | ✓ | sales | + +### 4.3 What each fits + +| Team profile | Recommended primary | Add-ons | +|---|---|---| +| **Small team, CI-gated, Node/TS, Claude (= QualOps)** | **Langfuse** | + Promptfoo (CI) + Inspect AI (nightly) | +| Small/mid team, Python-only, RAG-heavy | DeepEval + RAGAS | + Phoenix or Langfuse | +| Large org, many agents, dedicated SREs | Braintrust (product) + Phoenix/Arize AX (platform) | — | +| LangChain / LangGraph shop | LangSmith | — | +| Frontier-lab / safety org | Inspect AI | + custom storage | +| OpenAI-only shop | OpenAI Evals API + Promptfoo | — | +| Already on W&B for ML | W&B Weave | + RAGAS / DeepEval metrics | + +### 4.4 CI integration patterns + +The cleanest CI pattern for QualOps is **two-tier**: + +1. **Per-PR (fast tier, ~3–5 min)** — Promptfoo YAML with ~30 small assertions on the output of each pipeline stage; runs as a required GitHub check. Posts a PR comment with the diff vs. main. +2. **Nightly / weekly (slow tier, ~30–60 min)** — Langfuse experiment over a 100–200 item dataset, running the full pipeline, with LLM-as-judge scorers on the final report and tool-call F1 scorers on each stage. Plus a quarterly **Inspect AI** capability eval against held-out fixture repos. + +This gives developers fast PR feedback and the team a slower, deeper truth. + +### 4.5 Real-world reference deployments + +- **Canva** — Langfuse production reference for AI design features. +- **Notion** — Braintrust deployment, 70 AI engineers, 10× increase in caught issues per day going from JSONL files to Braintrust workflows. +- **Stripe, Vercel, Zapier, Airtable** — Braintrust customers per their marketing. +- **Etsy, Gamma** — Patronus AI case studies for multimodal LLM-judge. +- **Anthropic, DeepMind, Grok** — Inspect AI users (per UK AISI announcement and Hamel Husain's notes). +- **OpenAI and Anthropic** — both ship Promptfoo as part of their internal eval pipelines (per Promptfoo's GitHub README). + +--- + +## 5. Systematically improving agents (offline) + +Once an evaluation harness exists, the bottleneck for agent quality is not "more model" — it is the **discipline of systematic improvement**. Out of scope for QualOps: online RLHF or continuous self-tuning in production. In scope: structured offline iteration between releases. + +### 5.1 The eval-driven improvement loop + +```mermaid +flowchart TD + A[Production / staging traces] --> B[Sample failures + passes] + B --> C[Open coding
free-text notes] + C --> D[Axial coding
cluster into taxonomy] + D --> E[Prioritize by
frequency × severity × fixability] + E --> F{Pick top
bucket} + F --> G[Hypothesize fix:
prompt? context? tool?
sub-agent? model? SFT?] + G --> H[Implement smallest
change that could fix it] + H --> I[Run eval set
regression + targeted] + I --> J{Delta positive?
No regression?} + J -- No --> K[Discard or refine] + K --> G + J -- Yes --> L[Promote prompt/skill version] + L --> M[Ship behind gate] + M --> N[Collect new traces] + N --> A +``` + +Four properties worth preserving: + +1. **Failures are routed back to the eval set**, not just fixed. Otherwise the regression suite never grows. +2. **Hypothesis is logged separately from the diff.** "I changed the system prompt because X" matters when the next eval shows a regression six weeks later. +3. **One change at a time.** Multi-variate changes invalidate the delta and stall debugging. +4. **The eval set is versioned with the code.** A passing score on v3 of the eval against v3 of the prompt is the only meaningful claim. + +### 5.2 Error analysis: open coding → axial coding → frequency + +The single most cited improvement technique in the modern eval literature. The method, popularized by Hamel Husain, Eugene Yan, and Shreya Shankar, borrows from grounded-theory qualitative research: + +- **Pass 1 — Open coding (bottom-up).** Sit with raw traces. For each failed example (and a sample of passing), write a free-text note describing what went wrong. Critically, do not pre-define categories. Hamel emphasizes that top-down taxonomies — "this is a hallucination", "this is a refusal" — bias annotators toward generic ML categories that miss domain-specific failure modes. Bottom-up coding at NurtureBoss surfaced "date handling" as the dominant failure class and lifted that subtask from 33% to 95% accuracy. +- **Pass 2 — Axial coding.** Group the open-coded notes into a small set of error categories ("axes"). LLM-assisted clustering pass over the notes, then human review of the proposed taxonomy. Output: an error taxonomy of typically 5–15 categories with frequency counts. +- **Frequency-weighted prioritization.** Rank by `frequency × business cost × fixability`. Spend engineering effort top-down. The classic mistake is fixing rare-but-vivid failures because they are easier to remember. + +A QualOps-shaped taxonomy might look like: + +| ID | Category | Stage | Frequency | Severity | Fixability | Priority | +|---|---|---|---|---|---|---| +| E1 | False positive on idiomatic style | Review | 32% | Low | High | P1 | +| E2 | Missed null-deref across files | Analyze | 18% | High | Medium | P1 | +| E3 | Fix proposed wrong import | Fix | 11% | Medium | High | P2 | +| E4 | Judge rated harmless nit as "blocker" | Judge | 9% | Medium | High | P2 | +| E5 | Refused on large diff | Analyze | 4% | High | Low | P3 | + +Each row produces (a) a deterministic regression test, (b) a candidate fix hypothesis, (c) optionally an eval-set sample addition. + +### 5.3 The hierarchy of fixes + +When a bucket is picked, work the cheapest plausible fix first: + +```mermaid +flowchart TD + Start([Eval reveals failure]) --> Q1{What's the
error type?} + + Q1 -->|Format / schema
violation| F1[Tighten output format
in prompt; add example] + Q1 -->|Misunderstood
instruction| F2[Restructure prompt:
role/task/format/guardrails] + Q1 -->|Missing domain
knowledge| F3[Add Skill or
retrieval tool] + Q1 -->|Hallucinated
fact / API| F4[Add citation requirement
+ retrieval tool] + Q1 -->|Wrong tool used /
tool confusion| F5[Tool design: names,
descriptions, consolidate] + Q1 -->|Tool returned ok,
agent ignored result| F6[Tool description +
output summarization] + Q1 -->|Reasoning chain broke| F7[Extended thinking
or sub-agent split] + Q1 -->|Style / over-flagging| F8[Few-shot mining:
contrastive don't-flag] + Q1 -->|Cross-file blindness| F9[Code-graph tool;
orchestrator-worker] + Q1 -->|Long-context recall| F10[Context engineering:
prune, dynamic load] + Q1 -->|Judge miscalibrated| F11[Refresh judge
calibration; verifier model] + Q1 -->|Plateau across many
error types| F12[Prompt optimizer
DSPy / AdalFlow] + Q1 -->|Latency/cost plateau| F13[Routing or
distillation SFT] + Q1 -->|Persistent + structured
+ many examples| F14[DPO from preference
pairs; otherwise SFT] +``` + +The order of operations in 2026, from cheapest to most expensive, is: + +1. Better prompt structure / restructure. +2. Better few-shot examples (especially contrastive negatives). +3. Better tools / context / Skills. +4. Sub-agent decomposition. +5. Automated prompt optimization (DSPy/MIPROv2 or AdalFlow). +6. Routing or model swap. +7. Distillation SFT or DPO from preference pairs. + +The vast majority of agent-quality wins come before fine-tuning is required. + +### 5.4 Prompt engineering as iteration discipline + +The era of "prompt is whatever string is in `messages[0]`" is over. Anthropic's own guides for Claude 4.x ("Prompting best practices", "Effective context engineering for AI agents") and OpenAI's GPT-5 prompt cookbook converge on the same skeleton: + +``` +[Role] You are . +[Task] Goal in 1–2 sentences. +[Context] Static background, dynamic retrieval, tool surface. +[Examples] Few-shot, ideally diverse and including hard cases. +[Format] Output schema (XML / JSON / markdown sections). +[Guardrails] Out-of-scope behaviors, refusal triggers, escalation. +``` + +For Claude specifically: XML-tag structuring (``, ``, ``), tool definitions in the system message, instructions in the user turn, "think step by step" or extended thinking for multi-stage tasks. + +**Prompt-as-code** principles: + +- Prompts live in the repo, not a UI. Every change is a PR. +- Each prompt version gets a content hash; logs reference it. +- Promotion gates: dev → staging → prod, with eval thresholds at each boundary. +- Full execution context is versioned: prompt + model + temperature + tool list + retrieval config. A prompt that worked on Sonnet 4.0 may regress on 4.7. +- Two-axis A/B: by prompt version and by traffic slice. + +### 5.5 Automated prompt optimization + +A menu, not a stack — pick what fits the stage: + +| Tool | Mechanism | When it shines | When it fails | +|---|---|---|---| +| **APE** (2022) | LLM proposes candidates, scored on held-out set | Single-step tasks, clear metric | Multi-stage agents; metric noise | +| **OPRO** (2023) | LLM shown previous prompts + scores, asked to write a better one | Math / reasoning, binary metric | Long prompts; tool-using agents | +| **Promptbreeder** (2023) | Evolutionary, mutates task prompts AND mutation prompts | Cheap, plentiful evals | Cost; doesn't optimize tool use | +| **DSPy / MIPROv2** (2024) | "Programs not prompts": joint optimization of instructions + few-shot via Bayesian search | Multi-stage pipelines (perfect for QualOps) | Requires writing pipeline in DSPy idioms | +| **TextGrad** (2024) | "Backpropagation through text": LLM-generated textual feedback as gradient | Composable systems with critic-able output | Setting up textual loss; cost | +| **AdalFlow** (2024) | PyTorch-style auto-diff over LLM workflows; combines TextGrad + DSPy bootstrapping | Single library covering both directions | Newer, smaller community than DSPy | +| **SAMMO** (Microsoft 2024) | Structural mutation operators over function-graph prompts | Long structured prompts (manuals, policies) | Tasks needing example mining more than surgery | + +Practical guidance for QualOps: + +- For the **Review** and **Judge** stages — both narrow, scorable on labeled PRs — DSPy/MIPROv2 is the best fit. +- For the **Fix** stage, where the output is code and "correctness" requires running tests, AdalFlow + TextGrad-style textual feedback over `tests pass / fail / lint` is the better mental model. +- All of these need a *cheap, fast metric*. Build it before reaching for an optimizer. + +### 5.6 Few-shot mining: the under-used lever + +Three lessons from the 2024–25 literature: + +1. **Quality dominates quantity.** A handful of well-chosen demonstrations beat dozens of mediocre ones. A single noisy example can reduce accuracy. +2. **Diversity matters more than similarity.** Three near-duplicates teach less than three diverse, on-topic examples. +3. **Dynamic retrieval > static set** for heterogeneous inputs. + +Mining recipe for QualOps: + +1. Take the labeled error taxonomy from §5.2. +2. For each high-priority bucket, sample 2–3 *clean* fixed examples — input PR, ideal review comments, ideal Fix output. These become "canonical" few-shots. +3. Build a vector index over these examples keyed by diff features (language, file types, lines changed, presence of tests). +4. At inference, retrieve top-k examples and inject them. +5. **Add contrastive examples**: pairs of `(borderline diff, correct minimal review)` so the agent learns where to *not* comment. Code-review agents over-flag by default; contrastive negatives are the cure. +6. Recompute the index when the eval set grows. + +### 5.7 Tool design: the most under-appreciated lever + +Anthropic's *Writing tools for agents* is the canonical reference. The highlights: + +- **Naming**: namespace by service (`github_list_prs`, not `list_prs`); verb_object form. +- **Description is a prompt.** It is read by the model on every call. Explicit about *when* to use, what inputs are valid, what outputs to expect, what the tool will NOT do. +- **Schema with examples.** For complex inputs, an `input_examples` field beats prose. +- **Consolidate, don't proliferate.** Fewer, more capable tools beat many narrow ones. A `code_search(query, kind)` is better than four `find_function`, `find_class`, `find_import`, `find_callsite` tools — the model gets confused choosing among overlapping options. +- **Return high-signal text.** Stable identifiers beat opaque internal IDs. Pruned/summarized output beats firehose dumps; agents pay tokens to read tool returns. +- **Error messages are pedagogy.** "Error 422" teaches the agent nothing. "Error: file path must be relative to repo root; you passed an absolute path; try `src/foo.py`." enables self-correction. +- **Observable side effects.** If a tool mutates state, return the new state in the response. + +A QualOps tool-surface audit checklist: + +- [ ] Each tool's description has "use this when" and "do NOT use this when". +- [ ] No two tools overlap without explicit disambiguation. +- [ ] Error returns are actionable text, not numeric codes. +- [ ] Tool count per stage ≤ 7 (the empirical comfort zone for Claude Sonnet/Opus). +- [ ] Long outputs are paginated, not truncated mid-token. +- [ ] One canonical "search the code graph" tool, not five. + +### 5.8 Context engineering: curate, don't accumulate + +"Context engineering is the delicate art and science of filling the context window with just the right information for the next step" — Andrej Karpathy. + +The big findings: + +- **Context rot.** Recall and reasoning degrade as token count grows, well before the nominal limit. Larger context windows are not free. +- **Lost-in-the-middle** (Liu et al., Stanford). Performance is U-shaped: instructions at the very start or very end win; buried middle loses. Critical guardrails should be at one of the ends. +- **Instruction hierarchy.** When system, user, and tool-output instructions conflict, models tend to follow the most recent and most concretely worded. Conflict-free design beats stacking. + +Tactics: + +- **Curate, don't accumulate.** Strip stale tool outputs, archived plans, unused docs. +- **Dynamic instruction loading** (= Skills): inject language-specific or domain-specific rules only when the input matches. +- **Retrieval beats stuffing.** A 2K-token retrieved excerpt of the right file beats a 50K dump. +- **Plan persistence.** Long agent runs benefit from an explicit `plan.md`-style scratchpad refreshed each turn. + +For QualOps: never feed the entire repo. Feed (a) the diff, (b) the immediate symbol context (callers/callees of touched symbols), (c) project conventions for the relevant language, and nothing else. Add a tool the agent can call when it needs more. + +### 5.9 Sub-agent decomposition and Skills + +Anthropic's *Building Effective Agents* defines the patterns to reach for *before* assuming a fully autonomous agent is needed: + +- **Prompt chaining** — fixed pipeline of LLM calls. (QualOps's Analyze → Review → Fix → Report is one.) +- **Routing** — classifier sends the request to a specialist prompt. +- **Parallelization** — same input to N specialists, voted/aggregated. +- **Orchestrator-workers** — central LLM dispatches dynamically; subtasks not predeterminable. Coding tasks specifically. +- **Evaluator-optimizer** — generator + critic loop until acceptance criterion met. + +When to split a monolithic prompt: + +- One prompt is doing two qualitatively different jobs and your error taxonomy shows error types from both. +- Required tools differ across phases. +- One phase needs a stronger model than another. +- Prompt has crossed ~3–5K tokens and instructions are starting to interfere. + +Anthropic's **Skills** mechanism is filesystem-based, on-demand context bundles loaded only when relevant. Pattern for QualOps: + +- Each language (TS, Python, Go, Rust) is a Skill containing language-specific review heuristics. +- Each error-taxonomy bucket with stable rules can become a Skill. +- The Judge is a sub-agent with a tighter system prompt and only the report-shaped tools. + +Caveat: orchestrator-worker architectures use **10–15× more tokens** than a single agent. Reach for them when the accuracy gain justifies the cost. + +### 5.10 Routing and model selection + +The economics are dramatic: industry routers report 30–85% cost reduction with quality flat or slightly improved. Three strategies: + +- **Predictive (offline classifier).** A small model or feature classifier picks a tier. Fast, cheap, training data needed. +- **Cascading.** Try Haiku first; if low-confidence or fail, escalate to Sonnet, then Opus. No training; latency penalty when escalation triggers. +- **Mixture-of-agents.** Multiple models answer in parallel; aggregator synthesizes. Highest quality, highest cost. + +For QualOps, a reasonable default cascade: + +1. **Diff size / language detector** (no LLM) — small CSS/text changes go to Haiku-tier; backend logic to Sonnet; cross-cutting refactors and security-sensitive paths to Opus. +2. **Confidence escape hatch** — if Judge rejects with "ambiguous", upgrade and re-run Review. +3. **Per-stage routing** — Analyze and Report can be smaller models; Review and Fix usually want strong; Judge wants strong (or at least *different*). + +This pairs cleanly with prompt-as-code: each prompt version pins its model, and routing is "which version-id do we use for this input." + +### 5.11 Reflection patterns: when they help, when they hurt + +The core papers — *Self-Refine*, *Reflexion*, *CRITIC* — established that having the model critique its own output and try again improves accuracy on a wide range of tasks without weight updates. But: + +**Reflection is net positive when:** + +- The metric is expensive to compute by humans but cheap for an LLM judge. +- Errors are recognizable after the fact (the agent often spots its own mistake when prompted). +- You can afford ~2× tokens per task. + +**Reflection is a trap when:** + +- The error mode is "confidently wrong with no internal signal" — the critic agrees with the bad output. +- Per-call latency matters more than quality. +- The critic is the same model with the same context — same blind spots. + +Concrete recipes for QualOps: + +- **Judge as critic.** The Judge stage is already an evaluator-optimizer pattern. Make it explicit: Judge can return `accept` / `reject_with_reason`, and on reject re-run Review (bounded retries: 2 max). +- **Verifier model trick.** Run Fix with Sonnet, Judge with Opus. Different models reduce shared blindspots. +- **Test execution as ground-truth critic.** For Fix outputs, the actual unit-test result is the highest-quality verifier you will ever have. Use it. + +### 5.12 Fine-tuning, distillation, DPO + +The order of operations: prompts → few-shot → tools → context → sub-agents → optimizers → *then* fine-tuning. When SFT pays off (offline only — in scope): + +- The task has a stable shape, you have ≥1K labeled examples, prompt iteration has plateaued. +- Latency matters and you want to compress an Opus-level prompt into a smaller fine-tuned Sonnet/Haiku. +- You want to teach a tool-use *trajectory pattern*, not just a knowledge cut. + +Two relevant offline techniques: + +- **Distillation from agent traces.** Run the strong agent on a curated set, record (input, plan, tool calls, output), and SFT a smaller model. Recent work (Structured Agent Distillation, 2025) preserves >90% of teacher quality at <20% the cost on narrow domains. +- **DPO / KTO from preference pairs.** From your eval set you have `(input, accepted_output, rejected_output)` triples — exactly the DPO format. KTO works with thumbs-up/down rather than paired preferences. Both are offline and fit our policy. + +What to avoid: premature fine-tuning. The cost is real (training infra, drift eval, regression risk) and the gains often replicate cheaper prompt changes. + +### 5.13 The data flywheel + +Even though we ship fixed versions, *the next version* benefits from production traces: + +``` +production traces → sample → human-label (or LLM-pre-label + human review) + → error analysis → (eval set growth) + (few-shot mining) + (DPO pairs) + → next release +``` + +Practical suggestions: + +- **Trace everything.** Per stage: input, prompt version, tool calls, model version, output, judge verdict, downstream signal (was the comment dismissed by the human reviewer? was the fix merged?). +- **Stratified sampling.** Don't sample uniformly; over-sample low-confidence traces and traces where the judge disagreed with downstream human action. +- **Decompose labels.** Holistic "is this good?" labels are noisy. Split into dimensions (correctness, severity, conciseness, style fit) and label each separately. Inter-rater agreement goes up. +- **Few-shot mining loop.** Newly-labeled "exemplary" traces are first-class candidates for the dynamic few-shot index. Newly-labeled bad traces become eval-set additions and DPO negatives. + +--- + +## 6. The QualOps Approach — generalized concept + +This is the synthesis: a concrete, opinionated approach to evaluating and improving QualOps (and other agentic projects with similar shape) based on everything in Parts 2–5. + +### 6.1 Architecture: evals integrated into the QualOps pipeline + +```mermaid +flowchart LR + subgraph Pipeline["QualOps pipeline (per PR)"] + P0[PR diff] --> P1[Analyze] + P1 --> P2[Review] + P2 --> P3[Fix] + P3 --> P4[Report] + P4 --> P5[Judge] + P5 --> P6[CI status] + end + + subgraph EvalLayer["Eval layer (CI-gated)"] + E1[Tool-call F1
per stage] + E2[Schema validation
+ guardrails] + E3[SWE-bench-style
test harness] + E4[Agent-as-judge
on Review/Report] + E5[pass^k reliability] + end + + subgraph Storage["Trace + dataset storage"] + S1[(Langfuse
traces, datasets,
experiments)] + end + + subgraph Improve["Offline improvement loop"] + I1[Sample + open-code
failures] + I2[Axial code into
taxonomy] + I3[Pick top bucket] + I4[Apply smallest fix] + I5[Re-run eval, gate] + I6[Promote prompt
version + ship] + end + + P1 -.spans.-> S1 + P2 -.spans.-> S1 + P3 -.spans.-> S1 + P4 -.spans.-> S1 + P5 -.spans.-> S1 + + S1 --> E1 + S1 --> E2 + S1 --> E3 + S1 --> E4 + S1 --> E5 + + E1 --> I1 + E2 --> I1 + E3 --> I1 + E4 --> I1 + E5 --> I1 + + I1 --> I2 --> I3 --> I4 --> I5 --> I6 + I6 -.new prompt version.-> Pipeline +``` + +Three concerns are kept structurally separate: the **pipeline** (what runs in production), the **eval layer** (what scores it), and the **improvement loop** (what changes it between releases). All three are connected through the trace + dataset store, which is QualOps's existing Langfuse instance. + +### 6.2 Stage-by-stage eval matrix + +| Stage | Primary eval technique | Secondary | Reliability metric | +|---|---|---|---| +| **Analyze** | Tool-call F1 against expected `read_file`/`grep` set per fixture PR | Under-tooling rate; hallucinated-tool detection | pass^5 | +| **Review** | Location precision/recall on flagged lines + finding-class accuracy | Agent-as-judge on textual quality | pass^5 | +| **Fix** | SWE-bench-style harness: apply patch, FAIL_TO_PASS + PASS_TO_PASS | Linter / formatter delta; perf benchmark on regression-sensitive PRs | pass^3 | +| **Report** | Schema validation on emitted JSON | LLM-judge on narrative coherence; faithfulness check (every claim has file:line) | pass^5 | +| **Judge** | Agreement rate vs. held-out human labels | Calibration error (ECE) on severity | pass^5 | +| **End-to-end** | Composite score (weighted across stages) + human-rated hold-out | Agent-as-judge with cross-model setup | pass^5 + 95% CI | + +### 6.3 Tooling stack + +| Layer | Choice | Rationale | +|---|---|---| +| Trace + dataset store | **Langfuse** *(keep)* | MIT, self-host, observation-level evals, already wired in | +| Per-PR CI gate | **Promptfoo** *(add)* | Best-in-class GitHub Action, YAML config, Claude Agent SDK provider | +| Nightly capability eval | **Inspect AI** *(add)* | Used by Anthropic/DeepMind/Grok; Agent Bridge wraps QualOps unmodified | +| LLM judge | Claude Opus + GPT-5 cross-judge for Review/Report; Claude Sonnet for cheaper paths | Cross-model judging mitigates self-preference bias | +| Code-graph queries (improvement) | Internal index or [Sverklo](https://github.com/sverklo/sverklo)-style MCP server | For Greptile-style cross-file analysis when we add it | +| Statistics | Custom (numpy/scipy) — bootstrap CIs, McNemar | Anthropic's *Statistical Approach to Model Evals* methodology | + +We are deliberately **not migrating away from Langfuse** to LangSmith or Braintrust. The marginal UX wins do not justify a closed-source migration for a small team given Langfuse's current feature set. + +### 6.4 Two-tier eval cadence + +```mermaid +gantt + title QualOps eval cadence + dateFormat HH:mm + axisFormat %M min + + section Per-PR (fast tier) + Promptfoo YAML asserts :a1, 00:00, 5m + Per-stage tool-call F1 :a2, after a1, 1m + Schema + guardrail asserts :a3, after a2, 1m + PR comment with diff vs main :a4, after a3, 1m + + section Nightly (slow tier) + Langfuse experiment full pipeline :b1, 00:00, 25m + LLM-as-judge per stage :b2, after b1, 10m + Agent-as-judge on final Report :b3, after b2, 15m + pass^k variance (5 reps) :b4, after b3, 20m + Slack + dashboard update :b5, after b4, 1m + + section Weekly (capability tier) + Inspect AI on held-out fixture set :c1, 00:00, 60m + SWE-bench-style on Fix stage :c2, after c1, 60m + Drift report :c3, after c2, 5m +``` + +**Per-PR (3–5 min, blocking):** Promptfoo YAML with ~30 assertions. Fast feedback for engineers. PR comment shows diff vs. main branch. + +**Nightly (~30–60 min, non-blocking with alerting):** Langfuse experiment over 100–200 item dataset, full pipeline, LLM-as-judge scorers, pass^5. Posts to Slack and writes to the Langfuse dashboard. + +**Weekly (~2 hours, non-blocking):** Inspect AI capability eval against held-out fixture repos. SWE-bench-style harness on Fix stage. Generates a drift report. + +### 6.5 Statistical discipline + +For every release, the eval layer must produce: + +- Mean ± 95% CI for the headline metric (composite end-to-end score). +- Per-stage mean ± std across pass^5 runs. +- Paired-comparison delta against the previous release (McNemar's test for binary; paired bootstrap for continuous). +- Variance decomposition: sampling, prompt, judge, data. + +Releases ship only if: + +- All deterministic regression assertions pass. +- The composite score is within 3% of baseline (or improved with stat-sig). +- No individual stage regresses by more than 5% with stat-sig. + +### 6.6 Improvement cadence + +The team runs the improvement loop on a regular schedule: + +| Cadence | Activity | +|---|---| +| **Continuous** | Trace every production PR; LLM pre-label; human review of low-confidence and dismissed-by-reviewer cases | +| **Weekly (1 day)** | Pick the top error bucket from the latest taxonomy; implement smallest fix; gate eval; ship if green | +| **Monthly** | Re-cluster the error taxonomy from the last month's traces; refresh the few-shot index | +| **Quarterly** | Refresh judge calibration set against a fresh human-rated sample (50–100 traces); refresh hold-out fixture repos for capability evals | +| **Per major version** | Re-run end-to-end against the entire historical eval set; publish a release report with score deltas | + +### 6.7 The error taxonomy template (starting point) + +Initialise from `sources/04-improvement.md` §1.4 and refine after the first 30–50 traces: + +| Bucket | Description | Likely first fix | +|---|---|---| +| Format / schema | Output didn't match required structure | Tighten format + add example in prompt | +| Misunderstood instruction | Agent did the wrong thing despite clear request | Restructure prompt; explicit guardrails | +| Missing domain knowledge | Agent didn't know a project convention | Skill or retrieval tool | +| Hallucinated finding | Agent flagged something not in the diff | Citation requirement + retrieval | +| Wrong tool / tool confusion | Agent picked the wrong tool | Tool design: names, descriptions | +| Tool result ignored | Tool returned data, agent answered as if it didn't | Tool description + output summarization | +| Reasoning chain broke | Agent lost track over many turns | Extended thinking or sub-agent split | +| Style / over-flagging | Too many low-value findings | Few-shot mining: contrastive don't-flag | +| Cross-file blindness | Missed a bug requiring multi-file context | Code-graph tool; orchestrator-worker | +| Long-context recall | Forgot something stated earlier | Context engineering | +| Judge miscalibrated | Severity inflation or deflation | Refresh calibration; verifier model | +| Plateau | Many error types, no single fix | Prompt optimizer | +| Latency / cost | Quality OK but too slow / expensive | Routing or distillation SFT | + +### 6.8 Generalizing to other projects + +The same approach generalizes to other tool-calling / workflow agentic projects. The pattern, abstracted from QualOps: + +1. Define stages and the artifact each produces. +2. Pick a primary eval technique per stage based on artifact shape (deterministic check → schema + tests; structured fields → field-level F1; open text → LLM-judge or agent-as-judge). +3. Build a 50–200 item curated golden set from real production traces. +4. Wire a per-stage tracing layer (Langfuse-equivalent) with consistent span semantics (OpenInference is the OTEL standard). +5. Build a two-tier eval: fast deterministic gate per change; slow LLM-judge / capability eval per night or week. +6. Run the error analysis loop weekly. Maintain the taxonomy as a living document. +7. Promote prompts and skills as versioned code; gate releases on paired statistical comparison. + +What changes from project to project is the artifact (code patch vs. SQL query vs. customer email), the deterministic checks (`pytest` vs. `EXPLAIN` vs. spam classifier), and the rubric for the LLM-judge. The skeleton stays. + +--- + +## 7. Prerequisites and adoption roadmap + +### 7.1 Prerequisites + +Before adopting the approach in Part 6, QualOps needs: + +| Prerequisite | Status today | Action | +|---|---|---| +| Trace storage with span / observation primitives | **Have** (Langfuse) | None | +| Per-stage tracing in code (Analyze/Review/Fix/Report/Judge as distinct spans) | **Mostly have** | Audit; ensure consistent span names + attributes | +| Versioned prompts in repo (not in UI) | **Have** (`evals/qualopsrc/`) | None | +| Prompt-as-code promotion infra (content hashes, dev → staging → prod gates) | **Partial** (presets exist; gating infra implicit) | Add explicit promotion workflow + version pinning | +| A starter golden set of real PRs with labels | **Partial** (CRB datasets exist, internal labels TBD) | Label 50 internal PRs with finding-level + fix-level annotations | +| **Held-out / contamination control** (split management, fresh fixtures) | **Don't have** | Stand up split policy + lock evaluation set per release; rotate fixtures using SWE-bench Live monthly drops | +| LLM-as-judge wiring with binary rubrics | **Have** (Judge stage) | Add cross-model judge variant | +| **Cross-model judge access (GPT-5 + Claude Opus)** | **Don't have** | Procure GPT-5 API credentials, budget headroom for cross-model judging (~$500/mo at planned volume) | +| **Ongoing human calibration label capacity** (50–100 traces/quarter) | **Don't have** | Designate annotators (rotation across reviewers); lightweight labeling tooling | +| CI runner with secrets for LLM API calls | **Have** | None | +| Per-stage tool-call F1 scorer | **Don't have** | Implement (~1 week) | +| SWE-bench-style harness for Fix stage (using the methodology, not the deprecated dataset) | **Don't have** | Implement (~2 weeks); seed from SWE-bench Live + internal PRs | +| Agent-as-judge on Report stage | **Don't have** | Implement (~1 week) | +| Promptfoo per-PR gate | **Don't have** | Add Promptfoo + GitHub Action (~3 days) | +| Inspect AI nightly / weekly | **Don't have** | Add Inspect AI + Agent Bridge (~1 week) | +| Statistical comparison framework | **Don't have** | Adopt Anthropic's `statistical-approach-to-model-evals` recipe (~3 days) | +| Ownership: who runs the eval loop? | TBD | Designate a part-time eval lead | + +### 7.2 Adoption roadmap + +A phased rollout, ~3 months end-to-end: + +```mermaid +gantt + title QualOps eval program — phased rollout + dateFormat YYYY-MM-DD + axisFormat %b %d + + section Phase 1 — Foundations (4 weeks) + Audit per-stage tracing :p1a, 2026-05-12, 5d + Label 50 internal PRs :p1b, after p1a, 10d + Implement tool-call F1 scorer :p1c, after p1a, 7d + Implement schema validators :p1d, after p1c, 3d + Statistical comparison helpers :p1e, after p1d, 3d + + section Phase 2 — CI gate (3 weeks) + Add Promptfoo + GitHub Action :p2a, after p1e, 5d + Author 30 per-stage assertions :p2b, after p2a, 5d + Wire PR-comment diff :p2c, after p2b, 2d + Soft-gate dry run, then enforce :p2d, after p2c, 5d + + section Phase 3 — Fix harness + judge (4 weeks) + SWE-bench-style harness for Fix :p3a, after p2d, 10d + Mine SWE-bench Verified for code-quality cases :p3b, after p3a, 5d + Agent-as-judge on Report :p3c, after p3b, 7d + Cross-model judge wiring :p3d, after p3c, 3d + + section Phase 4 — Capability eval (2 weeks) + Inspect AI + Agent Bridge :p4a, after p3d, 5d + Held-out fixture repo set :p4b, after p4a, 5d + Weekly drift dashboard :p4c, after p4b, 2d + + section Phase 5 — Improvement cadence (ongoing) + First error-analysis pass (30 traces) :p5a, after p4c, 5d + First taxonomy + priorities :p5b, after p5a, 3d + Weekly improvement loop :p5c, after p5b, 30d +``` + +Phase milestones: + +- **End of Phase 1**: per-stage tool-call F1 visible in Langfuse on every dataset run. +- **End of Phase 2**: every PR to QualOps's own repo has an automated Promptfoo gate that posts a comment with deltas. +- **End of Phase 3**: Fix stage is graded by a deterministic test harness; Report stage is graded by a cross-model agent-as-judge. +- **End of Phase 4**: weekly capability eval against held-out repos, with a drift report. +- **End of Phase 5 (rolling)**: weekly improvement loop ships measurable deltas. + +### 7.3 Effort and cost estimate + +| Phase | Engineering effort | Recurring cost (LLM tokens / month) | +|---|---|---| +| 1 — Foundations | ~3 weeks | ~$200 | +| 2 — CI gate | ~2 weeks | ~$300 (per-PR judge calls) | +| 3 — Fix harness + judge | ~3.5 weeks | ~$1,500 (test-running + cross-model judge) | +| 4 — Capability eval | ~1.5 weeks | ~$2,000 (weekly full pipeline × 100 fixtures) | +| 5 — Improvement loop | ~1 day/week ongoing | ~$500 (taxonomy regen, judge calibration) | + +Total: ~10 weeks of engineering effort spread across the program, plus ~$4,500/month in steady-state LLM costs. These numbers will move with model pricing. + +--- + +## 8. Risks, open questions, and what we left out + +### 8.1 Risks + +- **Eval set leakage.** If the same PRs feed both eval and training (when SFT is added), the score is meaningless. Mitigation: strict held-out splits; SWE-bench Live for fresh data. +- **Judge drift.** As both judge and judged models update, judge scores drift. Mitigation: refresh the human-labeled calibration set quarterly. +- **Reward hacking on the Fix harness.** A patch that monkey-patches `pytest` to skip tests, or `import sys; sys.exit(0)` on early exit. Mitigation: PASS_TO_PASS check; code-quality grader on the diff itself. +- **Overfitting prompts to the eval set.** Especially with prompt optimizers. Mitigation: held-out validation set; rotate the eval set periodically. +- **Cost overruns.** Weekly capability evals over 100+ fixtures add up. Mitigation: routing — use Haiku for the broad eval, escalate to Sonnet/Opus only for low-confidence cases; cap turn budgets. +- **Self-preference in same-model judge.** Mitigation: cross-model judge, ideally on a different family (Claude judging GPT, GPT judging Claude). + +### 8.2 Open questions + +- **Are LLM judges good enough as the *only* signal?** Zheng et al. say yes for chat (>80% human agreement). Production teams hedge by combining judges with periodic human review. We follow the hedged approach. +- **Process supervision vs. outcome supervision** for training data. Process wins for math; how to label process at scale for fuzzy domains like code review remains open. We rely on outcome (test pass) where possible, agent-as-judge where not. +- **Benchmark validity.** Recent audits ([Zhuge et al. 2025](https://arxiv.org/pdf/2507.02825)) show many published benchmarks have leakage, mis-graded items, or task-validity problems. Internal benchmarks should explicitly audit both outcome validity (test failure ⇎ task failure) and task validity (a task is solvable iff the agent has the target capability). +- **Calibration for tool-using agents specifically.** Most calibration work targets factual QA. The QualOps team may need to invent its own severity-calibration methodology, especially around the Judge stage. + +### 8.3 What we deliberately left out + +Per the brief, we excluded: + +- **Online RLHF / continuous self-tuning in production.** Out of scope; we deploy fixed versions. +- **Human evaluation infrastructure beyond a calibration set.** We assume human evaluation is a periodic activity, not a continuous one. Building a Mechanical Turk-style human-in-the-loop platform is a separate project. +- **Compliance / regulatory eval.** Some industries require formal audit trails (FDA, financial). QualOps doesn't currently target these markets; if it does, an additional eval layer will be needed. +- **Prompt-injection / jailbreak red-teaming at scale.** Promptfoo includes a basic red-team suite; full adversarial robustness is its own program. + +--- + +## 9. Appendix + +### 9.1 Glossary + +- **Agent-as-judge** — using an LLM with tool access (rather than a static LLM judge) to evaluate another agent's output. +- **AST match** — comparing tool calls structurally as parsed trees, allowing argument-order independence. +- **BFCL** — Berkeley Function-Calling Leaderboard. Tool-call accuracy benchmark. +- **Calibration / ECE** — how well a model's confidence matches its empirical accuracy. Expected calibration error (ECE) is the standard metric. +- **DPO / KTO** — Direct Preference Optimization / Kahneman-Tversky Optimization. Offline preference-learning techniques. +- **DSPy** — Stanford NLP framework treating LLM workflows as programs; MIPROv2 is its current optimizer. +- **FAIL_TO_PASS / PASS_TO_PASS** — SWE-bench's two test sets: tests that should pass after the fix, and tests that should still pass. +- **G-Eval** — LLM-judge methodology with chain-of-thought rubric prompting (Liu et al. 2023). +- **Golden trace / golden set** — curated reference traces or examples used as the eval baseline. +- **HELM** — Stanford's holistic evaluation framework. +- **Inspect AI** — UK AISI's research-grade Python eval framework. +- **LLM-as-judge** — using an LLM to score another LLM's output. Workhorse of modern eval. +- **MIPROv2** — DSPy's joint instruction + few-shot optimizer. +- **Open coding / axial coding** — qualitative-research method for building error taxonomies bottom-up. +- **OpenInference** — OTEL-based semantic conventions for LLM/agent traces. +- **pass@k vs pass^k** — succeed at least once in k trials vs. succeed every time in k trials. The latter is the reliability metric. +- **Process Reward Model (PRM)** — model that scores each reasoning step rather than only the final outcome. +- **Promptfoo** — MIT-licensed CLI/library for prompt and agent eval; OpenAI-acquired March 2026. +- **ReAct** — Reason + Act loop pattern for agents. +- **SWE-bench / SWE-bench Verified / SWE-bench Live** — code-agent benchmarks based on real GitHub issues + unit tests. +- **τ-bench** — Sierra's tool-agent-user multi-turn benchmark; introduced pass^k. +- **Trajectory** — ordered record of (state, action, observation) triples for an agent run. + +### 9.2 Where to read more + +The four research dossiers compiled for this report contain the full primary-source citations: + +- `sources/01-foundations.md` — Foundational concepts, taxonomy, lifecycle, LLM-as-judge, statistical rigor. +- `sources/02-frameworks.md` — Framework landscape: Langfuse, LangSmith, DeepEval, Braintrust, Promptfoo, Inspect AI, Phoenix, and others. +- `sources/03-toolcalling-and-trajectory.md` — Tool-call and trajectory eval; benchmarks (BFCL, τ-bench, SWE-bench, AppWorld); agent-as-judge; replay testing. +- `sources/04-improvement.md` — Error analysis, prompt optimization, few-shot mining, tool design, context engineering, sub-agent decomposition, fine-tuning. + +Each dossier ends with an annotated reference list of 30–50 primary sources. + +### 9.3 Top recommended reads (start here) + +If you read nothing else from the dossiers, read these: + +1. [Anthropic — *Demystifying evals for AI agents*](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents) — the canonical Anthropic engineering blog on agent eval. +2. [Hamel Husain & Shreya Shankar — *LLM Evals: Everything You Need to Know*](https://hamel.dev/blog/posts/evals-faq/) — the practitioner playbook. +3. [Hamel Husain — *A Field Guide to Rapidly Improving AI Products*](https://hamel.dev/blog/posts/field-guide/) — the error-analysis methodology in concrete form. +4. [Zheng et al. 2023 — *Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena*](https://arxiv.org/abs/2306.05685) — the foundational LLM-judge paper. +5. [Yao et al. 2024 — *τ-bench*](https://arxiv.org/abs/2406.12045) — the tool-agent-user benchmark and pass^k methodology. +6. [Jimenez et al. 2023 — *SWE-bench*](https://arxiv.org/abs/2310.06770) — execution-based code-agent grading. +7. [Zhuge et al. 2024 — *Agent-as-a-Judge*](https://arxiv.org/abs/2410.10934) — the agentic-judge frontier. +8. [Anthropic — *Building effective agents*](https://www.anthropic.com/research/building-effective-agents) — workflow vs. agent patterns. +9. [Anthropic — *Writing tools for agents*](https://www.anthropic.com/engineering/writing-tools-for-agents) — tool design principles. +10. [Anthropic — *Effective context engineering for AI agents*](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents) — context engineering. + +### 9.4 Companion files + +- `REPORT.md` — this document. +- `report.html` — interactive HTML rendering with rich diagrams. +- `sources/01-foundations.md` — foundations dossier. +- `sources/02-frameworks.md` — frameworks dossier. +- `sources/03-toolcalling-and-trajectory.md` — tool-calling dossier. +- `sources/04-improvement.md` — improvement dossier. +- `diagrams/` — standalone SVG renderings of the key diagrams. + +--- + +*End of report.* diff --git a/agent-evaluation-research/diagrams/01-three-layers.svg b/agent-evaluation-research/diagrams/01-three-layers.svg new file mode 100644 index 00000000..32b5542d --- /dev/null +++ b/agent-evaluation-research/diagrams/01-three-layers.svg @@ -0,0 +1,84 @@ + + + + + + + + + + + + + + + + + + + +The three layers of agent evaluation +Component → trajectory → outcome. Reporting only the outcome hides the bug you actually care about. + + + + +Component-level + +Question +Does each sub-skill +work in isolation? + +Examples +• Tool-match rate +• Parameter F1 +• Retrieval recall@k +• Schema validation + +QualOps mapping +Did Analyze read_file +with right path? Did the +finding match schema? + + + + +Trajectory-level + +Question +Is the path of reasoning ++ actions valid, efficient, +and faithful? + +Examples +• Plan correctness +• Trajectory edit distance +• Tool-call F1 over seq. +• Step-level grounding + +QualOps mapping +Did Review's parallel +sub-agents converge +without redundant calls? + + + + +Outcome-level + +Question +Did the agent achieve +the user goal? + +Examples +• Task success +• Unit-test pass rate + (SWE-bench style) +• Human rating + +QualOps mapping +Did the suggested fix +actually fix the bug? +Did the Report match? + + diff --git a/agent-evaluation-research/diagrams/02-qualops-architecture.svg b/agent-evaluation-research/diagrams/02-qualops-architecture.svg new file mode 100644 index 00000000..8fd5f73b --- /dev/null +++ b/agent-evaluation-research/diagrams/02-qualops-architecture.svg @@ -0,0 +1,159 @@ + + + + + + + + + + + + + + + + +QualOps eval architecture +Pipeline + eval layer + offline improvement loop, all linked through the trace store. + + + +QUALOPS PIPELINE (per PR) + + + + +PR diff + + +Analyze +detect changed + + +Review +findings + + +Fix +suggest patch + + +Report +aggregate + + +Judge +quality gate + + +CI status + + + + + + + + + + + + + + +Langfuse +traces · datasets · experiments · scores + + + + + + + + + +spans + + + +EVAL LAYER (CI-gated) + + + +Tool-call F1 +per stage (BFCL-style) + + +Schema + guardrails +deterministic asserts + + +SWE-bench harness +apply patch + tests + + +Agent-as-judge +on Review/Report + + +pass^k reliability + 95% CI (paired comparison) +5+ runs per task; McNemar / paired bootstrap + + + + +OFFLINE IMPROVEMENT LOOP (between releases) + + + +Sample + open-code +30-50 failed traces + + +Axial-code → taxonomy +5-15 buckets, prioritized + + +Pick top bucket +freq × severity × fix + + +Apply smallest fix +prompt → tool → SFT + + +Re-run eval, gate, promote prompt version, ship +stat-sig improvement, no regression + + + + + + + + + + + + +new prompt version + + + + +Pipeline (production) + + +Eval layer (scoring) + + +Improvement (offline) + + +Trace store (Langfuse) + + +Three concerns kept structurally separate. The trace store is the single source of truth. + + diff --git a/agent-evaluation-research/diagrams/03-eval-cadence.svg b/agent-evaluation-research/diagrams/03-eval-cadence.svg new file mode 100644 index 00000000..2d75bce6 --- /dev/null +++ b/agent-evaluation-research/diagrams/03-eval-cadence.svg @@ -0,0 +1,91 @@ + + + + +Two-tier eval cadence +Fast feedback per PR, deeper truth per night, capability assurance per week. + + + +PER-PR · 3–5 min · BLOCKING +Promptfoo YAML asserts on each pipeline stage. PR-comment diff vs main. Gate the merge. + + + +Promptfoo asserts +~30 per-stage + + +Tool-call F1 +spot regressions early + + +Schema + guardrails +deterministic + + +PR-comment diff +vs main baseline + + +Block on regression +required check + + + + +NIGHTLY · 30–60 min · NON-BLOCKING + ALERTS +Langfuse experiment, full pipeline, LLM-as-judge scorers, paired stat comparison vs previous baseline. + + + +100–200 dataset items +CRB + internal PRs + + +LLM-as-judge per stage +cross-model + + +Agent-as-judge +on final Report + + +pass^5 reliability +5 runs / task + + +Slack + dashboard +drift alerts + + + + +WEEKLY · ~2 hours · CAPABILITY EVAL +Inspect AI on held-out fixture repos. SWE-bench-style harness on Fix stage. Drift report. + + + +Inspect AI bridge +unmodified harness + + +Held-out repos +contamination-free + + +SWE-bench-style +FAIL_TO_PASS + PASS_TO_PASS + + +SWE-bench Live mining +50/month fresh + + +Drift report +to leadership + + +Engineers get fast feedback. The team gets deeper nightly truth. Leadership gets weekly capability assurance. + + diff --git a/agent-evaluation-research/report.html b/agent-evaluation-research/report.html new file mode 100644 index 00000000..333ce754 --- /dev/null +++ b/agent-evaluation-research/report.html @@ -0,0 +1,1338 @@ + + + + + +Evaluating and Improving LLM Agents — QualOps Research Report + + + + +
+ + +
+ +
+

Evaluating and Improving LLM Agents

+

A state-of-the-art report on agent accuracy, evaluations, and offline improvement — with a generalized approach for QualOps and similar projects.

+

QualOps Research · May 8, 2026 · EngineeringLeadership

+
+ +
+Document map (click to collapse) + + + + + + + + + + + + + + +
PartTitleAudience
0Executive summaryLeadership + engineering
1Why this matters for QualOpsLeadership
2Foundations of agent evaluationEngineering
3Evaluating tool-calling and workflow agentsEngineering
4The framework landscapeEngineering
5Systematically improving agents (offline)Engineering
6The QualOps Approach — generalized conceptBoth
7Prerequisites and adoption roadmapBoth
8Risks, open questions, what we left outBoth
9Appendix: glossary, references, dossier linksEngineering
+
+ +

0 · Executive summary

+ +

LLM agent quality is not a property of the model alone — it is a property of the system: the model, the prompts, the tools, the context routing, and the harness, all together. As QualOps has matured into a multi-stage agentic pipeline (Analyze → Review → Fix → Report → Judge), the unit that matters has become the trajectory the agent takes through tool calls, not just the final PR comment.

+ +
+The shape of the recommendation  
+1. Score the agent on three layers, every release.
+2. Use a small, curated golden set (50–200 PRs), not a giant crawl.
+3. Apply the right tool to the right stage (deterministic tests for Fix; tool-call F1 for Analyze; LLM-judge for Review/Report; agent-as-judge for Judge).
+4. Run a two-tier eval cadence: per-PR fast gate + nightly capability eval.
+5. Improve through structured error analysis (open coding → axial coding → frequency-weighted prioritization).
+6. Keep Langfuse, add Promptfoo (per-PR CI gate) and Inspect AI (nightly capability eval). +
+ +
+
Golden set size
50–200
curated real PRs, refreshed quarterly
+
Per-PR gate
3–5 min
Promptfoo YAML asserts, blocking
+
Nightly cadence
~30–60 min
Langfuse experiments, full pipeline + LLM judges
+
Reliability metric
pass^5
probability of passing 5 runs in a row
+
+ +

1 · Why this matters for QualOps

+ +

1.1 The QualOps pipeline

+ +

QualOps is an AI-powered code review tool built on the Claude Agent SDK. It runs in CI on every pull request and produces structured findings — comments, GitHub Checks annotations, severity-ranked reports, and (in agentic mode) suggested fixes.

+ +
+flowchart LR + P0[PR diff] --> P1[Analyze] + P1 --> P2[Review] + P2 --> P3[Fix] + P3 --> P4[Report] + P4 --> P5[Judge] + P5 --> P6[CI status] + classDef stage fill:#1c232c,stroke:#5b9eff,color:#e6edf3 + class P1,P2,P3,P4,P5 stage +
+ +

1.2 What is at stake

+ +

A code review is a trust artifact. A false positive wastes developer time and erodes confidence in every subsequent finding. A false negative defeats the purpose of the tool. A confidently miscalibrated severity label routes attention away from the issues that actually matter. None of these failures are catastrophic individually, but they compound across thousands of PRs.

+ +

Without disciplined evaluation: customer churn (reviewers turn the bot off), hidden regressions (a prompt change fixes one issue and silently regresses three), cost overruns (model upgrades produce big bills with unclear value), and audit risk for enterprise customers.

+ +

1.3 What we already have

+ +

QualOps ships with a working evaluation suite: Langfuse-backed dataset runs, multiple presets (fast, default, sonnet-agentic, thorough), CRB-derived golden datasets across five real repos (Sentry, Grafana, Cal.com, Discourse, Keycloak), and a configurable LLM-as-judge scoring stage. This puts QualOps ahead of most teams shipping agentic products today. The gaps identified here are deliberate next steps, not foundations.

+ +

2 · Foundations of agent evaluation

+ +

2.1 An agent is not a function

+ +

A classical LLM eval treats the model as a function f(prompt) → completion. An agent eval treats the agent as a stateful policy π that interacts with an environment via tools. The unit of evaluation is a trajectory:

+ +
τ = (s₀, a₀, o₀, s₁, a₁, o₁, …, sₙ)
+ +

Every benchmark surveyed for this report agrees: agent evaluation requires assessing not just the terminal answer but the path taken to reach it.

+ +

2.2 The three layers of agent evaluation

+ +
+
+

Component-level

+

Q: Does each sub-skill (single tool call, retriever, sub-agent) work in isolation?

+

Metrics: Tool-match rate, parameter F1, retrieval recall@k

+
+
+

Trajectory-level

+

Q: Is the path of reasoning + actions valid, efficient, and faithful?

+

Metrics: Plan correctness, edit distance, tool-call F1 over sequence

+
+
+

Outcome-level

+

Q: Did the agent achieve the user goal?

+

Metrics: Task success, unit-test pass rate, human rating

+
+
+ +

For QualOps all three layers exist naturally:

+
    +
  • Component: did the Analyze stage read_file with the right path?
  • +
  • Trajectory: did the Review stage's parallel sub-agents converge without redundant tool calls?
  • +
  • Outcome: did the suggested fix actually fix the bug?
  • +
+ +

2.3 Dimensions of agent quality

+ + + + + + + + + + + + + +
DimensionWhy it matters for QualOpsHow to measure
Accuracy / task successThe headline numberExact match, unit-test pass, human rating
Faithfulness / groundednessDominant for code review. A hallucinated finding is worse than a missed oneAtomic-claim NLI; citations as guardrail
CompletenessDid the agent find all the issues a human would?Recall against an annotated PR review
CalibrationSeverity labels must be trustworthy for triageECE, Brier score
RobustnessStable under prompt perturbation, weird diffsPerformance under paraphrase / typo suites
Determinism / consistencySame PR → same reviewOutput variance across N samples; pass^k
LatencyCI gates have time budgetsp50/p95/p99 wall-clock per stage
Cost$ per PRTokens × price + tool-call costs
+ +
+Faithfulness is dominant for code review. A hallucinated finding is more damaging than a missed one. Every claim in the report must cite a file:line; if it cannot, drop the claim. This is the "citations as guardrail" pattern. +
+ +

2.4 The evaluation lifecycle

+ +
+flowchart LR + A[1. Error analysis
on real traces] --> B[2. Codify failure
modes as rubric] + B --> C[3. Add to golden set
+ regression suite] + C --> D[4. Run evals in CI;
block on regression] + D --> E[5. Ship + monitor
in production] + E --> F[6. Sample drift,
online judge] + F --> A + classDef step fill:#1c232c,stroke:#5b9eff,color:#e6edf3 + class A,B,C,D,E,F step +
+ +

Tactics endorsed across primary sources: start small (Anthropic: "20–50 simple tasks drawn from real failures is a great start"); treat evals like unit tests; route failures back to the eval set; gate on stat-sig regression.

+ +

2.5 LLM-as-judge — the workhorse, with caveats

+ +

Zheng et al.'s Judging LLM-as-a-Judge (NeurIPS 2023) showed that GPT-4 acting as a judge agreed with human preference at over 80% — roughly the same as inter-human agreement. This legitimized LLM-as-judge as a primary evaluation method.

+ + + + + + + + + + + +
BiasWhat happensMitigation
Position biasJudge prefers whichever appears firstSwap order, score both, average
Verbosity biasLonger answers rated higherLength constraint in rubric
Self-preferenceJudge prefers outputs from its own familyCross-model judge ensemble
Familiarity biasJudge favors text it would have generatedDown-weight low-perplexity samples
SycophancyJudge follows hints in the promptBlind the judge to source
Fallacy oversightJudge accepts confident-sounding wrong reasoningStep-by-step grading; process supervision
+ +

2.6 Trajectory and process evaluation

+ +

Process evaluation asks: was every intermediate step justified? Three families: step-wise correctness, plan-level precision/recall, and edit distance. OpenAI's Let's Verify Step by Step (Lightman et al. 2023) showed that process supervision beats outcome supervision for training reward models.

+ +

2.7 Statistical rigor — why "vibes" fail

+ +

With N=10 examples and a stochastic model, a swing of ±20% in pass rate is normal noise. The minimum statistical discipline:

+
    +
  • Paired comparisons. Run model A and B on the same examples; per-example difference cancels per-example variance.
  • +
  • Confidence intervals. 95% CI half-width ≈ 1.96 × √(p(1-p)/N). To distinguish 80% from 85% at 95% confidence you need ~1000 samples.
  • +
  • Multiple runs per task. Even at temperature 0, modern serving stacks are non-deterministic. Plan for ≥5 runs per task.
  • +
  • pass^k reporting. Sierra's τ-bench: probability of succeeding k times in a row.
  • +
  • Bradley-Terry / Elo for pairwise rankings (what Chatbot Arena does).
  • +
+ +

3 · Evaluating tool-calling and workflow agents

+ +

QualOps is a tool-calling workflow agent, not a chatbot. What matters is whether the agent picks the right tools, in the right order, with the right arguments, and produces the right final artifact.

+ +

3.1 What "tool-call accuracy" actually means

+ +

"Tool-call accuracy" is a deceptively flat label. It decomposes into a stack of sub-metrics:

+ + + + + + + + + + + + + + + +
MetricWhat it measures
Exact matchPredicted call equals gold byte-for-byte. Brittle.
AST matchParse into name + (arg-name, arg-value); structural equality
Semantic matchLLM-judge or custom equality on argument values
Argument F1Per-call precision/recall on argument names + values
Tool-call F1Set-level over multiset of (tool, args) pairs
Multi-call orderingExact / in-order / any-order / edit distance
Hallucinated toolsAgent invents a tool that doesn't exist
Missed toolsAgent answered from knowledge instead of calling tool
Idempotency / collateral damageSide-effecting call repeated; unintended state change
Parallel callsBag of tool calls per turn (not list)
+ +

3.2 Trajectory evaluation

+ +

Two orthogonal questions:

+
    +
  • Q1 — Did the agent get to the goal? (outcome)
  • +
  • Q2 — Did it follow a sensible path? (process)
  • +
+ +

An agent can stumble to the right answer through a 47-step random walk, or take an optimal 3-step path that ends in the wrong final state. Scoring rules in increasing leniency: trajectory exact match → in-order match → any-order match → edit distance.

+ +

3.3 Outcome vs. process — when to use each

+ + + + + + + + + + +
AspectOutcome evalProcess eval
Data neededGoal-state checker (test, schema, regex)Reference trajectories or judge model
CostCheap, deterministicExpensive or noisy
Catches lucky shortcutsNo — agent can game itYes
Catches plan inefficiencyNoYes
Penalizes equivalent pathsNo (good)Yes (bad — risks rewarding mimicry)
+ +

3.4 Major benchmarks worth knowing for QualOps

+ + + + + + + + + + + + + +
BenchmarkYearWhy it matters for QualOps
BFCL v32024AST match + executable accuracy methodology; right framework for per-stage tool calls
τ-bench2024pass^k metric for reliability under non-determinism. Directly transferable
SWE-bench Verified2024Apply patch + FAIL_TO_PASS + PASS_TO_PASS. The methodology template for QualOps's Fix stage
SWE-bench Live202550 freshly verified GitHub issues per month. Now the recommended SWE-bench variant
SWE-bench Pro2025Long-horizon, enterprise-scale. GPT-5 23.3% / Claude Opus 4.1 23.1%
AppWorld2024State-based eval with collateral-damage check
DevAI / Agent-as-a-Judge2024Methodology for using an agent (with tools) as the judge
HAL2025Variance-decomposed reporting
+ +
+The single most influential idea for QualOps is SWE-bench's "tests as oracle": apply the agent's patch, run FAIL_TO_PASS + PASS_TO_PASS, classify. Fully deterministic, fully outcome-based, ignores how the agent got there, resists most reward hacking. We adopt the methodology directly for the Fix stage. +
+ +
+† Note on SWE-bench Verified status (May 2026): OpenAI publicly deprecated SWE-bench Verified on Feb 23, 2026, citing flawed test patches and contamination concerns. The benchmark's methodology remains the gold standard, but for a fresh, contamination-free dataset prefer SWE-bench Live (50 newly-verified issues per month) or SWE-bench Pro. Internal harnesses built on the methodology are unaffected. +
+ +

3.5 Code-agent specific evaluation

+ +

For QualOps's Review stage (no patch, just a comment), the analog of SWE-bench's pattern is:

+
    +
  1. Finding location precision/recall — did the agent flag the right line and file?
  2. +
  3. Finding-class match — did it categorize correctly (security vs perf vs style)?
  4. +
  5. Finding–PR alignment — does the finding correspond to something the human reviewer also flagged?
  6. +
+ +

Tests don't catch every flavor of bad fix: style/readability regressions, performance regressions, security regressions. These need additional graders.

+ +

3.6 Agent-as-judge

+ +

Zhuge et al.'s Agent-as-a-Judge (Oct 2024; ICML 2025) replaces the LLM judge with an agent judge that can read code, run tools, and verify intermediate steps:

+ +
+
Human agreement
~90%
vs ~70% for plain LLM-judge
+
Cost reduction
~97%
86 h / $1,297 → ~2 h / $31
+
+ +

For QualOps: an external agent-as-judge is well suited to grading "was this PR review good?" — give it the diff, the agent's findings, the human-merged PR, and a rubric, and let it use grep/file-read tools to verify each finding against the code. Run the judge on a different model family than the Review stage to avoid self-preference.

+ +

3.7 Replay testing and recorded traces

+ +

Pattern (used by Braintrust, LangSmith, Phoenix, Anthropic, Cognition):

+
    +
  1. Capture every production run as a trace.
  2. +
  3. Tag failures, edge cases, customer escalations into a regression set.
  4. +
  5. On every prompt/model/harness change, replay each trace with stubbed tool outputs.
  6. +
  7. Diff: were tool-call sequences equivalent? Did the final artifact differ?
  8. +
+ +

3.8 Decision guide: situation → technique

+ + + + + + + + + + + + + + + + + +
SituationTechnique
Single-call function selectionBFCL-style AST match + argument F1
Multi-step deterministic workflowTrajectory in-order match + state-based eval
Parallel tool calls in one turnSet-equality match (bag of calls)
Tool arguments are free-form textAction similarity (embedding or LLM-judge)
Side-effecting toolsState-based eval with collateral-damage check
Output is a code patchSWE-bench harness: apply + FAIL_TO_PASS + PASS_TO_PASS
Output is open-ended textAgent-as-judge with structured rubric
Detect hallucinated toolsSchema validation + tool-name whitelist
Detect missed toolsRecall against reference trajectory
Variance/reliabilitypass^k with k≥5; mean + 95% CI
Catching prompt/model regressionsRecorded-trace replay with tool stubs
Long-horizon multi-stage agent (QualOps)Hybrid: per-stage F1 + per-stage state checks + outcome + agent-as-judge
+ +

4 · The framework landscape

+ +

The eval / observability tooling ecosystem moved fast through 2025–2026. The major shifts since early 2025: OpenAI acquired Promptfoo in March 2026 (MIT license preserved), Langfuse landed observation-level LLM-as-judge in February 2026, Inspect AI reached production-grade adoption inside frontier labs.

+ +

4.1 The shortlist for QualOps

+ +
    +
  1. Langfuse (incumbent — keep). MIT, self-host, observation-level LLM-as-judge, boolean/categorical scoring. Production references include Canva. TS + Python parity.
  2. +
  3. Promptfoo (add as CI gate). MIT, OpenAI-acquired but license preserved, first-class GitHub Action with PR-comment diffs, Claude Agent SDK provider.
  4. +
  5. Inspect AI (add for nightly capability evals). MIT, used by Anthropic/DeepMind/Grok. Agent Bridge wraps the QualOps agent without modification.
  6. +
  7. LangSmith (only if a wall is hit). Best out-of-the-box trajectory primitives via agentevals. Closed-source.
  8. +
  9. Braintrust (only if non-engineers must contribute test cases). Notion's reference deployment is real. Closed-source; hybrid-only self-host.
  10. +
+ +

4.2 Comparison matrix

+ +

Legend: yes/strong   partial/caveat   no/weak

+ + + + + + + + + + + + + + + + + +
FrameworkOSSSelf-hostTrajectoryTool-callLLM judgeOnlineCITSPyFree tier
Langfuse✓ MIT50k/mo
LangSmith✓ Plus+5k/mo
DeepEval✓ Py✓ pytest
Braintrust1M spans
Phoenix / Arize
Promptfoo✓ MIT✓ best
Inspect AI✓ MIT
OpenAI Evals API✓ repopaid API
W&B Weave✓ SDK
MLflow GenAI
RAGAS
Patronussales
+ +

4.3 Fit by team profile

+ + + + + + + + + + + + +
Team profileRecommended primaryAdd-ons
Small team, CI-gated, Node/TS, Claude (= QualOps)Langfuse+ Promptfoo (CI) + Inspect AI (nightly)
Small/mid team, Python-only, RAG-heavyDeepEval + RAGAS+ Phoenix or Langfuse
Large org, many agents, dedicated SREsBraintrust (product) + Phoenix/Arize AX (platform)
LangChain / LangGraph shopLangSmith
Frontier-lab / safety orgInspect AI+ custom storage
OpenAI-only shopOpenAI Evals API + Promptfoo
Already on W&B for MLW&B Weave+ RAGAS / DeepEval metrics
+ +

4.4 CI integration patterns

+ +

The cleanest CI pattern for QualOps is two-tier:

+
    +
  1. Per-PR (3–5 min) — Promptfoo YAML with ~30 small assertions on the output of each pipeline stage; required GitHub check; PR-comment diff vs. main.
  2. +
  3. Nightly / weekly (~30–60 min) — Langfuse experiment over a 100–200 item dataset, full pipeline, LLM-as-judge scorers + tool-call F1 scorers per stage. Plus quarterly Inspect AI capability eval against held-out fixture repos.
  4. +
+ +

5 · Systematically improving agents (offline)

+ +

5.1 The eval-driven improvement loop

+ +
+flowchart TD + A[Production / staging traces] --> B[Sample failures + passes] + B --> C[Open coding
free-text notes] + C --> D[Axial coding
cluster into taxonomy] + D --> E[Prioritize by
frequency × severity × fixability] + E --> F{Pick top
bucket} + F --> G[Hypothesize fix:
prompt / context / tool /
sub-agent / model / SFT] + G --> H[Implement smallest
change that could fix it] + H --> I[Run eval set
regression + targeted] + I --> J{Delta positive?
No regression?} + J -- No --> K[Discard or refine] + K --> G + J -- Yes --> L[Promote prompt version] + L --> M[Ship behind gate] + M --> N[Collect new traces] + N --> A + classDef step fill:#1c232c,stroke:#5b9eff,color:#e6edf3 + classDef decision fill:#1a1f26,stroke:#ffb454,color:#e6edf3 + class A,B,C,D,E,G,H,I,K,L,M,N step + class F,J decision +
+ +

5.2 Error analysis: open coding → axial coding → frequency

+ +

The single most cited improvement technique. Borrowed from grounded-theory qualitative research:

+ +
    +
  • Pass 1 — Open coding (bottom-up). Sit with raw traces. Free-text notes describing what went wrong. Critically, do not pre-define categories. Bottom-up coding at NurtureBoss surfaced "date handling" as the dominant failure class and lifted that subtask from 33% → 95% accuracy.
  • +
  • Pass 2 — Axial coding. Group notes into 5–15 categories. LLM-assisted clustering pass + human review.
  • +
  • Frequency-weighted prioritization. Rank by frequency × business cost × fixability. Spend engineering effort top-down.
  • +
+ + + + + + + + + + +
IDCategoryStageFrequencySeverityFixabilityPriority
E1False positive on idiomatic styleReview32%LowHighP1
E2Missed null-deref across filesAnalyze18%HighMediumP1
E3Fix proposed wrong importFix11%MediumHighP2
E4Judge rated harmless nit as "blocker"Judge9%MediumHighP2
E5Refused on large diffAnalyze4%HighLowP3
+ +

5.3 The hierarchy of fixes (decision tree)

+ +
+flowchart TD + Start([Eval reveals failure]) --> Q1{What's the
error type?} + + Q1 -->|Format / schema
violation| F1[Tighten output format
in prompt; add example] + Q1 -->|Misunderstood
instruction| F2[Restructure prompt:
role/task/format/guardrails] + Q1 -->|Missing domain
knowledge| F3[Add Skill or
retrieval tool] + Q1 -->|Hallucinated
fact / API| F4[Citation requirement
+ retrieval tool] + Q1 -->|Wrong tool used| F5[Tool design: names,
descriptions, consolidate] + Q1 -->|Tool result ignored| F6[Tool description +
output summarization] + Q1 -->|Reasoning chain broke| F7[Extended thinking
or sub-agent split] + Q1 -->|Style / over-flagging| F8[Few-shot mining:
contrastive don't-flag] + Q1 -->|Cross-file blindness| F9[Code-graph tool;
orchestrator-worker] + Q1 -->|Long-context recall| F10[Context engineering:
prune, dynamic load] + Q1 -->|Judge miscalibrated| F11[Refresh calibration;
verifier model] + Q1 -->|Plateau across types| F12[Prompt optimizer
DSPy / AdalFlow] + Q1 -->|Latency/cost plateau| F13[Routing or
distillation SFT] + Q1 -->|Persistent + structured| F14[DPO from preference
pairs; otherwise SFT] + + F1 --> R[Re-run eval, gate, ship] + F2 --> R + F3 --> R + F4 --> R + F5 --> R + F6 --> R + F7 --> R + F8 --> R + F9 --> R + F10 --> R + F11 --> R + F12 --> R + F13 --> R + F14 --> R + classDef step fill:#1c232c,stroke:#5b9eff,color:#e6edf3 + classDef decision fill:#1a1f26,stroke:#ffb454,color:#e6edf3 + classDef start fill:#1a1f26,stroke:#7ee787,color:#e6edf3 + class Start start + class Q1 decision + class F1,F2,F3,F4,F5,F6,F7,F8,F9,F10,F11,F12,F13,F14,R step +
+ +

The order of operations in 2026, cheapest to most expensive: prompts → few-shot → tools → context → sub-agents → optimizers → routing → distillation/DPO. Fine-tuning is the long-game once the prompt surface is exhausted.

+ +

5.4 Prompt engineering as iteration discipline

+ +

The prompt skeleton convergence between Anthropic's Claude 4.x guides and OpenAI's GPT-5 cookbook:

+ +
[Role]      You are <persona, scope>.
+[Task]      Goal in 1–2 sentences.
+[Context]   Static background, dynamic retrieval, tool surface.
+[Examples]  Few-shot, ideally diverse and including hard cases.
+[Format]    Output schema (XML / JSON / markdown sections).
+[Guardrails] Out-of-scope behaviors, refusal triggers, escalation.
+ +

For Claude specifically: XML-tag structuring (<example>, <context>, <task>), tool definitions in system message, instructions in user turn, "think step by step" or extended thinking for multi-stage tasks.

+ +

5.5 Automated prompt optimization

+ + + + + + + + + + +
ToolMechanismBest for
DSPy / MIPROv2Programs not prompts; joint Bayesian opt of instructions + few-shotMulti-stage pipelines (perfect for QualOps)
TextGradBackpropagation through text via LLM-generated feedbackComposable systems with critic-able output
AdalFlowPyTorch-style auto-diff combining TextGrad + DSPy bootstrappingSingle library covering both directions
SAMMOStructural mutation operators over function-graph promptsLong structured prompts (manuals, policies)
OPRO / Promptbreeder / APEEarlier generationsMostly historical; folded into DSPy/AdalFlow
+ +

5.6 Few-shot mining

+ +

Three lessons from 2024–25 literature:

+
    +
  1. Quality dominates quantity. A handful beat dozens.
  2. +
  3. Diversity matters more than similarity.
  4. +
  5. Dynamic retrieval > static set for heterogeneous inputs.
  6. +
+ +

For QualOps: build a vector index over canonical good examples keyed by diff features; retrieve top-k at inference. Add contrastive examples: pairs of (borderline diff, correct minimal review) so the agent learns where to not comment. Code-review agents over-flag by default; contrastive negatives are the cure.

+ +

5.7 Tool design — the most under-appreciated lever

+ +

From Anthropic's Writing tools for agents:

+ +
    +
  • Naming: namespace by service; verb_object form.
  • +
  • Description is a prompt. Read on every call. Be explicit about when to use, what inputs are valid, what the tool will NOT do.
  • +
  • Schema with examples.
  • +
  • Consolidate, don't proliferate. Fewer, more capable tools beat many narrow ones.
  • +
  • Return high-signal text. Stable identifiers; pruned/summarized output.
  • +
  • Error messages are pedagogy. "Error: file path must be relative to repo root; you passed an absolute path; try src/foo.py" enables self-correction.
  • +
  • Observable side effects. If a tool mutates state, return the new state.
  • +
+ +

5.8 Context engineering — curate, don't accumulate

+ +
"Context engineering is the delicate art and science of filling the context window with just the right information for the next step." — Andrej Karpathy
+ +
    +
  • Context rot. Recall and reasoning degrade as token count grows, well before the nominal limit.
  • +
  • Lost-in-the-middle (Liu et al., Stanford). Performance is U-shaped: instructions at the very start or end win; buried middle loses.
  • +
  • Instruction hierarchy. When system, user, and tool instructions conflict, models follow the most recent and most concretely worded.
  • +
+ +

For QualOps: never feed the entire repo. Feed (a) the diff, (b) the immediate symbol context (callers/callees of touched symbols), (c) project conventions for the relevant language, and nothing else. Add a tool the agent can call when it needs more.

+ +

5.9 Sub-agent decomposition and Skills

+ +

Anthropic's Building Effective Agents patterns: prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer. Reach for sub-agent decomposition when one prompt is doing two qualitatively different jobs, required tools differ across phases, one phase needs a stronger model, or the prompt has crossed ~3–5K tokens. Caveat: orchestrator-worker uses 10–15× more tokens than a single agent.

+ +

5.10 Routing and model selection

+ +

Industry routers report 30–85% cost reduction with quality flat or slightly improved. For QualOps, a reasonable default cascade:

+ +
    +
  1. Diff size / language detector (no LLM) — small CSS/text → Haiku-tier; backend logic → Sonnet; cross-cutting refactors → Opus.
  2. +
  3. Confidence escape hatch — if Judge rejects with "ambiguous", upgrade and re-run Review.
  4. +
  5. Per-stage routing — Analyze and Report can be smaller; Review and Fix usually want strong; Judge wants strong (or at least different).
  6. +
+ +

5.11 Reflection patterns

+ +
+Reflection is a trap when the error mode is "confidently wrong with no internal signal" (the critic agrees with the bad output), per-call latency matters more than quality, or the critic is the same model with the same context as the actor (same blind spots). +
+ +

The verifier model trick: run Fix with Sonnet, Judge with Opus. Different models reduce shared blindspots — empirically the cheap win.

+ +

5.12 Fine-tuning, distillation, DPO

+ +

The order of operations: prompts → few-shot → tools → context → sub-agents → optimizers → then fine-tuning. Two relevant offline techniques:

+
    +
  • Distillation from agent traces. Run the strong agent on a curated set, record traces, SFT a smaller model. Recent work preserves >90% of teacher quality at <20% the cost.
  • +
  • DPO / KTO from preference pairs. From the eval set you have (input, accepted, rejected) triples — exactly the DPO format.
  • +
+ +

6 · The QualOps Approach — generalized concept

+ +

This is the synthesis: a concrete, opinionated approach for QualOps and other tool-calling agentic projects with similar shape.

+ +

6.1 Architecture: evals integrated into the QualOps pipeline

+ +
+flowchart LR + subgraph Pipeline["QualOps pipeline (per PR)"] + P0[PR diff] --> P1[Analyze] + P1 --> P2[Review] + P2 --> P3[Fix] + P3 --> P4[Report] + P4 --> P5[Judge] + P5 --> P6[CI status] + end + + subgraph EvalLayer["Eval layer (CI-gated)"] + E1[Tool-call F1
per stage] + E2[Schema validation
+ guardrails] + E3[SWE-bench-style
test harness] + E4[Agent-as-judge
on Review/Report] + E5[pass^k reliability] + end + + subgraph Storage["Trace + dataset storage"] + S1[(Langfuse
traces, datasets,
experiments)] + end + + subgraph Improve["Offline improvement loop"] + I1[Sample + open-code
failures] + I2[Axial code into
taxonomy] + I3[Pick top bucket] + I4[Apply smallest fix] + I5[Re-run eval, gate] + I6[Promote prompt
version + ship] + end + + P1 -.spans.-> S1 + P2 -.spans.-> S1 + P3 -.spans.-> S1 + P4 -.spans.-> S1 + P5 -.spans.-> S1 + + S1 --> E1 + S1 --> E2 + S1 --> E3 + S1 --> E4 + S1 --> E5 + + E1 --> I1 + E2 --> I1 + E3 --> I1 + E4 --> I1 + E5 --> I1 + + I1 --> I2 --> I3 --> I4 --> I5 --> I6 + I6 -.new prompt version.-> Pipeline + classDef stage fill:#1c232c,stroke:#5b9eff,color:#e6edf3 + classDef eval fill:#1a1f26,stroke:#7ee787,color:#e6edf3 + classDef improve fill:#1a1f26,stroke:#ffb454,color:#e6edf3 + classDef store fill:#1a1f26,stroke:#ff7b72,color:#e6edf3 + class P1,P2,P3,P4,P5 stage + class E1,E2,E3,E4,E5 eval + class I1,I2,I3,I4,I5,I6 improve + class S1 store +
+ +

Three concerns kept structurally separate: the pipeline (production), the eval layer (scoring), and the improvement loop (between releases).

+ +

6.2 Stage-by-stage eval matrix

+ + + + + + + + + + + +
StagePrimary eval techniqueSecondaryReliability metric
AnalyzeTool-call F1 against expected read_file/grep setUnder-tooling rate; hallucinated-tool detectionpass^5
ReviewLocation precision/recall + finding-class accuracyAgent-as-judge on textual qualitypass^5
FixSWE-bench-style harness: apply patch, FAIL_TO_PASS + PASS_TO_PASSLinter / formatter delta; perf benchmarkpass^3
ReportSchema validationLLM-judge on coherence; faithfulness checkpass^5
JudgeAgreement rate vs. held-out human labelsCalibration error (ECE) on severitypass^5
End-to-endComposite weighted score + human-rated hold-outCross-model agent-as-judgepass^5 + 95% CI
+ +

6.3 Tooling stack

+ + + + + + + + + + +
LayerChoiceRationale
Trace + dataset storeLangfuse (keep)MIT, self-host, observation-level evals, already wired in
Per-PR CI gatePromptfoo (add)Best-in-class GitHub Action, YAML config, Claude Agent SDK provider
Nightly capability evalInspect AI (add)Used by Anthropic/DeepMind/Grok; Agent Bridge wraps QualOps unmodified
LLM judgeClaude Opus + GPT-5 cross-judge for Review/Report; Sonnet for cheaper pathsCross-model judging mitigates self-preference bias
StatisticsCustom (numpy/scipy) — bootstrap CIs, McNemarAnthropic's Statistical Approach methodology
+ +
+We deliberately do not migrate away from Langfuse to LangSmith or Braintrust. The marginal UX wins do not justify a closed-source migration for a small team given Langfuse's current feature set. +
+ +

6.4 Two-tier eval cadence

+ +
+
Per-PR (fast)
3–5 min
Promptfoo YAML, ~30 asserts, blocking PR check, comment with diff
+
Nightly (slow)
~30–60 min
Langfuse experiment, 100–200 dataset items, LLM-as-judge, pass^5
+
Weekly (capability)
~2 hours
Inspect AI on held-out fixtures, SWE-bench-style on Fix, drift report
+
+ +

6.5 Statistical discipline

+ +

Every release the eval layer must produce: mean ± 95% CI on the headline metric; per-stage mean ± std across pass^5; paired-comparison delta vs. previous release (McNemar / paired bootstrap); variance decomposition (sampling, prompt, judge, data).

+ +

Releases ship only if: all deterministic regression assertions pass; composite score within 3% of baseline (or stat-sig improved); no individual stage regresses by more than 5% with stat-sig.

+ +

6.6 Improvement cadence

+ + + + + + + + + + +
CadenceActivity
ContinuousTrace every production PR; LLM pre-label; human review of low-confidence and dismissed-by-reviewer cases
Weekly (1 day)Pick the top error bucket; implement smallest fix; gate eval; ship if green
MonthlyRe-cluster the error taxonomy from the last month's traces; refresh the few-shot index
QuarterlyRefresh judge calibration set against a fresh human-rated sample (50–100 traces); refresh hold-out fixture repos
Per major versionRe-run end-to-end against the entire historical eval set; publish a release report with score deltas
+ +

6.7 Generalizing to other projects

+ +

The same approach generalizes to other tool-calling / workflow agentic projects. The pattern, abstracted from QualOps:

+
    +
  1. Define stages and the artifact each produces.
  2. +
  3. Pick a primary eval technique per stage based on artifact shape.
  4. +
  5. Build a 50–200 item curated golden set from real production traces.
  6. +
  7. Wire a per-stage tracing layer with consistent span semantics (OpenInference is the OTEL standard).
  8. +
  9. Build a two-tier eval: fast deterministic gate per change; slow LLM-judge / capability eval per night or week.
  10. +
  11. Run the error analysis loop weekly. Maintain the taxonomy as a living document.
  12. +
  13. Promote prompts and skills as versioned code; gate releases on paired statistical comparison.
  14. +
+ +

7 · Prerequisites and adoption roadmap

+ +

7.1 Prerequisites

+ + + + + + + + + + + + + + + + + + + + + + +
PrerequisiteStatus todayAction
Trace storage with span / observation primitivesHave (Langfuse)None
Per-stage tracing in code (Analyze/Review/Fix/Report/Judge as distinct spans)Mostly haveAudit; ensure consistent span names + attributes
Versioned prompts in repoHave (evals/qualopsrc/)None
Prompt-as-code promotion infra (content hashes, dev → staging → prod gates)PartialAdd explicit promotion workflow + version pinning
A starter golden set of real PRs with labelsPartial (CRB datasets exist, internal labels TBD)Label 50 internal PRs
Held-out / contamination control (split mgmt, fresh fixtures)Don't haveStand up split policy; rotate fixtures via SWE-bench Live monthly drops
LLM-as-judge wiring with binary rubricsHave (Judge stage)Add cross-model judge variant
Cross-model judge access (GPT-5 + Claude Opus)Don't haveProcure GPT-5 API credentials + budget headroom (~$500/mo)
Ongoing human calibration label capacity (50–100 traces/quarter)Don't haveDesignate annotators; lightweight labeling tooling
CI runner with secrets for LLM API callsHaveNone
Per-stage tool-call F1 scorerDon't haveImplement (~1 week)
SWE-bench-style harness for Fix stageDon't haveImplement (~2 weeks); seed from SWE-bench Live + internal PRs
Agent-as-judge on Report stageDon't haveImplement (~1 week)
Promptfoo per-PR gateDon't haveAdd Promptfoo + GitHub Action (~3 days)
Inspect AI nightly / weeklyDon't haveAdd Inspect AI + Agent Bridge (~1 week)
Statistical comparison frameworkDon't haveAdopt Anthropic's recipe (~3 days)
Ownership: who runs the eval loop?TBDDesignate a part-time eval lead
+ +

7.2 Adoption roadmap (~3 months)

+ +
+gantt + title QualOps eval program — phased rollout + dateFormat YYYY-MM-DD + axisFormat %b %d + + section Phase 1 Foundations + Audit per-stage tracing :p1a, 2026-05-12, 5d + Label 50 internal PRs :p1b, after p1a, 10d + Implement tool-call F1 scorer :p1c, after p1a, 7d + Schema validators :p1d, after p1c, 3d + Stat comparison helpers :p1e, after p1d, 3d + + section Phase 2 CI gate + Promptfoo + GitHub Action :p2a, after p1e, 5d + 30 per-stage assertions :p2b, after p2a, 5d + PR-comment diff :p2c, after p2b, 2d + Soft-gate dry run, then enforce :p2d, after p2c, 5d + + section Phase 3 Fix + judge + SWE-bench-style harness for Fix :p3a, after p2d, 10d + Mine SWE-bench Verified :p3b, after p3a, 5d + Agent-as-judge on Report :p3c, after p3b, 7d + Cross-model judge wiring :p3d, after p3c, 3d + + section Phase 4 Capability eval + Inspect AI + Agent Bridge :p4a, after p3d, 5d + Held-out fixture repo set :p4b, after p4a, 5d + Weekly drift dashboard :p4c, after p4b, 2d + + section Phase 5 Improvement + First error-analysis pass :p5a, after p4c, 5d + First taxonomy + priorities :p5b, after p5a, 3d + Weekly improvement loop :p5c, after p5b, 30d +
+ +

7.3 Effort and cost estimate

+ + + + + + + + + + + +
PhaseEngineering effortRecurring cost (LLM tokens / month)
1 — Foundations~3 weeks~$200
2 — CI gate~2 weeks~$300
3 — Fix harness + judge~3.5 weeks~$1,500
4 — Capability eval~1.5 weeks~$2,000
5 — Improvement loop~1 day/week ongoing~$500
Total~10 weeks~$4,500/mo steady-state
+ +

8 · Risks, open questions, and what we left out

+ +

8.1 Risks

+ +
    +
  • Eval set leakage. If the same PRs feed both eval and SFT, the score is meaningless. Mitigation: strict held-out splits; SWE-bench Live for fresh data.
  • +
  • Judge drift. As both judge and judged models update, judge scores drift. Mitigation: refresh human-labeled calibration set quarterly.
  • +
  • Reward hacking on the Fix harness. A patch that monkey-patches pytest to skip tests. Mitigation: PASS_TO_PASS check; code-quality grader on the diff itself.
  • +
  • Overfitting prompts to the eval set. Especially with prompt optimizers. Mitigation: held-out validation set; rotate eval set periodically.
  • +
  • Cost overruns. Weekly capability evals over 100+ fixtures add up. Mitigation: routing — Haiku for broad eval, escalate to Sonnet/Opus for low-confidence cases; cap turn budgets.
  • +
  • Self-preference in same-model judge. Mitigation: cross-model judge, ideally on a different family.
  • +
+ +

8.2 Open questions

+ +
    +
  • Are LLM judges good enough as the only signal? Production teams hedge by combining judges with periodic human review.
  • +
  • Process supervision vs outcome supervision for training data. Process wins for math; how to label process at scale for fuzzy domains like code review remains open.
  • +
  • Benchmark validity. Recent audits show many published benchmarks have leakage, mis-graded items, or task-validity problems.
  • +
  • Calibration for tool-using agents specifically. Most calibration work targets factual QA. QualOps may need to invent its own severity-calibration methodology.
  • +
+ +

8.3 What we deliberately left out

+ +
    +
  • Online RLHF / continuous self-tuning in production. Out of scope; we deploy fixed versions.
  • +
  • Human evaluation infrastructure beyond a calibration set. A separate project.
  • +
  • Compliance / regulatory eval. If QualOps targets regulated industries, an additional eval layer will be needed.
  • +
  • Prompt-injection / jailbreak red-teaming at scale. Promptfoo includes a basic suite; full adversarial robustness is its own program.
  • +
+ +

9 · Appendix

+ +

9.1 Glossary

+ + + + + + + + + + + + + + + + + + + + + + + + + + + +
TermDefinition
Agent-as-judgeUsing an LLM with tool access to evaluate another agent's output
AST matchComparing tool calls structurally as parsed trees
BFCLBerkeley Function-Calling Leaderboard
Calibration / ECEHow well a model's confidence matches empirical accuracy
DPO / KTODirect Preference Optimization / Kahneman-Tversky Optimization
DSPyStanford NLP framework treating LLM workflows as programs
FAIL_TO_PASS / PASS_TO_PASSSWE-bench's two test sets
G-EvalLLM-judge with chain-of-thought rubric prompting
Golden trace / golden setCurated reference traces used as the eval baseline
HELMStanford's holistic evaluation framework
Inspect AIUK AISI's research-grade Python eval framework
LLM-as-judgeUsing an LLM to score another LLM's output
MIPROv2DSPy's joint instruction + few-shot optimizer
Open / axial codingQualitative-research method for building error taxonomies
OpenInferenceOTEL-based semantic conventions for LLM/agent traces
pass@k vs pass^kSucceed at least once in k trials vs. succeed every time
Process Reward Model (PRM)Model that scores each reasoning step
PromptfooMIT-licensed CLI/library for prompt and agent eval
ReActReason + Act loop pattern for agents
SWE-benchCode-agent benchmark based on real GitHub issues + unit tests
τ-benchSierra's tool-agent-user multi-turn benchmark; introduced pass^k
TrajectoryOrdered record of (state, action, observation) for an agent run
+ +

9.2 Top recommended reads

+ +
    +
  1. Anthropic — Demystifying evals for AI agents
  2. +
  3. Hamel Husain & Shreya Shankar — LLM Evals: Everything You Need to Know
  4. +
  5. Hamel Husain — A Field Guide to Rapidly Improving AI Products
  6. +
  7. Zheng et al. 2023 — Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
  8. +
  9. Yao et al. 2024 — τ-bench
  10. +
  11. Jimenez et al. 2023 — SWE-bench
  12. +
  13. Zhuge et al. 2024 — Agent-as-a-Judge
  14. +
  15. Anthropic — Building Effective Agents
  16. +
  17. Anthropic — Writing tools for agents
  18. +
  19. Anthropic — Effective context engineering for AI agents
  20. +
+ +

9.3 Companion files

+ + + +
+QualOps Research · May 8, 2026 · Synthesised from four primary-source dossiers totaling ~21,500 words. Built with mermaid for diagrams. Print-friendly: use your browser's print-to-PDF. +
+ +
+
+ + + + diff --git a/agent-evaluation-research/sources/01-foundations.md b/agent-evaluation-research/sources/01-foundations.md new file mode 100644 index 00000000..6524a239 --- /dev/null +++ b/agent-evaluation-research/sources/01-foundations.md @@ -0,0 +1,355 @@ +# Foundations of LLM Agent Evaluation + +*Research dossier for the QualOps internal report — Section 01: Foundations* +*Compiled May 8, 2026* + +## Executive Summary + +Evaluating an LLM-based agent is qualitatively different from evaluating a single LLM completion. Where classical LLM evals score one model output against a reference, agent evals must judge a *trajectory* — a multi-step sequence of plans, tool calls, observations, and self-corrections — under partially-observable, stochastic execution. The field has converged on a layered taxonomy: **component-level** evals (does the retriever / tool-call / sub-prompt do its job?), **trajectory-level** evals (is the reasoning path valid, efficient, and faithful?), and **end-to-end / outcome-level** evals (did the agent solve the task?). Around this skeleton, a quality framework has emerged covering accuracy, faithfulness, completeness, robustness, calibration, latency, cost, safety, and determinism. The evaluation lifecycle blends offline golden datasets, regression suites, and CI gates with online monitoring and drift detection. **LLM-as-judge** (Zheng et al. 2023) has become the dominant cheap-and-scalable method, but its biases (position, verbosity, self-preference) and the recent rise of **process reward models** and **agent-as-judge** systems are reshaping how teams measure quality. For a tool-using, multi-stage code-review pipeline like QualOps, the practical implications are: invest early in trajectory + tool-call F1 metrics, build a small (50-200) curated golden set of real PRs, gate releases on paired statistical comparisons, and run an LLM-judge fleet with debiasing guardrails. This document surveys the academic foundations underpinning all of those choices. + +--- + +## 1. Definitions and Taxonomy + +### 1.1 Agent eval vs. LLM eval + +A classical **LLM eval** treats the model as a function `f(prompt) -> completion` and scores the completion against a reference (BLEU, ROUGE, exact match) or a rubric. The unit of evaluation is one I/O pair. + +An **agent eval** treats the agent as a stateful policy `π` that interacts with an environment via tools. The unit of evaluation is a **trajectory** `τ = (s₀, a₀, o₀, s₁, a₁, o₁, …, sₙ)` where each `sᵢ` is a state (context + memory), `aᵢ` an action (tool call or final answer), and `oᵢ` an observation. Every benchmark surveyed agrees that agent eval requires assessing not just the terminal answer but the *path* taken to reach it ([SAP-Samples KDD 2025 tutorial](https://sap-samples.github.io/llm-agents-eval-tutorial/)). The Anthropic engineering team frames the same shift as moving from "single-output grading" to "behavior verification across many turns" ([Anthropic, "Demystifying evals for AI agents"](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)). + +### 1.2 The three layers + +Modern agent-eval taxonomies (e.g., the survey of [Yu et al. 2025, "Evaluation and Benchmarking of LLM Agents"](https://arxiv.org/html/2507.21504v1) and the LangChain framework) recognize three nested layers: + +| Layer | Question | Typical metric | +|---|---|---| +| **Component-level** | Does each sub-skill (retriever, planner, single tool call, sub-agent) work in isolation? | Tool-match rate, parameter F1, retrieval recall@k | +| **Trajectory-level** | Is the path of reasoning + actions valid, efficient, and faithful? | Plan correctness, trajectory edit distance, step-level reward (PRM), tool-call F1 over the sequence | +| **Outcome / end-to-end** | Did the agent achieve the user goal? | Task success, unit-test pass rate (SWE-bench), human rating | + +LangChain's documentation makes this explicit: "trajectory evaluators look at intermediate steps; output evaluators look only at the final response" ([LangChain trajectory eval docs](https://docs.langchain.com/langsmith/trajectory-evals)). + +### 1.3 Other axes + +- **Single-turn vs. multi-turn.** Single-turn evals score one user-input → agent-output pair. Multi-turn evals score a conversation or a tool-using session. MT-Bench was the first widely-cited multi-turn LLM eval ([Zheng et al. 2023](https://arxiv.org/abs/2306.05685)). +- **Online vs. offline.** Offline evals run on a frozen dataset before deployment; online evals run on live traffic and feed back into monitoring ([Langfuse, LLM Eval 101](https://langfuse.com/blog/2025-03-04-llm-evaluation-101-best-practices-and-challenges)). They are complementary: "offline tests catch regressions before code reaches staging; online monitors surface drift, abuse, and cost spikes" ([Label Studio, online vs. offline](https://labelstud.io/learningcenter/offline-evaluation-vs-online-evaluation-when-to-use-each/)). +- **Reference-based vs. reference-free.** Reference-based metrics (BLEU, exact match, unit-test pass) compare against a known-good answer. Reference-free metrics (LLM-judge rubrics, faithfulness scores, perplexity) require no gold answer — essential when ground truth is expensive or undefined ([Eugene Yan, "LLM-Evaluators"](https://eugeneyan.com/writing/llm-evaluators/)). + +--- + +## 2. Core Dimensions of Agent Quality + +Stanford's HELM framework canonicalized seven core metrics — accuracy, calibration, robustness, fairness, bias, toxicity, efficiency — and showed that scoring on accuracy alone hides serious failure modes ([HELM, Liang et al. 2022](https://arxiv.org/abs/2211.09110)). Below are the dimensions that matter most for an agentic code-review system. + +| Dimension | Definition | How it's typically measured | +|---|---|---| +| **Accuracy / task success** | Did the agent produce the correct outcome? | Exact match, unit tests pass (SWE-bench style), human rating | +| **Faithfulness / groundedness** | Are claims supported by the retrieved/observed context? | Claim-level NLI vs. context, RAGAS faithfulness ([Ragas docs](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/)) | +| **Completeness** | Did the agent address all parts of the request? | Aspect-coverage rubric, recall on a checklist of expected findings | +| **Helpfulness** | Was the response actionable and useful to the user? | Pairwise human preference, LLM-judge rubric | +| **Robustness** | Stable under prompt perturbation, adversarial input, distribution shift? | Performance under paraphrase / typo / adversarial prompt suites | +| **Calibration** | Do confidence scores match empirical accuracy? | ECE (expected calibration error), Brier score; verbalized vs. logit-based confidence ([Geng et al. 2025 survey](https://arxiv.org/abs/2503.15850)) | +| **Latency** | Wall-clock time per task / step | p50/p95/p99 latency, time-to-first-token | +| **Cost** | $ per task | Tokens × price + tool-call costs | +| **Safety** | Avoidance of harmful, biased, or policy-violating outputs | Red-team pass rate, toxicity classifiers ([Perez et al. 2022](https://arxiv.org/abs/2202.03286)) | +| **Determinism / consistency** | Same input → same output (or stable distribution)? | Output variance across N samples at T=0; self-consistency rate ([Wang et al. 2022](https://arxiv.org/abs/2203.11171)) | + +Two notes specific to the QualOps Analyze→Review→Fix→Report→Judge pipeline: + +1. **Faithfulness is the dominant dimension for code review.** A "hallucinated" finding (a vulnerability that doesn't exist in the diff) is more damaging than a missed one because it erodes reviewer trust. Faithfulness for code agents = "every claim in the report is grounded in actual lines of the diff or repo." The RAG literature's faithfulness metric ([deepset blog](https://www.deepset.ai/blog/rag-llm-evaluation-groundedness)) generalizes: extract atomic claims, verify each against the source. +2. **Calibration matters for triage.** If QualOps emits a severity label, ECE quantifies whether "high" findings are actually higher-priority. LLMs are systemically overconfident under verbalized prompting ([Geng et al. 2025](https://arxiv.org/abs/2503.15850)); consistency-based confidence (sample N, measure agreement) is more reliable. + +--- + +## 3. The Evaluation Lifecycle + +Hamel Husain's widely-cited guides describe the evals loop that mature teams converge on ([Husain, "Your AI Product Needs Evals"](https://hamel.dev/blog/posts/evals/); [Husain & Shankar, "LLM Evals: Everything You Need to Know"](https://hamel.dev/blog/posts/evals-faq/)): + +``` + +-----------------+ +-------------------+ +------------------+ + | 1. Error | --> | 2. Codify failure | --> | 3. Add eval to | + | analysis on | | modes as | | golden set | + | real logs | | rubric items | | | + +-----------------+ +-------------------+ +------------------+ + ^ | + | v + +-----------------+ +-------------------+ +------------------+ + | 6. Online | <-- | 5. Ship + monitor | <-- | 4. Run regression| + | drift / | | in production | | suite in CI; | + | sample-judge | | | | block on regr.| + +-----------------+ +-------------------+ +------------------+ +``` + +Key tactics endorsed across primary sources: + +- **Start small.** Anthropic's engineering team writes that "20-50 simple tasks drawn from real failures is a great start" ([Anthropic, demystifying evals](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)). Simon Willison echoes: "if you're passing 100% of your evals, you're not challenging your system enough" ([Willison, evals tag](https://simonwillison.net/tags/evals/)). +- **Eval-driven development.** Treat evals like unit tests: write them first, fail them, then build the change that passes them. The O'Reilly "What We Learned from a Year of Building with LLMs" team (Yan, Bischof, Frye, Husain, Liu, Shankar) describes evals as a "data flywheel" — every production failure becomes a new eval row ([O'Reilly Part I](https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/)). +- **Golden datasets are curated, not crawled.** They should reflect real production distribution, include known failure cases, and cover edge cases discovered in error analysis. For QualOps this means: 50-200 real PRs spanning languages, sizes, change types, and labeled failure modes. +- **Regression tests on every PR.** Run the full eval suite in CI on each prompt or code change. Block merges on stat-sig regression of any axis. +- **A/B comparisons are paired and statistical.** See section 7. +- **Online monitoring** samples production traffic (5-10% is a common heuristic) and runs an LLM-judge fleet asynchronously to flag drift ([Langfuse 2025](https://langfuse.com/blog/2025-03-04-llm-evaluation-101-best-practices-and-challenges)). Drift signals: rising malformed-output rate, latency creep, judge-score distribution shift. + +--- + +## 4. LLM-as-Judge + +### 4.1 The seminal work + +[Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (NeurIPS 2023)](https://arxiv.org/abs/2306.05685) introduced two artifacts that became the standard: + +- **MT-Bench**: 80 multi-turn questions across 8 categories, scored 1-10 by GPT-4 acting as judge. +- **Chatbot Arena**: a crowdsourced pairwise battle platform with Bradley-Terry / Elo ratings. + +Their headline empirical finding: GPT-4 as judge agreed with human preference at 80%+ — roughly the same as inter-human agreement. This legitimized LLM-judge as a primary evaluation method. + +### 4.2 Known biases and mitigations + +The same paper, and a flood of follow-ups, identify recurring failure modes: + +| Bias | What happens | Common mitigation | +|---|---|---| +| **Position bias** | Judge prefers whichever answer appears first (or second) regardless of content | Swap order, score both, average; use "robustness rate" metric ([Shi et al. 2024, position bias study](https://arxiv.org/html/2406.07791v9)) | +| **Verbosity bias** | Longer answers rated higher | Constrain length in rubric; normalize for length | +| **Self-preference / self-enhancement** | Judge prefers outputs from its own family | Use a different model as judge; ensemble across providers ([Panickssery et al. 2024](https://arxiv.org/abs/2410.21819)) | +| **Familiarity / low-perplexity bias** | Judge favors text it would have generated itself | Down-weight low-perplexity samples | +| **Sycophancy / sentiment bias** | Judge follows hints in the prompt about which is "better" | Blind the judge to source | +| **Fallacy oversight** | Judge accepts confident-sounding wrong reasoning | Require step-by-step grading; use process supervision | + +The CALM framework ([Ye et al. 2024, "Justice or Prejudice"](https://arxiv.org/html/2410.02736v1)) catalogs 12 distinct biases and shows that even GPT-4 fails to be position-consistent on ~30% of comparisons. + +### 4.3 Pairwise vs. pointwise + +- **Pointwise** (a.k.a. single-answer grading): judge scores one answer on a rubric. Cheap, parallelizable, but suffers from anchoring and rubric drift. +- **Pairwise**: judge picks the better of two. More aligned with human preference, easier to calibrate, but O(N²) for full ranking — usually solved by Bradley-Terry on sampled pairs. +- **Reference-based** (judge sees ground truth): highest agreement with humans, but requires gold answers. + +Husain's "Using LLM-as-a-Judge" guide recommends **binary pass/fail rubrics** over Likert scales for production judges, because binary judges are easier to calibrate against humans and easier to debug ([Husain, LLM Judge guide](https://hamel.dev/blog/posts/llm-judge/)). + +### 4.4 Specialized judge models + +- **G-Eval** ([Liu et al. 2023](https://arxiv.org/abs/2303.16634)): use GPT-4 with chain-of-thought rubrics and form-filling. 0.514 Spearman with humans on summarization — beats prior NLG metrics. +- **Prometheus** ([Kim et al. 2024](https://github.com/prometheus-eval/prometheus-eval)): open-weight Llama-2-13B fine-tuned on 100K GPT-4-generated rubric grades. Designed for fine-grained, rubric-driven evaluation. +- **JudgeLM** (7B/13B/33B): trained on a high-quality preference dataset with explicit bias mitigation in fine-tuning. +- **Auto-J**: generative evaluator that returns both score and critique across many task scenarios. + +A cautionary follow-up — [Huang et al. 2024, "An Empirical Study of LLM-as-a-Judge"](https://arxiv.org/html/2403.02839) — found that fine-tuned judges (JudgeLM, PandaLM, Auto-J, Prometheus) achieve high in-domain scores but fail to generalize, lagging GPT-4 on fairness, generalization, and aspect-specific evaluation. **For now, frontier models remain the most reliable judges.** + +### 4.5 When LLM-judge fails + +Eugene Yan's survey of two dozen judge papers ([Yan, "Evaluating LLM-Evaluators"](https://eugeneyan.com/writing/llm-evaluators/)) flags the cases where LLM-judge is unreliable: + +- Tasks requiring deep domain expertise (medicine, law, security) the judge model lacks. +- Tasks where the judge would have to do work harder than the generator (e.g., judging a math proof when the judge can't do the math). +- Highly subjective / aesthetic tasks where humans disagree among themselves. + +``` + +------------------+ + | Need a judgement | + +--------+---------+ + | + +-------------+-------------+ + | | + Have ground truth? No ground truth + | | + v v + +--------------+ +-----------------------+ + | Programmatic | | Is task within | + | check (tests,| | judge model's domain? | + | exact match) | +-----------+-----------+ + +--------------+ | + +----------+----------+ + | | + v v + +---------------+ +-------------------+ + | LLM-judge OK; | | Need human review | + | calibrate vs. | | or specialist | + | humans, watch | | judge (PRM, agent | + | for biases | | -as-judge) | + +---------------+ +-------------------+ +``` + +--- + +## 5. Trajectory and Process Evaluation + +### 5.1 Why outcome-only is insufficient + +A code-review agent could produce a correct final report by accident — having issued ten irrelevant tool calls and reasoned incorrectly along the way. Outcome metrics miss this. Process evaluation asks: *was every intermediate step justified?* + +### 5.2 Process Reward Models (PRMs) + +OpenAI's [Lightman et al. 2023, "Let's Verify Step by Step"](https://arxiv.org/abs/2305.20050) demonstrated that **process supervision** (label each reasoning step correct/incorrect) outperforms **outcome supervision** (label only the final answer) for training reward models on math problems. The process-supervised model solved 78% of MATH (vs. ~70% with outcome supervision). They released the [PRM800K dataset](https://github.com/openai/prm800k) of 800K step-level human labels. + +Recent advances: + +- [**The Lessons of Developing PRMs in Mathematical Reasoning** (Zheng et al. 2025)](https://arxiv.org/abs/2501.07301): documents what scales (data quality, step granularity) and what doesn't. +- [**R-PRM: Reasoning-Driven Process Reward Modeling** (Wu et al. 2025)](https://arxiv.org/abs/2503.21295): generative PRM that produces a rationale before scoring; +11.9 F1 on ProcessBench, +8.5 on PRMBench. +- [**Process Reward Models That Think** (Khalifa et al. 2025)](https://arxiv.org/abs/2504.16828): "ThinkPRM" verifies each step with an explicit verification CoT, reaching strong performance with orders-of-magnitude less labeled data. + +### 5.3 Trajectory-level metrics for tool-using agents + +LangChain's `agentevals` package and Arize's trajectory eval docs codify a practical metric set ([agentevals GitHub](https://github.com/langchain-ai/agentevals); [Arize trajectory docs](https://arize.com/docs/ax/evaluate/evaluators/trace-and-session-evals/trace-level-evaluations/agent-trajectory-evaluations)): + +- **Trajectory match (strict / unordered / superset)**: compare actual tool-call sequence to a reference. +- **Tool-call F1**: precision/recall on the multiset of (tool_name, key_args) tuples. +- **Plan correctness**: did the agent decompose the task correctly? (Often LLM-judged.) +- **Step-level grounding**: each step's reasoning is supported by prior context/observations. +- **Efficiency**: number of steps / tool calls relative to optimal; redundant-call rate. +- **Recovery**: did the agent recover from a tool error? + +[**TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents** (2026)](https://arxiv.org/html/2602.21230v1) and the [Holistic Agent Leaderboard (2025)](https://arxiv.org/pdf/2510.11977) extend these into multi-dimensional rubrics covering completeness, faithfulness, and exploration breadth. + +### 5.4 ReAct-trace specifics + +For agents using the ReAct pattern (Thought → Action → Observation loop), evaluation typically scores: (a) whether each Thought is logically derived from prior Observations, (b) whether the Action follows from the Thought, and (c) whether the loop terminates appropriately. Lilian Weng frames the two dominant ReAct failure modes as **inefficient planning** (long trajectory, no convergence) and **hallucination** (consecutive identical actions yielding the same observation) ([Weng, "LLM-Powered Autonomous Agents"](https://lilianweng.github.io/posts/2023-06-23-agent/)). + +--- + +## 6. Important Benchmarks + +Conceptual coverage only — deep tool-calling/coding benchmarks (SWE-bench Verified, Aider, etc.) are covered in a sister document. + +| Benchmark | What it measures | Methodological contribution | +|---|---|---| +| [**HELM** (Liang et al. 2022)](https://arxiv.org/abs/2211.09110) | 16 scenarios × 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency) on foundation models | Top-down "scenarios × metrics" matrix; standardized prompting; full transparency of raw completions | +| [**BIG-bench** (Srivastava et al. 2022)](https://arxiv.org/abs/2206.04615) | 204 diverse tasks contributed by 450 authors | Crowdsourced, programmatic + JSON tasks, focus on tasks "beyond current capability" | +| [**MMLU-Pro** (Wang et al. 2024)](https://arxiv.org/abs/2406.01574) | Reasoning-focused multi-task understanding; 12K questions, 14 subjects, 10 options | Reduces prompt sensitivity from 4-5% (MMLU) to 2%; CoT actually helps (unlike on MMLU) | +| [**AgentBench** (Liu et al. 2023, ICLR'24)](https://arxiv.org/abs/2308.03688) | LLM-as-Agent across 8 environments (OS, DB, KG, card games, web, etc.) | First multi-environment agent benchmark; exposed long-horizon reasoning as the bottleneck | +| [**GAIA** (Mialon et al. 2023)](https://arxiv.org/abs/2311.12983) | 466 real-world assistant tasks needing reasoning + multimodality + web + tools | Designed so questions are easy for humans (92%) but hard for AIs (15% for GPT-4 + plugins); 3 difficulty tiers | +| [**SWE-bench** (Jimenez et al. 2023, ICLR'24)](https://arxiv.org/abs/2310.06770) | 2,294 real GitHub issues across 12 Python repos; agent must produce a passing patch | Unit-test-as-truth (no LLM-judge needed); inspired SWE-bench Verified, SWE-bench Pro | +| [**MLAgentBench** (Huang et al. 2023)](https://arxiv.org/abs/2310.03302) | 13 ML experimentation tasks; agent must improve a model end-to-end | Open-ended research task evaluation; ReAct framework baseline | +| [**τ-bench** (Yao et al. 2024, Sierra)](https://arxiv.org/abs/2406.12045) | Tool-Agent-User interaction in retail/airline domains; user simulated by LLM | First widely-adopted multi-turn tool benchmark with policy adherence; Pass^k metric | + +**Methodology lessons that transfer to QualOps:** + +- HELM's "scenarios × metrics" matrix is a useful template — define your QualOps scenarios (Python bugfix PRs, JS feature PRs, refactor PRs, security-sensitive PRs…) and grade each on the same 7-9 dimensions. +- SWE-bench's insight — *tests are ground truth* — applies directly: when QualOps fixes a bug, the existing test suite is the cheapest, most reliable judge. +- GAIA's tiered difficulty and human-baseline anchoring is a discipline against benchmark inflation. +- τ-bench's Pass^k (probability of passing on all k independent runs) is a strong reliability/determinism metric for stochastic agents. + +[**"Establishing Best Practices for Building Rigorous Agentic Benchmarks"** (Zhuge et al. 2025)](https://arxiv.org/pdf/2507.02825) argues many published benchmarks fail two basic validity checks: **outcome validity** (test failure ⇎ task failure — SWE-bench-Verified is flagged because incorrect patches sometimes pass tests) and **task validity** (a task is solvable iff the agent has the target capability). Internal benchmarks should explicitly audit both. + +--- + +## 7. Statistical Rigor + +Why "vibes-based evals" fail: with N=10 examples and a stochastic model, swing of ±20% in pass rate is normal noise. Cameron Wolfe's ["Applying Statistics to LLM Evaluations"](https://cameronrwolfe.substack.com/p/stats-llm-evals) walks through the core math. + +### 7.1 Sample size and CIs + +For binary pass/fail with sample mean p and N samples, the 95% CI half-width is roughly `1.96 × sqrt(p(1-p)/N)`. To distinguish 80% from 85% accuracy at 95% confidence, you need ~1000 samples. Most teams have far fewer; this is why **paired comparisons** matter: + +### 7.2 Paired comparisons + +Run model A and model B on the *same* set of examples; the per-example difference cancels per-example variance. McNemar's test (for binary outcomes) or paired bootstrap (for any metric) yields tight CIs even on N=50-100. + +### 7.3 Bradley-Terry / Elo for pairwise rankings + +When the metric is "which model wins this pair?", the Bradley-Terry model fits a latent skill rating per model from observed pairwise outcomes ([Wikipedia: Bradley-Terry](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model); [Stanford Stats 200 lecture notes](https://web.stanford.edu/class/archive/stats/stats200/stats200.1172/Lecture24.pdf)). + +- **Elo** updates ratings online with a learning rate; recent matches dominate. +- **Bradley-Terry** is the offline MLE; more stable, no recency bias ([Hippocampus's Garden, Elo vs BT](https://hippocampus-garden.com/elo_vs_bt/)). +- **Bootstrap CIs**: Chatbot Arena resamples the pairwise vote set 1000× and refits BT, producing 95% CIs on each rating ([Chatbot Arena paper, Chiang et al. 2024](https://arxiv.org/pdf/2403.04132)). + +### 7.4 Eval variance sources + +- **Sampling variance**: stochastic decoding (T>0). +- **Prompt variance**: small wording changes flip 5-20% of judgments ([MMLU-Pro analysis](https://arxiv.org/abs/2406.01574)). +- **Judge variance**: different judge models (or different runs of the same judge) disagree. +- **Data variance**: the eval set itself is a sample of the production distribution. + +A practical rule: **report all four** when you publish an eval result internally. The Holistic Agent Leaderboard formalizes this with explicit variance decomposition ([HAL 2025](https://arxiv.org/pdf/2510.11977)). + +--- + +## 8. Recent Academic Directions (2024-2026) + +### 8.1 Process reward models everywhere + +Beyond math, PRMs are being applied to coding agents, web agents, and multi-step retrieval. The trend is from discriminative classifiers toward **generative / reasoning PRMs** that produce a rationale ([R-PRM](https://arxiv.org/abs/2503.21295), [ThinkPRM / "Process Reward Models That Think"](https://arxiv.org/abs/2504.16828)). + +### 8.2 Self-consistency, self-refinement, debate + +- **Self-consistency** ([Wang et al. 2022](https://arxiv.org/abs/2203.11171)): sample N reasoning paths, take majority answer. +18 points on GSM8K. Doubles as a calibration signal. +- **Self-refinement / critique loops**: agent reviews and revises its own output. Risk of degradation if the critique is wrong. +- **Multi-agent debate** ([Irving et al. 2018](https://arxiv.org/abs/1805.00899); [Khan et al. 2024](https://arxiv.org/html/2603.05293)): two agents argue both sides; a (weaker) judge decides. Khan et al. show debate lets weaker judges accurately evaluate stronger debaters — a candidate scalable-oversight technique. +- **Doubly-Efficient Debate** ([Brown-Cohen et al. 2023](https://arxiv.org/abs/2311.14125)): theoretical guarantees on debate as alignment mechanism. + +### 8.3 Constitutional methods + +[Constitutional AI (Bai et al. 2022)](https://arxiv.org/abs/2212.08073) replaces human harmfulness labels with a written "constitution" the model uses to critique and revise its own outputs (RLAIF). Useful for *evaluation* too: the constitution doubles as a rubric. Critiques (digi-con, others) note its quality is bounded by the constitution's quality and that it may "embed subjective priorities" of the developer. + +### 8.4 Automated red-teaming + +[Perez et al. 2022, "Red Teaming Language Models with Language Models"](https://arxiv.org/abs/2202.03286) used an LM to generate adversarial prompts that elicit harms from a target LM, finding tens of thousands of failures in a 280B chatbot. OpenAI's [Diverse and Effective Red Teaming with Auto-generated Rewards](https://cdn.openai.com/papers/diverse-and-effective-red-teaming.pdf) extends this with RL and diversity rewards. For QualOps, the analog is **synthetic adversarial PRs** designed to elicit false positives or missed bugs. + +### 8.5 Simulation-based evaluation + +τ-bench's user simulator is the canonical example: an LLM plays the user, executing scripted goals, while the agent must follow domain policy. Simulations let you generate effectively unlimited multi-turn evaluation traffic at known ground truth ([Sierra τ-bench post](https://sierra.ai/blog/tau-bench-shaping-development-evaluation-agents)). + +### 8.6 Agent-as-judge + +The newest frontier. [**Agent-as-a-Judge: Evaluate Agents with Agents** (Zhuge et al. 2024)](https://arxiv.org/abs/2410.10934) replaces the LLM judge with an *agent* judge that can read code, run tools, and verify intermediate steps — closing the gap between "static text grading" and "dynamic behavior verification." They report agent-judge results approaching human reliability while costing far less. [**When AIs Judge AIs** (Yu et al. 2025)](https://arxiv.org/abs/2508.02994) surveys the rapid ecosystem evolution from single-LLM judges → multi-agent debate frameworks. [**A Survey on Agent-as-a-Judge** (2026)](https://arxiv.org/html/2601.05111v1) consolidates the methodology. + +This is highly relevant to QualOps: the "Judge" stage in QualOps's pipeline already *is* an agent-as-judge over the Review/Fix output. Drawing on this literature, key design choices include: (a) give the judge agent independent tool access (re-run tests, re-read source) rather than just the diff, (b) use a different model family for the judge to avoid self-preference, (c) calibrate against a periodic human-review sample. + +--- + +## Open Questions and Controversies + +1. **Are LLM judges good enough as the primary signal?** Zheng et al. say yes (>80% human agreement). The "fine-tuned judges fail to generalize" results ([Huang et al. 2024](https://arxiv.org/html/2403.02839)) say maybe not. Production teams hedge by combining judges with periodic human review. +2. **Process supervision vs. outcome supervision for training.** Process wins for math, but it is not yet clear how to label process at scale for fuzzy domains like code review (what's a "correct" intermediate step when reviewing a PR?). +3. **Benchmark validity.** Recent audits ([Zhuge et al. 2025](https://arxiv.org/pdf/2507.02825); [Berkeley RDI on trustworthy benchmarks](https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/)) show many published benchmarks have leakage, mis-graded items, or task-validity problems. The "SWE-Bench Illusion" paper ([2025](https://arxiv.org/html/2506.12286v3)) argues SOTA models are partly memorizing, not reasoning. +4. **Self-preference and ecosystem effects.** As more training data is judge-generated, judges may drift toward systematic preferences that the next generation of models is trained to satisfy — a feedback loop with unclear long-term effects. +5. **Cost of rigor.** Bootstrap CIs, paired evals, multi-judge ensembles, and red-team suites are expensive. Teams routinely skip them; "vibe checks" remain common despite being known to fail. + +--- + +## References + +- [**Anthropic, "Demystifying evals for AI agents" (2026)**](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents) — Engineering blog with practical agent-eval strategies, "20-50 tasks is enough to start" rule. +- [**Bai et al. 2022, "Constitutional AI: Harmlessness from AI Feedback" (arXiv:2212.08073)**](https://arxiv.org/abs/2212.08073) — Anthropic's foundational paper on RLAIF. +- [**Brown-Cohen et al. 2023, "Scalable AI Safety via Doubly-Efficient Debate" (arXiv:2311.14125)**](https://arxiv.org/abs/2311.14125) — Theoretical extension of debate as scalable oversight. +- [**Chiang et al. 2024, "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference"**](https://arxiv.org/pdf/2403.04132) — Bradley-Terry methodology for crowdsourced LLM ranking with bootstrap CIs. +- [**Geng et al. 2025, "Uncertainty Quantification and Confidence Calibration in LLMs: A Survey" (arXiv:2503.15850)**](https://arxiv.org/abs/2503.15850) — Comprehensive survey of calibration methods (logit, sampling, verbalized). +- [**Huang et al. 2023, "MLAgentBench" (arXiv:2310.03302)**](https://arxiv.org/abs/2310.03302) — 13-task ML experimentation benchmark for AI research agents. +- [**Huang et al. 2024, "An Empirical Study of LLM-as-a-Judge" (arXiv:2403.02839)**](https://arxiv.org/html/2403.02839) — Shows fine-tuned judges (JudgeLM, Prometheus, Auto-J, PandaLM) underperform GPT-4 on generalization. +- [**Husain & Shankar, "LLM Evals: Everything You Need to Know" (Hamel's Blog, 2026)**](https://hamel.dev/blog/posts/evals-faq/) — Comprehensive FAQ from the AI Evals Maven course. +- [**Husain, "Your AI Product Needs Evals" (2024)**](https://hamel.dev/blog/posts/evals/) — Most-cited practitioner guide to building eval pipelines from scratch. +- [**Husain, "Using LLM-as-a-Judge for Evaluation" (2024)**](https://hamel.dev/blog/posts/llm-judge/) — Argues for binary pass/fail rubrics; debiasing tactics from 30+ deployments. +- [**Irving et al. 2018, "AI Safety via Debate" (arXiv:1805.00899)**](https://arxiv.org/abs/1805.00899) — Foundational paper on debate as alignment mechanism. +- [**Jimenez et al. 2023, "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" (arXiv:2310.06770)**](https://arxiv.org/abs/2310.06770) — 2,294 real GitHub issues; unit tests as ground truth. ICLR 2024 oral. +- [**Khalifa et al. 2025, "Process Reward Models That Think" (arXiv:2504.16828)**](https://arxiv.org/abs/2504.16828) — Generative PRM with verification CoT; data-efficient. +- [**Khan et al. 2024, "Knowledge Divergence and the Value of Debate for Scalable Oversight"**](https://arxiv.org/html/2603.05293) — Empirical evidence that debate helps weaker judges evaluate stronger debaters. +- [**Liang et al. 2022, "Holistic Evaluation of Language Models" (HELM, arXiv:2211.09110)**](https://arxiv.org/abs/2211.09110) — Stanford CRFM; canonical scenarios × metrics framework. +- [**Lightman et al. 2023, "Let's Verify Step by Step" (arXiv:2305.20050)**](https://arxiv.org/abs/2305.20050) — OpenAI's process-vs-outcome supervision study; PRM800K dataset. +- [**Liu et al. 2023, "G-Eval: NLG Evaluation using GPT-4" (arXiv:2303.16634)**](https://arxiv.org/abs/2303.16634) — CoT + form-filling rubric prompting; flagged self-preference bias for first time. +- [**Liu et al. 2023, "AgentBench: Evaluating LLMs as Agents" (arXiv:2308.03688)**](https://arxiv.org/abs/2308.03688) — First multi-environment LLM-agent benchmark; ICLR'24. +- [**Mialon et al. 2023, "GAIA: A Benchmark for General AI Assistants" (arXiv:2311.12983)**](https://arxiv.org/abs/2311.12983) — Meta/HuggingFace; tiered tasks with human baseline of 92%, GPT-4 at 15%. +- [**Panickssery et al. 2024, "Self-Preference Bias in LLM-as-a-Judge" (arXiv:2410.21819)**](https://arxiv.org/abs/2410.21819) — Quantifies self-preference; ties bias to text familiarity / perplexity. +- [**Perez et al. 2022, "Red Teaming Language Models with Language Models" (arXiv:2202.03286)**](https://arxiv.org/abs/2202.03286) — Anthropic; automated adversarial prompt generation. +- [**Shi et al. 2024, "Judging the Judges: Position Bias" (arXiv:2406.07791)**](https://arxiv.org/html/2406.07791v9) — Systematic study of position bias and "robustness rate" metric. +- [**Srivastava et al. 2022, "Beyond the Imitation Game (BIG-bench)" (arXiv:2206.04615)**](https://arxiv.org/abs/2206.04615) — 204 collaborative tasks; programmatic + JSON formats. +- [**Wang et al. 2022, "Self-Consistency Improves Chain-of-Thought" (arXiv:2203.11171)**](https://arxiv.org/abs/2203.11171) — Sample-and-marginalize decoding; +18 GSM8K. Foundational for consistency-based confidence. +- [**Wang et al. 2024, "MMLU-Pro" (arXiv:2406.01574)**](https://arxiv.org/abs/2406.01574) — Reasoning-focused, prompt-robust replacement for MMLU. +- [**Weng, "Extrinsic Hallucinations in LLMs" (Lil'Log, 2024)**](https://lilianweng.github.io/posts/2024-07-07-hallucination/) — Defines in-context vs. extrinsic hallucination; survey of mitigation. +- [**Weng, "LLM-Powered Autonomous Agents" (Lil'Log, 2023)**](https://lilianweng.github.io/posts/2023-06-23-agent/) — Canonical reference on agent architectures and ReAct failure modes. +- [**Willison, "Evals" tag on simonwillison.net**](https://simonwillison.net/tags/evals/) — 37+ posts on practical eval engineering. +- [**Wu et al. 2025, "R-PRM: Reasoning-Driven PRM" (arXiv:2503.21295)**](https://arxiv.org/abs/2503.21295) — Generative PRM with rationales; +11.9 F1 on ProcessBench. +- [**Yan, "Evaluating the Effectiveness of LLM-Evaluators" (eugeneyan.com, 2024)**](https://eugeneyan.com/writing/llm-evaluators/) — Survey of two dozen LLM-judge papers; when LLM-judge fails. +- [**Yan, "Patterns for Building LLM-based Systems & Products" (eugeneyan.com, 2023)**](https://eugeneyan.com/writing/llm-patterns/) — Seven patterns including evals; widely cited reference. +- [**Yan et al., "What We Learned from a Year of Building with LLMs" Parts I-III (O'Reilly, 2024)**](https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/) — Practitioner consolidation of evals, ops, and strategy lessons. +- [**Yao et al. 2024, "τ-bench: Tool-Agent-User Interaction" (arXiv:2406.12045)**](https://arxiv.org/abs/2406.12045) — Sierra; multi-turn tool benchmark with simulated users and Pass^k metric. +- [**Ye et al. 2024, "Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge" (arXiv:2410.02736)**](https://arxiv.org/html/2410.02736v1) — Catalogs 12 biases; CALM evaluation framework. +- [**Yu et al. 2025, "Evaluation and Benchmarking of LLM Agents: A Survey" (arXiv:2507.21504)**](https://arxiv.org/html/2507.21504v1) — Two-dimensional taxonomy: objectives × process. +- [**Yu et al. 2025, "When AIs Judge AIs: Agent-as-a-Judge for LLMs" (arXiv:2508.02994)**](https://arxiv.org/abs/2508.02994) — Survey of evolution from single-LLM judges to multi-agent debate. +- [**Zheng et al. 2023, "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (arXiv:2306.05685, NeurIPS 2023)**](https://arxiv.org/abs/2306.05685) — The seminal LLM-as-judge paper; introduced MT-Bench and Chatbot Arena; documented position/verbosity/self-enhancement biases. +- [**Zheng et al. 2025, "The Lessons of Developing Process Reward Models" (arXiv:2501.07301)**](https://arxiv.org/abs/2501.07301) — Practical lessons on PRM data quality and granularity. +- [**Zhuge et al. 2024, "Agent-as-a-Judge: Evaluate Agents with Agents" (arXiv:2410.10934)**](https://arxiv.org/abs/2410.10934) — Replaces static LLM judge with agentic judge; approaches human reliability. +- [**Zhuge et al. 2025, "Establishing Best Practices for Building Rigorous Agentic Benchmarks" (arXiv:2507.02825)**](https://arxiv.org/pdf/2507.02825) — Outcome-validity and task-validity audit framework. +- [**Holistic Agent Leaderboard (2025, arXiv:2510.11977)**](https://arxiv.org/pdf/2510.11977) — Variance-decomposed agent leaderboard methodology. +- [**Cameron Wolfe, "Applying Statistics to LLM Evaluations" (Substack)**](https://cameronrwolfe.substack.com/p/stats-llm-evals) — Practitioner-friendly walkthrough of CIs, paired tests, and Bradley-Terry. +- [**LangChain, "Trajectory Evaluations" docs**](https://docs.langchain.com/langsmith/trajectory-evals) and [**agentevals package**](https://github.com/langchain-ai/agentevals) — Concrete trajectory-eval API and metric implementations. +- [**Arize, "Agent Trajectory Evaluations"**](https://arize.com/docs/ax/evaluate/evaluators/trace-and-session-evals/trace-level-evaluations/agent-trajectory-evaluations) — Production trajectory-evaluation patterns. +- [**Ragas docs, "Available Metrics"**](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/) — Standard reference for faithfulness / answer-relevancy / context metrics. +- [**KDD 2025 Tutorial: Evaluation & Benchmarking of LLM Agents (SAP)**](https://sap-samples.github.io/llm-agents-eval-tutorial/) — Conference-grade tutorial covering the full taxonomy. diff --git a/agent-evaluation-research/sources/02-frameworks.md b/agent-evaluation-research/sources/02-frameworks.md new file mode 100644 index 00000000..d8f540be --- /dev/null +++ b/agent-evaluation-research/sources/02-frameworks.md @@ -0,0 +1,775 @@ +# Agent / LLM Evaluation Frameworks: Landscape Dossier + +*Compiled May 8, 2026 for the QualOps internal research report.* + +## 1. Scope and motivation + +QualOps is an AI code-review tool with a multi-stage agentic pipeline (Analyze, Review, Fix, Report, Judge), built on the Claude Agent SDK, executed in CI on every pull request. It is fundamentally a **tool-calling workflow agent**: success depends on whether each stage emits the right tool calls, with the right arguments, in the right order, and produces a final review whose claims match ground-truth. The product already uses **Langfuse** for tracing, dataset runs, scorers, prompt management, and LLM-as-judge. + +This dossier is not a sales pitch for any tool. It is a fair comparison so the team can either justify staying on Langfuse, swap to a tool that is better at one specific dimension, or layer a second tool on top (e.g. add `promptfoo` for CI gating while keeping Langfuse for tracing). For each framework we capture: identity, license, deployment model, primitives, agent / tool-call eval support, integration story (Python and TS/Node), pricing, and notable strengths and weaknesses, ending in a comparison matrix, fit-by-team-profile, CI patterns, and real-world case studies. + +The space is moving fast. Two notable shifts since early 2025: OpenAI acquired Promptfoo in March 2026, and the Claude Agent SDK (formerly Claude Code SDK) became the default Anthropic agent harness. Some 2024-era guidance is already stale; we cite May 2026 docs where possible. + +--- + +## 2. Tool-by-tool analysis + +### 2.1 Langfuse (incumbent) + +**What it is.** Open-source LLM engineering platform: tracing, datasets, experiments, scores / scorers, prompt management, playground, online LLM-as-a-judge. Integrates via OpenTelemetry plus native SDKs for Python and TypeScript. + +**Maintainer / license.** Langfuse GmbH (YC W23). Core product is **MIT-licensed**; SCIM, audit logs, retention policies, enterprise SLAs require a commercial Enterprise Edition license. + +**Deployment model.** Both. Langfuse Cloud (Hobby free, Core, Pro, Enterprise tiers) and full self-host on your own infra (Docker / Helm / AWS Marketplace). Self-host has no usage gate or license key for the OSS feature set. + +**Primitives offered.** +- **Traces / observations**: spans, nested generations, tool calls, retrievals; OpenTelemetry-compatible. +- **Datasets / DatasetItems**: input + expected output rows; versioned. +- **Experiments / DatasetRuns**: a `task` function maps a DatasetItem to an output, an `evaluator` function scores it, and the run is persisted with run-level aggregates. +- **Scores**: universal eval primitive (NUMERIC, CATEGORICAL, BOOLEAN, TEXT) with name, value, optional comment, attached to a trace, observation, or dataset run. +- **LLM-as-a-judge**: managed online evaluators that can score traces, observations, or experiments; categorical and boolean output landed in early 2026; observation-level evals (Feb 2026 changelog) let judges score individual tool calls or retrievals rather than only whole traces. +- **Prompt management**: versioned prompts with labels (`production`, `staging`, etc.), pull from SDK, optional GitHub sync. + +**Agent / tool-call eval.** Strong for trace-level inspection of tool calls. The Feb 2026 observation-level eval feature is exactly the primitive needed to score "did the Review stage call `read_file` with the right path" as a separate metric from "is the final Report correct." For *trajectory* assertions (this exact ordered sequence of calls), Langfuse does not ship a built-in `trajectory_match` evaluator the way LangSmith does; you write it yourself as a Python or TS evaluator function and post a score. + +**Integration: TS / Node.** First-class. `@langfuse/tracing`, `@langfuse/otel`, `@opentelemetry/sdk-node` packages. Supports decorator, context manager, or manual `span.startObservation({ asType: 'tool' })`. `observeOpenAI()` wrapper for OpenAI tool-calling. Anthropic / Claude Agent SDK works through OTEL or manual spans. + +**Integration: Python.** First-class, slightly more mature than TS. `@observe` decorator, dataset SDK, Experiments SDK with `run_experiment(...)` plus an evaluator function list. + +**Pricing.** Self-host free for the OSS feature set (you pay only ClickHouse + Postgres + app infra, ~$3-4k/mo at medium scale per third-party estimates). Cloud Hobby free at 50k units/mo, Core / Pro starting around $59-199/mo, Enterprise from ~$2,499/mo with custom volume. + +**Strengths.** +- OSS, MIT, no vendor lock-in. +- ClickHouse-backed, scales to billions of observations (Canva is the headline production reference; 2,300+ companies cited). +- TS and Python parity is genuinely close. +- Online evals on individual observations is one of the few platforms that treats tool calls as first-class scoring targets. +- Decoupled scorer model: any external scorer (RAGAS, DeepEval, custom) can post a Score over the SDK; you are not boxed into Langfuse's scorers. + +**Weaknesses.** +- No built-in trajectory-match evaluator (write your own). +- Comparison UI for experiments is functional but less polished than Braintrust's diff viewer. +- Online prod evals require you to wire up the judge config and pay judge LLM cost yourself. +- Self-host operationally is non-trivial (ClickHouse + Postgres + workers + Redis). + +--- + +### 2.2 LangSmith (LangChain) + +**What it is.** Closed-source observability + eval platform from LangChain, designed around LangChain / LangGraph but framework-agnostic. Datasets, evaluators (offline and online), prompt hub, deployments, agent trajectory evals. + +**Maintainer / license.** LangChain Inc. Proprietary SaaS; self-host available on Plus / Enterprise. + +**Deployment.** SaaS (US, EU regions) and self-hosted (Plus / Enterprise tiers). No OSS edition. + +**Primitives.** +- Traces with multi-turn / thread support. +- Datasets with versioning and splits. +- Evaluator templates: 30+ prebuilt (safety, response quality, **trajectory**, multimodal). +- Online evaluators that run on production traces. +- **Agent trajectory evaluators** (`agentevals` package, MIT-licensed, Python and TypeScript): `create_trajectory_match_evaluator` / `createTrajectoryMatchEvaluator` with strict, unordered, superset, and subset modes, plus an LLM-judged trajectory variant. +- Multi-turn evaluators that score whole user-agent threads. + +**Agent / tool-call eval.** Best-in-class out of the box. The `agentevals` library is purpose-built for tool-call sequences: feed in two lists of OpenAI-format messages, get a 0/1 or LLM-judged score on whether the trajectories match. This is exactly the QualOps "did Stage X call the right tools" use case. + +**Integration: TS / Node.** First-class. `langsmith` npm package, `agentevals` npm package. Works with any framework, but the LangChain JS SDK gets richest integrations. + +**Integration: Python.** First-class. Same. + +**Pricing (May 2026).** Developer plan free with 5k traces/mo and 1 seat. Plus plan: paid seats, 10k base traces/mo included, $2.50 per 1k base traces (14-day retention) or $5.00 per 1k extended traces (400-day retention). Enterprise custom. Self-host requires Plus or Enterprise. + +**Strengths.** +- Best built-in agent trajectory eval primitives in the ecosystem. +- Mature: 2+ years older than most competitors, deep eval template library. +- Tight LangGraph integration if you ever want a managed agent runtime. + +**Weaknesses.** +- Closed-source. +- Per-trace pricing penalizes verbose agentic apps (each PR run can produce hundreds of spans). +- Self-host is gated behind paid tier. +- LangChain ecosystem gravity: framework-agnostic in theory, opinionated in practice. + +--- + +### 2.3 DeepEval (Confident AI) + +**What it is.** Open-source pytest-style LLM eval framework with 40+ metrics: G-Eval (custom rubric judge), DAG (decision-graph judge), faithfulness, answer relevancy, RAG metrics, agentic metrics (task completion, tool correctness, tool argument correctness, agent trajectory). + +**Maintainer / license.** Confident AI. Apache 2.0. + +**Deployment.** OSS library + Confident AI commercial platform (managed datasets, traces, online eval, regression dashboard). + +**Primitives.** +- `LLMTestCase` and `MLLMTestCase` classes, plug into `pytest` via `deepeval test run ...`. +- **G-Eval**: research-backed metric (Liu et al.) that lets you describe a custom criterion in natural language and get a 0-1 score with chain-of-thought reasoning. +- **DAG metric** for deterministic compound criteria. +- **Agent metrics**: `TaskCompletionMetric`, `ToolCorrectnessMetric`, agentic flow eval. +- Synthetic dataset generation. +- Confident AI platform: tracing (Python `@observe`, JS/TS `observe()` wrapper), online evaluation, regression detection, A/B comparison UI. + +**Agent / tool-call eval.** Strong. `ToolCorrectnessMetric` checks whether expected tools were called, optionally including argument matching. Agent trajectory metrics are documented under "AI Agent Evaluation." + +**Integration: TS / Node.** Available via `deepeval-ts` (npm), positioned as a TypeScript client to Confident AI; integrates with the Vercel AI SDK `experimental_telemetry`. Less mature than Python: the JS/TS client primarily covers tracing and posting eval data; the rich metric library still lives in Python. + +**Integration: Python.** First-class. This is the home language. + +**Pricing.** OSS free. Confident AI: free tier, paid plans (per Confident AI docs). + +**Strengths.** +- Pytest fits engineer muscle memory; CI integration is one `pytest` command. +- Deep metric catalog including agentic ones. +- G-Eval is widely cited and battle-tested. + +**Weaknesses.** +- TS support is thin compared to Python. +- Confident AI managed platform is less polished than Braintrust / Langfuse if you want a hosted UI. +- Some metrics ship default judge prompts that you must override for domain accuracy (general LLM-eval critique; see Hamel's "Revenge of the Data Scientist"). + +--- + +### 2.4 RAGAS + +**What it is.** RAG-specific metrics library, originating from a 2023 arXiv paper. Reference implementations of context precision, context recall, faithfulness, answer relevancy, plus newer agent-flavoured metrics (tool-call accuracy, topic adherence, agent goal accuracy). + +**Maintainer / license.** Exploding Gradients team. Apache 2.0. + +**Deployment.** OSS Python library; no platform. + +**Primitives.** +- `MultiTurnSample` / `SingleTurnSample` data classes. +- Reference-free RAG metrics (the original differentiator). +- Synthetic test-set generation. +- Optional integrations with Langfuse, LangSmith, Phoenix to post scores. + +**Agent / tool-call eval.** Limited. RAGAS added a handful of agent metrics (e.g. `ToolCallAccuracy`, `AgentGoalAccuracy`) but it is not its strength. For QualOps these would supplement, not replace, a primary eval framework. + +**Integration: TS / Node.** None official. Python only. + +**Integration: Python.** First-class. + +**Pricing.** OSS free. + +**Strengths.** +- Best-in-class for **RAG** metrics specifically. +- Cheap to adopt as a *metric library* alongside Langfuse / LangSmith / Phoenix. +- Active research lineage. + +**Weaknesses.** +- Python-only. +- Not a full platform: no UI, no dataset versioning, no observability. +- Agent eval is a recent bolt-on, not primary. +- For a coding-agent product like QualOps where retrieval is not the bottleneck, RAGAS is largely irrelevant. + +--- + +### 2.5 Braintrust + +**What it is.** Eval-as-code SaaS with a polished comparison / playground UI. Datasets, experiments, prompt playground, online eval, log capture, alerts. Storage layer is Brainstore, a custom OLAP database for AI traces. + +**Maintainer / license.** Braintrust Data Inc. Proprietary. $80M Series B in February 2026. + +**Deployment.** SaaS primarily. Self-host is an enterprise hybrid model (control plane in Braintrust cloud, API + Brainstore in your VPC); not OSS. + +**Primitives.** +- Eval-as-code SDK (`Eval(...)` in Python or TS) where you define a dataset, a task function, and a list of scorers. +- Playground for prompt + model + dataset matrix testing. +- Comparison UI (side-by-side diff between two experiments) is the headline differentiator; non-technical users can review experiments and contribute test cases. +- Online evals on production logs, alerting on regressions. +- Custom scorers in Python or TS, plus library of built-ins. + +**Agent / tool-call eval.** Good. Comparison UI surfaces tool-call diffs across runs. No trajectory-match library on par with `agentevals`, but the SDK-first design makes writing one easy. + +**Integration: TS / Node.** First-class. `braintrust` npm package, full eval-as-code parity with Python. + +**Integration: Python.** First-class. + +**Pricing.** Free tier 1M trace spans, 10k scores, unlimited users, 14-day retention. Pro from $249/mo. Enterprise custom; self-host is enterprise-only. + +**Strengths.** +- Best-in-class comparison UI; the experiment diff view is well-loved. +- TS and Python parity. +- High-profile production users: **Notion** (70 AI engineers, 10x increase in issues caught per day), **Stripe**, **Vercel**, **Zapier**, **Airtable**. +- Strong CI/CD story; pre-built GitHub Action. + +**Weaknesses.** +- Closed-source. +- Self-host gated to enterprise only and is hybrid, not pure on-prem. +- Pricing scales with span volume; verbose agents get expensive. +- Online prod eval debugging is reportedly weaker than Arize / Langfuse. + +--- + +### 2.6 OpenAI Evals (OSS) and Evals API (hosted) + +**What it is.** Two products under one banner: (a) the original `openai/evals` GitHub repo, an OSS framework + benchmark registry; (b) the **Evals API** on platform.openai.com, a hosted product where you upload datasets and configure graders. + +**Maintainer / license.** OpenAI. OSS repo MIT-licensed; hosted Evals API is a paid OpenAI platform feature. + +**Deployment.** OSS repo runs anywhere. Evals API is OpenAI cloud only. + +**Primitives.** +- Datasets uploaded via API. +- **Graders**: `string_check`, `text_similarity`, `python` (sandboxed Python grader function), `label_model` (LLM classification grader), `score_model` (LLM scoring with rubric). +- Templating: `{{ var }}` substitution into grader prompts. +- Agent evals guide (added 2025) with tool-trajectory grading patterns. + +**Agent / tool-call eval.** The Evals API ships an "Evaluate agent workflows" guide with patterns for tool-call grading. The python grader can directly assert "expected tool sequence == actual tool sequence." Less batteries-included than `agentevals`. + +**Integration: TS / Node.** Via the OpenAI SDK only; same surface across languages. + +**Integration: Python.** Native; the OSS repo is Python. + +**Pricing.** OSS free. Evals API billed against OpenAI usage (tokens for graders + storage). + +**Strengths.** +- Hosted Evals API integrates trivially with OpenAI ecosystem and stored chat completions. +- Python grader is a clean escape hatch. +- Brand and continuity (OpenAI is unlikely to abandon it). + +**Weaknesses.** +- Hosted evals are OpenAI-cloud-only; multi-vendor teams running Claude as the primary model must adapt. +- OSS repo activity has slowed; the framework is more of a benchmark archive than an active platform. +- No tracing / observability layer. + +--- + +### 2.7 Anthropic eval tooling (Workbench, Claude Console, Claude Agent SDK patterns) + +**What it is.** Anthropic does not ship a standalone evaluation product. Instead they offer: +- **Claude Console**: dashboard with trace inspection, integration analytics, debugging UI for tool calls and decisions; ships with Claude Agent SDK. +- **Workbench / Prompt Improver**: web UI for prompt iteration with sample-by-sample comparison. +- **Claude Agent SDK**: programmatic harness (TS and Python) with tool execution; pairs with the engineering blog post "Demystifying evals for AI agents" which recommends the Tasks / Trials / Transcripts pattern with 20-50 hand-curated tasks. +- **Managed Agents** (public beta April 2026): hosted agent runtime; observability surfaced via Console. + +**Maintainer / license.** Anthropic. Proprietary. Claude Agent SDK is Apache 2.0. + +**Deployment.** SaaS only. + +**Primitives.** +- Console traces and analytics for Agent SDK runs. +- No first-party dataset / experiment / scorer abstraction. The Anthropic blog explicitly recommends using third-party eval frameworks; their position is "we will instrument and trace, you bring or build the eval harness." + +**Agent / tool-call eval.** Console shows tool calls and lets you inspect failure modes, but scoring is not a built-in primitive. The "Demystifying evals" blog post is the canonical Anthropic recommendation: build small (20-50) hand-curated task sets, run multiple trials, grade transcripts with a mix of programmatic and LLM-as-judge methods, and treat eval as iterative. + +**Integration: TS / Node.** Claude Agent SDK has a TypeScript SDK and a Python SDK; both hook into Console traces by default. + +**Pricing.** Console included with Claude API access. + +**Strengths.** +- Best-in-class for *guidance and patterns* via the engineering blog. +- Tightest fit for Claude Agent SDK users (which QualOps is). +- Console is "free" if you are already on Claude. + +**Weaknesses.** +- Not a full eval platform; you still need Langfuse / Braintrust / similar for datasets, experiments, scoring infra. +- No CI gating, no programmatic scorer API (as of May 2026). + +For QualOps specifically: this is a **complement** to Langfuse, not a replacement. The Anthropic blog patterns inform what we put inside the Langfuse experiments. + +--- + +### 2.8 Phoenix (OSS) and Arize AX (commercial) + +**What it is.** Phoenix is the OSS, OTEL-based observability + eval framework from Arize AI. Arize AX is the enterprise SaaS counterpart (managed infrastructure, alerts, online evals, agent copilots, compliance). + +**Maintainer / license.** Arize AI. Phoenix is Apache 2.0. + +**Deployment.** Phoenix self-host (Docker, Python `phoenix.launch_app()`, K8s), or Phoenix Cloud (managed). Arize AX is SaaS. + +**Primitives.** +- OpenTelemetry-native tracing with **OpenInference** semantic conventions (AGENT and TOOL spans are first-class). +- Datasets, experiments, evaluators (LLM-as-judge, code, human label). +- Built-in instrumentation for Claude Agent SDK (Python and TS), OpenAI Agents SDK, LangGraph, Vercel AI SDK, Mastra, CrewAI, LlamaIndex, DSPy. +- Phoenix CLI for piping traces / datasets / prompts to Cursor, Claude Code, Codex CLI, Gemini CLI. + +**Agent / tool-call eval.** Strong. OpenInference semantic conventions explicitly model AGENT, TOOL, RETRIEVER, CHAIN, LLM span kinds. Phoenix UI can score any span. Out-of-the-box Claude Agent SDK instrumentation captures tool spans automatically. + +**Integration: TS / Node.** Yes. `@arizeai/openinference-instrumentation-anthropic`, Vercel AI SDK auto-instrumentation, OTEL pipeline. + +**Integration: Python.** First-class. + +**Pricing.** Phoenix OSS is free. Arize AX is custom enterprise pricing. + +**Strengths.** +- Most rigorous OTEL story; the "switch your backend without changing instrumentation" pitch holds. +- Vendor-agnostic: ingests traces from anywhere that speaks OpenInference / OTLP. +- Strong agent semantic conventions. +- OSS self-host is genuinely usable, not crippleware. + +**Weaknesses.** +- UI is functional but less polished than Braintrust. +- Eval features are less mature than the observability features. +- Two-product split (Phoenix vs AX) creates feature gaps you only discover at sales time. + +For QualOps: a credible replacement candidate for Langfuse if OTEL portability is a hard requirement. Otherwise, Langfuse covers the same ground with arguably better experiment ergonomics. + +--- + +### 2.9 Weights & Biases Weave + +**What it is.** W&B's GenAI sub-product: tracing (`@weave.op` decorator), evaluations (`weave.Evaluation`), prompt experimentation, cost / latency tracking. Sits on top of W&B's existing experiment-tracking infra. + +**Maintainer / license.** Weights & Biases (acquired by CoreWeave 2024). Apache 2.0 SDK; managed service is paid. + +**Deployment.** SaaS primarily; W&B Server self-host available on enterprise. + +**Primitives.** +- `@weave.op` decorator auto-traces any Python (and TS) function call, including nested LLM calls. +- `weave.Evaluation` class with dataset + scorers + model. +- Built-in scorers, custom scorers, LLM judges. +- Comparison view across model / prompt / config combinations. + +**Agent / tool-call eval.** Possible but not a focus. Trace UI shows nested ops; no trajectory-match primitive shipped. + +**Integration: TS / Node.** TypeScript SDK exists but is younger and less complete than Python. + +**Integration: Python.** First-class; this is the home. + +**Pricing.** Free tier with limits; paid via W&B contracts. + +**Strengths.** +- If your org already uses W&B for ML experiment tracking, the on-ramp is one decorator. +- Cost / latency tracking out of the box. + +**Weaknesses.** +- Eval is not the center of the product; Braintrust / Langfuse have richer eval workflows. +- TS SDK lags Python. +- W&B's pricing is opaque and notoriously rises with scale. + +--- + +### 2.10 MLflow LLM evals + +**What it is.** Databricks-led OSS ML platform that added a GenAI evaluation track in 2024-2025. `mlflow.genai.evaluate()` is the primary entry point. + +**Maintainer / license.** Databricks + community. Apache 2.0. + +**Deployment.** OSS self-host or Databricks managed. Most "MLflow GenAI" features are available OSS, with deeper agent monitoring on Databricks. + +**Primitives.** +- Built-in scorers (correctness, relevance, safety, helpfulness) and LLM judges. +- Custom judges with prompt + rubric. +- Automatic evaluation: judges run on traces and multi-turn conversations as they are logged. +- Dataset and experiment objects (inherited from classic MLflow). +- "Evaluation-Driven Development" framing. + +**Agent / tool-call eval.** Has agent-aware scorers; Databricks-specific docs reference scoring tool decisions. Less battle-tested than LangSmith trajectory evals. + +**Integration: TS / Node.** Limited. MLflow's GenAI tracing has a TS package but is Python-first. + +**Integration: Python.** First-class. + +**Pricing.** OSS free; Databricks pricing for the managed product. + +**Strengths.** +- Natural fit if you already use MLflow for classical ML pipelines. +- Open standard, self-hostable. +- Databricks gravity if your data lives there. + +**Weaknesses.** +- Outside Databricks shops, mindshare is low. +- TS support is weak. +- Platform is broad; GenAI evals are one feature among many, not the focus. + +--- + +### 2.11 Patronus AI + +**What it is.** Judge-as-a-service. Fine-tuned evaluator models (notably **GLIDER**, a 3.8B parameter rubric-following judge with ~1s latency) plus a managed platform for hallucination detection, safety, RAG metrics, and custom rubric scoring. Recently added the first multimodal LLM-as-judge. + +**Maintainer / license.** Patronus AI (Series A startup). Proprietary models; open SDKs. + +**Deployment.** SaaS API + MCP server. No OSS judge models. + +**Primitives.** +- Pre-trained judge endpoints (hallucination, safety, format, custom rubric). +- GLIDER as a low-latency rubric-driven judge. +- Tracing + alerting via the platform. +- Reported 91% agreement with human judgments on benchmarks they publish. + +**Agent / tool-call eval.** Indirect. You score the agent output with a Patronus judge; Patronus does not own the trace / dataset layer. Etsy and Gamma are headline case studies. + +**Integration: TS / Node.** Via REST + SDKs; supported. + +**Integration: Python.** First-class. + +**Pricing.** Per-call, not publicly listed; sales-driven. + +**Strengths.** +- Specialized judge models give faster, cheaper scoring than calling GPT-5 / Claude as judge. +- Strong on multimodal evaluation if you ever need it. +- MCP-server form factor fits the 2026 toolchain. + +**Weaknesses.** +- You still need a host platform for traces, datasets, experiments. +- Vendor lock-in on judge models; opaque inner workings of GLIDER. +- For text-only code review like QualOps, the multimodal angle is irrelevant. + +Patronus is a **scorer plug-in**, not a Langfuse replacement. + +--- + +### 2.12 Promptfoo + +**What it is.** CLI + library for prompt and agent eval, configured in a single `promptfooconfig.yaml`. Test cases as YAML, asserts as YAML, providers as YAML. Used by OpenAI and Anthropic in their own pipelines. Strong red-team / security-test suite. **OpenAI acquired Promptfoo in March 2026**; product remains MIT-licensed and open source. + +**Maintainer / license.** Promptfoo (now OpenAI). MIT. + +**Deployment.** Local CLI + optional cloud. Trivially self-hostable. + +**Primitives.** +- YAML config: prompts, providers, test cases (`vars`), asserts (`equals`, `contains`, `llm-rubric`, `javascript`, `python`, `model-graded-closedqa`, etc.). +- CLI: `promptfoo eval -c promptfooconfig.yaml -o results.json`. +- Web viewer for results. +- GitHub Action: `promptfoo/promptfoo-action` posts a PR comment with diff. +- Red-team / pentesting suite (prompt injection, jailbreak, data leakage). + +**Agent / tool-call eval.** Supports asserting on tool calls via custom JS / Python asserts and via the agent providers. Claude Agent SDK is a first-class provider. Less semantically rich than `agentevals` or DeepEval's `ToolCorrectnessMetric`, but flexible. + +**Integration: TS / Node.** First-class. Runs via `npx promptfoo`. Custom asserts can be JS. + +**Integration: Python.** Custom asserts can be Python; otherwise it is a CLI you call from anywhere. + +**Pricing.** OSS free. Promptfoo Cloud / Enterprise tiers exist. + +**Strengths.** +- Lowest-effort CI gating in the entire space. Drop a YAML, add the GitHub Action, done. +- 350k developers, 130k MAU, 25% Fortune 500 reach as of acquisition announcement: serious adoption. +- Red-team / security suite is a genuine bonus for any code-review tool that processes untrusted input. +- MIT license, no platform required. + +**Weaknesses.** +- YAML scales poorly past a few hundred test cases; you end up generating it. +- Not an observability platform. No live traces. +- "Used by OpenAI" headline aside, the OpenAI acquisition introduces some governance uncertainty for Anthropic-first shops. + +For QualOps: serious candidate as a **CI-only thin layer**, complementary to Langfuse, especially for prompt-level regression gates and security-style tests. + +--- + +### 2.13 Inspect AI (UK AI Safety Institute) + +**What it is.** Research-grade Python eval framework for frontier models, originally built by UK AISI in collaboration with Meridian Labs. Adopted by Anthropic, DeepMind, and Grok for internal evals. Three core abstractions: `Dataset`, `Solver`, `Scorer`. + +**Maintainer / license.** UK AISI. MIT. + +**Deployment.** OSS Python library; runs anywhere. Optional Inspect View web UI. + +**Primitives.** +- `Dataset` (samples with input + target + metadata). +- `Solver` (chain of pluggable steps that produce an answer; includes ReAct, Deep Agent, custom agents). +- `Scorer` (programmatic or LLM judge). +- `SandboxEnvironment` abstraction for executing model-generated code or shell commands safely. +- Built-in tools (`bash`, `python`, `text editor`, `web search`, `web browse`, computer-use), MCP tool support, custom tool support. +- `inspect_evals` companion repo with 200+ pre-built evaluations (capability, agentic, reasoning, knowledge). +- Built-in agent primitives: ReAct, Deep Agent (long-horizon, sub-agents, memory), and **Agent Bridge** for plugging in external agents (Claude Code, Codex CLI, Gemini CLI). +- May 8 2026 (today): community contributions move to a `/register/` folder with YAML-based PR submission. + +**Agent / tool-call eval.** Best-in-class for *agentic capability evals* (frontier-model style). Sandbox-grounded tool execution + scorers that can inspect transcripts make it well-suited to "did the agent navigate this codebase correctly" tasks. Hamel Husain endorses it for serious eval work. + +**Integration: TS / Node.** None. Python only. + +**Integration: Python.** First-class. This is the home. + +**Pricing.** Free. + +**Strengths.** +- Used by every frontier lab; battle-tested at the hardest end of the spectrum. +- Sandbox + Agent Bridge means you can wrap Claude Code or QualOps' own agent and run capability evals on it without modifying the agent. +- 200+ pre-built evals as a starting library. +- Excellent reproducibility: experiment artifacts are deterministic. + +**Weaknesses.** +- Python only. +- Research / safety framing rather than product-CI framing; ergonomics for "fail the PR if score drops" are DIY. +- No managed observability; you persist to local files / your own backend. + +For QualOps: a strong **research / nightly capability eval** option, complementary to Langfuse for trace-driven product eval. + +--- + +### 2.14 Niche options: AgentOps, Helicone, LangWatch + +**AgentOps.** Agent-first observability, supports 400+ LLMs and major frameworks, "time-travel debugging" (replay agent state). Reasonable choice if you have many heterogeneous agents to debug. Reported ~12% overhead. + +**Helicone.** Proxy-based observability (Apache 2.0). Zero-SDK-changes onboarding through a gateway. **As of 2026 the upstream platform is in maintenance mode**; existing users have a 6-12 month migration window. Avoid for new deployments. + +**LangWatch.** Real-time LLM observability with built-in evaluations and accurate cost attribution. Less mature than Langfuse, but a credible option for teams that want a single product instead of an OSS-plus-judges stack. + +None of these is a serious primary candidate for QualOps given the Langfuse incumbency, but they are reasonable to mention to stakeholders who ask. + +--- + +## 3. Comparison matrix + +Legend: white check (yes / strong), warning (partial / caveat), red X (no / weak). + +| Framework | OSS license | Self-host | Agent trajectory eval | Tool-call scoring | LLM-as-judge built in | Online prod eval | CI integration | TS-native | Py-native | Dataset versioning | A/B comparison UI | Cost (free tier) | +|---|---|---|---|---|---|---|---|---|---|---|---|---| +| **Langfuse** | white check (MIT) | white check (free) | warning (DIY trajectory match) | white check (observation-level evals, Feb 2026) | white check | white check | white check (experiment-action / SDK) | white check | white check | white check | warning (basic diff) | white check (50k units/mo) | +| **LangSmith** | red X | white check (Plus+) | white check (`agentevals`) | white check | white check (30+ templates) | white check | white check | white check | white check | white check | white check | white check (5k traces/mo) | +| **DeepEval / Confident AI** | white check (Apache 2.0 lib) | warning (lib yes, platform paid) | white check (Python) | white check (`ToolCorrectnessMetric`) | white check (G-Eval, DAG) | white check (Confident AI) | white check (`pytest`) | warning (thin) | white check | white check | white check (Confident AI) | white check (OSS lib) | +| **RAGAS** | white check (Apache 2.0) | white check | warning (recent) | warning (`ToolCallAccuracy`) | white check (judges per metric) | red X | warning (via host) | red X | white check | red X | red X | white check | +| **Braintrust** | red X | warning (enterprise hybrid) | warning (DIY in SDK) | white check (good in UI) | white check | white check | white check (GitHub Action) | white check | white check | white check | white check (best-in-class) | white check (1M spans/mo) | +| **OpenAI Evals API** | white check (OSS repo MIT) | white check (repo) / red X (API) | warning (DIY via python grader) | warning (DIY) | white check (model graders) | warning | warning | warning (via SDK) | white check | white check | warning | white check (OSS) / paid (API) | +| **Anthropic Console / Agent SDK** | warning (SDK Apache 2.0) | red X (Console SaaS) | red X | red X (inspect only) | red X | red X | red X | white check | white check | red X | red X | included with Claude API | +| **Phoenix / Arize AX** | white check (Phoenix Apache 2.0) | white check (Phoenix) | warning (span-level, DIY) | white check (OpenInference TOOL spans) | white check | white check (AX) | white check | white check | white check | white check | warning | white check | +| **W&B Weave** | white check (SDK Apache 2.0) | warning (W&B Server enterprise) | warning | warning | white check | warning | white check | warning (younger TS) | white check | white check | white check | white check (limits) | +| **MLflow GenAI** | white check (Apache 2.0) | white check | warning | warning | white check | white check (auto-eval) | white check | red X | white check | white check | warning | white check | +| **Patronus AI** | red X | red X | red X (judge only) | warning (custom rubric) | white check (specialized models) | white check | warning | white check | white check | red X | red X | sales | +| **Promptfoo** | white check (MIT) | white check | warning (custom asserts) | warning (custom asserts, Claude SDK provider) | white check (`llm-rubric`) | red X | white check (best-in-class GitHub Action) | white check | white check | warning (YAML files) | white check (web viewer) | white check | +| **Inspect AI** | white check (MIT) | white check | white check (sandbox + scorer) | white check (built-in tools, MCP) | white check | red X | warning (custom) | red X | white check | white check | warning (Inspect View) | white check | +| **AgentOps** | warning (SDK OSS) | warning | white check (time-travel) | white check | white check | white check | warning | white check | white check | white check | white check | white check | +| **Helicone** | white check (Apache 2.0) | white check | red X | warning | warning | white check | warning | white check | white check | white check | warning | maintenance mode | +| **LangWatch** | white check (Apache 2.0 lib) | warning | warning | warning | white check | white check | warning | white check | white check | white check | white check | white check | + +--- + +## 4. Good fit / bad fit by team profile + +### 4.1 Small team, CI-gated, Node/TS app, on Claude (== QualOps) + +**Recommended primary**: stay on **Langfuse**, complement with **Promptfoo** in CI for fast prompt-regression gating. + +Why: Langfuse's MIT license + self-host + ClickHouse + observation-level LLM-as-judge (Feb 2026) covers tracing, datasets, experiments, online evals, and prompt management for free at small scale. Promptfoo's GitHub Action gives a per-PR YAML-driven gate with a PR comment view that Langfuse alone doesn't deliver. The two integrate: Promptfoo runs deterministic / fast-judge asserts on each PR; Langfuse stores the longer-running experiments and production traces and lets you spot drift over time. + +Add **Inspect AI** (or write your own using `agentevals` patterns) for nightly capability evals against a held-out task set. Use Anthropic's "Demystifying evals" Tasks / Trials / Transcripts pattern as the structural blueprint inside Langfuse experiments. + +Do **not** rip and replace Langfuse for LangSmith or Braintrust unless you discover a specific feature gap; the marginal UX wins do not justify a closed-source migration for a small team. + +### 4.2 Small / mid team, Python-only, RAG-heavy + +**Primary**: **DeepEval** + **RAGAS** (metric library) + your choice of host (Phoenix or Langfuse). + +Why: RAGAS for the canonical RAG metrics, DeepEval's pytest harness for CI gating, Phoenix for OTEL-based observability and OSS self-host. Avoid Braintrust unless you specifically want the comparison UI. + +### 4.3 Large org, many agents, dedicated SREs, vendor procurement OK + +**Primary**: **Braintrust** for product teams, **Phoenix / Arize AX** for the platform / SRE team. + +Why: Braintrust's diff UI scales to 70+ engineers (Notion case study) and lets product / non-engineering reviewers contribute test cases. Arize AX gives the SRE team enterprise compliance, alerts, online eval, and an OTEL pipeline. The two coexist: Braintrust for pre-deploy eval, Arize for post-deploy observability. + +### 4.4 LangChain / LangGraph shop + +**Primary**: **LangSmith**. + +Why: deepest framework integration, best agent trajectory primitives, and the multi-turn evaluators map directly to LangGraph state machines. Per-trace pricing is a constraint; budget accordingly. + +### 4.5 Frontier-lab / safety-focused / research org + +**Primary**: **Inspect AI**. + +Why: it is what the labs already use. Pair with custom storage for any required trace persistence. + +### 4.6 OpenAI-only shop + +**Primary**: **OpenAI Evals API** + **Promptfoo**. + +Why: Evals API is the path of least resistance for OpenAI-stored chat completions. Promptfoo for CI. No real downside until you go multi-model. + +### 4.7 Already on W&B for ML, adding LLM features + +**Primary**: **W&B Weave**, supplemented by RAGAS / DeepEval metrics if needed. + +Why: amortized onboarding cost. Re-evaluate after 12 months when LLM features outgrow ML features. + +### 4.8 Bad fits to avoid + +- **Helicone** for new deployments (maintenance mode). +- **Patronus** as a primary platform (it is a scorer, not a platform). +- **OpenAI Evals OSS repo** as an active framework (slow upstream activity; treat it as a benchmark archive). +- **Single-tool maximalism**: every successful 2025-2026 case study (Notion / Stripe / Canva) combines tools. Plan for two or three. + +--- + +## 5. CI integration patterns + +### 5.1 Langfuse + +**Pattern**: GitHub Actions calls a Python or TS script that runs the Langfuse `experiment-action` or the SDK `runExperiment(...)`. The experiment iterates a Dataset, calls the agent (the QualOps `Analyze->Review->Fix` pipeline), and runs evaluators. The script asserts on aggregate scores and exits non-zero on regression. + +```yaml +# .github/workflows/qualops-eval.yml +on: pull_request +jobs: + eval: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: actions/setup-node@v4 + - run: npm ci + - run: npx tsx scripts/run-langfuse-eval.ts + env: + LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PUBLIC_KEY }} + LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SECRET_KEY }} + LANGFUSE_HOST: ${{ secrets.LANGFUSE_HOST }} + ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} +``` + +The script posts a PR comment with the dataset run URL and asserts `mean_score >= baseline - tolerance`. + +### 5.2 LangSmith + +**Pattern**: `langsmith` SDK has `pytest` integration; you write tests as `@pytest.mark.langsmith` decorated functions; CI runs `pytest`. For Node: `langsmith` SDK `evaluate(...)` API in any test framework. + +### 5.3 DeepEval + +**Pattern**: pytest. Single command: + +```yaml +- run: deepeval test run tests/eval_qualops.py +``` + +Tests use `assert_test(test_case, [ToolCorrectnessMetric(), GEval(...)])`. Confident AI auto-stores the run. + +### 5.4 Braintrust + +**Pattern**: Pre-built `braintrust-eval` GitHub Action; runs `braintrust eval src/evals/*.eval.ts` and posts a PR comment with the experiment diff URL. + +### 5.5 Promptfoo + +**Pattern**: `promptfoo/promptfoo-action`. Diffs `promptfooconfig.yaml` test results between base and PR branches, posts PR comment. + +```yaml +- uses: promptfoo/promptfoo-action@v1 + with: + openai-api-key: ${{ secrets.OPENAI_API_KEY }} + anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }} + config: promptfooconfig.yaml +``` + +### 5.6 OpenAI Evals API + +**Pattern**: REST call from CI. `POST /v1/evals` to create, `POST /v1/evals/{id}/runs` to run. Poll for completion, fail the job on threshold violation. + +### 5.7 Phoenix + +**Pattern**: Phoenix CLI. `phoenix experiments run --dataset ... --task ... --evaluators ...`. CI script asserts on returned metrics. + +### 5.8 Inspect AI + +**Pattern**: `inspect eval my_task.py --model anthropic/claude-sonnet-4-5 --log-dir ./logs`. Parse the JSON log file for pass/fail. + +### 5.9 GitLab CI + +All of the above port directly. The only adjustment is using `rules` and `script` blocks; no tool ships a GitLab-only integration. + +### 5.10 Recommended pattern for QualOps + +A two-tier setup: + +1. **Per-PR (fast tier, ~2-5 min)**: Promptfoo YAML with ~30 small assertions on the output of each pipeline stage; runs as a required GitHub check. +2. **Nightly (slow tier, ~30-60 min)**: Langfuse experiment over a 100-200 item dataset, running the full pipeline, with LLM-as-judge scorers on the final report and `ToolCorrectness`-style scorers on each stage. Posts a Slack message and writes to the Langfuse experiments dashboard. + +This gives developers fast PR feedback and the team a slower, deeper nightly truth. + +--- + +## 6. Real-world usage and case studies + +- **Canva** runs production AI design features through Langfuse. Headline reference on `langfuse.com`. +- **Notion** built its AI evaluation system on Braintrust; 70 engineers, 10x increase in caught issues per day going from JSONL files to Braintrust workflows; ZenML LLMOps Database has the full write-up. +- **Stripe**, **Vercel**, **Zapier**, **Airtable** are listed Braintrust customers per Braintrust marketing pages. +- **Etsy** uses Patronus AI's multimodal LLM-as-judge for image captioning quality; Patronus published the case study. +- **Gamma** uses Patronus for automated evals and rigorous experimentation; Patronus published the case study. +- **Anthropic, DeepMind, Grok** are documented users of Inspect AI (per UK AISI announcement and Hamel Husain's notes). Anthropic also documents in "Demystifying evals for AI agents" how internal teams build small (20-50 item) eval task sets. +- **OpenAI and Anthropic** both ship Promptfoo as part of their internal eval pipelines per the Promptfoo GitHub README. +- **AI Engineer Europe 2026** (April 8-10, London) had an Evals & Observability track. **AI Engineer World's Fair 2026** (June 29-July 2, San Francisco) is the upcoming flagship. +- **Hamel Husain & Shreya Shankar** teach the "AI Evals For Engineers & PMs" course on Maven; Bryan Bischof of Hex AI gave the "Failure as a Funnel" talk at Data Council 2025 on agent failure-mode analysis. +- **Vanishing Gradients podcast** Episode 60: "10 Things I Hate About AI Evals with Hamel Husain" - argues against generic off-the-shelf eval frameworks and for application-specific metrics. + +--- + +## 7. Recommendations for QualOps + +**Keep**: Langfuse as the primary eval and observability backbone. The Feb 2026 observation-level LLM-as-judge feature and the recent boolean / categorical scoring landed exactly the primitives a tool-calling pipeline needs. + +**Add**: Promptfoo as a thin per-PR CI gate. YAML config can live in the repo; tests run in 2-5 minutes; the PR comment view is a developer-experience win Langfuse alone does not provide. + +**Add (optional)**: a small Inspect AI nightly task set (20-50 hand-curated tasks) for capability eval, modeled on Anthropic's "Demystifying evals" pattern. Run weekly, not per-PR. Inspect's `Agent Bridge` lets you wrap the QualOps agent without modifying it. + +**Consider only if a specific gap appears**: +- LangSmith if you find yourself reimplementing trajectory-match logic and the procurement / closed-source tradeoff is acceptable. +- Braintrust if non-engineering reviewers (product, design) need to contribute test cases through a UI. +- Phoenix if a customer or compliance requirement demands strict OTEL portability. + +**Do not**: replace Langfuse outright. The cost-benefit math for a small team in CI on Node/TS does not justify it given Langfuse's current feature set. + +--- + +## 8. References + +### Framework documentation + +- [Langfuse - Evaluation overview](https://langfuse.com/docs/evaluation/overview) - canonical eval docs index. +- [Langfuse - Datasets](https://langfuse.com/docs/evaluation/experiments/datasets) - dataset and experiment data model. +- [Langfuse - LLM-as-a-Judge](https://langfuse.com/docs/evaluation/evaluation-methods/llm-as-a-judge) - online judge configuration. +- [Langfuse - Observation-level evals (Feb 2026)](https://langfuse.com/changelog/2026-02-13-observation-level-evals) - tool-call-level scoring landed. +- [Langfuse - Boolean LLM-as-a-Judge Scores (Apr 2026)](https://langfuse.com/changelog/2026-04-08-boolean-llm-as-a-judge-scores) - boolean output added. +- [Langfuse - TypeScript SDK overview](https://langfuse.com/docs/observability/sdk/typescript/overview) - Node integration. +- [Langfuse - Self-hosted pricing](https://langfuse.com/pricing-self-host) - MIT-licensed self-host detail. +- [Langfuse - GitHub repo](https://github.com/langfuse/langfuse) - source code. +- [LangSmith - Evaluation docs](https://docs.langchain.com/langsmith/evaluation) - canonical eval docs. +- [LangSmith - Trajectory evaluations](https://docs.langchain.com/langsmith/trajectory-evals) - agent trajectory primitives. +- [agentevals GitHub repo](https://github.com/langchain-ai/agentevals) - readymade trajectory evaluators (Python + TS). +- [LangSmith - Insights Agent + Multi-turn Evals](https://blog.langchain.com/insights-agent-multiturn-evals-langsmith/) - multi-turn eval announcement. +- [LangSmith - Pricing](https://www.langchain.com/pricing) - 2026 tiers. +- [DeepEval homepage](https://deepeval.com/) - product entry point. +- [DeepEval - AI agent evaluation guide](https://deepeval.com/guides/guides-ai-agent-evaluation) - agent metrics walkthrough. +- [DeepEval GitHub repo](https://github.com/confident-ai/deepeval) - source. +- [deepeval-ts on npm](https://www.npmjs.com/package/deepeval-ts) - TypeScript client. +- [Confident AI - JS/TS observability](https://documentation.confident-ai.com/llm-observability/integrations/typescript) - TS tracing wrapper. +- [RAGAS docs](https://docs.ragas.io/en/stable/) - canonical metrics docs. +- [RAGAS - Available metrics](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/) - including agent metrics. +- [Ragas arXiv paper](https://arxiv.org/abs/2309.15217) - 2023 source paper. +- [Braintrust homepage](https://www.braintrust.dev/) - product. +- [Braintrust - Pricing](https://www.braintrust.dev/pricing) - tiers. +- [Braintrust - Notion case study](https://www.braintrust.dev/blog/notion) - 70 engineers, 10x more issues caught. +- [OpenAI - Evals API guide](https://developers.openai.com/api/docs/guides/evals) - hosted Evals API. +- [OpenAI - Agent evals guide](https://developers.openai.com/api/docs/guides/agent-evals) - agent workflow eval patterns. +- [OpenAI - Graders reference](https://developers.openai.com/api/docs/guides/graders) - all grader types. +- [OpenAI - openai/evals GitHub repo](https://github.com/openai/evals) - OSS framework + benchmark registry. +- [Anthropic - Demystifying evals for AI agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents) - the canonical agent-eval guide from Anthropic engineering. +- [Anthropic - Building agents with the Claude Agent SDK](https://www.anthropic.com/engineering/building-agents-with-the-claude-agent-sdk) - SDK patterns. +- [Claude Agent SDK overview](https://code.claude.com/docs/en/agent-sdk/overview) - SDK docs. +- [Phoenix homepage](https://phoenix.arize.com/) - Phoenix landing. +- [Phoenix GitHub repo](https://github.com/Arize-ai/phoenix) - source. +- [Phoenix - Claude Agent SDK (TypeScript) integration](https://arize.com/docs/phoenix/integrations/typescript/claude-agent-sdk) - direct integration. +- [Phoenix - Claude Agent SDK (Python) integration](https://arize.com/docs/phoenix/integrations/python/claude-agent-sdk) - Python. +- [Phoenix CLI](https://arize.com/docs/phoenix/sdk-api-reference/typescript/arizeai-phoenix-cli) - CLI for piping traces and datasets. +- [W&B Weave docs](https://docs.wandb.ai/weave) - Weave product docs. +- [W&B Weave GitHub](https://github.com/wandb/weave) - source. +- [MLflow GenAI evals](https://mlflow.org/genai/evaluations) - LLM and agent evaluation entry point. +- [MLflow - Automatic evaluation](https://mlflow.org/docs/latest/genai/eval-monitor/automatic-evaluations/) - judges on traces. +- [Patronus AI homepage](https://www.patronus.ai/) - product. +- [Patronus AI - LLM judges docs](https://docs.patronus.ai/docs/tutorials/evals/llm_judges) - judge integration. +- [Patronus AI - Etsy case study](https://www.patronus.ai/case-studies/etsy-leveraging-patronus-ais-multimodal-llm-as-a-judge-to-optimize-image-captionin) - production reference. +- [Promptfoo homepage / docs](https://www.promptfoo.dev/) - canonical entry. +- [Promptfoo - GitHub Action integration](https://www.promptfoo.dev/docs/integrations/github-action/) - PR-comment workflow. +- [promptfoo-action GitHub repo](https://github.com/promptfoo/promptfoo-action) - the Action source. +- [Promptfoo - Claude Agent SDK provider](https://www.promptfoo.dev/docs/providers/claude-agent-sdk/) - first-class support. +- [OpenAI - Acquiring Promptfoo announcement](https://openai.com/index/openai-to-acquire-promptfoo/) - March 2026, MIT license preserved. +- [Inspect AI homepage](https://inspect.aisi.org.uk/) - UK AISI framework. +- [Inspect AI - GitHub repo](https://github.com/UKGovernmentBEIS/inspect_ai) - source. +- [Inspect AI - Agents docs](https://inspect.aisi.org.uk/agents.html) - agent eval patterns including Agent Bridge. +- [Inspect Evals (200+ pre-built)](https://github.com/UKGovernmentBEIS/inspect_evals) - companion eval registry. +- [Helicone homepage](https://www.helicone.ai/) - observability gateway. +- [LangWatch blog - 4 best monitoring tools](https://langwatch.ai/blog/4-best-tools-for-monitoring-llm-agentapplications-in-2026) - 2026 landscape. +- [AgentOps overview (15 platforms compared)](https://aimultiple.com/agentic-monitoring) - includes AgentOps positioning. + +### Practitioner blogs and talks + +- [Hamel Husain - LLM Evals: Everything You Need to Know](https://hamel.dev/blog/posts/evals-faq/) - Jan 2026 FAQ; the most up-to-date practitioner take. +- [Hamel Husain - Selecting the Right AI Evals Tool](https://hamel.dev/blog/posts/eval-tools/) - tool-by-tool opinion. +- [Hamel Husain - Inspect AI notes](https://hamel.dev/notes/llm/evals/inspect.html) - Inspect endorsement. +- [Hamel Husain - Using LLM-as-a-Judge guide](https://hamel.dev/blog/posts/llm-judge/) - canonical LLM-as-judge how-to. +- [Hamel Husain - The Revenge of the Data Scientist](https://hamel.dev/blog/posts/revenge/) - March 2026 critique of generic eval frameworks. +- [Eugene Yan - Evaluating the Effectiveness of LLM-Evaluators](https://eugeneyan.com/writing/llm-evaluators/) - the foundational survey. +- [Eugene Yan - An LLM-as-Judge Won't Save The Product](https://eugeneyan.com/writing/eval-process/) - process over tooling. +- [Vanishing Gradients podcast Ep 60 - 10 Things I Hate About AI Evals with Hamel Husain](https://vanishinggradients.fireside.fm/60) - opinionated practitioner discussion. +- [AI Engineer Europe 2026 schedule](https://www.ai.engineer/europe/schedule) - Evals & Observability track. +- [AI Engineer World's Fair 2026](https://www.ai.engineer/worldsfair) - June 29-July 2, San Francisco. +- [ZenML LLMOps Database - Notion AI feature evaluation](https://www.zenml.io/llmops-database/building-a-scalable-ai-feature-evaluation-system) - Notion + Braintrust deep dive. +- [Niklas Heidloff - Evaluating Agents via LLM-as-a-Judge in Langfuse](https://heidloff.net/article/langfuse-evaluations/) - applied Langfuse agent eval walkthrough. +- [LangChain blog - Agent Evaluation Readiness Checklist](https://www.langchain.com/blog/agent-evaluation-readiness-checklist) - useful checklist regardless of platform. +- [Pragmatic Engineer - A pragmatic guide to LLM evals for devs](https://newsletter.pragmaticengineer.com/p/evals) - dev-oriented overview. +- [O'Reilly - Evals for AI Engineers (book)](https://www.oreilly.com/library/view/evals-for-ai/9798341660717/) - 2025 book-length treatment. diff --git a/agent-evaluation-research/sources/03-toolcalling-and-trajectory.md b/agent-evaluation-research/sources/03-toolcalling-and-trajectory.md new file mode 100644 index 00000000..25ceef16 --- /dev/null +++ b/agent-evaluation-research/sources/03-toolcalling-and-trajectory.md @@ -0,0 +1,519 @@ +# 03 — Tool-Calling, Trajectory, and Workflow-Agent Evaluation + +*Research dossier for the QualOps internal report. Compiled May 8, 2026.* + +## Executive Summary + +QualOps is a workflow / tool-calling agent (Analyze → Review → Fix → Report → Judge) built on the Claude Agent SDK. It does not chat — it reads files, runs greps, spawns subagents, hits APIs, and emits structured findings. Evaluating it well therefore requires a *trajectory-aware* and *outcome-aware* evaluation stack, not chatbot-style scoring. The dominant techniques in 2024–2026 fall into five families: (1) **tool-call accuracy** (AST/exec/argument-F1, à la BFCL); (2) **trajectory metrics** (exact-match, in-order, any-order, edit distance, step success rate); (3) **outcome-grounded evaluation** in sandboxed worlds (τ-bench, AppWorld, WebArena, SWE-bench, MLAgentBench); (4) **agent-as-judge** for open-ended artifacts where unit tests don't exist; and (5) **replay/recorded-trace regression** plus pass^k reliability under non-determinism. For a code-review agent the most directly transferable patterns are SWE-bench-style execution-based grading on patches, AppWorld-style state-based scoring of side effects, BFCL-style tool-call AST checks on the structured PR-comment JSON, recorded-trace replay tied to real PRs, and an agent-as-judge for the qualitative "is this review good?" question. This dossier expands each of these, surveys the major benchmarks, and ends with a decision guide mapping QualOps's pipeline stages to specific eval techniques. + +--- + +## 1. What "Tool-Call Accuracy" Actually Means + +"Tool-call accuracy" is a deceptively flat label. In practice it decomposes into a stack of sub-metrics, each measuring a different failure mode. Treat them as orthogonal axes and report all of them — a single scalar will mask the bug you actually care about. + +### 1.1 Exact match vs. semantic match +The simplest score: did the agent emit a tool call whose name + JSON argument blob equals the gold reference, byte-for-byte? This is brittle: `{"path": "src/foo.py"}` vs. `{"path": "./src/foo.py"}` are equivalent but exact-match scores them as wrong. **AST match** (BFCL's term) parses the call into name + (arg-name, arg-value) pairs and matches structurally, allowing argument-order independence and basic type/format normalization. **Semantic match** goes further: an LLM judge or a custom equality function decides whether two argument values are functionally identical (e.g. `"src/foo.py"` ≡ `"./src/foo.py"` ≡ absolute path of same file). + +### 1.2 Argument F1 +For a single call with multiple arguments you can score arguments individually: +- **parameter-name F1**: did the agent pick the right parameter names? +- **parameter-value F1 / accuracy**: given the right name, is the value right? + +Frameworks like Ragas, DeepEval (`ArgumentCorrectnessMetric`), and LangChain trajectory evals all expose this granularity. It matters when partial credit is meaningful — e.g. the agent called `grep` with the right `pattern` but wrong `path`; you want to know whether the model is failing at *tool selection* or *argument extraction*. + +### 1.3 Multi-call ordering +Many tasks require an ordered sequence (login → list → write). Possible scoring rules, in increasing leniency: +- **Trajectory exact match**: identical sequence, identical arguments. +- **In-order match**: the predicted trajectory contains the reference sequence as a (possibly non-contiguous) subsequence; extra calls allowed. +- **Any-order match**: predicted trajectory contains the reference set; order doesn't matter. +- **Edit distance / Levenshtein over tool-call sequences**: continuous score reflecting how far off the trajectory is. + +These are codified in Google Cloud's Vertex AI agent eval, LangChain's LangSmith trajectory evals, and AWS Strands. Pick the strictest mode the domain allows; for QualOps, ordering matters between Analyze→Review but is irrelevant within the parallel grep calls of a single phase. + +### 1.4 Partial credit +For a 12-step trajectory, a binary pass/fail loses signal. Partial credit comes from: +- **Step success rate**: fraction of individual steps that executed without error and matched the expected step type. +- **Plan precision/recall**: precision = (predicted steps that appear in reference) / (predicted steps); recall = (reference steps that appear in predicted) / (reference). F1 is their harmonic mean. This is the basis for Ragas's `ToolCallF1`. +- **Optimal path ratio**: actual steps / minimal-known steps. > 1 means the agent took detours. + +### 1.5 Idempotency +Some tools have side effects (write file, post comment, call a credit-card API). Evaluation must distinguish *first call* from *redundant repeat call*. AppWorld's state-based scoring penalizes "collateral damage" — unintended state changes — explicitly; this catches an agent that re-sends the same PR comment three times or double-posts findings. + +### 1.6 Parallel calls +Agents (and Claude in particular) can emit multiple tool calls in one turn. Scorers must handle a *bag* of tool calls per turn, not a list. The BFCL "parallel" and "parallel-multiple" categories test exactly this: was the right *set* produced regardless of order within the bag? + +### 1.7 Hallucinated tools +The agent invents a tool that doesn't exist (`run_pylint_v2`, when only `run_pylint` is registered) or invents arguments (`--strict-mode` when there's no such flag). This shows up as: tool-name mismatch against the registry, schema-validation failure on arguments, or an `executable accuracy = 0` outcome. BFCL has a dedicated **irrelevance / relevance detection** category specifically to penalize models that fabricate calls when none was needed. + +### 1.8 Missed tools +The flip side: the model should have called a tool but didn't, answering from its own (often outdated) knowledge instead. Recall is the natural metric. In eval frameworks this is "under-tooling"; production teams (Sierra, Anthropic's blog) flag it as one of the top failure modes for production agents because it produces confident-looking wrong answers with no trace to debug. + +--- + +## 2. Trajectory / Plan Evaluation + +A trajectory is the ordered record of (thought, action, observation) triples (or just (action, observation) if you don't expose CoT). Evaluating it answers two distinct questions: + +**Q1 — Did the agent get to the goal?** (outcome / goal-completion). +**Q2 — Did it follow a sensible path?** (process / plan quality). + +These are orthogonal: an agent can stumble to the right answer through a 47-step random walk, or it can take an optimal 3-step path that ends in the wrong final state. + +### 2.1 Step-wise correctness +For each step *i* in the predicted trajectory, score whether step *i* (a) matches the corresponding reference step type, (b) used a sensible tool, (c) used valid arguments. Aggregating gives a **step success rate**. + +### 2.2 Plan-level precision and recall over expected steps +Treat the reference trajectory as a *set* (or *multiset*) of expected steps. Compute: +- precision = |predicted ∩ reference| / |predicted| +- recall = |predicted ∩ reference| / |reference| +- F1 as usual. + +This is the "any-order" view; tighter variants impose order-aware matching with bipartite alignment. + +### 2.3 Edit distance over trajectories +Treat both trajectories as strings of tool tokens and compute Levenshtein distance, optionally weighted by argument-similarity. Yields a continuous score that reflects "how far off" rather than binary right/wrong. Useful for regression dashboards. + +### 2.4 Goal-conditional vs. path-conditional +- **Goal completion rate** (a.k.a. task success rate): pure outcome. Did the final state / final artifact match the spec? +- **Optimal-path-conformance**: did it follow the canonical path? Two agents can both hit GCR=1 yet differ wildly in cost, latency, and safety. + +The Sierra / τ-bench position is explicit: in production they care primarily about **goal database state** — they compare the post-conversation DB to an annotated goal DB and call it a day. Cursor's `CursorBench` flips this: they care about path quality (code style, efficiency, interaction) too because users *experience* the path. + +### 2.5 Convergence +Fraction of runs that reach a satisfactory terminal state without manual intervention or hitting a turn-budget cap. A separate signal from success: an agent that gets to the answer 80% of the time but loops 20% of the time is operationally distinct from one that gets there 80% and cleanly fails 20%. + +--- + +## 3. Outcome vs. Process Evaluation + +### 3.1 The trade-off +| Aspect | Outcome eval | Process eval | +| --- | --- | --- | +| Data needed | A goal-state checker (unit test, DB diff, regex on output). | Reference trajectories (expensive to author) or a judge model. | +| Scaling | Cheap and deterministic. | Expensive (many references) or noisy (judge variance). | +| Catches "lucky shortcuts" | No — agent can game it. | Yes. | +| Catches plan inefficiency | No. | Yes. | +| Penalizes equivalent-but-different paths | No (good). | Yes (bad — risks rewarding rote imitation). | + +### 3.2 When to use each +- **Outcome only**: when the goal is fully and cheaply specifiable (compile, pass tests, DB matches). SWE-bench, MLE-bench, τ-bench, AppWorld all rely heavily on this. +- **Process only**: when there is no ground-truth artifact (open-ended writing, exploratory research). Agent-as-judge thrives here. +- **Hybrid**: production reality. Use outcome for the gate (must pass) and process metrics for diagnostics and ranking when outcomes are roughly equal. + +### 3.3 Pitfalls of pure outcome +- **Reward hacking via lucky shortcut**: An agent finds a single-line trick that passes the FAIL_TO_PASS test but doesn't actually fix the bug class. SWE-bench Verified mitigates this by also requiring PASS_TO_PASS — i.e. nothing that previously passed should break. +- **Spec ambiguity**: the goal-state check is wrong or under-specified, so the agent gets credit for the wrong reason. (See SWE-bench → SWE-bench Verified — OpenAI annotators rejected ~30% of original SWE-bench instances for ambiguous specs or wrong test patches.) +- **Non-reproducibility**: stochastic tools (a flaky web page) make the goal-state non-deterministic. + +### 3.4 Pitfalls of pure process +- **Path rigidity**: penalizing a faster, equivalent path. Anthropic's eval blog calls this out as the most common pitfall they see. +- **Reference-trajectory bias**: human authors write idealized trajectories that don't reflect how an LLM actually thinks; comparing against them rewards mimicry over capability. + +--- + +## 4. Major Benchmarks for Tool-Calling and Workflow Agents + +A reference table is at the end of the section. Detailed methodology + criticism follows. + +### 4.1 Berkeley Function-Calling Leaderboard (BFCL) +- **v1 (Feb 2024)**: 2,000+ Q/A pairs across Java, JS, Python. Evaluates **AST accuracy** (parse the predicted call, check name + arg-name + arg-type match against gold) and **executable accuracy** (run the call in a sandbox; check return value or HTTP response). Categories: simple, multiple, parallel, parallel-multiple, REST, irrelevance. +- **v2 (Aug 2024)**: live, user-contributed, real-world functions; addresses contamination/freshness criticism. +- **v3 (Sep 2024)**: adds **multi-turn** and **multi-step** function calling. Replaces parameter AST matching for these tasks with **state-based and response-based evaluation** — i.e. after the model executes its sequence, the actual API system state (e.g. file system, mock CRM) is compared to ground truth. This is a meaningful methodological shift: BFCL v3 effectively becomes an outcome-eval for trajectory tasks. Datasets are hand-curated (API codebase → graph edges → tasks → human-labeled trajectories). +- **v4 (2025)**: ongoing, broader API surface and additional categories. +- **Known criticisms**: (a) AST match can disagree with semantic equivalence (Databricks's "Beyond the Leaderboard" post); (b) the multi-turn slice is small; (c) evaluation prompts deliberately avoid ReAct/CoT scaffolds, so leaderboard rank may not predict harness performance. + +### 4.2 τ-bench / τ²-bench / τ³-bench (Sierra) +- **Setup**: A simulated user (LLM-driven) chats with the agent over many turns. Agent has access to a small set of domain APIs (retail and airline domains in v1) plus a written policy document. Tasks are scenarios with an annotated goal database state. +- **Eval**: post-conversation DB compared to goal DB. Plus a **pass^k** metric — probability the agent succeeds *k* times in a row, exposing reliability/variance. +- **Findings (June 2024)**: GPT-4o solves <50% of retail; pass^8 in retail is <25% — i.e. consistency, not headline accuracy, is the bottleneck. +- **τ²-bench**: adds telecom domain, dual-control (user must take actions too), more realistic policies. +- **τ³-bench (2025)**: adds knowledge retrieval and voice modality. +- **Why it matters for QualOps**: pass^k methodology directly transfers — for a code-review agent, "did it produce the same finding 8 times in a row?" is a more honest measure than a single run. + +### 4.3 ToolBench / ToolLLM (OpenBMB) +- **Construction**: 16,464 RapidAPI APIs across 49 categories, ~120K instruction–API training/eval pairs. Three evaluation splits: I1 (single-tool), I2 (intra-category multi-tool), I3 (cross-collection multi-tool). +- **Solution paths**: annotated via DFSDT (depth-first search decision tree) over the API graph. +- **Evaluator**: ToolEval, an LLM-as-judge that scores both pass-rate and "win-rate" of one model against another. +- **Criticism**: API instability — many RapidAPI endpoints went stale or paywalled; Tsinghua/Alibaba's StableToolBench paper (ACL 2024) addresses this with mocked APIs and stable evaluation. +- **Note**: confusingly there are two unrelated "ToolBench" projects — OpenBMB's (the major one) and SambaNova's earlier eval suite. + +### 4.4 API-Bank +- **Scope**: 73 APIs, 314 tool-use dialogues, 753 annotated API calls. +- **Evaluation**: runnable — calls are dispatched to mocked or real APIs and outputs scored. +- **Levels**: tests (1) ability to call APIs given relevance, (2) ability to retrieve the right API from a registry, (3) ability to plan multi-step calls. +- **Why it matters**: smaller and more curated than ToolBench, easier to mine for test cases. + +### 4.5 WebArena / VisualWebArena / WorkArena +- **WebArena (Carnegie Mellon, Aug 2023)**: full-stack reproducible mock websites (e-commerce, GitLab, Reddit-clone, content management). Tasks are end-to-end user goals; evaluation compares final DB / UI state to a programmatic checker. GPT-4 baseline 14.4%; humans 78.2%. +- **VisualWebArena (2024)**: 910 tasks requiring visual understanding (image search, visual product matching). +- **WorkArena / WorkArena++ (ServiceNow)**: 33–682 enterprise SaaS tasks (ServiceNow ticketing, knowledge management). WorkArena++ adds compositional / verification tasks. +- **WebArena-Verified (ServiceNow, 2025)**: cleaned subset addressing the original benchmark's ambiguous-task problem. +- **Relevance to QualOps**: low — QualOps doesn't drive a browser. But the *evaluation methodology* (declarative goal-state checkers, programmatic + LLM-judge hybrids) is fully transferable. + +### 4.6 AgentBench (THUDM, ICLR 2024) +- **Eight environments**: OS (bash), DB (SQL), KG (knowledge graph), digital card game, lateral thinking puzzles, house-holding (ALFWorld), web shopping, web browsing (Mind2Web). +- **Metrics**: success rate per environment + an aggregated score; F1 / reward where appropriate. +- **Architecture**: server-client + Docker, so each task runs in an isolated container. +- **Use as a template**: AgentBench is less a leaderboard (somewhat dated) and more a *reference architecture* for how to spin up many environments behind a uniform agent-side API. + +### 4.7 GAIA (Meta, 2023) +- **466 hand-crafted general-assistant questions**, three difficulty levels. Each has a unique factual answer (string/number) so grading is exact-match string comparison — robust and cheap. +- **Tools used**: web browsing, file inspection, multimodal. +- **Headline gap**: humans 92%, GPT-4 + plugins 15% (at launch). 2025 frontier agents (e.g. H2O.ai's h2oGPTe) hit 65–75% on the dev set. +- **Why it matters**: shows you can get rigorous outcome eval out of *general* tasks if every answer is a checkable string. + +### 4.8 SWE-bench family — the most relevant for QualOps +- **SWE-bench (Princeton, 2023)**: 2,294 GitHub issue + PR pairs from 12 popular Python repos. Agent receives issue text + a snapshot of the repo. Submits a patch. Patch is applied and the repo's test suite is run with two test sets: **FAIL_TO_PASS** (the tests added by the original PR — must now pass) and **PASS_TO_PASS** (existing tests — must still pass). Resolves-issue rate is the headline metric. This is **execution-based, outcome-only, deterministic, and forgiving of any path** — the gold standard pattern for code-agent eval. +- **SWE-bench Lite (2024)**: 300 instances filtered for shorter, more contained patches. The fast-feedback subset most groups iterate on. +- **SWE-bench Verified (OpenAI, Aug 2024)**: 500 instances, human-validated for clear specs and correct hidden test sets. Has effectively replaced the original full set as the headline benchmark — most leaderboards quote Verified. +- **SWE-bench Multimodal (2024)**: front-end / JS issues with screenshots; eval is **kept private** to prevent contamination. +- **SWE-bench Multilingual (2025)**: 300 tasks across 9 languages. +- **SWE-bench Live (2025–2026)**: rolling release of 50 newly-verified issues per month, scraped from active repos. Directly addresses contamination — frontier models can't have seen the test set during training. +- **SWE-bench Pro (Scale AI, Sep 2025)**: 1,865 instances (731 public + 858 held-out + 276 commercial), 41 repos including enterprise codebases. Patches are larger (avg 107 lines, 4.1 files) and tasks are long-horizon. GPT-5 23.3%, Claude Opus 4.1 23.1% at launch — i.e. enterprise-scale code-agent tasks remain hard. +- **Criticism / mutation work**: the "Saving SWE-Bench" paper (Oct 2025) proposes systematic mutation of test cases to detect lucky-shortcut hacks; recommended reading if QualOps wants to harden its own eval. +- **Direct relevance to QualOps**: the SWE-bench harness — clone repo, apply agent's patch, run tests, classify by FAIL_TO_PASS / PASS_TO_PASS — is **the** template for evaluating the Fix stage of QualOps. Recommendation: mine SWE-bench Verified for cases where the original PR was a code-quality fix (refactor, lint cleanup, type fix) and use those as QualOps regression tests. + +### 4.9 MLAgentBench / MLE-bench +- **MLAgentBench (Stanford, 2023)**: 13 ML experimentation tasks; agent acts via a ReAct loop with read/write/execute. Outcome metric: improvement over a baseline model on a held-out set. +- **MLE-bench (OpenAI, Oct 2024)**: 75 Kaggle competitions; agent must produce a submission CSV. Outcome metric: medal-level performance against the human Kaggle leaderboard. +- **Pattern of interest**: agents act in a real shell with real Python; eval is purely outcome-based on a held-out scorer. This is the "give the agent a sandbox and grade what comes out" pattern par excellence. + +### 4.10 AppWorld (Stony Brook, ACL 2024 best resource paper) +- **Engine**: 9 simulated apps (Venmo, Spotify, Gmail, etc.) with 457 APIs and 100 fictional users; 60K LoC environment, 40K LoC benchmark. +- **Tasks**: 750 natural-language tasks (e.g. "split last weekend's Venmo charges with my roommates"). +- **Eval**: **state-based unit tests** — checks both that the goal state is reached *and* that no unintended state changed (collateral damage). MCP-compatible as of 2025. +- **GPT-4o**: ~49% normal, ~30% challenge. +- **Why it matters**: the cleanest available demonstration of state-based programmatic eval for tool-calling agents, including idempotency / collateral-damage checks. + +### 4.11 Other 2025–2026 releases worth knowing +- **TRAJECT-Bench (2025)**: focuses on trajectory-quality metrics rather than just outcomes. +- **WABER (Microsoft Research, 2025)**: web-agent reliability/efficiency benchmark, builds on WebArena with formal reliability bounds. +- **ARE (Meta FAIR, Sep 2025)**: scalable agent environments + auto-generated evals. +- **Efficient Agents (Aug 2025)**: small-model agents on GAIA at lower cost — useful baseline if you care about $/task. +- **FHIR-AgentBench (Sep 2025)**: domain-specific (healthcare interoperability) — example of how to build a vertical eval if QualOps later wants a "code-review-specific" benchmark. + +### 4.12 Comparison Table + +| Benchmark | Year | Focus | Size | Scoring | Link | +| --- | --- | --- | --- | --- | --- | +| BFCL v3 | 2024 | Function-call accuracy + multi-turn | 2,200+ | AST + execution + state-based | https://gorilla.cs.berkeley.edu/leaderboard.html | +| τ-bench | 2024 | Tool-agent-user dialog | 2 domains, ~165 tasks | DB-state + pass^k | https://github.com/sierra-research/tau-bench | +| τ²-bench | 2025 | Multi-actor dual-control | 3 domains | DB-state + pass^k | https://github.com/sierra-research/tau2-bench | +| ToolBench (OpenBMB) | 2023 | API tool use at scale | 16K APIs / 120K pairs | LLM-judge (ToolEval) | https://github.com/OpenBMB/ToolBench | +| API-Bank | 2023 | Tool retrieval + planning | 73 APIs / 314 dialogues | runnable + match | https://openreview.net/forum?id=o2HBfgY20b | +| WebArena | 2023 | Web tasks | 812 tasks | programmatic state checks | https://webarena.dev/ | +| VisualWebArena | 2024 | Multimodal web | 910 tasks | programmatic | https://jykoh.com/vwa | +| WorkArena++ | 2024 | Enterprise SaaS | 33–682 tasks | execution-based | https://github.com/ServiceNow/WorkArena | +| AgentBench | 2024 | Multi-environment | 8 envs | success rate / F1 | https://github.com/THUDM/AgentBench | +| GAIA | 2023 | General assistant | 466 Qs | exact-match string | https://huggingface.co/datasets/gaia-benchmark/GAIA | +| SWE-bench | 2023 | Code: GH issues | 2,294 | run unit tests | https://www.swebench.com/ | +| SWE-bench Verified | 2024 | Code, validated | 500 | run unit tests | https://www.swebench.com/verified.html | +| SWE-bench Multimodal | 2024 | Code + screenshots | private | run unit tests | https://www.swebench.com/multimodal.html | +| SWE-bench Multilingual | 2025 | Code, 9 languages | 300 | run unit tests | https://www.swebench.com/multilingual-leaderboard.html | +| SWE-bench Live | 2025 | Fresh GH issues/mo | rolling | run unit tests | https://swe-bench-live.github.io/ | +| SWE-bench Pro | 2025 | Long-horizon enterprise | 1,865 | run unit tests | https://github.com/scaleapi/SWE-bench_Pro-os | +| MLAgentBench | 2023 | ML experimentation | 13 tasks | outcome metric | https://github.com/snap-stanford/MLAgentBench | +| MLE-bench | 2024 | Kaggle competitions | 75 | leaderboard rank | https://github.com/openai/mle-bench | +| AppWorld | 2024 | Daily-life apps | 750 tasks | state-based unit tests | https://appworld.dev/ | +| DevAI (Agent-as-Judge) | 2024 | AI dev tasks | 55 / 365 reqs | agent judge | https://github.com/metauto-ai/agent-as-a-judge | +| TRAJECT-Bench | 2025 | Trajectory quality | n/a | trajectory metrics | https://www.emergentmind.com/topics/traject-bench | + +--- + +## 5. Code-Agent-Specific Evaluation (most relevant for QualOps) + +### 5.1 Test-execution as oracle +The SWE-bench pattern is the single most influential idea in code-agent eval: *the agent's output is graded by running tests*. Concretely: +1. Apply the agent's patch to a clean repo snapshot. +2. Run the FAIL_TO_PASS set — these are tests that exercised the bug; they must now pass. +3. Run the PASS_TO_PASS set — pre-existing tests that must still pass (no regressions). +4. Resolve = both sets pass. + +This is fully deterministic, fully outcome-based, ignores how the agent got there, and resists most reward hacking — a "fix" that monkey-patches the test harness or `import sys; sys.exit(0)`s usually breaks PASS_TO_PASS. + +### 5.2 pass@k vs. pass^k for code agents +- **pass@k** (Codex/HumanEval origin): agent gets *k* attempts, scored if any one passes. Reflects "given infinite retries, can it ever succeed?" +- **pass^k** (Sierra τ-bench): agent must succeed *k* times in a row. Reflects "is it reliable enough to deploy?" + +For QualOps, **pass^k is the more honest measure** — a code reviewer that catches the bug 50% of the time is not deployable, even if pass@4 looks great. + +### 5.3 "Did the suggested fix actually fix the bug?" +Three increasingly strict variants for QualOps's Fix stage: +1. **Patch applies cleanly**: trivial syntactic check. +2. **Patch passes new tests** (FAIL_TO_PASS analog): need to author tests that pin the bug, or mine them from existing PRs. +3. **Patch is semantically equivalent to the human PR**: harder; needs LLM-judge or AST-diff with allowable equivalences. + +For Review-stage findings (no patch, just a comment) the analog is: +1. **Finding location precision/recall**: did the agent flag the right line / file? +2. **Finding-class match**: did it categorize the issue correctly (security vs. perf vs. style)? +3. **Finding–PR alignment**: does the finding correspond to something the human reviewer also flagged? + +### 5.4 Patch correctness beyond tests +Tests don't catch every flavor of bad fix: +- **Style / readability regression**: tests pass, but the diff is ugly, over-broad, or violates project conventions. +- **Performance regression**: tests pass but quietly add an O(n²). +- **Security regression**: tests pass but the patch introduces a new vuln. + +These need additional graders: linter / formatter delta, perf benchmark, CodeQL/Semgrep diff, LLM-judge with explicit criteria. Cursor's CursorBench explicitly grades "code quality" and "efficiency" alongside correctness for this reason. + +### 5.5 Cognition's approach (Devin) +Cognition's blog "A review of OpenAI's o1 and how we evaluate coding agents" describes their internal `cognition-golden` benchmark: real-task-pattern tasks with full development environments where evaluator agents (with Devin's own tools — bash, browser, editor) autonomously judge outcomes. They describe two complementary axes: (1) deterministic evaluators (compilers, linters, tests) — preferred when applicable; (2) **agent-evaluators** that look at the final state and judge open-endedly. They also use simulated users for the questioning behavior. This is a strong model for QualOps: deterministic checks for what's deterministic; agent judges for the rest. + +### 5.6 Datasets to mine for QualOps test cases +- SWE-bench Verified — filter for issues labeled `code-quality`, `refactor`, `style`, `type`, `lint`. +- SWE-bench Live monthly drops — fresh, uncontaminated. +- SWE-bench Multilingual — if QualOps targets multiple languages. +- AppWorld — if you want to test the *workflow harness* on non-code tasks. +- Your own internal PR history — by far the highest-signal source. Convert past PRs into (pre-PR repo state, issue or commit message, set of human review comments, accepted patch). This becomes a proprietary `qualops-golden`. + +--- + +## 6. Agent-as-Judge + +### 6.1 Original paper +Zhuge et al., *Agent-as-a-Judge: Evaluate Agents with Agents* (arXiv 2410.10934, Oct 2024; ICML 2025). Core proposal: instead of an LLM-as-judge that sees only the final answer, give the *judge* itself agentic capabilities — tools, file system access, the ability to run code, the ability to inspect intermediate steps in the candidate agent's transcript. They release **DevAI**, 55 AI-dev tasks with 365 hierarchical requirements, and show: +- Agreement with human expert ~90% (vs. ~70% for plain LLM-judge). +- Cost reduction ~97% (86 h / $1,297 → ~2 h / $31). + +### 6.2 Why it works (and when it doesn't) +**Works well when:** +- The artifact is open-ended (no unit tests possible) — "is this PR comment helpful and accurate?" +- Evaluation requires looking at intermediate steps — "did the agent actually verify this finding by reading the file, or hallucinate it?" +- You have a structured rubric the judge can iterate over (DevAI's hierarchical requirements). + +**Doesn't help when:** +- The judge shares the candidate's biases (same model family — self-preference bias). +- Stakes require human sign-off anyway. +- A simple deterministic check exists — using a judge is just extra cost and noise. + +### 6.3 Vs. plain LLM-as-judge +Plain LLM-judge: candidate produces final artifact → judge LLM sees (input, artifact, rubric) → returns score. +Agent-as-judge: judge can also call tools — open files, run greps, execute the artifact, inspect the candidate's own trace. Higher fidelity but more expensive and harder to make deterministic. + +### 6.4 Pitfalls +- **Self-preference bias**: judge prefers outputs from its own model family. Mitigate with judge ensembles or cross-model judging. +- **Spec leakage**: if the judge sees the rubric verbatim, the candidate (if it also sees rubric-derived prompts) can game it. +- **Variance**: agent judges have higher variance than rubric-grader pipelines; budget for n=3+ runs. + +### 6.5 For QualOps +The Judge stage in your pipeline already smells like agent-as-judge applied internally. For evaluation, an *external* agent-as-judge is well suited to grading "was this PR review good?" — give it the diff, the agent's findings, the actual human-merged PR, and a rubric, and let it use bash/grep to verify each finding against the code. + +--- + +## 7. Simulation-Based / Sandbox / Replay Evaluation + +### 7.1 Mocked or recorded tools +Two flavors: +- **Mocked**: hand-written or generated fake implementations (StableToolBench's approach to dead RapidAPI endpoints; AppWorld's whole engine; τ-bench's API mocks). Pro: deterministic, reproducible. Con: doesn't catch integration bugs. +- **Recorded**: capture real tool I/O once, replay forever. Like VCR cassettes for HTTP. Pro: realism. Con: brittle to tool drift; cassettes go stale. + +### 7.2 Replay testing as regression +Pattern (used by Braintrust, LangSmith, Arize Phoenix, internally by Anthropic, Cognition): +1. Capture every production run as a trace (inputs, all tool calls, all outputs, final artifact). +2. Tag interesting traces — failures, edge cases, customer escalations — into a regression set. +3. On every prompt / model / harness change, replay each trace: feed the same input, *but stub tool calls with the recorded outputs*, and observe whether the agent makes equivalent decisions. +4. Diff: were tool-call sequences equivalent? Did the final artifact differ? + +Sakura Sky's "Trustworthy AI Agents: Deterministic Replay" article describes this exactly. AgentRR (arXiv 2505.17716, May 2025) formalizes the record-and-replay paradigm. + +For QualOps: every PR review you ship is already a trajectory. Sample some, freeze them, and you have a regression suite that tracks model / prompt drift better than any synthetic benchmark. + +### 7.3 Counterfactual replay +Replay with one variable changed: same input, swap the model; same model, perturb the prompt; same model and prompt, swap one tool's response to test robustness. The "Seeing the Whole Elephant" failure-attribution paper (arXiv 2604.22708) uses this for failure attribution in multi-agent systems. + +### 7.4 Hybrid: synthetic environments behind real harness +Pattern: build a controlled environment (a fixture repo with a known set of bugs) but run the production agent harness against it unmodified. SWE-bench is exactly this. AppWorld is exactly this. For QualOps, a fixture repo with N seeded bugs of various classes is cheap and gives a stable baseline. + +--- + +## 8. Trajectory-Level Metric Glossary (know-by-name) + +- **Trajectory exact match** — predicted == reference, identical tool calls in identical order. Strictest. +- **Trajectory in-order match** — reference is an ordered subsequence of predicted; extra calls allowed. +- **Trajectory any-order match** — reference is a subset of predicted; order-agnostic. +- **Tool-call F1** (a.k.a. ToolCallF1) — set-level precision/recall over (tool, args) pairs; harmonic mean. +- **Argument F1 / Argument Correctness** — per-call argument-level precision/recall. +- **Tool selection accuracy** — given the right step boundary, did it pick the right tool name (ignore args). +- **Action similarity** — embedding-based or LLM-judged similarity between action and reference action. Useful when arguments are free-form text (e.g. PR-comment body). +- **AgentBench success rate** — task completion fraction per environment. +- **AST match (BFCL)** — parsed-call structural equality; argument-order-agnostic. +- **Execution match (BFCL)** — call executed in sandbox returns ground-truth value. +- **Goal Completion Rate / Task Success Rate** — final-state binary outcome. +- **Step Success Rate** — fraction of trajectory steps that succeed. +- **Convergence** — fraction of runs reaching a terminal state without timeout/intervention. +- **Optimal Path Ratio** — actual_steps / minimal_steps. +- **pass@k** — succeed at least once in *k* trials. +- **pass^k** — succeed every time in *k* trials (Sierra). +- **State-based eval (BFCL v3 / AppWorld / τ-bench)** — compare environment state after run to gold state, optionally penalizing collateral damage. +- **Response-based eval (BFCL v3)** — compare model's natural-language reply for keyword/semantic match. + +--- + +## 9. Calibration Under Non-Determinism + +### 9.1 Why agents are non-deterministic +Even at temperature 0, modern serving stacks are non-deterministic (batched inference, kernel non-determinism on GPUs — see "Non-Determinism of 'Deterministic' LLM Settings", arXiv 2408.04667). On top of that: tool outputs change (search results, web pages, time), and many agents deliberately sample with temperature > 0. + +### 9.2 Implications for evaluation +Single-run pass/fail is statistically meaningless beyond a coarse signal. Reliable evaluation needs: +- **Multiple runs per task** — minimum 3 to compute a stable mean; 5–10 for sharper variance estimates; 30+ if you need confidence intervals on small effects. +- **pass^k reporting** — alongside pass@1; the gap is informative. +- **Confidence intervals** — Anthropic's "A Statistical Approach to Model Evals" (2024) walks through bootstrap CIs and the math for correctly comparing two model scores. tl;dr: a 5-point gap on 200 tasks is usually within the noise floor; reporters routinely overclaim. +- **Voting / self-consistency** — at evaluation time you can also use majority-vote over n runs as the candidate's "answer" (separately from how the production system runs). +- **Fixed seeds where possible** — partial mitigation; not a substitute for repetition. + +### 9.3 Practical defaults +- For tracked metrics: ≥5 runs per task, report mean and std. +- For decisions ("ship this prompt change"): require a statistically significant improvement, not a single-point gain. +- Cap turn budgets per run to keep variance bounded. +- Snapshot tool outputs (replay) for the regression suite so non-determinism is isolated to model variance only. + +--- + +## 10. Practical Patterns from Production Teams + +### 10.1 Anthropic — "Demystifying evals for AI agents" (engineering blog, Jan 2026) +- Taxonomy of graders: **code-based**, **model-based** (LLM-judge), **human**. +- Score aggregation modes: weighted, binary (all-must-pass), hybrid. +- Eval the **harness + model**, not the model alone — the harness (Claude Agent SDK in QualOps's case) is part of the system under test. +- Top pitfalls called out: rigid grading that punishes equivalent answers; ambiguous task specs; stochastic tasks that can't be reproduced. +- Bloom — Anthropic's open-source automated behavioral eval tool — and the agent-autonomy work (https://www.anthropic.com/research/measuring-agent-autonomy) are companion reads. + +### 10.2 Sierra — τ-bench in production +- Sierra runs τ-bench-style internal benchmarks for every customer-facing agent before deploy. +- Their public position: pass^k is the production-relevant metric; pass@1 hides a 2× reliability gap. +- They use simulated-user dialogs in eval because that matches their product surface. + +### 10.3 Cognition — Devin +- `cognition-golden` internal benchmark with train/test split; train side used for self-improvement loops, test side as a hold-out. +- Hybrid evaluator stack: deterministic (tests, compilers, linters) where possible; agent-evaluators (with Devin's tools) for open-ended judgment. +- Simulated users that can answer Devin's clarifying questions, modeling the realistic case where the agent has missing info. +- Public SWE-bench technical report (https://cognition.ai/blog/swe-bench-technical-report) details how they audit their own pipeline for contamination and harness drift. + +### 10.4 Cursor — CursorBench +- Multi-axis grading: solution correctness, code quality, efficiency, interaction quality. +- Offline (`CursorBench`) + on-policy live-traffic A/B catches a class of regressions where the agent looks correct to a grader but feels worse to a user. +- Public post: https://cursor.com/blog/cursorbench. + +### 10.5 Sourcegraph (Cody, Amp) +- Heavy emphasis on retrieval correctness — did the agent pull the right context before answering? Treat retrieval as a first-class tool call and evaluate it with precision/recall. +- Codebase-graph awareness as eval signal: did the agent's edits respect call-graph dependencies? + +### 10.6 Replit +- Standard pytest/Jest test running for outcome eval on user code, but no built-in agent-specific eval harness — relies on user-defined tests as oracle. + +### 10.7 Amazon Q Developer +- Public SWE-bench numbers as the headline (e.g. 38.8% on SWE-bench Verified at one point). +- Internally: trace-grading (capture full agent traces, run rubric graders) plus latency / cost / resource-efficiency metrics alongside accuracy. + +### 10.8 OpenAI +- `openai/evals` — generic LLM eval framework, not agent-specific but extended. +- Agent-Evals API (Platform): trace grading + structured rubric scorers; their guide explicitly recommends recording every model and tool call to a trace and grading the trace, not just the final answer. + +### 10.9 Common pattern across teams +Every serious production team converges to roughly the same five-layer stack: +1. Unit-style assertions on tool calls (BFCL-style). +2. End-to-end execution evals on synthetic fixtures (AppWorld / SWE-bench style). +3. Recorded-trace replay as regression net. +4. LLM-judge or agent-judge for open-ended quality. +5. Live-traffic A/B and human review on the long tail. + +--- + +## 11. Decision Guide — "If your agent does X, evaluate with Y" + +| Situation | Use this technique | +| --- | --- | +| Single-call function selection (one tool, one argument set) | BFCL-style **AST match** + **argument F1** | +| Multi-step deterministic workflow (login → list → action) | **Trajectory in-order match** + **state-based eval** of final environment | +| Parallel tool calls in one turn | **Set-equality** match (bag of calls); never order-sensitive | +| Tool arguments are free-form text (e.g. PR comment body) | **Action similarity** (embedding or LLM-judge), not exact match | +| Side-effecting tools (writes, posts, deletes) | **State-based eval with collateral-damage check** (AppWorld pattern) | +| Output is a code patch | **SWE-bench harness**: apply patch + FAIL_TO_PASS + PASS_TO_PASS | +| Output is open-ended text (review summary, design doc) | **Agent-as-judge** with structured rubric | +| Need to detect hallucinated tools | **Schema validation** + tool-name whitelist + irrelevance category | +| Need to detect missed tools | **Recall** against reference trajectory; track under-tooling rate | +| Variance/reliability concern | **pass^k** with k≥5; report mean + 95% CI | +| Catching prompt/model regressions | **Recorded-trace replay** with tool stubs | +| Validating against fresh / contamination-free data | **SWE-bench Live** or your own freshly-mined PRs | +| Long-horizon multi-stage agent (QualOps's case) | Hybrid: per-stage tool-call F1 + per-stage state checks + end-to-end outcome + agent-as-judge on final report | + +### 11.1 Specific recipe for QualOps +- **Analyze stage**: tool-call F1 against a reference set of grep/file-read calls per fixture repo. Penalize hallucinated tools, track under-call rate. +- **Review stage**: location precision/recall on flagged lines; finding-class accuracy; agent-as-judge on textual quality of comments. +- **Fix stage**: SWE-bench-style harness — apply suggested patch, run repo tests, FAIL_TO_PASS + PASS_TO_PASS. Plus linter/formatter delta to catch style regressions. +- **Report stage**: schema validation on emitted JSON; agent-as-judge for narrative coherence. +- **Judge stage** (QualOps's own internal judge): meta-evaluate by comparing the internal Judge's pass/fail call to a held-out human-labeled pass/fail. Agreement rate is your meta-judge metric. +- **Across stages**: pass^5 over a fixed set of 50–200 fixture PRs; recorded-trace replay on the last 200 production PRs as regression net; statistical CIs reported on every comparison. + +--- + +## 12. References + +### Primary papers +- Patil, Mao, et al. **The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models.** ICML 2025 / OpenReview. https://openreview.net/forum?id=2GmDdhBdDk — BFCL v1–v3 methodology and evaluation correlations. +- Yao et al. **τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains.** arXiv 2406.12045. https://arxiv.org/abs/2406.12045 — pass^k metric, simulated-user dialog eval. +- Qin et al. **ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs.** ICLR 2024. https://arxiv.org/abs/2307.16789 — RapidAPI-scale tool use, DFSDT, ToolEval. +- Li et al. **API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs.** OpenReview. https://openreview.net/forum?id=o2HBfgY20b — runnable API eval, 73 APIs. +- Zhou et al. **WebArena: A Realistic Web Environment for Building Autonomous Agents.** arXiv 2307.13854. https://arxiv.org/abs/2307.13854 — programmatic state checks for web agents. +- Koh et al. **VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks.** https://jykoh.com/vwa — multimodal extension of WebArena. +- Drouin et al. **WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?** ServiceNow Research — enterprise SaaS workflows. +- Liu et al. **AgentBench: Evaluating LLMs as Agents.** ICLR 2024. https://arxiv.org/abs/2308.03688 — 8-environment benchmark, server-client architecture. +- Mialon et al. **GAIA: A Benchmark for General AI Assistants.** arXiv 2311.12983. https://arxiv.org/abs/2311.12983 — exact-match string grading for general-assistant tasks. +- Jimenez et al. **SWE-bench: Can Language Models Resolve Real-World GitHub Issues?** ICLR 2024. https://www.swebench.com/ — execution-based code-agent grading. +- OpenAI. **Introducing SWE-bench Verified.** https://openai.com/index/introducing-swe-bench-verified/ — 500-instance human-validated subset. +- Scale AI. **SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?** arXiv 2509.16941. https://arxiv.org/abs/2509.16941 — enterprise-scale code agent benchmark. +- **SWE-bench Live.** https://swe-bench-live.github.io/ — rolling fresh-issue release (50/month). +- Huang et al. **MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation.** arXiv 2310.03302. https://arxiv.org/abs/2310.03302 — ML research agent eval. +- Chan et al. **MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering.** arXiv 2410.07095. https://arxiv.org/abs/2410.07095 — Kaggle-grounded ML-agent benchmark. +- Trivedi et al. **AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents.** ACL 2024 best resource paper. https://arxiv.org/abs/2407.18901 — 750 tasks with state-based + collateral-damage eval. +- Zhuge et al. **Agent-as-a-Judge: Evaluate Agents with Agents.** arXiv 2410.10934. https://arxiv.org/abs/2410.10934 — agentic judges, DevAI benchmark. +- Liu et al. **Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation.** arXiv 2510.08996. https://arxiv.org/abs/2510.08996 — adversarial mutations to detect lucky shortcuts. +- Atil et al. **Non-Determinism of 'Deterministic' LLM Settings.** arXiv 2408.04667. https://arxiv.org/html/2408.04667v5 — why temperature=0 isn't enough. +- Zhang et al. **AgentRR: Get Experience from Practice — LLM Agents with Record & Replay.** arXiv 2505.17716. https://arxiv.org/abs/2505.17716 — formal record-and-replay paradigm. +- **Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems.** arXiv 2604.22708. https://arxiv.org/html/2604.22708v1 — counterfactual replay for failure attribution. + +### Engineering / production blogs +- Anthropic. **Demystifying evals for AI agents.** https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents — taxonomy of graders, pitfalls. +- Anthropic. **A Statistical Approach to Model Evals.** https://www.anthropic.com/research/statistical-approach-to-model-evals — bootstrap CIs and significance for model comparisons. +- Anthropic. **Building agents with the Claude Agent SDK.** https://www.anthropic.com/engineering/building-agents-with-the-claude-agent-sdk — harness + model evaluation framing. +- Anthropic. **Bloom: an open source tool for automated behavioral evaluations.** https://www.anthropic.com/research/bloom — open-source eval tool. +- Anthropic. **Measuring AI Agent Autonomy in Practice.** https://www.anthropic.com/research/measuring-agent-autonomy — autonomy-axis eval framing. +- Sierra. **τ-Bench: Benchmarking AI agents for the real-world.** https://sierra.ai/blog/benchmarking-ai-agents — production rationale, pass^k motivation. +- Sierra. **τ³-Bench: Advancing agent evaluation to knowledge and voice.** https://sierra.ai/blog/bench-advancing-agent-benchmarking-to-knowledge-and-voice — third-gen extension. +- Cognition. **A review of OpenAI's o1 and how we evaluate coding agents.** https://cognition.ai/blog/evaluating-coding-agents — `cognition-golden`, hybrid evaluators. +- Cognition. **SWE-bench technical report.** https://cognition.ai/blog/swe-bench-technical-report — Devin's SWE-bench audit. +- Cognition. **Devin's 2025 Performance Review.** https://cognition.ai/blog/devin-annual-performance-review-2025 — production lessons. +- Cursor. **CursorBench: How we compare model quality.** https://cursor.com/blog/cursorbench — multi-axis quality grading + live A/B. +- Databricks. **Beyond the Leaderboard: Unpacking Function Calling Evaluation.** https://www.databricks.com/blog/unpacking-function-calling-eval — critique of pure AST match. +- AWS. **Reinventing the Amazon Q Developer agent for software development.** https://aws.amazon.com/blogs/devops/reinventing-the-amazon-q-developer-agent-for-software-development/ — production SWE-bench numbers and trace grading. +- AWS. **Evaluating AI agents for production: Strands Evals.** https://aws.amazon.com/blogs/machine-learning/evaluating-ai-agents-for-production-a-practical-guide-to-strands-evals/ — trajectory-evaluator pattern. +- OpenAI Developers. **Testing Agent Skills Systematically with Evals.** https://developers.openai.com/blog/eval-skills — trace grading. +- OpenAI Developers. **Evaluate agent workflows.** https://developers.openai.com/api/docs/guides/agent-evals — agent-eval API guide. +- Google Cloud. **A methodical approach to agent evaluation.** https://cloud.google.com/blog/topics/developers-practitioners/a-methodical-approach-to-agent-evaluation — Vertex AI trajectory metrics. +- LangChain. **LLM Evaluation Framework: Trajectories vs. Outputs.** https://www.langchain.com/articles/llm-evaluation-framework — trajectory-vs-outcome framing. +- LangChain. **How to evaluate your agent with trajectory evaluations.** https://docs.langchain.com/langsmith/trajectory-evals — exact/in-order/any-order metrics. +- DeepEval. **Argument Correctness metric.** https://deepeval.com/docs/metrics-argument-correctness — argument-level grading. +- DeepEval. **Tool Correctness metric.** https://deepeval.com/docs/metrics-tool-correctness — tool selection grading. +- Ragas. **Agentic / tool-use metrics.** https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/agents/ — ToolCallF1, parameter-name F1. +- Sakura Sky. **Trustworthy AI Agents: Deterministic Replay.** https://www.sakurasky.com/blog/missing-primitives-for-trustworthy-ai-part-8/ — deterministic replay primitive. +- Braintrust. **Evaluating agents with trace-driven insights.** https://medium.com/@braintrustdata/evaluating-agents-with-trace-driven-insights-9ad3bfed820e — trace-as-eval pattern. +- The Context Lab. **The Non-Determinism Problem: What It Takes to Evaluate Agents Reliably.** https://www.thecontextlab.ai/blog/non-determinism-problem-evaluating-agents-reliably — operational guidance for variance. +- Toloka. **Tau-Bench extension: benchmarking policy-aware agents in realistic settings.** https://toloka.ai/blog/tau-bench-extension-benchmarking-policy-aware-agents-in-realistic-settings/ — policy-compliance extension. +- Arize AI. **How to Evaluate Tool-Calling Agents.** https://arize.com/blog/how-to-evaluate-tool-calling-agents/ — observability-flavored eval guide. +- Galileo. **Agent Evaluation Framework With Metrics, Rubrics, and Benchmarks.** https://galileo.ai/blog/agent-evaluation-framework-metrics-rubrics-benchmarks — metric taxonomy. + +### Surveys +- **Evaluation and Benchmarking of LLM Agents: A Survey.** ACM Computing Surveys / arXiv 2507.21504. https://arxiv.org/html/2507.21504v1 — broad 2025 survey. +- **A Survey on LLM-as-a-Judge.** arXiv 2411.15594. https://arxiv.org/html/2411.15594v4 — companion survey on judge models. +- **A Survey on Agent-as-a-Judge.** arXiv 2601.05111. https://arxiv.org/html/2601.05111v1 — agent-judge specifically. +- **When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs.** arXiv 2508.02994. https://arxiv.org/html/2508.02994v1 — practitioner-oriented summary. + +### Leaderboards (live) +- BFCL v4 leaderboard. https://gorilla.cs.berkeley.edu/leaderboard.html +- SWE-bench leaderboards (all variants). https://www.swebench.com/ +- SWE-bench Live. https://swe-bench-live.github.io/ +- SWE-bench Pro public. https://labs.scale.com/leaderboard/swe_bench_pro_public +- τ²-Bench Telecom (Artificial Analysis). https://artificialanalysis.ai/evaluations/tau2-bench +- AppWorld leaderboard. https://github.com/StonyBrookNLP/appworld-leaderboard diff --git a/agent-evaluation-research/sources/04-improvement.md b/agent-evaluation-research/sources/04-improvement.md new file mode 100644 index 00000000..ec282419 --- /dev/null +++ b/agent-evaluation-research/sources/04-improvement.md @@ -0,0 +1,543 @@ +# Systematically Improving LLM Agents from Eval Results + +A research dossier for the QualOps code-review agent (Analyze → Review → Fix → Report → Judge), built on the Claude Agent SDK and deployed as fixed versions in CI. Scope is **offline improvement**: take eval failures, reason about them, change the system, re-run evals, ship. Online RLHF and continuous self-tuning in production are explicitly out of scope. + +Author: QualOps Research, May 2026 + +--- + +## Executive summary + +Once an evaluation harness exists, the bottleneck for agent quality is not "more model" — it is the discipline of systematic improvement. The community has converged on a recognizable loop: collect failure traces, perform open coding (annotate freely) and axial coding (cluster into a taxonomy) à la Hamel Husain and Eugene Yan; pick the largest, fixable cluster; choose the smallest intervention that plausibly fixes it (prompt edit → context change → tool redesign → sub-agent split → few-shot mining → fine-tune → model swap); run the eval; gate on regression. Around that loop, four newer disciplines are now mainstream: **prompt-as-code** version control with promotion gates, **automatic prompt optimization** via DSPy/MIPROv2, TextGrad, OPRO, Promptbreeder, AdalFlow, SAMMO, **context engineering** (curating the working-set rather than stuffing the window), and **eval-gated CI** with golden traces and snapshot tests. For a multi-stage code-review agent like QualOps, the highest-leverage early moves are usually (1) error taxonomy on real PR traces, (2) per-stage rather than end-to-end evals, (3) tool-surface cleanup, (4) targeted few-shot mining from confirmed-failure PRs, and (5) routing easy diffs to a cheaper model. Fine-tuning and DPO on traces are the long-game once the prompt/context surface is exhausted. + +--- + +## 1. Error analysis methodology + +The single most cited technique in the modern eval literature is structured **error analysis**, popularized by Hamel Husain (in "Your AI product needs evals" and "A Field Guide to Rapidly Improving AI Products"), Eugene Yan, and Shreya Shankar. The method is borrowed from grounded-theory qualitative research: open coding → axial coding → frequency-weighted prioritization. + +### The two-pass coding method + +**Pass 1 — Open coding (bottom-up).** Sit with raw traces. For each failed (and a sample of passing) example, write a free-text note describing what went wrong. Critically, do not pre-define categories; let them emerge. Hamel emphasizes that top-down taxonomies — "this is a hallucination", "this is a refusal" — bias annotators toward generic ML categories that miss domain-specific failure modes. Bottom-up coding at NurtureBoss surfaced "date handling" as the dominant failure class and lifted that subtask from 33% to 95% accuracy. + +**Pass 2 — Axial coding.** Group the open-coded notes into a small set of error categories ("axes"). Hamel recommends an LLM-assisted clustering pass over the notes, then human review of the proposed taxonomy. Output: an error taxonomy of typically 5–15 categories with frequency counts. + +**Frequency-weighted prioritization.** Rank categories by `frequency × business cost × fixability`. Spend engineering effort top-down. The classic mistake is fixing rare-but-vivid failures because they are easier to remember. + +### A concrete error-taxonomy template (QualOps-shaped) + +| ID | Category | Stage | Open-coding signature | Frequency | Severity | Fixability | Priority | +|----|----------|-------|----------------------|-----------|----------|-----------|----------| +| E1 | False positive on idiomatic style | Review | "agent flagged ternary as 'unreadable'" | 32% | Low | High | P1 | +| E2 | Missed null-deref across files | Analyze | "didn't load callee" | 18% | High | Medium | P1 | +| E3 | Fix proposed wrong import | Fix | "imported from wrong module" | 11% | Medium | High | P2 | +| E4 | Judge rated harmless nit as "blocker" | Judge | "severity inflated" | 9% | Medium | High | P2 | +| E5 | Refused on large diff | Analyze | ">200 file diff truncated" | 4% | High | Low | P3 | + +A row produces (a) a deterministic regression test, (b) a candidate fix hypothesis, (c) optionally an eval-set sample to add to the next harness pass. + +### Eugene Yan's complementary framing + +Yan's "Task-Specific LLM Evals That Do & Don't Work" emphasizes that evals must be tied to *behaviors a human PM would care about*, not generic NLP scores. He argues for a four-axis decomposition (correctness, instruction-following, factuality, coherence) but explicitly says these are starting points — your taxonomy must be domain-specific. His "Patterns for Building LLM-based Systems" is the canonical write-up of LLM-as-judge tradeoffs. + +### Shreya Shankar's "validators of validators" + +Shankar's research (e.g. EvalGen, "Who Validates the Validators?") is the rigorous case that LLM-as-judge scorers themselves drift from human preferences and need their own calibration set. For QualOps, this means whatever Judge stage you use must have a periodically refreshed gold-set of judged-by-humans examples. + +--- + +## 2. The eval-driven development loop + +```mermaid +flowchart TD + A[Production / staging traces] --> B[Sample failures + passes] + B --> C[Open coding
free-text notes] + C --> D[Axial coding
cluster into taxonomy] + D --> E[Prioritize by
frequency x severity x fixability] + E --> F{Pick top
bucket} + F --> G[Hypothesize fix:
prompt? context? tool?
sub-agent? model? SFT?] + G --> H[Implement smallest
change that could fix it] + H --> I[Run eval set
regression + targeted] + I --> J{Delta
positive?
No regression?} + J -- No --> K[Discard or refine hypothesis] + K --> G + J -- Yes --> L[Promote prompt/skill version] + L --> M[Ship behind gate] + M --> N[Collect new traces] + N --> A +``` + +The loop has four properties worth preserving: + +1. **Failures are routed back to the eval set**, not just fixed. Otherwise the regression suite never grows. +2. **Hypothesis is logged separately from the diff.** "I changed the system prompt because X" matters when the next eval shows a regression six weeks later. +3. **One change at a time.** Multi-variate changes invalidate the delta and stall debugging. +4. **The eval set is versioned with the code.** A passing score on v3 of the eval against v3 of the prompt is the only meaningful claim. + +--- + +## 3. Prompt engineering as an iteration discipline + +The era of "prompt is whatever string is in `messages[0]`" is over. Anthropic's own prompt-engineering guides for Claude 4.x ("Prompting best practices", "Effective context engineering for AI agents") and OpenAI's GPT-5 prompt cookbook converge on the same skeleton: + +``` +[Role] You are . +[Task] Goal in 1–2 sentences. +[Context] Static background, dynamic retrieval, tool surface. +[Examples] Few-shot, ideally diverse and including hard cases. +[Format] Output schema (XML / JSON / markdown sections). +[Guardrails] Out-of-scope behaviors, refusal triggers, escalation. +``` + +Anthropic specifically recommends XML-tag structuring for Claude (``, ``, ``), placing tool definitions in the system message, instructions in the user turn, and using "think step by step" — or extended thinking — for multi-stage tasks. + +### Prompt-as-code + +Treat prompts as first-class source. Concretely: + +- **Version control**: prompts live in the repo, not a UI. Every change is a PR. +- **Immutable IDs**: each prompt version gets a content hash; logs reference it. +- **Promotion gates**: dev → staging → prod, with eval thresholds at each boundary. +- **Full execution context is versioned**: the prompt, the model, the temperature, the tool list, retrieval config — all together. A prompt version that worked on Sonnet 4.0 may regress on 4.7. +- **Two-axis A/B**: by prompt version and by traffic slice. Braintrust, Langfuse, LangSmith, PromptLayer, LaunchDarkly all implement this. +- **Linting**: detect missing tags, contradictory instructions, redundant guardrails. + +### Structured prompt scaffolding patterns + +- **Decomposition tags.** Wrap the diff in ``, the relevant prior code in ``, the rules in ``. Claude was trained on XML-tagged data, so this hits the model where it is most reliable. +- **Output contract first.** State the output schema before the body of the request — models frequently forget late-stated formats. +- **Negative instructions.** "Do not invent function names not present in ``" is more effective than implicit assumptions. +- **Self-check footer.** "Before producing your answer, list the assumptions you made; if any are uncertain, mark `LOW_CONFIDENCE`." This is a poor-man's reflection (see §9). + +--- + +## 4. Automated prompt optimization + +This is a crowded field; treat the entries below as a menu, not a stack. + +| Tool | Year | Mechanism | When it shines | When it fails | +|------|------|-----------|----------------|---------------| +| **APE** (Zhou et al.) | 2022 | LLM proposes prompt candidates, scored on a held-out set, keep best. | Single-step tasks with clear metric. | Multi-stage agents; metric noise. | +| **OPRO** (Google) | 2023 | LLM is shown previous prompts + scores, asked to write a better one. Iterative. | Math / reasoning where metric is binary. | Long prompts; tool-using agents. | +| **Promptbreeder** (DeepMind) | 2023 | Evolutionary: mutates both task prompts AND the mutation prompts. Beats OPRO on GSM8K (83.9% vs 80.2%). | Tasks where you can run thousands of evals cheaply. | Cost; doesn't optimize tool use. | +| **DSPy / MIPROv2** (Stanford NLP) | 2024 | "Programs not prompts": you write modules with signatures, MIPROv2 jointly optimizes instructions + few-shot demos via Bayesian search over bootstrapped traces. | Multi-stage pipelines (perfect for QualOps). Strong with metric-driven optimization. | Requires writing your pipeline in DSPy idioms. | +| **TextGrad** (Stanford / Yuksekgonul) | 2024 | "Backpropagation through text": LLM-generated textual feedback used as gradient through the program. | Composable systems where each step has a critic-able output. | Setting up the textual loss; cost. | +| **Trace** (Microsoft) | 2024 | Generalizes TextGrad: traces execution, propagates feedback as updates to *any* parameter (prompt, code, tool spec). | Heterogeneous pipelines (prompt + tools + retriever). | Early-stage, few production case studies. | +| **AdalFlow** (SylphAI) | 2024 | PyTorch-style auto-diff over LLM workflows; combines TextGrad-style gradients + DSPy bootstrapping. Reports SOTA accuracy on prompt opt benchmarks. | Teams that want a single library covering both directions. | Newer, smaller community than DSPy. | +| **SAMMO** (Microsoft Research) | 2024 | Treats prompts as function graphs; mutation operators over structure (move section, delete example, paraphrase). | Long structured prompts (manuals, policies). | Tasks needing example mining more than structural surgery. | + +### Practical guidance for QualOps + +- For the **Review** and **Judge** stages — both narrow, scorable on labeled PRs — DSPy/MIPROv2 is the best fit. Write the stage as a `Module`, define a metric on the eval set, run MIPROv2. +- For the **Fix** stage, where the output is code (and "correctness" requires running tests), AdalFlow + TextGrad-style textual feedback over `tests pass / fail / lint` is the better mental model. +- Promptbreeder/OPRO are mostly historical interest now; their results have been folded into MIPROv2. +- All of these need a *cheap, fast metric*. Build it before reaching for an optimizer. + +--- + +## 5. Few-shot example mining and ICL improvements + +Few-shot examples are the highest-leverage, lowest-risk knob in the system. The 2024–25 literature has three clear lessons: + +1. **Quality dominates quantity.** A handful of well-chosen demonstrations beat dozens of mediocre ones. Cleanlab and others show that even a single noisy example can reduce accuracy. +2. **Diversity matters more than similarity.** A retrieved set of three near-duplicates teaches less than three diverse but on-topic examples. +3. **Dynamic retrieval > static set** for heterogeneous inputs. Encode each candidate example, encode the live query, retrieve k-NN, inject as few-shot. + +### Mining recipe for a code-review agent + +1. Take the labeled error taxonomy from §1. +2. For each high-priority bucket (E1, E2, …), sample 2–3 *clean* fixed examples — input PR, ideal review comments, ideal Fix output. These become "canonical" few-shots. +3. Build a vector index over these canonical examples keyed by diff features (language, file types, lines changed, presence of tests). +4. At inference, retrieve top-k examples from the index and inject them. +5. Add **contrastive examples**: pairs of `(borderline diff, correct minimal review)` so the agent learns where to *not* comment. Code-review agents over-flag by default; contrastive negatives are the cure. +6. Recompute the index when the eval set grows. + +Anthropic's own context-engineering guide flags few-shot curation as one of the three highest-leverage activities; in their phrasing, "diverse, canonical examples" is the goal, not exhaustive coverage. + +--- + +## 6. Skill / sub-agent decomposition + +Anthropic's "Building Effective Agents" essay defines the small set of patterns you should reach for *before* assuming you need a fully autonomous agent. They are: + +- **Prompt chaining** — fixed pipeline of LLM calls. (QualOps's Analyze → Review → Fix → Report is one.) +- **Routing** — classifier sends the request to a specialist prompt. +- **Parallelization** — same input to N specialists, voted/aggregated. +- **Orchestrator-workers** — central LLM dispatches dynamically; subtasks not predeterminable. Anthropic calls out coding tasks specifically — the number/nature of files to touch is unknowable up front. +- **Evaluator-optimizer** — generator + critic loop until acceptance criterion met. + +### When to split a monolithic prompt + +You should consider sub-agent decomposition when: + +- One prompt is doing two qualitatively different jobs ("review code AND format the report") and your error taxonomy shows error types from both jobs. +- Required tools differ across phases (Analyze needs static-analysis tools; Report just needs Markdown). +- One phase needs a stronger model than another. +- The prompt has crossed ~3–5K tokens and instructions are starting to interfere ("instruction hierarchy" decay; later instructions overpower earlier ones). + +### Anthropic's Skills mechanism + +Skills (released 2025, expanded through 2026) are filesystem-based, on-demand context bundles — instructions, scripts, reference material — that the agent loads only when relevant. The cwc-workshops example reduced a 400-line monolithic inventory-agent prompt by extracting policies into Skills + delegating arithmetic to a code-execution tool + introducing a callable sub-agent. The relevant pattern for QualOps: + +- Each language (TS, Python, Go, Rust) is a Skill containing language-specific review heuristics. +- Each error-taxonomy bucket with stable rules can become a Skill. +- The Judge is a sub-agent with a tighter system prompt and only the report-shaped tools. + +Caveat: orchestrator-worker architectures use **10–15× more tokens** than a single agent. Reach for them when the accuracy gain justifies the cost — typically when single-agent eval scores plateau and the failure analysis shows clean phase boundaries. + +### Code-review-agent specific decomposition (2026 state of the art) + +- **Greptile v3** uses parallel sub-agents on top of the Claude Agent SDK to trace dependencies across files and check git history. Reports 82% bug catch vs CodeRabbit's 44% in independent benchmarks (with more false positives). +- **Qodo 2.0** (Feb 2026) shipped an explicit multi-agent review architecture and reports outperforming seven competitors. +- **CodeRabbit** stays single-agent / PR-scoped, optimizing for speed and conciseness. + +The pattern: the more *cross-file* your reviews need to be, the more the orchestrator-worker pattern (with a code-graph indexer as a tool) pays for itself. + +--- + +## 7. Tool design + +Tool design is the single most under-appreciated lever. Anthropic's "Writing tools for agents" guide is the canonical reference; the highlights are: + +- **Naming**: namespace by service (`github_list_prs`, not `list_prs`); verb_object form. +- **Description is a prompt.** It is read by the model on every call. Be explicit about *when* to use the tool, what inputs are valid, what outputs to expect, and importantly what the tool will NOT do. +- **Schema with examples.** For complex inputs, the `input_examples` field beats prose. JSON Schema is necessary but not sufficient. +- **Consolidate, don't proliferate.** Fewer, more capable tools beat many narrow ones. A `code_search(query, kind)` is better than four `find_function`, `find_class`, `find_import`, `find_callsite` tools — the model gets confused choosing among overlapping options. +- **Return high-signal text.** Stable identifiers beat opaque internal IDs. Pruned/summarized output beats firehose dumps; agents pay tokens to read tool returns. +- **Error messages are pedagogy.** A tool that returns "Error 422" teaches the agent nothing. "Error: file path must be relative to repo root; you passed an absolute path; try `src/foo.py`." enables self-correction. +- **Observable side effects.** If a tool mutates state, return the new state in the response so the agent doesn't have to call a follow-up read. + +### QualOps-specific tool audit checklist + +- [ ] Each tool's description has a "use this when" and a "do NOT use this when". +- [ ] No two tools have overlapping use cases without explicit disambiguation in their descriptions. +- [ ] Error returns are actionable text, not numeric codes. +- [ ] Tool count per stage ≤ 7 (the empirical comfort zone for Claude Sonnet/Opus). +- [ ] Long outputs (file contents, AST dumps) are paginated, not truncated mid-token. +- [ ] One canonical "search the code graph" tool, not five. + +--- + +## 8. Context engineering + +"Context engineering is the delicate art and science of filling the context window with just the right information for the next step" — Andrej Karpathy. The framing has been formalized through 2025–26 by Anthropic ("Effective context engineering for AI agents"), Lilian Weng, and the LangChain team. The CPU/RAM analogy is now standard: the model is the CPU; the context window is RAM; what you load is the engineering. + +### The big findings + +- **Context rot.** Recall and reasoning degrade as token count grows, well before the nominal limit. Databricks observed correctness loss at 32K for Llama 3.1 405B; smaller models earlier. Larger context windows are not free. +- **Lost-in-the-middle (Liu et al., Stanford).** Performance is U-shaped: instructions at the very start or very end of the context win; buried middle loses. Critical guardrails should be at one of the ends — Anthropic's recommendation is system prompt for stable rules, last user turn for the immediate ask. +- **Instruction hierarchy.** When system, user, and tool-output instructions conflict, models tend to follow the most recent and most concretely worded. Conflict-free design beats stacking. + +### Tactics + +- **Curate, don't accumulate.** At each step ask: "is this token earning its place?" Strip stale tool outputs, archived plans, unused docs. +- **Dynamic instruction loading** (= Skills): only inject the language-specific or domain-specific rules when the input matches. +- **Retrieval beats stuffing.** A 2K-token retrieved excerpt of the right file beats a 50K dump of the directory. +- **Working memory vs reference memory.** Working memory (immediate plan, last tool result) goes in-context; reference memory (project guidelines, codebase facts) goes behind a retrieval tool. +- **Plan persistence.** Long agent runs benefit from an explicit `plan.md`-style scratchpad maintained by the orchestrator, refreshed each turn rather than relying on the model to remember what it decided fifteen turns ago. + +For QualOps specifically: never feed the entire repo. Feed (a) the diff, (b) the immediate symbol context (callers/callees of touched symbols), (c) project conventions for the relevant language, and nothing else. Add a tool the agent can call when it needs more. + +--- + +## 9. Self-refinement and reflective patterns + +The core papers — **Self-Refine** (Madaan et al.), **Reflexion** (Shinn et al.), **CRITIC** (Gou et al.) — established that having the model critique its own output and try again improves accuracy on a wide range of tasks without weight updates. The pattern is a generate-critique-refine loop, optionally with a separate verifier model. + +### When reflection is net positive + +- The metric is *expensive to compute by humans* but *cheap for an LLM judge*. Code review fits well: rerunning unit tests is cheap; a reviewer disagreeing about a comment is expensive. +- Errors are *recognizable after the fact* (the agent often spots its own mistake when prompted). Reasoning errors, format violations, missed edge cases. +- You can afford ~2× tokens per task. + +### When reflection is a trap + +- The error mode is "confidently wrong with no internal signal" — the critic agrees with the bad output. Hallucinations of API names are typical. +- Per-call latency matters more than quality (interactive use). +- The critic is the same model with the same context as the actor — same blind spots. + +### Concrete recipes for QualOps + +- **Judge as critic.** The Judge stage is already an evaluator-optimizer pattern. Make it explicit: Judge can return `accept` / `reject_with_reason`, and on `reject_with_reason` re-run Review (bounded retries: 2 max). +- **Verifier model trick.** Run the Fix stage with Sonnet, the Judge with Opus. Different models reduce shared blindspots; this is empirically the cheap win. +- **Test execution as ground-truth critic.** For Fix outputs, the actual unit-test result is the highest-quality verifier you will ever have. Use it. + +--- + +## 10. Routing and model selection + +Routing classifies an input and dispatches to the cheapest model that can handle it. The economics are dramatic: industry routers report 30–85% cost reduction with quality flat or slightly improved. + +### Routing strategies + +- **Predictive (offline classifier).** A small model or feature-based classifier inspects the input, predicts difficulty, picks a tier. Fast, cheap, training data needed. +- **Cascading.** Try Haiku first; if it returns low-confidence or fails a check, escalate to Sonnet, then Opus. Easy to implement with no training; latency penalty when escalation triggers. +- **Mixture-of-agents (MoA).** Multiple models answer in parallel; an aggregator synthesizes. Highest quality, highest cost. + +### For QualOps + +A reasonable default cascade: + +1. **Diff size / language detector** (no LLM) — small CSS/text changes go to Haiku-tier; backend logic changes to Sonnet; cross-cutting refactors and security-sensitive paths to Opus. +2. **Confidence escape hatch** — if the Judge stage rejects with reason "ambiguous", upgrade and re-run Review. +3. **Per-stage routing** — Analyze and Report can be small models; Review and Fix usually want strong; Judge wants strong (or at least *different*). + +This pairs well with prompt-as-code: each prompt version pins its model, and routing is just "which version-id do we use for this input". + +--- + +## 11. Fine-tuning vs. prompt engineering + +The overwhelming majority of agent-quality wins come before fine-tuning is required. The order of operations in 2026 is: + +1. Better prompt + scaffold. +2. Better few-shot examples. +3. Better tools / context. +4. Sub-agent decomposition. +5. Automated prompt optimization (DSPy/AdalFlow). +6. *Then* consider fine-tuning. + +### When SFT pays off (offline only — in our scope) + +- The task has a stable shape, you have ≥1K labeled examples, prompt iteration has plateaued. +- Latency matters and you want to compress an Opus-level prompt into a smaller fine-tuned Sonnet/Haiku. +- You want to teach a tool-use *trajectory pattern*, not just a knowledge cut. + +### Two relevant offline techniques + +- **Distillation from agent traces.** Run the strong agent on a curated set, record the (input, plan, tool calls, output) trace, and SFT a smaller model on those traces. Recent work (Structured Agent Distillation, 2025; Agent Fine-tuning through Distillation 2025) shows you can preserve >90% of teacher quality at <20% the cost on narrow domains. +- **DPO / KTO from preference pairs.** From your eval set you have `(input, accepted_output, rejected_output)` triples — exactly the DPO format. KTO is more flexible: works with thumbs-up/down rather than paired preferences. Both are offline and weight-update style; both are in scope under our policy because they happen between releases, not during one. + +### What to avoid + +- Online RLHF / continuous self-training in production. Out of scope. +- Premature fine-tuning. The cost is real (training infra, eval against drift, regression risk) and the gains often replicate cheaper prompt changes. + +--- + +## 12. Regression suites and gating + +Improvements regress. The mechanism that prevents this is the regression suite, gated in CI. + +### Three layers + +- **Unit / assertion tests.** Cheap, deterministic, not LLM-judged. Example: "given diff D1, the Review output must contain `null check`". Authored from error analysis (§1). Goal: ratchet — once we fix it, it stays fixed. +- **Golden traces / snapshot tests.** Record the entire tool-call sequence and final output for a known PR. On change, diff the new run against the snapshot. Tools: EvalView (open-source, Playwright-style for agents), Braintrust, Confident AI / DeepEval. Particularly catches *behavioral* regressions (tool order, retry count) that string-level evals miss. +- **LLM-as-judge eval set.** Larger, runs less often (per-PR or per-night), produces a holistic score. Used for trend tracking and gate thresholds. + +### CI gate design + +- Fail the build if any deterministic regression test breaks. +- Fail if the LLM-judge score drops by more than X% (3% is a common starting threshold) versus the main branch. +- Surface a delta report: which prompt/tool/skill changed, which eval categories moved. +- Always allow override by an explicit reviewer comment ("accepted regression on E1 because we accepted E5 win"), tracked. + +### Don't forget calibration + +LLM-judge scores drift as both judge and judged models update. Refresh the human-labeled calibration set quarterly; otherwise your pass/fail line wanders. + +--- + +## 13. Data flywheel + +Even though we ship fixed versions, *the next version* benefits from production traces. Shankar's "Data Flywheels for LLM Applications" is the reference text. The flywheel: + +``` +production traces -> sample -> human-label (or LLM-pre-label + human review) + -> error analysis -> (eval set growth) + (few-shot mining) + (DPO pairs) + -> next release +``` + +Practical suggestions for QualOps: + +- **Trace everything.** Per stage: input, prompt version, tool calls, model version, output, judge verdict, downstream signal (was the comment dismissed by the human reviewer? was the fix merged?). +- **Stratified sampling.** Don't sample uniformly; over-sample low-confidence traces and traces where the judge disagreed with downstream human action. +- **Decompose labels.** Shankar's specific finding: holistic "is this good?" labels are noisy. Split into dimensions (correctness, severity, conciseness, style fit) and label each separately. Inter-rater agreement goes up. +- **Few-shot mining loop.** Newly-labeled "exemplary" traces are first-class candidates for the dynamic few-shot index (§5). Newly-labeled bad traces become eval-set additions and DPO negatives. + +--- + +## 14. Code-review-agent specific patterns observed in the field + +Patterns that recur across Cursor, Sourcegraph Cody, GitHub Copilot Code Review, Greptile, CodeRabbit, Qodo (formerly Codium), and Anthropic's own Claude code reviews: + +1. **Code-graph indexing as a tool, not as in-context.** All serious players index the repo (symbols, calls, types, blame) and expose retrieval rather than dumping into context. Greptile's graph-of-the-repo and Sourcegraph Cody's code-search are the explicit cases. +2. **Diff-aware vs project-aware.** CodeRabbit optimizes the PR-diff scope; Greptile is project-aware. Project-aware catches more cross-file bugs at the cost of more false positives. Pick deliberately and tune the noise budget. +3. **Multi-pass review.** A first pass identifies candidate issues; a second pass filters / merges / re-ranks them by severity. This is the orchestrator-worker pattern with a small team of specialist sub-agents (security, performance, style, correctness) plus a deduplicator. +4. **Severity gating.** False-positive aversion drives most user satisfaction. Apply a calibrated severity threshold *after* generation, dropping anything below `medium`. Easier to tune than asking the model to suppress at generation time. +5. **Conventions and rules as Skills/configs.** Per-language, per-repo style guides loaded only when relevant. +6. **Use the test suite as oracle for Fix.** Anything that can run tests should. The Fix stage's hard ground truth is "tests still pass and the bug repro test now passes". This collapses into a deterministic eval and is the single biggest accuracy lever. +7. **Severity-aware routing.** Cheap model for nits; strong model for security and concurrency. Mirrors §10. +8. **Don't trust your own confidence.** Agents are systematically over-confident on cross-file claims. Force the agent to cite a file:line for every claim; if it can't, drop the claim. (This is the "citations as guardrail" pattern.) + +--- + +## Improvement playbook (step-by-step) + +A concrete process for QualOps when an eval reveals issues. + +1. **Lock the eval.** Pin model versions, prompt versions, eval-set version. Reproduce the failures. If you can't reproduce, your eval has variance you must fix first. +2. **Sample and open-code.** Take 30–50 failing traces (and 10–20 passing for contrast). Annotate freely — don't pre-categorize. Two annotators where possible; compare notes. +3. **Axial code into a taxonomy.** Cluster the notes. LLM-assist with human review. Aim for 5–15 buckets. Tag every example with one or more bucket IDs. +4. **Frequency-prioritize.** Pick the top bucket by `frequency × severity × fixability`. +5. **Localize to a stage.** Which of Analyze / Review / Fix / Report / Judge is producing the error? If it's a chain effect, the *earliest* stage is usually the place to fix. +6. **Pick the smallest fix that could plausibly work.** In rough cost order: + - Prompt edit (clarify, add negative example, restructure). + - Few-shot addition (drop in 2–3 canonical examples). + - Context edit (load a Skill, prune noise). + - Tool surface change (better description, error message, schema). + - Sub-agent split. + - Run a prompt optimizer (DSPy/MIPROv2 or AdalFlow) on the stage. + - Model swap or routing rule. + - SFT / DPO from collected traces. +7. **Implement one change.** Tag the prompt version. Note the hypothesis. +8. **Run targeted + regression evals.** Targeted = the bucket you are fixing. Regression = full eval set. Both must pass the gate. +9. **Decide.** Ship if delta positive and no regression. Else, refine hypothesis or rollback. +10. **Feed back.** New traces enter the labeling queue; new fixed examples enter the few-shot index; new failure modes enter the taxonomy. +11. **Cadence.** Run this loop weekly on the largest bucket; re-cluster the taxonomy monthly; refresh judge calibration quarterly. + +--- + +## Decision tree: given this error, try this fix first + +```mermaid +flowchart TD + Start([Eval reveals failure]) --> Q1{What's the
error type?} + + Q1 -->|Format / schema
violation| F1[Tighten output
format in prompt;
add format example] + Q1 -->|Misunderstood
instruction| F2[Restructure prompt:
role/task/format/
guardrails] + Q1 -->|Missing domain
knowledge| F3[Add Skill or
retrieval tool;
NOT just stuff context] + Q1 -->|Hallucinated
fact / API| F4[Add citation
requirement; add
retrieval tool] + Q1 -->|Wrong tool used /
tool confusion| F5[Tool design:
names, descriptions,
consolidate overlap] + Q1 -->|Tool returned ok,
agent ignored result| F6[Tool description
+ tool output
summarization] + Q1 -->|Reasoning chain
broke| F7[Add extended thinking
or sub-agent split] + Q1 -->|Style / tone /
over-flagging| F8[Few-shot mining:
contrastive examples
of don't-flag cases] + Q1 -->|Cross-file
blindness| F9[Add code-graph tool;
orchestrator-worker
pattern] + Q1 -->|Long-context
recall failure| F10[Context engineering:
prune, dynamic load,
move key info to ends] + Q1 -->|Judge
miscalibrated| F11[Refresh judge
calibration set;
verifier model] + Q1 -->|Single-stage
plateau across many
error types| F12[Run prompt optimizer
DSPy / AdalFlow] + Q1 -->|Latency / cost
plateau, accuracy ok| F13[Routing or
distillation SFT] + Q1 -->|Error mode is
persistent + structured
+ many examples| F14[DPO from preference
pairs; otherwise SFT] + + F1 --> R[Re-run eval, gate, ship] + F2 --> R + F3 --> R + F4 --> R + F5 --> R + F6 --> R + F7 --> R + F8 --> R + F9 --> R + F10 --> R + F11 --> R + F12 --> R + F13 --> R + F14 --> R +``` + +--- + +## References + +### Error analysis and the eval-driven loop +- [Hamel Husain — Your AI Product Needs Evals](https://hamel.dev/blog/posts/evals/) — canonical write-up of why and how. +- [Hamel Husain — A Field Guide to Rapidly Improving AI Products](https://hamel.dev/blog/posts/field-guide/) — open/axial coding, frequency prioritization, NurtureBoss case study (33%→95%). +- [Hamel Husain — Why is "error analysis" so important](https://hamel.dev/blog/posts/evals-faq/why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed.html) — concrete walkthrough. +- [Hamel Husain — Doing Error Analysis Before Writing Tests](https://hamel.dev/notes/llm/officehours/erroranalysis.html) — order-of-operations point. +- [Hamel Husain & Shreya Shankar — LLM Evals FAQ (Jan 2026)](https://hamel.dev/blog/posts/evals-faq/) — current consolidated reference. +- [Eugene Yan — Task-Specific LLM Evals That Do & Don't Work](https://eugeneyan.com/writing/evals/) — pragmatic taxonomy. +- [Eugene Yan — Patterns for Building LLM-based Systems](https://eugeneyan.com/writing/llm-patterns/) — system-level patterns. +- [Eugene Yan — Evaluating LLM-Evaluators (LLM-as-Judge)](https://eugeneyan.com/writing/llm-evaluators/) — calibration of judges. +- [Husain, Yan, Bischof, Frye, Liu, Shankar — What We Learned from a Year of Building with LLMs (Part I, II, III)](https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/) — multi-author field report; tactical/operational/strategic split. +- [Shreya Shankar — Data Flywheels for LLM Applications](https://www.sh-reya.com/blog/ai-engineering-flywheel/) — production-trace flywheel. +- [Langfuse — Error Analysis to Evaluate LLM Applications](https://langfuse.com/blog/2025-08-29-error-analysis-to-evaluate-llm-applications) — tooling-side perspective. + +### Prompt engineering and prompt-as-code +- [Anthropic — Prompting best practices for Claude](https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/claude-prompting-best-practices) — XML, system vs user, thinking. +- [Anthropic — Interactive Prompt Engineering Tutorial](https://github.com/anthropics/prompt-eng-interactive-tutorial) — 9-chapter hands-on. +- [Anthropic — Effective context engineering for AI agents](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents) — core 2025 doc. +- [Langfuse — Prompt Version Control](https://langfuse.com/docs/prompt-management/features/prompt-version-control) — versioning mechanics. +- [Braintrust — What is prompt versioning](https://www.braintrust.dev/articles/what-is-prompt-versioning) — practitioner guide. +- [LaunchDarkly — Prompt Versioning & Management Guide](https://launchdarkly.com/blog/prompt-versioning-and-management/) — env promotion patterns. + +### Automatic prompt optimization +- [DSPy — Optimizers overview](https://dspy.ai/learn/optimization/optimizers/) — Stanford NLP framework. +- [DSPy — MIPROv2 API](https://dspy.ai/api/optimizers/MIPROv2/) — joint instruction + few-shot. +- [DeepWiki — MIPROv2 internals](https://deepwiki.com/stanfordnlp/dspy/4.4-miprov2:-instruction-and-parameter-optimization) — bootstrap, propose, Bayesian search. +- [Promptbreeder paper (arXiv 2309.16797)](https://arxiv.org/pdf/2309.16797) — DeepMind, evolutionary prompt opt. +- [APE — Automatic Prompt Engineer](https://www.promptingguide.ai/techniques/ape) — Zhou et al., the foundational work. +- [TextGrad](https://tailoredai.substack.com/p/automating-prompt-optimisation-a) — textual backprop. +- [SAMMO — Microsoft Research](https://www.microsoft.com/en-us/research/blog/sammo-a-general-purpose-framework-for-prompt-optimization/) — structure-aware metaprompt opt. +- [SAMMO repo](https://github.com/microsoft/sammo) — code. +- [AdalFlow — SylphAI](https://github.com/SylphAI-Inc/AdalFlow) — PyTorch-style auto-diff for LLM apps. +- [Cameron Wolfe — Automatic Prompt Optimization survey](https://cameronrwolfe.substack.com/p/automatic-prompt-optimization) — landscape overview. + +### Few-shot ICL improvements +- [Learning to Retrieve In-Context Examples (arXiv 2307.07164)](https://arxiv.org/html/2307.07164v2) — dense retrieval for examples. +- [Cleanlab — Reliable Few-Shot Prompts](https://cleanlab.ai/blog/learn/reliable-fewshot-prompts/) — noisy-example hazards. +- [Many-Shot In-Context Learning (arXiv 2404.11018)](https://arxiv.org/pdf/2404.11018) — the long-context regime. +- [PromptHub — Few-Shot Prompting Guide](https://www.prompthub.us/blog/the-few-shot-prompting-guide) — practical patterns. + +### Agent architecture +- [Anthropic — Building Effective Agents](https://www.anthropic.com/research/building-effective-agents) — workflows + agents taxonomy. +- [Anthropic — Building Effective AI Agents (PDF resource hub)](https://resources.anthropic.com/building-effective-ai-agents) — extended cookbook with case studies (Coinbase, Intercom, Thomson Reuters). +- [Anthropic — Equipping agents with Agent Skills](https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills) — Skills mechanism. +- [Anthropic — Agent Skills overview](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview) — docs. +- [Anthropic Subagents — Claude Code Docs](https://docs.anthropic.com/en/docs/claude-code/sub-agents) — custom subagent how-to. +- [Anthropic CWC workshops](https://github.com/anthropics/cwc-workshops) — 400-line-prompt → skills + subagents refactor walkthrough. + +### Tool design +- [Anthropic — Writing tools for agents](https://www.anthropic.com/engineering/writing-tools-for-agents) — naming, schema, response design. +- [Anthropic — Implement tool use](https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/implement-tool-use) — schema docs. +- [Anthropic — Advanced tool use](https://www.anthropic.com/engineering/advanced-tool-use) — input_examples and friends. + +### Context engineering +- [Anthropic — Effective context engineering for AI agents](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents) — primary source. +- [LangChain — Context Engineering for Agents](https://blog.langchain.com/context-engineering-for-agents/) — practitioner overview. +- [Lost in the Middle (Liu et al., Stanford)](https://arxiv.org/abs/2307.03172) — U-shaped recall. +- [Decoder — Context engineering vs prompt engineering](https://the-decoder.com/anthropic-claims-context-engineering-beats-prompt-engineering-when-managing-ai-agents/) — the framing shift. + +### Reflection / self-refinement +- [Self-Refine (arXiv 2303.17651)](https://arxiv.org/abs/2303.17651) — generate-critique-refine. +- [Reflexion (arXiv 2303.11366)](https://arxiv.org/abs/2303.11366) — verbal RL via self-reflection. +- [CRITIC (arXiv 2305.11738)](https://arxiv.org/abs/2305.11738) — tool-augmented critique. +- [DeepLearning.AI — Agentic Design Patterns: Reflection](https://www.deeplearning.ai/the-batch/agentic-design-patterns-part-2-reflection/) — Andrew Ng's writeup. +- [Self-Reflection in LLM Agents (arXiv 2405.06682)](https://arxiv.org/abs/2405.06682) — empirical effect on problem-solving. + +### Routing / model selection +- [IBM Research — LLM routing](https://research.ibm.com/blog/LLM-routers) — predictive vs cascading vs nonpredictive. +- [vLLM Semantic Router](https://vllm-semantic-router.com/) — open-source mixture-of-models. +- [Patronus — AI Agent Routing best practices](https://www.patronus.ai/ai-agent-development/ai-agent-routing) — operational guide. +- [arXiv 2509.07571 — Generalized Routing](https://arxiv.org/html/2509.07571v1) — model + agent orchestration. + +### Fine-tuning, distillation, DPO/KTO +- [Direct Preference Optimization (arXiv 2305.18290)](https://arxiv.org/abs/2305.18290) — Rafailov et al. +- [KTO (arXiv 2402.01306)](https://arxiv.org/pdf/2402.01306) — Kahneman-Tversky alignment. +- [HuggingFace — Preference Tuning with DPO methods](https://huggingface.co/blog/pref-tuning) — practical DPO/IPO/KTO. +- [OpenAI — Supervised fine-tuning guide](https://developers.openai.com/api/docs/guides/supervised-fine-tuning) — SFT mechanics. +- [Structured Agent Distillation (arXiv 2505.13820)](https://arxiv.org/html/2505.13820v3) — segment Reason vs Action spans. +- [Agent Fine-tuning through Distillation (arXiv 2510.00482)](https://arxiv.org/html/2510.00482) — domain microagents. +- [Distilling LLM Agent into Small Models](https://github.com/Nardien/agent-distillation) — repo + paper. + +### Regression suites and CI gating +- [EvalView — Golden Traces docs](https://github.com/hidai25/eval-view/blob/main/docs/GOLDEN_TRACES.md) — snapshot/regression for agents. +- [Braintrust — Eval-driven development](https://www.braintrust.dev/articles/eval-driven-development) — gate design. +- [DeepEval](https://github.com/confident-ai/deepeval) — open-source LLM eval framework. +- [Evaluation-Driven Development of LLM Agents (arXiv 2411.13768)](https://arxiv.org/html/2411.13768v3) — process model + reference architecture. +- [Pragmatic Engineer — A pragmatic guide to LLM evals](https://newsletter.pragmaticengineer.com/p/evals) — engineering walkthrough. + +### Code-review-agent specifics +- [Anthropic — SWE-bench Sonnet engineering](https://www.anthropic.com/engineering/swe-bench-sonnet) — what changed at the model level. +- [Anthropic — Claude Opus 4.7 release](https://www.anthropic.com/news/claude-opus-4-7) — current frontier numbers. +- [Greptile vs CodeRabbit (Greptile blog)](https://www.greptile.com/greptile-vs-coderabbit) — vendor comparison; multi-agent review architecture. +- [Qodo (formerly Codium) — AI Code Review](https://www.qodo.ai/blog/ai-code-review/) — Qodo 2.0 multi-agent architecture. +- [Sverklo](https://github.com/sverklo/sverklo) — open-source code-graph MCP server; pattern reference. +- [FindSkill — Claude Code Review vs Bugbot vs Greptile vs CodeRabbit (May 2026)](https://findskill.ai/blog/claude-code-review-vs-cursor-bugbot-greptile-coderabbit/) — current head-to-head. + +### Cross-cutting community +- [Lenny's Newsletter — Evals, error analysis, better prompts (Hamel Husain)](https://www.lennysnewsletter.com/p/evals-error-analysis-and-better-prompts) — accessible interview. +- [Humanloop — Why Your AI Product Needs Evals (Hamel Husain)](https://humanloop.com/blog/why-your-product-needs-evals) — interview transcript. + +--- + +*End of dossier.*