Evaluating and Improving LLM Agents
+A state-of-the-art report on agent accuracy, evaluations, and offline improvement — with a generalized approach for QualOps and similar projects.
+ +Document map (click to collapse)
+| Part | Title | Audience |
|---|---|---|
| 0 | Executive summary | Leadership + engineering |
| 1 | Why this matters for QualOps | Leadership |
| 2 | Foundations of agent evaluation | Engineering |
| 3 | Evaluating tool-calling and workflow agents | Engineering |
| 4 | The framework landscape | Engineering |
| 5 | Systematically improving agents (offline) | Engineering |
| 6 | The QualOps Approach — generalized concept | Both |
| 7 | Prerequisites and adoption roadmap | Both |
| 8 | Risks, open questions, what we left out | Both |
| 9 | Appendix: glossary, references, dossier links | Engineering |
0 · Executive summary
+ +LLM agent quality is not a property of the model alone — it is a property of the system: the model, the prompts, the tools, the context routing, and the harness, all together. As QualOps has matured into a multi-stage agentic pipeline (Analyze → Review → Fix → Report → Judge), the unit that matters has become the trajectory the agent takes through tool calls, not just the final PR comment.
+ ++1. Score the agent on three layers, every release.
+2. Use a small, curated golden set (50–200 PRs), not a giant crawl.
+3. Apply the right tool to the right stage (deterministic tests for Fix; tool-call F1 for Analyze; LLM-judge for Review/Report; agent-as-judge for Judge).
+4. Run a two-tier eval cadence: per-PR fast gate + nightly capability eval.
+5. Improve through structured error analysis (open coding → axial coding → frequency-weighted prioritization).
+6. Keep Langfuse, add Promptfoo (per-PR CI gate) and Inspect AI (nightly capability eval). +
1 · Why this matters for QualOps
+ +1.1 The QualOps pipeline
+ +QualOps is an AI-powered code review tool built on the Claude Agent SDK. It runs in CI on every pull request and produces structured findings — comments, GitHub Checks annotations, severity-ranked reports, and (in agentic mode) suggested fixes.
+ +1.2 What is at stake
+ +A code review is a trust artifact. A false positive wastes developer time and erodes confidence in every subsequent finding. A false negative defeats the purpose of the tool. A confidently miscalibrated severity label routes attention away from the issues that actually matter. None of these failures are catastrophic individually, but they compound across thousands of PRs.
+ +Without disciplined evaluation: customer churn (reviewers turn the bot off), hidden regressions (a prompt change fixes one issue and silently regresses three), cost overruns (model upgrades produce big bills with unclear value), and audit risk for enterprise customers.
+ +1.3 What we already have
+ +QualOps ships with a working evaluation suite: Langfuse-backed dataset runs, multiple presets (fast, default, sonnet-agentic, thorough), CRB-derived golden datasets across five real repos (Sentry, Grafana, Cal.com, Discourse, Keycloak), and a configurable LLM-as-judge scoring stage. This puts QualOps ahead of most teams shipping agentic products today. The gaps identified here are deliberate next steps, not foundations.
2 · Foundations of agent evaluation
+ +2.1 An agent is not a function
+ +A classical LLM eval treats the model as a function f(prompt) → completion. An agent eval treats the agent as a stateful policy π that interacts with an environment via tools. The unit of evaluation is a trajectory:
τ = (s₀, a₀, o₀, s₁, a₁, o₁, …, sₙ)
+
+Every benchmark surveyed for this report agrees: agent evaluation requires assessing not just the terminal answer but the path taken to reach it.
+ +2.2 The three layers of agent evaluation
+ +Component-level
+Q: Does each sub-skill (single tool call, retriever, sub-agent) work in isolation?
+Metrics: Tool-match rate, parameter F1, retrieval recall@k
+Trajectory-level
+Q: Is the path of reasoning + actions valid, efficient, and faithful?
+Metrics: Plan correctness, edit distance, tool-call F1 over sequence
+Outcome-level
+Q: Did the agent achieve the user goal?
+Metrics: Task success, unit-test pass rate, human rating
+For QualOps all three layers exist naturally:
+-
+
- Component: did the Analyze stage
read_filewith the right path?
+ - Trajectory: did the Review stage's parallel sub-agents converge without redundant tool calls? +
- Outcome: did the suggested fix actually fix the bug? +
2.3 Dimensions of agent quality
+ +| Dimension | Why it matters for QualOps | How to measure |
|---|---|---|
| Accuracy / task success | The headline number | Exact match, unit-test pass, human rating |
| Faithfulness / groundedness | Dominant for code review. A hallucinated finding is worse than a missed one | Atomic-claim NLI; citations as guardrail |
| Completeness | Did the agent find all the issues a human would? | Recall against an annotated PR review |
| Calibration | Severity labels must be trustworthy for triage | ECE, Brier score |
| Robustness | Stable under prompt perturbation, weird diffs | Performance under paraphrase / typo suites |
| Determinism / consistency | Same PR → same review | Output variance across N samples; pass^k |
| Latency | CI gates have time budgets | p50/p95/p99 wall-clock per stage |
| Cost | $ per PR | Tokens × price + tool-call costs |
2.4 The evaluation lifecycle
+ +on real traces] --> B[2. Codify failure
modes as rubric] + B --> C[3. Add to golden set
+ regression suite] + C --> D[4. Run evals in CI;
block on regression] + D --> E[5. Ship + monitor
in production] + E --> F[6. Sample drift,
online judge] + F --> A + classDef step fill:#1c232c,stroke:#5b9eff,color:#e6edf3 + class A,B,C,D,E,F step +
Tactics endorsed across primary sources: start small (Anthropic: "20–50 simple tasks drawn from real failures is a great start"); treat evals like unit tests; route failures back to the eval set; gate on stat-sig regression.
+ +2.5 LLM-as-judge — the workhorse, with caveats
+ +Zheng et al.'s Judging LLM-as-a-Judge (NeurIPS 2023) showed that GPT-4 acting as a judge agreed with human preference at over 80% — roughly the same as inter-human agreement. This legitimized LLM-as-judge as a primary evaluation method.
+ +| Bias | What happens | Mitigation |
|---|---|---|
| Position bias | Judge prefers whichever appears first | Swap order, score both, average |
| Verbosity bias | Longer answers rated higher | Length constraint in rubric |
| Self-preference | Judge prefers outputs from its own family | Cross-model judge ensemble |
| Familiarity bias | Judge favors text it would have generated | Down-weight low-perplexity samples |
| Sycophancy | Judge follows hints in the prompt | Blind the judge to source |
| Fallacy oversight | Judge accepts confident-sounding wrong reasoning | Step-by-step grading; process supervision |
2.6 Trajectory and process evaluation
+ +Process evaluation asks: was every intermediate step justified? Three families: step-wise correctness, plan-level precision/recall, and edit distance. OpenAI's Let's Verify Step by Step (Lightman et al. 2023) showed that process supervision beats outcome supervision for training reward models.
+ +2.7 Statistical rigor — why "vibes" fail
+ +With N=10 examples and a stochastic model, a swing of ±20% in pass rate is normal noise. The minimum statistical discipline:
+-
+
- Paired comparisons. Run model A and B on the same examples; per-example difference cancels per-example variance. +
- Confidence intervals. 95% CI half-width ≈
1.96 × √(p(1-p)/N). To distinguish 80% from 85% at 95% confidence you need ~1000 samples.
+ - Multiple runs per task. Even at temperature 0, modern serving stacks are non-deterministic. Plan for ≥5 runs per task. +
- pass^k reporting. Sierra's τ-bench: probability of succeeding k times in a row. +
- Bradley-Terry / Elo for pairwise rankings (what Chatbot Arena does). +
3 · Evaluating tool-calling and workflow agents
+ +QualOps is a tool-calling workflow agent, not a chatbot. What matters is whether the agent picks the right tools, in the right order, with the right arguments, and produces the right final artifact.
+ +3.1 What "tool-call accuracy" actually means
+ +"Tool-call accuracy" is a deceptively flat label. It decomposes into a stack of sub-metrics:
+ +| Metric | What it measures |
|---|---|
| Exact match | Predicted call equals gold byte-for-byte. Brittle. |
| AST match | Parse into name + (arg-name, arg-value); structural equality |
| Semantic match | LLM-judge or custom equality on argument values |
| Argument F1 | Per-call precision/recall on argument names + values |
| Tool-call F1 | Set-level over multiset of (tool, args) pairs |
| Multi-call ordering | Exact / in-order / any-order / edit distance |
| Hallucinated tools | Agent invents a tool that doesn't exist |
| Missed tools | Agent answered from knowledge instead of calling tool |
| Idempotency / collateral damage | Side-effecting call repeated; unintended state change |
| Parallel calls | Bag of tool calls per turn (not list) |
3.2 Trajectory evaluation
+ +Two orthogonal questions:
+-
+
- Q1 — Did the agent get to the goal? (outcome) +
- Q2 — Did it follow a sensible path? (process) +
An agent can stumble to the right answer through a 47-step random walk, or take an optimal 3-step path that ends in the wrong final state. Scoring rules in increasing leniency: trajectory exact match → in-order match → any-order match → edit distance.
+ +3.3 Outcome vs. process — when to use each
+ +| Aspect | Outcome eval | Process eval |
|---|---|---|
| Data needed | Goal-state checker (test, schema, regex) | Reference trajectories or judge model |
| Cost | Cheap, deterministic | Expensive or noisy |
| Catches lucky shortcuts | No — agent can game it | Yes |
| Catches plan inefficiency | No | Yes |
| Penalizes equivalent paths | No (good) | Yes (bad — risks rewarding mimicry) |
3.4 Major benchmarks worth knowing for QualOps
+ +| Benchmark | Year | Why it matters for QualOps |
|---|---|---|
| BFCL v3 | 2024 | AST match + executable accuracy methodology; right framework for per-stage tool calls |
| τ-bench | 2024 | pass^k metric for reliability under non-determinism. Directly transferable |
| SWE-bench Verified † | 2024 | Apply patch + FAIL_TO_PASS + PASS_TO_PASS. The methodology template for QualOps's Fix stage |
| SWE-bench Live | 2025 | 50 freshly verified GitHub issues per month. Now the recommended SWE-bench variant |
| SWE-bench Pro | 2025 | Long-horizon, enterprise-scale. GPT-5 23.3% / Claude Opus 4.1 23.1% |
| AppWorld | 2024 | State-based eval with collateral-damage check |
| DevAI / Agent-as-a-Judge | 2024 | Methodology for using an agent (with tools) as the judge |
| HAL | 2025 | Variance-decomposed reporting |
3.5 Code-agent specific evaluation
+ +For QualOps's Review stage (no patch, just a comment), the analog of SWE-bench's pattern is:
+-
+
- Finding location precision/recall — did the agent flag the right line and file? +
- Finding-class match — did it categorize correctly (security vs perf vs style)? +
- Finding–PR alignment — does the finding correspond to something the human reviewer also flagged? +
Tests don't catch every flavor of bad fix: style/readability regressions, performance regressions, security regressions. These need additional graders.
+ +3.6 Agent-as-judge
+ +Zhuge et al.'s Agent-as-a-Judge (Oct 2024; ICML 2025) replaces the LLM judge with an agent judge that can read code, run tools, and verify intermediate steps:
+ +For QualOps: an external agent-as-judge is well suited to grading "was this PR review good?" — give it the diff, the agent's findings, the human-merged PR, and a rubric, and let it use grep/file-read tools to verify each finding against the code. Run the judge on a different model family than the Review stage to avoid self-preference.
+ +3.7 Replay testing and recorded traces
+ +Pattern (used by Braintrust, LangSmith, Phoenix, Anthropic, Cognition):
+-
+
- Capture every production run as a trace. +
- Tag failures, edge cases, customer escalations into a regression set. +
- On every prompt/model/harness change, replay each trace with stubbed tool outputs. +
- Diff: were tool-call sequences equivalent? Did the final artifact differ? +
3.8 Decision guide: situation → technique
+ +| Situation | Technique |
|---|---|
| Single-call function selection | BFCL-style AST match + argument F1 |
| Multi-step deterministic workflow | Trajectory in-order match + state-based eval |
| Parallel tool calls in one turn | Set-equality match (bag of calls) |
| Tool arguments are free-form text | Action similarity (embedding or LLM-judge) |
| Side-effecting tools | State-based eval with collateral-damage check |
| Output is a code patch | SWE-bench harness: apply + FAIL_TO_PASS + PASS_TO_PASS |
| Output is open-ended text | Agent-as-judge with structured rubric |
| Detect hallucinated tools | Schema validation + tool-name whitelist |
| Detect missed tools | Recall against reference trajectory |
| Variance/reliability | pass^k with k≥5; mean + 95% CI |
| Catching prompt/model regressions | Recorded-trace replay with tool stubs |
| Long-horizon multi-stage agent (QualOps) | Hybrid: per-stage F1 + per-stage state checks + outcome + agent-as-judge |
4 · The framework landscape
+ +The eval / observability tooling ecosystem moved fast through 2025–2026. The major shifts since early 2025: OpenAI acquired Promptfoo in March 2026 (MIT license preserved), Langfuse landed observation-level LLM-as-judge in February 2026, Inspect AI reached production-grade adoption inside frontier labs.
+ +4.1 The shortlist for QualOps
+ +-
+
- Langfuse (incumbent — keep). MIT, self-host, observation-level LLM-as-judge, boolean/categorical scoring. Production references include Canva. TS + Python parity. +
- Promptfoo (add as CI gate). MIT, OpenAI-acquired but license preserved, first-class GitHub Action with PR-comment diffs, Claude Agent SDK provider. +
- Inspect AI (add for nightly capability evals). MIT, used by Anthropic/DeepMind/Grok. Agent Bridge wraps the QualOps agent without modification. +
- LangSmith (only if a wall is hit). Best out-of-the-box trajectory primitives via
agentevals. Closed-source.
+ - Braintrust (only if non-engineers must contribute test cases). Notion's reference deployment is real. Closed-source; hybrid-only self-host. +
4.2 Comparison matrix
+ +Legend: ✓ yes/strong ≈ partial/caveat ✗ no/weak
+ +| Framework | OSS | Self-host | Trajectory | Tool-call | LLM judge | Online | CI | TS | Py | Free tier |
|---|---|---|---|---|---|---|---|---|---|---|
| Langfuse | ✓ MIT | ✓ | ≈ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50k/mo |
| LangSmith | ✗ | ✓ Plus+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 5k/mo |
| DeepEval | ✓ | ≈ | ✓ Py | ✓ | ✓ | ✓ | ✓ pytest | ≈ | ✓ | ✓ |
| Braintrust | ✗ | ≈ | ≈ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 1M spans |
| Phoenix / Arize | ✓ | ✓ | ≈ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Promptfoo | ✓ MIT | ✓ | ≈ | ≈ | ✓ | ✗ | ✓ best | ✓ | ✓ | ✓ |
| Inspect AI | ✓ MIT | ✓ | ✓ | ✓ | ✓ | ✗ | ≈ | ✗ | ✓ | ✓ |
| OpenAI Evals API | ✓ repo | ≈ | ≈ | ≈ | ✓ | ≈ | ≈ | ≈ | ✓ | paid API |
| W&B Weave | ✓ SDK | ≈ | ≈ | ≈ | ✓ | ≈ | ✓ | ≈ | ✓ | ✓ |
| MLflow GenAI | ✓ | ✓ | ≈ | ≈ | ✓ | ✓ | ✓ | ✗ | ✓ | ✓ |
| RAGAS | ✓ | ✓ | ≈ | ≈ | ✓ | ✗ | ≈ | ✗ | ✓ | ✓ |
| Patronus | ✗ | ✗ | ✗ | ≈ | ✓ | ✓ | ≈ | ✓ | ✓ | sales |
4.3 Fit by team profile
+ +| Team profile | Recommended primary | Add-ons |
|---|---|---|
| Small team, CI-gated, Node/TS, Claude (= QualOps) | Langfuse | + Promptfoo (CI) + Inspect AI (nightly) |
| Small/mid team, Python-only, RAG-heavy | DeepEval + RAGAS | + Phoenix or Langfuse |
| Large org, many agents, dedicated SREs | Braintrust (product) + Phoenix/Arize AX (platform) | — |
| LangChain / LangGraph shop | LangSmith | — |
| Frontier-lab / safety org | Inspect AI | + custom storage |
| OpenAI-only shop | OpenAI Evals API + Promptfoo | — |
| Already on W&B for ML | W&B Weave | + RAGAS / DeepEval metrics |
4.4 CI integration patterns
+ +The cleanest CI pattern for QualOps is two-tier:
+-
+
- Per-PR (3–5 min) — Promptfoo YAML with ~30 small assertions on the output of each pipeline stage; required GitHub check; PR-comment diff vs. main. +
- Nightly / weekly (~30–60 min) — Langfuse experiment over a 100–200 item dataset, full pipeline, LLM-as-judge scorers + tool-call F1 scorers per stage. Plus quarterly Inspect AI capability eval against held-out fixture repos. +
5 · Systematically improving agents (offline)
+ +5.1 The eval-driven improvement loop
+ +free-text notes] + C --> D[Axial coding
cluster into taxonomy] + D --> E[Prioritize by
frequency × severity × fixability] + E --> F{Pick top
bucket} + F --> G[Hypothesize fix:
prompt / context / tool /
sub-agent / model / SFT] + G --> H[Implement smallest
change that could fix it] + H --> I[Run eval set
regression + targeted] + I --> J{Delta positive?
No regression?} + J -- No --> K[Discard or refine] + K --> G + J -- Yes --> L[Promote prompt version] + L --> M[Ship behind gate] + M --> N[Collect new traces] + N --> A + classDef step fill:#1c232c,stroke:#5b9eff,color:#e6edf3 + classDef decision fill:#1a1f26,stroke:#ffb454,color:#e6edf3 + class A,B,C,D,E,G,H,I,K,L,M,N step + class F,J decision +
5.2 Error analysis: open coding → axial coding → frequency
+ +The single most cited improvement technique. Borrowed from grounded-theory qualitative research:
+ +-
+
- Pass 1 — Open coding (bottom-up). Sit with raw traces. Free-text notes describing what went wrong. Critically, do not pre-define categories. Bottom-up coding at NurtureBoss surfaced "date handling" as the dominant failure class and lifted that subtask from 33% → 95% accuracy. +
- Pass 2 — Axial coding. Group notes into 5–15 categories. LLM-assisted clustering pass + human review. +
- Frequency-weighted prioritization. Rank by
frequency × business cost × fixability. Spend engineering effort top-down.
+
| ID | Category | Stage | Frequency | Severity | Fixability | Priority |
|---|---|---|---|---|---|---|
| E1 | False positive on idiomatic style | Review | 32% | Low | High | P1 |
| E2 | Missed null-deref across files | Analyze | 18% | High | Medium | P1 |
| E3 | Fix proposed wrong import | Fix | 11% | Medium | High | P2 |
| E4 | Judge rated harmless nit as "blocker" | Judge | 9% | Medium | High | P2 |
| E5 | Refused on large diff | Analyze | 4% | High | Low | P3 |
5.3 The hierarchy of fixes (decision tree)
+ +error type?} + + Q1 -->|Format / schema
violation| F1[Tighten output format
in prompt; add example] + Q1 -->|Misunderstood
instruction| F2[Restructure prompt:
role/task/format/guardrails] + Q1 -->|Missing domain
knowledge| F3[Add Skill or
retrieval tool] + Q1 -->|Hallucinated
fact / API| F4[Citation requirement
+ retrieval tool] + Q1 -->|Wrong tool used| F5[Tool design: names,
descriptions, consolidate] + Q1 -->|Tool result ignored| F6[Tool description +
output summarization] + Q1 -->|Reasoning chain broke| F7[Extended thinking
or sub-agent split] + Q1 -->|Style / over-flagging| F8[Few-shot mining:
contrastive don't-flag] + Q1 -->|Cross-file blindness| F9[Code-graph tool;
orchestrator-worker] + Q1 -->|Long-context recall| F10[Context engineering:
prune, dynamic load] + Q1 -->|Judge miscalibrated| F11[Refresh calibration;
verifier model] + Q1 -->|Plateau across types| F12[Prompt optimizer
DSPy / AdalFlow] + Q1 -->|Latency/cost plateau| F13[Routing or
distillation SFT] + Q1 -->|Persistent + structured| F14[DPO from preference
pairs; otherwise SFT] + + F1 --> R[Re-run eval, gate, ship] + F2 --> R + F3 --> R + F4 --> R + F5 --> R + F6 --> R + F7 --> R + F8 --> R + F9 --> R + F10 --> R + F11 --> R + F12 --> R + F13 --> R + F14 --> R + classDef step fill:#1c232c,stroke:#5b9eff,color:#e6edf3 + classDef decision fill:#1a1f26,stroke:#ffb454,color:#e6edf3 + classDef start fill:#1a1f26,stroke:#7ee787,color:#e6edf3 + class Start start + class Q1 decision + class F1,F2,F3,F4,F5,F6,F7,F8,F9,F10,F11,F12,F13,F14,R step +
The order of operations in 2026, cheapest to most expensive: prompts → few-shot → tools → context → sub-agents → optimizers → routing → distillation/DPO. Fine-tuning is the long-game once the prompt surface is exhausted.
+ +5.4 Prompt engineering as iteration discipline
+ +The prompt skeleton convergence between Anthropic's Claude 4.x guides and OpenAI's GPT-5 cookbook:
+ +[Role] You are <persona, scope>.
+[Task] Goal in 1–2 sentences.
+[Context] Static background, dynamic retrieval, tool surface.
+[Examples] Few-shot, ideally diverse and including hard cases.
+[Format] Output schema (XML / JSON / markdown sections).
+[Guardrails] Out-of-scope behaviors, refusal triggers, escalation.
+
+For Claude specifically: XML-tag structuring (<example>, <context>, <task>), tool definitions in system message, instructions in user turn, "think step by step" or extended thinking for multi-stage tasks.
5.5 Automated prompt optimization
+ +| Tool | Mechanism | Best for |
|---|---|---|
| DSPy / MIPROv2 | Programs not prompts; joint Bayesian opt of instructions + few-shot | Multi-stage pipelines (perfect for QualOps) |
| TextGrad | Backpropagation through text via LLM-generated feedback | Composable systems with critic-able output |
| AdalFlow | PyTorch-style auto-diff combining TextGrad + DSPy bootstrapping | Single library covering both directions |
| SAMMO | Structural mutation operators over function-graph prompts | Long structured prompts (manuals, policies) |
| OPRO / Promptbreeder / APE | Earlier generations | Mostly historical; folded into DSPy/AdalFlow |
5.6 Few-shot mining
+ +Three lessons from 2024–25 literature:
+-
+
- Quality dominates quantity. A handful beat dozens. +
- Diversity matters more than similarity. +
- Dynamic retrieval > static set for heterogeneous inputs. +
For QualOps: build a vector index over canonical good examples keyed by diff features; retrieve top-k at inference. Add contrastive examples: pairs of (borderline diff, correct minimal review) so the agent learns where to not comment. Code-review agents over-flag by default; contrastive negatives are the cure.
5.7 Tool design — the most under-appreciated lever
+ +From Anthropic's Writing tools for agents:
+ +-
+
- Naming: namespace by service; verb_object form. +
- Description is a prompt. Read on every call. Be explicit about when to use, what inputs are valid, what the tool will NOT do. +
- Schema with examples. +
- Consolidate, don't proliferate. Fewer, more capable tools beat many narrow ones. +
- Return high-signal text. Stable identifiers; pruned/summarized output. +
- Error messages are pedagogy. "Error: file path must be relative to repo root; you passed an absolute path; try
src/foo.py" enables self-correction.
+ - Observable side effects. If a tool mutates state, return the new state. +
5.8 Context engineering — curate, don't accumulate
+ +"Context engineering is the delicate art and science of filling the context window with just the right information for the next step." — Andrej Karpathy+ +
-
+
- Context rot. Recall and reasoning degrade as token count grows, well before the nominal limit. +
- Lost-in-the-middle (Liu et al., Stanford). Performance is U-shaped: instructions at the very start or end win; buried middle loses. +
- Instruction hierarchy. When system, user, and tool instructions conflict, models follow the most recent and most concretely worded. +
For QualOps: never feed the entire repo. Feed (a) the diff, (b) the immediate symbol context (callers/callees of touched symbols), (c) project conventions for the relevant language, and nothing else. Add a tool the agent can call when it needs more.
+ +5.9 Sub-agent decomposition and Skills
+ +Anthropic's Building Effective Agents patterns: prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer. Reach for sub-agent decomposition when one prompt is doing two qualitatively different jobs, required tools differ across phases, one phase needs a stronger model, or the prompt has crossed ~3–5K tokens. Caveat: orchestrator-worker uses 10–15× more tokens than a single agent.
+ +5.10 Routing and model selection
+ +Industry routers report 30–85% cost reduction with quality flat or slightly improved. For QualOps, a reasonable default cascade:
+ +-
+
- Diff size / language detector (no LLM) — small CSS/text → Haiku-tier; backend logic → Sonnet; cross-cutting refactors → Opus. +
- Confidence escape hatch — if Judge rejects with "ambiguous", upgrade and re-run Review. +
- Per-stage routing — Analyze and Report can be smaller; Review and Fix usually want strong; Judge wants strong (or at least different). +
5.11 Reflection patterns
+ +The verifier model trick: run Fix with Sonnet, Judge with Opus. Different models reduce shared blindspots — empirically the cheap win.
+ +5.12 Fine-tuning, distillation, DPO
+ +The order of operations: prompts → few-shot → tools → context → sub-agents → optimizers → then fine-tuning. Two relevant offline techniques:
+-
+
- Distillation from agent traces. Run the strong agent on a curated set, record traces, SFT a smaller model. Recent work preserves >90% of teacher quality at <20% the cost. +
- DPO / KTO from preference pairs. From the eval set you have
(input, accepted, rejected)triples — exactly the DPO format.
+
6 · The QualOps Approach — generalized concept
+ +This is the synthesis: a concrete, opinionated approach for QualOps and other tool-calling agentic projects with similar shape.
+ +6.1 Architecture: evals integrated into the QualOps pipeline
+ +per stage] + E2[Schema validation
+ guardrails] + E3[SWE-bench-style
test harness] + E4[Agent-as-judge
on Review/Report] + E5[pass^k reliability] + end + + subgraph Storage["Trace + dataset storage"] + S1[(Langfuse
traces, datasets,
experiments)] + end + + subgraph Improve["Offline improvement loop"] + I1[Sample + open-code
failures] + I2[Axial code into
taxonomy] + I3[Pick top bucket] + I4[Apply smallest fix] + I5[Re-run eval, gate] + I6[Promote prompt
version + ship] + end + + P1 -.spans.-> S1 + P2 -.spans.-> S1 + P3 -.spans.-> S1 + P4 -.spans.-> S1 + P5 -.spans.-> S1 + + S1 --> E1 + S1 --> E2 + S1 --> E3 + S1 --> E4 + S1 --> E5 + + E1 --> I1 + E2 --> I1 + E3 --> I1 + E4 --> I1 + E5 --> I1 + + I1 --> I2 --> I3 --> I4 --> I5 --> I6 + I6 -.new prompt version.-> Pipeline + classDef stage fill:#1c232c,stroke:#5b9eff,color:#e6edf3 + classDef eval fill:#1a1f26,stroke:#7ee787,color:#e6edf3 + classDef improve fill:#1a1f26,stroke:#ffb454,color:#e6edf3 + classDef store fill:#1a1f26,stroke:#ff7b72,color:#e6edf3 + class P1,P2,P3,P4,P5 stage + class E1,E2,E3,E4,E5 eval + class I1,I2,I3,I4,I5,I6 improve + class S1 store +
Three concerns kept structurally separate: the pipeline (production), the eval layer (scoring), and the improvement loop (between releases).
+ +6.2 Stage-by-stage eval matrix
+ +| Stage | Primary eval technique | Secondary | Reliability metric |
|---|---|---|---|
| Analyze | Tool-call F1 against expected read_file/grep set | Under-tooling rate; hallucinated-tool detection | pass^5 |
| Review | Location precision/recall + finding-class accuracy | Agent-as-judge on textual quality | pass^5 |
| Fix | SWE-bench-style harness: apply patch, FAIL_TO_PASS + PASS_TO_PASS | Linter / formatter delta; perf benchmark | pass^3 |
| Report | Schema validation | LLM-judge on coherence; faithfulness check | pass^5 |
| Judge | Agreement rate vs. held-out human labels | Calibration error (ECE) on severity | pass^5 |
| End-to-end | Composite weighted score + human-rated hold-out | Cross-model agent-as-judge | pass^5 + 95% CI |
6.3 Tooling stack
+ +| Layer | Choice | Rationale |
|---|---|---|
| Trace + dataset store | Langfuse (keep) | MIT, self-host, observation-level evals, already wired in |
| Per-PR CI gate | Promptfoo (add) | Best-in-class GitHub Action, YAML config, Claude Agent SDK provider |
| Nightly capability eval | Inspect AI (add) | Used by Anthropic/DeepMind/Grok; Agent Bridge wraps QualOps unmodified |
| LLM judge | Claude Opus + GPT-5 cross-judge for Review/Report; Sonnet for cheaper paths | Cross-model judging mitigates self-preference bias |
| Statistics | Custom (numpy/scipy) — bootstrap CIs, McNemar | Anthropic's Statistical Approach methodology |
6.4 Two-tier eval cadence
+ +6.5 Statistical discipline
+ +Every release the eval layer must produce: mean ± 95% CI on the headline metric; per-stage mean ± std across pass^5; paired-comparison delta vs. previous release (McNemar / paired bootstrap); variance decomposition (sampling, prompt, judge, data).
+ +Releases ship only if: all deterministic regression assertions pass; composite score within 3% of baseline (or stat-sig improved); no individual stage regresses by more than 5% with stat-sig.
+ +6.6 Improvement cadence
+ +| Cadence | Activity |
|---|---|
| Continuous | Trace every production PR; LLM pre-label; human review of low-confidence and dismissed-by-reviewer cases |
| Weekly (1 day) | Pick the top error bucket; implement smallest fix; gate eval; ship if green |
| Monthly | Re-cluster the error taxonomy from the last month's traces; refresh the few-shot index |
| Quarterly | Refresh judge calibration set against a fresh human-rated sample (50–100 traces); refresh hold-out fixture repos |
| Per major version | Re-run end-to-end against the entire historical eval set; publish a release report with score deltas |
6.7 Generalizing to other projects
+ +The same approach generalizes to other tool-calling / workflow agentic projects. The pattern, abstracted from QualOps:
+-
+
- Define stages and the artifact each produces. +
- Pick a primary eval technique per stage based on artifact shape. +
- Build a 50–200 item curated golden set from real production traces. +
- Wire a per-stage tracing layer with consistent span semantics (OpenInference is the OTEL standard). +
- Build a two-tier eval: fast deterministic gate per change; slow LLM-judge / capability eval per night or week. +
- Run the error analysis loop weekly. Maintain the taxonomy as a living document. +
- Promote prompts and skills as versioned code; gate releases on paired statistical comparison. +
7 · Prerequisites and adoption roadmap
+ +7.1 Prerequisites
+ +| Prerequisite | Status today | Action |
|---|---|---|
| Trace storage with span / observation primitives | Have (Langfuse) | None |
| Per-stage tracing in code (Analyze/Review/Fix/Report/Judge as distinct spans) | Mostly have | Audit; ensure consistent span names + attributes |
| Versioned prompts in repo | Have (evals/qualopsrc/) | None |
| Prompt-as-code promotion infra (content hashes, dev → staging → prod gates) | Partial | Add explicit promotion workflow + version pinning |
| A starter golden set of real PRs with labels | Partial (CRB datasets exist, internal labels TBD) | Label 50 internal PRs |
| Held-out / contamination control (split mgmt, fresh fixtures) | Don't have | Stand up split policy; rotate fixtures via SWE-bench Live monthly drops |
| LLM-as-judge wiring with binary rubrics | Have (Judge stage) | Add cross-model judge variant |
| Cross-model judge access (GPT-5 + Claude Opus) | Don't have | Procure GPT-5 API credentials + budget headroom (~$500/mo) |
| Ongoing human calibration label capacity (50–100 traces/quarter) | Don't have | Designate annotators; lightweight labeling tooling |
| CI runner with secrets for LLM API calls | Have | None |
| Per-stage tool-call F1 scorer | Don't have | Implement (~1 week) |
| SWE-bench-style harness for Fix stage | Don't have | Implement (~2 weeks); seed from SWE-bench Live + internal PRs |
| Agent-as-judge on Report stage | Don't have | Implement (~1 week) |
| Promptfoo per-PR gate | Don't have | Add Promptfoo + GitHub Action (~3 days) |
| Inspect AI nightly / weekly | Don't have | Add Inspect AI + Agent Bridge (~1 week) |
| Statistical comparison framework | Don't have | Adopt Anthropic's recipe (~3 days) |
| Ownership: who runs the eval loop? | TBD | Designate a part-time eval lead |
7.2 Adoption roadmap (~3 months)
+ +7.3 Effort and cost estimate
+ +| Phase | Engineering effort | Recurring cost (LLM tokens / month) |
|---|---|---|
| 1 — Foundations | ~3 weeks | ~$200 |
| 2 — CI gate | ~2 weeks | ~$300 |
| 3 — Fix harness + judge | ~3.5 weeks | ~$1,500 |
| 4 — Capability eval | ~1.5 weeks | ~$2,000 |
| 5 — Improvement loop | ~1 day/week ongoing | ~$500 |
| Total | ~10 weeks | ~$4,500/mo steady-state |
8 · Risks, open questions, and what we left out
+ +8.1 Risks
+ +-
+
- Eval set leakage. If the same PRs feed both eval and SFT, the score is meaningless. Mitigation: strict held-out splits; SWE-bench Live for fresh data. +
- Judge drift. As both judge and judged models update, judge scores drift. Mitigation: refresh human-labeled calibration set quarterly. +
- Reward hacking on the Fix harness. A patch that monkey-patches
pytestto skip tests. Mitigation: PASS_TO_PASS check; code-quality grader on the diff itself.
+ - Overfitting prompts to the eval set. Especially with prompt optimizers. Mitigation: held-out validation set; rotate eval set periodically. +
- Cost overruns. Weekly capability evals over 100+ fixtures add up. Mitigation: routing — Haiku for broad eval, escalate to Sonnet/Opus for low-confidence cases; cap turn budgets. +
- Self-preference in same-model judge. Mitigation: cross-model judge, ideally on a different family. +
8.2 Open questions
+ +-
+
- Are LLM judges good enough as the only signal? Production teams hedge by combining judges with periodic human review. +
- Process supervision vs outcome supervision for training data. Process wins for math; how to label process at scale for fuzzy domains like code review remains open. +
- Benchmark validity. Recent audits show many published benchmarks have leakage, mis-graded items, or task-validity problems. +
- Calibration for tool-using agents specifically. Most calibration work targets factual QA. QualOps may need to invent its own severity-calibration methodology. +
8.3 What we deliberately left out
+ +-
+
- Online RLHF / continuous self-tuning in production. Out of scope; we deploy fixed versions. +
- Human evaluation infrastructure beyond a calibration set. A separate project. +
- Compliance / regulatory eval. If QualOps targets regulated industries, an additional eval layer will be needed. +
- Prompt-injection / jailbreak red-teaming at scale. Promptfoo includes a basic suite; full adversarial robustness is its own program. +
9 · Appendix
+ +9.1 Glossary
+ +| Term | Definition |
|---|---|
| Agent-as-judge | Using an LLM with tool access to evaluate another agent's output |
| AST match | Comparing tool calls structurally as parsed trees |
| BFCL | Berkeley Function-Calling Leaderboard |
| Calibration / ECE | How well a model's confidence matches empirical accuracy |
| DPO / KTO | Direct Preference Optimization / Kahneman-Tversky Optimization |
| DSPy | Stanford NLP framework treating LLM workflows as programs |
| FAIL_TO_PASS / PASS_TO_PASS | SWE-bench's two test sets |
| G-Eval | LLM-judge with chain-of-thought rubric prompting |
| Golden trace / golden set | Curated reference traces used as the eval baseline |
| HELM | Stanford's holistic evaluation framework |
| Inspect AI | UK AISI's research-grade Python eval framework |
| LLM-as-judge | Using an LLM to score another LLM's output |
| MIPROv2 | DSPy's joint instruction + few-shot optimizer |
| Open / axial coding | Qualitative-research method for building error taxonomies |
| OpenInference | OTEL-based semantic conventions for LLM/agent traces |
| pass@k vs pass^k | Succeed at least once in k trials vs. succeed every time |
| Process Reward Model (PRM) | Model that scores each reasoning step |
| Promptfoo | MIT-licensed CLI/library for prompt and agent eval |
| ReAct | Reason + Act loop pattern for agents |
| SWE-bench | Code-agent benchmark based on real GitHub issues + unit tests |
| τ-bench | Sierra's tool-agent-user multi-turn benchmark; introduced pass^k |
| Trajectory | Ordered record of (state, action, observation) for an agent run |
9.2 Top recommended reads
+ +-
+
- Anthropic — Demystifying evals for AI agents +
- Hamel Husain & Shreya Shankar — LLM Evals: Everything You Need to Know +
- Hamel Husain — A Field Guide to Rapidly Improving AI Products +
- Zheng et al. 2023 — Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena +
- Yao et al. 2024 — τ-bench +
- Jimenez et al. 2023 — SWE-bench +
- Zhuge et al. 2024 — Agent-as-a-Judge +
- Anthropic — Building Effective Agents +
- Anthropic — Writing tools for agents +
- Anthropic — Effective context engineering for AI agents +
9.3 Companion files
+ +-
+
- REPORT.md — the markdown version of this report +
- sources/01-foundations.md — foundations dossier (~5,000 words) +
- sources/02-frameworks.md — frameworks dossier (~5,800 words) +
- sources/03-toolcalling-and-trajectory.md — tool-calling dossier (~5,200 words) +
- sources/04-improvement.md — improvement dossier (~5,500 words) +