From d93c775b55f76affcc40a98917722e90a74b0863 Mon Sep 17 00:00:00 2001 From: Suhani Nagpal Date: Thu, 7 May 2026 14:46:16 +0530 Subject: [PATCH] docs(evaluation): add Agent as Judge type to eval-types page --- .../docs/evaluation/concepts/eval-types.mdx | 57 +++++++++++++++---- 1 file changed, 45 insertions(+), 12 deletions(-) diff --git a/src/pages/docs/evaluation/concepts/eval-types.mdx b/src/pages/docs/evaluation/concepts/eval-types.mdx index 0b015696..b81a0a1a 100644 --- a/src/pages/docs/evaluation/concepts/eval-types.mdx +++ b/src/pages/docs/evaluation/concepts/eval-types.mdx @@ -1,11 +1,13 @@ --- title: "Eval Types" -description: "The four evaluation methods in Future AGI: LLM as Judge, Deterministic, Statistical Metric, and LLM as Ranker, and how modality affects which ones apply." +description: "The five evaluation methods in Future AGI: LLM as Judge, LLM as Ranker, Agent as Judge, Deterministic, and Statistical Metric, and how modality affects which ones apply." --- ## About -Every eval template in Future AGI uses one of four evaluation methods to produce a result. The method determines how the eval computes its output, whether a judge model is required, and what kind of result to expect. Choosing the right type for your use case gives you the right balance of accuracy, speed, and cost. +Every eval template in Future AGI uses one of five evaluation methods to produce a result. The method determines how the eval computes its output, whether a judge model is required, and what kind of result to expect. Choosing the right type for your use case gives you the right balance of accuracy, speed, and cost. + +Internally these five methods collapse to three canonical types used by the API and DB: `llm` (LLM as Judge, LLM as Ranker), `code` (Deterministic and Statistical Metric), and `agent` (Agent as Judge). --- @@ -35,7 +37,15 @@ Computed directly from the text using code or string logic. No model is called a **Returns**: pass/fail only. No reason field. -**Examples**: Is JSON, Is Email, Contains Valid Link, No Invalid Links, One Line. +**Examples:** + +| Category | Templates | +|---|---| +| Format validation | Is JSON, Is Email, Is Code, Is URL, JSON Schema, JSON Validation | +| Substring checks | Contains, Contains Any, Contains All, Contains None, Starts With, Ends With, Equals | +| Length and shape | Length Greater Than, Length Less Than, Length Between, Word Count In Range, One Line | +| Link validation | Contains Valid Link, No Invalid Links | +| Pattern matching | Regex | **Best for:** - Format validation (valid JSON, email address, URL presence) @@ -59,11 +69,15 @@ Computes a numeric score using an algorithm applied to the output and a referenc | Levenshtein Similarity | Character edit distance between output and reference | | Numeric Similarity | Numerical difference between output and reference | | Embedding Similarity | Semantic vector similarity between output and reference | +| Fuzzy Match | Approximate string match against an expected answer | +| Ground Truth Match | Whether the output matches a reference ground truth | | Semantic List Contains | Whether output contains phrases semantically similar to a reference list | | Recall@K, Precision@K, NDCG@K, MRR, Hit Rate | Retrieval quality for RAG pipelines | | FID Score | Distribution similarity between sets of real and generated images | | CLIP Score | Alignment between an image and its text description | +Most statistical metrics require a reference value (a ground-truth answer, a target list, or a relevance label set). Provide it through the eval config when running the eval. + **Best for:** - Benchmarking against a ground-truth reference answer - RAG retrieval quality (recall, precision, ranking) @@ -88,9 +102,27 @@ A variant of LLM as Judge where instead of scoring a single response, the model --- +## Agent as Judge + +A specialised evaluation agent runs an iterative loop instead of a single LLM call. It can call tools through MCP connectors, look things up on the internet, retrieve from a knowledge base, and reason over multiple turns before returning a verdict. Use this when a single-shot judge cannot decide on its own because the eval needs external evidence or multi-step verification. + +**Requires a judge model.** Tool or MCP connectors and, optionally, a knowledge base must be configured for the evaluator to use. + +**Returns**: a result (pass/fail, score, or category) and a plain-language **reason** that can cite the tools and sources consulted during the run. + +**Examples**: Custom evals authored as agent evaluators, fact verification with web lookup, knowledge-base-grounded compliance checks. + +**Best for:** +- Fact verification that requires up-to-date or external information +- Multi-step policy or compliance checks that a single prompt cannot express +- Evals that should ground judgment in a curated knowledge base +- Higher-confidence judgments where accuracy outweighs speed and cost + +--- + ## Modality -In addition to the four types above, evals also vary by the kind of input they accept: +In addition to the five methods above, evals also vary by the kind of input they accept: | Modality | What it evaluates | Example evals | |---|---|---| @@ -105,18 +137,19 @@ Multimodal evals (image, audio, conversation) require a judge model that support ## Quick reference -| Type | Judge model required | Returns reason | No API key possible | -|---|---|---|---| -| LLM as Judge | Yes | Yes | No | -| Deterministic | No | No | Yes | -| Statistical Metric | No (most) | No | Yes (most) | -| LLM as Ranker | Yes | No | No | +| Type | Canonical type | Judge model required | Tools / KB | Returns reason | No API key possible | +|---|---|---|---|---|---| +| LLM as Judge | llm | Yes | No | Yes | No | +| LLM as Ranker | llm | Yes | No | No | No | +| Agent as Judge | agent | Yes | Yes | Yes | No | +| Deterministic | code | No | No | No | Yes | +| Statistical Metric | code | No (most) | No | No | Yes (most) | --- ## Next steps - [Built-in evals](/docs/evaluation/builtin): Full list with evaluation method and required inputs for each template. -- [Create custom evals](/docs/evaluation/features/custom): Custom evals always use LLM as Judge. -- [Judge models](/docs/evaluation/concepts/judge-models): Choose the right model for LLM as Judge and LLM as Ranker evals. +- [Create custom evals](/docs/evaluation/features/custom): Custom evals can be authored as LLM as Judge or Agent as Judge. +- [Judge models](/docs/evaluation/concepts/judge-models): Choose the right model for LLM as Judge, LLM as Ranker, and Agent as Judge evals. - [Eval groups](/docs/evaluation/features/groups): Combine different eval types and run them together in one pass.