From d93c775b55f76affcc40a98917722e90a74b0863 Mon Sep 17 00:00:00 2001
From: Suhani Nagpal <suhani.nagpal725@gmail.com>
Date: Thu, 7 May 2026 14:46:16 +0530
Subject: [PATCH] docs(evaluation): add Agent as Judge type to eval-types page

---
 .../docs/evaluation/concepts/eval-types.mdx   | 57 +++++++++++++++----
 1 file changed, 45 insertions(+), 12 deletions(-)

diff --git a/src/pages/docs/evaluation/concepts/eval-types.mdx b/src/pages/docs/evaluation/concepts/eval-types.mdx
index 0b015696..b81a0a1a 100644
--- a/src/pages/docs/evaluation/concepts/eval-types.mdx
+++ b/src/pages/docs/evaluation/concepts/eval-types.mdx
@@ -1,11 +1,13 @@
 ---
 title: "Eval Types"
-description: "The four evaluation methods in Future AGI: LLM as Judge, Deterministic, Statistical Metric, and LLM as Ranker, and how modality affects which ones apply."
+description: "The five evaluation methods in Future AGI: LLM as Judge, LLM as Ranker, Agent as Judge, Deterministic, and Statistical Metric, and how modality affects which ones apply."
 ---
 
 ## About
 
-Every eval template in Future AGI uses one of four evaluation methods to produce a result. The method determines how the eval computes its output, whether a judge model is required, and what kind of result to expect. Choosing the right type for your use case gives you the right balance of accuracy, speed, and cost.
+Every eval template in Future AGI uses one of five evaluation methods to produce a result. The method determines how the eval computes its output, whether a judge model is required, and what kind of result to expect. Choosing the right type for your use case gives you the right balance of accuracy, speed, and cost.
+
+Internally these five methods collapse to three canonical types used by the API and DB: `llm` (LLM as Judge, LLM as Ranker), `code` (Deterministic and Statistical Metric), and `agent` (Agent as Judge).
 
 ---
 
@@ -35,7 +37,15 @@ Computed directly from the text using code or string logic. No model is called a
 
 **Returns**: pass/fail only. No reason field.
 
-**Examples**: Is JSON, Is Email, Contains Valid Link, No Invalid Links, One Line.
+**Examples:**
+
+| Category | Templates |
+|---|---|
+| Format validation | Is JSON, Is Email, Is Code, Is URL, JSON Schema, JSON Validation |
+| Substring checks | Contains, Contains Any, Contains All, Contains None, Starts With, Ends With, Equals |
+| Length and shape | Length Greater Than, Length Less Than, Length Between, Word Count In Range, One Line |
+| Link validation | Contains Valid Link, No Invalid Links |
+| Pattern matching | Regex |
 
 **Best for:**
 - Format validation (valid JSON, email address, URL presence)
@@ -59,11 +69,15 @@ Computes a numeric score using an algorithm applied to the output and a referenc
 | Levenshtein Similarity | Character edit distance between output and reference |
 | Numeric Similarity | Numerical difference between output and reference |
 | Embedding Similarity | Semantic vector similarity between output and reference |
+| Fuzzy Match | Approximate string match against an expected answer |
+| Ground Truth Match | Whether the output matches a reference ground truth |
 | Semantic List Contains | Whether output contains phrases semantically similar to a reference list |
 | Recall@K, Precision@K, NDCG@K, MRR, Hit Rate | Retrieval quality for RAG pipelines |
 | FID Score | Distribution similarity between sets of real and generated images |
 | CLIP Score | Alignment between an image and its text description |
 
+Most statistical metrics require a reference value (a ground-truth answer, a target list, or a relevance label set). Provide it through the eval config when running the eval.
+
 **Best for:**
 - Benchmarking against a ground-truth reference answer
 - RAG retrieval quality (recall, precision, ranking)
@@ -88,9 +102,27 @@ A variant of LLM as Judge where instead of scoring a single response, the model
 
 ---
 
+## Agent as Judge
+
+A specialised evaluation agent runs an iterative loop instead of a single LLM call. It can call tools through MCP connectors, look things up on the internet, retrieve from a knowledge base, and reason over multiple turns before returning a verdict. Use this when a single-shot judge cannot decide on its own because the eval needs external evidence or multi-step verification.
+
+**Requires a judge model.** Tool or MCP connectors and, optionally, a knowledge base must be configured for the evaluator to use.
+
+**Returns**: a result (pass/fail, score, or category) and a plain-language **reason** that can cite the tools and sources consulted during the run.
+
+**Examples**: Custom evals authored as agent evaluators, fact verification with web lookup, knowledge-base-grounded compliance checks.
+
+**Best for:**
+- Fact verification that requires up-to-date or external information
+- Multi-step policy or compliance checks that a single prompt cannot express
+- Evals that should ground judgment in a curated knowledge base
+- Higher-confidence judgments where accuracy outweighs speed and cost
+
+---
+
 ## Modality
 
-In addition to the four types above, evals also vary by the kind of input they accept:
+In addition to the five methods above, evals also vary by the kind of input they accept:
 
 | Modality | What it evaluates | Example evals |
 |---|---|---|
@@ -105,18 +137,19 @@ Multimodal evals (image, audio, conversation) require a judge model that support
 
 ## Quick reference
 
-| Type | Judge model required | Returns reason | No API key possible |
-|---|---|---|---|
-| LLM as Judge | Yes | Yes | No |
-| Deterministic | No | No | Yes |
-| Statistical Metric | No (most) | No | Yes (most) |
-| LLM as Ranker | Yes | No | No |
+| Type | Canonical type | Judge model required | Tools / KB | Returns reason | No API key possible |
+|---|---|---|---|---|---|
+| LLM as Judge | llm | Yes | No | Yes | No |
+| LLM as Ranker | llm | Yes | No | No | No |
+| Agent as Judge | agent | Yes | Yes | Yes | No |
+| Deterministic | code | No | No | No | Yes |
+| Statistical Metric | code | No (most) | No | No | Yes (most) |
 
 ---
 
 ## Next steps
 
 - [Built-in evals](/docs/evaluation/builtin): Full list with evaluation method and required inputs for each template.
-- [Create custom evals](/docs/evaluation/features/custom): Custom evals always use LLM as Judge.
-- [Judge models](/docs/evaluation/concepts/judge-models): Choose the right model for LLM as Judge and LLM as Ranker evals.
+- [Create custom evals](/docs/evaluation/features/custom): Custom evals can be authored as LLM as Judge or Agent as Judge.
+- [Judge models](/docs/evaluation/concepts/judge-models): Choose the right model for LLM as Judge, LLM as Ranker, and Agent as Judge evals.
 - [Eval groups](/docs/evaluation/features/groups): Combine different eval types and run them together in one pass.