Skip to content

docs(evaluation): revamp evals documentation for new eval system#648

Draft
KarthikAvinashFI wants to merge 5 commits intodevfrom
karthikavinash/th-4638-evals-revamp-doc
Draft

docs(evaluation): revamp evals documentation for new eval system#648
KarthikAvinashFI wants to merge 5 commits intodevfrom
karthikavinash/th-4638-evals-revamp-doc

Conversation

@KarthikAvinashFI
Copy link
Copy Markdown
Contributor

@KarthikAvinashFI KarthikAvinashFI commented May 8, 2026

Summary

Brings the evaluation docs in line with the post-revamp platform. Replaces the old four-method taxonomy (LLM as Judge / Deterministic / Statistical Metric / LLM as Ranker) with what the UI actually shows today: Agents, LLM-As-A-Judge, and Code. Adds new concept and feature pages for things that were undocumented (composite evals, versioning, ground truth, error localization, test playground, data injection, output types in their new label-based form). Rewrites the trace and simulation eval guides around the actual Tasks and Create a Simulation flows.

Linear: TH-4638

What changed

Concepts (under evaluation/concepts/)

Rewritten: eval-types, eval-templates, eval-results, judge-models, understanding-evaluation.
New: output-types, data-injection, composite-evals, versioning.

Features (under evaluation/features/)

Rewritten: custom, evaluate.
New: test-playground, error-localization, ground-truth.
Minor: custom-models (added trace projects to the surfaces list).

Surface-specific eval guides (outside evaluation/)

observe/features/evals rewritten around the Tasks flow (Basic Info / Evaluations / Filters / Scheduling) and the Historical data / New incoming data run modes.

quickstart/running-evals-in-simulation aligned with the 4-step Create a Simulation wizard (Add simulation details, Choose Scenario(s), Select Evaluations, Summary) and updated mapping fields.

Navigation

src/lib/navigation.ts updated to include the 4 new concept pages and 3 new feature pages in the sidebar.

Removed

eval-groups.mdx and all references. The Groups feature is no longer reachable from the main UI navigation (/dashboard/evaluations renders EvalsListView directly without the wrapper that has the Groups tab).

Style guide compliance

  • All concept pages start with ## About; no UI walkthrough screenshots in concept pages.
  • All feature pages have one screenshot placeholder per major step. 25 placeholders in total, marked as {/* SCREENSHOT NEEDED: ... */} MDX comments. Run grep -rn "SCREENSHOT NEEDED" src/pages/docs/ to list them.
  • No em-dashes, no marketing language, no bold headings.
  • Internal terms (AgentLoop, Falcon AI Loop, Temporal, Celery, RestrictedPython, nsjail, VLLM, and class names from agentic_eval/ or ee/) do not appear in any doc.

Verification

Every concrete claim was cross-checked against the live frontend and backend:

  • EvalCreatePage.jsx, ModelSelector.jsx, OutputTypeConfig.jsx, TestPlayground.jsx, CompositeDetailPanel.jsx, EvalGroundTruthTab.jsx, EvalDetailPage.jsx.
  • TaskConfigPanel.jsx, TaskSchedulingSection.jsx, EvalsTasksViewV2.jsx, TaskListView.jsx.
  • CreateRunTestPage.jsx, TestEvaluationPage.jsx, RunTestsContent.jsx.
  • DevelopBarRightSection.jsx, DevelopEvaluationDrawer.jsx.
  • config-navigation.jsx, routes/sections/dashboard.jsx, ConfigNavData.jsx.
  • API schemas in model_hub/types.py and URL routes in model_hub/urls.py, sdk/urls.py, tracer/urls.py.

Test plan

  • pnpm audit-links — passes (0 broken nav links, 0 broken content links).
  • pnpm build — passes (all 18 docs render).
  • Spot-check each new and rewritten page in pnpm dev.
  • Click through each documented flow in the dashboard to confirm the steps still match the UI.
  • Replace the 25 {/* SCREENSHOT NEEDED: ... */} placeholders with real screenshots before un-drafting.

Aligns the evaluation docs with the post-revamp platform: three eval
types (Agents / LLM-As-A-Judge / Code), three output types
(Pass/fail / Scoring / Choices), composite templates, versioning,
ground truth, error localization, and updated apply flows for
datasets, trace projects (now via Tasks), and simulation.

Concepts (rewritten / new):
- eval-types: 3-type taxonomy matching the create-page tabs
- eval-templates: built-in vs custom, single vs composite, versioning
- eval-results: result formats per output type
- judge-models: Turing models + bring-your-own
- understanding-evaluation: surfaces and how it all fits
- output-types (new): Pass/fail, Scoring (label-based), Choices
- data-injection (new): the six Context options
- composite-evals (new): aggregation functions and child axis
- versioning (new): Set as Default, Restore Version, pinning

Features (rewritten / new):
- custom: full create flow for all 3 types with field reference
- evaluate: dataset apply flow + SDK
- test-playground (new): four source modes, AI generate
- error-localization (new): toggle, run lifecycle, SDK
- ground-truth (new): upload, mapping, embedding statuses

Surface-specific updates:
- observe/features/evals: rewritten around the Tasks page flow
  (Basic Info / Evaluations / Filters / Scheduling)
- quickstart/running-evals-in-simulation: aligned with the
  4-step Create a Simulation wizard

Eval Groups was removed from docs as the feature is no longer
exposed in the main UI navigation.

TH-4638
Adds reference pages for built-in evals that were missing documentation
(deterministic, statistical, and agent-mode templates). Also fixes
Detect Hallucination input requirement.
Adds rows for the freshly-generated reference pages so users can find
them from the Built-in Evals catalog.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant