feat(evaluation): migrate evaluation harnesses from playground by miguelgfierro · Pull Request #279 · fireflyframework/fireflyframework-agentic

miguelgfierro · 2026-06-19T07:14:34Z

Problem

The first migration of evaluation code into fireflyframework_agentic/evaluation/ brought over too much infrastructure alongside the metrics: a CLI (flyeval), a five-gate framework (G1–G5), a registry/corpus/matcher pipeline, a scorecard renderer, champion persistence, run-config snapshotting, and statistical helpers. The actual value — the LLM judge metric functions — was buried under 10 supporting files.

The goal of this PR (revised) is to gut that infrastructure and keep only the measurement code.

What We Keep

All metric functions from both evaluation systems:

Flyradar G4:

[D] deterministic: source_coverage, excerpt_fill_rate
[E] embedding-based: semantic_recovery
[J] LLM judge: faithfulness, numeric_temporal_fidelity, citation_relevance, nc_semantic_precision, fabricated_entity, contradiction, open_gap, actionability, severity_calibration, answer_relevancy, surface_deduplication, comparative_vs_champion (champion passed as optional parameter — no persistence)

Flycanon:

Custom: contains_answer, addresses_question (median of N LLM calls per item)
RAGAS: answer_correctness, answer_relevancy, faithfulness, context_recall, context_precision

Retrieval (lab/retrieval_metrics.py): unchanged.

What We Delete

From fireflyframework_agentic/evaluation/:

File	Reason
`cli.py`	`flyeval` CLI — experiment orchestration, not measurement
`gates.py`	G1–G5 gate framework — pipeline infrastructure
`corpus.py`	Corpus loader — pipeline infrastructure
`registry.py`	Registry management — pipeline infrastructure
`matcher.py`	Anchored matching utilities — pipeline infrastructure
`scorecard.py`	Scorecard renderer — reporting, not measurement
`run_config_snapshot.py`	Run config capture — pipeline infrastructure
`models.py`	`EvalConfig`, `GateVerdict` — only used by deleted files
`stats.py`	`aa_band`, `aggregate_grounding` — only used by deleted files
`champion.py`	Champion persistence — `comparative_vs_champion` accepts champion data as a parameter instead

Tests for deleted modules also removed: test_champion.py, test_gates.py, test_matcher.py, test_stats.py.

Target Package Layout

fireflyframework_agentic/evaluation/
├── __init__.py       # exports: EvalContext, AdvisoryReport, all metric functions
├── judge_client.py   # JudgeClient — async LLM scoring client (httpx.AsyncClient)
└── judge.py          # ALL metric functions + EvalContext + AdvisoryReport

Three files. No CLI. No gates. No registry.

Unified Interface

Every metric — flyradar [D], [E], [J], flycanon custom, and RAGAS — shares the same async signature:

async def metric_name(item: dict, ctx: EvalContext) -> float | None

item is a plain dict with a normalized schema:

{
    "question": str,
    "answer": str,
    "reference": str,
    "contexts": list[str],
    # flyradar extras (optional):
    "sources": list[str],
    "excerpts": list[str],
    # for comparative_vs_champion (optional):
    "champion_answer": str | None,
}

EvalContext is a Pydantic model carrying all dependencies:

class EvalContext(BaseModel):
    model_config = ConfigDict(arbitrary_types_allowed=True)

    client: JudgeClient                     # async LLM call client (all metrics)
    embedder: OllamaEmbedder | None = None  # [E] metrics + RAGAS embeddings
    runs: int = 3                           # flycanon multi-run median

No ragas_llm / ragas_embeddings — RAGAS metrics wrap ctx.client and ctx.embedder in LangChain adapters internally. Callers see one client, one embedder.

Composable type alias:

Metric = Callable[[dict, EvalContext], Awaitable[float | None]]

Example:

ctx = EvalContext(client=JudgeClient(model="claude-sonnet-4-6", api_key=KEY))
metrics: list[Metric] = [faithfulness, contains_answer, answer_correctness]
scores = await asyncio.gather(*[m(item, ctx) for m in metrics])

`judge_client.py`

Contains only JudgeClient — a thin async HTTP client for Anthropic/OpenAI/Ollama scoring calls:

class JudgeClient:
    async def chat_json(self, system: str, user: str, max_tokens: int = 200) -> dict: ...

Uses httpx.AsyncClient. Handles 429/5xx retry with Retry-After parsing. No embedding logic — embeddings come from embeddings/providers/ollama.py (existing async OllamaEmbedder). cosine_similarity imported from embeddings/similarity.py.

Dependencies

evaluation optional extra changes:

Remove: scipy (only used by deleted stats.py)
Add: ragas, langchain-anthropic, langchain-ollama
Keep: numpy

No changes to embeddings/ — existing async OllamaEmbedder used as-is.

Test plan

pytest tests/unit/evaluation/test_judge.py tests/unit/lab/test_retrieval_metrics.py — all passing
Each metric callable independently with a mocked EvalContext
asyncio.gather(*[m(item, ctx) for m in metrics]) composes correctly across families

…try point (#268) * feat(evaluation): add evaluation subpackage __init__ with gate/champion/judge/retrieval exports * feat(evaluation): add EvalConfig and GateVerdict models * feat(evaluation): add evaluation optional-deps and flyeval CLI entry point to pyproject.toml * feat(evaluation): note evaluation as optional subpackage in top-level __init__ docstring --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* feat(evaluation): add matcher primitives (anchored, matches, source_stem, tokens) * feat(evaluation): add statistics helpers (aa_band, aggregate_grounding, left_skew_flag) * feat(evaluation): export matcher and stats primitives from evaluation package --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* feat(evaluation): add corpus loader and evidence verification module * feat(evaluation): add lean-1 registry loader and RegistryItem/Registry models * feat(evaluation): re-export corpus and registry symbols from evaluation package --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* feat(evaluation): add G1-G5 gate framework (GateResult, run_gates, g2_recall_precision) * feat(evaluation): export g2_recall_precision from evaluation package --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* feat(evaluation): add scorecard renderer * feat(evaluation): export render_scorecard, verdict, VERDICT_PROMOTE/HOLD from scorecard module --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* feat(evaluation): add JudgeClient and OllamaEmbedder (judge_client.py) * feat(evaluation): add AdvisoryReport and run_judge with [D]/[E]/[J] metric families (judge.py) * feat(evaluation): import cosine from judge_client in matcher.py * feat(evaluation): export JudgeClient, OllamaEmbedder, build_embedder, cosine from evaluation package --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* feat(evaluation): add ChampionRecord and champion management functions * feat(evaluation): add run_config_snapshot for flyradar run configuration capture * feat(evaluation): add flyeval CLI with gate, aa-band, day-zero, invalidate subcommands --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

) * feat(lab): add retrieval_metrics module with compute_retrieval_metrics and RetrieverMetrics * feat(lab): export RetrieverMetrics and compute_retrieval_metrics from lab package * feat(evaluation): import RetrieverMetrics and compute_retrieval_metrics from lab.retrieval_metrics --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* feat(evaluation): add flyradar gate evaluation example * feat(evaluation): add flycanon RAG retrieval evaluation example --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

… metrics (#277) * feat(evaluation): add tests/unit/evaluation package init * feat(evaluation): add unit tests for matcher (anchored, source_stem, tokens, matches) * feat(evaluation): add unit tests for stats (aa_band, aggregate_grounding, left_skew_flag) * feat(evaluation): add unit tests for gates (GateResult, verdict, render_scorecard, g5_no_regression) * feat(evaluation): add unit tests for champion (ChampionRecord, load/save/invalidate, input_hash) * feat(evaluation): add unit tests for retrieval_metrics (compute_retrieval_metrics, RetrieverMetrics) * feat(evaluation): fix boundary test for left_skew_flag (floating-point precision) * feat(evaluation): fix no_answer_rate test to match implementation behaviour --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* feat(evaluation): add evaluation package documentation * docs(evaluation): mention evaluation subpackage in README --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

…905, N806, UP035)

fix(evaluation): resolve PR gate failures (lint, CI extras, remove flyradar example)

The top-level docstring still described the deleted gate/champion/challenger infrastructure. Correct it to match the shipped surface: LLM-as-judge metrics, RAGAS, and retrieval metrics.

Mirror flycanon's embedding_service factory: add build_embedder(spec) resolving a '<provider>:<model>' spec to a fireflyframework_agentic embedder (8 providers, deferred per-provider imports). Widen EvalContext.embedder to BaseEmbedder and feed it into RAGAS via LangchainEmbeddingsWrapper, so the evaluator embeds with the same provider as the pipeline. Removes the broken AnthropicEmbeddings branch. Rename _make_ragas_embeddings -> _build_embeddings to decouple the name from RAGAS for future refactoring.

The azure provider was grouped with openai and built a public-OpenAI ChatOpenAI client (api.openai.com + OPENAI_API_KEY), sending the azure deployment name as an OpenAI model id. Split azure out to AzureChatOpenAI using AZURE_OPENAI_ENDPOINT/ AZURE_OPENAI_API_KEY/AZURE_OPENAI_API_VERSION, mirroring judge_client._azure, and add langchain-openai to the [evaluation] extra so the openai/azure paths import.

Drop the dead 'calibrated' field (only ever set to False, never read) and the 'details' field (never written or read), and rewrite the docstring to remove the 'G4 output / GateResult' gate-era framing. Keeps the live fields: judge_model, same_provider_caveat, runs, metrics, errors.

Revert the top-level __init__.py docstring addition: it duplicated the README and docs/evaluation.md, pulled in unrelated lab/experiments, and already went stale (it described the deleted gates). This leaves the package root untouched by the PR.

Replace the hand-rolled multi-provider httpx client with the framework's FireflyAgent (pydantic-ai, a core dep). JudgeClient.chat_json(system,user)->dict becomes judge(system,user,output_type)->validated pydantic model; each of the 13 call-shapes gets a typed output model, so the LLM's structured output is schema- checked instead of parsed via _first_json_object. Agents are built lazily and cached per (system, output_type, max_tokens); temperature pinned to 0.0; retries handled by FireflyAgent/pydantic-ai. Deletes the bespoke _anthropic/_openai/_azure/_ollama methods, _first_json_object, _env, and _coerce_float. Fixes the _gather_chat bug: failed judge calls no longer collapse to {} and get scored as verdicts — they propagate and are recorded in report.errors (new _judge_all). Adds tests for agent caching, failure propagation, and the previously-untested run_judge orchestrator.

Align the run_judge output DTO with the framework convention — *Result/*Report types (EvalReport, EvalResult, BenchmarkResult, PipelineResult, ...) and the module's own EvalContext are all pydantic BaseModel, leaving AdvisoryReport the lone dataclass. Switching gains free model_dump_json() for logging/persistence at no cost (internal output, mutated in place).

After the FireflyAgent refactor the client shrank to ~90 lines and is used only by judge.py. Fold JudgeClient + parse_model + same_provider into judge.py and drop the separate file — no standalone transport to justify it anymore. Public imports are unchanged (still re-exported from the package).

Replace the 35 one-symbol 'from X import (Y as Y)' re-export blocks with three grouped imports and an explicit __all__, matching the agents/__init__ convention. __all__ marks the public re-exports so ruff doesn't flag them as unused.

Switch the example default and doc snippets from the pinned claude-haiku-4-5-20251001 to the floating claude-haiku-4-5 alias.

…work # Conflicts: # .github/workflows/pr-gate.yml # uv.lock

…t None _dedup now renumbers the surviving one-per-source entries to contiguous 1..N document positions; previously multi-chunk-per-source inputs left rank gaps that made recall/precision/nDCG/MRR/MAP understate (gold behind same-source chunks was scored at its chunk rank, not its document rank). no_answer_rate now coalesces a present-but-None 'answer' instead of calling None.strip().

…n/runs run_judge gains metrics='all'|'basic'|'process_mining' so callers pick the domain-agnostic LLM/RAG answer-quality metrics, the flyradar discovery-report metrics, or both. Families exported as BASIC_METRICS / PROCESS_MINING_METRICS. Also fix the median-of-runs bug: temperature is pinned to 0.0, so re-running each RAG metric ctx.runs times de-noised nothing and doubled calls. contains_answer / addresses_question now score once; the dead _median_runs/_numeric_leaves/_set_leaf helpers, EvalContext.runs, AdvisoryReport.runs, and the example --runs flag are removed.

+    client = MagicMock(spec=JudgeClient)
+    client.model_spec = "anthropic:claude-sonnet-4-6"
+
+    async def mock_judge(system, user, output_type, max_tokens=1024):


…ight) Rewrite semantic_recovery's docstring/comments around the lexical -> vector -> hybrid framing and rename its output keys for clarity: recovered_recall -> hybrid_recall, recovered -> vector_recovered (lexical_recall unchanged). Drop the stale registry/nc_items transition comments. PR-gate fixes: apply ruff format (judge.py/test_judge.py/example were unformatted), and move the pyright type:ignore onto the AzureChatOpenAI api_key argument line so the SecretStr arg-type error is actually suppressed (the call spans lines, so the opening-line ignore did not cover it).

Both groups are LLM-as-judge; relabel the catalog by who writes the rubric — 'Custom-rubric judge' (our prompts via ctx.client) vs 'RAGAS library' (the ragas package's standardized metrics, some + embeddings) — and add a framing note.

…ubric / RAGAS)

Expand both docstrings and add a comparison table + guidance to docs/evaluation.md: finding-level entailment against own citations (tally) vs claim-level grounding of a RAG answer against retrieved contexts (normalized float).

…tion

…trings Keep the condensed docs/evaluation.md note but bring the docstrings back to their detailed form.

…f-distinguishing finding-level over own citations (tally) vs claim-level over retrieved contexts (float).

…on-audit dict)

…xtras Every metric now returns {"score": float | None, **extra} via a _scored() helper, so results compare apples-to-apples: read result["score"] for the headline number and the remaining keys for the breakdown. Derived a score where one is natural (recall/precision/ calibration rates, hybrid recall, dedup rate); score is None for pure defect counts (fabricated_entity, contradiction) and free-text/structured probes (open_gap, comparative_vs_champion). RAG Q&A and RAGAS metrics now return {score} instead of a bare float. Updated tests and the example accordingly.

…alog Update every Returns cell to lead with score, rewrite the return-shapes note for the uniform shape, fix the signature/quick-start/semantic_recovery/faithfulness notes, and drop the stale latency mention.

miguelgfierro and others added 11 commits June 18, 2026 23:33

feat(evaluation): add scorecard renderer (#272)

d964ba1

* feat(evaluation): add scorecard renderer * feat(evaluation): export render_scorecard, verdict, VERDICT_PROMOTE/HOLD from scorecard module --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

feat(examples): add flyradar and flycanon evaluation examples (#276)

0acac37

* feat(evaluation): add flyradar gate evaluation example * feat(evaluation): add flycanon RAG retrieval evaluation example --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

docs(evaluation): add evaluation package documentation (#278)

f79439b

* feat(evaluation): add evaluation package documentation * docs(evaluation): mention evaluation subpackage in README --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

github-code-quality Bot found potential problems Jun 19, 2026

View reviewed changes

miguelgfierro added 3 commits June 19, 2026 09:24

remove examples/flyradar_eval_example.py

a1d28a5

ci: add --extra evaluation to typecheck and test sync steps

6161718

fix(evaluation): resolve all ruff lint errors (import sort, SIM108, B…

203134c

…905, N806, UP035)

miguelgfierro mentioned this pull request Jun 19, 2026

fix(evaluation): resolve PR gate failures (lint, CI extras, remove flyradar example) #280

Merged

Merge pull request #280 from fireflyframework/fix/eval-ci-gate

ceaba78

fix(evaluation): resolve PR gate failures (lint, CI extras, remove flyradar example)

github-code-quality Bot found potential problems Jun 19, 2026

View reviewed changes

Comment thread fireflyframework_agentic/evaluation/corpus.py Fixed

miguelgfierro added 12 commits June 19, 2026 13:18

chore(evaluation): delete cli.py

9c3555d

chore(evaluation): delete gates.py

e9fd965

chore(evaluation): delete corpus.py

38c3f60

chore(evaluation): delete registry.py

f819923

chore(evaluation): delete matcher.py

3bc0786

chore(evaluation): delete scorecard.py

9c43a32

chore(evaluation): delete run_config_snapshot.py

a3673b5

chore(evaluation): delete models.py

a51115e

chore(evaluation): delete stats.py

5074d14

chore(evaluation): delete champion.py

8716be9

chore(evaluation): delete test_champion.py

5c8fe8e

chore(evaluation): delete test_gates.py

fdc0277

miguelgfierro added 14 commits June 29, 2026 10:57

docs: fix stale evaluation subpackage description in package docstring

9c78ae5

The top-level docstring still described the deleted gate/champion/challenger infrastructure. Correct it to match the shipped surface: LLM-as-judge metrics, RAGAS, and retrieval metrics.

docs(evaluation): use claude-haiku-4-5 alias in example and guide

d8a48d5

Switch the example default and doc snippets from the pinned claude-haiku-4-5-20251001 to the floating claude-haiku-4-5 alias.

Merge remote-tracking branch 'origin/main' into feat/evaluation-frame…

9098d92

…work # Conflicts: # .github/workflows/pr-gate.yml # uv.lock

docs(evaluation): document metric-family selection; drop runs/median

60d9bf1

github-code-quality Bot found potential problems Jun 30, 2026

View reviewed changes

Comment thread tests/unit/evaluation/test_judge.py

client = MagicMock(spec=JudgeClient)

client.model_spec = "anthropic:claude-sonnet-4-6"

async def mock_judge(system, user, output_type, max_tokens=1024):

miguelgfierro added 7 commits June 30, 2026 08:54

docs(evaluation): order RAGAS library before custom-rubric RAG Q&A

09d664c

docs(evaluation): rename judge sub-groups to LLM-as-a-Judge (Custom r…

6db83d7

…ubric / RAGAS)

docs(evaluation): condense faithfulness vs ragas_faithfulness explana…

faed413

…tion

docs(evaluation): restore fuller faithfulness/ragas_faithfulness docs…

c985264

…trings Keep the condensed docs/evaluation.md note but bring the docstrings back to their detailed form.

miguelgfierro marked this pull request as ready for review June 30, 2026 08:15

miguelgfierro added 5 commits June 30, 2026 11:24

docs(evaluation): make faithfulness/ragas_faithfulness table rows sel…

d96278c

…f-distinguishing finding-level over own citations (tally) vs claim-level over retrieved contexts (float).

docs(evaluation): add return-shape convention note (float vs collecti…

ff77ada

…on-audit dict)

docs(evaluation): document uniform {score, ...} return across the cat…

1609533

…alog Update every Returns cell to lead with score, rewrite the return-shapes note for the uniform shape, fix the signature/quick-start/semantic_recovery/faithfulness notes, and drop the stale latency mention.

chore: bump version to 26.06.14

4dbadd9

miguelgfierro merged commit 5fd1217 into main Jun 30, 2026
9 checks passed

miguelgfierro deleted the feat/evaluation-framework branch June 30, 2026 10:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(evaluation): migrate evaluation harnesses from playground#279

feat(evaluation): migrate evaluation harnesses from playground#279
miguelgfierro merged 91 commits into
mainfrom
feat/evaluation-framework

miguelgfierro commented Jun 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

miguelgfierro commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

What We Keep

What We Delete

Target Package Layout

Unified Interface

judge_client.py

Dependencies

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

miguelgfierro commented Jun 19, 2026 •

edited

Loading

`judge_client.py`