Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
a2d6770
feat(evaluation): add evaluation subpackage skeleton and pyproject en…
miguelgfierro Jun 18, 2026
8676b6a
feat(evaluation): add matcher primitives and statistics helpers (#269)
miguelgfierro Jun 18, 2026
8eb2110
feat(evaluation): add corpus loader and registry modules (#270)
miguelgfierro Jun 18, 2026
ee64cfa
feat(evaluation): add G1-G5 gate framework (#271)
miguelgfierro Jun 18, 2026
d964ba1
feat(evaluation): add scorecard renderer (#272)
miguelgfierro Jun 18, 2026
09cfc34
feat(evaluation): add LLM-as-judge and judge client (#273)
miguelgfierro Jun 18, 2026
1906ede
feat(evaluation): add champion tracking and flyeval CLI (#274)
miguelgfierro Jun 18, 2026
4ab1d85
feat(lab): add retrieval metrics (hit@k, recall@k, MRR, MAP, nDCG) (#…
miguelgfierro Jun 18, 2026
0acac37
feat(examples): add flyradar and flycanon evaluation examples (#276)
miguelgfierro Jun 18, 2026
cc048cf
test(evaluation): add unit tests for evaluation package and retrieval…
miguelgfierro Jun 18, 2026
f79439b
docs(evaluation): add evaluation package documentation (#278)
miguelgfierro Jun 18, 2026
a1d28a5
remove examples/flyradar_eval_example.py
miguelgfierro Jun 19, 2026
6161718
ci: add --extra evaluation to typecheck and test sync steps
miguelgfierro Jun 19, 2026
203134c
fix(evaluation): resolve all ruff lint errors (import sort, SIM108, B…
miguelgfierro Jun 19, 2026
ceaba78
Merge pull request #280 from fireflyframework/fix/eval-ci-gate
miguelgfierro Jun 19, 2026
9c3555d
chore(evaluation): delete cli.py
miguelgfierro Jun 19, 2026
e9fd965
chore(evaluation): delete gates.py
miguelgfierro Jun 19, 2026
38c3f60
chore(evaluation): delete corpus.py
miguelgfierro Jun 19, 2026
f819923
chore(evaluation): delete registry.py
miguelgfierro Jun 19, 2026
3bc0786
chore(evaluation): delete matcher.py
miguelgfierro Jun 19, 2026
9c43a32
chore(evaluation): delete scorecard.py
miguelgfierro Jun 19, 2026
a3673b5
chore(evaluation): delete run_config_snapshot.py
miguelgfierro Jun 19, 2026
a51115e
chore(evaluation): delete models.py
miguelgfierro Jun 19, 2026
5074d14
chore(evaluation): delete stats.py
miguelgfierro Jun 19, 2026
8716be9
chore(evaluation): delete champion.py
miguelgfierro Jun 19, 2026
5c8fe8e
chore(evaluation): delete test_champion.py
miguelgfierro Jun 19, 2026
fdc0277
chore(evaluation): delete test_gates.py
miguelgfierro Jun 19, 2026
0732f85
chore(evaluation): delete test_matcher.py
miguelgfierro Jun 19, 2026
f769ef1
chore(evaluation): delete test_stats.py
miguelgfierro Jun 19, 2026
2516052
feat(evaluation): rewrite judge_client.py as async (httpx.AsyncClient)
miguelgfierro Jun 19, 2026
5609ab6
feat(evaluation): rewrite judge.py — async metrics + EvalContext + fl…
miguelgfierro Jun 19, 2026
7799185
feat(evaluation): slim __init__.py to 3-file exports
miguelgfierro Jun 19, 2026
9526f43
chore(evaluation): update pyproject.toml — drop scipy, add ragas deps…
miguelgfierro Jun 19, 2026
d567552
test(evaluation): add unit tests for judge.py metrics
miguelgfierro Jun 19, 2026
0dd9bac
chore: merge feat/evaluation-framework, keep simplification
miguelgfierro Jun 19, 2026
561f9b5
Merge pull request #282 from fireflyframework/feat/eval-simplification
miguelgfierro Jun 19, 2026
5646974
fix(lab): type-annotate out dict, remove quoted return type in retrie…
miguelgfierro Jun 19, 2026
582d1c0
fix(lab): remove unused import math, fix import sort in test_retrieva…
miguelgfierro Jun 19, 2026
3e62b1f
fix(evaluation): add type: ignore for pyright errors on RAGAS/langcha…
miguelgfierro Jun 19, 2026
a7e44d1
Merge pull request #283 from fireflyframework/chore/eval-ci-fixes
miguelgfierro Jun 19, 2026
6dd8575
Merge remote-tracking branch 'origin/main' into chore/sync-dev-with-main
miguelgfierro Jun 19, 2026
3679dbc
refactor(evaluation): move retrieval_metrics.py from lab/ to evaluation/
miguelgfierro Jun 19, 2026
6bce374
refactor(evaluation): update imports — retrieval_metrics now in evalu…
miguelgfierro Jun 19, 2026
9229c43
refactor(evaluation): move test_retrieval_metrics.py to tests/unit/ev…
miguelgfierro Jun 19, 2026
4d9353d
Merge pull request #284 from fireflyframework/refactor/move-retrieval…
miguelgfierro Jun 19, 2026
6cdd3db
refactor(evaluation): replace RetrieverMetrics class with plain funct…
miguelgfierro Jun 19, 2026
3a3c35f
refactor(evaluation): update __init__.py exports — replace RetrieverM…
miguelgfierro Jun 19, 2026
26bfe3b
test(evaluation): rewrite test_retrieval_metrics for individual metri…
miguelgfierro Jun 19, 2026
b029d36
Merge pull request #285 from fireflyframework/refactor/retrieval-metr…
miguelgfierro Jun 19, 2026
feadcbd
Remove compute_retrieval_metrics() and KS constant from retrieval_met…
miguelgfierro Jun 19, 2026
d54814f
Remove compute_retrieval_metrics export from evaluation __init__
miguelgfierro Jun 19, 2026
0853698
Remove test_compute_retrieval_metrics_* tests
miguelgfierro Jun 19, 2026
a7b1b91
Update flycanon_eval_example to use plain metric functions instead of…
miguelgfierro Jun 19, 2026
0c911b3
Apply ruff format to retrieval_metrics.py
miguelgfierro Jun 19, 2026
ef16882
Apply ruff format to test_retrieval_metrics.py
miguelgfierro Jun 19, 2026
5a9926b
Merge pull request #286 from fireflyframework/refactor/drop-compute-r…
miguelgfierro Jun 19, 2026
e9e97d1
fix(evaluation): deepcopy base in _median_runs to prevent mutation of…
miguelgfierro Jun 25, 2026
eeb315f
fix(evaluation): strip ragas_ prefix in _ragas_score column lookup so…
miguelgfierro Jun 25, 2026
ef092be
fix(evaluation): use provider-appropriate embeddings in _make_ragas_e…
miguelgfierro Jun 25, 2026
d25d2ce
fix(evaluation): use n_gold as MAP denominator instead of min(n_gold, k)
miguelgfierro Jun 25, 2026
c690977
refactor(examples): replace flycanon_eval_example with simpler generi…
miguelgfierro Jun 25, 2026
efdcbf7
refactor(examples): replace rag_eval_example with llm_eval_example us…
miguelgfierro Jun 25, 2026
a1519b9
fix(examples): use SI units only in sample reference
miguelgfierro Jun 25, 2026
d3f53a7
docs(evaluation): rewrite guide around metrics, drop deleted gate pip…
miguelgfierro Jun 29, 2026
a6db4ff
refactor(evaluation): drop mean_latency_ms — telemetry, not a quality…
miguelgfierro Jun 29, 2026
9c78ae5
docs: fix stale evaluation subpackage description in package docstring
miguelgfierro Jun 29, 2026
0d2476b
feat(evaluation): build RAGAS embeddings from the framework embedder
miguelgfierro Jun 29, 2026
7c14351
fix(evaluation): use AzureChatOpenAI for the azure RAGAS LLM
miguelgfierro Jun 29, 2026
c02b10c
refactor(evaluation): strip gate-era baggage from AdvisoryReport
miguelgfierro Jun 29, 2026
4d262ce
docs: drop optional-subpackages block from package docstring
miguelgfierro Jun 29, 2026
2dc1054
refactor(evaluation): back JudgeClient with FireflyAgent + typed outputs
miguelgfierro Jun 29, 2026
dd86d74
refactor(evaluation): make AdvisoryReport a pydantic model
miguelgfierro Jun 29, 2026
30e5fa6
refactor(evaluation): merge judge_client into judge
miguelgfierro Jun 29, 2026
648b20e
style(evaluation): simplify __init__ to grouped imports + __all__
miguelgfierro Jun 29, 2026
d8a48d5
docs(evaluation): use claude-haiku-4-5 alias in example and guide
miguelgfierro Jun 29, 2026
9098d92
Merge remote-tracking branch 'origin/main' into feat/evaluation-frame…
miguelgfierro Jun 30, 2026
6a497a2
fix(evaluation): re-densify dedup ranks + guard no_answer_rate agains…
miguelgfierro Jun 30, 2026
59899ed
feat(evaluation): select metric family in run_judge; drop no-op media…
miguelgfierro Jun 30, 2026
60d9bf1
docs(evaluation): document metric-family selection; drop runs/median
miguelgfierro Jun 30, 2026
7183b51
fix(evaluation): clarify semantic_recovery; fix PR gate (format + pyr…
miguelgfierro Jun 30, 2026
ac808fc
docs(evaluation): clarify custom-rubric vs RAGAS LLM-judge grouping
miguelgfierro Jun 30, 2026
09d664c
docs(evaluation): order RAGAS library before custom-rubric RAG Q&A
miguelgfierro Jun 30, 2026
6db83d7
docs(evaluation): rename judge sub-groups to LLM-as-a-Judge (Custom r…
miguelgfierro Jun 30, 2026
6fcfce0
docs(evaluation): detail faithfulness vs ragas_faithfulness difference
miguelgfierro Jun 30, 2026
faed413
docs(evaluation): condense faithfulness vs ragas_faithfulness explana…
miguelgfierro Jun 30, 2026
c985264
docs(evaluation): restore fuller faithfulness/ragas_faithfulness docs…
miguelgfierro Jun 30, 2026
d96278c
docs(evaluation): make faithfulness/ragas_faithfulness table rows sel…
miguelgfierro Jun 30, 2026
ff77ada
docs(evaluation): add return-shape convention note (float vs collecti…
miguelgfierro Jun 30, 2026
5942455
feat(evaluation): uniform metric return — leading score float, then e…
miguelgfierro Jun 30, 2026
1609533
docs(evaluation): document uniform {score, ...} return across the cat…
miguelgfierro Jun 30, 2026
4dbadd9
chore: bump version to 26.06.14
miguelgfierro Jun 30, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/pr-gate.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ jobs:
- uses: actions/setup-python@v6
with:
python-version: '3.13'
- run: uv sync --extra dev --extra binary --extra vectorstores-sqlite-vec --extra openai-embeddings
- run: uv sync --extra dev --extra binary --extra vectorstores-sqlite-vec --extra openai-embeddings --extra evaluation
- run: uv run pyright

test:
Expand All @@ -72,7 +72,7 @@ jobs:
- uses: actions/setup-python@v6
with:
python-version: '3.13'
- run: uv sync --extra dev --extra binary --extra vectorstores-sqlite-vec --extra vectorstores-pgvector --extra openai-embeddings --extra script-execution
- run: uv sync --extra dev --extra binary --extra vectorstores-sqlite-vec --extra vectorstores-pgvector --extra openai-embeddings --extra evaluation --extra script-execution
- run: uv run pytest -m "not nightly" --cov --cov-report=term-missing

build:
Expand Down
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,12 @@ create your own components; the framework discovers them via duck typing.
`EvalDataset` loads/saves test cases from JSON. `ModelComparison` runs the
same prompts across multiple agents for side-by-side analysis.

- **Evaluation** — LLM-as-judge metrics (faithfulness, relevancy, answer correctness,
RAGAS, …) and deterministic retrieval metrics (recall@k, MRR, MAP, nDCG, …) for
assessing LLM and pipeline outputs. Each metric is a plain function you call directly.
Install with `pip install "fireflyframework-agentic[evaluation]"`.
See [docs/evaluation.md](docs/evaluation.md) for the full guide.

> **Optional developer tooling.** `fireflyframework_agentic.experiments` (A/B
> experiments) and `fireflyframework_agentic.lab` (offline evaluation /
> benchmarking) are leaf modules — nothing in the core imports them and they add
Expand Down Expand Up @@ -751,6 +757,7 @@ Detailed guides for each module:
- [Secure Script Execution](docs/execution.md) — Deny-by-default Monty sandbox, static safety pre-screen, `SecureScriptRunner`, Firefly Code Mode
- [Experiments](docs/experiments.md) — A/B testing, variant comparison
- [Lab](docs/lab.md) — Benchmarks, datasets, evaluators
- [Evaluation](docs/evaluation.md) — LLM-as-judge metrics, RAGAS, retrieval metrics
- Studio — moved to [fireflyframework-agentic-studio](https://github.com/fireflyframework/fireflyframework-agentic-studio)
---

Expand Down
329 changes: 329 additions & 0 deletions docs/evaluation.md

Large diffs are not rendered by default.

119 changes: 119 additions & 0 deletions examples/llm_eval_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Copyright 2026 Firefly Software Foundation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""LLM-as-judge evaluation example.

Score a set of Q&A pairs using the evaluation metrics:
- contains_answer — does the answer contain the correct information?
- addresses_question — does the answer directly address what was asked?

Usage::

python examples/llm_eval_example.py --model anthropic:claude-haiku-4-5

# Or score from a JSONL file instead of the built-in sample data:
python examples/llm_eval_example.py \\
--model anthropic:claude-haiku-4-5 \\
--items-file items.jsonl

Items JSONL format — one JSON object per line::

{"question": "...", "answer": "...", "reference": "..."}
"""

from __future__ import annotations

import argparse
import asyncio
import json
from pathlib import Path

from fireflyframework_agentic.evaluation import (
EvalContext,
JudgeClient,
addresses_question,
contains_answer,
)

# Sample data used when no --items-file is provided.
_SAMPLE_ITEMS = [
{
"question": "What is the boiling point of water at sea level?",
"reference": "Water boils at 100 °C at standard atmospheric pressure.",
"answer": "Water boils at 100 degrees Celsius at sea level.",
},
{
"question": "Who wrote Romeo and Juliet?",
"reference": "Romeo and Juliet was written by William Shakespeare around 1594–1596.",
"answer": "It was written by Shakespeare.",
},
{
"question": "What is the capital of France?",
"reference": "The capital of France is Paris.",
"answer": "The weather in France is generally mild.",
},
]


async def score_items(items: list[dict], ctx: EvalContext) -> list[dict]:
tasks = [(contains_answer(item, ctx), addresses_question(item, ctx)) for item in items]
pairs = await asyncio.gather(*[asyncio.gather(ca, aq) for ca, aq in tasks])
return [
{
"question": item["question"],
"contains_answer": ca["score"] if ca else None,
"addresses_question": aq["score"] if aq else None,
}
for item, (ca, aq) in zip(items, pairs, strict=True)
]


async def main(args: argparse.Namespace) -> None:
if args.items_file:
lines = Path(args.items_file).read_text(encoding="utf-8").strip().splitlines()
items = [json.loads(line) for line in lines if line.strip()]
else:
items = _SAMPLE_ITEMS

ctx = EvalContext(client=JudgeClient(args.model))
results = await score_items(items, ctx)

print(f"\n{'Question':<45} {'contains':>8} {'addresses':>9}")
print("-" * 63)
for r in results:
q = r["question"][:43] + ".." if len(r["question"]) > 45 else r["question"]
ca = f"{r['contains_answer']:.2f}" if r["contains_answer"] is not None else " n/a"
aq = f"{r['addresses_question']:.2f}" if r["addresses_question"] is not None else " n/a"
print(f"{q:<45} {ca:>8} {aq:>9}")

scored = [r for r in results if r["contains_answer"] is not None]
if scored:
avg_ca = sum(r["contains_answer"] for r in scored) / len(scored)
avg_aq = sum(r["addresses_question"] for r in scored) / len(scored)
print("-" * 63)
print(f"{'Average':<45} {avg_ca:>8.2f} {avg_aq:>9.2f}")
print(f"\n{len(items)} items scored.")


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Score Q&A pairs with LLM-as-judge metrics.")
parser.add_argument(
"--model",
default="anthropic:claude-haiku-4-5",
help="Judge model spec (provider:model).",
)
parser.add_argument(
"--items-file", default=None, help="Optional JSONL file of {question, answer, reference} items."
)
asyncio.run(main(parser.parse_args()))
91 changes: 91 additions & 0 deletions fireflyframework_agentic/evaluation/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
"""Evaluation metrics for LLM and pipeline outputs.

LLM-as-judge metrics (``judge``), the spec-driven embedder factory (``embedder``),
and deterministic retrieval metrics (``retrieval_metrics``).
"""

from fireflyframework_agentic.evaluation.embedder import build_embedder
from fireflyframework_agentic.evaluation.judge import (
BASIC_METRICS,
PROCESS_MINING_METRICS,
AdvisoryReport,
EvalContext,
JudgeClient,
Metric,
actionability,
addresses_question,
answer_correctness,
answer_relevancy,
citation_relevance,
comparative_vs_champion,
contains_answer,
context_precision,
context_recall,
contradiction,
excerpt_fill_rate,
fabricated_entity,
faithfulness,
nc_semantic_precision,
numeric_temporal_fidelity,
open_gap,
parse_model,
ragas_faithfulness,
run_judge,
same_provider,
semantic_recovery,
severity_calibration,
source_coverage,
surface_deduplication,
)
from fireflyframework_agentic.evaluation.retrieval_metrics import (
citation_precision,
hit_at_k,
map_score,
mrr,
ndcg,
no_answer_rate,
precision_at_k,
recall_at_k,
)

__all__ = [
"BASIC_METRICS",
"PROCESS_MINING_METRICS",
"AdvisoryReport",
"EvalContext",
"JudgeClient",
"Metric",
"actionability",
"addresses_question",
"answer_correctness",
"answer_relevancy",
"build_embedder",
"citation_precision",
"citation_relevance",
"comparative_vs_champion",
"contains_answer",
"context_precision",
"context_recall",
"contradiction",
"excerpt_fill_rate",
"fabricated_entity",
"faithfulness",
"hit_at_k",
"map_score",
"mrr",
"nc_semantic_precision",
"ndcg",
"no_answer_rate",
"numeric_temporal_fidelity",
"open_gap",
"parse_model",
"precision_at_k",
"ragas_faithfulness",
"recall_at_k",
"run_judge",
"same_provider",
"semantic_recovery",
"severity_calibration",
"source_coverage",
"surface_deduplication",
]
82 changes: 82 additions & 0 deletions fireflyframework_agentic/evaluation/embedder.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Copyright 2026 Firefly Software Foundation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Resolve a ``<provider>:<model>`` spec to a framework embedder.

Mirrors flycanon's ``embedding_service._build_embedder``: one branch per
provider shipped by ``fireflyframework_agentic.embeddings``. Per-provider
imports are deferred so a spec that never touches a given provider doesn't
require its SDK to be installed.
"""

from __future__ import annotations

import os

from fireflyframework_agentic.embeddings.base import BaseEmbedder


def build_embedder(spec: str, *, dimensions: int | None = None, batch_size: int = 64) -> BaseEmbedder:
"""Build a framework embedder from a ``"<provider>:<model>"`` spec.

Supported providers: openai, azure, cohere, google, mistral, voyage,
bedrock, ollama. Raises ``ValueError`` on a malformed spec or unknown
provider.
"""
if ":" not in spec:
raise ValueError(f"embedder spec must be '<provider>:<model>' (got {spec!r})")
provider, _, model = spec.partition(":")
p = provider.strip().lower()
if p == "openai":
from fireflyframework_agentic.embeddings.providers.openai import OpenAIEmbedder # noqa: PLC0415

return OpenAIEmbedder(model=model, dimensions=dimensions, batch_size=batch_size)
if p in ("azure", "azure-openai"):
from fireflyframework_agentic.embeddings.providers.azure import AzureEmbedder # noqa: PLC0415

return AzureEmbedder(
model=model,
dimensions=dimensions,
batch_size=batch_size,
azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT", ""),
)
if p == "cohere":
from fireflyframework_agentic.embeddings.providers.cohere import CohereEmbedder # noqa: PLC0415

return CohereEmbedder(model=model, dimensions=dimensions, batch_size=batch_size)
if p in ("google", "gemini"):
from fireflyframework_agentic.embeddings.providers.google import GoogleEmbedder # noqa: PLC0415

return GoogleEmbedder(model=model, dimensions=dimensions, batch_size=batch_size)
if p == "mistral":
from fireflyframework_agentic.embeddings.providers.mistral import MistralEmbedder # noqa: PLC0415

return MistralEmbedder(model=model, dimensions=dimensions, batch_size=batch_size)
if p == "voyage":
from fireflyframework_agentic.embeddings.providers.voyage import VoyageEmbedder # noqa: PLC0415

return VoyageEmbedder(model=model, dimensions=dimensions, batch_size=batch_size)
if p == "bedrock":
from fireflyframework_agentic.embeddings.providers.bedrock import BedrockEmbedder # noqa: PLC0415

return BedrockEmbedder(model=model, dimensions=dimensions, batch_size=batch_size)
if p == "ollama":
from fireflyframework_agentic.embeddings.providers.ollama import OllamaEmbedder # noqa: PLC0415

base_url = os.environ.get("OLLAMA_HOST", "http://localhost:11434")
return OllamaEmbedder(model=model, dimensions=dimensions, base_url=base_url, batch_size=batch_size)
raise ValueError(
f"unknown embedding provider {provider!r}; supported: "
"openai, azure, cohere, google, mistral, voyage, bedrock, ollama"
)
Loading
Loading