Add verified evaluation harness for /query/verified by IgnazioDS · Pull Request #1 · IgnazioDS/Long-Form-Content-Intelligence-Engine

IgnazioDS · 2025-12-28T09:27:54Z

Provide an automated evaluation harness for the verification endpoint /query/verified to validate claim-level groundedness and evidence integrity.
Produce deterministic verification checks when AI_PROVIDER=fake so eval results are stable for CI and local smoke tests.
Emit both JSON and Markdown reports for auditability and debugging under scripts/eval/out/.
Keep the existing /query eval behavior unchanged and reuse the existing debug/source helpers.

Added a new dataset scripts/eval/golden_verified.json seeded with answerable and insufficient-evidence cases for verification testing.
Implemented the runner scripts/eval/run_eval_verified.py which posts to /query/verified, validates answer/citation expectations, claim shapes, verdicts, score bounds, evidence integrity, and computes verification metrics.
Wrote outputs to scripts/eval/out/eval_verified_results.json and scripts/eval/out/eval_verified_report.md, and updated the Makefile with the eval-verified target and README.md docs to include make eval-verified.
All checks are deterministic under AI_PROVIDER=fake, the runner exits non-zero if any case fails, and existing scripts/eval/run_eval.py runtime logic was not changed.

No automated tests were executed as part of this change.
The change is additive (new files and a new Make target) and does not modify existing eval runtime behavior or existing unit tests.
Recommend running the stack and harness with AI_PROVIDER=fake DEBUG=true docker compose up --build followed by make eval and make eval-verified to validate behavior locally.
Created outputs are written to scripts/eval/out/ and the runner is designed to fail the process with non-zero exit if any case fails.

Add verified eval harness

14d6531

IgnazioDS added the codex label Dec 28, 2025 — with ChatGPT Codex Connector

Provide feedback