Skip to content

Add verified evaluation harness for /query/verified#1

Open
IgnazioDS wants to merge 1 commit intomainfrom
codex/add-evaluation-harness-for-/query/verified
Open

Add verified evaluation harness for /query/verified#1
IgnazioDS wants to merge 1 commit intomainfrom
codex/add-evaluation-harness-for-/query/verified

Conversation

@IgnazioDS
Copy link
Copy Markdown
Owner

Motivation

  • Provide an automated evaluation harness for the verification endpoint /query/verified to validate claim-level groundedness and evidence integrity.
  • Produce deterministic verification checks when AI_PROVIDER=fake so eval results are stable for CI and local smoke tests.
  • Emit both JSON and Markdown reports for auditability and debugging under scripts/eval/out/.
  • Keep the existing /query eval behavior unchanged and reuse the existing debug/source helpers.

Description

  • Added a new dataset scripts/eval/golden_verified.json seeded with answerable and insufficient-evidence cases for verification testing.
  • Implemented the runner scripts/eval/run_eval_verified.py which posts to /query/verified, validates answer/citation expectations, claim shapes, verdicts, score bounds, evidence integrity, and computes verification metrics.
  • Wrote outputs to scripts/eval/out/eval_verified_results.json and scripts/eval/out/eval_verified_report.md, and updated the Makefile with the eval-verified target and README.md docs to include make eval-verified.
  • All checks are deterministic under AI_PROVIDER=fake, the runner exits non-zero if any case fails, and existing scripts/eval/run_eval.py runtime logic was not changed.

Testing

  • No automated tests were executed as part of this change.
  • The change is additive (new files and a new Make target) and does not modify existing eval runtime behavior or existing unit tests.
  • Recommend running the stack and harness with AI_PROVIDER=fake DEBUG=true docker compose up --build followed by make eval and make eval-verified to validate behavior locally.
  • Created outputs are written to scripts/eval/out/ and the runner is designed to fail the process with non-zero exit if any case fails.

Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant