Skip to content

Add eval analysis scripts and precision sample#65

Merged
CodeNinjaSarthak merged 2 commits into
mainfrom
feature/eval-analysis-scripts
Apr 26, 2026
Merged

Add eval analysis scripts and precision sample#65
CodeNinjaSarthak merged 2 commits into
mainfrom
feature/eval-analysis-scripts

Conversation

@CodeNinjaSarthak

Copy link
Copy Markdown
Owner

Summary

  • Adds the analysis and statistical scripts used to produce the benchmark results in the README, plus the human-annotated precision sample backing the 52% precision claim.

Files

File Purpose
eval/analyze_multihop.py Per-error-type breakdown for multi-hop failures (wrong_speaker, outdated_fact, reasoning_fail, missing_fact)
eval/analyze_sh_drop.py Dev vs. held-out single-hop comparison — isolates selection effect from run-to-run variance
eval/bootstrap_ci.py Bootstrap 95% CI + paired significance testing against RAG baseline
eval/sample_facts_for_precision.py Stratified sampler over user_ids → produces the CSV below
eval/results/precision_sample.csv Human-annotated 100-fact sample; is_relevant column filled during manual review (52% precision)

Test plan

  • python eval/bootstrap_ci.py runs without error against existing results JSONs
  • python eval/analyze_multihop.py runs without error
  • python eval/analyze_sh_drop.py runs without error
  • precision_sample.csv opens cleanly and is_relevant column is present

- analyze_multihop.py: per-error-type breakdown for multi-hop failures
- analyze_sh_drop.py: dev vs held-out single-hop comparison
- bootstrap_ci.py: bootstrap 95% CI + paired significance testing
- sample_facts_for_precision.py: stratified sampling for precision eval
- precision_sample.csv: human-annotated 100-fact precision sample (52%)
- analyze_sh_drop.py: remove unused Counter import, fix bare f-string
- bootstrap_ci.py: rewrite lambda as def (E731), fix two bare f-strings
- sample_facts_for_precision.py: remove unused os import, sort imports (I001)
@CodeNinjaSarthak CodeNinjaSarthak merged commit 1947034 into main Apr 26, 2026
2 checks passed
@CodeNinjaSarthak CodeNinjaSarthak deleted the feature/eval-analysis-scripts branch April 26, 2026 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant