improvement/week2: extraction context, speaker attribution, prompt optimization, v2 metrics by CodeNinjaSarthak · Pull Request #69 · CodeNinjaSarthak/eidetic-memory

CodeNinjaSarthak · 2026-06-04T11:34:57Z

Summary

Extraction context pipeline: rolling conversation summary (every 15 turns), recent-message window (last 10), and per-speaker Current Speaker header — reduces speaker misattribution from 60% → 15% of errors (attribution accuracy 97.0%)
OpenAI provider: added alongside Azure OpenAI, conditionally enabled via credentials
Generation prompt optimization: removed output-length cap, tightened "I don't know" threshold, added explicit token limits — dropped two-pass firing rate from 45.4% → 1.7%, average LLM calls per query 1.02
v2 benchmark results: 66.6% overall QA accuracy on LoCoMo (n=1540), +10.3 pp over v1; temporal 75.1%
Figures: added generation scripts and rendered PDFs for paper
Docs: fixed markdown code-fence wrappers in ARCHITECTURE.md and README.md; updated result.md with v2 metrics and incremental contribution breakdown

Test plan

Run make check (lint + 152 tests) on target machine
Verify .env.development has correct provider keys before running eval
Spot-check ARCHITECTURE.md and README.md render correctly on GitHub
Confirm result.md accuracy numbers match eval/results/prompt_fix_full_v1.json

…timization - Add rolling 10-message window and per-session conversation summary (every 15 turns) to extraction pipeline; passes context to ExtractionPipeline - Add current_speaker parameter through manager → extraction → prompt; fixes 29/48 speaker misattribution errors at extraction time - Update extraction.txt: use speaker name directly in facts, not 'User' - Add content full-text index to Qdrant collection (_ensure_collection) - Add text_search() to QdrantMemoryStore; add reciprocal_rank_fusion() helper to eval (built but not used — BM25 hurt on dev set, reverted) - Fix AsyncQdrantClient timeout (5s → 60s); add ResponseHandlingException to _RETRYABLE_ERRORS - Rewrite ANSWER_SYSTEM_PROMPT: remove 5-6 word cap, add ALL-items rule, tighten I-dont-know threshold - Rewrite OPEN_DOMAIN_SYSTEM_PROMPT: replace rambling instruction with direct/concise answer guidance - Add max_tokens: generation 200/350, rephrase 50, judge 50 - Remove AzureService from eval; unify on azure_client direct calls - Add eval/failure_analysis.py for failure mode breakdown Results (n=1540): 56.3% → 66.6% overall (+10.3 pp) Temporal: 64.2% → 75.1% | Multi-hop: 40.6% → 51.0% Held-out (n=718): 55.0% → 65.2% (+10.2 pp)

- Add 'openai' to llm_provider Literal; add openai_api_key, openai_model settings - Add OpenAIService (AsyncOpenAI) mirroring AzureService interface - Wire OpenAIService into API dependencies factory - eval_qa_accuracy.py: auto-detect provider from env vars (OPENAI_API_KEY takes precedence over AZURE_OPENAI_* credentials) - ingest_locomo_production.py: same conditional provider selection - Update .env.development.example with both Azure and OpenAI options Judges can now run with just OPENAI_API_KEY=sk-... without Azure setup

- eval/figures/plot_category_comparison.py: per-category bar chart - eval/figures/plot_ablation_progression.py: ablation study bars - eval/figures/plot_per_conv_heatmap.py: per-conversation delta heatmap (reads prompt_fix_full_v1.json + rag_baseline_all10.json for v2 deltas) - eval/figures/plot_efficiency_frontier.py: accuracy vs LLM calls scatter - figures/*.pdf: rendered outputs for paper Figures 2-5 Reproduce with: ~/eval-venv/bin/python eval/figures/plot_*.py

Precision re-measurement (v2): - Stratified sample of 100 facts (10/conv, 5/speaker) hand-labeled - Overall precision: 83% (Wilson CI [74.5, 89.1]) vs 52% in v1 - Speaker attribution accuracy: 97% (Wilson CI [91.5, 99.0]) - Misattribution rate dropped from 60% to 15% of errors - Labels: eval/v2_precision_sample_labeled.csv Two-pass firing rate (v2): - Added two_pass_fired tracking to eval_qa_accuracy.py - Dev set (n=233): fires on 1.7% of non-OD queries - Average LLM calls/query: 1.02 (vs claimed 1.9 from v1) - Generation prompt fix eliminated nearly all hedging responses Paper updates required (not in this commit — LaTeX only): - §2.2: replace 52% precision with v2 measurement (83%) - §2.6: update firing rate (45.4% → 1.7%) and avg calls (1.9 → 1.02) - Abstract/Intro/Table 2/Conclusion: update '1.9' → '1.02' everywhere Efficiency frontier figure: - Updated Eidetic Memory v2 x-position: 1.9 → 1.02 - Fixed label overlap near x=1.0 cluster - Regenerated figures/efficiency_frontier.pdf Bootstrap significance (v2 vs RAG): - p < 0.0001 confirmed (B=10000, observed diff = 22.37 pp) - Claim in §3 Setup remains valid

…lt.md Remove erroneous code-fence wrappers from ARCHITECTURE.md and README.md. Update result.md with v2 metrics, per-component accuracy, and incremental contribution breakdown (extraction context + generation prompt optimization).

Remove unused sys import, fix bare f-strings, rename unused loop vars to _reason, add strict=False to zip calls, rewrite dict() as literals, and fix import ordering in figure scripts.

CodeNinjaSarthak added 6 commits June 2, 2026 21:29

fix: resolve ruff lint errors in eval scripts to unblock CI

24b9f62

Remove unused sys import, fix bare f-strings, rename unused loop vars to _reason, add strict=False to zip calls, rewrite dict() as literals, and fix import ordering in figure scripts.

CodeNinjaSarthak merged commit c8f058a into main Jun 4, 2026
2 checks passed

CodeNinjaSarthak deleted the improvement/week2 branch June 4, 2026 11:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improvement/week2: extraction context, speaker attribution, prompt optimization, v2 metrics#69

improvement/week2: extraction context, speaker attribution, prompt optimization, v2 metrics#69
CodeNinjaSarthak merged 6 commits into
mainfrom
improvement/week2

CodeNinjaSarthak commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CodeNinjaSarthak commented Jun 4, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant