improvement/week2: extraction context, speaker attribution, prompt optimization, v2 metrics#69
Merged
Merged
Conversation
…timization - Add rolling 10-message window and per-session conversation summary (every 15 turns) to extraction pipeline; passes context to ExtractionPipeline - Add current_speaker parameter through manager → extraction → prompt; fixes 29/48 speaker misattribution errors at extraction time - Update extraction.txt: use speaker name directly in facts, not 'User' - Add content full-text index to Qdrant collection (_ensure_collection) - Add text_search() to QdrantMemoryStore; add reciprocal_rank_fusion() helper to eval (built but not used — BM25 hurt on dev set, reverted) - Fix AsyncQdrantClient timeout (5s → 60s); add ResponseHandlingException to _RETRYABLE_ERRORS - Rewrite ANSWER_SYSTEM_PROMPT: remove 5-6 word cap, add ALL-items rule, tighten I-dont-know threshold - Rewrite OPEN_DOMAIN_SYSTEM_PROMPT: replace rambling instruction with direct/concise answer guidance - Add max_tokens: generation 200/350, rephrase 50, judge 50 - Remove AzureService from eval; unify on azure_client direct calls - Add eval/failure_analysis.py for failure mode breakdown Results (n=1540): 56.3% → 66.6% overall (+10.3 pp) Temporal: 64.2% → 75.1% | Multi-hop: 40.6% → 51.0% Held-out (n=718): 55.0% → 65.2% (+10.2 pp)
- Add 'openai' to llm_provider Literal; add openai_api_key, openai_model settings - Add OpenAIService (AsyncOpenAI) mirroring AzureService interface - Wire OpenAIService into API dependencies factory - eval_qa_accuracy.py: auto-detect provider from env vars (OPENAI_API_KEY takes precedence over AZURE_OPENAI_* credentials) - ingest_locomo_production.py: same conditional provider selection - Update .env.development.example with both Azure and OpenAI options Judges can now run with just OPENAI_API_KEY=sk-... without Azure setup
- eval/figures/plot_category_comparison.py: per-category bar chart - eval/figures/plot_ablation_progression.py: ablation study bars - eval/figures/plot_per_conv_heatmap.py: per-conversation delta heatmap (reads prompt_fix_full_v1.json + rag_baseline_all10.json for v2 deltas) - eval/figures/plot_efficiency_frontier.py: accuracy vs LLM calls scatter - figures/*.pdf: rendered outputs for paper Figures 2-5 Reproduce with: ~/eval-venv/bin/python eval/figures/plot_*.py
Precision re-measurement (v2): - Stratified sample of 100 facts (10/conv, 5/speaker) hand-labeled - Overall precision: 83% (Wilson CI [74.5, 89.1]) vs 52% in v1 - Speaker attribution accuracy: 97% (Wilson CI [91.5, 99.0]) - Misattribution rate dropped from 60% to 15% of errors - Labels: eval/v2_precision_sample_labeled.csv Two-pass firing rate (v2): - Added two_pass_fired tracking to eval_qa_accuracy.py - Dev set (n=233): fires on 1.7% of non-OD queries - Average LLM calls/query: 1.02 (vs claimed 1.9 from v1) - Generation prompt fix eliminated nearly all hedging responses Paper updates required (not in this commit — LaTeX only): - §2.2: replace 52% precision with v2 measurement (83%) - §2.6: update firing rate (45.4% → 1.7%) and avg calls (1.9 → 1.02) - Abstract/Intro/Table 2/Conclusion: update '1.9' → '1.02' everywhere Efficiency frontier figure: - Updated Eidetic Memory v2 x-position: 1.9 → 1.02 - Fixed label overlap near x=1.0 cluster - Regenerated figures/efficiency_frontier.pdf Bootstrap significance (v2 vs RAG): - p < 0.0001 confirmed (B=10000, observed diff = 22.37 pp) - Claim in §3 Setup remains valid
…lt.md Remove erroneous code-fence wrappers from ARCHITECTURE.md and README.md. Update result.md with v2 metrics, per-component accuracy, and incremental contribution breakdown (extraction context + generation prompt optimization).
Remove unused sys import, fix bare f-strings, rename unused loop vars to _reason, add strict=False to zip calls, rewrite dict() as literals, and fix import ordering in figure scripts.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Current Speakerheader — reduces speaker misattribution from 60% → 15% of errors (attribution accuracy 97.0%)Test plan
make check(lint + 152 tests) on target machine.env.developmenthas correct provider keys before running evaleval/results/prompt_fix_full_v1.json