Skip to content

improvement/week2: extraction context, speaker attribution, prompt optimization, v2 metrics#69

Merged
CodeNinjaSarthak merged 6 commits into
mainfrom
improvement/week2
Jun 4, 2026
Merged

improvement/week2: extraction context, speaker attribution, prompt optimization, v2 metrics#69
CodeNinjaSarthak merged 6 commits into
mainfrom
improvement/week2

Conversation

@CodeNinjaSarthak

Copy link
Copy Markdown
Owner

Summary

  • Extraction context pipeline: rolling conversation summary (every 15 turns), recent-message window (last 10), and per-speaker Current Speaker header — reduces speaker misattribution from 60% → 15% of errors (attribution accuracy 97.0%)
  • OpenAI provider: added alongside Azure OpenAI, conditionally enabled via credentials
  • Generation prompt optimization: removed output-length cap, tightened "I don't know" threshold, added explicit token limits — dropped two-pass firing rate from 45.4% → 1.7%, average LLM calls per query 1.02
  • v2 benchmark results: 66.6% overall QA accuracy on LoCoMo (n=1540), +10.3 pp over v1; temporal 75.1%
  • Figures: added generation scripts and rendered PDFs for paper
  • Docs: fixed markdown code-fence wrappers in ARCHITECTURE.md and README.md; updated result.md with v2 metrics and incremental contribution breakdown

Test plan

  • Run make check (lint + 152 tests) on target machine
  • Verify .env.development has correct provider keys before running eval
  • Spot-check ARCHITECTURE.md and README.md render correctly on GitHub
  • Confirm result.md accuracy numbers match eval/results/prompt_fix_full_v1.json

…timization

- Add rolling 10-message window and per-session conversation summary (every
  15 turns) to extraction pipeline; passes context to ExtractionPipeline
- Add current_speaker parameter through manager → extraction → prompt;
  fixes 29/48 speaker misattribution errors at extraction time
- Update extraction.txt: use speaker name directly in facts, not 'User'
- Add content full-text index to Qdrant collection (_ensure_collection)
- Add text_search() to QdrantMemoryStore; add reciprocal_rank_fusion()
  helper to eval (built but not used — BM25 hurt on dev set, reverted)
- Fix AsyncQdrantClient timeout (5s → 60s); add ResponseHandlingException
  to _RETRYABLE_ERRORS
- Rewrite ANSWER_SYSTEM_PROMPT: remove 5-6 word cap, add ALL-items rule,
  tighten I-dont-know threshold
- Rewrite OPEN_DOMAIN_SYSTEM_PROMPT: replace rambling instruction with
  direct/concise answer guidance
- Add max_tokens: generation 200/350, rephrase 50, judge 50
- Remove AzureService from eval; unify on azure_client direct calls
- Add eval/failure_analysis.py for failure mode breakdown

Results (n=1540): 56.3% → 66.6% overall (+10.3 pp)
Temporal: 64.2% → 75.1% | Multi-hop: 40.6% → 51.0%
Held-out (n=718): 55.0% → 65.2% (+10.2 pp)
- Add 'openai' to llm_provider Literal; add openai_api_key, openai_model settings
- Add OpenAIService (AsyncOpenAI) mirroring AzureService interface
- Wire OpenAIService into API dependencies factory
- eval_qa_accuracy.py: auto-detect provider from env vars (OPENAI_API_KEY
  takes precedence over AZURE_OPENAI_* credentials)
- ingest_locomo_production.py: same conditional provider selection
- Update .env.development.example with both Azure and OpenAI options

Judges can now run with just OPENAI_API_KEY=sk-... without Azure setup
- eval/figures/plot_category_comparison.py: per-category bar chart
- eval/figures/plot_ablation_progression.py: ablation study bars
- eval/figures/plot_per_conv_heatmap.py: per-conversation delta heatmap
  (reads prompt_fix_full_v1.json + rag_baseline_all10.json for v2 deltas)
- eval/figures/plot_efficiency_frontier.py: accuracy vs LLM calls scatter
- figures/*.pdf: rendered outputs for paper Figures 2-5

Reproduce with: ~/eval-venv/bin/python eval/figures/plot_*.py
Precision re-measurement (v2):
- Stratified sample of 100 facts (10/conv, 5/speaker) hand-labeled
- Overall precision: 83% (Wilson CI [74.5, 89.1]) vs 52% in v1
- Speaker attribution accuracy: 97% (Wilson CI [91.5, 99.0])
- Misattribution rate dropped from 60% to 15% of errors
- Labels: eval/v2_precision_sample_labeled.csv

Two-pass firing rate (v2):
- Added two_pass_fired tracking to eval_qa_accuracy.py
- Dev set (n=233): fires on 1.7% of non-OD queries
- Average LLM calls/query: 1.02 (vs claimed 1.9 from v1)
- Generation prompt fix eliminated nearly all hedging responses

Paper updates required (not in this commit — LaTeX only):
- §2.2: replace 52% precision with v2 measurement (83%)
- §2.6: update firing rate (45.4% → 1.7%) and avg calls (1.9 → 1.02)
- Abstract/Intro/Table 2/Conclusion: update '1.9' → '1.02' everywhere

Efficiency frontier figure:
- Updated Eidetic Memory v2 x-position: 1.9 → 1.02
- Fixed label overlap near x=1.0 cluster
- Regenerated figures/efficiency_frontier.pdf

Bootstrap significance (v2 vs RAG):
- p < 0.0001 confirmed (B=10000, observed diff = 22.37 pp)
- Claim in §3 Setup remains valid
…lt.md

Remove erroneous code-fence wrappers from ARCHITECTURE.md and README.md.
Update result.md with v2 metrics, per-component accuracy, and incremental
contribution breakdown (extraction context + generation prompt optimization).
Remove unused sys import, fix bare f-strings, rename unused loop vars
to _reason, add strict=False to zip calls, rewrite dict() as literals,
and fix import ordering in figure scripts.
@CodeNinjaSarthak CodeNinjaSarthak merged commit c8f058a into main Jun 4, 2026
2 checks passed
@CodeNinjaSarthak CodeNinjaSarthak deleted the improvement/week2 branch June 4, 2026 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant