Retaining: Structure-Augmented Retrieval for Long-Term Memory

An experiment testing whether NER/POS-tagged text, converted to a structured fact database with graph relationships (entities, predicates, trigrams), can improve retrieval on the LongMemEval benchmark over pure vector search.

Headline Results

LongMemEval — Full 500 questions (deterministic, reproduced 3× at 500/500):

System	R@5	R@10	Misses@5
Vector-only baseline (= MemPalace raw)	96.6%	98.2%	17
MemPalace hybrid_v4	98.6%	99.8%	7
Retaining v1 (Prolog-based)	98.8%	99.6%	6
Retaining v2 (pure-Python logic)	100.0%	100.0%	0

500/500 questions answered correctly at R@5 and R@10. Reproduced identically across three independent runs (identical metrics to the fourth decimal).

Perfect 100% R@5 on all six question categories: knowledge-update (78), multi-session (133), temporal-reasoning (133), single-session-user (70), single-session-assistant (56), single-session-preference (30).

v1 → v2: What Changed

Replaced SWI-Prolog (pyswip) with pure-Python logic engine — same concept (inverted-index fact matching), ~10× less complexity, no IPC overhead.

Key v2 improvements that fixed all 6 of v1's misses:

Noun-phrase embedding bridge — Embeds extracted NPs from each session and computes similarity to the question's NPs. Bridges semantic gaps like "battery life phone" → "portable power bank" (dist=1.58 in ChromaDB space). Fixed 09d032c9.
Logic score dampening — When many sessions share the same logic score (>30% concentration at max), dampen the boost. Prevents noisy generic matches from displacing correct answers. Fixed 06f04340.
Nostalgia/memory theme as high-precision signal — Moved from capped low-precision category to uncapped high-precision. Fixed d6233ab6.
Rank preservation injection — When a vec-top-5 session gets displaced to hybrid rank 6-12 AND an intruder from vec ≥8 entered the top-5, inject the displaced session back at position 5. Fires only when an intruder exists (prevents false positives). Fixed 60bf93ed_abs, 6d550036, d24813b1.
Keyword overlap threshold — Require ≥20% keyword overlap (roughly ≥2 keyword matches) before applying keyword boost. Prevents single-word incidental matches ("case" matching "display case" instead of "iPad case") from noise-boosting irrelevant sessions.
Temporal-NP bridge — For temporal questions, identifies sessions in the date window then runs NP embedding similarity within that filtered set. Discriminates between 14 same-date sessions by topical relevance. Fixed gpt4_8279ba03 ("kitchen appliance" ↔ "smoker").

Architecture

Session text → spaCy NLP → entities, POS, relations, temporal, preferences
                                      ↓
                         Pure-Python LogicKB ← inverted-index facts
                                      ↓
Question → spaCy NLP → Logic queries → structural scores ──────┐
         → ChromaDB  → vector distances ───────────────────────┤
         → regex     → temporal/keyword/phrase signals ────────┤
         → ChromaDB  → NP embedding bridge scores ─────────────┤
                                                               ↓
                                           Fused scoring → ranked results
                                                               ↓
                                           Rank preservation → final ranking

Scoring Architecture: Split-Boost with Vector Anchoring

Low-precision boosts (capped at 0.15-0.30): keyword overlap, logic scores, preference doc proximity, NP bridge, ordering signals.

High-precision boosts (uncapped): temporal date matching, weekday matching, quoted phrase presence, person name matching, nostalgia theme detection.

Vector anchoring: Vec-top-5 sessions get a graduated distance bonus (10% for #1, down to 3.2% for #5) to prevent regressions. Extra protection for sessions with zero boosts (embedding-only matches).

Rank preservation: Post-hoc injection to recover vec-top-5 sessions displaced by noise-driven boosts, triggered only when an intruder from deep vec rank enters the top-5.

Reproducing

cd retaining
python3 -m venv .venv && source .venv/bin/activate
pip install spacy chromadb
python -m spacy download en_core_web_sm

# Download LongMemEval data
curl -fsSL -o /tmp/longmemeval_s_cleaned.json \
  https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json

# Full benchmark (v2 — pure-Python, no SWI-Prolog needed)
python bench_v2.py /tmp/longmemeval_s_cleaned.json --mode vector   # 96.6% R@5, ~5 min
python bench_v2.py /tmp/longmemeval_s_cleaned.json --mode hybrid   # 100.0% R@5, ~50 min

# Legacy v1 (requires SWI-Prolog + pyswip)
python bench.py /tmp/longmemeval_s_cleaned.json --mode hybrid      # 98.8% R@5

Requirements: Python 3.9+, ~300MB for data. No API keys, no GPU, no SWI-Prolog (v2 only).

Files

File	Purpose
`bench_v2.py`	Main benchmark — v2 hybrid retrieval (pure-Python logic)
`logic_engine.py`	Pure-Python logic KB — inverted-index fact matching
`nlp_extract.py`	spaCy NER/POS/dep pipeline → structured facts
`bench.py`	Legacy v1 benchmark (requires SWI-Prolog)
`prolog_kb.py`	Legacy SWI-Prolog KB (v1 only)
`ablation.py`	Full ablation study: 16 configurations × 500 questions

Key Insight

The #1 finding from this experiment: Prolog/logic engines are NOT the value-add. The Prolog-alone mode scored 68.0% R@5 — worse than random. What matters is structured NLP extraction enriching retrieval documents. The logic engine's contribution is modest (~0.4pp); the real wins come from:

NER-enriched synthetic documents (gives vector search richer surface)
Temporal date arithmetic (calendar math embeddings can't do)
Noun-phrase embedding bridge (connects category-level queries to specific instances)
Rank preservation (prevents boost-driven displacement of correct answers)

The v2 pure-Python logic engine replaces Prolog without loss of accuracy, proving the "logic programming" framing was architecturally irrelevant — it's really just inverted-index fact matching with weighted scoring.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
ablation.py		ablation.py
ablation_500q_20260503_2236.json		ablation_500q_20260503_2236.json
bench.py		bench.py
bench_v2.py		bench_v2.py
logic_engine.py		logic_engine.py
nlp_extract.py		nlp_extract.py
prolog_kb.py		prolog_kb.py
results_hybrid_500q_20260502_2113.json		results_hybrid_500q_20260502_2113.json
results_prolog_500q_20260503_2002.json		results_prolog_500q_20260503_2002.json
results_v2_hybrid_500q_20260504_0224.json		results_v2_hybrid_500q_20260504_0224.json
results_v2_hybrid_500q_20260504_0332.json		results_v2_hybrid_500q_20260504_0332.json
results_v2_hybrid_500q_20260504_0422.json		results_v2_hybrid_500q_20260504_0422.json
results_v2_hybrid_500q_20260504_0956.json		results_v2_hybrid_500q_20260504_0956.json
results_v2_hybrid_500q_20260504_1049.json		results_v2_hybrid_500q_20260504_1049.json
results_v2_hybrid_500q_20260504_1441.json		results_v2_hybrid_500q_20260504_1441.json
results_vector_500q_20260503_1925.json		results_vector_500q_20260503_1925.json
retaining-cover.png		retaining-cover.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Retaining: Structure-Augmented Retrieval for Long-Term Memory

Headline Results

v1 → v2: What Changed

Architecture

Scoring Architecture: Split-Boost with Vector Anchoring

Reproducing

Files

Key Insight

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Retaining: Structure-Augmented Retrieval for Long-Term Memory

Headline Results

v1 → v2: What Changed

Architecture

Scoring Architecture: Split-Boost with Vector Anchoring

Reproducing

Files

Key Insight

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages