fix(entity-resolver): stamp cooccurrences with event_date, not now()#1247
fix(entity-resolver): stamp cooccurrences with event_date, not now()#1247aliu-ronin wants to merge 1 commit intovectorize-io:mainfrom
Conversation
`entity_cooccurrences.last_cooccurred` was always set to `datetime.now(UTC)` at flush time. For real-time retains that's fine — event time ≈ ingest time — but any corpus **backfilled in a single session** (for example, migrating from another memory system) collapses every co-occurrence onto the import moment. The dashboard's entity graph recency heat then shows a one-or-two-day range regardless of how far the underlying knowledge actually spans, and downstream consumers of the column lose the timeline dimension entirely. The tuples flowing into `_link_units_to_entities_batch_impl` already carried the per-unit `fact_date` alongside `(unit_id, entity_id)` — it was just being discarded at the call site (`_fact_date` underscore). This change wires the event date through: - `_CooccurrencePair` grows an `event_date` field. - `link_units_to_entities_batch` accepts both the legacy `(unit_id, entity_id)` tuples and the new `(unit_id, entity_id, event_date)` form, so external callers aren't forced to migrate in lockstep. - `_link_units_to_entities_batch_impl` builds a per-unit event-date map and attaches the unit's date to every co-occurrence pair emitted from that unit. - `flush_pending_stats` aggregates per-pair event dates and INSERTs the observed maximum, falling back to `now()` only when no event date was carried (preserves the pre-fix semantics for real-time retains). - Both in-repo callers (`retain/orchestrator.py` and `retain/link_utils.py`) pass the `fact_date` they were already holding. A new Alembic migration repairs historical rows by recomputing `last_cooccurred` from `MAX(COALESCE(mentioned_at, occurred_start, created_at))` over `unit_entities × memory_units`, so operators don't have to run a manual backfill to see the fix in their dashboards. Regression coverage added in `test_entity_resolver.py` asserts a historical `event_date` survives the link → flush round-trip.
End-to-end verification on a running instanceDeployed the writer patch to a live Hindsight container (v1.x image, same codebase locations as this PR targets), then retained a test memory with a historical POST /v1/default/banks/main/memories
{
"items": [{
"content": "In June 2024, Fitzgerald Blackwood met Gwendolyn Stormrider at the Reykjavik summit to discuss the Chronos Protocol.",
"timestamp": "2024-06-15T12:00:00Z"
}]
}Result in the DB:
All 4 fresh entity pairs — |
nicoloboschi
left a comment
There was a problem hiding this comment.
Good fix — well-motivated, well-documented PR. The core change is sound and the risk is low since last_cooccurred is only consumed by the entity graph visualization endpoint (tiebreaker in ORDER BY + UI recency heat), not by the recall/entity-resolution path at all.
A few items below, one blocker on the migration chain.
|
|
||
| revision: str = "b5d4e3f2a1c9" | ||
| down_revision: str | Sequence[str] | None = "z1u2v3w4x5y6" | ||
| branch_labels: str | Sequence[str] | None = None |
There was a problem hiding this comment.
Blocker: down_revision creates a divergent branch.
z1u2v3w4x5y6 already has two children (a2v3w4x5y6z7 and aa2b3c4d5e6f). Adding this as a third child will break the migration graph — Alembic will refuse to run with multiple unmerged heads.
The current chain head on main is i4j5k6l7m8n9. This migration needs to be re-parented to descend from that (or whatever the head is at merge time).
| GROUP BY 1, 2 | ||
| ) sub | ||
| WHERE ec.entity_id_1 = sub.e1 AND ec.entity_id_2 = sub.e2 | ||
| """ |
There was a problem hiding this comment.
This self-join (ue1 × ue2 on the same unit_id) is O(n²) per unit in the number of entities. For banks with many entities per unit this could be slow. Worth adding a comment warning operators about expected runtime on large banks, or consider batching by bank_id with a WHERE clause so the UPDATE doesn't lock the whole table at once.
Not blocking since it's a one-time migration, but something to be aware of.
| # callers surface `None`, which must not clobber a real datetime from another | ||
| # caller for the same unit. | ||
| _SENTINEL_MISSING: typing.Any = object() | ||
|
|
There was a problem hiding this comment.
Nit: typing.Any defeats static analysis — any comparison with the sentinel will type-check as valid even if surrounding code regresses. Consider _SENTINEL_MISSING: typing.Final = object() instead, which preserves the intent and gives the type checker something to work with.
Also, import typing was added solely for this — the existing codebase uses from typing import .... A from typing import Final would be more consistent.
| latest = c.event_date | ||
| coo_agg[pair] = (prev_count + 1, latest) | ||
|
|
||
| now = datetime.now(UTC) |
There was a problem hiding this comment.
Nit: the "keep latest non-None event_date" logic appears both here (folding coo_agg in flush_pending_stats) and in _link_units_to_entities_batch_impl (building cooccurrence_pairs). Both do the same if event_date is not None and (prev is None or event_date > prev) pattern. A small helper like _later_date(a, b) -> datetime | None would reduce the surface area for the two copies diverging.
| if entity_id_1 == entity_id_2: | ||
| continue | ||
| # Ensure consistent ordering (entity_id_1 < entity_id_2) | ||
| # Canonical ordering (entity_id_1 < entity_id_2) matches the |
There was a problem hiding this comment.
The first two branches are identical:
if prev is _SENTINEL_MISSING:
cooccurrence_pairs[key] = event_date
elif prev is None:
cooccurrence_pairs[key] = event_date # sameCould simplify to:
if prev is _SENTINEL_MISSING or prev is None:
cooccurrence_pairs[key] = event_date
elif event_date is not None and event_date > prev:
cooccurrence_pairs[key] = event_dateMinor — the current form is arguably more explicit about intent, so up to you.
| assert row["last_cooccurred"] == historical, ( | ||
| f"expected last_cooccurred == {historical}, got {row['last_cooccurred']}" | ||
| ) | ||
| finally: |
There was a problem hiding this comment.
The finally block cleans up memory_units and entities but doesn't explicitly delete from unit_entities or entity_cooccurrences. If the FK on entity_cooccurrences → entities isn't ON DELETE CASCADE, the DELETE FROM entities will fail and the cooccurrence row leaks. Even if cascades handle it, explicitly cleaning up entity_cooccurrences and unit_entities first (in the right order) would be safer and make the teardown self-documenting.
Summary
entity_cooccurrences.last_cooccurredis always set todatetime.now(UTC)at flush time, regardless of the source memory unit's event date. For real-time retains this is fine (ingest time ≈ event time), but any bank backfilled in a single session — e.g. migrating from another memory system — collapses every co-occurrence onto the import moment. The control plane's entity-graph recency heat then shows a one-or-two-day range regardless of how far the underlying knowledge actually spans.Example: on a bank migrated on 2026-04-24, every edge's
lastCooccurredwas2026-04-23or2026-04-24, even though the source memories'mentioned_at/occurred_startcorrectly reflect the original 2026-03 timeline.Interestingly the tuples flowing into
_link_units_to_entities_batch_implalready carried the per-unitfact_datealongside(unit_id, entity_id)— it was just being discarded at the call site (_fact_dateunderscore). This change wires it through.Changes
Writer path
_CooccurrencePairgrows anevent_date: datetime | Nonefield.link_units_to_entities_batchaccepts both the legacy(unit_id, entity_id)tuples and the new(unit_id, entity_id, event_date)form, so external callers aren't forced to migrate in lockstep._link_units_to_entities_batch_implbuilds a per-unit event-date map and attaches the unit's date to every co-occurrence pair emitted from that unit. A_SENTINEL_MISSINGguard prevents a duplicate legacy-caller row (event_date=None) from clobbering a real datetime from another caller on the same unit.flush_pending_statsaggregates per-pair event dates andINSERTs the observed maximum. It falls back todatetime.now(UTC)only when no event date was carried, so real-time retains preserve their pre-fix semantics exactly.In-repo callers
retain/orchestrator.py:290 — passfact_date(was already in the tuple as_fact_date).retain/link_utils.py:429 — same.Backfill migration
b5d4e3f2a1c9_backfill_entity_cooccurrences_event_time.pyrecomputeslast_cooccurredfromMAX(COALESCE(mentioned_at, occurred_start, created_at))overunit_entities × memory_units, so operators don't have to run a manual SQL backfill to see the fix on their existing data. The downgrade path is a no-op — the pre-fix value wasnow()at the time of write and isn't recoverable, and rolling the code back is enough to revert behavior for subsequent retains.Test plan
test_entity_resolver.py: inserts a memory unit with a historicalmentioned_at, callslink_units_to_entities_batch+flush_pending_stats, assertsentity_cooccurrences.last_cooccurred == historical.python -m py_compilepasses on the three touched engine files, the migration, and the test file.now(), matching pre-fix behavior).Repro (before)
After (with migration applied)
# "2026-03-30T..." / "2026-03-29T..." / ...