fix(entity-resolver): stamp cooccurrences with event_date, not now() by aliu-ronin · Pull Request #1247 · vectorize-io/hindsight

aliu-ronin · 2026-04-24T08:28:14Z

Summary

entity_cooccurrences.last_cooccurred is always set to datetime.now(UTC) at flush time, regardless of the source memory unit's event date. For real-time retains this is fine (ingest time ≈ event time), but any bank backfilled in a single session — e.g. migrating from another memory system — collapses every co-occurrence onto the import moment. The control plane's entity-graph recency heat then shows a one-or-two-day range regardless of how far the underlying knowledge actually spans.

Example: on a bank migrated on 2026-04-24, every edge's lastCooccurred was 2026-04-23 or 2026-04-24, even though the source memories' mentioned_at / occurred_start correctly reflect the original 2026-03 timeline.

Interestingly the tuples flowing into _link_units_to_entities_batch_impl already carried the per-unit fact_date alongside (unit_id, entity_id) — it was just being discarded at the call site (_fact_date underscore). This change wires it through.

Changes

Writer path

_CooccurrencePair grows an event_date: datetime | None field.
link_units_to_entities_batch accepts both the legacy (unit_id, entity_id) tuples and the new (unit_id, entity_id, event_date) form, so external callers aren't forced to migrate in lockstep.
_link_units_to_entities_batch_impl builds a per-unit event-date map and attaches the unit's date to every co-occurrence pair emitted from that unit. A _SENTINEL_MISSING guard prevents a duplicate legacy-caller row (event_date=None) from clobbering a real datetime from another caller on the same unit.
flush_pending_stats aggregates per-pair event dates and INSERTs the observed maximum. It falls back to datetime.now(UTC) only when no event date was carried, so real-time retains preserve their pre-fix semantics exactly.

In-repo callers

retain/orchestrator.py:290 — pass fact_date (was already in the tuple as _fact_date).
retain/link_utils.py:429 — same.

Backfill migration

b5d4e3f2a1c9_backfill_entity_cooccurrences_event_time.py recomputes last_cooccurred from MAX(COALESCE(mentioned_at, occurred_start, created_at)) over unit_entities × memory_units, so operators don't have to run a manual SQL backfill to see the fix on their existing data. The downgrade path is a no-op — the pre-fix value was now() at the time of write and isn't recoverable, and rolling the code back is enough to revert behavior for subsequent retains.

Test plan

New regression in test_entity_resolver.py: inserts a memory unit with a historical mentioned_at, calls link_units_to_entities_batch + flush_pending_stats, asserts entity_cooccurrences.last_cooccurred == historical.
python -m py_compile passes on the three touched engine files, the migration, and the test file.
Backward compat: passing the old two-tuple shape still works (falls back to now(), matching pre-fix behavior).
Manually verified on a migrated bank: running the migration moves the entity-graph recency heat range from "one-day spike" back to the true multi-week event-time span.

Repro (before)

curl -s http://localhost:8000/v1/default/banks/main/entities/graph?limit=3 \
  | jq '.edges[].data.lastCooccurred'
# "2026-04-24T06:47:08+00:00"
# "2026-04-24T06:45:12+00:00"
# "2026-04-24T06:39:02+00:00"   ← all today, despite 03-xx memories

After (with migration applied)

# "2026-03-30T..." / "2026-03-29T..." / ...

`entity_cooccurrences.last_cooccurred` was always set to `datetime.now(UTC)` at flush time. For real-time retains that's fine — event time ≈ ingest time — but any corpus **backfilled in a single session** (for example, migrating from another memory system) collapses every co-occurrence onto the import moment. The dashboard's entity graph recency heat then shows a one-or-two-day range regardless of how far the underlying knowledge actually spans, and downstream consumers of the column lose the timeline dimension entirely. The tuples flowing into `_link_units_to_entities_batch_impl` already carried the per-unit `fact_date` alongside `(unit_id, entity_id)` — it was just being discarded at the call site (`_fact_date` underscore). This change wires the event date through: - `_CooccurrencePair` grows an `event_date` field. - `link_units_to_entities_batch` accepts both the legacy `(unit_id, entity_id)` tuples and the new `(unit_id, entity_id, event_date)` form, so external callers aren't forced to migrate in lockstep. - `_link_units_to_entities_batch_impl` builds a per-unit event-date map and attaches the unit's date to every co-occurrence pair emitted from that unit. - `flush_pending_stats` aggregates per-pair event dates and INSERTs the observed maximum, falling back to `now()` only when no event date was carried (preserves the pre-fix semantics for real-time retains). - Both in-repo callers (`retain/orchestrator.py` and `retain/link_utils.py`) pass the `fact_date` they were already holding. A new Alembic migration repairs historical rows by recomputing `last_cooccurred` from `MAX(COALESCE(mentioned_at, occurred_start, created_at))` over `unit_entities × memory_units`, so operators don't have to run a manual backfill to see the fix in their dashboards. Regression coverage added in `test_entity_resolver.py` asserts a historical `event_date` survives the link → flush round-trip.

aliu-ronin · 2026-04-24T08:37:10Z

End-to-end verification on a running instance

Deployed the writer patch to a live Hindsight container (v1.x image, same codebase locations as this PR targets), then retained a test memory with a historical timestamp:

POST /v1/default/banks/main/memories
{
  "items": [{
    "content": "In June 2024, Fitzgerald Blackwood met Gwendolyn Stormrider at the Reykjavik summit to discuss the Chronos Protocol.",
    "timestamp": "2024-06-15T12:00:00Z"
  }]
}

Result in the DB:

field	value
`memory_units.mentioned_at`	`2024-06-15 12:00+00` (matches `timestamp`)
`memory_units.occurred_start`	`2024-06-15 00:00+00` (extracted by the LLM as the event day)
`memory_units.created_at`	`2026-04-24 08:35+00` (ingest moment)
`entity_cooccurrences.last_cooccurred`	`2024-06-15 00:00+00` (event time, not ingest time)

All 4 fresh entity pairs — (Fitzgerald Blackwood, Gwendolyn Stormrider), (Fitzgerald Blackwood, Chronos Protocol), (Fitzgerald Blackwood, Reykjavik summit), (Gwendolyn Stormrider, Chronos Protocol), etc. — carry last_cooccurred = 2024-06-15. Without this patch they would all be 2026-04-24 (the ingest moment).

nicoloboschi

Good fix — well-motivated, well-documented PR. The core change is sound and the risk is low since last_cooccurred is only consumed by the entity graph visualization endpoint (tiebreaker in ORDER BY + UI recency heat), not by the recall/entity-resolution path at all.

A few items below, one blocker on the migration chain.

nicoloboschi · 2026-04-24T09:00:12Z

+
+revision: str = "b5d4e3f2a1c9"
+down_revision: str | Sequence[str] | None = "z1u2v3w4x5y6"
+branch_labels: str | Sequence[str] | None = None


Blocker: down_revision creates a divergent branch.

z1u2v3w4x5y6 already has two children (a2v3w4x5y6z7 and aa2b3c4d5e6f). Adding this as a third child will break the migration graph — Alembic will refuse to run with multiple unmerged heads.

The current chain head on main is i4j5k6l7m8n9. This migration needs to be re-parented to descend from that (or whatever the head is at merge time).

nicoloboschi · 2026-04-24T09:00:12Z

+            GROUP BY 1, 2
+        ) sub
+        WHERE ec.entity_id_1 = sub.e1 AND ec.entity_id_2 = sub.e2
+        """


This self-join (ue1 × ue2 on the same unit_id) is O(n²) per unit in the number of entities. For banks with many entities per unit this could be slow. Worth adding a comment warning operators about expected runtime on large banks, or consider batching by bank_id with a WHERE clause so the UPDATE doesn't lock the whole table at once.

Not blocking since it's a one-time migration, but something to be aware of.

nicoloboschi · 2026-04-24T09:00:12Z

+# callers surface `None`, which must not clobber a real datetime from another
+# caller for the same unit.
+_SENTINEL_MISSING: typing.Any = object()
+


Nit: typing.Any defeats static analysis — any comparison with the sentinel will type-check as valid even if surrounding code regresses. Consider _SENTINEL_MISSING: typing.Final = object() instead, which preserves the intent and gives the type checker something to work with.

Also, import typing was added solely for this — the existing codebase uses from typing import .... A from typing import Final would be more consistent.

nicoloboschi · 2026-04-24T09:00:12Z

+                        latest = c.event_date
+                    coo_agg[pair] = (prev_count + 1, latest)

                now = datetime.now(UTC)


Nit: the "keep latest non-None event_date" logic appears both here (folding coo_agg in flush_pending_stats) and in _link_units_to_entities_batch_impl (building cooccurrence_pairs). Both do the same if event_date is not None and (prev is None or event_date > prev) pattern. A small helper like _later_date(a, b) -> datetime | None would reduce the surface area for the two copies diverging.

nicoloboschi · 2026-04-24T09:00:13Z

                    if entity_id_1 == entity_id_2:
                        continue
-                    # Ensure consistent ordering (entity_id_1 < entity_id_2)
+                    # Canonical ordering (entity_id_1 < entity_id_2) matches the


The first two branches are identical:

if prev is _SENTINEL_MISSING: cooccurrence_pairs[key] = event_date elif prev is None: cooccurrence_pairs[key] = event_date # same

Could simplify to:

if prev is _SENTINEL_MISSING or prev is None: cooccurrence_pairs[key] = event_date elif event_date is not None and event_date > prev: cooccurrence_pairs[key] = event_date

Minor — the current form is arguably more explicit about intent, so up to you.

nicoloboschi · 2026-04-24T09:00:13Z

+        assert row["last_cooccurred"] == historical, (
+            f"expected last_cooccurred == {historical}, got {row['last_cooccurred']}"
+        )
+    finally:


The finally block cleans up memory_units and entities but doesn't explicitly delete from unit_entities or entity_cooccurrences. If the FK on entity_cooccurrences → entities isn't ON DELETE CASCADE, the DELETE FROM entities will fail and the cooccurrence row leaks. Even if cascades handle it, explicitly cleaning up entity_cooccurrences and unit_entities first (in the right order) would be safer and make the teardown self-documenting.

nicoloboschi reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(entity-resolver): stamp cooccurrences with event_date, not now()#1247

fix(entity-resolver): stamp cooccurrences with event_date, not now()#1247
aliu-ronin wants to merge 1 commit intovectorize-io:mainfrom
aliu-ronin:fix/entity-cooccurrences-event-time

aliu-ronin commented Apr 24, 2026

Uh oh!

aliu-ronin commented Apr 24, 2026

Uh oh!

nicoloboschi left a comment

Uh oh!

nicoloboschi Apr 24, 2026

Uh oh!

nicoloboschi Apr 24, 2026

Uh oh!

nicoloboschi Apr 24, 2026

Uh oh!

nicoloboschi Apr 24, 2026

Uh oh!

nicoloboschi Apr 24, 2026

Uh oh!

nicoloboschi Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aliu-ronin commented Apr 24, 2026

Summary

Changes

Writer path

In-repo callers

Backfill migration

Test plan

Repro (before)

After (with migration applied)

Uh oh!

aliu-ronin commented Apr 24, 2026

End-to-end verification on a running instance

Uh oh!

nicoloboschi left a comment

Choose a reason for hiding this comment

Uh oh!

nicoloboschi Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

nicoloboschi Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

nicoloboschi Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

nicoloboschi Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

nicoloboschi Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

nicoloboschi Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants