Skip to content

fix(entity-resolver): stamp cooccurrences with event_date, not now()#1247

Open
aliu-ronin wants to merge 1 commit intovectorize-io:mainfrom
aliu-ronin:fix/entity-cooccurrences-event-time
Open

fix(entity-resolver): stamp cooccurrences with event_date, not now()#1247
aliu-ronin wants to merge 1 commit intovectorize-io:mainfrom
aliu-ronin:fix/entity-cooccurrences-event-time

Conversation

@aliu-ronin
Copy link
Copy Markdown
Contributor

Summary

entity_cooccurrences.last_cooccurred is always set to datetime.now(UTC) at flush time, regardless of the source memory unit's event date. For real-time retains this is fine (ingest time ≈ event time), but any bank backfilled in a single session — e.g. migrating from another memory system — collapses every co-occurrence onto the import moment. The control plane's entity-graph recency heat then shows a one-or-two-day range regardless of how far the underlying knowledge actually spans.

Example: on a bank migrated on 2026-04-24, every edge's lastCooccurred was 2026-04-23 or 2026-04-24, even though the source memories' mentioned_at / occurred_start correctly reflect the original 2026-03 timeline.

Interestingly the tuples flowing into _link_units_to_entities_batch_impl already carried the per-unit fact_date alongside (unit_id, entity_id) — it was just being discarded at the call site (_fact_date underscore). This change wires it through.

Changes

Writer path

  • _CooccurrencePair grows an event_date: datetime | None field.
  • link_units_to_entities_batch accepts both the legacy (unit_id, entity_id) tuples and the new (unit_id, entity_id, event_date) form, so external callers aren't forced to migrate in lockstep.
  • _link_units_to_entities_batch_impl builds a per-unit event-date map and attaches the unit's date to every co-occurrence pair emitted from that unit. A _SENTINEL_MISSING guard prevents a duplicate legacy-caller row (event_date=None) from clobbering a real datetime from another caller on the same unit.
  • flush_pending_stats aggregates per-pair event dates and INSERTs the observed maximum. It falls back to datetime.now(UTC) only when no event date was carried, so real-time retains preserve their pre-fix semantics exactly.

In-repo callers

  • retain/orchestrator.py:290 — pass fact_date (was already in the tuple as _fact_date).
  • retain/link_utils.py:429 — same.

Backfill migration

b5d4e3f2a1c9_backfill_entity_cooccurrences_event_time.py recomputes last_cooccurred from MAX(COALESCE(mentioned_at, occurred_start, created_at)) over unit_entities × memory_units, so operators don't have to run a manual SQL backfill to see the fix on their existing data. The downgrade path is a no-op — the pre-fix value was now() at the time of write and isn't recoverable, and rolling the code back is enough to revert behavior for subsequent retains.

Test plan

  • New regression in test_entity_resolver.py: inserts a memory unit with a historical mentioned_at, calls link_units_to_entities_batch + flush_pending_stats, asserts entity_cooccurrences.last_cooccurred == historical.
  • python -m py_compile passes on the three touched engine files, the migration, and the test file.
  • Backward compat: passing the old two-tuple shape still works (falls back to now(), matching pre-fix behavior).
  • Manually verified on a migrated bank: running the migration moves the entity-graph recency heat range from "one-day spike" back to the true multi-week event-time span.

Repro (before)

curl -s http://localhost:8000/v1/default/banks/main/entities/graph?limit=3 \
  | jq '.edges[].data.lastCooccurred'
# "2026-04-24T06:47:08+00:00"
# "2026-04-24T06:45:12+00:00"
# "2026-04-24T06:39:02+00:00"   ← all today, despite 03-xx memories

After (with migration applied)

# "2026-03-30T..." / "2026-03-29T..." / ...

`entity_cooccurrences.last_cooccurred` was always set to `datetime.now(UTC)`
at flush time. For real-time retains that's fine — event time ≈ ingest
time — but any corpus **backfilled in a single session** (for example,
migrating from another memory system) collapses every co-occurrence
onto the import moment. The dashboard's entity graph recency heat then
shows a one-or-two-day range regardless of how far the underlying
knowledge actually spans, and downstream consumers of the column lose
the timeline dimension entirely.

The tuples flowing into `_link_units_to_entities_batch_impl` already
carried the per-unit `fact_date` alongside `(unit_id, entity_id)` — it
was just being discarded at the call site (`_fact_date` underscore).
This change wires the event date through:

- `_CooccurrencePair` grows an `event_date` field.
- `link_units_to_entities_batch` accepts both the legacy
  `(unit_id, entity_id)` tuples and the new
  `(unit_id, entity_id, event_date)` form, so external callers aren't
  forced to migrate in lockstep.
- `_link_units_to_entities_batch_impl` builds a per-unit event-date map
  and attaches the unit's date to every co-occurrence pair emitted from
  that unit.
- `flush_pending_stats` aggregates per-pair event dates and INSERTs the
  observed maximum, falling back to `now()` only when no event date was
  carried (preserves the pre-fix semantics for real-time retains).
- Both in-repo callers (`retain/orchestrator.py` and
  `retain/link_utils.py`) pass the `fact_date` they were already
  holding.

A new Alembic migration repairs historical rows by recomputing
`last_cooccurred` from `MAX(COALESCE(mentioned_at, occurred_start,
created_at))` over `unit_entities × memory_units`, so operators don't
have to run a manual backfill to see the fix in their dashboards.
Regression coverage added in `test_entity_resolver.py` asserts a
historical `event_date` survives the link → flush round-trip.
@aliu-ronin
Copy link
Copy Markdown
Contributor Author

End-to-end verification on a running instance

Deployed the writer patch to a live Hindsight container (v1.x image, same codebase locations as this PR targets), then retained a test memory with a historical timestamp:

POST /v1/default/banks/main/memories
{
  "items": [{
    "content": "In June 2024, Fitzgerald Blackwood met Gwendolyn Stormrider at the Reykjavik summit to discuss the Chronos Protocol.",
    "timestamp": "2024-06-15T12:00:00Z"
  }]
}

Result in the DB:

field value
memory_units.mentioned_at 2024-06-15 12:00+00 (matches timestamp)
memory_units.occurred_start 2024-06-15 00:00+00 (extracted by the LLM as the event day)
memory_units.created_at 2026-04-24 08:35+00 (ingest moment)
entity_cooccurrences.last_cooccurred 2024-06-15 00:00+00 (event time, not ingest time)

All 4 fresh entity pairs — (Fitzgerald Blackwood, Gwendolyn Stormrider), (Fitzgerald Blackwood, Chronos Protocol), (Fitzgerald Blackwood, Reykjavik summit), (Gwendolyn Stormrider, Chronos Protocol), etc. — carry last_cooccurred = 2024-06-15. Without this patch they would all be 2026-04-24 (the ingest moment).

Copy link
Copy Markdown
Collaborator

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix — well-motivated, well-documented PR. The core change is sound and the risk is low since last_cooccurred is only consumed by the entity graph visualization endpoint (tiebreaker in ORDER BY + UI recency heat), not by the recall/entity-resolution path at all.

A few items below, one blocker on the migration chain.


revision: str = "b5d4e3f2a1c9"
down_revision: str | Sequence[str] | None = "z1u2v3w4x5y6"
branch_labels: str | Sequence[str] | None = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker: down_revision creates a divergent branch.

z1u2v3w4x5y6 already has two children (a2v3w4x5y6z7 and aa2b3c4d5e6f). Adding this as a third child will break the migration graph — Alembic will refuse to run with multiple unmerged heads.

The current chain head on main is i4j5k6l7m8n9. This migration needs to be re-parented to descend from that (or whatever the head is at merge time).

GROUP BY 1, 2
) sub
WHERE ec.entity_id_1 = sub.e1 AND ec.entity_id_2 = sub.e2
"""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This self-join (ue1 × ue2 on the same unit_id) is O(n²) per unit in the number of entities. For banks with many entities per unit this could be slow. Worth adding a comment warning operators about expected runtime on large banks, or consider batching by bank_id with a WHERE clause so the UPDATE doesn't lock the whole table at once.

Not blocking since it's a one-time migration, but something to be aware of.

# callers surface `None`, which must not clobber a real datetime from another
# caller for the same unit.
_SENTINEL_MISSING: typing.Any = object()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: typing.Any defeats static analysis — any comparison with the sentinel will type-check as valid even if surrounding code regresses. Consider _SENTINEL_MISSING: typing.Final = object() instead, which preserves the intent and gives the type checker something to work with.

Also, import typing was added solely for this — the existing codebase uses from typing import .... A from typing import Final would be more consistent.

latest = c.event_date
coo_agg[pair] = (prev_count + 1, latest)

now = datetime.now(UTC)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the "keep latest non-None event_date" logic appears both here (folding coo_agg in flush_pending_stats) and in _link_units_to_entities_batch_impl (building cooccurrence_pairs). Both do the same if event_date is not None and (prev is None or event_date > prev) pattern. A small helper like _later_date(a, b) -> datetime | None would reduce the surface area for the two copies diverging.

if entity_id_1 == entity_id_2:
continue
# Ensure consistent ordering (entity_id_1 < entity_id_2)
# Canonical ordering (entity_id_1 < entity_id_2) matches the
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first two branches are identical:

if prev is _SENTINEL_MISSING:
    cooccurrence_pairs[key] = event_date
elif prev is None:
    cooccurrence_pairs[key] = event_date  # same

Could simplify to:

if prev is _SENTINEL_MISSING or prev is None:
    cooccurrence_pairs[key] = event_date
elif event_date is not None and event_date > prev:
    cooccurrence_pairs[key] = event_date

Minor — the current form is arguably more explicit about intent, so up to you.

assert row["last_cooccurred"] == historical, (
f"expected last_cooccurred == {historical}, got {row['last_cooccurred']}"
)
finally:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The finally block cleans up memory_units and entities but doesn't explicitly delete from unit_entities or entity_cooccurrences. If the FK on entity_cooccurrencesentities isn't ON DELETE CASCADE, the DELETE FROM entities will fail and the cooccurrence row leaks. Even if cascades handle it, explicitly cleaning up entity_cooccurrences and unit_entities first (in the right order) would be safer and make the teardown self-documenting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants