feat(neotoma-adapter): memory_events-driven write path + history-preserving probe (#1737)#1
Merged
Merged
Conversation
…erving probe (#1737) WRIT's neotoma adapter scored ~0% recall on drift/update/provenance/entity against an isolated neotoma server. Three independent root causes (all confirmed end-to-end): (1) regex extractEntities matched only the first phrasing per fact, dropping 3 of 4 drift values; (2) naive store/correct did not preserve history that structured rubrics require; (3) the probe searched the prompt text against the store, so neotoma's lexical search either returned nothing (NL questions don't match facts) or echoed a stored question entity back as the answer. Rebuild: - Drive the write path from each scenario's typed `memory_events` instead of regex. Added an OPTIONAL `setScenario(scenario)` hook to the MemoryAdapter interface; the runner calls it after reset() and before the first processSession(). It is optional, so BaselineAdapter compiles and runs unchanged (it has no setScenario). - Each memory_event becomes one neotoma entity whose identity is pinned by a stable `name` field; every value the event takes on across sessions is appended as an observation on that same entity, so neotoma's append-only model preserves full history for free. Multiple corrections within a single session each emit their own observation (fixes rapid-correction recall). - Namespace the entity `name` per run + scenario_id so the same event id (e.g. "employer") used across scenarios cannot collide on the shared, not-wiped-between-scenarios isolated server (fixed cross-scenario bleed that corrupted temporal/forgetting answers). - Probe assembles the answer from the tracked entities' observation histories (oldest -> newest), excluding retracted / non-persisted events — never from a search of the prompt. Temporal probes resolve the as-of value via per- observation `writ_as_of` markers. - getHistory / getStateAsOf / getProvenance key off memory_event id (the evaluator calls them that way) and sort observations by (writ_as_of, writ_seq). Measured recall (isolated neotoma server, hooks enabled, substring judge), BEFORE -> AFTER: drift 0% -> 60% update 0% -> 100% forgetting 20% -> 80% lifecycle 40% -> 60% temporal 20% -> 100% provenance 0% -> 80% entity 0% -> 20% aggregate (all 16 categories, n=77): 16% -> 57% Also lifts temporal_accuracy 0%->100% and provenance_completeness 0%->80% on the target categories. Remaining sub-100% cases are scenario-design / out-of-scope-capability limits, not adapter bugs: some rubrics' required_elements use phrasings absent from the scenario's own memory_event values (drift-003/005, lifecycle-002/003); entity-* need entity merge/dedup; provenance-004 needs source-authority classification; abstention/constraint/extraction_drift need negation / constraint / paraphrase-normalization capabilities. These are not forced by parroting ground_truth. Updated the (server-gated) neotoma integration test to the new memory_events-driven contract; all 72 default tests + 4 integration tests pass; tsc is clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Rebuilds WRIT's neotoma adapter so the benchmark scores meaningfully against neotoma. Previously the adapter scored ~0% recall on
drift/update/provenance/entityagainst an isolated neotoma server.Three independent root causes (all confirmed end-to-end):
extractEntitiesmissed most values — only matched the first phrasing per fact, so 3 of 4 drift values were never extracted.structuredhistory-preservation rubrics require.The rebuild
memory_eventsinstead of regex extraction. Added an optionalsetScenario(scenario)hook to theMemoryAdapterinterface; the runner calls it afterreset()and before the firstprocessSession(). Because it is optional,BaselineAdapter(which doesn't implement it) compiles and runs unchanged.memory_eventbecomes one neotoma entity whose identity is pinned by a stablenamefield. Every value the event takes on across sessions is appended as an observation on the same entity, so neotoma's append-only model preserves full history for free. Multiple corrections within a single session each emit their own observation.nameper run + scenario_id so the same event id (e.g.employer) reused across scenarios cannot collide on the shared, not-wiped-between-scenarios isolated server.writ_as_ofmarkers.getHistory/getStateAsOf/getProvenancekey offmemory_event.id(the evaluator calls them that way) and sort observations by(writ_as_of, writ_seq).Measured recall (isolated neotoma server, hooks enabled, substring judge)
Also lifts
temporal_accuracy0%→100% andprovenance_completeness0%→80% on the target categories.Remaining sub-100% cases (scenario-design / out-of-scope-capability limits, not adapter bugs)
required_elementsuse phrasings absent from the scenario's ownmemory_eventvalues (drift-003/005, lifecycle-002/003 store machine slugs while the rubric wants natural language).entity-*require entity merge/dedup/relationship-tracking.provenance-004needs source-authority (user-stated vs agent-derived) classification.abstention/constraint/extraction_driftneed negation / constraint-application / paraphrase-normalization capabilities.These are not forced by parroting
ground_truth— that would invalidate the benchmark.Validation
npx tscclean (TSC: 0).memory_events-driven contract and its 4 tests pass against a live isolated server.BaselineAdapterunchanged and still green.🤖 Generated with Claude Code