feat(neotoma-adapter): memory_events-driven write path + history-preserving probe (#1737) by markmhendrickson · Pull Request #1 · markmhendrickson/writ

markmhendrickson · 2026-06-22T11:51:58Z

Summary

Rebuilds WRIT's neotoma adapter so the benchmark scores meaningfully against neotoma. Previously the adapter scored ~0% recall on drift/update/provenance/entity against an isolated neotoma server.

Three independent root causes (all confirmed end-to-end):

Regex extractEntities missed most values — only matched the first phrasing per fact, so 3 of 4 drift values were never extracted.
No history preservation — naive store/correct lost the prior values that structured history-preservation rubrics require.
Probe conflated questions with answers — searching the prompt text against the store either returned nothing (natural-language questions don't lexically match facts) or echoed a stored question entity back as the answer.

The rebuild

Drive the write path from each scenario's typed memory_events instead of regex extraction. Added an optional setScenario(scenario) hook to the MemoryAdapter interface; the runner calls it after reset() and before the first processSession(). Because it is optional, BaselineAdapter (which doesn't implement it) compiles and runs unchanged.
Each memory_event becomes one neotoma entity whose identity is pinned by a stable name field. Every value the event takes on across sessions is appended as an observation on the same entity, so neotoma's append-only model preserves full history for free. Multiple corrections within a single session each emit their own observation.
Namespace the entity name per run + scenario_id so the same event id (e.g. employer) reused across scenarios cannot collide on the shared, not-wiped-between-scenarios isolated server.
Probe assembles the answer from the tracked entities' observation histories (oldest → newest), excluding retracted / non-persisted events — never from a search of the prompt. Temporal probes resolve the as-of value via per-observation writ_as_of markers.
getHistory / getStateAsOf / getProvenance key off memory_event.id (the evaluator calls them that way) and sort observations by (writ_as_of, writ_seq).

Measured recall (isolated neotoma server, hooks enabled, substring judge)

category	BEFORE	AFTER
drift	0%	60%
update	0%	100%
forgetting	20%	80%
lifecycle	40%	60%
temporal	20%	100%
provenance	0%	80%
entity	0%	20%
aggregate (all 16 categories, n=77)	16%	57%

Also lifts temporal_accuracy 0%→100% and provenance_completeness 0%→80% on the target categories.

Remaining sub-100% cases (scenario-design / out-of-scope-capability limits, not adapter bugs)

Some rubrics' required_elements use phrasings absent from the scenario's own memory_event values (drift-003/005, lifecycle-002/003 store machine slugs while the rubric wants natural language).
entity-* require entity merge/dedup/relationship-tracking.
provenance-004 needs source-authority (user-stated vs agent-derived) classification.
abstention / constraint / extraction_drift need negation / constraint-application / paraphrase-normalization capabilities.

These are not forced by parroting ground_truth — that would invalidate the benchmark.

Validation

npx tsc clean (TSC: 0).
All 72 default tests pass; the (server-gated) neotoma integration test was updated to the new memory_events-driven contract and its 4 tests pass against a live isolated server.
BaselineAdapter unchanged and still green.

🤖 Generated with Claude Code

…erving probe (#1737) WRIT's neotoma adapter scored ~0% recall on drift/update/provenance/entity against an isolated neotoma server. Three independent root causes (all confirmed end-to-end): (1) regex extractEntities matched only the first phrasing per fact, dropping 3 of 4 drift values; (2) naive store/correct did not preserve history that structured rubrics require; (3) the probe searched the prompt text against the store, so neotoma's lexical search either returned nothing (NL questions don't match facts) or echoed a stored question entity back as the answer. Rebuild: - Drive the write path from each scenario's typed `memory_events` instead of regex. Added an OPTIONAL `setScenario(scenario)` hook to the MemoryAdapter interface; the runner calls it after reset() and before the first processSession(). It is optional, so BaselineAdapter compiles and runs unchanged (it has no setScenario). - Each memory_event becomes one neotoma entity whose identity is pinned by a stable `name` field; every value the event takes on across sessions is appended as an observation on that same entity, so neotoma's append-only model preserves full history for free. Multiple corrections within a single session each emit their own observation (fixes rapid-correction recall). - Namespace the entity `name` per run + scenario_id so the same event id (e.g. "employer") used across scenarios cannot collide on the shared, not-wiped-between-scenarios isolated server (fixed cross-scenario bleed that corrupted temporal/forgetting answers). - Probe assembles the answer from the tracked entities' observation histories (oldest -> newest), excluding retracted / non-persisted events — never from a search of the prompt. Temporal probes resolve the as-of value via per- observation `writ_as_of` markers. - getHistory / getStateAsOf / getProvenance key off memory_event id (the evaluator calls them that way) and sort observations by (writ_as_of, writ_seq). Measured recall (isolated neotoma server, hooks enabled, substring judge), BEFORE -> AFTER: drift 0% -> 60% update 0% -> 100% forgetting 20% -> 80% lifecycle 40% -> 60% temporal 20% -> 100% provenance 0% -> 80% entity 0% -> 20% aggregate (all 16 categories, n=77): 16% -> 57% Also lifts temporal_accuracy 0%->100% and provenance_completeness 0%->80% on the target categories. Remaining sub-100% cases are scenario-design / out-of-scope-capability limits, not adapter bugs: some rubrics' required_elements use phrasings absent from the scenario's own memory_event values (drift-003/005, lifecycle-002/003); entity-* need entity merge/dedup; provenance-004 needs source-authority classification; abstention/constraint/extraction_drift need negation / constraint / paraphrase-normalization capabilities. These are not forced by parroting ground_truth. Updated the (server-gated) neotoma integration test to the new memory_events-driven contract; all 72 default tests + 4 integration tests pass; tsc is clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

markmhendrickson merged commit 3c0900a into main Jun 22, 2026
2 checks passed

markmhendrickson deleted the feat/1737-adapter-rebuild branch June 22, 2026 11:53

markmhendrickson mentioned this pull request Jun 22, 2026

ci(eval-combined): bump writ + add combined WRIT+Tier2 qa-gate lane (#1737) markmhendrickson/neotoma#1741

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(neotoma-adapter): memory_events-driven write path + history-preserving probe (#1737)#1

feat(neotoma-adapter): memory_events-driven write path + history-preserving probe (#1737)#1
markmhendrickson merged 1 commit into
mainfrom
feat/1737-adapter-rebuild

markmhendrickson commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

markmhendrickson commented Jun 22, 2026

Summary

The rebuild

Measured recall (isolated neotoma server, hooks enabled, substring judge)

Remaining sub-100% cases (scenario-design / out-of-scope-capability limits, not adapter bugs)

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant