feat: two-pass critique extractor — bench-validated, parked by hungtranphamminh · Pull Request #214 · MystenLabs/MemWal

hungtranphamminh · 2026-06-01T17:38:28Z

Status

This PR is opened as a draft and will be closed without merging.

End-to-end implementation of the opt-in two-pass critique extractor, fully benched against pre-declared acceptance criteria. The result did not clear the gate, so this is intentionally NOT a ship candidate. The PR exists as a discoverable record of the attempt + the parking rationale.

See the commit message body for the full implementation summary, bench numbers, and reasoning.

Why it's parked

Pre-declared acceptance criteria (set before bench started):

Primary: LME single_session_assistant ≥ +5 J vs the timestamp-anchor baseline
Constraint 1: no category regresses > 5 J
Constraint 2: neither overall drops > 2-3 J below baseline

Bench result on top of the current dev tip vs the prior archived timestamp-anchor baseline:

Metric	Baseline	Critique	Δ	Gate
LME single_session_assistant ⭐	80.89	84.11	+3.21	❌ (missed +5 by 1.79)
LOCOMO overall	69.95	68.76	−1.19	✅ (inside noise)
LME overall	78.42	77.02	−1.40	✅ (inside noise)
Worst category regression	—	LOCOMO single_hop	−4.47	✅ (inside per-cat noise)

Per-category pattern: gains on single-turn categories (single_session_user +2.14, single_session_assistant +3.21, preference +2.00), losses on cross-turn categories (multi_session −3.31, temporal −2.82, knowledge_update −3.53, LOCOMO single_hop −4.47). The critic strips facts it cannot verify in the current turn — wrong move for facts whose justification lives in other turns.

Cost: doubles per-analyze LLM call count (LOCOMO ingest 1500s → 2680s, LME 3500s → 7000s).

The nature of critique — what this mechanism can and can't do

The critic runs as a second LLM call on the same /api/analyze request as the first pass. It sees:

The original input message (the conversation turn being analyzed)
The first-pass extracted facts
The optional <context occurred_at> anchor and <related_memories> block

It does NOT see:

Other turns in the conversation
Other sessions in the namespace
Memories stored from earlier turns that the user is now contradicting or building on
The reasoning the first-pass extractor used to decide a fact was important

This is the structural reality of server-side critique: per-call, stateless, scope-limited to what the current /api/analyze call passes in. The critic is asked to verify whether a fact is supported — but the only "support" it has access to is the current input plus what the first pass produced.

That constraint shapes every result we measured.

What the constraint means in practice

Critique helps when the fact's justification is in the current input. Single-turn categories — single_session_user, single_session_assistant, preference — all gained (+2.14, +3.21, +2.00). The critic looks at "user said X" and "first-pass fact says X-prime" and can verify the relationship cleanly because both are in front of it.

Critique hurts when the fact's justification is in other turns. Cross-turn categories — multi_session, knowledge_update, temporal, LOCOMO single_hop — all lost (−3.31, −3.53, −2.82, −4.47). The first pass extracted a fact whose validity depends on earlier conversation context. The critic, having only the current turn, sees an unsupported claim and removes it. To the critic, the fact looks like a hallucination; in reality the first pass had context the critic doesn't.

The redistribution pattern in the numbers (+9.20 across 4 categories, −16.74 across 7) is the direct shadow of this mechanism property. We're not measuring random noise; we're measuring the cost of having a verifier that doesn't share the extractor's context window.

What this rules out as a "fix"

A better prompt won't fix the cross-turn regression. The critic isn't making prompt-correctable mistakes; it's correctly applying its "verify from source" instruction to inputs that don't contain the relevant source. Better prompts can tune aggressiveness, but they can't give the critic context it doesn't have.
Default-off opt-in doesn't fix it either. Callers who flip the flag based on "second pass = higher quality" reasonably expect quality to improve. For the typical multi-turn conversational use case, our data says it doesn't. Shipping a flag whose docs would have to say "this makes recall worse on the kinds of queries most users actually run" isn't a feature improvement.

When this mechanism becomes worth shipping

Critique is the right tool when the use case happens to match what critique can do — and the cost ratio is acceptable for that use case. Concretely:

Input is naturally single-turn (one document, one note, one isolated piece of text in, facts out). Cross-turn regression doesn't apply because there are no other turns. The single-turn category gains become a real win, not offset by losses elsewhere.
Precision matters more than throughput. The 2× LLM cost is acceptable because every extracted fact is worth verifying. The marginal cost per fact is small relative to the cost of missing or fabricating one.
Per-call latency tolerant. Critique roughly doubles P50 latency on analyze. Use cases where ingestion runs in batch or background are fine; interactive paths where the user is waiting are harder.

The substrate is ready when a use case matching those properties appears — re-bench against the new baseline, validate the cost/benefit math for the specific call pattern, ship behind a flag with a clear "use this when..." guide that matches the data.

Other things that would justify reopening the conversation later:

A first-pass extractor weak enough to leave critique meaningful headroom (e.g. a future prompt rewrite or model swap that drops the baseline by enough to change the math).
A critique design that gives the second pass more context than the current per-call scope allows — for example, critiquing only the small subset of facts the first pass marked highest-importance, so the marginal LLM spend is concentrated on the facts that matter most. Different feature, but the substrate (trait method, prompt asset, SDK plumbing) on this branch can be amended rather than rebuilt.

What's in the diff

Server (Rust): AnalyzeRequest.extract_with_critique: bool, Extractor::extract_with_critique trait method (signature symmetric with extract_with_context), LlmExtractor impl with 4-5 structured user-message critique payload, new critique.v1 prompt with six rules (injection guard, temporal preservation, recounted-past carve-out, strip-wrongly-stamped-stable, anti-leak, strict output), prompt_versions.critique on /health, 10 new unit tests.
SDKs (TS + Python): opt-in extractWithCritique / extract_with_critique field, omitted from wire when false. 2 new Python SDK tests.
Bench harness: --extract-with-critique flag, config plumbing, banner on enable, flag recorded in result-artifact config.
Release metadata intentionally stripped (changesets, version bumps, CHANGELOG entries) because this implementation is not being released.

Side validation

Bench surfaced 25 transient 5xx events across both runs (10 LOCOMO, 15 LME). Every one correctly classified by is_upstream_status_transient + envelope detection + null-content → 503 handling already on dev. Zero dropped turns. Real-world load test of the transient-failure handling confirmed it works as designed.

What happens next

This PR closes once CI confirms green.
Branch stays available indefinitely so any future cycle that wants to revisit critique has the substrate ready.
Decision context lives in the Linear ticket WALM-56 for cross-reference.

End-to-end implementation of the opt-in two-pass critique extractor. Implemented, fully tested, benched against locked acceptance criteria, and parked because the result did not clear the pre-declared gate. This commit exists as recoverable evidence of the attempt, not as a release candidate. Release metadata (changesets for the TS SDK + openclaw, Python SDK version bump + CHANGELOG entry) was intentionally NOT included so the branch's purpose is unambiguous: this is the implementation that backed the bench validation, not code intended to ship. ## What's in here Server (Rust): - `AnalyzeRequest.extract_with_critique: bool` with serde(default). - `Extractor` trait gains `extract_with_critique(text, related_memories, occurred_at)`, signature symmetric with `extract_with_context`. Default impl falls through to `extract_with_context` so mocks and non-LLM impls inherit a sane fallback. - `LlmExtractor::extract_with_critique` runs `extract_with_context` for the first pass, then builds a structured 4-5 user-message critique payload (system + optional related_memories + optional occurred_at + Original input + First-pass extracted facts) — same per-message pattern the timestamp anchor uses, not an inline-concat approach. - New `critique.v1` prompt at `services/server/src/services/prompts/ critique.txt`. 30 lines, six rules: injection guard, preserve resolved dates, preserve historical dates from recounted events, strip wrongly-stamped stable facts, anti-leak (no tag emission), strict BUCKET<TAB>FACT_TEXT output with NONE for empty. - `FACT_EXTRACTION_CRITIQUE_PROMPT_VERSION = "critique.v1"` constant surfaced on `/health` as `prompt_versions.critique` so bench artifacts can pin the prompt version they ran against. - New `render_extracted_facts_for_prompt` helper renders first-pass facts back as BUCKET<TAB>TEXT, with `escape_for_prompt_context` applied — defence in depth against fact text containing structural markers. - 10 new Rust tests (six prompt-content pins, three helper pins including escape, one trait-default fallthrough). SDKs (TS + Python): - TS: `AnalyzeOptions.extractWithCritique?: boolean` plumbed through `analyze()` and `analyzeAndWait()`. Wire field omitted when not true. - Python: `extract_with_critique: bool = False` kwarg on async + sync `analyze()` and `analyze_and_wait()`. Same omit-when-false semantics. Two new SDK tests pin flag-true sends and flag-false omits. Bench harness: - `--extract-with-critique` CLI flag on `ingest` and `full` subcommands. - Flag flows config → stage_ingest → client.analyze; recorded in the result artifact's `config` block for attribution. - User-facing banner printed on enable so it's loud in the runlog. - `/health` validation reads `prompt_versions.critique` when present. ## Why this is parked (not opened for merge) Locked acceptance criteria (set before bench started): - Primary: LME single_session_assistant ≥ +5 J vs the timestamp- anchor baseline. - Constraint 1: no category regresses > 5 J. - Constraint 2: neither overall drops > 2-3 J below baseline. Bench result on top of the current dev tip, measured against the prior archived timestamp-anchor baseline: - LME single_session_assistant: 80.89 → 84.11 = +3.21 J → missed primary target by 1.79 J. - LOCOMO overall: 69.95 → 68.76 = −1.19 J (inside noise) - LME overall: 78.42 → 77.02 = −1.40 J (inside noise) Per-category pattern: gains on single-turn categories (single_session_user +2.14, single_session_assistant +3.21, preference +2.00), losses on cross-turn categories (multi_session −3.31, temporal −2.82, knowledge_update −3.53, LOCOMO single_hop −4.47). The critic strips facts it cannot verify in the current turn — wrong move for facts whose justification lives in other turns. Doubles the per-analyze LLM call count (LOCOMO ingest 1500s → 2680s, LME 3500s → 7000s). Not enough lift to justify the cost, especially since the gain redistributes from cross-turn recall (the majority of real-world queries) to single-turn recall. ## Side validation Bench surfaced 25 transient 5xx events across both runs (10 LOCOMO, 15 LME). Every one correctly classified by `is_upstream_status_transient` + envelope detection + null-content → 503 handling that the upstream classifier ships on dev. Zero dropped turns. Real-world load test of the transient-failure handling confirmed it works as designed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: two-pass critique extractor — bench-validated, parked#214

feat: two-pass critique extractor — bench-validated, parked#214
hungtranphamminh wants to merge 1 commit into
devfrom
feat/WALM-56-critique-restack

hungtranphamminh commented Jun 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hungtranphamminh commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status

Why it's parked

The nature of critique — what this mechanism can and can't do

What the constraint means in practice

What this rules out as a "fix"

When this mechanism becomes worth shipping

What's in the diff

Side validation

What happens next

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hungtranphamminh commented Jun 1, 2026 •

edited

Loading