Skip to content

feat: two-pass critique extractor — bench-validated, parked#214

Draft
hungtranphamminh wants to merge 1 commit into
devfrom
feat/WALM-56-critique-restack
Draft

feat: two-pass critique extractor — bench-validated, parked#214
hungtranphamminh wants to merge 1 commit into
devfrom
feat/WALM-56-critique-restack

Conversation

@hungtranphamminh

@hungtranphamminh hungtranphamminh commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Status

This PR is opened as a draft and will be closed without merging.

End-to-end implementation of the opt-in two-pass critique extractor, fully benched against pre-declared acceptance criteria. The result did not clear the gate, so this is intentionally NOT a ship candidate. The PR exists as a discoverable record of the attempt + the parking rationale.

See the commit message body for the full implementation summary, bench numbers, and reasoning.

Why it's parked

Pre-declared acceptance criteria (set before bench started):

  • Primary: LME single_session_assistant+5 J vs the timestamp-anchor baseline
  • Constraint 1: no category regresses > 5 J
  • Constraint 2: neither overall drops > 2-3 J below baseline

Bench result on top of the current dev tip vs the prior archived timestamp-anchor baseline:

Metric Baseline Critique Δ Gate
LME single_session_assistant ⭐ 80.89 84.11 +3.21 ❌ (missed +5 by 1.79)
LOCOMO overall 69.95 68.76 −1.19 ✅ (inside noise)
LME overall 78.42 77.02 −1.40 ✅ (inside noise)
Worst category regression LOCOMO single_hop −4.47 ✅ (inside per-cat noise)

Per-category pattern: gains on single-turn categories (single_session_user +2.14, single_session_assistant +3.21, preference +2.00), losses on cross-turn categories (multi_session −3.31, temporal −2.82, knowledge_update −3.53, LOCOMO single_hop −4.47). The critic strips facts it cannot verify in the current turn — wrong move for facts whose justification lives in other turns.

Cost: doubles per-analyze LLM call count (LOCOMO ingest 1500s → 2680s, LME 3500s → 7000s).

The nature of critique — what this mechanism can and can't do

The critic runs as a second LLM call on the same /api/analyze request as the first pass. It sees:

  • The original input message (the conversation turn being analyzed)
  • The first-pass extracted facts
  • The optional <context occurred_at> anchor and <related_memories> block

It does NOT see:

  • Other turns in the conversation
  • Other sessions in the namespace
  • Memories stored from earlier turns that the user is now contradicting or building on
  • The reasoning the first-pass extractor used to decide a fact was important

This is the structural reality of server-side critique: per-call, stateless, scope-limited to what the current /api/analyze call passes in. The critic is asked to verify whether a fact is supported — but the only "support" it has access to is the current input plus what the first pass produced.

That constraint shapes every result we measured.

What the constraint means in practice

Critique helps when the fact's justification is in the current input. Single-turn categories — single_session_user, single_session_assistant, preference — all gained (+2.14, +3.21, +2.00). The critic looks at "user said X" and "first-pass fact says X-prime" and can verify the relationship cleanly because both are in front of it.

Critique hurts when the fact's justification is in other turns. Cross-turn categories — multi_session, knowledge_update, temporal, LOCOMO single_hop — all lost (−3.31, −3.53, −2.82, −4.47). The first pass extracted a fact whose validity depends on earlier conversation context. The critic, having only the current turn, sees an unsupported claim and removes it. To the critic, the fact looks like a hallucination; in reality the first pass had context the critic doesn't.

The redistribution pattern in the numbers (+9.20 across 4 categories, −16.74 across 7) is the direct shadow of this mechanism property. We're not measuring random noise; we're measuring the cost of having a verifier that doesn't share the extractor's context window.

What this rules out as a "fix"

  • A better prompt won't fix the cross-turn regression. The critic isn't making prompt-correctable mistakes; it's correctly applying its "verify from source" instruction to inputs that don't contain the relevant source. Better prompts can tune aggressiveness, but they can't give the critic context it doesn't have.
  • Default-off opt-in doesn't fix it either. Callers who flip the flag based on "second pass = higher quality" reasonably expect quality to improve. For the typical multi-turn conversational use case, our data says it doesn't. Shipping a flag whose docs would have to say "this makes recall worse on the kinds of queries most users actually run" isn't a feature improvement.

When this mechanism becomes worth shipping

Critique is the right tool when the use case happens to match what critique can do — and the cost ratio is acceptable for that use case. Concretely:

  • Input is naturally single-turn (one document, one note, one isolated piece of text in, facts out). Cross-turn regression doesn't apply because there are no other turns. The single-turn category gains become a real win, not offset by losses elsewhere.
  • Precision matters more than throughput. The 2× LLM cost is acceptable because every extracted fact is worth verifying. The marginal cost per fact is small relative to the cost of missing or fabricating one.
  • Per-call latency tolerant. Critique roughly doubles P50 latency on analyze. Use cases where ingestion runs in batch or background are fine; interactive paths where the user is waiting are harder.

The substrate is ready when a use case matching those properties appears — re-bench against the new baseline, validate the cost/benefit math for the specific call pattern, ship behind a flag with a clear "use this when..." guide that matches the data.

Other things that would justify reopening the conversation later:

  • A first-pass extractor weak enough to leave critique meaningful headroom (e.g. a future prompt rewrite or model swap that drops the baseline by enough to change the math).
  • A critique design that gives the second pass more context than the current per-call scope allows — for example, critiquing only the small subset of facts the first pass marked highest-importance, so the marginal LLM spend is concentrated on the facts that matter most. Different feature, but the substrate (trait method, prompt asset, SDK plumbing) on this branch can be amended rather than rebuilt.

What's in the diff

  • Server (Rust): AnalyzeRequest.extract_with_critique: bool, Extractor::extract_with_critique trait method (signature symmetric with extract_with_context), LlmExtractor impl with 4-5 structured user-message critique payload, new critique.v1 prompt with six rules (injection guard, temporal preservation, recounted-past carve-out, strip-wrongly-stamped-stable, anti-leak, strict output), prompt_versions.critique on /health, 10 new unit tests.
  • SDKs (TS + Python): opt-in extractWithCritique / extract_with_critique field, omitted from wire when false. 2 new Python SDK tests.
  • Bench harness: --extract-with-critique flag, config plumbing, banner on enable, flag recorded in result-artifact config.
  • Release metadata intentionally stripped (changesets, version bumps, CHANGELOG entries) because this implementation is not being released.

Side validation

Bench surfaced 25 transient 5xx events across both runs (10 LOCOMO, 15 LME). Every one correctly classified by is_upstream_status_transient + envelope detection + null-content → 503 handling already on dev. Zero dropped turns. Real-world load test of the transient-failure handling confirmed it works as designed.

What happens next

  • This PR closes once CI confirms green.
  • Branch stays available indefinitely so any future cycle that wants to revisit critique has the substrate ready.
  • Decision context lives in the Linear ticket WALM-56 for cross-reference.

End-to-end implementation of the opt-in two-pass critique extractor.
Implemented, fully tested, benched against locked acceptance criteria,
and parked because the result did not clear the pre-declared gate.

This commit exists as recoverable evidence of the attempt, not as a
release candidate. Release metadata (changesets for the TS SDK +
openclaw, Python SDK version bump + CHANGELOG entry) was intentionally
NOT included so the branch's purpose is unambiguous: this is the
implementation that backed the bench validation, not code intended
to ship.

## What's in here

Server (Rust):
- `AnalyzeRequest.extract_with_critique: bool` with serde(default).
- `Extractor` trait gains `extract_with_critique(text, related_memories,
  occurred_at)`, signature symmetric with `extract_with_context`.
  Default impl falls through to `extract_with_context` so mocks and
  non-LLM impls inherit a sane fallback.
- `LlmExtractor::extract_with_critique` runs `extract_with_context` for
  the first pass, then builds a structured 4-5 user-message critique
  payload (system + optional related_memories + optional occurred_at
  + Original input + First-pass extracted facts) — same per-message
  pattern the timestamp anchor uses, not an inline-concat approach.
- New `critique.v1` prompt at `services/server/src/services/prompts/
  critique.txt`. 30 lines, six rules: injection guard, preserve
  resolved dates, preserve historical dates from recounted events,
  strip wrongly-stamped stable facts, anti-leak (no tag emission),
  strict BUCKET<TAB>FACT_TEXT output with NONE for empty.
- `FACT_EXTRACTION_CRITIQUE_PROMPT_VERSION = "critique.v1"` constant
  surfaced on `/health` as `prompt_versions.critique` so bench
  artifacts can pin the prompt version they ran against.
- New `render_extracted_facts_for_prompt` helper renders first-pass
  facts back as BUCKET<TAB>TEXT, with `escape_for_prompt_context`
  applied — defence in depth against fact text containing structural
  markers.
- 10 new Rust tests (six prompt-content pins, three helper pins
  including escape, one trait-default fallthrough).

SDKs (TS + Python):
- TS: `AnalyzeOptions.extractWithCritique?: boolean` plumbed through
  `analyze()` and `analyzeAndWait()`. Wire field omitted when not true.
- Python: `extract_with_critique: bool = False` kwarg on async + sync
  `analyze()` and `analyze_and_wait()`. Same omit-when-false semantics.
  Two new SDK tests pin flag-true sends and flag-false omits.

Bench harness:
- `--extract-with-critique` CLI flag on `ingest` and `full` subcommands.
- Flag flows config → stage_ingest → client.analyze; recorded in the
  result artifact's `config` block for attribution.
- User-facing banner printed on enable so it's loud in the runlog.
- `/health` validation reads `prompt_versions.critique` when present.

## Why this is parked (not opened for merge)

Locked acceptance criteria (set before bench started):
- Primary: LME single_session_assistant ≥ +5 J vs the timestamp-
  anchor baseline.
- Constraint 1: no category regresses > 5 J.
- Constraint 2: neither overall drops > 2-3 J below baseline.

Bench result on top of the current dev tip, measured against the
prior archived timestamp-anchor baseline:
- LME single_session_assistant: 80.89 → 84.11 = +3.21 J
  → missed primary target by 1.79 J.
- LOCOMO overall: 69.95 → 68.76 = −1.19 J (inside noise)
- LME overall: 78.42 → 77.02 = −1.40 J (inside noise)

Per-category pattern: gains on single-turn categories
(single_session_user +2.14, single_session_assistant +3.21, preference
+2.00), losses on cross-turn categories (multi_session −3.31, temporal
−2.82, knowledge_update −3.53, LOCOMO single_hop −4.47). The critic
strips facts it cannot verify in the current turn — wrong move for
facts whose justification lives in other turns.

Doubles the per-analyze LLM call count (LOCOMO ingest 1500s → 2680s,
LME 3500s → 7000s). Not enough lift to justify the cost, especially
since the gain redistributes from cross-turn recall (the majority of
real-world queries) to single-turn recall.

## Side validation

Bench surfaced 25 transient 5xx events across both runs (10 LOCOMO,
15 LME). Every one correctly classified by `is_upstream_status_transient`
+ envelope detection + null-content → 503 handling that the upstream
classifier ships on dev. Zero dropped turns. Real-world load test of
the transient-failure handling confirmed it works as designed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant