feat: two-pass critique extractor — bench-validated, parked#214
Draft
hungtranphamminh wants to merge 1 commit into
Draft
feat: two-pass critique extractor — bench-validated, parked#214hungtranphamminh wants to merge 1 commit into
hungtranphamminh wants to merge 1 commit into
Conversation
End-to-end implementation of the opt-in two-pass critique extractor. Implemented, fully tested, benched against locked acceptance criteria, and parked because the result did not clear the pre-declared gate. This commit exists as recoverable evidence of the attempt, not as a release candidate. Release metadata (changesets for the TS SDK + openclaw, Python SDK version bump + CHANGELOG entry) was intentionally NOT included so the branch's purpose is unambiguous: this is the implementation that backed the bench validation, not code intended to ship. ## What's in here Server (Rust): - `AnalyzeRequest.extract_with_critique: bool` with serde(default). - `Extractor` trait gains `extract_with_critique(text, related_memories, occurred_at)`, signature symmetric with `extract_with_context`. Default impl falls through to `extract_with_context` so mocks and non-LLM impls inherit a sane fallback. - `LlmExtractor::extract_with_critique` runs `extract_with_context` for the first pass, then builds a structured 4-5 user-message critique payload (system + optional related_memories + optional occurred_at + Original input + First-pass extracted facts) — same per-message pattern the timestamp anchor uses, not an inline-concat approach. - New `critique.v1` prompt at `services/server/src/services/prompts/ critique.txt`. 30 lines, six rules: injection guard, preserve resolved dates, preserve historical dates from recounted events, strip wrongly-stamped stable facts, anti-leak (no tag emission), strict BUCKET<TAB>FACT_TEXT output with NONE for empty. - `FACT_EXTRACTION_CRITIQUE_PROMPT_VERSION = "critique.v1"` constant surfaced on `/health` as `prompt_versions.critique` so bench artifacts can pin the prompt version they ran against. - New `render_extracted_facts_for_prompt` helper renders first-pass facts back as BUCKET<TAB>TEXT, with `escape_for_prompt_context` applied — defence in depth against fact text containing structural markers. - 10 new Rust tests (six prompt-content pins, three helper pins including escape, one trait-default fallthrough). SDKs (TS + Python): - TS: `AnalyzeOptions.extractWithCritique?: boolean` plumbed through `analyze()` and `analyzeAndWait()`. Wire field omitted when not true. - Python: `extract_with_critique: bool = False` kwarg on async + sync `analyze()` and `analyze_and_wait()`. Same omit-when-false semantics. Two new SDK tests pin flag-true sends and flag-false omits. Bench harness: - `--extract-with-critique` CLI flag on `ingest` and `full` subcommands. - Flag flows config → stage_ingest → client.analyze; recorded in the result artifact's `config` block for attribution. - User-facing banner printed on enable so it's loud in the runlog. - `/health` validation reads `prompt_versions.critique` when present. ## Why this is parked (not opened for merge) Locked acceptance criteria (set before bench started): - Primary: LME single_session_assistant ≥ +5 J vs the timestamp- anchor baseline. - Constraint 1: no category regresses > 5 J. - Constraint 2: neither overall drops > 2-3 J below baseline. Bench result on top of the current dev tip, measured against the prior archived timestamp-anchor baseline: - LME single_session_assistant: 80.89 → 84.11 = +3.21 J → missed primary target by 1.79 J. - LOCOMO overall: 69.95 → 68.76 = −1.19 J (inside noise) - LME overall: 78.42 → 77.02 = −1.40 J (inside noise) Per-category pattern: gains on single-turn categories (single_session_user +2.14, single_session_assistant +3.21, preference +2.00), losses on cross-turn categories (multi_session −3.31, temporal −2.82, knowledge_update −3.53, LOCOMO single_hop −4.47). The critic strips facts it cannot verify in the current turn — wrong move for facts whose justification lives in other turns. Doubles the per-analyze LLM call count (LOCOMO ingest 1500s → 2680s, LME 3500s → 7000s). Not enough lift to justify the cost, especially since the gain redistributes from cross-turn recall (the majority of real-world queries) to single-turn recall. ## Side validation Bench surfaced 25 transient 5xx events across both runs (10 LOCOMO, 15 LME). Every one correctly classified by `is_upstream_status_transient` + envelope detection + null-content → 503 handling that the upstream classifier ships on dev. Zero dropped turns. Real-world load test of the transient-failure handling confirmed it works as designed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status
This PR is opened as a draft and will be closed without merging.
End-to-end implementation of the opt-in two-pass critique extractor, fully benched against pre-declared acceptance criteria. The result did not clear the gate, so this is intentionally NOT a ship candidate. The PR exists as a discoverable record of the attempt + the parking rationale.
See the commit message body for the full implementation summary, bench numbers, and reasoning.
Why it's parked
Pre-declared acceptance criteria (set before bench started):
single_session_assistant≥ +5 J vs the timestamp-anchor baselineBench result on top of the current
devtip vs the prior archived timestamp-anchor baseline:Per-category pattern: gains on single-turn categories (single_session_user +2.14, single_session_assistant +3.21, preference +2.00), losses on cross-turn categories (multi_session −3.31, temporal −2.82, knowledge_update −3.53, LOCOMO single_hop −4.47). The critic strips facts it cannot verify in the current turn — wrong move for facts whose justification lives in other turns.
Cost: doubles per-analyze LLM call count (LOCOMO ingest 1500s → 2680s, LME 3500s → 7000s).
The nature of critique — what this mechanism can and can't do
The critic runs as a second LLM call on the same
/api/analyzerequest as the first pass. It sees:<context occurred_at>anchor and<related_memories>blockIt does NOT see:
This is the structural reality of server-side critique: per-call, stateless, scope-limited to what the current
/api/analyzecall passes in. The critic is asked to verify whether a fact is supported — but the only "support" it has access to is the current input plus what the first pass produced.That constraint shapes every result we measured.
What the constraint means in practice
Critique helps when the fact's justification is in the current input. Single-turn categories —
single_session_user,single_session_assistant,preference— all gained (+2.14, +3.21, +2.00). The critic looks at "user said X" and "first-pass fact says X-prime" and can verify the relationship cleanly because both are in front of it.Critique hurts when the fact's justification is in other turns. Cross-turn categories —
multi_session,knowledge_update,temporal, LOCOMOsingle_hop— all lost (−3.31, −3.53, −2.82, −4.47). The first pass extracted a fact whose validity depends on earlier conversation context. The critic, having only the current turn, sees an unsupported claim and removes it. To the critic, the fact looks like a hallucination; in reality the first pass had context the critic doesn't.The redistribution pattern in the numbers (+9.20 across 4 categories, −16.74 across 7) is the direct shadow of this mechanism property. We're not measuring random noise; we're measuring the cost of having a verifier that doesn't share the extractor's context window.
What this rules out as a "fix"
When this mechanism becomes worth shipping
Critique is the right tool when the use case happens to match what critique can do — and the cost ratio is acceptable for that use case. Concretely:
analyze. Use cases where ingestion runs in batch or background are fine; interactive paths where the user is waiting are harder.The substrate is ready when a use case matching those properties appears — re-bench against the new baseline, validate the cost/benefit math for the specific call pattern, ship behind a flag with a clear "use this when..." guide that matches the data.
Other things that would justify reopening the conversation later:
What's in the diff
AnalyzeRequest.extract_with_critique: bool,Extractor::extract_with_critiquetrait method (signature symmetric withextract_with_context),LlmExtractorimpl with 4-5 structured user-message critique payload, newcritique.v1prompt with six rules (injection guard, temporal preservation, recounted-past carve-out, strip-wrongly-stamped-stable, anti-leak, strict output),prompt_versions.critiqueon/health, 10 new unit tests.extractWithCritique/extract_with_critiquefield, omitted from wire when false. 2 new Python SDK tests.--extract-with-critiqueflag, config plumbing, banner on enable, flag recorded in result-artifact config.Side validation
Bench surfaced 25 transient 5xx events across both runs (10 LOCOMO, 15 LME). Every one correctly classified by
is_upstream_status_transient+ envelope detection + null-content → 503 handling already ondev. Zero dropped turns. Real-world load test of the transient-failure handling confirmed it works as designed.What happens next