fix: replay reasoning + web_search_call without orphan in chatd#24721
Open
ibetitsmike wants to merge 3 commits intomainfrom
Open
fix: replay reasoning + web_search_call without orphan in chatd#24721ibetitsmike wants to merge 3 commits intomainfrom
ibetitsmike wants to merge 3 commits intomainfrom
Conversation
When the OpenAI Responses API returns reasoning + a provider-executed
web_search_call in the same turn (e.g. with gpt-5.5 at xhigh effort), a
follow-up turn that replays full conversation history produced a 400:
Item 'ws_xxx' of type 'web_search_call' was provided without its
required 'reasoning' item: 'rs_xxx'.
Root cause: coder/fantasy unconditionally skipped reasoning items during
replay even when store=true and another item required pairing. The
companion fix in coder/fantasy#30 emits an item_reference for the stored
reasoning ItemID alongside the following item_reference(ws_*) so the
pair is preserved.
Changes:
- Bump charm.land/fantasy replace pin to the merge of coder/fantasy#30.
- Extend coderd/x/chatd/chattest with ValidateResponsesAPIInput, a
Responses API input validator that mirrors the live OpenAI 400s for
the orphan web_search_call and orphan function_call cases.
- Add TestOpenAIReasoningWithWebSearchRoundTrip and
TestOpenAIReasoningWithWebSearchRoundTripStoreFalse against a fake
OpenAI server. The tests use two model configs to disable chain mode
on turn 2 and force the replay path where the bug surfaced.
- Add SanitizeOpenAIOrphanToolMessages in chatprompt as defensive
belt-and-suspenders. It drops tool messages that follow a
reasoning-only assistant message (the residual failure mode that
produces 'No tool output found for function call call_xxx' on the
next turn). It mirrors SanitizeAnthropicProviderToolCalls and runs
at both prompt-build and reload sites in chatd.go.
Verified manually: with the unfixed pin, the new integration test
fails with the production error string. With the fix, both Store=true
and Store=false paths complete the follow-up turn cleanly.
This change was generated by a Coder Agent on behalf of @ibetitsmike.
Addresses Tier 1 findings from the deep-review pass: - chatprompt.go: invert isReasoningOnlyAssistantMessage so unknown or pointer-typed parts default to 'not reasoning-only' (fail closed instead of fail open). Also use fantasy.AsMessagePart so pointer variants are normalized. - chatprompt.go: drop the orphan tool RUN, not just one tool message. An assistant followed by [tool, tool, ...] now has all contiguous tool messages dropped together with the assistant. - chatprompt.go: sanitize via plain append (not appendSanitizedMessage) so the two non-tool messages bracketing a dropped run keep their turn boundary instead of getting merged into a single message. - chatloop.go: also run SanitizeOpenAIOrphanToolMessages at the pre-request boundary, mirroring SanitizeAnthropicProviderToolCalls symmetry. Closes the asymmetry the Go Architect flagged. - openai_responses_validation.go: document the responsesInputKind enum values (Message is intentional scaffolding for future rules). - openai_reasoning_websearch_replay_test.go: drop the openAIValidationError wrapper in favor of xerrors.New; capture step 2's input items and assert chain mode was actually disabled (proves the test exercises the replay path the bug surfaces on); use ptr.Ref consistently; use fantasyopenai.Name instead of the bare 'openai' literal in the chatprompt sanitizer table; rename parts -> partTypes; rename Turn -> Step to match existing integration test terminology. - chatprompt_test.go: add cases for empty assistant content, pointer reasoning parts, and contiguous tool runs; rename wantAsstDropped -> wantAssistantDropped to match the field it pins. - openai_responses_validation_test.go: rename cases -> testCases for consistency with the rest of the chatd test suite. This change was generated by a Coder Agent on behalf of @ibetitsmike.
Several deep-review reviewers (Structural Analyst P2, Edge Case Analyst P3, Contract Auditor P3) flagged that SanitizeOpenAIOrphanToolMessages was gated on provider == 'openai' exactly, missing Azure OpenAI and openai-compat backends that route through the same fantasy openai package and hit the same Responses API pairing rules. Replace the gate with isOpenAIResponsesProvider, which recognizes 'openai', 'azure', and 'openai-compat'. Add table-driven test cases proving the sanitizer fires for all three. This change was generated by a Coder Agent on behalf of @ibetitsmike.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Disclaimer: opened by a Coder Agent on behalf of @ibetitsmike.
Fixes the production OpenAI Responses API failure where a follow-up turn after a
reasoning + web_search_callstep is rejected with:This trips on
gpt-5.5 xhigh(and any reasoning model) when web search runs in step 1 and full-history replay (rather than chain mode) is used in step 2.Fix
charm.land/fantasyreplace pin to include fix(providers/openai): emit item_reference for reasoning when store=true fantasy#30, which emits anitem_referencefor stored reasoning items so they pair with a followingweb_search_call(orfunction_call) on replay.chattestfake server viaValidateResponsesAPIInput. Cover both the orphanweb_search_call(primary user-reported error) and orphanfunction_call(secondary failure mode).TestOpenAIReasoningWithWebSearchRoundTripand theStore=falsevariant. The tests use two model configs to disable chain mode on step 2 and force the full-history replay path where the bug surfaces. Step 2's input is captured and asserted to contain the historical assistant turn so the test fails closed if chain-mode detection ever changes.SanitizeOpenAIOrphanToolMessagesinchatprompt. It drops reasoning-only assistant messages together with any contiguous tool-message run that follows. Active foropenai,azure, andopenai-compatproviders (all three route through the same Responses API). It runs at the prompt-build, reload, and chatloop pre-request boundaries to mirrorSanitizeAnthropicProviderToolCalls.This change supersedes closed #23402.
Verification
Item 'ws_xxx' of type 'web_search_call' was provided without its required 'reasoning' item: 'rs_*'.Store=trueandStore=falsefollow-up turns complete cleanly.go test ./coderd/x/chatd/...(full suite, race) passes.make pre-commit(gen / fmt / lint / build) passes.Implementation plan and decision log
Cycle A — coder/fantasy#30 (merged via this PR's pin)
Replaced fantasy's unconditional
continueforContentTypeReasoningwith astore-gateditem_referenceemission usingGetReasoningMetadata. InlineOfReasoningreplay was intentionally avoided so #181's regression doesn't return.Cycle B — chatd integration test
The integration test had to defeat chain mode (which short-circuits replay via
previous_response_id). Approach taken: create two model configs and dispatch step 2 against the second.chainInfo.modelConfigID == modelConfig.IDthen fails and chatd sends full history through fantasy, where the bug surfaced. Step 2's input items are captured and asserted to contain the historical turn so the test fails closed if chain-mode detection changes.ValidateResponsesAPIInputdetectsitem_referenceshape leniently because the openai-go SDK marks thetypefield asomitzero, so fantasy serializes references as bare{id}. The validator falls back to ID-prefix shape detection (ws_,rs_,msg_,fc_).Cycle C — defensive sanitizer
Sanitizer drops tool messages that follow reasoning-only assistant messages. It mirrors the Anthropic sanitizer pattern (zero-allocation no-op fast path, structured stats, warn-level log). It activates for
openai,azure, andopenai-compatproviders viaisOpenAIResponsesProvider. Wired at all three boundaries that currently invokeSanitizeAnthropicProviderToolCalls: prompt-build, reload, and chatloop pre-request.Decisions
item_referenceover inlineOfReasoning: inline replay re-introduces the failure mode feat: Add resource detection on project import #181 fixed for plain reasoning replay.function_call_outputorphan when reasoning is stripped (e.g. missingItemID).isReasoningOnlyAssistantMessageusesfantasy.AsMessagePart(handles pointer + value variants) and falls back toreturn falsefor any unrecognized part type so a future fantasy MessagePart addition cannot silently cause data loss.append(notappendSanitizedMessage) so dropping the orphan pair does not collapse the two non-tool messages bracketing the run into one.Deep-review pass
A multi-reviewer deep review (Test Auditor, Edge Case Analyst, Contract Auditor, Structural Analyst, Go Architect, Concurrency Reviewer, Performance Analyst, Duplication Checker, Modernization Reviewer, Style Reviewer) ran against the initial commit. All P2 and P3 structural findings were addressed:
isOpenAIResponsesProvider.fantasy.AsMessagePart+ invert predicate.appendSanitizedMessage. Fixed by switching to plainappend.ptr.Refconsistency,StepoverTurnterminology, drop unnecessary error wrapper).Follow-up items called out by reviewers but not in scope here:
Promptfield inOpenAIRequestis dead code from a prior PR (Structural Analyst Obs).