fix: replay reasoning + web_search_call without orphan in chatd by ibetitsmike · Pull Request #24721 · coder/coder

ibetitsmike · 2026-04-25T02:17:26Z

Disclaimer: opened by a Coder Agent on behalf of @ibetitsmike.

Fixes the production OpenAI Responses API failure where a follow-up turn after a reasoning + web_search_call step is rejected with:

Item 'ws_xxx' of type 'web_search_call' was provided without its
required 'reasoning' item: 'rs_xxx'.

This trips on gpt-5.5 xhigh (and any reasoning model) when web search runs in step 1 and full-history replay (rather than chain mode) is used in step 2.

Fix

Bump the charm.land/fantasy replace pin to include fix(providers/openai): emit item_reference for reasoning when store=true fantasy#30, which emits an item_reference for stored reasoning items so they pair with a following web_search_call (or function_call) on replay.
Mirror live OpenAI's pairing rule in the chattest fake server via ValidateResponsesAPIInput. Cover both the orphan web_search_call (primary user-reported error) and orphan function_call (secondary failure mode).
Add TestOpenAIReasoningWithWebSearchRoundTrip and the Store=false variant. The tests use two model configs to disable chain mode on step 2 and force the full-history replay path where the bug surfaces. Step 2's input is captured and asserted to contain the historical assistant turn so the test fails closed if chain-mode detection ever changes.
Add a defensive SanitizeOpenAIOrphanToolMessages in chatprompt. It drops reasoning-only assistant messages together with any contiguous tool-message run that follows. Active for openai, azure, and openai-compat providers (all three route through the same Responses API). It runs at the prompt-build, reload, and chatloop pre-request boundaries to mirror SanitizeAnthropicProviderToolCalls.

This change supersedes closed #23402.

Verification

Manual red phase: with the unfixed pin, the new integration test fails reproducing the exact production error: Item 'ws_xxx' of type 'web_search_call' was provided without its required 'reasoning' item: 'rs_*'.
Green phase: with the fixed pin, both Store=true and Store=false follow-up turns complete cleanly.
go test ./coderd/x/chatd/... (full suite, race) passes.
make pre-commit (gen / fmt / lint / build) passes.

Implementation plan and decision log

Cycle A — coder/fantasy#30 (merged via this PR's pin)

Replaced fantasy's unconditional continue for ContentTypeReasoning with a store-gated item_reference emission using GetReasoningMetadata. Inline OfReasoning replay was intentionally avoided so #181's regression doesn't return.

Cycle B — chatd integration test

The integration test had to defeat chain mode (which short-circuits replay via previous_response_id). Approach taken: create two model configs and dispatch step 2 against the second. chainInfo.modelConfigID == modelConfig.ID then fails and chatd sends full history through fantasy, where the bug surfaced. Step 2's input items are captured and asserted to contain the historical turn so the test fails closed if chain-mode detection changes.

ValidateResponsesAPIInput detects item_reference shape leniently because the openai-go SDK marks the type field as omitzero, so fantasy serializes references as bare {id}. The validator falls back to ID-prefix shape detection (ws_, rs_, msg_, fc_).

Cycle C — defensive sanitizer

Sanitizer drops tool messages that follow reasoning-only assistant messages. It mirrors the Anthropic sanitizer pattern (zero-allocation no-op fast path, structured stats, warn-level log). It activates for openai, azure, and openai-compat providers via isOpenAIResponsesProvider. Wired at all three boundaries that currently invoke SanitizeAnthropicProviderToolCalls: prompt-build, reload, and chatloop pre-request.

Decisions

item_reference over inline OfReasoning: inline replay re-introduces the failure mode feat: Add resource detection on project import #181 fixed for plain reasoning replay.
Defensive sanitizer kept despite Cycle A solving the primary failure mode: belt-and-suspenders against the residual function_call_output orphan when reasoning is stripped (e.g. missing ItemID).
Two-model-configs test approach over a custom multi-step tool: smaller test surface, no synthetic tool stack, deterministic chain-mode bypass.
isReasoningOnlyAssistantMessage uses fantasy.AsMessagePart (handles pointer + value variants) and falls back to return false for any unrecognized part type so a future fantasy MessagePart addition cannot silently cause data loss.
Drop the entire contiguous tool-message run after a reasoning-only assistant, not just the immediate neighbor. Avoids leaving a later orphan tool message that would still trip the same OpenAI 400.
Use plain append (not appendSanitizedMessage) so dropping the orphan pair does not collapse the two non-tool messages bracketing the run into one.

Deep-review pass

A multi-reviewer deep review (Test Auditor, Edge Case Analyst, Contract Auditor, Structural Analyst, Go Architect, Concurrency Reviewer, Performance Analyst, Duplication Checker, Modernization Reviewer, Style Reviewer) ran against the initial commit. All P2 and P3 structural findings were addressed:

P2 (Structural Analyst, Edge Case Analyst, Contract Auditor): sanitizer didn't cover Azure / openai-compat. Fixed via isOpenAIResponsesProvider.
P2 (Go Architect): sanitizer asymmetric with Anthropic at the chatloop pre-request boundary. Fixed.
P3 (Test Auditor, Edge Case Analyst, Contract Auditor, Structural Analyst, Go Architect): type-switch failed open for unknown / pointer parts. Fixed via fantasy.AsMessagePart + invert predicate.
P3 (Edge Case Analyst): sanitizer dropped only one tool message in a contiguous run. Fixed.
P3 (Structural Analyst): user turns merged across the dropped run via appendSanitizedMessage. Fixed by switching to plain append.
P3 (Structural Analyst): integration test depended on chain-mode internals without an assertion. Fixed via captured-input check.
Multiple style nits applied (constants over literals, ptr.Ref consistency, Step over Turn terminology, drop unnecessary error wrapper).

Follow-up items called out by reviewers but not in scope here:

ID-prefix sniffing in the validator is fragile against future openai-go SDK changes (Test Auditor Obs).
Prompt field in OpenAIRequest is dead code from a prior PR (Structural Analyst Obs).
Map-iteration order in unpaired-function-call validator is non-deterministic when multiple orphans exist (Edge Case Analyst, Performance Analyst Obs); fine for current tests.
If a third provider needs a similar sanitizer, factor out the shared shape (Duplication Checker Obs).

@ibetitsmike

When the OpenAI Responses API returns reasoning + a provider-executed web_search_call in the same turn (e.g. with gpt-5.5 at xhigh effort), a follow-up turn that replays full conversation history produced a 400: Item 'ws_xxx' of type 'web_search_call' was provided without its required 'reasoning' item: 'rs_xxx'. Root cause: coder/fantasy unconditionally skipped reasoning items during replay even when store=true and another item required pairing. The companion fix in coder/fantasy#30 emits an item_reference for the stored reasoning ItemID alongside the following item_reference(ws_*) so the pair is preserved. Changes: - Bump charm.land/fantasy replace pin to the merge of coder/fantasy#30. - Extend coderd/x/chatd/chattest with ValidateResponsesAPIInput, a Responses API input validator that mirrors the live OpenAI 400s for the orphan web_search_call and orphan function_call cases. - Add TestOpenAIReasoningWithWebSearchRoundTrip and TestOpenAIReasoningWithWebSearchRoundTripStoreFalse against a fake OpenAI server. The tests use two model configs to disable chain mode on turn 2 and force the replay path where the bug surfaced. - Add SanitizeOpenAIOrphanToolMessages in chatprompt as defensive belt-and-suspenders. It drops tool messages that follow a reasoning-only assistant message (the residual failure mode that produces 'No tool output found for function call call_xxx' on the next turn). It mirrors SanitizeAnthropicProviderToolCalls and runs at both prompt-build and reload sites in chatd.go. Verified manually: with the unfixed pin, the new integration test fails with the production error string. With the fix, both Store=true and Store=false paths complete the follow-up turn cleanly. This change was generated by a Coder Agent on behalf of @ibetitsmike.

@ibetitsmike

Addresses Tier 1 findings from the deep-review pass: - chatprompt.go: invert isReasoningOnlyAssistantMessage so unknown or pointer-typed parts default to 'not reasoning-only' (fail closed instead of fail open). Also use fantasy.AsMessagePart so pointer variants are normalized. - chatprompt.go: drop the orphan tool RUN, not just one tool message. An assistant followed by [tool, tool, ...] now has all contiguous tool messages dropped together with the assistant. - chatprompt.go: sanitize via plain append (not appendSanitizedMessage) so the two non-tool messages bracketing a dropped run keep their turn boundary instead of getting merged into a single message. - chatloop.go: also run SanitizeOpenAIOrphanToolMessages at the pre-request boundary, mirroring SanitizeAnthropicProviderToolCalls symmetry. Closes the asymmetry the Go Architect flagged. - openai_responses_validation.go: document the responsesInputKind enum values (Message is intentional scaffolding for future rules). - openai_reasoning_websearch_replay_test.go: drop the openAIValidationError wrapper in favor of xerrors.New; capture step 2's input items and assert chain mode was actually disabled (proves the test exercises the replay path the bug surfaces on); use ptr.Ref consistently; use fantasyopenai.Name instead of the bare 'openai' literal in the chatprompt sanitizer table; rename parts -> partTypes; rename Turn -> Step to match existing integration test terminology. - chatprompt_test.go: add cases for empty assistant content, pointer reasoning parts, and contiguous tool runs; rename wantAsstDropped -> wantAssistantDropped to match the field it pins. - openai_responses_validation_test.go: rename cases -> testCases for consistency with the rest of the chatd test suite. This change was generated by a Coder Agent on behalf of @ibetitsmike.

@ibetitsmike

Several deep-review reviewers (Structural Analyst P2, Edge Case Analyst P3, Contract Auditor P3) flagged that SanitizeOpenAIOrphanToolMessages was gated on provider == 'openai' exactly, missing Azure OpenAI and openai-compat backends that route through the same fantasy openai package and hit the same Responses API pairing rules. Replace the gate with isOpenAIResponsesProvider, which recognizes 'openai', 'azure', and 'openai-compat'. Add table-driven test cases proving the sanitizer fires for all three. This change was generated by a Coder Agent on behalf of @ibetitsmike.

github-actions Bot assigned ibetitsmike Apr 25, 2026

ibetitsmike added 2 commits April 25, 2026 02:54

ibetitsmike changed the title ~~fix(coderd/x/chatd): replay reasoning + web_search_call without orphan~~ fix: replay reasoning + web_search_call without orphan in chatd Apr 25, 2026

ibetitsmike marked this pull request as ready for review April 25, 2026 03:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: replay reasoning + web_search_call without orphan in chatd#24721

fix: replay reasoning + web_search_call without orphan in chatd#24721
ibetitsmike wants to merge 3 commits intomainfrom
mike/fix-openai-reasoning-replay-orphan

ibetitsmike commented Apr 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ibetitsmike commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix

Verification

Cycle A — coder/fantasy#30 (merged via this PR's pin)

Cycle B — chatd integration test

Cycle C — defensive sanitizer

Decisions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ibetitsmike commented Apr 25, 2026 •

edited

Loading