Skip to content

fix: replay reasoning + web_search_call without orphan in chatd#24721

Open
ibetitsmike wants to merge 3 commits intomainfrom
mike/fix-openai-reasoning-replay-orphan
Open

fix: replay reasoning + web_search_call without orphan in chatd#24721
ibetitsmike wants to merge 3 commits intomainfrom
mike/fix-openai-reasoning-replay-orphan

Conversation

@ibetitsmike
Copy link
Copy Markdown
Collaborator

@ibetitsmike ibetitsmike commented Apr 25, 2026

Disclaimer: opened by a Coder Agent on behalf of @ibetitsmike.

Fixes the production OpenAI Responses API failure where a follow-up turn after a reasoning + web_search_call step is rejected with:

Item 'ws_xxx' of type 'web_search_call' was provided without its
required 'reasoning' item: 'rs_xxx'.

This trips on gpt-5.5 xhigh (and any reasoning model) when web search runs in step 1 and full-history replay (rather than chain mode) is used in step 2.

Fix

  • Bump the charm.land/fantasy replace pin to include fix(providers/openai): emit item_reference for reasoning when store=true fantasy#30, which emits an item_reference for stored reasoning items so they pair with a following web_search_call (or function_call) on replay.
  • Mirror live OpenAI's pairing rule in the chattest fake server via ValidateResponsesAPIInput. Cover both the orphan web_search_call (primary user-reported error) and orphan function_call (secondary failure mode).
  • Add TestOpenAIReasoningWithWebSearchRoundTrip and the Store=false variant. The tests use two model configs to disable chain mode on step 2 and force the full-history replay path where the bug surfaces. Step 2's input is captured and asserted to contain the historical assistant turn so the test fails closed if chain-mode detection ever changes.
  • Add a defensive SanitizeOpenAIOrphanToolMessages in chatprompt. It drops reasoning-only assistant messages together with any contiguous tool-message run that follows. Active for openai, azure, and openai-compat providers (all three route through the same Responses API). It runs at the prompt-build, reload, and chatloop pre-request boundaries to mirror SanitizeAnthropicProviderToolCalls.

This change supersedes closed #23402.

Verification

  • Manual red phase: with the unfixed pin, the new integration test fails reproducing the exact production error: Item 'ws_xxx' of type 'web_search_call' was provided without its required 'reasoning' item: 'rs_*'.
  • Green phase: with the fixed pin, both Store=true and Store=false follow-up turns complete cleanly.
  • go test ./coderd/x/chatd/... (full suite, race) passes.
  • make pre-commit (gen / fmt / lint / build) passes.
Implementation plan and decision log

Cycle A — coder/fantasy#30 (merged via this PR's pin)

Replaced fantasy's unconditional continue for ContentTypeReasoning with a store-gated item_reference emission using GetReasoningMetadata. Inline OfReasoning replay was intentionally avoided so #181's regression doesn't return.

Cycle B — chatd integration test

The integration test had to defeat chain mode (which short-circuits replay via previous_response_id). Approach taken: create two model configs and dispatch step 2 against the second. chainInfo.modelConfigID == modelConfig.ID then fails and chatd sends full history through fantasy, where the bug surfaced. Step 2's input items are captured and asserted to contain the historical turn so the test fails closed if chain-mode detection changes.

ValidateResponsesAPIInput detects item_reference shape leniently because the openai-go SDK marks the type field as omitzero, so fantasy serializes references as bare {id}. The validator falls back to ID-prefix shape detection (ws_, rs_, msg_, fc_).

Cycle C — defensive sanitizer

Sanitizer drops tool messages that follow reasoning-only assistant messages. It mirrors the Anthropic sanitizer pattern (zero-allocation no-op fast path, structured stats, warn-level log). It activates for openai, azure, and openai-compat providers via isOpenAIResponsesProvider. Wired at all three boundaries that currently invoke SanitizeAnthropicProviderToolCalls: prompt-build, reload, and chatloop pre-request.

Decisions

  • item_reference over inline OfReasoning: inline replay re-introduces the failure mode feat: Add resource detection on project import #181 fixed for plain reasoning replay.
  • Defensive sanitizer kept despite Cycle A solving the primary failure mode: belt-and-suspenders against the residual function_call_output orphan when reasoning is stripped (e.g. missing ItemID).
  • Two-model-configs test approach over a custom multi-step tool: smaller test surface, no synthetic tool stack, deterministic chain-mode bypass.
  • isReasoningOnlyAssistantMessage uses fantasy.AsMessagePart (handles pointer + value variants) and falls back to return false for any unrecognized part type so a future fantasy MessagePart addition cannot silently cause data loss.
  • Drop the entire contiguous tool-message run after a reasoning-only assistant, not just the immediate neighbor. Avoids leaving a later orphan tool message that would still trip the same OpenAI 400.
  • Use plain append (not appendSanitizedMessage) so dropping the orphan pair does not collapse the two non-tool messages bracketing the run into one.
Deep-review pass

A multi-reviewer deep review (Test Auditor, Edge Case Analyst, Contract Auditor, Structural Analyst, Go Architect, Concurrency Reviewer, Performance Analyst, Duplication Checker, Modernization Reviewer, Style Reviewer) ran against the initial commit. All P2 and P3 structural findings were addressed:

  • P2 (Structural Analyst, Edge Case Analyst, Contract Auditor): sanitizer didn't cover Azure / openai-compat. Fixed via isOpenAIResponsesProvider.
  • P2 (Go Architect): sanitizer asymmetric with Anthropic at the chatloop pre-request boundary. Fixed.
  • P3 (Test Auditor, Edge Case Analyst, Contract Auditor, Structural Analyst, Go Architect): type-switch failed open for unknown / pointer parts. Fixed via fantasy.AsMessagePart + invert predicate.
  • P3 (Edge Case Analyst): sanitizer dropped only one tool message in a contiguous run. Fixed.
  • P3 (Structural Analyst): user turns merged across the dropped run via appendSanitizedMessage. Fixed by switching to plain append.
  • P3 (Structural Analyst): integration test depended on chain-mode internals without an assertion. Fixed via captured-input check.
  • Multiple style nits applied (constants over literals, ptr.Ref consistency, Step over Turn terminology, drop unnecessary error wrapper).

Follow-up items called out by reviewers but not in scope here:

  • ID-prefix sniffing in the validator is fragile against future openai-go SDK changes (Test Auditor Obs).
  • Prompt field in OpenAIRequest is dead code from a prior PR (Structural Analyst Obs).
  • Map-iteration order in unpaired-function-call validator is non-deterministic when multiple orphans exist (Edge Case Analyst, Performance Analyst Obs); fine for current tests.
  • If a third provider needs a similar sanitizer, factor out the shared shape (Duplication Checker Obs).

When the OpenAI Responses API returns reasoning + a provider-executed
web_search_call in the same turn (e.g. with gpt-5.5 at xhigh effort), a
follow-up turn that replays full conversation history produced a 400:

    Item 'ws_xxx' of type 'web_search_call' was provided without its
    required 'reasoning' item: 'rs_xxx'.

Root cause: coder/fantasy unconditionally skipped reasoning items during
replay even when store=true and another item required pairing. The
companion fix in coder/fantasy#30 emits an item_reference for the stored
reasoning ItemID alongside the following item_reference(ws_*) so the
pair is preserved.

Changes:

- Bump charm.land/fantasy replace pin to the merge of coder/fantasy#30.
- Extend coderd/x/chatd/chattest with ValidateResponsesAPIInput, a
  Responses API input validator that mirrors the live OpenAI 400s for
  the orphan web_search_call and orphan function_call cases.
- Add TestOpenAIReasoningWithWebSearchRoundTrip and
  TestOpenAIReasoningWithWebSearchRoundTripStoreFalse against a fake
  OpenAI server. The tests use two model configs to disable chain mode
  on turn 2 and force the replay path where the bug surfaced.
- Add SanitizeOpenAIOrphanToolMessages in chatprompt as defensive
  belt-and-suspenders. It drops tool messages that follow a
  reasoning-only assistant message (the residual failure mode that
  produces 'No tool output found for function call call_xxx' on the
  next turn). It mirrors SanitizeAnthropicProviderToolCalls and runs
  at both prompt-build and reload sites in chatd.go.

Verified manually: with the unfixed pin, the new integration test
fails with the production error string. With the fix, both Store=true
and Store=false paths complete the follow-up turn cleanly.

This change was generated by a Coder Agent on behalf of @ibetitsmike.
Addresses Tier 1 findings from the deep-review pass:

- chatprompt.go: invert isReasoningOnlyAssistantMessage so unknown or
  pointer-typed parts default to 'not reasoning-only' (fail closed
  instead of fail open). Also use fantasy.AsMessagePart so pointer
  variants are normalized.
- chatprompt.go: drop the orphan tool RUN, not just one tool message.
  An assistant followed by [tool, tool, ...] now has all contiguous
  tool messages dropped together with the assistant.
- chatprompt.go: sanitize via plain append (not appendSanitizedMessage)
  so the two non-tool messages bracketing a dropped run keep their
  turn boundary instead of getting merged into a single message.
- chatloop.go: also run SanitizeOpenAIOrphanToolMessages at the
  pre-request boundary, mirroring SanitizeAnthropicProviderToolCalls
  symmetry. Closes the asymmetry the Go Architect flagged.
- openai_responses_validation.go: document the responsesInputKind enum
  values (Message is intentional scaffolding for future rules).
- openai_reasoning_websearch_replay_test.go: drop the
  openAIValidationError wrapper in favor of xerrors.New; capture
  step 2's input items and assert chain mode was actually disabled
  (proves the test exercises the replay path the bug surfaces on);
  use ptr.Ref consistently; use fantasyopenai.Name instead of the
  bare 'openai' literal in the chatprompt sanitizer table; rename
  parts -> partTypes; rename Turn -> Step to match existing
  integration test terminology.
- chatprompt_test.go: add cases for empty assistant content, pointer
  reasoning parts, and contiguous tool runs; rename wantAsstDropped
  -> wantAssistantDropped to match the field it pins.
- openai_responses_validation_test.go: rename cases -> testCases for
  consistency with the rest of the chatd test suite.

This change was generated by a Coder Agent on behalf of @ibetitsmike.
Several deep-review reviewers (Structural Analyst P2, Edge Case Analyst
P3, Contract Auditor P3) flagged that SanitizeOpenAIOrphanToolMessages
was gated on provider == 'openai' exactly, missing Azure OpenAI and
openai-compat backends that route through the same fantasy openai
package and hit the same Responses API pairing rules.

Replace the gate with isOpenAIResponsesProvider, which recognizes
'openai', 'azure', and 'openai-compat'. Add table-driven test cases
proving the sanitizer fires for all three.

This change was generated by a Coder Agent on behalf of @ibetitsmike.
@ibetitsmike ibetitsmike changed the title fix(coderd/x/chatd): replay reasoning + web_search_call without orphan fix: replay reasoning + web_search_call without orphan in chatd Apr 25, 2026
@ibetitsmike ibetitsmike marked this pull request as ready for review April 25, 2026 03:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant