fix(foreman): retry transient SSE/transport errors in the OAI client (#815)#816
Open
Defilan wants to merge 1 commit into
Open
fix(foreman): retry transient SSE/transport errors in the OAI client (#815)#816Defilan wants to merge 1 commit into
Defilan wants to merge 1 commit into
Conversation
The OAI client's Chat method already retried on ErrTruncatedToolCallArguments and per-request header timeouts. However, mid-stream transport disconnects (e.g. io.ErrUnexpectedEOF, connection reset by peer) were not classified as retryable, causing a single transient blip on a long run to fail the entire AgenticTask and discard all prior turns. Add isRetryableTransportError to classify transport-level errors as retryable: - io.ErrUnexpectedEOF and io.EOF mid-stream - net.Error timeouts (already handled by isRetryableTimeout, but included for completeness in the standalone function) - syscall.ECONNRESET and "connection reset by peer" Genuine API errors (4xx/5xx JSON error bodies) are NOT retryable and pass through unchanged. The existing retry loop with exponential backoff and bounded attempts (maxRetries) is reused, so the behavior is consistent with the existing retry semantics. Add two tests: - TestClient_Chat_RetriesTransportErrorThenSucceeds: verifies a mid-stream connection drop is retried and the full response is returned - TestClient_Chat_TransportErrorNotRetriedOnAPIError: verifies a 400 API error is NOT retried Fixes defilantech#815 Signed-off-by: Foreman Bot <chris@mahercode.io>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Retries transient transport errors in the OAI streaming client instead of failing the entire AgenticTask on a single mid-stream disconnect.
Why
Fixes #815
A dropped SSE response mid-completion (
read SSE stream: unexpected EOF,oai/client.go) bubbled up as a retry-lessExecutorError, discarding a whole multi-turn run. On long gateway-routed coder runs this is costly: two ~50-turn Strix runs died this way (turn 36, turn 50). The model is fine and the same request succeeds on a retry.How
In
pkg/foreman/agent/oai/client.go, classify transport-level errors as retryable and fold them into the client's existing Chat retry loop (alongsideErrTruncatedToolCallArgumentsand timeout retries), so they get the existing bounded exponential backoff:isRetryableTransportError:io.ErrUnexpectedEOF, mid-streamio.EOF,net.Errortimeouts,syscall.ECONNRESET, and the"connection reset by peer"string form.client_test.go) cover: an httptest server that drops the stream then succeeds on retry, and a non-retryable API error that is not retried.Surgical: no client restructuring, reuses the established retry/backoff path.
Provenance (transparency)
Authored by a Foreman coder run on the AMD Strix node (dense Qwopus-27B, gateway-routed), then hand-verified for scope (
oai/only) and correctness. GO + deterministic verify gate GATE-PASS.Note: deploying this makes long gateway-routed runs resilient to the very class of drop that has been failing them, which unblocks reliably completing larger harness changes (e.g. #813).
Checklist
make testpasses (verify gate GATE-PASS)make lintpasses (verify gate)