Skip to content

fix(foreman): retry transient SSE/transport errors in the OAI client (#815)#816

Open
Defilan wants to merge 1 commit into
defilantech:mainfrom
Defilan:foreman/issue-815-sse-retry
Open

fix(foreman): retry transient SSE/transport errors in the OAI client (#815)#816
Defilan wants to merge 1 commit into
defilantech:mainfrom
Defilan:foreman/issue-815-sse-retry

Conversation

@Defilan

@Defilan Defilan commented Jun 23, 2026

Copy link
Copy Markdown
Member

What

Retries transient transport errors in the OAI streaming client instead of failing the entire AgenticTask on a single mid-stream disconnect.

Why

Fixes #815

A dropped SSE response mid-completion (read SSE stream: unexpected EOF, oai/client.go) bubbled up as a retry-less ExecutorError, discarding a whole multi-turn run. On long gateway-routed coder runs this is costly: two ~50-turn Strix runs died this way (turn 36, turn 50). The model is fine and the same request succeeds on a retry.

How

In pkg/foreman/agent/oai/client.go, classify transport-level errors as retryable and fold them into the client's existing Chat retry loop (alongside ErrTruncatedToolCallArguments and timeout retries), so they get the existing bounded exponential backoff:

  • New isRetryableTransportError: io.ErrUnexpectedEOF, mid-stream io.EOF, net.Error timeouts, syscall.ECONNRESET, and the "connection reset by peer" string form.
  • Genuine API errors (4xx/5xx JSON error bodies) are explicitly not retried and pass through unchanged.
  • Tests (client_test.go) cover: an httptest server that drops the stream then succeeds on retry, and a non-retryable API error that is not retried.

Surgical: no client restructuring, reuses the established retry/backoff path.

Provenance (transparency)

Authored by a Foreman coder run on the AMD Strix node (dense Qwopus-27B, gateway-routed), then hand-verified for scope (oai/ only) and correctness. GO + deterministic verify gate GATE-PASS.

Note: deploying this makes long gateway-routed runs resilient to the very class of drop that has been failing them, which unblocks reliably completing larger harness changes (e.g. #813).

Checklist

  • Tests added/updated
  • make test passes (verify gate GATE-PASS)
  • make lint passes (verify gate)
  • Conventional-commit message, DCO signed off
  • Documentation updated (if user-facing change) — internal client behavior, no user-facing docs

The OAI client's Chat method already retried on ErrTruncatedToolCallArguments
and per-request header timeouts. However, mid-stream transport disconnects
(e.g. io.ErrUnexpectedEOF, connection reset by peer) were not classified as
retryable, causing a single transient blip on a long run to fail the entire
AgenticTask and discard all prior turns.

Add isRetryableTransportError to classify transport-level errors as retryable:
- io.ErrUnexpectedEOF and io.EOF mid-stream
- net.Error timeouts (already handled by isRetryableTimeout, but included for
  completeness in the standalone function)
- syscall.ECONNRESET and "connection reset by peer"

Genuine API errors (4xx/5xx JSON error bodies) are NOT retryable and pass
through unchanged. The existing retry loop with exponential backoff and
bounded attempts (maxRetries) is reused, so the behavior is consistent with
the existing retry semantics.

Add two tests:
- TestClient_Chat_RetriesTransportErrorThenSucceeds: verifies a mid-stream
  connection drop is retried and the full response is returned
- TestClient_Chat_TransportErrorNotRetriedOnAPIError: verifies a 400 API
  error is NOT retried

Fixes defilantech#815

Signed-off-by: Foreman Bot <chris@mahercode.io>
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 42.85714% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/foreman/agent/oai/client.go 42.85% 4 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] foreman-agent: transient SSE stream drop fails the whole task instead of retrying

1 participant