fix: make HTTP 429 retryable when no fallback model, respect Retry-After header by simon-agent-go-expert · Pull Request #2096 · docker/docker-agent

simon-agent-go-expert · 2026-03-13T10:18:08Z

Summary

Fixes #2095

When the model provider returns HTTP 429 (Too Many Requests):

With fallback models configured → skip to the next fallback immediately (existing behaviour preserved)
Without fallback models → retry the same model, honouring the Retry-After response header (capped at 60 s), falling back to exponential backoff when the header is absent

HTTP 500 was already retryable via IsRetryableStatusCode — no change needed.

What changed

`pkg/modelerrors/modelerrors.go`

ExtractRetryAfter(err error) time.Duration — extracts Retry-After from *anthropic.Error and *openai.Error (both expose *http.Response); supports integer-seconds and HTTP-date formats
parseRetryAfterHeader(value string) time.Duration — parses the header per RFC 7231 §7.1.3
ClassifyModelError(err error) (retryable, rateLimited bool, retryAfter time.Duration) — unified classifier used by the retry loop; existing IsRetryableModelError / IsRetryableStatusCode unchanged for backward compatibility
MaxRetryAfterWait constant (60 s cap) to prevent a misbehaving server from blocking the agent indefinitely

`pkg/runtime/fallback.go`

Replaced both IsRetryableModelError call sites with ClassifyModelError
Extracted handleModelError helper method to avoid duplicating classification logic across stream-creation and stream-handling error paths
Added retryDecision type (Continue / Break / Return) for clarity

Tests

pkg/modelerrors/modelerrors_test.go: TestClassifyModelError, TestExtractRetryAfter, TestParseRetryAfterHeader
pkg/runtime/fallback_test.go: TestFallback429WithFallbacksSkipsToNextModel, TestFallback429WithoutFallbacksRetriesSameModel, TestFallback429WithoutFallbacksExhaustsRetries, TestFallback500RetryableWithBackoff

Note on HTTP 500

HTTP 500 was already handled as retryable by IsRetryableStatusCode(500) == true before this PR. The requirement is satisfied by the existing code; this PR adds an explicit test (TestFallback500RetryableWithBackoff) to document and protect that behaviour.

…ter header When a model returns HTTP 429 (Too Many Requests): - If fallback models are configured: skip to next fallback immediately (existing behavior preserved) - If no fallbacks are configured: retry the same model using the Retry-After header duration (capped at 60s), falling back to exponential backoff if the header is absent HTTP 500 was already handled as retryable via IsRetryableStatusCode — no change needed. New additions to pkg/modelerrors: - ExtractRetryAfter(err): extracts Retry-After duration from *anthropic.Error and *openai.Error responses - parseRetryAfterHeader(value): parses integer-seconds or HTTP-date Retry-After values - ClassifyModelError(err): unified classifier returning (retryable, rateLimited bool, retryAfter time.Duration) - MaxRetryAfterWait constant (60s cap) pkg/runtime/fallback.go changes: - Replace both IsRetryableModelError call sites with ClassifyModelError - Add handleModelError helper to avoid duplicating classification logic - 429 with fallbacks → break to next model (unchanged behavior) - 429 without fallbacks → sleep(min(retryAfter, 60s) or backoff), then continue Fixes docker#2095 Assisted-By: docker-agent

docker-agent

Review Summary

Assessment: 🟢 APPROVE

Summary

This PR successfully implements HTTP 429 retry logic with Retry-After header support. The code review found no bugs in the changed code.

Key changes reviewed:

✅ ExtractRetryAfter correctly extracts and parses Retry-After headers from both Anthropic and OpenAI SDK errors
✅ parseRetryAfterHeader properly handles both integer-seconds and HTTP-date formats per RFC 7231
✅ ClassifyModelError provides unified error classification with proper separation of concerns (retryable vs rate-limited)
✅ handleModelError in fallback.go correctly implements the retry-vs-fallback decision logic
✅ 60-second cap on Retry-After prevents indefinite blocking
✅ Comprehensive test coverage for all new functions and edge cases

The implementation is well-structured, properly tested, and maintains backward compatibility with existing error handling functions.

No issues found — the code is ready to merge.

pkg/modelerrors/modelerrors.go

…ter to providers Addresses two inline review comments on PR docker#2096: 1. Comment 1 (line 35): Merge the standalone maxRetryAfterWait and MaxRetryAfterWait constants into the existing backoff constants block. Eliminates the two-step declaration. 2. Comment 2 (line 313): Move provider-specific Retry-After header extraction logic out of the shared modelerrors package and into each provider package: - pkg/model/provider/anthropic/adapter.go: ExtractRetryAfter + parseRetryAfterHeader - pkg/model/provider/openai/retry.go: ExtractRetryAfter + parseRetryAfterHeader - pkg/model/provider/retry.go: top-level ExtractRetryAfter dispatcher - modelerrors.ClassifyModelError now accepts retryAfter time.Duration as a parameter (caller supplies it from provider.ExtractRetryAfter) instead of calling into SDKs itself - modelerrors.ParseRetryAfterHeader exported for general use (header parsing is generic) - fallback.go: calls provider.ExtractRetryAfter(err) then passes result to ClassifyModelError Tests updated accordingly: - modelerrors_test.go: updated ClassifyModelError call sites, removed SDK helper functions - anthropic/retry_test.go: new test file covering ExtractRetryAfter and parseRetryAfterHeader - openai/retry_test.go: new test file covering ExtractRetryAfter and parseRetryAfterHeader Assisted-By: docker-agent

simonferquel · 2026-03-13T10:40:42Z

pkg/model/provider/openai/retry.go

+
+// ExtractRetryAfter extracts the Retry-After wait duration from an OpenAI API error.
+// Returns 0 if the error is not an *openai.Error, has no response, or has no Retry-After header.
+func ExtractRetryAfter(err error) time.Duration {


This seems to be the same exact same as the anthropic implementation.
I like the idea of having each provider customize the retry after extraction, but I'd also like to avoid code duplication

rumpl · 2026-03-13T10:43:26Z

pkg/model/provider/retry.go

+// It delegates to each provider package to keep provider-specific SDK knowledge
+// local to each provider.
+// Returns 0 if no Retry-After header is found or the error type is unsupported.
+func ExtractRetryAfter(err error) time.Duration {


We shouldn't need this. Providers could return a typed error that contains everything we need so the check would become (in fallbac) a errors.As or whatever

…rors at Recv() Addresses review feedback from rumpl and simonferquel on PR docker#2096. The previous approach extracted Retry-After metadata in a dispatcher function that type-switched on provider SDK error types (`*anthropic.Error`, `*openai.Error`). This kept all providers coupled to the modelerrors package and duplicated parsing logic. New design: 1. `modelerrors.StatusError` — a thin typed wrapper carrying: - StatusCode int - RetryAfter time.Duration (parsed from Retry-After header) - Err error (original SDK error, accessible via Unwrap) 2. `modelerrors.WrapHTTPError(statusCode, resp, err)` — wraps any provider error in StatusError; parses Retry-After from resp.Header if non-nil. 3. `modelerrors.ParseRetryAfterHeader(value)` — shared RFC 7231 §7.1.3 parser (integer seconds + HTTP-date). No SDK imports. 4. `modelerrors.ClassifyModelError(err)` — single-arg; uses errors.As to find *StatusError in the chain; falls back to message/pattern matching for providers not yet wrapped (e.g. Bedrock). 5. Three new wrap.go files (one per provider family), each a single function: - pkg/model/provider/anthropic/wrap.go — wrapAnthropicError(*anthropic.Error) - pkg/model/provider/oaistream/wrap.go — WrapOpenAIError(*openai.Error) (exported) - pkg/model/provider/gemini/wrap.go — wrapGeminiError(*genai.APIError) 6. Wrap sites — one-line change at Recv() in each adapter: - anthropic/adapter.go, beta_adapter.go - oaistream/adapter.go - openai/response_stream.go (reuses oaistream.WrapOpenAIError) - gemini/adapter.go No changes to fallback.go logic — ClassifyModelError signature stays `(err error)` and retry/rate-limit decisions are unchanged. Assisted-By: docker-agent

simonferquel · 2026-03-13T11:16:25Z

I like the new approach, letting @rumpl to take the final decision

simonferquel · 2026-03-13T11:17:03Z

/review

triggering docker agent for good measure

docker-agent

Review Summary

Assessment: 🟢 APPROVE

Summary

This PR implements HTTP 429 retry logic with Retry-After header support when no fallback models are configured. The implementation is well-structured and comprehensive:

Strengths:

✅ Proper error classification with ClassifyModelError that distinguishes retryable errors, rate limits, and extracts retry delays
✅ Correct Retry-After header parsing supporting both integer-seconds and HTTP-date formats per RFC 7231
✅ Sensible 60-second cap on Retry-After to prevent indefinite blocking
✅ Clean refactoring of fallback.go with extracted handleModelError helper to avoid code duplication
✅ Comprehensive test coverage including edge cases (429 with/without fallbacks, 500 retries, header parsing)
✅ Backward compatibility preserved for existing IsRetryableModelError and IsRetryableStatusCode functions
✅ Proper error wrapping throughout the provider adapters

No bugs found in the changed code. The implementation correctly handles the distinction between rate-limited errors (429) with and without fallback models, and properly respects the Retry-After header when present.

Findings

No issues to report. The code is production-ready.

Automated review by docker-agent PR reviewer

simon-agent-go-expert requested a review from a team as a code owner March 13, 2026 10:18

docker-agent bot reviewed Mar 13, 2026

View reviewed changes

rumpl reviewed Mar 13, 2026

View reviewed changes

pkg/modelerrors/modelerrors.go Outdated Show resolved Hide resolved

rumpl reviewed Mar 13, 2026

View reviewed changes

pkg/modelerrors/modelerrors.go Outdated Show resolved Hide resolved

simonferquel reviewed Mar 13, 2026

View reviewed changes

rumpl reviewed Mar 13, 2026

View reviewed changes

simon-agent-go-expert force-pushed the fix/retryable-429-500 branch from 96806ae to 3d44b7c Compare March 13, 2026 11:07

docker-agent bot reviewed Mar 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make HTTP 429 retryable when no fallback model, respect Retry-After header#2096

fix: make HTTP 429 retryable when no fallback model, respect Retry-After header#2096
simon-agent-go-expert wants to merge 2 commits intodocker:mainfrom
simon-agent-go-expert:fix/retryable-429-500

simon-agent-go-expert commented Mar 13, 2026

Uh oh!

docker-agent bot left a comment

Uh oh!

Uh oh!

Uh oh!

simonferquel Mar 13, 2026

Uh oh!

rumpl Mar 13, 2026

Uh oh!

simonferquel commented Mar 13, 2026

Uh oh!

simonferquel commented Mar 13, 2026

Uh oh!

docker-agent bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

simon-agent-go-expert commented Mar 13, 2026

Summary

What changed

pkg/modelerrors/modelerrors.go

pkg/runtime/fallback.go

Tests

Note on HTTP 500

Uh oh!

docker-agent bot left a comment

Choose a reason for hiding this comment

Review Summary

Assessment: 🟢 APPROVE

Summary

Uh oh!

Uh oh!

Uh oh!

simonferquel Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

rumpl Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

simonferquel commented Mar 13, 2026

Uh oh!

simonferquel commented Mar 13, 2026

Uh oh!

docker-agent bot left a comment

Choose a reason for hiding this comment

Review Summary

Assessment: 🟢 APPROVE

Summary

Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`pkg/modelerrors/modelerrors.go`

`pkg/runtime/fallback.go`