Skip to content

fix: make HTTP 429 retryable when no fallback model, respect Retry-After header#2096

Open
simon-agent-go-expert wants to merge 2 commits intodocker:mainfrom
simon-agent-go-expert:fix/retryable-429-500
Open

fix: make HTTP 429 retryable when no fallback model, respect Retry-After header#2096
simon-agent-go-expert wants to merge 2 commits intodocker:mainfrom
simon-agent-go-expert:fix/retryable-429-500

Conversation

@simon-agent-go-expert
Copy link

Summary

Fixes #2095

When the model provider returns HTTP 429 (Too Many Requests):

  • With fallback models configured → skip to the next fallback immediately (existing behaviour preserved)
  • Without fallback models → retry the same model, honouring the Retry-After response header (capped at 60 s), falling back to exponential backoff when the header is absent

HTTP 500 was already retryable via IsRetryableStatusCode — no change needed.

What changed

pkg/modelerrors/modelerrors.go

  • ExtractRetryAfter(err error) time.Duration — extracts Retry-After from *anthropic.Error and *openai.Error (both expose *http.Response); supports integer-seconds and HTTP-date formats
  • parseRetryAfterHeader(value string) time.Duration — parses the header per RFC 7231 §7.1.3
  • ClassifyModelError(err error) (retryable, rateLimited bool, retryAfter time.Duration) — unified classifier used by the retry loop; existing IsRetryableModelError / IsRetryableStatusCode unchanged for backward compatibility
  • MaxRetryAfterWait constant (60 s cap) to prevent a misbehaving server from blocking the agent indefinitely

pkg/runtime/fallback.go

  • Replaced both IsRetryableModelError call sites with ClassifyModelError
  • Extracted handleModelError helper method to avoid duplicating classification logic across stream-creation and stream-handling error paths
  • Added retryDecision type (Continue / Break / Return) for clarity

Tests

  • pkg/modelerrors/modelerrors_test.go: TestClassifyModelError, TestExtractRetryAfter, TestParseRetryAfterHeader
  • pkg/runtime/fallback_test.go: TestFallback429WithFallbacksSkipsToNextModel, TestFallback429WithoutFallbacksRetriesSameModel, TestFallback429WithoutFallbacksExhaustsRetries, TestFallback500RetryableWithBackoff

Note on HTTP 500

HTTP 500 was already handled as retryable by IsRetryableStatusCode(500) == true before this PR. The requirement is satisfied by the existing code; this PR adds an explicit test (TestFallback500RetryableWithBackoff) to document and protect that behaviour.

…ter header

When a model returns HTTP 429 (Too Many Requests):
- If fallback models are configured: skip to next fallback immediately (existing behavior preserved)
- If no fallbacks are configured: retry the same model using the Retry-After header duration (capped at 60s), falling back to exponential backoff if the header is absent

HTTP 500 was already handled as retryable via IsRetryableStatusCode — no change needed.

New additions to pkg/modelerrors:
- ExtractRetryAfter(err): extracts Retry-After duration from *anthropic.Error and *openai.Error responses
- parseRetryAfterHeader(value): parses integer-seconds or HTTP-date Retry-After values
- ClassifyModelError(err): unified classifier returning (retryable, rateLimited bool, retryAfter time.Duration)
- MaxRetryAfterWait constant (60s cap)

pkg/runtime/fallback.go changes:
- Replace both IsRetryableModelError call sites with ClassifyModelError
- Add handleModelError helper to avoid duplicating classification logic
- 429 with fallbacks → break to next model (unchanged behavior)
- 429 without fallbacks → sleep(min(retryAfter, 60s) or backoff), then continue

Fixes docker#2095

Assisted-By: docker-agent
@simon-agent-go-expert simon-agent-go-expert requested a review from a team as a code owner March 13, 2026 10:18
Copy link

@docker-agent docker-agent bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Assessment: 🟢 APPROVE

Summary

This PR successfully implements HTTP 429 retry logic with Retry-After header support. The code review found no bugs in the changed code.

Key changes reviewed:

  • ExtractRetryAfter correctly extracts and parses Retry-After headers from both Anthropic and OpenAI SDK errors
  • parseRetryAfterHeader properly handles both integer-seconds and HTTP-date formats per RFC 7231
  • ClassifyModelError provides unified error classification with proper separation of concerns (retryable vs rate-limited)
  • handleModelError in fallback.go correctly implements the retry-vs-fallback decision logic
  • ✅ 60-second cap on Retry-After prevents indefinite blocking
  • ✅ Comprehensive test coverage for all new functions and edge cases

The implementation is well-structured, properly tested, and maintains backward compatibility with existing error handling functions.

No issues found — the code is ready to merge.

simon-agent-go-expert added a commit to simon-agent-go-expert/cagent that referenced this pull request Mar 13, 2026
…ter to providers

Addresses two inline review comments on PR docker#2096:

1. Comment 1 (line 35): Merge the standalone maxRetryAfterWait and MaxRetryAfterWait constants
   into the existing backoff constants block. Eliminates the two-step declaration.

2. Comment 2 (line 313): Move provider-specific Retry-After header extraction logic out of
   the shared modelerrors package and into each provider package:
   - pkg/model/provider/anthropic/adapter.go: ExtractRetryAfter + parseRetryAfterHeader
   - pkg/model/provider/openai/retry.go:      ExtractRetryAfter + parseRetryAfterHeader
   - pkg/model/provider/retry.go:             top-level ExtractRetryAfter dispatcher
   - modelerrors.ClassifyModelError now accepts retryAfter time.Duration as a parameter
     (caller supplies it from provider.ExtractRetryAfter) instead of calling into SDKs itself
   - modelerrors.ParseRetryAfterHeader exported for general use (header parsing is generic)
   - fallback.go: calls provider.ExtractRetryAfter(err) then passes result to ClassifyModelError

Tests updated accordingly:
   - modelerrors_test.go: updated ClassifyModelError call sites, removed SDK helper functions
   - anthropic/retry_test.go: new test file covering ExtractRetryAfter and parseRetryAfterHeader
   - openai/retry_test.go:    new test file covering ExtractRetryAfter and parseRetryAfterHeader

Assisted-By: docker-agent

// ExtractRetryAfter extracts the Retry-After wait duration from an OpenAI API error.
// Returns 0 if the error is not an *openai.Error, has no response, or has no Retry-After header.
func ExtractRetryAfter(err error) time.Duration {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the same exact same as the anthropic implementation.
I like the idea of having each provider customize the retry after extraction, but I'd also like to avoid code duplication

// It delegates to each provider package to keep provider-specific SDK knowledge
// local to each provider.
// Returns 0 if no Retry-After header is found or the error type is unsupported.
func ExtractRetryAfter(err error) time.Duration {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need this. Providers could return a typed error that contains everything we need so the check would become (in fallbac) a errors.As or whatever

…rors at Recv()

Addresses review feedback from rumpl and simonferquel on PR docker#2096.

The previous approach extracted Retry-After metadata in a dispatcher function
that type-switched on provider SDK error types (`*anthropic.Error`,
`*openai.Error`). This kept all providers coupled to the modelerrors package
and duplicated parsing logic.

New design:

1. `modelerrors.StatusError` — a thin typed wrapper carrying:
   - StatusCode int
   - RetryAfter time.Duration  (parsed from Retry-After header)
   - Err error                 (original SDK error, accessible via Unwrap)

2. `modelerrors.WrapHTTPError(statusCode, resp, err)` — wraps any provider
   error in StatusError; parses Retry-After from resp.Header if non-nil.

3. `modelerrors.ParseRetryAfterHeader(value)` — shared RFC 7231 §7.1.3
   parser (integer seconds + HTTP-date). No SDK imports.

4. `modelerrors.ClassifyModelError(err)` — single-arg; uses errors.As to
   find *StatusError in the chain; falls back to message/pattern matching for
   providers not yet wrapped (e.g. Bedrock).

5. Three new wrap.go files (one per provider family), each a single function:
   - pkg/model/provider/anthropic/wrap.go  — wrapAnthropicError(*anthropic.Error)
   - pkg/model/provider/oaistream/wrap.go  — WrapOpenAIError(*openai.Error) (exported)
   - pkg/model/provider/gemini/wrap.go     — wrapGeminiError(*genai.APIError)

6. Wrap sites — one-line change at Recv() in each adapter:
   - anthropic/adapter.go, beta_adapter.go
   - oaistream/adapter.go
   - openai/response_stream.go (reuses oaistream.WrapOpenAIError)
   - gemini/adapter.go

No changes to fallback.go logic — ClassifyModelError signature stays
`(err error)` and retry/rate-limit decisions are unchanged.

Assisted-By: docker-agent
@simonferquel
Copy link
Contributor

I like the new approach, letting @rumpl to take the final decision

@simonferquel
Copy link
Contributor

/review

triggering docker agent for good measure

Copy link

@docker-agent docker-agent bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Assessment: 🟢 APPROVE

Summary

This PR implements HTTP 429 retry logic with Retry-After header support when no fallback models are configured. The implementation is well-structured and comprehensive:

Strengths:

  • ✅ Proper error classification with ClassifyModelError that distinguishes retryable errors, rate limits, and extracts retry delays
  • ✅ Correct Retry-After header parsing supporting both integer-seconds and HTTP-date formats per RFC 7231
  • ✅ Sensible 60-second cap on Retry-After to prevent indefinite blocking
  • ✅ Clean refactoring of fallback.go with extracted handleModelError helper to avoid code duplication
  • ✅ Comprehensive test coverage including edge cases (429 with/without fallbacks, 500 retries, header parsing)
  • ✅ Backward compatibility preserved for existing IsRetryableModelError and IsRetryableStatusCode functions
  • ✅ Proper error wrapping throughout the provider adapters

No bugs found in the changed code. The implementation correctly handles the distinction between rate-limited errors (429) with and without fallback models, and properly respects the Retry-After header when present.

Findings

No issues to report. The code is production-ready.


Automated review by docker-agent PR reviewer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle HTTP 429 and 500 errors as retryable in model provider

3 participants