Skip to content

[FEATURE]: Add provider timeouts and jittered retry handling for LLM API calls #28832

@michaelmanly

Description

@michaelmanly

Feature hasn't been suggested before.

  • I have verified this feature I'm about to request hasn't been suggested before.

Describe the enhancement you want to request

Summary

This PR improves production reliability for LLM provider calls by adding safer timeout and retry behavior around OpenAI, Anthropic, and Gemini requests.

Problem

A few provider paths currently appear to depend on caller-provided context behavior and provider-specific retry handling:

  • OpenAI, Anthropic, and Gemini calls pass ctx directly into SDK calls without applying a provider-layer timeout
  • title generation uses context.Background(), so it can continue after the parent request is cancelled
  • retry backoff appears deterministic rather than jittered
  • retry classification differs across providers and misses some transient failures
  • Gemini streaming retry behavior can enter an inconsistent state after stream errors
  • Gemini client init can return a provider with a nil client after logging an error
  • some Gemini tool-call JSON parse/marshal errors are ignored

Changes

  • Add provider-layer timeout/deadline handling around LLM API calls
  • Derive title-generation context from the parent request with a short timeout
  • Replace deterministic retry delay with jittered backoff
  • Centralize retry classification for transient provider failures
  • Include common retryable cases such as:
    • 429
    • 500
    • 502
    • 503
    • 504
    • transport resets/timeouts where detectable
  • Ensure Gemini stream retry exits and restarts cleanly instead of continuing in a partial stream state
  • Return Gemini client initialization errors instead of creating a provider with a nil client
  • Add structured warnings for ignored Gemini tool-call JSON conversion errors

Why this matters

These changes reduce the chance of:

  • hung provider calls tying up goroutines/session slots
  • orphaned title-generation requests continuing after cancellation
  • synchronized retry storms during provider 429/5xx incidents
  • avoidable user-visible failures during short provider outages
  • duplicate/truncated Gemini streaming output after retryable stream errors
  • nil-client runtime failures after Gemini initialization errors
  • silent tool-call degradation from ignored JSON conversion failures

Notes

This PR does not add provider/model fallback routing yet. That can be handled separately as a follow-up because it needs policy decisions around cost, model compatibility, and output behavior.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions