Feature hasn't been suggested before.
Describe the enhancement you want to request
Summary
This PR improves production reliability for LLM provider calls by adding safer timeout and retry behavior around OpenAI, Anthropic, and Gemini requests.
Problem
A few provider paths currently appear to depend on caller-provided context behavior and provider-specific retry handling:
- OpenAI, Anthropic, and Gemini calls pass
ctx directly into SDK calls without applying a provider-layer timeout
- title generation uses
context.Background(), so it can continue after the parent request is cancelled
- retry backoff appears deterministic rather than jittered
- retry classification differs across providers and misses some transient failures
- Gemini streaming retry behavior can enter an inconsistent state after stream errors
- Gemini client init can return a provider with a nil client after logging an error
- some Gemini tool-call JSON parse/marshal errors are ignored
Changes
- Add provider-layer timeout/deadline handling around LLM API calls
- Derive title-generation context from the parent request with a short timeout
- Replace deterministic retry delay with jittered backoff
- Centralize retry classification for transient provider failures
- Include common retryable cases such as:
429
500
502
503
504
- transport resets/timeouts where detectable
- Ensure Gemini stream retry exits and restarts cleanly instead of continuing in a partial stream state
- Return Gemini client initialization errors instead of creating a provider with a nil client
- Add structured warnings for ignored Gemini tool-call JSON conversion errors
Why this matters
These changes reduce the chance of:
- hung provider calls tying up goroutines/session slots
- orphaned title-generation requests continuing after cancellation
- synchronized retry storms during provider 429/5xx incidents
- avoidable user-visible failures during short provider outages
- duplicate/truncated Gemini streaming output after retryable stream errors
- nil-client runtime failures after Gemini initialization errors
- silent tool-call degradation from ignored JSON conversion failures
Notes
This PR does not add provider/model fallback routing yet. That can be handled separately as a follow-up because it needs policy decisions around cost, model compatibility, and output behavior.
Feature hasn't been suggested before.
Describe the enhancement you want to request
Summary
This PR improves production reliability for LLM provider calls by adding safer timeout and retry behavior around OpenAI, Anthropic, and Gemini requests.
Problem
A few provider paths currently appear to depend on caller-provided context behavior and provider-specific retry handling:
ctxdirectly into SDK calls without applying a provider-layer timeoutcontext.Background(), so it can continue after the parent request is cancelledChanges
429500502503504Why this matters
These changes reduce the chance of:
Notes
This PR does not add provider/model fallback routing yet. That can be handled separately as a follow-up because it needs policy decisions around cost, model compatibility, and output behavior.