-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Multi API-Keys Load Balancing for LLM Providers
Is your feature request related to a problem?
Yes. For individual users on free or low-tier plans of OpenAI API/Azure/Google AI API, or enterprise users running self-hosted model service platforms, single API key usage is severely constrained by strict RPM (Requests Per Minute) and TPM (Tokens Per Minute) rate limits - typically in the single or double digits for RPM.
In our team's practical scenarios, running DeepWiki on large-scale repositories and handling subsequent RAG indexing/querying demands can easily consume over 100,000 tokens per minute. When DeepWiki is deployed as a service for development teams with 20+ concurrent users, at least 70 LLM requests per minute are required. Supporting only one API key per provider prevents cost-sensitive individual and enterprise users from deploying DeepWiki effectively.
Describe the solution you'd like
Implement a multi-API-key configuration system with intelligent load balancing algorithms to distribute model requests and token consumption across multiple API keys, thereby bypassing rate limit policies, while users don't have to rocket their spendings.
Key Features:
- Multiple Key Support: Allow users to configure multiple API keys per provider (e.g.,
OPENAI_API_KEYS=key1,key2,key3) - Load Balancing Strategy:
- Primary criterion: Least-used key (by request count)
- Tiebreaker: Least recently used timestamp
- Per-provider independent tracking
- Configuration Flexibility:
- Environment variables with comma-separated values
- JSON configuration file support
- Dynamic key rotation without service restart
- Monitoring & Observability:
- Real-time key usage statistics
- Balance ratio metrics
- Per-key performance tracking
Benefits:
- ✅ Users can combine multiple free/low-tier keys to achieve higher effective rate limits
- ✅ Cost remains controlled while bypassing single-key rate limit restrictions
- ✅ Improved service reliability and availability
- ✅ Horizontal scaling capability for enterprise deployments
Describe alternatives you've considered
Alternative 1: Request Queue with Delayed Retry
Instead of load balancing across multiple keys, implement a request queue that automatically retries failed requests after the rate limit window resets.
Implementation:
- Maintain a FIFO queue for all LLM requests
- When rate limit error occurs, calculate wait time based on rate limit window
- Automatically retry requests after the wait period
- Apply exponential backoff for subsequent failures
Why This Doesn't Work:
- ❌ Poor User Experience: Users face significant delays (30-60 seconds per rate limit hit)
- ❌ Unpredictable Latency: Response times become highly variable and unreliable
- ❌ Queue Buildup: During high concurrency, queues grow exponentially, leading to request timeouts
- ❌ Resource Waste: Server resources are tied up maintaining queue state and retry timers
- ❌ No Scalability: Doesn't solve the fundamental throughput problem, just delays it
- ❌ Cascade Failures: Long queues can cause cascading failures when requests timeout while waiting
Additional context
✅ Completed - This feature has been fully implemented and a PR will be submitted later:
- Multi-key configuration via environment variables and JSON
- Load balancing with least-used + LRU strategy
- Real-time monitoring and statistics
- Full backward compatibility with single-key setups
- Comprehensive testing and documentation
Performance Metrics (from testing):
- 10 concurrent requests distributed across 5 keys
- Load balance ratio: 80%+ (1.0 = perfect balance)
- Zero request failures due to rate limiting
- Predictable, low-latency responses
Configuration Example:
# .env file
OPENAI_API_KEYS=sk-key1,sk-key2,sk-key3,sk-key4,sk-key5
GOOGLE_API_KEYS=AIza-key1,AIza-key2,AIza-key3// api/config/api_keys.json
{
"openai": {
"keys": ["${OPENAI_API_KEYS}"]
},
"google": {
"keys": ["${GOOGLE_API_KEYS}"]
}
}