fix: add retry with exponential backoff to weather MCP tool#222
fix: add retry with exponential backoff to weather MCP tool#222
Conversation
The weather tool calls external Open-Meteo APIs that can timeout or return transient errors in CI environments. This caused E2E test failures across all kagenti PRs. Changes: - Use requests.Session with urllib3 Retry strategy (3 retries, exponential backoff, retry on 429/5xx) - Increase timeout from 10s to 30s (configurable via WEATHER_REQUEST_TIMEOUT) - Catch RequestException and return user-friendly error instead of crashing — lets the LLM retry the tool call gracefully - All retry params configurable via env vars (WEATHER_MAX_RETRIES, WEATHER_BACKOFF_FACTOR) Signed-off-by: Ladislav Smola <lsmola@redhat.com>
ChatOpenAI with default settings crashes on MaaS 503 responses (rate limit / service overload). Add max_retries=5 for automatic retry with backoff, and request_timeout=60 for slow MaaS endpoints. Signed-off-by: Ladislav Smola <lsmola@redhat.com>
pdettori
left a comment
There was a problem hiding this comment.
Good fix addressing a real CI reliability problem. The retry strategy and error handling are well-implemented — proper use of urllib3.Retry, sensible status code list, and user-friendly error messages that let the LLM retry gracefully.
One must-fix: unused import time is failing lint CI. Everything else is suggestions.
Areas reviewed: Python, Security, Commit conventions
Commits: 2, both signed-off ✓
CI: lint failing (unused import), all other checks passing
mcp/weather_tool/weather_tool.py
Outdated
| import logging | ||
| import os | ||
| import sys | ||
| import time |
There was a problem hiding this comment.
must-fix: import time is unused — this is causing the lint CI failure. Remove this line.
| return session | ||
|
|
||
|
|
||
| @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True}) |
There was a problem hiding this comment.
suggestion: _resilient_session() creates a new Session on every tool call. Since the retry config uses module-level constants, consider creating a single module-level session (_session = _resilient_session()) and reusing it. Avoids re-mounting adapters per call and requests.Session is thread-safe. Not blocking — current approach works correctly.
| openai_api_key=config.llm_api_key, | ||
| openai_api_base=config.llm_api_base, | ||
| temperature=0, | ||
| max_retries=5, |
There was a problem hiding this comment.
nit: max_retries=5 and request_timeout=60 are hardcoded. The weather tool side uses env vars for configurability. Consider the same pattern here for consistency, or add a brief comment explaining the values (e.g., "MaaS endpoints can be slow under load").
The has_valid_api_key check only allowed localhost but Kind clusters use 'dockerhost' for Ollama, and cluster-internal LiteLLM proxies use *.svc.cluster.local addresses. Both accept any API key. Also adds host.docker.internal for Docker Desktop compatibility. Signed-off-by: Ladislav Smola <lsmola@redhat.com>
…retry - Remove unused `import time` (lint failure) - Make HTTP session a module-level singleton for connection reuse - Make LLM max_retries and request_timeout configurable via env vars (LLM_MAX_RETRIES, LLM_REQUEST_TIMEOUT) - Accept dummy API keys for dockerhost and *.svc.cluster.local Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Summary
requests.Sessionwithurllib3.Retrystrategy (3 retries, exponential backoff)WEATHER_REQUEST_TIMEOUT)RequestExceptionand return user-friendly message instead of crashingWEATHER_MAX_RETRIES,WEATHER_BACKOFF_FACTOR,WEATHER_REQUEST_TIMEOUTContext
The weather tool calls external Open-Meteo APIs that frequently timeout from CI runners, causing E2E test failures across ALL kagenti PRs. The tool had no error handling — any failure crashed the entire tool call.