Skip to content

fix: add retry with exponential backoff to weather MCP tool#222

Merged
pdettori merged 5 commits intomainfrom
fix/weather-tool-resilience
Apr 9, 2026
Merged

fix: add retry with exponential backoff to weather MCP tool#222
pdettori merged 5 commits intomainfrom
fix/weather-tool-resilience

Conversation

@Ladas
Copy link
Copy Markdown
Contributor

@Ladas Ladas commented Apr 8, 2026

Summary

  • Add requests.Session with urllib3.Retry strategy (3 retries, exponential backoff)
  • Retry automatically on 429, 500, 502, 503, 504 status codes
  • Increase timeout from 10s to 30s (configurable via WEATHER_REQUEST_TIMEOUT)
  • Catch RequestException and return user-friendly message instead of crashing
  • All params configurable: WEATHER_MAX_RETRIES, WEATHER_BACKOFF_FACTOR, WEATHER_REQUEST_TIMEOUT

Context

The weather tool calls external Open-Meteo APIs that frequently timeout from CI runners, causing E2E test failures across ALL kagenti PRs. The tool had no error handling — any failure crashed the entire tool call.

The weather tool calls external Open-Meteo APIs that can timeout or
return transient errors in CI environments. This caused E2E test
failures across all kagenti PRs.

Changes:
- Use requests.Session with urllib3 Retry strategy (3 retries,
  exponential backoff, retry on 429/5xx)
- Increase timeout from 10s to 30s (configurable via WEATHER_REQUEST_TIMEOUT)
- Catch RequestException and return user-friendly error instead of
  crashing — lets the LLM retry the tool call gracefully
- All retry params configurable via env vars (WEATHER_MAX_RETRIES,
  WEATHER_BACKOFF_FACTOR)

Signed-off-by: Ladislav Smola <lsmola@redhat.com>
ChatOpenAI with default settings crashes on MaaS 503 responses
(rate limit / service overload). Add max_retries=5 for automatic
retry with backoff, and request_timeout=60 for slow MaaS endpoints.

Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Copy link
Copy Markdown
Contributor

@pdettori pdettori left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix addressing a real CI reliability problem. The retry strategy and error handling are well-implemented — proper use of urllib3.Retry, sensible status code list, and user-friendly error messages that let the LLM retry gracefully.

One must-fix: unused import time is failing lint CI. Everything else is suggestions.

Areas reviewed: Python, Security, Commit conventions
Commits: 2, both signed-off ✓
CI: lint failing (unused import), all other checks passing

import logging
import os
import sys
import time
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must-fix: import time is unused — this is causing the lint CI failure. Remove this line.

return session


@mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: _resilient_session() creates a new Session on every tool call. Since the retry config uses module-level constants, consider creating a single module-level session (_session = _resilient_session()) and reusing it. Avoids re-mounting adapters per call and requests.Session is thread-safe. Not blocking — current approach works correctly.

openai_api_key=config.llm_api_key,
openai_api_base=config.llm_api_base,
temperature=0,
max_retries=5,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: max_retries=5 and request_timeout=60 are hardcoded. The weather tool side uses env vars for configurability. Consider the same pattern here for consistency, or add a brief comment explaining the values (e.g., "MaaS endpoints can be slow under load").

Ladas added 3 commits April 9, 2026 19:21
The has_valid_api_key check only allowed localhost but Kind clusters
use 'dockerhost' for Ollama, and cluster-internal LiteLLM proxies
use *.svc.cluster.local addresses. Both accept any API key.

Also adds host.docker.internal for Docker Desktop compatibility.

Signed-off-by: Ladislav Smola <lsmola@redhat.com>
…retry

- Remove unused `import time` (lint failure)
- Make HTTP session a module-level singleton for connection reuse
- Make LLM max_retries and request_timeout configurable via env vars
  (LLM_MAX_RETRIES, LLM_REQUEST_TIMEOUT)
- Accept dummy API keys for dockerhost and *.svc.cluster.local

Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Copy link
Copy Markdown
Contributor

@pdettori pdettori left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pdettori pdettori merged commit e194691 into main Apr 9, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants