Skip to content

feat(ai-service): export Prometheus metrics for circuit breaker (closes #134)#194

Merged
kilodesodiq-arch merged 1 commit into
ChainForgee:mainfrom
Moonwalker-rgb:feat/134-circuit-breaker-metrics
Jun 23, 2026
Merged

feat(ai-service): export Prometheus metrics for circuit breaker (closes #134)#194
kilodesodiq-arch merged 1 commit into
ChainForgee:mainfrom
Moonwalker-rgb:feat/134-circuit-breaker-metrics

Conversation

@Moonwalker-rgb

Copy link
Copy Markdown
Contributor

What

Closes #134 — adds Prometheus metrics for the AI service circuit breaker so unhealthy provider trips are observable in dashboards rather than from user-visible degradation.

Changes

  • Adds three Prometheus instruments updated on every breaker transition:
Metric Type Description
circuit_breaker_state Gauge Current state (0=CLOSED, 1=HALF_OPEN, 2=OPEN), labeled by breaker_name
circuit_breaker_failure_count_total Counter Cumulative failures per breaker
circuit_breaker_recovery_time_seconds Histogram Time spent OPEN before transitioning to HALF_OPEN
  • Metric updates happen inside the existing _lock that already guards breaker state, so the exported values can never diverge from the underlying state.
  • The initial state (CLOSED) is published in __init__ so every instantiated breaker appears in the gauge even before any traffic flows.
  • Exports CIRCUIT_STATE_CLOSED|HALF_OPEN|OPEN constants and a set_circuit_state(...) helper to keep call sites readable.

Tests

  • New TestCircuitBreakerMetrics class with five focused tests:
    • Initial state published
    • Failure increments counter / gauge flips on threshold
    • OPEN → HALF_OPEN updates histogram and gauge
    • HALF_OPEN → CLOSED via success
    • HALF_OPEN → OPEN via failure
  • All 9 tests in tests/test_circuit_breaker.py pass locally in ~1.5s.

Compatibility

  • Additive only. Existing CircuitBreaker callers are untouched.
  • Uses the existing metrics.py registry; no new dependencies, no new registry.

Notes

  • circuit_breaker_failure_count_total is cumulative since the last reset, matching Prometheus counter semantics. Operators should compare rate over a window rather than the absolute value.
  • The histogram uses prometheus_client default buckets. If finer-grained slices are needed later, pass buckets=[1, 5, 10, 30, 60, 120].

ChainForgee#134)

Adds three Prometheus instruments updated on every circuit breaker
transition, so unhealthy providers surface in dashboards rather than
from user-visible degradation:

  - Gauge   circuit_breaker_state            (0=CLOSED, 1=HALF_OPEN, 2=OPEN)
  - Counter circuit_breaker_failure_count_total (cumulative failures)
  - Histogram circuit_breaker_recovery_time_seconds (OPEN -> HALF_OPEN lag)

Metrics are emitted inside the same lock that guards state, so the
exported values can never diverge from the underlying state. Failure
count tracks cumulative failures since the last reset, matching the
Prometheus counter contract documented in the issue. The initial state
is published in __init__ so every instantiated breaker appears in the
gauge even before any traffic flows.

Copy link
Copy Markdown
Contributor Author

cc @gbengaeben @CodeMayor — recent contributors to app/ai-service and project tooling. Could one of you (or another ai-service owner) take a first pass? Specifically curious whether the gauge encoding (0/1/2) and the cumulative counter semantics fit how app/backend already scrapes these metrics.

@kilodesodiq-arch kilodesodiq-arch left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kilodesodiq-arch kilodesodiq-arch merged commit 8681df0 into ChainForgee:main Jun 23, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[LOW] AI service circuit breaker has no metrics export

2 participants