You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Epic / consolidation. This merges several smaller observability & reliability tasks into one high-priority, mainnet-blocking initiative. Supersedes #563, #570, #571, #572, #576, #577, #579.
Why this matters (growth/scale)
A platform meant for thousands of users cannot grow on top of a system that's blind to its own failures. Reliability is a growth feature: every outage or silent failure during a high-traffic campaign launch burns user trust and operator confidence. This initiative makes Trivela observable and self-defending so it can scale without surprise outages.
Goal
Ship production-grade observability + runtime reliability: dashboards, alerts, a live canary, request deadlines, pool/saturation visibility, and graceful shutdown — wired to SLOs.
Why this matters (growth/scale)
A platform meant for thousands of users cannot grow on top of a system that's blind to its own failures. Reliability is a growth feature: every outage or silent failure during a high-traffic campaign launch burns user trust and operator confidence. This initiative makes Trivela observable and self-defending so it can scale without surprise outages.
Goal
Ship production-grade observability + runtime reliability: dashboards, alerts, a live canary, request deadlines, pool/saturation visibility, and graceful shutdown — wired to SLOs.
Scope (merged work items)
observability/dashboards/*) for API, DB/pools, RPC/breaker, and indexer; provisioned + reproducible. (was feat: Grafana dashboards as code (API, DB, RPC, indexer, KPIs) #577, feat: Indexer observability dashboard (lag, throughput, errors) #563)promtooltests + routing. (was feat: Prometheus alerting rules (latency, errors, RPC, indexer lag, pools) #576)register→credit→claimcanary on testnet + uptime checks; alert within minutes on failure (also flags stale README contract IDs). (was feat: Synthetic uptime + register→credit→claim canary monitoring #579)AbortSignalpropagation to DB/RPC/storage; abort on client disconnect; 504 on deadline. (was feat: Per-route request timeouts, deadlines & cancellation propagation #570)pool_in_use/idle/waitingmetrics; acquire timeouts; typed503 POOL_SATURATEDinstead of hanging. (was feat: Connection-pool saturation metrics & fast-fail safeguards #571)Acceptance criteria
promtooltests pass).Verification
promtool test rulesin CI; force a canary failure; load test driving pool saturation; SIGTERM-under-load deploy test.Priority: high · Difficulty: hard · Effort: L · mainnet blocker