Skip to content

[EPIC] Production-grade observability & runtime reliability (mainnet blocker) #650

@joelpeace48-cell

Description

@joelpeace48-cell

Epic / consolidation. This merges several smaller observability & reliability tasks into one high-priority, mainnet-blocking initiative. Supersedes #563, #570, #571, #572, #576, #577, #579.

Why this matters (growth/scale)

A platform meant for thousands of users cannot grow on top of a system that's blind to its own failures. Reliability is a growth feature: every outage or silent failure during a high-traffic campaign launch burns user trust and operator confidence. This initiative makes Trivela observable and self-defending so it can scale without surprise outages.

Goal

Ship production-grade observability + runtime reliability: dashboards, alerts, a live canary, request deadlines, pool/saturation visibility, and graceful shutdown — wired to SLOs.

Scope (merged work items)

Acceptance criteria

  • Dashboards render against live metrics; alerts fire on synthetic breaches (promtool tests pass).
  • A broken core journey is detected by the canary within minutes.
  • Slow upstreams return timely 504s; saturated pools fast-fail with a typed 503; rolling deploys drop no work.

Verification

  • promtool test rules in CI; force a canary failure; load test driving pool saturation; SIGTERM-under-load deploy test.

Priority: high · Difficulty: hard · Effort: L · mainnet blocker

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: backendBackend API (Node/Express)difficulty: hardLarger or subtle changesenhancementNew feature or requestepicLarge initiative bundling multiple work itemsinfraDeployment, docker, runtimeobservabilityLogs, metrics, tracingpriority: highHigh-priority, high-impact work

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions