Add alert contacts and managed delivery#75
Merged
Conversation
added 13 commits
April 25, 2026 11:23
Replaces the Family 5 stub ("legacy bridge to WPCOM") with the actual
Phase 3.x design: managed channels for human destinations (email,
PagerDuty, Slack, Teams), parallel to the webhooks API rather than a
generalization of it.
Key design decisions captured:
- Boundary with webhooks: alert contacts deliver Jetmon-rendered
notifications through Jetmon-owned transports; webhooks deliver
the raw signed event stream for the consumer to render. Customers
who want generic outbound webhooks register a webhook, not an
alert contact — explicitly noted with deferral rationale.
- Filter shape: site_filter + min_severity (no event-type filter, no
state filter). Severity gate is the primary control surface, with
Critical default to avoid accidental noise.
- Per-contact max_per_hour cap as pager-storm insurance.
- Send-test endpoint (POST /alert-contacts/{id}/test) for verifying
newly-configured contacts.
- Email delivery via swappable Sender interface with wpcom / smtp /
stub implementations — wpcom for production, smtp for docker
integration tests, stub for unit tests.
- Plaintext credential storage in destination JSON, same rationale as
webhook secrets (outbound dispatch needs the raw value).
- Worker placement intentionally parallel (internal/alerting/) rather
than extending internal/webhooks/.
ROADMAP additions under "Deferred from Phase 3.x":
- SMS, OpsGenie, quiet hours, alert ack, alert grouping, generic
webhook transport — each with deferral rationale and revisit trigger.
- WPCOM notification migration: documented as the cleanup that pays
off the deliverer-binary extraction, gated on alert contacts proving
out in production.
Multi-binary-split section: explicit revisit point added for unifying
internal/alerting/ and internal/webhooks/ at deliverer-binary extraction
time, with rationale for why the split is right today (no shared
abstraction with only one known concrete user) and what enables the
later unification (third transport, two known shapes).
The first draft used placeholder severity names (Info/Notice/Warning/ Critical). The actual codebase severity ladder is Up < Warning < Degraded < SeemsDown < Down (uint8 0-4 in internal/eventstore). Aligned the JSON string names, default min_severity, severity gate description, PagerDuty mapping table, and schema column type to match what the code actually uses.
jetmon_alert_contacts mirrors jetmon_webhooks but with a simpler filter model: site_filter (JSON, empty=all sites) + min_severity (TINYINT matching internal/eventstore.Severity*, default 4=Down) + max_per_hour rate cap. destination is JSON because each transport has a different shape (email address, PagerDuty integration key, Slack/Teams webhook URL); plaintext storage rationale matches jetmon_webhooks.secret — outbound dispatch needs the raw value at every send. jetmon_alert_deliveries is a per-fire record table mirroring jetmon_webhook_deliveries, with the same fan-in shape (one transition → many deliveries, one contact gets at most one delivery per transition via uk_alert_transition). Adds a severity column so the dispatcher can render transport-specific severity (PagerDuty critical/warning/info) without re-deriving from the payload. jetmon_alert_dispatch_progress is a high-water-mark table for the alert delivery worker, identical shape to jetmon_webhook_dispatch_progress.
Establishes the package boundary parallel to internal/webhooks: types,
sentinel errors, severity name <-> uint8 helpers, transport identifier
enum, and the Dispatcher interface that every concrete transport
implements. No DB or transport code yet — that's the next two tasks.
AlertContact.Matches encodes the full filter rule:
site_id ∈ SiteFilter (or empty = match all sites)
AND (
new_severity >= MinSeverity // escalation / sustained
OR (prev_severity >= MinSeverity // recovery from a
AND new_severity == SeverityUp) // previously-paging state
)
Recovery firing requires both prev and new severity because Matches
doesn't see the transition reason. The unit tests exercise both halves
of the gate, the recovery edge cases, the empty-site-filter "match all"
semantic, and the AND across all dimensions.
Notification is the rendered shape passed to Dispatcher.Send — the
worker builds it once per delivery from the frozen-at-fire-time
payload, transports translate it into channel-specific representations
(PagerDuty events-v2, Slack Block Kit, Teams Adaptive Card, email
subject + body).
Email is unique among alert-contact transports in that there is no "post to this URL" — it requires a sender. emailDispatcher owns the rendering (subject + plain + HTML body) and delegates the actual send to a Sender, so swapping transports is a config change and rendering logic stays in one place. Three Sender implementations: - StubSender — records messages in memory + log line; default for EMAIL_TRANSPORT="stub" and the basis of unit tests. - SMTPSender — net/smtp send; for dev/staging with MailHog. - WPCOMSender — Bearer-token POST to a WPCOM-owned email endpoint; same shape as the existing internal/wpcom client. Rendering covers severity name, site URL, event id/type, state, reason, and timestamp. Subject prefixes signal recovery and test sends. HTML output escapes untrusted fields (site URL, reason) against XSS in HTML email clients. Config additions: EMAIL_TRANSPORT (default "stub"), EMAIL_FROM, WPCOM_EMAIL_ENDPOINT/AUTH_TOKEN, SMTP_HOST/PORT/USERNAME/PASSWORD/ USE_TLS. Validation is lazy — happens at Dispatcher construction time, not at config load — so a misconfigured email block doesn't prevent the binary from starting if alerting is disabled. Tests cover subject variants, plain/HTML body rendering, HTML escaping, recovery and test banners, dispatcher destination parsing (happy path and three rejection cases), MIME multipart shape, WPCOM HTTP request shape (Bearer auth, JSON body, status surfacing), and StubSender record/reset behavior.
Three concrete Dispatcher implementations, each rendering the Notification into the channel-specific request shape: PagerDuty (Events API v2): event_action of "trigger" or "resolve" based on the recovery flag, dedup_key derived from the Jetmon event id so all transitions of the same incident group under one alert. Severity mapping: Down/SeemsDown → critical, Degraded → warning, Warning → info. Test sends use a distinct dedup_key so they never accidentally resolve a real alert. Slack: Block Kit message with severity-keyed emoji header, fallback text for old clients / mobile previews, and a fields section with site id, event id, state, reason, time. Recovery and test variants swap the header. Teams: Adaptive Card 1.4 wrapped in the message envelope expected by incoming-webhook URLs. FactSet for the same fields Slack uses. Shared postJSON helper handles serialization, request building, and status interpretation: 4xx/5xx returns an error so the worker schedules a retry. Response bodies are read with a 4 KB cap and truncated to last_response's column width before being returned. 24 tests cover happy paths, recovery action mapping (PagerDuty resolve), test-send dedup isolation, severity ordering, bad destination rejection (empty fields, malformed JSON), and upstream error surfacing.
internal/alerting/contacts.go provides Create/Get/List/ListActive/
Update/Delete and a separate LoadDestination helper for the worker.
Mirrors the webhooks CRUD shape but with the simpler filter model
(site_filter + min_severity, no event-type/state filter) and a
transport-aware destination validator: each transport requires its
own field shape (email→address, pagerduty→integration_key,
slack/teams→webhook_url).
destination_preview is the last 4 chars of the credential field for
the contact's transport, so operators can identify a contact without
exposing the full credential. The destination itself is loaded
separately by the worker via LoadDestination, kept off the
AlertContact struct so it can't leak through serialization.
Update intentionally cannot change the transport — the destination
shape is transport-specific and validating cross-transport changes
is brittle. Operators who want to switch transports delete and
re-create.
API surface in internal/api/handlers_alerts.go: POST/GET/LIST/PATCH/
DELETE on /api/v1/alert-contacts plus POST /alert-contacts/{id}/test.
The send-test endpoint loads the contact, fetches the destination,
and dispatches a synthetic IsTest=true notification through the
configured Dispatcher with a 15s timeout. Returns the transport's
status_code and response_body so operators can verify connectivity;
transport errors surface as 502.
Server gains an alertDispatchers map (alerting.Transport → Dispatcher)
populated via SetAlertDispatchers — this is the same map the worker
will use, so a successful send-test exercises the real production
code path. nil map (alerting disabled) → 503 transport_not_configured.
12 new sqlmock tests cover create happy-path with default min_severity,
transport rejection (sms), missing-field rejection per transport,
invalid severity name (Critical instead of Down), get/update/delete
happy + 404, and send-test happy / transport-error / no-dispatcher
cases.
internal/alerting/deliveries.go provides the per-fire delivery store
mirroring internal/webhooks/deliveries.go: Enqueue (INSERT IGNORE on
uk_alert_transition), ClaimReady, MarkDelivered, ScheduleRetry,
GetDelivery, ListDeliveries, RetryDelivery for the operator manual-
retry path. Adds MarkSuppressed for the rate-cap exit — terminal,
status='abandoned', last_status_code=429, last_response identifies
why so deliveries don't look like normal abandonments in the audit
trail.
internal/alerting/worker.go runs two background goroutines:
- dispatchTick polls jetmon_event_transitions on a high-water mark
in jetmon_alert_dispatch_progress, then for each new transition
matches every active contact via Matches(prev_severity,
next_severity, site_id) — the recovery edge fires correctly
because we read both severity columns from the transition row.
- deliverTick claims pending deliveries respecting the per-contact
in-flight cap, looks up the contact + destination, builds a
Notification (frozen-payload-derived; site URL looked up live for
display only), and dispatches through the configured Dispatcher
map. Same retry ladder as webhooks (1m/5m/30m/1h/6h then abandon).
Per-contact rate cap enforced via an in-memory sliding window
(rateLimitWindow). When the cap is hit, the delivery is marked
suppressed rather than retried — by the time the cap reopens the
alert is stale, so retrying just delays the inevitable. Single-instance
caveat documented inline; matches the existing multi-instance row-claim
caveat in webhooks.
14 unit tests cover the retry schedule, defaults, in-flight cap (per-
contact and isolation), zero-count cleanup, rate-window admission and
expiry, contact isolation in the rate window, and the transition reason
to alert event type mapping. Race-clean.
GET /api/v1/alert-contacts/{id}/deliveries with status filter and
cursor pagination; POST /api/v1/alert-contacts/{id}/deliveries/
{delivery_id}/retry to reset abandoned rows back to pending. Same
shapes as the webhook delivery endpoints.
cmd/jetmon2/main.go now boots the alerting worker alongside the
webhooks worker (gated on API_PORT > 0). buildAlertDispatchers
constructs the per-transport map from runtime config: the three
HTTP-based transports (PagerDuty, Slack, Teams) always wire up;
email picks wpcom/smtp/stub via EMAIL_TRANSPORT and falls back to
stub if unset. The same dispatcher map is shared with the API
server via SetAlertDispatchers, so a successful send-test exercises
the real production code path.
5 sqlmock tests cover list happy path, bad status rejection, retry
happy path, wrong-contact 404 cross-check, and not-abandoned 409.
Verified end-to-end in docker: registered an email + a slack alert
contact, ran the send-test endpoint (Block Kit message landed at
the test target with the 🔍 test emoji), inserted a failing
test site, watched the worker enqueue + dispatch alert.opened ->
alert.state_changed -> alert.closed transitions through the slack
transport (status 200 round trip), stopped the target to exercise
the retry path (delivery rescheduled with last_response captured
the transport error), force-marked one delivery abandoned and
exercised the manual retry endpoint (200 with status=pending; second
call against pending row returns 409 delivery_not_retryable as
expected). Test data cleaned up.
ClaimReady now follows its SELECT with a per-row UPDATE pushing next_attempt_at to NOW+60s before the dispatch goroutine even starts. Without this, the 1-second deliver tick was re-claiming any still- in-flight row up to the per-contact in-flight cap (default 3), producing concurrent dispatches that all read d.Attempt at claim time, computed retry delays from the same stale value, and ran ScheduleRetry's `attempt = attempt + 1` SQL three times in a row. The visible effect: the documented 1m / 5m / 30m / 1h / 6h ladder was collapsing to roughly 1m → 1h → abandoned, ~1h total instead of the intended 7h36m. A transient consumer outage during a deploy window could exhaust all retries before the consumer was back. The dispatch goroutine still owns the real next_attempt_at: it overwrites the soft lock with NULL on success or the actual retry time on failure. If a goroutine panics without recovery (or the process is OOM-killed mid-dispatch), the 60s soft lock expires and the row becomes claimable again — natural recovery without operator intervention. Same fix in both internal/webhooks and internal/alerting (they share the bug because alerting's worker was modeled on webhooks). Tests verify the contract: SELECT + N per-row UPDATEs, and an idle tick with no candidates issues no UPDATE traffic. Multi-instance row-claim (SELECT ... FOR UPDATE SKIP LOCKED) is still tracked alongside the deliverer-binary extraction in ROADMAP.
CHANGELOG gets a comprehensive "v2 branch — site health platform" section ahead of the rewrite section, covering everything that has landed on this branch: event sourcing, the internal REST API, Phase 3 webhooks, Phase 3.x alert contacts, verifier hardening, and the soft-lock worker fix from the previous commit. Marked clearly as not drop-in with Jetmon 1 (separate from the rewrite branch). AGENTS.md's Key Files table picks up internal/api, internal/apikeys, internal/webhooks, internal/alerting, plus references to API.md and ROADMAP.md so future agents have a route to the design docs. Architecture diagram intentionally left as-is — it reflects the drop-in rewrite, and the v2 architecture is still mid-flight pending the deliverer-binary extraction.
…otency
Three small hardenings caught while reviewing the recently-added code:
1) alerting.Update now validates label (must be non-empty) and
max_per_hour (must be >= 0) at input time, before the DB lookup.
Previously an empty label PATCH would silently persist and a
negative max_per_hour would surface as a generic 500 from MySQL's
INT UNSIGNED constraint instead of a clean 422. Validations that
don't need the existing row run first so obviously bad bodies
don't pay for a round-trip.
2) buildMIMEMessage and renderEmailSubject now strip CR/LF from
anything that becomes a MIME header value (From, To, Subject,
site URL in the subject). Defense-in-depth: monitor_url is
operator-controlled but the column doesn't enforce CRLF-free,
so a malicious or accidental URL with embedded CRLF could have
added new header lines (Bcc:, X-headers, etc.) in outbound email.
Body content with newlines is intentionally unaffected.
3) POST /api/v1/alert-contacts/{id}/test now goes through
withIdempotency like the other write POSTs (and like
webhooks/{id}/rotate-secret). A retried "click to test" during a
network blip no longer double-pages the destination.
Tests: 2 new for the Update validation rejections (no DB hit because
validation fires first), 2 new for the MIME header strip (subject
strip + asserting no new header lines are created when the input
contains CRLF; body CRLF passes through unchanged).
API.md picks up a one-paragraph note on the send-test endpoint explaining its Idempotency-Key support — same shape as the rest of the write surface, called out specifically because operators are the typical caller (a "click to test" UX expectation). CHANGELOG gets a "Polish" subsection under the v2 branch entry covering the three hardenings just landed: Update input validation, MIME header CRLF stripping, and idempotent send-test.
This was referenced Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked PR 3 of 9.
Base:
stack-02-webhooksHead:
stack-03-alert-contactsPrevious PR: #74
Summary:
internal/alertingpackage.jetmon2.Review notes:
This PR builds on the webhook delivery shape from #74 and applies the same transition-consumer pattern to managed alert contacts.