Harden shadow-state projection and delivery paths by chrisbliss18 · Pull Request #76 · Automattic/jetmon

chrisbliss18 · 2026-04-28T14:21:43Z

Stacked PR 4 of 9.

Base: stack-03-alert-contacts
Head: stack-04-shadow-projection-hardening
Previous PR: #75

Summary:

Fixes timing overflow on partially failed checks.
Replaces the unsafe DB update flag with legacy status projection configuration.
Derives API site state from open v2 events and documents the shadow-state rollout.
Adds MySQL 5.7 compatibility, startup warnings, delivery claim-loop hardening, health-check behavior, audit improvements, and related coverage.
Refreshes architecture docs and API reference alignment.

Review notes:
This PR is mostly correctness and operational hardening around the API, event projection, and outbound delivery paths introduced by the earlier stack entries.

Each httptrace phase has a Start hook and a Done hook. When the connection fails mid-way (TCP refused, TLS handshake failure, hostname resolution then disconnect, etc.), the Start hook fires but the matching Done hook never does — leaving its *End timestamp as the zero time.Time. The recording code then computed zero_time.Sub(real_time), which is roughly the negative of the original timestamp's nanoseconds — a huge negative duration that overflows the INT NULL columns in jetmon_check_history. The visible symptom was a flood of repeat log lines on every check round whenever a monitored site refused TCP: orchestrator: record history blog_id=N: Error 1264 (22003): Out of range value for column 'dns_ms' at row 1 The fix is one line per phase: only record a duration if BOTH the Start and Done hooks fired. A failed phase reports zero rather than a misleading negative value. Zero is the right reporting because the phase didn't successfully complete — there is no "duration" to report. Regression test added to TestCheckConnectionRefused asserting all three phase durations are non-negative on a connection-refused target. Without the fix, TCP would be ~ -unix-nanos.

The previous diagram described the original drop-in rewrite — three boxes (orchestrator, check pool, gRPC server) talking to the same internal channels. The v2 branch has three more independently-scaling concerns the diagram didn't show: the REST API server, the webhook delivery worker, and the alerting delivery worker — all consumers of jetmon_event_transitions via the eventstore. Updated diagram shows the layered shape: - Top tier: orchestrator + check pool + veriflier transport (the "monitor what's out there" half) - Middle: eventstore as the single writer for events / transitions - Bottom tier: API + webhook worker + alerting worker (the "tell the world about it" half) Plus inline component descriptions for each new package and an explicit forward-looking note about the deliverer-binary split tracked in ROADMAP.md, so future agents have a route from "I see two delivery workers" to "yes, intentionally — they unify when the binary extracts." WPCOM is shown as the legacy notification path coexisting with alert contacts, matching the design decision documented in API.md "Family 5 → Relationship to legacy WPCOM notifications."

Adds docs/adr/ with a README explaining the format and seven ADRs covering the load-bearing decisions on this branch — the kind of "why is X like this" question that has been re-explained more than once in code review, commit messages, and inline design rationale. Each ADR is short (Status / Context / Decision / Consequences, plus Alternatives where useful) and cross-links to the others and to the relevant code paths. 0001 — Event-sourced state model with dedicated transitions table 0002 — Internal-only API behind a gateway 0003 — Plaintext credential storage for outbound dispatch 0004 — Stripe-style HMAC-SHA256 webhook signatures 0005 — Pull-only webhook and alerting delivery 0006 — Separate internal/alerting and internal/webhooks packages 0007 — Soft-lock claim vs SELECT ... FOR UPDATE SKIP LOCKED AGENTS.md picks up a Key Files row pointing future agents at docs/adr/. The README explains the conventions: append-only after acceptance, one decision per ADR, cross-link generously, don't backfill speculatively. No code changes; pure docs.

API key rotation uses revoked_at for two different operational shapes: immediate revocation sets it to now, while a rolling rotation may set it in the future so the old token keeps working until consumers have deployed the replacement. Lookup treated any non-NULL revoked_at as terminal. That made the documented grace-window path impossible because the old key was rejected immediately instead of at the configured cutoff time. Change Lookup to reject only when revoked_at is now or in the past. Add sqlmock coverage for both sides of the boundary: a future revoked_at remains valid and updates last_used_at, while a past revoked_at returns ErrKeyRevoked.

The shadow-v2-state migration needs Jetmon 2 to keep writing the v1 site_status / last_status_change projection while legacy readers are still in flight. The old DB_UPDATES_ENABLE name and JETMON_UNSAFE_DB_UPDATES startup guard described a local-only safety valve, which is the opposite of the production bridge we now need. Add LEGACY_STATUS_PROJECTION_ENABLE as the explicit migration switch, default it on, and keep DB_UPDATES_ENABLE as a deprecated alias so older configs keep their behavior until they are rewritten. If both keys are present, the new name wins. The orchestrator now uses config.LegacyStatusProjectionEnabled() for every compatibility projection write, and the fatal unsafe-env guard is removed from startup. Config tests cover the default, alias behavior, and new-key precedence; orchestrator tests opt out via the new field.

During the shadow-v2-state migration, jetpack_monitor_sites remains the table we page through for site identity and configuration. It must not remain the API's source of truth for current incident state, especially once the legacy site_status projection is intentionally disabled. List responses now reflect the worst open jetmon_events row for each returned site before state/severity filtering runs. Single-site scans use the legacy site_status fallback only while LEGACY_STATUS_PROJECTION_ENABLE is true; with projection disabled, no open v2 event means the API reports Up even if the old v1 column is stale. Administrative event closes now project site_status back to running inside the same close transaction when the compatibility projection is enabled and no active events remain. That removes the separate maybeProjectRunning follow-up write and keeps manual close / trigger-now behavior aligned with the eventstore invariant. Tests cover active-event list rollups, stale legacy status when projection is disabled, and the revised transaction shape for close paths.

The code now supports the migration shape we settled on: v2 event tables own incident state, while jetpack_monitor_sites stays in place as the legacy site/config table and temporary compatibility projection. The docs needed to say that directly instead of carrying forward the old DB_UPDATES_ENABLE / unsafe-local-test framing. Add ADR 0008 for the shadow-v2-state decision, including the context, rejected alternatives, rollout consequences, and the rule that legacy site_status is a projection rather than source of truth. README, ARCHITECTURE, EVENTS, CHANGELOG, AGENTS.md, ROADMAP, config.readme, and config-sample.json now use LEGACY_STATUS_PROJECTION_ENABLE and describe when it can be disabled. API.md also documents two adjacent migration details that matter operationally: future revoked_at values are key-rotation grace windows, and the current single-binary deployment should run API/webhook/alert delivery workers on only one active instance per DB cluster until transactional row claiming or a deliverer split lands. No behavior changes in this commit; it is the written contract for operators, reviewers, and future agents working on the v2 branch.

Three follow-ups from a review of the morning's shadow-v2-state work. Active-event rollup (handlers_sites.go) used a ROW_NUMBER() OVER (PARTITION BY ...) window function. Window functions require MySQL 8.0+, which conflicts with the documented "MySQL 5.7+ in production" target in .claude/rules/general-guidelines.md. Replace the query with a flat SELECT over open events for the bounded blog_id set and reduce to the worst (highest severity, earliest started_at) per blog_id in Go. The IN list is capped by the API's max page size (200) and a site rarely has more than one open event, so the in-Go reduction is cheap. Adds tests covering severity-wins and severity-tie-with-earlier-started-at picks. API key cutoffs (apikeys.go) now share half-open semantics: a key is valid for times strictly before the cutoff and rejected at or after it. Previously revoked_at used !Before(now) and expires_at used After(now), so a key was rejected at exactly revoked_at but still valid at exactly expires_at. Pick the strict-less-than convention to match JWT/OAuth `exp` behavior and document the invariant in the code and API.md. Adds tests for the boundary on both sides of expires_at. Projection state visibility (cmd/jetmon2/main.go) now logs `config: legacy_status_projection=enabled|disabled` at startup and adds the same line to `./jetmon2 validate-config` output. The fatal JETMON_UNSAFE_DB_UPDATES guard is intentionally not reinstated; the ADR-0008 default is projection-enabled in production, and the guard fights that default. Loud observability is the right shape here.

Housekeeping pass over LSP-flagged diagnostics that were unrelated to today's earlier shadow-state work but worth a sweep while the analysis tooling was attached. - internal/api/handlers_events.go: drop the unused handleListAllEvents handler. Not routed and the 'future admin tooling' rationale would better belong with the actual route when one is added. - internal/api/middleware.go: remove the unused scopeAdmin alias and the unused ctx parameter from (*Server).audit; audit.Log doesn't take context today, so threading it was misleading. All five call sites updated. - internal/api/testhelp_test.go: remove unused invokeWithMux helper and rename the unused server arg in invokeAuthed to _ so the signature is honest without churning every call site. - internal/api/handlers_sites_test.go: drop the no-op trimSQL helper. - cmd/jetmon2/main.go: writePIDFile now uses fmt.Appendf to build the pid bytes directly, and the local repeat helper is replaced with strings.Repeat. main_test.go drops the now-redundant TestRepeat that covered the removed helper. - go.mod: promote github.com/DATA-DOG/go-sqlmock to a direct dependency (it is imported by package tests, not just transitively). go test ./... and go vet ./... both clean.

Whitespace alignment only — no functional change. The set was flagged by `gofmt -l .` and consisted entirely of struct-field and var-block alignment differences in files that hadn't been re-run through gofmt since they last grew a wider field. Repo is now `gofmt -l` clean.

The previous housekeeping commit deleted handleListAllEvents, which was the only caller that passed nil to listEvents. The function still carried a `siteID *int64` parameter and a `WHERE 1=1` shape so the optional `AND blog_id = ?` clause could be appended conditionally. With every caller now guaranteed to pass a real blog id, change the parameter to `siteID int64`, start the WHERE with `blog_id = ?`, and seed the args slice with siteID directly. Tests updated to match the new SQL shape.

Several top-level docs still described the Veriflier path as a live gRPC server and API.md as a not-yet-implemented proposal. That no longer matches the current packages: internal/veriflier is JSON-over-HTTP, the API/webhook/alerting surfaces exist, and eventstore owns event transitions while audit remains operational context. Update README, PROJECT, ARCHITECTURE, and API to describe the current transport, package map, table set, and coarse read/write scopes. Also fix the developer build instructions for veriflier2 so they build into bin/ rather than targeting the parent directory, which fails from the repo root. Verified with: - go test ./... - go vet ./... - go build -o bin/jetmon2 ./cmd/jetmon2/ - go build -o bin/veriflier2 ./veriflier2/cmd/ - git diff --check

API.md still carried a few design-era details after the architecture docs cleanup. Some of those details described behavior that is not routed today or response shapes that no longer match the handler structs, which makes the file risky as an implementation reference for internal consumers. Update the API reference to match internal/api and the related webhook and alerting packages: one per-key rate-limit bucket, POST-only idempotency, current site create and trigger-now payloads, current SLA/stat response shapes, implemented webhook and alert-contact route lists, and the actual webhook/alert delivery retry and suppression behavior. Also mark the OpenAPI endpoint as planned rather than live. Verified with: - go test ./... - go vet ./... - git diff --check

EMAIL_TRANSPORT previously accepted any string and the alerting path fell through to the stub sender for unrecognized values. That made a typo look like a working configuration while silently avoiding the intended email delivery path. Teach config validation which email transports are supported and enforce the required SMTP or WPCOM settings for the active mode. Document the transport keys in the sample config and operator docs, and cover the default stub behavior, invalid values, required transport fields, and sample config loading in tests. Verified with: - go test ./internal/config ./cmd/jetmon2 ./internal/alerting ./internal/api - go test ./... - go vet ./... - git diff --check - make build - make build-veriflier

The webhook and alerting delivery queues already pushed next_attempt_at forward after selecting ready rows, but ClaimReady still returned every selected row even when the soft-lock UPDATE did not affect anything. If another claimant moved the row first, the stale row could still be dispatched by this worker. Repeat the readiness predicate in the soft-lock UPDATE and check RowsAffected before returning a claimed delivery. This keeps overlapping claim attempts from doing duplicate dispatch work while preserving the existing soft-lock model until the future SELECT FOR UPDATE SKIP LOCKED path lands. Add matching regression coverage for webhook and alert-contact deliveries so a row that loses the soft-lock race is skipped rather than returned to the deliver loop. Verified with: - go test ./internal/webhooks ./internal/alerting - go test ./... - go vet ./... - git diff --check - make build - make build-veriflier

EMAIL_TRANSPORT validation already rejects unknown values and enforces the required SMTP / WPCOM settings, but it accepts "stub" (and the empty-string compatibility alias) silently. An operator who configures alert contacts with transport="email" while leaving EMAIL_TRANSPORT as the default ends up with rendered messages that go to the log instead of the destination. There is no startup signal that this is happening. Emit `email_transport=<mode>` at startup alongside the existing legacy_status_projection line, and add a WARN line when the mode is "stub" (or empty) so the missing delivery path is visible in logs. `./jetmon2 validate-config` surfaces the same INFO + WARN pair so the warning is also visible during pre-deploy validation. The emailTransportLabel helper canonicalizes the empty-string alias to "stub" so logs and validate-config show one name regardless of which form an operator wrote.

ClaimReady reuses candidates' backing array via `claimed := candidates[:0]` to drop rows that lost the soft-lock race without a second allocation. The pattern is safe here because the write index is always <= the read index and the candidates slice is not returned in its original form, but that contract is not obvious to a reader skimming the loop. Add a two-line comment in both webhook and alert-contact deliveries.go so the next reader does not relax the invariant by accident.

The startup warning added for stub email transport depends on small command-package helpers and dispatcher wiring. That behavior was covered by package compilation but not by focused tests, so a future cleanup could accidentally change the operator-facing mode labels or drop the email dispatcher without a narrow failure. Add tests for the EMAIL_TRANSPORT label and delivery classification helpers, including the empty-string compatibility alias. Also verify that buildAlertDispatchers wires every managed alert transport and that the stub email dispatcher can render and send successfully. Verified with: - go test ./cmd/jetmon2 - go test ./... - go vet ./... - git diff --check - make build - make build-veriflier

The recent EMAIL_TRANSPORT startup warning made stub email mode visible in logs and validate-config output, but the operator docs still described stub mostly as a unit-test sender and README/PROJECT still claimed validate-config checked Veriflier reachability and WPCOM certificate state. Update the API reference, README, config reference, PROJECT, and changelog so the documented operator behavior matches the current code: stub is the default log-only sender, startup and validate-config warn when email contacts will not deliver mail, validate-config reports Veriflier configuration as context rather than failing on reachability, and the delivery soft-lock changelog now includes the stale-claim RowsAffected fix. Verified with: - go test ./... - go vet ./... - git diff --check - make build - make build-veriflier

README and PROJECT still described a broader operator dashboard than the code currently serves, including a live System Health Map, Veriflier status, slowest-site lists, and dependency health details that are not wired into the dashboard publisher today. Narrow the operator-facing docs to the metrics and WPCOM circuit state that the dashboard actually renders, and recast the broader health grid as future work with the existing /api/health shape but no live publisher yet. Also update the validate-config inline comment so it says the Veriflier list is operator context rather than a reachability check. Verified with: - go test ./cmd/jetmon2 ./internal/dashboard - go test ./... - go vet ./... - git diff --check - make build - make build-veriflier

The API health handler assumed Server.db was always populated and called PingContext directly. That is true for normal startup, but the server constructor accepts a nil DB and the existing nil-DB test only verified that the route compiled rather than exercising the handler behavior it described. Return the same db_unavailable 503 envelope when the database handle is missing, and turn the no-op test into a real regression test for that path. This keeps /api/v1/health predictable even if a caller builds the API server before wiring storage. Verified with: - go test ./internal/api - go test ./... - go vet ./... - git diff --check - make build - make build-veriflier

Chris Jean added 21 commits April 25, 2026 12:50

This was referenced Apr 28, 2026

Refresh docs, tooling, and coverage #77

Merged

Internal REST API foundation: events, webhooks, alerting #72

Merged

chrisbliss18 changed the base branch from stack-03-alert-contacts to v2 April 28, 2026 14:53

chrisbliss18 merged commit 4a8ffc6 into v2 Apr 28, 2026

chrisbliss18 deleted the stack-04-shadow-projection-hardening branch April 28, 2026 15:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden shadow-state projection and delivery paths#76

Harden shadow-state projection and delivery paths#76
chrisbliss18 merged 21 commits into
v2from
stack-04-shadow-projection-hardening

chrisbliss18 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chrisbliss18 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant