Skip to content

Harden shadow-state projection and delivery paths#76

Merged
chrisbliss18 merged 21 commits into
v2from
stack-04-shadow-projection-hardening
Apr 28, 2026
Merged

Harden shadow-state projection and delivery paths#76
chrisbliss18 merged 21 commits into
v2from
stack-04-shadow-projection-hardening

Conversation

@chrisbliss18
Copy link
Copy Markdown
Contributor

Stacked PR 4 of 9.

Base: stack-03-alert-contacts
Head: stack-04-shadow-projection-hardening
Previous PR: #75

Summary:

  • Fixes timing overflow on partially failed checks.
  • Replaces the unsafe DB update flag with legacy status projection configuration.
  • Derives API site state from open v2 events and documents the shadow-state rollout.
  • Adds MySQL 5.7 compatibility, startup warnings, delivery claim-loop hardening, health-check behavior, audit improvements, and related coverage.
  • Refreshes architecture docs and API reference alignment.

Review notes:
This PR is mostly correctness and operational hardening around the API, event projection, and outbound delivery paths introduced by the earlier stack entries.

Chris Jean added 21 commits April 25, 2026 12:50
Each httptrace phase has a Start hook and a Done hook. When the
connection fails mid-way (TCP refused, TLS handshake failure,
hostname resolution then disconnect, etc.), the Start hook fires but
the matching Done hook never does — leaving its *End timestamp as
the zero time.Time. The recording code then computed
zero_time.Sub(real_time), which is roughly the negative of the
original timestamp's nanoseconds — a huge negative duration that
overflows the INT NULL columns in jetmon_check_history.

The visible symptom was a flood of repeat log lines on every check
round whenever a monitored site refused TCP:

  orchestrator: record history blog_id=N: Error 1264 (22003):
    Out of range value for column 'dns_ms' at row 1

The fix is one line per phase: only record a duration if BOTH the
Start and Done hooks fired. A failed phase reports zero rather than
a misleading negative value. Zero is the right reporting because
the phase didn't successfully complete — there is no "duration" to
report.

Regression test added to TestCheckConnectionRefused asserting all
three phase durations are non-negative on a connection-refused
target. Without the fix, TCP would be ~ -unix-nanos.
The previous diagram described the original drop-in rewrite — three
boxes (orchestrator, check pool, gRPC server) talking to the same
internal channels. The v2 branch has three more independently-scaling
concerns the diagram didn't show: the REST API server, the webhook
delivery worker, and the alerting delivery worker — all consumers
of jetmon_event_transitions via the eventstore.

Updated diagram shows the layered shape:
  - Top tier: orchestrator + check pool + veriflier transport (the
    "monitor what's out there" half)
  - Middle: eventstore as the single writer for events / transitions
  - Bottom tier: API + webhook worker + alerting worker (the
    "tell the world about it" half)

Plus inline component descriptions for each new package and an
explicit forward-looking note about the deliverer-binary split
tracked in ROADMAP.md, so future agents have a route from "I see
two delivery workers" to "yes, intentionally — they unify when the
binary extracts."

WPCOM is shown as the legacy notification path coexisting with
alert contacts, matching the design decision documented in API.md
"Family 5 → Relationship to legacy WPCOM notifications."
Adds docs/adr/ with a README explaining the format and seven ADRs
covering the load-bearing decisions on this branch — the kind of
"why is X like this" question that has been re-explained more than
once in code review, commit messages, and inline design rationale.

Each ADR is short (Status / Context / Decision / Consequences, plus
Alternatives where useful) and cross-links to the others and to the
relevant code paths.

  0001 — Event-sourced state model with dedicated transitions table
  0002 — Internal-only API behind a gateway
  0003 — Plaintext credential storage for outbound dispatch
  0004 — Stripe-style HMAC-SHA256 webhook signatures
  0005 — Pull-only webhook and alerting delivery
  0006 — Separate internal/alerting and internal/webhooks packages
  0007 — Soft-lock claim vs SELECT ... FOR UPDATE SKIP LOCKED

AGENTS.md picks up a Key Files row pointing future agents at
docs/adr/. The README explains the conventions: append-only after
acceptance, one decision per ADR, cross-link generously, don't
backfill speculatively.

No code changes; pure docs.
API key rotation uses revoked_at for two different operational shapes: immediate revocation sets it to now, while a rolling rotation may set it in the future so the old token keeps working until consumers have deployed the replacement.

Lookup treated any non-NULL revoked_at as terminal. That made the documented grace-window path impossible because the old key was rejected immediately instead of at the configured cutoff time.

Change Lookup to reject only when revoked_at is now or in the past. Add sqlmock coverage for both sides of the boundary: a future revoked_at remains valid and updates last_used_at, while a past revoked_at returns ErrKeyRevoked.
The shadow-v2-state migration needs Jetmon 2 to keep writing the v1 site_status / last_status_change projection while legacy readers are still in flight. The old DB_UPDATES_ENABLE name and JETMON_UNSAFE_DB_UPDATES startup guard described a local-only safety valve, which is the opposite of the production bridge we now need.

Add LEGACY_STATUS_PROJECTION_ENABLE as the explicit migration switch, default it on, and keep DB_UPDATES_ENABLE as a deprecated alias so older configs keep their behavior until they are rewritten. If both keys are present, the new name wins.

The orchestrator now uses config.LegacyStatusProjectionEnabled() for every compatibility projection write, and the fatal unsafe-env guard is removed from startup. Config tests cover the default, alias behavior, and new-key precedence; orchestrator tests opt out via the new field.
During the shadow-v2-state migration, jetpack_monitor_sites remains the table we page through for site identity and configuration. It must not remain the API's source of truth for current incident state, especially once the legacy site_status projection is intentionally disabled.

List responses now reflect the worst open jetmon_events row for each returned site before state/severity filtering runs. Single-site scans use the legacy site_status fallback only while LEGACY_STATUS_PROJECTION_ENABLE is true; with projection disabled, no open v2 event means the API reports Up even if the old v1 column is stale.

Administrative event closes now project site_status back to running inside the same close transaction when the compatibility projection is enabled and no active events remain. That removes the separate maybeProjectRunning follow-up write and keeps manual close / trigger-now behavior aligned with the eventstore invariant.

Tests cover active-event list rollups, stale legacy status when projection is disabled, and the revised transaction shape for close paths.
The code now supports the migration shape we settled on: v2 event tables own incident state, while jetpack_monitor_sites stays in place as the legacy site/config table and temporary compatibility projection. The docs needed to say that directly instead of carrying forward the old DB_UPDATES_ENABLE / unsafe-local-test framing.

Add ADR 0008 for the shadow-v2-state decision, including the context, rejected alternatives, rollout consequences, and the rule that legacy site_status is a projection rather than source of truth. README, ARCHITECTURE, EVENTS, CHANGELOG, AGENTS.md, ROADMAP, config.readme, and config-sample.json now use LEGACY_STATUS_PROJECTION_ENABLE and describe when it can be disabled.

API.md also documents two adjacent migration details that matter operationally: future revoked_at values are key-rotation grace windows, and the current single-binary deployment should run API/webhook/alert delivery workers on only one active instance per DB cluster until transactional row claiming or a deliverer split lands.

No behavior changes in this commit; it is the written contract for operators, reviewers, and future agents working on the v2 branch.
Three follow-ups from a review of the morning's shadow-v2-state work.

Active-event rollup (handlers_sites.go) used a ROW_NUMBER() OVER
(PARTITION BY ...) window function. Window functions require MySQL 8.0+,
which conflicts with the documented "MySQL 5.7+ in production" target in
.claude/rules/general-guidelines.md. Replace the query with a flat
SELECT over open events for the bounded blog_id set and reduce to the
worst (highest severity, earliest started_at) per blog_id in Go. The IN
list is capped by the API's max page size (200) and a site rarely has
more than one open event, so the in-Go reduction is cheap. Adds tests
covering severity-wins and severity-tie-with-earlier-started-at picks.

API key cutoffs (apikeys.go) now share half-open semantics: a key is
valid for times strictly before the cutoff and rejected at or after it.
Previously revoked_at used !Before(now) and expires_at used After(now),
so a key was rejected at exactly revoked_at but still valid at exactly
expires_at. Pick the strict-less-than convention to match JWT/OAuth `exp`
behavior and document the invariant in the code and API.md. Adds tests
for the boundary on both sides of expires_at.

Projection state visibility (cmd/jetmon2/main.go) now logs
`config: legacy_status_projection=enabled|disabled` at startup and adds
the same line to `./jetmon2 validate-config` output. The fatal
JETMON_UNSAFE_DB_UPDATES guard is intentionally not reinstated; the
ADR-0008 default is projection-enabled in production, and the guard
fights that default. Loud observability is the right shape here.
Housekeeping pass over LSP-flagged diagnostics that were unrelated to
today's earlier shadow-state work but worth a sweep while the analysis
tooling was attached.

- internal/api/handlers_events.go: drop the unused handleListAllEvents
  handler. Not routed and the 'future admin tooling' rationale would
  better belong with the actual route when one is added.
- internal/api/middleware.go: remove the unused scopeAdmin alias and
  the unused ctx parameter from (*Server).audit; audit.Log doesn't take
  context today, so threading it was misleading. All five call sites
  updated.
- internal/api/testhelp_test.go: remove unused invokeWithMux helper and
  rename the unused server arg in invokeAuthed to _ so the signature is
  honest without churning every call site.
- internal/api/handlers_sites_test.go: drop the no-op trimSQL helper.
- cmd/jetmon2/main.go: writePIDFile now uses fmt.Appendf to build the
  pid bytes directly, and the local repeat helper is replaced with
  strings.Repeat. main_test.go drops the now-redundant TestRepeat that
  covered the removed helper.
- go.mod: promote github.com/DATA-DOG/go-sqlmock to a direct dependency
  (it is imported by package tests, not just transitively).

go test ./... and go vet ./... both clean.
Whitespace alignment only — no functional change. The set was flagged
by `gofmt -l .` and consisted entirely of struct-field and var-block
alignment differences in files that hadn't been re-run through gofmt
since they last grew a wider field. Repo is now `gofmt -l` clean.
The previous housekeeping commit deleted handleListAllEvents, which was
the only caller that passed nil to listEvents. The function still
carried a `siteID *int64` parameter and a `WHERE 1=1` shape so the
optional `AND blog_id = ?` clause could be appended conditionally.

With every caller now guaranteed to pass a real blog id, change the
parameter to `siteID int64`, start the WHERE with `blog_id = ?`, and
seed the args slice with siteID directly. Tests updated to match the
new SQL shape.
Several top-level docs still described the Veriflier path as a live
gRPC server and API.md as a not-yet-implemented proposal. That no
longer matches the current packages: internal/veriflier is
JSON-over-HTTP, the API/webhook/alerting surfaces exist, and eventstore
owns event transitions while audit remains operational context.

Update README, PROJECT, ARCHITECTURE, and API to describe the current
transport, package map, table set, and coarse read/write scopes. Also
fix the developer build instructions for veriflier2 so they build into
bin/ rather than targeting the parent directory, which fails from the
repo root.

Verified with:
- go test ./...
- go vet ./...
- go build -o bin/jetmon2 ./cmd/jetmon2/
- go build -o bin/veriflier2 ./veriflier2/cmd/
- git diff --check
API.md still carried a few design-era details after the architecture docs
cleanup. Some of those details described behavior that is not routed today
or response shapes that no longer match the handler structs, which makes the
file risky as an implementation reference for internal consumers.

Update the API reference to match internal/api and the related webhook and
alerting packages: one per-key rate-limit bucket, POST-only idempotency,
current site create and trigger-now payloads, current SLA/stat response
shapes, implemented webhook and alert-contact route lists, and the actual
webhook/alert delivery retry and suppression behavior. Also mark the
OpenAPI endpoint as planned rather than live.

Verified with:
- go test ./...
- go vet ./...
- git diff --check
EMAIL_TRANSPORT previously accepted any string and the alerting path fell through to the stub sender for unrecognized values. That made a typo look like a working configuration while silently avoiding the intended email delivery path.

Teach config validation which email transports are supported and enforce the required SMTP or WPCOM settings for the active mode. Document the transport keys in the sample config and operator docs, and cover the default stub behavior, invalid values, required transport fields, and sample config loading in tests.

Verified with:

- go test ./internal/config ./cmd/jetmon2 ./internal/alerting ./internal/api

- go test ./...

- go vet ./...

- git diff --check

- make build

- make build-veriflier
The webhook and alerting delivery queues already pushed next_attempt_at forward after selecting ready rows, but ClaimReady still returned every selected row even when the soft-lock UPDATE did not affect anything. If another claimant moved the row first, the stale row could still be dispatched by this worker.

Repeat the readiness predicate in the soft-lock UPDATE and check RowsAffected before returning a claimed delivery. This keeps overlapping claim attempts from doing duplicate dispatch work while preserving the existing soft-lock model until the future SELECT FOR UPDATE SKIP LOCKED path lands.

Add matching regression coverage for webhook and alert-contact deliveries so a row that loses the soft-lock race is skipped rather than returned to the deliver loop.

Verified with:

- go test ./internal/webhooks ./internal/alerting

- go test ./...

- go vet ./...

- git diff --check

- make build

- make build-veriflier
EMAIL_TRANSPORT validation already rejects unknown values and enforces
the required SMTP / WPCOM settings, but it accepts "stub" (and the
empty-string compatibility alias) silently. An operator who configures
alert contacts with transport="email" while leaving EMAIL_TRANSPORT as
the default ends up with rendered messages that go to the log instead
of the destination. There is no startup signal that this is happening.

Emit `email_transport=<mode>` at startup alongside the existing
legacy_status_projection line, and add a WARN line when the mode is
"stub" (or empty) so the missing delivery path is visible in logs.
`./jetmon2 validate-config` surfaces the same INFO + WARN pair so the
warning is also visible during pre-deploy validation.

The emailTransportLabel helper canonicalizes the empty-string alias to
"stub" so logs and validate-config show one name regardless of which
form an operator wrote.
ClaimReady reuses candidates' backing array via `claimed := candidates[:0]`
to drop rows that lost the soft-lock race without a second allocation.
The pattern is safe here because the write index is always <= the read
index and the candidates slice is not returned in its original form, but
that contract is not obvious to a reader skimming the loop. Add a
two-line comment in both webhook and alert-contact deliveries.go so the
next reader does not relax the invariant by accident.
The startup warning added for stub email transport depends on small command-package helpers and dispatcher wiring. That behavior was covered by package compilation but not by focused tests, so a future cleanup could accidentally change the operator-facing mode labels or drop the email dispatcher without a narrow failure.

Add tests for the EMAIL_TRANSPORT label and delivery classification helpers, including the empty-string compatibility alias. Also verify that buildAlertDispatchers wires every managed alert transport and that the stub email dispatcher can render and send successfully.

Verified with:
- go test ./cmd/jetmon2
- go test ./...
- go vet ./...
- git diff --check
- make build
- make build-veriflier
The recent EMAIL_TRANSPORT startup warning made stub email mode visible in logs and validate-config output, but the operator docs still described stub mostly as a unit-test sender and README/PROJECT still claimed validate-config checked Veriflier reachability and WPCOM certificate state.

Update the API reference, README, config reference, PROJECT, and changelog so the documented operator behavior matches the current code: stub is the default log-only sender, startup and validate-config warn when email contacts will not deliver mail, validate-config reports Veriflier configuration as context rather than failing on reachability, and the delivery soft-lock changelog now includes the stale-claim RowsAffected fix.

Verified with:
- go test ./...
- go vet ./...
- git diff --check
- make build
- make build-veriflier
README and PROJECT still described a broader operator dashboard than the code currently serves, including a live System Health Map, Veriflier status, slowest-site lists, and dependency health details that are not wired into the dashboard publisher today.

Narrow the operator-facing docs to the metrics and WPCOM circuit state that the dashboard actually renders, and recast the broader health grid as future work with the existing /api/health shape but no live publisher yet. Also update the validate-config inline comment so it says the Veriflier list is operator context rather than a reachability check.

Verified with:
- go test ./cmd/jetmon2 ./internal/dashboard
- go test ./...
- go vet ./...
- git diff --check
- make build
- make build-veriflier
The API health handler assumed Server.db was always populated and called PingContext directly. That is true for normal startup, but the server constructor accepts a nil DB and the existing nil-DB test only verified that the route compiled rather than exercising the handler behavior it described.

Return the same db_unavailable 503 envelope when the database handle is missing, and turn the no-op test into a real regression test for that path. This keeps /api/v1/health predictable even if a caller builds the API server before wiring storage.

Verified with:
- go test ./internal/api
- go test ./...
- go vet ./...
- git diff --check
- make build
- make build-veriflier
@chrisbliss18 chrisbliss18 changed the base branch from stack-03-alert-contacts to v2 April 28, 2026 14:53
@chrisbliss18 chrisbliss18 merged commit 4a8ffc6 into v2 Apr 28, 2026
@chrisbliss18 chrisbliss18 deleted the stack-04-shadow-projection-hardening branch April 28, 2026 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant