Refresh docs, tooling, and coverage by chrisbliss18 · Pull Request #77 · Automattic/jetmon

chrisbliss18 · 2026-04-28T14:21:53Z

Stacked PR 5 of 9.

Base: stack-04-shadow-projection-hardening
Head: stack-05-docs-tooling-coverage
Previous PR: #76

Summary:

Adds the top-level docs index and refreshes README/project docs for the v2 health-platform story.
Clarifies public API, Veriflier transport, Makefile, dashboard, and probe-agent architecture notes.
Makes Go resolution and writable build cache behavior explicit in build targets.
Expands coverage across core packages and tightens API/audit tests.
Prioritizes remaining roadmap work.

Review notes:
This PR is a documentation, tooling, and test-coverage layer after the main API and delivery behavior has landed.

The v3 probe-agent discussion should be preserved without changing the v2 production target. The important constraint is sequencing: deploy and stabilize v2 first, gather production data, then revisit whether Jetmon should evolve beyond the current main-server-plus-Veriflier confirmation model. Add a planning note that records the v1 -> v2 -> v3 migration framing, the production data to collect during v2, the current v2 baseline, candidate future architectures, and the current recommendation to treat a central scheduler plus regional probe agents as the leading v3 option. Link the note from ROADMAP so it stays visible but clearly parked until after v2 is stable. Verified with: - git diff --check - rg -n '[[:blank:]]$' ROADMAP.md docs/v3-probe-agent-architecture-options.md

The docs directory now contains both accepted ADRs and future-facing planning notes. Without a top-level index, the new v3 probe-agent architecture note is only discoverable through ROADMAP or by listing files. Add docs/README.md to explain the distinction between architecture decisions and planning notes, point readers at the ADR index, and link the v3 probe-agent architecture options note. Verified with: - git diff --check - rg -n '[[:blank:]]$' docs/README.md - go test ./... - go vet ./...

The roadmap still described the public REST API as if Jetmon had no API surface at all, which now conflicts with the internal /api/v1 work. Reframe the item around the remaining customer-facing contract: gateway semantics, tenant ownership, sanitized errors, public rate limits, and payloads safe for external tooling. Verified with: - git diff --check - go test ./... - go vet ./...

Several docs and comments still implied that Monitor-to-Veriflier traffic had already moved to gRPC. The implementation is intentionally JSON-over-HTTP today, with the proto contract kept for a future generated gRPC swap. Clarify the wording around the existing transport while leaving the legacy config names alone for compatibility. Verified with: - git diff --check - go test ./... - go vet ./... - make build - make build-veriflier

The default make target should build the currently implemented binaries without requiring the future gRPC generation toolchain. Generated Veriflier stubs are still an explicit make generate step, but JSON-over-HTTP is the active transport today. This keeps make all usable in normal development environments that do not have protoc installed. Verified with: - git diff --check - make all - go test ./... - go vet ./...

The Makefile now keeps code generation separate from the default build, but the developer docs still showed direct go build commands. Point README and AGENTS at make all for the normal two-binary build, and call out that make generate is only needed when working on the future generated gRPC Veriflier transport. Verified with: - git diff --check - make all - go test ./... - go vet ./...

The Unreleased changelog now covers the build-target and documentation cleanup that landed after the main v2 platform notes. Capture the Makefile default target, separated code generation, docs index, v3 probe-agent architecture note, Veriflier transport wording, and public API roadmap clarification so release prep does not lose those smaller changes. Verified with: - git diff --check - make all - go test ./... - go vet ./...

The local shell can have a valid Go installation at /usr/local/go/bin/go without exposing go on PATH. That made the documented Makefile targets fragile in the same kind of environment where direct go commands failed. Route build, test, race-test, and lint targets through a configurable GO variable. The default still prefers the Go binary on PATH, falls back to /usr/local/go/bin/go when available, and remains overrideable for other local layouts. Verified with: - git diff --check - make all - make test - make lint - make test-race

Rejected API requests already received an X-Request-ID response header, but the audit call used the original request rather than the request carrying the generated context. That left api_access metadata without the same request_id operators see in the response. Carry the request context into auth, scope, and rate-limit rejection audit paths, and add coverage that fails if rejected-request audit metadata loses its request id. Verified with: - git diff --check - make all - make lint - make test - make test-race

The Makefile can now find Go when it lives at /usr/local/go/bin/go, but the same restricted environments may also have an unwritable home-directory Go build cache. Route build, test, lint, and race-test through an overrideable GOCACHE defaulting to /tmp/jetmon-go-cache, and document the override in the developer docs and changelog. Verified with: - git diff --check - make all - make lint - make test - make test-race

The previous fix carried the contextualized request into rejection audit calls so the request_id metadata was no longer empty. The helper still discovered the id by reading r's context, which means a future caller that hands in a request without the middleware-attached context would silently regress the same bug. Take reqID as an explicit parameter to (*Server).audit and update all call sites in requireScope. The success-path audit call previously passed the original r (no context attached), so its audit row had been missing request_id all along; the explicit-parameter shape fixes that incidentally. Broaden TestRequireScopeAuditsRejectedRequestWithRequestID into a table over the four single-request rejection paths (missing token, invalid token, revoked token, insufficient scope) so a future change can't drop the request id from any of them without a narrow failure.

The two 503 paths described the same operational condition (database not available to the API server) with different vocabulary: "database handle is not configured" vs. "database ping failed: <err>". The first form leaks the implementation cause to the unauthenticated health endpoint without giving load balancers or external monitors anything they can act on. Both paths now lead with "database not reachable", with the ping path appending the underlying driver error for operator debugging. The public error code (db_unavailable) and HTTP status (503) are unchanged, so consumers branching on `code` are unaffected.

The stub email dispatcher returns 250 to mirror SMTP's "Requested mail action okay, completed" reply so the audit row reads the same shape regardless of which email transport actually fired. The bare 250 in the test assertion read as a magic number. A short comment explains the convention so the next reader does not strip or "normalize" it.

The earlier table-driven test fixed regression coverage for the four single-request rejection paths in requireScope. The success path and the rate-limit path were still untested for audit-row content, even though the latest fix made request_id preservation a guarantee on all six audit call sites. Add audit Init + ExpectExec to TestRequireScopeAllowsValidToken so the status=200 / note="" row is asserted with a non-empty request_id. Add the same to TestRequireScopeRateLimit429 across both requests (allowed plus rate-limited) so the 429 path locks in note="rate limited" with a distinct request_id from the allowed row. The audit-row regression net now covers every s.audit call site in requireScope.

audit.Log used db.Exec with no context, so an audit insert against a wedged DB could block the calling goroutine indefinitely and could not be cancelled when the orchestrator shut down or the API server drained. Take ctx as the first argument to Log and switch to ExecContext. Callers pick the lifetime that fits: - The orchestrator's auditLog wrapper passes o.ctx, so audit writes honor orchestrator shutdown alongside the rest of the round loop. - The API middleware's (*Server).audit derives a fresh context.Background context with a 5s write timeout. Background rather than r.Context() preserves the existing "audit fires regardless of client disconnect" semantic — audit is for the operator, not the caller — and the timeout caps any wedged-DB hang. Add a regression test that confirms a canceled ctx surfaces context.Canceled from Log instead of silently succeeding, and update the existing nil-DB and event-type tests to pass a context. Two existing API middleware tests were already asserting audit-row content, so they cover the threaded API end-to-end alongside the new audit-package test.

The previous overview framed Jetmon 2 as a Go rewrite of v1 with a backwards-compatible payload. That undersells where the branch actually landed: v2 is a full event-sourced health platform on top of the same Veriflier-confirmed detection core. Rewrite the overview to lead with the platform shift, add a dedicated "What's new in v2" section with a side-by-side capability table and five concrete bragging points (HMAC webhooks with the documented retry ladder, idempotent write endpoints, half-open API key cutoffs, embedded migrations + validate-config, and MySQL 5.7+ compatibility), and update the architecture diagram + prose to show the eventstore, REST API, webhook worker, and alert-contact worker that sit alongside the orchestrator. Keep the WPCOM notification flow table intact and mark it as the legacy compatibility path that the shadow-state migration is preserving until consumers cut over. No code or config changes; the rest of the README (installation, configuration, database, developer/tester/admin/HE sections) stands as written.

Site list pagination fetches one extra raw row so it can decide whether to return page.next. State and severity filters run after active-event rollups, so a filtered page could remove the sentinel row before cursor calculation and incorrectly report page.next as null. Track the raw fetched window separately from the filtered response. When the filtered response consumes fewer than the requested limit but the raw query still fetched a sentinel row, advance the cursor past that raw row so clients can keep walking the filtered result set.

Add focused unit coverage for the packages that still had deterministic gaps: eventstore mutations, DB query helpers, API delivery handlers, webhook and alerting repositories, worker result paths, metrics emission, verifier config/check wrappers, and small CLI/config helpers. The tests stay on sqlmock, httptest, and in-memory checks so they exercise production behavior without requiring MySQL or external services. The local coverage profile moved total statement coverage from 51.9% to 68.6%, with the largest gains in eventstore, DB, metrics, audit, alerting, webhooks, and apikeys.

Add the remaining deterministic coverage around the recent API and worker test pass. Webhook and alert-contact list handlers now cover success and database-error responses, and the API server helpers cover shutdown-before-listen plus the http.ErrServerClosed sentinel. Also cover idempotency cache cleanup, streaming Flush passthrough, the transaction-aware site-status projection helper, and the worker start/stop path for webhook and alert delivery. These stay in unit-test territory and avoid taking on the process entrypoints or long-running orchestration loops that need a broader integration harness.

Add an explicit P0/P1/P2 queue so the remaining v2 hardening, post-v2 delivery work, and v3 probe-agent decisions are easy to sequence. Also update the public API roadmap text to distinguish the already-implemented internal API from the customer-facing contract work that still needs tenant scoping, redaction, OpenAPI generation, and public-contract tests.

Chris Jean added 20 commits April 27, 2026 15:40

This was referenced Apr 28, 2026

Add deliverer, OpenAPI, and v2 decision metrics #78

Merged

Internal REST API foundation: events, webhooks, alerting #72

Merged

chrisbliss18 changed the base branch from stack-04-shadow-projection-hardening to v2 April 28, 2026 14:54

chrisbliss18 merged commit d95c8f8 into v2 Apr 28, 2026

chrisbliss18 deleted the stack-05-docs-tooling-coverage branch April 28, 2026 15:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refresh docs, tooling, and coverage#77

Refresh docs, tooling, and coverage#77
chrisbliss18 merged 20 commits into
v2from
stack-05-docs-tooling-coverage

chrisbliss18 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chrisbliss18 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant