Skip to content

Refresh docs, tooling, and coverage#77

Merged
chrisbliss18 merged 20 commits into
v2from
stack-05-docs-tooling-coverage
Apr 28, 2026
Merged

Refresh docs, tooling, and coverage#77
chrisbliss18 merged 20 commits into
v2from
stack-05-docs-tooling-coverage

Conversation

@chrisbliss18
Copy link
Copy Markdown
Contributor

Stacked PR 5 of 9.

Base: stack-04-shadow-projection-hardening
Head: stack-05-docs-tooling-coverage
Previous PR: #76

Summary:

  • Adds the top-level docs index and refreshes README/project docs for the v2 health-platform story.
  • Clarifies public API, Veriflier transport, Makefile, dashboard, and probe-agent architecture notes.
  • Makes Go resolution and writable build cache behavior explicit in build targets.
  • Expands coverage across core packages and tightens API/audit tests.
  • Prioritizes remaining roadmap work.

Review notes:
This PR is a documentation, tooling, and test-coverage layer after the main API and delivery behavior has landed.

Chris Jean added 20 commits April 27, 2026 15:40
The v3 probe-agent discussion should be preserved without changing the v2 production target. The important constraint is sequencing: deploy and stabilize v2 first, gather production data, then revisit whether Jetmon should evolve beyond the current main-server-plus-Veriflier confirmation model.

Add a planning note that records the v1 -> v2 -> v3 migration framing, the production data to collect during v2, the current v2 baseline, candidate future architectures, and the current recommendation to treat a central scheduler plus regional probe agents as the leading v3 option. Link the note from ROADMAP so it stays visible but clearly parked until after v2 is stable.

Verified with:
- git diff --check
- rg -n '[[:blank:]]$' ROADMAP.md docs/v3-probe-agent-architecture-options.md
The docs directory now contains both accepted ADRs and future-facing planning notes. Without a top-level index, the new v3 probe-agent architecture note is only discoverable through ROADMAP or by listing files.

Add docs/README.md to explain the distinction between architecture decisions and planning notes, point readers at the ADR index, and link the v3 probe-agent architecture options note.

Verified with:
- git diff --check
- rg -n '[[:blank:]]$' docs/README.md
- go test ./...
- go vet ./...
The roadmap still described the public REST API as if Jetmon had no API surface at all, which now conflicts with the internal /api/v1 work.

Reframe the item around the remaining customer-facing contract: gateway semantics, tenant ownership, sanitized errors, public rate limits, and payloads safe for external tooling.

Verified with:
- git diff --check
- go test ./...
- go vet ./...
Several docs and comments still implied that Monitor-to-Veriflier traffic had already moved to gRPC. The implementation is intentionally JSON-over-HTTP today, with the proto contract kept for a future generated gRPC swap.

Clarify the wording around the existing transport while leaving the legacy config names alone for compatibility.

Verified with:
- git diff --check
- go test ./...
- go vet ./...
- make build
- make build-veriflier
The default make target should build the currently implemented binaries without requiring the future gRPC generation toolchain.

Generated Veriflier stubs are still an explicit make generate step, but JSON-over-HTTP is the active transport today. This keeps make all usable in normal development environments that do not have protoc installed.

Verified with:
- git diff --check
- make all
- go test ./...
- go vet ./...
The Makefile now keeps code generation separate from the default build, but the developer docs still showed direct go build commands.

Point README and AGENTS at make all for the normal two-binary build, and call out that make generate is only needed when working on the future generated gRPC Veriflier transport.

Verified with:
- git diff --check
- make all
- go test ./...
- go vet ./...
The Unreleased changelog now covers the build-target and documentation cleanup that landed after the main v2 platform notes.

Capture the Makefile default target, separated code generation, docs index, v3 probe-agent architecture note, Veriflier transport wording, and public API roadmap clarification so release prep does not lose those smaller changes.

Verified with:
- git diff --check
- make all
- go test ./...
- go vet ./...
The local shell can have a valid Go installation at /usr/local/go/bin/go without exposing go on PATH. That made the documented Makefile targets fragile in the same kind of environment where direct go commands failed.

Route build, test, race-test, and lint targets through a configurable GO variable. The default still prefers the Go binary on PATH, falls back to /usr/local/go/bin/go when available, and remains overrideable for other local layouts.

Verified with:
- git diff --check
- make all
- make test
- make lint
- make test-race
Rejected API requests already received an X-Request-ID response header, but the audit call used the original request rather than the request carrying the generated context. That left api_access metadata without the same request_id operators see in the response.

Carry the request context into auth, scope, and rate-limit rejection audit paths, and add coverage that fails if rejected-request audit metadata loses its request id.

Verified with:
- git diff --check
- make all
- make lint
- make test
- make test-race
The Makefile can now find Go when it lives at /usr/local/go/bin/go, but the same restricted environments may also have an unwritable home-directory Go build cache.

Route build, test, lint, and race-test through an overrideable GOCACHE defaulting to /tmp/jetmon-go-cache, and document the override in the developer docs and changelog.

Verified with:
- git diff --check
- make all
- make lint
- make test
- make test-race
The previous fix carried the contextualized request into rejection audit
calls so the request_id metadata was no longer empty. The helper still
discovered the id by reading r's context, which means a future caller
that hands in a request without the middleware-attached context would
silently regress the same bug.

Take reqID as an explicit parameter to (*Server).audit and update all
call sites in requireScope. The success-path audit call previously
passed the original r (no context attached), so its audit row had been
missing request_id all along; the explicit-parameter shape fixes that
incidentally.

Broaden TestRequireScopeAuditsRejectedRequestWithRequestID into a
table over the four single-request rejection paths (missing token,
invalid token, revoked token, insufficient scope) so a future change
can't drop the request id from any of them without a narrow failure.
The two 503 paths described the same operational condition (database
not available to the API server) with different vocabulary: "database
handle is not configured" vs. "database ping failed: <err>". The first
form leaks the implementation cause to the unauthenticated health
endpoint without giving load balancers or external monitors anything
they can act on.

Both paths now lead with "database not reachable", with the ping path
appending the underlying driver error for operator debugging. The
public error code (db_unavailable) and HTTP status (503) are unchanged,
so consumers branching on `code` are unaffected.
The stub email dispatcher returns 250 to mirror SMTP's "Requested mail
action okay, completed" reply so the audit row reads the same shape
regardless of which email transport actually fired. The bare 250 in
the test assertion read as a magic number. A short comment explains
the convention so the next reader does not strip or "normalize" it.
The earlier table-driven test fixed regression coverage for the four
single-request rejection paths in requireScope. The success path and
the rate-limit path were still untested for audit-row content, even
though the latest fix made request_id preservation a guarantee on all
six audit call sites.

Add audit Init + ExpectExec to TestRequireScopeAllowsValidToken so the
status=200 / note="" row is asserted with a non-empty request_id. Add
the same to TestRequireScopeRateLimit429 across both requests (allowed
plus rate-limited) so the 429 path locks in note="rate limited" with a
distinct request_id from the allowed row.

The audit-row regression net now covers every s.audit call site in
requireScope.
audit.Log used db.Exec with no context, so an audit insert against a
wedged DB could block the calling goroutine indefinitely and could not
be cancelled when the orchestrator shut down or the API server drained.

Take ctx as the first argument to Log and switch to ExecContext.
Callers pick the lifetime that fits:

- The orchestrator's auditLog wrapper passes o.ctx, so audit writes
  honor orchestrator shutdown alongside the rest of the round loop.
- The API middleware's (*Server).audit derives a fresh
  context.Background context with a 5s write timeout. Background
  rather than r.Context() preserves the existing "audit fires
  regardless of client disconnect" semantic — audit is for the
  operator, not the caller — and the timeout caps any wedged-DB hang.

Add a regression test that confirms a canceled ctx surfaces
context.Canceled from Log instead of silently succeeding, and update
the existing nil-DB and event-type tests to pass a context. Two
existing API middleware tests were already asserting audit-row
content, so they cover the threaded API end-to-end alongside the new
audit-package test.
The previous overview framed Jetmon 2 as a Go rewrite of v1 with a
backwards-compatible payload. That undersells where the branch
actually landed: v2 is a full event-sourced health platform on top of
the same Veriflier-confirmed detection core.

Rewrite the overview to lead with the platform shift, add a dedicated
"What's new in v2" section with a side-by-side capability table and
five concrete bragging points (HMAC webhooks with the documented
retry ladder, idempotent write endpoints, half-open API key cutoffs,
embedded migrations + validate-config, and MySQL 5.7+ compatibility),
and update the architecture diagram + prose to show the eventstore,
REST API, webhook worker, and alert-contact worker that sit alongside
the orchestrator. Keep the WPCOM notification flow table intact and
mark it as the legacy compatibility path that the shadow-state
migration is preserving until consumers cut over.

No code or config changes; the rest of the README (installation,
configuration, database, developer/tester/admin/HE sections) stands as
written.
Site list pagination fetches one extra raw row so it can decide whether to return page.next. State and severity filters run after active-event rollups, so a filtered page could remove the sentinel row before cursor calculation and incorrectly report page.next as null.

Track the raw fetched window separately from the filtered response. When the filtered response consumes fewer than the requested limit but the raw query still fetched a sentinel row, advance the cursor past that raw row so clients can keep walking the filtered result set.
Add focused unit coverage for the packages that still had deterministic gaps: eventstore mutations, DB query helpers, API delivery handlers, webhook and alerting repositories, worker result paths, metrics emission, verifier config/check wrappers, and small CLI/config helpers.

The tests stay on sqlmock, httptest, and in-memory checks so they exercise production behavior without requiring MySQL or external services. The local coverage profile moved total statement coverage from 51.9% to 68.6%, with the largest gains in eventstore, DB, metrics, audit, alerting, webhooks, and apikeys.
Add the remaining deterministic coverage around the recent API and worker test pass. Webhook and alert-contact list handlers now cover success and database-error responses, and the API server helpers cover shutdown-before-listen plus the http.ErrServerClosed sentinel.

Also cover idempotency cache cleanup, streaming Flush passthrough, the transaction-aware site-status projection helper, and the worker start/stop path for webhook and alert delivery. These stay in unit-test territory and avoid taking on the process entrypoints or long-running orchestration loops that need a broader integration harness.
Add an explicit P0/P1/P2 queue so the remaining v2 hardening, post-v2 delivery work, and v3 probe-agent decisions are easy to sequence.

Also update the public API roadmap text to distinguish the already-implemented internal API from the customer-facing contract work that still needs tenant scoping, redaction, OpenAPI generation, and public-contract tests.
@chrisbliss18 chrisbliss18 changed the base branch from stack-04-shadow-projection-hardening to v2 April 28, 2026 14:54
@chrisbliss18 chrisbliss18 merged commit d95c8f8 into v2 Apr 28, 2026
@chrisbliss18 chrisbliss18 deleted the stack-05-docs-tooling-coverage branch April 28, 2026 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant