Skip to content

Add production rollout hardening#80

Merged
chrisbliss18 merged 8 commits into
v2from
stack-08-rollout-hardening
Apr 28, 2026
Merged

Add production rollout hardening#80
chrisbliss18 merged 8 commits into
v2from
stack-08-rollout-hardening

Conversation

@chrisbliss18
Copy link
Copy Markdown
Contributor

Stacked PR 8 of 9.

Base: stack-07-gateway-tenant
Head: stack-08-rollout-hardening
Previous PR: #79

Summary:

  • Adds pinned bucket rollout mode and preflight command.
  • Adds dynamic rollout ownership checks.
  • Splits detection metrics by outcome source.
  • Lists legacy projection drift details and tracks WPCOM notification parity metrics.
  • Surfaces rollout checks in validate-config and rollout health in the dashboard.

Review notes:
This PR focuses on migration confidence and observability. It should be reviewed with the v1 to v2 rollout plan in mind.

Chris Jean added 8 commits April 27, 2026 21:55
Add migration-only pinned bucket config so a v2 host can replace a v1 host on the same static range without participating in jetmon_hosts dynamic ownership.

Document the host-by-host v1-to-v2 rollout, rollback path, and transition from pinned ranges to normal v2 dynamic ownership.
Add jetmon2 rollout pinned-check so operators can validate the host-local invariants before replacing a v1 static-bucket host: pinned range configured, legacy projection writes enabled, API ownership called out, no jetmon_hosts ownership row, active site count reported, and projection drift at zero.

Document the command in the pinned rollout runbook and admin docs, and cover the new DB/query helpers plus command runner behavior with focused tests.
Add jetmon2 rollout dynamic-check for the post-pinned cutover so operators can verify that pinned config has been removed, legacy projection remains enabled, and jetmon_hosts covers the full bucket range with fresh active rows and no gaps or overlaps.

Document the coordinated dynamic-ownership transition in the rollout runbook and admin docs, with focused tests for the success path and failure cases.
Add low-cardinality StatsD series that split local failure, Seems Down, confirmed-down, false-alarm, and probe-cleared outcomes by failure class. Add per-Veriflier-host RPC and vote series so v2 production can show regional latency and agreement patterns for the later v3 architecture decision.

Document the new evidence metrics in the operator and architecture docs, and cover the metric naming and emission paths with orchestrator tests.
Add jetmon2 rollout projection-drift so rollout failures can show the exact active site rows where the legacy site_status projection disagrees with the authoritative HTTP event state. The report supports pinned defaults, full dynamic-range defaults, explicit bucket ranges, and row limits.

Document the drift investigation path in the migration runbook and admin docs, and cover the DB query plus CLI range/report behavior with focused tests.
Emit StatsD counters for legacy WPCOM notification attempts, deliveries, retries, errors, and final failures. The counters include status-specific splits for down, running, and confirmed_down so rollout dashboards can catch delivery regressions without changing notification behavior.

Document the parity metrics in the operator and v3 architecture docs, and cover the emission paths with orchestrator tests.
Make validate-config print the rollout preflight command that matches the configured bucket ownership mode, followed by the projection drift report command operators should run during rollout.

Also display Veriflier addresses through the canonical transport-port helper so deprecated grpc_port aliases resolve the same way they do at runtime, and document the expanded validate-config output.
Add live operator-dashboard health entries for MySQL, configured Verifliers, WPCOM circuit state, StatsD initialization, and log/stats directory writes so cutover operators can separate site downtime from monitor impairment.

Reorder the P0 roadmap around rollout health and rehearsal, then update the v1-to-v2 runbooks to validate config before stopping v1 and to check dashboard guardrails during each host replacement.
@chrisbliss18 chrisbliss18 changed the base branch from stack-07-gateway-tenant to v2 April 28, 2026 14:55
@chrisbliss18 chrisbliss18 merged commit 3617fdf into v2 Apr 28, 2026
@chrisbliss18 chrisbliss18 deleted the stack-08-rollout-hardening branch April 28, 2026 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant