Add production rollout hardening#80
Merged
Merged
Conversation
added 8 commits
April 27, 2026 21:55
Add migration-only pinned bucket config so a v2 host can replace a v1 host on the same static range without participating in jetmon_hosts dynamic ownership. Document the host-by-host v1-to-v2 rollout, rollback path, and transition from pinned ranges to normal v2 dynamic ownership.
Add jetmon2 rollout pinned-check so operators can validate the host-local invariants before replacing a v1 static-bucket host: pinned range configured, legacy projection writes enabled, API ownership called out, no jetmon_hosts ownership row, active site count reported, and projection drift at zero. Document the command in the pinned rollout runbook and admin docs, and cover the new DB/query helpers plus command runner behavior with focused tests.
Add jetmon2 rollout dynamic-check for the post-pinned cutover so operators can verify that pinned config has been removed, legacy projection remains enabled, and jetmon_hosts covers the full bucket range with fresh active rows and no gaps or overlaps. Document the coordinated dynamic-ownership transition in the rollout runbook and admin docs, with focused tests for the success path and failure cases.
Add low-cardinality StatsD series that split local failure, Seems Down, confirmed-down, false-alarm, and probe-cleared outcomes by failure class. Add per-Veriflier-host RPC and vote series so v2 production can show regional latency and agreement patterns for the later v3 architecture decision. Document the new evidence metrics in the operator and architecture docs, and cover the metric naming and emission paths with orchestrator tests.
Add jetmon2 rollout projection-drift so rollout failures can show the exact active site rows where the legacy site_status projection disagrees with the authoritative HTTP event state. The report supports pinned defaults, full dynamic-range defaults, explicit bucket ranges, and row limits. Document the drift investigation path in the migration runbook and admin docs, and cover the DB query plus CLI range/report behavior with focused tests.
Emit StatsD counters for legacy WPCOM notification attempts, deliveries, retries, errors, and final failures. The counters include status-specific splits for down, running, and confirmed_down so rollout dashboards can catch delivery regressions without changing notification behavior. Document the parity metrics in the operator and v3 architecture docs, and cover the emission paths with orchestrator tests.
Make validate-config print the rollout preflight command that matches the configured bucket ownership mode, followed by the projection drift report command operators should run during rollout. Also display Veriflier addresses through the canonical transport-port helper so deprecated grpc_port aliases resolve the same way they do at runtime, and document the expanded validate-config output.
Add live operator-dashboard health entries for MySQL, configured Verifliers, WPCOM circuit state, StatsD initialization, and log/stats directory writes so cutover operators can separate site downtime from monitor impairment. Reorder the P0 roadmap around rollout health and rehearsal, then update the v1-to-v2 runbooks to validate config before stopping v1 and to check dashboard guardrails during each host replacement.
This was referenced Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked PR 8 of 9.
Base:
stack-07-gateway-tenantHead:
stack-08-rollout-hardeningPrevious PR: #79
Summary:
validate-configand rollout health in the dashboard.Review notes:
This PR focuses on migration confidence and observability. It should be reviewed with the v1 to v2 rollout plan in mind.