Add per-table successful write Prometheus metrics#164
Open
millerjp wants to merge 5 commits intodatastax:mainfrom
Open
Add per-table successful write Prometheus metrics#164millerjp wants to merge 5 commits intodatastax:mainfrom
millerjp wants to merge 5 commits intodatastax:mainfrom
Conversation
Introduce ZDM_TARGET_CONSISTENCY_LEVEL config option that overrides the consistency level for all requests forwarded to the target cluster. The origin cluster always receives the original client-requested consistency level, preserving the consistency contract on the source of truth. This is useful during migration when the target is being populated via dual writes. A weaker CL such as LOCAL_ONE on the target reduces the risk of write failures caused by target-side instability (node outages, streaming, compaction pressure). Target data can be repaired after migration, so temporary under-replication is acceptable. The feature is strictly opt-in: when the config is absent, empty, or unset, the proxy forwards requests with the original client CL (existing behaviour preserved). Invalid values are rejected at startup. A WARN log is emitted when the override is active. Verified end-to-end against Cassandra 5.0.6 via system_traces: inline Query, prepared Execute, and Batch writes all show the overridden CL on the target while origin retains the client-requested CL.
Query system_auth.roles on both origin and target control connections at startup to check if the configured user is a superuser. If so, log a WARN explaining that superuser authentication in Cassandra requires QUORUM consistency internally, which increases the risk of auth failures during node instability. The check is best-effort: if auth is not enabled, or the query fails for any reason (e.g. permission denied, Astra-specific behavior, table not present), it is silently skipped. This ensures no impact on platforms where system_auth.roles may not be accessible. Verified against Cassandra 5.0.6 with PasswordAuthenticator enabled: - superuser 'cassandra' triggers WARN on both ORIGIN and TARGET - non-superuser 'app_user' produces no warning (query fails silently because non-superusers cannot read system_auth.roles) - auth-disabled clusters produce no warning (check skipped)
Verifies via system_traces.sessions on real Cassandra clusters that: - Inline INSERT at QUORUM: origin trace shows QUORUM (unchanged), target trace shows LOCAL_ONE (overridden) - Prepared INSERT at QUORUM: same verification - Batch INSERT at QUORUM: same verification Skipped on Cassandra < 3.0 (system_traces parameters map format differs in older versions).
Introduce proxy_write_success_total counter with labels {cluster,
keyspace, table} that tracks successful writes per cluster, per
keyspace, and per table. The counter is incremented independently when
each cluster's response arrives — origin increments when origin
responds, target increments when target responds.
This provides visibility into which tables are being written to and
whether both clusters are keeping up during migration. During a target
outage, origin counters continue to increment while target counters
flatline, making it easy to identify the scope of any data divergence.
Implementation:
- WriteTarget type carries keyspace/table through the request lifecycle
- GetWriteTargets() on RequestInfo interface, populated at parse time
- For inline queries: extracted from ANTLR parse result
- For prepared statements: stored on PrepareRequestInfo during PREPARE,
accessed via PreparedData cache at EXECUTE time
- For batch statements: per-child table extraction for both inline
and prepared children, with deduplication within the batch
- Counter cache on MetricHandler with double-checked locking
Testing:
- Unit tests for metric creation, caching, and concurrent access
- CCM integration tests covering inline INSERT/UPDATE/DELETE, prepared
INSERT/UPDATE/DELETE, counter UPDATE (inline and prepared), batch
with inline/prepared/mixed children, and counter batches
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new Prometheus counter
proxy_write_success_total{cluster, keyspace, table}that tracks successful writes per cluster, per keyspace, and per table. The counter is incremented independently when each cluster responds — origin increments when origin responds, target increments when target responds.Depends on: #163 (target consistency level override) — this PR is based on that branch.
Use case
During migration, this metric provides visibility into:
Example Prometheus output
During a target outage, the target counters stop incrementing while origin continues:
The difference (14,112) tells you exactly how many writes to
my_ks.usersthe target missed.Implementation
WriteTargettype carries keyspace/table through the request lifecycleGetWriteTargets()onRequestInfointerface, populated at parse timePrepareRequestInfoduring PREPARE, accessed viaPreparedDatacache at EXECUTE timeMetricHandlerwith double-checked locking (same pattern as per-node metrics)Test plan
Resolves #165