Add per-table successful write Prometheus metrics by millerjp · Pull Request #164 · datastax/zdm-proxy

millerjp · 2026-04-04T04:10:24Z

Summary

Adds a new Prometheus counter proxy_write_success_total{cluster, keyspace, table} that tracks successful writes per cluster, per keyspace, and per table. The counter is incremented independently when each cluster responds — origin increments when origin responds, target increments when target responds.

Depends on: #163 (target consistency level override) — this PR is based on that branch.

Use case

During migration, this metric provides visibility into:

Which tables are being written to on each cluster
Whether both clusters are keeping up (counts should be equal)
During a target outage, origin counters keep ticking while target flatlines — making it easy to identify which tables have diverged and need repair

Example Prometheus output

# HELP zdm_proxy_write_success_total Running total of successful writes per cluster, keyspace and table
# TYPE zdm_proxy_write_success_total counter
zdm_proxy_write_success_total{cluster="origin",keyspace="my_ks",table="users"} 148392
zdm_proxy_write_success_total{cluster="origin",keyspace="my_ks",table="events"} 52841
zdm_proxy_write_success_total{cluster="origin",keyspace="my_ks",table="audit_log"} 7203
zdm_proxy_write_success_total{cluster="target",keyspace="my_ks",table="users"} 148392
zdm_proxy_write_success_total{cluster="target",keyspace="my_ks",table="events"} 52841
zdm_proxy_write_success_total{cluster="target",keyspace="my_ks",table="audit_log"} 7203

During a target outage, the target counters stop incrementing while origin continues:

zdm_proxy_write_success_total{cluster="origin",keyspace="my_ks",table="users"} 162504
zdm_proxy_write_success_total{cluster="target",keyspace="my_ks",table="users"} 148392

The difference (14,112) tells you exactly how many writes to my_ks.users the target missed.

Implementation

WriteTarget type carries keyspace/table through the request lifecycle
GetWriteTargets() on RequestInfo interface, populated at parse time
For inline queries: extracted from ANTLR parse result
For prepared statements: stored on PrepareRequestInfo during PREPARE, accessed via PreparedData cache at EXECUTE time
For batch statements: per-child table extraction for both inline and prepared children, with deduplication
Counter cache on MetricHandler with double-checked locking (same pattern as per-node metrics)

Test plan

Unit tests for metric counter creation, caching, and concurrent access
CCM integration tests covering full permutation matrix:
- Inline: INSERT, UPDATE, DELETE, counter UPDATE
- Prepared: INSERT, UPDATE, DELETE, counter UPDATE
- Batch inline: multi-table, update+delete
- Batch prepared: multi-table, update+delete
- Batch mixed: inline + prepared children
- Counter batch: inline and prepared
- Data verification on both clusters
All existing unit tests updated and passing

Resolves #165

Introduce ZDM_TARGET_CONSISTENCY_LEVEL config option that overrides the consistency level for all requests forwarded to the target cluster. The origin cluster always receives the original client-requested consistency level, preserving the consistency contract on the source of truth. This is useful during migration when the target is being populated via dual writes. A weaker CL such as LOCAL_ONE on the target reduces the risk of write failures caused by target-side instability (node outages, streaming, compaction pressure). Target data can be repaired after migration, so temporary under-replication is acceptable. The feature is strictly opt-in: when the config is absent, empty, or unset, the proxy forwards requests with the original client CL (existing behaviour preserved). Invalid values are rejected at startup. A WARN log is emitted when the override is active. Verified end-to-end against Cassandra 5.0.6 via system_traces: inline Query, prepared Execute, and Batch writes all show the overridden CL on the target while origin retains the client-requested CL.

Query system_auth.roles on both origin and target control connections at startup to check if the configured user is a superuser. If so, log a WARN explaining that superuser authentication in Cassandra requires QUORUM consistency internally, which increases the risk of auth failures during node instability. The check is best-effort: if auth is not enabled, or the query fails for any reason (e.g. permission denied, Astra-specific behavior, table not present), it is silently skipped. This ensures no impact on platforms where system_auth.roles may not be accessible. Verified against Cassandra 5.0.6 with PasswordAuthenticator enabled: - superuser 'cassandra' triggers WARN on both ORIGIN and TARGET - non-superuser 'app_user' produces no warning (query fails silently because non-superusers cannot read system_auth.roles) - auth-disabled clusters produce no warning (check skipped)

Verifies via system_traces.sessions on real Cassandra clusters that: - Inline INSERT at QUORUM: origin trace shows QUORUM (unchanged), target trace shows LOCAL_ONE (overridden) - Prepared INSERT at QUORUM: same verification - Batch INSERT at QUORUM: same verification Skipped on Cassandra < 3.0 (system_traces parameters map format differs in older versions).

Introduce proxy_write_success_total counter with labels {cluster, keyspace, table} that tracks successful writes per cluster, per keyspace, and per table. The counter is incremented independently when each cluster's response arrives — origin increments when origin responds, target increments when target responds. This provides visibility into which tables are being written to and whether both clusters are keeping up during migration. During a target outage, origin counters continue to increment while target counters flatline, making it easy to identify the scope of any data divergence. Implementation: - WriteTarget type carries keyspace/table through the request lifecycle - GetWriteTargets() on RequestInfo interface, populated at parse time - For inline queries: extracted from ANTLR parse result - For prepared statements: stored on PrepareRequestInfo during PREPARE, accessed via PreparedData cache at EXECUTE time - For batch statements: per-child table extraction for both inline and prepared children, with deduplication within the batch - Counter cache on MetricHandler with double-checked locking Testing: - Unit tests for metric creation, caching, and concurrent access - CCM integration tests covering inline INSERT/UPDATE/DELETE, prepared INSERT/UPDATE/DELETE, counter UPDATE (inline and prepared), batch with inline/prepared/mixed children, and counter batches

millerjp added 5 commits April 3, 2026 12:42

Add target consistency level override to README quick start

b26a6c3

millerjp requested review from absurdfarce, alicel, grighetto, joao-r-reis, lukasz-antoniak and weideng1 as code owners April 4, 2026 04:10

millerjp mentioned this pull request Apr 4, 2026

Improve resilience and observability during migration with target cluster instability #165

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-table successful write Prometheus metrics#164

Add per-table successful write Prometheus metrics#164
millerjp wants to merge 5 commits intodatastax:mainfrom
axonops:upstream-pr-per-table-metrics

millerjp commented Apr 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

millerjp commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Use case

Example Prometheus output

Implementation

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

millerjp commented Apr 4, 2026 •

edited

Loading