Skip to content

Add per-table successful write Prometheus metrics#164

Open
millerjp wants to merge 5 commits intodatastax:mainfrom
axonops:upstream-pr-per-table-metrics
Open

Add per-table successful write Prometheus metrics#164
millerjp wants to merge 5 commits intodatastax:mainfrom
axonops:upstream-pr-per-table-metrics

Conversation

@millerjp
Copy link
Copy Markdown

@millerjp millerjp commented Apr 4, 2026

Summary

Adds a new Prometheus counter proxy_write_success_total{cluster, keyspace, table} that tracks successful writes per cluster, per keyspace, and per table. The counter is incremented independently when each cluster responds — origin increments when origin responds, target increments when target responds.

Depends on: #163 (target consistency level override) — this PR is based on that branch.

Use case

During migration, this metric provides visibility into:

  • Which tables are being written to on each cluster
  • Whether both clusters are keeping up (counts should be equal)
  • During a target outage, origin counters keep ticking while target flatlines — making it easy to identify which tables have diverged and need repair

Example Prometheus output

# HELP zdm_proxy_write_success_total Running total of successful writes per cluster, keyspace and table
# TYPE zdm_proxy_write_success_total counter
zdm_proxy_write_success_total{cluster="origin",keyspace="my_ks",table="users"} 148392
zdm_proxy_write_success_total{cluster="origin",keyspace="my_ks",table="events"} 52841
zdm_proxy_write_success_total{cluster="origin",keyspace="my_ks",table="audit_log"} 7203
zdm_proxy_write_success_total{cluster="target",keyspace="my_ks",table="users"} 148392
zdm_proxy_write_success_total{cluster="target",keyspace="my_ks",table="events"} 52841
zdm_proxy_write_success_total{cluster="target",keyspace="my_ks",table="audit_log"} 7203

During a target outage, the target counters stop incrementing while origin continues:

zdm_proxy_write_success_total{cluster="origin",keyspace="my_ks",table="users"} 162504
zdm_proxy_write_success_total{cluster="target",keyspace="my_ks",table="users"} 148392

The difference (14,112) tells you exactly how many writes to my_ks.users the target missed.

Implementation

  • WriteTarget type carries keyspace/table through the request lifecycle
  • GetWriteTargets() on RequestInfo interface, populated at parse time
  • For inline queries: extracted from ANTLR parse result
  • For prepared statements: stored on PrepareRequestInfo during PREPARE, accessed via PreparedData cache at EXECUTE time
  • For batch statements: per-child table extraction for both inline and prepared children, with deduplication
  • Counter cache on MetricHandler with double-checked locking (same pattern as per-node metrics)

Test plan

  • Unit tests for metric counter creation, caching, and concurrent access
  • CCM integration tests covering full permutation matrix:
    • Inline: INSERT, UPDATE, DELETE, counter UPDATE
    • Prepared: INSERT, UPDATE, DELETE, counter UPDATE
    • Batch inline: multi-table, update+delete
    • Batch prepared: multi-table, update+delete
    • Batch mixed: inline + prepared children
    • Counter batch: inline and prepared
    • Data verification on both clusters
  • All existing unit tests updated and passing

Resolves #165

millerjp added 5 commits April 3, 2026 12:42
Introduce ZDM_TARGET_CONSISTENCY_LEVEL config option that overrides the
consistency level for all requests forwarded to the target cluster. The
origin cluster always receives the original client-requested consistency
level, preserving the consistency contract on the source of truth.

This is useful during migration when the target is being populated via
dual writes. A weaker CL such as LOCAL_ONE on the target reduces the risk
of write failures caused by target-side instability (node outages,
streaming, compaction pressure). Target data can be repaired after
migration, so temporary under-replication is acceptable.

The feature is strictly opt-in: when the config is absent, empty, or
unset, the proxy forwards requests with the original client CL (existing
behaviour preserved). Invalid values are rejected at startup. A WARN log
is emitted when the override is active.

Verified end-to-end against Cassandra 5.0.6 via system_traces: inline
Query, prepared Execute, and Batch writes all show the overridden CL on
the target while origin retains the client-requested CL.
Query system_auth.roles on both origin and target control connections
at startup to check if the configured user is a superuser. If so, log
a WARN explaining that superuser authentication in Cassandra requires
QUORUM consistency internally, which increases the risk of auth failures
during node instability.

The check is best-effort: if auth is not enabled, or the query fails
for any reason (e.g. permission denied, Astra-specific behavior, table
not present), it is silently skipped. This ensures no impact on
platforms where system_auth.roles may not be accessible.

Verified against Cassandra 5.0.6 with PasswordAuthenticator enabled:
- superuser 'cassandra' triggers WARN on both ORIGIN and TARGET
- non-superuser 'app_user' produces no warning (query fails silently
  because non-superusers cannot read system_auth.roles)
- auth-disabled clusters produce no warning (check skipped)
Verifies via system_traces.sessions on real Cassandra clusters that:
- Inline INSERT at QUORUM: origin trace shows QUORUM (unchanged),
  target trace shows LOCAL_ONE (overridden)
- Prepared INSERT at QUORUM: same verification
- Batch INSERT at QUORUM: same verification

Skipped on Cassandra < 3.0 (system_traces parameters map format
differs in older versions).
Introduce proxy_write_success_total counter with labels {cluster,
keyspace, table} that tracks successful writes per cluster, per
keyspace, and per table. The counter is incremented independently when
each cluster's response arrives — origin increments when origin
responds, target increments when target responds.

This provides visibility into which tables are being written to and
whether both clusters are keeping up during migration. During a target
outage, origin counters continue to increment while target counters
flatline, making it easy to identify the scope of any data divergence.

Implementation:
- WriteTarget type carries keyspace/table through the request lifecycle
- GetWriteTargets() on RequestInfo interface, populated at parse time
- For inline queries: extracted from ANTLR parse result
- For prepared statements: stored on PrepareRequestInfo during PREPARE,
  accessed via PreparedData cache at EXECUTE time
- For batch statements: per-child table extraction for both inline
  and prepared children, with deduplication within the batch
- Counter cache on MetricHandler with double-checked locking

Testing:
- Unit tests for metric creation, caching, and concurrent access
- CCM integration tests covering inline INSERT/UPDATE/DELETE, prepared
  INSERT/UPDATE/DELETE, counter UPDATE (inline and prepared), batch
  with inline/prepared/mixed children, and counter batches
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve resilience and observability during migration with target cluster instability

1 participant