Conversation
Add new `slo` Maven module implementing the workload contract of ydb-platform/ydb-slo-action. The workload is a KV read/write driver with per-operation HDR latency histograms and OTLP metric export, designed to be packaged into a Docker image and run as the `current` and `baseline` containers under chaos. Schema, query shape and metric names mirror the SLO workloads in ydb-go-sdk and ydb-js-sdk so reports across SDKs are directly comparable. The CI that drives this image lives in ydb-java-sdk — the SDK is built from source and pinned into the examples build via the multi-stage Dockerfile. Behaviour notes: - Metrics export every second via OTLP HTTP, with a single batch callback that snapshots p50/p95/p99 from a shared HDR histogram and resets it. This avoids JIT-warmup outliers being permanently stuck in cumulative percentiles. - Latency is recorded only for successful operations. Failure latency is dominated by retry budgets and timeouts and would mask real SDK performance under chaos. Counters and `sdk_errors_total` still cover both branches, so availability is computed correctly.
|
Thanks for the review, Alex! Force-pushed the fixes. Your comments
Copilot comments I agree with (also addressed)
Copilot comments I didn't apply
End-to-end re-verificationRan the full stack (
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| writeOnce(generator.generate()); | ||
| writesIssued.incrementAndGet(); |
There was a problem hiding this comment.
writesIssued is incremented unconditionally after writeOnce(...), even when the write fails. That makes readOnce(writesIssued.get()) select ids that may never have been written, contradicting the comment that reads target only known-to-exist ids and reducing how often reads exercise row deserialization under chaos. Consider incrementing only on successful writes (e.g., make writeOnce return a boolean/Status), or adjust the keyspace tracking logic to reflect the max generated id (and update the comment accordingly).
| writeOnce(generator.generate()); | |
| writesIssued.incrementAndGet(); | |
| Status status = writeOnce(generator.generate()); | |
| if (status.isSuccess()) { | |
| writesIssued.incrementAndGet(); | |
| } |
Summary
Adds a new
sloMaven module that implements the workload contract ofydb-platform/ydb-slo-action. The workload is a KV read/write driver that records per-operation HDR latency histograms and pushes OTLP metrics, so it can be packaged into a Docker image and run as thecurrentandbaselinecontainers under chaos.Schema, query shape and metric names mirror the SLO workloads in
ydb-go-sdkandydb-js-sdkso reports across SDKs are directly comparable.The CI that drives this image lives in ydb-java-sdk (companion PR: ydb-platform/ydb-java-sdk#TBD). The SDK is built from source and pinned into the examples build via the multi-stage Dockerfile, so no SNAPSHOT publishing is needed.
What's in the module
Metrics emitted
All metrics carry a
reflabel sourced fromWORKLOAD_REF(set by the action), so the report action can split current vs baseline series.sdk_operations_totaloperation_type,operation_statussdk_errors_totaloperation_type,error_kindsdk_retry_attempts_totaloperation_type,operation_statussdk_pending_operationsoperation_typesdk_operation_latency_p50/p95/p99_secondsoperation_type,operation_status="success"Notable design decisions
histogram.reset(). Without reset, JIT-warmup outliers stay permanently in the cumulative HDR and p99 stops reflecting current performance — same approach as inydb-js-sdk.sdk_operations_total,sdk_errors_total) still cover both branches, so availability is computed correctly. Themetrics.yamlshipped with the SLO action filters onoperation_status="success"already, so this matches what's consumed.GrpcTransport/QueryClientper run and uses aSessionRetryContextfor every operation. This is the recommended Java SDK usage pattern.Local verification
End-to-end smoke tested via the
deploy/directory shipped withydb-slo-action:After 2 minutes of run with chaos enabled (graceful + instant restart of database nodes):
ydb/transport_unavailable, matching the chaos events.ref=local-currentlabel.Companion PR
CI in ydb-platform/ydb-java-sdk (workflow + build script): TBD.