Add SLO workload module for ydb-slo-action by polRk · Pull Request #61 · ydb-platform/ydb-java-examples

polRk · 2026-04-26T10:47:46Z

Summary

Adds a new slo Maven module that implements the workload contract of ydb-platform/ydb-slo-action. The workload is a KV read/write driver that records per-operation HDR latency histograms and pushes OTLP metrics, so it can be packaged into a Docker image and run as the current and baseline containers under chaos.

Schema, query shape and metric names mirror the SLO workloads in ydb-go-sdk and ydb-js-sdk so reports across SDKs are directly comparable.

The CI that drives this image lives in ydb-java-sdk (companion PR: ydb-platform/ydb-java-sdk#TBD). The SDK is built from source and pinned into the examples build via the multi-stage Dockerfile, so no SNAPSHOT publishing is needed.

What's in the module

slo/
├── Dockerfile                           multi-stage build (SDK + workload)
├── pom.xml
├── README.md
└── src/main/
    ├── java/tech/ydb/slo/
    │   ├── Config.java                  reads action env vars
    │   ├── Main.java                    entry point + JCommander parsing
    │   ├── Metrics.java                 OTLP exporter + HDR histograms
    │   ├── RateLimiter.java             RPS pacer
    │   └── kv/
    │       ├── KvWorkload.java          setup / run / teardown
    │       ├── KvWorkloadParams.java    `--read-rps`, `--prefill-count`, …
    │       ├── Row.java
    │       └── RowGenerator.java
    └── resources/log4j2.xml

Metrics emitted

All metrics carry a ref label sourced from WORKLOAD_REF (set by the action), so the report action can split current vs baseline series.

Metric	Type	Labels
`sdk_operations_total`	counter	`operation_type`, `operation_status`
`sdk_errors_total`	counter	`operation_type`, `error_kind`
`sdk_retry_attempts_total`	counter	`operation_type`, `operation_status`
`sdk_pending_operations`	up/down	`operation_type`
`sdk_operation_latency_p50/p95/p99_seconds`	gauge	`operation_type`, `operation_status="success"`

Notable design decisions

Percentile gauges reset each export. A single OTel batch callback snapshots p50/p95/p99 atomically and then calls histogram.reset(). Without reset, JIT-warmup outliers stay permanently in the cumulative HDR and p99 stops reflecting current performance — same approach as in ydb-js-sdk.
Latency is recorded only for successful operations. Failure latency is dominated by retry budgets and timeouts; mixing it into percentiles produces noisy multi-second spikes during chaos that mask real SDK regressions. Counters (sdk_operations_total, sdk_errors_total) still cover both branches, so availability is computed correctly. The metrics.yaml shipped with the SLO action filters on operation_status="success" already, so this matches what's consumed.
Single shared transport per workload run. The workload opens one GrpcTransport/QueryClient per run and uses a SessionRetryContext for every operation. This is the recommended Java SDK usage pattern.

Local verification

End-to-end smoke tested via the deploy/ directory shipped with ydb-slo-action:

WORKLOAD_NAME=java-query-kv \
WORKLOAD_DURATION=120 \
WORKLOAD_CURRENT_IMAGE=ydb-app-current \
WORKLOAD_CURRENT_REF=local-current \
WORKLOAD_CURRENT_COMMAND="--read-rps 100 --write-rps 10 --prefill-count 100" \
docker compose -f deploy/compose.yml \
  --profile telemetry --profile workload-current --profile chaos up -d

After 2 minutes of run with chaos enabled (graceful + instant restart of database nodes):

read p99 ≈ 2.6 ms / write p99 ≈ 9.8 ms — no chaos-induced tail drag, since timing-out ops aren't recorded.
3 read errors + 3 write errors classified as ydb/transport_unavailable, matching the chaos events.
read availability = 99.99 %, write availability = 100 % over the 2-minute window.
All 6 expected metric series visible in Prometheus with the ref=local-current label.

Companion PR

CI in ydb-platform/ydb-java-sdk (workflow + build script): TBD.

Add new `slo` Maven module implementing the workload contract of ydb-platform/ydb-slo-action. The workload is a KV read/write driver with per-operation HDR latency histograms and OTLP metric export, designed to be packaged into a Docker image and run as the `current` and `baseline` containers under chaos. Schema, query shape and metric names mirror the SLO workloads in ydb-go-sdk and ydb-js-sdk so reports across SDKs are directly comparable. The CI that drives this image lives in ydb-java-sdk — the SDK is built from source and pinned into the examples build via the multi-stage Dockerfile. Behaviour notes: - Metrics export every second via OTLP HTTP, with a single batch callback that snapshots p50/p95/p99 from a shared HDR histogram and resets it. This avoids JIT-warmup outliers being permanently stuck in cumulative percentiles. - Latency is recorded only for successful operations. Failure latency is dominated by retry budgets and timeouts and would mask real SDK performance under chaos. Counters and `sdk_errors_total` still cover both branches, so availability is computed correctly.

polRk · 2026-04-27T18:17:24Z

Thanks for the review, Alex! Force-pushed the fixes.

Your comments

RateLimiter → Guava. Done. Dropped my custom RateLimiter.java and switched to com.google.common.util.concurrent.RateLimiter. Guava is already pulled in transitively via ydb-sdk-query (grpc-netty-shaded), so no new dependency in slo/pom.xml.
LogManager.shutdown() instead of Thread.sleep(100). Done in Main.finally — the sleep hack is gone.
Move the loop inside the pool task to reduce thread count. Done. The two slo-{read,write}-driver driver threads are gone; each worker now pulls a permit from the shared Guava RateLimiter and executes the operation inline. This also removes the newFixedThreadPool unbounded queue that Copilot flagged — there's no queue any more. Per-worker termination is handled by awaitTermination(duration + graceSeconds) so the pool drains naturally as workers hit the deadline.
thenApply(Result::getStatus). Done. Plus switched writeRowInternal from supplyResult to supplyStatus so the whole chain works in CompletableFuture<Status> without manual Result.success/fail wrapping.

Copilot comments I agree with (also addressed)

readOnce() reading only id = 0: real bug, good catch. RowGenerator.nextId was never advanced because prefill + writes both called generate(id) with an explicit id. Replaced generator.peekNextId() with params.prefillCount() + writesIssued.get() and removed peekNextId() altogether; the generator is now constructed with nextId = prefillCount so writes continue from there naturally.
WORKLOAD_DURATION=0 not actually unlimited: fixed — endNanos = duration > 0 ? now + duration : Long.MAX_VALUE.
--read-timeout-ms / --write-timeout-ms were no-ops: fixed. They're now threaded through ExecuteQuerySettings.newBuilder().withRequestTimeout(Duration.ofMillis(...)) for both read and write queries. During the chaos smoke test this made ydb/client_cancelled errors show up in sdk_errors_total alongside ydb/transport_unavailable — the timeouts actually fire now.
tablePath sanitization: applied sanitize() to config.workloadName() as well, not just ref.
README metric names (dotted vs underscored): added a paragraph explaining OTel → Prometheus name conversion and switched the table to Prometheus form, which is what you actually query.

Copilot comments I didn't apply

RateLimiter with rps <= 0 being an infinite stream — moot now that we're on Guava + per-worker model. workerCount(rps) == 0 explicitly skips pool creation for that branch (useful for read-only or write-only smoke tests), and the Guava limiter doesn't have the "0 means unlimited" footgun.
Unbounded pool queue — also moot: the queue is gone, backpressure comes from the limiter blocking the worker.

End-to-end re-verification

Ran the full stack (ydb-slo-action/deploy/compose.yml with telemetry + workload-current + chaos profiles) for 3 minutes at 200 read RPS / 20 write RPS:

36045 read ops (200.3 RPS, 12 errors) / 3615 write ops (20.1 RPS, 4 errors).
Errors split into ydb/transport_unavailable (chaos restart) and ydb/client_cancelled (timeout) — the latter proves the --*-timeout-ms flags work.
Availability: read 99.97 %, write 99.90 % over the 3-minute window.
sdk_operation_latency_p99_seconds exists only with operation_status="success" — error samples don't pollute the histogram. Instant steady-state p99 read = 2.02 ms / write = 4.56 ms; peak-over-window p99 shows chaos spikes up to ~1.6 s on read and ~1.2 s on write right after node restarts, which is exactly the signal the SLO report is supposed to catch.
No memory pressure under extended chaos (no OOM, no queue buildup — the per-worker model has no shared queue).

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-27T18:34:13Z

+                                writeOnce(generator.generate());
+                                writesIssued.incrementAndGet();


writesIssued is incremented unconditionally after writeOnce(...), even when the write fails. That makes readOnce(writesIssued.get()) select ids that may never have been written, contradicting the comment that reads target only known-to-exist ids and reducing how often reads exercise row deserialization under chaos. Consider incrementing only on successful writes (e.g., make writeOnce return a boolean/Status), or adjust the keyspace tracking logic to reflect the max generated id (and update the comment accordingly).

Suggested change

writeOnce(generator.generate());

writesIssued.incrementAndGet();

Status status = writeOnce(generator.generate());

if (status.isSuccess()) {

writesIssued.incrementAndGet();

}

Copilot AI review requested due to automatic review settings April 26, 2026 10:47

Copilot started reviewing on behalf of polRk April 26, 2026 10:48 View session

polRk mentioned this pull request Apr 26, 2026

Add SLO workflow for the Java SDK ydb-platform/ydb-java-sdk#644

Open

This comment was marked as resolved.

Sign in to view

alex268 reviewed Apr 27, 2026

View reviewed changes

Comment thread slo/src/main/java/tech/ydb/slo/RateLimiter.java Outdated

alex268 reviewed Apr 27, 2026

View reviewed changes

Comment thread slo/src/main/java/tech/ydb/slo/Main.java Outdated

alex268 reviewed Apr 27, 2026

View reviewed changes

Comment thread slo/src/main/java/tech/ydb/slo/kv/KvWorkload.java Outdated

alex268 reviewed Apr 27, 2026

View reviewed changes

Comment thread slo/src/main/java/tech/ydb/slo/kv/KvWorkload.java Outdated

polRk force-pushed the slo-kv-workload branch from 3f0077a to 1a5b2a2 Compare April 27, 2026 18:16

polRk requested review from alex268 and Copilot April 27, 2026 18:22

Copilot started reviewing on behalf of polRk April 27, 2026 18:28 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SLO workload module for ydb-slo-action#61

Add SLO workload module for ydb-slo-action#61
polRk wants to merge 1 commit intomasterfrom
slo-kv-workload

polRk commented Apr 26, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

polRk commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		writeOnce(generator.generate());
		writesIssued.incrementAndGet();

Conversation

polRk commented Apr 26, 2026

Summary

What's in the module

Metrics emitted

Notable design decisions

Local verification

Companion PR

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

polRk commented Apr 27, 2026

Your comments

Copilot comments I agree with (also addressed)

Copilot comments I didn't apply

End-to-end re-verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants