Skip to content

Add SLO workload module for ydb-slo-action#61

Open
polRk wants to merge 1 commit intomasterfrom
slo-kv-workload
Open

Add SLO workload module for ydb-slo-action#61
polRk wants to merge 1 commit intomasterfrom
slo-kv-workload

Conversation

@polRk
Copy link
Copy Markdown
Member

@polRk polRk commented Apr 26, 2026

Summary

Adds a new slo Maven module that implements the workload contract of ydb-platform/ydb-slo-action. The workload is a KV read/write driver that records per-operation HDR latency histograms and pushes OTLP metrics, so it can be packaged into a Docker image and run as the current and baseline containers under chaos.

Schema, query shape and metric names mirror the SLO workloads in ydb-go-sdk and ydb-js-sdk so reports across SDKs are directly comparable.

The CI that drives this image lives in ydb-java-sdk (companion PR: ydb-platform/ydb-java-sdk#TBD). The SDK is built from source and pinned into the examples build via the multi-stage Dockerfile, so no SNAPSHOT publishing is needed.

What's in the module

slo/
├── Dockerfile                           multi-stage build (SDK + workload)
├── pom.xml
├── README.md
└── src/main/
    ├── java/tech/ydb/slo/
    │   ├── Config.java                  reads action env vars
    │   ├── Main.java                    entry point + JCommander parsing
    │   ├── Metrics.java                 OTLP exporter + HDR histograms
    │   ├── RateLimiter.java             RPS pacer
    │   └── kv/
    │       ├── KvWorkload.java          setup / run / teardown
    │       ├── KvWorkloadParams.java    `--read-rps`, `--prefill-count`, …
    │       ├── Row.java
    │       └── RowGenerator.java
    └── resources/log4j2.xml

Metrics emitted

All metrics carry a ref label sourced from WORKLOAD_REF (set by the action), so the report action can split current vs baseline series.

Metric Type Labels
sdk_operations_total counter operation_type, operation_status
sdk_errors_total counter operation_type, error_kind
sdk_retry_attempts_total counter operation_type, operation_status
sdk_pending_operations up/down operation_type
sdk_operation_latency_p50/p95/p99_seconds gauge operation_type, operation_status="success"

Notable design decisions

  • Percentile gauges reset each export. A single OTel batch callback snapshots p50/p95/p99 atomically and then calls histogram.reset(). Without reset, JIT-warmup outliers stay permanently in the cumulative HDR and p99 stops reflecting current performance — same approach as in ydb-js-sdk.
  • Latency is recorded only for successful operations. Failure latency is dominated by retry budgets and timeouts; mixing it into percentiles produces noisy multi-second spikes during chaos that mask real SDK regressions. Counters (sdk_operations_total, sdk_errors_total) still cover both branches, so availability is computed correctly. The metrics.yaml shipped with the SLO action filters on operation_status="success" already, so this matches what's consumed.
  • Single shared transport per workload run. The workload opens one GrpcTransport/QueryClient per run and uses a SessionRetryContext for every operation. This is the recommended Java SDK usage pattern.

Local verification

End-to-end smoke tested via the deploy/ directory shipped with ydb-slo-action:

WORKLOAD_NAME=java-query-kv \
WORKLOAD_DURATION=120 \
WORKLOAD_CURRENT_IMAGE=ydb-app-current \
WORKLOAD_CURRENT_REF=local-current \
WORKLOAD_CURRENT_COMMAND="--read-rps 100 --write-rps 10 --prefill-count 100" \
docker compose -f deploy/compose.yml \
  --profile telemetry --profile workload-current --profile chaos up -d

After 2 minutes of run with chaos enabled (graceful + instant restart of database nodes):

  • read p99 ≈ 2.6 ms / write p99 ≈ 9.8 ms — no chaos-induced tail drag, since timing-out ops aren't recorded.
  • 3 read errors + 3 write errors classified as ydb/transport_unavailable, matching the chaos events.
  • read availability = 99.99 %, write availability = 100 % over the 2-minute window.
  • All 6 expected metric series visible in Prometheus with the ref=local-current label.

Companion PR

CI in ydb-platform/ydb-java-sdk (workflow + build script): TBD.

This comment was marked as resolved.

Comment thread slo/src/main/java/tech/ydb/slo/RateLimiter.java Outdated
Comment thread slo/src/main/java/tech/ydb/slo/Main.java Outdated
Comment thread slo/src/main/java/tech/ydb/slo/kv/KvWorkload.java Outdated
Comment thread slo/src/main/java/tech/ydb/slo/kv/KvWorkload.java Outdated
Add new `slo` Maven module implementing the workload contract of
ydb-platform/ydb-slo-action. The workload is a KV read/write driver with
per-operation HDR latency histograms and OTLP metric export, designed to
be packaged into a Docker image and run as the `current` and `baseline`
containers under chaos.

Schema, query shape and metric names mirror the SLO workloads in
ydb-go-sdk and ydb-js-sdk so reports across SDKs are directly
comparable. The CI that drives this image lives in ydb-java-sdk —
the SDK is built from source and pinned into the examples build via
the multi-stage Dockerfile.

Behaviour notes:
- Metrics export every second via OTLP HTTP, with a single batch
  callback that snapshots p50/p95/p99 from a shared HDR histogram and
  resets it. This avoids JIT-warmup outliers being permanently stuck
  in cumulative percentiles.
- Latency is recorded only for successful operations. Failure latency
  is dominated by retry budgets and timeouts and would mask real SDK
  performance under chaos. Counters and `sdk_errors_total` still cover
  both branches, so availability is computed correctly.
@polRk polRk force-pushed the slo-kv-workload branch from 3f0077a to 1a5b2a2 Compare April 27, 2026 18:16
@polRk
Copy link
Copy Markdown
Member Author

polRk commented Apr 27, 2026

Thanks for the review, Alex! Force-pushed the fixes.

Your comments

  • RateLimiter → Guava. Done. Dropped my custom RateLimiter.java and switched to com.google.common.util.concurrent.RateLimiter. Guava is already pulled in transitively via ydb-sdk-query (grpc-netty-shaded), so no new dependency in slo/pom.xml.
  • LogManager.shutdown() instead of Thread.sleep(100). Done in Main.finally — the sleep hack is gone.
  • Move the loop inside the pool task to reduce thread count. Done. The two slo-{read,write}-driver driver threads are gone; each worker now pulls a permit from the shared Guava RateLimiter and executes the operation inline. This also removes the newFixedThreadPool unbounded queue that Copilot flagged — there's no queue any more. Per-worker termination is handled by awaitTermination(duration + graceSeconds) so the pool drains naturally as workers hit the deadline.
  • thenApply(Result::getStatus). Done. Plus switched writeRowInternal from supplyResult to supplyStatus so the whole chain works in CompletableFuture<Status> without manual Result.success/fail wrapping.

Copilot comments I agree with (also addressed)

  • readOnce() reading only id = 0: real bug, good catch. RowGenerator.nextId was never advanced because prefill + writes both called generate(id) with an explicit id. Replaced generator.peekNextId() with params.prefillCount() + writesIssued.get() and removed peekNextId() altogether; the generator is now constructed with nextId = prefillCount so writes continue from there naturally.
  • WORKLOAD_DURATION=0 not actually unlimited: fixed — endNanos = duration > 0 ? now + duration : Long.MAX_VALUE.
  • --read-timeout-ms / --write-timeout-ms were no-ops: fixed. They're now threaded through ExecuteQuerySettings.newBuilder().withRequestTimeout(Duration.ofMillis(...)) for both read and write queries. During the chaos smoke test this made ydb/client_cancelled errors show up in sdk_errors_total alongside ydb/transport_unavailable — the timeouts actually fire now.
  • tablePath sanitization: applied sanitize() to config.workloadName() as well, not just ref.
  • README metric names (dotted vs underscored): added a paragraph explaining OTel → Prometheus name conversion and switched the table to Prometheus form, which is what you actually query.

Copilot comments I didn't apply

  • RateLimiter with rps <= 0 being an infinite stream — moot now that we're on Guava + per-worker model. workerCount(rps) == 0 explicitly skips pool creation for that branch (useful for read-only or write-only smoke tests), and the Guava limiter doesn't have the "0 means unlimited" footgun.
  • Unbounded pool queue — also moot: the queue is gone, backpressure comes from the limiter blocking the worker.

End-to-end re-verification

Ran the full stack (ydb-slo-action/deploy/compose.yml with telemetry + workload-current + chaos profiles) for 3 minutes at 200 read RPS / 20 write RPS:

  • 36045 read ops (200.3 RPS, 12 errors) / 3615 write ops (20.1 RPS, 4 errors).
  • Errors split into ydb/transport_unavailable (chaos restart) and ydb/client_cancelled (timeout) — the latter proves the --*-timeout-ms flags work.
  • Availability: read 99.97 %, write 99.90 % over the 3-minute window.
  • sdk_operation_latency_p99_seconds exists only with operation_status="success" — error samples don't pollute the histogram. Instant steady-state p99 read = 2.02 ms / write = 4.56 ms; peak-over-window p99 shows chaos spikes up to ~1.6 s on read and ~1.2 s on write right after node restarts, which is exactly the signal the SLO report is supposed to catch.
  • No memory pressure under extended chaos (no OOM, no queue buildup — the per-worker model has no shared queue).

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread slo/src/main/java/tech/ydb/slo/kv/KvWorkload.java
Comment on lines +240 to +241
writeOnce(generator.generate());
writesIssued.incrementAndGet();
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writesIssued is incremented unconditionally after writeOnce(...), even when the write fails. That makes readOnce(writesIssued.get()) select ids that may never have been written, contradicting the comment that reads target only known-to-exist ids and reducing how often reads exercise row deserialization under chaos. Consider incrementing only on successful writes (e.g., make writeOnce return a boolean/Status), or adjust the keyspace tracking logic to reflect the max generated id (and update the comment accordingly).

Suggested change
writeOnce(generator.generate());
writesIssued.incrementAndGet();
Status status = writeOnce(generator.generate());
if (status.isSuccess()) {
writesIssued.incrementAndGet();
}

Copilot uses AI. Check for mistakes.
Comment thread slo/src/main/java/tech/ydb/slo/kv/KvWorkload.java
Comment thread slo/src/main/java/tech/ydb/slo/Metrics.java
Comment thread slo/src/main/java/tech/ydb/slo/Config.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants