From 8faa8b177b726e77feb1bbabf0bb74127cadb551 Mon Sep 17 00:00:00 2001 From: Dustin Cote Date: Wed, 17 Jun 2026 10:14:24 -0400 Subject: [PATCH] Clarify v0-vs-v1 latency metric semantics and low-traffic percentiles Two OpenMetrics doc gaps surfaced by a customer alerting false alarm after migrating from the v0 query endpoint to v1 OpenMetrics: - Migration guide: add a caution that v0 service_latency_sum/count is an average (~p50) and _bucket is a count, not a percentile. Comparing either against v1 _p95/_p99 reports higher values for identical traffic. Includes safe-migration steps and a pointer to the p99 latency SLO. - Metrics reference: add a note that percentile metrics on low-traffic namespaces are computed from small per-minute samples, so a single slow request dominates p50/p95/p99. Recommends gating latency alerts on a minimum request count, and notes that pre-calculated percentiles cannot be re-aggregated into an accurate longer-window percentile. Co-Authored-By: Claude Opus 4.8 --- .../metrics/openmetrics/metrics-reference.mdx | 8 ++++++++ .../metrics/openmetrics/migration-guide.mdx | 20 +++++++++++++++++++ 2 files changed, 28 insertions(+) diff --git a/docs/cloud/metrics/openmetrics/metrics-reference.mdx b/docs/cloud/metrics/openmetrics/metrics-reference.mdx index da8682d920..e00a3fa272 100644 --- a/docs/cloud/metrics/openmetrics/metrics-reference.mdx +++ b/docs/cloud/metrics/openmetrics/metrics-reference.mdx @@ -40,6 +40,14 @@ All metrics are stored as 1 minute aggregates. Rate metrics are therefore per-se ::: +:::note Percentile metrics on low-traffic namespaces + +Percentile metrics (`*_p50` / `_p95` / `_p99`) are calculated from the requests observed in each 1-minute aggregation window. On a namespace with few requests per minute, that sample is small, so a single slow request dominates every percentile and p50, p95, and p99 converge toward the slowest observed request. Tail percentiles generally need roughly 20 or more samples per window before they are statistically meaningful; below that, values vary widely. + +For example, a low-volume namespace that starts one Workflow every few minutes can report a several-hundred-millisecond `StartWorkflowExecution` `temporal_cloud_v1_service_latency_p95` that reflects a single request, not systemic latency. When alerting on percentile latency for low-traffic namespaces, gate the alert on a minimum request count (for example, [`temporal_cloud_v1_service_request_count`](#temporal_cloud_v1_service_request_count)) so that windows with too few samples don't trigger it. These percentiles are pre-calculated per 1-minute window and cannot be re-aggregated into an accurate longer-window percentile, so widening your evaluation window does not by itself make a sparse sample meaningful. + +::: + ### Common Labels All metrics include these base labels: diff --git a/docs/cloud/metrics/openmetrics/migration-guide.mdx b/docs/cloud/metrics/openmetrics/migration-guide.mdx index 32b3c86a33..f69386a6e7 100644 --- a/docs/cloud/metrics/openmetrics/migration-guide.mdx +++ b/docs/cloud/metrics/openmetrics/migration-guide.mdx @@ -137,6 +137,26 @@ accurately aggregated_. For example: - ✅ Can still view individual namespace/task queue percentiles accurately - ✅ More accurate percentile calculations for individual series, especially with outliers +:::caution Don't compare a v0 average against a v1 percentile + +The v0 latency metrics are a histogram, not a percentile. Dividing `temporal_cloud_v0_service_latency_sum` by +`temporal_cloud_v0_service_latency_count` yields an **average** (roughly a p50), and a single +`temporal_cloud_v0_service_latency_bucket{le="..."}` series only **counts** requests below a threshold. Neither is a p95 +or p99. + +If your v0 alert compared an average (or a raw `_sum` / `_bucket` value) against a latency threshold, switching to +`temporal_cloud_v1_service_latency_p95` / `_p99` reports higher values for identical traffic. This is a measurement +change, not a latency regression. + +To migrate latency alerts safely: + +- Compare like-for-like. To reproduce a former average-based alert, start on `temporal_cloud_v1_service_latency_p50`, + then move to `_p95` / `_p99` deliberately once you have set an appropriate threshold. +- Confirm which percentile your SLO targets. The Temporal Cloud [latency SLO](/cloud/service-availability#latency) is a + **p99**; alerting on p95 against a p99 threshold trips earlier than the SLO. + +::: + ### 4\. Authentication Setup **Before**: mTLS certificates with customer-specific endpoint