diff --git a/docs/cloud/metrics/openmetrics/metrics-reference.mdx b/docs/cloud/metrics/openmetrics/metrics-reference.mdx index da8682d920..e00a3fa272 100644 --- a/docs/cloud/metrics/openmetrics/metrics-reference.mdx +++ b/docs/cloud/metrics/openmetrics/metrics-reference.mdx @@ -40,6 +40,14 @@ All metrics are stored as 1 minute aggregates. Rate metrics are therefore per-se ::: +:::note Percentile metrics on low-traffic namespaces + +Percentile metrics (`*_p50` / `_p95` / `_p99`) are calculated from the requests observed in each 1-minute aggregation window. On a namespace with few requests per minute, that sample is small, so a single slow request dominates every percentile and p50, p95, and p99 converge toward the slowest observed request. Tail percentiles generally need roughly 20 or more samples per window before they are statistically meaningful; below that, values vary widely. + +For example, a low-volume namespace that starts one Workflow every few minutes can report a several-hundred-millisecond `StartWorkflowExecution` `temporal_cloud_v1_service_latency_p95` that reflects a single request, not systemic latency. When alerting on percentile latency for low-traffic namespaces, gate the alert on a minimum request count (for example, [`temporal_cloud_v1_service_request_count`](#temporal_cloud_v1_service_request_count)) so that windows with too few samples don't trigger it. These percentiles are pre-calculated per 1-minute window and cannot be re-aggregated into an accurate longer-window percentile, so widening your evaluation window does not by itself make a sparse sample meaningful. + +::: + ### Common Labels All metrics include these base labels: diff --git a/docs/cloud/metrics/openmetrics/migration-guide.mdx b/docs/cloud/metrics/openmetrics/migration-guide.mdx index 32b3c86a33..f69386a6e7 100644 --- a/docs/cloud/metrics/openmetrics/migration-guide.mdx +++ b/docs/cloud/metrics/openmetrics/migration-guide.mdx @@ -137,6 +137,26 @@ accurately aggregated_. For example: - ✅ Can still view individual namespace/task queue percentiles accurately - ✅ More accurate percentile calculations for individual series, especially with outliers +:::caution Don't compare a v0 average against a v1 percentile + +The v0 latency metrics are a histogram, not a percentile. Dividing `temporal_cloud_v0_service_latency_sum` by +`temporal_cloud_v0_service_latency_count` yields an **average** (roughly a p50), and a single +`temporal_cloud_v0_service_latency_bucket{le="..."}` series only **counts** requests below a threshold. Neither is a p95 +or p99. + +If your v0 alert compared an average (or a raw `_sum` / `_bucket` value) against a latency threshold, switching to +`temporal_cloud_v1_service_latency_p95` / `_p99` reports higher values for identical traffic. This is a measurement +change, not a latency regression. + +To migrate latency alerts safely: + +- Compare like-for-like. To reproduce a former average-based alert, start on `temporal_cloud_v1_service_latency_p50`, + then move to `_p95` / `_p99` deliberately once you have set an appropriate threshold. +- Confirm which percentile your SLO targets. The Temporal Cloud [latency SLO](/cloud/service-availability#latency) is a + **p99**; alerting on p95 against a p99 threshold trips earlier than the SLO. + +::: + ### 4\. Authentication Setup **Before**: mTLS certificates with customer-specific endpoint