Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/cloud/metrics/openmetrics/metrics-reference.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,14 @@ All metrics are stored as 1 minute aggregates. Rate metrics are therefore per-se

:::

:::note Percentile metrics on low-traffic namespaces

Percentile metrics (`*_p50` / `_p95` / `_p99`) are calculated from the requests observed in each 1-minute aggregation window. On a namespace with few requests per minute, that sample is small, so a single slow request dominates every percentile and p50, p95, and p99 converge toward the slowest observed request. Tail percentiles generally need roughly 20 or more samples per window before they are statistically meaningful; below that, values vary widely.

For example, a low-volume namespace that starts one Workflow every few minutes can report a several-hundred-millisecond `StartWorkflowExecution` `temporal_cloud_v1_service_latency_p95` that reflects a single request, not systemic latency. When alerting on percentile latency for low-traffic namespaces, gate the alert on a minimum request count (for example, [`temporal_cloud_v1_service_request_count`](#temporal_cloud_v1_service_request_count)) so that windows with too few samples don't trigger it. These percentiles are pre-calculated per 1-minute window and cannot be re-aggregated into an accurate longer-window percentile, so widening your evaluation window does not by itself make a sparse sample meaningful.

:::

### Common Labels

All metrics include these base labels:
Expand Down
20 changes: 20 additions & 0 deletions docs/cloud/metrics/openmetrics/migration-guide.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,26 @@ accurately aggregated_. For example:
- ✅ Can still view individual namespace/task queue percentiles accurately
- ✅ More accurate percentile calculations for individual series, especially with outliers

:::caution Don't compare a v0 average against a v1 percentile

The v0 latency metrics are a histogram, not a percentile. Dividing `temporal_cloud_v0_service_latency_sum` by
`temporal_cloud_v0_service_latency_count` yields an **average** (roughly a p50), and a single
`temporal_cloud_v0_service_latency_bucket{le="..."}` series only **counts** requests below a threshold. Neither is a p95
or p99.

If your v0 alert compared an average (or a raw `_sum` / `_bucket` value) against a latency threshold, switching to
`temporal_cloud_v1_service_latency_p95` / `_p99` reports higher values for identical traffic. This is a measurement
change, not a latency regression.

To migrate latency alerts safely:

- Compare like-for-like. To reproduce a former average-based alert, start on `temporal_cloud_v1_service_latency_p50`,
then move to `_p95` / `_p99` deliberately once you have set an appropriate threshold.
- Confirm which percentile your SLO targets. The Temporal Cloud [latency SLO](/cloud/service-availability#latency) is a
**p99**; alerting on p95 against a p99 threshold trips earlier than the SLO.

:::

### 4\. Authentication Setup

**Before**: mTLS certificates with customer-specific endpoint
Expand Down