diff --git a/.gitignore b/.gitignore
index 6851ff6e..84343993 100644
--- a/.gitignore
+++ b/.gitignore
@@ -416,3 +416,7 @@ Chart.lock
site/
operator/src/coverage
operator/src/webhook_coverage
+
+# Local helm dev: `make` / `helm dependency update` produces a flattened copy
+# of the chart under operator/src/ — not part of source.
+operator/src/documentdb-chart/
diff --git a/docs/operator-public-documentation/preview/monitoring/metrics.md b/docs/operator-public-documentation/preview/monitoring/metrics.md
new file mode 100644
index 00000000..7f464d9d
--- /dev/null
+++ b/docs/operator-public-documentation/preview/monitoring/metrics.md
@@ -0,0 +1,358 @@
+---
+title: Metrics Reference
+description: Detailed reference of all metrics available when monitoring DocumentDB clusters, with PromQL examples.
+tags:
+ - monitoring
+ - metrics
+ - prometheus
+ - opentelemetry
+---
+
+# Metrics Reference
+
+This page documents the key metrics available when monitoring a DocumentDB cluster, organized by source. Each section includes the metric name, description, labels, and example PromQL queries.
+
+## Container Resource Metrics
+
+These metrics are scraped directly from the kubelet/cAdvisor interface by Prometheus. They cover CPU, memory, network, and filesystem for the **postgres**, **documentdb-gateway**, and **otel-collector** containers in each DocumentDB pod. No collector sidecar is required for these — Prometheus reads them straight from each node's kubelet.
+
+### CPU
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `container_cpu_usage_seconds_total` | Counter | Cumulative CPU time consumed in seconds |
+| `container_spec_cpu_quota` | Gauge | CPU quota (microseconds per `cpu_period`) |
+| `container_spec_cpu_period` | Gauge | CPU CFS scheduling period (microseconds) |
+
+**Common labels:** `namespace`, `pod`, `container`, `node`
+
+#### Example Query
+
+CPU usage rate per container over 5 minutes:
+
+```promql
+rate(container_cpu_usage_seconds_total{
+ container=~"postgres|documentdb-gateway",
+ pod=~".*documentdb.*"
+}[5m])
+```
+
+### Memory
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `container_memory_working_set_bytes` | Gauge | Current working set memory (bytes) |
+| `container_memory_rss` | Gauge | Resident set size (bytes) |
+| `container_memory_cache` | Gauge | Page cache memory (bytes) |
+| `container_spec_memory_limit_bytes` | Gauge | Memory limit (bytes) |
+
+**Common labels:** `namespace`, `pod`, `container`, `node`
+
+#### Example Query
+
+Memory utilization as a percentage of limit:
+
+```promql
+(container_memory_working_set_bytes{
+ container=~"postgres|documentdb-gateway",
+ pod=~".*documentdb.*"
+}
+/ container_spec_memory_limit_bytes{
+ container=~"postgres|documentdb-gateway",
+ pod=~".*documentdb.*"
+}) * 100
+```
+
+### Network
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `container_network_receive_bytes_total` | Counter | Bytes received |
+| `container_network_transmit_bytes_total` | Counter | Bytes transmitted |
+
+**Common labels:** `namespace`, `pod`, `interface`
+
+#### Example Queries
+
+Network throughput (bytes/sec) per pod:
+
+```promql
+sum by (pod) (
+ rate(container_network_receive_bytes_total{
+ pod=~".*documentdb.*"
+ }[5m])
+ + rate(container_network_transmit_bytes_total{
+ pod=~".*documentdb.*"
+ }[5m])
+)
+```
+
+### Filesystem
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `container_fs_usage_bytes` | Gauge | Filesystem usage (bytes) |
+| `container_fs_reads_bytes_total` | Counter | Filesystem read bytes |
+| `container_fs_writes_bytes_total` | Counter | Filesystem write bytes |
+
+**Common labels:** `namespace`, `pod`, `container`, `device`
+
+#### Example Queries
+
+Disk I/O rate for the postgres container:
+
+```promql
+rate(container_fs_writes_bytes_total{
+ container="postgres",
+ pod=~".*documentdb.*"
+}[5m])
+```
+
+## Gateway Metrics
+
+The DocumentDB Gateway exports application-level metrics via OTLP (OpenTelemetry Protocol) push. The gateway sidecar injector automatically sets `OTEL_EXPORTER_OTLP_ENDPOINT` and `OTEL_METRICS_ENABLED=true` on each gateway container, so metrics are exported without manual configuration. Per-pod attribution (`k8s.pod.name`) is added downstream by the collector's resource processor.
+
+Metrics are exported to an OpenTelemetry Collector, which converts them to Prometheus format via the `prometheus` exporter.
+
+!!! note "Gateway metric names may change between versions"
+ The metrics below are emitted by the DocumentDB Gateway binary, which is versioned independently from the operator. Metric names, labels, and semantics may change between gateway releases. Always verify metric availability against the gateway version deployed in your cluster.
+
+### Operations
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `db_client_operations_total` | Counter | Total MongoDB operations processed |
+| `db_client_operation_duration_seconds_total` | Counter | Cumulative operation duration (can be broken down by `db_operation_phase`) |
+
+**Common labels:** `db_operation_name` (e.g., `Find`, `Insert`, `Update`, `Aggregate`, `Delete`), `db_namespace`, `db_system_name`, `pod` (originating pod), `error_type` (set on failed operations)
+
+**Phase labels** (on `db_client_operation_duration_seconds_total`): `db_operation_phase` — values include `pg_query`, `cursor_iteration`, `bson_serialization`, `command_parsing`. Empty phase represents total duration.
+
+#### Example Queries
+
+Operations per second by command type:
+
+```promql
+sum by (db_operation_name) (
+ rate(db_client_operations_total[1m])
+)
+```
+
+Average latency per operation (milliseconds):
+
+```promql
+sum by (db_operation_name) (
+ rate(db_client_operation_duration_seconds_total{db_operation_phase=""}[1m])
+) / sum by (db_operation_name) (
+ rate(db_client_operations_total[1m])
+) * 1000
+```
+
+Error rate as a percentage:
+
+```promql
+sum(rate(db_client_operations_total{error_type!=""}[1m]))
+/ sum(rate(db_client_operations_total[1m])) * 100
+```
+
+Time spent in each operation phase per second:
+
+```promql
+sum by (db_operation_phase) (
+ rate(db_client_operation_duration_seconds_total{
+ db_operation_phase!=""
+ }[1m])
+)
+```
+
+### Request/Response Size
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `db_client_request_size_bytes_total` | Counter | Cumulative request payload size |
+| `db_client_response_size_bytes_total` | Counter | Cumulative response payload size |
+
+**Common labels:** `pod` (originating pod)
+
+#### Example Queries
+
+Average request throughput (bytes/sec):
+
+```promql
+sum(rate(db_client_request_size_bytes_total[1m]))
+```
+
+## Operator Metrics (controller-runtime)
+
+The DocumentDB operator binary exposes standard controller-runtime metrics on its metrics endpoint. These track reconciliation performance and work queue health.
+
+### Reconciliation
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `controller_runtime_reconcile_total` | Counter | Total reconciliations |
+| `controller_runtime_reconcile_errors_total` | Counter | Total reconciliation errors |
+| `controller_runtime_reconcile_time_seconds` | Histogram | Time spent in reconciliation |
+
+**Common labels:** `controller` (e.g., `documentdb-controller`, `backup-controller`, `scheduled-backup-controller`, `certificate-controller`, `pv-controller`), `result` (`success`, `error`, `requeue`, `requeue_after`)
+
+#### Example Queries
+
+Reconciliation error rate by controller:
+
+```promql
+sum by (controller) (
+ rate(controller_runtime_reconcile_errors_total[5m])
+)
+```
+
+P95 reconciliation latency for the DocumentDB controller:
+
+```promql
+histogram_quantile(0.95,
+ sum by (le) (
+ rate(controller_runtime_reconcile_time_seconds_bucket{
+ controller="documentdb-controller"
+ }[5m])
+ )
+)
+```
+
+Reconciliation throughput (reconciles/sec):
+
+```promql
+sum by (controller) (
+ rate(controller_runtime_reconcile_total[5m])
+)
+```
+
+### Work Queue
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `workqueue_depth` | Gauge | Current number of items in the queue |
+| `workqueue_adds_total` | Counter | Total items added |
+| `workqueue_queue_duration_seconds` | Histogram | Time items spend in queue before processing |
+| `workqueue_work_duration_seconds` | Histogram | Time spent processing items |
+| `workqueue_retries_total` | Counter | Total retries |
+
+**Common labels:** `name` (queue name, maps to controller name)
+
+#### Example Queries
+
+Work queue depth by controller:
+
+```promql
+workqueue_depth{name=~"documentdb-controller|backup-controller|scheduled-backup-controller|certificate-controller"}
+```
+
+Average time items spend waiting in queue:
+
+```promql
+rate(workqueue_queue_duration_seconds_sum{name="documentdb-controller"}[5m])
+/ rate(workqueue_queue_duration_seconds_count{name="documentdb-controller"}[5m])
+```
+
+## CNPG / PostgreSQL Metrics
+
+The `cnpg_*` metrics below come from CloudNative-PG's built-in Prometheus endpoint, which the DocumentDB operator does **not** enable by default. They are only available if you manually configure CNPG monitoring on the underlying Cluster resource.
+
+For the full CNPG metrics list, see the [CloudNative-PG monitoring docs](https://cloudnative-pg.io/documentation/current/monitoring/).
+
+### Replication
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `cnpg_pg_replication_lag` | Gauge | Replication lag in seconds (CNPG) |
+| `postgresql_replication_data_delay_bytes` | Gauge | Replication data delay in bytes (OTel PG receiver) |
+
+#### Example Queries
+
+Replication lag per pod:
+
+```promql
+cnpg_pg_replication_lag{pod=~".*documentdb.*"}
+```
+
+### Connections
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `cnpg_pg_stat_activity_count` | Gauge | Active backend connections by state (CNPG) |
+| `postgresql_backends` | Gauge | Number of backends (OTel PG receiver) |
+| `postgresql_connection_max` | Gauge | Maximum connections (OTel PG receiver) |
+
+#### Example Queries
+
+Active connections by state:
+
+```promql
+sum by (state) (
+ cnpg_pg_stat_activity_count{pod=~".*documentdb.*"}
+)
+```
+
+Backend utilization:
+
+```promql
+postgresql_backends / postgresql_connection_max * 100
+```
+
+### Storage
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `cnpg_pg_database_size_bytes` | Gauge | Total database size (CNPG) |
+| `postgresql_db_size_bytes` | Gauge | Database size (OTel PG receiver) |
+| `postgresql_wal_age_seconds` | Gauge | WAL age (OTel PG receiver) |
+
+#### Example Queries
+
+Database size in GiB:
+
+```promql
+postgresql_db_size_bytes / 1024 / 1024 / 1024
+```
+
+### Operations
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `postgresql_commits_total` | Counter | Total committed transactions |
+| `postgresql_rollbacks_total` | Counter | Total rolled-back transactions |
+| `postgresql_operations_total` | Counter | Row operations (labels: `operation`) |
+
+#### Example Queries
+
+Transaction rate:
+
+```promql
+rate(postgresql_commits_total[1m])
+```
+
+Row operations per second by type:
+
+```promql
+sum by (operation) (rate(postgresql_operations_total[1m]))
+```
+
+### Cluster Health
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `cnpg_collector_up` | Gauge | 1 if the CNPG metrics collector is running |
+| `cnpg_pg_postmaster_start_time` | Gauge | PostgreSQL start timestamp |
+
+#### Example Queries
+
+Detect pods where the metrics collector is down:
+
+```promql
+cnpg_collector_up{pod=~".*documentdb.*"} == 0
+```
+
+## OpenTelemetry vs Prometheus Metric Names
+
+The current architecture scrapes container/node metrics directly from kubelet/cAdvisor, so all dashboards and queries in this repo use the **Prometheus/cAdvisor** naming convention (`container_cpu_usage_seconds_total`, `container_memory_working_set_bytes`, `container_network_receive_bytes_total`, …).
+
+If you forward the same data through an OpenTelemetry pipeline (for example, in a multi-tenant cloud setup), the OTel `kubeletstats` receiver uses different names (`k8s.container.cpu.time`, `k8s.container.memory.usage`, etc.) and the OTel Prometheus exporter converts dots to underscores. Adjust your queries accordingly when crossing pipelines.
diff --git a/docs/operator-public-documentation/preview/monitoring/overview.md b/docs/operator-public-documentation/preview/monitoring/overview.md
new file mode 100644
index 00000000..339541b8
--- /dev/null
+++ b/docs/operator-public-documentation/preview/monitoring/overview.md
@@ -0,0 +1,228 @@
+---
+title: Monitoring Overview
+description: How to monitor DocumentDB clusters using the in-pod OpenTelemetry Collector sidecar, Prometheus, and Grafana.
+tags:
+ - monitoring
+ - observability
+ - metrics
+ - opentelemetry
+---
+
+# Monitoring Overview
+
+This guide describes how to monitor DocumentDB clusters running on Kubernetes using the operator's built-in OpenTelemetry Collector sidecar, Prometheus, and Grafana.
+
+## Prerequisites
+
+- A running Kubernetes cluster with the DocumentDB operator installed
+- [Helm 3](https://helm.sh/docs/intro/install/) for deploying Prometheus and Grafana
+- [kubectl](https://kubernetes.io/docs/tasks/tools/) configured for your cluster
+- [`jq`](https://jqlang.github.io/jq/) for processing JSON in verification commands
+
+## Architecture
+
+When `spec.monitoring.enabled: true` is set on a `DocumentDB` resource, the operator instructs the CNPG sidecar-injector plugin to add an **`otel-collector` container to every cluster pod**. The collector runs alongside the `postgres` and `documentdb-gateway` containers and exposes a Prometheus `/metrics` endpoint that any scraper (Prometheus, Datadog Agent, Grafana Alloy, …) can consume.
+
+```mermaid
+graph TB
+ subgraph pod["DocumentDB Pod (one per cluster instance)"]
+ pg["postgres :5432"]
+ gw["documentdb-gateway :10260
OTLP push → localhost:4317"]
+ otel["otel-collector sidecar
otlp :4317 in / prometheus :9188 out"]
+ gw -. OTLP gRPC .-> otel
+ end
+
+ prom["Prometheus
(annotation-based pod discovery)"]
+ cadv["kubelet / cAdvisor
(per-node)"]
+ grafana["Grafana"]
+
+ prom -- "scrape :9188" --> otel
+ prom -- "/metrics/cadvisor" --> cadv
+ prom --> grafana
+```
+
+Key points:
+
+- **One collector per pod** — no central Deployment, no ExternalName bridge, no per-instance Service is required.
+- **Gateway → sidecar push** — the sidecar-injector sets `OTEL_EXPORTER_OTLP_ENDPOINT=http://127.0.0.1:4317` and `OTEL_METRICS_ENABLED=true` on the gateway container so it pushes its OTLP metrics to the co-located collector. Per-pod attribution (`k8s.pod.name`) is added downstream by the collector's `resource` processor on every metric.
+- **Prometheus discovery via pod annotations** — the injector adds `prometheus.io/scrape=true`, `prometheus.io/port=`, and `prometheus.io/path=/metrics` annotations so a standard pod-annotation scrape config works out of the box.
+- **Container & node metrics come from kubelet/cAdvisor directly** — Prometheus scrapes them natively; no DaemonSet collector is needed.
+
+### Enabling the sidecar
+
+```yaml
+apiVersion: db.microsoft.com/preview
+kind: DocumentDB
+metadata:
+ name: my-cluster
+spec:
+ monitoring:
+ enabled: true
+ exporter:
+ prometheus:
+ port: 9188 # the port the sidecar exposes /metrics on; pick a port distinct from CNPG's instance manager (9187)
+```
+
+Once applied, the operator generates an `OpenTelemetryConfig` ConfigMap and the sidecar-injector adds the `otel-collector` container to every CNPG instance pod. Pods will be 3/3 (`postgres` + `documentdb-gateway` + `otel-collector`).
+
+### Collector pipeline (current)
+
+The collector ships with a minimal pipeline:
+
+| Stage | Components |
+|-------|------------|
+| Receivers | `otlp` (gRPC `127.0.0.1:4317`, loopback only — gateway and collector share the pod network namespace) and `sqlquery` (a stub `documentdb.postgres.up` query) |
+| Processors | `batch`, `resource` (adds `documentdb.cluster`, `k8s.namespace.name`, `k8s.pod.name` to every metric) |
+| Exporters | `prometheus` on the configured port (default `8888`; the playground uses `9188` to avoid collision with CNPG's instance manager on `9187`) |
+
+The pipeline is deep-merged from an embedded static config (`base_config.yaml`) and a dynamic config rendered by the operator. Changes to either trigger a content-hash update on the ConfigMap; the sidecar-injector compares hashes and rolls pods only when the config actually changes.
+
+## Prometheus Integration
+
+### Scraping the in-pod sidecar
+
+The operator sets these annotations on every DocumentDB pod when monitoring is enabled, so a single pod-annotation-based scrape job in Prometheus is sufficient:
+
+```yaml
+- job_name: documentdb-otel-sidecar
+ kubernetes_sd_configs:
+ - role: pod
+ relabel_configs:
+ - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
+ action: keep
+ regex: "true"
+ - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
+ action: replace
+ regex: ([^:]+)(?::\d+)?;(\d+)
+ replacement: $1:$2
+ target_label: __address__
+ - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
+ action: replace
+ target_label: __metrics_path__
+ regex: (.+)
+```
+
+For Prometheus Operator users, a `PodMonitor` selecting the same labels achieves the equivalent effect.
+
+### Container & node metrics
+
+Use Prometheus's native `kubelet` and `cadvisor` scrape jobs (no collector required):
+
+```yaml
+- job_name: kubelet-cadvisor
+ kubernetes_sd_configs:
+ - role: node
+ scheme: https
+ tls_config: { insecure_skip_verify: true }
+ bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
+ metrics_path: /metrics/cadvisor
+ relabel_configs:
+ - target_label: __address__
+ replacement: kubernetes.default.svc:443
+ - source_labels: [__meta_kubernetes_node_name]
+ regex: (.+)
+ target_label: __metrics_path__
+ replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
+```
+
+### Operator metrics
+
+The DocumentDB operator exposes a controller-runtime metrics endpoint. By default:
+
+- **Bind address**: controlled by `--metrics-bind-address` (default `0`, disabled)
+- **Secure mode**: `--metrics-secure=true` serves via HTTPS with authn/authz
+- **Certificates**: supply `--metrics-cert-path` for custom TLS, otherwise self-signed certs are generated
+
+Enable the endpoint by setting the bind address (e.g. `:8443`) and create a `Service` + `ServiceMonitor` to scrape it.
+
+## Key Metrics
+
+### Gateway application metrics (via OTLP push to the sidecar)
+
+| Metric | Description |
+|--------|-------------|
+| `db_client_operations_total` | Total MongoDB operations by command type |
+| `db_client_operation_duration_seconds_total` | Cumulative operation latency |
+| `db_client_request_size_bytes_total` | Cumulative request payload size |
+| `db_client_response_size_bytes_total` | Cumulative response payload size |
+
+All gateway metrics carry a `pod` label (set by the Prometheus scrape relabel from `__meta_kubernetes_pod_name`) identifying the originating pod.
+
+### Container & node metrics (cAdvisor)
+
+| Metric | Description |
+|--------|-------------|
+| `container_cpu_usage_seconds_total` | Cumulative CPU time consumed (rate it for CPU usage) |
+| `container_memory_working_set_bytes` | Working-set memory (matches OOM accounting) |
+| `container_memory_rss` | Resident set size |
+| `container_network_receive_bytes_total` | Network bytes received (pod-level) |
+| `container_network_transmit_bytes_total` | Network bytes transmitted (pod-level) |
+| `container_fs_usage_bytes` | Filesystem usage per container |
+
+Filter by `namespace`, `pod`, and `container` labels (e.g. `container="documentdb-gateway"` or `container="postgres"`).
+
+### PostgreSQL metrics
+
+The CNPG instance manager exposes a separate `/metrics` endpoint that some users may scrape directly; that is independent of the OTel sidecar pipeline.
+
+### Controller-runtime metrics
+
+| Metric | Description |
+|--------|-------------|
+| `controller_runtime_reconcile_total` | Total reconciliations by controller and result |
+| `controller_runtime_reconcile_errors_total` | Total reconciliation errors |
+| `controller_runtime_reconcile_time_seconds` | Reconciliation duration histogram |
+| `workqueue_depth` | Current depth of the work queue |
+
+## Telemetry Playground
+
+The [`documentdb-playground/telemetry/local/`](https://github.com/documentdb/documentdb-kubernetes-operator/tree/main/documentdb-playground/telemetry/local) directory contains a self-contained Kind-based reference implementation:
+
+- 3-instance DocumentDB HA cluster (1 primary + 2 streaming replicas) with `spec.monitoring.enabled: true`
+- Prometheus configured with pod-annotation discovery + native kubelet/cAdvisor scrape
+- Grafana with two pre-built dashboards (Gateway, Container & Node Resources)
+- Traffic generator for demo workload
+- Operator chart installed **from the local working tree** so in-tree operator changes are exercised
+
+```bash
+cd documentdb-playground/telemetry/local
+./scripts/deploy.sh
+./scripts/validate.sh
+```
+
+See its [README](https://github.com/documentdb/documentdb-kubernetes-operator/blob/main/documentdb-playground/telemetry/local/README.md) for full instructions.
+
+## Verification
+
+After deploying the monitoring stack, confirm metrics are flowing:
+
+```bash
+NS=documentdb-preview-ns
+
+# 1. Pods are 3/3 (postgres + gateway + otel-collector)
+kubectl get pods -n $NS -l app=documentdb-preview
+
+# 2. The otel-collector sidecar exists on each pod
+kubectl get pod -n $NS -o jsonpath='{range .items[*]}{.metadata.name}{": "}{range .spec.containers[*]}{.name}{","}{end}{"\n"}{end}'
+
+# 3. Prometheus scrape target is UP (port-forward first)
+kubectl port-forward svc/prometheus 9090:9090 -n observability &
+curl -s 'http://localhost:9090/api/v1/query?query=up{job="documentdb-otel-sidecar"}' | jq '.data.result'
+
+# 4. Gateway metrics are present
+curl -s 'http://localhost:9090/api/v1/query?query=db_client_operations_total' | jq '.data.result | length'
+
+# 5. cAdvisor container metrics are present
+curl -s 'http://localhost:9090/api/v1/query?query=container_cpu_usage_seconds_total{namespace="documentdb-preview-ns"}' | jq '.data.result | length'
+```
+
+If no metrics appear, check:
+
+- `spec.monitoring.enabled: true` is set on the `DocumentDB` resource
+- Pods are 3/3; if not, check `kubectl logs deploy/documentdb-operator -n documentdb-operator` and the sidecar-injector logs
+- The sidecar is healthy: `kubectl logs -c otel-collector -n $NS`
+- Prometheus has RBAC to read pods, nodes, `nodes/proxy`, and `nodes/metrics`
+
+## Next Steps
+
+- [Metrics Reference](metrics.md) — detailed metric descriptions and PromQL examples
diff --git a/documentdb-playground/telemetry/local/README.md b/documentdb-playground/telemetry/local/README.md
new file mode 100644
index 00000000..aa761470
--- /dev/null
+++ b/documentdb-playground/telemetry/local/README.md
@@ -0,0 +1,243 @@
+# DocumentDB Telemetry Playground (Local)
+
+A metrics-focused observability stack for DocumentDB running on a local Kind cluster. Deploys a 3-instance HA cluster with the in-pod OTel Collector sidecar enabled and pre-configured Grafana dashboards for **gateway** and **container/node** metrics out of the box.
+
+## Prerequisites
+
+- **Docker** (running)
+- **kind** ≥ v0.20 — [install](https://kind.sigs.k8s.io/docs/user/quick-start/#installation)
+- **kubectl**
+- **Helm 3** — [install](https://helm.sh/docs/intro/install/)
+- **jq** — for JSON processing in deploy scripts
+
+> **⚠️ Gateway image requirement.** Out of the box, this playground pins
+> `udsmiley/documentdb-gateway-otel:k8s-pgmongo-main-latest` in
+> `k8s/documentdb/cluster.yaml`. The default upstream gateway image does **not**
+> yet emit OTLP `db_client_*` metrics — that instrumentation lives in the
+> pgmongo project and has not been published in an upstream release. Once an
+> upstream release ships with OTel support, swap the pin back to the default.
+> See [Gateway image](#gateway-image) for build instructions.
+
+## Quick Start
+
+```bash
+cd documentdb-playground/telemetry/local
+
+# 1. Deploy everything (Kind cluster + operator from this branch + observability + DocumentDB + traffic)
+./scripts/deploy.sh
+
+# 2. Open Grafana (admin/admin, anonymous access enabled)
+kubectl port-forward svc/grafana 3000:3000 -n observability --context kind-documentdb-telemetry
+# → http://localhost:3000 (Dashboards are in the "DocumentDB" folder)
+
+# 3. Open Prometheus (optional)
+kubectl port-forward svc/prometheus 9090:9090 -n observability --context kind-documentdb-telemetry
+# → http://localhost:9090
+
+# 4. Validate data is flowing
+./scripts/validate.sh
+```
+
+`deploy.sh` is idempotent — re-running it after a failure will skip already-completed steps.
+
+The operator chart is installed **from this branch** (`operator/documentdb-helm-chart/`), not from the public Helm repo, so any in-tree operator changes (e.g. updates to `base_config.yaml`) are exercised end-to-end.
+
+### What gets deployed
+
+| Component | Namespace | Description |
+|-----------|-----------|-------------|
+| Kind cluster | — | 4-node cluster (1 control-plane + 3 workers) with local registry |
+| cert-manager | `cert-manager` | TLS certificate management |
+| DocumentDB operator | `documentdb-operator` | Operator + CNPG (Helm chart from this branch) |
+| DocumentDB HA cluster | `documentdb-preview-ns` | 1 primary + 2 streaming replicas |
+| OTel Collector sidecar | `documentdb-preview-ns` | One per pod, injected by the operator's CNPG sidecar plugin when `spec.monitoring.enabled=true` |
+| Prometheus | `observability` | Metrics storage + alerting rules; scrapes pods via annotation discovery + kubelet/cAdvisor directly |
+| Grafana | `observability` | Dashboards (Gateway + Internals) |
+| Traffic generators | `documentdb-preview-ns` | Read/write workload via mongosh |
+
+There is **no central OTel Collector Deployment** and **no per-node DaemonSet** — every signal lives in the per-pod sidecar (gateway metrics) or comes straight from kubelet/cAdvisor (container/node metrics).
+
+### Gateway image
+
+The playground pins a community-built gateway image
+(`udsmiley/documentdb-gateway-otel:k8s-pgmongo-main-latest`) that includes the
+OTLP metrics exporter required by the dashboards. Two reasons that pin exists:
+
+1. The upstream `documentdb-gateway` image (built from this repo's `main`) does
+ not yet enable OTLP metrics — it gates initialization behind
+ `OTEL_METRICS_ENABLED=true` (set by the operator's sidecar-injector) **and**
+ the OTel exporter library being linked in (only true for builds based on a
+ recent pgmongo `oss/main`).
+2. Once an upstream release lands with OTel metrics, edit
+ `k8s/documentdb/cluster.yaml` and either remove the `gatewayImage:` line
+ (to fall back to the operator default) or point it at the new tag.
+
+#### Building your own from pgmongo
+
+```bash
+# 1. Build the deb + emulator-shape image from pgmongo
+cd ~/repos/pgmongo/oss
+./packaging/build_packages.sh --os deb13 --pg 17 --output-dir downloaded-artifacts
+docker build \
+ -f packaging/gateway/docker/Dockerfile_documentdb_local \
+ --build-arg DEB_PACKAGE_REL_PATH=downloaded-artifacts/.deb \
+ --build-arg POSTGRES_VERSION=17 \
+ --build-arg BASE_IMAGE=debian:trixie-slim \
+ -t my-gateway:emulator .
+
+# 2. Re-wrap as the slim K8s shape using this repo's Dockerfile
+cd ~/repos/documentdb-kubernetes-operator
+docker build \
+ -f .github/dockerfiles/Dockerfile_gateway_public_image \
+ --build-arg SOURCE_IMAGE=my-gateway:emulator \
+ -t localhost:5001/documentdb-gateway:dev .
+docker push localhost:5001/documentdb-gateway:dev
+
+# 3. Point cluster.yaml at it
+# gatewayImage: "localhost:5001/documentdb-gateway:dev"
+```
+
+## Architecture
+
+```mermaid
+graph TB
+ subgraph cluster["Kind Cluster (documentdb-telemetry)"]
+ subgraph obs["observability namespace"]
+ prometheus["Prometheus
annotation discovery + kubelet/cAdvisor scrape"]
+ grafana["Grafana
:3000
Gateway + Internals dashboards"]
+ prometheus --> grafana
+ end
+
+ subgraph docdb["documentdb-preview-ns"]
+ subgraph pod1["Pod: preview-1 (primary)"]
+ pg1["postgres :5432"]
+ gw1["documentdb-gateway :10260
OTLP push → :4317"]
+ otel1["otel-collector sidecar
OTLP :4317 in / Prom :9188 out"]
+ gw1 -. OTLP .-> otel1
+ end
+ subgraph pod2["Pod: preview-2 (replica)"]
+ pg2["postgres :5432"]
+ gw2["documentdb-gateway :10260"]
+ otel2["otel-collector sidecar"]
+ gw2 -. OTLP .-> otel2
+ end
+ subgraph pod3["Pod: preview-3 (replica)"]
+ pg3["postgres :5432"]
+ gw3["documentdb-gateway :10260"]
+ otel3["otel-collector sidecar"]
+ gw3 -. OTLP .-> otel3
+ end
+ traffic_rw["Traffic Gen (RW)"]
+ traffic_ro["Traffic Gen (RO)"]
+ end
+
+ traffic_rw --> gw1
+ traffic_ro --> gw2
+ traffic_ro --> gw3
+ prometheus -- "scrape :9188 (annotation)" --> otel1
+ prometheus -- "scrape :9188 (annotation)" --> otel2
+ prometheus -- "scrape :9188 (annotation)" --> otel3
+ prometheus -- "/metrics/cadvisor" --> kubelet["kubelet (each node)"]
+ end
+
+ user["Browser"] --> grafana
+```
+
+## Directory Layout
+
+```
+local/
+├── scripts/
+│ ├── setup-kind.sh # Creates Kind cluster + local registry
+│ ├── deploy.sh # One-command full deployment
+│ ├── validate.sh # Health check — verifies sidecar + data flow
+│ └── teardown.sh # Deletes cluster and proxy containers
+├── k8s/
+│ ├── observability/ # Namespace, Prometheus (with annotation discovery + kubelet scrape), Grafana
+│ ├── documentdb/ # DocumentDB CR (with spec.monitoring.enabled) + credentials
+│ └── traffic/ # Traffic generator services + jobs
+└── dashboards/
+ ├── gateway.json # Gateway metrics dashboard
+ └── internals.json # Container & node resources dashboard
+```
+
+## Dashboards
+
+Two dashboards are auto-provisioned in the **DocumentDB** folder:
+
+| Dashboard | Description |
+|-----------|-------------|
+| **Gateway** | Request rates, average latency, error rates, document throughput, request/response sizes, gateway container CPU/memory and pod network I/O |
+| **Internals** | Container CPU / memory (working set, RSS) / filesystem usage, pod network rx/tx, node-level memory available, sidecar pod count |
+
+Dashboards auto-refresh every 30 seconds. Edits made in the Grafana UI persist until the pod restarts.
+
+## Alerting Rules
+
+Prometheus includes sample alerting rules:
+
+| Alert | Condition |
+|-------|-----------|
+| **GatewayHighErrorRate** | Error rate > 5% for 5 minutes |
+| **GatewayDown** | No gateway metrics for 2 minutes |
+| **ContainerHighMemory** | Informational — container memory observed |
+
+View firing alerts at `http://localhost:9090/alerts` (after port-forwarding Prometheus).
+
+## Validation
+
+After deployment, verify everything is working:
+
+```bash
+./scripts/validate.sh
+```
+
+This checks: pods running, the `otel-collector` sidecar is injected on every DocumentDB pod, Prometheus has active targets, the sidecar scrape job is UP, and gateway + cAdvisor metrics are present.
+
+## Restarting Traffic Generators
+
+Traffic generators run as Kubernetes Jobs. To restart them:
+
+```bash
+CONTEXT="kind-documentdb-telemetry"
+NS="documentdb-preview-ns"
+
+# Delete completed jobs
+kubectl delete job traffic-generator-rw traffic-generator-ro -n $NS --context $CONTEXT --ignore-not-found
+
+# Re-apply
+kubectl apply -f k8s/traffic/ --context $CONTEXT
+```
+
+## Teardown
+
+```bash
+./scripts/teardown.sh
+```
+
+This deletes the Kind cluster and any proxy containers. The local Docker registry is kept for reuse.
+
+## Troubleshooting
+
+**Gateway metrics missing (`db_client_operations_total` = 0)**
+
+- Check that traffic generators are running: `kubectl get jobs -n documentdb-preview-ns --context kind-documentdb-telemetry`. If completed, restart them (see [Restarting Traffic Generators](#restarting-traffic-generators)).
+- Verify the gateway image includes OTel metrics instrumentation. The gateway must be built from a version that includes the OpenTelemetry metrics changes.
+- Verify the sidecar is healthy: `kubectl logs documentdb-preview-1 -c otel-collector -n documentdb-preview-ns`. The sidecar should be listening on `127.0.0.1:4317` (gRPC, loopback only) and serving `/metrics` on the configured Prometheus port.
+
+**OTel sidecar not injected**
+
+- Confirm `spec.monitoring.enabled: true` is set on the `DocumentDB` CR.
+- Check the operator logs for the OTel ConfigMap reconciliation: `kubectl logs -n documentdb-operator deploy/documentdb-operator | grep -i otel`.
+- Confirm pods have 3/3 containers: `kubectl get pods -n documentdb-preview-ns -l app=documentdb-preview`.
+
+**`deploy.sh` fails at "Installing DocumentDB operator"**
+
+- Ensure Helm chart dependencies can be fetched: `cd operator/documentdb-helm-chart && helm dependency update`.
+- Ensure you have internet access for the CNPG Helm dependency.
+
+**Pods stuck in `Pending` or `ImagePullBackOff`**
+
+- Check Docker has enough resources allocated (recommended: 8GB RAM, 4 CPUs).
+- Verify the Kind node image exists: `docker images kindest/node:v1.35.0`
+
diff --git a/documentdb-playground/telemetry/local/dashboards/gateway.json b/documentdb-playground/telemetry/local/dashboards/gateway.json
new file mode 100644
index 00000000..a5fbb56f
--- /dev/null
+++ b/documentdb-playground/telemetry/local/dashboards/gateway.json
@@ -0,0 +1,710 @@
+{
+ "uid": "documentdb-gateway",
+ "title": "DocumentDB Gateway",
+ "tags": [
+ "documentdb",
+ "gateway"
+ ],
+ "timezone": "browser",
+ "editable": true,
+ "graphTooltip": 1,
+ "time": {
+ "from": "now-1h",
+ "to": "now"
+ },
+ "templating": {
+ "list": [
+ {
+ "current": {
+ "selected": true,
+ "text": "All",
+ "value": "$__all"
+ },
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "definition": "label_values(db_client_operations_total, pod)",
+ "hide": 0,
+ "includeAll": true,
+ "multi": true,
+ "name": "instance",
+ "options": [],
+ "query": {
+ "query": "label_values(db_client_operations_total, pod)"
+ },
+ "refresh": 2,
+ "regex": "",
+ "skipUrlSync": false,
+ "sort": 1,
+ "type": "query",
+ "label": "Instance",
+ "allValue": ".*"
+ }
+ ]
+ },
+ "links": [
+ {
+ "title": "DocumentDB Internals",
+ "type": "link",
+ "icon": "bolt",
+ "url": "/d/documentdb-internals/documentdb-internals",
+ "tooltip": "Database & Infrastructure metrics"
+ }
+ ],
+ "panels": [
+ {
+ "type": "stat",
+ "title": "Operations/sec",
+ "id": 1,
+ "gridPos": {
+ "h": 4,
+ "w": 6,
+ "x": 0,
+ "y": 0
+ },
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "unit": "ops",
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ },
+ {
+ "color": "yellow",
+ "value": 1000
+ },
+ {
+ "color": "red",
+ "value": 5000
+ }
+ ]
+ }
+ }
+ },
+ "options": {
+ "reduceOptions": {
+ "calcs": [
+ "lastNotNull"
+ ]
+ },
+ "colorMode": "value",
+ "graphMode": "area"
+ },
+ "targets": [
+ {
+ "refId": "A",
+ "expr": "sum(rate(db_client_operations_total{pod=~\"$instance\"}[1m]))",
+ "legendFormat": "ops/sec"
+ }
+ ]
+ },
+ {
+ "type": "stat",
+ "title": "Avg Latency (ms)",
+ "id": 2,
+ "gridPos": {
+ "h": 4,
+ "w": 6,
+ "x": 6,
+ "y": 0
+ },
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "unit": "ms",
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ },
+ {
+ "color": "yellow",
+ "value": 100
+ },
+ {
+ "color": "red",
+ "value": 500
+ }
+ ]
+ }
+ }
+ },
+ "options": {
+ "reduceOptions": {
+ "calcs": [
+ "lastNotNull"
+ ]
+ },
+ "colorMode": "value",
+ "graphMode": "area"
+ },
+ "targets": [
+ {
+ "refId": "A",
+ "expr": "sum(rate(db_client_operation_duration_seconds_total{pod=~\"$instance\",db_operation_phase=\"\"}[1m])) / sum(rate(db_client_operations_total{pod=~\"$instance\"}[1m])) * 1000",
+ "legendFormat": "avg latency"
+ }
+ ]
+ },
+ {
+ "type": "stat",
+ "title": "Error Rate %",
+ "id": 3,
+ "gridPos": {
+ "h": 4,
+ "w": 6,
+ "x": 12,
+ "y": 0
+ },
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "unit": "percent",
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ },
+ {
+ "color": "yellow",
+ "value": 1
+ },
+ {
+ "color": "red",
+ "value": 5
+ }
+ ]
+ }
+ }
+ },
+ "options": {
+ "reduceOptions": {
+ "calcs": [
+ "lastNotNull"
+ ]
+ },
+ "colorMode": "value",
+ "graphMode": "area"
+ },
+ "targets": [
+ {
+ "refId": "A",
+ "expr": "sum(rate(db_client_operations_total{pod=~\"$instance\",error_type!=\"\"}[1m])) / sum(rate(db_client_operations_total{pod=~\"$instance\"}[1m])) * 100",
+ "legendFormat": "error rate"
+ }
+ ]
+ },
+ {
+ "type": "row",
+ "title": "Traffic & Performance",
+ "collapsed": true,
+ "panels": [
+ {
+ "type": "timeseries",
+ "title": "Ops/sec by Operation",
+ "id": 14,
+ "gridPos": {
+ "x": 0,
+ "y": 0,
+ "w": 12,
+ "h": 8
+ },
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "unit": "ops",
+ "custom": {
+ "drawStyle": "line",
+ "lineInterpolation": "smooth",
+ "fillOpacity": 10,
+ "spanNulls": false
+ }
+ }
+ },
+ "options": {
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ },
+ "legend": {
+ "displayMode": "list",
+ "placement": "bottom"
+ }
+ },
+ "targets": [
+ {
+ "refId": "A",
+ "expr": "sum by (db_operation_name) (rate(db_client_operations_total{pod=~\"$instance\"}[1m]))",
+ "legendFormat": "{{db_operation_name}}"
+ }
+ ]
+ },
+ {
+ "type": "timeseries",
+ "title": "Latency by Operation",
+ "id": 15,
+ "gridPos": {
+ "x": 12,
+ "y": 0,
+ "w": 12,
+ "h": 8
+ },
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "unit": "ms",
+ "custom": {
+ "drawStyle": "line",
+ "lineInterpolation": "smooth",
+ "fillOpacity": 10,
+ "spanNulls": false
+ }
+ }
+ },
+ "options": {
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ },
+ "legend": {
+ "displayMode": "list",
+ "placement": "bottom"
+ }
+ },
+ "targets": [
+ {
+ "refId": "A",
+ "expr": "sum by (db_operation_name) (rate(db_client_operation_duration_seconds_total{pod=~\"$instance\",db_operation_phase=\"\"}[1m])) / sum by (db_operation_name) (rate(db_client_operations_total{pod=~\"$instance\"}[1m])) * 1000",
+ "legendFormat": "{{db_operation_name}}"
+ }
+ ]
+ },
+ {
+ "type": "timeseries",
+ "title": "Documents/sec",
+ "id": 8,
+ "gridPos": {
+ "x": 0,
+ "y": 8,
+ "w": 8,
+ "h": 8
+ },
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "unit": "docs/s",
+ "custom": {
+ "drawStyle": "line",
+ "lineInterpolation": "smooth",
+ "fillOpacity": 10,
+ "spanNulls": false
+ }
+ }
+ },
+ "options": {
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ },
+ "legend": {
+ "displayMode": "list",
+ "placement": "bottom"
+ }
+ },
+ "targets": [
+ {
+ "refId": "A",
+ "expr": "sum(rate(db_client_documents_returned_total{pod=~\"$instance\"}[1m]))",
+ "legendFormat": "returned"
+ },
+ {
+ "refId": "B",
+ "expr": "sum(rate(db_client_documents_inserted_total{pod=~\"$instance\"}[1m]))",
+ "legendFormat": "inserted"
+ },
+ {
+ "refId": "C",
+ "expr": "sum(rate(db_client_documents_updated_total{pod=~\"$instance\"}[1m]))",
+ "legendFormat": "updated"
+ },
+ {
+ "refId": "D",
+ "expr": "sum(rate(db_client_documents_deleted_total{pod=~\"$instance\"}[1m]))",
+ "legendFormat": "deleted"
+ }
+ ]
+ },
+ {
+ "type": "timeseries",
+ "title": "Documents/sec by Collection",
+ "id": 9,
+ "gridPos": {
+ "x": 8,
+ "y": 8,
+ "w": 8,
+ "h": 8
+ },
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "unit": "docs/s",
+ "custom": {
+ "lineWidth": 1,
+ "fillOpacity": 30,
+ "showPoints": "never",
+ "stacking": {
+ "mode": "normal",
+ "group": "A"
+ }
+ }
+ },
+ "overrides": []
+ },
+ "options": {
+ "legend": {
+ "displayMode": "table",
+ "placement": "bottom",
+ "calcs": [
+ "lastNotNull"
+ ]
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "targets": [
+ {
+ "expr": "sum by (db_collection_name) (rate(db_client_documents_returned_total{pod=~\"$instance\"}[1m]))",
+ "legendFormat": "{{db_collection_name}} returned",
+ "refId": "A"
+ },
+ {
+ "expr": "sum by (db_collection_name) (rate(db_client_documents_inserted_total{pod=~\"$instance\"}[1m]))",
+ "legendFormat": "{{db_collection_name}} inserted",
+ "refId": "B"
+ },
+ {
+ "expr": "sum by (db_collection_name) (rate(db_client_documents_updated_total{pod=~\"$instance\"}[1m]))",
+ "legendFormat": "{{db_collection_name}} updated",
+ "refId": "C"
+ },
+ {
+ "expr": "sum by (db_collection_name) (rate(db_client_documents_deleted_total{pod=~\"$instance\"}[1m]))",
+ "legendFormat": "{{db_collection_name}} deleted",
+ "refId": "D"
+ }
+ ]
+ },
+ {
+ "type": "timeseries",
+ "title": "Request & Response Size",
+ "id": 16,
+ "gridPos": {
+ "x": 16,
+ "y": 8,
+ "w": 8,
+ "h": 8
+ },
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "unit": "bytes",
+ "custom": {
+ "drawStyle": "line",
+ "lineInterpolation": "smooth",
+ "fillOpacity": 10,
+ "spanNulls": false
+ }
+ }
+ },
+ "options": {
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ },
+ "legend": {
+ "displayMode": "list",
+ "placement": "bottom"
+ }
+ },
+ "targets": [
+ {
+ "refId": "A",
+ "expr": "sum(rate(db_client_request_size_bytes_total{pod=~\"$instance\"}[1m]))",
+ "legendFormat": "request"
+ },
+ {
+ "refId": "B",
+ "expr": "sum(rate(db_client_response_size_bytes_total{pod=~\"$instance\"}[1m]))",
+ "legendFormat": "response"
+ }
+ ]
+ },
+ {
+ "type": "timeseries",
+ "title": "Errors by Type",
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "sum by (error_type) (rate(db_client_operations_total{pod=~\"$instance\", error_type!=\"\"}[5m]))",
+ "legendFormat": "{{error_type}}",
+ "refId": "A"
+ }
+ ],
+ "fieldConfig": {
+ "defaults": {
+ "unit": "ops",
+ "custom": {
+ "lineWidth": 1,
+ "fillOpacity": 30,
+ "showPoints": "never",
+ "stacking": {
+ "mode": "normal",
+ "group": "A"
+ }
+ }
+ },
+ "overrides": []
+ },
+ "options": {
+ "legend": {
+ "displayMode": "table",
+ "placement": "bottom",
+ "calcs": [
+ "lastNotNull"
+ ]
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "gridPos": {
+ "x": 0,
+ "y": 16,
+ "w": 24,
+ "h": 8
+ }
+ }
+ ],
+ "gridPos": {
+ "h": 1,
+ "w": 24,
+ "x": 0,
+ "y": 4
+ }
+ },
+ {
+ "type": "row",
+ "title": "Resource Usage",
+ "collapsed": true,
+ "panels": [
+ {
+ "type": "timeseries",
+ "title": "Gateway CPU",
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "rate(container_cpu_usage_seconds_total{namespace=\"documentdb-preview-ns\", container=\"documentdb-gateway\"}[1m])",
+ "legendFormat": "{{pod}}",
+ "refId": "A"
+ }
+ ],
+ "fieldConfig": {
+ "defaults": {
+ "unit": "percentunit",
+ "custom": {
+ "lineWidth": 1,
+ "fillOpacity": 10,
+ "showPoints": "never"
+ },
+ "min": 0
+ },
+ "overrides": []
+ },
+ "options": {
+ "legend": {
+ "displayMode": "table",
+ "placement": "bottom",
+ "calcs": [
+ "lastNotNull"
+ ]
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "gridPos": {
+ "x": 0,
+ "y": 0,
+ "w": 8,
+ "h": 8
+ }
+ },
+ {
+ "type": "timeseries",
+ "title": "Gateway Memory",
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "container_memory_working_set_bytes{namespace=\"documentdb-preview-ns\", container=\"documentdb-gateway\"}",
+ "legendFormat": "{{pod}}",
+ "refId": "A"
+ }
+ ],
+ "fieldConfig": {
+ "defaults": {
+ "unit": "bytes",
+ "custom": {
+ "lineWidth": 1,
+ "fillOpacity": 10,
+ "showPoints": "never"
+ }
+ },
+ "overrides": []
+ },
+ "options": {
+ "legend": {
+ "displayMode": "table",
+ "placement": "bottom",
+ "calcs": [
+ "lastNotNull"
+ ]
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "gridPos": {
+ "x": 8,
+ "y": 0,
+ "w": 8,
+ "h": 8
+ }
+ },
+ {
+ "type": "timeseries",
+ "title": "Pod Network I/O",
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "sum by (pod) (rate(container_network_receive_bytes_total{namespace=\"documentdb-preview-ns\"}[1m]))",
+ "legendFormat": "{{pod}} rx",
+ "refId": "A"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "sum by (pod) (rate(container_network_transmit_bytes_total{namespace=\"documentdb-preview-ns\"}[1m]))",
+ "legendFormat": "{{pod}} tx",
+ "refId": "B"
+ }
+ ],
+ "fieldConfig": {
+ "defaults": {
+ "unit": "Bps",
+ "custom": {
+ "lineWidth": 1,
+ "fillOpacity": 10,
+ "showPoints": "never"
+ }
+ },
+ "overrides": []
+ },
+ "options": {
+ "legend": {
+ "displayMode": "table",
+ "placement": "bottom",
+ "calcs": [
+ "lastNotNull"
+ ]
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "gridPos": {
+ "x": 16,
+ "y": 0,
+ "w": 8,
+ "h": 8
+ }
+ }
+ ],
+ "gridPos": {
+ "h": 1,
+ "w": 24,
+ "x": 0,
+ "y": 5
+ }
+ }
+ ],
+ "refresh": "10s"
+}
\ No newline at end of file
diff --git a/documentdb-playground/telemetry/local/dashboards/internals.json b/documentdb-playground/telemetry/local/dashboards/internals.json
new file mode 100644
index 00000000..0826e830
--- /dev/null
+++ b/documentdb-playground/telemetry/local/dashboards/internals.json
@@ -0,0 +1,174 @@
+{
+ "title": "DocumentDB Internals - Container & Node Resources",
+ "description": "Container CPU / memory / network and node-level resources from cAdvisor and kubelet.",
+ "uid": "documentdb-internals",
+ "schemaVersion": 39,
+ "version": 1,
+ "refresh": "30s",
+ "tags": ["documentdb", "internals"],
+ "time": { "from": "now-30m", "to": "now" },
+ "templating": {
+ "list": [
+ {
+ "name": "namespace",
+ "type": "constant",
+ "current": { "selected": false, "text": "documentdb-preview-ns", "value": "documentdb-preview-ns" },
+ "query": "documentdb-preview-ns",
+ "hide": 2
+ },
+ {
+ "name": "pod",
+ "type": "query",
+ "datasource": { "type": "prometheus", "uid": "prometheus" },
+ "query": "label_values(container_cpu_usage_seconds_total{namespace=\"$namespace\", container!=\"\", container!=\"POD\"}, pod)",
+ "refresh": 2,
+ "multi": true,
+ "includeAll": true,
+ "current": { "selected": true, "text": "All", "value": "$__all" }
+ }
+ ]
+ },
+ "panels": [
+ {
+ "type": "row",
+ "title": "Container CPU & Memory",
+ "id": 1,
+ "gridPos": { "x": 0, "y": 0, "w": 24, "h": 1 },
+ "collapsed": false
+ },
+ {
+ "type": "timeseries",
+ "title": "Container CPU usage (cores)",
+ "id": 2,
+ "datasource": { "type": "prometheus", "uid": "prometheus" },
+ "gridPos": { "x": 0, "y": 1, "w": 12, "h": 8 },
+ "fieldConfig": { "defaults": { "unit": "none" } },
+ "targets": [
+ {
+ "expr": "sum by (pod, container) (rate(container_cpu_usage_seconds_total{namespace=\"$namespace\", pod=~\"$pod\", container!=\"\", container!=\"POD\"}[1m]))",
+ "legendFormat": "{{pod}} / {{container}}",
+ "refId": "A"
+ }
+ ]
+ },
+ {
+ "type": "timeseries",
+ "title": "Container memory working set (bytes)",
+ "id": 3,
+ "datasource": { "type": "prometheus", "uid": "prometheus" },
+ "gridPos": { "x": 12, "y": 1, "w": 12, "h": 8 },
+ "fieldConfig": { "defaults": { "unit": "bytes" } },
+ "targets": [
+ {
+ "expr": "sum by (pod, container) (container_memory_working_set_bytes{namespace=\"$namespace\", pod=~\"$pod\", container!=\"\", container!=\"POD\"})",
+ "legendFormat": "{{pod}} / {{container}}",
+ "refId": "A"
+ }
+ ]
+ },
+ {
+ "type": "timeseries",
+ "title": "Container memory RSS (bytes)",
+ "id": 4,
+ "datasource": { "type": "prometheus", "uid": "prometheus" },
+ "gridPos": { "x": 0, "y": 9, "w": 12, "h": 8 },
+ "fieldConfig": { "defaults": { "unit": "bytes" } },
+ "targets": [
+ {
+ "expr": "sum by (pod, container) (container_memory_rss{namespace=\"$namespace\", pod=~\"$pod\", container!=\"\", container!=\"POD\"})",
+ "legendFormat": "{{pod}} / {{container}}",
+ "refId": "A"
+ }
+ ]
+ },
+ {
+ "type": "timeseries",
+ "title": "Container filesystem write throughput",
+ "description": "Per-pod write bytes/sec to container filesystem. cAdvisor in recent kubelet builds no longer exports container_fs_usage_bytes; rate of container_fs_writes_bytes_total is the closest available signal.",
+ "id": 5,
+ "datasource": { "type": "prometheus", "uid": "prometheus" },
+ "gridPos": { "x": 12, "y": 9, "w": 12, "h": 8 },
+ "fieldConfig": { "defaults": { "unit": "Bps" } },
+ "targets": [
+ {
+ "expr": "sum by (pod) (rate(container_fs_writes_bytes_total{namespace=\"$namespace\", pod=~\"$pod\"}[5m]))",
+ "legendFormat": "{{pod}}",
+ "refId": "A"
+ }
+ ]
+ },
+ {
+ "type": "row",
+ "title": "Pod Network",
+ "id": 6,
+ "gridPos": { "x": 0, "y": 17, "w": 24, "h": 1 },
+ "collapsed": false
+ },
+ {
+ "type": "timeseries",
+ "title": "Network bytes received (per pod)",
+ "id": 7,
+ "datasource": { "type": "prometheus", "uid": "prometheus" },
+ "gridPos": { "x": 0, "y": 18, "w": 12, "h": 8 },
+ "fieldConfig": { "defaults": { "unit": "Bps" } },
+ "targets": [
+ {
+ "expr": "sum by (pod) (rate(container_network_receive_bytes_total{namespace=\"$namespace\", pod=~\"$pod\"}[1m]))",
+ "legendFormat": "{{pod}}",
+ "refId": "A"
+ }
+ ]
+ },
+ {
+ "type": "timeseries",
+ "title": "Network bytes transmitted (per pod)",
+ "id": 8,
+ "datasource": { "type": "prometheus", "uid": "prometheus" },
+ "gridPos": { "x": 12, "y": 18, "w": 12, "h": 8 },
+ "fieldConfig": { "defaults": { "unit": "Bps" } },
+ "targets": [
+ {
+ "expr": "sum by (pod) (rate(container_network_transmit_bytes_total{namespace=\"$namespace\", pod=~\"$pod\"}[1m]))",
+ "legendFormat": "{{pod}}",
+ "refId": "A"
+ }
+ ]
+ },
+ {
+ "type": "row",
+ "title": "Node",
+ "id": 9,
+ "gridPos": { "x": 0, "y": 26, "w": 24, "h": 1 },
+ "collapsed": false
+ },
+ {
+ "type": "timeseries",
+ "title": "Node memory available (bytes)",
+ "id": 10,
+ "datasource": { "type": "prometheus", "uid": "prometheus" },
+ "gridPos": { "x": 0, "y": 27, "w": 12, "h": 8 },
+ "fieldConfig": { "defaults": { "unit": "bytes" } },
+ "targets": [
+ {
+ "expr": "sum by (node) (machine_memory_bytes) - sum by (node) (container_memory_working_set_bytes{container=\"\"})",
+ "legendFormat": "{{node}}",
+ "refId": "A"
+ }
+ ]
+ },
+ {
+ "type": "stat",
+ "title": "Pods up (DocumentDB cluster)",
+ "id": 11,
+ "datasource": { "type": "prometheus", "uid": "prometheus" },
+ "gridPos": { "x": 12, "y": 27, "w": 12, "h": 8 },
+ "fieldConfig": { "defaults": { "unit": "none" } },
+ "targets": [
+ {
+ "expr": "count(up{job=\"documentdb-otel-sidecar\", namespace=\"$namespace\"} == 1)",
+ "refId": "A"
+ }
+ ]
+ }
+ ]
+}
diff --git a/documentdb-playground/telemetry/local/k8s/documentdb/cluster.yaml b/documentdb-playground/telemetry/local/k8s/documentdb/cluster.yaml
new file mode 100644
index 00000000..6115f5c7
--- /dev/null
+++ b/documentdb-playground/telemetry/local/k8s/documentdb/cluster.yaml
@@ -0,0 +1,51 @@
+# ⚠️ DEMO/PLAYGROUND ONLY — credentials below are for local development.
+# Do NOT use these passwords in production.
+apiVersion: v1
+kind: Namespace
+metadata:
+ name: documentdb-preview-ns
+---
+apiVersion: v1
+kind: Secret
+metadata:
+ name: documentdb-credentials
+ namespace: documentdb-preview-ns
+type: Opaque
+stringData:
+ username: demo_user
+ password: DemoPassword100
+---
+apiVersion: documentdb.io/preview
+kind: DocumentDB
+metadata:
+ name: documentdb-preview
+ namespace: documentdb-preview-ns
+spec:
+ nodeCount: 1
+ instancesPerNode: 3
+ documentDbCredentialSecret: documentdb-credentials
+ # The default gateway image (built from this repo's main) does NOT yet emit
+ # OTLP `db_client_*` metrics — that instrumentation lives in pgmongo and is
+ # not yet in a public release. We pin a community-built image here so the
+ # playground works out-of-box; replace with your own once an upstream release
+ # ships with OTel metrics support. See README §Gateway image for build steps.
+ gatewayImage: "udsmiley/documentdb-gateway-otel:k8s-pgmongo-main-latest"
+ exposeViaService:
+ serviceType: ClusterIP
+ resource:
+ storage:
+ pvcSize: 5Gi
+ sidecarInjectorPluginName: cnpg-i-sidecar-injector.documentdb.io
+ # Enable the OTel Collector sidecar (one per pod). The sidecar:
+ # - receives OTLP metrics from the documentdb-gateway on localhost:4317
+ # - exposes Prometheus /metrics on the configured port
+ # Prometheus discovers the sidecar via pod annotations injected by the operator.
+ #
+ # Port note: containers in the same pod share a network namespace, so this
+ # port must NOT collide with anything CNPG already binds. CNPG's instance
+ # manager uses 9187 for its own /metrics endpoint, so we use 9188 here.
+ monitoring:
+ enabled: true
+ exporter:
+ prometheus:
+ port: 9188
diff --git a/documentdb-playground/telemetry/local/k8s/observability/grafana.yaml b/documentdb-playground/telemetry/local/k8s/observability/grafana.yaml
new file mode 100644
index 00000000..5fd5cc23
--- /dev/null
+++ b/documentdb-playground/telemetry/local/k8s/observability/grafana.yaml
@@ -0,0 +1,98 @@
+# ============================================================
+# Grafana - Dashboards & Visualization
+# ⚠️ DEMO/PLAYGROUND ONLY — not for production use.
+# Anonymous admin access (GF_AUTH_ANONYMOUS_ORG_ROLE: Admin) and
+# the default admin password are intentional for local dev
+# convenience. Do NOT deploy this configuration in production.
+# ============================================================
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+ name: grafana
+ namespace: observability
+spec:
+ replicas: 1
+ selector:
+ matchLabels:
+ app: grafana
+ template:
+ metadata:
+ labels:
+ app: grafana
+ spec:
+ containers:
+ - name: grafana
+ image: grafana/grafana:11.6.0
+ ports:
+ - containerPort: 3000
+ env:
+ - name: GF_SECURITY_ADMIN_PASSWORD
+ value: admin
+ - name: GF_AUTH_ANONYMOUS_ENABLED
+ value: "true"
+ - name: GF_AUTH_ANONYMOUS_ORG_ROLE
+ value: Admin
+ volumeMounts:
+ - name: datasources
+ mountPath: /etc/grafana/provisioning/datasources
+ - name: dashboard-provisioning
+ mountPath: /etc/grafana/provisioning/dashboards
+ - name: dashboards
+ mountPath: /var/lib/grafana/dashboards
+ volumes:
+ - name: datasources
+ configMap:
+ name: grafana-datasources
+ - name: dashboard-provisioning
+ configMap:
+ name: grafana-dashboard-provisioning
+ - name: dashboards
+ configMap:
+ name: grafana-dashboards
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+ name: grafana-datasources
+ namespace: observability
+data:
+ datasources.yaml: |
+ apiVersion: 1
+ datasources:
+ - name: Prometheus
+ type: prometheus
+ uid: prometheus
+ access: proxy
+ url: http://prometheus.observability.svc:9090
+ isDefault: true
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+ name: grafana-dashboard-provisioning
+ namespace: observability
+data:
+ dashboards.yaml: |
+ apiVersion: 1
+ providers:
+ - name: default
+ orgId: 1
+ folder: DocumentDB
+ type: file
+ disableDeletion: false
+ editable: true
+ options:
+ path: /var/lib/grafana/dashboards
+ foldersFromFilesStructure: false
+---
+apiVersion: v1
+kind: Service
+metadata:
+ name: grafana
+ namespace: observability
+spec:
+ selector:
+ app: grafana
+ type: ClusterIP
+ ports:
+ - port: 3000
diff --git a/documentdb-playground/telemetry/local/k8s/observability/namespace.yaml b/documentdb-playground/telemetry/local/k8s/observability/namespace.yaml
new file mode 100644
index 00000000..4f75b8c5
--- /dev/null
+++ b/documentdb-playground/telemetry/local/k8s/observability/namespace.yaml
@@ -0,0 +1,4 @@
+apiVersion: v1
+kind: Namespace
+metadata:
+ name: observability
diff --git a/documentdb-playground/telemetry/local/k8s/observability/prometheus.yaml b/documentdb-playground/telemetry/local/k8s/observability/prometheus.yaml
new file mode 100644
index 00000000..b463b4a8
--- /dev/null
+++ b/documentdb-playground/telemetry/local/k8s/observability/prometheus.yaml
@@ -0,0 +1,237 @@
+# ============================================================
+# Prometheus - Metrics Store
+# ============================================================
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+ name: prometheus
+ namespace: observability
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+ name: prometheus
+rules:
+ - apiGroups: [""]
+ resources: ["pods", "services", "endpoints", "nodes", "nodes/metrics", "nodes/proxy"]
+ verbs: ["get", "list", "watch"]
+ - nonResourceURLs: ["/metrics", "/metrics/cadvisor"]
+ verbs: ["get"]
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+ name: prometheus
+roleRef:
+ apiGroup: rbac.authorization.k8s.io
+ kind: ClusterRole
+ name: prometheus
+subjects:
+ - kind: ServiceAccount
+ name: prometheus
+ namespace: observability
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+ name: prometheus
+ namespace: observability
+spec:
+ replicas: 1
+ selector:
+ matchLabels:
+ app: prometheus
+ template:
+ metadata:
+ labels:
+ app: prometheus
+ spec:
+ serviceAccountName: prometheus
+ containers:
+ - name: prometheus
+ image: prom/prometheus:v3.3.0
+ args:
+ - --config.file=/etc/prometheus/prometheus.yml
+ - --storage.tsdb.retention.time=1d
+ - --web.enable-lifecycle
+ ports:
+ - containerPort: 9090
+ volumeMounts:
+ - name: config
+ mountPath: /etc/prometheus
+ volumes:
+ - name: config
+ configMap:
+ name: prometheus-config
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+ name: prometheus-config
+ namespace: observability
+data:
+ prometheus.yml: |
+ global:
+ scrape_interval: 15s
+
+ # Alerting rules
+ rule_files:
+ - /etc/prometheus/alerts.yml
+
+ scrape_configs:
+ # OTel Collector sidecars injected into each DocumentDB pod by the
+ # operator (spec.monitoring.enabled=true). The sidecar-injector adds
+ # prometheus.io/scrape, prometheus.io/port, prometheus.io/path
+ # annotations on each pod, which we honour here.
+ - job_name: documentdb-otel-sidecar
+ kubernetes_sd_configs:
+ - role: pod
+ namespaces:
+ names: ['documentdb-preview-ns']
+ relabel_configs:
+ - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
+ action: keep
+ regex: "true"
+ - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
+ action: replace
+ target_label: __metrics_path__
+ regex: (.+)
+ - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
+ action: replace
+ target_label: __address__
+ regex: ([^:]+)(?::\d+)?;(\d+)
+ replacement: $1:$2
+ - source_labels: [__meta_kubernetes_namespace]
+ target_label: namespace
+ - source_labels: [__meta_kubernetes_pod_name]
+ target_label: pod
+ - source_labels: [__meta_kubernetes_pod_label_cnpg_io_cluster]
+ target_label: cnpg_cluster
+
+ # CNPG operator-managed instance metrics (replication slot, backup, etc.)
+ # exposed by the CNPG instance manager on each Postgres pod, port 9187.
+ # NOTE: The OTel collector sidecar exposes its own /metrics on port 9188
+ # (see cluster.yaml) precisely to avoid colliding with this port.
+ - job_name: cnpg
+ kubernetes_sd_configs:
+ - role: pod
+ namespaces:
+ names: ['documentdb-preview-ns']
+ relabel_configs:
+ - source_labels: [__meta_kubernetes_pod_label_cnpg_io_cluster]
+ regex: documentdb-preview
+ action: keep
+ # The CNPG instance-manager container and the OTel collector sidecar
+ # both expose Prometheus /metrics on the same pod (different ports
+ # — 9187 vs 9188). Filter to the instance-manager port by name so
+ # we don't accidentally double-scrape the sidecar via this job.
+ - source_labels: [__meta_kubernetes_pod_container_port_name]
+ regex: metrics
+ action: keep
+ - source_labels: [__meta_kubernetes_pod_ip, __meta_kubernetes_pod_container_port_number]
+ target_label: __address__
+ separator: ":"
+
+ # Node-level container metrics via kubelet's built-in cAdvisor endpoint.
+ # Replaces the previous OTel kubeletstats DaemonSet — Prometheus can scrape
+ # kubelet directly with no extra collector deployment.
+ - job_name: kubelet-cadvisor
+ scheme: https
+ tls_config:
+ ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
+ insecure_skip_verify: true
+ bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
+ kubernetes_sd_configs:
+ - role: node
+ relabel_configs:
+ - target_label: __address__
+ replacement: kubernetes.default.svc:443
+ - source_labels: [__meta_kubernetes_node_name]
+ regex: (.+)
+ target_label: __metrics_path__
+ replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
+
+ # Kubelet's own /metrics endpoint (process-level metrics for each kubelet).
+ - job_name: kubelet
+ scheme: https
+ tls_config:
+ ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
+ insecure_skip_verify: true
+ bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
+ kubernetes_sd_configs:
+ - role: node
+ relabel_configs:
+ - target_label: __address__
+ replacement: kubernetes.default.svc:443
+ - source_labels: [__meta_kubernetes_node_name]
+ regex: (.+)
+ target_label: __metrics_path__
+ replacement: /api/v1/nodes/$1/proxy/metrics
+
+ # Uncomment to scrape DocumentDB operator controller-runtime metrics.
+ # Requires the operator to be started with --metrics-bind-address=:8443.
+ # - job_name: documentdb-operator
+ # scheme: https
+ # tls_config:
+ # insecure_skip_verify: true
+ # kubernetes_sd_configs:
+ # - role: pod
+ # namespaces:
+ # names: ['documentdb-operator']
+ # relabel_configs:
+ # - source_labels: [__meta_kubernetes_pod_label_app]
+ # regex: documentdb
+ # action: keep
+ # - source_labels: [__meta_kubernetes_pod_ip]
+ # target_label: __address__
+ # replacement: $1:8443
+
+ alerts.yml: |
+ groups:
+ - name: documentdb
+ rules:
+ - alert: GatewayHighErrorRate
+ expr: |
+ (
+ sum(rate(db_client_operations_total{error_type!=""}[5m]))
+ / sum(rate(db_client_operations_total[5m]))
+ ) * 100 > 5
+ for: 5m
+ labels:
+ severity: warning
+ annotations:
+ summary: "Gateway error rate above 5%"
+ description: "{{ $value | printf \"%.1f\" }}% of gateway operations are failing."
+
+ - alert: GatewayDown
+ expr: absent(db_client_operations_total) == 1
+ for: 2m
+ labels:
+ severity: critical
+ annotations:
+ summary: "No gateway metrics — gateway may be down"
+ description: "The db_client_operations_total metric has been absent for 2 minutes."
+
+ - alert: ContainerHighMemory
+ expr: |
+ container_memory_working_set_bytes{
+ namespace="documentdb-preview-ns",
+ container!="",container!="POD"
+ } > 1073741824
+ for: 5m
+ labels:
+ severity: warning
+ annotations:
+ summary: "Container working-set memory above 1Gi"
+ description: "{{ $labels.pod }}/{{ $labels.container }} working set: {{ $value | humanize1024 }}B."
+---
+apiVersion: v1
+kind: Service
+metadata:
+ name: prometheus
+ namespace: observability
+spec:
+ selector:
+ app: prometheus
+ ports:
+ - port: 9090
diff --git a/documentdb-playground/telemetry/local/k8s/traffic/traffic-generator.yaml b/documentdb-playground/telemetry/local/k8s/traffic/traffic-generator.yaml
new file mode 100644
index 00000000..3ece3989
--- /dev/null
+++ b/documentdb-playground/telemetry/local/k8s/traffic/traffic-generator.yaml
@@ -0,0 +1,294 @@
+---
+# ⚠️ DEMO/PLAYGROUND ONLY — hardcoded passwords below are for local development.
+# Do NOT use these credentials in production.
+#
+# Service to expose gateway port 10260 on PRIMARY
+apiVersion: v1
+kind: Service
+metadata:
+ name: documentdb-preview-gateway
+ namespace: documentdb-preview-ns
+spec:
+ selector:
+ cnpg.io/cluster: documentdb-preview
+ cnpg.io/instanceRole: primary
+ ports:
+ - name: gateway
+ port: 10260
+ targetPort: 10260
+ type: ClusterIP
+---
+# Service to expose gateway port 10260 on REPLICAS
+apiVersion: v1
+kind: Service
+metadata:
+ name: documentdb-preview-gateway-ro
+ namespace: documentdb-preview-ns
+spec:
+ selector:
+ cnpg.io/cluster: documentdb-preview
+ cnpg.io/instanceRole: replica
+ ports:
+ - name: gateway
+ port: 10260
+ targetPort: 10260
+ type: ClusterIP
+---
+# Traffic generator - writes to primary, reads to replicas
+apiVersion: v1
+kind: ConfigMap
+metadata:
+ name: traffic-generator-script
+ namespace: documentdb-preview-ns
+data:
+ generate-traffic.js: |
+ // Traffic generator for DocumentDB telemetry demo
+ // Writes go to primary (RW), reads go to replicas (RO)
+
+ const DB_NAME = "telemetry_demo";
+ const COLLECTION = "events";
+ const BATCH_SIZE = 5;
+ const ITERATIONS = 2250;
+ const SLEEP_MS = 800;
+
+ // This script runs on the PRIMARY connection
+ db = db.getSiblingDB(DB_NAME);
+
+ print("=== DocumentDB Traffic Generator (Primary - Writes) ===");
+ print(`Target: ${DB_NAME}.${COLLECTION}`);
+ print(`Iterations: ${ITERATIONS}, Batch: ${BATCH_SIZE}, Sleep: ${SLEEP_MS}ms`);
+
+ const categories = ["auth", "api", "database", "network", "system"];
+ const severities = ["info", "warn", "error", "critical"];
+ const sources = ["web-server", "api-gateway", "worker", "scheduler", "monitor"];
+
+ function randomChoice(arr) {
+ return arr[Math.floor(Math.random() * arr.length)];
+ }
+
+ function generateEvent() {
+ return {
+ timestamp: new Date(),
+ category: randomChoice(categories),
+ severity: randomChoice(severities),
+ source: randomChoice(sources),
+ message: "Event " + Math.random().toString(36).substring(7),
+ duration_ms: Math.floor(Math.random() * 2000),
+ statusCode: randomChoice([200, 200, 200, 201, 400, 404, 500])
+ };
+ }
+
+ for (let i = 0; i < ITERATIONS; i++) {
+ try {
+ // WRITES: insert documents
+ for (let j = 0; j < BATCH_SIZE; j++) {
+ db[COLLECTION].insertOne(generateEvent());
+ }
+
+ // WRITES: update
+ db[COLLECTION].updateMany(
+ { severity: "info", source: randomChoice(sources) },
+ { $set: { processed: true } }
+ );
+
+ // READ on primary too (some mixed workload)
+ db[COLLECTION].countDocuments({ source: randomChoice(sources) });
+
+ // ERROR GENERATORS (~10% of iterations)
+ if (i % 10 === 0) {
+ try {
+ // Invalid update operator
+ db[COLLECTION].updateOne({ _id: 1 }, { $badOp: { x: 1 } });
+ } catch (e) { /* expected */ }
+
+ try {
+ // Duplicate key on unique index (if exists)
+ db[COLLECTION].insertOne({ _id: "deliberate-dup-" + (i % 3) });
+ } catch (e) { /* expected */ }
+
+ try {
+ // Query non-existent collection with invalid pipeline
+ db.getSiblingDB("telemetry_demo")["no_such_coll"].aggregate([
+ { $merge: { into: { db: "admin", coll: "forbidden" } } }
+ ]).toArray();
+ } catch (e) { /* expected */ }
+
+ try {
+ // Invalid regex
+ db[COLLECTION].find({ message: { $regex: "[invalid" } }).toArray();
+ } catch (e) { /* expected */ }
+ }
+
+ // Periodic cleanup
+ if (i % 100 === 0 && i > 0) {
+ db[COLLECTION].deleteMany({ processed: true });
+ print(`[${i}/${ITERATIONS}] Cleanup done`);
+ }
+
+ if (i % 25 === 0) {
+ print(`[${i}/${ITERATIONS}] OK`);
+ }
+
+ sleep(SLEEP_MS);
+ } catch (e) {
+ print(`[${i}/${ITERATIONS}] Error: ${e.message}`);
+ sleep(2000);
+ }
+ }
+
+ print("=== Primary traffic complete ===");
+
+ generate-reads.js: |
+ // Read-only traffic for replica instances
+ // Runs against the gateway-ro service (load-balanced across replicas)
+
+ const DB_NAME = "telemetry_demo";
+ const COLLECTION = "events";
+ const ITERATIONS = 2250;
+ const SLEEP_MS = 600;
+
+ db = db.getSiblingDB(DB_NAME);
+
+ print("=== DocumentDB Traffic Generator (Replicas - Reads) ===");
+ print(`Target: ${DB_NAME}.${COLLECTION}`);
+ print(`Iterations: ${ITERATIONS}, Sleep: ${SLEEP_MS}ms`);
+
+ const categories = ["auth", "api", "database", "network", "system"];
+ const severities = ["info", "warn", "error", "critical"];
+ const sources = ["web-server", "api-gateway", "worker", "scheduler", "monitor"];
+
+ function randomChoice(arr) {
+ return arr[Math.floor(Math.random() * arr.length)];
+ }
+
+ for (let i = 0; i < ITERATIONS; i++) {
+ try {
+ // Simple find
+ db[COLLECTION].find({ category: randomChoice(categories) }).limit(10).toArray();
+
+ // Find with sort
+ db[COLLECTION].find({ severity: randomChoice(severities) }).sort({ timestamp: -1 }).limit(10).toArray();
+
+ // Count
+ db[COLLECTION].countDocuments({ source: randomChoice(sources) });
+
+ // Aggregate pipeline
+ db[COLLECTION].aggregate([
+ { $match: { category: randomChoice(categories) } },
+ { $group: { _id: "$severity", count: { $sum: 1 }, avg_duration: { $avg: "$duration_ms" } } },
+ { $sort: { count: -1 } }
+ ]).toArray();
+
+ // Distinct
+ db[COLLECTION].distinct("source", { severity: randomChoice(severities) });
+
+ if (i % 25 === 0) {
+ print(`[${i}/${ITERATIONS}] OK`);
+ }
+
+ sleep(SLEEP_MS);
+ } catch (e) {
+ print(`[${i}/${ITERATIONS}] Error: ${e.message}`);
+ sleep(2000);
+ }
+ }
+
+ print("=== Replica traffic complete ===");
+---
+apiVersion: batch/v1
+kind: Job
+metadata:
+ name: traffic-generator-rw
+ namespace: documentdb-preview-ns
+spec:
+ backoffLimit: 3
+ template:
+ metadata:
+ labels:
+ app: traffic-generator
+ spec:
+ restartPolicy: OnFailure
+ containers:
+ - name: traffic-gen
+ image: mongodb/mongodb-community-server:7.0.30-ubuntu2204
+ env:
+ - name: DOCDB_USER
+ valueFrom:
+ secretKeyRef:
+ name: documentdb-credentials
+ key: username
+ - name: DOCDB_PASSWORD
+ valueFrom:
+ secretKeyRef:
+ name: documentdb-credentials
+ key: password
+ command:
+ - mongosh
+ - "--host"
+ - "documentdb-preview-gateway.documentdb-preview-ns.svc.cluster.local"
+ - "--port"
+ - "10260"
+ - "--tls"
+ - "--tlsAllowInvalidCertificates"
+ - "-u"
+ - "$(DOCDB_USER)"
+ - "-p"
+ - "$(DOCDB_PASSWORD)"
+ - "--file"
+ - "/scripts/generate-traffic.js"
+ volumeMounts:
+ - name: script
+ mountPath: /scripts
+ volumes:
+ - name: script
+ configMap:
+ name: traffic-generator-script
+---
+apiVersion: batch/v1
+kind: Job
+metadata:
+ name: traffic-generator-ro
+ namespace: documentdb-preview-ns
+spec:
+ backoffLimit: 3
+ template:
+ metadata:
+ labels:
+ app: traffic-generator
+ spec:
+ restartPolicy: OnFailure
+ containers:
+ - name: traffic-gen
+ image: mongodb/mongodb-community-server:7.0.30-ubuntu2204
+ env:
+ - name: DOCDB_USER
+ valueFrom:
+ secretKeyRef:
+ name: documentdb-credentials
+ key: username
+ - name: DOCDB_PASSWORD
+ valueFrom:
+ secretKeyRef:
+ name: documentdb-credentials
+ key: password
+ command:
+ - mongosh
+ - "--host"
+ - "documentdb-preview-gateway-ro.documentdb-preview-ns.svc.cluster.local"
+ - "--port"
+ - "10260"
+ - "--tls"
+ - "--tlsAllowInvalidCertificates"
+ - "-u"
+ - "$(DOCDB_USER)"
+ - "-p"
+ - "$(DOCDB_PASSWORD)"
+ - "--file"
+ - "/scripts/generate-reads.js"
+ volumeMounts:
+ - name: script
+ mountPath: /scripts
+ volumes:
+ - name: script
+ configMap:
+ name: traffic-generator-script
diff --git a/documentdb-playground/telemetry/local/scripts/deploy.sh b/documentdb-playground/telemetry/local/scripts/deploy.sh
new file mode 100755
index 00000000..4769d01d
--- /dev/null
+++ b/documentdb-playground/telemetry/local/scripts/deploy.sh
@@ -0,0 +1,90 @@
+#!/bin/bash
+set -euo pipefail
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+LOCAL_DIR="$(dirname "$SCRIPT_DIR")"
+REPO_ROOT="$(cd "$LOCAL_DIR/../../.." && pwd)"
+CLUSTER_NAME="${CLUSTER_NAME:-documentdb-telemetry}"
+CONTEXT="kind-${CLUSTER_NAME}"
+# Path to the operator Helm chart in this repo. Built from the same branch
+# so the playground exercises in-tree operator changes (e.g. base_config.yaml
+# updates) without needing a published release.
+OPERATOR_CHART_DIR="${OPERATOR_CHART_DIR:-${REPO_ROOT}/operator/documentdb-helm-chart}"
+
+echo "=== DocumentDB Telemetry Playground ==="
+
+# Step 1: Create Kind cluster
+echo "[1/6] Setting up Kind cluster..."
+"$SCRIPT_DIR/setup-kind.sh"
+
+# Step 2: Wait for cluster
+echo "[2/6] Waiting for cluster to be ready..."
+kubectl wait --for=condition=Ready nodes --all --context "$CONTEXT" --timeout=120s
+
+# Step 3: Install cert-manager + DocumentDB operator (from this branch)
+echo "[3/6] Installing cert-manager and DocumentDB operator..."
+if helm list -n documentdb-operator --kube-context "$CONTEXT" 2>/dev/null | grep -q documentdb-operator; then
+ echo " DocumentDB operator already installed, skipping."
+else
+ # cert-manager
+ if kubectl get namespace cert-manager --context "$CONTEXT" &>/dev/null; then
+ echo " cert-manager already installed, skipping."
+ else
+ echo " Installing cert-manager..."
+ helm repo add jetstack https://charts.jetstack.io --force-update 2>/dev/null
+ helm repo update jetstack 2>/dev/null
+ helm install cert-manager jetstack/cert-manager \
+ --namespace cert-manager \
+ --create-namespace \
+ --set installCRDs=true \
+ --kube-context "$CONTEXT" \
+ --wait --timeout 120s
+ fi
+
+ # DocumentDB operator from the local chart in this branch.
+ echo " Building chart dependencies..."
+ ( cd "$OPERATOR_CHART_DIR" && helm dependency update >/dev/null )
+ echo " Installing DocumentDB operator from ${OPERATOR_CHART_DIR}..."
+ helm install documentdb-operator "$OPERATOR_CHART_DIR" \
+ --namespace documentdb-operator \
+ --create-namespace \
+ --kube-context "$CONTEXT" \
+ --wait --timeout 180s
+fi
+
+# Step 4: Deploy observability stack (Prometheus + Grafana only — no central collector;
+# every DocumentDB pod runs its own OTel Collector sidecar via spec.monitoring).
+echo "[4/6] Deploying observability stack..."
+kubectl apply -f "$LOCAL_DIR/k8s/observability/namespace.yaml" --context "$CONTEXT"
+kubectl apply -f "$LOCAL_DIR/k8s/observability/" --context "$CONTEXT"
+
+# Create dashboard ConfigMap from JSON files
+echo " Loading Grafana dashboards..."
+kubectl create configmap grafana-dashboards \
+ --namespace=observability \
+ --from-file=gateway.json="$LOCAL_DIR/dashboards/gateway.json" \
+ --from-file=internals.json="$LOCAL_DIR/dashboards/internals.json" \
+ --context "$CONTEXT" \
+ --dry-run=client -o yaml | kubectl apply -f - --context "$CONTEXT"
+
+# Step 5: Deploy DocumentDB
+# spec.monitoring.enabled triggers the operator to create an OTel ConfigMap
+# and inject the otel-collector sidecar via the CNPG sidecar-injector plugin.
+# The CNPG-managed -app secret is reused for the sidecar's PG creds —
+# no dedicated monitoring role is needed.
+echo "[5/6] Deploying DocumentDB..."
+kubectl apply -f "$LOCAL_DIR/k8s/documentdb/" --context "$CONTEXT"
+
+echo " Waiting for observability stack..."
+kubectl wait --for=condition=Available deployment --all -n observability --context "$CONTEXT" --timeout=180s
+
+# Step 6: Deploy traffic generators
+echo "[6/6] Deploying traffic generators..."
+echo " Waiting for DocumentDB pods (this may take a few minutes)..."
+kubectl wait --for=condition=Ready pod -l app=documentdb-preview -n documentdb-preview-ns --context "$CONTEXT" --timeout=300s 2>/dev/null || echo " (DocumentDB pods not ready yet - deploy traffic manually later)"
+kubectl apply -f "$LOCAL_DIR/k8s/traffic/" --context "$CONTEXT"
+
+echo ""
+echo "=== Deployment Complete ==="
+echo "Grafana: kubectl port-forward svc/grafana 3000:3000 -n observability --context $CONTEXT"
+echo "Prometheus: kubectl port-forward svc/prometheus 9090:9090 -n observability --context $CONTEXT"
+echo "Validate: ./scripts/validate.sh"
diff --git a/documentdb-playground/telemetry/local/scripts/setup-kind.sh b/documentdb-playground/telemetry/local/scripts/setup-kind.sh
new file mode 100755
index 00000000..0f061c8a
--- /dev/null
+++ b/documentdb-playground/telemetry/local/scripts/setup-kind.sh
@@ -0,0 +1,76 @@
+#!/bin/bash
+set -euo pipefail
+
+# Configuration
+CLUSTER_NAME="${CLUSTER_NAME:-documentdb-telemetry}"
+REG_NAME="kind-registry"
+REG_PORT="5001"
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+LOCAL_DIR="$(dirname "$SCRIPT_DIR")"
+K8S_VERSION="${K8S_VERSION:-v1.35.0}"
+
+echo "=== DocumentDB Telemetry - Kind Cluster Setup ==="
+
+# 1. Create registry container unless it already exists
+if [ "$(docker inspect -f '{{.State.Running}}' "${REG_NAME}" 2>/dev/null || true)" = 'true' ]; then
+ echo "Registry '${REG_NAME}' already running"
+elif docker inspect "${REG_NAME}" &>/dev/null; then
+ echo "Starting existing registry '${REG_NAME}'..."
+ docker start "${REG_NAME}"
+else
+ echo "Creating local registry on port ${REG_PORT}..."
+ docker run -d --restart=always -p "127.0.0.1:${REG_PORT}:5000" --network bridge --name "${REG_NAME}" registry:2
+fi
+
+# 2. Create Kind cluster if it doesn't exist
+if kind get clusters 2>/dev/null | grep -q "^${CLUSTER_NAME}$"; then
+ echo "Kind cluster '${CLUSTER_NAME}' already exists"
+else
+ echo "Creating Kind cluster '${CLUSTER_NAME}' with 3 worker nodes..."
+ cat </dev/null || true
+
+echo "Done. Registry container kept for reuse."
diff --git a/documentdb-playground/telemetry/local/scripts/validate.sh b/documentdb-playground/telemetry/local/scripts/validate.sh
new file mode 100755
index 00000000..36486de8
--- /dev/null
+++ b/documentdb-playground/telemetry/local/scripts/validate.sh
@@ -0,0 +1,116 @@
+#!/bin/bash
+set -euo pipefail
+CLUSTER_NAME="${CLUSTER_NAME:-documentdb-telemetry}"
+CONTEXT="kind-${CLUSTER_NAME}"
+PASS=0
+FAIL=0
+
+green() { echo -e "\033[32m✓ $1\033[0m"; PASS=$((PASS + 1)); }
+red() { echo -e "\033[31m✗ $1\033[0m"; FAIL=$((FAIL + 1)); }
+warn() { echo -e "\033[33m⚠ $1\033[0m"; }
+
+echo "=== DocumentDB Telemetry Playground - Validation ==="
+echo ""
+
+# 1. Check observability deployments (no central OTel collector — every
+# DocumentDB pod runs its own sidecar via spec.monitoring).
+echo "--- Observability Stack ---"
+for deploy in prometheus grafana; do
+ if kubectl get deployment "$deploy" -n observability --context "$CONTEXT" &>/dev/null; then
+ ready=$(kubectl get deployment "$deploy" -n observability --context "$CONTEXT" -o jsonpath='{.status.readyReplicas}' 2>/dev/null || echo "0")
+ if [ "${ready:-0}" -ge 1 ]; then
+ green "$deploy is running"
+ else
+ red "$deploy is not ready (readyReplicas=${ready:-0})"
+ fi
+ else
+ red "$deploy deployment not found"
+ fi
+done
+
+# 2. Check DocumentDB pods AND that the otel-collector sidecar is injected.
+echo ""
+echo "--- DocumentDB ---"
+running=$(kubectl get pods -l app=documentdb-preview -n documentdb-preview-ns --context "$CONTEXT" --no-headers 2>/dev/null | grep -c "Running" || true)
+if [ "$running" -ge 1 ]; then
+ green "DocumentDB pods running ($running)"
+else
+ red "No DocumentDB pods running"
+fi
+
+# Verify each running pod has 3/3 containers (postgres + documentdb-gateway + otel-collector).
+short=$(kubectl get pods -l app=documentdb-preview -n documentdb-preview-ns --context "$CONTEXT" \
+ -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.containerStatuses[*].name}{"\n"}{end}' 2>/dev/null || true)
+missing_sidecar=0
+while IFS= read -r line; do
+ [ -z "$line" ] && continue
+ if ! echo "$line" | grep -q "otel-collector"; then
+ red "pod $(echo "$line" | awk '{print $1}') is missing the otel-collector sidecar"
+ missing_sidecar=$((missing_sidecar + 1))
+ fi
+done <<< "$short"
+if [ "$missing_sidecar" -eq 0 ] && [ "$running" -ge 1 ]; then
+ green "otel-collector sidecar injected on all DocumentDB pods"
+fi
+
+# 3. Check Prometheus targets and key metrics
+echo ""
+echo "--- Data Flow ---"
+PROM_POD=$(kubectl get pod -l app=prometheus -n observability --context "$CONTEXT" -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
+if [ -n "$PROM_POD" ]; then
+ query() {
+ kubectl exec "$PROM_POD" -n observability --context "$CONTEXT" -- \
+ wget -qO- "http://localhost:9090/api/v1/query?query=$1" 2>/dev/null || echo ""
+ }
+
+ # Prometheus has at least one UP target?
+ target_up=$(query "up")
+ if echo "$target_up" | grep -q '"value"'; then
+ up_count=$(echo "$target_up" | grep -o '"value":\[' | wc -l)
+ green "Prometheus has $up_count active targets"
+ else
+ red "Cannot query Prometheus targets"
+ fi
+
+ # Sidecar scrape job is up?
+ sidecar_up=$(query 'up{job="documentdb-otel-sidecar"}')
+ if echo "$sidecar_up" | grep -q '"value":\["[^"]*","1"\]'; then
+ green "OTel sidecar scrape targets are UP"
+ else
+ warn "OTel sidecar scrape targets not UP yet (sidecar may still be starting)"
+ fi
+
+ # Gateway metric (pushed via OTLP into the sidecar, exported as Prometheus).
+ # Wait up to ~2 min for the first export interval + traffic to start.
+ echo "Waiting up to 120s for gateway OTLP metrics to appear..."
+ gateway_metric_found=0
+ for _ in $(seq 1 24); do
+ if echo "$(query db_client_operations_total)" | grep -q '"result":\[{'; then
+ gateway_metric_found=1
+ break
+ fi
+ sleep 5
+ done
+ if [ "$gateway_metric_found" -eq 1 ]; then
+ green "Gateway metric db_client_operations_total present"
+ else
+ red "Gateway OTLP metrics absent after 120s — likely the configured gatewayImage lacks OTel instrumentation. See README §Gateway image."
+ fi
+
+ # cAdvisor / kubelet container metric (replaces former kubeletstats coverage)
+ if echo "$(query container_cpu_usage_seconds_total)" | grep -q '"result":\[{'; then
+ green "cAdvisor container metric (container_cpu_usage_seconds_total) present"
+ else
+ warn "No cAdvisor metrics yet"
+ fi
+else
+ red "Prometheus pod not found"
+fi
+
+# Summary
+echo ""
+echo "=== Results: $PASS passed, $FAIL failed ==="
+if [ "$FAIL" -gt 0 ]; then
+ echo "Some checks failed. Components may still be starting up — retry in a minute."
+ exit 1
+fi
diff --git a/mkdocs.yml b/mkdocs.yml
index 378b477d..8e0bcf16 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -55,6 +55,9 @@ nav:
- Overview: preview/high-availability/overview.md
- Local HA: preview/high-availability/local-ha.md
- Advanced Configuration: preview/advanced-configuration/README.md
+ - Monitoring:
+ - Overview: preview/monitoring/overview.md
+ - Metrics Reference: preview/monitoring/metrics.md
- Multi-Region Deployment:
- Overview: preview/multi-region-deployment/overview.md
- Setup Guide: preview/multi-region-deployment/setup.md