feat(telemetry): local telemetry playground using injected OTel sidecar (#342)#286
feat(telemetry): local telemetry playground using injected OTel sidecar (#342)#286udsmicrosoft wants to merge 3 commits intodocumentdb:mainfrom
Conversation
Review — PR #286: Local Telemetry Playground + Monitoring DocsThis PR has two parts: (1) a code change to the sidecar injector adding OTEL resource attributes, and (2) documentation + playground files for monitoring. Reviewing both. Code Change:
|
| Check | Result |
|---|---|
| Env var ordering (POD_NAME before OTEL_RESOURCE_ATTRIBUTES) | ✅ Correct — K8s resolves $(VAR) in declaration order |
Downward API metadata.name field path |
✅ Valid |
OTEL semantic convention service.instance.id |
✅ Matches OTel spec |
| Insertion point (after OTEL_EXPORTER_OTLP_ENDPOINT, before credentials) | ✅ Clean |
| Existing tests cover this code path | lifecycle_test.go found — see Major #3 |
The code change is correct and well-placed.
🔴 Critical (1)
1. PostgreSQL receiver will fail to authenticate — password is "unused"
The OTel collector config contains:
postgresql:
endpoint: documentdb-preview-rw.documentdb-preview-ns.svc.cluster.local:5432
username: postgres
password: "unused"CNPG generates and manages the postgres superuser password in a Kubernetes Secret (typically <cluster-name>-superuser). The hardcoded "unused" password will cause the postgresql receiver to fail SQL authentication, meaning all PostgreSQL-level metrics (replication lag, backends, DB size, commits, rollbacks) documented in metrics.md will not actually be collected.
Fix options:
- Mount the CNPG superuser secret into the OTel collector pod and reference it via
${env:PG_MONITOR_PASSWORD} - Create a dedicated monitoring role with limited permissions and a known password
- Add a setup step to the README that creates the monitoring credentials
This also applies to PG_MONITOR_USER: postgres / PG_MONITOR_PASSWORD: unused in the collector deployment env vars.
🟠 Major (3)
2. Monitoring docs missing front matter (title, description, tags)
Both overview.md and metrics.md lack YAML front matter. Per the documentation standards, every page should include:
---
title: Monitoring Overview
description: How to monitor DocumentDB clusters using OpenTelemetry, Prometheus, and Grafana.
tags:
- monitoring
- observability
- metrics
---This affects search ranking, AI discoverability, and consistency with other documentation pages.
3. No tests for the lifecycle.go code change
The sidecar injector's lifecycle.go has no test file. The new POD_NAME and OTEL_RESOURCE_ATTRIBUTES env vars are injected into every DocumentDB gateway container — a functional change that affects all deployments. While the change is straightforward, a unit test would prevent regression (e.g., if someone reorders the env vars, breaking $(POD_NAME) resolution).
Suggestion: Add a test in operator/cnpg-plugins/sidecar-injector/internal/lifecycle/lifecycle_test.go that verifies:
POD_NAMEenv var is present withFieldReftometadata.nameOTEL_RESOURCE_ATTRIBUTESis present and containsservice.instance.id=$(POD_NAME)POD_NAMEappears beforeOTEL_RESOURCE_ATTRIBUTESin the env slice
4. Traffic generator and observability stack contain hardcoded passwords
Multiple files use plaintext passwords:
documentdb-ha.yaml:password: DemoPassword100traffic-generator.yaml:-p DemoPassword100observability-stack.yaml:GF_SECURITY_ADMIN_PASSWORD: admin,GF_AUTH_ANONYMOUS_ORG_ROLE: Admin
For a playground/demo this is acceptable, but:
- The Grafana anonymous admin access (
GF_AUTH_ANONYMOUS_ENABLED: true+GF_AUTH_ANONYMOUS_ORG_ROLE: Admin) should have a comment warning this is for demo use only - Consider adding a
⚠️ DO NOT USE IN PRODUCTIONheader comment inobservability-stack.yaml
🟡 Minor (5)
5. Gateway metric names cannot be verified against source code
The docs reference metrics like db_client_operations_total, gateway_client_connections_active, db_client_connection_active, etc. These are emitted by the gateway binary (separate repo), not the operator Go code. I cannot verify these metric names against this repository.
Recommendation: Add a note in metrics.md indicating these metric names are from the DocumentDB Gateway and may change between versions.
6. Collector deployed as Deployment but kubeletstats only covers one node
The kubeletstats receiver with K8S_NODE_NAME env var only scrapes the node where the collector pod runs. In the 4-node Kind cluster (1 control-plane + 3 workers), you'll miss kubelet metrics from 3 of 4 nodes.
Recommendation: Add a note in the observability-stack.yaml or overview.md that kubeletstats coverage is limited in Deployment mode (vs DaemonSet).
7. Kind config uses containerd v1 registry path
The kind-config.yaml uses [plugins."io.containerd.grpc.v1.cri".registry]. With kindest/node:v1.35.0 and containerd 2.x, this path may have changed to io.containerd.cri.v1.images. Verify against the target Kind version.
8. K8S_VERSION defaults to v1.35.0 in setup-kind.sh
This pins to a very specific Kind image. If the image isn't available, the script fails. Consider defaulting to a known-good version or adding a version check.
9. The documentdb-preview-rw endpoint is hardcoded in collector config
The OTel collector config hardcodes documentdb-preview-rw.documentdb-preview-ns.svc.cluster.local:5432. This only works for a cluster named documentdb-preview in namespace documentdb-preview-ns. Document this coupling or make it configurable via env vars.
✅ What's Good
- Clean code change: The
POD_NAME/OTEL_RESOURCE_ATTRIBUTESinjection is well-implemented with correct K8s env var ordering - Complete observability stack: Tempo (traces) + Loki (logs) + Prometheus (metrics) + Grafana (dashboards) + OTel Collector — full three-pillar observability
- ExternalName service bridge: Clever cross-namespace OTLP routing from DocumentDB namespace to observability namespace
- Comprehensive metrics reference:
metrics.mdcovers container, gateway, operator, and PostgreSQL metrics with useful PromQL examples - Architecture diagram: Clear ASCII art showing data flow from pods → collector → backends → Grafana
- OTel naming table: OpenTelemetry ↔ Prometheus metric name mapping table is very helpful
- Traffic generator: Well-designed mongosh script with varied CRUD operations, periodic cleanup, and error handling
- mkdocs.yml change is clean: Only adds nav entries, no extension conflicts with other PRs
- Grafana dashboard JSON: Pre-built 1528-line dashboard — users get instant value
- RBAC for collector: Proper ClusterRole with nodes/stats, pods, services, metrics access
Summary
| Severity | Count | Items |
|---|---|---|
| 🔴 Critical | 1 | PostgreSQL receiver auth will fail (password: "unused") |
| 🟠 Major | 3 | Missing front matter, no lifecycle tests, hardcoded passwords need warnings |
| 🟡 Minor | 5 | Unverifiable gateway metrics, kubeletstats single-node, containerd v2, K8s version pin, hardcoded endpoint |
| ✅ | — | Code change correct, complete stack, clean mkdocs, good architecture |
Verdict: The code change to lifecycle.go is correct and adds genuine value (per-instance metric attribution). The observability stack is comprehensive and well-designed. Fix the PostgreSQL receiver authentication (Critical #1) — without it, half the documented metrics won't actually be collected. Add front matter to the docs and ideally a unit test for the env var injection.
ae99e4f to
04ef816
Compare
There was a problem hiding this comment.
Pull request overview
Adds a local “telemetry playground” for running a DocumentDB HA demo on Kind with a full observability stack, and introduces monitoring documentation pages (with MkDocs nav updates) describing the telemetry/metrics architecture and references. It also updates the CNPG sidecar injector to set per-pod OpenTelemetry resource attributes for better per-instance attribution.
Changes:
- Add
documentdb-playground/telemetry/local/demo (Kind setup/deploy/teardown scripts, k8s manifests, traffic generator, Grafana dashboards). - Add Monitoring docs (
overview.md,metrics.md) and updatemkdocs.ymlnav + Mermaid fenced blocks support. - Update sidecar injector to inject
POD_NAMEandOTEL_RESOURCE_ATTRIBUTES(and enable tracing via env var).
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| operator/cnpg-plugins/sidecar-injector/internal/lifecycle/lifecycle.go | Injects additional OTEL-related env vars into the gateway sidecar for per-instance attribution. |
| mkdocs.yml | Adds Monitoring section to nav and enables Mermaid code fences via pymdownx.superfences. |
| documentdb-playground/telemetry/local/scripts/teardown.sh | Adds Kind cluster teardown + proxy container cleanup. |
| documentdb-playground/telemetry/local/scripts/setup-kind.sh | Adds Kind cluster + local registry bootstrap script. |
| documentdb-playground/telemetry/local/scripts/deploy.sh | Adds one-command deploy flow for observability stack + DocumentDB + traffic. |
| documentdb-playground/telemetry/local/k8s/traffic/traffic-generator.yaml | Adds RW/RO traffic generator Jobs and gateway Services for demo load. |
| documentdb-playground/telemetry/local/k8s/observability/tempo.yaml | Adds Tempo deployment/service for traces backend. |
| documentdb-playground/telemetry/local/k8s/observability/prometheus.yaml | Adds Prometheus deployment/service and scrape config for the collector. |
| documentdb-playground/telemetry/local/k8s/observability/otel-collector.yaml | Adds OTel Collector deployment, RBAC, and pipelines for metrics/traces/logs. |
| documentdb-playground/telemetry/local/k8s/observability/namespace.yaml | Adds observability namespace manifest. |
| documentdb-playground/telemetry/local/k8s/observability/loki.yaml | Adds Loki deployment/service for logs backend. |
| documentdb-playground/telemetry/local/k8s/observability/grafana.yaml | Adds Grafana deployment/service plus datasource & dashboard provisioning. |
| documentdb-playground/telemetry/local/k8s/documentdb/collector-bridge.yaml | Adds ExternalName bridge Service so gateway OTLP endpoint resolves across namespaces. |
| documentdb-playground/telemetry/local/k8s/documentdb/cluster.yaml | Adds demo DocumentDB namespace, credentials, and DocumentDB CR for 3-instance HA. |
| documentdb-playground/telemetry/local/dashboards/internals.json | Adds “Internals” Grafana dashboard JSON. |
| documentdb-playground/telemetry/local/dashboards/gateway.json | Adds “Gateway” Grafana dashboard JSON. |
| documentdb-playground/telemetry/local/README.md | Documents local playground usage and architecture (diagram + steps). |
| docs/operator-public-documentation/preview/monitoring/overview.md | Adds monitoring architecture + setup/verification guidance. |
| docs/operator-public-documentation/preview/monitoring/metrics.md | Adds metrics reference and PromQL examples across signal sources. |
You can also share your feedback on Copilot code review. Take the survey.
9b0e6b5 to
dfeb8fb
Compare
| - **Helm 3** — [install](https://helm.sh/docs/intro/install/) | ||
| - **jq** — for credential copying | ||
|
|
||
| !!! important "Gateway OTEL support required" |
There was a problem hiding this comment.
These docs don't use the mkbuild stuff, so the !!! important tag doesn't work here
| ./scripts/validate.sh | ||
|
|
||
| # Tear down | ||
| ./scripts/teardown.sh |
There was a problem hiding this comment.
Maybe this could have a separate heading for cleanup instead of being in quickstart
There was a problem hiding this comment.
I see you do have a section for it, so I think it can be removed here.
WentingWu666666
left a comment
There was a problem hiding this comment.
docs/operator-public-documentation/preview/monitoring/overview.md (lines 23-30)
Instead of asking users to build their own gateway image from a specific branch, it would be better to publish a pre-built gateway image (e.g., to ghcr.io/documentdb/...) that includes the OTEL instrumentation. Requiring users to build the image themselves adds friction and is error-prone most users won't have the build toolchain set up for the gateway. Could we provide a ready-to-use image with telemetry support, or at least track an issue to do so before GA?
|
Great work on the documentation, dashboards, and overall polish the metrics reference and alerting rules are really nice! I'd like to discuss the collector architecture before we merge. The current design uses a central Deployment + DaemonSet, which has some limitations for production: 1. Replica blind spot: The 2. Hardcoded endpoint: The PG endpoint 3. Cross-namespace coupling: The ExternalName bridge service is needed because the collector lives in a different namespace. This adds operational complexity and breaks if the CR is renamed. Alternative: OTel Collector as a sidecar (injected by the CNPG plugin) I've been prototyping this approach on
Scope suggestion: For the initial merge, I'd suggest we reduce scope to VM/container metrics only (CPU, memory, network, filesystem via Happy to walk through the sidecar POC and discuss tradeoffs the docs and dashboards from this PR would pair perfectly with the sidecar approach. |
… sidecar - Operator: add OTLP gRPC :4317 receiver to embedded base_config.yaml and include it in the generated metrics pipeline. - Sidecar-injector: when monitoring is enabled, set POD_NAME (downward API) and OTEL_RESOURCE_ATTRIBUTES=service.instance.id=$(POD_NAME) on the gateway container alongside the existing OTEL_EXPORTER_OTLP_ENDPOINT. - Playground: remove central OTel Collector, per-node DaemonSet, per-instance Services, ExternalName bridge, and the monitoring PG role. - Cluster manifest now sets spec.monitoring.enabled=true so each pod gets an in-pod otel-collector container; Prometheus discovers them via the pod annotations the injector adds. - Prometheus config switched to pod-annotation discovery + native kubelet / cAdvisor scrape; deploy.sh installs operator chart from this branch. - Dashboards remapped to cAdvisor metric names; Postgres-internal panels removed (deferred to a follow-up that extends base_config.yaml SQL). - Public monitoring docs and playground README rewritten for the sidecar architecture; Postgres metrics explicitly marked deferred. Signed-off-by: urismiley <urismiley@microsoft.com>
a22125c to
909d0c7
Compare
…review - lifecycle.go: inject OTEL_METRICS_ENABLED=true on gateway container (pgmongo gates OTel init behind it); extract injectGatewayOTelEnv helper with name-based dedup; add lifecycle_test.go (5 cases). - base_config.yaml: bind OTLP receiver to 127.0.0.1:4317. - cluster.yaml: pin gateway image with OTel support; sidecar prom port 9188. - dashboards: pivot Instance variable on pod label (Prometheus rewrites service.instance.id to exported_instance). - validate.sh: poll up to 120s for db_client_operations_total and fail red. - README/prometheus.yaml: document 9187/9188 split; add Gateway image section. - .gitignore: ignore generated operator/src/documentdb-chart/. Signed-off-by: urismiley <urismiley@microsoft.com>
Signed-off-by: urismiley <urismiley@microsoft.com>
|
Suggestion: Add Currently, per-container CPU/memory metrics (cAdvisor) are only available via the Prometheus kubelet scrape path ( Since the OTel Collector sidecar already supports both push (OTLP exporter) and pull (Prometheus exporter on receivers:
otlp: ... # gateway app metrics (existing)
kubeletstats: # container resource metrics (new)
auth_type: serviceAccount
collection_interval: 30s
metric_groups: [pod, container]This gives users metrics like The only additional requirement is RBAC for the pod's ServiceAccount to read With this change, the |
|
Suggestion: Split this PR into smaller, focused PRs by metric category This PR covers a lot of ground (2900+ lines, 21 files). The operator core changes (sidecar injection, OTel base_config, lifecycle env var injection) look complete I'd suggest landing that as-is and then adding metrics support incrementally by category:
Each can be its own PR with matching docs, dashboards, and playground updates. Easier to review, easier to revert, and unblocks the foundational pieces while the rest iterates. |
Summary
Adds a self-contained local telemetry playground (Kind + Prometheus + Grafana + dashboards + traffic generators) and aligns it with the injected OTel Collector sidecar introduced by #342.
Architecture (re-aligned on #342)
Every DocumentDB pod runs a per-pod OTel Collector sidecar injected by the CNPG sidecar-injector plugin (
spec.monitoring.enabled=true). There is no central OTel Collector Deployment and no per-node DaemonSet in this stack:otlpreceiver on127.0.0.1:4317→ re-exported via theprometheusexporter on:9188.:9187directly.Operator changes (under
operator/)internal/otel/base_config.yaml— adds anotlpreceiver bound to127.0.0.1:4317(loopback only; gateway and collector share pod netns) and wires it into themetricspipeline alongside the existingsqlqueryreceiver.cnpg-plugins/sidecar-injector/internal/lifecycle/lifecycle.go— when injecting the sidecar, also adds OTel env vars to the gateway container:POD_NAME(downward API)OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317OTEL_METRICS_ENABLED=true← required: the gateway gates OTel init off by default and checks this env varOTEL_RESOURCE_ATTRIBUTES=service.instance.id=$(POD_NAME)Injection is idempotent across CREATE and PATCH via name-based dedup — without this, repeated reconciles double-append entries and CNPG fails with
Pod is invalid: spec: Forbidden: pod updates may not change fields other than ....New
lifecycle_test.gocovers: all-appended, metrics-enabled-present, preserves-existing, idempotent re-invocation, no-gateway-container.internal/otel/config_test.goextended for the newotlpreceiver and pipeline wiring.Playground (
documentdb-playground/telemetry/local/)One-command deploy:
./scripts/deploy.sh— Kind cluster + cert-manager + operator (from this branch) + observability stack + DocumentDB HA + traffic generators.Dashboards' Instance variable pivots on the
podlabel (set by the collector'sresourceprocessor) — Prometheus's scrapeinstancelabel clobbers OTel'sservice.instance.id.Sidecar Prometheus port is 9188 (not the conventional 9187, which collides with the CNPG instance-manager in the same pod netns).
validate.shpolls up to 120s fordb_client_operations_totaland fails red if absent — surfaces gateway-image issues clearly instead of warning.The default upstream gateway image does not yet emit OTLP
db_client_*metrics — that instrumentation lives in pgmongo and has not been published in an upstream release. The playground'scluster.yamltherefore pins:Once an upstream release ships with OTel metrics, drop the pin to fall back to the operator default. README §Gateway image documents the build recipe (pgmongo
oss→ emulator-shape image → re-wrap with this repo'sDockerfile_gateway_public_imagefor the K8s slim shape).Documentation (
docs/.../monitoring/)base_config.yaml, ServiceMonitor examplesmkdocs.ymlnavigation