Skip to content

feat(telemetry): local Kind playground for gateway + container metrics, plus monitoring docs#351

Draft
udsmicrosoft wants to merge 1 commit intodocumentdb:mainfrom
udsmicrosoft:users/urismiley/telemetry-playground-gateway
Draft

feat(telemetry): local Kind playground for gateway + container metrics, plus monitoring docs#351
udsmicrosoft wants to merge 1 commit intodocumentdb:mainfrom
udsmicrosoft:users/urismiley/telemetry-playground-gateway

Conversation

@udsmicrosoft
Copy link
Copy Markdown
Collaborator

Summary

Adds an end-to-end telemetry playground for DocumentDB on a local Kind cluster, plus user-facing monitoring documentation.

The playground deploys the operator from this branch's in-tree Helm chart, a 3-instance HA DocumentDB cluster with spec.monitoring.enabled=true (which causes the operator's CNPG sidecar plugin to inject a per-pod OTel Collector), Prometheus + Grafana, and traffic generators. Two pre-built Grafana dashboards ship out of the box:

  • Gateway — OTLP-pushed db_client_* metrics (request rate, latency, error rate, in-flight)
  • Internals — container/node metrics scraped directly from kubelet/cAdvisor

There is no central OTel Collector and no per-node DaemonSet — every signal lives in the per-pod sidecar (gateway) or comes straight from kubelet (container/node).

What's included

  • documentdb-playground/telemetry/local/
    • scripts/{setup-kind,deploy,validate,teardown}.sh — idempotent end-to-end automation
    • k8s/observability/{namespace,prometheus,grafana}.yaml
    • k8s/documentdb/cluster.yaml — HA cluster with monitoring enabled
    • k8s/traffic/traffic-generator.yaml — mongosh read/write workload
    • dashboards/{gateway,internals}.json — pre-provisioned Grafana dashboards
    • README.md — quick-start, architecture, troubleshooting
  • docs/operator-public-documentation/preview/monitoring/{overview,metrics}.md + mkdocs.yml nav entry
  • .gitignore entry for operator/src/documentdb-chart/ (generated by helm dependency update)

Gateway image caveat

The default upstream gateway image does not yet emit OTLP db_client_* metrics — that instrumentation lives in the pgmongo project and has not been published in an upstream release. The playground temporarily pins udsmiley/documentdb-gateway-otel:k8s-pgmongo-main-latest in k8s/documentdb/cluster.yaml. Once an OTel-enabled gateway image is published upstream and the operator's DOCUMENTDB_VERSION is bumped, the pin can be removed. This caveat is called out at the top of the playground README.

How to validate

cd documentdb-playground/telemetry/local
./scripts/deploy.sh
./scripts/validate.sh
kubectl port-forward svc/grafana 3000:3000 -n observability --context kind-documentdb-telemetry
# → http://localhost:3000  (admin/admin, Dashboards → DocumentDB folder)

deploy.sh is idempotent. Re-runs after a failure pick up where they left off.

Scope

This PR is playground + docs only — no operator code is changed. CI gates that exercise operator code/tests are unaffected.

…trics

Adds documentdb-playground/telemetry/local/ — an end-to-end observability
stack for DocumentDB on Kind:

- Kind cluster + cert-manager + operator (from in-tree Helm chart)
- DocumentDB HA cluster (1 primary + 2 replicas) with monitoring enabled
- Per-pod OTel Collector sidecar (provided by the operator)
- Prometheus + Grafana in an 'observability' namespace
- Two pre-built Grafana dashboards: Gateway and Internals
- Traffic generators (mongosh read/write workload) for demoable signal
- Idempotent deploy.sh / setup-kind.sh / validate.sh / teardown.sh

Adds user-facing monitoring documentation:

- docs/operator-public-documentation/preview/monitoring/overview.md
- docs/operator-public-documentation/preview/monitoring/metrics.md
- mkdocs.yml nav entry under the preview section

Adds .gitignore entry for operator/src/documentdb-chart/, the flattened
chart copy produced by 'helm dependency update' during local dev.

Signed-off-by: urismiley <urismiley@microsoft.com>
@udsmicrosoft udsmicrosoft force-pushed the users/urismiley/telemetry-playground-gateway branch from de190e9 to e7721f6 Compare April 22, 2026 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant