feat(telemetry): local Kind playground for gateway + container metrics, plus monitoring docs#351
Draft
udsmicrosoft wants to merge 1 commit intodocumentdb:mainfrom
Conversation
…trics Adds documentdb-playground/telemetry/local/ — an end-to-end observability stack for DocumentDB on Kind: - Kind cluster + cert-manager + operator (from in-tree Helm chart) - DocumentDB HA cluster (1 primary + 2 replicas) with monitoring enabled - Per-pod OTel Collector sidecar (provided by the operator) - Prometheus + Grafana in an 'observability' namespace - Two pre-built Grafana dashboards: Gateway and Internals - Traffic generators (mongosh read/write workload) for demoable signal - Idempotent deploy.sh / setup-kind.sh / validate.sh / teardown.sh Adds user-facing monitoring documentation: - docs/operator-public-documentation/preview/monitoring/overview.md - docs/operator-public-documentation/preview/monitoring/metrics.md - mkdocs.yml nav entry under the preview section Adds .gitignore entry for operator/src/documentdb-chart/, the flattened chart copy produced by 'helm dependency update' during local dev. Signed-off-by: urismiley <urismiley@microsoft.com>
de190e9 to
e7721f6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an end-to-end telemetry playground for DocumentDB on a local Kind cluster, plus user-facing monitoring documentation.
The playground deploys the operator from this branch's in-tree Helm chart, a 3-instance HA DocumentDB cluster with
spec.monitoring.enabled=true(which causes the operator's CNPG sidecar plugin to inject a per-pod OTel Collector), Prometheus + Grafana, and traffic generators. Two pre-built Grafana dashboards ship out of the box:db_client_*metrics (request rate, latency, error rate, in-flight)There is no central OTel Collector and no per-node DaemonSet — every signal lives in the per-pod sidecar (gateway) or comes straight from kubelet (container/node).
What's included
documentdb-playground/telemetry/local/scripts/{setup-kind,deploy,validate,teardown}.sh— idempotent end-to-end automationk8s/observability/{namespace,prometheus,grafana}.yamlk8s/documentdb/cluster.yaml— HA cluster with monitoring enabledk8s/traffic/traffic-generator.yaml— mongosh read/write workloaddashboards/{gateway,internals}.json— pre-provisioned Grafana dashboardsREADME.md— quick-start, architecture, troubleshootingdocs/operator-public-documentation/preview/monitoring/{overview,metrics}.md+mkdocs.ymlnav entry.gitignoreentry foroperator/src/documentdb-chart/(generated byhelm dependency update)Gateway image caveat
The default upstream gateway image does not yet emit OTLP
db_client_*metrics — that instrumentation lives in the pgmongo project and has not been published in an upstream release. The playground temporarily pinsudsmiley/documentdb-gateway-otel:k8s-pgmongo-main-latestink8s/documentdb/cluster.yaml. Once an OTel-enabled gateway image is published upstream and the operator'sDOCUMENTDB_VERSIONis bumped, the pin can be removed. This caveat is called out at the top of the playground README.How to validate
deploy.shis idempotent. Re-runs after a failure pick up where they left off.Scope
This PR is playground + docs only — no operator code is changed. CI gates that exercise operator code/tests are unaffected.