feat(otel): receive gateway OTLP metrics in the per-pod sidecar#350
feat(otel): receive gateway OTLP metrics in the per-pod sidecar#350udsmicrosoft wants to merge 3 commits intodocumentdb:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Connects the documentdb-gateway’s OTLP metrics emission to the per-pod OpenTelemetry Collector sidecar (injected by the CNPG sidecar-injector) by adding an OTLP receiver to the collector config and injecting the required OTEL env vars into the gateway container.
Changes:
- Add an
otlpreceiver (gRPC, 127.0.0.1:4317) to the embedded OTel Collector base config and wire it into themetricspipeline. - Update OTel config generation/tests to include the new receiver and pipeline wiring.
- Extend the sidecar-injector lifecycle to idempotently inject OTEL-related env vars into the
documentdb-gatewaycontainer, with new unit tests.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| operator/src/internal/otel/base_config.yaml | Adds an OTLP gRPC receiver on loopback and keeps existing sqlquery receiver. |
| operator/src/internal/otel/config.go | Wires otlp receiver into the dynamic metrics pipeline. |
| operator/src/internal/otel/config_test.go | Asserts otlp receiver exists and pipeline includes sqlquery + otlp. |
| operator/cnpg-plugins/sidecar-injector/internal/lifecycle/lifecycle.go | Replaces direct env append with idempotent gateway OTEL env injection helper. |
| operator/cnpg-plugins/sidecar-injector/internal/lifecycle/lifecycle_test.go | Adds tests for env injection order, dedup, idempotency, and no-op behavior. |
|
Consider dropping OTEL_RESOURCE_ATTRIBUTES (and its POD_NAME dependency) The OTel Collector's resource processor already attaches
So Removing this simplifies the env var list from 4 2 ( |
Wires the documentdb-gateway's OTLP/gRPC metric output into the
per-pod OTel Collector sidecar so the sidecar's prometheus exporter
can re-export gateway metrics alongside the existing sqlquery output.
operator/src/internal/otel:
- base_config.yaml: add an otlp receiver bound to 127.0.0.1:4317
(loopback only; gateway and collector share pod netns) and add it
to the metrics pipeline alongside sqlquery.
- config.go / config_test.go: register the receiver in the default
pipeline list and assert the wiring.
operator/cnpg-plugins/sidecar-injector/internal/lifecycle:
- lifecycle.go: when injecting the sidecar, also append four OTel
env vars to the documentdb-gateway container so it can push to the
sidecar:
POD_NAME (downward API),
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317,
OTEL_METRICS_ENABLED=true (the gateway gates OTel init off by
default and reads this env var to enable it),
OTEL_RESOURCE_ATTRIBUTES=service.instance.id=$(POD_NAME).
Injection is idempotent across CREATE and PATCH operations via
a name-based dedup map; without that, repeated reconciles would
double-append entries and CNPG would reject the patch with
'Pod is invalid: spec: Forbidden: pod updates may not change
fields other than ...'.
- lifecycle_test.go: new file. Five table-driven tests cover
all-appended ordering (POD_NAME must precede the var that
interpolates $(POD_NAME)), OTEL_METRICS_ENABLED presence
(regression guard), pre-existing env preserved, idempotent
re-invocation produces a byte-equal Env slice, and no-op when no
gateway container is present.
Signed-off-by: urismiley <urismiley@microsoft.com>
- Use 127.0.0.1 (not localhost) for OTEL_EXPORTER_OTLP_ENDPOINT so it matches the collector's IPv4-only receiver bind on dual-stack nodes. - Correct the gatewayContainerName comment: gateway pushes OTLP to the sidecar collector, not the other way around. - gofmt the gatewayOTelEnvVars block. Signed-off-by: urismiley <urismiley@microsoft.com>
…UTES on gateway The OTel Collector's resource processor already attaches k8s.pod.name to every metric flowing through the pipeline (including OTLP-received gateway metrics), so service.instance.id=$(POD_NAME) set via OTEL_RESOURCE_ATTRIBUTES is redundant. Drops gateway env vars from 4 to 2, removes the POD_NAME intermediate, and removes the POD_NAME -> OTEL_RESOURCE_ATTRIBUTES ordering constraint. Signed-off-by: urismiley <urismiley@microsoft.com>
518ad7b to
2a448e8
Compare
xgerman
left a comment
There was a problem hiding this comment.
Review
Verdict: /approve with two follow-ups. No blocking issues.
Note: the PR description is stale against the current commit (
2a448e8). The code no longer injectsPOD_NAMEorOTEL_RESOURCE_ATTRIBUTES; onlyOTEL_EXPORTER_OTLP_ENDPOINTandOTEL_METRICS_ENABLEDare set, and pod identity is picked up via the collector'sresourceprocessor. The POD_NAME-ordering andservice.instance.idcaveats in the PR body are moot against the real diff. Worth updating the PR description before merge.
🟠 Major — worth addressing before or right after merge
1. OTEL_METRICS_ENABLED=true is unconditional — no CRD opt-out — operator/cnpg-plugins/sidecar-injector/internal/lifecycle/lifecycle.go:391-402
The dedup logic preserves user-set values, but there is no way for a cluster operator to opt out of gateway OTLP emission short of hand-setting the env var on the CNPG pod template. Enabling monitoring now silently enables gateway metrics emission — a behavior change for existing clusters with monitoring already on. Suggest a Spec.Monitoring.Gateway.MetricsEnabled *bool knob, or at minimum a CHANGELOG entry (there is none) + a godoc note on MonitoringSpec.
2. Gateway-image env-var contract is uncovenanted and unversioned — lifecycle.go:393-398
The operator now hard-codes knowledge of two env vars whose semantics live in the external gateway repo. If the gateway image renames or retypes either variable, metrics stop silently: the collector comes up clean, the prometheus exporter returns 200, and scrapers see zero db_client_* series with no error surface. Suggest (a) a minimum-gateway-version note near the constants, and (b) an e2e smoke asserting ≥1 db_client_* series reaches the prometheus exporter — bare presence is enough to catch the rename class.
3. Pipeline assembled only when exporters configured; otlp receiver is orphaned in the no-exporter case — operator/src/internal/otel/config.go:130-156
Pre-existing structural issue that this PR widens. When len(exporterNames)==0, generateDynamicConfig writes no service: block, but static.yaml now declares both sqlquery and otlp receivers with no pipeline referencing them. The OTel Collector refuses to start with "receiver declared but not used". Before this PR the same was true for sqlquery alone, so behavior doesn't change — but the new tests all run through paths with an exporter, so the no-exporter case remains untested. If monitoring is ever enabled with Spec.Exporter == nil (schema allows it), the collector crash-loops. At minimum add a regression test; ideally add CRD validation requiring ≥1 exporter, or fall back to a nop/debug exporter.
🟡 Minor
- #4
lifecycle_test.go:70-85— When a user pre-sets a non-loopbackOTEL_EXPORTER_OTLP_ENDPOINT, the sidecar's127.0.0.1:4317receiver sits empty. Injector shouldlog.Printfa warning when dedup skips a key — zero-cost observability for a subtle misconfiguration. - #5
lifecycle_test.go:87-103— Idempotency test doesn't cover the realistic PATCH case where another admission controller has already inserted env vars. Add a case starting with a pre-existingFOO=barand assert it stays at index 0 after a second invocation. - #6
config_test.go:45-55— AddExpect(protocols).NotTo(HaveKey("http"))to lock the protocol surface; consider validating the merged static+dynamic YAML with a real collector parser fixture. - #7
lifecycle.go:422-425—map[string]bool→ idiomatic Go usesmap[string]struct{}for a set. Cosmetic. - #8 — No doc update for
documentdb-playground/telemetry/and no CHANGELOG. The PR advertises newdb_client_*series but doesn't document them.
🟢 Nit
- #9
gatewayOTelEnvVars()— allocates per call. Package-levelvarwould be marginally cleaner and lets the test reference the source of truth directly. - #10
lifecycle.go:352-355— "traces pipeline TBD" should be// TODO(#<issue>)with a tracked issue, not a bare promissory note.
Correctly handled (worth crediting)
- Loopback binding
127.0.0.1:4317inbase_config.yamlis explicitly not0.0.0.0, and collector internal telemetry isLevel: "none"atconfig.go:142— no accidental0.0.0.0:8888dual-listener leak. ✅ - Config-drift → rolling restart already handled via
HashConfigMapData→ plugin parameter diff → CNPG restart annotation. ✅ - Sidecar-presence gate:
injectGatewayOTelEnvis called inside the "OTel sidecar injected" branch, so clusters without monitoring don't get connection-refused on 4317. ✅ - Idempotency design: name-based dedup with byte-equal
reflect.DeepEqualis the right primitive — correct for CNPG's serialized-spec comparison and future-proof against re-introducing$(VAR)env-ref interpolation. ✅ - No secrets, no PII: the PR doesn't set
OTEL_RESOURCE_ATTRIBUTESat all, and the collector's resource processor emits only cluster/namespace/pod identity. ✅
Thanks for the clean refactor between revisions — relying on the collector resource processor instead of hand-rolled env-var interpolation is the right call.
Reviewed with the GitHub Copilot code-review agent on commit 2a448e8.
Summary
Wires the documentdb-gateway's OTLP/gRPC metric output into the per-pod OTel
Collector sidecar that the CNPG sidecar-injector plugin already injects. After
this change, the sidecar's existing prometheus exporter re-exports the
gateway's
db_client_*metrics alongside the existingdocumentdb.postgres.upsqlquery output.
Changes
operator/src/internal/otel/base_config.yaml— add anotlpreceiver bound to127.0.0.1:4317(loopback only; the gateway and collector share the pod network namespace,
so loopback is sufficient and avoids exposing the listener pod-wide). Wire
the receiver into the
metricspipeline alongside the existingsqlquery.config.go— register the receiver in the default pipeline list.config_test.go— assert the receiver definition (port 4317, gRPCprotocol) and the pipeline wiring (
ConsistOf("sqlquery", "otlp")).operator/cnpg-plugins/sidecar-injector/internal/lifecycle/When the sidecar-injector injects the OTel sidecar, also append four OTel env
vars to the
documentdb-gatewaycontainer so the gateway can push its metricsto the co-located sidecar:
POD_NAMEmetadata.nameOTEL_EXPORTER_OTLP_ENDPOINThttp://localhost:4317OTEL_METRICS_ENABLEDtrueOTEL_RESOURCE_ATTRIBUTESservice.instance.id=$(POD_NAME)OTEL_METRICS_ENABLED=trueis required: the gateway gates OTel initializationoff by default and reads this env var (or a JSON
TelemetryOptionsblock) atstartup. Without it, the gateway logs
MeterProvider.GlobalSetis neveremitted and zero
db_client_*metrics flow.Injection is idempotent across CREATE and PATCH via a name-based dedup
map. Without that, repeated reconciles double-append entries and CNPG rejects
the patch with
Pod is invalid: spec: Forbidden: pod updates may not change fields other than ....Tests
New
lifecycle_test.gocovers:OTEL_RESOURCE_ATTRIBUTESso$(POD_NAME)resolves per Kubernetes env-varordering rules).
OTEL_METRICS_ENABLED=truepresent (regression guard — without it thegateway silently disables OTLP export).
Envslice (the actualCNPG-PATCH scenario).
config_test.goextended for the newotlpreceiver and pipeline.Verification
End-to-end verified on Kind: gateway OTLP metrics flow through the sidecar's
OTLP receiver and are exported by the prometheus exporter; Prometheus scrapes
the sidecar and surfaces the full
db_client_*series with per-pod labels.Notes for reviewers
127.0.0.1:4317, not0.0.0.0, because it's onlyever consumed from inside the same pod. Tighter binding, no functional
difference.
OTEL_RESOURCE_ATTRIBUTESwith a different value,service.instance.idwill not be merged in. Documented in the function comment.