Skip to content

feat(otel): receive gateway OTLP metrics in the per-pod sidecar#350

Open
udsmicrosoft wants to merge 3 commits intodocumentdb:mainfrom
udsmicrosoft:users/urismiley/telemetry-otlp-receiver
Open

feat(otel): receive gateway OTLP metrics in the per-pod sidecar#350
udsmicrosoft wants to merge 3 commits intodocumentdb:mainfrom
udsmicrosoft:users/urismiley/telemetry-otlp-receiver

Conversation

@udsmicrosoft
Copy link
Copy Markdown
Collaborator

Summary

Wires the documentdb-gateway's OTLP/gRPC metric output into the per-pod OTel
Collector sidecar that the CNPG sidecar-injector plugin already injects. After
this change, the sidecar's existing prometheus exporter re-exports the
gateway's db_client_* metrics alongside the existing documentdb.postgres.up
sqlquery output.

Changes

operator/src/internal/otel/

  • base_config.yaml — add an otlp receiver bound to 127.0.0.1:4317
    (loopback only; the gateway and collector share the pod network namespace,
    so loopback is sufficient and avoids exposing the listener pod-wide). Wire
    the receiver into the metrics pipeline alongside the existing sqlquery.
  • config.go — register the receiver in the default pipeline list.
  • config_test.go — assert the receiver definition (port 4317, gRPC
    protocol) and the pipeline wiring (ConsistOf("sqlquery", "otlp")).

operator/cnpg-plugins/sidecar-injector/internal/lifecycle/

When the sidecar-injector injects the OTel sidecar, also append four OTel env
vars to the documentdb-gateway container so the gateway can push its metrics
to the co-located sidecar:

Env var Value
POD_NAME downward API metadata.name
OTEL_EXPORTER_OTLP_ENDPOINT http://localhost:4317
OTEL_METRICS_ENABLED true
OTEL_RESOURCE_ATTRIBUTES service.instance.id=$(POD_NAME)

OTEL_METRICS_ENABLED=true is required: the gateway gates OTel initialization
off by default and reads this env var (or a JSON TelemetryOptions block) at
startup. Without it, the gateway logs MeterProvider.GlobalSet is never
emitted and zero db_client_* metrics flow.

Injection is idempotent across CREATE and PATCH via a name-based dedup
map. Without that, repeated reconciles double-append entries and CNPG rejects
the patch with Pod is invalid: spec: Forbidden: pod updates may not change fields other than ....

Tests

New lifecycle_test.go covers:

  1. All four envs appended in declaration order (POD_NAME must precede
    OTEL_RESOURCE_ATTRIBUTES so $(POD_NAME) resolves per Kubernetes env-var
    ordering rules).
  2. OTEL_METRICS_ENABLED=true present (regression guard — without it the
    gateway silently disables OTLP export).
  3. Pre-existing env vars preserved (dedup-by-name, no overwrite).
  4. Idempotent re-invocation produces a byte-equal Env slice (the actual
    CNPG-PATCH scenario).
  5. No-op when no gateway container is present.

config_test.go extended for the new otlp receiver and pipeline.

Verification

$ go test ./operator/src/internal/otel/...
ok  github.com/documentdb/documentdb-operator/internal/otel0.051s

$ go test ./operator/cnpg-plugins/sidecar-injector/internal/lifecycle/...
ok  github.com/documentdb/cnpg-i-sidecar-injector/internal/lifecycle0.010s

End-to-end verified on Kind: gateway OTLP metrics flow through the sidecar's
OTLP receiver and are exported by the prometheus exporter; Prometheus scrapes
the sidecar and surfaces the full db_client_* series with per-pod labels.

Notes for reviewers

  • The OTLP listener is on 127.0.0.1:4317, not 0.0.0.0, because it's only
    ever consumed from inside the same pod. Tighter binding, no functional
    difference.
  • The dedup is name-only. If a downstream user pre-sets
    OTEL_RESOURCE_ATTRIBUTES with a different value, service.instance.id
    will not be merged in. Documented in the function comment.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Connects the documentdb-gateway’s OTLP metrics emission to the per-pod OpenTelemetry Collector sidecar (injected by the CNPG sidecar-injector) by adding an OTLP receiver to the collector config and injecting the required OTEL env vars into the gateway container.

Changes:

  • Add an otlp receiver (gRPC, 127.0.0.1:4317) to the embedded OTel Collector base config and wire it into the metrics pipeline.
  • Update OTel config generation/tests to include the new receiver and pipeline wiring.
  • Extend the sidecar-injector lifecycle to idempotently inject OTEL-related env vars into the documentdb-gateway container, with new unit tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
operator/src/internal/otel/base_config.yaml Adds an OTLP gRPC receiver on loopback and keeps existing sqlquery receiver.
operator/src/internal/otel/config.go Wires otlp receiver into the dynamic metrics pipeline.
operator/src/internal/otel/config_test.go Asserts otlp receiver exists and pipeline includes sqlquery + otlp.
operator/cnpg-plugins/sidecar-injector/internal/lifecycle/lifecycle.go Replaces direct env append with idempotent gateway OTEL env injection helper.
operator/cnpg-plugins/sidecar-injector/internal/lifecycle/lifecycle_test.go Adds tests for env injection order, dedup, idempotency, and no-op behavior.

Comment thread operator/cnpg-plugins/sidecar-injector/internal/lifecycle/lifecycle.go Outdated
Comment thread operator/cnpg-plugins/sidecar-injector/internal/lifecycle/lifecycle.go Outdated
Comment thread operator/cnpg-plugins/sidecar-injector/internal/lifecycle/lifecycle.go Outdated
@udsmicrosoft udsmicrosoft marked this pull request as draft April 21, 2026 18:32
@udsmicrosoft udsmicrosoft marked this pull request as ready for review April 21, 2026 20:28
@WentingWu666666
Copy link
Copy Markdown
Collaborator

Consider dropping OTEL_RESOURCE_ATTRIBUTES (and its POD_NAME dependency)

The OTel Collector's resource processor already attaches k8s.pod.name to every metric flowing through the pipeline including gateway metrics arriving via OTLP. See config.go L86:

go {"key": "k8s.pod.name", "value": "${POD_NAME}", "action": "upsert"},

So service.instance.id=$(POD_NAME) set via OTEL_RESOURCE_ATTRIBUTES is redundant the pod name is already present on every exported metric as k8s.pod.name.

Removing this simplifies the env var list from 4 2 (OTEL_EXPORTER_OTLP_ENDPOINT + OTEL_METRICS_ENABLED), eliminates the POD_NAME intermediate env var, and removes the ordering constraint (POD_NAME must precede OTEL_RESOURCE_ATTRIBUTES for $(POD_NAME) interpolation).

Wires the documentdb-gateway's OTLP/gRPC metric output into the
per-pod OTel Collector sidecar so the sidecar's prometheus exporter
can re-export gateway metrics alongside the existing sqlquery output.

operator/src/internal/otel:
  - base_config.yaml: add an otlp receiver bound to 127.0.0.1:4317
    (loopback only; gateway and collector share pod netns) and add it
    to the metrics pipeline alongside sqlquery.
  - config.go / config_test.go: register the receiver in the default
    pipeline list and assert the wiring.

operator/cnpg-plugins/sidecar-injector/internal/lifecycle:
  - lifecycle.go: when injecting the sidecar, also append four OTel
    env vars to the documentdb-gateway container so it can push to the
    sidecar:
      POD_NAME (downward API),
      OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317,
      OTEL_METRICS_ENABLED=true (the gateway gates OTel init off by
        default and reads this env var to enable it),
      OTEL_RESOURCE_ATTRIBUTES=service.instance.id=$(POD_NAME).
    Injection is idempotent across CREATE and PATCH operations via
    a name-based dedup map; without that, repeated reconciles would
    double-append entries and CNPG would reject the patch with
    'Pod is invalid: spec: Forbidden: pod updates may not change
    fields other than ...'.
  - lifecycle_test.go: new file. Five table-driven tests cover
    all-appended ordering (POD_NAME must precede the var that
    interpolates $(POD_NAME)), OTEL_METRICS_ENABLED presence
    (regression guard), pre-existing env preserved, idempotent
    re-invocation produces a byte-equal Env slice, and no-op when no
    gateway container is present.

Signed-off-by: urismiley <urismiley@microsoft.com>
- Use 127.0.0.1 (not localhost) for OTEL_EXPORTER_OTLP_ENDPOINT so it
  matches the collector's IPv4-only receiver bind on dual-stack nodes.
- Correct the gatewayContainerName comment: gateway pushes OTLP to the
  sidecar collector, not the other way around.
- gofmt the gatewayOTelEnvVars block.

Signed-off-by: urismiley <urismiley@microsoft.com>
…UTES on gateway

The OTel Collector's resource processor already attaches k8s.pod.name to
every metric flowing through the pipeline (including OTLP-received gateway
metrics), so service.instance.id=$(POD_NAME) set via OTEL_RESOURCE_ATTRIBUTES
is redundant. Drops gateway env vars from 4 to 2, removes the POD_NAME
intermediate, and removes the POD_NAME -> OTEL_RESOURCE_ATTRIBUTES ordering
constraint.

Signed-off-by: urismiley <urismiley@microsoft.com>
@udsmicrosoft udsmicrosoft force-pushed the users/urismiley/telemetry-otlp-receiver branch from 518ad7b to 2a448e8 Compare April 22, 2026 16:20
Copy link
Copy Markdown
Collaborator

@xgerman xgerman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Verdict: /approve with two follow-ups. No blocking issues.

Note: the PR description is stale against the current commit (2a448e8). The code no longer injects POD_NAME or OTEL_RESOURCE_ATTRIBUTES; only OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_METRICS_ENABLED are set, and pod identity is picked up via the collector's resource processor. The POD_NAME-ordering and service.instance.id caveats in the PR body are moot against the real diff. Worth updating the PR description before merge.

🟠 Major — worth addressing before or right after merge

1. OTEL_METRICS_ENABLED=true is unconditional — no CRD opt-outoperator/cnpg-plugins/sidecar-injector/internal/lifecycle/lifecycle.go:391-402
The dedup logic preserves user-set values, but there is no way for a cluster operator to opt out of gateway OTLP emission short of hand-setting the env var on the CNPG pod template. Enabling monitoring now silently enables gateway metrics emission — a behavior change for existing clusters with monitoring already on. Suggest a Spec.Monitoring.Gateway.MetricsEnabled *bool knob, or at minimum a CHANGELOG entry (there is none) + a godoc note on MonitoringSpec.

2. Gateway-image env-var contract is uncovenanted and unversionedlifecycle.go:393-398
The operator now hard-codes knowledge of two env vars whose semantics live in the external gateway repo. If the gateway image renames or retypes either variable, metrics stop silently: the collector comes up clean, the prometheus exporter returns 200, and scrapers see zero db_client_* series with no error surface. Suggest (a) a minimum-gateway-version note near the constants, and (b) an e2e smoke asserting ≥1 db_client_* series reaches the prometheus exporter — bare presence is enough to catch the rename class.

3. Pipeline assembled only when exporters configured; otlp receiver is orphaned in the no-exporter caseoperator/src/internal/otel/config.go:130-156
Pre-existing structural issue that this PR widens. When len(exporterNames)==0, generateDynamicConfig writes no service: block, but static.yaml now declares both sqlquery and otlp receivers with no pipeline referencing them. The OTel Collector refuses to start with "receiver declared but not used". Before this PR the same was true for sqlquery alone, so behavior doesn't change — but the new tests all run through paths with an exporter, so the no-exporter case remains untested. If monitoring is ever enabled with Spec.Exporter == nil (schema allows it), the collector crash-loops. At minimum add a regression test; ideally add CRD validation requiring ≥1 exporter, or fall back to a nop/debug exporter.

🟡 Minor

  • #4 lifecycle_test.go:70-85 — When a user pre-sets a non-loopback OTEL_EXPORTER_OTLP_ENDPOINT, the sidecar's 127.0.0.1:4317 receiver sits empty. Injector should log.Printf a warning when dedup skips a key — zero-cost observability for a subtle misconfiguration.
  • #5 lifecycle_test.go:87-103 — Idempotency test doesn't cover the realistic PATCH case where another admission controller has already inserted env vars. Add a case starting with a pre-existing FOO=bar and assert it stays at index 0 after a second invocation.
  • #6 config_test.go:45-55 — Add Expect(protocols).NotTo(HaveKey("http")) to lock the protocol surface; consider validating the merged static+dynamic YAML with a real collector parser fixture.
  • #7 lifecycle.go:422-425map[string]bool → idiomatic Go uses map[string]struct{} for a set. Cosmetic.
  • #8 — No doc update for documentdb-playground/telemetry/ and no CHANGELOG. The PR advertises new db_client_* series but doesn't document them.

🟢 Nit

  • #9 gatewayOTelEnvVars() — allocates per call. Package-level var would be marginally cleaner and lets the test reference the source of truth directly.
  • #10 lifecycle.go:352-355 — "traces pipeline TBD" should be // TODO(#<issue>) with a tracked issue, not a bare promissory note.

Correctly handled (worth crediting)

  • Loopback binding 127.0.0.1:4317 in base_config.yaml is explicitly not 0.0.0.0, and collector internal telemetry is Level: "none" at config.go:142 — no accidental 0.0.0.0:8888 dual-listener leak. ✅
  • Config-drift → rolling restart already handled via HashConfigMapData → plugin parameter diff → CNPG restart annotation. ✅
  • Sidecar-presence gate: injectGatewayOTelEnv is called inside the "OTel sidecar injected" branch, so clusters without monitoring don't get connection-refused on 4317. ✅
  • Idempotency design: name-based dedup with byte-equal reflect.DeepEqual is the right primitive — correct for CNPG's serialized-spec comparison and future-proof against re-introducing $(VAR) env-ref interpolation. ✅
  • No secrets, no PII: the PR doesn't set OTEL_RESOURCE_ATTRIBUTES at all, and the collector's resource processor emits only cluster/namespace/pod identity. ✅

Thanks for the clean refactor between revisions — relying on the collector resource processor instead of hand-rolled env-var interpolation is the right call.


Reviewed with the GitHub Copilot code-review agent on commit 2a448e8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants