Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
db6ad55
added Appinsights doc
rjayaswal Jan 8, 2026
0165f5c
dixed PII
rjayaswal Feb 2, 2026
7e64d7f
Implement Application Insights telemetry integration
rjayaswal Feb 3, 2026
453da4e
feat: Add Application Insights telemetry integration
rjayaswal Feb 12, 2026
ded55fa
Merge branch 'main' into users/rjayaswal/appinsights-implementation
Ritvik-Jayaswal Feb 25, 2026
615a3da
Address Copilot PR review comments for telemetry implementation
rjayaswal Feb 25, 2026
c5b0719
Remove accidentally added github-secrets-telemetry-setup.md
rjayaswal Mar 9, 2026
c3b92fe
feat: Inject App Insights connection string in release workflow
rjayaswal Mar 9, 2026
b795272
Merge branch 'main' into users/rjayaswal/appinsights-implementation
Ritvik-Jayaswal Mar 9, 2026
4481b4f
Merge main into users/rjayaswal/appinsights-implementation
rjayaswal Mar 19, 2026
244f5b9
docs: Add telemetry GUID strategy design and update metrics spec
rjayaswal Apr 2, 2026
08dd7a5
docs: Make telemetry GUID strategy document neutral
rjayaswal Apr 2, 2026
81ccaf7
docs: Present GUID strategy as greenfield design
rjayaswal Apr 2, 2026
05f942e
docs: Add Option 3 Hybrid Cloud-Native ID Strategy
rjayaswal Apr 3, 2026
831453c
docs: Add telemetry testing section to AGENTS.md
rjayaswal Apr 3, 2026
195eb55
Merge branch 'main' into users/rjayaswal/appinsights-implementation
Ritvik-Jayaswal Apr 3, 2026
131d925
fix: Fix Helm template issues in operator deployment
rjayaswal Apr 3, 2026
da1d1a8
docs: Add sample cluster for telemetry E2E testing
rjayaswal Apr 3, 2026
bf654e0
feat: Add missing telemetry calls and periodic metrics heartbeat
rjayaswal Apr 6, 2026
a4f9acf
refactor: Decouple telemetry, use metadata.uid, implement Option 3 cl…
rjayaswal Apr 6, 2026
858ce00
refactor: Remove unused guid.go and annotation constants
rjayaswal Apr 6, 2026
e10692b
test: Add unit tests for telemetry package
rjayaswal Apr 6, 2026
8ba1a3e
chore: Remove accidentally committed coverage file
rjayaswal Apr 6, 2026
d21272d
feat: Buffer periodic metrics with 15min collection and hourly flush
rjayaswal Apr 6, 2026
a68ed04
test: Add workflow to verify telemetry secret and injection
rjayaswal Apr 6, 2026
e480ef3
chore: Remove test-telemetry-secret workflow from PR
rjayaswal Apr 6, 2026
c1695bc
test: Increase telemetry test coverage to 55%
rjayaswal Apr 6, 2026
e7b9e7d
Merge main and address PR review comments
rjayaswal Apr 9, 2026
9e569b0
docs: Fix link to Microsoft App Insights connection strings docs
rjayaswal Apr 9, 2026
8a9b9b9
refactor: Switch from AppInsights SDK to OpenTelemetry SDK
rjayaswal Apr 10, 2026
ddd25f5
Merge main into users/rjayaswal/appinsights-implementation
rjayaswal Apr 10, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .github/workflows/release_images.yml
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,13 @@ jobs:
echo "values.yaml content:"
cat operator/documentdb-helm-chart/values.yaml

- name: Inject telemetry connection string
if: ${{ secrets.APPINSIGHTS_CONNECTION_STRING != '' }}
run: |
echo "Injecting Application Insights connection string for telemetry"
# Use yq to update the connectionString field in values.yaml
sed -i 's|connectionString: ""|connectionString: "${{ secrets.APPINSIGHTS_CONNECTION_STRING }}"|g' operator/documentdb-helm-chart/values.yaml

- name: Set chart version
run: |
echo "CHART_VERSION=${{ github.event.inputs.version }}" >> $GITHUB_ENV
Expand Down
6 changes: 6 additions & 0 deletions .github/workflows/release_operator.yml
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,12 @@ jobs:
echo "Chart.yaml after update:"
cat operator/documentdb-helm-chart/Chart.yaml

- name: Inject telemetry connection string
if: ${{ secrets.APPINSIGHTS_CONNECTION_STRING != '' }}
run: |
echo "Injecting Application Insights connection string for telemetry"
sed -i 's|connectionString: ""|connectionString: "${{ secrets.APPINSIGHTS_CONNECTION_STRING }}"|g' operator/documentdb-helm-chart/values.yaml

- name: Verify values.yaml has explicit documentDbVersion
run: |
DOCDB_VERSION=$(grep 'documentDbVersion:' operator/documentdb-helm-chart/values.yaml | sed 's/.*"\(.*\)".*/\1/')
Expand Down
25 changes: 25 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -371,6 +371,31 @@ Types:
- Mock external dependencies appropriately
- Ensure tests are idempotent and isolated

### Telemetry Testing

The operator includes Application Insights telemetry. For design details, see:
- [docs/designs/appinsights-metrics.md](docs/designs/appinsights-metrics.md) - Metrics design and implementation
- [docs/designs/telemetry-guid-strategy.md](docs/designs/telemetry-guid-strategy.md) - GUID strategy options

**Quick E2E Test:**
```bash
# 1. Set instrumentation key (use test App Insights resource)
export APPINSIGHTS_INSTRUMENTATIONKEY="your-instrumentation-key"

# 2. Deploy operator to AKS/Kind cluster with telemetry enabled
helm install documentdb-operator ./operator/documentdb-helm-chart \
--namespace documentdb-operator --create-namespace \
--set telemetry.enabled=true \
--set telemetry.instrumentationKey=$APPINSIGHTS_INSTRUMENTATIONKEY

# 3. Create a DocumentDB cluster to trigger telemetry events
kubectl apply -f documentdb-playground/telemetry/sample-appinsights-cluster.yaml

# 4. Verify in Azure Portal → Application Insights → Live Metrics / Logs
```

For full observability setup (Prometheus, Grafana, OpenTelemetry), see `documentdb-playground/telemetry/`.

### Code Review

For thorough code reviews, reference the code review agent:
Expand Down
5 changes: 4 additions & 1 deletion docs/designs/appinsights-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@
## Overview
This document specifies all telemetry data points to be collected by Application Insights for the DocumentDB Kubernetes Operator. These metrics provide operational insights, usage patterns, and error tracking for operator deployments.

### Cluster ID Generation
Cluster IDs (`cluster_id`) are generated using a deterministic SHA-256 hash of `namespace + cluster_name`. This ensures consistent IDs across operator restarts without requiring persistence. See [telemetry-guid-strategy.md](telemetry-guid-strategy.md) for details.

---

## 1. Operator Lifecycle Metrics
Expand All @@ -21,7 +24,7 @@ This document specifies all telemetry data points to be collected by Application
- **Metric**: `operator.health.status`
- **Value**: `1` (healthy) or `0` (unhealthy)
- **Frequency**: Every 60 seconds
- **Dimensions**: `pod_name`, `namespace`
- **Dimensions**: `pod_name`, `namespace_hash`

---

Expand Down
146 changes: 146 additions & 0 deletions docs/designs/telemetry-configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# Application Insights Telemetry Configuration

This document describes how to configure Application Insights telemetry collection for the DocumentDB Kubernetes Operator.

## Overview

The DocumentDB Operator can send telemetry data to Azure Application Insights to help monitor operator health, track cluster lifecycle events, and diagnose issues. All telemetry is designed with privacy in mind - no personally identifiable information (PII) is collected.

## Configuration

### Environment Variables

Configure telemetry by setting these environment variables in the operator deployment:

| Variable | Description | Required |
|----------|-------------|----------|
| `APPINSIGHTS_INSTRUMENTATIONKEY` | Application Insights instrumentation key | Yes (or connection string) |
| `APPLICATIONINSIGHTS_CONNECTION_STRING` | Application Insights connection string (alternative to instrumentation key) | Yes (or instrumentation key) |
Comment on lines +17 to +18
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPINSIGHTS vs APPLICATIONINSHIGHTS. We might stick to one name.

Copy link
Copy Markdown
Collaborator Author

@Ritvik-Jayaswal Ritvik-Jayaswal Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are actually official Microsoft names that the AppInsights Go SDK expects. I included both because a user could use either one to pass in their own string. I added a section in the design doc that explains this a bit more.

APPINSIGHTS_INSTRUMENTATIONKEY — Takes a bare instrumentation key (just the GUID, e.g., f5614a64-9358-44db-b19b-18a2eb54f623). This is the legacy method from the original Application Insights SDK. The Go SDK we use (microsoft/ApplicationInsights-Go) reads this variable.

APPLICATIONINSIGHTS_CONNECTION_STRING — Takes a full connection string (e.g., InstrumentationKey=f5614a64-...;IngestionEndpoint=https://westus2.in.applicationinsights.azure.com/). Based on modern connection string from https://learn.microsoft.com/en-us/azure/azure-monitor/app/connection-strings

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ microsoft/ApplicationInsights-Go is archived and unsupported

The Go SDK this PR depends on (github.com/microsoft/ApplicationInsights-Go) is archived and explicitly unsupported:

"This SDK is NOT currently maintained or supported by Microsoft. Azure Monitor only provides support when using our supported SDKs, and this SDK does not yet meet that
standard."

Known gaps listed by Microsoft themselves:

  • No operation correlation
  • No sampling support
  • No automatic collection
  • No offline storage

Building a new telemetry feature on an archived, unsupported dependency is risky — any bugs or security issues won't get upstream fixes, and it may break with future App
Insights backend changes.

Recommended alternative: OpenTelemetry Go SDK + OTel Collector sidecar

Microsoft's recommended path for Go is the standard OpenTelemetry Go SDK (go.opentelemetry.io/otel) + an OTel Collector with the Azure Monitor exporter. This is:

  • ✅ Actively maintained (CNCF project)
  • ✅ Microsoft's official recommendation for languages without a supported App Insights SDK
  • ✅ Vendor-neutral — just swap the exporter if we ever need to export elsewhere
  • ✅ Supports connection strings, regional endpoints, Entra ID auth

We already have an OTel Collector sidecar architecture in progress — see PR #286 by Uri, where I've left a review comment recommending this approach. Your telemetry PR would build on top of that foundation: the operator emits custom events/metrics via the OTel SDK, and the collector routes them to App Insights. This unifies the telemetry story — one SDK (OTel) for both user-facing monitoring and product telemetry.

I'd suggest coordinating with Uri to get PR #286 merged first. If needed, I can help create a simpler PR to add the collector sidecar to unblock your work here.

| `DOCUMENTDB_TELEMETRY_ENABLED` | Set to `false` to disable telemetry collection | No (default: `true`) |

> **Note on naming convention:** `APPINSIGHTS_INSTRUMENTATIONKEY` and `APPLICATIONINSIGHTS_CONNECTION_STRING` are the
> [official Microsoft Application Insights SDK environment variable names](https://learn.microsoft.com/en-us/azure/azure-monitor/app/connection-strings).
> The naming difference (`APPINSIGHTS_` vs `APPLICATIONINSIGHTS_`) reflects Microsoft's SDK conventions, not an inconsistency in this project.

### Helm Chart Configuration

When installing via Helm, you can configure telemetry in your values.yaml:

```yaml
# values.yaml
telemetry:
enabled: true
instrumentationKey: "YOUR-INSTRUMENTATION-KEY-HERE"
# Or use connection string:
# connectionString: "InstrumentationKey=xxx;IngestionEndpoint=https://..."
# Or use an existing secret containing APPINSIGHTS_INSTRUMENTATIONKEY / APPLICATIONINSIGHTS_CONNECTION_STRING:
# existingSecret: "documentdb-operator-telemetry"
```

### Kubernetes Secret

For production deployments, store the instrumentation key in a Kubernetes secret:

```yaml
apiVersion: v1
kind: Secret
metadata:
name: documentdb-operator-telemetry
namespace: documentdb-system
type: Opaque
stringData:
APPINSIGHTS_INSTRUMENTATIONKEY: "YOUR-INSTRUMENTATION-KEY-HERE"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another underscore would be good separation: APPINSIGHTS_INSTRUMENTATION _KEY

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comment

```

Then reference it in the operator deployment:

```yaml
envFrom:
- secretRef:
name: documentdb-operator-telemetry
```

## Privacy & Data Collection

### What We Collect

The operator collects anonymous, aggregated telemetry data including:

- **Operator lifecycle**: Startup events, health status, version information
- **Cluster operations**: Create, update, delete events (with timing metrics)
- **Backup operations**: Backup creation, completion, and expiration events
- **Error tracking**: Categorized errors (no raw error messages with sensitive data)
- **Performance metrics**: Reconciliation duration, API call latency

### What We DON'T Collect

To protect your privacy, we explicitly do NOT collect:

- Cluster names, namespace names, or any user-provided resource names
- Connection strings, passwords, or credentials
- IP addresses or hostnames
- Storage class names (may contain organizational information)
- Raw error messages (only categorized error types)
- Container image names

### Privacy Protection Mechanisms

1. **GUIDs Instead of Names**: All resources are identified by auto-generated GUIDs stored in annotations (`telemetry.documentdb.io/cluster-id`)
2. **Hashed Namespaces**: Namespace names are SHA-256 hashed before transmission
3. **Categorized Data**: Values like PVC sizes are categorized (small/medium/large) instead of exact values
4. **Error Sanitization**: Error messages are stripped of potential PII and truncated

## Disabling Telemetry

To completely disable telemetry collection:

1. **Via environment variable**:
```yaml
env:
- name: DOCUMENTDB_TELEMETRY_ENABLED
value: "false"
```

2. **Via Helm** (at install time):
```yaml
telemetry:
enabled: false
```
Comment on lines +103 to +108
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can they turn off telemetry on an already provisioned/running cluster?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, users can do helm upgrade --set telemetry.enabled=false. I can add this to more documentation if needed.


3. **Via Helm upgrade** (on an already running cluster):
```bash
helm upgrade documentdb-operator ./operator/documentdb-helm-chart \
--namespace documentdb-operator \
--set telemetry.enabled=false
```
This restarts the operator pod with telemetry disabled. No data loss or downtime for DocumentDB clusters — only the operator pod restarts.

4. **Don't provide instrumentation key**: If no `APPINSIGHTS_INSTRUMENTATIONKEY` or `APPLICATIONINSIGHTS_CONNECTION_STRING` is set, telemetry is automatically disabled.

## Telemetry Events Reference

See [appinsights-metrics.md](appinsights-metrics.md) for the complete specification of all telemetry events and metrics collected.

## Troubleshooting

### Telemetry Not Being Sent

1. Verify the instrumentation key is correctly configured:
```bash
kubectl get deployment documentdb-operator -n documentdb-system -o yaml | grep -A5 APPINSIGHTS
```

2. Check operator logs for telemetry initialization:
```bash
kubectl logs -n documentdb-system -l app=documentdb-operator | grep -i telemetry
```

3. Verify network connectivity to Application Insights endpoint (`dc.services.visualstudio.com`)

### High Cardinality Warnings

If you see warnings about high cardinality dimensions, this indicates too many unique values for a dimension. The telemetry system automatically samples high-frequency events to mitigate this.

## Support

For issues related to telemetry collection, please open an issue on the [GitHub repository](https://github.com/documentdb/documentdb-kubernetes-operator/issues).
Loading
Loading