Support Envoy AI Gateway observability (SWIP-10)#13772
Support Envoy AI Gateway observability (SWIP-10)#13772
Conversation
- New layer: ENVOY_AI_GATEWAY - MAL rules for OTLP metrics: service and instance level aggregates, per-provider and per-model breakdowns (38 metrics total) - LAL rules for access log sampling (error responses, high token cost) - UI dashboard templates: root, service, instance with Log tabs - OTel receiver: convert data point attribute dots to underscores, change LABEL_MAPPINGS to fallback-only (preserve service_name tag) - SampleFamily: add toString() and debugDump() for debugging - E2e test: docker-compose with ai-gateway-cli + Ollama - SWIP-10 doc updated: use OTEL_SERVICE_NAME for service identity, explicit job_name for MAL routing
- OTel metric receiver: change LABEL_MAPPINGS to fallback-only — explicit job_name in resource attributes takes precedence over service.name mapping - OTel log handler: prefer service.instance.id (OTel spec) over service.instance with backward-compatible fallback - OTel metric receiver: convert data point attribute dots to underscores (same as resource attributes and metric names) - MAL rules: use service_name for service identity instead of aigw_service - Docker-compose: use OTEL_SERVICE_NAME for per-deployment service name, explicit job_name for MAL routing - Service dashboard: add metrics to InstanceList widget - SampleFamily: add debugDump() for MAL debugging - SWIP-10: updated entity model docs
The service.name → job_name mapping in OTLP metric receiver was added for AWS Firehose (commit 29556a0) but AWS rules filter by Namespace, not job_name. No existing integration uses this mapping. Removing it ensures service.name is preserved as service_name tag (via dot-to-underscore conversion), available for MAL expSuffix. Integrations that need job_name should set it explicitly. Remaining fallback mappings (all used by Prometheus-based integrations): - job → job_name - net.host.name → node_identifier_host_name - host.name → node_identifier_host_name MAL checker: 1268/1268 rules pass.
Setup guide with OTLP configuration, Kubernetes GatewayConfig example, metric reference tables (service, provider, model, instance), and access log sampling policy.
…place GenAI - OTel receiver doc: update label conversion rules (dots to underscores now applies to both resource and data point attributes), document fallback label mappings - LAL doc: add OTLP log attribute mapping section (service.name, service.instance.id, service.layer) - Marketplace: add GenAI category with Virtual GenAI and Envoy AI Gateway
c41a03c to
e892421
Compare
| // service.instance.id is the OTel standard resource attribute for instance identity | ||
| // https://opentelemetry.io/docs/specs/semconv/resource/#service | ||
| // Fall back to service.instance for backward compatibility | ||
| final var instanceId = attributes.getOrDefault("service.instance.id", ""); | ||
| final var serviceInstance = instanceId.isEmpty() | ||
| ? attributes.getOrDefault("service.instance", "") | ||
| : instanceId; |
There was a problem hiding this comment.
@kezhenxu94 I searched for otlp standard, .id seems the official one. I am not sure why this was .instance only.
| /** | ||
| * Fallback label mappings: if the target label (value) is absent in resource attributes, | ||
| * copy the source label (key) value as the target. The source label is always kept as-is | ||
| * (with dots converted to underscores by the first pass). | ||
| */ | ||
| private static final Map<String, String> FALLBACK_LABEL_MAPPINGS = | ||
| ImmutableMap | ||
| .<String, String>builder() | ||
| .put("net.host.name", "node_identifier_host_name") | ||
| .put("host.name", "node_identifier_host_name") | ||
| .put("job", "job_name") | ||
| .put("service.name", "job_name") | ||
| .build(); |
There was a problem hiding this comment.
@pg-yang I did some research, but can't find why to use service name as job name. This seems an old bug, introduced long time again due to your AWS firehose PR.
There was a problem hiding this comment.
Pull request overview
Adds first-class observability support for Envoy AI Gateway (SWIP-10) in SkyWalking by introducing a dedicated layer, MAL/LAL processing rules, UI dashboard templates, OTLP receiver tweaks for GenAI semantic conventions, docs, and an E2E compose-based verification case.
Changes:
- Add new monitored layer
ENVOY_AI_GATEWAYand register UI templates for root/service/instance dashboards. - Add MAL rules for GenAI OTLP metrics and LAL rules for OTLP access-log sampling.
- Improve OTLP receiver label normalization (dot-to-underscore) and update documentation + E2E coverage.
Reviewed changes
Copilot reviewed 28 out of 28 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
test/e2e-v2/cases/envoy-ai-gateway/expected/service.yml |
Expected service discovery (layer verification). |
test/e2e-v2/cases/envoy-ai-gateway/expected/metrics-has-value.yml |
Generic “metric has value” expectation template. |
test/e2e-v2/cases/envoy-ai-gateway/expected/metrics-has-value-label.yml |
Expectation template for labeled metrics (e.g., percentiles). |
test/e2e-v2/cases/envoy-ai-gateway/envoy-ai-gateway-cases.yaml |
swctl verification queries for service + key metrics. |
test/e2e-v2/cases/envoy-ai-gateway/e2e.yaml |
Compose-based E2E scenario definition and trigger. |
test/e2e-v2/cases/envoy-ai-gateway/docker-compose.yml |
Compose stack (OAP/BanyanDB + ai-gateway-cli + Ollama). |
oap-server/server-starter/src/main/resources/ui-initialized-templates/menu.yaml |
Adds UI menu entry for Envoy AI Gateway. |
oap-server/server-starter/src/main/resources/ui-initialized-templates/envoy_ai_gateway/envoy-ai-gateway-service.json |
Service dashboard template (overview/providers/models/log/instances). |
oap-server/server-starter/src/main/resources/ui-initialized-templates/envoy_ai_gateway/envoy-ai-gateway-root.json |
Root dashboard template listing AI Gateway services. |
oap-server/server-starter/src/main/resources/ui-initialized-templates/envoy_ai_gateway/envoy-ai-gateway-instance.json |
Instance dashboard template (per-pod view). |
oap-server/server-starter/src/main/resources/otel-rules/envoy-ai-gateway/gateway-service.yaml |
MAL rules for service-scope GenAI metrics. |
oap-server/server-starter/src/main/resources/otel-rules/envoy-ai-gateway/gateway-instance.yaml |
MAL rules for instance-scope GenAI metrics. |
oap-server/server-starter/src/main/resources/lal/envoy-ai-gateway.yaml |
LAL rule for OTLP access log sampling/extraction. |
oap-server/server-starter/src/main/resources/application.yml |
Enables Envoy AI Gateway MAL/LAL rules by default. |
oap-server/server-receiver-plugin/otel-receiver-plugin/src/main/java/.../OpenTelemetryMetricRequestProcessor.java |
Dot→underscore normalization for resource + point attributes; fallback-only resource label mapping. |
oap-server/server-receiver-plugin/otel-receiver-plugin/src/main/java/.../OpenTelemetryLogHandler.java |
Prefer service.instance.id with fallback to service.instance. |
oap-server/server-core/src/main/java/.../UITemplateInitializer.java |
Registers ENVOY_AI_GATEWAY UI template folder. |
oap-server/server-core/src/main/java/.../Layer.java |
Adds new ENVOY_AI_GATEWAY layer enum entry. |
oap-server/analyzer/meter-analyzer/src/main/java/.../SampleFamily.java |
Adds toString() + debugDump() to help MAL debugging. |
docs/menu.yml |
Adds docs navigation entry for Envoy AI Gateway monitoring. |
docs/en/swip/SWIP-10/SWIP.md |
Updates SWIP-10 design text to current OTLP attribute mapping/routing. |
docs/en/swip/SWIP-10/kind-test-setup.sh |
Removes local Kind helper script from SWIP-10. |
docs/en/swip/SWIP-10/kind-test-resources.yaml |
Removes local Kind resources manifest from SWIP-10. |
docs/en/setup/backend/opentelemetry-receiver.md |
Documents label conversion + fallback mappings behavior. |
docs/en/setup/backend/marketplace.md |
Adds Envoy AI Gateway to the Marketplace “GenAI” section. |
docs/en/setup/backend/backend-envoy-ai-gateway-monitoring.md |
New setup guide for Envoy AI Gateway OTLP integration. |
docs/en/concepts-and-designs/lal.md |
Documents OTLP resource attribute mapping used by LAL (incl. instance id fallback). |
docs/en/changes/changes.md |
Changelog entry for Envoy AI Gateway observability + related receiver changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
oap-server/server-starter/src/main/resources/otel-rules/envoy-ai-gateway/gateway-service.yaml
Outdated
Show resolved
Hide resolved
oap-server/server-starter/src/main/resources/lal/envoy-ai-gateway.yaml
Outdated
Show resolved
Hide resolved
...org/apache/skywalking/oap/server/receiver/otel/otlp/OpenTelemetryMetricRequestProcessor.java
Show resolved
Hide resolved
...org/apache/skywalking/oap/server/receiver/otel/otlp/OpenTelemetryMetricRequestProcessor.java
Outdated
Show resolved
Hide resolved
d986f0a to
0ea5231
Compare
- OTel receiver: use String.replace(char,char) instead of replaceAll regex - OTel receiver buildLabels: add merge function to handle dot-to-underscore key collisions - LAL rule: guard tag-to-Integer casts for missing/empty/dash values - SWIP doc: fix stale service.name → job_name reference
0ea5231 to
3e067bf
Compare
Support Envoy AI Gateway observability (SWIP-10)
If this is non-trivial feature, paste the links/URLs to the design doc.
docs/en/swip/SWIP-10/SWIP.mdUpdate the documentation to include this new feature.
docs/en/setup/backend/backend-envoy-ai-gateway-monitoring.mdTests(including UT, IT, E2E) are added to verify the new feature.
test/e2e-v2/cases/envoy-ai-gateway/(docker-compose with ai-gateway-cli + Ollama)Update the
CHANGESlog.If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #xxx.
Summary
Add Envoy AI Gateway as a new monitored layer (
ENVOY_AI_GATEWAY) in SkyWalking, receivingGenAI metrics and access logs via OTLP push from the AI Gateway.
Changes:
ENVOY_AI_GATEWAY(46, true)inLayer.javaand per-model breakdowns including token usage, latency, TTFT, TPOT
instance (Overview/Providers/Models/Log tabs)
LABEL_MAPPINGSto fallback-only —service.namepreserved asservice_nametagservice.name → job_namemapping (MAL checker: 1268/1268 rules pass)service.instance.id(OTel spec) with fallback toservice.instanceUITemplateInitializer: registerENVOY_AI_GATEWAYtemplate folderSampleFamily: addtoString()anddebugDump()for MAL debuggingai-gateway-cli+ Ollama (qwen2.5:0.5b)Screenshots
Root Dashboard
Service Dashboard - Overview
Service Dashboard - Providers/Models
Instance Dashboard