Skip to content

Support Envoy AI Gateway observability (SWIP-10)#13772

Open
wu-sheng wants to merge 8 commits intomasterfrom
feature/swip-10-envoy-ai-gateway
Open

Support Envoy AI Gateway observability (SWIP-10)#13772
wu-sheng wants to merge 8 commits intomasterfrom
feature/swip-10-envoy-ai-gateway

Conversation

@wu-sheng
Copy link
Copy Markdown
Member

@wu-sheng wu-sheng commented Mar 31, 2026

Support Envoy AI Gateway observability (SWIP-10)

  • If this is non-trivial feature, paste the links/URLs to the design doc.

    • SWIP-10: docs/en/swip/SWIP-10/SWIP.md
  • Update the documentation to include this new feature.

    • docs/en/setup/backend/backend-envoy-ai-gateway-monitoring.md
  • Tests(including UT, IT, E2E) are added to verify the new feature.

    • test/e2e-v2/cases/envoy-ai-gateway/ (docker-compose with ai-gateway-cli + Ollama)
  • Update the CHANGES log.

  • If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #xxx.

Summary

Add Envoy AI Gateway as a new monitored layer (ENVOY_AI_GATEWAY) in SkyWalking, receiving
GenAI metrics and access logs via OTLP push from the AI Gateway.

Changes:

  • New layer ENVOY_AI_GATEWAY(46, true) in Layer.java
  • MAL rules: 2 rule files (service + instance) with 38 metrics total — aggregates, per-provider
    and per-model breakdowns including token usage, latency, TTFT, TPOT
  • LAL rules: access log sampling (errors, upstream failures, high token cost)
  • UI dashboards: root (with doc link), service (Overview/Providers/Models/Log/Instances tabs),
    instance (Overview/Providers/Models/Log tabs)
  • OTel receiver fixes:
    • Convert data point attribute dots to underscores (consistent with resource attributes)
    • Change LABEL_MAPPINGS to fallback-only — service.name preserved as service_name tag
    • Remove unused service.name → job_name mapping (MAL checker: 1268/1268 rules pass)
    • OTLP log handler: prefer service.instance.id (OTel spec) with fallback to service.instance
  • UITemplateInitializer: register ENVOY_AI_GATEWAY template folder
  • SampleFamily: add toString() and debugDump() for MAL debugging
  • Documentation: setup guide, OTel receiver label conversion, LAL OTLP mapping, marketplace GenAI
  • E2e test: docker-compose with ai-gateway-cli + Ollama (qwen2.5:0.5b)

Screenshots

Root Dashboard

ai-1

Service Dashboard - Overview

ai-2

Service Dashboard - Providers/Models

ai-3

Instance Dashboard

ai-4

- New layer: ENVOY_AI_GATEWAY
- MAL rules for OTLP metrics: service and instance level aggregates,
  per-provider and per-model breakdowns (38 metrics total)
- LAL rules for access log sampling (error responses, high token cost)
- UI dashboard templates: root, service, instance with Log tabs
- OTel receiver: convert data point attribute dots to underscores,
  change LABEL_MAPPINGS to fallback-only (preserve service_name tag)
- SampleFamily: add toString() and debugDump() for debugging
- E2e test: docker-compose with ai-gateway-cli + Ollama
- SWIP-10 doc updated: use OTEL_SERVICE_NAME for service identity,
  explicit job_name for MAL routing
- OTel metric receiver: change LABEL_MAPPINGS to fallback-only — explicit
  job_name in resource attributes takes precedence over service.name mapping
- OTel log handler: prefer service.instance.id (OTel spec) over service.instance
  with backward-compatible fallback
- OTel metric receiver: convert data point attribute dots to underscores
  (same as resource attributes and metric names)
- MAL rules: use service_name for service identity instead of aigw_service
- Docker-compose: use OTEL_SERVICE_NAME for per-deployment service name,
  explicit job_name for MAL routing
- Service dashboard: add metrics to InstanceList widget
- SampleFamily: add debugDump() for MAL debugging
- SWIP-10: updated entity model docs
The service.name → job_name mapping in OTLP metric receiver was added
for AWS Firehose (commit 29556a0) but AWS rules filter by Namespace,
not job_name. No existing integration uses this mapping.

Removing it ensures service.name is preserved as service_name tag
(via dot-to-underscore conversion), available for MAL expSuffix.
Integrations that need job_name should set it explicitly.

Remaining fallback mappings (all used by Prometheus-based integrations):
- job → job_name
- net.host.name → node_identifier_host_name
- host.name → node_identifier_host_name

MAL checker: 1268/1268 rules pass.
Setup guide with OTLP configuration, Kubernetes GatewayConfig example,
metric reference tables (service, provider, model, instance), and
access log sampling policy.
…place GenAI

- OTel receiver doc: update label conversion rules (dots to underscores
  now applies to both resource and data point attributes), document
  fallback label mappings
- LAL doc: add OTLP log attribute mapping section (service.name,
  service.instance.id, service.layer)
- Marketplace: add GenAI category with Virtual GenAI and Envoy AI Gateway
@wu-sheng wu-sheng force-pushed the feature/swip-10-envoy-ai-gateway branch from c41a03c to e892421 Compare March 31, 2026 07:51
Comment on lines +107 to +113
// service.instance.id is the OTel standard resource attribute for instance identity
// https://opentelemetry.io/docs/specs/semconv/resource/#service
// Fall back to service.instance for backward compatibility
final var instanceId = attributes.getOrDefault("service.instance.id", "");
final var serviceInstance = instanceId.isEmpty()
? attributes.getOrDefault("service.instance", "")
: instanceId;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kezhenxu94 I searched for otlp standard, .id seems the official one. I am not sure why this was .instance only.

Comment on lines +73 to 84
/**
* Fallback label mappings: if the target label (value) is absent in resource attributes,
* copy the source label (key) value as the target. The source label is always kept as-is
* (with dots converted to underscores by the first pass).
*/
private static final Map<String, String> FALLBACK_LABEL_MAPPINGS =
ImmutableMap
.<String, String>builder()
.put("net.host.name", "node_identifier_host_name")
.put("host.name", "node_identifier_host_name")
.put("job", "job_name")
.put("service.name", "job_name")
.build();
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pg-yang I did some research, but can't find why to use service name as job name. This seems an old bug, introduced long time again due to your AWS firehose PR.

@wu-sheng wu-sheng requested a review from wankai123 March 31, 2026 07:55
@wu-sheng wu-sheng added backend OAP backend related. feature New feature labels Mar 31, 2026
@wu-sheng wu-sheng added this to the 10.4.0 milestone Mar 31, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class observability support for Envoy AI Gateway (SWIP-10) in SkyWalking by introducing a dedicated layer, MAL/LAL processing rules, UI dashboard templates, OTLP receiver tweaks for GenAI semantic conventions, docs, and an E2E compose-based verification case.

Changes:

  • Add new monitored layer ENVOY_AI_GATEWAY and register UI templates for root/service/instance dashboards.
  • Add MAL rules for GenAI OTLP metrics and LAL rules for OTLP access-log sampling.
  • Improve OTLP receiver label normalization (dot-to-underscore) and update documentation + E2E coverage.

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
test/e2e-v2/cases/envoy-ai-gateway/expected/service.yml Expected service discovery (layer verification).
test/e2e-v2/cases/envoy-ai-gateway/expected/metrics-has-value.yml Generic “metric has value” expectation template.
test/e2e-v2/cases/envoy-ai-gateway/expected/metrics-has-value-label.yml Expectation template for labeled metrics (e.g., percentiles).
test/e2e-v2/cases/envoy-ai-gateway/envoy-ai-gateway-cases.yaml swctl verification queries for service + key metrics.
test/e2e-v2/cases/envoy-ai-gateway/e2e.yaml Compose-based E2E scenario definition and trigger.
test/e2e-v2/cases/envoy-ai-gateway/docker-compose.yml Compose stack (OAP/BanyanDB + ai-gateway-cli + Ollama).
oap-server/server-starter/src/main/resources/ui-initialized-templates/menu.yaml Adds UI menu entry for Envoy AI Gateway.
oap-server/server-starter/src/main/resources/ui-initialized-templates/envoy_ai_gateway/envoy-ai-gateway-service.json Service dashboard template (overview/providers/models/log/instances).
oap-server/server-starter/src/main/resources/ui-initialized-templates/envoy_ai_gateway/envoy-ai-gateway-root.json Root dashboard template listing AI Gateway services.
oap-server/server-starter/src/main/resources/ui-initialized-templates/envoy_ai_gateway/envoy-ai-gateway-instance.json Instance dashboard template (per-pod view).
oap-server/server-starter/src/main/resources/otel-rules/envoy-ai-gateway/gateway-service.yaml MAL rules for service-scope GenAI metrics.
oap-server/server-starter/src/main/resources/otel-rules/envoy-ai-gateway/gateway-instance.yaml MAL rules for instance-scope GenAI metrics.
oap-server/server-starter/src/main/resources/lal/envoy-ai-gateway.yaml LAL rule for OTLP access log sampling/extraction.
oap-server/server-starter/src/main/resources/application.yml Enables Envoy AI Gateway MAL/LAL rules by default.
oap-server/server-receiver-plugin/otel-receiver-plugin/src/main/java/.../OpenTelemetryMetricRequestProcessor.java Dot→underscore normalization for resource + point attributes; fallback-only resource label mapping.
oap-server/server-receiver-plugin/otel-receiver-plugin/src/main/java/.../OpenTelemetryLogHandler.java Prefer service.instance.id with fallback to service.instance.
oap-server/server-core/src/main/java/.../UITemplateInitializer.java Registers ENVOY_AI_GATEWAY UI template folder.
oap-server/server-core/src/main/java/.../Layer.java Adds new ENVOY_AI_GATEWAY layer enum entry.
oap-server/analyzer/meter-analyzer/src/main/java/.../SampleFamily.java Adds toString() + debugDump() to help MAL debugging.
docs/menu.yml Adds docs navigation entry for Envoy AI Gateway monitoring.
docs/en/swip/SWIP-10/SWIP.md Updates SWIP-10 design text to current OTLP attribute mapping/routing.
docs/en/swip/SWIP-10/kind-test-setup.sh Removes local Kind helper script from SWIP-10.
docs/en/swip/SWIP-10/kind-test-resources.yaml Removes local Kind resources manifest from SWIP-10.
docs/en/setup/backend/opentelemetry-receiver.md Documents label conversion + fallback mappings behavior.
docs/en/setup/backend/marketplace.md Adds Envoy AI Gateway to the Marketplace “GenAI” section.
docs/en/setup/backend/backend-envoy-ai-gateway-monitoring.md New setup guide for Envoy AI Gateway OTLP integration.
docs/en/concepts-and-designs/lal.md Documents OTLP resource attribute mapping used by LAL (incl. instance id fallback).
docs/en/changes/changes.md Changelog entry for Envoy AI Gateway observability + related receiver changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@wu-sheng wu-sheng force-pushed the feature/swip-10-envoy-ai-gateway branch from d986f0a to 0ea5231 Compare March 31, 2026 08:14
- OTel receiver: use String.replace(char,char) instead of replaceAll regex
- OTel receiver buildLabels: add merge function to handle dot-to-underscore
  key collisions
- LAL rule: guard tag-to-Integer casts for missing/empty/dash values
- SWIP doc: fix stale service.name → job_name reference
@wu-sheng wu-sheng force-pushed the feature/swip-10-envoy-ai-gateway branch from 0ea5231 to 3e067bf Compare March 31, 2026 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend OAP backend related. feature New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants