Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion docs/en/changes/changes.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,13 @@
* Support Virtual-GenAI monitoring.
* Fix on-demand pod log parsing failure by replacing invalid `DateTimeFormatter` pattern with `ISO_OFFSET_DATE_TIME`.
* Fix Zipkin receiver compatibility with application/x-protobuf Content-Type.

* Support Envoy AI Gateway observability (SWIP-10): new `ENVOY_AI_GATEWAY` layer with MAL/LAL rules
for GenAI metrics (token usage, latency, TTFT, TPOT) and access log sampling via OTLP.
* OTel metric receiver: convert data point attribute dots to underscores (consistent with resource attributes
and metric names). Label mappings are now fallback-only — explicit `job_name` in resource attributes takes
precedence over the `service.name` fallback.
* OTel log handler: prefer `service.instance.id` (OTel spec) over `service.instance` with fallback.
* Add `SampleFamily.debugDump()` for MAL debugging.

#### UI
* Fix the missing icon in new native trace view.
Expand Down
13 changes: 13 additions & 0 deletions docs/en/concepts-and-designs/lal.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,19 @@ The LAL config files are in YAML format, and are located under directory `lal`.
set `log-analyzer/default/lalFiles` in the `application.yml` file or set environment variable `SW_LOG_LAL_FILES` to
activate specific LAL config files.

## OTLP log attribute mapping

When logs arrive via the OTLP receiver, resource attributes are mapped to `LogData` fields:

| Resource attribute | LogData field | Notes |
|---|---|---|
| `service.name` | `service` | SkyWalking service name |
| `service.instance.id` | `serviceInstance` | OTel standard ([spec](https://opentelemetry.io/docs/specs/semconv/resource/#service)). Falls back to `service.instance` for backward compatibility. |
| `service.layer` | `layer` | Routes to the LAL rule with matching `layer` declaration |

Log record attributes are available via `tag("attribute_name")` in LAL rules. Attribute keys
retain their original names (dots are NOT converted to underscores in log attributes).

## Layer
Layer should be declared in the LAL script to represent the analysis scope of the logs.

Expand Down
104 changes: 104 additions & 0 deletions docs/en/setup/backend/backend-envoy-ai-gateway-monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Envoy AI Gateway Monitoring

## Envoy AI Gateway observability via OTLP

[Envoy AI Gateway](https://aigateway.envoyproxy.io/) is a gateway/proxy for AI/LLM API traffic
(OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, Google Gemini, etc.) built on top of Envoy Proxy.
It natively emits GenAI metrics and access logs via OTLP, following
[OpenTelemetry GenAI Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/).

SkyWalking receives OTLP metrics and logs directly on its gRPC port (11800) — no OpenTelemetry
Collector is needed between the AI Gateway and SkyWalking OAP.

### Prerequisites
- [Envoy AI Gateway](https://aigateway.envoyproxy.io/) deployed. See the
[Envoy AI Gateway getting started](https://aigateway.envoyproxy.io/docs/getting-started/) for installation.

### Data flow
1. Envoy AI Gateway processes LLM API requests and records GenAI metrics (token usage, latency, TTFT, TPOT).
2. The AI Gateway pushes metrics and access logs via OTLP gRPC to SkyWalking OAP.
3. SkyWalking OAP parses metrics with [MAL](../../concepts-and-designs/mal.md) rules and access logs
with [LAL](../../concepts-and-designs/lal.md) rules.

### Set up

The MAL rules (`envoy-ai-gateway/*`) and LAL rules (`envoy-ai-gateway`) are enabled by default
in SkyWalking OAP. No OAP-side configuration is needed.

Configure the AI Gateway to push OTLP to SkyWalking by setting these environment variables:

| Env Var | Value | Purpose |
|---------|-------|---------|
| `OTEL_SERVICE_NAME` | Per-deployment gateway name (e.g., `my-ai-gateway`) | SkyWalking service name |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | `http://skywalking-oap:11800` | SkyWalking OAP gRPC receiver |
| `OTEL_EXPORTER_OTLP_PROTOCOL` | `grpc` | OTLP transport |
| `OTEL_METRICS_EXPORTER` | `otlp` | Enable OTLP metrics push |
| `OTEL_LOGS_EXPORTER` | `otlp` | Enable OTLP access log push |
| `OTEL_RESOURCE_ATTRIBUTES` | See below | Routing + instance + layer |

**Required resource attributes** (in `OTEL_RESOURCE_ATTRIBUTES`):
- `job_name=envoy-ai-gateway` — Fixed routing tag for MAL/LAL rules. Same for all AI Gateway deployments.
- `service.instance.id=<instance-id>` — Instance identity. In Kubernetes, use the pod name via Downward API.
- `service.layer=ENVOY_AI_GATEWAY` — Routes access logs to the AI Gateway LAL rules.

**Example:**
```bash
OTEL_SERVICE_NAME=my-ai-gateway
OTEL_EXPORTER_OTLP_ENDPOINT=http://skywalking-oap:11800
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
OTEL_METRICS_EXPORTER=otlp
OTEL_LOGS_EXPORTER=otlp
OTEL_RESOURCE_ATTRIBUTES=job_name=envoy-ai-gateway,service.instance.id=pod-abc123,service.layer=ENVOY_AI_GATEWAY
```

### Supported Metrics

SkyWalking observes the AI Gateway as a `LAYER: ENVOY_AI_GATEWAY` service. Each gateway deployment
is a service, each pod is an instance. Metrics include per-provider and per-model breakdowns.

#### Service Metrics

| Monitoring Panel | Unit | Metric Name | Description |
|---|---|---|---|
| Request CPM | calls/min | meter_envoy_ai_gw_request_cpm | Requests per minute |
| Request Latency Avg | ms | meter_envoy_ai_gw_request_latency_avg | Average request duration |
| Request Latency Percentile | ms | meter_envoy_ai_gw_request_latency_percentile | P50/P75/P90/P95/P99 |
| Input Token Rate | tokens/min | meter_envoy_ai_gw_input_token_rate | Input (prompt) tokens per minute |
| Output Token Rate | tokens/min | meter_envoy_ai_gw_output_token_rate | Output (completion) tokens per minute |
| TTFT Avg | ms | meter_envoy_ai_gw_ttft_avg | Time to First Token (streaming only) |
| TTFT Percentile | ms | meter_envoy_ai_gw_ttft_percentile | P50/P75/P90/P95/P99 TTFT |
| TPOT Avg | ms | meter_envoy_ai_gw_tpot_avg | Time Per Output Token (streaming only) |
| TPOT Percentile | ms | meter_envoy_ai_gw_tpot_percentile | P50/P75/P90/P95/P99 TPOT |

#### Provider Breakdown Metrics

| Monitoring Panel | Unit | Metric Name | Description |
|---|---|---|---|
| Provider Request CPM | calls/min | meter_envoy_ai_gw_provider_request_cpm | Requests by provider |
| Provider Token Rate | tokens/min | meter_envoy_ai_gw_provider_token_rate | Token rate by provider |
| Provider Latency Avg | ms | meter_envoy_ai_gw_provider_latency_avg | Latency by provider |

#### Model Breakdown Metrics

| Monitoring Panel | Unit | Metric Name | Description |
|---|---|---|---|
| Model Request CPM | calls/min | meter_envoy_ai_gw_model_request_cpm | Requests by model |
| Model Token Rate | tokens/min | meter_envoy_ai_gw_model_token_rate | Token rate by model |
| Model Latency Avg | ms | meter_envoy_ai_gw_model_latency_avg | Latency by model |
| Model TTFT Avg | ms | meter_envoy_ai_gw_model_ttft_avg | TTFT by model |
| Model TPOT Avg | ms | meter_envoy_ai_gw_model_tpot_avg | TPOT by model |

#### Instance Metrics

All service-level metrics are also available per instance (pod) with `meter_envoy_ai_gw_instance_` prefix,
including per-provider and per-model breakdowns.

### Access Log Sampling

The LAL rules apply a sampling policy to reduce storage:
- **Error responses** (HTTP status >= 400) — always persisted.
- **Upstream failures** — always persisted.
- **High token cost** (>= 10,000 total tokens) — persisted for cost anomaly detection.
- Normal successful responses with low token counts are dropped.

The token threshold can be adjusted in `lal/envoy-ai-gateway.yaml`.
1 change: 1 addition & 0 deletions docs/en/setup/backend/marketplace.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ SkyWalking provides ready-to-use monitoring capabilities for a wide range of tec
- **Infrastructure** - Linux and Windows server monitoring
- **Cloud Services** - AWS EKS, S3, DynamoDB, API Gateway, and more
- **Gateways** - Nginx, APISIX, Kong monitoring
- **GenAI** - [Virtual GenAI](../service-agent/virtual-genai.md) for agent-based LLM call monitoring, [Envoy AI Gateway](backend-envoy-ai-gateway-monitoring.md) for infrastructure-side AI traffic observability
- **Databases** - MySQL, PostgreSQL, Redis, Elasticsearch, MongoDB, ClickHouse, and more
- **Message Queues** - Kafka, RabbitMQ, Pulsar, RocketMQ, ActiveMQ
- **Browser** - Real user monitoring for web applications
Expand Down
21 changes: 20 additions & 1 deletion docs/en/setup/backend/opentelemetry-receiver.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,26 @@ The receiver adds label with key `node_identifier_host_name` to the collected da
and its value is from `net.host.name` (or `host.name` for some OTLP versions) resource attributes defined in OpenTelemetry proto,
for identification of the metric data.

**Notice:** In the resource scope, dots (.) in the attributes' key names are converted to underscores (_), whereas in the metrics scope, they are not converted.
**Label name conversion:** Dots (`.`) in attribute key names are converted to underscores (`_`) for both
resource attributes and data point (metric-level) attributes. For example, `gen_ai.token.type` becomes
`gen_ai_token_type` in MAL rules. Metric names also undergo the same conversion (e.g.,
`gen_ai.client.token.usage` becomes `gen_ai_client_token_usage`).

**Fallback label mappings:** The following resource attributes are copied to alternative label names
if the target does not already exist. These are fallback-only — if the target label is already present
in the resource attributes, the fallback is skipped.

| Source | Target | Notes |
|---|---|---|
| `service.name` | `job_name` | The [OTel Collector Prometheus Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/prometheusreceiver/README.md) automatically converts the Prometheus `job` label to `service.name`. This fallback ensures it is available as `job_name` for MAL rule filtering. |
| `net.host.name` | `node_identifier_host_name` | Legacy: used by VM/Windows MAL rules |
| `host.name` | `node_identifier_host_name` | Legacy: used by VM/Windows MAL rules |

When `job_name` is set explicitly in `OTEL_RESOURCE_ATTRIBUTES` (e.g., by Envoy AI Gateway),
it takes precedence and the `service.name` fallback is skipped.

**Note:** The `net.host.name` and `host.name` mappings are legacy. New integrations should use
the natural dot-to-underscore conversion (e.g., `host.name` → `host_name` in MAL rules).

| Description | Configuration File | Data Source |
|-----------------------------------------|-----------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
Expand Down
Loading
Loading