Skip to content

Support virtual gen ai analysis for otlp and zipkin traces#13767

Open
peachisai wants to merge 7 commits intoapache:masterfrom
peachisai:Support-Virtual-GenAI-analysis-for-OTLP-and-Zipkin-traces
Open

Support virtual gen ai analysis for otlp and zipkin traces#13767
peachisai wants to merge 7 commits intoapache:masterfrom
peachisai:Support-Virtual-GenAI-analysis-for-OTLP-and-Zipkin-traces

Conversation

@peachisai
Copy link
Copy Markdown
Contributor

  • If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #.
  • Update the CHANGES log.

@peachisai
Copy link
Copy Markdown
Contributor Author

peachisai commented Mar 29, 2026

open a new PR for #13766

@wu-sheng
Copy link
Copy Markdown
Member

Why open a new one? Usually keeping updating the old one is better to keep context and change discussion clear.

@wu-sheng wu-sheng added backend OAP backend related. feature New feature tracing Distributed tracing labels Mar 29, 2026
@wu-sheng wu-sheng added this to the 10.4.0 milestone Mar 29, 2026
@peachisai
Copy link
Copy Markdown
Contributor Author

peachisai commented Mar 29, 2026

Why open a new one? Usually keeping updating the old one is better to keep context and change discussion clear.

I accidentally overwrote some code from the previous commit, so I've started a new branch.
And that' why I commented old PR in this session to link the previouse context

@wu-sheng
Copy link
Copy Markdown
Member

OK. You should be able to just revert wrong changes. That's fine.
Let's keep thllon this.

Copy link
Copy Markdown
Member

@wu-sheng wu-sheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress from #13766 — most bugs are fixed. A few remaining issues:

Bug

  1. Inverted error status logic
    Line 227:
    metrics.setStatus(StringUtil.isNotBlank(getZipkinSpanTagValue(tags, "error")));
    This sets status=true (success) when the error tag is present, and status=false (failure) when it's absent. It's inverted. The original extractMetricsFromSWSpan uses metrics.setStatus(!span.getIsError()) — so status=true means success. Should be:
    metrics.setStatus(StringUtil.isBlank(getZipkinSpanTagValue(tags, "error")));

Design

  1. SpanForward still has duplicate source conversion methods
    SpanForward has its own toVirtualGenAIServiceMeta(), toVirtualGenAIInstance(), toProviderAccess(), toModelAccess() (lines 474-515) that are never calledprocessGenAILogic() correctly uses transferToSources() from the service. These 4 dead methods should be removed.

  2. Cost tag mutation still lives in SpanForward
    The setEstimatedCost() method in SpanForward mutates the ZipkinSpan tags before it's persisted. This is fine functionally, but the BigDecimal.valueOf(totalEstimatedCost) takes a long (already Math.round()-ed in the analyzer). So dividing an already-rounded integer by 1,000,000 loses most precision. Consider computing the cost display value from the raw double totalCost before rounding, or passing the raw cost through GenAIMetrics.

  3. NamingControl passed as parameter to transferToSources() is awkward API design
    NamingControl is a core service — the analyzer could hold a reference to it (injected at construction time or via module manager) instead of requiring every caller to pass it. This would simplify the interface.

Minor

  1. Changes log trailing space: "* Support virtual GenAI analysis for otlp and zipkin traces ." — trailing space before the period.

  2. Missing trailing newline in otlp-virtual-genai.yaml and zipkin-virtual-genai.yaml (both end without newline per the diff).

  3. toVirtualGenAIInstance inconsistency: In transferToSources() the instance's serviceName is set to namingControl.formatServiceName(metrics.getProviderName()), but the original VirtualGenAIProcessor.toInstance() set it to metrics.getProviderName() without formatting. Since NamingControl.formatServiceName may truncate, this could cause service name mismatches between the service and instance registrations when the provider name is very long. Verify this is intentional. (The same difference exists for the ServiceMeta.name in the original vs new code — in the original VirtualGenAIProcessor, serviceName on the instance was NOT formatted, but the ServiceMeta.name was.)

Copy link
Copy Markdown
Member

@wu-sheng wu-sheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the following issues:

  1. Bug: Inverted error status logicStringUtil.isNotBlank(getZipkinSpanTagValue(tags, "error")) sets status=true when the error tag IS present, which means "success when there's an error". Should be StringUtil.isBlank(...).

  2. Dead code — The 4 private methods toVirtualGenAIServiceMeta(), toVirtualGenAIInstance(), toProviderAccess(), toModelAccess() in SpanForward are never called. processGenAILogic() already delegates to transferToSources(). Please remove them.

  3. Cost precision losssetEstimatedCost() takes metrics.getTotalEstimatedCost() which is a long (already rounded via Math.round(totalCost)), then divides by 1,000,000. The rounding already lost the fractional cost. Consider passing the raw double totalCost through GenAIMetrics so the display value retains precision.

  4. API designtransferToSources(GenAIMetrics, NamingControl) requires every caller to pass NamingControl. Since GenAIMeterAnalyzer is a singleton service, inject NamingControl at construction time and simplify the signature to transferToSources(GenAIMetrics).

wu-sheng
wu-sheng previously approved these changes Mar 29, 2026
Copy link
Copy Markdown
Member

@wu-sheng wu-sheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 4 issues from the previous review are fixed:

  1. Error status logic — now correctly uses StringUtil.isBlank(...) (success = no error tag)
  2. Dead code removed — the 4 unused toVirtualGenAI* / toProviderAccess / toModelAccess methods deleted from SpanForward
  3. Cost precisionGenAIMetrics.totalEstimatedCost changed to double, Math.round() deferred to transferToSources() when writing to GenAIProviderAccess/GenAIModelAccess (which need long), and SpanForward.setEstimatedCost() now receives the raw double for the Zipkin tag
  4. NamingControl injected — now injected into GenAIMeterAnalyzer constructor via GenAIAnalyzerModuleProvider.prepare(), transferToSources() signature simplified

Minor nit (non-blocking): extractMetricsFromSWSpan still has metrics.setTotalEstimatedCost(Math.round(totalCost)) which rounds then widens back to double — should be just totalCost like the Zipkin path. And the changes.md trailing space "traces ." is still there.

@wu-sheng
Copy link
Copy Markdown
Member

And you need to fix code style issue.

Copy link
Copy Markdown
Member

@wu-sheng wu-sheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. All previous issues addressed:

  • extractMetricsFromSWSpan now stores raw totalCost (no premature rounding)
  • NamingControl injected via setNamingControl() in start() phase — correct lifecycle for SkyWalking module system (prepare() creates the service, start() wires cross-module deps)
  • Private methods use the instance field directly instead of parameter passing
  • Changes.md trailing space fixed

Copy link
Copy Markdown
Member

@wu-sheng wu-sheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two suggestions:

1. Share the Python app between OTLP and Zipkin e2e tests

The two Python scripts (otlp-virtual-genai/openai-call.py and zipkin-virtual-genai/openai-call.py) and Dockerfiles are nearly identical — only the exporter differs (3 lines).

Use opentelemetry-instrument (auto-instrumentation CLI) to eliminate all OTel boilerplate from the script. The script becomes pure business logic (just OpenAI client calls), and the exporter is selected entirely via env vars:

  • One shared openai-call.py (no TracerProvider, no exporter code)
  • One shared Dockerfile.python (install opentelemetry-distro + both exporters)
  • OTLP docker-compose: OTEL_TRACES_EXPORTER=otlp, entrypoint opentelemetry-instrument python3 /openai-call.py
  • Zipkin docker-compose: OTEL_TRACES_EXPORTER=zipkin, entrypoint opentelemetry-instrument python3 /openai-call.py

The expected files are also identical and could be shared.

2. Documentation: clarify which semantic convention is used

virtual-genai.md should clarify that the tag keys follow the OpenTelemetry GenAI Semantic Conventions. Currently the "Span Contract" section only describes the SkyWalking native agent contract (Exit span, layer == GENAI). For OTLP/Zipkin sources, there's no "Exit span" or "layer" — the OAP detects GenAI spans by checking for the gen_ai.response.model tag.

The doc should distinguish:

  • SkyWalking native agent: Exit span + SpanLayer.GenAI + gen_ai.* tags
  • OTLP / Zipkin: Any span with gen_ai.response.model tag present (following OTel GenAI Semantic Conventions)

Also note: the current OTel Python instrumentation emits gen_ai.system (old semconv) by default, not gen_ai.provider.name (new semconv). SkyWalking falls back to model-name prefix matching when gen_ai.provider.name is absent. Consider also reading gen_ai.system as a fallback for broader compatibility.

@wu-sheng
Copy link
Copy Markdown
Member

@peachisai Most are good, Could you polish a little further?

@peachisai
Copy link
Copy Markdown
Contributor Author

@wu-sheng
OK, can.
Regarding the gen_ai.system tag, it is deprecated, so I don't think we need to support it. I'm considering opening a new PR in the OpenTelemetry community to update this tag to follow the latest semantic conventions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend OAP backend related. feature New feature tracing Distributed tracing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants