Support virtual gen ai analysis for otlp and zipkin traces#13767
Support virtual gen ai analysis for otlp and zipkin traces#13767peachisai wants to merge 7 commits intoapache:masterfrom
Conversation
|
open a new PR for #13766 |
|
Why open a new one? Usually keeping updating the old one is better to keep context and change discussion clear. |
I accidentally overwrote some code from the previous commit, so I've started a new branch. |
|
OK. You should be able to just revert wrong changes. That's fine. |
wu-sheng
left a comment
There was a problem hiding this comment.
Good progress from #13766 — most bugs are fixed. A few remaining issues:
Bug
- Inverted error status logic
Line 227:This setsmetrics.setStatus(StringUtil.isNotBlank(getZipkinSpanTagValue(tags, "error")));
status=true(success) when theerrortag is present, andstatus=false(failure) when it's absent. It's inverted. The originalextractMetricsFromSWSpanusesmetrics.setStatus(!span.getIsError())— sostatus=truemeans success. Should be:metrics.setStatus(StringUtil.isBlank(getZipkinSpanTagValue(tags, "error")));
Design
-
SpanForwardstill has duplicate source conversion methods
SpanForwardhas its owntoVirtualGenAIServiceMeta(),toVirtualGenAIInstance(),toProviderAccess(),toModelAccess()(lines 474-515) that are never called —processGenAILogic()correctly usestransferToSources()from the service. These 4 dead methods should be removed. -
Cost tag mutation still lives in
SpanForward
ThesetEstimatedCost()method inSpanForwardmutates theZipkinSpantags before it's persisted. This is fine functionally, but theBigDecimal.valueOf(totalEstimatedCost)takes along(alreadyMath.round()-ed in the analyzer). So dividing an already-rounded integer by 1,000,000 loses most precision. Consider computing the cost display value from the rawdouble totalCostbefore rounding, or passing the raw cost throughGenAIMetrics. -
NamingControlpassed as parameter totransferToSources()is awkward API design
NamingControlis a core service — the analyzer could hold a reference to it (injected at construction time or via module manager) instead of requiring every caller to pass it. This would simplify the interface.
Minor
-
Changes log trailing space:
"* Support virtual GenAI analysis for otlp and zipkin traces ."— trailing space before the period. -
Missing trailing newline in
otlp-virtual-genai.yamlandzipkin-virtual-genai.yaml(both end without newline per the diff). -
toVirtualGenAIInstanceinconsistency: IntransferToSources()the instance'sserviceNameis set tonamingControl.formatServiceName(metrics.getProviderName()), but the originalVirtualGenAIProcessor.toInstance()set it tometrics.getProviderName()without formatting. SinceNamingControl.formatServiceNamemay truncate, this could cause service name mismatches between the service and instance registrations when the provider name is very long. Verify this is intentional. (The same difference exists for theServiceMeta.namein the original vs new code — in the originalVirtualGenAIProcessor,serviceNameon the instance was NOT formatted, but theServiceMeta.namewas.)
wu-sheng
left a comment
There was a problem hiding this comment.
Please fix the following issues:
-
Bug: Inverted error status logic —
StringUtil.isNotBlank(getZipkinSpanTagValue(tags, "error"))setsstatus=truewhen the error tag IS present, which means "success when there's an error". Should beStringUtil.isBlank(...). -
Dead code — The 4 private methods
toVirtualGenAIServiceMeta(),toVirtualGenAIInstance(),toProviderAccess(),toModelAccess()inSpanForwardare never called.processGenAILogic()already delegates totransferToSources(). Please remove them. -
Cost precision loss —
setEstimatedCost()takesmetrics.getTotalEstimatedCost()which is along(already rounded viaMath.round(totalCost)), then divides by 1,000,000. The rounding already lost the fractional cost. Consider passing the rawdouble totalCostthroughGenAIMetricsso the display value retains precision. -
API design —
transferToSources(GenAIMetrics, NamingControl)requires every caller to passNamingControl. SinceGenAIMeterAnalyzeris a singleton service, injectNamingControlat construction time and simplify the signature totransferToSources(GenAIMetrics).
wu-sheng
left a comment
There was a problem hiding this comment.
All 4 issues from the previous review are fixed:
- Error status logic — now correctly uses
StringUtil.isBlank(...)(success = no error tag) - Dead code removed — the 4 unused
toVirtualGenAI*/toProviderAccess/toModelAccessmethods deleted fromSpanForward - Cost precision —
GenAIMetrics.totalEstimatedCostchanged todouble,Math.round()deferred totransferToSources()when writing toGenAIProviderAccess/GenAIModelAccess(which needlong), andSpanForward.setEstimatedCost()now receives the rawdoublefor the Zipkin tag - NamingControl injected — now injected into
GenAIMeterAnalyzerconstructor viaGenAIAnalyzerModuleProvider.prepare(),transferToSources()signature simplified
Minor nit (non-blocking): extractMetricsFromSWSpan still has metrics.setTotalEstimatedCost(Math.round(totalCost)) which rounds then widens back to double — should be just totalCost like the Zipkin path. And the changes.md trailing space "traces ." is still there.
|
And you need to fix code style issue. |
wu-sheng
left a comment
There was a problem hiding this comment.
LGTM. All previous issues addressed:
extractMetricsFromSWSpannow stores rawtotalCost(no premature rounding)NamingControlinjected viasetNamingControl()instart()phase — correct lifecycle for SkyWalking module system (prepare()creates the service,start()wires cross-module deps)- Private methods use the instance field directly instead of parameter passing
- Changes.md trailing space fixed
wu-sheng
left a comment
There was a problem hiding this comment.
Two suggestions:
1. Share the Python app between OTLP and Zipkin e2e tests
The two Python scripts (otlp-virtual-genai/openai-call.py and zipkin-virtual-genai/openai-call.py) and Dockerfiles are nearly identical — only the exporter differs (3 lines).
Use opentelemetry-instrument (auto-instrumentation CLI) to eliminate all OTel boilerplate from the script. The script becomes pure business logic (just OpenAI client calls), and the exporter is selected entirely via env vars:
- One shared
openai-call.py(no TracerProvider, no exporter code) - One shared
Dockerfile.python(installopentelemetry-distro+ both exporters) - OTLP docker-compose:
OTEL_TRACES_EXPORTER=otlp, entrypointopentelemetry-instrument python3 /openai-call.py - Zipkin docker-compose:
OTEL_TRACES_EXPORTER=zipkin, entrypointopentelemetry-instrument python3 /openai-call.py
The expected files are also identical and could be shared.
2. Documentation: clarify which semantic convention is used
virtual-genai.md should clarify that the tag keys follow the OpenTelemetry GenAI Semantic Conventions. Currently the "Span Contract" section only describes the SkyWalking native agent contract (Exit span, layer == GENAI). For OTLP/Zipkin sources, there's no "Exit span" or "layer" — the OAP detects GenAI spans by checking for the gen_ai.response.model tag.
The doc should distinguish:
- SkyWalking native agent: Exit span +
SpanLayer.GenAI+gen_ai.*tags - OTLP / Zipkin: Any span with
gen_ai.response.modeltag present (following OTel GenAI Semantic Conventions)
Also note: the current OTel Python instrumentation emits gen_ai.system (old semconv) by default, not gen_ai.provider.name (new semconv). SkyWalking falls back to model-name prefix matching when gen_ai.provider.name is absent. Consider also reading gen_ai.system as a fallback for broader compatibility.
|
@peachisai Most are good, Could you polish a little further? |
|
@wu-sheng |
CHANGESlog.