Stamp Honeycomb SampleRate from tracestate rate hint and export 5xx error tail for unsampled traces#4562
Stamp Honeycomb SampleRate from tracestate rate hint and export 5xx error tail for unsampled traces#4562erik-the-implementer wants to merge 3 commits into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4562 +/- ##
==========================================
- Coverage 58.11% 58.06% -0.05%
==========================================
Files 369 369
Lines 40459 40459
Branches 11471 11467 -4
==========================================
- Hits 23513 23493 -20
- Misses 16871 16891 +20
Partials 75 75
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Claude Code ReviewSummaryThis PR teaches the sync service to honor an upstream head-sampling rate hint ( What's Working Well
Issues FoundCritical (Must Fix)None. Important (Should Fix)None. Suggestions (Nice to Have)None remaining — both suggestions from iteration 2 were addressed (see below). Issue ConformanceNo linked issue (flagged per project convention — PRs should reference the issue/incident they address). The PR description is exceptionally thorough: motivation (the ~20x undercount and ~95% missing error traces), a full per-request behavior matrix, scope decisions, and the known pre-existing mid-stream-crash limitation inherited from #4561. The implementation matches the described matrix, and the changeset ( The one CI signal is a Previous Review StatusAddressed (commit
Remaining: Nothing. No new code paths were introduced in this iteration, and nothing regressed. Review iteration: 3 | 2026-06-11 |
…tail
The Cloudflare worker in front of Electric head-samples successful
requests at 1:N (currently 1:20) and keeps all >= 500 responses at
sampleRate 1, weighting its exported spans accordingly. Electric only
received the sampled/not-sampled bit via traceparent, so:
1. Electric spans carried no SampleRate attribute and electric-region
aggregates under-reported worker traffic ~20x;
2. ~95% of error traces had no server-side spans at all, because the
worker's keep-on-error decision happens at export time, after a
traceparent with sampled=0 was already propagated.
The worker now sends its rate hint on every proxied request via the
W3C tracestate header (member: `electric=rate:<N>`). This commit makes
Electric honor it:
* TraceContextPlug parses the hint (missing/unparseable/rate < 1 hints
are ignored) and stashes it in the conn together with the remote
parent span context and its sampled flag.
* For sampled remote parents, ServeShapePlug stamps `SampleRate` = N
(status < 500) or `SampleRate` = 1 (status >= 500) on the
Plug_shape_get root span once the response status is known; the
shape_get.plug.stream_chunk child spans get the same attribute at
creation.
* For unsampled remote parents, a 5xx response now synthesizes a
single root request span at response time with `SampleRate` = 1,
carrying the standard root-span attributes, backdated to the request
start, and parented onto the remote span context with the sampled
trace flag forced on — so the parent-based sampler records it and it
lands in the same trace as the worker's kept-on-error spans.
Unsampled successful requests still export nothing.
Direct traffic (no remote parent) is unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
f1158ed to
7fe2939
Compare
Note
Stacked on #4561 (
flatten-shape-get-spans). The PR base is set to that branch until it merges; only the last commit (Stamp Honeycomb SampleRate from tracestate hint and export 5xx error tail) is new here.Motivation
The Cloudflare worker in front of Electric Cloud head-samples successful requests at 1:20 and keeps all >= 500 responses at sampleRate 1, weighting its exported spans accordingly. Electric only ever received the sampled/not-sampled bit via
traceparent, which causes two problems today:SampleRateattribute, so Honeycomb aggregates over electric-region spans under-report worker traffic ~20x relative to the worker's own (weighted) spans.traceparentwithsampled=0— and Electric's parent-based sampler then drops every span for the request.The worker now sends its sampling rate alongside
traceparenton every proxied request (regardless of the sampled flag) via the W3Ctracestateheader:Honeycomb's OTLP ingest natively reads an integer span attribute named
SampleRateand weights aggregates by it.Behavior matrix (per shape GET request)
SampleRateattributerate:NSampleRate = Non the root span and thestream_chunkchild spansrate:NSampleRate = 1(error responses ignore the hint — mirrors the worker's keep-all-errors-at-rate-1 semantics)Plug_shape_getspan synthesized and exported withSampleRate = 1, in the same trace as the worker's spansrate:<1Implementation
Electric.Plug.TraceContextPlug— after the existingtraceparentextraction, parses theelectric=rate:Nmember out of the (already-decoded) tracestate carried by the extracted span context, and stores%{parent_span_ctx, parent_sampled?, sample_rate_hint}inconn.private. Exposestrace_context/1andsample_rate_attrs/2for downstream plugs.Electric.Plug.ServeShapePlug—emit_shape_telemetry/1(which runs on success, halt and caught-error paths once the response status is known) now ends withstamp_sample_rate/1:SampleRateattribute to the recording root span (attributes may be set any time before span end);Electric.Shapes.Api.Response.send_stream/2— the per-requestshape_get.plug.stream_chunkchild spans (which survive the flattening in Flatten 1:1 shape GET child spans into Plug_shape_get root-span attributes #4561 because they are 1:n, not 1:1) get the sameSampleRateattribute at creation, from the stashed hint and the response status.Electric.Telemetry.OpenTelemetry.export_unsampled_remote_span/4— the error-tail mechanism. Since the parent-based sampler means no recording span exists during a remote-unsampled request, the span is synthesized when the 5xx response is emitted: a copy of the extracted remote parent span context with the W3Csampledtrace flag forced on is installed as the parent in a fresh OTel context, and:otel_tracer.start_span/4is called under it. The parent-based sampler (default{parent_based, %{root: always_on}}— verified inotel_sampler_parent_based.erl) then takes itsremote_parent_sampled→always_onbranch and records the span.Synthesized-span fidelity (no SDK compromises needed)
trace_id/parent_span_idequal the inboundtraceparentvalues.ServeShapePlug.start_telemetry_span/1already records inconn.private[:electric_telemetry_span](the OTel SDK stores span start/end as native monotonic time, so the value is directly usable); end time is "now", so the duration matches the request.open_telemetry_attrs/1puts on a normalPlug_shape_getspan (final request/response state), plusSampleRate = 1and an error status (error_strwhen available,HTTP <status>otherwise).Notes / scope decisions
emit_shape_telemetry/1is skipped (inheritedPlug.ErrorHandlerlimitation documented in Flatten 1:1 shape GET child spans into Plug_shape_get root-span attributes #4561's base) — but in that case the committed status was non-5xx to begin with, so no error-tail span is lost.Sampler'sexclude_spanssemantics and the flattening work from the base branch are untouched.MIX_TARGET=application, no OTel SDK): the propagator no-ops, so no trace context is ever stored and every path is unchanged.Tests
test/electric/plug/trace_context_plug_test.exs(new): tracestate parsing — valid/absent/malformed hints,rate:0/negative/non-integer rejection, multi-member tracestate, whitespace, sampled-flag extraction,sample_rate_attrs/2status mapping (200/304/404/499 → N; 500/503 → 1; no hint →%{}).test/electric/plug/serve_shape_plug_sample_rate_test.exs(new,async: false, exports real spans to the test process viaotel_simple_processor+otel_exporter_pid):SampleRate = 20(andstream_chunkchild spans too);SampleRate = 1;Plug_shape_getspan withSampleRate = 1,trace_idandparent_span_idmatching the inboundtraceparent,start_time <= end_time, error status, standard root attributes;SampleRate(200 and 5xx).:opentelemetryapp with a pid exporter requires erasing the OTel API's per-application tracer persistent-term cache, because cached tracers embed the old provider's (empty) processor pipeline andcreate_application_tracerskeeps stale entries across restarts.test/electric/plug,test/electric/telemetryandtest/electric/shapes/apidirectories pass (235 tests, 0 failures), including the files touched on the base branch.mix compile --warnings-as-errorsclean on both theapplicationtarget and the default (embedded) target;mix format --check-formattedclean.🤖 Generated with Claude Code