Skip to content

Stamp Honeycomb SampleRate from tracestate rate hint and export 5xx error tail for unsampled traces#4562

Open
erik-the-implementer wants to merge 3 commits into
mainfrom
tracestate-sample-rate
Open

Stamp Honeycomb SampleRate from tracestate rate hint and export 5xx error tail for unsampled traces#4562
erik-the-implementer wants to merge 3 commits into
mainfrom
tracestate-sample-rate

Conversation

@erik-the-implementer

Copy link
Copy Markdown
Contributor

Note

Stacked on #4561 (flatten-shape-get-spans). The PR base is set to that branch until it merges; only the last commit (Stamp Honeycomb SampleRate from tracestate hint and export 5xx error tail) is new here.

Motivation

The Cloudflare worker in front of Electric Cloud head-samples successful requests at 1:20 and keeps all >= 500 responses at sampleRate 1, weighting its exported spans accordingly. Electric only ever received the sampled/not-sampled bit via traceparent, which causes two problems today:

  1. ~20x undercount: Electric spans carry no SampleRate attribute, so Honeycomb aggregates over electric-region spans under-report worker traffic ~20x relative to the worker's own (weighted) spans.
  2. Missing server-side error traces: ~95% of error traces have no Electric spans at all. The worker's keep-on-error decision happens at export time — after it already propagated traceparent with sampled=0 — and Electric's parent-based sampler then drops every span for the request.

The worker now sends its sampling rate alongside traceparent on every proxied request (regardless of the sampled flag) via the W3C tracestate header:

tracestate: electric=rate:<positive-integer>    # currently rate:20

Honeycomb's OTLP ingest natively reads an integer span attribute named SampleRate and weights aggregates by it.

Behavior matrix (per shape GET request)

Remote parent Rate hint Response status Behavior
none (direct traffic) any Unchanged — spans recorded, no SampleRate attribute
sampled=1 rate:N < 500 Spans exported as today plus SampleRate = N on the root span and the stream_chunk child spans
sampled=1 rate:N >= 500 Spans exported as today plus SampleRate = 1 (error responses ignore the hint — mirrors the worker's keep-all-errors-at-rate-1 semantics)
sampled=0 any < 500 Unchanged — no spans (this is the volume win, deliberately not regressed)
sampled=0 any >= 500 New: exactly one root Plug_shape_get span synthesized and exported with SampleRate = 1, in the same trace as the worker's spans
any unparseable / rate:<1 any Hint ignored — behaves as if no hint was sent

Implementation

  • Electric.Plug.TraceContextPlug — after the existing traceparent extraction, parses the electric=rate:N member out of the (already-decoded) tracestate carried by the extracted span context, and stores %{parent_span_ctx, parent_sampled?, sample_rate_hint} in conn.private. Exposes trace_context/1 and sample_rate_attrs/2 for downstream plugs.
  • Electric.Plug.ServeShapePlugemit_shape_telemetry/1 (which runs on success, halt and caught-error paths once the response status is known) now ends with stamp_sample_rate/1:
    • sampled remote parent → adds the SampleRate attribute to the recording root span (attributes may be set any time before span end);
    • unsampled remote parent + status >= 500 → synthesizes the error-tail span (below);
    • no remote parent → no-op.
  • Electric.Shapes.Api.Response.send_stream/2 — the per-request shape_get.plug.stream_chunk child spans (which survive the flattening in Flatten 1:1 shape GET child spans into Plug_shape_get root-span attributes #4561 because they are 1:n, not 1:1) get the same SampleRate attribute at creation, from the stashed hint and the response status.
  • Electric.Telemetry.OpenTelemetry.export_unsampled_remote_span/4 — the error-tail mechanism. Since the parent-based sampler means no recording span exists during a remote-unsampled request, the span is synthesized when the 5xx response is emitted: a copy of the extracted remote parent span context with the W3C sampled trace flag forced on is installed as the parent in a fresh OTel context, and :otel_tracer.start_span/4 is called under it. The parent-based sampler (default {parent_based, %{root: always_on}} — verified in otel_sampler_parent_based.erl) then takes its remote_parent_sampledalways_on branch and records the span.

Synthesized-span fidelity (no SDK compromises needed)

  • Same trace, real parent: the span keeps the inbound trace_id and is parented on the worker's span id — no span-link fallback was necessary. Tests assert trace_id/parent_span_id equal the inbound traceparent values.
  • Correct timing: start time is backdated to the request start using the monotonic timestamp ServeShapePlug.start_telemetry_span/1 already records in conn.private[:electric_telemetry_span] (the OTel SDK stores span start/end as native monotonic time, so the value is directly usable); end time is "now", so the duration matches the request.
  • Standard attributes: it carries the same root-span attributes open_telemetry_attrs/1 puts on a normal Plug_shape_get span (final request/response state), plus SampleRate = 1 and an error status (error_str when available, HTTP <status> otherwise).

Notes / scope decisions

  • The synthesized error tail covers the root span only; per-chunk child spans of remote-unsampled 5xx requests stay dropped (they don't exist retroactively). Root-only is the useful unit here — it carries the full request attributes.
  • The error tail triggers on the hint-independent condition "remote parent unsampled + status >= 500" (per the behavior matrix), so 5xx responses on traces head-sampled away by any W3C-compliant upstream become visible server-side.
  • Known pre-existing edge: when a crash happens mid-stream after the response has been committed, emit_shape_telemetry/1 is skipped (inherited Plug.ErrorHandler limitation documented in Flatten 1:1 shape GET child spans into Plug_shape_get root-span attributes #4561's base) — but in that case the committed status was non-5xx to begin with, so no error-tail span is lost.
  • Sampler's exclude_spans semantics and the flattening work from the base branch are untouched.
  • Embedded usage (no MIX_TARGET=application, no OTel SDK): the propagator no-ops, so no trace context is ever stored and every path is unchanged.

Tests

  • test/electric/plug/trace_context_plug_test.exs (new): tracestate parsing — valid/absent/malformed hints, rate:0/negative/non-integer rejection, multi-member tracestate, whitespace, sampled-flag extraction, sample_rate_attrs/2 status mapping (200/304/404/499 → N; 500/503 → 1; no hint → %{}).
  • test/electric/plug/serve_shape_plug_sample_rate_test.exs (new, async: false, exports real spans to the test process via otel_simple_processor + otel_exporter_pid):
    • sampled parent + 200 → root span has SampleRate = 20 (and stream_chunk child spans too);
    • sampled parent + 5xx → SampleRate = 1;
    • unsampled parent + 5xx → exactly one exported Plug_shape_get span with SampleRate = 1, trace_id and parent_span_id matching the inbound traceparent, start_time <= end_time, error status, standard root attributes;
    • unsampled parent + 200 → nothing exported;
    • no remote parent → spans exported without SampleRate (200 and 5xx).
    • Test-infra note: restarting the :opentelemetry app with a pid exporter requires erasing the OTel API's per-application tracer persistent-term cache, because cached tracers embed the old provider's (empty) processor pipeline and create_application_tracers keeps stale entries across restarts.
  • Full test/electric/plug, test/electric/telemetry and test/electric/shapes/api directories pass (235 tests, 0 failures), including the files touched on the base branch.
  • mix compile --warnings-as-errors clean on both the application target and the default (embedded) target; mix format --check-formatted clean.

🤖 Generated with Claude Code

@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.06%. Comparing base (0947230) to head (7fe2939).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4562      +/-   ##
==========================================
- Coverage   58.11%   58.06%   -0.05%     
==========================================
  Files         369      369              
  Lines       40459    40459              
  Branches    11471    11467       -4     
==========================================
- Hits        23513    23493      -20     
- Misses      16871    16891      +20     
  Partials       75       75              
Flag Coverage Δ
packages/agents 71.37% <ø> (ø)
packages/agents-mcp 77.54% <ø> (ø)
packages/agents-mobile 75.49% <ø> (ø)
packages/agents-runtime 82.12% <ø> (-0.04%) ⬇️
packages/agents-server 74.78% <ø> (-0.20%) ⬇️
packages/agents-server-ui 6.25% <ø> (ø)
packages/electric-ax 46.42% <ø> (ø)
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 91.71% <ø> (ø)
packages/y-electric 56.05% <ø> (ø)
typescript 58.06% <ø> (-0.05%) ⬇️
unit-tests 58.06% <ø> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Claude Code Review

Summary

This PR teaches the sync service to honor an upstream head-sampling rate hint (tracestate: electric=rate:N) by stamping Honeycomb's SampleRate attribute on shape-GET spans, and to synthesize a single root span for 5xx responses whose remote parent was head-sampled away. The implementation is clean, well-scoped, and unusually well-tested. (Reviews the PR's own commits; the PR is stacked on #4561.)

What's Working Well

  • OTel SDK semantics are used correctly, not fought. export_unsampled_remote_span/4 forces the sampled bit on a copy of the remote parent's span_ctx and starts the span under a fresh :otel_ctx.new(), so it does not pollute the process's current-span stack. Relying on the default {parent_based, %{root: always_on}} sampler taking its remote_parent_sampled branch is the right mechanism.
  • Timestamp units are consistent. start_time flows from System.monotonic_time() (native) and the end time from :opentelemetry.timestamp() (also native) — same clock, so the synthesized span's duration is real.
  • Embedded / non-telemetry safety. Every new code path is gated behind a non-nil trace_context/1, which is only populated when the W3C propagator extracts a parent. In embedded mode the propagator no-ops, so the new paths short-circuit. The Record.is_record(...) guards can't FunctionClauseError on :undefined.
  • Defensive parsing. sample_rate_hint/1's with chain rejects rate:0, negatives, floats, trailing garbage, and the empty value, all exhaustively covered by the table-driven test.
  • Hot-path discipline. In Response.send_stream/2 the sample_rate_attrs map is computed once before the chunk reduction and only Map.put-merged per chunk — no per-chunk recompute.
  • Test coverage is excellent. The behavior matrix maps 1:1 onto real exported-span assertions, plus the unit-level trace_context_plug_test.exs.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None.

Suggestions (Nice to Have)

None remaining — both suggestions from iteration 2 were addressed (see below).

Issue Conformance

No linked issue (flagged per project convention — PRs should reference the issue/incident they address). The PR description is exceptionally thorough: motivation (the ~20x undercount and ~95% missing error traces), a full per-request behavior matrix, scope decisions, and the known pre-existing mid-stream-crash limitation inherited from #4561. The implementation matches the described matrix, and the changeset (@core/sync-service: patch) is present and accurately scoped. No scope creep.

The one CI signal is a codecov report of a single failing test (Electric.Shapes.ConsumerTest hibernate/suspend) — a known flake in main (28% flake rate, unrelated to this PR's telemetry changes).

Previous Review Status

Addressed (commit 7fe29396c): Both iteration-2 suggestions are now resolved.

  1. Mixed key types in the stream_chunk attributes mapMap.put(sample_rate_attrs, :chunk_size, ...) became Map.put(sample_rate_attrs, "chunk_size", ...) (response.ex:445), so the map is now uniformly string-keyed ("SampleRate" + "chunk_size"). Verified no code or test reads this span attribute by its key, so the atom-to-string change is purely cosmetic and exported attribute names are unchanged. 👍
  2. Global-state mutation in the integration testrestart_opentelemetry/1 now carries a clear # NOTE: block documenting that it mutates VM-global telemetry state, why async: false + on_exit keep it contained, and the constraint that tests in the module must not assume externally-configured OTel state. 👍

Remaining: Nothing. No new code paths were introduced in this iteration, and nothing regressed.


Review iteration: 3 | 2026-06-11

Base automatically changed from flatten-shape-get-spans to main June 11, 2026 13:10
alco and others added 3 commits June 11, 2026 15:15
…tail

The Cloudflare worker in front of Electric head-samples successful
requests at 1:N (currently 1:20) and keeps all >= 500 responses at
sampleRate 1, weighting its exported spans accordingly. Electric only
received the sampled/not-sampled bit via traceparent, so:

 1. Electric spans carried no SampleRate attribute and electric-region
    aggregates under-reported worker traffic ~20x;
 2. ~95% of error traces had no server-side spans at all, because the
    worker's keep-on-error decision happens at export time, after a
    traceparent with sampled=0 was already propagated.

The worker now sends its rate hint on every proxied request via the
W3C tracestate header (member: `electric=rate:<N>`). This commit makes
Electric honor it:

 * TraceContextPlug parses the hint (missing/unparseable/rate < 1 hints
   are ignored) and stashes it in the conn together with the remote
   parent span context and its sampled flag.

 * For sampled remote parents, ServeShapePlug stamps `SampleRate` = N
   (status < 500) or `SampleRate` = 1 (status >= 500) on the
   Plug_shape_get root span once the response status is known; the
   shape_get.plug.stream_chunk child spans get the same attribute at
   creation.

 * For unsampled remote parents, a 5xx response now synthesizes a
   single root request span at response time with `SampleRate` = 1,
   carrying the standard root-span attributes, backdated to the request
   start, and parented onto the remote span context with the sampled
   trace flag forced on — so the parent-based sampler records it and it
   lands in the same trace as the worker's kept-on-error spans.
   Unsampled successful requests still export nothing.

Direct traffic (no remote parent) is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alco alco force-pushed the tracestate-sample-rate branch from f1158ed to 7fe2939 Compare June 11, 2026 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants