Skip to content

feat(telemetry): per-shape response_size.bytes histogram#4144

Draft
alco wants to merge 2 commits intomainfrom
sv1466-response-size-metric
Draft

feat(telemetry): per-shape response_size.bytes histogram#4144
alco wants to merge 2 commits intomainfrom
sv1466-response-size-metric

Conversation

@alco
Copy link
Copy Markdown
Member

@alco alco commented Apr 21, 2026

Summary

Adds a per-shape response size histogram so operators can attribute HTTP response payload volume to individual shapes.

  • Emits a new [:electric, :shape, :response_size] telemetry event from ServeShapePlug.end_telemetry_span/_, tagged with root_table, is_live and stack_id.
  • Registers it as a Telemetry.Metrics.distribution (electric.shape.response_size.bytes, unit :byte), which the existing OtelMetricExporter reporter publishes as an OTel histogram.
  • Bytes are pulled from conn.assigns[:streaming_bytes_sent], the same value already fed into the existing [:electric, :plug, :serve_shape] event; no new bookkeeping on the hot path.

Motivated by operational visibility work — without this, there is no way to tell which shape is responsible for a request-handler binary-memory spike.

Refs stratovolt#1466.

Why a new event rather than re-tagging the existing one

[:electric, :plug, :serve_shape] already feeds a latency distribution (with a keep filter dropping live requests) and span attributes. Layering high-cardinality shape tags onto it would risk affecting those consumers. A dedicated event keeps the intent and the tagging surface isolated.

Test plan

  • mix compile in packages/sync-service and packages/electric-telemetry
  • mix test test/electric/plug/serve_shape_plug_test.exs in packages/sync-service
  • mix test in packages/electric-telemetry
  • Manual check in a stack that the histogram shows up on the OTel exporter once a request has been served

Emit a new `[:electric, :shape, :response_size]` telemetry event from
`ServeShapePlug.end_telemetry_span/_`, tagged with `root_table`,
`is_live` and `stack_id`, and register it as a `distribution` metric
(`electric.shape.response_size.bytes`, unit: :byte) so the OTel
exporter publishes it as a histogram.

This lets operators attribute response payload volume to individual
shapes and distinguish initial-snapshot traffic from live long-polls
without having to correlate with traces.
Session ID: 2026-04-21--sv1466--shape-response-size-metric
@alco alco added the claude label Apr 21, 2026
@claude
Copy link
Copy Markdown

claude Bot commented Apr 21, 2026

Claude Code Review

Summary

This PR adds a per-shape electric.shape.response_size.bytes OTel histogram by emitting a new [:electric, :shape, :response_size] telemetry event from ServeShapePlug and registering it as a distribution in StackTelemetry. The approach is clean and well-motivated: it reuses the already-computed streaming_bytes_sent value and keeps the new event isolated from the existing [:electric, :plug, :serve_shape] event. Ready to merge with one test gap addressed.

What's Working Well

  • Sound separation of concerns: the dedicated event avoids tainting the existing latency distribution with high-cardinality shape tags, which is the right call.
  • No hot-path cost: streaming_bytes_sent is already tracked; the new event is a cheap metadata attachment.
  • get_root_table/2 fallback chain is well-designed — parsed params first, raw query params as fallback, nil as last resort.
  • Changeset file included and covers both changed packages.
  • Local variable extraction (stack_id, bytes_sent) makes end_telemetry_span cleaner.

Issues Found

Important (Should Fix)

Missing test coverage for the new telemetry event

File: packages/sync-service/test/electric/plug/serve_shape_plug_test.exs

The new [:electric, :shape, :response_size] event is not tested. The codebase has a clear pattern for testing telemetry events (see admission_control_test.exs:207-239). A test should verify that a served shape request emits the event with the correct measurements and metadata.

Minimal example following existing patterns:

test "emits response_size telemetry event", %{...} do
  test_pid = self()
  ref = make_ref()
  handler_id = "test-response-size-#{inspect(ref)}"

  :telemetry.attach(
    handler_id,
    [:electric, :shape, :response_size],
    fn _event, measurements, metadata, _config ->
      send(test_pid, {:response_size, measurements, metadata})
    end,
    nil
  )

  # ... make a request ...

  assert_receive {:response_size, %{bytes: bytes}, %{root_table: table, is_live: false, stack_id: _}}
  assert is_integer(bytes) and bytes >= 0
  assert is_binary(table)

  :telemetry.detach(handler_id)
end

The mix test check in the test plan confirms existing tests pass but doesn't verify the new behaviour is tested.

Suggestions (Nice to Have)

get_live_mode/1 called twice

File: packages/sync-service/lib/electric/plug/serve_shape_plug.ex:306,322

stack_id and bytes_sent were extracted into local variables, but get_live_mode(assigns) is still called twice. Minor consistency nit:

stack_id = get_in(conn.assigns, [:config, :stack_id])
bytes_sent = assigns[:streaming_bytes_sent] || 0
live = get_live_mode(assigns)   # extract once

nil root_table in metric labels

File: packages/sync-service/lib/electric/plug/serve_shape_plug.ex:350

When neither parsed params nor raw query params have a table key, get_root_table/2 returns nil. Depending on the OTel backend, this may surface as root_table="" or root_table="nil". An explicit fallback string avoids ambiguity:

defp get_root_table(_assigns, _conn), do: "unknown"

Consider custom histogram buckets for byte measurements

File: packages/electric-telemetry/lib/electric/telemetry/stack_telemetry.ex:89

The new distribution uses default OTel histogram buckets (tuned for latency, e.g. [0, 5, 10, 25, 50, …, 10000]). For response sizes in bytes, virtually every response will land in the top bucket, reducing histogram utility. Custom buckets like 1 KB → 100 MB would make the metric much more actionable for operators. This is a follow-up consideration if the histogram turns out noisy in practice.

Issue Conformance

No linked issue in this repository (refs an external private issue stratovolt#1466). The PR description is clear and self-contained, explaining both what was added and why a separate event was chosen over augmenting the existing one. The implementation matches the stated scope.

The absence of a linked public issue is worth noting per project convention — flagging for awareness rather than blocking.


Review iteration: 1 | 2026-04-21

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e2a4b6b5cf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +347 to +348
defp get_root_table(_assigns, %Conn{query_params: %{"table" => table}}) when is_binary(table),
do: table
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop labeling histogram with raw table query params

This fallback uses conn.query_params["table"] directly, so requests that fail validation still emit [:electric, :shape, :response_size] with attacker-controlled root_table values (the plug’s error paths still call halt/1, which ends telemetry). Because root_table is a metric tag, varying this query param can create unbounded time-series cardinality and put pressure on Otel/StatsD backends; this defeats the low-cardinality goal noted in the comment. Use only validated table values (or a bounded fallback like nil/"unknown") for metric tags.

Useful? React with 👍 / 👎.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.31%. Comparing base (365dd17) to head (e2a4b6b).
⚠️ Report is 3 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4144      +/-   ##
==========================================
- Coverage   89.20%   85.31%   -3.90%     
==========================================
  Files          25       40      +15     
  Lines        2520     3077     +557     
  Branches      636      641       +5     
==========================================
+ Hits         2248     2625     +377     
- Misses        270      450     +180     
  Partials        2        2              
Flag Coverage Δ
electric-telemetry 67.68% <ø> (?)
elixir 67.68% <ø> (?)
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 94.30% <ø> (ø)
packages/y-electric 56.05% <ø> (ø)
typescript 89.20% <ø> (ø)
unit-tests 85.31% <ø> (-3.90%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@alco alco marked this pull request as draft April 21, 2026 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants