Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Progress

## Timeline

- 2026-04-21 — Session started. Worktree created at
`~/code/electric-sql/worktrees/electric/sv1466-response-size-metric`.
- Read `ServeShapePlug.end_telemetry_span/_`. Confirmed
`conn.assigns[:streaming_bytes_sent]` is accessible; existing event
`[:electric, :plug, :serve_shape]` already reports bytes but without
`root_table` tag.
- Looked at `ElectricTelemetry.StackTelemetry.metrics/1`: metrics are declared
via `Telemetry.Metrics` macros (distribution/counter/sum/last_value). The
OTel exporter (`OtelMetricExporter`) converts distributions to histograms.
That's the established pattern.
- Decision: emit a **separate** telemetry event rather than augmenting the
existing `[:electric, :plug, :serve_shape]` event. Reasons:
1. That event is already used for a duration distribution (with a `keep`
filter dropping live requests) and is consumed by spans; adding
high-cardinality tags to it risks affecting other reporters.
2. A dedicated event makes the histogram's intent obvious and independent
of tracing plumbing.
- `root_table` source: `conn.query_params["table"]` or `request.params.table`
(same sources used for the `shape.root_table` span attribute).
- `is_live` source: existing `get_live_mode/1` private helper.

## Operational issues

- Git identity on the freshly cloned worktree defaulted to Oleksii's global
config; had to re-apply the `~/agents/github/erik/.gitconfig` settings
(`user.name`, `user.email`, `user.signingkey`, `commit.gpgsign`, `gpg.format`)
on the local clone. Possible improvement: make `gclone` honour per-agent
gitconfig includes.
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Task: Per-shape response-size metrics

Session ID: `2026-04-21--sv1466--shape-response-size-metric`
Upstream issue: electric-sql/stratovolt#1466

## Problem

Operators lack per-shape visibility into response sizes. Request-handler binary
memory grows significantly and it's unclear which shapes contribute. The data
needed for a per-shape histogram already exists on
`conn.assigns[:streaming_bytes_sent]` inside `ServeShapePlug.end_telemetry_span/_`.

## Fix

Emit a new telemetry event `[:electric, :shape, :response_size]` with a `bytes`
measurement, and labels `root_table`, `is_live`, `stack_id`.

Register it as a `distribution` metric (exported as OTel histogram via
`OtelMetricExporter`) at the name `electric.shape.response_size.bytes`,
keeping all three tags.

## Scope

- `packages/sync-service/lib/electric/plug/serve_shape_plug.ex`: emit
the new event alongside the existing `[:electric, :plug, :serve_shape]`
event, pulling `root_table` from the loaded request params and `is_live` from
`get_live_mode/1`.
- `packages/electric-telemetry/lib/electric/telemetry/stack_telemetry.ex`:
register the distribution metric with `tags: [:root_table, :is_live, :stack_id]`
and `unit: :byte`.
- Add a changeset entry.

## Out of scope

- Any correlation with `process.bin_memory.total` — handled separately.
- Adjusting existing `[:electric, :plug, :serve_shape]` event (used by other
metrics and traces).
6 changes: 6 additions & 0 deletions .changeset/shape-response-size-histogram.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
'@core/sync-service': patch
'@core/electric-telemetry': patch
---

Add per-shape `electric.shape.response_size.bytes` histogram metric tagged with `root_table`, `is_live` and `stack_id`, letting operators attribute response payload volume to individual shapes.
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,10 @@ defmodule ElectricTelemetry.StackTelemetry do
unit: {:native, :millisecond},
keep: fn metadata -> metadata[:live] != true end
),
distribution("electric.shape.response_size.bytes",
unit: :byte,
tags: [:root_table, :is_live, :stack_id]
),
distribution("electric.shape_cache.create_snapshot_task.stop.duration",
unit: {:native, :millisecond}
),
Expand Down
32 changes: 30 additions & 2 deletions packages/sync-service/lib/electric/plug/serve_shape_plug.ex
Original file line number Diff line number Diff line change
Expand Up @@ -291,11 +291,14 @@ defmodule Electric.Plug.ServeShapePlug do
# is the place to assign them because we keep this plug last in the "plug pipeline" defined
# in this module.
defp end_telemetry_span(%Conn{assigns: assigns} = conn, _ \\ nil) do
stack_id = get_in(conn.assigns, [:config, :stack_id])
bytes_sent = assigns[:streaming_bytes_sent] || 0

OpenTelemetry.execute(
[:electric, :plug, :serve_shape],
%{
count: 1,
bytes: assigns[:streaming_bytes_sent] || 0,
bytes: bytes_sent,
monotonic_time: System.monotonic_time(),
duration: System.monotonic_time() - conn.private[:electric_telemetry_span][:start_time]
},
Expand All @@ -304,7 +307,20 @@ defmodule Electric.Plug.ServeShapePlug do
shape_handle: get_handle(assigns) || conn.query_params["handle"],
client_ip: conn.remote_ip,
status: conn.status,
stack_id: get_in(conn.assigns, [:config, :stack_id])
stack_id: stack_id
}
)

# Per-shape response size histogram. Tagged by `root_table`, `is_live`
# and `stack_id` so operators can attribute payload volume to individual
# shapes and tell initial snapshots apart from live long-polls.
:telemetry.execute(
[:electric, :shape, :response_size],
%{bytes: bytes_sent},
%{
root_table: get_root_table(assigns, conn),
is_live: get_live_mode(assigns),
stack_id: stack_id
}
)

Expand All @@ -321,6 +337,18 @@ defmodule Electric.Plug.ServeShapePlug do
defp get_live_mode(%{request: %{params: %{live: live}}}), do: live
defp get_live_mode(_), do: false

# Used as a metric label for the response-size histogram. The goal is low
# cardinality (one value per configured shape root table), so we stick with
# whatever the client sent as the `table` query parameter or the already
# parsed request params.
defp get_root_table(%{request: %{params: %{table: table}}}, _conn) when is_binary(table),
do: table

defp get_root_table(_assigns, %Conn{query_params: %{"table" => table}}) when is_binary(table),
do: table
Comment on lines +347 to +348
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop labeling histogram with raw table query params

This fallback uses conn.query_params["table"] directly, so requests that fail validation still emit [:electric, :shape, :response_size] with attacker-controlled root_table values (the plug’s error paths still call halt/1, which ends telemetry). Because root_table is a metric tag, varying this query param can create unbounded time-series cardinality and put pressure on Otel/StatsD backends; this defeats the low-cardinality goal noted in the comment. Use only validated table values (or a bounded fallback like nil/"unknown") for metric tags.

Useful? React with 👍 / 👎.


defp get_root_table(_assigns, _conn), do: nil

defp add_span_attrs_from_conn(conn) do
conn
|> open_telemetry_attrs()
Expand Down
Loading