diff --git a/.agent-tasks/2026-04-21--sv1466--shape-response-size-metric/progress.md b/.agent-tasks/2026-04-21--sv1466--shape-response-size-metric/progress.md new file mode 100644 index 0000000000..9a999d3a94 --- /dev/null +++ b/.agent-tasks/2026-04-21--sv1466--shape-response-size-metric/progress.md @@ -0,0 +1,32 @@ +# Progress + +## Timeline + +- 2026-04-21 — Session started. Worktree created at + `~/code/electric-sql/worktrees/electric/sv1466-response-size-metric`. +- Read `ServeShapePlug.end_telemetry_span/_`. Confirmed + `conn.assigns[:streaming_bytes_sent]` is accessible; existing event + `[:electric, :plug, :serve_shape]` already reports bytes but without + `root_table` tag. +- Looked at `ElectricTelemetry.StackTelemetry.metrics/1`: metrics are declared + via `Telemetry.Metrics` macros (distribution/counter/sum/last_value). The + OTel exporter (`OtelMetricExporter`) converts distributions to histograms. + That's the established pattern. +- Decision: emit a **separate** telemetry event rather than augmenting the + existing `[:electric, :plug, :serve_shape]` event. Reasons: + 1. That event is already used for a duration distribution (with a `keep` + filter dropping live requests) and is consumed by spans; adding + high-cardinality tags to it risks affecting other reporters. + 2. A dedicated event makes the histogram's intent obvious and independent + of tracing plumbing. +- `root_table` source: `conn.query_params["table"]` or `request.params.table` + (same sources used for the `shape.root_table` span attribute). +- `is_live` source: existing `get_live_mode/1` private helper. + +## Operational issues + +- Git identity on the freshly cloned worktree defaulted to Oleksii's global + config; had to re-apply the `~/agents/github/erik/.gitconfig` settings + (`user.name`, `user.email`, `user.signingkey`, `commit.gpgsign`, `gpg.format`) + on the local clone. Possible improvement: make `gclone` honour per-agent + gitconfig includes. diff --git a/.agent-tasks/2026-04-21--sv1466--shape-response-size-metric/task.md b/.agent-tasks/2026-04-21--sv1466--shape-response-size-metric/task.md new file mode 100644 index 0000000000..6b0e067f69 --- /dev/null +++ b/.agent-tasks/2026-04-21--sv1466--shape-response-size-metric/task.md @@ -0,0 +1,37 @@ +# Task: Per-shape response-size metrics + +Session ID: `2026-04-21--sv1466--shape-response-size-metric` +Upstream issue: electric-sql/stratovolt#1466 + +## Problem + +Operators lack per-shape visibility into response sizes. Request-handler binary +memory grows significantly and it's unclear which shapes contribute. The data +needed for a per-shape histogram already exists on +`conn.assigns[:streaming_bytes_sent]` inside `ServeShapePlug.end_telemetry_span/_`. + +## Fix + +Emit a new telemetry event `[:electric, :shape, :response_size]` with a `bytes` +measurement, and labels `root_table`, `is_live`, `stack_id`. + +Register it as a `distribution` metric (exported as OTel histogram via +`OtelMetricExporter`) at the name `electric.shape.response_size.bytes`, +keeping all three tags. + +## Scope + +- `packages/sync-service/lib/electric/plug/serve_shape_plug.ex`: emit + the new event alongside the existing `[:electric, :plug, :serve_shape]` + event, pulling `root_table` from the loaded request params and `is_live` from + `get_live_mode/1`. +- `packages/electric-telemetry/lib/electric/telemetry/stack_telemetry.ex`: + register the distribution metric with `tags: [:root_table, :is_live, :stack_id]` + and `unit: :byte`. +- Add a changeset entry. + +## Out of scope + +- Any correlation with `process.bin_memory.total` — handled separately. +- Adjusting existing `[:electric, :plug, :serve_shape]` event (used by other + metrics and traces). diff --git a/.changeset/shape-response-size-histogram.md b/.changeset/shape-response-size-histogram.md new file mode 100644 index 0000000000..448f18afdb --- /dev/null +++ b/.changeset/shape-response-size-histogram.md @@ -0,0 +1,6 @@ +--- +'@core/sync-service': patch +'@core/electric-telemetry': patch +--- + +Add per-shape `electric.shape.response_size.bytes` histogram metric tagged with `root_table`, `is_live` and `stack_id`, letting operators attribute response payload volume to individual shapes. diff --git a/packages/electric-telemetry/lib/electric/telemetry/stack_telemetry.ex b/packages/electric-telemetry/lib/electric/telemetry/stack_telemetry.ex index 58679fe5c8..10a4a0f4a9 100644 --- a/packages/electric-telemetry/lib/electric/telemetry/stack_telemetry.ex +++ b/packages/electric-telemetry/lib/electric/telemetry/stack_telemetry.ex @@ -86,6 +86,10 @@ defmodule ElectricTelemetry.StackTelemetry do unit: {:native, :millisecond}, keep: fn metadata -> metadata[:live] != true end ), + distribution("electric.shape.response_size.bytes", + unit: :byte, + tags: [:root_table, :is_live, :stack_id] + ), distribution("electric.shape_cache.create_snapshot_task.stop.duration", unit: {:native, :millisecond} ), diff --git a/packages/sync-service/lib/electric/plug/serve_shape_plug.ex b/packages/sync-service/lib/electric/plug/serve_shape_plug.ex index ba4311bd2f..95c6abd5fb 100644 --- a/packages/sync-service/lib/electric/plug/serve_shape_plug.ex +++ b/packages/sync-service/lib/electric/plug/serve_shape_plug.ex @@ -291,11 +291,14 @@ defmodule Electric.Plug.ServeShapePlug do # is the place to assign them because we keep this plug last in the "plug pipeline" defined # in this module. defp end_telemetry_span(%Conn{assigns: assigns} = conn, _ \\ nil) do + stack_id = get_in(conn.assigns, [:config, :stack_id]) + bytes_sent = assigns[:streaming_bytes_sent] || 0 + OpenTelemetry.execute( [:electric, :plug, :serve_shape], %{ count: 1, - bytes: assigns[:streaming_bytes_sent] || 0, + bytes: bytes_sent, monotonic_time: System.monotonic_time(), duration: System.monotonic_time() - conn.private[:electric_telemetry_span][:start_time] }, @@ -304,7 +307,20 @@ defmodule Electric.Plug.ServeShapePlug do shape_handle: get_handle(assigns) || conn.query_params["handle"], client_ip: conn.remote_ip, status: conn.status, - stack_id: get_in(conn.assigns, [:config, :stack_id]) + stack_id: stack_id + } + ) + + # Per-shape response size histogram. Tagged by `root_table`, `is_live` + # and `stack_id` so operators can attribute payload volume to individual + # shapes and tell initial snapshots apart from live long-polls. + :telemetry.execute( + [:electric, :shape, :response_size], + %{bytes: bytes_sent}, + %{ + root_table: get_root_table(assigns, conn), + is_live: get_live_mode(assigns), + stack_id: stack_id } ) @@ -321,6 +337,18 @@ defmodule Electric.Plug.ServeShapePlug do defp get_live_mode(%{request: %{params: %{live: live}}}), do: live defp get_live_mode(_), do: false + # Used as a metric label for the response-size histogram. The goal is low + # cardinality (one value per configured shape root table), so we stick with + # whatever the client sent as the `table` query parameter or the already + # parsed request params. + defp get_root_table(%{request: %{params: %{table: table}}}, _conn) when is_binary(table), + do: table + + defp get_root_table(_assigns, %Conn{query_params: %{"table" => table}}) when is_binary(table), + do: table + + defp get_root_table(_assigns, _conn), do: nil + defp add_span_attrs_from_conn(conn) do conn |> open_telemetry_attrs()