Centralize telemetry collector error handling via Poller.safe_invoke/3#4148
Centralize telemetry collector error handling via Poller.safe_invoke/3#4148
Conversation
telemetry_poller removes a measurement permanently from its polling list after the first failure, so transient collector errors (GenServer restart races, ETS tables not yet created, DB unavailable) silently disable metrics for the lifetime of the poller. Wrap every MFA built by ElectricTelemetry.Poller.periodic_measurements/2 in safe_invoke/3, which absorbs :noproc/:timeout/:shutdown/:normal exits and ArgumentError silently and logs any other failure as a warning tagged with the offending MFA. The measurement always returns :ok to telemetry_poller and stays on the polling list. Strip now-redundant defensive code from count_shapes/2 (with-fallthrough) and report_retained_wal_size/3 (try/catch on :noproc and catch-all). Refs electric-sql/alco-agent-tasks#32
❌ 1 Tests Failed:
View the top 1 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 13b3967b18
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| %{total: num_shapes, indexed: indexed_shapes, unindexed: unindexed_shapes} = | ||
| Electric.ShapeCache.shape_counts(stack_id) |
There was a problem hiding this comment.
Keep active-shapes emission independent of shape-count failures
This strict match will raise whenever Electric.ShapeCache.shape_counts/1 returns :error (the current behavior when shape-cache ETS is unavailable during startup/restart), which aborts the function before [:electric, :shapes, :active_shapes] is emitted. Previously only the total-shapes metric was skipped and active-shapes still reported; now both are skipped for that tick, creating avoidable gaps in consumer-count telemetry.
Useful? React with 👍 / 👎.
| %Postgrex.Result{rows: [[pg_wal_offset, retained_wal_size, confirmed_flush_lsn_lag]]} = | ||
| Postgrex.query!( |
There was a problem hiding this comment.
Preserve non-raising behavior for retained-WAL collector
Removing local try/catch means this function now raises on transient failures (Postgrex.query! exit/errors or empty rows mismatches), while it previously returned :ok and logged. safe_invoke/3 only protects calls that go through the telemetry poller wrapper, so any direct invocation of report_retained_wal_size/3 (including direct-use paths of default_periodic_measurements/1) now crashes unexpectedly.
Useful? React with 👍 / 👎.
Claude Code ReviewSummaryThe follow-up commit ( What's Working Well
Issues FoundImportant (Should Fix)
File: The :builtin ->
module.builtin_periodic_measurements(telemetry_opts)These MFAs are added to the poller list without going through Fix (same as previous review): :builtin ->
module.builtin_periodic_measurements(telemetry_opts)
|> Enum.map(&wrap_mfa/1)Observability regression in File: The restored catch
kind, reason when kind in [:error, :exit] ->
Logger.debug("report_retained_wal_size/3 skipped: #{kind} #{inspect(reason)}")
:ok
endBecause the function catches everything and returns Suggested fix: Keep the catch but restore warning level for unexpected errors, silently ignoring only the known transient exits: catch
:exit, {reason, _} when reason in [:noproc, :timeout, :shutdown, :normal] ->
:ok
:exit, reason when reason in [:noproc, :shutdown, :normal] ->
:ok
kind, reason ->
Logger.warning(
"Failed to query retained WAL size: #{kind} #{inspect(reason)}",
stack_id: stack_id,
slot_name: slot_name
)
:ok
endSuggestions (Nice to Have)
Issue ConformanceNo linked public issue (refs Previous Review Status
Review iteration: 2 | 2026-04-22 |
…tting - count_shapes/2: restore `with` fallthrough on `:error` from shape_counts and emit `active_shapes` independently so a shape-cache outage doesn't drop both metrics for the tick (Codex feedback). - report_retained_wal_size/3: restore try/catch around the Postgrex call so direct callers (including the stack-down regression test) don't crash on transient DB/pool failures (Codex feedback). - Reformat poller.ex and poller_test.exs per `mix format`. safe_invoke/3 stays as the backstop for unexpected errors through the poller wrapper; these local handlers ensure graceful partial emission for known stack-startup states.
Summary
ElectricTelemetry.Poller.safe_invoke/3and wraps every MFA built byperiodic_measurements/2in it, so transient collector failures no longer cause:telemetry_pollerto permanently drop a measurement from its polling list.:noproc/:timeout/:shutdown/:normalexits andArgumentError(typical startup/restart races); logs a warning tagged with the MFA for anything else.count_shapes/2(with-fallthrough) andreport_retained_wal_size/3(try/catch).Background
:telemetry_pollercatches MFA exceptions but returnserrorto its internal loop, which removes the measurement permanently. A single GenServer restart race or ETS-table startup race is enough to silently disable a metric for the lifetime of the poller. See issue electric-sql/alco-agent-tasks#32 for the full audit and design.Note on semantics
User-supplied periodic measurement functions passed to
ElectricTelemetryno longer have their exceptions propagated up to:telemetry_poller's own error logger — they are now caught and logged viaElectricTelemetry.Pollerinstead. This is called out in the changeset.Test plan
ElectricTelemetry.PollerTestcovers all catch clauses + the wrapping logic inperiodic_measurements/2mix compileclean for@core/sync-serviceand@core/electric-telemetryRefs electric-sql/alco-agent-tasks#32