Skip to content

Centralize telemetry collector error handling via Poller.safe_invoke/3#4148

Draft
alco wants to merge 2 commits intomainfrom
erik/issue-32-safe-invoke
Draft

Centralize telemetry collector error handling via Poller.safe_invoke/3#4148
alco wants to merge 2 commits intomainfrom
erik/issue-32-safe-invoke

Conversation

@alco
Copy link
Copy Markdown
Member

@alco alco commented Apr 21, 2026

Summary

  • Adds ElectricTelemetry.Poller.safe_invoke/3 and wraps every MFA built by periodic_measurements/2 in it, so transient collector failures no longer cause :telemetry_poller to permanently drop a measurement from its polling list.
  • Silently absorbs :noproc/:timeout/:shutdown/:normal exits and ArgumentError (typical startup/restart races); logs a warning tagged with the MFA for anything else.
  • Strips now-redundant defensive code from count_shapes/2 (with-fallthrough) and report_retained_wal_size/3 (try/catch).

Background

:telemetry_poller catches MFA exceptions but returns error to its internal loop, which removes the measurement permanently. A single GenServer restart race or ETS-table startup race is enough to silently disable a metric for the lifetime of the poller. See issue electric-sql/alco-agent-tasks#32 for the full audit and design.

Note on semantics

User-supplied periodic measurement functions passed to ElectricTelemetry no longer have their exceptions propagated up to :telemetry_poller's own error logger — they are now caught and logged via ElectricTelemetry.Poller instead. This is called out in the changeset.

Test plan

  • New ElectricTelemetry.PollerTest covers all catch clauses + the wrapping logic in periodic_measurements/2
  • mix compile clean for @core/sync-service and @core/electric-telemetry
  • Existing telemetry / stack-supervisor tests pass in CI

Refs electric-sql/alco-agent-tasks#32

telemetry_poller removes a measurement permanently from its polling list
after the first failure, so transient collector errors (GenServer restart
races, ETS tables not yet created, DB unavailable) silently disable metrics
for the lifetime of the poller.

Wrap every MFA built by ElectricTelemetry.Poller.periodic_measurements/2 in
safe_invoke/3, which absorbs :noproc/:timeout/:shutdown/:normal exits and
ArgumentError silently and logs any other failure as a warning tagged with
the offending MFA. The measurement always returns :ok to telemetry_poller
and stays on the polling list.

Strip now-redundant defensive code from count_shapes/2 (with-fallthrough)
and report_retained_wal_size/3 (try/catch on :noproc and catch-all).

Refs electric-sql/alco-agent-tasks#32
@alco alco added the claude label Apr 21, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
2120 1 2119 1
View the top 1 failed test(s) by shortest run time
Elixir.Electric.Replication.PublicationManagerTest::test component restarts handles relation tracker restart
Stack Traces | 0.155s run time
9) test component restarts handles relation tracker restart (Electric.Replication.PublicationManagerTest)
     .../electric/replication/publication_manager_test.exs:503
     ** (exit) exited in: GenServer.call({:via, Registry, {:"Electric.ProcessRegistry:Electric.Replication.PublicationManagerTest test component restarts handles relation tracker restart", {Electric.Replication.PublicationManager.RelationTracker, nil}}}, {:remove_shape, "36883596-1776816318788225"}, 5000)
         ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
     code: PublicationManager.remove_shape(ctx.stack_id, shape_handle)
     stacktrace:
       (elixir 1.19.5) lib/gen_server.ex:1135: GenServer.call/3
       (electric 1.5.1) .../replication/publication_manager/relation_tracker.ex:73: Electric.Replication.PublicationManager.RelationTracker.remove_shape/2
       .../electric/replication/publication_manager_test.exs:522: (test)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 13b3967b18

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +30 to +31
%{total: num_shapes, indexed: indexed_shapes, unindexed: unindexed_shapes} =
Electric.ShapeCache.shape_counts(stack_id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep active-shapes emission independent of shape-count failures

This strict match will raise whenever Electric.ShapeCache.shape_counts/1 returns :error (the current behavior when shape-cache ETS is unavailable during startup/restart), which aborts the function before [:electric, :shapes, :active_shapes] is emitted. Previously only the total-shapes metric was skipped and active-shapes still reported; now both are skipped for that tick, creating avoidable gaps in consumer-count telemetry.

Useful? React with 👍 / 👎.

Comment on lines +73 to +74
%Postgrex.Result{rows: [[pg_wal_offset, retained_wal_size, confirmed_flush_lsn_lag]]} =
Postgrex.query!(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve non-raising behavior for retained-WAL collector

Removing local try/catch means this function now raises on transient failures (Postgrex.query! exit/errors or empty rows mismatches), while it previously returned :ok and logged. safe_invoke/3 only protects calls that go through the telemetry poller wrapper, so any direct invocation of report_retained_wal_size/3 (including direct-use paths of default_periodic_measurements/1) now crashes unexpectedly.

Useful? React with 👍 / 👎.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 21, 2026

Claude Code Review

Summary

The follow-up commit (275e0d9a6) addressed the failing test by restoring local defensive handling in both collector functions. The PR is closer to mergeable, but the :builtin wrapping gap from the previous review remains unaddressed, and the fix introduced a new observability regression in report_retained_wal_size/3.

What's Working Well

  • The test failure was fixed correctly: count_shapes/2 retains the with fallthrough so :error from a cold ETS table does not raise, and report_retained_wal_size/3 has a restored try/catch.
  • The count_shapes/2 comment update clarifies that active_shapes is always emitted independently of total_shapes.
  • safe_invoke/3 implementation and its test coverage remain solid.

Issues Found

Important (Should Fix)

:builtin measurements still not wrapped in safe_invoke

File: packages/electric-telemetry/lib/electric/telemetry/poller.ex:30-31

The :builtin branch is unchanged from the previous review iteration:

:builtin ->
  module.builtin_periodic_measurements(telemetry_opts)

These MFAs are added to the poller list without going through wrap/3, so any crash in a builtin measurement still permanently removes it from the polling list. All other paths (bare atoms, anonymous functions, {m, f, a} tuples) correctly go through wrap/3. The :builtin path is the odd one out.

Fix (same as previous review):

:builtin ->
  module.builtin_periodic_measurements(telemetry_opts)
  |> Enum.map(&wrap_mfa/1)

Observability regression in report_retained_wal_size/3

File: packages/sync-service/lib/electric/stack_supervisor/telemetry.ex:106-111

The restored try/catch catches all :error and :exit kinds at Logger.debug:

catch
  kind, reason when kind in [:error, :exit] ->
    Logger.debug("report_retained_wal_size/3 skipped: #{kind} #{inspect(reason)}")
    :ok
end

Because the function catches everything and returns :ok, safe_invoke never sees an error and never emits a warning. Unexpected failures (DB connection errors, query failures, result mismatches) are now silently swallowed at debug level. Before this PR those unexpected failures were logged at Logger.warning with stack_id and slot_name metadata. The safe_invoke centralized warning layer only fires when the inner function actually raises; by re-adding the local catch this function opts out of that mechanism entirely.

Suggested fix: Keep the catch but restore warning level for unexpected errors, silently ignoring only the known transient exits:

catch
  :exit, {reason, _} when reason in [:noproc, :timeout, :shutdown, :normal] ->
    :ok

  :exit, reason when reason in [:noproc, :shutdown, :normal] ->
    :ok

  kind, reason ->
    Logger.warning(
      "Failed to query retained WAL size: #{kind} #{inspect(reason)}",
      stack_id: stack_id,
      slot_name: slot_name
    )
    :ok
end

Suggestions (Nice to Have)

ArgumentError silencing in safe_invoke still uncommentedpackages/electric-telemetry/lib/electric/telemetry/poller.ex:73-74. Carried over from previous review.


Issue Conformance

No linked public issue (refs alco-agent-tasks#32). Remains a non-blocking concern.


Previous Review Status

Issue Status
Critical: Failing test for cold-stack measurements Fixed — local defensive handling restored
Important: :builtin measurements not wrapped Not addressed
Suggestion: Log context regression in safe_invoke warnings Partially regressed — warning downgraded to debug
Suggestion: ArgumentError comment Not addressed

Review iteration: 2 | 2026-04-22

…tting

- count_shapes/2: restore `with` fallthrough on `:error` from shape_counts
  and emit `active_shapes` independently so a shape-cache outage doesn't
  drop both metrics for the tick (Codex feedback).
- report_retained_wal_size/3: restore try/catch around the Postgrex call
  so direct callers (including the stack-down regression test) don't
  crash on transient DB/pool failures (Codex feedback).
- Reformat poller.ex and poller_test.exs per `mix format`.

safe_invoke/3 stays as the backstop for unexpected errors through the
poller wrapper; these local handlers ensure graceful partial emission
for known stack-startup states.
@alco alco marked this pull request as draft April 22, 2026 00:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants