Throttle replication event retry logging and scrub event payloads by erik-the-implementer · Pull Request #4566 · electric-sql/electric

erik-the-implementer · 2026-06-11T13:43:59Z

Fixes the log-flooding half of #4563.

A failing replication event is retried every 50ms (@event_retry_delay) for up to 10 minutes (@max_event_retry_time), and every attempt logged at error level — up to ~12,000 error logs for a single incident. In Electric Cloud each of those lines becomes a Sentry event (capture_log_messages: true), which is how one tenant produced 7,000+ Sentry events from a single incident. Each message also embedded the full inspected event — potentially megabytes of customer row data copied into the logs every 50ms.

Changes

The first failure and one progress update per 10s (@retry_log_interval) log at error level; every attempt in between logs at debug level with the previous full detail.
The throttle window is wall-clock time tracked in the client state (reset on successful delivery), not retry budget: a fast failure like :noproc consumes no measurable budget, so budget-based throttling would never fire.
Error-level messages carry the event identity (xid/LSN/relation instead of the full payload) and a scrubbed failure reason: any TransactionFragment/Relation embedded in the exit reason (e.g. as arguments in a :noproc tuple) is replaced with a one-line summary before inspecting, so row data never reaches error-level logs. The scrubber is safe for improper lists and leaves other structs untouched.
The budget-exhaustion messages get the same identity + scrubbed-reason treatment (the subsequent raise/exit is unchanged).

Side observation (behavior unchanged in this PR): because the retry budget only counts time spent inside attempts, sub-millisecond failures like :noproc barely consume it, so the "10-minute" budget can run much longer in wall-clock terms for fast failures.

Test

A new test drives the real noproc retry loop against a database and asserts that ~10 retry attempts produce exactly one error-level log line, that the line identifies the event by xid without containing the inserted row value, and that the event is still delivered once the handler comes back.

🤖 Generated with Claude Code

A failing replication event is retried every @event_retry_delay (50ms) for up to @max_event_retry_time (10 minutes) and every attempt was logged at error level — up to ~12000 error logs per incident, flooding log output and any error tracker fed from it (the cloud ships every Logger.error line to Sentry). Each message also embedded the full inspected event, copying potentially megabytes of customer row data into the logs on every 50ms attempt. Now the first failure and one progress update per @retry_log_interval (10s) are logged at error level, carrying the event identity (xid/lsn/relation) and the failure reason with any embedded replication events replaced by their one-line summary. The attempts in between log the full detail at debug level. The throttle window is wall-clock time tracked in the client state, because a fast failure such as :noproc consumes no measurable retry budget. Fixes part of #4563. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Make scrub_events safe for improper lists and leave other structs untouched, and apply mix format to the retry logging call site. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

codecov · 2026-06-11T13:45:27Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.36%. Comparing base (1219cb4) to head (29a5fcb).
⚠️ Report is 5 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #4566       +/-   ##
===========================================
- Coverage   70.96%   58.36%   -12.60%     
===========================================
  Files          83      370      +287     
  Lines        9856    40652    +30796     
  Branches     3124    11548     +8424     
===========================================
+ Hits         6994    23728    +16734     
- Misses       2844    16850    +14006     
- Partials       18       74       +56

Flag	Coverage Δ
packages/agents	`71.37% <ø> (ø)`
packages/agents-mcp	`77.54% <ø> (?)`
packages/agents-mobile	`75.49% <ø> (?)`
packages/agents-runtime	`82.45% <ø> (?)`
packages/agents-server	`75.06% <ø> (+0.22%)`	⬆️
packages/agents-server-ui	`6.25% <ø> (?)`
packages/electric-ax	`46.42% <ø> (ø)`
packages/experimental	`87.73% <ø> (?)`
packages/react-hooks	`86.48% <ø> (?)`
packages/start	`82.83% <ø> (?)`
packages/typescript-client	`91.83% <ø> (?)`
packages/y-electric	`56.05% <ø> (?)`
typescript	`58.36% <ø> (-12.60%)`	⬇️
unit-tests	`58.36% <ø> (-12.60%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

claude · 2026-06-11T13:50:13Z

Claude Code Review

Summary

This PR fixes the log-flooding half of #4563 by throttling replication event-retry logging and scrubbing event payloads out of error-level logs. The implementation is focused, well-commented, and well-tested. No blocking issues found — overall a clean, defensive change.

What's Working Well

Right throttle dimension. Tracking the throttle window in wall-clock time (last_retry_error_log) rather than retry budget correctly handles the :noproc fast-failure case the PR is built around, where budget-based throttling would essentially never fire. The reasoning is captured in the comment at replication_client.ex:570-582.
Defense in depth on the scrubber. Even when scrub_events/1 misses an embedding shape, inspect_scrubbed/1 caps output with limit: 100, printable_limit: 2048 (replication_client.ex:615-616), so a missed event can't dump megabytes of row data — it's bounded to a few KB. Good belt-and-suspenders.
Improper-list safety. The head/tail recursion in scrub_events/1 (:628) correctly handles improper lists that can appear in arbitrary exit reasons, and the %_{} clause leaves unrelated structs untouched rather than crashing.
Lazy debug logging. Using Logger.debug(fn -> ... end) (:595) avoids the expensive Exception.format/inspect work entirely when debug is disabled — the common production case.
Solid test. replication_client_test.exs:254 drives the real noproc retry loop, asserts exactly one error line over ~10 attempts, confirms the row value ("a private row value") is absent while xid= is present, and verifies the event is still delivered once the handler recovers. The 500ms window vs. 10s throttle interval makes the "exactly one" assertion deterministic.
Typespec for the new last_retry_error_log field updated (:84); changeset present with the correct @core/sync-service package name.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None.

Suggestions (Nice to Have)

1. Exhaustion-path error log drops the stacktrace and kind on the dispatching path

File: packages/sync-service/lib/electric/postgres/replication_client.ex:560-563

The budget-exhaustion log for the dispatching path changed from Exception.format(kind, reason, __STACKTRACE__) to describe_event(event) <> inspect_scrubbed(reason). This is correct for scrubbing, but it also drops the kind (:error/:exit/:throw) and the stacktrace from that error-level message — diagnostics that were previously present. In production (debug off) that per-attempt detail is no longer captured anywhere except the subsequent :erlang.raise/3, which does preserve the trace via the crash report. So the information isn't fully lost, but operators reading the "Exhausted retry budget" line alone now get less. If the crash report reliably captures it this is fine — worth a conscious confirmation rather than an accident.

2. scrub_events/1 doesn't recurse into maps or into non-event struct fields

File: packages/sync-service/lib/electric/postgres/replication_client.ex:619-629

scrub_events/1 recurses through tuples and lists, but the final scrub_events(other) clause returns maps as-is and scrub_events(%_{}) returns other structs without descending into their fields. An event embedded as a map value, or nested inside another struct's field (e.g. a custom exception struct carrying the event), would not be replaced by its one-line summary. In practice the targeted :noproc/$gen_call exit reasons embed the event in tuple/list arg shapes, which are handled, and the inspect limits bound any leak to ~2KB regardless — so this is a minor completeness note, not a correctness bug. Flagging it only in case the embedding shape ever changes.

Issue Conformance

The PR links #4563 and explicitly scopes itself to the log-flooding half. The implementation matches the description precisely: first failure + per-10s progress at error level with scrubbed identity, intermediate attempts at debug with full detail, and scrubbed reasons on both retry and exhaustion paths. The author also calls out a deliberately out-of-scope side observation (the budget being wall-clock-insensitive for fast failures), appropriately left unchanged. No scope creep. The companion issue is reasonably specified given the incident context described.

Previous Review Status

Not applicable — first review of this PR.

Review iteration: 1 | 2026-06-11

alco and others added 2 commits June 11, 2026 15:24

fix: harden exit-reason scrubbing and reformat

29a5fcb

Make scrub_events safe for improper lists and leave other structs untouched, and apply mix format to the retry logging call site. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

erik-the-implementer added the claude label Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throttle replication event retry logging and scrub event payloads#4566

Throttle replication event retry logging and scrub event payloads#4566
erik-the-implementer wants to merge 2 commits into
mainfrom
fix/replication-retry-log-throttle

erik-the-implementer commented Jun 11, 2026

Uh oh!

codecov Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erik-the-implementer commented Jun 11, 2026

Changes

Test

Uh oh!

codecov Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude Bot commented Jun 11, 2026

Claude Code Review

Summary

What's Working Well

Issues Found

Critical (Must Fix)

Important (Should Fix)

Suggestions (Nice to Have)

Issue Conformance

Previous Review Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 11, 2026 •

edited

Loading