Skip to content

Throttle replication event retry logging and scrub event payloads#4566

Draft
erik-the-implementer wants to merge 2 commits into
mainfrom
fix/replication-retry-log-throttle
Draft

Throttle replication event retry logging and scrub event payloads#4566
erik-the-implementer wants to merge 2 commits into
mainfrom
fix/replication-retry-log-throttle

Conversation

@erik-the-implementer

Copy link
Copy Markdown
Contributor

Fixes the log-flooding half of #4563.

A failing replication event is retried every 50ms (@event_retry_delay) for up to 10 minutes (@max_event_retry_time), and every attempt logged at error level — up to ~12,000 error logs for a single incident. In Electric Cloud each of those lines becomes a Sentry event (capture_log_messages: true), which is how one tenant produced 7,000+ Sentry events from a single incident. Each message also embedded the full inspected event — potentially megabytes of customer row data copied into the logs every 50ms.

Changes

  • The first failure and one progress update per 10s (@retry_log_interval) log at error level; every attempt in between logs at debug level with the previous full detail.
  • The throttle window is wall-clock time tracked in the client state (reset on successful delivery), not retry budget: a fast failure like :noproc consumes no measurable budget, so budget-based throttling would never fire.
  • Error-level messages carry the event identity (xid/LSN/relation instead of the full payload) and a scrubbed failure reason: any TransactionFragment/Relation embedded in the exit reason (e.g. as arguments in a :noproc tuple) is replaced with a one-line summary before inspecting, so row data never reaches error-level logs. The scrubber is safe for improper lists and leaves other structs untouched.
  • The budget-exhaustion messages get the same identity + scrubbed-reason treatment (the subsequent raise/exit is unchanged).

Side observation (behavior unchanged in this PR): because the retry budget only counts time spent inside attempts, sub-millisecond failures like :noproc barely consume it, so the "10-minute" budget can run much longer in wall-clock terms for fast failures.

Test

A new test drives the real noproc retry loop against a database and asserts that ~10 retry attempts produce exactly one error-level log line, that the line identifies the event by xid without containing the inserted row value, and that the event is still delivered once the handler comes back.

🤖 Generated with Claude Code

alco and others added 2 commits June 11, 2026 15:24
A failing replication event is retried every @event_retry_delay (50ms)
for up to @max_event_retry_time (10 minutes) and every attempt was
logged at error level — up to ~12000 error logs per incident, flooding
log output and any error tracker fed from it (the cloud ships every
Logger.error line to Sentry). Each message also embedded the full
inspected event, copying potentially megabytes of customer row data
into the logs on every 50ms attempt.

Now the first failure and one progress update per @retry_log_interval
(10s) are logged at error level, carrying the event identity
(xid/lsn/relation) and the failure reason with any embedded replication
events replaced by their one-line summary. The attempts in between log
the full detail at debug level. The throttle window is wall-clock time
tracked in the client state, because a fast failure such as :noproc
consumes no measurable retry budget.

Fixes part of #4563.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Make scrub_events safe for improper lists and leave other structs untouched, and apply mix format to the retry logging call site.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.36%. Comparing base (1219cb4) to head (29a5fcb).
⚠️ Report is 5 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #4566       +/-   ##
===========================================
- Coverage   70.96%   58.36%   -12.60%     
===========================================
  Files          83      370      +287     
  Lines        9856    40652    +30796     
  Branches     3124    11548     +8424     
===========================================
+ Hits         6994    23728    +16734     
- Misses       2844    16850    +14006     
- Partials       18       74       +56     
Flag Coverage Δ
packages/agents 71.37% <ø> (ø)
packages/agents-mcp 77.54% <ø> (?)
packages/agents-mobile 75.49% <ø> (?)
packages/agents-runtime 82.45% <ø> (?)
packages/agents-server 75.06% <ø> (+0.22%) ⬆️
packages/agents-server-ui 6.25% <ø> (?)
packages/electric-ax 46.42% <ø> (ø)
packages/experimental 87.73% <ø> (?)
packages/react-hooks 86.48% <ø> (?)
packages/start 82.83% <ø> (?)
packages/typescript-client 91.83% <ø> (?)
packages/y-electric 56.05% <ø> (?)
typescript 58.36% <ø> (-12.60%) ⬇️
unit-tests 58.36% <ø> (-12.60%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Claude Code Review

Summary

This PR fixes the log-flooding half of #4563 by throttling replication event-retry logging and scrubbing event payloads out of error-level logs. The implementation is focused, well-commented, and well-tested. No blocking issues found — overall a clean, defensive change.

What's Working Well

  • Right throttle dimension. Tracking the throttle window in wall-clock time (last_retry_error_log) rather than retry budget correctly handles the :noproc fast-failure case the PR is built around, where budget-based throttling would essentially never fire. The reasoning is captured in the comment at replication_client.ex:570-582.
  • Defense in depth on the scrubber. Even when scrub_events/1 misses an embedding shape, inspect_scrubbed/1 caps output with limit: 100, printable_limit: 2048 (replication_client.ex:615-616), so a missed event can't dump megabytes of row data — it's bounded to a few KB. Good belt-and-suspenders.
  • Improper-list safety. The head/tail recursion in scrub_events/1 (:628) correctly handles improper lists that can appear in arbitrary exit reasons, and the %_{} clause leaves unrelated structs untouched rather than crashing.
  • Lazy debug logging. Using Logger.debug(fn -> ... end) (:595) avoids the expensive Exception.format/inspect work entirely when debug is disabled — the common production case.
  • Solid test. replication_client_test.exs:254 drives the real noproc retry loop, asserts exactly one error line over ~10 attempts, confirms the row value ("a private row value") is absent while xid= is present, and verifies the event is still delivered once the handler recovers. The 500ms window vs. 10s throttle interval makes the "exactly one" assertion deterministic.
  • Typespec for the new last_retry_error_log field updated (:84); changeset present with the correct @core/sync-service package name.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None.

Suggestions (Nice to Have)

1. Exhaustion-path error log drops the stacktrace and kind on the dispatching path

File: packages/sync-service/lib/electric/postgres/replication_client.ex:560-563

The budget-exhaustion log for the dispatching path changed from Exception.format(kind, reason, __STACKTRACE__) to describe_event(event) <> inspect_scrubbed(reason). This is correct for scrubbing, but it also drops the kind (:error/:exit/:throw) and the stacktrace from that error-level message — diagnostics that were previously present. In production (debug off) that per-attempt detail is no longer captured anywhere except the subsequent :erlang.raise/3, which does preserve the trace via the crash report. So the information isn't fully lost, but operators reading the "Exhausted retry budget" line alone now get less. If the crash report reliably captures it this is fine — worth a conscious confirmation rather than an accident.

2. scrub_events/1 doesn't recurse into maps or into non-event struct fields

File: packages/sync-service/lib/electric/postgres/replication_client.ex:619-629

scrub_events/1 recurses through tuples and lists, but the final scrub_events(other) clause returns maps as-is and scrub_events(%_{}) returns other structs without descending into their fields. An event embedded as a map value, or nested inside another struct's field (e.g. a custom exception struct carrying the event), would not be replaced by its one-line summary. In practice the targeted :noproc/$gen_call exit reasons embed the event in tuple/list arg shapes, which are handled, and the inspect limits bound any leak to ~2KB regardless — so this is a minor completeness note, not a correctness bug. Flagging it only in case the embedding shape ever changes.

Issue Conformance

The PR links #4563 and explicitly scopes itself to the log-flooding half. The implementation matches the description precisely: first failure + per-10s progress at error level with scrubbed identity, intermediate attempts at debug with full detail, and scrubbed reasons on both retry and exhaustion paths. The author also calls out a deliberately out-of-scope side observation (the budget being wall-clock-insensitive for fast failures), appropriately left unchanged. No scope creep. The companion issue is reasonably specified given the incident context described.

Previous Review Status

Not applicable — first review of this PR.


Review iteration: 1 | 2026-06-11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants