Throttle replication event retry logging and scrub event payloads#4566
Throttle replication event retry logging and scrub event payloads#4566erik-the-implementer wants to merge 2 commits into
Conversation
A failing replication event is retried every @event_retry_delay (50ms) for up to @max_event_retry_time (10 minutes) and every attempt was logged at error level — up to ~12000 error logs per incident, flooding log output and any error tracker fed from it (the cloud ships every Logger.error line to Sentry). Each message also embedded the full inspected event, copying potentially megabytes of customer row data into the logs on every 50ms attempt. Now the first failure and one progress update per @retry_log_interval (10s) are logged at error level, carrying the event identity (xid/lsn/relation) and the failure reason with any embedded replication events replaced by their one-line summary. The attempts in between log the full detail at debug level. The throttle window is wall-clock time tracked in the client state, because a fast failure such as :noproc consumes no measurable retry budget. Fixes part of #4563. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Make scrub_events safe for improper lists and leave other structs untouched, and apply mix format to the retry logging call site. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4566 +/- ##
===========================================
- Coverage 70.96% 58.36% -12.60%
===========================================
Files 83 370 +287
Lines 9856 40652 +30796
Branches 3124 11548 +8424
===========================================
+ Hits 6994 23728 +16734
- Misses 2844 16850 +14006
- Partials 18 74 +56
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Claude Code ReviewSummaryThis PR fixes the log-flooding half of #4563 by throttling replication event-retry logging and scrubbing event payloads out of error-level logs. The implementation is focused, well-commented, and well-tested. No blocking issues found — overall a clean, defensive change. What's Working Well
Issues FoundCritical (Must Fix)None. Important (Should Fix)None. Suggestions (Nice to Have)1. Exhaustion-path error log drops the stacktrace and File: The budget-exhaustion log for the dispatching path changed from 2. File:
Issue ConformanceThe PR links #4563 and explicitly scopes itself to the log-flooding half. The implementation matches the description precisely: first failure + per-10s progress at error level with scrubbed identity, intermediate attempts at debug with full detail, and scrubbed reasons on both retry and exhaustion paths. The author also calls out a deliberately out-of-scope side observation (the budget being wall-clock-insensitive for fast failures), appropriately left unchanged. No scope creep. The companion issue is reasonably specified given the incident context described. Previous Review StatusNot applicable — first review of this PR. Review iteration: 1 | 2026-06-11 |
Fixes the log-flooding half of #4563.
A failing replication event is retried every 50ms (
@event_retry_delay) for up to 10 minutes (@max_event_retry_time), and every attempt logged at error level — up to ~12,000 error logs for a single incident. In Electric Cloud each of those lines becomes a Sentry event (capture_log_messages: true), which is how one tenant produced 7,000+ Sentry events from a single incident. Each message also embedded the full inspected event — potentially megabytes of customer row data copied into the logs every 50ms.Changes
@retry_log_interval) log at error level; every attempt in between logs at debug level with the previous full detail.:noprocconsumes no measurable budget, so budget-based throttling would never fire.xid/LSN/relation instead of the full payload) and a scrubbed failure reason: anyTransactionFragment/Relationembedded in the exit reason (e.g. as arguments in a:noproctuple) is replaced with a one-line summary before inspecting, so row data never reaches error-level logs. The scrubber is safe for improper lists and leaves other structs untouched.Side observation (behavior unchanged in this PR): because the retry budget only counts time spent inside attempts, sub-millisecond failures like
:noprocbarely consume it, so the "10-minute" budget can run much longer in wall-clock terms for fast failures.Test
A new test drives the real noproc retry loop against a database and asserts that ~10 retry attempts produce exactly one error-level log line, that the line identifies the event by xid without containing the inserted row value, and that the event is still delivered once the handler comes back.
🤖 Generated with Claude Code