Don't crash ShapeLogCollector on introspection failures in the replication path#4565
Don't crash ShapeLogCollector on introspection failures in the replication path#4565erik-the-implementer wants to merge 2 commits into
Conversation
…eplication path
The replication hot path (Partitions, pk-column lookup, shape restore)
only handled {:error, :connection_not_available} from the inspector.
Every other legal inspector return — :table_not_found for dropped
tables, in-band DB errors like out-of-memory or statement_timeout, and
connection-class errors returned in-band by Postgrex — crashed the
ShapeLogCollector, took down the one_for_all shapes supervision tree
and put the replication client into its 10-minute noproc retry loop.
- Partitions.handle_relation/2: ignore :table_not_found, propagate
other errors instead of raising a CaseClauseError
- Partitions.add_shape/3: return {:error, reason} instead of raising
- pk_cols_of_relation: key changes on the full record when the table
was dropped before introspection (same fallback as PK-less tables)
- ShapeLogCollector: log and reply with the retryable
:connection_not_available error for unexpected introspection
failures instead of crashing
- shape restore: retry introspection in place while the connection
pool is unavailable instead of crashing with a MatchError
- DirectInspector: classify in-band connection-class query errors
(e.g. "ssl recv: closed") as :connection_not_available
Fixes part of #4563.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…red handle_relation Upstream #4560 split relation handling into handle_relation/publish_relation; the generic introspection-error clause now lives on the Partitions.handle_relation case in the new structure. Kept both our introspection-failure test and the new upstream relation tests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
❌ 1 Tests Failed:
View the top 1 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
Claude Code ReviewSummaryThis PR hardens the replication hot path ( What's Working Well
Issues FoundCritical (Must Fix)None. Important (Should Fix)Permanent introspection errors become an unbounded paused-replication loop File: Both hot-path handlers map every non-connection This is a deliberate and defensible trade (alive-and-paused beats crash-looping the supervision tree), and you already flag the log-spam half as a follow-up. Worth calling out explicitly that, unlike the restore path (which has Suggestions (Nice to Have)
Issue ConformanceNo issue is formally linked via GitHub, but the PR body and changeset reference #4563 with a detailed breakdown of the Sentry crash signatures being addressed ( Previous Review StatusFirst review of this PR — no prior Claude review to reconcile. Review iteration: 1 | 2026-06-11 |
Fixes the crash family from #4563: the replication hot path (
Partitions, pk-column lookup, shape restore) only handled{:error, :connection_not_available}from the inspector, so every other legal inspector return crashed the ShapeLogCollector, took down the:one_for_allshapes supervision tree, and put the replication client into its 10-minute noproc retry loop.Changes
Partitions.handle_relation/2: handles:table_not_found(table dropped/renamed between the WAL record and introspection — nothing left to track, mirroring the existingadd_shapebehavior) and propagates other errors instead of crashing with aCaseClauseError.Partitions.add_shape/3: returns{:error, reason}instead of raising aRuntimeErroron unexpected introspection errors; the shape-registration path propagates the error to the caller.pk_cols_of_relation: when the table was dropped before introspection, changes are keyed on the full record (the same fallback used for tables without a primary key) instead of crashing inUtils.map_while_ok; handling of the dropped table is left to the affected shapes further down the stack.{:error, :connection_not_available}— the one error the replication client knows how to recover from — pausing replication until introspection succeeds instead of crashing the collector.MatchErrorin a restart loop. Skipping the shape was not an option: nothing re-registers restored shapes later, so a skipped shape would silently stop receiving updates. If the error persists past ~10s the restore still fails, but with a descriptive error.DirectInspector: connection-class errors returned in-band by Postgrex (e.g."ssl recv: closed") are classified as:connection_not_availablevia the existingElectric.DbConnectionErrorclassifier, so they take the established recovery path instead of leaking through as strings.Production context
In Electric Cloud these crashes account for most of the "retry budget" Sentry noise:
{:case_clause, :table_not_found},{:case_clause, {:error, "ERROR 53200 (out_of_memory) ..."}},query_canceled,"ssl recv: closed", the restoreMatchError, and — downstream of all of them — thousands of(EXIT) no processdispatch errors per incident while the supervision tree restarts. See #4563 for the breakdown.Notes for review
partitions_test,shape_log_collector_testand a newdirect_inspector_test).🤖 Generated with Claude Code