Don't crash ShapeLogCollector on introspection failures in the replication path by erik-the-implementer · Pull Request #4565 · electric-sql/electric

erik-the-implementer · 2026-06-11T13:43:55Z

Fixes the crash family from #4563: the replication hot path (Partitions, pk-column lookup, shape restore) only handled {:error, :connection_not_available} from the inspector, so every other legal inspector return crashed the ShapeLogCollector, took down the :one_for_all shapes supervision tree, and put the replication client into its 10-minute noproc retry loop.

Changes

Partitions.handle_relation/2: handles :table_not_found (table dropped/renamed between the WAL record and introspection — nothing left to track, mirroring the existing add_shape behavior) and propagates other errors instead of crashing with a CaseClauseError.
Partitions.add_shape/3: returns {:error, reason} instead of raising a RuntimeError on unexpected introspection errors; the shape-registration path propagates the error to the caller.
pk_cols_of_relation: when the table was dropped before introspection, changes are keyed on the full record (the same fallback used for tables without a primary key) instead of crashing in Utils.map_while_ok; handling of the dropped table is left to the affected shapes further down the stack.
ShapeLogCollector: unexpected introspection failures in the relation and transaction paths log a warning with the real reason and reply with {:error, :connection_not_available} — the one error the replication client knows how to recover from — pausing replication until introspection succeeds instead of crashing the collector.
Shape restore: retries introspection in place (100ms × 100) when the connection pool isn't available instead of crashing with a MatchError in a restart loop. Skipping the shape was not an option: nothing re-registers restored shapes later, so a skipped shape would silently stop receiving updates. If the error persists past ~10s the restore still fails, but with a descriptive error.
DirectInspector: connection-class errors returned in-band by Postgrex (e.g. "ssl recv: closed") are classified as :connection_not_available via the existing Electric.DbConnectionError classifier, so they take the established recovery path instead of leaking through as strings.

Production context

In Electric Cloud these crashes account for most of the "retry budget" Sentry noise: {:case_clause, :table_not_found}, {:case_clause, {:error, "ERROR 53200 (out_of_memory) ..."}}, query_canceled, "ssl recv: closed", the restore MatchError, and — downstream of all of them — thousands of (EXIT) no process dispatch errors per incident while the supervision tree restarts. See #4563 for the breakdown.

Notes for review

All new behavior is covered by tests written before the fixes (9 new tests across partitions_test, shape_log_collector_test and a new direct_inspector_test).
One known rough edge: a persistently failing introspection now produces one ShapeLogCollector warning per replication-client retry cycle (throttled only by the introspection query duration). Rate-limiting that warning could be a follow-up; the companion PR throttles the replication client's own retry logging.

🤖 Generated with Claude Code

…eplication path The replication hot path (Partitions, pk-column lookup, shape restore) only handled {:error, :connection_not_available} from the inspector. Every other legal inspector return — :table_not_found for dropped tables, in-band DB errors like out-of-memory or statement_timeout, and connection-class errors returned in-band by Postgrex — crashed the ShapeLogCollector, took down the one_for_all shapes supervision tree and put the replication client into its 10-minute noproc retry loop. - Partitions.handle_relation/2: ignore :table_not_found, propagate other errors instead of raising a CaseClauseError - Partitions.add_shape/3: return {:error, reason} instead of raising - pk_cols_of_relation: key changes on the full record when the table was dropped before introspection (same fallback as PK-less tables) - ShapeLogCollector: log and reply with the retryable :connection_not_available error for unexpected introspection failures instead of crashing - shape restore: retry introspection in place while the connection pool is unavailable instead of crashing with a MatchError - DirectInspector: classify in-band connection-class query errors (e.g. "ssl recv: closed") as :connection_not_available Fixes part of #4563. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…red handle_relation Upstream #4560 split relation handling into handle_relation/publish_relation; the generic introspection-error clause now lives on the Partitions.handle_relation case in the new structure. Kept both our introspection-failure test and the new upstream relation tests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

codecov · 2026-06-11T13:51:56Z

❌ 1 Tests Failed:

Tests completed	Failed	Passed	Skipped
4325	1	4324	55

View the top 1 failed test(s) by shortest run time

test/runtime-dsl.test.ts > N: wake primitives verification > N4: ctx.agent.run receives the wake payload and performs a second run on child completion

Stack Traces | 30.1s run time

Error: Timeout (30000ms) waiting for completed runs=2 on /idle-wake-parent-n3/wake-agent-1
[
  {
    "args": {},
    "entityType": "idle-wake-parent-n3",
    "operation": "insert",
    "type": "entity_created"
  },
  {
    "from": "/principal/system%3Aruntime-dsl-test",
    "operation": "insert",
    "payload": "spawn wake-agent-child-1",
    "type": "inbox"
  },
  {
    "key": "run-0",
    "operation": "insert",
    "status": "started",
    "type": "run"
  },
  {
    "key": "step-0",
    "operation": "insert",
    "status": "started",
    "stepNumber": 1,
    "type": "step"
  },
  {
    "key": "msg-0",
    "operation": "insert",
    "status": "streaming",
    "type": "text"
  },
  {
    "delta": "spawned:wake-agent-child-1:wake.type=inbox",
    "key": "msg-0:0",
    "operation": "insert",
    "runId": "run-0",
    "textId": "msg-0",
    "type": "text_delta"
  },
  {
    "key": "msg-0",
    "operation": "update",
    "status": "completed",
    "type": "text"
  },
  {
    "finishReason": "stop",
    "key": "step-0",
    "operation": "update",
    "status": "completed",
    "stepNumber": 1,
    "type": "step"
  },
  {
    "finishReason": "stop",
    "key": "run-0",
    "operation": "update",
    "status": "completed",
    "type": "run"
  },
  {
    "key": "child:idle-wake-child-n3:wake-agent-child-1",
    "manifest": {
      "entityType": "idle-wake-child-n3",
      "entityUrl": "/idle-wake-child-n3/wake-agent-child-1",
      "id": "wake-agent-child-1",
      "key": "child:idle-wake-child-n3:wake-agent-child-1",
      "kind": "child",
      "wake": {
        "includeResponse": true,
        "on": "runFinished"
      }
    },
    "operation": "insert",
    "type": "manifest"
  }
]
 ❯ waitForHistory test/runtime-dsl.ts:666:11
 ❯ test/runtime-dsl.ts:789:35
 ❯ test/runtime-dsl.test.ts:6315:29

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

claude · 2026-06-11T13:56:37Z

Claude Code Review

Summary

This PR hardens the replication hot path (Partitions, pk-column lookup, ShapeLogCollector, shape restore, DirectInspector) against the full set of inspector return values, replacing CaseClauseError/RuntimeError/MatchError crashes with graceful handling. It directly addresses the most common crash family in Electric's history — non-exhaustive pattern matching on introspection results — and is well-scoped, well-tested, and includes a changeset. Overall: strong, ready to merge with a couple of minor considerations.

What's Working Well

Exhaustive pattern matching against the inspector contract. I verified lib/electric/postgres/inspector.ex — every load_relation_oid/load_relation_info/load_column_info callback is specced as {:ok, ...} | :table_not_found | {:error, term()}. The new clauses in partitions.ex, pk_cols_of_relation, and the collector now cover all three shapes, so the previously-crashing :table_not_found and arbitrary {:error, _} paths are handled. This is exactly the "single most frequent crash cause" the project guidelines call out.
:table_not_found → key-on-full-record fallback in pk_cols_of_relation reuses the existing no-PK behavior rather than inventing new semantics — minimal blast radius.
Restore retry rationale is sound and documented. Skipping a restored shape would silently stop its updates (nothing re-registers it), so retrying-then-crashing-with-context is the right trade-off over silent data loss. The comment explains it well.
Error-contract broadening is safe at every caller. I traced the broadened {:error, term()} from Partitions.add_shape through the batched registration path (shape_log_collector.ex:454) to setup_effects.ex:37, which already has a generic {:error, error} catch-all. No caller relied on the narrow :connection_not_available-only contract.
DirectInspector.normalize_query_error/1 correctly routes in-band DBConnection.ConnectionError structs (e.g. "ssl recv: closed") through the existing Electric.DbConnectionError classifier so they take the established retry path instead of leaking as opaque strings.
Tests written against the contract, including a pure-async direct_inspector_test, process-stays-alive assertions, and ConstInspector for deterministic error injection. Changeset present and correctly scoped to @core/sync-service (patch).

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

Permanent introspection errors become an unbounded paused-replication loop

File: packages/sync-service/lib/electric/replication/shape_log_collector.ex:599, 707

Both hot-path handlers map every non-connection {:error, reason} to {:error, :connection_not_available}, which the replication client recovers from by pausing and redelivering the same event indefinitely. For a genuinely transient error (out-of-memory, statement_timeout, query_canceled) this is the right call. But for a permanent introspection failure, replication is now stuck forever — making no progress while emitting one Logger.warning per redelivery cycle.

This is a deliberate and defensible trade (alive-and-paused beats crash-looping the supervision tree), and you already flag the log-spam half as a follow-up. Worth calling out explicitly that, unlike the restore path (which has @restore_max_retries), the hot path has no upper bound — a stuck stream is only observable via the (unthrottled) warning. Consider, as follow-up: (a) throttling the warning, and (b) a metric/telemetry signal for "replication paused on introspection failure" so a permanently-stuck stream is alertable rather than just log-greppable.

Suggestions (Nice to Have)

restore_partitions_for_shape blocks the GenServer via Process.sleep (shape_log_collector.ex:344). This runs in handle_continue(:restore_shapes), so blocking is acceptable during startup, and it's bounded (~10s, logged once on attempt 1). No change needed — just flagging that if the DB is down at restart, the collector is unresponsive for up to ~10s before crashing and being restarted to try again. The bound and one-shot logging are the right calls.
Classification proxy (direct_inspector.ex:111): from_error(err).type != :unknown works because all DBConnection.ConnectionError-matching clauses are connection-class with retry_may_fix?: true. If a future non-retryable ConnectionError type is added, it would be silently treated as :connection_not_available. Keying on retry_may_fix? would be marginally more intention-revealing, but the current form is fine.
Restore test asserts liveness only (shape_log_collector_test.exs:147): the test confirms the collector survives flaky introspection but doesn't assert the shape's partition info was actually populated after the retry succeeds. A follow-up assertion on restored partition state would make the test prove the recovery, not just the survival.

Issue Conformance

No issue is formally linked via GitHub, but the PR body and changeset reference #4563 with a detailed breakdown of the Sentry crash signatures being addressed ({:case_clause, :table_not_found}, the out-of-memory case_clause, query_canceled, "ssl recv: closed", the restore MatchError, and downstream (EXIT) no process). The implementation maps cleanly onto each named signature. Recommend adding a Fixes #4563 linkage so the issue auto-closes.

Previous Review Status

First review of this PR — no prior Claude review to reconcile.

Review iteration: 1 | 2026-06-11

erik-the-implementer added the claude label Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't crash ShapeLogCollector on introspection failures in the replication path#4565

Don't crash ShapeLogCollector on introspection failures in the replication path#4565
erik-the-implementer wants to merge 2 commits into
mainfrom
fix/replication-introspection-error-handling

erik-the-implementer commented Jun 11, 2026

Uh oh!

codecov Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erik-the-implementer commented Jun 11, 2026

Changes

Production context

Notes for review

Uh oh!

codecov Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 1 Tests Failed:

Uh oh!

claude Bot commented Jun 11, 2026

Claude Code Review

Summary

What's Working Well

Issues Found

Critical (Must Fix)

Important (Should Fix)

Suggestions (Nice to Have)

Issue Conformance

Previous Review Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 11, 2026 •

edited

Loading