Skip to content

Don't crash ShapeLogCollector on introspection failures in the replication path#4565

Draft
erik-the-implementer wants to merge 2 commits into
mainfrom
fix/replication-introspection-error-handling
Draft

Don't crash ShapeLogCollector on introspection failures in the replication path#4565
erik-the-implementer wants to merge 2 commits into
mainfrom
fix/replication-introspection-error-handling

Conversation

@erik-the-implementer

Copy link
Copy Markdown
Contributor

Fixes the crash family from #4563: the replication hot path (Partitions, pk-column lookup, shape restore) only handled {:error, :connection_not_available} from the inspector, so every other legal inspector return crashed the ShapeLogCollector, took down the :one_for_all shapes supervision tree, and put the replication client into its 10-minute noproc retry loop.

Changes

  • Partitions.handle_relation/2: handles :table_not_found (table dropped/renamed between the WAL record and introspection — nothing left to track, mirroring the existing add_shape behavior) and propagates other errors instead of crashing with a CaseClauseError.
  • Partitions.add_shape/3: returns {:error, reason} instead of raising a RuntimeError on unexpected introspection errors; the shape-registration path propagates the error to the caller.
  • pk_cols_of_relation: when the table was dropped before introspection, changes are keyed on the full record (the same fallback used for tables without a primary key) instead of crashing in Utils.map_while_ok; handling of the dropped table is left to the affected shapes further down the stack.
  • ShapeLogCollector: unexpected introspection failures in the relation and transaction paths log a warning with the real reason and reply with {:error, :connection_not_available} — the one error the replication client knows how to recover from — pausing replication until introspection succeeds instead of crashing the collector.
  • Shape restore: retries introspection in place (100ms × 100) when the connection pool isn't available instead of crashing with a MatchError in a restart loop. Skipping the shape was not an option: nothing re-registers restored shapes later, so a skipped shape would silently stop receiving updates. If the error persists past ~10s the restore still fails, but with a descriptive error.
  • DirectInspector: connection-class errors returned in-band by Postgrex (e.g. "ssl recv: closed") are classified as :connection_not_available via the existing Electric.DbConnectionError classifier, so they take the established recovery path instead of leaking through as strings.

Production context

In Electric Cloud these crashes account for most of the "retry budget" Sentry noise: {:case_clause, :table_not_found}, {:case_clause, {:error, "ERROR 53200 (out_of_memory) ..."}}, query_canceled, "ssl recv: closed", the restore MatchError, and — downstream of all of them — thousands of (EXIT) no process dispatch errors per incident while the supervision tree restarts. See #4563 for the breakdown.

Notes for review

  • All new behavior is covered by tests written before the fixes (9 new tests across partitions_test, shape_log_collector_test and a new direct_inspector_test).
  • One known rough edge: a persistently failing introspection now produces one ShapeLogCollector warning per replication-client retry cycle (throttled only by the introspection query duration). Rate-limiting that warning could be a follow-up; the companion PR throttles the replication client's own retry logging.

🤖 Generated with Claude Code

…eplication path

The replication hot path (Partitions, pk-column lookup, shape restore)
only handled {:error, :connection_not_available} from the inspector.
Every other legal inspector return — :table_not_found for dropped
tables, in-band DB errors like out-of-memory or statement_timeout, and
connection-class errors returned in-band by Postgrex — crashed the
ShapeLogCollector, took down the one_for_all shapes supervision tree
and put the replication client into its 10-minute noproc retry loop.

- Partitions.handle_relation/2: ignore :table_not_found, propagate
  other errors instead of raising a CaseClauseError
- Partitions.add_shape/3: return {:error, reason} instead of raising
- pk_cols_of_relation: key changes on the full record when the table
  was dropped before introspection (same fallback as PK-less tables)
- ShapeLogCollector: log and reply with the retryable
  :connection_not_available error for unexpected introspection
  failures instead of crashing
- shape restore: retry introspection in place while the connection
  pool is unavailable instead of crashing with a MatchError
- DirectInspector: classify in-band connection-class query errors
  (e.g. "ssl recv: closed") as :connection_not_available

Fixes part of #4563.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…red handle_relation

Upstream #4560 split relation handling into handle_relation/publish_relation; the generic introspection-error clause now lives on the Partitions.handle_relation case in the new structure. Kept both our introspection-failure test and the new upstream relation tests.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
4325 1 4324 55
View the top 1 failed test(s) by shortest run time
test/runtime-dsl.test.ts > N: wake primitives verification > N4: ctx.agent.run receives the wake payload and performs a second run on child completion
Stack Traces | 30.1s run time
Error: Timeout (30000ms) waiting for completed runs=2 on /idle-wake-parent-n3/wake-agent-1
[
  {
    "args": {},
    "entityType": "idle-wake-parent-n3",
    "operation": "insert",
    "type": "entity_created"
  },
  {
    "from": "/principal/system%3Aruntime-dsl-test",
    "operation": "insert",
    "payload": "spawn wake-agent-child-1",
    "type": "inbox"
  },
  {
    "key": "run-0",
    "operation": "insert",
    "status": "started",
    "type": "run"
  },
  {
    "key": "step-0",
    "operation": "insert",
    "status": "started",
    "stepNumber": 1,
    "type": "step"
  },
  {
    "key": "msg-0",
    "operation": "insert",
    "status": "streaming",
    "type": "text"
  },
  {
    "delta": "spawned:wake-agent-child-1:wake.type=inbox",
    "key": "msg-0:0",
    "operation": "insert",
    "runId": "run-0",
    "textId": "msg-0",
    "type": "text_delta"
  },
  {
    "key": "msg-0",
    "operation": "update",
    "status": "completed",
    "type": "text"
  },
  {
    "finishReason": "stop",
    "key": "step-0",
    "operation": "update",
    "status": "completed",
    "stepNumber": 1,
    "type": "step"
  },
  {
    "finishReason": "stop",
    "key": "run-0",
    "operation": "update",
    "status": "completed",
    "type": "run"
  },
  {
    "key": "child:idle-wake-child-n3:wake-agent-child-1",
    "manifest": {
      "entityType": "idle-wake-child-n3",
      "entityUrl": "/idle-wake-child-n3/wake-agent-child-1",
      "id": "wake-agent-child-1",
      "key": "child:idle-wake-child-n3:wake-agent-child-1",
      "kind": "child",
      "wake": {
        "includeResponse": true,
        "on": "runFinished"
      }
    },
    "operation": "insert",
    "type": "manifest"
  }
]
 ❯ waitForHistory test/runtime-dsl.ts:666:11
 ❯ test/runtime-dsl.ts:789:35
 ❯ test/runtime-dsl.test.ts:6315:29

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Claude Code Review

Summary

This PR hardens the replication hot path (Partitions, pk-column lookup, ShapeLogCollector, shape restore, DirectInspector) against the full set of inspector return values, replacing CaseClauseError/RuntimeError/MatchError crashes with graceful handling. It directly addresses the most common crash family in Electric's history — non-exhaustive pattern matching on introspection results — and is well-scoped, well-tested, and includes a changeset. Overall: strong, ready to merge with a couple of minor considerations.

What's Working Well

  • Exhaustive pattern matching against the inspector contract. I verified lib/electric/postgres/inspector.ex — every load_relation_oid/load_relation_info/load_column_info callback is specced as {:ok, ...} | :table_not_found | {:error, term()}. The new clauses in partitions.ex, pk_cols_of_relation, and the collector now cover all three shapes, so the previously-crashing :table_not_found and arbitrary {:error, _} paths are handled. This is exactly the "single most frequent crash cause" the project guidelines call out.
  • :table_not_found → key-on-full-record fallback in pk_cols_of_relation reuses the existing no-PK behavior rather than inventing new semantics — minimal blast radius.
  • Restore retry rationale is sound and documented. Skipping a restored shape would silently stop its updates (nothing re-registers it), so retrying-then-crashing-with-context is the right trade-off over silent data loss. The comment explains it well.
  • Error-contract broadening is safe at every caller. I traced the broadened {:error, term()} from Partitions.add_shape through the batched registration path (shape_log_collector.ex:454) to setup_effects.ex:37, which already has a generic {:error, error} catch-all. No caller relied on the narrow :connection_not_available-only contract.
  • DirectInspector.normalize_query_error/1 correctly routes in-band DBConnection.ConnectionError structs (e.g. "ssl recv: closed") through the existing Electric.DbConnectionError classifier so they take the established retry path instead of leaking as opaque strings.
  • Tests written against the contract, including a pure-async direct_inspector_test, process-stays-alive assertions, and ConstInspector for deterministic error injection. Changeset present and correctly scoped to @core/sync-service (patch).

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

Permanent introspection errors become an unbounded paused-replication loop

File: packages/sync-service/lib/electric/replication/shape_log_collector.ex:599, 707

Both hot-path handlers map every non-connection {:error, reason} to {:error, :connection_not_available}, which the replication client recovers from by pausing and redelivering the same event indefinitely. For a genuinely transient error (out-of-memory, statement_timeout, query_canceled) this is the right call. But for a permanent introspection failure, replication is now stuck forever — making no progress while emitting one Logger.warning per redelivery cycle.

This is a deliberate and defensible trade (alive-and-paused beats crash-looping the supervision tree), and you already flag the log-spam half as a follow-up. Worth calling out explicitly that, unlike the restore path (which has @restore_max_retries), the hot path has no upper bound — a stuck stream is only observable via the (unthrottled) warning. Consider, as follow-up: (a) throttling the warning, and (b) a metric/telemetry signal for "replication paused on introspection failure" so a permanently-stuck stream is alertable rather than just log-greppable.

Suggestions (Nice to Have)

  • restore_partitions_for_shape blocks the GenServer via Process.sleep (shape_log_collector.ex:344). This runs in handle_continue(:restore_shapes), so blocking is acceptable during startup, and it's bounded (~10s, logged once on attempt 1). No change needed — just flagging that if the DB is down at restart, the collector is unresponsive for up to ~10s before crashing and being restarted to try again. The bound and one-shot logging are the right calls.
  • Classification proxy (direct_inspector.ex:111): from_error(err).type != :unknown works because all DBConnection.ConnectionError-matching clauses are connection-class with retry_may_fix?: true. If a future non-retryable ConnectionError type is added, it would be silently treated as :connection_not_available. Keying on retry_may_fix? would be marginally more intention-revealing, but the current form is fine.
  • Restore test asserts liveness only (shape_log_collector_test.exs:147): the test confirms the collector survives flaky introspection but doesn't assert the shape's partition info was actually populated after the retry succeeds. A follow-up assertion on restored partition state would make the test prove the recovery, not just the survival.

Issue Conformance

No issue is formally linked via GitHub, but the PR body and changeset reference #4563 with a detailed breakdown of the Sentry crash signatures being addressed ({:case_clause, :table_not_found}, the out-of-memory case_clause, query_canceled, "ssl recv: closed", the restore MatchError, and downstream (EXIT) no process). The implementation maps cleanly onto each named signature. Recommend adding a Fixes #4563 linkage so the issue auto-closes.

Previous Review Status

First review of this PR — no prior Claude review to reconcile.


Review iteration: 1 | 2026-06-11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants