Handle endpos pipeline shutdown cleanly and reset sequences on follow by teknogeek0 · Pull Request #40 · planetscale/pgcopydb

teknogeek0 · 2026-06-22T14:17:09Z

Problem

In follow mode, the receive → transform → apply processes are connected by Unix pipes. When the apply process reaches endpos it exits and closes its read end of the pipe. Upstream processes still writing trailing messages then hit EPIPE and exit non-zero. Because the supervisor (follow_wait_subprocesses) ANDs every child's exit status, a migration that completed correctly at endpos was reported as a failure.

A customer reported exactly this:

pgcopydb seems to have a race condition when it encounters the endpos sentinel whereby the replayer says "ok done", and quits. The json→sql transformer then tries to write to the pipe, EPIPE dies, and then pgcopydb thinks the whole thing failed.

The false failure had a second, quieter consequence: it skipped the end-of-migration sequence reset, so target sequences were left at their initial base-copy values (logical decoding does not replicate sequences). The same gap existed for the standalone pgcopydb follow command used to resume CDC after a crash — it never reset sequences at all.

Fix

Two layers of defense, since this involves customer data:

Child side — transform and receive treat an EPIPE on the downstream pipe as a clean shutdown only when endpos has been durably reached for the last message they processed. In every other case (endpos unset, or not yet reached) a broken pipe is still a failure.

Supervisor backstop — follow_wait_subprocesses declares overall success when the apply process exited cleanly and endpos has been durably applied (endpos <= replay_lsn), regardless of upstream teardown noise.

The apply process is authoritative — it exits cleanly only after durably applying through endpos and syncing replay_lsn. So the backstop cannot mask a genuine pre-endpos failure: if apply crashed, or endpos was not reached, success is left false and the failure propagates so the operator can resume. The two layers are belt-and-suspenders: the child-side gate handles the common case locally, and the supervisor gate guarantees a completed migration is never reported as failed even if an upstream process exits non-zero for another teardown reason.

Sequences

follow_reset_sequences is now also run by the standalone pgcopydb follow command once endpos is durably reached. This mirrors what clone --follow already does at the end of its run, and makes a resumed CDC run (pgcopydb follow --resume) that catches up to endpos correctly update target sequences to current source values. The reset is gated on endpos being reached, so an interrupted continuous follow (no endpos, or stopped early by a signal) does not advance sequences ahead of the data actually applied.

Interaction with the reconnect/backoff flow

The downstream-EPIPE path is separate from the source-reconnect path: the existing exponential backoff loop fires only on source connection loss, while a broken downstream pipe is handled in its own branch. This change only affects how that downstream branch reports its outcome at endpos; it does not touch the reconnect window, backoff timing, or permissions-error handling.

Testing

Full CDC / follow / unit suites pass on PG18:

cdc-wal2json, cdc-test-decoding, follow-wal2json, follow-defer-indexes, follow-defer-validate-fks, cdc-endpos-between-transaction, endpos-in-multi-wal-txn, cdc-low-level, cdc-message-handling, follow-data-only, follow-9.6, follow-target-reconnect, follow-standby, cdc-filtering, unit — all green.

follow-target-reconnect in particular confirms the reconnect/backoff behavior is unchanged.

Note

The endpos shutdown race is timing-dependent (pipe buffer fill, data volume), so it does not reproduce deterministically in CI — every test run completes via the normal clean-shutdown path. The change is verified not to regress any existing behavior; the correctness of the endpos handling rests on the gating analysis above (apply is authoritative; the override is strictly gated on endpos <= replay_lsn).

In follow mode the receive, transform, and apply processes are connected by Unix pipes. When the apply process reaches endpos it exits and closes its read end of the pipe. Upstream processes that are still writing trailing messages then hit EPIPE and exit non-zero, and the supervisor ANDs every child's status, so a migration that completed correctly at endpos was reported as a failure. This adds two layers of handling: - Child side: transform and receive treat an EPIPE on the downstream pipe as a clean shutdown only when endpos has been durably reached for the last message they processed. In every other case (endpos unset, or not yet reached) a broken pipe is still a failure. - Supervisor backstop: follow_wait_subprocesses declares overall success when the apply process exited cleanly and endpos has been durably applied (endpos <= replay_lsn), regardless of upstream teardown noise. The apply process is authoritative, so this cannot mask a genuine pre-endpos failure: if apply crashed or endpos was not reached, the failure still propagates. The false failure also skipped the end-of-migration sequence reset. follow_reset_sequences is now also run by the standalone "pgcopydb follow" command once endpos is durably reached, so a resumed CDC run that catches up to endpos updates target sequences to current source values. Previously only "clone --follow" reset sequences, leaving resume-after-crash with stale sequences.

Adds a deterministic regression test for the sequence reset performed by the standalone `pgcopydb follow` command when it reaches endpos (the path used by resume-cdc helpers, which previously did not reset sequences). The test clones pagila, advances rental_rental_id_seq on the source by inserting rows, sets endpos, and runs `pgcopydb follow --resume`. Because CDC replays the inserts with OVERRIDING SYSTEM VALUE (explicit ids that do not advance the target sequence), the target sequence only catches up to the source if follow_reset_sequences runs at endpos. The test asserts the target sequence advanced from the snapshot value to match the source. Verified the test fails (target stuck at the snapshot value) when the reset is removed, and passes with it in place.

teknogeek0 added 2 commits June 22, 2026 10:16

teknogeek0 merged commit 50f630c into main Jun 22, 2026
90 checks passed

teknogeek0 mentioned this pull request Jun 22, 2026

Handle endpos pipeline shutdown cleanly and reset sequences on follow teknogeek0/pgcopydb#66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle endpos pipeline shutdown cleanly and reset sequences on follow#40

Handle endpos pipeline shutdown cleanly and reset sequences on follow#40
teknogeek0 merged 2 commits into
mainfrom
fix/cdc-endpos-shutdown

teknogeek0 commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

teknogeek0 commented Jun 22, 2026

Problem

Fix

Sequences

Interaction with the reconnect/backoff flow

Testing

Note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant