Add retry with backoff to follow pipeline on target connection failure by teknogeek0 · Pull Request #35 · planetscale/pgcopydb

teknogeek0 · 2026-06-01T15:25:43Z

Summary

Adds retry-with-backoff to the --follow (CDC) pipeline so a dropped target connection no longer fails the migration. When the apply/replay side loses its connection to the target mid-follow, it now reconnects with exponential backoff and resumes streaming, instead of terminating the clone.

Includes several correctness fixes uncovered while building the reconnect path:

Fix a pipeline spin loop and a stale replication-origin session on reconnect.
Fix replication-origin holder termination (parse the holder PID from the error message).
Connection-retry helpers in pgsql.c/pgsql.h; reconnect/backoff loop in ld_replay.c; supporting changes in ld_stream.c, ld_apply.c, follow.c.

Test plan

New follow-target-reconnect suite: drives a follow migration, severs the target connection mid-stream via a netblock container, and asserts the pipeline reconnects and completes. Added to the Tier-3 CI matrix.
PGVERSION=16 make build — clean (compiles against the current FK/defer-index CDC code).
PGVERSION=16 make tests — full suite green on PG16, including follow-target-reconnect.

Note (out of scope)

While validating, observed that blob-snapshot-release has a pre-existing, intermittent snapshot-release race on main (the source snapshot can be released before a COPY worker runs SET TRANSACTION SNAPSHOT, giving ERROR: invalid snapshot identifier). Measured ~1-in-5 on clean main without this change, so it is unrelated to this PR — flagging it as a separate issue to track.

Forward-port of dev follow-target-reconnect work: when the target connection drops mid-follow, the apply/replay pipeline now retries with exponential backoff instead of failing the migration. Includes fixes for the pipeline spin loop, stale replication-origin session on reconnect, and replication-origin holder termination, plus the follow-target-reconnect test suite (simulates a target outage via a netblock container). timescaledb stays disabled in CI (upstream failure).

The Docker workflow ran on every PR and every push to main, rebuilding (and on main, pushing) the container image. The test workflow already builds its own image per job, so the PR/main builds are redundant CI cost — and they were failing intermittently on GitHub Actions Cache 504s. Restrict the workflow to version-tag pushes (the release flow) plus manual dispatch.

teknogeek0 added 2 commits June 1, 2026 10:49

teknogeek0 merged commit e073ebc into main Jun 1, 2026
81 checks passed

teknogeek0 deleted the fix/follow-target-reconnect branch June 1, 2026 19:22

teknogeek0 mentioned this pull request Jun 8, 2026

Release v0.19.0 #38

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retry with backoff to follow pipeline on target connection failure#35

Add retry with backoff to follow pipeline on target connection failure#35
teknogeek0 merged 2 commits into
mainfrom
fix/follow-target-reconnect

teknogeek0 commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

teknogeek0 commented Jun 1, 2026

Summary

Test plan

Note (out of scope)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant