Skip to content

Add retry with backoff to follow pipeline on target connection failure#35

Merged
teknogeek0 merged 2 commits into
mainfrom
fix/follow-target-reconnect
Jun 1, 2026
Merged

Add retry with backoff to follow pipeline on target connection failure#35
teknogeek0 merged 2 commits into
mainfrom
fix/follow-target-reconnect

Conversation

@teknogeek0

Copy link
Copy Markdown
Collaborator

Summary

Adds retry-with-backoff to the --follow (CDC) pipeline so a dropped target connection no longer fails the migration. When the apply/replay side loses its connection to the target mid-follow, it now reconnects with exponential backoff and resumes streaming, instead of terminating the clone.

Includes several correctness fixes uncovered while building the reconnect path:

  • Fix a pipeline spin loop and a stale replication-origin session on reconnect.
  • Fix replication-origin holder termination (parse the holder PID from the error message).
  • Connection-retry helpers in pgsql.c/pgsql.h; reconnect/backoff loop in ld_replay.c; supporting changes in ld_stream.c, ld_apply.c, follow.c.

Test plan

  • New follow-target-reconnect suite: drives a follow migration, severs the target connection mid-stream via a netblock container, and asserts the pipeline reconnects and completes. Added to the Tier-3 CI matrix.
  • PGVERSION=16 make build — clean (compiles against the current FK/defer-index CDC code).
  • PGVERSION=16 make tests — full suite green on PG16, including follow-target-reconnect.

Note (out of scope)

While validating, observed that blob-snapshot-release has a pre-existing, intermittent snapshot-release race on main (the source snapshot can be released before a COPY worker runs SET TRANSACTION SNAPSHOT, giving ERROR: invalid snapshot identifier). Measured ~1-in-5 on clean main without this change, so it is unrelated to this PR — flagging it as a separate issue to track.

Forward-port of dev follow-target-reconnect work: when the target
connection drops mid-follow, the apply/replay pipeline now retries with
exponential backoff instead of failing the migration. Includes fixes for
the pipeline spin loop, stale replication-origin session on reconnect,
and replication-origin holder termination, plus the follow-target-reconnect
test suite (simulates a target outage via a netblock container).

timescaledb stays disabled in CI (upstream failure).
The Docker workflow ran on every PR and every push to main, rebuilding
(and on main, pushing) the container image. The test workflow already
builds its own image per job, so the PR/main builds are redundant CI
cost — and they were failing intermittently on GitHub Actions Cache 504s.

Restrict the workflow to version-tag pushes (the release flow) plus
manual dispatch.
@teknogeek0 teknogeek0 merged commit e073ebc into main Jun 1, 2026
81 checks passed
@teknogeek0 teknogeek0 deleted the fix/follow-target-reconnect branch June 1, 2026 19:22
@teknogeek0 teknogeek0 mentioned this pull request Jun 8, 2026
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant