Add retry with backoff to follow pipeline on target connection failure#35
Merged
Conversation
Forward-port of dev follow-target-reconnect work: when the target connection drops mid-follow, the apply/replay pipeline now retries with exponential backoff instead of failing the migration. Includes fixes for the pipeline spin loop, stale replication-origin session on reconnect, and replication-origin holder termination, plus the follow-target-reconnect test suite (simulates a target outage via a netblock container). timescaledb stays disabled in CI (upstream failure).
The Docker workflow ran on every PR and every push to main, rebuilding (and on main, pushing) the container image. The test workflow already builds its own image per job, so the PR/main builds are redundant CI cost — and they were failing intermittently on GitHub Actions Cache 504s. Restrict the workflow to version-tag pushes (the release flow) plus manual dispatch.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds retry-with-backoff to the
--follow(CDC) pipeline so a dropped target connection no longer fails the migration. When the apply/replay side loses its connection to the target mid-follow, it now reconnects with exponential backoff and resumes streaming, instead of terminating the clone.Includes several correctness fixes uncovered while building the reconnect path:
pgsql.c/pgsql.h; reconnect/backoff loop inld_replay.c; supporting changes inld_stream.c,ld_apply.c,follow.c.Test plan
follow-target-reconnectsuite: drives a follow migration, severs the target connection mid-stream via a netblock container, and asserts the pipeline reconnects and completes. Added to the Tier-3 CI matrix.PGVERSION=16 make build— clean (compiles against the current FK/defer-index CDC code).PGVERSION=16 make tests— full suite green on PG16, includingfollow-target-reconnect.Note (out of scope)
While validating, observed that
blob-snapshot-releasehas a pre-existing, intermittent snapshot-release race onmain(the source snapshot can be released before a COPY worker runsSET TRANSACTION SNAPSHOT, givingERROR: invalid snapshot identifier). Measured ~1-in-5 on cleanmainwithout this change, so it is unrelated to this PR — flagging it as a separate issue to track.