feat(postgresql): graceful primary switchover by renecannao · Pull Request #99 · ProxySQL/orchestrator

renecannao · 2026-04-18T13:11:03Z

Summary

Adds planned/graceful primary switchover for PostgreSQL, reusing the existing graceful-master-takeover / graceful-master-takeover-auto CLI commands and API endpoints.

Switchover flow (implemented in go/logic/topology_recovery_postgresql.go):

Validate cluster / designated standby (PostgreSQLGetBestStandbyForPromotion when auto=true)
Run PreGracefulTakeoverProcesses
Set primary read-only (ALTER SYSTEM + terminate non-replication backends)
Capture primary WAL LSN
Wait for the designated standby to catch up to that LSN
Promote the designated standby (pg_promote())
Reconfigure remaining standbys to point at the new primary
Set primary_conninfo on the demoted primary (no restart)
Run PostGracefulTakeoverProcesses

The demoted primary requires an operator-managed restart with standby.signal — automate via PostGracefulTakeoverProcesses hooks.

New inst operations (go/inst/instance_topology_postgresql.go):

PostgreSQLSetReadOnly
PostgreSQLGetCurrentWALLSN
PostgreSQLWaitForStandbyLSN
PostgreSQLRepositionAsStandby

Docs: docs/database-providers.md, docs/user-manual.md, and design/plan specs under docs/superpowers/.

Tests: functional test in tests/functional/test-postgresql.sh covers

Basic graceful switchover (API returns OK, primary changes)
Deeper verification (new primary accepts writes, demoted primary has primary_conninfo)
Negative cases (bogus cluster, bogus designated host)
Round-trip (switch back) with operator-hook simulation
Unchanged: dead-primary failover still works after the switchover sequence

CI: removed continue-on-error from the PostgreSQL functional job — PG regressions now fail the build.

Test plan

Functional tests pass on PostgreSQL 15, 16, 17 (GHA matrix)
MySQL graceful takeover remains unaffected (regression)
Dead-primary failover still triggers after the switchover + round-trip
Unit tests pass (go test ./go/...)

Summary by CodeRabbit

New Features
- PostgreSQL graceful primary switchover is now supported via CLI/API (graceful-master-takeover and graceful-master-takeover-auto commands)
Documentation
- Updated PostgreSQL documentation describing the switchover workflow and operational setup requirements
- Added design specification for planned primary switchover capability
Tests
- Added comprehensive functional tests validating PostgreSQL graceful switchover scenarios
- PostgreSQL test failures now properly fail the CI job instead of continuing

Covers planned/graceful switchover support for PostgreSQL, including new instance operations (set read-only, WAL LSN catch-up, reposition as standby), orchestration flow, and CLI/API dispatch via the existing graceful-master-takeover commands.

9 tasks covering instance operations, orchestration, CLI/API dispatch, functional tests, documentation updates, and GitHub issue creation.

Returns the current WAL write position (LSN) of a PostgreSQL primary instance, needed to synchronize standbys before promotion during graceful switchover.

…over Polls a PostgreSQL standby until its replay LSN reaches or exceeds the target LSN, or the timeout expires. Used during graceful switchover to ensure the standby has caught up before promotion.

…chover Configures a demoted PostgreSQL primary to replicate from a new primary by updating primary_conninfo via ALTER SYSTEM. The operator handles the actual restart and standby.signal creation via hooks.

…tchover When the cluster master is PostgreSQL (ProviderType == "postgresql"), GracefulMasterTakeover now dispatches to PostgreSQLGracefulPrimarySwitchover instead of running the MySQL-specific code path.

Runs graceful-master-takeover-auto via the API before the destructive failover test. Verifies the primary changes after switchover.

- Escape single quotes in primary_conninfo to prevent SQL breakage when PostgreSQLTopologyPassword contains quotes (both PostgreSQLReconfigureStandby and PostgreSQLRepositionAsStandby) - Move AttemptRecoveryRegistration before mutations to prevent concurrent switchover races - Use 3x ReasonableMaintenanceReplicationLagSeconds for catch-up timeout since the primary is already read-only

Adds deeper verification of the graceful switchover result (new primary is writable, demoted primary has primary_conninfo set), negative-case tests (bogus cluster, bogus designated host), and a round-trip that reactivates the demoted primary as a live standby and switches back. Removes continue-on-error from the PostgreSQL functional job so PG test regressions now fail CI.

coderabbitai · 2026-04-18T13:11:20Z

📝 Walkthrough

Walkthrough

Implements PostgreSQL graceful primary switchover by introducing four instance-level operations (set read-only, capture WAL LSN, wait for standby replication, reposition as standby) and orchestration logic that validates topology, coordinates the switchover process, and routes PostgreSQL clusters through dedicated switchover logic.

Changes

Cohort / File(s)	Summary
Documentation & Specs `docs/database-providers.md`, `docs/user-manual.md`, `docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md`, `docs/superpowers/specs/2026-04-18-postgresql-graceful-switchover-design.md`	Updated PostgreSQL documentation to declare graceful switchover now supported; added design spec and implementation plan describing the switchover flow (read-only → WAL capture → standby catch-up → promotion → reconfiguration) and `PostGracefulTakeoverProcesses` hook responsibility for demoted-primary restart.
PostgreSQL Instance Operations `go/inst/instance_topology_postgresql.go`, `go/inst/instance_topology_postgresql_test.go`	Added four exported functions (`PostgreSQLSetReadOnly`, `PostgreSQLGetCurrentWALLSN`, `PostgreSQLWaitForStandbyLSN`, `PostgreSQLRepositionAsStandby`) and their input-validation tests; fixed `PostgreSQLReconfigureStandby` to escape single quotes in `primary_conninfo` SQL generation.
Orchestration & Routing `go/logic/topology_recovery.go`, `go/logic/topology_recovery_postgresql.go`	Modified `GracefulMasterTakeover` to route PostgreSQL clusters to new `PostgreSQLGracefulPrimarySwitchover` function; implemented switchover orchestration with topology validation, standby selection (explicit/auto), pre/post hooks, read-only enforcement, LSN synchronization, promotion, replication reconfiguration, and explicit rollback on failure.
CI & Testing `.github/workflows/functional.yml`, `tests/functional/test-postgresql.sh`	Removed `continue-on-error: true` from PostgreSQL test step; expanded functional tests to validate graceful switchover via API, verify primary identity changes, confirm write capability, check demoted primary reposition, and test round-trip switchback with `standby.signal` restart sequence.

Sequence Diagram

sequenceDiagram
    participant Orch as Orchestrator
    participant Prime as Current Primary
    participant Stby as Designated Standby
    participant OthStby as Other Standbys

    Orch->>Prime: Set read-only (ALTER SYSTEM)
    Prime->>Prime: Terminate non-replication connections
    Orch->>Prime: Capture current WAL LSN (pg_current_wal_lsn)
    Prime-->>Orch: Return target LSN

    rect rgba(100, 150, 200, 0.5)
        Orch->>Stby: Poll pg_last_wal_replay_lsn() >= target
        Stby-->>Orch: LSN reached / timeout
    end

    Orch->>Stby: Promote standby (pg_promote)
    Orch->>OthStby: Reconfigure primary_conninfo (ALTER SYSTEM)
    Orch->>Prime: Set primary_conninfo to new primary (ALTER SYSTEM)
    
    Orch-->>Orch: Register TopologyRecovery<br/>Run PostGracefulTakeoverProcesses
    Orch->>Prime: (External) Restart with standby.signal

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related issues

feat: PostgreSQL graceful primary switchover (planned failover) #98: Directly implements the same PostgreSQL graceful primary switchover feature with identical function signatures and orchestration flow.

Possibly related PRs

Add built-in ProxySQL failover hooks #36: Overlaps in GracefulMasterTakeover control flow routing and pre/post-failover hook integration patterns.
CI: MySQL version matrix (5.7-9.6) and named channels tests #85: Shares modifications to .github/workflows/functional.yml and tests/functional/test-postgresql.sh functional test infrastructure.
CI: Advanced failover tests and PostgreSQL version matrix #86: Related through PostgreSQL functional test and CI setup overlaps.

Poem

🐰 A switchover smooth, from primary to standby's grace,
Replication LSNs race to keep the pace,
Read-only guards the window tight,
Promotion swift—the torch burns bright!
PostgreSQL planned, no crash in sight! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 64.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(postgresql): graceful primary switchover' directly and accurately describes the main change: implementing graceful primary switchover functionality for PostgreSQL.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/postgresql-graceful-switchover

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tests/functional/test-postgresql.sh (1)

295-331: ⚠️ Potential issue | 🟠 Major

Don’t hard-code pgprimary unless the round-trip restored it as primary.

Because this script runs without set -e, a failed round-trip still falls through to Line 331 and stops pgprimary, even if the current primary is still pgstandby1. Gate the dead-primary section on FINAL_PRIMARY being pgprimary, or dynamically stop the actual current primary container.

Test-flow guard suggestion

             if [ "$FINAL_PRIMARY" = "172.30.0.20" ] || [ "$FINAL_PRIMARY" = "pgprimary" ]; then
                 pass "Round-trip complete: pgprimary is primary again"
+                DEAD_PRIMARY_CONTAINER="pgprimary"
             else
                 fail "Round-trip incomplete: primary is '$FINAL_PRIMARY' (expected pgprimary)"
+                DEAD_PRIMARY_CONTAINER=""
             fi
@@
 echo "--- Failover test: kill current primary ---"
@@
-echo "Stopping pgprimary container..."
-$COMPOSE stop pgprimary
+if [ -z "${DEAD_PRIMARY_CONTAINER:-}" ]; then
+    skip "Dead-primary test skipped — pgprimary was not restored as primary"
+else
+    echo "Stopping ${DEAD_PRIMARY_CONTAINER} container..."
+    $COMPOSE stop "$DEAD_PRIMARY_CONTAINER"
+fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-postgresql.sh` around lines 295 - 331, The script
currently always proceeds to "Stopping pgprimary container..." even when the
round-trip did not restore pgprimary; update the dead-primary section to guard
on FINAL_PRIMARY or stop the actual current primary: check the FINAL_PRIMARY
variable (and/or its resolved container name) before stopping pgprimary, e.g. if
[ "$FINAL_PRIMARY" = "172.30.0.20" ] || [ "$FINAL_PRIMARY" = "pgprimary" ] then
stop pgprimary else stop the container corresponding to FINAL_PRIMARY (e.g.
derive container name like "pgstandby1" from the IP or use FINAL_PRIMARY
directly) so the stop command targets the actual primary instead of hard-coding
pgprimary; adjust the echo and stop/restart invocation that currently refer to
"pgprimary" to use the chosen container variable.

go/inst/instance_topology_postgresql.go (1)

100-115: ⚠️ Potential issue | 🟠 Major

Escape primary_conninfo values at the libpq layer, not just the SQL layer.

strings.ReplaceAll(..., "'", "''") protects the outer SQL literal, but primary_conninfo itself is a libpq keyword/value string. Passwords, usernames, or other values containing spaces, backslashes, or quotes must be quoted and escaped according to libpq rules (\' for quotes, \\ for backslashes), not just SQL rules. Build conninfo with per-value quoting/escaping, then escape the resulting SQL literal.

This issue occurs in two places: lines 101-115 and lines 313-318. Consider extracting the conninfo construction into a helper function to avoid duplication.

Safer helper-based construction

+func postgreSQLConnInfoValue(value string) string {
+	if value == "" || strings.ContainsAny(value, " \t\r\n'\\") {
+		value = strings.ReplaceAll(value, `\`, `\\`)
+		value = strings.ReplaceAll(value, `'`, `\'`)
+		return "'" + value + "'"
+	}
+	return value
+}
+
+func postgreSQLPrimaryConnInfo(newPrimaryKey *InstanceKey) string {
+	return fmt.Sprintf(
+		"host=%s port=%d user=%s password=%s sslmode=%s application_name=%s",
+		postgreSQLConnInfoValue(newPrimaryKey.Hostname),
+		newPrimaryKey.Port,
+		postgreSQLConnInfoValue(config.Config.PostgreSQLTopologyUser),
+		postgreSQLConnInfoValue(config.Config.PostgreSQLTopologyPassword),
+		postgreSQLConnInfoValue(config.Config.PostgreSQLSSLMode),
+		postgreSQLConnInfoValue("orchestrator"),
+	)
+}
+
+func postgreSQLStringLiteral(value string) string {
+	return strings.ReplaceAll(value, "'", "''")
+}
+
 	// Build new primary_conninfo
-	newConnInfo := fmt.Sprintf(
-		"host=%s port=%d user=%s password=%s sslmode=%s application_name=orchestrator",
-		newPrimaryKey.Hostname,
-		newPrimaryKey.Port,
-		config.Config.PostgreSQLTopologyUser,
-		config.Config.PostgreSQLTopologyPassword,
-		config.Config.PostgreSQLSSLMode,
-	)
+	newConnInfo := postgreSQLPrimaryConnInfo(newPrimaryKey)
@@
-	escapedConnInfo := strings.ReplaceAll(newConnInfo, "'", "''")
+	escapedConnInfo := postgreSQLStringLiteral(newConnInfo)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@go/inst/instance_topology_postgresql.go` around lines 100 - 115, The conninfo
string (newConnInfo) is being SQL-escaped but not libpq-escaped; change
construction in PostgreSQLReconfigureStandby to build each libpq key=value
element with proper libpq quoting/escaping (escape single quote as \' and
backslash as \\ and wrap values containing spaces or special chars in single
quotes), extract that logic into a helper (e.g., buildLibpqConnInfo or
quoteLibpqValue) and use it to produce the conninfo, then still SQL-escape the
final conninfo for the ALTER SYSTEM statement (escapedConnInfo =
strings.ReplaceAll(conninfo, "'", "''")); apply the same helper at the second
occurrence mentioned to remove duplication.

🧹 Nitpick comments (1)

docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md (1)

86-90: Add defensive validation for readOnlyValue.

While readOnlyValue is currently controlled (only "on" or "off"), adding an explicit validation check would make the code more defensive against future modifications and clearer about the expected values.

🛡️ Proposed defensive check

 	readOnlyValue := "off"
 	if readOnly {
 		readOnlyValue = "on"
 	}
+	if readOnlyValue != "on" && readOnlyValue != "off" {
+		return nil, fmt.Errorf("PostgreSQLSetReadOnly: invalid readOnlyValue: %s", readOnlyValue)
+	}
 
 	log.Infof("PostgreSQLSetReadOnly: setting default_transaction_read_only = %s on %+v", readOnlyValue, *instanceKey)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md` around
lines 86 - 90, The code in PostgreSQLSetReadOnly uses readOnlyValue directly in
the ALTER SYSTEM SQL; add a defensive validation before db.Exec to ensure
readOnlyValue is only "on" or "off" (or whatever canonical set you expect) and
return a descriptive error via log.Errore if it is invalid; locate the log.Infof
call and the subsequent db.Exec(fmt.Sprintf("ALTER SYSTEM SET
default_transaction_read_only = %s", readOnlyValue)) and insert the validation
check there to short-circuit on bad values and avoid executing malformed SQL.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/database-providers.md`:
- Around line 183-190: Update the PostgreSQL graceful takeover docs to state
that in addition to the pg_monitor role, the orchestrator user must have
superuser privileges or equivalent role-based permissions to run ALTER SYSTEM,
call pg_reload_conf(), and execute pg_terminate_backend(); mention these
specific operations and recommend granting either superuser or targeted
role-based privileges (or extension of pg_monitor) before using the CLI/API
endpoints (`graceful-master-takeover`, `graceful-master-takeover-auto`), and add
a note explaining that an operator-managed restart (triggered via
`PostGracefulTakeoverProcesses`) is still required for the demoted primary.

In `@docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md`:
- Line 15: Change the task headings from level 3 to level 2 so heading levels
increment correctly: replace each "### Task N: `..." with "## Task N: `..." for
all task headings (e.g., "Task 1: `PostgreSQLSetReadOnly` — Unit Test +
Implementation" through Tasks 1-9) so the document goes H1 → H2 → H3 correctly.
- Around line 507-525: Move the AttemptRecoveryRegistration call so it executes
before mutating the topology (before calling inst.PostgreSQLPromoteStandby);
i.e., call AttemptRecoveryRegistration(&analysisEntry, false, false) and
set/initialize topologyRecovery (with the same error-handling fallback) prior to
invoking PostgreSQLPromoteStandby, then after promotion assign
topologyRecovery.SuccessorKey = &promotedInstance.Key; keep the
PostgreSQLSetReadOnly undo logic and all existing error branches but reorder
them so registration happens before the promotion to avoid races.
- Around line 700-736: Add explicit language specifiers to the fenced code
blocks that contain the PostgreSQL switchover text so markdown renderers apply
proper highlighting: replace the three-backtick fences before the block
containing "Graceful master takeover (planned switchover)..." and before "###
Graceful Primary Switchover", and the two occurrences around "Graceful master
takeover is not yet supported for PostgreSQL." / "Graceful master takeover is
supported for PostgreSQL..." with language-tagged fences (e.g., ```markdown).
Locate these blocks by searching for the exact phrases "Graceful master takeover
(planned switchover) is not yet implemented for PostgreSQL. Only unplanned
failover (dead primary) is supported.", "### Graceful Primary Switchover", and
"Graceful master takeover is not yet supported for PostgreSQL." and update their
opening ``` to ```markdown.
- Around line 338-351: The ALTER SYSTEM command interpolates newConnInfo
directly and can break or be exploited if fields contain single quotes; before
calling db.Exec in PostgreSQLRepositionAsStandby, escape single quotes in
newConnInfo (e.g., replace every "'" with "''") and use the escaped value when
building the SQL string for primary_conninfo; ensure the strings package is
imported and update the db.Exec call that sets primary_conninfo to use the
escapedConnInfo variable.

In `@docs/superpowers/specs/2026-04-18-postgresql-graceful-switchover-design.md`:
- Around line 18-85: Update the design doc to match the implementation: change
the "three new functions" header to "four new functions" and ensure all four
(PostgreSQLSetReadOnly, PostgreSQLGetCurrentWALLSN, PostgreSQLWaitForStandbyLSN,
PostgreSQLRepositionAsStandby) are listed; change the
PostgreSQLGracefulPrimarySwitchover signature description to match the actual
return values implemented (adjust the noted return tuple to the real types/count
used in PostgreSQLGracefulPrimarySwitchover); and update the dispatch snippet to
match the code's provider check (use ProviderType == "postgresql" early return
logic as implemented rather than the inst.PostgreSQL variant) so the spec aligns
exactly with the code paths and symbols named in the repo.

In `@go/logic/topology_recovery_postgresql.go`:
- Around line 239-243: The code currently logs failures from
inst.PostgreSQLRepositionAsStandby(...) but switchover still returns success;
change this to surface the error: when
PostgreSQLRepositionAsStandby(&clusterPrimary.Key, &promotedInstance.Key)
returns a non-nil err, call AuditTopologyRecovery with the error and then
propagate the error out of the surrounding function (or mark clusterPrimary as
lost in topologyRecovery and return a wrapped error) instead of only logging;
update the surrounding function signature/return path to return that error so
callers see the failure (refer to PostgreSQLRepositionAsStandby,
AuditTopologyRecovery, topologyRecovery, clusterPrimary, promotedInstance).
- Around line 180-223: After AttemptRecoveryRegistration succeeds you must
register a deferred failure resolver to always call resolveRecovery(...) on
error paths and avoid leaving topologyRecovery unresolved; add a defer that
checks a local resolved bool and if false calls
resolveRecovery(topologyRecovery) (and AuditTopologyRecovery as appropriate),
ensure you only set topologyRecovery.SuccessorKey = &designatedStandby.Key after
successful pre-promotion steps or clear it before any early return, and mark
resolved = true immediately after a successful resolveRecovery(...) so the defer
skips double-resolution; key symbols: AttemptRecoveryRegistration,
topologyRecovery, SuccessorKey, resolveRecovery, AuditTopologyRecovery.

In `@go/logic/topology_recovery.go`:
- Around line 2124-2128: The CLI panics when PostgreSQLGracefulPrimarySwitchover
returns nil promotedMasterCoordinates because handlers in go/app/cli.go
unconditionally dereference *promotedMasterCoordinates; update the CLI handlers
that call GracefulMasterTakeover (the blocks around the current dereferences of
promotedMasterCoordinates) to check for nil before dereferencing and handle the
nil case (e.g., print a message or omit coordinates) or alternatively ensure
PostgreSQLGracefulPrimarySwitchover always returns a non-nil coordinates struct;
reference PostgreSQLGracefulPrimarySwitchover, GracefulMasterTakeover, and the
promotedMasterCoordinates variable to locate the code to change.

---

Outside diff comments:
In `@go/inst/instance_topology_postgresql.go`:
- Around line 100-115: The conninfo string (newConnInfo) is being SQL-escaped
but not libpq-escaped; change construction in PostgreSQLReconfigureStandby to
build each libpq key=value element with proper libpq quoting/escaping (escape
single quote as \' and backslash as \\ and wrap values containing spaces or
special chars in single quotes), extract that logic into a helper (e.g.,
buildLibpqConnInfo or quoteLibpqValue) and use it to produce the conninfo, then
still SQL-escape the final conninfo for the ALTER SYSTEM statement
(escapedConnInfo = strings.ReplaceAll(conninfo, "'", "''")); apply the same
helper at the second occurrence mentioned to remove duplication.

In `@tests/functional/test-postgresql.sh`:
- Around line 295-331: The script currently always proceeds to "Stopping
pgprimary container..." even when the round-trip did not restore pgprimary;
update the dead-primary section to guard on FINAL_PRIMARY or stop the actual
current primary: check the FINAL_PRIMARY variable (and/or its resolved container
name) before stopping pgprimary, e.g. if [ "$FINAL_PRIMARY" = "172.30.0.20" ] ||
[ "$FINAL_PRIMARY" = "pgprimary" ] then stop pgprimary else stop the container
corresponding to FINAL_PRIMARY (e.g. derive container name like "pgstandby1"
from the IP or use FINAL_PRIMARY directly) so the stop command targets the
actual primary instead of hard-coding pgprimary; adjust the echo and
stop/restart invocation that currently refer to "pgprimary" to use the chosen
container variable.

---

Nitpick comments:
In `@docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md`:
- Around line 86-90: The code in PostgreSQLSetReadOnly uses readOnlyValue
directly in the ALTER SYSTEM SQL; add a defensive validation before db.Exec to
ensure readOnlyValue is only "on" or "off" (or whatever canonical set you
expect) and return a descriptive error via log.Errore if it is invalid; locate
the log.Infof call and the subsequent db.Exec(fmt.Sprintf("ALTER SYSTEM SET
default_transaction_read_only = %s", readOnlyValue)) and insert the validation
check there to short-circuit on bad values and avoid executing malformed SQL.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9e088dff-bc56-4ee5-8b1f-aa81497ff5c9

📥 Commits

Reviewing files that changed from the base of the PR and between 4c92815 and a594961.

📒 Files selected for processing (10)

.github/workflows/functional.yml
docs/database-providers.md
docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md
docs/superpowers/specs/2026-04-18-postgresql-graceful-switchover-design.md
docs/user-manual.md
go/inst/instance_topology_postgresql.go
go/inst/instance_topology_postgresql_test.go
go/logic/topology_recovery.go
go/logic/topology_recovery_postgresql.go
tests/functional/test-postgresql.sh

💤 Files with no reviewable changes (1)

.github/workflows/functional.yml

coderabbitai · 2026-04-18T13:16:06Z

+- **Graceful master takeover** (planned switchover) is supported for PostgreSQL.
+  The same CLI commands and API endpoints (`graceful-master-takeover`,
+  `graceful-master-takeover-auto`) work for PostgreSQL clusters. The switchover
+  sets the primary read-only, waits for the designated standby to catch up via
+  WAL LSN, promotes the standby, reconfigures remaining standbys, and configures
+  the demoted primary's `primary_conninfo`. The demoted primary requires an
+  operator-managed restart with `standby.signal` — use
+  `PostGracefulTakeoverProcesses` hooks to automate this step.


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

PostgreSQL privileges required for ALTER SYSTEM pg_reload_conf pg_terminate_backend pg_signal_backend

💡 Result:

PostgreSQL privileges required: - ALTER SYSTEM: Superuser or users granted the ALTER SYSTEM privilege on the specific configuration parameter (via GRANT ALTER SYSTEM ON PARAMETER ...). {value: Superuser or ALTER SYSTEM privilege on parameter, citations: [1,2,4], confidence: high} - pg_reload_conf: Superuser or EXECUTE privilege granted on the pg_catalog.pg_reload_conf function (via GRANT EXECUTE ON FUNCTION pg_catalog.pg_reload_conf TO role;). These signaling functions are restricted to superusers by default but access may be granted using GRANT. {value: Superuser or EXECUTE on pg_reload_conf, citations: [6,8,11], confidence: high} - pg_terminate_backend(pid): Superuser, or the calling role is a member of the target backend's role, or has privileges of the pg_signal_backend role. Only superusers can terminate superuser backends. {value: Superuser, member of target role, or pg_signal_backend role, citations: [6,7,11], confidence: high} - pg_signal_backend(pid): This appears to refer to the pg_signal_backend predefined role (not a function). Membership in pg_signal_backend role allows signaling (cancel/terminate) other backends (via pg_cancel_backend/pg_terminate_backend). Superusers bypass this. {value: pg_signal_backend role membership, citations: [6,7,13], confidence: high}

Citations:

1: https://www.postgresql.org/docs/current/sql-altersystem.html

2: https://www.postgresql.org/docs/current/ddl-priv.html

3: https://www.postgresql.org/docs/current/static/sql-altersystem.html

4: https://www.postgresql.org/docs/current/functions-admin.html

5: https://www.postgresql.org/docs/current/static/predefined-roles.html

6: https://runebook.dev/en/docs/postgresql/functions-admin/pg_reload_conf

7: http://www.postgresql.org/docs/current/interactive/functions-admin.html

8: https://www.postgresql.org/docs/current/predefined-roles.html

🏁 Script executed:

# First, let's read the specific documentation section mentioned head -200 docs/database-providers.md | tail -50

Repository: ProxySQL/orchestrator

Length of output: 2911

🏁 Script executed:

# Search for pg_monitor mentions in the documentation rg "pg_monitor" docs/database-providers.md -B 5 -A 5

Repository: ProxySQL/orchestrator

Length of output: 688

🏁 Script executed:

# Find the PostgreSQL instance topology implementation fd "instance_topology_postgresql" --type f

Repository: ProxySQL/orchestrator

Length of output: 149

🏁 Script executed:

# Search for ALTER SYSTEM, pg_reload_conf, pg_terminate_backend in codebase rg "ALTER SYSTEM|pg_reload_conf|pg_terminate_backend" --type go -B 2 -A 2

Repository: ProxySQL/orchestrator

Length of output: 5706

🏁 Script executed:

# Check the exact context around the pg_monitor requirement to see where prerequisites are documented rg -B 10 "pg_monitor role:" docs/database-providers.md

Repository: ProxySQL/orchestrator

Length of output: 47

🏁 Script executed:

# Verify if there are any existing privilege requirements documented for graceful takeover nearby rg -A 20 "Graceful master takeover" docs/database-providers.md | head -40

Repository: ProxySQL/orchestrator

Length of output: 1039

Add privilege requirements to PostgreSQL graceful takeover documentation.

The graceful takeover implementation requires additional PostgreSQL privileges beyond pg_monitor. The operations (ALTER SYSTEM, pg_reload_conf(), pg_terminate_backend()) require superuser access or equivalent role-based privileges. Update the documentation to specify these requirements so operators can configure the orchestrator user with the necessary permissions before attempting graceful takeover.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/database-providers.md` around lines 183 - 190, Update the PostgreSQL graceful takeover docs to state that in addition to the pg_monitor role, the orchestrator user must have superuser privileges or equivalent role-based permissions to run ALTER SYSTEM, call pg_reload_conf(), and execute pg_terminate_backend(); mention these specific operations and recommend granting either superuser or targeted role-based privileges (or extension of pg_monitor) before using the CLI/API endpoints (`graceful-master-takeover`, `graceful-master-takeover-auto`), and add a note explaining that an operator-managed restart (triggered via `PostGracefulTakeoverProcesses`) is still required for the demoted primary.

coderabbitai · 2026-04-18T13:16:06Z

+
+---
+
+### Task 1: `PostgreSQLSetReadOnly` — Unit Test + Implementation


⚠️ Potential issue | 🟡 Minor

Use h2 (##) for task headings instead of h3 (###).

The task headings jump from h1 (line 1) to h3 (line 15), skipping h2. Markdown best practices require heading levels to increment by one level at a time.

📝 Proposed fix

Change all task headings from h3 to h2:

-### Task 1: `PostgreSQLSetReadOnly` — Unit Test + Implementation +## Task 1: `PostgreSQLSetReadOnly` — Unit Test + Implementation

Apply this change to all task headings (Tasks 1-9).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

### Task 1: `PostgreSQLSetReadOnly` — Unit Test + Implementation

## Task 1: `PostgreSQLSetReadOnly` — Unit Test + Implementation

🧰 Tools

🪛 markdownlint-cli2 (0.22.0)

[warning] 15-15: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md` at line 15, Change the task headings from level 3 to level 2 so heading levels increment correctly: replace each "### Task N: `..." with "## Task N: `..." for all task headings (e.g., "Task 1: `PostgreSQLSetReadOnly` — Unit Test + Implementation" through Tasks 1-9) so the document goes H1 → H2 → H3 correctly.

coderabbitai · 2026-04-18T13:16:06Z

+	newConnInfo := fmt.Sprintf(
+		"host=%s port=%d user=%s password=%s sslmode=%s application_name=orchestrator",
+		newPrimaryKey.Hostname,
+		newPrimaryKey.Port,
+		config.Config.PostgreSQLTopologyUser,
+		config.Config.PostgreSQLTopologyPassword,
+		config.Config.PostgreSQLSSLMode,
+	)
+
+	log.Infof("PostgreSQLRepositionAsStandby: configuring %+v to replicate from %+v", *instanceKey, *newPrimaryKey)
+
+	if _, err := db.Exec(fmt.Sprintf("ALTER SYSTEM SET primary_conninfo = '%s'", newConnInfo)); err != nil {
+		return log.Errore(fmt.Errorf("PostgreSQLRepositionAsStandby: ALTER SYSTEM failed on %+v: %v", *instanceKey, err))
+	}


⚠️ Potential issue | 🔴 Critical

Critical: Escape single quotes in primary_conninfo to prevent SQL errors and injection.

The newConnInfo string is interpolated directly into the SQL command without escaping. If newPrimaryKey.Hostname or config.Config.PostgreSQLTopologyPassword contain single quotes, the ALTER SYSTEM command will fail or could be vulnerable to SQL injection.

The PR objectives note this was fixed post-review: "escape single quotes in primary_conninfo". The plan should be updated to show the correct implementation.

🔒 Proposed fix with quote escaping

newConnInfo := fmt.Sprintf( "host=%s port=%d user=%s password=%s sslmode=%s application_name=orchestrator", newPrimaryKey.Hostname, newPrimaryKey.Port, config.Config.PostgreSQLTopologyUser, config.Config.PostgreSQLTopologyPassword, config.Config.PostgreSQLSSLMode, ) + // Escape single quotes for SQL string literal + newConnInfo = strings.ReplaceAll(newConnInfo, "'", "''") log.Infof("PostgreSQLRepositionAsStandby: configuring %+v to replicate from %+v", *instanceKey, *newPrimaryKey) if _, err := db.Exec(fmt.Sprintf("ALTER SYSTEM SET primary_conninfo = '%s'", newConnInfo)); err != nil {

Note: Add "strings" to imports if not already present.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md` around lines 338 - 351, The ALTER SYSTEM command interpolates newConnInfo directly and can break or be exploited if fields contain single quotes; before calling db.Exec in PostgreSQLRepositionAsStandby, escape single quotes in newConnInfo (e.g., replace every "'" with "''") and use the escaped value when building the SQL string for primary_conninfo; ensure the strings package is imported and update the db.Exec call that sets primary_conninfo to use the escapedConnInfo variable.

coderabbitai · 2026-04-18T13:16:06Z

+	// --- Promote designated standby ---
+	promotedInstance, err := inst.PostgreSQLPromoteStandby(&designatedStandby.Key)
+	if err != nil {
+		// Undo read-only before aborting
+		_, _ = inst.PostgreSQLSetReadOnly(&clusterPrimary.Key, false)
+		return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: promotion of %+v failed: %v", designatedStandby.Key, err)
+	}
+	log.Infof("PostgreSQLGracefulPrimarySwitchover: promoted %+v to primary", promotedInstance.Key)
+
+	// --- Register recovery ---
+	topologyRecovery, err = AttemptRecoveryRegistration(&analysisEntry, false, false)
+	if err != nil || topologyRecovery == nil {
+		_ = log.Warningf("PostgreSQLGracefulPrimarySwitchover: error registering recovery: %v", err)
+		// Continue — promotion already happened
+		topologyRecovery = &TopologyRecovery{
+			AnalysisEntry: analysisEntry,
+		}
+	}
+	topologyRecovery.SuccessorKey = &promotedInstance.Key


⚠️ Potential issue | 🟠 Major

Race condition: Move AttemptRecoveryRegistration before promotion to avoid registration races.

AttemptRecoveryRegistration (line 517) occurs after the standby is promoted (line 508). This creates a window where the topology has mutated but the recovery isn't registered, potentially leading to inconsistent state or duplicate recovery attempts.

The PR objectives confirm this issue was fixed post-review: "move AttemptRecoveryRegistration before mutations to avoid races". The plan should reflect the corrected order.

📋 Proposed fix to move registration before mutations

Move the registration block before the promotion step:

} - // --- Promote designated standby --- + // --- Register recovery --- + topologyRecovery, err = AttemptRecoveryRegistration(&analysisEntry, false, false) + if err != nil || topologyRecovery == nil { + // Undo read-only before aborting + _, _ = inst.PostgreSQLSetReadOnly(&clusterPrimary.Key, false) + return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: recovery registration failed: %v", err) + } + + // --- Promote designated standby --- promotedInstance, err := inst.PostgreSQLPromoteStandby(&designatedStandby.Key) if err != nil { // Undo read-only before aborting _, _ = inst.PostgreSQLSetReadOnly(&clusterPrimary.Key, false) return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: promotion of %+v failed: %v", designatedStandby.Key, err) } log.Infof("PostgreSQLGracefulPrimarySwitchover: promoted %+v to primary", promotedInstance.Key) - // --- Register recovery --- - topologyRecovery, err = AttemptRecoveryRegistration(&analysisEntry, false, false) - if err != nil || topologyRecovery == nil { - _ = log.Warningf("PostgreSQLGracefulPrimarySwitchover: error registering recovery: %v", err) - // Continue — promotion already happened - topologyRecovery = &TopologyRecovery{ - AnalysisEntry: analysisEntry, - } - } topologyRecovery.SuccessorKey = &promotedInstance.Key

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md` around lines 507 - 525, Move the AttemptRecoveryRegistration call so it executes before mutating the topology (before calling inst.PostgreSQLPromoteStandby); i.e., call AttemptRecoveryRegistration(&analysisEntry, false, false) and set/initialize topologyRecovery (with the same error-handling fallback) prior to invoking PostgreSQLPromoteStandby, then after promotion assign topologyRecovery.SuccessorKey = &promotedInstance.Key; keep the PostgreSQLSetReadOnly undo logic and all existing error branches but reorder them so registration happens before the promotion to avoid races.

coderabbitai · 2026-04-18T13:16:06Z

+```
+Graceful master takeover (planned switchover) is not yet implemented for PostgreSQL. Only unplanned failover (dead primary) is supported.
+```
+
+Replace with:
+```
+### Graceful Primary Switchover
+
+Graceful master takeover (planned switchover) is supported for PostgreSQL. The same CLI commands
+and API endpoints used for MySQL (`graceful-master-takeover`, `graceful-master-takeover-auto`) work
+for PostgreSQL clusters. The switchover flow:
+
+1. Sets the primary read-only via `ALTER SYSTEM SET default_transaction_read_only = on` and terminates
+   existing non-replication connections.
+2. Captures the primary's current WAL LSN and waits for the designated standby to catch up.
+3. Promotes the designated standby via `pg_promote()`.
+4. Reconfigures remaining standbys to replicate from the new primary.
+5. Configures the demoted primary's `primary_conninfo` to point at the new primary.
+
+**Important:** The demoted primary requires a restart with `standby.signal` to complete its conversion
+to a standby. Use `PostGracefulTakeoverProcesses` hooks to automate this step.
+```
+
+- [ ] **Step 2: Update user-manual.md**
+
+Find the line (around line 699) that says:
+```
+Graceful master takeover is not yet supported for PostgreSQL.
+```
+
+Replace with:
+```
+Graceful master takeover is supported for PostgreSQL. Use `graceful-master-takeover` or
+`graceful-master-takeover-auto` CLI commands, or the equivalent API endpoints. The demoted
+primary requires an operator-managed restart with `standby.signal` — configure this via
+`PostGracefulTakeoverProcesses` hooks.
+```


⚠️ Potential issue | 🟡 Minor

Add language specifiers to fenced code blocks for better rendering.

The fenced code blocks at lines 700, 705, 726, and 731 lack language specifiers, which affects syntax highlighting and rendering in documentation viewers.

📝 Proposed fix to add language specifiers

Line 700:

Find the line (around line 183-184) that says: -``` +```markdown Graceful master takeover (planned switchover) is not yet implemented for PostgreSQL. Only unplanned failover (dead primary) is supported.

Line 705: ```diff Replace with: -``` +```markdown ### Graceful Primary Switchover

Line 726:

Find the line (around line 699) that says: -``` +```markdown Graceful master takeover is not yet supported for PostgreSQL.

Line 731: ```diff Replace with: -``` +```markdown Graceful master takeover is supported for PostgreSQL. Use `graceful-master-takeover` or

🧰 Tools

🪛 markdownlint-cli2 (0.22.0)

[warning] 700-700: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

[warning] 705-705: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

[warning] 726-726: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

[warning] 731-731: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md` around lines 700 - 736, Add explicit language specifiers to the fenced code blocks that contain the PostgreSQL switchover text so markdown renderers apply proper highlighting: replace the three-backtick fences before the block containing "Graceful master takeover (planned switchover)..." and before "### Graceful Primary Switchover", and the two occurrences around "Graceful master takeover is not yet supported for PostgreSQL." / "Graceful master takeover is supported for PostgreSQL..." with language-tagged fences (e.g., ```markdown). Locate these blocks by searching for the exact phrases "Graceful master takeover (planned switchover) is not yet implemented for PostgreSQL. Only unplanned failover (dead primary) is supported.", "### Graceful Primary Switchover", and "Graceful master takeover is not yet supported for PostgreSQL." and update their opening ``` to ```markdown.

coderabbitai · 2026-04-18T13:16:07Z

+Three new functions in `go/inst/instance_topology_postgresql.go`:
+
+#### `PostgreSQLSetReadOnly(instanceKey *InstanceKey, readOnly bool) (*Instance, error)`
+
+- Connects to the instance via SQL
+- Executes `ALTER SYSTEM SET default_transaction_read_only = on|off`
+- Calls `SELECT pg_reload_conf()`
+- When setting read-only: queries `pg_stat_activity` for non-replication, non-orchestrator backends and calls `pg_terminate_backend()` on each to close the write window
+- Re-reads and returns the instance
+
+#### `PostgreSQLGetCurrentWALLSN(instanceKey *InstanceKey) (string, error)`
+
+- Connects to the primary
+- Returns result of `SELECT pg_current_wal_lsn()::text`
+
+#### `PostgreSQLWaitForStandbyLSN(standbyKey *InstanceKey, targetLSN string, timeout time.Duration) error`
+
+- Polls `SELECT pg_last_wal_replay_lsn()` on the standby
+- Compares using `SELECT '%s'::pg_lsn >= '%s'::pg_lsn` (PostgreSQL-native LSN comparison)
+- Polls every 500ms up to the configured timeout
+- Returns error on timeout
+
+#### `PostgreSQLRepositionAsStandby(instanceKey *InstanceKey, newPrimaryKey *InstanceKey) error`
+
+- Connects to the old primary
+- Sets `primary_conninfo` via `ALTER SYSTEM` pointing to the new primary (same logic as existing `PostgreSQLReconfigureStandby`)
+- Calls `SELECT pg_reload_conf()`
+- Does NOT restart the instance — the actual restart and `standby.signal` creation is the operator's responsibility via `PostGracefulTakeoverProcesses` hook
+
+### Switchover Orchestration
+
+New function `PostgreSQLGracefulPrimarySwitchover(clusterName string, designatedKey *InstanceKey, auto bool) (*TopologyRecovery, *inst.BinlogCoordinates, error)` in `go/logic/topology_recovery_postgresql.go`:
+
+**Flow:**
+
+1. **Validate** — Read cluster primary, read standbys, determine designated standby. When `auto=true`, use `PostgreSQLGetBestStandbyForPromotion()`. When manual, require the user to specify or have exactly one standby.
+2. **Execute `PreGracefulTakeoverProcesses`** — Abort on failure.
+3. **Set primary read-only** — `PostgreSQLSetReadOnly(primary, true)`. Terminates existing non-replication connections.
+4. **Capture target LSN** — `PostgreSQLGetCurrentWALLSN(primary)`.
+5. **Wait for standby catch-up** — `PostgreSQLWaitForStandbyLSN(designated, targetLSN, timeout)`.
+6. **Promote designated standby** — `PostgreSQLPromoteStandby(designated)` (existing function).
+7. **Reconfigure remaining standbys** — `PostgreSQLReconfigureStandby(standby, newPrimary)` for each non-designated standby (existing function).
+8. **Configure demoted primary** — `PostgreSQLRepositionAsStandby(oldPrimary, newPrimary)`. Sets `primary_conninfo` only; operator handles restart via hook.
+9. **Execute `PostGracefulTakeoverProcesses`**.
+10. **Return** TopologyRecovery with successor key.
+
+**Error handling:**
+
+- If promotion fails after setting read-only: undo via `PostgreSQLSetReadOnly(primary, false)`, abort.
+- If standby catch-up times out: undo read-only, abort with error.
+
+**Note on flat topology:** PostgreSQL topologies are flat (no cascading replication detected), so there is no sibling relocation step (unlike MySQL which may need to reparent replicas under the designated replica).
+
+### CLI/API Dispatch
+
+No changes to `cli.go` or `api.go`. The existing commands and endpoints are reused.
+
+**In `go/logic/topology_recovery.go`:**
+
+Add a provider check in `GracefulMasterTakeover()`:
+
+```go
+if clusterMaster.DatabaseProvider == inst.PostgreSQL {
+    return PostgreSQLGracefulPrimarySwitchover(clusterName, designatedKey, auto)
+}
+```
+
+Add a PostgreSQL branch in `getGracefulMasterTakeoverDesignatedInstance()` to use `PostgreSQLGetBestStandbyForPromotion()` when `auto=true`.


⚠️ Potential issue | 🟡 Minor

Bring the design spec back in sync with the implementation.

This spec says “three” functions while listing four, documents a three-return-value PostgreSQLGracefulPrimarySwitchover(...), and shows a dispatch branch that differs from the implemented ProviderType == "postgresql" early return. Please update the spec so it remains useful as implementation documentation.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/superpowers/specs/2026-04-18-postgresql-graceful-switchover-design.md` around lines 18 - 85, Update the design doc to match the implementation: change the "three new functions" header to "four new functions" and ensure all four (PostgreSQLSetReadOnly, PostgreSQLGetCurrentWALLSN, PostgreSQLWaitForStandbyLSN, PostgreSQLRepositionAsStandby) are listed; change the PostgreSQLGracefulPrimarySwitchover signature description to match the actual return values implemented (adjust the noted return tuple to the real types/count used in PostgreSQLGracefulPrimarySwitchover); and update the dispatch snippet to match the code's provider check (use ProviderType == "postgresql" early return logic as implemented rather than the inst.PostgreSQL variant) so the spec aligns exactly with the code paths and symbols named in the repo.

coderabbitai · 2026-04-18T13:16:07Z

+	topologyRecovery, err = AttemptRecoveryRegistration(&analysisEntry, false, false)
+	if topologyRecovery == nil {
+		AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("found an active or recent recovery on %+v. Will not issue another graceful switchover.", clusterName))
+		return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: active or recent recovery exists for %+v, aborting", clusterName)
+	}
+	topologyRecovery.SuccessorKey = &designatedStandby.Key
+
+	// --- Execute PreGracefulTakeoverProcesses ---
+	if err := executeProcesses(config.Config.PreGracefulTakeoverProcesses, "PreGracefulTakeoverProcesses", topologyRecovery, true); err != nil {
+		return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: PreGracefulTakeoverProcesses failed: %v", err)
+	}
+
+	// --- Set primary read-only ---
+	log.Infof("PostgreSQLGracefulPrimarySwitchover: setting %+v read-only", clusterPrimary.Key)
+	if _, err := inst.PostgreSQLSetReadOnly(&clusterPrimary.Key, true); err != nil {
+		return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: failed to set read-only on %+v: %v", clusterPrimary.Key, err)
+	}
+
+	// --- Capture target LSN ---
+	targetLSN, err := inst.PostgreSQLGetCurrentWALLSN(&clusterPrimary.Key)
+	if err != nil {
+		// Undo read-only before aborting
+		_, _ = inst.PostgreSQLSetReadOnly(&clusterPrimary.Key, false)
+		return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: failed to get WAL LSN from %+v: %v", clusterPrimary.Key, err)
+	}
+	log.Infof("PostgreSQLGracefulPrimarySwitchover: target LSN is %s", targetLSN)
+
+	// --- Wait for standby to catch up ---
+	// Use 3x the maintenance lag threshold since the primary is already read-only
+	// and waiting longer only extends the maintenance window, not data risk.
+	catchUpTimeout := 3 * time.Duration(config.Config.ReasonableMaintenanceReplicationLagSeconds) * time.Second
+	if err := inst.PostgreSQLWaitForStandbyLSN(&designatedStandby.Key, targetLSN, catchUpTimeout); err != nil {
+		// Undo read-only before aborting
+		_, _ = inst.PostgreSQLSetReadOnly(&clusterPrimary.Key, false)
+		return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: standby catch-up failed: %v", err)
+	}
+
+	// --- Promote designated standby ---
+	promotedInstance, err := inst.PostgreSQLPromoteStandby(&designatedStandby.Key)
+	if err != nil {
+		// Undo read-only before aborting
+		_, _ = inst.PostgreSQLSetReadOnly(&clusterPrimary.Key, false)
+		return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: promotion of %+v failed: %v", designatedStandby.Key, err)
+	}


⚠️ Potential issue | 🟠 Major

Resolve the registered recovery on all abort paths.

After AttemptRecoveryRegistration, these early returns can leave an active/recent recovery record unresolved. Add a deferred failure resolver after registration, and avoid leaving SuccessorKey populated for failed pre-promotion attempts.

Failure cleanup pattern

topologyRecovery, err = AttemptRecoveryRegistration(&analysisEntry, false, false) if topologyRecovery == nil { AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("found an active or recent recovery on %+v. Will not issue another graceful switchover.", clusterName)) return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: active or recent recovery exists for %+v, aborting", clusterName) } - topologyRecovery.SuccessorKey = &designatedStandby.Key + resolved := false + defer func() { + if topologyRecovery != nil && !resolved && err != nil { + topologyRecovery.SuccessorKey = nil + _ = topologyRecovery.AddError(err) + _ = resolveRecovery(topologyRecovery, nil) + } + }() + topologyRecovery.SuccessorKey = &designatedStandby.Key @@ - if err := executeProcesses(config.Config.PreGracefulTakeoverProcesses, "PreGracefulTakeoverProcesses", topologyRecovery, true); err != nil { - return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: PreGracefulTakeoverProcesses failed: %v", err) + if err = executeProcesses(config.Config.PreGracefulTakeoverProcesses, "PreGracefulTakeoverProcesses", topologyRecovery, true); err != nil { + return topologyRecovery, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: PreGracefulTakeoverProcesses failed: %v", err) }

Set resolved = true immediately after the successful resolveRecovery(...) call.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@go/logic/topology_recovery_postgresql.go` around lines 180 - 223, After AttemptRecoveryRegistration succeeds you must register a deferred failure resolver to always call resolveRecovery(...) on error paths and avoid leaving topologyRecovery unresolved; add a defer that checks a local resolved bool and if false calls resolveRecovery(topologyRecovery) (and AuditTopologyRecovery as appropriate), ensure you only set topologyRecovery.SuccessorKey = &designatedStandby.Key after successful pre-promotion steps or clear it before any early return, and mark resolved = true immediately after a successful resolveRecovery(...) so the defer skips double-resolution; key symbols: AttemptRecoveryRegistration, topologyRecovery, SuccessorKey, resolveRecovery, AuditTopologyRecovery.

coderabbitai · 2026-04-18T13:16:07Z

+	// --- Configure demoted primary for repositioning ---
+	AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("configuring demoted primary %+v to replicate from %+v", clusterPrimary.Key, promotedInstance.Key))
+	if err := inst.PostgreSQLRepositionAsStandby(&clusterPrimary.Key, &promotedInstance.Key); err != nil {
+		AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("error configuring demoted primary %+v: %v", clusterPrimary.Key, err))
+	}


⚠️ Potential issue | 🟠 Major

Surface demoted-primary reconfiguration failure.

If PostgreSQLRepositionAsStandby fails, the old primary lacks the primary_conninfo that the documented restart hook depends on, but the switchover still returns success. Record this as an error and return it, or at least mark the demoted primary as lost so operators do not get a false OK.

Propagate the post-promotion configuration error

AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("configuring demoted primary %+v to replicate from %+v", clusterPrimary.Key, promotedInstance.Key)) if err := inst.PostgreSQLRepositionAsStandby(&clusterPrimary.Key, &promotedInstance.Key); err != nil { AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("error configuring demoted primary %+v: %v", clusterPrimary.Key, err)) + topologyRecovery.LostReplicas.AddKey(clusterPrimary.Key) + _ = topologyRecovery.AddError(err) + return topologyRecovery, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: failed configuring demoted primary %+v: %v", clusterPrimary.Key, err) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// --- Configure demoted primary for repositioning ---

AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("configuring demoted primary %+v to replicate from %+v", clusterPrimary.Key, promotedInstance.Key))

if err := inst.PostgreSQLRepositionAsStandby(&clusterPrimary.Key, &promotedInstance.Key); err != nil {

AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("error configuring demoted primary %+v: %v", clusterPrimary.Key, err))

}

// --- Configure demoted primary for repositioning ---

AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("configuring demoted primary %+v to replicate from %+v", clusterPrimary.Key, promotedInstance.Key))

if err := inst.PostgreSQLRepositionAsStandby(&clusterPrimary.Key, &promotedInstance.Key); err != nil {

AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("error configuring demoted primary %+v: %v", clusterPrimary.Key, err))

topologyRecovery.LostReplicas.AddKey(clusterPrimary.Key)

_ = topologyRecovery.AddError(err)

return topologyRecovery, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: failed configuring demoted primary %+v: %v", clusterPrimary.Key, err)

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@go/logic/topology_recovery_postgresql.go` around lines 239 - 243, The code currently logs failures from inst.PostgreSQLRepositionAsStandby(...) but switchover still returns success; change this to surface the error: when PostgreSQLRepositionAsStandby(&clusterPrimary.Key, &promotedInstance.Key) returns a non-nil err, call AuditTopologyRecovery with the error and then propagate the error out of the surrounding function (or mark clusterPrimary as lost in topologyRecovery and return a wrapped error) instead of only logging; update the surrounding function signature/return path to return that error so callers see the failure (refer to PostgreSQLRepositionAsStandby, AuditTopologyRecovery, topologyRecovery, clusterPrimary, promotedInstance).

coderabbitai · 2026-04-18T13:16:07Z

+	// PostgreSQL: dispatch to PostgreSQL-specific implementation
+	if clusterMaster.ProviderType == "postgresql" {
+		topologyRecovery, err := PostgreSQLGracefulPrimarySwitchover(clusterName, designatedKey, auto)
+		return topologyRecovery, nil, err
+	}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Inspect callers of GracefulMasterTakeover and usages of the returned coordinates. # Expectation: Callers either ignore the coordinates for PostgreSQL or guard nil before use. rg -nP -C4 '\bGracefulMasterTakeover\s*\(' rg -nP -C4 '\bpromotedMasterCoordinates\b|\bpromoted.*Coordinates\b'

Repository: ProxySQL/orchestrator

Length of output: 16005

Fix nil pointer dereference in CLI when PostgreSQL graceful takeover returns nil coordinates.

The PostgreSQL dispatch in GracefulMasterTakeover returns nil for promotedMasterCoordinates. The API endpoint safely discards this value, but the CLI handlers in go/app/cli.go at lines 1518-1519 and 1534-1535 unconditionally dereference it with *promotedMasterCoordinates without nil guards. This will panic for PostgreSQL clusters invoked via CLI.

Either:

Return non-nil coordinates from PostgreSQLGracefulPrimarySwitchover, or

Add nil checks in CLI before dereferencing (lines 1518, 1519, 1534, 1535)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@go/logic/topology_recovery.go` around lines 2124 - 2128, The CLI panics when PostgreSQLGracefulPrimarySwitchover returns nil promotedMasterCoordinates because handlers in go/app/cli.go unconditionally dereference *promotedMasterCoordinates; update the CLI handlers that call GracefulMasterTakeover (the blocks around the current dereferences of promotedMasterCoordinates) to check for nil before dereferencing and handle the nil case (e.g., print a message or omit coordinates) or alternatively ensure PostgreSQLGracefulPrimarySwitchover always returns a non-nil coordinates struct; reference PostgreSQLGracefulPrimarySwitchover, GracefulMasterTakeover, and the promotedMasterCoordinates variable to locate the code to change.

gemini-code-assist

Code Review

This pull request implements graceful primary switchover for PostgreSQL, introducing new instance operations for read-only mode, WAL LSN tracking, and primary repositioning. The orchestration logic is integrated into the existing GracefulMasterTakeover flow, supported by updated documentation and new functional tests. Feedback highlights the need for robust escaping in PostgreSQL connection strings and notes that pg_terminate_backend requires elevated privileges beyond the documented pg_monitor role. Additionally, the recovery registration should be tightened to prevent concurrent conflicts, and the orchestration function signature needs to align with the design specification's return types.

gemini-code-assist · 2026-04-18T13:16:16Z

+	escapedConnInfo := strings.ReplaceAll(newConnInfo, "'", "''")
+	_, err = db.Exec(fmt.Sprintf("ALTER SYSTEM SET primary_conninfo = '%s'", escapedConnInfo))


The primary_conninfo string is built using fmt.Sprintf which does not handle escaping of values that might contain spaces or special characters. While single quotes are escaped for the SQL statement, they are not properly escaped for the PostgreSQL connection string format itself (which uses backslashes for internal quotes). Furthermore, if the password contains spaces, it must be wrapped in single quotes within the connection string. This logic should be moved to a helper that correctly formats the connection string for PostgreSQL.

gemini-code-assist · 2026-04-18T13:16:17Z

+		_, err := db.Exec(`
+			SELECT pg_terminate_backend(pid)
+			FROM pg_stat_activity
+			WHERE pid <> pg_backend_pid()
+			  AND backend_type NOT IN ('walsender', 'walreceiver', 'autovacuum worker', 'logical replication launcher')
+			  AND backend_type <> 'background worker'
+			  AND datname IS NOT NULL
+		`)


The pg_terminate_backend function requires pg_signal_backend or superuser privileges. The documentation in docs/database-providers.md only mentions granting pg_monitor, which is insufficient for this operation. Additionally, the termination of backends is critical for ensuring a zero-data-loss switchover by closing the write window; therefore, failure to terminate backends should likely be treated as a fatal error rather than just a warning.

gemini-code-assist · 2026-04-18T13:16:17Z

+	}
+
+	// --- Register recovery to prevent concurrent switchover attempts ---
+	topologyRecovery, err = AttemptRecoveryRegistration(&analysisEntry, false, false)


Using false, false for AttemptRecoveryRegistration is risky for a manual graceful switchover. If an automated recovery is already active for the cluster, this call will return the existing recovery object and the switchover logic will proceed to demote the primary while another recovery is in progress. It is safer to use true for failIfAlreadyRegistered to ensure the switchover only proceeds if the cluster is in a stable state.

gemini-code-assist · 2026-04-18T13:16:17Z

+// PostgreSQL primary to a designated standby. It sets the primary read-only,
+// waits for the standby to catch up, promotes the standby, reconfigures
+// remaining standbys, and configures the demoted primary for repositioning.
+func PostgreSQLGracefulPrimarySwitchover(clusterName string, designatedKey *inst.InstanceKey, auto bool) (topologyRecovery *TopologyRecovery, err error) {


The function signature for PostgreSQLGracefulPrimarySwitchover does not match the design specification in docs/superpowers/specs/2026-04-18-postgresql-graceful-switchover-design.md, which states it should return (*TopologyRecovery, *inst.BinlogCoordinates, error). The current implementation only returns (*TopologyRecovery, error), causing the dispatch in go/logic/topology_recovery.go to return nil for the promoted master coordinates.

renecannao added 12 commits April 18, 2026 09:25

docs: add implementation plan for PostgreSQL graceful primary switchover

31ffa1b

9 tasks covering instance operations, orchestration, CLI/API dispatch, functional tests, documentation updates, and GitHub issue creation.

feat(postgresql): add PostgreSQLSetReadOnly for graceful switchover

44c12e9

feat(postgresql): add PostgreSQLGetCurrentWALLSN for graceful switchover

51ce741

Returns the current WAL write position (LSN) of a PostgreSQL primary instance, needed to synchronize standbys before promotion during graceful switchover.

feat(postgresql): add PostgreSQLWaitForStandbyLSN for graceful switch…

d6c4a9f

…over Polls a PostgreSQL standby until its replay LSN reaches or exceeds the target LSN, or the timeout expires. Used during graceful switchover to ensure the standby has caught up before promotion.

feat(postgresql): add PostgreSQLRepositionAsStandby for graceful swit…

1dfea47

…chover Configures a demoted PostgreSQL primary to replicate from a new primary by updating primary_conninfo via ALTER SYSTEM. The operator handles the actual restart and standby.signal creation via hooks.

feat(postgresql): add PostgreSQLGracefulPrimarySwitchover orchestration

19b510b

feat(postgresql): dispatch graceful-master-takeover to PostgreSQL swi…

ae749c7

…tchover When the cluster master is PostgreSQL (ProviderType == "postgresql"), GracefulMasterTakeover now dispatches to PostgreSQLGracefulPrimarySwitchover instead of running the MySQL-specific code path.

test(postgresql): add functional test for graceful primary switchover

2ea800e

Runs graceful-master-takeover-auto via the API before the destructive failover test. Verifies the primary changes after switchover.

docs: update PostgreSQL docs to reflect graceful switchover support

a7c0456

coderabbitai Bot reviewed Apr 18, 2026

View reviewed changes

gemini-code-assist Bot reviewed Apr 18, 2026

View reviewed changes

renecannao merged commit 4457c7a into master Apr 18, 2026
13 of 17 checks passed

renecannao mentioned this pull request Apr 18, 2026

fix(tests): detect PG graceful switchover at PostgreSQL level #100

Open

3 tasks


		---

		### Task 1: `PostgreSQLSetReadOnly` — Unit Test + Implementation

	### Task 1: `PostgreSQLSetReadOnly` — Unit Test + Implementation
	## Task 1: `PostgreSQLSetReadOnly` — Unit Test + Implementation

		escapedConnInfo := strings.ReplaceAll(newConnInfo, "'", "''")
		_, err = db.Exec(fmt.Sprintf("ALTER SYSTEM SET primary_conninfo = '%s'", escapedConnInfo))

Conversation

renecannao commented Apr 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

renecannao commented Apr 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 18, 2026 •

edited

Loading