Skip to content

feat(postgresql): graceful primary switchover#99

Merged
renecannao merged 12 commits intomasterfrom
feat/postgresql-graceful-switchover
Apr 18, 2026
Merged

feat(postgresql): graceful primary switchover#99
renecannao merged 12 commits intomasterfrom
feat/postgresql-graceful-switchover

Conversation

@renecannao
Copy link
Copy Markdown

@renecannao renecannao commented Apr 18, 2026

Summary

Adds planned/graceful primary switchover for PostgreSQL, reusing the existing graceful-master-takeover / graceful-master-takeover-auto CLI commands and API endpoints.

Switchover flow (implemented in go/logic/topology_recovery_postgresql.go):

  1. Validate cluster / designated standby (PostgreSQLGetBestStandbyForPromotion when auto=true)
  2. Run PreGracefulTakeoverProcesses
  3. Set primary read-only (ALTER SYSTEM + terminate non-replication backends)
  4. Capture primary WAL LSN
  5. Wait for the designated standby to catch up to that LSN
  6. Promote the designated standby (pg_promote())
  7. Reconfigure remaining standbys to point at the new primary
  8. Set primary_conninfo on the demoted primary (no restart)
  9. Run PostGracefulTakeoverProcesses

The demoted primary requires an operator-managed restart with standby.signal — automate via PostGracefulTakeoverProcesses hooks.

New inst operations (go/inst/instance_topology_postgresql.go):

  • PostgreSQLSetReadOnly
  • PostgreSQLGetCurrentWALLSN
  • PostgreSQLWaitForStandbyLSN
  • PostgreSQLRepositionAsStandby

Docs: docs/database-providers.md, docs/user-manual.md, and design/plan specs under docs/superpowers/.

Tests: functional test in tests/functional/test-postgresql.sh covers

  • Basic graceful switchover (API returns OK, primary changes)
  • Deeper verification (new primary accepts writes, demoted primary has primary_conninfo)
  • Negative cases (bogus cluster, bogus designated host)
  • Round-trip (switch back) with operator-hook simulation
  • Unchanged: dead-primary failover still works after the switchover sequence

CI: removed continue-on-error from the PostgreSQL functional job — PG regressions now fail the build.

Test plan

  • Functional tests pass on PostgreSQL 15, 16, 17 (GHA matrix)
  • MySQL graceful takeover remains unaffected (regression)
  • Dead-primary failover still triggers after the switchover + round-trip
  • Unit tests pass (go test ./go/...)

Summary by CodeRabbit

  • New Features

    • PostgreSQL graceful primary switchover is now supported via CLI/API (graceful-master-takeover and graceful-master-takeover-auto commands)
  • Documentation

    • Updated PostgreSQL documentation describing the switchover workflow and operational setup requirements
    • Added design specification for planned primary switchover capability
  • Tests

    • Added comprehensive functional tests validating PostgreSQL graceful switchover scenarios
    • PostgreSQL test failures now properly fail the CI job instead of continuing

Covers planned/graceful switchover support for PostgreSQL, including
new instance operations (set read-only, WAL LSN catch-up, reposition
as standby), orchestration flow, and CLI/API dispatch via the existing
graceful-master-takeover commands.
9 tasks covering instance operations, orchestration, CLI/API dispatch,
functional tests, documentation updates, and GitHub issue creation.
Returns the current WAL write position (LSN) of a PostgreSQL primary
instance, needed to synchronize standbys before promotion during
graceful switchover.
…over

Polls a PostgreSQL standby until its replay LSN reaches or exceeds the
target LSN, or the timeout expires. Used during graceful switchover to
ensure the standby has caught up before promotion.
…chover

Configures a demoted PostgreSQL primary to replicate from a new primary
by updating primary_conninfo via ALTER SYSTEM. The operator handles the
actual restart and standby.signal creation via hooks.
…tchover

When the cluster master is PostgreSQL (ProviderType == "postgresql"),
GracefulMasterTakeover now dispatches to PostgreSQLGracefulPrimarySwitchover
instead of running the MySQL-specific code path.
Runs graceful-master-takeover-auto via the API before the destructive
failover test. Verifies the primary changes after switchover.
- Escape single quotes in primary_conninfo to prevent SQL breakage
  when PostgreSQLTopologyPassword contains quotes (both
  PostgreSQLReconfigureStandby and PostgreSQLRepositionAsStandby)
- Move AttemptRecoveryRegistration before mutations to prevent
  concurrent switchover races
- Use 3x ReasonableMaintenanceReplicationLagSeconds for catch-up
  timeout since the primary is already read-only
Adds deeper verification of the graceful switchover result (new primary
is writable, demoted primary has primary_conninfo set), negative-case
tests (bogus cluster, bogus designated host), and a round-trip that
reactivates the demoted primary as a live standby and switches back.

Removes continue-on-error from the PostgreSQL functional job so PG
test regressions now fail CI.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 18, 2026

📝 Walkthrough

Walkthrough

Implements PostgreSQL graceful primary switchover by introducing four instance-level operations (set read-only, capture WAL LSN, wait for standby replication, reposition as standby) and orchestration logic that validates topology, coordinates the switchover process, and routes PostgreSQL clusters through dedicated switchover logic.

Changes

Cohort / File(s) Summary
Documentation & Specs
docs/database-providers.md, docs/user-manual.md, docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md, docs/superpowers/specs/2026-04-18-postgresql-graceful-switchover-design.md
Updated PostgreSQL documentation to declare graceful switchover now supported; added design spec and implementation plan describing the switchover flow (read-only → WAL capture → standby catch-up → promotion → reconfiguration) and PostGracefulTakeoverProcesses hook responsibility for demoted-primary restart.
PostgreSQL Instance Operations
go/inst/instance_topology_postgresql.go, go/inst/instance_topology_postgresql_test.go
Added four exported functions (PostgreSQLSetReadOnly, PostgreSQLGetCurrentWALLSN, PostgreSQLWaitForStandbyLSN, PostgreSQLRepositionAsStandby) and their input-validation tests; fixed PostgreSQLReconfigureStandby to escape single quotes in primary_conninfo SQL generation.
Orchestration & Routing
go/logic/topology_recovery.go, go/logic/topology_recovery_postgresql.go
Modified GracefulMasterTakeover to route PostgreSQL clusters to new PostgreSQLGracefulPrimarySwitchover function; implemented switchover orchestration with topology validation, standby selection (explicit/auto), pre/post hooks, read-only enforcement, LSN synchronization, promotion, replication reconfiguration, and explicit rollback on failure.
CI & Testing
.github/workflows/functional.yml, tests/functional/test-postgresql.sh
Removed continue-on-error: true from PostgreSQL test step; expanded functional tests to validate graceful switchover via API, verify primary identity changes, confirm write capability, check demoted primary reposition, and test round-trip switchback with standby.signal restart sequence.

Sequence Diagram

sequenceDiagram
    participant Orch as Orchestrator
    participant Prime as Current Primary
    participant Stby as Designated Standby
    participant OthStby as Other Standbys

    Orch->>Prime: Set read-only (ALTER SYSTEM)
    Prime->>Prime: Terminate non-replication connections
    Orch->>Prime: Capture current WAL LSN (pg_current_wal_lsn)
    Prime-->>Orch: Return target LSN

    rect rgba(100, 150, 200, 0.5)
        Orch->>Stby: Poll pg_last_wal_replay_lsn() >= target
        Stby-->>Orch: LSN reached / timeout
    end

    Orch->>Stby: Promote standby (pg_promote)
    Orch->>OthStby: Reconfigure primary_conninfo (ALTER SYSTEM)
    Orch->>Prime: Set primary_conninfo to new primary (ALTER SYSTEM)
    
    Orch-->>Orch: Register TopologyRecovery<br/>Run PostGracefulTakeoverProcesses
    Orch->>Prime: (External) Restart with standby.signal
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related issues

Possibly related PRs

Poem

🐰 A switchover smooth, from primary to standby's grace,
Replication LSNs race to keep the pace,
Read-only guards the window tight,
Promotion swift—the torch burns bright!
PostgreSQL planned, no crash in sight! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 64.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(postgresql): graceful primary switchover' directly and accurately describes the main change: implementing graceful primary switchover functionality for PostgreSQL.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/postgresql-graceful-switchover

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/functional/test-postgresql.sh (1)

295-331: ⚠️ Potential issue | 🟠 Major

Don’t hard-code pgprimary unless the round-trip restored it as primary.

Because this script runs without set -e, a failed round-trip still falls through to Line 331 and stops pgprimary, even if the current primary is still pgstandby1. Gate the dead-primary section on FINAL_PRIMARY being pgprimary, or dynamically stop the actual current primary container.

Test-flow guard suggestion
             if [ "$FINAL_PRIMARY" = "172.30.0.20" ] || [ "$FINAL_PRIMARY" = "pgprimary" ]; then
                 pass "Round-trip complete: pgprimary is primary again"
+                DEAD_PRIMARY_CONTAINER="pgprimary"
             else
                 fail "Round-trip incomplete: primary is '$FINAL_PRIMARY' (expected pgprimary)"
+                DEAD_PRIMARY_CONTAINER=""
             fi
@@
 echo "--- Failover test: kill current primary ---"
@@
-echo "Stopping pgprimary container..."
-$COMPOSE stop pgprimary
+if [ -z "${DEAD_PRIMARY_CONTAINER:-}" ]; then
+    skip "Dead-primary test skipped — pgprimary was not restored as primary"
+else
+    echo "Stopping ${DEAD_PRIMARY_CONTAINER} container..."
+    $COMPOSE stop "$DEAD_PRIMARY_CONTAINER"
+fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-postgresql.sh` around lines 295 - 331, The script
currently always proceeds to "Stopping pgprimary container..." even when the
round-trip did not restore pgprimary; update the dead-primary section to guard
on FINAL_PRIMARY or stop the actual current primary: check the FINAL_PRIMARY
variable (and/or its resolved container name) before stopping pgprimary, e.g. if
[ "$FINAL_PRIMARY" = "172.30.0.20" ] || [ "$FINAL_PRIMARY" = "pgprimary" ] then
stop pgprimary else stop the container corresponding to FINAL_PRIMARY (e.g.
derive container name like "pgstandby1" from the IP or use FINAL_PRIMARY
directly) so the stop command targets the actual primary instead of hard-coding
pgprimary; adjust the echo and stop/restart invocation that currently refer to
"pgprimary" to use the chosen container variable.
go/inst/instance_topology_postgresql.go (1)

100-115: ⚠️ Potential issue | 🟠 Major

Escape primary_conninfo values at the libpq layer, not just the SQL layer.

strings.ReplaceAll(..., "'", "''") protects the outer SQL literal, but primary_conninfo itself is a libpq keyword/value string. Passwords, usernames, or other values containing spaces, backslashes, or quotes must be quoted and escaped according to libpq rules (\' for quotes, \\ for backslashes), not just SQL rules. Build conninfo with per-value quoting/escaping, then escape the resulting SQL literal.

This issue occurs in two places: lines 101-115 and lines 313-318. Consider extracting the conninfo construction into a helper function to avoid duplication.

Safer helper-based construction
+func postgreSQLConnInfoValue(value string) string {
+	if value == "" || strings.ContainsAny(value, " \t\r\n'\\") {
+		value = strings.ReplaceAll(value, `\`, `\\`)
+		value = strings.ReplaceAll(value, `'`, `\'`)
+		return "'" + value + "'"
+	}
+	return value
+}
+
+func postgreSQLPrimaryConnInfo(newPrimaryKey *InstanceKey) string {
+	return fmt.Sprintf(
+		"host=%s port=%d user=%s password=%s sslmode=%s application_name=%s",
+		postgreSQLConnInfoValue(newPrimaryKey.Hostname),
+		newPrimaryKey.Port,
+		postgreSQLConnInfoValue(config.Config.PostgreSQLTopologyUser),
+		postgreSQLConnInfoValue(config.Config.PostgreSQLTopologyPassword),
+		postgreSQLConnInfoValue(config.Config.PostgreSQLSSLMode),
+		postgreSQLConnInfoValue("orchestrator"),
+	)
+}
+
+func postgreSQLStringLiteral(value string) string {
+	return strings.ReplaceAll(value, "'", "''")
+}
+
 	// Build new primary_conninfo
-	newConnInfo := fmt.Sprintf(
-		"host=%s port=%d user=%s password=%s sslmode=%s application_name=orchestrator",
-		newPrimaryKey.Hostname,
-		newPrimaryKey.Port,
-		config.Config.PostgreSQLTopologyUser,
-		config.Config.PostgreSQLTopologyPassword,
-		config.Config.PostgreSQLSSLMode,
-	)
+	newConnInfo := postgreSQLPrimaryConnInfo(newPrimaryKey)
@@
-	escapedConnInfo := strings.ReplaceAll(newConnInfo, "'", "''")
+	escapedConnInfo := postgreSQLStringLiteral(newConnInfo)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@go/inst/instance_topology_postgresql.go` around lines 100 - 115, The conninfo
string (newConnInfo) is being SQL-escaped but not libpq-escaped; change
construction in PostgreSQLReconfigureStandby to build each libpq key=value
element with proper libpq quoting/escaping (escape single quote as \' and
backslash as \\ and wrap values containing spaces or special chars in single
quotes), extract that logic into a helper (e.g., buildLibpqConnInfo or
quoteLibpqValue) and use it to produce the conninfo, then still SQL-escape the
final conninfo for the ALTER SYSTEM statement (escapedConnInfo =
strings.ReplaceAll(conninfo, "'", "''")); apply the same helper at the second
occurrence mentioned to remove duplication.
🧹 Nitpick comments (1)
docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md (1)

86-90: Add defensive validation for readOnlyValue.

While readOnlyValue is currently controlled (only "on" or "off"), adding an explicit validation check would make the code more defensive against future modifications and clearer about the expected values.

🛡️ Proposed defensive check
 	readOnlyValue := "off"
 	if readOnly {
 		readOnlyValue = "on"
 	}
+	if readOnlyValue != "on" && readOnlyValue != "off" {
+		return nil, fmt.Errorf("PostgreSQLSetReadOnly: invalid readOnlyValue: %s", readOnlyValue)
+	}
 
 	log.Infof("PostgreSQLSetReadOnly: setting default_transaction_read_only = %s on %+v", readOnlyValue, *instanceKey)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md` around
lines 86 - 90, The code in PostgreSQLSetReadOnly uses readOnlyValue directly in
the ALTER SYSTEM SQL; add a defensive validation before db.Exec to ensure
readOnlyValue is only "on" or "off" (or whatever canonical set you expect) and
return a descriptive error via log.Errore if it is invalid; locate the log.Infof
call and the subsequent db.Exec(fmt.Sprintf("ALTER SYSTEM SET
default_transaction_read_only = %s", readOnlyValue)) and insert the validation
check there to short-circuit on bad values and avoid executing malformed SQL.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/database-providers.md`:
- Around line 183-190: Update the PostgreSQL graceful takeover docs to state
that in addition to the pg_monitor role, the orchestrator user must have
superuser privileges or equivalent role-based permissions to run ALTER SYSTEM,
call pg_reload_conf(), and execute pg_terminate_backend(); mention these
specific operations and recommend granting either superuser or targeted
role-based privileges (or extension of pg_monitor) before using the CLI/API
endpoints (`graceful-master-takeover`, `graceful-master-takeover-auto`), and add
a note explaining that an operator-managed restart (triggered via
`PostGracefulTakeoverProcesses`) is still required for the demoted primary.

In `@docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md`:
- Line 15: Change the task headings from level 3 to level 2 so heading levels
increment correctly: replace each "### Task N: `..." with "## Task N: `..." for
all task headings (e.g., "Task 1: `PostgreSQLSetReadOnly` — Unit Test +
Implementation" through Tasks 1-9) so the document goes H1 → H2 → H3 correctly.
- Around line 507-525: Move the AttemptRecoveryRegistration call so it executes
before mutating the topology (before calling inst.PostgreSQLPromoteStandby);
i.e., call AttemptRecoveryRegistration(&analysisEntry, false, false) and
set/initialize topologyRecovery (with the same error-handling fallback) prior to
invoking PostgreSQLPromoteStandby, then after promotion assign
topologyRecovery.SuccessorKey = &promotedInstance.Key; keep the
PostgreSQLSetReadOnly undo logic and all existing error branches but reorder
them so registration happens before the promotion to avoid races.
- Around line 700-736: Add explicit language specifiers to the fenced code
blocks that contain the PostgreSQL switchover text so markdown renderers apply
proper highlighting: replace the three-backtick fences before the block
containing "Graceful master takeover (planned switchover)..." and before "###
Graceful Primary Switchover", and the two occurrences around "Graceful master
takeover is not yet supported for PostgreSQL." / "Graceful master takeover is
supported for PostgreSQL..." with language-tagged fences (e.g., ```markdown).
Locate these blocks by searching for the exact phrases "Graceful master takeover
(planned switchover) is not yet implemented for PostgreSQL. Only unplanned
failover (dead primary) is supported.", "### Graceful Primary Switchover", and
"Graceful master takeover is not yet supported for PostgreSQL." and update their
opening ``` to ```markdown.
- Around line 338-351: The ALTER SYSTEM command interpolates newConnInfo
directly and can break or be exploited if fields contain single quotes; before
calling db.Exec in PostgreSQLRepositionAsStandby, escape single quotes in
newConnInfo (e.g., replace every "'" with "''") and use the escaped value when
building the SQL string for primary_conninfo; ensure the strings package is
imported and update the db.Exec call that sets primary_conninfo to use the
escapedConnInfo variable.

In `@docs/superpowers/specs/2026-04-18-postgresql-graceful-switchover-design.md`:
- Around line 18-85: Update the design doc to match the implementation: change
the "three new functions" header to "four new functions" and ensure all four
(PostgreSQLSetReadOnly, PostgreSQLGetCurrentWALLSN, PostgreSQLWaitForStandbyLSN,
PostgreSQLRepositionAsStandby) are listed; change the
PostgreSQLGracefulPrimarySwitchover signature description to match the actual
return values implemented (adjust the noted return tuple to the real types/count
used in PostgreSQLGracefulPrimarySwitchover); and update the dispatch snippet to
match the code's provider check (use ProviderType == "postgresql" early return
logic as implemented rather than the inst.PostgreSQL variant) so the spec aligns
exactly with the code paths and symbols named in the repo.

In `@go/logic/topology_recovery_postgresql.go`:
- Around line 239-243: The code currently logs failures from
inst.PostgreSQLRepositionAsStandby(...) but switchover still returns success;
change this to surface the error: when
PostgreSQLRepositionAsStandby(&clusterPrimary.Key, &promotedInstance.Key)
returns a non-nil err, call AuditTopologyRecovery with the error and then
propagate the error out of the surrounding function (or mark clusterPrimary as
lost in topologyRecovery and return a wrapped error) instead of only logging;
update the surrounding function signature/return path to return that error so
callers see the failure (refer to PostgreSQLRepositionAsStandby,
AuditTopologyRecovery, topologyRecovery, clusterPrimary, promotedInstance).
- Around line 180-223: After AttemptRecoveryRegistration succeeds you must
register a deferred failure resolver to always call resolveRecovery(...) on
error paths and avoid leaving topologyRecovery unresolved; add a defer that
checks a local resolved bool and if false calls
resolveRecovery(topologyRecovery) (and AuditTopologyRecovery as appropriate),
ensure you only set topologyRecovery.SuccessorKey = &designatedStandby.Key after
successful pre-promotion steps or clear it before any early return, and mark
resolved = true immediately after a successful resolveRecovery(...) so the defer
skips double-resolution; key symbols: AttemptRecoveryRegistration,
topologyRecovery, SuccessorKey, resolveRecovery, AuditTopologyRecovery.

In `@go/logic/topology_recovery.go`:
- Around line 2124-2128: The CLI panics when PostgreSQLGracefulPrimarySwitchover
returns nil promotedMasterCoordinates because handlers in go/app/cli.go
unconditionally dereference *promotedMasterCoordinates; update the CLI handlers
that call GracefulMasterTakeover (the blocks around the current dereferences of
promotedMasterCoordinates) to check for nil before dereferencing and handle the
nil case (e.g., print a message or omit coordinates) or alternatively ensure
PostgreSQLGracefulPrimarySwitchover always returns a non-nil coordinates struct;
reference PostgreSQLGracefulPrimarySwitchover, GracefulMasterTakeover, and the
promotedMasterCoordinates variable to locate the code to change.

---

Outside diff comments:
In `@go/inst/instance_topology_postgresql.go`:
- Around line 100-115: The conninfo string (newConnInfo) is being SQL-escaped
but not libpq-escaped; change construction in PostgreSQLReconfigureStandby to
build each libpq key=value element with proper libpq quoting/escaping (escape
single quote as \' and backslash as \\ and wrap values containing spaces or
special chars in single quotes), extract that logic into a helper (e.g.,
buildLibpqConnInfo or quoteLibpqValue) and use it to produce the conninfo, then
still SQL-escape the final conninfo for the ALTER SYSTEM statement
(escapedConnInfo = strings.ReplaceAll(conninfo, "'", "''")); apply the same
helper at the second occurrence mentioned to remove duplication.

In `@tests/functional/test-postgresql.sh`:
- Around line 295-331: The script currently always proceeds to "Stopping
pgprimary container..." even when the round-trip did not restore pgprimary;
update the dead-primary section to guard on FINAL_PRIMARY or stop the actual
current primary: check the FINAL_PRIMARY variable (and/or its resolved container
name) before stopping pgprimary, e.g. if [ "$FINAL_PRIMARY" = "172.30.0.20" ] ||
[ "$FINAL_PRIMARY" = "pgprimary" ] then stop pgprimary else stop the container
corresponding to FINAL_PRIMARY (e.g. derive container name like "pgstandby1"
from the IP or use FINAL_PRIMARY directly) so the stop command targets the
actual primary instead of hard-coding pgprimary; adjust the echo and
stop/restart invocation that currently refer to "pgprimary" to use the chosen
container variable.

---

Nitpick comments:
In `@docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md`:
- Around line 86-90: The code in PostgreSQLSetReadOnly uses readOnlyValue
directly in the ALTER SYSTEM SQL; add a defensive validation before db.Exec to
ensure readOnlyValue is only "on" or "off" (or whatever canonical set you
expect) and return a descriptive error via log.Errore if it is invalid; locate
the log.Infof call and the subsequent db.Exec(fmt.Sprintf("ALTER SYSTEM SET
default_transaction_read_only = %s", readOnlyValue)) and insert the validation
check there to short-circuit on bad values and avoid executing malformed SQL.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9e088dff-bc56-4ee5-8b1f-aa81497ff5c9

📥 Commits

Reviewing files that changed from the base of the PR and between 4c92815 and a594961.

📒 Files selected for processing (10)
  • .github/workflows/functional.yml
  • docs/database-providers.md
  • docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md
  • docs/superpowers/specs/2026-04-18-postgresql-graceful-switchover-design.md
  • docs/user-manual.md
  • go/inst/instance_topology_postgresql.go
  • go/inst/instance_topology_postgresql_test.go
  • go/logic/topology_recovery.go
  • go/logic/topology_recovery_postgresql.go
  • tests/functional/test-postgresql.sh
💤 Files with no reviewable changes (1)
  • .github/workflows/functional.yml

Comment on lines +183 to +190
- **Graceful master takeover** (planned switchover) is supported for PostgreSQL.
The same CLI commands and API endpoints (`graceful-master-takeover`,
`graceful-master-takeover-auto`) work for PostgreSQL clusters. The switchover
sets the primary read-only, waits for the designated standby to catch up via
WAL LSN, promotes the standby, reconfigures remaining standbys, and configures
the demoted primary's `primary_conninfo`. The demoted primary requires an
operator-managed restart with `standby.signal` — use
`PostGracefulTakeoverProcesses` hooks to automate this step.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

PostgreSQL privileges required for ALTER SYSTEM pg_reload_conf pg_terminate_backend pg_signal_backend

💡 Result:

PostgreSQL privileges required: - ALTER SYSTEM: Superuser or users granted the ALTER SYSTEM privilege on the specific configuration parameter (via GRANT ALTER SYSTEM ON PARAMETER ...). {value: Superuser or ALTER SYSTEM privilege on parameter, citations: [1,2,4], confidence: high} - pg_reload_conf: Superuser or EXECUTE privilege granted on the pg_catalog.pg_reload_conf function (via GRANT EXECUTE ON FUNCTION pg_catalog.pg_reload_conf TO role;). These signaling functions are restricted to superusers by default but access may be granted using GRANT. {value: Superuser or EXECUTE on pg_reload_conf, citations: [6,8,11], confidence: high} - pg_terminate_backend(pid): Superuser, or the calling role is a member of the target backend's role, or has privileges of the pg_signal_backend role. Only superusers can terminate superuser backends. {value: Superuser, member of target role, or pg_signal_backend role, citations: [6,7,11], confidence: high} - pg_signal_backend(pid): This appears to refer to the pg_signal_backend predefined role (not a function). Membership in pg_signal_backend role allows signaling (cancel/terminate) other backends (via pg_cancel_backend/pg_terminate_backend). Superusers bypass this. {value: pg_signal_backend role membership, citations: [6,7,13], confidence: high}

Citations:


🏁 Script executed:

# First, let's read the specific documentation section mentioned
head -200 docs/database-providers.md | tail -50

Repository: ProxySQL/orchestrator

Length of output: 2911


🏁 Script executed:

# Search for pg_monitor mentions in the documentation
rg "pg_monitor" docs/database-providers.md -B 5 -A 5

Repository: ProxySQL/orchestrator

Length of output: 688


🏁 Script executed:

# Find the PostgreSQL instance topology implementation
fd "instance_topology_postgresql" --type f

Repository: ProxySQL/orchestrator

Length of output: 149


🏁 Script executed:

# Search for ALTER SYSTEM, pg_reload_conf, pg_terminate_backend in codebase
rg "ALTER SYSTEM|pg_reload_conf|pg_terminate_backend" --type go -B 2 -A 2

Repository: ProxySQL/orchestrator

Length of output: 5706


🏁 Script executed:

# Check the exact context around the pg_monitor requirement to see where prerequisites are documented
rg -B 10 "pg_monitor role:" docs/database-providers.md

Repository: ProxySQL/orchestrator

Length of output: 47


🏁 Script executed:

# Verify if there are any existing privilege requirements documented for graceful takeover nearby
rg -A 20 "Graceful master takeover" docs/database-providers.md | head -40

Repository: ProxySQL/orchestrator

Length of output: 1039


Add privilege requirements to PostgreSQL graceful takeover documentation.

The graceful takeover implementation requires additional PostgreSQL privileges beyond pg_monitor. The operations (ALTER SYSTEM, pg_reload_conf(), pg_terminate_backend()) require superuser access or equivalent role-based privileges. Update the documentation to specify these requirements so operators can configure the orchestrator user with the necessary permissions before attempting graceful takeover.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/database-providers.md` around lines 183 - 190, Update the PostgreSQL
graceful takeover docs to state that in addition to the pg_monitor role, the
orchestrator user must have superuser privileges or equivalent role-based
permissions to run ALTER SYSTEM, call pg_reload_conf(), and execute
pg_terminate_backend(); mention these specific operations and recommend granting
either superuser or targeted role-based privileges (or extension of pg_monitor)
before using the CLI/API endpoints (`graceful-master-takeover`,
`graceful-master-takeover-auto`), and add a note explaining that an
operator-managed restart (triggered via `PostGracefulTakeoverProcesses`) is
still required for the demoted primary.


---

### Task 1: `PostgreSQLSetReadOnly` — Unit Test + Implementation
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use h2 (##) for task headings instead of h3 (###).

The task headings jump from h1 (line 1) to h3 (line 15), skipping h2. Markdown best practices require heading levels to increment by one level at a time.

📝 Proposed fix

Change all task headings from h3 to h2:

-### Task 1: `PostgreSQLSetReadOnly` — Unit Test + Implementation
+## Task 1: `PostgreSQLSetReadOnly` — Unit Test + Implementation

Apply this change to all task headings (Tasks 1-9).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### Task 1: `PostgreSQLSetReadOnly` — Unit Test + Implementation
## Task 1: `PostgreSQLSetReadOnly` — Unit Test + Implementation
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 15-15: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md` at line
15, Change the task headings from level 3 to level 2 so heading levels increment
correctly: replace each "### Task N: `..." with "## Task N: `..." for all task
headings (e.g., "Task 1: `PostgreSQLSetReadOnly` — Unit Test + Implementation"
through Tasks 1-9) so the document goes H1 → H2 → H3 correctly.

Comment on lines +338 to +351
newConnInfo := fmt.Sprintf(
"host=%s port=%d user=%s password=%s sslmode=%s application_name=orchestrator",
newPrimaryKey.Hostname,
newPrimaryKey.Port,
config.Config.PostgreSQLTopologyUser,
config.Config.PostgreSQLTopologyPassword,
config.Config.PostgreSQLSSLMode,
)

log.Infof("PostgreSQLRepositionAsStandby: configuring %+v to replicate from %+v", *instanceKey, *newPrimaryKey)

if _, err := db.Exec(fmt.Sprintf("ALTER SYSTEM SET primary_conninfo = '%s'", newConnInfo)); err != nil {
return log.Errore(fmt.Errorf("PostgreSQLRepositionAsStandby: ALTER SYSTEM failed on %+v: %v", *instanceKey, err))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Escape single quotes in primary_conninfo to prevent SQL errors and injection.

The newConnInfo string is interpolated directly into the SQL command without escaping. If newPrimaryKey.Hostname or config.Config.PostgreSQLTopologyPassword contain single quotes, the ALTER SYSTEM command will fail or could be vulnerable to SQL injection.

The PR objectives note this was fixed post-review: "escape single quotes in primary_conninfo". The plan should be updated to show the correct implementation.

🔒 Proposed fix with quote escaping
 	newConnInfo := fmt.Sprintf(
 		"host=%s port=%d user=%s password=%s sslmode=%s application_name=orchestrator",
 		newPrimaryKey.Hostname,
 		newPrimaryKey.Port,
 		config.Config.PostgreSQLTopologyUser,
 		config.Config.PostgreSQLTopologyPassword,
 		config.Config.PostgreSQLSSLMode,
 	)
+	// Escape single quotes for SQL string literal
+	newConnInfo = strings.ReplaceAll(newConnInfo, "'", "''")
 
 	log.Infof("PostgreSQLRepositionAsStandby: configuring %+v to replicate from %+v", *instanceKey, *newPrimaryKey)
 
 	if _, err := db.Exec(fmt.Sprintf("ALTER SYSTEM SET primary_conninfo = '%s'", newConnInfo)); err != nil {

Note: Add "strings" to imports if not already present.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md` around
lines 338 - 351, The ALTER SYSTEM command interpolates newConnInfo directly and
can break or be exploited if fields contain single quotes; before calling
db.Exec in PostgreSQLRepositionAsStandby, escape single quotes in newConnInfo
(e.g., replace every "'" with "''") and use the escaped value when building the
SQL string for primary_conninfo; ensure the strings package is imported and
update the db.Exec call that sets primary_conninfo to use the escapedConnInfo
variable.

Comment on lines +507 to +525
// --- Promote designated standby ---
promotedInstance, err := inst.PostgreSQLPromoteStandby(&designatedStandby.Key)
if err != nil {
// Undo read-only before aborting
_, _ = inst.PostgreSQLSetReadOnly(&clusterPrimary.Key, false)
return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: promotion of %+v failed: %v", designatedStandby.Key, err)
}
log.Infof("PostgreSQLGracefulPrimarySwitchover: promoted %+v to primary", promotedInstance.Key)

// --- Register recovery ---
topologyRecovery, err = AttemptRecoveryRegistration(&analysisEntry, false, false)
if err != nil || topologyRecovery == nil {
_ = log.Warningf("PostgreSQLGracefulPrimarySwitchover: error registering recovery: %v", err)
// Continue — promotion already happened
topologyRecovery = &TopologyRecovery{
AnalysisEntry: analysisEntry,
}
}
topologyRecovery.SuccessorKey = &promotedInstance.Key
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Race condition: Move AttemptRecoveryRegistration before promotion to avoid registration races.

AttemptRecoveryRegistration (line 517) occurs after the standby is promoted (line 508). This creates a window where the topology has mutated but the recovery isn't registered, potentially leading to inconsistent state or duplicate recovery attempts.

The PR objectives confirm this issue was fixed post-review: "move AttemptRecoveryRegistration before mutations to avoid races". The plan should reflect the corrected order.

📋 Proposed fix to move registration before mutations

Move the registration block before the promotion step:

 	}
 
-	// --- Promote designated standby ---
+	// --- Register recovery ---
+	topologyRecovery, err = AttemptRecoveryRegistration(&analysisEntry, false, false)
+	if err != nil || topologyRecovery == nil {
+		// Undo read-only before aborting
+		_, _ = inst.PostgreSQLSetReadOnly(&clusterPrimary.Key, false)
+		return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: recovery registration failed: %v", err)
+	}
+
+	// --- Promote designated standby ---
 	promotedInstance, err := inst.PostgreSQLPromoteStandby(&designatedStandby.Key)
 	if err != nil {
 		// Undo read-only before aborting
 		_, _ = inst.PostgreSQLSetReadOnly(&clusterPrimary.Key, false)
 		return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: promotion of %+v failed: %v", designatedStandby.Key, err)
 	}
 	log.Infof("PostgreSQLGracefulPrimarySwitchover: promoted %+v to primary", promotedInstance.Key)
 
-	// --- Register recovery ---
-	topologyRecovery, err = AttemptRecoveryRegistration(&analysisEntry, false, false)
-	if err != nil || topologyRecovery == nil {
-		_ = log.Warningf("PostgreSQLGracefulPrimarySwitchover: error registering recovery: %v", err)
-		// Continue — promotion already happened
-		topologyRecovery = &TopologyRecovery{
-			AnalysisEntry: analysisEntry,
-		}
-	}
 	topologyRecovery.SuccessorKey = &promotedInstance.Key
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md` around
lines 507 - 525, Move the AttemptRecoveryRegistration call so it executes before
mutating the topology (before calling inst.PostgreSQLPromoteStandby); i.e., call
AttemptRecoveryRegistration(&analysisEntry, false, false) and set/initialize
topologyRecovery (with the same error-handling fallback) prior to invoking
PostgreSQLPromoteStandby, then after promotion assign
topologyRecovery.SuccessorKey = &promotedInstance.Key; keep the
PostgreSQLSetReadOnly undo logic and all existing error branches but reorder
them so registration happens before the promotion to avoid races.

Comment on lines +700 to +736
```
Graceful master takeover (planned switchover) is not yet implemented for PostgreSQL. Only unplanned failover (dead primary) is supported.
```

Replace with:
```
### Graceful Primary Switchover

Graceful master takeover (planned switchover) is supported for PostgreSQL. The same CLI commands
and API endpoints used for MySQL (`graceful-master-takeover`, `graceful-master-takeover-auto`) work
for PostgreSQL clusters. The switchover flow:

1. Sets the primary read-only via `ALTER SYSTEM SET default_transaction_read_only = on` and terminates
existing non-replication connections.
2. Captures the primary's current WAL LSN and waits for the designated standby to catch up.
3. Promotes the designated standby via `pg_promote()`.
4. Reconfigures remaining standbys to replicate from the new primary.
5. Configures the demoted primary's `primary_conninfo` to point at the new primary.

**Important:** The demoted primary requires a restart with `standby.signal` to complete its conversion
to a standby. Use `PostGracefulTakeoverProcesses` hooks to automate this step.
```

- [ ] **Step 2: Update user-manual.md**

Find the line (around line 699) that says:
```
Graceful master takeover is not yet supported for PostgreSQL.
```

Replace with:
```
Graceful master takeover is supported for PostgreSQL. Use `graceful-master-takeover` or
`graceful-master-takeover-auto` CLI commands, or the equivalent API endpoints. The demoted
primary requires an operator-managed restart with `standby.signal` — configure this via
`PostGracefulTakeoverProcesses` hooks.
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specifiers to fenced code blocks for better rendering.

The fenced code blocks at lines 700, 705, 726, and 731 lack language specifiers, which affects syntax highlighting and rendering in documentation viewers.

📝 Proposed fix to add language specifiers

Line 700:

 Find the line (around line 183-184) that says:
-```
+```markdown
 Graceful master takeover (planned switchover) is not yet implemented for PostgreSQL. Only unplanned failover (dead primary) is supported.

Line 705:
```diff
 Replace with:
-```
+```markdown
 ### Graceful Primary Switchover

Line 726:

 Find the line (around line 699) that says:
-```
+```markdown
 Graceful master takeover is not yet supported for PostgreSQL.

Line 731:
```diff
 Replace with:
-```
+```markdown
 Graceful master takeover is supported for PostgreSQL. Use `graceful-master-takeover` or
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 700-700: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 705-705: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 726-726: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 731-731: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md` around
lines 700 - 736, Add explicit language specifiers to the fenced code blocks that
contain the PostgreSQL switchover text so markdown renderers apply proper
highlighting: replace the three-backtick fences before the block containing
"Graceful master takeover (planned switchover)..." and before "### Graceful
Primary Switchover", and the two occurrences around "Graceful master takeover is
not yet supported for PostgreSQL." / "Graceful master takeover is supported for
PostgreSQL..." with language-tagged fences (e.g., ```markdown). Locate these
blocks by searching for the exact phrases "Graceful master takeover (planned
switchover) is not yet implemented for PostgreSQL. Only unplanned failover (dead
primary) is supported.", "### Graceful Primary Switchover", and "Graceful master
takeover is not yet supported for PostgreSQL." and update their opening ``` to
```markdown.

Comment on lines +18 to +85
Three new functions in `go/inst/instance_topology_postgresql.go`:

#### `PostgreSQLSetReadOnly(instanceKey *InstanceKey, readOnly bool) (*Instance, error)`

- Connects to the instance via SQL
- Executes `ALTER SYSTEM SET default_transaction_read_only = on|off`
- Calls `SELECT pg_reload_conf()`
- When setting read-only: queries `pg_stat_activity` for non-replication, non-orchestrator backends and calls `pg_terminate_backend()` on each to close the write window
- Re-reads and returns the instance

#### `PostgreSQLGetCurrentWALLSN(instanceKey *InstanceKey) (string, error)`

- Connects to the primary
- Returns result of `SELECT pg_current_wal_lsn()::text`

#### `PostgreSQLWaitForStandbyLSN(standbyKey *InstanceKey, targetLSN string, timeout time.Duration) error`

- Polls `SELECT pg_last_wal_replay_lsn()` on the standby
- Compares using `SELECT '%s'::pg_lsn >= '%s'::pg_lsn` (PostgreSQL-native LSN comparison)
- Polls every 500ms up to the configured timeout
- Returns error on timeout

#### `PostgreSQLRepositionAsStandby(instanceKey *InstanceKey, newPrimaryKey *InstanceKey) error`

- Connects to the old primary
- Sets `primary_conninfo` via `ALTER SYSTEM` pointing to the new primary (same logic as existing `PostgreSQLReconfigureStandby`)
- Calls `SELECT pg_reload_conf()`
- Does NOT restart the instance — the actual restart and `standby.signal` creation is the operator's responsibility via `PostGracefulTakeoverProcesses` hook

### Switchover Orchestration

New function `PostgreSQLGracefulPrimarySwitchover(clusterName string, designatedKey *InstanceKey, auto bool) (*TopologyRecovery, *inst.BinlogCoordinates, error)` in `go/logic/topology_recovery_postgresql.go`:

**Flow:**

1. **Validate** — Read cluster primary, read standbys, determine designated standby. When `auto=true`, use `PostgreSQLGetBestStandbyForPromotion()`. When manual, require the user to specify or have exactly one standby.
2. **Execute `PreGracefulTakeoverProcesses`** — Abort on failure.
3. **Set primary read-only** — `PostgreSQLSetReadOnly(primary, true)`. Terminates existing non-replication connections.
4. **Capture target LSN** — `PostgreSQLGetCurrentWALLSN(primary)`.
5. **Wait for standby catch-up** — `PostgreSQLWaitForStandbyLSN(designated, targetLSN, timeout)`.
6. **Promote designated standby** — `PostgreSQLPromoteStandby(designated)` (existing function).
7. **Reconfigure remaining standbys** — `PostgreSQLReconfigureStandby(standby, newPrimary)` for each non-designated standby (existing function).
8. **Configure demoted primary** — `PostgreSQLRepositionAsStandby(oldPrimary, newPrimary)`. Sets `primary_conninfo` only; operator handles restart via hook.
9. **Execute `PostGracefulTakeoverProcesses`**.
10. **Return** TopologyRecovery with successor key.

**Error handling:**

- If promotion fails after setting read-only: undo via `PostgreSQLSetReadOnly(primary, false)`, abort.
- If standby catch-up times out: undo read-only, abort with error.

**Note on flat topology:** PostgreSQL topologies are flat (no cascading replication detected), so there is no sibling relocation step (unlike MySQL which may need to reparent replicas under the designated replica).

### CLI/API Dispatch

No changes to `cli.go` or `api.go`. The existing commands and endpoints are reused.

**In `go/logic/topology_recovery.go`:**

Add a provider check in `GracefulMasterTakeover()`:

```go
if clusterMaster.DatabaseProvider == inst.PostgreSQL {
return PostgreSQLGracefulPrimarySwitchover(clusterName, designatedKey, auto)
}
```

Add a PostgreSQL branch in `getGracefulMasterTakeoverDesignatedInstance()` to use `PostgreSQLGetBestStandbyForPromotion()` when `auto=true`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Bring the design spec back in sync with the implementation.

This spec says “three” functions while listing four, documents a three-return-value PostgreSQLGracefulPrimarySwitchover(...), and shows a dispatch branch that differs from the implemented ProviderType == "postgresql" early return. Please update the spec so it remains useful as implementation documentation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/superpowers/specs/2026-04-18-postgresql-graceful-switchover-design.md`
around lines 18 - 85, Update the design doc to match the implementation: change
the "three new functions" header to "four new functions" and ensure all four
(PostgreSQLSetReadOnly, PostgreSQLGetCurrentWALLSN, PostgreSQLWaitForStandbyLSN,
PostgreSQLRepositionAsStandby) are listed; change the
PostgreSQLGracefulPrimarySwitchover signature description to match the actual
return values implemented (adjust the noted return tuple to the real types/count
used in PostgreSQLGracefulPrimarySwitchover); and update the dispatch snippet to
match the code's provider check (use ProviderType == "postgresql" early return
logic as implemented rather than the inst.PostgreSQL variant) so the spec aligns
exactly with the code paths and symbols named in the repo.

Comment on lines +180 to +223
topologyRecovery, err = AttemptRecoveryRegistration(&analysisEntry, false, false)
if topologyRecovery == nil {
AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("found an active or recent recovery on %+v. Will not issue another graceful switchover.", clusterName))
return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: active or recent recovery exists for %+v, aborting", clusterName)
}
topologyRecovery.SuccessorKey = &designatedStandby.Key

// --- Execute PreGracefulTakeoverProcesses ---
if err := executeProcesses(config.Config.PreGracefulTakeoverProcesses, "PreGracefulTakeoverProcesses", topologyRecovery, true); err != nil {
return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: PreGracefulTakeoverProcesses failed: %v", err)
}

// --- Set primary read-only ---
log.Infof("PostgreSQLGracefulPrimarySwitchover: setting %+v read-only", clusterPrimary.Key)
if _, err := inst.PostgreSQLSetReadOnly(&clusterPrimary.Key, true); err != nil {
return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: failed to set read-only on %+v: %v", clusterPrimary.Key, err)
}

// --- Capture target LSN ---
targetLSN, err := inst.PostgreSQLGetCurrentWALLSN(&clusterPrimary.Key)
if err != nil {
// Undo read-only before aborting
_, _ = inst.PostgreSQLSetReadOnly(&clusterPrimary.Key, false)
return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: failed to get WAL LSN from %+v: %v", clusterPrimary.Key, err)
}
log.Infof("PostgreSQLGracefulPrimarySwitchover: target LSN is %s", targetLSN)

// --- Wait for standby to catch up ---
// Use 3x the maintenance lag threshold since the primary is already read-only
// and waiting longer only extends the maintenance window, not data risk.
catchUpTimeout := 3 * time.Duration(config.Config.ReasonableMaintenanceReplicationLagSeconds) * time.Second
if err := inst.PostgreSQLWaitForStandbyLSN(&designatedStandby.Key, targetLSN, catchUpTimeout); err != nil {
// Undo read-only before aborting
_, _ = inst.PostgreSQLSetReadOnly(&clusterPrimary.Key, false)
return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: standby catch-up failed: %v", err)
}

// --- Promote designated standby ---
promotedInstance, err := inst.PostgreSQLPromoteStandby(&designatedStandby.Key)
if err != nil {
// Undo read-only before aborting
_, _ = inst.PostgreSQLSetReadOnly(&clusterPrimary.Key, false)
return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: promotion of %+v failed: %v", designatedStandby.Key, err)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Resolve the registered recovery on all abort paths.

After AttemptRecoveryRegistration, these early returns can leave an active/recent recovery record unresolved. Add a deferred failure resolver after registration, and avoid leaving SuccessorKey populated for failed pre-promotion attempts.

Failure cleanup pattern
 	topologyRecovery, err = AttemptRecoveryRegistration(&analysisEntry, false, false)
 	if topologyRecovery == nil {
 		AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("found an active or recent recovery on %+v. Will not issue another graceful switchover.", clusterName))
 		return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: active or recent recovery exists for %+v, aborting", clusterName)
 	}
-	topologyRecovery.SuccessorKey = &designatedStandby.Key
+	resolved := false
+	defer func() {
+		if topologyRecovery != nil && !resolved && err != nil {
+			topologyRecovery.SuccessorKey = nil
+			_ = topologyRecovery.AddError(err)
+			_ = resolveRecovery(topologyRecovery, nil)
+		}
+	}()
+	topologyRecovery.SuccessorKey = &designatedStandby.Key
@@
-	if err := executeProcesses(config.Config.PreGracefulTakeoverProcesses, "PreGracefulTakeoverProcesses", topologyRecovery, true); err != nil {
-		return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: PreGracefulTakeoverProcesses failed: %v", err)
+	if err = executeProcesses(config.Config.PreGracefulTakeoverProcesses, "PreGracefulTakeoverProcesses", topologyRecovery, true); err != nil {
+		return topologyRecovery, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: PreGracefulTakeoverProcesses failed: %v", err)
 	}

Set resolved = true immediately after the successful resolveRecovery(...) call.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@go/logic/topology_recovery_postgresql.go` around lines 180 - 223, After
AttemptRecoveryRegistration succeeds you must register a deferred failure
resolver to always call resolveRecovery(...) on error paths and avoid leaving
topologyRecovery unresolved; add a defer that checks a local resolved bool and
if false calls resolveRecovery(topologyRecovery) (and AuditTopologyRecovery as
appropriate), ensure you only set topologyRecovery.SuccessorKey =
&designatedStandby.Key after successful pre-promotion steps or clear it before
any early return, and mark resolved = true immediately after a successful
resolveRecovery(...) so the defer skips double-resolution; key symbols:
AttemptRecoveryRegistration, topologyRecovery, SuccessorKey, resolveRecovery,
AuditTopologyRecovery.

Comment on lines +239 to +243
// --- Configure demoted primary for repositioning ---
AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("configuring demoted primary %+v to replicate from %+v", clusterPrimary.Key, promotedInstance.Key))
if err := inst.PostgreSQLRepositionAsStandby(&clusterPrimary.Key, &promotedInstance.Key); err != nil {
AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("error configuring demoted primary %+v: %v", clusterPrimary.Key, err))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Surface demoted-primary reconfiguration failure.

If PostgreSQLRepositionAsStandby fails, the old primary lacks the primary_conninfo that the documented restart hook depends on, but the switchover still returns success. Record this as an error and return it, or at least mark the demoted primary as lost so operators do not get a false OK.

Propagate the post-promotion configuration error
 	AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("configuring demoted primary %+v to replicate from %+v", clusterPrimary.Key, promotedInstance.Key))
 	if err := inst.PostgreSQLRepositionAsStandby(&clusterPrimary.Key, &promotedInstance.Key); err != nil {
 		AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("error configuring demoted primary %+v: %v", clusterPrimary.Key, err))
+		topologyRecovery.LostReplicas.AddKey(clusterPrimary.Key)
+		_ = topologyRecovery.AddError(err)
+		return topologyRecovery, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: failed configuring demoted primary %+v: %v", clusterPrimary.Key, err)
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// --- Configure demoted primary for repositioning ---
AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("configuring demoted primary %+v to replicate from %+v", clusterPrimary.Key, promotedInstance.Key))
if err := inst.PostgreSQLRepositionAsStandby(&clusterPrimary.Key, &promotedInstance.Key); err != nil {
AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("error configuring demoted primary %+v: %v", clusterPrimary.Key, err))
}
// --- Configure demoted primary for repositioning ---
AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("configuring demoted primary %+v to replicate from %+v", clusterPrimary.Key, promotedInstance.Key))
if err := inst.PostgreSQLRepositionAsStandby(&clusterPrimary.Key, &promotedInstance.Key); err != nil {
AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("error configuring demoted primary %+v: %v", clusterPrimary.Key, err))
topologyRecovery.LostReplicas.AddKey(clusterPrimary.Key)
_ = topologyRecovery.AddError(err)
return topologyRecovery, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: failed configuring demoted primary %+v: %v", clusterPrimary.Key, err)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@go/logic/topology_recovery_postgresql.go` around lines 239 - 243, The code
currently logs failures from inst.PostgreSQLRepositionAsStandby(...) but
switchover still returns success; change this to surface the error: when
PostgreSQLRepositionAsStandby(&clusterPrimary.Key, &promotedInstance.Key)
returns a non-nil err, call AuditTopologyRecovery with the error and then
propagate the error out of the surrounding function (or mark clusterPrimary as
lost in topologyRecovery and return a wrapped error) instead of only logging;
update the surrounding function signature/return path to return that error so
callers see the failure (refer to PostgreSQLRepositionAsStandby,
AuditTopologyRecovery, topologyRecovery, clusterPrimary, promotedInstance).

Comment on lines +2124 to +2128
// PostgreSQL: dispatch to PostgreSQL-specific implementation
if clusterMaster.ProviderType == "postgresql" {
topologyRecovery, err := PostgreSQLGracefulPrimarySwitchover(clusterName, designatedKey, auto)
return topologyRecovery, nil, err
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Inspect callers of GracefulMasterTakeover and usages of the returned coordinates.
# Expectation: Callers either ignore the coordinates for PostgreSQL or guard nil before use.

rg -nP -C4 '\bGracefulMasterTakeover\s*\('
rg -nP -C4 '\bpromotedMasterCoordinates\b|\bpromoted.*Coordinates\b'

Repository: ProxySQL/orchestrator

Length of output: 16005


Fix nil pointer dereference in CLI when PostgreSQL graceful takeover returns nil coordinates.

The PostgreSQL dispatch in GracefulMasterTakeover returns nil for promotedMasterCoordinates. The API endpoint safely discards this value, but the CLI handlers in go/app/cli.go at lines 1518-1519 and 1534-1535 unconditionally dereference it with *promotedMasterCoordinates without nil guards. This will panic for PostgreSQL clusters invoked via CLI.

Either:

  • Return non-nil coordinates from PostgreSQLGracefulPrimarySwitchover, or
  • Add nil checks in CLI before dereferencing (lines 1518, 1519, 1534, 1535)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@go/logic/topology_recovery.go` around lines 2124 - 2128, The CLI panics when
PostgreSQLGracefulPrimarySwitchover returns nil promotedMasterCoordinates
because handlers in go/app/cli.go unconditionally dereference
*promotedMasterCoordinates; update the CLI handlers that call
GracefulMasterTakeover (the blocks around the current dereferences of
promotedMasterCoordinates) to check for nil before dereferencing and handle the
nil case (e.g., print a message or omit coordinates) or alternatively ensure
PostgreSQLGracefulPrimarySwitchover always returns a non-nil coordinates struct;
reference PostgreSQLGracefulPrimarySwitchover, GracefulMasterTakeover, and the
promotedMasterCoordinates variable to locate the code to change.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements graceful primary switchover for PostgreSQL, introducing new instance operations for read-only mode, WAL LSN tracking, and primary repositioning. The orchestration logic is integrated into the existing GracefulMasterTakeover flow, supported by updated documentation and new functional tests. Feedback highlights the need for robust escaping in PostgreSQL connection strings and notes that pg_terminate_backend requires elevated privileges beyond the documented pg_monitor role. Additionally, the recovery registration should be tightened to prevent concurrent conflicts, and the orchestration function signature needs to align with the design specification's return types.

Comment on lines +114 to +115
escapedConnInfo := strings.ReplaceAll(newConnInfo, "'", "''")
_, err = db.Exec(fmt.Sprintf("ALTER SYSTEM SET primary_conninfo = '%s'", escapedConnInfo))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The primary_conninfo string is built using fmt.Sprintf which does not handle escaping of values that might contain spaces or special characters. While single quotes are escaped for the SQL statement, they are not properly escaped for the PostgreSQL connection string format itself (which uses backslashes for internal quotes). Furthermore, if the password contains spaces, it must be wrapped in single quotes within the connection string. This logic should be moved to a helper that correctly formats the connection string for PostgreSQL.

Comment on lines +221 to +228
_, err := db.Exec(`
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid <> pg_backend_pid()
AND backend_type NOT IN ('walsender', 'walreceiver', 'autovacuum worker', 'logical replication launcher')
AND backend_type <> 'background worker'
AND datname IS NOT NULL
`)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The pg_terminate_backend function requires pg_signal_backend or superuser privileges. The documentation in docs/database-providers.md only mentions granting pg_monitor, which is insufficient for this operation. Additionally, the termination of backends is critical for ensuring a zero-data-loss switchover by closing the write window; therefore, failure to terminate backends should likely be treated as a fatal error rather than just a warning.

}

// --- Register recovery to prevent concurrent switchover attempts ---
topologyRecovery, err = AttemptRecoveryRegistration(&analysisEntry, false, false)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using false, false for AttemptRecoveryRegistration is risky for a manual graceful switchover. If an automated recovery is already active for the cluster, this call will return the existing recovery object and the switchover logic will proceed to demote the primary while another recovery is in progress. It is safer to use true for failIfAlreadyRegistered to ensure the switchover only proceeds if the cluster is in a stable state.

// PostgreSQL primary to a designated standby. It sets the primary read-only,
// waits for the standby to catch up, promotes the standby, reconfigures
// remaining standbys, and configures the demoted primary for repositioning.
func PostgreSQLGracefulPrimarySwitchover(clusterName string, designatedKey *inst.InstanceKey, auto bool) (topologyRecovery *TopologyRecovery, err error) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function signature for PostgreSQLGracefulPrimarySwitchover does not match the design specification in docs/superpowers/specs/2026-04-18-postgresql-graceful-switchover-design.md, which states it should return (*TopologyRecovery, *inst.BinlogCoordinates, error). The current implementation only returns (*TopologyRecovery, error), causing the dispatch in go/logic/topology_recovery.go to return nil for the promoted master coordinates.

@renecannao renecannao merged commit 4457c7a into master Apr 18, 2026
13 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant