feat(postgresql): graceful primary switchover#99
Conversation
Covers planned/graceful switchover support for PostgreSQL, including new instance operations (set read-only, WAL LSN catch-up, reposition as standby), orchestration flow, and CLI/API dispatch via the existing graceful-master-takeover commands.
9 tasks covering instance operations, orchestration, CLI/API dispatch, functional tests, documentation updates, and GitHub issue creation.
Returns the current WAL write position (LSN) of a PostgreSQL primary instance, needed to synchronize standbys before promotion during graceful switchover.
…over Polls a PostgreSQL standby until its replay LSN reaches or exceeds the target LSN, or the timeout expires. Used during graceful switchover to ensure the standby has caught up before promotion.
…chover Configures a demoted PostgreSQL primary to replicate from a new primary by updating primary_conninfo via ALTER SYSTEM. The operator handles the actual restart and standby.signal creation via hooks.
…tchover When the cluster master is PostgreSQL (ProviderType == "postgresql"), GracefulMasterTakeover now dispatches to PostgreSQLGracefulPrimarySwitchover instead of running the MySQL-specific code path.
Runs graceful-master-takeover-auto via the API before the destructive failover test. Verifies the primary changes after switchover.
- Escape single quotes in primary_conninfo to prevent SQL breakage when PostgreSQLTopologyPassword contains quotes (both PostgreSQLReconfigureStandby and PostgreSQLRepositionAsStandby) - Move AttemptRecoveryRegistration before mutations to prevent concurrent switchover races - Use 3x ReasonableMaintenanceReplicationLagSeconds for catch-up timeout since the primary is already read-only
Adds deeper verification of the graceful switchover result (new primary is writable, demoted primary has primary_conninfo set), negative-case tests (bogus cluster, bogus designated host), and a round-trip that reactivates the demoted primary as a live standby and switches back. Removes continue-on-error from the PostgreSQL functional job so PG test regressions now fail CI.
📝 WalkthroughWalkthroughImplements PostgreSQL graceful primary switchover by introducing four instance-level operations (set read-only, capture WAL LSN, wait for standby replication, reposition as standby) and orchestration logic that validates topology, coordinates the switchover process, and routes PostgreSQL clusters through dedicated switchover logic. Changes
Sequence DiagramsequenceDiagram
participant Orch as Orchestrator
participant Prime as Current Primary
participant Stby as Designated Standby
participant OthStby as Other Standbys
Orch->>Prime: Set read-only (ALTER SYSTEM)
Prime->>Prime: Terminate non-replication connections
Orch->>Prime: Capture current WAL LSN (pg_current_wal_lsn)
Prime-->>Orch: Return target LSN
rect rgba(100, 150, 200, 0.5)
Orch->>Stby: Poll pg_last_wal_replay_lsn() >= target
Stby-->>Orch: LSN reached / timeout
end
Orch->>Stby: Promote standby (pg_promote)
Orch->>OthStby: Reconfigure primary_conninfo (ALTER SYSTEM)
Orch->>Prime: Set primary_conninfo to new primary (ALTER SYSTEM)
Orch-->>Orch: Register TopologyRecovery<br/>Run PostGracefulTakeoverProcesses
Orch->>Prime: (External) Restart with standby.signal
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/functional/test-postgresql.sh (1)
295-331:⚠️ Potential issue | 🟠 MajorDon’t hard-code
pgprimaryunless the round-trip restored it as primary.Because this script runs without
set -e, a failed round-trip still falls through to Line 331 and stopspgprimary, even if the current primary is stillpgstandby1. Gate the dead-primary section onFINAL_PRIMARYbeingpgprimary, or dynamically stop the actual current primary container.Test-flow guard suggestion
if [ "$FINAL_PRIMARY" = "172.30.0.20" ] || [ "$FINAL_PRIMARY" = "pgprimary" ]; then pass "Round-trip complete: pgprimary is primary again" + DEAD_PRIMARY_CONTAINER="pgprimary" else fail "Round-trip incomplete: primary is '$FINAL_PRIMARY' (expected pgprimary)" + DEAD_PRIMARY_CONTAINER="" fi @@ echo "--- Failover test: kill current primary ---" @@ -echo "Stopping pgprimary container..." -$COMPOSE stop pgprimary +if [ -z "${DEAD_PRIMARY_CONTAINER:-}" ]; then + skip "Dead-primary test skipped — pgprimary was not restored as primary" +else + echo "Stopping ${DEAD_PRIMARY_CONTAINER} container..." + $COMPOSE stop "$DEAD_PRIMARY_CONTAINER" +fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional/test-postgresql.sh` around lines 295 - 331, The script currently always proceeds to "Stopping pgprimary container..." even when the round-trip did not restore pgprimary; update the dead-primary section to guard on FINAL_PRIMARY or stop the actual current primary: check the FINAL_PRIMARY variable (and/or its resolved container name) before stopping pgprimary, e.g. if [ "$FINAL_PRIMARY" = "172.30.0.20" ] || [ "$FINAL_PRIMARY" = "pgprimary" ] then stop pgprimary else stop the container corresponding to FINAL_PRIMARY (e.g. derive container name like "pgstandby1" from the IP or use FINAL_PRIMARY directly) so the stop command targets the actual primary instead of hard-coding pgprimary; adjust the echo and stop/restart invocation that currently refer to "pgprimary" to use the chosen container variable.go/inst/instance_topology_postgresql.go (1)
100-115:⚠️ Potential issue | 🟠 MajorEscape
primary_conninfovalues at the libpq layer, not just the SQL layer.
strings.ReplaceAll(..., "'", "''")protects the outer SQL literal, butprimary_conninfoitself is a libpq keyword/value string. Passwords, usernames, or other values containing spaces, backslashes, or quotes must be quoted and escaped according to libpq rules (\'for quotes,\\for backslashes), not just SQL rules. Build conninfo with per-value quoting/escaping, then escape the resulting SQL literal.This issue occurs in two places: lines 101-115 and lines 313-318. Consider extracting the conninfo construction into a helper function to avoid duplication.
Safer helper-based construction
+func postgreSQLConnInfoValue(value string) string { + if value == "" || strings.ContainsAny(value, " \t\r\n'\\") { + value = strings.ReplaceAll(value, `\`, `\\`) + value = strings.ReplaceAll(value, `'`, `\'`) + return "'" + value + "'" + } + return value +} + +func postgreSQLPrimaryConnInfo(newPrimaryKey *InstanceKey) string { + return fmt.Sprintf( + "host=%s port=%d user=%s password=%s sslmode=%s application_name=%s", + postgreSQLConnInfoValue(newPrimaryKey.Hostname), + newPrimaryKey.Port, + postgreSQLConnInfoValue(config.Config.PostgreSQLTopologyUser), + postgreSQLConnInfoValue(config.Config.PostgreSQLTopologyPassword), + postgreSQLConnInfoValue(config.Config.PostgreSQLSSLMode), + postgreSQLConnInfoValue("orchestrator"), + ) +} + +func postgreSQLStringLiteral(value string) string { + return strings.ReplaceAll(value, "'", "''") +} + // Build new primary_conninfo - newConnInfo := fmt.Sprintf( - "host=%s port=%d user=%s password=%s sslmode=%s application_name=orchestrator", - newPrimaryKey.Hostname, - newPrimaryKey.Port, - config.Config.PostgreSQLTopologyUser, - config.Config.PostgreSQLTopologyPassword, - config.Config.PostgreSQLSSLMode, - ) + newConnInfo := postgreSQLPrimaryConnInfo(newPrimaryKey) @@ - escapedConnInfo := strings.ReplaceAll(newConnInfo, "'", "''") + escapedConnInfo := postgreSQLStringLiteral(newConnInfo)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@go/inst/instance_topology_postgresql.go` around lines 100 - 115, The conninfo string (newConnInfo) is being SQL-escaped but not libpq-escaped; change construction in PostgreSQLReconfigureStandby to build each libpq key=value element with proper libpq quoting/escaping (escape single quote as \' and backslash as \\ and wrap values containing spaces or special chars in single quotes), extract that logic into a helper (e.g., buildLibpqConnInfo or quoteLibpqValue) and use it to produce the conninfo, then still SQL-escape the final conninfo for the ALTER SYSTEM statement (escapedConnInfo = strings.ReplaceAll(conninfo, "'", "''")); apply the same helper at the second occurrence mentioned to remove duplication.
🧹 Nitpick comments (1)
docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md (1)
86-90: Add defensive validation forreadOnlyValue.While
readOnlyValueis currently controlled (only "on" or "off"), adding an explicit validation check would make the code more defensive against future modifications and clearer about the expected values.🛡️ Proposed defensive check
readOnlyValue := "off" if readOnly { readOnlyValue = "on" } + if readOnlyValue != "on" && readOnlyValue != "off" { + return nil, fmt.Errorf("PostgreSQLSetReadOnly: invalid readOnlyValue: %s", readOnlyValue) + } log.Infof("PostgreSQLSetReadOnly: setting default_transaction_read_only = %s on %+v", readOnlyValue, *instanceKey)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md` around lines 86 - 90, The code in PostgreSQLSetReadOnly uses readOnlyValue directly in the ALTER SYSTEM SQL; add a defensive validation before db.Exec to ensure readOnlyValue is only "on" or "off" (or whatever canonical set you expect) and return a descriptive error via log.Errore if it is invalid; locate the log.Infof call and the subsequent db.Exec(fmt.Sprintf("ALTER SYSTEM SET default_transaction_read_only = %s", readOnlyValue)) and insert the validation check there to short-circuit on bad values and avoid executing malformed SQL.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/database-providers.md`:
- Around line 183-190: Update the PostgreSQL graceful takeover docs to state
that in addition to the pg_monitor role, the orchestrator user must have
superuser privileges or equivalent role-based permissions to run ALTER SYSTEM,
call pg_reload_conf(), and execute pg_terminate_backend(); mention these
specific operations and recommend granting either superuser or targeted
role-based privileges (or extension of pg_monitor) before using the CLI/API
endpoints (`graceful-master-takeover`, `graceful-master-takeover-auto`), and add
a note explaining that an operator-managed restart (triggered via
`PostGracefulTakeoverProcesses`) is still required for the demoted primary.
In `@docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md`:
- Line 15: Change the task headings from level 3 to level 2 so heading levels
increment correctly: replace each "### Task N: `..." with "## Task N: `..." for
all task headings (e.g., "Task 1: `PostgreSQLSetReadOnly` — Unit Test +
Implementation" through Tasks 1-9) so the document goes H1 → H2 → H3 correctly.
- Around line 507-525: Move the AttemptRecoveryRegistration call so it executes
before mutating the topology (before calling inst.PostgreSQLPromoteStandby);
i.e., call AttemptRecoveryRegistration(&analysisEntry, false, false) and
set/initialize topologyRecovery (with the same error-handling fallback) prior to
invoking PostgreSQLPromoteStandby, then after promotion assign
topologyRecovery.SuccessorKey = &promotedInstance.Key; keep the
PostgreSQLSetReadOnly undo logic and all existing error branches but reorder
them so registration happens before the promotion to avoid races.
- Around line 700-736: Add explicit language specifiers to the fenced code
blocks that contain the PostgreSQL switchover text so markdown renderers apply
proper highlighting: replace the three-backtick fences before the block
containing "Graceful master takeover (planned switchover)..." and before "###
Graceful Primary Switchover", and the two occurrences around "Graceful master
takeover is not yet supported for PostgreSQL." / "Graceful master takeover is
supported for PostgreSQL..." with language-tagged fences (e.g., ```markdown).
Locate these blocks by searching for the exact phrases "Graceful master takeover
(planned switchover) is not yet implemented for PostgreSQL. Only unplanned
failover (dead primary) is supported.", "### Graceful Primary Switchover", and
"Graceful master takeover is not yet supported for PostgreSQL." and update their
opening ``` to ```markdown.
- Around line 338-351: The ALTER SYSTEM command interpolates newConnInfo
directly and can break or be exploited if fields contain single quotes; before
calling db.Exec in PostgreSQLRepositionAsStandby, escape single quotes in
newConnInfo (e.g., replace every "'" with "''") and use the escaped value when
building the SQL string for primary_conninfo; ensure the strings package is
imported and update the db.Exec call that sets primary_conninfo to use the
escapedConnInfo variable.
In `@docs/superpowers/specs/2026-04-18-postgresql-graceful-switchover-design.md`:
- Around line 18-85: Update the design doc to match the implementation: change
the "three new functions" header to "four new functions" and ensure all four
(PostgreSQLSetReadOnly, PostgreSQLGetCurrentWALLSN, PostgreSQLWaitForStandbyLSN,
PostgreSQLRepositionAsStandby) are listed; change the
PostgreSQLGracefulPrimarySwitchover signature description to match the actual
return values implemented (adjust the noted return tuple to the real types/count
used in PostgreSQLGracefulPrimarySwitchover); and update the dispatch snippet to
match the code's provider check (use ProviderType == "postgresql" early return
logic as implemented rather than the inst.PostgreSQL variant) so the spec aligns
exactly with the code paths and symbols named in the repo.
In `@go/logic/topology_recovery_postgresql.go`:
- Around line 239-243: The code currently logs failures from
inst.PostgreSQLRepositionAsStandby(...) but switchover still returns success;
change this to surface the error: when
PostgreSQLRepositionAsStandby(&clusterPrimary.Key, &promotedInstance.Key)
returns a non-nil err, call AuditTopologyRecovery with the error and then
propagate the error out of the surrounding function (or mark clusterPrimary as
lost in topologyRecovery and return a wrapped error) instead of only logging;
update the surrounding function signature/return path to return that error so
callers see the failure (refer to PostgreSQLRepositionAsStandby,
AuditTopologyRecovery, topologyRecovery, clusterPrimary, promotedInstance).
- Around line 180-223: After AttemptRecoveryRegistration succeeds you must
register a deferred failure resolver to always call resolveRecovery(...) on
error paths and avoid leaving topologyRecovery unresolved; add a defer that
checks a local resolved bool and if false calls
resolveRecovery(topologyRecovery) (and AuditTopologyRecovery as appropriate),
ensure you only set topologyRecovery.SuccessorKey = &designatedStandby.Key after
successful pre-promotion steps or clear it before any early return, and mark
resolved = true immediately after a successful resolveRecovery(...) so the defer
skips double-resolution; key symbols: AttemptRecoveryRegistration,
topologyRecovery, SuccessorKey, resolveRecovery, AuditTopologyRecovery.
In `@go/logic/topology_recovery.go`:
- Around line 2124-2128: The CLI panics when PostgreSQLGracefulPrimarySwitchover
returns nil promotedMasterCoordinates because handlers in go/app/cli.go
unconditionally dereference *promotedMasterCoordinates; update the CLI handlers
that call GracefulMasterTakeover (the blocks around the current dereferences of
promotedMasterCoordinates) to check for nil before dereferencing and handle the
nil case (e.g., print a message or omit coordinates) or alternatively ensure
PostgreSQLGracefulPrimarySwitchover always returns a non-nil coordinates struct;
reference PostgreSQLGracefulPrimarySwitchover, GracefulMasterTakeover, and the
promotedMasterCoordinates variable to locate the code to change.
---
Outside diff comments:
In `@go/inst/instance_topology_postgresql.go`:
- Around line 100-115: The conninfo string (newConnInfo) is being SQL-escaped
but not libpq-escaped; change construction in PostgreSQLReconfigureStandby to
build each libpq key=value element with proper libpq quoting/escaping (escape
single quote as \' and backslash as \\ and wrap values containing spaces or
special chars in single quotes), extract that logic into a helper (e.g.,
buildLibpqConnInfo or quoteLibpqValue) and use it to produce the conninfo, then
still SQL-escape the final conninfo for the ALTER SYSTEM statement
(escapedConnInfo = strings.ReplaceAll(conninfo, "'", "''")); apply the same
helper at the second occurrence mentioned to remove duplication.
In `@tests/functional/test-postgresql.sh`:
- Around line 295-331: The script currently always proceeds to "Stopping
pgprimary container..." even when the round-trip did not restore pgprimary;
update the dead-primary section to guard on FINAL_PRIMARY or stop the actual
current primary: check the FINAL_PRIMARY variable (and/or its resolved container
name) before stopping pgprimary, e.g. if [ "$FINAL_PRIMARY" = "172.30.0.20" ] ||
[ "$FINAL_PRIMARY" = "pgprimary" ] then stop pgprimary else stop the container
corresponding to FINAL_PRIMARY (e.g. derive container name like "pgstandby1"
from the IP or use FINAL_PRIMARY directly) so the stop command targets the
actual primary instead of hard-coding pgprimary; adjust the echo and
stop/restart invocation that currently refer to "pgprimary" to use the chosen
container variable.
---
Nitpick comments:
In `@docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md`:
- Around line 86-90: The code in PostgreSQLSetReadOnly uses readOnlyValue
directly in the ALTER SYSTEM SQL; add a defensive validation before db.Exec to
ensure readOnlyValue is only "on" or "off" (or whatever canonical set you
expect) and return a descriptive error via log.Errore if it is invalid; locate
the log.Infof call and the subsequent db.Exec(fmt.Sprintf("ALTER SYSTEM SET
default_transaction_read_only = %s", readOnlyValue)) and insert the validation
check there to short-circuit on bad values and avoid executing malformed SQL.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9e088dff-bc56-4ee5-8b1f-aa81497ff5c9
📒 Files selected for processing (10)
.github/workflows/functional.ymldocs/database-providers.mddocs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.mddocs/superpowers/specs/2026-04-18-postgresql-graceful-switchover-design.mddocs/user-manual.mdgo/inst/instance_topology_postgresql.gogo/inst/instance_topology_postgresql_test.gogo/logic/topology_recovery.gogo/logic/topology_recovery_postgresql.gotests/functional/test-postgresql.sh
💤 Files with no reviewable changes (1)
- .github/workflows/functional.yml
| - **Graceful master takeover** (planned switchover) is supported for PostgreSQL. | ||
| The same CLI commands and API endpoints (`graceful-master-takeover`, | ||
| `graceful-master-takeover-auto`) work for PostgreSQL clusters. The switchover | ||
| sets the primary read-only, waits for the designated standby to catch up via | ||
| WAL LSN, promotes the standby, reconfigures remaining standbys, and configures | ||
| the demoted primary's `primary_conninfo`. The demoted primary requires an | ||
| operator-managed restart with `standby.signal` — use | ||
| `PostGracefulTakeoverProcesses` hooks to automate this step. |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
PostgreSQL privileges required for ALTER SYSTEM pg_reload_conf pg_terminate_backend pg_signal_backend
💡 Result:
PostgreSQL privileges required: - ALTER SYSTEM: Superuser or users granted the ALTER SYSTEM privilege on the specific configuration parameter (via GRANT ALTER SYSTEM ON PARAMETER ...). {value: Superuser or ALTER SYSTEM privilege on parameter, citations: [1,2,4], confidence: high} - pg_reload_conf: Superuser or EXECUTE privilege granted on the pg_catalog.pg_reload_conf function (via GRANT EXECUTE ON FUNCTION pg_catalog.pg_reload_conf TO role;). These signaling functions are restricted to superusers by default but access may be granted using GRANT. {value: Superuser or EXECUTE on pg_reload_conf, citations: [6,8,11], confidence: high} - pg_terminate_backend(pid): Superuser, or the calling role is a member of the target backend's role, or has privileges of the pg_signal_backend role. Only superusers can terminate superuser backends. {value: Superuser, member of target role, or pg_signal_backend role, citations: [6,7,11], confidence: high} - pg_signal_backend(pid): This appears to refer to the pg_signal_backend predefined role (not a function). Membership in pg_signal_backend role allows signaling (cancel/terminate) other backends (via pg_cancel_backend/pg_terminate_backend). Superusers bypass this. {value: pg_signal_backend role membership, citations: [6,7,13], confidence: high}
Citations:
- 1: https://www.postgresql.org/docs/current/sql-altersystem.html
- 2: https://www.postgresql.org/docs/current/ddl-priv.html
- 3: https://www.postgresql.org/docs/current/static/sql-altersystem.html
- 4: https://www.postgresql.org/docs/current/functions-admin.html
- 5: https://www.postgresql.org/docs/current/static/predefined-roles.html
- 6: https://runebook.dev/en/docs/postgresql/functions-admin/pg_reload_conf
- 7: http://www.postgresql.org/docs/current/interactive/functions-admin.html
- 8: https://www.postgresql.org/docs/current/predefined-roles.html
🏁 Script executed:
# First, let's read the specific documentation section mentioned
head -200 docs/database-providers.md | tail -50Repository: ProxySQL/orchestrator
Length of output: 2911
🏁 Script executed:
# Search for pg_monitor mentions in the documentation
rg "pg_monitor" docs/database-providers.md -B 5 -A 5Repository: ProxySQL/orchestrator
Length of output: 688
🏁 Script executed:
# Find the PostgreSQL instance topology implementation
fd "instance_topology_postgresql" --type fRepository: ProxySQL/orchestrator
Length of output: 149
🏁 Script executed:
# Search for ALTER SYSTEM, pg_reload_conf, pg_terminate_backend in codebase
rg "ALTER SYSTEM|pg_reload_conf|pg_terminate_backend" --type go -B 2 -A 2Repository: ProxySQL/orchestrator
Length of output: 5706
🏁 Script executed:
# Check the exact context around the pg_monitor requirement to see where prerequisites are documented
rg -B 10 "pg_monitor role:" docs/database-providers.mdRepository: ProxySQL/orchestrator
Length of output: 47
🏁 Script executed:
# Verify if there are any existing privilege requirements documented for graceful takeover nearby
rg -A 20 "Graceful master takeover" docs/database-providers.md | head -40Repository: ProxySQL/orchestrator
Length of output: 1039
Add privilege requirements to PostgreSQL graceful takeover documentation.
The graceful takeover implementation requires additional PostgreSQL privileges beyond pg_monitor. The operations (ALTER SYSTEM, pg_reload_conf(), pg_terminate_backend()) require superuser access or equivalent role-based privileges. Update the documentation to specify these requirements so operators can configure the orchestrator user with the necessary permissions before attempting graceful takeover.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/database-providers.md` around lines 183 - 190, Update the PostgreSQL
graceful takeover docs to state that in addition to the pg_monitor role, the
orchestrator user must have superuser privileges or equivalent role-based
permissions to run ALTER SYSTEM, call pg_reload_conf(), and execute
pg_terminate_backend(); mention these specific operations and recommend granting
either superuser or targeted role-based privileges (or extension of pg_monitor)
before using the CLI/API endpoints (`graceful-master-takeover`,
`graceful-master-takeover-auto`), and add a note explaining that an
operator-managed restart (triggered via `PostGracefulTakeoverProcesses`) is
still required for the demoted primary.
|
|
||
| --- | ||
|
|
||
| ### Task 1: `PostgreSQLSetReadOnly` — Unit Test + Implementation |
There was a problem hiding this comment.
Use h2 (##) for task headings instead of h3 (###).
The task headings jump from h1 (line 1) to h3 (line 15), skipping h2. Markdown best practices require heading levels to increment by one level at a time.
📝 Proposed fix
Change all task headings from h3 to h2:
-### Task 1: `PostgreSQLSetReadOnly` — Unit Test + Implementation
+## Task 1: `PostgreSQLSetReadOnly` — Unit Test + ImplementationApply this change to all task headings (Tasks 1-9).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ### Task 1: `PostgreSQLSetReadOnly` — Unit Test + Implementation | |
| ## Task 1: `PostgreSQLSetReadOnly` — Unit Test + Implementation |
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)
[warning] 15-15: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3
(MD001, heading-increment)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md` at line
15, Change the task headings from level 3 to level 2 so heading levels increment
correctly: replace each "### Task N: `..." with "## Task N: `..." for all task
headings (e.g., "Task 1: `PostgreSQLSetReadOnly` — Unit Test + Implementation"
through Tasks 1-9) so the document goes H1 → H2 → H3 correctly.
| newConnInfo := fmt.Sprintf( | ||
| "host=%s port=%d user=%s password=%s sslmode=%s application_name=orchestrator", | ||
| newPrimaryKey.Hostname, | ||
| newPrimaryKey.Port, | ||
| config.Config.PostgreSQLTopologyUser, | ||
| config.Config.PostgreSQLTopologyPassword, | ||
| config.Config.PostgreSQLSSLMode, | ||
| ) | ||
|
|
||
| log.Infof("PostgreSQLRepositionAsStandby: configuring %+v to replicate from %+v", *instanceKey, *newPrimaryKey) | ||
|
|
||
| if _, err := db.Exec(fmt.Sprintf("ALTER SYSTEM SET primary_conninfo = '%s'", newConnInfo)); err != nil { | ||
| return log.Errore(fmt.Errorf("PostgreSQLRepositionAsStandby: ALTER SYSTEM failed on %+v: %v", *instanceKey, err)) | ||
| } |
There was a problem hiding this comment.
Critical: Escape single quotes in primary_conninfo to prevent SQL errors and injection.
The newConnInfo string is interpolated directly into the SQL command without escaping. If newPrimaryKey.Hostname or config.Config.PostgreSQLTopologyPassword contain single quotes, the ALTER SYSTEM command will fail or could be vulnerable to SQL injection.
The PR objectives note this was fixed post-review: "escape single quotes in primary_conninfo". The plan should be updated to show the correct implementation.
🔒 Proposed fix with quote escaping
newConnInfo := fmt.Sprintf(
"host=%s port=%d user=%s password=%s sslmode=%s application_name=orchestrator",
newPrimaryKey.Hostname,
newPrimaryKey.Port,
config.Config.PostgreSQLTopologyUser,
config.Config.PostgreSQLTopologyPassword,
config.Config.PostgreSQLSSLMode,
)
+ // Escape single quotes for SQL string literal
+ newConnInfo = strings.ReplaceAll(newConnInfo, "'", "''")
log.Infof("PostgreSQLRepositionAsStandby: configuring %+v to replicate from %+v", *instanceKey, *newPrimaryKey)
if _, err := db.Exec(fmt.Sprintf("ALTER SYSTEM SET primary_conninfo = '%s'", newConnInfo)); err != nil {Note: Add "strings" to imports if not already present.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md` around
lines 338 - 351, The ALTER SYSTEM command interpolates newConnInfo directly and
can break or be exploited if fields contain single quotes; before calling
db.Exec in PostgreSQLRepositionAsStandby, escape single quotes in newConnInfo
(e.g., replace every "'" with "''") and use the escaped value when building the
SQL string for primary_conninfo; ensure the strings package is imported and
update the db.Exec call that sets primary_conninfo to use the escapedConnInfo
variable.
| // --- Promote designated standby --- | ||
| promotedInstance, err := inst.PostgreSQLPromoteStandby(&designatedStandby.Key) | ||
| if err != nil { | ||
| // Undo read-only before aborting | ||
| _, _ = inst.PostgreSQLSetReadOnly(&clusterPrimary.Key, false) | ||
| return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: promotion of %+v failed: %v", designatedStandby.Key, err) | ||
| } | ||
| log.Infof("PostgreSQLGracefulPrimarySwitchover: promoted %+v to primary", promotedInstance.Key) | ||
|
|
||
| // --- Register recovery --- | ||
| topologyRecovery, err = AttemptRecoveryRegistration(&analysisEntry, false, false) | ||
| if err != nil || topologyRecovery == nil { | ||
| _ = log.Warningf("PostgreSQLGracefulPrimarySwitchover: error registering recovery: %v", err) | ||
| // Continue — promotion already happened | ||
| topologyRecovery = &TopologyRecovery{ | ||
| AnalysisEntry: analysisEntry, | ||
| } | ||
| } | ||
| topologyRecovery.SuccessorKey = &promotedInstance.Key |
There was a problem hiding this comment.
Race condition: Move AttemptRecoveryRegistration before promotion to avoid registration races.
AttemptRecoveryRegistration (line 517) occurs after the standby is promoted (line 508). This creates a window where the topology has mutated but the recovery isn't registered, potentially leading to inconsistent state or duplicate recovery attempts.
The PR objectives confirm this issue was fixed post-review: "move AttemptRecoveryRegistration before mutations to avoid races". The plan should reflect the corrected order.
📋 Proposed fix to move registration before mutations
Move the registration block before the promotion step:
}
- // --- Promote designated standby ---
+ // --- Register recovery ---
+ topologyRecovery, err = AttemptRecoveryRegistration(&analysisEntry, false, false)
+ if err != nil || topologyRecovery == nil {
+ // Undo read-only before aborting
+ _, _ = inst.PostgreSQLSetReadOnly(&clusterPrimary.Key, false)
+ return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: recovery registration failed: %v", err)
+ }
+
+ // --- Promote designated standby ---
promotedInstance, err := inst.PostgreSQLPromoteStandby(&designatedStandby.Key)
if err != nil {
// Undo read-only before aborting
_, _ = inst.PostgreSQLSetReadOnly(&clusterPrimary.Key, false)
return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: promotion of %+v failed: %v", designatedStandby.Key, err)
}
log.Infof("PostgreSQLGracefulPrimarySwitchover: promoted %+v to primary", promotedInstance.Key)
- // --- Register recovery ---
- topologyRecovery, err = AttemptRecoveryRegistration(&analysisEntry, false, false)
- if err != nil || topologyRecovery == nil {
- _ = log.Warningf("PostgreSQLGracefulPrimarySwitchover: error registering recovery: %v", err)
- // Continue — promotion already happened
- topologyRecovery = &TopologyRecovery{
- AnalysisEntry: analysisEntry,
- }
- }
topologyRecovery.SuccessorKey = &promotedInstance.Key🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md` around
lines 507 - 525, Move the AttemptRecoveryRegistration call so it executes before
mutating the topology (before calling inst.PostgreSQLPromoteStandby); i.e., call
AttemptRecoveryRegistration(&analysisEntry, false, false) and set/initialize
topologyRecovery (with the same error-handling fallback) prior to invoking
PostgreSQLPromoteStandby, then after promotion assign
topologyRecovery.SuccessorKey = &promotedInstance.Key; keep the
PostgreSQLSetReadOnly undo logic and all existing error branches but reorder
them so registration happens before the promotion to avoid races.
| ``` | ||
| Graceful master takeover (planned switchover) is not yet implemented for PostgreSQL. Only unplanned failover (dead primary) is supported. | ||
| ``` | ||
|
|
||
| Replace with: | ||
| ``` | ||
| ### Graceful Primary Switchover | ||
|
|
||
| Graceful master takeover (planned switchover) is supported for PostgreSQL. The same CLI commands | ||
| and API endpoints used for MySQL (`graceful-master-takeover`, `graceful-master-takeover-auto`) work | ||
| for PostgreSQL clusters. The switchover flow: | ||
|
|
||
| 1. Sets the primary read-only via `ALTER SYSTEM SET default_transaction_read_only = on` and terminates | ||
| existing non-replication connections. | ||
| 2. Captures the primary's current WAL LSN and waits for the designated standby to catch up. | ||
| 3. Promotes the designated standby via `pg_promote()`. | ||
| 4. Reconfigures remaining standbys to replicate from the new primary. | ||
| 5. Configures the demoted primary's `primary_conninfo` to point at the new primary. | ||
|
|
||
| **Important:** The demoted primary requires a restart with `standby.signal` to complete its conversion | ||
| to a standby. Use `PostGracefulTakeoverProcesses` hooks to automate this step. | ||
| ``` | ||
|
|
||
| - [ ] **Step 2: Update user-manual.md** | ||
|
|
||
| Find the line (around line 699) that says: | ||
| ``` | ||
| Graceful master takeover is not yet supported for PostgreSQL. | ||
| ``` | ||
|
|
||
| Replace with: | ||
| ``` | ||
| Graceful master takeover is supported for PostgreSQL. Use `graceful-master-takeover` or | ||
| `graceful-master-takeover-auto` CLI commands, or the equivalent API endpoints. The demoted | ||
| primary requires an operator-managed restart with `standby.signal` — configure this via | ||
| `PostGracefulTakeoverProcesses` hooks. | ||
| ``` |
There was a problem hiding this comment.
Add language specifiers to fenced code blocks for better rendering.
The fenced code blocks at lines 700, 705, 726, and 731 lack language specifiers, which affects syntax highlighting and rendering in documentation viewers.
📝 Proposed fix to add language specifiers
Line 700:
Find the line (around line 183-184) that says:
-```
+```markdown
Graceful master takeover (planned switchover) is not yet implemented for PostgreSQL. Only unplanned failover (dead primary) is supported.
Line 705:
```diff
Replace with:
-```
+```markdown
### Graceful Primary Switchover
Line 726:
Find the line (around line 699) that says:
-```
+```markdown
Graceful master takeover is not yet supported for PostgreSQL.
Line 731:
```diff
Replace with:
-```
+```markdown
Graceful master takeover is supported for PostgreSQL. Use `graceful-master-takeover` or
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)
[warning] 700-700: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 705-705: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 726-726: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 731-731: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/superpowers/plans/2026-04-18-postgresql-graceful-switchover.md` around
lines 700 - 736, Add explicit language specifiers to the fenced code blocks that
contain the PostgreSQL switchover text so markdown renderers apply proper
highlighting: replace the three-backtick fences before the block containing
"Graceful master takeover (planned switchover)..." and before "### Graceful
Primary Switchover", and the two occurrences around "Graceful master takeover is
not yet supported for PostgreSQL." / "Graceful master takeover is supported for
PostgreSQL..." with language-tagged fences (e.g., ```markdown). Locate these
blocks by searching for the exact phrases "Graceful master takeover (planned
switchover) is not yet implemented for PostgreSQL. Only unplanned failover (dead
primary) is supported.", "### Graceful Primary Switchover", and "Graceful master
takeover is not yet supported for PostgreSQL." and update their opening ``` to
```markdown.
| Three new functions in `go/inst/instance_topology_postgresql.go`: | ||
|
|
||
| #### `PostgreSQLSetReadOnly(instanceKey *InstanceKey, readOnly bool) (*Instance, error)` | ||
|
|
||
| - Connects to the instance via SQL | ||
| - Executes `ALTER SYSTEM SET default_transaction_read_only = on|off` | ||
| - Calls `SELECT pg_reload_conf()` | ||
| - When setting read-only: queries `pg_stat_activity` for non-replication, non-orchestrator backends and calls `pg_terminate_backend()` on each to close the write window | ||
| - Re-reads and returns the instance | ||
|
|
||
| #### `PostgreSQLGetCurrentWALLSN(instanceKey *InstanceKey) (string, error)` | ||
|
|
||
| - Connects to the primary | ||
| - Returns result of `SELECT pg_current_wal_lsn()::text` | ||
|
|
||
| #### `PostgreSQLWaitForStandbyLSN(standbyKey *InstanceKey, targetLSN string, timeout time.Duration) error` | ||
|
|
||
| - Polls `SELECT pg_last_wal_replay_lsn()` on the standby | ||
| - Compares using `SELECT '%s'::pg_lsn >= '%s'::pg_lsn` (PostgreSQL-native LSN comparison) | ||
| - Polls every 500ms up to the configured timeout | ||
| - Returns error on timeout | ||
|
|
||
| #### `PostgreSQLRepositionAsStandby(instanceKey *InstanceKey, newPrimaryKey *InstanceKey) error` | ||
|
|
||
| - Connects to the old primary | ||
| - Sets `primary_conninfo` via `ALTER SYSTEM` pointing to the new primary (same logic as existing `PostgreSQLReconfigureStandby`) | ||
| - Calls `SELECT pg_reload_conf()` | ||
| - Does NOT restart the instance — the actual restart and `standby.signal` creation is the operator's responsibility via `PostGracefulTakeoverProcesses` hook | ||
|
|
||
| ### Switchover Orchestration | ||
|
|
||
| New function `PostgreSQLGracefulPrimarySwitchover(clusterName string, designatedKey *InstanceKey, auto bool) (*TopologyRecovery, *inst.BinlogCoordinates, error)` in `go/logic/topology_recovery_postgresql.go`: | ||
|
|
||
| **Flow:** | ||
|
|
||
| 1. **Validate** — Read cluster primary, read standbys, determine designated standby. When `auto=true`, use `PostgreSQLGetBestStandbyForPromotion()`. When manual, require the user to specify or have exactly one standby. | ||
| 2. **Execute `PreGracefulTakeoverProcesses`** — Abort on failure. | ||
| 3. **Set primary read-only** — `PostgreSQLSetReadOnly(primary, true)`. Terminates existing non-replication connections. | ||
| 4. **Capture target LSN** — `PostgreSQLGetCurrentWALLSN(primary)`. | ||
| 5. **Wait for standby catch-up** — `PostgreSQLWaitForStandbyLSN(designated, targetLSN, timeout)`. | ||
| 6. **Promote designated standby** — `PostgreSQLPromoteStandby(designated)` (existing function). | ||
| 7. **Reconfigure remaining standbys** — `PostgreSQLReconfigureStandby(standby, newPrimary)` for each non-designated standby (existing function). | ||
| 8. **Configure demoted primary** — `PostgreSQLRepositionAsStandby(oldPrimary, newPrimary)`. Sets `primary_conninfo` only; operator handles restart via hook. | ||
| 9. **Execute `PostGracefulTakeoverProcesses`**. | ||
| 10. **Return** TopologyRecovery with successor key. | ||
|
|
||
| **Error handling:** | ||
|
|
||
| - If promotion fails after setting read-only: undo via `PostgreSQLSetReadOnly(primary, false)`, abort. | ||
| - If standby catch-up times out: undo read-only, abort with error. | ||
|
|
||
| **Note on flat topology:** PostgreSQL topologies are flat (no cascading replication detected), so there is no sibling relocation step (unlike MySQL which may need to reparent replicas under the designated replica). | ||
|
|
||
| ### CLI/API Dispatch | ||
|
|
||
| No changes to `cli.go` or `api.go`. The existing commands and endpoints are reused. | ||
|
|
||
| **In `go/logic/topology_recovery.go`:** | ||
|
|
||
| Add a provider check in `GracefulMasterTakeover()`: | ||
|
|
||
| ```go | ||
| if clusterMaster.DatabaseProvider == inst.PostgreSQL { | ||
| return PostgreSQLGracefulPrimarySwitchover(clusterName, designatedKey, auto) | ||
| } | ||
| ``` | ||
|
|
||
| Add a PostgreSQL branch in `getGracefulMasterTakeoverDesignatedInstance()` to use `PostgreSQLGetBestStandbyForPromotion()` when `auto=true`. |
There was a problem hiding this comment.
Bring the design spec back in sync with the implementation.
This spec says “three” functions while listing four, documents a three-return-value PostgreSQLGracefulPrimarySwitchover(...), and shows a dispatch branch that differs from the implemented ProviderType == "postgresql" early return. Please update the spec so it remains useful as implementation documentation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/superpowers/specs/2026-04-18-postgresql-graceful-switchover-design.md`
around lines 18 - 85, Update the design doc to match the implementation: change
the "three new functions" header to "four new functions" and ensure all four
(PostgreSQLSetReadOnly, PostgreSQLGetCurrentWALLSN, PostgreSQLWaitForStandbyLSN,
PostgreSQLRepositionAsStandby) are listed; change the
PostgreSQLGracefulPrimarySwitchover signature description to match the actual
return values implemented (adjust the noted return tuple to the real types/count
used in PostgreSQLGracefulPrimarySwitchover); and update the dispatch snippet to
match the code's provider check (use ProviderType == "postgresql" early return
logic as implemented rather than the inst.PostgreSQL variant) so the spec aligns
exactly with the code paths and symbols named in the repo.
| topologyRecovery, err = AttemptRecoveryRegistration(&analysisEntry, false, false) | ||
| if topologyRecovery == nil { | ||
| AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("found an active or recent recovery on %+v. Will not issue another graceful switchover.", clusterName)) | ||
| return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: active or recent recovery exists for %+v, aborting", clusterName) | ||
| } | ||
| topologyRecovery.SuccessorKey = &designatedStandby.Key | ||
|
|
||
| // --- Execute PreGracefulTakeoverProcesses --- | ||
| if err := executeProcesses(config.Config.PreGracefulTakeoverProcesses, "PreGracefulTakeoverProcesses", topologyRecovery, true); err != nil { | ||
| return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: PreGracefulTakeoverProcesses failed: %v", err) | ||
| } | ||
|
|
||
| // --- Set primary read-only --- | ||
| log.Infof("PostgreSQLGracefulPrimarySwitchover: setting %+v read-only", clusterPrimary.Key) | ||
| if _, err := inst.PostgreSQLSetReadOnly(&clusterPrimary.Key, true); err != nil { | ||
| return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: failed to set read-only on %+v: %v", clusterPrimary.Key, err) | ||
| } | ||
|
|
||
| // --- Capture target LSN --- | ||
| targetLSN, err := inst.PostgreSQLGetCurrentWALLSN(&clusterPrimary.Key) | ||
| if err != nil { | ||
| // Undo read-only before aborting | ||
| _, _ = inst.PostgreSQLSetReadOnly(&clusterPrimary.Key, false) | ||
| return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: failed to get WAL LSN from %+v: %v", clusterPrimary.Key, err) | ||
| } | ||
| log.Infof("PostgreSQLGracefulPrimarySwitchover: target LSN is %s", targetLSN) | ||
|
|
||
| // --- Wait for standby to catch up --- | ||
| // Use 3x the maintenance lag threshold since the primary is already read-only | ||
| // and waiting longer only extends the maintenance window, not data risk. | ||
| catchUpTimeout := 3 * time.Duration(config.Config.ReasonableMaintenanceReplicationLagSeconds) * time.Second | ||
| if err := inst.PostgreSQLWaitForStandbyLSN(&designatedStandby.Key, targetLSN, catchUpTimeout); err != nil { | ||
| // Undo read-only before aborting | ||
| _, _ = inst.PostgreSQLSetReadOnly(&clusterPrimary.Key, false) | ||
| return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: standby catch-up failed: %v", err) | ||
| } | ||
|
|
||
| // --- Promote designated standby --- | ||
| promotedInstance, err := inst.PostgreSQLPromoteStandby(&designatedStandby.Key) | ||
| if err != nil { | ||
| // Undo read-only before aborting | ||
| _, _ = inst.PostgreSQLSetReadOnly(&clusterPrimary.Key, false) | ||
| return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: promotion of %+v failed: %v", designatedStandby.Key, err) | ||
| } |
There was a problem hiding this comment.
Resolve the registered recovery on all abort paths.
After AttemptRecoveryRegistration, these early returns can leave an active/recent recovery record unresolved. Add a deferred failure resolver after registration, and avoid leaving SuccessorKey populated for failed pre-promotion attempts.
Failure cleanup pattern
topologyRecovery, err = AttemptRecoveryRegistration(&analysisEntry, false, false)
if topologyRecovery == nil {
AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("found an active or recent recovery on %+v. Will not issue another graceful switchover.", clusterName))
return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: active or recent recovery exists for %+v, aborting", clusterName)
}
- topologyRecovery.SuccessorKey = &designatedStandby.Key
+ resolved := false
+ defer func() {
+ if topologyRecovery != nil && !resolved && err != nil {
+ topologyRecovery.SuccessorKey = nil
+ _ = topologyRecovery.AddError(err)
+ _ = resolveRecovery(topologyRecovery, nil)
+ }
+ }()
+ topologyRecovery.SuccessorKey = &designatedStandby.Key
@@
- if err := executeProcesses(config.Config.PreGracefulTakeoverProcesses, "PreGracefulTakeoverProcesses", topologyRecovery, true); err != nil {
- return nil, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: PreGracefulTakeoverProcesses failed: %v", err)
+ if err = executeProcesses(config.Config.PreGracefulTakeoverProcesses, "PreGracefulTakeoverProcesses", topologyRecovery, true); err != nil {
+ return topologyRecovery, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: PreGracefulTakeoverProcesses failed: %v", err)
}Set resolved = true immediately after the successful resolveRecovery(...) call.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@go/logic/topology_recovery_postgresql.go` around lines 180 - 223, After
AttemptRecoveryRegistration succeeds you must register a deferred failure
resolver to always call resolveRecovery(...) on error paths and avoid leaving
topologyRecovery unresolved; add a defer that checks a local resolved bool and
if false calls resolveRecovery(topologyRecovery) (and AuditTopologyRecovery as
appropriate), ensure you only set topologyRecovery.SuccessorKey =
&designatedStandby.Key after successful pre-promotion steps or clear it before
any early return, and mark resolved = true immediately after a successful
resolveRecovery(...) so the defer skips double-resolution; key symbols:
AttemptRecoveryRegistration, topologyRecovery, SuccessorKey, resolveRecovery,
AuditTopologyRecovery.
| // --- Configure demoted primary for repositioning --- | ||
| AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("configuring demoted primary %+v to replicate from %+v", clusterPrimary.Key, promotedInstance.Key)) | ||
| if err := inst.PostgreSQLRepositionAsStandby(&clusterPrimary.Key, &promotedInstance.Key); err != nil { | ||
| AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("error configuring demoted primary %+v: %v", clusterPrimary.Key, err)) | ||
| } |
There was a problem hiding this comment.
Surface demoted-primary reconfiguration failure.
If PostgreSQLRepositionAsStandby fails, the old primary lacks the primary_conninfo that the documented restart hook depends on, but the switchover still returns success. Record this as an error and return it, or at least mark the demoted primary as lost so operators do not get a false OK.
Propagate the post-promotion configuration error
AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("configuring demoted primary %+v to replicate from %+v", clusterPrimary.Key, promotedInstance.Key))
if err := inst.PostgreSQLRepositionAsStandby(&clusterPrimary.Key, &promotedInstance.Key); err != nil {
AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("error configuring demoted primary %+v: %v", clusterPrimary.Key, err))
+ topologyRecovery.LostReplicas.AddKey(clusterPrimary.Key)
+ _ = topologyRecovery.AddError(err)
+ return topologyRecovery, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: failed configuring demoted primary %+v: %v", clusterPrimary.Key, err)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // --- Configure demoted primary for repositioning --- | |
| AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("configuring demoted primary %+v to replicate from %+v", clusterPrimary.Key, promotedInstance.Key)) | |
| if err := inst.PostgreSQLRepositionAsStandby(&clusterPrimary.Key, &promotedInstance.Key); err != nil { | |
| AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("error configuring demoted primary %+v: %v", clusterPrimary.Key, err)) | |
| } | |
| // --- Configure demoted primary for repositioning --- | |
| AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("configuring demoted primary %+v to replicate from %+v", clusterPrimary.Key, promotedInstance.Key)) | |
| if err := inst.PostgreSQLRepositionAsStandby(&clusterPrimary.Key, &promotedInstance.Key); err != nil { | |
| AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("error configuring demoted primary %+v: %v", clusterPrimary.Key, err)) | |
| topologyRecovery.LostReplicas.AddKey(clusterPrimary.Key) | |
| _ = topologyRecovery.AddError(err) | |
| return topologyRecovery, fmt.Errorf("PostgreSQLGracefulPrimarySwitchover: failed configuring demoted primary %+v: %v", clusterPrimary.Key, err) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@go/logic/topology_recovery_postgresql.go` around lines 239 - 243, The code
currently logs failures from inst.PostgreSQLRepositionAsStandby(...) but
switchover still returns success; change this to surface the error: when
PostgreSQLRepositionAsStandby(&clusterPrimary.Key, &promotedInstance.Key)
returns a non-nil err, call AuditTopologyRecovery with the error and then
propagate the error out of the surrounding function (or mark clusterPrimary as
lost in topologyRecovery and return a wrapped error) instead of only logging;
update the surrounding function signature/return path to return that error so
callers see the failure (refer to PostgreSQLRepositionAsStandby,
AuditTopologyRecovery, topologyRecovery, clusterPrimary, promotedInstance).
| // PostgreSQL: dispatch to PostgreSQL-specific implementation | ||
| if clusterMaster.ProviderType == "postgresql" { | ||
| topologyRecovery, err := PostgreSQLGracefulPrimarySwitchover(clusterName, designatedKey, auto) | ||
| return topologyRecovery, nil, err | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Inspect callers of GracefulMasterTakeover and usages of the returned coordinates.
# Expectation: Callers either ignore the coordinates for PostgreSQL or guard nil before use.
rg -nP -C4 '\bGracefulMasterTakeover\s*\('
rg -nP -C4 '\bpromotedMasterCoordinates\b|\bpromoted.*Coordinates\b'Repository: ProxySQL/orchestrator
Length of output: 16005
Fix nil pointer dereference in CLI when PostgreSQL graceful takeover returns nil coordinates.
The PostgreSQL dispatch in GracefulMasterTakeover returns nil for promotedMasterCoordinates. The API endpoint safely discards this value, but the CLI handlers in go/app/cli.go at lines 1518-1519 and 1534-1535 unconditionally dereference it with *promotedMasterCoordinates without nil guards. This will panic for PostgreSQL clusters invoked via CLI.
Either:
- Return non-nil coordinates from
PostgreSQLGracefulPrimarySwitchover, or - Add nil checks in CLI before dereferencing (lines 1518, 1519, 1534, 1535)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@go/logic/topology_recovery.go` around lines 2124 - 2128, The CLI panics when
PostgreSQLGracefulPrimarySwitchover returns nil promotedMasterCoordinates
because handlers in go/app/cli.go unconditionally dereference
*promotedMasterCoordinates; update the CLI handlers that call
GracefulMasterTakeover (the blocks around the current dereferences of
promotedMasterCoordinates) to check for nil before dereferencing and handle the
nil case (e.g., print a message or omit coordinates) or alternatively ensure
PostgreSQLGracefulPrimarySwitchover always returns a non-nil coordinates struct;
reference PostgreSQLGracefulPrimarySwitchover, GracefulMasterTakeover, and the
promotedMasterCoordinates variable to locate the code to change.
There was a problem hiding this comment.
Code Review
This pull request implements graceful primary switchover for PostgreSQL, introducing new instance operations for read-only mode, WAL LSN tracking, and primary repositioning. The orchestration logic is integrated into the existing GracefulMasterTakeover flow, supported by updated documentation and new functional tests. Feedback highlights the need for robust escaping in PostgreSQL connection strings and notes that pg_terminate_backend requires elevated privileges beyond the documented pg_monitor role. Additionally, the recovery registration should be tightened to prevent concurrent conflicts, and the orchestration function signature needs to align with the design specification's return types.
| escapedConnInfo := strings.ReplaceAll(newConnInfo, "'", "''") | ||
| _, err = db.Exec(fmt.Sprintf("ALTER SYSTEM SET primary_conninfo = '%s'", escapedConnInfo)) |
There was a problem hiding this comment.
The primary_conninfo string is built using fmt.Sprintf which does not handle escaping of values that might contain spaces or special characters. While single quotes are escaped for the SQL statement, they are not properly escaped for the PostgreSQL connection string format itself (which uses backslashes for internal quotes). Furthermore, if the password contains spaces, it must be wrapped in single quotes within the connection string. This logic should be moved to a helper that correctly formats the connection string for PostgreSQL.
| _, err := db.Exec(` | ||
| SELECT pg_terminate_backend(pid) | ||
| FROM pg_stat_activity | ||
| WHERE pid <> pg_backend_pid() | ||
| AND backend_type NOT IN ('walsender', 'walreceiver', 'autovacuum worker', 'logical replication launcher') | ||
| AND backend_type <> 'background worker' | ||
| AND datname IS NOT NULL | ||
| `) |
There was a problem hiding this comment.
The pg_terminate_backend function requires pg_signal_backend or superuser privileges. The documentation in docs/database-providers.md only mentions granting pg_monitor, which is insufficient for this operation. Additionally, the termination of backends is critical for ensuring a zero-data-loss switchover by closing the write window; therefore, failure to terminate backends should likely be treated as a fatal error rather than just a warning.
| } | ||
|
|
||
| // --- Register recovery to prevent concurrent switchover attempts --- | ||
| topologyRecovery, err = AttemptRecoveryRegistration(&analysisEntry, false, false) |
There was a problem hiding this comment.
Using false, false for AttemptRecoveryRegistration is risky for a manual graceful switchover. If an automated recovery is already active for the cluster, this call will return the existing recovery object and the switchover logic will proceed to demote the primary while another recovery is in progress. It is safer to use true for failIfAlreadyRegistered to ensure the switchover only proceeds if the cluster is in a stable state.
| // PostgreSQL primary to a designated standby. It sets the primary read-only, | ||
| // waits for the standby to catch up, promotes the standby, reconfigures | ||
| // remaining standbys, and configures the demoted primary for repositioning. | ||
| func PostgreSQLGracefulPrimarySwitchover(clusterName string, designatedKey *inst.InstanceKey, auto bool) (topologyRecovery *TopologyRecovery, err error) { |
There was a problem hiding this comment.
The function signature for PostgreSQLGracefulPrimarySwitchover does not match the design specification in docs/superpowers/specs/2026-04-18-postgresql-graceful-switchover-design.md, which states it should return (*TopologyRecovery, *inst.BinlogCoordinates, error). The current implementation only returns (*TopologyRecovery, error), causing the dispatch in go/logic/topology_recovery.go to return nil for the promoted master coordinates.
Summary
Adds planned/graceful primary switchover for PostgreSQL, reusing the existing
graceful-master-takeover/graceful-master-takeover-autoCLI commands and API endpoints.Switchover flow (implemented in
go/logic/topology_recovery_postgresql.go):PostgreSQLGetBestStandbyForPromotionwhenauto=true)PreGracefulTakeoverProcessesALTER SYSTEM+ terminate non-replication backends)pg_promote())primary_conninfoon the demoted primary (no restart)PostGracefulTakeoverProcessesThe demoted primary requires an operator-managed restart with
standby.signal— automate viaPostGracefulTakeoverProcesseshooks.New inst operations (
go/inst/instance_topology_postgresql.go):PostgreSQLSetReadOnlyPostgreSQLGetCurrentWALLSNPostgreSQLWaitForStandbyLSNPostgreSQLRepositionAsStandbyDocs:
docs/database-providers.md,docs/user-manual.md, and design/plan specs underdocs/superpowers/.Tests: functional test in
tests/functional/test-postgresql.shcoversprimary_conninfo)CI: removed
continue-on-errorfrom the PostgreSQL functional job — PG regressions now fail the build.Test plan
go test ./go/...)Summary by CodeRabbit
New Features
graceful-master-takeoverandgraceful-master-takeover-autocommands)Documentation
Tests