Skip to content
Open
28 changes: 28 additions & 0 deletions pgcopydb-helpers/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ All scripts read connection strings from `~/.env`:
```bash
export PGCOPYDB_SOURCE_PGURI='postgresql://user:pass@source-host:5432/dbname'
export PGCOPYDB_TARGET_PGURI='postgresql://user:pass@target-host:5432/dbname'
export SLACK_WEBHOOK_URL='https://hooks.slack.com/services/...' # optional, for Slack alerts
```

## Script Reference
Expand Down Expand Up @@ -189,6 +190,32 @@ Wrapper that runs `run-migration.sh` inside a detached `screen` session named "m

### Monitoring

#### `slack-migration-alerts.sh`

Sends Slack alerts for migration events. Runs from cron (default every 2 min) and fires each alert exactly once. Requires `SLACK_WEBHOOK_URL` in `~/.env`.

```bash
~/slack-migration-alerts.sh --test # send a test message to verify the webhook
~/slack-migration-alerts.sh --setup # install cron job (default every 2 min)
~/slack-migration-alerts.sh --setup --interval N # custom interval (1-59 min)
~/slack-migration-alerts.sh --uninstall # remove the cron job
```

**Alerts fired:**
- Process stopped unexpectedly (fires once per running→stopped transition)
- New ERROR lines in `migration.log` (fires once per new batch)
- Initial copy completed — data, indexes, constraints, sequences, and post-data done; CDC phase is starting (fires once)
- Migration completed successfully (fires once)
- Migration failed with non-zero exit code (fires once)

All alerts include the PlanetScale branch ID (parsed from `PGCOPYDB_TARGET_PGURI`) and migration progress context (tables, data GB, runtime). Alert messages do not include raw log content.

**State file:** `$MIGRATION_DIR/.notify-state` — resets automatically when a new migration directory is created.

**Webhook:** set `SLACK_WEBHOOK_URL` in `~/.env`

---

#### `check-migration-status.sh`

Displays a full migration progress dashboard: phase completion status, table/index/constraint copy progress, CDC streaming, error counts, runtime, and active database operations on the target.
Expand Down Expand Up @@ -427,6 +454,7 @@ sqlite3 ~/migration_*/schema/filter.db \
- Run ~/start-migration-screen.sh to begin
- Monitor with ~/check-migration-status.sh (initial copy phase)
- Monitor with ~/check-cdc-status.sh (CDC catch-up phase)
- Run ~/slack-migration-alerts.sh --setup to enable Slack alerts (optional)

3. CUTOVER (when CDC is caught up)
- Stop writes to source
Expand Down
11 changes: 10 additions & 1 deletion pgcopydb-helpers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@ SELECT pg_reload_conf();
```bash
export PGCOPYDB_SOURCE_PGURI='postgresql://user:pass@source-host:5432/dbname'
export PGCOPYDB_TARGET_PGURI='postgresql://user:pass@target-host:5432/dbname'
export SLACK_WEBHOOK_URL='https://hooks.slack.com/services/...' # optional, for Slack alerts
```

3. **Customize `~/filters.ini`** to exclude schemas, tables, and extensions that should not be migrated. See [Filter Configuration](#filter-configuration) below.
Expand Down Expand Up @@ -172,6 +173,13 @@ Once the initial copy completes and CDC is streaming, check replication progress

When `check-cdc-status.sh` reports **"CDC IS CAUGHT UP"** (apply backlog < 100 MB), you are ready for cutover.

To receive Slack alerts for migration events (errors, initial copy completion, success/failure), set up the monitor separately. Requires `SLACK_WEBHOOK_URL` in `~/.env`:

```bash
~/slack-migration-alerts.sh --test # verify webhook before installing
~/slack-migration-alerts.sh --setup # install cron job (default every 2 min)
```

### 4. Cut Over

1. **Stop writes** to the source database (maintenance mode, read-only, connection drain, etc.).
Expand Down Expand Up @@ -395,9 +403,10 @@ sqlite3 ~/migration_*/schema/filter.db "SELECT COUNT(*) FROM s_depend;"
| `fix-replica-identity.sh` | Prepare | Set REPLICA IDENTITY FULL on tables without primary keys |
| `filters.ini` | Prepare | pgcopydb filter configuration |
| `run-migration.sh` | Migrate | Start a pgcopydb clone --follow migration |
| `start-migration-screen.sh` | Migrate | Run the migration in a screen session |
| `start-migration-screen.sh` | Migrate | Run the migration in a detached screen session. |
| `check-migration-status.sh` | Monitor | Migration progress dashboard |
| `check-cdc-status.sh` | Monitor | CDC replication progress and health |
| `slack-migration-alerts.sh` | Monitor | Slack alerts |
| `resume-migration.sh` | Recovery | Resume an interrupted migration (full clone + CDC) |
| `resume-cdc.sh` | Recovery | Resume only the CDC phase (skips clone) |
| `target-clean.sh` | Recovery | Wipe target database for re-migration (prompts for confirmation) |
Expand Down
280 changes: 280 additions & 0 deletions pgcopydb-helpers/slack-migration-alerts.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,280 @@
#!/bin/bash
#
# slack-migration-alerts.sh — Slack alerts for pgcopydb migration failures and errors
#
# Runs from cron. State is stored inside the migration directory so it resets
# automatically when a new migration starts. Each unique event fires once only.
#
# SETUP
# 0. Add SLACK_WEBHOOK_URL to ~/.env:
# export SLACK_WEBHOOK_URL='https://hooks.slack.com/services/...'
#
# 1. Test the webhook:
# ~/slack-migration-alerts.sh --test
#
# 2. Install the cron job (default 2 min interval):
# ~/slack-migration-alerts.sh --setup
# ~/slack-migration-alerts.sh --setup --interval 5
#
# 3. Remove the cron job when done:
# ~/slack-migration-alerts.sh --uninstall
#
# ALERTS FIRED
# - pgcopydb process stopped unexpectedly (fires once per transition)
# - New ERROR lines in migration.log (fires once per new batch)
# - Initial copy completed (data + indexes + constraints + post-data; fires once)
# - Migration completed successfully (fires once)
# - Migration failed with non-zero exit code (fires once)
#
# State file: $MIGRATION_DIR/.notify-state (inside the migration directory)
# Cron output is discarded; run manually to see output
#

set -uo pipefail

# ── Flag parsing ───────────────────────────────────────────────────
INTERVAL=2
ACTION=""

while [ $# -gt 0 ]; do
case "$1" in
--interval)
INTERVAL="${2:?--interval requires a value (1-59)}"
shift 2
;;
--setup|--uninstall|--test)
ACTION="${1#--}"
shift
;;
*)
echo "Unknown argument: $1" >&2
echo "Usage: $0 [--setup [--interval N]] | [--uninstall] | [--test]" >&2
exit 1
;;
esac
done

# ── Load environment ───────────────────────────────────────────────
set +u
set -a
source ~/.env
set +a
set -u

# ── Parse PlanetScale branch ID ────────────────────────────────────
# Username format in connection string: pscale_api_xxx.BRANCH_ID
_u="${PGCOPYDB_TARGET_PGURI:-}"; _u="${_u#*://}"; _u="${_u%%@*}"; _u="${_u%%:*}"
PS_BRANCH_ID="${_u##*.}"; unset _u

# ── Slack helper ───────────────────────────────────────────────────
slack_send() {
local text="$1"
local safe
safe="${text//\\/\\\\}"
safe="${safe//\"/\\\"}"
safe="${safe//$'\n'/\\n}"

local http_code
http_code=$(curl -s -o /dev/null -w "%{http_code}" \
-X POST \
-H 'Content-type: application/json' \
--data "{\"text\":\"${safe}\"}" \
"$SLACK_WEBHOOK_URL" 2>/dev/null) || http_code="000"

if [ "$http_code" = "200" ]; then
echo "$(date '+%Y-%m-%d %H:%M:%S') SENT: $text"
else
echo "$(date '+%Y-%m-%d %H:%M:%S') ERROR: Slack returned HTTP $http_code" >&2
return 1
fi
}

# ── SQLite helper ──────────────────────────────────────────────────
db_query() {
sqlite3 "$DB" "$1" 2>/dev/null || echo "${2:-0}"
}

# ── --test ─────────────────────────────────────────────────────────
if [ "$ACTION" = "test" ]; then
if [ -z "${SLACK_WEBHOOK_URL:-}" ]; then
echo "ERROR: SLACK_WEBHOOK_URL is not set in ~/.env"
exit 1
fi
HOST=$(hostname -s 2>/dev/null || hostname)
slack_send ":white_check_mark: Migration Monitor test from *${HOST}* — webhook working!" || exit 1
exit 0
fi

# ── --setup ────────────────────────────────────────────────────────
if [ "$ACTION" = "setup" ]; then
if [ -z "${SLACK_WEBHOOK_URL:-}" ]; then
echo "ERROR: SLACK_WEBHOOK_URL is not set in ~/.env"
exit 1
fi
if ! [[ "$INTERVAL" =~ ^[1-9][0-9]?$ ]] || [ "$INTERVAL" -gt 59 ]; then
echo "ERROR: --interval must be 1-59 (got: $INTERVAL)"
exit 1
fi
echo "Verifying webhook..."
slack_send ":rocket: Migration monitor started for branch *${PS_BRANCH_ID:-unknown}* — Slack notifications active" || {
echo "ERROR: Webhook verification failed — cron job not installed" >&2
exit 1
}
SCRIPT="$HOME/slack-migration-alerts.sh"
CRON_LINE="*/${INTERVAL} * * * * ${SCRIPT} > /dev/null 2>&1"
( crontab -l 2>/dev/null | grep -v "notify-migration.sh" | grep -v "slack-migration-alerts.sh" || true
echo "$CRON_LINE"
) | crontab -
echo "Cron job installed (every ${INTERVAL} min):"
echo " $CRON_LINE"
exit 0
fi

# ── --uninstall ────────────────────────────────────────────────────
if [ "$ACTION" = "uninstall" ]; then
( crontab -l 2>/dev/null | grep -v "notify-migration.sh" | grep -v "slack-migration-alerts.sh" || true ) | crontab -
echo "Cron job removed."
exit 0
fi

# ── Guard ──────────────────────────────────────────────────────────
if [ -z "${SLACK_WEBHOOK_URL:-}" ]; then
echo "$(date '+%Y-%m-%d %H:%M:%S') SKIP: SLACK_WEBHOOK_URL not set in ~/.env"
exit 0
fi

# ── Find migration directory ───────────────────────────────────────
MIGRATION_DIR=$(ls -dt "$HOME"/migration_* 2>/dev/null | head -1 || true)
if [ -z "$MIGRATION_DIR" ]; then
exit 0
fi

LOG="$MIGRATION_DIR/migration.log"
DB="$MIGRATION_DIR/schema/source.db"
STATE="$MIGRATION_DIR/.notify-state"

if [ ! -f "$LOG" ]; then
exit 0
fi

# ── Load state from previous run ──────────────────────────────────
# Stored inside the migration directory — resets automatically when
# a new migration starts (new directory = no state file).
LAST_ERROR_COUNT=0
LAST_STATUS="unknown"
LAST_INITIAL_COPY_NOTIFIED="false"
LAST_COMPLETION_NOTIFIED="false"

if [ -f "$STATE" ]; then
# shellcheck source=/dev/null
source "$STATE" 2>/dev/null || true
fi

# ── Current state from log ─────────────────────────────────────────
PROC_RUNNING=false
if ps aux | grep -q "[p]gcopydb.*clone"; then
PROC_RUNNING=true
fi

MIGRATION_SUCCEEDED=false
MIGRATION_FAILED=false

INITIAL_COPY_DONE=false
if grep -q "All step are now done" "$LOG" 2>/dev/null; then
INITIAL_COPY_DONE=true
fi

if grep -q "Migration SUCCEEDED" "$LOG" 2>/dev/null; then
MIGRATION_SUCCEEDED=true
fi

EXIT_LINE=$(grep "Exit code:" "$LOG" 2>/dev/null | tail -1 || true)
if [ -n "$EXIT_LINE" ] && ! echo "$EXIT_LINE" | grep -q "Exit code: 0"; then
MIGRATION_FAILED=true
fi

if [ "$MIGRATION_SUCCEEDED" = true ]; then
CURRENT_STATUS="succeeded"
elif [ "$MIGRATION_FAILED" = true ]; then
CURRENT_STATUS="failed"
elif [ "$PROC_RUNNING" = true ]; then
CURRENT_STATUS="running"
else
CURRENT_STATUS="stopped"
fi

CURRENT_ERROR_COUNT=$(grep -c " ERROR " "$LOG" 2>/dev/null || true)
CURRENT_ERROR_COUNT=$(( ${CURRENT_ERROR_COUNT:-0} + 0 ))

# ── Context from SQLite for richer alert messages ──────────────────
TABLES_DONE=$(db_query "SELECT COUNT(*) FROM summary WHERE tableoid IS NOT NULL AND done_time_epoch IS NOT NULL;")
NONSPLIT=$(db_query "SELECT COUNT(*) FROM s_table t WHERE NOT EXISTS (SELECT 1 FROM s_table_part p WHERE p.oid = t.oid);")
SPLIT_PARTS=$(db_query "SELECT COUNT(*) FROM s_table_part;")
TABLES_TOTAL=$(( NONSPLIT + SPLIT_PARTS ))
BYTES=$(db_query "SELECT COALESCE(SUM(bytes),0) FROM summary WHERE tableoid IS NOT NULL;")
GB=$(echo "scale=1; $BYTES / 1024 / 1024 / 1024" | bc 2>/dev/null || echo "0")
INDEXES_DONE=$(db_query "SELECT COUNT(DISTINCT indexoid) FROM summary WHERE indexoid IS NOT NULL AND done_time_epoch IS NOT NULL;")
INDEXES_TOTAL=$(db_query "SELECT COUNT(*) FROM s_index;")
CONSTRAINTS_DONE=$(db_query "SELECT COUNT(DISTINCT conoid) FROM summary WHERE conoid IS NOT NULL AND done_time_epoch IS NOT NULL;")
CONSTRAINTS_TOTAL=$(db_query "SELECT COUNT(*) FROM s_constraint;")

# ── Evaluate and notify ────────────────────────────────────────────
NOTIFIED_INITIAL_COPY="$LAST_INITIAL_COPY_NOTIFIED"
NOTIFIED_COMPLETION="$LAST_COMPLETION_NOTIFIED"

if [ "$INITIAL_COPY_DONE" = true ] && [ "$LAST_INITIAL_COPY_NOTIFIED" = "false" ]; then
DIR_EPOCH=$(stat -c %Y "$MIGRATION_DIR" 2>/dev/null || date +%s)
SECS=$(( $(date +%s) - DIR_EPOCH ))
RUNTIME=$(printf "%dh %02dm" $(( SECS/3600 )) $(( (SECS%3600)/60 )))
msg=":large_green_circle: *Initial copy completed — CDC phase starting*"
msg+=$'\n'"Branch: *${PS_BRANCH_ID}* | Runtime: ${RUNTIME} | Data: ${GB} GB"
msg+=$'\n'"Tables: ${TABLES_DONE}/${TABLES_TOTAL} | Indexes: ${INDEXES_DONE}/${INDEXES_TOTAL} | Constraints: ${CONSTRAINTS_DONE}/${CONSTRAINTS_TOTAL}"
if slack_send "$msg"; then
NOTIFIED_INITIAL_COPY="true"
fi

elif [ "$CURRENT_STATUS" = "succeeded" ] && [ "$LAST_COMPLETION_NOTIFIED" = "false" ]; then
DIR_EPOCH=$(stat -c %Y "$MIGRATION_DIR" 2>/dev/null || date +%s)
SECS=$(( $(date +%s) - DIR_EPOCH ))
RUNTIME=$(printf "%dh %02dm" $(( SECS/3600 )) $(( (SECS%3600)/60 )))
msg=":white_check_mark: *Migration completed successfully*"
msg+=$'\n'"Branch: *${PS_BRANCH_ID}* | Runtime: ${RUNTIME} | Data: ${GB} GB"
msg+=$'\n'"Tables: ${TABLES_DONE}/${TABLES_TOTAL} | Dir: ${MIGRATION_DIR}"
if slack_send "$msg"; then
NOTIFIED_COMPLETION="true"
fi

elif [ "$CURRENT_STATUS" = "failed" ] && [ "$LAST_COMPLETION_NOTIFIED" = "false" ]; then
msg=":red_circle: *Migration FAILED*"
msg+=$'\n'"Branch: *${PS_BRANCH_ID}* | Tables: ${TABLES_DONE}/${TABLES_TOTAL} | Data: ${GB} GB"
msg+=$'\n'"Check migration.log for error details"
msg+=$'\n'"Dir: ${MIGRATION_DIR}"
if slack_send "$msg"; then
NOTIFIED_COMPLETION="true"
fi

elif [ "$CURRENT_STATUS" = "stopped" ] && [ "$LAST_STATUS" = "running" ]; then
msg=":red_circle: *Migration process stopped unexpectedly*"
msg+=$'\n'"Branch: *${PS_BRANCH_ID}* | Tables: ${TABLES_DONE}/${TABLES_TOTAL} | Data: ${GB} GB"
msg+=$'\n'"Run: tail -50 ${LOG}"
slack_send "$msg" || true
fi

if [ "$CURRENT_ERROR_COUNT" -gt "$LAST_ERROR_COUNT" ]; then
NEW_COUNT=$(( CURRENT_ERROR_COUNT - LAST_ERROR_COUNT ))
if [ "$CURRENT_STATUS" = "stopped" ]; then
msg=":red_circle: *${NEW_COUNT} new error(s) in migration log — process is not running*"
else
msg=":warning: *${NEW_COUNT} new error(s) in migration log*"
fi
msg+=$'\n'"Branch: *${PS_BRANCH_ID}* | Total errors: ${CURRENT_ERROR_COUNT} | Tables: ${TABLES_DONE}/${TABLES_TOTAL}"
slack_send "$msg" || true
fi

# ── Save state ─────────────────────────────────────────────────────
cat > "$STATE" <<EOF
LAST_ERROR_COUNT=${CURRENT_ERROR_COUNT}
LAST_STATUS=${CURRENT_STATUS}
LAST_INITIAL_COPY_NOTIFIED=${NOTIFIED_INITIAL_COPY}
LAST_COMPLETION_NOTIFIED=${NOTIFIED_COMPLETION}
EOF
5 changes: 3 additions & 2 deletions pgcopydb-helpers/start-migration-screen.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ screen -dmS migration bash -c '~/run-migration.sh; echo "Press enter to exit.";

echo "Migration started in screen session 'migration'"
echo ""
echo "To watch: screen -r migration"
echo "To detach: Ctrl-A then D"
echo "To watch: screen -r migration"
echo "To detach: Ctrl-A then D"
echo "To check status: ~/check-migration-status.sh"
echo "────────────────────────────────────────────────────────"