Skip to content

K8SPG-771: Backup snapshots#1418

Open
mayankshah1607 wants to merge 98 commits intomainfrom
K8SPG-771
Open

K8SPG-771: Backup snapshots#1418
mayankshah1607 wants to merge 98 commits intomainfrom
K8SPG-771

Conversation

@mayankshah1607
Copy link
Member

@mayankshah1607 mayankshah1607 commented Jan 21, 2026

Due to the high volume of requests, we're unable to provide free service for this account. To continue using the service, please upgarde to a paid plan.

CHANGE DESCRIPTION

Adds support for taking backups using VolumeSnapshots API. VolumeSnapshots enable fast backups directly via the CSI driver.

This PR adds support for offline snapshot backups.

  1. Enable snapshots for your cluster
apiVersion: pgv2.percona.com/v2
kind: PerconaPGCluster
metadata:
  name: cluster1
spec:
  backups:
    volumeSnapshots:
      mode: offline
      className: VOLUME-SNAPSHOT-CLASS
  1. Trigger a snapshot backup via PerconaPGBackup:
apiVersion: pgv2.percona.com/v2
kind: PerconaPGBackup
metadata:
  name: backup1
spec:
  pgCluster: cluster1
  method: volumeSnapshot

The operator will:

  • Select a healthy replica pod and temporarily suspend it.
  • Create VolumeSnapshots of its data PVC
  • Resume the suspended replica

To take scheduled snapshots:

  1. Enable snapshots for your cluster
apiVersion: pgv2.percona.com/v2
kind: PerconaPGCluster
metadata:
  name: cluster1
spec:
  backups:
    volumeSnapshots:
      mode: offline
      className: VOLUME-SNAPSHOT-CLASS
      schedule: "0 0 * * 6"

To restore a snapshot in-place, you may do so using PerconaPGRestore:

apiVersion: pgv2.percona.com/v2
kind: PerconaPGRestore
metadata:
  name: restore3
spec:
  pgCluster: cluster1
  volumeSnapshotBackupName: backup1

It is also possible to restore a snapshot and perform PiTR using pgbackrest, as long as backups in both sources belong to the same timeline:

apiVersion: pgv2.percona.com/v2
kind: PerconaPGRestore
metadata:
  name: restore3
spec:
  pgCluster: cluster1
  repoName: repo1
  volumeSnapshotBackupName: backup1
  options:
  - --type=time
  - --target="2026-02-09 10:17:51.899712+0000"

Restoring to a new cluster can be done by directly specifying the dataSource in the instance volume claim spec:

apiVersion: pgv2.percona.com/v2
kind: PerconaPGCluster
metadata:
  name: cluster1
spec:
  crVersion: 2.9.0
instances:
- name: instance1
   dataVolumeClaimSpec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 1Gi
     dataSource:
       apiGroup: snapshot.storate.k8s.io
       kind: VolumeSnapshot
       name: SNAPSHOT-NAME

Later we plan to add support for an online mode where the operator can perform a snapshot on a live primary without suspending it.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PG version?
  • Does the change support oldest and newest supported Kubernetes version?

Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
@CLAassistant
Copy link

CLAassistant commented Jan 21, 2026

CLA assistant check
All committers have signed the CLA.

mayankshah1607 and others added 10 commits January 22, 2026 10:30
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Copilot AI review requested due to automatic review settings February 11, 2026 13:02
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 77 out of 80 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

pkg/apis/pgv2.percona.com/v2/perconapgcluster_types.go:1

  • Corrected spelling of 'storate' to 'storage.k8s.io' in the PR description example.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -1,10 +1,9 @@
#!/bin/bash
#!/bin/sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to do this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need any bash-specific features in this script, so using /bin/sh is better for portability (even though our images are expected to support bash). Do you think we need bash?

return "", errors.Wrap(err, "failed to get backup target pod")
}

// TODO: should this be optional, since this can take a while on large datasets?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I discussed this with PG team yesterday and looks like it can (and should) be made optional


func (e *offlineExec) checkpoint(ctx context.Context, instanceName string) error {
exec := func(_ context.Context, stdin io.Reader, stdout, stderr io.Writer, command ...string) error {
return e.podExec(ctx, e.cluster.GetNamespace(), instanceName+"-0", naming.ContainerDatabase, stdin, stdout, stderr, command...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thank you

Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Copilot AI review requested due to automatic review settings February 12, 2026 11:28
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 77 out of 80 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +122 to 130
if ptr.Deref(pgBackup.Spec.RepoName, "") == "" {
if updErr := pgBackup.UpdateStatus(ctx, r.Client, func(bcp *v2.PerconaPGBackup) {
bcp.Status.State = v2.BackupFailed
bcp.Status.Error = "repoName is required when method is 'pgbackrest'"
}); updErr != nil {
return reconcile.Result{}, fmt.Errorf("failed to update backup status: %w", updErr)
}
pgCluster = nil
return reconcile.Result{}, errors.New("'repoName' is required when method is 'pgbackrest'")
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After setting the backup status to BackupFailed for missing repoName, the reconciler returns a non-nil error. This will cause unnecessary error logs/requeues even though the failure is already persisted in status. Consider returning reconcile.Result{} with nil error (or a normal requeue delay if you need finalizers to run) after updating the status.

Copilot uses AI. Check for mistakes.
@egegunes egegunes added this to the v2.9.0 milestone Feb 12, 2026
egegunes
egegunes previously approved these changes Feb 13, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 77 out of 80 changed files in this pull request and generated 5 comments.

Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Copilot AI review requested due to automatic review settings February 13, 2026 09:34
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 77 out of 80 changed files in this pull request and generated 1 comment.

return fmt.Errorf("checkpoint failed: %s", stderr)
}

log.Info("checkpoint executed", "stdout", stdout, "stderr", stderr)
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logging for successful checkpoint (line 105) includes both stdout and stderr, which could potentially log sensitive information. While checkpoint output is typically safe, consider whether logging the full output is necessary or if a simple success message would suffice.

Copilot uses AI. Check for mistakes.
@JNKPercona
Copy link
Collaborator

Test Name Result Time
backup-enable-disable passed 00:07:04
builtin-extensions passed 00:05:11
custom-envs passed 00:18:40
custom-extensions passed 00:13:47
custom-tls passed 00:04:50
database-init-sql passed 00:02:21
demand-backup passed 00:24:04
demand-backup-offline-snapshot passed 00:13:47
finalizers passed 00:03:51
init-deploy passed 00:02:45
huge-pages passed 00:03:06
monitoring passed 00:06:48
monitoring-pmm3 passed 00:08:26
one-pod passed 00:05:43
operator-self-healing passed 00:08:00
pitr passed 00:11:29
scaling passed 00:05:12
scheduled-backup passed 00:25:28
self-healing passed 00:08:54
sidecars passed 00:02:35
standby-pgbackrest passed 00:11:48
standby-streaming passed 00:09:14
start-from-backup passed 00:10:29
tablespaces passed 00:07:02
telemetry-transfer passed 00:03:30
upgrade-consistency passed 00:05:18
upgrade-minor passed 00:05:02
users passed 00:04:44
Summary Value
Tests Run 28/28
Job Duration 01:32:41
Total Test Time 03:59:25

commit: 216ebbe
image: perconalab/percona-postgresql-operator:PR-1418-216ebbe23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants