Skip to content

postgres database controllers#1760

Draft
mploski wants to merge 63 commits intodevelopfrom
feature/database-controllers
Draft

postgres database controllers#1760
mploski wants to merge 63 commits intodevelopfrom
feature/database-controllers

Conversation

@mploski
Copy link
Collaborator

@mploski mploski commented Mar 9, 2026

Description

Introduces the MVP PostgreSQL contoller stack to the Splunk Operator. Rather than
implementing PostgreSQL cluster management from scratch, this PR builds a Kubernetes-native
wrapper on top of CloudNativePG (CNPG) — a production-grade
PostgreSQL operator. The Splunk Operator layer adds three new CRDs and two new controllers
that translate high-level declarative intent (cluster classes, logical databases, managed
roles) into CNPG primitives, while handling the Splunk-specific lifecycle concerns:
credential provisioning, connection metadata publishing, and multi-phase status tracking.


Key Changes

api/v4/postgresclusterclass_types.go (new)

Defines the Go API schema for PostgresClusterClass — a cluster-scoped CRD that
standardises PostgreSQL cluster configuration (instance count, storage, Postgres version,
resource limits, pg_hba rules, connection pooler) across environments. Operators define
dev and prod classes; teams reference them by name.

api/v4/postgrescluster_types.go (new)

Defines the Go API schema for PostgresCluster — the namespaced CRD teams use to request
a PostgreSQL cluster. Declares spec fields for merging class defaults with per-cluster
overrides, and status fields for provisioner reference, connection pooler state, and
managed role tracking.

api/v4/postgresdatabase_types.go (new)

Defines the Go API schema for PostgresDatabase — the CRD that declares one or more
logical databases on a PostgresCluster, including reclaim policy and the list of
database definitions. Status tracks per-database secret and ConfigMap references.

internal/controller/postgresoperator_common_types.go (new)

Shared constants, phase enums, condition types, and condition reasons used across both
controllers. Single source of truth for retry delays, finalizer names, endpoint suffixes,
and all status strings.

internal/controller/postgrescluster_controller.go (new)

Reconciles PostgresCluster CRs against CNPG:

  • Provisions a PostgreSQL cluster by translating the PostgresCluster spec into a CNPG
    cluster, merging any defaults inherited from the referenced PostgresClusterClass with
    per-cluster overrides
  • Manages the optional connection pooler (PgBouncer) lifecycle alongside the cluster,
    enabling high-throughput workloads to share a bounded pool of database connections
  • Handles safe deletion by cleaning up all dependent CNPG resources before removing the CR,
    preventing orphaned infrastructure
  • Exposes observable status conditions (ClusterReady, PoolerReady) so the
    PostgresDatabase controller and operators can wait on a confirmed healthy cluster before
    proceeding

internal/controller/postgresdatabase_controller.go (new)

Reconciles PostgresDatabase CRs against CNPG:

  • Ensures the target PostgresCluster is healthy before creating any database resources,
    so partial provisioning is never left in an inconsistent state
  • Provisions database roles (admin and read-write) by configuring them on the cluster —
    each role is scoped to its own database to enforce least-privilege access
  • Generates secure credentials for each role and stores them as Kubernetes Secrets,
    ready to be mounted by workloads that need database access
  • Publishes connection details (host, port, database name, usernames, and pooler endpoints
    when available) into a ConfigMap per database, giving consuming services a stable
    reference without hardcoding hostnames
  • Provisions the logical databases on the PostgreSQL cluster and respects a configurable
    deletion policy that controls whether the database is physically dropped when the CR is
    deleted
  • Exposes observable status conditions (ClusterReady, SecretsReady, ConfigMapsReady,
    UsersReady, DatabasesReady) so operators can track exactly which phase reconciliation
    is in at any point

cmd/main.go

Both new controllers registered with the manager.

config/rbac/

Nine new RBAC files: admin, editor, and viewer roles for PostgresCluster,
PostgresClusterClass, and PostgresDatabase.

config/samples/

Five sample CRs: dev and prod cluster classes, default and override cluster
instances, and a PostgresDatabase example.


Testing and Verification

Manual Testing

End-to-end flow verified by applying the full CR chain in order:

# Deploy Class
kubectl apply -f config/samples/enterprise_v4_postgresclusterclass_dev.yaml
# Deploy Postgres Cluster
kubectl apply -f config/samples/enterprise_v4_postgrescluster_default.yaml
# Deploy Postgres Databases
kubectl apply -f config/samples/enterprise_v4_postgresdatabase.yaml

1. Verify the PostgresCluster is ready

kubectl get postgrescluster postgresql-cluster-dev -n default

2. Verify CNPG provisioned the underlying cluster

kubectl get clusters.postgresql.cnpg.io postgresql-cluster-dev -n default

3. Verify the PostgresDatabase status and conditions

kubectl get postgresdatabase splunk-databases -n default
kubectl get postgresdatabase splunk-databases -n default \
  -o jsonpath='{.status.conditions}' | jq

4. Verify CNPG Database CRs were created for each declared database

kubectl get databases.postgresql.cnpg.io -n default

5. Inspect Secrets and ConfigMaps created per database

# Secrets: {postgresdatabase-name}-{db-name}-{role}
kubectl get secret splunk-databases-kvstore-admin -n default -o jsonpath='{.data}' | jq
kubectl get secret splunk-databases-kvstore-rw -n default -o jsonpath='{.data}' | jq

# ConfigMaps: {postgresdatabase-name}-{db-name}
kubectl get configmap splunk-databases-kvstore -n default -o jsonpath='{.data}' | jq
kubectl get configmap splunk-databases-analytics -n default -o jsonpath='{.data}' | jq

6. Connect to the database using credentials and endpoints from the ConfigMap and Secret

HOST=$(kubectl get configmap splunk-databases-kvstore -n default \
  -o jsonpath='{.data.rw-host}')
DB=$(kubectl get configmap splunk-databases-kvstore -n default \
  -o jsonpath='{.data.dbname}')
USER=$(kubectl get configmap splunk-databases-kvstore -n default \
  -o jsonpath='{.data.rw-user}')
PASS=$(kubectl get secret splunk-databases-kvstore-rw -n default \
  -o jsonpath='{.data.password}' | base64 -d)

psql "postgresql://$USER:$PASS@$HOST/$DB"

Additional scenarios verified:

  • pooler-rw-host and pooler-ro-host keys appear in ConfigMap only when the cluster
    has connection pooling enabled
  • Controller fully reconciles on first apply with no hang after finalizer write
  • Deleting a PostgresDatabase CR triggers clean-up respecting the deletionPolicy
    declared per database (Retain preserves data, Delete drops the database)

Automated Tests — Planned Next

Unit and integration tests covering the full reconciliation lifecycle, error paths, and
status condition transitions are scheduled as the next deliverable for this feature branch.

Related Issues

Jira tickets, GitHub issues, Support tickets...

PR Checklist

  • Code changes adhere to the project's coding standards.
  • Relevant unit and integration tests are included.
  • Documentation has been updated accordingly.
  • All tests pass locally.
  • The PR description follows the project's guidelines.

mploski and others added 30 commits January 29, 2026 15:53
…ntroller

postgresql logical db controller
…ster-controller

Postgres Cluster controller with base functionality.
* addition of connection pooler types

* moved ConnectionPooler to CNPGConfig

* createdconnectionPoolerEnabled in cluster config

* delete connectionpooler_types.go
remove semver check, add messages extended description
fix validation logic for storage and postgresversion
…iler

fix infinite loop on reconciliation
…ecrets-management

Add secrets management for roles and its use in cluster
Copy link
Collaborator

@kubabuczak kubabuczak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase with develop. For some reason part of CI checks were not executed (it could be because of merge conflicts)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this file is part of this MR?

Copy link
Collaborator Author

@mploski mploski Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing, this is an orphan it shouldn't be here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this file be part of this MR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing, this is an orphan it shouldn't be here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, because this is part of current develop:

# Additional volumeMounts will be added by patches for webhook and metrics certs

@@ -28,6 +28,16 @@ spec:
kind: ClusterMaster
name: clustermasters.enterprise.splunk.com
version: v3
- description: DatabaseClass is the Schema for the databaseclasses API.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. I have not idea what this comment is referring to.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those validation tests used anywhere?
I would say they are good candidates for kuttl based testing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be moved to ginko integration/e2e testing once we have it. We already started this work

// Note: CNPG PoolerStatus only tracks scheduled instances, not ready pods.
func (r *PostgresClusterReconciler) isPoolerReady(pooler *cnpgv1.Pooler, err error) bool {
if err != nil {
return false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree - instead you should check if poller != nil


if role.PasswordSecretRef != nil {
cnpgRole.PasswordSecret = &cnpgv1.LocalObjectReference{
Name: role.PasswordSecretRef.Name,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LocalObjectReference doesn't contain namespace

endpoints := resolveClusterEndpoints(cluster, cnpgCluster, postgresDB.Namespace)
if err := r.reconcileUserConfigMaps(ctx, postgresDB, endpoints); err != nil {
if statusErr := updateStatus(configMapsReady, metav1.ConditionFalse, reasonConfigMapsCreationFailed,
fmt.Sprintf("Failed to reconcile ConfigMaps: %v", err), provisioningDBPhase); statusErr != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it's better not to expose internal go errors in k8s resources (apply everywhere)

Comment on lines +378 to +382
func (r *PostgresDatabaseReconciler) verifyRolesReady(
ctx context.Context,
expectedUsers []string,
cnpgCluster *cnpgv1.Cluster,
) ([]string, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Hmm.. if it's only verification, why we are returning string slice containing notReady? Maybe it should be named differently

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for verifyDatabasesReady

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is verification but we return list of notReady Databases so we can update status of our CR with that list so users see what is not ready. Aside of that it could be bool. Do you think this pattern is not accurate?

// userSecretName gives both secret creation and status wiring a single source of truth
// for naming — eliminating any risk of the two sides drifting out of sync.
func userSecretName(postgresDBName, dbName, role string) string {
return fmt.Sprintf("%s-%s-%s", postgresDBName, dbName, role)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add some prefix/postfix to reduce the possibility of conflict (apply to all derived names)

Also there is a length limit for name of each k8s resource - maybe we need to check for it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we explicitly removed prefix/postfix to simplify the logic. the secret is named by postgres cluster name and db-name which provide needed uniqueness within one namespace. I see we have maximum size for resources around https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-subdomain-names 253 chars which should be sufficient

Copy link
Collaborator

@kubabuczak kubabuczak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some review from AI - some of them makes a lot of sense, some of them needs some discovery

Secret deletion breaks retained CNPG clusters

When ClusterDeletionPolicy: Retain, the finalizer removes the owner reference from the CNPG Cluster but not from the superuser Secret. The Secret is cascade-deleted with PostgresCluster, leaving the retained CNPG cluster referencing a non-existent secret.

Samples kustomization references non-existent files

config/samples/kustomization.yaml references enterprise_v4_database.yaml, enterprise_v4_databaseclass.yaml, and enterprise_v4_postgrescluster.yaml — none of which exist. This will break kustomize build.

Typo in type name: PosgresClusterClassConfig

The type is named PosgresClusterClassConfig (missing a t in Postgres) in postgresclusterclass_types.go. This propagates into the generated CRD and deepcopy code.

Broken print columns in PostgresClusterClass CRD

Print columns reference .spec.postgresClusterConfig.* but the actual spec field serializes as .spec.config.*. kubectl get postgresclusterclass will show empty columns.

No conflict handling for finalizer updates in PostgresDatabase

PostgresCluster handles apierrors.IsConflict when adding finalizers; PostgresDatabase does not, causing unnecessary reconcile failures on resource version conflicts.

Status updates use stale objects

updateStatus in PostgresCluster mutates the in-memory object and calls r.Status().Update(). Under concurrent updates, this can overwrite changes. Should use Patch with client.MergeFrom.

ConnectionPoolerConfig field is dead code

PostgresCluster spec includes ConnectionPoolerConfig with a comment implying it has effect, but a CEL validation rule (!has(self.connectionPoolerConfig)) prevents it from ever being set. Either remove the field or the validation.

No minLength validation on clusterRef.name and database name

Both fields accept empty strings, which would cause the reconciler to create invalid CNPG resources.

poolerExists swallows errors

The function treats all non-NotFound API errors as "pooler does not exist" (returns false), hiding real API issues and potentially causing incorrect behavior.

PostgresClusterClass immutability not enforced

Description says "immutable after creation" but there are no CEL rules preventing mutations to the spec after initial creation.

Sample comment contradicts value

enterprise_v4_postgrescluster_override.yaml has a comment saying clusterDeletionPolicy: Retain but the actual value is Delete.

JSON tag inconsistency: ConfigMapRef serializes as "configMap"

In DatabaseInfo, other refs use the *Ref suffix in JSON tags (databaseRef, adminUserSecretRef, rwUserSecretRef) but ConfigMapRef serializes as configMap.

No pattern validation on database names

DatabaseDefinition.Name has MaxLength: 30 but no pattern restricting to valid PostgreSQL identifiers.

PostgresClusterClass missing resource path in marker

Should have +kubebuilder:resource:path=postgresclusterclasses,scope=Cluster for consistency with other types.

Database CRDs not included in preserveUnknownFields patch

Existing CRDs are patched with preserveUnknownFields: false but the three new database CRDs are not.

Status updated unconditionally

syncStatus and updateStatus always call r.Status().Update() even when nothing changed, creating unnecessary API calls and potential conflicts.

Role removal not propagated

When a role is removed from spec.managedRoles, it is not patched as Ensure: absent in CNPG. Roles can only be added, not removed (acknowledged via TODO).

ObservedGeneration early-exit may miss state regression

If external state changes (e.g., CNPG cluster deleted) but the CR's generation hasn't changed, status won't be updated

@limak9182 limak9182 force-pushed the feature/database-controllers branch from bf24bbb to 6bd518c Compare March 17, 2026 09:14
@@ -0,0 +1,203 @@
/*
Copyright 2026.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] I would double check what is prefered option for copyright year range

Copy link
Collaborator Author

@mploski mploski Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who might give us this info? What is the policy for the SOK itself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants