Skip to content

burnside-project/pg-cdc

pg-cdc

PostgreSQL change data capture → Parquet. Stream WAL changes into typed, compacted Parquet files on S3, GCS, or filesystem. Single Go binary. No CGO. No Kafka.

CI Release License Go Stars

Why pg-cdc?

Change data capture from PostgreSQL is usually delivered by Debezium + Kafka + a warehouse loader — three moving parts with their own JVM, schema registry, and ops cost. pg-cdc replaces that stack with a single Go binary that reads WAL via native PostgreSQL logical replication, writes typed Parquet directly to S3/GCS/filesystem, and (optionally) registers tables in AWS Glue. No Kafka. No Java. No warehouse to pre-provision.

What does it solve?

A PostgreSQL CDC server that streams WAL changes into typed, compacted Parquet files in cloud storage. Follows the native PostgreSQL replica pattern: one publication, one replication slot, one streaming consumer. Suitable for teams who want CDC output as Parquet for downstream warehouses, data lakes, or analytical engines — without running Kafka, Debezium, or a warehouse loader.

Architecture

pg-cdc is the cloud-storage + data-governance layer. pg-warehouse is the developer-friendly local feature and analytics platform that consumes its output.

pg-cdc produces typed, compacted Parquet files in cloud storage. pg-warehouse consumes them. The contract between them is defined in burnside-go — a shared manifest spec with base snapshots, delta epochs, and role-based access profiles derived from PostgreSQL ACLs.

Multiple developers pull from the same CDC stream independently. No shared DuckDB files, no locking, no contention.

Internally, pg-cdc uses hexagonal architecture with clean port/adapter separation. CLI commands (Cobra) call services that depend only on port interfaces. Adapters for PostgreSQL (source), Parquet writer, filesystem/S3/GCS sinks, SQLite state, and Glue catalog implement those interfaces. New sinks or catalog backends plug in without changing business logic.

Quick comparison

pg-cdc Debezium + Kafka AWS DMS Fivetran / Airbyte
Output format Parquet (typed, columnar) Avro/JSON via Kafka Multiple, via engine Warehouse-native
Infrastructure Single Go binary Kafka + Connect + JVM Managed AWS service SaaS
Logical replication Native pglogrepl Debezium connector Proprietary Via connectors
Sinks Filesystem, S3, GCS Kafka topics S3, Redshift, others Warehouses
Catalog AWS Glue (optional) None Glue Varies
Cloud required No No (but heavy) Yes Yes
Language / runtime Go, no CGO Java (JVM) Managed Managed

How does it work?

PostgreSQL ──WAL──→ pg-cdc ──Parquet──→ Cloud Storage
                      │                  (S3 / GCS / filesystem)
                      │
                      └──→ AWS Glue (optional: table catalog)

pg-cdc uses PostgreSQL's native logical replication:

Stage Output
init Snapshot every table → base Parquet + manifest + (optional) Glue catalog entries
start Stream WAL → append-only delta Parquet epochs per table
compact Merge deltas → new base snapshot (applies inserts/updates/deletes; soft-deletes on 30d TTL)

State survives restarts: LSN position, epoch watermarks, and table metadata live in a local SQLite state file.

Documentation

Doc Description
Getting Started Install, configure, run a first pipeline
Configuration Full YAML reference
Init Snapshot phase details
Streaming WAL streaming mechanics
Compaction Base + delta model, TTL semantics
Operations Production run-book, health checks, troubleshooting
Commercial Edition Closed-source governance / ACL extensions

Install

Download binary — see Releases for Linux (amd64/arm64), macOS (arm64), and Windows (amd64):

# Linux (amd64)
curl -fsSL https://github.com/burnside-project/pg-cdc/releases/latest/download/pg-cdc_linux_amd64.tar.gz | tar xz
sudo install -m 0755 pg-cdc-linux-amd64 /usr/local/bin/pg-cdc

# macOS (Apple Silicon)
curl -fsSL https://github.com/burnside-project/pg-cdc/releases/latest/download/pg-cdc_darwin_arm64.tar.gz | tar xz
sudo install -m 0755 pg-cdc-darwin-arm64 /usr/local/bin/pg-cdc

Build from source:

git clone https://github.com/burnside-project/pg-cdc.git
cd pg-cdc
make build

Quickstart

1. Configure — create pg-cdc.yml:

source:
  postgres:
    url: "postgresql://cdc_user@host:5432/db"
    schemas: ["public"]

storage:
  type: filesystem              # filesystem | s3 | gcs
  path: /var/lib/pg-cdc/output/

replication:
  publication: pg_cdc_pub
  slot: pg_cdc_slot

flush:
  interval_sec: 10
  max_rows: 1000

2. Discover tables — confirm pg-cdc can see what you expect:

$ pg-cdc discover

3. Initialize — snapshot base tables + create the replication slot:

$ pg-cdc init --config pg-cdc.yml

4. Start streaming — tail WAL into Parquet deltas:

$ pg-cdc start --config pg-cdc.yml

5. Check health — LSN lag, epochs, per-table status:

$ pg-cdc status

See the Operations guide for production deployment patterns.

Features

Source

  • PostgreSQL logical replication (pgx/v5, pglogrepl)
  • Per-schema discovery
  • Declarative table include/exclude rules
  • Tag-based table policy (e.g., pii, ephemeral tags → include/exclude)

Output

  • Typed Parquet (pure Go, no CGO)
  • Base snapshots + append-only delta epochs
  • Compaction into new base (applies I/U/D; soft-deletes on 30d TTL)
  • Manifest file per table (schema + epoch ordering)

Sinks

  • Filesystem
  • S3
  • GCS

Catalog

  • AWS Glue (optional; register manifest tables without re-snapshotting)

Operations

  • Single static binary, no CGO
  • Linux amd64/arm64, macOS arm64, Windows amd64
  • SQLite state tracking (LSN, epoch watermarks, table metadata)
  • Role → table → column ACL discovery from PostgreSQL GRANTs
  • Automated RC releases on every push to main; stable releases via workflow dispatch

Commands

Command What it does
init Snapshot tables → base Parquet + manifest + (optional) Glue catalog
start Stream WAL → delta Parquet epochs
compact Merge deltas → new base snapshot (applies I/U/D; soft-deletes on TTL)
status Health: lag, LSN, epochs, tables
discover List tables from Postgres
discover --acl Show role → table → column access map from PostgreSQL GRANTs
teardown Drop publication + replication slot
catalog register Register manifest tables in Glue without re-snapshotting
version Print version

Full reference in docs/08-operations.md.

Configuration

Production example with S3 + Glue and tag-based policy:

source:
  postgres:
    url: "postgresql://cdc_user:${PGCDC_PASSWORD}@host:5432/db"
    schemas: ["public"]

storage:
  type: s3
  bucket: my-warehouse
  prefix: cdc/
  region: us-west-2

catalog:
  type: glue
  database: my_db
  region: us-west-2

tables:
  exclude: ["public.tbl_sessions"]
  tags:
    pii: ["public.tbl_cc", "billing.*"]
    ephemeral: ["*.tbl_session*"]
  policy:
    pii: exclude
    ephemeral: exclude
    untagged: include

Full reference: docs/02-configuration.md.

Open Core

The open-source edition covers the full CDC pipeline: logical replication, base/delta Parquet output, compaction, three sinks, Glue catalog, and SQLite-backed state. Production governance, compliance, and access-control features are commercial:

  • Layer-2 tag governance (policy-as-code)
  • DynamoDB-backed ACL registry with versioned audit trail
  • AWS Lake Formation reconciliation (acl diff, acl sync)
  • Emergency-override workflows with expiry
  • Terraform stack for IAM / OIDC / governance provisioning
  • Extended CLI: pg-cdc acl register|get|set|diff|sync|list
  • HIPAA-ready deployment topology

Governance in action

The commercial edition extends open-core CDC output with a tag-based governance layer. End to end:

1. pg-cdc writes Parquet + manifest to cloud storage — one prefix per table, plus a top-level manifest.json describing schema and epoch ordering:

S3 Parquet output — per-table prefixes and manifest.json

2. pg-cdc registers Parquet tables in Glue Data Catalog — the open-source catalog register command populates the catalog for downstream query engines:

Glue Data Catalog with CDC-registered Parquet tables

3. Operators apply ACL changes via GitHub Actions workflow dispatch — the commercial edition ships a click-ops UI that wraps pg-cdc acl set. Tag keys and values are constrained to the Layer-1 taxonomy; an audit reason (≥8 chars) is required before the workflow runs:

GitHub Actions workflow dispatch form for applying ACL tag changes

4. Governance intent is stored in a DynamoDB ACL registry — each workflow run writes a versioned record, tagged (e.g. sensitivity: internal) and stamped with actor, reason, and timestamp:

DynamoDB ACL registry item — versioned policy record with tags

5. Tags are reconciled as LF-Tags on Glue tablespg-cdc acl sync drives Lake Formation's tag-based access control from the registry:

Lake Formation LF-Tags on a CDC table

Details: docs/commercial-edition.md.

Related repos

Repo Role
pg-cdc (this repo) CDC server — WAL streaming, Parquet writing, compaction
burnside-go Shared types — manifest spec, storage interface
pg-warehouse Local-first analytical engine that can consume CDC output

Tech stack

Layer Technology
Language Go 1.25 (pure Go, no CGO)
CLI Cobra
PostgreSQL pgx/v5, pglogrepl
Parquet parquet-go (pure Go)
State SQLite (modernc.org/sqlite)
Storage Filesystem, S3, GCS
Platforms Linux amd64/arm64, macOS arm64, Windows amd64

Versioning

Release candidates auto-increment on every push to main: v0.1.0-rc1, v0.1.0-rc2, ...

Stable releases are promoted from RCs via the Release workflow dispatch.

Community

License

Apache License 2.0 — Copyright 2025-2026 Burnside Project

About

PostgreSQL CDC server. Streams WAL changes into typed, compacted Parquet files in cloud storage. Follows the native Postgres replica pattern.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors