GitHub - burnside-project/pg-cdc: PostgreSQL CDC server. Streams WAL changes into typed, compacted Parquet files in cloud storage. Follows the native Postgres replica pattern.

pg-cdc

PostgreSQL change data capture → Parquet. Stream WAL changes into typed, compacted Parquet files on S3, GCS, or filesystem. Single Go binary. No CGO. No Kafka.

Why pg-cdc?

Change data capture from PostgreSQL is usually delivered by Debezium + Kafka + a warehouse loader — three moving parts with their own JVM, schema registry, and ops cost. pg-cdc replaces that stack with a single Go binary that reads WAL via native PostgreSQL logical replication, writes typed Parquet directly to S3/GCS/filesystem, and (optionally) registers tables in AWS Glue. No Kafka. No Java. No warehouse to pre-provision.

What does it solve?

A PostgreSQL CDC server that streams WAL changes into typed, compacted Parquet files in cloud storage. Follows the native PostgreSQL replica pattern: one publication, one replication slot, one streaming consumer. Suitable for teams who want CDC output as Parquet for downstream warehouses, data lakes, or analytical engines — without running Kafka, Debezium, or a warehouse loader.

Architecture

pg-cdc is the cloud-storage + data-governance layer. pg-warehouse is the developer-friendly local feature and analytics platform that consumes its output.

pg-cdc produces typed, compacted Parquet files in cloud storage. pg-warehouse consumes them. The contract between them is defined in burnside-go — a shared manifest spec with base snapshots, delta epochs, and role-based access profiles derived from PostgreSQL ACLs.

Multiple developers pull from the same CDC stream independently. No shared DuckDB files, no locking, no contention.

Internally, pg-cdc uses hexagonal architecture with clean port/adapter separation. CLI commands (Cobra) call services that depend only on port interfaces. Adapters for PostgreSQL (source), Parquet writer, filesystem/S3/GCS sinks, SQLite state, and Glue catalog implement those interfaces. New sinks or catalog backends plug in without changing business logic.

Quick comparison

	pg-cdc	Debezium + Kafka	AWS DMS	Fivetran / Airbyte
Output format	Parquet (typed, columnar)	Avro/JSON via Kafka	Multiple, via engine	Warehouse-native
Infrastructure	Single Go binary	Kafka + Connect + JVM	Managed AWS service	SaaS
Logical replication	Native pglogrepl	Debezium connector	Proprietary	Via connectors
Sinks	Filesystem, S3, GCS	Kafka topics	S3, Redshift, others	Warehouses
Catalog	AWS Glue (optional)	None	Glue	Varies
Cloud required	No	No (but heavy)	Yes	Yes
Language / runtime	Go, no CGO	Java (JVM)	Managed	Managed

How does it work?

PostgreSQL ──WAL──→ pg-cdc ──Parquet──→ Cloud Storage
                      │                  (S3 / GCS / filesystem)
                      │
                      └──→ AWS Glue (optional: table catalog)

pg-cdc uses PostgreSQL's native logical replication:

Stage	Output
`init`	Snapshot every table → base Parquet + manifest + (optional) Glue catalog entries
`start`	Stream WAL → append-only delta Parquet epochs per table
`compact`	Merge deltas → new base snapshot (applies inserts/updates/deletes; soft-deletes on 30d TTL)

State survives restarts: LSN position, epoch watermarks, and table metadata live in a local SQLite state file.

Documentation

Doc	Description
Getting Started	Install, configure, run a first pipeline
Configuration	Full YAML reference
Init	Snapshot phase details
Streaming	WAL streaming mechanics
Compaction	Base + delta model, TTL semantics
Operations	Production run-book, health checks, troubleshooting
Commercial Edition	Closed-source governance / ACL extensions

Install

Download binary — see Releases for Linux (amd64/arm64), macOS (arm64), and Windows (amd64):

# Linux (amd64)
curl -fsSL https://github.com/burnside-project/pg-cdc/releases/latest/download/pg-cdc_linux_amd64.tar.gz | tar xz
sudo install -m 0755 pg-cdc-linux-amd64 /usr/local/bin/pg-cdc

# macOS (Apple Silicon)
curl -fsSL https://github.com/burnside-project/pg-cdc/releases/latest/download/pg-cdc_darwin_arm64.tar.gz | tar xz
sudo install -m 0755 pg-cdc-darwin-arm64 /usr/local/bin/pg-cdc

Build from source:

git clone https://github.com/burnside-project/pg-cdc.git
cd pg-cdc
make build

Quickstart

1. Configure — create pg-cdc.yml:

source:
  postgres:
    url: "postgresql://cdc_user@host:5432/db"
    schemas: ["public"]

storage:
  type: filesystem              # filesystem | s3 | gcs
  path: /var/lib/pg-cdc/output/

replication:
  publication: pg_cdc_pub
  slot: pg_cdc_slot

flush:
  interval_sec: 10
  max_rows: 1000

2. Discover tables — confirm pg-cdc can see what you expect:

$ pg-cdc discover

3. Initialize — snapshot base tables + create the replication slot:

$ pg-cdc init --config pg-cdc.yml

4. Start streaming — tail WAL into Parquet deltas:

$ pg-cdc start --config pg-cdc.yml

5. Check health — LSN lag, epochs, per-table status:

$ pg-cdc status

See the Operations guide for production deployment patterns.

Features

Source

PostgreSQL logical replication (pgx/v5, pglogrepl)
Per-schema discovery
Declarative table include/exclude rules
Tag-based table policy (e.g., pii, ephemeral tags → include/exclude)

Output

Typed Parquet (pure Go, no CGO)
Base snapshots + append-only delta epochs
Compaction into new base (applies I/U/D; soft-deletes on 30d TTL)
Manifest file per table (schema + epoch ordering)

Sinks

Filesystem
S3
GCS

Catalog

AWS Glue (optional; register manifest tables without re-snapshotting)

Operations

Single static binary, no CGO
Linux amd64/arm64, macOS arm64, Windows amd64
SQLite state tracking (LSN, epoch watermarks, table metadata)
Role → table → column ACL discovery from PostgreSQL GRANTs
Automated RC releases on every push to main; stable releases via workflow dispatch

Commands

Command	What it does
`init`	Snapshot tables → base Parquet + manifest + (optional) Glue catalog
`start`	Stream WAL → delta Parquet epochs
`compact`	Merge deltas → new base snapshot (applies I/U/D; soft-deletes on TTL)
`status`	Health: lag, LSN, epochs, tables
`discover`	List tables from Postgres
`discover --acl`	Show role → table → column access map from PostgreSQL GRANTs
`teardown`	Drop publication + replication slot
`catalog register`	Register manifest tables in Glue without re-snapshotting
`version`	Print version

Full reference in docs/08-operations.md.

Configuration

Production example with S3 + Glue and tag-based policy:

source:
  postgres:
    url: "postgresql://cdc_user:${PGCDC_PASSWORD}@host:5432/db"
    schemas: ["public"]

storage:
  type: s3
  bucket: my-warehouse
  prefix: cdc/
  region: us-west-2

catalog:
  type: glue
  database: my_db
  region: us-west-2

tables:
  exclude: ["public.tbl_sessions"]
  tags:
    pii: ["public.tbl_cc", "billing.*"]
    ephemeral: ["*.tbl_session*"]
  policy:
    pii: exclude
    ephemeral: exclude
    untagged: include

Full reference: docs/02-configuration.md.

Open Core

The open-source edition covers the full CDC pipeline: logical replication, base/delta Parquet output, compaction, three sinks, Glue catalog, and SQLite-backed state. Production governance, compliance, and access-control features are commercial:

Layer-2 tag governance (policy-as-code)
DynamoDB-backed ACL registry with versioned audit trail
AWS Lake Formation reconciliation (acl diff, acl sync)
Emergency-override workflows with expiry
Terraform stack for IAM / OIDC / governance provisioning
Extended CLI: pg-cdc acl register|get|set|diff|sync|list
HIPAA-ready deployment topology

Governance in action

The commercial edition extends open-core CDC output with a tag-based governance layer. End to end:

1. pg-cdc writes Parquet + manifest to cloud storage — one prefix per table, plus a top-level manifest.json describing schema and epoch ordering:

2. pg-cdc registers Parquet tables in Glue Data Catalog — the open-source catalog register command populates the catalog for downstream query engines:

3. Operators apply ACL changes via GitHub Actions workflow dispatch — the commercial edition ships a click-ops UI that wraps pg-cdc acl set. Tag keys and values are constrained to the Layer-1 taxonomy; an audit reason (≥8 chars) is required before the workflow runs:

4. Governance intent is stored in a DynamoDB ACL registry — each workflow run writes a versioned record, tagged (e.g. sensitivity: internal) and stamped with actor, reason, and timestamp:

5. Tags are reconciled as LF-Tags on Glue tables — pg-cdc acl sync drives Lake Formation's tag-based access control from the registry:

Details: docs/commercial-edition.md.

Related repos

Repo	Role
pg-cdc (this repo)	CDC server — WAL streaming, Parquet writing, compaction
burnside-go	Shared types — manifest spec, storage interface
pg-warehouse	Local-first analytical engine that can consume CDC output

Tech stack

Layer	Technology
Language	Go 1.25 (pure Go, no CGO)
CLI	Cobra
PostgreSQL	pgx/v5, pglogrepl
Parquet	parquet-go (pure Go)
State	SQLite (modernc.org/sqlite)
Storage	Filesystem, S3, GCS
Platforms	Linux amd64/arm64, macOS arm64, Windows amd64

Versioning

Release candidates auto-increment on every push to main: v0.1.0-rc1, v0.1.0-rc2, ...

Stable releases are promoted from RCs via the Release workflow dispatch.

Community

GitHub Issues — bugs and feature requests
GitHub Discussions — questions and ideas
Contributing — development setup and guidelines
Code of Conduct
Security Policy

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github		.github
assets		assets
cmd/pg-cdc		cmd/pg-cdc
deploy/systemd		deploy/systemd
docs		docs
internal		internal
pkg/version		pkg/version
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
GOVERNANCE.md		GOVERNANCE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
VERSION		VERSION
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Why pg-cdc?

What does it solve?

Architecture

Quick comparison

How does it work?

Documentation

Install

Quickstart

Features

Commands

Configuration

Open Core

Governance in action

Related repos

Tech stack

Versioning

Community

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Why pg-cdc?

What does it solve?

Architecture

Quick comparison

How does it work?

Documentation

Install

Quickstart

Features

Commands

Configuration

Open Core

Governance in action

Related repos

Tech stack

Versioning

Community

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages