pg-cdc
PostgreSQL change data capture → Parquet. Stream WAL changes into typed, compacted Parquet files on S3, GCS, or filesystem. Single Go binary. No CGO. No Kafka.
Change data capture from PostgreSQL is usually delivered by Debezium + Kafka + a warehouse loader — three moving parts with their own JVM, schema registry, and ops cost. pg-cdc replaces that stack with a single Go binary that reads WAL via native PostgreSQL logical replication, writes typed Parquet directly to S3/GCS/filesystem, and (optionally) registers tables in AWS Glue. No Kafka. No Java. No warehouse to pre-provision.
A PostgreSQL CDC server that streams WAL changes into typed, compacted Parquet files in cloud storage. Follows the native PostgreSQL replica pattern: one publication, one replication slot, one streaming consumer. Suitable for teams who want CDC output as Parquet for downstream warehouses, data lakes, or analytical engines — without running Kafka, Debezium, or a warehouse loader.
pg-cdc is the cloud-storage + data-governance layer. pg-warehouse is the developer-friendly local feature and analytics platform that consumes its output.
pg-cdc produces typed, compacted Parquet files in cloud storage. pg-warehouse consumes them. The contract between them is defined in burnside-go — a shared manifest spec with base snapshots, delta epochs, and role-based access profiles derived from PostgreSQL ACLs.
Multiple developers pull from the same CDC stream independently. No shared DuckDB files, no locking, no contention.
Internally, pg-cdc uses hexagonal architecture with clean port/adapter separation. CLI commands (Cobra) call services that depend only on port interfaces. Adapters for PostgreSQL (source), Parquet writer, filesystem/S3/GCS sinks, SQLite state, and Glue catalog implement those interfaces. New sinks or catalog backends plug in without changing business logic.
| pg-cdc | Debezium + Kafka | AWS DMS | Fivetran / Airbyte | |
|---|---|---|---|---|
| Output format | Parquet (typed, columnar) | Avro/JSON via Kafka | Multiple, via engine | Warehouse-native |
| Infrastructure | Single Go binary | Kafka + Connect + JVM | Managed AWS service | SaaS |
| Logical replication | Native pglogrepl | Debezium connector | Proprietary | Via connectors |
| Sinks | Filesystem, S3, GCS | Kafka topics | S3, Redshift, others | Warehouses |
| Catalog | AWS Glue (optional) | None | Glue | Varies |
| Cloud required | No | No (but heavy) | Yes | Yes |
| Language / runtime | Go, no CGO | Java (JVM) | Managed | Managed |
PostgreSQL ──WAL──→ pg-cdc ──Parquet──→ Cloud Storage
│ (S3 / GCS / filesystem)
│
└──→ AWS Glue (optional: table catalog)
pg-cdc uses PostgreSQL's native logical replication:
| Stage | Output |
|---|---|
init |
Snapshot every table → base Parquet + manifest + (optional) Glue catalog entries |
start |
Stream WAL → append-only delta Parquet epochs per table |
compact |
Merge deltas → new base snapshot (applies inserts/updates/deletes; soft-deletes on 30d TTL) |
State survives restarts: LSN position, epoch watermarks, and table metadata live in a local SQLite state file.
| Doc | Description |
|---|---|
| Getting Started | Install, configure, run a first pipeline |
| Configuration | Full YAML reference |
| Init | Snapshot phase details |
| Streaming | WAL streaming mechanics |
| Compaction | Base + delta model, TTL semantics |
| Operations | Production run-book, health checks, troubleshooting |
| Commercial Edition | Closed-source governance / ACL extensions |
Download binary — see Releases for Linux (amd64/arm64), macOS (arm64), and Windows (amd64):
# Linux (amd64)
curl -fsSL https://github.com/burnside-project/pg-cdc/releases/latest/download/pg-cdc_linux_amd64.tar.gz | tar xz
sudo install -m 0755 pg-cdc-linux-amd64 /usr/local/bin/pg-cdc
# macOS (Apple Silicon)
curl -fsSL https://github.com/burnside-project/pg-cdc/releases/latest/download/pg-cdc_darwin_arm64.tar.gz | tar xz
sudo install -m 0755 pg-cdc-darwin-arm64 /usr/local/bin/pg-cdcBuild from source:
git clone https://github.com/burnside-project/pg-cdc.git
cd pg-cdc
make build1. Configure — create pg-cdc.yml:
source:
postgres:
url: "postgresql://cdc_user@host:5432/db"
schemas: ["public"]
storage:
type: filesystem # filesystem | s3 | gcs
path: /var/lib/pg-cdc/output/
replication:
publication: pg_cdc_pub
slot: pg_cdc_slot
flush:
interval_sec: 10
max_rows: 10002. Discover tables — confirm pg-cdc can see what you expect:
$ pg-cdc discover3. Initialize — snapshot base tables + create the replication slot:
$ pg-cdc init --config pg-cdc.yml4. Start streaming — tail WAL into Parquet deltas:
$ pg-cdc start --config pg-cdc.yml5. Check health — LSN lag, epochs, per-table status:
$ pg-cdc statusSee the Operations guide for production deployment patterns.
Source
- PostgreSQL logical replication (pgx/v5, pglogrepl)
- Per-schema discovery
- Declarative table include/exclude rules
- Tag-based table policy (e.g.,
pii,ephemeraltags → include/exclude)
Output
- Typed Parquet (pure Go, no CGO)
- Base snapshots + append-only delta epochs
- Compaction into new base (applies I/U/D; soft-deletes on 30d TTL)
- Manifest file per table (schema + epoch ordering)
Sinks
- Filesystem
- S3
- GCS
Catalog
- AWS Glue (optional; register manifest tables without re-snapshotting)
Operations
- Single static binary, no CGO
- Linux amd64/arm64, macOS arm64, Windows amd64
- SQLite state tracking (LSN, epoch watermarks, table metadata)
- Role → table → column ACL discovery from PostgreSQL GRANTs
- Automated RC releases on every push to main; stable releases via workflow dispatch
| Command | What it does |
|---|---|
init |
Snapshot tables → base Parquet + manifest + (optional) Glue catalog |
start |
Stream WAL → delta Parquet epochs |
compact |
Merge deltas → new base snapshot (applies I/U/D; soft-deletes on TTL) |
status |
Health: lag, LSN, epochs, tables |
discover |
List tables from Postgres |
discover --acl |
Show role → table → column access map from PostgreSQL GRANTs |
teardown |
Drop publication + replication slot |
catalog register |
Register manifest tables in Glue without re-snapshotting |
version |
Print version |
Full reference in docs/08-operations.md.
Production example with S3 + Glue and tag-based policy:
source:
postgres:
url: "postgresql://cdc_user:${PGCDC_PASSWORD}@host:5432/db"
schemas: ["public"]
storage:
type: s3
bucket: my-warehouse
prefix: cdc/
region: us-west-2
catalog:
type: glue
database: my_db
region: us-west-2
tables:
exclude: ["public.tbl_sessions"]
tags:
pii: ["public.tbl_cc", "billing.*"]
ephemeral: ["*.tbl_session*"]
policy:
pii: exclude
ephemeral: exclude
untagged: includeFull reference: docs/02-configuration.md.
The open-source edition covers the full CDC pipeline: logical replication, base/delta Parquet output, compaction, three sinks, Glue catalog, and SQLite-backed state. Production governance, compliance, and access-control features are commercial:
- Layer-2 tag governance (policy-as-code)
- DynamoDB-backed ACL registry with versioned audit trail
- AWS Lake Formation reconciliation (
acl diff,acl sync) - Emergency-override workflows with expiry
- Terraform stack for IAM / OIDC / governance provisioning
- Extended CLI:
pg-cdc acl register|get|set|diff|sync|list - HIPAA-ready deployment topology
The commercial edition extends open-core CDC output with a tag-based governance layer. End to end:
1. pg-cdc writes Parquet + manifest to cloud storage — one prefix per table, plus a top-level manifest.json describing schema and epoch ordering:
2. pg-cdc registers Parquet tables in Glue Data Catalog — the open-source catalog register command populates the catalog for downstream query engines:
3. Operators apply ACL changes via GitHub Actions workflow dispatch — the commercial edition ships a click-ops UI that wraps pg-cdc acl set. Tag keys and values are constrained to the Layer-1 taxonomy; an audit reason (≥8 chars) is required before the workflow runs:
4. Governance intent is stored in a DynamoDB ACL registry — each workflow run writes a versioned record, tagged (e.g. sensitivity: internal) and stamped with actor, reason, and timestamp:
5. Tags are reconciled as LF-Tags on Glue tables — pg-cdc acl sync drives Lake Formation's tag-based access control from the registry:
Details: docs/commercial-edition.md.
| Repo | Role |
|---|---|
| pg-cdc (this repo) | CDC server — WAL streaming, Parquet writing, compaction |
| burnside-go | Shared types — manifest spec, storage interface |
| pg-warehouse | Local-first analytical engine that can consume CDC output |
| Layer | Technology |
|---|---|
| Language | Go 1.25 (pure Go, no CGO) |
| CLI | Cobra |
| PostgreSQL | pgx/v5, pglogrepl |
| Parquet | parquet-go (pure Go) |
| State | SQLite (modernc.org/sqlite) |
| Storage | Filesystem, S3, GCS |
| Platforms | Linux amd64/arm64, macOS arm64, Windows amd64 |
Release candidates auto-increment on every push to main: v0.1.0-rc1, v0.1.0-rc2, ...
Stable releases are promoted from RCs via the Release workflow dispatch.
- GitHub Issues — bugs and feature requests
- GitHub Discussions — questions and ideas
- Contributing — development setup and guidelines
- Code of Conduct
- Security Policy
Apache License 2.0 — Copyright 2025-2026 Burnside Project




