Skip to content

Add ClickHouse destination#658

Merged
jmqd merged 100 commits intomainfrom
jm/clickhouse
May 4, 2026
Merged

Add ClickHouse destination#658
jmqd merged 100 commits intomainfrom
jm/clickhouse

Conversation

@jmqd
Copy link
Copy Markdown
Contributor

@jmqd jmqd commented Apr 7, 2026

Adds ClickHouse as a destination to etl.

Replicates Postgres tables into append-only MergeTree tables, with cdc_operation and cdc_lsn columns appended to every row. Both initial copy and streaming are supported, encoded directly as RowBinary (rationale in encoding.rs). Per INSERT byte budgeting keeps peak memory bounded for large copies.

Schema evolution / DDL is handled by diffing against the previous snapshot and replaying it through idempotent DDL statements, so an interrupted run is recoverable on restart. RowBinary nullable flags are derived from the actual ClickHouse schema rather than the source, since ClickHouse forces newly-added columns to Nullable(T) regardless of the upstream NOT NULL constraint. Identifiers are quoted at every SQL boundary, and a validate_connectivity probe lets the etl-api ping/check for health similarly to Iceberg.

Coverage: 20 unit tests on the encoder/schema/client helpers plus 15 integration tests covering initial copy of all supported types, insert/update/delete streaming (with key-only DELETE tombstone expansion), restart, intermediate flushing, multi-table concurrency, truncate, and schema-change scenarios.

@jmqd jmqd changed the title Jm/clickhouse ClickHouse destination Apr 7, 2026
bnjjj and others added 10 commits April 7, 2026 17:08
Signed-off-by: Benjamin <5719034+bnjjj@users.noreply.github.com>
Signed-off-by: Benjamin <5719034+bnjjj@users.noreply.github.com>
Signed-off-by: Benjamin <5719034+bnjjj@users.noreply.github.com>
Signed-off-by: Benjamin <5719034+bnjjj@users.noreply.github.com>
jmqd added 3 commits April 7, 2026 17:52
Main introduced AsyncResult parameters on the Destination trait
(truncate_table, write_table_rows, write_events). Update the
ClickHouse implementation to conform, add missing K8sClient trait
stubs, and fix imports that drifted during rebase.
These were carried over during the ClickHouse rebase but the startup
path on main no longer calls them, triggering dead_code warnings.
@coveralls
Copy link
Copy Markdown

coveralls commented Apr 7, 2026

Coverage Status

coverage: 76.444% (+0.3%) from 76.129% — jm/clickhouse into main

Silently skipping tests when env vars are missing gives false
confidence. The tests should fail explicitly so missing configuration
is always noticed.
jmqd added 4 commits April 9, 2026 15:49
Exercises encoding edge cases the all_types test does not cover:
- NULL scalars (nullable columns with SQL NULL)
- NULL elements inside arrays (ARRAY[1, NULL, 3])
- Empty strings distinct from NULLs
- Single-element arrays (varint length = 0x01)
- Multi-byte UTF-8 (emoji, CJK) in scalars and arrays
- Zero integer distinct from NULL
The docker-compose stack already starts a ClickHouse container but
the test env vars were not set, causing the integration tests to
panic. Wire up the same defaults used by docker-compose.
delete_dynamic_replicator_secrets deleted BigQuery, Iceberg, and
Ducklake secrets but not ClickHouse, leaving orphaned K8s secrets
when ClickHouse pipelines were deleted or changed destination type.
Previously this silently wrote a zero-length string, corrupting data
without any signal. Now returns a ConversionError so the problem is
surfaced to the operator.
The previous ErrorKind::Unknown discarded the JoinError, losing the
panic message. Now preserves the error as a detail string and uses a
more specific error kind matching the DuckLake convention.
@jmqd jmqd changed the title ClickHouse destination Add ClickHouse destination Apr 10, 2026
jmqd added 4 commits April 13, 2026 15:07
Uses REPLICA IDENTITY FULL so Postgres sends all column values in the
DELETE event. Inserts two rows, deletes one, and verifies the correct
row appears with cdc_operation = "DELETE", old data preserved, and a
positive LSN — while the other row remains untouched.
# Conflicts:
#	etl-api/Cargo.toml
#	etl-destinations/Cargo.toml
#	etl-examples/Cargo.toml
#	etl-examples/README.md
Copy link
Copy Markdown
Contributor

@farazdagi farazdagi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great work!

Comment thread etl-api/src/configs/destination.rs Outdated
Comment thread etl-api/src/configs/destination.rs Outdated
Comment thread etl-api/src/k8s/core.rs
Comment thread etl-api/src/validation/validators.rs
Comment thread etl-examples/src/bin/clickhouse.rs Outdated
Comment thread etl-destinations/src/clickhouse/core.rs Outdated
Comment thread etl-destinations/src/clickhouse/core.rs
Comment thread etl-destinations/src/clickhouse/core.rs Outdated
Copy link
Copy Markdown
Contributor

@iambriccardo iambriccardo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

jmqd and others added 7 commits May 1, 2026 12:50
Co-authored-by: Victor Farazdagi <simple.square@gmail.com>
The merge from main brought `etl-replicator/src/init/destination.rs`,
whose `destination_name` match did not include the `ClickHouse` variant
that exists on this branch. Build broke with E0004 non-exhaustive
patterns. Add the missing arm.
@jmqd jmqd force-pushed the jm/clickhouse branch from 5e22ac0 to 81fbd10 Compare May 1, 2026 04:58
Copy link
Copy Markdown
Contributor

@farazdagi farazdagi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good work!

Comment thread etl-api/src/k8s/http.rs Outdated
Comment thread etl-destinations/src/clickhouse/client.rs
Comment thread etl-destinations/src/clickhouse/client.rs Outdated
Comment thread etl-destinations/src/clickhouse/client.rs
Comment thread etl-destinations/src/clickhouse/core.rs Outdated
Comment thread etl-destinations/src/clickhouse/core.rs Outdated
Comment thread etl-destinations/src/clickhouse/core.rs
Comment thread etl-replicator/configuration/base.yaml Outdated
Comment thread etl-replicator/configuration/dev.yaml Outdated
Comment thread etl-replicator/configuration/prod.yaml Outdated
jmqd added 7 commits May 1, 2026 18:15
The `clickhouse_pipeline::*` integration tests call
`spawn_source_database()` and so share the source Postgres cluster, but
they were missing from the `shared-pg` filter in `.config/nextest.toml`
and `xtask/src/commands/nextest.rs`. Under `cargo xtask nextest run`
they were therefore landing in the non-pg lane and running in parallel
against whichever cluster `TESTS_DATABASE_PORT` happens to point at,
defeating the `max-threads = 1` serialization the test group is
supposed to provide and risking timing-sensitive contention under
load. Add `clickhouse_pipeline` to both filter copies.
ClickHouse `Date` is a UInt16 day offset from 1970-01-01 and only
covers 1970-01-01..=2149-06-06. Postgres `date` has a much wider range,
and the previous `date_to_days(...).clamp(...)` silently turned
pre-1970 dates into 1970-01-01 and far-future dates into the ClickHouse
maximum, corrupting historical and edge-case data without surfacing
anything to the operator.

Switch the schema mapping for `Type::DATE` and `Type::DATE_ARRAY` to
ClickHouse `Date32`, swap `ClickHouseValue::Date(u16)` for
`ClickHouseValue::Date32(i32)`, and convert dates through a fallible
`date_to_date32_days` that returns `ConversionError` when the value
falls outside `Date32`'s `1900-01-01..=2299-12-31` range. Failing the
batch is strictly better than silent corruption.

`cell_to_clickhouse_value` and `array_cell_to_clickhouse_values` now
return `EtlResult` so the conversion error can propagate; both
RowBinary call sites in `core.rs` thread the error up via `?`.
Add `pre_1970_and_far_future_dates_round_trip`, an integration test
that runs the pipeline against a Postgres `date not null` column
populated with the Date32 boundaries (1900-01-01, 1969-12-31,
1970-01-01, 2024-01-15, 2299-12-31) and asserts each value lands in
ClickHouse as the expected signed day offset. This pins the new Date32
behavior so the previous silent-clamp regression cannot return.

Update `AllTypesRow.date_col` from `u16` to `i32` to match the
`Date32` schema the destination now produces.
Address review on the date-boundaries test:

- Rename `DATE_1900_01_01_DAYS` -> `DATE32_MIN_DAYS_FROM_UNIX_EPOCH`
  and `DATE_2299_12_31_DAYS` -> `DATE32_MAX_DAYS_FROM_UNIX_EPOCH` so the
  constant names describe their semantic role (the Date32 boundaries)
  rather than the underlying date.
- Drop `DATE_1969_12_31_DAYS = -1`; the literal `-1` reads more clearly
  inline with a description in the assertion message.
@jmqd
Copy link
Copy Markdown
Contributor Author

jmqd commented May 4, 2026

Great work, left a few comments.

  • The new clickhouse_pipeline tests call spawn_source_database() but are not included in the shared-pg nextest filter. Under cargo xtask nextest run, they run in the non-pg lane instead of the sharded/serialized Postgres lanes, contrary to the test-group comment and likely to make CI scheduling flaky under load..

Good catch, fixed in cd486b8

  • Could we avoid clamping Postgres date values when encoding to ClickHouse ClickHouse Date is a UInt16 day offset from 1970-01-01 and only supports roughly 1970-01-01..=2149-06-06, while Postgres date has a much wider range. The current date_to_days(...).clamp(...) would silently turn pre-1970 dates into 1970-01-01 and far-future dates into ClickHouse’s max date. I think we should either map Postgres date to Date32 and still error outside the Date32 range, or keep Date but return a conversion error when the value is not representable. Silent corruption feels worse than failing the batch here.

Yeah, this also makes sense. Fixed in 6e54249 and 1fe9f2b

jmqd added 5 commits May 4, 2026 13:05
`etl-replicator/configuration/{base,dev,prod}.yaml` are not on main and
only ever held placeholder credentials (`password: password`) plus
Kubernetes service URLs. The deployment model is explicitly to provide
configuration at runtime via `APP_CONFIG_DIR` (see Dockerfile and
README), so committed dummy configs serve no purpose and risk
normalising checked-in placeholder secrets.
Missed in the Date -> Date32 migration. CI surfaced it as
`SchemaMismatch: attempting to (de)serialize ClickHouse type Date32 as
u16` when `delete_with_default_replica_identity` ran. Match the row
struct to the new schema.
@jmqd jmqd enabled auto-merge (squash) May 4, 2026 07:38
@jmqd jmqd merged commit 02a7072 into main May 4, 2026
14 checks passed
@jmqd jmqd deleted the jm/clickhouse branch May 4, 2026 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants