Conversation
Signed-off-by: Benjamin <5719034+bnjjj@users.noreply.github.com>
Signed-off-by: Benjamin <5719034+bnjjj@users.noreply.github.com>
Signed-off-by: Benjamin <5719034+bnjjj@users.noreply.github.com>
Main introduced AsyncResult parameters on the Destination trait (truncate_table, write_table_rows, write_events). Update the ClickHouse implementation to conform, add missing K8sClient trait stubs, and fix imports that drifted during rebase.
These were carried over during the ClickHouse rebase but the startup path on main no longer calls them, triggering dead_code warnings.
Silently skipping tests when env vars are missing gives false confidence. The tests should fail explicitly so missing configuration is always noticed.
Exercises encoding edge cases the all_types test does not cover: - NULL scalars (nullable columns with SQL NULL) - NULL elements inside arrays (ARRAY[1, NULL, 3]) - Empty strings distinct from NULLs - Single-element arrays (varint length = 0x01) - Multi-byte UTF-8 (emoji, CJK) in scalars and arrays - Zero integer distinct from NULL
The docker-compose stack already starts a ClickHouse container but the test env vars were not set, causing the integration tests to panic. Wire up the same defaults used by docker-compose.
delete_dynamic_replicator_secrets deleted BigQuery, Iceberg, and Ducklake secrets but not ClickHouse, leaving orphaned K8s secrets when ClickHouse pipelines were deleted or changed destination type.
Previously this silently wrote a zero-length string, corrupting data without any signal. Now returns a ConversionError so the problem is surfaced to the operator.
The previous ErrorKind::Unknown discarded the JoinError, losing the panic message. Now preserves the error as a detail string and uses a more specific error kind matching the DuckLake convention.
Uses REPLICA IDENTITY FULL so Postgres sends all column values in the DELETE event. Inserts two rows, deletes one, and verifies the correct row appears with cdc_operation = "DELETE", old data preserved, and a positive LSN — while the other row remains untouched.
# Conflicts: # etl-api/Cargo.toml # etl-destinations/Cargo.toml # etl-examples/Cargo.toml # etl-examples/README.md
farazdagi
reviewed
Apr 30, 2026
Co-authored-by: Victor Farazdagi <simple.square@gmail.com>
The merge from main brought `etl-replicator/src/init/destination.rs`, whose `destination_name` match did not include the `ClickHouse` variant that exists on this branch. Build broke with E0004 non-exhaustive patterns. Add the missing arm.
bnjjj
reviewed
May 1, 2026
The `clickhouse_pipeline::*` integration tests call `spawn_source_database()` and so share the source Postgres cluster, but they were missing from the `shared-pg` filter in `.config/nextest.toml` and `xtask/src/commands/nextest.rs`. Under `cargo xtask nextest run` they were therefore landing in the non-pg lane and running in parallel against whichever cluster `TESTS_DATABASE_PORT` happens to point at, defeating the `max-threads = 1` serialization the test group is supposed to provide and risking timing-sensitive contention under load. Add `clickhouse_pipeline` to both filter copies.
ClickHouse `Date` is a UInt16 day offset from 1970-01-01 and only covers 1970-01-01..=2149-06-06. Postgres `date` has a much wider range, and the previous `date_to_days(...).clamp(...)` silently turned pre-1970 dates into 1970-01-01 and far-future dates into the ClickHouse maximum, corrupting historical and edge-case data without surfacing anything to the operator. Switch the schema mapping for `Type::DATE` and `Type::DATE_ARRAY` to ClickHouse `Date32`, swap `ClickHouseValue::Date(u16)` for `ClickHouseValue::Date32(i32)`, and convert dates through a fallible `date_to_date32_days` that returns `ConversionError` when the value falls outside `Date32`'s `1900-01-01..=2299-12-31` range. Failing the batch is strictly better than silent corruption. `cell_to_clickhouse_value` and `array_cell_to_clickhouse_values` now return `EtlResult` so the conversion error can propagate; both RowBinary call sites in `core.rs` thread the error up via `?`.
Add `pre_1970_and_far_future_dates_round_trip`, an integration test that runs the pipeline against a Postgres `date not null` column populated with the Date32 boundaries (1900-01-01, 1969-12-31, 1970-01-01, 2024-01-15, 2299-12-31) and asserts each value lands in ClickHouse as the expected signed day offset. This pins the new Date32 behavior so the previous silent-clamp regression cannot return. Update `AllTypesRow.date_col` from `u16` to `i32` to match the `Date32` schema the destination now produces.
Address review on the date-boundaries test: - Rename `DATE_1900_01_01_DAYS` -> `DATE32_MIN_DAYS_FROM_UNIX_EPOCH` and `DATE_2299_12_31_DAYS` -> `DATE32_MAX_DAYS_FROM_UNIX_EPOCH` so the constant names describe their semantic role (the Date32 boundaries) rather than the underlying date. - Drop `DATE_1969_12_31_DAYS = -1`; the literal `-1` reads more clearly inline with a description in the assertion message.
Contributor
Author
Good catch, fixed in cd486b8
|
`etl-replicator/configuration/{base,dev,prod}.yaml` are not on main and
only ever held placeholder credentials (`password: password`) plus
Kubernetes service URLs. The deployment model is explicitly to provide
configuration at runtime via `APP_CONFIG_DIR` (see Dockerfile and
README), so committed dummy configs serve no purpose and risk
normalising checked-in placeholder secrets.
Missed in the Date -> Date32 migration. CI surfaced it as `SchemaMismatch: attempting to (de)serialize ClickHouse type Date32 as u16` when `delete_with_default_replica_identity` ran. Match the row struct to the new schema.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds ClickHouse as a destination to etl.
Replicates Postgres tables into append-only MergeTree tables, with
cdc_operationandcdc_lsncolumns appended to every row. Both initial copy and streaming are supported, encoded directly asRowBinary(rationale inencoding.rs). PerINSERTbyte budgeting keeps peak memory bounded for large copies.Schema evolution / DDL is handled by diffing against the previous snapshot and replaying it through idempotent DDL statements, so an interrupted run is recoverable on restart.
RowBinarynullable flags are derived from the actual ClickHouse schema rather than the source, since ClickHouse forces newly-added columns toNullable(T)regardless of the upstream NOT NULL constraint. Identifiers are quoted at every SQL boundary, and avalidate_connectivityprobe lets theetl-apiping/check for health similarly to Iceberg.Coverage: 20 unit tests on the encoder/schema/client helpers plus 15 integration tests covering initial copy of all supported types, insert/update/delete streaming (with key-only DELETE tombstone expansion), restart, intermediate flushing, multi-table concurrency, truncate, and schema-change scenarios.