You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-**CheckpointPersistence**: optional; defaults to `true`. When enabled, polling sources (PostgreSQL, ClickHouse, Trino) persist read position to a ConfigMap, reducing duplicates on restart. Set to `false` to disable.
28
29
29
30
Secrets can be referenced via `SecretRef` in the spec; the operator resolves them before writing the spec into the ConfigMap.
30
31
@@ -46,7 +47,9 @@ For each DataFlow `<name>` in a namespace:
| ConfigMap |`dataflow-<name>-checkpoint`| Stores read position for polling sources (default). Omitted when `checkpointPersistence: false`. |
49
51
| Deployment |`dataflow-<name>`| One replica; pod runs the **processor** container. |
52
+
| ServiceAccount, Role, RoleBinding |`dataflow-<name>-processor`| RBAC for processor to read/write checkpoint ConfigMap (default). Omitted when `checkpointPersistence: false`. |
50
53
51
54
The processor container:
52
55
@@ -65,8 +68,9 @@ The operator uses a **ClusterRole** (and **ClusterRoleBinding** to its ServiceAc
65
68
- Create/patch **events**.
66
69
- Read **secrets** (for resolution).
67
70
- Create/update/delete **ConfigMaps** and **Deployments** in the same namespaces as DataFlow resources.
71
+
- When checkpoint persistence is enabled: create **ServiceAccounts**, **Roles**, and **RoleBindings** for processor pods to access the checkpoint ConfigMap.
68
72
69
-
See the Helm templates (e.g. `clusterrole.yaml`, `clusterrolebinding.yaml`) and the manifest under `config/rbac/`for the exact rules.
73
+
See the Helm templates (e.g. `clusterrole.yaml`, `clusterrolebinding.yaml`) for the exact rules.
70
74
71
75
### Optional: GUI
72
76
@@ -92,18 +96,21 @@ flowchart LR
92
96
API["API Server"]
93
97
CRD["DataFlow CRD"]
94
98
Operator["Operator Pod"]
95
-
CM["ConfigMap"]
99
+
CMSpec["ConfigMap spec"]
100
+
CMCheckpoint["ConfigMap checkpoint"]
96
101
Dep["Deployment"]
97
102
Proc["Processor Pod"]
98
103
Ext["Kafka / PostgreSQL / Trino / Nessie"]
99
104
100
105
User -->|"apply DataFlow"| API
101
106
API --> CRD
102
107
Operator -->|watch| CRD
103
-
Operator -->|create/update| CM
108
+
Operator -->|create/update| CMSpec
109
+
Operator -->|create/update| CMCheckpoint
104
110
Operator -->|create/update| Dep
105
111
Dep --> Proc
106
-
Proc -->|mount spec| CM
112
+
Proc -->|mount spec| CMSpec
113
+
Proc -->|read/write checkpoint| CMCheckpoint
107
114
Proc -->|connect| Ext
108
115
```
109
116
@@ -114,33 +121,39 @@ flowchart LR
114
121
For each DataFlow, the controller runs the following steps (on create, update, or when owned resources change):
115
122
116
123
1.**Get DataFlow**
117
-
If not found, return. If **DeletionTimestamp** is set: delete the Deploymentand ConfigMap (cleanup), update status to `Stopped`, then return.
124
+
If not found, return. If **DeletionTimestamp** is set: delete the Deployment, ConfigMaps (spec and checkpoint), and processor RBAC (cleanup), update status to `Stopped`, then return.
118
125
119
126
2.**Resolve secrets**
120
127
Use **SecretResolver** to substitute all `SecretRef` fields in the spec with values from Kubernetes Secrets. Result: **resolved spec**.
121
128
122
129
3.**ConfigMap**
123
130
Create or update the ConfigMap `dataflow-<name>-spec` with key `spec.json` = JSON of the resolved spec. Set controller reference to the DataFlow.
124
131
125
-
4.**Deployment**
126
-
Create or update the Deployment `dataflow-<name>`: processor image, volume from that ConfigMap, args and env as above. Use resources/affinity from DataFlow spec if set. Set controller reference to the DataFlow.
132
+
4.**Checkpoint ConfigMap and RBAC** (when `checkpointPersistence` is not `false`, default: enabled)
133
+
Create ConfigMap `dataflow-<name>-checkpoint` and RBAC (ServiceAccount, Role, RoleBinding) so the processor pod can read/write the checkpoint. The processor persists source read position (lastReadID, lastReadChangeTime) there, reducing duplicates on restart.
127
134
128
-
5.**Deployment status**
135
+
5.**Deployment**
136
+
Create or update the Deployment `dataflow-<name>`: processor image, volume from the spec ConfigMap, args and env as above. When checkpoint persistence is enabled, set `serviceAccountName` so the pod uses the dedicated ServiceAccount. Use resources/affinity from DataFlow spec if set. Set controller reference to the DataFlow.
137
+
138
+
6.**Deployment status**
129
139
Read the Deployment; set DataFlow status **Phase** and **Message** from it (e.g. `Running` when `ReadyReplicas > 0`, `Pending` when replicas are starting, `Error` when no replicas).
130
140
131
-
6.**Update DataFlow status**
141
+
7.**Update DataFlow status**
132
142
Write Phase, Message, and other status fields back to the DataFlow resource (with retry on conflict).
133
143
134
144
### Reconcile Loop Diagram
135
145
136
146
```mermaid
137
147
flowchart TD
138
148
A[Get DataFlow] --> B{Deleted?}
139
-
B -->|Yes| C[Cleanup Deployment and ConfigMap]
149
+
B -->|Yes| C[Cleanup Deployment, ConfigMaps, RBAC]
140
150
C --> D[Update Status Stopped]
141
151
B -->|No| E[Resolve Secrets]
142
152
E --> F[Create or Update ConfigMap]
143
-
F --> G[Create or Update Deployment]
153
+
F --> F2{CheckpointPersistence?}
154
+
F2 -->|Yes| F3[Create Checkpoint ConfigMap and RBAC]
155
+
F2 -->|No| G
156
+
F3 --> G[Create or Update Deployment]
144
157
G --> H[Read Deployment Status]
145
158
H --> I[Update DataFlow Status]
146
159
```
@@ -164,7 +177,7 @@ It reads the spec from the file, builds a **Processor** from it, and runs `Proce
164
177
165
178
The **Processor** (in `internal/processor/processor.go`) is built from the spec and contains:
166
179
167
-
-**Source**: a **SourceConnector** (Kafka, PostgreSQL, Trino, or Nessie) — `Connect`, `Read`, `Close`.
180
+
-**Source**: a **SourceConnector** (Kafka, PostgreSQL, Trino, or Nessie) — `Connect`, `Read`, `Close`. By default, polling sources load initial checkpoint from ConfigMap and save it after each successful sink write (debounced). Disable with `checkpointPersistence: false`.
168
181
-**Sink**: a **SinkConnector** for the main destination — `Connect`, `Write`, `Close`.
169
182
-**Error sink** (optional): another SinkConnector for failed writes.
170
183
-**Transformations**: an ordered list of **Transformer** implementations (timestamp, flatten, filter, mask, router, select, remove, snakeCase, camelCase).
@@ -220,6 +233,6 @@ flowchart LR
220
233
221
234
## Summary
222
235
223
-
-**Kubernetes**: You declare a **DataFlow** CR; the **operator** reconciles it into a **ConfigMap** (spec) and a **Deployment** (processor pod). RBAC and optional GUI complete the picture.
236
+
-**Kubernetes**: You declare a **DataFlow** CR; the **operator** reconciles it into a **ConfigMap** (spec) and a **Deployment** (processor pod). By default, a second ConfigMap and RBAC are created for checkpoint storage (set `checkpointPersistence: false` to disable). RBAC and optional GUI complete the picture.
224
237
-**Reconciliation**: Get DataFlow → resolve secrets → update ConfigMap → update Deployment → reflect Deployment status in DataFlow status.
225
238
-**Runtime**: Each **processor** pod runs a single pipeline: source → read channel → transformations → write to main (and optionally error and router) sinks, using pluggable connectors and a fixed set of transformations.
Copy file name to clipboardExpand all lines: docs/en/connectors.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -414,7 +414,7 @@ source:
414
414
- **Change Tracking**: By default tracks changes via `updated_at` column (or `changeTrackingColumn`), captures both INSERTs and UPDATEs
415
415
- **Auto-create Table**: When `autoCreateTable: true`, creates the table with CDC-friendly schema (`id SERIAL PRIMARY KEY`, `created_at`, `updated_at`) if it doesn't exist. Creation happens at Connect time.
416
416
- **Schema notation**: Table name supports `schema.table` format (e.g. `public.products`)
417
-
- **In-memory state**: Read position (lastReadChangeTime) is stored only in memory. On pod/connector restart, the table is fully re-read. For pg→pg flows, enable `upsertMode: true` in sink to update duplicates instead of inserting them again.
417
+
- **Checkpoint persistence**: By default, read position (lastReadChangeTime) is persisted to ConfigMap; on restart, reading resumes from the last position. Set `checkpointPersistence: false` in spec to store only in memory. For pg→pg flows, enable `upsertMode: true` in sink to update duplicates instead of inserting them again.
|**Kafka**| Consumer group (Kafka) | Resumes from last committed offset. No duplicates if offset was committed after sink write. |
15
-
|**PostgreSQL**|In-memory (lastReadChangeTime) | State lost. Re-reads from beginning. Duplicates or gaps possible. |
16
-
|**ClickHouse**|In-memory (lastReadID, lastReadTime) | State lost. Re-reads from beginning. Duplicates possible. |
17
-
|**Trino**|In-memory (lastReadID) | State lost. Re-reads from beginning. Duplicates possible. |
15
+
|**PostgreSQL**|ConfigMap (default); in-memory when `checkpointPersistence: false`| By default resumes from last position. Without persistence: re-reads from beginning. |
16
+
|**ClickHouse**|ConfigMap (default); in-memory when `checkpointPersistence: false`| By default resumes from last position. Without persistence: re-reads from beginning. |
17
+
|**Trino**|ConfigMap (default); in-memory when `checkpointPersistence: false`| By default resumes from last position. Without persistence: re-reads from beginning. |
18
18
19
19
### Kafka Source
20
20
@@ -26,12 +26,14 @@ The Kafka consumer commits offset **only after** the message is successfully wri
Read position (lastReadID, lastReadChangeTime) is stored **only in memory**. On pod crash:
29
+
By default, read position (lastReadID, lastReadChangeTime) is stored **only in memory**. On pod crash:
30
30
31
31
- State is lost.
32
32
- On restart, the source re-reads from the beginning (or from a wrong position).
33
33
-**Duplicates** or **gaps** are possible depending on when the crash occurred.
34
34
35
+
**Checkpoint persistence** is enabled by default. The read position is persisted to a ConfigMap. On restart, the source resumes from the last committed position, reducing duplicates. Set `checkpointPersistence: false` in spec to disable.
36
+
35
37
!!! warning "Idempotent sink required"
36
38
For polling sources, always configure an **idempotent sink** (UPSERT, ReplacingMergeTree) to handle duplicates safely.
37
39
@@ -107,9 +109,28 @@ On SIGTERM (e.g., pod eviction, node drain):
107
109
108
110
Ensure `terminationGracePeriodSeconds` is sufficient for large batches to flush (default: 600 seconds).
109
111
110
-
## Checkpoint Persistence (Future)
112
+
## Checkpoint Persistence
113
+
114
+
!!! note "Enabled by default"
115
+
The `checkpointPersistence` field in the DataFlow spec defaults to `true`. You do not need to set it explicitly — checkpoint persistence is enabled for all DataFlows with polling sources.
116
+
117
+
Checkpoint persistence is **enabled by default**. The read position (lastReadID, lastReadChangeTime) is persisted to ConfigMap `dataflow-<name>-checkpoint`. On processor restart, polling sources (PostgreSQL, ClickHouse, Trino) resume from the last committed position, reducing duplicates.
Persisting source checkpoint (lastReadID, lastReadChangeTime) to external storage (ConfigMap or sink table) would allow polling sources to resume from the last committed position after a processor restart, reducing duplicates. This is planned for a future release. Until then, use idempotent sinks to handle duplicates safely.
133
+
The controller creates the ConfigMap and RBAC (ServiceAccount, Role, RoleBinding) for the processor. Checkpoint is saved with debounce (every 30 seconds) and on graceful shutdown.
0 commit comments