[fix](cdc) Cap debezium ChangeEventQueue with a heap-adaptive byte limit to avoid OOM#64511
[fix](cdc) Cap debezium ChangeEventQueue with a heap-adaptive byte limit to avoid OOM#64511JNSimba wants to merge 3 commits into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
|
run buildall |
There was a problem hiding this comment.
Pull request overview
Introduces a heap-adaptive byte bound for Debezium’s ChangeEventQueue to prevent OOMs when wide rows make the existing count-only bound (max.queue.size) insufficient. The change is centralized in ConfigUtil.getDefaultDebeziumProps() so it applies to both the MySQL and PostgreSQL CDC readers, and is validated with unit tests including a system-property override.
Changes:
- Add a default Debezium queue byte cap computed as
clamp(maxHeap/16, 64MB, 256MB). - Add a system-property escape hatch
-Dcdc.max.queue.size.in.bytes=<bytes>(<= 0 disables). - Add unit tests covering the clamp and override/disable behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/utils/ConfigUtil.java | Computes and injects Debezium max.queue.size.in.bytes with an adaptive default and sysprop override. |
| fs_brokers/cdc_client/src/test/java/org/apache/doris/cdcclient/utils/ConfigUtilTest.java | Adds tests asserting clamp bounds and sysprop override/disable behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
I found one blocking issue in the new override path: malformed values for the new cdc.max.queue.size.in.bytes system property are silently ignored, so the escape hatch can run with an unexpected adaptive cap instead of the operator's intended bound.
Critical checkpoint conclusions:
- Goal and tests: the default byte cap is wired into Debezium properties and numeric override cases are tested, but malformed override behavior is currently unsafe and untested.
- Scope: the implementation is small and focused in ConfigUtil plus unit tests.
- Concurrency and lifecycle: no new shared mutable runtime state beyond reading a JVM system property during config creation; no new locking or lifecycle risk found.
- Configuration: this PR adds an operational config escape hatch; invalid values should fail fast instead of falling back silently.
- Compatibility and parallel paths: the shared helper feeds both MySQL and PostgreSQL source config factories, and upstream Flink CDC passes maxQueueSizeInBytes into Debezium ChangeEventQueue for both paths.
- Testing and results: no regression output is involved; add a negative unit test for a malformed override when fixing this.
- Observability, performance, transaction, persistence, and security-sensitive behavior: no additional concerns found for this PR.
- User focus: no additional user-provided focus points were present.
|
/review |
|
run buildall |
There was a problem hiding this comment.
I completed another pass over the full current diff and did not find any additional distinct inline issues beyond the already-open thread on cdc.max.queue.size.in.bytes malformed override handling. That thread still appears relevant to the current head because malformed and overflow values are logged and ignored, then the adaptive cap is used; I am not duplicating the same issue inline.
Critical checkpoint conclusions:
- Goal and tests: the PR wires a Debezium queue byte cap into the shared default properties path used by both CDC readers, with tests for default clamp, numeric override, disable, and malformed fallback. The existing malformed-override thread remains the only behavior concern.
- Scope: the code change is small and focused in
ConfigUtilplus unit tests. - Concurrency and lifecycle: no new mutable shared state is introduced beyond reading a JVM system property during property construction. Existing CDC reader concurrency is unchanged.
- Configuration: a new JVM-level operational escape hatch is added. Current parsing behavior for invalid values is already covered by the existing review thread.
- Compatibility and parallel paths: no storage/protocol compatibility issue found. I verified the helper feeds both MySQL and PostgreSQL reader paths, and Flink CDC 3.6.0 passes
getMaxQueueSizeInBytes()intoChangeEventQueue.Builderfor both connectors. - Special checks and observability: the warning on invalid override is observable, but whether invalid config should continue is the open existing concern.
- Tests and results: no regression outputs are involved. The current CI shows
Build Cdc Clientsucceeded; I did not run Maven locally because this runner review is constrained to the checkout and Maven would otherwise use external caches. - Performance and memory: the byte cap removes the previous count-only unbounded-by-bytes queue risk. I did not find a separate concrete regression in the cap calculation.
- Transactions, persistence, FE-BE variable passing, security-sensitive behavior: not applicable to this PR.
- User focus: no additional user-provided review focus was present.
What problem does this PR solve?
The cdc_client builds debezium's
ChangeEventQueuewith only a count-based bound (max.queue.size=8192) while the byte bound (max.queue.size.in.bytes) defaults to0(disabled). With wide rows (e.g. ~2MB each), the in-memory queue can grow to2MB * 8192 ≈ 16GBand OOM the process. Both PostgreSQL and MySQL paths build the queue fromgetMaxQueueSizeInBytes(), so a single property covers both, and it applies to both the snapshot and streaming phases.What this PR does
Set a heap-adaptive byte cap on the queue buffer in
ConfigUtil.getDefaultDebeziumProps(), which is shared by the Postgres and MySQL source readers:clamp(heap/16, 64MB, 256MB): heap 1G -> 64MB, 2G -> 128MB, >= 4G -> 256MB.-Dcdc.max.queue.size.in.bytes=<bytes>overrides the adaptive value (absolute bytes;<= 0disables the byte bound).Narrow tables are unaffected: 8192 rows stay well under 64MB, so the count bound is reached first and behavior is unchanged.
Release note
None