Skip to content

[fix](cdc) Cap debezium ChangeEventQueue with a heap-adaptive byte limit to avoid OOM#64511

Open
JNSimba wants to merge 3 commits into
apache:masterfrom
JNSimba:fix-cdc-25937-queue-bytes-limit
Open

[fix](cdc) Cap debezium ChangeEventQueue with a heap-adaptive byte limit to avoid OOM#64511
JNSimba wants to merge 3 commits into
apache:masterfrom
JNSimba:fix-cdc-25937-queue-bytes-limit

Conversation

@JNSimba

@JNSimba JNSimba commented Jun 15, 2026

Copy link
Copy Markdown
Member

What problem does this PR solve?

The cdc_client builds debezium's ChangeEventQueue with only a count-based bound (max.queue.size=8192) while the byte bound (max.queue.size.in.bytes) defaults to 0 (disabled). With wide rows (e.g. ~2MB each), the in-memory queue can grow to 2MB * 8192 ≈ 16GB and OOM the process. Both PostgreSQL and MySQL paths build the queue from getMaxQueueSizeInBytes(), so a single property covers both, and it applies to both the snapshot and streaming phases.

What this PR does

Set a heap-adaptive byte cap on the queue buffer in ConfigUtil.getDefaultDebeziumProps(), which is shared by the Postgres and MySQL source readers:

  • Default cap is clamp(heap/16, 64MB, 256MB): heap 1G -> 64MB, 2G -> 128MB, >= 4G -> 256MB.
  • The cap is intentionally conservative because a single cdc_client JVM can run many queues concurrently (one per split, across multiple jobs), and the real batching/backpressure happens downstream in the sink rather than in this queue.
  • Escape hatch: -Dcdc.max.queue.size.in.bytes=<bytes> overrides the adaptive value (absolute bytes; <= 0 disables the byte bound).

Narrow tables are unaffected: 8192 rows stay well under 64MB, so the count bound is reached first and behavior is unchanged.

Release note

None

@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@JNSimba JNSimba requested a review from Copilot June 15, 2026 07:13
@JNSimba

JNSimba commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

/review

@JNSimba

JNSimba commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

run buildall

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a heap-adaptive byte bound for Debezium’s ChangeEventQueue to prevent OOMs when wide rows make the existing count-only bound (max.queue.size) insufficient. The change is centralized in ConfigUtil.getDefaultDebeziumProps() so it applies to both the MySQL and PostgreSQL CDC readers, and is validated with unit tests including a system-property override.

Changes:

  • Add a default Debezium queue byte cap computed as clamp(maxHeap/16, 64MB, 256MB).
  • Add a system-property escape hatch -Dcdc.max.queue.size.in.bytes=<bytes> (<= 0 disables).
  • Add unit tests covering the clamp and override/disable behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/utils/ConfigUtil.java Computes and injects Debezium max.queue.size.in.bytes with an adaptive default and sysprop override.
fs_brokers/cdc_client/src/test/java/org/apache/doris/cdcclient/utils/ConfigUtilTest.java Adds tests asserting clamp bounds and sysprop override/disable behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking issue in the new override path: malformed values for the new cdc.max.queue.size.in.bytes system property are silently ignored, so the escape hatch can run with an unexpected adaptive cap instead of the operator's intended bound.

Critical checkpoint conclusions:

  • Goal and tests: the default byte cap is wired into Debezium properties and numeric override cases are tested, but malformed override behavior is currently unsafe and untested.
  • Scope: the implementation is small and focused in ConfigUtil plus unit tests.
  • Concurrency and lifecycle: no new shared mutable runtime state beyond reading a JVM system property during config creation; no new locking or lifecycle risk found.
  • Configuration: this PR adds an operational config escape hatch; invalid values should fail fast instead of falling back silently.
  • Compatibility and parallel paths: the shared helper feeds both MySQL and PostgreSQL source config factories, and upstream Flink CDC passes maxQueueSizeInBytes into Debezium ChangeEventQueue for both paths.
  • Testing and results: no regression output is involved; add a negative unit test for a malformed override when fixing this.
  • Observability, performance, transaction, persistence, and security-sensitive behavior: no additional concerns found for this PR.
  • User focus: no additional user-provided focus points were present.

@JNSimba

JNSimba commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

/review

@JNSimba

JNSimba commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

run buildall

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completed another pass over the full current diff and did not find any additional distinct inline issues beyond the already-open thread on cdc.max.queue.size.in.bytes malformed override handling. That thread still appears relevant to the current head because malformed and overflow values are logged and ignored, then the adaptive cap is used; I am not duplicating the same issue inline.

Critical checkpoint conclusions:

  • Goal and tests: the PR wires a Debezium queue byte cap into the shared default properties path used by both CDC readers, with tests for default clamp, numeric override, disable, and malformed fallback. The existing malformed-override thread remains the only behavior concern.
  • Scope: the code change is small and focused in ConfigUtil plus unit tests.
  • Concurrency and lifecycle: no new mutable shared state is introduced beyond reading a JVM system property during property construction. Existing CDC reader concurrency is unchanged.
  • Configuration: a new JVM-level operational escape hatch is added. Current parsing behavior for invalid values is already covered by the existing review thread.
  • Compatibility and parallel paths: no storage/protocol compatibility issue found. I verified the helper feeds both MySQL and PostgreSQL reader paths, and Flink CDC 3.6.0 passes getMaxQueueSizeInBytes() into ChangeEventQueue.Builder for both connectors.
  • Special checks and observability: the warning on invalid override is observable, but whether invalid config should continue is the open existing concern.
  • Tests and results: no regression outputs are involved. The current CI shows Build Cdc Client succeeded; I did not run Maven locally because this runner review is constrained to the checkout and Maven would otherwise use external caches.
  • Performance and memory: the byte cap removes the previous count-only unbounded-by-bytes queue risk. I did not find a separate concrete regression in the cap calculation.
  • Transactions, persistence, FE-BE variable passing, security-sensitive behavior: not applicable to this PR.
  • User focus: no additional user-provided review focus was present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants