maintainer: quiesce control plane during remove handoff by wlwilliamx · Pull Request #4828 · pingcap/ticdc

wlwilliamx · 2026-04-15T07:31:41Z

What problem does this PR solve?

Issue Number: close #4827

What is changed and how it works?

This PR stops the old maintainer from continuing ordinary control-plane work after RemoveMaintainer starts.

It does that by:

disabling heartbeat self-healing while the maintainer is removing;
suppressing node-change, block-status, and non-close resend handling during removing;
quiescing the operator controller so only the DDL trigger close path can keep running;
adding regression tests for removing-state heartbeat handling, removing-state resend suppression, and operator quiescing.

Validation:

go test ./maintainer/...

Check List

Tests

Unit test
Manual test

Questions

Will it cause performance regression or break compatibility?

No. The change only tightens old-maintainer behavior during the remove handoff window and prevents stale control-plane work from continuing after shutdown starts.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

Fix a bug where a removing TiCDC maintainer could still reschedule or recreate dispatchers after shutdown handoff started.

Freeze ordinary scheduling once RemoveMaintainer starts so the old maintainer can only finish the DDL trigger close path. This avoids late heartbeat, barrier, node-change, and operator activity from recreating dispatchers after shutdown begins.

ti-chi-bot · 2026-04-15T07:31:44Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

ti-chi-bot · 2026-04-15T07:31:47Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kennytm for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-04-15T07:31:48Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b37e9a08-be37-440f-b912-3cfefbf05b43

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a quiescing mechanism for the maintainer and its operator controller to ensure a stable shutdown and handoff process. By entering a removing mode, the maintainer suppresses ordinary scheduling, self-healing, and legacy control-plane traffic while allowing critical DDL trigger operations to complete. The review feedback correctly identifies several opportunities to optimize locking efficiency in the OperatorController, specifically by consolidating quiescing checks and operator lookups into single lock blocks to reduce contention and redundant acquisitions in performance-critical paths.

gemini-code-assist · 2026-04-15T07:33:38Z

+	if !oc.isOperatorAllowed(id) {
+		return
+	}
 	oc.mu.RLock()
 	op, ok := oc.operators[id]
 	oc.mu.RUnlock()


The current implementation of UpdateOperatorStatus performs redundant locking by calling isOperatorAllowed (which acquires a read lock) followed by another manual read lock acquisition. This can be optimized by using isOperatorAllowedLocked within a single lock block.

Suggested change

if !oc.isOperatorAllowed(id) {

return

}

oc.mu.RLock()

op, ok := oc.operators[id]

oc.mu.RUnlock()

oc.mu.RLock()

if !oc.isOperatorAllowedLocked(id) {

oc.mu.RUnlock()

return

}

op, ok := oc.operators[id]

oc.mu.RUnlock()

gemini-code-assist · 2026-04-15T07:33:39Z

 	ops := oc.GetAllOperators()

 	for _, op := range ops {
+		if !oc.isOperatorAllowed(op.ID()) {
+			continue
+		}


The GetMinCheckpointTs function is in the hot path and currently suffers from significant lock contention. It first copies all operators into a slice (acquiring a lock) and then repeatedly acquires and releases a read lock for every single operator in the loop via isOperatorAllowed.

Iterating over the map directly while holding the lock once is much more efficient and avoids unnecessary allocations and lock bouncing.

func (oc *Controller) GetMinCheckpointTs(minCheckpointTs uint64) uint64 { oc.mu.RLock() defer oc.mu.RUnlock() for id, op := range oc.operators { if !oc.isOperatorAllowedLocked(id) { continue }

gemini-code-assist · 2026-04-15T07:33:39Z

+	if !oc.isOperatorAllowed(op.ID()) {
+		log.Info("skip remove operator while controller is quiescing",
+			zap.String("role", oc.role),
+			zap.Stringer("changefeedID", oc.changefeedID),
+			zap.String("dispatcherID", op.ID().String()),
+			zap.String("operator", op.String()))
+		return
+	}
 	oc.mu.RLock()
 	old, ok := oc.operators[op.ID()]
 	oc.mu.RUnlock()


Similar to other methods in this controller, removeReplicaSet performs redundant locking. Consolidating the quiescing check and the operator lookup into a single lock block improves performance.

oc.mu.RLock() if !oc.isOperatorAllowedLocked(op.ID()) { oc.mu.RUnlock() log.Info("skip remove operator while controller is quiescing", zap.String("role", oc.role), zap.Stringer("changefeedID", oc.changefeedID), zap.String("dispatcherID", op.ID().String()), zap.String("operator", op.String())) return } old, ok := oc.operators[op.ID()] oc.mu.RUnlock()

ti-chi-bot Bot added do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Apr 15, 2026

ti-chi-bot Bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 15, 2026

gemini-code-assist Bot reviewed Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

maintainer: quiesce control plane during remove handoff#4828

maintainer: quiesce control plane during remove handoff#4828
wlwilliamx wants to merge 1 commit intopingcap:masterfrom
wlwilliamx:codex/maintainer-failover-issue-20260415

wlwilliamx commented Apr 15, 2026

Uh oh!

ti-chi-bot Bot commented Apr 15, 2026

Uh oh!

ti-chi-bot Bot commented Apr 15, 2026

Uh oh!

coderabbitai Bot commented Apr 15, 2026

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wlwilliamx commented Apr 15, 2026

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Uh oh!

ti-chi-bot Bot commented Apr 15, 2026

Uh oh!

ti-chi-bot Bot commented Apr 15, 2026

Uh oh!

coderabbitai Bot commented Apr 15, 2026

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant