Skip to content

maintainer: quiesce control plane during remove handoff#4828

Draft
wlwilliamx wants to merge 1 commit intopingcap:masterfrom
wlwilliamx:codex/maintainer-failover-issue-20260415
Draft

maintainer: quiesce control plane during remove handoff#4828
wlwilliamx wants to merge 1 commit intopingcap:masterfrom
wlwilliamx:codex/maintainer-failover-issue-20260415

Conversation

@wlwilliamx
Copy link
Copy Markdown
Collaborator

What problem does this PR solve?

Issue Number: close #4827

What is changed and how it works?

This PR stops the old maintainer from continuing ordinary control-plane work after RemoveMaintainer starts.

It does that by:

  • disabling heartbeat self-healing while the maintainer is removing;
  • suppressing node-change, block-status, and non-close resend handling during removing;
  • quiescing the operator controller so only the DDL trigger close path can keep running;
  • adding regression tests for removing-state heartbeat handling, removing-state resend suppression, and operator quiescing.

Validation:

  • go test ./maintainer/...

Check List

Tests

  • Unit test
  • Manual test

Questions

Will it cause performance regression or break compatibility?

No. The change only tightens old-maintainer behavior during the remove handoff window and prevents stale control-plane work from continuing after shutdown starts.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

Fix a bug where a removing TiCDC maintainer could still reschedule or recreate dispatchers after shutdown handoff started.

Freeze ordinary scheduling once RemoveMaintainer starts so the old
maintainer can only finish the DDL trigger close path. This avoids
late heartbeat, barrier, node-change, and operator activity from
recreating dispatchers after shutdown begins.
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 15, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Apr 15, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 15, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kennytm for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 15, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b37e9a08-be37-440f-b912-3cfefbf05b43

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 15, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a quiescing mechanism for the maintainer and its operator controller to ensure a stable shutdown and handoff process. By entering a removing mode, the maintainer suppresses ordinary scheduling, self-healing, and legacy control-plane traffic while allowing critical DDL trigger operations to complete. The review feedback correctly identifies several opportunities to optimize locking efficiency in the OperatorController, specifically by consolidating quiescing checks and operator lookups into single lock blocks to reduce contention and redundant acquisitions in performance-critical paths.

Comment on lines +223 to 228
if !oc.isOperatorAllowed(id) {
return
}
oc.mu.RLock()
op, ok := oc.operators[id]
oc.mu.RUnlock()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of UpdateOperatorStatus performs redundant locking by calling isOperatorAllowed (which acquires a read lock) followed by another manual read lock acquisition. This can be optimized by using isOperatorAllowedLocked within a single lock block.

Suggested change
if !oc.isOperatorAllowed(id) {
return
}
oc.mu.RLock()
op, ok := oc.operators[id]
oc.mu.RUnlock()
oc.mu.RLock()
if !oc.isOperatorAllowedLocked(id) {
oc.mu.RUnlock()
return
}
op, ok := oc.operators[id]
oc.mu.RUnlock()

Comment on lines 276 to +281
ops := oc.GetAllOperators()

for _, op := range ops {
if !oc.isOperatorAllowed(op.ID()) {
continue
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The GetMinCheckpointTs function is in the hot path and currently suffers from significant lock contention. It first copies all operators into a slice (acquiring a lock) and then repeatedly acquires and releases a read lock for every single operator in the loop via isOperatorAllowed.

Iterating over the map directly while holding the lock once is much more efficient and avoids unnecessary allocations and lock bouncing.

func (oc *Controller) GetMinCheckpointTs(minCheckpointTs uint64) uint64 {
	oc.mu.RLock()
	defer oc.mu.RUnlock()

	for id, op := range oc.operators {
		if !oc.isOperatorAllowedLocked(id) {
			continue
		}

Comment on lines +400 to 410
if !oc.isOperatorAllowed(op.ID()) {
log.Info("skip remove operator while controller is quiescing",
zap.String("role", oc.role),
zap.Stringer("changefeedID", oc.changefeedID),
zap.String("dispatcherID", op.ID().String()),
zap.String("operator", op.String()))
return
}
oc.mu.RLock()
old, ok := oc.operators[op.ID()]
oc.mu.RUnlock()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to other methods in this controller, removeReplicaSet performs redundant locking. Consolidating the quiescing check and the operator lookup into a single lock block improves performance.

oc.mu.RLock()
	if !oc.isOperatorAllowedLocked(op.ID()) {
		oc.mu.RUnlock()
		log.Info("skip remove operator while controller is quiescing",
			zap.String("role", oc.role),
			zap.Stringer("changefeedID", oc.changefeedID),
			zap.String("dispatcherID", op.ID().String()),
			zap.String("operator", op.String()))
		return
	}
	old, ok := oc.operators[op.ID()]
	oc.mu.RUnlock()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-triage-completed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

maintainer remove handoff can still recreate dispatchers after shutdown starts

1 participant