maintainer: quiesce control plane during remove handoff#4828
maintainer: quiesce control plane during remove handoff#4828wlwilliamx wants to merge 1 commit intopingcap:masterfrom
Conversation
Freeze ordinary scheduling once RemoveMaintainer starts so the old maintainer can only finish the DDL trigger close path. This avoids late heartbeat, barrier, node-change, and operator activity from recreating dispatchers after shutdown begins.
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a quiescing mechanism for the maintainer and its operator controller to ensure a stable shutdown and handoff process. By entering a removing mode, the maintainer suppresses ordinary scheduling, self-healing, and legacy control-plane traffic while allowing critical DDL trigger operations to complete. The review feedback correctly identifies several opportunities to optimize locking efficiency in the OperatorController, specifically by consolidating quiescing checks and operator lookups into single lock blocks to reduce contention and redundant acquisitions in performance-critical paths.
| if !oc.isOperatorAllowed(id) { | ||
| return | ||
| } | ||
| oc.mu.RLock() | ||
| op, ok := oc.operators[id] | ||
| oc.mu.RUnlock() |
There was a problem hiding this comment.
The current implementation of UpdateOperatorStatus performs redundant locking by calling isOperatorAllowed (which acquires a read lock) followed by another manual read lock acquisition. This can be optimized by using isOperatorAllowedLocked within a single lock block.
| if !oc.isOperatorAllowed(id) { | |
| return | |
| } | |
| oc.mu.RLock() | |
| op, ok := oc.operators[id] | |
| oc.mu.RUnlock() | |
| oc.mu.RLock() | |
| if !oc.isOperatorAllowedLocked(id) { | |
| oc.mu.RUnlock() | |
| return | |
| } | |
| op, ok := oc.operators[id] | |
| oc.mu.RUnlock() |
| ops := oc.GetAllOperators() | ||
|
|
||
| for _, op := range ops { | ||
| if !oc.isOperatorAllowed(op.ID()) { | ||
| continue | ||
| } |
There was a problem hiding this comment.
The GetMinCheckpointTs function is in the hot path and currently suffers from significant lock contention. It first copies all operators into a slice (acquiring a lock) and then repeatedly acquires and releases a read lock for every single operator in the loop via isOperatorAllowed.
Iterating over the map directly while holding the lock once is much more efficient and avoids unnecessary allocations and lock bouncing.
func (oc *Controller) GetMinCheckpointTs(minCheckpointTs uint64) uint64 {
oc.mu.RLock()
defer oc.mu.RUnlock()
for id, op := range oc.operators {
if !oc.isOperatorAllowedLocked(id) {
continue
}| if !oc.isOperatorAllowed(op.ID()) { | ||
| log.Info("skip remove operator while controller is quiescing", | ||
| zap.String("role", oc.role), | ||
| zap.Stringer("changefeedID", oc.changefeedID), | ||
| zap.String("dispatcherID", op.ID().String()), | ||
| zap.String("operator", op.String())) | ||
| return | ||
| } | ||
| oc.mu.RLock() | ||
| old, ok := oc.operators[op.ID()] | ||
| oc.mu.RUnlock() |
There was a problem hiding this comment.
Similar to other methods in this controller, removeReplicaSet performs redundant locking. Consolidating the quiescing check and the operator lookup into a single lock block improves performance.
oc.mu.RLock()
if !oc.isOperatorAllowedLocked(op.ID()) {
oc.mu.RUnlock()
log.Info("skip remove operator while controller is quiescing",
zap.String("role", oc.role),
zap.Stringer("changefeedID", oc.changefeedID),
zap.String("dispatcherID", op.ID().String()),
zap.String("operator", op.String()))
return
}
old, ok := oc.operators[op.ID()]
oc.mu.RUnlock()
What problem does this PR solve?
Issue Number: close #4827
What is changed and how it works?
This PR stops the old maintainer from continuing ordinary control-plane work after
RemoveMaintainerstarts.It does that by:
Validation:
go test ./maintainer/...Check List
Tests
Questions
Will it cause performance regression or break compatibility?
No. The change only tightens old-maintainer behavior during the remove handoff window and prevents stale control-plane work from continuing after shutdown starts.
Do you need to update user documentation, design documentation or monitoring documentation?
No.
Release note