Pause DDL during eligible TiDB upgrades (#6904)#6914
Conversation
Document the release-1.x design for coordinating TiDB smooth-upgrade mode around TiDB StatefulSet rolling upgrades. The design records the TiDB HTTP API contract, version matrix, annotation-based pause state, and verification expectations before implementation starts.\n\nConstraint: release-1.x supports only classic TiDB kernel, so Premium/keyspace behavior is out of scope\nConstraint: omx team requires a clean leader worktree before launching worker worktrees\nRejected: Store active pause state only in status conditions | controller workflow state must survive status recalculation and restarts\nConfidence: high\nScope-risk: narrow\nTested: git diff --check for design proposal and plan\nNot-tested: implementation tests not run; design-only commit
Coordinate TiDB smooth-upgrade state around classic TiDB rolling version upgrades so user DDL is paused only for switch-controlled version pairs and resumed after all TiDB members are updated and healthy. The implementation keeps the rollout gate retryable with controller-owned annotations and makes the TiDB HTTP API client accept TiDB's JSON string responses.\n\nConstraint: release-1.x scope is classic TiDB only; Premium, keyspace, and TiCI behavior remain out of scope\nRejected: Call smooth-upgrade APIs for every version upgrade | unsupported and auto-supported version pairs must preserve existing rolling behavior\nRejected: Clear annotations by omitting keys from a merge patch | Kubernetes JSON merge patch requires null values to delete existing annotations\nConfidence: medium\nScope-risk: moderate\nDirective: Do not advance TiDB rolling partitions before StartUpgrade succeeds for switch-controlled version pairs\nTested: go test ./pkg/controller ./pkg/manager/member -count=1\nTested: go test ./tests/e2e/util/proxiedtidbclient -count=1\nNot-tested: golangci-lint run ./pkg/controller ./pkg/manager/member failed before linting because installed golangci-lint was built with go1.24 but repo targets go1.25.8
- avoid mutating shared http.Client in upgrade(); create a fresh client
with the TLS transport preserved and the correct per-operation timeout
- patch only the four smooth-upgrade annotation keys instead of the full
tc.Annotations map, preventing loss of user-set annotations on conflict
- fix chooseHealthyTiDBOrdinal to scan tc.Status.TiDB.Members by name
instead of iterating 0..Replicas-1, correctly handling advanced
StatefulSet delete slots with non-contiguous ordinals
- add TestTiDBUpgraderSmoothUpgradeActiveAnnotationSkipsDuplicate to
cover the missing "active annotation avoids duplicate start" test case
- use any instead of interface{} in patch helpers (modernize)
- use %q consistently for body text in upgrade error messages
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three tests covering the annotation lifecycle for smooth upgrade DDL pause: - happy path: switch-controlled pair (v7.5.0→v7.5.3) sets annotations during rolling upgrade and clears them after finish - unsupported pair (v6.5.10→v8.1.0): verifies no pause annotation is set - operator restart: confirms finish still runs and annotations are cleared after the controller-manager pod is deleted mid-upgrade Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Classifying a downgrade pair (e.g. v7.5.0 -> v7.4.0) as switch-controlled would incorrectly call /upgrade/start before rolling to the older image. Guard against this by returning Unsupported when target <= source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix error format string order in tidb_control.go: put HTTP status before body text (%d %q) to match HTTP convention - Add unit test for stale annotation recovery path in ensureSmoothUpgradeStarted - Add direct imageTag edge case tests (digest ref, no tag, empty tag) - Add cross-reference comment in e2e test to flag annotation key constants that must stay in sync with smooth_upgrade.go Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
/test pull-e2e-kind-tidbcluster |
|
/lgtm |
|
@fgksgf: adding LGTM is restricted to approvers and reviewers in OWNERS files. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: fgksgf The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/retest |
This is an automated cherry-pick of #6904
What is changed
This PR adds TiDB smooth-upgrade coordination to TiDB Operator v1 classic TiDB upgrades on
release-1.x:POST /upgrade/startbefore eligible TiDB rolling version upgrades begin200start response before allowing TiDB pods to rollTidbClusterannotationsPOST /upgrade/finishafter the upgraded TiDB StatefulSet is fully updated, Normal, and members are healthyDesign notes
The implementation follows
docs/design-proposals/2026-05-14-pause-ddl-during-tidb-upgrade.md.Important behavior:
/upgrade/startuses a 30s client timeout because TiDB waits up to 10s for the DDL owner to sync upgrading state before returning success.success!and duplicate-success responses.tidb.pingcap.com/smooth-upgrade-ddl-pausedtidb.pingcap.com/smooth-upgrade-source-versiontidb.pingcap.com/smooth-upgrade-target-versiontidb.pingcap.com/smooth-upgrade-started-at/upgrade/finish, deleting the annotation keys, and requeueing.Tests