Pause DDL during eligible TiDB upgrades#6904
Conversation
|
Skipping CI for Draft Pull Request. |
Document the release-1.x design for coordinating TiDB smooth-upgrade mode around TiDB StatefulSet rolling upgrades. The design records the TiDB HTTP API contract, version matrix, annotation-based pause state, and verification expectations before implementation starts.\n\nConstraint: release-1.x supports only classic TiDB kernel, so Premium/keyspace behavior is out of scope\nConstraint: omx team requires a clean leader worktree before launching worker worktrees\nRejected: Store active pause state only in status conditions | controller workflow state must survive status recalculation and restarts\nConfidence: high\nScope-risk: narrow\nTested: git diff --check for design proposal and plan\nNot-tested: implementation tests not run; design-only commit
Coordinate TiDB smooth-upgrade state around classic TiDB rolling version upgrades so user DDL is paused only for switch-controlled version pairs and resumed after all TiDB members are updated and healthy. The implementation keeps the rollout gate retryable with controller-owned annotations and makes the TiDB HTTP API client accept TiDB's JSON string responses.\n\nConstraint: release-1.x scope is classic TiDB only; Premium, keyspace, and TiCI behavior remain out of scope\nRejected: Call smooth-upgrade APIs for every version upgrade | unsupported and auto-supported version pairs must preserve existing rolling behavior\nRejected: Clear annotations by omitting keys from a merge patch | Kubernetes JSON merge patch requires null values to delete existing annotations\nConfidence: medium\nScope-risk: moderate\nDirective: Do not advance TiDB rolling partitions before StartUpgrade succeeds for switch-controlled version pairs\nTested: go test ./pkg/controller ./pkg/manager/member -count=1\nTested: go test ./tests/e2e/util/proxiedtidbclient -count=1\nNot-tested: golangci-lint run ./pkg/controller ./pkg/manager/member failed before linting because installed golangci-lint was built with go1.24 but repo targets go1.25.8
- avoid mutating shared http.Client in upgrade(); create a fresh client
with the TLS transport preserved and the correct per-operation timeout
- patch only the four smooth-upgrade annotation keys instead of the full
tc.Annotations map, preventing loss of user-set annotations on conflict
- fix chooseHealthyTiDBOrdinal to scan tc.Status.TiDB.Members by name
instead of iterating 0..Replicas-1, correctly handling advanced
StatefulSet delete slots with non-contiguous ordinals
- add TestTiDBUpgraderSmoothUpgradeActiveAnnotationSkipsDuplicate to
cover the missing "active annotation avoids duplicate start" test case
- use any instead of interface{} in patch helpers (modernize)
- use %q consistently for body text in upgrade error messages
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
0510155 to
7b4bace
Compare
Three tests covering the annotation lifecycle for smooth upgrade DDL pause: - happy path: switch-controlled pair (v7.5.0→v7.5.3) sets annotations during rolling upgrade and clears them after finish - unsupported pair (v6.5.10→v8.1.0): verifies no pause annotation is set - operator restart: confirms finish still runs and annotations are cleared after the controller-manager pod is deleted mid-upgrade Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fgksgf
left a comment
There was a problem hiding this comment.
Code Review Summary
Verdict: Changes requested
The overall smooth-upgrade flow is well covered for the happy path, but I found two recovery/correctness issues that can leave TiDB in upgrade mode unexpectedly: reverse/downgrade version changes can be classified as switch-controlled smooth upgrades, and /upgrade/start is called before the durable annotation marker is persisted.
Validation: the targeted test command from the PR description passes locally:
go test ./pkg/controller ./pkg/manager/member ./tests/e2e/util/proxiedtidbclient -count=1
- HuaxiClaw
Additional issues1. Error format string argument order ( return fmt.Errorf("tidb upgrade %s error response %q:%v URL: %s", op, bodyText, res.StatusCode, url)Format specifiers return fmt.Errorf("tidb upgrade %s error response %d %q URL: %s", op, res.StatusCode, bodyText, url)Same issue on line 399 (unexpected response). 2. Missing unit test for stale/mismatched annotation recovery path
3. Annotation constants duplicated in e2e test without sync guard
4.
|
Classifying a downgrade pair (e.g. v7.5.0 -> v7.4.0) as switch-controlled would incorrectly call /upgrade/start before rolling to the older image. Guard against this by returning Unsupported when target <= source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix error format string order in tidb_control.go: put HTTP status before body text (%d %q) to match HTTP convention - Add unit test for stale annotation recovery path in ensureSmoothUpgradeStarted - Add direct imageTag edge case tests (digest ref, no tag, empty tag) - Add cross-reference comment in e2e test to flag annotation key constants that must stay in sync with smooth_upgrade.go Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@tennix: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| if err != nil { | ||
| return err | ||
| } | ||
| if err := u.deps.TiDBControl.FinishUpgrade(tc, ordinal); err != nil { |
There was a problem hiding this comment.
This mismatch branch treats every active annotation that does not match the current old/new StatefulSet image pair as stale and calls /upgrade/finish immediately. That is unsafe if the user changes spec.version while a smooth-upgrade rolling update is still in progress: the annotation may describe the first target, while newSet now describes the second target. This code would finish and clear the pause before the first rolling update has completed, briefly allowing user DDL in a mixed-version cluster before the next reconcile starts the new pause.
Please distinguish stale state from an active pause for an in-progress previous target. For example, keep the existing annotation and block/defer the new version change until the annotated target has finished, or only run this recovery after verifying the cluster is not still rolling.
- HuaxiClaw
fgksgf
left a comment
There was a problem hiding this comment.
LGTM. The downgrade/rollback classification issue has been addressed, and the duplicate /upgrade/start recovery path is acceptable given TiDB's idempotent duplicate-start response and the retrying annotation patch.
Validated locally:
git diff --check
go test ./pkg/controller ./pkg/manager/member ./tests/e2e/util/proxiedtidbclient -count=1
I also checked the current failing Prow jobs; the failures are Docker Hub 429 pull-rate-limit errors while building e2e images, not caused by this PR.
- HuaxiClaw
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fgksgf The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/cherry-pick release-1.6 |
|
@tennix: new pull request created to branch DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
What is changed
This PR adds TiDB smooth-upgrade coordination to TiDB Operator v1 classic TiDB upgrades on
release-1.x:POST /upgrade/startbefore eligible TiDB rolling version upgrades begin200start response before allowing TiDB pods to rollTidbClusterannotationsPOST /upgrade/finishafter the upgraded TiDB StatefulSet is fully updated, Normal, and members are healthyDesign notes
The implementation follows
docs/design-proposals/2026-05-14-pause-ddl-during-tidb-upgrade.md.Important behavior:
/upgrade/startuses a 30s client timeout because TiDB waits up to 10s for the DDL owner to sync upgrading state before returning success.success!and duplicate-success responses.tidb.pingcap.com/smooth-upgrade-ddl-pausedtidb.pingcap.com/smooth-upgrade-source-versiontidb.pingcap.com/smooth-upgrade-target-versiontidb.pingcap.com/smooth-upgrade-started-at/upgrade/finish, deleting the annotation keys, and requeueing.Tests