Skip to content

Pause DDL during eligible TiDB upgrades (#6904)#6914

Open
ti-chi-bot wants to merge 6 commits into
pingcap:release-1.6from
ti-chi-bot:cherry-pick-6904-to-release-1.6
Open

Pause DDL during eligible TiDB upgrades (#6904)#6914
ti-chi-bot wants to merge 6 commits into
pingcap:release-1.6from
ti-chi-bot:cherry-pick-6904-to-release-1.6

Conversation

@ti-chi-bot
Copy link
Copy Markdown
Member

This is an automated cherry-pick of #6904

What is changed

This PR adds TiDB smooth-upgrade coordination to TiDB Operator v1 classic TiDB upgrades on release-1.x:

  • call TiDB POST /upgrade/start before eligible TiDB rolling version upgrades begin
  • wait for the 200 start response before allowing TiDB pods to roll
  • persist the active pause window in TidbCluster annotations
  • call POST /upgrade/finish after the upgraded TiDB StatefulSet is fully updated, Normal, and members are healthy
  • clear smooth-upgrade annotations only after finish succeeds
  • skip unsupported and auto-supported smooth-upgrade version pairs

Design notes

The implementation follows docs/design-proposals/2026-05-14-pause-ddl-during-tidb-upgrade.md.

Important behavior:

  • /upgrade/start uses a 30s client timeout because TiDB waits up to 10s for the DDL owner to sync upgrading state before returning success.
  • TiDB's success responses are JSON string bodies, so the client decodes them before matching success! and duplicate-success responses.
  • Active pause state is stored in annotations:
    • tidb.pingcap.com/smooth-upgrade-ddl-paused
    • tidb.pingcap.com/smooth-upgrade-source-version
    • tidb.pingcap.com/smooth-upgrade-target-version
    • tidb.pingcap.com/smooth-upgrade-started-at
  • Stale/mismatched annotations are handled by idempotently calling /upgrade/finish, deleting the annotation keys, and requeueing.

Tests

PASS git diff --check
PASS go test ./pkg/controller ./pkg/manager/member ./tests/e2e/util/proxiedtidbclient -count=1

tennix and others added 6 commits May 23, 2026 01:28
Document the release-1.x design for coordinating TiDB smooth-upgrade mode around TiDB StatefulSet rolling upgrades. The design records the TiDB HTTP API contract, version matrix, annotation-based pause state, and verification expectations before implementation starts.\n\nConstraint: release-1.x supports only classic TiDB kernel, so Premium/keyspace behavior is out of scope\nConstraint: omx team requires a clean leader worktree before launching worker worktrees\nRejected: Store active pause state only in status conditions | controller workflow state must survive status recalculation and restarts\nConfidence: high\nScope-risk: narrow\nTested: git diff --check for design proposal and plan\nNot-tested: implementation tests not run; design-only commit
Coordinate TiDB smooth-upgrade state around classic TiDB rolling version upgrades so user DDL is paused only for switch-controlled version pairs and resumed after all TiDB members are updated and healthy. The implementation keeps the rollout gate retryable with controller-owned annotations and makes the TiDB HTTP API client accept TiDB's JSON string responses.\n\nConstraint: release-1.x scope is classic TiDB only; Premium, keyspace, and TiCI behavior remain out of scope\nRejected: Call smooth-upgrade APIs for every version upgrade | unsupported and auto-supported version pairs must preserve existing rolling behavior\nRejected: Clear annotations by omitting keys from a merge patch | Kubernetes JSON merge patch requires null values to delete existing annotations\nConfidence: medium\nScope-risk: moderate\nDirective: Do not advance TiDB rolling partitions before StartUpgrade succeeds for switch-controlled version pairs\nTested: go test ./pkg/controller ./pkg/manager/member -count=1\nTested: go test ./tests/e2e/util/proxiedtidbclient -count=1\nNot-tested: golangci-lint run ./pkg/controller ./pkg/manager/member failed before linting because installed golangci-lint was built with go1.24 but repo targets go1.25.8
- avoid mutating shared http.Client in upgrade(); create a fresh client
  with the TLS transport preserved and the correct per-operation timeout
- patch only the four smooth-upgrade annotation keys instead of the full
  tc.Annotations map, preventing loss of user-set annotations on conflict
- fix chooseHealthyTiDBOrdinal to scan tc.Status.TiDB.Members by name
  instead of iterating 0..Replicas-1, correctly handling advanced
  StatefulSet delete slots with non-contiguous ordinals
- add TestTiDBUpgraderSmoothUpgradeActiveAnnotationSkipsDuplicate to
  cover the missing "active annotation avoids duplicate start" test case
- use any instead of interface{} in patch helpers (modernize)
- use %q consistently for body text in upgrade error messages

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three tests covering the annotation lifecycle for smooth upgrade DDL pause:
- happy path: switch-controlled pair (v7.5.0→v7.5.3) sets annotations during
  rolling upgrade and clears them after finish
- unsupported pair (v6.5.10→v8.1.0): verifies no pause annotation is set
- operator restart: confirms finish still runs and annotations are cleared
  after the controller-manager pod is deleted mid-upgrade

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Classifying a downgrade pair (e.g. v7.5.0 -> v7.4.0) as
switch-controlled would incorrectly call /upgrade/start before
rolling to the older image. Guard against this by returning
Unsupported when target <= source.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix error format string order in tidb_control.go: put HTTP status
  before body text (%d %q) to match HTTP convention
- Add unit test for stale annotation recovery path in ensureSmoothUpgradeStarted
- Add direct imageTag edge case tests (digest ref, no tag, empty tag)
- Add cross-reference comment in e2e test to flag annotation key constants
  that must stay in sync with smooth_upgrade.go

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@tennix
Copy link
Copy Markdown
Member

tennix commented May 25, 2026

/test pull-e2e-kind-tidbcluster

@fgksgf
Copy link
Copy Markdown
Member

fgksgf commented May 25, 2026

/lgtm

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 25, 2026

@fgksgf: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 25, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: fgksgf
Once this PR has been reviewed and has the lgtm label, please ask for approval from tennix. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@liubog2008
Copy link
Copy Markdown
Member

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants