Skip to content

Pause DDL during eligible TiDB upgrades#6904

Merged
ti-chi-bot[bot] merged 6 commits into
pingcap:release-1.xfrom
tennix:feature/release-1.x-pause-ddl
May 21, 2026
Merged

Pause DDL during eligible TiDB upgrades#6904
ti-chi-bot[bot] merged 6 commits into
pingcap:release-1.xfrom
tennix:feature/release-1.x-pause-ddl

Conversation

@tennix
Copy link
Copy Markdown
Member

@tennix tennix commented May 18, 2026

What is changed

This PR adds TiDB smooth-upgrade coordination to TiDB Operator v1 classic TiDB upgrades on release-1.x:

  • call TiDB POST /upgrade/start before eligible TiDB rolling version upgrades begin
  • wait for the 200 start response before allowing TiDB pods to roll
  • persist the active pause window in TidbCluster annotations
  • call POST /upgrade/finish after the upgraded TiDB StatefulSet is fully updated, Normal, and members are healthy
  • clear smooth-upgrade annotations only after finish succeeds
  • skip unsupported and auto-supported smooth-upgrade version pairs

Design notes

The implementation follows docs/design-proposals/2026-05-14-pause-ddl-during-tidb-upgrade.md.

Important behavior:

  • /upgrade/start uses a 30s client timeout because TiDB waits up to 10s for the DDL owner to sync upgrading state before returning success.
  • TiDB's success responses are JSON string bodies, so the client decodes them before matching success! and duplicate-success responses.
  • Active pause state is stored in annotations:
    • tidb.pingcap.com/smooth-upgrade-ddl-paused
    • tidb.pingcap.com/smooth-upgrade-source-version
    • tidb.pingcap.com/smooth-upgrade-target-version
    • tidb.pingcap.com/smooth-upgrade-started-at
  • Stale/mismatched annotations are handled by idempotently calling /upgrade/finish, deleting the annotation keys, and requeueing.

Tests

PASS git diff --check
PASS go test ./pkg/controller ./pkg/manager/member ./tests/e2e/util/proxiedtidbclient -count=1

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 18, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

tennix and others added 3 commits May 18, 2026 14:03
Document the release-1.x design for coordinating TiDB smooth-upgrade mode around TiDB StatefulSet rolling upgrades. The design records the TiDB HTTP API contract, version matrix, annotation-based pause state, and verification expectations before implementation starts.\n\nConstraint: release-1.x supports only classic TiDB kernel, so Premium/keyspace behavior is out of scope\nConstraint: omx team requires a clean leader worktree before launching worker worktrees\nRejected: Store active pause state only in status conditions | controller workflow state must survive status recalculation and restarts\nConfidence: high\nScope-risk: narrow\nTested: git diff --check for design proposal and plan\nNot-tested: implementation tests not run; design-only commit
Coordinate TiDB smooth-upgrade state around classic TiDB rolling version upgrades so user DDL is paused only for switch-controlled version pairs and resumed after all TiDB members are updated and healthy. The implementation keeps the rollout gate retryable with controller-owned annotations and makes the TiDB HTTP API client accept TiDB's JSON string responses.\n\nConstraint: release-1.x scope is classic TiDB only; Premium, keyspace, and TiCI behavior remain out of scope\nRejected: Call smooth-upgrade APIs for every version upgrade | unsupported and auto-supported version pairs must preserve existing rolling behavior\nRejected: Clear annotations by omitting keys from a merge patch | Kubernetes JSON merge patch requires null values to delete existing annotations\nConfidence: medium\nScope-risk: moderate\nDirective: Do not advance TiDB rolling partitions before StartUpgrade succeeds for switch-controlled version pairs\nTested: go test ./pkg/controller ./pkg/manager/member -count=1\nTested: go test ./tests/e2e/util/proxiedtidbclient -count=1\nNot-tested: golangci-lint run ./pkg/controller ./pkg/manager/member failed before linting because installed golangci-lint was built with go1.24 but repo targets go1.25.8
- avoid mutating shared http.Client in upgrade(); create a fresh client
  with the TLS transport preserved and the correct per-operation timeout
- patch only the four smooth-upgrade annotation keys instead of the full
  tc.Annotations map, preventing loss of user-set annotations on conflict
- fix chooseHealthyTiDBOrdinal to scan tc.Status.TiDB.Members by name
  instead of iterating 0..Replicas-1, correctly handling advanced
  StatefulSet delete slots with non-contiguous ordinals
- add TestTiDBUpgraderSmoothUpgradeActiveAnnotationSkipsDuplicate to
  cover the missing "active annotation avoids duplicate start" test case
- use any instead of interface{} in patch helpers (modernize)
- use %q consistently for body text in upgrade error messages

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@tennix tennix force-pushed the feature/release-1.x-pause-ddl branch from 0510155 to 7b4bace Compare May 18, 2026 21:04
@tennix tennix marked this pull request as ready for review May 18, 2026 21:05
@ti-chi-bot ti-chi-bot Bot requested a review from howardlau1999 May 18, 2026 21:05
Three tests covering the annotation lifecycle for smooth upgrade DDL pause:
- happy path: switch-controlled pair (v7.5.0→v7.5.3) sets annotations during
  rolling upgrade and clears them after finish
- unsupported pair (v6.5.10→v8.1.0): verifies no pause annotation is set
- operator restart: confirms finish still runs and annotations are cleared
  after the controller-manager pod is deleted mid-upgrade

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@fgksgf fgksgf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Verdict: Changes requested

The overall smooth-upgrade flow is well covered for the happy path, but I found two recovery/correctness issues that can leave TiDB in upgrade mode unexpectedly: reverse/downgrade version changes can be classified as switch-controlled smooth upgrades, and /upgrade/start is called before the durable annotation marker is persisted.

Validation: the targeted test command from the PR description passes locally:

go test ./pkg/controller ./pkg/manager/member ./tests/e2e/util/proxiedtidbclient -count=1
  • HuaxiClaw

Comment thread pkg/manager/member/smooth_upgrade.go
Comment thread pkg/manager/member/smooth_upgrade.go
@fgksgf
Copy link
Copy Markdown
Member

fgksgf commented May 20, 2026

Additional issues

1. Error format string argument order (pkg/controller/tidb_control.go:397,399)

return fmt.Errorf("tidb upgrade %s error response %q:%v URL: %s", op, bodyText, res.StatusCode, url)

Format specifiers %q:%v map to bodyText, res.StatusCode, producing "body":400. HTTP convention is status before body. Suggested fix:

return fmt.Errorf("tidb upgrade %s error response %d %q URL: %s", op, res.StatusCode, bodyText, url)

Same issue on line 399 (unexpected response).

2. Missing unit test for stale/mismatched annotation recovery path

ensureSmoothUpgradeStarted (smooth_upgrade.go:715-729) handles stale annotations: finish stale upgrade, clear annotations, requeue. This is the most complex recovery scenario but has no unit test. TestTiDBUpgraderSmoothUpgradeStart only tests the happy path and the already-active-annotation skip.

3. Annotation constants duplicated in e2e test without sync guard

tests/e2e/tidbcluster/tidbcluster.go:3523-3527 re-declares the 3 annotation key constants from smooth_upgrade.go. If keys change in the implementation, e2e tests silently break against stale keys. Either export the constants or add a cross-reference comment in both locations.

4. imageTag() edge cases untested

imageTag() handles digest references (@sha256:...), missing tags, and empty tags. smooth_upgrade_test.go only tests it indirectly via detectTiDBVersionUpgrade with valid images. Missing direct coverage for:

  • pingcap/tidb@sha256:abc -> ""

  • pingcap/tidb (no tag) -> ""

  • pingcap/tidb: (empty tag) -> ""

  • HuaxiClaw

tennix and others added 2 commits May 20, 2026 15:35
Classifying a downgrade pair (e.g. v7.5.0 -> v7.4.0) as
switch-controlled would incorrectly call /upgrade/start before
rolling to the older image. Guard against this by returning
Unsupported when target <= source.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix error format string order in tidb_control.go: put HTTP status
  before body text (%d %q) to match HTTP convention
- Add unit test for stale annotation recovery path in ensureSmoothUpgradeStarted
- Add direct imageTag edge case tests (digest ref, no tag, empty tag)
- Add cross-reference comment in e2e test to flag annotation key constants
  that must stay in sync with smooth_upgrade.go

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 21, 2026

@tennix: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-e2e-kind-across-kubernetes cce87a2 link false /test pull-e2e-kind-across-kubernetes
pull-e2e-kind-tidbcluster cce87a2 link false /test pull-e2e-kind-tidbcluster
pull-e2e-kind-basic e9ece76 link false /test pull-e2e-kind-basic
pull-e2e-kind-br e9ece76 link false /test pull-e2e-kind-br

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

if err != nil {
return err
}
if err := u.deps.TiDBControl.FinishUpgrade(tc, ordinal); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mismatch branch treats every active annotation that does not match the current old/new StatefulSet image pair as stale and calls /upgrade/finish immediately. That is unsafe if the user changes spec.version while a smooth-upgrade rolling update is still in progress: the annotation may describe the first target, while newSet now describes the second target. This code would finish and clear the pause before the first rolling update has completed, briefly allowing user DDL in a mixed-version cluster before the next reconcile starts the new pause.

Please distinguish stale state from an active pause for an in-progress previous target. For example, keep the existing annotation and block/defer the new version change until the annotated target has finished, or only run this recovery after verifying the cluster is not still rolling.

  • HuaxiClaw

Copy link
Copy Markdown
Member

@fgksgf fgksgf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The downgrade/rollback classification issue has been addressed, and the duplicate /upgrade/start recovery path is acceptable given TiDB's idempotent duplicate-start response and the retrying annotation patch.

Validated locally:

git diff --check
go test ./pkg/controller ./pkg/manager/member ./tests/e2e/util/proxiedtidbclient -count=1

I also checked the current failing Prow jobs; the failures are Docker Hub 429 pull-rate-limit errors while building e2e images, not caused by this PR.

  • HuaxiClaw

@ti-chi-bot ti-chi-bot Bot added the lgtm label May 21, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 21, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fgksgf

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 21, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-05-20 04:34:58.14888854 +0000 UTC m=+325227.653019216: ✖️🔁 reset by fgksgf.
  • 2026-05-21 01:53:50.434279254 +0000 UTC m=+1961.075136057: ☑️ agreed by fgksgf.

@ti-chi-bot ti-chi-bot Bot added the approved label May 21, 2026
@ti-chi-bot ti-chi-bot Bot merged commit a7e85cb into pingcap:release-1.x May 21, 2026
11 of 15 checks passed
@tennix tennix deleted the feature/release-1.x-pause-ddl branch May 21, 2026 02:12
@tennix
Copy link
Copy Markdown
Member Author

tennix commented May 23, 2026

/cherry-pick release-1.6

@ti-chi-bot
Copy link
Copy Markdown
Member

@tennix: new pull request created to branch release-1.6: #6914.

Details

In response to this:

/cherry-pick release-1.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants