Skip to content

[regression-test](backup-restore) wait for colocate group to stabilize before asserting COLOCATE plan#64532

Open
shuke987 wants to merge 1 commit into
apache:branch-4.1from
shuke987:fix-backup-restore-colocate-wait-stable
Open

[regression-test](backup-restore) wait for colocate group to stabilize before asserting COLOCATE plan#64532
shuke987 wants to merge 1 commit into
apache:branch-4.1from
shuke987:fix-backup-restore-colocate-wait-stable

Conversation

@shuke987

Copy link
Copy Markdown
Collaborator

Problem

test_backup_restore_colocate_with_partition (in regression-test/suites/backup_restore/test_backup_restore_colocate.groovy) flakily fails right after a RESTORE:

Explain and check failed, expect contains 'COLOCATE', but actual explain string is:
  HAS_COLO_PLAN_NODE: false
  3:VHASH JOIN(331)
  |  join op: INNER JOIN(BROADCAST)[]
  TABLE: ..._db_new..._table1

(reproduced deterministically in an isolated run on a 4-BE / force-3 cluster; failure was on the restore-to-new-db case.)

Root cause — a case timing bug, not a plan regression

After RESTORE the restored colocate group needs a moment to become stable, and the planner only emits a COLOCATE join once the group is stable (otherwise it falls back to BROADCAST/shuffle). The suite ran explain ... contains("COLOCATE") immediately after waitAllRestoreFinish, racing the stabilization. The existing checkColocateTabletHealth (a single-shot ColocateMismatchNum == 0 assert) sat after the assertion, so it didn't gate the explain.

Fix

  • Add a bounded poll waitColocatePlan(query) (60 × 1s) that waits for the explain plan to actually contain COLOCATE, and call it before each contains("COLOCATE") assertion.
  • Turn checkColocateTabletHealth into a bounded poll as well, so the health check waits for stabilization instead of racing it.
  • Applied symmetrically to both suites in the file; the notContains("COLOCATE") assertions are untouched.

If COLOCATE never appears within the timeout, the assertion still fails — so a genuine regression would not be masked.

Verification

Run isolated on the cluster where the original case failed deterministically: the fixed case passes (Test 1 suites, failed 0 suites).

…e before asserting COLOCATE plan

test_backup_restore_colocate_with_partition flakily fails right after a
RESTORE with `expect contains 'COLOCATE', but actual explain string is ...
INNER JOIN(BROADCAST) / HAS_COLO_PLAN_NODE: false`.

After RESTORE the restored colocate group needs time to become stable, and
the planner only emits a COLOCATE join once the group is stable. The suite
ran `explain ... contains("COLOCATE")` immediately after waitAllRestoreFinish,
racing the stabilization, while checkColocateTabletHealth (a single-shot
ColocateMismatchNum==0 assert) sat after the assertion.

Add a bounded poll (waitColocatePlan) that waits for the explain plan to
contain COLOCATE before each contains("COLOCATE") assertion, and make
checkColocateTabletHealth a bounded poll too. If COLOCATE never shows up
within the timeout the assertion still fails, so no buggy behavior is masked.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@shuke987 shuke987 requested a review from yiguolei as a code owner June 15, 2026 09:54
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@shuke987

Copy link
Copy Markdown
Collaborator Author

run buildall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants