[regression-test](backup-restore) wait for colocate group to stabilize before asserting COLOCATE plan#64532
Open
shuke987 wants to merge 1 commit into
Open
Conversation
…e before asserting COLOCATE plan
test_backup_restore_colocate_with_partition flakily fails right after a
RESTORE with `expect contains 'COLOCATE', but actual explain string is ...
INNER JOIN(BROADCAST) / HAS_COLO_PLAN_NODE: false`.
After RESTORE the restored colocate group needs time to become stable, and
the planner only emits a COLOCATE join once the group is stable. The suite
ran `explain ... contains("COLOCATE")` immediately after waitAllRestoreFinish,
racing the stabilization, while checkColocateTabletHealth (a single-shot
ColocateMismatchNum==0 assert) sat after the assertion.
Add a bounded poll (waitColocatePlan) that waits for the explain plan to
contain COLOCATE before each contains("COLOCATE") assertion, and make
checkColocateTabletHealth a bounded poll too. If COLOCATE never shows up
within the timeout the assertion still fails, so no buggy behavior is masked.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Collaborator
Author
|
run buildall |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
test_backup_restore_colocate_with_partition(inregression-test/suites/backup_restore/test_backup_restore_colocate.groovy) flakily fails right after aRESTORE:(reproduced deterministically in an isolated run on a 4-BE / force-3 cluster; failure was on the restore-to-new-db case.)
Root cause — a case timing bug, not a plan regression
After
RESTOREthe restored colocate group needs a moment to become stable, and the planner only emits aCOLOCATEjoin once the group is stable (otherwise it falls back toBROADCAST/shuffle). The suite ranexplain ... contains("COLOCATE")immediately afterwaitAllRestoreFinish, racing the stabilization. The existingcheckColocateTabletHealth(a single-shotColocateMismatchNum == 0assert) sat after the assertion, so it didn't gate the explain.Fix
waitColocatePlan(query)(60 × 1s) that waits for the explain plan to actually containCOLOCATE, and call it before eachcontains("COLOCATE")assertion.checkColocateTabletHealthinto a bounded poll as well, so the health check waits for stabilization instead of racing it.notContains("COLOCATE")assertions are untouched.If
COLOCATEnever appears within the timeout, the assertion still fails — so a genuine regression would not be masked.Verification
Run isolated on the cluster where the original case failed deterministically: the fixed case passes (
Test 1 suites, failed 0 suites).