branch-4.0: [fix](test) fix branch-4.0 P0 regression failures (build #198126)#64525
Open
morningman wants to merge 9 commits into
Open
branch-4.0: [fix](test) fix branch-4.0 P0 regression failures (build #198126)#64525morningman wants to merge 9 commits into
morningman wants to merge 9 commits into
Conversation
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
Author
|
run buildall |
### What problem does this PR solve? When `INSERT INTO ... SELECT` is forwarded from a follower FE to the master FE, `SHOW LOAD` could show an empty `JobDetails`, such as `ScannedRows=0`, `LoadBytes=0`, `TaskNumber=0`, and empty backend lists. The root cause is that the insert load job is registered with a real `jobId`, but when coordinator creation falls back to the regular `Coordinator` / `CloudCoordinator` path, that `jobId` was not passed into the coordinator. Therefore, the coordinator kept the default `jobId=-1` and did not initialize or update the corresponding `LoadManager` progress. The load job was still recorded as `FINISHED`, but its `LoadStatistic` remained empty when `SHOW LOAD` rendered `JobDetails`. This PR preserves the insert `jobId` in the regular `Coordinator` and `CloudCoordinator` fallback paths, so `initJobProgress()` and `updateJobProgress()` update the same `InsertLoadJob` that is later recorded and displayed by `SHOW LOAD`. (cherry picked from commit a53679b)
…3859) Issue Number: None Related PR: None Problem Summary: The previous ANN index-only scan regression coverage inferred whether source vector columns were skipped by comparing ScanBytes from query profiles. That made the test hard to review and could miss cases where both query shapes still read the source column. Replace that coverage with a dedicated debug-point regression that directly fails if the embedding column is read in index-only scenarios, including a remapped reader-schema case where the source slot index differs from the storage column id. Remove the old profile-based suites and generated output. None - Test: Manual test - git diff --cached --check - Regression test not run per request; an earlier attempt was blocked by Maven writing to /Users/roanhe/.m2/repository under the sandbox - Behavior changed: No - Does this need documentation: No Issue Number: close #xxx Related PR: #xxx Problem Summary: None - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into --> (cherry picked from commit 6e5198b)
…pache#64238) The scanner_profile suite asserts that `actualRows` is backfilled onto the PhysicalOlapScan node, but the regex `PhysicalOlapScan\[scanner_profile\]` can never match: FE renders the node as `PhysicalOlapScan[<id>] ( table= scanner_profile, ... actualRows=N )` (the bracket holds the numeric nereids id, not the table name). matcher.find() is always false regardless of the backfilled value, so the case fails deterministically. Upstream apache/doris apache#64238 (952aede) fixes this regex to `PhysicalOlapScan[^\n]*scanner_profile[^\n]*actualRows=(\d+)`. This is a minimal port of ONLY that regex change. The rest of apache#64238 (a BE CPU-time counter rename `MaxFindRecvrTime`->`MaxFindRecvrCpuTime` plus new `TaskCpuTime`/`ScannerCpuTime` profile assertions and BE unit tests) is an unrelated cosmetic cleanup; it is intentionally NOT backported to avoid shipping a BE change and profile-naming assertions to a release branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…126) test_temp_table asserted `3 * replicaNum == show_tablets_result.size()`, deriving replicaNum by regex-parsing force_olap_table_replication_allocation with `/:(\d+)/`. That detection is wrong on clusters that force replication=3: the canonical allocation string `tag.location.default: 3` has a space after the colon so the regex never matches (replicaNum stuck at 1), and the separate force_olap_table_replication_num config is never consulted. On the 3-replica TeamCity cluster SHOW TABLETS returned 9 rows (3 tablets x 3 replicas) while the test expected 3. Keep verifying the real replica count (so a wrong/missing replica is still caught) but fix the detection: - assert there are exactly 3 distinct tablets (3 partitions x 1 bucket), and - assert total rows == 3 * replicaNum, where replicaNum is derived robustly and mirrors FE precedence (PropertyAnalyzer.analyzeReplicaAllocation): the force_olap_table_replication_allocation per-tag counts are summed first with a whitespace-tolerant regex `/:\s*(\d+)/`, falling back to force_olap_table_replication_num. Same defect exists on apache/master; forward-port there too. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dex_id (build #198126) restore_reset_index_id is a single process-global, master-only, mutable FE config read at RESTORE execution time (OlapTable.resetIdsForRestore). Three suites mutate it: - backup_restore/test_backup_restore_reset_index_id (sets true, then asserts the restored index id was reset to a new value) - backup_restore/test_backup_restore_inverted_idx (sets false / true) - restore_p0/test_validate_restore_inverted_idx (sets false / true) Because the regression framework runs NORMAL suites in parallel, one suite can flip the shared global config between another suite's BACKUP and RESTORE. In build #198126 a concurrent suite set restore_reset_index_id=false after test_backup_restore_reset_index_id had set it true, so that suite's RESTORE took the else-branch, kept the original index id, and the line-142 assertion failed (old id == new id). Add these suites to the `nonConcurrent` group so they run serially in the single-thread SINGLE phase and can no longer race on the shared global config. The FE reset logic itself is correct; this is purely a test-isolation fix. Same design exists on apache/master; forward-port there too. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit 8a076ce)
…only_scan debug_point cases These 3 cases were cherry-picked from master (apache#63859). The session variable enable_condition_cache (condition cache feature) does not exist on branch-4.0, so 'set enable_condition_cache=false' fails with 'Unknown system variable'. Since the feature is absent, the branch is already in the cache-disabled state the test expects, so removing the line is a semantic no-op and restores the cases on this branch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
c2254f3 to
d58d717
Compare
Contributor
Author
|
run buildall |
Contributor
FE UT Coverage ReportIncrement line coverage |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Stabilizes the
branch-4.0P0 regression pipeline by fixing test failures observed inregression build #198126 (12 failed). This PR addresses 7 of them.
Backports of apache/master fixes
[fix](load) fix empty statistics for forwarded INSERT— real product fix: aforwarded
INSERT ... SELECTreported emptyJobDetails(ScannedRows=0) because thenon-Nereids-distribute fallback
Coordinatorwas built without thejobId. Fixesload_p0/insert/test_insert_statistic.[test](regression) Add debug point ANN index-only scan test— replaces thefragile
ScanBytes-profile-based ANN suites (an unreliable index-only proxy) withdebug-point suites. Fixes
ann_index_p0/ann_index_only_scan_distance_exprandann_index_p0/ann_index_only_scan_metric_direction(both were new failures).[fix](case) fix insert_group_commit_into_max_filter_ratio— cherry-pick of the(currently open) master PR; aligns the case with the
enable_insert_strict/enable_strict_castdecoupling ([fix](insert) enable_insert_strict should not affect semantic of enable_strict_cast #63794). Fixesinsert_p0/insert_group_commit_into_max_filter_ratio.scanner_profileactualRowsregex fix (PhysicalOlapScan[id]is rendered with the numeric nereids id, not the table name). The BE CPU-counter rename in
[fix](be) Rename CPU time profile counter #64238 is intentionally NOT backported to a release branch. Fixes
query_profile/scanner_profile.Test robustness / isolation fixes (the same defect also exists on apache/master)
temp_table_p0/test_temp_table: the replica-count detection was broken — the regex/:(\d+)/missed the space in the canonical
tag.location.default: 3, andforce_olap_table_replication_numwas never consulted, so on a 3-replica cluster
SHOW TABLETSreturned 9 rows while the testexpected 3. Detection is now whitespace-tolerant, sums the allocation per-tag counts (mirroring
PropertyAnalyzer.analyzeReplicaAllocationprecedence) and falls back toforce_olap_table_replication_num, while still asserting the real replica count.backup_restore/test_backup_restore_reset_index_id(+test_backup_restore_inverted_idxandrestore_p0/test_validate_restore_inverted_idx): cross-suite race on the process-globalrestore_reset_index_idFE config — a concurrent suite flipped it tofalsebetween thissuite's BACKUP and RESTORE. The three suites that mutate this global are now
nonConcurrentso they run serially. The FE reset logic itself is correct; this is purely test isolation.
Release note
None
Check List (For Committer)
Check List (For Reviewer who provide ASF feature)