Skip to content

branch-4.0: [fix](test) fix branch-4.0 P0 regression failures (build #198126)#64525

Open
morningman wants to merge 9 commits into
apache:branch-4.0from
morningman:tc-198126-backports
Open

branch-4.0: [fix](test) fix branch-4.0 P0 regression failures (build #198126)#64525
morningman wants to merge 9 commits into
apache:branch-4.0from
morningman:tc-198126-backports

Conversation

@morningman

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Stabilizes the branch-4.0 P0 regression pipeline by fixing test failures observed in
regression build #198126 (12 failed). This PR addresses 7 of them.

Backports of apache/master fixes

Test robustness / isolation fixes (the same defect also exists on apache/master)

  • temp_table_p0/test_temp_table: the replica-count detection was broken — the regex /:(\d+)/
    missed the space in the canonical tag.location.default: 3, and force_olap_table_replication_num
    was never consulted, so on a 3-replica cluster SHOW TABLETS returned 9 rows while the test
    expected 3. Detection is now whitespace-tolerant, sums the allocation per-tag counts (mirroring
    PropertyAnalyzer.analyzeReplicaAllocation precedence) and falls back to
    force_olap_table_replication_num, while still asserting the real replica count.
  • backup_restore/test_backup_restore_reset_index_id (+ test_backup_restore_inverted_idx and
    restore_p0/test_validate_restore_inverted_idx): cross-suite race on the process-global
    restore_reset_index_id FE config — a concurrent suite flipped it to false between this
    suite's BACKUP and RESTORE. The three suites that mutate this global are now nonConcurrent
    so they run serially. The FE reset logic itself is correct; this is purely test isolation.

Release note

None

Check List (For Committer)

  • Test

Check List (For Reviewer who provide ASF feature)

  • Performance regression

@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@morningman morningman changed the title [fix](test) fix branch-4.0 P0 regression failures (build #198126) branch-4.0: [fix](test) fix branch-4.0 P0 regression failures (build #198126) Jun 15, 2026
@morningman

Copy link
Copy Markdown
Contributor Author

run buildall

sollhui and others added 9 commits June 15, 2026 22:59
### What problem does this PR solve?

When `INSERT INTO ... SELECT` is forwarded from a follower FE to the
master FE, `SHOW LOAD` could show an empty `JobDetails`, such as
`ScannedRows=0`, `LoadBytes=0`, `TaskNumber=0`, and empty backend lists.

The root cause is that the insert load job is registered with a real
`jobId`, but when coordinator creation falls back to the regular
`Coordinator` / `CloudCoordinator` path, that `jobId` was not passed
into the coordinator. Therefore, the coordinator kept the default
`jobId=-1` and did not initialize or update the corresponding
`LoadManager` progress. The load job was still recorded as `FINISHED`,
but its `LoadStatistic` remained empty when `SHOW LOAD` rendered
`JobDetails`.

This PR preserves the insert `jobId` in the regular `Coordinator` and
`CloudCoordinator` fallback paths, so `initJobProgress()` and
`updateJobProgress()` update the same `InsertLoadJob` that is later
recorded and displayed by `SHOW LOAD`.

(cherry picked from commit a53679b)
…3859)

Issue Number: None

Related PR: None

Problem Summary: The previous ANN index-only scan regression coverage
inferred whether source vector columns were skipped by comparing
ScanBytes from query profiles. That made the test hard to review and
could miss cases where both query shapes still read the source column.
Replace that coverage with a dedicated debug-point regression that
directly fails if the embedding column is read in index-only scenarios,
including a remapped reader-schema case where the source slot index
differs from the storage column id. Remove the old profile-based suites
and generated output.

None

- Test: Manual test
    - git diff --cached --check
- Regression test not run per request; an earlier attempt was blocked by
Maven writing to /Users/roanhe/.m2/repository under the sandbox
- Behavior changed: No
- Does this need documentation: No

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

None

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->

(cherry picked from commit 6e5198b)
…pache#64238)

The scanner_profile suite asserts that `actualRows` is backfilled onto the
PhysicalOlapScan node, but the regex `PhysicalOlapScan\[scanner_profile\]`
can never match: FE renders the node as `PhysicalOlapScan[<id>] ( table=
scanner_profile, ... actualRows=N )` (the bracket holds the numeric nereids
id, not the table name). matcher.find() is always false regardless of the
backfilled value, so the case fails deterministically.

Upstream apache/doris apache#64238 (952aede) fixes this regex to
`PhysicalOlapScan[^\n]*scanner_profile[^\n]*actualRows=(\d+)`. This is a
minimal port of ONLY that regex change. The rest of apache#64238 (a BE CPU-time
counter rename `MaxFindRecvrTime`->`MaxFindRecvrCpuTime` plus new
`TaskCpuTime`/`ScannerCpuTime` profile assertions and BE unit tests) is an
unrelated cosmetic cleanup; it is intentionally NOT backported to avoid
shipping a BE change and profile-naming assertions to a release branch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…126)

test_temp_table asserted `3 * replicaNum == show_tablets_result.size()`,
deriving replicaNum by regex-parsing force_olap_table_replication_allocation
with `/:(\d+)/`. That detection is wrong on clusters that force replication=3:
the canonical allocation string `tag.location.default: 3` has a space after
the colon so the regex never matches (replicaNum stuck at 1), and the separate
force_olap_table_replication_num config is never consulted. On the 3-replica
TeamCity cluster SHOW TABLETS returned 9 rows (3 tablets x 3 replicas) while
the test expected 3.

Keep verifying the real replica count (so a wrong/missing replica is still
caught) but fix the detection:
- assert there are exactly 3 distinct tablets (3 partitions x 1 bucket), and
- assert total rows == 3 * replicaNum, where replicaNum is derived robustly and
  mirrors FE precedence (PropertyAnalyzer.analyzeReplicaAllocation): the
  force_olap_table_replication_allocation per-tag counts are summed first with a
  whitespace-tolerant regex `/:\s*(\d+)/`, falling back to
  force_olap_table_replication_num.

Same defect exists on apache/master; forward-port there too.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dex_id (build #198126)

restore_reset_index_id is a single process-global, master-only, mutable FE
config read at RESTORE execution time (OlapTable.resetIdsForRestore). Three
suites mutate it:
- backup_restore/test_backup_restore_reset_index_id (sets true, then asserts
  the restored index id was reset to a new value)
- backup_restore/test_backup_restore_inverted_idx (sets false / true)
- restore_p0/test_validate_restore_inverted_idx (sets false / true)

Because the regression framework runs NORMAL suites in parallel, one suite can
flip the shared global config between another suite's BACKUP and RESTORE. In
build #198126 a concurrent suite set restore_reset_index_id=false after
test_backup_restore_reset_index_id had set it true, so that suite's RESTORE
took the else-branch, kept the original index id, and the line-142 assertion
failed (old id == new id).

Add these suites to the `nonConcurrent` group so they run serially in the
single-thread SINGLE phase and can no longer race on the shared global config.
The FE reset logic itself is correct; this is purely a test-isolation fix. Same
design exists on apache/master; forward-port there too.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…only_scan debug_point cases

These 3 cases were cherry-picked from master (apache#63859). The session
variable enable_condition_cache (condition cache feature) does not exist
on branch-4.0, so 'set enable_condition_cache=false' fails with
'Unknown system variable'. Since the feature is absent, the branch is
already in the cache-disabled state the test expects, so removing the
line is a semantic no-op and restores the cases on this branch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@morningman morningman force-pushed the tc-198126-backports branch from c2254f3 to d58d717 Compare June 15, 2026 12:59
@morningman

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/7) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants