[refactor](table) Refactor table and file reader#63893
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#63893 Problem Summary: Add focused BE unit coverage for new table reader and new parquet reader edge cases, including aggregate pushdown over split ranges, Iceberg equality/position deletes, row lineage after delete filtering, Parquet dictionary/statistics pruning, and IOContext release. Also clean up temporary delete predicate expression columns in the new Parquet reader so equality delete predicates with cast children do not alter the returned file block schema. ### Release note None ### Check List (For Author) - Test: Unit Test - Added BE UT cases in table_reader_test and parquet_reader_test. - Ran git diff --check. - Tried ./run-be-ut.sh with focused filters, but local JAVA_HOME points to JDK 11 and JDK_17 is not set; the runner requires JDK 17. - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve? Issue Number: close #xxx Related PR: #63893 Problem Summary: Add focused BE unit coverage for new table reader and new parquet reader edge cases, including aggregate pushdown over split ranges, Iceberg equality/position deletes, row lineage after delete filtering, Parquet dictionary/statistics pruning, and IOContext release. Also clean up temporary delete predicate expression columns in the new Parquet reader so equality delete predicates with cast children do not alter the returned file block schema. ### Release note None ### Check List (For Author) - Test: Unit Test - Added BE UT cases in table_reader_test and parquet_reader_test. - Ran git diff --check. - Tried ./run-be-ut.sh with focused filters, but local JAVA_HOME points to JDK 11 and JDK_17 is not set; the runner requires JDK 17. - Behavior changed: No - Does this need documentation: No ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
837cc56 to
475e48a
Compare
|
run buildall |
TPC-H: Total hot run time: 29107 ms |
FE Regression Coverage ReportIncrement line coverage |
TPC-DS: Total hot run time: 168765 ms |
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 29068 ms |
TPC-DS: Total hot run time: 171113 ms |
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 29390 ms |
FE Regression Coverage ReportIncrement line coverage |
TPC-DS: Total hot run time: 168294 ms |
|
run buildall |
1 similar comment
|
run buildall |
TPC-H: Total hot run time: 29426 ms |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 29178 ms |
TPC-DS: Total hot run time: 168644 ms |
FE Regression Coverage ReportIncrement line coverage |
|
/review |
There was a problem hiding this comment.
I found a few blocking rollout and operational issues.
Critical checkpoint conclusions:
- Goal and scope: this PR adds the FileScannerV2/format_v2 reader stack with broad tests, but it also changes default behavior for supported external query scans and Iceberg batch planning.
- Data correctness: in the reviewed row/delete/aggregate paths, delete predicates and aggregate pushdown have the expected guards; I did not find an additional concrete data-loss issue there.
- Version and compatibility: thrift option/enum additions are append-only, but the FE default for
enable_file_scanner_v2conflicts with its documented and thrift defaults. - FE/BE variables:
enable_file_scanner_v2is forwarded to BE and currently enables V2 for all supported non-load scan ranges by default. - Parallel paths and rollback: the old
FileScannerremains available only when the session variable is false or the split is unsupported; default true removes rollback-by-default for supported scans. - Transactions and persistence: no transaction or storage write-path change was identified.
- Concurrency and lifecycle: no new shared mutable-state or scanner lifetime issue was identified in the reviewed paths.
- Performance and observability: the per-file warning-level debug log and the batch-mode default flip are problematic.
- Tests: substantial BE unit and regression coverage was added; I did not run full Doris test suites in this runner.
git diff --checkreports trailing whitespace in one generated.outfile. - Security: no security review was requested, and I am not making a security finding.
- User focus: no additional user-provided focus was present.
| @@ -1112,6 +1113,11 @@ public static double getHotValueThreshold() { | |||
| "FileScanNode 扫描数据的最大并发,默认为 16", "The max threads to read data of FileScanNode, default 16"}) | |||
| public int maxFileScannersConcurrency = 16; | |||
|
|
|||
| @VarAttrDef.VarAttr(name = ENABLE_FILE_SCANNER_V2, needForward = true, description = { | |||
| "开启后 FileScanNode 会在支持的查询场景使用 FileScannerV2,默认关闭", | |||
| "When enabled, FileScanNode uses FileScannerV2 for supported query scans. Disabled by default."}) | |||
There was a problem hiding this comment.
The variable is documented as disabled by default and the thrift field default is also false, but this initializes every FE session to true. toThrift() always forwards that value, and FileScanLocalState::_init_scanners selects FileScannerV2 whenever it is true and all splits are supported, so this makes the new reader stack the default for supported query scans. For a refactor of this size, that removes the intended rollback-by-default path. Please default this to false, or update the release/compatibility plan if default enablement is intentional.
| "When enabled, FileScanNode uses FileScannerV2 for supported query scans. Disabled by default."}) | |
| public boolean enableFileScannerV2 = false; |
| @@ -2873,10 +2879,9 @@ public static boolean isEagerAggregationOnJoin() { | |||
| public static final String ENABLE_MC_LIMIT_SPLIT_OPTIMIZATION = "enable_mc_limit_split_optimization"; | |||
| @VarAttrDef.VarAttr( | |||
| name = ENABLE_EXTERNAL_TABLE_BATCH_MODE, | |||
| fuzzy = true, | |||
| description = {"使能外表的 batch mode 功能", "Enable the batch mode function of the external table."}, | |||
| needForward = true) | |||
There was a problem hiding this comment.
This silently flips enable_external_table_batch_mode from true to false and removes it from fuzzy testing. IcebergScanNode.isBatchMode() returns false as soon as this session variable is false, so existing Iceberg scans stop using batch mode by default. That is a user-visible planner/performance change while the PR release note says None; please either keep the previous default and fuzzy coverage, or document the behavior change with dedicated tests.
| _data_reader.block_template.insert( | ||
| {column.type->create_column(), column.type, column.name}); | ||
| } | ||
| LOG(WARNING) << "TableReader debug: " << debug_string(); |
There was a problem hiding this comment.
This warning is emitted every time a file reader is opened, and debug_string() includes the projected schema, table filters, column predicates, conjunct debug strings, and full column mapping state. With V2 selected for supported scans, a normal external query over many files will write one warning per file and can flood BE logs. Please remove this or gate it behind debug/VLOG-level logging.
### What problem does this PR solve? Issue Number: close #xxx Related PR: #63893 Problem Summary: TeamCity external regression build 970191 still had several expected output files using old timestamp values. The new parquet timestamp semantics return the corrected values for the affected external table cases, including Hive, Iceberg, Paimon, and TVF parquet result files. This commit refreshes the corresponding regression expected outputs from the observed CI results and keeps unrelated non-timestamp failures untouched. ### Release note None ### Check List (For Author) - Test: Manual test - Compared TeamCity build 970191 failure details with the updated expected output files. Full regression test was not rerun locally. - Behavior changed: No - Does this need documentation: No
|
run buildall |
TPC-H: Total hot run time: 29516 ms |
TPC-DS: Total hot run time: 175874 ms |
ClickBench: Total hot run time: 25.86 s |
FE Regression Coverage ReportIncrement line coverage |
struct type should be use name mode with nested type
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: New parquet profile definitions and wiring were split across ParquetReader, ParquetScan, and column reader headers. This made ParquetReader own counter initialization, pruning counter updates, and scheduler sub-profile assembly directly even though parquet_profile.h already existed for profile-related types. This change centralizes the new parquet RuntimeProfile counter ownership in parquet_profile.h/.cpp and keeps ParquetReader responsible only for invoking the profile helper methods.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran build-support/clang-format.sh for touched files.
- Ran git diff --check.
- Tried ./run-be-ut.sh --run '--filter=NewParquetReaderTest.*', but local CMake compiler detection failed before building Doris because /opt/homebrew/opt/llvm@16/bin/clang++ could not link a simple program: ld: library 'c++' not found.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Align format_v2 implementation namespaces with the format_v2 ownership boundary. Parquet, Hive, Paimon, Iceberg, and JDBC implementations now live under doris::format subnamespaces, while shared format_v2 expression helpers live under doris::format. Call sites and tests were updated to use the new namespace layout.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran build-support/clang-format.sh on the modified BE files
- Ran git diff --check
- Ran namespace residue scans for old doris::parquet/hive/paimon/jdbc/iceberg namespaces and duplicate format::format references
- Attempted targeted BE UT with ./run-be-ut.sh --run '--filter=NewParquetReaderTest.*:ParquetColumnReaderTest.*:TableReaderTest.*:CastTest.*:DeletePredicateTest.*:EqualityDeletePredicateTest.*', but local CMake compiler detection failed before Doris code compiled because /opt/homebrew/opt/llvm@16/bin/clang++ could not link libc++: ld: library 'c++' not found
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve? Issue Number: close #xxx Related PR: #63893 Problem Summary: The new parquet reader reports timestamp values with the updated INT96 timestamp interpretation for existing external parquet coverage. This commit updates the affected regression expected outputs from the latest TeamCity P0 and external regression real outputs. Doris parquet export/write cases with suspicious timestamp offsets are intentionally excluded because those require separate writer-side analysis. ### Release note None ### Check List (For Author) - Test: Manual test - Validated modified expected rows against TeamCity builds 970619 and 970620 failure logs, and ran `git diff --check`. - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve? Issue Number: close #xxx Related PR: #63893 Problem Summary: The new parquet reader did not map TIMESTAMP(NANOS) logical columns to a supported Doris timestamp type, and DATETIMEV2 decoded INT64 timestamp values only handled millis and micros. As a result Hive parquet timestamp nanos data was materialized as NULL instead of the expected timestamp values. This change maps parquet timestamp nanos to DATETIMEV2(6), decodes nanos by truncating to microseconds, and adds decoded-value coverage for DATETIMEV2 nanos. It also refreshes the external TVF group4 expected output for a parquet file containing BC timestamp values that Doris cannot represent, where the new reader correctly returns NULL for those rows. ### Release note None ### Check List (For Author) - Test: Manual test - Ran `git diff --check`. - Verified the relevant parquet files with DuckDB to confirm timestamp nanos and BC timestamp source values. - Attempted `./run-be-ut.sh --run '--filter=DataTypeSerDeDecodedValuesTest.*'`, but local CMake failed before compiling tests because the macOS toolchain cannot link a simple C++ program: `ld: library 'c++' not found`. - Behavior changed: Yes. The new parquet reader now reads TIMESTAMP(NANOS) values as DATETIMEV2(6) instead of producing NULL through unsupported conversion. - Does this need documentation: No
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 29030 ms |
TPC-DS: Total hot run time: 175389 ms |
ClickBench: Total hot run time: 25.33 s |
FE Regression Coverage ReportIncrement line coverage |
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Refine the new Parquet reader row group pruning flow so scan range filtering is applied before more expensive statistics, dictionary, bloom filter, and page index pruning. Also document the Parquet reader, scan scheduler, statistics pruning, and nested column reader APIs, and update affected namespace references in BE tests.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran build-support/clang-format.sh on touched BE C++ files and git diff --check locally.
- Started BE UT validation on Fedora with NewParquetReaderTest.* and ParquetBloomFilterPruningTest.*; fixed compile issues found during validation. Full rerun was interrupted before completion by follow-up history cleanup request.
- Behavior changed: No
- Does this need documentation: No
d4159b1 to
e3379ea
Compare
Rewrite comments for the entry-point and foundational modules: parquet_reader.h: - Class-level doc: role boundary, lifecycle (init→get_schema→open→get_block→close) - TableReader calling relationship explained - Each method and field annotated parquet_type.h: - ParquetExtraTypeInfo: each variant documented - ParquetTypeDescriptor: full field-by-field descriptions - Three-level resolution priority (logical→converted→physical) explained - resolve_parquet_type / supports_record_reader / decoded_value_kind docs parquet_column_schema.h: - Class-level doc: design decisions (wrapper folding, nullable, Dremel levels) - All fields grouped into sections (identifier / type / levels / children) - Each field annotated with its role and valid domain (PRIMITIVE vs complex) parquet_column_schema.cpp: - SchemaBuildContext fields annotated parquet_file_context.cpp: - DorisRandomAccessFile adapter class documented parquet_profile.h: - All Profile structs with section-based Chinese comments - Counter groups organized (RG pruning / page skip / batch read / column reader / file ops / decompress & cache / decode / others) Co-Authored-By: Claude <noreply@anthropic.com>
5291051 to
974f56b
Compare
|
run buildall |
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)