feat: introduce table scan, split and plan#99
Open
lszskye wants to merge 1 commit into
Open
Conversation
Contributor
|
+1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduce Table Source Module: Scan, Split and Plan
Summary
Add the
table/sourcemodule which provides the public API and core implementation for table scan, split generation, and plan creation. This module serves as the entry point for batch and streaming read operations, supporting split serialization/deserialization compatible with Java Paimon for cross-language interoperability.New Classes
Public API (
include/paimon/table/source/)Split— Base class for input splits used by batch computation engines. Supports binary serialization and deserialization compatible with the Java version, enabling cross-process and cross-language split transmission. ASplitcan be either aDataSplit(for direct data file reads) or anIndexedSplit(for reads leveraging global indexes).DataSplit— ExtendsSplitfor direct data file reading scenarios. ContainsSimpleDataFileMetadescribing each file's path, size, row count, sequence numbers, schema, level, creation time, and optional delete row count. Provides a file list accessor for append table reads.Plan— Interface representing the result of aTableScan. Exposes the generated splits and the associated snapshot ID (ornulloptfor empty tables).TableScan— Scanner interface that reads table metadata and produces aPlan. Created from aScanContext, it serves as the primary entry point for initiating table scan operations in both batch and streaming modes.TableRead— Given aSplitor a list ofSplits, createsBatchReaderinstances for reading data. Manages memory allocation through a sharedMemoryPool.StartupMode— Specifies the startup mode. SupportsDefault,LatestFull,Latest,FromSnapshot,FromSnapshotFull, andFromTimestampmodes, each with different semantics for batch vs. streaming sources. Provides string conversion and parsing.Internal Implementation (
src/paimon/core/table/source/)AbstractTableScan— Abstract base class aboveFileStoreScanthat provides input split generation logic. Implements theCreateStartingScannermethod which routes to differentStartingScannerimplementations based on the configuredStartupMode, handling snapshot lookup, timestamp-based resolution, and tag-based scanning.DataSplitImpl— Concrete implementation ofDataSplitwith full serialization support. Tracks partition, bucket, file metadata, deletion files, streaming flag, and raw-convertibility. Includes aBuilderpattern for construction and supports multipleDataFileMetaserializer versions (v9, v10, v12, legacy).SplitGenerator— Generates split groups fromDataFileMetalists. ProducesSplitGroups that distinguish between raw-convertible groups (directly readable without merge) and non-raw-convertible groups (requiring merge-tree processing). Provides separate entry points for batch and streaming split generation.DeletionFile— Represents a deletion vector index file associated with a data file. Contains path, offset, length, and optional cardinality (number of deleted rows). Supports versioned serialization/deserialization (v3 and v4+) with list-level serialize/deserialize helpers.ScanMode— Enum specifying which part of a snapshot to scan:ALL(complete data files) orDELTA(only newly changed files).PlanImpl— Concrete implementation of thePlaninterface. Holds a snapshot ID and a vector of splits, and provides a staticEmptyPlan()factory method for empty results.New Tests
deletion_file_test.cppsplit_generator_test.cppstartup_mode_test.cpptable_scan_test.cpp