Skip to content

feat: introduce table scan, split and plan#99

Open
lszskye wants to merge 1 commit into
apache:mainfrom
lszskye:p12-5
Open

feat: introduce table scan, split and plan#99
lszskye wants to merge 1 commit into
apache:mainfrom
lszskye:p12-5

Conversation

@lszskye

@lszskye lszskye commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Introduce Table Source Module: Scan, Split and Plan

Summary

Add the table/source module which provides the public API and core implementation for table scan, split generation, and plan creation. This module serves as the entry point for batch and streaming read operations, supporting split serialization/deserialization compatible with Java Paimon for cross-language interoperability.

New Classes

Public API (include/paimon/table/source/)

Split — Base class for input splits used by batch computation engines. Supports binary serialization and deserialization compatible with the Java version, enabling cross-process and cross-language split transmission. A Split can be either a DataSplit (for direct data file reads) or an IndexedSplit (for reads leveraging global indexes).

DataSplit — Extends Split for direct data file reading scenarios. Contains SimpleDataFileMeta describing each file's path, size, row count, sequence numbers, schema, level, creation time, and optional delete row count. Provides a file list accessor for append table reads.

Plan — Interface representing the result of a TableScan. Exposes the generated splits and the associated snapshot ID (or nullopt for empty tables).

TableScan — Scanner interface that reads table metadata and produces a Plan. Created from a ScanContext, it serves as the primary entry point for initiating table scan operations in both batch and streaming modes.

TableRead — Given a Split or a list of Splits, creates BatchReader instances for reading data. Manages memory allocation through a shared MemoryPool.

StartupMode — Specifies the startup mode. Supports Default, LatestFull, Latest, FromSnapshot, FromSnapshotFull, and FromTimestamp modes, each with different semantics for batch vs. streaming sources. Provides string conversion and parsing.

Internal Implementation (src/paimon/core/table/source/)

AbstractTableScan — Abstract base class above FileStoreScan that provides input split generation logic. Implements the CreateStartingScanner method which routes to different StartingScanner implementations based on the configured StartupMode, handling snapshot lookup, timestamp-based resolution, and tag-based scanning.

DataSplitImpl — Concrete implementation of DataSplit with full serialization support. Tracks partition, bucket, file metadata, deletion files, streaming flag, and raw-convertibility. Includes a Builder pattern for construction and supports multiple DataFileMeta serializer versions (v9, v10, v12, legacy).

SplitGenerator — Generates split groups from DataFileMeta lists. Produces SplitGroups that distinguish between raw-convertible groups (directly readable without merge) and non-raw-convertible groups (requiring merge-tree processing). Provides separate entry points for batch and streaming split generation.

DeletionFile — Represents a deletion vector index file associated with a data file. Contains path, offset, length, and optional cardinality (number of deleted rows). Supports versioned serialization/deserialization (v3 and v4+) with list-level serialize/deserialize helpers.

ScanMode — Enum specifying which part of a snapshot to scan: ALL (complete data files) or DELTA (only newly changed files).

PlanImpl — Concrete implementation of the Plan interface. Holds a snapshot ID and a vector of splits, and provides a static EmptyPlan() factory method for empty results.

New Tests

  • deletion_file_test.cpp
  • split_generator_test.cpp
  • startup_mode_test.cpp
  • table_scan_test.cpp

@lxy-9602

Copy link
Copy Markdown
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants