feat: introduce table scan, split and plan by lszskye · Pull Request #99 · apache/paimon-cpp

lszskye · 2026-06-18T03:13:56Z

Introduce Table Source Module: Scan, Split and Plan

Summary

Add the table/source module which provides the public API and core implementation for table scan, split generation, and plan creation. This module serves as the entry point for batch and streaming read operations, supporting split serialization/deserialization compatible with Java Paimon for cross-language interoperability.

New Classes

Public API (`include/paimon/table/source/`)

Split — Base class for input splits used by batch computation engines. Supports binary serialization and deserialization compatible with the Java version, enabling cross-process and cross-language split transmission. A Split can be either a DataSplit (for direct data file reads) or an IndexedSplit (for reads leveraging global indexes).

DataSplit — Extends Split for direct data file reading scenarios. Contains SimpleDataFileMeta describing each file's path, size, row count, sequence numbers, schema, level, creation time, and optional delete row count. Provides a file list accessor for append table reads.

Plan — Interface representing the result of a TableScan. Exposes the generated splits and the associated snapshot ID (or nullopt for empty tables).

TableScan — Scanner interface that reads table metadata and produces a Plan. Created from a ScanContext, it serves as the primary entry point for initiating table scan operations in both batch and streaming modes.

TableRead — Given a Split or a list of Splits, creates BatchReader instances for reading data. Manages memory allocation through a shared MemoryPool.

StartupMode — Specifies the startup mode. Supports Default, LatestFull, Latest, FromSnapshot, FromSnapshotFull, and FromTimestamp modes, each with different semantics for batch vs. streaming sources. Provides string conversion and parsing.

Internal Implementation (`src/paimon/core/table/source/`)

AbstractTableScan — Abstract base class above FileStoreScan that provides input split generation logic. Implements the CreateStartingScanner method which routes to different StartingScanner implementations based on the configured StartupMode, handling snapshot lookup, timestamp-based resolution, and tag-based scanning.

DataSplitImpl — Concrete implementation of DataSplit with full serialization support. Tracks partition, bucket, file metadata, deletion files, streaming flag, and raw-convertibility. Includes a Builder pattern for construction and supports multiple DataFileMeta serializer versions (v9, v10, v12, legacy).

SplitGenerator — Generates split groups from DataFileMeta lists. Produces SplitGroups that distinguish between raw-convertible groups (directly readable without merge) and non-raw-convertible groups (requiring merge-tree processing). Provides separate entry points for batch and streaming split generation.

DeletionFile — Represents a deletion vector index file associated with a data file. Contains path, offset, length, and optional cardinality (number of deleted rows). Supports versioned serialization/deserialization (v3 and v4+) with list-level serialize/deserialize helpers.

ScanMode — Enum specifying which part of a snapshot to scan: ALL (complete data files) or DELTA (only newly changed files).

PlanImpl — Concrete implementation of the Plan interface. Holds a snapshot ID and a vector of splits, and provides a static EmptyPlan() factory method for empty results.

New Tests

deletion_file_test.cpp
split_generator_test.cpp
startup_mode_test.cpp
table_scan_test.cpp

lxy-9602 · 2026-06-18T09:53:29Z

+1

feat: introduce table scan, split and plan

8364c73

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: introduce table scan, split and plan#99

feat: introduce table scan, split and plan#99
lszskye wants to merge 1 commit into
apache:mainfrom
lszskye:p12-5

lszskye commented Jun 18, 2026

Uh oh!

lxy-9602 commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lszskye commented Jun 18, 2026

Introduce Table Source Module: Scan, Split and Plan

Summary

New Classes

Public API (include/paimon/table/source/)

Internal Implementation (src/paimon/core/table/source/)

New Tests

Uh oh!

lxy-9602 commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Public API (`include/paimon/table/source/`)

Internal Implementation (`src/paimon/core/table/source/`)