Skip to content

feat(format): introduce blob file format#98

Open
zjw1111 wants to merge 1 commit into
apache:mainfrom
zjw1111:migrate/blob-format
Open

feat(format): introduce blob file format#98
zjw1111 wants to merge 1 commit into
apache:mainfrom
zjw1111:migrate/blob-format

Conversation

@zjw1111

@zjw1111 zjw1111 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Purpose

Linked issue: N/A

Introduce the blob file format implementation migrated from Alibaba Paimon C++.

Migrated files:

  • src/paimon/format/blob/blob_file_batch_reader.cpp
  • src/paimon/format/blob/blob_file_batch_reader.h
  • src/paimon/format/blob/blob_file_batch_reader_test.cpp
  • src/paimon/format/blob/blob_file_format.h
  • src/paimon/format/blob/blob_file_format_factory.cpp
  • src/paimon/format/blob/blob_file_format_factory.h
  • src/paimon/format/blob/blob_file_format_factory_test.cpp
  • src/paimon/format/blob/blob_format_writer.cpp
  • src/paimon/format/blob/blob_format_writer.h
  • src/paimon/format/blob/blob_format_writer_test.cpp
  • src/paimon/format/blob/blob_reader_builder.h
  • src/paimon/format/blob/blob_stats_extractor.cpp
  • src/paimon/format/blob/blob_stats_extractor.h
  • src/paimon/format/blob/blob_stats_extractor_test.cpp
  • src/paimon/format/blob/blob_writer_builder.h
  • src/paimon/format/blob/blob_writer_builder_test.cpp

No extra dependency files were migrated. src/paimon/format/blob/CMakeLists.txt was intentionally not migrated.

The Alibaba headers were converted to ASF license headers. No LICENSE or NOTICE updates were required for this batch.

External contributor analysis found no Co-authored-by trailer or thank-you comment requirement.

Validation performed:

  • Ran check_migration_batch.py --skip-deps.
  • Ran analyze_external_contributors.py.
  • Ran git diff --check.
  • Ran git diff --no-index --check /dev/null <file> for each migrated new file.
  • Ran git diff --cached --check.

Tests

API and Format

This PR adds blob file format source and test files under src/paimon/format/blob. It does not change public headers under include/.

Documentation

No documentation changes.

Generative AI tooling

Migrated-by: OpenAI Codex

Copilot AI review requested due to automatic review settings June 18, 2026 03:10

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new blob file format implementation under src/paimon/format/blob, migrated from Alibaba Paimon C++, adding the writer/reader path plus factory integration and test coverage for core behaviors.

Changes:

  • Added BlobFormatWriter and BlobFileBatchReader implementations for writing/reading .blob files.
  • Added BlobFileFormat + BlobFileFormatFactory and builder classes to integrate the format into the existing FileFormat/factory architecture.
  • Added unit tests covering writer/reader behavior, stats extraction, and factory wiring.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
src/paimon/format/blob/blob_writer_builder.h Writer builder that wires options/memory pool/fs into blob writer creation.
src/paimon/format/blob/blob_writer_builder_test.cpp Tests writer builder error handling and successful build path.
src/paimon/format/blob/blob_stats_extractor.h Declares blob stats extractor implementation.
src/paimon/format/blob/blob_stats_extractor.cpp Implements stats extraction from blob files (row count + column stats).
src/paimon/format/blob/blob_stats_extractor_test.cpp Tests stats extraction for valid/invalid schemas and missing files.
src/paimon/format/blob/blob_reader_builder.h Reader builder that constructs BlobFileBatchReader from an InputStream.
src/paimon/format/blob/blob_format_writer.h Declares blob format writer and its file layout behavior.
src/paimon/format/blob/blob_format_writer.cpp Implements blob writing (bin encoding, index/footer, CRC).
src/paimon/format/blob/blob_format_writer_test.cpp End-to-end tests for blob write/read, nulls, size checks, etc.
src/paimon/format/blob/blob_file_format.h Implements FileFormat for identifier blob.
src/paimon/format/blob/blob_file_format_factory.h Declares factory for constructing blob file format instances.
src/paimon/format/blob/blob_file_format_factory.cpp Implements and registers the blob file format factory.
src/paimon/format/blob/blob_file_format_factory_test.cpp Tests factory identifier and basic format creation.
src/paimon/format/blob/blob_file_batch_reader.h Declares blob batch reader + documents blob binary layout.
src/paimon/format/blob/blob_file_batch_reader.cpp Implements blob batch reader, schema validation, bitmap pushdown, batch materialization.
src/paimon/format/blob/blob_file_batch_reader_test.cpp Tests reader correctness, bitmap pushdown, row numbering, and invalid scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +55 to +75
PAIMON_ASSIGN_OR_RAISE(uint64_t file_size, input_stream->Length());
PAIMON_RETURN_NOT_OK(
input_stream->Seek(file_size - BlobDefs::kBlobFileHeaderLength, FS_SEEK_SET));
int8_t header[BlobDefs::kBlobFileHeaderLength];
PAIMON_ASSIGN_OR_RAISE(
int32_t actual_size,
input_stream->Read(reinterpret_cast<char*>(header), BlobDefs::kBlobFileHeaderLength));
if (actual_size != BlobDefs::kBlobFileHeaderLength) {
return Status::Invalid(
fmt::format("actual read size {} not match with expect header length {}", actual_size,
BlobDefs::kBlobFileHeaderLength));
}
int8_t version = header[4];
if (version != BlobDefs::kFileVersion) {
return Status::Invalid(fmt::format(
"create blob format reader failed. unsupported blob file version: {}", version));
}
int32_t index_length = GetIndexLength(header, 0);
PAIMON_RETURN_NOT_OK(input_stream->Seek(
file_size - BlobDefs::kBlobFileHeaderLength - index_length, FS_SEEK_SET));
std::vector<char> index_bytes(index_length, '\0');
Comment on lines +138 to +140
static PAIMON_UNIQUE_PTR<Bytes> kMagicNumberBytes =
IntegerToLittleEndian<int32_t>(BlobDefs::kMagicNumber, pool_);
PAIMON_RETURN_NOT_OK(WriteWithCrc32(kMagicNumberBytes->data(), kMagicNumberBytes->size()));
Comment on lines +155 to +163
PAIMON_ASSIGN_OR_RAISE(int32_t actual_read_len, in->Read(tmp_buffer_->data(), read_len));
if (static_cast<uint32_t>(actual_read_len) != read_len) {
return Status::Invalid("actual read length {}, not match with expect length {}",
actual_read_len, read_len);
}
PAIMON_RETURN_NOT_OK(WriteWithCrc32(tmp_buffer_->data(), actual_read_len));
total_read_length += actual_read_len;
read_len = static_cast<uint32_t>(
std::min<uint64_t>(file_length - total_read_length, tmp_buffer_->size()));
Comment on lines +60 to +62
if (pool == nullptr) {
return Status::Invalid("blob format writer create failed. pool is nullptr");
}
Comment on lines +57 to +60
ColumnStatsVector result_stats;
result_stats.push_back(
ColumnStats::CreateStringColumnStats(std::nullopt, std::nullopt, /*null_count=*/0));
PAIMON_ASSIGN_OR_RAISE(uint64_t num_rows, blob_reader->GetNumberOfRows());
Comment on lines +159 to +160
auto dir = paimon::test::UniqueTestDirectory::Create();
std::string table_path = dir->Str();
Comment on lines +271 to +272
auto dir = paimon::test::UniqueTestDirectory::Create();
std::string table_path = dir->Str();
Comment on lines +296 to +297
auto dir = paimon::test::UniqueTestDirectory::Create();
std::string table_path = dir->Str();
Comment on lines +321 to +322
auto dir = paimon::test::UniqueTestDirectory::Create();
std::string table_path = dir->Str();
Comment on lines +344 to +345
auto dir = paimon::test::UniqueTestDirectory::Create();
std::string table_path = dir->Str();
@lxy-9602

Copy link
Copy Markdown
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants