feat(format): introduce blob file format#98
Open
zjw1111 wants to merge 1 commit into
Open
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a new blob file format implementation under src/paimon/format/blob, migrated from Alibaba Paimon C++, adding the writer/reader path plus factory integration and test coverage for core behaviors.
Changes:
- Added
BlobFormatWriterandBlobFileBatchReaderimplementations for writing/reading.blobfiles. - Added
BlobFileFormat+BlobFileFormatFactoryand builder classes to integrate the format into the existingFileFormat/factory architecture. - Added unit tests covering writer/reader behavior, stats extraction, and factory wiring.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| src/paimon/format/blob/blob_writer_builder.h | Writer builder that wires options/memory pool/fs into blob writer creation. |
| src/paimon/format/blob/blob_writer_builder_test.cpp | Tests writer builder error handling and successful build path. |
| src/paimon/format/blob/blob_stats_extractor.h | Declares blob stats extractor implementation. |
| src/paimon/format/blob/blob_stats_extractor.cpp | Implements stats extraction from blob files (row count + column stats). |
| src/paimon/format/blob/blob_stats_extractor_test.cpp | Tests stats extraction for valid/invalid schemas and missing files. |
| src/paimon/format/blob/blob_reader_builder.h | Reader builder that constructs BlobFileBatchReader from an InputStream. |
| src/paimon/format/blob/blob_format_writer.h | Declares blob format writer and its file layout behavior. |
| src/paimon/format/blob/blob_format_writer.cpp | Implements blob writing (bin encoding, index/footer, CRC). |
| src/paimon/format/blob/blob_format_writer_test.cpp | End-to-end tests for blob write/read, nulls, size checks, etc. |
| src/paimon/format/blob/blob_file_format.h | Implements FileFormat for identifier blob. |
| src/paimon/format/blob/blob_file_format_factory.h | Declares factory for constructing blob file format instances. |
| src/paimon/format/blob/blob_file_format_factory.cpp | Implements and registers the blob file format factory. |
| src/paimon/format/blob/blob_file_format_factory_test.cpp | Tests factory identifier and basic format creation. |
| src/paimon/format/blob/blob_file_batch_reader.h | Declares blob batch reader + documents blob binary layout. |
| src/paimon/format/blob/blob_file_batch_reader.cpp | Implements blob batch reader, schema validation, bitmap pushdown, batch materialization. |
| src/paimon/format/blob/blob_file_batch_reader_test.cpp | Tests reader correctness, bitmap pushdown, row numbering, and invalid scenarios. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+55
to
+75
| PAIMON_ASSIGN_OR_RAISE(uint64_t file_size, input_stream->Length()); | ||
| PAIMON_RETURN_NOT_OK( | ||
| input_stream->Seek(file_size - BlobDefs::kBlobFileHeaderLength, FS_SEEK_SET)); | ||
| int8_t header[BlobDefs::kBlobFileHeaderLength]; | ||
| PAIMON_ASSIGN_OR_RAISE( | ||
| int32_t actual_size, | ||
| input_stream->Read(reinterpret_cast<char*>(header), BlobDefs::kBlobFileHeaderLength)); | ||
| if (actual_size != BlobDefs::kBlobFileHeaderLength) { | ||
| return Status::Invalid( | ||
| fmt::format("actual read size {} not match with expect header length {}", actual_size, | ||
| BlobDefs::kBlobFileHeaderLength)); | ||
| } | ||
| int8_t version = header[4]; | ||
| if (version != BlobDefs::kFileVersion) { | ||
| return Status::Invalid(fmt::format( | ||
| "create blob format reader failed. unsupported blob file version: {}", version)); | ||
| } | ||
| int32_t index_length = GetIndexLength(header, 0); | ||
| PAIMON_RETURN_NOT_OK(input_stream->Seek( | ||
| file_size - BlobDefs::kBlobFileHeaderLength - index_length, FS_SEEK_SET)); | ||
| std::vector<char> index_bytes(index_length, '\0'); |
Comment on lines
+138
to
+140
| static PAIMON_UNIQUE_PTR<Bytes> kMagicNumberBytes = | ||
| IntegerToLittleEndian<int32_t>(BlobDefs::kMagicNumber, pool_); | ||
| PAIMON_RETURN_NOT_OK(WriteWithCrc32(kMagicNumberBytes->data(), kMagicNumberBytes->size())); |
Comment on lines
+155
to
+163
| PAIMON_ASSIGN_OR_RAISE(int32_t actual_read_len, in->Read(tmp_buffer_->data(), read_len)); | ||
| if (static_cast<uint32_t>(actual_read_len) != read_len) { | ||
| return Status::Invalid("actual read length {}, not match with expect length {}", | ||
| actual_read_len, read_len); | ||
| } | ||
| PAIMON_RETURN_NOT_OK(WriteWithCrc32(tmp_buffer_->data(), actual_read_len)); | ||
| total_read_length += actual_read_len; | ||
| read_len = static_cast<uint32_t>( | ||
| std::min<uint64_t>(file_length - total_read_length, tmp_buffer_->size())); |
Comment on lines
+60
to
+62
| if (pool == nullptr) { | ||
| return Status::Invalid("blob format writer create failed. pool is nullptr"); | ||
| } |
Comment on lines
+57
to
+60
| ColumnStatsVector result_stats; | ||
| result_stats.push_back( | ||
| ColumnStats::CreateStringColumnStats(std::nullopt, std::nullopt, /*null_count=*/0)); | ||
| PAIMON_ASSIGN_OR_RAISE(uint64_t num_rows, blob_reader->GetNumberOfRows()); |
Comment on lines
+159
to
+160
| auto dir = paimon::test::UniqueTestDirectory::Create(); | ||
| std::string table_path = dir->Str(); |
Comment on lines
+271
to
+272
| auto dir = paimon::test::UniqueTestDirectory::Create(); | ||
| std::string table_path = dir->Str(); |
Comment on lines
+296
to
+297
| auto dir = paimon::test::UniqueTestDirectory::Create(); | ||
| std::string table_path = dir->Str(); |
Comment on lines
+321
to
+322
| auto dir = paimon::test::UniqueTestDirectory::Create(); | ||
| std::string table_path = dir->Str(); |
Comment on lines
+344
to
+345
| auto dir = paimon::test::UniqueTestDirectory::Create(); | ||
| std::string table_path = dir->Str(); |
Contributor
|
+1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Linked issue: N/A
Introduce the blob file format implementation migrated from Alibaba Paimon C++.
Migrated files:
src/paimon/format/blob/blob_file_batch_reader.cppsrc/paimon/format/blob/blob_file_batch_reader.hsrc/paimon/format/blob/blob_file_batch_reader_test.cppsrc/paimon/format/blob/blob_file_format.hsrc/paimon/format/blob/blob_file_format_factory.cppsrc/paimon/format/blob/blob_file_format_factory.hsrc/paimon/format/blob/blob_file_format_factory_test.cppsrc/paimon/format/blob/blob_format_writer.cppsrc/paimon/format/blob/blob_format_writer.hsrc/paimon/format/blob/blob_format_writer_test.cppsrc/paimon/format/blob/blob_reader_builder.hsrc/paimon/format/blob/blob_stats_extractor.cppsrc/paimon/format/blob/blob_stats_extractor.hsrc/paimon/format/blob/blob_stats_extractor_test.cppsrc/paimon/format/blob/blob_writer_builder.hsrc/paimon/format/blob/blob_writer_builder_test.cppNo extra dependency files were migrated.
src/paimon/format/blob/CMakeLists.txtwas intentionally not migrated.The Alibaba headers were converted to ASF license headers. No
LICENSEorNOTICEupdates were required for this batch.External contributor analysis found no
Co-authored-bytrailer or thank-you comment requirement.Validation performed:
check_migration_batch.py --skip-deps.analyze_external_contributors.py.git diff --check.git diff --no-index --check /dev/null <file>for each migrated new file.git diff --cached --check.Tests
API and Format
This PR adds blob file format source and test files under
src/paimon/format/blob. It does not change public headers underinclude/.Documentation
No documentation changes.
Generative AI tooling
Migrated-by: OpenAI Codex