Skip to content

feat(format): introduce avro writer components#88

Open
zjw1111 wants to merge 1 commit into
apache:mainfrom
zjw1111:migrate/avro-writer
Open

feat(format): introduce avro writer components#88
zjw1111 wants to merge 1 commit into
apache:mainfrom
zjw1111:migrate/avro-writer

Conversation

@zjw1111

@zjw1111 zjw1111 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Purpose

This PR introduces the exact-scope Avro writer, batch reader, writer builder, and stats extractor files migrated from the Alibaba Paimon C++ repository.

Migrated files:

  • src/paimon/format/avro/avro_file_batch_reader.h
  • src/paimon/format/avro/avro_file_batch_reader.cpp
  • src/paimon/format/avro/avro_file_batch_reader_test.cpp
  • src/paimon/format/avro/avro_format_writer.h
  • src/paimon/format/avro/avro_format_writer.cpp
  • src/paimon/format/avro/avro_format_writer_test.cpp
  • src/paimon/format/avro/avro_writer_builder.h
  • src/paimon/format/avro/avro_writer_builder_test.cpp
  • src/paimon/format/avro/avro_stats_extractor.h
  • src/paimon/format/avro/avro_stats_extractor.cpp
  • src/paimon/format/avro/avro_stats_extractor_test.cpp

No additional dependency files are included in this exact-scope migration.

Tests

git diff --cached --check

API and Format

This patch introduces Avro writer-related implementation and test files. It does not change public headers under include/, storage format, or protocol definitions.

Documentation

No user-facing documentation is added.

Generative AI tooling

Migrated-by: OpenAI Codex

Copilot AI review requested due to automatic review settings June 17, 2026 03:37

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR introduces Avro format support by adding core writer/reader components, a stats extractor skeleton, and a suite of tests to validate end-to-end behavior.

Changes:

  • Added AvroWriterBuilder to configure Avro codec and zstd compression level from options.
  • Added AvroFormatWriter and AvroFileBatchReader implementations with new integration/unit tests.
  • Added AvroStatsExtractor implementation (currently returns null stats) with tests covering multiple Arrow types.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
src/paimon/format/avro/avro_writer_builder.h Adds builder logic for Avro writer creation and codec/level selection.
src/paimon/format/avro/avro_writer_builder_test.cpp Adds tests for codec mapping and codec/level propagation into writer.
src/paimon/format/avro/avro_format_writer.h Declares Avro FormatWriter implementation.
src/paimon/format/avro/avro_format_writer.cpp Implements Avro writer creation, add/flush/finish, and target-size checks.
src/paimon/format/avro/avro_format_writer_test.cpp Adds writer tests for batch sizing, multiple writes, and size estimation.
src/paimon/format/avro/avro_file_batch_reader.h Declares Avro FileBatchReader implementation.
src/paimon/format/avro/avro_file_batch_reader.cpp Implements Avro batch reading, schema projection, and row counting.
src/paimon/format/avro/avro_file_batch_reader_test.cpp Adds reader tests for nulls, batch sizes, types, timestamps, maps, and row counts.
src/paimon/format/avro/avro_stats_extractor.h Declares Avro stats extractor interface.
src/paimon/format/avro/avro_stats_extractor.cpp Implements stats extraction scaffolding (null stats) and file-info path.
src/paimon/format/avro/avro_stats_extractor_test.cpp Adds tests ensuring extractor returns expected “null stats” for various types.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +21 to +25
#include <cstdint>
#include <map>
#include <memory>
#include <string>
#include <utility>
Comment on lines +52 to +54
AvroWriterBuilder(const std::shared_ptr<arrow::Schema>& schema, int32_t batch_size,
const std::map<std::string, std::string>& options)
: pool_(GetDefaultPool()), schema_(schema), options_(options) {}
return Status::Invalid("unknown compression " + file_compression);
}
}
Result<std::optional<int32_t>> GetAvroCompressionLevel(const ::avro::Codec& codec) {
Comment on lines +28 to +29
ASSERT_OK_AND_ASSIGN(::avro::Codec zstd_codec,
AvroWriterBuilder::ToAvroCompressionKind("zstd"));
Comment on lines +63 to +66
ASSERT_OK_AND_ASSIGN(auto file_writer, builder.Build(nullptr, "zstd"));
auto* avro_file_writer = dynamic_cast<AvroFormatWriter*>(file_writer.get());
ASSERT_EQ(avro_file_writer->writer_->codec_, ::avro::Codec::SNAPPY_CODEC);
ASSERT_EQ(avro_file_writer->writer_->compressionLevel_, std::nullopt);
return Status::Invalid("unknown compression " + file_compression);
}
}
Result<std::optional<int32_t>> GetAvroCompressionLevel(const ::avro::Codec& codec) {
std::unique_ptr<ArrowArray> c_array = std::make_unique<ArrowArray>();
std::unique_ptr<ArrowSchema> c_schema = std::make_unique<ArrowSchema>();
PAIMON_RETURN_NOT_OK_FROM_ARROW(arrow::ExportArray(*array, c_array.get(), c_schema.get()));
return make_pair(std::move(c_array), std::move(c_schema));
Comment on lines +134 to +137
data_str.append(fmt::format(R"([{}, {}, {}, {}, {}, "str_{}", "bin_{}"])", "true", i,
i * 100000000000L, i * 0.12, i * 123.45678901, i, i));
} else if (i % 3 == 1) {
data_str.append(fmt::format(R"([{}, -{}, -{}, -{}, -{}, "string_{}", "binary_{}"])",
Comment on lines +132 to +133
PAIMON_RETURN_NOT_OK(Flush());
return Status::OK();
Comment on lines +188 to +192
ScopeGuard stream_guard([this, current_pos]() -> void {
// reset input stream position to original position
Status status = input_stream_->Seek(current_pos, SeekOrigin::FS_SEEK_SET);
(void)status;
});
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants