Skip to content

fix(format): reset avro reader on schema change#374

Open
zjw1111 wants to merge 2 commits into
alibaba:mainfrom
zjw1111:codex/fix-avro-set-read-schema-reset
Open

fix(format): reset avro reader on schema change#374
zjw1111 wants to merge 2 commits into
alibaba:mainfrom
zjw1111:codex/fix-avro-set-read-schema-reset

Conversation

@zjw1111

@zjw1111 zjw1111 commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Purpose

Linked issue: none

AvroFileBatchReader::SetReadSchema() reset the reported row numbers and Arrow builder, but left the underlying Avro DataFileReaderBase at its current read position. If a caller consumed part of a file and then changed the read schema, the next batch continued from the old Avro reader position while reporting row number 0.

This patch recreates the Avro data file reader when SetReadSchema() succeeds, so the next NextBatch() call reads from the first row as required by the FileBatchReader contract.

Tests

git diff --cached --check

cmake --build build --target paimon-avro-format-test -j64

./build/release/paimon-avro-format-test --gtest_filter=AvroFileBatchReaderTest.TestSetReadSchemaResetsReaderToFirstRow

./build/release/paimon-avro-format-test

API and Format

No public API, storage format, or protocol changes.

Documentation

No.

Generative AI tooling

Generated-by: OpenAI Codex

Copilot AI review requested due to automatic review settings June 17, 2026 06:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Resets the underlying Avro reader when AvroFileBatchReader::SetReadSchema() is called so subsequent NextBatch() reads restart from the beginning and row numbering aligns with the FileBatchReader contract.

Changes:

  • Recreate the underlying avro::DataFileReaderBase on successful schema changes.
  • Rebuild the Arrow ArrayBuilder and update field projection on schema changes.
  • Add a regression test ensuring schema changes reset the reader back to the first row.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/paimon/format/avro/avro_file_batch_reader.cpp Recreates the Avro reader and resets internal state on SetReadSchema() success.
src/paimon/format/avro/avro_file_batch_reader_test.cpp Adds a test asserting SetReadSchema() causes reading to restart from row 0.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +154 to +165
PAIMON_ASSIGN_OR_RAISE(std::unique_ptr<::avro::DataFileReaderBase> reader,
CreateDataFileReader(input_stream_, pool_));

if (reader_) {
reader_->close();
}
reader_ = std::move(reader);
read_fields_projection_ = std::move(read_fields_projection);
array_builder_ = std::move(array_builder);
previous_first_row_ = std::numeric_limits<uint64_t>::max();
next_row_to_read_ = std::numeric_limits<uint64_t>::max();
close_ = false;

auto read_schema = arrow::schema({arrow::field("f1", arrow::int32())});
std::unique_ptr<ArrowSchema> c_schema = std::make_unique<ArrowSchema>();
ASSERT_TRUE(arrow::ExportSchema(*read_schema, c_schema.get()).ok());

@lxy-9602 lxy-9602 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants