[Parquet] Split byte-array batches transparently when i32 offset would overflow by vigneshsiva11 · Pull Request #9504 · apache/arrow-rs

vigneshsiva11 · 2026-03-04T15:26:16Z

Which issue does this PR close?

Closes Error when reading row group larger than 2GB (total string length per 8k row batch exceeds 2GB) #7973.

Rationale for this change

When reading Parquet byte-array columns (Utf8 / Binary) into Arrow arrays with 32-bit offsets, the reader errors with "index overflow decoding byte array" as soon as the accumulated string/binary data in a single batch exceeds 2 GiB (i32::MAX bytes).

With the default batch_size of 8 192 rows, this means any column where the average value is larger than ~256 KB cannot be read at all—even though the file is perfectly valid and both pyarrow and DuckDB handle it fine by splitting internally.

The correct fix, as discussed in the issue, is for the Parquet reader to treat batch_size as a target rather than a hard limit and emit a smaller RecordBatch whenever the next value would overflow the offset
type.

What changes are included in this PR?

`parquet/src/arrow/buffer/offset_buffer.rs`

Added OffsetBuffer::would_overflow(data_len: usize) -> bool — an inline, zero-allocation helper that uses checked_add to safely test whether appending data_len bytes would exceed the representable range of offset type I, without mutating any state.

`parquet/src/arrow/array_reader/byte_array.rs`

All four byte-array decoders are updated to call would_overflow before each try_push. When the check fires the decoder breaks out of its loop and returns the partial count. The decoder's internal position is left pointing at the value that didn't fit, so the next read_records() call resumes from exactly that value—no rows are lost, duplicated, or reordered.

Decoder	Change
`ByteArrayDecoderPlain::read`	Check `would_overflow` before `try_push`; fix `max_remaining_values` to subtract actual reads, not requested reads
`ByteArrayDecoderDeltaLength::read`	Same pattern; advance `length_offset` / `data_offset` only by what was consumed
`ByteArrayDecoderDelta::read`	Check `would_overflow` inside the callback closure; use an `overflow` flag to distinguish a clean stop from a real error
`ByteArrayDecoderDictionary::read`	Process one dictionary key at a time via `decoder.read(1, …)` so `DictIndexDecoder` never advances past an unconsumed key

Are these changes tested?

Yes:

test_would_overflow — unit test for the new helper covering both i32 and i64 offset types, including the usize::MAX edge case.
test_plain_decoder_partial_read — confirms that a 3-value PLAIN page is correctly split across two read() calls with no data lost or
duplicated.

Are there any user-facing changes?

No breaking changes. Users who previously hit "index overflow decoding byte array" with large string/binary columns will now get their data returned across multiple RecordBatches transparently, with no API or schema changes required.

…pache#7973) When reading Parquet byte-array columns (Utf8 / Binary) into Arrow arrays with 32-bit offsets, the reader previously returned an error the moment accumulated data in a batch exceeded 2 GiB. This made large Parquet row groups (e.g. NLP datasets with long text columns) completely unreadable with default settings. This commit makes the Parquet reader treat batch_size as a *target* rather than a hard limit: when the next value would overflow the i32 offset type, the decoder stops early and returns the partial batch. The decoder's internal position is left at the unread value, so the following read_records() call resumes seamlessly - no rows are lost, duplicated, or reordered. Key changes ----------- * OffsetBuffer::would_overflow(data_len) - new inline helper that uses checked_add to safely detect whether appending data_len bytes would exceed the capacity of offset type I, without any mutation. * ByteArrayDecoderPlain::read - checks would_overflow before each try_push and breaks out of the loop when true; fixes max_remaining_values accounting to subtract actual reads rather than requested reads. * ByteArrayDecoderDeltaLength::read - same pattern; advances length_offset and data_offset only by what was actually consumed. * ByteArrayDecoderDelta::read - checks would_overflow inside the callback closure and uses an overflow flag to distinguish a clean stop from a genuine error. * ByteArrayDecoderDictionary::read - processes dictionary keys one at a time via decoder.read(1, ...) so that the DictIndexDecoder never advances past an unconsumed key on overflow. Tests ----- * test_would_overflow - unit test for the new helper covering both i32 and i64 offset types, including the usize overflow edge case. * test_plain_decoder_partial_read - confirms a 3-value PLAIN page is correctly split across two read() calls with no data lost. Fixes apache#7973

Copilot

Pull request overview

This PR updates the Parquet → Arrow byte-array decoding path to avoid failing when building Arrow arrays with 32-bit offsets would exceed the representable offset range, by stopping early and allowing subsequent read_records() calls to resume from the same value.

Changes:

Add OffsetBuffer::would_overflow(data_len) to detect offset-range overflow before mutating buffers.
Update byte-array decoders (PLAIN, DELTA_LENGTH, DELTA_BYTE_ARRAY, DICTIONARY) to stop decoding before offset overflow and return a partial count.
Add unit tests for would_overflow and for multi-call reading behavior in the plain decoder.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
`parquet/src/arrow/buffer/offset_buffer.rs`	Adds `would_overflow` helper and unit tests validating overflow boundaries.
`parquet/src/arrow/array_reader/byte_array.rs`	Uses `would_overflow` to stop early across byte-array decoding implementations; adds a decoder test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-04T15:32:48Z