Flink: Avoid per-row allocations in Parquet array and map writers by wombatu-kun · Pull Request #16791 · apache/iceberg

wombatu-kun · 2026-06-12T08:15:07Z

FlinkParquetWriters' ArrayDataWriter and MapDataWriter allocate per row when writing array and map columns. For every value, elements(...) / pairs(...) create a fresh iterator, whose constructor in turn calls ArrayData.createElementGetter(...) for the element (and both key and value for maps). For nullable element/value types createElementGetter returns a capturing null-checking wrapper, so it allocates too; the map iterator additionally allocates a ReusableEntry per row.

The element/key/value getters depend only on the column types, which are fixed at writer construction, so they are now built once in the writer. The iterators themselves are reused: the parent RepeatedWriter / RepeatedKeyValueWriter fully consumes the iterator inside a single write() call and never retains it (it drains it in a while loop), and writers are single-threaded, so one reusable iterator instance per writer (reset on each call, with the map's ReusableEntry allocated once) is safe. Nested collections use distinct writer instances, so no iterator is ever re-entered.

The change is identical across the supported Flink versions, so it is applied to v1.20, v2.0, and v2.1 in this PR.

Benchmark

JMH (JDK 17, -prof gc, SingleShotTime), end-to-end Flink Parquet write of 1,000,000 rows of id: long, tags: array<int>, props: map<string, long>, measured for non-nullable and nullable (optional) element/value types.

Collection elements	Before	After	Delta
non-nullable	508.3 MB/op	436.3 MB/op	-14.2%
nullable	421.7 MB/op	349.7 MB/op	-17.1%

(Allocation per 1,000,000-row write.) Wall-clock time was unchanged within noise; this is an allocation / GC-pressure reduction on collection-heavy writes. Existing TestFlinkParquetWriter / TestFlinkParquetReader round-trip coverage (arrays, maps, nested structs, required and optional, dictionary and fallback encodings) passes unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Flink: Avoid per-row allocations in Parquet array and map writers

463a5f2

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions Bot added the flink label Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink: Avoid per-row allocations in Parquet array and map writers#16791

Flink: Avoid per-row allocations in Parquet array and map writers#16791
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:flink-parquet-collection-getter-hoist

wombatu-kun commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wombatu-kun commented Jun 12, 2026

Benchmark

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant