GH-3495: Optimize PlainValuesWriter with direct ByteBuffer slab writes (~2.5x encode speedup) by iemejia · Pull Request #3496 · apache/parquet-java

iemejia · 2026-04-19T14:12:54Z

Summary

Closes #3495.

Two-commit PR optimizing PlainValuesWriter and following up with API cleanup of the now-unused LittleEndianDataOutputStream wrapper.

Commit 1 — `PlainValuesWriter` direct ByteBuffer writes

Removes the LittleEndianDataOutputStream layer between PlainValuesWriter and CapacityByteArrayOutputStream. Adds writeInt(int)/writeLong(long) methods on CBOS that write directly to its internal ByteBuffer slabs (set to LITTLE_ENDIAN), making the value write a single HotSpot intrinsic instead of a 4-byte decomposition through a temp array and an OutputStream chain.

IntEncodingBenchmark.encodePlain (100k INT32 / invocation, JMH -wi 5 -i 10 -f 2):

Pattern	Before (ops/s)	After (ops/s)	Improvement
SEQUENTIAL	20,944,860	53,231,121	+154% (2.55x)
RANDOM	20,613,242	53,419,118	+160%
LOW_CARDINALITY	20,749,103	53,510,247	+158%
HIGH_CARDINALITY	20,521,786	52,825,012	+157%

The same code path is shared by writeLong, writeFloat, writeDouble, and the length prefix in writeBytes(Binary), so PLAIN-encoded INT64/FLOAT/DOUBLE/BINARY columns benefit too. Decode benchmarks (decodePlain etc.) are unchanged, as expected.

Commit 2 — Deprecate `LittleEndianDataOutputStream`, remove last wrapper usages

Pure API cleanup, no measurable performance impact. After commit 1, FixedLenByteArrayPlainValuesWriter and DeltaLengthByteArrayValuesWriter were the last two production usages of LittleEndianDataOutputStream. Both wrapped a CapacityByteArrayOutputStream only to call Binary.writeTo(out), which goes through OutputStream.write(byte[], int, int) — the wrapper added nothing for that call. Removing the wrapper allows marking LittleEndianDataOutputStream as @Deprecated (kept for binary compatibility, scheduled for removal in a future major).

Benchmarks for the two touched paths (BinaryEncodingBenchmark, JMH -wi 5 -i 10 -f 3, 30 samples per row) are within ±5% with allocation rates per op unchanged within 2% — consistent with noise rather than a real effect either way. Rationale is code health (one fewer wrapper layer, deprecation of an internal-shaped public class), not performance. Full numbers are in the commit message.

Validation

parquet-column: 573 tests pass
parquet-common: 308 tests pass
Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true

How to reproduce the benchmarks

The JMH benchmarks cited above are being added to parquet-benchmarks in #3512. Once that lands, reproduce with:

./mvnw clean package -pl parquet-benchmarks -DskipTests \
    -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true
java -jar parquet-benchmarks/target/parquet-benchmarks.jar 'IntEncodingBenchmark.encodePlain' \
    -wi 5 -i 10 -f 3

Compare runs against master (baseline) and this branch (optimized).

… writes PlainValuesWriter previously wrote values through a two-layer abstraction: PlainValuesWriter -> LittleEndianDataOutputStream -> CapacityByteArrayOutputStream. Each writeInt() decomposed the int into 4 bytes in a temp writeBuffer[8] array, then dispatched through the OutputStream chain. Since CapacityByteArrayOutputStream already uses ByteBuffer slabs internally, we can write directly to the slab with putInt()/putLong() using LITTLE_ENDIAN byte order -- a single JVM intrinsic on x86/ARM -- eliminating the byte decomposition, temp array, and virtual dispatch. Changes: - CapacityByteArrayOutputStream: set ByteOrder.LITTLE_ENDIAN on newly allocated slabs in addSlab(); add writeInt(int) and writeLong(long) methods that use currentSlab.putInt(v) / currentSlab.putLong(v) directly. - PlainValuesWriter: remove the LittleEndianDataOutputStream field; route writeInteger/writeLong/writeFloat/writeDouble/writeBytes through the underlying CapacityByteArrayOutputStream directly. writeFloat and writeDouble use Float.floatToIntBits / Double.doubleToLongBits + the new writeInt/writeLong methods. getBytes() no longer needs to flush a buffering layer; close() no longer closes the defunct stream. Benchmark (IntEncodingBenchmark.encodePlain, 100k INT32 values per invocation, JMH -wi 3 -i 5 -f 1): Pattern Before (ops/s) After (ops/s) Improvement SEQUENTIAL 26,817,451 52,953,193 +97.5% (2.0x) RANDOM 28,517,312 37,774,036 +32.5% LOW_CARDINALITY 28,705,158 52,819,678 +84.0% HIGH_CARDINALITY 28,595,519 37,862,571 +32.4% The same code path also benefits writeLong, writeFloat, writeDouble, and the length prefix written by writeBytes(Binary).

@deprecated

…ining wrapper usages This is an API cleanup commit with no measurable performance impact; it removes the last two production usages of LittleEndianDataOutputStream so the class can be deprecated. After the previous commit removed LittleEndianDataOutputStream from PlainValuesWriter, two production usages remained: - FixedLenByteArrayPlainValuesWriter wrapped its CapacityByteArrayOutputStream in a LittleEndianDataOutputStream solely to call Binary.writeTo(out) for the fixed-length payload. The fixed-length encoding has no length prefix and the wrapper exposed no LE-specific behavior used here -- Binary.writeTo() only invokes OutputStream.write(byte[], int, int), which the wrapper passes through unchanged. The wrapper has been removed and the writer now writes the binary payload directly to the underlying CapacityByteArrayOutputStream. The wrapper-specific flush() in getBytes() is also gone (CBOS does not buffer). - DeltaLengthByteArrayValuesWriter had the same pattern: a wrapper used only for v.writeTo(out) on the concatenated byte-array payload, with lengths written through a separate DeltaBinaryPackingValuesWriterForInteger. The wrapper has been removed for the same reasons. With no remaining production usages, LittleEndianDataOutputStream is marked @deprecated. The class is retained for binary compatibility (it is part of the public parquet-common API) and will be removed in a future major release. The javadoc directs producers of PLAIN-encoded data to write little-endian values directly into a ByteBuffer with ByteOrder.LITTLE_ENDIAN, which compiles to a single intrinsic store on little-endian architectures and avoids the per-call byte decomposition and virtual dispatch performed by this class.

… shaded jar The parquet-benchmarks pom is missing the JMH annotation-processor configuration and the AppendingTransformer entries for BenchmarkList / CompilerHints. As a result, the shaded jar built from master fails at runtime with "Unable to find the resource: /META-INF/BenchmarkList". This commit: - Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds jmh-generator-annprocess to maven-compiler-plugin's annotation processor paths, and adds AppendingTransformer entries for META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin. - Adds 11 JMH benchmarks covering the encode/decode paths used by the pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504, apache#3506, apache#3510), so reviewers can reproduce the reported numbers and detect regressions: IntEncodingBenchmark, BinaryEncodingBenchmark, ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark, FixedLenByteArrayEncodingBenchmark, FileReadBenchmark, FileWriteBenchmark, RowGroupFlushBenchmark, ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory. After this change the shaded jar registers 87 benchmarks (was 0 from a working build, or unrunnable at all from a default build).

iemejia added 2 commits April 19, 2026 14:14

iemejia force-pushed the perf-plain-values-writer branch from 0165e61 to 6964ccb Compare April 19, 2026 14:15

This was referenced Apr 19, 2026

Cache hashCode() for non-reused Binary instances (huge dictionary-encode speedup) #3499

Open

GH-3499: Cache hashCode() for non-reused Binary instances (up to 73x dictionary-encode speedup) #3500

Open

arouel approved these changes Apr 19, 2026

View reviewed changes

iemejia mentioned this pull request Apr 19, 2026

GH-3511: Add JMH encoding benchmarks and fix parquet-benchmarks shaded jar #3512

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3495: Optimize PlainValuesWriter with direct ByteBuffer slab writes (~2.5x encode speedup)#3496

GH-3495: Optimize PlainValuesWriter with direct ByteBuffer slab writes (~2.5x encode speedup)#3496
iemejia wants to merge 2 commits intoapache:masterfrom
iemejia:perf-plain-values-writer

iemejia commented Apr 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iemejia commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commit 1 — PlainValuesWriter direct ByteBuffer writes

Commit 2 — Deprecate LittleEndianDataOutputStream, remove last wrapper usages

Validation

Related

How to reproduce the benchmarks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

iemejia commented Apr 19, 2026 •

edited

Loading

Commit 1 — `PlainValuesWriter` direct ByteBuffer writes

Commit 2 — Deprecate `LittleEndianDataOutputStream`, remove last wrapper usages