GH-3495: Optimize PlainValuesWriter with direct ByteBuffer slab writes (~2.5x encode speedup)#3496
Open
iemejia wants to merge 2 commits intoapache:masterfrom
Open
GH-3495: Optimize PlainValuesWriter with direct ByteBuffer slab writes (~2.5x encode speedup)#3496iemejia wants to merge 2 commits intoapache:masterfrom
iemejia wants to merge 2 commits intoapache:masterfrom
Conversation
… writes PlainValuesWriter previously wrote values through a two-layer abstraction: PlainValuesWriter -> LittleEndianDataOutputStream -> CapacityByteArrayOutputStream. Each writeInt() decomposed the int into 4 bytes in a temp writeBuffer[8] array, then dispatched through the OutputStream chain. Since CapacityByteArrayOutputStream already uses ByteBuffer slabs internally, we can write directly to the slab with putInt()/putLong() using LITTLE_ENDIAN byte order -- a single JVM intrinsic on x86/ARM -- eliminating the byte decomposition, temp array, and virtual dispatch. Changes: - CapacityByteArrayOutputStream: set ByteOrder.LITTLE_ENDIAN on newly allocated slabs in addSlab(); add writeInt(int) and writeLong(long) methods that use currentSlab.putInt(v) / currentSlab.putLong(v) directly. - PlainValuesWriter: remove the LittleEndianDataOutputStream field; route writeInteger/writeLong/writeFloat/writeDouble/writeBytes through the underlying CapacityByteArrayOutputStream directly. writeFloat and writeDouble use Float.floatToIntBits / Double.doubleToLongBits + the new writeInt/writeLong methods. getBytes() no longer needs to flush a buffering layer; close() no longer closes the defunct stream. Benchmark (IntEncodingBenchmark.encodePlain, 100k INT32 values per invocation, JMH -wi 3 -i 5 -f 1): Pattern Before (ops/s) After (ops/s) Improvement SEQUENTIAL 26,817,451 52,953,193 +97.5% (2.0x) RANDOM 28,517,312 37,774,036 +32.5% LOW_CARDINALITY 28,705,158 52,819,678 +84.0% HIGH_CARDINALITY 28,595,519 37,862,571 +32.4% The same code path also benefits writeLong, writeFloat, writeDouble, and the length prefix written by writeBytes(Binary).
…ining wrapper usages This is an API cleanup commit with no measurable performance impact; it removes the last two production usages of LittleEndianDataOutputStream so the class can be deprecated. After the previous commit removed LittleEndianDataOutputStream from PlainValuesWriter, two production usages remained: - FixedLenByteArrayPlainValuesWriter wrapped its CapacityByteArrayOutputStream in a LittleEndianDataOutputStream solely to call Binary.writeTo(out) for the fixed-length payload. The fixed-length encoding has no length prefix and the wrapper exposed no LE-specific behavior used here -- Binary.writeTo() only invokes OutputStream.write(byte[], int, int), which the wrapper passes through unchanged. The wrapper has been removed and the writer now writes the binary payload directly to the underlying CapacityByteArrayOutputStream. The wrapper-specific flush() in getBytes() is also gone (CBOS does not buffer). - DeltaLengthByteArrayValuesWriter had the same pattern: a wrapper used only for v.writeTo(out) on the concatenated byte-array payload, with lengths written through a separate DeltaBinaryPackingValuesWriterForInteger. The wrapper has been removed for the same reasons. With no remaining production usages, LittleEndianDataOutputStream is marked @deprecated. The class is retained for binary compatibility (it is part of the public parquet-common API) and will be removed in a future major release. The javadoc directs producers of PLAIN-encoded data to write little-endian values directly into a ByteBuffer with ByteOrder.LITTLE_ENDIAN, which compiles to a single intrinsic store on little-endian architectures and avoids the per-call byte decomposition and virtual dispatch performed by this class.
0165e61 to
6964ccb
Compare
arouel
approved these changes
Apr 19, 2026
This was referenced Apr 19, 2026
Optimize ByteStreamSplitValuesWriter: remove per-value allocation and batch single-byte writes
#3503
Open
iemejia
added a commit
to iemejia/parquet-java
that referenced
this pull request
Apr 19, 2026
… shaded jar The parquet-benchmarks pom is missing the JMH annotation-processor configuration and the AppendingTransformer entries for BenchmarkList / CompilerHints. As a result, the shaded jar built from master fails at runtime with "Unable to find the resource: /META-INF/BenchmarkList". This commit: - Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds jmh-generator-annprocess to maven-compiler-plugin's annotation processor paths, and adds AppendingTransformer entries for META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin. - Adds 11 JMH benchmarks covering the encode/decode paths used by the pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504, apache#3506, apache#3510), so reviewers can reproduce the reported numbers and detect regressions: IntEncodingBenchmark, BinaryEncodingBenchmark, ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark, FixedLenByteArrayEncodingBenchmark, FileReadBenchmark, FileWriteBenchmark, RowGroupFlushBenchmark, ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory. After this change the shaded jar registers 87 benchmarks (was 0 from a working build, or unrunnable at all from a default build).
iemejia
added a commit
to iemejia/parquet-java
that referenced
this pull request
Apr 19, 2026
… shaded jar The parquet-benchmarks pom is missing the JMH annotation-processor configuration and the AppendingTransformer entries for BenchmarkList / CompilerHints. As a result, the shaded jar built from master fails at runtime with "Unable to find the resource: /META-INF/BenchmarkList". This commit: - Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds jmh-generator-annprocess to maven-compiler-plugin's annotation processor paths, and adds AppendingTransformer entries for META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin. - Adds 11 JMH benchmarks covering the encode/decode paths used by the pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504, apache#3506, apache#3510), so reviewers can reproduce the reported numbers and detect regressions: IntEncodingBenchmark, BinaryEncodingBenchmark, ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark, FixedLenByteArrayEncodingBenchmark, FileReadBenchmark, FileWriteBenchmark, RowGroupFlushBenchmark, ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory. After this change the shaded jar registers 87 benchmarks (was 0 from a working build, or unrunnable at all from a default build).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #3495.
Two-commit PR optimizing
PlainValuesWriterand following up with API cleanup of the now-unusedLittleEndianDataOutputStreamwrapper.Commit 1 —
PlainValuesWriterdirect ByteBuffer writesRemoves the
LittleEndianDataOutputStreamlayer betweenPlainValuesWriterandCapacityByteArrayOutputStream. AddswriteInt(int)/writeLong(long)methods on CBOS that write directly to its internalByteBufferslabs (set toLITTLE_ENDIAN), making the value write a single HotSpot intrinsic instead of a 4-byte decomposition through a temp array and anOutputStreamchain.IntEncodingBenchmark.encodePlain(100k INT32 / invocation, JMH-wi 5 -i 10 -f 2):The same code path is shared by
writeLong,writeFloat,writeDouble, and the length prefix inwriteBytes(Binary), so PLAIN-encodedINT64/FLOAT/DOUBLE/BINARYcolumns benefit too. Decode benchmarks (decodePlainetc.) are unchanged, as expected.Commit 2 — Deprecate
LittleEndianDataOutputStream, remove last wrapper usagesPure API cleanup, no measurable performance impact. After commit 1,
FixedLenByteArrayPlainValuesWriterandDeltaLengthByteArrayValuesWriterwere the last two production usages ofLittleEndianDataOutputStream. Both wrapped aCapacityByteArrayOutputStreamonly to callBinary.writeTo(out), which goes throughOutputStream.write(byte[], int, int)— the wrapper added nothing for that call. Removing the wrapper allows markingLittleEndianDataOutputStreamas@Deprecated(kept for binary compatibility, scheduled for removal in a future major).Benchmarks for the two touched paths (
BinaryEncodingBenchmark, JMH-wi 5 -i 10 -f 3, 30 samples per row) are within ±5% with allocation rates per op unchanged within 2% — consistent with noise rather than a real effect either way. Rationale is code health (one fewer wrapper layer, deprecation of an internal-shaped public class), not performance. Full numbers are in the commit message.Validation
parquet-column: 573 tests passparquet-common: 308 tests pass-Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=trueRelated
This is the second in a small series of focused performance PRs from work in https://github.com/iemejia/parquet-perf. The first was #3494.
How to reproduce the benchmarks
The JMH benchmarks cited above are being added to
parquet-benchmarksin #3512. Once that lands, reproduce with:Compare runs against
master(baseline) and this branch (optimized).