Skip to content

GH-3495: Optimize PlainValuesWriter with direct ByteBuffer slab writes (~2.5x encode speedup)#3496

Open
iemejia wants to merge 2 commits intoapache:masterfrom
iemejia:perf-plain-values-writer
Open

GH-3495: Optimize PlainValuesWriter with direct ByteBuffer slab writes (~2.5x encode speedup)#3496
iemejia wants to merge 2 commits intoapache:masterfrom
iemejia:perf-plain-values-writer

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented Apr 19, 2026

Summary

Closes #3495.

Two-commit PR optimizing PlainValuesWriter and following up with API cleanup of the now-unused LittleEndianDataOutputStream wrapper.

Commit 1 — PlainValuesWriter direct ByteBuffer writes

Removes the LittleEndianDataOutputStream layer between PlainValuesWriter and CapacityByteArrayOutputStream. Adds writeInt(int)/writeLong(long) methods on CBOS that write directly to its internal ByteBuffer slabs (set to LITTLE_ENDIAN), making the value write a single HotSpot intrinsic instead of a 4-byte decomposition through a temp array and an OutputStream chain.

IntEncodingBenchmark.encodePlain (100k INT32 / invocation, JMH -wi 5 -i 10 -f 2):

Pattern Before (ops/s) After (ops/s) Improvement
SEQUENTIAL 20,944,860 53,231,121 +154% (2.55x)
RANDOM 20,613,242 53,419,118 +160%
LOW_CARDINALITY 20,749,103 53,510,247 +158%
HIGH_CARDINALITY 20,521,786 52,825,012 +157%

The same code path is shared by writeLong, writeFloat, writeDouble, and the length prefix in writeBytes(Binary), so PLAIN-encoded INT64/FLOAT/DOUBLE/BINARY columns benefit too. Decode benchmarks (decodePlain etc.) are unchanged, as expected.

Commit 2 — Deprecate LittleEndianDataOutputStream, remove last wrapper usages

Pure API cleanup, no measurable performance impact. After commit 1, FixedLenByteArrayPlainValuesWriter and DeltaLengthByteArrayValuesWriter were the last two production usages of LittleEndianDataOutputStream. Both wrapped a CapacityByteArrayOutputStream only to call Binary.writeTo(out), which goes through OutputStream.write(byte[], int, int) — the wrapper added nothing for that call. Removing the wrapper allows marking LittleEndianDataOutputStream as @Deprecated (kept for binary compatibility, scheduled for removal in a future major).

Benchmarks for the two touched paths (BinaryEncodingBenchmark, JMH -wi 5 -i 10 -f 3, 30 samples per row) are within ±5% with allocation rates per op unchanged within 2% — consistent with noise rather than a real effect either way. Rationale is code health (one fewer wrapper layer, deprecation of an internal-shaped public class), not performance. Full numbers are in the commit message.

Validation

  • parquet-column: 573 tests pass
  • parquet-common: 308 tests pass
  • Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true

Related

This is the second in a small series of focused performance PRs from work in https://github.com/iemejia/parquet-perf. The first was #3494.

How to reproduce the benchmarks

The JMH benchmarks cited above are being added to parquet-benchmarks in #3512. Once that lands, reproduce with:

./mvnw clean package -pl parquet-benchmarks -DskipTests \
    -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true
java -jar parquet-benchmarks/target/parquet-benchmarks.jar 'IntEncodingBenchmark.encodePlain' \
    -wi 5 -i 10 -f 3

Compare runs against master (baseline) and this branch (optimized).

iemejia added 2 commits April 19, 2026 14:14
… writes

PlainValuesWriter previously wrote values through a two-layer abstraction:
PlainValuesWriter -> LittleEndianDataOutputStream -> CapacityByteArrayOutputStream.
Each writeInt() decomposed the int into 4 bytes in a temp writeBuffer[8]
array, then dispatched through the OutputStream chain. Since
CapacityByteArrayOutputStream already uses ByteBuffer slabs internally, we
can write directly to the slab with putInt()/putLong() using LITTLE_ENDIAN
byte order -- a single JVM intrinsic on x86/ARM -- eliminating the byte
decomposition, temp array, and virtual dispatch.

Changes:
- CapacityByteArrayOutputStream: set ByteOrder.LITTLE_ENDIAN on newly
  allocated slabs in addSlab(); add writeInt(int) and writeLong(long)
  methods that use currentSlab.putInt(v) / currentSlab.putLong(v) directly.
- PlainValuesWriter: remove the LittleEndianDataOutputStream field; route
  writeInteger/writeLong/writeFloat/writeDouble/writeBytes through the
  underlying CapacityByteArrayOutputStream directly. writeFloat and
  writeDouble use Float.floatToIntBits / Double.doubleToLongBits + the new
  writeInt/writeLong methods. getBytes() no longer needs to flush a
  buffering layer; close() no longer closes the defunct stream.

Benchmark (IntEncodingBenchmark.encodePlain, 100k INT32 values per
invocation, JMH -wi 3 -i 5 -f 1):

  Pattern           Before (ops/s)   After (ops/s)   Improvement
  SEQUENTIAL          26,817,451      52,953,193     +97.5% (2.0x)
  RANDOM              28,517,312      37,774,036     +32.5%
  LOW_CARDINALITY     28,705,158      52,819,678     +84.0%
  HIGH_CARDINALITY    28,595,519      37,862,571     +32.4%

The same code path also benefits writeLong, writeFloat, writeDouble, and
the length prefix written by writeBytes(Binary).
…ining wrapper usages

This is an API cleanup commit with no measurable performance impact;
it removes the last two production usages of LittleEndianDataOutputStream
so the class can be deprecated.

After the previous commit removed LittleEndianDataOutputStream from
PlainValuesWriter, two production usages remained:

- FixedLenByteArrayPlainValuesWriter wrapped its CapacityByteArrayOutputStream
  in a LittleEndianDataOutputStream solely to call Binary.writeTo(out) for the
  fixed-length payload. The fixed-length encoding has no length prefix and the
  wrapper exposed no LE-specific behavior used here -- Binary.writeTo() only
  invokes OutputStream.write(byte[], int, int), which the wrapper passes
  through unchanged. The wrapper has been removed and the writer now writes
  the binary payload directly to the underlying CapacityByteArrayOutputStream.
  The wrapper-specific flush() in getBytes() is also gone (CBOS does not
  buffer).

- DeltaLengthByteArrayValuesWriter had the same pattern: a wrapper used only
  for v.writeTo(out) on the concatenated byte-array payload, with lengths
  written through a separate DeltaBinaryPackingValuesWriterForInteger. The
  wrapper has been removed for the same reasons.

With no remaining production usages, LittleEndianDataOutputStream is marked
@deprecated. The class is retained for binary compatibility (it is part of
the public parquet-common API) and will be removed in a future major release.
The javadoc directs producers of PLAIN-encoded data to write little-endian
values directly into a ByteBuffer with ByteOrder.LITTLE_ENDIAN, which
compiles to a single intrinsic store on little-endian architectures and
avoids the per-call byte decomposition and virtual dispatch performed by
this class.
iemejia added a commit to iemejia/parquet-java that referenced this pull request Apr 19, 2026
… shaded jar

The parquet-benchmarks pom is missing the JMH annotation-processor
configuration and the AppendingTransformer entries for BenchmarkList /
CompilerHints. As a result, the shaded jar built from master fails at
runtime with "Unable to find the resource: /META-INF/BenchmarkList".

This commit:

- Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds
  jmh-generator-annprocess to maven-compiler-plugin's annotation
  processor paths, and adds AppendingTransformer entries for
  META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin.

- Adds 11 JMH benchmarks covering the encode/decode paths used by the
  pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504,
  apache#3506, apache#3510), so reviewers can reproduce the reported numbers and
  detect regressions:

    IntEncodingBenchmark, BinaryEncodingBenchmark,
    ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark,
    FixedLenByteArrayEncodingBenchmark, FileReadBenchmark,
    FileWriteBenchmark, RowGroupFlushBenchmark,
    ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory.

After this change the shaded jar registers 87 benchmarks (was 0 from a
working build, or unrunnable at all from a default build).
iemejia added a commit to iemejia/parquet-java that referenced this pull request Apr 19, 2026
… shaded jar

The parquet-benchmarks pom is missing the JMH annotation-processor
configuration and the AppendingTransformer entries for BenchmarkList /
CompilerHints. As a result, the shaded jar built from master fails at
runtime with "Unable to find the resource: /META-INF/BenchmarkList".

This commit:

- Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds
  jmh-generator-annprocess to maven-compiler-plugin's annotation
  processor paths, and adds AppendingTransformer entries for
  META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin.

- Adds 11 JMH benchmarks covering the encode/decode paths used by the
  pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504,
  apache#3506, apache#3510), so reviewers can reproduce the reported numbers and
  detect regressions:

    IntEncodingBenchmark, BinaryEncodingBenchmark,
    ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark,
    FixedLenByteArrayEncodingBenchmark, FileReadBenchmark,
    FileWriteBenchmark, RowGroupFlushBenchmark,
    ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory.

After this change the shaded jar registers 87 benchmarks (was 0 from a
working build, or unrunnable at all from a default build).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize PlainValuesWriter by writing directly to ByteBuffer slabs (up to 2x encode speedup)

2 participants