GH-3509: Optimize BinaryPlainValuesReader by reading directly from ByteBuffer#3510
Open
iemejia wants to merge 1 commit intoapache:masterfrom
Open
GH-3509: Optimize BinaryPlainValuesReader by reading directly from ByteBuffer#3510iemejia wants to merge 1 commit intoapache:masterfrom
iemejia wants to merge 1 commit intoapache:masterfrom
Conversation
…rom ByteBuffer BinaryPlainValuesReader.readBytes is the hot-path decoder for BINARY (and STRING) columns using PLAIN encoding. The current implementation funnels every length read through BytesUtils.readIntLittleEndian(InputStream), which calls in.read() four times with full IOException plumbing and virtual dispatch on ByteBufferInputStream, and slices every value through a virtual ByteBufferInputStream.slice(int). This change replaces the ByteBufferInputStream field with a single ByteBuffer set up once in initFromPage. The length prefix is then a single ByteBuffer.getInt() (one bounds check, JIT-friendly little-endian intrinsic, no IOException plumbing) and each value slice is a direct ByteBuffer.slice() instead of a virtual ByteBufferInputStream.slice(int). When the input is a MultiBufferInputStream the upfront stream.slice(available) call may consolidate the page into a single fresh ByteBuffer. This is one allocation per page in exchange for inlined per-value reads, which is a clear win whenever the page contains more than a handful of values. Benchmark (BinaryEncodingBenchmark.decodePlain, 100k values per invocation, JDK 18, JMH -wi 5 -i 10 -f 3, 30 samples per row): cardinality stringLen Before (ops/s) After (ops/s) Improvement HIGH 10 23,114,969 27,126,384 +17.4% (1.17x) HIGH 100 20,516,861 22,200,091 +8.2% (1.08x) HIGH 1000 7,069,927 7,679,070 +8.6% (1.09x) LOW 10 22,885,778 26,459,404 +15.6% (1.16x) LOW 100 20,349,900 22,158,675 +8.9% (1.09x) LOW 1000 6,279,616 7,500,811 +19.4% (1.19x) Per-op allocation is unchanged (~88 B/op = the returned Binary + the per-value ByteBuffer slice). The improvement is largest at small string lengths because the per-value fixed cost dominates more there. All 573 parquet-column tests pass.
iemejia
added a commit
to iemejia/parquet-java
that referenced
this pull request
Apr 19, 2026
… shaded jar The parquet-benchmarks pom is missing the JMH annotation-processor configuration and the AppendingTransformer entries for BenchmarkList / CompilerHints. As a result, the shaded jar built from master fails at runtime with "Unable to find the resource: /META-INF/BenchmarkList". This commit: - Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds jmh-generator-annprocess to maven-compiler-plugin's annotation processor paths, and adds AppendingTransformer entries for META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin. - Adds 11 JMH benchmarks covering the encode/decode paths used by the pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504, apache#3506, apache#3510), so reviewers can reproduce the reported numbers and detect regressions: IntEncodingBenchmark, BinaryEncodingBenchmark, ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark, FixedLenByteArrayEncodingBenchmark, FileReadBenchmark, FileWriteBenchmark, RowGroupFlushBenchmark, ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory. After this change the shaded jar registers 87 benchmarks (was 0 from a working build, or unrunnable at all from a default build).
iemejia
added a commit
to iemejia/parquet-java
that referenced
this pull request
Apr 19, 2026
… shaded jar The parquet-benchmarks pom is missing the JMH annotation-processor configuration and the AppendingTransformer entries for BenchmarkList / CompilerHints. As a result, the shaded jar built from master fails at runtime with "Unable to find the resource: /META-INF/BenchmarkList". This commit: - Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds jmh-generator-annprocess to maven-compiler-plugin's annotation processor paths, and adds AppendingTransformer entries for META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin. - Adds 11 JMH benchmarks covering the encode/decode paths used by the pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504, apache#3506, apache#3510), so reviewers can reproduce the reported numbers and detect regressions: IntEncodingBenchmark, BinaryEncodingBenchmark, ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark, FixedLenByteArrayEncodingBenchmark, FileReadBenchmark, FileWriteBenchmark, RowGroupFlushBenchmark, ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory. After this change the shaded jar registers 87 benchmarks (was 0 from a working build, or unrunnable at all from a default build).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
BinaryPlainValuesReader.readBytes()is the hot-path decoder forBINARY(andSTRING) columns usingPLAINencoding. The current implementation funnels every length read throughBytesUtils.readIntLittleEndian(InputStream)and every value slice throughByteBufferInputStream.slice(int):Two issues per value:
BytesUtils.readIntLittleEndian(InputStream)callsin.read()four times — fullIOExceptionplumbing and virtual dispatch onByteBufferInputStream(SingleBufferInputStream/MultiBufferInputStream).in.slice(length)is also a virtual dispatch onByteBufferInputStreamfor every value.If the page is materialised as a
MultiBufferInputStreamthe cost is even higher because each slice may have to walk a buffer list.What changes are included in this PR?
Replace the
ByteBufferInputStreamfield with a singleByteBufferset up once ininitFromPage. The length prefix is then a singleByteBuffer.getInt()(one bounds check, JIT-friendly little-endian intrinsic, noIOExceptionplumbing) and each value slice is a directByteBuffer.slice()instead of a virtualByteBufferInputStream.slice(int).Trade-off: when the input is a
MultiBufferInputStreamthe upfrontstream.slice(available)call may consolidate the page into a single freshByteBuffer. This is one allocation per page in exchange for inlined per-value reads — a clear win whenever the page contains more than a handful of values.Benchmark
BinaryEncodingBenchmark.decodePlain(100k values per invocation, JDK 18, JMH-wi 5 -i 10 -f 3, 30 samples per row):Per-op allocation is unchanged (~88 B/op = the returned
Binary+ the per-valueByteBufferslice).The improvement is largest at small string lengths because the per-value fixed cost (length read + slice) dominates more there; at 1000-byte values the cost is increasingly dominated by the value-bytes work downstream rather than the length read itself, but the gain is still ~9–19% even there.
Are these changes tested?
Yes. All 573
parquet-columntests pass. No new test was added because the returnedBinaryvalues are byte-identical to before (covered by existingBinaryPlainValuesReaderround-trip tests).Are there any user-facing changes?
No. Only an internal reader optimization. No public API, file format, or configuration change.
Closes #3509
Part of a small series of focused performance PRs from work in parquet-perf. Previous: #3494, #3496, #3500, #3504, #3506.
How to reproduce the benchmarks
The JMH benchmarks cited above are being added to
parquet-benchmarksin #3512. Once that lands, reproduce with:Compare runs against
master(baseline) and this branch (optimized).