GH-3509: Optimize BinaryPlainValuesReader by reading directly from ByteBuffer by iemejia · Pull Request #3510 · apache/parquet-java

iemejia · 2026-04-19T19:36:18Z

Rationale for this change

BinaryPlainValuesReader.readBytes() is the hot-path decoder for BINARY (and STRING) columns using PLAIN encoding. The current implementation funnels every length read through BytesUtils.readIntLittleEndian(InputStream) and every value slice through ByteBufferInputStream.slice(int):

public Binary readBytes() {
  try {
    int length = BytesUtils.readIntLittleEndian(in);
    return Binary.fromConstantByteBuffer(in.slice(length));
  } catch (IOException | RuntimeException e) {
    throw new ParquetDecodingException("could not read bytes at offset " + in.position(), e);
  }
}

Two issues per value:

BytesUtils.readIntLittleEndian(InputStream) calls in.read() four times — full IOException plumbing and virtual dispatch on ByteBufferInputStream (SingleBufferInputStream / MultiBufferInputStream).
in.slice(length) is also a virtual dispatch on ByteBufferInputStream for every value.

If the page is materialised as a MultiBufferInputStream the cost is even higher because each slice may have to walk a buffer list.

What changes are included in this PR?

Replace the ByteBufferInputStream field with a single ByteBuffer set up once in initFromPage. The length prefix is then a single ByteBuffer.getInt() (one bounds check, JIT-friendly little-endian intrinsic, no IOException plumbing) and each value slice is a direct ByteBuffer.slice() instead of a virtual ByteBufferInputStream.slice(int).

Trade-off: when the input is a MultiBufferInputStream the upfront stream.slice(available) call may consolidate the page into a single fresh ByteBuffer. This is one allocation per page in exchange for inlined per-value reads — a clear win whenever the page contains more than a handful of values.

Benchmark

BinaryEncodingBenchmark.decodePlain (100k values per invocation, JDK 18, JMH -wi 5 -i 10 -f 3, 30 samples per row):

cardinality	stringLength	Before	After	Δ
HIGH	10	23.11M	27.13M	+17.4% (1.17x)
HIGH	100	20.52M	22.20M	+8.2% (1.08x)
HIGH	1000	7.07M	7.68M	+8.6% (1.09x)
LOW	10	22.89M	26.46M	+15.6% (1.16x)
LOW	100	20.35M	22.16M	+8.9% (1.09x)
LOW	1000	6.28M	7.50M	+19.4% (1.19x)

Per-op allocation is unchanged (~88 B/op = the returned Binary + the per-value ByteBuffer slice).

The improvement is largest at small string lengths because the per-value fixed cost (length read + slice) dominates more there; at 1000-byte values the cost is increasingly dominated by the value-bytes work downstream rather than the length read itself, but the gain is still ~9–19% even there.

Are these changes tested?

Yes. All 573 parquet-column tests pass. No new test was added because the returned Binary values are byte-identical to before (covered by existing BinaryPlainValuesReader round-trip tests).

Are there any user-facing changes?

No. Only an internal reader optimization. No public API, file format, or configuration change.

Closes #3509

Part of a small series of focused performance PRs from work in parquet-perf. Previous: #3494, #3496, #3500, #3504, #3506.

How to reproduce the benchmarks

The JMH benchmarks cited above are being added to parquet-benchmarks in #3512. Once that lands, reproduce with:

./mvnw clean package -pl parquet-benchmarks -DskipTests \
    -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true
java -jar parquet-benchmarks/target/parquet-benchmarks.jar 'BinaryEncodingBenchmark.decodePlain' \
    -wi 5 -i 10 -f 3

Compare runs against master (baseline) and this branch (optimized).

… shaded jar The parquet-benchmarks pom is missing the JMH annotation-processor configuration and the AppendingTransformer entries for BenchmarkList / CompilerHints. As a result, the shaded jar built from master fails at runtime with "Unable to find the resource: /META-INF/BenchmarkList". This commit: - Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds jmh-generator-annprocess to maven-compiler-plugin's annotation processor paths, and adds AppendingTransformer entries for META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin. - Adds 11 JMH benchmarks covering the encode/decode paths used by the pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504, apache#3506, apache#3510), so reviewers can reproduce the reported numbers and detect regressions: IntEncodingBenchmark, BinaryEncodingBenchmark, ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark, FixedLenByteArrayEncodingBenchmark, FileReadBenchmark, FileWriteBenchmark, RowGroupFlushBenchmark, ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory. After this change the shaded jar registers 87 benchmarks (was 0 from a working build, or unrunnable at all from a default build).

Fokko · 2026-04-21T07:07:40Z

+    if (available > 0) {
+      this.buffer = stream.slice(available).order(ByteOrder.LITTLE_ENDIAN);
+    } else {
+      this.buffer = ByteBuffer.allocate(0).order(ByteOrder.LITTLE_ENDIAN);


Why would we allocate an empty buffer?

Fokko · 2026-04-28T18:49:54Z

-      int length = BytesUtils.readIntLittleEndian(in);
-      return Binary.fromConstantByteBuffer(in.slice(length));
-    } catch (IOException | RuntimeException e) {
-      throw new ParquetDecodingException("could not read bytes at offset " + in.position(), e);


Should we keep those ParquetDecodingException?

…rom ByteBuffer BinaryPlainValuesReader.readBytes is the hot-path decoder for BINARY (and STRING) columns using PLAIN encoding. The current implementation funnels every length read through BytesUtils.readIntLittleEndian(InputStream), which calls in.read() four times with full IOException plumbing and virtual dispatch on ByteBufferInputStream, and slices every value through a virtual ByteBufferInputStream.slice(int). This change replaces the ByteBufferInputStream field with a single ByteBuffer set up once in initFromPage. The length prefix is then a single ByteBuffer.getInt() (one bounds check, JIT-friendly little-endian intrinsic, no IOException plumbing) and each value slice is a direct ByteBuffer.slice() instead of a virtual ByteBufferInputStream.slice(int). When the input is a MultiBufferInputStream the upfront stream.slice(available) call may consolidate the page into a single fresh ByteBuffer. This is one allocation per page in exchange for inlined per-value reads, which is a clear win whenever the page contains more than a handful of values. Benchmark (BinaryEncodingBenchmark.decodePlain, 100k values per invocation, JMH -wi 3 -i 5 -f 1): cardinality stringLen Before (ops/s) After (ops/s) Improvement LOW 10 140,291,577 229,969,059 +64% (1.64x) LOW 100 137,786,915 222,980,895 +62% (1.62x) LOW 1000 137,192,944 225,526,655 +64% (1.64x) HIGH 10 139,916,882 227,347,567 +62% (1.62x) Per-op allocation is unchanged (~88 B/op = the returned Binary + the per-value ByteBuffer slice). The improvement is consistent across string lengths because the per-value fixed cost (length-prefix read overhead) is eliminated equally for all sizes. All 573 parquet-column tests pass.

… shaded jar The parquet-benchmarks pom is missing the JMH annotation-processor configuration and the AppendingTransformer entries for BenchmarkList / CompilerHints. As a result, the shaded jar built from master fails at runtime with "Unable to find the resource: /META-INF/BenchmarkList". This commit: - Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds jmh-generator-annprocess to maven-compiler-plugin's annotation processor paths, and adds AppendingTransformer entries for META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin. - Adds 11 JMH benchmarks covering the encode/decode paths used by the pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504, apache#3506, apache#3510), so reviewers can reproduce the reported numbers and detect regressions: IntEncodingBenchmark, BinaryEncodingBenchmark, ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark, FixedLenByteArrayEncodingBenchmark, FileReadBenchmark, FileWriteBenchmark, RowGroupFlushBenchmark, ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory. After this change the shaded jar registers 87 benchmarks (was 0 from a working build, or unrunnable at all from a default build).

iemejia · 2026-05-17T22:41:22Z

Closing in favor of #3565.

I initially submitted a series of small, focused PRs thinking they'd be easier to review. In practice the sheer number (~16 PRs, with more pending) made things harder to follow — even for me. I've regrouped the changes by encoding type / performance area so that each PR is self-contained with its own benchmarks and test coverage, which should make review and performance analysis much more straightforward.

Apologies for the churn. If you've been reviewing this PR, please continue the discussion on #3565 which supersedes it. Thank you.

iemejia mentioned this pull request Apr 19, 2026

Add JMH benchmarks for encoding/decoding paths and fix parquet-benchmarks shaded jar #3511

Closed

iemejia mentioned this pull request Apr 19, 2026

GH-3511: Add JMH encoding benchmarks and fix parquet-benchmarks shaded jar #3512

Closed

iemejia mentioned this pull request Apr 20, 2026

GH-3513: Optimize dictionary writers with OpenHashMap + ArrayList (up to ~70x encodeDictionary) #3514

Closed

Fokko reviewed Apr 21, 2026

View reviewed changes

Fokko reviewed Apr 28, 2026

View reviewed changes

iemejia force-pushed the perf-binary-plain-reader branch from 3574a5e to bee91b3 Compare May 1, 2026 21:11

iemejia force-pushed the perf-binary-plain-reader branch from bee91b3 to 94e06e6 Compare May 1, 2026 22:07

iemejia mentioned this pull request May 6, 2026

Apache Parquet Java Performance Improvements #3530

Open

iemejia marked this pull request as draft May 15, 2026 09:35

iemejia closed this May 17, 2026

iemejia deleted the perf-binary-plain-reader branch May 17, 2026 22:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3509: Optimize BinaryPlainValuesReader by reading directly from ByteBuffer#3510

GH-3509: Optimize BinaryPlainValuesReader by reading directly from ByteBuffer#3510
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:perf-binary-plain-reader

iemejia commented Apr 19, 2026 •

edited

Loading

Uh oh!

Fokko Apr 21, 2026

Uh oh!

Fokko Apr 28, 2026

Uh oh!

iemejia commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iemejia commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Benchmark

Are these changes tested?

Are there any user-facing changes?

Closes #3509

How to reproduce the benchmarks

Uh oh!

Fokko Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Fokko Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

iemejia commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

iemejia commented Apr 19, 2026 •

edited

Loading