Skip to content

GH-3509: Optimize BinaryPlainValuesReader by reading directly from ByteBuffer#3510

Open
iemejia wants to merge 1 commit intoapache:masterfrom
iemejia:perf-binary-plain-reader
Open

GH-3509: Optimize BinaryPlainValuesReader by reading directly from ByteBuffer#3510
iemejia wants to merge 1 commit intoapache:masterfrom
iemejia:perf-binary-plain-reader

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented Apr 19, 2026

Rationale for this change

BinaryPlainValuesReader.readBytes() is the hot-path decoder for BINARY (and STRING) columns using PLAIN encoding. The current implementation funnels every length read through BytesUtils.readIntLittleEndian(InputStream) and every value slice through ByteBufferInputStream.slice(int):

public Binary readBytes() {
  try {
    int length = BytesUtils.readIntLittleEndian(in);
    return Binary.fromConstantByteBuffer(in.slice(length));
  } catch (IOException | RuntimeException e) {
    throw new ParquetDecodingException("could not read bytes at offset " + in.position(), e);
  }
}

Two issues per value:

  1. BytesUtils.readIntLittleEndian(InputStream) calls in.read() four times — full IOException plumbing and virtual dispatch on ByteBufferInputStream (SingleBufferInputStream / MultiBufferInputStream).
  2. in.slice(length) is also a virtual dispatch on ByteBufferInputStream for every value.

If the page is materialised as a MultiBufferInputStream the cost is even higher because each slice may have to walk a buffer list.

What changes are included in this PR?

Replace the ByteBufferInputStream field with a single ByteBuffer set up once in initFromPage. The length prefix is then a single ByteBuffer.getInt() (one bounds check, JIT-friendly little-endian intrinsic, no IOException plumbing) and each value slice is a direct ByteBuffer.slice() instead of a virtual ByteBufferInputStream.slice(int).

Trade-off: when the input is a MultiBufferInputStream the upfront stream.slice(available) call may consolidate the page into a single fresh ByteBuffer. This is one allocation per page in exchange for inlined per-value reads — a clear win whenever the page contains more than a handful of values.

Benchmark

BinaryEncodingBenchmark.decodePlain (100k values per invocation, JDK 18, JMH -wi 5 -i 10 -f 3, 30 samples per row):

cardinality stringLength Before After Δ
HIGH 10 23.11M 27.13M +17.4% (1.17x)
HIGH 100 20.52M 22.20M +8.2% (1.08x)
HIGH 1000 7.07M 7.68M +8.6% (1.09x)
LOW 10 22.89M 26.46M +15.6% (1.16x)
LOW 100 20.35M 22.16M +8.9% (1.09x)
LOW 1000 6.28M 7.50M +19.4% (1.19x)

Per-op allocation is unchanged (~88 B/op = the returned Binary + the per-value ByteBuffer slice).

The improvement is largest at small string lengths because the per-value fixed cost (length read + slice) dominates more there; at 1000-byte values the cost is increasingly dominated by the value-bytes work downstream rather than the length read itself, but the gain is still ~9–19% even there.

Are these changes tested?

Yes. All 573 parquet-column tests pass. No new test was added because the returned Binary values are byte-identical to before (covered by existing BinaryPlainValuesReader round-trip tests).

Are there any user-facing changes?

No. Only an internal reader optimization. No public API, file format, or configuration change.

Closes #3509

Part of a small series of focused performance PRs from work in parquet-perf. Previous: #3494, #3496, #3500, #3504, #3506.

How to reproduce the benchmarks

The JMH benchmarks cited above are being added to parquet-benchmarks in #3512. Once that lands, reproduce with:

./mvnw clean package -pl parquet-benchmarks -DskipTests \
    -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true
java -jar parquet-benchmarks/target/parquet-benchmarks.jar 'BinaryEncodingBenchmark.decodePlain' \
    -wi 5 -i 10 -f 3

Compare runs against master (baseline) and this branch (optimized).

…rom ByteBuffer

BinaryPlainValuesReader.readBytes is the hot-path decoder for BINARY (and
STRING) columns using PLAIN encoding. The current implementation funnels
every length read through BytesUtils.readIntLittleEndian(InputStream),
which calls in.read() four times with full IOException plumbing and
virtual dispatch on ByteBufferInputStream, and slices every value through
a virtual ByteBufferInputStream.slice(int).

This change replaces the ByteBufferInputStream field with a single
ByteBuffer set up once in initFromPage. The length prefix is then a
single ByteBuffer.getInt() (one bounds check, JIT-friendly little-endian
intrinsic, no IOException plumbing) and each value slice is a direct
ByteBuffer.slice() instead of a virtual ByteBufferInputStream.slice(int).

When the input is a MultiBufferInputStream the upfront stream.slice(available)
call may consolidate the page into a single fresh ByteBuffer. This is one
allocation per page in exchange for inlined per-value reads, which is a
clear win whenever the page contains more than a handful of values.

Benchmark (BinaryEncodingBenchmark.decodePlain, 100k values per
invocation, JDK 18, JMH -wi 5 -i 10 -f 3, 30 samples per row):

  cardinality stringLen  Before (ops/s)   After (ops/s)   Improvement
  HIGH        10           23,114,969       27,126,384    +17.4% (1.17x)
  HIGH        100          20,516,861       22,200,091     +8.2% (1.08x)
  HIGH        1000          7,069,927        7,679,070     +8.6% (1.09x)
  LOW         10           22,885,778       26,459,404    +15.6% (1.16x)
  LOW         100          20,349,900       22,158,675     +8.9% (1.09x)
  LOW         1000          6,279,616        7,500,811    +19.4% (1.19x)

Per-op allocation is unchanged (~88 B/op = the returned Binary + the
per-value ByteBuffer slice). The improvement is largest at small string
lengths because the per-value fixed cost dominates more there.

All 573 parquet-column tests pass.
iemejia added a commit to iemejia/parquet-java that referenced this pull request Apr 19, 2026
… shaded jar

The parquet-benchmarks pom is missing the JMH annotation-processor
configuration and the AppendingTransformer entries for BenchmarkList /
CompilerHints. As a result, the shaded jar built from master fails at
runtime with "Unable to find the resource: /META-INF/BenchmarkList".

This commit:

- Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds
  jmh-generator-annprocess to maven-compiler-plugin's annotation
  processor paths, and adds AppendingTransformer entries for
  META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin.

- Adds 11 JMH benchmarks covering the encode/decode paths used by the
  pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504,
  apache#3506, apache#3510), so reviewers can reproduce the reported numbers and
  detect regressions:

    IntEncodingBenchmark, BinaryEncodingBenchmark,
    ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark,
    FixedLenByteArrayEncodingBenchmark, FileReadBenchmark,
    FileWriteBenchmark, RowGroupFlushBenchmark,
    ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory.

After this change the shaded jar registers 87 benchmarks (was 0 from a
working build, or unrunnable at all from a default build).
iemejia added a commit to iemejia/parquet-java that referenced this pull request Apr 19, 2026
… shaded jar

The parquet-benchmarks pom is missing the JMH annotation-processor
configuration and the AppendingTransformer entries for BenchmarkList /
CompilerHints. As a result, the shaded jar built from master fails at
runtime with "Unable to find the resource: /META-INF/BenchmarkList".

This commit:

- Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds
  jmh-generator-annprocess to maven-compiler-plugin's annotation
  processor paths, and adds AppendingTransformer entries for
  META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin.

- Adds 11 JMH benchmarks covering the encode/decode paths used by the
  pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504,
  apache#3506, apache#3510), so reviewers can reproduce the reported numbers and
  detect regressions:

    IntEncodingBenchmark, BinaryEncodingBenchmark,
    ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark,
    FixedLenByteArrayEncodingBenchmark, FileReadBenchmark,
    FileWriteBenchmark, RowGroupFlushBenchmark,
    ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory.

After this change the shaded jar registers 87 benchmarks (was 0 from a
working build, or unrunnable at all from a default build).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize BinaryPlainValuesReader by reading directly from ByteBuffer

1 participant