Describe the enhancement requested
BinaryPlainValuesReader.readBytes() is the hot-path decoder for BINARY (and STRING) columns using PLAIN encoding. The current implementation funnels every length read through BytesUtils.readIntLittleEndian(InputStream) and every value slice through ByteBufferInputStream.slice(int):
public Binary readBytes() {
try {
int length = BytesUtils.readIntLittleEndian(in);
return Binary.fromConstantByteBuffer(in.slice(length));
} catch (IOException | RuntimeException e) {
throw new ParquetDecodingException("could not read bytes at offset " + in.position(), e);
}
}
Two issues per value:
BytesUtils.readIntLittleEndian(InputStream) calls in.read() four times. Each call goes through a try / IOException plumbing path and a virtual dispatch on ByteBufferInputStream (typically resolved as either SingleBufferInputStream or MultiBufferInputStream).
in.slice(length) is also a virtual dispatch on ByteBufferInputStream for every value.
If the page is materialised as a MultiBufferInputStream the cost is even higher because each slice may have to walk a buffer list.
JMH (BinaryEncodingBenchmark.decodePlain, 100k values per invocation, JDK 18, -wi 5 -i 10 -f 3, 30 samples) on master:
| cardinality |
stringLength |
ops/s |
| HIGH |
10 |
23.11M |
| HIGH |
100 |
20.52M |
| HIGH |
1000 |
7.07M |
| LOW |
10 |
22.89M |
| LOW |
100 |
20.35M |
| LOW |
1000 |
6.28M |
Proposal
Replace the ByteBufferInputStream field with a single ByteBuffer set up once in initFromPage:
@Override
public void initFromPage(int valueCount, ByteBufferInputStream stream) throws IOException {
int available = stream.available();
this.buffer = available > 0
? stream.slice(available).order(ByteOrder.LITTLE_ENDIAN)
: ByteBuffer.allocate(0).order(ByteOrder.LITTLE_ENDIAN);
}
@Override
public Binary readBytes() {
int length = buffer.getInt();
ByteBuffer valueSlice = buffer.slice();
valueSlice.limit(length);
buffer.position(buffer.position() + length);
return Binary.fromConstantByteBuffer(valueSlice);
}
The length prefix is now a single ByteBuffer.getInt() (one bounds check, no IOException plumbing, JIT-friendly intrinsic on little-endian buffers) and each value slice is a direct ByteBuffer.slice() instead of a virtual ByteBufferInputStream.slice(int).
The trade-off: when the input is a MultiBufferInputStream the upfront stream.slice(available) call may consolidate the page into a single fresh ByteBuffer. This is one allocation per page in exchange for inlined per-value reads, which is a clear win whenever the page contains more than a handful of values.
Expected speedup (same JMH config):
| cardinality |
stringLength |
Before |
After |
Δ |
| HIGH |
10 |
23.11M |
27.13M |
+17.4% (1.17x) |
| HIGH |
100 |
20.52M |
22.20M |
+8.2% (1.08x) |
| HIGH |
1000 |
7.07M |
7.68M |
+8.6% (1.09x) |
| LOW |
10 |
22.89M |
26.46M |
+15.6% (1.16x) |
| LOW |
100 |
20.35M |
22.16M |
+8.9% (1.09x) |
| LOW |
1000 |
6.28M |
7.50M |
+19.4% (1.19x) |
Allocation per op is unchanged (~88 B/op = the returned Binary + the per-value ByteBuffer slice).
The improvement is largest at small string lengths because the per-value fixed cost (length read + slice) dominates more there; at 1000-byte values the cost is increasingly dominated by the value-bytes copy/compare downstream rather than the read itself, but the gain is still ~9–19% even there.
Scope
- Single file change to
parquet-column/src/main/java/org/apache/parquet/column/values/plain/BinaryPlainValuesReader.java.
- No public-API change; only the implementation of
readBytes(), skip(), and initFromPage() is rewritten.
- All 573
parquet-column tests pass.
Relation
Part of a small series of focused performance PRs from work in parquet-perf. Previous: #3494 (PlainValuesReader), #3496 (PlainValuesWriter), #3500 (Binary.hashCode cache), #3504 (BSS writer), #3506 (BSS reader).
Describe the enhancement requested
BinaryPlainValuesReader.readBytes()is the hot-path decoder forBINARY(andSTRING) columns usingPLAINencoding. The current implementation funnels every length read throughBytesUtils.readIntLittleEndian(InputStream)and every value slice throughByteBufferInputStream.slice(int):Two issues per value:
BytesUtils.readIntLittleEndian(InputStream)callsin.read()four times. Each call goes through atry/IOExceptionplumbing path and a virtual dispatch onByteBufferInputStream(typically resolved as eitherSingleBufferInputStreamorMultiBufferInputStream).in.slice(length)is also a virtual dispatch onByteBufferInputStreamfor every value.If the page is materialised as a
MultiBufferInputStreamthe cost is even higher because each slice may have to walk a buffer list.JMH (
BinaryEncodingBenchmark.decodePlain, 100k values per invocation, JDK 18,-wi 5 -i 10 -f 3, 30 samples) on master:Proposal
Replace the
ByteBufferInputStreamfield with a singleByteBufferset up once ininitFromPage:The length prefix is now a single
ByteBuffer.getInt()(one bounds check, noIOExceptionplumbing, JIT-friendly intrinsic on little-endian buffers) and each value slice is a directByteBuffer.slice()instead of a virtualByteBufferInputStream.slice(int).The trade-off: when the input is a
MultiBufferInputStreamthe upfrontstream.slice(available)call may consolidate the page into a single freshByteBuffer. This is one allocation per page in exchange for inlined per-value reads, which is a clear win whenever the page contains more than a handful of values.Expected speedup (same JMH config):
Allocation per op is unchanged (~88 B/op = the returned
Binary+ the per-valueByteBufferslice).The improvement is largest at small string lengths because the per-value fixed cost (length read + slice) dominates more there; at 1000-byte values the cost is increasingly dominated by the value-bytes copy/compare downstream rather than the read itself, but the gain is still ~9–19% even there.
Scope
parquet-column/src/main/java/org/apache/parquet/column/values/plain/BinaryPlainValuesReader.java.readBytes(),skip(), andinitFromPage()is rewritten.parquet-columntests pass.Relation
Part of a small series of focused performance PRs from work in parquet-perf. Previous: #3494 (PlainValuesReader), #3496 (PlainValuesWriter), #3500 (Binary.hashCode cache), #3504 (BSS writer), #3506 (BSS reader).