Describe the enhancement requested
ByteStreamSplitValuesReader is the symmetric reader for BYTE_STREAM_SPLIT-encoded FLOAT, DOUBLE, INT32, and INT64 columns. On initFromPage it eagerly transposes the entire page from stream-split layout (elementSizeInBytes separate streams of valuesCount bytes each) back to interleaved layout (valuesCount elements of elementSizeInBytes bytes each). The current loop is:
private byte[] decodeData(ByteBuffer encoded, int valuesCount) {
byte[] decoded = new byte[encoded.limit()];
int destByteIndex = 0;
for (int srcValueIndex = 0; srcValueIndex < valuesCount; ++srcValueIndex) {
for (int stream = 0; stream < elementSizeInBytes; ++stream, ++destByteIndex) {
decoded[destByteIndex] = encoded.get(srcValueIndex + stream * valuesCount);
}
}
return decoded;
}
Two issues on the hot path:
- Every read goes through
ByteBuffer.get(int), which does per-call bounds checks and dispatches through HeapByteBuffer/DirectByteBuffer virtual methods.
- The inner stream offset (
stream * valuesCount) is recomputed on every iteration even though it depends only on the outer loop.
For a 100k-value FLOAT page that is 400k ByteBuffer.get(int) calls; for a DOUBLE/LONG page it is 800k.
JMH (new ByteStreamSplitDecodingBenchmark, 100k values per invocation, JDK 18, -wi 5 -i 10 -f 3, 30 samples) on master:
| Type |
ops/s |
| Float |
47.80M |
| Double |
26.32M |
| Int |
47.07M |
| Long |
26.80M |
Proposal
Restructure decodeData in ByteStreamSplitValuesReader:
-
Drop down to a byte[] view of the encoded buffer. When encoded.hasArray() is true (the typical case), use the backing array directly with the correct base offset; otherwise copy once with a single get(byte[]) call. This eliminates the per-byte ByteBuffer.get(int) bounds check and virtual dispatch.
-
Specialize loops for the common element sizes (4 and 8). Hoist all stream * valuesCount offsets out of the inner loop into local ints (s0..s3 for floats/ints, s0..s7 for doubles/longs), and write each output slot exactly once in a single sequential pass. The reads come from elementSizeInBytes concurrent sequential streams, which modern hardware prefetchers handle well (typically 8–16 tracked streams per core).
-
Generic fallback for arbitrary element sizes (FIXED_LEN_BYTE_ARRAY of any width).
Expected speedup (same JMH config):
| Type |
Before |
After |
Δ |
| Float |
47.80M |
162.29M |
+240% (3.4x) |
| Double |
26.32M |
66.00M |
+151% (2.5x) |
| Int |
47.07M |
162.18M |
+245% (3.5x) |
| Long |
26.80M |
66.00M |
+146% (2.5x) |
Scope
- Single file change to
parquet-column/src/main/java/org/apache/parquet/column/values/bytestreamsplit/ByteStreamSplitValuesReader.java.
- No public-API change; only the
private decodeData helper is rewritten.
- All 573
parquet-column tests pass; 51 BSS-specific tests pass.
Relation
Symmetric companion to #3504 (writer-side BSS optimization). Part of a small series of focused performance PRs from work in parquet-perf. Previous: #3494, #3496, #3500, #3504.
Describe the enhancement requested
ByteStreamSplitValuesReaderis the symmetric reader forBYTE_STREAM_SPLIT-encodedFLOAT,DOUBLE,INT32, andINT64columns. OninitFromPageit eagerly transposes the entire page from stream-split layout (elementSizeInBytesseparate streams ofvaluesCountbytes each) back to interleaved layout (valuesCountelements ofelementSizeInBytesbytes each). The current loop is:Two issues on the hot path:
ByteBuffer.get(int), which does per-call bounds checks and dispatches throughHeapByteBuffer/DirectByteBuffervirtual methods.stream * valuesCount) is recomputed on every iteration even though it depends only on the outer loop.For a 100k-value
FLOATpage that is 400kByteBuffer.get(int)calls; for aDOUBLE/LONGpage it is 800k.JMH (new
ByteStreamSplitDecodingBenchmark, 100k values per invocation, JDK 18,-wi 5 -i 10 -f 3, 30 samples) on master:Proposal
Restructure
decodeDatainByteStreamSplitValuesReader:Drop down to a
byte[]view of the encoded buffer. Whenencoded.hasArray()is true (the typical case), use the backing array directly with the correct base offset; otherwise copy once with a singleget(byte[])call. This eliminates the per-byteByteBuffer.get(int)bounds check and virtual dispatch.Specialize loops for the common element sizes (4 and 8). Hoist all
stream * valuesCountoffsets out of the inner loop into local ints (s0..s3for floats/ints,s0..s7for doubles/longs), and write each output slot exactly once in a single sequential pass. The reads come fromelementSizeInBytesconcurrent sequential streams, which modern hardware prefetchers handle well (typically 8–16 tracked streams per core).Generic fallback for arbitrary element sizes (
FIXED_LEN_BYTE_ARRAYof any width).Expected speedup (same JMH config):
Scope
parquet-column/src/main/java/org/apache/parquet/column/values/bytestreamsplit/ByteStreamSplitValuesReader.java.private decodeDatahelper is rewritten.parquet-columntests pass; 51 BSS-specific tests pass.Relation
Symmetric companion to #3504 (writer-side BSS optimization). Part of a small series of focused performance PRs from work in parquet-perf. Previous: #3494, #3496, #3500, #3504.