Skip to content

Optimize ByteStreamSplitValuesReader page transposition #3505

@iemejia

Description

@iemejia

Describe the enhancement requested

ByteStreamSplitValuesReader is the symmetric reader for BYTE_STREAM_SPLIT-encoded FLOAT, DOUBLE, INT32, and INT64 columns. On initFromPage it eagerly transposes the entire page from stream-split layout (elementSizeInBytes separate streams of valuesCount bytes each) back to interleaved layout (valuesCount elements of elementSizeInBytes bytes each). The current loop is:

private byte[] decodeData(ByteBuffer encoded, int valuesCount) {
  byte[] decoded = new byte[encoded.limit()];
  int destByteIndex = 0;
  for (int srcValueIndex = 0; srcValueIndex < valuesCount; ++srcValueIndex) {
    for (int stream = 0; stream < elementSizeInBytes; ++stream, ++destByteIndex) {
      decoded[destByteIndex] = encoded.get(srcValueIndex + stream * valuesCount);
    }
  }
  return decoded;
}

Two issues on the hot path:

  1. Every read goes through ByteBuffer.get(int), which does per-call bounds checks and dispatches through HeapByteBuffer/DirectByteBuffer virtual methods.
  2. The inner stream offset (stream * valuesCount) is recomputed on every iteration even though it depends only on the outer loop.

For a 100k-value FLOAT page that is 400k ByteBuffer.get(int) calls; for a DOUBLE/LONG page it is 800k.

JMH (new ByteStreamSplitDecodingBenchmark, 100k values per invocation, JDK 18, -wi 5 -i 10 -f 3, 30 samples) on master:

Type ops/s
Float 47.80M
Double 26.32M
Int 47.07M
Long 26.80M

Proposal

Restructure decodeData in ByteStreamSplitValuesReader:

  1. Drop down to a byte[] view of the encoded buffer. When encoded.hasArray() is true (the typical case), use the backing array directly with the correct base offset; otherwise copy once with a single get(byte[]) call. This eliminates the per-byte ByteBuffer.get(int) bounds check and virtual dispatch.

  2. Specialize loops for the common element sizes (4 and 8). Hoist all stream * valuesCount offsets out of the inner loop into local ints (s0..s3 for floats/ints, s0..s7 for doubles/longs), and write each output slot exactly once in a single sequential pass. The reads come from elementSizeInBytes concurrent sequential streams, which modern hardware prefetchers handle well (typically 8–16 tracked streams per core).

  3. Generic fallback for arbitrary element sizes (FIXED_LEN_BYTE_ARRAY of any width).

Expected speedup (same JMH config):

Type Before After Δ
Float 47.80M 162.29M +240% (3.4x)
Double 26.32M 66.00M +151% (2.5x)
Int 47.07M 162.18M +245% (3.5x)
Long 26.80M 66.00M +146% (2.5x)

Scope

  • Single file change to parquet-column/src/main/java/org/apache/parquet/column/values/bytestreamsplit/ByteStreamSplitValuesReader.java.
  • No public-API change; only the private decodeData helper is rewritten.
  • All 573 parquet-column tests pass; 51 BSS-specific tests pass.

Relation

Symmetric companion to #3504 (writer-side BSS optimization). Part of a small series of focused performance PRs from work in parquet-perf. Previous: #3494, #3496, #3500, #3504.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions