Optimize ByteStreamSplitValuesReader page transposition

### Describe the enhancement requested

`ByteStreamSplitValuesReader` is the symmetric reader for `BYTE_STREAM_SPLIT`-encoded `FLOAT`, `DOUBLE`, `INT32`, and `INT64` columns. On `initFromPage` it eagerly transposes the entire page from stream-split layout (`elementSizeInBytes` separate streams of `valuesCount` bytes each) back to interleaved layout (`valuesCount` elements of `elementSizeInBytes` bytes each). The current loop is:

```java
private byte[] decodeData(ByteBuffer encoded, int valuesCount) {
  byte[] decoded = new byte[encoded.limit()];
  int destByteIndex = 0;
  for (int srcValueIndex = 0; srcValueIndex < valuesCount; ++srcValueIndex) {
    for (int stream = 0; stream < elementSizeInBytes; ++stream, ++destByteIndex) {
      decoded[destByteIndex] = encoded.get(srcValueIndex + stream * valuesCount);
    }
  }
  return decoded;
}
```

Two issues on the hot path:

1. Every read goes through `ByteBuffer.get(int)`, which does per-call bounds checks and dispatches through `HeapByteBuffer`/`DirectByteBuffer` virtual methods.
2. The inner stream offset (`stream * valuesCount`) is recomputed on every iteration even though it depends only on the outer loop.

For a 100k-value `FLOAT` page that is 400k `ByteBuffer.get(int)` calls; for a `DOUBLE`/`LONG` page it is 800k.

JMH (new `ByteStreamSplitDecodingBenchmark`, 100k values per invocation, JDK 18, `-wi 5 -i 10 -f 3`, 30 samples) on master:

| Type   | ops/s   |
|--------|--------:|
| Float  | 47.80M  |
| Double | 26.32M  |
| Int    | 47.07M  |
| Long   | 26.80M  |

### Proposal

Restructure `decodeData` in `ByteStreamSplitValuesReader`:

1. **Drop down to a `byte[]` view** of the encoded buffer. When `encoded.hasArray()` is true (the typical case), use the backing array directly with the correct base offset; otherwise copy once with a single `get(byte[])` call. This eliminates the per-byte `ByteBuffer.get(int)` bounds check and virtual dispatch.

2. **Specialize loops for the common element sizes (4 and 8)**. Hoist all `stream * valuesCount` offsets out of the inner loop into local ints (`s0..s3` for floats/ints, `s0..s7` for doubles/longs), and write each output slot exactly once in a single sequential pass. The reads come from `elementSizeInBytes` concurrent sequential streams, which modern hardware prefetchers handle well (typically 8–16 tracked streams per core).

3. **Generic fallback** for arbitrary element sizes (`FIXED_LEN_BYTE_ARRAY` of any width).

Expected speedup (same JMH config):

| Type   | Before  | After    | Δ              |
|--------|--------:|---------:|---------------:|
| Float  | 47.80M  | 162.29M  | **+240% (3.4x)** |
| Double | 26.32M  | 66.00M   | **+151% (2.5x)** |
| Int    | 47.07M  | 162.18M  | **+245% (3.5x)** |
| Long   | 26.80M  | 66.00M   | **+146% (2.5x)** |

### Scope

- Single file change to `parquet-column/src/main/java/org/apache/parquet/column/values/bytestreamsplit/ByteStreamSplitValuesReader.java`.
- No public-API change; only the `private decodeData` helper is rewritten.
- All 573 `parquet-column` tests pass; 51 BSS-specific tests pass.

### Relation

Symmetric companion to #3504 (writer-side BSS optimization). Part of a small series of focused performance PRs from work in [parquet-perf](https://github.com/iemejia/parquet-perf). Previous: #3494, #3496, #3500, #3504.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize ByteStreamSplitValuesReader page transposition #3505

Describe the enhancement requested

Proposal

Scope

Relation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Type	Before	After	Δ
Float	47.80M	162.29M	+240% (3.4x)
Double	26.32M	66.00M	+151% (2.5x)
Int	47.07M	162.18M	+245% (3.5x)
Long	26.80M	66.00M	+146% (2.5x)

Optimize ByteStreamSplitValuesReader page transposition #3505

Description

Describe the enhancement requested

Proposal

Scope

Relation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions