Optimize BinaryPlainValuesReader by reading directly from ByteBuffer

### Describe the enhancement requested

`BinaryPlainValuesReader.readBytes()` is the hot-path decoder for `BINARY` (and `STRING`) columns using `PLAIN` encoding. The current implementation funnels every length read through `BytesUtils.readIntLittleEndian(InputStream)` and every value slice through `ByteBufferInputStream.slice(int)`:

```java
public Binary readBytes() {
  try {
    int length = BytesUtils.readIntLittleEndian(in);
    return Binary.fromConstantByteBuffer(in.slice(length));
  } catch (IOException | RuntimeException e) {
    throw new ParquetDecodingException("could not read bytes at offset " + in.position(), e);
  }
}
```

Two issues per value:

1. `BytesUtils.readIntLittleEndian(InputStream)` calls `in.read()` four times. Each call goes through a `try` / `IOException` plumbing path and a virtual dispatch on `ByteBufferInputStream` (typically resolved as either `SingleBufferInputStream` or `MultiBufferInputStream`).
2. `in.slice(length)` is also a virtual dispatch on `ByteBufferInputStream` for every value.

If the page is materialised as a `MultiBufferInputStream` the cost is even higher because each slice may have to walk a buffer list.

JMH (`BinaryEncodingBenchmark.decodePlain`, 100k values per invocation, JDK 18, `-wi 5 -i 10 -f 3`, 30 samples) on master:

| cardinality | stringLength | ops/s   |
|------------:|-------------:|--------:|
| HIGH        | 10           | 23.11M  |
| HIGH        | 100          | 20.52M  |
| HIGH        | 1000         | 7.07M   |
| LOW         | 10           | 22.89M  |
| LOW         | 100          | 20.35M  |
| LOW         | 1000         | 6.28M   |

### Proposal

Replace the `ByteBufferInputStream` field with a single `ByteBuffer` set up once in `initFromPage`:

```java
@Override
public void initFromPage(int valueCount, ByteBufferInputStream stream) throws IOException {
  int available = stream.available();
  this.buffer = available > 0
      ? stream.slice(available).order(ByteOrder.LITTLE_ENDIAN)
      : ByteBuffer.allocate(0).order(ByteOrder.LITTLE_ENDIAN);
}

@Override
public Binary readBytes() {
  int length = buffer.getInt();
  ByteBuffer valueSlice = buffer.slice();
  valueSlice.limit(length);
  buffer.position(buffer.position() + length);
  return Binary.fromConstantByteBuffer(valueSlice);
}
```

The length prefix is now a single `ByteBuffer.getInt()` (one bounds check, no `IOException` plumbing, JIT-friendly intrinsic on little-endian buffers) and each value slice is a direct `ByteBuffer.slice()` instead of a virtual `ByteBufferInputStream.slice(int)`.

The trade-off: when the input is a `MultiBufferInputStream` the upfront `stream.slice(available)` call may consolidate the page into a single fresh `ByteBuffer`. This is one allocation per page in exchange for inlined per-value reads, which is a clear win whenever the page contains more than a handful of values.

Expected speedup (same JMH config):

| cardinality | stringLength | Before  | After   | Δ       |
|------------:|-------------:|--------:|--------:|--------:|
| HIGH        | 10           | 23.11M  | 27.13M  | **+17.4% (1.17x)** |
| HIGH        | 100          | 20.52M  | 22.20M  | **+8.2% (1.08x)**  |
| HIGH        | 1000         | 7.07M   | 7.68M   | **+8.6% (1.09x)**  |
| LOW         | 10           | 22.89M  | 26.46M  | **+15.6% (1.16x)** |
| LOW         | 100          | 20.35M  | 22.16M  | **+8.9% (1.09x)**  |
| LOW         | 1000         | 6.28M   | 7.50M   | **+19.4% (1.19x)** |

Allocation per op is unchanged (~88 B/op = the returned `Binary` + the per-value `ByteBuffer` slice).

The improvement is largest at small string lengths because the per-value fixed cost (length read + slice) dominates more there; at 1000-byte values the cost is increasingly dominated by the value-bytes copy/compare downstream rather than the read itself, but the gain is still ~9–19% even there.

### Scope

- Single file change to `parquet-column/src/main/java/org/apache/parquet/column/values/plain/BinaryPlainValuesReader.java`.
- No public-API change; only the implementation of `readBytes()`, `skip()`, and `initFromPage()` is rewritten.
- All 573 `parquet-column` tests pass.

### Relation

Part of a small series of focused performance PRs from work in [parquet-perf](https://github.com/iemejia/parquet-perf). Previous: #3494 (PlainValuesReader), #3496 (PlainValuesWriter), #3500 (Binary.hashCode cache), #3504 (BSS writer), #3506 (BSS reader).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize BinaryPlainValuesReader by reading directly from ByteBuffer #3509

Describe the enhancement requested

Proposal

Scope

Relation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

cardinality	stringLength	ops/s
HIGH	10	23.11M
HIGH	100	20.52M
HIGH	1000	7.07M
LOW	10	22.89M
LOW	100	20.35M
LOW	1000	6.28M

cardinality	stringLength	Before	After	Δ
HIGH	10	23.11M	27.13M	+17.4% (1.17x)
HIGH	100	20.52M	22.20M	+8.2% (1.08x)
HIGH	1000	7.07M	7.68M	+8.6% (1.09x)
LOW	10	22.89M	26.46M	+15.6% (1.16x)
LOW	100	20.35M	22.16M	+8.9% (1.09x)
LOW	1000	6.28M	7.50M	+19.4% (1.19x)

Optimize BinaryPlainValuesReader by reading directly from ByteBuffer #3509

Description

Describe the enhancement requested

Proposal

Scope

Relation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions