GH-3511: Add JMH encoding benchmarks and fix parquet-benchmarks shaded jar by iemejia · Pull Request #3512 · apache/parquet-java

iemejia · 2026-04-19T20:08:55Z

Summary

Resolves #3511.

The parquet-benchmarks shaded jar built from current master is non-functional — it fails at runtime with RuntimeException: Unable to find the resource: /META-INF/BenchmarkList. This PR fixes that and adds 11 JMH benchmarks covering the encode/decode paths exercised by the open performance PRs, so reviewers can reproduce the reported numbers.

What's broken on master

parquet-benchmarks/pom.xml is missing two pieces of configuration:

maven-compiler-plugin lacks the annotationProcessorPaths / annotationProcessors config for jmh-generator-annprocess, so the JMH annotation processor never runs and META-INF/BenchmarkList / META-INF/CompilerHints are never generated.
maven-shade-plugin lacks AppendingTransformer entries for those two resources, so even if generated they would be dropped during shading.

Both problems are fixed in this PR.

Benchmarks added

11 new files in parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/:

Benchmark	Coverage
`IntEncodingBenchmark`	int encode/decode: PLAIN, DELTA_BINARY_PACKED, BYTE_STREAM_SPLIT, RLE, DICTIONARY
`BinaryEncodingBenchmark`	Binary write/read paths, parameterized on length and cardinality
`ByteStreamSplitEncodingBenchmark` / `ByteStreamSplitDecodingBenchmark`	BSS for float / double / int / long
`FixedLenByteArrayEncodingBenchmark`	FLBA encode/decode
`FileReadBenchmark` / `FileWriteBenchmark`	CPU-focused file-level benchmarks
`RowGroupFlushBenchmark`	Flush path
`ConcurrentReadWriteBenchmark`	Multi-threaded read/write throughput
`BlackHoleOutputFile`	`OutputFile` that discards bytes — isolates CPU from I/O
`TestDataFactory`	Shared data-generation utilities

Validation

After this PR, the shaded jar is runnable and registers 87 benchmarks:

$ ./mvnw clean package -pl parquet-benchmarks -DskipTests \
    -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true
$ java -jar parquet-benchmarks/target/parquet-benchmarks.jar -l | wc -l
87

Sanity check — IntEncodingBenchmark.decodePlain reproduces the master baseline cited in #3493/#3494 (~91M ops/s on JDK 21, JMH 1.37, 3 warmup + 5 measurement iterations):

Benchmark                            (dataPattern)   Mode  Cnt         Score         Error  Units
IntEncodingBenchmark.decodePlain        SEQUENTIAL  thrpt    5  93528419.575 ± 1472148.214  ops/s
IntEncodingBenchmark.decodePlain            RANDOM  thrpt    5  90908523.483 ± 1978982.394  ops/s
IntEncodingBenchmark.decodePlain   LOW_CARDINALITY  thrpt    5  92672978.255 ± 2071927.851  ops/s
IntEncodingBenchmark.decodePlain  HIGH_CARDINALITY  thrpt    5  90770177.655 ± 2427904.955  ops/s

Out of scope (deferred)

Modernization of the existing ReadBenchmarks / WriteBenchmarks / NestedNullWritingBenchmarks (Hadoop-free LocalInputFile, parameterization, JMH-idiomatic state setup) is a separate concern and will be proposed in a follow-up PR.

Follow-up

Once this lands, each open perf PR (#3494, #3496, #3500, #3504, #3506, #3510) will be updated with a one-line "How to reproduce" snippet referencing the relevant *Benchmark class.

Replace fastutil's *2IntLinkedOpenHashMap with the plain *2IntOpenHashMap plus a separate primitive-typed list to track insertion order in the five dictionary writers (binary, long, double, float, int). The Linked variant was used because the dictionary page must be emitted in insertion order, but it pays an avoidable cost on every put: two extra long fields per slot (prev, next), 3-4 scattered writes per insert to fix up the doubly-linked list, and re-stitching on rehash. None of this is vectorizable. With the plain map plus an append-only list, the hash map is a pure id lookup with the smallest possible slot, and the list is contiguous and cache-friendly to iterate at flush time. Both candidates are fastutil primitive-keyed maps, so this is not a boxing change. The win is structural: an ordering guarantee that was being paid for on every insert is replaced with an explicit append-only list that provides it more cheaply. Benchmark results (BinaryEncodingBenchmark.encodeDictionary, IntEncodingBenchmark.encodeDictionary - added in apache#3512): - encodeDictionary (binary, high cardinality, short strings): +23-42% - encodeDictionary (int, high cardinality): ~+2x - low-cardinality cases: flat (linked-list overhead doesn't matter when there are few inserts) No public API change. No file format change. Behavior is identical: dictionary pages emit values in the same order. Validation: parquet-column 573 tests pass. Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.

steveloughran

commented. I'm only just learning effective jmh myself; these all LGTM. The one request is that temp files are put under target/ so that a ./run.sh all command puts everything in the existing temporary directory tree

steveloughran · 2026-04-22T17:20:59Z

+  }
+
+  @Benchmark
+  @OperationsPerInvocation(VALUE_COUNT)


didn't know of this trick until i saw your PR; adopted in #3452 earlier today

steveloughran · 2026-04-22T17:24:08Z

+ * A no-op {@link OutputFile} that discards all written data.
+ * Useful for isolating CPU/encoding cost from filesystem I/O in write benchmarks.
+ */
+public final class BlackHoleOutputFile implements OutputFile {


does this act as a black hole for the benchmarks? or would passing in the blackhole to the constructor and having it used on L62 and L67 be best?

steveloughran · 2026-04-22T18:34:48Z

+
+  @Setup(Level.Trial)
+  public void setup() throws IOException {
+    tempFile = File.createTempFile("parquet-read-bench-", ".parquet");


there's a constant BenchmarkFiles.TARGET_DIR which defines the dir for benchmarks; it puts them under target/ so maven will clean them up. I used that in my PR so killing a test run in my IDE didn't leave cruft around...I'd recommend the same.

Replace fastutil's *2IntLinkedOpenHashMap with the plain *2IntOpenHashMap plus a separate primitive-typed list to track insertion order in the five dictionary writers (binary, long, double, float, int). The Linked variant was used because the dictionary page must be emitted in insertion order, but it pays an avoidable cost on every put: two extra long fields per slot (prev, next), 3-4 scattered writes per insert to fix up the doubly-linked list, and re-stitching on rehash. None of this is vectorizable. With the plain map plus an append-only list, the hash map is a pure id lookup with the smallest possible slot, and the list is contiguous and cache-friendly to iterate at flush time. Both candidates are fastutil primitive-keyed maps, so this is not a boxing change. The win is structural: an ordering guarantee that was being paid for on every insert is replaced with an explicit append-only list that provides it more cheaply. Benchmark results (BinaryEncodingBenchmark.encodeDictionary, IntEncodingBenchmark.encodeDictionary - added in apache#3512): - encodeDictionary (binary, high cardinality, short strings): +23-42% - encodeDictionary (int, high cardinality): ~+2x - low-cardinality cases: flat (linked-list overhead doesn't matter when there are few inserts) No public API change. No file format change. Behavior is identical: dictionary pages emit values in the same order. Validation: parquet-column 573 tests pass. Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.

…luesWriter Two related changes in the DELTA_BYTE_ARRAY write path: 1. DeltaLengthByteArrayValuesWriter: drop the unused LittleEndianDataOutputStream wrapper. Binary.writeTo(arrayOut) works directly with the underlying CapacityByteArrayOutputStream; the LE wrapper added an extra layer of dispatch on every value but never used any LE functionality (writeInt/writeLong/etc.). Add a new writeBytes(byte[], int, int) overload so callers that already have the raw bytes can avoid allocating a Binary wrapper. 2. DeltaByteArrayWriter: tighten suffixWriter field type to DeltaLengthByteArrayValuesWriter (it's always constructed as one) so the new writeBytes(byte[], int, int) overload is callable. Replace the suffix call with the raw-bytes overload, eliminating the per-value Binary.slice() allocation. Benchmark results (BinaryEncodingBenchmark.encodeDeltaByteArray and encodeDeltaLengthByteArray, added in apache#3512): - encodeDeltaByteArray (LOW cardinality, len=10): +33% to +55% - encodeDeltaLengthByteArray (LOW card, len=10): +18% to +21% - long-string cases: flat (per-value alloc amortized away) No public API change. No file format change. Validation: parquet-column 573 tests pass. Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.

steveloughran · 2026-05-13T18:44:14Z

+  public void readIntegers(int[] dest, int offset, int count) {
+    try {
+      // Batch-decode dictionary IDs, then batch-lookup
+      int[] ids = new int[count];


You could havve a lambda expression to catch and translate all of these, as per org.apache.hadoop.util.functional.FunctionalIO
If you are happy with UncheckedIOException, you can use that as is, and submit a PR saying "this shouldn't be private", which is shouldn't be.

… shaded jar The parquet-benchmarks pom is missing the JMH annotation-processor configuration and the AppendingTransformer entries for BenchmarkList / CompilerHints. As a result, the shaded jar built from master fails at runtime with "Unable to find the resource: /META-INF/BenchmarkList". This commit: - Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds jmh-generator-annprocess to maven-compiler-plugin's annotation processor paths, and adds AppendingTransformer entries for META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin. - Adds 11 JMH benchmarks covering the encode/decode paths used by the pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504, apache#3506, apache#3510), so reviewers can reproduce the reported numbers and detect regressions: IntEncodingBenchmark, BinaryEncodingBenchmark, ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark, FixedLenByteArrayEncodingBenchmark, FileReadBenchmark, FileWriteBenchmark, RowGroupFlushBenchmark, ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory. After this change the shaded jar registers 87 benchmarks (was 0 from a working build, or unrunnable at all from a default build).

Pre-generate deterministic rows for the file and concurrent benchmarks so row construction does not skew the timed section, and make the encoding benchmarks include real dictionary-page and dictionary-decode work instead of only value buffers. Split synthetic RLE dictionary-index decoding into its own benchmark and encode generated binary payloads as UTF-8 explicitly so benchmark inputs stay consistent across runs and platforms.

Make the dictionary encode/decode benchmarks symmetric by routing both sides through a shared EncodedDictionary helper, guard against the dictionary writer falling back to plain encoding (which previously NPE'd in BinaryEncodingBenchmark setup for high-cardinality long strings), and drop redundant close() calls after toDictPageAndClose(). Share the pre-generated row array across threads in ConcurrentReadWriteBenchmark via Scope.Benchmark, eliminating 4x heap duplication and a now-unnecessary ThreadData inner class. Centralize the RNG seed as TestDataFactory.DEFAULT_SEED and add seed-overload variants for the int and binary generators so generators in the same setup no longer share a Random and silently depend on call order. Wrap the RLE encoder in try-with-resources and validate that LOW_CARDINALITY_DISTINCT fits within the configured bit width.

Benchmarks raw compress/decompress throughput for each supported codec (SNAPPY, ZSTD, LZ4_RAW, GZIP) at page sizes 8KB, 64KB, and 256KB using the heap-based CodecFactory path. Input data mixes sequential, repeated, low-range random, and full random patterns for realistic compression ratios.

- Add RLE encodeDictionaryIds benchmark to cover par9 encoder pack32Values fast path (previously only decode was benchmarked) - Trim CompressionBenchmark page sizes to boundary conditions (64K, 1MB) to cut redundant mid-points - Increase FileRead/FileWriteBenchmark SS iterations (warmup 3->5, measurement 5->10) for better statistical stability - Increase RowGroupFlushBenchmark iterations (warmup 2->3, measurement 3->5) for improved confidence with 2 param combos

…r13)

…der (from par13)

Restore decodeDictionaryIdsBatch and decodeValuesReaderBatch from the original cherry-pick, now that the par13 batch read APIs are available. Add decodeValuesReader for production-path (ValuesReader wrapper) coverage. Five RLE benchmark methods now cover par9 and par13: - encodeDictionaryIds: RLE encoder pack32Values fast path - decodeDictionaryIds: per-value RLE decoder - decodeDictionaryIdsBatch: batch RLE decoder via readInts() - decodeValuesReader: per-value via ValuesReader wrapper - decodeValuesReaderBatch: batch via ValuesReader.readIntegers()

…/DOUBLE) Covers per-value and batch decode paths for PlainValuesReader across all four numeric primitive types. Uses pre-allocated destination arrays to avoid per-invocation allocation noise in batch measurements.

Add decodeFloatBatch, decodeDoubleBatch, decodeIntBatch, decodeLongBatch benchmarks with pre-allocated destination arrays to measure readXxx(dest, offset, count) throughput for all four BSS primitive types.

BooleanEncodingBenchmark exercises both encoding paths across six data patterns: ALL_TRUE, ALL_FALSE, ALTERNATING, RANDOM, MOSTLY_TRUE_99, MOSTLY_FALSE_99. Key findings (100K values): Encode: V1 PLAIN is data-independent (~880M ops/s). V2 RLE ranges from 2,344M (ALL_FALSE, +166%) to 192M (RANDOM, -78%). Decode: V2 RLE always >= V1 PLAIN — from +154% (ALL_FALSE) to +7% (ALTERNATING). The RLE decode penalty for random data is negligible. The severe RLE encode penalty for random data (4.6x slower than PLAIN) suggests the V1/V2 split is well-justified: V2 RLE is ideal for the common case of skewed boolean columns, while V1 PLAIN is safer for high-entropy data.

…2 RLE) Adds decodePlainV1Batch and decodeRleV2Batch benchmark methods that exercise the new readBooleans() batch API. Uses a pre-allocated boolean[] destination array to isolate decode throughput from allocation overhead.

…ypes Covers encode + decode (scalar and batch) paths for all four type-specific dictionary implementations: Long2IntOpenHashMap, Float2IntOpenHashMap, Double2IntOpenHashMap, and Object2IntOpenHashMap (for FLBA). Two data patterns exercise low-cardinality (100 distinct values, ~100% hit rate) and high-cardinality (all unique, stresses hash map growth). Also adds TestDataFactory generators for long[], float[], double[], and fixed-length Binary[] data with configurable cardinality. Characterization results (100K values, JDK 25, Compiler Blackholes): - Batch decode shows +60-67% over scalar for LONG/FLOAT/DOUBLE - HIGH_CARDINALITY encode is 6-7x slower than LOW (hash map pressure) - FLBA encode is 14-108M ops/s (Binary hashing overhead)

…ecode Rewrites FixedLenByteArrayEncodingBenchmark from a single encodePlain() method to full coverage of all four FLBA-supported encodings (PLAIN, DELTA_BYTE_ARRAY, BYTE_STREAM_SPLIT, DICTIONARY) with both encode and decode benchmarks. Adds parameterized fixedLength (2=FLOAT16, 12=INT96, 16=UUID) and dataPattern (RANDOM, LOW_CARDINALITY) axes. Characterization results (100K values, JDK 25, fixedLength=16): - Dictionary decode: 543M ops/s (fastest, avoids 16B copy per value) - PLAIN decode: 184M ops/s (slice + Binary wrapping) - BSS/Delta decode: ~87M ops/s (byte scatter/prefix overhead) - BSS excels at fixedLength=2: 368M ops/s (trivial 2-stream transpose)

LZ4_RAW was optimized (+47-77% decompress throughput) and has a micro-benchmark in CompressionBenchmark, but was missing from the end-to-end file read/write benchmarks. Adding it enables direct comparison with SNAPPY, ZSTD, and GZIP at the full-file level.

Add encodePlainV1Batch and encodeRleV2Batch benchmarks that exercise the new writeBooleans() batch encoding path, complementing the existing scalar encode benchmarks.

…LOAT/DOUBLE) New PlainEncodingBenchmark class with scalar vs batch comparison for all four numeric types. Also adds encodePlainBatch to IntEncodingBenchmark for consistency.

…dePlainBatch) Exercises the new readBinaries()/writeBinaries() batch APIs for FIXED_LEN_BYTE_ARRAY PLAIN encoding. Results: decode batch +165-245%, encode batch +19-81%.

- Add brotli-codec dependency to parquet-benchmarks (profile-gated, x86_64 only) - Include BROTLI in @Param codec list alongside SNAPPY, ZSTD, LZ4_RAW, GZIP - Add jitpack.io repository for brotli-codec resolution

Bypass the Hadoop BrotliCodec/stream wrapper for BROTLI compression and decompression by using org.meteogroup.jbrotli's native JNI bindings directly with ByteBuffer support via reflection (brotli-codec remains runtime scope). This eliminates intermediate buffer copies and the BrotliStreamCompressor state machine overhead. Changes: - DirectCodecFactory: Add BrotliDirectCompressor (quality=1, matching Hadoop default) and BrotliDirectDecompressor using one-shot jbrotli API via reflection - Load native library eagerly with graceful fallback to Hadoop codec path - CompressionBenchmark: Switch from heap CodecFactory to DirectCodecFactory to benchmark the actual production code path Results at 64KB page size: - Compress: 6,746 -> 9,662 ops/s (1.43x speedup) - Decompress: 2,534 -> 2,786 ops/s (1.10x speedup)

Replace per-value getXxx(offset) loops with position()+asXxxBuffer().get() bulk copy in readFloats/readDoubles/readIntegers/readLongs. The decoded data buffer is a contiguous heap byte[] in LE order, making view buffer bulk reads a single memcpy via Unsafe.copyMemory. Benchmark results (100K values, BSS FLOAT batch): Before: ~1,228M ops/s After: ~1,442M ops/s (+17%) INT32/INT64/DOUBLE show negligible change because BSS invocation cost is dominated by page transposition in initFromPage, not the read loop.

…ns() Replace ByteBitPackingValuesReader delegation in BooleanPlainValuesReader with direct bit extraction from the page byte[]. The scalar path uses a single array access + shift + mask instead of the 8-element int[] buffer and packer dispatch. The batch path (readBooleans) unrolls 8 booleans per byte with constant masks. For RLE (V2), add a native readBooleans() method that uses Arrays.fill for RLE runs (constant-time for uniform data) and direct int-to-boolean conversion for packed groups, avoiding the intermediate int[] allocation of the readInts() path. Benchmark results (1M values, JDK 25, Compiler Blackholes): - V1 PLAIN scalar: ~620M -> ~1,528-1,618M ops/s (+150%) - V1 PLAIN batch: ALL_TRUE/FALSE ~5,000M (+680%), RANDOM 2,757M (+337%) - V2 RLE batch: ALL_TRUE/FALSE ~190B (fill), RANDOM 1,335M (+93%)

Replace the per-bit unrolled extraction loop with a static boolean[256][8] lookup table + System.arraycopy. Each byte maps to its 8 pre-decoded booleans, and the 8-byte copy is emitted by HotSpot as a single 64-bit load/store pair — the boolean equivalent of asIntBuffer().get() for ints. For RLE PACKED groups (bitWidth=1), bypass the int[] intermediate and read directly from the raw packed bytes via the same lookup table. This makes batch decode throughput independent of data pattern: - V1 PLAIN batch RANDOM: 2,757M -> 5,047M ops/s (+83%) - V2 RLE batch RANDOM: 1,335M -> 1,618M ops/s (+21%) - V2 RLE batch MOSTLY_TRUE_99: 3,205M -> 3,745M ops/s (+17%) - Uniform patterns (ALL_TRUE/FALSE): unchanged (still Arrays.fill)

…king Refactor BooleanPlainValuesWriter to pack bits directly into bytes instead of delegating through ByteBitPackingValuesWriter and the generic int[8]-based ByteBasedBitPackingEncoder. Add batch writeBooleans() API to ValuesWriter with optimized overrides: - PLAIN: processes 8 booleans at a time into single bytes with OR/shift, eliminating the per-value method call chain and int[] intermediate. - RLE: pre-scans for runs >= 8 to emit RLE directly, fills partial bit-packed groups from run boundaries to avoid spurious padding. PLAIN scalar improves +69% (890M -> 1,500M ops/s) from the refactoring. PLAIN batch: +184% over old scalar (2,528M for RANDOM). RLE batch: +278% for ALL_FALSE, +95% for MOSTLY_*, +36% for ALTERNATING.

…riteDoubles with bulk ByteBuffer view transfers Add bulk write methods to CapacityByteArrayOutputStream (writeInts, writeLongs, writeFloats, writeDoubles) that use IntBuffer/LongBuffer/FloatBuffer/DoubleBuffer view puts to transfer entire arrays in one operation, amortizing capacity checks across the batch. Add corresponding batch APIs to ValuesWriter (with scalar default) and optimized overrides in PlainValuesWriter. Performance improvement (100K values, JDK 25): INT32: 566M -> 2,809M ops/s (+396%) FLOAT: 540M -> 2,818M ops/s (+422%) INT64: 479M -> 1,306M ops/s (+173%) DOUBLE: 442M -> 1,275M ops/s (+189%)

- ValuesReader.readBinaries() / ValuesWriter.writeBinaries() default impls - FixedLenByteArrayPlainValuesReader: bulk slice() with fixed-offset Binary views - FixedLenByteArrayPlainValuesWriter: chunked bulk write() amortizing stream overhead - ByteStreamSplitValuesReader: optimized array-based decode with unrolled loops for element sizes 2, 4, 8, 12, 16 - ByteStreamSplitValuesReaderForFLBA: batch readBinaries() with single advanceByteOffset - FixedLenByteArrayEncodingBenchmark: full FLBA benchmark suite with batch variants - Add TestDataFactory and BenchmarkEncodingUtils helper classes - Fix JMH annotation processor config in pom.xml for Maven Compiler 3.14+

… writes Replace per-value scatterBytes() in FixedLenByteArrayByteStreamSplitValuesWriter with a BATCH_SIZE=64 buffered scatter pattern: - Accumulate byte values into per-stream batch buffers - Flush as bulk write(byte[], 0, count) to each stream - Eliminates N*elementSize individual stream.write(byte) calls per batch - Adds writeBinaries() batch override for FLBA BSS writer Performance improvement: FLBA size=2 +85%, size=16 +160% (vs per-byte scatter).

iemejia · 2026-05-17T22:42:48Z

Closing — the JMH benchmarks and shaded jar fix from this PR have been distributed into the individual encoding/compression PRs, each of which now includes its own benchmarks:

GH-3530: Optimize PLAIN encoding and decoding with direct ByteBuffer I/O #3565 — PLAIN (PlainEncodingBenchmark, PlainDecodingBenchmark)
GH-3530: Optimize DICTIONARY encoding/decoding data structures and use ByteBuffer #3566 — DICTIONARY (DictionaryEncodingBenchmark, DictionaryDecodingBenchmark)
GH-3530: Optimize DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, and DELTA_BYTE_ARRAY encoding/decoding #3567 — DELTA (DeltaBinaryPackedEncodingBenchmark, DeltaBinaryPackedDecodingBenchmark, DeltaByteArrayEncodingBenchmark, DeltaByteArrayDecodingBenchmark, DeltaLengthByteArrayEncodingBenchmark, DeltaLengthByteArrayDecodingBenchmark, LongDeltaDecodingBenchmark)
GH-3530: Optimize RLE hybrid encoder/decoder scalar hot-path performance #3568 — RLE (RleEncodingBenchmark, RleDecodingBenchmark, RleDictionaryIndexDecodingBenchmark)
GH-3530: Optimize BYTE_STREAM_SPLIT encoding/decoding #3569 — BYTE_STREAM_SPLIT (BssEncodingBenchmark, BssDecodingBenchmark)
GH-3530: Bypass Hadoop codec abstraction to optimize compression performance #3570 — Compression (CompressionBenchmark, CpuReadBenchmark, CpuWriteBenchmark, FileReadBenchmark, FileWriteBenchmark, ConcurrentReadWriteBenchmark)
GH-3530: Eagerly release column buffers during row group flush #3571 — Row Group Flush (RowGroupFlushBenchmark)

I initially submitted a series of small, focused PRs thinking they'd be easier to review. In practice the sheer number (~16 PRs, with more pending) made things harder to follow — even for me. I've regrouped the changes by encoding type / performance area so that each PR is self-contained with its own benchmarks and test coverage, which should make review and performance analysis much more straightforward.

Apologies for the churn. Thank you.

iemejia force-pushed the perf-benchmarks branch from 668caf7 to 2404a29 Compare April 19, 2026 20:17

iemejia mentioned this pull request Apr 20, 2026

Optimize dictionary writers by replacing fastutil Linked maps with OpenHashMap + ArrayList #3513

Closed

iemejia mentioned this pull request Apr 20, 2026

GH-3513: Optimize dictionary writers with OpenHashMap + ArrayList (up to ~70x encodeDictionary) #3514

Closed

steveloughran reviewed Apr 22, 2026

View reviewed changes

iemejia mentioned this pull request May 6, 2026

Apache Parquet Java Performance Improvements #3530

Open

iemejia force-pushed the perf-benchmarks branch from 404ed02 to 19f343e Compare May 11, 2026 22:14

steveloughran suggested changes May 13, 2026

View reviewed changes

iemejia added 7 commits May 13, 2026 21:31

apacheGH-3522: Add batch read APIs to ValuesReader hierarchy (from pa…

bf599cd

…r13)

Override readIntegers() in RLE ValuesReader to delegate to batch deco…

17e20bd

…der (from par13)

iemejia added 20 commits May 13, 2026 21:31

Add batch decode benchmarks to ByteStreamSplitDecodingBenchmark

ecb854e

Add decodeFloatBatch, decodeDoubleBatch, decodeIntBatch, decodeLongBatch benchmarks with pre-allocated destination arrays to measure readXxx(dest, offset, count) throughput for all four BSS primitive types.

Add batch encode benchmarks for BOOLEAN writeBooleans() API

67ef35b

Add encodePlainV1Batch and encodeRleV2Batch benchmarks that exercise the new writeBooleans() batch encoding path, complementing the existing scalar encode benchmarks.

Add batch encode benchmarks for PLAIN numeric encoding (INT32/INT64/F…

cb0ba5b

…LOAT/DOUBLE) New PlainEncodingBenchmark class with scalar vs batch comparison for all four numeric types. Also adds encodePlainBatch to IntEncodingBenchmark for consistency.

Add FLBA PLAIN batch encode/decode benchmarks (encodePlainBatch, deco…

7f08762

…dePlainBatch) Exercises the new readBinaries()/writeBinaries() batch APIs for FIXED_LEN_BYTE_ARRAY PLAIN encoding. Results: decode batch +165-245%, encode batch +19-81%.

Add BROTLI to CompressionBenchmark codec parameter list

af56526

- Add brotli-codec dependency to parquet-benchmarks (profile-gated, x86_64 only) - Include BROTLI in @Param codec list alongside SNAPPY, ZSTD, LZ4_RAW, GZIP - Add jitpack.io repository for brotli-codec resolution

iemejia force-pushed the perf-benchmarks branch from b58fc2a to 165bf49 Compare May 13, 2026 19:31

iemejia marked this pull request as draft May 15, 2026 09:35

iemejia closed this May 17, 2026

iemejia deleted the perf-benchmarks branch May 17, 2026 22:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3511: Add JMH encoding benchmarks and fix parquet-benchmarks shaded jar#3512

GH-3511: Add JMH encoding benchmarks and fix parquet-benchmarks shaded jar#3512
iemejia wants to merge 27 commits into
apache:masterfrom
iemejia:perf-benchmarks

iemejia commented Apr 19, 2026

Uh oh!

steveloughran left a comment

Uh oh!

steveloughran Apr 22, 2026

Uh oh!

steveloughran Apr 22, 2026

Uh oh!

steveloughran Apr 22, 2026

Uh oh!

steveloughran May 13, 2026

Uh oh!

iemejia commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iemejia commented Apr 19, 2026

Summary

What's broken on master

Benchmarks added

Validation

Out of scope (deferred)

Follow-up

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

steveloughran Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

steveloughran Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

steveloughran Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

steveloughran May 13, 2026

Choose a reason for hiding this comment

Uh oh!

iemejia commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants