GH-3499: Cache hashCode() for non-reused Binary instances (up to 73x dictionary-encode speedup)#3500
Open
iemejia wants to merge 1 commit intoapache:masterfrom
Open
GH-3499: Cache hashCode() for non-reused Binary instances (up to 73x dictionary-encode speedup)#3500iemejia wants to merge 1 commit intoapache:masterfrom
iemejia wants to merge 1 commit intoapache:masterfrom
Conversation
PLAIN_DICTIONARY encoding of BINARY columns repeatedly hashes Binary keys during dictionary map lookups, but the existing Binary.hashCode() implementations (in ByteArraySliceBackedBinary, ByteArrayBackedBinary, and ByteBufferBackedBinary) recompute the hash byte-by-byte on every call. For columns with many repeated values this is the dominant cost of encodeDictionary -- we observed up to 73x slowdown vs. the cached version on the existing JMH benchmark. Cache the hash code in a single int field on Binary. Reused Binary instances (those whose backing array can be mutated by the producer between calls) do not cache, preserving the existing mutable-buffer semantics. Thread safety follows the java.lang.String.hashCode() idiom: the cache is a single int field with sentinel value 0 meaning "not yet computed". Two threads racing on the first hashCode() call may both compute and write the same deterministic value, which is benign. A Binary whose true hash equals 0 is recomputed on every call (acceptably rare and still correct). No volatile or synchronization is needed; both the field load and the field store are atomic per JLS, and the value is deterministic given the immutable byte content. Implementation notes: - The cache field is package-private (not private) so the three nested Binary subclasses can read it directly in their hashCode() hot path, avoiding an extra method-call layer that would otherwise be needed since inherited private fields are not accessible from nested subclasses. - A package-private cacheHashCode(int) helper centralises the isBackingBytesReused check on the slow path. - New tests in TestBinary cover (a) cached-and-stable hashCode for the three constant Binary impls, and (b) reused Binary not returning a stale hash after the backing buffer is replaced. Benchmark (BinaryEncodingBenchmark.encodeDictionary, 100k BINARY values per invocation, JMH -wi 5 -i 10 -f 3, 30 samples per row): Param Before (ops/s) After (ops/s) Improvement LOW / 10 13,170,110 20,203,480 +53% (1.53x) LOW / 100 2,955,460 18,048,610 +511% (6.11x) LOW / 1000 300,693 21,933,470 +7193% (72.9x) HIGH / 10 847,657 1,336,238 +58% (1.58x) HIGH / 100 418,327 1,323,284 +216% (3.16x) HIGH / 1000 72,553 1,296,679 +1687% (17.9x) The relative gain grows with string length because the per-value hash cost (byte-loop length) grows linearly while the cached lookup is O(1). LOW cardinality benefits even more because each unique key is hashed many more times (once per insertion check across the 100k values). Negative control: BinaryEncodingBenchmark.encodePlain (which writes Binary without dictionary lookups, so does not exercise hashCode) is unchanged within +/- 2.5% across all parameter combinations. Allocation rate per operation is identical between baseline and optimized (7.36 B/op for LOW/10, etc.), confirming the speedup comes from CPU saved on hashing rather than reduced allocations. All 575 parquet-column tests pass (was 573; +2 new tests for the cache).
arouel
reviewed
Apr 19, 2026
arouel
left a comment
There was a problem hiding this comment.
I use a similar optimization already in a patched parquet-column version on my side and verified the improvement.
arouel
approved these changes
Apr 19, 2026
e1c3ed9 to
a8152c9
Compare
This was referenced Apr 19, 2026
Optimize ByteStreamSplitValuesWriter: remove per-value allocation and batch single-byte writes
#3503
Open
iemejia
added a commit
to iemejia/parquet-java
that referenced
this pull request
Apr 19, 2026
… shaded jar The parquet-benchmarks pom is missing the JMH annotation-processor configuration and the AppendingTransformer entries for BenchmarkList / CompilerHints. As a result, the shaded jar built from master fails at runtime with "Unable to find the resource: /META-INF/BenchmarkList". This commit: - Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds jmh-generator-annprocess to maven-compiler-plugin's annotation processor paths, and adds AppendingTransformer entries for META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin. - Adds 11 JMH benchmarks covering the encode/decode paths used by the pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504, apache#3506, apache#3510), so reviewers can reproduce the reported numbers and detect regressions: IntEncodingBenchmark, BinaryEncodingBenchmark, ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark, FixedLenByteArrayEncodingBenchmark, FileReadBenchmark, FileWriteBenchmark, RowGroupFlushBenchmark, ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory. After this change the shaded jar registers 87 benchmarks (was 0 from a working build, or unrunnable at all from a default build).
iemejia
added a commit
to iemejia/parquet-java
that referenced
this pull request
Apr 19, 2026
… shaded jar The parquet-benchmarks pom is missing the JMH annotation-processor configuration and the AppendingTransformer entries for BenchmarkList / CompilerHints. As a result, the shaded jar built from master fails at runtime with "Unable to find the resource: /META-INF/BenchmarkList". This commit: - Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds jmh-generator-annprocess to maven-compiler-plugin's annotation processor paths, and adds AppendingTransformer entries for META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin. - Adds 11 JMH benchmarks covering the encode/decode paths used by the pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504, apache#3506, apache#3510), so reviewers can reproduce the reported numbers and detect regressions: IntEncodingBenchmark, BinaryEncodingBenchmark, ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark, FixedLenByteArrayEncodingBenchmark, FileReadBenchmark, FileWriteBenchmark, RowGroupFlushBenchmark, ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory. After this change the shaded jar registers 87 benchmarks (was 0 from a working build, or unrunnable at all from a default build).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #3499.
Caches
Binary.hashCode()per instance for non-reused (immutable-backing)Binaryvalues. Eliminates repeated full-buffer hash recomputation duringPLAIN_DICTIONARYencoding, where the same key is hashed many times across a 100k-value page. Reused (mutable-backing) instances skip the cache to preserve their existing semantics.Uses the
java.lang.String.hashCode()idiom — a singleintfield with sentinel0meaning "not yet computed" — so the cache is race-safe withoutvolatile(concurrent first calls compute the same deterministic value; either ordering is correct).Benchmark
BinaryEncodingBenchmark.encodeDictionary, 100k BINARY values per invocation, JDK 18, JMH-wi 5 -i 10 -f 3(30 samples per row):The relative gain grows with string length (per-call hash work is O(N), cache lookup is O(1)) and with low cardinality (each unique key is hashed many more times).
Negative control:
encodePlain(writes Binary without dictionary lookups, so doesn't exercisehashCode) is unchanged within ±2.5% across all parameter combinations. Allocation rate per op (gc.alloc.rate.norm) is identical between baseline and optimized — speedup is pure CPU saved.Implementation notes
transient int cachedHashCodeonBinary(package-private so the three nested subclasses can read it directly on the hot path; inherited private fields are not accessible from nested subclasses without a method-call indirection).!isBackingBytesReusedinside a small package-privatecacheHashCode(int)helper that runs only on the cache-miss path.TestBinary:testHashCodeCachedForConstantBinary: constant Binary returns stablehashCode, equal across the three impls (ByteArraySliceBackedBinary,ByteArrayBackedBinary,ByteBufferBackedBinary).testHashCodeNotCachedForReusedBinary: reused Binary returns the new hash after the backing buffer is replaced.Validation
parquet-column: 575 tests pass (was 573; +2 new tests for the cache).-Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.Related
This is the third in a small series of focused performance PRs from work in https://github.com/iemejia/parquet-perf. Previous: #3494 (PlainValuesReader), #3496 (PlainValuesWriter).
How to reproduce the benchmarks
The JMH benchmarks cited above are being added to
parquet-benchmarksin #3512. Once that lands, reproduce with:Compare runs against
master(baseline) and this branch (optimized).