Conversation
Malkovsky
commented
Apr 20, 2026
- Implemented 512-bit target excess positions based on 16-bit window expansion
- Implemented 512-bit target excess positions based on two-stage 4-bit lookups
3.6-5.5x faster than the 16-bit expansion approach (16-24 ns vs 86 ns per 512-bit block). Uses vpshufb for 4-bit LUT lookups, byte-level exclusive prefix sum on __m128i, and pdep for result interleaving.
| const size_t num_blocks = 4096; | ||
|
|
||
| std::mt19937_64 rng(42); | ||
| std::vector<std::array<uint64_t, 8>> blocks(num_blocks); |
There was a problem hiding this comment.
SUGGESTION: Benchmark setup can be reused to reduce noise
The RNG fill and vector allocation happen for each benchmark invocation and each Arg value. Consider moving blocks (and the RNG seeding) to a static/fixture to reduce setup overhead and improve signal-to-noise in the measurements.
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Files Reviewed (2 files)
Reviewed by gpt-5.2-codex-20260114 · 294,719 tokens |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #39 +/- ##
==========================================
+ Coverage 86.35% 87.34% +0.98%
==========================================
Files 11 12 +1
Lines 2850 3136 +286
Branches 562 606 +44
==========================================
+ Hits 2461 2739 +278
- Misses 254 262 +8
Partials 135 135
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…rks, add overflow boundary test - Add detailed comment explaining why int8 arithmetic in base + pos_j computation is safe despite the boundary value 128 wrapping to -128 (neither -128 nor 128 equals 0, so cmpeq_epi8 produces no false positive) - Refactor benchmarks to share block generation via make_blocks() helper - Add OverflowBoundary test covering all x in [-64,64] with varying prefix fills to exercise the int8 arithmetic boundary cases