x86_64: Replace rej_uniform intrinsics with assembly#1014
x86_64: Replace rej_uniform intrinsics with assembly#1014
Conversation
Replace the AVX2 intrinsics implementation of rej_uniform with hand-written x86_64 assembly, resolving #926. The assembly follows the same algorithmic structure as the intrinsics version: load 32 bytes, vpermq to rearrange 64-bit lanes, vpshufb to extract 8x 3-byte groups, mask to 23 bits, compare against MLDSA_Q, then use the lookup table to compact valid coefficients. Key design decisions: - Table passed as parameter (consistent with aarch64 approach), avoiding external symbol references for simpasm compatibility - All constants constructed from immediates (no .rodata section), enabling future HOL-Light formal verification - Register name #defines with #undef cleanup for SCU builds - CBMC contract on assembly function declaration (following mlkem-native) - vzeroupper at function exit to avoid AVX-SSE transition penalties Also adds poly_uniform to the component benchmark. Signed-off-by: jakemas <jakemas@amazon.com>
There was a problem hiding this comment.
Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
113118 cycles |
113013 cycles |
1.00 |
ML-DSA-44 sign |
355649 cycles |
355605 cycles |
1.00 |
ML-DSA-44 verify |
117801 cycles |
117682 cycles |
1.00 |
ML-DSA-65 keypair |
196381 cycles |
196214 cycles |
1.00 |
ML-DSA-65 sign |
589557 cycles |
588943 cycles |
1.00 |
ML-DSA-65 verify |
194604 cycles |
194375 cycles |
1.00 |
ML-DSA-87 keypair |
322210 cycles |
322148 cycles |
1.00 |
ML-DSA-87 sign |
752493 cycles |
752763 cycles |
1.00 |
ML-DSA-87 verify |
320055 cycles |
319900 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
212361 cycles |
212622 cycles |
1.00 |
ML-DSA-44 sign |
760716 cycles |
760066 cycles |
1.00 |
ML-DSA-44 verify |
228743 cycles |
228987 cycles |
1.00 |
ML-DSA-65 keypair |
379384 cycles |
379665 cycles |
1.00 |
ML-DSA-65 sign |
1250617 cycles |
1249827 cycles |
1.00 |
ML-DSA-65 verify |
371531 cycles |
372045 cycles |
1.00 |
ML-DSA-87 keypair |
604335 cycles |
605426 cycles |
1.00 |
ML-DSA-87 sign |
1593243 cycles |
1591413 cycles |
1.00 |
ML-DSA-87 verify |
618270 cycles |
617375 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
AMD EPYC 3rd gen (c6a)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
66830 cycles |
68874 cycles |
0.97 |
ML-DSA-44 sign |
184077 cycles |
187594 cycles |
0.98 |
ML-DSA-44 verify |
65562 cycles |
68993 cycles |
0.95 |
ML-DSA-65 keypair |
111959 cycles |
119089 cycles |
0.94 |
ML-DSA-65 sign |
292002 cycles |
299488 cycles |
0.98 |
ML-DSA-65 verify |
108472 cycles |
115385 cycles |
0.94 |
ML-DSA-87 keypair |
185520 cycles |
203754 cycles |
0.91 |
ML-DSA-87 sign |
379630 cycles |
396462 cycles |
0.96 |
ML-DSA-87 verify |
177291 cycles |
196231 cycles |
0.90 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton4
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
68316 cycles |
68121 cycles |
1.00 |
ML-DSA-44 sign |
202487 cycles |
202429 cycles |
1.00 |
ML-DSA-44 verify |
70722 cycles |
70691 cycles |
1.00 |
ML-DSA-65 keypair |
121061 cycles |
121050 cycles |
1.00 |
ML-DSA-65 sign |
331574 cycles |
332242 cycles |
1.00 |
ML-DSA-65 verify |
117810 cycles |
118169 cycles |
1.00 |
ML-DSA-87 keypair |
198140 cycles |
198283 cycles |
1.00 |
ML-DSA-87 sign |
427941 cycles |
428124 cycles |
1.00 |
ML-DSA-87 verify |
194637 cycles |
194645 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
AMD EPYC 3rd gen (c6a) (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
134578 cycles |
135123 cycles |
1.00 |
ML-DSA-44 sign |
523923 cycles |
523989 cycles |
1.00 |
ML-DSA-44 verify |
147640 cycles |
147421 cycles |
1.00 |
ML-DSA-65 keypair |
228634 cycles |
227032 cycles |
1.01 |
ML-DSA-65 sign |
864042 cycles |
860343 cycles |
1.00 |
ML-DSA-65 verify |
236700 cycles |
234883 cycles |
1.01 |
ML-DSA-87 keypair |
371955 cycles |
371568 cycles |
1.00 |
ML-DSA-87 sign |
1080535 cycles |
1079389 cycles |
1.00 |
ML-DSA-87 verify |
383811 cycles |
383403 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Intel Xeon 3rd gen (c6i)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
56863 cycles |
56287 cycles |
1.01 |
ML-DSA-44 sign |
181063 cycles |
181562 cycles |
1.00 |
ML-DSA-44 verify |
61140 cycles |
61061 cycles |
1.00 |
ML-DSA-65 keypair |
98291 cycles |
98770 cycles |
1.00 |
ML-DSA-65 sign |
298368 cycles |
299116 cycles |
1.00 |
ML-DSA-65 verify |
100343 cycles |
100251 cycles |
1.00 |
ML-DSA-87 keypair |
152430 cycles |
153265 cycles |
0.99 |
ML-DSA-87 sign |
354719 cycles |
355417 cycles |
1.00 |
ML-DSA-87 verify |
153124 cycles |
153884 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton4 (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
128315 cycles |
128272 cycles |
1.00 |
ML-DSA-44 sign |
447513 cycles |
447600 cycles |
1.00 |
ML-DSA-44 verify |
138123 cycles |
144678 cycles |
0.95 |
ML-DSA-65 keypair |
220541 cycles |
220481 cycles |
1.00 |
ML-DSA-65 sign |
726484 cycles |
726951 cycles |
1.00 |
ML-DSA-65 verify |
222926 cycles |
223461 cycles |
1.00 |
ML-DSA-87 keypair |
366142 cycles |
366604 cycles |
1.00 |
ML-DSA-87 sign |
927541 cycles |
927414 cycles |
1.00 |
ML-DSA-87 verify |
374016 cycles |
373875 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton3
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
72353 cycles |
72235 cycles |
1.00 |
ML-DSA-44 sign |
212424 cycles |
212375 cycles |
1.00 |
ML-DSA-44 verify |
75754 cycles |
75714 cycles |
1.00 |
ML-DSA-65 keypair |
127646 cycles |
127612 cycles |
1.00 |
ML-DSA-65 sign |
351030 cycles |
350845 cycles |
1.00 |
ML-DSA-65 verify |
125627 cycles |
125755 cycles |
1.00 |
ML-DSA-87 keypair |
205980 cycles |
208476 cycles |
0.99 |
ML-DSA-87 sign |
444778 cycles |
450018 cycles |
0.99 |
ML-DSA-87 verify |
205601 cycles |
205843 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Intel Xeon 3rd gen (c6i) (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
157499 cycles |
157541 cycles |
1.00 |
ML-DSA-44 sign |
549244 cycles |
549413 cycles |
1.00 |
ML-DSA-44 verify |
169448 cycles |
168865 cycles |
1.00 |
ML-DSA-65 keypair |
268437 cycles |
268818 cycles |
1.00 |
ML-DSA-65 sign |
903422 cycles |
903672 cycles |
1.00 |
ML-DSA-65 verify |
275283 cycles |
274680 cycles |
1.00 |
ML-DSA-87 keypair |
448241 cycles |
448464 cycles |
1.00 |
ML-DSA-87 sign |
1158654 cycles |
1157970 cycles |
1.00 |
ML-DSA-87 verify |
458704 cycles |
458043 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
AMD EPYC 4th gen (c7a)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
42142 cycles |
40662 cycles |
1.04 |
ML-DSA-44 sign |
134317 cycles |
132808 cycles |
1.01 |
ML-DSA-44 verify |
44844 cycles |
43607 cycles |
1.03 |
ML-DSA-65 keypair |
72940 cycles |
71859 cycles |
1.02 |
ML-DSA-65 sign |
213861 cycles |
213367 cycles |
1.00 |
ML-DSA-65 verify |
73729 cycles |
72847 cycles |
1.01 |
ML-DSA-87 keypair |
107003 cycles |
109237 cycles |
0.98 |
ML-DSA-87 sign |
250851 cycles |
254550 cycles |
0.99 |
ML-DSA-87 verify |
107681 cycles |
109371 cycles |
0.98 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
AMD EPYC 4th gen (c7a) (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
120754 cycles |
120325 cycles |
1.00 |
ML-DSA-44 sign |
447570 cycles |
447576 cycles |
1.00 |
ML-DSA-44 verify |
130511 cycles |
130561 cycles |
1.00 |
ML-DSA-65 keypair |
205040 cycles |
205018 cycles |
1.00 |
ML-DSA-65 sign |
728790 cycles |
729474 cycles |
1.00 |
ML-DSA-65 verify |
210029 cycles |
209605 cycles |
1.00 |
ML-DSA-87 keypair |
337610 cycles |
336678 cycles |
1.00 |
ML-DSA-87 sign |
925517 cycles |
924223 cycles |
1.00 |
ML-DSA-87 verify |
347563 cycles |
347399 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton3 (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
138744 cycles |
138561 cycles |
1.00 |
ML-DSA-44 sign |
483982 cycles |
484140 cycles |
1.00 |
ML-DSA-44 verify |
148574 cycles |
162388 cycles |
0.91 |
ML-DSA-65 keypair |
241921 cycles |
241950 cycles |
1.00 |
ML-DSA-65 sign |
792702 cycles |
792591 cycles |
1.00 |
ML-DSA-65 verify |
240763 cycles |
241288 cycles |
1.00 |
ML-DSA-87 keypair |
396106 cycles |
397138 cycles |
1.00 |
ML-DSA-87 sign |
1013453 cycles |
1013569 cycles |
1.00 |
ML-DSA-87 verify |
403446 cycles |
403178 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton2
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
113189 cycles |
113255 cycles |
1.00 |
ML-DSA-44 sign |
355791 cycles |
356042 cycles |
1.00 |
ML-DSA-44 verify |
117978 cycles |
117969 cycles |
1.00 |
ML-DSA-65 keypair |
196342 cycles |
196623 cycles |
1.00 |
ML-DSA-65 sign |
589183 cycles |
589242 cycles |
1.00 |
ML-DSA-65 verify |
194553 cycles |
194559 cycles |
1.00 |
ML-DSA-87 keypair |
322537 cycles |
322281 cycles |
1.00 |
ML-DSA-87 sign |
753613 cycles |
753546 cycles |
1.00 |
ML-DSA-87 verify |
320115 cycles |
320070 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton2 (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
213219 cycles |
212521 cycles |
1.00 |
ML-DSA-44 sign |
761553 cycles |
760970 cycles |
1.00 |
ML-DSA-44 verify |
241351 cycles |
234237 cycles |
1.03 |
ML-DSA-65 keypair |
380573 cycles |
379762 cycles |
1.00 |
ML-DSA-65 sign |
1252452 cycles |
1252199 cycles |
1.00 |
ML-DSA-65 verify |
372839 cycles |
371797 cycles |
1.00 |
ML-DSA-87 keypair |
607341 cycles |
604584 cycles |
1.00 |
ML-DSA-87 sign |
1596680 cycles |
1595561 cycles |
1.00 |
ML-DSA-87 verify |
619175 cycles |
618927 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Graviton2 (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 verify |
241351 cycles |
234237 cycles |
1.03 |
This comment was automatically generated by workflow using github-action-benchmark.
CBMC Results (ML-DSA-44)Full Results (177 proofs)
|
CBMC Results (ML-DSA-87)Full Results (177 proofs)
|
CBMC Results (ML-DSA-65)Full Results (177 proofs)
|
There was a problem hiding this comment.
Intel Xeon 4th gen (c7i)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
34764 cycles |
34374 cycles |
1.01 |
ML-DSA-44 sign |
120113 cycles |
120132 cycles |
1.00 |
ML-DSA-44 verify |
38092 cycles |
38166 cycles |
1.00 |
ML-DSA-65 keypair |
61138 cycles |
60500 cycles |
1.01 |
ML-DSA-65 sign |
201844 cycles |
199945 cycles |
1.01 |
ML-DSA-65 verify |
62783 cycles |
62429 cycles |
1.01 |
ML-DSA-87 keypair |
93501 cycles |
94486 cycles |
0.99 |
ML-DSA-87 sign |
236815 cycles |
239500 cycles |
0.99 |
ML-DSA-87 verify |
95619 cycles |
96894 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Intel Xeon 4th gen (c7i) (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
93930 cycles |
93842 cycles |
1.00 |
ML-DSA-44 sign |
333310 cycles |
333119 cycles |
1.00 |
ML-DSA-44 verify |
100022 cycles |
100025 cycles |
1.00 |
ML-DSA-65 keypair |
159902 cycles |
160115 cycles |
1.00 |
ML-DSA-65 sign |
543114 cycles |
543227 cycles |
1.00 |
ML-DSA-65 verify |
160989 cycles |
161060 cycles |
1.00 |
ML-DSA-87 keypair |
266666 cycles |
266874 cycles |
1.00 |
ML-DSA-87 sign |
704974 cycles |
706010 cycles |
1.00 |
ML-DSA-87 verify |
270510 cycles |
269779 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'AMD EPYC 4th gen (c7a)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
42142 cycles |
40662 cycles |
1.04 |
This comment was automatically generated by workflow using github-action-benchmark.
Summary
Resolves #926 and #418 (?)
rej_uniformwith hand-written x86_64 assembly.rodatasection), enabling future HOL-Light formal verification#defines with#undefcleanup for SCU builds (following mlkem-native pattern)poly_uniformto component benchmarkML-DSA's 23-bit coefficients require 32-bit lanes, which naturally fills a 256-bit YMM register for 8 elements per iteration. This led to the choice of AVX2 over SSE — with SSE's 128-bit registers and 32-bit lanes, we'd only get 4 coefficients per iteration vs 8 with AVX2.
Performance
AMD EPYC 3rd gen (c6a) — opt
Proof
I'd like to get the CBMC/Hol_Light proof in with this PR -- I'm investigating now. Ideally, I need stable assembly to work from, other wise modeling instructions for them not to be used is wasteful. Currently modeling
VMOVMSKPS,VPMOVZXBD, andVZEROUPPER, and testing out proof framework.