Skip to content

x86_64: Replace rej_uniform intrinsics with assembly#1014

Draft
jakemas wants to merge 1 commit intomainfrom
jakemas/rej-uniform-asm
Draft

x86_64: Replace rej_uniform intrinsics with assembly#1014
jakemas wants to merge 1 commit intomainfrom
jakemas/rej-uniform-asm

Conversation

@jakemas
Copy link
Copy Markdown
Contributor

@jakemas jakemas commented Apr 3, 2026

Summary

Resolves #926 and #418 (?)

  • Replace AVX2 intrinsics implementation of rej_uniform with hand-written x86_64 assembly
  • Table passed as parameter (consistent with aarch64 approach), avoiding external symbol references for simpasm compatibility
  • All constants constructed from immediates (no .rodata section), enabling future HOL-Light formal verification
  • Register name #defines with #undef cleanup for SCU builds (following mlkem-native pattern)
  • Adds poly_uniform to component benchmark
  • HOL-Light proof infrastructure included (bytecode, table definition, proof skeleton, Makefile)

ML-DSA's 23-bit coefficients require 32-bit lanes, which naturally fills a 256-bit YMM register for 8 elements per iteration. This led to the choice of AVX2 over SSE — with SSE's 128-bit registers and 32-bit lanes, we'd only get 4 coefficients per iteration vs 8 with AVX2.

Performance

AMD EPYC 3rd gen (c6a) — opt

Benchmark Before After Change
ML-DSA-44 keypair 68,874 66,828 -3%
ML-DSA-44 sign 187,594 184,181 -2%
ML-DSA-44 verify 68,993 65,665 -5%
ML-DSA-65 keypair 119,089 112,640 -5%
ML-DSA-65 sign 299,488 294,836 -2%
ML-DSA-65 verify 115,385 108,494 -6%
ML-DSA-87 keypair 203,754 185,518 -9%
ML-DSA-87 sign 396,462 378,579 -5%
ML-DSA-87 verify 196,231 177,157 -10%

Proof

I'd like to get the CBMC/Hol_Light proof in with this PR -- I'm investigating now. Ideally, I need stable assembly to work from, other wise modeling instructions for them not to be used is wasteful. Currently modeling VMOVMSKPS, VPMOVZXBD, and VZEROUPPER, and testing out proof framework.

Replace the AVX2 intrinsics implementation of rej_uniform with
hand-written x86_64 assembly, resolving #926.

The assembly follows the same algorithmic structure as the intrinsics
version: load 32 bytes, vpermq to rearrange 64-bit lanes, vpshufb to
extract 8x 3-byte groups, mask to 23 bits, compare against MLDSA_Q,
then use the lookup table to compact valid coefficients.

Key design decisions:
- Table passed as parameter (consistent with aarch64 approach),
  avoiding external symbol references for simpasm compatibility
- All constants constructed from immediates (no .rodata section),
  enabling future HOL-Light formal verification
- Register name #defines with #undef cleanup for SCU builds
- CBMC contract on assembly function declaration (following mlkem-native)
- vzeroupper at function exit to avoid AVX-SSE transition penalties

Also adds poly_uniform to the component benchmark.

Signed-off-by: jakemas <jakemas@amazon.com>
@jakemas jakemas requested a review from a team as a code owner April 3, 2026 04:11
@jakemas jakemas marked this pull request as draft April 3, 2026 04:11
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 113118 cycles 113013 cycles 1.00
ML-DSA-44 sign 355649 cycles 355605 cycles 1.00
ML-DSA-44 verify 117801 cycles 117682 cycles 1.00
ML-DSA-65 keypair 196381 cycles 196214 cycles 1.00
ML-DSA-65 sign 589557 cycles 588943 cycles 1.00
ML-DSA-65 verify 194604 cycles 194375 cycles 1.00
ML-DSA-87 keypair 322210 cycles 322148 cycles 1.00
ML-DSA-87 sign 752493 cycles 752763 cycles 1.00
ML-DSA-87 verify 320055 cycles 319900 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 212361 cycles 212622 cycles 1.00
ML-DSA-44 sign 760716 cycles 760066 cycles 1.00
ML-DSA-44 verify 228743 cycles 228987 cycles 1.00
ML-DSA-65 keypair 379384 cycles 379665 cycles 1.00
ML-DSA-65 sign 1250617 cycles 1249827 cycles 1.00
ML-DSA-65 verify 371531 cycles 372045 cycles 1.00
ML-DSA-87 keypair 604335 cycles 605426 cycles 1.00
ML-DSA-87 sign 1593243 cycles 1591413 cycles 1.00
ML-DSA-87 verify 618270 cycles 617375 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 66830 cycles 68874 cycles 0.97
ML-DSA-44 sign 184077 cycles 187594 cycles 0.98
ML-DSA-44 verify 65562 cycles 68993 cycles 0.95
ML-DSA-65 keypair 111959 cycles 119089 cycles 0.94
ML-DSA-65 sign 292002 cycles 299488 cycles 0.98
ML-DSA-65 verify 108472 cycles 115385 cycles 0.94
ML-DSA-87 keypair 185520 cycles 203754 cycles 0.91
ML-DSA-87 sign 379630 cycles 396462 cycles 0.96
ML-DSA-87 verify 177291 cycles 196231 cycles 0.90

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 68316 cycles 68121 cycles 1.00
ML-DSA-44 sign 202487 cycles 202429 cycles 1.00
ML-DSA-44 verify 70722 cycles 70691 cycles 1.00
ML-DSA-65 keypair 121061 cycles 121050 cycles 1.00
ML-DSA-65 sign 331574 cycles 332242 cycles 1.00
ML-DSA-65 verify 117810 cycles 118169 cycles 1.00
ML-DSA-87 keypair 198140 cycles 198283 cycles 1.00
ML-DSA-87 sign 427941 cycles 428124 cycles 1.00
ML-DSA-87 verify 194637 cycles 194645 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a) (no-opt)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 134578 cycles 135123 cycles 1.00
ML-DSA-44 sign 523923 cycles 523989 cycles 1.00
ML-DSA-44 verify 147640 cycles 147421 cycles 1.00
ML-DSA-65 keypair 228634 cycles 227032 cycles 1.01
ML-DSA-65 sign 864042 cycles 860343 cycles 1.00
ML-DSA-65 verify 236700 cycles 234883 cycles 1.01
ML-DSA-87 keypair 371955 cycles 371568 cycles 1.00
ML-DSA-87 sign 1080535 cycles 1079389 cycles 1.00
ML-DSA-87 verify 383811 cycles 383403 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 56863 cycles 56287 cycles 1.01
ML-DSA-44 sign 181063 cycles 181562 cycles 1.00
ML-DSA-44 verify 61140 cycles 61061 cycles 1.00
ML-DSA-65 keypair 98291 cycles 98770 cycles 1.00
ML-DSA-65 sign 298368 cycles 299116 cycles 1.00
ML-DSA-65 verify 100343 cycles 100251 cycles 1.00
ML-DSA-87 keypair 152430 cycles 153265 cycles 0.99
ML-DSA-87 sign 354719 cycles 355417 cycles 1.00
ML-DSA-87 verify 153124 cycles 153884 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4 (no-opt)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 128315 cycles 128272 cycles 1.00
ML-DSA-44 sign 447513 cycles 447600 cycles 1.00
ML-DSA-44 verify 138123 cycles 144678 cycles 0.95
ML-DSA-65 keypair 220541 cycles 220481 cycles 1.00
ML-DSA-65 sign 726484 cycles 726951 cycles 1.00
ML-DSA-65 verify 222926 cycles 223461 cycles 1.00
ML-DSA-87 keypair 366142 cycles 366604 cycles 1.00
ML-DSA-87 sign 927541 cycles 927414 cycles 1.00
ML-DSA-87 verify 374016 cycles 373875 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 72353 cycles 72235 cycles 1.00
ML-DSA-44 sign 212424 cycles 212375 cycles 1.00
ML-DSA-44 verify 75754 cycles 75714 cycles 1.00
ML-DSA-65 keypair 127646 cycles 127612 cycles 1.00
ML-DSA-65 sign 351030 cycles 350845 cycles 1.00
ML-DSA-65 verify 125627 cycles 125755 cycles 1.00
ML-DSA-87 keypair 205980 cycles 208476 cycles 0.99
ML-DSA-87 sign 444778 cycles 450018 cycles 0.99
ML-DSA-87 verify 205601 cycles 205843 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i) (no-opt)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 157499 cycles 157541 cycles 1.00
ML-DSA-44 sign 549244 cycles 549413 cycles 1.00
ML-DSA-44 verify 169448 cycles 168865 cycles 1.00
ML-DSA-65 keypair 268437 cycles 268818 cycles 1.00
ML-DSA-65 sign 903422 cycles 903672 cycles 1.00
ML-DSA-65 verify 275283 cycles 274680 cycles 1.00
ML-DSA-87 keypair 448241 cycles 448464 cycles 1.00
ML-DSA-87 sign 1158654 cycles 1157970 cycles 1.00
ML-DSA-87 verify 458704 cycles 458043 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 42142 cycles 40662 cycles 1.04
ML-DSA-44 sign 134317 cycles 132808 cycles 1.01
ML-DSA-44 verify 44844 cycles 43607 cycles 1.03
ML-DSA-65 keypair 72940 cycles 71859 cycles 1.02
ML-DSA-65 sign 213861 cycles 213367 cycles 1.00
ML-DSA-65 verify 73729 cycles 72847 cycles 1.01
ML-DSA-87 keypair 107003 cycles 109237 cycles 0.98
ML-DSA-87 sign 250851 cycles 254550 cycles 0.99
ML-DSA-87 verify 107681 cycles 109371 cycles 0.98

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a) (no-opt)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 120754 cycles 120325 cycles 1.00
ML-DSA-44 sign 447570 cycles 447576 cycles 1.00
ML-DSA-44 verify 130511 cycles 130561 cycles 1.00
ML-DSA-65 keypair 205040 cycles 205018 cycles 1.00
ML-DSA-65 sign 728790 cycles 729474 cycles 1.00
ML-DSA-65 verify 210029 cycles 209605 cycles 1.00
ML-DSA-87 keypair 337610 cycles 336678 cycles 1.00
ML-DSA-87 sign 925517 cycles 924223 cycles 1.00
ML-DSA-87 verify 347563 cycles 347399 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3 (no-opt)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 138744 cycles 138561 cycles 1.00
ML-DSA-44 sign 483982 cycles 484140 cycles 1.00
ML-DSA-44 verify 148574 cycles 162388 cycles 0.91
ML-DSA-65 keypair 241921 cycles 241950 cycles 1.00
ML-DSA-65 sign 792702 cycles 792591 cycles 1.00
ML-DSA-65 verify 240763 cycles 241288 cycles 1.00
ML-DSA-87 keypair 396106 cycles 397138 cycles 1.00
ML-DSA-87 sign 1013453 cycles 1013569 cycles 1.00
ML-DSA-87 verify 403446 cycles 403178 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 113189 cycles 113255 cycles 1.00
ML-DSA-44 sign 355791 cycles 356042 cycles 1.00
ML-DSA-44 verify 117978 cycles 117969 cycles 1.00
ML-DSA-65 keypair 196342 cycles 196623 cycles 1.00
ML-DSA-65 sign 589183 cycles 589242 cycles 1.00
ML-DSA-65 verify 194553 cycles 194559 cycles 1.00
ML-DSA-87 keypair 322537 cycles 322281 cycles 1.00
ML-DSA-87 sign 753613 cycles 753546 cycles 1.00
ML-DSA-87 verify 320115 cycles 320070 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2 (no-opt)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 213219 cycles 212521 cycles 1.00
ML-DSA-44 sign 761553 cycles 760970 cycles 1.00
ML-DSA-44 verify 241351 cycles 234237 cycles 1.03
ML-DSA-65 keypair 380573 cycles 379762 cycles 1.00
ML-DSA-65 sign 1252452 cycles 1252199 cycles 1.00
ML-DSA-65 verify 372839 cycles 371797 cycles 1.00
ML-DSA-87 keypair 607341 cycles 604584 cycles 1.00
ML-DSA-87 sign 1596680 cycles 1595561 cycles 1.00
ML-DSA-87 verify 619175 cycles 618927 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Graviton2 (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 verify 241351 cycles 234237 cycles 1.03

This comment was automatically generated by workflow using github-action-benchmark.

@oqs-bot
Copy link
Copy Markdown
Contributor

oqs-bot commented Apr 3, 2026

CBMC Results (ML-DSA-44)

Full Results (177 proofs)
Proof Status Current Previous Change
**TOTAL** 2074s 1959s +5.9%
mld_attempt_signature_generation 238s 227s +5%
polyvecl_pointwise_acc_montgomery_c 228s 204s +12%
poly_pointwise_montgomery_c 167s 147s +14%
rej_uniform_native 155s 141s +10%
sign_verify_internal 126s 122s +3%
mld_invntt_layer 92s 85s +8%
mld_ct_memcmp 81s 73s +11%
mld_ntt_layer 57s 54s +6%
sign_signature_internal 33s 32s +3%
polyvec_matrix_expand 29s 26s +12%
poly_chknorm_c 24s 21s +14%
fqmul 21s 20s +5%
rej_uniform 21s 20s +5%
poly_uniform_eta_4x 18s 17s +6%
polyeta_unpack 18s 16s +12%
poly_uniform_4x 17s 17s +0%
polyvec_matrix_pointwise_montgomery 16s 12s +33%
polyt0_unpack 15s 14s +7%
mld_compute_t0_t1_tr_from_sk_components 14s 13s +8%
polymat_permute_bitrev_to_custom 14s 15s -7%
rej_uniform_c 14s 14s +0%
keccakf1600x4_permute_native 13s 12s +8%
mld_ntt_butterfly_block 13s 11s +18%
polyvec_matrix_expand_serial 13s 11s +18%
poly_add 12s 11s +9%
polyveck_power2round 12s 11s +9%
polyz_unpack_c 12s 11s +9%
keccak_absorb_once_x4 10s 12s -17%
keccakf1600_permute_native 10s 8s +25%
mld_polyvecl_permute_bitrev_to_custom_native 10s 8s +25%
sign 9s 7s +29%
keccakf1600_permute 8s 8s +0%
poly_invntt_tomont_c 8s 6s +33%
polyveck_add 8s 8s +0%
keccak_absorb 7s 7s +0%
mld_prepare_domain_separation_prefix 7s 2s +250%
poly_caddq_c 7s 6s +17%
poly_uniform 7s 4s +75%
polyveck_caddq 7s 4s +75%
keccak_squeeze 6s 4s +50%
keccak_squeezeblocks_x4 6s 6s +0%
mld_compute_pack_z 6s 6s +0%
poly_challenge 6s 6s +0%
poly_chknorm 6s 3s +100%
poly_power2round 6s 5s +20%
polyveck_decompose 6s 4s +50%
polyveck_ntt 6s 9s -33%
polyveck_sub 6s 6s +0%
polyveck_use_hint 6s 10s -40%
polyw1_pack 6s 1s +500%
shake256_release 6s 1s +500%
unpack_sk 6s 7s -14%
mld_check_pct 5s 7s -29%
mld_ct_cmask_nonzero_u32 5s 2s +150%
pack_sig_z 5s 5s +0%
poly_pointwise_montgomery_native 5s 3s +67%
poly_uniform_gamma1 5s 3s +67%
poly_use_hint_c 5s 4s +25%
polyt0_pack 5s 3s +67%
polyveck_chknorm 5s 4s +25%
polyveck_pack_eta 5s 4s +25%
polyvecl_chknorm 5s 4s +25%
polyvecl_ntt 5s 5s +0%
polyvecl_permute_bitrev_to_custom 5s 3s +67%
polyvecl_pointwise_acc_montgomery_native 5s 3s +67%
shake256_squeeze 5s 3s +67%
sign_keypair_internal 5s 6s -17%
sign_signature_extmu 5s 6s -17%
sign_signature_pre_hash_internal 5s 5s +0%
unpack_hints 5s 6s -17%
unpack_sig 5s 3s +67%
intt_native_x86_64 4s 4s +0%
keccakf1600_xor_bytes 4s 2s +100%
mld_ct_abs_i32 4s 2s +100%
mld_h 4s 5s -20%
mld_sample_s1_s2 4s 2s +100%
mld_sample_s1_s2_serial 4s 3s +33%
mld_value_barrier_i64 4s 6s -33%
pack_pk 4s 2s +100%
poly_caddq 4s 5s -20%
poly_caddq_native 4s 4s +0%
poly_decompose_c 4s 5s -20%
poly_make_hint 4s 4s +0%
poly_ntt 4s 2s +100%
poly_uniform_gamma1_4x 4s 5s -20%
polyveck_invntt_tomont 4s 5s -20%
polyveck_shiftl 4s 6s -33%
polyveck_unpack_eta 4s 3s +33%
polyz_unpack 4s 1s +300%
polyz_unpack_native 4s 2s +100%
rej_eta 4s 5s -20%
rej_eta_c 4s 4s +0%
rej_eta_native 4s 4s +0%
shake128_finalize 4s 3s +33%
shake128_init 4s 2s +100%
shake128_squeeze 4s 3s +33%
shake128x4_absorb_once 4s 4s +0%
shake256_absorb 4s 3s +33%
shake256x4_absorb_once 4s 4s +0%
sign_keypair 4s 3s +33%
sign_pk_from_sk 4s 8s -50%
sign_signature 4s 8s -50%
sign_signature_pre_hash_shake256 4s 6s -33%
sign_verify 4s 5s -20%
sign_verify_extmu 4s 3s +33%
unpack_pk 4s 3s +33%
use_hint 4s 3s +33%
caddq 3s 3s +0%
decompose 3s 4s -25%
fqscale 3s 2s +50%
keccak_init 3s 2s +50%
keccakf1600_xor_bytes (big endian) 3s 2s +50%
make_hint 3s 4s -25%
mld_ct_cmask_neg_i32 3s 2s +50%
mld_ct_cmask_nonzero_u8 3s 4s -25%
mld_ct_get_optblocker_u32 3s 3s +0%
mld_keccakf1600_extract_bytes 3s 3s +0%
montgomery_reduce 3s 4s -25%
ntt_native_aarch64 3s 4s -25%
ntt_native_x86_64 3s 5s -40%
pack_sig_c_h 3s 3s +0%
pack_sk 3s 3s +0%
poly_caddq_native_aarch64 3s 5s -40%
poly_chknorm_native 3s 2s +50%
poly_chknorm_native_aarch64 3s 4s -25%
poly_decompose 3s 3s +0%
poly_decompose_native 3s 2s +50%
poly_invntt_tomont_native 3s 3s +0%
poly_pointwise_montgomery 3s 3s +0%
poly_reduce 3s 2s +50%
poly_shiftl 3s 3s +0%
poly_sub 3s 4s -25%
poly_uniform_eta 3s 7s -57%
poly_use_hint_native 3s 5s -40%
polyeta_pack 3s 3s +0%
polyt1_pack 3s 3s +0%
polyt1_unpack 3s 2s +50%
polyveck_make_hint 3s 2s +50%
polyveck_pack_w1 3s 4s -25%
polyveck_pointwise_poly_montgomery 3s 4s -25%
polyveck_reduce 3s 5s -40%
polyveck_unpack_t0 3s 4s -25%
polyvecl_pack_eta 3s 3s +0%
polyvecl_pointwise_acc_montgomery 3s 6s -50%
polyvecl_uniform_gamma1 3s 3s +0%
polyvecl_unpack_eta 3s 4s -25%
polyvecl_unpack_z 3s 5s -40%
power2round 3s 3s +0%
reduce32 3s 3s +0%
shake128_absorb 3s 2s +50%
shake128_release 3s 3s +0%
shake128x4_squeezeblocks 3s 3s +0%
shake256 3s 5s -40%
sign_verify_pre_hash_internal 3s 3s +0%
sign_verify_pre_hash_shake256 3s 5s -40%
keccak_finalize 2s 3s -33%
keccakf1600_extract_bytes (big endian) 2s 2s +0%
keccakf1600x4_permute 2s 3s -33%
keccakf1600x4_xor_bytes 2s 3s -33%
mld_ct_get_optblocker_i64 2s 1s +100%
mld_ct_get_optblocker_u8 2s 2s +0%
mld_value_barrier_u32 2s 2s +0%
mld_value_barrier_u8 2s 5s -60%
poly_invntt_tomont 2s 6s -67%
poly_ntt_c 2s 2s +0%
poly_ntt_native 2s 2s +0%
poly_use_hint 2s 3s -33%
polyveck_pack_t0 2s 3s -33%
polyvecl_uniform_gamma1_serial 2s 3s -33%
polyz_pack 2s 6s -67%
shake256_finalize 2s 1s +100%
shake256_init 2s 2s +0%
shake256x4_squeezeblocks 2s 5s -60%
sign_open 2s 8s -75%
sys_check_capability 2s 3s -33%
keccakf1600x4_extract_bytes 1s 2s -50%
mld_ct_sel_int32 1s 3s -67%

@oqs-bot
Copy link
Copy Markdown
Contributor

oqs-bot commented Apr 3, 2026

CBMC Results (ML-DSA-87)

Full Results (177 proofs)
Proof Status Current Previous Change
**TOTAL** 2545s 2631s -3.3%
sign_verify_internal 323s 327s -1%
polyvecl_pointwise_acc_montgomery_c 257s 260s -1%
mld_attempt_signature_generation 222s 231s -4%
polyvec_matrix_expand 173s 173s +0%
poly_pointwise_montgomery_c 153s 155s -1%
rej_uniform_native 142s 146s -3%
mld_invntt_layer 90s 99s -9%
polyvec_matrix_expand_serial 81s 80s +1%
mld_ct_memcmp 68s 71s -4%
polyveck_decompose 56s 56s +0%
mld_ntt_layer 52s 52s +0%
sign_signature_internal 50s 54s -7%
polymat_permute_bitrev_to_custom 45s 46s -2%
mld_compute_t0_t1_tr_from_sk_components 25s 28s -11%
rej_uniform 21s 20s +5%
fqmul 20s 21s -5%
poly_chknorm_c 20s 22s -9%
polyeta_unpack 16s 16s +0%
poly_uniform_4x 15s 14s +7%
poly_uniform_eta_4x 14s 17s -18%
polyveck_use_hint 14s 12s +17%
keccakf1600x4_permute_native 13s 14s -7%
polyvec_matrix_pointwise_montgomery 13s 14s -7%
rej_uniform_c 13s 16s -19%
mld_ntt_butterfly_block 12s 12s +0%
polyt0_unpack 12s 15s -20%
mld_polyvecl_permute_bitrev_to_custom_native 11s 10s +10%
keccak_absorb_once_x4 10s 11s -9%
poly_add 10s 13s -23%
polyveck_reduce 10s 10s +0%
polyveck_shiftl 10s 8s +25%
keccak_squeezeblocks_x4 9s 7s +29%
keccakf1600_permute 9s 8s +12%
mld_compute_pack_z 9s 7s +29%
polyveck_add 9s 9s +0%
polyveck_sub 9s 9s +0%
polyvecl_ntt 9s 13s -31%
polyz_unpack_c 9s 8s +12%
keccakf1600_permute_native 8s 8s +0%
poly_invntt_tomont_c 8s 8s +0%
polyveck_power2round 8s 7s +14%
keccakf1600x4_xor_bytes 7s 3s +133%
mld_check_pct 7s 7s +0%
poly_decompose_c 7s 11s -36%
poly_uniform_eta 7s 6s +17%
polyveck_caddq 7s 8s -12%
polyveck_pointwise_poly_montgomery 7s 6s +17%
reduce32 7s 3s +133%
sign 7s 9s -22%
sign_pk_from_sk 7s 8s -12%
keccak_absorb 6s 10s -40%
mld_sample_s1_s2 6s 6s +0%
mld_sample_s1_s2_serial 6s 8s -25%
polyveck_invntt_tomont 6s 10s -40%
polyveck_make_hint 6s 5s +20%
polyveck_ntt 6s 6s +0%
polyvecl_unpack_z 6s 4s +50%
shake256_squeeze 6s 2s +200%
keccakf1600_extract_bytes (big endian) 5s 3s +67%
keccakf1600_xor_bytes 5s 4s +25%
mld_ct_cmask_nonzero_u8 5s 4s +25%
poly_caddq 5s 2s +150%
poly_power2round 5s 8s -38%
poly_uniform 5s 4s +25%
poly_uniform_gamma1 5s 4s +25%
polyveck_chknorm 5s 6s -17%
polyveck_pack_t0 5s 3s +67%
polyvecl_pointwise_acc_montgomery_native 5s 6s -17%
shake128_squeeze 5s 2s +150%
sign_verify_pre_hash_internal 5s 4s +25%
unpack_hints 5s 7s -29%
unpack_sig 5s 2s +150%
intt_native_x86_64 4s 4s +0%
keccak_finalize 4s 1s +300%
keccak_init 4s 2s +100%
keccakf1600x4_extract_bytes 4s 4s +0%
make_hint 4s 3s +33%
mld_h 4s 4s +0%
poly_caddq_native_aarch64 4s 3s +33%
poly_challenge 4s 3s +33%
poly_chknorm_native 4s 2s +100%
poly_invntt_tomont_native 4s 2s +100%
poly_ntt 4s 3s +33%
poly_ntt_c 4s 3s +33%
poly_sub 4s 3s +33%
poly_use_hint_c 4s 2s +100%
polyeta_pack 4s 2s +100%
polyt0_pack 4s 3s +33%
polyveck_unpack_t0 4s 6s -33%
polyvecl_pack_eta 4s 3s +33%
polyw1_pack 4s 3s +33%
rej_eta_native 4s 3s +33%
shake256_absorb 4s 2s +100%
sign_keypair 4s 4s +0%
sign_keypair_internal 4s 5s -20%
sign_signature 4s 6s -33%
sign_signature_pre_hash_shake256 4s 3s +33%
sign_verify 4s 4s +0%
unpack_pk 4s 6s -33%
unpack_sk 4s 6s -33%
use_hint 4s 4s +0%
caddq 3s 3s +0%
fqscale 3s 2s +50%
keccak_squeeze 3s 4s -25%
mld_ct_cmask_nonzero_u32 3s 1s +200%
mld_keccakf1600_extract_bytes 3s 4s -25%
mld_prepare_domain_separation_prefix 3s 3s +0%
mld_value_barrier_i64 3s 2s +50%
mld_value_barrier_u32 3s 1s +200%
mld_value_barrier_u8 3s 3s +0%
ntt_native_aarch64 3s 3s +0%
ntt_native_x86_64 3s 3s +0%
pack_sig_z 3s 5s -40%
poly_caddq_c 3s 4s -25%
poly_chknorm 3s 3s +0%
poly_chknorm_native_aarch64 3s 6s -50%
poly_decompose_native 3s 3s +0%
poly_make_hint 3s 3s +0%
poly_pointwise_montgomery 3s 4s -25%
poly_reduce 3s 5s -40%
poly_uniform_gamma1_4x 3s 7s -57%
poly_use_hint 3s 4s -25%
poly_use_hint_native 3s 3s +0%
polyt1_unpack 3s 2s +50%
polyveck_pack_w1 3s 5s -40%
polyveck_unpack_eta 3s 5s -40%
polyvecl_permute_bitrev_to_custom 3s 3s +0%
polyvecl_pointwise_acc_montgomery 3s 3s +0%
polyvecl_uniform_gamma1 3s 4s -25%
polyvecl_unpack_eta 3s 4s -25%
polyz_unpack_native 3s 3s +0%
rej_eta 3s 4s -25%
rej_eta_c 3s 7s -57%
shake128_absorb 3s 1s +200%
shake128_finalize 3s 3s +0%
shake128_init 3s 2s +50%
shake128x4_squeezeblocks 3s 3s +0%
shake256_init 3s 1s +200%
shake256_release 3s 3s +0%
sign_open 3s 2s +50%
sign_signature_extmu 3s 4s -25%
sign_signature_pre_hash_internal 3s 6s -50%
sign_verify_extmu 3s 5s -40%
sign_verify_pre_hash_shake256 3s 4s -25%
sys_check_capability 3s 3s +0%
decompose 2s 2s +0%
keccakf1600_xor_bytes (big endian) 2s 1s +100%
keccakf1600x4_permute 2s 4s -50%
mld_ct_cmask_neg_i32 2s 2s +0%
mld_ct_get_optblocker_u32 2s 4s -50%
mld_ct_get_optblocker_u8 2s 1s +100%
mld_ct_sel_int32 2s 1s +100%
montgomery_reduce 2s 2s +0%
pack_pk 2s 5s -60%
pack_sig_c_h 2s 3s -33%
pack_sk 2s 2s +0%
poly_caddq_native 2s 3s -33%
poly_decompose 2s 4s -50%
poly_invntt_tomont 2s 2s +0%
poly_ntt_native 2s 2s +0%
poly_shiftl 2s 3s -33%
polyveck_pack_eta 2s 4s -50%
polyvecl_chknorm 2s 5s -60%
polyvecl_uniform_gamma1_serial 2s 6s -67%
polyz_unpack 2s 3s -33%
power2round 2s 5s -60%
shake128x4_absorb_once 2s 3s -33%
shake256 2s 2s +0%
shake256_finalize 2s 1s +100%
shake256x4_absorb_once 2s 4s -50%
shake256x4_squeezeblocks 2s 4s -50%
mld_ct_abs_i32 1s 2s -50%
mld_ct_get_optblocker_i64 1s 4s -75%
poly_pointwise_montgomery_native 1s 4s -75%
polyt1_pack 1s 5s -80%
polyz_pack 1s 2s -50%
shake128_release 1s 6s -83%

@oqs-bot
Copy link
Copy Markdown
Contributor

oqs-bot commented Apr 3, 2026

CBMC Results (ML-DSA-65)

Full Results (177 proofs)
Proof Status Current Previous Change
**TOTAL** 2597s 2384s +8.9%
sign_verify_internal 360s 332s +8%
mld_attempt_signature_generation 290s 263s +10%
polyvecl_pointwise_acc_montgomery_c 213s 178s +20%
poly_pointwise_montgomery_c 174s 150s +16%
rej_uniform_native 158s 145s +9%
polyvec_matrix_expand 134s 120s +12%
mld_invntt_layer 103s 95s +8%
mld_ct_memcmp 83s 75s +11%
polyvec_matrix_expand_serial 71s 67s +6%
mld_ntt_layer 58s 54s +7%
sign_signature_internal 38s 34s +12%
polymat_permute_bitrev_to_custom 32s 30s +7%
mld_compute_t0_t1_tr_from_sk_components 30s 26s +15%
fqmul 22s 20s +10%
poly_chknorm_c 22s 21s +5%
rej_uniform 22s 20s +10%
rej_uniform_c 18s 15s +20%
poly_uniform_eta_4x 16s 21s -24%
poly_uniform_4x 15s 17s -12%
mld_ntt_butterfly_block 14s 13s +8%
polyt0_unpack 14s 13s +8%
polyvecl_chknorm 14s 13s +8%
keccakf1600x4_permute_native 13s 13s +0%
polyvec_matrix_pointwise_montgomery 13s 13s +0%
polyveck_decompose 13s 13s +0%
polyveck_sub 13s 10s +30%
poly_add 11s 12s -8%
polyveck_caddq 11s 8s +38%
polyveck_ntt 11s 13s -15%
polyveck_use_hint 11s 8s +38%
polyveck_add 10s 9s +11%
polyveck_power2round 10s 11s -9%
polyveck_reduce 10s 6s +67%
sign_pk_from_sk 10s 8s +25%
keccak_absorb_once_x4 9s 9s +0%
keccakf1600_permute_native 9s 8s +12%
poly_invntt_tomont_c 9s 6s +50%
polyveck_invntt_tomont 9s 10s -10%
polyvecl_ntt 9s 7s +29%
keccakf1600_permute 8s 7s +14%
mld_check_pct 8s 8s +0%
mld_polyvecl_permute_bitrev_to_custom_native 8s 6s +33%
polyeta_unpack 8s 6s +33%
sign 8s 5s +60%
keccak_absorb 7s 8s -12%
keccak_squeezeblocks_x4 7s 8s -12%
mld_sample_s1_s2 7s 5s +40%
poly_decompose_c 7s 7s +0%
poly_use_hint_c 7s 3s +133%
polyt1_pack 7s 4s +75%
polyveck_pointwise_poly_montgomery 7s 7s +0%
polyveck_shiftl 7s 8s -12%
sign_open 7s 5s +40%
sign_signature_pre_hash_shake256 7s 3s +133%
poly_caddq_c 6s 7s -14%
poly_power2round 6s 4s +50%
polyt0_pack 6s 5s +20%
polyveck_make_hint 6s 8s -25%
sign_keypair_internal 6s 4s +50%
sign_signature 6s 4s +50%
sign_verify_extmu 6s 5s +20%
fqscale 5s 2s +150%
mld_compute_pack_z 5s 9s -44%
mld_ct_cmask_nonzero_u32 5s 5s +0%
mld_h 5s 5s +0%
poly_uniform 5s 8s -38%
poly_use_hint_native 5s 3s +67%
polyveck_unpack_eta 5s 6s -17%
polyveck_unpack_t0 5s 2s +150%
rej_eta_native 5s 5s +0%
sign_signature_extmu 5s 5s +0%
sign_verify_pre_hash_shake256 5s 3s +67%
unpack_hints 5s 5s +0%
unpack_sk 5s 5s +0%
intt_native_x86_64 4s 4s +0%
mld_prepare_domain_separation_prefix 4s 5s -20%
mld_sample_s1_s2_serial 4s 4s +0%
mld_value_barrier_u32 4s 3s +33%
pack_sig_c_h 4s 3s +33%
pack_sk 4s 3s +33%
poly_chknorm_native 4s 3s +33%
poly_make_hint 4s 3s +33%
poly_ntt_native 4s 2s +100%
poly_shiftl 4s 5s -20%
poly_uniform_eta 4s 5s -20%
poly_uniform_gamma1 4s 4s +0%
poly_uniform_gamma1_4x 4s 5s -20%
polyt1_unpack 4s 2s +100%
polyveck_chknorm 4s 4s +0%
polyveck_pack_eta 4s 1s +300%
polyveck_pack_t0 4s 3s +33%
polyw1_pack 4s 3s +33%
polyz_unpack_native 4s 2s +100%
rej_eta 4s 3s +33%
rej_eta_c 4s 3s +33%
shake128_finalize 4s 2s +100%
shake128_init 4s 1s +300%
shake256_init 4s 2s +100%
sign_keypair 4s 6s -33%
sign_signature_pre_hash_internal 4s 5s -20%
sign_verify 4s 4s +0%
unpack_pk 4s 5s -20%
use_hint 4s 3s +33%
caddq 3s 3s +0%
decompose 3s 2s +50%
keccak_finalize 3s 2s +50%
keccakf1600_extract_bytes (big endian) 3s 4s -25%
keccakf1600_xor_bytes 3s 3s +0%
keccakf1600_xor_bytes (big endian) 3s 1s +200%
keccakf1600x4_permute 3s 3s +0%
keccakf1600x4_xor_bytes 3s 2s +50%
make_hint 3s 2s +50%
mld_ct_cmask_neg_i32 3s 3s +0%
mld_ct_get_optblocker_i64 3s 2s +50%
mld_ct_get_optblocker_u32 3s 5s -40%
mld_ct_get_optblocker_u8 3s 2s +50%
mld_ct_sel_int32 3s 3s +0%
mld_keccakf1600_extract_bytes 3s 3s +0%
mld_value_barrier_i64 3s 1s +200%
mld_value_barrier_u8 3s 2s +50%
montgomery_reduce 3s 2s +50%
ntt_native_x86_64 3s 2s +50%
poly_caddq_native_aarch64 3s 2s +50%
poly_challenge 3s 3s +0%
poly_chknorm_native_aarch64 3s 3s +0%
poly_decompose 3s 2s +50%
poly_decompose_native 3s 4s -25%
poly_ntt 3s 2s +50%
poly_ntt_c 3s 2s +50%
poly_pointwise_montgomery_native 3s 3s +0%
poly_reduce 3s 2s +50%
poly_use_hint 3s 2s +50%
polyvecl_permute_bitrev_to_custom 3s 2s +50%
polyvecl_pointwise_acc_montgomery 3s 4s -25%
polyvecl_pointwise_acc_montgomery_native 3s 4s -25%
polyvecl_uniform_gamma1 3s 2s +50%
polyvecl_unpack_eta 3s 3s +0%
polyvecl_unpack_z 3s 4s -25%
polyz_pack 3s 3s +0%
power2round 3s 4s -25%
shake128_squeeze 3s 3s +0%
shake256_squeeze 3s 2s +50%
sign_verify_pre_hash_internal 3s 5s -40%
unpack_sig 3s 5s -40%
keccak_init 2s 4s -50%
keccak_squeeze 2s 1s +100%
keccakf1600x4_extract_bytes 2s 3s -33%
mld_ct_cmask_nonzero_u8 2s 3s -33%
ntt_native_aarch64 2s 3s -33%
pack_sig_z 2s 5s -60%
poly_caddq 2s 3s -33%
poly_caddq_native 2s 3s -33%
poly_chknorm 2s 3s -33%
poly_invntt_tomont 2s 1s +100%
poly_invntt_tomont_native 2s 6s -67%
poly_pointwise_montgomery 2s 5s -60%
poly_sub 2s 2s +0%
polyeta_pack 2s 2s +0%
polyveck_pack_w1 2s 4s -50%
polyvecl_pack_eta 2s 3s -33%
polyvecl_uniform_gamma1_serial 2s 5s -60%
polyz_unpack_c 2s 2s +0%
reduce32 2s 2s +0%
shake128_release 2s 2s +0%
shake128x4_absorb_once 2s 4s -50%
shake256 2s 2s +0%
shake256_finalize 2s 2s +0%
shake256_release 2s 3s -33%
shake256x4_squeezeblocks 2s 2s +0%
sys_check_capability 2s 4s -50%
mld_ct_abs_i32 1s 2s -50%
pack_pk 1s 4s -75%
polyz_unpack 1s 3s -67%
shake128_absorb 1s 2s -50%
shake128x4_squeezeblocks 1s 3s -67%
shake256_absorb 1s 2s -50%
shake256x4_absorb_once 1s 2s -50%

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 34764 cycles 34374 cycles 1.01
ML-DSA-44 sign 120113 cycles 120132 cycles 1.00
ML-DSA-44 verify 38092 cycles 38166 cycles 1.00
ML-DSA-65 keypair 61138 cycles 60500 cycles 1.01
ML-DSA-65 sign 201844 cycles 199945 cycles 1.01
ML-DSA-65 verify 62783 cycles 62429 cycles 1.01
ML-DSA-87 keypair 93501 cycles 94486 cycles 0.99
ML-DSA-87 sign 236815 cycles 239500 cycles 0.99
ML-DSA-87 verify 95619 cycles 96894 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i) (no-opt)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 93930 cycles 93842 cycles 1.00
ML-DSA-44 sign 333310 cycles 333119 cycles 1.00
ML-DSA-44 verify 100022 cycles 100025 cycles 1.00
ML-DSA-65 keypair 159902 cycles 160115 cycles 1.00
ML-DSA-65 sign 543114 cycles 543227 cycles 1.00
ML-DSA-65 verify 160989 cycles 161060 cycles 1.00
ML-DSA-87 keypair 266666 cycles 266874 cycles 1.00
ML-DSA-87 sign 704974 cycles 706010 cycles 1.00
ML-DSA-87 verify 270510 cycles 269779 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'AMD EPYC 4th gen (c7a)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 42142 cycles 40662 cycles 1.04

This comment was automatically generated by workflow using github-action-benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AVX2: Replace intrinsics implementation of rej_uniform with assembly

2 participants