Skip to content

x86_64: Eliminate use_hint 32/88 intrinsics#940

Open
willieyz wants to merge 2 commits intomainfrom
eliminate-use_hint_32_88-intrinsics
Open

x86_64: Eliminate use_hint 32/88 intrinsics#940
willieyz wants to merge 2 commits intomainfrom
eliminate-use_hint_32_88-intrinsics

Conversation

@willieyz
Copy link
Copy Markdown
Contributor

@willieyz willieyz commented Feb 3, 2026

We also tried unrolling the loops: mld_poly_use_hint_88_avx2_loop and mld_poly_use_hint_32_avx2_loop
in both files. However, the benchmark results showed that this did not provide any performance benefit, so we decided to keep the current version.

  • bench components
    • Δ (%) = (asm − AVX2) / AVX2 × 100
Component Implementation Build ML-DSA-44 ML-DSA-65 ML-DSA-87 Notes
mld_poly_caddq
(avg)
AVX2 intrinsics no-opt 821 781 789
x86_64 asm no-opt 847 786 787
Δ (%) no-opt +3.17% +0.64% -0.25%
mld_poly_caddq
(avg)
AVX2 intrinsics opt 210 147 143
x86_64 asm opt 220 153 155
x86_64 asm
(unroll)
opt 273 154 156 unroll by 4
Δ (%) opt +4.76% +4.08% +8.39%
Δ (%) (unroll) opt +30.00% +4.76% +9.09% unroll by 4
  • bench
    • Δ (%) = (asm − AVX2) / AVX2 × 100
Component Implementation Build ML-DSA-44 ML-DSA-65 ML-DSA-87 Notes
keypair cycles
(avg)
AVX2 intrinsics no-opt 127436 218610 360739 baseline (main)
x86_64 asm no-opt 127459 217604 367118
Δ (%) no-opt +0.02% -0.46% +1.77%
AVX2 intrinsics opt 56955 98362 157869 baseline (main)
x86_64 asm opt 59747 102961 165706
x86_64 asm
(unroll)
opt 59483 104732 166654
Δ (%) opt +4.90% +4.68% +4.96%
Δ (%) (unroll) opt +4.44% +6.48% +5.56% unroll by 4
sign cycles
(avg)
AVX2 intrinsics no-opt 451922 756003 958151 baseline (main)
x86_64 asm no-opt 452833 752512 974497
Δ (%) no-opt +0.20% -0.46% +1.71%
AVX2 intrinsics opt 170370 281545 347924 baseline (main)
x86_64 asm opt 178564 294843 362677
x86_64 asm
(unroll)
opt 177251 300667 366158
Δ (%) opt +4.81% +4.72% +4.24%
Δ (%) (unroll) opt +4.04% +6.79% +5.24% unroll by 4
verify cycles
(avg)
AVX2 intrinsics no-opt 134113 220671 363234 baseline (main)
x86_64 asm no-opt 134633 220015 369763
Δ (%) no-opt +0.39% -0.30% +1.80%
AVX2 intrinsics opt 60234 98904 156281 baseline (main)
x86_64 asm opt 63140 103682 164376
x86_64 asm
(unroll)
opt 62822 105719 164028
Δ (%) opt +4.82% +4.83% +5.18%
Δ (%) (unroll) opt +4.30% +6.89% +4.96% unroll by 4

@oqs-bot
Copy link
Copy Markdown
Contributor

oqs-bot commented Feb 3, 2026

CBMC Results (ML-DSA-87)

Full Results (179 proofs)
Proof Status Current Previous Change
**TOTAL** 2655s 2394s +10.9%
polyvecl_pointwise_acc_montgomery_c 308s 260s +18%
mld_attempt_signature_generation 233s 221s +5%
sign_verify_internal 217s 203s +7%
polyvec_matrix_expand 192s 176s +9%
poly_pointwise_montgomery_c 182s 142s +28%
rej_uniform_native 156s 138s +13%
mld_invntt_layer 104s 92s +13%
mld_ct_memcmp 84s 72s +17%
polyvec_matrix_expand_serial 82s 80s +2%
polyveck_decompose 59s 55s +7%
mld_ntt_layer 57s 52s +10%
polymat_permute_bitrev_to_custom 47s 46s +2%
sign_signature_internal 40s 38s +5%
mld_compute_t0_t1_tr_from_sk_components 25s 26s -4%
rej_uniform 23s 25s -8%
fqmul 21s 18s +17%
poly_chknorm_c 19s 19s +0%
polyeta_unpack 19s 17s +12%
poly_uniform_4x 16s 13s +23%
poly_uniform_eta_4x 16s 15s +7%
polyt0_unpack 16s 14s +14%
keccakf1600x4_permute_native 15s 12s +25%
rej_uniform_c 14s 13s +8%
mld_ntt_butterfly_block 13s 11s +18%
mld_sample_s1_s2 13s 6s +117%
poly_add 13s 11s +18%
polyvec_matrix_pointwise_montgomery 13s 12s +8%
polyz_unpack_c 12s 7s +71%
sign_pk_from_sk 12s 7s +71%
keccak_absorb_once_x4 11s 12s -8%
polyvecl_ntt 11s 9s +22%
mld_polyvecl_permute_bitrev_to_custom_native 10s 14s -29%
polyveck_add 10s 9s +11%
polyveck_power2round 10s 6s +67%
keccak_squeezeblocks_x4 9s 8s +12%
keccakf1600_permute_native 9s 10s -10%
poly_decompose_c 9s 8s +12%
polyveck_pointwise_poly_montgomery_s2 9s 5s +80%
polyveck_reduce 9s 9s +0%
polyveck_sub 9s 7s +29%
polyveck_use_hint 9s 8s +12%
unpack_sk 9s 9s +0%
keccakf1600_permute 8s 8s +0%
poly_invntt_tomont_c 8s 6s +33%
polyveck_caddq 8s 7s +14%
polyveck_ntt 8s 7s +14%
polyveck_pointwise_poly_montgomery 8s 9s -11%
sign 8s 6s +33%
mld_check_pct 7s 9s -22%
mld_sample_s1_s2_serial 7s 8s -12%
poly_ntt_c 7s 2s +250%
poly_uniform_eta 7s 7s +0%
polyveck_pointwise_poly_montgomery_t0 7s 4s +75%
unpack_hints 7s 6s +17%
keccak_absorb 6s 6s +0%
pack_pk 6s 4s +50%
polyeta_pack 6s 2s +200%
polyvecl_chknorm 6s 4s +50%
rej_eta_native 6s 4s +50%
intt_native_x86_64 5s 6s -17%
mld_compute_pack_z 5s 7s -29%
mld_ct_sel_int32 5s 2s +150%
mld_h 5s 4s +25%
pack_sk 5s 4s +25%
poly_challenge 5s 8s -38%
poly_chknorm 5s 3s +67%
poly_power2round 5s 8s -38%
polyveck_shiftl 5s 7s -29%
polyvecl_permute_bitrev_to_custom 5s 3s +67%
polyvecl_uniform_gamma1 5s 2s +150%
polyvecl_uniform_gamma1_serial 5s 6s -17%
polyvecl_unpack_z 5s 4s +25%
polyw1_pack 5s 2s +150%
sign_keypair_internal 5s 6s -17%
sign_open 5s 5s +0%
sign_signature_extmu 5s 4s +25%
sign_signature_pre_hash_internal 5s 4s +25%
sign_verify 5s 4s +25%
caddq 4s 3s +33%
make_hint 4s 2s +100%
mld_ct_cmask_nonzero_u32 4s 2s +100%
mld_prepare_domain_separation_prefix 4s 4s +0%
ntt_native_x86_64 4s 2s +100%
poly_caddq_c 4s 5s -20%
poly_caddq_native 4s 2s +100%
poly_chknorm_native_aarch64 4s 1s +300%
poly_invntt_tomont 4s 2s +100%
poly_reduce 4s 4s +0%
poly_shiftl 4s 3s +33%
poly_sub 4s 3s +33%
poly_uniform 4s 3s +33%
poly_uniform_gamma1_4x 4s 3s +33%
poly_use_hint_c 4s 3s +33%
poly_use_hint_native 4s 3s +33%
polyt0_pack 4s 2s +100%
polyt1_unpack 4s 2s +100%
polyveck_chknorm 4s 6s -33%
polyveck_invntt_tomont 4s 4s +0%
polyveck_make_hint 4s 5s -20%
polyveck_pack_eta 4s 2s +100%
polyveck_unpack_eta 4s 4s +0%
polyvecl_pointwise_acc_montgomery_native 4s 1s +300%
polyz_unpack 4s 2s +100%
polyz_unpack_native 4s 3s +33%
power2round 4s 3s +33%
rej_eta_c 4s 3s +33%
shake128_init 4s 4s +0%
shake256_init 4s 4s +0%
sign_signature 4s 5s -20%
sign_verify_extmu 4s 3s +33%
sign_verify_pre_hash_shake256 4s 4s +0%
sys_check_capability 4s 4s +0%
unpack_sig 4s 3s +33%
use_hint 4s 4s +0%
decompose 3s 2s +50%
fqscale 3s 4s -25%
keccakf1600x4_extract_bytes 3s 1s +200%
keccakf1600x4_xor_bytes 3s 3s +0%
mld_ct_cmask_nonzero_u8 3s 3s +0%
mld_ct_get_optblocker_u32 3s 1s +200%
mld_ct_get_optblocker_u8 3s 3s +0%
pack_sig_z 3s 5s -40%
poly_caddq 3s 4s -25%
poly_caddq_native_aarch64 3s 4s -25%
poly_chknorm_native 3s 3s +0%
poly_decompose 3s 2s +50%
poly_decompose_native 3s 2s +50%
poly_ntt 3s 4s -25%
poly_ntt_native 3s 2s +50%
poly_pointwise_montgomery 3s 3s +0%
poly_pointwise_montgomery_native 3s 2s +50%
polyt1_pack 3s 3s +0%
polyveck_pack_t0 3s 3s +0%
polyveck_pack_w1 3s 2s +50%
polyveck_unpack_t0 3s 2s +50%
polyvecl_pointwise_acc_montgomery 3s 2s +50%
polyvecl_unpack_eta 3s 2s +50%
shake128x4_absorb_once 3s 2s +50%
shake256_release 3s 1s +200%
shake256x4_absorb_once 3s 2s +50%
shake256x4_squeezeblocks 3s 3s +0%
sign_keypair 3s 3s +0%
sign_verify_pre_hash_internal 3s 4s -25%
keccak_finalize 2s 3s -33%
keccak_init 2s 3s -33%
keccakf1600_extract_bytes (big endian) 2s 2s +0%
keccakf1600x4_permute 2s 1s +100%
mld_ct_abs_i32 2s 2s +0%
mld_value_barrier_i64 2s 1s +100%
mld_value_barrier_u32 2s 1s +100%
mld_value_barrier_u8 2s 2s +0%
montgomery_reduce 2s 2s +0%
ntt_native_aarch64 2s 4s -50%
pack_sig_c_h 2s 4s -50%
poly_invntt_tomont_native 2s 9s -78%
poly_uniform_gamma1 2s 2s +0%
poly_use_hint 2s 3s -33%
polyvecl_pack_eta 2s 5s -60%
polyz_pack 2s 4s -50%
reduce32 2s 2s +0%
rej_eta 2s 2s +0%
shake128_absorb 2s 4s -50%
shake128_release 2s 3s -33%
shake128_squeeze 2s 1s +100%
shake128x4_squeezeblocks 2s 3s -33%
shake256 2s 2s +0%
shake256_absorb 2s 2s +0%
shake256_finalize 2s 4s -50%
shake256_squeeze 2s 2s +0%
sign_signature_pre_hash_shake256 2s 3s -33%
unpack_pk 2s 4s -50%
keccak_squeeze 1s 3s -67%
keccakf1600_xor_bytes 1s 3s -67%
keccakf1600_xor_bytes (big endian) 1s 2s -50%
mld_ct_cmask_neg_i32 1s 3s -67%
mld_ct_get_optblocker_i64 1s 2s -50%
mld_keccakf1600_extract_bytes 1s 2s -50%
poly_make_hint 1s 3s -67%
shake128_finalize 1s 3s -67%

@oqs-bot
Copy link
Copy Markdown
Contributor

oqs-bot commented Feb 3, 2026

CBMC Results (ML-DSA-44)

Full Results (179 proofs)
Proof Status Current Previous Change
**TOTAL** 1995s 2009s -0.7%
mld_attempt_signature_generation 245s 251s -2%
polyvecl_pointwise_acc_montgomery_c 202s 204s -1%
sign_verify_internal 182s 182s +0%
poly_pointwise_montgomery_c 146s 157s -7%
rej_uniform_native 139s 144s -3%
mld_invntt_layer 84s 86s -2%
mld_ct_memcmp 71s 75s -5%
mld_ntt_layer 53s 53s +0%
poly_chknorm_c 21s 19s +11%
rej_uniform 21s 23s -9%
sign_signature_internal 21s 22s -5%
fqmul 20s 18s +11%
polyvec_matrix_expand 20s 19s +5%
poly_uniform_eta_4x 18s 18s +0%
poly_uniform_4x 16s 16s +0%
polyeta_unpack 16s 18s -11%
rej_uniform_c 14s 14s +0%
polymat_permute_bitrev_to_custom 13s 13s +0%
polyt0_unpack 13s 13s +0%
polyvec_matrix_pointwise_montgomery 13s 13s +0%
polyveck_power2round 13s 14s -7%
keccak_absorb_once_x4 12s 10s +20%
keccakf1600x4_permute_native 12s 13s -8%
mld_compute_t0_t1_tr_from_sk_components 12s 17s -29%
polyz_unpack_c 12s 12s +0%
mld_ntt_butterfly_block 11s 12s -8%
mld_polyvecl_permute_bitrev_to_custom_native 11s 8s +38%
poly_add 11s 10s +10%
polyvec_matrix_expand_serial 11s 13s -15%
unpack_sk 9s 6s +50%
keccakf1600_permute 8s 9s -11%
keccakf1600_permute_native 8s 7s +14%
polyveck_add 8s 7s +14%
polyveck_decompose 8s 5s +60%
polyveck_use_hint 8s 6s +33%
sign 8s 8s +0%
mld_prepare_domain_separation_prefix 7s 5s +40%
polyveck_chknorm 7s 6s +17%
sign_verify_pre_hash_shake256 7s 4s +75%
keccak_absorb 6s 7s -14%
keccak_squeezeblocks_x4 6s 5s +20%
mld_check_pct 6s 7s -14%
mld_compute_pack_z 6s 7s -14%
poly_invntt_tomont_c 6s 5s +20%
poly_ntt_native 6s 4s +50%
poly_power2round 6s 4s +50%
polyveck_caddq 6s 3s +100%
polyveck_invntt_tomont 6s 6s +0%
sign_open 6s 4s +50%
sign_signature_pre_hash_internal 6s 5s +20%
sign_verify 6s 4s +50%
decompose 5s 3s +67%
mld_h 5s 2s +150%
mld_sample_s1_s2 5s 5s +0%
mld_sample_s1_s2_serial 5s 7s -29%
poly_caddq_c 5s 5s +0%
poly_use_hint_c 5s 5s +0%
polyveck_make_hint 5s 5s +0%
polyveck_pointwise_poly_montgomery_t0 5s 5s +0%
polyveck_shiftl 5s 6s -17%
polyveck_sub 5s 5s +0%
polyvecl_uniform_gamma1_serial 5s 2s +150%
polyvecl_unpack_z 5s 3s +67%
power2round 5s 2s +150%
rej_eta_native 5s 6s -17%
sign_pk_from_sk 5s 10s -50%
sign_verify_extmu 5s 3s +67%
unpack_hints 5s 6s -17%
use_hint 5s 3s +67%
intt_native_x86_64 4s 2s +100%
keccakf1600_extract_bytes (big endian) 4s 2s +100%
keccakf1600_xor_bytes (big endian) 4s 3s +33%
mld_ct_get_optblocker_u8 4s 3s +33%
mld_value_barrier_u32 4s 2s +100%
montgomery_reduce 4s 5s -20%
pack_sig_z 4s 4s +0%
poly_caddq_native 4s 2s +100%
poly_challenge 4s 4s +0%
poly_chknorm_native_aarch64 4s 7s -43%
poly_decompose 4s 3s +33%
poly_decompose_c 4s 4s +0%
poly_ntt 4s 4s +0%
poly_pointwise_montgomery 4s 3s +33%
poly_pointwise_montgomery_native 4s 2s +100%
poly_shiftl 4s 5s -20%
poly_uniform_eta 4s 5s -20%
poly_uniform_gamma1 4s 3s +33%
poly_use_hint_native 4s 3s +33%
polyt0_pack 4s 5s -20%
polyveck_ntt 4s 4s +0%
polyveck_pack_eta 4s 4s +0%
polyveck_pack_t0 4s 2s +100%
polyveck_pointwise_poly_montgomery_s2 4s 4s +0%
polyveck_reduce 4s 5s -20%
polyveck_unpack_eta 4s 2s +100%
polyveck_unpack_t0 4s 4s +0%
polyvecl_ntt 4s 3s +33%
polyvecl_pointwise_acc_montgomery 4s 2s +100%
polyw1_pack 4s 1s +300%
polyz_unpack 4s 2s +100%
polyz_unpack_native 4s 2s +100%
reduce32 4s 4s +0%
shake256_absorb 4s 2s +100%
shake256x4_squeezeblocks 4s 4s +0%
sign_keypair_internal 4s 7s -43%
sign_signature 4s 5s -20%
unpack_sig 4s 3s +33%
fqscale 3s 3s +0%
keccak_finalize 3s 3s +0%
keccak_init 3s 2s +50%
keccakf1600x4_extract_bytes 3s 2s +50%
keccakf1600x4_permute 3s 3s +0%
mld_ct_abs_i32 3s 3s +0%
mld_ct_cmask_nonzero_u32 3s 4s -25%
mld_ct_cmask_nonzero_u8 3s 2s +50%
mld_keccakf1600_extract_bytes 3s 2s +50%
ntt_native_x86_64 3s 3s +0%
pack_sk 3s 5s -40%
poly_caddq 3s 3s +0%
poly_caddq_native_aarch64 3s 2s +50%
poly_decompose_native 3s 6s -50%
poly_make_hint 3s 4s -25%
poly_ntt_c 3s 3s +0%
poly_sub 3s 3s +0%
poly_uniform 3s 6s -50%
poly_uniform_gamma1_4x 3s 5s -40%
poly_use_hint 3s 2s +50%
polyt1_pack 3s 2s +50%
polyt1_unpack 3s 3s +0%
polyveck_pointwise_poly_montgomery 3s 4s -25%
polyvecl_chknorm 3s 3s +0%
polyvecl_pack_eta 3s 3s +0%
polyvecl_pointwise_acc_montgomery_native 3s 3s +0%
polyvecl_unpack_eta 3s 2s +50%
polyz_pack 3s 3s +0%
rej_eta 3s 1s +200%
shake128x4_absorb_once 3s 2s +50%
shake128x4_squeezeblocks 3s 3s +0%
shake256 3s 3s +0%
shake256_finalize 3s 2s +50%
sign_keypair 3s 3s +0%
sign_signature_extmu 3s 3s +0%
sign_signature_pre_hash_shake256 3s 3s +0%
sign_verify_pre_hash_internal 3s 2s +50%
sys_check_capability 3s 3s +0%
unpack_pk 3s 3s +0%
keccakf1600_xor_bytes 2s 3s -33%
keccakf1600x4_xor_bytes 2s 1s +100%
make_hint 2s 3s -33%
mld_ct_cmask_neg_i32 2s 3s -33%
mld_ct_get_optblocker_i64 2s 2s +0%
mld_ct_sel_int32 2s 2s +0%
ntt_native_aarch64 2s 3s -33%
pack_pk 2s 3s -33%
pack_sig_c_h 2s 3s -33%
poly_chknorm_native 2s 3s -33%
poly_invntt_tomont_native 2s 2s +0%
polyveck_pack_w1 2s 2s +0%
polyvecl_permute_bitrev_to_custom 2s 2s +0%
polyvecl_uniform_gamma1 2s 2s +0%
rej_eta_c 2s 4s -50%
shake128_absorb 2s 3s -33%
shake128_init 2s 3s -33%
shake128_release 2s 2s +0%
shake128_squeeze 2s 4s -50%
shake256_init 2s 1s +100%
caddq 1s 3s -67%
keccak_squeeze 1s 3s -67%
mld_ct_get_optblocker_u32 1s 3s -67%
mld_value_barrier_i64 1s 2s -50%
mld_value_barrier_u8 1s 3s -67%
poly_chknorm 1s 5s -80%
poly_invntt_tomont 1s 3s -67%
poly_reduce 1s 3s -67%
polyeta_pack 1s 3s -67%
shake128_finalize 1s 2s -50%
shake256_release 1s 2s -50%
shake256_squeeze 1s 2s -50%
shake256x4_absorb_once 1s 2s -50%

@oqs-bot
Copy link
Copy Markdown
Contributor

oqs-bot commented Feb 3, 2026

CBMC Results (ML-DSA-65)

Full Results (179 proofs)
Proof Status Current Previous Change
**TOTAL** 2467s 2424s +1.8%
sign_verify_internal 334s 330s +1%
mld_attempt_signature_generation 284s 277s +3%
polyvecl_pointwise_acc_montgomery_c 187s 188s -1%
poly_pointwise_montgomery_c 168s 157s +7%
rej_uniform_native 149s 140s +6%
polyvec_matrix_expand 121s 123s -2%
mld_invntt_layer 98s 94s +4%
mld_ct_memcmp 82s 73s +12%
polyvec_matrix_expand_serial 68s 66s +3%
mld_ntt_layer 56s 53s +6%
polymat_permute_bitrev_to_custom 30s 29s +3%
sign_signature_internal 28s 30s -7%
mld_compute_t0_t1_tr_from_sk_components 27s 26s +4%
fqmul 24s 19s +26%
poly_chknorm_c 20s 19s +5%
rej_uniform 19s 21s -10%
rej_uniform_c 17s 16s +6%
poly_uniform_eta_4x 16s 19s -16%
poly_uniform_4x 15s 13s +15%
keccakf1600x4_permute_native 13s 12s +8%
poly_add 13s 11s +18%
polyt0_unpack 13s 14s -7%
polyveck_decompose 13s 12s +8%
mld_check_pct 12s 11s +9%
mld_ntt_butterfly_block 12s 12s +0%
polyvec_matrix_pointwise_montgomery 12s 10s +20%
polyvecl_chknorm 12s 11s +9%
polyveck_sub 11s 13s -15%
keccak_absorb_once_x4 10s 10s +0%
keccakf1600_permute_native 10s 8s +25%
polyveck_add 10s 11s -9%
polyveck_caddq 10s 8s +25%
polyveck_power2round 10s 9s +11%
keccakf1600_permute 9s 8s +12%
mld_compute_pack_z 9s 6s +50%
poly_decompose_native 9s 6s +50%
polyveck_ntt 9s 9s +0%
poly_invntt_tomont_c 8s 6s +33%
polyveck_pointwise_poly_montgomery 8s 10s -20%
polyveck_shiftl 8s 7s +14%
polyveck_use_hint 8s 8s +0%
polyvecl_ntt 8s 7s +14%
unpack_sk 8s 7s +14%
keccak_absorb 7s 8s -12%
keccak_squeezeblocks_x4 7s 8s -12%
poly_challenge 7s 4s +75%
poly_decompose_c 7s 7s +0%
polyeta_unpack 7s 8s -12%
polyveck_invntt_tomont 7s 6s +17%
sign_pk_from_sk 7s 10s -30%
sign_verify_pre_hash_shake256 7s 5s +40%
mld_polyvecl_permute_bitrev_to_custom_native 6s 8s -25%
polyveck_pack_t0 6s 2s +200%
polyveck_pointwise_poly_montgomery_s2 6s 9s -33%
polyveck_pointwise_poly_montgomery_t0 6s 5s +20%
polyveck_reduce 6s 7s -14%
polyz_unpack 6s 2s +200%
mld_h 5s 5s +0%
mld_sample_s1_s2_serial 5s 4s +25%
poly_caddq_c 5s 4s +25%
poly_pointwise_montgomery_native 5s 6s -17%
poly_power2round 5s 4s +25%
poly_uniform_gamma1_4x 5s 5s +0%
poly_use_hint_c 5s 3s +67%
polyveck_chknorm 5s 6s -17%
polyveck_make_hint 5s 4s +25%
polyveck_unpack_t0 5s 3s +67%
polyvecl_pointwise_acc_montgomery_native 5s 6s -17%
rej_eta_c 5s 3s +67%
sign 5s 7s -29%
sign_signature 5s 2s +150%
sign_signature_pre_hash_shake256 5s 6s -17%
keccak_squeeze 4s 4s +0%
keccakf1600_extract_bytes (big endian) 4s 2s +100%
keccakf1600x4_xor_bytes 4s 2s +100%
mld_ct_abs_i32 4s 2s +100%
ntt_native_x86_64 4s 4s +0%
pack_sig_c_h 4s 2s +100%
poly_chknorm 4s 3s +33%
poly_invntt_tomont 4s 6s -33%
poly_invntt_tomont_native 4s 4s +0%
poly_pointwise_montgomery 4s 2s +100%
poly_uniform_eta 4s 6s -33%
polyeta_pack 4s 3s +33%
polyt0_pack 4s 5s -20%
polyt1_pack 4s 3s +33%
polyveck_pack_w1 4s 2s +100%
polyz_pack 4s 3s +33%
polyz_unpack_c 4s 4s +0%
rej_eta 4s 4s +0%
rej_eta_native 4s 2s +100%
shake256_absorb 4s 2s +100%
sign_keypair_internal 4s 7s -43%
sign_open 4s 4s +0%
sign_signature_pre_hash_internal 4s 4s +0%
sign_verify_extmu 4s 3s +33%
unpack_hints 4s 6s -33%
unpack_pk 4s 4s +0%
decompose 3s 3s +0%
intt_native_x86_64 3s 3s +0%
keccak_init 3s 2s +50%
keccakf1600_xor_bytes (big endian) 3s 3s +0%
keccakf1600x4_extract_bytes 3s 1s +200%
keccakf1600x4_permute 3s 5s -40%
mld_ct_cmask_nonzero_u32 3s 1s +200%
mld_ct_cmask_nonzero_u8 3s 5s -40%
mld_ct_get_optblocker_i64 3s 3s +0%
mld_prepare_domain_separation_prefix 3s 4s -25%
mld_sample_s1_s2 3s 8s -62%
mld_value_barrier_i64 3s 1s +200%
mld_value_barrier_u32 3s 1s +200%
mld_value_barrier_u8 3s 2s +50%
montgomery_reduce 3s 3s +0%
ntt_native_aarch64 3s 3s +0%
pack_pk 3s 4s -25%
pack_sig_z 3s 5s -40%
poly_caddq_native_aarch64 3s 5s -40%
poly_chknorm_native 3s 2s +50%
poly_chknorm_native_aarch64 3s 4s -25%
poly_decompose 3s 1s +200%
poly_make_hint 3s 4s -25%
poly_ntt_native 3s 3s +0%
poly_reduce 3s 2s +50%
poly_shiftl 3s 2s +50%
poly_sub 3s 2s +50%
poly_uniform 3s 6s -50%
poly_uniform_gamma1 3s 3s +0%
poly_use_hint 3s 3s +0%
poly_use_hint_native 3s 5s -40%
polyveck_pack_eta 3s 3s +0%
polyveck_unpack_eta 3s 2s +50%
polyvecl_pack_eta 3s 3s +0%
polyvecl_permute_bitrev_to_custom 3s 4s -25%
polyvecl_pointwise_acc_montgomery 3s 2s +50%
polyvecl_uniform_gamma1_serial 3s 3s +0%
polyvecl_unpack_eta 3s 2s +50%
power2round 3s 3s +0%
shake128_release 3s 1s +200%
shake128x4_absorb_once 3s 2s +50%
shake128x4_squeezeblocks 3s 4s -25%
shake256 3s 2s +50%
shake256x4_absorb_once 3s 8s -62%
shake256x4_squeezeblocks 3s 1s +200%
sign_signature_extmu 3s 6s -50%
sign_verify 3s 2s +50%
sign_verify_pre_hash_internal 3s 6s -50%
sys_check_capability 3s 3s +0%
unpack_sig 3s 2s +50%
caddq 2s 5s -60%
fqscale 2s 3s -33%
keccak_finalize 2s 3s -33%
keccakf1600_xor_bytes 2s 1s +100%
make_hint 2s 3s -33%
mld_ct_cmask_neg_i32 2s 6s -67%
mld_ct_get_optblocker_u32 2s 2s +0%
mld_ct_sel_int32 2s 3s -33%
mld_keccakf1600_extract_bytes 2s 2s +0%
pack_sk 2s 3s -33%
poly_caddq 2s 3s -33%
poly_caddq_native 2s 5s -60%
poly_ntt 2s 4s -50%
poly_ntt_c 2s 5s -60%
polyt1_unpack 2s 4s -50%
polyvecl_uniform_gamma1 2s 3s -33%
polyvecl_unpack_z 2s 3s -33%
polyw1_pack 2s 5s -60%
polyz_unpack_native 2s 3s -33%
reduce32 2s 3s -33%
shake128_finalize 2s 3s -33%
shake128_init 2s 2s +0%
shake128_squeeze 2s 2s +0%
shake256_init 2s 5s -60%
shake256_release 2s 1s +100%
sign_keypair 2s 5s -60%
use_hint 2s 3s -33%
mld_ct_get_optblocker_u8 1s 1s +0%
shake128_absorb 1s 2s -50%
shake256_finalize 1s 2s -50%
shake256_squeeze 1s 2s -50%

@willieyz willieyz force-pushed the eliminate-use_hint_32_88-intrinsics branch 9 times, most recently from 1ea9d5f to 8a19e9a Compare February 5, 2026 06:05
@willieyz willieyz marked this pull request as ready for review February 5, 2026 06:39
@willieyz willieyz requested a review from a team as a code owner February 5, 2026 06:39
@willieyz willieyz marked this pull request as draft February 5, 2026 07:19
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 46205 cycles 46203 cycles 1.00
ML-DSA-44 sign 131278 cycles 131278 cycles 1
ML-DSA-44 verify 47765 cycles 47768 cycles 1.00
ML-DSA-65 keypair 81014 cycles 81024 cycles 1.00
ML-DSA-65 sign 215785 cycles 215787 cycles 1.00
ML-DSA-65 verify 80057 cycles 80052 cycles 1.00
ML-DSA-87 keypair 132158 cycles 132151 cycles 1.00
ML-DSA-87 sign 276862 cycles 276816 cycles 1.00
ML-DSA-87 verify 130418 cycles 130384 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 114213 cycles 114155 cycles 1.00
ML-DSA-44 sign 418158 cycles 417994 cycles 1.00
ML-DSA-44 verify 122319 cycles 122262 cycles 1.00
ML-DSA-65 keypair 195508 cycles 195499 cycles 1.00
ML-DSA-65 sign 682497 cycles 682470 cycles 1.00
ML-DSA-65 verify 197760 cycles 197741 cycles 1.00
ML-DSA-87 keypair 322642 cycles 322656 cycles 1.00
ML-DSA-87 sign 864585 cycles 864584 cycles 1.00
ML-DSA-87 verify 328628 cycles 328653 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i)

Details
Benchmark suite Current: 1c31329 Previous: 0f8b8e0 Ratio
ML-DSA-44 keypair 34909 cycles 34607 cycles 1.01
ML-DSA-44 sign 120375 cycles 120704 cycles 1.00
ML-DSA-44 verify 38205 cycles 38101 cycles 1.00
ML-DSA-65 keypair 60968 cycles 61787 cycles 0.99
ML-DSA-65 sign 202493 cycles 204750 cycles 0.99
ML-DSA-65 verify 62726 cycles 62947 cycles 1.00
ML-DSA-87 keypair 94450 cycles 94143 cycles 1.00
ML-DSA-87 sign 241633 cycles 240274 cycles 1.01
ML-DSA-87 verify 96451 cycles 95109 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i) (no-opt)

Details
Benchmark suite Current: 1c31329 Previous: 0f8b8e0 Ratio
ML-DSA-44 keypair 94555 cycles 94592 cycles 1.00
ML-DSA-44 sign 333735 cycles 333857 cycles 1.00
ML-DSA-44 verify 99826 cycles 99864 cycles 1.00
ML-DSA-65 keypair 159716 cycles 159928 cycles 1.00
ML-DSA-65 sign 544638 cycles 544846 cycles 1.00
ML-DSA-65 verify 160752 cycles 160968 cycles 1.00
ML-DSA-87 keypair 267459 cycles 267912 cycles 1.00
ML-DSA-87 sign 709420 cycles 709152 cycles 1.00
ML-DSA-87 verify 270024 cycles 270923 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks (opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 276468 cycles 277102 cycles 1.00
ML-DSA-44 sign 818650 cycles 810656 cycles 1.01
ML-DSA-44 verify 276672 cycles 278882 cycles 0.99
ML-DSA-65 keypair 475323 cycles 478906 cycles 0.99
ML-DSA-65 sign 1367640 cycles 1360800 cycles 1.01
ML-DSA-65 verify 459822 cycles 466415 cycles 0.99
ML-DSA-87 keypair 825623 cycles 818822 cycles 1.01
ML-DSA-87 sign 1873209 cycles 1878770 cycles 1.00
ML-DSA-87 verify 800938 cycles 794467 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a)

Details
Benchmark suite Current: 1c31329 Previous: 0f8b8e0 Ratio
ML-DSA-44 keypair 69188 cycles 69341 cycles 1.00
ML-DSA-44 sign 188711 cycles 188628 cycles 1.00
ML-DSA-44 verify 69609 cycles 69167 cycles 1.01
ML-DSA-65 keypair 119110 cycles 119048 cycles 1.00
ML-DSA-65 sign 301012 cycles 300972 cycles 1.00
ML-DSA-65 verify 115433 cycles 115129 cycles 1.00
ML-DSA-87 keypair 202783 cycles 202705 cycles 1.00
ML-DSA-87 sign 393591 cycles 393401 cycles 1.00
ML-DSA-87 verify 194240 cycles 194477 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i)

Details
Benchmark suite Current: 1c31329 Previous: 0f8b8e0 Ratio
ML-DSA-44 keypair 56699 cycles 57040 cycles 0.99
ML-DSA-44 sign 182012 cycles 183077 cycles 0.99
ML-DSA-44 verify 61080 cycles 61515 cycles 0.99
ML-DSA-65 keypair 99136 cycles 98855 cycles 1.00
ML-DSA-65 sign 302786 cycles 300890 cycles 1.01
ML-DSA-65 verify 101006 cycles 100170 cycles 1.01
ML-DSA-87 keypair 154691 cycles 153387 cycles 1.01
ML-DSA-87 sign 357607 cycles 356600 cycles 1.00
ML-DSA-87 verify 155516 cycles 153458 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4

Details
Benchmark suite Current: 1c31329 Previous: 0f8b8e0 Ratio
ML-DSA-44 keypair 68195 cycles 68176 cycles 1.00
ML-DSA-44 sign 203778 cycles 203661 cycles 1.00
ML-DSA-44 verify 70887 cycles 70749 cycles 1.00
ML-DSA-65 keypair 120728 cycles 120835 cycles 1.00
ML-DSA-65 sign 334605 cycles 334759 cycles 1.00
ML-DSA-65 verify 117912 cycles 118016 cycles 1.00
ML-DSA-87 keypair 198206 cycles 198256 cycles 1.00
ML-DSA-87 sign 431061 cycles 431078 cycles 1.00
ML-DSA-87 verify 194668 cycles 194587 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a) (no-opt)

Details
Benchmark suite Current: 1c31329 Previous: 0f8b8e0 Ratio
ML-DSA-44 keypair 135143 cycles 136158 cycles 0.99
ML-DSA-44 sign 526833 cycles 531116 cycles 0.99
ML-DSA-44 verify 147425 cycles 148648 cycles 0.99
ML-DSA-65 keypair 226627 cycles 226842 cycles 1.00
ML-DSA-65 sign 859933 cycles 861270 cycles 1.00
ML-DSA-65 verify 234683 cycles 235270 cycles 1.00
ML-DSA-87 keypair 370822 cycles 370874 cycles 1.00
ML-DSA-87 sign 1078880 cycles 1077097 cycles 1.00
ML-DSA-87 verify 383211 cycles 382857 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a)

Details
Benchmark suite Current: 1c31329 Previous: 0f8b8e0 Ratio
ML-DSA-44 keypair 40627 cycles 40587 cycles 1.00
ML-DSA-44 sign 133192 cycles 136713 cycles 0.97
ML-DSA-44 verify 43499 cycles 43374 cycles 1.00
ML-DSA-65 keypair 72359 cycles 71982 cycles 1.01
ML-DSA-65 sign 214367 cycles 214626 cycles 1.00
ML-DSA-65 verify 73011 cycles 73104 cycles 1.00
ML-DSA-87 keypair 108917 cycles 108890 cycles 1.00
ML-DSA-87 sign 254496 cycles 253022 cycles 1.01
ML-DSA-87 verify 114127 cycles 110459 cycles 1.03

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i) (no-opt)

Details
Benchmark suite Current: 1c31329 Previous: 0f8b8e0 Ratio
ML-DSA-44 keypair 157441 cycles 157064 cycles 1.00
ML-DSA-44 sign 552409 cycles 549451 cycles 1.01
ML-DSA-44 verify 169081 cycles 168897 cycles 1.00
ML-DSA-65 keypair 269113 cycles 268697 cycles 1.00
ML-DSA-65 sign 905233 cycles 905890 cycles 1.00
ML-DSA-65 verify 274920 cycles 274888 cycles 1.00
ML-DSA-87 keypair 448473 cycles 448496 cycles 1.00
ML-DSA-87 sign 1160497 cycles 1159979 cycles 1.00
ML-DSA-87 verify 458580 cycles 458091 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3

Details
Benchmark suite Current: 1c31329 Previous: 0f8b8e0 Ratio
ML-DSA-44 keypair 72272 cycles 72243 cycles 1.00
ML-DSA-44 sign 213477 cycles 213451 cycles 1.00
ML-DSA-44 verify 75713 cycles 75744 cycles 1.00
ML-DSA-65 keypair 127604 cycles 127603 cycles 1.00
ML-DSA-65 sign 353407 cycles 353426 cycles 1.00
ML-DSA-65 verify 125750 cycles 125745 cycles 1.00
ML-DSA-87 keypair 208441 cycles 208481 cycles 1.00
ML-DSA-87 sign 452287 cycles 452641 cycles 1.00
ML-DSA-87 verify 205856 cycles 205909 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4 (no-opt)

Details
Benchmark suite Current: 1c31329 Previous: 0f8b8e0 Ratio
ML-DSA-44 keypair 128174 cycles 128347 cycles 1.00
ML-DSA-44 sign 448073 cycles 448111 cycles 1.00
ML-DSA-44 verify 138265 cycles 144871 cycles 0.95
ML-DSA-65 keypair 220367 cycles 220834 cycles 1.00
ML-DSA-65 sign 729443 cycles 729991 cycles 1.00
ML-DSA-65 verify 223253 cycles 223754 cycles 1.00
ML-DSA-87 keypair 366585 cycles 367262 cycles 1.00
ML-DSA-87 sign 928832 cycles 929744 cycles 1.00
ML-DSA-87 verify 373916 cycles 374445 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a) (no-opt)

Details
Benchmark suite Current: 1c31329 Previous: 0f8b8e0 Ratio
ML-DSA-44 keypair 120691 cycles 120833 cycles 1.00
ML-DSA-44 sign 449407 cycles 449229 cycles 1.00
ML-DSA-44 verify 130264 cycles 130297 cycles 1.00
ML-DSA-65 keypair 204729 cycles 204649 cycles 1.00
ML-DSA-65 sign 730192 cycles 731243 cycles 1.00
ML-DSA-65 verify 210457 cycles 210085 cycles 1.00
ML-DSA-87 keypair 338245 cycles 337488 cycles 1.00
ML-DSA-87 sign 925719 cycles 929314 cycles 1.00
ML-DSA-87 verify 346809 cycles 346839 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@willieyz willieyz force-pushed the eliminate-use_hint_32_88-intrinsics branch 6 times, most recently from f820653 to 91cddd6 Compare February 24, 2026 10:16
@willieyz willieyz marked this pull request as draft February 25, 2026 01:24
@willieyz willieyz force-pushed the eliminate-use_hint_32_88-intrinsics branch 3 times, most recently from d3f2de3 to e5ad167 Compare February 25, 2026 03:49
@willieyz willieyz marked this pull request as ready for review February 25, 2026 03:56
@willieyz willieyz force-pushed the eliminate-use_hint_32_88-intrinsics branch from e5ad167 to d278675 Compare March 24, 2026 07:56
@mkannwischer mkannwischer self-assigned this Apr 1, 2026
@mkannwischer
Copy link
Copy Markdown
Contributor

@willieyz, can you rebase this PR, please?

@willieyz willieyz force-pushed the eliminate-use_hint_32_88-intrinsics branch from d278675 to 726fc9d Compare April 4, 2026 18:31
@willieyz
Copy link
Copy Markdown
Contributor Author

willieyz commented Apr 5, 2026

Hello, @mkannwischer , thank you for reviewing, I had rebased it on top of the main!

@mkannwischer mkannwischer force-pushed the eliminate-use_hint_32_88-intrinsics branch from 726fc9d to 1c31329 Compare April 6, 2026 01:51
Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'AMD EPYC 4th gen (c7a)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 1c31329 Previous: 0f8b8e0 Ratio
ML-DSA-87 verify 114127 cycles 110459 cycles 1.03

This comment was automatically generated by workflow using github-action-benchmark.

@mkannwischer mkannwischer force-pushed the eliminate-use_hint_32_88-intrinsics branch from 1c31329 to 5437035 Compare April 6, 2026 02:27
willieyz added 2 commits April 6, 2026 10:27
This commit adds poly_use_hint to bench --components for benchmarking
the performance impact of the changes to:
- poly_use_hint_32
- poly_use_hint_88

Signed-off-by: willieyz <willie.zhao@chelpis.com>
This commit replaces the AVX2 intrinsics implementation of
poly_use_hint_32 and poly_use_hint_88 with a x86_64 assembly version,
this is part of the effort to enable HOL-Light proofs.

Signed-off-by: willieyz <willie.zhao@chelpis.com>
@mkannwischer mkannwischer force-pushed the eliminate-use_hint_32_88-intrinsics branch from 5437035 to c53c97f Compare April 6, 2026 02:27
Copy link
Copy Markdown
Contributor

@mkannwischer mkannwischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @willieyz. I made a couple of small changes to the comments (to align it with the intrinsics). Now I am happy with the changes. The performance degradation is unfortunate, but we can revisit that in a follow-up.

@hanno-becker, @jakemas, could you also take a look?

@mkannwischer mkannwischer changed the title Eliminate use_hint 32/88 intrinsics x86_64: Eliminate use_hint 32/88 intrinsics Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AVX2: Replace intrinsics implementation of poly_use_hint with assembly

4 participants