Skip to content

Add AVX2 SAD and SadFour functions for motion estimation#3933

Open
rbenaley wants to merge 1 commit intocisco:masterfrom
rbenaley:avx2-sad-functions
Open

Add AVX2 SAD and SadFour functions for motion estimation#3933
rbenaley wants to merge 1 commit intocisco:masterfrom
rbenaley:avx2-sad-functions

Conversation

@rbenaley
Copy link
Copy Markdown

This adds AVX2-optimized SAD functions for the block sizes used in motion estimation: 16x16, 16x8, 8x16, and 8x8, in both simple and SadFour (4-reference) variants.

The existing SATD functions already have AVX2 implementations in satd_sad.asm, but the corresponding SAD functions were missing. The SadFour variants are critical for the diamond search pattern, where computing SAD against four neighbor positions in a single call avoids redundant source block loads.

The 16-wide functions use vinserti128 to pack two rows into a 256-bit ymm register, processing them with a single vpsadbw. All loops are fully unrolled with %rep.

Files changed:

  • codec/common/x86/satd_sad.asm — 8 functions + 5 macros (+386 lines)
  • codec/common/inc/sad_common.h — declarations (+12 lines)
  • codec/encoder/core/src/sample.cpp — function pointer registration (+11 lines)

These optimizations were developed for Vauban, an open-source privileged access management (PAM) bastion that uses OpenH264 for real-time H.264 encoding of RDP desktop sessions streamed to web browsers. Enabling AVX2 across the encoder (including these new SAD functions) reduced CPU usage per session by approximately 50% on an Intel Xeon E-2246G running FreeBSD.

For a detailed technical writeup covering the encoding pipeline context, implementation choices, and performance measurements, see:
https://github.com/rbenaley/Vauban/blob/main/docs/technical/Vauban_OpenH264_AVX2_Optimizations_EN(1.0).md

Implement AVX2-optimized SAD for block sizes 16x16, 16x8, 8x16, 8x8
(simple and SadFour variants). The 16-wide functions use vinserti128
to pack two rows into a ymm register, processing them with a single
vpsadbw. SadFour variants compute SAD against four reference positions
simultaneously, avoiding redundant source loads during diamond search.

All code is guarded by %ifdef HAVE_AVX2 / WELS_CPU_AVX2 and selected
at runtime via CPUID detection.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant