Add AVX2 SAD and SadFour functions for motion estimation by rbenaley · Pull Request #3933 · cisco/openh264

rbenaley · 2026-02-25T15:09:20Z

This adds AVX2-optimized SAD functions for the block sizes used in motion estimation: 16x16, 16x8, 8x16, and 8x8, in both simple and SadFour (4-reference) variants.

The existing SATD functions already have AVX2 implementations in satd_sad.asm, but the corresponding SAD functions were missing. The SadFour variants are critical for the diamond search pattern, where computing SAD against four neighbor positions in a single call avoids redundant source block loads.

The 16-wide functions use vinserti128 to pack two rows into a 256-bit ymm register, processing them with a single vpsadbw. All loops are fully unrolled with %rep.

Files changed:

codec/common/x86/satd_sad.asm — 8 functions + 5 macros (+386 lines)
codec/common/inc/sad_common.h — declarations (+12 lines)
codec/encoder/core/src/sample.cpp — function pointer registration (+11 lines)

These optimizations were developed for Vauban, an open-source privileged access management (PAM) bastion that uses OpenH264 for real-time H.264 encoding of RDP desktop sessions streamed to web browsers. Enabling AVX2 across the encoder (including these new SAD functions) reduced CPU usage per session by approximately 50% on an Intel Xeon E-2246G running FreeBSD.

For a detailed technical writeup covering the encoding pipeline context, implementation choices, and performance measurements, see:
https://github.com/rbenaley/Vauban/blob/main/docs/technical/Vauban_OpenH264_AVX2_Optimizations_EN(1.0).md

Implement AVX2-optimized SAD for block sizes 16x16, 16x8, 8x16, 8x8 (simple and SadFour variants). The 16-wide functions use vinserti128 to pack two rows into a ymm register, processing them with a single vpsadbw. SadFour variants compute SAD against four reference positions simultaneously, avoiding redundant source loads during diamond search. All code is guarded by %ifdef HAVE_AVX2 / WELS_CPU_AVX2 and selected at runtime via CPUID detection.

rbenaley mentioned this pull request Feb 26, 2026

Add AVX2 optimizations for SAD, SadFour, and intra prediction (encoder) ralfbiedert/openh264-rs#92

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AVX2 SAD and SadFour functions for motion estimation#3933

Add AVX2 SAD and SadFour functions for motion estimation#3933
rbenaley wants to merge 1 commit intocisco:masterfrom
rbenaley:avx2-sad-functions

rbenaley commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rbenaley commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant