Performance tuning for YOLO-like workloads by Human9000-bit · Pull Request #9 · enthropy7/YSCV

Human9000-bit · 2026-06-27T07:04:39Z

Thanks to enthropy7's benchmark refactoring, we can now see which computational kernel path takes this or that convolution.
via BENCH_COOLDOWN=0 cargo run --release --example bench_yolo --features=blas we get such profiling of YOLO models (on x86_64, ymmv):

11.43ms  [1, 80, 80, 64] → [1, 80, 80, 64]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.23/cv2.0/cv2.0.1/conv/Conv
10.66ms  [1, 80, 80, 64] → [1, 80, 80, 64]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.23/cv2.0/cv2.0.0/conv/Conv
7.63ms  [1, 160, 160, 16] → [1, 160, 160, 8]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.2/m.0/cv1/conv/Conv
5.48ms  [1, 40, 40, 128] → [1, 40, 40, 64]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.23/cv2.1/cv2.1.0/conv/Conv
5.07ms  [1, 80, 80, 32] → [1, 80, 80, 16]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.4/m.0/cv1/conv/Conv
4.86ms  [1, 160, 160, 8] → [1, 160, 160, 16]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.2/m.0/cv2/conv/
3.90ms  [1, 80, 80, 16] → [1, 80, 80, 32]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.16/m.0/cv2/conv/
3.78ms  [1, 20, 20, 256] → [1, 20, 20, 64]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.23/cv2.2/cv2.2.0/conv/
3.68ms  [1, 40, 40, 64] → [1, 40, 40, 64]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.23/cv2.1/cv2.1.1/conv/
3.60ms  [1, 80, 80, 32] → [1, 80, 80, 16]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.16/m.0/cv1/conv/Conv

, so winograd 3x3 path takes most of the computational time.
At initial state (5faba42), winograd path was just scalar with vectorized GEMM function inisde.

Human9000-bit · 2026-06-27T07:14:11Z

Vectorized the most hot function on winograd path, the winograd_input_tile

enthropy7 · 2026-06-27T18:04:52Z

I measured that optimize - speed stayed the same. After careful review we have that situation: the reason of lack of speedup on benches - winograd_input_tile is only the input transform, beggining of the work in a winograd conv. The batched GEMM dominates, and this PR doesn't touch it. Vectorizing the input tile can't move the needle.

Two issues that matter more than the change itself:

The new benchmark doesn't actually exercise winograd. bench_winograd_conv_modes calls conv2d_nhwc (no padding). I confirmed with the dispatch recorder that that routes to im2col-gemm, not winograd. Winograd only fires via conv2d_nhwc_padded (SAME padding). So the winograd bench measures im2col and would show no change from this PR. To actually measure winograd, call conv2d_nhwc_padded (or go through the runner with a padded 3×3 conv).

On aarch64, winograd isn't used for YOLO at all. The runner intercepts every 3×3 group=1 conv with the indirect kernel (conv2d_nhwc_indirect_padded) before winograd is reached. So the NEON winograd path has no effect on YOLO on ARM - I didn't bench the Orange Pi because it would be a guaranteed null result.

WHAT WE SHOULD DO TO ACTUALLY BOOST WINOGRAD?

(It's really important and i just haven't had a time to implement it through my whole work on my fw, so i would be very grateful for it)

x86

the winograd GEMM at gemm_conv.rs:488:

for a in 0..16 {
    ////
    super::super::matmul::blas_sgemm(v_slice, u_slice, m_slice, n_tiles, c_in, c_out);
}

Three things to do, biggest first:

It calls raw blas_sgemm, not matmul_2d_slices_fused_maybe_packed , so it bypasses our fast packed kernels (mr12/mr6/avx512) and the fused bias/activation epilogue. Route it through the fused entry. This is the main win.
16 GEMMs run sequentially. The positions a ∈ 0..16 are independent - parallelize them (par_iter / rayon scope) or fold into one batched GEMM.
The weights U are re-packed every inference; conv weights are static, thus pack once via pack_b_for_session and pass packed_b in.

ARM

conv2d_nhwc_indirect_padded (gemm_conv.rs:18): this is the path YOLO's 3×3 convs actually take on aarch64. Its body is plain for b / oy / ox loops with no parallelism (no par_chunks_mut_dispatch). On a 4-core A53 that idles three cores - parallelizing over output rows is an easy ~4× before any tiling work.

And of course fix the bench to call conv2d_nhwc_padded so it measures winograd at all.

Thanks for your work. Furthermore, I'm looking forward to help you!

Human9000-bit added 2 commits June 27, 2026 12:25

add yolo-like conv benchmarks

d2f738e

vectorize winograd_input_tile

40a7288

Human9000-bit force-pushed the yolo-tuning branch from 239c3b7 to 40a7288 Compare June 27, 2026 07:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance tuning for YOLO-like workloads#9

Performance tuning for YOLO-like workloads#9
Human9000-bit wants to merge 2 commits into
enthropy7:mainfrom
Human9000-bit:yolo-tuning

Human9000-bit commented Jun 27, 2026 •

edited

Loading

Uh oh!

Human9000-bit commented Jun 27, 2026

Uh oh!

enthropy7 commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Human9000-bit commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Human9000-bit commented Jun 27, 2026

Uh oh!

enthropy7 commented Jun 27, 2026

WHAT WE SHOULD DO TO ACTUALLY BOOST WINOGRAD?

x86

Three things to do, biggest first:

ARM

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Human9000-bit commented Jun 27, 2026 •

edited

Loading