Skip to content

Performance tuning for YOLO-like workloads#9

Draft
Human9000-bit wants to merge 2 commits into
enthropy7:mainfrom
Human9000-bit:yolo-tuning
Draft

Performance tuning for YOLO-like workloads#9
Human9000-bit wants to merge 2 commits into
enthropy7:mainfrom
Human9000-bit:yolo-tuning

Conversation

@Human9000-bit

@Human9000-bit Human9000-bit commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Thanks to enthropy7's benchmark refactoring, we can now see which computational kernel path takes this or that convolution.
via BENCH_COOLDOWN=0 cargo run --release --example bench_yolo --features=blas we get such profiling of YOLO models (on x86_64, ymmv):

11.43ms  [1, 80, 80, 64] → [1, 80, 80, 64]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.23/cv2.0/cv2.0.1/conv/Conv
10.66ms  [1, 80, 80, 64] → [1, 80, 80, 64]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.23/cv2.0/cv2.0.0/conv/Conv
7.63ms  [1, 160, 160, 16] → [1, 160, 160, 8]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.2/m.0/cv1/conv/Conv
5.48ms  [1, 40, 40, 128] → [1, 40, 40, 64]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.23/cv2.1/cv2.1.0/conv/Conv
5.07ms  [1, 80, 80, 32] → [1, 80, 80, 16]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.4/m.0/cv1/conv/Conv
4.86ms  [1, 160, 160, 8] → [1, 160, 160, 16]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.2/m.0/cv2/conv/
3.90ms  [1, 80, 80, 16] → [1, 80, 80, 32]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.16/m.0/cv2/conv/
3.78ms  [1, 20, 20, 256] → [1, 20, 20, 64]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.23/cv2.2/cv2.2.0/conv/
3.68ms  [1, 40, 40, 64] → [1, 40, 40, 64]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.23/cv2.1/cv2.1.1/conv/
3.60ms  [1, 80, 80, 32] → [1, 80, 80, 16]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.16/m.0/cv1/conv/Conv

, so winograd 3x3 path takes most of the computational time.
At initial state (5faba42), winograd path was just scalar with vectorized GEMM function inisde.

@Human9000-bit

Copy link
Copy Markdown
Contributor Author

Vectorized the most hot function on winograd path, the winograd_input_tile

@enthropy7

Copy link
Copy Markdown
Owner

I measured that optimize - speed stayed the same. After careful review we have that situation: the reason of lack of speedup on benches - winograd_input_tile is only the input transform, beggining of the work in a winograd conv. The batched GEMM dominates, and this PR doesn't touch it. Vectorizing the input tile can't move the needle.

Two issues that matter more than the change itself:

The new benchmark doesn't actually exercise winograd. bench_winograd_conv_modes calls conv2d_nhwc (no padding). I confirmed with the dispatch recorder that that routes to im2col-gemm, not winograd. Winograd only fires via conv2d_nhwc_padded (SAME padding). So the winograd bench measures im2col and would show no change from this PR. To actually measure winograd, call conv2d_nhwc_padded (or go through the runner with a padded 3×3 conv).

On aarch64, winograd isn't used for YOLO at all. The runner intercepts every 3×3 group=1 conv with the indirect kernel (conv2d_nhwc_indirect_padded) before winograd is reached. So the NEON winograd path has no effect on YOLO on ARM - I didn't bench the Orange Pi because it would be a guaranteed null result.


WHAT WE SHOULD DO TO ACTUALLY BOOST WINOGRAD?

(It's really important and i just haven't had a time to implement it through my whole work on my fw, so i would be very grateful for it)

x86

the winograd GEMM at gemm_conv.rs:488:

for a in 0..16 {
    ////
    super::super::matmul::blas_sgemm(v_slice, u_slice, m_slice, n_tiles, c_in, c_out);
}

Three things to do, biggest first:

  • It calls raw blas_sgemm, not matmul_2d_slices_fused_maybe_packed , so it bypasses our fast packed kernels (mr12/mr6/avx512) and the fused bias/activation epilogue. Route it through the fused entry. This is the main win.
  • 16 GEMMs run sequentially. The positions a ∈ 0..16 are independent - parallelize them (par_iter / rayon scope) or fold into one batched GEMM.
  • The weights U are re-packed every inference; conv weights are static, thus pack once via pack_b_for_session and pass packed_b in.

ARM

conv2d_nhwc_indirect_padded (gemm_conv.rs:18): this is the path YOLO's 3×3 convs actually take on aarch64. Its body is plain for b / oy / ox loops with no parallelism (no par_chunks_mut_dispatch). On a 4-core A53 that idles three cores - parallelizing over output rows is an easy ~4× before any tiling work.

And of course fix the bench to call conv2d_nhwc_padded so it measures winograd at all.

Thanks for your work. Furthermore, I'm looking forward to help you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants