Performance tuning for YOLO-like workloads#9
Conversation
|
Vectorized the most hot function on winograd path, the |
239c3b7 to
40a7288
Compare
|
I measured that optimize - speed stayed the same. After careful review we have that situation: the reason of lack of speedup on benches - Two issues that matter more than the change itself: The new benchmark doesn't actually exercise winograd. bench_winograd_conv_modes calls On aarch64, winograd isn't used for YOLO at all. The runner intercepts every 3×3 group=1 conv with the indirect kernel (conv2d_nhwc_indirect_padded) before winograd is reached. So the NEON winograd path has no effect on YOLO on ARM - I didn't bench the Orange Pi because it would be a guaranteed null result. WHAT WE SHOULD DO TO ACTUALLY BOOST WINOGRAD?
x86the winograd GEMM at Three things to do, biggest first:
ARMconv2d_nhwc_indirect_padded (gemm_conv.rs:18): this is the path YOLO's 3×3 convs actually take on aarch64. Its body is plain for b / oy / ox loops with no parallelism (no par_chunks_mut_dispatch). On a 4-core A53 that idles three cores - parallelizing over output rows is an easy ~4× before any tiling work. And of course fix the bench to call conv2d_nhwc_padded so it measures winograd at all. Thanks for your work. Furthermore, I'm looking forward to help you! |
Thanks to enthropy7's benchmark refactoring, we can now see which computational kernel path takes this or that convolution.
via
BENCH_COOLDOWN=0 cargo run --release --example bench_yolo --features=blaswe get such profiling of YOLO models (on x86_64, ymmv):, so winograd 3x3 path takes most of the computational time.
At initial state (5faba42), winograd path was just scalar with vectorized GEMM function inisde.