An academic project on accelerating Neural Network training by optimizing the GEMM kernel on multi-core CPUs and GPUs. (NTUA)
-
Updated
Dec 16, 2025 - Python
An academic project on accelerating Neural Network training by optimizing the GEMM kernel on multi-core CPUs and GPUs. (NTUA)
Profile-driven FP32 CUDA GEMM optimization: naive --> tiled --> coalesced --> register-blocked --> bank-padded, benchmarked against cuBLAS.
High-performance CUDA matrix multiplication kernels - shared memory tiling, register blocking, Roofline Model analysis. Benchmarked against cuBLAS.
Add a description, image, and links to the memory-coalescing topic page so that developers can more easily learn about it.
To associate your repository with the memory-coalescing topic, visit your repo's landing page and select "manage topics."