We want to verify that implementing a complex and very optimized kernel as a DPP, does really allow to fuse the previous and continuation kernels, making those kernels almost free, and the overall performance much faster.
We found interesting OpenSource kernels to play with here: https://github.com/karpathy/llm.c/blob/master/dev/cuda/attention_forward.cu
We want to verify that implementing a complex and very optimized kernel as a DPP, does really allow to fuse the previous and continuation kernels, making those kernels almost free, and the overall performance much faster.
We found interesting OpenSource kernels to play with here: https://github.com/karpathy/llm.c/blob/master/dev/cuda/attention_forward.cu