Add a benchmark to see how a complex kernel benefits from FKL methodology

We want to verify that implementing a complex and very optimized kernel as a DPP, does really allow to fuse the previous and continuation kernels, making those kernels almost free, and the overall performance much faster.

We found interesting OpenSource kernels to play with here: https://github.com/karpathy/llm.c/blob/master/dev/cuda/attention_forward.cu