sd_fft: AVX512 implementation for some functions#2715
Open
user202729 wants to merge 2 commits into
Open
Conversation
Contributor
Author
|
https://www.texmacs.org/joris/ntt/ntt.html claims 51.4ns (xeon) and 16.5ns (zen4) for r=64... I guess more room for improvement is possible. |
Collaborator
I believe Joris optimized through SLPs, i.e. no branching at all, in the framework of Mathemagix (i.e. their own compiler). |
Contributor
Author
|
The |
29a6814 to
acbd4dd
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR defines new types
vec8dzandvec16dz, corresponding toAVX512 register types, and use that to speed up some functions in
sd_fft.c.Only a limited number of functions in
sd_fft.care modified.This is mostly to discuss what's the correct interface that we want in
machine_vectors.h.Note that
vec8dis already taken. Changingvec8dfrom twoymmto onezmmregister is not unconditionally profitable, since there would not be enough instruction level parallelism.To limit the amount of code changes, I keep the original transformed element ordering, which in particular mean
sd_fft_store_vec8dz_basecase_orderbecomes very terrible. (it takes only 24 cycles though.)Original:
After first commit:
bits=6 shows 22% improvement and bits=8 shows 17% improvement.
bits=7 is unmodified (will need more work).
After second commit:
bits=8 shows 8% additional improvement.
The temp benchmark file (AI).
Details
Use perf events to count number of CPU cycles.
More reliable than
rdtscpsince clock frequency varies.Might need
sudo sysctl kernel.perf_event_paranoid=-1to run this.