At the time of writing, IteratorILP does not lead to good code generation on iterators of tuples and references. This is highly unfortunate because tuples are omnipresent in iterator-based code, due to Iterator::zip() and friends, and references are also omnipresent as they appear as soon as one writes collection.iter() without thinking.
The root cause of this problem is that IteratorILP works by alternatively reading STREAMS inputs from the input iterator and processing these STREAMS inputs. Between these two steps, the inputs are, at the time of writing, buffered into an internal array of [Iterator::Item; STREAMS].
As long Iterator::Item is a basic numeric value type, such as i8, f32 or even Simd<T, N>, this translates into manipulating something like a [f32; STREAMS], which the compiler backend is very good at handling, and all is well.
But if now Iterator::Item is a tuple type like (f32, u32), the compiler ends up manipulating [(f32, u32); STREAMS], which naively translates into a memory layout that interleaves between the two elements of each tuple. The compiler backend is not smart enough to automatically transform this interleaved data into ([f32; STREAMS], [u32; STREAMS]) non-interleaved data upon observing that the resulting collection is never observed by the outside world, and thus important optimizations that rely on a contiguous memory layout, such as autovectorization, will be lost.
Similarly, when Iterator::Item is a reference type like &f32, the compiler ends up building an [&f32; STREAMS] array of input pointers, which is inefficient to access and obviously also breaks autovectorization. It would be wiser to build a [f32; STREAMS] array of input copies instead, but the compiler backend is not smart enough to perform this transformation during optimization passes.
To address these two important performance deficiencies, it would be great if IteratorILP could detect such known-problematic patterns in the input data stream, and adapt its input buffer and accumulator layout accordingly. For example, IteratorILP could detect iterators of references to Copy types and switch to internally buffering copies of inputs instead of raw inputs. And it could detect iterators of tuples and switch to tuples of input buffers instead of buffers of input tuples.
But this is not as easy as it sounds:
- At the time of writing, stable Rust does not have specialization. It is therefore not possible to elegantly provide a default
IteratorILP implementation that works for the largest possible amount of iterator types, but gets overridden for specific iterator types that need codegen tweaks. Our options are to...
- Restrict this feature to nightly compilers and gate it behind some nightly-specific flag like
nightly. This will result in a massive performance difference between nightly and non-nightly builds, which is not great, and it will also reduce in a higher maintenance burden as nightly features regularly break without warning (that's their whole point).
- Do a major release with a massive API break that restricts
IteratorILP implementations to specific types, instead of a blanket impl for all supported iterator types. By knowing which iterator item types we're dealing with, we can optimize more. But losing support for all types we didn't explicitly think about/add support for would be a great shame.
- Have several "backends" for each operation that can work with all possible iterator item types, but are most optimized for a particular type. Select the right implementation for a particular item type with code along those lines...
if const { is_tuple::<Self::Item>() } {
tuple_impl(/* ... */)
} else if const { is_ref::<Self::Item>() } {
ref_impl(/* ... */)
} else {
default_impl(/* ... */)
}
...and let the const-propagation optimization do its magic and eliminate the unused code paths. This has the major benefit of working on stable without restricting operation to specific types, but will result in code bloat and make specialized impls harder to write as they must work even for types where they are inadequate and will never be used.
- Tuples can nest and will commonly do so in idiomatic code (think zips of zips). This will result in fugly types like
(T, (U, (V, W))). In principle, we could blindly codegen the associated buffer into a ([T; STREAMS], ([U; STREAMS], ([V; STREAMS], [W; STREAMS]))), using as much recursion as necessary to traverse the hierarchy when reading and writing the buffer, but there is a risk that the compiler backend will go crazy and generate bad code beyond a certain degree of nesting. It would therefore seem more prudent to internally use a flattened tuple like ([T; STREAMS], [U; STREAMS], [V; STREAMS], [W; STREAMS]), which has the additional benefit of being easier to manipulate. But then, when the time comes to invoke the user's reduction callback, we must somehow go back from this flattened layout to the nested layout that the user expects (here (T, (U, (V, W)))), and this can be painful. So it's not clear if flattening is worth it.
- Even nightly rust does not have variadic generics, so no matter which strategy we go for, we will never be able to handle arbitrary tuples. This is probably fine for this crate, because unrolling loops where each iteration operates on a very big tuples cannot result in good codegen anyway as it will increase a register pressure that's already too large. But it means we need to somehow detect and impose a threshold above which the tuple-specialized codegen path is not taken.
So far, the strategy has been to defer this work until the Rust language gains extra features like specialization and variadic generics that will make this optimization much easier. But if time permits, it would be good to explore how far we can go with or without using nightly features, as it's a shame that IteratorILP breaks so easily on many common iteration patterns for the time being, and it will likely be a long long time before Rust gets such "big" features.
At the time of writing,
IteratorILPdoes not lead to good code generation on iterators of tuples and references. This is highly unfortunate because tuples are omnipresent in iterator-based code, due toIterator::zip()and friends, and references are also omnipresent as they appear as soon as one writescollection.iter()without thinking.The root cause of this problem is that
IteratorILPworks by alternatively readingSTREAMSinputs from the input iterator and processing theseSTREAMSinputs. Between these two steps, the inputs are, at the time of writing, buffered into an internal array of[Iterator::Item; STREAMS].As long
Iterator::Itemis a basic numeric value type, such asi8,f32or evenSimd<T, N>, this translates into manipulating something like a[f32; STREAMS], which the compiler backend is very good at handling, and all is well.But if now
Iterator::Itemis a tuple type like(f32, u32), the compiler ends up manipulating[(f32, u32); STREAMS], which naively translates into a memory layout that interleaves between the two elements of each tuple. The compiler backend is not smart enough to automatically transform this interleaved data into([f32; STREAMS], [u32; STREAMS])non-interleaved data upon observing that the resulting collection is never observed by the outside world, and thus important optimizations that rely on a contiguous memory layout, such as autovectorization, will be lost.Similarly, when
Iterator::Itemis a reference type like&f32, the compiler ends up building an[&f32; STREAMS]array of input pointers, which is inefficient to access and obviously also breaks autovectorization. It would be wiser to build a[f32; STREAMS]array of input copies instead, but the compiler backend is not smart enough to perform this transformation during optimization passes.To address these two important performance deficiencies, it would be great if
IteratorILPcould detect such known-problematic patterns in the input data stream, and adapt its input buffer and accumulator layout accordingly. For example,IteratorILPcould detect iterators of references toCopytypes and switch to internally buffering copies of inputs instead of raw inputs. And it could detect iterators of tuples and switch to tuples of input buffers instead of buffers of input tuples.But this is not as easy as it sounds:
IteratorILPimplementation that works for the largest possible amount of iterator types, but gets overridden for specific iterator types that need codegen tweaks. Our options are to...nightly. This will result in a massive performance difference between nightly and non-nightly builds, which is not great, and it will also reduce in a higher maintenance burden as nightly features regularly break without warning (that's their whole point).IteratorILPimplementations to specific types, instead of a blanket impl for all supported iterator types. By knowing which iterator item types we're dealing with, we can optimize more. But losing support for all types we didn't explicitly think about/add support for would be a great shame.(T, (U, (V, W))). In principle, we could blindly codegen the associated buffer into a([T; STREAMS], ([U; STREAMS], ([V; STREAMS], [W; STREAMS]))), using as much recursion as necessary to traverse the hierarchy when reading and writing the buffer, but there is a risk that the compiler backend will go crazy and generate bad code beyond a certain degree of nesting. It would therefore seem more prudent to internally use a flattened tuple like([T; STREAMS], [U; STREAMS], [V; STREAMS], [W; STREAMS]), which has the additional benefit of being easier to manipulate. But then, when the time comes to invoke the user's reduction callback, we must somehow go back from this flattened layout to the nested layout that the user expects (here(T, (U, (V, W)))), and this can be painful. So it's not clear if flattening is worth it.So far, the strategy has been to defer this work until the Rust language gains extra features like specialization and variadic generics that will make this optimization much easier. But if time permits, it would be good to explore how far we can go with or without using nightly features, as it's a shame that
IteratorILPbreaks so easily on many common iteration patterns for the time being, and it will likely be a long long time before Rust gets such "big" features.