Work around bad codegen on iterators of tuples and references

At the time of writing, `IteratorILP` does not lead to good code generation on iterators of tuples and references. This is highly unfortunate because tuples are omnipresent in iterator-based code, due to `Iterator::zip()` and friends, and references are also omnipresent as they appear as soon as one writes `collection.iter()` without thinking.

The root cause of this problem is that `IteratorILP` works by alternatively reading `STREAMS` inputs from the input iterator and processing these `STREAMS` inputs. Between these two steps, the inputs are, at the time of writing, buffered into an internal array of `[Iterator::Item; STREAMS]`.

As long `Iterator::Item` is a basic numeric value type, such as `i8`, `f32` or even `Simd<T, N>`, this translates into manipulating something like a `[f32; STREAMS]`, which the compiler backend is very good at handling, and all is well.

But if now `Iterator::Item` is a tuple type like `(f32, u32)`, the compiler ends up manipulating `[(f32, u32); STREAMS]`, which naively translates into a memory layout that interleaves between the two elements of each tuple. The compiler backend is not smart enough to automatically transform this interleaved data into `([f32; STREAMS], [u32; STREAMS])` non-interleaved data upon observing that the resulting collection is never observed by the outside world, and thus important optimizations that rely on a contiguous memory layout, such as autovectorization, will be lost.

Similarly, when `Iterator::Item` is a reference type like `&f32`, the compiler ends up building an `[&f32; STREAMS]` array of input pointers, which is inefficient to access and obviously also breaks autovectorization. It would be wiser to build a `[f32; STREAMS]` array of input copies instead, but the compiler backend is not smart enough to perform this transformation during optimization passes.

To address these two important performance deficiencies, it would be great if `IteratorILP` could detect such known-problematic patterns in the input data stream, and adapt its input buffer and accumulator layout accordingly. For example, `IteratorILP` could detect iterators of references to `Copy` types and switch to internally buffering copies of inputs instead of raw inputs. And it could detect iterators of tuples and switch to tuples of input buffers instead of buffers of input tuples.

But this is not as easy as it sounds:

- At the time of writing, stable Rust does not have specialization. It is therefore not possible to elegantly provide a default `IteratorILP` implementation that works for the largest possible amount of iterator types, but gets overridden for specific iterator types that need codegen tweaks. Our options are to...
    - Restrict this feature to nightly compilers and gate it behind some nightly-specific flag like `nightly`. This will result in a massive performance difference between nightly and non-nightly builds, which is not great, and it will also reduce in a higher maintenance burden as nightly features regularly break without warning (that's their whole point).
    - Do a major release with a massive API break that restricts `IteratorILP` implementations to specific types, instead of a blanket impl for all supported iterator types. By knowing which iterator item types we're dealing with, we can optimize more. But losing support for all types we didn't explicitly think about/add support for would be a great shame.
    - Have several "backends" for each operation that can work with all possible iterator item types, but are most optimized for a particular type. Select the right implementation for a particular item type with code along those lines...
      ```rust
      if const { is_tuple::<Self::Item>() } {
          tuple_impl(/* ... */)
      } else if const { is_ref::<Self::Item>() } {
          ref_impl(/* ... */)
      } else {
          default_impl(/* ... */)
      }
      ```
      ...and let the const-propagation optimization do its magic and eliminate the unused code paths. This has the major benefit of working on stable without restricting operation to specific types, but will result in code bloat and make specialized impls harder to write as they must work even for types where they are inadequate and will never be used.
- Tuples can nest and will commonly do so in idiomatic code (think zips of zips). This will result in fugly types like `(T, (U, (V, W)))`. In principle, we could blindly codegen the associated buffer into a `([T; STREAMS], ([U; STREAMS], ([V; STREAMS], [W; STREAMS])))`, using as much recursion as necessary to traverse the hierarchy when reading and writing the buffer, but there is a risk that the compiler backend will go crazy and generate bad code beyond a certain degree of nesting. It would therefore seem more prudent to internally use a flattened tuple like `([T; STREAMS], [U; STREAMS], [V; STREAMS], [W; STREAMS])`, which has the additional benefit of being easier to manipulate. But then, when the time comes to invoke the user's reduction callback, we must somehow go back from this flattened layout to the nested layout that the user expects (here `(T, (U, (V, W)))`), and this can be painful. So it's not clear if flattening is worth it.
- Even nightly rust does not have variadic generics, so no matter which strategy we go for, we will never be able to handle arbitrary tuples. This is probably fine for this crate, because unrolling loops where each iteration operates on a very big tuples cannot result in good codegen anyway as it will increase a register pressure that's already too large. But it means we need to somehow detect and impose a threshold above which the tuple-specialized codegen path is not taken.

So far, the strategy has been to defer this work until the Rust language gains extra features like specialization and variadic generics that will make this optimization much easier. But if time permits, it would be good to explore how far we can go with or without using nightly features, as it's a shame that `IteratorILP` breaks so easily on many common iteration patterns for the time being, and it will likely be a long long time before Rust gets such "big" features.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work around bad codegen on iterators of tuples and references #52

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Work around bad codegen on iterators of tuples and references #52

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions