See for example here. To lower memory requirements, fstlib allocates larger blocks of memory that are written to by several threads. In such cases, cache line pollution must be avoided.
A solution is to make sure each sub-block has a size that is a multiple of the cache line size (64 bytes on most modern Intel processors).