Indirect access Vectorization

AVX2 / AVX512 have vpgatherdd which does use a vector of signed 32-bit scaled indices. - SO

investigating the AVX2 memory gather instruction

For very small array sizes the vector version is a bit faster, but not significantly. After about the size of 512K, the scalar version becomes faster, and this difference in speed is maintained most of the time.

Load address calculation when using AVX2 gather instructions

Gather instructions use byte addressing and do not have any alignment requirements.

load_addr = (char *)base + index[i] * scale;       // byte addressing

_mm256_i32gather_ps

extern __m256 _mm256_mask_i32gather_ps(float const * base, __m256i vindex, const int scale);
Written on June 28, 2021, Last update on November 9, 2022
avx array lookup