Indirect access Vectorization

AVX2 / AVX512 have vpgatherdd which does use a vector of signed 32-bit scaled indices. - SO

Fast 2D array lookup of int16_t LUT using AVX2 or AVX512
- AVX2 has gathered loads for 32 bit and 64 bit ints (vpgatherXX) as well as floats and doubles.
AVX2 vectorized 256 bit lookup table (32 unsigned chars)
How are the gather instructions in AVX2 implemented? - From the table it’s clear that in all cases gather loads are faster than scalar loads
Packing and de-interleaving two __m256 registers
Gathering half-float values using AVX

investigating the AVX2 memory gather instruction

For very small array sizes the vector version is a bit faster, but not significantly. After about the size of 512K, the scalar version becomes faster, and this difference in speed is maintained most of the time.

Load address calculation when using AVX2 gather instructions

Gather instructions use byte addressing and do not have any alignment requirements.

load_addr = (char *)base + index[i] * scale;       // byte addressing

_mm256_i32gather_ps

extern __m256 _mm256_mask_i32gather_ps(float const * base, __m256i vindex, const int scale);

Written on June 28, 2021, Last update on November 9, 2022

avx array lookup