Indirect access Vectorization

AVX2 / AVX512 have vpgatherdd which does use a vector of signed 32-bit scaled indices. - SO

# investigating the AVX2 memory gather instruction

For very small array sizes the vector version is a bit faster, but not significantly. After about the size of 512K, the scalar version becomes faster, and this difference in speed is maintained most of the time.

# Load address calculation when using AVX2 gather instructions

Gather instructions use byte addressing and do not have any alignment requirements.

load_addr = (char *)base + index[i] * scale;       // byte addressing

# _mm256_i32gather_ps

extern __m256 _mm256_mask_i32gather_ps(float const * base, __m256i vindex, const int scale);
Written on June 28, 2021, Last update on November 9, 2022
avx array lookup