Indirect access Vectorization
AVX2 / AVX512 have vpgatherdd which does use a vector of signed 32-bit scaled indices. - SO
- Fast 2D array lookup of int16_t LUT using AVX2 or AVX512
- AVX2 has gathered loads for 32 bit and 64 bit ints (vpgatherXX) as well as floats and doubles.
-
How are the gather instructions in AVX2 implemented? - From the table itβs clear that in all cases gather loads are faster than scalar loads
- Gathering half-float values using AVX
# investigating the AVX2 memory gather instruction
For very small array sizes the vector version is a bit faster, but not significantly. After about the size of 512K, the scalar version becomes faster, and this difference in speed is maintained most of the time.
# Load address calculation when using AVX2 gather instructions
Gather instructions use byte addressing and do not have any alignment requirements.
load_addr = (char *)base + index[i] * scale; // byte addressing# _mm256_i32gather_ps
extern __m256 _mm256_mask_i32gather_ps(float const * base, __m256i vindex, const int scale);
Written on June 28, 2021, Last update on November 9, 2022
avx
array
lookup