Indirect access Vectorization
AVX2 / AVX512 have vpgatherdd which does use a vector of signed 32-bit scaled indices. - SO
- Fast 2D array lookup of int16_t LUT using AVX2 or AVX512
- AVX2 has gathered loads for 32 bit and 64 bit ints (vpgatherXX) as well as floats and doubles.
-
How are the gather instructions in AVX2 implemented? - From the table itβs clear that in all cases gather loads are faster than scalar loads
- Gathering half-float values using AVX
investigating the AVX2 memory gather instruction
For very small array sizes the vector version is a bit faster, but not significantly. After about the size of 512K, the scalar version becomes faster, and this difference in speed is maintained most of the time.
Load address calculation when using AVX2 gather instructions
Gather instructions use byte addressing and do not have any alignment requirements.
_mm256_i32gather_ps
Written on June 28, 2021, Last update on November 9, 2022
avx
array
lookup