Simple Instruction Multiple Data Vectorization (SIMD/AVX)
SSE and SSE2 are available in every single x86-family CPU with 64-bit support… here’s a list of tricks to get you around some of the more common, eh, “idiosyncrasies” of SSE and its descendants. - SSE: mind the gap!
Intel Instruction Set
x86/x64 SIMD Instruction List (SSE to AVX512)
What is the difference between AVX, AVX2 and AVX-512?
AVX(1) supports only floating point operations, AVX2 adds 256 bit integer operations.
- Which is the reason for avx floating point bitwise logical operations?
- There are no scatter or gather instructions in the original AVX instruction set.
AVX2 is a 256 bit vector instruction set. You have 256 bit registers which can be interpreted several ways (8 floats, 4 doubles, 32 bytes, etc). - AVX2 adds gather, but not scatter instructions.
- AVX2 16bits integer operation
- vector shift AVX-512 comes in many different flavors. AVX512 introduces masking so you can more cheaply blend as part of another operation.
- AVX512PF additionally provides prefetch variants of gather and scatter instructions. AVX10 - AVX10/128 is a silly idea / HN
Gcc compiler intrinsic
Always use #include <immintrin.h>
X86 intrinsics are follow the naming convention mm[opname]_[suffix]
Do not use in Union - this impact performance
Beware that by defaullt:
__m256
is treated as 8xfloat by code/and debugger
__m256i
is treated as 4x64bit integers.
Checking avx availability with Gcc
The intrinsics _mm256_castps_si256
/_mm256_castsi256_ps
are only to make the compiler happy “This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.”
Autovectorization
- Auto-vectorization with gcc 4.7
- Writing Autovectorizable Code
- Practical vectorization (CERN)
- Auto-Vectorization in LLVM
Codingame
Vectorizing indirect access through avx instructions
Example
- Crunching Numbers with AVX and AVX2
- Fastest way to get IPv4 address from string
- How to implement atoi using SIMD?
- Transpose an 8x8 float using AVX/AVX2