Simple Instruction Multiple Data Vectorization (SIMD/AVX)

SSE and SSE2 are available in every single x86-family CPU with 64-bit support… here’s a list of tricks to get you around some of the more common, eh, “idiosyncrasies” of SSE and its descendants. - SSE: mind the gap!

AVX(1) supports only floating point operations, AVX2 adds 256 bit integer operations.

Which is the reason for avx floating point bitwise logical operations?
There are no scatter or gather instructions in the original AVX instruction set.
AVX2 is a 256 bit vector instruction set. You have 256 bit registers which can be interpreted several ways (8 floats, 4 doubles, 32 bytes, etc).
AVX2 adds gather, but not scatter instructions.
AVX2 16bits integer operation
- missing 16bits shift
vector shift AVX-512 comes in many different flavors. AVX512 introduces masking so you can more cheaply blend as part of another operation.
AVX512PF additionally provides prefetch variants of gather and scatter instructions.
AVX-512: First Impressions on Performance and Programmability AVX10 - AVX10/128 is a silly idea / HN

# Gcc compiler intrinsic

Always use #include <immintrin.h>

X86 intrinsics are follow the naming convention mm[opname]_[suffix]

Do not use in Union - this impact performance

Beware that by defaullt:
__m256 is treated as 8xfloat by code/and debugger
__m256i is treated as 4x64bit integers.

# Checking avx availability with Gcc

#define CHECK(target) if(__builtin_cpu_supports(target)) {   cerr << target << " supported\n"; }

CHECK("avx");
CHECK("avx2");
CHECK("avx512vl");

Casting

The intrinsics _mm256_castps_si256/_mm256_castsi256_ps are only to make the compiler happy “This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.”

#4577

Simple Instruction Multiple Data Vectorization (SIMD/AVX)

# Intel Instruction Set

# x86/x64 SIMD Instruction List (SSE to AVX512)

# What is the difference between AVX, AVX2 and AVX-512?

# Gcc compiler intrinsic

# Checking avx availability with Gcc

# Autovectorization

# Codingame

# Vectorizing indirect access through avx instructions

# Example

# Limitation