Mask & Bitwise operation
There is no single instruction in AVX2 or earlier. (AVX512 can use masks in bitmap form directly, and has an instruction to expand masks to vectors). - Peter Cordes (SO)
Load / Get
The result is all 1s, for true, which happens to be a NaN. For false it’s all 0s, which happens to be 0.0. Typically you use the result as a bitwise mask, so the float value isn’t really meaningful.
Load integer value
AVX - binary cast through a union and load float value
_mm256_set1_epi32 - Initializes 256-bit vector with scalar integer values. No corresponding Intel® AVX instruction.
Mask
_mm256_movemask_ps (mm256 -> int)
int as_int( const v8f& f) {
return _mm256_movemask_ps( f.v);
}
reversing _mm256_movemask_ps
AVX Solution
// AVX2 can be significantly more efficient, doing this with integer SIMD
// Especially for the case where the bitmap is in an integer register, not memory
// It's fine if `bitmap` contains high garbage; make sure your C compiler broadcasts from a dword in memory if possible instead of integer load with zero extension.
// e.g. __m256 _mm256_broadcast_ss(float *a); or memcpy to unsigned.
// Store/reload is not a bad strategy vs. movd + 2 shuffles so maybe just do it even if the value might be in a register; it will force some compilers to store/broadcast-load. But it might not be type-punning safe even though it's an intrinsic.
// Low bit -> element 0, etc.
__m256 inverse_movemask_ps_avx1(unsigned bitmap)
{
// if you know DAZ is off: don't OR, just AND/CMPEQ with subnormal bit patterns
// FTZ is irrelevant, we only use bitwise booleans and CMPPS
const __m256 exponent = _mm256_set1_ps(1.0f); // set1_epi32(0x3f800000)
const __m256 bit_select = _mm256_castsi256_ps(
_mm256_set_epi32( // exponent + low significand bits
0x3f800000 + (1<<7), 0x3f800000 + (1<<6),
0x3f800000 + (1<<5), 0x3f800000 + (1<<4),
0x3f800000 + (1<<3), 0x3f800000 + (1<<2),
0x3f800000 + (1<<1), 0x3f800000 + (1<<0)
));
// bitmap |= 0x3f800000; // more efficient to do this scalar, but only if the data was in a register to start with
__m256 bcast = _mm256_castsi256_ps(_mm256_set1_epi32(bitmap));
__m256 ored = _mm256_or_ps(bcast, exponent);
__m256 isolated = _mm256_and_ps(ored, bit_select);
return _mm256_cmp_ps(isolated, bit_select, _CMP_EQ_OQ);
}
AVX2 Solution ?
int mask = _mm256_movemask_epi8(__m256i s1);
// vs
__m256i get_mask2(const uint32_t mask)
Bitwise operation
see also
- Fastest way to unpack 32 bits to a 32 byte SIMD vector
- Why AVX2 and SSE2 bitwise OR operators are not faster than a simple | operator? - hint compiler is generating vectorized code
- Haswell AVX2 for Simple Integer Bitwise Operations
Written on November 5, 2022, Last update on December 10, 2022
c++
avx
mask