Popcount / Hamming weight (# bit set)
Population count is a procedure of counting number of ones in a bit string. - The NSA Instruction / HN
see also
- Population Count
- sideways addition
- Generating Binary Permutations in Popcount Order
- Bit Hacks
- Revisiting POPCOUNT Operations in CPUs/GPUs
std::popcount (C++20)
Using the built-in functions of your compilers.
For GCC:
SSSE3: fast popcount
SSSE3 has powerful instruction PSHUFB. This instruction can be used to perform a parallel 16-way lookup; LUT has 16 entries and is stored in an XMM register, indexes are 4 lower bits of each byte stored in another XMM register.
Depending on architecture i5 (Westmere) vector can be slower or much faster Core i7 (Haswell) Core i7 (Skylake) for a sufficiently long bitstring (> 128 bytes).
Hamming Weight
GCC and clang 10.0 can recognize this pattern / idiom and compile it to a hardware popcnt or equivalent instruction when available, giving you the best of both worlds.
The best method for counting bits in a 32-bit integer v is the following:
on Intel® Core™ i7: 16 cycles per number, constant time, data independent
Advantage: no branches, no memory lookups, constant runtime - independent from population
Drawback: dependency chain, not much parallel speedup
Notes
The whole algorithm modifies the input in order to generate the output, that means it works in-place.
First, the code counts the bits of two adjacent bits:
a = v - ((v >> 1) & 0x5555...)
0b and 0b → 00b
0b and 1b → 01b
1b and 0b → 01b
1b and 1b → 10b
Whenever the higher bit of each 2-bit group is set, subtracting 01b gives the desired outcome. Looks like branching … but as it turns out, the subtraction can be done always: just subtract the higher bit ! If it is 0, the result remains unchanged, if it is 1, then we get the right numbers, too.
Now the 2-bit count is done. As you can see, there are just three possible decimal results: 0, 1 or 2.
Then, two adjacent 2-bit groups are joined to 4-bit groups:
the 2-bit groups are masked and shifted to match and then simply added. No overflow is possible.
b = (a & 0x3333..) + ((a >> 2) & 0x3333..)
00b and 00b → 0000b
00b and 01b → 0001b
00b and 10b → 0010b
01b and 00b → 0001b
01b and 01b → 0010b
01b and 10b → 0011b
10b and 00b → 0010b
10b and 01b → 0011b
10b and 10b → 0100b
Again The same procedure is done for all 4-bit groups yielding the bit counts for each of the four bytes in their lower four bits.
That means, each byte contains its bit count, however, the upper four bits may contain junk and are masked out.
c = (b + (b >> 4)) & 0x0f0f...
Last, sum all bytes.
Multiplying by 0x01010101 has an interesting property if we name the four bytes A, B, C, D:
A, B, C, D → A+B+C+D, B+C+D, C+D, D
Obviously the highest byte is what we are looking for. The right shift returns just it.
see also