Generalized 'parallel' popcnt

For a recent project I was looking to use Bloom filters to verify a efficient serialized structures (receiver can verify if they have the correct arbitrary deserialization).

Now popcnt is a quite common function; it's available on many architectures as a single instruction. It counts the amount of hot-bits in some register/variable (also known as the Hamming Weight). It is a well researched topic. However in certain cases it's not possible to use the fast-path due to the target platform/language. With the introduction of tc39-bigint, it's possible to use arbitrary precision integers as conveniently as using regular integers, unfortunately no popcnt functionality is provided.

Update(2023): there is a proposal to add it but it hasn't seen much progress https://github.com/tc39/proposal-bigint-math#vision

There are plenty of ways to calculate popcnt, however not all methods are created equal. In this article I will explain how to do 'parallel' popcnt in arbitrary precision integers.

The 'parallel' implementation has the benefits of being O(log2(k)), rather than O(n). In theory it's also constant-time but in practice this is not the case for bigint.

You can check the relevant source code on github

Research

At first I used the basic popcnt algorithm:

popcnt(v: bigint) {
    let c = 0n
    for (; v; c++) {
        // clear the least significant bit set
        v &= v - 1n;
    }
    return c
}

And this worked fine, but because I was already writing my own implementation, I thought I might as well optimize it.

First looking if anyone had found a good algorithm for arbitrary size integers, I found that most implementations unrolled their data into 64-bits.

However then I found this:

Fast way of counting non-zero bits in positive integer
I need a fast way to count the number of bits in an integer in python.My current solution is bin(n).count(“1”)but I am wondering if there is any faster way of doing this? PS: (i am representin...

An answer which highlighted the naive implementation of the parallel popcnt.

Here is a simplified implementation:

 const masks = [
     0x5555555555555555,
     0x3333333333333333,
     0x0F0F0F0F0F0F0F0F,
     0x00FF00FF00FF00FF,
     0x0000FFFF0000FFFF,
     0x00000000FFFFFFFF
 ]
 popcnt(n: u64){
  for (let i = 0; i < masks.length; i++){
  	n = (n & masks[i]) + (n ^ ~masks[i] >> (1 << i))
  }
  return n
 } 


Suprisingly everywhere I looked (1, 2, 3) it only was made for 64-bit integers using the same constants. Even though the principle should work for any power-of-2 bits.

I would go into how this works internally & construct it from there, but after reading many articles on the topic I didn't gain much of an expressible intuition. So instead I reverse-engineered the pattern.

Algorithm

Constraints

We start by asserting our constraints, the parallel algorithm works for any power of 2.

// Convert to power of 2 (e.g.) 118 -> 128
let approx_bits = bits
bits = 1n;
while(approx_bits < bits) bits <<= 1n;

Creating masks

The algorithm relies on the different 'masks'. Each mask is used with its complement, to count bits. These masks can be reused for any size smaller or equal than the constant (also can be made smaller/faster by truncating it, although we won't be using that).

uint64_t m1  = 0b0101010101010101..
uint64_t m2  = 0b0011001100110011..
uint64_t m4  = 0b0000111100001111..
uint64_t m8  = 0b0000000011111111...
The pattern seen in masks at the binary level

The masks follow a relatively simple pattern, however this is hard to generate efficiently in binary form.

Extending these constants to bigger integers requires adding more masks, and increase the size of the existing masks; you need ceil(log2(n)) masks per n-bit integer.

Generating masks in reverse

We can generate the constants by using the patterns observed; which is

$$m_{n} = (m_{n+1} \ll \frac{b_{n+1}}{2}) \oplus m_{n+1}$$

Where b is the minimum repeating pattern size of m (e.g. b = 2 -> m = 0b01010101), starting with m as 'all-hot'.

However since each mask is dependent on the one after it we need to generate the masks in reverse. This means we can't create an implementation where the masks are dynamically created as popcnt runs, but what can you do.

With that our final implementation becomes:

let mask = (1n << bits) - 1n
// prevent overflow because of the arbitrary precision int
const fmask = mask 
let masks: bigint[] = []
while (bits > 1) {
    bits >>= 1n
    mask = ((mask << bits) ^ mask) & fmask
    masks.push(mask)
}
masks = masks.reverse()

Generating masks in execution order

After creating the reversed algorithm I observed another pattern.
For any n-bit number the following pattern can be generated:

$$ m^{n} = (1 \ll n) - 1$$

You can also see this at the beginning of the reversed algorithm. The problem with this is that it generates a number only n-bit long, so if we need to mask for an higher bit number it wouldn't cover the entire input. However we can cover for this by just repeating the pattern as many times as we have masks.

    let masks: bigint[] = []
    for (let n = 1n; n < bits;){
        masks = masks.map(x => x | (x << n))
        masks.push((1n << n) - 1n)
        n <<= 1n;
    }
    return masks

Sadly this algorithm still requires us keep mask-generation and popcnt, but it has the benefit of being able to be expand/contract more easily. Just running another iteration of the loop and now you can handle double the amount of bits; or when contracting AND all masks with (1<<bits)-1 and now you have faster masks (but less bits).

We don't really need it as bloom-filters usually have a constant bit-size, but it's nice to have the option.

Calculate popcnt

For the final implementation I went with the original algorithm as it's a bit faster, and it ends up quite simple:

popcnt(v: bigint) { 
    let bits = 1n
    for (const mask of masks) {
        v = (v & mask) + ((v & ~mask) >> bits)
        bits <<= 1n
    }
    return v
}

Benchmarks

Here is the benchmarks for the original non-parallel popcnt vs our new generalized popcnt.

In our benchmark we run each algorithm 100000 times. fastpopcnt does include mask generation, which is only run once. We use a 8192-bit number with 1024-bits being hot, this ratio is realistic for a partially filled bloom filter.

popcnt:            1024, samples: 100000, 16488ms, 6.06501698204755p/ms
fastpopcnt(8192b): 1024, samples: 100000, 947ms,   105.59662090813094p/ms

As we can see our new algorithm is almost 21x faster than the original.

Looking at the Big-O for each of the algorithms this makes sense, popcnt has O(n), while fastpopcnt has O(log2(n)). The difference is that popcnt scales by the hot-bits, while fastpopcnt scales by the total bits.

Going to the extreme with this principle we devise a worst-case and best-case scenario: all bits vs single bit set in a large integer. Here are the results:

popcnt:            8192, samples: 100000, 585785ms, 0.17071109p/ms
fastpopcnt(8192b): 8192, samples: 100000, 2281ms,   43.8404208p/ms
Best-case: 256.8x faster

popcnt:            1,    samples: 100000, 23ms,     4347.82608p/ms
fastpopcnt(8192b): 1,    samples: 100000, 668ms,    149.700598p/ms
Worst-case: 29x slower

While in a worst case it is 29x slower, this is a decent trade-off for our application as the amount of hot-bits is often quite high.


Interested?

If you're interested or want more information, feel free to contact us! We are more than happy to help.