Work-efficient Prefix Sums

The prefix sums p of an array a are defined as:

p[0] = a[0]
p[1] = a[0] + a[1]
p[2] = a[0] + a[1] + a[2]
...
p[n-1] = a[0] + a[1] + ... + a[n-1]

But if you actually want to compute the prefix sums, you wouldn’t use that formula. Instead, you’d compress all the repetitive sums and compute p with only n-1 total additions:

p[0] = a[0]
p[1] = p[0] + a[1]
p[2] = p[1] + a[2]
...
p[n-1] = p[n-2] + a[n-1]

If you want to compute prefix sums in parallel instead, it’s easy to come up with a scheme that can compute all the p’s in O(log n) time, but it’s harder to come up with a scheme that’s work-efficient, meaning it does O(n) total additions – the same amount of work as the sequential sum above.

The Brent-Kung adder uses work-efficient parallel prefix sums to propagate carries. The adder doesn’t actually compute sums, but instead accumulates a special operator o across an entire array. Since o is associative, the same ideas for prefix sums apply, so I’ll just talk about sums in the rest of this post.

In the original Brent-Kung paper¹ they show a variant of this network to illustrate the work-efficient prefix sum computation on 16 elements:

A work-efficient prefix sum network for n=16

The computation flows from top to bottom, with inputs fed into the wires at the top. Whenever there’s a small circle on wire i leading to a large +-circle on wire j, you add the current contents of wire i to wire j and store the result in wire j. At the bottom of the diagram, each wire i holds p[i], the prefix sum of all input elements up to and including wire i.

Brent and Kung cite a paper by Ladner and Fischer² for the network construction. The Wikipedia page for prefix sums also has a diagram like the one above and cites Ladner and Fischer as well as two other papers in Russian from around the same time. The Ladner-Fischer paper provides a recursive construction of these networks in the form of network diagrams, but no pseudocode to generate them and no explicit proof that they’re correct. I’m sure it’s easy to prove by induction, but the whole construction was pretty mysterious to me at first. The Wikipedia entry has a high-level description of the networks, but did not leave me feeling like I was ready to write code to generate them:

3. Express each term of the final sequence y0, y1, y2, … as the sum of up to two terms of these intermediate sequences: y0 = x0, y1 = z0, y2 = z0 + x2, y3 = w1, etc. After the first value, each successive number yi is either copied from a position half as far through the w sequence, or is the previous value added to one value in the x sequence.

I wanted to find code to generate these networks because I’m using the Brent-Kung adder in cncf, the CNF compiler I’m working on. The Brent-Kung adder is simple and achieves logarithmic delay because of the work-efficient parallel prefix sum scheme, which should make it nice for encoding SAT problems that involve arithmetic. In circuits, delay means time, but in SAT encodings, delay means inference complexity to a solver. All else being equal, doing unit propagation or resolution over a long chain of clauses should be more expensive for a solver than simpler encodings, although it doesn’t always work that way. But in general, achieving simple, low-depth encodings is a good overall goal for generating CNF encodings.

The Ladner-Fischer Networks

Ladner and Fischer provide a mutually recursive definition of their prefix sum network. I translated it to Python as:

# Generates the Ladner-Fischer network P_k on wires xs with parameter k.
# To generate a prefix sum network for n elements, call
# ladner_fischer_network(list(range(n))).
def ladner_fischer_network(xs, k=None):
    n = len(xs)
    if n == 1: return
    if k is None: k = n
    if k == 0:
        half = math.ceil(n/2)
        yield from ladner_fischer_network(xs[half:], k=1)
        yield from ladner_fischer_network(xs[:half], k=0)
        yield from ((xs[half-1], xs[elem]) for elem in xs[:half])
    else:
        yield from zip(xs[::2], xs[1::2])
        new_xs = xs[1::2]
        if len(xs) % 2 == 1:
            new_xs.append(xs[-1])
        yield from ladner_fischer_network(new_xs, k=k-1)
        yield from zip(xs[1:-1:2], xs[2:-1:2])

The function above just generates the wire indices for the network to compare, from left-to-right and top-to-bottom. So:

>>> list(ladner_fischer_network(list(range(7))))
[(0, 1), (2, 3), (4, 5), (1, 3), (5, 6), (3, 6), (3, 5), (1, 2), (3, 4)]

corresponds to the network diagram:

A work-efficient prefix sum network for n=7

A Menagerie of Networks

Now that we can generate these, you can sit back and stare at the patterns in a few larger ones for a minute:

A work-efficient prefix sum network for n=33

A work-efficient prefix sum network for n=90

A work-efficient prefix sum network for n=150

We know how to generate the prefix sum networks now, but you probably can’t yet draw one on a blank sheet of paper from first principles or explain why they’re correct.

Simpler?

I came up with a different algorithm for generating these networks that I think is a little clearer, and makes it easier to see that the construction is correct. Like the previous Python code for the recursive generation of Ladner-Fischer networks, my algorithm takes a list of all wires you want to generate the network on (so you’d pass [0,1,2,3,4] if you wanted to generate a network on 5 elements). The entire algorithm is just:

Merge the elements of the list in pairs, keeping the latter element in each pair in a new merged array. Each merge of wires i and j generates a sum operation on i and j.
Repeat this merging process until the resulting merged array has only one element. Go back through all the arrays you’ve created (the original and any merged arrays) and mark the first element of each “finished”.
Go back through the merged arrays in the reverse order they were created. For any unfinished index you find in this reverse scan, generate a sum operation on that wire and the wire to the left of it in the merged array, storing the result in the unfinished index. Mark the unfinished index as finished whenever this happens.

For example, to generate the 13-element prefix sum network, we’d start with [0,1,2,3,4,5,6,7,8,9,10,11,12]. The merging steps generate the following sequences of arrays:

[0,1,2,3,4,5,6,7,8,9,10,11,12]
[1,3,5,7,9,11,12]
[3,7,11,12]
[7,12]
[12]

Notice that since we start with an odd number of elements, 12 has no buddy to merge with until the third. That’s fine, we just bring it along to the next step without any merge when it can’t pair up with a buddy. Now, if we’ve dilligently marked the first element in each array as “finished” (0,1,3,7, and 12), we end up in this state at the end (with finished elements starred):

[0*,1*,2,3*,4,5,6,7*,8,9,10,11,12*]
[1*,3*,5,7*,9,11,12*]
[3*,7*,11,12*]
[7*,12*]
[12*]

When we iterate through in reverse order, we combine 7 with 11 and mark 11 finished once we get to the third array, then combine 3 and 5 into 5 and 7 and 9 into 9 once we get back to the second array, and so on. All these combine operations taken together create the correct work-efficient prefix sum schedule:

A work-efficient prefix sum network for n=13

The for this iterative generation in Python is:

def generate_prefix_sum(n):
    reduced = [list(range(n))]
    while len(reduced[-1]) > 1:
        prev, current = reduced[-1], []
        for x,y in zip(prev[::2], prev[1::2]):
            yield (x,y)
            current.append(y)
        if len(prev) % 2 == 1:
            current.append(prev[-1])
        reduced.append(current)

    finished = set(r[0] for r in reduced)
    for result in reversed(reduced):
        for i, item in enumerate(result):
            if item not in finished:
                yield (result[i-1], item)
                finished.add(item)

and generate_prefix_sum generates exactly the same networks as the ladner_fischer_network function in the previous section. This iterative algorithm is something I can reproduce on a sheet of paper without any reference to recursive definitions and it’s also something I can convince myself is correct – you just need to reason about the contents of “finished” wires and the order in which they’re finished for a little bit.

In any case, now we have two ways to generate these Ladner-Fischer prefix sum networks.

Afterword

After all of this work, did the Ladner-Fischer network help cnfc generate better encodings? Let’s try proving a large number is prime. I’ve got an example in cnfc that produces formulas that are satisfiable exactly when the given input number is composite.

First, just generating prefix sums using a naive sequential accumulation:

~/cnfc$ poetry run python3 examples/prime/prime.py 100123456789 /tmp/cnf /tmp/extractor.py
~/cnfc$ grep "p " /tmp/cnf  # Number of variables and clauses in the encoding
p cnf 15211 48400
~/cnfc$ time kissat /tmp/cnf > /tmp/kissat-out  # Solve the formula

real    1m20.329s
user    1m19.655s
sys     0m0.597s

Now, again, with the Ladner-Fischer network dropped in instead:

~/cnfc$ poetry run python3 examples/prime/prime.py 100123456789 /tmp/cnf /tmp/extractor.py
~/cnfc$ grep "p " /tmp/cnf   # Number of variables and clauses in the encoding
p cnf 20362 63853
~/cnfc$ time kissat /tmp/cnf > /tmp/kissat-out  # Solve the formula

real    1m33.597s
user    1m32.979s
sys     0m0.612s

So we use slightly more variables and clauses when we drop in the new prefix sum network, which is expected. The encoding of the primality test is basically all adders (multiplication is implemented with the grade-school repeated addition algorithm) and the Ladner-Fischer network uses about twice as many operations as the linear accumulation of prefix sums. These encodings don’t have any standard preprocessing applied, but the numbers don’t change much if I do a little unit-propagation, subsumption, etc.

But we don’t get any kind of great speedup with the more complicated network, and in fact it made the solver work a little harder in all of the examples I tried, even with composite numbers instead of primes. So for now, I’m not replacing the simpler linear-delay prefix sum network with the lower-depth Ladner-Fischer prefix-sum network. But I’ll keep the code around in case these numbers change if/when I encode multiplication with something better than repeated addition.

Brent, Richard P. and H. T. Kung. “A Regular Layout for Parallel Adders.” IEEE Transactions on Computers C-31 (1982): 260-264. ↩
Richard E. Ladner and Michael J. Fischer. 1980. Parallel Prefix Computation. J. ACM 27, 4 (Oct. 1980), 831–838. https://doi.org/10.1145/322217.322232 ↩