Workefficient Prefix Sums
The prefix sums p of an array a are defined as:
p[0] = a[0]
p[1] = a[0] + a[1]
p[2] = a[0] + a[1] + a[2]
...
p[n1] = a[0] + a[1] + ... + a[n1]
But if you actually want to compute the prefix sums, you wouldn’t use that formula. Instead, you’d compress all the repetitive sums and compute p with only n1 total additions:
p[0] = a[0]
p[1] = p[0] + a[1]
p[2] = p[1] + a[2]
...
p[n1] = p[n2] + a[n1]
If you want to compute prefix sums in parallel instead, it’s easy to come up with a scheme that can compute all the p’s in O(log n) time, but it’s harder to come up with a scheme that’s workefficient, meaning it does O(n) total additions – the same amount of work as the sequential sum above.
The BrentKung adder uses workefficient parallel prefix sums to propagate carries. The adder doesn’t actually compute sums, but instead accumulates a special operator o across an entire array. Since o is associative, the same ideas for prefix sums apply, so I’ll just talk about sums in the rest of this post.
In the original BrentKung paper^{1} they show a variant of this network to illustrate the workefficient prefix sum computation on 16 elements:
The computation flows from top to bottom, with inputs fed into the wires at the top. Whenever there’s a small circle on wire i leading to a large +circle on wire j, you add the current contents of wire i to wire j and store the result in wire j. At the bottom of the diagram, each wire i holds p[i], the prefix sum of all input elements up to and including wire i.
Brent and Kung cite a paper by Ladner and Fischer^{2} for the network construction. The Wikipedia page for prefix sums also has a diagram like the one above and cites Ladner and Fischer as well as two other papers in Russian from around the same time. The LadnerFischer paper provides a recursive construction of these networks in the form of network diagrams, but no pseudocode to generate them and no explicit proof that they’re correct. I’m sure it’s easy to prove by induction, but the whole construction was pretty mysterious to me at first. The Wikipedia entry has a highlevel description of the networks, but did not leave me feeling like I was ready to write code to generate them:
3. Express each term of the final sequence y0, y1, y2, … as the sum of up to two terms of these intermediate sequences: y0 = x0, y1 = z0, y2 = z0 + x2, y3 = w1, etc. After the first value, each successive number yi is either copied from a position half as far through the w sequence, or is the previous value added to one value in the x sequence.
I wanted to find code to generate these networks because I’m using the BrentKung adder in cncf, the CNF compiler I’m working on. The BrentKung adder is simple and achieves logarithmic delay because of the workefficient parallel prefix sum scheme, which should make it nice for encoding SAT problems that involve arithmetic. In circuits, delay means time, but in SAT encodings, delay means inference complexity to a solver. All else being equal, doing unit propagation or resolution over a long chain of clauses should be more expensive for a solver than simpler encodings, although it doesn’t always work that way. But in general, achieving simple, lowdepth encodings is a good overall goal for generating CNF encodings.
The LadnerFischer Networks
Ladner and Fischer provide a mutually recursive definition of their prefix sum network. I translated it to Python as:
# Generates the LadnerFischer network P_k on wires xs with parameter k.
# To generate a prefix sum network for n elements, call
# ladner_fischer_network(list(range(n))).
def ladner_fischer_network(xs, k=None):
n = len(xs)
if n == 1: return
if k is None: k = n
if k == 0:
half = math.ceil(n/2)
yield from ladner_fischer_network(xs[half:], k=1)
yield from ladner_fischer_network(xs[:half], k=0)
yield from ((xs[half1], xs[elem]) for elem in xs[:half])
else:
yield from zip(xs[::2], xs[1::2])
new_xs = xs[1::2]
if len(xs) % 2 == 1:
new_xs.append(xs[1])
yield from ladner_fischer_network(new_xs, k=k1)
yield from zip(xs[1:1:2], xs[2:1:2])
The function above just generates the wire indices for the network to compare, from lefttoright and toptobottom. So:
>>> list(ladner_fischer_network(list(range(7))))
[(0, 1), (2, 3), (4, 5), (1, 3), (5, 6), (3, 6), (3, 5), (1, 2), (3, 4)]
corresponds to the network diagram:
A Menagerie of Networks
Now that we can generate these, you can sit back and stare at the patterns in a few larger ones for a minute:
We know how to generate the prefix sum networks now, but you probably can’t yet draw one on a blank sheet of paper from first principles or explain why they’re correct.
Simpler?
I came up with a different algorithm for generating these networks that I think is a little clearer, and makes it easier to see that the construction is correct. Like the previous Python code for the recursive generation of LadnerFischer networks, my algorithm takes a list of all wires you want to generate the network on (so you’d pass [0,1,2,3,4] if you wanted to generate a network on 5 elements). The entire algorithm is just:
 Merge the elements of the list in pairs, keeping the latter element in each pair in a new merged array. Each merge of wires i and j generates a sum operation on i and j.
 Repeat this merging process until the resulting merged array has only one element. Go back through all the arrays you’ve created (the original and any merged arrays) and mark the first element of each “finished”.
 Go back through the merged arrays in the reverse order they were created. For any unfinished index you find in this reverse scan, generate a sum operation on that wire and the wire to the left of it in the merged array, storing the result in the unfinished index. Mark the unfinished index as finished whenever this happens.
For example, to generate the 13element prefix sum network, we’d start with [0,1,2,3,4,5,6,7,8,9,10,11,12]. The merging steps generate the following sequences of arrays:
 [0,1,2,3,4,5,6,7,8,9,10,11,12]
 [1,3,5,7,9,11,12]
 [3,7,11,12]
 [7,12]
 [12]
Notice that since we start with an odd number of elements, 12 has no buddy to merge with until the third. That’s fine, we just bring it along to the next step without any merge when it can’t pair up with a buddy. Now, if we’ve dilligently marked the first element in each array as “finished” (0,1,3,7, and 12), we end up in this state at the end (with finished elements starred):
 [0*,1*,2,3*,4,5,6,7*,8,9,10,11,12*]
 [1*,3*,5,7*,9,11,12*]
 [3*,7*,11,12*]
 [7*,12*]
 [12*]
When we iterate through in reverse order, we combine 7 with 11 and mark 11 finished once we get to the third array, then combine 3 and 5 into 5 and 7 and 9 into 9 once we get back to the second array, and so on. All these combine operations taken together create the correct workefficient prefix sum schedule:
The for this iterative generation in Python is:
def generate_prefix_sum(n):
reduced = [list(range(n))]
while len(reduced[1]) > 1:
prev, current = reduced[1], []
for x,y in zip(prev[::2], prev[1::2]):
yield (x,y)
current.append(y)
if len(prev) % 2 == 1:
current.append(prev[1])
reduced.append(current)
finished = set(r[0] for r in reduced)
for result in reversed(reduced):
for i, item in enumerate(result):
if item not in finished:
yield (result[i1], item)
finished.add(item)
and generate_prefix_sum
generates exactly the same networks as the ladner_fischer_network
function in the previous section.
This iterative algorithm is something I can reproduce on a sheet of paper without any reference to recursive
definitions and it’s also something I can convince myself is correct – you just need to reason about
the contents of “finished” wires and the order in which they’re finished for a little bit.
In any case, now we have two ways to generate these LadnerFischer prefix sum networks.
Afterword
After all of this work, did the LadnerFischer network help cnfc generate better encodings? Let’s try proving a large number is prime. I’ve got an example in cnfc that produces formulas that are satisfiable exactly when the given input number is composite.
First, just generating prefix sums using a naive sequential accumulation:
~/cnfc$ poetry run python3 examples/prime/prime.py 100123456789 /tmp/cnf /tmp/extractor.py
~/cnfc$ grep "p " /tmp/cnf # Number of variables and clauses in the encoding
p cnf 15211 48400
~/cnfc$ time kissat /tmp/cnf > /tmp/kissatout # Solve the formula
real 1m20.329s
user 1m19.655s
sys 0m0.597s
Now, again, with the LadnerFischer network dropped in instead:
~/cnfc$ poetry run python3 examples/prime/prime.py 100123456789 /tmp/cnf /tmp/extractor.py
~/cnfc$ grep "p " /tmp/cnf # Number of variables and clauses in the encoding
p cnf 20362 63853
~/cnfc$ time kissat /tmp/cnf > /tmp/kissatout # Solve the formula
real 1m33.597s
user 1m32.979s
sys 0m0.612s
So we use slightly more variables and clauses when we drop in the new prefix sum network, which is expected. The encoding of the primality test is basically all adders (multiplication is implemented with the gradeschool repeated addition algorithm) and the LadnerFischer network uses about twice as many operations as the linear accumulation of prefix sums. These encodings don’t have any standard preprocessing applied, but the numbers don’t change much if I do a little unitpropagation, subsumption, etc.
But we don’t get any kind of great speedup with the more complicated network, and in fact it made the solver work a little harder in all of the examples I tried, even with composite numbers instead of primes. So for now, I’m not replacing the simpler lineardelay prefix sum network with the lowerdepth LadnerFischer prefixsum network. But I’ll keep the code around in case these numbers change if/when I encode multiplication with something better than repeated addition.

Brent, Richard P. and H. T. Kung. “A Regular Layout for Parallel Adders.” IEEE Transactions on Computers C31 (1982): 260264. ↩

Richard E. Ladner and Michael J. Fischer. 1980. Parallel Prefix Computation. J. ACM 27, 4 (Oct. 1980), 831–838. https://doi.org/10.1145/322217.322232 ↩