Aaron Windsor

Work-efficient Prefix Sums

Sun, 05 Nov 2023 17:58:00 +0000

The prefix sums p of an array a are defined as:

p[0] = a[0]
p[1] = a[0] + a[1]
p[2] = a[0] + a[1] + a[2]
...
p[n-1] = a[0] + a[1] + ... + a[n-1]

But if you actually want to compute the prefix sums, you wouldn’t use that formula. Instead, you’d compress all the repetitive sums and compute p with only n-1 total additions:

p[0] = a[0]
p[1] = p[0] + a[1]
p[2] = p[1] + a[2]
...
p[n-1] = p[n-2] + a[n-1]

If you want to compute prefix sums in parallel instead, it’s easy to come up with a scheme that can compute all the p’s in O(log n) time, but it’s harder to come up with a scheme that’s work-efficient, meaning it does O(n) total additions – the same amount of work as the sequential sum above.

The Brent-Kung adder uses work-efficient parallel prefix sums to propagate carries. The adder doesn’t actually compute sums, but instead accumulates a special operator o across an entire array. Since o is associative, the same ideas for prefix sums apply, so I’ll just talk about sums in the rest of this post.

In the original Brent-Kung paper¹ they show a variant of this network to illustrate the work-efficient prefix sum computation on 16 elements:

The computation flows from top to bottom, with inputs fed into the wires at the top. Whenever there’s a small circle on wire i leading to a large +-circle on wire j, you add the current contents of wire i to wire j and store the result in wire j. At the bottom of the diagram, each wire i holds p[i], the prefix sum of all input elements up to and including wire i.

Brent and Kung cite a paper by Ladner and Fischer² for the network construction. The Wikipedia page for prefix sums also has a diagram like the one above and cites Ladner and Fischer as well as two other papers in Russian from around the same time. The Ladner-Fischer paper provides a recursive construction of these networks in the form of network diagrams, but no pseudocode to generate them and no explicit proof that they’re correct. I’m sure it’s easy to prove by induction, but the whole construction was pretty mysterious to me at first. The Wikipedia entry has a high-level description of the networks, but did not leave me feeling like I was ready to write code to generate them:

3. Express each term of the final sequence y0, y1, y2, … as the sum of up to two terms of these intermediate sequences: y0 = x0, y1 = z0, y2 = z0 + x2, y3 = w1, etc. After the first value, each successive number yi is either copied from a position half as far through the w sequence, or is the previous value added to one value in the x sequence.

I wanted to find code to generate these networks because I’m using the Brent-Kung adder in cncf, the CNF compiler I’m working on. The Brent-Kung adder is simple and achieves logarithmic delay because of the work-efficient parallel prefix sum scheme, which should make it nice for encoding SAT problems that involve arithmetic. In circuits, delay means time, but in SAT encodings, delay means inference complexity to a solver. All else being equal, doing unit propagation or resolution over a long chain of clauses should be more expensive for a solver than simpler encodings, although it doesn’t always work that way. But in general, achieving simple, low-depth encodings is a good overall goal for generating CNF encodings.

The Ladner-Fischer Networks

Ladner and Fischer provide a mutually recursive definition of their prefix sum network. I translated it to Python as:

# Generates the Ladner-Fischer network P_k on wires xs with parameter k.
# To generate a prefix sum network for n elements, call
# ladner_fischer_network(list(range(n))).
def ladner_fischer_network(xs, k=None):
    n = len(xs)
    if n == 1: return
    if k is None: k = n
    if k == 0:
        half = math.ceil(n/2)
        yield from ladner_fischer_network(xs[half:], k=1)
        yield from ladner_fischer_network(xs[:half], k=0)
        yield from ((xs[half-1], xs[elem]) for elem in xs[:half])
    else:
        yield from zip(xs[::2], xs[1::2])
        new_xs = xs[1::2]
        if len(xs) % 2 == 1:
            new_xs.append(xs[-1])
        yield from ladner_fischer_network(new_xs, k=k-1)
        yield from zip(xs[1:-1:2], xs[2:-1:2])

The function above just generates the wire indices for the network to compare, from left-to-right and top-to-bottom. So:

>>> list(ladner_fischer_network(list(range(7))))
[(0, 1), (2, 3), (4, 5), (1, 3), (5, 6), (3, 6), (3, 5), (1, 2), (3, 4)]

corresponds to the network diagram:

A Menagerie of Networks

Now that we can generate these, you can sit back and stare at the patterns in a few larger ones for a minute:

We know how to generate the prefix sum networks now, but you probably can’t yet draw one on a blank sheet of paper from first principles or explain why they’re correct.

Simpler?

I came up with a different algorithm for generating these networks that I think is a little clearer, and makes it easier to see that the construction is correct. Like the previous Python code for the recursive generation of Ladner-Fischer networks, my algorithm takes a list of all wires you want to generate the network on (so you’d pass [0,1,2,3,4] if you wanted to generate a network on 5 elements). The entire algorithm is just:

Merge the elements of the list in pairs, keeping the latter element in each pair in a new merged array. Each merge of wires i and j generates a sum operation on i and j.
Repeat this merging process until the resulting merged array has only one element. Go back through all the arrays you’ve created (the original and any merged arrays) and mark the first element of each “finished”.
Go back through the merged arrays in the reverse order they were created. For any unfinished index you find in this reverse scan, generate a sum operation on that wire and the wire to the left of it in the merged array, storing the result in the unfinished index. Mark the unfinished index as finished whenever this happens.

For example, to generate the 13-element prefix sum network, we’d start with [0,1,2,3,4,5,6,7,8,9,10,11,12]. The merging steps generate the following sequences of arrays:

[0,1,2,3,4,5,6,7,8,9,10,11,12]
[1,3,5,7,9,11,12]
[3,7,11,12]
[7,12]
[12]

Notice that since we start with an odd number of elements, 12 has no buddy to merge with until the third. That’s fine, we just bring it along to the next step without any merge when it can’t pair up with a buddy. Now, if we’ve dilligently marked the first element in each array as “finished” (0,1,3,7, and 12), we end up in this state at the end (with finished elements starred):

[0*,1*,2,3*,4,5,6,7*,8,9,10,11,12*]
[1*,3*,5,7*,9,11,12*]
[3*,7*,11,12*]
[7*,12*]
[12*]

When we iterate through in reverse order, we combine 7 with 11 and mark 11 finished once we get to the third array, then combine 3 and 5 into 5 and 7 and 9 into 9 once we get back to the second array, and so on. All these combine operations taken together create the correct work-efficient prefix sum schedule:

The for this iterative generation in Python is:

def generate_prefix_sum(n):
    reduced = [list(range(n))]
    while len(reduced[-1]) > 1:
        prev, current = reduced[-1], []
        for x,y in zip(prev[::2], prev[1::2]):
            yield (x,y)
            current.append(y)
        if len(prev) % 2 == 1:
            current.append(prev[-1])
        reduced.append(current)

    finished = set(r[0] for r in reduced)
    for result in reversed(reduced):
        for i, item in enumerate(result):
            if item not in finished:
                yield (result[i-1], item)
                finished.add(item)

and generate_prefix_sum generates exactly the same networks as the ladner_fischer_network function in the previous section. This iterative algorithm is something I can reproduce on a sheet of paper without any reference to recursive definitions and it’s also something I can convince myself is correct – you just need to reason about the contents of “finished” wires and the order in which they’re finished for a little bit.

In any case, now we have two ways to generate these Ladner-Fischer prefix sum networks.

Afterword

After all of this work, did the Ladner-Fischer network help cnfc generate better encodings? Let’s try proving a large number is prime. I’ve got an example in cnfc that produces formulas that are satisfiable exactly when the given input number is composite.

First, just generating prefix sums using a naive sequential accumulation:

~/cnfc$ poetry run python3 examples/prime/prime.py 100123456789 /tmp/cnf /tmp/extractor.py
~/cnfc$ grep "p " /tmp/cnf  # Number of variables and clauses in the encoding
p cnf 15211 48400
~/cnfc$ time kissat /tmp/cnf > /tmp/kissat-out  # Solve the formula

real    1m20.329s
user    1m19.655s
sys     0m0.597s

Now, again, with the Ladner-Fischer network dropped in instead:

~/cnfc$ poetry run python3 examples/prime/prime.py 100123456789 /tmp/cnf /tmp/extractor.py
~/cnfc$ grep "p " /tmp/cnf   # Number of variables and clauses in the encoding
p cnf 20362 63853
~/cnfc$ time kissat /tmp/cnf > /tmp/kissat-out  # Solve the formula

real    1m33.597s
user    1m32.979s
sys     0m0.612s

So we use slightly more variables and clauses when we drop in the new prefix sum network, which is expected. The encoding of the primality test is basically all adders (multiplication is implemented with the grade-school repeated addition algorithm) and the Ladner-Fischer network uses about twice as many operations as the linear accumulation of prefix sums. These encodings don’t have any standard preprocessing applied, but the numbers don’t change much if I do a little unit-propagation, subsumption, etc.

But we don’t get any kind of great speedup with the more complicated network, and in fact it made the solver work a little harder in all of the examples I tried, even with composite numbers instead of primes. So for now, I’m not replacing the simpler linear-delay prefix sum network with the lower-depth Ladner-Fischer prefix-sum network. But I’ll keep the code around in case these numbers change if/when I encode multiplication with something better than repeated addition.

Brent, Richard P. and H. T. Kung. “A Regular Layout for Parallel Adders.” IEEE Transactions on Computers C-31 (1982): 260-264. ↩
Richard E. Ladner and Michael J. Fischer. 1980. Parallel Prefix Computation. J. ACM 27, 4 (Oct. 1980), 831–838. https://doi.org/10.1145/322217.322232 ↩

Intersecting Set-Pair Systems

Sun, 10 Apr 2022 13:54:00 +0000

An intersecting set-pair system is a set of pairs { (A_i, B_i) | 1 ≤ i ≤ m } with some additional restrictions on how the As and Bs intersect. The classic setup requires A_i ∩ B_i = ∅ for all i but A_i ∩ B_j ≠ ∅ for all i ≠ j.

Here’s an example:

{ ({1,2}, {3,4}),
  ({1,4}, {2,5}),
  ({2,3}, {1,4}) }

How large can an intersecting set-pair system be? With just the restrictions above, Bollobás proved that if |A_i| ≤ a and |B_i| ≤ b for all i, then m ≤ C(a + b, b)¹. And that bound is tight because all (a,b) partitions of a set of size a+b form an intersecting set-pair system.

Recently, Füredi, Gyárfás, and Király [2019] explored intersecting set-pair systems with the additional restriction that | A_i ∩ B_j | = 1 for all i ≠ j. Holzman [2021] followed their paper with a proof that m ≤ 29⁄30 C(a + b, b) for such set-pair systems, and Kostochka, McCourt, and Nahvi [2021] have recently proven that m ≤ 5⁄6 C(a + b, b).

I tried to see if 5⁄6 was tight for small a and b using a SAT solver. It doesn’t look like it is, and some of these constructions look interesting, so here they are:

a=2

When a = 2, Füredi, Gyárfás, and Király give a closed-form solution for m when b ≥ 4. So the results in this section are really just an exercise in verifying their results.

For a = b = 2, I was able to verify that m = 5 by generating a system with 5 pairs and proving that a system of size 6 doesn’t exist. The system of size 5 below is actually unique under permutations of the ground set:

{ ({1, 2}, {3, 4}),
  ({1, 4}, {2, 5}),
  ({2, 3}, {1, 5}),
  ({3, 5}, {2, 4}),
  ({4, 5}, {1, 3}) }

For a = 2, b = 3, m = 7. I was again able to generate a system of size 7 and prove no system of size 8 exists:

{ ({1, 2}, {3, 4, 5}),
  ({1, 3}, {2, 6, 7}),
  ({2, 4}, {1, 6, 7}),
  ({3, 6}, {1, 4, 5}),
  ({4, 7}, {2, 3, 5}),
  ({5, 6}, {2, 3, 7}),
  ({5, 7}, {1, 4, 6}) }

This time, there were 9 distinct systems under permutations. I stopped counting distinct solutions after this because it gets a little tedious with a SAT solver (you find a solution, add a clause that blocks that solution, repeat).

At a = 2, b = 4, I could generate a system of size 9 but couldn’t prove a system of size 10 didn’t exist (the SAT solver ran for too long and I killed it):

{ ({1, 2},  {3, 4, 5, 6}),
  ({2, 5},  {1, 3, 4, 6}),
  ({2, 6},  {1, 3, 4, 5}),
  ({3, 10}, {2, 4, 11, 12}),
  ({3, 11}, {2, 4, 10, 12}),
  ({3, 12}, {2, 4, 10, 11}),
  ({4, 7},  {2, 3, 8, 9}),
  ({4, 8},  {2, 3, 7, 9}),
  ({4, 9},  {2, 3, 7, 8}) }

Same story for a = 2, b = 5, 6, and 7: I could generate a system of maximum size pretty easily but didn’t bother trying to prove a larger system doesn’t exist.

a = 2, b = 5 → m = 12:

{ ({1, 2},  {3, 4, 5, 6, 7}),
  ({2, 4},  {1, 3, 5, 6, 7}),
  ({2, 5},  {1, 3, 4, 6, 7}),
  ({3, 11}, {2, 6, 7, 15, 16}),
  ({3, 15}, {2, 6, 7, 11, 16}),
  ({3, 16}, {2, 6, 7, 11, 15}),
  ({6, 10}, {2, 3, 7, 12, 14}),
  ({6, 12}, {2, 3, 7, 10, 14}),
  ({6, 14}, {2, 3, 7, 10, 12}),
  ({7, 8},  {2, 3, 6, 9, 13}),
  ({7, 9},  {2, 3, 6, 8, 13}),
  ({7, 13}, {2, 3, 6, 8, 9}) }

a = 2, b = 6 → m = 16:

{ ({1, 2},  {3, 4, 5, 6, 7, 8}),
  ({2, 5},  {1, 3, 4, 6, 7, 8}),
  ({2, 6},  {1, 3, 4, 5, 7, 8}),
  ({2, 8},  {1, 3, 4, 5, 6, 7}),
  ({3, 9},  {2, 4, 7, 11, 12, 19}),
  ({3, 11}, {2, 4, 7, 9, 12, 19}),
  ({3, 12}, {2, 4, 7, 9, 11, 19}),
  ({3, 19}, {2, 4, 7, 9, 11, 12}),
  ({4, 15}, {2, 3, 7, 16, 18, 20}),
  ({4, 16}, {2, 3, 7, 15, 18, 20}),
  ({4, 18}, {2, 3, 7, 15, 16, 20}),
  ({4, 20}, {2, 3, 7, 15, 16, 18}),
  ({7, 10}, {2, 3, 4, 13, 14, 17}),
  ({7, 13}, {2, 3, 4, 10, 14, 17}),
  ({7, 14}, {2, 3, 4, 10, 13, 17}),
  ({7, 17}, {2, 3, 4, 10, 13, 14}) }

a = 2, b = 7 → m = 20:

{ ({1, 2},  {3, 4, 5, 6, 7, 8, 9}),
  ({1, 3},  {2, 4, 5, 6, 7, 8, 9}),
  ({1, 4},  {2, 3, 5, 6, 7, 8, 9}),
  ({1, 6},  {2, 3, 4, 5, 7, 8, 9}),
  ({5, 12}, {1, 7, 8, 9, 13, 22, 24}),
  ({5, 13}, {1, 7, 8, 9, 12, 22, 24}),
  ({5, 22}, {1, 7, 8, 9, 12, 13, 24}),
  ({5, 24}, {1, 7, 8, 9, 12, 13, 22}),
  ({7, 10}, {1, 5, 8, 9, 11, 15, 19}),
  ({7, 11}, {1, 5, 8, 9, 10, 15, 19}),
  ({7, 15}, {1, 5, 8, 9, 10, 11, 19}),
  ({7, 19}, {1, 5, 8, 9, 10, 11, 15}),
  ({8, 14}, {1, 5, 7, 9, 17, 21, 25}),
  ({8, 17}, {1, 5, 7, 9, 14, 21, 25}),
  ({8, 21}, {1, 5, 7, 9, 14, 17, 25}),
  ({8, 25}, {1, 5, 7, 9, 14, 17, 21}),
  ({9, 16}, {1, 5, 7, 8, 18, 20, 23}),
  ({9, 18}, {1, 5, 7, 8, 16, 20, 23}),
  ({9, 20}, {1, 5, 7, 8, 16, 18, 23}),
  ({9, 23}, {1, 5, 7, 8, 16, 18, 20}) }

I stopped here because I wanted to move on to a=3 where a closed form for m isn’t known.

a=3

For a=3, we only know the upper bound m ≤ 5⁄6 C(a + b, b) when a,b ≥ 2 from Kostochka, McCourt, and Nahvi’s work.

Here’s what I could find:

For b=1, m=4:

{ ({1, 2, 3}, {4}),
  ({1, 2, 4}, {3}),
  ({1, 3, 4}, {2}),
  ({2, 3, 4}, {1}) }

We already know that m=7 when b=2 by symmetry from the results in the previous section.

When b=3, m ≥ 10:

{ ({1, 2, 3},   {4, 5, 6}),
  ({1, 3, 5},   {2, 9, 10}),
  ({1, 5, 9},   {3, 7, 8}),
  ({2, 3, 6},   {1, 7, 8}),
  ({2, 6, 7},   {3, 9, 10}),
  ({4, 7, 10},  {2, 5, 8}),
  ({4, 8, 9},   {1, 6, 10}),
  ({4, 8, 10},  {3, 7, 9}),
  ({5, 7, 10}), {1, 4, 6}),
  ({6, 8, 9},   {2, 4, 5}) }

I couldn’t prove that there’s no system with a = b = 3 and m = 11. One thing that I could prove (in about 2 hours with a SAT solver) is that no system with a = b = 3 and m = 11 and a ground set of cardinality 10 or 11 exists. So that’s good evidence that the system above is the largest possible, since any larger system would have to use at least a 12-element ground set.

When b=4, m >= 15:

{ ({1, 2, 3},   {4, 5, 6, 7}),
  ({1, 2, 5},   {3, 6, 7, 12}),
  ({2, 3, 4},   {1, 6, 7, 12}),
  ({2, 4, 12},  {3, 5, 6, 7}),
  ({2, 5, 12},  {1, 4, 6, 7}),
  ({6, 11, 14}, {2, 7, 13, 16}),
  ({6, 11, 16}, {2, 7, 14, 18}),
  ({6, 13, 14}, {2, 7, 11, 18}),
  ({6, 13, 18}, {2, 7, 14, 16}),
  ({6, 16, 18}, {2, 7, 11, 13}),
  ({7, 8, 10},  {2, 6, 15, 17}),
  ({7, 8, 17},  {2, 6, 9, 10}),
  ({7, 9, 15},  {2, 6, 10, 17}),
  ({7, 9, 17},  {2, 6, 8, 15}),
  ({7, 10, 15}, {2, 6, 8, 9}) }

I couldn’t prove any more partial results after this, even by restricting the size of the ground set.

Finally, when b = 5, m >= 20:

{ ({1, 2, 3},   {4, 5, 6, 7, 8}),
  ({1, 2, 8},   {3, 5, 6, 7, 17}),
  ({2, 3, 4},   {1, 5, 6, 7, 17}),
  ({2, 4, 17},  {3, 5, 6, 7, 8}),
  ({2, 8, 17},  {1, 4, 5, 6, 7}),
  ({5, 9, 11},  {2, 6, 7, 13, 18}),
  ({5, 9, 13},  {2, 6, 7, 11, 14}),
  ({5, 11, 18}, {2, 6, 7, 9, 14}),
  ({5, 13, 14}, {2, 6, 7, 9, 18}),
  ({5, 14, 18}, {2, 6, 7, 11, 13}),
  ({6, 12, 16}, {2, 5, 7, 15, 23}),
  ({6, 12, 23}, {2, 5, 7, 16, 22}),
  ({6, 15, 16}, {2, 5, 7, 12, 22}),
  ({6, 15, 22}, {2, 5, 7, 16, 23}),
  ({6, 22, 23}, {2, 5, 7, 12, 15}),
  ({7, 10, 21}, {2, 5, 6, 19, 24}),
  ({7, 10, 24}, {2, 5, 6, 20, 21}),
  ({7, 19, 20}, {2, 5, 6, 21, 24}),
  ({7, 19, 21}, {2, 5, 6, 10, 20}),
  ({7, 20, 24}, {2, 5, 6, 10, 19}) }

To wrap up, here’s a table summarizing the lower bounds for m when a=3 implied by the constructions above compared to the current best upper bounds from the formula m ≤ 5⁄6 C(a + b, b)

a	b	m ≥ ?	m ≤ ?
3	1	4	4
3	2	7	8.33…
3	3	10	16.666…
3	4	15	29.1666…
3	5	20	46.666…

The closed form for the upper bound only applies for a, b ≥ 2 so I just put “4” in the first row for the upper bound since that’s clearly exact.

I think my lower bounds are pretty tight because there was a noticable threshold between the results I could find (which the SAT solver could find in seconds or minutes) and the proofs I couldn’t find (which made the SAT solver just hang for hours or days).

I’ve put the script I used to generate SAT instances and the decoder I used to take SAT solutions and generate canonical set systems online. You just need Python 3 to run both scripts. I used mainly kissat and cadical to find these results.

C(n,r) is the number of ways to choose r items from an n-element set: C(n,r) = n!/(r!(n-r)!)). ↩

The Presidential Rectangle

Sun, 07 Nov 2021 12:37:00 +0000

In 1976, Robert Cass Keller proposed a puzzle in Word Ways that was inspired by Benjamin Franklin’s quote “We must all hang together, or assuredly we shall all hang separately”. The puzzle asks you to fit the first 33 distinct surnames of American Presidents into a rectangle of minimum area using the following rules:

Names can only be placed left-to-right or top-to-bottom
Letters in adjacent cells must belong to a common name
Every name must be connected by some path of intersecting names to every other name (“We must all hang together…”)

Keller mentions a similar contest in a British magazine involving the 50 U.S. states. The winner apparently fit them all into a 23-by-29 rectangle.

In the next issue, Sam Harlan presented the first solution to Keller’s puzzle, a 17-by-31 rectangle with total area 527:

1976 was an election year, which meant the set of presidential surnames was about to increase by one. Harlan mentioned that Jimmy Carter (the eventual winner) could easily be packed into the existing solution using the first “e” in Pierce.

In 1993, shortly after Bill Clinton’s election, Darryl Francis produced a new rectangle that was smaller than Harlan’s solution by two columns (17-by-29, for a total area of 493) and included the entire current set of 37 presidential surnames including Carter, Reagan, Bush, and Clinton:

In next issue of Word Ways, Lee Sallows presented a 19-by-22 grid containing the same set of 37 surnames. The resulting area (418) was a massive reduction from previous attempts, but his grid contained an isolated Ford-Polk-Jackson component. Sallows noted that some “purists” might therefore disqualify his rectangle.

Such a purist showed up in the very next issue, when Leonard Gordon presented a completely connected 19-by-22 grid as well as the following new record-setting 17-by-24 grid with area 408:

Gordon published an update a year later, presenting a tighter 18-by-22 solution with an area of 396:

Gordon’s rectangle above from 1994 is the last communication I can find on the Presidential Rectangle for a while, even though thanks to the election of George W. Bush in 2000, the set of presidential surnames didn’t change again until 2008.

The Presidential Rectangle shows up as an exercise in Don Knuth’s Art of Computer Programming Vol. 4, Fascicle 5, although the exercise asks for a square rather than a rectangle, and the exercise asks for the first 38 surnames up to and including Obama.

The 20-by-20 solution (area: 400) is credited to Gary McDonald from 2017:

McDonald’s solution contains some elements from Gordon’s earlier patterns, including an updated 4-way intersection including Coolidge and Nixon. But I think the dense Roosevelt-Adams-Harding-Obama-Cleveland-Jefferson block is really the centerpiece of the packing.

Knuth mentions that a 19-by-19 solution is surely impossible, and after many attempts over the past few weeks, I am also doubtful that a 19-by-19 packing exists. I was, however, able to pack all of the first 38 surnames up to Obama in a 19-by-21 grid, which beats McDonald’s 20-by-20 solution by exactly 1 when scoring by total area.

Here’s my 19-by-21:

I was also able to construct a similar 21-by-21 packing of all 40 presidential surnames as of 2021, including Trump and Biden:

Word Ways has suspended publication as of a year ago, otherwise I’d continue the tradition by posting these constructions there. In the rest of this post, I’ll explain how I found these new rectangles and share some tools that you can use to construct similar word rectangles.

SAT Encoding

You can frame the Presidential Rectangle problem as a set cover problem and use solvers based on the Dancing Links technique to identify solutions but the connectedness property is difficult to specify in that setting. Knuth describes a way to modify a solver to only discover connected solutions instead of filtering out disconnected solutions as a post-processing step. But SAT solvers are usually much more efficient at finding a solution with multiple constraints like this, and I like playing around with complicated SAT encodings, so I wanted to give that approach a go.

My encoding creates a boolean variable C_(ch,r,c) for every possible character ch and row r, column c pair in the rectangle. These variables represent “the cell at (r, c) in the rectangle contains character ch”. I also add clauses that restrict at most one C_(ch,r,c) to be true for any (r, c) pair.

Now I can go through all surnames that I want to pack in the rectangle and create variables PH_(x,r,c) representing “surname x is placed horizontally starting at (r,c)” and PV_(x,r,c) respresenting “x is placed vertically starting at (r,c)”. I create clauses connecting the PH_(x,r,c) and PV_(x,r,c) to the C_(ch,r,c) in the natural way, spelling out the exact placements of characters in surname x into cells in the rectangle if PH_(x,r,c) or PV_(x,r,c) is true. Finally, I add clauses that ensure that exactly one PH_(x,r,c) or PV_(x,r,c) is true for any surname x.

At this point, any satisfying assignment of the variables must pack all of the surnames into the rectangle and ensure that each cell in the rectangle is assigned at most one distinct character. But recall the rules we started with:

Names can only be placed left-to-right or top-to-bottom
Letters in adjacent cells must belong to a common name
Every name must be connected by some path of intersecting names to every other name

Without additional contraints, we could still violate rule 2 above by placing surnames too close together in the rectangle.

To encode the spacing constraints of rule 2, I introduced three more families of variables:

S_(r,c): “spacer” variables.
H_(r,c): “horizontal cell” marker variables.
V_(r,c): “vertical cell” marker variables.

When a PH_(x,r,c) or PV_(x,r,c) is set, we set spacer variables in the two cells before and after the placement of the surname. There may be no cell that comes before or after a surname if it’s placed on the border of the rectangle, and that’s fine, we just omit spacers when that happens. We also set H_(r,c) on all cells covered by the characters in a surname when PH_(x,r,c) is set and V_(r,c) on all cells covered by the characters in a surname when PV_(x,r,c) is set.

With that setup, we add some more clauses to guarantee that:

A spacer and a horizontal cell never co-occur
A spacer and a vertical cell never co-occur
Two horizontal cells can’t be vertically adjacent unless both are also vertical cells
Two vertical cells can’t be horizontally adjacent unless both are also horizontal cells.

At this point, there are already a lot of variables involved, so here’s an example that might help make sense of the encoding so far: If POLK is placed horizontally in the cells (1,1), (1,2), (1,3), (1,4), then PH_(POLK,1,1) is set and no other PH_(POLK,r,c) or PV_(POLK,r,c) can be set. This also means that C_(P,1,1), C_(O,1,2), C_(L,1,3), and C_(K,1,4) get set and no other C_(ch,1,1), C_(ch,1,2), C_(ch,1,3), or C_(ch,1,4) can be set. Also, S_(1,0) and S_(1,5) get set, which keeps any other surnames from using those cells. And H_(1,1), H_(1,2), H_(1,3), and H_(1,4) get set, which keeps horizontal surnames from occurring adjacent to POLK in row 0 or row 2 unless the cells are also vertical cells (to see why we make this exception, look at, for example, the Reagan-Carter-Nixon-Coolidge intersection in Gary McDonald’s solution above).

So we’ve encoded rules 1 and 2, and we’re left with rule 3: “Every name must be connected by some path of intersecting names to every other name”. This is a little tedious but we can do it. Define variables R_(x,y,i) for every pair of surnames x and y and each i from 0 to N-2, where N is the total number of surnames. R_(x,y,i) is true exactly when x is reachable from y through a path consisting of exactly i intermediate surnames. You can define it inductively as follows:

R_(x,y,0) is true if any of the placements of x and y that intersect are selected, or false if there are no such placements.
R_(x,y,i) for i > 0 is true if there’s some z such that R_(x,z,0) and R_(z,y,i-1) are both true.

Finally, we generate a set of variables R_(x,y) for each pair of surnames x, y and define each as the disjunction of R_(x,y,i) for i from 0 to N-2, so that R_(x,y) is true exactly when there’s a path of any length from x to y. Then we add unit clauses for each R_(x,y) to the encoding to ensure everything is connected to everything else.

Now we can just run a SAT solver on the formula. If it returns a satisfying assignment, we can extract the assignments of the PH_(x,r,c) and PV_(x,r,c) for each surname x to reconstruct the solution.

Assisted solutions

Unfortunately, the Presidential Rectangle is a bit too difficult even for modern SAT solvers. You can, for example, find a packing of a dozen or so surnames in, say, a 12-by-12 square or prove that no such packing exists for an 11-by-11 square in a few minutes. But packing even the first 38 presidential surnames into a 20-by-20 rectangle or proving a 19-by-19 packing seems beyond the reach of today’s technology: I ran multiple state-of-the-art solvers for a week each on problems like these with no results.

Since I care more about finding packings than proving they don’t exist, I started to look for ways to make the problem easier by targeting packings with additional constraints instead of solving the general problem.

Along these lines, I played around with:

Forces: Some of the advances in the Presidential Rectangle came from re-using partial packings that were successful in earlier attempts (Lee Sallows re-used Darryl Francis’s Van Buren-Harding-Grant-Buchanan loop and Leonard Gordon’s solutions share many common constructs, for example). It makes sense that you might want to force absolute positions for some of the surnames and let the solver figure out the rest.
Relative Forces: Similar to forces, these allow you to specify relative positions instead of absolute posititions in the form of constraints that say things like “I’d like Harding and Harrison to intersect in their first ‘H’”.
Partial Packings: Allow the solver to pack a subset of a particular size into a sub-rectangle. Partial packings are easy for the solver to figure out, and they could be used incrementally with forces to figure out a full packing.
Minimizing Blank Cells: A constraint that guarantees at most N blank cells in a packing helps create denser partial packings. Without adding something like this, if you ask a solver for a partial packing of any 10 surnames into a 10-by-10 grid you’re likely to just get the 10 shortest names.

I had some promising partial successes with each of these techniques. But the only thing that really worked for me (and what gave me the 19-by-21 and 21-by-21 rectangles above) was to hand-pack some of the trickier surnames and use those as absolute forces, letting the solver figure out the rest. Both of my solutions above share the same lower half which I hand-crafted and then fed into the solver with forces. That lower half helped the solver find both solutions in just a few hours each.

Tools

I’ve put the tools I used to generate the rectangles above up at github.com/aaw/presidential-rectangle. They support all of the advanced features for assisted solutions described in the previous section. Have fun creating your own rectangles!

Partridge Puzzle

Sun, 05 Sep 2021 14:13:44 +0000

Robert T. Wainwright’s Partridge Puzzle for n=8 is: pack one 1x1 unit square, two 2x2 unit squares, … , eight 8x8 unit squares into a larger 36 x 36 square. Since the sum of the first n cubes is equal to the square of the sum of the first n integers, it shouldn’t be impossible for any n, but n=8 is the smallest value where a packing actually exists:

(The little unlabled square above is 1x1, but it’s too tiny for a label.)

You can generate all solutions for n=8 at a terminal by running

examples/partridge/solve-partridge.sh 8

from github.com/aaw/cover and waiting a few minutes for the first one to appear.

Comma-free codes

Wed, 20 Jan 2021 13:28:00 +0000

A set of strings is comma-free if, for any two strings x and y in the set, no substring of the concatenation xy that overlaps both x and y is also in the set.

Once again, my interest in comma-free codes is coming from Don Knuth’s Volume 4, Fascicle 5 of The Art of Computer Programming. Knuth covers a souped-up backtracking algorithm to find comma-free codes and includes an exercise that derives W. L. Eastman’s algorithm for generating codes with odd word length¹. He also goes over Eastman’s algorithm in his 2015 Christmas lecture as well, which must have been given around the time he was writing Fascicle 5.

In this post, I’ll describe how to use a SAT solver to discover comma-free codes.

Background

I’ll call a comma-free set a “code” and the strings in the set “codewords” in most of the rest of this post.

A few warm-up examples to make the comma-free property concrete: if a comma-free code containing only words of length three contains 123 and 456, then it can’t contain 234 or 345 and it can’t contain 561 or 612 (since those are substrings of 123456 and 456123, respectively). It also can’t contain 231, 312, 564, or 654 (since those are substrings of 123123 and 456456, respectively).

The term “comma-free” comes from the fact that you can create messages just by concatenating codewords together (without commas or other such delimiters) and those messages will be uniquely decodable into codewords even if you start decoding somewhere in the middle of the message.

For example, a comma-free code with words of length 3 over the alphabet {0,1,2} is:

{001, 002, 101, 102, 120, 121, 220, 221}

Given some partial message made up of these codewords:

...011202212200012...

The interpretation of that message as codewords is unique. It has to be

...01,120,221,220,001,2...

since neither of the other ways of interpreting it makes sense. Neither

...0,112,022,122,000,12...

nor

...011,202,212,200,012...

contain any valid code words.

Upper bounds

A Lyndon word is a string that’s a unique lexicographic minimum among the multiset of all of its rotations.

0001 is a Lyndon word because it’s the minimum among the rotations 0001, 0010, 0100, 1000.

0101 is the minimum among all its rotations, but it’s not a Lyndon word because the minimum is not unique among all four rotations: 0101, 1010, 0101, 1010. Any periodic string will have the same problem.

If the same codeword is repeated in a message, all of its other rotations overlap both occurrences of the codeword (for example, 00010001 contains all rotations of 0001). So a codeword and one of its rotation cannot both appear in a comma-free code. Furthermore, you can’t have periodic codewords in a comma-free code (0101 creates ambiguity when concatenated with itself as 01010101). These two facts mean that the number of Lyndon words of length n over an alphabet of size m is an upper bound on the maximum size of a comma-free code with the same parameters.

There’s a closed-form formula for the number of Lyndon words of length n over an alphabet of size m, but it involves the Möbius function. I’ll just use LW(n,m) to represent it here – it doesn’t matter what the formula is, I’m only interested in how close we can get to the optimal size LW(n,m) for various values of n and m.

To tie all of this together, one way of viewing the process of creating a comma-free code that’s as large as possible is just selecting at most one rotation of each Lyndon word while avoiding conflicts. Let CF(n,m) be the maximum size of a comma-free code with words of length n over an alphabet of size m. If LW(n,m) = CF(n,m), then there’s some way of choosing one rotation of each Lyndon word to create a maximum-size comma-free code.

For all odd n, LW(n,m) = CF(n,m) and W.L. Eastman came up with an algorithm¹ that will create such a code. For even n, not much is known even for very small values of n and m. The case n=2 was solved by Golomb, Gordon, and Welch²: CF(2,m) = floor(m²/3), which is strictly less than LW(n,m) for m > 3.

For example, here’s a comma-free code for n=2, m=4:

{10, 12, 30, 31, 32}

LW(2,4) = 6 and the code above contains rotations of all Lyndon words except for 02. And that’s as good as you can do: by Golumb, Gordon, and Welch’s formula above, CF(2,4) = 5, so there’s no comma-free code that includes all Lyndon words for n=2, m=4.

Here’s a table of LW(n,m) - CF(n,m), where CF(n,m) is actually known:

	m=2	m=3	m=4	m=5	m=6	m=7	m=8	m=9	m=10
n=2	0	0	1	2	3	5	7	9	12	…	LW(n,m) - floor(m²/3)
n=3	0	0	0	0	0	0	0	0	0
n=4	0	0	3	11
n=5	0	0	0	0	0	0	0	0	0
n=6	0	3
n=7	0	0	0	0	0	0	0	0	0
n=8	0
n=9	0	0	0	0	0	0	0	0	0
n=10	0
n=11	0	0	0	0	0	0	0	0	0
n=12	1
n=13	0	0	0	0	0	0	0	0	0
n=14

Blanks in the table above are currently unknown. In particular, all rows after n=13 are unknown. For even n, CF(n,m) gets really difficult to determine really quickly and LW(n,m) is only achievable for small values of n and m.

The three highlighted entries in the table are state-of-the-art and apparently unknown at the time of the first printing of Volume 4, Fascicle 5. All of the entries in rows with even n above are within reach of at most a day’s computation on a decent computer using a SAT solver. In the rest of this post, I’ll describe how I calculated the values in the table above for n > 2.

Using a Solver to Find Comma-free Codes

The general recipe I used for finding comma-free codes with a SAT solver was:

Write a Python program that takes arguments n, m, and d on the command line and generates a DIMACS file with a formula that’s satisfiable exactly when there’s a comma-free code on codewords of size n over alphabet of size m with d Lyndon words not chosen. So if you run with d=0 and get a satisfiable formula, LW(n,m) = CF(n,m).

Some of the variables in the formula are indicators for particular codewords being chosen in the code, so I generated comments in the DIMACS file (example) that would help me translate variables to codewords later.
For any pair of n, m I was interested in, I generated DIMACS files starting with d = 0 and ran Armin Biere’s kissat on the resulting files, increasing d and starting over until I found the smallest d where the resulting formula was satisfiable.
Using the satisfying assigment found by kissat, I could optionally extract a code by matching the assignment up with comments in the DIMACS input file.

With this method, I was able to fill in the table in the previous section. If you want to replicate this, you’ll need Python3 installed and a SAT solver like kissat. Download two scripts:

commafree.py: the input file generator
extract-code.py: a script that verifies and extracts comma-free codes

Using these tools, you should be able to extract your own codes:

$ commafree.py 4 2 0 > /tmp/commafree-4-2-0.cnf  # n=4, m=2, d=0
$ kissat /tmp/commafree-4-2-0.cnf > /tmp/kissat-4-2-0.out
$ [ $? -eq 10 ] && extract-code.py /tmp/commafree-4-2-0.cnf /tmp/kissat-4-2-0.out
{1000, 1001, 1011}

(The final command prints nothing if the formula was unsatisfiable, in which case you need to increment d and try again.)

Some of the blank entries in the table above should be solvable today with a bit more effort, particularly at the frontier, like (n,m) = (4,6), (6,4), and (8,3). Maybe these are already solvable by SAT solver tuning, symmetry-breaking or some other preprocessing of the CNF files.

The SAT Encoding

The heart of the process above is a Python program to generate DIMACS files, but it’s nothing too complicated. It does the following:

Generate candidate codewords by iterating over all Lyndon words and their rotations. This can be done with Knuth’s Algorithm 7.2.1.1 F from The Art of Computer Programming, Volume 4A. Associate each of these words with a variable that’s true exactly when the codeword has been chosen for the comma-free code.
Generate clauses for each Lyndon word that express “at most one codeword from this Lyndon word and its rotations is chosen”.
Iterate over all pairs x, y of Lyndon word variables. For each pair, discover all codewords z that are disallowed in xy or yx by the comma-free property and generate clauses for such triples x, y, z that express “x, y, and z can’t all occur in the comma-free code”.
Generate variables and clauses for each Lyndon word such that the variables indicate “some rotation of this Lyndon word was chosen”.
Generate variables and clauses that sort the variables from the previous step and assert that the smallest d values are 0 and the d+1-st value is 1. I cheated here and didn’t do a full sort, I just applied a full row of comparators d+1 times – essentially d+1 rounds of bubblesort to sort the smallest d values only.

Again, you can check out the generator source for full details.

New Codes

So, finally, here’s a dump of the new codes I was able to discover. All of these are now officially the maximum size comma-free codes for these values of n and m:

n=4, m=5, size=139 (LW(4,5) = 150):

{0012, 0013, 0014, 0100, 0102, 0103, 0104, 0110, 0111, 0112, 0113, 0114, 0204,
 0212, 0213, 0214, 0304, 0312, 0313, 0314, 1012, 1013, 1014, 2000, 2002, 2003,
 2004, 2012, 2013, 2014, 2022, 2023, 2024, 2032, 2033, 2034, 2100, 2102, 2103,
 2104, 2110, 2111, 2112, 2113, 2114, 2204, 2212, 2213, 2214, 2224, 2234, 2304,
 2312, 2313, 2314, 2324, 2334, 3000, 3002, 3003, 3004, 3012, 3013, 3014, 3022,
 3023, 3024, 3032, 3033, 3034, 3100, 3102, 3103, 3104, 3110, 3111, 3112, 3113,
 3114, 3204, 3212, 3213, 3214, 3224, 3234, 3304, 3312, 3313, 3314, 3324, 3334,
 4000, 4002, 4003, 4004, 4012, 4013, 4014, 4022, 4023, 4024, 4032, 4033, 4034,
 4042, 4043, 4044, 4100, 4102, 4103, 4104, 4110, 4111, 4112, 4113, 4114, 4122,
 4123, 4124, 4132, 4133, 4134, 4142, 4143, 4144, 4204, 4212, 4213, 4214, 4224,
 4234, 4244, 4304, 4312, 4313, 4314, 4324, 4334, 4344}

n=6, m=3, size=113 (LW(6,3) = 116):

{001000, 001100, 001101, 001102, 001200, 001201, 001202, 002000, 002100,
 002101, 002102, 002200, 002201, 002202, 010102, 011100, 011101, 011102,
 011110, 011111, 011200, 011201, 011202, 011210, 011211, 012100, 012101,
 012102, 012110, 012111, 012112, 012200, 012201, 012202, 012210, 012211,
 012220, 020100, 020102, 020110, 020120, 020200, 020210, 020220, 021100,
 021101, 021102, 021110, 021111, 021120, 021121, 021200, 021201, 021202,
 021210, 021211, 022100, 022101, 022102, 022110, 022111, 022112, 022120,
 022121, 022200, 022201, 022202, 022211, 022212, 101000, 101100, 101200,
 101201, 101202, 102000, 102100, 102200, 102201, 102202, 111200, 111201,
 111202, 111210, 111211, 112200, 112201, 112202, 112210, 112211, 112220,
 121200, 121201, 121202, 121210, 121211, 122120, 122121, 122211, 122212,
 212200, 212201, 212202, 212210, 212211, 212220, 222100, 222101, 222102,
 222200, 222201, 222202, 222211, 222212}

n=12, m=2, size=334 (LW(12,2) = 335):

{000001000011, 000010000000, 000010000001, 000010000011, 000010000101,
 000100000101, 000100001001, 000101000011, 000101001011, 000110000001,
 000110000011, 000110000101, 000110000111, 000110001001, 000110001011,
 000110001101, 000110001111, 000110010001, 000110010011, 000110010101,
 000110010111, 000110011001, 000110011011, 000110011101, 000110011111,
 000111000011, 000111001011, 001000001001, 001000010001, 001001000011,
 001001001011, 001001010011, 001010000001, 001010000011, 001010000101,
 001010001001, 001010001011, 001010010001, 001010010011, 001010010101,
 001100000001, 001100000101, 001100001001, 001100001101, 001100010001,
 001100010101, 001100011101, 001101000011, 001101001011, 001101010011,
 001110000001, 001110000011, 001110000101, 001110001001, 001110001011,
 001110001101, 001110001111, 001110010001, 001110010011, 001110010101,
 001110011001, 001110011011, 001110011101, 001110011111, 001111000000,
 001111000011, 001111001011, 001111010011, 010000010001, 010001000011,
 010001010011, 010010000001, 010010000011, 010010000101, 010010010001,
 010010010011, 010010010101, 010010100011, 010100000001, 010100000101,
 010100001001, 010100010001, 010100010101, 010101000011, 010101001011,
 010101010011, 010110000001, 010110000011, 010110000101, 010110000111,
 010110001001, 010110001011, 010110001101, 010110001111, 010110010001,
 010110010011, 010110010101, 010110010111, 010110011001, 010110011011,
 010110011101, 010110011111, 010110100011, 010110100111, 010110101011,
 010111000011, 010111001011, 010111010011, 010111011011, 010111100011,
 010111101011, 010111111011, 011000000000, 011000000001, 011000001001,
 011000010001, 011001000011, 011001001011, 011001010011, 011001011011,
 011001101011, 011001111011, 011010000001, 011010000011, 011010000101,
 011010001001, 011010001011, 011010010001, 011010010011, 011010010101,
 011010011001, 011010011011, 011010100011, 011010101011, 011100000000,
 011100000001, 011100000101, 011100001001, 011100010001, 011100010101,
 011100011101, 011101000011, 011101001011, 011101010011, 011101011011,
 011101111011, 011110000011, 011110000101, 011110001001, 011110001011,
 011110010001, 011110010011, 011110010101, 011110011001, 011110011011,
 011110011101, 011110011111, 011110100011, 011110101011, 011111000011,
 011111001011, 011111010011, 011111011011, 011111100011, 011111101011,
 011111111011, 100001000011, 100010000000, 100010000001, 100010000011,
 100010000101, 100010100011, 100100000000, 100100000001, 100100000101,
 100100001001, 100101000011, 100101001011, 100110000001, 100110000011,
 100110000101, 100110000111, 100110001001, 100110001011, 100110001101,
 100110001111, 100110100011, 100110100111, 100110101011, 100111000011,
 100111001011, 100111100011, 100111101011, 101000000000, 101000000001,
 101000001001, 101000010001, 101001000011, 101001001011, 101001010011,
 101010000001, 101010000011, 101010000101, 101010001001, 101010001011,
 101010010001, 101010010011, 101010010101, 101010100011, 101010101011,
 101100000001, 101100000101, 101100001001, 101100001101, 101100010001,
 101100010101, 101100011101, 101101000011, 101101001011, 101101010011,
 101101011011, 101110000001, 101110000011, 101110000101, 101110001001,
 101110001011, 101110001101, 101110001111, 101110010001, 101110010011,
 101110010101, 101110011001, 101110011011, 101110011101, 101110011111,
 101110100011, 101110101011, 101111000000, 101111000011, 101111001011,
 101111010011, 101111100011, 101111101011, 101111111011, 110000010001,
 110001000011, 110001010011, 110010000001, 110010000011, 110010000101,
 110010010001, 110010010011, 110010010101, 110010100011, 110100000001,
 110100000101, 110100001001, 110100010001, 110100010101, 110101000011,
 110101001011, 110101010011, 110110000001, 110110000011, 110110000101,
 110110000111, 110110001001, 110110001011, 110110001101, 110110001111,
 110110010001, 110110010011, 110110010101, 110110010111, 110110011001,
 110110011011, 110110011101, 110110011111, 110110100011, 110110100111,
 110110101011, 110111000011, 110111001011, 110111010011, 110111010111,
 110111011011, 110111100011, 110111101011, 110111111011, 111000001001,
 111000010001, 111001000011, 111001001011, 111001010011, 111001101011,
 111001111011, 111010000001, 111010000011, 111010000101, 111010001001,
 111010001011, 111010010001, 111010010011, 111010010101, 111010100011,
 111010101011, 111100000101, 111100001001, 111100010001, 111100010101,
 111101000011, 111101001011, 111101010011, 111110000000, 111110000001,
 111110000011, 111110000101, 111110001001, 111110001011, 111110010001,
 111110010011, 111110010101, 111110011001, 111110011011, 111110011101,
 111110011111, 111110100011, 111110101011, 111111000011, 111111001011,
 111111010011, 111111100011, 111111101011, 111111111011}

Eastman, W. L., On the Construction of Comma-Free Codes, IEEE Trans. IT-11 (1965), pp 263-267. ↩ ↩²
Golumb, S. W., B. Gordon, and L. R. Welch, Comma-free codes, Can. J. Math, vol. 10, 1958, pp 202-209. ↩

Twenty Questions with z3

Sun, 01 Nov 2020 21:05:32 +0000

Don Knuth’s Volume 4, Fascicle 5 of The Art of Computer Programming has some great combinatorial puzzles in the exercises, including a variant of a puzzle called “Twenty Questions” invented by Donald Woods.

Donald Woods has written up some history of the problem that probably serves as a better introduction than I could give. Knuth introduces the puzzle as an exercise in backtracking (and has written a fast backtracking solver for a variant of the problem here), but you can also solve Twenty Questions using a SAT solver, and that’s what I’ll describe in this post.

I’ll use z3 (with its Python frontend) to find the solution to the puzzle. z3 is technically a little more than just a SAT solver, but the encoding of the problem in this post could easily be mapped down to a “pure” boolean formula and fed to a SAT solver if you were patient and careful enough.

The Questions

The questions start off easy, but the first few answers clearly depend on answers to other questions:

1. The first question whose answer is A is:

   (A) 1  (B) 2  (C) 3  (D) 4  (E) 5

2. The next question with the same answer as this one is:

   (A) 4  (B) 6  (C) 8  (D) 10  (E) 12

But wait, it gets harder! Instead of one question referencing another’s answer, some questions reference the distribution of all answers to the entire set of questions:

7. The answer that appears most often (possibly tied) is:

   (A) A  (B) B  (C) C  (D) D  (E) E

8. Ignoring those that occur equally often, the answer that appears least often is:

   (A) A  (B) B  (C) C  (D) D  (E) E

...

18. The number of prime-numbered questions whose answers are vowels is:

   (A) prime  (B) square  (C) odd  (D) even  (E) zero

Finally, question 20 doesn’t just require you to know answers to other questions in the quiz, you have to know the optimal score over any set of answers to the quiz:

20. The maximum score that can be achieved on this test is:

   (A) 18  (B) 19  (C) 20  (D) indeterminate
   (E) achievable only by getting this question wrong

You might want to leave question 20 out and solve 1-19 separately, then use the results to figure out 20. But remember, some of the first 19 questions refer to the distribution of answers on the quiz, which includes the answer to question 20!

Using z3 to find solutions

My general idea in encoding this problem as a boolean formula is:

Start by defining 100 variables: x1a, x1b, x1c, x1d, x1e, x2a, x2b, …, x20e. x1a means “Question 1 is marked A”, and so on.
Make sure that each question has exactly one answer by adding some constraints to x1a-x1e, x2a-x2e, etc.
Define 20 more variables: x1, x2, x3, …, x20. x1 means “The answer to question 1 is correct”. Each of these is defined by an expression involving the original 100 variables that represent answers to the questions.

Then we just let z3 run and try to satisfy as many of x1, x2, …, x20 as it can and interpret the results.

The Encoding

My program starts off by declaring the boolean variables, something like:

from z3 import *

x1 = Bool('x1')
...
x20 = Bool('x20')

x1a = Bool('x1a')
...
x20e = Bool('x20e')

I also want to write some helper functions that iterate over variables and answers, so I’ll define:

answers = [None]
answers.append(dict(zip(['A','B','C','D','E'],
                        [x1a, x1b, x1c, x1d, x1e])))
...
answers.append(dict(zip(['A','B','C','D','E'],
                        [x20a, x20b, x20c, x20d, x20e])))

so that answers[7]['D'] gives you x7d, for example. Also, there’s

correct = [None] + \
          [x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20]

so that correct[13] is x13.

I need to treat booleans as integers sometimes, so I do that with:

def btoi(x): return If(x,1,0)

Finally, I want to reduce some sequences of expressions using And, Or, and +, but functional programming is hard to read in Python, so I’ll use a few more helper functions:

def SumOf(xs): return reduce(lambda x,y: x+y, xs)
def AndOf(xs): return reduce(lambda x,y: And(x,y), xs)
def OrOf(xs): return reduce(lambda x,y: Or(x,y), xs)

Now we can start defining x1, x2, etc. in terms of the answers.

# 1. The first question whose answer is A is:
#     (A) 1  (B) 2  (C) 3  (D) 4  (E) 5

s.add(x1 == Or(x1a,
               And(x1b,x2a),
               And(x1c,x3a,Not(x2a)),
               And(x1d,x4a,Not(x2a),Not(x3a)),
               And(x1e,x5a,Not(x2a),Not(x3a),Not(x4a))))

z3 and Python make it easy to express some of the more complicated constraints:

# 3. The only two consecutive questions with identical answers are questions:
#    (A) 15 and 16  (B) 16 and 17  (C) 17 and 18  (D) 18 and 19  (E) 19 and 20

def same_answer(i,j):
    return Or(And(answers[i]['A'],answers[j]['A']),
              And(answers[i]['B'],answers[j]['B']),
              And(answers[i]['C'],answers[j]['C']),
              And(answers[i]['D'],answers[j]['D']),
              And(answers[i]['E'],answers[j]['E']))

s.add(x3 == Or(And(x3a, same_answer(15, 16),
                   AndOf(Not(same_answer(i,i+1)) for i in range(1,20) if i != 15)),
               And(x3b, same_answer(16, 17),
                   AndOf(Not(same_answer(i,i+1)) for i in range(1,20) if i != 16)),
               And(x3c, same_answer(17, 18),
                   AndOf(Not(same_answer(i,i+1)) for i in range(1,20) if i != 17)),
               And(x3d, same_answer(18, 19),
                   AndOf(Not(same_answer(i,i+1)) for i in range(1,20) if i != 18)),
               And(x3e, same_answer(19, 20),
                   AndOf(Not(same_answer(i,i+1)) for i in range(1,20) if i != 19))))

A few of these were a little tricky. Question 8 wants the answer that occurs least often among all answers that don’t occur the same number of times as another answer. So if A and B are both chosen 3 times and C is chosen 4 times, C would be the correct answer.

# 8. Ignoring those that occur equally often, the answer that appears least
#    often is:
#    (A) A  (B) B  (C) C  (D) D  (E) E

def other_answers(ans): return set(['A','B','C','D','E']) - set([ans])

def answer_tally(z): return SumOf(btoi(answers[i][z]) for i in range(1,21))

def least_of_distinct(ans):
    others = other_answers(ans)
    clauses = [answer_tally(ans) != answer_tally(x) for x in others]
    for x in others:
        remains = others - set([x])
        clauses.append(Or(answer_tally(ans) < answer_tally(x),
                          OrOf(answer_tally(x) == answer_tally(y) for y in remains)))
    return AndOf(clauses)

s.add(x8 == Or(And(x8a, least_of_distinct('A')),
               And(x8b, least_of_distinct('B')),
               And(x8c, least_of_distinct('C')),
               And(x8d, least_of_distinct('D')),
               And(x8e, least_of_distinct('E'))))

All of the other questions up through 19 are pretty straightforward. But what should we do about question 20?

20. The maximum score that can be achieved on this test is:

   (A) 18  (B) 19  (C) 20  (D) indeterminate
   (E) achievable only by getting this question wrong

We can’t correctly grade it just based on one solution to all of x1, x2, … x19. But if we can find at least one valid solution with the first 19 correct, then C is correct. And if we can’t find a solution with the first 19 correct, then maybe we can find one with 18 or 17 correct and fall back on B or A. So we’ll punt for now and define it optimistically as:

s.add(x20 == x20c)

We’ll hope to work down from there, re-defining it as needed.

Finding the Solution(s)

Once we’ve defined x1 through x20, we can use z3 to check whether our formula is satisfiable and, if so, get a model that maps variables to a satisfying assignment:

if s.check() == sat:
    print_solution(s.model())

My print_solution just prints out all of the answers using uppercase letters if the question is correct and lowercase otherwise. Now we can add one more constraint to maximize the score:

s.add(SumOf(btoi(correct[i]) for i in range(1,21)) >= 20)

When we do that, we see there’s no satisfying assignment at 20, so we reset our expectations and re-define x20 as:

s.add(x20 == x20b)

We update our maximization constraint to:

s.add(SumOf(btoi(correct[i]) for i in range(1,21)) >= 19)

and then find out that there is a solution with 19 correct:

A E D C A B C D C A C E D B C A D A A c

What would it mean if this were the only solution with 19 correct? Then 20(E), which says that the maximum score is achievable only by getting 20 wrong, would be correct! And once we knew that 20(E) was correct, maybe there would be a solution with all 20 correct. But if that optimal solution with 20(E) correct stands, then 20(E) (“the maximum score is achievable only by getting 20 wrong”) is necessarily incorrect! And if we can get stuck in that kind of circular reasoning going back and forth between 20(E) being true and false, doesn’t that mean that 20(D) (“the answer to 20 is indeterminate”) is then true?

Before we go too far down that path, we need to keep looking for solutions with 19 questions correct. Clearly, we need to iterate over all satisfying assignments at a given score to make any sense of a solution to 20. To do that, we can modify our check/print logic:

while s.check() == sat:
    m = s.model()
    print_solution(m)
    block_solution(s,m)

Where block_solution adds a new clause to the solver that prohibits the most recent solution:

def block_solution(s, m):
    ans = [v for vv in [answers[i].values() for i in range(1,21)] for v in vv]
    s.add(Not(And(AndOf(v == m[v] for v in ans),
                  AndOf(v == m[v] for v in correct[1:]))))

Now when we run the solver and iterate over all solutions with score at least 19, we get:

A E D C A B C D C A C E D B C A D A A c
D C E A B A D C D A E D A E D B D B E e
D C E A B E B C E A B E A E D B D A b B

Whew, so that settles it – 20(E) is incorrect since there is one way to achieve the maximum possible score (19) without getting question 20 incorrect. And 20(D) is clearly incorrect since 20(E) is incorrect and there is an absolute achievable best score for the quiz.

So B is the only correct answer for question 20 and the above three answers are the only three ways to achieve the maximum score of 19. We’re done!

Final Thoughts

Question 20 is supposed to be the kicker, but question 9 is also a little weird:

9. The sum of all question numbers whose answers are correct and the same as
   this one is in the range:
   (A) 59 to 62, inclusive
   (B) 52 to 55, inclusive
   (C) 44 to 49, inclusive
   (D) 61 to 67, inclusive
   (E) 44 to 53, inclusive

It’s the only one of the questions other than 20 that depends on its own correctness. You can see this in the way I encoded it:

def answer_sum(ans):
    return SumOf(If(And(correct[i],answers[i][ans]), i, 0) for i in range(1,21))

s.add(x9 == Or(And(x9a, answer_sum('A') >= 59, answer_sum('A') <= 62),
               And(x9b, answer_sum('B') >= 52, answer_sum('B') <= 55),
               And(x9c, answer_sum('C') >= 44, answer_sum('C') <= 49),
               And(x9d, answer_sum('D') >= 61, answer_sum('D') <= 67),
               And(x9e, answer_sum('E') >= 44, answer_sum('E') <= 53)))

When I say it “depends on its own correctness”, I just mean that x9 appears on both the left and the right side of the == in the definition of x9 (recall that correct[9] is just x9).

Think about any specific case where x9 is correct: for example, if question 9 is marked “A” and the sum of all question numbers with A as the answer that are correct (including question 9) is, say, 61. 9(A) is clearly correct here, but a grader would be just as correct if they marked 9 wrong: in that case, the sum of all question numbers that are correct and have A as the answer is now 61 - 9 = 52, which is no longer in the range [59, 62].

Since none of the ranges in question 9 span more than 9 numbers, any time question 9 is marked correct, it can also be marked incorrect – it’s all up to the grader! This sort of choice in grading never really comes into play since we’re always looking to maximize the score, but it does make you think about clever alternatives for question 20 that could introduce some indeterminate conclusions.

In fact, if you look back at my definition for block_solution, you’ll see that it blocks both the answers (x1a, x2b, etc.) and the whether or not they’re correct (x1, x2, …) when blocking an solution. You have to do this to get correct results, otherwise the solver could come across a solution where question 9 could be graded correct to get a score of 19 but choose to grade it incorrect instead, giving it a score of 18. If you only block the choices of answers, that solution gets blocked and the solver is never able to discover the “alternative grading” that would give it 19 points.

More Final Thoughts

My full Python solution is here if you want to run it or play around with the constraints.

If you think this is interesting, you should check out Knuth’s treatment of a variant of this problem in Exercises 7.2.2.71-72 of in Volume 4, Fascicle 5 of The Art of Computer Programming. The latter exercise plays around with constraints to see what happens when the answer to question 20 isn’t as clear-cut as it was in Donald Wood’s original problem.

Simulating Levenshtein Automata

Fri, 25 Aug 2017 21:58:17 +0000

A Levenshtein Automaton is a finite state machine that recognizes all strings within a given edit distance from a fixed string. Here’s a Levenshtein Automaton that accepts all strings within edit distance 2 from “banana”:

The epsilon transitions represent the empty string. The * transitions are shorthand for “any character” to save space in the diagram, but the actual automaton has one transition for every possible input character anywhere you see a *. The automaton accepts a string s exactly when there’s a directed path from the start state on the lower left to any of the accept states on the right such that the concatenation of all of the labels on the path in order equals s.

The automaton above is an NFA but it looks like three copies of a DFA accepting the string “banana” stacked on top of each other. Transitions between the three DFAs represent edit operations under the Damerau-Levenshtein metric: a transition going up represents the insertion of a character, a diagonal epsilon transition represents the deletion of a character, and a diagonal * transition represents a substitution.

Levenshtein Automata can be used as one part of a data structure that generates spelling corrections. The other component is a Trie that contains all correctly spelled words. Any word that’s accepted by both the Trie and the Levenshtein Automaton is a word that’s correctly spelled and up to edit distance d from the query. Given a query, you’d generate a Levenshtein Automaton for that query with the desired edit distance and then traverse both the automaton and the Trie in parallel, yielding a word whenever you reach an accept state in the Levenshtein Automaton and at a leaf node in the Trie at the same time.

Generating a non-deterministic Levenshtein Automaton is straightforward. The node and edge structure of the automaton above for the query “banana” isn’t a function of the word “banana” at all – only the transition labels would be different if you wanted to create a similar automaton accepting anything within edit distance 2 of any other six-letter word. Increasing or decreasing the edit distance just involves adding or removing one or more rows of identical states. Increasing or decreasing the length of the fixed word just involves adding or removing one or more columns.

Unfortunately, even though generating a non-deterministic Levenshtein Automaton is easy, simulating it efficiently isn’t. In general, simulating all execution paths of an NFA with n states on an input of length m can take time O(nm) just to keep up with the bookkeeping: a state in the simulation is a subset of states in the NFA and you have to update that set of up to n states m times during the simulation.

Schulz and Mihov introduced Levenshtein Automata and showed how to generate and simulate them in linear time in the length of an input string. The implementation they describe involves a preprocessing step that creates a DFA defined by a transition table whose size grows very quickly with the edit distance d. The implementation starts with a two-dimensional table from which many of the entries can be removed because they represent dead states. For d=1, a 5 x 8 table is reduced to just 9 entries, for d=2, a 30 x 32 table is reduced to one with just 80 entries. For d=3 and 4, the tables start with 196 x 128 = 25,088 and 1352 x 512 = 692,224 entries, respectively, before removing dead states.

The Lucene project implemented Schulz and Mihov’s scheme, but only for d = 1 and d = 2. Their implementation uses a Python script from another project, Moman, to generate Java code with the transition tables offline.

It’s hard to beat a table-driven DFA for matching regular expressions, but on the other hand, it’s not clear that the simulation of the Levenshtein Automaton is the bottleneck in a spelling corrector. The size of the Levenshtein Automaton is dwarfed by the size of the Trie containing the correctly spelled words, and since query processing involves simulating both the Trie and the Levenshtein Automaton in parallel, the main bottleneck in the simulation will likely be the I/O expense of loading nodes for the Trie. Because of their high and irregular branching factor, Tries are all but impossible to lay out in memory with any kind of locality of reference for an arbitrary query.

So maybe there’s a simpler way to simulate Levenshtein Automata that’s theoretically slower but will give us about the same real-world performance. Jules Jacobs recently wrote a post describing a pretty substantial simplification that simulates an automaton using states based on rows in the two-dimensional matrix of the Wagner-Fischer algorithm to compute edit distance. The simulation takes O(nd) time for an input of length n, which is essentially as good as an implementation based on Schulz and Mihov’s scheme since d is often small and fixed.

You can also derive an O(nd) time simulation by just directly simulating the NFA that I described at the beginning of this post. You just need to make a few optimizations based on the highly regular structure of this family of NFAs, but all of the optimizations are relatively straightforward. I’ll describe those optimizations in this post. I’ve also implemented everything I describe here in a Go package called levtrie that provides a map-like interface to a Trie equipped with a Levenshtein Automaton.

Alternatives to Levenshtein Automata

First, some more background. You can skip this section and the next if you already understand why Levenshtein Automata are a good choice for indexing a set of words by edit distance.

Edit distance is a metric and there are many data structures that index keys by distance under various metrics. Maybe the most appropriate metric tree for edit distance is the BK-Tree. The BK-Tree, like other metric trees, has two big disadvantages: first, the layout of the tree is highly dependent on the distribution and the insertion order, so it’s hard to quote good bounds on how balanced the tree is in general. Second, during lookups, you have to compute the distance function at each node along the search path, which can be expensive for an metric like edit distance that takes quadratic time to compute (or O(nd) if you optimize for computing distance at most d).

Locality-sensitive hashing is another option, but, like metric trees, even after hashing to a bucket you’re left with a set of candidates on which you need to exhaustively calculate edit distance. It’s also very difficult with most metrics to get anywhere close to perfect recall with locality-sensitve hashing and perfect recall is essential to spelling correction since there are often just a few good correction candidates.

Still another alternative is to index the n-grams of all correctly spelled words and put them in an inverted index. At query time, you’d break up the query string into n-grams and search the inverted index for them all, running an actual edit distance computation as a final pass on all of the candidates that come back. This doesn’t always work well in practice, since, for example, if you’re trying to retrieve “bahama” from the query “banana” (edit distance 2), none of the 3-grams match (bahama breaks up into “bah”, “aha”, “ham”, “ama” and banana breaks up into “ban”, “ana”, “nan”, and “ana”). Even moving down to 2-grams doesn’t help much; only the leading 2-gram “ba” matches so you’d have to retrieve all strings that start with “ba” and exhaustively test edit distance on all of them to find “banana”.

In contrast to all of the methods described above, using a Trie with a Levenshtein Automaton doesn’t ever require exhaustively calculating edit distance during lookups: the cost of computing edit distance is incremental and shared among many candidates that share paths in the Trie. A Trie can also efficiently return all strings that are suffixes of the query string, which is a popular feature with most on-the-fly spelling correctors: instead of waiting for someone to type the entire word “banona” to return the suggestion “banana”, you can start returning suggestions as soon as they’ve typed “ban” or even “ba” by exploring paths from those prefixes in the Trie.

Finding edits in a Trie without automata

Levenshtein Automata are used to generate an optimized Trie traversal but you can actually build a quick-and-dirty spelling corrector using just a Trie. I’ll walk through that construction in this section since it motivates why you’d want to augment a Trie with a Levenshtein Automaton in the first place.

Suppose that your query string is “banana”. To figure out if that exact string is in the Trie, you’d just use the sequence of characters in the string to find a path through the Trie: starting from the root, transition on the “b” edge to a node one level down, then transition on an “a” edge, then an “n” edge, and so on, until you’re at the end of the string. If the word is in the Trie, then at the end of the traversal you will have reached a node that represents the last character of that word. Otherwise, you will have stopped at some point along the way because there wasn’t an edge available to make the transition you needed to make, in which case you know the word you’re looking up isn’t in the Trie.

Instead, if you wanted to find both exact matches to “banana” and words in the Trie that were a few edits away, you could extend the search process to branch out a little and try paths that correspond to edits. If you want to find words that are at most, say, 2 edits away from “banana”, you could start your traversal at the root but perform the following four searches while keeping track of an edit budget that’s initially 2:

Simulate no edit: Move from the root to the second level of the Trie on the edge labeled “b”. Keep your edit budget at 2 and set the remaining string to match to “anana”.
Simulate an insertion before the first character: For every edge out of the root of the Trie, move to the second level of the Trie along that edge. Decrement your edit budget to 1 and keep the remaining string to match set to “banana”.
Simulate a deletion of the first character: Don’t move from the root of the Trie at all, simply decrement your edit budget to 1 and update the string you want to match to “anana”.
Simulate a substitution for the first character: For every edge out of the root of the Trie except the edge labeled “b”, move to the second level of the Trie along that edge, decrement your edit budget to 1, and update the remaining string you want to match to “anana”.

Now just keep recursively applying these cases at each new node you explore and stop the traversal once you reach an edit budget that’s negative. If you ever get to a leaf node with a non-negative edit budget at least as big as the remaining string length, return that string as a match.

This traversal will discover all strings in the Trie that are within a fixed edit distance of a query but the traversal does a lot of repeated work. You can generate the correctly-spelled word “banana” from the misspelled word “banaba” using either a substitution, a deletion followed by an insertion, or an insertion followed by a deletion. This means that in the traversal we defined above, we’ll visit the node in the Trie defined by “banan” at least three times from three different search paths. The search also has degenerate paths that just burn the error budget but do no useful work, for example a deletion followed by an insertion of the same character. These paths just put you back in the same position the traversal started from with a smaller edit budget.

Again, because of their large branching factor, Tries don’t have good locality of reference. Each time you follow a pointer to another node, you’re likely jumping to memory that at the very least causes a cache miss, so popping the same state several times to explore can be expensive. You might think you could optimize this a little by marking Trie nodes as “visited” when you first see them and avoiding exploring visited nodes more than once. But you can also see the same node through different search paths with different edit budgets, so you’d have to store more than just a visited flag – you’d at least need to store the minimum edit budget that you’d visited the node with. If you ever saw the node on a search path with a greater edit budget, you could prune that portion of the search, but that still means that you could end up exploring a node up to d times on a search for words within with edit distance d.

A Levenshtein Automaton maintains all of the search state so that you never have to traverse a path in the Trie more than once. If you use a deterministic Levenshtein Automaton like Schulz and Mihov’s original scheme, it’s really as efficient a way to encode the search state as you can get: each transition from state to state in the automaton is just a O(1) table lookup.

There’s some history of people rediscovering ways to maintain the search state through more or less efficient means: Steve Hanov described a method of keeping track of the search state using the Wagner-Fischer matrix that allows you to update states in time O(m) when searching for a string of length m. The method that Jules Jacobs describes is similar but optimized even further to get a O(d) update time for edit distance d, regardless of the length of the input string. The method I’ll describe in the next two sections also has an O(d) time bound on state transitions but it isn’t directly derived from the Wagner-Fischer edit distance algorithm.

An O(nd²)-time simulation

Now back to the original Levenshtein NFA construction at the beginning of this post. Instead of creating a DFA from the NFA, we just want to simulate the DFA. To simulate one, we need to maintain a set of active states as we read input characters, accepting exactly when we’ve read all of the input and there’s at least one accepting state in our current set of active states.

We initialize the set of active states to contain just the NFA’s start state plus anything reachable by an epsilon transition. On each input character, we create an initially empty new set of active states and iterate through all current active states, trying to take any valid transition from each state on the current input character and adding the resulting state and anything else reachable by epsilon transitions to the new set of active states if we succeed. When we’re done with iterating through all current active states, the new set of active states becomes current and we proceed to the next input character. If we’re done reading input characters and there’s an accept state in our set of active states, we accept, otherwise we reject.

Let’s walk through a simulation of the Levenshtein NFA that accepts all words within edit distance 2 of “banana”. We’ll feed it the input string “bahama”. The set of active states are highlighted in blue at each step below. First, the initial state of the NFA contains all states on the diagonal including the initial state:

After consuming “b”, active states are again highlighted in blue:

After consuming “ba”:

After consuming “bah”:

After consuming “baha”:

After consuming “baham”:

And finally, after consuming “bahama”, a string that’s edit distance 2 from “banana”, we end up in an accepting state:

We want to bound the time complexity of simulating an NFA from this family of Levenshtein NFAs. One way to do this is to bound the maximum number of possible active states, since simulating the NFA on an input of length n involves updating the entire set of active states n times. Since a Levenshtein NFA created for a word of length m and edit distance d has (m + 1) * (d + 1) states, this means that the worst-case time complexity for simulating that NFA on an input of length n could be as bad as O(nmd). We can get a better bound by being a little more careful in our simulation.

The first thing to notice about the family of Levenshtein NFAs is that the diagonals contain paths of epsilon transitions all the way up. This means that any time a state is active, all other states further up on the same diagonal are active. Instead of keeping track of the set of active states, then, we can just keep track of the lowest active state on every diagonal. We’ll call this minimum active index the “floor” of the diagonal.

We’ll start numbering floors from the bottom: any diagonal with a state active in the bottom row of NFA states has floor 0. Since there are d + 1 rows in the NFA, the maximum floor a diagonal can have is d.

To make sure that every state in the NFA is contained in some diagonal, we just need to create a few fake diagonals that extend out to the left a little bit to add to the set of diagonals that are anchored by the states in the bottom level of the NFA:

Indexed like this, a Levenshtein NFA has m + d + 1 diagonals. This means that any active state in the NFA can be represented by a set of at most m + d + 1 diagonal floors.

Actually, we never end up needing to consider all m + d + 1 diagonals at once. There’s only ever one diagonal with floor 0 at any point in time, since consuming an input character while one diagonal has a floor at position 0 transfers the state to position 1 in the previous diagonal, position 1 in the current diagonal, and possibly to position 0 in the next diagonal:

This fact generalizes to higher positions in the set of diagonals: there’s always a sliding window of at most 2d + 1 diagonals that are active at level d or lower. You can prove this by induction on d where the general induction step looks at a window of 2d - 1 diagonals at level d - 1 and shows that they can expand to a window of at most 2d + 1 at level d.

A particular example of the general case is illustrated below, with a starting state illustrated by all light blue and dark blue nodes. These light blue and dark blue nodes cover 5 diagonals at level 2 or lower. The green nodes illustrate all new nodes that can be active after a transition from this state. The green and dark blue nodes together illustrate possible active nodes after the transition, covering 7 diagonals at level 3 or lower:

All of this means that instead of tracking all m + d - 1 diagonals, we only ever need to track a set of 2d + 1 diagonal positions plus an offset into the NFA. The sliding window of diagonals that we track moves through the NFA and we increment the offset by one each time we consume an input character.

To update a single diagonal when we read an input character, we need to take the minimum of:

The previous floor of the diagonal, plus one (shown in red below).
The previous floor of the next diagonal, plus one (shown in green below).
The smallest index in the previous diagonal that has a transition on the input character, if any (shown in dark blue below).

The figure below shows all three of the contributions to a single diagonal’s update:

Since we’re storing the state as a collection of 2d + 1 diagonal floors plus an offset, the first two items in the list above (red and green updates in the figure above) can be computed in constant time. The third item can be computed by iterating over all d + 1 horizontal transitions from the previous diagonal to see if any match the current input character.

Here’s pseudocode for our current algorithm for a fixed value of d with a few details omitted:

// Returns a structure representing an initial state for a Levenshtein NFA. The
// State structure just contains:
//  * d, the edit distance.
//  * An array of 2*d + 1 integers representing floors.
//  * An integer offset into the underlying string being matched.
State InitialState(d) { ... }

// Returns the floor of the ith diagonal in state s. Returns d + 1 if i is
// out of bounds.
int Floor(State s, int i) { ... }

// Returns the smallest index in the ith diagonal of state s that has a
// transition on character ch. If none exists, returns d + 1.
int SmallestIndexWithTransition(State s, string w, int i, char ch) { ... }

// Set the ith diagonal floor in state s to x.
void SetDiagonal(State* s, int i, int x) { ... }

// Returns true exactly when the given state is an accepting state.
bool IsAccepting(State s) { ... }

// Simulate the Levenshtein NFA for string w and distance d on the string u.
bool SimulateNFA(string w, int d, string u) {
  State s = InitialState(d)
  for each ch in u {
    State t = s
    t.offset = s.offset + 1
    for i from 0 to 2*d + 1 {
      int x = Floor(s, i + 1) + 1                       // Diagonal
      int y = Floor(s, i + 2) + 1                       // Up
      int z = SmallestIndexWithTransition(s, w, i, ch)  // Horizontal
      SetDiagonal(&t, i, Min(x, y, z))
    }
    s = t
  }
  return IsAccepting(s)
}

For each of the n characters in the input string, we iterate over the 2d + 1 diagonals in the state and compute the three components of the diagonal update. Two of the three components (x and y in the pseudocode above) are floor computations which are really just array accesses, so they’re both constant time operations. The third component (z in the pseudocode above, the result of SmallestIndexWithTransition) is the most expensive, taking time O(d) to compute, so the inner loop takes time O(d²) and the entire simulation takes time O(nd²).

An O(nd)-time simulation

To get from O(d²) to O(d) time per state transition and O(nd) for the entire simulation, there’s one final trick: making the SmallestIndexWithTransition function in the pseudocode above run in constant time.

SmallestIndexWithTransition needs to be called on each of the at most 2d + 1 active diagonals to find the smallest index above the floor of the diagonal with a horizontal transition on the current input character. Luckily, there’s a lot of overlap between the horizontal transitions of two consecutive diagonals: if you read the horizontal transitions of any diagonal from bottom to top, they form a substring of length d + 1 of the string the automaton was generated with. Any two consecutive diagonals have an overlap of d characters in these substrings:

The horizontal transitions from a set of 2d + 1 consecutive diagonals span a substring of length at most 3d + 2. This overlap between the horizontal transitions of consecutive diagonals means that instead of iterating up each diagonal doing O(d) work to figure out the first applicable horizontal transition, if any, for each of the 2d + 1 diagonals, we can instead precompute all the transitions on the substring of length 3d + 2 once and use that precomputed result to calculate SmallestIndexWithTransition on each diagonal.

This precomputation just involves calculating, for each index in the 3d + 2 character window, the next index in the window that matches the current input character. For example, if the window was the string “cookbook”:

    [c][o][o][k][b][o][o][k]
     0  1  2  3  4  5  6  7

our precomputed jump array for the character ‘k’ would look like:

    [3][3][3][3][7][7][7][7]
     0  1  2  3  4  5  6  7

Each index in the jump array tells us the next index of the character ‘k’ in the window. If we look in the jump array at the index of a particular diagonal plus its floor, the jump array tells us the next applicable horizontal transition on that diagonal.

The jump array needs to be initialized once per transition but initialization only takes time O(d) and all subsequent calls to SmallestIndexWithTransition can then be implemented with just a constant-time access into the jump array. Now everything inside the for loop that iterates over 2d + 1 diagonals runs in constant time and computing a transition of the NFA only takes time O(d).

Final Thoughts

I’ve implemented all of this in the Go package levtrie and the code in that package is a better place to look if you’re interested in more details after reading this post.

I took some shortcuts there to make the code a little more readable at the expense of some unnecessary memory bloat. In particular, the Trie implementation is very simple: each node keeps a map of runes to children and no intermediate nodes are suppressed. The key for each node, implicitly defined by the path through the Trie, is duplicated in the key-value struct stored at each node.

If you’re looking to implement something similar and speed it up, the first thing to consider is replacing the basic Trie with a Radix or Crit-bit tree. A Crit-bit tree will dramatically reduce the memory usage and the number of node accesses needed for Trie traversals, but it will also complicate the logic of the parallel search through the Trie and Levenshtein Automaton. Instead of iterating over all runes emanating from a node in the Trie and matching those up with transitions in the Levenshtein Automata, you’ll have to do something more careful. You could add some logic to traverse the Crit-bit tree from a node, simulating an iteration over all single-character transitions. Multi-byte characters make this a little more tricky than an ASCII alphabet. You might have an easier time converting the the Levenshtein Automata itself to have an alphabet of bytes instead of multibyte characters, which would then match up better with a Radix tree with byte transitions.

Regular Expression Search via Graph Cuts

Fri, 10 Jun 2016 20:51:47 +0000

Google used to offer a search engine called Code Search which let you use regular expressions to search code. I never thought Code Search was doing anything much more sophisticated than something along the lines of a fast, distributed grep until Russ Cox explained how it worked in a blog post and released codesearch, a smaller-scale implementation. Both the post and the code are fascinating – it turns out that Code Search was doing something much more sophisticated than a distributed grep.

The big idea in codesearch is an engine that translates a large class of regular expressions into queries that can be run against a standard inverted index. In this post, I’m going to describe a different implementation of such an engine using some of the same ideas in codesearch along with a well-known algorithm for finding the minimum-weight node cut in a graph. I think the result is a little bit simpler conceptually than the query-building technique that codesearch uses, particularly if you understand textbook regular expressions but don’t know a lot about the internals of any particular regular expression library.

The implementation I’ll describe is available as a Go package at https://github.com/aaw/regrams. It’s a few hundred lines of code and the benchmarks show it running only about 4-5 times slower than codesearch’s regular expression-to-query translation. Both engines run on the order of microseconds for individual queries, so a few times slower is still very usable for a search engine.

Trigram queries and regular expression search

First, a little background about codesearch’s approach to regular expression search.

Before it can process queries, codesearch creates an inverted index from all the files you want to search. The index contains all overlapping substrings of length 3 (“trigrams”) that occur in any of the files. If you index a file called greetings.txt which contains just the string Hello, world!, the resulting trigram index would contain entries for Hel, ell, llo, lo,, o,_, ,_w, _wo, wor, orl, rld, and ld!. When you look up any of those trigrams in the index, you’ll get a list of files: greetings.txt and any other files that contain the trigram you’ve looked up. At query time, codesearch extracts trigrams from the regular expression and uses those trigrams to look up some candidate documents in the index.

The trigram queries used by codesearch are just trigrams combined with ANDs and ORs. For a simple regular expression like Hello, codesearch might generate the trigram query Hel AND ell AND llo. Looking that query up in the index would return all of the indexed files that contain all three of the trigrams Hel, ell, and llo. A more complicated regular expression like abc(d|e) might generate the trigram query abc AND (bcd OR bce).

Sometimes there’s no good trigram query for a regular expression. This can happen if the regular expression accepts strings of length less than 3. The regular expression [0-9]+, for example, accepts strings of digits of length 1 and 2 so we can’t use a trigram query to search for files that match that expression. There also may not be a good trigram query if the regular expression accepts too many different strings. [a-z]{3} is a short regular expression but it needs an enormous trigram query with 17,576 trigrams OR-ed together to capture its meaning. If you leave off any of those 17,576 trigrams from the query, you risk false negatives: files that match the query but aren’t returned from your trigram query. You’d be better off grepping files in most cases than running such a large query.

Finally, even when codesearch can come up with a good trigram query, it can get false positives in its result set from the trigram index. The regular expression Hello’s trigram query Hel AND ell AND llo matches not just files containing Hello but also files that contain things like Help smell this fellow. Since false positives like this are possible, a regular expression search based on trigram queries will need to post-process trigram query results by running the regular expression over each file that comes back. The goal, then, is just to generate a small enough set of candidate documents so that the query generation, lookup in the inverted index, and post-processing with the original regular expression runs much more quickly than exhaustive grepping over files.

codesearch generates these trigram queries by parsing the regular expression using Go’s regexp/syntax, then analyzing the parsed regular expression and converting it into structures that describe what trigrams each portion of the regular expresssion can match. These structures are then combined based on a table of rules derived from the meaning of the regular expression operations involved. The structures maintained during this process keep track of several attributes of the pieces of the regular expression their analyzing: whether that portion matches the empty string as well as sets of trigrams that can match exactly as well as prefixes and suffixes. The whole process is quick and, along with some boolean simplification of the resulting trigram queries, generates succinct queries without false negatives.

Generating trigram queries with graph cuts

Instead of generating trigram queries by analyzing the structure of the regular expression, regrams transforms the regular expression into an NFA and analyzes that.

regrams uses an NFA with some extra annotations on the states. Getting to an NFA from a regular expression in Go is pretty simple with the regexp/syntax package: running Parse on the string, then Simplify on the resulting expression yields a normal form that’s easy to work with. regrams massages the simplified expression into an even simpler form that’s closer to a textbook regular expression, containing only concatenation (.), alternation (|), and Kleene star (*) operations on literals and empty strings. This simplified expression might match more than the original expression but it never matches less, so it’s fine for generating a trigram query. Finally, regrams converts the simplified regular expression into an NFA using Thompson’s algorithm.

Next, any state that has a literal transition gets annotated with a set of trigrams that are reachable starting from that state. There are a few exceptions: if we start collecting the set of trigrams but realize it’s going to be too large, we’ll bail out and the state will get an empty set of trigrams. Also, if we start collecting the set of trigrams but realize that you can reach the accept state of the NFA in 2 or fewer steps, we’ll bail out, since that means that we can’t represent the query from that node with trigrams. So we end up with annotations on some of the nodes in the NFA that describe trigram OR queries, but only on the nodes that are easy to write trigram OR queries for.

Here’s an example NFA for the regular expression ab(c|d*)ef with trigram set annotations in blue below each node:

Notice that nodes with only epsilon transitions in the NFA above don’t get trigram annotations and nodes that can reach the accept state in fewer than 3 literal transitions like the node with the “e” transition and the node with the “f” transition don’t have trigram set annotations. Every other node is annotated with the set of trigrams that can be generated from that node by following an unlimited number of epsilon transitions and exactly three literal transitions.

Creating a trigram query from the NFA above is now just an exercise in applying the right graph algorithm. A good trigram OR query – one that captures a set of trigrams that must appear in any string matching the regular expression but is as small as possible – corresponds to a minimal set of trigram-annotated nodes in the NFA that, when removed, disconnect the initial state from the final state. In graph theory, a set of nodes that, when removed, disconnects two nodes s and t in the graph is called an s-t node cut. A node cut in our NFA separating the start and accept nodes corresponds to a complete set of trigrams that must be present in any string that matches the regular expression: there’s no way for a string to be accepted by the NFA but to go through one of the nodes in the cut.

In the NFA above, there are only two minimal node cuts that consist only of trigram-annotated nodes: the cut consisting of only the node with the “a” transition and the cut consisting of only the node with the “b” transition.

Once we’ve extracted a node cut consisting only of trigram-annotated nodes, we can just OR all of the trigrams in the cut together and, by the argument above, we’ve got a valid trigram query for the original regular expression: at least one of those trigrams must be present in any string that matches the regular expression. Continuing with the example above, depending on which cut we choose, we either get the trigram query abc OR abd OR abe or the trigram query bce OR bdd OR bde OR bef.

At this point, our query is a single big OR defined by the cut and we’ve likely used relatively few of the trigram sets we’ve annotated the graph with. But because we’ve just isolated a cut, that cut splits the NFA (and also the corresponding regular expression) into two parts, and if we now clean up both of those two parts of the NFA a little, we can run the same cut analysis on each of two those parts recursively to extract more trigram queries. We just keep isolating cuts recursively until we can’t find a cut with trigram-annotated nodes, at which point there’s nothing good left to generate a query with. All of the OR queries we generate like this can be AND-ed together to create one final query.

In the example NFA we’re working with, that means that if we choose the cut with just the node with a “b” transition, it generates the trigram query bce OR bdd OR bde OR bef and splits the NFA into two parts: the part with the “a” transition and everything after the “b” transition node:

We can now recursively consider both sub-NFAs. We see a single cut in the first NFA that generates the trigram query abc OR abd OR abe and no cut in the second NFA that consists of just trigram annotated nodes: even if you remove both the node with a “c” transition and the node with a “d” transition, there’s still a path from the initial state in the sub-NFA to the accept state through the lower-level epsilon-transition around the node with a “d” transition. Since there are no cuts left in any of the subgraphs, our final trigram query for the entire regular expression ab(c|d*)ef is all of the subqueries AND-ed together, which is (abc OR abd OR abe) AND (bce OR bdd OR bde OR bef).

The fact that we couldn’t extract a cut from the second NFA is by design: that NFA corresponds to the regular expression (c|d*)ef, which matches the string ef that we can’t express with a trigram query.

Finding minimal node cuts

So how do you find a minimal cut consisting only of trigram-annotated nodes? If we label each node in a trigram-annotated NFA with the size of its trigram set or with “infinity” if it doesn’t have a trigram set, then we can frame the problem as a search for a minimum-weight node cut in the graph that separates the initial and final state. Because of the way we’ve set up the node weights, finding the minimum-weight node cut will tell us if we actually have a usable query: the cut can be used to generate a query exactly when its weight is non-infinite.

Finding a minimum-weight cut in a graph that separates two distinguished nodes is a tractable optimization problem and is especially easy in simple graphs like the NFAs that we get from regular expressions. Minimum-weight cuts are closely related to maximum weight “flows”, and if you ever took an algorithms course you may have run into the max-flow min-cut theorem that formalizes this duality. If you imagine a graph as a set of pipes of differing widths, flowing in the direction of the arrows between nodes, the max flow represents the most amount of fluid that can flow through the pipes at any point in time. The minimum-weight cut is a bottleneck that constrains the maximum flow you can push through.

The simplest way to find a minimum-weight cut, then, is to find a maximum-weight flow, which really boils down to initializing all nodes with capacities as described above (each capacity will either be infinite or the size of the trigram set) and then repeatedly finding a path that goes from the start node to the accept node through nodes of non-zero capacity. Every time we find such a path, we reduce the capacity of all nodes on the path by the minimum capacity on the path. Eventually, there are no paths left except those that go through zero-capacity nodes and those zero-capacity nodes can be used to recover the minimal cut.

There are more sophisticated ways of finding a maximum-weight flow, but the complexity of this algorithm is never more than the number of edges in the NFA times the maximum allowable trigram set size. Since we bound both the maximum size of the NFA allowed and the size of the maximum trigram set in the regrams implementation, the complexity never gets out of hand here, and in practice it’s very quick and even out-performs some theoretically faster optimizations like the Edmonds-Karp search order in my tests.

Examples

So what do the results look like? Here are a few examples which you can play with on your own if you have a Go development environment. Just build the regrams commmand line wrapper and follow along:

go get github.com/aaw/regrams
go build -o regrams github.com/aaw/regrams/cmd/regrams

The trigram queries are written with implicit AND and | for OR, so abc AND (bcd OR bce) comes out looking like abc (bcd|bce):

$ ./regrams 'Hello, world!'
( wo) (, w) (Hel) (ell) (ld!) (llo) (lo,) (o, ) (orl) (rld) (wor)
$ ./regrams 'a(bc)+d'
(abc) (bcb|bcd)
$ ./regrams 'ab(c|d*)ef'
(abc|abd|abe) (bce|bdd|bde|bef)
$ ./regrams '(?i)abc'
(ABC|ABc|AbC|Abc|aBC|aBc|abC|abc)
$ ./regrams 'abc[a-zA-Z]de(f|g)h*i{3}'
(abc) (def|deg) (efh|efi|egh|egi) (fhh|fhi|fii|ghh|ghi|gii) (iii)
$ ./regrams '[0-9]+'  # No trigrams match single or double digit strings
Couldn't generate a query
$ ./regrams '[a-z]{3}'  # Too many trigrams
Couldn't generate a query

regrams is also available as a Go package, too: just import github.com/aaw/regrams and then call regrams.MakeQuery on a string to parse a regular expression into a trigram query. The trigram query returned is a slice-of-slices of strings, which represents a big AND or ORs: [["abc"], ["bcd", "bce"]] represents abc (bcd|bce).

Optimizations

There’s at least one major optimization that’s possible but isn’t yet implemented in regrams: we could extract node-disjoint paths instead of just node cuts and get more specific queries.

For example, if you ask regrams to generate a query for (abcde|vwxyz), it’ll generate (abc|vwx) (bcd|wxy) (cde|xyz), which isn’t as specific as the query (abc bcd cde)|(vwx wxy xyz). (abc|vwx), (bcd|wxy), and (cde|xyz) represent three distinct cuts in the NFA. If we didn’t stop at node cuts, and instead augmented the nodes in the cut with disjoint paths, we could avoid this situation and always return the best trigram query for a regular expression. To do this, we’d need some algorithmic version of something like Menger’s theorem that extracts node-disjoint paths from our NFAs that go through only nodes with non-infinite capacity.

codesearch isn’t the only attempt at generating trigram queries from regular expressions. In 2002, Cho and Rajagopalan published “A Fast Regular Expression Indexing Engine”, which describes a search engine called FREE that shares a lot of the ideas found in codesearch, including using an n-gram index to index the underlying documents and generating queries against that index using some rules based on the structure of those regular expressions. The rules are much simpler than those used by codesearch and FREE doesn’t have an actual implementation that I know of.

Postgres also now supports regular expression search via an implementation by Alexander Korotkov in the pg_tgrm module. Korotkov’s implementation apparently generates a special NFA with trigram transitions after converting the original regular expression to an NFA, then extracts a query from that trigram NFA. Korotkov’s implementation seems similar to regrams in that all of the analysis is done on an NFA, but I don’t understand enough about it to say anything more. I’d love to read a description of the technique somewhere. It seems to generate slightly better queries than regrams based on some of the documentation, for example generating (abe AND bef) OR (cde AND def)) AND efg from (ab|cd)efg, whereas regrams would generate (abe OR cde) AND (bef OR def) AND efg from the same regular expression.

Aaron Windsor

Work-efficient Prefix Sums

The Ladner-Fischer Networks

A Menagerie of Networks

Simpler?

Afterword

Intersecting Set-Pair Systems

a=2

a=3

The Presidential Rectangle

SAT Encoding

Assisted solutions

Tools

Partridge Puzzle

Comma-free codes

Background

Upper bounds

Using a Solver to Find Comma-free Codes

The SAT Encoding

New Codes

Twenty Questions with z3

The Questions

Using z3 to find solutions

The Encoding

Finding the Solution(s)

Final Thoughts

More Final Thoughts

Simulating Levenshtein Automata

Alternatives to Levenshtein Automata

Finding edits in a Trie without automata

An O(nd2)-time simulation

An O(nd)-time simulation

Final Thoughts

Regular Expression Search via Graph Cuts

Trigram queries and regular expression search

Generating trigram queries with graph cuts

Finding minimal node cuts

Examples

Optimizations

Related Work

An O(nd²)-time simulation