Aaron Windsorblog.aaw.io
http://blog.aaw.io/
Sun, 24 Jan 2021 14:30:51 +0000Sun, 24 Jan 2021 14:30:51 +0000Jekyll v3.9.0Comma-free codes<p>A set of strings is <a href="https://en.wikipedia.org/wiki/Comma-free_code"><em>comma-free</em></a> if, for any two strings <em>x</em> and <em>y</em> in the
set, no substring of the concatenation <em>xy</em> that overlaps both <em>x</em> and <em>y</em> is also in the set.</p>
<p>Once again, my interest in comma-free codes is coming from Don Knuth’s
<a href="https://www.pearson.com/store/p/the-art-of-computer-programming-volume-4-fascicle-5-mathematical-preliminaries-redux-introduction-to-backtracking-dancing-links/P100000291857/9780134671796">Volume 4, Fascicle 5 of The Art of Computer Programming</a>. Knuth covers a
souped-up backtracking algorithm to find comma-free codes and includes
an exercise that derives W. L. Eastman’s
algorithm for generating codes with odd word length<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote">1</a></sup>. He also goes over Eastman’s algorithm
in his <a href="https://www.youtube.com/watch?v=48iJx8FVuis">2015 Christmas lecture</a> as well, which must have been given around the time
he was writing Fascicle 5.</p>
<p>In this post, I’ll describe how to use a SAT solver to discover comma-free codes.</p>
<h1 id="background">Background</h1>
<p>I’ll call a comma-free set a “code” and the strings in the set “codewords” in most of the rest of this post.</p>
<p>A few warm-up examples to make the comma-free property concrete: if a comma-free code containing only words of
length three contains 123 and 456, then it can’t contain 234 or 345 and it can’t contain 561
or 612 (since those are substrings of 123456 and 456123, respectively). It also can’t contain
231, 312, 564, or 654 (since those are substrings of 123123 and 456456, respectively).</p>
<p>The term “comma-free” comes from the fact that you can create messages just by
concatenating codewords together (without commas or other such delimiters) and those
messages will be uniquely decodable into codewords even if you start decoding
somewhere in the middle of the message.</p>
<p>For example, a comma-free code with words of length 3 over the alphabet {0,1,2} is:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{001, 002, 101, 102, 120, 121, 220, 221}
</code></pre></div></div>
<p>Given some partial message made up of these codewords:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>...011202212200012...
</code></pre></div></div>
<p>The interpretation of that message as codewords is unique. It has to be</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>...01,120,221,220,001,2...
</code></pre></div></div>
<p>since neither of the other ways of interpreting it makes sense. Neither</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>...0,112,022,122,000,12...
</code></pre></div></div>
<p>nor</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>...011,202,212,200,012...
</code></pre></div></div>
<p>contain any valid code words.</p>
<h1 id="upper-bounds">Upper bounds</h1>
<p>A <a href="https://en.wikipedia.org/wiki/Lyndon_word">Lyndon word</a> is a string that’s a unique lexicographic minimum
among the multiset of all of its rotations.</p>
<p>0001 is a Lyndon word because it’s the minimum among the rotations 0001, 0010, 0100, 1000.</p>
<p>0101 is the minimum among all its rotations, but it’s not a Lyndon word because the minimum
is not unique among all four rotations: 0101, 1010, 0101, 1010. Any periodic string will have
the same problem.</p>
<p>If the same codeword is repeated in a message, all of its other rotations overlap
both occurrences of the codeword (for example, 00010001 contains all rotations of 0001).
So a codeword and one of its rotation cannot both appear in a comma-free code. Furthermore, you
can’t have periodic codewords in a comma-free code (0101 creates ambiguity when concatenated
with itself as 01010101). These two facts mean that the number of Lyndon words of length n over
an alphabet of size m is an upper bound on the maximum size of a comma-free code with
the same parameters.</p>
<p>There’s a closed-form formula for the number of Lyndon words of length n over an alphabet
of size m, but it <a href="https://encyclopediaofmath.org/wiki/Lyndon_word">involves the Möbius function</a>. I’ll just use <em>LW(n,m)</em> to represent it here –
it doesn’t matter what the formula is, I’m only interested in how close we can get to the optimal
size <em>LW(n,m)</em> for various values of <em>n</em> and <em>m</em>.</p>
<p>To tie all of this together, one way of viewing the process of creating a comma-free code
that’s as large as possible is just selecting at most one rotation of each Lyndon word while
avoiding conflicts. Let <em>CF(n,m)</em> be the maximum size of a comma-free code with words
of length <em>n</em> over an alphabet of size <em>m</em>. If <em>LW(n,m)</em> = <em>CF(n,m)</em>, then there’s
some way of choosing one rotation of each Lyndon word to create a maximum-size comma-free code.</p>
<p>For all odd <em>n</em>, <em>LW(n,m)</em> = <em>CF(n,m)</em> and W.L. Eastman came up with an algorithm<sup id="fnref:2:1" role="doc-noteref"><a href="#fn:2" class="footnote">1</a></sup>
that will create such a code. For even <em>n</em>, not much is known
even for very small values of <em>n</em> and <em>m</em>. The case <em>n</em>=2 was solved by
Golomb, Gordon, and Welch<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote">2</a></sup>: <em>CF(2,m)</em> = <em>floor(m<sup>2</sup>/3)</em>,
which is strictly less than <em>LW(n,m)</em> for <em>m</em> > 3.</p>
<p>For example, here’s a comma-free code for n=2, m=4:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{10, 12, 30, 31, 32}
</code></pre></div></div>
<p><em>LW(2,4)</em> = 6 and the code above contains rotations of
all Lyndon words except for 02. And that’s as good as you can do: by Golumb, Gordon, and Welch’s
formula above, <em>CF(2,4)</em> = 5, so there’s no comma-free code that includes all Lyndon words for n=2, m=4.</p>
<p>Here’s a table of <em>LW(n,m)</em> - <em>CF(n,m)</em>, where <em>CF(n,m)</em> is actually known:</p>
<table>
<thead>
<tr>
<th style="text-align: center"> </th>
<th style="text-align: center">m=2</th>
<th style="text-align: center">m=3</th>
<th style="text-align: center">m=4</th>
<th style="text-align: center">m=5</th>
<th style="text-align: center">m=6</th>
<th style="text-align: center">m=7</th>
<th style="text-align: center">m=8</th>
<th style="text-align: center">m=9</th>
<th style="text-align: center">m=10</th>
<th> </th>
<th> </th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">n=2</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">2</td>
<td style="text-align: center">3</td>
<td style="text-align: center">5</td>
<td style="text-align: center">7</td>
<td style="text-align: center">9</td>
<td style="text-align: center">12</td>
<td>…</td>
<td><em>LW(n,m) - floor(m<sup>2</sup>/3)</em></td>
</tr>
<tr>
<td style="text-align: center">n=3</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td> </td>
<td> </td>
</tr>
<tr>
<td style="text-align: center">n=4</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">3</td>
<td style="text-align: center"><span style="background-color: #FFFF00">11</span></td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td style="text-align: center">n=5</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td> </td>
<td> </td>
</tr>
<tr>
<td style="text-align: center">n=6</td>
<td style="text-align: center">0</td>
<td style="text-align: center"><span style="background-color: #FFFF00">3</span></td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td style="text-align: center">n=7</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td> </td>
<td> </td>
</tr>
<tr>
<td style="text-align: center">n=8</td>
<td style="text-align: center">0</td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td style="text-align: center">n=9</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td> </td>
<td> </td>
</tr>
<tr>
<td style="text-align: center">n=10</td>
<td style="text-align: center">0</td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td style="text-align: center">n=11</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td> </td>
<td> </td>
</tr>
<tr>
<td style="text-align: center">n=12</td>
<td style="text-align: center"><span style="background-color: #FFFF00">1</span></td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td style="text-align: center">n=13</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td> </td>
<td> </td>
</tr>
<tr>
<td style="text-align: center">n=14</td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td style="text-align: center"> </td>
<td> </td>
<td> </td>
</tr>
</tbody>
</table>
<p>Blanks in the table above are currently unknown. In particular, all rows after
n=13 are unknown. For even <em>n</em>, <em>CF(n,m)</em> gets really difficult to determine
really quickly and <em>LW(n,m)</em> is only achievable for small values of <em>n</em> and <em>m</em>.</p>
<p>The three highlighted entries in the table are state-of-the-art and apparently
unknown at the time of the first printing of Volume 4, Fascicle 5. All of the
entries in rows with even <em>n</em> above
are within reach of at most a day’s computation on a decent computer using a
SAT solver. In the rest of this post, I’ll describe how I calculated the values
in the table above for <em>n</em> > 2.</p>
<h1 id="using-a-solver-to-find-comma-free-codes">Using a Solver to Find Comma-free Codes</h1>
<p>The general recipe I used for finding comma-free codes with a SAT solver was:</p>
<ol>
<li>
<p>Write a Python program that takes arguments <em>n</em>, <em>m</em>, and <em>d</em> on the
command line and generates a <a href="http://www.satcompetition.org/2009/format-benchmarks2009.html">DIMACS file</a> with a formula that’s
satisfiable exactly when there’s a comma-free code on codewords of size
<em>n</em> over alphabet of size <em>m</em> with <em>d</em> Lyndon words not chosen. So if
you run with <em>d</em>=0 and get a satisfiable formula, <em>LW(n,m)</em> = <em>CF(n,m)</em>.</p>
<p>Some of the variables in the formula are indicators for particular codewords
being chosen in the code, so I generated comments in the DIMACS file (<a href="https://raw.githubusercontent.com/aaw/sat/master/test/commafree-4-4-0.cnf">example</a>) that
would help me translate variables to codewords later.</p>
</li>
<li>
<p>For any pair of <em>n</em>, <em>m</em> I was interested in, I generated DIMACS files starting with
<em>d</em> = 0 and ran Armin Biere’s <a href="https://github.com/arminbiere/kissat">kissat</a> on the resulting files, increasing
<em>d</em> and starting over until I found the smallest <em>d</em> where the resulting
formula was satisfiable.</p>
</li>
<li>
<p>Using the satisfying assigment found by kissat, I could optionally extract a code by
matching the assignment up with comments in the DIMACS input file.</p>
</li>
</ol>
<p>With this method, I was able to fill in the table in the previous section. If
you want to replicate this, you’ll need Python3 installed and a SAT solver like kissat.
Download two scripts:</p>
<ul>
<li><a href="https://github.com/aaw/sat/blob/master/gen/commafree.py">commafree.py</a>: the input file generator</li>
<li><a href="https://gist.github.com/aaw/430fa06b6f0c8db73b81c9b2f4afb079">extract-code.py</a>: a script that verifies and extracts comma-free codes</li>
</ul>
<p>Using these tools, you should be able to extract your own codes:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ commafree.py 4 2 0 > /tmp/commafree-4-2-0.cnf # n=4, m=2, d=0
$ kissat /tmp/commafree-4-2-0.cnf > /tmp/kissat-4-2-0.out
$ [ $? -eq 10 ] && extract-code.py /tmp/commafree-4-2-0.cnf /tmp/kissat-4-2-0.out
{1000, 1001, 1011}
</code></pre></div></div>
<p>(The final command prints nothing if the formula was unsatisfiable, in which case
you need to increment <em>d</em> and try again.)</p>
<p>Some of the blank entries in the table above should be solvable today with a bit
more effort, particularly at the
frontier, like (<em>n</em>,<em>m</em>) = (4,6), (6,4), and (8,3). Maybe these
are already solvable by SAT solver tuning, symmetry-breaking or some other
preprocessing of the CNF files.</p>
<h1 id="the-sat-encoding">The SAT Encoding</h1>
<p>The heart of the process above is a Python program to generate DIMACS files,
but it’s nothing too complicated. It does the following:</p>
<ol>
<li>
<p>Generate candidate codewords by iterating over all Lyndon words and their
rotations. This can be done with Knuth’s Algorithm 7.2.1.1 F from <a href="https://www.pearson.com/store/p/art-of-computer-programming-volume-4a-the-combinatorial-algorithms-part-1/P100001186353/9780201038040">The Art of
Computer Programming, Volume 4A</a>. Associate
each of these words with a variable that’s true exactly when the codeword
has been chosen for the comma-free code.</p>
</li>
<li>
<p>Generate clauses for each Lyndon word that express “at most one codeword
from this Lyndon word and its rotations is chosen”.</p>
</li>
<li>
<p>Iterate over all pairs <em>x</em>, <em>y</em> of Lyndon word variables. For each pair,
discover all codewords <em>z</em> that are disallowed in <em>xy</em> or <em>yx</em> by the
comma-free property and generate clauses for such triples <em>x</em>, <em>y</em>, <em>z</em> that express “x, y, and z can’t
all occur in the comma-free code”.</p>
</li>
<li>
<p>Generate variables and clauses for each Lyndon word such that the variables
indicate “some rotation of this Lyndon word was chosen”.</p>
</li>
<li>
<p>Generate variables and clauses that sort the variables from the previous
step and assert that the smallest <em>d</em> values are 0 and the <em>d+1</em>-st value
is 1. I cheated here and didn’t do a full sort, I just applied a full row
of <a href="https://en.wikipedia.org/wiki/Sorting_network">comparators</a> <em>d+1</em> times – essentially <em>d+1</em> rounds of bubblesort
to sort the smallest <em>d</em> values only.</p>
</li>
</ol>
<p>Again, you can check out the <a href="https://github.com/aaw/sat/blob/master/gen/commafree.py">generator source</a> for full details.</p>
<h1 id="new-codes">New Codes</h1>
<p>So, finally, here’s a dump of the new codes I was able to discover. All of these
are now officially the maximum size comma-free codes for these values of <em>n</em> and <em>m</em>:</p>
<p>n=4, m=5, size=139 (<em>LW(4,5)</em> = 150):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{0012, 0013, 0014, 0100, 0102, 0103, 0104, 0110, 0111, 0112, 0113, 0114, 0204,
0212, 0213, 0214, 0304, 0312, 0313, 0314, 1012, 1013, 1014, 2000, 2002, 2003,
2004, 2012, 2013, 2014, 2022, 2023, 2024, 2032, 2033, 2034, 2100, 2102, 2103,
2104, 2110, 2111, 2112, 2113, 2114, 2204, 2212, 2213, 2214, 2224, 2234, 2304,
2312, 2313, 2314, 2324, 2334, 3000, 3002, 3003, 3004, 3012, 3013, 3014, 3022,
3023, 3024, 3032, 3033, 3034, 3100, 3102, 3103, 3104, 3110, 3111, 3112, 3113,
3114, 3204, 3212, 3213, 3214, 3224, 3234, 3304, 3312, 3313, 3314, 3324, 3334,
4000, 4002, 4003, 4004, 4012, 4013, 4014, 4022, 4023, 4024, 4032, 4033, 4034,
4042, 4043, 4044, 4100, 4102, 4103, 4104, 4110, 4111, 4112, 4113, 4114, 4122,
4123, 4124, 4132, 4133, 4134, 4142, 4143, 4144, 4204, 4212, 4213, 4214, 4224,
4234, 4244, 4304, 4312, 4313, 4314, 4324, 4334, 4344}
</code></pre></div></div>
<p>n=6, m=3, size=113 (<em>LW(6,3)</em> = 116):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{001000, 001100, 001101, 001102, 001200, 001201, 001202, 002000, 002100,
002101, 002102, 002200, 002201, 002202, 010102, 011100, 011101, 011102,
011110, 011111, 011200, 011201, 011202, 011210, 011211, 012100, 012101,
012102, 012110, 012111, 012112, 012200, 012201, 012202, 012210, 012211,
012220, 020100, 020102, 020110, 020120, 020200, 020210, 020220, 021100,
021101, 021102, 021110, 021111, 021120, 021121, 021200, 021201, 021202,
021210, 021211, 022100, 022101, 022102, 022110, 022111, 022112, 022120,
022121, 022200, 022201, 022202, 022211, 022212, 101000, 101100, 101200,
101201, 101202, 102000, 102100, 102200, 102201, 102202, 111200, 111201,
111202, 111210, 111211, 112200, 112201, 112202, 112210, 112211, 112220,
121200, 121201, 121202, 121210, 121211, 122120, 122121, 122211, 122212,
212200, 212201, 212202, 212210, 212211, 212220, 222100, 222101, 222102,
222200, 222201, 222202, 222211, 222212}
</code></pre></div></div>
<p>n=12, m=2, size=334 (<em>LW(12,2)</em> = 335):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{000001000011, 000010000000, 000010000001, 000010000011, 000010000101,
000100000101, 000100001001, 000101000011, 000101001011, 000110000001,
000110000011, 000110000101, 000110000111, 000110001001, 000110001011,
000110001101, 000110001111, 000110010001, 000110010011, 000110010101,
000110010111, 000110011001, 000110011011, 000110011101, 000110011111,
000111000011, 000111001011, 001000001001, 001000010001, 001001000011,
001001001011, 001001010011, 001010000001, 001010000011, 001010000101,
001010001001, 001010001011, 001010010001, 001010010011, 001010010101,
001100000001, 001100000101, 001100001001, 001100001101, 001100010001,
001100010101, 001100011101, 001101000011, 001101001011, 001101010011,
001110000001, 001110000011, 001110000101, 001110001001, 001110001011,
001110001101, 001110001111, 001110010001, 001110010011, 001110010101,
001110011001, 001110011011, 001110011101, 001110011111, 001111000000,
001111000011, 001111001011, 001111010011, 010000010001, 010001000011,
010001010011, 010010000001, 010010000011, 010010000101, 010010010001,
010010010011, 010010010101, 010010100011, 010100000001, 010100000101,
010100001001, 010100010001, 010100010101, 010101000011, 010101001011,
010101010011, 010110000001, 010110000011, 010110000101, 010110000111,
010110001001, 010110001011, 010110001101, 010110001111, 010110010001,
010110010011, 010110010101, 010110010111, 010110011001, 010110011011,
010110011101, 010110011111, 010110100011, 010110100111, 010110101011,
010111000011, 010111001011, 010111010011, 010111011011, 010111100011,
010111101011, 010111111011, 011000000000, 011000000001, 011000001001,
011000010001, 011001000011, 011001001011, 011001010011, 011001011011,
011001101011, 011001111011, 011010000001, 011010000011, 011010000101,
011010001001, 011010001011, 011010010001, 011010010011, 011010010101,
011010011001, 011010011011, 011010100011, 011010101011, 011100000000,
011100000001, 011100000101, 011100001001, 011100010001, 011100010101,
011100011101, 011101000011, 011101001011, 011101010011, 011101011011,
011101111011, 011110000011, 011110000101, 011110001001, 011110001011,
011110010001, 011110010011, 011110010101, 011110011001, 011110011011,
011110011101, 011110011111, 011110100011, 011110101011, 011111000011,
011111001011, 011111010011, 011111011011, 011111100011, 011111101011,
011111111011, 100001000011, 100010000000, 100010000001, 100010000011,
100010000101, 100010100011, 100100000000, 100100000001, 100100000101,
100100001001, 100101000011, 100101001011, 100110000001, 100110000011,
100110000101, 100110000111, 100110001001, 100110001011, 100110001101,
100110001111, 100110100011, 100110100111, 100110101011, 100111000011,
100111001011, 100111100011, 100111101011, 101000000000, 101000000001,
101000001001, 101000010001, 101001000011, 101001001011, 101001010011,
101010000001, 101010000011, 101010000101, 101010001001, 101010001011,
101010010001, 101010010011, 101010010101, 101010100011, 101010101011,
101100000001, 101100000101, 101100001001, 101100001101, 101100010001,
101100010101, 101100011101, 101101000011, 101101001011, 101101010011,
101101011011, 101110000001, 101110000011, 101110000101, 101110001001,
101110001011, 101110001101, 101110001111, 101110010001, 101110010011,
101110010101, 101110011001, 101110011011, 101110011101, 101110011111,
101110100011, 101110101011, 101111000000, 101111000011, 101111001011,
101111010011, 101111100011, 101111101011, 101111111011, 110000010001,
110001000011, 110001010011, 110010000001, 110010000011, 110010000101,
110010010001, 110010010011, 110010010101, 110010100011, 110100000001,
110100000101, 110100001001, 110100010001, 110100010101, 110101000011,
110101001011, 110101010011, 110110000001, 110110000011, 110110000101,
110110000111, 110110001001, 110110001011, 110110001101, 110110001111,
110110010001, 110110010011, 110110010101, 110110010111, 110110011001,
110110011011, 110110011101, 110110011111, 110110100011, 110110100111,
110110101011, 110111000011, 110111001011, 110111010011, 110111010111,
110111011011, 110111100011, 110111101011, 110111111011, 111000001001,
111000010001, 111001000011, 111001001011, 111001010011, 111001101011,
111001111011, 111010000001, 111010000011, 111010000101, 111010001001,
111010001011, 111010010001, 111010010011, 111010010101, 111010100011,
111010101011, 111100000101, 111100001001, 111100010001, 111100010101,
111101000011, 111101001011, 111101010011, 111110000000, 111110000001,
111110000011, 111110000101, 111110001001, 111110001011, 111110010001,
111110010011, 111110010101, 111110011001, 111110011011, 111110011101,
111110011111, 111110100011, 111110101011, 111111000011, 111111001011,
111111010011, 111111100011, 111111101011, 111111111011}
</code></pre></div></div>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:2" role="doc-endnote">
<p>Eastman, W. L., On the Construction of Comma-Free Codes, IEEE Trans. IT-11 (1965), pp 263-267. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:2:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a></p>
</li>
<li id="fn:1" role="doc-endnote">
<p><a href="https://www.cambridge.org/core/journals/canadian-journal-of-mathematics/article/commafree-codes/B78C35132E2BCD4A14459BE6A142FF30">Golumb, S. W., B. Gordon, and L. R. Welch, Comma-free codes, Can. J. Math, vol. 10, 1958, pp 202-209.</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Wed, 20 Jan 2021 13:28:00 +0000
http://blog.aaw.io/2021/01/20/comma-free-codes.html
http://blog.aaw.io/2021/01/20/comma-free-codes.htmlTwenty Questions with z3<p>Don Knuth’s <a href="https://www.pearson.com/store/p/the-art-of-computer-programming-volume-4-fascicle-5-mathematical-preliminaries-redux-introduction-to-backtracking-dancing-links/P100000291857/9780134671796">Volume 4, Fascicle 5 of The Art of Computer Programming</a> has some great combinatorial
puzzles in the exercises, including a variant of a puzzle called <a href="http://www.icynic.com/~don/20q4.html">“Twenty Questions”</a> invented by Donald Woods.</p>
<p>Donald Woods has written up <a href="http://www.icynic.com/~don/20qintro.html">some history of the problem</a> that probably serves as
a better introduction than I could give. Knuth introduces the puzzle as an exercise in backtracking (and has written
a fast backtracking solver for a variant of the problem <a href="https://www-cs-faculty.stanford.edu/~knuth/programs/back-20q.w">here</a>), but you can also solve Twenty Questions using a
SAT solver, and that’s what I’ll describe in this post.</p>
<p>I’ll use <a href="https://github.com/Z3Prover/z3">z3</a> (with its Python frontend) to find the solution to the puzzle. z3 is technically a little more than
just a SAT solver, but the encoding of the problem in this post could easily be mapped down to a “pure” boolean formula
and fed to a SAT solver if you were patient and careful enough.</p>
<h1 id="the-questions">The Questions</h1>
<p>The questions start off easy, but the first few answers clearly depend on answers to <em>other</em> questions:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. The first question whose answer is A is:
(A) 1 (B) 2 (C) 3 (D) 4 (E) 5
2. The next question with the same answer as this one is:
(A) 4 (B) 6 (C) 8 (D) 10 (E) 12
</code></pre></div></div>
<p>But wait, it gets harder! Instead of one question referencing another’s answer, some questions reference the
distribution of all answers to the entire set of questions:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>7. The answer that appears most often (possibly tied) is:
(A) A (B) B (C) C (D) D (E) E
8. Ignoring those that occur equally often, the answer that appears least often is:
(A) A (B) B (C) C (D) D (E) E
...
18. The number of prime-numbered questions whose answers are vowels is:
(A) prime (B) square (C) odd (D) even (E) zero
</code></pre></div></div>
<p>Finally, question 20 doesn’t just require you to know answers to other questions in the quiz,
you have to know the optimal score over any set of answers to the quiz:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>20. The maximum score that can be achieved on this test is:
(A) 18 (B) 19 (C) 20 (D) indeterminate
(E) achievable only by getting this question wrong
</code></pre></div></div>
<p>You might want to leave question 20 out and solve 1-19 separately, then use the results
to figure out 20. But remember, some of the first 19 questions refer to the distribution of
answers on the quiz, which includes the answer to question 20!</p>
<h1 id="using-z3-to-find-solutions">Using z3 to find solutions</h1>
<p>My general idea in encoding this problem as a boolean formula is:</p>
<ul>
<li>Start by defining 100 variables: x1a, x1b, x1c, x1d, x1e, x2a, x2b, …, x20e. x1a means “Question 1 is marked A”, and so on.</li>
<li>Make sure that each question has exactly one answer by adding some constraints to x1a-x1e, x2a-x2e, etc.</li>
<li>Define 20 more variables: x1, x2, x3, …, x20. x1 means “The answer to question 1 is correct”. Each of these is defined by an
expression involving the original 100 variables that represent answers to the questions.</li>
</ul>
<p>Then we just let z3 run and try to satisfy as many of x1, x2, …, x20 as it can and interpret the
results.</p>
<h1 id="the-encoding">The Encoding</h1>
<p>My program starts off by declaring the boolean variables, something like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>from z3 import *
x1 = Bool('x1')
...
x20 = Bool('x20')
x1a = Bool('x1a')
...
x20e = Bool('x20e')
</code></pre></div></div>
<p>I also want to write some helper functions that iterate over variables and answers, so I’ll define:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>answers = [None]
answers.append(dict(zip(['A','B','C','D','E'],
[x1a, x1b, x1c, x1d, x1e])))
...
answers.append(dict(zip(['A','B','C','D','E'],
[x20a, x20b, x20c, x20d, x20e])))
</code></pre></div></div>
<p>so that <code class="language-plaintext highlighter-rouge">answers[7]['D']</code> gives you <code class="language-plaintext highlighter-rouge">x7d</code>, for example. Also, there’s</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>correct = [None] + \
[x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20]
</code></pre></div></div>
<p>so that <code class="language-plaintext highlighter-rouge">correct[13]</code> is <code class="language-plaintext highlighter-rouge">x13</code>.</p>
<p>I need to treat booleans as integers sometimes, so I do that with:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def btoi(x): return If(x,1,0)
</code></pre></div></div>
<p>Finally, I want to reduce some sequences of expressions using <code class="language-plaintext highlighter-rouge">And</code>, <code class="language-plaintext highlighter-rouge">Or</code>, and <code class="language-plaintext highlighter-rouge">+</code>, but functional programming
is hard to read in Python, so I’ll use a few more helper functions:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def SumOf(xs): return reduce(lambda x,y: x+y, xs)
def AndOf(xs): return reduce(lambda x,y: And(x,y), xs)
def OrOf(xs): return reduce(lambda x,y: Or(x,y), xs)
</code></pre></div></div>
<p>Now we can start defining x1, x2, etc. in terms of the answers.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># 1. The first question whose answer is A is:
# (A) 1 (B) 2 (C) 3 (D) 4 (E) 5
s.add(x1 == Or(x1a,
And(x1b,x2a),
And(x1c,x3a,Not(x2a)),
And(x1d,x4a,Not(x2a),Not(x3a)),
And(x1e,x5a,Not(x2a),Not(x3a),Not(x4a))))
</code></pre></div></div>
<p>z3 and Python make it easy to express some of the more complicated constraints:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># 3. The only two consecutive questions with identical answers are questions:
# (A) 15 and 16 (B) 16 and 17 (C) 17 and 18 (D) 18 and 19 (E) 19 and 20
def same_answer(i,j):
return Or(And(answers[i]['A'],answers[j]['A']),
And(answers[i]['B'],answers[j]['B']),
And(answers[i]['C'],answers[j]['C']),
And(answers[i]['D'],answers[j]['D']),
And(answers[i]['E'],answers[j]['E']))
s.add(x3 == Or(And(x3a, same_answer(15, 16),
AndOf(Not(same_answer(i,i+1)) for i in range(1,20) if i != 15)),
And(x3b, same_answer(16, 17),
AndOf(Not(same_answer(i,i+1)) for i in range(1,20) if i != 16)),
And(x3c, same_answer(17, 18),
AndOf(Not(same_answer(i,i+1)) for i in range(1,20) if i != 17)),
And(x3d, same_answer(18, 19),
AndOf(Not(same_answer(i,i+1)) for i in range(1,20) if i != 18)),
And(x3e, same_answer(19, 20),
AndOf(Not(same_answer(i,i+1)) for i in range(1,20) if i != 19))))
</code></pre></div></div>
<p>A few of these were a little tricky. Question 8 wants the answer that occurs least often
among all answers that don’t occur the same number of times as another answer. So if A and B are both chosen 3 times
and C is chosen 4 times, C would be the correct answer.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># 8. Ignoring those that occur equally often, the answer that appears least
# often is:
# (A) A (B) B (C) C (D) D (E) E
def other_answers(ans): return set(['A','B','C','D','E']) - set([ans])
def answer_tally(z): return SumOf(btoi(answers[i][z]) for i in range(1,21))
def least_of_distinct(ans):
others = other_answers(ans)
clauses = [answer_tally(ans) != answer_tally(x) for x in others]
for x in others:
remains = others - set([x])
clauses.append(Or(answer_tally(ans) < answer_tally(x),
OrOf(answer_tally(x) == answer_tally(y) for y in remains)))
return AndOf(clauses)
s.add(x8 == Or(And(x8a, least_of_distinct('A')),
And(x8b, least_of_distinct('B')),
And(x8c, least_of_distinct('C')),
And(x8d, least_of_distinct('D')),
And(x8e, least_of_distinct('E'))))
</code></pre></div></div>
<p>All of the other questions up through 19 are pretty straightforward. But what should we do about question 20?</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>20. The maximum score that can be achieved on this test is:
(A) 18 (B) 19 (C) 20 (D) indeterminate
(E) achievable only by getting this question wrong
</code></pre></div></div>
<p>We can’t correctly grade it just based on one solution to all of x1, x2, … x19. But if
we can find at least one valid solution with the first 19 correct, then C is correct. And if we can’t find
a solution with the first 19 correct, then maybe we can find one with 18 or 17 correct and fall back on B or A.
So we’ll punt for now and define it optimistically as:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>s.add(x20 == x20c)
</code></pre></div></div>
<p>We’ll hope to work down from there, re-defining it as needed.</p>
<h1 id="finding-the-solutions">Finding the Solution(s)</h1>
<p>Once we’ve defined x1 through x20, we can use z3 to check whether our formula is satisfiable and, if so,
get a model that maps variables to a satisfying assignment:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if s.check() == sat:
print_solution(s.model())
</code></pre></div></div>
<p>My <code class="language-plaintext highlighter-rouge">print_solution</code> just prints out all of the answers using uppercase letters if the question
is correct and lowercase otherwise. Now we can add one more constraint to maximize the score:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>s.add(SumOf(btoi(correct[i]) for i in range(1,21)) >= 20)
</code></pre></div></div>
<p>When we do that, we see there’s no satisfying assignment at 20, so we reset our expectations and re-define
x20 as:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>s.add(x20 == x20b)
</code></pre></div></div>
<p>We update our maximization constraint to:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>s.add(SumOf(btoi(correct[i]) for i in range(1,21)) >= 19)
</code></pre></div></div>
<p>and then find out that there is a solution with 19 correct:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A E D C A B C D C A C E D B C A D A A c
</code></pre></div></div>
<p>What would it mean if this were the only solution with 19 correct? Then 20(E), which says that
the maximum score is achievable only by getting 20 wrong, would be correct! And once we
knew that 20(E) was correct, maybe there would be a solution with all 20 correct. But if that optimal
solution with 20(E) correct
stands, then 20(E) (“the maximum score is achievable only by getting 20 wrong”) is necessarily incorrect!
And if we can get stuck in that kind of circular reasoning going back and forth between 20(E) being
true and false, doesn’t that mean that 20(D) (“the answer to 20 is indeterminate”) is then true?</p>
<p>Before we go too far down that path, we need to keep looking for solutions with 19 questions correct.
Clearly, we need to iterate over all satisfying assignments at a given score to make any
sense of a solution to 20. To do that, we can modify our check/print logic:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>while s.check() == sat:
m = s.model()
print_solution(m)
block_solution(s,m)
</code></pre></div></div>
<p>Where <code class="language-plaintext highlighter-rouge">block_solution</code> adds a new clause to the solver that prohibits the most recent solution:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def block_solution(s, m):
ans = [v for vv in [answers[i].values() for i in range(1,21)] for v in vv]
s.add(Not(And(AndOf(v == m[v] for v in ans),
AndOf(v == m[v] for v in correct[1:]))))
</code></pre></div></div>
<p>Now when we run the solver and iterate over all solutions with score at least 19, we get:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A E D C A B C D C A C E D B C A D A A c
D C E A B A D C D A E D A E D B D B E e
D C E A B E B C E A B E A E D B D A b B
</code></pre></div></div>
<p>Whew, so that settles it – 20(E) is incorrect since there is one way to achieve the maximum
possible score (19) without getting question 20 incorrect. And 20(D) is clearly incorrect since
20(E) is incorrect and there is an absolute achievable best score for the quiz.</p>
<p>So B is the only correct answer for question 20 and the above three answers are the only three ways to
achieve the maximum score of 19. We’re done!</p>
<h1 id="final-thoughts">Final Thoughts</h1>
<p>Question 20 is supposed to
be the kicker, but question 9 is also a little weird:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>9. The sum of all question numbers whose answers are correct and the same as
this one is in the range:
(A) 59 to 62, inclusive
(B) 52 to 55, inclusive
(C) 44 to 49, inclusive
(D) 61 to 67, inclusive
(E) 44 to 53, inclusive
</code></pre></div></div>
<p>It’s the only one of the questions other than 20 that depends on its own correctness.
You can see this in the way I encoded it:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def answer_sum(ans):
return SumOf(If(And(correct[i],answers[i][ans]), i, 0) for i in range(1,21))
s.add(x9 == Or(And(x9a, answer_sum('A') >= 59, answer_sum('A') <= 62),
And(x9b, answer_sum('B') >= 52, answer_sum('B') <= 55),
And(x9c, answer_sum('C') >= 44, answer_sum('C') <= 49),
And(x9d, answer_sum('D') >= 61, answer_sum('D') <= 67),
And(x9e, answer_sum('E') >= 44, answer_sum('E') <= 53)))
</code></pre></div></div>
<p>When I say it “depends on its own correctness”, I just mean that x9 appears on both the
left and the right side of the == in the definition of x9 (recall that <code class="language-plaintext highlighter-rouge">correct[9]</code> is just x9).</p>
<p>Think about any specific case where x9 is correct: for example, if question 9 is marked “A”
and the sum of all question numbers with A as the answer that are correct (including question 9)
is, say, 61. 9(A) is clearly correct here, but a grader would be just as correct if they marked 9
wrong: in that case, the sum of all question numbers that are correct and have A as the answer is now
61 - 9 = 52, which is no longer in the range [59, 62].</p>
<p>Since none of the ranges in question 9 span more than 9 numbers, <em>any</em> time question 9 is marked
correct, it can also be marked incorrect – it’s all up to the grader! This sort of choice in grading
never really comes into play since we’re always looking to maximize the score, but it does make
you think about clever alternatives for question 20 that could introduce some indeterminate
conclusions.</p>
<p>In fact, if you look back at my definition for <code class="language-plaintext highlighter-rouge">block_solution</code>, you’ll see that it blocks both
the answers (x1a, x2b, etc.) and the whether or not they’re correct (x1, x2, …) when blocking
an solution. You have to do this to get correct results, otherwise the solver could come across
a solution where question 9 could be graded correct to get a score of 19 but choose to grade it
incorrect instead, giving it a score of 18. If you only block the choices of answers, that solution
gets blocked and the solver is never able to discover the “alternative grading” that would give it 19 points.</p>
<h1 id="more-final-thoughts">More Final Thoughts</h1>
<p>My full Python solution is <a href="https://github.com/aaw/twenty-questions/blob/main/20q.py">here</a> if you want to run it or play around with the constraints.</p>
<p>If you think this is interesting, you should check out Knuth’s treatment of a variant
of this problem in Exercises 7.2.2.71-72 of in <a href="https://www.pearson.com/store/p/the-art-of-computer-programming-volume-4-fascicle-5-mathematical-preliminaries-redux-introduction-to-backtracking-dancing-links/P100000291857/9780134671796">Volume 4, Fascicle 5 of The Art of Computer Programming</a>.
The latter exercise plays around with constraints to see what happens when the answer to question
20 isn’t as clear-cut as it was in Donald Wood’s original problem.</p>
<p>The Twenty Questions puzzle is Copyright 2000, 2001, 2015 by Donald R. Woods.</p>
Sun, 01 Nov 2020 21:05:32 +0000
http://blog.aaw.io/2020/11/01/twenty-questions.html
http://blog.aaw.io/2020/11/01/twenty-questions.htmlSimulating Levenshtein Automata<p>A <a href="https://en.wikipedia.org/wiki/Levenshtein_automaton">Levenshtein Automaton</a> is a finite state machine that recognizes all
strings within a given <a href="https://en.wikipedia.org/wiki/Edit_distance">edit distance</a> from a fixed string.
Here’s a Levenshtein Automaton that accepts all
strings within edit distance 2 from “banana”:</p>
<p><img src="/assets/levtrie/nfa-banana.png" alt="A Levenshtein Automata for "banana" with d=2" /></p>
<p>The epsilon transitions represent the empty string. The * transitions
are shorthand for “any character” to save space in the diagram, but the
actual automaton has one transition for every possible input character anywhere
you see a *. The automaton accepts a string <em>s</em> exactly when
there’s a directed path from the start state on the lower left to any of the
accept states on the right such that the concatenation of all of the labels on
the path in order equals <em>s</em>.</p>
<p>The automaton above is an <a href="https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton">NFA</a> but it looks like three
copies of a <a href="https://en.wikipedia.org/wiki/Deterministic_finite_automaton">DFA</a> accepting the string “banana” stacked on top of each other.
Transitions between the three DFAs represent edit operations under the
<a href="https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance">Damerau-Levenshtein metric</a>: a transition going up represents
the insertion of a character, a diagonal epsilon transition represents the
deletion of a character, and a diagonal * transition represents
a substitution.</p>
<p>Levenshtein Automata can be used as one part of a data structure that generates
spelling corrections. The
other component is a <a href="https://en.wikipedia.org/wiki/Trie">Trie</a> that contains all correctly spelled words.
Any word that’s accepted by both the Trie and the Levenshtein Automaton is a
word that’s correctly spelled and up to edit distance <em>d</em> from the query.
Given a query, you’d generate a Levenshtein Automaton for that query with the
desired edit distance and then traverse both the automaton and the Trie in
parallel, yielding a word whenever you reach an accept state in the Levenshtein
Automaton and at a leaf node in the Trie at the same time.</p>
<p>Generating a non-deterministic Levenshtein Automaton is straightforward.
The node and edge structure of the automaton above for the query “banana” isn’t a function
of the word “banana” at all – only the transition labels would be different if you
wanted to create a similar automaton accepting anything within edit distance 2 of any
other six-letter word. Increasing or decreasing the edit distance just
involves adding or removing one or more rows of identical
states. Increasing or decreasing the length of the fixed word just involves
adding or removing one or more columns.</p>
<p>Unfortunately, even though generating a non-deterministic Levenshtein Automaton is easy,
simulating it efficiently isn’t. In general,
simulating all execution paths of an NFA with <em>n</em> states on an input of length
<em>m</em> can take time <em>O(nm)</em> just to keep up with the bookkeeping:
a state in the simulation is a subset of states in the NFA and you have to
update that set of up to <em>n</em> states <em>m</em> times during the simulation.</p>
<p><a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652">Schulz and Mihov introduced Levenshtein Automata</a> and showed
how to generate and simulate them in linear time in the length of an input string.
The implementation they describe involves a preprocessing step that creates a DFA
defined by a transition table whose size grows very quickly with the edit distance <em>d</em>.
The implementation starts with a two-dimensional table from which many of the entries can
be removed because they represent dead states. For <em>d</em>=1, a 5 <em>x</em> 8 table
is reduced to just 9 entries, for <em>d</em>=2, a 30 <em>x</em> 32 table is reduced to
one with just 80 entries. For <em>d</em>=3 and 4, the tables start with
196 <em>x</em> 128 = 25,088 and 1352 x 512 = 692,224 entries, respectively, before
removing dead states.</p>
<p>The Lucene project <a href="http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html">implemented</a> Schulz and Mihov’s scheme, but only for
<a href="https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/util/automaton/Lev1ParametricDescription.java"><em>d</em> = 1</a> and <a href="https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/util/automaton/Lev2ParametricDescription.java"><em>d</em> = 2</a>. Their implementation uses a Python script
from another project, <a href="https://sites.google.com/site/rrettesite/moman">Moman</a>, to generate Java code with the transition
tables offline.</p>
<p>It’s hard to beat a table-driven DFA for matching regular expressions,
but on the other hand, it’s not clear that the simulation of the Levenshtein
Automaton is the bottleneck in a spelling corrector. The size of the Levenshtein
Automaton is dwarfed by the size of the Trie containing the correctly spelled
words, and since query processing involves simulating both the Trie and the
Levenshtein Automaton in parallel, the main bottleneck in the simulation will
likely be the I/O expense of loading nodes for the Trie. Because of their high
and irregular branching factor, Tries are all but impossible to lay out in
memory with any kind of locality of reference for an arbitrary query.</p>
<p>So maybe there’s a simpler way to simulate Levenshtein Automata that’s
theoretically slower but will give us about the same real-world performance.
Jules Jacobs recently wrote <a href="http://julesjacobs.github.io/2015/06/17/disqus-levenshtein-simple-and-fast.html">a post describing a pretty
substantial simplification</a> that simulates an automaton using states
based on rows in the two-dimensional matrix of the <a href="https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm">Wagner-Fischer algorithm</a> to
compute edit distance. The simulation takes <em>O(nd)</em> time for an input of
length <em>n</em>, which is essentially as good as an implementation based on
Schulz and Mihov’s scheme since <em>d</em> is often small and fixed.</p>
<p>You can also derive an <em>O(nd)</em> time simulation by just directly simulating
the NFA that I described at the beginning of this post. You just need to make
a few optimizations based on the highly regular structure of this family of
NFAs, but all of the optimizations are relatively straightforward. I’ll describe
those optimizations in this post. I’ve also implemented everything I describe here
in a Go package called <a href="http://github.com/aaw/levtrie">levtrie</a> that provides a map-like interface to a Trie
equipped with a Levenshtein Automaton.</p>
<h1 id="alternatives-to-levenshtein-automata">Alternatives to Levenshtein Automata</h1>
<p>First, some more background. You can skip this section and the next if you
already understand why Levenshtein Automata are a good choice for indexing
a set of words by edit distance.</p>
<p>Edit distance is a <a href="https://en.wikipedia.org/wiki/Metric_(mathematics)">metric</a> and there are <a href="https://en.wikipedia.org/wiki/Metric_tree">many data
structures</a> that index keys by distance under various metrics.
Maybe the most appropriate metric tree for edit distance is the <a href="https://en.wikipedia.org/wiki/BK-tree">BK-Tree</a>.
The BK-Tree, like other metric trees, has two big disadvantages: first, the
layout of the tree is highly dependent on the distribution and the insertion
order, so it’s hard to quote good bounds on how balanced the tree is in general.
Second, during lookups, you have to compute the distance function at each node along
the search path,
which can be expensive for an metric like edit distance that takes quadratic
time to compute (or <em>O(nd)</em> if you optimize for computing distance at most <em>d</em>).</p>
<p><a href="https://en.wikipedia.org/wiki/Locality-sensitive_hashing">Locality-sensitive hashing</a>
is another option, but, like metric trees, even after hashing to a bucket you’re
left with a set of candidates on which you need to exhaustively calculate edit
distance. It’s also very difficult with most metrics to get anywhere close to
perfect recall with locality-sensitve hashing and perfect recall is essential
to spelling correction since there are often just a few good correction
candidates.</p>
<p>Still another alternative is to index the <a href="https://en.wikipedia.org/wiki/N-gram">n-grams</a> of all correctly spelled
words and put them in an inverted index. At query time, you’d break up the query string into
n-grams and search the inverted index for them all, running an actual edit distance
computation as a final pass on all of the candidates that come back. This doesn’t
always work well in practice, since, for example, if you’re trying to retrieve
“bahama” from the query “banana” (edit distance 2), none of the 3-grams match (bahama
breaks up into “bah”, “aha”, “ham”, “ama” and banana breaks up into
“ban”, “ana”, “nan”, and “ana”). Even moving down to 2-grams doesn’t
help much; only the leading 2-gram “ba” matches so you’d have to retrieve all
strings that start with “ba”
and exhaustively test edit distance on all of them to find “banana”.</p>
<p>In contrast to all of the methods described above, using a
Trie with a Levenshtein Automaton doesn’t ever require exhaustively calculating
edit distance during lookups: the cost of computing edit distance is incremental
and shared among many candidates that share paths in the Trie. A Trie can also
efficiently return all strings that are suffixes of the query string, which
is a popular feature with most on-the-fly spelling correctors: instead of
waiting for someone to type the entire word “banona” to return the suggestion
“banana”, you can start returning suggestions as soon as they’ve typed “ban” or
even “ba” by exploring paths from those prefixes in the Trie.</p>
<h1 id="finding-edits-in-a-trie-without-automata">Finding edits in a Trie without automata</h1>
<p>Levenshtein Automata are used to generate an optimized Trie traversal but
you can actually build a quick-and-dirty spelling corrector using just a Trie. I’ll
walk through that construction in this section since it motivates why you’d want
to augment a Trie with a Levenshtein Automaton in the first place.</p>
<p>Suppose that your query string is “banana”. To figure out if that exact string is
in the Trie, you’d just use the sequence of characters in the string to find a
path through the Trie:
starting from the root, transition on the “b” edge to a node one level down, then transition
on an “a” edge, then an “n” edge, and so on, until you’re at the end of the string.
If the word is in the Trie, then
at the end of the traversal you will have reached
a node that represents the last character of that word. Otherwise, you will have
stopped at some point along the way because there wasn’t an edge available to make the
transition you needed to make, in which case you know the word you’re looking up isn’t
in the Trie.</p>
<p>Instead, if you wanted to find both exact matches to “banana” and words in the Trie
that were a few edits away,
you could extend the search process to branch out a little and try paths that
correspond to edits.
If you want to find words that are at most, say, 2 edits away from “banana”, you could
start your traversal at the root but perform the following four searches
while keeping track of an edit budget that’s initially 2:</p>
<ul>
<li><em>Simulate no edit</em>: Move from the root to the second
level of the Trie on the edge labeled “b”. Keep your edit budget at 2 and set the remaining string to match
to “anana”.</li>
<li><em>Simulate an insertion before the first character</em>: For every edge out of the root of the Trie, move
to the second level of the Trie along that edge. Decrement your edit budget to 1 and keep the remaining string
to match set to “banana”.</li>
<li><em>Simulate a deletion of the first character</em>: Don’t move from the root of the Trie at all, simply decrement your
edit budget to 1 and update the string you want to match to “anana”.</li>
<li><em>Simulate a substitution for the first character</em>: For every edge out of the root of the Trie except
the edge labeled “b”, move to the second level of the Trie along that edge, decrement your edit budget to 1,
and update the remaining string you want to match to “anana”.</li>
</ul>
<p>Now just keep recursively applying these cases at each new node you explore and
stop the traversal once you reach an edit budget that’s negative. If you ever get to
a leaf node with a non-negative edit budget at least as big as the remaining string
length, return that string as a match.</p>
<p>This traversal will discover all strings in the Trie that are within a fixed edit distance of
a query but the traversal does a lot of repeated work. You can generate the correctly-spelled
word “banana” from the misspelled word “banaba” using either a substitution,
a deletion followed by an insertion,
or an insertion followed by a deletion. This means that in the traversal we defined above, we’ll
visit the node in the Trie defined by “banan” at least three times from three different search paths.
The search also has degenerate paths that just
burn the error budget but do no useful work, for example a deletion followed by an insertion
of the same character. These paths just put you back in the same position the traversal started
from with a smaller edit budget.</p>
<p>Again, because of their large branching factor, Tries don’t have good locality of reference. Each
time you follow a pointer to another node, you’re likely jumping to memory that at the very
least causes a cache miss, so popping the same state several times to explore can be expensive.
You might think you could optimize this a little by marking Trie nodes as “visited” when you first
see them and avoiding exploring visited nodes more than once. But you can also see the same node
through different search paths with different edit budgets, so you’d have to store more than
just a visited flag – you’d at least need to store the minimum edit budget that you’d visited
the node with. If you ever saw the node on a search path with a greater edit budget, you could
prune that portion of the search, but that still means that you could end up exploring a node up
to <em>d</em> times on a search for words within with edit distance <em>d</em>.</p>
<p>A Levenshtein Automaton maintains all of the search state so that you never have to traverse
a path in the Trie more than once. If you use a deterministic Levenshtein Automaton like
<a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652">Schulz and Mihov’s original scheme</a>, it’s really as efficient a way to encode the search
state as you can get: each transition from state to state in the automaton is just a <em>O(1)</em>
table lookup.</p>
<p>There’s some history of people rediscovering ways to maintain the search state
through more or less efficient means: <a href="http://stevehanov.ca/blog/index.php?id=114">Steve Hanov described</a> a method of
keeping track of the search state using the Wagner-Fischer matrix that allows you to update
states in time <em>O(m)</em> when searching for a string of length <em>m</em>. The method that Jules Jacobs
describes is similar but optimized even further to get a <em>O(d)</em> update time
for edit distance <em>d</em>, regardless of the length of the input string. The method I’ll describe
in the next two sections also has an <em>O(d)</em> time bound on state transitions but it isn’t
directly derived from the Wagner-Fischer edit distance algorithm.</p>
<h1 id="an-ond2-time-simulation">An <em>O(nd<sup>2</sup>)</em>-time simulation</h1>
<p>Now back to the original Levenshtein NFA construction at the beginning of this post.
Instead of creating a DFA from the NFA, we just want to simulate the DFA.
To simulate one, we need to maintain a set of active states as we read input characters, accepting
exactly when we’ve read all of the input and there’s at least one accepting state in our current set
of active states.</p>
<p>We initialize the set of active states to contain just the NFA’s start state plus anything
reachable by an epsilon transition.
On each input character, we create an initially empty new set of active states and iterate
through all current active states, trying to take any valid
transition from each state on the current input character and adding the resulting state
and anything else reachable by epsilon transitions
to the new set of active states if we succeed. When we’re done with iterating through
all current active states, the new set of active states becomes current and we proceed
to the next input character. If we’re done reading input characters and there’s an accept
state in our set of active states, we accept, otherwise we reject.</p>
<p>Let’s walk through a simulation of the Levenshtein NFA that accepts all words within edit
distance 2 of “banana”. We’ll feed it the input string “bahama”. The set of active states are highlighted
in blue at each step below. First, the initial state of the NFA
contains all states on the diagonal including the initial state:</p>
<p><img src="/assets/levtrie/nfa-banana-initial.png" alt="A Levenshtein Automata for "banana" with d=2 in its initial state" /></p>
<p>After consuming “b”, active states are again highlighted in blue:</p>
<p><img src="/assets/levtrie/nfa-banana-b.png" alt="A Levenshtein Automata for "banana" with d=2 after consuming "b"" /></p>
<p>After consuming “ba”:</p>
<p><img src="/assets/levtrie/nfa-banana-ba.png" alt="A Levenshtein Automata for "banana" with d=2 after consuming "ba"" /></p>
<p>After consuming “bah”:</p>
<p><img src="/assets/levtrie/nfa-banana-bah.png" alt="A Levenshtein Automata for "banana" with d=2 after consuming "bah"" /></p>
<p>After consuming “baha”:</p>
<p><img src="/assets/levtrie/nfa-banana-baha.png" alt="A Levenshtein Automata for "banana" with d=2 after consuming "baha"" /></p>
<p>After consuming “baham”:</p>
<p><img src="/assets/levtrie/nfa-banana-baham.png" alt="A Levenshtein Automata for "banana" with d=2 after consuming "baham"" /></p>
<p>And finally, after consuming “bahama”, a string that’s edit distance 2 from “banana”, we end up in an
accepting state:</p>
<p><img src="/assets/levtrie/nfa-banana-bahama.png" alt="A Levenshtein Automata for "banana" with d=2 after consuming "bahama"" /></p>
<p>We want to bound the time complexity of simulating an NFA from this family of Levenshtein NFAs.
One way to do this is to bound
the maximum number of possible active states, since simulating the NFA on an input of length
<em>n</em> involves updating the entire set of active states <em>n</em> times. Since a Levenshtein NFA created for
a word of length <em>m</em> and edit distance <em>d</em> has <em>(m + 1) * (d + 1)</em>
states, this means that the worst-case time complexity for simulating that NFA on an input of
length <em>n</em> could be as bad as
<em>O(nmd)</em>. We can get a better bound by being a little more careful in
our simulation.</p>
<p>The first thing to notice about the family of Levenshtein NFAs is that the diagonals contain paths
of epsilon transitions all the way up. This means that
any time a state is active, all other states further up on the same diagonal are active. Instead of
keeping track of the set of active states, then, we can just keep track of the lowest active state on
every diagonal. We’ll call this minimum active index the “floor” of the diagonal.</p>
<p>We’ll start numbering floors from the bottom: any diagonal with a state active in the bottom
row of NFA states has floor 0. Since there are <em>d + 1</em> rows in the NFA, the maximum floor a
diagonal can have is <em>d</em>.</p>
<p>To make sure that every state in the NFA is contained in some diagonal, we just need to create a
few fake diagonals that extend out to the left a little bit to add to the set of diagonals that
are anchored by the states in the bottom level of the NFA:</p>
<p><img src="/assets/levtrie/nfa-banana-fake-diags.png" alt="Identifying diagonals in a Levenshtein Automata" /></p>
<p>Indexed like this, a Levenshtein NFA has <em>m + d + 1</em> diagonals. This means that any active
state in the NFA can be represented by a set of at most <em>m + d + 1</em> diagonal floors.</p>
<p>Actually, we never end up needing to consider all <em>m + d + 1</em> diagonals at once.
There’s only ever one diagonal with floor 0 at any
point in time, since consuming an input character while one diagonal has a floor at position 0
transfers the state to position 1 in the previous diagonal, position 1 in the current diagonal,
and possibly to position 0 in the next diagonal:</p>
<p><img src="/assets/levtrie/level-0-transition.png" alt="A single level-0 transition in a Levenshtein Automaton" /></p>
<p>This fact generalizes to higher positions in the set of diagonals: there’s always a sliding window of
at most <em>2d + 1</em> diagonals that are active at level <em>d</em> or lower. You can prove this by
induction on <em>d</em> where the general induction step looks at a window of <em>2d - 1</em>
diagonals at level <em>d - 1</em> and shows that they can expand to a window of at most <em>2d + 1</em>
at level <em>d</em>.</p>
<p>A particular example of the general case is illustrated below, with a starting state illustrated by
all light blue and dark blue nodes. These light blue and dark blue nodes cover 5 diagonals at level 2
or lower. The green nodes illustrate all new nodes that can be active after a transition from
this state. The green and dark blue nodes together illustrate possible active nodes after the
transition, covering 7 diagonals at level 3 or lower:</p>
<p><img src="/assets/levtrie/multi-level-transition.png" alt="A general transition in a Levenshtein Automaton" /></p>
<p>All of this means that instead of tracking all <em>m + d - 1</em> diagonals, we only ever need to track
a set of <em>2d + 1</em> diagonal positions plus an offset into the NFA. The sliding window of diagonals that
we track moves through the NFA and we increment the offset by one each time we consume an
input character.</p>
<p>To update a single diagonal when we read an input character, we need to take the minimum of:</p>
<ul>
<li>The previous floor of the diagonal, plus one (shown in red below).</li>
<li>The previous floor of the next diagonal, plus one (shown in green below).</li>
<li>The smallest index in the previous diagonal that has a transition on the input character, if any (shown in dark blue below).</li>
</ul>
<p>The figure below shows all three of the contributions to a single diagonal’s update:</p>
<p><img src="/assets/levtrie/contributions-to-diagonal-update.png" alt="Contributions to a single diagonal update" /></p>
<p>Since we’re storing the state as a collection of <em>2d + 1</em> diagonal floors plus an offset,
the first two items in the list above (red and green updates in the figure above) can be computed in constant time.
The third item can be computed by iterating over all <em>d + 1</em> horizontal transitions
from the previous diagonal to see if any match the current input character.</p>
<p>Here’s pseudocode for our current algorithm for a fixed value of <em>d</em>
with a few details omitted:</p>
<pre>
// Returns a structure representing an initial state for a Levenshtein NFA. The
// State structure just contains:
// * d, the edit distance.
// * An array of 2*d + 1 integers representing floors.
// * An integer offset into the underlying string being matched.
State InitialState(d) { ... }
// Returns the floor of the ith diagonal in state s. Returns d + 1 if i is
// out of bounds.
int Floor(State s, int i) { ... }
// Returns the smallest index in the ith diagonal of state s that has a
// transition on character ch. If none exists, returns d + 1.
int SmallestIndexWithTransition(State s, string w, int i, char ch) { ... }
// Set the ith diagonal floor in state s to x.
void SetDiagonal(State* s, int i, int x) { ... }
// Returns true exactly when the given state is an accepting state.
bool IsAccepting(State s) { ... }
// Simulate the Levenshtein NFA for string w and distance d on the string u.
bool SimulateNFA(string w, int d, string u) {
State s = InitialState(d)
for each ch in u {
State t = s
t.offset = s.offset + 1
for i from 0 to 2*d + 1 {
int x = Floor(s, i + 1) + 1 // Diagonal
int y = Floor(s, i + 2) + 1 // Up
int z = SmallestIndexWithTransition(s, w, i, ch) // Horizontal
SetDiagonal(&t, i, Min(x, y, z))
}
s = t
}
return IsAccepting(s)
}
</pre>
<p>For each of the <em>n</em> characters in the input string, we iterate over the <em>2d + 1</em>
diagonals in the state and compute the three components of the diagonal update.
Two of the three components (<em>x</em> and <em>y</em> in the pseudocode above) are floor
computations which are really just array accesses, so they’re both constant
time operations. The third component (<em>z</em> in the pseudocode above, the result
of <code class="language-plaintext highlighter-rouge">SmallestIndexWithTransition</code>) is the most expensive, taking time <em>O(d)</em> to
compute, so the inner loop takes time <em>O(d<sup>2</sup>)</em> and the entire
simulation takes time <em>O(nd<sup>2</sup>)</em>.</p>
<h1 id="an-ond-time-simulation">An <em>O(nd)</em>-time simulation</h1>
<p>To get from <em>O(d<sup>2</sup>)</em> to <em>O(d)</em> time per state transition and <em>O(nd)</em>
for the entire simulation, there’s one final trick: making the
<code class="language-plaintext highlighter-rouge">SmallestIndexWithTransition</code> function in the pseudocode above run in constant
time.</p>
<p><code class="language-plaintext highlighter-rouge">SmallestIndexWithTransition</code> needs to be called on each of the at most <em>2d + 1</em>
active diagonals to find the smallest index above the floor of the diagonal with
a horizontal transition on the current input character. Luckily, there’s a lot of overlap
between the horizontal transitions of two consecutive diagonals: if you read the
horizontal transitions of any diagonal from bottom to top, they form a substring
of length <em>d + 1</em> of the string the automaton was generated with. Any two consecutive
diagonals have an overlap of <em>d</em> characters in these substrings:</p>
<p><img src="/assets/levtrie/jump-array.png" alt="Overlap between horizontal transitions between consecutive diagonals" /></p>
<p>The horizontal transitions from a set of <em>2d + 1</em> consecutive diagonals span a
substring of length at most <em>3d + 2</em>. This overlap between the horizontal transitions of
consecutive diagonals means
that instead of iterating up each diagonal doing <em>O(d)</em> work to figure out the
first applicable horizontal transition, if any, for each of the <em>2d + 1</em> diagonals,
we can instead precompute all the transitions on the substring of length <em>3d + 2</em> once
and use that precomputed result to calculate <code class="language-plaintext highlighter-rouge">SmallestIndexWithTransition</code> on each
diagonal.</p>
<p>This precomputation just involves calculating, for each index in the <em>3d + 2</em>
character window, the next index in the window that matches the current input
character. For example, if the window was the string “cookbook”:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> [c][o][o][k][b][o][o][k]
0 1 2 3 4 5 6 7
</code></pre></div></div>
<p>our precomputed jump array for the character ‘k’ would look like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> [3][3][3][3][7][7][7][7]
0 1 2 3 4 5 6 7
</code></pre></div></div>
<p>Each index in the jump array tells us the next index of the character ‘k’ in the
window. If we look in the jump array at the index of a particular diagonal plus its
floor, the jump array tells us the next applicable horizontal transition on that diagonal.</p>
<p>The jump array needs to be initialized once per transition but initialization only takes time
<em>O(d)</em> and all subsequent calls to <code class="language-plaintext highlighter-rouge">SmallestIndexWithTransition</code>
can then be implemented with just a constant-time access into the jump array. Now everything
inside the for loop that iterates over <em>2d + 1</em> diagonals runs in constant time and computing
a transition of the NFA only takes time <em>O(d)</em>.</p>
<h1 id="final-thoughts">Final Thoughts</h1>
<p>I’ve implemented all of this in the Go package <a href="http://github.com/aaw/levtrie">levtrie</a> and the code in that
package is a better place to look if you’re interested in more details after reading
this post.</p>
<p>I took some shortcuts
there to make the code a little more readable at the expense of some unnecessary
memory bloat. In particular, the Trie implementation is very simple: each node keeps
a map of runes to children and no intermediate nodes are suppressed. The key for each
node, implicitly defined by the path through the Trie, is duplicated in the key-value
struct stored at each node.</p>
<p>If you’re looking to implement something similar and speed it up,
the first thing to consider is replacing the basic Trie with a <a href="https://en.wikipedia.org/wiki/Radix_tree">Radix</a> or <a href="http://cr.yp.to/critbit.html">Crit-bit tree</a>.
A Crit-bit tree will dramatically reduce the memory usage and the number of
node accesses needed for Trie traversals, but it will also complicate the logic of
the parallel search through the Trie and Levenshtein Automaton. Instead of iterating
over all runes emanating from a node in the Trie and matching those up with transitions
in the Levenshtein Automata, you’ll have to do something more careful. You could
add some logic to traverse the Crit-bit tree from a node, simulating an iteration over
all single-character transitions. Multi-byte characters make this a little more tricky
than an ASCII alphabet. You might have an easier time converting the the Levenshtein
Automata itself to have an alphabet of bytes instead of multibyte characters, which
would then match up better with a Radix tree with byte transitions.</p>
Fri, 25 Aug 2017 21:58:17 +0000
http://blog.aaw.io/2017/08/25/levenshtein-automata.html
http://blog.aaw.io/2017/08/25/levenshtein-automata.htmlRegular Expression Search via Graph Cuts<p>Google used to offer a search engine called
<a href="https://en.wikipedia.org/wiki/Google_Code_Search">Code Search</a> which
let you use regular expressions to search code.
I never thought Code Search was doing anything much more sophisticated than
something along the lines of a fast, distributed grep until Russ Cox explained
how it worked in a <a href="https://swtch.com/~rsc/regexp/regexp4.html">blog post</a>
and released <a href="https://github.com/google/codesearch">codesearch</a>, a smaller-scale
implementation.
Both the post and the code are fascinating – it turns out that Code Search
was doing something much more sophisticated than a distributed grep.</p>
<p>The big idea in codesearch is an engine that translates a large class of regular
expressions into queries that can be run against a standard inverted index.
In this post, I’m going to describe a different implementation of such an engine
using some of the same ideas in
codesearch along with a well-known algorithm for finding the minimum-weight
node cut in a graph. I think the result is a little bit simpler conceptually than
the query-building technique that codesearch uses, particularly if you
understand textbook regular expressions but don’t know a lot about the
internals of any particular regular expression library.</p>
<p>The implementation I’ll describe is available as a Go package at
<a href="https://github.com/aaw/regrams">https://github.com/aaw/regrams</a>. It’s a few hundred lines of code and the benchmarks
show it running only about 4-5 times slower than codesearch’s regular
expression-to-query translation. Both engines run on the order of microseconds for
individual queries, so a few times slower is still very usable for a search engine.</p>
<h1 id="trigram-queries-and-regular-expression-search">Trigram queries and regular expression search</h1>
<p>First, a little background about codesearch’s approach to regular expression search.</p>
<p>Before it can process queries, codesearch creates an inverted index from all the
files you want to search. The index contains all overlapping substrings of length 3
(“trigrams”) that occur in any of the files.
If you index a file called <code class="language-plaintext highlighter-rouge">greetings.txt</code> which contains just the
string <code class="language-plaintext highlighter-rouge">Hello, world!</code>,
the resulting trigram index would contain entries for
<code class="language-plaintext highlighter-rouge">Hel</code>, <code class="language-plaintext highlighter-rouge">ell</code>, <code class="language-plaintext highlighter-rouge">llo</code>, <code class="language-plaintext highlighter-rouge">lo,</code>, <code class="language-plaintext highlighter-rouge">o,_</code>, <code class="language-plaintext highlighter-rouge">,_w</code>, <code class="language-plaintext highlighter-rouge">_wo</code>, <code class="language-plaintext highlighter-rouge">wor</code>, <code class="language-plaintext highlighter-rouge">orl</code>, <code class="language-plaintext highlighter-rouge">rld</code>, and <code class="language-plaintext highlighter-rouge">ld!</code>.
When you look up any of those trigrams in the index,
you’ll get a list of files: <code class="language-plaintext highlighter-rouge">greetings.txt</code> and any other files that contain
the trigram you’ve looked up. At query time, codesearch extracts trigrams from the regular expression
and uses those trigrams to look up some candidate documents in the index.</p>
<p>The trigram queries used by codesearch are just trigrams combined with ANDs and ORs.
For a simple regular expression like <code class="language-plaintext highlighter-rouge">Hello</code>, codesearch might generate the trigram
query <code class="language-plaintext highlighter-rouge">Hel AND ell AND llo</code>. Looking that query up in the index would return all of
the indexed files that contain all three of the trigrams <code class="language-plaintext highlighter-rouge">Hel</code>, <code class="language-plaintext highlighter-rouge">ell</code>, and <code class="language-plaintext highlighter-rouge">llo</code>.
A more complicated regular expression like <code class="language-plaintext highlighter-rouge">abc(d|e)</code> might generate the trigram
query <code class="language-plaintext highlighter-rouge">abc AND (bcd OR bce)</code>.</p>
<p>Sometimes there’s no good trigram query for a regular expression. This can happen
if the regular expression accepts strings of length less than 3. The regular
expression <code class="language-plaintext highlighter-rouge">[0-9]+</code>, for example, accepts strings of digits of length 1 and 2 so
we can’t use a trigram query to search for files that match that expression.
There also may not be a good trigram query if the regular expression accepts too many
different strings. <code class="language-plaintext highlighter-rouge">[a-z]{3}</code> is a short regular expression but it needs an enormous
trigram query with 17,576 trigrams OR-ed together to capture its meaning. If you leave
off any of those 17,576 trigrams from the query, you risk false negatives: files that
match the query but aren’t returned from your trigram query. You’d be better off
grepping files in most cases than running such a large query.</p>
<p>Finally, even when codesearch can come up with a good trigram query, it can get false
positives in its result set from the trigram index. The regular expression <code class="language-plaintext highlighter-rouge">Hello</code>’s trigram query
<code class="language-plaintext highlighter-rouge">Hel AND ell AND llo</code> matches not just files containing <code class="language-plaintext highlighter-rouge">Hello</code> but also files that
contain things like <code class="language-plaintext highlighter-rouge">Help smell this fellow</code>. Since false positives like this are
possible, a regular expression search based on trigram queries will need to
post-process trigram query results by running the regular expression over each file
that comes back. The goal, then, is just to generate a small enough set of candidate
documents so that the query generation, lookup in the inverted index, and post-processing
with the original regular expression runs much more quickly than exhaustive grepping over
files.</p>
<p>codesearch generates these trigram queries by parsing the regular expression using Go’s
<a href="https://golang.org/pkg/regexp/syntax/">regexp/syntax</a>, then analyzing the
parsed regular expression and converting it into structures
that describe what trigrams each portion of the regular expresssion can match.
These structures are then combined
based on a table of rules derived from the meaning of the regular expression operations
involved. The structures maintained during this process
keep track of several attributes of the pieces of the regular expression
their analyzing: whether that portion matches the empty string as well as sets
of trigrams that can match exactly as well as prefixes
and suffixes. The whole process is quick and, along with some boolean simplification of
the resulting trigram queries, generates succinct queries without false negatives.</p>
<h1 id="generating-trigram-queries-with-graph-cuts">Generating trigram queries with graph cuts</h1>
<p>Instead of generating trigram queries by analyzing the structure of the regular
expression, <a href="https://github.com/aaw/regrams">regrams</a> transforms the regular expression
into an <a href="https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton">NFA</a> and
analyzes that.</p>
<p>regrams uses an NFA with some extra annotations on the states. Getting
to an NFA from a regular expression in Go is pretty simple with the <a href="https://golang.org/pkg/regexp/syntax/">regexp/syntax</a>
package: running <a href="https://golang.org/pkg/regexp/syntax/#Parse"><code class="language-plaintext highlighter-rouge">Parse</code></a> on the
string, then <a href="https://golang.org/pkg/regexp/syntax/#Regexp.Simplify"><code class="language-plaintext highlighter-rouge">Simplify</code></a> on the
resulting expression yields a normal form that’s easy to work with. regrams massages the
simplified expression into an even simpler form that’s closer to a textbook regular
expression, containing only concatenation (.), alternation (|), and Kleene star (*)
operations on literals and empty strings. This simplified expression
might match more than the original expression but it never matches less, so it’s fine
for generating a trigram query. Finally, regrams converts the simplified regular
expression into an NFA using <a href="https://en.wikipedia.org/wiki/Thompson%27s_construction">Thompson’s algorithm</a>.</p>
<p>Next, any state that has a literal transition gets
annotated with a set of trigrams that are reachable starting from that state. There
are a few exceptions: if we start collecting the set of trigrams but realize it’s
going to be too large, we’ll bail out and the state will get an empty set of
trigrams. Also, if we start collecting the set of trigrams but realize that you
can reach the accept state of the NFA in 2 or fewer steps, we’ll bail out, since
that means that we can’t represent the query from that node with trigrams. So we end up
with annotations on some of the nodes in the NFA that describe trigram OR queries,
but only on the nodes that are easy to write trigram OR queries for.</p>
<p>Here’s an example NFA for the regular expression <code class="language-plaintext highlighter-rouge">ab(c|d*)ef</code> with trigram set
annotations in blue below each node:</p>
<p><img src="/assets/nfa-trigram-sets.png" alt="An NFA with annotated trigram sets" /></p>
<p>Notice that nodes with only epsilon transitions in the NFA above don’t get trigram
annotations and nodes that can reach the accept state in fewer than 3 literal
transitions like the node with the “e” transition and the node with the “f” transition
don’t have trigram set annotations. Every other node is annotated with the set of
trigrams that can be generated from that node by following an unlimited number of
epsilon transitions and exactly three literal transitions.</p>
<p>Creating a trigram query from the NFA above is now just an exercise in applying
the right graph algorithm. A good trigram OR query – one that captures a set of
trigrams that must appear in any string matching the regular expression but is as
small as possible – corresponds to a minimal set of trigram-annotated nodes in the
NFA that, when removed, disconnect the initial state from the final state. In
graph theory, a set of nodes that, when removed, disconnects two nodes s and t in the
graph is called an s-t node cut.
A node cut in our NFA separating the start and accept nodes corresponds to a
complete set of trigrams that must be present in any string that
matches the regular expression: there’s no way for a string to be accepted by the
NFA but to go through one of the nodes in the cut.</p>
<p>In the NFA above, there are only two minimal node
cuts that consist only of trigram-annotated nodes: the cut consisting of only the
node with the “a” transition and the cut consisting of only the node with the “b”
transition.</p>
<p>Once we’ve extracted a node cut consisting only of trigram-annotated nodes, we can just OR all of
the trigrams in the cut together and, by the argument above, we’ve got a valid trigram
query for the original regular expression: at least one of those trigrams must be
present in any string that matches the regular expression. Continuing with the
example above, depending on which cut we choose, we either
get the trigram query <code class="language-plaintext highlighter-rouge">abc OR abd OR abe</code> or the trigram query <code class="language-plaintext highlighter-rouge">bce OR bdd OR bde OR bef</code>.</p>
<p>At this point, our query is a single big OR defined by the cut and we’ve likely used relatively few
of the trigram sets we’ve annotated the graph with. But because we’ve just isolated
a cut, that cut splits the NFA (and also the corresponding regular expression) into two parts, and if we now clean up
both of those two parts of the NFA a little, we can run the same cut analysis on each of two those parts
recursively to extract more trigram queries. We just keep isolating cuts recursively
until we can’t find a cut with trigram-annotated nodes, at which point there’s nothing good left
to generate a query with. All of the OR queries we generate like this can be
AND-ed together to create one final query.</p>
<p>In the example NFA we’re working with, that means that if we choose the cut
with just the node with a “b” transition, it generates the trigram query <code class="language-plaintext highlighter-rouge">bce OR bdd OR bde OR bef</code>
and splits the NFA into two parts: the part with the “a” transition and everything
after the “b” transition node:</p>
<p><img src="/assets/nfa-trigram-sets-split.png" alt="The same NFA from before, now with a cut at the node with a "b" transition removed" /></p>
<p>We can now recursively consider both sub-NFAs. We see a single cut in the
first NFA that generates the trigram query <code class="language-plaintext highlighter-rouge">abc OR abd OR abe</code> and no cut in the second
NFA that consists of just trigram annotated nodes: even if you remove both the
node with a “c” transition and the node with a “d” transition, there’s still
a path from the initial state in the sub-NFA to the accept state through the
lower-level epsilon-transition around the node with a “d” transition. Since there
are no cuts left in any of the subgraphs, our final trigram query for the entire
regular expression <code class="language-plaintext highlighter-rouge">ab(c|d*)ef</code> is all of the subqueries AND-ed together, which is
<code class="language-plaintext highlighter-rouge">(abc OR abd OR abe) AND (bce OR bdd OR bde OR bef)</code>.</p>
<p>The fact that we couldn’t extract a cut from the second NFA is by design: that
NFA corresponds to the regular expression <code class="language-plaintext highlighter-rouge">(c|d*)ef</code>, which matches the string
<code class="language-plaintext highlighter-rouge">ef</code> that we can’t express with a trigram query.</p>
<h1 id="finding-minimal-node-cuts">Finding minimal node cuts</h1>
<p>So how do you find a minimal cut consisting only of trigram-annotated nodes?
If we label each node in a trigram-annotated NFA with the size of its trigram set or
with “infinity” if it doesn’t have a trigram set, then we can frame the problem
as a search for a minimum-weight node cut in the graph that separates the initial
and final state. Because of the way we’ve set up the node weights, finding the
minimum-weight node cut will tell us if we actually have a usable query: the cut
can be used to generate a query exactly when its weight is non-infinite.</p>
<p>Finding a minimum-weight cut in a graph that separates two distinguished nodes
is a tractable optimization problem and is
especially easy in simple graphs like the NFAs that we get from regular expressions.
Minimum-weight cuts are closely related to maximum weight “flows”, and if you ever
took an algorithms course you may have run into
the <a href="https://en.wikipedia.org/wiki/Max-flow_min-cut_theorem">max-flow min-cut theorem</a>
that formalizes this duality. If you imagine a graph as a set of pipes of differing
widths, flowing in the direction of the arrows between nodes, the max flow
represents the most amount of fluid that can flow through the pipes at any point
in time. The minimum-weight cut is a bottleneck that constrains the maximum flow you
can push through.</p>
<p>The simplest way to find a minimum-weight cut, then, is to find a maximum-weight
flow, which really boils down to initializing all nodes with capacities as described
above (each capacity will either be infinite or the size of the trigram set) and then
repeatedly finding a path that goes from the start node to the accept node through
nodes of non-zero capacity. Every time we find such a path, we reduce the
capacity of all nodes on the path by the minimum capacity on the path.
Eventually, there are no paths left except those that go through zero-capacity nodes
and those zero-capacity nodes can be used to recover the minimal cut.</p>
<p>There are more sophisticated ways of finding a maximum-weight flow,
but the complexity of this algorithm is never more than the number of edges in the
NFA times the maximum allowable trigram set size. Since we bound both the maximum
size of the NFA allowed and the size of the maximum trigram set in the regrams
implementation, the complexity never gets out of hand here, and in practice it’s
very quick and even out-performs some theoretically faster optimizations like the
<a href="https://en.wikipedia.org/wiki/Edmonds%E2%80%93Karp_algorithm">Edmonds-Karp</a> search
order in my tests.</p>
<h1 id="examples">Examples</h1>
<p>So what do the results look like? Here are a few examples which you can play with
on your own if you have a Go development environment. Just build the regrams
commmand line wrapper and follow along:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>go get github.com/aaw/regrams
go build -o regrams github.com/aaw/regrams/cmd/regrams
</code></pre></div></div>
<p>The trigram queries are written with implicit AND
and <code class="language-plaintext highlighter-rouge">|</code> for OR, so <code class="language-plaintext highlighter-rouge">abc AND (bcd OR bce)</code> comes out looking like <code class="language-plaintext highlighter-rouge">abc (bcd|bce)</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./regrams 'Hello, world!'
( wo) (, w) (Hel) (ell) (ld!) (llo) (lo,) (o, ) (orl) (rld) (wor)
$ ./regrams 'a(bc)+d'
(abc) (bcb|bcd)
$ ./regrams 'ab(c|d*)ef'
(abc|abd|abe) (bce|bdd|bde|bef)
$ ./regrams '(?i)abc'
(ABC|ABc|AbC|Abc|aBC|aBc|abC|abc)
$ ./regrams 'abc[a-zA-Z]de(f|g)h*i{3}'
(abc) (def|deg) (efh|efi|egh|egi) (fhh|fhi|fii|ghh|ghi|gii) (iii)
$ ./regrams '[0-9]+' # No trigrams match single or double digit strings
Couldn't generate a query
$ ./regrams '[a-z]{3}' # Too many trigrams
Couldn't generate a query
</code></pre></div></div>
<p>regrams is also available as a Go package, too: just import <code class="language-plaintext highlighter-rouge">github.com/aaw/regrams</code>
and then call <code class="language-plaintext highlighter-rouge">regrams.MakeQuery</code> on a string to parse a regular expression into a trigram
query. The trigram query returned is a slice-of-slices of strings, which represents a big
AND or ORs: <code class="language-plaintext highlighter-rouge">[["abc"], ["bcd", "bce"]]</code> represents <code class="language-plaintext highlighter-rouge">abc (bcd|bce)</code>.</p>
<h1 id="optimizations">Optimizations</h1>
<p>There’s at least one major optimization that’s possible but isn’t yet implemented
in regrams: we could extract node-disjoint paths instead of just node cuts and
get more specific queries.</p>
<p>For example, if you ask regrams to generate a query for
<code class="language-plaintext highlighter-rouge">(abcde|vwxyz)</code>, it’ll generate <code class="language-plaintext highlighter-rouge">(abc|vwx) (bcd|wxy) (cde|xyz)</code>, which isn’t as specific
as the query <code class="language-plaintext highlighter-rouge">(abc bcd cde)|(vwx wxy xyz)</code>. <code class="language-plaintext highlighter-rouge">(abc|vwx)</code>, <code class="language-plaintext highlighter-rouge">(bcd|wxy)</code>, and <code class="language-plaintext highlighter-rouge">(cde|xyz)</code>
represent three distinct cuts in the NFA. If we didn’t stop at node cuts, and instead
augmented the nodes in the cut with disjoint paths, we could avoid this situation
and always return the best trigram query for a regular expression. To do this, we’d
need some algorithmic version of something like <a href="https://en.wikipedia.org/wiki/Menger%27s_theorem">Menger’s theorem</a>
that extracts node-disjoint paths from our NFAs that go through only nodes with
non-infinite capacity.</p>
<h1 id="related-work">Related Work</h1>
<p>codesearch isn’t the only attempt at generating trigram queries from regular
expressions. In 2002, Cho and Rajagopalan published
“<a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.6659">A Fast Regular Expression Indexing Engine</a>”,
which describes a search engine called FREE that shares a lot of the ideas
found in codesearch, including using an n-gram index to index the underlying
documents and generating queries against that index using some rules
based on the structure of those regular expressions. The rules are much
simpler than those used by codesearch and FREE doesn’t have an actual
implementation that I know of.</p>
<p>Postgres also now
<a href="https://www.pgcon.org/2012/schedule/events/383.en.html">supports regular expression search</a>
via <a href="https://github.com/postgres/postgres/blob/master/contrib/pg_trgm/trgm_regexp.c">an implementation by Alexander Korotkov</a>
in the <a href="https://www.postgresql.org/docs/9.3/static/pgtrgm.html">pg_tgrm module</a>.
Korotkov’s implementation apparently generates a special NFA with trigram transitions after
converting the original regular expression to an NFA, then extracts a query from
that trigram NFA. Korotkov’s implementation seems similar to regrams in that all of
the analysis is done on an NFA, but I don’t understand enough about it to say
anything more. I’d love to read a description of the technique somewhere. It seems
to generate slightly better queries than regrams based on some of the documentation, for
example generating <code class="language-plaintext highlighter-rouge">(abe AND bef) OR (cde AND def)) AND efg</code> from <code class="language-plaintext highlighter-rouge">(ab|cd)efg</code>, whereas regrams
would generate <code class="language-plaintext highlighter-rouge">(abe OR cde) AND (bef OR def) AND efg</code> from the same regular expression.</p>
Fri, 10 Jun 2016 20:51:47 +0000
http://blog.aaw.io/2016/06/10/regrams-intro.html
http://blog.aaw.io/2016/06/10/regrams-intro.html