Aaron Windsornotes.aaw.io
http://blog.aaw.io/
Fri, 25 Aug 2017 22:00:17 +0000Fri, 25 Aug 2017 22:00:17 +0000Jekyll v3.5.2Simulating Levenshtein Automata<p>A <a href="https://en.wikipedia.org/wiki/Levenshtein_automaton">Levenshtein Automaton</a> is a finite state machine that recognizes all
strings within a given <a href="https://en.wikipedia.org/wiki/Edit_distance">edit distance</a> from a fixed string.
Here’s a Levenshtein Automaton that accepts all
strings within edit distance 2 from “banana”:</p>
<p><img src="/assets/levtrie/nfa-banana.png" alt="A Levenshtein Automata for "banana" with d=2" /></p>
<p>The epsilon transitions represent the empty string. The * transitions
are shorthand for “any character” to save space in the diagram, but the
actual automaton has one transition for every possible input character anywhere
you see a *. The automaton accepts a string <em>s</em> exactly when
there’s a directed path from the start state on the lower left to any of the
accept states on the right such that the concatenation of all of the labels on
the path in order equals <em>s</em>.</p>
<p>The automaton above is an <a href="https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton">NFA</a> but it looks like three
copies of a <a href="https://en.wikipedia.org/wiki/Deterministic_finite_automaton">DFA</a> accepting the string “banana” stacked on top of each other.
Transitions between the three DFAs represent edit operations under the
<a href="https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance">Damerau-Levenshtein metric</a>: a transition going up represents
the insertion of a character, a diagonal epsilon transition represents the
deletion of a character, and a diagonal * transition represents
a substitution.</p>
<p>Levenshtein Automata can be used as one part of a data structure that generates
spelling corrections. The
other component is a <a href="https://en.wikipedia.org/wiki/Trie">Trie</a> that contains all correctly spelled words.
Any word that’s accepted by both the Trie and the Levenshtein Automaton is a
word that’s correctly spelled and up to edit distance <em>d</em> from the query.
Given a query, you’d generate a Levenshtein Automaton for that query with the
desired edit distance and then traverse both the automaton and the Trie in
parallel, yielding a word whenever you reach an accept state in the Levenshtein
Automaton and at a leaf node in the Trie at the same time.</p>
<p>Generating a non-deterministic Levenshtein Automaton is straightforward.
The node and edge structure of the automaton above for the query “banana” isn’t a function
of the word “banana” at all – only the transition labels would be different if you
wanted to create a similar automaton accepting anything within edit distance 2 of any
other six-letter word. Increasing or decreasing the edit distance just
involves adding or removing one or more rows of identical
states. Increasing or decreasing the length of the fixed word just involves
adding or removing one or more columns.</p>
<p>Unfortunately, even though generating a non-deterministic Levenshtein Automaton is easy,
simulating it efficiently isn’t. In general,
simulating all execution paths of an NFA with <em>n</em> states on an input of length
<em>m</em> can take time <em>O(nm)</em> just to keep up with the bookkeeping:
a state in the simulation is a subset of states in the NFA and you have to
update that set of up to <em>n</em> states <em>m</em> times during the simulation.</p>
<p><a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652">Schulz and Mihov introduced Levenshtein Automata</a> and showed
how to generate and simulate them in linear time in the length of an input string.
The implementation they describe involves a preprocessing step that creates a DFA
defined by a transition table whose size grows very quickly with the edit distance <em>d</em>.
The implementation starts with a two-dimensional table from which many of the entries can
be removed because they represent dead states. For <em>d</em>=1, a 5 <em>x</em> 8 table
is reduced to just 9 entries, for <em>d</em>=2, a 30 <em>x</em> 32 table is reduced to
one with just 80 entries. For <em>d</em>=3 and 4, the tables start with
196 <em>x</em> 128 = 25,088 and 1352 x 512 = 692,224 entries, respectively, before
removing dead states.</p>
<p>The Lucene project <a href="http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html">implemented</a> Schulz and Mihov’s scheme, but only for
<a href="https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/util/automaton/Lev1ParametricDescription.java"><em>d</em> = 1</a> and <a href="https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/util/automaton/Lev2ParametricDescription.java"><em>d</em> = 2</a>. Their implementation uses a Python script
from another project, <a href="https://sites.google.com/site/rrettesite/moman">Moman</a>, to generate Java code with the transition
tables offline.</p>
<p>It’s hard to beat a table-driven DFA for matching regular expressions,
but on the other hand, it’s not clear that the simulation of the Levenshtein
Automaton is the bottleneck in a spelling corrector. The size of the Levenshtein
Automaton is dwarfed by the size of the Trie containing the correctly spelled
words, and since query processing involves simulating both the Trie and the
Levenshtein Automaton in parallel, the main bottleneck in the simulation will
likely be the I/O expense of loading nodes for the Trie. Because of their high
and irregular branching factor, Tries are all but impossible to lay out in
memory with any kind of locality of reference for an arbitrary query.</p>
<p>So maybe there’s a simpler way to simulate Levenshtein Automata that’s
theoretically slower but will give us about the same real-world performance.
Jules Jacobs recently wrote <a href="http://julesjacobs.github.io/2015/06/17/disqus-levenshtein-simple-and-fast.html">a post describing a pretty
substantial simplification</a> that simulates an automaton using states
based on rows in the two-dimensional matrix of the <a href="https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm">Wagner-Fischer algorithm</a> to
compute edit distance. The simulation takes <em>O(nd)</em> time for an input of
length <em>n</em>, which is essentially as good as an implementation based on
Schulz and Mihov’s scheme since <em>d</em> is often small and fixed.</p>
<p>You can also derive an <em>O(nd)</em> time simulation by just directly simulating
the NFA that I described at the beginning of this post. You just need to make
a few optimizations based on the highly regular structure of this family of
NFAs, but all of the optimizations are relatively straightforward. I’ll describe
those optimizations in this post. I’ve also implemented everything I describe here
in a Go package called <a href="http://github.com/aaw/levtrie">levtrie</a> that provides a map-like interface to a Trie
equipped with a Levenshtein Automaton.</p>
<h1 id="alternatives-to-levenshtein-automata">Alternatives to Levenshtein Automata</h1>
<p>First, some more background. You can skip this section and the next if you
already understand why Levenshtein Automata are a good choice for indexing
a set of words by edit distance.</p>
<p>Edit distance is a <a href="https://en.wikipedia.org/wiki/Metric_(mathematics)">metric</a> and there are <a href="https://en.wikipedia.org/wiki/Metric_tree">many data
structures</a> that index keys by distance under various metrics.
Maybe the most appropriate metric tree for edit distance is the <a href="https://en.wikipedia.org/wiki/BK-tree">BK-Tree</a>.
The BK-Tree, like other metric trees, has two big disadvantages: first, the
layout of the tree is highly dependent on the distribution and the insertion
order, so it’s hard to quote good bounds on how balanced the tree is in general.
Second, during lookups, you have to compute the distance function at each node along
the search path,
which can be expensive for an metric like edit distance that takes quadratic
time to compute (or <em>O(nd)</em> if you optimize for computing distance at most <em>d</em>).</p>
<p><a href="https://en.wikipedia.org/wiki/Locality-sensitive_hashing">Locality-sensitive hashing</a>
is another option, but, like metric trees, even after hashing to a bucket you’re
left with a set of candidates on which you need to exhaustively calculate edit
distance. It’s also very difficult with most metrics to get anywhere close to
perfect recall with locality-sensitve hashing and perfect recall is essential
to spelling correction since there are often just a few good correction
candidates.</p>
<p>Still another alternative is to index the <a href="https://en.wikipedia.org/wiki/N-gram">n-grams</a> of all correctly spelled
words and put them in an inverted index. At query time, you’d break up the query string into
n-grams and search the inverted index for them all, running an actual edit distance
computation as a final pass on all of the candidates that come back. This doesn’t
always work well in practice, since, for example, if you’re trying to retrieve
“bahama” from the query “banana” (edit distance 2), none of the 3-grams match (bahama
breaks up into “bah”, “aha”, “ham”, “ama” and banana breaks up into
“ban”, “ana”, “nan”, and “ana”). Even moving down to 2-grams doesn’t
help much; only the leading 2-gram “ba” matches so you’d have to retrieve all
strings that start with “ba”
and exhaustively test edit distance on all of them to find “banana”.</p>
<p>In contrast to all of the methods described above, using a
Trie with a Levenshtein Automaton doesn’t ever require exhaustively calculating
edit distance during lookups: the cost of computing edit distance is incremental
and shared among many candidates that share paths in the Trie. A Trie can also
efficiently return all strings that are suffixes of the query string, which
is a popular feature with most on-the-fly spelling correctors: instead of
waiting for someone to type the entire word “banona” to return the suggestion
“banana”, you can start returning suggestions as soon as they’ve typed “ban” or
even “ba” by exploring paths from those prefixes in the Trie.</p>
<h1 id="finding-edits-in-a-trie-without-automata">Finding edits in a Trie without automata</h1>
<p>Levenshtein Automata are used to generate an optimized Trie traversal but
you can actually build a quick-and-dirty spelling corrector using just a Trie. I’ll
walk through that construction in this section since it motivates why you’d want
to augment a Trie with a Levenshtein Automaton in the first place.</p>
<p>Suppose that your query string is “banana”. To figure out if that exact string is
in the Trie, you’d just use the sequence of characters in the string to find a
path through the Trie:
starting from the root, transition on the “b” edge to a node one level down, then transition
on an “a” edge, then an “n” edge, and so on, until you’re at the end of the string.
If the word is in the Trie, then
at the end of the traversal you will have reached
a node that represents the last character of that word. Otherwise, you will have
stopped at some point along the way because there wasn’t an edge available to make the
transition you needed to make, in which case you know the word you’re looking up isn’t
in the Trie.</p>
<p>Instead, if you wanted to find both exact matches to “banana” and words in the Trie
that were a few edits away,
you could extend the search process to branch out a little and try paths that
correspond to edits.
If you want to find words that are at most, say, 2 edits away from “banana”, you could
start your traversal at the root but perform the following four searches
while keeping track of an edit budget that’s initially 2:</p>
<ul>
<li><em>Simulate no edit</em>: Move from the root to the second
level of the Trie on the edge labeled “b”. Keep your edit budget at 2 and set the remaining string to match
to “anana”.</li>
<li><em>Simulate an insertion before the first character</em>: For every edge out of the root of the Trie, move
to the second level of the Trie along that edge. Decrement your edit budget to 1 and keep the remaining string
to match set to “banana”.</li>
<li><em>Simulate a deletion of the first character</em>: Don’t move from the root of the Trie at all, simply decrement your
edit budget to 1 and update the string you want to match to “anana”.</li>
<li><em>Simulate a substitution for the first character</em>: For every edge out of the root of the Trie except
the edge labeled “b”, move to the second level of the Trie along that edge, decrement your edit budget to 1,
and update the remaining string you want to match to “anana”.</li>
</ul>
<p>Now just keep recursively applying these cases at each new node you explore and
stop the traversal once you reach an edit budget that’s negative. If you ever get to
a leaf node with a non-negative edit budget at least as big as the remaining string
length, return that string as a match.</p>
<p>This traversal will discover all strings in the Trie that are within a fixed edit distance of
a query but the traversal does a lot of repeated work. You can generate the correctly-spelled
word “banana” from the misspelled word “banaba” using either a substitution,
a deletion followed by an insertion,
or an insertion followed by a deletion. This means that in the traversal we defined above, we’ll
visit the node in the Trie defined by “banan” at least three times from three different search paths.
The search also has degenerate paths that just
burn the error budget but do no useful work, for example a deletion followed by an insertion
of the same character. These paths just put you back in the same position the traversal started
from with a smaller edit budget.</p>
<p>Again, because of their large branching factor, Tries don’t have good locality of reference. Each
time you follow a pointer to another node, you’re likely jumping to memory that at the very
least causes a cache miss, so popping the same state several times to explore can be expensive.
You might think you could optimize this a little by marking Trie nodes as “visited” when you first
see them and avoiding exploring visited nodes more than once. But you can also see the same node
through different search paths with different edit budgets, so you’d have to store more than
just a visited flag – you’d at least need to store the minimum edit budget that you’d visited
the node with. If you ever saw the node on a search path with a greater edit budget, you could
prune that portion of the search, but that still means that you could end up exploring a node up
to <em>d</em> times on a search for words within with edit distance <em>d</em>.</p>
<p>A Levenshtein Automaton maintains all of the search state so that you never have to traverse
a path in the Trie more than once. If you use a deterministic Levenshtein Automaton like
<a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652">Schulz and Mihov’s original scheme</a>, it’s really as efficient a way to encode the search
state as you can get: each transition from state to state in the automaton is just a <em>O(1)</em>
table lookup.</p>
<p>There’s some history of people rediscovering ways to maintain the search state
through more or less efficient means: <a href="http://stevehanov.ca/blog/index.php?id=114">Steve Hanov described</a> a method of
keeping track of the search state using the Wagner-Fischer matrix that allows you to update
states in time <em>O(m)</em> when searching for a string of length <em>m</em>. The method that Jules Jacobs
describes is similar but optimized even further to get a <em>O(d)</em> update time
for edit distance <em>d</em>, regardless of the length of the input string. The method I’ll describe
in the next two sections also has an <em>O(d)</em> time bound on state transitions but it isn’t
directly derived from the Wagner-Fischer edit distance algorithm.</p>
<h1 id="an-ond2-time-simulation">An <em>O(nd<sup>2</sup>)</em>-time simulation</h1>
<p>Now back to the original Levenshtein NFA construction at the beginning of this post.
Instead of creating a DFA from the NFA, we just want to simulate the DFA.
To simulate one, we need to maintain a set of active states as we read input characters, accepting
exactly when we’ve read all of the input and there’s at least one accepting state in our current set
of active states.</p>
<p>We initialize the set of active states to contain just the NFA’s start state plus anything
reachable by an epsilon transition.
On each input character, we create an initially empty new set of active states and iterate
through all current active states, trying to take any valid
transition from each state on the current input character and adding the resulting state
and anything else reachable by epsilon transitions
to the new set of active states if we succeed. When we’re done with iterating through
all current active states, the new set of active states becomes current and we proceed
to the next input character. If we’re done reading input characters and there’s an accept
state in our set of active states, we accept, otherwise we reject.</p>
<p>Let’s walk through a simulation of the Levenshtein NFA that accepts all words within edit
distance 2 of “banana”. We’ll feed it the input string “bahama”. The set of active states are highlighted
in blue at each step below. First, the initial state of the NFA
contains all states on the diagonal including the initial state:</p>
<p><img src="/assets/levtrie/nfa-banana-initial.png" alt="A Levenshtein Automata for "banana" with d=2 in its initial state" /></p>
<p>After consuming “b”, active states are again highlighted in blue:</p>
<p><img src="/assets/levtrie/nfa-banana-b.png" alt="A Levenshtein Automata for "banana" with d=2 after consuming "b"" /></p>
<p>After consuming “ba”:</p>
<p><img src="/assets/levtrie/nfa-banana-ba.png" alt="A Levenshtein Automata for "banana" with d=2 after consuming "ba"" /></p>
<p>After consuming “bah”:</p>
<p><img src="/assets/levtrie/nfa-banana-bah.png" alt="A Levenshtein Automata for "banana" with d=2 after consuming "bah"" /></p>
<p>After consuming “baha”:</p>
<p><img src="/assets/levtrie/nfa-banana-baha.png" alt="A Levenshtein Automata for "banana" with d=2 after consuming "baha"" /></p>
<p>After consuming “baham”:</p>
<p><img src="/assets/levtrie/nfa-banana-baham.png" alt="A Levenshtein Automata for "banana" with d=2 after consuming "baham"" /></p>
<p>And finally, after consuming “bahama”, a string that’s edit distance 2 from “banana”, we end up in an
accepting state:</p>
<p><img src="/assets/levtrie/nfa-banana-bahama.png" alt="A Levenshtein Automata for "banana" with d=2 after consuming "bahama"" /></p>
<p>We want to bound the time complexity of simulating an NFA from this family of Levenshtein NFAs.
One way to do this is to bound
the maximum number of possible active states, since simulating the NFA on an input of length
<em>n</em> involves updating the entire set of active states <em>n</em> times. Since a Levenshtein NFA created for
a word of length <em>m</em> and edit distance <em>d</em> has <em>(m + 1) * (d + 1)</em>
states, this means that the worst-case time complexity for simulating that NFA on an input of
length <em>n</em> could be as bad as
<em>O(nmd)</em>. We can get a better bound by being a little more careful in
our simulation.</p>
<p>The first thing to notice about the family of Levenshtein NFAs is that the diagonals contain paths
of epsilon transitions all the way up. This means that
any time a state is active, all other states further up on the same diagonal are active. Instead of
keeping track of the set of active states, then, we can just keep track of the lowest active state on
every diagonal. We’ll call this minimum active index the “floor” of the diagonal.</p>
<p>We’ll start numbering floors from the bottom: any diagonal with a state active in the bottom
row of NFA states has floor 0. Since there are <em>d + 1</em> rows in the NFA, the maximum floor a
diagonal can have is <em>d</em>.</p>
<p>To make sure that every state in the NFA is contained in some diagonal, we just need to create a
few fake diagonals that extend out to the left a little bit to add to the set of diagonals that
are anchored by the states in the bottom level of the NFA:</p>
<p><img src="/assets/levtrie/nfa-banana-fake-diags.png" alt="Identifying diagonals in a Levenshtein Automata" /></p>
<p>Indexed like this, a Levenshtein NFA has <em>m + d + 1</em> diagonals. This means that any active
state in the NFA can be represented by a set of at most <em>m + d + 1</em> diagonal floors.</p>
<p>Actually, we never end up needing to consider all <em>m + d + 1</em> diagonals at once.
There’s only ever one diagonal with floor 0 at any
point in time, since consuming an input character while one diagonal has a floor at position 0
transfers the state to position 1 in the previous diagonal, position 1 in the current diagonal,
and possibly to position 0 in the next diagonal:</p>
<p><img src="/assets/levtrie/level-0-transition.png" alt="A single level-0 transition in a Levenshtein Automaton" /></p>
<p>This fact generalizes to higher positions in the set of diagonals: there’s always a sliding window of
at most <em>2d + 1</em> diagonals that are active at level <em>d</em> or lower. You can prove this by
induction on <em>d</em> where the general induction step looks at a window of <em>2d - 1</em>
diagonals at level <em>d - 1</em> and shows that they can expand to a window of at most <em>2d + 1</em>
at level <em>d</em>.</p>
<p>A particular example of the general case is illustrated below, with a starting state illustrated by
all light blue and dark blue nodes. These light blue and dark blue nodes cover 5 diagonals at level 2
or lower. The green nodes illustrate all new nodes that can be active after a transition from
this state. The green and dark blue nodes together illustrate possible active nodes after the
transition, covering 7 diagonals at level 3 or lower:</p>
<p><img src="/assets/levtrie/multi-level-transition.png" alt="A general transition in a Levenshtein Automaton" /></p>
<p>All of this means that instead of tracking all <em>m + d - 1</em> diagonals, we only ever need to track
a set of <em>2d + 1</em> diagonal positions plus an offset into the NFA. The sliding window of diagonals that
we track moves through the NFA and we increment the offset by one each time we consume an
input character.</p>
<p>To update a single diagonal when we read an input character, we need to take the minimum of:</p>
<ul>
<li>The previous floor of the diagonal, plus one (shown in red below).</li>
<li>The previous floor of the next diagonal, plus one (shown in green below).</li>
<li>The smallest index in the previous diagonal that has a transition on the input character, if any (shown in dark blue below).</li>
</ul>
<p>The figure below shows all three of the contributions to a single diagonal’s update:</p>
<p><img src="/assets/levtrie/contributions-to-diagonal-update.png" alt="Contributions to a single diagonal update" /></p>
<p>Since we’re storing the state as a collection of <em>2d + 1</em> diagonal floors plus an offset,
the first two items in the list above (red and green updates in the figure above) can be computed in constant time.
The third item can be computed by iterating over all <em>d + 1</em> horizontal transitions
from the previous diagonal to see if any match the current input character.</p>
<p>Here’s pseudocode for our current algorithm for a fixed value of <em>d</em>
with a few details omitted:</p>
<pre>
// Returns a structure representing an initial state for a Levenshtein NFA. The
// State structure just contains:
// * d, the edit distance.
// * An array of 2*d + 1 integers representing floors.
// * An integer offset into the underlying string being matched.
State InitialState(d) { ... }
// Returns the floor of the ith diagonal in state s. Returns d + 1 if i is
// out of bounds.
int Floor(State s, int i) { ... }
// Returns the smallest index in the ith diagonal of state s that has a
// transition on character ch. If none exists, returns d + 1.
int SmallestIndexWithTransition(State s, string w, int i, char ch) { ... }
// Set the ith diagonal floor in state s to x.
void SetDiagonal(State* s, int i, int x) { ... }
// Returns true exactly when the given state is an accepting state.
bool IsAccepting(State s) { ... }
// Simulate the Levenshtein NFA for string w and distance d on the string u.
bool SimulateNFA(string w, int d, string u) {
State s = InitialState(d)
for each ch in u {
State t = s
t.offset = s.offset + 1
for i from 0 to 2*d + 1 {
int x = Floor(s, i + 1) + 1 // Diagonal
int y = Floor(s, i + 2) + 1 // Up
int z = SmallestIndexWithTransition(s, w, i, ch) // Horizontal
SetDiagonal(&t, i, Min(x, y, z))
}
s = t
}
return IsAccepting(s)
}
</pre>
<p>For each of the <em>n</em> characters in the input string, we iterate over the <em>2d + 1</em>
diagonals in the state and compute the three components of the diagonal update.
Two of the three components (<em>x</em> and <em>y</em> in the pseudocode above) are floor
computations which are really just array accesses, so they’re both constant
time operations. The third component (<em>z</em> in the pseudocode above, the result
of <code class="highlighter-rouge">SmallestIndexWithTransition</code>) is the most expensive, taking time <em>O(d)</em> to
compute, so the inner loop takes time <em>O(d<sup>2</sup>)</em> and the entire
simulation takes time <em>O(nd<sup>2</sup>)</em>.</p>
<h1 id="an-ond-time-simulation">An <em>O(nd)</em>-time simulation</h1>
<p>To get from <em>O(d<sup>2</sup>)</em> to <em>O(d)</em> time per state transition and <em>O(nd)</em>
for the entire simulation, there’s one final trick: making the
<code class="highlighter-rouge">SmallestIndexWithTransition</code> function in the pseudocode above run in constant
time.</p>
<p><code class="highlighter-rouge">SmallestIndexWithTransition</code> needs to be called on each of the at most <em>2d + 1</em>
active diagonals to find the smallest index above the floor of the diagonal with
a horizontal transition on the current input character. Luckily, there’s a lot of overlap
between the horizontal transitions of two consecutive diagonals: if you read the
horizontal transitions of any diagonal from bottom to top, they form a substring
of length <em>d + 1</em> of the string the automaton was generated with. Any two consecutive
diagonals have an overlap of <em>d</em> characters in these substrings:</p>
<p><img src="/assets/levtrie/jump-array.png" alt="Overlap between horizontal transitions between consecutive diagonals" /></p>
<p>The horizontal transitions from a set of <em>2d + 1</em> consecutive diagonals span a
substring of length at most <em>3d + 2</em>. This overlap between the horizontal transitions of
consecutive diagonals means
that instead of iterating up each diagonal doing <em>O(d)</em> work to figure out the
first applicable horizontal transition, if any, for each of the <em>2d + 1</em> diagonals,
we can instead precompute all the transitions on the substring of length <em>3d + 2</em> once
and use that precomputed result to calculate <code class="highlighter-rouge">SmallestIndexWithTransition</code> on each
diagonal.</p>
<p>This precomputation just involves calculating, for each index in the <em>3d + 2</em>
character window, the next index in the window that matches the current input
character. For example, if the window was the string “cookbook”:</p>
<div class="highlighter-rouge"><pre class="highlight"><code> [c][o][o][k][b][o][o][k]
0 1 2 3 4 5 6 7
</code></pre>
</div>
<p>our precomputed jump array for the character ‘k’ would look like:</p>
<div class="highlighter-rouge"><pre class="highlight"><code> [3][3][3][3][7][7][7][7]
0 1 2 3 4 5 6 7
</code></pre>
</div>
<p>Each index in the jump array tells us the next index of the character ‘k’ in the
window. If we look in the jump array at the index of a particular diagonal plus its
floor, the jump array tells us the next applicable horizontal transition on that diagonal.</p>
<p>The jump array needs to be initialized once per transition but initialization only takes time
<em>O(d)</em> and all subsequent calls to <code class="highlighter-rouge">SmallestIndexWithTransition</code>
can then be implemented with just a constant-time access into the jump array. Now everything
inside the for loop that iterates over <em>2d + 1</em> diagonals runs in constant time and computing
a transition of the NFA only takes time <em>O(d)</em>.</p>
<h1 id="final-thoughts">Final Thoughts</h1>
<p>I’ve implemented all of this in the Go package <a href="http://github.com/aaw/levtrie">levtrie</a> and the code in that
package is a better place to look if you’re interested in more details after reading
this post.</p>
<p>I took some shortcuts
there to make the code a little more readable at the expense of some unnecessary
memory bloat. In particular, the Trie implementation is very simple: each node keeps
a map of runes to children and no intermediate nodes are suppressed. The key for each
node, implicitly defined by the path through the Trie, is duplicated in the key-value
struct stored at each node.</p>
<p>If you’re looking to implement something similar and speed it up,
the first thing to consider is replacing the basic Trie with a <a href="https://en.wikipedia.org/wiki/Radix_tree">Radix</a> or <a href="http://cr.yp.to/critbit.html">Crit-bit tree</a>.
A Crit-bit tree will dramatically reduce the memory usage and the number of
node accesses needed for Trie traversals, but it will also complicate the logic of
the parallel search through the Trie and Levenshtein Automaton. Instead of iterating
over all runes emanating from a node in the Trie and matching those up with transitions
in the Levenshtein Automata, you’ll have to do something more careful. You could
add some logic to traverse the Crit-bit tree from a node, simulating an iteration over
all single-character transitions. Multi-byte characters make this a little more tricky
than an ASCII alphabet. You might have an easier time converting the the Levenshtein
Automata itself to have an alphabet of bytes instead of multibyte characters, which
would then match up better with a Radix tree with byte transitions.</p>
Fri, 25 Aug 2017 21:58:17 +0000
http://blog.aaw.io/2017/08/25/levenshtein-automata.html
http://blog.aaw.io/2017/08/25/levenshtein-automata.htmlRegular Expression Search via Graph Cuts<p>Google used to offer a search engine called
<a href="https://en.wikipedia.org/wiki/Google_Code_Search">Code Search</a> which
let you use regular expressions to search code.
I never thought Code Search was doing anything much more sophisticated than
something along the lines of a fast, distributed grep until Russ Cox explained
how it worked in a <a href="https://swtch.com/~rsc/regexp/regexp4.html">blog post</a>
and released <a href="https://github.com/google/codesearch">codesearch</a>, a smaller-scale
implementation.
Both the post and the code are fascinating – it turns out that Code Search
was doing something much more sophisticated than a distributed grep.</p>
<p>The big idea in codesearch is an engine that translates a large class of regular
expressions into queries that can be run against a standard inverted index.
In this post, I’m going to describe a different implementation of such an engine
using some of the same ideas in
codesearch along with a well-known algorithm for finding the minimum-weight
node cut in a graph. I think the result is a little bit simpler conceptually than
the query-building technique that codesearch uses, particularly if you
understand textbook regular expressions but don’t know a lot about the
internals of any particular regular expression library.</p>
<p>The implementation I’ll describe is available as a Go package at
<a href="https://github.com/aaw/regrams">https://github.com/aaw/regrams</a>. It’s a few hundred lines of code and the benchmarks
show it running only about 4-5 times slower than codesearch’s regular
expression-to-query translation. Both engines run on the order of microseconds for
individual queries, so a few times slower is still very usable for a search engine.</p>
<h1 id="trigram-queries-and-regular-expression-search">Trigram queries and regular expression search</h1>
<p>First, a little background about codesearch’s approach to regular expression search.</p>
<p>Before it can process queries, codesearch creates an inverted index from all the
files you want to search. The index contains all overlapping substrings of length 3
(“trigrams”) that occur in any of the files.
If you index a file called <code class="highlighter-rouge">greetings.txt</code> which contains just the
string <code class="highlighter-rouge">Hello, world!</code>,
the resulting trigram index would contain entries for
<code class="highlighter-rouge">Hel</code>, <code class="highlighter-rouge">ell</code>, <code class="highlighter-rouge">llo</code>, <code class="highlighter-rouge">lo,</code>, <code class="highlighter-rouge">o,_</code>, <code class="highlighter-rouge">,_w</code>, <code class="highlighter-rouge">_wo</code>, <code class="highlighter-rouge">wor</code>, <code class="highlighter-rouge">orl</code>, <code class="highlighter-rouge">rld</code>, and <code class="highlighter-rouge">ld!</code>.
When you look up any of those trigrams in the index,
you’ll get a list of files: <code class="highlighter-rouge">greetings.txt</code> and any other files that contain
the trigram you’ve looked up. At query time, codesearch extracts trigrams from the regular expression
and uses those trigrams to look up some candidate documents in the index.</p>
<p>The trigram queries used by codesearch are just trigrams combined with ANDs and ORs.
For a simple regular expression like <code class="highlighter-rouge">Hello</code>, codesearch might generate the trigram
query <code class="highlighter-rouge">Hel AND ell AND llo</code>. Looking that query up in the index would return all of
the indexed files that contain all three of the trigrams <code class="highlighter-rouge">Hel</code>, <code class="highlighter-rouge">ell</code>, and <code class="highlighter-rouge">llo</code>.
A more complicated regular expression like <code class="highlighter-rouge">abc(d|e)</code> might generate the trigram
query <code class="highlighter-rouge">abc AND (bcd OR bce)</code>.</p>
<p>Sometimes there’s no good trigram query for a regular expression. This can happen
if the regular expression accepts strings of length less than 3. The regular
expression <code class="highlighter-rouge">[0-9]+</code>, for example, accepts strings of digits of length 1 and 2 so
we can’t use a trigram query to search for files that match that expression.
There also may not be a good trigram query if the regular expression accepts too many
different strings. <code class="highlighter-rouge">[a-z]{3}</code> is a short regular expression but it needs an enormous
trigram query with 17,576 trigrams OR-ed together to capture its meaning. If you leave
off any of those 17,576 trigrams from the query, you risk false negatives: files that
match the query but aren’t returned from your trigram query. You’d be better off
grepping files in most cases than running such a large query.</p>
<p>Finally, even when codesearch can come up with a good trigram query, it can get false
positives in its result set from the trigram index. The regular expression <code class="highlighter-rouge">Hello</code>’s trigram query
<code class="highlighter-rouge">Hel AND ell AND llo</code> matches not just files containing <code class="highlighter-rouge">Hello</code> but also files that
contain things like <code class="highlighter-rouge">Help smell this fellow</code>. Since false positives like this are
possible, a regular expression search based on trigram queries will need to
post-process trigram query results by running the regular expression over each file
that comes back. The goal, then, is just to generate a small enough set of candidate
documents so that the query generation, lookup in the inverted index, and post-processing
with the original regular expression runs much more quickly than exhaustive grepping over
files.</p>
<p>codesearch generates these trigram queries by parsing the regular expression using Go’s
<a href="https://golang.org/pkg/regexp/syntax/">regexp/syntax</a>, then analyzing the
parsed regular expression and converting it into structures
that describe what trigrams each portion of the regular expresssion can match.
These structures are then combined
based on a table of rules derived from the meaning of the regular expression operations
involved. The structures maintained during this process
keep track of several attributes of the pieces of the regular expression
their analyzing: whether that portion matches the empty string as well as sets
of trigrams that can match exactly as well as prefixes
and suffixes. The whole process is quick and, along with some boolean simplification of
the resulting trigram queries, generates succinct queries without false negatives.</p>
<h1 id="generating-trigram-queries-with-graph-cuts">Generating trigram queries with graph cuts</h1>
<p>Instead of generating trigram queries by analyzing the structure of the regular
expression, <a href="https://github.com/aaw/regrams">regrams</a> transforms the regular expression
into an <a href="https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton">NFA</a> and
analyzes that.</p>
<p>regrams uses an NFA with some extra annotations on the states. Getting
to an NFA from a regular expression in Go is pretty simple with the <a href="https://golang.org/pkg/regexp/syntax/">regexp/syntax</a>
package: running <a href="https://golang.org/pkg/regexp/syntax/#Parse"><code class="highlighter-rouge">Parse</code></a> on the
string, then <a href="https://golang.org/pkg/regexp/syntax/#Regexp.Simplify"><code class="highlighter-rouge">Simplify</code></a> on the
resulting expression yields a normal form that’s easy to work with. regrams massages the
simplified expression into an even simpler form that’s closer to a textbook regular
expression, containing only concatenation (.), alternation (|), and Kleene star (*)
operations on literals and empty strings. This simplified expression
might match more than the original expression but it never matches less, so it’s fine
for generating a trigram query. Finally, regrams converts the simplified regular
expression into an NFA using <a href="https://en.wikipedia.org/wiki/Thompson%27s_construction">Thompson’s algorithm</a>.</p>
<p>Next, any state that has a literal transition gets
annotated with a set of trigrams that are reachable starting from that state. There
are a few exceptions: if we start collecting the set of trigrams but realize it’s
going to be too large, we’ll bail out and the state will get an empty set of
trigrams. Also, if we start collecting the set of trigrams but realize that you
can reach the accept state of the NFA in 2 or fewer steps, we’ll bail out, since
that means that we can’t represent the query from that node with trigrams. So we end up
with annotations on some of the nodes in the NFA that describe trigram OR queries,
but only on the nodes that are easy to write trigram OR queries for.</p>
<p>Here’s an example NFA for the regular expression <code class="highlighter-rouge">ab(c|d*)ef</code> with trigram set
annotations in blue below each node:</p>
<p><img src="/assets/nfa-trigram-sets.png" alt="An NFA with annotated trigram sets" /></p>
<p>Notice that nodes with only epsilon transitions in the NFA above don’t get trigram
annotations and nodes that can reach the accept state in fewer than 3 literal
transitions like the node with the “e” transition and the node with the “f” transition
don’t have trigram set annotations. Every other node is annotated with the set of
trigrams that can be generated from that node by following an unlimited number of
epsilon transitions and exactly three literal transitions.</p>
<p>Creating a trigram query from the NFA above is now just an exercise in applying
the right graph algorithm. A good trigram OR query – one that captures a set of
trigrams that must appear in any string matching the regular expression but is as
small as possible – corresponds to a minimal set of trigram-annotated nodes in the
NFA that, when removed, disconnect the initial state from the final state. In
graph theory, a set of nodes that, when removed, disconnects two nodes s and t in the
graph is called an s-t node cut.
A node cut in our NFA separating the start and accept nodes corresponds to a
complete set of trigrams that must be present in any string that
matches the regular expression: there’s no way for a string to be accepted by the
NFA but to go through one of the nodes in the cut.</p>
<p>In the NFA above, there are only two minimal node
cuts that consist only of trigram-annotated nodes: the cut consisting of only the
node with the “a” transition and the cut consisting of only the node with the “b”
transition.</p>
<p>Once we’ve extracted a node cut consisting only of trigram-annotated nodes, we can just OR all of
the trigrams in the cut together and, by the argument above, we’ve got a valid trigram
query for the original regular expression: at least one of those trigrams must be
present in any string that matches the regular expression. Continuing with the
example above, depending on which cut we choose, we either
get the trigram query <code class="highlighter-rouge">abc OR abd OR abe</code> or the trigram query <code class="highlighter-rouge">bce OR bdd OR bde OR bef</code>.</p>
<p>At this point, our query is a single big OR defined by the cut and we’ve likely used relatively few
of the trigram sets we’ve annotated the graph with. But because we’ve just isolated
a cut, that cut splits the NFA (and also the corresponding regular expression) into two parts, and if we now clean up
both of those two parts of the NFA a little, we can run the same cut analysis on each of two those parts
recursively to extract more trigram queries. We just keep isolating cuts recursively
until we can’t find a cut with trigram-annotated nodes, at which point there’s nothing good left
to generate a query with. All of the OR queries we generate like this can be
AND-ed together to create one final query.</p>
<p>In the example NFA we’re working with, that means that if we choose the cut
with just the node with a “b” transition, it generates the trigram query <code class="highlighter-rouge">bce OR bdd OR bde OR bef</code>
and splits the NFA into two parts: the part with the “a” transition and everything
after the “b” transition node:</p>
<p><img src="/assets/nfa-trigram-sets-split.png" alt="The same NFA from before, now with a cut at the node with a "b" transition removed" /></p>
<p>We can now recursively consider both sub-NFAs. We see a single cut in the
first NFA that generates the trigram query <code class="highlighter-rouge">abc OR abd OR abe</code> and no cut in the second
NFA that consists of just trigram annotated nodes: even if you remove both the
node with a “c” transition and the node with a “d” transition, there’s still
a path from the initial state in the sub-NFA to the accept state through the
lower-level epsilon-transition around the node with a “d” transition. Since there
are no cuts left in any of the subgraphs, our final trigram query for the entire
regular expression <code class="highlighter-rouge">ab(c|d*)ef</code> is all of the subqueries AND-ed together, which is
<code class="highlighter-rouge">(abc OR abd OR abe) AND (bce OR bdd OR bde OR bef)</code>.</p>
<p>The fact that we couldn’t extract a cut from the second NFA is by design: that
NFA corresponds to the regular expression <code class="highlighter-rouge">(c|d*)ef</code>, which matches the string
<code class="highlighter-rouge">ef</code> that we can’t express with a trigram query.</p>
<h1 id="finding-minimal-node-cuts">Finding minimal node cuts</h1>
<p>So how do you find a minimal cut consisting only of trigram-annotated nodes?
If we label each node in a trigram-annotated NFA with the size of its trigram set or
with “infinity” if it doesn’t have a trigram set, then we can frame the problem
as a search for a minimum-weight node cut in the graph that separates the initial
and final state. Because of the way we’ve set up the node weights, finding the
minimum-weight node cut will tell us if we actually have a usable query: the cut
can be used to generate a query exactly when its weight is non-infinite.</p>
<p>Finding a minimum-weight cut in a graph that separates two distinguished nodes
is a tractable optimization problem and is
especially easy in simple graphs like the NFAs that we get from regular expressions.
Minimum-weight cuts are closely related to maximum weight “flows”, and if you ever
took an algorithms course you may have run into
the <a href="https://en.wikipedia.org/wiki/Max-flow_min-cut_theorem">max-flow min-cut theorem</a>
that formalizes this duality. If you imagine a graph as a set of pipes of differing
widths, flowing in the direction of the arrows between nodes, the max flow
represents the most amount of fluid that can flow through the pipes at any point
in time. The minimum-weight cut is a bottleneck that constrains the maximum flow you
can push through.</p>
<p>The simplest way to find a minimum-weight cut, then, is to find a maximum-weight
flow, which really boils down to initializing all nodes with capacities as described
above (each capacity will either be infinite or the size of the trigram set) and then
repeatedly finding a path that goes from the start node to the accept node through
nodes of non-zero capacity. Every time we find such a path, we reduce the
capacity of all nodes on the path by the minimum capacity on the path.
Eventually, there are no paths left except those that go through zero-capacity nodes
and those zero-capacity nodes can be used to recover the minimal cut.</p>
<p>There are more sophisticated ways of finding a maximum-weight flow,
but the complexity of this algorithm is never more than the number of edges in the
NFA times the maximum allowable trigram set size. Since we bound both the maximum
size of the NFA allowed and the size of the maximum trigram set in the regrams
implementation, the complexity never gets out of hand here, and in practice it’s
very quick and even out-performs some theoretically faster optimizations like the
<a href="https://en.wikipedia.org/wiki/Edmonds%E2%80%93Karp_algorithm">Edmonds-Karp</a> search
order in my tests.</p>
<h1 id="examples">Examples</h1>
<p>So what do the results look like? Here are a few examples which you can play with
on your own if you have a Go development environment. Just build the regrams
commmand line wrapper and follow along:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>go get github.com/aaw/regrams
go build -o regrams github.com/aaw/regrams/cmd/regrams
</code></pre>
</div>
<p>The trigram queries are written with implicit AND
and <code class="highlighter-rouge">|</code> for OR, so <code class="highlighter-rouge">abc AND (bcd OR bce)</code> comes out looking like <code class="highlighter-rouge">abc (bcd|bce)</code>:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>$ ./regrams 'Hello, world!'
( wo) (, w) (Hel) (ell) (ld!) (llo) (lo,) (o, ) (orl) (rld) (wor)
$ ./regrams 'a(bc)+d'
(abc) (bcb|bcd)
$ ./regrams 'ab(c|d*)ef'
(abc|abd|abe) (bce|bdd|bde|bef)
$ ./regrams '(?i)abc'
(ABC|ABc|AbC|Abc|aBC|aBc|abC|abc)
$ ./regrams 'abc[a-zA-Z]de(f|g)h*i{3}'
(abc) (def|deg) (efh|efi|egh|egi) (fhh|fhi|fii|ghh|ghi|gii) (iii)
$ ./regrams '[0-9]+' # No trigrams match single or double digit strings
Couldn't generate a query
$ ./regrams '[a-z]{3}' # Too many trigrams
Couldn't generate a query
</code></pre>
</div>
<p>regrams is also available as a Go package, too: just import <code class="highlighter-rouge">github.com/aaw/regrams</code>
and then call <code class="highlighter-rouge">regrams.MakeQuery</code> on a string to parse a regular expression into a trigram
query. The trigram query returned is a slice-of-slices of strings, which represents a big
AND or ORs: <code class="highlighter-rouge">[["abc"], ["bcd", "bce"]]</code> represents <code class="highlighter-rouge">abc (bcd|bce)</code>.</p>
<h1 id="optimizations">Optimizations</h1>
<p>There’s at least one major optimization that’s possible but isn’t yet implemented
in regrams: we could extract node-disjoint paths instead of just node cuts and
get more specific queries.</p>
<p>For example, if you ask regrams to generate a query for
<code class="highlighter-rouge">(abcde|vwxyz)</code>, it’ll generate <code class="highlighter-rouge">(abc|vwx) (bcd|wxy) (cde|xyz)</code>, which isn’t as specific
as the query <code class="highlighter-rouge">(abc bcd cde)|(vwx wxy xyz)</code>. <code class="highlighter-rouge">(abc|vwx)</code>, <code class="highlighter-rouge">(bcd|wxy)</code>, and <code class="highlighter-rouge">(cde|xyz)</code>
represent three distinct cuts in the NFA. If we didn’t stop at node cuts, and instead
augmented the nodes in the cut with disjoint paths, we could avoid this situation
and always return the best trigram query for a regular expression. To do this, we’d
need some algorithmic version of something like <a href="https://en.wikipedia.org/wiki/Menger%27s_theorem">Menger’s theorem</a>
that extracts node-disjoint paths from our NFAs that go through only nodes with
non-infinite capacity.</p>
<h1 id="related-work">Related Work</h1>
<p>codesearch isn’t the only attempt at generating trigram queries from regular
expressions. In 2002, Cho and Rajagopalan published
“<a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.6659">A Fast Regular Expression Indexing Engine</a>”,
which describes a search engine called FREE that shares a lot of the ideas
found in codesearch, including using an n-gram index to index the underlying
documents and generating queries against that index using some rules
based on the structure of those regular expressions. The rules are much
simpler than those used by codesearch and FREE doesn’t have an actual
implementation that I know of.</p>
<p>Postgres also now
<a href="https://www.pgcon.org/2012/schedule/events/383.en.html">supports regular expression search</a>
via <a href="https://github.com/postgres/postgres/blob/master/contrib/pg_trgm/trgm_regexp.c">an implementation by Alexander Korotkov</a>
in the <a href="https://www.postgresql.org/docs/9.3/static/pgtrgm.html">pg_tgrm module</a>.
Korotkov’s implementation apparently generates a special NFA with trigram transitions after
converting the original regular expression to an NFA, then extracts a query from
that trigram NFA. Korotkov’s implementation seems similar to regrams in that all of
the analysis is done on an NFA, but I don’t understand enough about it to say
anything more. I’d love to read a description of the technique somewhere. It seems
to generate slightly better queries than regrams based on some of the documentation, for
example generating <code class="highlighter-rouge">(abe AND bef) OR (cde AND def)) AND efg</code> from <code class="highlighter-rouge">(ab|cd)efg</code>, whereas regrams
would generate <code class="highlighter-rouge">(abe OR cde) AND (bef OR def) AND efg</code> from the same regular expression.</p>
Fri, 10 Jun 2016 20:51:47 +0000
http://blog.aaw.io/2016/06/10/regrams-intro.html
http://blog.aaw.io/2016/06/10/regrams-intro.html