Google used to offer a search engine called Code Search which let you use regular expressions to search code. I never thought Code Search was doing anything much more sophisticated than something along the lines of a fast, distributed grep until Russ Cox explained how it worked in a blog post and released codesearch, a smaller-scale implementation. Both the post and the code are fascinating – it turns out that Code Search was doing something much more sophisticated than a distributed grep.

The big idea in codesearch is an engine that translates a large class of regular expressions into queries that can be run against a standard inverted index. In this post, I’m going to describe a different implementation of such an engine using some of the same ideas in codesearch along with a well-known algorithm for finding the minimum-weight node cut in a graph. I think the result is a little bit simpler conceptually than the query-building technique that codesearch uses, particularly if you understand textbook regular expressions but don’t know a lot about the internals of any particular regular expression library.

The implementation I’ll describe is available as a Go package at https://github.com/aaw/regrams. It’s a few hundred lines of code and the benchmarks show it running only about 4-5 times slower than codesearch’s regular expression-to-query translation. Both engines run on the order of microseconds for individual queries, so a few times slower is still very usable for a search engine.

Trigram queries and regular expression search

First, a little background about codesearch’s approach to regular expression search.

Before it can process queries, codesearch creates an inverted index from all the files you want to search. The index contains all overlapping substrings of length 3 (“trigrams”) that occur in any of the files. If you index a file called greetings.txt which contains just the string Hello, world!, the resulting trigram index would contain entries for Hel, ell, llo, lo,, o,_, ,_w, _wo, wor, orl, rld, and ld!. When you look up any of those trigrams in the index, you’ll get a list of files: greetings.txt and any other files that contain the trigram you’ve looked up. At query time, codesearch extracts trigrams from the regular expression and uses those trigrams to look up some candidate documents in the index.

The trigram queries used by codesearch are just trigrams combined with ANDs and ORs. For a simple regular expression like Hello, codesearch might generate the trigram query Hel AND ell AND llo. Looking that query up in the index would return all of the indexed files that contain all three of the trigrams Hel, ell, and llo. A more complicated regular expression like abc(d|e) might generate the trigram query abc AND (bcd OR bce).

Sometimes there’s no good trigram query for a regular expression. This can happen if the regular expression accepts strings of length less than 3. The regular expression [0-9]+, for example, accepts strings of digits of length 1 and 2 so we can’t use a trigram query to search for files that match that expression. There also may not be a good trigram query if the regular expression accepts too many different strings. [a-z]{3} is a short regular expression but it needs an enormous trigram query with 17,576 trigrams OR-ed together to capture its meaning. If you leave off any of those 17,576 trigrams from the query, you risk false negatives: files that match the query but aren’t returned from your trigram query. You’d be better off grepping files in most cases than running such a large query.

Finally, even when codesearch can come up with a good trigram query, it can get false positives in its result set from the trigram index. The regular expression Hello’s trigram query Hel AND ell AND llo matches not just files containing Hello but also files that contain things like Help smell this fellow. Since false positives like this are possible, a regular expression search based on trigram queries will need to post-process trigram query results by running the regular expression over each file that comes back. The goal, then, is just to generate a small enough set of candidate documents so that the query generation, lookup in the inverted index, and post-processing with the original regular expression runs much more quickly than exhaustive grepping over files.

codesearch generates these trigram queries by parsing the regular expression using Go’s regexp/syntax, then analyzing the parsed regular expression and converting it into structures that describe what trigrams each portion of the regular expresssion can match. These structures are then combined based on a table of rules derived from the meaning of the regular expression operations involved. The structures maintained during this process keep track of several attributes of the pieces of the regular expression their analyzing: whether that portion matches the empty string as well as sets of trigrams that can match exactly as well as prefixes and suffixes. The whole process is quick and, along with some boolean simplification of the resulting trigram queries, generates succinct queries without false negatives.

Generating trigram queries with graph cuts

Instead of generating trigram queries by analyzing the structure of the regular expression, regrams transforms the regular expression into an NFA and analyzes that.

regrams uses an NFA with some extra annotations on the states. Getting to an NFA from a regular expression in Go is pretty simple with the regexp/syntax package: running Parse on the string, then Simplify on the resulting expression yields a normal form that’s easy to work with. regrams massages the simplified expression into an even simpler form that’s closer to a textbook regular expression, containing only concatenation (.), alternation (|), and Kleene star (*) operations on literals and empty strings. This simplified expression might match more than the original expression but it never matches less, so it’s fine for generating a trigram query. Finally, regrams converts the simplified regular expression into an NFA using Thompson’s algorithm.

Next, any state that has a literal transition gets annotated with a set of trigrams that are reachable starting from that state. There are a few exceptions: if we start collecting the set of trigrams but realize it’s going to be too large, we’ll bail out and the state will get an empty set of trigrams. Also, if we start collecting the set of trigrams but realize that you can reach the accept state of the NFA in 2 or fewer steps, we’ll bail out, since that means that we can’t represent the query from that node with trigrams. So we end up with annotations on some of the nodes in the NFA that describe trigram OR queries, but only on the nodes that are easy to write trigram OR queries for.

Here’s an example NFA for the regular expression ab(c|d*)ef with trigram set annotations in blue below each node:

An NFA with annotated trigram sets

Notice that nodes with only epsilon transitions in the NFA above don’t get trigram annotations and nodes that can reach the accept state in fewer than 3 literal transitions like the node with the “e” transition and the node with the “f” transition don’t have trigram set annotations. Every other node is annotated with the set of trigrams that can be generated from that node by following an unlimited number of epsilon transitions and exactly three literal transitions.

Creating a trigram query from the NFA above is now just an exercise in applying the right graph algorithm. A good trigram OR query – one that captures a set of trigrams that must appear in any string matching the regular expression but is as small as possible – corresponds to a minimal set of trigram-annotated nodes in the NFA that, when removed, disconnect the initial state from the final state. In graph theory, a set of nodes that, when removed, disconnects two nodes s and t in the graph is called an s-t node cut. A node cut in our NFA separating the start and accept nodes corresponds to a complete set of trigrams that must be present in any string that matches the regular expression: there’s no way for a string to be accepted by the NFA but to go through one of the nodes in the cut.

In the NFA above, there are only two minimal node cuts that consist only of trigram-annotated nodes: the cut consisting of only the node with the “a” transition and the cut consisting of only the node with the “b” transition.

Once we’ve extracted a node cut consisting only of trigram-annotated nodes, we can just OR all of the trigrams in the cut together and, by the argument above, we’ve got a valid trigram query for the original regular expression: at least one of those trigrams must be present in any string that matches the regular expression. Continuing with the example above, depending on which cut we choose, we either get the trigram query abc OR abd OR abe or the trigram query bce OR bdd OR bde OR bef.

At this point, our query is a single big OR defined by the cut and we’ve likely used relatively few of the trigram sets we’ve annotated the graph with. But because we’ve just isolated a cut, that cut splits the NFA (and also the corresponding regular expression) into two parts, and if we now clean up both of those two parts of the NFA a little, we can run the same cut analysis on each of two those parts recursively to extract more trigram queries. We just keep isolating cuts recursively until we can’t find a cut with trigram-annotated nodes, at which point there’s nothing good left to generate a query with. All of the OR queries we generate like this can be AND-ed together to create one final query.

In the example NFA we’re working with, that means that if we choose the cut with just the node with a “b” transition, it generates the trigram query bce OR bdd OR bde OR bef and splits the NFA into two parts: the part with the “a” transition and everything after the “b” transition node:

The same NFA from before, now with a cut at the node with a "b" transition removed

We can now recursively consider both sub-NFAs. We see a single cut in the first NFA that generates the trigram query abc OR abd OR abe and no cut in the second NFA that consists of just trigram annotated nodes: even if you remove both the node with a “c” transition and the node with a “d” transition, there’s still a path from the initial state in the sub-NFA to the accept state through the lower-level epsilon-transition around the node with a “d” transition. Since there are no cuts left in any of the subgraphs, our final trigram query for the entire regular expression ab(c|d*)ef is all of the subqueries AND-ed together, which is (abc OR abd OR abe) AND (bce OR bdd OR bde OR bef).

The fact that we couldn’t extract a cut from the second NFA is by design: that NFA corresponds to the regular expression (c|d*)ef, which matches the string ef that we can’t express with a trigram query.

Finding minimal node cuts

So how do you find a minimal cut consisting only of trigram-annotated nodes? If we label each node in a trigram-annotated NFA with the size of its trigram set or with “infinity” if it doesn’t have a trigram set, then we can frame the problem as a search for a minimum-weight node cut in the graph that separates the initial and final state. Because of the way we’ve set up the node weights, finding the minimum-weight node cut will tell us if we actually have a usable query: the cut can be used to generate a query exactly when its weight is non-infinite.

Finding a minimum-weight cut in a graph that separates two distinguished nodes is a tractable optimization problem and is especially easy in simple graphs like the NFAs that we get from regular expressions. Minimum-weight cuts are closely related to maximum weight “flows”, and if you ever took an algorithms course you may have run into the max-flow min-cut theorem that formalizes this duality. If you imagine a graph as a set of pipes of differing widths, flowing in the direction of the arrows between nodes, the max flow represents the most amount of fluid that can flow through the pipes at any point in time. The minimum-weight cut is a bottleneck that constrains the maximum flow you can push through.

The simplest way to find a minimum-weight cut, then, is to find a maximum-weight flow, which really boils down to initializing all nodes with capacities as described above (each capacity will either be infinite or the size of the trigram set) and then repeatedly finding a path that goes from the start node to the accept node through nodes of non-zero capacity. Every time we find such a path, we reduce the capacity of all nodes on the path by the minimum capacity on the path. Eventually, there are no paths left except those that go through zero-capacity nodes and those zero-capacity nodes can be used to recover the minimal cut.

There are more sophisticated ways of finding a maximum-weight flow, but the complexity of this algorithm is never more than the number of edges in the NFA times the maximum allowable trigram set size. Since we bound both the maximum size of the NFA allowed and the size of the maximum trigram set in the regrams implementation, the complexity never gets out of hand here, and in practice it’s very quick and even out-performs some theoretically faster optimizations like the Edmonds-Karp search order in my tests.

Examples

So what do the results look like? Here are a few examples which you can play with on your own if you have a Go development environment. Just build the regrams commmand line wrapper and follow along:

go get github.com/aaw/regrams
go build -o regrams github.com/aaw/regrams/cmd/regrams

The trigram queries are written with implicit AND and | for OR, so abc AND (bcd OR bce) comes out looking like abc (bcd|bce):

$ ./regrams 'Hello, world!'
( wo) (, w) (Hel) (ell) (ld!) (llo) (lo,) (o, ) (orl) (rld) (wor)
$ ./regrams 'a(bc)+d'
(abc) (bcb|bcd)
$ ./regrams 'ab(c|d*)ef'
(abc|abd|abe) (bce|bdd|bde|bef)
$ ./regrams '(?i)abc'
(ABC|ABc|AbC|Abc|aBC|aBc|abC|abc)
$ ./regrams 'abc[a-zA-Z]de(f|g)h*i{3}'
(abc) (def|deg) (efh|efi|egh|egi) (fhh|fhi|fii|ghh|ghi|gii) (iii)
$ ./regrams '[0-9]+'  # No trigrams match single or double digit strings
Couldn't generate a query
$ ./regrams '[a-z]{3}'  # Too many trigrams
Couldn't generate a query

regrams is also available as a Go package, too: just import github.com/aaw/regrams and then call regrams.MakeQuery on a string to parse a regular expression into a trigram query. The trigram query returned is a slice-of-slices of strings, which represents a big AND or ORs: [["abc"], ["bcd", "bce"]] represents abc (bcd|bce).

Optimizations

There’s at least one major optimization that’s possible but isn’t yet implemented in regrams: we could extract node-disjoint paths instead of just node cuts and get more specific queries.

For example, if you ask regrams to generate a query for (abcde|vwxyz), it’ll generate (abc|vwx) (bcd|wxy) (cde|xyz), which isn’t as specific as the query (abc bcd cde)|(vwx wxy xyz). (abc|vwx), (bcd|wxy), and (cde|xyz) represent three distinct cuts in the NFA. If we didn’t stop at node cuts, and instead augmented the nodes in the cut with disjoint paths, we could avoid this situation and always return the best trigram query for a regular expression. To do this, we’d need some algorithmic version of something like Menger’s theorem that extracts node-disjoint paths from our NFAs that go through only nodes with non-infinite capacity.

codesearch isn’t the only attempt at generating trigram queries from regular expressions. In 2002, Cho and Rajagopalan published “A Fast Regular Expression Indexing Engine”, which describes a search engine called FREE that shares a lot of the ideas found in codesearch, including using an n-gram index to index the underlying documents and generating queries against that index using some rules based on the structure of those regular expressions. The rules are much simpler than those used by codesearch and FREE doesn’t have an actual implementation that I know of.

Postgres also now supports regular expression search via an implementation by Alexander Korotkov in the pg_tgrm module. Korotkov’s implementation apparently generates a special NFA with trigram transitions after converting the original regular expression to an NFA, then extracts a query from that trigram NFA. Korotkov’s implementation seems similar to regrams in that all of the analysis is done on an NFA, but I don’t understand enough about it to say anything more. I’d love to read a description of the technique somewhere. It seems to generate slightly better queries than regrams based on some of the documentation, for example generating (abe AND bef) OR (cde AND def)) AND efg from (ab|cd)efg, whereas regrams would generate (abe OR cde) AND (bef OR def) AND efg from the same regular expression.

Trigram queries and regular expression search

Generating trigram queries with graph cuts

Finding minimal node cuts

Examples

Optimizations

Related Work