Comma-free codes
A set of strings is comma-free if, for any two strings x and y in the set, no substring of the concatenation xy that overlaps both x and y is also in the set.
Once again, my interest in comma-free codes is coming from Don Knuth’s Volume 4, Fascicle 5 of The Art of Computer Programming. Knuth covers a souped-up backtracking algorithm to find comma-free codes and includes an exercise that derives W. L. Eastman’s algorithm for generating codes with odd word length1. He also goes over Eastman’s algorithm in his 2015 Christmas lecture as well, which must have been given around the time he was writing Fascicle 5.
In this post, I’ll describe how to use a SAT solver to discover comma-free codes.
Background
I’ll call a comma-free set a “code” and the strings in the set “codewords” in most of the rest of this post.
A few warm-up examples to make the comma-free property concrete: if a comma-free code containing only words of length three contains 123 and 456, then it can’t contain 234 or 345 and it can’t contain 561 or 612 (since those are substrings of 123456 and 456123, respectively). It also can’t contain 231, 312, 564, or 654 (since those are substrings of 123123 and 456456, respectively).
The term “comma-free” comes from the fact that you can create messages just by concatenating codewords together (without commas or other such delimiters) and those messages will be uniquely decodable into codewords even if you start decoding somewhere in the middle of the message.
For example, a comma-free code with words of length 3 over the alphabet {0,1,2} is:
{001, 002, 101, 102, 120, 121, 220, 221}
Given some partial message made up of these codewords:
...011202212200012...
The interpretation of that message as codewords is unique. It has to be
...01,120,221,220,001,2...
since neither of the other ways of interpreting it makes sense. Neither
...0,112,022,122,000,12...
nor
...011,202,212,200,012...
contain any valid code words.
Upper bounds
A Lyndon word is a string that’s a unique lexicographic minimum among the multiset of all of its rotations.
0001 is a Lyndon word because it’s the minimum among the rotations 0001, 0010, 0100, 1000.
0101 is the minimum among all its rotations, but it’s not a Lyndon word because the minimum is not unique among all four rotations: 0101, 1010, 0101, 1010. Any periodic string will have the same problem.
If the same codeword is repeated in a message, all of its other rotations overlap both occurrences of the codeword (for example, 00010001 contains all rotations of 0001). So a codeword and one of its rotation cannot both appear in a comma-free code. Furthermore, you can’t have periodic codewords in a comma-free code (0101 creates ambiguity when concatenated with itself as 01010101). These two facts mean that the number of Lyndon words of length n over an alphabet of size m is an upper bound on the maximum size of a comma-free code with the same parameters.
There’s a closed-form formula for the number of Lyndon words of length n over an alphabet of size m, but it involves the Möbius function. I’ll just use LW(n,m) to represent it here – it doesn’t matter what the formula is, I’m only interested in how close we can get to the optimal size LW(n,m) for various values of n and m.
To tie all of this together, one way of viewing the process of creating a comma-free code that’s as large as possible is just selecting at most one rotation of each Lyndon word while avoiding conflicts. Let CF(n,m) be the maximum size of a comma-free code with words of length n over an alphabet of size m. If LW(n,m) = CF(n,m), then there’s some way of choosing one rotation of each Lyndon word to create a maximum-size comma-free code.
For all odd n, LW(n,m) = CF(n,m) and W.L. Eastman came up with an algorithm1 that will create such a code. For even n, not much is known even for very small values of n and m. The case n=2 was solved by Golomb, Gordon, and Welch2: CF(2,m) = floor(m2/3), which is strictly less than LW(n,m) for m > 3.
For example, here’s a comma-free code for n=2, m=4:
{10, 12, 30, 31, 32}
LW(2,4) = 6 and the code above contains rotations of all Lyndon words except for 02. And that’s as good as you can do: by Golumb, Gordon, and Welch’s formula above, CF(2,4) = 5, so there’s no comma-free code that includes all Lyndon words for n=2, m=4.
Here’s a table of LW(n,m) - CF(n,m), where CF(n,m) is actually known:
m=2 | m=3 | m=4 | m=5 | m=6 | m=7 | m=8 | m=9 | m=10 | |||
---|---|---|---|---|---|---|---|---|---|---|---|
n=2 | 0 | 0 | 1 | 2 | 3 | 5 | 7 | 9 | 12 | … | LW(n,m) - floor(m2/3) |
n=3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
n=4 | 0 | 0 | 3 | 11 | |||||||
n=5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
n=6 | 0 | 3 | |||||||||
n=7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
n=8 | 0 | ||||||||||
n=9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
n=10 | 0 | ||||||||||
n=11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
n=12 | 1 | ||||||||||
n=13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
n=14 | |||||||||||
Blanks in the table above are currently unknown. In particular, all rows after n=13 are unknown. For even n, CF(n,m) gets really difficult to determine really quickly and LW(n,m) is only achievable for small values of n and m.
The three highlighted entries in the table are state-of-the-art and apparently unknown at the time of the first printing of Volume 4, Fascicle 5. All of the entries in rows with even n above are within reach of at most a day’s computation on a decent computer using a SAT solver. In the rest of this post, I’ll describe how I calculated the values in the table above for n > 2.
Using a Solver to Find Comma-free Codes
The general recipe I used for finding comma-free codes with a SAT solver was:
-
Write a Python program that takes arguments n, m, and d on the command line and generates a DIMACS file with a formula that’s satisfiable exactly when there’s a comma-free code on codewords of size n over alphabet of size m with d Lyndon words not chosen. So if you run with d=0 and get a satisfiable formula, LW(n,m) = CF(n,m).
Some of the variables in the formula are indicators for particular codewords being chosen in the code, so I generated comments in the DIMACS file (example) that would help me translate variables to codewords later.
-
For any pair of n, m I was interested in, I generated DIMACS files starting with d = 0 and ran Armin Biere’s kissat on the resulting files, increasing d and starting over until I found the smallest d where the resulting formula was satisfiable.
-
Using the satisfying assigment found by kissat, I could optionally extract a code by matching the assignment up with comments in the DIMACS input file.
With this method, I was able to fill in the table in the previous section. If you want to replicate this, you’ll need Python3 installed and a SAT solver like kissat. Download two scripts:
- commafree.py: the input file generator
- extract-code.py: a script that verifies and extracts comma-free codes
Using these tools, you should be able to extract your own codes:
$ commafree.py 4 2 0 > /tmp/commafree-4-2-0.cnf # n=4, m=2, d=0
$ kissat /tmp/commafree-4-2-0.cnf > /tmp/kissat-4-2-0.out
$ [ $? -eq 10 ] && extract-code.py /tmp/commafree-4-2-0.cnf /tmp/kissat-4-2-0.out
{1000, 1001, 1011}
(The final command prints nothing if the formula was unsatisfiable, in which case you need to increment d and try again.)
Some of the blank entries in the table above should be solvable today with a bit more effort, particularly at the frontier, like (n,m) = (4,6), (6,4), and (8,3). Maybe these are already solvable by SAT solver tuning, symmetry-breaking or some other preprocessing of the CNF files.
The SAT Encoding
The heart of the process above is a Python program to generate DIMACS files, but it’s nothing too complicated. It does the following:
-
Generate candidate codewords by iterating over all Lyndon words and their rotations. This can be done with Knuth’s Algorithm 7.2.1.1 F from The Art of Computer Programming, Volume 4A. Associate each of these words with a variable that’s true exactly when the codeword has been chosen for the comma-free code.
-
Generate clauses for each Lyndon word that express “at most one codeword from this Lyndon word and its rotations is chosen”.
-
Iterate over all pairs x, y of Lyndon word variables. For each pair, discover all codewords z that are disallowed in xy or yx by the comma-free property and generate clauses for such triples x, y, z that express “x, y, and z can’t all occur in the comma-free code”.
-
Generate variables and clauses for each Lyndon word such that the variables indicate “some rotation of this Lyndon word was chosen”.
-
Generate variables and clauses that sort the variables from the previous step and assert that the smallest d values are 0 and the d+1-st value is 1. I cheated here and didn’t do a full sort, I just applied a full row of comparators d+1 times – essentially d+1 rounds of bubblesort to sort the smallest d values only.
Again, you can check out the generator source for full details.
New Codes
So, finally, here’s a dump of the new codes I was able to discover. All of these are now officially the maximum size comma-free codes for these values of n and m:
n=4, m=5, size=139 (LW(4,5) = 150):
{0012, 0013, 0014, 0100, 0102, 0103, 0104, 0110, 0111, 0112, 0113, 0114, 0204,
0212, 0213, 0214, 0304, 0312, 0313, 0314, 1012, 1013, 1014, 2000, 2002, 2003,
2004, 2012, 2013, 2014, 2022, 2023, 2024, 2032, 2033, 2034, 2100, 2102, 2103,
2104, 2110, 2111, 2112, 2113, 2114, 2204, 2212, 2213, 2214, 2224, 2234, 2304,
2312, 2313, 2314, 2324, 2334, 3000, 3002, 3003, 3004, 3012, 3013, 3014, 3022,
3023, 3024, 3032, 3033, 3034, 3100, 3102, 3103, 3104, 3110, 3111, 3112, 3113,
3114, 3204, 3212, 3213, 3214, 3224, 3234, 3304, 3312, 3313, 3314, 3324, 3334,
4000, 4002, 4003, 4004, 4012, 4013, 4014, 4022, 4023, 4024, 4032, 4033, 4034,
4042, 4043, 4044, 4100, 4102, 4103, 4104, 4110, 4111, 4112, 4113, 4114, 4122,
4123, 4124, 4132, 4133, 4134, 4142, 4143, 4144, 4204, 4212, 4213, 4214, 4224,
4234, 4244, 4304, 4312, 4313, 4314, 4324, 4334, 4344}
n=6, m=3, size=113 (LW(6,3) = 116):
{001000, 001100, 001101, 001102, 001200, 001201, 001202, 002000, 002100,
002101, 002102, 002200, 002201, 002202, 010102, 011100, 011101, 011102,
011110, 011111, 011200, 011201, 011202, 011210, 011211, 012100, 012101,
012102, 012110, 012111, 012112, 012200, 012201, 012202, 012210, 012211,
012220, 020100, 020102, 020110, 020120, 020200, 020210, 020220, 021100,
021101, 021102, 021110, 021111, 021120, 021121, 021200, 021201, 021202,
021210, 021211, 022100, 022101, 022102, 022110, 022111, 022112, 022120,
022121, 022200, 022201, 022202, 022211, 022212, 101000, 101100, 101200,
101201, 101202, 102000, 102100, 102200, 102201, 102202, 111200, 111201,
111202, 111210, 111211, 112200, 112201, 112202, 112210, 112211, 112220,
121200, 121201, 121202, 121210, 121211, 122120, 122121, 122211, 122212,
212200, 212201, 212202, 212210, 212211, 212220, 222100, 222101, 222102,
222200, 222201, 222202, 222211, 222212}
n=12, m=2, size=334 (LW(12,2) = 335):
{000001000011, 000010000000, 000010000001, 000010000011, 000010000101,
000100000101, 000100001001, 000101000011, 000101001011, 000110000001,
000110000011, 000110000101, 000110000111, 000110001001, 000110001011,
000110001101, 000110001111, 000110010001, 000110010011, 000110010101,
000110010111, 000110011001, 000110011011, 000110011101, 000110011111,
000111000011, 000111001011, 001000001001, 001000010001, 001001000011,
001001001011, 001001010011, 001010000001, 001010000011, 001010000101,
001010001001, 001010001011, 001010010001, 001010010011, 001010010101,
001100000001, 001100000101, 001100001001, 001100001101, 001100010001,
001100010101, 001100011101, 001101000011, 001101001011, 001101010011,
001110000001, 001110000011, 001110000101, 001110001001, 001110001011,
001110001101, 001110001111, 001110010001, 001110010011, 001110010101,
001110011001, 001110011011, 001110011101, 001110011111, 001111000000,
001111000011, 001111001011, 001111010011, 010000010001, 010001000011,
010001010011, 010010000001, 010010000011, 010010000101, 010010010001,
010010010011, 010010010101, 010010100011, 010100000001, 010100000101,
010100001001, 010100010001, 010100010101, 010101000011, 010101001011,
010101010011, 010110000001, 010110000011, 010110000101, 010110000111,
010110001001, 010110001011, 010110001101, 010110001111, 010110010001,
010110010011, 010110010101, 010110010111, 010110011001, 010110011011,
010110011101, 010110011111, 010110100011, 010110100111, 010110101011,
010111000011, 010111001011, 010111010011, 010111011011, 010111100011,
010111101011, 010111111011, 011000000000, 011000000001, 011000001001,
011000010001, 011001000011, 011001001011, 011001010011, 011001011011,
011001101011, 011001111011, 011010000001, 011010000011, 011010000101,
011010001001, 011010001011, 011010010001, 011010010011, 011010010101,
011010011001, 011010011011, 011010100011, 011010101011, 011100000000,
011100000001, 011100000101, 011100001001, 011100010001, 011100010101,
011100011101, 011101000011, 011101001011, 011101010011, 011101011011,
011101111011, 011110000011, 011110000101, 011110001001, 011110001011,
011110010001, 011110010011, 011110010101, 011110011001, 011110011011,
011110011101, 011110011111, 011110100011, 011110101011, 011111000011,
011111001011, 011111010011, 011111011011, 011111100011, 011111101011,
011111111011, 100001000011, 100010000000, 100010000001, 100010000011,
100010000101, 100010100011, 100100000000, 100100000001, 100100000101,
100100001001, 100101000011, 100101001011, 100110000001, 100110000011,
100110000101, 100110000111, 100110001001, 100110001011, 100110001101,
100110001111, 100110100011, 100110100111, 100110101011, 100111000011,
100111001011, 100111100011, 100111101011, 101000000000, 101000000001,
101000001001, 101000010001, 101001000011, 101001001011, 101001010011,
101010000001, 101010000011, 101010000101, 101010001001, 101010001011,
101010010001, 101010010011, 101010010101, 101010100011, 101010101011,
101100000001, 101100000101, 101100001001, 101100001101, 101100010001,
101100010101, 101100011101, 101101000011, 101101001011, 101101010011,
101101011011, 101110000001, 101110000011, 101110000101, 101110001001,
101110001011, 101110001101, 101110001111, 101110010001, 101110010011,
101110010101, 101110011001, 101110011011, 101110011101, 101110011111,
101110100011, 101110101011, 101111000000, 101111000011, 101111001011,
101111010011, 101111100011, 101111101011, 101111111011, 110000010001,
110001000011, 110001010011, 110010000001, 110010000011, 110010000101,
110010010001, 110010010011, 110010010101, 110010100011, 110100000001,
110100000101, 110100001001, 110100010001, 110100010101, 110101000011,
110101001011, 110101010011, 110110000001, 110110000011, 110110000101,
110110000111, 110110001001, 110110001011, 110110001101, 110110001111,
110110010001, 110110010011, 110110010101, 110110010111, 110110011001,
110110011011, 110110011101, 110110011111, 110110100011, 110110100111,
110110101011, 110111000011, 110111001011, 110111010011, 110111010111,
110111011011, 110111100011, 110111101011, 110111111011, 111000001001,
111000010001, 111001000011, 111001001011, 111001010011, 111001101011,
111001111011, 111010000001, 111010000011, 111010000101, 111010001001,
111010001011, 111010010001, 111010010011, 111010010101, 111010100011,
111010101011, 111100000101, 111100001001, 111100010001, 111100010101,
111101000011, 111101001011, 111101010011, 111110000000, 111110000001,
111110000011, 111110000101, 111110001001, 111110001011, 111110010001,
111110010011, 111110010101, 111110011001, 111110011011, 111110011101,
111110011111, 111110100011, 111110101011, 111111000011, 111111001011,
111111010011, 111111100011, 111111101011, 111111111011}
-
Eastman, W. L., On the Construction of Comma-Free Codes, IEEE Trans. IT-11 (1965), pp 263-267. ↩ ↩2
-
Golumb, S. W., B. Gordon, and L. R. Welch, Comma-free codes, Can. J. Math, vol. 10, 1958, pp 202-209. ↩