A clever trick on combining automata

John Robson, who goes by Mike, has worked on various problems including what is still the best result on separating words—the topic we discussed the other day. Ken first knew him for his proof than ${N \times N}$ checkers is ${\mathsf{EXPTIME}}$-complete and similar hardness results for chess and Go.

Today I want to talk about his theorem that any two words can be separated by an automaton with relativley few states.

In his famous paper from 1989, he proved an upper bound on the Separating Word Problem. This is the question: Given two strings ${S}$ and ${T}$, how many states does a deterministic automaton need to be able to accept ${S}$ and reject ${T}$? His theorem is:

Theorem 1 (Robson’s Theorem) Suppose that ${S}$ and ${T}$ are distinct strings of length ${n}$. Then there is an automaton with at most ${O(n^{0.4}\log^{0.6} n)}$ states that accepts ${S}$ and rejects ${T}$.

The story of his result is involved. For starters, it is still the best upper bound after almost three decades. Impressive. Another issue is that a web search does not quickly, at least for for me, find a PDF of the original paper. I tried to find it and could not. More recent papers on the separating word problem reference his 1989 paper, but they do not explain how he proves it.

Recall the problem of separating words is: Given two distinct words of length ${n}$, is there a deterministic finite automaton that accepts one and rejects the other? And the machine has as few states as possible. Thus his theorem shows that roughly the number of states grows at most like the square root of ${n}$.

I did finally track the paper down. The trouble for me is the paper is encrypted. Well not exactly, but the version I did find is a poor copy of the original. Here is an example to show what I mean:

 [ An Example ]

So the task of decoding the proof is a challenge. A challenge, but a rewarding one.

## A Cool Trick

Robson’s proof uses two insights. The first is he uses some basic string-ology. That is he uses some basic facts about strings. For example he uses that a non-periodic string cannot overlap itself too much.

He also uses a clever trick on how to simulate two deterministic machines for the price of one. This in general is not possible, and is related to deep questions about automata that we have discussed before here. Robson shows that it can be done in a special but important case.

Let me explain. Suppose that ${\alpha}$ is a string. We can easily design an automaton that accepts ${x}$ if and only if ${x}$ is the string ${\alpha}$. The machine will have order the length of ${\alpha}$ states. So far quite simple.

Now suppose that we have a string ${S}$ of length ${n}$ and wish to find a particular occurrence of the pattern ${\alpha}$ in ${S}$. We assume that there are ${\#(S,\alpha)}$ occurrences of ${\alpha}$ in ${S}$. The task is to construct an automaton that accepts at the end of the ${k^{th}}$ copy of ${\alpha}$. Robson shows that this can be done by a automaton that has order

$\displaystyle \#(S,\alpha) + |\alpha|$

Here ${|\alpha|}$ is the length of the string ${\alpha}$.

This is a simple, clever, and quite useful observation. Clever indeed. The obvious automaton that can do this would seem to require a cartesian product of two machines. This would imply that it would require

$\displaystyle \#(S,\alpha) \times |\alpha|$

number of states: Note the times operator ${\times}$ rather than addition. Thus Robson’s trick is a huge improvement.

Here is how he does this.

## His Trick

Robson’s uses a clever trick in his proof of the main lemma. Let’s work through an example with the string ${100}$. The goal is to see if there is a copy of this string starting at a position that is a multiple of ${3}$.

The machine starts in state ${(0,Y)}$ and tries to find the correct string ${100}$ as input. If it does, then it reaches the accepting state ${(3,Y)}$. If while doing this it gets a wrong input, then it switches to states that have stopped looking for the input ${100}$. After seeing three inputs the machine reaches ${(3,N)}$ and then moves back to the start state.

 [ The automaton ]

## The Lemmas

We will now outline the proof in some detail.

### Hashing

The first lemma is a simple fact about hashing.

Lemma 2 Suppose ${1 \le r \le m}$ and

$\displaystyle 1 \le k_{1} < \cdots < k_{m} \le n.$

Then all but ${{O}(m \log n)}$ primes satisfy

$\displaystyle k_{i} \equiv k_{r} \bmod p \text{ if and only if } i =r.$

Proof: Consider the quantity ${|k_{r} - k_{i}|}$ for ${i}$ not equal to ${r}$. Call a prime bad if it divides this quantity. This quantity can be divisible by at most ${\log n}$ primes. So there are at most ${{O}(m\log n)}$ bad primes in total. $\Box$

### Strings

We need some definitions about strings. Let ${| \alpha |}$ be the length of the string ${\alpha}$. Also let ${\#(S,\alpha)}$ be the number of occurrences of ${\alpha}$ in ${S}$.

A string ${\alpha}$ has the period ${p}$ provided

$\displaystyle \alpha_{i} = \alpha_{i+p},$

for all ${i}$ so that ${i+p}$ is defined. A string ${\alpha}$ is periodic provided it has a period ${p>0}$ that is less than half its length. Note, the shorter the period the more the string is really “periodic”: for example, the string

$\displaystyle 10101010101010$

is more “periodic” than

$\displaystyle 10000001000000.$

Lemma 3 For any string ${u}$ either ${u0}$ or ${u1}$ is not periodic.

Proof: Suppose that ${\beta=u\sigma}$ is periodic with period ${p}$ where ${\sigma}$ is a single character. Let the length of ${\beta}$ equal ${\ell}$. So by definition, ${1 \le p \le \ell/2}$. Then

$\displaystyle \beta_{i} = \beta_{i+p},$

for ${1 \le i \le \ell-p}$. So it follows that

$\displaystyle \beta_{\ell-p} = \beta_{\ell} = \sigma.$

This shows that ${u1}$ and ${u0}$ cannot both be periodic, since

$\displaystyle \ell/2 \le \ell-p \le \ell - 1.$

$\Box$

Lemma 4 Suppose that ${\alpha}$ is not a periodic string. Then the number of copies of ${\alpha}$ in a string ${S}$ is upper bounded by ${{O}(M)}$ where

$\displaystyle M = \frac{|S|}{|\alpha|}.$

Proof: The claim follows once we prove that no two copies of ${\alpha}$ in ${S}$ can overlap more than ${\ell/2}$ where ${\ell}$ is the length of ${\alpha}$. This will immediately imply the lemma.

If ${\alpha}$ has two copies in ${S}$ that overlap then clearly

$\displaystyle \alpha_{i} = \alpha_{i+d},$

for some ${d>0}$ and all ${i}$ in the range ${1,\dots,\ell-d}$. This says that ${\alpha}$ has the period ${\ell-d}$. Since ${\alpha}$ is not periodic it follows that ${d > \ell/2}$. This implies that the overlap of the two copies of ${\alpha}$ are at most length ${\ell/2}$. Thus we have shown that they cannot overlap too much. $\Box$

### Main Lemma

Say an automaton finds the ${k^{th}}$ occurrence of ${\alpha}$ in ${S}$ provided it enters a special state after scanning the last bit of this occurrence.

Lemma 5 Let ${S}$ be a string of length ${n}$ and let ${\alpha}$ be a non-periodic string.Then, there is an automaton with at most ${\widetilde{O}(M)}$ states that can find the ${k^{th}}$ occurrence of ${\alpha}$ in ${S}$ where

$\displaystyle M = \#(S,\alpha) + |\alpha|.$

Here ${\widetilde{O}(M)}$ allows factors that are fixed powers of ${\log n}$. This lemma is the main insight of Robson and will be proved later.

## The Main Theorem

The following is a slightly weaker version of Robson’s theorem. I am still confused a bit about his stronger theorem, to be honest.

Theorem 6 (Robson’s Theorem) Suppose that ${S}$ and ${T}$ are distinct strings of length ${n}$. Then there is an automaton with at most ${\widetilde{O}(\sqrt {n})}$ states that accepts ${S}$ and rejects ${T}$.

Proof: Since ${S}$ and ${T}$ are distinct we can assume that ${S}$ starts with the prefix ${u1}$ and ${T}$ starts with the prefix ${u0}$ for some string ${u}$. If the length of ${u}$ is less than order ${\widetilde{O}(\sqrt {n})}$ the theorem is trivial. Just construct an automaton that accepts ${u1}$ and rejects ${u0}$.

So we can assume that ${u = w\alpha}$ for some strings ${w}$ and ${\alpha}$ where the latter is order ${\widetilde{O}(\sqrt {n})}$ in length. By lemma we can assume that ${\alpha1}$ is not periodic. So by lemma we get that

$\displaystyle \#(S,\alpha1) = \widetilde{O}(\sqrt{n}).$

Then by lemma we are done. $\Box$

## Proof of Main Lemma

Proof: Let ${S}$ have length ${n}$ and let ${\alpha}$ be a non-periodic string in ${S}$ of length ${\ell}$. Also let ${\#(S,\alpha) = m}$. By the overlap lemma it follows that ${m}$ is bounded by ${\widetilde{O}(|S|/|\alpha|)}$.

Let ${\alpha}$ occur at locations

$\displaystyle 1 \le k_{1} < \cdots < k_{m} \le n.$

Suppose that we are to construct a machine that finds the ${r^{th}}$ copy of ${\alpha}$. By the hashing lemma there is a prime ${p=\widetilde{O}(m)}$ so that

$\displaystyle k_{i} \equiv k_{r} \bmod p$

if and only if ${i=r}$. Note we can also assume that ${p > \ell}$.

Let’s argue the special case where ${k_{r}}$ is ${0}$ modulo ${p}$. If it is congruent to another value the same argument can be used. This follows by having the machine initially skip a fixed amount of the input and then do the same as in the congruent to ${0}$ case.

The automaton has states ${(i,Y)}$ and ${(i,N)}$ for ${i=0,\dots,p}$. The machine starts in state ${(0,Y)}$ and tries to get to the accepting state ${(\ell,Y)}$. The transitions include:

$\displaystyle (0,Y) \underset{\alpha_{1}}{\rightarrow} (1,Y) \underset{\alpha_{2}}{\rightarrow} (2,Y) \underset{\alpha_{3}}{\rightarrow} \cdots \underset{\alpha_{\ell}}{\rightarrow} (\ell,Y).$

This means that the machine keeps checking the input to see if it is scanning a copy of ${\alpha}$. If it gets all the way to the accepting state ${(\ell,Y)}$, then it stops.

Further transitions are:

$\displaystyle (1,N) \rightarrow (2,N) \rightarrow \cdots \rightarrow (p,N),$

and

$\displaystyle (0,Y) \underset{\neg \alpha_{1}}{\rightarrow} (1,N), (1,Y) \underset{\neg \alpha_{2}}{\rightarrow} (2,N), \dots, (\ell-1,Y) \underset{\neg \alpha_{\ell}}{\rightarrow} (\ell,N).$

The second group means that if a wrong input happens, then ${(i,Y)}$ moves to ${(i+1,N)}$. Finally, the state ${(p,N)}$ resets and starts the search again by going to the start state ${(0,Y)}$ with an epsilon move.

Clearly this has the required number of states and it operates correctly. $\Box$

## Open Problems

The open problem is: Can the SWP be solved with a better bound? The lower bound is still order ${\log n}$. So the gap is exponential.

1. September 17, 2019 12:31 pm

1. After separating u0 and u1, there appears to be one more step.
Let the states reached after reading u0,u1 be $s \neq t$, and consider the first time that M reaches t after reading u0; let x be the string remaining in S at this point, and let y be the string remaining in T after u1. Then x and y have different lengths and they are now separated by composing with a DFA of size $O(\log n)$.

2. Here is a tight example for the prefix separation lemma.Let $\gamma(x)$ denote the size of the smallest DFA which separates $x$ from its prefixes, that is, on reading $x$, reaches a state not visited before.
Then the two lemmas together state that $min(\gamma(x0),\gamma(x1) =O(\sqrt{n})$.
This is tight for $x=(y1)^{n-1}(01)^{n-1}$, where $y=(01)^{n-1}$.

Regards,
Aravind.

September 17, 2019 12:57 pm

Dear N.R. Aravind:

Thanks so much. Will ponder this, but sounds good. Again thanks. I also added “latex” to try and get it typeset. Hope that is okay with you.

Let me know if not.

Best

Dick

September 17, 2019 1:00 pm

A really nice trick!

I think there might be a little bug in the proof. It can happen that alpha1 appears in T, in a different position than in S, but such that it collides with the original position modulo p. In that case, the automaton would accept both S and T. There is an easy fix: it suffices to add all appearances of alpha1 in T to the list of k_i’s we don’t want to collide with.

Btw, I don’t really get the last step in the proof of Lemma 3. Could someone please explain?

September 17, 2019 2:53 pm

Thanks. The fix certainly works. The last step of the not overlap too much lemma is just this: the argument shows that there are two values for the same symbol. This is the key.

Best

dick

September 17, 2019 4:03 pm

To be honest, that didn’t help me, but I looked at the original paper (ScienceDirect has a reasonably good looking copy), and now I see how to prove the lemma. Still, I think the last chain of inequalities in the blog version of the proof has some typo. I cannot figure out what was the intended meaning, but it is not true that l-p \leq l/2, it’s quite the opposite.

3. September 24, 2019 3:22 am

Robson also had a paper in 1999 about separating word with CFG’s (with Currie, Petersen and Shallit)

4. October 8, 2019 2:36 am

We got an O(sqrt(n)*log(n)) upper bound using a different technique that may or may not be known. I’m very excited about it and hope to share more details something this week. Even though the bound is worse, I think the approach is neat. 🙂

5. October 10, 2019 12:21 am

Hi there! I’m very excited that we found a potentially different approach that leads to a $O(\sqrt{n} \cdot log(n))$ upper bound. It’s possible that it could be related to one of Robson’s works. Or, maybe there is someway to combine the approaches to get a better upper bound. Any thoughts are greatly appreciated.

Here is the approach:

————–

PART 1: Prime numbers

Lemma 1: Let positive integers k_1, k_2, and n such that k_1 <= n and k_2 <= n be given. There exists a constant c_1 (independent of n) such that if k_1 != k_2, then there exists a prime number $p \leq c_1 \cdot log(n)$ such that k_1 != k_2 mod p.

This is stated in: https://cs.uwaterloo.ca/~shallit/Talks/hawaii2.pdf

As far as I can tell, this follows by estimates on the Chebyshev function combined with the Chinese remainder theorem.

————–

PART 2: Periodic sums of binary strings

Let a binary string x of length n be given. Let integers r and d such that $0 <= r < d <= n$ be given.

Definition: The (r, d)-periodic sum of x is the sum of all bits of the form $x_{i \cdot d + r}$ where i is an integer.

A binary string can be characterized by its periodic sums. In particular, observe that for all binary strings x and y of the same length, we have that x = y if and only if x and y have the same periodic sums.

Further, we prove that it's sufficient to look at periodic sums with small values of d to check if x = y.

Lemma 2: Let a natural number n be given. There exists a constant c_2 (independent of n) such that for all binary strings x and y of length n, we have that x = y if and only if:

For all (r, d) such that $0 <= r < d <= c_2 \cdot \sqrt{n}$, the (r, d)-periodic sum of x equals the (r, d)-periodic sum of y.

To see this, first turn each periodic sum of x into an equation. There are n unknowns (one unknown for each bit of x) and the sum is known. We can associate with each equation a binary vector where the coordinates from the vector are the coefficients of the unknowns from the equation.

The hope is that if we have enough equations, then the system has a unique solution which is just the binary string x. This happens when the associated vectors span $\mathbb{R}^n$.

Thanks to an awesome answer on stack exchange, we known that there exists some c_2 such that for (r, d) where $0 <= r < d <= c_2 \cdot \sqrt{n}$, the associated vectors are sufficient to span $\mathbb{R}^n$. ( See here: https://mathoverflow.net/questions/343355/do-the-following-binary-vectors-span-mathbbrn ).

As a result, x is characterized by the periodic sums where $0 <= r < d <= c_2 \cdot \sqrt{n}$.

————–

PART 3: The upper bound for separating words

Theorem: Let binary strings x and y of length n be given. If x != y, then there exists a DFA with $O(\sqrt{n} \cdot log(n))$ states that accepts x and rejects y.

By Lemma 2, there is some (r, d) such that the (r, d)-periodic sum of x != the (r, d)-periodic sum of y and $0 <= r < d <= c_2 \cdot \sqrt{n}$.

Further, since the periodic sums are bounded by n, by Lemma 1, there is some prime p such that the (r, d)-periodic sum of x != the (r, d)-periodic sum of y mod p and $p \leq c_1 \cdot log(n)$.

Therefore, we can build a DFA with $d \cdot p \leq c_1 \cdot c_2 \cdot \sqrt{n} \cdot log(n)$ states that computes the (r, d)-periodic sum of the input string mod p to differentiate x and y.

————–

I hope to write this up more cleanly in latex and am open to any ideas, suggestions, or corrections. Thank you!

• October 11, 2019 12:46 am

Hi, Mike—odd that these were held up in the mod queue. Let me know whether this one supersedes the previous one. Dick and I may be able to talk about it over the weekend.

• October 11, 2019 1:52 am

Thank you very much! Sounds great.

Yes, if you can delete my previous post, that would be great. It had a few typos in it. 🙂

• April 27, 2020 4:41 am

I just found out that this method is already known! Please see this paper: “Separating Words by Occurrences of Subwords” by Vyalyi and Gimadeev.

6. October 10, 2019 12:22 am

Sorry, this version had some typos. I posted an update with the corrections.

7. April 27, 2020 11:57 am

If possible, I would suggest deleting this duplicated post. Thank you!

July 26, 2020 4:47 pm

Upper bound improved to n^{1/3}: https://arxiv.org/abs/2007.12097

• July 26, 2020 6:46 pm

This is great! I am excited to take a look. Thanks for sharing. 🙂

July 28, 2020 7:10 pm

A tiny mistake in Lemma 3: since p is at most l/2, we have that l – p is at *least* l/2, not at *most* l/2. Fortunately it’s easy to fix: since p is at least 1, we have that l – p is at most l – 1, so we are still indexing into u.

• July 28, 2020 10:37 pm

Thank you very much. Most likely it was a change of indexing between drafts. I have also make a curly ell from that point on: lemmas 3 and 4 and the main proof. I checked once with my eyes but can’t say I’m sure I caught them all. One idea for something like MathDeck would be to enable a search for mathematical uses of the plain ${l}$ as opposed to occurrences within words, or even abbreviations such as “l.” for liters.

A regex-aware search for (in Vim syntax) \ will miss most of the occurrences that you don’t want. It’ll still give you false positives, like on “l.”, but it should be easier to deal with manually.
My Vim regex got eaten, probably because it uses less-than-greater-than signs. In Perl syntax, it’s \bl\b.