A visual proof with no abstract-algebra overhead

 Composite crop of src1, src2

Dominique Perrin and Jean-Éric Pin are French mathematicians who have done significant work in automata theory. Their 1986 paper “First-Order Logic and Star-Free Sets” gave a new proof that first-order logic plus the ${<}$ relation characterizes star-free regular sets.

Today we present their proof in a new visual way, using “stacked words” rather than their “marked words.” We also sidestep algebra by appealing to the familiar theorem that every regular language has a unique minimal deterministic finite automaton (DFA).

The first post in this series defined the classes ${\mathsf{SF}}$ for star-free regular and ${\mathsf{FO}[<]}$ for first-order logic where variables ${u,v,w,\dots}$ range over the indices ${0,\dots,n-1}$ of a string ${x}$ of length ${n}$. For any character ${c}$ in the alphabet ${\Sigma}$ there is the predicate ${X_c(u)}$ saying that the character in position ${u}$ of ${x}$ equals ${c}$. The first part proved that for any language ${A \in \mathsf{SF}}$ there is a sentence ${\psi_A}$ using only predicates ${X_c}$ and the relation ${u < v}$ besides the usual quantifiers and Boolean operations of first-order logic. such that ${\psi_A}$ defined ${A}$. That is, it proved ${\mathsf{SF}\subseteq\mathsf{FO}[<]}$.

A key trick was to focus not on ${\psi_A}$ but on a formula ${\phi_A(i,j)}$ expressing that the part of ${x}$ from position ${i}$ (inclusive) to ${j}$ (exclusive) belongs to ${A}$. To prove the converse direction, ${\mathsf{FO}[<]\subseteq\mathsf{SF}}$, we focus not on middles of strings but on prefixes and suffixes. The previous post proved two lemmas showing that if ${L}$ is star-free then for any string ${y}$, so are its prefix and suffix languages:

$\displaystyle \begin{array}{rcl} L\backslash y &=& \{z: yz \in L\}\\ L/y &=& \{x: xy \in L\}. \end{array}$

We will piece together star-free expressions for a multitude of ${L/y}$ and ${L \backslash y}$ sets that come from analyzing minimum DFAs ${M}$ at each induction step, until we have built a big star-free expression ${\alpha}$ for ${A}$. This sounds complex—and is complex—but all the steps are elementary. At the end we’ll try to see how big ${\alpha}$ gets. Again the rest of this post up to end notes is written by Daniel Winton.

## Stacked Words and FO Formulas

We implement the “marked words” idea of Perrin and Pin by using extra dimensions rather than marks. Consider any formula ${\phi(u_1, \dots, u_j)}$ in FO[${<}$] where ${u_1,\dots,u_j}$ are the free variables. Our "stacked words" have ${j}$ extra rows underneath the top level holding the string ${x}$. Each row ${i}$ has the same length ${n}$ as ${x}$ and contains a single ${1}$ and ${(n-1)}$ ${0}$s. The column ${k}$ in which the ${1}$ lies gives the value of the variable ${u_i}$. Thus the alphabet of our stacked words is ${\Sigma^{(j)} = \Sigma\times\{0,1\}^j}$. Here is an example:

Each column is really a single character ${C \in \Sigma^{(j)}}$, but we picture its entries as characters in ${\Sigma}$ or ${\{0,1\}}$. For any row ${i}$, define ${Y_{i,0}(\cdot)}$ to be the disjunction of ${X_C(\cdot)}$ over all ${C}$ that have a ${0}$ in entry ${i}$, and define ${Y_{i,1}}$ similarly. Then the following sentence expresses the “format condition” that row ${i}$ has exactly one ${1}$:

$\displaystyle (\exists k)[Y_{i,1}(k) \land (\forall \ell) [\ell \neq k \rightarrow Y_{i,0}(k)]].$

This readily transforms into a star-free expression for the language of stacked words with exactly one ${1}$ in row ${i}$:

$\displaystyle \alpha_i = C_{i,0}^* C_{i,1} C_{i,0}^*,$

Here ${C_{i,0}}$ is the union of all stacked characters that have a ${0}$ in entry ${i}$, and ${C_{i,1}}$ is similarly defined. This is where the examples in the previous post about finite unions inside stars come in handy. The upshot is that

$\displaystyle \alpha_0 = \bigcap_{i=1}^j \alpha_i$

is a ${\mathsf{SF}}$ expression that enforces the “format condition” over all rows. We will use this at the crux of the proof; whether ${\alpha_0}$ is really needed is a question at the end. If we take the format for granted, then the main advantage of our stacked words will be visualizing how to decompose automata according to when the ${1}$ in a critical row is read.

## From FO To SF

The proof manipulates formulas ${\phi(u_1,\dots,u_j)}$ having the ${j}$ free variables shown. The corresponding language ${L_\phi}$ is the set of stacked words that make ${\phi(k_1,\dots,k_j)}$ true, where for each ${i \leq j}$, ${k_i}$ is the position of the lone ${1}$ in row ${i}$, which gives the value of ${u_i}$.

Theorem 1 For all formulas ${\phi}$ over ${\mathsf{FO}[<]}$, ${L_\phi\in\mathsf{SF}}$.

Without loss of generality we may suppose ${\phi}$ is in prenex form, meaning that it has all quantifiers out front. If there are ${r}$ variables of which ${j}$ are free, this means

$\displaystyle \phi = (Q_{j+1} u_{j+1})\cdots (Q_r u_r)\mu,$

where each ${Q_i}$ is ${\forall}$ or ${\exists}$ and ${\mu}$ has no quantifiers. If we picture the variables ${u_1,\dots,u_j}$ as once having had quantifiers ${Q_1,\dots,Q_j}$, then the proof—after dealing with ${\mu}$—restores them one at a time, beginning with ${Q_j}$. That is the induction. We prove the base case ${\mu}$ here and the induction case in the coming sections.

Proof: Quantifier free formulas in ${\mathsf{FO}[<]}$ are Boolean combinations of the atomic predicates ${X_c(u_i)}$ and ${u_h < u_i}$. Here ${X_c}$ is only for the characters ${c \in \Sigma}$, not the stacked characters ${C \in \Sigma^{(j)}}$. But inside ${\mu}$ (when we have ${j = r}$) we have to deal with the whole stack to represent ${X_c(u_i)}$. Putting ${\Sigma = \{c_1,\dots,c_{\ell}\}}$, the expression can be viewed schematically as:

As illustrated above, using the initial lemma in the part-1 post, this is equivalent to a star-free regular expression. The other atomic predicate, ${u_h < u_i}$, is true when we have a stacked word ${Y}$ of the form:

It is not important to have ${h < i}$; these are just the labels of the variables. Their values ${k_1,k_2}$ must obey ${k_1 < k_2}$, meaning that the ${1}$ in row ${h}$ comes in an earlier column than the ${1}$ in row ${i}$. The characters of ${x}$ in the top row are immaterial. We can capture this condition over all strings ${Y}$ over ${\Sigma^{(j)}}$ by the expression:

Again by the lemma in Part 1, this yields a star-free expression. To complete the basis, we note that if ${L_1}$ and ${L_2}$ have star-free expressions ${\alpha_1}$ and ${\alpha_2}$ corresponding to ${\phi_1, \phi_2\in \mathsf{FO}[<]}$, then ${\lnot\phi_1}$ is represented by ${\sim{\alpha_1}}$ and ${\phi_1\lor\phi_2}$ by ${\alpha_1\cup\alpha_2}$, both of which are clearly star-free.

## Strategy For the Induction Case

Assume that all formulas in ${\mathsf{FO}[<]}$ with ${\ell = r - j}$ quantifiers have star-free translations. We will show that any formula in ${\mathsf{FO}[<]}$ with ${\ell+1}$ quantifiers also has a star-free translation. Let ${\psi(u_1,... ,u_{j-1})=\exists u_j \phi(u_1,..., u_j)}$ where ${\phi}$ is a formula in ${\mathsf{FO}[<]}$ that has ${m}$ total variables, ${j}$ free variables and ${m-j=\ell}$ quantifiers. We are allowed to assume the first quantifier in ${\psi}$ is existential as ${\exists \equiv \lnot (\forall \lnot)}$ and we observed above that the star-free expressible formulas in ${\mathsf{FO}[<]}$ are closed under negation.

The language of ${\psi}$ is given by:

${L_\psi=\{z\in(\Sigma\times \{0,1\}^{j-1})^*:}$ we can add a ${{(j+1)}^{th}}$ row to each of the ${n}$ characters of ${z}$ using ${n}$ minus one ${0'}$s and one ${1}$ such that the resulting string ${z'}$ belongs to ${L_\phi\}.}$

We must show that ${L_\psi}$ is star-free. As ${\phi}$ has ${\ell}$ quantifiers, then by assumption ${L_\phi}$ is star-free regular.

Note that ${L_\phi}$ has one more row than ${L_\psi}$ because it has one more free variable. Without loss of generality we may suppose it is the bottom row.

### The Tricky Point

Since we have an existential quantifier we might think we can simply delete the last row of ${L_\phi}$. But here is a counterexample to show why we must be careful not to “lose information”:

Let ${\alpha=}$

and ${\alpha'=}$

Notice that ${\alpha'}$ equals ${\alpha}$ with its final row removed. We have ${L_\alpha=}$

and ${L_{\alpha'}=}$

Then ${L_\alpha}$ is the language of strings with at least three zeroes in a row and ${L_{\alpha'}}$ is the language of strings with at least two zeroes in a row.

We cannot add a second row consisting of zeroes and one 1 to each word ${w}$ in ${L_\alpha'}$ to force ${w}$ to be in ${L_\alpha}$. For a trivial example, consider the word ${w=}$

The fault can be interpreted as ${\alpha}$ not being recoverable uniquely from ${\alpha'}$ owing to a loss of information. We could also pin the fault in this case on the central ${\cap}$, noting that ${(\exists u)(f(u) \wedge g(u))}$ is generally not equivalent to ${(\exists u)f(u) \wedge (\exists u)g(u)}$.

Either way, this means we must be careful with how we express (the final row of) ${L_\phi}$. The idea is to give ${L_\phi}$ in segments such that the final row of each segment is all-zero or a single ${1}$, so that the act of removing the final row could be inverted. This is the crux of the proof. To handle it we employ the unique DFA for the regular expression already obtained by induction for ${L_\phi}$.

## Main Part of the Proof

By the Myhill-Nerode Theorem, there must exist a unique minimal automaton ${M_\phi}$ over the alphabet of our stacked words defining the language ${L_\phi}$ of ${\phi}$.

The state set ${Q}$ of ${M_\phi}$ can be partitioned into the set ${Q_1}$ of states before any ${1}$ is read in the final row, the set ${Q_2}$ of states reached after reading a ${1}$ in the final row, plus a dead state ${q_D}$ belonging to neither ${Q_1}$ nor ${Q_2}$. The start state belongs to ${Q_1}$ and all accepting states belong to ${Q_2}$. Since any stacked word and therefore accepting computation has exactly one ${1}$ in any row, all arcs within ${Q_1}$ and ${Q_2}$ must have a only ${0}$s in their final row, while the letters connecting states in ${Q_1}$ and ${Q_2}$ must have a ${1}$ in their final row. The minimality of ${M_\phi}$ enforces these properties. Here is a sketch, omitting ${q_D}$:

Let ${E}$ be the set of edges that cross from ${Q_1}$ to ${Q_2}$. With ${m = |E|}$ we can label their origins by states ${p_1,\dots,p_m}$ (not necessarily all distinct), and similarly label their characters by ${c_1,\dots,c_m}$ and the destination states by ${r_1,\dots,r_m}$. Then ${C = \{c_1,\dots,c_m\}}$ collects all the characters used by ${M}$ that have a ${1}$ in row ${j}$ (except those into or at the dead state ${q_D}$). We can identify ${E}$ with the set of instructions ${\{(p_e, c_e, r_e): 1 \leq e \leq m\}}$.

Finally let ${L_{p,q}}$ denote the set of strings that take ${M}$ from state ${p}$ to state ${q}$, and let ${L_{r,F} = \cup_{f \in F} L_{r,f}}$. Then we can define ${L_\phi}$ by the expression

$\displaystyle L_\phi =\bigcup_{(p_e, c_e, r_e)\in E} L_{s,p_e}\cdot c_e \cdot L_{r_e,F}. \ \ \ \ \ (1)$

What remains is to find ${\mathsf{SF}}$ expressions for ${L_{s,p_e}}$ and ${L_{r_e,F}}$ for each ${e}$. The latter is easy: Pick any string ${y_e}$ in ${L_{s,p_e}}$. Then

$\displaystyle L_{r_e,F} = \{z: y_e\cdot c_e \cdot z \in L(M)\} = L_{\phi} \backslash y_e c_e.$

The closure of ${\mathsf{SF}}$ under left quotients supplies expressions ${\beta_e}$ for ${L_{r_e,F}}$.

The case of ${L_{s,p_e}}$, however, is harder. What we would like to do is choose some (any) string ${w_e}$ that goes from ${r_e}$ to an accepting state and claim that ${L_{s,p_e} = L(M)/c_e w_e}$. The problem is that ${w_e}$ might be accepted from some state ${r_d \neq r_e}$ that has an incoming arc on the same character ${c_d = c_e}$ from some other state ${p_d}$. Then ${L_{s,p_d}}$, which is disjoint from ${L_{s,p_e}}$, is included also in ${L(M)/c_e w_e}$. There may, however, be strings ${y_d \in L_{s,p_d}}$ and ${w'_e \in L_{r_e,F}}$ such that ${y = y_d c_e w'_e}$ is not accepted by ${M}$. Then the string ${y}$ would be wrongly included if we substituted ${L(M)/c_e z_e}$ for ${L_{s,p_e}}$ (or for ${L_{s,p_e} \cup L_{s,p_d}}$) in (1).

There is also a second issue when crossing edges on the same character ${c_e}$ come in from distinct states ${p_d,p_e}$ to the same state ${r_e}$. To fix all this, we need to use sets of strings that distinguish ${p_e}$ from all the other states ${p_d}$. This requires the most particular use of the minimality of ${M}$: If ${p_d \neq p_e}$, then there is either (or both):

• a string ${z_{e,d}}$ in ${L_{p_e,F} - L_{p_d,F}}$, or

• a string ${z_{d,e}}$ in ${L_{p_d,F} - L_{p_e,F}}$.

The strings in question need not begin with a character in ${C}$—they need not cross right away. Let ${Z_e}$ stand for a choice of strings of the former kind, ${Z'_e}$ the latter kind, covering all states ${p_d \neq p_e}$. Then

$\displaystyle L_{s,p_e} = \left(\bigcap_{z \in Z_e} L_{\phi}/z\right) \cap \left(\bigcap_{z' \in Z'_e} (\alpha_0 \cap \tilde{L}_{\phi})/z'\right). \ \ \ \ \ (2)$

We could not use just the complement ${\tilde{L}_{\phi}}$ of ${L_{\phi}}$ because that would allow strings that violate the “one ${1}$” condition in the rows. Note that ${\alpha_0 \cap \tilde{L}_{\phi} = L_{\neg\phi}}$. Whether one can do the induction for ${\phi}$ and ${\neg\phi}$ in tandem without invoking ${\alpha_0}$ is a riddle we pose at the end.

Either way, the closure of ${\mathsf{SF}}$ under right quotients yields a star-free expression ${\alpha_e}$ for ${L_{s,p_e}}$. Then we just have to plug our expressions ${\alpha_e}$ and ${\beta_e}$ into (1) to get a star-free expression ${\beta_\phi}$ for ${L_\phi}$. Now we are ready for the crude final step:

Form ${\beta'}$ by knocking out the last entry in every character occurring in ${\beta_\phi}$. Then ${\beta'}$ represents ${L_{\psi}}$.

To prove this final statement, first suppose ${w'}$ is a stacked word in ${L_\psi}$. Then ${w'}$ satisfies ${(\exists u)\phi}$, so there is a value ${k}$ so that ${w'}$ satisfies ${\phi}$ with ${u}$ set equal to the value ${k}$. This means that the stacked word ${w}$ formed from ${w}$ and a last row with ${1}$ in position ${k}$ belongs to ${L_\phi}$ and hence matches ${\beta_\phi}$. Since ${w'}$ is obtained by knocking out the last row of ${w}$, ${w'}$ matches ${\beta'}$.

The converse is the part where the tricky point we noted at the outset could trip us up. Suppose ${w'}$ matches ${\beta'}$. From the form of (1) as a union of terms ${\alpha_e c_e \beta_e}$, we can trace how ${w'}$ is matched to identify a place in ${w'}$ that matches a “middle character” ${c'}$ obtained from a ${c_e}$ in a crossing edge. Putting a ${1}$ in a new row underneath this place and ${0}$s elsewhere hence makes a word ${w}$ that matches the unique regular expression ${\beta''}$ obtained by appending a ${1}$ component to every “middle character” ${c'}$ and a ${0}$ to all other characters in ${\beta'}$. Then ${\beta''}$ is equivalent to ${\beta}$, so ${w}$ satisfies ${\phi}$. Since the last row that was added to ${w}$ amounts to supplying a value ${k}$ for the variable ${u_j}$, it follows that ${w'}$ satisfies ${\phi}$ with ${u_j = k}$, hence satisfies ${\psi}$. This yields ${L(\beta') = L_{\psi}}$ and so the entire induction goes through. $\Box$

## Open Problems

First, as an end note, the end of the proof wound up different from what we envisioned until polishing this post. It originally applied the distinct-states argument to states ${r_d,r_e}$ on the far side of the crossing, but we had to backtrack on having previously thought that the “second issue” did not matter. Two other questions we pose for our readers:

• Do the definitions of ${Z_e}$ and ${Z'_e}$ need to extend over all states ${p \in Q_1}$, not just the other states ${p_d}$ on the border?

• Is ${\alpha_0}$ needed in (2), or in the proof overall?

A larger question concerns how the size of the translated expressions increases as we add more quantifiers. We discuss the blowup in terms of quantifier depth ${m}$ and length ${n=|x|}$. Let ${\psi_0}$ be a first-order formula with no quantifiers and let ${\psi_m}$ denote ${\psi_0}$ with ${m}$ quantifiers applied to it. Also let ${\alpha_m}$ be a star-free translation of ${\psi_m}$ and the length of ${\alpha_0}$, denoted by len${(\alpha_0)}$, equal ${n}$.

To obtain ${\alpha_{1}}$ from ${\alpha_0}$ we represent ${\alpha_0}$ by a union over crossing edges between ${Q_1}$ and ${Q_2}$ and the final states of ${Q}$ of the concatenation of prefix and suffix languages and a crossing character between them. We have that ${n}$ gives a decent approximation for—worst-case—length of the prefix languages, suffix languages and the product of crossing edges and final states. Using these approximations we have ${\mathit{len}(\alpha_1)\approx(2n+1)(n)\approx n^2}$. Iterating gives ${\mathit{len}(\alpha_2)\approx(2\mathit{len}(\alpha_1)+1)\mathit{len}(\alpha_1)\approx n^4}$, ${\mathit{len}(\alpha_3)\approx n^8}$, and in general, ${\mathit{len}(\alpha_m)\approx ({n^{2^m}})}$. This says that our blowup could be doubly exponential. Is there a tighter estimate?

Our final question is: how well do our “stacked words” help to visualize the proof? Are they helpful for framing proofs of equivalences between language classes and logics that are higher up?