Part 1 of a two-part series

Daniel Winton is a graduate student in mathematics at Buffalo. He did an independent study with me last semester on descriptive complexity.

Today we begin a two-part series written by Daniel on a foundational result in this area involving first-order logic and star-free languages.

Dick and I have discussed how GLL might act during the terrible coronavirus developments. We could collect quantitative and theoretical insights to help understand what is happening and possibly contribute to background for attempts to deal with it. Our March 12 post was this kind, and was on the wavelength of an actual advance in group testing announced in Israel two days ago, as noted in a comment. We could try to be even more active than that. We could focus on entertainment and diversion. Our annual St. Patrick’s Day post mentioned the ongoing Candidates Tournament in chess, with links to follow it live 7am–noon EDT. Game analysis may be found here and elsewhere.

Here we’re doing what we most often do: present a hopefully sprightly angle on a technical subject. We offer this in solidarity with the many professors and teachers who are beginning to teach online to many students. The subject of regular languages is the start of many theory courses and sometimes taught in high schools. Accordingly, Dick and I decided to prefix a section introducing regular languages as a teaching topic and motivating the use of logic, before entering the main body written by Daniel.

## Regular Languages and Logic

Regular languages are building blocks used throughout computer science. They can be defined in many ways. Two major types of descriptions are:

1. Regular expressions. For example, the regular expression

$\displaystyle b^{*}(a b^{*} a b^{*})^{*}$

describes the set of strings over the alphabet ${\{a,b\}}$ that have an even number of ${a}$‘s.

2. Finite automata, for example:

The regular languages defined by (1) and (2) are the same. All regular expressions have corresponding finite automata. This equivalence makes a powerful statement about the concept of regular languages. The more and more diverse definitions we have, the better we understand a concept. This leads us to consider other possible definitions.

A natural kind of definition involves logic. Studying complexity classes through logic and model theory has proven fruitful, creating descriptive complexity as an area. Good news: there are logic definitions equivalent to the regular languages. Bad news: they require going up to second-order logic. We would like to stay with first-order logic. So we ask:

What kind of languages can we define using only first-order logic (FOL) and simple predicates like “the ${i}$th bit of ${x}$ is ${a}$” and “place ${i}$ comes before place ${j}$“?

The answer is the star-free languages, which form a subclass ${\mathsf{SF}}$ of the regular languages. They were made famous in the book Counter-Free Automata by Robert McNaughton and Seymour Papert, where the equivalence to FOL was proved. A portentous fact is that these automata cannot solve simple tasks involving modular counting. Nor can perceptrons—the title subject of a book at the same time by Papert with Marvin Minsky, which we discussed in relation to both AI and circuit complexity. This post will introduce ${\mathsf{SF}}$ and FOL, prove the easier direction of the characterization, and give two lemmas for next time. The second post will present a new way to visualize the other direction. The rest of this post is by Daniel Winton.

## No Stars Upon Thars

A regular language over an alphabet ${\Sigma}$ is one with an expression that can be obtained by applying the union, concatenation, and Kleene star operations a finite number of times on the empty set and singleton subsets of ${\Sigma}$. Star-free languages are defined similarly but give up the use of the Kleene star ${{}^*}$ operation, while adding complementation (${\sim}$) as a basic operation. The star-free languages are a subset of the regular languages, because regular languages are closed under complementation.

Complementation often helps find star-free expressions that we ordinarily write using stars. For instance, if the alphabet ${\Sigma}$ is ${\{a,b\}}$, then ${(\sim \emptyset)}$ gives ${(a + b)^*}$. The following lemma gives a family of regular expressions that use Kleene star but are really star-free.

Lemma 1 The language given by taking the Kleene star operation on a union of singleton elements in an alphabet ${\Sigma}$ is star-free.

Proof. For ${\Sigma=\{a_1, ..., a_n, b_1, ..., b_m\}}$ we have that ${(b_1+ ... + b_m)^*}$ can be given by

$\displaystyle \sim((\sim\emptyset) (a_1+ ... + a_n) (\sim\emptyset)). \quad$

$\Box$

For example, the language ${b^*ab^*ab^*}$ is star-free because ${b^*}$ is by this lemma. This idea extends also to forbidden substrings—e.g., the set of strings with no ${aa}$ is ${\;\sim((\sim\emptyset)aa(\sim\emptyset))}$.

The language ${(ab + ba + bb)^*}$ is not the same, however, and it is not star-free. Intuitively this is because it involves modular counting: an ${aa}$ in an odd position is OK but not even. The parity language ${b^*(ab^*ab^*)^*}$ from the introduction is another example. So ${\mathsf{SF}}$ is a proper subset of the regular languages. What kind of subset? This is where having a third description via logic is really useful.

## The First Order of Business

In addition to the familiar Boolean operations ${\wedge,\vee,\neg,\rightarrow,\leftrightarrow}$ and truth values, first-order logic provides variables that range over elements of a structure and quantifiers on those variables. Since we will be concerned with Boolean strings ${x}$, the variables will range over places in the string ${0,\dots,n-1}$, where ${n = |x|}$. A logic also specifies a set of predicates that relate variables and interact with the structure. For strings we have:

• ${X_c(i)}$, meaning that the symbol indexed ${i}$ in ${x}$ is the character ${c}$. The logic gives one such predicate for each ${c \in \Sigma}$.

• ${i < j}$, with the obvious meaning for natural numbers ${i}$ and ${j}$.

We can take the ${X_c}$ for granted since we are talking about strings, but we need to say that predicates like ${i + j = k}$ and ${i\cdot j = k}$ are excluded, so we call this ${\mathsf{FO}[<]}$. We could define equality by ${i = j \equiv \lnot(i < j \lor j < i)}$ but we regard equality as inherent.

We can use ${<}$ to define the successor relation ${S(i,j)}$, which denotes that position ${j}$ comes immediately after position ${i}$:

$\displaystyle S(i,j) \equiv i

Note the use of quantifiers. We can use quantifiers to say things about the string ${x}$ too. For instance, the language ${L}$ of strings having no ${aa}$ substrings is defined by the logical sentence

$\displaystyle \sigma_L = (\forall i,j):S(i,j) \rightarrow \neg (X_a(i) \land X_a(j)).$

It is implicit here that ${i}$ and ${j}$ are always in-bounds. If ${n}$ were a legal constant with ${X_c(n)}$ always false then a string like ${bbba}$ (which belongs to ${L}$) would falsify ${\sigma_L}$ with ${i = n-1}$ and ${j = n}$.

How big is ${\mathsf{FO}[<]}$? We'll see it is no more and no less than ${\mathsf{SF}}$.

## From SF to FO

To prove that any language ${L}$ in ${\mathsf{SF}}$ has a definition in ${\mathsf{FO}}$, we will not only give a sentence ${\sigma_L}$ but also a formula ${\phi_L(i,j)}$. We will define this formula to indicate that for a given string ${x}$, the portion of the string between indices ${i}$ and ${j-1}$ is in ${L}$. Then for the correct choices of ${i}$ and ${j}$, ${\phi_L(i,j)}$ gives ${L}$. We define ${\phi}$ to test middle portions of strings, because it handles lengths better for the induction in the concatenation case.

Theorem 2

Every language in ${\mathsf{SF}}$ is definable in ${\mathsf{FO}[<]}$.

The proof is a nice example where “building up” to prove something more general—involving two extra variables—makes induction go smoothly.

Proof: Let ${L}$ be a star-free regular language and ${L'=\{\langle x, i ,j \rangle:}$ the portion of the string ${x}$ between indices ${i}$ (inclusive) and ${j}$ (exclusive) is contained in ${L\}}$. Let ${\phi_L(i,j)\in \sf{FO}[<]}$ be the formula denoting that for a given string ${x, x[i,j)\in L}$, that is, ${\phi_L(i,j)}$ is the representation of ${L'}$ in ${\mathsf{FO}[<]}$. We will show that such a ${\phi_L(i,j)}$ exists via induction on ${L}$. This is sufficient, as for a given symbol ${n}$ that always represents the length of a string ${x}$, ${\phi_L(0,n)}$ is the formula in ${\mathsf{FO}[<]}$ representing the language ${L}$.

First, we must show that ${\phi_L}$ is in ${\mathsf{FO}[<]}$ for ${L}$ one of the basis languages, ${\emptyset, \epsilon}$, and ${c}$, for some ${c}$ in ${\Sigma}$. We have that:

$\displaystyle \begin{array}{rcl} \phi_\emptyset(i,j) &=& \texttt{false};\\ \phi_\epsilon(i,j) &=& i=j;\\ \phi_c(i,j) &=& (X_c(i)\land S(i,j)). \end{array}$

Now, we must show that given star-free languages ${L_1}$ and ${L_2}$ with FO${[<]}$ translations ${\psi_1(i,j)}$ and ${\psi_2(i,j)}$ respectively, we have ${\phi_{L_1\cup L_2}(i,j)}$, ${\phi_{L_1\cdot L_2}(i,j)}$, and ${\phi_{\sim L_1}(i,j)}$ are in ${\sf{FO}[<]}$. Then:

$\displaystyle \begin{array}{rcl} \phi_{L_1\cup L_2}(i,j)&=&\psi_1(i,j) \lor \psi_2(i,j);\\ \\ \phi_{L_1\cdot L_2}(i,j)&=&\exists k(i\leq k \land k \leq j) (\psi_1(i,k) \land \psi_2(k,j));\\ \\ \phi_{\sim L_1}(i,j)&=&\lnot \psi_{1}(i,j). \end{array}$

Since star-free languages can be obtained by applying the union, concatenation, and complementation operations a finite number of times on ${\emptyset}$ and singleton subsets of ${\Sigma}$, this completes the proof of ${\mathsf{SF} \subseteq \mathsf{FO}[<]}$. $\Box$

## Two Lemmas For Next Time

Prefatory to showing (in the next post) that ${\mathsf{FO}[<]}$ is contained in ${\mathsf{SF}}$, we prove properties about substrings on the ends of star-free languages, rather than in the middle as with the trick in the proof of Theorem 2.

Let ${L}$ be a language over an alphabet ${\Sigma}$ and ${w}$ be a word in ${\Sigma^n}$ for some ${n\in \mathbb{N}}$. Define ${L/w}$, the right quotient of ${L}$ by ${w}$, by ${L/w =\{v: vw \in L\}}$ and ${L\backslash w}$, the left quotient of ${L}$ by ${w}$, by ${L\backslash w=\{v:wv \in L\}}$. First we handle right quotients:

Lemma 3 If ${L}$ is star-free, then ${L/w}$ is star-free.

Proof: For any word ${w}$ over ${\Sigma}$ we define a function ${f_w(\alpha)}$ by ${f_w(\alpha)=\alpha'}$ where ${L(\alpha')=L(\alpha)/w}$. If ${w = \epsilon}$, then ${f_w(\alpha)=\alpha}$, and so the statement of the lemma trivially holds. So let ${w = vc}$ for some string ${v}$ and character ${c}$. Note that ${L/(vc) = (L/c)/v}$. Thus ${f_{vc}(\alpha) = f_v(f_c(\alpha))}$ for all ${\alpha}$. Hence it suffices to define ${f_c(\alpha)}$ for any single character ${c}$ by recursion on ${\alpha}$. We have:

$\displaystyle \begin{array}{rcl} f_c(\emptyset)&=&\emptyset;\\ f_c(\epsilon)&=&\emptyset;\\ f_c(c)&=&\epsilon;\\ f_c(a)&=&\emptyset, \end{array}$

and recursively:

$\displaystyle \begin{array}{rcl} f_c(\alpha \cup\beta)&=&f_c(\alpha) \cup f_c(\beta);\\\\ f_c(\alpha\cdot\beta)&=&\begin{cases} \alpha\cdot f_c(\beta) & \epsilon\not\in L(\beta), \\ \alpha\cdot f_c(\beta) \cup f_c(\alpha) & otherwise\text{;} \end{cases}\\\\ f_c(\sim\alpha)&=&\sim f_c(\alpha). \end{array}$

In general, if ${w = c_1 c_2 \cdots c_{n-1} c_n}$, then

$\displaystyle L/w = f_{c_1}(f_{c_2}(\dots f_{c_{n-1}}(f_{c_n}(\alpha))\dots)).$

$\Box$

Lemma 4 If ${L}$ is star-free, then ${L\backslash w}$ is star-free.

The proof of Lemma 4 is similar to the proof of Lemma 3. The main differences lie in the concatenation subcase for the ${w=c}$ case and the order of quotienting when using this operation repeatedly.

Proof: For any word ${w}$ over ${\Sigma}$ we define a function ${f_w(\alpha)}$ by ${f_w(\alpha)=\alpha'}$ where ${L(\alpha')=L(\alpha)\backslash w}$. If ${w = \epsilon}$, then ${f_w(\alpha)=\alpha}$, and so the statement of the lemma trivially holds. So let ${w = vc}$ for some string ${v}$ and character ${c}$. Note that ${L\backslash(vc) = (L\backslash v)\backslash c}$. Thus ${f_{vc}(\alpha) = f_c(f_v(\alpha))}$ for all ${\alpha}$. Hence it suffices to define ${f_c(\alpha)}$ for any single character ${c}$ by recursion on ${\alpha}$. We have:

$\displaystyle \begin{array}{rcl} f_c(\emptyset)&=&\emptyset;\\ f_c(\epsilon)&=&\emptyset;\\ f_c(c)&=&\epsilon;\\ f_c(a)&=&\emptyset.\\ \end{array}$

and recursively:

$\displaystyle \begin{array}{rcl} f_c(\alpha\cup\beta)&=&f_c(\alpha)\cup f_c(\beta);\\\\ f_c(\alpha\cdot\beta)&=&\begin{cases} f_c(\alpha)\cdot\beta & \epsilon\not\in L(\alpha), \\ f_c(\alpha)\cdot\beta \cup f_c\beta) & otherwise\text{;} \end{cases}\\\\ f_c(\sim \alpha)&=&\sim f_c(\alpha). \end{array}$

In general, if ${w = c_1 c_2 \cdots c_{n-1} c_n}$, then

$\displaystyle L\backslash w = f_{c_n}(f_{c_{n-1}}(\dots f_{c_{2}}(f_{c_1}(\alpha))\dots)).$

$\Box$

## Open Problems

There are richer systems such as ${\mathsf{FO}(+)}$ and ${\mathsf{FO}(+,\cdot)}$. Note that ${\mathsf{FO}(+)}$ allows defining the ${<}$ relation via ${i < j \equiv (\exists k) \lnot(k= 0) \land i + k = j}$. We can define non-regular languages such as ${{a^n b^n}}$ in ${\mathsf{FO}(+)}$. The class ${\mathsf{FO}(+,\times)}$ famously equals uniform ${\mathsf{AC^0}}$, see chapter 5 of Neil Immerman's book Descriptive Complexity. Thus we hope our new style for proving ${\mathsf{FO}[<] \subseteq \mathsf{SF}}$ (to come in the next part) will build a nice foundation for visualizing these higher results.

March 21, 2020 5:27 pm

SF = FO[<] is known as Theorem of Schützenberger. What is not very well-known is that
Schützenberger gave another characterization of SF in terms of regular expressions:
A language is star-free iff it expressible by regular expressions where the use of the star-
operation is only allowed over (prefix?)codes of bounded synchronization delay.

• March 21, 2020 9:30 pm

Thanks. More detail is available in this 2007 survey, and we followed its ascriptions.

March 22, 2020 2:45 pm

SInce FO[<] is in my mind an alias of SF, I made this mistake. Sch. related automata and syntactic monoids to expressions, while the relations to logic are off course due to M.+P. Sorry for that.

NB: The reference to Schützenberger's other characterization of SF was given to me by one of the authors of the survey.

March 22, 2020 4:07 pm

Maybe I misunderstood something. I think the proposal for defining equality in FOL[<] contains a typo? Doesn't seem to be well-formed. If I'm mistaken, please clarify. thanks!

March 22, 2020 4:10 pm

Given only ‘<', you'd want to say neither i < j nor j < i.

• March 22, 2020 4:21 pm

Thanks! Something went strange in the translation of “lor” and “land” and only there. I think what happened is that the text “neither i j” made the middle part get parsed as an HTML angle-bracket item. Hard to spot… Added: in fact, it happened again or “neither i < j nor i > j, and probably to you in your first comment… So “neither I < j nor j < i" it has to be…