Cropped from src1, src2

Arturs Backurs and Piotr Indyk are student and advisor. The latter, Piotr, is one of the leading algorithms and complexity theorists in the world—what an honor it must be for Arturs to work with him as his advisor.

Today Ken and I want to talk about their paper on edit distance and an aspect that we find puzzling.

The paper in question is, “Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false).” It is in this coming STOC 2015.

What we find puzzling is that the beautiful connection it makes between two old problems operates between two different levels of scaling, “${n}$” and “${2^n.}$” This messes up our intuition, at least mine.

I, Dick, have thought long and hard, over many years, about both the edit distance problem and about algorithms for satisfiability. I always felt that both algorithms should have much better than the “obvious” algorithms. However, I was much more positive about the ability for us to make a breakthrough on computing the edit distance, then to do the same for satisfiability.

The way of linking the two problems is to me quite puzzling. Quoting my favorite band, the Talking Heads:

… Now don’t you wanna get right with me?
(puzzling evidence)
I hope you get ev’rything you need
(puzzling evidence)

Puzzling evidence
Puzzling evidence
Puzzling evidence …

## The Problems

The edit distance between two strings is defined as the minimum number of insertions, deletions, or substitutions of symbols needed to transform one string into another. Thus CAT requires three substitutions to become ATE, but it can also be done by one insertion and one deletion: pop the C to make AT and then append E to make ATE. Thus the edit distance between these two words is 2. The problem of computing the edit distance occurs in so many fields of science that it is hard to figure out who invented what first. The case of strings of length ${N}$ is easily seen to be computable in time quadratic, ${O(N^{2})}$, by a dynamic programming algorithm that builds up edit distances between initial substrings.

Chak-Kuen Wong and Ashok Chandra proved this is optimal in the restricted model where one can only compare characters to each other. There are algorithms that beat quadratic by logarithmic factors—they essentially treat blocks of characters as one. But it remain open after much research whether there is an algorithm that runs in time of order ${n^{1.999998}}$, for example.

The ${\mathsf{SAT}}$ problem is the usual question of testing Boolean clauses to see if they can all be satisfied at the same time by the same assignment to the Boolean variable. ${\mathsf{CNF}\text{-}\mathsf{SAT}}$ restricts to formulas in conjunctive normal form, and ${k\text{-}\mathsf{SAT}}$ restricts to clauses with at most ${k}$ literals per clause.

## The Connection

Backurs and Indyk prove that if there exists ${\delta > 0}$ such that edit distance can be decided in time ${O(N^{2 - \delta})}$, then there exists ${\epsilon > 0}$ such that ${\mathsf{CNF}\text{-}\mathsf{SAT}}$ for formulas with ${n}$ variables and ${m}$ clauses can be solved in time ${m^{O(1)}2^{(1-\epsilon)n}}$. They build on a connection to SETH showed ten years ago by Ryan Williams in part of a wider-ranging paper.

The basic idea is how ${\mathsf{CNF}\text{-}\mathsf{SAT}}$ behaves with regard to partitions of the ${n}$ variables of a formula ${\phi = C_1 \wedge\cdots \wedge C_m}$ into two size-${n/2}$ subsets, call them ${X_1}$ and ${X_2}$. Let ${A}$ be the set of assignments to ${X_1}$ and ${B}$ to ${X_2}$. For every assignment ${a \in A}$, let ${S_a}$ be the set of clauses it satisfies, and ${T_a}$ the remaining clauses which it does not satisfy. Similarly define ${S_b}$ and ${T_b}$ for ${b \in B}$. Then

$\displaystyle \phi \in \mathsf{SAT} \iff (\exists a,b)[|S_a \cup S_b| = m] \iff (\exists a,b)[T_a \cap T_b = \emptyset].$

Now let us identify ${a}$ with ${T_a}$ regarded as an ${m}$-bit vector and similarly ${b}$ with ${T_b}$, also re-labeling ${A}$ to be the set of ${N = 2^{n/2}}$-many ${T_a}$‘s, ${B}$ for ${T_b}$‘s. Then as Williams observed, ${\phi}$ is satisfiable if and only if we have a yes-instance of the following problem:

Orthogonal Vectors Problem (${\mathsf{OVP}}$): Given two sets ${A,B}$ of ${N}$-many length-${m}$ binary vectors, are there ${a \in A}$ and ${b \in B}$ such that ${\sum_{i=1}^m a_i b_i = 0}$?

It is obvious to solve ${\mathsf{OVP}}$ in time ${O(N^2 m)}$ by trying all pairs. The nub is what happens if we achieve anything slightly better in the exponent than quadratic, say time ${N^{2-\epsilon} m^{O(1)}}$. Then with ${\epsilon' = \epsilon/2}$ we get time

$\displaystyle \left(2^{n/2}\right)^{2-\epsilon} m^{O(1)} = 2^{n - \epsilon' n} m^{O(1)}$

for ${\mathsf{CNF}\text{-}\mathsf{SAT}}$, which contradicts SETH.

What’s puzzling is that the evidence against doing better than quadratic comes when ${N}$ is already exponential, ${N = 2^{n/2}}$. Moreover, the ${\mathsf{OVP}}$ instances ${(A,B)}$ involved are ridiculously large, exponential sized, and we don’t even care that they have a succinct genesis in terms of ${\phi}$. (Note that we have swapped the letters ${n}$ and ${N}$ from their paper—we find it helpful to keep “${N}$” the larger one.)

Backurs and Indyk itemize several problems to which this connection was extended since 2010, but we agree that Edit Distance (${\mathsf{ED}}$) is the most striking addition to this list. Their new result is a kind of “SETH-reduction” from ${\mathsf{OVP}}$ to ${\mathsf{ED}.}$ Can we capture its essence without referencing ${\mathsf{CNF}\text{-}\mathsf{SAT}}$ each time?

## The Reductions

The results here and before all use an unusual type of reduction. Ken and I think it would be useful to formalize this reduction, and try to understand its properties. It is not totally correct to call it simply a quasi-linear time reduction because multiple parameters are involved—we can call them ${N}$ and ${m}$ quite generally.

In the above case with ${\mathsf{CNF}\text{-}\mathsf{SAT}}$ and ${\mathsf{OVP}}$, if the clause size is fixed then we have ${m = n^{O(1)}}$ so ${m = \mathsf{polylog}(N)}$. It hence suffices to have a reduction from ${\mathsf{OVP}}$ to ${\mathsf{ED}}$ that is computable in quasi-linear time, here meaning time ${N\mathsf{polylog}(N)}$. Indeed, we can allow time ${N\cdot g(N)}$ for any function ${g(N) = N^{o(1)}}$.

When talking just about the problems ${\mathsf{OVP}}$ and ${\mathsf{ED}}$, however—without reference to ${\mathsf{CNF}\text{-}\mathsf{SAT}}$${m}$ and ${N}$ are separate parameters with no relation specified. It suffices to say that the reduction is polynomial in ${m}$ and quasi-linear in ${N}$. This is essentially what Backurs and Indyk do. Their “${m}$” is called “${d}$“; then they define ${\ell_0 = 1000d}$, ${\ell_1 = \ell_0^2}$, and ${\ell_2 = \ell_0^3}$; then they multiply ${\ell_0 d}$, and so on. The details in their paper are considerable, involving an initial reduction from ${\mathsf{OVP}}$ to a problem they call ${\mathsf{PATTERN}}$, and this is one reason we’d like to streamline the reduction concept.

If we assume ${m = \Omega(\log N)}$, then “quasi-linear in ${N}$ and polynomial in ${m}$” is the same as “linear in ${N}$ and polynomial in ${m}$.” Perhaps the latter phrase is the simplest and best way to define the reduction? However, we do not need to specify “polynomial in ${m}$” either. It is enough to have a suitable sub-exponential time in ${m}$. For instances with ${m = O(n)}$ we would need ${2^{o(m)}}$ not ${2^{m - \epsilon m}}$, while for ${m = O(n^3)}$ we would need ${2^{o(m^{1/3})}}$.

Parameterized complexity defines reductions with two parameters, but the simplest ones are not exactly suitable. Finally, we wonder whether it would help to stipulate any of the specific structure that comes from ${\mathsf{SAT}}$ including that the instances ${(A,B)}$ are succinct. Note that we once covered succinctness and a hypothesis roughly related to SETH (see this comment for a circuit form). This paper highlighted by Backurs and Indyk works from ${\mathsf{CNF}\text{-}\mathsf{SAT}}$ but says it could have used ${\mathsf{OVP}}$, while still not formalizing the reduction concept. Likewise their other major references; some work from ${\mathsf{CNF}\text{-}\mathsf{SAT}}$ and others not. The latest of them, by Karl Bringmann and Marvin Künnemann who show “SETH-hardness” for ${\mathsf{ED}}$ on binary strings, defines a framework for the gadgets used in all these reductions.

## Power and Puzzlement

The remarks just above about the reduction time in “${m}$” make us recall the three most common levels of exponential hardness. The power index of a language ${L}$ was coined by Richard Stearns and Harry Hunt in a 1990 paper. It is the infimum of ${\epsilon > 0}$ such that ${L}$ belongs to time ${2^{O(n^\epsilon)}}$. Their “Satisfiability Hypothesis” (SH) about the power index of satisfiability is much weaker than SETH, though not as weak as conjecturing merely a lower bound of ${2^{n^{\Omega(1)}}}$.

• SH: ${\mathsf{CNF}\text{-}\mathsf{SAT}}$ has power index 1.

• ETH: All algorithms for ${\mathsf{CNF}\text{-}\mathsf{SAT}}$ require time ${2^{\Omega(n)}}$.

• SETH: All algorithms for ${\mathsf{CNF}\text{-}\mathsf{SAT}}$ require time ${2^{n - o(n)}}$.

The latter two come in slightly different versions bringing ${m}$ and/or a fixed ${k}$ in ${k}$${\mathsf{SAT}}$ into the picture, and of course all these versions might be false. SETH is distinguished as the closest to the upper bound. There are randomized algorithms for ${k}$${\mathsf{SAT}}$ that run in time ${2^{n - (\epsilon/k)n}}$ for some ${\epsilon > 0}$ and all ${k}$, and ${k}$ can be replaced by ${\log(m/n)}$ in general.

Stearns and Hunt emphasized the effect of reductions on SH and the power index in general. The same can be done for ETH. But we remain equally puzzled about the issue of the size of the problems used in the reductions. We start with a SAT problem that uses ${b}$ bits in its description. This is viewed then eventually as an edit distance problem that uses exponential in ${b}$ bits. The ${N}$ is the edit problem is extremely large. Of course this is just fine, since we only claim to get an exponential time algorithm.

The point is that our intuition about the edit distance problem is all on problems that have modest size ${n}$. I, Dick, actually had a student build a hardware machine that did modest size such problems many times faster than any computer. So all my intuition was—is—about small size edit problems. When the size of ${n}$ becomes astronomical my intuition may fall apart. Could this explain the reason that the result here seems puzzling?

## Open Problems

So what is the complexity of computing the edit distance? Can we really not do better than the obvious algorithms? This seems hard, no puzzling, to us; but it may indeed be the case.

The 3SUM problem, which we recently covered, is attacked by some of these papers but has not of yet been brought within the scope of the reductions from ${\mathsf{OVP}}$. The ${\tilde{O}(n^{3/2})}$ decision tree upper bound has not yet yielded an algorithm that actually runs in time ${n^{2-\epsilon}.}$ Yet perhaps the above kinds of reductions also generally preserve the decision-tree complexity? This would make the decision-tree result a real obstacle to showing “SETH-hardness” in that manner.

1. June 1, 2015 8:44 pm

On thing that the simplicity of many of these new reductions points out is that we have generally been somewhat blinded in thinking about SAT algorithms: We think nothing of using polynomial space when attacking polynomial-time problems like edit-distance or 3SUM but don’t tend to try to exploit exponential space to try to find faster exponential-time problems.

June 2, 2015 3:23 am

Hi Dick, a few points:

(1) We (Piotr, Arturs, Amir and I) are organizing a tutorial on reductions from popular conjectures (i.e. hardness for easy problems) at this coming FCRC. See here: http://theory.stanford.edu/~virgi/stoctutorial.html

(2) Arturs, Amir and I have an earlier independent result on hardness of Longest Common Subsequence etc. for constant size alphabets. See here: http://arxiv.org/abs/1501.07053

(3) I think that 3SUM and OVP are very different problems, so that there may be no reduction from one to the other. We (Amir, Huacheng Yu and I in this coming STOC’15) have however found a natural problem that is harder than 3SUM, OVP and APSP. It’s called Matching Triangles: you are given a graph on n nodes, where each node has a color, you are given an integer k and you want to know if there is a triple of colors (a,b,c) such that there are at least k triangles whose vertices are colored (a,b,c). It turns out that if Matching Triangles can be solved in truly subcubic (e.g. n^{2.9999}) time, for any k=\omega(1), then APSP has a truly subcubic time algorithm, 3SUM has a truly subquadratic time algorithm and OVP has a truly subquadratic time algorithm, so I’d believe hardness under Matching Triangles more than merely from SETH/OVP.

(4) There are some other very surprising (to me) consequences of believing SETH. For instance, it turns out that if you want to maintain the number of strongly connected components of a directed graph while supporting edge insertions and deletions, then (under SETH) essentially the best you can do is recompute the SCCs from scratch after each update! This even holds if you only want to know whether the number is >3…

–virgi

June 8, 2015 12:31 am

Wonder if this [conditional] “lower” bound based on SETH re-affirms the belief in
NP!=P, or it will have great implications even when SETH is proven false.

Long back in 1991, I attempted a proof of the conjecture, “NC !=P unless #P=FP”,
based on some information theoretic arguments.
While that “proof” was far from acceptance, the NC algorithm result of Lev, Pippenger and Valiant, for regular bipartite graphs, “A fast parallel algorithm for routing in permutation networks. IEEE Trans. on Computers (1981),
had motivated me into working towards the proof of #P=FP.
http://arxiv.org/abs/0812.1385v15

4. June 8, 2015 10:02 am

Can I get a reminder on SETH? ETH = exponential time hypothesis = (P≠NP). SETH = strong exponential time hypothesis, but what does the ‘strong’ part entail again?

• June 8, 2015 12:21 pm

The “strong” is the constant on n in the exponent being 1, not just “some constant times n” as indicated by the Ω in ETH. P != NP is weaker than the power-index hypothesis and does not necessarily imply any kind of exponential lower bound. My own belief, incidentally, is in size ${2^{\Theta(n/\log n)}}$ being tight for circuits; I’m less venturesome about machines.

June 11, 2015 7:28 am

Once you had a post on how big Armageddon it would be if P was NP. Well, on the other side, it also would again make the problem interesting if subquadratic algorithm for edit distance exists, am I right? 🙂

June 11, 2015 2:52 pm

The edit distance between two strings L and S, where |L| >= |S|, is at most |L|. For any number n and string W, a DFA can be built which accepts an input IFF its edit distance to W is < n. Such a DFA can be built and run in O(|W|) time.[1]

What's the complexity of a 'binary search' for the edit distance between L and S using this method? The DFA construction complexity grows very fast with respect to n, but we only need log(|L|) DFAs. And the search cut points can be skewed to amortize the growth with respect to n (i.e. first n << |L|/2).

• July 8, 2015 9:41 pm

💡 ❗ very interesting! but not following, “the DFA construction complexity grows very fast with respect to n,” can you clarify? you just said it takes O(|W|) time regardless of n…? this also reminds me of another recent paper by Wehar relating the L=?P question to complexity of intersection (nonemptiness) of DFAs.

July 9, 2015 1:52 am

For any *fixed* n, the complexity is O(|W|). See “Remark 5.2.3” of the cited paper for a note on what happens when n is not fixed.

7. June 13, 2015 9:31 am

this seems like a very big deal, even exciting to me & somewhat suddenly leads to the possibility of attacking P vs NP via attempting to prove polynomial lower bounds on a very practical & applied problem, namely edit distance. this would be surprising to me if that route succeeded because my own intuition for many years is that a proof would likely or have to come from circuit theory/ extremal set theory somewhat in the vein of proofs by Razborov and Rossman (2009).

note VWs comment above & these results remind me a lot of her/ RWs work (and this is building on the OVP problem RW isolated years ago) which mostly connect O(n^3) algorithms to larger implications where this is O(n^2). however is it true that polynomial lower bounds have not been proven on any problem whatsoever? which is somewhat shocking.

fyi this result is already covered in two other write ups on the web, 40-year-old algorithm proven the best possible / physorg news and Longstanding problem put to rest mit news.

did some searching around and google intelligence picked up this paper as similar/ related with constructions/ implications wrt P vs NP & Logspace. “how hard is it to compute edit distance” by Pighizzini

if anyone would like to discuss/ further muse/ push in these directions plz join me on stackexchange cstheory chat room for ongoing attn.