Approaches to the exact edit distance problem

Richard Bellman created the method of dynamic programming in 1953. He was honored with many awards for this and other seminal work, during his lifetime. We probably all know dynamic programming, use it often to design algorithms, and perhaps take it for granted. But, it is one of the great discoveries that changed the theory and practice of computation. One of the surprises–at least to me–is given its wide range of application, it often yields the fastest known algorithm. See our friends at Wikipedia for a nice list of examples of dynamic programming in action.

I never had the honor of meeting Bellman, but he wrote a terrific short autobiographical piece that contained many interesting stories. (I cannot locate it anymore: if you can please let me know.) One story that I recall was how, when he was at The Rand Corporation, he often made extra money playing poker with his doctor friends in the evenings. He usually won, which convinced them that he was using “deep mathematics” against them–perhaps probability theory. Bellman said the truth was much simpler: during the day the doctors were stressed making life and death decisions, at night when they played poker they “let go”. During the day Bellman could just look out at the Pacific Ocean from his office and think about problems. At night he played tight simple poker, and he cleaned up.

One other story from this piece made a long-lasting impression on me. When he was famous he would often be invited to places to give talks, like many scientists. However, he always insisted that he must be picked up at the airport by a driver of a white limo. I understand the driver part, I understand the limo part, but insisting on the color was a very nice touch.

The Edit Distance Problem

The edit distance problem (EDP) is given two strings

$\displaystyle a=a_{1},a_{2},\dots,a_{n} \text{ and } b=b_{1},b_{2},\dots,b_{m}$

determine their edit distance. That is what are the fewest operations that are needed to transform string ${a}$ into ${b}$. The operations allowed are: insert a character, delete a character, and substitute one character for another. In general their costs can be arbitrary, but for us we will consider only the basic case where insert/delete have unit cost and substitute has cost two. This makes sense, since substitute is the same as one delete and one insert.

This distance is sometimes called the Levenshtein distance, named after Vladimir Levenshtein, who defined it in 1965. The notation of edit distance arises in a multlude of places and contexts, I believe that the notion has been repeatedly discovered. It also has many generalizations where more complex operations are allowed.

People really care about EDP. There are very good heuristic algorithms that are used, for example, in the biological community to solve EDP. One is called BLAST. It is implemented in many languages and runs on everything, there are even special purpose hardware devices that run BLAST. Clearly, there is need for solutions to EDP. Unlike theory algorithms, BLAST has no provable error bounds; unlike theory algorithms, BLAST seems, in practice, to be good enough for the scientists who use it. However, I have no doubt that they would prefer to get the optimal answer–however, the cost of getting the optimal answer cannot be too great.

Upper and Lower Bounds

The dynamic programming algorithm for EDP has been discovered and rediscovered many times. I believe that Robert Wagner and Mike Fischer did it first, but I will not try to be the perfect historian. Their algorithm runs in time ${O(n^{2})}$ and uses the same space: it is convenient to assume that ${n=m}$ for the rest of this post. With a little bit of work, it is easy to improve the space bound to ${O(n).}$

There are papers that prove a lower bound of ${\Omega(n^{2})}$ for EDP. These papers assume a restricted model of computation: only input symbols can be compared for equality. Like any restricted lower bound they give you an idea of what not to do: if you wish to beat the the lower bound, then you must do more that compare symbols for equality. Use the value of the symbols, use bit operations, use randomness, do something other than just compare symbols.

Actually, as I recall, there was a bit of an inversion here: there was a better upper bound paper that gave an algorithm for EDP that took time ${O(n^{2}/\log n)}$, before the lower bound papers were published. The method used to prove the sub-quadratic algorithm for EDP is based on a general idea called The Four Russian Method. I never liked the name, I do like the method. As Dan Gusfield points out, the four were not even all Russian: perhaps a consequence of the cold war was that information did not always flow easily between east and west.

The Four Russian Method was first, I believe, used to compute boolean matrix products faster than cubic time. The method has been used since then to solve many other problems. If you do not know the method, take a look at the following link.

Still here. Okay, here is an overview of the method: it is based on trading space for time. A typical Four Russian Method algorithm operates in two phases. In the first phase, it examines the input and cleverly pre-computes values to a large number of subproblems, and stores these values in a look-up table. In the second phase, it uses the pre-computed values from the table to make macro steps during the rest of the computation. The result is usually a reduction in time by a logarithmic factor, while the space becomes as large as the time. The method is not generally practical; however, it is a useful method to know.

Bit Complexity

Actually there is a logarithmic factor that is hidden in the quadratic dynamic programming algorithm for EDP, since the algorithm must use ${\log n}$ size numbers. Thus, the bit complexity is, ${O(n^{2}\log n)}$. There is a clever way to remove the factor of ${\log n}$ that is due to Dan Lopresti:

Theorem 1 There is an algorithm for EDP that uses ${O(n^{2})}$ bit operations and ${O(n)}$ space.

Proof: The key idea is to use only a constant number of bits rather than ${\log n}$ bits and still run the standard dynamic program. $\Box$

I will explain this in more detail in a future post. Truth in blogging: Lopresti was once a graduate student of mine, and I worked with him on this result.

Approximation and EDP

Legend has it that when faced with unravelling the famous Gordian Knot, Alexander the Great simply cut the knot with his sword. We do this all the time in theory: when faced by a “knotty” problem we often change the problem. The edit distance problem is any not different: years ago the problem of exact distance was changed to approximate edit distance. The goal is to get a fast algorithm that finds the distance to within a small relative error. For example, in the STOC 2009 conference there will be a paper “Approximating Edit Distance in Near-Linear Time” by Alexandr Andoni and Krzysztof Onak. They show that there is an algorithm (AO) that in time ${n^{1+o(1)}}$ gets a relative error of

$\displaystyle 2^{O(\sqrt{\log n}\log\log n)}.$

This is a huge improvement over the previous work. I will not go into any details today, see their paper for the full proofs.

This is very pretty work. I have started to look at their paper and believe that it has many interesting ideas, they may even help solve other problems. My issue is that we still are no closer to solving the edit distance problem. The AO algorithm still makes too large relative error to be practical–what exactly is the factor for reasonable size ${n}$? See my other posts on the use of “ O-notation.”

On the other hand, I am intrigued with their breakthrough. I think that it may be possible to use their new algorithm as a basis of an exact algorithm. The key idea I have in mine is to try to use their algorithm combined with some kind of “amplification trick.”

Amplification

Papers on approximations to the EDP use “relative” error, so let’s take a look at additive error. Suppose that ${A(x,y)}$ is an algorithm for the EDP that works for any fixed size alphabet. Say that ${A(x,y)}$ has additive error ${E(n)}$ provided, for all strings ${x,y}$ of length ${n}$ over the alphabet,

$\displaystyle A(x,y) = d_{\text{edit}}(x,y) + O(E(n)).$

There is a simple amplification trick:

Theorem 1 Suppose there is an algorithm ${A(x,y)}$ for the EDP that runs in time ${T(n)}$ and has additive error ${E(n)}$. Then, there is an algorithm ${A^{*}(x,y)}$ for the EDP that runs in time ${T(4n)}$ and has an additive error of ${E(4n)/2}$.

I assume that this is well known, but in case not, here is a sketch of the proof. Suppose that ${x}$ and ${y}$ are strings of length ${n}$ over the alphabet ${\{0,1\}}$. Consider the following two strings:

$\displaystyle x^{*} = x\#^{m}x \text { and } y^{*} = y\#^{m}y.$

The claim is that provided ${m = 2n}$,

$\displaystyle d_{\text{edit}}(x^{*},y^{*}) = 2d_{\text{edit}}(x,y).$

It should be clear that

$\displaystyle d_{\text{edit}}(x^{*},y^{*}) \le 2d_{\text{edit}}(x,y)$

since one way to change ${x^{*}}$ to ${y^{*}}$ is to edit the strings ${x}$ and ${y}$ and leave the ${\#}$‘s alone. The cost of this is twice the cost of editing ${x}$ to ${y}$.

The opposite inequality follows since any matching of the first ${x}$ with the second ${y}$ (or the other way around: second ${x}$ with first ${y}$) must delete all the ${\#}$‘s and this will cost more than ${2d_{\text{edit}}(x,y).}$ This claim follows since ${x,y}$ are over an alphabet that does not include ${\#}$. Thus,

$\displaystyle A(x^{*},y^{*}) = d_{\text{edit}}(x^{*},y^{*}) + E(4n)$

and therefore,

$\displaystyle A(x^{*},y^{*}) = 2d_{\text{edit}}(x,y) + E(4n).$

Divide by ${2}$ and we see that ${A(x^{*},y^{*})/2}$ is within additive error ${E(4n)/2}$ of the edit distance of ${x}$ and ${y}$.

I conjecture that there is a general construction that can replace ${2}$ by an arbitrary ${k=k(n)}$. This would yield an algorithm that runs in time ${T(cnk)}$ and has additive error of ${E(cnk)/k}$ for some absolute constant ${c>0}$. Note, we can repeat the above construction; however, the obvious application would require a ${\log k}$ size alphabet.

If such an amplification were possible, then we could do the following, for example: if there is an algorithm that runs in near linear time, i.e. time ${n^{o(1)}}$, for EDP and has additive error of ${n^{o(1)}}$, then there is a near linear time exact algorithm for EDP. This follows by just setting ${k=n^{o(1)}}$ and doing the calculation.

I have started thinking about the AO algorithm and how to combine it with different amplification methods. Perhaps this could lead, finally, to a fast exact algorithm for EDP. Perhaps. In any event such amplification methods seem potentially useful.

Open Problems

Thus, the main open problem is clear: find an EDP algorithm that runs in ${O(n)}$ time. Or one that runs in ${O(n \log n)}$ time, or near linear time. Or one that runs in ${\dots}$ I see no reason, even after decades of failure, to believe that there could not be an exact algorithm that runs fast and solves the EDP. This is–in my opinion–an important and exciting open problem. Let’s finally solve the problem.

There are other, perhaps, more approachable open problems. In real applications the inputs are not worst case, nor are they random. For instance, the BLAST program assumes implicitly a kind of randomness to the inputs. Are there reasonable models of the kinds of inputs that are actually encountered? And for such a model can we get provable fast algorithms for EDP?

1. March 22, 2009 8:08 pm

There is a variant of the DP that runs in O(ED^2). People in practice often claim that this is enough for them, since they never care about very large distances (if two strings are very far away, you don’t really care how far they are).

The real issue in practice, as far as I understand, is nearest neighbor under the edit distance or variants (biologists have some weighted versions).

March 22, 2009 8:36 pm

Agree about not caring if far apart…

Still would be nice to beat $O(ED^2)$–don’t you agree…rjlipton

2. March 22, 2009 10:47 pm

It is clearly a very nice theoretical question. But I would take this further: what can the TCS community say about dynamic programming? So far we have not said much.

Can we build tools, perhaps of the “NP-hardness” flavor, that would allow us to give definite answers about the complexity of problems that are solved via dynamic programs? We should be able to say either “here’s an O(n^1.7) algorithm for edit distance,” or “under the X-conjecture, edit distance requires Omega(n^2 / polylog n) time to compute.”

March 23, 2009 4:02 am

There is the general framework of Woeginger for DPs for NP-hard problems. It’s a bit weaker than what you’re asking since it only specifies when a PTAS can be achieved, but it’s interesting nonetheless.

http://www.win.tue.nl/~gwoegi/papers/fptas.pdf

March 23, 2009 9:59 am

I would be really surprised if someone were to come up with an algorithm which fills the dynamic prog. table in less than O(ED^2) time. (ED^2 time is actually achieved by filling the DP table cleverly)

The best algorithm previous to AO result was a O(n^1/3)-approximation which still used the classical DP. And I believe (maybe rather naively) that this cannot be improved unless one breaks apart from the DP approach.

So my argument is that, what we are really lacking is a good estimator for EDP which breaks apart from the DP paradigm. This could be a better recursion (than in Ostrovsky-Rabani), a “physical model” which models repulsion/compulsion between similar/dissimilar substrings, or what not.

Of course, there is also this problem of computing the estimator in less than quadratic time. Even a simple looking recursion of OR requires extremely deep machinery to compute in near-linear time.

March 23, 2009 10:13 am

Btw, it should probably be mentioned that by O(ED^2) in the above posts, O(ED^2 + n) is meant. The algorithm is never sub-linear.

6. March 24, 2009 12:55 am

I can’t believe I’m only just now finding your blog. Your posts are very interesting, and well written — I love the focus on people as well. Reading your previous posts will keep me occupied for quite some time :)

Thanks!

March 25, 2009 12:26 am

Dear Dr. Lipton:

I read Bellman’s biography (Eye of the hurricane) in 1986 and found it to be quite engaging. I have not seen the book in any of the many libraries I have hence visited and it is a shame that very few people seem to have read this book. (Many of my mathematician colleagues have not even heard of Bellman, let alone read this book!) Now the book seems to be out of print and I could only find one used book seller willing to part with it for a paltry \$ 315. Good luck finding a buyer!

Some of the interesting things I recall from this book are –

1) His account of the working environment at Rand corporation where he spent some years. Ford-Fulkerson were there at that time. His collaboration with Ford (Bellman-Ford algorithm) must have happened there.

2) His account of Princeton math department as a grad student and his mentor Solomon Lefchetz. (When they went out for dinner, Bellman’s job was to cut the steak for Lefschetz who had lost both arms in an accident and hence started a second career from an engineer to a mathematician!)

3) His take on the Erdos-Selberg controversy. He thinks Selberg should have been given more credit as the originator of the main idea of the proof.

Thanks for your excellent postings with historical flavor. They will certainly enrich the readers.

8. March 27, 2009 8:16 pm

As ravi points out above, Eye of the Hurricane is worth reading and likely includes any such material as a subset. It’s unfortunate that we don’t have more great biographies of computer scientists; the physicists and mathematicians of this century have left a rich legacy (Bethe’s autobiography, Hahn’s, Gamow’s, Kay & Bird’s biography of Oppenheimer, great pop biographies of Ramanujan and Erdos, Hardy’s A Mathematician’s Apology, biographies of Fermi by Serge and his wife both, the glut of Einstein biographies, threatening to collapse into their own gravitational pull…)

Bellman’s Dynamic Programming (reprinted by Dover) mentions “a fairly complete bibliography plus some complementing remarks”: The Theory of Dynamic Programming Bull. Amer. Math. Soc. 60/1954 503-516.

March 30, 2009 9:13 am

Just a technical comment:
As far as I understood, the major restriction used in the paper where the lower bound is shown is that the strings are over an alphabet with at least n+m symbols, where n, respectively m, is the length of the first string, respectively of the second string.
This is exactly why the algorithm using the Russians’ trick works better, although it is based also on equality tests between symbols: it is essential to have a finite (small) alphabet in order to pre-compute the distance between “all” the small strings.

June 30, 2010 5:49 pm

Hello.

What if, in a general theory of everything kind of way, some singular human conscious had used simple Yes(I should feel guilty) or No(I do not feel guilty) answers to answer every moral question that he could remember that had ever occurred in his life. In this way he could have become completely pure. He would have truly asked himself all the questions that could be asked of him. He would have emotionally evolved back to when he was a child.

What if he then assigned an ‘It doesn’t matter as life is only made to make mistakes’ label on every decision that he had analysed.

This would not make him God or the devil, but just very still and very exhausted. Anybody can do this but just for the purpose of this experiment lets say I have done this. Which I have.

There are no fears in me and if I died today I could deal with that because who can know me better than myself? Neither God or the Devil. I am consciously judging myself on ‘their’ level.

To make this work, despite my many faults, take ME as the ONLY universal constant. In this sense I have killed God and the Devil external to myself.The only place that they could exist is if I chose to believe they are internal.

This is obviously more a matter for a shrink more than a mathematician, but that would only be the case depending on what you believed the base operating system of the universe to be. Math / Physics / morals or some other concept.

As long I agreed to do NOTHING, to never move or think a thought, humanity would have something to peg all understanding on. Each person could send a moral choice and I would simply send back only one philosophy. ‘ LIFE IS ONLY FOR MAKING MISTAKES’.

People, for the purpose of this experiment could disbelief their belief system knowing they could go back to it at any time. It would give them an opportunity to unburden themselves to someone pure. A new Pandora’s box. Once everyone had finished I would simply format my drive and always leave adequate space for each individual to send any more thoughts that they thought were impure. People would then eventually end up with clear minds and have to be judged only on their physical efforts. Either good or bad. It would get rid of a lot of maybes which would speed lives along..

If we then assume that there are a finite(or at some point in the future, given a lot of hard work, there will be a finite amount) amount of topics that can be conceived of then we can realise that there will come to a time when we, as a people, will NOT have to work again.

Once we reach that point we will only have the option of doing the things we love or doing the things we hate as society will be completely catered for in every possible scenario. People will find their true path in life which will make them infinitely more happy, forever.

In this system there is no point in accounting for time in any context.

If we consider this to be The Grand Unified Theory then we can set the parameters as we please.

This will be our common goals for humanity. As before everyone would have their say. This could be a computer database that was completely updated in any given moment when a unique individual thought of a new idea / direction that may or may not have been thought before.

All that would be required is that every person on the planet have a mobile phone or access to the web and a self organising weighting algorithm biased on an initial set of parameters that someone has set to push the operation in a random direction.

As I’m speaking first I say we concentrate on GRAINE.

Genetics – Robotics – Artificial Intelligence – Nanotechnology and Zero Point Energy.

I have chosen these as I think the subjects will cross breed information(word of mouth first) at the present day optimum rate to get us to our finishing point, complete and finished mastery of all possible information.

Surely mastery of information in this way will lead us to the bottom of a fractal??? What if one end of the universes fractal was me??? What could we start to build with that concept???

As parameters, we can assume that we live only in 3 dimensions. We can place this box around The Milky Way galaxy as this gives us plenty of scope for all kinds of discoveries.

In this new system we can make time obsolete as it only making us contemplate our things that cannot be solved because up to now, no one has been willing to stand around for long enough. It has been holding us back.

All watches should be banned so that we find a natural rhythm with each other, those in our immediate area and then further afield.

An old clock watch in this system is should only be used when you think of an event that you need to go to. It is a compass, a modern day direction of thought.

A digital clock can be adapted to let you know where you are in relation to this new milky way boxed system.(x,y,z).

With these two types of clocks used in combination we can safely start taking small steps around the box by using the digital to see exactly where you are and then using the old watch to direct you as your thoughts come to you.

We can even build as many assign atomic clocks as are needed to individually track stars. Each and every star in the Milky Way galaxy.

I supposed a good idea would be to assume that I was inside the centre of the super-massive black hole at the centre of the galaxy. That would stop us from being so Earth centric.

We could even go as far as to say that we are each an individual star and that we project all our energies onto the super-massive black hole at the centre of the galaxy.

You can assume that I have stabilized the black hole into a non rotating perfect cube. 6 outputs /f aces in which we all see ‘the universe and earth, friends’ etc. This acts like a block hole mirror finish. Once you look it is very hard to look away from.

The 6 face cube should make the maths easier to run as you could construct the inner of the black hole with solid beams of condensed light(1 unit long) arranged into right angled triangles with set squares in for strength.

Some people would naturally say that if the universe is essentially unknowable as the more things we ‘discover’ the more things there are to combine with and therefore talk about. This is extremely fractal in both directions. There can be no true answers because there is no grounding point. Nothing for the people who are interested, to start building there own personal concepts, theories and designs on.

Is it possible that with just one man becoming conscious of a highly improbable possibility that all of universes fractals might collapse into one wave function that would answer all of humanities questions in a day? Could this be possible?

Answers to why we are here? What the point of life really is et al?

Is it possible that the insignificant possibility could become an escalating one that would at some point reach 100%???

Could it be at that point that the math would figure itself out so naturally that we would barely need to think about it. We would instantly understand Quantum theory and all.

Can anybody run the numbers on that probability?