Virginia Vassilevska Williams is a theoretical computer scientist who has worked on a number of problems, including a very neat question about cheating in the setup of tournaments, and a bunch of novel papers exemplified by this one on graph and matrix problems with runtime exponents of ${3}$ that have long been begging to be improved.

Today Ken and I want to discuss her latest breakthrough in improving our understanding of the matrix multiplication problem.

Of course Volker Strassen first showed in his famous paper in 1969 that the obvious cubic algorithm was suboptimal. Ever since then progress has been measured with one parameter ${\omega}$: if your algorithm runs in time ${O(n^{\omega})}$, then you are known by this one number. Strassen got ${\omega = 2.808\dots}$, and the race was off. A long series of improvements started to happen, which for a while seemed to be stuck above ${2.5}$. Then, Don Coppersmith and Shmuel Winograd (CW) got ${\omega < 2.496}$ and everything changed. After a contribution by Strassen himself, CW finally obtained ${\omega < 2.3755}$ in 1987, with full details here. This has been the best known for decades.

This has all changed now. Virginia has proved that she can beat the “barrier” of CW and get a new lower value for ${\omega}$. Currently her paper gives ${\omega < 2.3727}$, an improvement of “only” ${0.0028}$, but there is promise of more. This is also another case of proofs coming in twos, as a theorem ${\omega < 2.3737}$ in PhD thesis work by Andrew Stothers was circulated to some in June 2010 but not very widely. All this is extremely exciting, and is one of the best results proved in years in all of theory. While these algorithms are unlikely to be usable in practice, they help shed light on one of the basic questions of complexity theory: how fast can we multiply matrices? What could be more fundamental than that?

The Basic Idea

Matrix multiplication is bi-linear: the formula for the ${i,j}$ entry of ${XY}$ is ${\sum_k x_{i,k} y_{k,j}}$. The first step in simplifying the problem is to make it more complicated: Let us have indicator variables ${z_{i,j}}$ and compute instead the tri-linear form

$\displaystyle T(x,y,z) = \sum_{i,j,k} x_{i,k} y_{k,j} z_{i,j}$

This is a special case of a general tri-linear form

$\displaystyle F = \sum_{i,j,k = 1}^{N} C_{i,j,k} x_i y_j z_k$

where ${N = n^2}$ and we have re-mapped the indices. It looks like we have made order-of ${N^3 = n^6}$ work for ourselves. The key, however, is to try to fit a representation of the form:

$\displaystyle \begin{array}{rcl} F &=& \sum_{\ell = 1}^r (\sum_i a_{\ell,i} x_i)(\sum_j b_{\ell,j} y_j)(\sum_k c_{\ell,k} z_k)\\ &=& \sum_{\ell = 1}^r P_\ell (\sum_k c_{\ell,k} z_k), \end{array}$

where ${P_\ell = (\sum_i a_{\ell,i} x_i)(\sum_j b_{\ell,j} y_j)}$. The point is, suppose we can compute these products ${P_\ell}$ in total time ${T}$. Then we can compute (the coefficients for) all the desired entries

$\displaystyle z_k = \sum_{\ell = 1}^r c_{\ell,k} P_\ell$

in ${Nr}$ steps. Thus what we have are separate handles on the time ${T}$ for the products and the time for the ${z_k}$. The way to manage and balance these times involves recursion.

The Basis Idea

The recursion idea is nice to picture for matrices, though its implementation for the way we have unrolled matrices into vectors is not so nice. Picture ${X}$ and ${Y}$ as each being ${4 \times 4}$ matrices. We can regard ${X}$ instead as a ${2 \times 2}$ matrix of four ${2 \times 2}$ matrices ${X_1,X_2,X_3,X_4}$, and do the same for ${Y}$. Then the product ${XY}$ can be written via ${2 \times 2}$ products ${X_i Y_j}$, and we can picture ourselves recursing on these products.

The reason why the vector case does not look so nice is that the tri-linear form ${F}$ is so general—indeed we cannot expect to fit a general tri-linear form into a small number of products ${P_\ell}$. What CW did, building on work by Arnold Schönhage, is relax the tri-linear form by introducing more than ${N}$-many ${z_k}$ variables, supplying appropriate coefficients to set up the recursion, and most of all framing a strategy for setting variables to zero so that three goals are met: the recursion is furthered, the values of “${r}$” at each level stay relatively small, and the matrix product can be extracted from the variables left over. This involved a hashing scheme which used subsets of integers that are free of arithmetical progressions.

The final step by CW was to choose a starting algorithm ${{\cal A}}$ for the basis case of the recursion. They devised one and got ${\omega < 2.39}$. Then they noticed that if they bumped up the base case by manually expanding their algorithm to an ${{\cal A}^2}$ handling the next-higher case, they got a better analysis and their famous result ${\omega < 2.376}$. By their way of thinking, bumping the basis up once more to ${{\cal A}^3}$ was the way to do better, but they left analyzing this as a problem. Others attempted the analysis and…found it gave worse not better results. So ${2.376}$, actually ${2.375477}$, stood.

The insight for breaking through was to make a bigger jump in the basis. Vassilevska Williams was actually anticipated in this without her knowledge by Andrew Stothers, in his 2010 PhD thesis at the University of Edinburgh. Stothers used ${{\cal A}^4}$ and showed this method capable of achieving ${\omega < 2.3737}$, though there has been some doubt in whether all details were worked out. Vassilevska Williams, however, used ${{\cal A}^8}$ and brought some powerful computing software to bear on a more-extensive framework for the plan. It is not clear whether there is anything necessary about jumping ${{\cal A}}$ by a power of ${2}$—in any event her program and framework work for any exponent.

The Proof

We cannot yet really give a good summary of the proof—further details are in her paper. One quick observation about her work is in order. She used the CW method, but extended it into a general schema can be used to find good matrix product algorithms, perhaps even better than the one in the paper. The algorithms themselves can be generated and examined, but as usual the task of analyzing them is very hard. The brilliant insight that she has is this task can be laid out automatically by a certain computer program. This allows her to do the analysis where previous others failed.

For example here is a sample of the overview of her main program, in pseudo-code:

The details are not as important as the fact that this program allows one to work on much larger schemas than anyone could previously.

What Does the Bound Mean?

Note that she has improved the bound of Stothers by ${0.001}$. For what threshold value of ${n}$ does an additive improvement of ${a}$ in the exponent halve the running time? The answer is ${n = 2^{1/a}}$, which in this case is ${n = 2^{1000}}$. This value is far above the Bekenstein Bound for the number of particles that could be fit into a volume the size of the observable universe without collapsing it into a black hole. In this sense the algorithm itself is even beyond galactic.

The meaning instead comes from this question: Is there a fundamental reason why ${\omega}$ could settle at a value strictly greater than ${2}$? Note that ${\omega = 2}$ is not taken to mean the existence of a quadratic-time algorithm, but rather that for all ${\epsilon > 0}$ there are algorithms that achieve time ${O(n^{2+\epsilon})}$. There was some reason to think ${2.5}$ could be a natural barrier, but it was breached. Perhaps ${\sqrt{5} = 2.236\dots}$, since this is connected to the golden ratio? Her paper notes a recent draft by Noga Alon, Amir Shpilka, and Christopher Umans that speaks somewhat against the optimism shared by many that ${\omega = 2}$.

Open Problems

Can the current bounds be improved by more computer computations? Are we about to see the solution to this classic question? Or will it be struggling over increments of ${0.001}$?

In any event congratulate Virginia—and Andrew—for their brilliant work.

Update: Markus Bläser, who externally reviewed Stothers’ thesis, has contributed an important comment on the blog of Scott Aaronson here. It evaluates the significance of the work in-context, and also removes the doubt going back to 2010 that was expressed here.

November 29, 2011 1:37 pm

Thanks for the very understandable summary. One thing I found interesting about this algorithm is that it uses nonlinear optimization for obtaining the bounds. It is not often one sees algorithms being directly used to prove a theorem.

November 29, 2011 1:46 pm

November 29, 2011 3:18 pm

I am not quite a theoretical CS person, but I fail to understand the significance or the implications of any improvement over even O(n^3) bound when the NP=?P issue has not made any progress.

November 29, 2011 7:55 pm

In a previous post, Gödel said:
“I thought k*n means you understand the solution, k*n^2 means you can solve it but only partly understand it.”

Thus under the standards of Gödel himself, matrix multiplication is still not well understood. However if the new algorithm can be made practical, it will certainly give rise to new larger-scale implementations of matrix multiplication.

• November 29, 2011 8:29 pm

The data size is N = n^2, so in terms of N we’re talking N^{1.1863…}. Bien entendu! :-)

Dick and I mused that the most “salient” constant c that could form a natural barrier for \omega, with 2 strictly < c < 2.3727…, may be 2 + 1/e = 2.367879441… This as opposed to c = sqrt(5) = 2.236… as I wrote and Dave Bacon independently offered on Scott's blog. As I said there, any takers or breakers?

November 29, 2011 8:40 pm

Merci Ken pour l’explication. :)

So now we do have a practical bound for matrix multiplication. I really hope the new algorithm will soon be made feasible.

• November 29, 2011 9:09 pm

I should add actually that there is a good point between us and “Gödel” here. The “cubic” algorithms mentioned in this posts’s first sentence are really N^{1.5} by the same token, and N^2 means quartic. For graph and matrix problems one can indeed argue pretty sweepingly that n^4-time algorithms are probably improvable, and n^5 and higher ones are probably short of understanding.

• November 30, 2011 9:29 am

Re salient constants, I know of proofs that naturally give rise to bounds of the form $n^\alpha$ for some irrational $\alpha$ (usually the root of a quadratic). Typically they work by feeding in an existing bound to get a slightly better one, then iterating that process and showing that the limiting exponent satisfies some polynomial equation. I can’t think of any that naturally give rise to a transcendental bound — does anyone have a good example? Maybe something that depends on a continued-fraction expansion?

• November 30, 2011 10:27 am

Exponent bounds for matrix multiplication generally arise from an iterative process that leads to a transcendental equation (since exponentiation naturally comes into the equation). In simple cases, like Strassen’s algorithm, you get a nice enough equation (namely, 2^x = 7) that you can solve it explicitly and get a transcendental number. Coppersmith and Winograd and the new improvements are a lot more complicated. You can characterize the resulting exponents as solutions of big systems of somewhat complicated simultaneous equations. The solutions are surely transcendental numbers, but I don’t know whether this can be proved.

• December 1, 2011 4:29 am

Oops, I realize now (after reading your response) that I was of course aware of proofs that lead to exponents like $\log 2/\log 3$ — pretty well any proof that starts with a small example of something and tensors it up. So I suppose I’d better make my question more specific and ask whether there a general type of argument that could conceivably lead to the bound $2+1/e$. Are there natural proof techniques that lead to exponents resembling that one?

December 1, 2011 9:06 am

gowers,

I know of no even ideas for lower bounds that could come close to that. The best I know is $2n^2$ which seems pretty weak, but still needs a proof. Perhaps the best idea could be something like: if matrix product exponent was 2, then this means matrix product is really doable in linear time, which then shows $\dots$ Perhaps this approach could be made to work.

December 1, 2011 1:59 pm

Serge- Are these new algorithms really make understanding of matrix multiplication any better?
With the increased [process] complexity to lower the exponent by only a small fraction we going into the opposite direction.

December 1, 2011 5:41 pm

Of course it does. Gödel was right to emphasize that faster algorithms come from a better understanding of the problem to solve.

Maybe it’s no progress towards solving PvsNP, but since it took 25 years to speed up this algorithm by such a small fraction I find it hard to believe that we’ll know someday whether SAT or integer factorization admit polynomial algorithms…

• December 1, 2011 6:41 pm

My question isn’t about the size of the exponent, but rather whether an exponent made in a simple way out of e could arise naturally from a proof. I can’t conceive of an argument that would lead to such a bound (upper or lower), but that may be just because there’s a technique I’m not familiar with or I’m forgetting something. To answer my question in the affirmative it would be sufficient to exhibit any sensible argument to a sensible combinatorial problem that yielded a bound of the form $n^{f(e)}$ where $f(x)$ was a simple function such as (for instance) a non-constant polynomial in $x$ and $x^{-1}$ with integer coefficients.

3. November 29, 2011 4:54 pm

To better appreciate Strassen-type theorems, I would very much enjoy reading natural statements of these conjectures in contexts other than purely algebraic. Here is an attempt from a geometrically natural point-of-view:

Let ${R}$ be a smooth Riemannian (or Kählerian) manifold that is equipped with a complete set of real (or holomorphic) coordinate functions and a metric (or complex) structure ${g}$. Regard ${g}$ as an oracle that computes at zero computational cost the inner product ${\langle X,Y\rangle_{g}|_p}$ at a given point ${p}$ for arbitrary vector fields ${X}$ and ${Y}$. Then given ${s}$ and ${t}$ as smooth real (or holomorphic) functions, and assigning zero computational cost to the evaluation of functions ${ds(Z)|_p}$ and ${dt(Z)|_p}$ for an arbitrary vector field ${Z}$, the arithmetic cost of computing ${\langle ds,dt\rangle_{g^{-1}}|_p}$ is ${\mathcal{O}[(\dim R)^\omega]}$, with ${\omega \le 2.3727}$.

Is there a more geometrically natural statement of Strassen-type theorems than this? For what other branches of mathematics can Strassen-type theorems be stated naturally? Comments and pointers to the literature — both fundamental and applied — would be very welcome.

• November 30, 2011 3:21 pm

As a followup, here’s another (tentative) geometric description of Strassen-type theorems. From a matrix algebra point of view, it is very remarkable that the matrix-matrix product ${C=AB}$, for ${A}$ and ${B}$ square matrices, requires — asymptotically and assuming ${\omega =2}$ — no more computational effort than the matrix-vector product ${y=Ax}$ for ${x}$ a general vector. Equivalently in the language of differential geometry, the computational effort required to sharp (${\sharp}$) or flat (${\flat}$) a full-rank bilinear form is no greater than the effort to ${\sharp}$ a vector or ${\flat}$ a covector (viewing vectors and covectors as rank-1 bilinear forms).

These reflections lead us to a (dim on my part) geometric appreciation that somehow, to “learn” that a bilinear form is rank-1 requires computational effort, or equivalently in algebraic terms, matrix-vector multiplication is surprisingly just as costly as matrix-matrix multiplication.

It would be great if someone smarter than me could explain these mysteries better and more naturally! :)

November 29, 2011 8:56 pm

Please, could people kindly refrain from fussing over this so called breakthrough and get on with whatever they are supposed to be doing

In particular Scott Aaronson’s blog reads like if someone just proved the Riemann hypothesis.

Such self promotion enterprise is simply embarrassing, both for the people involved and for the TCS community at large.

November 29, 2011 9:26 pm

Anonymous,

Hi. It is not that we will use the algorithm it is galactic. Strassen’s original one is borderline. But the idea that progress is made on a problem that stood for almost 25 years seems pretty cool to me.

dick

• November 29, 2011 9:48 pm

“the idea that progress is made on a problem that stood for almost 25 years seems pretty cool to me.”

You are right, but I guess you’d agree there is a difference between “pretty cool” and “one of the best results proved in years in all of theory”. Seriously, is that much hype justified, given that we currently have no clue whether this algorithm will lead to any significant further improvements or will it be a dead end a mere curiosity 20 years from now? Regardless of the importance (or lack of it) of the result, it would much more reasonable to show some restraint, especially as this blog is surely one of the “flagships” of TCS community when it comes to wider outreach.

December 1, 2011 9:32 am

Well said Michal, I wonder if there is a correlation between this way of thinking about break-through results in TCS and the relevance of TCS in Computer Science as well as how hard it is for students in the field to get a job.

December 1, 2011 10:30 am

The significance of the result is already explained by Markus Bläser, this is a marginal improvement based on existing technique, period.

PS I would rather be enlightened if you would like to share your view on the recent update in the blog post by Aaronson, in which he advised people to “go to hell”?

5. December 1, 2011 4:36 am

That last talk about golden ratio sounds like numerology.

December 1, 2011 9:48 am

Don’t let yourself be influenced by names! As David Hilbert used to say: “It must be possible to replace in all geometric statements the words point, line, plane by table, chair, beer mug.”

6. December 1, 2011 9:53 am

It seems to me that the ongoing discussion of the new Stothers-Vassilevska-Williams algorithms and also Lance Forrnow’s recent essay The Death of Complexity Classes? echo perennial themes that Charles Townes articulated in a 1984 IEEE article titled “Ideas and Stumbling Blocks in Quantum Electronics”;

The following excerpts from Townes’ 1984 essay are refocused to address contemporary issues in complexity theory:

Ideas and Stumbling Blocks in Quantum Electronics Computational Complexity

Quantum electronics Theoretical computer science, including in particular the maser the study of algorithmic efficiency and the laser complexity classes, represents a marriage between quantum physics mathematics and electrical software engineering which was probably longer delayed than it might have been because the two were not sufficiently acquainted.

It is sometimes said that there is no single component idea involved in the construction of masers or lasers surpassing of the Coppersmith-Winograd bound which had not been known for at least 20 years before the advent of these devices algorithms.

The beauty of the device this class of algorithms may have been more attractive to most computer scientists than its potential applications. A favorite quip which many will remember was
“the laser is Strassen-style algorithms and complexity classes are a solution looking for a problem.”

In looking back over why the field of quantum electronics applied complexity theory took as long as it did in getting started and why even then the buildup was initially not more rapid, I necessarily mention some of the stumbling blocks, misconceptions, and fumbles. The development of any science by humans has its similar mistakes and illogicalities. Recalling these can keep us humble and make us aware there may be other exciting events not yet visible around the corner.

It is wonderful to reflect that even today, more than 60 years after the invention of masers and lasers, we are still finding beautiful new mathematics, physics, engineering, and enterprises associated to these devices. On the other hand, it is sobering to reflect that Townes’ 1984 essay was written ten years after peak enrollment in north american physics programs; and yet few or no physicists of Townes’ generation anticipated this two-generation (and still-persisting) educational stagnation. In failing to recognize the onset of this stagnation, no effective steps were taken to forestall it.

Anxiety that computer science — and even the global STEM enterprise — may be stagnating largely accounts (it seems to me) for both the hyperbole and the rancor that have attended the discussion of these topics. Townes’ essay reminds us that neither hyperbole nor rancor are warranted, rather it’s prudent (for students especially) to take Townes’ advice that we “stay humble and be aware there may be other exciting events around the corner.”

• December 1, 2011 8:53 pm

Note: please don’t hesitate to delete the above post, from which some of the HTML markup needed to make it readable is missing; the following post repairs the lack.

7. December 1, 2011 10:00 am

Hmmm … let’s switch to an alternative markup strategy for Townes’ essay:
————————————–
It seems to me that the ongoing discussion of the new Stothers-Vassilevska-Williams algorithms and also Lance Fortnow’s recent essay The Death of Complexity Classes? echo perennial themes that Charles Townes articulated in a 1984 IEEE article titled “Ideas and Stumbling Blocks in Quantum Electronics”;

The following excerpts from Townes’ 1984 essay are refocused to address contemporary issues in complexity theory:

Ideas and Stumbling Blocks in Quantum Electronics Computational Complexity

{Quantum electronics} [Theoretical computer science], including in particular {the maser} [the study of algorithmic efficiency] and {the laser} [complexity classes], represents a marriage between {quantum physics} [mathematics] and {electrical} [software] engineering which was probably longer delayed than it might have been because the two were not sufficiently acquainted.

It is sometimes said that there is no single component idea involved in the {construction of masers or lasers} [surpassing of the Coppersmith-Winograd bound] which had not been known for at least 20 years before the advent of these {devices} [algorithms].

The beauty of {the device} [this class of algorithms] may have been more attractive to most {scientists} [computer scientists] than its potential applications. A favorite quip which many will remember was “{the laser is} [Strassen-style algorithms and complexity classes are] a solution looking for a problem.”

In looking back over why the field of {quantum electronics} [applied complexity theory] took as long as it did in getting started and why even then the buildup was initially not more rapid, I necessarily mention some of the stumbling blocks, misconceptions, and fumbles. The development of any science by humans has its similar mistakes and illogicalities. Recalling these can keep us humble and make us aware there may be other exciting events not yet visible around the corner.

It is wonderful to reflect that even today, more than 60 years after the invention of masers and lasers, we are still finding beautiful new mathematics, physics, engineering, and enterprises associated to these devices. On the other hand, it is sobering to reflect that Townes’ 1984 essay was written ten years after peak enrollment in north american physics programs; and yet few or no physicists of Townes’ generation anticipated this two-generation (and still-persisting) educational stagnation. In failing to recognize the onset of this stagnation, no effective steps were taken to forestall it.

Anxiety that computer science — and even the global STEM enterprise — may be stagnating largely accounts (it seems to me) for both the hyperbole and the rancor that have attended the discussion of these topics. Townes’ essay reminds us that neither hyperbole nor rancor are warranted, rather it’s prudent (for students especially) to take Townes’ advice that we “stay humble and be aware there may be other exciting events around the corner.”

December 2, 2011 9:59 am

Wouldn’t it be helpful to have an online catalogue of matrices
like oeis.org with known lower and upper runtime bounds
for matrix-vector and matrix-matrix product algorithms
for concrete matrices?

9. December 2, 2011 1:45 pm

Let me add my cranky comment.

In some places people say that Fourier transform of something may lead to major improvement. Fourier Transform usually mean translation invariance. Here that can mean the partial multiplication over the blobs for fixed n M_k,m= L( a_k, a_{k+1}, a_{k+n}) L(b_m, … b_{m+n} ) \forall k,m lead to the desired sums (L – linear operator). Then one can use Fourier transform.

Good luck.

December 2, 2011 6:14 pm

In the course of history we become able to design faster and faster algorithms, so that problems become easier and easier to solve. Therefore it’s not legitimate to credit each problem with an inherent difficulty, for it’s only with time that they become easier. This is why the PvsNP problem is ill-posed, being based on the wrong idea that history can be condensed onto a single point.

December 3, 2011 3:17 pm

This is not very relevant to this particular post…but I am curious about some recent possible developements at LHC. There will supposedly be a public announcement on Dec 13….
I am sorry to churn the rumor mill…Apparently the Higgs bosonmass is most likely to be around 125 Gev .

http://blog.vixra.org/2011/12/02/higgs-rumour-anaylsis-points-to-125-gev/

Curiously this is very close the the one predicted in this paper….

http://www.citebase.org/abstract?id=oai%3AarXiv.org%3A0912.5189

Apparently this involves the use of four color theorem….Is this completely bogus….or is there more to this?

December 3, 2011 7:35 pm

ok, The author in the second paper seems to have made several claims including a simple proof of “4CT” which is quite bogus. Case closed.