A New Way To Solve Linear Equations
Impossible but true: a new approach to linear systems
Prasad Raghavendra is an expert in many aspects of complexity theory, especially the foundations of approximation theory. He recently was a colleague at Georgia Tech, but now has moved on to Berkeley. He will be greatly missed at Tech.
Today I want to talk about a brilliant new result that Prasad has on linear equations.
I was recently at the advisory meeting for the Berkeley Simons Theory Institute. During the meeting we had two presentations on new areas, besides talks on planned special projects. One was given by Prasad on theory, but almost in passing he mentioned that he had a way to solve linear systems. I liked the whole talk, but his almost causal comment surprised me. It seemed to me to be an entire new approach to solving linear systems. My initial thought was it must be something well known, but I did not know it.
As soon as he finished his talk we had a break, and I ran up to thank him for a wonderful talk. I quickly asked was his result on linear systems new? Or was it something I had missed? He answered to several of us, who now were waiting to hear his answer, that it was something that he had just proved. I was relieved: I am not completely out of it.
I asked if we could write about this result and he agreed. Even better, he wrote up a draft paper with just the algorithm and its analysis, which are part of larger results and projects that were the body of his talk.
Solving Linear Systems
The solving of linear systems of equations is ancient, dating back to 600 BCE. It is of extreme importance, and still an active area of research. For arbitrary systems Gaussian Elimination is still quite powerful. A historical note, apparently Carl Gauss only used the method named after him on six-variable problems. In those days there was no notion of, I can solve the general case in time cubic in . There was no “n”: of course there was the letter “n,” but not the notion of solving a problem of size determined by
.
Of course now we have faster methods than this for general systems, methods that descend from the famous breakthrough of Volker Strassen. For special systems there are almost-linear-time methods, but these all require that the system have special structure, and work over the reals.
The Idea, and a Problem
Prasad’s work is for solving linear systems over finite fields, specifically with
prime. For example, consider these equations over the field of two elements:
Such systems are of great importance in theory, and it shocked me that he found an entirely new approach to solving them. Note that Gaussian elimination happens to work well in this example: the second and third equations yield , and this gives
. However, in general this requires sifting through each equation multiple times, and defines all the solutions. Can we do better by considering each equation only once, and by needing only some solution?
So what does he do?
He starts with a random set of vectors . Then he checks which ones satisfy the first equation. If the vectors are random, then about half of them will satisfy the equation. He throws away all the vectors that do not satisfy the equation. Call the remaining set
. This is not a typo—we have a reason for using
not
—you will see soon.
The obvious idea is to iterate with on the second equation: About half the vectors in
will satisfy it; let
be those that do. Vectors in
still satisfy the first equation, so the winnowing-down process that continues by taking the vectors in
that satisfy the third equation finds solutions to all considered equations. The problem is that with high probability it winnows down to zero before all
equations are satisfied—unless you start with
having order-of
vectors, which is too many.
How to solve this problem? The surprising answer is that one doesn’t need to start with a large initial to maintain a high probability of catching a solution to all the equations. It suffices to maintain a suitable large-enough set
of solutions to the first
equations. But how does he do this?
The Amplifier, and a Trick
The key is to use an amplifier of the kind we discussed here. After he takes the subset of
, he rebuilds a set
of the same size as
by taking combinations of vectors in
that are “random enough.” Then he proceeds on to the next equation. In this way he gets a series of sets of vectors,
so that satisfies the first
equations, the sets are large, and they are almost random. The last set is nonempty with high probability, and so has a solution to the linear system.
To make this work he needs one trick, and this is why the result is limited to finite fields. To see it, first note that if all the equations are set equal to zero, then any linear combination of solutions will be a solution. If the constants are non-zero, however, this fails. For instance in the above equations, the all-1 vector satisfies the first two, as does the vector with and
flipped to
. The sum of these vectors, which has
and the other variables
, fails to satisfy the first equation.
The trick is that if you sum three solutions to the first two equations, then you always get a solution to both. Generally over with characteristic
, the trick is to form random sums of
vectors. The point is these vectors all satisfy some equation
then the sum of vectors will give
on the right-hand side. But in characteristic
,
. Hence the original equation remains satisfied.
The algorithm to form from
while still satisfying the
th equation—and all equations before it—is just to form random sums of
vectors in
until
has been amplified to the needed size.
So that is how he does it. Pretty cool?
He calls the amplification step a “recombination” step. Essentially he picks random vectors and adds them together. The evolutionary analogy is that this process performs enough “mixing” to preserve the needed amount of entropy.
The Algorithm
Here is the actual algorithm from his paper draft.
As always see his paper for the details and full proof that the method works. This is the hard part. I take that back, both the creation of the algorithm and its proof of correctness are equally tricky. Indeed the fact that there is a new algorithm is perhaps most surprising of all.
Open Problems
This approach of guess, delete, recombine reminds me of what are sometimes called genetic algorithms. Are there applications of Prasad algorithm? Or of his general method to other problems?
Reblogged this on Room 196, Hilbert's Hotel.
The recombination step also reminds me of the Ajtai, Kumar and Sivakumar sieving algorithm for the shortest vector problem in a lattice: the idea is to generate a set of random lattice points, compute new points as differences between pairs of points in the set, keep the shortest ones, and reiterate.
I was thinking the same thing — and if I am not misled, the AKS algorithm was quite directly inspired by Blum-Kalai-Wasserman’s algorithm for learning noisy parities, which seems even more similar to — yet still somehow different from — Prasad’s algorithm.
(Indeed, the BKW–>AKS lineage has a Kumar-Sivakumar paper in the middle, which applied BKW to solve the random equations output by Ajtai’s reduction from lattice problems.)
That is really slick!
In the algorithm description, though, is N=cn correct? In the statement of Thm 0.1 below, N is defined to be something else.
I parse the occurrence in the formula of “ln q n” as (ln q)n rather than ln(q n). Since q is constant, it has the form Cn. The question which your humble bloggers didn’t have time to delve before posting is how things change when q = p^k for p prime. I suspect (indeed expect) it still works with perhaps a different constant C(q) or C(p,k) on the n, but we took the Wikipedia ethic of not trying to include “original research” :-).
Power of prime? – http://arxiv.org/abs/quant-ph/0401155 and references therein may be useful
Yeah, that’s a really confusing part of the writeup. I think the ambiguity is caused by a missing parenthesis: N = (145 q^2 ln q)n.
Yeah, that makes sense. Otherwise the randomly-chosen vectors wouldn’t have the correct solution in their span. In that case, the run time seems to be poly(q) * n * m, where m is the number of nonzero entries in the matrix. So I guess this is something we knew how to do already?
The connections of this method with genetic and evolutionary computation are really interesting.
The first set of randomly selected vectors is like the first generation of an evolutionary algorithm (EA). Then, at each generation, some vectors are selected using as selection scheme the i-th constraint of the linear system. These selected vectors are then recombined among them to form new vectors for the next generation.
A difference with EAs is that there is crossover/recombination but no mutation (although maybe that mutation can be useful also in the proposed method). However, the main difference is that in EAs usually it’s impossible to know (apart some special/artificial cases) how many generations are needed to reach a good-enough solution, while in the proposed method the number of generations is exactly the number of equations in the linear system.
Now a strange idea… it should be interesting to perform a map in order to try to predict the performance/complexity of an evolutionary algorithm.
Given the stocastich behaviour A of an evolutionary algorithm (some formal description of its working scheme), and given some description P of the problem to solve (for example some properties of its fitness landscape), it should be interesting to study a mapping of the couple (A,P) into a linear system over a finite field so that:
1) the number of equations of this system can be adopted as an estimate of the running time of the evolutionary algorithm,
2) solve the original problem using the Prasad’s method.
Perhaps, it’s too fanciful 🙂
It seems to be very important that the algorithm is online in the set of constraints. And maybe somebody can even start something with the fact that the algorithm does not use division.
Brilliant! This looks like a good candidate to be a “why didn’t I think of that?” Algorithm. You know, those things that look so obvious in retrospect but you probably wouldn’t have thought of in a million years.
It seems like one can derandomize the recombination step using an epsilon-biased set.
In the paper he only actually states it for prime fields, not general finite fields. Is there some reason it doesn’t work for more general finite fields? (If it does work for more general finite fields, is there any way it could be adapted to more general fields of nonzero characteristic? Probably not, but…)
Also apologies but I am getting myself all mixed up figuring out the running time. I know this should be simple but would anyone care to state it, since he doesn’t in the paper? Am I correct in at least inferring that for fixed q it’s polynomial but for variable q it is not?
Very nice! The idea of using a small random subset of a subspace as a proxy for that subspace is clever.
I don’t see what stops this working in characteristic 0, though. At each stage, to inflate the set of solutions, one can sum a random solution with differences of two other random solutions, in place of summing together p+1 random solutions. Then, when a new equation comes in, one can take two random solutions v, v’ and find a convex combination t v + (1-t) v that also solves the new equation. I didn’t do the analysis, but it seems that generically one should still be able to maintain enough “generic diversity” in the solution set at each stage to keep spanning the a big portion of solution space, after suitably selecting some amplification parameters. But perhaps the devil is in the details…
Also, the same idea seems to work (in principle, at least) to find the intersection of preimages of various homomorphisms from a large, not necessarily abelian group to various small groups (instead of from F_p^n to F_p), though I’m not sure how often this situation would actually arise in a genuinely nonabelian setting (for instance, it couldn’t happen at all in a simple group)…
Actually, thinking about it a bit more I don’t see exactly why randomness is necessary. Couldn’t one simply, at each stage of the process, store an affine basis for the solution space (a particular solution and a basis for the homogeneous solutions), and update that basis each time a new constraint is added (by shifting all the stored data by a suitable multiple of one of the basis elements, which is then discarded)? This seems to have comparable time and memory requirements to the random algorithm, if I’m not mistaken.
As I understand (imperfectly no doubt) the above remarks, the process outlined bears a considerable similarity to (what are called) Krylov subspace methods. Perhaps some person more knowledgeable than me will comment upon this! 🙂
Thanks for the comments.
I think a similar algorithm might work for the characterestic zero case, but I do not yet see the proof details.
The algorithm was discovered in an attempt to find a more general algorithm for constraint satisfaction problems. Therefore, the algorithm tries to use as little algebraic structure as possible, apart from the fact that there is an operation that preserves the solution space. The description of the algorithm itself just uses the operation preserving the solution space, but the analysis uses the algebraic structure heavily.
Hi Terry: could you elaborate a little further on how you perform this update method on the affine basis? I don’t see how you can do it faster than Gaussian elimination.
Ah, I got confused by the line “N = Cn” in the algorithm description, which suggested that N was linear in n (leading to an O(n^2 m) algorithm), which is essentially the same speed as Gaussian elimination (or the basis updating algorithm I mentioned above). Looking more closely at the paper, it seems instead that N is actually selected to be logarithmic in n (for fixed q), which is the source of the speedup, but unfortunately one which is constrained to the low q setting.
Gah, no, I take that back… on even closer reading it seems that N is ultimately taken to be
(it took me a while to figure out where the parentheses are going). In that case, I stand by my previous comment: one can work deterministically and starting with a basis of n vectors, rather than N random vectors. At the j^th stage, one has an affine basis consisting of one particular solution and a basis of n-j homogeneous solutions. When one receives the j^th equation, one can do Gaussian elimination by adding a suitable multiple of one of the homogeneous solutions to the particular solution and also to the other homogeneous solutions in order to get an affine basis to the new system. This has a run time of O(n^2 m), compared to the run time O(N n m) of the randomised algorithm (in the bounded q case).
Terrence,
Thanks for comments. I liked the simple nature of the prune/recombine. But I see what you are saying. Also does it work for reals—have to think about that
It reminds me of Dalmau’s very beautiful algorithm for solving CSPs that have a Maltsev operation: add in the constraints one by one, always implicitly maintaining the solution space in terms of applying the Maltsev operation to some reference solutions. I wouldn’t be surprised if Prasad was thinking about similar things when coming up with this algorithm 🙂
This also reminds me of a notable problem from the area which I like very much. All of these algorithms process the constraints one at a time and seem very sequential. Is it possible that there is, say, an RNC algorithm for solving CSPs with Maltsev constraints? Or is there such a CSP which is P-complete? The conjecture that it might be possible to solve them all in parallel is inspired by another very different algorithm for solving linear systems: Csanky’s.
What I cannot see is that what is the probability that these new linear combinations will satisfy the next equation (in case of characteristic 0…).
Whoops, I overlooked this. Somehow I was thought that x + y – z will be a solution if all x, y, and z are solutions, but Terry was talking about t v + (1-t)v’…
I agree with Terry. Why not to take several random current solutions v_1…v_t (for some reasonably large t), and t random coefficients a_1…a_t subject to a_1+…a_t=1 and look at v = a_1*v_1+…+a_t*v_t? Clearly, v is still a solution, and also has some entropy, so that hopefully there are good odds it will satisfy the next constraint too. Of course, I have not done the analysis yet, so this is just a syntactic suggestion. Another such suggestion is that perhaps leftover hash lemma might be useful here.
-Yevgeniy
This sounds a lot like differential evolution.
Terence’s suggestion for the characteristic 0’s case sounds like differential evolution mutation (sum a difference between two solutions to a third one) with a geometric crossover operator (the convex combination)
… as already suggested by Uzair
“Very nice”
That’s what I was going to say.
Hmm, could you do a quick implementation in Matlab and see if it works ? thx
What are the odds of the algorithm falsely reporting “System infeasible”?
Never mind. I see the answer’s in the paper.
Can this work be called a breakthrough?
It looks like one over finite fields but Gaussian elimination seems to remain useful in characteristic 0.
By the way I wonder if this method won’t get slower as the characteristic increases. It looks efficient especially in low positive characteristics.
It would worth to explore Terry’s idea on making convex combinations t v + (1-t)v’. It seems it should work for arbitrary characteristic.
Eventually when this is all sorted out by the mathematicians and TCS folks, it might be a nice way to impose different type of constraints on families of solutions thereby providing a path between the current sparsest solutions to underdetermined systems (aka compressed sensing) to currently empirical model based approaches (aka structured sparsity etc…)
Can this algorithm have any practical applications?
My initial thought had been to sample sufficient vectors to span the entire solution space (similar to the coupon collector’s problem) times q (so you still have sufficient samples after pruning without accidentally loosing any dimensions of the new solution space), and then in the recombination step select random convex combinations. This would reduce C to be linear in q. But maybe my thinking is flawed…
Typo: In Thm 0.1, Completeness and Soundness labels need to be swapped.
Benjamin Jonen and I have implemented some example code in Python for this solver; you can read the source files at: https://code.google.com/p/gc3pie/source/browse/#svn%2Ftrunk%2Fgc3pie%2Fexamples%2Foptimizer%2Flinear-systems-solver
There’s also an implementation of the algorithm as modified by Fliege (arXiv 1209.3995v1), which works over the reals, and some variants with which we are experimenting.
What is the status of this algorithm? The link to Prasad’s draft paper is dead, and I wasn’t able to find a corresponding paper on his webpage (but I may have overlooked it).
It appears that this has been generalized to the Reals as well by Jorg Fliege (https://arxiv.org/pdf/1209.3995.pdf).