Impossible but true: a new approach to linear systems

Prasad Raghavendra is an expert in many aspects of complexity theory, especially the foundations of approximation theory. He recently was a colleague at Georgia Tech, but now has moved on to Berkeley. He will be greatly missed at Tech.

Today I want to talk about a brilliant new result that Prasad has on linear equations.

I was recently at the advisory meeting for the Berkeley Simons Theory Institute. During the meeting we had two presentations on new areas, besides talks on planned special projects. One was given by Prasad on theory, but almost in passing he mentioned that he had a way to solve linear systems. I liked the whole talk, but his almost causal comment surprised me. It seemed to me to be an entire new approach to solving linear systems. My initial thought was it must be something well known, but I did not know it.

As soon as he finished his talk we had a break, and I ran up to thank him for a wonderful talk. I quickly asked was his result on linear systems new? Or was it something I had missed? He answered to several of us, who now were waiting to hear his answer, that it was something that he had just proved. I was relieved: I am not completely out of it.

I asked if we could write about this result and he agreed. Even better, he wrote up a draft paper with just the algorithm and its analysis, which are part of larger results and projects that were the body of his talk.

## Solving Linear Systems

The solving of linear systems of equations is ancient, dating back to 600 BCE. It is of extreme importance, and still an active area of research. For arbitrary systems Gaussian Elimination is still quite powerful. A historical note, apparently Carl Gauss only used the method named after him on six-variable problems. In those days there was no notion of, I can solve the general case in time cubic in ${n}$. There was no “n”: of course there was the letter “n,” but not the notion of solving a problem of size determined by ${n}$.

Of course now we have faster methods than this for general systems, methods that descend from the famous breakthrough of Volker Strassen. For special systems there are almost-linear-time methods, but these all require that the system have special structure, and work over the reals.

## The Idea, and a Problem

Prasad’s work is for solving linear systems over finite fields, specifically ${\mathbb{F}_p}$ with ${p}$ prime. For example, consider these equations over the field of two elements:

$\displaystyle \begin{array}{rcl} u + w + z &=& 1 \\ v + w + x + y &=& 0\\ u + v + w + x + y + z &=& 1 \\ \end{array}$

Such systems are of great importance in theory, and it shocked me that he found an entirely new approach to solving them. Note that Gaussian elimination happens to work well in this example: the second and third equations yield ${u + z = 1}$, and this gives ${w = 0}$. However, in general this requires sifting through each equation multiple times, and defines all the solutions. Can we do better by considering each equation only once, and by needing only some solution?

So what does he do?

He starts with a random set of vectors ${V_{0}}$. Then he checks which ones satisfy the first equation. If the vectors are random, then about half of them will satisfy the equation. He throws away all the vectors that do not satisfy the equation. Call the remaining set ${S_1}$. This is not a typo—we have a reason for using ${S}$ not ${V}$—you will see soon.

The obvious idea is to iterate with ${S_1}$ on the second equation: About half the vectors in ${S_1}$ will satisfy it; let ${S_2}$ be those that do. Vectors in ${S_2}$ still satisfy the first equation, so the winnowing-down process that continues by taking the vectors in ${S_2}$ that satisfy the third equation finds solutions to all considered equations. The problem is that with high probability it winnows down to zero before all ${m}$ equations are satisfied—unless you start with ${V_0}$ having order-of ${2^m}$ vectors, which is too many.

How to solve this problem? The surprising answer is that one doesn’t need to start with a large initial ${V_0}$ to maintain a high probability of catching a solution to all the equations. It suffices to maintain a suitable large-enough set ${V_{k-1}}$ of solutions to the first ${k-1}$ equations. But how does he do this?

## The Amplifier, and a Trick

The key is to use an amplifier of the kind we discussed here. After he takes the subset ${S}_{1}$ of ${V_0}$, he rebuilds a set ${V_1}$ of the same size as ${V_0}$ by taking combinations of vectors in ${S_1}$ that are “random enough.” Then he proceeds on to the next equation. In this way he gets a series of sets of vectors,

$\displaystyle { V}_{0}, { V}_{1}, \dots, { V}_{m},$

so that ${V}_{k}$ satisfies the first ${k}$ equations, the sets are large, and they are almost random. The last set is nonempty with high probability, and so has a solution to the linear system.

To make this work he needs one trick, and this is why the result is limited to finite fields. To see it, first note that if all the equations are set equal to zero, then any linear combination of solutions will be a solution. If the constants are non-zero, however, this fails. For instance in the above equations, the all-1 vector satisfies the first two, as does the vector with ${u}$ and ${z}$ flipped to ${0}$. The sum of these vectors, which has ${u = z = 1}$ and the other variables ${0}$, fails to satisfy the first equation.

The trick is that if you sum three solutions to the first two equations, then you always get a solution to both. Generally over ${\mathbb{F}_p}$ with characteristic ${p}$, the trick is to form random sums of ${p+1}$ vectors. The point is these vectors all satisfy some equation

$\displaystyle a_1 x_1 + a_2 x_2 + \cdots + a_n x_n = b$

then the sum of ${p+1}$ vectors will give ${b(p+1)}$ on the right-hand side. But in characteristic ${p}$, ${b(p+1) = b}$. Hence the original equation remains satisfied.

The algorithm to form ${V_k}$ from ${S_k}$ while still satisfying the ${k}$th equation—and all equations before it—is just to form random sums of ${p+1}$ vectors in ${S_k}$ until ${V_k}$ has been amplified to the needed size.

So that is how he does it. Pretty cool?

He calls the amplification step a “recombination” step. Essentially he picks random vectors and adds them together. The evolutionary analogy is that this process performs enough “mixing” to preserve the needed amount of entropy.

## The Algorithm

Here is the actual algorithm from his paper draft.

As always see his paper for the details and full proof that the method works. This is the hard part. I take that back, both the creation of the algorithm and its proof of correctness are equally tricky. Indeed the fact that there is a new algorithm is perhaps most surprising of all.

## Open Problems

This approach of guess, delete, recombine reminds me of what are sometimes called genetic algorithms. Are there applications of Prasad algorithm? Or of his general method to other problems?

43 Comments leave one →
August 9, 2012 9:28 am

Reblogged this on Room 196, Hilbert's Hotel.

August 9, 2012 10:02 am

The recombination step also reminds me of the Ajtai, Kumar and Sivakumar sieving algorithm for the shortest vector problem in a lattice: the idea is to generate a set of random lattice points, compute new points as differences between pairs of points in the set, keep the shortest ones, and reiterate.

• August 9, 2012 10:09 pm

I was thinking the same thing — and if I am not misled, the AKS algorithm was quite directly inspired by Blum-Kalai-Wasserman’s algorithm for learning noisy parities, which seems even more similar to — yet still somehow different from — Prasad’s algorithm.

(Indeed, the BKW–>AKS lineage has a Kumar-Sivakumar paper in the middle, which applied BKW to solve the random equations output by Ajtai’s reduction from lattice problems.)

3. August 9, 2012 10:07 am

That is really slick!
In the algorithm description, though, is N=cn correct? In the statement of Thm 0.1 below, N is defined to be something else.

• August 9, 2012 10:28 am

I parse the occurrence in the formula of “ln q n” as (ln q)n rather than ln(q n). Since q is constant, it has the form Cn. The question which your humble bloggers didn’t have time to delve before posting is how things change when q = p^k for p prime. I suspect (indeed expect) it still works with perhaps a different constant C(q) or C(p,k) on the n, but we took the Wikipedia ethic of not trying to include “original research”.

• Dániel Varga permalink
August 9, 2012 10:29 am

Yeah, that’s a really confusing part of the writeup. I think the ambiguity is caused by a missing parenthesis: N = (145 q^2 ln q)n.

• August 9, 2012 11:21 am

Yeah, that makes sense. Otherwise the randomly-chosen vectors wouldn’t have the correct solution in their span. In that case, the run time seems to be poly(q) * n * m, where m is the number of nonzero entries in the matrix. So I guess this is something we knew how to do already?

4. August 9, 2012 10:09 am

The connections of this method with genetic and evolutionary computation are really interesting.

The first set of randomly selected vectors is like the first generation of an evolutionary algorithm (EA). Then, at each generation, some vectors are selected using as selection scheme the i-th constraint of the linear system. These selected vectors are then recombined among them to form new vectors for the next generation.

A difference with EAs is that there is crossover/recombination but no mutation (although maybe that mutation can be useful also in the proposed method). However, the main difference is that in EAs usually it’s impossible to know (apart some special/artificial cases) how many generations are needed to reach a good-enough solution, while in the proposed method the number of generations is exactly the number of equations in the linear system.

Now a strange idea… it should be interesting to perform a map in order to try to predict the performance/complexity of an evolutionary algorithm.

Given the stocastich behaviour A of an evolutionary algorithm (some formal description of its working scheme), and given some description P of the problem to solve (for example some properties of its fitness landscape), it should be interesting to study a mapping of the couple (A,P) into a linear system over a finite field so that:
1) the number of equations of this system can be adopted as an estimate of the running time of the evolutionary algorithm,
2) solve the original problem using the Prasad’s method.

Perhaps, it’s too fanciful

5. Dániel Varga permalink
August 9, 2012 10:30 am

It seems to be very important that the algorithm is online in the set of constraints. And maybe somebody can even start something with the fact that the algorithm does not use division.

August 9, 2012 10:37 am

Brilliant! This looks like a good candidate to be a “why didn’t I think of that?” Algorithm. You know, those things that look so obvious in retrospect but you probably wouldn’t have thought of in a million years.

7. August 9, 2012 10:52 am

It seems like one can derandomize the recombination step using an epsilon-biased set.

8. August 9, 2012 11:07 am

In the paper he only actually states it for prime fields, not general finite fields. Is there some reason it doesn’t work for more general finite fields? (If it does work for more general finite fields, is there any way it could be adapted to more general fields of nonzero characteristic? Probably not, but…)

Also apologies but I am getting myself all mixed up figuring out the running time. I know this should be simple but would anyone care to state it, since he doesn’t in the paper? Am I correct in at least inferring that for fixed q it’s polynomial but for variable q it is not?

9. August 9, 2012 11:21 am

Very nice! The idea of using a small random subset of a subspace as a proxy for that subspace is clever.

I don’t see what stops this working in characteristic 0, though. At each stage, to inflate the set of solutions, one can sum a random solution with differences of two other random solutions, in place of summing together p+1 random solutions. Then, when a new equation comes in, one can take two random solutions v, v’ and find a convex combination t v + (1-t) v that also solves the new equation. I didn’t do the analysis, but it seems that generically one should still be able to maintain enough “generic diversity” in the solution set at each stage to keep spanning the a big portion of solution space, after suitably selecting some amplification parameters. But perhaps the devil is in the details…

Also, the same idea seems to work (in principle, at least) to find the intersection of preimages of various homomorphisms from a large, not necessarily abelian group to various small groups (instead of from F_p^n to F_p), though I’m not sure how often this situation would actually arise in a genuinely nonabelian setting (for instance, it couldn’t happen at all in a simple group)…

• August 9, 2012 11:27 am

Actually, thinking about it a bit more I don’t see exactly why randomness is necessary. Couldn’t one simply, at each stage of the process, store an affine basis for the solution space (a particular solution and a basis for the homogeneous solutions), and update that basis each time a new constraint is added (by shifting all the stored data by a suitable multiple of one of the basis elements, which is then discarded)? This seems to have comparable time and memory requirements to the random algorithm, if I’m not mistaken.

• John Sidles permalink
August 9, 2012 12:09 pm

As I understand (imperfectly no doubt) the above remarks, the process outlined bears a considerable similarity to (what are called) Krylov subspace methods. Perhaps some person more knowledgeable than me will comment upon this!

August 9, 2012 2:21 pm

Thanks for the comments.

I think a similar algorithm might work for the characterestic zero case, but I do not yet see the proof details.

The algorithm was discovered in an attempt to find a more general algorithm for constraint satisfaction problems. Therefore, the algorithm tries to use as little algebraic structure as possible, apart from the fact that there is an operation that preserves the solution space. The description of the algorithm itself just uses the operation preserving the solution space, but the analysis uses the algebraic structure heavily.

August 9, 2012 6:44 pm

Hi Terry: could you elaborate a little further on how you perform this update method on the affine basis? I don’t see how you can do it faster than Gaussian elimination.

• August 9, 2012 8:52 pm

Ah, I got confused by the line “N = Cn” in the algorithm description, which suggested that N was linear in n (leading to an O(n^2 m) algorithm), which is essentially the same speed as Gaussian elimination (or the basis updating algorithm I mentioned above). Looking more closely at the paper, it seems instead that N is actually selected to be logarithmic in n (for fixed q), which is the source of the speedup, but unfortunately one which is constrained to the low q setting.

• August 9, 2012 8:59 pm

Gah, no, I take that back… on even closer reading it seems that N is ultimately taken to be $\lceil (145 q^2 \log q) n \rceil$ (it took me a while to figure out where the parentheses are going). In that case, I stand by my previous comment: one can work deterministically and starting with a basis of n vectors, rather than N random vectors. At the j^th stage, one has an affine basis consisting of one particular solution and a basis of n-j homogeneous solutions. When one receives the j^th equation, one can do Gaussian elimination by adding a suitable multiple of one of the homogeneous solutions to the particular solution and also to the other homogeneous solutions in order to get an affine basis to the new system. This has a run time of O(n^2 m), compared to the run time O(N n m) of the randomised algorithm (in the bounded q case).

August 9, 2012 12:11 pm

Terrence,

Thanks for comments. I liked the simple nature of the prune/recombine. But I see what you are saying. Also does it work for reals—have to think about that

• Ryan O'Donnell permalink
August 9, 2012 7:36 pm

It reminds me of Dalmau’s very beautiful algorithm for solving CSPs that have a Maltsev operation: add in the constraints one by one, always implicitly maintaining the solution space in terms of applying the Maltsev operation to some reference solutions. I wouldn’t be surprised if Prasad was thinking about similar things when coming up with this algorithm

This also reminds me of a notable problem from the area which I like very much. All of these algorithms process the constraints one at a time and seem very sequential. Is it possible that there is, say, an RNC algorithm for solving CSPs with Maltsev constraints? Or is there such a CSP which is P-complete? The conjecture that it might be possible to solve them all in parallel is inspired by another very different algorithm for solving linear systems: Csanky’s.

August 9, 2012 1:38 pm

What I cannot see is that what is the probability that these new linear combinations will satisfy the next equation (in case of characteristic 0…).

August 10, 2012 7:55 am

Whoops, I overlooked this. Somehow I was thought that x + y – z will be a solution if all x, y, and z are solutions, but Terry was talking about t v + (1-t)v’…

• August 9, 2012 2:58 pm

I agree with Terry. Why not to take several random current solutions v_1…v_t (for some reasonably large t), and t random coefficients a_1…a_t subject to a_1+…a_t=1 and look at v = a_1*v_1+…+a_t*v_t? Clearly, v is still a solution, and also has some entropy, so that hopefully there are good odds it will satisfy the next constraint too. Of course, I have not done the analysis yet, so this is just a syntactic suggestion. Another such suggestion is that perhaps leftover hash lemma might be useful here.
-Yevgeniy

August 9, 2012 4:45 pm

This sounds a lot like differential evolution.

• August 9, 2012 6:20 pm

Terence’s suggestion for the characteristic 0’s case sounds like differential evolution mutation (sum a difference between two solutions to a third one) with a geometric crossover operator (the convex combination)

• August 9, 2012 6:21 pm

… as already suggested by Uzair

August 10, 2012 6:15 am

“Very nice”

That’s what I was going to say.

• Jonathan Bachini permalink
August 10, 2012 8:13 am

Hmm, could you do a quick implementation in Matlab and see if it works ? thx

August 9, 2012 6:38 pm

What are the odds of the algorithm falsely reporting “System infeasible”?

August 9, 2012 6:41 pm

Never mind. I see the answer’s in the paper.

August 10, 2012 6:45 am

Can this work be called a breakthrough?

August 10, 2012 8:27 am

It looks like one over finite fields but Gaussian elimination seems to remain useful in characteristic 0.

August 10, 2012 7:34 pm

By the way I wonder if this method won’t get slower as the characteristic increases. It looks efficient especially in low positive characteristics.

August 11, 2012 2:40 am

It would worth to explore Terry’s idea on making convex combinations t v + (1-t)v’. It seems it should work for arbitrary characteristic.

12. August 10, 2012 7:28 am

Eventually when this is all sorted out by the mathematicians and TCS folks, it might be a nice way to impose different type of constraints on families of solutions thereby providing a path between the current sparsest solutions to underdetermined systems (aka compressed sensing) to currently empirical model based approaches (aka structured sparsity etc…)

August 12, 2012 8:59 am

Can this algorithm have any practical applications?

14. Mal Malego permalink
August 13, 2012 6:17 am

My initial thought had been to sample sufficient vectors to span the entire solution space (similar to the coupon collector’s problem) times q (so you still have sufficient samples after pruning without accidentally loosing any dimensions of the new solution space), and then in the recombination step select random convex combinations. This would reduce C to be linear in q. But maybe my thinking is flawed…