A 2010 Algorithm for the Knapsack Problem
An improved exponential time algorithm for the knapsack problem
Nick Howgrave-Graham and Antoine Joux are experts in the area of computational number theory and cryptography.
Today I plan to talk about their new algorithm for the knapsack problem. It is an exponential time algorithm with a much smaller exponent than any previous algorithm. The knapsack problem is one of my favorite problems, and their improvement is clever and surprising.
Nick has done some beautiful work in computational number theory. One of my favorites is his fast algorithm for factoring numbers of the form , where as usual and are primes. He has shown, with Dan Boneh and Glenn Durfee, if is large enough—order logarithmic in —then such numbers can be factored in polynomial time.
Antoine has done some wonderful work on secure hash functions: for example, his team found the first collision for SHA-0. In cryptography hash functions require a variety of special properties to be useful. Distinct messages with the same hash value must exist by pigeon-hole, but they should be hard to find. If collisions were easy to find, then it might be possible to “fool” someone into accepting the wrong message as the intended message. Here are two messages, found by Antoine, that collide for SHA-0:
The exact messages are not important; the great result is they can be found.
Their paper will appear in the next EUROCRYPT. I like one of the traditions of this conference, which is an evening “rump” session. The session consists of short presentations that range from: “here is my paper you did not accept,” to “here is my great new paper you will accept,” to jokes. The rump session presentations are helped by a generous supply of beer and by general good company.
One year, while I was eating dinner with a large group right before the rump session, we decided we had to present a talk during the evening session—but on what? After some discussion we agreed on an idea I suggested; I did not present the idea but one of my colleagues did. The idea was a new protocol for program committees; the protocol would allow committees to have provable deniability. Our protocol was simple:
After the committee was done deciding which papers to accept and which to reject, they would do the following. They would randomly select one accepted paper to reject and one rejected paper to accept.
The protocol would give each program committee member the ability to handle the two most dreaded questions they get:
- “Why did you reject my paper?”
- “Why did you accept their paper?”
Each of these questions could now be answered with a smile. As far as I know the program committee does not currently use our protocol, but I am not sure.
Let’s turn to the knapsack problem.
Recall the knapsack problem is the question of whether or not given the positive integers , there is a subset of the indices to so that
where is another given positive integer. The obvious algorithm would try all subsets . Since there are non-empty subsets the algorithm takes a long time. An early algorithm, due to Ellis Horowitz and Sartaj Sahni in 1974, made a key observation. The above equation can be re-written as follows:
where and . (Yes, we assume that is even, but if is odd the method can easily be fixed.) Then, the algorithm computes two sets. The first consists of all the values of
where ; and the second consists of all values of
where . Then simply check to see if these two have a value in common. The point is they have a value in common if and only if the knapsack problem has a solution. The punch-line is the cost for this is dominated by the number of values in each set: . Note, checking if they have a common value can be done either by hashing or by sorting. But, in either case, this takes time just slightly larger than the sum size of the sets.
The New Algorithm
Nick and Antoine have proved the following:
Theorem: There is a randomized algorithm for the knapsack problem that runs in time and uses space .
Note, is just the usual O-notation except the missing constant is allowed to be a missing polynomial. There is always a chance this could be abused—what if the missing polynomial is , but in their case it is not. See their paper for the actual bounds. Another way to see they have not “cheated” is that they have a running implementation with reasonable performance.
I think this is a terrific result, since the bound they beat is almost thirty years old. Further, one wonders whether or not there could be further improvements to the exponent on the running time. Of course, I believe P=NP is possible, so even more surprising bounds could exist. In any event, a great result.
The Schroeppel and Shamir Algorithm
How do Nick and Antoine prove their theorem? I will try to sketch the main ideas for you, but please look at their paper for the details.
Their algorithm is based on an earlier algorithm of Richard Schroeppel and Adi Shamir. This algorithm uses the same method I just outlined, with a key twist. The previous algorithm created two sets and . The first consists of all the values of
where ; and the second consists of all values of
where . What Schroeppel and Shamir focused on was not reducing the running time, but the storage of the algorithm. Their twist is to create a method that takes time still , but now only uses memory . This was a huge improvement, since even for modest size the memory requirement of the old algorithm of made it impractical.
The key to reducing the memory needed is not to store all of and . Instead, create a subroutine that creates the elements from these sets in sorted order, and this reduces the storage required. This is achieved by viewing and as coming from other sets. For example, is viewed as the set of all sums of and where: is the set of all sums from to and is the set of all sums from to . In a like manner is viewed as the set of all sums of two sets and .
This method, as I stated already, reduces the storage required to , but leaves the running time the same. The algorithm must still look at values to determine whether or not the knapsack problem has a solution.
The new clever insight is to add another twist to reduce the time to and very slightly increase the storage requirements. Again consider finding a sum of elements in
that sum to : of course they must handle all cases, but the case where the answer uses elements is the case that takes the longest. Define as the set of all sums of elements. There are elements and that lie in so
There is a general trick for searching a large space of potential solutions provided there are many solutions. Pick a hash function and a random hash value, and only search those potential solutions with the given hash value. In their algorithm they pick a number of size about the number of solutions:
Then, they pick a random , and try to find solutions to the equations:
The time to search for such a solution is bounded roughly by
The details involve a clever use of the Schroeppel Shamir algorithm to make their hash search work correctly.
I hope this captures some of the flavor of what they do. See their well-written paper for the full details.
The obvious question is how low can the exponent be? Can knapsack be done in time
Another question is can their ideas be used to solve other related open problems?
[added more links to their paper]