The Junta Problem
Learning in the presence of extraneous information: the Junta problem
Avrim Blum is one of the top experts in computational learning theory, and has also made important contributions to many other aspects of theory. These include approximation algorithms, computational game and economic theory, and complexity theory–to name a few. However, you may not know that he has done extremely influential work on an important AI problem: more about this in a moment.
Today I will talk about the Junta Problem. The problem was known for years before it was given this flashy name, but to me it will always be “Arvim’s Problem.” I am not sure if he created it: I asked him and he says the Lisa Hellerstein was also there at the beginning. In any event it is a terrible problem.
The problem is terrible because it is so natural, so simple to state, and so hard to solve. When I heard about it for the first time I could not believe that it was open. But open it is, very open. We know something about the problem, yet a full solution remains elusive.
As I promised before I start discussing the Junta problem, I want to say something about his AI work. Years ago, he and Merrick Furst, set out to try and understand what were the key problem(s) in AI. For the rest of this I will call them the “team.”
At the time they set out to discover if theory could help AI, they were both at Carnegie-Mellon, which is one of the top places for doing AI research in the world. Then and still now. They told me that they started to have weekly meetings with the AI folks, trying to discover if theory could help solve any AI problems. As you might imagine, at first this was like a U.N. meeting without the benefit of translators. Initially, it might have been better if the AI researchers spoke German, and the team only understood French. The language and world view of AI and theory are very different: the notions, the methods, the tools, the names, and more are completely different.
However, after many meetings and lots of hard work, the team finally got it. The figured out what the “planning problem” was, and found a viable approach. The planning problem is a central problem in AI: it covers everything from moving robot arms, to driving autonomous vehicles, to using digital agents. Formally, think of a set of objects and a set of operations. The problem is to get the objects from some initial state to some final state. Not surprising, the planning problem is not even NP-complete, but is PSPACE-complete. Good luck.
However, their colleagues had problems that they needed to solve, so the team set out to make their approach work on “practical problems.” They succeeded wildly: they created a novel approach to planning that is now the core of all planning methods in AI. Their work is one of the most cited paper in AI, and has helped make planning possible for problems many orders of magnitude larger than before.
Their new idea was to introduce what became known as a planning graph. This was a normal directed graph that captures the planning problem’s essential structure, and reduced the planning problem to a kind of “s-t” connectivity problem. Of course this could not be exactly right since the problem is so hard, but the graph yields new insights that help prune the exponential searches tremendously. Go team.
The Junta Problem
The Junta problem is to learn an unknown boolean function
that only depends on of the variables from uniform random samples of the form . Thus, for some and some boolean function ,
The variables are called the Junta, and the hidden function. Note, that only uniform random sampling is allowed; no queries are made. This is why the Junta problem is hard, and also why it is interesting.
The problem can be reduced to learning just the Junta, since if the Junta is known, then recovering the function takes time polynomial in and . All such statements can be made more precise, since they are really statements “with high probability”. We will not be so careful; please be sure to check the papers for the exact statements.
The reason this is an important problem is that it represents, perhaps, the simplest case of learning in the presence of a vast amount of extraneous information. An example of this could be trying to discover biological pathways. It may be the case that very few pieces of information are predictors of which proteins are actually in the pathway, but we are overwhelmed by a vast amount of information.
The Junta problem is an attempt to make a clean theory problem that, at least in spirit, captures this type of situation. The fact that the values of are selected from a uniform distribution makes the problem an idealization of more practical problems. But as I stated earlier it is still very difficult.
The obvious bound is order . This is the brute force algorithm that tries all possible subsets of size . Once you guess a subset checking that it is correct and finding the actual function is easy–with high probability. The primary challenge is to do better than this bound.
A Key Simple Case
There is an important special case of the Junta problem that can be solved in polynomial time. Suppose that the hidden function is a linear boolean function. In this case there is an algorithm that solves the problem in polynomial time–with high probability. The key observation is for each input and output value , form the equation,
In this case, is just the XOR function of a -sized subset of ‘s. The subset is determined the value of ‘s. Note, viewed as an equation in the ‘s this is a linear equation. After sufficiently many random samples the collection of all such equations will have full rank. Essentially, we will get equations like this:
Solve the resulting equations for the ‘s, which will yield the location of the Junta: if and only if is in the Junta. Then, it is easy to find the actual function, as we pointed out earlier.
Mossel, O’Donnell, and Servedio
Elchanan Mossel, Ryan O’Donnell, and Rocco Servedio (MOS) made a breakthrough on the Junta problem. The obvious bound, as we already said, is , they were the first to break through this. There were some earlier results that were of the form , but theirs was a true breakthrough.
MOS proved a wonderful result. They also named the problem the Junta problem in their paper. It’s rare to see a paper that both gets a great result, and re-names a problem in such a memorable way. Their result is the following.
Theorem: The Junta problem can be solved in time where .
In the above result, is the Strassen constant.
Their result is based on a general idea from problem solving. When faced with a problem that you cannot solve with one strategy, find two strategies. Suppose one is trying to prove that all have some property. You then must prove three things:
- If is in the first case, use strategy I;
- If is in the second case, use strategy II;
- Finally, all fall into one of these two cases.
This is an old idea, that has been used many times before. One of my personal favorite examples, is from number theory. John Littlewood once proved a great theorem in number theory by dividing the world into two cases: (i) the Riemann Hypothesis is true; (ii) the Riemann Hypothesis is false. In the first case primes are “well-behaved” so he could solve his problem. In the second case, they are not “well-behaved”, but he had such a strong hypothesis that he could still solve the problem.
The main idea of MOS’s proof is this. Consider the hidden boolean function . They have two strategies for finding the Junta.
In the first case, suppose that the degree of as a boolean function is “small”. They generalize the linear case and collect many samples, and then solve the resulting linear system, on variables. Of course this can be done in where is the number of equations.
In the second case, suppose that the boolean function has a “small” non-zero Fourier coefficient. Then, they search all small possible Fourier coefficients, and find the Junta.
The notions of small are carefully tuned to make all this work–see their paper for details. The key insight is that one of these two cases must hold, no matter what the hidden boolean function is. They show that it is impossible for a boolean function to have both high degree and also no small non-zero Fourier coefficient. This is essentially an inequality bounding the degree and the size of the smallest Fourier coefficients.
Lisa Hellerstein told me, recently at STOC, that after reading the MOS paper, years ago, she and her colleagues tried to do some experiments with machine learning algorithms on Junta problems. She said their methods failed badly on certain types of functions. In an attempt to understand why, she also tried to get references on the class of functions that seemed to be the worst case.
A chance conversation with Eric Bach one day, led Lisa to find out that the same inequality proved in the MOS paper was already known. It is called Siegenthaler’s inequality, and was proved in the part of the security community that works on crypto-boxes, back in 1984. Further, there is a rich collection of ideas and results known about this lemma, and which functions are the worse cases for the inequality. See the book by Thomas Cusick and Pantelimon Stănică for a modern treatment of this inequality and related ideas.
I find this connection interesting for two reasons. First, it is a tribute to MOS that they were able to prove the lemma from scratch. Second, it also says something about research: Seigenthaler’s inequality was proved in the security community for the purpose of understanding crypto-boxes. Yet many of us–all?–in the theory community were unaware of this neat result. We owe thanks to Lisa for discovering this connection. Sometime research requires a bit of luck, or is it persistence?
One special case that I think is interesting is when the hidden boolean function is symmetric. In this case Mihail Kolountzakis, Evangelos Markakis, and Aranyak Mehta proved a beautiful result on the Junta problem. They show that,
Theorem: In the symmetric case the Junta problem can be solved in
I will not have time to explain the details of the proof, but it is based on Fourier methods only–there is no linear solve strategy needed. They show,
Lemma: Every symmetric boolean function of variables, has a non-zero Fourier coefficient of size at most
They conjecture that this bound is far from optimal. It is open, if there is always a Fourier coefficient of size at most for symmetric boolean functions.
By the way, I helped start this line of research with Markakis, Mehta, and Nisheeth Vishnoi, we eventually got bounds of on the Fourier coefficients for small . Then, with important insights from Kolountzakis–an expert on Fourier methods–Kolountzakis, Markakis, and Mehta could prove . Finally, we all put several papers together to make the final contribution. The complete story of who did what when is a bit complex, and probably is best left unsaid. Research can be messy.
An Idea for Improvement of MOS
I have an idea that I want to share with you on how compressive sensing technology could play a role in improving the result of MOS. Recall MOS have to solve a linear system of equations as one of their two strategies:
The matrix comes from the values of the random that is selected and the vector are the values of the function . Note, the key point is thatthe vector is sparse. If is by where is at most , then the vector has at most non-zero components. Here is the degree of the boolean function. I will not explain what compressive sensing is today: see this for lots of information and tutorials. The key to compressive sensing technology is that it has methods for solving linear systems,
over the reals, that have sparse solutions. The technology works when the matrix is “random.” Even better the matrix need not be full rank for the technology to work.
The matrix that arises in the MOS approach to the Junta problem is not a random matrix. However, it does have a large amount of randomness. Is there enough randomness in the matrix to make compressive sensing work? The point can be made clearer if we consider the case where . Then, for each , we get an equation of the form:
where is the random input and is the output. There are order terms, but only random bits. As additional rows are added, with additional samples, the matrix will be by . Is this matrix able to make the compressive sensing method work, since it only has order random bits?
MOS solve their linear system by using, of course, fast matrix methods. They solve this part using time. My thought is that if we could use the fact that is sparse compared to the size of the matrix, could we do much better? Note, if we could do closer to time for the linear solve, then we would improve the exponent of the MOS algorithm.
The problem with this approach is that I do not see exactly how to use the compressive sensing technology here. That basic technology is for real numbers and we seem to need to do all our calculations over a finite field. But, perhaps there is some hope. More in a future post on this approach.
Solve the Junta problem, please. Or at least improve the current exponent of MOS. A more tractable problem may be try to solving the Junta problem over other distributions. Or try to solve the problem for other special classes of boolean functions. If you go this route, be aware that I have not given an exhaustive summary of all that is known about the Junta problem: for example, monotone functions are also easy. Or try something else.
Another open problem is to try using compressive sensing. This will require an extension, I believe, to the existing technology. It may not work, but it is worth the effort, since the payoff is large.
Finally, having worked hard on the symmetric case, I would like to see this case solved. The proof of our theorem that symmetric boolean functions have small non-zero Fourier coefficients is not easy, yet we believe that it is far from optimal. Perhaps there is a completely new approach that will solve the Junta problem for symmetric functions?