Sarah Cannon is a current PhD student in our Algorithms, Combinatorics, and Optimization program working with Dana Randall. Sarah has a newly-updated paper with Dana and Joshua Daymude and Andrea Richa entitled, “A Markov Chain Algorithm for Compression in Self-Organizing Particle Systems.” An earlier version was presented at PODC 2016.
Today Ken and I would like to discuss the paper, and relate it to some recent results on soft robots.
For starters let’s call the paper (CDRR)—after the authors last names. Being lazy let me also start by quoting part of their abstract:
We consider programmable matter as a collection of simple computational elements (or particles) with limited (constant-size) memory that self-organize to solve system-wide problems of movement, configuration, and coordination. Here, we focus on the compression problem, in which the particle system gathers as tightly together as possible, as in a sphere or its equivalent in the presence of some underlying geometry. More specifically, we seek fully distributed, local, and asynchronous algorithms that lead the system to converge to a configuration with small perimeter. We present a Markov chain based algorithm that solves the compression problem under the geometric amoebot model, for particle systems that begin in a connected configuration with no holes.
What does this mean? They imagine simple devices that lie on a 2-dimensional lattice. Each device operates in with the same rules: they can decide what to do next based only on their local environment; however, the devices have access to randomness. The goal is that the devices over time should tend, with high probability, to form a tightly grouped system. This is what they call the compression problem. The surprise is that even with only local interactions, such devices can form a configuration that close to as tight a configuration as possible. Roughly they will collapse into a region of perimeter order at most .
As I started to write this post, I discovered that are some neat results on soft robots. As usual, theorists like CDRR think about objects. They study the behavior of many simple devices and prove theorems that hold often for large enough . Their main one, as a stated above, is that there are devices that can solve the compression problem and get within a region of perimeter order .
On the other hand, practical researchers often start by studying the case of . I think both ends of the spectrum are important, and they complement each other. Since I am not a device physicist, I will just point out the highlights of the recent work on soft robots.
The above photo is of a “light-powered micro-robot capable of mimicking the slow, steady crawl of an inchworm or small caterpillar.” See this for more details.
This is research done by a team in Poland led by Piotr Wasylczyk, who writes:
Designing soft robots calls for a completely new paradigm in their mechanics, power supply and control. We are only beginning to learn from nature and shift our design approaches towards these that emerged in natural evolution.
Their soft robot is based on no motors or pneumatic actuators to make it move. Instead it uses a clever liquid crystal elastomer technology: when exposed to light the device moves like a caterpillar. Further the light can be adjusted to make the device move in different ways.
I have no idea if this work can be extended to , nor could it be used to implement even small numbers of the devices that CDRR require. I thought you might enjoy hearing about such creepy devices. Let’s turn to the mathematical work on the compression problem.
CDRR assume that they have a fixed structure, which is an infinite undirected graph. Their devices or particles sit on the vertices of this structure. This, of course, forces the particles to be a discrete system. As you might guess, the usual structures are fixed lattices. These lattices have periodic structure that makes the systems that result at least possible to understand. Even with this regular structure the global behavior of their particles can be quite subtle.
The models of this type have various names; the one they use is called the amoebot model as proposed in this 2014 paper. I like to think of them as soft robots creeping along the lattice.
Okay the above is a cartoon. Here are some more-illustrative figures from the 2014 paper:
Their “particles” reside in a triangular lattice and either sit at one vertex or occupy two adjacent vertices. Figuratively, the worm is either pulled together or is stretched out. They can creep to an adjacent vertex by stretching out and later contracting. They cannot hop as in Chinese checkers.
We don’t know if the game of Go can be played sensibly on a Chinese checkers board, but anyway these worms cannot play Go. A necessity in Go is forming interior holes called eyes. Although the movement rules crafted by CDRR are entirely local and randomized, the configuration of worms can never make an eye.
Let be a node occupied by a contracted worm and an adjacent empty node. Then let and be the two nodes adjacent to both and Define the neighborhood to include and , the other three nodes adjacent to , and the other three nodes adjacent to . Here is an alternate version of the rules in CDRR for it to be legal for to expand into and then contract into :
There are further rules saying initially that no node in may be part of an expanded worm and covering possible adjacent expanded worms after has expanded. However, their effect is to enable treating the nodes concerned as unoccupied, so that Markov chains built on these rules need only use states with all worms contracted. The rules are enforceable by giving each worm a fixed finite local memory that its neighbors can share.
The “not both” in the first rule subsumes their rule that the worm cannot have five occupied neighbors, which would cause an eye upon its moving to . The second and third rules preserve connectedness of the whole and ensure that the move does not connect islands at . The third rule also ensures that a path of open nodes through now can go . The rules’ symmetry makes a subsequent move back from to also legal, so that reversible.
Their chain always executes the move if the number of triangles the worm was part of on node is no greater than the number of triangles it joins on , and otherwise happens with probability where is a fixed constant. Given the rules, it strikes us that using the difference in neighbor counts as the exponent is equivalent. The whole system is concurrent and asynchronous, but is described by first choosing one worm uniformly at random for a possible move at each step, and then choosing an unoccupied neighbor at random.
Here is a configuration from their paper in which the only legal moves involve the third rule:
The edges denote adjacent contracted worms, not expanded worms. Suppose the chosen node is the one midway up the second column at left. It forms a triangle with its two neighbors to the left, so The node above is vacant, but moving there would close off a big region below it. The move is prevented by the second rule because would comprise four nodes that are not all connected within Similarly the node below is forbidden. Either node to the right of is permitted by rule 3, however. Since the move will certainly happen if is randomly chosen.
This suggests there could be examples where the only legal moves go back and forth, creating cycles as can happen in John Conway’s game of Life. However, CDRR show that every connected hole-free n-worm configuration is reachable from any other. This makes the Markov chain ergodic. Their main theorem is:
Theorem 1 For all and , and sufficiently large , when is run from any connected, hole-free -worm configuration, with all but exponentially vanishing probability it reaches and stays among configurations with total perimeter at most times the minimum possible perimeter for nodes.
The second main theorem shows a threshold of for the opposite behavior: the perimeter stays at size .
The researchers CDRR are experts at the analysis of Markov chains. So they view their particles as such a system. Then they need to prove that the resulting Markov system behaves the way they claim: that as time increases they tend to form a tight unit that solves the compression problem.
Luckily there are many analytical tools at their disposal. But regarding the ergodicity alone, they say:
We emphasize the details of this proof are far from trivial, and occupy the next ten pages.
Their particles are pretty simple, but to prove that the system operates as claimed requires quite careful analysis. Take a look at their paper for the details.
I (Dick) will make one last detailed comment. They want their system to operate completely locally. This means that there can be no global clock: each particles operates asynchronously. This requires some clever ideas to make it work: they want each particle to activate in an random manner. They use the trick that random sequences of actions can be approximated using Poisson clocks with mean The key is:
After each action, a particle then computes another random time drawn from the same distribution and executes again after that amount of time has elapsed. The exponential distribution is unique in that, if particle has just activated, it is equally likely that any particle will be the next particle to activate, including particle . Moreover, the particles update without requiring knowledge of any of the other particles’ clocks. Similar Poisson clocks are commonly used to describe physical systems that perform updates in parallel in continuous time.
After looking at their paper in some depth, we find the result that local independent particles can actually work together to solve a global problem remains intriguing. Yes there are many such results but they usually have global clocks and other assumptions. The fact that compression is achievable by a weaker model is very neat.
]]>
Toward teaching computability and complexity simultaneously
Large Numbers in Computing source |
Wilhelm Ackermann was a mathematician best known for work in constructive aspects of logic. The Ackermann function is named after him. It is used both in complexity theory and in data structure theory. That is a pretty neat combination.
I would like today to talk about a proof of the famous Halting Problem.
This term at Georgia Tech I am teaching CS4510, which is the introduction to complexity theory. We usually study general Turing machines and then use the famous Cantor Diagonal method to show that the Halting Problem is not computable. My students over the years have always had trouble with this proof. We have discussed this method multiple times: see here and here and here and in motion pictures here.
This leads always to the question, what really is a proof? The formal answer is that it is a derivation of a theorem statement in a sound and appropriate system of logic. But as reflected in our last two posts, such a proof might not help human understanding. The original meaning of “proof” in Latin was the same as “probe”—to test and explore. I mean “proof of the Halting Problem” in this sense. We think the best proofs are those that show a relationship between concepts that one might not have thought to juxtapose.
The question is how best to convince students that there is no way to compute a halting function. We can define Turing machines in a particular way—or define other kinds of machines. Then we get the particular definition
How can we prove that is not computable? We want to convey not only that this particular is uncomputable, but also that no function like it is computable.
Trying the diagonal method means first defining the set
We need to have already defined what “accept” means. OK, we show that there is no machine whose set of accepted strings equals . Then what? We can say that the complementary language is not decidable, but we still need another step to conclude that is uncomputable. And when you trace back the reason, you have to fall back on the diagonal contradiction—which feels disconnected and ad hoc to the particular way and are defined.
Ken in his classes goes the route first, but Sipser’s and several other common textbooks try to hit directly. The targeted reason is one that anyone can grab:
It is impossible for a function to give the value —or any greater value.
Implementations of this, however, resort to double loops to define . Or like Sipser’s they embed the “” idea in the proof anyway, which strikes us as making it harder than doing separate steps as above. We want the cleanest way.
Here is the plan. As usual we need to say that represents a computation. If the computation halts then it returns a result. We allow the machine to return an integer, not just accept or reject. If the machine does not halt then we can let this value be undefined; our point will be that by “short-circuit” reasoning the question of an undefined value won’t even enter.
Now let be defined as the halting function above.
Theorem 1 The function is not computable.
Proof: Define the function as follows:
Suppose that is computable. Then so is . This is easy to see: just do the summation and when computing compute the part first. If it is then it adds nothing to the summation, so it “short-circuits” and we move on to the next . If it is then we compute and add that to the summation. Let stand for the summation before the last term; then .
Now if the theorem is false, then there must be some such that the machine computes . But then
This is impossible and so the theorem follows.
What Ken and I are really after is relating this to hierarchies in complexity classes. When the are machines of a complexity class then the functions and are computable. It follows that is not computed by any and so does not belong to . What we want is to find similar functions that are natural.
Ackermann’s famous function does this when is the class of primitive recursive functions. There are various ways to define the machines , for instance by programs with counted loops only. The that tumbles out is not primitive recursive—indeed it out-grows all primitive recursive functions. Showing that does likewise takes a little more formal work.
In complexity theory we have various time and space hierarchy theorems, say where is . For any time constructible , we can separate from by a “slowed” diagonalization. The obtained this way, however, needs knowledge of and its constructibility to define it. By further “padding” and “translation” steps, one can laboriously make it work for , for any fixed , and a similar theorem for deterministic space needs no log factors at all. This is all technical in texts and lectures.
Suppose we’re happy with , that is, with a non-“tight” hierarchy. Can we simply find a natural that works for ? Or suppose is a combined time and space class, say machines that run in time and space simultaneously. Can we possibly get a natural that is different from what we get by considering time or space separately?
We’d like the “non-tight” proofs to be simple enough to combine with the above proof for halting. This leads into another change we’d like to see. Most textbooks define computability several chapters ahead of complexity, so the latter feels like a completely different topic. Why should this be so? It is easy to define the length and space usage of a computation in the same breath. Even when finite automata are included in the syllabus, why not present them as special cases of Turing machines and say they run in linear time, indeed time ?
Is the above Halting Problem proof clearer than the usual ones? Or is it harder to follow?
What suggestions would you make for updating and tightening theory courses? Note some discussion in the comments to two other recent posts.
[some word fixes]
Marijn Heule, Oliver Kullmann, and Victor Marek are experts in practical SAT solvers. They recently used this ability to solve a longstanding problem popularized by Ron Graham.
Today Ken and I want to discuss their work and ask a question about its ramifications.
The paper by them—we will call it and them HKM—is titled, “Solving and Verifying the boolean Pythagorean Triples problem via Cube-and-Conquer.” The triples problem is a “Ramsey”-like question raised years ago by Graham. Cube-and-Conquer is a method for solving large and complex SAT problems. Sandwiched in between is a clever new tuning of resolution SAT methods called “DRAT” which we discuss in some detail.
Ron was interested in a problem that generalizes Schur’s Theorem, due to Issai Schur. Suppose we color the numbers red and green. Can we always find three distinct numbers of the same color so that
Schur’s theorem says that provided is large enough this is true. Note that another way of putting this is that with , all elements of the set of nonempty sums from are the same color. Several mathematicians independently proved the extension that there are arbitrarily large sets with this property—indeed for any number of colors:
All the sums of course are linear. What happens if we go to higher powers ?
If we simply look at -th powers of sums from then we tie into the same theorem via for all . Taking sums of -th powers such as is different. We can map that case into the simple sums problem with in place of , but it is not clear how to argue similarly with mapped colorings . Sets of the form are special.
We can make them even more special by requiring to be a perfect -th power too. OK, for we are kidding, but the case and is the famous one of Pythagorean triples. Suppose we color the numbers red and green. Can we always find three distinct numbers of the same color so that
This is the question Ron asked. In the spirit of Paul Erdős, he offered $100 for a solution.
The answer from HKM is that this extension is true. Perhaps that is not too surprising, since many problems can be generalized from from linear to non-linear cases. But what is really perhaps the most interesting part is that HKM found a proof via using SAT solvers.
The exact theorem HKM prove is:
Theorem 1 The set can be partitioned into two parts, such that no part contains a Pythagorean triple, while this is impossible for .
Note, this shows that Schur’s theorem does extend from to .
What is special about ? According to this list of triples up to , which is linked under “Integer Lists” from Douglas Butler’s TSM Resources site, there are seven Pythagorean triples involving :
There are five distinct entries for , various others before it, and quite a few for after it. Nothing, however, shouts why is a barrier. It seems better to think of it as a tipping point.
There are colorings for the numbers up to . This immediately stops any simple brute-force approach. What must be done is to break the immense number of cases down to a more manageable number. HKM did this by clever use of known SAT methods with at the addition of heuristics that are tailored to this question.
The previous best positive result had been a 2-coloring of with no monochromatic triple, so HKM had a good idea of how large an to try. The SAT encoding is simple: use variables for , and for every such that , include the clauses
If we give the meaning that the number is colored green, then this says that for every Pythagorean triple, at least one member must be green and at least one must be red. 3SAT remains NP-complete for clauses of all-equal sign, as follows by tweaking the proof at the end of this post.
Now with , what HKM needed to do was to prove the formula unsatisfiable. Proving satisfiability is easy when you know or guess a satisfying assignment—in this case, a coloring. The following graphic from the Nature article on their work shows a coloring for in which the white squares are “don’t-cares”—they can be either color:
The top row goes 24 squares; the cell after them is the sticking point. How to prove there is no consistent way to color it? Given the formula , it may be hard to recognize that it entails a contradiction. The general idea, roughly speaking, is to add more clauses to make a formula so that:
Besides good guesses for , HKM were armed with the latest knowledge on well-performing heuristics. A 2012 paper by Matti Järvisalo with Heule and Armin Biere includes an overview of resolution-related properties involving a big array of acronyms such as AT and HBC and RHT. The AT stands for “asymmetric tautology” and the ‘R’ prefix applied to a formula property enlarges by adding cases where a certain kind of resolution yields a formula with . Combining these two yields the following definition—we paraphrase the newer paper’s version informally:
Definition 2 Given a formula and a clause not in , say has RAT via a literal in if for all clauses of containing the following happens: when you make the other literals in and false, remove , and simplify, you get an immediate contradiction.
We should say more about “simplify”: Suppose are those other literals. Making them false is the same as making the formula
which has their negations as unit clauses. We simplify by removing for each all clauses with (those were satisfied) and deleting from other clauses. After doing this, there may be other unit clauses, whereupon we repeat. If we get both and for some variable , that’s the immediate contradiction we seek. What’s important is that this unit resolution process, while “logically” inferior to full resolution, stops in polynomial time.
Now suppose is satisfiable and has RAT via . If there is a satisfying assignment of that sets or one of the other literals in true, then is also satisfied. So suppose sets them all false. Now there must exist a clause in containing such that sets the other literals in that clause false—if none then we could have set true after all. Then above is satisfied by , but this is a contradiction of the (immediate) contradiction.
Note also that if is satisfiable then of course so is . This isn’t important to the unsatisfiability proof but is good to know: RAT clauses can be added freely. The trick is to find them. Definition 2 was crafted to make it polynomial-time recognizable that is RAT when you have it, but you still have to find it. A particularly adept choice of may allow simplifications that delete other clauses, yielding a technique called DRAT for “Deletion Resolution Asymmetric Tautology” proofs.
This is where the other ingenious heuristics—tailored for the triples problem but following a general paradigm called “cube-and-conquer”—come in. We’ll refer those details to the paper and its references, but this breakthrough should make one excited to read more about the state of the art.
The problem took “only” two days of computing on a supercomputer—the Texas Advanced Computing Center. The computation generated 200 terabytes of raw text output. It is not clear to us whether even more intermediate text was generated on-the-fly as unsuccessful moves were backtracked-out, or how much. HKM say in their abstract:
…Due to the general interest in this mathematical problem, our result requires a formal proof. Exploiting recent progress in unsatisfiability proofs of SAT solvers, we produced and verified a proof in the DRAT format, which is almost 200 terabytes in size. From this we extracted and made available a compressed certificate of 68 gigabytes, that allows anyone to reconstruct the DRAT proof for checking.
As with all computer proofs we still would like a human-readable proof. It is not that we do not trust the validity of the current proof, but rather that we would like to “understand” if possible why Ron’s problem is answered. Can we possibly extract from the certificate a dart of reasoning that yields a shorter explanation? It might be a numerical potential function whose values in this case are guessable and verifiable, such that for some threshold analytically implies unsatisfiability.
We also wonder why the size- formulas treated here should be any more difficult than ones you can get for factoring -bit numbers. As we noted above, the all-signs-equal condition on the literals comes without loss of generality. So the degree of ease that allowed solving on a university center in two days must come from how the Pythagorean pattern gave a leg up to “cube-and-conquer.” For factoring there might be other legs—and the “” from current security standards might yield even smaller formulas.
Last, as noted in the papers we’ve linked, the DART condition has universality properties with regard to resolution in general, yet builds on steps that are in polynomial-time . It was an incremental liberalization of previously used steps, and this makes us wonder whether it can be enhanced further while still yielding proofs that take up length and time. Perhaps we can get the part to be ? That would refute some forms of the “Exponential Time Hypothesis,” which we last discussed here.
The most immediate questions raised by this wonderful work are: what about other equations, and what about allowing more colors? Does having three colors zoom the problem beyond any hope of attack by today’s computers, or will the practical breakthroughs continue a virtuous cycle with advances in theory that bring more cases into the realm of feasibility? Is there an asymptotic analysis that might guide our ability to forecast this?
[fixed typo in SAT encoding and struck “not” between “does” and “extend”]
Wikimedia Commons source |
Nicholas Saunderson was the fourth Lucasian Professor at Cambridge, two after Isaac Newton. He promoted Newton’s Principia Mathematica in the Cambridge curriculum but channeled his original work into lecture notes and treatises rather than published papers. After his death, most of his work was collected into one book, The Elements of Algebra in Ten Books, whose title recalls Euclid’s Elements. It includes what is often credited as the first “extended” version of Euclid’s algorithm.
Today we raise the idea of using algorithms such as this as the basis for proofs.
Saunderson was blind from age one. He built a machine for doing what he called “Palpable Arithmetic” by touch. As described in the same book, it was an enhanced abacus—not a machine for automated calculation of the kind a later Lucasian professor, Charles Babbage, attempted to build.
We take the “palpable” idea metaphorically. Not only beginning students but we ourselves still find proofs by contradiction or “infinite descent” hard to pick up at first reading. We wonder how far mathematics can be developed so that the hard nubs of proofs are sheathed in assertions about the availability and correctness of algorithms. The algorithm’s proof may still involve contradiction, but there’s a difference: You can interact with an algorithm. It is hard to interact with a contradiction.
It was known long before Euclid that the square root of 2 is irrational. In the terms Saunderson used, the diagonal of a square is “incommensurable” with its side.
Alexander Bogomolny’s great educational website Cut the Knot has an entire section on proofs. Its coverage of the irrationality of itemizes twenty-eight proofs. All seem to rely on some type of infinite descent: if there is a rational solution then there is a smaller one and so on. Or they involve a contradiction of a supposition whose introduction seems perfunctory rather than concrete. We gave a proof by induction in a post some years ago, where we also noted a MathOverflow thread and a discussion by Tim Gowers about this example.
We suspect that one reason the proof of this simple fact is considered hard for a newcomer is just that it uses these kinds of descent and suppositions. Certainly the fact itself was considered veiled in antiquity. According to legend the followers of Pythagoras treated it as an official secret and murdered Hippasus of Metapontum for the crime of divulging it. To state it truly without fear today, we still want a clear view of why the square root of 2 is irrational.
Our suggestion below is to avoid the descent completely. Of course it is used somewhere, but it is encapsulated in another result. The result is that for any co-prime integers and there are integers and such that
The and are given by the extended Euclidean algorithm. Incidentally, this was noted earlier by the French mathematician Claude Bachet de Méziriac—see this review—while Saunderson ascribed the general method to his late colleague Roger Cotes two pages before his chapter “Of Incommensurables” (in Book V) where he laid out full details.
Here is the closest classical proof we could find to our aims. We quote the source verbatim (including “it’s” not “its”) and will reveal it at the end.
Proposition 15. If there be any whole number, as , whose square root cannot be expressed by any other whole number; I say then that neither can it be expressed by any fraction whatever.
For if possible, let the square root of be expressed by a fraction which when reduced to it’s least integral terms is , that is, let , then we shall have ; but the fraction is in it’s least terms, by the third corollary to the twelfth proposition, because the fraction was so; and the fraction is in it’s least terms, because 1 cannot be further reduced; therefore we have two equal fractions and both in their least terms; therefore by the tenth proposition, these two fractions must not only be equal in their values, but in their terms also, that is, must be equal to , and to 1: but cannot be equal to , because is a whole number by the supposition, and is supposed to admit of no whole number for its root; therefore the square root of cannot possibly be expressed by any fraction whatever. Q.E.D.
The cited propositions are that two fractions in lowest whole-number terms must be identical and that if and are co-prime to then so is . The proof of the latter starts with the for-contradiction words “if this be denied,” so the absence of such language above gets only part credit. This all does not come trippingly off the tongue; rather it sticks trippingly in the throat. Let’s try again.
In fact, we don’t need the concepts of “lowest terms” or co-primality or the full statement of the identity named for Étienne Bézout. It suffices to assert that for any whole numbers and , there are integers and such that the number
divides both and . This is what the extended Euclidean algorithm gives you.
Then for the proof, suppose that for integers and . We take and , and let , be the resulting integers. Now let’s do some simple algebra:
It follows that
Now divide both sides of this by . We get
The conclusion is that , hence . Thus . But also divides . So is an integer—the same end as the classical proof.
This is a contradiction. But it is a palpable contradiction. For instance, of course we can see that isn’t an integer. Thus we claim that the effect of this proof is more concrete.
Is this a new proof? We doubt it. But the proof is nice in that it avoids any recursion or induction. The essential point—the divisibility of into and —is coded into the Euclidean algorithm.
Is ours at least smoother than the classical proof we quoted? The latter is from Saunderson’s book, on pages 304–305 which come soon after his presentation of the algorithm on pages 295–298.
What other proofs can benefit from similar treatment by “reduction to algorithms”?
[fixed missing n in last line of proof, some word tweaks]
Peter Landweber, Emanuel Lazar, and Neel Patel are mathematicians. I have never worked with Peter Landweber, but have written papers with Larry and Laura Landweber. Perhaps I can add Peter one day.
Today I want to report on a recent result on the fiber structure of continuous maps.
The paper by Landweber, Lazar, and Patel (LLP) is titled, “On The Fiber Diameter Of Continuous Maps.” Pardon me, but I assume that some of you may not be familiar with the fiber of a map. Fiber has nothing to do with the content of food or diets, for example. Fibers are a basic property of a map.
Their title does not give away any suggestion that their result is relevant to those studying data sets. Indeed even their full abstract only says at the end:
Applications to data analysis are considered.
I just became aware of their result from reading a recent Math Monthly issue. The paper has a number of interesting results—all with some connection to data analytics. I must add that I had not seen it earlier because of a recent move, and the subsequent lack of getting US mail. Moves are disruptive—Bob Floyd used to tell me that “two moves equal a fire”—and I’ve just moved twice. Oh well.
The fiber of a map at is the set of points so that . The diameter of a fiber is just what you would expect: the maximum distance of the points in the fiber. LLP prove this—they say they have a “surprisingly short proof” and give earlier sources for it at the end of their paper:
Theorem: Let be a continuous function where . Then for any , there exists whose fiber has diameter greater than .
The following figure from their paper conveys the essence of the proof in the case :
For one might expect a difficult dimension-based agument. However, they leverage whatever difficult reasoning went into the following theorem by Karol Borsuk and Stanislaw Ulam. We have mentioned both of them multiple times on this blog but never this theorem:
Theorem: Let be any continuous function from the -sphere to . Then there are antipodal points that give the same value, i.e., some on the sphere such that .
The proof then simply observes that -spheres of radius live inside for any , and arbitrarily large . The antipodal points belong to the same fiber of but are apart.
Why should we care about this theorem? That’s a good question.
One of the main ideas in analytics is to reduce the dimension of a set of data. If we let the data lie in a Euclidean space, say , then we may wish to map the data down to a space of lower dimension. This yields lots of obvious advantages—the crux is that we can do many computational things on lower-dimensional data that would be too expensive on the original -dimensional space.
The LLP result shows that no matter what the mapping is, as long as it is continuous, there must be points that are far apart in the original space and yet get map to the exactly same point in the lower space. This is somewhat annoying: clearly it means there will always be points that the map does not classify correctly.
One of the issues I think raised by this work on LLP is that within areas like big-data people can work on it from many angles. I think that we do not always see results from another area as related to our work. I believe that many people in analytics are probably surprised by this result, and I would guess that they may have not known about the result previously. This phenomenon seems to be getting worse as more researchers work on similar areas, but come at the problems with different viewpoints.
Can we do a better job at linking different areas of research? Finally, with respect, this seems like a result that could have been proved decades ago? Perhaps one of the great consequences of new areas like big data is to raise questions that were not thought about previously.
[fixed typo R^m, corrected picture of Landweber, added note on sources for main theorem]
Noam Chomsky is famous for many many things. He has had a lot to say over his long career, and he wrote over 100 books on topics from linguistics to war and politics.
Today I focus on work that he pioneered sixty years ago.
Yes sixty years ago. The work is usually called the Chomsky hierarchy(CH) and is a hierarchy of classes of formal grammars. It was described by Noam Chomsky in 1956 driven by his interest in linguistics, not war and politics. Some add Marcel-Paul Schützenberger’s name to the hierachry. He played a crucial role in the early development of the theory of formal languages—see his joint
paper with Chomsky from 1962.
We probably all know about this hierarchy. Recall grammars define languages:
One neat thing about this hierarchy is that it has long been known to be strict: each class is more powerful than the previous one. Each proof that the next class is more powerful is really a beautiful result. Do you know, offhand, how to prove each one?
I have a simple question:
Should we still teach the CH today?
Before discussing let me explain some about grammars.
In the 1950’s people started to define various programming languages. It quickly became clear that if they wanted to be precise they needed some formal method to define their languages. The formalism of context-free grammars of Noam Chomsky was well suited for at least defining the syntax of their languages—semantics were left to “English,” but at least the syntax would be well defined.
Another milestone in the late 1950s was the publication, by a committee of American and European computer scientists, of “a new language for algorithms”; the ALGOL 60 Report (the “ALGOrithmic Language”). This report consolidated many ideas circulating at the time and featured several key language innovations: Perhaps one of the most useful was a mathematically exact notation, Backus-Naur Form (BNF), that was used to describe their grammar. It is not more expressive than context-free grammars, but is more user friendly and variants of still it are still used today.
I must add a story about the power of defining the syntax of a language precisely. Jeff Ullman moved from Princeton to Stanford many years ago in 1979. I must thank him, since his senior position was the one that I received in 1980. Jeff was a prolific writer of textbooks already then and used an old method from Bell Labs, TROFF, to write his books. On arrival at Stanford he told me that he wanted to try out the then new system that Don Knuth had just created in 1978—of course that was the TeX system. Jeff tried the system out and liked it. But then he asked for the formal syntax description, since he wanted to be sure what the TeX language was. He asked and the answer from Knuth was:
There is no formal description. None.
Jeff was shocked. After all Knuth had done seminal work on context-free grammars and was well versed in formal grammars—for example Knuth invented the LR parser (Left to Right, Rightmost derivation). TeX was at the time only defined by what Knuth’s program accepted as legal.
Let’s return to my question: Should we still teach the CH today?
It is beautiful work. I specially think the connection between context-free languages and pushdown automata is wonderful, non-obvious, and quite useful. Context free and pushdown automata led to Steve Cook’s beautiful work on two-way deterministic pushdown automata (2DPDA). He showed they could be simulated in linear time on a random-access machine.
This insight was utilized by Knuth to find a linear-time solution for the left-to-right pattern-matching problem, which can easily be expressed as a 2DPDA:
This was the first time in Knuth’s experience that automata theory had taught him how to solve a real programming problem better than he could solve it before.
The work was finally written up and published together with Vaughan Pratt and James Morris several years later.
And of course context sensitive languages led to the LBA problem. This really was the question whether nondeterministic space is closed under complement. See our discussion here.
Should I teach the old CH material? Or leave it out and teach more modern results? What do you think? Do results have a “teach-by-date?”
[Fixed paper link]
Non-technical fact-check source |
Dan Brown is the bestselling author of the novel The Da Vinci Code. His most recent bestseller, published in 2013, is Inferno. Like two of his earlier blockbusters it has been made into a movie. It stars Tom Hanks and Felicity Jones and is slated for release on October 28.
Today I want to talk about a curious aspect of the book Inferno, since it raises an interesting mathematical question.
Brown’s books are famous for their themes: cryptography, keys, symbols, codes, and conspiracy theories. The first four of these have a distinctive flavor of our field. Although we avoid the last in our work, it is easy to think of possible conspiracies that involve computational theory. How about these: certain groups already can factor large numbers, certain groups have real quantum computers, certain groups have trapdoors in cryptocurrencies, or …
The book has been out for awhile, but I only tried to read it the other day. It was tough to finish so I jumped to the end where the “secret” was exposed. Brown’s works have sold countless copies and yet have been attacked as being poorly written. He must be doing something very right. His prose may not be magical—whose is?—but his plots and the use of his themes usually makes for a terrific “cannot put down” book.
Well I put it down. But I must be the exception. If you haven’t read the book and wish to do so without “spoilers” then you can put down this column.
The Inferno is about the release of a powerful virus that changes the world. Before I go into the mathematical issues this virus raises I must point out that Brown’s work has often been criticized for making scientific errors and overstepping the bounds of “plausible suspension of disbelief.” I think it is a great honor—really—that so many posts and discussions are around mistakes that he has made. Clearly there is huge interest in his books.
Examples of such criticism of The Inferno have addressed the DNA science involved, the kind of virus used, the hows of genetic engineering and virus detection, and the population projections, some of which we get into below. There is also an entire book about Brown’s novel, Secrets of Inferno
However, none of these seems to address a simple point that we hadn’t found anywhere, until Ken noticed it raised here on the often-helpful FourmiLab site maintained by the popular science writer John Walker. It appears when you click “Show Spoilers” on that page, so again you may stop reading if you don’t wish to know.
How does the virus work? The goal of the virus is to stop population explosion.
The book hints that it is airborne, so we may assume that everyone in the world is infected by it—all women in particular. Brown says that 1/3 are made infertile. There are two ways to think about this statement. It depends on the exact definition of the mechanism causing infertility.
The first way is that when you get infected by the virus a coin is flipped and with probability 1/3 you are unable to have children. That is, when the virus attacks your original DNA there is a 1/3 chance the altered genes render you infertile. In the 2/3-case that the virus embeds in a way that does not cause infertility, that gets passed on to children and there is no further effect. In the 1/3-case that the alteration causes infertility, that property too gets passed on. Except, that is, for the issue in this famous quote:
Having Children Is Hereditary: If Your Parents Didn’t Have Any, Then You Probably Won’t Either.
Thus the effect “dies out” almost immediately; it would necessarily be just one-shot on the current generation.
The second way is that the virus allows the initial receiver to be fertile but has its effect when (female) children are born. In one third of cases the woman becomes infertile, and otherwise is able to have children when she grows up.
In this case the effect seems to work as claimed in the book. Children all get the virus and it keeps flipping coins forever. Walker still isn’t sure—we won’t reveal here the words he hides but you can find them. In any event, the point remains that this would become a much more complex virus. And Brown does not explain this point in his book—at least I am unsure if he even sees the necessary distinctions.
The other discussions focus on issues like how society would react to this reduction in fertility. Except for part of one we noted above, however, none seems to address the novel’s mathematical presumptions.
The purpose of the virus is to reduce the growth rate in the world’s population. By how much is not clear in the book. The over-arching issue is that it is hard to find conditions under which the projection of the effect is stable.
For example, suppose we can divide time into discrete units of generations so that the world population of women after generations follows the exponential growth curve . Ignoring the natural rate of infertility and male-female imbalance and other factors for simplicity, this envisions women having female children on average. The intent seems to be to replace this with women having female children each, for in the next generation. This means multiplying by , so
becomes the new curve. The problem is that this tends to zero unless , whereas the estimates of that you can get from tables such as this are uniformly lower at least since 2000.
The point is that the blunt “1/3” factor of the virus is thinking only in such simplistic terms about “exponential growth”—yet in the same terms there is no region of stability. Either growth remains exponential or humanity crashes. Maybe the latter possibility is implicit in the dark allusions to Dante Alighieri’s Inferno that permeate the plot.
In reality, as our source points out, it would not take much for humanity to compensate. If a generation is 30 years and we are missing 33% of women, then what’s needed is for just over 3% of the remaining women to change their minds about not having a child in any given year. We don’t want to trivialize the effect of infertility, but there is much more to adaptability than the book’s tenet presumes.
Have you read the book? What do you think about the math?
]]>
Some CS reflections for our 700th post
MacArthur Fellowship source |
Lin-Manuel Miranda is both the composer and lyricist of the phenomenal Broadway musical Hamilton. A segment of Act I covers the friendship between Alexander Hamilton and Gilbert du Motier, the Marquis de Lafayette. This presages the French co-operation in the 1781 Battle of Yorktown, after which the British forces played the ballad “The World Turned Upside Down” as they surrendered. The musical’s track by the same name has different words and melodies.
Today we discuss some aspects of computing that seem turned upside down from when we first learned and taught them.
Yesterday was halfway between our Fourth of July and France’s Bastille Day, and was also the last day of Miranda performing the lead on-stage with the original Hamilton company. They are making recordings of yesterday’s two performances, to be aired at least in part later this year. A month ago, Miranda wrote an op-ed in the New York Times against the illegal (in New York) but prevalent use of “bots” to snap up tickets the moment they become available for later marked-up resale.
This is also the 700th post on this blog. It took until 1920 for a Broadway show of any kind to reach 700 performances. The Playbill list of “Long Runs on Broadway” includes any show with 800 or more performances. That mark is within our reach, and our ticket prices will remain eminently reasonable.
This list is just what strikes us now—far from exhaustive—and we invite our readers to add opinions about examples in comments.
Forty-five years ago, Dick Karp showed how the difficulty of SAT represented by NP-completeness spreads to other natural problems. As the number of complete problems from many areas of science and operations research soared into the thousands by the 1979 publication of the book Computers and Intractability, people regarded NP-completeness as tantamount to intractability.
Today the flow is in the other direction—as expressed for instance in this talk by Moshe Vardi. Dick Karp himself has been among many in the vanguard—I remember his talk on practical solvability of Hitting-Set problems at the 2008 LiptonFest and here is a relevant paper. We now reduce problems to SAT in order to solve them. SAT-solvers that work in many cases are big business. In some whole areas the SAT-encodings of major problems are well-behaved, as we remarked about rank-aggregation and voting theory in the third section of this post from last October. The solvers can even tackle huge problems. Marijn Heule, Oliver Kullback, and Victor Marek proved that every 2-coloring of the interval has a monochromatic Pythagorean triple in a proof of over 200 terabytes in uncompressed length.
Quadratic time is notionally on the low end of polynomial time, and “polynomial time” has long been used as a synonym for “easy.” But as the amount of data we can and need to handle has mushroomed, the difference in scaling between quasi-linear and quadratic is more and more felt. This difference has even been argued for cryptographic security. A particular definition of quasi-linear time is time as named by Claus Schnorr for his theorem on quasi-linear completeness of SAT; see also this.
In genomics the quadratic time of algorithms for full edit-distance measures is felt enough to warrant approximative methods, as we covered in our memorial a year ago for Alberto Apostolico. This also puts meaning behind theoretical evidence that time for edit distance cannot be improved.
These two items seem to contradict each other, but point up a difference in scale between data and logical control. Often a thousand data points are nothing. A formula with a thousand clauses can say a lot.
My first doctoral student had been working on neural networks before I became his advisor in 1991, and I remember the feeling of their being under a cloud. The so-called AI Winter traced in part to lower bounds shown against certain shallow neural nets in the 1969 book Perceptrons by Marvin Minsky and Seymour Papert. We discussed complexity aspects of this in our memorial of Minsky last January.
Since then what has emerged is that composing a bunch of these nets, as in a convolutional neural network (CNN), is both feasible and algorithmically effective. The recent breakthrough on playing Go is just a headline among many emerging applications of CNNs and larger systems. We are not saying neural nets and deep learning are the be-all or anything more than a “cartoon” of the brain, but rather noting them among many reasons that AI and machine learning are resurgent.
The same AI-winter article on Wikipedia mentions the collapse of Lisp-dedicated systems in 1987, and more widely, many companies devoted to data-parallel architectures “left nothing but their logos on coffee mugs” as a colleague once put it. Subsequently I perceived signs of stagnation in functional languages in the late 1990s and early 00s. This lent a ghostly air to John Backus’s famous 1978 Turing Award lecture, “Can Programming Be Liberated From the von Neumann Style?”
Unlike the revenant in last year’s award-winning movie of that name, this one has come back with a different body. Not a large-scale dedicated machine system, but rather the pan-spectral pervasion we call the Cloud. A great lecture we heard by Mike Franklin on Amplab activities highlighted the role of programs written in the functional language Scala running on the Apache Spark framework.
A common thread in all these items is the combined efficacy and scalability of algorithmic primitives whose abstract forms characterize quasi-linear time: sorting, parallel prefix sum (as one of several forms of map-reduce), convolution, streaming count-sketching, and the like.
We considered mentioning some subjects that have seen changes such as digital privacy and block ciphers, but maybe these are not so “upside-down.” Doubtless we are missing many more. What developments in computing have carried shock on the order of the discovery that neutrinos have mass in particle physics? We invite your suggestions and opinions.
Here also is a web folder of photos from Dick’s wedding and honeymoon.
[added link to Vardi on SAT solving]
Richard Lipton is, among so many other things, a newlywed. He and Kathryn Farley were married on June 4th in Atlanta. The wedding was attended by family and friends including many faculty from Georgia Tech, some from around the country, and even one of Dick’s former students coming from Greece. Their engagement was noted here last St. Patrick’s Day, and Kathryn was previously mentioned in a relevantly-titled post on cryptography.
Today we congratulate him and Kathryn, and as part of our tribute, revisit a paper of his on factoring from 1994.
They have just come back from their honeymoon in Paris. Paris is many wonderful things: a flashstone of history, a center of culture, a city for lovers. It is also the setting for most of Dan Brown’s novel The Da Vinci Code and numerous other conspiracy-minded thrillers. Their honeymoon was postponed by an event that could be a plot device in these novels: the Seine was flooded enough to close the Louvre and Musée D’Orsay and other landmarks until stored treasures could be brought to safe higher ground.
It is fun to read or imagine stories of cabals seeking to collapse world systems and achieve domination. Sometimes these stories turn on scientific technical advances, even purely mathematical points as in Brown’s new novel, Inferno. It needs a pinch to realize that we as theorists often verge on some of these points. Computational complexity theory as we know it is asymptotic and topical, so it is a stretch to think that papers such as the present one impact the daily work of those guarding the security of international commerce or investigating possible threats. But from its bird’s-eye view there is always the potential to catch a new glint of light reflected from the combinatorial depths that could not be perceived until the sun and stars align right. In this quest we take a spade to dig up old ideas anew.
Pei’s Pyramid of the Louvre Court = Phi delves out your prime factor… (source) |
The paper is written in standard mathematical style: first a theorem statement with hypotheses, next a series of lemmas, and the final algorithm and its analysis coming at the very end. We will reverse the presentation by beginning with the algorithm and treating the final result as a mystery to be decoded.
Here is the code of the algorithm. It all fits on one sheet and is self-contained; no abstruse mathematics text or Rosetta Stone is needed to decipher. The legend says that the input is a product of two prime numbers, is a polynomial in just one variable, and refers to the greatest-common-divisor algorithm expounded by Euclid around 300 B.C. Then come the runes, which could not be simpler:
Exiting enables carrying out the two prime factors of —but a final message warns of a curse of vast unknowable consequences.
How many iterations must one expect to make through this maze before exit? How and when can the choice of the polynomial speed up the exploration? That is the mystery.
Our goal is to expose the innards of how the paper works, so that its edifice resembles another famous modern Paris landmark:
This is the Georges Pompidou Centre, whose anagram “go count degree prime ops” well covers the elements of the paper. Part of the work for this post—in particular the possibility of improving to —is by my newest student at Buffalo, Chaowen Guan.
Let with and prime. To get the expected running time, it suffices to have good lower and upper bounds
and analogous bounds for . Then the probability of success on any trial is at least
This lower bounds the probability of , whereupon gives us the factor .
We could add a term for the other way to have success, which is . However, our strategy will be to make and hence close to by considering cases where but is still large enough to matter. Then we can ignore this second possibility and focus on . At the end we will consider relaxing just so that is bounded away from .
Note that we cannot consider the events and to be independent, even though and are prime, because and may introduce bias. We could incidentally insert an initial text for without affecting the time or improving the success probability by much. Then conditioned on its failure, the events and become independent via the Chinese Remainder Theorem. This fact is irrelevant to the algorithm but helps motivate the analysis in part.
This first analysis thus focuses the question to become:
How does computing change the sampling?
We mention in passing that Peter Shor’s algorithm basically shows that composing certain (non-polynomial) functions into the quantum Fourier transform greatly improves the success of the sampling. This requires, however, a special kind of machine that, according to some of its principal conceivers, harnesses the power of multiple universes. There are books and even a movie about such machines, but none have been built yet and this is not a Dan Brown novel so we’ll stay classically rooted.
Two great facts about polynomials of degree are:
The second requires the coefficients of to be integers. Neither requires all the roots to be integers, but we will begin by assuming this is the case. Take to be the set of integer roots of . Then define
where as usual . The key point is that
To prove this, suppose the random belongs to but not to . Then for some , so , but since . There is still the possibility that is a nonzero multiple of , which would give and deny success, but this entails and so this is accounted by subtracting off .
Our lower bound will be based on . There is one more important element of the analysis. We do not have to bound the running time for all , that is, all pairs of primes. The security of factoring being hard is needed for almost all . Hence to challenge this, it suffices to show that is large in average case over . Thus we are estimating the distributional complexity of a randomized algorithm. These are two separate components of the analysis. We will show:
For many primes belonging to a large set of primes of length substantially below , where is the length of , is “large.”
We will quantify “large” at the end, and it will follow that since is substantially greater, is “tiny” in the needed sense. Now we are ready to estimate the key cardinality .
In the best case, can be larger than by a factor of . This happens if for every root , the values do not hit any other members of . When this happens, itself can be as large as . Then
By a similar token, for any , if , then —or in general,
The factor of is the lever by which to gain a higher likelihood of quick success. When will it be at our disposal? It depends on whether is “good” in the sense that and also on itself being large enough.
For each root and define the “strand” where There are always distinct values in any strand. If then every strand has most as non-roots. There is still the possibility that —that is, such that —which would prevent a successful exit. This is where really comes in, attending to the upper bound .
The Paris church of St. Sulpice and its crypt (source) |
Hence what can make a prime “bad” is having a low number of strands. When and the strands and coincide—and this happens for any other such that divides .
Here is where we hit the last important requirement on . Suppose where is the product of every prime other than . Then and coincide for every prime . It doesn’t matter that is astronomically bigger than or ; the strands still coincide within and within .
Hence what we need to do is bound the roots by some value that is greater than any we are likely to encounter. The is not too great: if we limit to of some same given length as that of , then so . We need not impose the requirement but must replace above by where . We can’t get in trouble from such that divides and divides since then divides already. This allows the key observation:
For any distinct pair , there are at most primes such that divides .
Thus given we have “slots” for primes . Every bad prime must occupy a certain number of these slots. Counting these involves the last main ingredient in Dick’s paper. We again try to view it a different way.
Given , and replacing the original with , ultimately we want to call a prime bad if , where . We will approach this by calling “bad” if there are strands.
For intuition, let’s suppose , . If we take as the paper does, then we can make bad by inserting it into three slots: say , , and . We could instead insert a copy of into , , and , which lumps into one strand and leaves free to make two others. In the latter case, however, we also know by transitivity that divides , , and as well. Thus we have effectively used up not slots on . Now suppose instead, so “bad” means getting down to strands. Then we are forced to create at least one -clique and this means using more than slots. Combinatorially the problem we are facing is:
Cover nodes by cliques while minimizing the total number of edges.
This problem has an easy answer: make all cliques as small as possible. Supposing is an integer, this means making -many -cliques, which (ignoring the difference between and ) totals edges. When is constant this is , but shows possible ways to improve when is not constant. We conclude:
Lemma 1 The number of bad primes is at most .
We will constrain by bounding the degree of . By this will also bound relative to so that the number of possible strands is small with respect to , which will lead to the desired bound on . Now we are able to conclude the analysis well enough to state a result.
Define to mean problems solvable on a fraction of inputs by randomized algorithms with expected time . Superscripting means having an oracle to compute from for free. If is such that the time to compute is and this is done once per trial, then a given algorithm can be re-classified into without the oracle notation.
Theorem 2 Suppose is a sequence of polynomials in of degrees having integer roots in the interval , for all . Then for any fixed , the problem of factoring -bit integers with belongs to
provided and .
Proof: We first note that the probability of a random making is negligible. By the Chinese Remainder Theorem, as remarked above, gives independent draws and and whether depends only on . This induces a polynomial over the field of degree (at most) . So the probability of getting a root mod is at most
which is exponentially vanishing. Thus we may ignore in the rest of the analysis. The chance of a randomly sampled in a strand of length coinciding with another member of is likewise bounded by and hence ignorable.
The reason why the probability of giving a root in the field is not vanishing is that is close to . By , we satisfy the constraint
The condition ensures that this is the actual asymptotic order of . Since we are limiting attention to primes of the same length as , the “” above can be to base . Hence has the right order to give that for some and constant fraction of primes of length , the success probability of one trial over satisfies
Hence the expected number of trials is . The extra in the theorem statement is the time for each iteration, i.e., for arithmetic modulo and the Euclidean algorithm.
It follows that if is also computable modulo in time, and presuming so that , then factoring products of primes whose lengths differ by just a hair is in randomized average-case polynomial time. Of course this depends on the availability of a suitable polynomial . But could be any polynomial—it needs no relation to factoring other than having plenty of distinct roots relative to its degree as itemized above. Hence there might be a lot of scope for such “dangerous” polynomials to exist.
Is there a supercomputer under the Palais Royale? (source) |
Dick’s paper does give an example where a with specified properties cannot exist, but there is still a lot of play in the bounds above. This emboldens us also to ask exactly how big the “hair” needs to be. We do not actually need to send toward zero. If a constant fraction of the values get bounced by the event, then the expected time just goes up by the same constant factor.
We have tried to present Dick’s paper in an “open” manner that encourages variations of its underlying enigma. We have also optically improved the result by using rather than as in the paper. However, this may be implicit anyway since the paper’s proofs might not require “” to be constant, so that by taking one can make for any desired factor . Is all of this correct?
If so, then possibly one can come even tighter to for the length of . Then the question shifts to the possibilities of finding suitable polynomials . The paper “Few Product Gates But Many Zeroes,” by Bernd Borchert, Pierre McKenzie, and Klaus Reinhardt, goes into such issues. This paper investigates “gems”—that is, integer polynomials of degree having distinct integer roots and minimum possible circuit complexity for their degree—finding some for as high as 55 but notably leaving open. Moreover, the role of a limitation on the magnitude of a constant fraction of a gem’s roots remains at issue, along with roots exceeding having many relatively small prime factors.
Finally, we address the general case with rational coefficients (and ). If in lowest terms then means (divided by ) so the algorithm is the same. Suppose is a rational root in lowest terms and does not divide , nor the denominator of any Then we can take such that for some and define . This gives
which we write as . Then . Because is a root, it follows that is a sum of terms in which each numerator is a multiple of and each denominator is not. So in lowest terms where possibly . Thus either yields or falls into one of two cases we already know how to count: is another root of or we have found a root mod . Since behaves the same as for all , we can define integer “strands” as before. There remains the possibility that strands induced by two roots and coincide. Take the inverse for and resulting integer , then the strands coincide if . This happens iff . Multiplying both sides by gives
so it follows that divides the numerator of in lowest terms. Thus we again have “slots” for each distinct pair of rational roots and each possible prime divisor of the numerator of their difference. Essentially the same counting argument shows that a “bad” must fill of such slots. The other ways can be bad include dividing the denominator of a root or the denominator of a coefficient —although neither way is mentioned in the paper it seems the choices for and in the above theorem leave just enough headroom. Then we just need to be a bound on all numerators and denominators involved in the bad cases, arguing as before. Last, it seems as above that only a subset of the roots with constant (or at least non-negligible) is needed to obey this bound. Assuming this sketch of Dick’s full argument is airtight and works for our improved result, we leave its possible further ramifications over the integer case as a further open problem.
Update 7/10: >
I’ve made a web folder of photos from Dick’s wedding and honeymoon.
[linked wedding announcement, clarified nature of Shor’s phi, added more about gems, linked photos, fixed n/2 – eps to n/2 – eps*n in theorem statement]
Anna Gilbert and Atri Rudra are top theorists who are well known for their work in unraveling secrets of computation. They are experts on anything to do with coding theory—see this for a book draft by Atri with Venkatesan Guruswami and Madhu Sudan called Essential Coding Theory. They also do great theory research involving not only linear algebra but also much non-linear algebra of continuous functions and approximative numerical methods.
Today we want to focus on a recent piece of research they have done that is different from their usual work: It contains no proofs, no conjectures, nor even any mathematical symbols.
Their new working paper is titled, “Teaching Theory in the time of Data Science/Big Data.” As you might guess it is about the role of theory in the education of computer scientists today. The paper contains much information that they have collected on what is being taught at some of the top departments in computer science, and how the current immense interest in Big Data is affecting classic theory courses.
A short overview of what they find is:
The above is leading to pressure to delete and/or modify theory courses. From Atri’s CS viewpoint and Anna’s as Mathematics faculty active in the theory community, both wish to see CS majors obtain degrees that leave them well versed in CS in general and theory in particular. Undergraduates in programs with a CS component should likewise be well served in formal and mathematical areas. Is this possible given the finite constraints on the curriculums? It is not clear, but their paper shows what is happening right now with theory courses (plus linear algebra and probability/statistics), what is being planned for the near future, and some options that may be useful to consider.
§
For the purpose of this post, we made some edits to their text which follows, with their permission. Some changes were stylistic and some more content-oriented. Their PDF version linked as above may evolve over time—especially upon success of their appeal for reader input at the end. So to obtain a complete and current picture please visit their paper too.
Now Anna and Atri speak:
The genesis of this article is a conversation between the two authors that started six weeks ago. One of us (Anna) was giving a talk at an NSF workshop on Theoretical Foundations of Data Science (TFoDS) and the other (Atri) was thinking about changes to the Computer Science (henceforth CS) curriculum that his department at the University at Buffalo is considering. Anna’s talk at NSF, which included data on theory courses at top ranked schools, generated a great deal of interest in knowing even more about the state of theory courses. This was followed by more data collection on our part.
This post is meant as a starting point of discussion on how we teach theory courses, especially in the light of the increased importance of data science. It is not a position paper—it does not argue that the current trends are inherently good or bad, nor does it prescribe any silver bullet. We do suggest some possible courses of action around which discussion can begin.
CS enrollments as well as the numbers of CS majors have increased exponentially in the last few years. In 2014, Ed Lazowska, Eric Roberts, and Jim Kurose exhibited the trend in the former, not only majors. Their graphs in Figure 1 show the trend in introductory CS course enrollments at six institutions in the years 2006–2014.
Figure 1. Enrollment trends in introductory CS sequences at six institutions (Stanford, MIT, University of Pennsylvania, Harvard, University of Michigan, and University of Washington) from 2006–2014. |
Lazowska’s presentation has more detailed statistics and a discussion of the potential implications of these increases. These trends remain valid in 2016, for example as shown by the following chart for the University at Buffalo. In addition to total number of CSE majors, it shows the enrollment in CSE 115 (the introduction to CSE course), CSE 191 (Discrete Math), CSE 250 (Data Structures), CSE 331 (Algorithms) and CSE 396 (Theory of Computation), all of which are required of all CS majors:
Figure 2. Enrollment trends, University at Buffalo CSE 8/08–5/16, with total majors. |
As enrollments out-pace hiring, class sizes have exploded. Lazowska points out that over 10% of Princeton’s majors are CS majors, while it is highly unlikely that 10% of Princeton’s faculty will ever be CS faculty. At the same time, many institutions are re-evaluating and changing their theoretical computer science (henceforth TCS) course requirements and content.
The twin pressures of staffing and content are shifting priorities in both the material covered and how it is covered—e.g., reducing emphasis on proofs and essay-type problems which are harder to grade. We are not judging these shifts or tying them directly to enrollments, but are for now observing that they are happening and impact a large (and increasing) number of students.
The changes in course content, in emphasis on particular TCS components, and in overall CS requirements (including mathematics and statistics) are occurring exactly when there is a big move towards “computational thinking” in many fields and a national emphasis on STEM education more broadly. Not only are the fundamental backgrounds of incoming CS majors thereby changing, but the CS audience is expanding to students in other fields that are benefiting from solid computational foundations. With the increasing role of data and concomitant needs for machine learning and statistics, it is important to obtain a deep understanding of the mathematical foundations of data science. Traditional TCS has been founded on discrete mathematics, but “continuous” math—especially as related to statistics, probability, and linear algebra—is increasingly important in ways also reflected by cutting-edge TCS research.
We considered the top 20 CS schools according to the US News ranking of graduate programs, numbering 24 including ties. It may be inappropriate to use the graduate program rankings to consider the undergraduate program requirements, and it should be noted that the rankings cover all of the graduate program not just TCS, but this is a reasonable starting point. We sent colleagues a short survey and collected data (available spreadsheet) on these 24 schools. Since several include Engineering in one department as at Buffalo or a separate department as at Michigan we will use `CSE’ as the collective term.
We counted the total number of theory courses that all CS majors have to take within the CSE department and then calculated the fraction over the total number of required courses. We categorized the theory courses under these bins:
The bounds are not sharp—a Data Structures course always covers algorithms associated to the data structures and may overlap with an Algorithms course especially when graphs are covered—and Algorithms often includes some complexity theory, especially NP-completeness. In our spreadsheet these columns are followed by the number of theory electives—besides these required courses—that all CS majors have to take. We would like to clarify four things:
We begin with statistics on the total number of semesters of theory courses that are currently required of all CS majors, standardly equating 3 quarters or trimesters to 2 semesters. The basic statistics are in Table 1.
The median number of semester-long courses was three. All but one school requires a discrete math course, all but two require a Data Structures course, and all but nine require an Algorithms course. Eight schools require a Theory of Computation course separate from Algorithms. All these schools have a significant programming component in their Data Structures course. Only one, Cornell, currently adds programming assignments in the required algorithms course. We would like to remind the reader that we are only considering TCS courses required of all CS majors—for instance, CS 124/125 at Harvard has programming assignments but is not required of all CS majors.
We limited attention to cases where courses in Probability/Statistics and/or Linear Algebra are required of all CS majors but taught outside of CSE. We focus on these two courses since they are most relevant to data science.
Probability/Statistics. Of those surveyed, nineteen schools required a Probability/Statistics course, while five did not. Five had developed a specific required course within the CSE department (Stanford, Berkeley, UIUC, Univ. of Washington, and MIT), three had choices among courses both inside and outside the CSE department, and eleven required a course outside CSE. Of the five institutions that did not require a Probability/Statistics course, two (Univ. of Wisconsin and Harvard) listed such a course among electives in Mathematics. Princeton, Yale, and Brown do not list such a course.
Linear Algebra. Sixteen surveyed schools require a Linear Algebra course, out of 24 total. Of the 16, only Brown and Columbia provide a linear algebra course within CSE that satisfies the requirement, though both allow for non-CSE linear algebra courses.
After reflecting on the data in relation to our initial observations about increasing CS enrollments and emphasis on computational thinking across disciplines, we dug deeper and asked people further questions about changes they have seen or are discussing at their institutions. Of eight departments responding (as of 6/10/16):
Four universities changed their Mathematics requirements in the last 10 years. These changes are primarily to require fewer semesters of Calculus II or III (e.g., some no longer require Ordinary Differential Equations) and, instead, require Linear Algebra and/or Probability/Statistics (whether inside the CSE department or not). Two institutions plan to make changes in the future, likely to require Linear Algebra.
We suggest that now is the time to re-think some of the theory curriculum, to work with our colleagues in Mathematics and Statistics, and to develop mathematical foundations classes that are appropriate both for CS majors and STEM majors more broadly. Especially for CS majors, this exposure should come no later than junior year. Here are some starting points for this discussion.
Our goal is to educate the different students at our respective institutions as best we can, by working with our colleagues at our home institutes and by having a dialogue with our theory colleagues across the country.
After sending emails initially to friends in our social networks to gather data and/or supplement the above preliminary analysis, we noted that we had asked only three women total. We then mused on how we could have increased that number by thinking a bit harder about which women were in our social network and whether the institutions we collected figures for had women theorists. We found that, upon reflection, we could have asked eight more women in our social networks, for a total of 11 women theorists, each at a different school, among the top 24 institutions. There are certainly more than 11 institutions with women theorists but either the women faculty are in areas we are not familiar with or they are women in our areas whom we do not yet know personally (e.g., new, junior faculty). In other words, a ten-minute reflection yielded an almost four-fold increase in representatives from an under-represented group.
We recognize that our sample covers only 24 top institutions. This was done mostly to reduce work on our part since the first data was collected by reading the relevant curricula webpages. Needless to say, a better picture of TCS and math requirements for CS degrees in schools in the US can be gained with more data. We are hoping that readers of this blog at many more institutions can make valuable contributions to our data collection and discussion. Those of you interested can contribute your institution’s information to this survey by filling in a Google form. We will periodically update the master spreadsheet with information that we get from this Google form.
We join Anna and Atri in their appeal which ends their paper: the destiny of theory courses can be considered as one large “open problem.” They conclude by thanking those who have already contributed data and others at Michigan and Buffalo and Georgia Tech (besides us) and MIT for inputs to their article.
We have a few remarks of our own: The main ulterior purpose of theory courses is to sharpen analytical modes of thinking and linear deductive argument, among skills often lumped into the general term “mathematical maturity.” The Internet and advances in technology have brought greater and quicker rewards for non-linear, associative, and more-visual modes. These might seem to compete with or even replace “theory,” but the point behind Anna and Atri’s post is that while diffused among more courses in various areas, the need for analytical and linear-deductive experience grows overall.
What emerges is a greater call for mathematical maturity before capstone courses in these areas, as opposed to the view that a required theory course can be taken in the senior year. Shifting TCS material into an early discrete mathematics course may accomplish this. As we have discussed in Buffalo, this could accompany an across-the-board upgrade in rigor of our entry curriculum, but that may discourage some types of students. That in turn might slow increased enrollments—amid several feedback loops whose consequences are an open problem.
[clarified in Buffalo figure that “Total” means majors.]