John Robson has worked on various problems including what is still the best result on separating words—the topic we discussed the other day. Ken first knew him for his proof than checkers is -complete and similar hardness results for chess and Go.
Today I want to talk about his theorem that any two words can be separated by an automaton with relataivley few states.
In his famous paper from 1989, he proved an upper bound on the Separating Word Problem. This is the question: Given two strings and , how many states does a deterministic automaton need to be able to accept and reject ? His theorem is:
Theorem 1 (Robson’s Theorem) Suppose that and are distinct strings of length . Then there is an automaton with at most states that accepts and rejects .
The story of his result is involved. For starters, it is still the best upper bound after almost three decades. Impressive. Another issue is that a web search does not quickly, at least for for me, find a PDF of the original paper. I tried to find it and could not. More recent papers on the separating word problem reference his 1989 paper, but they do not explain how he proves it.
Recall the problem of separating words is: Given two distinct words of length , is there a deterministic finite automaton that accepts one and rejects the other? And the machine has as few states as possible. Thus his theorem shows that roughly the number of states grows at most like the square root of .
I did finally track the paper down. The trouble for me is the paper is encrypted. Well not exactly, but the version I did find is a poor copy of the original. Here is an example to show what I mean:
</tr
[ An Example ] |
So the task of decoding the proof is a challenge. A challenge, but a rewarding one.
Robson’s proof uses two insights. The first is he uses some basic string-ology. That is he uses some basic facts about strings. For example he uses that a non-periodic string cannot overlap itself too much.
He also uses a clever trick on how to simulate two deterministic machines for the price of one. This in general is not possible, and is related to deep questions about automata that we have discussed before here. Robson shows that it can be done in a special but important case.
Let me explain. Suppose that is a string. We can easily design an automaton that accepts if and only if is the string . The machine will have order the length of states. So far quite simple.
Now suppose that we have a string of length and wish to find a particular occurrence of the pattern in . We assume that there are occurrences of in . The task is to construct an automaton that accepts at the end of the copy of . Robson shows that this can be done by a automaton that has order
Here is the length of the string .
This is a simple, clever, and quite useful observation. Clever indeed. The obvious automaton that can do this would seem to require a cartesian product of two machines. This would imply that it would require
number of states: Note the times operator rather than addition. Thus Robson’s trick is a huge improvement.
Here is how he does this.
Robson’s uses a clever trick in his proof of the main lemma. Let’s work through an example with the string . The goal is to see if there is a copy of this string starting at a position that is a multiple of .
The machine starts in state and tries to find the correct string as input. If it does, then it reaches the accepting state . If while doing this it gets a wrong input, then it switches to states that have stopped looking for the input . After seeing three inputs the machine reaches and then moves back to the start state.
[ The automaton ] |
We will now outline the proof in some detail.
The first lemma is a simple fact about hashing.
Lemma 2 Suppose and
Then all but primes satisfy
Proof: Consider the quantity for not equal to . Call a prime bad if it divides this quantity. This quantity can be divisible by at most primes. So there are at most bad primes in total.
We need some definitions about strings. Let be the length of the string . Also let be the number of occurrences of in .
A string has the period provided
for all so that is defined. A string is periodic provided it has a period that is less than half its length. Note, the shorter the period the more the string is really “periodic”: for example, the string
is more “periodic” than
Lemma 3 For any string either or is not periodic.
Proof: Suppose that is periodic with period where is a single character. Let the length of equal . So by definition, . Then
for . So it follows that
This shows that and cannot both be periodic, since
Lemma 4 Suppose that is not a periodic string. Then the number of copies of in a string is upper bounded by where
Proof: The claim follows once we prove that no two copies of in can overlap more than where is the length of . This will immediately imply the lemma.
If has two copies in that overlap then clearly
for some and all in the range . This says that has the period . Since is not periodic it follows that . This implies that the overlap of the two copies of are at most length . Thus we have shown that they cannot overlap too much.
Say an automaton finds the occurrence of in provided it enters a special state after scanning the last bit of this occurrence.
Lemma 5 Let be a string of length and let be a non-periodic string.Then, there is an automaton with at most states that can find the occurrence of in where
Here allows factors that are fixed powers of . This lemma is the main insight of Robson and will be proved later.
The following is a slightly weaker version of Robson’s theorem. I am still confused a bit about his stronger theorem, to be honest.
Theorem 6 (Robson’s Theorem) Suppose that and are distinct strings of length . Then there is an automaton with at most states that accepts and rejects .
Proof: Since and are distinct we can assume that starts with the prefix and starts with the prefix for some string . If the length of is less than order the theorem is trivial. Just construct an automaton that accepts and rejects .
So we can assume that for some strings and where the latter is order in length. By lemma we can assume that is not periodic. So by lemma we get that
Then by lemma we are done.
Proof: Let have length and let be a non-periodic string in of length . Also let . By the overlap lemma it follows that is bounded by .
Let occur at locations
Suppose that we are to construct a machine that finds the copy of . By the hashing lemma there is a prime so that
if and only if . Note we can also assume that .
Let’s argue the special case where is modulo . If it is congruent to another value the same argument can be used. This follows by having the machine initially skip a fixed amount of the input and then do the same as in the congruent to case.
The automaton has states and for . The machine starts in state and tries to get to the accepting state . The transitions include:
This means that the machine keeps checking the input to see if it is scanning a copy of . If it gets all the way to the accepting state , then it stops.
Further transitions are:
and
The second group means that if a wrong input happens, then moves to . Finally, the state resets and starts the search again by going to the start state with an epsilon move.
Clearly this has the required number of states and it operates correctly.
The open problem is: Can the SWP be solved with a better bound? The lower bound is still order . So the gap is exponential.
]]>[ From his home page ] |
Jeffrey Shallit is a famous researcher into many things, including number theory and being a skeptic. He has a colorful website with an extensive quotation page—one of my favorites by Howard Aiken is right at the top:
Don’t worry about people stealing an idea. If it’s original, you will have to ram it down their throats.
Today I thought I would discuss a wonderful problem that Jeffrey has worked on.
Jeffrey’s paper is joint with Erik Demaine, Sarah Eisenstat, and David Wilson. See also his talk. They say in their introduction:
Imagine a computing device with very limited powers. What is the simplest computational problem you could ask it to solve? It is not the addition of two numbers, nor sorting, nor string matching—it is telling two inputs apart: distinguishing them in some way.
More formally:
Let and be two distinct long strings over the usual binary alphabet. What is the size of the smallest deterministic automaton that can accept one of the strings and reject the other?
That is, how hard is it for a simple type of machine to tell apart from ? There is no super cool name for the question—it is called the Separating Words Problem (SWP).
Pavel Goralčik and Vaclav Koubek introduced the problem in 1986—see their paper here. Suppose that and are distinct binary words of length . Define to be the number of states of the smallest automaton that accepts and rejects or vice-versa. They proved the result that got people interested:
Theorem 1 For all distinct binary words and of length ,
That is the size of the automaton is asymptotically sub-linear. Of course there is trivially a way to tell the words apart with order states. The surprise is that one can do better, always.
In 1989 John Robson obtained the best known result:
Theorem 2 For all distinct binary words and of length ,
This bound is pretty strange. We rarely see bounds like it. This suggest to me that it is either special or it is not optimal. Not clear which is the case. By the way it is also known that there are and so that
Thus there is an exponential gap between the known lower and upper bounds. Welcome to complexity theory.
What heightens interest in this gap is that whenever the words have different lengths, there is always a logarithmic-size automaton that separates them. The reason is our old friend, the Chinese Remainder Theorem. Simply, if there is always a short prime that does not divide , which means that the DFA that goes in a cycle of length will end in a different state on any of length from the state on any of length . Moreover, the strings and where equals plus the least common multiple of require states to separate. Padding these with s gives equal-length pairs of all lengths giving SEP(x,y).
Some other facts about SWP can be found in the paper:
Point (b) underscores why it has been hard to find “bad pairs” that defeat all small DFAs. All this promotes belief that logarithmic is the true upper bound as well. Jeffrey stopped short of calling this a conjecture in his talk, but he did offer a 100-pound prize (the talk was in Britain) for improving Robson’s bound.
There are many partial results in cases where and are restricted in some way. See the papers for details. I thought I would just repeat a couple of interesting open cases.
How hard is it to tell words from their reversal? That is, if is a word can we prove a better bound on
Recall is the reversal of the word . Of course we assume that is not the same as its reversal—that is, we assume that is not a palindrome.
How hard is it to tell words apart from their cyclic shifts?
How hard is it to tell words from their You get the idea: try other operations on words.
The SWP is a neat question in my opinion. I wonder if there would be some interesting consequence if we could always tell two words apart with few states. The good measure of a conjecture is: how many consequences are there that follow from it? I wonder if there could be some interesting applications. What do you think?
]]>
Self-play and Ramsey numbers
[ Talking about worst case ] |
Avrim Blum is the CAO for TTIC. That is he is the Chief Academic Officer at the Toyota Technological Institute of Chicago. Avrim has and continues to make key contributions to many areas of theory—including machine learning, approximation algorithms, on-line algorithms, algorithmic game theory, the theory of database privacy, and non-worst-case analysis of algorithms.
Today I want to discuss a suggestion of Avrim for research on self-play.
Self-play is the key to many recent AI results on playing games. These results include essentially solving the games Chess, Go, Shogi, forms of poker, and many others. They were solved by algorithms that start with no knowledge of the game, save the rules. The algorithm then learn the secrets of playing the game by self-play: by playing games against itself. For example, the AI chess programs did not know that “a rook is worth more than a pawn.” But they discover that by playing the game over and over. Impressive.
For example, David Sweet on his hacker site says referring to self-play:
This is mysterious to me. If it only played against itself, where did new information come from? How did it know if it was doing well? If I play a game of chess against myself, should I say I did well if I beat myself? But when I beat myself, I also lose to myself. And how could I ever know if I’d do well against someone else?
I was at TTIC last month and over lunch we discussed self-play possibilities for theory problems. I suggested that the planted clique problem might be a potential example. Recall the planted clique problem is the task of distinguishing two types of graphs:
If is large, it is easy to tell these apart—just count the number of edges. If is small enough, it is open if one can tell them apart. The largest clique in a random graph typically has size near . This implies that there is a quasi-polynomial time average-case algorithm: just try all subsets of size around .
My intuition was that a program might be able to exploit self-play to solve planted clique problems. The point is that it is easy, by definition, to generate “yes” and “no” examples for this problem. Note, this is not known for SAT problems—generating hard instances there is not clear. This was my point. Could the AI methods somehow divine the planted clique version of “a rook is worth more than a pawn”? Could they use self-play to solve planted clique problems?
I wondered to the lunch group about all this. It left the group unexcited.
Then Avrim asked a better question. He wondered if self-play methods could be used to solve a long standing problem concerning Ramsey numbers. Recall the Ramsey number is defined as the smallest such every red-green coloring of the edges of the complete graph has either a red or a green subgraph of size at least .
The exact value of is unknown, although it is known to lie between and . See a post by Gil Kalai on his blog for some discussions. Joel Spencer quotes Paul Erdős:
Erdős asks us to imagine an alien force, vastly more powerful than us, landing on Earth and demanding the value of or they will destroy our planet. In that case, he claims, we should marshal all our computers and all our mathematicians and attempt to find the value. But suppose, instead, that they ask for . In that case, he believes, we should attempt to destroy the aliens.
Aliens are not attacking currently, but Avrim’s idea is that perhaps we could organize a self-play attack on the problem. The idea would be to try to build a “game” version of this question. The algorithm would try to create a strategy that finds a red/green coloring for the complete graph so that no -clique is all red or all green.
We need to arrange the computation of the Ramsey number as the result of some type of game. The paradox to me is that the play of a game suggests , while the Ramsey calculation is clearly of complexity . So how do we make the Ramsey calculation into a game? Ken and I wonder if there is a natural game so that playing the game well yields insight into the value of a Ramsey number.
Is there a game with simple rules so that playing it well yields bounds on general Ramsey numbers?
There have been several attempts to use non-standard methods to compute Ramsey numbers. See the following:
See also this survey on computational methods in general.
The asymmetry between the upper and lower bounds shapes approaches. The lower bound of 43 was proved by finding a two-coloring of the edges of without a green or red -clique. Once a single coloring is guessed its property is easy to prove. The improvement of the upper bound from 49 to 48 two years ago needed checking two trillion cases in order to fence-away all possible colorings. This has led to a common belief that 43 is the answer—if it were higher then a coloring of would have been found by now.
Can we use self-play to turn this belief into something more concrete? The training would begin on running self-play on the known cases . This should create a neural net that is highly skilled at finding colorings that are free of small monochrome cliques. The question is how to leverage its presumed failure once we hit size .
Perhaps someone should take Avrim’s suggestion and try it out. A natural idea would be to see if this approach could compute the known smaller Ramsey numbers—getting their exact upper bounds.
]]>
Not a theorist but …
[ Internet Hall of Fame] |
Danny Cohen was a pioneer who advanced many areas of computer science. He made contributions to computer graphics, networking, and flight simulation, and much more.
Today I want to remember him and his work.
Danny is hard to label. Leonard Kleinrock, said that Cohen “had the uncanny ability to employ his deep mathematical intellect and insight to real world challenges, with enormous impact.” Indeed. This is quoted in last week’s New York Times obit. The Los Angeles Times obit noted his 1981 paper addressed to warring “Big Endian” and “Little Endian” camps.
Obituaries for Danny do not, in my opinion, explain why Danny was so remarkable. For example, he impacted theory even though he never worked in it. Let me explain in a moment. But first
I worked with Danny. I was on a government committee that he chaired, years ago. He was an excellent leader, a tough chair, and ran a tight ship. He had been a fighter pilot, perhaps that is why he was such a great leader. On that committee—I will not identify it—he once fired a new committee member as he left the first meeting. Really as he walked out the door, while the rest of us were still meeting. He immediately said that he would not be staying on the committee. Tough. Yes. But Danny knew how to run a meeting.
Danny had a clever wit. One of his favorite jokes was based on a form he had printed out:
The_____________________________technology is a promising technology and always will be.
He then would fill in the blank with whatever you thought was a great technology and hand it to you. Today perhaps we might fill in “quantum computing”—or is that wrong to say?
Danny loved irony. His son revealed that Danny had joined the Flat Earth Society. Danny was initially rejected—no scientists allowed. After he got in—this time he did not list his profession—he framed the certificate of membership, the rejection letter and the Flat Earth Society’s map of the world, complete with an emblazoned “Australia Not Down Under..”
A key contribution that seems to be missed in obits I’ve seen is that he helped create the VLSI revolution. MOSIS is the Metal Oxide Semiconductor Implementation Service. Danny created it, styling it after the earlier work of the pioneer Lynn Conway. Ten points if you know what VLSI stands for.
MOSIS allowed researchers to actually fabricate their designs: to convert digital plans into chips that worked. You designed the chips yourself, encoded in a strange digital file format. Then you sent the files to MOSIS. They then placed you design along with other designs onto one wafer. A wafer held many designs. The wafer files were finally sent to a commercial foundry. There they were converted from digital into silicon. When MOSIS got them back they broke the wafers up and sent you the chips that you designed.
This was magical. The beauty was we could make working chips without having to have our own foundry. At that time frame many companies in the computer business would boast that they had they own captive foundry. Few could afford the huge cost of having their own fabrication foundry. Danny liked to say:
A captive foundry was one that captured you.
His point was that if you had your own foundry, you probably would be forced to use it, even if it was not the right type for your needs. Captive indeed.
Since 1981, when it started, countless projects have been fabricated. MOSIS was a brilliant idea that made it possible for many to build working VLSI chips. We did that more that once while I was at Princeton University. While at Princeton my students and I sent several designs to MOSIS. For example, Dan Lopresti sent in a design that we discussed before here.
The ability to make chips was a terrific motivator. I believe that the excitement of being able to make your own devices helped move the field forward. Without MOSIS, without Danny, the field would have advanced much slower.
Perhaps the lesson of MOSIS is not lost on us. Today there are “MOSIS” like systems that allow even theorists to execute physically real quantum computations. This is indirectly thanks to Danny. We owe him much.
]]>
With tighter links between notions of rank?
Composite of src1, src2 (our congrats) |
James Oxley and Geoff Whittle are mathematicians at LSU and Victoria University (Wellington, New Zealand), respectively. They have written many joint papers on matroid theory, but none on quantum computing.
Today we observe that they really have indirectly written a paper on quantum computing.
Their paper, “A Characterization of Tutte Invariants of 2-Polymatroids,” has implications for quantum complexity theory. It introduced a polynomial where is a graph and are numbers. Like many polynomials defined on graphs this one is in general hard to compute—it is -hard, so if it is polynomial time computable, then surprising stuff happens.
The same paper observes that is easy to compute when
However, a 2006 paper by Steve Noble proves that all other rational cases are -hard. Our work finds easy cases in another family where and is a rational multiple of . This result goes through quantum complexity theory.
It is interesting to note that quantum computations welcome complex numbers such as . In mathematics insights to behavior of real functions are often found when we extend them to the complex functions. A beautiful example to compare is that the behavior of
is best understood by noting that can be zero at .
In order to best understand the above results it is helpful to look at matroids and generalizations. Matroids are a bit scary, since they seem quite abstract. But that is wrong. The matroid concept is a natural generalization of the notion of rank in an ordinary vector space.
Let’s take a look. A matroid is defined by a set and a function from finite subsets of to that obeys the following rules:
The notion of rank in an ordinary vector space obeys these axioms. The rules are:
The definition of a polymatroid simply wipes out rule 2. OK, a -polymatroid replaces it by the rule that for all singleton sets have . An important kind of 2-polymatroid springs from the following idea:
The “-rank” of a subset of the edges in a graph is the number of vertices collectively touched by edges in .
In a simple undirected graph, every edge has -rank 2. In graphs with self-loops, however, the loops have rank . We can also allow the universe to include members of -rank . Those are visualized as loops without a vertex and called circles. We could also visualize edges of rank that stick out from a vertex into empty space, but those are formally the same as loops at . This defines a graphic(al) 2-polymatroid .
The theory of polymatroids ignores the notion of “vertex” apart from the definition of the rank function . Hence all graphs composed of isolated vertices define the empty 2-polymatroid. To preserve the distinction, we define an m-graph to be a graph with circles allowed. The empty m-graph is denoted by and called a “wisp.” Every m-graph specializes to a unique graphical polymatroid but we can include isolated nodes when thinking of it as a graph (plus circles).
We revisit the definition of “exploding” an edge in a graph from the previous post in this series:
If we ignore vertices, then the rank function of the resulting m-graph is such that for all subsets of its edge set ,
One can verify that the “explosion” description obeys (1). Glancing at our picture helps—we have varied the one in the previous post by showing how one exploded node becomes a wisp while the loop at the other leaves a circle:
The equation (1), however, is more fundamentally natural. The simple extension
defines the matroid contraction by any subset of . When , it supersedes the graph-theory notion of contraction in matroid contexts. Since we are talking about both, we will write for the matroid version. The notation conveys that the edge gets deleted emphatically.
In a previous post we derived a recurrence for the amplitude in terms of explosion. The presence of circles and use of the function now allows us to write it as:
We can omit the “” from the previous version because the circles keep track of this. The rule is valid even when is a self-loop, using . The following is thus a complete basis:
To get more mileage out of this, we want to emulate William Tutte’s idea and define a polynomial whose evaluation at gives , and whose other evaluations may give more information. Without further ado:
Definition 1 For an m-graph , define its amplitude polynomial inductively by:
Recall if is a loop, else . When consists of a single node with a loop, the recursion gives . When it is a single edge between two nodes, we get
If we have a single node with two loops, something portentous happens. Exploding one of the loops leaves a circle. Removing still leaves one loop. So we get:
This agrees with the value on one isolated node. Similarly, if has just a double edge between two nodes and , then one edge becomes a circle upon exploding the other edge, so recursing on the other edge makes
Thus the double edge gives the same result as having two isolated nodes. The upshot is that equation (2) naturally treats edges modulo two. We can recurse on a non-edge as if it were a double-edge, and the circle left by the explosion (i.e., by the matroid contraction) becomes a multiplier on the entire remainder of that branch of the recursion. The circle thus becomes a placeholder for calculating phase flips and cancellations, which is why we believe the matroid notions are useful in analyzing quantum processes.
It may not be obvious, however, that is well-defined when has more than one edge. That is, does it come out the same for any order of choosing edges for the recursion? We will prove this by connecting to the polynomial mentioned above.
The original non-recursive definition of the rank-generating function of a graphical 2-polymatroid is:
Note the symmetry in and between cases 2 and 4 in particular, which could be brought out more by discussing how (graphical poly-)matroids foster higher notions of duality than graphs do. Our theorem breaks this symmetry, but perhaps there is a deeper underlying law that would restore it:
Theorem 2 For any m-graph with nodes, of which are isolated, and all , taking ,
Proof: Call the right-hand side of (4). Note first that if consists only of isolated nodes then so the right-hand side becomes
Now we verify the other base cases:
To verify the recursion, we note facts observed by Oxley and Whittle as consequences of (3):
Every connected graph other than those we put in our basis has an edge that falls into one of these three cases. So we can prove (4) by induction on them. The details convey why we need to use the particular value but are otherwise straightforward and tedious, so we’ve put them in a longer PDF version. Since the equation (3) for involves no recursion, this also proves that the recursive definition of is confluent.
Note how the sum in the case (c) echoes the sum defining the Tutte polynomial, but with “explosion” in place of ordinary graph contraction. A more salient difference is that whereas the Tutte polynomial is the same on all -vertex trees, differs on them.
As remarked in Noble’s paper, Oxley and Whittle noted the significance of a host of specializations of the polynomial. We suppose has nodes with isolated, so , and we put . The first two presume .
Also interesting is that if and we delete each of the edges independently with probability , then the probability that the deletions did not cause any isolated vertices is
Of course, this is classical probability. What interests us here is the import of for quantum amplitude and probability. We have already observed in previous posts that gives the amplitude for an outcome in a special kind of quantum circuit. This means that gives the same information, where .
We are curious about is the significance of for other values of besides . The complex value takes us outside the real-valued domain of Noble’s paper. His proof that most real cases are -hard to compute does not extend because , which causes a denominator in his proof to vanish. Thus and hence may be polynomial-time computable for various other real .
The specialization (times other easily-computed factors) does not seem to intersect with any of the above cases. Hence the field is wide open for finding new interpretations of it. Whatever we find, however, is bound to relate to the analysis of quantum circuits. It might help fulfill the aim in our post drawing analogy to Gustav Kirchhoff’s laws for electrical circuits.
How useful is the matroid-based framework for analyzing graph entities associated to quantum circuits? Can we say more about the significance of for other particular values of ?
We had thought to imitate how the Tutte polynomial has base monomials for graphs composed of “bridge” edges and loop edges at otherwise-isolated nodes. One can base a polynomial on monomials
for composed of isolated nodes, disjoint single-node loops, circles, and “wisps.” The recursion is the same:
The point is that in the explosion case, the inclusion of the factors for wisps and circles (namely, if the explosion leaves two wisps, if it leaves a wisp and a circle, or for two circles) keeps homogeneous of degree for an -vertex graph. The conditions for this recursion to be confluent, however, essentially leave something equivalent to with as a homogenizing variable. In particular, this idea appears not to yield a two-variable polynomial that gives more leverage.
[made S_G(u,v) not S_G(x,y) consistent at top]
]]>Kathryn Farley is my dear wife. She and I are currently on a cruise through the Mediterranean. Our trip started in Barcelona and is stopping daily at various cities as we journey to Rome. “Tough duty,” but we are trying to enjoy it.
Today I wish to talk about our visit to Monte Carlo.
Our ship, the Encore, just docked there Sunday. The day was warm and clear, and we spent some time exploring the city. We did manage to avoid losing any money at the famous casino. Our secret was simple: do not play, do not gamble, do not lose.
Over lunch I started to explain to Kathryn why Monte Carlo is an important city for complexity theorists. I felt a bit like we were at a theory shrine.
Indeed. I realized that it is not so simple to explain why randomness helps. Kathryn has a Ph.D in theatre. She is smart, is a member of Mensa, but is not a complexity theorist. How do I explain that randomness is powerful? Indeed.
I started to explain, but my examples were lame. I think she got the main idea, but I also think that I did not do a great job. Russell Impagliazzo has a nice explanation on the role of randomness—I wish Russell had been there to help explain randomness to Kathryn.
After lunch I started to think more about the role of randomness. I looked at our friends over at Wikipedia and discovered they had a pretty good page. Some reasons are:
Games
Randomness was first investigated in the context of gambling. Dice, playing cards, roulette wheels, all have been studied by those interested in gambling. Clearly, betting on the roll of dice, deal of cards, or spin of the wheel, only makes sense when these actions are unpredictable. Random.
Political
Randomness is often used to create “fairness”. For example, in the US and UK, juror selection is done by a lottery.
Science
Monte Carlo methods in physics and computer science require random numbers.
Cryptography
Random keys for encryption algorithms should be unpredictable. Random. Otherwise, they can be guessed by others. The password “password” is usually not allowed.
Arts
Kathryn is interested in the arts: in plays and in painting and other fine arts. Some theories of art claim that all art is random. One thinks of artists like Jackson Pollock with his famous drip paintings. He was a major player in the abstract expressionist movement.
Ken has been paying intensive devotions at the same shrine. As he wrote in the previous post, he has been conducting millions of randomized tests of his new chess model.
Why random? What he needs to do is show that his model will not tend to “cry wolf” by giving a too-high -score to a set of games in a tournament by an honest player. He wants to show that his model is equally tempered no matter the rating of the player. So he runs trials at different rating levels ranging from Elo 1000 for novice players to Elo 2800 which is championship level. To show that the -scores given by his model conform to a normal bell curve, he needs to do 10,000s or 100,000s of tests at each level.
The problem is there just don’t exist enough games. Most large tournaments give only the games played on their “top boards” which use special auto-recording equipment, and the losers on those boards in one round may play on lower boards in the next round. Thus out of about 60,000 player-tournament pairs Ken can track each year, most are only partial samples. So what Ken does is generate “synthetic players” by randomly taking subsets of (say) 9 games—from his data set of 1,000 or so games for each level—and randomly choosing white or black for each game. This is a common resampling technique, and it uses Monte Carlo.
Ken uses pseudo-random generators (PRGs). He starts a C++ library PRG on a seed based on the current time. The fact that the choices are deterministic once is given might allow him to reproduce an entire run exactly (after a model tweak) by preserving the it used. This is a paradox: we might want our “random” bits to be deterministic. Monte Carlo with predestined loaded dice.
From time to time on this blog we have mused about what a world without randomness or with reduced entropy would be like. We were struck a few weeks ago when the noted physics blogger Sabine Hossenfelder wrote about “superdeterminism.” That post provoked a few hundred comments in her blog, as did her post last week on the quantum measurement problem—including long exchanges with Peter Shor. Ken and I don’t know which side to take, but I can say that the side of a ship is a great place to think about possible real effects of these differences.
What is your take on randomness? Do you employ it? How “true” do you need it to be?
]]>
Using predictivity both to sharpen and cross-check models
Cropped from article source |
Patrice Miller and Jeff Seder look under the hide of horses. Their company EQB does predictive modeling for horse racing based on biometric data. They are famous for having advised the owner of American Pharoah not to sell because the horse had a powerful heart. In 2015, American Pharoah became the first Triple Crown winner since the also-hearty Secretariat in 1978.
Today I am happy to announce an extended version of my predictive model for chess and discuss how it gains accuracy by looking under the hood of chess positions.
I had thought to credit Charles Babbage and Ada Lovelace for being the first to envision computational predictive modeling, but the evidence connected to her design of betting schemes for horse racing is scant and secondhand accounts differ. It is known that Babbage compiled voluminous data on the medical fitness and diet of animals, including heart function by taking their pulse. We have discussed their computing work here and here.
I will use horse racing as a device for explaining the main new ingredient of my model. It sharpens the prediction of moves—and the results of cheating tests—by using deeper information to “beat the bookie” as Lovelace tried to do. I have described the basic form of my model—and previous efforts to extend it—in several previous posts on this blog. Last month, besides being involved in several media stories involving a grandmaster caught in the act of cheating in France, I was invited to discuss this work by Ben Johnson for his “Perpetual Chess” podcast.
My chess model does the same thing to a chess position—given information about the skill set of the player deciding on a move—that a bookie does to a horse race. It sets odds on each legal move to “win” by being played in the game. The probabilities need to be accurate for the same reason bookmakers need their “initial betting lines” to be close to how bets will ultimately balance, so they can preserve their margin. A horse with highest —perhaps a tie—is the bookie’s favorite. The favorite might be “odds-on,” meaning , or might be a “narrow favorite” among several horses with near-equal chances.
Suppose you don’t care how much money you might win but just want to maximize your chance of being right—of winning something. Unless you have reason to doubt the bookie, you should bet on the favorite. That is what my basic chess model does. Whichever move is given the highest value by the computer at the end of its search is declared the favorite, regardless of the player’s Elo rating or other skill factors.
That the best move—we’ll label it —should always be most likely even for the weakest players runs counter to sense. Aren’t weaker players weaker because they prefer weaker moves? When the right move is obvious, say a forced recapture or a checkmate in one, of course we expect any player to find it. But when is subtle, what then?
My basic model still makes it the favorite. This doesn’t mean its probability is greater than half. My model might make for world champion level players but only, say, for beginning players. Thus it will still say the beginner is 75% likely to play an inferior move. What my base model shies away from is saying any other particular move—any other horse—is more likely to win than . As the rating gets lower it bunches up the probabilities so that while is lower, no other probability passes it.
This is borne out in practice. The lowest Elo rating used by the World Chess Federation (FIDE) is 1000. Let’s take ratings between that and 1200 (which used to be the lowest rating) as denoting the novice class. Consider only those positions that have many reasonable choices—say at least ten moves valued within 0.25 (figuratively, a quarter of a pawn) of optimal. My main training set has 6.082 such positions in games between evenly-matched players of this level. Here are the frequencies of their playing the best through the tenth-best move in such many-choice positions:
Rank | Pct. | ||
1 | 17.76% | ||
2 | 13.22% | ||
3 | 9.95% | ||
4 | 7.66% | ||
5 | 6.25% | ||
6 | 5.18% | ||
7 | 4.41% | ||
8 | 4.55% | ||
9 | 3.50% | ||
10 | 3.03% | ||
11+ | 24.49% | ||
Both my basic model and the new one, when fitted over the entire training set for this class but then restricted to the many-choices subset, give projections close to these actual values. The basic model, always betting the favorite, is right on under 18% of its projections. Can we do better? That is, can we “beat the bookie” at chess?
It is almost four years since the idea for improving predictions was described on this blog. In place of “weaker players prefer weaker moves,” it advanced a hypothesis that we can state as follows:
Weaker players are more likely to be diverted by shiny objects.
Most in particular, they will fall for moves that look attractive early on, but which are revealed (by the computer) to be inferior after deeper consideration. The computer programs output values for each depth of search, and when these moves’ values are graphed against the depth, they start high but “swing down” at higher depths. Weaker players are more likely to be satisfied by the early flash and not think deeper. The old post has a great example of such a move from the pivotal game in the 2008 world championship match, where Viswanathan Anand set a deep trap that caught the previous world champion, Vladimir Kramnik.
The flip side are moves that look poor at low depths but whose high value emerges at high depths. My basic model, which uses only the final values of moves, gives too high a projection on these cases, and too low a likelihood of falling into traps. I have figured that these two kinds of ‘misses’ offset over a few dozen positions. Moreover, in both kinds of misses, the player is given express benefit of doubt by the projections. It is edgier, after all, to project that a player is more likely to fall into a trap than to find the safest and best move.
The effect of lower-depth values is still too powerful to ignore in identifiable cases. Including them, however, makes the whole model edgier, as I have described here before. Simply put, the lower-depth values are subject to more noise, from which we are trying to extract greater information. It has been like trying to catch lightning—or fusion—in a bottle.
My new model implements the “swing” feature without adding any more free parameters for fitting. It has new parameters but those are set “by hand” after an initial fitting of the free parameters under the “sliding-scale” regime I described last September, which is followed by a second round of re-fitting. It required heavy clamps on the weighting of lower-depth values and more-intense conditioning of inputs overall. It required a solution to the “firewall at zero” phenomenon that was the exact opposite of what I’d envisioned.
After all this, here is what it delivers in the above case—and quite generally:
It improves the prediction success rate—for the weakest players in the most difficult kind of positions to forecast—from 17.76% to 20.04%.
For the elite class—2600 to 2800—in the same kind of many-choice positions, the new model does even better. Much more data on elite players is available, so I have 49,793 such positions faced by them:
Whereas elite players found the best move in 30.85% of these difficult positions, my new model finds their move in 34.64% of them.
Over all positions, the average prediction gain ranges from about 1 percentage point for the lowest players to over 2% for masters. These gains may not sound like much, but for cheating tests they give prime value. The reasons are twofold:
Initial applications in recent cases seem to prove this out more often than not. Of course, the larger purpose is to have a better model of human chess play overall.
In recent years, several new dimensions of quality and safety with predictive models have emerged. They supplement the two classic ones:
I vetted my model’s natural safety by processing tens of millions of generated z-scores under resampling after my final design tweaks earlier this month. This was over a link between departmental machines and UB’s Center for Computational Research (CCR) where my data is generated. The previous discussion has all been about greater power. The first new factor updates the idea of calibration:
I have a suite of cross-checking measures besides those tests that are expressly fitted to be unbiased estimators. They include checking how my model performs on various different types of positions, such as those with many choices as above, or the opposite: those having one standout move. For example, the model’s best-move projection in the many-choice positions by elite players, using the general settings for 2700 rating, is 31.07%. That’s within 0.28%. Another check is figuratively how much “probability money” my model wagered to get its 34.64% hit rate. The sum of it projected on its own most-likely moves, in the 68.4% of the many-choice positions where it agreed with the computer’s favorite plus 31.6% where it did not, was 35.00%. If I fit all 282,060 positions by these players, rather than use “2700,” and then re-select the subset, the model comes within 0.01% on the first-move projection and 0.11% on its own betting forecasts. I will say more about the cross-checks, use of prediction–scoring metrics, and conformance to normal distribution at a later time. The relevant point is to ask:
How well does your model perform on pertinent tests besides those it was expressly trained for?
Beyond fairness, good wide-range calibration alleviates dangers of “mission creep.”
The second newer factor is:
My impressions over the long haul of this work is that the new model’s more-powerful heart inevitably brings greater “springiness.” By dint of its being more sensitive to moves whose high value emerges only after deep search, it is possible to create shorter sequences of such moves that make it jump to conclusions. The z-score vetting turned up a few games that were agreed drawn after some “book” moves—openings known by heart to many professional players—whose entirety the model would flag, except for the standard procedure of identifying and removing book moves from cheating tests. These outliers came from over a hundred thousand games between evenly-matched players, so they still conformed to the natural rate, but one can be concerned about the “unnatural rate.” On the flip side, I believe my more-intensive conditioning of values has made the model more robust against being gamed by cheaters playing a few inferior moves to cover their tracks.
In general, and especially with nonlinear models, the caveat is that amplifying statistical power brings greater susceptibility to adversarial conditions. Trying to “beat the bookie” requires more model introspection. My model retains its explainability and ability to provide audits for its determinations.
What lessons from similar situations with other predictive models can be identified and brought to bear on this one?
Update 8/24: Here is a timely example of the tradeoff between amplifying the prediction accuracy and the overall stability of the model.
]]>Composite from src1, src2 |
Brendan Lucier and Csaba Szepesvári were consecutive speakers at this week’s workshop at the Toyota Technological Institute in Chicago on “Automated Algorithm Design.”
Today I will discuss their talks, which I enjoyed greatly at the workshop.
The workshop’s general theme was the use of machine-learning techniques to improve the tuning of algorithms. Indeed, the goal is to have the algorithm tune itself by selecting strategies from a collection of possible ones. Besides better performance with less human work, this promises better logical reliability than with hand-tuning programs and debugging.
A common component of this work used in both talks is an interesting oracle model. Since oracles are familiar to many of us theorists this will give us an avenue into the talks and the two recent papers they were based on.
The oracle has a nonnegative matrix of shape . Think of the entry as the runtime of the program on the input. You know and , but must ask the oracle for information about . You may ask the oracle a question :
Is the entry less than or equal to ?
Here is a time bound. Further they assume that the oracle charges you
Thus an algorithm is charged in total
where the sum is over all questions you asked. The rationale is that the oracle can run a program on a task and stop after at most steps. Thus returning the minimum is realistic. Since their research is motivated by practical problems, this is a useful model for studying questions about program performance.
In passing I believe the reason this oracle model is new is that theorists do not often think about running arbitrary programs. Well those in recursion type did, but those designing classic algorithms do not.
This model to used to study a selection problem. Assume is as above. The task is to find the row with the smallest row sum :
That is to find the program that takes the least total time on the inputs—hence the least average time.
The trouble is that in the worst case this can require examining all the entries. So they introduce approximations to make the problem doable, but still useful in practice:
Now we can state the problem formally. It is parameterized by :
Runtime Selection Problem (RSP): Given a matrix with all entries bounded by , find a row so that
for all , while minimizing the oracle charges for all questions “is ?” that are asked.
The talks gave a variety of randomized algorithms that solve such problems for various parameter assumptions.
See their papers for details on the results. The algorithms they have are interesting not only from a theory viewpoint but also in practice. Indeed, they not only prove theorems but also give experimental timings.
In order to establish some intuition, let’s look at the following simple case. Assume that all entries are either or some . This is the “fast-or-slow” running time stipulation. As theorists we often are drawn to binary values, so this might be a good case to look at initially.
The first observation is that we will always ask questions of the form
This gives us the value of the entry for almost unit cost: if it is the case then we are only charged . Thus, we have reduced the original problem to a classic oracle problem. There is no complicated oracle cost measure: the cost is just the number of entries we read. Well okay the cost is slightly higher, but we can make it as close as we wish.
The selection problem then comes down to this:
I believe that the analysis of this problem should be generally known. In any event it seems clear that randomly sampling each row is a good start.
Are there other natural problems where the runtime cost oracle is useful?
[Edited typo]
]]>Randi 2014 documentary source |
James Randi is a magician who has challenged paranormal claims of all kinds.
Today Ken and I want to make a suggestion to those who claim they have proved P=NP.
No the claim to have a proof that P=NP is not a paranormal claim. But such claims are related to Randi—or the Amazing Randi as he is called. We talked about him before here.
Randi once helped run a contest to see who could find gold with their dowsing rod. He explained why he doubted one contestant:
If they really could find gold, why were they dressed so poorly, and why were they so interested in winning the prize?
I have the same question about those who claim that they have a proof that P=NP. Usually the proof is constructive and I agree with Randi:
If they really could solve P=NP, why
You get the idea.
Ken adds the obvious remark that if a foreign power or intelligence agency discovered P=NP, or factoring in P, they would still keep the lean-and-hungry look. But they are not the kind we are addressing here.
Let’s look at a claims that P=NP is resolved. Yes, such a result is unlikely—many would say impossible. But we do get claims like this:
The following is known to be a NP-complete problem; the following is known to be a polynomial time problem. I can reduce to in polynomial time.
Usually the reduction is the reason their proof fails. Their claims about and are usually correct, since they are in the literature.
The reduction is often complicated, often poorly defined, often defined by example. Giving a precise definition for the reduction is critical. This is the reason we suggest the following:
Write the reduction down in code.
Even better, write it as a program in a real language such as Python.
There are two advantages in doing this.
The later point is the key point. Even trying your method on tiny examples is useful. Even better if you can say the following we might read the proof:
I have tried my code on the following public set of difficult SAT problems. The code solved all in less than three minutes each.
This claim would greatly improve the likelihood that people might take your claims seriously. That your code worked correctly, forgetting the running time, would improve confidence. Greatly.
Ken worries that some NP-complete problems are more equal than others. That is some problems, even though they are NP-complete may require reductions that blow up when encoding SAT.
We wrote about this before regarding the “Power Index” idea of Richard Stearns and Harry Hunt III. In their paper they gave evidence that the reductions from SAT to many familiar NP-complete problems must expand the size of instances quadratically, insofar as those problems have power index . This was based on their “SAT Hypothesis” which anticipated current forms of the Exponential Time Hypothesis, which we have discussed.
Ken ponders a related issue. Even problems with power index run into the success of practical solvers. This means:
Anyone citing algorithmic success as evidence toward a claim of P=NP must compete with the real-world success of algorithms that do not represent claims of P=NP.
We have several times discussed the practical success of SAT-solvers on myriad real-world instances.
This situation has become real in the argument over achieving quantum supremacy. One who claims that quantum is superior to classic must worry that that classical algorithms can improve without making P=NP. A headline example from last year was when Ewin Tang—as a high-school senior—found a classical way to remove a plausible quantum advantage in a matrix-completion problem that underlies recommender systems. There are many “industrial strength” examples in this argument—see this May 2019 story for a start.
Ken’s insightful comments aside, the key point is still:
Coding up your claimed algorithm for that NP-complete problem will still enhance belief.
This will happen even if the algorithm only succeeds on tiny examples. Indeed, if you cannot do this then I suggest that you will have an impossible time getting anyone to listen.
How useful is this advice for the vast majority of us who are not claiming P=NP or the opposite?
]]>
A old unpublished result, some new published results
[ Playbill ] |
Alexander Hamilton was a framer of the U.S. Constitution. He wrote the bulk of the Federalist Papers (FP) defending the Constitution. Today he is best known for the playbill—the musical on his life—and the bill, the US ten dollar bill.
Today I thought we would discuss the U.S. electoral college (EC).
We are in the midst of the run-up to next year’s President election. An on-going discussion is the issue of the EC. Should it be modified? Should it be replaced? Is it a good idea?
So let’s recall how the EC works. Then we will look at it from a theory viewpoint.
The electoral college is how every four years we elect the President of the United States. It is not a direct popular vote. The Constitution created it as a compromise between a direct popular vote and a vote by the members of Congress. Back then, the framers of the Constitution, including Hamilton, did not trust the electorate. Hence, the rationale for the EC.
Today the EC consists of 538 electors. Voters in each state pick electors, who then vote in EC for the President. Thus by high math, 270 electors are required to win. A state gets one electoral vote for each member in the House of Representatives plus two. The latter rule ensures that no state gets too few votes. It is some times called the “two-plus rule”.
The arguments for the EC are distilled in FP No. 68. Although the collaboration/authorship status of numerous FP remains unclear, Hamilton’s claim in his last testament to sole authorship of FP 68 is not seriously disputed. Quoting Wikipedia:
Entitled “The Mode of Electing the President”, No. 68 describes a perspective on the process of selecting the Chief Executive of the United States. In writing this essay, the author sought to convince the people of New York of the merits of the proposed Constitution. Number 68 is the second in a series of 11 essays discussing the powers and limitations of the Executive branch and the only one to describe the method of selecting the president.
Opponents today argue against the EC. They point out that it allows one to win without getting the most votes. This has happened in two of the last five elections, in 2000 and 2016. The EC rewards uneven allocations of campaigning to the few “swing-states”. It also gives voters in less populated states more voting power. A vote from Wyoming has over three times the influence on the EC tally as a vote from California. The battle over FP 68 has even been internationalized.
Years ago. Decades ago. Eons ago. When I was in college, I almost flunked a required one-credit course in my senior year. The course was on issues of the election that year of the President. No it did not involve Hamilton.
The grade of the course was based on a term paper. Mine, which got a , was based on an argument for the EC. Thankfully, the grade was just enough to get me a pass in the course, and allow me to graduate. I did not take the course seriously—my attendance was spotty, at best.
My idea was that there was an argument for the EC based on a connection with the ability to manage elections. My central thesis was:
The ability to accurately predict the outcome of a Presidential election is inherently undesirable.
Let’s agree that we will call this the Prediction Assumption (PA). Predicting the outcome of elections may not be a good idea. If predictions could be accurate, then one could argue that this would allow candidates to manipulate the election. I think you could make the case that this could be a problem. Candidates would be able to manage their opinions to optimize their chances of winning the election.
In any event I then proved a result that showed that given PA, one could argue that the EC was better than a popular election. Note, the usual math arguments against the EC are based on the power of individual voters. See here and here for some of their insights.
The central point of my paper was informally this:
Theorem: Prediction of an election using EC is more difficult than one using the popular vote.
A simple example should help. Imagine an election with three states: Northeast, West, and South. Let them each have one electoral vote. Clearly are needed to win. Suppose the states are arranged like this:
Then prediction requires the polling to be able to tell the outcome of the South vote. The point is:
The smaller the number of voters in the ensemble being predicted, the more uncertain the prediction.
Ken argues that simply having a multiplicity of component elections—one in each state plus DC—also increases the uncertainty. This may happen technically just because the result is a kind of average over unequal-sized averages.
Modern results in Boolean function theory actually have studied the noise sensitivity of the EC. They have studied how errors in voting can flip an election. Look at Gil Kalai’s 2010 paper, “Noise Sensitivity And Chaos In Social Choice Theory.” He shows that majority is more stable in the presence of noise than the EC. Look at Ryan O’Donnell’s paper, “Some Topics in Analysis of Boolean Functions.” He shows a related point that errors in EC—in a simple model—can increase the chance that errors flip the election factor of about .
Neither paper strikes me as studying whether predictions are easier with the simple majority rule than with the EC.
I believe that their new results can be used to prove the same type of theorems on prediction.
Did I deserve a better grade than a ? Or should I have flunked? Should I have published something?
For comparison, the college term paper which eventually became the 27th Amendment to the Constitution received a better grade: . Oh well.
]]>