Littlewood’s Law and Big Data
|“Leprechaun-proofing” data source|
Neil L. is a leprechaun. He has visited Dick on St. Patrick’s Day or the evening before many times. Up until this night I had never seen him.
Today, Neil’s message is more important than ever.
With over a foot of snow in Buffalo this week and the wind still howling, I was not expecting anything green. Long after Debbie had gone to bed, I was enmeshed in the “big–data blues” that have haunted me since summer and before. I was so fixated it took me more than a few seconds to realize that wisps of green smoke floating between me and the computer screen were something I should investigate.
There on our kitchen-study divider sat Neil. He looked like the pictures Dick had posted of him, but frazzled. He cleaned his pipe into a big Notre Dame coffee mug I got as a gift. I’d had it out since Princeton went up against Notre Dame in “March Madness”—my Tigers missed a chance for a big upset in the closing five seconds. As if reading my mind, he remarked how the tournament always produces upsets in the first round:
“If there be no unusual results, ‘twould be most unusual.”
The Neil whom Dick described would have said this with wry mirth, but he sounded weary as if he had a dozen mouths to feed. I fired up the kettle and brought out the matching mug to offer tea or coffee, but he pointed to his hip flask and said “it’s better against the cold.”
Leprechaun Birds and Bees
That prompted me to ask, “Why didn’t you visit Dick? He and Kathryn have been enjoying sun at this week’s Bellairs workshop on Barbados.” I had been there two years ago when Neil had taken great pains to track Dick down. Neil puffed and replied, “Same reason I didn’t try finding you there back then—too far afield for a big family man.” The word “family” struck me as our dog Zoey, who had stayed sleeping in her computer-side bed at my feet, woke up to give Neil a barkless greeting. Of course, even leprechauns have relations…
Nodding to pictures of our children on the wall, I asked Neil how many he had. He took a long puff and replied:
“Several thousand. It’s too hard to keep count nowadays.”
Now Zoey barked, and this covered my gasp. Knowing that Neil was several centuries old, I did some mental arithmetic, but concluded he would still need a sizable harem. Reading my mind again, Neil cut in:
“Not as ye mortals do. What d’ye think we’re made of?”
I reached out to touch him, but Neil leaned away and vanished. A moment later he popped back and folded his arms, waiting for me to reply. I realized, ah, he is made of spirit matter. What can that be? Only one thing in this world it could be: information.
“Tá tú ceart” he whistled. “Right. And some o’ yer sages wit ye mortals have some o’ the same stuff. Max Tegmark, for one, wrote:
“… consciousness is the way information feels when being processed.”
And Roger Penrose has just founded a new institute on similar premises—up front he says chess holds a key to human consciousness so you of all people should know whereof I speak.”
Indeed, I had to nod. He continued, “And information has been growing faster than Moore’s Law. Hard to keep up…” The last words came with a puff of manly pride.
“Information is leprechauns??,” I blurted out. The propagation of “fake news” and outright falsehoods in recent months has been hard enough to take, but this boiled me over. I wanted to challenge Neil—and I recalled the protocol followed by William Hamilton’s wife: glove and shamrock at his feet. Well, I don’t wear gloves even in zero-degree weather, and good luck my finding a shamrock under two feet of snow. So I asked in a level voice, “can you give me some examples?”
Neil puffed and replied, “Not that information be us, but it bears us. And more and more ye can get to know us by reading your information carefully. But alas, more and more ye are confusing us with aliens.”
“Aliens?” This was all too much, and the dog wanted out. But Neil was happy to flit alongside me as I opened the door to the yard for her. He explained in simple tones:
“Ye have been reading the sky for many decades listening for alien intelligence. Up to last year ye had maybe one possible instance in 1977—apart from Nikola Tesla, who knew us well. But now reports are coming fast and furious. Not just fleeting sequences but recurrent ‘fast radio bursts’ observed in papers and discussed even this week by scientists from Harvard. Why so many now?”
I was quick to answer: “Because we are reading so much more data now.” Neil clapped his hands—I expected something to materialize by magic but he was just affirming my reply. I hedged, “But surely we understand the natural variation?” Neil retorted:
Indeed, the so-called diphoton anomaly had seemed on its way to confirmation because two separate experiments at the LHC were seeing it. An earlier LHC anomaly about so-called “penguin decays” has persisted since 2013 with seemingly no conclusion.
As I let the dog back in and toweled snow off her, I reflected: what was wrong with those 500 physics papers? A particle beyond the Standard Model would be the pot of gold at the end of a rainbow not only for many researchers but human knowledge on the whole. Then I remembered whom I was speaking with. Once free of the towel, Zoey scooted away, and I regrouped. I turned to Neil and said, “There is huge work on anomaly detection and data cleansing to identify and remove spurious data. Surely we are scaling that up as needed…”
Neil took a long drag on his pipe and arched up:
“I be not talking o’ bad data points but whole data sets, me lad.”
Littlewood’s Law of Leprechauns
I sank into an armchair and an electrical voltage drop dimmed the lights as Neil took over, perched again on the divider. “Ye know John Littlewood’s law of a miracle per month, indeed you wrote a post on it. If ye do a million things or observe a million things, one o’ them is bound to be a million-to-one shot.”
I nodded, already aware of his point.
“No different ’tis with data sets. One in a million be one-in-a-million bad. A thousand in a million be—begorra—one-in-a-thousand bad. Or too good. If ye ha’e 50,000 companies and agencies and research groups doing upwards of 20 data sets each, that’s wha’ ye have. Moreover—”
Neil leaned forward enough to fall off the counter but of course he didn’t fall.
“All the cleansing, all the cross-validation ye do, all the confirmation ye believe, is merely brought inside this reckoning. All that also changes the community standards, and by those standards ye’re still one-in-a-million, one-in-a-thousand off. Now ye may say, 999-in-a-thousand are good, a fair run o’ the mill. But think of the impacts. Runs o’ the mill have run o’ the mill effects, but the stark ones, hoo–ee.”
He whistled. “The impacts of the ones we choose to reside in scale a thousand-to-one stronger, a million-to-one… An’ that is how we keep up a constant level of influence in affairs o’ the world. All o’ the world—yer hard science as well as social data.”
I thought of something important: “If you lot choose to commandeer one data set, does that give you free rein to infect another of the same kind?”
“Nae—ye know from Dick’s accounts, we must do our work within the bounds of probability. So if ye get a whiff of us or even espy us, ye can take double the data without fear of us. But—then ye be subject to the most subtle kind of sampling bias, which is the bias of deciding when to stop sampling.”
After the terrible anomaly I showed in December from four data points of chess players rated near 2200, 2250, 2300, and 2350 on the Elo scale, I had spent much of January filling in 2225, 2275, 2325, and 2375. Which improved the picture quite a lot. Of course I ran all the quarter-century marks from Elo 1025 to Elo 2775, over three million more moves in all. But instead of feeling pride, after Neil’s last point I looked down at the floor.
His final words were gentle:
“Cheer up lad, it not only could be worse, it would be worse. Another o’ your sages, Nassim Taleb, has pointed out what he calls the ‘tragedy of big data’: spurious correlations and falsity grow faster than information. See that article’s graphic, which looks quadratic or at any rate convex. Then be ye thankful, for we Leprechauns are hard at work keeping the troubles down to linear. But this needs many more of us, lad, so I must be parting anon.”
And with a pop he was gone.
Is Neil right? What examples might you know of big data sets suspected of being anomalous not for any known systematic reason but just the “luck of the draw”?
Happy St. Patrick’s Day anyway.
[some word changes]
The breaks keep on coming…
Holly Dragoo, Yacin Nadji, Joel Odom, Chris Roberts, and Stone Tillotson are experts in computer security. They recently were featured in the GIT newsletter Cybersecurity Commentary.
Today, Ken and I consider how their comments raise a basic issue about cybersecurity. Simply put:
Is it possible?
With a little more from Smullyan
Maurice Ashley is an American chess grandmaster. He played for the US Championship in 2003. He coached two youth teams from Harlem to national championships and played himself in one scene of the movie Brooklyn Castle. He created a TEDYouth video titled, “Working Backward to Solve Problems.”
Today we discuss retrograde analysis in chess and other problems, including one of my own.
Serious work amid the puzzles and jokes.
When Raymond Smullyan was born, Emanuel Lasker was still the world chess champion. Indeed, of the 16 universally recognized champions, only the first, Wilhelm Steinitz, lived outside Smullyan’s lifetime. Smullyan passed away a week ago Monday at age 97.
Today, Dick and I wish to add some thoughts to the many comments and tributes about Smullyan.
A discussion on the famous problem
William Agnew is the chairperson of the Georgia Tech Theoretical Computer Science Club. He is, of course, an undergraduate at Tech with a multitude of interests—all related to computer science.
Today I want to report on a panel that we had the other night on the famous P vs. NP question.
What to do about claims of hard theorems?
|Cropped from source|
Shinichi Mochizuki has claimed the famous ABC conjecture since 2012. It is still unclear whether or not the claimed proof is correct. We covered it then and have mentioned it a few times since, but have not delved in to check it. Anyway its probably way above our ability to understand in some finite time.
Today I want to talk about how to check proofs like that of the ABC conjecture.
The issue is simple:
Someone writes up a paper that “proves” that X is true, where X is some hard open problem. How do we check that X is proved?
The proof in question is almost always long and complex. So the checking is not a simple matter. In some cases the proof might even use nonstandard methods and make it even harder to understand. That is exactly the case with Mochizuki’s proof—see here for some comments.
Let’s further assume that the claimed proof resolves X which is the P vs. NP problem. What should we do? There are some possible answers:
- Ignore: I have many colleagues who will not even open the paper to glance at it. Ken and I get a fair number of these, but I do at least open the file and take a quick look. I will send a message to the author—it usually is a single author—about some issue if I see one right away.
- Show Me The Beef: I firmly believe that a proof of an open problem should have at least one simple to state new trick or insight that we all missed. I would suggest that the author must be able to articulate this new idea: if they cannot then we can safely refuse to read it. I have worked some on the famous Jacobian Problem. At one time an author claimed they had a proof and it was just “a careful induction.” No. I never looked at it because of the lack of “beef,” and in a few weeks the proof fell apart.
- Money: Several people have suggested—perhaps not seriously—that any one claiming a proof must be ready to post a “bond” of some money. If someone finds an error they get the bond money. If no one does or even better if the proof is correct, then the money can be donated to one of our conferences.
- Hire: I have seen this idea just recently. The author posts a request for someone to work on their paper as a type of consultant. They are paid a fair hourly rate and help find the error.
Timeout: An author who posts a false proof gets a timeout. They are not allowed to post another paper or submit a paper on X for some fixed time period. Some of the top journals like the Journal of the ACM already have a long timeout in place. The rationale behind this is that very often when an error is found in such a paper the author quickly “fixes” the issue and re-claims the result. In Stanislaw Ulam’s wonderful book Adventures of a Mathematician he talks about false proofs: Here “he” refers to an amateur who often joined Ulam at his habitual coffeehouse:
Every once in a while he would get up and join our table to gossip or kibitz Then he would add, “The bigger my proof, the smaller the hole. The longer and larger the proof, the smaller the hole.”
Knock Heads Together: Oxford University hosted in December 2015 a workshop to examine Mochizuki’s claimed proof, including contact by Skype with Mochizuki himself. A report by Brian Conrad on the MathBabe blog makes for engaging reading—we could quote extensively from its concluding section 6. This shorter news report cited feelings of greater understanding and promise but lack of definite progress on verifying the proof, noting:
…[N]o one wants to be the guy that spent years working to understand a proof, only to find that it was not really a proof after all.
Share The Credit: Building on the last point, perhaps proper credit can be given to someone who does spent a great deal of time working on trying to understand a long proof. If they find an unfixable error, then maybe they can publish that as a paper—especially if the error is nontrivial and not just a simple one. If they show that the proof is indeed correct, could they be rewarded with some type of co-authorship? Maybe a new type of authorship:
P Does Not Equal NP: A Proof Via Non-Linear Fourier Methods
Alice Azure with Bob Blue
Here the “with” signals that Alice is the main author and Bob was simply a helper. Recall a maxim sometimes credited to President Harry Truman: “It is amazing what you can accomplish if you do not care who gets the credit.”
What do you think about ways to check proofs? Any better ideas?
Impetus to study a new reducibility relation
|See Mike’s other projects too|
Michael Wehar has just earned his PhD degree in near-record time in my department. He has posted the final version of his dissertation titled On the Complexity of Intersection Non-Emptiness Problems which he defended last month. The dissertation expands on his paper at ICALP 2014, joint paper at ICALP 2015 with Joseph Swernofsky, and joint paper at FoSSaCS 2016.