Leprechauns are Multiplying
Littlewood’s Law and Big Data
|“Leprechaun-proofing” data source|
Neil L. is a leprechaun. He has visited Dick on St. Patrick’s Day or the evening before many times. Up until this night I had never seen him.
Today, Neil’s message is more important than ever.
With over a foot of snow in Buffalo this week and the wind still howling, I was not expecting anything green. Long after Debbie had gone to bed, I was enmeshed in the “big–data blues” that have haunted me since summer and before. I was so fixated it took me more than a few seconds to realize that wisps of green smoke floating between me and the computer screen were something I should investigate.
There on our kitchen-study divider sat Neil. He looked like the pictures Dick had posted of him, but frazzled. He cleaned his pipe into a big Notre Dame coffee mug I got as a gift. I’d had it out since Princeton went up against Notre Dame in “March Madness”—my Tigers missed a chance for a big upset in the closing five seconds. As if reading my mind, he remarked how the tournament always produces upsets in the first round:
“If there be no unusual results, ‘twould be most unusual.”
The Neil whom Dick described would have said this with wry mirth, but he sounded weary as if he had a dozen mouths to feed. I fired up the kettle and brought out the matching mug to offer tea or coffee, but he pointed to his hip flask and said “it’s better against the cold.”
Leprechaun Birds and Bees
That prompted me to ask, “Why didn’t you visit Dick? He and Kathryn have been enjoying sun at this week’s Bellairs workshop on Barbados.” I had been there two years ago when Neil had taken great pains to track Dick down. Neil puffed and replied, “Same reason I didn’t try finding you there back then—too far afield for a big family man.” The word “family” struck me as our dog Zoey, who had stayed sleeping in her computer-side bed at my feet, woke up to give Neil a barkless greeting. Of course, even leprechauns have relations…
Nodding to pictures of our children on the wall, I asked Neil how many he had. He took a long puff and replied:
“Several thousand. It’s too hard to keep count nowadays.”
Now Zoey barked, and this covered my gasp. Knowing that Neil was several centuries old, I did some mental arithmetic, but concluded he would still need a sizable harem. Reading my mind again, Neil cut in:
“Not as ye mortals do. What d’ye think we’re made of?”
I reached out to touch him, but Neil leaned away and vanished. A moment later he popped back and folded his arms, waiting for me to reply. I realized, ah, he is made of spirit matter. What can that be? Only one thing in this world it could be: information.
“Tá tú ceart” he whistled. “Right. And some o’ yer sages wit ye mortals have some o’ the same stuff. Max Tegmark, for one, wrote:
“… consciousness is the way information feels when being processed.”
And Roger Penrose has just founded a new institute on similar premises—up front he says chess holds a key to human consciousness so you of all people should know whereof I speak.”
Indeed, I had to nod. He continued, “And information has been growing faster than Moore’s Law. Hard to keep up…” The last words came with a puff of manly pride.
“Information is leprechauns??,” I blurted out. The propagation of “fake news” and outright falsehoods in recent months has been hard enough to take, but this boiled me over. I wanted to challenge Neil—and I recalled the protocol followed by William Hamilton’s wife: glove and shamrock at his feet. Well, I don’t wear gloves even in zero-degree weather, and good luck my finding a shamrock under two feet of snow. So I asked in a level voice, “can you give me some examples?”
Neil puffed and replied, “Not that information be us, but it bears us. And more and more ye can get to know us by reading your information carefully. But alas, more and more ye are confusing us with aliens.”
“Aliens?” This was all too much, and the dog wanted out. But Neil was happy to flit alongside me as I opened the door to the yard for her. He explained in simple tones:
“Ye have been reading the sky for many decades listening for alien intelligence. Up to last year ye had maybe one possible instance in 1977—apart from Nikola Tesla, who knew us well. But now reports are coming fast and furious. Not just fleeting sequences but recurrent ‘fast radio bursts’ observed in papers and discussed even this week by scientists from Harvard. Why so many now?”
I was quick to answer: “Because we are reading so much more data now.” Neil clapped his hands—I expected something to materialize by magic but he was just affirming my reply. I hedged, “But surely we understand the natural variation?” Neil retorted:
Indeed, the so-called diphoton anomaly had seemed on its way to confirmation because two separate experiments at the LHC were seeing it. An earlier LHC anomaly about so-called “penguin decays” has persisted since 2013 with seemingly no conclusion.
As I let the dog back in and toweled snow off her, I reflected: what was wrong with those 500 physics papers? A particle beyond the Standard Model would be the pot of gold at the end of a rainbow not only for many researchers but human knowledge on the whole. Then I remembered whom I was speaking with. Once free of the towel, Zoey scooted away, and I regrouped. I turned to Neil and said, “There is huge work on anomaly detection and data cleansing to identify and remove spurious data. Surely we are scaling that up as needed…”
Neil took a long drag on his pipe and arched up:
“I be not talking o’ bad data points but whole data sets, me lad.”
Littlewood’s Law of Leprechauns
I sank into an armchair and an electrical voltage drop dimmed the lights as Neil took over, perched again on the divider. “Ye know John Littlewood’s law of a miracle per month, indeed you wrote a post on it. If ye do a million things or observe a million things, one o’ them is bound to be a million-to-one shot.”
I nodded, already aware of his point.
“No different ’tis with data sets. One in a million be one-in-a-million bad. A thousand in a million be—begorra—one-in-a-thousand bad. Or too good. If ye ha’e 50,000 companies and agencies and research groups doing upwards of 20 data sets each, that’s wha’ ye have. Moreover—”
Neil leaned forward enough to fall off the counter but of course he didn’t fall.
“All the cleansing, all the cross-validation ye do, all the confirmation ye believe, is merely brought inside this reckoning. All that also changes the community standards, and by those standards ye’re still one-in-a-million, one-in-a-thousand off. Now ye may say, 999-in-a-thousand are good, a fair run o’ the mill. But think of the impacts. Runs o’ the mill have run o’ the mill effects, but the stark ones, hoo–ee.”
He whistled. “The impacts of the ones we choose to reside in scale a thousand-to-one stronger, a million-to-one… An’ that is how we keep up a constant level of influence in affairs o’ the world. All o’ the world—yer hard science as well as social data.”
I thought of something important: “If you lot choose to commandeer one data set, does that give you free rein to infect another of the same kind?”
“Nae—ye know from Dick’s accounts, we must do our work within the bounds of probability. So if ye get a whiff of us or even espy us, ye can take double the data without fear of us. But—then ye be subject to the most subtle kind of sampling bias, which is the bias of deciding when to stop sampling.”
After the terrible anomaly I showed in December from four data points of chess players rated near 2200, 2250, 2300, and 2350 on the Elo scale, I had spent much of January filling in 2225, 2275, 2325, and 2375. Which improved the picture quite a lot. Of course I ran all the quarter-century marks from Elo 1025 to Elo 2775, over three million more moves in all. But instead of feeling pride, after Neil’s last point I looked down at the floor.
His final words were gentle:
“Cheer up lad, it not only could be worse, it would be worse. Another o’ your sages, Nassim Taleb, has pointed out what he calls the ‘tragedy of big data’: spurious correlations and falsity grow faster than information. See that article’s graphic, which looks quadratic or at any rate convex. Then be ye thankful, for we Leprechauns are hard at work keeping the troubles down to linear. But this needs many more of us, lad, so I must be parting anon.”
And with a pop he was gone.
Is Neil right? What examples might you know of big data sets suspected of being anomalous not for any known systematic reason but just the “luck of the draw”?
Happy St. Patrick’s Day anyway.
[some word changes]