Cybercrime And Bad Statistics
Some surveys are more unreliable than others
Dinei Florencio and Cormac Herley are both at Microsoft Research. Florencio is with the Multimedia, Interaction and Communication group, and Herley—a Principal researcher— is with the Machine Learning Department. Both have made important contributions to many areas of computing, including especially security.
Today I want to talk about a brilliant piece of theirs on cybercrime that just appeared in this Sunday’s New York Times—it is in the Sunday Review section on page five.
The essence of their opinion piece is given away by its title: “The Cybercrime Wave That Wasn’t.” They make a strong case, in my opinion, that the claims of huge losses to consumers—over 100 billion dollars a year—being made about cybercrime are just not correct. Not even close. They point out that
[S]uch widely circulated cybercrime estimates are generated using absurdly bad statistical methods, making them wholly unreliable.
This is a pretty strong statement.
I was immediately attracted to their piece, partly because of the topic, cybercrime; and mostly by the word “absurdly.” In my experience, where absurd methods are being used by responsible people, interesting and tricky science is there to find. In this spirit let’s take a look at the issue that they raise.
The problem is that the cybercrime estimates are based on surveys of individuals, surveys that ask essentially “what did cybercrime cost you this year?” There may be other ways to get this information, but companies are unlikely to answer questions about cybercrime. Hence even the Federal Trade Commission (FTC) must rely on surveys.
Now surveys are reasonably well understood, relatively reliable if done correctly, and a powerful method of assessing how people feel about an issue. There have been some terrific failures over the years, but in general they are a useful tool.
Perhaps the most famous failure was the Truman-Dewey prediction. Logistical matters in 1948 induced the Chicago Tribune newspaper to put its first editions out earlier than before, and on Election Night, November 3, the deadline came before most voting returns started coming in. Hence they relied on their Washington correspondent, who relied on polls all showing a solid lead for challenger Thomas E. Dewey over President Harry S. Truman. Truman’s words while posing for the iconic photo
Closer to our time, three famous prediction mistakes impacted our lives. Surveys prior to January 1, 2000, indicated that the Y2K Problem would cause far more economic damage than it did. It is still not clear how much credit proactive measures should receive for the relative lack of problems, but it seems clear as some even noted beforehand that the problem had been severely overestimated.
Exit poll results caused the state of Florida to be prematurely called as a win for Albert Gore in the 2000 US presidential election, and led to mistaken expectations of a win for John Kerry in the 2004 election, before returns started coming in. The latter was a much wider-scale phenomenon, and there remains controversy over the assertion that Republican voters for President George W. Bush showed a lower rate of participation, thus skewing the surveys toward Kerry. There is general speculation that results for survey answers perceived as unpopular are generally depressed, and about other ways people might lie on voting surveys. But even so, any single change makes just a tiny difference in the total.
Suppose instead of an election poll we have a job-approval poll. Imagine that the entire population has people, each with values
meaning: means they strongly approve the job the President is doing, means approve but not strongly, neutral, disapprove, and strongly disapprove. The goal of the survey is to ask people for their values, and then make a guess at the value of the sum of all preferences, i.e. at
Then is the average degree of support, or disapproval if negative.
If the people are sampled randomly, then this is a terrific method for determining the sum of their values. If they “lie” and do not tell the truth when asked for their values, then sampling will still work well. The key is that we need only have some reasonable bound on the fraction of people that will lie. Besides lies there are many other ways to mess up, but we will put these all together: people can, for example, misunderstand the survey question.
Florencio and Herley note that all this changes in numeric surveys when the values are allowed to range from zero to infinity. See their technical paper here for details.
The problem is that even if most of the values in the sample are zero, a few very large values will have a huge effect on the estimate for the sum. And these large values can be lies. In their NYT piece they point out that one FTC survey on identity theft would have added 37 billion dollars to the total based on just the answers from two respondents. Just two—as in one, then two.
Ken’s Erroneous Error Problem
Ken notes that a related issue plagues his chess research because of widespread inaccuracies in records of moves in chess games, even in top-level chess tournaments. For example, a game from the 2010 Women’s Grand Prix tournament in Ulanbaatar reached this position after Black’s move 33…Rd8:
Undoubtedly White traded Rooks by 34. Rxd8+ Qxd8 and then play continued 35. Qc3 Qd6. All sources Ken knows, however, give the moves 34. Qc3 Qd6 putting Black’s Queen in-take, and the gamescore miraculously stays legal for the remaining 22 moves, with both Queens often left in-take to the phantom Rooks. This originally racked up huge amounts of player error in the computerized move analysis that underlies Ken’s model. And none of it is true. Update: Subsequent to this post and the head of ChessGames.com contacting me in May, this gamescore was fixed.
The causes of errors are all too human, even when games are played on special sensory boards and transmitted in real time. The Rooks may have been traded too softly for the captures to register. There is a quirk that sometimes causes spurious King moves at ends of games. When human scoresheets are typed into databases, the letters ‘c’ and ‘e’, and ‘b’ and ‘d’, are often confused. By Ken’s estimates, upwards of 1% of games have mis-recorded moves. This would make 2,000 bad games in the 200,000 games he has analyzed to date, and over 60,000 in the most extensive game collections.
This is too many for one person to sift and correct by hand, and in over a quarter of cases Ken has examined, he does not even see a fix to suggest. Happily his data-processing scripts flag the most egregious cases, and he hand-cleaned the set of 6,000 games analyzed in depth to train his model, but he suspects enough uncaught “erroneous error” to skew the other totals by 5–10%. This makes a significant difference in the correspondence between his aggregate-error statistics and chess Elo Ratings that is established in his followup paper.
Not all gamescore errors lead to the appearance of blunders, and the small effect of those can be ignored. The problem is distinguishing spurious high error values from cases where players really do blunder on consecutive moves. Should we place a cap on the amount of error that can come from any one game? Or can we weight the data points by a confidence value that shrinks as the error increases?
A Sampling Challenge
Note that Ken’s training set of 6,000 games is essentially a kind of sample of the whole population. In general we cannot go and fix errors in our samples, however. It may help that we have an estimate like 1% on the rate of false data points, but when the data is numeric and unbounded, we may not have a good estimate on the total numeric magnitude of the error.
This seems like an important problem that might benefit from tools we have in theory of computation. Imagine the above setup of people, but add the assumption that some are allowed to lie. How can one use sampling methods to ask only a small fraction of the people their values, but avoid having a few liars affect the final estimate unduly?
This leads to the question: Can we sample in a way that stands a good chance of avoiding the false data points altogether? Ken and I have some initial ideas how this might be possible, but have no definite results to report. We do note that having statistical methods that work in the presence of lies seems to be a fundamental problem.
Just a repeat: can we handle liars in the estimation of sums? In the estimation of other values? Are there reasonable models that would be able to help here? Beyond cybercrime estimation we believe that such questions must be common in many places. Can we help?
Another problem dear to Ken is to fix up the chess record. The modern way of recording full play-by-play, pitch-by-pitch, manager move-by-move information of Major League Baseball games apparently began only in 1984. RetroSheet.org was founded to collect pre-1984 data and has about 100 volunteers. Can we mount a similar effort to assure accuracy of historical chess game scores?
[fixed "based" to "bad" in first section's quotation]