Some surveys are more unreliable than others

Dinei Florencio and Cormac Herley are both at Microsoft Research. Florencio is with the Multimedia, Interaction and Communication group, and Herley—a Principal researcher— is with the Machine Learning Department. Both have made important contributions to many areas of computing, including especially security.

Today I want to talk about a brilliant piece of theirs on cybercrime that just appeared in this Sunday’s New York Times—it is in the Sunday Review section on page five.

The essence of their opinion piece is given away by its title: “The Cybercrime Wave That Wasn’t.” They make a strong case, in my opinion, that the claims of huge losses to consumers—over 100 billion dollars a year—being made about cybercrime are just not correct. Not even close. They point out that

[S]uch widely circulated cybercrime estimates are generated using absurdly bad statistical methods, making them wholly unreliable.

This is a pretty strong statement.

I was immediately attracted to their piece, partly because of the topic, cybercrime; and mostly by the word “absurdly.” In my experience, where absurd methods are being used by responsible people, interesting and tricky science is there to find. In this spirit let’s take a look at the issue that they raise.

## Surveys

The problem is that the cybercrime estimates are based on surveys of individuals, surveys that ask essentially “what did cybercrime cost you this year?” There may be other ways to get this information, but companies are unlikely to answer questions about cybercrime. Hence even the Federal Trade Commission (FTC) must rely on surveys.

Now surveys are reasonably well understood, relatively reliable if done correctly, and a powerful method of assessing how people feel about an issue. There have been some terrific failures over the years, but in general they are a useful tool.

Perhaps the most famous failure was the Truman-Dewey prediction. Logistical matters in 1948 induced the Chicago Tribune newspaper to put its first editions out earlier than before, and on Election Night, November 3, the deadline came before most voting returns started coming in. Hence they relied on their Washington correspondent, who relied on polls all showing a solid lead for challenger Thomas E. Dewey over President Harry S. Truman. Truman’s words while posing for the iconic photo

were: “This is one for the books.”

Closer to our time, three famous prediction mistakes impacted our lives. Surveys prior to January 1, 2000, indicated that the Y2K Problem would cause far more economic damage than it did. It is still not clear how much credit proactive measures should receive for the relative lack of problems, but it seems clear as some even noted beforehand that the problem had been severely overestimated.

Exit poll results caused the state of Florida to be prematurely called as a win for Albert Gore in the 2000 US presidential election, and led to mistaken expectations of a win for John Kerry in the 2004 election, before returns started coming in. The latter was a much wider-scale phenomenon, and there remains controversy over the assertion that Republican voters for President George W. Bush showed a lower rate of participation, thus skewing the surveys toward Kerry. There is general speculation that results for survey answers perceived as unpopular are generally depressed, and about other ways people might lie on voting surveys. But even so, any single change makes just a tiny difference in the total.

## Numeric Surveys

Suppose instead of an election poll we have a job-approval poll. Imagine that the entire population has ${N}$ people, each with values

$\displaystyle p_{1}, \dots, p_{N}$

meaning: ${+2}$ means they strongly approve the job the President is doing, ${+1}$ means approve but not strongly, ${0}$ neutral, ${-1}$ disapprove, and ${-2}$ strongly disapprove. The goal of the survey is to ask ${n \ll N}$ people for their values, and then make a guess at the value of the sum of all preferences, i.e. at

$\displaystyle P = \sum_{i} p_{i}.$

Then ${P/N}$ is the average degree of support, or disapproval if negative.

If the people are sampled randomly, then this is a terrific method for determining the sum of their values. If they “lie” and do not tell the truth when asked for their values, then sampling will still work well. The key is that we need only have some reasonable bound on the fraction of people that will lie. Besides lies there are many other ways to mess up, but we will put these all together: people can, for example, misunderstand the survey question.

Florencio and Herley note that all this changes in numeric surveys when the values ${p_{i}}$ are allowed to range from zero to infinity. See their technical paper here for details.

The problem is that even if most of the values ${p_{i}}$ in the sample are zero, a few very large values will have a huge effect on the estimate for the sum. And these large values can be lies. In their NYT piece they point out that one FTC survey on identity theft would have added 37 billion dollars to the total based on just the answers from two respondents. Just two—as in one, then two.

## Ken’s Erroneous Error Problem

Ken notes that a related issue plagues his chess research because of widespread inaccuracies in records of moves in chess games, even in top-level chess tournaments. For example, a game from the 2010 Women’s Grand Prix tournament in Ulanbaatar reached this position after Black’s move 33…Rd8:

Undoubtedly White traded Rooks by 34. Rxd8+ Qxd8 and then play continued 35. Qc3 Qd6. All sources Ken knows, however, give the moves 34. Qc3 Qd6 putting Black’s Queen in-take, and the gamescore miraculously stays legal for the remaining 22 moves, with both Queens often left in-take to the phantom Rooks. This originally racked up huge amounts of player error in the computerized move analysis that underlies Ken’s model. And none of it is true. Update: Subsequent to this post and the head of ChessGames.com contacting me in May, this gamescore was fixed.

The causes of errors are all too human, even when games are played on special sensory boards and transmitted in real time. The Rooks may have been traded too softly for the captures to register. There is a quirk that sometimes causes spurious King moves at ends of games. When human scoresheets are typed into databases, the letters ‘c’ and ‘e’, and ‘b’ and ‘d’, are often confused. By Ken’s estimates, upwards of 1% of games have mis-recorded moves. This would make 2,000 bad games in the 200,000 games he has analyzed to date, and over 60,000 in the most extensive game collections.

This is too many for one person to sift and correct by hand, and in over a quarter of cases Ken has examined, he does not even see a fix to suggest. Happily his data-processing scripts flag the most egregious cases, and he hand-cleaned the set of 6,000 games analyzed in depth to train his model, but he suspects enough uncaught “erroneous error” to skew the other totals by 5–10%. This makes a significant difference in the correspondence between his aggregate-error statistics and chess Elo Ratings that is established in his followup paper.

Not all gamescore errors lead to the appearance of blunders, and the small effect of those can be ignored. The problem is distinguishing spurious high error values from cases where players really do blunder on consecutive moves. Should we place a cap on the amount of error that can come from any one game? Or can we weight the data points by a confidence value that shrinks as the error increases?

## A Sampling Challenge

Note that Ken’s training set of 6,000 games is essentially a kind of sample of the whole population. In general we cannot go and fix errors in our samples, however. It may help that we have an estimate like 1% on the rate of false data points, but when the data is numeric and unbounded, we may not have a good estimate on the total numeric magnitude of the error.

This seems like an important problem that might benefit from tools we have in theory of computation. Imagine the above setup of ${N}$ people, but add the assumption that some are allowed to lie. How can one use sampling methods to ask only a small fraction of the people their values, but avoid having a few liars affect the final estimate unduly?

This leads to the question: Can we sample in a way that stands a good chance of avoiding the false data points altogether? Ken and I have some initial ideas how this might be possible, but have no definite results to report. We do note that having statistical methods that work in the presence of lies seems to be a fundamental problem.

## Open Problems

Just a repeat: can we handle liars in the estimation of sums? In the estimation of other values? Are there reasonable models that would be able to help here? Beyond cybercrime estimation we believe that such questions must be common in many places. Can we help?

Another problem dear to Ken is to fix up the chess record. The modern way of recording full play-by-play, pitch-by-pitch, manager move-by-move information of Major League Baseball games apparently began only in 1984. RetroSheet.org was founded to collect pre-1984 data and has about 100 volunteers. Can we mount a similar effort to assure accuracy of historical chess game scores?

[fixed “based” to “bad” in first section’s quotation]

1. April 18, 2012 10:59 am

Perhaps the data should be preconditioned before attempting to estimate sums. Intuitively, we will only need to worry about apparent outliers (out-liars?), which should be easy to detect—e.g., the two outliers in the FTC survey you mentioned. If these have a significant impact on the sum, perhaps one should follow up with further investigation, or just perform different analyses under the assumption that some or all of these outliers are lies.

Another alternative is to use a different measure of central tendency all together. In real estate pricing, the median home price is commonly used to compare different markets. This is because the distribution of home prices has a heavy tail, and so the average home price is a less telling statistic. For this reason, depending on the survey data, it might not be appropriate to estimate sums.

April 18, 2012 12:05 pm

The quote should read “absurdly bad statistical methods”, and not “absurdly based statistical methods”.

• April 18, 2012 1:31 pm

Thanks. Maybe a case of our natural civility filter kicking in; “based” is better technically anyway.

April 18, 2012 2:52 pm

Just FYI: For me, the “37 billion” is showing up as “2X billion” (I can tell it is meant to be 37 by mousing over for the hovertext). A similar bug appeared for me in the recent April 11 post on the Lonely Runner Conjecture, which was also noted in the comments section there by iitdelhireport and Stijn.

• April 18, 2012 10:48 pm

Thanks! That is very strange, but I found a Jan 2012 item on the WordPress forums about the bug. I fixed it in the other place too.

April 18, 2012 5:21 pm

Might it be possible to run something line RANSAC on the data to find a consensus amoung inlier datapoints?

5. April 18, 2012 10:57 pm

The question of liars seems potentially connected to a problem which I was told a few years ago was essentially open: Suppose A has picked a positive integer s from 1 to n, and B wants to figure out what s is. If B is allowed to ask A questions of the form “is s greater than/less than _” and A always answer truthfully, then B can win optimally with no more than 1 + log_2 n questions. What are the asymptotics if A is allowed to lie some number of times (known to B)? Last I heard, a good bounds aren’t known even for A having no more than a single lie in the general case, and two lies isn’t understood well at all. Unfortunately, I don’t have a reference for the problem (in fact, if someone does have a reference I’d like to see it myself!). This seems like a thematically similar question.

• April 19, 2012 6:58 am

I’ve never heard this problem before, but I agree it’s thematically similar. I really like the non-adaptive version, in which you design all of your questions before asking any of them. This version can be posed as a matrix- or code-design problem: Design a $QxN$ (short, fat) matrix such that the first so many entries of each row are $0$‘s and the other entries are $1$‘s; say, the first $1$ in row $q$ is at $n_q$. Then each row represents the question, “Is your number greater than or equal to $n_q$?” Furthermore, if the number of interest is $n$, then the $n$th column represents the truthful response to the list of questions. The goal is to design the matrix in such a way that the columns have minimum Hamming distance $2e+1$, where $e$ is the number of lies you expect, while at the same time minimizing $Q$, the total number of questions asked.

If we had the luxury of asking questions of the more general form, “Does your number lie in this set?”, then we could use any error-correcting code in a similar way. Here, the Hamming code produces $Q=O(\log N)$ questions that are robust to a single lie.

April 19, 2012 8:29 am

The paper “On Playing Twenty Questions with a Liar” by Dhagat, Gacs and Winkler addresses this question.

May 1, 2012 12:28 pm

Well there are some simple upper bounds. For example if there could be at most $k$ lies, each query can be posed $k+1$ times, making sure that the right answer is given at least once. For any query for which there are contradicting answers, one more query is made to verify which answers is the correct one. This gives $(k+1)\log n + k + 1$. Are there examples where one can do better?

I do recall a similar problem about optimal bounds for comparison sorting with lies.

6. April 19, 2012 6:14 am

Anyone knows if there is some crowdsourcing going on to find scoresheet typos in those collections? Should be an interesting diversion to some…

7. April 19, 2012 2:24 pm

The problem actually lies in the extrapolation method. Asking people how much money they lost is not a simple binomial nor gaussian distribution problem. Since we know that money distribution follows a pareto distribution within the population, we should expect random samples to reflect a similar distribution.

People can look up pareto distributions on wikipedia, but I do want to mention the interesting fact that there is a shape variable associated with the distribution. When that variable goes to infinity, the pareto distribution becomes the dirac delta function.

In any case, the data collected might still have some error associated with liars, but it should be recognized that large numbers are not outliers but are in fact important members of the distribution.

• April 19, 2012 2:32 pm

I would also add that despite some of my past discontent with Taleb’s books, “Fooled by Randomness” and the “Black Swan” I believe his general point about a significant part of the world follow power law statistics should be appreciated.