Is Jeopardy! in Mathematicians?
Human lessons from a computer trying to think like a human
David Ferrucci is Department Group Manager in Semantic Analysis and Integration at the IBM Thomas J. Watson Research Center. He heads the DeepQA Project, which produced the automated Question-Answering system named Watson. It famously defeated Jeopardy! champions Ken Jennings and Brad Rutter on a special edition of the American TV game show last February, as we covered back then in our post, “Are Mathematicians in Jeopardy?”
Today we wish to ask what IBM’s solution to a real-world problem may imply for humans approaching research problems.
Dr. Ferrucci gave a plenary talk titled “Building Watson: An Overview of DeepQA for the Jeopardy! Challenge” (paper, video from TiEcon, 5/15/11) at the AAAI 2011 conference (joint with IAAI 2011), which I (Ken) attended for the first time. I presented my paper with Guy Haworth of the University of Reading (UK) on “Intrinsic Chess Ratings”; it was also mentioned here and here. I was glad to meet several people in related areas of research.
DeepQA is an offspring of the famous Deep Blue, the IBM chess program and multi-processing computer that vanquished world champion Garry Kasparov in 1997. Ferrucci hopefully labeled his project, “The Next Deep Blue,” since “IBM got a lot of value out of Deep Blue,” he pointed out, beyond beating Kasparov. As a manager at IBM no doubt this value is critical, and at the end of his talk he projected concrete applications of DeepQA. However, most of his talk was about the journey to the Jeopardy! TV contest, and that interests us now.
Ferrucci began with a question that a member of an online chess club such as ICC might pose:
Do you want to play chess, or just chat?”
In chess, he said, all messages signifying moves have well-defined meaning from the rules of the game. In human language, however, words have no such intrinsic meaning—they gain meaning from human cognition aligned with context and human actions and intents. Chat is hard, and so is wit.
The initial 2007 version of Watson was fairly rule-based, as if playing chess, and tests showed it had only a fraction of the skill needed to compete with good Jeopardy! players. My word `fairly’ here is fairly difficult, perhaps unfairly difficult, for a rule-based system to pick up the shade of meaning. Just sporting a balanced, unbiased, impartial approach might not be enough for a Jeopardy! category like one whose “hook” requires recognizing synonyms for `fair,’ such as the first five real words of this sentence. Thus the IBM team’s initial idea of THINK needed a dose of Think Different in order to think like us.
In computer science theory and mathematics we believe all of our problem-solving objectives are defined by rules. But as we have tried to say in earlier discussions, sometimes true progress needs one to break the rules, or perhaps better put, look for patterns and tricks in new contexts.
Watson may not be able to sing or dance, but it can probably give a good talk. Let’s see what steps enabled it to get onstage.
The IBM team did not attempt to build large databases of questions and answers, Jeopardy!-style or not. A sample of just 20,000 past questions revealed 2,500 distinct types, the most frequent type only about 3% of the whole. The team inferred that the whole would be a distribution with a long tail, and with no clear framework for the domain, so that they could only hit a fraction of it.
Hence they used rules mainly to enumerate senses of words and basic relations: Is a `liquid’ a `fluid’? Is a `fluid’ a `liquid’? They used WordNet for much of this, but Ferrucci quipped that while WordNet by itself would be good for understanding questions on a physics test, it would not do for anything like Jeopardy! The problem had to be seen as much greater, just as research is different from taking an examination. Here are some of the steps that he outlined—all boldface is quasi-verbatim from his slides:
Identify and solve sub-questions. Use divide-and-conquer. As an example he gave:
When “60 Minutes” premiered, this man was US President.
The sub-questions are what is “60 Minutes,” when did it premiere, and who was President then?
In Watson’s case, sub-questions often arise from different senses of words, and have to be worked on in parallel. Hence the next step:
Try different decompositions of the problem. Use recursion as appropriate. He gave an example for a category titled “Edible Rhyme Time,”
A long tiresome speech delivered by a frothy pie topping.
The category expects two rhyming words or phrases as the answer, but does the first or second stand for the speech? Try both. Eventually “meringue harangue” is found and trumps whatever other trials return.
In research one would hope words and concepts are not so slippery. Even so, our thoughts on “changing the game” qualify as trying different ways of breaking down a problem, not just the idea of breaking down itself.
Find a missing link or common bond. Shirts, TV remotes, telephones, and I can add to his list, impressionable people—what do they have in common? Buttons. Perceiving this answer was acknowledged as a “bias toward humans”—I’ll add that a computer could come up with `pressed’ not realizing that ironing shirts is a different sense. But a list of possible ties is a right track, even if the final answer is something else.
In research we can ask, what does this problem have in common with other problems whose solutions we know from the literature? There are also areas of mathematics designed to draw out common features, such as the representation theory of groups.
Combine deep and shallow approaches. In Watson’s case, relying on basic text search never delivered high self-estimated confidence in answers, and plateaued at about 30% accuracy, too low when 50% is needed just to break even on Jeopardy! A structured knowledge-base approach could deliver high confidence if the questions could be precisely mapped to existing and reliable senses, but this was rarely the case straight off.
Specifying large hand-crafted models didn’t cut it, Ferrucci said: they were “too slow, too narrow, too brittle, and too biased.” Combining the two approaches, learning to analyze information from scanning “as-is” knowledge sources, worked because computers can do something that single researchers cannot.
Be `Embarrassingly Parallel.’ According to Watson’s Wikipedia page, it clustered POWER7 processor cores, each with four threads, that could crunch 500 gigabytes per second. Its 4 terabytes of storage was all in the box, making 200 million pages at 20K/page, with the whole set of Wikipedia pages, including presumably its own. It did not use the Internet. Its machine-learning component used about 200 features from answer scorers plus 400 derived features and applied about 100 different techniques to formulate, merge, and rank hypotheses and put confidence estimates on them, a process Ferrucci called “evidence diffusion.”
As we said, a single researcher cannot be expected to do this. This belongs to the brave new realm of crowd-sourcing research. But we can stay flexible and avoid being wedded to preconceptions on a problem.
Tonto on Toronto
The rest of the talk described the system’s growing pains, as tuning made nonsensical answers rarer. Of course this did not prevent a famous bad answer on the first “Final Jeopardy” question:
US CITIES: Its largest airport is named for a World War II hero; its second largest, for a World War II battle.
Watson generated only 14% self-confidence for its answer “What is Toronto” and hence appended five question marks to it. The correct answer Chicago (O’Hare and Midway) was second at 11%, with Omaha at 10%.
Ferrucci reviewed the explanations for the error, including many cases where the category title specifies a set but the answer is not a member of that set. Amazingly, even if this is made explicit by re-wording the question as “This US city’s largest airport is named…,” Toronto still comes a close second at 28% to Chicago at 32% confidence. The second clause follows a semicolon but lacks a verb, making it hard to parse; there are several US cities named Toronto; and Toronto associates to the American League. Toronto’s Lester Pearson even served in both World Wars.
We all have our “Toronto” moments. One nice aspect of Watson’s name is that besides IBM founder Thomas J. Watson, it calls to mind Sherlock Holmes’ genial and underestimated sidekick, whose store of medical and practical knowledge and interaction skills often help solve the mysteries. However the name of the Lone Ranger’s sidekick, Tonto, means `silly’ in Spanish. Well, one way to avoid being silly in research is to have a partner to begin with, or at least someone with whom to confide and go over results closely. John Horton Conway played this role for Andrew Wiles in the spring of 1993, while we note for many of us that we have no such help. Perhaps that is the role that a Watson could play in the future—not to do mathematics, but help us avoid silly mistakes.
A Question for Watson
What if the first “Daily Double” in, say, a “Famous Computers” category had this question, and was chosen by Watson in a rematch against Jennings and Rutter next February 13, a Monday?
This automated Jeopardy! player was unable to answer the first “Daily Double” on 2/13/12.
What could Watson answer? Note that neither human opponent would have any trouble replying to make a fully correct question-answer pair.
What other lessons can we take home from Watson’s success? Will we soon have a “Watson” to help us attack problems, something more active than our current use of published literature?
Does Watson think? Does it matter whether it does?