Human lessons from a computer trying to think like a human

David Ferrucci is Department Group Manager in Semantic Analysis and Integration at the IBM Thomas J. Watson Research Center. He heads the DeepQA Project, which produced the automated Question-Answering system named Watson. It famously defeated Jeopardy! champions Ken Jennings and Brad Rutter on a special edition of the American TV game show last February, as we covered back then in our post, “Are Mathematicians in Jeopardy?”

Today we wish to ask what IBM’s solution to a real-world problem may imply for humans approaching research problems.

Dr. Ferrucci gave a plenary talk titled “Building Watson: An Overview of DeepQA for the Jeopardy! Challenge” (paper, video from TiEcon, 5/15/11) at the AAAI 2011 conference (joint with IAAI 2011), which I (Ken) attended for the first time. I presented my paper with Guy Haworth of the University of Reading (UK) on “Intrinsic Chess Ratings”; it was also mentioned here and here. I was glad to meet several people in related areas of research.

DeepQA is an offspring of the famous Deep Blue, the IBM chess program and multi-processing computer that vanquished world champion Garry Kasparov in 1997. Ferrucci hopefully labeled his project, “The Next Deep Blue,” since “IBM got a lot of value out of Deep Blue,” he pointed out, beyond beating Kasparov. As a manager at IBM no doubt this value is critical, and at the end of his talk he projected concrete applications of DeepQA. However, most of his talk was about the journey to the Jeopardy! TV contest, and that interests us now.

## Watson’s Problem

Ferrucci began with a question that a member of an online chess club such as ICC might pose:

Do you want to play chess, or just chat?”

In chess, he said, all messages signifying moves have well-defined meaning from the rules of the game. In human language, however, words have no such intrinsic meaning—they gain meaning from human cognition aligned with context and human actions and intents. Chat is hard, and so is wit.

The initial 2007 version of Watson was fairly rule-based, as if playing chess, and tests showed it had only a fraction of the skill needed to compete with good Jeopardy! players. My word fairly’ here is fairly difficult, perhaps unfairly difficult, for a rule-based system to pick up the shade of meaning. Just sporting a balanced, unbiased, impartial approach might not be enough for a Jeopardy! category like one whose “hook” requires recognizing synonyms for fair,’ such as the first five real words of this sentence. Thus the IBM team’s initial idea of THINK needed a dose of Think Different in order to think like us.

In computer science theory and mathematics we believe all of our problem-solving objectives are defined by rules. But as we have tried to say in earlier discussions, sometimes true progress needs one to break the rules, or perhaps better put, look for patterns and tricks in new contexts.

Watson may not be able to sing or dance, but it can probably give a good talk. Let’s see what steps enabled it to get onstage.

## Watson’s Steps

The IBM team did not attempt to build large databases of questions and answers, Jeopardy!-style or not. A sample of just 20,000 past questions revealed 2,500 distinct types, the most frequent type only about 3% of the whole. The team inferred that the whole would be a distribution with a long tail, and with no clear framework for the domain, so that they could only hit a fraction of it.

Hence they used rules mainly to enumerate senses of words and basic relations: Is a liquid’ a fluid’? Is a fluid’ a liquid’? They used WordNet for much of this, but Ferrucci quipped that while WordNet by itself would be good for understanding questions on a physics test, it would not do for anything like Jeopardy! The problem had to be seen as much greater, just as research is different from taking an examination. Here are some of the steps that he outlined—all boldface is quasi-verbatim from his slides:

${\bullet}$ Identify and solve sub-questions. Use divide-and-conquer. As an example he gave:

When “60 Minutes” premiered, this man was US President.

The sub-questions are what is “60 Minutes,” when did it premiere, and who was President then?

In Watson’s case, sub-questions often arise from different senses of words, and have to be worked on in parallel. Hence the next step:

${\bullet}$ Try different decompositions of the problem. Use recursion as appropriate. He gave an example for a category titled “Edible Rhyme Time,”

A long tiresome speech delivered by a frothy pie topping.

The category expects two rhyming words or phrases as the answer, but does the first or second stand for the speech? Try both. Eventually “meringue harangue” is found and trumps whatever other trials return.

In research one would hope words and concepts are not so slippery. Even so, our thoughts on “changing the game” qualify as trying different ways of breaking down a problem, not just the idea of breaking down itself.

${\bullet}$ Find a missing link or common bond. Shirts, TV remotes, telephones, and I can add to his list, impressionable people—what do they have in common? Buttons. Perceiving this answer was acknowledged as a “bias toward humans”—I’ll add that a computer could come up with pressed’ not realizing that ironing shirts is a different sense. But a list of possible ties is a right track, even if the final answer is something else.

In research we can ask, what does this problem have in common with other problems whose solutions we know from the literature? There are also areas of mathematics designed to draw out common features, such as the representation theory of groups.

${\bullet}$ Combine deep and shallow approaches. In Watson’s case, relying on basic text search never delivered high self-estimated confidence in answers, and plateaued at about 30% accuracy, too low when 50% is needed just to break even on Jeopardy! A structured knowledge-base approach could deliver high confidence if the questions could be precisely mapped to existing and reliable senses, but this was rarely the case straight off.

Specifying large hand-crafted models didn’t cut it, Ferrucci said: they were “too slow, too narrow, too brittle, and too biased.” Combining the two approaches, learning to analyze information from scanning “as-is” knowledge sources, worked because computers can do something that single researchers cannot.

${\bullet}$ Be Embarrassingly Parallel.’ According to Watson’s Wikipedia page, it clustered ${90 \times 32 = 2,880}$ POWER7 processor cores, each with four threads, that could crunch 500 gigabytes per second. Its 4 terabytes of storage was all in the box, making 200 million pages at 20K/page, with the whole set of Wikipedia pages, including presumably its own. It did not use the Internet. Its machine-learning component used about 200 features from answer scorers plus 400 derived features and applied about 100 different techniques to formulate, merge, and rank hypotheses and put confidence estimates on them, a process Ferrucci called “evidence diffusion.”

As we said, a single researcher cannot be expected to do this. This belongs to the brave new realm of crowd-sourcing research. But we can stay flexible and avoid being wedded to preconceptions on a problem.

## Tonto on Toronto

The rest of the talk described the system’s growing pains, as tuning made nonsensical answers rarer. Of course this did not prevent a famous bad answer on the first “Final Jeopardy” question:

US CITIES: Its largest airport is named for a World War II hero; its second largest, for a World War II battle.

Watson generated only 14% self-confidence for its answer “What is Toronto” and hence appended five question marks to it. The correct answer Chicago (O’Hare and Midway) was second at 11%, with Omaha at 10%.

Ferrucci reviewed the explanations for the error, including many cases where the category title specifies a set but the answer is not a member of that set. Amazingly, even if this is made explicit by re-wording the question as “This US city’s largest airport is named…,” Toronto still comes a close second at 28% to Chicago at 32% confidence. The second clause follows a semicolon but lacks a verb, making it hard to parse; there are several US cities named Toronto; and Toronto associates to the American League. Toronto’s Lester Pearson even served in both World Wars.

We all have our “Toronto” moments. One nice aspect of Watson’s name is that besides IBM founder Thomas J. Watson, it calls to mind Sherlock Holmes’ genial and underestimated sidekick, whose store of medical and practical knowledge and interaction skills often help solve the mysteries. However the name of the Lone Ranger’s sidekick, Tonto, means `silly’ in Spanish. Well, one way to avoid being silly in research is to have a partner to begin with, or at least someone with whom to confide and go over results closely. John Horton Conway played this role for Andrew Wiles in the spring of 1993, while we note for many of us that we have no such help. Perhaps that is the role that a Watson could play in the future—not to do mathematics, but help us avoid silly mistakes.

## A Question for Watson

What if the first “Daily Double” in, say, a “Famous Computers” category had this question, and was chosen by Watson in a rematch against Jennings and Rutter next February 13, a Monday?

This automated Jeopardy! player was unable to answer the first “Daily Double” on 2/13/12.

What could Watson answer? Note that neither human opponent would have any trouble replying to make a fully correct question-answer pair.

## Open Problems

What other lessons can we take home from Watson’s success? Will we soon have a “Watson” to help us attack problems, something more active than our current use of published literature?

Does Watson think? Does it matter whether it does?

1. August 15, 2011 2:43 am

Whether we will soon have a Watson to answer mathematical questions for us I don’t know, but one remark that’s worth making is that we don’t need it to be too realistically human (just as for many purposes we don’t need Google to be all that intelligent). And it would be well within the reach of current technology to devise a database of mathematical knowledge based on a search of key concepts rather than key words. (I don’t have the space to give that bald statement the justification it deserves, but I believe it.)

Another way of looking at things is that we already have an amazing mathematical Watson in the form of Mathoverflow. Since it will be hard to beat the accumulated wisdom of hundreds of active research mathematicians, my hunch is that Mathoverflow will delay the appearance of a more automatic question answerer by several years: there’s just too much it has to be able to do in order to compete.

2. August 15, 2011 6:18 am

As a followup to gowers’ (intensely thought-provoking) post, we can conceive of a Watson-style mathematical savant whose sole expertise consisted of (1) crafting Google queries that searched Mathoverflow, followed by (2) a rough-and-ready heuristics for stitching together answers by excerpting the search results.

We have all graded tests and homework assignments which displayed this too-simple search-and-quote level of mathematical cognition … which is sufficient to pass many tests of aptitude and certification … yet insufficient to solve novel problems, create new mathematics, write all but the dullest textbooks, or cope with the case-by-case challenges of teaching.

Thus it seems (to me) that Watson’s present instantiation broadly lacks a notion of narrative and narrowly lacks a notion of mathematical naturality (the latter being a specialized notion of the former).

Who knows? Perhaps IBM is already working on story-telling successors to Watson, whose capabilities encompass good taste in mathematics (or science, or systems engineering) as a subset of story-telling capability.

After all, although Mr. Sherlock Holmes possessed the superior deductive capability, it was his friend Dr. John Watson who possessed the superior story-telling capability … the latter capability being considerably the rarer and more sophisticated of the two. Ars est celare artem (“it is art to conceal art”, Ovid) or better cogitatio est celare cogitationis or best narratio est celare narrationis. 🙂

3. August 16, 2011 2:44 am

Regarding the intrinsic chess ratings work – excellent! I’ve been waiting for someone to pick up where GB2006 left off… I’ve been shocked it didn’t have a bigger impact so far. Question: How did you control Rybka3? Some sort of UCI harness?

4. August 16, 2011 5:01 am

Ken, I too enjoyed your chess ratings work … it seems to me that your finding (that there has been no rating inflation in chess) is startling, controversial, and well-supported. Terrific! 🙂

One wonders, is this true in mathematics too? Do we nowadays have many mathematicians comparable in skill and creativity to the mathematical giants of the past?

For example, Alfréd Rényi was one of many colleagues to say of John von Neumann:

“Other mathematicians prove what they can, von Neumann what he wants.”

In consequence of greater knowledge, larger literature, better training tools, and (most of all) more mathematicians, do we nowadays have many mathematicians who can “prove what they want” in the style of von Neumann? Just as nowadays we have many chess players of the caliber of Fischer and Karpov?

If so, what theorems are the modern master-mathematicians proving? In service of what goals and enterprises? If no, why not?

August 17, 2011 12:45 pm

May I take this opportunity to remind you that today, especially in the United States, we do have undergraduate research programs and sufficiently many Einstein wannabes produced from such.

July 7, 2012 6:37 pm

I sat in on an Undergraduate Lecture yesterday.
It was the second course in Linear Algebra – Spaces/subspaces…
The Lecturer was clear but he never went beyond a systematic algebraic approach
in his explanations and the lecture mainly consisted of Question/note/proof over and
over again.

To me, at least, it would have been clearer if he had drawn some succinct diagrams/pictures
before and during and after the lecture to show where he hoped to go, where he was,
and where he ended up.

Perhaps we spend too much time drilling “accepted proofs” into the student and not
enough time showing how earlier mathematicians wrangled with the problems.
Perhaps there should be more time in the classroom and more of that wrestling with problems rather than just going through one proof/result after another and then assigning problems that most undergraduate do poorly on in their somewhat futile attempts to solve –
then Quizzes and Exams in which the results must be “curved” to make sure most of
the students pass the course.

As for Chess and Chess rankings – certainly having a computer to play against and
to consult will make you a better high level player. What would Tal/Bronstein/Fischer
I think we should move to a ten by ten board with two extra pawns and new pieces
that could be super bishops or rooks – I think they are known as Chancellors in the
10 by 10 games that I have seen played. More variety, more of a challenge, less of
a chance that everything is memorized until the 20th move.

August 16, 2011 8:16 am

Thanks to a pointer from John Sidles, I was surprised to learn that John von Neumann, among his other talents, was a first class economist.

Hopefully, some of the new generation of mathematicians will team up with the economists and the computer modelers to develop a game plan for getting the economy back on track.

1. Assorted links — Marginal Revolution
2. Links for 2011-09-06 | Bear Market Investments
3. “A Question for Watson” « A Spinning Top