Skip to content

Big Problems With Big Iron

August 26, 2011


Problems to be solved and problems with the current approach to solving them

John Melchi is the head of the administration directorate of the National Center for Supercomputing Applications (NCSA), and is right in the middle of a current controversy.

Today I want to point out a role of theory in supercomputing, and how we could potentially help in new ways.

It is all about the Blue Waters Project. This was the name of a petascale supercomputer to be deployed at the NCSA, which is an arm of the University of Illinois at Urbana-Champaign. Back on August 8, 2007 the National Science Board approved a resolution that authorized the National Science Foundation to fund

“the acquisition and deployment of the world’s most powerful leadership-class supercomputer.”

The NSF awarded $208 million over the four and a half years of the Blue Waters project.

Just a few days ago IBM pulled out of the project, returning a large sum of money. According to the Associated Press:

CHAMPAIGN, Ill.—IBM Corp. is leaving a project at the University of Illinois to build the world’s fastest supercomputer.

The university’s National Center for Supercomputing Applications and IBM said in a brief statement Monday that IBM terminated its contract. IBM said the computer had grown more expensive and technically demanding than anticipated.

The NCSA said it still plans to pursue a petascale computer in what it called “timely manner” and is working with the National Science Foundation. Petascale refers to the speed of the computer and means it could perform a thousand trillion mathematical operations a second.

IBM was chosen in 2007 to build Blue Waters. The computer was initially due to go online this year.

The Controversy

As you might imagine, many in the supercomputer area—or as they say the high performance computing (HPC) area—are not happy with the current situation. Maybe no one is happy, including Melchi at NCSA. In 2007 there was a competition and NSF chose among several very strong proposals, one of which involved Georgia Tech. I had nothing directly to do with the proposal, but needed to add that for full disclosure. Of course the teams that lost the competition are upset, since they believe that they could have succeeded if they had been given the chance—and the money.

The fact that the Blue Water project has collapsed with the removal of IBM is a serious issue. My friends who work in HPC are quite interested in what happened, what went wrong, and on and on. They have many questions, and no answers.

HPC

I hope there is a serious and open discussion about what did happen with Blue Waters, but that is not what I think we need to talk about. Rather I think there is an even more foundational issue involved. The rationale behind projects like Blue Waters is:

  1. Science and engineering face problems that cannot be solved on todays computers systems, because of their immense requirements for time or space or both.
  2. The only way to solve them in the future is to build faster and faster supercomputers—bigger and bigger iron.
  3. Thus, NSF and other agencies must spent 100’s of millions of dollars, perhaps soon billions of dollars, to create such large systems.

I agree with the first statement completely: more generally there are societal computational problems that are beyond our abilities. These include problems of all kinds, from pure science, to very applied problems. We all would greatly benefit if they were solvable.

What I disagree with is the assumption that the above is the only approach to solving these problems, which are often called grand challenges. This is more especially true today than years ago, because of the inability to continue to make faster and faster uni-processors. There was a time when each generation of computers was faster than the previous generation—this was mostly a consequence of advances in device physics, essentially advances that followed Moore’s Law.

Today because of power limits and other technical issues that I am not an expert on, processors and therefore supercomputers have had to become extremely parallel machines. Extremely. This many-core approach is exciting, but does not give the automatic improvement that one used to get with just increased clock rates. Last year Noam Nisan wrote on related issues where he quoted from the Report to the President and Congress: Designing a Digital Future: Federally Funded R&D in Networking and IT. The central point was that there are two ways to speed up and solve a problem: faster iron and better algorithms. Speedup is always of the product of two terms:

\displaystyle  \text{ speed up due to faster iron } \times \text{ speedup due to better algorithms}.

In the cited report Martin Grötschel notes:

{\dots} that a benchmark production planning model solved using linear programming would have taken 82 years to solve in 1988, using the computers and the linear programming algorithms of the day. Fifteen years later—in 2003—this same model could be solved in roughly 1 minute, an improvement by a factor of roughly 43 million. Of this, a factor of roughly 1,000 was due to increased processor speed, whereas a factor of roughly 43,000 was due to improvements in algorithms!

Note, the 1,000 fold increase in processor speed was mostly due to increased clock rates, not the use of massive parallelism.

The Approaches

Granted that society has problems—computational problems—that require solution, there are several approaches:

{\bullet } Bigger Iron This is the perceived approach that is taken by HPC researchers. Projects like Bule Waters attempt to build larger and larger systems that can execute more and more instructions per second. Petascale was the Blue Water goal, exascale is next, then zettacsale, and then yottascale. Of course the bigger iron is not really just faster; the new iron is more and more parallel, and therefore, more and more difficult to program.

{\bullet } Bigger Iron and Better Algorithms This is the actual approach that seems to happen. Both the iron gets faster and the algorithms get better. The claim by Grötschel is that more of the improvement of the speed of solution has often come from the algorithmic breakthroughs, but somehow this fact is not as exciting as thinking about big iron. Big projects that build large iron today commission whole buildings, with huge needed sources of power and cooling, and are much more visible than any algorithm. Oh well.

{\bullet } Break The Rules This is the approach that I would like to elaborate on in the next section. It is one that we are equipped to do, but not one that NSF funds to even a fraction of the level that large iron gets. Perhaps with the Blue Water situation we should seriously consider spending huge sums on such projects. They will be high risk with huge payoff, but perhaps the time has come where they risk reward has shifted to make this a viable addition to the HPC investment portfolio.

Break The Rules

My suggestion is more radical, and is based on the fact that I am an “ optimist.” Recall that the definition of an optimist is:

An optimist is a guy that has never had much experience.

Don Marquis 1927

Since I do not work in HPC I can be an optimist and suggest that theory can play a huge role, indeed an even bigger role than the development of improved algorithms. What I think is possible is that we can approach the problems that need to be solved with new and completely fresh ideas. That is not to attempt to improve the performance of the existing algorithms, but to go straight to the original core problems and approach them in new ways.

I do not think this has ever been done before on a large scale, and would have high risk. But building larger iron, the current focus of NSF, is clearly not without risk.

I have previously discussed this at length here. Take a look for more details on what I mean. One concrete example is the recent work on RNA structure via game playing—see this for more details.

Open Problems

Can theory play a bigger role in solving society’s problems, not merely by speeding up existing algorithms, but fundamentally by re-working approaches to the original core problems?

About these ads
29 Comments leave one →
  1. August 26, 2011 2:32 pm

    An elephant in this room: quantum computing

    My Favorite Feynman Quote

  2. August 26, 2011 2:39 pm

    A well-written survey in a similar computational theme (that cites Martin Grötschel’s work among others) is Robert Bixby’s “Solving Real-World Linear Programs: A Decade and More of Progress” (Operations Research, 2002). If anyone cares to post more references, these would be very welcome (to me & IMHO to many folks).

    Particularly interesting (from a systems engineering point-of-view) are references that support or refute the broad thesis, that assessed by a variety of reasonable metrics, the capabilities of computational hardware, algorithms, sensors, storage devices, and communication channels, all have improved at an exponential pace, with doubling times varying in the range 1-4 years, that has been sustained since 1945 or so.

    It would be cool too to have estimates of doubling-times for global STEM employment. Is that doubling time … perhaps seven years? Fifteen years? Obviously this is a key career parameter for young people. Other key questions include, is this STEM doubling-pace sustainable? If one capability falters, must all falter? Will IBM’s Blue Water be missed, or can/will other capabilities take-up the slack? What enterprises might be enabled by sustained doubling?

    It’s another wonderfully thought-provoking post by Gödel’s Lost Letter. For which, thanks and appreciation are extended.

  3. anon permalink
    August 26, 2011 5:58 pm

    Fantastic article, and I’m a whole-hearted believer in breaking every rule in the book. But what would theorists use all that money for if they had it? You can only find so many talented post-docs and grad students, and throwing money at untalented ones won’t make them talented. And there’s only so much paper and pencil lead you can fit in your office.

  4. GASARCH permalink
    August 26, 2011 9:37 pm

    A Numerical Analyst once told me that he would rather use year-2000 algorithms
    on year-1950 computers then year-1950 algorithms on a year-2000 computer.

    Theory DOES need to get involved with HPC. We need to work on a model of
    computation that is realistic AND we can prove things about. Getting the model
    right may be the hard part.

  5. August 27, 2011 7:46 am

    One issue (perhaps small) with the “break the rules” approach is that algorithms are somethings that can be easilly used by others. I mean a situation when, say, the penthagon builds a supercomputer to test nuclear weapon designs. It could be that as soon as they tell the world what algorithms they would like to have, it tells much about the actual designs.

    Furthermore, if someone comes up with the algorithm, anyone else can use it, including those countries who woudn’t be able to build a supercomputer strong enough to run older algorithms.

  6. August 27, 2011 5:06 pm

    I recently wondered about a similar question after reading about yet another complete shambles in the UK concerning a large-scale computer system costing billions that had failed. The aim was that the health data of everyone in the country would be on a single system so that if you went into any surgery anywhere they would have all the information they needed. I think one of the problems is to merge previously incompatible systems.

    Variants of this story occur with almost metronomic regularity. It makes me wonder whether the basic problem can be distilled and abstracted until it becomes amenable to mathematical study. If so, then someone has the potential to save their country billions of pounds/dollars and perhaps make a bit for themselves. Does anyone have a clear idea why it is so hard to introduce big new IT systems?

    • Serge Ganachaud permalink
      August 27, 2011 7:21 pm

      There are at least three kinds of problems :

      1) Having the professional of some trade tell you what it consists of precisely. They know it intuitively but they don’t know how to teach it to a computer scientist.

      2) Users of the new system will have to type into it everyday some information that they weren’t used to processing when they used the old system. This requires training the existing employees and sometimes hiring some more. In fact the whole working organization has to be completely rethought.

      3) It’s also a tough job to write the programs that will convert the existing data from the old systems and store it into the new system.

      As you might guess, all this demands spending a bit more money than is commonly believed…

      • max permalink
        August 28, 2011 8:27 pm

        The problem is maybe not a mathematical one, but it is a management one. Managing a large scale project requires extraordinary leadership ability, Think of Oppenheimer, probably not the greatest scientist of his generation, but he did well with the A-bomb project.

        Bureaucratic and technical constraints can make it difficult to achieve the outcome in the time and budget required. You have to get committee approvals and buy-in on many things. Every contractor will charge a markup for making adjustments. The people who do the work may not be calling the shots, and the people who get paid the big bucks may be unaware of the technical difficulties. The combination of a good leadership ability with technical knowledge is so valuable because it is rare.

      • August 29, 2011 7:25 am

        Regarding Robert Oppenheimer, in leading the Manhattan Project he consistently focused upon one overriding goal (perhaps not a purely noble one): finish and use the atomic bomb before the war ended. Hence Oppenheimer’s letter to his commanding general Leslie Groves:

        It is true that there are a few people here whose interests are exclusively “scientific” in the sense that they will abandon any problem that appears to be soluble. I believe that these men are now in appropriate positions in the organization. For the most part the men actually responsible for the prosecution of the work have proven records of carrying developments through the scientific and into the engineering stage. For the most part these men regard their work here not as a scientific adventure, but as a responsible mission which will have failed if it is let drop at the laboratory phase. I therefore do not expect to have to take heroic measures to insure something which I know to be the common desire of the overwhelming majority of our personnel.

        The laboratory is operating under a directive to produce weapons; this directive has been and will be rigorously adhered to.

        AFAICT, the Blue Water collaboration never articulated a comparably time-urgent and technically specific goal, nor submitted itself to a comparable level of enterprise-level discipline.

        And yet, the challenges of our 21st century are (if anything) even more serious and urgent than the challenges Oppenheimer and Groves faced. An overheated resource-poor planet planet with 10^10 people living on it, is in a pretty sobering situation. So perhaps our methods of managing enterprises like Blue Water do need rethinking.

        Michael Nielsen is one person who thinks seriously about these issues, and he has a book coming out later this year (Reinventing Discovery). Without my having the least idea what’s in Michael’s book, IMHO he might be a good guest-poster for Gödel’s Lost Letter.

    • August 27, 2011 9:23 pm

      Variations of this story occur “with metronomic regularity” (in Gowers’ admirable phrase) in every branch of the STEM enterprise. Off the top of my head, here are some technological Great Goals that have faltered substantially (relative to early expectations): nuclear-powered aircraft, safe fission power, practical fusion power, economical carbon-neutral energy by any means, effective vaccines for malaria/ tuberculosis/ HIV, the War on Cancer, rule-based AI, AI by any means, solar-system human space travel … PvsNP ( ouch ) … quantum computers ( ouch! ) … atomic-resolution 3D biomicroscopy ( ouch! ) … truly regenerative medicine ( ouch! ) …

      An over-arching rationale for pursuing petaflop-scale computation is the hope it provides (IMHO) of speeding the pace and retiring the risks of progress toward all of these Great Goals.

      Uhhh … except maybe nuclear-powered aircraft? … `cuz some technologies just aren’t in the cards:

      In the last meeting in which I participated [as Presidential Science Advisor James Killian] in which the nuclear-powered airplane was discussed with President Eisenhower, he commented that the next thing he knew, someone would propose to take the liner Queen Elizabeth and put wings a mile wide on it and install enough power to make it fly. Dr. York begged him not to let the idea get around or someone would want to try!

      Hmmm … so perhaps Blue Waters faltered because it wasn’t concretely STEM enterprise-centric? And yet, which STEM enterprises are most promising to focus upon concretely? After all, not all great enterprises succeed … and yet few great enterprises ever are given up entirely.

      So, should should the successor to Blue Waters be concretely enterprise(s)-centric? If so, which enterprises?

    • Arnab permalink
      August 28, 2011 9:43 am

      Replacing a sequential looking algorithm by one that can take full advantage of inherent parallelism allowed by the underlying problem isn’t always a simple task. Even after that, how that top-level algorithm is handled at the machine level is yet another story – a part usually not written by theoreticians, but by hardware-architecture-compiler type of people – and the combination doesn’t always resonate. Plus, in spite of the fact that there has been so much thrust on parallel programming over many many years, the common programming languages aren’t in particular very supportive of a parallel thinking process.

      I believe what needs to happen is ‘another variant’ of “theory does need to get involved”, viz., people writing compilers or designing machine hardware, as well as the experts of the application domains, should expand themselves to understand ‘somewhat more’ of the theoretical aspects of complexity.

  7. Giorgio Camerani permalink
    August 28, 2011 5:14 am

    Great article. I second the necessity of breaking the rules sometimes, and I love the core idea behind the sentence “That is not to attempt to improve the performance of the existing algorithms, but to go straight to the original core problems and approach them in new ways”. In certain cases the only way to make progress is by changing mind and performing a paradigm shift.

  8. August 28, 2011 7:29 am

    The key to progress of supercomputing is parallel programming and parallel algorithms. But non trivial parallel algorithms are much more difficult to develop than sequential ones. Not for all algorithms an effective parallel modification is possible. Perhaps instruction-level parallelism is the best way for future progress. Today we have good hardware but we have not good software tools and programming languages for parallel programming…

  9. anonymous permalink
    August 28, 2011 12:09 pm

    An advantage of big iron is that if you build a computer that is 1000x faster, then you get an immediate speedup of 1000 for any problem, using existing software. (OK, this ignores a few issues like code portability and the fact that speedup is never completely uniform for all algorithms, but you get the point.) On the other hand, theoretical approaches, for the most part, will just attack one specialized problem at a time, and then require the production of new code.

  10. August 28, 2011 7:03 pm

    “1000x faster, then you get an immediate speedup of 1000 for any problem, using existing software” — agreed! :-) But unfortunately this is not possible :-(

  11. August 29, 2011 3:12 am

    There is an ancient question. Suppose there are two cities. One with a good men, and another with a bad men. Which one should be destroyed? No, we keep inhabitant alive, just destroying the buildings, and infrastructure.

    The city of good should be destroyed! Because, if we destroy the city of good all good men will spread around the world, spreading the goodness, and keeping evil forces in one place.

    About 10-15% percent of people claiming the solution of P vs NP on GJ Woeginger page are concentrated in this blog :). Dick, I’m loving you. You have nobles cafe, with one-way glass to the complexitiests zoo, with “monkeys” jumping for one-million banana, grown on high tree of concepts/notations. I do not have to be politically correct here, I’m one of these monkeys. The rest of the community should adore it, seriously. Dick is saving your scarce resources. Prof. Woeginger just need to add on his page, that all useful discussions related to the topic are happening here. Dick is smart enough to calm down young, ready to fight soul.

    • Carthage permalink
      September 2, 2011 4:29 pm

      mkatkov, thank you so much for your post; I could not contain my laughter. The idea that this blog is a honeypot for the circle-squarers and angle-trisectors of the algorithmic complexity world is a bewitching idea. I fear that I too am one of those monkeys.

  12. Anonymous permalink
    August 29, 2011 10:19 am

    About breaking the rules: if it results in programs that are demonstrably better-performing than existing ones, then that’s one thing. However, I’m not sure how receptive people will be to “fresh ideas” without them (i.e. an idea of an idea for a program).

    People are generally not receptive to groundbreaking ideas; they mostly prefer “new” ideas that “plug in” to existing frameworks. See for example the following Cornell News article:

    http://www.news.cornell.edu/stories/Aug11/ILRCreativityBias.html

  13. August 29, 2011 11:17 am

    Perhaps power was a major consideration too in IBM’s drop-out decision. According to Wikipedia’s page on IBM’s Blue Gene series, the air-cooled Q-model Blue Gene models, due to ship next year, achieves 1.8 MFlop/Watt. By comparison, the water-cooled Blue Water specs called for 15 MWatts of delivered power … sufficient for 27 petaflops of air-cooled Blue Gene/Q computation.

    Unless I’ve slipped a decimal place, the yearly power bill for Blue Water would have been $10-20 million … sufficient to make power costs for water-cooled Blue Water technology economically non-viable relative to air-cooled Blue Gene/Q technology.

    So it might be a pure business decision on the part of IBM … that the future of petaflop computation is air-cooled silicon, not water-cooled silicon.

    Perhaps some person more knowledgeable than me can comment?

    By the way, these power costs show concretely that improvements in algorithmic efficiency are directly associated to decreased enterprise expenses.

    Which costs more … a STEM professional … or the electric power to keep his/her computer warm?

    Perhaps soon this question will not seem funny. :)

    • August 29, 2011 12:20 pm

      Further investigation finds the Green 500 List of energy-efficient supercomputers:

      “Not just one, but two IBM Blue Gene/Q prototypes top this edition of the Green500. However, the Blue Gene/Q prototype at #1 on this list is different from the one that topped the last edition of the list in that it delivers significantly better performance with the same number of cores and only a marginal increase in power consumption. The end result is a MFLOPS/W rating of 2,097, the first supercomputer to surpass 2,000 MFLOPS/W.”

      Research at companies like IBM is always partly about “lustre” … and by topping the Green 500 List, Blue Gene/Q delivers that lustre.

      If you were a shareholder-protecting CEO, which list would *you* rather top: world’s fastest computer, or world’s greenest computer? Hmmmm … or maybe (eventually) both lists … starting with greenest? That dual-dominance strategy would make plenty of business sense *and* scientific sense.

      • August 30, 2011 8:02 am

        Now Microsoft has replied with an ingenious idea that provokes both hilarity and sober reflection: The Data Furnace: Heating Up with Cloud Computing.

        Let’s see, allocating a GJ of yearly heating energy per-person, and assuming Blue Gene/Q computational efficiency of 2 GFLOP/W, that’s enough energy to supply every individual citizen with 23 days of petaflop super-computing.

        That’s a substantial amount of computing. What might ordinary citizens do with it?

      • September 8, 2011 5:22 pm

        As one final comment, Google just released their mean power consumption: 260 MW. Which at 2 GFLOP/W (Google’s achieved computational efficiency no doubt is less, but perhaps not an order of magnitude less) works out to five hundred petaflops of continuous computational horsepower … and an annualized power bill of order three hundred million dollars.

        So indeed, perhaps IBM’s bailout was over differences related to the relative importance of FLOP/Watt versus FLOP/second.

        Since bits are enormously cheaper to transport than Watts, perhaps this is good news for remote areas of the world that (potentially) have relative power surpluses, for example Afghanistan (hydroelectric) and Saharan Africa (solar).

  14. Jim Blair permalink
    August 30, 2011 8:19 am

    Speaking of breaking the rules, it is amusing to play Metaphysical Detective CluesSo and try to figure out who broke what rule, when?

    Consider the following rendition of the first three definitions of Euclid “Elements”:

    1. A point is that which has no part.
    2. A line is breadthless length.
    3. The extremities of a line are points.

    Did Euclid contradict himself? Which part of no part has length?

    A point which is part of a line is not the same as a discrete point and an endpoint is somehow different from a point on a line which is not an endpoint.

    The point being: Points are not as well behaved as roses. “A rose is a rose is a rose.”

  15. Jim Blair permalink
    August 30, 2011 9:17 am

    Postscrpt:

    Open Problem:

    “Can theory play a bigger role in solving society’s problems, not merely by speeding up existing algorithms, but fundamentally by re-working approaches to the original core problems?”

    Classical Metaphysics:

    “Rethink everything. Look for hidden (or missed) order.”

  16. A.G.McDowell permalink
    August 31, 2011 2:02 am

    As suggested by your article on galactic algorithms, it may be the case that better algorithms are only better after a given cross-over point, whether in time or in memory. It is therefore possible that better algorithms may require better computers to make them pay off, and even that better computer facilities may prompt the invention of better algorithms.

  17. August 31, 2011 11:03 am

    Some rules definitely need to be broken. Many conventional approaches seem to fundamentally assume that bigger iron will become available. For instance, linear programming is a great methodology for expressing combinatorial optimization problems, if you believe a faster CPU and more memory are just around the corner.

    However, LP is P-complete. It therefore seems unlikely that it is possible to ever massively parallelize LP as a methodology (unless one believes P=NC, or some practical version thereof). Therefore it is probably worth looking at other kinds of building blocks: expressing optimization problems using other formalisms may be able to avoid the fundamental bottleneck.

    Horn-SAT, alas, is also P-complete, so turning everything into even nicely structured SAT (or more generally, SMT) does not necessarily improve things. But at least SMT solvers allow the solution space to be searched in parallel, while sharing some learned clauses, so often yield quick solutions.

    So perhaps instead of the extensive work on how to solve decision problems by means of classical techniques like LP, perhaps funding agencies should be encouraging the relatively small community trying to go the other way?

    • August 31, 2011 1:15 pm

      András, I have long admired both your questions and your answers on TCS Stack Exchange and on MathOverflow. So in keeping with Dick’s question, if you were gifted with a week’s computing time on a 10 Petaflops machine (that is, 6×10^21 floating-point and/or logical operations), and gifted also with a hard-working (but unimaginative) TCS house-elf to do the RTFM-level coding and debugging, then (1) what questions would you seek to answer, and (2) by what strategy(s) would you seek to answer them? (needless to say, everyone else is invited to answer too).

Trackbacks

  1. Big Problems With Big Iron « Gödel's Lost Letter and P=NP | Solve Math & Science Problems - Solveable.com
  2. On making a Theory of Computation course more lively « Keet blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,889 other followers