An announcement of a DIMACS conference on parallel computation

Gordon Bell is of course not a theorist, but is one of the great computer designers of all time. He helped design machines that were used by millions—especially the early PDP family and the VAX family while at Digital Equipment Corporation (DEC). His 1971 book “Computer Structures: Readings and Examples,” with Allen Newell was an early classic book on computer architecture. When I was a student it was “the” book we all studied.

Today I would like to help announce an event that will happen next year—an event that is related to Gordon’s work. I usually do not announce events, but there is always room for exceptions, so here goes.

Gordon likes to get straight to the point, he is quite strong in his opinions, and he is usually right. He is famous for listening to talks that predict some technology X will soon change the world, then interrupting the speaker with:

“I am willing to bet ${\dots}$ dollars that X will not be in use in five years.”

He has won almost all of these bets, lost one or two, and had a few people fail to pay up after they lost. Of course, behind the personal bet there was often millions of dollars of research, venture capital, and work effort essentially betting which direction the industry would take. This post is about such an issue: will raw speed give way to multi-core parallel processing, and will that ramp up the usage of fault-tolerance technology?

I recall attending a meeting called Nextgens Technologies in December 2006. It is a yearly meeting held by the TTI/Vanguard group on various new technologies, as you might guess from its title. One of the invited speakers was an expert on processor design from UCLA, Eli Yablonovitch. He was giving a very technical talk—talks at this meeting range from very technical to very general—on why extremely fast processors were impossible to make. Essentially he was explaining the physics behind the collapse of Moore’s Law and the rise of many-core systems. The short answer is power: if chips are clocked much faster than today’s rates, they will burn up. No amount of cooling could stop them from melting.

Gordon did not challenge Yablonovitch with a bet, nor did Gordon challenge his arguments. But he did say,

I do not want parallel computers. I want faster uni-processors.

Yablonovitch listened, nodded to Gordon, and answered you may want uni-processors that are very fast, but many-core is what you are going to get.

For just a small example of what Gordon meant, suppose you are developing a chess program, where speed is utmost. Which would you rather have, two cores or a twice-faster uniprocessor? The world champion Rybka chess program advertises honestly that it gets only 71% of the benefit of doubling the number of cores. The other 29% is the overhead for chess search being hard to parallelize, for communication between cores, and for system management. The predictions are that this will only get much worse as the number of cores increase from two to thousands.

Let’s turn to the announcement that is all about the rise of many-core.

Many-Core: The Future

Phil Gibbons, Howard Karloff, and Sergei Vassilvitskii are organizing a workshop entitled Parallelism: A 2020 Vision, to be held March 14-16 at DIMACS. Howard says the goal of the workshop is to bring together both users and researchers to discuss:

• how parallel computing in its various forms is used today;
• what new uses and programming abstractions will arise by 2020;
• what parallel computers will look like in 2020; and
• how to model parallelism theoretically.

I think the workshop looks interesting, and have no doubt that it will be fun to attend. If you are interested get in touch with DIMACS or Howard: as usual all are welcome.

I would like to make one prediction about the future of computers. The rise of many-core and the construction of systems with huge numbers of chips will lead to the following: Fault tolerant computation will become more important both in practice and in theory. This is a safe prediction and a dangerous one. It is safe because 2020 is a long way from now. I could predict we will all be driving space cars like on the Jetsons and be safe, since who will remember what I predict today?

It is dangerous because fault tolerant computing has been around for a long time, has been predicted as “just around the corner” many times, and yet has largely been unused in most systems.

In the early days of computers, when they used vacuum tubes which were very unreliable, John von Neumann wrote in 1952 a famous paper on fault tolerance methods. It was titled: “Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components.” This work is justly famous, and fault tolerant computing methods are in vogue again. However, one of the reasons that von Neumann’s work did not have much practical impact is that quickly the technology changed: tubes became transistors, which became micro-chips. Each of these changes yielded huge increases in reliability. Perhaps that will happen again, or perhaps it will not and fault tolerant methods will be needed in the future.

Many-Core: The Future?

I wonder if many-core is the final answer? I do not doubt that we will see more and more processors on a single chip. But I do wonder if predictions of the type made by Yablonovitch are absolute. I have talked before on how hard it is to prove that something is impossible. This is true in mathematics and in complexity theory, but is even harder in physical systems. What about some breakthrough that allows chips to run cooler? Or allows them to use totally different methods to implement gates? I wonder.

Open Problems

Perhaps one of the topics in the planned workshop should be on how to make faster uni-processors? I hope that you attend the workshop if you are interested in this area. Maybe we can do a post on the workshop in the spring.

1. November 22, 2010 12:28 am

Open Problems:
Quantum Computers!
Error correction is important in quantum computing too.
And quantum computers achieve a type of parallelism.

P.S.- I see QCs as good competition to supercomputers built with zillions of multicores. (For anyone interested, I wrote a blog post comparing QCs and exaflop classical computers)

• November 22, 2010 1:34 pm

Your blog post is neat. I thought of suggesting to add mention of quantum computers—with the point that fault-tolerance is simply necessary to deal with decoherence—but since they’re not on the “by 2020” horizon I thought the post was snappier without them.

November 22, 2010 2:31 am

“What about some breakthrough that allows chips to run cooler? Or allows them to use totally different methods to implement gates?”

The thing is, any such breakthrough that speeds up single-core processors, will speed up many-cores by the same amount!

November 22, 2010 4:11 am

well, there is definitely a physical upper limit on clock speed. so each instruction needs to do more if you want uni-processors to go a lot faster.

we’ve already had numerous parallelizations that have taken place inside the chip — instruction scheduling is one of these. i think on on old ultrasparc you could schedule a load, a store, a float operation and an int operation all at once. and then you just had to wait for the first type to finish to think about what you wanted to do next. even though it was called pipelining on other chips, it’s really just a kind of parallelization. it’s hard to do well, by hand or otherwise, but it can make things go way faster.

the thing is, people writing code in high-level languages weren’t often so worried about this that they tried to write easily pipelineable code, right? so this massive advantage could only be taken advantage of by those sick few writing assembly, and anyone for whom their compiler could take advantage for them.

i think that the same effectively goes for multi-cores. if you don’t think in terms of parallellizable data structures or algorithms, it’s going to be hard to accidentally let your compiler get massive advantage from this kind of parallelization.

i do think that error-correcting codes should be used inside hardware and for hardware. there’s some intersection point between the cost gain from making less reliable hardware and the size increase from not having to worry about that because of the error-correcting code that you’re using. they’re fast enough and good enough that they will end up ubiquitously embedded into any process wherever and whenever they save companies money. mostly there’s no reason for the end-user to have to worry about this, so it can act as a transparent kind of speedup.

s.

November 25, 2010 4:47 pm

Your compiler cares about pipelining, and the cores are REALLY good at finding work to do, it’s relatively easy to find little bits of parallelism in most programs, you almost never have to think about it, it’s memory latency you have to think about.

Unless you’re writing a VERY tight loop the compiler will do reasonable job of scheduling the instructions, and the CPU will find lots to pipeline and re-order and play all it’s tricks with, unless it misses the cache, then it will stall.

Automatically parallelizing code (either to use vector instructions or across cores) is much more difficult to do without help from the programmer.

November 22, 2010 5:52 am

The task is to design a physical structure with two contradicting design goals. It has to be sparse to avoid overheating, and it has to be dense to allow fast computation on some specific, not easily parallelizable tasks. Obviously, multi-core is not the best answer to this question. It is an implementation where very-very dense substructures are separated by needlessly much empty space. The best solution to this optimization problem will have a much more homogeneous density distribution.

November 22, 2010 6:07 am

I think the future will be “multi-core” for 2 reasons

1) becouse the nature is parallel so is more easy to follow the law of nature to build computers

2) becouse the “multi-core” programming is an advance in the computer science . A “multi-core” , not a “multi-thread” becouse a “multi-thread” is something like I want N threads to do this job independently from the number of cores in the hardware . A “multi-core” is something like I want to do this job independently on the number of cores in the hardware . This approach let to write programs where the programmer must define the “objects” with their relations without define the steps to execute . This is the direction of the general evolution of programming ( as example object oriented , … ).
In a more pragmatic view one good principle in programming is “never set constraints if you don’t need them” in the serial programming we set always constraints also when we don’t need . For example in classic for cycle we set implicitly a order in the execution but probably we can run operatons in parallel and this depend characteristics of the objects in the cycle.

November 22, 2010 7:05 am

I am very happy to see this article. I would really love to see more research in the area of parallelization caveats. For instance, what is the answer to this simple problem:

Given C universal TM with common input and communication tapes, how much can you speed up binary search? What changes if communication tapes have a significantly smaller alphabet than input tapes (modeling slow communication)?

Binary search is a canonical example for a heavily branching algorithm, that is, its behavior at a given step depends dramatically on previous steps. That makes it hard to parallelize, especially when communication comes with cost.

Fixed-width circuits are an example of communication-constrained computational models. It imposes limits on the wires crossing a certain class of cuts through the circuit, can we find a similar model of “distributable” circuits that run on N nodes with limited communication?

November 22, 2010 9:44 pm

Amdahl’s law will calculate how fast a program could be sped up by adding additional processors.

I’m not sure why you would think that a binary search is not well-suited to parallelization – when in fact a binary search is “embarrassingly” easy to parallelize.

For example: assume you are using 16 distinct processor to search N number of items for a particular match. First, divide the set of items into N/16-sized segments and assign each segment to each processor. Each processor then performs a binary search on its assigned segment and reports either success or failure when it completes its search. And in fact there is no communication overhead at all during this search – because each processor has nothing to communicate to any other processor involved in the search.

November 22, 2010 10:44 pm

Amdahl’s law will only help to calculate the maximum parallel speed-up if you know how much of your program is not parallelizable. But, proving that something isn’t parallelizable is as hard as proving any other lower bounds – i.e., in most cases, we can’t do that.

November 23, 2010 3:04 am

Thank you both for your comments. Indeed there were some unclear terms in my post both

1) What is meant by binary search

This is essentially a Meta-Algorithm and typically expected to perform logarithmically many applications of an abstract comparison or lookup operation (or both). Its input size is not well-defined, it may consist of only boundaries or a complete sorted enumeration of the search space. So for clarification, let us measure the number of comparison operations in terms of search space size N.

2) What is meant by speed-up

A typical approach to this problem is building a tree of nodes and letting each node decide which other node should carry on computation. This increases /throughput/, while in fact slowing down each individual computation, i.e. increasing /latency/ (just like any pipeline). Greg’s solution speeds up individual computations, reducing latency, which is what I actually meant.

Of course Greg’s simple approach would work, but it can only speed up (in above terms) computation to O(log(N/C)), which is poor in my opinion. You can be much faster O(((log N)/(log C))) if you, for instance, do a C+1-section instead of a 2-section at each step (each node performing one of the C necessary comparisons), and publish all results, requiring only C communication bits if you have a broadcast channel. Still, a factor of 1/log C is still a poor performance gain for employing C-times as many resources, it is exponentially worse than the ideal factor of 1/C.

We DO know that binary search needs to perform at least log N comparisons if N is the size of the search space (each comparison can yield at most one bit of information gain). So we know that we cannot get a latency below (log N)/C. But how close to this can we actually get? How?

Maybe someone even knows a better example for “strongly sequential” algorithms. I wanted something simple but maybe I accidentally chose something trivial.

• November 23, 2010 3:16 am

This strategy will only yield a speedup by a constant number of steps (not even a constant factor) in the best case and might therefore not be worth the effort.

Implementing it without communication might be possible in theory, but not in many programming languages since the runtime environment will have to assume that the individual tasks read — and write — overlapping parts. If you “physically” split the array, you have explicit overhead.

Also, if we are talking n threads on one memory bank, the threads might serialize at the RAM “gate” since (for large data) data lookup might take much, much longer than comparing two elements. Since you choose n disjoint partitions there is no principle of locality being applied, resulting in the maximum number of memory requests in the worst case.

November 23, 2010 1:44 pm

Carsten, I think one way to formalize whether something is parallelizable is to ask what depth circuit do you need. E.g. the class NC of uniform polylog depth circuits is a good candidate for the class of problems that are parallelizable.

Now, for binary search, the issue is that the sequential algorithm is already logarithmic. One could ask if constant depth is possible. You can check in parallel whether any element of the input array is equal to the element you’re looking for. However, you will need unbounded fan-in to combine the results to a single output. If you don’t have it, the naive combining of the results will be the bottleneck, and will lead back to logarithmic depth (if I’m not mistaken). Again, I cannot prove this is optimal, and I’m not sure if it’s known…

November 25, 2010 11:42 am

Thanks Milos! I fully agree with you, sublogarithmic time even for an unbounded number of processors cannot be expected, provided we require a single output and constant fan-in.

Unfortunately looking for “maximal benefit” assuming an unbounded number of processors is usually the only thing considered. This is even worse if this benefit is a constant factor and disappears in O-notation (like in this case). That’s why I think it’s worthwhile including the number of processors into the complexity. If you can achieve a constant factor of 1/C speedup for C processors, you parallelize extemely well, 1/log C (like in my example) would be quite poor.

7. November 22, 2010 8:01 am

I agree that it is in general hard to have one computation run on many cores and benefit from it. I have participated in a project with some heavy computations. Our program used almost 100% of a dual core, about 95% percent of a quad core and only about 60% of a 32 core machine. Of course, absolute runtime was much lower on the biggest machine despite it having lesser individual clock speeds.

I think multi-core will continue to be the standard for some non-technical reasons.

First, running one calculation heavy job is not what most people do. We have a multitude of processes running on our PCs and servers and multiple cores enable them to run truly in parallel and each being more responsive than on a single core machine. That is, even without concurrent programming we get the benefit.

Second, if you have many (slow) cores you can shut those down you do not need at the moment. Therefore, energy consumption and heat development scales even better with load than on single core machines.

I think technology will enable us to programm without regarding the number of cores our programs run on in some years. Some programming languages already hide that from us quite conveniently, typically those that are more abstract anyhow.

November 22, 2010 9:26 am

As has been said, some kind of parallelism is already heavily used in unicores (bit level, word level, pipelining, etc..). Afaik the industrial change to multicores was led by Intel in about year 2004-5. It is known that engineering design costs surpasses any other costs of chips. At this time the engineering cost of designing densest unicores superseded the cost of designing quicker multicores. So any innovation, physical (for instance a better cooling system) or algorithmic (for instance a best algorithm for dealing with collective processor comunications) might either overturn or boost this industrial trend.

Industry is opportunistic in its decisions, nature isn´t in the long term. And nature has bet for parallelism long time ago (inner organization of cells, organisms, brain, companies….) but maybe Mr Bell can win nature bets…

9. November 22, 2010 3:29 pm

Industry is slow for a good reason. They seek for a stable designs, suitable for large set of applications. I personally think that we are far from exploring parallelism. In particular, the fastest way of computing is through the data piping, like in graphic cards. On the other hand, due to branching, given there is only one processing stream, processors have to be reset at each branch. On the other hand, if there is 2D or even 3D grid (where non-intersecting paths are always possible) the branching can be designed as a path through the grid, allocating the nodes for all reasonable paths. The dead and computed paths can be freed, and allocated for new computations. This design become more similar to the brain architecture, and it requires a lot of “micro-cores”, far more then there are exist today. That also needs hardware paths and data flow scheduling, again far from the current technology. Think of this as a snake game.

November 22, 2010 10:34 pm

Modern GPUs are very capable of branching, though it may lead to a slow-down, if threads within a “warp” (i.e. a block of 32 threads) do not follow the same execution path. In that case the threads will automatically wait for each other – much of the non-trivial scheduling you mention is already in the hardware.

November 22, 2010 9:16 pm

Today’s programming languages were designed for single processors – so multithreaded programs have to use “locks” and other synchronization techniques that – not only make programming extremely complicated but also cause programs to scale extremely poorly as the number of threads increases.

The future of writing programs to run on multiple cores therefore lies with with lock-free synchronization and, in particular, transactional memory.

• November 27, 2010 3:05 am

Many people make bold claims that programming parallel code is “too hard”, but I think that’s complete nonsense. Nobody’s given me a shred of evidence supporting such a claim, and I have mountains of evidence to the contrary.

I’ve never (so far as I can remember) had a concurrency bug that took more than a day to fix, and correctly implementing half-decent parallelism in the physics simulation programs I’ve developed usually takes 5 minutes. Better parallelism usually takes an hour or two. I very rarely use any locks directly (almost always just barriers in the cases where they’re needed), and I’ve NEVER used a lock-free synchronization algorithm, since they add a large amount of unnecessary complexity.

I do, grudgingly, turn off multithreading for some debugging, since the support for multithreadded debugging in VisualStudio is lousy, but that’s not the same issue.

11. November 23, 2010 12:17 am

Interesting post! One comment: “Moore’s Law” typically refers to the growth of the number of transistors per chip (should be doubling every two years). This trend is still going strong, at least for now. It’s only the processor speed that is stagnant (as you point out).

November 24, 2010 1:28 am

I think the future of parallelism will be neither uniprocessor or multi core in the sense of a replicated uniprocessor, but rather, giant application-reconfigurable FPGA’s in general purpose computers. Unfortunately even the supposedly bleeding edge programming languages are concentrating on conventional processors. That has to change.

13. November 24, 2010 6:15 pm

I think the future of parallelism depends on programming languages. C++ is very popular language, but we have well-known problems for the parallelism in C. For example, TBB is nice thing but it is not sufficient unfortunately. IMHO, old Pascal (OO) would be better to do our programs more reliable.

November 25, 2010 11:57 am

Speaking about bets: it’s a safe bet the main target for near-future programming improvements is abstraction. As complexity of Software seems to increase (though not at the same speed) with computing power, to keep things manageable we have to chop it into an increasing number of swallowable chunks (often managed by separate teams, sometimes even written in separate programming languages). In order to keep this number manageable again, we need to increase the abstraction level of our code so little code does much. Now the logical next step is to make it easier to write and to understand highly abstract code. That’s already grown to some extent, but IMHO has not reached fully maturity in currently popular programming languages.

• November 25, 2010 12:47 pm

This is open problem: “the logical next step is to make it easier to write and to understand highly abstract code. That’s already grown to some extent, but IMHO has not reached fully maturity in currently popular programming languages”. Some time ago, I saw a paper where spreadsheet technology had been defined as next level of programming languages. But unfortunately till now we have no ideas how to increase abstract level of programming languages.

November 26, 2010 12:46 am

>>The other 29% is the overhead for chess search being hard to parallelize

Huh??

Exploring a game tree is the most parallelizable problems out there.

Take a chess board, pick a random move, evaluate.

It is easy to spin these off on their own processes. You get a score for each random move, and pick the best.

Rinse.

Repeat.

Don’t get me wrong, there are surely bottlenecks on multi-core systems that you don’t see on single-core systems. But don’t ever tell me that chess search is hard to parallelize.

December 1, 2010 8:13 pm

You could write a chess program that way but it wouldn’t be very good. Think of transposition tables, killer heuristic, endgame tablebases, etc. You need very fast (as in shared memory fast) communication between processors to make any of that stuff happen usefully.

16. November 26, 2010 3:28 am

for what it’s worth, the brain could be considered a massively parallel fault-tolerant computational device. it can still solve problems that the largest supercomputers of the world cannot, and it stays relatively cool (98 degrees), requires the space of a couple hard drives, and about 700 kcal’s a day of energy. further, neurodegenerative disease can often destroy 10% of the cells before people are symptomatic. so, while multi-core might not be the future for everything, nature has certainly discovered a way to make a relatively general purpose (more general than any computing system today, in some sense) multi-core box. i imagine we will want to mimic some of her ingenuity in the coming years….

November 26, 2010 10:48 am

Current workstations have a handful (generally 100) simpler cores, courtesy of the GPU. It is tempting to draw an analogy with cache/storage, and, indeed, one can already see technologies emerging which make use of a hierarchy of cores; smaller numbers of high frequency, heavily-pipelined cores packaged together with larger numbers of simpler, lower frequency cores.

It would be uncontroversial if technology continued to move in this direction.

The interesting problem then, is how we write software that operates in a multitasking environment with multiple heterogeneous cores, each with varying latency, throughput & memory access characteristics, and how can the OS provide effective resource allocation & scheduling services to such software.

Perhaps our processes need to come with some meta-data to help OS scheduling: “I perform best on a single, traditional CPU with good integer performance and low-latency read/write memory access to a small cache”, vs “I perform best on a larger number of smaller CPUs with good of floating point throughput, and read-only access to a large cache”.

In any case, whether for single-core or multi-core processing, it is memory access patterns (particularly writes) that remain critical to the performance of the majority of applications, rather than the raw throughput of the processor or processors themselves.

18. November 26, 2010 11:33 am

I don’t understand the motive for the prediction about fault tolerant systems.

Computer hardware is still as reliable as after the move from vacuum tubes to transistors. Where is the increase in unreliability that requires more fault tolerance? Parallel computation is not faulty. It works the way it does because it was designed that way.

Hardware failures are external to the system, they are beyond control. This is fundamentally different to mistakes in the design. The concept of fault tolerance doesn’t really make sense for design errors. What would it be doing? — when the software goes wrong, it would ‘recover’ in some way. But what does ‘recover’ mean? — stop doing what was going wrong, and do something right. And that is what doesn’t make sense. If you knew what was right, why didn’t you do that in the first place?

(And if you don’t know what is right, you don’t need fault tolerance but adaptation or evolution, which is not really design.)

Parallel computation, as far as it needs help, doesn’t need fault tolerance, but to be easier to program for in the first place.

• March 19, 2011 7:42 am

Well, if there failure rate per component stays constant, then the failure rate of the system will go up with the number of components.

Also, design flaws in Parallel systems can be non-deteministic. For example, race conditions:
Process 1: Lock A; Lock B; Unlock A,B
Process 2: Lock B; Lock A; Unlock A,B

This will work perfectly until Process 1 and 2 run at the exact same time in which case Process 1 will have A locked while Process 2 has B locked. Both will wait for ever for the other Process to unlock the lock it is holding. Since on modern processors timing is essentially random, simply restarting the system is one way to recover from this sort of fault.

November 29, 2010 2:41 pm

Maybe this is intereseting in this context:

http://www.chessbase.com/newsdetail.asp?newsid=6839

20. November 29, 2010 8:41 pm

Professor Lipton,
Multi-core may suffer from overhead for performance as you point out – so in addition to predicting fault-tolerant computing, would you say that the net result will be for a migration towards cloud computing where presumably the cloud providers may better utilize the trend towards multiple cores?

• March 19, 2011 7:53 am

Since your question has remained unanswered, I’ll give it a go. The smarts required to parallelize software are on the end of the programmers and perhaps compiler writers, not on the end of the hosting company. Thus the difficulty in utilizing multi-core processors give no reason to go to the cloud. If anything, it makes it harder to scale past the number of cores in your desktop, reducing the performance benefit from taking advantage of the virtually unlimited number of cores the cloud can already provide on demand.