Beta

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Supercomputer Advancement Slows?

Soulskill posted more than 3 years ago | from the moore-flops-moore-problems dept.

Supercomputing 86

kgeiger writes "In the Feb. 2011 issue of IEEE Spectrum online, Peter Kogge, an IEEE Fellow and professor of computer science and engineering at the University of Notre Dame, outlines why we won't see exaflops computers soon. To start with, consuming 67 MW (an optimistic estimate) is going to make a lot of heat. He concludes, 'So don't expect to see a supercomputer capable of a quintillion operations per second appear anytime soon. But don't give up hope, either. [...] As long as the problem at hand can be split up into separate parts that can be solved independently, a colossal amount of computing power could be assembled similar to how cloud computing works now. Such a strategy could allow a virtual exaflops supercomputer to emerge. It wouldn't be what DARPA asked for in 2007, but for some tasks, it could serve just fine.'"

cancel ×

86 comments

Sorry! There are no comments related to the filter you selected.

Whoa (0)

Anonymous Coward | more than 3 years ago | (#35033742)

Then I could play Duke Nukem forever properly!

just make a cluster... (1)

hey (83763) | more than 3 years ago | (#35033744)

...of all the existing supercomputers.

Re:just make a cluster... (0)

Anonymous Coward | more than 3 years ago | (#35033818)

2) ...
3) profit!

Re:just make a cluster... (0)

Anonymous Coward | more than 3 years ago | (#35033822)

That's called Grid Computing, you know.

Re:just make a cluster... (0)

Anonymous Coward | more than 3 years ago | (#35033966)

Argh! Don't say the G-word! Before you know it, the dark demons of Globus will curse you to never get any research done ever again and you will end up working in McDonalds.

Re:just make a cluster... (0)

Anonymous Coward | more than 3 years ago | (#35039080)

The problem with Grid computing is that you can't even walk around your own micro-circuits without permission from the MCP! Who does he calculate he is!?

The real reason.. (0)

Haedrian (1676506) | more than 3 years ago | (#35033798)

Is that Crysis 2 isn't out yet. When it is, people will be all going out to buy their own supercomputer to run the game.

Re:The real reason.. (0)

Anonymous Coward | more than 3 years ago | (#35035532)

Welcome to 50 years ago, joke.

Less of a matter of can't, but won't (2)

mlts (1038732) | more than 3 years ago | (#35033850)

In the past, there were a lot of applications that a true supercomputer was needed to be built for to solve, be it basic modeling of weather, rendering stuff for ray-tracing, etc.

Now, most applications are able to be done by COTS hardware. Because of this, there isn't much of a push to keep building faster and faster computers.

So, other than the guys who need the top of the line CPU cycles for very detailed models, such as the modelling used to simulate nuclear testing, there isn't really as big a push for supercomputing as there was in the past.

Re:Less of a matter of can't, but won't (4, Interesting)

vbraga (228124) | more than 3 years ago | (#35033972)

I don't know if this is true.

Weather modeling is still done on supercomputers.

Engineering applications needs high performance computing on a regular basis: geophysics (offshore oil, 4D seismic, ...), materials science (MD, ...), and others. There's also academical problems.

I've seen a lot of new HPC centers being built or getting new equipment in the last few years (Rio de Janeiro, Brazil). From small CUDA clusters to heavy duty Cray systems (not in Rio, but nearby).

Re:Less of a matter of can't, but won't (1)

kelemvor4 (1980226) | more than 3 years ago | (#35034820)

The #1 supercomputer in the world is one of those CUDA clusters in China. http://www.top500.org/system/10587 [top500.org] . At the moment, nVIDIA is where it's at for HPC as I understand it.

Re:Less of a matter of can't, but won't (1)

conspirator57 (1123519) | more than 3 years ago | (#35035646)

benchmarks aren't real work. and sadly the tail is wagging the dog to a great extent as people design computers to be good at benchmarks, rather than being as good at a real workload as possible and designing the benchmark to resemble the workload. it's a contest of napoleon complexes.

i'd judge an architecture not by their slot on the benchmarks lists, but by the number and complexity of real workloads they actually are used for.

Re:Less of a matter of can't, but won't (1)

allenw (33234) | more than 3 years ago | (#35038448)

... and to make matters worse, top500 is based primarily on LINPACK. So top500 is really a measure as to how fast something can do floating point with a distributed shared memory model and not much else. Most of the systems listed in the top500 would fail miserably at heavy IO loads, which is what most of the increasingly common Big Data problems need. It concerns me that manufacturers are building systems for one top of heavy duty computing based on top500 while ignoring the others.

Re:Less of a matter of can't, but won't (0)

Anonymous Coward | more than 3 years ago | (#35047238)

http://e-mergebs.blogspot.com/

More and more applications all the time (2)

mangu (126918) | more than 3 years ago | (#35034324)

In the past, there were a lot of applications that a true supercomputer was needed to be built for to solve, be it basic modeling of weather, rendering stuff for ray-tracing, etc.

Now, most applications are able to be done by COTS hardware

It's true, many applications that needed supercomputers in the past can be done by COTS hardware today. But this does not mean there are no applications for bigger computers. As each generation of computers assume the tasks done by the former supercomputers, new applications appear for the next supercomputer.

Take weather modeling, for instance. Today we still can't predict rain accurately. That's not because the modeling itself is not accurate, but because the spatial resolution needed to predict rainfall beyond our computers. Engineers still use wind tunnels, they still have tanks to test ship models, there are many situations where the most powerful computers today cannot perform calculations at the same level of precision one gets from scale models.

And then there are entirely new applications that are way beyond the capacity of our current computers. Drug design is one example, a computer capable of calculating accurately the shape a protein molecule will have given its sequence of amino acids is still a dream.

Re:More and more applications all the time (1)

bberens (965711) | more than 3 years ago | (#35036368)

I think as competition grows in the cloud computing market we'll see a lot more modeling being done on the cloud. There's a lot to be said about having your own supercomputer for sure, but if I can get it done at a fraction of the cost by renting off-peak hours on Amazon's cloud... I'm convinced the future is there, it'll just take us another decade to migrate off our entirely customized and proprietary environments we see today.

Re:More and more applications all the time (1)

dkf (304284) | more than 3 years ago | (#35039872)

I think as competition grows in the cloud computing market we'll see a lot more modeling being done on the cloud. There's a lot to be said about having your own supercomputer for sure, but if I can get it done at a fraction of the cost by renting off-peak hours on Amazon's cloud... I'm convinced the future is there, it'll just take us another decade to migrate off our entirely customized and proprietary environments we see today.

Depends on the problem. Some things work well with highly distributed architectures like a cloud (e.g., exploring a "space" of parameters where there's not vast amounts to do at each "point") but others are far better off with traditional supercomputing (computational fluid dynamics is the classic example, of which weather modeling is just a particular type). Of course, some of the most interesting problems are mixes, such as pipelines of processing where some stages are embarrassingly distributable and others require grunty concentrated power. If you're getting into the design of this class of "meta-application", then you're going to have to just be aware that the details really matter (and the costs associated with the various tradeoffs also vary over time, just to make it more "fun").

Re:More and more applications all the time (1)

sjames (1099) | more than 3 years ago | (#35040708)

The cloud can handle a small subset well, the embarrassingly parallel workloads. For other simulations, the cloud is exactly the opposite of what's needed.

It doesn't matter how fast all the cpus are is they're all busy waiting for the network latency. 3 to 10 microseconds is a reasonable latency in these applications.

Re:Less of a matter of can't, but won't (0)

Anonymous Coward | more than 3 years ago | (#35034490)

That's correct, if you ignore energy and housing costs, as well as availability and security. So, it's not that correct.

Re:Less of a matter of can't, but won't (2)

dr2chase (653338) | more than 3 years ago | (#35034790)

The problem, not immediately obvious, is that if you shrink the grid size in a finite-elements simulation (which describes very very many of them), you must also shrink the time step, because you are modeling changes in the physical world, and it takes less time for change to propagate across a smaller element. And at each time step, everyone must chat with their "neighbors" about the new state of the world. The chatting is what supercomputers do well, compared to a city full of gaming rigs with GPUs.

The constraints of faster chatting also drive up power density.

Another issue with COTS hardware (or any other) is that if you use enough of it, failure of some part becomes a certainty. This means you need redundancy and/or checkpointing, and changes your tolerance for hardware failure (COTS may not be tuned to that design point). Full redundancy uses twice the resources, but means you (almost) never wait; otherwise, checkpoint and recover eats into your wall time.

Re:Less of a matter of can't, but won't (0)

Anonymous Coward | more than 3 years ago | (#35035120)

The SKA project (Square Kilometre Array) specifically has a task called "Climbing Mount Exaflop"; Next-Gen HPC is certainly a topic in radio astronomy.

Re:Less of a matter of can't, but won't (1)

Paracelcus (151056) | more than 3 years ago | (#35035952)

"They" (we all know who "they" are) want a panexaflop (one trillion exaflop) machine to break todays encryption technology (for the children/security/public safety), of course after "they" spend umpteen billions (lotsa billions) some crypto nerd working on his mom's PC will take crypto to a whole new level, and off we go again!

Re:Less of a matter of can't, but won't (2)

Darinbob (1142669) | more than 3 years ago | (#35036928)

I think the problem here is in calling these "applications". Most super computers are used to run "experiments". Scientists are always going to want to push to the limits of what they can compute. They're unlikely to just think that because a modern desktop is as fast as a super computer a couple decades ago, that they are fine just running the same numbers they ran a couple decades ago too.

Breathtakingly dumb. (0)

Brannon (221550) | more than 3 years ago | (#35033870)

In the computer industry, the most impressive displays of stupidity generally result from linear extrapolation.

There will be 'exascale' computers, they just won't look like scaled up versions of today's computers, where the bulk of the die area, power, and complexity is spent on making something which is programmable by a hoard of low-skill programmers and performs equally poorly for thousands of different applications. The whole notion of general-purpose for the supercomputer industry is kinda silly, considering that there are only a handful of applications which justify the expense of building such a computer.

Just design a specialized machine for each of those applications and one can get a factor of hundred improvement.

Re:Breathtakingly dumb. (2)

c0d3g33k (102699) | more than 3 years ago | (#35034256)

So wait. Your answer to "very expensive general purpose machine" is "design many slightly less expensive single purpose machines"? Your "factor of hundred" performance improvement will likely be overshadowed by the "factor of thousand" increase in economic cost.

Provide believable numbers or your argument is bullshit. You may be right, but your style of discourse requires concrete evidence to be at all convincing.

It's been done (0)

Anonymous Coward | more than 3 years ago | (#35037288)

http://www.genomeweb.com/blog/anton-sets-protein-folding-record

Take a small team and build a high-performance, extremely power-efficient machine targeting a single application--accelerate said application by 100x. For the $1 Billion that the government is likely to waste on the exascale nonsense we could probably build five different special-purpose machines each targeting a vital area of high performance computing (fluid dynamics, oil exploration, weather prediction, etc.) and literally own those industries in the same way that the Anton supercomputer now owns molecular dynamics.

The machines could share a lot of technology if it's done right.

Re:Breathtakingly dumb. (1)

foobsr (693224) | more than 3 years ago | (#35034800)

In the computer industry, the most impressive displays of stupidity generally result from linear extrapolation.

This is not confined to the computer industry (and not news as well).
See "The Logic of Failure", 1996 (with roots back to TANALAND, appr. 1980?)

CC.

slowing... (1)

Schmyz (1265182) | more than 3 years ago | (#35033950)

..but doesnt history show us most things stall before the next large wave of advances?

Re:slowing... (1)

c0d3g33k (102699) | more than 3 years ago | (#35034336)

No, not really. "Shit happened" is about all that history really shows us. With the correct set of selected examples, it could probably also show us that things stall and stagnate so something else can provide the next large wave of advances.

Re:slowing... (1)

Richard Dick Head (803293) | more than 3 years ago | (#35041134)

There's no passion in the flop race anymore...its just business. You can get something outrageous like that, but who's gonna care about it over more pressing issues like storage space. Its just not fun anymore, the computer age is dead. We're now in the handheld age, get used to it folks. :D

Somewhat unrelated to the article I suppose, and I never thought I'd say this but buying a new computer is...boring.

There I said it.

I remember a time when if you waited three years, and got a computer, the difference was downright shocking.

Last week a blew my wad on new, high-end everything and I really can't tell the difference. Crysis on full tilt looks...oh wait. this is a game that I played through YEARS AGO. okay. Now what.

So, I found a PC game I haven't played through: Star Trek:Judgement Rights(1993-ish?) on my brand new four core. With DOSBox. Yeah. Four cores. Three cheers for piracy, nobody makes PC games anymore :/ Its like owning a gigantic SUV with a small tank and the gas station is 50 miles away.

Moore's law meets Amadal's law (1)

LWATCDR (28044) | more than 3 years ago | (#35033954)

Current super computers are limited by consumer technology. Adding cores is already running out of steam on the desk top. On servers it works well be cause we are using them mainly for virtualization. Eight and sixteen core CPUs will boarder on useless on the desktop unless some significant change takes place in software to use them.

Re:Moore's law meets Amadal's law (0)

Anonymous Coward | more than 3 years ago | (#35035136)

your comment is nonsense

supercomputer applications can soak up tens of thousands of cores, often with near-linear scalability.

commodity isn't holding anything back, its substantially lowering the cost and power requirements
and allowing the relatively inexpensive construction of fairly effective petaflop machines

the article is just pointing out that if you do the math, we cant really power the machine with existing
architectures that would provide the performance target.

Re:Moore's law meets Amadal's law (1)

wagnerrp (1305589) | more than 3 years ago | (#35036594)

Nonsense. Ridiculously parallel applications can soak up tens of thousands of cores, with near linear scalability. For those, you may as well use grid computers running on the idle time of tens of thousands of desktops. Supercomputer applications are run on supercomputers specifically because they have difficulty scaling. They require a large amount of frequent communication between the compute nodes that a detached grid cannot provide, and even commodity 'beowulf' clusters struggle with. You build a supercomputer for that interconnect bandwidth and latency.

Re:Moore's law meets Amadal's law (1)

LWATCDR (28044) | more than 3 years ago | (#35036832)

Actually if you read the link the problem is that the interconnects and that lack limits of parallelizm of the problems are the limitations.

"The good news is that over the next decade, engineers should be able to get the energy requirements of a flop down to about 5 to 10 pJ. The bad news is that even if we do that, it won't really help. The reason is that the energy to perform an arithmetic operation is trivial in comparison with the energy needed to shuffle the data around, from one chip to another, from one board to another, and even from rack to rack. A typical floating-point operation takes two 64-bit numbers as input and produces a 64-bit result. That's almost 200 bits in all that need to be moved into and out of some sort of memory, likely multiple times, for each operation. Taking all that overhead into account, the best we could reasonably hope for in an exaflops-class machine by 2015 if we used conventional architecture was somewhere between 1000 and 10 000 pJ per flop."

And " Realistic applications running on today's supercomputers typically use only 5 to 10 percent of the machine's peak processing power at any given moment. Most of the other processor cores are just treading water, perhaps waiting for data they need to perform their next calculation. It has proved impossible for programmers to keep a larger fraction of the processors working on calculations that are directly relevant to the application. And as the number of processor cores skyrockets, the fraction you can keep busy at any given moment can be expected to plummet. So if we use lots of processors with relatively slow clock rates to build a supercomputer that can perform 1000 times the flops of the current generation, we'll probably end up with just 10 to 100 times today's computational oomph. That is, we might meet DARPA's targets on paper, but the reality would be disappointing indeed."

I did read it did you?

What you are thinking of are grid computing problems not supercomputer problems.

Rent Out My Machine (1)

KermodeBear (738243) | more than 3 years ago | (#35034152)

I would be very willing to run something akin to Folding@Home where I get paid for my idle computing power. Why build a super computing cluster when, for some applications, the idle CPU power of ten million consumer machine is perfectly adequate? Yes, there needs to be some way to verify the work, otherwise you could have cheating or people trolling the system, but it can't be too hard a problem to solve.

Re:Rent Out My Machine (2)

SuricouRaven (1897204) | more than 3 years ago | (#35034266)

Two problems:
1. The value of the work your CPU can do is probably less than the extra power it'll consume. Maybe the GPU could it, but then:
2. You are not a supercomputer. Computing power is cheap - unless you're running a cluster of GPUs, it could take a very long time for you to earn even enough to be worth the cost of the payment transaction.

What you are talking about is selling CPU time. It's only had one real application since the days of the mainframe days, and that's in cloud computing as it offers the ability to buy instantly if the customer has a sudden need for more (Eg, Slashdot just linked to their site). It just isn't economically viable right now, because anyone who needs so much processing power they might need to buy it can probably just go and buy their own cluster.

Re:Rent Out My Machine (0)

Anonymous Coward | more than 3 years ago | (#35038622)

it could take a very long time for you to earn even enough to be worth the cost of the payment transaction.

An interesting world of possibilities could open if the transaction costs would be proportional to the amount, with no minimal cost.

Re:Rent Out My Machine (3, Insightful)

ceoyoyo (59147) | more than 3 years ago | (#35034290)

Because nobody uses a real supercomputer for that kind of work. It's much cheaper to buy some processing from Amazon or use a loosely coupled cluster, or write an @Home style app.

Supercomputers are used for tasks where fast communication between processors is important, and distributed systems don't work for these tasks.

So the answer to your question is that tasks that are appropriate for distributed computing are already done that way (and when lots of people are willing to volunteer, why would they pay you?).

Re:Rent Out My Machine (1)

Anonymous Coward | more than 3 years ago | (#35034746)

That kind of thing (grid computing) is only good for 'embarrassingly parallel' problem. You cannot solve large coupled partial differential equation problems because the required communications. And most of problems in nature is large coupled PDE.

How about those limited edition Gallium chips??? (0)

Anonymous Coward | more than 3 years ago | (#35034248)

A little bird informs the world that the US has a supercomputer already running on them, somewhere between 100Ghz-1Thz per processor. Looks like exoflop has come and gone.

Re:How about those limited edition Gallium chips?? (3, Informative)

mangu (126918) | more than 3 years ago | (#35034630)

A little bird informs the world that the US has a supercomputer already running on them, somewhere between 100Ghz-1Thz per processor

Unlikely. If you do the calculations, you'll find that the current 3GHz limit is about as fast as you can get data from other chips on a circuit board. 3GHz is 0.33 nanoseconds period, the time it takes for light to travel ten centimeters in a vacuum. A faster CPU will stay idle most of the time, waiting for the data it requested from other chips to arrive at the speed of light.

How Long (0)

fuzznutz (789413) | more than 3 years ago | (#35034380)

How long until "the cloud" becomes Skynet?

LA TE N C Y I S F O R E V (3, Insightful)

tarpitcod (822436) | more than 3 years ago | (#35034432)

These modern machines which consist of zillions of cores attached over very low bandwidth and high latency link are really not supercomputers for a huge class of applications. Unless your application exhibits extreme memory locality and hardly any interconnect bandwidth / can tolerate long latencies.

The current crop of machines is driven mostly by marketing folks and not by people who really want to improve the core physics like Cray used to.

BANDWIDTH COSTS MONEY, LATENCY IS FOREVER

Take any of these zillion dollar plies of CPU's and just try doing this:
for ( x=0; x .lt. bounds; ++x )
{
        humungousMemoryStructure [ x ] = humungousMemoryStructure1 [ x ] * humungousMemoryStructure2 [ randomAddress ] + humungousMemoryStructure3 [ anotherMostlyRandomAddress ] ;
}

It'll suck eggs. You'd be better off with a single liquid nitrogen cooled GaAs/ECL processor surrounded by the fastest memory you can get your hands on all packed into the smallest place you can and cooled with LN or LHe.

Half the problem is that everyone measures performance for publicity with LINPACK MFLOPS. It's a horrible metric.

If you really want to build a great new supercomputer get a (smallish) bunch of smart people together like Cray did, and focus on improving the core issues. Instead of spending all your erfforts on hiding latency, tackle it head on. Figure out how to build a fast processor and cool it. Figure out how to surround it with memory.

Yes,

Customers will still use commodity MPP machines for the stuff that parallelizes.
Customers will still hire mathematicians, and have them look at ways to Map things that seem inherently non local into spaces that are local.
Customers who have money and the mathematicians couldn't help will need your company and your GaAs/ECL or LHe cooled fastest SCALAR / Short Vector box in the world.

Re:LA TE N C Y I S F O R E V (1)

PaladinAlpha (645879) | more than 3 years ago | (#35035094)

Well, yeah, if you deliberately design a program to not take advantage of the architecture it's running on, then it won't take advantage of the architecture it's running on. (This, btw, is one of the great things about Linux, but that's not really what we're talking about.)

One mistake you're making is in assuming only one kind of general computing improvement can be occurring at a time (and there is some good, quality irony in that *grin*). Cray (and others) can continue to experiment on the edge of the thermal/lightspeed barrier, while others can simultaneously increase computing resources by running clusters.

There's nothing inherently detrimental about clusters, and almost all tasks can be parallelized to some degree. There are, of course, always latency and bandwidth issues in distributed memory systems, but InfiniBand (which is what serious clusters use) has latency in the 1 microsecond range, which is only about five times memory latency itself, and a bandwidth that you'll never saturate with current data distribution models (we're working on that).

It's true that typically performance to processor growth is sublinear, but it's also nonzero. For even moderately parallel tasks, a $300k cluster can outperform a $3M supermachine, and the failure of that to extend to the very boundary is more an effect of our inability to grok the parallel coding model (parallel code is hard); as we understand more and more about concurrent programming you will see the oft-cursed 'penalty' of cluster computing dwindle more and more towards nothing.

Re:LA TE N C Y I S F O R E V (1)

Anonymous Coward | more than 3 years ago | (#35037302)

I hear you but sure, you can harness a million chickens over slow links and reinvent the transputer, or Illiac IV but your then constraining yourself to problems where someone can actually get a handle on the locality. If they can't your *screwed* if you want to actually really improve your ability to answer hard problems in a fixed amount of time.

You can even just take your problem, and brute force parallelize it and say 'wow lets run it for time steps 1..1000' and farm that out to your MPP or your clusters, and then wait a good long amount of time till you get your answers... but it doesn't help you one IOTA if after seeing the first five steps you realize that you really should have asked a different question.

I'm not against MPP or Multithreading (I liked the Tera-MTA), Transputers, Dataflow, CUDA, Millions of School children with Abacus (ii?), or any of that stuff, but there are real science questions that can't be answered by them. There really are people out there who do stuff where they really do need to be able to say

a[i] = b[j]*c[k]+d[l]; ... and where *nobody* (with lots of smart nobodys) has managed to figure a decent way to partition the problem.

You do realize that if you go off-node on your cluster even over infiniband the 1uS is about equal to a late 1960's core memory access time right?

Re:LA TE N C Y I S F O R E V (1)

PaladinAlpha (645879) | more than 3 years ago | (#35039020)

You do realize that if you go off-node on your cluster even over infiniband the 1uS is about equal to a late 1960's core memory access time right?

Sure, but having 1960 mag core access to entirely different systems is pretty good, I'd say. And it will only improve.

It's a false dichotomy. There are some problems that clusters are bad at. That is true. The balancing factor that you are missing is that there are problems that single-proc machines are bad at, also. For every highly sequential problem we know, we also know of a very highly parallel one. There are questions that cannot be efficiently answered in a many-node environment, but there are also questions that can be answered far faster. What you're saying is basically that a desktop computer running modern games doesn't need a GPU, it just needs a faster CPU. It's failing to optimize for the problem; driving the screw with a hammer, if you will.

There is no such thing as effective "brute force parallelize". Effective concurrent computation requires thought. Sequential programming is MUCH easier to design and optimize for; that's the draw of single-proc systems. There is nothing inherently non-partitionable about "a[i] = b[j] * c[k] + d[l]"; indeed, I would submit that there are very easy threading optimizations that can be done for any combinations of i,j,k,l; a single thread of execution is GUARANTEED to be slower than even the most trivially optimized multithreaded case. Any multithreaded case benefits from additional execution units up to the number of threads as long as the computation bandwidth (NOT the interprocess bandwidth!) increases enough to account for the increase in computation latency (which is linked to, but not identical to, interprocess latency).

Sequential is dying, because it's suboptimal. Trying to burn the gas faster and faster doesn't get around the fact that the engine just isn't that efficient.

Re:LA TE N C Y I S F O R E V (1)

sjames (1099) | more than 3 years ago | (#35041122)

a single thread of execution is GUARANTEED to be slower than even the most trivially optimized multithreaded case.

That is true if and only if the cost of multithreading doesn't include greatly increased latency or contention. Those are the real killers. Even in SMP there are cases where you get eaten alive with cache ping ponging. The degree to which the cache, memory latency, and lock contention matter is directly controlled by the locality of the data.

For an example, let's look at this very simple loop:

FOR i=1to100
A[i]=B[i]+c[i]+A[i-1]

You might be tempted to pre-compute B[i]+c[i] in one thread and add in a[i-1] in another, but you then have 2 problems. First, if you aren't doing a barrier sync in the loop the second thread might pass the first and the result is junk, but if you are, you're burning more time in the sync than you saved. Next, the time spent in the second thread loading the intermediate value cold from either RAM or L1 cache into a register will exceed the time it would take to perform the addition.

Given some time, I can easily come up with far more perverse cases that come up in the real world.

Re:LA TE N C Y I S F O R E V (1)

PaladinAlpha (645879) | more than 3 years ago | (#35041532)

You might be tempted to pre-compute B[i]+c[i] in one thread and add in a[i-1] in another, but you then have 2 problems. First, if you aren't doing a barrier sync in the loop the second thread might pass the first and the result is junk, but if you are, you're burning more time in the sync than you saved. Next, the time spent in the second thread loading the intermediate value cold from either RAM or L1 cache into a register will exceed the time it would take to perform the addition.

Given some time, I can easily come up with far more perverse cases that come up in the real world.

...of course there's going to be some kind of synchronization. The suggestion otherwise implies a lack of experience in the field; failure to plan sync before anything else is an undergrad mistake.

I fail to see how the sync burns more time than you save by threading the computation. It seems to me that doing operation a and operation b in sequence will almost always be slower than doing them simultaneously with one joining the other at the end (or, better and a little trickier, a max-reference count for the dependent thread).

Re: cache, you are making strong assumptions that you're going to get cache hits for equidistant indecies across three arrays. In the kind of real-world computations that need this kind of hardware, you're not going to look at three flat 100 element arrays. You're going to be looking at 10k x 10k x 10k arrays, and the number of cache hits you get is not going to decrease by parallelizing the work -- indeed, since you're getting more cache for every node, the cache hits are going to increase with any degree of optimization. This is true in all cases.

There is nothing 'perverse' about the case you have presented. It's a trivial optimization problem that doesn't require more than a basic knowledge of cache and some hint of the target architecture. The more perverse cases are the ones that get worked on at the highest level, and I have seen no proof of any computational problem that is sequential in the fastest case.

Re:LA TE N C Y I S F O R E V (1)

sjames (1099) | more than 3 years ago | (#35041590)

...of course there's going to be some kind of synchronization. The suggestion otherwise implies a lack of experience in the field; failure to plan sync before anything else is an undergrad mistake.

As is not realizing that synchronization costs. How fortunate that I committed none of those errors! Synchronization requires atomic operations. On theoretical (read cannot be built) machines in CS, that may be a free operation. On real hardware, it costs extra cycles.

As for cache assumptions, I am assuming that liner access to linear memory will result in cache hits. That's hardly a stretch to think so given the way memory and cache are laid out these days.

If you are suggesting that handing off those subtotals and doing barrier sync through main memory (not sharing a data cache), you either have no idea what modern register vs cache vs main memory timings are like or you're using a fairly exotic and expensive system where RAM is one big cache. But if the latter were true, you would have pointed out that cache hits are irrelevant, so...

Here's a thought, give it a try! See how it goes.

Re:LA TE N C Y I S F O R E V (1)

PaladinAlpha (645879) | more than 3 years ago | (#35042714)

I never suggested that synchronization is free. However, a CAS or other (x86-supported!) atomic instruction would suffice, so you are talking about one extra cycle and a cache read (in the worst case) for the benefit of working (at least) twice as fast; you will benefit from extra cores almost linearly until you've got the entire thing in cache.

The cache stuff is pretty straightforward. More CPUs = more cache = more cache hits. Making the assumption that a[], b[], and c[] are contiguous in memory only increases this effect -- in your scenario, there is only one cache, and you'll have at most * 3 of the values local; whereas for every cpu you add in distribution the value increases linearly for quite some time.

This is ignoring the trivially shallow dependance of the originally proposed computation (there's a simple loop invariant) and making the assumption that a difficult computation is being done.

Re:LA TE N C Y I S F O R E V (1)

sjames (1099) | more than 3 years ago | (#35043562)

This is ignoring the trivially shallow dependance of the originally proposed computation (there's a simple loop invariant) and making the assumption that a difficult computation is being done.

I put the dependence there because it reflects the real world. For example, any iterative simulation. I could prove a lot of things if I get to start with ignoring reality. You asserted that there existed no case where a single thread performs as well as multiple threads, a most extraordinary claim. It's particularly extraordinary given that it actually claims that all problems are infinitely scalable with only trivial optimization.

CAS is indeed an atomic operation that could be used (i would have used a semaphore though), but to claim 1 cycle is far from the whole story. The instruction may take 1 cycle to execute (7 in an x86 actually) but the necessary cache invalidation is what kills you. A cache miss can cost 1000 cycles. If you go off node, it will cost many thousand even with a specialized backplane. An add costs 5. MOV from cache to register costs 1. So, the question comes down to this: would you rather take a guaranteed 1000 cycle hit for each iteration or would you prefer to spend 7 cycles doing the add yourself?

So, I say again, give it a try! Actually try to get a multi-threaded implementation of the loop I specified to outrun a single threaded version. You claimed it is trivial, so it shouldn't take you long.

Re:LA TE N C Y I S F O R E V (1)

tarpitcod (822436) | more than 3 years ago | (#35044992)

Exactly!

It's all easy if you ignore:

Cache-misses.
Pipeline stalls
Dynamic clock throttling on cores
Interconnect delays
Timing skews

It's the same problems as the async CPU people go through, except everyone is wearing rose-colored-spectacles and acting like there still playing with nice synchronous clocking.

The semantics become horrible once you start stringing together bazillions of commodity CPU's. Guaranteeing the dependencies are satisfied becomes non-trivial like you say even for a single multi-core x86 processor . You end up either with heavy duty synchronization that is reliable and slow or risk the chance of garbage and all kinds of synchronization hell. 1uS infiniband messages are several thousand clock cycles...

The only caveat I would add is if people really got honest and just built big piles of transfer-triggered architecture stuff then many of the timing problems would be solved - but they don't...

What bugs me is that there are plenty of scientists that hate the MPP boxes because their codes that do real stuff simply don't parallelize well, yet every time someone says fastest on TOP 5000 and manages to string even more commodity CPU's together with even worse latency the press say 'new breakthrough''.

Parallel computers are great, and parallel algorithms research is fine, and it should continue to be worked on, but someone needs to be figuring out how to build a damn fast GaAs box (or something else with better electron mobility than silicon) that has scalar performance that kicks serious ass. Hell even if someone was doing what the ETA people tried to do, and use commodity CMOS and seriously cool it to get a speedup would be start..

Re:LA TE N C Y I S F O R E V (1)

sjames (1099) | more than 3 years ago | (#35046056)

Agreed. I'm really glad MPP machines are out there, there is a wide class of jobs that they do handle decently well for a tiny fraction of the cost. In fact, I've been specifying those for years (mostly a matter of figuring out where the budget is best spent given the expected workload and estimating the scalability) but as you say, it is also important to keep in mind that there is a significant class of problem they can't even touch. Meanwhile, the x86 line seems top have hit the wall at a bit over 3GHz clock and the various RISC processors have been focusing on low power embedded applications. The DDR signaling already bears more resemblance to QAM than a logic line, so the only way to get any more out of it is faster switching/higher frequency.

Re:LA TE N C Y I S F O R E V (1)

tarpitcod (822436) | more than 3 years ago | (#35047086)

Some of the new ARM cores are getting interesting. I do wonder how much market share from x86 ARM will win. Your right about the DDR specs smelling like QAM. They are doing a great job at getting more bandwidth but the latency stucks worse than ever. When it gets too much we will finally see processors distributed in memory and Cray 3/SSS here we come...

I keep thinking more and more often that Amdahls 'wafer scale' processor needs to be revisited. If you could build a say 3 centimeter square LN2 cooled CPU that had a few hundred MB of on-chip SRAM that would be an improvement. You wouldn't need to have signal pins, just power pins, Replace the memory pins with interconnect pins - you might as well, and then cram a bunch of them into cube - Cray 3 style.

I'm sure there would be a market for even a single chip fairly typical current x86 like that, let alone a bunch of them It should still be an order of magnitude lower latency than external stuff. It wouldn't solve the problem, but after you've built the 3 cm square chip - figure out how to do it with a 4 cm square chip that's LN2 cooled,

If you didn't want to do x86 you could tailor the instruction set to add a bunch of useful synchronization stuff, and most critically you'd want the processor to be deterministic with regards to timing - so the compilers could know that if you accessed the emory at the edges of the chip it was slower. If you got rid of all the cache you could even try explicitly handling dependencies with predicate registers. I'd have to think more about if the predicate registers versus cache makes sense but it seems like it would make things way easier for the compiler.

You could say it would be a design that was designed to remove all the latencies trying to drive all those memory and address lines fast - and kindof like a transputer...

Re:LA TE N C Y I S F O R E V (1)

sjames (1099) | more than 3 years ago | (#35054296)

The key part there is getting the memory up to the CPU speed. On-die SRAM is a good way to do that. It's way too expensive for a general purpose machine, but this is a specialized application. A few hundred MB would go a long way, particularly if either a DMA engine or separate general purpose CPU was handling transfers to a larger but higher latency memory concurrently. By making the local memory large enough and manually manageable with concurrent DMA, it could actually hide the latency of DDR SDRAM.

For applications where parallelization was possible, additional CPUs could be interconnected through fast links as you suggest.

On the lower end, just adding a fast common scratch space to multi-core CPUs so they could actually do locking in just a few cycles or even do very low latency message passing would be a big help. On the OS side, the ability to dedicate a core exclusively to a task could be useful.

Re:LA TE N C Y I S F O R E V (1)

tarpitcod (822436) | more than 3 years ago | (#35055506)

I thought about this some more and came to the same conclusion re external memory. I was trying to weigh the relative merit of very fast very small (Say 4K instructions) channel processors that can stream memory into the larger SRAM banks. The idea would be DMA on steroids. If your going to build a DMA controller and have the transistor budget then replacing a DMA unit with a simple in-order fast core might be a win, especially if it was fast enough that you could do bit vector stuff / record packing and unpacking down in the channels. The caveat being you need to keep the timing consistent so if that became difficult I would go back to just straight DMA.

The downside of adding lots of channels with the necessary drivers for memory is you'll end up needing more pins and then driving those pins, so power distribution becomes a factor. I think there's real benefit in surfacing the timing of the channel/memory processors details to compilers. Having compilers able to schedule things taking into account the performance of the larger memory.

I really like your idea of a common fast scratch space with deterministic timing for locking. I think that's a great idea and would hugely improve things on commodity CPU's.

Looking at the above you end up with a picture of basically a Crayesque machine. It occurs to me that if you added FP to the channel processors then you can start thinking about them like vector pipelines, I'm not sure that makes sense, it seems like doing DMA first, then maybe a simple in-order integer processor and then maybe FP would be a growth path for channels.

I wish I knew more about the limitations of putting SRAM on a die. Doing some searching it looks like Tukwilla's L3 cache of 24 MB is http://forums.anandtech.com/showthread.php?t=2093673 I think thats at 1.6 Ghz so 25 ns. If you could cool it at least as well as the ETA guys and get say 4x your down to ~6 ns for that 24 MB.

One real trade-off question to me would be at that point, how long would it take to do RDMA / use a channel processor to fetch from a different CPU versus talk to a bunch of external DRAM? At which point your deterministic timing goes out the window so maybe DRAM is the way to go...

Re:LA TE N C Y I S F O R E V (1)

sjames (1099) | more than 3 years ago | (#35060236)

The channel controllers are a good idea. One benefit to that is there need be no real distinction between accessing another CPU's memory and an external RAM other than the speed/latency. So long as all off-chip access is left to the channel controllers with the CPU only accessing it's on-die memory, variable timing off chip wouldn't be such a big problem. Only the channel controller would need to know.

The SDRAM memory controller itself and all the pins necessary to talk to SDRAM modules can be external to the CPU if necessary. It would be just one more thing a channel controller might talk to. If the inter-chip communication is hypertransport like with smarter endpoints, the pin count can be kept under control. The current Opteron 8 and 12 core chips support 4 memory channels and (I think) 4 Hypertransport channels. Drop the on board memory channels and it starts looking quite reasonable.

If channel controllers also handle disk access, it looks like a mainframe. If they map the disk into the address space like very slow RAM, it looks like an AS/400. I suppose the supercomputer/mainframe/mini distinction would be a matter of how many of what are included in the system, and perhaps which OS is loaded.

I like the new supercomputing graphic (1)

GameboyRMH (1153867) | more than 3 years ago | (#35034500)

That little Cray thing looks really nice. Nice work, whoever did it. Reminds me of '90s side-scrolling games for some reason.

It's Von Neuman's fault (2)

ka9dgx (72702) | more than 3 years ago | (#35034602)

I read what I thought were the relevant sections of the big PDF file that went along with the article. They know that the actual RAM cell power use would only be 200 KW for an exabyte, but the killer comes when you address it in rows, columns, etc... then it goes to 800KW, and then when you start moving it off chip, etc... it gets to the point where it just can't scale without running a generating station just to supply power.

What if instead of trying to address everything that way, they break up the computing and move it to the data... so that RAM is tied directly to the logic that would use it... it would waste some logic gates, but the power savings would be more than worth it.

Instead of having 8kit rows... just a 16x4 bit look up table would be the basic unit of computation. Globally read/writable at setup time, but otherwise only accessed via single bit connections to neighboring cells. Each cell would be capable of computing 4 single bit operations simultaneously on the 4 bits of input, and passing them to their neighbors.

This bit processor grid (bitgrid) is turing complete, and should be scalable to the exaflop scale, unless I've really missed something. I'm guessing somewhere around 20 megawatts for first generation silicon, then more like 1 megawatt after a few generations.

Re:It's Von Neuman's fault (1)

mangu (126918) | more than 3 years ago | (#35034742)

Non-von-Neumann supercomputers have been built, look at this hypercube topology [wikipedia.org] .

The problem is the software, we have such a big collection of traditional libraries that it becomes hard to justify starting over in an alternative way.

Re:It's Von Neuman's fault (3, Interesting)

Animats (122034) | more than 3 years ago | (#35035024)

What if instead of trying to address everything that way, they break up the computing and move it to the data... so that RAM is tied directly to the logic that would use it.

It's been tried. See Thinking Machines Corporation [wikipedia.org] . Not many problems will decompose that way, and all the ones that will can be decomposed onto clusters.

The history of supercomputers is full of weird architectures intended to get around the "von Neumann bottleneck". Hypercubes, SIMD machines, dataflow machines, associative memory machines, perfect shuffle machines, partially-shared-memory machines, non-coherent cache machines - all were tried, and all went to the graveyard of bad supercomputing ideas.

The two extremes in large-scale computing are clusters of machines interconnected by networks, like server farms and cloud computing, and shared-memory multiprocessors with hardware cache consistency, like almost all current desktops and servers. Everything else, with the notable exception of GPUs, has been a failure. Even the Cell, the most widely deployed non-standard architecture ever, was only used in the PS3, and was more trouble than it was worth.

Re:It's Von Neuman's fault (1)

Mechanik (104328) | more than 3 years ago | (#35037680)

Even the Cell, the most widely deployed non-standard architecture ever, was only used in the PS3, and was more trouble than it was worth.

I think you are forgetting about the Roadrunner supercomputer [wikipedia.org] which has 12,960 PowerXCell processors. It was #1 on the supercomputer Top 500 in 2008. It's still at #7 as of November 2010.

Re:It's Von Neuman's fault (0)

Anonymous Coward | more than 3 years ago | (#35037748)

Everyone has tried to hide latency, but the hard truth is there is only so much parallelism readily extractable from some problems, and then it's all about the latency.

Re:It's Von Neuman's fault (0)

Anonymous Coward | more than 3 years ago | (#35039756)

http://www.greenarraychips.com/

Chuck Moore, the inventor of FORTH, created these neat chips. 64 words of RAM and 64 words of Rom per chip.

Re:It's Von Neuman's fault (1)

ka9dgx (72702) | more than 3 years ago | (#35040740)

All of the examples you all gave to this point are still conventional CPUs with differences in I/O routing.

I'm proposing something with no program counter, no stack, etc... just pure logic computation.

And no, it's not an FPGA because those all have lots of routing as well.

Sounds familiar... (1)

Bill, Shooter of Bul (629286) | more than 3 years ago | (#35034868)

I liked this computer before, when it was called a beowolf cluster.

Re:Sounds familiar... (1)

conspirator57 (1123519) | more than 3 years ago | (#35035708)

indeed. and when it was cheap(er) than supercomputers. and when supercomputer vendors looked down their noses at it.

67 Megawatts? (2)

Waffle Iron (339739) | more than 3 years ago | (#35035782)

That doesn't seem like a show stopper. In the 1950s, the US Air Force built over 50 vacuum tube SAGE computers for air defense. Each one used up to 3 MW of power and probably wasn't much faster than an 80286. They didn't unplug the last one until the 1980s.

If they get their electricity wholesale at 5 cents/kWh, 67 MW would cost about $30,000,000 per year. That's steep, but probably less than the cost to build and staff the installation.

Re:67 Megawatts? (1)

Peeteriz (821290) | more than 3 years ago | (#35041974)

As the article says, it's not the power requirements but the heat that worries them.

67 MW of heat spread out in 50 buildings is ok; 67 MW of heat in a shared-memory device that needs to be physically small and compact for latency reasons may make it impossible.

Super Computers, heat and the future (0)

Anonymous Coward | more than 3 years ago | (#35035796)

In spite of some of the benefits of cloud computing or cluster computing Super Computers are still needed and we need to go way beyond where people are currently thinking. The integration of seebeck effect materials into the electron paths of processors as well as integrated layers of seebeck effect materials inside of processors can reduce the internal core temperature of processors. In addition advanced manufacturing can be used to integrate fluid based meso and micro cooling channels that, when combined with nucleated phase change, can super-cool super-computers as well as super-conductors. Lot of supers there, but, the future is full of super ideas and this professor is just out of the loop.

No worries... (1)

Twinbee (767046) | more than 3 years ago | (#35035802)

And while giant high end supercomputers may progress more slowly, we're slowly seeing a revolution in personal supercomputing, where everyone can have a share of the pie. Witness CUDA, OpenCL, and projects like GPU.NET [tidepowerd.com] (.NET for the GPU, and apparently easy to use, though expensive for now).

Along with advancements such as multitasking in the next generation of GPUs (yes, they can't actually multitask yet, but when they do it'll be killer for a few reasons), and a shared memory with the CPU (by combining both the GPU and CPU into one die, which goes into the CPU socket), and I think the future will be very interesting.

He forgets about software developments (1)

parallel_prankster (1455313) | more than 3 years ago | (#35035862)

His statements are both true and false. Its true that exaflops is a big challenge, however, research on supercomputers has not stopped. But there are other areas which are being looked at too. For example - algorithms. Whenever a new supercomputer is developed, parallel programmers try to modify or come up with new algorithms that take advantage of the architectures/network speeds to make things faster. Heck, there are some applications that have started looking at avoiding huge computations and instead going for incorrect/approximate ones and converging on the solution techniques. There are many apps that are ok with approx results. Also, new denser and less power hungry memory technologies have come up and some of those links have already been posted here so I am not going to post them. Besides, there are a bunch better cooling techniques proposed by IBM and use of optical interconnects which can help organize things in a more distributed way and provide easier heat dissipation. All in all, someone is going to come up with exaflop computing soon :)

Re:He forgets about software developments (0)

Anonymous Coward | more than 3 years ago | (#35036688)

I read TFA and the author did not forget about software development. The article hypothesized that software/algorithms could be the biggest problem of all, since in his words most of the processors would be threading water most of the time.

I think it's pretty obvious that the main challenge moving into the future is better development tools for software and firmware. Maybe model-based development tools will help a bit with this. Maybe that's a pipe-dream and we're all fooling ourselves. The only way to find out is to keep trying to make stuff...

Not because they are easy but becase they are hard (0)

Anonymous Coward | more than 3 years ago | (#35036284)

I definitely agree with DARPA's conclusion that we cannot build an exascale computer. Thankfully, I think the Chinese might sell us some time on theirs, as long as we promise not to use it for nuclear-weapons simulations. With the money we save by outsourcing that job, we could probably end the estate tax once and for all.

This is COMPLETELY WRONG! (0)

Anonymous Coward | more than 3 years ago | (#35036412)

Ray Kurzweil says so!

Yes, a forecast with CURRENT technology (2)

Luke_2010 (1515829) | more than 3 years ago | (#35037048)

I've read the article (the WHOLE article) and the exaflop issue is generally posed in terms of power requirements in reference to current silicon technlogy and its most strictly related future advancements. The caveat of that is that not even IBM thinks exaflop computing can be achieved with current technology, that's why they are deeply involved with photonic CMOS, of which they have already made the first working prototype. Research into exaflop computing in IBM is largely based on that. You can't achieve the necessary power requirements without moving (at least in part) from electronic to photonic. This will decrease power requirements (and cooling requirements) by a large factor.

Yes but what about your desktop? (1)

deadline (14171) | more than 3 years ago | (#35037534)

Check out the Limulus Project [basement-s...puting.com]

Well said. (1)

Sylverius (1785470) | more than 3 years ago | (#35037892)

Well said, sir.

Jim

A virtual cloud based super computer? (3, Funny)

Yaos (804128) | more than 3 years ago | (#35038404)

Why has nobody tried this before? They could easily plow through the data from SETI, fold proteins, or even have a platform for creating and distributing cloud based computing turnkey computing solutions! It's too bad that the cloud was not invented until a year or two ago, this stuff could have probably started out in 1999 if the cloud existed back then.

Re:A virtual cloud based super computer? (1)

WorBlux (1751716) | more than 3 years ago | (#35039980)

Because the word cloud really doesn't refer to any new innovation it's marketing, it is just a new term on an old idea.. Cloud just either means a distributed or non-trivial client-server computation over the public internet. It's been around forever. SETI already makes use of what could be describes as cloud computing. The reason now rather than then is the ubiquitousness of broadband, machines with significant idle, and an increase in the number of programmers who now who to split very large problems into small enough parts to make use of that idle time to get a result in hours or days rather than weeks or years.

Re:A virtual cloud based super computer? (1)

km_2_go (1404213) | more than 3 years ago | (#35040056)

Mod parent up. For those whose head this has flown over, distributed supercomputing is not a new idea. It has been implemented, most famously with SETI@home, for quite a while. "Cloud" is merely a new word for an old concept.

FPGA array (0)

Anonymous Coward | more than 3 years ago | (#35038414)

What about a super computer made out of FPGAs ?

Just put a bunch of FPGAs in a board, and write real hardcore tasks in hardware language like vhdl...

Re:FPGA array (1)

WorBlux (1751716) | more than 3 years ago | (#35039852)

Latency. Programmable gates simply aren't as fast as dedicated ones. You could make new hardware for every application, but that's expensive.

Use computing nodes as electric heaters (1)

Paul Fernhout (109597) | more than 3 years ago | (#35040692)

I disclosed this sort-of-cogeneration idea before on the open manufacturing list so that no one could patent it, but for years I've been thinking that the electric heaters in my home should be supercomputer nodes (or doing other industrial process work), controlled by thermostats (or controlled by some algorithm related to expectations of heat needs).

When we want heat, the processors click on and do some computing and we get the waste heat to heat our home. When the house is warm enough, they shut down. They would use the network to talk to the rest of the nodes in neighbor's homes, or homes across the globe, to form a supercomputing cloud. Basically, any place in the country that has an electric heater (or similar thing) could have a processing node instead (this includes water heating, too, and even things like kilns). (Hydroponic agriculture would be another example use as well instead of computing, growing plants in winter where the grow lights were controlled by thermostats or heating algorithms or timers.)

For reference, for those who don't know much physics, essentially all use of electricity produces waste heat eventually, so if you run a computer that takes 100 watts, it heats the room as much as running a 100 watt heater. The same goes for a 100 watt incandescent lightbulb, which also doubles as a 100 watt heater. For those who live in homes in cold climates (heating somewhat most of the time) and who do not have very well insulated homes, paying more for energy efficient appliance may not pay well, because your electric heaters just have to pick up the slack left by the the more efficient lights.

I don't know the industrial figures, but for residential electric heating use in 2001:
    http://www.eia.doe.gov/emeu/reps/enduse/er01_us.html [doe.gov]
"Electric space heating accounted for an additional 116 billion kWh (10 percent of the total)... Electric water heating accounted for over 100 billion kWh (9 percent) in 2001."

So that is about 200 billion kWh per year, or about 23 gigawatts continuously. It is "free" power to use for computing in a sense. (I know, it would need to be networked -- maybe with integrated wireless of some sort?) So, that would be enough power for about 46 of the 500 MW computers they mention in the article. The cost savings would be (at US$0.10 per kWh) 20 billion dollars a year in energy costs. Looking around, commercial buildings use about the same amount of electric heating. Electric use has increased over the past decade, as well. So potentially 100 or so of these exaflops machines could be powered by residential and commercial heating needs alone.

I don't know what the figure would be for industrial process heat. If we shift way from fossil fuels and towards more energy from PV, wind, nuclear, and cold fusion, there might be terawatts of power available to use for computing in this way, where the waste heat (on demand) then drove industrial processes like making plastic or refining ore. Waste heat could also drive heat engines for mechanical action. So, industrial processes might be able to power (for "free") thousands of these supercomputers.

Large datacenters could also be located in places that wanted the heat, like near big buildings. Power plants sometimes have industrial plants near them that want their waste heat already, so this would be a similar thing. The datacenter waste heat could also be concentrated by heat-pumps and used for industrial processes (like melting silicon to make solar cells or IC chips).

I guess with cold fusion in the air (with the Italy demo claim) I should disclose the idea of integrating cold fusion power production (such as without limitation nickel/hydrogen fusion) directly into, or adjacent to, computing nodes that somehow directly use the energy, either electricity generated someway or even running directly off any generated radiation. These too could also be thermostat controlled (or controlled by some algorithm related to expectations of heat needs).

One problem long term with this is that the state of the art now in green home design is a house without a furnace (just well insulated, and with an air-to-air heat exchanger).
http://www.google.com/search?q=no+furnace [google.com]
In general, such homes are heated by the waste heat of the occupants (100 watts per person 40 watts per poodle) and the waste heat of any appliance like water heaters or cooking stoves or personal electronics.

Now, we may well see better supercomputers. The human brain does a lot of calculations (we don't know for sure how much but beyond exaflops from one perspective), and it only takes about 35 watts or so. So, clearly, there is room to make exaflops computers with new designs that don't require 500 million watts to run. Perhaps things like 3D computer chips, or perhaps optical computing, or quantum computing, or bacterial computing, or maybe something else entirely -- even tapping into a fundamental computational nature of the universe (Edward Fredkin says the universe is a computer)?

For a paper I wrote for a physics class at Princeton taught by Gerry O'Neill around 1982, I suggested that optical computing would be essential to creating faster computers, to allow high speed interconnections and deal with other limitations in computing with electrons. I'm surprised this article did not mention optical computing. Gerry did not seem to like it either, I think the paper got a B or something as he (or the TA?) did not see the connection to physics. :-) I told another professor (Harvey Lam) about that, and he said essentially it is the kind of thing to keep around for an "I told you so" about the foibles of scientists in relation to technology.

Sadly, Gerry passed away years ago. :-( A great man, missed by many. I wonder if he would have lived to see space habitats built if he had gotten enough supplemental vitamin D while working so hard to ensure humanity had a future? And eating more vegetables, fruits, and beans like Dr. Joel Fuhrman recommends to prevent cancer? We'll never know. At least others can use that health knowledge which I found through using the Google supercomputer to live longer and better. Some more details here (if slashdot would only expand comment links right again):
http://slashdot.org/comments.pl?sid=1692444&cid=32644166 [slashdot.org]

Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?
or Connect with...

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>