Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

IEEE Says Multicore is Bad News For Supercomputers

timothy posted more than 5 years ago | from the unexpected-downsides dept.

Supercomputing 251

Richard Kelleher writes "It seems the current design of multi-core processors is not good for the design of supercomputers. According to IEEE: 'Engineers at Sandia National Laboratories, in New Mexico, have simulated future high-performance computers containing the 8-core, 16-core, and 32-core microprocessors that chip makers say are the future of the industry. The results are distressing. Because of limited memory bandwidth and memory-management schemes that are poorly suited to supercomputers, the performance of these machines would level off or even decline with more cores.'"

cancel ×

251 comments

Sorry! There are no comments related to the filter you selected.

Time for vector processing again (5, Insightful)

suso (153703) | more than 5 years ago | (#26001307)

Sounds like its time for supercomputers to go their own way again. I'd love to see some new technologies.

Re:Time for vector processing again (5, Interesting)

jellomizer (103300) | more than 5 years ago | (#26001487)

I've always felt there was something odd about the recent trend of Super Computers using common hardware. components. They have really loss their way in super computing by just making a beefed up PC and running a version of a common OS which could handle it. Or Clustering a bunch of PC's togeter. Multi-Core technology is good for desktop systems as it is meant to run a lot of relatively small apps Rarely taking advantage of more then 1 or 2 cores. per app.In other-words it allows Multi-Tasking without a penalty. We don't use super computers that way. We use them to to perform 1 app that takes huge resources that would take hours or years on your PC and spit out results in seconds or days. Back in the early-mid 90's we had different processors for Desktop and Super Computers. Yes it was more expensive for the super computers but if you were going to pay millions of dollars for a super computer what the difference if you need to pay an additional $80,000 for more custom processors.

Re:Time for vector processing again (1)

suso (153703) | more than 5 years ago | (#26001523)

Yes I agree and have the same odd feeling. The first time I read an article where I think Los Alamos was ordering a supercomputer with 8192 Pentium Pro processors in it, I was like WTF?

I missed the days when super computers looked like alien technology or Raiders of the Lost Ark.

Re:Time for vector processing again (1, Funny)

Anonymous Coward | more than 5 years ago | (#26001749)

I missed the days when super computers looked like alien technology or Raiders of the Lost Ark.

How about supercomputers that look like alien technology from Kingdom of the Crystal Skull?

George Lucas and Steven Spielberg raped supercomputing!

Re:Time for vector processing again (4, Insightful)

postbigbang (761081) | more than 5 years ago | (#26002471)

Look are deceptive.

The problem with multicores relates to the fact that the cores are processors, but the relationship to other cores and to memory aren't fully 'cross-bar'. Sun did a multi-CPU architecture that's truly crossbar (meaning that there are no dirty cache problems and semaphor latencies) among the processors, but the machine was more of a technical achievement than a decent workhorse to use in day to day stuff.

Still, cores are cores. More cores aren't better necessarily until you fix what they describe. And it doesn't matter what they look like at all. Like any other system, it's what's under the hood that count. Esoteric-looking shells are there for marketing purposes and cost-justification.

Re:Time for vector processing again (5, Insightful)

virtual_mps (62997) | more than 5 years ago | (#26001679)

It's very simple. Intel & AMD spend about $6bn/year on R&D. The total supercomputing market is on the order of $35bn (out of a global IT market on the order of $1000bn) and a big chunk of that is spent on storage, people, software, etc., rather than processors. That market simply isn't large enough to support an R&D effort which will consistently outperform commodity hardware at a price people are willing to pay. Even if a company spent a huge amount of money developing a breakthrough architecture which dramatically outperformed existing hardware, the odds are that the commodity processors would catch up before that innovator recouped its development costs. Certainly they'd catch up before everyone rewrote their software to take advantage of the new architecture. The days when Seymour Cray could design a product which was cutting edge & saleable for a decade are long gone.

Re:Time for vector processing again (1)

Timothy Brownawell (627747) | more than 5 years ago | (#26001831)

If they can make superconducting FETs [newscientist.com] that can be manufactured on ICs, I could see there being a very big difference that will last until they can reach liquid nitrogen temperatures (at which point it goes mainstream and cryogenics turns into a boom industry for a while).

Re:Time for vector processing again (2, Interesting)

David Gerard (12369) | more than 5 years ago | (#26002041)

I eagerly await the Slashdot story about an Apple laptop with liquid nitrogen cooling. Probably Alienware will do it first.

Re:Time for vector processing again (5, Interesting)

hey! (33014) | more than 5 years ago | (#26002739)

It may be true that "That market simply isn't large enough to support an R&D which will consistently outperform commodity hardware at a price people are willing to pay," that's not quite tantamount to saying "there is no possible rational justification for a larger supercomputer budget." There are considerable inflection points and external factors to consider.

The market doesn't allocate funds the way a central planner does. A central planner says, "there isn't room in this budget to add to supercomputer R&D." The way the market works is that commodity hardware vendors beat each other down until everybody is earning roughly similar normal profits. Then somebody comes a long with a set of ideas that could double the rate at which supercomputer power is increasing. If that person is credible, he is a standout investment, not just despite the fact that there is so much money being poured into commodity hardware, but because of that.

There may also be reasons for public investment in R&D. Naturally the public has no reason to invest in commodity hardware research, but it may have reason to look at exotic computing research. Suppose that you expected to have a certain maximum practical supercomputer capability in twenty years' time. Suppose you figure that once you have that capability you could predict a hurricane's track with several times the precision you could today. It'd be quite reasonable to put a fair amount of public research funds into supercomputing in order to have the that ability in five to ten years' time.

Re:Time for vector processing again (3, Informative)

Retric (704075) | more than 5 years ago | (#26001757)

Modern CPU's have 8+ Mega Bytes of L2/L3 cache on chip so RAM is only a problem when your working set it larger than that. The problem super computing folks are having is they want to solve problems that don't really fit in L3 cache which creates significant problems but they still need a large cache. However, because of speed of light issues off chip ram is always going to be high latency so you need to use some type of on chip cache or stream lot's off data to the chip.

There are really only 2 options for modern systems when it comes to memory you can have lot's of cores and a tiny cache like GPU's or lot's of cache and fewer cores like CPU's. (ignoring type of core issues and on chip interconnects etc.) So there is little advantage to paying 10x per chip to go custom vs using more cheaper chips when they can build supper computers out of CPU's, GPU's, or something between them like the Cell processor.

Re:Time for vector processing again (0, Troll)

Anonymous Coward | more than 5 years ago | (#26001889)

There are really only 2 options for modern systems when it comes to memory you can have lot's of cores and a tiny cache...

Given that for some bizarre reason you think a simple plural like 'lots' needs an aprostrophe, why doesn't your sentence read
"There are really only 2 option's for modern system's when it comes to memory you can have lot's of core's and a tiny cache...

At least it would be consistently wrong then.

Re:Time for vector processing again (2, Insightful)

LingNoi (1066278) | more than 5 years ago | (#26002255)

This is slashdot, our professions are computer related not literature based. You're on the wrong website.

Re:Time for vector processing again (2, Funny)

Anonymous Coward | more than 5 years ago | (#26002389)

Because I'm sure inserting a random apostrophe into your code would make it run just fine...

Re:Time for vector processing again (1)

X0563511 (793323) | more than 5 years ago | (#26002665)

It's a shame we don't compile written word. ...

Programming is not literature, it's machine instructions.

Re:Time for vector processing again (5, Insightful)

knails (915340) | more than 5 years ago | (#26002453)

No, proper spelling and grammar are important for everyone, not just english majors. With computers so important, if the computer professionals cannot use the language correctly, then who will? We cannot let ignorant people degrade the quality of language and therefore remove beauty and subtle distinctions between similar words just because they're too lazy to conform to standards. If a linguist misused/ignored computing standards, would you not correct them, even though it's not their chosen field of study?

Re:Time for vector processing again (2, Informative)

Pharmboy (216950) | more than 5 years ago | (#26002669)

Yes, that is what I want, a super computer designed by an English major...

Please get over yourself. This is slashdot, not something important like a resume or will.

Re:Time for vector processing again (2, Insightful)

Dishevel (1105119) | more than 5 years ago | (#26002795)

I do not study literature. I do not like those that do. Come on though. Knowing the difference between adding an "s" in a plural or possessive situation is truly basic. If you want to sound like a complete idiot then don't mangle true English. Just speak Ebonics.

Re:Time for vector processing again (5, Funny)

knewter (62953) | more than 5 years ago | (#26002749)

Hey dipshit. When you mock someone's grammar, you'd sure as fuck better not mis-spell 'apostrophe'

Idiot.

I'll paste it a few times so you can look at your grotesque failure more:

aprostrophe
aprostrophe
aprostrophe
aprostrophe

See how stupid that looks?

Re:Time for vector processing again (5, Interesting)

necro81 (917438) | more than 5 years ago | (#26002523)

A related problem to the speed of memory access is the energy efficiency of it. In an IEEE Spectrum Radio [ieee.org] piece interviewing Peter Kogge, current supercomputers can spend many times more energy shuffling bits around than operating on them. Today's computer can do a double-precision (64-bit) floating point operation using about 100 picojoules. However, it takes upwards of 30 pJ per bit to get the 128 bits of data loaded into the floating point math unit of the CPU, and then moving the 64-bit result elsewhere.

Actual math operations consume 5-10% of a supercomputer's total power, moving data from A to B is approaching 50%. Most optimization and innovation in the past few decades has gone into compute algorithms in the CPU core, and very little has gone into memory.

Re:Time for vector processing again (5, Insightful)

AlpineR (32307) | more than 5 years ago | (#26001759)

My supercomputing tasks are computation-limited. Multicores are great because each core shares memory and they save me the overhead of porting my simulations to distributed memory multiprocessor setups. I think a better summary of the study is:

Faster computation doesn't help communication-limited tasks. Faster communication doesn't help computation-limited tasks.

Re:Time for vector processing again (1)

Vellmont (569020) | more than 5 years ago | (#26002509)


Faster computation doesn't help communication-limited tasks. Faster communication doesn't help computation-limited tasks.

I thought the same thing. Years ago with the massively-parallel architectures you could have said that massively-parallel architectures don't help inherently serial tasks.

The other thing I wonder is how server and desktop tasks will drive the multi-core architecture. It may be the case that many of the common server and desktop tasks have massive IO need (gaming?). The current memory architectures aren't set in stone, but I also doubt they'll be driven by what the Supercomputers consumers need.

Re:Time for vector processing again (2, Insightful)

ipoverscsi (523760) | more than 5 years ago | (#26002699)

Faster computation doesn't help communication-limited tasks. Faster communication doesn't help computation-limited tasks.

Computation is communication. It's communication between the CPU and memory.

The problem with multicore is that, as you add more cores, the increased bus contention causes the cores to stall making so they cannot compute. This is why many real supercomputers have memory local to each CPU. Cache memory can help, but just adding more cache per core yields diminishing returns. SMP will only get you so far in the supercomputer world. You have to go NUMA for performance, which means custom code and algorithms.

Re:Time for vector processing again (3, Insightful)

timeOday (582209) | more than 5 years ago | (#26001955)

IMHO this study is not an indictment against the use of today's multi-core processors for supercomputers or anything else. They're simply pointing out that in the future (as cores continue to grow exponentially) some memory bandwidth advances will be needed. The implication that today's multi-core processors are best suited for games is silly - where they're really well utilized is in servers, and they work very well. The move towards commodity processors in supercomputing wasn't some kind of accident, it occurred because that's what currently gets the best results. I'd expect a renaisance in true supercomputing just as soon as it's justified, but I wouldn't hold my breath.

Re:Time for vector processing again (3, Informative)

yttrstein (891553) | more than 5 years ago | (#26002141)

We still have different processors for desktops and supercomputers.

http://www.cray.com/products/XMT.aspx

Rest assured, there are still people who know how to build them. They're just not quite as popular as they used to be, now that a middle manager who has no idea what the hell they're talking about can go to an upper manager with a spec sheet that's got 8 thousand processors on it and say "look! This ones got a whole ton more processors than that dumb Cray thing!"

Re:Time for vector processing again (4, Informative)

TapeCutter (624760) | more than 5 years ago | (#26002145)

"Multi-Core technology is good for desktop systems as it is meant to run a lot of relatively small apps Rarely taking advantage of more then 1 or 2 cores. per app.In other-words it allows Multi-Tasking without a penalty. We don't use super computers that way. We use them to to perform 1 app that takes huge resources that would take hours or years on your PC and spit out results in seconds or days."

Sorry but that's not entirely correct, most super computers work on highly parallel problems [earthsimulator.org.uk] using numerical analysis [wikipedia.org] techniques. By definition the problem is broken up into millions of smaller problems [bluebrain.epfl.ch] that make ideal "small apps", a common consequence is that the bandwidth of the communications between the 'small apps' becomes the limiting factor.

"Back in the early-mid 90's we had different processors for Desktop and Super Computers."

The earth simulator was refered to in some parts as 'computenick', it's speed jump over it's nearest rival and longevity at the top marked the renaissance [lbl.gov] of "vector processing" after it had been largely ignored during the 90's.

In the end a supercomputer is a purpose built machine, if cores fit the purpose then they will be used.

Re:Time for vector processing again (1)

Johnny_Longtorso (90816) | more than 5 years ago | (#26002511)

You seem to have missed the point of COTS entirely. And you're way off the mark on the price differential - unless you're talking about an 8 CPU "supercomputer".

The whole reason HPTC bled down to COTS product was the outrageous costs of more proprietary hardware AND the fact that COTS product performance and reliability were on a massive upswing.

I work with LOTS of customers using HPTC - and very, very few of them are still running a single application. It's the nature of growth & development - there are older apps and there are newer apps.

There's a 235 node 10G InfiniBand connected HP-based supercomputer humming right behind me as I type this....

Re:Time for vector processing again (1)

IceCreamGuy (904648) | more than 5 years ago | (#26002065)

Maybe they should all just be simulated at Sandia!

Re:Time for vector processing again (2, Funny)

Methuselah2 (1173677) | more than 5 years ago | (#26002659)

That does it...I'm not buying a supercomputer this Christmas!

Re:Time for vector processing again (1)

DiegoBravo (324012) | more than 5 years ago | (#26002679)

>> I'd love to see some new technologies.

Yeah, It would be nice to see that Quantum Computing ( http://en.wikipedia.org/wiki/Quantum_Computer [wikipedia.org] ) finally adds a couple of arbitrary integers. Despite the many publications in the subject, it smells like the superstrings theory of computing. Hope that's not the case.

Well doh (4, Insightful)

Kjella (173770) | more than 5 years ago | (#26001367)

If you make a simulation like that keeping the memory interface constant then of course you'll see diminishing returns. That's why we're still not running plain old FSBs as AMD has HyperTransport, Intel has QPI, the AMD Horus system expands it up to 32 sockets / 128 cores and I'm sure something similar can and will be built as a supercomputer backplane. The header is more than a little sensationalist...

Re:Well doh (4, Insightful)

cheater512 (783349) | more than 5 years ago | (#26001449)

There are limits however to what you can do.
Its not like multi-processor systems where each cpu gets its own ram.

Re:Well doh (1, Insightful)

Anonymous Coward | more than 5 years ago | (#26001557)

There are limits however to what you can do.
Its not like multi-processor systems where each cpu gets its own ram.

I like where your brain is. What if they reworked the board layouts so that each proc had its own bank, or two, or ram. 4 cpu's with 2gig each. I sense amazing possiblities here. Dual north/south bridges. I mean you write a smart enough CPU set you could do amazing things for the future.

Re:Well doh (2, Informative)

jebrew (1101907) | more than 5 years ago | (#26001771)

That's how a lot of boards are already done. The issue is with a single processor that has multiple cores.

There's no real way to split the banks for each core, so the net effect is that you have 4-32 cores sharing the same lanes for memory.

Re:Well doh (2, Interesting)

Targon (17348) | more than 5 years ago | (#26002037)

Multi-channel memory controller is my response to this. Remember how going to a dual-channel memory controller increased the available bandwidth to memory? Having support for even 32 banks of memory could be implemented if the CPU design and connections are there.

You are thinking along the lines of current computers, not of the applications. People keep quoting the old statement that 640KB should be enough memory for anyone, but then repeat the same mistake they quote. Quantity of memory not only goes up, but the way to talk to that memory also evolves over time.

We used to see the CPU to chipset to memory as the way personal computers would work. Since then, AMD moved to an integrated memory controller on their CPUs, and Intel is finally following the example that AMD set. A dual-channel memory controller used to be the exception, not the rule, but now the idea is very common. In time, a 32 channel memory controller will be the standard even in an average home computer. How those channels are used to talk to memory of course remains to be seen, but you get the idea.

Unganged channels = already non shared lanes today (5, Insightful)

DrYak (748999) | more than 5 years ago | (#26002355)

The issue is with a single processor that has multiple cores.
There's no real way to split the banks for each core, so the net effect is that you have 4-32 cores sharing the same lanes for memory.

No, sorry. That's how Phenom processor are *Already* working.

Each physical CPU package has two 64-bit memory controllers, each controlling a separate bank of 64bits DDR-2 memory chips. (Each of the two bank in a dual channel mother board).

Phenom have two mode of function :
- Ganged : both memory controllers work in parallel, working as if they were a huge 128bits memory connection. That's how dual channel has worked since it was invented.
That's good for system running few very bandwidth-hungry applications (for example : benchmarks)

- Unganged : each memory controller work on its own. Thus you have two completely separate 64bits memory channel accessible at the same time. By correctly lying the applications in memory thanks to a NUMA-aware OS (anything better than Windows Vista), that means that two separate applications can simultaneously access each one's memory at the exact same moment, although at only half the bandwith *per process* (but still the same total of bandwidth for all processes running at the same time on a multi core chip).
This is perfect for systems running lots of tasks in parallel, and is the default mode on most BIOSes I've seen.

This gives a tremendous boost to heavily multi-tasked applications (a busy database server, for example), and it's what TFA's author are looking for.

Probably that at some point in the future, Intel will follow the same trend with its QPI processors.

Also, the future trend is to multiply the memory channels on the CPU: Intel has already planned Triple Channel DDR-3 for their high-end server Xeons (the first crop of QPI chips). AMD has announced 4 memory channels for their future 6- and 12- core chips targeting the G34 socket.

So the net effect of Unganged Dual Channel is that today you already have 4 cores having a choice of 2 sets of memory lanes to choose among, and within 1 year, you'll have 6-to-12 cores sharing 4 sets of memory lanes.

By the time you reach 32 cores on CPU, probably that almost each slot will have its own dedicated memory channel (probably with the help of some technology which communicates serially with fewer lines, like FB-DIMM). Or even weirder memory interfaces (who knows ? maybe DDR-6 will be able to give several simultaneous access to the same memory module).

So, well, once again, it proves that running stupid simulations without taking into account that other technologies will improves beside the number of cores* yields stupid non realistic results.

Shame on TFA's Author, because the trends to increase bandwith have already started. I little bit more background research would have avoided this kind of stupidity.
But on the other hand, they would have missed the opportunity to publish an alarmist article with an eye catching title.

--

*: Although, yes, the number of cores you can slap inside the same package seems to be the "new megahertz" in the manufacturers' race, with some like Intel trying to increase this number faster without putting so much efforts on the rest.

Re:Well doh (1)

confused one (671304) | more than 5 years ago | (#26002879)

You just described AMD's current memory architecture in multi-processor systems.

Re:Well doh (0, Redundant)

stonemetal (934261) | more than 5 years ago | (#26001575)

Its not like multi-processor systems where each cpu gets its own ram.

While I agree there are limits, AMD's NUMA arch. means that every CPU does in fact get its own ram. It is each core on that one chip that doesn't get its own ram.

Re:Well doh (1)

TheRaven64 (641858) | more than 5 years ago | (#26001845)

Actually, that is a closer approximation of the real problem. Cache coherency is a performance killer, so most supercomputer software is written around the CSP model and fits nicely with NUMA architectures. The Opteron and friends run a NUMA architecture but expose it to the programmer as a UMA model.

Re:Well doh (2, Interesting)

gabebear (251933) | more than 5 years ago | (#26001957)

if you read the article they are talking about is the disparity that is growing between CPU speed and access to memory. The stacked memory they are talking about is shoving the physical CPU dies and RAM chips closer together(in the same package) [betanews.com] so that you can have a LOT more interconnects(less wire to screw things up). Build up a huge stack of these and you have supercomputer cluster WITH fast access to all memory in the stack. For infomatic applications you need random access to any bit of memory in the entire array and this might do it for them. The biggest problem is heat dissipation...

The other options I see are creating some kind of super giant shared buffered RAM pool [wikipedia.org] that has high latency but great throughput and then sticking as many cores as they can on a single motherboard(1000+), or for a wizard to find some caching algorithm that will let them stay on commodity hardware(a.k.a. use those extra cores to figure out what you are going to need and optimize for it).

I'd put my money on them finding a wizard.

Re:Well doh (1, Interesting)

Anonymous Coward | more than 5 years ago | (#26001645)

Why is it important that each CPU/Core has it's own RAM? How is that more efficient than a *huge* chunk of RAM that could be accessed by any CPU/Core?

I understand there are risks -- concurrent access and the like -- but completely separate RAM seems like an extreme solution if this is the only problem it is trying to address.

Re:Well doh (0)

Anonymous Coward | more than 5 years ago | (#26001675)

Cache coherency and such are a huuuuuuuge problem with having a big chunk of RAM. That's why NUMA is such an elegant solution.

Re:Well doh (2, Informative)

cheater512 (783349) | more than 5 years ago | (#26001741)

The summary mentions that the path to the memory controller gets clogged.
There is only so much bandwidth to go around.

Re:Well doh (1)

DRobson (835318) | more than 5 years ago | (#26001939)

With any pool of memory, there is some limit to latency and throughput. Thus, the more processing elements you throw at the problem the more they compete for this resource.

Now, if you can separate this memory into discrete pools associated with each processing element, you have less contention (locally) hence the possibility of lower latency and higher throughput (locally).

If you can design your algorithms such that multiple writers (and hopefully readers) to the same location happen infrequently, then you have a net win, despite higher costs addressing foreign memory.

Re:Well doh (4, Interesting)

Targon (17348) | more than 5 years ago | (#26001949)

You have failed to notice that AMD is already on top of this and can add more memory channels to their processors as needed for the application. This may increase the number of pins the processor has, but that is to be expected.

You may not have noticed, but there is a difference between AMD Opteron and Phenom processors beyond just the price. The base CPU design may be the same, but AMD and Intel can make special versions of their chips for the supercomputer market and have them work well.

In a worst case, with the support from AMD or Intel, a new CPU with extra pins(and an increased die size) could add as many channels of memory support as required for the application. This is another area where spinning off the fab business might come in handy.

And yes, this might be a bit expensive, but have you seen the price of a supercomputer?

Re:Well doh (1)

Lumpy (12016) | more than 5 years ago | (#26001989)

no but by using dual or quad cores with a crapload of ram each you do get a benefit.

a 128 processor quad core supercomputer will be faster than a 128 processor single core supercomputer.

you get a benefit.

Re:Well doh (2, Informative)

bwcbwc (601780) | more than 5 years ago | (#26002105)

Actually that is part of the problem. Most of the architectures have core-specific L1 cache, and unless a particular thread has its affinity mapped to a particular core, a thread can jump from a core where its data is in the L1 cache to a core where its data is not present, and is forced to undergo a cache refresh from memory.

Also, regardless of whether a system is multi-processing within a chip (multi-core) or on a board (multi-CPU), the number of communication channels required to avoid communication bottlenecks goes up as O(n^2) the number of cores.

So yes, we are probably seeing the beginning of the end of performance gains using general-purpose CPU interconnects and have to go back to vector processing. Unless we are somehow able to jump the heat dissipation barrier and start raising GHz again.

Re:Well doh (1)

GooberToo (74388) | more than 5 years ago | (#26002235)

This really is a problem that doesn't exist. The issue at hand is if you have all cores cranking away you run out of bandwidth. Simple solution - don't run all cores and continue to scale horizontally as they currently do. So if you need 8-core CPUs and it has the bandwidth you need, only buy 8-core CPUs. If your CPUs run out of bandwidth are 16-cores (or whatever), then only buy up to 16-core CPUs, passing on the 32-core CPUs.

Wow that a hard problem to solve. Next.

Bad for current designs maybe (0)

gatkinso (15975) | more than 5 years ago | (#26001427)

Figure it out.

Gotta love job security.

but.. (0)

Anonymous Coward | more than 5 years ago | (#26001481)

Isn't this already a problem in today's computers? The CPU isn't the bottleneck, the HDD is.

Re:but.. (1)

jellomizer (103300) | more than 5 years ago | (#26001509)

Well we are talking about CPU to ram not the Hard drive. But a similar process the Ram is order of magnitude slower then the CPU. But the When the CPU talks to the ram it goes over the bus and talks to the ram and back threw the bus to the CPU. With a single core Fast CPU you can have a bus for each core, which is like adding more lanes to a highway it allows more traffic so the CPU while may be waiting for the ram it will be faster as you are not waiting for your bits because an other core requested some other bits.

Re:but.. (2, Interesting)

peragrin (659227) | more than 5 years ago | (#26001633)

So your saying that next generation processors need a gig of cache. Plus 4gigs of ram.

I think what is really needed is new OS designs. Something that is no longer tied quite as close to the hardware. So that new hardware ideas can be tried.

Re:but.. (0)

jebrew (1101907) | more than 5 years ago | (#26001801)

I'd like to see larger caches getting put onto the processors. Yeah, it makes it really expensive, but if you do incremental speed/cache size updates it'll be reasonable.

A gig of L1 and 4 gig of L2 would be blazing fast! Throw an SSD on that bad boy and you'd almost never have to wait for an app to load again.

Re:but.. (1)

jellomizer (103300) | more than 5 years ago | (#26002177)

Like on the 386 0 wait computers.

Re:but.. (1)

nicolas.kassis (875270) | more than 5 years ago | (#26001857)

Except that will obviously be slower due to the overhead of abstracting from the hardware. And that is already what most OS out there do. Linux on RISC does exist and so does Linux on * But there is a drawback to all that.

Re:but.. (1)

lloydchristmas759 (1105487) | more than 5 years ago | (#26001737)

And do you need a supercomputer to run a spellchecker ?

Re:but.. (4, Funny)

David Gerard (12369) | more than 5 years ago | (#26002061)

Only for Office 2007.

Re:but.. (0)

Anonymous Coward | more than 5 years ago | (#26001513)

Not in massive supercomputers with more than 4 TB of ram it's not.

Re:but.. (1)

KeithJM (1024071) | more than 5 years ago | (#26001743)

Isn't this already a problem in today's computers? The CPU isn't the bottleneck, the HDD is.

Generally this isn't true if you're talking about a supercomputer because of the tasks they'll be performing. You don't build supercomputers to be file servers (or even database servers, which can still use a lot of CPU)

Yeah! (5, Funny)

crhylove (205956) | more than 5 years ago | (#26001549)

Once we get to 32 or 64 core cpus that cost less than $100 (say, five years), I'd HATE to have a beowulf cluster of those!

Re:Yeah! (0)

TapeCutter (624760) | more than 5 years ago | (#26001605)

Mod the parent +1Eienstien, the term 'supercomputer' is relative.

Re:Yeah! (0)

Anonymous Coward | more than 5 years ago | (#26001765)

Who is Einstien?

Apparently there are 308,000 retards like you out there right now [google.com]

Re:Yeah! (1)

TehBlahhh (947819) | more than 5 years ago | (#26001885)

erm. A beowulf cluster won't perform worse than any individual machine in it. The limitation is the cpu to memory path, which becomes saturated when you have N cores. And in a beowulf, you'd have many many of those paths, meaning you would still get a speedup - BUT each individiual machine is limited as per TFA.

Re:Yeah! (0)

Anonymous Coward | more than 5 years ago | (#26002449)

...woosh??

So what does it mean for PCs? (3, Insightful)

theaveng (1243528) | more than 5 years ago | (#26001559)

>>>"After about 8 cores, there's no improvement," says James Peery, director of computation, computers, information, and mathematics at Sandia. "At 16 cores, it looks like 2 cores."
>>>

That's interesting but how does it affect us, the users of "personal computers"? Can we extrapolate that buying a CPU larger than 8 cores is a waste of dollars, because it will actually run slower?

Re:So what does it mean for PCs? (1)

jebrew (1101907) | more than 5 years ago | (#26001891)

No, you'll have a slew of desktop apps that will get split out amongst several cores. For the applications your likely to run, the more cores the better (well, I'm sure there's an upper limit, but it's most likely much higher than 32).

Re:So what does it mean for PCs? (1)

bigsexyjoe (581721) | more than 5 years ago | (#26002077)

What they are saying doesn't apply. The problem is specifically that a supercomputer usually runs one big demanding program. Right now, my task manager says that I am running 76 processes to look at the internet. So I could easily benefit from extra cores as each process being run could go on a separate core.

Re:So what does it mean for PCs? (1)

Johann Lau (1040920) | more than 5 years ago | (#26002541)

Most of these processes hardly utilize the CPU though, and if you have 8 cores, a process can use at most 1/8 of your total CPU power, right? That might be nice when browsing and e-mailing and such, but when converting a big image or archiving lots of files it means most of the CPU will sit idle. So I tend to think 2 cores is kinda perfect: when a program does some heavy crunching, it can eat up 50% at most, and the other core can be used to run all those small trivial processes you mentioned smoothly.

Re:So what does it mean for PCs? (1)

dreamchaser (49529) | more than 5 years ago | (#26002931)

Image processing was a bad example for you to use, as it lends itself well to multi-threaded operations.

Re:So what does it mean for PCs? (2, Funny)

David Gerard (12369) | more than 5 years ago | (#26002079)

It will only affect you if you're running ForecastFoxNG, where you can set the weather and the CPU will calculate where the butterfly should flap to get the effect you want (M-x butterfly).

It's so obvious... (4, Interesting)

Alwin Henseler (640539) | more than 5 years ago | (#26001585)

That to remove the 'memory wall', main memory and CPU will have to be integrated.

I mean, look at general-purpose computing systems past & present: there is a somewhat constant relation between CPU speed and memory size. Ever seen a 1 MHz. system with a GB. RAM? Ever seen a GHz. CPU coupled with a single KB. of RAM? Why not? Because with very few exceptions, heavier compute loads also require more memory space.

Just like the line between GPU and CPU is slowly blurring, it's just obvious that the parts with the most intensive communication, should be the parts closest together. Instead of doubling nummber of cores from 8 to 16, why not use those extra transistors to stack main memory directly on top of the CPU core(s)? Main memory would then be split up in little sections, with each section on top of a particular CPU core. I read sometime that semiconductor processes that are suitable for CPU's, aren't that good for memory chips (and vice versa) - don't know if that's true but if so, let the engineers figure that out.

Ofcourse things are different with supercomputers. If you have a 1000 'processing units', where each PU would consist of say, 32 cores and some GB's RAM on a single die, that would create a memory wall between 'local' and 'remote' memory. The on-die section of main memory would be accessible at near CPU speed, main memory that is part of other PU's would be 'remote', and slow. Hey wait, sounds like a compute cluster of some kind... (so scientists already know how to deal with it).

Perhaps the trick would be to make access to memory found on one of the other PU's transparent, so that programming-wise there's no visible distinction between 'local' and 'remote' memory. With some intelligent routing to migrate blocks of data closer towards the core(s) that access it? Maybe that could be done in hardware, maybe that's better done on a software level. Either way: the technology isn't the problem, it's an architectural / software problem.

Re:It's so obvious... (0)

theaveng (1243528) | more than 5 years ago | (#26001909)

>>>Ever seen a 1 MHz. system with a GB. RAM?

Yes. Both a Commodore 64 and Commodore 128, although the 1 gigabyte RAM is typically used as a fast drive rather than as CPU-addressable DRAM.

Re:It's so obvious... (1)

theaveng (1243528) | more than 5 years ago | (#26001947)

P.S. Your idea of putting memory on the CPU is certainly workable. The very first CPU to integrate memory was the 80486 (8 kilobyte cache), so the idea has been proven sound since at least 1990.

Re:It's so obvious... (3, Informative)

AlXtreme (223728) | more than 5 years ago | (#26001959)

You mean something like a CPU cache? I assume you know that every core already has a cache (L1) on multi-core [wikipedia.org] systems, and shares a larger cache (L2) between all cores.

The problem is that on/near-core memory is damn expensive, and your average supercomputing task requires significant amounts of memory. When the bottleneck for high performance computing becomes memory bandwidth instead of interconnect/network bandwidth you have something a lot harder to optimize, so I can understand where the complaint in IEEE comes from.

Perhaps this will lead to CPUs with large L1 caches specifically for supercomputing tasks, who knows...

Re:It's so obvious... (1)

DRobson (835318) | more than 5 years ago | (#26002091)

Perhaps this will lead to CPUs with large L1 caches specifically for supercomputing tasks, who knows...

Even discounting price concerns, L1 caches can only increase a certain amount. As the capacity increases, so does the search time for the data, until you find yourself with access times equivalent to the next level down the cache heirarchy, thus negating use of L1. L1 needs to be /quite/ fast for it to be worthwhile.

Re:It's so obvious... (0)

Anonymous Coward | more than 5 years ago | (#26002557)

Did you just invent cache memory?

Re:It's so obvious... (2, Insightful)

Funk_dat69 (215898) | more than 5 years ago | (#26002693)

Ofcourse things are different with supercomputers. If you have a 1000 'processing units', where each PU would consist of say, 32 cores and some GB's RAM on a single die, that would create a memory wall between 'local' and 'remote' memory. The on-die section of main memory would be accessible at near CPU speed, main memory that is part of other PU's would be 'remote', and slow. Hey wait, sounds like a compute cluster of some kind... (so scientists already know how to deal with it).

It also sounds like you are described the Cell Processor setup. Each SPU has local memory on-die - but cannot do operations on main memory(remote). Each SPU also has a DMA engine that will grab data from main memory and bring it into its local store. The good thing is you can overlap the DMA transfer and the computation so the SPUs are constantly burning through computation.

This does help against the memory wall. And is a big reason why Roadrunner is so damn fast.

Optical Computing? (0)

Anonymous Coward | more than 5 years ago | (#26001619)

Would optical get around such a barrier?

Physical space seems to be one of the major hurdles in CPU design today, due to leakage with the ever shrinking processes.

And i think it is about damn time that new silicon laser receiver thing (forgot the details) was put into implementation and testing.

Re:Optical Computing? (1)

arktemplar (1060050) | more than 5 years ago | (#26001733)

Well in a way it could be. I'd read the spectrum article some time back, but since I work in the field I can give some insight.

RAM latencies are a huge hit for applications that are based on random access. DRAMs etc. don't actually do random access the way you'd want they access one memory over a large time period, and provide faster access to some successive elements. New processor architectures based on smart caches and intelligent memories could be a lot more useful, basically though a rethinking of processor architecture is involved - in the end electrical and computer engineering is still that : Engineering there will always be tradeoffs.

Re:Optical Computing? (1)

nicolas.kassis (875270) | more than 5 years ago | (#26001905)

Would optical get around such a barrier?

Physical space seems to be one of the major hurdles in CPU design today, due to leakage with the ever shrinking processes.

And i think it is about damn time that new silicon laser receiver thing (forgot the details) was put into implementation and testing.

IBM is already working on it. Stay tuned.

Memory (4, Insightful)

Detritus (11846) | more than 5 years ago | (#26001639)

I once heard someone define a supercomputer as a $10 million memory system with a CPU thrown in for free. One of the interesting CPU benchmarks is to see how much data it can move when the cache is blown out.

Multiple CPUs? (4, Insightful)

Dan East (318230) | more than 5 years ago | (#26001649)

This doesn't quite make sense to me. You wouldn't replace a 64 CPU supercomputer with a single 64 core CPU, but would instead use 64 multicore CPUs. As production switches to multicore, the cost of producing multiple cores will be about the same as the single core CPUs of old. So eventually you'll get 4 cores from the price of 2, then get 8 cores from the price of 4, then 16 for the price of 8, etc. So the extra cores in the CPUs of a supercomputer are like a bonus, and if software can be written to utilize those extra cores in some way that benefits performance, then that's a good thing.

The problem allegedly being.. (4, Informative)

Junta (36770) | more than 5 years ago | (#26001777)

For a given node count, we've seen increases in performance. The claimed problem is that for the workloads that concern these researchers, they don't see people mentioning significant enhancements to the fundamental memory architecture projected to follow the scale at which multi-core systems go. So you buy a 16 core chip system to upgrade your quad-core based system and hypothetically gain little despite the expense. Power efficencies drop and getting more performance requires more nodes. Additionally, who is to say that clock speeds won't lower if programming models in the mass market change such that distributed workloads are common and single-core performance isn't all that impressive.

All that said, talk beyond 6-core/8-core is mostly grandstanding at this time. As memory architecture for the mass market is not considered as intrinsically exciting, I would wager there will be advancements that no one speaks to. For example, Nehalem leapfrogs AMD memory bandwidth by a large margin (like by a factor of 2). It means if Shanghai parts are considered satisfactory today to get respectable yield memory wise to support four cores, Nehalem, by a particular metric, supports 8 equally satisfactorily. The whole picture is a tad more complicated (i.e. latency, numbers I don't know off hand), but the one metric is a highly important one in the supercomputer field.

For all the worry over memory bandwidth though, it hasn't stopped supercomputer purchasers from buying into Core2 all this time. Despite improvements in their chipset, Intel Core2 still doesn't reach AMD performance. Despite that, people spending money to get into the Top500 still chose to put their money on Core2 in general. Sure, Cray and IBM supercomputers in the Top2 used AMD, but from the time of its release, Core2 has decimated AMD supercomputer market share despite an inferior memory architecture.

Re:The problem allegedly being.. (2, Interesting)

amori (1424659) | more than 5 years ago | (#26002917)

Earlier this year, I had access to a large supercomputer cluster. Often I would run code on the supercomputer (with 1000+, 100, 10, 2 CPUs), and then I would try running it on my own dual core machine. Benchmarking the 2 CPUs for comparison purposes. More than anything, just the manner in which memory was being shared or distributed would influence the end results, tremendously. You really have to rethink how you choose to parallelize your vectors when dealing with supercomputers vs. multicore machines. As a researcher, I've found that I don't necessarily have the time to rewrite my code for both scenarios. I think this too might factor in heavily ...

Re:Multiple CPUs? (0)

yttrstein (891553) | more than 5 years ago | (#26002195)

I would actually get a machine that was filled with Cray Threadstorm Processors (128 threads per processor core) and tell Intel (2 threads per core on a good day, and then kinda not really) to suck it.

Duh... (0)

Anonymous Coward | more than 5 years ago | (#26001711)

From the article: "âoeThe key to solving this bottleneck is tighter, and maybe smarter, integration of memory and processors,â says Peery. For its part, Sandia is exploring the impact of stacking memory chips atop processors to improve memory bandwidth." ... breaking news, cache is good. Hasn't getting data to the processor quickly without wasting cycles always been a critical bottleneck in servers?

"this year the U.S. Department of Energy formed the Institute for Advanced Architectures and Algorithms. Located at Sandia and at Oak Ridge National Laboratory, in Tennessee, the instituteâ(TM)s work will be to figure out what high-performance computer architectures will be needed five to 10 years from now and help steer the industry in that direction."

While I heavily like moving technology forward as quickly as possible and understand that much of the U.S's power relies on being ahead technologically I question if having groups guess theoretically what architectures will be the best in the future is the best spending of tax dollars in our debt heavy government. Especially since by the nature of competition Intel and AMD are already doing this. Isn't about time the g-man got forced to declare bankruptcy and shed unnecessary assets and debt holes?

-Brian

Multicore news and information (-1, Troll)

Anonymous Coward | more than 5 years ago | (#26001729)

Check out http://www.multicoreinfo.com/ [multicoreinfo.com] for the latest news and resources related to multicore processors.

No.... (1)

fitten (521191) | more than 5 years ago | (#26001731)

Maybe they should do something like we did back when the Paragon (yes, that far back) had multiple CPUs on a node and the memory bandwidth wasn't enough to support them all simultaneously... Don't use some of CPUs on the card (leave some idle) so that all the bandwidth is availalbe to the one, or few, cores that need it. Alternatively, figure out a way (algorithms) to make sure that no more than one core is memory intensive at a time... take turns being bandwidth intensive. Or, just realize, as it's always been, that some solutions/algorithms just aren't optimal on commodity hardware.

Have they heard of CUDA? (0)

Anonymous Coward | more than 5 years ago | (#26001789)

I mean, really, this sounds like a poorly designed experiment. NVidia GPU's can have hundreds of cores and they just get faster. Memory management is different on GPUs than standard CPUs/chipsets use for that very reason. Hopefully our tax dollars didn't pay these geniuses for this crap.

Do you know anything about supercomputing tasks? (1)

SaDan (81097) | more than 5 years ago | (#26002421)

CUDA has zero benefit for supercomputing projects that cannot be broken into tiny bits and spread across multiple cores.

It's not just about memory, or clock speed.

Different Chip Architecture (1)

rabun_bike (905430) | more than 5 years ago | (#26001803)

You might see a super computer design around other RISC processors such as the ARM. A supercomputer using the ARM takes more chips perhaps but the power savings is substantial compared to the x86. Furthermore, companies that like Nvidia with their Telsa platform are pushing into the supercomputing space with specialized chips that are purposefully designed to deal with large linear problem solving. Interestingly the Telsa chip is a multicore chip as well. http://www.nvidia.com/object/product_tesla_s1070_us.html [nvidia.com]

So don't use conventional processors (0)

Anonymous Coward | more than 5 years ago | (#26001833)

Any idea why the current top of the supercomputer pack uses Cell processors? Besides having mad vector processing skillz with their SPUs the memory bandwidth is fairly large.

Re:So don't use conventional processors (1)

David Gerard (12369) | more than 5 years ago | (#26002123)

So they can play GTA IV in their time off.

Sounds like posturing for grant money (0)

Anonymous Coward | more than 5 years ago | (#26001855)

Seriously. While the algorithms/code they execute may run out of memory bandwidth when spread across 16 cores (doubtful) the bottleneck is more likely to come from the external interfaces long before the cpu runs out of bandwidth to access that memory. Current motherboards from Tyan and the like hold up to 128GB of ram, which, when consumed by the CPU at its THEORETICAL 64GB/s in triple channel mode means that it will simply run out of data in 2 seconds. Even if you are backfilling the entire time (reducing overall memory throughput) you will run out of data a short time later (dictated by the network connections backfill rate). The problem is one of being able to keep main memory full. Stacking memory on the chip will still have its limitations and really doesn't make sense when you consider the commodity pricing of these chips, the complexity of stacking, and the limitations of external data access. Buy more computers, build a smarter cluster and write smarter software. 16 cores looking like 2... really.. seriously.. wtf ever. Sandia (and I used to live down the street so I have met many people there) makes its living by raising government money. Nothing like coming up with a scare tactic to open up the coffers of government spending.

They haven't got 2 cores figured out yet.... (0, Troll)

3seas (184403) | more than 5 years ago | (#26002103)

... what makes anyone think more cpus can be done?

Hoenetly, using a core 2 duo at work and after 3 weeks of swapping out every part of the system including the cpu.... sent hack to hp where they finally figured out it was the cpu.... oh wait it was swapped out too... so.... it was both cpus.... Whats the odds of that?

And now on another system, also a core 2 duo, I get application hangs and taskmanager saying no cpu is being used. Hmmm, must be some sort of fancy dancy hibernation power saving mode thingy that puts the user in a wait to get heat for not getting the job done..... piss on you....

So who is god? those who create the hardware or the software creators? because even at work, I'm assumed to be at fault as a user first.,....

And I'm fucking god damn sick of this shit from the computer industry arrogance. Don't they teach how to get things right before moving to the next step in the god training of CS?

My biggest pet peeve is this button with the symbols "0" & "1" on it ... and everyone knows it means "on" and "off" as its used on all sorts of consumer and industrial electronic equipment.

But on a computer, these fucking GODS damn its gotta mean something more and hidden so the morons can think they know something more than their consumer slaves... so they can masterbait their arrogance and stroke their egos.

Hey I got this really neat idea and I have all sorts of sheep skins from a universatays that says I smarts more then you.... so lets do this and sell it and if it doesn't work then I still get a pay check so who the fucjk cares.... Lets abstract stuff so much that we are only experts at a small part and nobody is an expert of the whole... and dat way we can blame the other guy...

Maybe the computer industry needs a bailout too?

Here... have a holy bucket... start bailing... you have plenty experience at that.

Bailing on teh end users....

End users are stupid..... we can blame them.....

FAILZORS?! (-1, Offtopic)

Anonymous Coward | more than 5 years ago | (#26002181)

If you do Not

Parallel processing (1)

amclay (1356377) | more than 5 years ago | (#26002353)

Couldn't they just break up their programs into threads? Obviously, this wouldn't work for real-time applications, but modeling and other asynchronous programs could definitely be split and coprocessed.

as expected (2, Funny)

tyler.willard (944724) | more than 5 years ago | (#26002439)

"A supercomputer is a device for turning compute-bound problems into I/O-bound problems."

-Ken Batcher

Simple, if it doesn't work, don't use it. (1)

JoeMerchant (803320) | more than 5 years ago | (#26002459)

What's distressing here? That they have to keep building supercomputers the same way they always have? I worked with an ex IBM'er from their supercomputing algorithms department, he and I BSed about future chip performance alot in the late 2006 - early 2007 timeframe. We were both convinced that the current approaches to CPU design were going to top out in usefulness at 8 to maybe 16 cores due to memory bandwidth.

I guess the guys at Sandia had to do a little more than BS about it before they published, but c'mon guys, this has been obvious for a while. And, if it's obvious to all of us out here, don't you think that Intel knew about it during their 2002 roadmap meetings?

Ok. soooo.... (1)

Taibhsear (1286214) | more than 5 years ago | (#26002717)

Because of limited memory bandwidth and memory-management schemes that are poorly suited to supercomputers, the performance of these machines would level off or even decline with more cores.

So increase the bandwidth on the memory to something more suited to supercomputers then. Design and make a supercomputer for supercomputer purposes. You are scientists using supercomputers, not kids begging mom for a new laptop on christmas. Make it happen.

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>