Beta

Slashdot: News for Nerds

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Why 'Gaming' Chips Are Moving Into the Server Room

timothy posted about 4 years ago | from the expense-report-manipulation-++ dept.

Power 137

Esther Schindler writes "After several years of trying, graphics processing units (GPUs) are beginning to win over the major server vendors. Dell and IBM are the first tier-one server vendors to adopt GPUs as server processors for high-performance computing (HPC). Here's a high level view of the hardware change and what it might mean to your data center. (Hint: faster servers.) The article also addresses what it takes to write software for GPUs: 'Adopting GPU computing is not a drop-in task. You can't just add a few boards and let the processors do the rest, as when you add more CPUs. Some programming work has to be done, and it's not something that can be accomplished with a few libraries and lines of code.'"

cancel ×

137 comments

A whole new level of parallelism (4, Insightful)

TwiztidK (1723954) | about 4 years ago | (#32918288)

I've heard that many programmers have issues coding for 2 and 4 core processors. I'd like to see how they'll addapt to running "run hundreds of threads" in parallel.

Re:A whole new level of parallelism (3, Insightful)

morcego (260031) | about 4 years ago | (#32918424)

This is just like programing for a computer cluster ... after a fashion.

Anyone used to do both should have no problem with this.

I'm anything but a high end programmer (I mostly only code for myself), and I have written plenty of code that runs with 7-10 threads. Believe me, when you change the way you think about how an algorithm works, it doesn't matter if you are using 3 or 10000 processors.

Re:A whole new level of parallelism (4, Insightful)

Nadaka (224565) | about 4 years ago | (#32918682)

No it isn't. That you think so just shows how much you still have left to learn.

I am not a high end programmer either. But I have two degrees on the subject and have been working professionally in the field for years, including optimization and parallelization.

Many algorithms just won't have much improvement with multi-threading.

Many will even perform more poorly due to data contention and the overhead of context switches and creating threads.

Many algorithms just can not be converted to a format that will work within the restrictions of GPGPU computing at all.

The stream architecture of modern GPU's work radically differently than a conventional CPU.

It is not as simple as scaling conventional multi-threading up to thousands of threads.

Certain things that you are used to doing on a normal processor have an insane cost in GPU hardware.

For instance, the if statement. Until recently OpenCL and CUDA didn't allow branching. Now they do, but they incur such a huge penalty in cycles that it just isn't worth it.

Re:A whole new level of parallelism (1, Informative)

Anonymous Coward | about 4 years ago | (#32918878)

Uh
OpenCL and CUDA supported branching from day one (with a performance hit). Before they existed, there was some (very little) usage of GPUs for general purpose computing and they used GLSL/HLSL/Cg, which supported branching poorly or not at all.

The tools that were recently added to CUDA (for the latest GPUs) are recursion and function pointers.

Re:A whole new level of parallelism (2, Informative)

sarkeizen (106737) | about 4 years ago | (#32921358)

Personally (and I love that someone below mentioned Ahmdals law [wikipedia.org] ). The problem isn't as you said about specific language constructs but that there isn't any general solution to parallelism. That is to use Brook's [amazon.com] illustration, problems we try to solve with computers aren't like harvesting wheat - they aren't efficiently divisible to an arbitrary degree. We do know of a few problems like this which we call "embarassingly parallel" [wikipedia.org] but these are few and far between. So GPU's are great MD5 crackers, protein folders and I personally *love* writing CUDA code but I don't suffer from the delusion that this is somehow a revolution in software. That the usual day-to-day tasks are going to be affected. So the idea that GPUs are moving into the server room seems optimistic because the majority of stuff in there is pretty mundane.

That said I'd say I wonder if there aren't some architectural limitations on GPUs e.g. memory protection and if we really wanted to use these for general purpose computing and added them would we lose performance? In other words are we just making some kind of cores-to-features tradeoff?

Re:A whole new level of parallelism (1)

Twinbee (767046) | about 4 years ago | (#32919770)

Are If branches only slow because of what someone said below:

"If you run into a branch in the code, then you lose your parallelism, as the divergent threads are frozen until they come back together."

Because if that's the case, that's fine by me. The worst case length that a thread can run can be defined and even low in some cases I know of.

Re:A whole new level of parallelism (0)

Anonymous Coward | about 4 years ago | (#32919836)

Yes, that is correct. I don't know the details of ATI's hardware very well, but on nVidia, branch divergence is only an issue if threads within a warp (32 threads with sequential id), and then it slows it down by evaluating different branches successively.

Re:A whole new level of parallelism (0)

Anonymous Coward | about 4 years ago | (#32920078)

there's a re-join of divergent threads. the penalty is that you serialize for one thread for a while ( e.g. 15x issue rate drop).
but that behaviour is also not going to provide coalesced mem transactions... so some algorithms are really easy and some require some work to get it to scale.

Re:A whole new level of parallelism (1)

morcego (260031) | about 4 years ago | (#32920602)

Nadaka, you are just proving my statement there.

What you are describing are people using the wrong kind of logic and algorithms to do parallelization.
The only new statement you make is:

Many algorithms just can not be converted to a format that will work within the restrictions of GPGPU computing at all.

I will take your word for it, since I really don't know GPGPUs at all. Most of my experience with parallelism is with clusters (up to 30 nodes). On that scenario, 99% of the time I've heard someone say something like that was because they were using bad algorithms for parallel processing, and even with 2-3 nodes they were not ideal.

But as I said, I have no experience with GPGPU, so my experience with clusters might not be relevant.

There are also easy problems (1)

dbIII (701233) | about 4 years ago | (#32922078)

Many algorithms just won't have much improvement with multi-threading.
Yes, but there are also many that will. I work with geophysicists, and a lot of what they do really involves applying the same filter to 25 million or so audio traces. Such tasks get split arbitrarily over clusters at any point of those millions of traces. One thread per trace is certainly possible because that's how it works normally anyway as independent operations in series. Once you get to output the results some theoretical 25 million CPU machine is going to have bottlenecks elsewhere however and not give much benefit over something a lot smaller - that's where the hard problems come in.
Also, working with images and video brings up a lot of other parallel problems that even those of us that only dabble in parallel processing can get decent results with.

Re:A whole new level of parallelism (5, Insightful)

Dynetrekk (1607735) | about 4 years ago | (#32918728)

Believe me, when you change the way you think about how an algorithm works, it doesn't matter if you are using 3 or 10000 processors.

Have you ever read up on Amdahl's law? [wikipedia.org]

Re:A whole new level of parallelism (2, Interesting)

Lord of Hyphens (975895) | about 4 years ago | (#32920578)

Have you ever read up on Amdahl's law? [wikipedia.org]

I'll see your Amdahl's Law, and raise you Gustafson's Law [wikipedia.org] .

Re:A whole new level of parallelism (3, Funny)

pushing-robot (1037830) | about 4 years ago | (#32918858)

Microsoft must be doing a bang-up job then, because when I'm in Windows it doesn't matter if I'm using 3 or 10000 processors.

Re:A whole new level of parallelism (2, Interesting)

Anonymous Coward | about 4 years ago | (#32920912)

You might find this [youtube.com] Google Tech Talk interesting..

Re:A whole new level of parallelism (2, Insightful)

Austerity Empowers (669817) | about 4 years ago | (#32918436)

CUDA or OpenCL is how they do it.

Re:A whole new level of parallelism (3, Insightful)

Sax Maniac (88550) | about 4 years ago | (#32918462)

This isn't hundreds of threads that can run arbitrary code paths like a CPU, you have to totally redesign your code, or already have implemented parallel code so that you already run a number of threads that all do the same thing at the same time, just on different data.

The threads all run in lockstep, as in, all the threads better be at the same PC at the same time. If you run into a branch in the code, then you lose your parallelism, as the divergent threads are frozen until they come back together.

I'm not a big thread programmer, but I do work on threading tools. Most of the problems with threads seems to come with threads doing totally different code paths, and the unpredictable scheduling interactions that arise between them. GPU coding a lot more tightly controlled.

Re:A whole new level of parallelism (1)

Monkeedude1212 (1560403) | about 4 years ago | (#32918554)

I've heard that many programmers have issues coding for 2 and 4 core processors.

Or even multiple processors, for that matter.

That in and of itself is almost an entirely new section of programming - if you were an Ace 15 years ago, your C++ skills might still be sharper than most new graduates, but most post secondaries are now teaching students how to properly thread for parallel programming. If you don't know how to code for 2 or 4 core processors, you really should jump on board. Almost every computer and laptop I can think of being sold brand new today has more than 1 core or processor.

I'd like to see how they'll addapt to running "run hundreds of threads" in parallel.

It requires a slightly more abstract design pattern, designed to be flexible. Kind of like moving from older structured program to object oriented - you just have to approach it differently. I haven't had to deal with any of it myself, but I imagine it'll boil down to knowing what calculations in your program can be done simultaneously, and then setting up a way to dump it off onto the next available core. That way, instead of stopping a core to wait and synch with another, you are synching the thread conceptually as it simply waits for the data, not the processors the next step might need.

Re:A whole new level of parallelism (3, Interesting)

jgagnon (1663075) | about 4 years ago | (#32918880)

The problem with "programming for multiple cores/CPUs/threads" is that it is done in very different ways between languages, operating systems, and APIs. There is no such thing as a "standard for multi-thread programming". All the variants share some concepts in common but their implementations are mostly very different from each other. No amount of schooling can fully prepare you for this diversity.

Re:A whole new level of parallelism (1)

Miseph (979059) | about 4 years ago | (#32919564)

Isn't that basically true of everything else in coding too? You wouldn't code something in C++ for Linux the same way that you would code it in Java for Windows, even though a lot of it might be similar.

Is parallelization supposed to be different?

Re:A whole new level of parallelism (1)

Jeremy Erwin (2054) | about 4 years ago | (#32921420)

Java for Windows? I think you might be missing the point.

Re:A whole new level of parallelism (1)

nxtw (866177) | about 4 years ago | (#32920896)

The problem with "programming for multiple cores/CPUs/threads" is that it is done in very different ways between languages, operating systems, and APIs.

Really?

Most modern operating systems implement POSIX threads, or are close enough that POSIX threads can be implemented on top of a native threading mechanism. The concept of independently scheduled threads with a shared memory space can only be implemented in so many ways, and when someone understands these concepts well, everything looks rather similar.

It seems that claiming things are radically different due to superficial differences is fairly common today in computer science.

No amount of schooling can fully prepare you for this diversity.

Of course not, if you're the kind of person who can't grasp the concepts.

Re:A whole new level of parallelism (1)

BitZtream (692029) | about 4 years ago | (#32922342)

Never heard of posix eh?

Re:A whole new level of parallelism (2, Insightful)

Fulcrum of Evil (560260) | about 4 years ago | (#32920646)

most post secondaries are now teaching students how to properly thread for parallel programming.

No they aren't. Even grad courses are no substitute for doing it. Never mind that parallel processing is a different animal than SIMD-like models that most GPUs use.

I haven't had to deal with any of it myself, but I imagine it'll boil down to knowing what calculations in your program can be done simultaneously, and then setting up a way to dump it off onto the next available core.

No, it's not like that. you set up a warp of threads running the same code on different data and structure it for minimal branching. That's the thumbnail sketch - nvidia has some good tutorials on the subject and you can use your current GPU.

Re:A whole new level of parallelism (4, Informative)

Chris Burke (6130) | about 4 years ago | (#32919006)

Programmers of Server applications are already used to multithreading, and they've been able to make good use of systems with large numbers of processors on them even before the advent of virtualization.

But don't pay too much attention to the word "Server". Yes the machines that they're talking about are in the segment of the market referred to as "servers", as distinct from "desktops" or "mobile". But the target of GPU-based computing isn't "Servers" in the sense of the tasks you normally think of -- web servers, database servers, etc.

The real target is mentioned in the article, and it's HPC, aka scientific computing. Normal server apps are integer code, and depend more on high memory bandwidth and I/O, which GPGPU doesn't really address. HPC wants that stuff too, but they also want floating point performance. As much floating point math performance as you can possibly give them. And GPUs are way beyond what CPUs can provide in that regard. Plus a lot of HPC applications are easier to parallelize than even the traditional server codes, though not all fall in the "embarrassingly parallel" category.

There will be a few growing pains, but once APIs get straightened out and programmers get used to it (which shouldn't take too long for the ones writing HPC code), this is going to be a huge win for scientific computing.

Re:A whole new level of parallelism (1, Insightful)

Anonymous Coward | about 4 years ago | (#32919168)

Well, GPGPU actually in a way addresses the memory bandwidth. Mostly due to design limitations, each GPU comes with their own memory, and thus memory bus and bandwidth.
Of course you can get that for CPUs as well (with new Intels or any non-ancient AMD) by going to multiple sockets, however that is more effort and costlier (6 PCIe slots - unusual but obtainable - and you can have 12 GPUs, each with their own bus, try getting a 12-socket motherboard...).

Re:A whole new level of parallelism (1)

emt377 (610337) | about 4 years ago | (#32919866)

There will be a few growing pains, but once APIs get straightened out and programmers get used to it (which shouldn't take too long for the ones writing HPC code), this is going to be a huge win for scientific computing.

All you say makes sense, but I for one don't understand the market for this. Today, if you need a compute server that's good for stream (e.g. SIMD) workloads you get a dozen 1U/2U rackmounts and fill them up with as many GPU boards as they'll take. You put a work scheduler on them that accepts tasks from a dispatch server, and hook them up to a NAS box (or just rsync data sets and results from a storage subsystem). Then you put a transaction server in front it all that exposes a job manager.

Throwing a couple of GPU boards in a transaction server will add some computational punch, but not enough to make it a real compute server. It'll be too expensive to rack and stack. It'll cost more than a plain old transaction server. The market clearly is whoever needs a little bit of computational power on their backend - but does it really exist?

Re:A whole new level of parallelism (1)

Chris Burke (6130) | about 4 years ago | (#32920194)

All you say makes sense, but I for one don't understand the market for this. Today, if you need a compute server that's good for stream (e.g. SIMD) workloads you get a dozen 1U/2U rackmounts and fill them up with as many GPU boards as they'll take.

Well I was mostly just trying to justify why the transition to the situation "today" is taking place, and why the multi-threading itself isn't that big a deal. "Yesterday" the biggest compute servers were still made from traditional CPUs. Only recently has the potential for GPUs as general purpose (if your "general" purpose is FP math) computational devices really captured significant mindshare, and APIs an methodologies are still being ironed out.

It seemed like the article was talking about having GPUs in the "server" room in general, not about a specific situation where there's only a few GPUs stuck in an otherwise normal rackmount server. That would be fairly pointless. Though on the other hand there is some research that suggests that having more than a token amount of regular CPUs close to the GPUs is useful.

Re:A whole new level of parallelism (1)

DigiShaman (671371) | about 4 years ago | (#32921710)

The HPC platform comes in two flavors. Server and desktop. I'm of the understanding that the HPC server is mainly used for quick post-processing of data. While real-time interaction with data is usually done on the desktop.

Re:A whole new level of parallelism (3, Interesting)

psilambda (1857088) | about 4 years ago | (#32922064)

The article and everybody else are ignoring one large, valid use of GPUs in the data center--whether you call it business intelligence or OLAP--it needs to be in the data center and it needs some serious number crunching. There is not as much difference between this and scientific number crunching as most people might think. I have been involved in both crunching numbers for financials at a major multinational and had the privilege of being the first to process the first full genome (complete genetic sequence--terabytes of data) for a single individual and actually the genomic analysis was much more integer based than the financials. Based on my experience with both, I created the Kappa library for doing CUDA or OpenMP analysis in a datacenter--whether for business or scientific work.

Re:A whole new level of parallelism (3, Informative)

Hodapp (1175021) | about 4 years ago | (#32919700)

I am one such programmer. Yet I also coded for an Nvidia Tesla C1060 board and found it much more straightforward to handle several thousand threads at once.

Not all types of threads are created equal. I usually explain CUDA to people as the "Zerg Rush" model of computing - instead of a couple, well-behaved, intelligent threads that try to be polite to each other and clean up their own messes, you throw a horde of a thousand little vicious, stupid threads at the problem all at once, and rely on some overlord to keep them in line.

Most of the guides explained it as, "Flops are free, bandwidth is expensive." This board had a 384 or 512-bit wide memory bus with a very high latency, and the reason you throw that many threads at it is to let the hardware cover up the latency - it can merge a huge number of memory reads/writes into one operation, and as soon as a thread is waiting on memory I/O it can swap another thread into that same SP and let it compute. If memory serves me, the board was divided into blocks of 8 scalar processors (each block had some scratchpad memory that could be accessed almost as fast as a register) and you wrote groups of 16 threads which ran in lock-step on that processor (no recursion was allowed, and if one branched, the others would just wait around until it reached the same point) in two rounds.

Sure, that's a bit complex to optimize for, but it beats the hell out of conventional threading while trying to optimize for x86 SIMD. And if you manage to write it so it runs well on CUDA, it generally will scale effortlessly to whatever card you throw it at.

It's looking like OpenCL won't be much different, but I have yet to try it. I'm kind of eager, since apparently AMD/ATI's current cards, for the money, have a bit more raw power than Nvidia's.

Good luck with that (3, Insightful)

tedgyz (515156) | about 4 years ago | (#32918370)

This is a long-standing issue. If your programs don't just "magically" run faster, then count out 90% or more of the programs that will benefit from this.

Re:Good luck with that (0)

Anonymous Coward | about 4 years ago | (#32918552)

That's okay - 99% or more of programs don't need to run faster. It's remaining 1% that is actually doing something important we want to run faster.

Re:Good luck with that (1)

crafty.munchkin (1220528) | about 4 years ago | (#32921050)

Can anyone provide any info on how this is going to work with regard to virtual environments? After all, there has been a rather large push toward virtualizing everything in the datacenter, and about the only physical server we have left in our server is the Fax/SMS server, the ISDN card and GSM module for which could not be virtualised...

Yes, of course (2, Funny)

Anonymous Coward | about 4 years ago | (#32918378)

The sysdamins need new machines with powerful GPUs, you know, for business purposes.

Oh and, they sell ERP software on Steam now, too, so we'll have to install that as well.

Re:Yes, of course (5, Funny)

Yvan256 (722131) | about 4 years ago | (#32918494)

Portal 2? It's something for our Web server. It adds more portals to access the internet.

Re:Yes, of course (0)

Anonymous Coward | about 4 years ago | (#32918864)

Anyone can requisition cheap hardware, only a SysAdmin can spend 100x for the same thing with more blinkin lights, now with 1000GPU cores per blade, get at it programmers, performance issues are in your court now!

CUDA (3, Informative)

Lord Ender (156273) | about 4 years ago | (#32918402)

I was interested in CUDA until I learned that even the simplest of "hello world" apps is still quite complex and quite low-level.

NVidia needs to make the APIs and tools for CUDA programming simpler and more accessible, with solid support for higher-level languages. Once that happens, we could see adoption skyrocket.

Re:CUDA (4, Interesting)

Rockoon (1252108) | about 4 years ago | (#32918630)

Indeed. With Cuda, DirectCompute, and OpenCL, nearly 100% of your code is boilerplate interfacing to the API.

There needs to be a language where this stuff is a first-class citizen and not just something provided by an API.

Re:CUDA (1)

jpate (1356395) | about 4 years ago | (#32919108)

Actors are a really good framework (with a few [akkasource.org] different [scala-lang.org] implementations [javaworld.com] ) for easy parallelization. Scala has an implementation of Actors as part of the standard library, so they really are first-class citizens.

Re:CUDA (1)

0100010001010011 (652467) | about 4 years ago | (#32919284)

You mean like C/Objective-C and Grand Central Dispatch [wikipedia.org] ?

It's open source and has been ported to work with FreeBSD and Apache.

Doesn't care if it's a CPU, GPU, 10xGPUs etc.

Re:CUDA (1)

Rockoon (1252108) | about 4 years ago | (#32920374)

No.. thats not the same thing. Even if GCD worked with GPU's (which I see no evidence of) it still wouldnt be the same thing. While GPU's often have many "threads", each thread itself is a very wide SIMD architecture. For GCD in its current form to be useful, the work() function would still have to have the SIMD stuff baked in.

Re:CUDA (2, Informative)

BitZtream (692029) | about 4 years ago | (#32922392)

GCD combined with OpenCL makes it usable on a GPU, but that would be stupid. GPUs aren't really 'threaded' in any context that someone who hasn't worked with them would think of.

All the threads run simultaneously, and side by side. They all start at the same time and they all end at the same time in a batch (not entirely true, but it is if you want to actually get any boost out of it).

GCD is multithreading on a General Processing Unit, like your Intel CoreWhateverThisWeek processor. Code paths are ran and scheduled on different cores as needed and don't really run side by side, but they can run at the same time which is practical and useful in A LOT of cases.

OpenCL is multithreading on a graphics chip. It lets you do the same calculation over and over again or on a very large data set, side by side. You can calculate 128 encryption keys in one pass, but you can't calculate one encryption key, the average of your monthly bills, and draw a circle because the graphics chip doesn't do random processing side by side, it runs a whole bunch of the same instructions side by side and goes to hell in a handbasket the INSTANT you break its ability to run all the 'threads' side by side, executing the same instruction in each at the same time.

I really don't think you understand either standard GP multithreading or what GPUs are practically capable of doing.

Re:CUDA (2, Informative)

psilambda (1857088) | about 4 years ago | (#32922102)

Indeed. With Cuda, DirectCompute, and OpenCL, nearly 100% of your code is boilerplate interfacing to the API. There needs to be a language where this stuff is a first-class citizen and not just something provided by an API.

If you use CUDA, OpenCL or DirectComputeX it is--try the Kappa library--it has its own scheduling language that make this much easier. The next version that is about to come out goes much further yet.

Re:CUDA (1)

Austerity Empowers (669817) | about 4 years ago | (#32918660)

Probably can't happen, the parallel computing model is very different than the model you use in applications today. It's still evolving, but I doubt you will ever be in a position where you can write code as you do now and have it use and benefit from GPU hardware out of the gates.

Re:CUDA (1)

jgtg32a (1173373) | about 4 years ago | (#32919140)

ever?

Re:CUDA (1)

Dekker3D (989692) | about 4 years ago | (#32919444)

Yes. Just like we still doubt that anybody ever should need more than 640K.

Re:CUDA (1)

Bigjeff5 (1143585) | about 4 years ago | (#32921150)

Fun fact:

That quote is an urban legend, and there has never been any evidence that it was actually uttered by Gates.

You'd think confirmation would be easy, since it was supposedly said at a 1981 computer trade show.

It's like the famous quote "Let them eat cake" which is attributed to Marie Antoinette, but which scholars have never been able to find any evidence to suggest she actually uttered it.

The idea that 640k would be enough forever is idiotic, especially since the industry was so constricted by the 64k limit of 8-bit processors. Microsoft was actually influential in getting the limit to 640k from the 512k originally proposed for the 8088, because they wanted to get as much memory as possible.

Re:CUDA (2, Interesting)

cgenman (325138) | about 4 years ago | (#32918778)

While I don't disagree that NVIDIA needs to make this simpler, is that really a sizeable market for them? Presuming every college will want a cluster of 100 GPU's, they've still got about 10,000 students per college buying these things to game with.

I wonder what the size of the server room market for something that can't handle IF statements really would be.

Re:CUDA (1)

Lord Ender (156273) | about 4 years ago | (#32919306)

Well, since you can crack a password a hundred (or more) times faster with CUDA than with a CPU, they could at least sell a million units to the NSA and the FBI... and the analogous departments of every other country...

Re:CUDA (1)

Dekker3D (989692) | about 4 years ago | (#32919468)

Plenty of data processing could be parallelized to GPU style code, I'll bet. As long as you've got enough data that needs enough processing, you can probably get a speedup from that. Just how much, is another question..

Re:CUDA (0)

Anonymous Coward | about 4 years ago | (#32918810)

I was interested in CUDA until I learned that even the simplest of "hello world" apps is still quite complex and quite low-level.

But it looks awesome!

Re:CUDA (1)

bberens (965711) | about 4 years ago | (#32918838)

import java.util.concurrent.*; //???

Re:CUDA (1)

tedgyz (515156) | about 4 years ago | (#32919128)

I was interested in CUDA until I learned that even the simplest of "hello world" apps is still quite complex and quite low-level.

NVidia needs to make the APIs and tools for CUDA programming simpler and more accessible, with solid support for higher-level languages. Once that happens, we could see adoption skyrocket.

The simple fact is, parallel programming is very hard. More to the point, most programs don't need this type of parallelism.

Re:CUDA (0)

Anonymous Coward | about 4 years ago | (#32919612)

Re:CUDA (1)

Lord Ender (156273) | about 4 years ago | (#32919974)

The PyCUDA "hello world" involvies inline C code!

Re:CUDA (1)

russotto (537200) | about 4 years ago | (#32919906)

I found just the opposite; not enough low-level access. For instance, no access to the carry bit from integer operations!

Re:CUDA (0)

Anonymous Coward | about 4 years ago | (#32920324)

What a load of old tosh. The current CUDA SDKs and samples couldn't be easier to install and get working with. For god's sake the most recent version even allows you to call printf directly on the bloody card.

The fact is a GPU is a very differently structured piece of hardware. If you want to use that to execute certain classes of algorithm orders of magnitude faster than on a scalar processor, then great, join in. If you want to unthinkingly write high level code and expect it to go super fast, move along, there's nothing for you here.

Nvidia has chosen C as their lead language for nvcc because C is the most common HPC language by a country mile. If you want to get heavily in to HPC and you don't want to learn C, either learn to write cross compilers or start getting used to disappointment.

Re:CUDA (0)

Anonymous Coward | about 4 years ago | (#32921298)

The problem is, we do not need more software built with the programming equivalent of Duplo Lego. Sure, Lego Technics is more complicated to build, but you get better results.

Notice in TFA (1)

blai (1380673) | about 4 years ago | (#32918438)

"OpenCL is managed by a standards group, which is a great way to get nothing done"

I don't see the correlation.

Re:Notice in TFA (2, Interesting)

binarylarry (1338699) | about 4 years ago | (#32918958)

Not only that, but they posit that Microsoft's solution solves the issue of both Nvidia's proprietary-ness and the OpenCL boards's "lack of action."

Fuck this article, I wish I could unclick on it.

Re:Notice in TFA (0)

Anonymous Coward | about 4 years ago | (#32922470)

DirectX has been (and still is) ahead of OpenGL in terms of implementing new features for many years now. The gap has been closing with the Khronos group taking over OpenGL, but it's still there and it's still a valid worry. It shouldn't be so difficult to understand why people are skeptical about the Khronos group's ability to compete in the general purpose GPU computing market -- so far their only marked success has been OpenGL ES, an API with virtually no competitors. Further, OpenCL is sort of alone and without a lot of support compared to the competing platforms. CUDA has strong nVidia backing (duh) and DirectCompute has Microsoft strong-arming ATI and nVidia into supporting it if they want to be DirectX 11 compatible. OpenCL has them voluntarily implementing it, if they feel like it. And frankly, as someone who works in GPU computing, OpenCL implementations from both vendors are lacking.

OpenCL (2, Informative)

gbrandt (113294) | about 4 years ago | (#32918440)

Sounds like a perfect job for OpenCL. When a program is rewritten for OpenCL, you can just drop in CPU's or GPU's and they get used.

Re:OpenCL (3, Informative)

Anonymous Coward | about 4 years ago | (#32918918)

Unfortunately, no. OpenCL does not map equally to different compute devices, and does not enforce uniformity of parallelism approaches. Code written in OpenCL for CPUs is not going to be fast on GPUs. Hell, OpenCL code written for ATI GPUs is not going to work well on nVidia GPUs.

Re:OpenCL (1)

quanticle (843097) | about 4 years ago | (#32919524)

Well, true, but that overlooks the fact that porting a program to OpenCL is not exactly a trivial task.

Of course not! (2, Informative)

Yvan256 (722131) | about 4 years ago | (#32918478)

It's not something that can be accomplished with a few libraries and lines of code.

It doesn't take a few libraries and lines of code... It takes a SHITLOAD of libraries and lines of code! - Lone Starr

Not really news... (1)

Third Position (1725934) | about 4 years ago | (#32918486)

I remember reading that IBM was planning to put Cell in mainframes [hpcwire.com] and other high-end servers several years ago, supposedly to accrue the same benefits. I don't really know whether or not that was ever followed through with, I haven't kept track of the story.

Re:Not really news... (2, Interesting)

Dynetrekk (1607735) | about 4 years ago | (#32918680)

I'm no expert, but from what I understand, it wouldn't be at all surprising. IBM has been regularly using their Power processors for supercomputers, and the architecture is (largely) the same. The Cell has some extra graphics-friendly floating-point units, but it's not entirely differnent from the CPUs IBM has been pushing for computation in the past. I'm not even sure if the extra stuff in the Cell is interesting in the supercomputing arena.

Re:Not really news... (1, Interesting)

Anonymous Coward | about 4 years ago | (#32919758)

The Cell is a PowerPC processor, which is intimately related with the Power architecture. Basically, PowerPC was an architecture designed by IBM, Apple, and Motorola, for use in high performance computing. It was based in part on an older (now) version of IBM's POWER architecture. In short, POWER was the "core" architecture, and additional instruction sets could be added at fabrication time -- kind of like Intel with their SSE extensions.

This same pattern continued for a long time. IBM's POWER architecture basically took the PowerPC instruction set, implemented it in new, faster ways. Any interesting extensions might/could be folded into the newer PowerPC architecture revision. The next generation of PowerPC branded chips would inherit the "core" of the last POWER chip's implementation. Later POWER was renamed to Power, to align it with PowerPC branding.

The neat thing is that the "core" instruction set is pretty powerful. You can run the same Linux binary on a G3 iMac as a Cell as a Gamecube or Wii (in principle) as a as a super computing POWER7 or whatever IBM is up to now, as long as it doesn't need extensions. And you can do a lot of computation without extensions. The "base" is broad, unlike x86's strict hierarchy of modes. In some respects, this doesn't sound so neat, since the computing world has mostly settled on x86 for general purpose computation, and so any new x86 chips will probably include a big suite of extensions to the architecture too. Intel, AMD, and IBM eventually converged on this same RISC-y CISC idea, though IBM/Apple/Motorola managed to expose less of the implementation through its architecture at first.

Re:Not really news... (1)

PrecambrianRabbit (1834412) | about 4 years ago | (#32919760)

Yep, IBM produced the PowerXCell for that purpose, and used them to build Roadrunner, which was the worlds first petaflop supercomputer. I'm not sure whether Cell is still being pushed forward these days though.

That's somewhat different than the trend towards GPGPU that the article talks about, although it's related. Both approaches use semi-specialized parallel hardware for compute-intensive tasks.

Libraries (2, Insightful)

Dynetrekk (1607735) | about 4 years ago | (#32918584)

I'm really interested in using GPGPU for my physics calculations. But you know - I don't want to learn Nvidia's low-level, proprietary (whateveritis) in order to do an addition or multiplication, which may or may not outperform the CPU version. What would be _really_ great is stuff like porting the standard "low-level numerics" libraries to the GPU: BLAS, LAPACK, FFTs, special functions, and whatnot - the building blocks for most numerical programs. LAPACK+BLAS you already get in multicore versions, and there's no extra work on my part to use all cores on my PC. Please, computer geeks (i.e. more computer geek than myself), let me have the same on the GPU. When that happens, we can all buy Nvidia HotShit gaming cards and get research done. Until then, GPGPU is for the superdupergeeks.

Re:Libraries (3, Informative)

brian_tanner (1022773) | about 4 years ago | (#32918796)

It's not free, unfortunately. I briefly looked into using it but got distracted by something shiny (maybe trying to finish my thesis...)

CULA is a GPU-accelerated linear algebra library that utilizes the NVIDIA CUDA parallel computing architecture to dramatically improve the computation speed of sophisticated mathematics.
http://www.culatools.com/ [culatools.com]

Re:Libraries (2, Informative)

Anonymous Coward | about 4 years ago | (#32919106)

It's not as complete as CULA, but for free there is also MAGMA [utk.edu] . Also, nVidia implements a CUDA-accelerated BLAS (CUBLAS) which is free.

As far as OpenCL goes, I don't think there has been much in terms of a good BLAS made yet. The compilers are still sketchy (especially for ATI GPUs), and the performance is lacking on nVidia GPUs compared to CUDA.

Re:Libraries (1)

ihuntrocks (870257) | about 4 years ago | (#32922884)

I know I posted this like a little bit above, but this sounds like something you might be looking for. Any card with the PowerXCell setup. http://www.fixstars.com/en/products/gigaaccel180/features.html [fixstars.com] If you check under the specs section, you'll see tha BLAS, LAPACK, FFT, and several other numeric libraries are supported. Also, the GCC can target Cell. All around, not a bad set up for physics modeling.

IIS 3D (2, Interesting)

curado (1677466) | about 4 years ago | (#32918806)

So.. webpages will soon be available in 3D with anti-aliasing and realistic shading?

Re:IIS 3D (1)

Enderandrew (866215) | about 4 years ago | (#32919788)

Yes, actually. IE9 uses DirectDraw and your graphics card to render fonts smoother and faster. Firefox has a similar project in the works.

Wouldn't a DSP do better? (2, Interesting)

91degrees (207121) | about 4 years ago | (#32918836)

So why a GPU rather than a dedicated DSP? Seems they do pretty much the same thing except a GPU is optimised for graphics. A DSP offers 32 or even 64 bit integers, have had 64 bit floats for a while now, allow more flexible memory write positions, and can use the previous results of adjacent values in calculations.

Re:Wouldn't a DSP do better? (2, Informative)

pwnies (1034518) | about 4 years ago | (#32919024)

Price. GPUs are being mass produced. Why create a separate market that only has the DSP in it (even if the technology is already present and utilized by GPUs) for the relatively small amount of servers that will be using them?

Modern GPUs, for all their hype, are just DSPs (3, Interesting)

pslam (97660) | about 4 years ago | (#32919304)

I could almost EOM that. They're massively parallel, deeply pipelined DSPs. This is why people have trouble with their programming model.

The only difference here is the arrays we're dealing with are 2D and the number of threads is huge (100s-1000s). But each pipe is just a DSP.

OpenCL and the like are basically revealing these chips for what they really are, and the more general purpose they try to make them, the more they resemble a conventional, if massively parallel, array of DSPs.

There's a lot of comments on this subject along the lines of "Why couldn't they make it easier to program?" Well, it always boils down to fundamental complexities in design, and those boil down to the laws of physics. The only way you can get things running this parallel and this fast is to mess with the programming model. People need to learn to deal with it, because all programming is going to end up heading this way.

Re:Modern GPUs, for all their hype, are just DSPs (1)

91degrees (207121) | about 4 years ago | (#32920174)

Well, yes, there's not a lot in it and more and more has been handed over to more general purpose hardware. Still, there can't be a lot of use for depth buffer handling and caching (since for a lot of applications memory will be accessed perfectly linearly), or rasterisation, or texture filtering, and I'd have thought there would be some use for the slight extra flexibility from a DSP. Granted, all I can think of is that you have addressable write operations but I'm sure there's more.

Is the specialised 3d hardware really such a tiny part of a chip these days that it doesn't significantly affect the price?

Re:Modern GPUs, for all their hype, are just DSPs (2, Interesting)

pclminion (145572) | about 4 years ago | (#32920338)

There's a lot of comments on this subject along the lines of "Why couldn't they make it easier to program?"

Why should they? Just because not every programmer on the planet can do it doesn't mean there's nobody who can do it. There are plenty of people who can. Find one of these people and hire them. Problem solved.

Most programmers can't even write single-threaded assembly code any more. If you need some assembly code written, you hire somebody who knows how to do it. I don't see how this is any different.

As far as whether all programming will head this direction eventually, I don't think so. Most computational tasks are data-bound, and throughput is enhanced by improving the data backends, which are usually handled by third parties. We already don't know how the hell our own systems work. For the people who really need this kind of thing, you need to go out and learn it or find somebody who knows it. Expecting that the whole world can do it is crazy thinking.

Crysis 2... (2, Funny)

drc003 (738548) | about 4 years ago | (#32918928)

...coming soon to a server farm near you!

Re:Crysis 2... (2, Interesting)

JorgeM (1518137) | about 4 years ago | (#32919042)

I'd love this, actually. My geek fantasy is to be able to run my gaming rig in a VM on a server with a high end GPU which is located in the basement. On my desk in the living room would be a silent, tiny thin client. Additionally, I would have a laptop thin client that I could take out onto the patio.

On a larger scale, think Steam but with the game running on a server in a datacenter somewhere which would eliminate the need for hardware on the user end.

Re:Crysis 2... (1)

drc003 (738548) | about 4 years ago | (#32919110)

I like the way you think. In fact now I'm all excited at th.......ahhhhhhhhhooohhhhhhhhhhhh. Oops.

Re:Crysis 2... (1)

SleazyRidr (1563649) | about 4 years ago | (#32919496)

+1 overinformative.

Re:Crysis 2... (1)

Dalambertian (963810) | about 4 years ago | (#32919396)

Sacrificing all my mod points to say this, but a friend of mine did this with his PS3 so he could play remotely using a PSP. Also, check out OnLive for a pretty slick implementation of gaming in the cloud.

You must be salivating about OnLive, then (1)

rsborg (111459) | about 4 years ago | (#32920740)

From wikipedia [wikimedia.org] :

OnLive is a gaming-on-demand platform, announced in 2009[3] and launched in the United States in June 2010. The service is a gaming equivalent of cloud computing: the game is synchronized, rendered, and stored on a remote server and delivered via the Internet.

Sounds very interesting to me, as I'm pretty sick of upgrade treadmills. OnLive would probably also wipe out hacked-client based cheating (though bots and such might still be doable). It would also allow bleeding-edge games to be enjoyed by those without the best hardware, increasing adoption rates for those types of games.

RemoteFX (2, Interesting)

JorgeM (1518137) | about 4 years ago | (#32918930)

No mention of Microsoft's RemoteFX coming in Windows 2008 R2 SP1? RemoteFX uses the server GPU for compression and to provide 3d capabilites to the desktop VMs.

Any company large enough for a datacenter is looking at VDI and RemoteFX is going to be supported by all of VDI providers except VMware. VDI, not relatively niche case massive calculations, will put GPUs in the datacenter.

How much number-crunching is your server doing? (1)

Animats (122034) | about 4 years ago | (#32919100)

If your data center is running stochastic tests, trying scenarios on derivative securities, it's a big win. If it's serving pages with PHP, zero win.

There are many useful ways to use a GPU. Machine learning. Computer vision. Finite element analysis. Audio processing. But those aren't things most people are doing. If your problem can be expressed well in MATLAB, a GPU can probably accelerate it. MATLAB connections to GPUs are becoming popular. They're badly needed; MATLAB is widely used in engineering and scientific work, but it's not as fast as it should be.

Re:How much number-crunching is your server doing? (0)

Anonymous Coward | about 4 years ago | (#32919350)

depends how you're serving up those PHP pages. It might be a big win if you're doing lots of SSL connections.

Encryption requires calculations that can benefit from faster math processors...

Re:How much number-crunching is your server doing? (1)

smallfries (601545) | about 4 years ago | (#32919446)

But it is not the same kind of maths. Most GPUs support very fast use of single-precision floats. The asymmetric crypto that you use to establish your SSL connection uses very large integers, and the AES that encrypts the stream operates in a finite field. Neither can executed efficiently on a GPU.

Re:How much number-crunching is your server doing? (1)

ceoyoyo (59147) | about 4 years ago | (#32919968)

Many people will be doing those things going ahead: all forms of machine learning. The obvious example is natural language processing for your web page.

Parallel Pr0n (1)

tedgyz (515156) | about 4 years ago | (#32919150)

There's always an application for that.

Why call the GPU a gaming chip? (1)

wrightrocket (1664871) | about 4 years ago | (#32919510)

It is a Graphics Processing Unit, not a Gaming Processing Unit. Sure, they are great for gaming, but also very useful for other types of 3D and 2D rendering of graphics.

Re:Why call the GPU a gaming chip? (1)

Urkki (668283) | about 4 years ago | (#32920658)

It is a Graphics Processing Unit, not a Gaming Processing Unit. Sure, they are great for gaming, but also very useful for other types of 3D and 2D rendering of graphics.

But the top bang-for-the-buck chips are designed for games. They have architecture (number of pipelines etc) designed to maximize performance in typical game use, at a framerate needed for games. In other words, they're gaming chips, just like eg. PS3 is a game console, no matter if it can be used to build a cluster for number crunching.

Huh... (1)

geemon (513231) | about 4 years ago | (#32919536)

Saw the title of this article and wondered "how will Las Vegas casinos make the move to have all of my gaming chips put onto a server."

Major benefits seem overlooked (1)

nickdwaters (1452675) | about 4 years ago | (#32920368)

While it is true that parallelization does not necessarily assist a single program to operate more efficiently or faster, it is true that multi-cpu systems allow more concurrent programs to operate. In a major corporate context, there are 1000's of jobs running at any given time. The more effective number of CPU's (and memory) the better to keep costs down.

Car Analogy (0)

Anonymous Coward | about 4 years ago | (#32922334)

so a car analogy would be, CPU are normal cars, and GPUs are dragracers.. high speed and no brakes?

GPU apps are pretty specific... (2, Insightful)

bored (40072) | about 4 years ago | (#32922734)

I've done a little CUDA programming, and I've yet to find significant speedups doing it. Every single time, some limitation in the arch keeps it from running well. My last little project, ran about 30x faster on the GPU than the CPU, the only problem was that the overhead of getting it to the GPU + computation + overhead of getting it back, was roughly equal to the time it took to just dedicate a CPU.

I was really excited about AES on the GPU too, until it turned out to be about 5% faster than my CPU.

Now if the GPU was designed more as a proper coprocessor (ala early x87, or early Weitek) and integrated into the memory hierarchy better (put the funky texture ram and such off to the side) some of my problems might go away.

Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Create a Slashdot Account

Loading...