×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Sorting Algorithm Breaks Giga-Sort Barrier, With GPUs

timothy posted more than 3 years ago | from the quick-like-double-time dept.

Graphics 187

An anonymous reader writes "Researchers at the University of Virginia have recently open sourced an algorithm capable of sorting at a rate of one billion (integer) keys per second using a GPU. Although GPUs are often assumed to be poorly suited for algorithms like sorting, their results are several times faster than the best known CPU-based sorting implementations."

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

187 comments

Excel Charts (3, Insightful)

Anonymous Coward | more than 3 years ago | (#33412026)

I find it very disappointing that a group of programmer/computer science types who even supply BibTeX to make it easier to reference their work, resort to screen-capturing an Excel chart to display their data.

Re:Excel Charts (-1, Troll)

Anonymous Coward | more than 3 years ago | (#33412040)

I'm sure they find it equally disappointing that you suck at math.

Re:Excel Charts (-1, Troll)

Anonymous Coward | more than 3 years ago | (#33412064)

Why is your shit all 'tarded and shit?

Re:Excel Charts (0, Funny)

Anonymous Coward | more than 3 years ago | (#33412222)

Researchers at the University of Virginia have recently open sourced

I stopped this shit about right there. You think I'm going to trust my sorting to some open sores buggy shit? I think I'll just keep using Microsoft for my algorithms thank you very much.

Re:Excel Charts (0)

ScrewMaster (602015) | more than 3 years ago | (#33412254)

Researchers at the University of Virginia have recently open sourced

I stopped this shit about right there. You think I'm going to trust my sorting to some open sores buggy shit? I think I'll just keep using Microsoft for my algorithms thank you very much.

I'm assuming that was an attempt at humor. Otherwise, you're going to get modded all to hell very soon.

Re:Excel Charts (1)

udippel (562132) | more than 3 years ago | (#33412360)

Nevermind what the poster attempted do it.
I for one found it quite funny.
(Simply have no mod points now, so a comment should be in order.)

Re:Excel Charts (3, Funny)

bramp (830799) | more than 3 years ago | (#33412128)

I've also found this annoying when reading papers. Perhaps I just spend too much time learning how to use gnuplot so that my graphs look nice.

Re:Excel Charts (0)

Anonymous Coward | more than 3 years ago | (#33412730)

I'm more bothered by the ragged edges and too-wide text columns. Seriously. The paper looks unprofessional. With LaTeX, this is automatic.

Re:Excel Charts (1, Insightful)

Anonymous Coward | more than 3 years ago | (#33412134)

Who cares what tool they use to display data, if they're getting their point across in an effective manner? Good lord...

Re:Excel Charts (3, Insightful)

Anpheus (908711) | more than 3 years ago | (#33412208)

Maybe excel was just the right tool for the job? It's quick and easy to use, and to reformat the graphs.

I know the Linux tools tend to be a little longer between tweaking, rendering and displaying, so a fast WYSIWIG tool works just fine.

Re:Excel Charts (0)

Anonymous Coward | more than 3 years ago | (#33412604)

Seems the issue is that, instead of providing the data, its a screenshot of the data (less able to verify the results?)

Re:Excel Charts (1)

dave420 (699308) | more than 3 years ago | (#33413058)

Maybe the screenshot isn't the finished paper, and just a sneak-peek into the results for those interested?

Re:Excel Charts (0)

Anonymous Coward | more than 3 years ago | (#33412716)

FUD

Re:Excel Charts (0)

Anonymous Coward | more than 3 years ago | (#33412900)

Are you f*cking kidding me? gnuplot is all that was needed here.

Re:Excel Charts (4, Insightful)

pspahn (1175617) | more than 3 years ago | (#33412798)

Actually, I find even more disappointing that a decent way to display datasets on a web page isn't standard yet. Why can't a nice one be embeddable with column sorts and robust methods for retrieving data? There are solutions, sure, but I have yet to find one that isn't unnecessarily complex or just plain ugly and difficult to use. But I guess it's just a matter of time, right?

it's a shame that it's only on integer keys... (0, Offtopic)

aussieslovethecock (1840034) | more than 3 years ago | (#33412036)

because the "giga-sort" barrier, as they so eloquently put it, has yet to be conquered in other important fields. Like my dick for instance, which is gigantic but can't be handled by any normal algorithms due to it's ridiculous size.

The video card in question.. (5, Informative)

black3d (1648913) | more than 3 years ago | (#33412038)

Specifically, a GTX480 (just over 1 B keys/sec), followed up by a Tesla 2050 at around 75% of the speed of the GTX480. (745 M keys/sec)

Re:The video card in question.. (3, Funny)

MarkRose (820682) | more than 3 years ago | (#33412294)

I never would have suspected the GTX480 would have been good at this sorta thing.

Re:The video card in question.. (1)

Sycraft-fu (314770) | more than 3 years ago | (#33412384)

That's strange, I wonder what the 480 is faster? The 2070 should be the Tesla equivalent to the GTX 480. nVidia doesn't make different processors, they just change the configuration for their cards. Teslas have no display output and generally more RAM.

Re:The video card in question.. (1)

black3d (1648913) | more than 3 years ago | (#33412458)

It's quite possible a 2070 would perform just as well, except they tested it against a 2050. I thought the 2070 would actually perform better than the 480, due to screeds (still no idea if it's a word, but love it!) more RAM, but I presume it's simply what they had available. Maybe a dedicated 2070 wasn't on this year's budget? :)

Re:The video card in question.. (1)

Sycraft-fu (314770) | more than 3 years ago | (#33412590)

Oh didn't notice that. I presume the 2050 is the GTX470 (or maybe 460) equivalent, which would explain things.

Re:The video card in question.. (2, Informative)

FilipeMaia (550301) | more than 3 years ago | (#33412818)

The reason for the GTX480 being faster is that it has 15 SM compared to 14 from the Tesla 2050. Also the GTX 480 runs at a higher clock speed (700 compared to 575). Put together this is 575/700*14/15 = 76.7% which comes pretty close to the 75%.

Slashdot dont need no GPU.... (0, Troll)

Anonymous Coward | more than 3 years ago | (#33412042)

... to sort my first post to the top.

So why is this GPU sorting ability important to me?

No Surprise... (5, Funny)

Anonymous Coward | more than 3 years ago | (#33412082)

GPUs have always been better at sorting your money from your wallet.

Not a barrier (5, Insightful)

Captain Segfault (686912) | more than 3 years ago | (#33412084)

This isn't a "barrier" like the "sound barrier". There are no special difficulties that start around 1G/sec! It's just a threshold.

Don't get me wrong -- I'm not saying this isn't impressive, but no "barrier" was broken here!

Re:Not a barrier (4, Insightful)

XanC (644172) | more than 3 years ago | (#33412096)

It's not a threshold! It's just a milestone.

Re:Not a barrier (3, Funny)

CoolGopher (142933) | more than 3 years ago | (#33412132)

It's just a milestone.

Hang on, since when do you measure sorting performance using a distance indicator? And an imperial one at that!

No, this is not a serious comment.

Re:Not a barrier (1)

Tumbleweed (3706) | more than 3 years ago | (#33412448)

Hang on, since when do you measure sorting performance using a distance indicator? And an imperial one at that!

The GTX480 can make the Kessel Run in 18 Parsecs, too.

And why should I have to add "Kessel" to my spelling dictionary? Geez.

Re:Not a barrier (0)

Anonymous Coward | more than 3 years ago | (#33412164)

Technically sound barrier isnt a barrier either!

Re:Not a barrier (2, Insightful)

caerwyn (38056) | more than 3 years ago | (#33412200)

Actually, if you look at shockwave dynamics during the moment an object crosses from subsonic to supersonic velocity, it can very easily be considered much more of a barrier than 1gkeys/sec can.

Re:Not a barrier (4, Funny)

PatPending (953482) | more than 3 years ago | (#33412256)

Actually, if you look at shockwave dynamics during the moment an object crosses from subsonic to supersonic velocity, it can very easily be considered much more of a barrier than 1gkeys/sec can.

Actually in this case, your analogy should use ludicrous speed.

Re:Not a barrier (0)

Anonymous Coward | more than 3 years ago | (#33412272)

my brains.... are going into my feeeeeet

Re:Not a barrier (1)

ScrewMaster (602015) | more than 3 years ago | (#33412280)

Actually, if you look at shockwave dynamics during the moment an object crosses from subsonic to supersonic velocity, it can very easily be considered much more of a barrier than 1gkeys/sec can.

Yes, it was [wikipedia.org]

Re:Not a barrier (1)

jamesh (87723) | more than 3 years ago | (#33412554)

Actually, if you look at shockwave dynamics during the moment an object crosses from subsonic to supersonic velocity, it can very easily be considered much more of a barrier than 1gkeys/sec can.

I take it you didn't feel the disturbance in the force when the 1gkeys/sec barrier was broken then? It felt like a million GPU marking executives all took a deep breath and shouted "w00t!"

Re:Not a barrier (1)

jamesh (87723) | more than 3 years ago | (#33412560)

GPU marking executives

and that should have been marketing executives obviously. I blame the migraine i am currently experiencing (presumably a result of said force disturbance :)

Re:Not a barrier (0)

Anonymous Coward | more than 3 years ago | (#33412988)

I don't know... What are the shockwave dynamic like at 1 billion keys/second ?

Re:Not a barrier (1)

sznupi (719324) | more than 3 years ago | (#33413092)

Though the effects are in full swing quite a bit before "the moment an object crosses from subsonic to supersonic velocity"; the "barrier" is marked by entry into transonic flight, around 0.2 Mach below the speed of sound.

Um... (5, Insightful)

Anonymous Coward | more than 3 years ago | (#33412088)

Algorithms aren't measured in "x per second"... only implementations are measured that way. The speed of an algorithm is described in big-O notation, such as O(n log n). The metric of "sorted keys per second" is largely useless, because it depends on the particular hardware setup.

Re:Um... (1, Troll)

martin-boundary (547041) | more than 3 years ago | (#33412192)

The problem with big-oh notation is that the constant isn't explicit, so for any given n (pick as large as desired), it is possible that O(nlogn) As always, in theory there's no difference between theory and practice, but in practice there is...

Ugh. (3, Insightful)

martin-boundary (547041) | more than 3 years ago | (#33412226)

Stupid HTML ate my <...

The problem with big-oh notation is that the constant isn't explicit, so for any given n (pick as large as desired), it is possible that O(nlogn) < O(n) for some choice of constants. That's why ops per second is still a useful metric when comparing implementations on standardized hardware.

As always, in theory there's no difference between theory and practice, but in practice there is...

Re:Ugh. (1)

jmerlin (1010641) | more than 3 years ago | (#33412608)

nlog(n) < n when log(n) < 1, log(n) < 1 when n < e (or 10, depending on whether you know math or you just do math), the next biggest integer is 3 (or 10). When you're talking about 3 (or 10) things, you usually don't care about the efficiency of the algorithm. When testing on standardized hardware, theoretical and actual results are close except in two cases: 1. When n is very small. 2. When optimization (be it in software or hardware) plays a large role. Both of these are largely negligible by using a very large test case and compiling in debug mode, so in reality, this metric is quite honestly useless.

Re:Ugh. (1)

martin-boundary (547041) | more than 3 years ago | (#33412762)

That's not the range where the constants matter, though. Take log(10^12) = 28 approx, then for n=10^9, an algo that's nlogn is equivalent to (or even better than) an algo that's 28n over the range [1,n]. Now a factor of 30 in practical implementations is not that rare. Penalties due to caching effects or disk I/O can amount to much more and amortized over the full runtime could give that. So I don't think it's that pathological to expect some pairs of algos to compare oppositely in theory and in implementation.

Re:Ugh. (2, Informative)

metacell (523607) | more than 3 years ago | (#33412784)

Dude, an algorith which is O(n*log(n)) is not faster than O(n) just because n*log(n) < n.

When an algorithm is O(n* log(n)), it means the actual time requirement is p*n*log(n)+q, where p and q are constants specific to the algorithm.

The O(n*log(n)) algorith is faster than the O(n) one when

p1*n*log(n)+q1 < p2*n+q2

... and for any n, it is possible to choose p1, p2, q1 and q2 so that the O(n) algorith becomes faster.

This means, for example, that an algorithm which is O(n*log(n)) isn't automatically faster than an algorithm which is O(n) on lists with three elements or more. The O(n*log(n)) algorithm may take a hundred times longer to sort a list of two elements than the O(n) one (due each step being more complex), and in that case the lists will need to grow some before the O(n*log(n)) algorithm becomes faster.

Re:Ugh. (1)

gmthor (1150907) | more than 3 years ago | (#33412808)

I take it, that you don't get the big O notation correct. As the constants are different, it makes no sense to calculate when nlog(n) < n.

Ask yourself, when is C*n*log(n) < K*n, for unknown C and K(these constants depend on the implementation) and you will see that the only way to determine this is measuring the result of your implementation.

Re:Ugh. (0)

Anonymous Coward | more than 3 years ago | (#33412902)

No, it is also possible to sit down and count the cycles each assembler instruction will take.
I have plenty of source-codes where the comments are cycle counts from a specific point.

Re:Um... (5, Interesting)

PrecambrianRabbit (1834412) | more than 3 years ago | (#33412196)

Given that the particular hardware setup is detailed here (a GTX 480 achieves the 1 billion keys/sec figure), and the algorithm used (radix sort) has known asymptotic behavior (O(nk) for n keys of length k), 10^9 keys/sec is quite meaningful, particularly since it's a significant implementation challenge (possibly even an algorithmic challenge) to port this algorithm to a GPU.

Furthermore, I think sorting speed is appropriately measured in keys/sec. Big-O does not in fact describe the speed, but rather the upper bound of the growth of an algorithm's asymptotic running time, which needs to be paired with the implementation, architecture, and data set to determine a speed. It turns out the constant factors can actually be quite important in practice.

Also (5, Interesting)

Sycraft-fu (314770) | more than 3 years ago | (#33412624)

CS people get way too caught up in Big O forgetting that, as you note, it is the theory describing the upper bound on time, not actual practice AND is only relevant for one factor, n. A great example is ray tracing. CS people love ray tracing because most create a ray tracer as part of their studies. They love to talk on about how it is great for rendering because "It is O(log n)!" They love to hype it over rasteraztion like current GPUs do. However there's two really important things they forget:

1) What is n? In the case, polygons. It scales with the log of the number of polygons you have. This is why ray tracer demos love showing off spheres made of millions of polygons and so on. It is cheap to do. However turns out polygons aren't the only thing that matters for graphics. Pixels also matter and ray tracing is O(n) with relation to pixels and it gets worse for anti aliasing. For FSAA you cast multiple rays per pixel. That means that 4x FSAA has 4x the computation requirements. Turns out rasterization scales much better with regards to resolution, and AA. In fact these days 4x FSAA is often O(1) in actual implementation in that it doesn't hit frame rate to turn it on. That is also why ray tracing demos are low resolution, because THAT'S the hard part.

2) Real hardware limits. In the case of ray tracing, it is memory access. While those polygons scale in a perfectly logarithmic fashion in theory, real system RAM isn't so forgiving. As you start to have more and more you run in to RAM bandwidth limits and things slow down. All the memory access required becomes a limit that wasn't on paper. You can cry all you like that it wasn't a problem in theory, on actual hardware you run in to other limits.

People need to better understand that it is a theoretical tool for comparing speed factors algorithms. That is useful, but you have to then consider the reality of the situation. CS people also need to understand for users, there is only one benchmark that matters: The clock on the wall. Whatever is faster is better. Doesn't matter what the theory says, if the hardware can do X faster than Y, then X is better according to users.

Re:Also (1)

chammy (1096007) | more than 3 years ago | (#33412964)

This is why ray tracer demos love showing off spheres made of millions of polygons and so on. It is cheap to do. However turns out polygons aren't the only thing that matters for graphics.

The spheres would most likely be represented as an equation, not a soup of polygons. It's much more efficient (not to mention a whole lot easier) to raytrace. It's also infinitely precise, which is actually why a lot of people are more interested in raytracing than approximating things with polygons.

For instance, it's a heck of a lot easier to render isosurfaces [demon.co.uk] in a raytracer than turning them into a big polygon soup and rasterizing that.

Re:Also (1, Informative)

Anonymous Coward | more than 3 years ago | (#33412986)

Doesn't matter what the theory says, if the hardware can do X faster than Y, then X is better according to users.

Normally big-O notation is applied on a purely theoretical level where all operations are assumed to have the same base cost in terms of time to execute. This does not make the notation invalid for real world applications and implementations, however. But when doing so you have to adjust your formulas by adding the proper weighting to execution time. In practice this is usually a waste of effort as it's generally faster to just write an implementation and time it on various pieces of hardware.

In the article we're talking about, they are comparing a single implementation of a single algorithm on several pieces of hardware. So first of all the summary shouldn't be shouting about breaking any kind of record- they weren't trying to hit any particular benchmark it's a relative test of the hardware, not the algorithm or implementation.

The reason why using big-O and making comparisons is useful, is because if all we use is this type of test the answer to any speed problem is simply "Get faster hardware, or buy a piece of hardware which runs this code faster". In all likliehood, there are other methods which, when implemented on the same hardware, may yield much faster results. Heck, it's possible someone else's implementation of the same algorithm may yield faster results as well.

In regards to your comments about raytracing vs. polygon rendering, all I'm going to say is that you don't have a very good concept of what raytracing really is if you think a sphere created from a million triangles will raytrace faster than one modeled as a mathematical sphere. It won't- those demonstrations are a pure head-to-head comparison operating on a scene which has actually been optimized for a non-raycasting technique.

Re:Also (1)

Thiez (1281866) | more than 3 years ago | (#33413024)

I don't think it's a fair comparison, really. The only reason modern games look so pretty is because they have the GPU that happens to be designed to help them out. If companies had been designing GPUs to do raytracing for over a decade, it seems rather likely to me that we could do raytracing at modern resolutions.

Wouldn't it be more fair to compare raytracing to rasterization when you're only allowed to use the CPU?

Re:Also (1)

sznupi (719324) | more than 3 years ago | (#33413078)

Not quite, you can just buy Larab...oh, wait, it seems there were some problems with its goals...

Re:Um... (0)

Anonymous Coward | more than 3 years ago | (#33412214)

Algorithms aren't measured in "x per second"... only implementations are measured that way. The speed of an algorithm is described in big-O notation, such as O(n log n). The metric of "sorted keys per second" is largely useless, because it depends on the particular hardware setup.

Not quite. Big-O is a measure of asymptotic complexity, which is only part of how fast an algorithm is. The other part is the constant factors. An algorithm which runs in 2n time is twice as fast as an algorithm which runs in n time, even though they have the same Big-O complexity. An algorithm which a slower Big-O complexity but lower constant factors can be faster practically.

There are many algorithms which have great Big-O complexities but which are rarely or never used due to big constant factors.

Re:Um... (4, Interesting)

SpazmodeusG (1334705) | more than 3 years ago | (#33412258)

I wish they'd start putting the "P" into these Big-O notations, where the "P" is the number of processors. Some algorithms don't scale well, some do. Putting the P in illustrates this.

eg. O( n/P ) illustrates an algorithm that scales perfectly with more cores added. O( n / log(P) ) not so much.

Re:Um... (-1, Troll)

Anonymous Coward | more than 3 years ago | (#33412796)

....Big-O originates from pure mathematics and is an abstract measure divorced from specific computing implementations. its not even tangentially related to hardware. learn to science, you bitchface fuckwit.

Re:Um... (2, Interesting)

frosty_tsm (933163) | more than 3 years ago | (#33412816)

While an interesting idea, many algorithms behave unpredictably on multiple processors (depending on how much communication would be required). Some will even be slower!

The affect that extra CPUs will have is too dependent on the hardware implementation to be able to formalize like this.

Re:Um... (1, Insightful)

Anonymous Coward | more than 3 years ago | (#33412952)

The affect that extra CPUs will have is too dependent on the hardware implementation to be able to formalize like this.

It's no more exotic than having several levels of cache between the CPU and RAM (or even treating RAM as a cache between CPU and network/spinning disk). Even O(N) algorithms curve when N overflows L1, L2, L3, and RAM.

Re:Um... (3, Informative)

Anonymous Coward | more than 3 years ago | (#33412832)

Typically, I hear researchers describe the parallelism of an algorithm separately from its computational complexity (big oh notation) using the idea of "speedup."

The perfect scaling in your first example has linear speedup, and the second example has logarithmic speedup (that is, the speedup is log(p)).

Here is the relevant Wikipedia article [wikipedia.org].

Re:Um... (0)

Anonymous Coward | more than 3 years ago | (#33412596)

You think it is not useful to know how fast a certain algorithm runs on certain hardware?

New level of gaming. (2, Funny)

Murdoch5 (1563847) | more than 3 years ago | (#33412180)

Hold on, so I can play Jezz Ball, Chips Challenge and all my favourite old school games with a greater level of speed.

PRON! Marches on! (3, Funny)

jewishbaconzombies (1861376) | more than 3 years ago | (#33412184)

Thank god, because my online Porn collection isn't going to sort itself.

Re:PRON! Marches on! (1, Funny)

Anonymous Coward | more than 3 years ago | (#33412634)

A porn collection is something that needs to be sorted manually.

Re:PRON! Marches on! (1, Funny)

Anonymous Coward | more than 3 years ago | (#33413038)

"I sorted that whole 1 giga porn collection single-handedly!".

Big deal. Radix sort works well IF ... (3, Interesting)

crovira (10242) | more than 3 years ago | (#33412204)

you've got lots and lots of RAM with room for the keys and lots of space to waste for unfilled pointers.

Pass 1, read the records, at the key radix store a record URI
Pass 2, sweep RAM and read the record URIs in the order you encounter them copying them onto a sequential write device.

I was doing this years ago.

If you are careful with the number and sizes of your read buffers, the re-read done for pass 2 doesn't have to be all that disk intensive.

You can even generate the keys using what ever hash function you find that is truly efficient and store collisions separately (leave a bit high and go into the a link list maintained by the hash generator for those few keys that hash redundantly.)

Re:Big deal. Radix sort works well IF ... (4, Insightful)

Trepidity (597) | more than 3 years ago | (#33412618)

Well, yeah, they're not claiming they invented radix sort. They're claiming that their GPU implementation of radix sort runs about 4x as fast as the CPU implementation you describe.

Re:Big deal. Radix sort works well IF ... (3, Interesting)

kramulous (977841) | more than 3 years ago | (#33412946)

So, does that mean if they went out and got the fastest Xeon processor (they used the fastest gpu - excluding the C2070), wrote parallel and used the Intel Compiler (writing to it) the speedup over the cpu is less than zero?

After having just looked at their code, also remove the cpu stl shit (actually any template since they don't vectorise). If you are writing speedy code for the gpu, to perform an adequate test for the cpu you also have to write appropriately.

hahahahaha

This gets better and better...
They only timed the compute time. Cudamalloc was not part of the timing or cudamemcpy.

Sorry, I only count 'time to solution'. That is all i'm interested in. I thought is was strange that a less than O(n**2) was faster on the gpu.

It is like writing benchmarks that ignore disk IO, ignore reading from system memory, L3 cache, etc. Only time stuff that is in the registers.

Re:Big deal. Radix sort works well IF ... (0)

Anonymous Coward | more than 3 years ago | (#33412684)

Big deal, I doubt you were getting anywhere near the performance they were quoting. Was yours coded for a GPU as well?
Or was this just your attempt at showing how cool you are?

Re:Big deal. Radix sort works well IF ... (0)

Anonymous Coward | more than 3 years ago | (#33412766)

I was doing this years ago.

You, Sir, are an idiot and have not understood the article.

Who the hell modded this guy up?

Re:Big deal. Radix sort works well IF ... (2, Insightful)

evilWurst (96042) | more than 3 years ago | (#33412948)

It's generally not size of RAM that breaks radix sort; it's the size of cache. Modern processors are highly reliant on cache, which means they're highly reliant on things in memory being in small tight chunks that fit in cache - because cache misses are expensive enough that if you thrash cache badly enough, you may end up running slower than if you hadn't had any cache at all.

Good comparison sorts may start fragmented, but by their very nature each pass of the algorithm makes them less so. Radix sort is the other way around; it follows pointers (so more precious scarce cache in use already) that point in more and more fragmented patterns with every pass. That's why even though radix sort's average speed is theoretically faster than quicksort, quicksort still wins on real life hardware. And that's probably why radix sort wins on GPUs - the data fits in the card's dedicated memory, which is already optimized to be accessed in a much more parallel way than main memory.

Not impressed unless... (1)

PmanAce (1679902) | more than 3 years ago | (#33412262)

I will not be impressed until it plays a video of sorting examples using humans at 120 FPS while doing its other sorting!

Re:Not impressed unless... (1, Funny)

Anonymous Coward | more than 3 years ago | (#33412278)

Please update your sig to reflect your new status... It should be "Tired of my customary (Score:0)"

Re:Not impressed unless... (1)

chill (34294) | more than 3 years ago | (#33412658)

A Benny Hill skit is the first thing that pops to mind.

Theme music here, for those too young to remember. (URL:http://www.youtube.com/watch?v=MK6TXMsvgQg)

x86 (1)

wargod667 (1743872) | more than 3 years ago | (#33412312)

at the whim of sounding like a complete arrogant jerk, does this actually support the theory that we've hit a wall with the x86 architecture with general computing, or is it still the 'best' at that?

I'm a bit confused when they, again and again, come up with gpu architecture doing stuff miles faster that our old trusty x86. I mean, would it be meaningful to actually start researching the architectural way of computing like we have seen gaming consoles do for decades? (latest being the ps3's cell)

just a thought that came to mind.

Re:x86 (4, Informative)

emmons (94632) | more than 3 years ago | (#33412374)

GPUs are highly parallel processors, but most of our computing algorithms were developed for fast single core processors. As we figure out how to implement new solutions to old problems to take advantage of these highly parallel processors, you'll continue to see stories like this one. But, there's a limit to how good they can be at certain types or problems. Read up on Amdahl's law.

Basically, traditional x86 processors are good at lots of stuff. Modern GPUs are great at a few things.

Re:x86 (1)

Mr Z (6791) | more than 3 years ago | (#33412388)

Ok, so the page lists a 240M keys/sec sorting implementation for a quad-core Core i7, versus their 1G keys/sec for the GPU. How many processors were on that GPU? It's a 4x speedup end-to-end, sure, but it sure didn't scale linearly in the number of processors.

Computing tasks will always be a mix of truly serial vs. mostly serial but parallelizable with herculean effort vs. embarrassingly parallel (and all points in between). GPUs work on the embarrassingly parallel stuff with ease, and fall off as you get more serial. x86s, on the other hand, attack truly serial tasks with clock rate and mostly serial tasks with speculation and a bunch of extra hardware that hunts for every spec of available parallelism. Which type of processor does best depends on the problem you're trying to solve.

No (5, Informative)

Sycraft-fu (314770) | more than 3 years ago | (#33412408)

GPUs are special kinds of processors, often called stream processors. They are very efficient at some kinds of operations, and not efficient at others. Some things, they run literally a thousand times faster than the CPU. Graphics rasterization would be one of these (no surprise, that's their prime job). However other things they run much slower. For something to run fast on a GPU it has to meet the following requirement, the more it matches them, the faster it is:

1) It needs to be parallel to a more or less infinite extent. GPUs are highly parallel devices. The GTX 480 in question has 448 shaders, meaning for max performance it needs to be working on 448 things in parallel. Things that are only somewhat parallel don't work well.

2) It needs to not have a whole lot of branching. Modern GPUs can branch, but they incur a larger penalty than CPUs do. So branching in the code needs to be minimal. It needs to mostly be working down a known path.

3) When a branch happens, things need to branch the same way. The shaders work in groups with regards to data and instructions. So if you have half a group branching one way, half the other, that'll slow things down as it'll have to be split out and done separately. So branches need to be uniform for the most part.

4) The problem set needs to fit in to the RAM of the GPU. This varies, 1GB is normal for high end GPUs and 4-6GB is possible for special processing versions of those. The memory on board is exceedingly fast, over a hundred gigabytes per second in most cases. However the penalty for hitting the main system RAM is heavy, the PCIe bus is but a fraction of that. So you need to be able to load data in to video RAM and work on it there, only occasionally going back and forth with the main system.

5) For very best performance, your problem needs to be single precision floating point (32-bit). That is what GPUs like the best. Very modern ones can do double precision as well, but at half the speed. I don't know how their integer performance fares over all, they can do it, but again not the same speed as single precision FP.

Now this is very useful. There are a lot of problems that fall in that domain. As I said, graphics would be one of the biggest, hence why they exist. However there are many problems that don't. When you get ones that are way outside of that, like, say, a relational database, they fall over flat. A normal CPU creams them performance wise.

That's why we have the separate components. CPUs can't do what GPUs do as well, but they are good at everything. GPUs do particular things well, but other things not so much.

In fact this is taken to the extreme in some electronics with ASICs. They do one and only one thing, but are fast as hell. Gigabit switches are an example. You find that tiny, low power, chips can switch amazing amounts of traffic. Try it on a computer with gigabit NICs and it'll fall over. Why? Because those ASICs do nothing but switch packets. They are designed just for that, with no extra hardware. Efficient, but inflexible.

Re:No (3, Insightful)

black3d (1648913) | more than 3 years ago | (#33412492)

Unfortunately I can't mod having already posted in this thread, but please allow me to /bow. This is the best explanation I've ever read anywhere for the differences. Even I knew the differences but couldn't have expressed it so finely. Bravo.

Re:No (1)

Sycraft-fu (314770) | more than 3 years ago | (#33412642)

Thanks :). I've had to give that speech to professors a number of times so I've some practice. Some faculty don't understand the whole GPGPU thing well and think they are a panacea for speeding up things. We (computer support) need to explain to them the limits, so they can evaluate if their research is the kind that would benefit.

Re:No (0)

Anonymous Coward | more than 3 years ago | (#33412546)

Thank you, my day on slashdot wasn't wasted, I learned a lot from your post.

Re:No (2, Informative)

FlawedLogic (1062848) | more than 3 years ago | (#33412992)

The GTX480 can actually do a double precision op per clock cycle. Fermi was designed with DP supercomputing in mind which is why it's so bloody expensive. To get the price down for consumer cards they removed that ability since graphics doesn't generally need it. Consumer cards need four ticks to do the equivalent DP op.

Re:x86 (2, Interesting)

jhachen (1889398) | more than 3 years ago | (#33412466)

The answer to hitting a wall with traditional CPUs is complicated. The number of transistors on new CPUs is actually keeping up with Moore's law. The size of the transistors and power consumption is also steadily decreasing. However clock speeds have been stagnant and performance/clock cycle hasn't been increasing as fast as it has in the past.

When it comes to raw performance numbers GPUs destroy CPUs. The problem is trying to take advantage of the power GPUs offer. For starters the algorithm has to be parallel in nature. And not just part of it, the majority of the algorithm has to be parallel or the overhead will erase any performance gain. The application also has to run long enough to justify offloading it to the GPU or again, the overhead will get you.

Even if you have a parallel algorithm, implementing it isn't trivial. To use CUDA or OpenCL you have to have not only a good understanding of general parallel programming but also a good understanding of the architecture of the GPU hardware. These languages are not user friendly. They really put the burden on the programmer. On the other hand this does mean they can be very powerful in the right hands.

Now lets say your application meets these criteria and you implemented it in CUDA and got a 10x speedup. No one with an ATI card can run it. Sure you could implement it in OpenCL instead to be cross platform but OpenCL seems to still be in it's infancy and not as mature as CUDA.

I'm not trying to say GPGPU computing has no future, just that it has a long way to go. Parallel Programming has actually had quite the revival lately and I'm truly interested to see where it goes. Some type of parallel compiler that relieves the programmer of having to deal with all the headaches associated with parallel programming would be ground breaking and have awesome implications. Some people claim this isn't possible. If this topic interests you I would recommend looking into reconfigurable computing. Theres some real interesting stuff going on in that area and it supports a much wider range of algorithms than GPGPU currently does.

most real life sorting involves indirection (2, Insightful)

Anonymous Coward | more than 3 years ago | (#33412340)

The typical sorting problem I've encountered in my career (various types of scientific, telecommunications and business software, though not games) involves an array of pointers to fixed length records that have a sort key (let's say an integer) at a predefined offset. Not an array of integers, nor an array of directly embedded small fixed length records which I'm guessing was used in TFA. The former situation requires random as well as stream access to memory, which would likely favor processing by the CPU in the motherboard of a typical $1000-$2000 PC.

Re:most real life sorting involves indirection (2)

MadKeithV (102058) | more than 3 years ago | (#33413060)

Solution: preprocess the data into a map of keys to pointers, and sort that.

Efficient? (2, Interesting)

scotty.m (1881826) | more than 3 years ago | (#33412578)

Not sure how they are defining efficient in this experiement. I'd like to see how many clock cycles it took to sort each problem, maybe how much memory the radix sort is using too.

Re:Efficient? (1)

Trepidity (597) | more than 3 years ago | (#33412628)

Wallclock time: they're claiming that this is, in absolute numbers, the fastest sort in keys-per-second yet demonstrated on commodity hardware.

cheap replica handbag (0, Offtopic)

didtrade (1884658) | more than 3 years ago | (#33412614)

Online sell fashion goods,Accept PayPal.cheap replica fashion goods for sale from china free shipping Cheap replica handbag [didtrade.com]
Purse handbags [didtrade.com]
wholesale replica handbags [didtrade.com]
Handbags wholesale [didtrade.com]
Cheap Replica watches [didtrade.com]
Cheap LV handbag [didtrade.com]
Cheap Replica Jeans [didtrade.com]
Wholesale Replica Jewelry [didtrade.com]
Cheap replica handbag [didtrade.com]
wholesale fashion handbags [tradeshown.com]
cheap coach handbags [tradeshown.com]
replica designer handbag [tradeshown.com]
replica handbag free shipping [tradeshown.com]
replica handbag accept paypal [tradeshown.com]

Benchmark details desired (0)

Anonymous Coward | more than 3 years ago | (#33412646)

Were the benchmarks performed on lists that already resided on the GPU cards or do the benchmarks include "I/O" to and from CPU memory?

How does the implementation perform when the entire list does not fit in memory on the GPU card?

Re:Benchmark details desired (1)

kramulous (977841) | more than 3 years ago | (#33412966)

Goto the source; They only timed the compute time. Cudamalloc was not part of the timing or cudamemcpy.

Sorry, I only count 'time to solution'. That is all i'm interested in. I thought it was strange that a less than O(n**2) was faster on the gpu.

It is like writing benchmarks that ignore disk IO, ignore reading from system memory, L3 cache, etc. Only time stuff that is in the registers.

Great (1, Funny)

robinvanleeuwen (1009809) | more than 3 years ago | (#33412820)

No we can finally sort the digits of Pi...

Re:Great (0)

Anonymous Coward | more than 3 years ago | (#33413010)

I can sort the unique digits for you

0,1,2,3,4,5,6,7,8,9
phew! Took a while but did it without a gpu

Non-unique?
0000000000000000000000000000000000000000....

Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...