×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Ask Slashdot: Parallel Cluster In a Box?

timothy posted more than 2 years ago | from the must-fit-in-a-briefcase-too dept.

Supercomputing 205

QuantumMist writes "I'm helping someone with accelerating an embarrassingly parallel application. What's the best way to spend $10K to $15K to receive the maximum number of simultaneous threads of execution? The focus is on threads of execution as memory requirements are decently low e.g. ~512MB in memory at any given time (maybe up to 2 to 3X that at the very high end). I've looked at the latest Tesla card, as well as the four Teslas in a box solutions, and am having trouble justifying the markup for what's essentially 'double precision FP being enabled, some heat improvements, and ECC which actually decreases available memory (I recognize ECC's advantages though).' Spending close to $11K for the four Teslas in a 1U setup seems to be the only solution at this time. I was thinking that GTX cards can be replaced for a fraction of the cost, so should I just stuff four or more of them in a box? Note, they don't have to pay the power/cooling bill. Amazon is too expensive for this level of performance, so can't go cloud via EC2. Any parallel architectures out there at this price point, even for $5K more? Any good manycore offerings that I've missed? e.g. somebody who can stuff a ton of ARM or other CPUs/GPUs in a server (cluster in a box)? It would be great if this could be easily addressed via a PCI or other standard interface. Should I just stuff four GTX cards in a server and replace them as they die from heat? Any creative solutions out there? Thanks for any thoughts!"

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

205 comments

AMD (2, Informative)

Anonymous Coward | more than 2 years ago | (#38250862)

Why not use AMD and OpenCL?

Re:AMD (1)

speckman (2511208) | more than 2 years ago | (#38251070)

Yeah. AMDs do more for cheaper, unless you really need the double precision. But you could get up to 10 boxes with 4 AMD cards a piece for that money. Although, are you saying that you need half a gig of RAM for a single thread? If so, GPUs are not the way to go.

Re:AMD (4, Insightful)

tempest69 (572798) | more than 2 years ago | (#38251172)

Because it's new, and finding someone who's done it to get some pointers is really hard.
CUDA has been around a while, figuring it out isn't such a rough learning curve.

Overall I'm a little suspicious of someone looking to use a GPU for more threads on a problem. As going the GPU route is a really committed step, and the programming gets a new level of complicated. Using multiple cards has some odd issues in CUDA, ie. If you exceed the card index it defaults to card-0, rather than crashing. There are more places to screw up with a GPU- transferring memory- getting blocks, threads, and weaves organized(if done properly it hides all sorts of latency in calculations, done poorly it's worse than a CPU)- avoiding memory contention (the memory scheme isn't bad, but it needs to be understood).

So in most cases I'd first start with this chart http://www.cpubenchmark.net/cpu_value_available.html [cpubenchmark.net] and tell them to cut their teeth on a GPU with a smaller(cheaper) test case.

Re:AMD (1)

GameboyRMH (1153867) | more than 2 years ago | (#38251738)

Because it's new, and finding someone who's done it to get some pointers is really hard.
CUDA has been around a while, figuring it out isn't such a rough learning curve.

On the downside, you're stuck with NVidia GPUs forever (or until they decide to drop CUDA, although I'll admit that's unlikely).

Beowulf cluster! (5, Funny)

TWX (665546) | more than 2 years ago | (#38251188)

Why not a beowulf clust---

I'm sorry, I just can't. I searched the ~35 posts, browsing at -1, and no reference to a Beowulf cluster anywhere, let alone Natalie Portman or Grits.

Slashdot! You're slipping! I lament the days when even our trolls were amusing and somewhat topical to the discussion at hand! We've fallen so far!

Beowulf clusters (3, Informative)

G3ckoG33k (647276) | more than 2 years ago | (#38251248)

Yes, I haven't seen any references here or anywhere else either lately.

From http://en.wikipedia.org/wiki/Beowulf_cluster [wikipedia.org]: "The name Beowulf originally referred to a specific computer built in 1994 by Thomas Sterling and Donald Becker at NASA. [...] There is no particular piece of software that defines a cluster as a Beowulf. Beowulf clusters normally run a Unix-like operating system, such as BSD, Linux, or Solaris, normally built from free and open source software. Commonly used parallel processing libraries include Message Passing Interface (MPI) and Parallel Virtual Machine (PVM). Both of these permit the programmer to divide a task among a group of networked computers, and collect the results of processing. Examples of MPI software include OpenMPI or MPICH. There are additional MPI implementations available. Beowulf systems are now deployed worldwide, chiefly in support of scientific computing."

Apparently, Beowuld clusters may be around, it is just that they don't go by that name any longer. I wonder what would be the latest buzzword for essentially the same thing?

Re:Beowulf clusters (0)

Anonymous Coward | more than 2 years ago | (#38251326)

Somewhere between MapReduce (most literal) and The Cloud (most useless).

Re:Beowulf clusters (1)

westyvw (653833) | more than 2 years ago | (#38251386)

Do they just call it nothing now days it is just expeced to be some variant, or that it is so mainstream?

Re:AMD (2, Insightful)

Anonymous Coward | more than 2 years ago | (#38251990)

Why not use AMD and OpenCL?

Sure use two AMD 6990 with 3072 stream units each, for a total of 6144 ALUs per box (DP FPU) with OpenCL 1.1.
Cost about $2500 per box! $700 per card plus $1000 for a CPU system with 1000W PSU.

PENIS in a BOX! (-1, Flamebait)

Anonymous Coward | more than 2 years ago | (#38250864)

with a pretty bow on the outside!

Helmer is a cheap way to get there (0, Funny)

Anonymous Coward | more than 2 years ago | (#38250896)

Sorry for just the link, but you could one off something like this http://helmer.sfe.se/

touch my weiner (-1)

Anonymous Coward | more than 2 years ago | (#38250900)

stroke it

Grey boxes (1)

Anonymous Coward | more than 2 years ago | (#38250908)

Spec out a quad-core AMD grey box with 4 gigs of ram (I saw 4 gigs of DDR3 RAM for $20 the other day). That shouldn't run you more than $400 a pop.

For 10K, you'll get 10,000/400*4=100 threads of execution.

Re:Grey boxes (0)

MichaelKristopeit406 (2018812) | more than 2 years ago | (#38251010)

100 system threads is not the same as 100 application threads... and the power and network requirements for 10,000/400 machines is a lot different than that of a single 1U machine.

you're an idiot.

Re:Grey boxes (1)

tomhudson (43916) | more than 2 years ago | (#38251212)

Look who's the idiot - the article says they aren't paying for the power.

Re:Grey boxes (1)

MichaelKristopeit406 (2018812) | more than 2 years ago | (#38251648)

who is paying the money and time for the 10,000/400 times more maintenance expenditures?

you're always the idiot, tom.

cower in my shadow some more behind your chosen masculinity based pseudonym, lady feeb.

you're completely pathetic.

Nothing special (2, Informative)

Anonymous Coward | more than 2 years ago | (#38250910)

Just put bunch of GTX cards to nice, big server case with enough fans. You are hardly going to find any cheaper alternative.
When choosing cards, look for tests like this one:
http://www.behardware.com/articles/840-13/roundup-a-review-of-the-super-geforce-gtx-580s-from-asus-evga-gainward-gigabyte-msi-and-zotac.html
The IR thermal photos are great when choosing well cooled card.
Also use SW to control card fans to keep them running at 100% fan speed.
Noisy? Yes. But who cares, unless you plan putting it in your bedroom.
You can easily keep these cards at ~70C with full load.

Re:Nothing special (4, Informative)

TWX (665546) | more than 2 years ago | (#38251156)

It would have been nice if he'd given us more information about the form factor he needs to put this into. Since the client isn't paying the electric or cooling bill then I have to assume that it's colocated, so there might be some real rack unit restrictions that prevent this from adequately working well. It also would have been nice to know storage demands too, as there are tradeoffs in front-accessible drive arrays for cooling and airflow purposes. Most of the cases with tons of hot-swap drives in front lack good front ventilation. If he only needs a few drives then that opens him up to a simple 3U or 4U chassis with a mostly open-grille of a front to make airflow a lot less restrictive.

Re:Nothing special (1)

Ruie (30480) | more than 2 years ago | (#38252058)

Just put bunch of GTX cards to nice, big server case with enough fans. You are hardly going to find any cheaper alternative.

That's actually pretty hard to do as you need a motherboard with lots of multiple-lane PCIe connections.

Re:Nothing special (1)

SuricouRaven (1897204) | more than 2 years ago | (#38252122)

I recall it is possible to fit a 16x card in a 1x slot (Obviously at 1x performance), but this requires the card be hacked. Literally. With a hacksaw. All the power and essential control lanes are at the front, and if 15 of the 16 data lanes are not connected then the card will simply not use them.

SuperMicro MicroCloud w/ 8 NVidia GPUs? (2, Interesting)

Anonymous Coward | more than 2 years ago | (#38250916)

If the off-the-shelf GTX cards work, you'd have 8 * Xeon + 8 * NVidia GPU's in 3U, all entirely parallel (I.E. 8 separate machines) to avoid the main CPU's being any kind of bottleneck. Stock each node w/ 2GB of RAM on the cheap and some cheaper SATA drives, you'd likely end up under $10k for the whole thing and have an 8-node cluster you can use for other tasks later.

I've noticed that "embarrassingly parallel" tasks, if you take the low-hanging fruit too far, end up running into some other unforeseen bottleneck. Thus me suggesting something faux-bladeish instead.

Re:SuperMicro MicroCloud w/ 8 NVidia GPUs? (1)

Lord_Naikon (1837226) | more than 2 years ago | (#38251056)

Good advice, because I've found out in practice that unless the problem set can be divided up in CPU core cache size blocks (i.e. 512k or less), memory bandwidth is going to be the major bottleneck. More machines = more memory bandwidth.

Re:SuperMicro MicroCloud w/ 8 NVidia GPUs? (1)

fuzzyfuzzyfungus (1223518) | more than 2 years ago | (#38251392)

The one potentially tricky thing with that particular machine might be the graphics cards: PCIe x8, low profile, is not going to help your search for a high end GTX that will fit...

Unless he is heavily space constrained, he should probably take your advice on specs; but in 1 or 2U cases where getting a double-wide, full profile, PCIe x16 card installed will be easier.

PS3 (2, Interesting)

History's Coming To (1059484) | more than 2 years ago | (#38250918)

PlayStation 3s have proved a cost efficient way of setting up large scale parallel processing systems. Of course you'll have to find your way around Sony's blocks on the OtherOS system, and you'll need to keep it off the internet or firewalled in some way, but you essentially get cheap processing subsidised by the games that you don't need to buy.

Re:PS3 (2, Informative)

Anonymous Coward | more than 2 years ago | (#38251068)

I wouldn't give Sony a dollar of my business if they had the cure for cancer and I was a week away from death.

Yes you would (0)

Anonymous Coward | more than 2 years ago | (#38251414)

if it could save you

Re:PS3 (1)

bmsleight (710084) | more than 2 years ago | (#38251496)

Don't you realise that Sony, would make a loss on this ?

Buying lots of subsidise PS3 and then NOT buying the games they are worse off.

Re:PS3 (0)

Anonymous Coward | more than 2 years ago | (#38251116)

OS problems here. Newer PS3 won't allow direct install of linux.

Re:PS3 (5, Informative)

Anonymous Coward | more than 2 years ago | (#38251174)

PlayStation 3s have proved a cost efficient way of setting up large scale parallel processing systems. Of course you'll have to find your way around Sony's blocks on the OtherOS system, and you'll need to keep it off the internet or firewalled in some way, but you essentially get cheap processing subsidised by the games that you don't need to buy.

Back-of-the-envelope comparison of PS3 and GTX:

A cluster of three PS3s: 920 GFLOPS. Price: about $800.

A PC with 3 GTX 460 cards: 2200 GFLOPS. Price: about $800.

Each of those GTX cards also has significantly more memory than the PS3, and are cheaper to develop for.

Re:PS3 (1)

History's Coming To (1059484) | more than 2 years ago | (#38251262)

Each of the GTX cards needs a motheboard, power supply etc, you get those thrown in with the consoles. Yup, there may be significant OS issues, and I've not done the sums on flops per dollar, so it may be a dumb idea...just throwing it into the mix.

Re:PS3 (1)

Anonymous Coward | more than 2 years ago | (#38251346)

Each of the GTX cards needs a motheboard, power supply etc, you get those thrown in with the consoles.

No, not "each". 3 cards can fit in one PC. The ballpark price included the cost of the other parts.

Re:PS3 (1)

CronoCloud (590650) | more than 2 years ago | (#38251280)

A cluster of three PS3s: 920 GFLOPS. Price: about $800.

Less than that, because they would have to buy older used PS3's, CECHA/CECHB/CECHE models and they'd need to have pre 3.21 firmware. Difficult but probably cheaper.

A PC with 3 GTX 460 cards: 2200 GFLOPS. Price: about $800.

Woudn't the 450's alone cost about 500, let alone a motherboard that can handle 3 of them, and a good power supply and cooling. I think that PC total is estimating a bit low.

Perhaps the poster could do both, because some calculations might work better on the PS3's and some on the 460's. Even in Folding@home, there's still calculations the PS3's do better than the GPU clients, because they're more versatile, as they say, taking the middle path between the CPU and GPU clients.

Re:PS3 (1)

Surt (22457) | more than 2 years ago | (#38252076)

Cards:
http://www.newegg.com/Product/Product.aspx?Item=N82E16814162058&nm_mc=OTC-Froogle&cm_mmc=OTC-Froogle-_-Video+Cards-_-Galaxy-_-14162058 [newegg.com]

x3 = $360.

Motherboard:
http://www.newegg.com/Product/Product.aspx?Item=N82E16813128495 [newegg.com]
$114

Power supply:
http://www.newegg.com/Product/Product.aspx?Item=N82E16817152044 [newegg.com]
$144

CPU can be less than $50 if he really doesn't need the cpu to do much of anything.
So far I'm at $668. Probably have to buy a box to put it in for $50.
So now i'm at $718. What shall I buy with my $82?

Re:PS3 -- sure, if you like your CPUs from 2006. (1)

Arakageeta (671142) | more than 2 years ago | (#38251590)

I think the time of the PS3 clusters has past. The Cell processor was released back in 2006! IBM released a few upgraded processors, mostly improving double-precision performance, but those systems are really cost prohibitive.

Assuming you can deal with PCIe latency, GPUs are the way to go.

Re:PS3 -- sure, if you like your CPUs from 2006. (1)

SuricouRaven (1897204) | more than 2 years ago | (#38252140)

The Cell, at the time of release, was mind-blowingly fast. Fastest chip around. But it didn't advance very far, and more conventional processors have now overtaken it.

Re:PS3 -- sure, if you like your CPUs from 2006. (1)

Rockoon (1252108) | more than 2 years ago | (#38252384)

They just claimed that it was mind-blowingly fast.

In theory there is no difference between theory and practice, but in practice there is.

here is what i did (0)

Anonymous Coward | more than 2 years ago | (#38250922)

https://sites.google.com/site/jimerickso/home/new-build

can you write GPU code? (5, Insightful)

zeldor (180716) | more than 2 years ago | (#38250934)

do you or them know how to program on a GPU?
if its really embarrassingly parallel EC2 spot instances and the gnu program 'parallel' will work quite nicely.
But if coding changes are required then the hardware is the least of your expenses.
 

Re:can you write GPU code? (1)

woodhouse (625329) | more than 2 years ago | (#38251920)

Exactly. Unless the user has some experience in CUDA/Compute shaders/OpenCL, just shoving cards in there doesn't really solve the problem.

Re:can you write GPU code? (1)

human spam filter (994463) | more than 2 years ago | (#38252216)

Also, whether you will get a significant speedup by using the GPU really depends on the algorithm. Some algorithms may not even be possible to implement for the GPU (due to limitations of CUDA, OpenCL etc.).

I suggest (0)

denshao2 (1515775) | more than 2 years ago | (#38250938)

Amazon EC2.

Re:I suggest (0)

Anonymous Coward | more than 2 years ago | (#38251028)

Guess someone didn't read the summary. EC2 was explicitly stated as being too expensive, and looking for a self-hosted instance.

Re:I suggest (0)

Anonymous Coward | more than 2 years ago | (#38251692)

Congrats, retard, you didn't read the question. EC2 is too expensive for hefty computational tasks. Cheaper to buy hardware that they can use as much as they want for free.

Die of heat? (2)

TheSHAD0W (258774) | more than 2 years ago | (#38250944)

> Should I just stuff four GTX cards in a server and replace them as they die from heat?

It'd be more cost-efficient to improve the air flow or add liquid cooling. Yay mineral oil baths.

AMD graphics cards (0)

Anonymous Coward | more than 2 years ago | (#38250970)

Radeon HD 5800-5900 series Supports FP64
Radeon HD 6900 series supports FP64

3k - 64cores + 54+GB of ram. (4, Interesting)

Anonymous Coward | more than 2 years ago | (#38250974)

You can easily build a 64core 1U system with opterons using the quad socket setup, or 128 core using the quad socket with extension setup, that will only run you about 5k. These are general 128 cores, 2ghz+, you don't have to change the program to run on these, you do not need to obfuscate things as you would programming and dealing with gpus... Or you can wait for knights corner, or get the Tile64s.

Re:3k - 64cores + 54+GB of ram. (0)

Anonymous Coward | more than 2 years ago | (#38251064)

Where the heck are you shopping?

Re:3k - 64cores + 54+GB of ram. (2, Informative)

Anonymous Coward | more than 2 years ago | (#38251158)

NewEgg. The 4 socket and extension boards are below 1k together. And the low-avg speed 16 core opterons are about 300-400, so 350*8 + 700 (board+extension) = 3.5k. The other 1.5k are power, 1333ghz ram, and the 1u container.

You can of course spend a lot more if you want the fastest opterons, but the return goes down quickly, the 2.2Ghz are fast, cheap, 16core cpus.

Re:3k - 64cores + 54+GB of ram. (4, Informative)

dch24 (904899) | more than 2 years ago | (#38251282)

Just took a look. They have 4 choices for a 16-core opteron listen:
  • AMD Opteron 6262 HE Interlagos 1.6GHz Socket G34 85W 16-Core Server Processor OS6262VATGGGU - OEM $539.99
  • AMD Opteron 6272 Interlagos 2.1GHz Socket G34 115W 16-Core Server Processor OS6272WKTGGGUWOF $539.99
  • AMD Opteron 6274 Interlagos 2.2GHz Socket G34 115W 16-Core Server Processor OS6274WKTGGGUWOF $659.99 out of stock
  • AMD Opteron 6274 Interlagos 2.2GHz Socket G34 115W 16-Core Server Processor OS6274WKTGGGU - OEM $659.99 out of stock

I'm going to keep looking, but I don't see any in the 300-400 range.

Re:3k - 64cores + 54+GB of ram. (0)

Anonymous Coward | more than 2 years ago | (#38251340)

all prices from newegg, i'm sure that you could get a better price somewhere else if you look...

tyan quad g34 board=$810
4x2gb ddr3 ram=$45
hard drive=$80
subtotal=$935
add proc...
4x8c 2.6ghz(32c)=1120...tot=$2055
4x12c 2.4ghz (48c)=1560...tot=$2495
4*16c 2.1ghz (64c)=2160...tot=$3095

so for 4k you could add more memory, add an ssd and a nice case and have 64 general purpose cores in a 1u box. buy three and you have a good general purpose mini-supercomputer with 192 cores.

Need more information (3, Informative)

pem (1013437) | more than 2 years ago | (#38250984)

If, for example, it's embarrassing parallel DSP operations, you might try some dedicated DSP engines, or even some Xilinx FPGAs.

Re:Need more information (0)

Anonymous Coward | more than 2 years ago | (#38251798)

If it's a good fit to a GPU's capability, GPUs provide more FLOPS/$ than DSPs or FPGAs, although both DSPs and FPGAs are much more flexible and are therefore applicable to a wider range of problems without performance-wrecking contortions.

The *other* reason to consider DSPs or FPGAs is energy efficiency: both give substantially better FLOPS/W than the GPUs. However the OP suggested that they're not paying for power, so this is unlikely to be relevant here.

Re:Need more information (1)

gmarsh (839707) | more than 2 years ago | (#38252008)

A GPU will spank a dedicated DSP chip at just about everything, even the highest end TI's and TigerSHARCs. Both DSPs and GPUs are designed to haul data out of memory and do vector multiplication on it, but the GPU has a heck of a lot more of both memory bandwidth and processing grunt.

A big FPGA card, or FPGA array system like a Copacobana, might be quicker assuming I/O limitations aren't a problem for the algorithm to be run. But FPGA hardware for HPC isn't really a commodity so it's awfully expensive - you're looking at $5K+ for a big Virtex on a PCIe card. Plus buying FPGA tools and IP blocks, and getting the VHDL/Verilog written, will eat up a budget really quick.

But if this is being done in an academic environment and there's no looming deadline for this project, the FPGA method might be something you can get a grant for and throw a computer engineering grad student or co-op student at.

I built a similar system recently (1)

Anonymous Coward | more than 2 years ago | (#38250992)

I built a cluster the other day with 8 i7-2600K processors.

CPU - Intel i7-2600k = $300
Motherboard - P8H67-M PRO/CSM = $110
Ram - 4x 4GB corsair = $100
2u case + 400w ps = $90

My total cost was under 5k for 8 nodes, and it runs very very fast. Although my application likes CPUs more then GPUs. I also use a total of 16U of space, but that is not much of a extra cost.

Requirements (1)

Anonymous Coward | more than 2 years ago | (#38251020)

You really haven't given any details about your requirements.

This is a parallel problem, but will it run well on a GPU? If its an inherently divergent task, then probably not (Correct me if this isn't the case for other cards, I only have CUDA experience). If you want good answers, you'll need to describe your problem in more detail than just being embarrassingly parallel.

HP Moonshot (0)

Anonymous Coward | more than 2 years ago | (#38251042)

The recently-announced HP Moonshot architecture seems to meet most of your operational requirements. http://www.hp.com/hpinfo/newsroom/press/2011/111101xa.html
I haven't seen any pricing, though.

Re:HP Moonshot (0)

Anonymous Coward | more than 2 years ago | (#38251066)

Moonshot is fucking awesome, but they're not available to purchase yet and they'll probably end up outside the posters price point by a large margin.

U of I (4, Informative)

TheGreatOrangePeel (618581) | more than 2 years ago | (#38251046)

Try getting in touch with the folks doing parallel processing research or the people with NCSA at U of I. I imagine one or both would have a few tips for you assuming they're open to doing that kind of collaboration.
  • http://parallel.illinois.edu/
  • http://www.ncsa.illinois.edu/

Blade servers (0)

Anonymous Coward | more than 2 years ago | (#38251062)

Buying a blade server on ebay would also be a great option. For around 5-6k you could get a nice blade with 10 nodes.

10 dual cpu nodes = 80 cores, if you have hyperthreading you can run 160 threads.

AMD (0)

Anonymous Coward | more than 2 years ago | (#38251076)

AMD cards are worth a look. Especially for embarrassingly parallel stuff they often deliver higher performance (see eg bitcoin) .

Do you need it in a box? (2)

Tom Goodale (150359) | more than 2 years ago | (#38251082)

If it's really embarrassingly parallel, just run it on whatever CPUs you have hanging about or can scrounge cheaply. As long as the application is written portably they don't even need to be the same architecture or operating system, although that would help with deployment. The only reason to try to scrunch everything in one box would be if you have space limitations.

many AMD CPUs unless the GPU port is done already (2, Interesting)

Anonymous Coward | more than 2 years ago | (#38251094)

You can get 48 real AMD Magny-Cours CPU cores with full DP floating point support and ~64GB ECC memory in a box for under 10K(EUR!) from e.g. Tyan and supermicro.
I run my embarassingly parallel stuff on that, and it works great. Depending on your application 64 Bulldozer cores which come in the same package for only slightly more money may perform better or not. I have not seen many realworld applications in which one GPU is actually faster than 12 to 16 server-class CPU cores.
Of course this depends a lot on wether you have done the GPU porting already or are just planning to, which you unfortunately don't state in your post

Raspberry Pi (1)

Anonymous Coward | more than 2 years ago | (#38251102)

Just make a cluster of these little guys.
HDMI output, USB input.
Encode data in to HDMI frames.
Have a decent board for decoding and to perform instructions from the HDMI data, then send more data back through USB.
You could probably even use the audio ports for even more throughput.
And if you get the ethernet version, that too.

I'm not even joking.
Well, partially. Might not be worthy of this case.
Plus, not out or even final.

But it sounds like an interesting idea anyway, so might as well throw it out there since it is pretty related.

Re:Raspberry Pi (0)

Anonymous Coward | more than 2 years ago | (#38252038)

The HDMI thing seems a wierd way to accomplish output... I don't know of any cheap massively-multiport HDMI capture board, anyway. Is there a reason you suppose this to be cheaper than using USB for both input and output, and adding however many % more Pis it takes to make up the difference?

Definitely GPU. (4, Interesting)

pla (258480) | more than 2 years ago | (#38251106)

Others have pointed it out, but if you can run this on a GPU, you don't need to look any further than that.

Specifically, check out some of the BitCoin mining rigs [bitcointalk.org] people have built, like 4x Radeon 6990s in a single box. For comparison, a single 6990 easily beats a top-of-the-line modern CPU by a factor of 50 (as in, not 50%, but 5000%). You can build such a box for well under $5k.

MPI+AMD vs GPU (1)

Anonymous Coward | more than 2 years ago | (#38251130)

If you're not a GPU programmer the alternative is a 48-core AMD server (64-core systems are notoriously slow and have half the floating point units) with MPI. This is the solution that many academics are taking.

Also if you're lucky you might be able to get your hands on Intel's 100-core Atom processor, they're not for sale AFAIK but I believe you can apply to get one for free.

commodity HPC depends on your code (5, Informative)

Haven (34895) | more than 2 years ago | (#38251140)

In HPC we call it "pleasantly parallel," nothing is embarrassing about it! =]

If your code:
-scales to OpenCL/CUDA easily.
-does not require high concurrent memory transfers
-is fault tolerant (ie a failed card doesn't hose a whole day/week of runs)
-can use single precision flops

Then you can use commodity hardware like the gtx series cards. I'd go with the gtx 560ti (GF114 gpu).

Make nodes with:
quad core processors (amd or intel)
whatever ram is needed (8GB minimum)
2 x gtx560ti (448) run in SLI (or the 560ti dual from EVGA)

Basically a scaled down Cray XK6 node. http://www.cray.com/Assets/PDF/products/xk/CrayXK6Brochure.pdf [cray.com]

It all depends on your code.

We built a ~9.1 TFLOPS system for $10k last year. (4, Interesting)

Arakageeta (671142) | more than 2 years ago | (#38251380)

What does SLI give you in CUDA? The newer GeForce cards support direct GPU-to-GPU memory copies, assuming they are on the same PCIe bus (NUMA systems might have multiple PCIe buses).

My research group built this 12-core/8-GPU system last year for about $10k: http://tinyurl.com/7ecqjfj [tinyurl.com]

The system has a theoretical peak ~9.1 TFLOPS, single precision (simultaneously maxing out all CPUs and GPUs). I wish the GPUs had more individual memory (~1.25GB each), but we would have quickly broken our budget had we gone for Tesla-grade cards.

How does it parallelize? (4, Informative)

darkjedi521 (744526) | more than 2 years ago | (#38251170)

How does the app parallelize? Is each process/thread dependent on every other process/thread or is it a 1000 processes flying in close formation that all need to complete at the same time but don't interact with each other? How embarrassingly parallel is embarrassingly parallel? Is that 512MB requirement per process or the sum of all processes?

GPUs might not be the right solution for this. GPUs are excellent for parallelizing some operations but not others. Have you done any benchmarks? Throwing lots of CPU at the problem may be the right solution depending on the algorithms used and how well they can be adapted for a GPU, if they can be adapted for a GPU.

For the $10K-$15K USD range, I'd look at Supermicro's offerings. You have options ranging from dual socket 16 core AMD systems with 2 Teslas to quad socket AMD systems to quad socket Intel solutions to dual socket Intel systems with 4 Tesla cards.

Do some testing of your code in various configurations before blindly throwing hardware at the problem. I support researchers who run molecular dynamics simulations. I've put together some GPU systems and after testing, it was discovered that for the calculations they are doing, the portions that could be offloaded to their code only accounted for at most 10% of the execution time, with the remainder being operations that the software packages could only do on CPU.

Passive cooled GPU (1)

GoRK (10018) | more than 2 years ago | (#38251180)

Don't use high end GTX cards; twice as many lower end passively-cooled GPU cards will provide more than the equivalent performance with far less cost and failure rate. If your application really benefits more from additional threads vs single thread execution speed, this is the way to go. Most GPGPU clusters that aren't built using Tegra use this approach.

need a better characterization of the workload (1)

Surt (22457) | more than 2 years ago | (#38251182)

big FP bandwidth on a tesla doesn't do much for you if you only need integer execution. Maybe you'd be better off with a 4-cpu xeon box, or a bulldozer, or a 64-core arm. Really, you want to find a way to benchmark your particular software on a variety of potential cpu targets, and then do a price comparison.

But what next? (0)

Anonymous Coward | more than 2 years ago | (#38251226)

This makes sense as i have a two year old that uses the devices. they are fairly easy to use and in the long run may be cheaper. however, with new updates to the program coming out yearly, most of these devices will be outdated very quickly. So after that, then what?

Orlando Web Design [elijahclark.com] By Elijah Clark [elijahclark.com]

what kind of embarassingly parallel? (0)

Anonymous Coward | more than 2 years ago | (#38251258)

normally, the phrase means "lots of serial jobs", which have an input configuration and a result, and nothing in between (particularly no inter-job sharing). gp-gpu is suitable for a somewhat different sort of workload, basically single-instruction-multiple-threads. in short, are the threads working in lockstep?

Why, mini-cluster, of course! (2)

Noryungi (70322) | more than 2 years ago | (#38251328)

http://www.mini-itx.com/projects/cluster/?p [mini-itx.com]

The example at the URL above is quite old, but a good starting point. Just use a dozen cheap mini-itx cards with -- let's say -- Intel Core i5 and voilà! Probably the cheapest way to go, and, also much easier to program than using CUDA and nVidia. Hook the whole thing in a gigabit switch

I'll let the experts debate the best CPU for that job, but AMD should also have some nice products on offer.

200hr (0)

Anonymous Coward | more than 2 years ago | (#38251370)

Hi Guys,

I get paid 200/hr by the government to come up with an architecture for parallel processing, Rather than taking time reading through droll literature, I need to go traveling to my second house in the Cayman's. I wondered if I could ask slashdot and save myself the trouble.

Ttyl
Tax Evader

Don't buy GTX's (4, Informative)

MetricT (128876) | more than 2 years ago | (#38251536)

We have several racks full, purchased because "they're cheaper than Tesla's".

Except the Tesla's have, as pointed out, ECC memory and better thermal management, and the GTX's have several useful features (like the GPU load level in nvidia-smi) disabled.

The former cause the compute nodes to crash regularly. What you save on cards, you'll lose in salary for someone to nursemaid them. The latter makes it harder to integrate into a scheduler environment (we're using Torque).

Yes, this is primarily marketing discrimination, and there probably isn't $10 worth of real difference between the two. I hope the marketing droid who thought that scheme up burns. It's a total aggravation, but paying for Teslas is worthwhile.

Re:Don't buy GTX's (2)

Khyber (864651) | more than 2 years ago | (#38251660)

Plenty of hacks to enable GPU load level. Probably several already out there as-is. The ECC memory is a different beast, though.

Embarrassingly parallel problems... (1)

MickLinux (579158) | more than 2 years ago | (#38251680)

... do not require embarrassingly parallel solutions.

They require math and algorithm design to make the solution *nonembarrassing*.

Give you an example: a typical FFT can, with easy math, cut it number of calculations by four. With a little care, you can halve the number of calculations again.

Start with the math. Then look at the solution.

Last of all, consider cloudware. It's out there. Let's see... on my android, I have "sourceLair". Yeah, that's one.

Once you have the cloudware solution in hand, *then* you can start thinking about spending money on a kindof parallel solution (such as what Google uses).

Re:Embarrassingly parallel problems... (1)

ceoyoyo (59147) | more than 2 years ago | (#38251954)

Ah, generalizations. Of course, you have no idea what he's working on.

Re:Embarrassingly parallel problems... (2)

MickLinux (579158) | more than 2 years ago | (#38252350)

Yes I do. He's extending the calculations begun by Lewis Carroll in the imaginary space (through the looking glass), to see the effects as the ultimate limit increases.

What's

  1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+ 1+1+1+1+1+1+ 1+1+1+1+1+1+ 1+1+1+1+1+1+
  1+1+1+1+1+1+1+1+1+1+1+1

As I said, embarrassingly parallel. Get 7 computers working on it in parallel, with 1 for backup:

What's 1+1+1+1+1+1 (after some calculation, 6)
So that all is 42.

the ultimate answer is

  1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+ 1+1+1+1+1+1+ 1+1+1+1+1+1+ 1+1+1+1+1+1+
  1+1+1+1+1+1+1+1+1+1+1+1=42.

I should note that this mathematical calculation was also attempted by Douglas Adams, using genetic algorithms.

memory bus bottlenecks: 1 machine? (1)

bbulkow (954499) | more than 2 years ago | (#38251746)

Even though the application is parallel, your bottleneck can easily be the memory bus. Adding tesla cores won't solve memory bus issues. For a number of apps, Intel i5 quad cores stacked up increase memory bandwidth on the cheap. 10 500$ machines, or 5 1000$ machines with a cheap NVidia GPU, may very will outperform anything that can be put in a single "box" - because there is 10x or 5x more memory bandwidth. That means you need software to write not just parallel code, but multimachine parallel code - in which case you should get in bed with a computation fabric like Hadoop or one of a million others (raw OpenMP is another example, if you're a GPU hacker type).

Re:memory bus bottlenecks: 1 machine? (1)

bbulkow (954499) | more than 2 years ago | (#38251772)

Replying to self: our Citrusleaf database does amazing parallel operations on Sandy Bridge i5 (2400) machines. Single socket machines have the best interrupt processing and lowest memory latency. Going to Xeon architectures is, price performance, a HUGE decrease. There was a great post somewhere about $/speed in CPUs, and of course the true consumer grade stuff (i5 and Phenom II) were 10x better than "datacenter" grade machines. This is especially true for Supermicro. As much as I like them, you can save 4x money by going Asus and using a physically larger box - if you're not going into a data center. Another cost savings is running the project at home - you'll get more bandwidth for $50/month then you'll ever get from a data center.

YOU FA+IL IT. (-1)

Anonymous Coward | more than 2 years ago | (#38251778)

during this file 4 productivi:ty

How about (1)

koan (80826) | more than 2 years ago | (#38251930)

Multiple GTX card servers in a cluster? So 4 GTX GPU's to a box, and several boxes to the cluster.

Try looking at the cheap end... (2)

JoeMerchant (803320) | more than 2 years ago | (#38251952)

I've played this parallel cost analysis game several times, and if you don't need high bandwidth communication between the threads, I usually come up with the Google solution: a big farm of cheap machines. AMD chips start looking good compared to Intel because you're not after a single thread finishing as fast as possible, you're after as many FLOPS per $ as you can get. We even did the analysis for an extreme Apple fanboi: MacPros vs MacMinis back in 2007, and a stack of 25 minis came out way more powerful than the 3 or 4 Pros you could get for the same money.

pick up a Sun T5140 on ebay (1)

Anonymous Coward | more than 2 years ago | (#38252044)

Two 8-core processors with 8 threads per core == 128 simultaneous threads.

You could get a new Sun T3-1 for a little more. It would be roughly the same performance (it only has one physical processor, but it's 16 cores * 8 threads per core, so still 128 total).

100 cores per chip on Tilera (0)

Anonymous Coward | more than 2 years ago | (#38252080)

Tilera has 100 core chips. If you don't need floating point (you never said what you were doing) they're a great choice.

No need for the high-end, little need for doubles (0)

Anonymous Coward | more than 2 years ago | (#38252118)

Most (but not all) double-precision work can be handled by single precision pairs. Basically you keep numbers in the form a+b, where a is "big" and b is "little" and handle them accordingly. The slowdown is less than you'd think, and often gives better performance than the kind of hardware dp that gpus offer. There's a bunch of libraries and papers out there if you google them.

ECC is nice, but can be avoided in a number of applications by simply doing regular checkpointing and restarting failed computations. Again, this only works for certain applications, but can save a whole heap of $ for the ones that can.

If your target workload fits in both of these groups, you can assemble a cluster using high-end AMD gamer cards that will thrash any Tesla-based solution on performance/$ by a huge amount.

Speaking anonymously because of my employer, but I have built such systems for these kind of applications.

Distributed.net (0)

Anonymous Coward | more than 2 years ago | (#38252298)

If you imitate what distributed.net accomplished (and folding@home or others are currently accomplishing) just make a creative website detailing what you're doing and why people should give you their unused GPU/CPU cycles.

Otherwise 4x GTX 580 (as mentioned already) will destroy what you throw at it.

I think the fact that you mentioned Tesla and GTX in this article covers the unsaid statement of "we're using CUDA".

Or browse top500.org for a rental shopping list.

Do your homework before going GPU (3, Informative)

PatDev (1344467) | more than 2 years ago | (#38252304)

As someone who has done some GPU programming (specifically CUDA) be aware that there is more to the GPU parallelism model than just "lots of threads". Many embarrassingly parallel problems translate very poorly to CUDA. The primary things to consider is that:

1. GPUs are *data parallel*. This means that you need to have an algorithm in which each and every thread will be executing the same instruction at the same time (just on different data). For a cheap way to evaluate it, if you can't speed up your program by vectorizing it then the GPU won't help. Of course, you can have divergent "threads" on GPUs, but as soon as you do you've lost all benefit to using a GPU, and have essentially turned your GPU into an expensive but slow computer.

2. Moving data onto or off of the GPU is *slow*. So if you can leave all the data on the GPUs and none of the GPUs need to communicate with each other, then this will work well. If the threads need to frequently globally sync up, you're going to be in trouble.

That said, if you have the right kind of data parallel problem, GPUs will blow everything else out of the water at the same price point.

Arm cluster (0)

Anonymous Coward | more than 2 years ago | (#38252338)

www.gumstix.com/store/product_info.php?products_id=247

Overo omap3s rock

Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...