Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Cluster Interconnect Review

ScuttleMonkey posted more than 8 years ago | from the sparring-with-scandinavian-warriors dept.

64

deadline writes to tell us that Cluster Monkeys has an interesting review of cluster interconnects. From the article: "An often asked question from both 'clusters newbies' and experienced cluster users is, 'what kind of interconnects are available?' The question is important for two reasons. First, the price of interconnects can range from as little as $32 per node to as much as $3,500 per node, yet the choice of an interconnect can have a huge impact on the performance of the codes and the scalability of the codes. And second, many users are not aware of all the possibilities. People new to clusters may not know of the interconnection options and, sometimes, experienced people choose an interconnect and become fixated on it, ignoring all of the alternatives. The interconnect is an important choice and ultimately the choice depends upon on your code, requirements, and budget."

cancel ×

64 comments

Sorry! There are no comments related to the filter you selected.

firstpost (-1, Offtopic)

Anonymous Coward | more than 8 years ago | (#15181839)

nigger fuck lol what

No comments and it's slashdotted? (0, Redundant)

Christopher Cashell (2517) | more than 8 years ago | (#15181847)

Wow. No comments (except for the one idiot who's already been modded to -1), and the site is already slashdotted?

I'm not sure whether I should be annoyed, or amused.

Interconnect Failure? (1)

slashbob22 (918040) | more than 8 years ago | (#15181848)

Apparantly one of the cluster interconnects on this site failed.

Re:No comments and it's slashdotted? (0)

Anonymous Coward | more than 8 years ago | (#15181864)

Well what ever interconnect you use, windows is still windows.

Re:No comments and it's slashdotted? (2, Insightful)

ScrewMaster (602015) | more than 8 years ago | (#15181885)

Mainly I think it's because, unlike most other stories, people are actually Reading The Fucking Article before posting. Interesting concept, eh?

Interesting? (2, Funny)

foreverdisillusioned (763799) | more than 8 years ago | (#15182128)

I don't know if I'd call it "interesting." More like the third seal of the apocalypse has just been broken.

Re:Interesting? (1)

ScrewMaster (602015) | more than 8 years ago | (#15182159)

So how many seals do we have left before all Hell breaks loose? And can we replace the broken seals (which are obviously in short supply) with a similar part? Say, an otter?

Re:No comments and it's slashdotted? (1)

Christopher Cashell (2517) | more than 8 years ago | (#15185678)

Redundant?

This is the very first comment on this story, and it's moderated redundant? What kind of a moron moderates like that?

I mean, I could understand if it were moderated as offtopic, or something, but redundant?

Dumbass.

News? (2, Funny)

Ramble (940291) | more than 8 years ago | (#15181867)

Interesting article, but I'm not sure how many Slashdotters can fit a cluster powerful enough to saturate a GigE interconnect in their mother's basement.

Re:News? (2, Funny)

Anonymous Coward | more than 8 years ago | (#15181921)

My mom has a bigger basement than yours! :D

Re:News? (0)

Anonymous Coward | more than 8 years ago | (#15182130)

Its not about the bandwidth in clusters its about the latency

Re:News? (1)

cazbar (582875) | more than 8 years ago | (#15182168)

Most people don't get killed by terrorists but the news media sure likes to focus on it anyway.

News media usually focuses on the exception, not the norm. Besides, I find clusters interesting.

Re:News? (1)

cloudmaster (10662) | more than 8 years ago | (#15185131)

At least one Slashdotter works on several clusters daily - his job involves benchmarking applications and tuning the hardware such that customers running specific applications can have an officially recommended hardware configuration that will help them to optimize their expensive hardware. He might be interested in reading about new cluster interconnects... /glad Slashdot is almost geared to my personal interests for once ;)

/.ed (5, Funny)

xming (133344) | more than 8 years ago | (#15181876)

I bet they use the $32 interconnect for their server

Mirror (4, Informative)

tempest69 (572798) | more than 8 years ago | (#15181894)

http://www.mirrordot.net/stories/57bdef42b0ad596ff 35350041a22b442/index.html [mirrordot.net]

because some people practice what they preach

Storm

Re:Mirror (1)

convolvatron (176505) | more than 8 years ago | (#15181928)

except that only the first page is mirrored, and the next pointers go to the dead site.

if you care, use ib. the linux support is still a little funky, but in terms of application performance for the dollar, its hard to beat. tcp is gorgeous for sharing buffer space in the wide area, but its alot of work for a tightly coupled machine.

Re:Mirror (2)

Erbo (384) | more than 8 years ago | (#15181946)

Yeah. The fact that only the first page of the article was mirrored means I didn't get to see their notes on Myrinet or InfiniBand...which are the high-speed interconnects that the company I work for, Aspen Systems [aspsys.com] , generally use when building clusters. (This is in addition to a standard GigE network used as the "control" network, freeing up the high-speed network for application data. The part I'm responsible for, the management software, uses the control network exclusively.) I'm interested to see what they say about them.

GigE FTW (1)

mg2 (823681) | more than 8 years ago | (#15181920)

Where I work, I deal with 30-40GBps average read/write total throughput on our distributed filesystem using GigE and Cisco 6509s.

I have trouble imagining an application that could eat up more than that. It's bananas.

Easy... (0)

Anonymous Coward | more than 8 years ago | (#15181963)

Imagine I have a simulation I'm running that takes a trace stream as input. The traces are between 5-50GB each. I want to run a parameter sweep of my simulation. So I submit 50 jobs to the cluster. If I have just GB ethernet I will completely saturate it whenever the jobs are trying to access the trace stream. Solution? Use bittorrent to copy the files to the local nodes and then run locally. This only requires 100GB of data from the file server and only requires it once. So filling up a GB switch is easy in my experience.

Re:GigE FTW (1)

Metabolife (961249) | more than 8 years ago | (#15182069)

Have you looked at the needs of Willie Wonka lately? He had to develop his own interconnect made out of midgets and chocolate. Now THAT is bananas.

Re:GigE FTW (1)

0racle (667029) | more than 8 years ago | (#15182104)

Yes it might include bananas, or possibly some other fruit or veggie.

Re:GigE FTW (0)

Anonymous Coward | more than 8 years ago | (#15182155)

There are applications that require high rates of data acquisition and then processing of that data as its retrieved. This is extremely high bandwidth and needs a solution like infiniband or something similar.

Re:GigE FTW (0)

Anonymous Coward | more than 8 years ago | (#15182257)

Gigabit ethernet is a awful HPC interconnect with respect to latency (mainly due to tcp/ip). Some of the applications we run do so 50% faster when using Myrinet (in an 8-node run) because the latency is so much less. Granted, you don't have to run tcp/ip over gigabit ethernet, but there is very little ISV support for doing so.

Re:GigE FTW (2, Interesting)

multimediavt (965608) | more than 8 years ago | (#15182443)

You're talking bandwidth in a read/write to a filesystem. You are not taking into consideration applications that are latency bound, or are both latency and bandwidth bound when passing information from node to node, let alone writing to a filesystem. We run a number of scientific codes on our IB-based cluster. Some of these codes are slinging around up to 20GB of data passing messages between nodes, and this is memory copies not filesystem read/wries. It has to be fast (lower the latency the better for these particular codes), and it has to have a data path capable of having large amounts of data (above 512 MB/s) going in each direction (TX and RX) at the same time.

It ain't bananas, it's NUTS! But, it does happen. Of course the one application could be running on up to 900 processors (450 nodes) at a time and will generate data files to our storage system in the neighborhood of 250 GB when it's all done. YIKES!

I call BS (1)

petermgreen (876956) | more than 8 years ago | (#15182472)

well you obviously aren't pushing 30 gigabytes (your capital B did mean bytes right?) per second down gig-e links unless you are running several hundred in paralell.

Re:I call BS (1)

Junta (36770) | more than 8 years ago | (#15182665)

Just to make clear what is required...
We can have it to 15 GB/s for most conversation (because he implied concurrent read/write, and most people just discuss unidirectional bandwidth)
That brings us down to about 150-170 Gb/s required to measure a cross-sectional bandwith of 30 GB/s.

So, say, a 256 node cluster running something like GPFS or Lustre even on gigabit ethernet might play in the realm of 30 GB/s concurrent read-write throughput. This is assuming nodes contributing their storage to a pool and not having any SAN-like sharing which would mean a bottleneck there.

256 node clusters, while not common, would not surprise me to be under the administration of a handful of slashdot readers.

Re:I call BS (1)

mg2 (823681) | more than 8 years ago | (#15183379)

Distributed read/write Pieces of files exist in multiple places and are needed by multiple processes in multiple locations Overall, I see average 30 gigabytes throughput across the cluster, not from node to node. The node to node speed could, of course, never exceed that of the transmission medium.

Re:GigE FTW (0)

Anonymous Coward | more than 8 years ago | (#15183118)

At work we're using all PCI express across the board.
It's not unusual to see the nic sustain 100MB/s and burst at 120MB/s.
We can get both gige nics to run at about 150MB/s sustained or so.
The 6TB of storage can handle 300MB/s sustained, but with access patterns being multiple sequential streams (instead of just a single one) the number is somewhat less, probably about that 150MB/s level.

Off the shelf components.

PCI-X can't hack it. No way no how. Our last gen machines had bad problems with IO bottlenecks and bus stalls. Those issues lead to periodic server misbehaviors that would happen every few months.

And don't think for a second that any intel products can keep this going, they can't. Hypertransport is a wondrous thing.

Nitpicking... (2, Informative)

isj (453011) | more than 8 years ago | (#15181935)

"every time a packet gets sent to the NIC, the kernel must be interrupted to at least look at the packet header to determine of the data is destined for that NIC"

Uhm.. no. That is only the case if the NIC has been set into promiscuous mode, or if it has joined too many multicast groups.

"...because the packets are not TCP, they are not routable across Ethernet"

Uhm. If they were IP they would still be routable. I suspect he meant "not IP".

I also get irritated by the spelling out of "Sixty-four-bit PCI"

But the article still has a lot of good reviews and a load of links to other sides with interesting info.

Confusing TCP and IP *is* annoying (4, Informative)

billstewart (78916) | more than 8 years ago | (#15182011)

I found the article's comments implying "Not TCP => Not Routable" quite annoying also, but I don't think he just meant "Not IP". Obviously if the application uses TCP or UDP it's going to have IP underneath and therefore be routable (unless you're doing some leftover-1980s hackery like implementing TCP over ISO CLNP or whatever.) And you could build an application that took a different flow-control approach than TCP that might be more efficient but still use IP and therefore still be routable (though usually people who want to do that sort of thing keep UDP and roll their own apps at Layer 7.)

But he's probably talking about some kind of application that's intended for local-area application only and wants to avoid the overhead of TCP, UDP, and IP addressing, header-bit-twiddling, flow control, slow-start, kernel implementations optimized for wide-area general-purpose Internet networks, etc., and rolls its own protocols that assume a simpler problem definition, much different response times, and probably just pastes some simple packet counters over Layer 2 Ethernet, probably with jumbo frames.

If you've implemented your ugly hackery properly, you still _could_ bridge it over wide areas using standard routers even though it doesn't have an IP layer. That doesn't mean it would work well - TCP's flow control mechanisms were designed (particularly during Van Jacobson's re-engineering) to deal with real-world problems including buffer allocation at destination machines and intermediate routers and congestion behaviour in wide-area networks with lots of contending systems, which a LAN-cluster protocol might not handle because it "knows" problems like that won't happen. Timing mechanisms are especialy dodgy - they might have enough buffering to allow for LAN latencies in the small microseconds, but not enough to support Wide Area Network latencies that include speed-of-light (~1ms per 100 miles one-way in fiber) or insertion delay (spooling the packet onto a transmission line takes time, e.g. 1500 byte packet onto 1.5 Mbps T1 line takes about 8ms, and jumbo-frames obviously take longer.)

Re:Confusing TCP and IP *is* annoying (1)

mindstormpt (728974) | more than 8 years ago | (#15182060)

Also noticed it, but I thought he meant not being routable over NAT.

Re:Confusing TCP and IP *is* annoying (1)

jgs (245596) | more than 8 years ago | (#15184619)

unless you're doing some leftover-1980s hackery like implementing TCP over ISO CLNP

Actually TUBA [ietf.org] was early to mid-90's.

(At least we ended up with IPv6 instead which is way, way, better because... um... never mind.)

The Codes? (-1, Offtopic)

arodland (127775) | more than 8 years ago | (#15181972)

What the hell is that about? Is this a bad spy movie? "Give us the codes or we'll make it very bad for you, Mr. Glossy Photo..."

Re:The Codes? (0)

Detritus (11846) | more than 8 years ago | (#15182001)

It's "real programmer" jargon for programs that do a lot of number crunching, like physics or weather simulations.

Re:The Codes? (2, Informative)

gedhrel (241953) | more than 8 years ago | (#15182002)

"codes" used like that is a term for parallel software that's particularly prevalent amongst the number-crunching crowd.

Re:The Codes? (1)

arodland (127775) | more than 8 years ago | (#15182049)

Heh, alright. As an entirely different type of "real programmer", it just sounds ignorant to me. "Codes" is usually a word used by people who don't understand that code is code is code.

Re:The Codes? (0)

Anonymous Coward | more than 8 years ago | (#15182357)

It's not necessarily a sign of ignorance. Some of the best programmers I've known said "codes" instead of "code." These guys were from India, where everybody that speaks English says "codes." It's a cultural difference.

Every other week (yes I'm exaggerating), we'd have a conversation like this:

Coworker(C): ... codes ...
Me(M): Dude, you just said "codes" again.
C: Sorry.
C: ... code ...
[ less than a minute passes ]
C: ... codes ...
M: *looks at him funny*
C: Sorry. I can't help it. I learned it as codes. Where I'm from everybody says codes.

Re:The Codes? (0)

Anonymous Coward | more than 8 years ago | (#15183406)

In supercomputing, a "code" is countable and roughly correlates to application, or program, or algorithm. It gets interesting when doing analysis and benchmarks and coupled problems where you might say you joined or linked two codes into the same program or distributed application. For example, Joe's atmospheric code might be coupled with Sue's oceanographic code to provide a better climate simulation.

One reason it is a countable noun is that these things are almost proper nouns. We refer to somebody's code as a well known artifact in the community, about which papers are published, etc. They are also thought of as quite monolithic, because message-passing parallel programming is so fraught with peril when you are trying to produce scientifically meaningful, numerical results. You don't just toss portions of code around like a substance. We don't go into museums and point at all the paint on the wall that happens to be surrounded by frames. We talk about paintings. Same thing.

I hesitate to tell you what a kernel is in this area. It has nothing to do with operating systems...

Re:The Codes? (1)

fitten (521191) | more than 8 years ago | (#15184763)

Code = programming in general
Code = one application or one logical part of an application
Codes = multiple applications or multiple logical parts of an application

Code is code is code when you are talking about programming in general. All of us are programmers and we write code.

"Code" can refer to a specific entity. Sometimes you'll hear it as "codebase" or "source" or "sourcebase". An example of a specific set of code is the Linux kernel. Another example of a specific set of code is Firefox.

Codes refers to multiple sets of specific entities. If I install all the source for my Linux distribution, I have many sets of code on my machine. I have the codes for the kernel, Firefox, and whatever else.

In the parallel community, your "appliction" may consist of multiple, seperate yet dependent, executables that must all be run together to make up the single job. So, you have multiple sets of code that make the application... thus... multiple codes.

Example: John, check out the atmospheric code (refering to the source of a single executable) to make sure it's using the correct algorithms. David, check out the ocean code (again, the source of a single executable) to make sure it's using the correct algorithm. Ok, now let's run these codes (talking about both executables) together to form an ocean/atmospheric model.

Re:The Codes? (1)

arodland (127775) | more than 8 years ago | (#15185530)

Right, but the more common usage doesn't reflect this because it uses "code" as a mass noun. You don't have a singular "code" or plural "codes"; you have some code, a piece of code, the Apache code, or all of the code that was ever written in C. That's my point. It's not a difference in scale, it's a difference in usage entirely.

Shurely Shome Mistake? (0)

Anonymous Coward | more than 8 years ago | (#15182024)

s/codes/nodes/

WTF?? (0)

SpinJaunt (847897) | more than 8 years ago | (#15182042)

It goes without saying, these article must have been written by imbaciles as opposed to "real" monkeys, it is just atrocious.

5mins of my life well spent. what a bargin.

Good Intro Article (1)

Frumious Wombat (845680) | more than 8 years ago | (#15182093)

though I would argue that there was too much time spent discussing GigE, and not enough on the performance and scaling issues seen with the more exotic cards.

Not a technical issue, but a little note about the Infiniband cards reading, "Unlike the alternatives, just try to get information on pricing one of these without leaving all of your contact information for a salesman to use now and in perpetuity." I've been through this recently, and have considered (given the similar performance), purchasing Myrinet because they post their prices out in the open, so that you can make some informed decisions before calling their salescritter.

More technically, some analysis of the stability and maturity of the software stack would be nice. We owned Dolphinics SCI cards once (2001 ish), and while blazingly fast when they worked, on our Opterons the MPI system would mysteriously shut down. They were also very closed and proprietary about their software at that point, so we went round and round over the early 2.4 drivers. Myri, while more expensive, was also more stable.

Finally, to simply geek out for a moment, I saw numbers for the Quadrics cards once. PNNL built their Itanium-2 cluster with multiple quadrics cards per machine to get the bandwidth high enough for their chemistry apps. Light on details, but found at http://www.emsl.pnl.gov/using-emsl/tour/lab.php?fa cility=msc&lab=vr1119/ [pnl.gov] How to tie together 980 dual-proc Itanium-2 systems.

Re:Good Intro Article (2, Informative)

chemguru (104422) | more than 8 years ago | (#15182122)

We've built a Sun Cluster with SCI for 10G RAC. It's not just about the bandwidth of the interconnect, but the latency. With 10G RAC, you can use SCI to allow shared memory segments between each node. Damn good stuff. Too bad SCI is rarely used.

Re:Good Intro Article (1)

Frumious Wombat (845680) | more than 8 years ago | (#15182154)

It's always looked good on paper, but my one experience in the field has made me radically gun-shy. What are they like to deal with these days? I'd like to be able to run some codes such as CPMD and Siesta distributed parallel.

and lurking there in the background, there's always NWChem....

Re:Good Intro Article (1)

multimediavt (965608) | more than 8 years ago | (#15182541)

Let me leave my impressions of the article out and just give a little tidbit about why the IB (and other high-end interconnect companies) don't publish their prices. The reason is, they really don't have fixed prices. They will almost always do a custom pricing based on the size of your cluster and whether or not you will be a repeat customer (based on the nature of your business, etc.). Don't be afraid to call these guys, their really not hounds that will bug the crap out of you like a software sales guy. Most of them are extremely technical (unlike a lot of IT sales folks) coming from an engineering background. They are very reasonable people in that if they call you back a couple times and you don't answer, they won't call you again unless you call them...ever! The reason being, the ones that work for the smaller/earlier phase companies don't have the time to chase you down. They've got to get the big sales and will really not bother you if you're only gonna build a 64 (or less) node cluster and don't get back to them. The ones at the more established companies will know you just don't want to be bothered after they do a bit of due diligence when it comes to follow up. So, bottom line, don't be afraid to call these guys because you might just get a great deal out of it!

P.S. No I'm not a sales person for one of these companies, nor do I own any stock. I've just dealt with just about all of them over the last three years building clusters.

Re:Good Intro Article (0)

Anonymous Coward | more than 8 years ago | (#15185240)

I don't have to call and negotiate a discount on most Ethernet hardware, because they've already squeezed the margins out and everybody gets the low price. Why can't this happen with Infiniband? Most of the vendors are not adding any value; they're just shipping Mellanox reference cards. And now that OpenIB exists, I don't need "added-value" vendor software stacks.

Further benchmarks.. (2, Informative)

keithoc (916498) | more than 8 years ago | (#15183962)

..can be gotten from the results [utk.edu] of the 2005 HPC Challenge [hpcchallenge.org] - real world results, no marketing blurb.

IB Pricing (Re:Good Intro Article) (1)

Jon_E (148226) | more than 8 years ago | (#15184617)

Try here [sun.com] if you want an idea .. my complaint was that the entire site was very linux centric .. there's some pretty good ideas going into Solaris [opensolaris.org] and with their big push in amd64 and i386, it can be a more affordable and stable platform to work with .. in fact if you want you can reference the Infiniband source here [opensolaris.org] or here [opensolaris.org] for example ..

For parallel computation cheap can be okay (1)

andy314159pi (787550) | more than 8 years ago | (#15182179)

We found that for disk intensive parallel computing that gigabit ethernet can be almost as fast as very, very expensive networking equipment. Of course our throughput requirements are small compared to many other types of applications. So try it cheap before you invest in the expensive networking equipment. You can always use the cheap stuff for login type work and distributed shells if you opt for the expensive equipment later.

So once you have one, what do you do with it? (1)

Myself (57572) | more than 8 years ago | (#15182397)

What sort of fun projects could a home experimenter with a pile of hardware dive into? It sounds like all these machines are used for a fairly narrow set of scientific applications. Anything a non-academic would find interesting?

Not really (2, Informative)

Junta (36770) | more than 8 years ago | (#15182637)

One, the price of all this stuff is exhorbitant, and most home applications could barely benefit from going from 100 MBit to Gigabit. Realistically, getting 1.5 microsecond latency and the ability to transfer GigaBYTES per second has no home use right now. Really exorbitant High definition streams top out at about 20 MBit/second for 1920x1080 MPEG-2, and of course no game demands that much throughput. Hard drives for home use can only theoreticly dump out 300 MB/s or so anyway (SATA II), and realistically except for cache operations you almost never acheive it.

Going to gigabit ethernet makes diskless systems close to theoretically working as fast as UDMA 66 drives, which allows for fun home projects working more smoothly. Latency for network operations is already similar to drive seek times, so going to insane latency won't help too much either.

Systems that benefit from this have to have large (many-drive) storage architectures to pull throughput from and large numbers of systems to have enough computational data to make the interconnect fabrics worth while. Before you begin to ever approach a system that large, your power/cooling bill would be insane.

If you were into the intrensic interesting stuff of this, you can learn most principles involved with good old ethernet, and fill the gaps with google research. It is undeniable that you learn more hands on, but if you ever really need to use it with a company or something and you have your bases covered, chances are you'd exceed most other candidates who aren't even aware of the technology.

Re:So once you have one, what do you do with it? (0)

Anonymous Coward | more than 8 years ago | (#15183616)

The first use that came to my mind is mpeg encoding. You could either encode a bunch of mpeg4 videos simultaneously, ...or, if you have a multithreaded encoder, you could encode 1 mpeg4 video really really fast.

Anything which requires lots of CPU cycles should benefit. Their are examples of such apps in the consumer space if you look around (some of them are even freeware/shareware). ...The real trick is finding the (preferably multi-threaded) apps which will work on your clustered OS.

There are a handful of linux distributions which support Open Mosix clustering and come with a variety of apps which work on the cluster. check out http://openmosix.sourceforge.net/instant_openmosix _clusters.html [sourceforge.net] for some examples

Not able to RTFA, but my perspective... (4, Informative)

Junta (36770) | more than 8 years ago | (#15182517)

Lately, the big contenders are:
-Ethernet
-Inifiniband
-Myrinet

I haven't heard much about SCI or Quadrics lately, and just these three have been tossed around a lot lately. Points on each:
-Ethernet is cheap, and frequently adequate. Low throughput and high latency, but it's ok. 10GbE ethernet is starting to proliferate to eliminate the throughput shortcomings, and RDMA is starting to possibly help latency for particular applications. Note that though overwhelmingly clusters put together using ethernet use IP stack to communicate over it, it is not exclusively true. There are MPI implementations available that sit right under the ethernet header layer. It bypasses the OS IP stack which can be very slow and reduces overhead per message. Increasing MTU also helps throughput efficiency. But for now only 1 Gigabit ethernet is remotely affordable at any scale (primarily due to current 10GbE switch densities/prices, adapters are no more expensive than Myrinet/Infiniband).

-Myrinet. With their PCI-E cards they achieve about 2 GBytes/sec bidirectional throughput, very nearly demonstrating full saturation of their 10GBit fabric. They also are among the lowest latency sitting right about 2.5 microsecond node-to-node latency as a PingPong minimum. Currently the highest single-link throughput technology realistically available to a customer (Infiniband SDR doesn't quite acheive it, about 200 or so MByte/s short, but DDR will overtake it as it realistically is available). Very focused on HPC and until recently also the only popular high-speed cluster interconnect that was very mature, easy to set up and maintain, and efficient. Now they are starting to embrace more interoperability with 10GbE, probably in response to the rise of infiniband.

-Infiniband. Until very recently immature (huge memory consumption for large MPI jobs, software stack that is highly complex and not easily maintanable, and the prominent vendor of chips (Mellanox), didn't acheive good latency. With Mellanox chips you are lucky to get into the 4 microsecond range or so. With Pathscale's alternative implementation (particularly on HTX), the lowest latency interconnect becomes possible (I have done runs with 1.5 microsecond end-to-end latency even with a switch involved). The maximum throughput is on the order of 1.7-1.8 GByte/s and more importantly is one of the faster technologies in ramping up to that. No technology acheives their peak throughput until about 4 MB message sizes, and Pathscale IB is remarkably a good performer down to 16k-32k message sizes. Additionally, IB has a broader focus and some interesting efforts. They make efforts to not only be a good HPC interconnect, but also to be a good SAN architecture that in many ways significantly outshines fibre channel. The OpenIB efforts are interesting as well. The huge downside is that for whatever reason no Infiniband provider has been able to demonstrate good IP performance over their technology. This particularly is an issue because most all methods of storage sharing from hosts are IP based. SRP is ok for the little amount of flexibility that strategy gives to be Fibre-Channel like, but nfs, smb, and image access like NBD and iSCSI all perform very poorly on Infiniband compared to Myrinet. iSER promises to alleviate that, but for the moment you are restricted to performance on the order of 2.4 gigabit/s for IP transactions. Myrinet has been able to deliver 6-7 Gigabit/s for the same measurements. You could overcome this by sharing storage enclosures and use something like lustre, GFS, or GPFS to communicate more directly with the storage over SRP, but generally speaking some applications demand flexibility not acheivable without IP performance.

And at the end of the day, I come home and run my home network on 100MBit ethernet, sigh. It is enough to run a diskless MythFrontend for HD content at least.

Just want to add... (4, Informative)

Junta (36770) | more than 8 years ago | (#15182582)

For those not aware of how ethernet is limited latency wise regardless of what is done, I will explain a tad.

Ethernet is well architected for large deployments (enterprise-wide) with the packet routing (not IP routing) done on the switches. Menaing a computer sending a packet asks its switch to get it to 0A:0B:0C:01:02:03, having no idea where it will go. Switch only knows it's immediate neighbors, and will check/populate it's arp table to figure out the next entity to hand off. This means switches have to be really powerful because they are responsible for a lot of heavy lifting for all the relatively dumb nodes. This is not TCP, it is not IP, it is raw reality of ethernet networking. Aside from Spanning tree (which is not maintained for any other reason than keeping a network from getting screwed over by incorrect connections, not for performance), no single entity in the network has a map of how things look beyond its immediate neighbors.

IB, Myrinet, etc, are source routed. Every node has a full network map of every switch and system in the fabric. The task of computing communication pathways is distributed rather than concentrated (fits well with the whole point of clusters). node1 doesn't blindly say to the switch, 'send this to node636', it says to switch 'send this to port 5, and the next switch, put it out port 2, and the next switch, do port 9 and then it should be where it needs to be'.

There are more complicated issues their, but a lion's share of the inherent strength of non-ethernet interconnects is this.

Re:Not able to RTFA, but my perspective... (2, Insightful)

drmerope (771119) | more than 8 years ago | (#15182837)

There is a growing consensus that Infiniband is effectively a dead-end. Last year it would have been a tough call between Infiniband, Ethernet, and other more proprietary interconnects. The market seems to be favoring the backward compatibility of Ethernet, and now that low latency Ethernet (~200ns) appears to be at hand there does not appear to be any reason to prefer the less general tecnologies (Myrinet, Infiniband, etc). My friends at Myrinet hint that they are looking to using the Myrinet protocol layer on top of Ethernet and essentially EOLing the Myrinet physical layer.

Myrinet was fabulously advanced ten years ago when it spun out of Caltech, but alas, the rest of the world has caught-up.

Re:Not able to RTFA, but my perspective... (0)

Anonymous Coward | more than 8 years ago | (#15183141)

FYI: "Myricom and Fujitsu Demo Wire-Speed 10 Gigabit Performance"

http://www.hpcwire.com/hpc/633629.html [hpcwire.com]

450 nsec latency

If a packet hits a pocket (1)

ShortBeard (740119) | more than 8 years ago | (#15182681)

If a packet hits a pocket on a socket on a port,
And the bus is interrupted as a very last resort,
And the address of the memory makes your floppy disk abort,
Then the socket packet pocket has an error to report!

If your cursor finds a menu item followed by a dash,
And the double-clicking icon puts your window in the trash,
And your data is corrupted 'cause the index doesn't hash,
Then your situation's hopeless, and your system's gonna crash!

If the label on the cable on the table at your house,
Says the network is connected to the button on your mouse,
But your packets want to tunnel on another protocol,
That's repeatedly rejected by the printer down the hall,
And your screen is all distorted by the side effects of gauss
So your icons in the window are as wavy as a souse,
Then you may as well reboot and go out with a bang,
'Cause as sure as I'm a poet, the sucker's gonna hang!

When the copy of your floppy's getting sloppy on the disk,
And the microcode instructions cause unnecessary risc,
Then you have to flash your memory and you'll want to RAM your ROM.
Quickly turn off the computer and be sure to tell your mom!

Re:If a packet hits a pocket (1)

chawly (750383) | more than 8 years ago | (#15183923)

Brilliant post - really enjoyed it. Thanks.

Re:If a packet hits a pocket (1)

Slashcrap (869349) | more than 8 years ago | (#15191337)

Brilliant post - really enjoyed it. Thanks.

Would have been a lot funnier had it been attributed to the original author in my opinion.

Re:If a packet hits a pocket (1)

chawly (750383) | more than 8 years ago | (#15194790)

More or less took it at face value. Your post raises two questions.

  • Who was the original author ?
  • What was the context that incited this person to produce this pearl ?

How about sharing the answers - I'm curious

bizn4Tch (-1, Redundant)

Anonymous Coward | more than 8 years ago | (#15182942)

intentions and coomitt3rbase and

If it doesn't say OpenVMS it's not a cluster (2, Informative)

tengu1sd (797240) | more than 8 years ago | (#15183624)

See page 3 of the Custering Software Product Description [hp.com]

Cluster systems are configured by connecting multiple systems with a communications medium, referred to as an interconnect. OpenVMS Cluster systems communicate with each other using the most appropriate interconnect available. In the event of interconnect failure, OpenVMS Cluster software automatically uses an alternate interconnect whenever possible. OpenVMS Cluster software supports any combination of the following interconnects:

CI (computer interconnect) (Alpha and VAX)

DSSI (Digital Storage Systems Interconnect) (Alpha and VAX)

SCSI (Small Computer Storage Interconnect) (storage only, Alpha and limited support for I64)

FDDI (Fiber Distributed Data Interface) (Alpha and VAX)

Ethernet (10/100, Gigabit) (I64, Alpha and VAX)

Asynchronous transfer mode (ATM) (emulated LAN configurations only, Alpha only)

Memory Channel (Version 7.1 and higher only, Alpha only)

Fibre Channel (storage only, Version 7.2-1 and higher only, I64 and Alpha only)

Re:If it doesn't say OpenVMS it's not a cluster (1)

Macka (9388) | more than 8 years ago | (#15186975)


Yes Digital invented clustering with OpenVMS. But since then the "clustering" brand has fragmented into HA (Highly Available) clusters and Compute Clusters like Beowulf (and variations on the theme). OpenVMS is a HA cluster, and still rules the HA roost because it does such an amazing job of combining shared storage with SSI (single system image) functionality. On the Unix front, only Tru64 has come close to OpenVMS, but sadly that's in decline and will very soon vanish.

There also seems to be zero interest out there in SSI. HP had the chance to re-model HP-UX using Tru64's SSI, bit gave up citing customer disinterest. Perhaps that's because the benefits of SSI have been under sold, I don't know. But you're unlikely to see it again in a commercial Unix setting any time soon. Once Tru64 finally dies.

The really sad thing is that in the commercial space, traditional HA clusters have become almost irrelevant. Most commercial clusters sit on top of Oracle.

Going back in time when I was working for Digital/Compaq in a unix support role I remember then I learned that some (twit) in Compaq had sold (or licensed) the DLM source code from Tru64 to Oracle. I was stunned. It was like selling the crown jewels, and it turns out I was right. Now Oracle 9i/10g RAC has ASM as an option. You don't need a cluster filesystem anymore as Oracle RAC has its own DLM and can manage just fine thank-you-very-much with shared raw storage feeding multiple database instances on multiple nodes. With each "cluster" node managing its own instance, you don't even need cluster services to start/stop Oracle across the cluster.

Why do you need Unix clusters in a commercial setting any more, or OpenVMS for that matter, when Oracle had got things so neatly tied up? And there is nothing in the Open Source database arena that can compete with RAC on functionality. Competing on price of cause is a different matter.

Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>