×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Choosing Interconnects for Grid Databases?

Cliff posted more than 8 years ago | from the neurons-of-the-distributed-mind dept.

Databases 31

cybrthng asks: "With all of the choices from Gig-E, 10 Gig-E, SCI/Infiniband and other connections for grid database applications, which one is actually worth the money and which ones are overkill or under performing? In a Real Application Cluster (RAC), latency can be an issue with cached memory and latches going over the interconnect. I don't want to recommend an architecture that doesn't achieve desired results, but on the flipside, I don't necessarily want overkill. Sun had recommended SCI, Oracle has said Gig-E and other vendors have said 10 Gig-E. Seems sales commissions drive many of what people recommend, so I'm interested in any real world experience you may have. Obviously, Gig-E is more affordable from a hardware perspective but does this come at a cost of application availability and performance to the end users? What has been your success or failures of grid interconnects?"

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

31 comments

Linux for corporate servers? (-1, Troll)

Anonymous Coward | more than 8 years ago | (#13774217)

Hi! I'm considering expanding and open-sourcing my back end. Will Linux help?

If you have the money..... (-1, Flamebait)

Creepy Crawler (680178) | more than 8 years ago | (#13774375)

GO hire a REAL NETWORK ENGINEER andwork it out.

Preferrably somebody with real experience...

Not "IANAL" groups like found on slashdot.

Yeah, you pay through the nose, but you will get the best.

 

Re:If you have the money..... (1, Offtopic)

hey! (33014) | more than 8 years ago | (#13774520)

Oh, what the hell. 90% of the responses will be people who have no fricken' idea about what he's talking about, 9.9% will have some idea but no actual experience.

So, being an optimist, I choose to see the glass as 0.1% full. This means if we get a thousand responses,it's worth a shot.

Re:If you have the money..... (0, Offtopic)

Godeke (32895) | more than 8 years ago | (#13774662)

Well, when I read this article there were two people with experience (one with a fair commentary even), you, and four other comments. I guess he beat the odds.

Re:If you have the money..... (1)

OzPeter (195038) | more than 8 years ago | (#13776164)

Damn I wish I had mod points to mark this funny. But then again I think that the lead up to the punch line was a bit long, so I guess the people who modded you off topic never saw it coming (or going).

And yes .. I refuse to explain humouor.

mod parent up (1)

TTK Ciar (698795) | more than 8 years ago | (#13775190)

CC could have worded it more nicely, but his underlying message is spot-on. If you don't have the necessary expertise in-house, hiring a professional is faster and less expensive than growing the expertise yourself, and you'll probably end up with a better-running system.

-- TTK

Re:mod parent up (1)

arkanes (521690) | more than 8 years ago | (#13775594)

*Every single* Ask Slashdot either has a slew of people telling the poster that he's stupid and should have just used Google, or a slew of posters telling him that he's stupid and unprofessional and should throw money at the problem. Sometimes both in the same thread, which is always good for a laugh.

Since you're one of the second type of responder, do you care to say what you think an acceptable topic for Ask Slashdot would be? Obviously, asking about highly technical enterprise topics is verbotten, since if you have to ask you're clearly unqualified and instead should just hire the most expensive consultant listed in your directory. And clearly simple questions you can answer by searching Google are a waste of everyones valuable time. So why don't you provide some guidelines for what an acceptable topic is? Maybe this very question would be a good Ask Slashdot!

Re:mod parent up (1)

TTK Ciar (698795) | more than 8 years ago | (#13778233)

Since you're one of the second type of responder

Oh, am I now? Funny, before I made the post to which you replied, I made this other post first [slashdot.org], because I thought I didn't emphasize this alternative enough there.

You're off-base, though, in claiming that hiring a professional network engineer would cost more money than developing the necessary expertise in-house. The time and resources it takes to learn how to do it yourself will cost your business more in the long run. If it's a labor of love, though, that's a different matter -- picking up new skills because you want to know how to do things yourself is highly commendable (especially if you can do it on someone else's dime).

-- TTK

Re:mod parent up (1)

aminorex (141494) | more than 8 years ago | (#13775728)

Spoken like a true professional consultant.

Here's my take:

The OP never said anything useful about price and performance constraints,
scaling requirements, software constraints, etc.

1) connect them all to a fast switch using commodity hardware, OTS software.
      - cheap, cheap, cheap, works now
      - probably good enough

need faster?

2) connect them all to each other using commodity hardware, OTS software.
      - cheap, will work once the wires are all connected
      - almost certainly good enough

still not satisfied?

3) connect them all to each other using commodity hardware, and roll your
      own user-space memory-mapped IO system with a sockets API (assuming that's
      what your apps require), probably using Adam Dunkels' lwIP.
      - expensive, behind schedule
      - forsooth, it rocks thine world.

more money than time?

4) connect them to a proprietary fabric, running on boutique hardware and
      software.
      - you just won a brand new investor lawsuit
      - ok, so its fast

Re:mod parent up (1)

cybrthng (22291) | more than 8 years ago | (#13776389)

Grid computing assumes scalability to a certain factor. The concept is consolidation of your datacenter to a grid environment of managed/hosted databases that are seemless to the end users and managed centrally.

Your options assume that your developing the grid as a proprietary system, which we are not.

We are looking at using Sun 490's, Solaris 10, Sun Cluster 3.1 and Oracle 10g RDBMS. All of which everything from GigE to SCI/Infiniband is supported on and all of which shareholders would highly approve of as it consolidates dozens of disperate systems into a common infrastucture to manage saving buttloads of money.

My question is more of real world experience vs labs and sales figures.

Gigabit Ethernet (1, Informative)

Anonymous Coward | more than 8 years ago | (#13774473)

Switched, gigabit ethernet is going to offer the best performance. Gigabit ethernet is also cheap. Infiniband is too expensive and underperforms. Fibre channel is way too expensive and is no faster than Gig E. 10Gbps ethernet is only good on dedicated switches because a PC cannot drive it. Most PCs can't even drive 1 Gbps ethernet.

Gig-E (5, Informative)

tedhiltonhead (654502) | more than 8 years ago | (#13774496)

We use Gig-E for our 2-node Oracle 9i RAC cluster. We have each NIC plugged into a different switch in our 3-switch fabric (which we'd have anyway). This way, if a switch or one of the node's interfaces dies, the other node's link doesn't drop. On Linux, the ethX interface sometimes disappears when the network link goes down, which can confuse Oracle. To my knowledge, this is the Oracle-preferred method for RAC interconnect.

Re:Gig-E (1)

cybrthng (22291) | more than 8 years ago | (#13776408)

ON linux oracle preaches Gig-E because it fits the affordable architecture, however after calling both Oracle and Sun they both run there core ERP system on SCI based systems :)

So they don't necessarily practice what they preach - however i guess they assume the scale of there customers is smaller. (which kind of defeats the purpose of scoping a GRID environment). RAC gig-e - "GRID" conceptually ????

GigE (5, Informative)

Yonder Way (603108) | more than 8 years ago | (#13774577)

Until a couple of months ago I was the Sr Linux Cluster admin for the research division of a major pharma company. Our cluster did just fine with GigE interconnectivity, without bottlenecking.

Make sure you tune your cluster subnet, adjusting window sizes, utilize jumbo frames, etc. Just the jump from 1500 MTU to jumbo frames made a HUUUUGE difference in performance, so spending a couple of days just tuning the network will make all the difference in the world.

Re:GigE (1)

cybrthng (22291) | more than 8 years ago | (#13776299)

We do this already for our generic RAC systems running mostly for high availability and some of the clustering functionality.

Our new platform will be the enterprise ERP suite and CRM with hundreds of concurrent users and live transactions running in everything from order entry, product configurator to processing invoices, taking support requests and all.

Re:GigE (1)

chris.dag (22141) | more than 8 years ago | (#13785521)

{ excerpting from my own reply made in a different section of this article ... }

There are many people posting here who are completely confusing what the word "cluster" means for this particular question.

This article is about APPLICATION CLUSTERING (in this case a very specific relational database) and you are answering the question with information that is generalized to a COMPUTE FARM or a Linux cluster built and optimized for high performance computing.

Broadly speaking the word "cluster" means different things to different people.

There are people who build speedy Linux Clusters where the focus is on High Performance or High Throughput Computing. Then there are people who build ultra-reliable services (High Availibility Clustering).

The difference between "high performance", "high throughput" and "high availibility" clusters can be extreme.

For this particular Slashdot article the only people posting solutions or suggestions should be people who have actually used the RAC product in a production environment. Everyone else just needs to sit back and watch...

This is really application-specific (5, Informative)

TTK Ciar (698795) | more than 8 years ago | (#13774639)

In my own experience, fully switched Gig-E was sufficient for operating a high performance distributed database. The bottlenecks were at the level of tuning the filesystem and hard drive parameters, and memory pool sizes. But that was also a few years ago, when the machines were a lot less powerful than they are now (though hard drives have not improved their performance by all that much).

Today, high-end machines have no trouble maxing out a single Gig-E interface, but unless you go with PCI-Express or similarly appropriate IO bus, they might not be able to take advantage of more. That caveat aside, if Gig-E proved insufficient for my application today, I would add one or two more Gig-E interfaces to each node. There is software (for Linux at least; not sure about other OS's) which allows for efficient load-balancing between multiple network interfaces. 10Gig-E is not really appropriate, imo, for node interconnect, because it needs to transmit very large packets to perform well. A good message-passing interface will cram multiple messages into each packet to maximize performance (for some definition of performance -- throughput vs latency), but as packet size increases you'll run into latency and scheduling issues. 10Gig-E is more appropriate for connecting Gig-E switches within a cluster.

The clincher, though, is that this all depends on the details of your application. One user has already suggested you hire a professional network engineer to analyze your problem and come up with an appropriate solution. Without knowing more, it's quite possible that single Gig-E is best for you, or 10Gig-E, or Infiniband.

If you're going to be frugal, or if you want to develop expertise in-house, then an alternative is to build a small network (say, eight machines) with single channel Gig-E, set up your software, and stress-test the hell out of it while looking for bottlenecks. After some parameter-tweaking it should be pretty obvious to you where your bottlenecks lie, and you can decide where to go from there. After experimentally settling on an interconnect, and having gotten some insights into the problem, you can build your "real" network of a hundred or however many machines. As you scale up, new problems will reveal themselves, so incorporating nodes a hundred at a time with stress-testing in between is probably a good idea.

-- TTK

Re:This is really application-specific (1)

cybrthng (22291) | more than 8 years ago | (#13776345)

We have professionals on site, and i use that term loosely. Professinals are hardly proficient in proving the concept of grid computing because they few and far between.

It comes down to a latency issue and figuring out how that latency impacts real world use and the answer comes down to how much money you want to throw at that problem.

My question is that latency that much for the interconnects that cutting it by high speed interconnects isn't lost in all of the other overheads in the system (such as local disk reads, processing and such)

The obvious efficiency of SCI over GigE is no IP overhead, less cpu cycles and a latency 1/10th of that of the fastest GigE card. Now is that latency saved worth it in the real world in spending 6k a node for the interconnect and 20k for a SCI switch or is Gig-E sufficient.

Without pre-building the system i don't know.. so i asked slashdot to see if anyone has any real world experience that isn't confined to a sales lab or some system totally tweaked out and not "Real world experience".

I've learned since posting that saturation of the connect isn't as much of a problem as the latency.. hopefully that helps in answering my concerns :)

Re:This is really application-specific (1)

blackcoot (124938) | more than 8 years ago | (#13776745)

unless you're including the cost of installation and configuration in the sci stuff, i believe your numbers are pretty far off. i was looking at dolphin sci cards which could do roughly 8-10gb and remote dma. the prices i was discussing were on the order of $1-1.5k / device (dual ported cards, depending on volume) and about $5k for an 8 port switch (cables will add a pretty significant amount to these totals). (check their pricelist at http://www.dolphinics.com/pdf/pricing/Price%20List %2020050701.pdf [dolphinics.com]). the dolphin guys were pretty pleasant to deal with. i would have loved to have used their solution, but power requirements made that essentially impossible (high bandwidth, low latency, and low power communications solutions don't really exist, which is a pity.)

Re:This is really application-specific (1)

cybrthng (22291) | more than 8 years ago | (#13780690)

Dolphin is the vendor, however for Sun Cluster 3.1 certification to support RAC on our "grid" requires the sun branded cards.. we are looking to get around that, but your right. you get screwed by the big vendors :)

Re:This is really application-specific (1)

Mr. Sketch (111112) | more than 8 years ago | (#13776842)

The obvious efficiency of SCI over GigE is no IP overhead, less cpu cycles and a latency 1/10th of that of the fastest GigE card

I'm assuming you're using MPI for message passing. Why can't you go with MPI directly over GigE without going over IP? There are MPI libraries that communicate directly with SCI, MyriNet, Infiniband, etc, why not GigE? I did a quick google for such a library, but didn't find any, so maybe it's part of standard MPI implementations already.

If you're not using MPI, there must be a GigE SAN library that doesn't use IP. It seems like that would reduce your CPU cycles used and latency if you removed IP from the picture since GigE doesn't require IP.

Re:This is really application-specific (1)

cybrthng (22291) | more than 8 years ago | (#13780727)

We are running on Sun Cluster 3.1, so open source MPI libraries won't cut it as far as remaining certified for support. Sun doesn't support MPI over GigE because of the latency and timing which would cost more than IP over the same interface.

On a linux system, that would be an interesting benchmark for proof of concept, but not something a vendor would support us for.

This is for a large corporation so supported platforms is critical.. i don't believe redhat or suse have certified MPI over GigE in any form either.

Sigh once again: (1)

Shut the fuck up! (572058) | more than 8 years ago | (#13774756)

Ask Slashdot: "I'm charged with setting up a complicated system of which I have no fucking clue. Please help me. Oh yeah, one more thing - lives will depend on your advice so please make it good."

Multiple Networks (4, Informative)

neomage86 (690331) | more than 8 years ago | (#13774926)

I have worked with some bioinformatics clusters, and each machine usually was in two seperate networks.

One was a high latency high bandwidth switched network (I reccomend GigE since it has good price/performance) and one was a low latency low bandwidth network just for passing messages between CPUs. The application should be able to pass off thoroughput intensive stuff (file transfers and the like) to the high latency network, and keep the low latency network clear for inter-cpu communication.

The low latency network depends on your precise application. I've seen everything from a hypercube topolgy w/ GigE (for example with 16 machines in the grid, you need 4 gigE connections for the hypercube per computer. It always seemed to me that the routing in software would be really high latency, but people smarter than me tell me it's low latency so it's worth looking into). Personally, I just use a 100mbit line with a hub (i tried with switch, but it actually introduced more latency at less than 10% saturation since few collisions were taking place anyways) for the low latency connect. The 100mbit line is never close to saturated for my application, but it really depends on what you need.

The big thing is make sure your software is smart enough to understand what the two networks are for, and not try to pass a 5 gig file over your low latency network. Oh, and I definetly agree. If you are dealing with more than $10K-20k it's definetly worth it to find a consultant in that field to at least help with the design, if not the implementation.

Re:Multiple Networks (1)

chris.dag (22141) | more than 8 years ago | (#13785474)

There are many people posting here who are completely confusing what the word "cluster" means for this particular question.

This article is about APPLICATION CLUSTERING (in this case a very specific relational database) and you are answering the question with information that is generalized to a COMPUTE FARM or a Linux cluster built and optimized for high performance computing.

The two areas are completly distinct and have different "best practices" when it comes to network topology, configuration and interconnect selection.

Just because GigE works fine as an interconnect on a compute farm does not mean it is going to work fine on a cluster built to satisfy a completly different requirement.

Broadly speaking the word "cluster" means different things to different people.

There are people who build speedy Linux Clusters where the focus is on High Performance or High Throughput Computing. Then there are people who build ultra-reliable services (High Availibility Clustering).

The difference between "high performance", "high throughput" and "high availibility" clusters can be extreme.

For this particular Slashdot article the only people posting solutions or suggestions should be people who have actually used the RAC product in a production environment. Everyone else just needs to sit back and watch...

SCSI Question (1)

LordMyren (15499) | more than 8 years ago | (#13775686)

I'm aware that it's feasible to have multiple SCSI cards in a SCSI chain, is there any way to do this with hardware RAID cards? It would be a great boon to reliability & costs were this feasible for me. Systems fail but RAID-5 and RAID-6 far less so.

Myren

Re:SCSI Question (1)

LordMyren (15499) | more than 8 years ago | (#13775767)

I ask because its either this or using drbd [drbd.org] to replicate the entire file system over multi GigE lines while having to use twice the number of hard drives. I'd much prefer to avoid these interconnections altogether and simply have SCSI itself be the common communication bus, at least such that either controller can access the raid array should the other fail. I'm not /totally/ OT. I was looking at 4 gige or a 10gb solutions which would have pounded cpu usage to death... I'd much rather just make sure I can always access the drives.

Myren

Yes you can... (1)

malakai (136531) | more than 8 years ago | (#13776750)

Most major SCSI card/storage array vendors have SCSI cluster support. The easy big names being Compaq/HP and Dell.

Most the time they are used in Hot/Cold clusters. It's easiest to manage. Hot/Hot is possible, but you need to make sure your applications know how to handle it properly.

It works by each of your servers having a SCSI/RAID controller card. They connect to a shared backplane in some sorta storage array (like a PowerVault). Make sure your backplane isn't set to 'split' mode, in which case each server would only see half the drives and negate the 'cluste' capability.

When configured properly each card can see the same set of devices in the array. Some cards can save the RAID configuration to the storage device. This is nice because after configurign the array, you can bring up your 2nd box and the 2nd RAID card can read the settings the first one configured. They both need to presume the same RAID configuration. The virtual devices can differ in the OS level, but the logical RAID configuration obviously much match.

If you go that route, save yourself an enormous hassle and make sure each server has their own two local drives mirrored for the boot drive. Don't go through the hassle of booting off the array in a shared array configuration. It's technically possible, but managing those extra partions on the shared array is a pain.

Re:Yes you can... (1)

LordMyren (15499) | more than 8 years ago | (#13777578)

Excellent, thanks so very much, the Hot/Cold is exactly what I need to save myself some perposterous high-availability pains.

Will it be possible to find this sort of support in "consumer" RAID gear? I'm looking at the Intel SRCU42X and LSI MegaRAID 320-2e (both $700) and dont know how it would work. The main thing I dont understand is what form the utilities for these kinds of cards will be in... how arrays are setup, how hotswap is performed, &all... I doubt the firmware interfaces are open enough to allow for GPL kernel modules. BIOS isnt an option since they need to be configured and reconfigured on the fly.

Good advice on the local boot drives. Thanks again!
Myren

imo/e (1)

bonezed (187343) | more than 8 years ago | (#13778693)

gigE with jumbo frames is the way to go

and don't skimp on switches. Get something from the Extreme Networks Summit range if possible

We use Gigabyte Ethernet (1)

Sun Tzu (41522) | more than 8 years ago | (#13815720)

We have a 2-node RAC in place with plans to add two additional nodes over the next few weeks. We're using GigE and it's working perfectly. Unfortunately, only two GEthernet ports were spec'd originally, so we're also adding a third so we can have a redundant interconnect, heh.
--
Barebones computer reviews [baremetalbits.com]
Check for New Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...