×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Best Server Storage Setup?

Cliff posted more than 7 years ago | from the saving-cents-on-the-GB-adds-up dept.

76

new-black-hand asks: "We are in the process of setting up a very large storage array and we are working toward having the most cost-effective setup. Until now, we have tried a variety of different architectures, using 1U servers or 6U servers packed with drives. Our main aims are to get the best price per GB of storage that we can, while having a reliable and scalable setup at the same time. The storage array will eventually become very large (in the PB range) so saving just a few dollars on each server means a lot. What do people out there find is the most effective hardware setup? Which drives and of what size? Which motherboards, etc? I am familiar with the Petabox solution which is what the Internet Archive uses — they have made good use of Open Source software. So what are some of the architectures out there that, together with Open Source, can give us a storage array that is much better than the $3 per GB plus that the commercial vendors ask for?"

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

76 comments

Low-Tech (3, Funny)

smvp6459 (896580) | more than 7 years ago | (#15529637)

I love shoeboxes for storage. Cheap and modular.

Re:Low-Tech (2, Funny)

remembertomorrow (959064) | more than 7 years ago | (#15529663)

This is especially feasible if you're married (or live with one or more women).

Re:Low-Tech (2, Funny)

Monkeys!!! (831558) | more than 7 years ago | (#15529720)

Just wait until she goes looking for her dress shoes and finds your pr0n stash instead.

Re:Low-Tech (-1, Flamebait)

Anonymous Coward | more than 7 years ago | (#15529872)

Goatse's back! *_g_o_a_t_s_e_x_*_g_o_a_t_s_e_x_*_g_o_a_t_s_e_x_*_
g_______________________________________________g_ _
o_/_____\_____________\____________/____\_______o_ _
a|_______|_____________\__________|______|______a_ _
t|_______`._____________|_________|_______:_____t_ _
s`________|_____________|________\|_______|_____s_ _
e_\_______|_/_______/__\\\___--___\\_______:____e_ _
x__\______\/____--~~__________~--__|_\_____|____x_ _
*___\______\_-~____________________~-_\____|____*_ _
g____\______\_________.--------.______\|___|____g_ _
o______\_____\______//_________(_(__>__\___|____o_ _
a_______\___.__C____)_________(_(____>__|__/____a_ _
t_______/\_|___C_____)/INSERT\_(_____>__|_/_____t_ _
s______/_/\|___C_____)_GERBIL|__(___>___/__\____s_ _
e_____|___(____C_____)\_HERE_/__//__/_/_____\___e_ _
x_____|____\__|_____\\_________//_(__/_______|__x_ _
*____|_\____\____)___`----___--'_____________|__*_ _
g____|__\______________\_______/____________/_|_g_ _
o___|______________/____|_____|__\____________|_o_ _
a___|_____________|____/_______\__\___________|_a_ _
t___|__________/_/____|_________|__\___________|t_ _
s___|_________/_/______\__/\___/____|__________|s_ _
e__|_________/_/________|____|_______|_________|e_ _
x__|__________|_________|____|_______|_________|x_ _
*_g_o_a_t_s_e_x_*_g_o_a_t_s_e_x_*_g_o_a_t_s_e_x_*_


Important Stuff: Please try to keep posts on topic. Try to reply to other people's comments instead of starting new threads. Read other people's messages before posting your own to avoid simply duplicating what has already been said. Use a clear subject that describes what your message is about. Offtopic, Inflammatory, Inappropriate, Illegal, or Offensive comments might be moderated. (You can read everything, even moderated posts, by adjusting your threshold on the User Preferences Page) If you want replies to your comments sent to you, consider logging in or creating an account.

Important Stuff: Please try to keep posts on topic. Try to reply to other people's comments instead of starting new threads. Read other people's messages before posting your own to avoid simply duplicating what has already been said. Use a clear subject that describes what your message is about. Offtopic, Inflammatory, Inappropriate, Illegal, or Offensive comments might be moderated. (You can read everything, even moderated posts, by adjusting your threshold on the User Preferences Page) If you want replies to your comments sent to you, consider logging in or creating an account.

Important Stuff: Please try to keep posts on topic. Try to reply to other people's comments instead of starting new threads. Read other people's messages before posting your own to avoid simply duplicating what has already been said. Use a clear subject that describes what your message is about. Offtopic, Inflammatory, Inappropriate, Illegal, or Offensive comments might be moderated. (You can read everything, even moderated posts, by adjusting your threshold on the User Preferences Page) If you want replies to your comments sent to you, consider logging in or creating an account.

Re:Ultra-High-Tech (1)

thegrassyknowl (762218) | more than 7 years ago | (#15530139)

I prefer monkey brains in a Matrix-style. I don't know the actual data storage density of a monkey brain, but it's higher than many existing IDE disks... if you wire it right. As a bonus, the array is self-powering and self-temperature regulating (providing minimal ventilation is supplied).

Coraid (5, Informative)

namtro (22488) | more than 7 years ago | (#15529710)

If $/GB is a dominant factor, I would suggest Coraid's [coraid.com] products. They have a pretty niffy technology which is dead simple and extensively leverages OSS. From my personal experience as a customer, I think they are a bunch of good folks as well. They also seem to constantly be wringing more and more performance our of their systems. Anyway, something to explore if I were you.

Re:Coraid (1)

dtfinch (661405) | more than 7 years ago | (#15529874)

Doing some price checking, I came up with a 45tb Coraid setup with sixty 750gb drives for slightly over $45k. Not bad.

How about some requirements? (4, Insightful)

Anonymous Crowhead (577505) | more than 7 years ago | (#15529713)

If price is your only requirement, go with tape. Otherwise, let's hear about redunancy, speed, reliablity, availablitlty, etc.

Lots of copies. (3, Informative)

Anonymous Coward | more than 7 years ago | (#15529806)

Rendundancy, aviability, and reliability is best served in the most cost effective way by using the 'lots of copies' method of storage.

You simply get the cheapest PCs you can get. The Petabox uses Mini-itx systems due to low power and thermal requirements. Stuff them with the biggest ATA drives you can get for a good price.

Then fill up racks and racks and racks of stuff like that. As long as the hardware is fairly uniform then you have some spares and you have maintainance taken care of.

Then you copy your information on it, use a database (redundant also) to keep track of your files and such. Scripts and programs that keep files in sync and spread out on multiple machines so there is no single point of failure. I don't know of any sort of OSS software that does this that is out there, but I would bet that some place like Archive.org will share what they use.

Think about it like you have a huge library of tapes for backup. Then you have a automated method to use a robot to fetch, copy, and return various tapes as you need them.. Except that instead of automated hardware and tapes your using software and harddrives.

This sort of system will scale much higher then any sort of currently aviable SANS solution.

NOW... If you want all relaibility, aviability, redundancy with SPEED then that is going to fucking cost you huge amounts of money.

This is purely for STORAGE, not for work scratch space.

Personally were I work we have a 3 terrabytes worth of disk space for 2 copies of our major production database and scratch work space. The information is completely refreshed every 2-3 months. No data is kept older then that. In the back we have about 22 thousand tapes for longer term storage and backups for weekly jobs...

Personally I'd rather have a room full of mini-itx machines to go along with the SAN rather then a bunch of fucking tapes.

Re:Lots of copies. (1)

Nutria (679911) | more than 7 years ago | (#15530672)

Personally were I work we have a 3 terrabytes worth of disk space for 2 copies of our major production database and scratch work space. The information is completely refreshed every 2-3 months. No data is kept older then that. In the back we have about 22 thousand tapes for longer term storage and backups for weekly jobs...

What kind of tape drives do you use?

After all, if there's a physical "incident" in the data center you need off-site storage of the data.

(A careless plumber caused a gush of water to dump 10,000 gallons of water into our DC last July, and the point of entry was right above one of our SAN units.)

Splash. (1)

jhantin (252660) | more than 7 years ago | (#15536069)

A careless plumber caused a gush of water to dump 10,000 gallons of water into our DC last July, and the point of entry was right above one of our SAN units.)
Aw, man! Talk about getting your I/O buffers flushed. *rimshot*

Personally, I've been eyeing a Norco DS1220 [norcotek.com] 3U 4-channel/12-disc external SATA enclosure. Stuff it full of perpendicular drives [seagate.com] and you're looking at 9 TB for US$6000.

Xserve RAID? (3, Informative)

CatOne (655161) | more than 7 years ago | (#15529756)

It's just fibre attached storage, so you can use whatever servers you want as the head units. List is 7 TB for $13K... if you're going to scale up a lot you can certainly ask Apple for a discount.

Not *the* cheapest out there, but fast, reliable, and works well with Linux or whatever server heads you want.

Re:Xserve RAID? (1)

Fishbulb (32296) | more than 7 years ago | (#15536380)

I can second that. I've used the Xserve RAID on Suns, x86 Linux, Windows 2k server, and actual Xservers too!
i was very impressed with the fact that the RAID manager software was a run-anywhere java app. Pull it out of the package, and it runs on anything with a JRE.

Why was the parent to this modded 'Funny'?

It all depends (2, Informative)

ar32h (45035) | more than 7 years ago | (#15529768)

It all depends on what you are trying to do

For some workloads, many servers with four drives each may work. This is the Petabox [capricorn-tech.com]/Google model. This works if you have a parallelisable problem and can push most of your computation out to the storage servers.
Remember, you don't have a 300Tb drive, you have 300 servers, each with 1Tb of local storage.

For other workloads, you need a big disk aray and SAN, probably from Hitachi [hds.com], Sun [sun.com], or HP [hp.com]. This is the traditional model. Use this if you need a really big central storage pool or really high throughput.
Many SAN arrays can scale into the PB range without too much trouble.

Nowadays, a PB is enough to arch your eyebrows, but otherwise not that amazing. It seemes that commercial storage leads home storage by 1000. When home users had 1GB drives, 1TB was amazing. Now that some home users have 1TB, many companies have 1PB.

Re:It all depends (1)

HockeyPuck (141947) | more than 7 years ago | (#15530009)

For other workloads, you need a big disk aray and SAN, probably from Hitachi [hds.com], Sun [sun.com], or HP [hp.com]. This is the traditional model. Use this if you need a really big central storage pool or really high throughput.


You do realize that HP and SUN OEM Hitachi (HDS) storage arrays?

*sigh* n00b

Re:It all depends (1)

StanVassilevTroll (956384) | more than 7 years ago | (#15530155)

he may realize it. if he did he would be wrong, like you are. sun's array's are not hds although the hp xp arrays are.

Re:It all depends (1)

ar32h (45035) | more than 7 years ago | (#15533451)

Actually, the Sun StorageTek 9990 is a HDS array.
I find HDS arrays being sold under the STK name to be slightly funny.

Re:It all depends (1)

ar32h (45035) | more than 7 years ago | (#15533496)

Yes, I know that Sun and HP OEM HDS arrays.
I think that many people are more familiar with HP or Sun than they are with HDS, which is why I mentioned them.

In all fairness, I should have mentioned EMC, but I don't really like them. Nasty disk offset performance hacks and such.

Re:It all depends (0)

Anonymous Coward | more than 7 years ago | (#15530244)

For some workloads, many servers with four drives each may work. This is the Petabox/Google model. This works if you have a parallelisable problem and can push most of your computation out to the storage servers.
Remember, you don't have a 300Tb drive, you have 300 servers, each with 1Tb of local storage.


Or if you're not really wanting to pay up for a Petabox solution - why not try with AFS/openAFS under Linux?
It's a relatively common solution used by universities and the likes to solve their storage woes.

Take a page from Google's book... (1)

JimXugle (921609) | more than 7 years ago | (#15529776)

With PB of storage redundancy will be needed. maybe you could try to make your own version of Google's distributed redundant FS and run it on a bunch of cheap machines? (like google does)

Strike that subject, it should read "Take a chapter from Google's book..."

Good luck, plz write to /. again... I wanna see how this turns out.

-jX

P.S. I very rarely say "I wanna see how this turns out"... so write again, damnit!

Re:Take a page from Google's book... (0)

Anonymous Coward | more than 7 years ago | (#15533084)

Even cheaper -- and also "from Google's Book"
GmailFS [jones.name] and FUSE [sourceforge.net]

-----------
Get 10^91 or so free GMail accounts for googolbytes of free storage!

Easy - Think SAN - Apple XServe RAID + DNFStorage (5, Insightful)

ejoe_mac (560743) | more than 7 years ago | (#15529816)

So having thought about this a lot, here's what I would do:

1) Run a FC SAN as the backend. This allows you to connect anything you want without wondering what future technology will allow for - ATAoE, iSCSI, ???

2) Love thy Apple. XServe RAID's are 3u, 7tb (raw) and $13,000 - get a bunch - each controller see's 7 disks, set them up as a RAID 0 and uplink the thing to a FC switch.

3) Use DNFStorage.com's SANGear 4002 / 6002 devices to RAID 5 across the XServe RAID 0 LUN's. Your data security then can tolerate half of an XServe RAID going offline. RAID 6 allows for an entire unit to become DOA. Make sure to have an online spare or two.

4) Repeat - but remember, just because you can create it, doesn't mean you can reasonibly back it up.

Now the stupid question - what are you trying to do that would require this much space when you don't have the budget to get a "tested, supported, enterprise" solution? Building things is fun, but at some point you need to back up and say, "Am I willing to risk my company on my solution". EMC, HDS, IBM, HP and other big vendors are willing to step up and make sure your solution works, runs and will not fail (see that video with the SAN array getting shot?)

Re:Easy - Think SAN - Apple XServe RAID + DNFStora (4, Insightful)

scsirob (246572) | more than 7 years ago | (#15530205)

You're providing some really interesting information, but with all due respect, your approach is backward. Why would you first design the solution, and then ask what the guy has in mind for it?? It's like delivering a shiny new Porsche to someone, just to find he wants to move tonnes of dirt around.

First establish what he wants to use the storage for. Video streams? Archive? Massive web farm? Database systems?

Always start with establishing purpose. Designing the storage to match will be a lot easier.

Re:Easy - Think SAN - Apple XServe RAID + DNFStora (1)

Tim C (15259) | more than 7 years ago | (#15531618)

While you are of course 100% correct (how can you possibly design a solution to a set of requirements you don't have?), that's the fault of the submittor. Those missing requirements really should have been supplied already...

Re:Easy - Think SAN + Solaris + ZFS (1)

mritunjai (518932) | more than 7 years ago | (#15530432)

If you're already running SAN over FC, then seriously look at SUN Solaris with ZFS. It would save you tons of headaches over RAID5 (write holes, bad data on disk... etc).

Go through their docs on ZFS at http://www.opensolaris.org/ [opensolaris.org] if your data is important to you.

Re:Easy - Think SAN - Apple XServe RAID + DNFStora (3)

Alex (342) | more than 7 years ago | (#15530573)


2) Love thy Apple. XServe RAID's are 3u, 7tb (raw) and $13,000 - get a bunch - each controller see's 7 disks, set them up as a RAID 0 and uplink the thing to a FC switch.

3) Use DNFStorage.com's SANGear 4002 / 6002 devices to RAID 5 across the XServe RAID 0 LUN's. Your data security then can tolerate half of an XServe RAID going offline. RAID 6 allows for an entire unit to become DOA. Make sure to have an online spare or two.


Lets think about this - what happens when you loose a single disk in one of your Xserve's ? Your LUN fails and you then have to rebuild from parity 3.5TB of data.

How long will that take ? This is a horrible solution.

You won't be able to grow your LUNs either.

Alex

Re:Easy - Think SAN - Apple XServe RAID + DNFStora (1)

Anarke_Incarnate (733529) | more than 7 years ago | (#15532550)

Ok, the word is LOSE not loose. Secondly, if you LOSE 1 drive, in a RAID 5, you recover from parity based on availability of processing time on the controllers and the algorithms in place on said controllers. You must not deal with heavy storage if you think that recovering 3.5TB is too much to recover from.

Re:Easy - Think SAN - Apple XServe RAID + DNFStora (1)

WarlockD (623872) | more than 7 years ago | (#15557350)

You missunderstand the poster. He said he wanted each drive "box" as a RAID0, but all the "boxs" in one big SAN RAID0 array. If you lose a single drive in one of those boxes, you have to replace the entire box. (or the drive, still the whole thing gets rebuilt). While, I understand "why" you would do that, FC is VERY fast and should rebuild fast, its still kind of silly to do it that way.

Re:Easy - Think SAN - Apple XServe RAID + DNFStora (1)

Anarke_Incarnate (733529) | more than 7 years ago | (#15558112)

without sparing and/or some fancy software, you cannot really recover from a RAID 0 failure.

Re:Easy - Think SAN - Apple XServe RAID + DNFStora (2, Informative)

georgewilliamherbert (211790) | more than 7 years ago | (#15533471)

Do not use XServe RAID. It's the worst possible pseudo-enterprise SAN product. This is not to rag on Apple in general - the company is full of smart people, many of whom are friends. This is just a lame product in an otherwise excellent product line. There are plenty of SATA based SAN storage devices out there which are cheap. I'm partial to Nexsan [nexsan.com], having worked with them, and if you need slightly higher quality the Sun Storagetek, EMC/Dell boxes, etc. Software RAID (Veritas or open source) striping on top of large HW RAID (RAID 5, or RAID 10) SAN storage array stacks works just fine.

Re:Easy - Think SAN - Apple XServe RAID + DNFStora (1)

jasonwea (598696) | more than 7 years ago | (#15537817)

I can't help but wonder, why do you recommend against XServe RAID?

Re:Easy - Think SAN - Apple XServe RAID + DNFStora (1)

flimflam (21332) | more than 7 years ago | (#15538799)

I'm confused -- what don't you like? XServe RAID or XSAN? Certainly as a RAID XServe RAID is at least superior to a software RAID. Certainly Apple's SAN software is, um, well, immature, but there's no law that says you have to use it -- there's plenty of better SAN solutions out there.

SCSI or iSCSI (1)

mkosmo (768069) | more than 7 years ago | (#15529911)

Just eBay around for SCSI drive arrays and buy a 1U to control the array, or some iSCSI units and use the iSCSI initiator on a 1U. Lowers costs and makes the unit flexible. You can move it around if the controlling server dies etc. I prefer iSCSI since its mobile without physically moving! iSCSI will probably be more widely used than fiber channel soon. Google those around to find some good reviews. Both flavors are cheap on eBay!

CORAID (4, Interesting)

KagatoLNX (141673) | more than 7 years ago | (#15529982)

The best option here is Coraid.

15-drive array = $4000
750GB Seagate Drive = $420

Full Array (14-drive RAID5, one hot spare) = $10,300 for 9.75 Terabytes

That's $1.06 per gigabyte RAID5 with hotspare. It doesn't get any better than this. Even with labor to assemble and set it up, and shipping, it's hard to get above $1.50 a gigabyte.

I suggest CLVM and Xen on the servers. Xen makes it really easy to turn up a new box. The space is available everywhere. CLVM is flexible enough to allow you to migrate stuff across arrays (or span arrays) very easily. I actually boot off of a flash chip and pivot_root my Linux systems onto a filesystem running off of these.

These numbers are roughly my cost. E-mail me if you'd like to buy one and we can talk about it. :)

Re:CORAID (1)

Osty (16825) | more than 7 years ago | (#15530063)

That's $1.06 per gigabyte RAID5 with hotspare. It doesn't get any better than this. Even with labor to assemble and set it up, and shipping, it's hard to get above $1.50 a gigabyte.

RAID5? BAARF. If you were to use RAID10 instead, you'd still be around $2.1x per GB which is below the original poster's $3/GB max.

Re:CORAID (0)

Anonymous Coward | more than 7 years ago | (#15530167)

Ah, crap. I previewed and still missed it. That should be BAARF [baarf.com].

Re:CORAID (1)

drsmithy (35869) | more than 7 years ago | (#15531851)

RAID5? BAARF. If you were to use RAID10 instead, you'd still be around $2.1x per GB which is below the original poster's $3/GB max.

While BAARFers have a point, if your performance profile doesn't involve lots of random writes, then RAID5 (or, preferably, RAID6, if you have a large (~>6 disks) array) is a perfectly valid solution *and* provides [much] better $/GB.

Re:CORAID (1)

i_am_not_a_script_03 (982677) | more than 7 years ago | (#15540019)

Hmm....

14 drives in one RAID 5 set?
The larger the RAID set, the more risk of you loosing 2 drives in quick succession before your hot spares redbuild completely.
Of course, loosing 2 drives on a RAID 5 means your data is fsck'd.
Its not uncommon for drives from the same factory lot to fail in this manner. I usually keep my RAID sets far smaller to negate this risk, even though it does cost me the extra drive for the other array.
 

Reliability... (3, Insightful)

HockeyPuck (141947) | more than 7 years ago | (#15529985)

Ok... so if you are planning on driving down the cost of the setup, you will have to sacrifice other features.

How important is performance? Reliability? Scalability? If you are building PB of storage, how do you plan to back it up? What are the uptime requirements? Hitless codeload? Do you need multipathing? Snapshots? They all do raid, but do you need SATA/FC disk? How much throughput (MB/s) and IO/sec do you require? What management tools do you need? Callhome? SNMP/pager/email notification? Can you save $$$ by putting in a high end array but use iSCSI in the hosts to save $800/host?

In the end it comes down to. How valuable is your data? What is the impact to your business if it is down.

It sounds sexy on /. to 'roll your own', but I don't see any financial institutions 'rolling their own'. Educate yourself, put out RFPs to various storage vendors. Educate yourself.

Isilon (1)

Scott (1049) | more than 7 years ago | (#15530042)

They haven't hit a PB yet, which I'll say up front, and their cost is in line with other commercial storage vendors. But I will say that they are absolutely awesome to work with as far as performance and reliability goes. We installed our cluster about eight months ago. Initially we had a problem, which was tracked back to malformed NFS packets being sent to the network by the storage system the Isilon cluster was replacing. Once that was fixed, the system has been dead quiet. It's one piece of gear that I never worry about.

The most effective solution... (2, Interesting)

Hymer (856453) | more than 7 years ago | (#15530099)

...is a HP EVA 8000 SAN... or any other SAN with a virtualizaion layer. IBM TSM (aka ADSM) for backup.
It is not cheap.

Re:The most effective solution... (1)

sconeu (64226) | more than 7 years ago | (#15532966)

Troika (now part of QLogic) has the Accelera FC SAN virtualization appliance. It does in-fabric virtualization at wire speed.

What's it for ? (5, Informative)

drsmithy (35869) | more than 7 years ago | (#15530177)

You firstly need to assess what the storage is for - in particular, its *requirements* for performance and reliability/availability/uptime.

If you require high levels of performance (=comparable to local direct-attached-disk) or reliability (=must be online "all the time") then stop right now and go out talking to commercial vendors. You will not save enough money doing it yourself to make up for the stress, people-power overheads and losses the first time the whole house of cards falls down.

However, if your performance or reliability requirements are not so high (ie: it's basically being used to archive data and you can tolerate it going down occasionally and unexpectedly) then doing it yourself may be a worthwhile task. I get the impression this is the kind of solution you're after, so you'll be looking at standard 7200rpm SATA drives.

Firstly, decide on a decent motherboard and disk controller combo. CPU speed is basically irrelevant, however, you should pack each node with a good 2G+ of RAM. Make sure your motherboards have at least two 64bit/100Mhz PCI-X buses. I recommend (and use) Intel's single-CPU P4 "server" motherboards and 3ware disk controllers. I believe the Areca controllers are also quite good. You will have trouble on the AMD64 side finding decent "low end" motherboards to use (ie: single CPU boards with lots of I/O bandwidth). Do not skimp on the motherboards and controllers, as they are the single most important building blocks of your arrays.

Secondly, pick some disks. Price out the various available drives and compare their $/GB rates. There will be a sweet spot were you get the best ratio, probably around the 400G or 500G size these days (it's been several months since I last put one of these together).

Thirdly, find a suitable case. I personally don't like to go over 16 drives per chassis, but there are certainly rackmount cases out there with 24 (and probably more) hotswap SATA trays.

Now, put it all together and run some benchmarks. In particular, benchmark hardware RAID vs Linux software RAID and see which is faster for you (it will probably be software RAID, assuming your motherboard is any good). Bear in mind that some hardware RAID controllers do not support RAID6, but only RAID5. Prefer a RAID6 array to a RAID5 + hotspare array.

You now have the first component of your Honkin' Big Array. Install a suitable Linux distribution onto it (either use a dedicated OS hard disk, get some sort of solid-state device or roll a suitable CD-ROM based distro for your needs). Setup iSCSI Enterprise Target.

Finally, you need a front-end device to make it all usable. Get yourself a 1U machine with gobs of network and/or bus bandwidth. I recommend one of Sun's x4100 servers (4xGigE onboard + 2 PCI-X). Throw some version of Linux on it with an iSCSI initiator. Connect to your back-end array node and set it up as an LVM PV, in an LVM VG. Allocate space from this VG to different purposes as you require.

When you start to run out of disk, build another array node, connect to it from the front-end machine and then expand the LVM VG. As you expand, investigate bonding multiple NICs together and additional dual- or quad-port NICs to supplement the onboard ones. I also recommend keeping at least one spare disk in the cupboard at all times for each of your storage nodes, and also a spare motherboard+CPU+RAM combo, to rebuild most of a machine quickly if required. Ideally you'd keep a spare disk controller on hand as well, but these tend to be expensive, and if you're using software RAID, any controller with enough ports will be a suitable replacement for any failures.

We do this where I work and have taken our "array" from a single 1.6T node (12x200G drives) to 10T split amongst 3 nodes. We are planning to add another ~6T node before the end of the year. *If* this is the kind of solution that would meet your needs, I can offer a lot of help, advice and experience to you.

However, our "array" has neither high performance nor especially high availability requirements (it can be offline for a day or two at a time without negatively impacting our workflow). I emphasise again, if you need something to be fast and highly available, don't stuff around trying to build it yourself, it's not worth it. Even if you are considering building it yourself, I would also strongly advise investigating off-the-shelf 'iSCSI disk shelves' like the Promise vTrak M500i, before you embark down the DIY path. They're a bit more expensive that DIY, but a lot more convenient. If you're not already well-versed and comfortable in both speccing and building your own machines, get the off-the-shelf components.

Re:What's it for ? (1)

swillden (191260) | more than 7 years ago | (#15530259)

When you start to run out of disk, build another array node, connect to it from the front-end machine and then expand the LVM VG.

How do you connect the storage in the back-end machines to the front-end? NFS? CIFS? Network Block Devices?

Re:What's it for ? (2, Insightful)

drsmithy (35869) | more than 7 years ago | (#15530291)

How do you connect the storage in the back-end machines to the front-end? NFS? CIFS? Network Block Devices?

iSCSI. Using the linux-iscsi-utils in CentOS 4.3.

You could also use NBD or AoE (that's ATA over Ethernet, not Age of Empires ;) ), but I have found iSCSI to be the fastest, most reliable, most flexible and most supported solution.

Look for jumbo frame support (1)

charnov (183495) | more than 7 years ago | (#15532924)

If you go with gigabit ethernet as opposed to fiberchannel or infiniband, be sure to verify every thing support jumbo frames and has good tcp offload. I built a NAS in the earlier days of gigabit and the CPU's were pushing 30% just on framing overhead.

Just another one of those things to look for. Personally, I recommend an IBM Shark or EMC, but you would be talking 7 figures.

Cost cost cost (2, Interesting)

TopSpin (753) | more than 7 years ago | (#15530367)

Before I throw my two cents in a disclaimer; you focus pretty much exclusively on cost. No performance, no management, no reliability. With that in mind;

Dell MD1000 SAS/SATA JBOD + 15x 500GB SATA disks: ~$10500

Three of those, daisy chained to some head end server through a single SAS RAID controller. Guessing ~$4000 for the box.

That's 22.5TB of raw storage for ~$35k, or $1.57/GB, less if you work a deal.

You'll need about 45 of these for a petabyte (raw); $1.6M
Fabric (~5x 24 port layer-2 gigabit switches): ~$12k
600u of racks (5) + power: ~$25k
etc...

Tons of single points of failure, limited performance and zero management. Don't even suspect it's possible to get good IOPS from this; too few controllers, major throughput constraints on the 1Gb/s uplinks, etc... It's good for light to moderate streaming archival type use. Use Linux with lots of md and lvm for at least some measure of manageability.

Frankly, $3/GB for something with management tools, better reliability and/or better performance and looks pretty good, but if you're all about cost...

Re:Cost cost cost (1)

jeffy210 (214759) | more than 7 years ago | (#15531728)

Well what do you know, they did update the firmware.... When the MD1000 first came out it was only able to support SAS drives, had a big disclaimer too. Just went back to their product page to make sure and what do you know... I'd prefer SAS, but SATA is a bit cheaper to start.

BEST storage setup (1)

psergiu (67614) | more than 7 years ago | (#15530458)

EMC DMX3 [emc.com] with an ED-48000B [emc.com] FC Director. Actual setup - works flawlesly - you can start low and scale up to 1Pb/array & 64 2gb FC connections/array without any downtime.

Petabox moving forward (4, Informative)

TTK Ciar (698795) | more than 7 years ago | (#15530627)

It's been almost exactly two years since we put together the first petabox rack [ciar.org], and both the technology and our understanding of the problem have progressed since then. We've been working towards agreement on what the next generation of petabox hardware should look like, but in the meantime there are a few differing opinions making the rounds. All of them come from very competent professionals who have been immersed in the current system, so IMO all of them are worthy of serious consideration. Even though we'll only go with one of the options, a different option might be better suited to your specific needs.

One that is a no-brainer is: SATA. The smaller cables used by SATA are a big win. Their smaller size means (1) better air flow for cooling the disks, and (2) fewer cable-related hardware failures (fewer wires to break, and more flexibility). Very often when Nagios flags a disk as having gone bad, it's not the disk, but rather the cable that needs reseating or replacing.

Choosing the right CPU is important. The VIA C3's we started with are great for low power consumption and heat dissipation, but we underestimated the amount of processing power we needed in our "dumb storage bricks". The two most likely successors are the Geode and the (as-yet unreleased, but RSN) new low-power version of AMD's dual-core Athlon 3800+. But depending on your specific needs you might want to just stick with the C3's (which, incidentally, cannot keep gig-e filled, so if you wanted full gig-e bandwidth on each host, you'll want something beefier than the C3).

It has been pointed out that we could get better CPU/U, disks/U, $$/U, and power/U by going with either a 16-disks-in-3U or 24-disks-in-4U solution (both of which are available off-the-shelf), compared to 4-disks-in-1U (our current hardware). This would also make for fewer hosts to maintain, and to monitor with the ever-crabby Nagios, and require us to run fewer interconnects. Right now it looks like we'll probably stick with 4-in-1U, though, for various reasons which are pretty much peculiar to our situation.

I heartily recommend Capricorn's petabox hardware for anyone's low-cost storage cluster project, if for no other reason than because a lot of time, effort, brain-cycles, and experimentation was devoted to designing the case for optimizing air flow over the disks (and figuring out which parts of the disks are most important to keep cool). Keeping the disks happy will save you a lot of grief and money. When you're keeping enough disks spinning, having a certain number of them blow up is a statistical certainty. Cooler disks blow up less frequently. Most cases do a really bad job of assuring good air flow across the disks -- the emphasis is commonly on keeping the CPU and memory cool. But in a storage cluster it's almost never the CPU or memory that fails, it's the drives.

Even though the 750GB Seagates appear to provide less bang-for-buck than smaller solutions (400GB, 300GB), the higher data storage density pays off in a big way. Cramming more data into a single box means amortizing the power/heat cost of the non-disk components better, and also allows you better utilization of your floorspace (which is going to become very important, if you really are looking to scale this into the multi-petabyte range).

When dealing with sufficiently large datasets, silent corruption of data during a transfer becomes inevitable, even using transport protocols which attempt to detect and correct such corruptions (since the corruption could have occurred "further upstream" in the hardware than the protocol is capable of seeing). We have found it necessary to keep a record of the MD5 checksums of the contents of all our data, and add a "doublecheck" step to transfers: perform the transfer (usually via rsync), make sure the data is no longer being cached in memory (usually by performing additional transfers), and then recompute the MD5 checksums on the destination host and compare it to the known-good MD5. (It's important that the data is not cached on the destination machine because the corruption might have occurred during the copy from memory to disk, and it's the copy on disk that you really want to guarantee is correct. So it's necessary to copy it to the destination, then read the copy back off the disk and into memory. Then if it was corrupted in the disk -> memory transfer, that's okay, false hits that way only cost you a second transfer. False detection of no corruption costs you data!)

It's hard to make recommendations about your system (especially software) without knowing more details about what you're doing. Please feel free to contact the Archive to discuss. Or we can chat about it informally if you like -- ttk (at) archive (dot) org. If nothing else I can get you in contact with the right people. I can't guarantee that folks will have a lot of time to spare for it, but we're a friendly bunch and like to help. (And it's not like we're overly protective of our technology!)

Good luck with your project! :-)

-- TTK

Re:Petabox moving forward (1)

drsmithy (35869) | more than 7 years ago | (#15531883)

But depending on your specific needs you might want to just stick with the C3's (which, incidentally, cannot keep gig-e filled, so if you wanted full gig-e bandwidth on each host, you'll want something beefier than the C3).

I'd be looking at the motherboards before the CPUs. What's the bus topology of the motherboards you use ? What SATA & network controllers ? What *motherboard chipset* ?

Re:Petabox moving forward (1)

TTK Ciar (698795) | more than 7 years ago | (#15534549)

Our current solution is based on the VIA EPIA-M micro-itx motherboard, which uses the VIA Apollo CLE266 chipset. It and four hard drives fit nicely into a moderate-depth 1U case, and it does PXE boot (which we (and Capricorn) use to mass-image new racks with their operating systems and filesystem layouts). Its topology is quite simple, but frankly we didn't need anything more at the time (plain-jane PCI, with embedded 100bT and ATA/133), and its power and heat characteristics are very good.

If you really, really don't need anything more than a simple dumb storage brick, and don't need full gig-E bandwidth, then this hardware does just fine. What we found is that, after deciding on this hardware, subsequent decisions about the handling of metadata on the storage nodes and some other things increased our need for storage nodes' processing capability. Also, increases in storage density (1TB per node in early 2004, vs 3TB per node today) renders 100bT insufficient for some of the operations we need to occasionally perform.

As for what might replace the VIA EPIA-M at The Archive, I cannot yet say. There's a lot of testing of prospective hardware needing to get done. I wouldn't want to suggest a chipset which later turned out to have some horrible flaw that only showed up during stress-testing.

-- TTK

Re:Petabox moving forward (1)

durval (233404) | more than 7 years ago | (#15542116)

First of all, thanks for the awesomely informative comment. Just a point I'm curious about:

But depending on your specific needs you might want to just stick with the C3's (which, incidentally, cannot keep gig-e filled, so if you wanted full gig-e bandwidth on each host, you'll want something beefier than the C3).

Over what protocol did you measure that (HTTP, Samba, NFS, etc)? Can you be more precise at just how much under gigE (1Gbps, I presume) you managed to fill up with a C3?
I'm interested because currently I have a K6-III-500Mhz (OK, long story) that can't fill even half of a fastE (yes, under 59Mbps) over Samba... based on whetstone/etc published benchmarks for the C3 I was expecting an 800Mhz part to barely fill a fastE, and I'm curious about the rates you are getting.

Re:Petabox moving forward (1)

TTK Ciar (698795) | more than 7 years ago | (#15545172)

The gig-e NIC in question was an Intel EtherPro1000, which is our standard card at The Archive (it's reliable, easy to fill to near-theoretical capacity without resorting to jumbo frames, and well-supported under linux). We measured its performance over "native rsync" (which is to say, without ssh encryption), without compression, which is how data is most commonly transferred between hosts at The Archive. (qv: "rsync ia300130.us.archive.org::items_0" to see what a data node looks like through rsync .. the directories with "d---------" permissions are "darkened" items, taken offline but not deleted.)

I don't have the exact numbers in front of me, but IIRC the C3 was able to fill it to about 35 megabytes/second.

-- TTK

Embedded x86-compatible CPUs: Geode vs. C3 (1)

durval (233404) | more than 7 years ago | (#15542889)

The two most likely successors are the Geode [...]

Do you mean the Geode GX or the NX? If the former, are you aware of Sudhian's comparison
of it with a VIA C3 (http://www.sudhian.com/index.php?/articles/show/6 57 [sudhian.com])?

If not, you will certainly find it interesting.

Re:Embedded x86-compatible CPUs: Geode vs. C3 (1)

TTK Ciar (698795) | more than 7 years ago | (#15545397)

The Geode under consideration is the Geode NX 1500, which draws 7 watts at load. AFAIK we are only benchmarking it on how well it processes book scans (which is the majority of our processing load for now and the foreseeable future), and of course how easily it can fill gig-e.

Thanks for forwarding the comparison article :-) it made for an interesting read (the references to Cyrix gave me quite the nostalgic kick!).

-- TTK

Terastation (1)

BigNumber (457893) | more than 7 years ago | (#15531182)

How about the Buffalo Terastation. Obviously this wouldn't work if you're looking for hundreds of terabytes. You can't beat the price at less than 70 cents per gig, though.

Re:Terastation (1)

raphae (754310) | more than 7 years ago | (#15536233)

I tried both the Buffalo Terastation and Anthology Solutions' Yellow Machine. The arrays are mountable only via NFS or Samba, there is no direct access to the device via USB which is I what I had hoped for. I often like to use rsync to mirror uncompressed images of filesystems to a backup drive. Unfortunately there's no way to do it with either of these products without losing all permissions data.

I've been trying to find some barebones system that had the same form factor as these products which I could put my own OS on. That would be really nice.

Try NetApp (0)

Anonymous Coward | more than 7 years ago | (#15531648)

Call up NetApp and tell them you will be buying a PB of storage over the next X months and commit to it. The price will drop a lot, tell them you are willing to state you have that much storage from them and they can use your company as a reference and it will give you more.

Great tools, excellent performance.

EMC Symmetrix or Clarion, just do it... (1)

haplo21112 (184264) | more than 7 years ago | (#15532479)

...you'll thank yourself later.

In the long run its cheaper, its supported(24/7/365) by a huge multinational Corp that generally knows you have a a problem and solve it remotely even before you do...

Anything else is a waste of time money and effort.

Re:EMC Symmetrix or Clarion, just do it... (1)

skulcap (184906) | more than 7 years ago | (#15534153)

"..you'll thank yourself later." ... or kick yourself later. Don't get me wrong, EMC has certainly made some pretty deep inroads into the SAN marketplace, but they
are also working off of technology/ideas/designs that were developed years ago. And while I'm all for tried and true and tested "stuff", I think there are newer, cheaper, better solutions out there. Just because EMC is easy to say, and Dell sells it, doesn't make it right. *grin*

We've had some problems with the EMCs that we have at my workplace - lacking performance, oversold features, limited expandability (which should be expected from the clariion line). Some of this could be alleviated by going the Symmetrix route, but that's a whole lot of coin for what it is, and if you don't need mainframe connectivity, it's just silly. So for one of our latest purchases we tried 3par - http://www.3pardata.com./ [www.3pardata.com] They provided more for less, are great to work with, have excellent support, and push some serious throughput. There are some very big institutions letting 3par into their datacenters... and then asking for more.

If you really need petabytes of data, and actually want it available, manageable, and accessible - Look at a disk vendor (something like 3par), don't build it yourself or buy it from Dell.

Re:EMC Symmetrix or Clarion, just do it... (1)

LukeCrawford (918758) | more than 7 years ago | (#15574575)

EMC support is not very good. One place I worked, we had a Clairon NAS. The NAS worked okay, I mean, for what we paid, it should have been better, but it pretty much worked. At one point, we had some weird error messages on the Clarion. We called EMC support, and got nowhere. The front-line guys, like most front-line guys, know nothing. We finally had to go back to our sales guy and threaten to leave. Our sales guy got 5 different techs onto our site to troubleshoot the problem. None of them had any clue what it was. About 3 months into this farce, one of our junior network guys figured out what the problem was. We were running 7 different layer 1 broadcast domains into the same unmanaged switch. The resulting network brokenness was the problem. We removed the bum switch, and everything worked fine.

My point is that the EMC hardware is decent. It's overpriced, but it works well. EMC support, though, leaves a lot to be desired.

Fast, Reliable, Cheap (1, Informative)

Anonymous Coward | more than 7 years ago | (#15535349)

Pick any two.

My personal suggestion is to find a bunch of older FC arrays (Compaq RA4100?) on eBay and load them up. But here's where you get into trouble:

Fast, reliable, cheap. You can have any two. We stripe for speed. We mirror for reliablility. We parity-stripe for cheap.

Here's what you give up with each choice.

Striping alone is fast and cheap, but you have no fault tolerance.
Mirroring doubles your cost, but it is reliable and reasonably fast.
Striping with parity is cheap, and reasonably reliable, but you pay a huge performance penalty in write operations.

When reliablility and speed are the utmost concerns, use SAME (Stripe And Mirror Everything.)

As for your hardware, find whatever is cheap on eBay and run with it until you make enough money for the Tested, Supported Commercial Solutions.

iSCSI from LeftHand or EqualLogic (1)

Thundersnatch (671481) | more than 7 years ago | (#15535554)

We recently implemented a iSCSI SAN from LeftHand Networks. Basically, it's a system of dual-CPU, dual-GigE x86 servers with gigabytes of cache lots of disk. Their "Network-based iSCSI RAID" software (Linux-based) allows the individual units to pool their storage, cache, and network bandwidth into a single cluster. So every unit you add to the SAN adds capacity and performance. Traditional SAN features of snapshots, replication, virtualization, etc. are supported.

Going with iSCSI over FC will save you a factor of 2x or more in cost/GB without requiring performance sacrifices. Remember, you not only have to buy the array with FC, but also very expensive switches and HBAs. With iSCSI, all that hardware is cheap - you don't even really need HBAs in most cases - so you can buy a lot of it, and use link aggregation or MPIO for increased performanced and redundancy.

Equalogic has similar iSCSI offerings, but with proprietary building blocks. Both vendors seem "big enough" to be stable partners, and the support from LeftHand has so far been very good.

Don't even think about building something yourself if you care about reliability or management functionality; we tried that, and the iSCSI target software available for linux and windows isn't there yet, and just can't scale like this stuff can. We ended up buying a commercial offering. We looked at EMC, HP, etc. but their iSCSI stuff is immature compared with the aformentioned "iSCSI specialist" vendors.

1Gb fibre channel. (1)

LukeCrawford (918758) | more than 7 years ago | (#15574525)

For my Xen hosting service [prgmr.com] I use a SilkWorm 2400 along with various fibre channel arrays and disks bought off e-bay. I've got 2 14 bay Dell 224F JBOD arrays, and one IBM EXP 500 for the half-height drives.

Now, you really only need the Brocade if you have multiple computers, but it really does make things much easier than using the traditional SCSI reserve and release commands with a shared bus.

Of course, the kids today seem to like IDE. Me, I don't use it for anything other than near-line backups. For that sort of thing, I use one of the 14 bay SuperMicro cases. they are pretty nice.

Check for New Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...