Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Building a Massive Single Volume Storage Solution?

Cliff posted more than 8 years ago | from the 15-zeros-is-a-lot-of-bytes dept.

Data Storage 557

An anonymous reader asks: "I've been asked to build a massive storage solution to scale from an initial threshold of 25TB to 1PB, primarily on commodity hardware and software. Based on my past experience and research, the commercial offerings for such a solution becomes cost prohibitive, and the budget for the solution is fairly small. Some the technologies that I've been scoping out are iSCSI, AoE and plain clustered/grid computers with JBOD (just a bunch of disks). Personally I'm more inclined on a grid cluster with 1GB interface where each node will have about 1-2TB of disk space and each node is based on a 'low' power consumption architecture. Next issue to tackle is finding a file system that could span across all the nodes and yet appear as a single volume to the application servers. At this point data redundancy is not a priority, however it will have to be addressed. My research has not yielded any viable open source alternative (unless Google releases GoogleFS) and I've researched into Lustre, xFS and PVFS. There some interesting commercial products such as the File Director from NeoPath Networks and a few others; however the cost is astronomical. I would like to know if any Slashdot readers have any experience in build out such a solution? Any help/idea(s) would be greatly appreciated!"

cancel ×

557 comments

Sorry! There are no comments related to the filter you selected.

gmail (4, Funny)

Adult film producer (866485) | more than 8 years ago | (#13874188)

register a few thousand gmail accounts and write the interface that will make writing of data to gmail inboxes invisible to the app.

Er... be careful (2, Informative)

LeonGeeste (917243) | more than 8 years ago | (#13874422)

That violates their terms of use pretty severely. I don't know what they would do (Google's not the "suing-for-the-hell-of-it" type), but that wouldn't last very long when they found out. And they would find out. +5 Interesting? Well, curiosity killed the cat.

Re:gmail (2, Interesting)

Anonymous Coward | more than 8 years ago | (#13874424)

Gmail? Why bother when you can just use a few hundred million Tinydisks [msblabs.org] instead?

I wonder if tinyurl can handle 25TB...

Re:gmail (4, Funny)

Stuart Gibson (544632) | more than 8 years ago | (#13874433)

That would have been my second answer.

The first, and presumably the reason this was posted to /. is simple...

Imagine a Beowolf cluster...

Stuart

Re:gmail (-1, Flamebait)

Anonymous Coward | more than 8 years ago | (#13874470)

You should ask Zonk. He seems to think his nex door neighbor's 14 year old kid's snatch is the perfict "High Capacity Storage" location.

Been there done that (2, Interesting)

CommanderC (925728) | more than 8 years ago | (#13874511)

I wrote a web application and a client in C# that uses gmail accounts as a sort of file system. using a set of email accounts as "index" accounts that use the gmail search functionality to find what you are looking for then pulling the attachment on the index to grab the parts of the file that where spread accross multiple gmail accounts in 500K chunks. it works really well. I did it for fun to see if I could. uses smtp to post the file chunks to a given set of accounts and users can donate accounts to the hive at will, increasing the overall storage size. all hosted maintained and index by gmal or any other free mail service as one big file system.

GFS? (4, Informative)

fifirebel (137361) | more than 8 years ago | (#13874189)

Have you checked out GFS [redhat.com] from RedHat (formerly Sistina)?

Oracle, also (1)

PCM2 (4486) | more than 8 years ago | (#13874236)

The Oracle Cluster filesystem [oracle.com] is also available under the GPL. Dunno if that fits the bill; the description here is sort of vague. It sounds like a seriously ambitious project to approach for someone who doesn't even know what can be done, let alone what's within his budget.

Re:Oracle, also (1)

PCM2 (4486) | more than 8 years ago | (#13874287)

Er, sorry, version 2 [oracle.com] is what I meant.

Re:Oracle, also (1)

N1ck0 (803359) | more than 8 years ago | (#13874358)

From what I've heard OCFS2 can be a bit...finicky like most oracle systems, and it hasn't really taken off like they really hoped.

Re:GFS? (3, Informative)

N1ck0 (803359) | more than 8 years ago | (#13874296)

GFS over a FC SAN with some EMC CLARiiON CX700s as the hosts is the solution that I'm going to looking at deploying next year, although there is still some thoughts on using iSCSI instead of FC. It all really depends on what your usage patterns and performcance requirements are. I don't believe GFS supports ATAoE systems but since their is linux support I doubt it would be too far of a strech.

Apple Xserve? (2, Informative)

mozumder (178398) | more than 8 years ago | (#13874198)

Can't you hook up 4x 7TB Xserve RAIDs to a PowerMac and use that?

Re:Apple Xserve? (3, Informative)

Jeff DeMaagd (2015) | more than 8 years ago | (#13874248)

Apple Xserve may be the cheapest of that kind of storage, but it's probably not fitting the original idea of commodity hardware.

Scaling to petabytes means spanning storage across multiple systems.

Re:Apple Xserve? (3, Informative)

stang7423 (601640) | more than 8 years ago | (#13874541)

Apple has a solution for this. Xsan [apple.com] is a distrubuted filesystem that is based on the ADIC's StoreNext filesystem. Apple states on that page that it will scale into the range of petabytes.

Re:Apple Xserve? (5, Interesting)

medazinol (540033) | more than 8 years ago | (#13874269)

My first thought as well. However, he is asking for a single volume solution. So XSAN from Apple would have to be implemented. Good thing that it's compatible with ADIC's solution for cross-platform support.
Probably would be the least expensive option overall and the simplest to implement. Don't take my word for it, go look for yourself.

Veritas Filesystem (0)

Anonymous Coward | more than 8 years ago | (#13874200)

Go check out veritas.com (now Symantec) for a comercially available filesystem...

Hah (-1, Offtopic)

Anonymous Coward | more than 8 years ago | (#13874201)

Hah!

Andrew FIle System (4, Informative)

mroch (715318) | more than 8 years ago | (#13874211)

Check out AFS [openafs.org] .

Re:Andrew FIle System (2, Informative)

Simon Lyngshede (623138) | more than 8 years ago | (#13874232)

Agreed. AFS is exceptional nice. However I think it still have a max file size of 2GB.

Re:Andrew FIle System (1)

ashpool7 (18172) | more than 8 years ago | (#13874256)

Additionally, Coda, but I'm not sure if it's as stable.

Re:Andrew FIle System (1)

JAZ (13084) | more than 8 years ago | (#13874356)

I was about to recommend this but when I googled afs and found a faq it said:

Subject: 1.02 Who supplies AFS?

                Transarc Corporation phone: +1 (412) 338-4400
                The Gulf Tower
                707 Grant Street fax: +1 (412) 338-4404
                Pittsburgh
                PA 15219 email: information@transarc.com
                United States of America afs-sales@transarc.com

                                                                            WWW: http://www.transarc.com/ [transarc.com]

BUT....
I clicked that transarc.com link and found the only porn site that my company proxies don't block. EEK!

Re:Andrew FIle System (2, Informative)

Anonymous Coward | more than 8 years ago | (#13874465)

UFS (0)

Anonymous Coward | more than 8 years ago | (#13874213)

UFS (universal file system) seems to be what you want here, any other thoughts?

PetaBox (4, Informative)

Anonymous Coward | more than 8 years ago | (#13874221)

Howabout the PetaBox [capricorn-tech.com] , used by the Internet Archive [archive.org] ?

Re:PetaBox (-1, Redundant)

Usquebaugh (230216) | more than 8 years ago | (#13874378)

This was going to be my answer.

Re:PetaBox (5, Funny)

sycodon (149926) | more than 8 years ago | (#13874454)

Just don't call it PetaFile.

Re:PetaBox (3, Informative)

MikeFM (12491) | more than 8 years ago | (#13874516)

I priced one of those and decided I'd have to work my way up to that kind of toy. Instead I started with Buffalo's TeraStations [buffalotech.co.uk] which are affordable and have built-in RAID support. You can mount them in Linux and use LVM to span a single filesystem across several of them or just mount them normally depending on your needs. $1-$2 per GB for external, RAID, storage isn't bad at all.

MogileFS from livejournal (2, Informative)

mikeee (137160) | more than 8 years ago | (#13874224)

Livejournal developed their own distributed filesystem:

http://www.danga.com/mogilefs/ [danga.com]

It's scalable and has nice reliability features, but is all userspace and doesn't have all the features/operations of a true POSIX filesystem, so it may not suit your needs.

Go the Easy Route (3, Funny)

Evil W1zard (832703) | more than 8 years ago | (#13874228)

I know a certain recent Zombie network that was discovered which collectively had quite a few Pbs of storage... Of course I wouldn't recommend going down that road as it leads to you know ... jail.

Petabox (0, Redundant)

treerex (743007) | more than 8 years ago | (#13874229)

Check out the Internet Archive [archive.org] 's Petabox [archive.org] . They have a 100 TB rack running in Europe right now.

Petabox.... (1)

HotNeedleOfInquiry (598897) | more than 8 years ago | (#13874239)

Does not appear to be a single volume..

Re:Petabox.... (1)

treerex (743007) | more than 8 years ago | (#13874290)

Does not appear to be a single volume..

That depends entirely on what software you run on top of the hardware, doesn't it.

call EMC. i am sure their clarion line will handle (0)

Anonymous Coward | more than 8 years ago | (#13874233)

call EMC. i am sure their clarion line will handle it.

i am unsure how you plan to do this with open source
software. It seems to me, you will want mgmt software
to go along with it. That is the real value, me thinks.

Re:call EMC. i am sure their clarion line will han (0)

Anonymous Coward | more than 8 years ago | (#13874457)

Anonymous marketingdroid?

Go Virtual (1)

furry_wookie (8361) | more than 8 years ago | (#13874247)

Check this out to do what you want.

This is the one of the coolest companies out there and their product is better than anything EMC has for storage.

http://www.falconstor.com/ [falconstor.com]

Re:Go Virtual (2, Interesting)

krbvroc1 (725200) | more than 8 years ago | (#13874381)

He asked for low cost commodity hardware. The fact that no price is mentioned and you need to contact a sales droid for a quote is an instant red-flag. I hate vendors who do not put price lists, even 'retail' prices on their product pages. I realize they may have different price levels based on quantity, but there is a value to seeing that a product is in the '$1000-$1500' range versus the '$120000-$150000' range. Having the contact sales droids who will put your name/phone number on a sales list and harrass you just to find out the price range turns me off of a lot of these outfits. I do a lot of product research and selection using the Internet. I favor outfits who allow me to get all the info online without contacting a sales rep. Many times if I cannot get the info on the web and I cannot get a price on the first phone call without providing sales lead information, I skip them.

Why? (2, Insightful)

Anonymous Coward | more than 8 years ago | (#13874251)

What are you doing on a limited budget trying to build a 1PB solution? And why are you on a budget?

Just because you are starting at 25TB doesn't mean you aren't building a 1PB solution.

You also need to figure out what kind of bandwidth you need. It's very seldom that people have 1PB of data that is accessed by one person occasionally. If Some sort of USB or 1394 connection will work you are much better off than requiring infiniband.

Like many "ask Slashdot" questions this is the last place you should be looking for help...

Re:Why? (1, Insightful)

temojen (678985) | more than 8 years ago | (#13874498)

Unless you are the mint, every budget is limited.

Network Block Device (1)

drightler (233032) | more than 8 years ago | (#13874270)

LVM/Software RAID over Linux NBD.... ok it might suck, but I think it would work.

Re:Network Block Device (0)

Anonymous Coward | more than 8 years ago | (#13874326)

FreeBSD has some nice tools to address this problem, but I'm not very familiar with. I would use GEOM to export devices to the network. It would use software LVM/RAID too. And would be centralized in a server for that. Thats a suggestion...

what gall (0, Troll)

Anonymous Coward | more than 8 years ago | (#13874283)

Considering that there are billion dollar companies whose only job it is to provide secure and redunant storage of the type that you describe, what makes you believe that someone on slash-dot would give you a solution for free?

The kind of thing you are talking about is non-trivial. If people have ideas concerning these matters you should pay them for them.

What a lot of gall!
Also, if you are being paid to do this by someone, then they obviously hired the wrong person to do the work.

Google Releases OSS? (1)

TheoMurpse (729043) | more than 8 years ago | (#13874285)

My research has not yielded any viable open source alternative (unless Google releases GoogleFS)

Since when has Google released any open source software?

Re:Google Releases OSS? (1)

Evangelion (2145) | more than 8 years ago | (#13874351)


Since google's massive infrastructure is built on Linux, chances are any kernel-space filesystem they release is going to have to be GPL compatible.

Re:Google Releases OSS? (2, Informative)

ggvaidya (747058) | more than 8 years ago | (#13874367)

A while ago [google.com]

Re:Google Releases OSS? (1)

LLuthor (909583) | more than 8 years ago | (#13874368)

See http://code.google.com/ [google.com]

Re:Google Releases OSS? (0, Redundant)

Bananatree3 (872975) | more than 8 years ago | (#13874456)

I have a feeling you haven't seen http://code.google.com [google.com] yet. This site just so happens to release code, written by Google employees, available for free. 100% open-source free.


Re:Google Releases OSS? (0)

Anonymous Coward | more than 8 years ago | (#13874474)

Yeah, we definately need more opensource-free software.

previous story: Distributed storage (0)

Anonymous Coward | more than 8 years ago | (#13874292)

You might want to do a google search for "Linux Distributed Storage" or otherwise look at this old post on Slashdot which covered your question.

Otherwise there are various solutions already availlable for free under Linux, but none will offer you a system that is easily implemented cheaply with fully redundant data storage.

http://ask.slashdot.org/article.pl?sid=05/05/04/15 22247&tid=198&tid=230&tid=4&tid=106 [slashdot.org]

Good luck,
-eks

Scale (3, Interesting)

LLuthor (909583) | more than 8 years ago | (#13874298)

If you know the scale of the problem, you should consult with a company like EMC to provide the support for this thing - you WILL need it.

Clustering the disks with iSCSI or ATAoE is trivial - you can do that very easily, but the filesystem to run on top of it is where you will have problems.

PVFS - has no redundancy - Lose one node lose them all.
GFS - does not scale well to those sizes or a large number of nodes - lots of hassle with the dlm.
GoogleFS - Essentially one write only - no small (50GB) files - little or no locking.
xFS - Way too easy to lose your data.

It seems that you only have one option:
Lustre - VERY Expensive - lots of hassle with meta-data servers and lock servers.

Go with a company to take care of all this hassle - you do not have the resources of Google to deal with this kind of thing yourself.

Re:Scale (1)

LLuthor (909583) | more than 8 years ago | (#13874328)

I forgot to mention OCFS2 - It does not scale well to large numbers of nodes, but it does handle Pb volumes better than lustre 1.2 (I have never used 1.4).

Re:Scale (2, Funny)

c_woolley (905087) | more than 8 years ago | (#13874477)

I concur. Everything I have researched matches what you have stated. It is not likely this will be a very easy task to perform on a budget (depending on what he is calling a "budget"). I would guess that GoogleFS is the only viable solution other than Lustre, depending on what he is attempting to use this storage for. If large file storage is what he desires, this may be the answer once it is released to the public.

Wow (5, Funny)

DingerX (847589) | more than 8 years ago | (#13874302)

I never thought I'd see the day when sites were boasting a petabyte of porn.
That's over 3 million hours of .avis -- if you sat down and watched them end-to-end, you'd have 348 years of "backdoor sliders", "dribblers to short", "pop flies", and "long balls". We live in an enlightened age.

Re:Wow (0)

Anonymous Coward | more than 8 years ago | (#13874401)

We live in an enlightened age.

Maybe you do. I'm pleased to be unenlightened about the meaning of "backdoor sliders", "dribblers to short", "pop flies", and "long balls".

Re:Wow (0)

Anonymous Coward | more than 8 years ago | (#13874409)

Don't forget some of my personal favorites:
  • Anal Invaders 6
  • MIWLF: Moms I Wouldn't Like to Fuck
  • Face Shots with Wheelchair Bound Midgets
  • Midget Defication: Shittin' on the Little Guy

Re:Wow (1)

Surt (22457) | more than 8 years ago | (#13874445)

You're not thinking far enough ahead. The porn industry is always on the leading edge of technology, so of course they're going to be storing high definition porn on those petabytes, so that brings you down to a few paltry years worth of porn. And of course you have to factor in fast forwarding through whatever parts don't interest you.

Data redundancy REQUIRED (5, Informative)

cheesedog (603990) | more than 8 years ago | (#13874314)

One thing to think about when building such a system from a large number of hard disks is that disks will fail, all the time. The argument is fairly convincing:

Suppose each disk has a MTBF (mean time before failure) of 500,000 hours. That means that the average disk is expected to have a failure about every 57 years. Sounds good, right? Now, suppose you have 1000 disks. How long before the first one fails? Chances, are, not 57 years. If you assume that the failures are spread out evenly across time, a 1000-disk system will have a failure every 500 hours, or about every 3 weeks!

Now, of course the failures won't be spread out evenly, which makes this even trickier. A lot of your disks will be dead on arrival, or fail within the first few hundred hours. A lot will go for a long time without failure. The failure rates, in fact, will likely be fractal -- you'll have long periods without failures, or with few failures, and then a bunch of failures will occur in a short period of time, seemingly all at once.

You absolutely must plan on using some redundancy or erasure coding to store data on such a system. Some of the filesystems you mentioned do this. This allows the system to keep working under X number of failures. Redundancy/coding allows you to plan on scheduled maintanence, where you simply go in and swap out drives that have gone bad after the fact, rather than running around like a chicken with its head cut off every time a drive goes belly up.

Re:Data redundancy REQUIRED (4, Insightful)

OrangeSpyderMan (589635) | more than 8 years ago | (#13874407)

Agreed. We have around 50 TByte of data in one of our datacenters and it's great, but the number of disks that fail when you have to restart the systems (SAN fabric firmware install ) is just scary. Even on the system disks of the Wintel servers (around 400) which are DAS, around 10% fail on Datacenter powerdowns. That's where you pray that statistics are kind and you have no more failures on any one box than you have hot spares+tolerance :-) Last time one server didn't make it back up because of this.... though it was actually strictly speaking the PSUs that let go, it would appear.

Oooo... (1)

temojen (678985) | more than 8 years ago | (#13874317)

I was going to suggest Reiser4 on LVM over a bunch of 4-disk RAID-5 arrays, but it seems that his definition or massive is more massive than mine.

NFS on Reiser4 on RAID-5 on AoE (multipath) on LVM on RAID-5?

What kind of availability do you need? Does all data need to be up all the time (like a bank/telco), or most of the data need to be up all the time (like google), or all the data need to be up most of the time (like a movie studio)?

I just have to ask... (5, Informative)

jcdick1 (254644) | more than 8 years ago | (#13874319)

...what your management was thinking. I mean, I can't imagine a storage requirement that large that you can build in a distributed model that would beat on price per GB an EMC or Hitachi or IBM or whomever SAN solution. The administration and DR costs alone for something like this would be astronomical. There just isn't really a way to do something this big on the cheap. I mean, this is what SANs were developed for in the first place. Its cheaper per GB than distributed local storage ever could be.

Re:I just have to ask... (3, Funny)

temojen (678985) | more than 8 years ago | (#13874415)

With a project this large, they may be able to do it in-house and still take advantage of economies of scale. They can buy HDDs, motherboards, rackmount cases, etc. by the pallet or container load and temporarily up-hire some of their part-timers to do the assembly.

With a network bootable bios, the nodes could just be plugged in and install an image off a server, then customize it based on their MAC.

Yup, time to pick up the phone. (5, Insightful)

Kadin2048 (468275) | more than 8 years ago | (#13874473)

Exactly. This seems like somebody is trying to figure out a way to do something in-house which really ought to be left to either an outside contractor, or at least set up as a turnkey solution by a consultant. Given that he knows little enough about it that he's asking for help on Slashdot, I think this is yet another problem best solved using the telephone and a fat checkbook, and enough negotiating skills to convince management to pony up the cash up front instead of piddling it out over time on an in-house solution that's going to be a hole into which money and time are poured.

I know people get tired of hearing "call IBM" as a solution to these questions, but in general if you have some massive IT infrastructure development task and are so lost on it that you're asking the /. crowd for help, calling in professionals to take over for you isn't probably a bad idea.

It's not even a question if whether you could do it in-house or not; given enough resources you probably could. It comes down to why you want to do something like this yourselves instead of finding people who do it all the time, week after week, for a living, telling them what you want, getting a price quote, and getting it done. Sure seems like a better way to go to me.

Re:I just have to ask... (1)

lysander (31017) | more than 8 years ago | (#13874497)

I entirely agree with the parent post. If there were an easy, cheap way to to this with the required redundancy and speed you need, the big SAN companies would not be around.

If there is more data than disks you can shove in a computer, data that your company considers important: buy a SAN. If you have speed requirements, you'll need caching: buy a SAN. If you haven't worked with anything this big before, are you willing to risk your company's data while you learn the ropes?

If you're still intent on doing this, at least look at how the SAN companies pull it off.

Check out Isilon (1)

elan (171883) | more than 8 years ago | (#13874331)

We liked what we saw when we were looking for a similar thing. It's not cheap, but it's much cheaper than comparable stuff, and it runs well. We had an eval cluster and they worked like a champ.

Re:Check out Isilon (0)

Anonymous Coward | more than 8 years ago | (#13874496)

I would second this, their clustered file system is very robust, fault tolerant and scales well and is much more cost effective than some 1st tier solution providers systems.

If you think that you can do this with off the shelf commodity hardware and maintain your sanity, more power to you!! This is a non-trivial task. Just try building a 24x7 available 100 TB system out of Fry's disks without having a full-time support person managing it..

15 zeros are no bytes at all (1)

caluml (551744) | more than 8 years ago | (#13874346)

15-zeros-is-a-lot-of-bytes

15 zeros is no bytes at all... :)

Re:15 zeros are no bytes at all (0)

Anonymous Coward | more than 8 years ago | (#13874420)

Actually it's 15 bytes... don't diss the big 0!!*

* Unless of course we are in pure binary in which case it could be 15 bits... take that pedantic reply boy!

Re:15 zeros are no bytes at all (1)

rk (6314) | more than 8 years ago | (#13874426)

Oh, sure it is! It's almost two!

Have you looked at OneFS (0)

Anonymous Coward | more than 8 years ago | (#13874359)

You really ought to look at Isilon Systems OneFS solutions for this?

This problem is a very real one in film production and we are moving in this direction for future productions after numerous faciliities got back to me with rave reviews of the speed, scalability and reliability of these units.

The nice thing is that they do scale, as the number of inodes grows so does the performance of the cluster and as you add storage you add bandwidth to your core filesystem. They are a great option for just this type of application.

The main issue you may run into at that size is that the real issue becomes having enough CPU horsepower to handle all the potential requests. This is where conventional network connected filesystem appliance solutions fall flat on their face. These seem to not have that issue at all. Only drawback is the obscene price Cisco charges for their infiniband switches they use as the backplane for the clusters. If you look at this as a potential solution you may want to pressure them to find another infiniband switch provider instead of paying the extortional pricing cisco's invented on their 48 port units.

Stress the importance .... (3, Insightful)

gstoddart (321705) | more than 8 years ago | (#13874360)

I've been asked to build a massive storage solution to scale from an initial threshold of 25TB to 1PB ... Based on my past experience and research, the commercial offerings for such a solution becomes cost prohibitive, and the budget for the solution is fairly small.

Unfortunately, I should think needing a solution which can scale up to a Petabyte (!) of disk-space and a "fairly small" budget are at odds with one another.

Maybe you need to make a stronger case to someone that if such a mammoth storage system is required, it needs to be a higher priority item with better funding?

Heck, the loss of such large volumes of data would be devastating (I assume it's not your pr0n collection) to any organization. Buliding it on the cheap and having no backup (*)/redundancy systems would be just waiting to lose the whole thing.

(*) I truly have no idea how one backs up a petabyte

Re:Stress the importance .... (1)

dustinbarbour (721795) | more than 8 years ago | (#13874480)

With another petabyte of storage.. DUH!

Used by gov't labs and universities.... (0)

Anonymous Coward | more than 8 years ago | (#13874370)

You may want to check out Panasas. From what I can discern, they are used by some high-end entities, but it does have the advantage of being high performance. You probably won't find anything this good for reading/writing out data from your app server to your data storage network.

BTW one of the founders wrote the original paper on RAID drives...

www.panasas.com

IBRIX (1, Insightful)

Anonymous Coward | more than 8 years ago | (#13874374)

Check out the IBRIX Clustered Filesystem. http://www.ibrix.com/ [ibrix.com]

For the most part (4, Insightful)

retinaburn (218226) | more than 8 years ago | (#13874376)

the reason you can't find a cheap way to do this is because it just isn't cheap.

I would look at some lessons learned from Google. If you decide to go with some sort of homebrew solution based on a bunch of standard consumer disks you will run into other problems besides money. The more disks you have running, the more failures you will encounter. So any system you setup has to be able to have drives fail all day, and not require human intervention to stay up and running(unless you can get humans for cheap too).

Homebrew HSM (1)

benow (671946) | more than 8 years ago | (#13874384)

A buddy and I were talking of similar. We were looking for expandable large scale storage with good performance and cost and a high level facilitation of data management tasks (metadata management, media shuffling, accomodating technology advancements, etc). Decided on single nodes each controlling data storage to fit data use case. Each node presenting a span over ram, jbod, optical and tape in increasing size. RAM fs for that which is always in use, a couple TBs of jbod/ide raid holding more frequently used, everything in optical jukebox and safety backups to tape jukebox. Ideally the entire tower could be modular and auto-discovering, so when more long term storage is required a jukebox module could be added and the system could autodiscover capacities, alignment, etc. Data presentation done via a cascading hsm aware of each of the components and with optimized usecases (ie burning to blank optical when sufficient new data arrives on raid, maintaining indexing for each of the modules, etc). On top of the cascading storage would be a metadata vfs/presentation layer, to allow data navigation at a high level (cd /video/nature/banff/) via nfs or http web app over gig eth. A deligation/peering layer could allow for grouping such towers to grow in size. Tho quite theorhetical and the software being somewhat tricky, it could scale to the tens of TB/node size easily. We'd originally thought of an open hardware design with modules being added by the community and open hsm software supporting it. Neither of us has yet had the time to do much more than basic dev, but it's planned.

AtomChip Corp (1)

OctoberSky (888619) | more than 8 years ago | (#13874385)

Why not just wait until those Atom Chip Laptops.

I mean, yeah the portibility is not what the customer wanted but the 6.8GHz CPU, cuppled with the 1TB of RAM should easily make up for the limited 2TB of HDD space.

Courtesy Link [slashdot.org] .

raid-nfs-raid (1)

eatjello (767686) | more than 8 years ago | (#13874389)

How about this: set up a bunch of mini-itx or similar low-power machines with 9 250GB PATA drives (and a single smaller drive for OS install), then use software raid under Linux to configure them as a single RAID-5 array (roughly 2TB) and set up an NFS server on the machine to share the array. Then set up a couple of controllers (for redundancy) that mount all the NFS shares and turn them into a linear or striped array(or RAID-1 for added data security). Then the controllers would be able to present a share using NFS, SMB, or whatever you need it to, that have a capacity that scales seamlessly... all you have to do is add more 2TB nodes to it. Obviously the details are flexible, like making the nodes RAID-1 instead of RAID-5 (and dropping them to 8 250GB drives for a round 1TB per node), but this should give you exactly what you're looking for. Your cost per node would simply be the mini-itx mobo, memory, 4 channel IDE controller card, and hard drives, and cost per controller would be practically any computer with high speed network interfaces (I'd recommend something with 2x Gb LAN at least, maybe Gb out and 10Gb to the HDD farm).

Do It Right (5, Insightful)

moehoward (668736) | more than 8 years ago | (#13874402)


Look. Everyone wants a Lamborgini for the price of a Chevy. Cute. Yawn. Half of the Ask Slashdot questions are people who didn't find what they want at Walmart. Despite the amazing Slashdot advice, Ask Slashdot answers have somehow failed to put EMC, IBM, HP, etc. out of business. There is no free lunch.

Just call EMC, get a rep out, and give the paperwork to your boss. Do it today instead of 5 months from now and you will have a much better holiday season.

Note to moderators and other finger pointers: I did not say to BUY from EMC, I just said to show his boss how and why to do things the right way. It does not hurt to get quotes from the big vendors, mainly because the quote also comes with good, solid info that you can share with the PHBs. Despite what you think about "evil" tech sales persons and sales engineers, you actually can learn from them.

Re:Do It Right (1)

Tankko (911999) | more than 8 years ago | (#13874453)

This is damn good advice! I always try to get bids from people when I have very little interested in using them (they have changed my mind a few times). It has always provided me with invaluable information.

IBRIX (3, Informative)

Wells2k (107114) | more than 8 years ago | (#13874414)

You may want to take a look at IBRIX [ibrix.com] systems. They do a pretty robust parallel file system that has redundancy and failover.

Just a Thought... (0)

Anonymous Coward | more than 8 years ago | (#13874428)

And this is a little insane, but you MIGHT look at a combination of Oracle offerings. Specifically, Collaboration Suite for the end-user presentation (which gives you a web interface, and FTP/FTPS, and WebDAV, and they have a little desktop app that'll let you mount the WebDAV volume like a traditional SMB share, drive letter and all) and ASM for the disk management. ASM is an Oracle database (that can be a bunch of RAC instances for redundancy) that can take a bunch of disk, doesn't even need to be the same type or anything, just as long as you can present it to the ASM server somehow like NFS, and creates a sort of virtual data pool that could then be used for another, regular Oracle database like the one used as a datastore for the Collaboration Suite instance above.

Yes, I realize this is probably needlessly complicated, and since we don't have very specific information about what the disk is actually FOR it's also likely to be inappropriate for some other reason, but it could work, and Oracle Collaboration Suite (which is the only part you'd actually have to license, I believe, the rest just sort of comes with it) is licensed on a per-user basis. I'm not sure what the minimum number of users is, but for only the files-based part we're talking about here, I think the list price is something like $15/user.

UnionFS (1)

MaskedSlacker (911878) | more than 8 years ago | (#13874429)

UnionFS ought to do the trick.

Are you working for Facebook? (1)

fitchmicah (920679) | more than 8 years ago | (#13874430)

I hope this is for facebook! Maybe to expand the new photo galleries?

www.pillardata.com (1)

bvoth (771322) | more than 8 years ago | (#13874436)

check it out! have no experience with this company but looks very cool

you dont need to go outside (0, Offtopic)

deervark (904586) | more than 8 years ago | (#13874447)

seriously, you are an agorophobe and you think that the outside seems nice but you just have so much trouble actually getting out there. try to get some xanax or klonopin and slowly work your way through it. i am a male, i am 49, and i am gay. here is my phone number (454) 867-5309 if you need some support or phone therapy or agorophobe phone sex. i like chutney spread all over my face and taint. if you let me come over sometime (Im scared too) then maybe i can take your hand and slowly lead you into the back yard and then maybe we can hug and kiss and nap like little princesses. i love you, seriously. we are all agorophobic here on /.

Nice Engrish (1)

suckass (169442) | more than 8 years ago | (#13874450)

After reading that post and seeing the level of your english you should probably let someone else handle a project so complex.

storagetek... (1)

blackcoot (124938) | more than 8 years ago | (#13874451)

...can probably solve this problem for you. whether or not they can do so on the sort of budget you're willing to spend is a totally different story, however....

Don't forget Coda.. (1)

Sir Pallas (696783) | more than 8 years ago | (#13874455)

Coda [cmu.edu] works even when nodes disconnect, for instance with network outages or mobile computing. Plus, there is a Windows client, if that's the way your shop swings.

Stripped (1)

zephris (925151) | more than 8 years ago | (#13874462)

Well, I don't know alot about this kind of thing, if XP doesn't have an upper drive size limit, could you just throw it in a big server case, thow in some scsi drives, stripe them (I *think* that's what it's called) and have it appear as one big volume?

what about JFS and ATAoE? (1)

imsmith (239784) | more than 8 years ago | (#13874471)

I don't know what the limits of JFS are, but it sounds like a nice set up.

This article in Linux Journal ( http://www.linuxjournal.com/article/8149 [linuxjournal.com] ) talks about doing just that. The hardware costs ring up and don't scale as you get into your capacity ranges unless you can get a deal buying bulk HDDs - something like $10K per 7.5 terabytes

Depends on the content (1)

behrman (51554) | more than 8 years ago | (#13874495)

If you're talking about building some sort of archival-type repository (like, keeping years worth of satellite imagery, for example), then you should probably look at the Centera from EMC. They scale into the petabyte range.

Providing you can find some sort of filesystem to support it (good luck), you could stash multiple arrays behind your host, or you could put in a TagmaStore from HDS with several arrays behind it. I'm not entirely sure how large the Tagma will scale, but the number 32 petabytes sticks in my head from a whitepaper somewhere.

I'd also question the perceived need to create one big filesystem to hold your whole petabyte of data. I'm a storage geek for a living, and I've found that usually after you start drilling into the application requirements, you find out that the app folks are either trying to use a data warehouse solution that's too small for the environment, or they're simply not aware of other alternatives available in their chosen app. No offense, but it sounds like you've had snowshoes strapped to your feet and directed to take a stroll through a minefield.

Whatout for File System too (0)

Anonymous Coward | more than 8 years ago | (#13874510)

You've done your research so you may alreay know, but it's worth mentioning that EXT3 and ReiserFS will not cut it for your system. Most file systems (not to be confused with storage sub-systems) have a maximum volume size.

http://en.wikipedia.org/wiki/Comparison_of_file_sy stems [wikipedia.org]

Also - consider the problem of expanding any array.

As for how do do it, if I were to go super-cheap:
  - Use software RAID-5 on each node (2 sub-systems, 8 disks each, hot-swap SATA 200GB, RAID-0 them and export as NBD)
  - Use NBD to concatinate each node into a single block device
    + you DO NOT want to rebuild any parity info accross nodes with this poor-man's setup
  - Use as many of GigE to interconnect with the nodes (limit use of switches)

Note:
  - This setup will write data at multi 100MBit, but not GBit. It will read at close to GBit. I have this setup at home (1 node) and I'm impressed with software RAID and SATA.
  - Contact the maintainers of any userland / kernel stuff you'll be needed and ask if they support the sizes you're looking for. I ran into trouble with dm-crypt (unsigned 32bit integer overflow) relating to file system size and mode of operation. All fixable.

Just wait 5 years ... (3, Interesting)

tomhudson (43916) | more than 8 years ago | (#13874513)

Hard disk space is doubling every 6 months - wait 5 years and you'll be able to buy a 25TB disk for $125.00.

A single raid50 of them will then give you your petabyte of storage, for around $6,000.

iSCSI (1)

wasabii (693236) | more than 8 years ago | (#13874525)

I have been searching for a solution for this as well. My current thought is that iSCSI is most appropiate. I plan to set up a number of small linux boxes, with as much storage space as a single system can accomidate, MD them so that each system is itself redundant. Each system will export an iSCSI target of the MD device. A single large node will then mount all the iSCSI devices and add anothe rlayer of raid (so that a single node failure doesn't result in down time), and export the file system as NFS to clients. I plan to just start with XFS for the on disk structures with an out-of-band journal.
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>