Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Data Deduplication Comparative Review

samzenpus posted more than 3 years ago | from the a-little-order-please dept.

Data Storage 195

snydeq writes "InfoWorld's Keith Schultz provides an in-depth comparative review of four data deduplication appliances to vet how well the technology stacks up against the rising glut of information in today's datacenters. 'Data deduplication is the process of analyzing blocks or segments of data on a storage medium and finding duplicate patterns. By removing the duplicate patterns and replacing them with much smaller placeholders, overall storage needs can be greatly reduced. This becomes very important when IT has to plan for backup and disaster recovery needs or when simply determining online storage requirements for the coming year,' Schultz writes. 'If admins can increase storage usage 20, 40, or 60 percent by removing duplicate data, that allows current storage investments to go that much further.' Under review are dedupe boxes from FalconStor, NetApp, and SpectraLogic."

cancel ×

195 comments

Sorry! There are no comments related to the filter you selected.

Second post (2, Funny)

Anonymous Coward | more than 3 years ago | (#33594064)

Same as the first.

Re:Second post (-1, Redundant)

Anonymous Coward | more than 3 years ago | (#33594144)

Same as the first.

(looks like slashdot already employs such sophistication, This exact comment has already been posted. Try to be more original...)

Re:Second post (0)

Anonymous Coward | more than 3 years ago | (#33595986)

I'm Henry the eighth I am
Henry the eighth I am, I am...

Wrong layer (4, Insightful)

Hatta (162192) | more than 3 years ago | (#33594088)

Filesystems should be doing this.

Re:Wrong layer (1)

suutar (1860506) | more than 3 years ago | (#33594168)

Actually, this feature is a recent addition to ZFS, and it's the main reason I'm interested in putting ZFS on my file server. I just have to get around to picking up another drive to serve as the backup first.

Re:Wrong layer (2, Informative)

KiloByte (825081) | more than 3 years ago | (#33594310)

It's not fully automatic, I assume? Since that would cause a major slowdown.

For manual dedupes, btrfs can do that as well, and a part of vserver patchset (not related to the main functionality) includes a hack that works for most Unix filesystems.

Re:Wrong layer (4, Informative)

phantomcircuit (938963) | more than 3 years ago | (#33594400)

It is fully automatic and it's not that much of a slow down. The reduced IO might actual provide a performance boost.

Re:Wrong layer (5, Informative)

suutar (1860506) | more than 3 years ago | (#33594404)

Actually, it is automatic. ZFS already assumes you have a multithreaded OS running on more cpu than you probably need (e.g. Solaris), so it's already doing checksums (up to and including SHA256) for each data block in the filesystem. Comparing checksums (and optionally entire datablocks) to determine what blocks are duplicates isn't that much extra work at that point, although for deduplication you probably want to use a beefier checksum than you might choose otherwise, so there is some increase in work. http://blogs.sun.com/bonwick/entry/zfs_dedup [sun.com] has some more information on it. Getting it onto my linux box, now.. there's the rub. userspace ZFS exists, but I've only seen one pointer to a patch for it that includes dedup, and I haven't heard any stability reports on it yet.

Re:Wrong layer (2, Interesting)

hoggoth (414195) | more than 3 years ago | (#33594980)

> Getting it onto my linux box, now.. there's the rub

So don't put it on Linux. Set up a Solaris or Nexenta box. I just did it. I installed a Nexenta server with 1TB of mirrored, checksummed storage in 15 minutes. I wrote it up here http://petertheobald.blogspot.com/ [blogspot.com] - it was extremely easy. Now all of my computers back up to the Nexenta server. All of my media is on it. I have daily snapshots of everything at almost no cost in disk storage.

Re:Wrong layer (2, Insightful)

h4rr4r (612664) | more than 3 years ago | (#33595174)

Open Solaris is dead, and there are kernel bugs in the latest version, so good luck with that. I looked at doing it at one time and due to fears about Opensolaris I stayed away. I consider myself lucky.

Re:Wrong layer (0, Troll)

BitZtream (692029) | more than 3 years ago | (#33595792)

Open Solaris is dead, and there are kernel bugs in the latest version, so good luck with that.

OpenSolaris is dead, thank god, it was a shining example of just how well OSS doesn't work for everything. It was a rather stupid idea in the first place but it did manage to put a kink in a few other companies plans so good for them, but yes, its dead.

As for kernel bugs ... welp, if you think the kernel of your OS doesn't have bugs, you're an idiot. Second, you have the source, fix OpenSolaris yourself ... thats the response we see out of so many OSS zealots here so I figured I'd throw in my two cents.

Google luck on finding solutions to your problems that are based on logic and rational thinking, I doubt you can pull it off judging by your statements so far.

Re:Wrong layer (1)

Bigjeff5 (1143585) | more than 3 years ago | (#33596140)

Google luck on finding solutions to your problems that are based on logic and rational thinking, I doubt you can pull it off judging by your statements so far.

I dunno, I found it pretty easy. I got some interesting results too:

Critical Thinking - HowTo.Lifehack [google.com]

Virgo free weekly horoscope [google.com]

Actually that's pretty funny.

Maybe you're right, maybe it is hard to google luck on finding solutions to your problems that are based on logic and rational thinking.

Re:Wrong layer (1)

suutar (1860506) | more than 3 years ago | (#33595458)

Sweet, thanks for the pointer. I was also concerned about the death of OpenSolaris but it sounds like Nexenta may be just what I want.

Re:Wrong layer (2, Insightful)

drsmithy (35869) | more than 3 years ago | (#33596124)

Sweet, thanks for the pointer. I was also concerned about the death of OpenSolaris but it sounds like Nexenta may be just what I want.

Nexenta is built off Open Solaris and is, therefore, also dead - though it may take longer for the thrashing to stop.

Re:Wrong layer (2, Informative)

hoytak (1148181) | more than 3 years ago | (#33595142)

The latest stable version of zfs-fuse, 0.6.9, includes pool version 23 which has dedup support. Haven't tried it out yet, though.

http://zfs-fuse.net/releases/0.6.9 [zfs-fuse.net]

Re:Wrong layer (1)

h4rr4r (612664) | more than 3 years ago | (#33595322)

I thought everything that used FUSE was slow as hell, is this not true?

Re:Wrong layer (2, Interesting)

bersl2 (689221) | more than 3 years ago | (#33594184)

No, deduplication has quite a bit of policy attached to it. Sometimes you want multiple independent copies of a file (well, maybe not in a data center, but why should the filesystem know that?). The filesystem should store the data it's told to; leave the deduplication to higher layers of a system.

Re:Wrong layer (2, Interesting)

PCM2 (4486) | more than 3 years ago | (#33594236)

The filesystem should store the data it's told to; leave the deduplication to higher layers of a system.

But if that's the kind of deduplication you're talking about, does it really make sense to try to do it at the block level, as these boxes seem to be doing? Seems like you'd want to analyze files or databases in a more intelligent fashion.

Re:Wrong layer (2, Interesting)

dougmc (70836) | more than 3 years ago | (#33594360)

But if that's the kind of deduplication you're talking about, does it really make sense to try to do it at the block level, as these boxes seem to be doing? Seems like you'd want to analyze files or databases in a more intelligent fashion

This isn't a new thing -- it's a tried and true backup strategy, and it's quite effective at making your backup tapes go further. It increases the complexity of the backup setup, but it's mostly transparent to the user beyond the saved space.

As for doing it at the file level rather than the block level, yes, that makes sense, but the block level does too. Think of a massive database file where only a few rows in a table changed, or a massive log file that only had some stuff appended to the end.

Re:Wrong layer (1, Interesting)

Anonymous Coward | more than 3 years ago | (#33594624)

This technology is not just deduplicated backups, this is deduplicated STORAGE. Big difference. Combine with a SAN that has thin provisioning and automatic on the fly tiering between cache, SSD, FC, and some ATA disks and you can have a decent cost effective setup. Oddly, the cost per GB is about the same but you buy less and get fast and slow disks and it also has a lot of integrated DR features. I'll know how it all works in a few months, we are about two months away from a rolling upgrade of several Clariion CX3-80's to CX4's. It looked really good in the lab ;)

Although we will have to increase our MPLS bandwidth, we will also be getting rid of tapes. I know people claim tapes are cheap but even with the great backup software setup and as automated as possible, you still have people on the ground loading and unloading, you pay for the Iron Mountain or Recall trucks, and you are paying of dearly for the tape hardware. We have older StorageTek SL500's of various size. Those bitches are can cost like 250K with a support contract and you are pushing data over your network or best case over your fiber network every night. Need to do a recovery? Call Iron Mountain and wait a few hours for the tapes to arrive. blah..

I guess ever situation is different but for us, getting rid of tape, retiring our older CX3-80's and migrating to a CX4 with more features was a sound decision over keeping our existing setup. The ROI is less then 2 years and the we can use the additional features immediately.

Kind of unrelated but I'd like to get rid of FC and move to 10GB iSCSI or FCoE but I guess I'm happy with the intermediate steps for now.

Re:Wrong layer (1)

hawguy (1600213) | more than 3 years ago | (#33594744)

No, deduplication has quite a bit of policy attached to it. Sometimes you want multiple independent copies of a file (well, maybe not in a data center, but why should the filesystem know that?). The filesystem should store the data it's told to; leave the deduplication to higher layers of a system.

Why do you want multiple independent copies of a file? If you're doing it because your disk storage system is so flakey that you aren't sure you can read the file, deduplication policy is not what you need -- you need a more reliable storage system and backups.

Most disks have a fine line between throwing random unrecoverable read errors and failing completely, so there's little value in having multiple copies of the same file on the same physical disk. (and most storage systems will have automatically replaced the drive with a hot spare once it started throwing too many soft read errors)

Which filesystem should be doing this??? (2, Insightful)

DanDD (1857066) | more than 3 years ago | (#33594300)

Filesystems should be doing this.

The one on your desktop machine, or the primary NAS storage that you access shared data from, or the backup server that ends up getting it all anyway? You see, this is a shared database problem. If your local filesystem does this, then it has to 'share' knowledge of all the unique blocklets with every other server/filesystem that wishes to share in this compressed file space. De-duplication is a means of compression that works across many filesystems - or at least it can be, if it is properly implemented.

Re:Which filesystem should be doing this??? (1)

DeadDecoy (877617) | more than 3 years ago | (#33594498)

Plus if you want speed and safety, having redundant, mirrored copies of data can be useful.

Re:Which filesystem should be doing this??? (1)

Vancorps (746090) | more than 3 years ago | (#33594574)

Rarely is it useful on the same local storage. Keeping live copies offsite or in separate hardware is a good strategy but on the same hardware is just wasteful.

Re:Which filesystem should be doing this??? (1)

Eristone (146133) | more than 3 years ago | (#33595940)

Ah, so you want to go to other hardware to restore a file that you have a snapshot of on your local hardware? And that fileset happens to be oh say a few hundred gigabytes. Out of curiosity, do you manage production fileservers with end users that are able to do stupid things?

Re:Which filesystem should be doing this??? (1)

icebike (68054) | more than 3 years ago | (#33594510)

Well in the end, does not the filesystem running on the device end up controlling the actual reads and writes regardless of whether the file is shared across the network or across the world?

My take is that there is not much to justify the claim that this should be in the filesystem vs the hardware. If you don't want to de-duplicate some data (for what ever reason) then you don't put it on that type of storage.

But it seems to me that a hardware approach is a perfectly reasonable layer to do this. It eliminates several potential points of failure, (FS version changes, FS bugs, memory failures, bus failures, end user fiddling, etc).

Its OS insensitive, and when you replace the server OS or hardware the search for drivers is eliminated. Obsolescence is defined by when the NAS fails to meet your needs, not by when the developer moves on to something new, or the company declines to release new drivers for the next version of your OS.

As long as I can get out exactly the same data I put in, why would I want to do this at the FS layer? Why would I care, as long as it was reliable?

I'm aware there are traps, such as having to make minor unique changes in thousands of files, forcing the system to un-de-duplicate many megabytes of data, potentially over-flowing the available storage. But that's equally possible in an FS based solution as a hardware based one.

Re:Wrong layer (1)

JWSmythe (446288) | more than 3 years ago | (#33594306)

    Wouldn't a compressed filesystem already do this? They don't just get the compression from nowhere. They eliminate duplicates blocks and empty space. You don't just get compression from nowhere.

    Pick your platform. I know in both Linux and Windows, there have been compressed filesystems for quite some time.

    It doesn't really negate the need for good housekeeping routines, nor good programming. Do you really want 100 copies of record X, or would one suffice? Sadly, people tend to think that they have unlimited space, until the time comes when they've run out of space. "Oh shit, what do we do now!" is way too common an occurrence.

Re:Wrong layer (1)

icebike (68054) | more than 3 years ago | (#33594566)

True, compression does a lot of this.

But De-duplication does that and goes one step further.

Multiple copies of the same block of data (either entire files or portions of files) that match even if stored in separate directories can be replaced by a pointer to a single copy of that file or block.

How many times would, say, the boilerplate at the bottom of a lawyer/doctor/accountant's file systems appear verbatim in every single document filed in every single directory?

A proper system might allow you to have just one of these.

Re:Wrong layer (1)

JWSmythe (446288) | more than 3 years ago | (#33595460)

How many times would, say, the boilerplate at the bottom of a lawyer/doctor/accountant's file systems appear verbatim in every single document filed in every single directory?

    I won't argue about that. I'm still shocked to see the bad housekeeping practices on various servers I've worked on. No, really, you don't need site_old site_back, site_backup, site_backup_1988. and site_backup_y2k. Has anyone even considered getting rid of those? Nope. They're kept "just in case". What "just in case"? Just in case you want to roll back to a 20 year old copy of your data?

Re:Wrong layer (1)

dgatwood (11270) | more than 3 years ago | (#33594602)

Yes and no. Compression generally does involve reduction of duplication of information in one form or another, but does so at a finer level of granularity. With a compressed filesystem, you'll generally see compression of the data within a block, maybe across multiple blocks to some degree, but for the most part, you'd expect the lookup tables to be most efficient at compressing when they are employed on a per-file basis. The more data that shares a single compression table, the closer your input gets to being essentially random, and the lower your overall compression rate typically is.

Deduplication, as I understand it (and I've read very little about this, so I could be misunderstanding) takes this a step further, taking advantage of the fact that multiple copies and/or multiple generations of a given file often exist in storage, and that when compressing two files results in very similar or identical compression tables, you can easily throw away one copy and express the other copy relative to the first.

Although this is conceptually related to the way many compression schemes work (Huffman coding and LZW in particular), the mechanism for doing so must inherently be a lot smarter. Arbitrarily combining random or contiguous chunks of such large data sets would result in expansion, not compression. Thus, the deduplication algorithms use various techniques to determine how similar two files are before deciding to try to express one in terms of the other.

Re:Wrong layer (1)

icebike (68054) | more than 3 years ago | (#33594746)

Thus, the deduplication algorithms use various techniques to determine how similar two files are before deciding to try to express one in terms of the other.

But I understood de-duplication to be not concerned with files at all. Simply blocks of data on the device.

As such my might de-duplicate the boiler plate out of a couple hundred thousand word documents scattered across many different directories.

Is that not the case? Are they not yet that sophisticated?

Re:Wrong layer (3, Interesting)

dgatwood (11270) | more than 3 years ago | (#33595078)

I think it depends on which scheme you're talking about.

Basic de-duplication techniques might focus only on blocks being identical. That would work for eliminating actual duplicated files, but would be nearly useless for eliminating portions of files unless those files happen to be block-structured themselves (e.g. two disk images that contain mostly the same files at mostly the same offsets).

De-duplicating the boilerplate content in two Word documents, however, requires not only discovering that the content is the same, but also dealing with the fact that the content in question likely spans multiple blocks, and more to the point, dealing with the fact that the content will almost always span those blocks differently in different files. Thus, I would expect the better de-duplication schemes to treat files as glorified streams, and to de-duplicate stream fragments rather than operating at the block level. Block level de-duplication is at best a good start.

What de-duplication should ideally not be concerned with (and I think this is what you are asking about) are the actual names of the files or where they came from. That information is a good starting point for rapidly de-duplicating the low hanging fruit (identical files, multiple versions of a single file, etc.), but that doesn't mean that the de-duplication software should necessarily limit itself to files with the same name or whatever.

Does that answer the question?

Re:Wrong layer (1)

drsmithy (35869) | more than 3 years ago | (#33596210)

But I understood de-duplication to be not concerned with files at all. Simply blocks of data on the device.

It depends.

Simplistic dedupe schemes only operate at the file level. More advanced schemes operate at the block/cluster level.

Re:Wrong layer (2, Interesting)

drsmithy (35869) | more than 3 years ago | (#33596192)

Wouldn't a compressed filesystem already do this? They don't just get the compression from nowhere. They eliminate duplicates blocks and empty space. You don't just get compression from nowhere.

No, because compression is limited to a single dataset. Deduplication can act across multiple datasets (assuming they're all on the same underlying block device).

Consider an example with 4 identical files of 10MB in 4 different locations on a drive, that cat be compressed at 50%.

"Logical" space used is 40MB.
Using compression, they will fit into 20MB.
Using dedupe, they will fit somewhere in between 5MB and 10MB.
Using dedupe and compression, they will fit into ~5MB (probably a bit less).

It doesn't really negate the need for good housekeeping routines, nor good programming. Do you really want 100 copies of record X, or would one suffice?

Far better to let the computer do the heavy lifting, than trying to impose partial order on an inherently chaotic situation.

Not to mention that the three textbook scenarios where dedupe really shines are backups, email and virtual machines, none of which can really be helped by "better housekeeping".

Re:Wrong layer (1)

phantomcircuit (938963) | more than 3 years ago | (#33594398)

Data de duplication is mostly being used for virtual servers. So no this is being done at the right level, the block level.

Re:Wrong layer (0)

Anonymous Coward | more than 3 years ago | (#33595188)

Filesystems should be doing [deduplication].

Venti sort of does this. The way it works is you convert whatever fs you're storing into the vac fs format and then send it to the venti server. It's not the fs itself doing the deduping, but your data gets deduped at the vac fs level not at the disk's block layer.

ref:http://en.wikipedia.org/wiki/Venti [wikipedia.org]

Re:Wrong layer (3, Insightful)

drsmithy (35869) | more than 3 years ago | (#33596174)

Filesystems should be doing this.

No, block devices should be doing this. Then you get the benefits regardless of which filesystem you want to layer on top.

Um.. (0, Troll)

The MAZZTer (911996) | more than 3 years ago | (#33594102)

AFAIK this is pretty much how every compression algorithm works. No need to give it a fancy name.

Re:Um.. (0, Troll)

almightyons (1842868) | more than 3 years ago | (#33594206)

Not only this is just another name for compression, it's another name for the usual 'diff' that every good backup system should be doing - storing only the changes from time to time, not the whole thing. And all that should be packed in any good backup appliance.

Re:Um.. (1)

HTH NE1 (675604) | more than 3 years ago | (#33594274)

Diffs are fine until you lose the root file upon which they are based. Then you lose everything you've never changed. You need to do periodic full backups.

Re:Um.. (1)

georgewilliamherbert (211790) | more than 3 years ago | (#33594428)

No, it's not.

Differential backups are taking a single filesystem, seeing what changed (either at the file level (whole changed/updated/new files) or block level (changed blocks within files).

Block level deduplication is noticing that the storage appliance on which you back up 100 desktops and 10 servers has 50 copies of the same version of each data block in each Microsoft OS file from XP, 25 from Win 7, and 35 from Fedora, and only storing 1 copy of each of those blocks rather than 100 separate ones. It's returning those blocks to the usable storage pool and remapping without having to "compress" anything, not having to rewrite the backup data images, etc. It's just saying "This is block 3 of the binary for Internet Explorer 8, and I already have a copy of that", for each and every common block out there.

You still have to upload the blocks, and the system still needs to scan them to notice the duplication, but it's a lot more than "oh, compression".

Re:Um.. (0)

Anonymous Coward | more than 3 years ago | (#33594260)

No, it's actually reasonably different.

Compression exploits patterns within a particular data set to represent it using fewer bits. (Like saying "one hundred zeroes" instead of "000000...").

Deduplication basically looks through your files and replaces extra copies of identical files with hard links to the original, only it's more complicated than that and it generally acts at the block level. No compression of data, as we generally use the term, actually takes place.

Re:Um.. (1)

icebike (68054) | more than 3 years ago | (#33594636)

Although there is nothing to say compression of data might not also happen. I don't believe compression and de-duplication are mutually exclusive.

This is actually a good argument for de-duplication to run on the device. It can surf thru files more or less at leisure looking for duplicate blocks all over the file system, without tying up the server's bus/controller.

That could be done independent of File System compression, which generally, as you pointed out, works best on large blocks of repetitive bytes within a single file.

Re:Um.. (1)

Znork (31774) | more than 3 years ago | (#33594352)

No need to give it a fancy name.

It's much easier for sales if you give it a fancy name, and preferably one that doesn't trigger comparisons with other solutions.

Of course, as deduplication is mainly a solution for enterprises that have been tricked into buying obscenely expensive storage, and who lack any coherent data storage policy and tiering strategy, the fancy name might be superfluous; they're spread wide and lubed up already.

Re:Um.. (2, Informative)

cetialphav (246516) | more than 3 years ago | (#33594464)

AFAIK this is pretty much how every compression algorithm works. No need to give it a fancy name.

The reason it has a different name is to distinguish this from a compressed file system. The blocks of data are not compressed in these systems. Imagine that you have a file system that stores lots of vmware images. In this system, there are lots of files that store the same information because the underlying data is OS system files and applications. Even if you compress each image, you will still have lots of blocks that have duplicate values.

Deduplication says that the file system recognizes and eliminates duplicate blocks across the entire file system. If a given block has redundant data within it, that redundancy is not removed because the blocks themselves are not actually compressed. This is the difference between a compressed file system and a deduplicated file system. In fact, there is no reason that you could not combine both of these methods into a single system.

Re:Um.. (2, Funny)

igny (716218) | more than 3 years ago | (#33595344)

Yeah! To fight dupes I compute CRC checksum for each file and store it (and only it) on my back up drive. That method removes dupes almost automatically and there is a side effect of a huge compression ratio too. I have been downloading the high def videos from Internet for quite a while now and with my compression method I have used less than 10 percent of 1GB flash drive! I strongly recommend this method to everyone!

Don't forget to weigh in the cost (2, Informative)

leathered (780018) | more than 3 years ago | (#33594114)

The shiny new NetApp appliance that my PHB decided to blow the last of our budget on saves around 30% by using de-dupe, however we could have had 3 times conventional storage for the same cost.

NetApp is neat and all but horribly overpriced.

Re:Don't forget to weigh in the cost (1)

ischorr (657205) | more than 3 years ago | (#33594214)

I assume they didn't spend the money only for dedupe? That box has a whole lot of features.

Re:Don't forget to weigh in the cost (0)

Anonymous Coward | more than 3 years ago | (#33594226)

however we could have had 3 times conventional storage for the same cost.

Three times the rack space
Three times the backup volume
Three times the heat and power

NetApp is neat and all but horribly overpriced.

NetApp is expensive. NetApp is also worth every cent. Your PHB is a smart guy. Pay attention while you have the opportunity to learn from him. It won't last forever.

Re:Don't forget to weigh in the cost (0)

Anonymous Coward | more than 3 years ago | (#33594416)

Thank goodness there's a NetApp sales rep here in the forum. So, should I buy the FAS2000 or the FAS3100 line of your products?

Re:Don't forget to weigh in the cost (1)

hawguy (1600213) | more than 3 years ago | (#33594872)

I don't think it takes a NetApp sales rep to recognize the value of a reliable storage system. I'm sure he would say the same of EMC - it's expensive but worth every penny when you've got hundreds (or thousands) of people relying on your storage.

If you're in a 10 person office, you can get by with less, but when you've got a large corporate environment, you'll recognize the advantage of paying for Netapp or EMC.

Re:Don't forget to weigh in the cost (3, Insightful)

h4rr4r (612664) | more than 3 years ago | (#33595132)

More disk is still so much cheaper it really cannot be justified on that front. More disks also mean more IOPS, so reducing sinning platters can be a bad thing.

There are some reasons to go for it, but even with thousands of clients it may or may not be suitable for what you are doing.

Re:Don't forget to weigh in the cost (2, Funny)

zooblethorpe (686757) | more than 3 years ago | (#33595784)

...so reducing sinning platters can be a bad thing.

Satan, is that you?

Cheers,

Re:Don't forget to weigh in the cost (2, Informative)

hardburn (141468) | more than 3 years ago | (#33594262)

Was it near the end of the fiscal year? Good department managers know that if they use up their full budget, then it's harder to argue for a budget cut next year. Managers will sometimes blow any excess funds at the end of the year on things like this for that very reason.

Re:Don't forget to weigh in the cost (0)

Anonymous Coward | more than 3 years ago | (#33595006)

Just like the Government...

*zing*

Re:Don't forget to weigh in the cost (0)

Anonymous Coward | more than 3 years ago | (#33594276)

Yeah no kidding. I mean *nothing* is more important than absolute storage, right? Who cares about RAS? Who cares about support? Don't these fortune 500 companies know that they can get a TB of disk at Fry's for $100? Heck, if their IT managers would just troll slickdeals, they can probably find 1.5TB drives for $90. Just give the rebate forms to the finance dept. to deal with. Or, they could just use refurb drives for corporate data. Surely Sarbanes-Oxley wouldn't have a problem with that, right?

Re:Don't forget to weigh in the cost (0)

Anonymous Coward | more than 3 years ago | (#33594454)

$66 shipped for 1.5TB at eWiz last week... apparently you do not browse slickdeals enough :)

If you get it just for dedupe maybe (1)

Sycraft-fu (314770) | more than 3 years ago | (#33594356)

However they have a ton of features including extremely high performance and reliability. For example they monitor your unit and if a drive fails, they'll send one you next day air. Sometimes the first you know of the failure is a disk shows up at your office.

Don't get me wrong, they aren't the only way to go, we have a much cheaper storage solution for less critical data, but the people who think dropping a bunch of disks in a Linux server gives you the same thing as a NetApp for less cost are fooling themselves.

It is exceedingly high end stuff, which is why it costs so much.

Re:If you get it just for dedupe maybe (1)

h4rr4r (612664) | more than 3 years ago | (#33595040)

You can just have nagios monitoring for errors and even order a drive off amazon if you really wanted. NetApps have a lot of neat features, mailing you drives are not really one of them.

Ya it is (3, Insightful)

Sycraft-fu (314770) | more than 3 years ago | (#33595246)

Something you start to appreciate when you are called on to do a really high availability, high reliability system is to have features like this. For one thing it reduces the time it takes to get a replacement. Unless a drive fails late at night, you get one the next day. You don't have to rely on someone to notice the alert, place the order, etc. It just happens. Also, like most high end support companies, their shipping time is fairly late so even late in the day it is next day service. What arrives is the drive you need, in its caddy, ready to go.

Then there's just the fact of having someone else help monitor things. It's easy to say "Oh ya I'll watch everything important and deal with it right away," but harder to do it. I've known more than a few people who are not nearly as good at monitoring their critical system as they ought to be. A backup is not a bad thing.

You have to remember that the kind of stuff you are talking about for things like NetApps is when no downtime is ok, when no data loss is ok. You can't say "Ya a disk died and before we got a new on in another died so sorry, stuff is gone."

Not saying that your situation needs it, but there are those that do. They offer other features along those lines like redundant units, so if one fails the other continues no problem.

Basically they are for when data (and performance) is very important and you are willing to spend money for that. You put aside the tech-tough guy attitude of "I can manage it all myself," and accept that the data is that important.

Re:Ya it is (2, Insightful)

h4rr4r (612664) | more than 3 years ago | (#33595466)

I mean have the nagios server order the drive without any human intervention.

Also if it was really critical you would keep several disks ready to go on site. You know for when you can't wait for next day. Also like netapp you too can have many hot spares in the volume.

If you have problems with people not noticing or reacting to alerts you need to fire them.

Re:Ya it is (1)

h4rr4r (612664) | more than 3 years ago | (#33595494)

You have to remember that the kind of stuff you are talking about for things like NetApps is when no downtime is ok, when no data loss is ok.

Then what you want is redundancy, because downtime and loss of data are guarantees in life. The real service NetApp provides is letting companies hire MCSEs and be ok with the job they do. They spend money to outsource this part of their IT, which is fine. Just do not pretend that they are doing anything else.

Re:Ya it is (1)

Dex1331 (1810146) | more than 3 years ago | (#33596034)

Basically they are for when data (and performance) is very important and you are willing to spend money for that. You put aside the tech-tough guy attitude of "I can manage it all myself," and accept that the data is that important.

APPLAUSE>> Yes, too many people working for small to medium sized businesses don't understand the needs of "high availability" enterprise data centers with 60,000 plus machines with 20,000 of those being servers. You also can't discount legal requirements for storage and redundancy of legally sensitive data. Believe me, I work in THAT place and my life revolves around "backups" during a good portion of my shifts.

Re:Don't forget to weigh in the cost (1)

alen (225700) | more than 3 years ago | (#33594414)

No shit

We have deduce and plain old tape. 20 lot-4 tapes cost $700. That's 20 - 60 terabytes depending on compression.

We also pay$20,000 a year in support for a dedupe software app. Plus the disk, servers and power to keep it running and you have to buy at least 2 since if your os crashes then your data is gone

Cheap disk my ass

The tape backup was a little pricey at first but the tapes hold so much and are so fast that we hardly buy any more tape. Like we used to blow $25,000 a year or more for dlt tape

Re:Don't forget to weigh in the cost (0)

Anonymous Coward | more than 3 years ago | (#33595034)

MOD PARENT UP.

This is *exactly* the reason that we don't spend $600 to boost our RAM capacity by 10%. Guess what? RAM is cheap - buy more RAM.

Storage is a commodity... period. Focus on content management systems, not on trying to find technical solutions to solve poor planning issues.

Re:Don't forget to weigh in the cost (1)

mlts (1038732) | more than 3 years ago | (#33595556)

The Netapp box does a lot more than deduping:

1: The newer models come with 512GB-1TB of SSD, and automatically place data either on the SSD, the FC drives, or the SATA platters depending on how much it is used. If the chunk of data is used all the time, it sits on the SSD. This helps a lot with the bottleneck of a lot of machines needing to access the same data block with deduplication. This is different from other disk solutions, as the NetApp chooses the "tier" of disk for you. However, a lot of servers don't put out the throughput requiring someone to select between T1 and T2 disks, so for this, the NetApp is fine. Carve your LUNs out, carry on.

2: NetApp's WAFL system has been around saving butts for a long time. People don't realize this until you walk in and see that a junior admin blew away /net, and is looking at you with the deer in the headlight glance. A quick move from a snapshot directory, and nobody is the wiser.

3: You can put two NetApp SAN clusters in two geographically disparate locations and have them send changes via the WAN. This way, DR can be automated and made quite fast.

4: SANs are a lot more than just a bunch of disks shoved in a rack. They tend to be very intelligent of where data is placed, and on the backend, at least use RAID 6, where more than two drives have to fail at the same time for data to get lost. Almost all have multiple controllers, so if one path via the network fabric gets stomped on, machines are still able to access their LUN via the second one.

This isn't to say the NetApp is for everyone. If someone just needs a bunch of disk and no other features, a BackBlaze pod or a tower full of eSATA JBOD drives may be good enough. However, if one has a number of machines and is doing large amounts of random I/O, having an enterprise grade SAN goes without saying.

Four? (0)

Anonymous Coward | more than 3 years ago | (#33594158)

I only see three. Was data deduplication at work in the article?

Not enough products (2, Interesting)

ischorr (657205) | more than 3 years ago | (#33594178)

Odd that if they reviewed this class of products they didn't review the most common deduping NAS/SAN applicance - the EMC NS-series (particularly NS20).

Re:Not enough products (1)

drdrgivemethenews (1525877) | more than 3 years ago | (#33594362)

I found it odd too, though they seem to be reviewing boxes that do dedup on live data, as opposed to backup streams. Appliances like the NS-series claim dedup percentages of 95%+, but they accomplish this seeming miracle when slowly changing datasets are backed up over and over (even differential backup systems usually do a full backup fairly regularly).

Re:Not enough products (1)

ischorr (657205) | more than 3 years ago | (#33594692)

I can't say that I've ever heard dedup percentage of 95% related to the NS series, which is very similar to the products in this article (NAS/SAN server that does dedupe on live data that lives on the array). Maybe you're confusing with products like Data Domain or Avamar or something?

Re:Not enough products (1)

georgewilliamherbert (211790) | more than 3 years ago | (#33594380)

Thirded. Data Domain (now part of EMC) really started the commercial use of this...

Nor do they give proper mention to Quantum DXi (1)

DanDD (1857066) | more than 3 years ago | (#33594408)

Quantum was one of the first to bring variable-block data deduplication products to market, so in a sense their omission is rather odd.

However, the article seems centered on primary storage, and not the marriage of backup/replication/physical tape, which is Quantum's focus.

Personally, I'd be _terrified_ of using dedup for primary storage. What this does is exactly the opposite of RAID - it squeezes every last bit of redundancy out of your data, and makes everything dependent upon the integrity of your blockpool database. Loose a single blocklet and you stand to lose _all_ of your data.

Compressing common data across many filesystems for things like backups makes a lot more sense, and seems more cost effective.

Re:Nor do they give proper mention to Quantum DXi (1)

ischorr (657205) | more than 3 years ago | (#33594878)

"Personally, I'd be _terrified_ of using dedup for primary storage. What this does is exactly the opposite of RAID - it squeezes every last bit of redundancy out of your data, and makes everything dependent upon the integrity of your blockpool database. Loose a single blocklet and you stand to lose _all_ of your data. "

Dedupe reduces multiple copies of the same data *on the same storage*

I think you're implying that having - probably purely at random - multiple copies of some files on the same FS is somehow a proper backup/redundancy strategy. It sounds like you're saying that WITHOUT dedupe, if a file got corrupted you'd at least be able to go restore it from some other copy of the file. I can't imagine how this is true - you can't rely on chance copies of multiple files to be able to recover from a file corruption. That's crazy. With or without dedupe you better have BACKUPS of the data on some other storage.

Maybe you mean that if something gets corrupted in some of the deduped data that you'll lose ALL dedupe data (so maybe half of your filesystem or something). Most dedupe technologies don't work that way - if corruption occurs it will impact the actual data or possibly file that was affected (and obviously each copy of that data throughout the FS). But not more than that.

Re:Nor do they give proper mention to Quantum DXi (2, Interesting)

immortalpob (847008) | more than 3 years ago | (#33595490)

You are missing his point. On a non-deduplicated system if one block goes bad you lose one file, on a deduplicated system you can lose any number of files due to one bad block. It gets worse when you consider the panacea of non-backup deduplication, consider all of your servers are VMs and reside on the same deduplicated storage, one bad block can take them ALL DOWN. Now admittedly any dedupe solution will sit on some type of raid, however there is still the possibility of something terrible, and this is made worse by the likelihood of a URE during a raid-5 rebuild.

Re:Nor do they give proper mention to Quantum DXi (1)

jgreco (1542031) | more than 3 years ago | (#33595762)

That's why you have the system store more than one copy, and you have it validate their integrity when reading them. Think of it as sensible RAID. I suggest a quick Google for "zfs data integrity", etc.

Re:Nor do they give proper mention to Quantum DXi (1)

ischorr (657205) | more than 3 years ago | (#33595894)

"You are missing his point. On a non-deduplicated system if one block goes bad you lose one file, on a deduplicated system you can lose any number of files due to one bad block."

This is true, but he was saying "This is the opposite of RAID...it squeezes every bit of redundancy out of your data". Like having random duplicate copies of files scattered around a filesystem was a redundancy mechanism that is somehow on-par with RAID, and so enabling dedupe means that you have eliminated a serious data redundancy mechanism. It's true that it might be a higher-impact loss when you lose a single file (and require more restores or mean more users will be impacted), but it's not a situation where you've suddenly killed your data backup plan and lost all your data. You're just not going to be using random duplicate copies of files on your FS in this way.

Re:Not enough products (1)

alen (225700) | more than 3 years ago | (#33594452)

If it's emc then you need to be a global fortune 10 company to afford it

I used to joke that they are like crack dealers. The initial hardware is not that much, but they get you on the disk upgrades, licenses to go above some storage size, backend bandwith, etc

Re:Not enough products (1)

ischorr (657205) | more than 3 years ago | (#33594716)

The NS20 goes head-to-head with that NetApp box, so I'm not sure if that's true in this case (need to be fortune 10 to afford it). And from what I read a couple of days ago, it's the most commonly sold NAS product in this class...which is why I thought it was weird not to include it in the review. I'm curious what they would have said about it.

Foredown your data (1)

HTH NE1 (675604) | more than 3 years ago | (#33594234)

I can't wait until the Dilbert strip hits where the PHB does this across all their backups and deduplicates them all away, thinking he's just saved a ton of money on backup media.

Redundancy can be a very good thing!

De-Dupe on Linux? (1)

MarcQuadra (129430) | more than 3 years ago | (#33594304)

Are there any open-source filesystems that offer deduplication?

It seems that the FS du-jour changes faster than any of the promised 'optional' features ever materialize.

Instead of working full-bore on The Next Great FS, it would be really nice to have compression, encryption, deduplication, shadow copies, and idle optimization running in EXT4.

Maybe I'm just jaded, but I've been a Linux user for 12 years now. Sometimes it feels like the names of the technologies are changing, but nothing ever gets 'finished'. Maybe the NTFS/BSD model (good core design, long intervals with only minor changes) would be wise in Linux filesystem development.

Re:De-Dupe on Linux? - yes (1, Interesting)

Anonymous Coward | more than 3 years ago | (#33594460)

http://www.opendedup.org/ [opendedup.org]

Re:De-Dupe on Linux? (1)

Microlith (54737) | more than 3 years ago | (#33594598)

[blockquote]Maybe the NTFS/BSD model (good core design, long intervals with only minor changes) would be wise in Linux filesystem development.[/blockquote]
You mean like the extremely long lived EXT* series of filesystems?

Eventually the things you want to add lead you to rethink the core design instead of hacking on things on the outside more and more. But that process takes a long time and requires a lot of work to accomplish. Which is why BTRFS is a break from EXT* and IIRC it supports most (if not all) of the features you mentioned.

Re:De-Dupe on Linux? (1)

cetialphav (246516) | more than 3 years ago | (#33594656)

Instead of working full-bore on The Next Great FS, it would be really nice to have compression, encryption, deduplication, shadow copies, and idle optimization running in EXT4.

To do all these things, you have to change how data is stored on the disk and what information is present. When you do this, you necessarily create a new file system. These aren't simple features that you can just tack onto an existing file system.

I suspect that one of these days we will be running the ext10 file system that has most of these features and evolved from ext3 in a methodical way, but it will in no way actually resemble ext3. There will always be other systems being developed to try out new ideas but getting things both reliable and fast is hard enough that the new systems never cross over the experimental hurdle at which point their innovations will migrate into ext\d.

Re:De-Dupe on Linux? (2, Informative)

suutar (1860506) | more than 3 years ago | (#33595312)

There's a few. I've read there's a patchset for ZFS on FUSE that can do deduplication; there's also opendedup [slashdot.org] and lessfs [lessfs.com] . The problem is that none of these has been around long enough to be considered bulletproof yet, and for a filesystem whose job is to play fast and loose with file contents in the name of space savings, that's kinda worrisome.

Use ZFS. It offers dedupe, compression, etc. (3, Informative)

jgreco (1542031) | more than 3 years ago | (#33594382)

ZFS offers dedupe, and is even available in prepackaged NAS distributions such as Nexenta and OpenNAS. You too can have these great features, for much less than NetApp and friends.

Re:Use ZFS. It offers dedupe, compression, etc. (2, Informative)

lisany (700361) | more than 3 years ago | (#33594668)

Except NexentaStor (3.0.3) has an OpenSolaris upstream (which has gone away, by the way) kernel bug that hanged our Nexenta test box. Not a real good first impression.

Re:Use ZFS. It offers dedupe, compression, etc. (1)

jgreco (1542031) | more than 3 years ago | (#33595734)

I found a ton of stuff I didn't really care for with Nexenta. They've put some good effort into it, and it'd be a fine way to go if you wanted commercial support, but overall it doesn't really seem to fit our needs here. ZFS itself is a resource pig, but on the other hand, resources have become relatively cheap. It's not unthinkable to jam gigs of RAM in a storage server ... today. Five years ago, though, that would have been much more likely to be a deal-breaker.

This is new? (2, Interesting)

Angst Badger (8636) | more than 3 years ago | (#33594384)

Didn't Plan 9's filesystem combine journaling and block-level de-duplication years ago?

Re:This is new? (0)

Anonymous Coward | more than 3 years ago | (#33594616)

this was done at the block layer. c.f.
http://plan9.bell-labs.com/sys/doc/venti/venti.html and
http://plan9.bell-labs.com/magic/man2html/8/venti

Re:This is new? (0)

Anonymous Coward | more than 3 years ago | (#33595268)

Plan 9's fossil filesystem has copy-on-write snapshots, not data or metadata journaling like ext3. I recall reading a paper on journaling on Plan9, but as far as I know that code isn't in use. I'm not sure if it was even journaling for fossil. Besides, what Plan 9 user needs journaling anyway? When you have just a good online backup system, you don't have to be so careful with your data.

Re:This is new? (1)

BitZtream (692029) | more than 3 years ago | (#33595824)

Plan 9 could have the cure for cancer too but still no one gives a shit about it.

Dedup is a good 30 years old at least, if you want to point out that it isn't new.

Only slashdotters and Linux children get excited at silly things like this.

Deleting superfluous data (0)

Anonymous Coward | more than 3 years ago | (#33594390)

This is what I need. I can't swing a dead cat around my head without hitting a bunch of USB drives with fuck knows what's on them. But I can't bring myself to toss them out, or, even less likely, go through them to see if there is anything on them worth saving. Where is all that AI technology that everyone promised me in the 80's? I need an intellegent agent that tells me:

"Listen, dude, this data has gone way off, and has to go. Just look at the expiration date. Chuck this drive tomorrow!"

Data Deduplication . . (0)

Anonymous Coward | more than 3 years ago | (#33594496)

aside from the mentioned 'to reduce duplicate data to increase available storage space' are there any other benefits to de-duplicating your storage? As I understand THAT point....instead of having 20 or 50 copies of the same email that has been forwarded to everybody in the organization 2 or 3 times, KEEP only 1 copy on the storage space, remove the duplicates and place "placeholders" in place of the duplicates which link back to that 1 copy on the same storage space, thereby reducing needed storage space and increasing available space, HOWEVER, if the sig on my email is different than the sig on anyone else's, how are those forwarded emails "duplicates"? and so then what good is it anyway? The forwards usually contain the quotations from each quote, or whatever they call the >>>>>>> marks, so again, how are those duplicates? And so what about the "near duplicates" ? -They just don't get considered because they aren't exact, right? WHAT IS THE POINT?!? note: I started reading the 8 pages of the linked story from the OP listing, but. . . .

Re:Data Deduplication . . (1)

initdeep (1073290) | more than 3 years ago | (#33594884)

they dont have to do this at the file level.
they do it at the block level.
so in your example, since the only change would be the signature on the bottom of each email, the email blocks themselves would be deduped, and the signatures would be retained.

think of backing up a whole bunch of similar desktops in an enterprise situation where the majority of the OS files are going to be the same or have only slight variations.

even if the files have slight variations, only the actual bits that are different would be stored and the rest would be deduped and only one copy kept.

personally i know a fairly large company using avamar for this, however they do it on their backups only. And iirc, they still keep different sets of backups, just dedupe within the backup itself which saves them quite a bit of space per backup.

Re:Data Deduplication . . (1)

hawguy (1600213) | more than 3 years ago | (#33594976)

I think what you're talking about is single instance storage in your mail server. But as you mentioned, it only works well on identical emails and attachments.

No dedupe system that I'm aware of does what you'd need to do to dedupe forwarded emails. It's technically possible by recognizing similar messages and doing diff's on them to find identical sections. But, it's computationally difficult and there's not much payback -- better to go after the lowhanging fruit and dedupe all of the identical gif's and mp3s that people have downloaded off the internet.

When we deduped our corporate fileserver, we got around 40% of our space back.

Re:Data Deduplication . . (1)

h4rr4r (612664) | more than 3 years ago | (#33595512)

Did it cost less than buying 40% more disks? Heck, did it cost less than building another fileserver with 100% more disk and then syncing between them?

Data Domain (0)

Anonymous Coward | more than 3 years ago | (#33594552)

Data domain pretty much started this market and is still the best product / market leader. Oddly they did not review them here, probably because these boxes just don't stack up. All of these reviewed are post-processing dedup appliances which in my mind suck.

Re:Data Domain (1)

debus (751449) | more than 3 years ago | (#33595834)

I agree. Netapp tried to buy ASIS after they had ASIS (the name of their dedup product) integrated into their filers. That should tell everyone what the people at Netapp think of their own solution.

DataDomain (0)

Anonymous Coward | more than 3 years ago | (#33594578)

I administer a DataDomain DD660. It's amazing. I have 140TB of backups sitting on 10TB of space. Why not include the market leader in this review? Too expensive?

I already do this (3, Funny)

MyLongNickName (822545) | more than 3 years ago | (#33595544)

After an analysis of a 1TB drive, I noticed that roughly 95% were 0's with only 5% being 1's.

I was then able to compress this dramatically. I just record that there are 950M 0's and 50M 1's. The space taken up drops to around 37 bits. Throw in a few checksum bits, and I am still under eight bytes.

I am not sure what is so hard about this disaster recovery planning. Heck, I figure I am up for a promotion after I implement this.

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>