Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

ZFS Gets Built-In Deduplication

ScuttleMonkey posted more than 4 years ago | from the sounds-like-a-resource-hog-waiting-to-happen dept.

Sun Microsystems 386

elREG writes to mention that Sun's ZFS now has built-in deduplication utilizing a master hash function to map duplicate blocks of data to a single block instead of storing multiples. "File-level deduplication has the lowest processing overhead but is the least efficient method. Block-level dedupe requires more processing power, and is said to be good for virtual machine images. Byte-range dedupe uses the most processing power and is ideal for small pieces of data that may be replicated and are not block-aligned, such as e-mail attachments. Sun reckons such deduplication is best done at the application level since an app would know about the data. ZFS provides block-level deduplication, using SHA256 hashing, and it maps naturally to ZFS's 256-bit block checksums. The deduplication is done inline, with ZFS assuming it's running with a multi-threaded operating system and on a server with lots of processing power. A multi-core server, in other words."

Sorry! There are no comments related to the filter you selected.

Does that mean... (4, Funny)

Anonymous Coward | more than 4 years ago | (#29956668)

Duplicate slashdot articles will be links back to the original one?

Re:Does that mean... (1)

Shikaku (1129753) | more than 4 years ago | (#29956822)

Er, isn't block deduplication really really bad at a hard drive block failure point of view? You'd have to compress or otherwise change the data to have a copy now, or it'd just be marked redundant; if that block where all those redundant nodes are pointing to go bad, all of those files are now bad.

Re:Does that mean... (3, Insightful)

ezzzD55J (697465) | more than 4 years ago | (#29956882)

The single block is still stored redundantly, of course. Just not redundantly more than once.

Re:Does that mean... (0)

Anonymous Coward | more than 4 years ago | (#29957006)

Doesn't matter in ZFS's case - if there's a single unrecoverable bad block anywhere in the filesystem, it becomes unusable. (To be fair, it's really good at recovering from bad blocks.)

Re:Does that mean... (2, Insightful)

hedwards (940851) | more than 4 years ago | (#29957436)

That requires a citation.

ZFS isn't that much different than traditional file systems. I'm not quite sure how that reconciles with the fact that it reports unrecoverable bits of information when it couldn't self heal to you. If it were that unusable there'd be no point. Additionally there isn't really much likelihood of that happening considering that ZFS isn't really supposed to be used outside of a ZMIRROR or RAIDZ environment. Sure you can do it, but most of the goodness comes from multiple disks.

Re:Does that mean... (2, Insightful)

Methlin (604355) | more than 4 years ago | (#29957036)

Er, isn't block deduplication really really bad at a hard drive block failure point of view? You'd have to compress or otherwise change the data to have a copy now, or it'd just be marked redundant; if that block where all those redundant nodes are pointing to go bad, all of those files are now bad.

If you were concerned about block level failure or even just drive level failure, you wouldn't be running your ZFS pool without redundancy (mirror or raidz(2)).

This is good news... (1, Offtopic)

The Ancients (626689) | more than 4 years ago | (#29956676)

...and would normally make me happy; except I'm a Mac user. Still good news, but could've been better for a certain sub-set of the population, darn it.

File systems are one area where computer technology is lagging, comparatively speaking, so good to see innovation such as this.

Re:This is good news... (0)

Anonymous Coward | more than 4 years ago | (#29956730)

In same boat. Had a SolarisX86 box around here at one time years ago.

they must be reading my mind. I cant tell you how many times in the datacenter I wished this existed.

Re:This is good news... (4, Insightful)

bcmm (768152) | more than 4 years ago | (#29956770)

...and would normally make me happy; except I'm a Mac user. Still good news, but could've been better for a certain sub-set of the population, darn it.

Use open source, get cutting edge things.

Re:This is good news... (3, Funny)

jeffb (2.718) (1189693) | more than 4 years ago | (#29957162)

Use open source, get cutting edge things.

The last time I tried to build an Intel box for Linux work, I lost my grip on the cheap generic case, and sustained a cut that sent me to the emergency room. One of the things I like about my Mac is the lack of cutting edges.

Re:This is good news... (4, Funny)

Anonymous Coward | more than 4 years ago | (#29957248)

Shoulda gone with a blade server, then you wouldn't have had to worry about the emergency room.

Re:This is good news... (1)

MrCrassic (994046) | more than 4 years ago | (#29957298)

This is called doin it wrong! :)

Re:This is good news... (2, Informative)

Tynin (634655) | more than 4 years ago | (#29957714)

Not sure when you tried building it, but I build cheap computers for friends / family, at least 2 or 3 computers a year. Almost a decade ago... maybe really only 8 years ago, all cheapo generic cases stopped having razor sharp edges. I used to get cuts all the time, but cheap cases, at least in the realm of having sharp edges, haven't been an issue in a long time. (I purchase all my cheapo cases from newegg these days)

Re:This is good news... (1)

The Ancients (626689) | more than 4 years ago | (#29957240)

Use open source, get cutting edge things.

Cutting edge is nice for the functionality; unfortunately it more often than not comes with unintended functionality. I like standing back a bit - not too much mind you, but enough to avoid the bleeding edge.

Re:This is good news... (0)

Anonymous Coward | more than 4 years ago | (#29957338)

You surely meant: 'the bloody edge'.

damn that edge.

Open Source Cures Cancer (1, Insightful)

sjbe (173966) | more than 4 years ago | (#29957442)

Use open source, get cutting edge things.

Like a cutting edge CAD packages, games, financial management and office suites? Good thing we had you to tell us that open source will solve our every problem just by virtue of it being open source. I'm sure every print shop is going to dump Photoshop for GIMP, every finance firm will dump Excel for Openoffice Calc and every engineering firm will dump AutoCAD for... what exactly?

Maybe, just maybe open source isn't the answer for everything after all...

Re:This is good news... (1)

MBCook (132727) | more than 4 years ago | (#29956906)

It's neat. I can see it being rather useful for our systems at work to de-duplicate our VMs (and perhaps our DB files, since we have replicated slaves). Network storage (where multiple users may have their own copies of static documents that they've never edited) could benefit, perhaps email storage as well.

Personally though, I don't think there is too much on my hard drive that would benefit from this. I would love for OS X to get the built in checksumming that ZFS has so it can detect silent corruption that may have happened during a bad boot/power loss etc when I try to read the file later.

It's pretty obvious that HFS+ will have to be replaced soon, and Apple is reportedly working on it (since they ditched ZFS). I'd really like the checksumming, at this point (having so much cheap storage and extra CPU cycles) it should be a gimme.

This is the year of Solaris on the desktop (1)

jotaeleemeese (303437) | more than 4 years ago | (#29957132)

Where did I hear that one?

Re:This is good news... (4, Informative)

Trepidity (597) | more than 4 years ago | (#29957720)

If you're running a normal desktop or laptop, this isn't likely to be of great use in any case. There's non-negligible overhead in doing the deduplication process, and drive space at consumer-level sizes is dirt-cheap, so it's only really worth doing this you have a lot of block-level duplicate data. That might be the case if e.g. you have 30 VMs on the same machine each with a separate install of the same OS, but is unlikely to be the case on a normal Mac laptop.

First posts! (0)

Anonymous Coward | more than 4 years ago | (#29956686)

I wrote two first posts, but I guess /. is on ZFS now.

Re:First posts! (1)

BitZtream (692029) | more than 4 years ago | (#29956710)

Why did you write your first 'first post' to say that you wrote 'two' first posts? You must have, or they wouldn't be duplicate blocks, and wouldn't have been deduplicated.

Re:First posts! (0)

Anonymous Coward | more than 4 years ago | (#29956800)

Because, I then proceeded to communicate to myself back in time, telling myself I must write just like that and that the reason why would become obvious. I then replied to my future self that, given the title of the story, the reason why was all ready obvious, and my future self said "Oh yeah, now that you mention it, I remember thinking that".

ehem (0)

oldhack (1037484) | more than 4 years ago | (#29956712)

Before we get all excited and look all silly, can somebody confirm with Netcraft first?

Hash Collisions (2, Interesting)

UltimApe (991552) | more than 4 years ago | (#29956720)

Surely with high amounts of data (that zfs is supposed to be able to handle), a hash collision may occur? I'm sure a block is > 256bits. Do they just expect this never to happen?

Although I suppose they could just be using it as a way to narrow down candidates for deduplication... doing a final bit for bit check before deciding the data is the same.

Re:Hash Collisions (1)

Score Whore (32328) | more than 4 years ago | (#29956760)

Yeah. If you are concerned by the fact that a block might be 128 KB and the hashed value is only 256 bits, then an option like:

zfs set dedup=verify tank

Might be helpful.

Re:Hash Collisions (2, Interesting)

dotgain (630123) | more than 4 years ago | (#29957546)

Before the instruction you posted, I found this explanation in TFA:

An enormous amount of the world's commerce operates on this assumption, including your daily credit card transactions. However, if this makes you uneasy, that's OK: ZFS provies a 'verify' option that performs a full comparison of every incoming block with any alleged duplicate to ensure that they really are the same, and ZFS resolves the conflict if not. To enable this variant of dedup, just specify 'verify' instead of 'on':

I fail to see how someone can sit down and rationally decide whether their data will be more susceptible to hash collisions or not. While I would be very surprised if any two blocks on my computer hash to the same value in spite of being different, it seems to me that someone's going to get hit by this sooner rather than later. And what a nasty way to find hash collisions! Who would have thought my Aunt's chocolate cake recipe had the same SHA1 as hello.jpg from goatse.cx!.

On one hand, 2^256 is a damn big keyspace. I've heard people say a collision is about as likely as winning every lottery in the world simultaneously, and then doing it again next week. Bug give enough computers with enough blocks enough time, and find a SHA1 collision you will. Depending on what kind of data it happens to, you might not even notice it.

Re:Hash Collisions (1)

sgbett (739519) | more than 4 years ago | (#29957670)

Hey! If no-one will notice then it won't be a problem ;)

Re:Hash Collisions (1)

SLi (132609) | more than 4 years ago | (#29957796)

No. We're talking about such amounts of data needed that there's no conceivable way now or in the near (1000-year) future that such a collision would be found by accident, and even after that only on some supercomputer that is larger than earth and is powered by its own sun. It's not going to happen by accident. The probabilities are just so much against it, given any conceivable amount of data - and there are elementary limits that come from physics that cannot be surpassed. Moore's law will stop working sooner or later, and then the humanity will not be much closer to finding an SHA-256 collision by accident.

The only realistic way you're going to have a hash collision is malice (or perhaps fate or divine intervention, if you believe in such). That's not anywhere near realistic actually now, but if a significant weakness would be found on SHA-256, it could become a possibility one day (and judging from history I'd say it's probable it will be broken sooner or later). An attacker that can store a file on your filesystem can then replace your precious data with crafted data with the same hash.

Some other smaller attack vectors come to mind though, depending on how it's implemented. If the deduplication shows on filesystem usage, an attacker could use it to check if you have a certain block of data on the filesystem (in a file inaccessible to him). For example.

Re:Hash Collisions (1)

Shikaku (1129753) | more than 4 years ago | (#29956786)

If blocks that are supposedly from different files have the same block data, does it really matter if it's marked redundant?

Not only that, do you really think a SHA256 hash collision can occur? And even if it does, for the sake of CPU time, a hash table is made for a quick check rather than checking every piece of data from the to be written and already available data to see if there is a copy in situations as this. If somehow they have the same hash, it SHOULD be checked to see if it is the same data byte by byte, THEN marked redundant.

Re:Hash Collisions (2, Funny)

icebike (68054) | more than 4 years ago | (#29956910)

If blocks that are supposedly from different files have the same block data, does it really matter if it's marked redundant?

I thing the hash collision people are worrying about is when two blocks/files/byte-ranges are hashed to be identical but in fact differ.

When that happens your Power Point presentation contains your Bosses bedroom-cam shots.

Re:Hash Collisions (2, Informative)

Rising Ape (1620461) | more than 4 years ago | (#29956856)

The probability of a hash collision for a 256 bit hash (or even a 128 bit one) is negligible.

How negligible? Well, the probability of a collision is never more then N^2 / 2^h, where N is the number of blocks stored and h is the number of bits in the hash. So, if we have 2^64 blocks stored (a mere billion terabytes or so for 128 byte blocks) , the probability of a collision is less than 2^(-128), or 10^(-38). Hardly worth worrying about.

And that's an upper limit, not the actual value.

Re:Hash Collisions (2, Funny)

pclminion (145572) | more than 4 years ago | (#29956866)

Suppose you can tolerate a chance of collision of 10^-18 per-block. Given a 256-bit hash, it would take 4.8e29 blocks to achieve this collision probability. Supposing a block size of 512 bytes, that's 223517417907714843750 terabytes.

Now, supposing you have a 223517417907714843750 terabyte drive, and you can NOT tolerate a collision probability of 10^-18, then you can just do a bit-for-bit check of the colliding blocks before deciding if they are identical or not.

Re:Hash Collisions (2, Interesting)

pclminion (145572) | more than 4 years ago | (#29956894)

Oops. I didn't mean 10^-18 per-block, I meant 10^-18 for the entire filesystem. (Obviously it doesn't make sense the other way)

Re:Hash Collisions (4, Informative)

shutdown -p now (807394) | more than 4 years ago | (#29956960)

Before I left Acronis, I was the lead developer and designer for deduplication in Acronis Backup & Recovery 10 [acronis.com] . We also used SHA256 there, and naturally the possibility of a hash collision was investigated. After we did the math, it turned out that you're about 10^6 times more likely to lose data because of hardware failure (even considering RAID) than you are to lose it because of a hash collision.

Re:Hash Collisions (1)

buchner.johannes (1139593) | more than 4 years ago | (#29957354)

I have an idea for an attack vector.

Say File A is one block big. File A is publicly available on the server, not writable by users. Eve produces a SHA256 hash collision of file A and stores this file B in ~. Someone wants to retrieve file A but gets file B (e.g. like evilize exe [mscs.dal.ca] for MD5).
Alternatively, if always the oldest file is kept, Eve has to know the next version of the file.

Given big blocks and time until cryptoanalysis for SHA256 is at the state of where it is with MD5, why not?

Re:Hash Collisions (1)

hedwards (940851) | more than 4 years ago | (#29957506)

If I'm not mistaken, that would be a waste of time. Ultimately, you're looking to get a file executed in most cases in which case you don't really need that you just need some other exploit. If you do need to get that file retrieved, there are better ways of doing that as well.

Re:Hash Collisions (1)

TheRaven64 (641858) | more than 4 years ago | (#29957628)

Yes, it's a valid attack once you can generate hash collisions for SHA256 attacks, in the same way that 'sit between two parties and decrypt their communication' is a valid attack on RSA once you can factorise the product of two primes quickly. Currently, the best known attack on SHA256 is not feasible (and won't be for a very long time if computers only follow Moore's law).

Re:Hash Collisions (1)

shutdown -p now (807394) | more than 4 years ago | (#29957706)

Say File A is one block big. File A is publicly available on the server, not writable by users. Eve produces a SHA256 hash collision of file A

The whole point of a cryptographic hash function [wikipedia.org] is that you're not supposed to be able to produce input matching a given hash value other than by brute force - that is, 2^N evaluations, where N is the digest size in bits. That's an ideal state - in practice, number of evaluations can be reduced, and this is also the case for SHA256 [iacr.org] , but for this particular scenario (finding a message corresponding to a known hash, rather than just any two messages that collide with a random hash), it is still way beyond the number that is practical for a successful real-world attack.

Re:Hash Collisions (1)

SLi (132609) | more than 4 years ago | (#29957846)

But then you could just use your magic SHA-256 breaking skillz to divert bank transactions and many outright vital things in commerce and communications, so it seems to me that replacing the contents of a file on some file system would be petty crime compared to that.

Re:Hash Collisions (1)

Just Some Guy (3352) | more than 4 years ago | (#29957334)

Surely with high amounts of data (that zfs is supposed to be able to handle), a hash collision may occur?

The birthday paradox says you'd have to look at 2^(n/2) candidates, on average, to find a collision for a given n-bit hash. In this case, that means you'd have to look at about 2^128 objects to find a collision with a particular one.

On my home server, the default block size is 128KB. With a terabyte drive, that gives about 8.4 million blocks.

GmPy says the likelihood of an event with probably of 1/(2^128) not happening 8.4 million times (well, 1024^4/(128*1024) times) in a row is 0.99999999999999999999999999999997534809671184338108088348233. In other words, that's how likely you are to fill a 1TB drive with 128KB blocks without a single hash collision.

I can live with that.

Re:Hash Collisions (0)

Anonymous Coward | more than 4 years ago | (#29957452)

Apparently you suck at math as much as in life.

Re:Hash Collisions (1)

Junta (36770) | more than 4 years ago | (#29957876)

They have the 'verify' mode to do what you prescribe, though I'm presuming it comes with a hefty performance penalty.

I have no idea if they do this up front, inducing latency on all write operations, or as it goes.

What I would like to see is a strategy where it does the hash calculation, writes block to new part of disk assuming it is unique, records the block location as an unverified block in a hash table, and schedules a dedupe scan if one not already pending. Then, a very low priority io task could scan that structure for block locations that have yet to be verified, and then scan all the blocks that match its hash for sameness and update the structures to retroactively make it a single copy (effectively unlinking a block deemed duplicate after the fact). The absolute hard guarantee of sameness without a write performance penalty.

I'm very far from a filesystem designer, and I recognize the likelihood of a collision given sufficiently large block size is low, but I'd really be wary of something that relies on not having bad luck to accidentally lose data on a write due to an unlikely hash collision.

Any other file systems with that feature? (2)

Dwedit (232252) | more than 4 years ago | (#29956740)

Are there any other filesystems with that feature? If not, I'm very strongly considering writing my own.

Re:Any other file systems with that feature? (1)

mrmeval (662166) | more than 4 years ago | (#29956782)

While you're at it write one in assembler as a replacement for the Apple II and 1541 so us retrogeeks can store MORE on a floppy. ;)

I know of all the compression schemes but this block level stuff is fascinating.

Re:Any other file systems with that feature? (1)

Korin43 (881732) | more than 4 years ago | (#29957608)

Wouldn't compression do this? I've never written a program involving compression, but it seems like the first thing you'd look for is two places that have the same data, and then you could just store them as references to the original data.

Re:Any other file systems with that feature? (5, Informative)

iMaple (769378) | more than 4 years ago | (#29956802)

Windows Storage Server 2003 (yes, yes I know its from Microsoft) shipped with this feature (that is called Single Instance Storage)
http://blogs.technet.com/josebda/archive/2008/01/02/the-basics-of-single-instance-storage-sis-in-wss-2003-r2-and-wudss-2003.a [technet.com]

Re:Any other file systems with that feature? (4, Informative)

buchner.johannes (1139593) | more than 4 years ago | (#29957456)

From that link: It is file-based and a service indexes it (whereas in ZFS it is block-based and on-the-fly). And they first introduced it in Windows Server 2000. Amazing. I'm sure it is a ugly hack since Windows has no soft/hard-links IIRC.

Re:Any other file systems with that feature? (0)

Anonymous Coward | more than 4 years ago | (#29957694)

Windows Storage Server 2003 (yes, yes I know its from Microsoft) shipped with this feature (that is called Single Instance Storage)
http://blogs.technet.com/josebda/archive/2008/01/02/the-basics-of-single-instance-storage-sis-in-wss-2003-r2-and-wudss-2003.a [technet.com]

Not quite. From the above link it works at the file level:

The files don’t need to be on the same folder, have the same name or have the same date, but they do need to be in the same volume, have exactly the same size and the contents of both need to be exactly the same.

ZFS' dedupe (and similar technologies like NetApp's A-SIS) work a the block level. From one of the leads of ZFS:

Data can be deduplicated at the level of files, blocks, or bytes.

File-level assigns a hash signature to an entire file. File-level dedup has the lowest overhead when the natural granularity of data duplication is whole files, but it also has significant limitations: any change to any block in the file requires recomputing the checksum of the whole file, which means that if even one block changes, any space savings is lost because the two versions of the file are no longer identical. This is fine when the expected workload is something like JPEG or MPEG files, but is completely ineffective when managing things like virtual machine images, which are mostly identical but differ in a few blocks.

Block-level dedup has somewhat higher overhead than file-level dedup when whole files are duplicated, but unlike file-level dedup, it handles block-level data such as virtual machine images extremely well. Most of a VM image is duplicated data -- namely, a copy of the guest operating system -- but some blocks are unique to each VM. With block-level dedup, only the blocks that are unique to each VM consume additional storage space. All other blocks are shared. [...]

ZFS provides block-level deduplication because this is the finest granularity that makes sense for a general-purpose storage system.

http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup

Re:Any other file systems with that feature? (1)

jack2000 (1178961) | more than 4 years ago | (#29956810)

Meet NTFS, it has this thing named SiS for Single Instance Storage. There's a service known as the SiS groveller, it scans your files and links them if they are duplicate, it does that for parts of your files aswell.

Re:Any other file systems with that feature? (2, Informative)

hapalibashi (1104507) | more than 4 years ago | (#29956982)

Yes, Venti. I believe it originated in Plan9 from Bell Labs.

Re:Any other file systems with that feature? (0)

Anonymous Coward | more than 4 years ago | (#29956984)

Plan 9 I pioneered filesystems that do block level deduplication in it's backup filesystem.

Re:Any other file systems with that feature? (2, Interesting)

ZerdZerd (1250080) | more than 4 years ago | (#29957078)

I hope btrfs will get it. Or else you will have to add it :)

Re:Any other file systems with that feature? (2, Interesting)

TheSpoom (715771) | more than 4 years ago | (#29957388)

What I'm wondering about all of this is what happens when you edit one of the files? Does it "reduplicate" them? And if so, isn't that inefficient in terms of the time needed to update a large file (in that it would need to recopy the file over to another section of the disk in order to maintain the fact that there are two now-different copies)?

Re:Any other file systems with that feature? (3, Informative)

hedwards (940851) | more than 4 years ago | (#29957554)

ZFS is a copy on write filesystem, it already creates a temporary second copy so that the file system is always consistent if not quite up to date. I'd venture to guess that the new version of the file, not being identical to the old file would just be treated like copying it to a new name.

Re:Any other file systems with that feature? (1)

PRMan (959735) | more than 4 years ago | (#29957658)

And worse...What happens when you go through a set of files A and change a single IP Address in each of them, defeating the duplication, while filesets B & C still point to the same set. Now, you have just increased your disk space usage by 200% while not increasing the "size" of the files at all.

This will be extremely counter-intuitive when you run out of disk space by globally changing "192.168.1.1" to "192.168.1.2" in a huge set of files.

Re:Any other file systems with that feature? (1)

TheRaven64 (641858) | more than 4 years ago | (#29957660)

ZFS is copy on write, so every time you write a block it generates a new copy then decrements the reference count of the old copy. The 'reduplication' doesn't require any additional support, it will work automatically. Of course, you also want to check if the new block can be deduplicated...

More reason to be a ZFS fanboy (3, Insightful)

BitZtream (692029) | more than 4 years ago | (#29956752)

I'm wondering how long its going to take for them to do something with ZFS that actually makes me slow down my overwhelming ZFS fanboyism.

I just love these guys.

My virtual machine NFS server is going to have to get this as soon as FBSD imports it, and I'll no longer have to worry about having backup software (like BackupPC, good stuff btw) that does this.

I don't use high end SANs but it would seem to me that they are rapidly losing any particular advantage to a Solaris or FBSD file server.

Re:More reason to be a ZFS fanboy (3, Informative)

HockeyPuck (141947) | more than 4 years ago | (#29957016)

The advantages of SANs are easy to realize, they need not necessarily be FibreChannel vs NAS (NFS/CIFS) as a SAN could be iSCSI, FCOE, FCIP, FICON etc..

-Storage Consolidation compared with internal disk.
-Fewer components in your servers that can break.
-Server admins don't have to focus on Storage except at the VolMgr/Filesystem level
-Higher Utilization (a WebServer might not need 500GB of internal disk).
-Offloading storage based functions (RAID in the array vs RAID on your server's CPU, I'd rather the CPU perform application work rather than calculating parity, replacing failed disks etc). This increases when you want to replicate to a DR site.

This is not a ZFS vs SANs argument. I think ZFS running on SAN based storage is a great idea as ZFS replaces/combines two applications that are already on the host (volmgr & filesystem).

Re:More reason to be a ZFS fanboy (1)

phoenix_rizzen (256998) | more than 4 years ago | (#29957386)

Or, use ZFS to create a SAN for your other servers. Just create a ZVol, and share it out via iSCSI. On Solaris, it's as simple as setting shareiscsi for the dataset. On FreeBSD, you have to install an iSCSI target (there are a handful available in the ports tree) and configure it to share out the ZVol.

Re:More reason to be a ZFS fanboy (5, Informative)

Anonymous Coward | more than 4 years ago | (#29957072)

How about this: you can't remove a top-level vdev without destroying your storage pool. That means that if you accidentally use the "zpool add" command instead of "zpool attach" to add a new disk to a mirror, you are in a world of hurt.

How about this: after years of ZFS being around, you still can't add or remove disks from a RAID-Z.

How about this: If you have a mirror between two devices of different sizes, and you remove the smaller one, you won't be able to add it back. The vdev will autoexpand to fill the larger disk, even if no data is actually written, and the disk that was just a moment ago part of the mirror is now "too small".

How about this: the whole system was designed with the implicit assumption that your storage needs would only ever grow, with the result that in nearly all cases it's impossible to ever scale a ZFS pool down.

Re:More reason to be a ZFS fanboy (4, Informative)

Methlin (604355) | more than 4 years ago | (#29957380)

Mod parent up. These are all legit deficiencies in ZFS that really need to be fixed at some point. Currently the only solutions to these is to build a new storage pool, either on the same system or different system, and export/import; big PITA and potentially expensive. Off the top of my head I can't think of anyone that lets you do #2 except enterprise storage solutions and Drobo.

Re:More reason to be a ZFS fanboy (0)

Anonymous Coward | more than 4 years ago | (#29957400)

You make some good points about ZFS annoyances.

I've seen some recent activity around the first limitation you mention (i.e. you can't remove a top-level vdev), so hopefully we'll see a fix soon.

You may have missed that there's now a ZFS property you can set to control whether pools automatically expand into free space. Note that previously autoexpansion could only happen if you gave ZFS entire disks without partitions.

Re:More reason to be a ZFS fanboy (1)

Just Some Guy (3352) | more than 4 years ago | (#29957458)

What do you know - you and I actually agree on something. Yeah, FreeBSD + ZFS is a complete win for pretty much everything involving file transfer. I honestly can't think of a single thing I don't like about it. The instant FreeBSD imports this, I'm swapping in a quad-core CPU to give it as much crunching power as it wants to do its thing.

well ... (1)

wsanders (114993) | more than 4 years ago | (#29957690)

There are enough tales of woe in the discussion groups of ZFS file systems that have melted down on people that I would not start shorting the midrange storage companies stock just yet. I myself have an 18TB ZFS filesystem on a X4540 and it was brought to a standstill a few weeks ago by one dead SATA disk. Didn't lose any data, and it might be buggy hardware and drivers, but still, Sun support had no explanation. That should not happen!

I'm still a ZFS fanboy though - for about $1 per GB how can you lose. The host is a backup / virtual tape library server so it's not super high availability, and it's hella fast. No problem stuffing data into it at 2 X 1000baseT wire speed.

SAN, ZFS with dedupe is not a backup system (1)

caseih (160668) | more than 4 years ago | (#29957780)

Don't mistake in-filesystem deduplication and snapshots for a backup system. It's most certainly not backup and if you treat it as such you will eventually be very sorry. A SAN with ZFS, snapshots, and deduplication features is at best an archive, which is distinct in form and purpose from a backup. Still very useful, though. Ideally you have both archive and backup systems. To get a feel for the difference, consider that an archive is for when a user says, "I overwrote a file last week sometime. Can you recover the version before I made this change or saved over this file?" Whereas a backup is for recovering an entire system from when there's a catastrophic failure (like a SAN dying). Very distinct things. Both are useful.

I get strange looks when I tell people that a Time Capsule is not a backup. Nor is a single Time Machine external disk. Now 2, 3 or even 4 external disks could constitute a backup (and as a bonus with Time Machine an archive also).

Dupe dedupe de dupe dupe! (0)

dangitman (862676) | more than 4 years ago | (#29956764)

Dee dupe de dupe!

Drey dupe de drupes!

Dey dook dour dobbs!

Dey took Lou Dobbs!

Dey drook our jobs!

They took our jobs!

Signed,

Slashdot editors

Next home server will be OpenSolaris (or fBSD) (2, Insightful)

0100010001010011 (652467) | more than 4 years ago | (#29956784)

ZFS, from what I can tell, kicks ass. I've played around with it in virtual machines, taking drives off line, recreating them, adding drives, etc.

When I search NewEgg I also search OpenSolaris' compatibility list.

The two areas that Linux is playing catchup is Filesystems (like this) and Sound (OSS, Pulse, Alsa Oh My!). And before you go pointing out the btrfs project, this has been in servers for years. It's tried in an enterprise environment. Your file system is still in beta with a huge "Don't use this for important stuff" warning.

Re:Next home server will be OpenSolaris (or fBSD) (2, Funny)

buchner.johannes (1139593) | more than 4 years ago | (#29957526)

Oh yeah? Well tux is cuter so I'm not switching.

Re:Next home server will be OpenSolaris (or fBSD) (2, Interesting)

buchner.johannes (1139593) | more than 4 years ago | (#29957536)

I'm sure btrfs -- once fully implemented and tested -- will also have problems reaching the performance of reiser4.

Re:Next home server will be OpenSolaris (or fBSD) (0)

Anonymous Coward | more than 4 years ago | (#29957878)

You can download Sun's prebuilt storage appliance VM here [sun.com] .

It gives you a free GUI storage appliance wrapper around OpenSolaris and ZFS, so you can start using the features without being an expert in either (just like NetApp with BSD and WAFL). You can replace the virtual disks with real ones if you want to store serious data.

Another Lawsuit? (1)

yukonbob (410399) | more than 4 years ago | (#29956788)

Considering [netapp.com] what's going on between NetApp and Sun currently, I wonder what they'll think [netapp.com] of this?

-yb

Wake me when they build it into the hard disk (4, Interesting)

icebike (68054) | more than 4 years ago | (#29956796)

Imagine he amount of stuff you could (unreliably) store on a hard disk if massive de-duplication was built into the drive electronics. It could even do this quietly in the background.

I say unreliably, because years ago we had a Novell server that used an automated compression scheme. Eventually, the drive got full anyway, and we had to migrate to a larger disk.

But since the copy operation de-compressed files on the fly we couldn't copy because any attempt to reference several large compressed files instantly consumed all remaining space on the drive. What ensued was a nightmare of copy and delete files beginning with the smallest, and working our way up to the largest. It took over a day of manual effort before we freed up enough space to mass-move the remaining files.

De-duplication is pretty much the same thing, compression by recording and eliminating duplicates. But any minor automated update of some files runs the risk of changing them such that what was a duplicate, must now be stored separately.

This could trigger a similar situation where there was suddenly not enough room to store the same amount of data that was already on the device. (For some values of "suddenly" and "already").

For archival stuff or OS components (executables, and source code etc) which virtually never change this would be great.

But there is a hell to pay somewhere down the road.

Re:Wake me when they build it into the hard disk (1)

Shikaku (1129753) | more than 4 years ago | (#29956896)

That's actually very easy to explain, and ZFS could have a very similar situation:

Say you have on your hard drive these two files that have this, which in reality is 1GB worth of data for each file (the space is a seperate file):

ABCDABCD ABCDABCD

Every letter has equal weight, so those two files are stored .5GB without compression. Let's change it a little bit:

AeBCDABfCD ABCgDABChD

efgh are 1 byte.

You now have 2GB worth of space taken :) that's a gotcha if I ever saw one.

Re:Wake me when they build it into the hard disk (1)

Shikaku (1129753) | more than 4 years ago | (#29956928)

Oh, I guess I should mention the blocks in my case are stupidly large, and the point is data insertion/shifting can cause sudden increases in size with block level deduplication.

Re:Wake me when they build it into the hard disk (-1, Troll)

Anonymous Coward | more than 4 years ago | (#29957570)

you fail ..

Re:Wake me when they build it into the hard disk (1)

dgatwood (11270) | more than 4 years ago | (#29957012)

That's just classic bad design. There's no reason for the decompressed files to exist on disk at all just to decompress them. The software should have decompressed to RAM on the fly instead of storing the decompressed files as temp files on the hard drive. It's all probably because they made a poor attempt at shoehorning compression into a VFS layer that was too block-centric. Classic bad design all around.

Re:Wake me when they build it into the hard disk (3, Interesting)

icebike (68054) | more than 4 years ago | (#29957208)

Bad design on Novell's part, but the problem persists in the de-duplicated world, where de-duplicating to memory only is not a solution.

Imagine a hundred very large file containing largely the same content. Not imagine CHANGING just a few characters in each file via some automated process. Now 100 files which were actually stored as ONE file balloon to 100 large files.

On a drive that was already full, changing just a few characters (not adding any total content) could cause a disk full error.

You really can't fake what you don't have. You either have enough disk to store all of your data or you run the risk of hind-sight telling you it was a really bad design.

Re:Wake me when they build it into the hard disk (3, Informative)

ArsonSmith (13997) | more than 4 years ago | (#29957392)

No you still have it stored the size of one file + 100 block sizes, in size. You'd need a substantially large number of random changes through all 100 files to balloon up from 1x file size, to 100x file size.

Re:Wake me when they build it into the hard disk (1)

dgatwood (11270) | more than 4 years ago | (#29957394)

True, but that's going to fail when you change the very first file, and one would hope that the process would go no further.

Re:Wake me when they build it into the hard disk (1)

geniusj (140174) | more than 4 years ago | (#29957398)

ZFS dedupe is block level. This would be a problem, however, in file-level dedupe schemes.

Re:Wake me when they build it into the hard disk (0)

Anonymous Coward | more than 4 years ago | (#29957074)

True, though commodity grade hard drives are so inexpensive these days that the cost of providing a generously larger amount of them than what you plan to store is usually not a big deal.
The only really expensive drives today are high end enterprise type SAS / SAN / SCSI units or FLASH based ones. If you're storing media like digitized video, the benefits of dedup are usually insignificant since you're unlikely to accidentally / routinely have duplicated data at anything less than the file level, and at the file level you'd probably have easy options not to duplicate that content by design if so desired.

The REAL "wake me up when they integrate it into the drive itself" list for me is:
* drive integrated mirroring at a head/platter level with the functions of different platters being independent enough so you could still stand a good chance of reading functional ones even after one head/platter is damaged.

* drive integrated gigabit / 10GbE ethernet interfaces for commodity drives and an iSCSI protocol over ipV6.

* drive integrated ECC and spatial data striping at user selectable and much higher than default levels so that even a single drive could give you much better data reliability / redundancy across platters.

* drives with integrated encryption being the norm

* drives with built in ZFS / NAS and the ability to link to each other over e.g. PCIE / infiniband so you could set up small clusters of RAIDZ'd drives just with a few cables and inexpensive drives.

Re:Wake me when they build it into the hard disk (1)

Znork (31774) | more than 4 years ago | (#29957336)

But there is a hell to pay somewhere down the road.

I'd certainly expect that. I don't quite get what people are so desperate to de-duplicate anyway. A stripped VM os image is less than a gigabyte, you can fit 150 of them on a drive that costs less than $100. You'd have to have vast ranges of perfectly synchronized virtual machines before you'd have made back even the cost of the time spent listening to the sales pitch.

I can't really see many situations where the extra complexity and cost would end up actually saving money. The few I can see it would be where somebody's been tricked into buying such excruciatingly expensive SAN storage that they can barely afford to store anything on it any more, or situations where their storage is a complete mess and they can't use more intelligent means of not storing the same thing many times (snapshots, shared file systems, overlay devices, etc). In those cases it seems there would be more to gain by solving the actual problem than tacking another patch onto the stack. Storage, for most purposes, is dirt cheap today.

Re:Wake me when they build it into the hard disk (1)

icebike (68054) | more than 4 years ago | (#29957512)

>I can't really see many situations where the extra complexity and cost would end up actually saving money.

I could see it for write-only media.
With the proper byte-range selection, you could probably find enough duplicate blocks in just about anything to greatly expand capacity.

Re:Wake me when they build it into the hard disk (1)

PRMan (959735) | more than 4 years ago | (#29957752)

It would be great for ISPs, where each of their user instances have files in common. Also, for a backup drive for user PCs, where each user has the OS and probably a lot of documents in common.

Re:Wake me when they build it into the hard disk (1)

perzquixle (213538) | more than 4 years ago | (#29957448)

This doesn't apply to ZFS due to the way it uses drives. All drives are added to a storage pool, and drives are used as needed based on speed and reliability requirements. So to upgrade, you'd just add a new drive to the pool, mark the old drive for removal, wait as it moves the blocks to any other drive(s) in the pool, then remove the old drive.

Re:Wake me when they build it into the hard disk (1)

icebike (68054) | more than 4 years ago | (#29957612)

You could STILL be stuck with a transaction in mid-flight when you exhaust your storage because what was one block replicated hundreds of times now becomes hundreds of blocks exhausting all storage.

The Ease with which you can add storage only makes it somewhat more palatable. It doesn't hand wave the problem away.

Sooner or later every you have to upgrade storage on almost every platform. The problem with a platform that uses compression or de-duplication to store more than can really fit on its drives is that you can SUDDENLY run out of storage due to seemingly innocuous tasks. No steadily falling free-disk space to warn you ahead of time.

Re:Wake me when they build it into the hard disk (1)

c6gunner (950153) | more than 4 years ago | (#29957518)

This could trigger a similar situation where there was suddenly not enough room to store the same amount of data that was already on the device. (For some values of "suddenly" and "already").

Yes, but what's the likelihood of that occurring? We're talking about block level duplication here. If you have two identical files and you add a bit to the end of one, you're not creating a duplicate fi;e - you're just adding a few blocks while still referencing the original de-dupped file. Now, if you were doing file-level duplication it might be an issue, but this way ... I can't see it ever being a problem unless your array is already at 99.9% percent capacity (and that's just a bad idea in general).

Re:Wake me when they build it into the hard disk (1, Informative)

Anonymous Coward | more than 4 years ago | (#29957810)

I say unreliably, because years ago we had a Novell server that used an automated compression scheme. Eventually, the drive got full anyway, and we had to migrate to a larger disk.

But since the copy operation de-compressed files on the fly we couldn't copy because any attempt to reference several large compressed files instantly consumed all remaining space on the drive. What ensued was a nightmare of copy and delete files beginning with the smallest, and working our way up to the largest. It took over a day of manual effort before we freed up enough space to mass-move the remaining files.

This is because you didn't use NetWare's tools to copy the files - the command line NCOPY, for example, with /Ror and /RU (available when file compression was introduced with NetWare 4) would have copied the files in their compressed format, avoiding this (Link: http://support.novell.com/techcenter/articles/ana19940603.html [novell.com] ). Using the Novell Client for Windows, I'd imagine that its Explorer shell integration would give you GUI tools, too, though I no longer have a NetWare server to verify this, and always preferred the command line anyway :).

No offense, but the scenario you describe is the result of ignorance, nor poor design.

Building it in makes no sense (1)

saleenS281 (859657) | more than 4 years ago | (#29957856)

First, why would you want it built into a hard drive? Your deduplication ratio would then be limited to what you can store on one drive. The drive would have no way to reference blocks on other drives in the same system. Doing it in software allows you reference (in this case) all data within the entire zpool. That could be petabytes of storage (theoretically it could be far more, but that's probably the realistic limits today due to hardware/performance constraints).

As for your "hell to pay later" that's not true for two reason. First, there is no "modify in place". All data is allocated from new blocks, that's how a copy-on-write filesystem works. If it's "updated" you'd be allocating new blocks. If you're concerned with filling a pool up completely, you can put quota's in place to prevent it.

Second, if you "run out of space", you just add new drives to the raid group and continue on your merry way. You can grow a zpool on the fly.

Nice, but can it ... (0)

Anonymous Coward | more than 4 years ago | (#29957026)

... strategically populate the available space with duplicates of commonly read blocks, for increased fault tolerance and performance?

Re:Nice, but can it ... (1)

Per Wigren (5315) | more than 4 years ago | (#29957418)

... strategically populate the available space with duplicates of commonly read blocks, for increased fault tolerance and performance?

yes, it can [sun.com] .

Re:Nice, but can it ... (0)

Anonymous Coward | more than 4 years ago | (#29957802)

GP observation was that the so called "free" space can be harnessed in such a manner that a storage pool needn't have any "unused" or empty blocks at all. /Caching/ is not the answer here.

What's the point? (1)

Mask (87752) | more than 4 years ago | (#29957268)

The amount of resources it reportedly takes makes this not so practical.

What do one would want to have deduplication for? The cost of disk storage has two big elements - speed (latency&throughput) and backup.

It does not seem that this technology would help much in the speed department, it might actually hurt. Managing copy on write has several potential costs. It may help backup if the backup program knows the fine details of deduplication, but that means that old backup software will have to be replaced.

It reminds me the compressed file system I used to have on my old SLS Linux PC which had a small disk (1992 if memory serves me right). It was dog slow to run X11 on it. I have not seen a compressed file system since, there was no need. Disk storage grows much faster than my need for data.

Re:What's the point? (1)

myowntrueself (607117) | more than 4 years ago | (#29957790)

It reminds me the compressed file system I used to have on my old SLS Linux PC which had a small disk (1992 if memory serves me right).

Soft Landings from DOS bailouts!!! Yaaay!

I had a Windows 3.x PC on which I was coding some simple turbo pascal stuff to do pretty graphics.

This Windows PC didn't have a lot of disk so I was using Stacker (or some such disk compression thing).

One time one of my programs crashed. Just a simple graphics thing, but it crashed the PC, had to hit the reset button.

Erm... sadly the disk compression did not survive this.

A friend at university was *just* getting into this thing he called "Linux" and it ran on PCs, so I thought I'd give it a go. It was the SLS distro and he had a pile of 5.25" floppies. My PC had a 3.5" floppy. So I sat in the computer center for an afternoon copying disks...

And that was how I got into Linux; a soft landing from a DOS bailout :D

I never had to run disk compression under Linux though, never realised SLS supported that. Cool.

Re:What's the point? (1)

TheRaven64 (641858) | more than 4 years ago | (#29957866)

The canonical use case for dedup is backup servers. Imagine you have one Solaris file server serving 40 workstations. Each of these does a full backup of its 10GB Window (or Linux, or whatever) install. You then have 400GB of data, but only about 12GB of unique data. Dedup lets you only store this 12GB, and you can store it with n redundant copies so it's easier to recover in cases of partial hardware failure. Each workstation then does incremental backups, copying files with any changes to the server. The server dedups these and only store the changed blocks.

The clients can be using NFS, CIFS, or iSCSI for the backup, and the server has a complete disk image (and periodic snapshots) of the clients' disks, but uses a tiny fraction of the space that this may require.

Oh, and with regard to this:

It may help backup if the backup program knows the fine details of deduplication

The entire point of dedup in the FS layer is that the backup software can be completely unaware of it. As long as it produces a copy of the data on the server, the server will handle turning it from a full backup or a per-file incremental backup into a per-block incremental backup.

I Heard ISPs Were Doing This (1)

sexconker (1179573) | more than 4 years ago | (#29957584)

I Heard ISPs Were Doing This With Broadband.
Simply duplicate your advertised pipe across 100 subscribers.

If they want to access it at the same time, just shift stuff around.

If they want to access it at the same time, and you don't have room to shift stuff around, just impose caps and bill them progressively out the ass.

I worked with De-duplication (0)

Anonymous Coward | more than 4 years ago | (#29957772)

You are loosing reliability. Some hashes will collide on some computer somewhere.
The idea is that if you assume that blocks on HD are random then odds of hitting Hash collision are tiny.
But data is not random - humans and programs make it non-random!

Here is an example:
What are the odds that 256 people going across the street will all be men?
That would be 2 ^ -256 - that will never happen.
But guess what? Image that you see a parade and 300 marines are marching by...
It just happened.
Do you want to bet your server data on that?

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?