Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Open Source Deduplication For Linux With Opendedup

timothy posted more than 4 years ago | from the its-missing-apostrophes dept.

Data Storage 186

tazzbit writes "The storage vendors have been crowing about data deduplication technology for some time now, but a new open source project, Opendedup, brings it to Linux and its hypervisors — KVM, Xen and VMware. The new deduplication-based file system called SDFS (GPL v2) is scalable to eight petabytes of capacity with 256 storage engines, which can each store up to 32TB of deduplicated data. Each volume can be up to 8 exabytes and the number of files is limited by the underlying file system. Opendedup runs in user space, making it platform independent, easier to scale and cluster, and it can integrate with other user space services like Amazon S3."

cancel ×

186 comments

Sorry! There are no comments related to the filter you selected.

In case you don't know much about it (5, Informative)

stoolpigeon (454276) | more than 4 years ago | (#31644772)

Data deduplication [wikipedia.org]
( I don't )

Re:In case you don't know much about it (4, Informative)

MyLongNickName (822545) | more than 4 years ago | (#31644882)

Data deduplication is huge in virtualized environments. Four virtual servers with identical OS's running on one host server? Deduplicate the data and save a lot of space.

This is even bigger in the virutulized desktop envirornment where you could literally have hundreds of PCs virtualized on the same physical box.

Re:In case you don't know much about it (1)

Hurricane78 (562437) | more than 4 years ago | (#31645048)

Unless you “deduplicate” the CPU work, that’s not going to happen. ^^

Re:In case you don't know much about it (2, Informative)

rubycodez (864176) | more than 4 years ago | (#31645110)

hundreds of virtualized desktops per physical server does happen, my employer sells such solutions from several vendors.

Re:In case you don't know much about it (3, Informative)

MyLongNickName (822545) | more than 4 years ago | (#31645170)

If you have a couple hundred people running business apps, it ain't all that difficult. Generally you will get spikes of CPU utilization that last a few seconds mashed between many minutes, or even hours of very low CPU utilization. A powerful server can handle dozens or even hundreds of virtual desktops in this type of environment.

Re:In case you don't know much about it (1)

drsmithy (35869) | more than 4 years ago | (#31645686)

Unless you "deduplicate" the CPU work, that's not going to happen. ^^

Sure it does. CPU power is generally the _last_ thing you run out of in virtualised environments, and that's been true for years.

On a modern, Core i7-based server, you should be able to get 10+ "virtual desktops" per core on average, without too much trouble. IOPS and RAM are typically your two biggest limitations.

Re:In case you don't know much about it (1)

fyoder (857358) | more than 4 years ago | (#31645136)

I don't know much about the subject, so forgive me if this is a dumb question, but in that scenario, if the data for a file becomes corrupted on the hard drive, say a critical system file, doesn't that mean that all vm's using it are pooched?

Re:In case you don't know much about it (4, Informative)

zappepcs (820751) | more than 4 years ago | (#31645250)

In a word, No. There are many types of 'virtualization' and more than one approach to de-duplication. In a system as engineered as one with de-duplication, you should have replication as part of the data integrity processes. If the file is corrupted in all the main copies (everywhere it exists, including backups) then the scenario you describe would be correct. This is true for any individual file that exists on computer systems today. De-duplication strives to reduce the number of copies needed across some defined data 'space' whether that is user space, or server space, or storage space etc.

This is a problem in many aspects of computing. Imagine you have a business with 50 users. Each must use a web application which has many graphics. The browser caches of each user has copies of each of those graphics images. When the cache is backed up, the backup is much larger than it needs to be. You can do several things to reduce backup times, storage space, and user quality of service

1 - disable caching for that site in the browser and cache them on a single server locally located
2 - disable backing up the browser caches, or back up only one
3 - enable deduplication in the backup and storage processes
4 - implement all or several of the above

The problems are not single ended and the answers or solutions will also not be single ended or faceted. That is no one solution is the answer to all possible problems. This one has some aspects to it that are appealing to certain groups of people. You average home user might not be able to take advantage of this yet. Small businesses though might need to start looking at this type of solution. Think how many people got the same group email message with a 12MB attachment. How many times do all those copies get archived? In just that example you see the waste that duplicated data represents. Solutions such as this offer an affordable way to positively affect bottom lines in fighting those types of problems problems.

Re:In case you don't know much about it (1)

jamesh (87723) | more than 4 years ago | (#31645504)

I don't know much about the subject, so forgive me if this is a dumb question, but in that scenario, if the data for a file becomes corrupted on the hard drive, say a critical system file, doesn't that mean that all vm's using it are pooched?

Yes, but not because of deduplication. If you had one sector go bad then yes, you could affect many more vm's if you you were using data deduplication than if you weren't, but in my experience, data corruption is seldom just a '1 sector' thing, and once you detect it you should restore anything that uses that disk from a backup that probably was taken before the corruption started (which is tricky... how do you know when that was?)

Bitrot is one of the nastiest failure modes around.

Re:In case you don't know much about it (0)

Anonymous Coward | more than 4 years ago | (#31645604)

And having dedup as well as checksumming of all disc data to detect and prevent (by automatic correction) silent data corruption are good things, precisely to prevent the bitrot / corruption issues mentioned above. Unfortunately ZFS still seems way ahead of LINUX filesystems like ext4, btrfs, et. al. since ZFS does these things already in modern OpenSolaris versions. FreeBSD's zfs will likely get dedup within a few months as they integrate newer versions of the ZFS code.

It is nice that LINUX is getting new features like this, but it seems like too little too late. IMHO
checksumming, dedup, encryption, et. al. should all have been in a production grade and commonly used free LINUX filesystem by now. Having one or two without the rest is still just disappointingly slow given the slow pace at which filesystems are deployed and evolved, it'll be perhaps years before LINUX's FSs catch up to ZFS today.

Re:In case you don't know much about it (1)

drsmithy (35869) | more than 4 years ago | (#31645670)

I don't know much about the subject, so forgive me if this is a dumb question, but in that scenario, if the data for a file becomes corrupted on the hard drive, say a critical system file, doesn't that mean that all vm's using it are pooched?

Yes, but a) this is something inherent to anything using shared resources, and b) there's not a lot of scope for such corruption to happen in a decent system (RAID, block-level checksums, etc).

Re:In case you don't know much about it (1)

fatp (1171151) | more than 4 years ago | (#31645456)

It is also huge for java developer, as every java software normally installs at least one jdk and jre

Re:In case you don't know much about it (2, Funny)

fatp (1171151) | more than 4 years ago | (#31645466)

Oh in fact it requires jdk 7...

Re:In case you don't know much about it (0, Offtopic)

fm6 (162816) | more than 4 years ago | (#31645470)

Yeah, that makes sense. But I had to do some googling to figure that out. If Slashdot lived up to its pretense to be a news site, the editors would take a few minutes to summarize the concept, or at least point at the appropriate Wikipedia article. It's beyond lame that they can't be bothered.

One wonders if they even bother to read the stories they post — and what they do with the remaining 7 3/4 hours in the work day after they've picked out the stories.

Re:In case you don't know much about it (2, Funny)

GNUALMAFUERTE (697061) | more than 4 years ago | (#31645646)

Hey, slow down cowboy. Explain that concept to me again. I don't know if it's applicable here, but if we find a way to implement it, it might just prove revolutionary.

I work in the quality assurance department of Geeknet Inc, Slashdot's parent company. We are constantly looking for ways to improve all the sites on our network.

I don't know if this method you propose, that, if I understand correctly, would involve parsing the content of the html document linked, and having an editor analyze the output of such html document after being rendered (let's call it, reading the story), is at all possible. But if we implement it the right way, it might prove useful.

We'll get our research team to work over this reading-the-story concept. It's something absolutely novel to us, so it might take a while. We'll let you know when we reach a conclusion, so that we might license this reading-the-story technology from you.

Kind Regards,
Lazy Rodriguez
GeekNet INC.

Patent 5,813,008 (0, Offtopic)

snikulin (889460) | more than 4 years ago | (#31645160)

September 22, 1998.
Single instance storage of information [uspto.gov]

Re:Patent 5,813,008 (1)

Lorens (597774) | more than 4 years ago | (#31645194)

Good try, but after skimming it, does not seem to apply. Seems to be for deduplicating e-mail attachments.

Re:Patent 5,813,008 (1)

snikulin (889460) | more than 4 years ago | (#31645218)

Re:Patent 5,813,008 (0)

Anonymous Coward | more than 4 years ago | (#31645260)

the claims apply. 2018 runs out tho.

Re:Patent 5,813,008 (1)

Lorens (597774) | more than 4 years ago | (#31645376)

Which claims apply? I can see no claim that does not reference "information items [...] transferred between a plurality of servers connected on a distributed network". In fact, e-mail attachment dedup is seen as prior art (Background, fourth paragraph). File dedup is simpler than that.
 

Re:Patent 5,813,008 (2, Interesting)

pem (1013437) | more than 4 years ago | (#31645386)

A good lawyer could probably argue that this doesn't apply.

Claim 1(a) requires "dividing an information item into a common portion and a unique portion".

It may be that the patent covers the case where the unique portion is empty, but then again maybe not, especially if the computer never takes the step to find out! In other words, if you treat every item as a common item (even if there is only one copy), there is a good chance the patent might not apply.

(There is also a good chance that the patent is written the way it is specifically because it doesn't apply to that case -- it may be that there is prior art in one of the referenced patents.)

apologies to LL Cool J (0)

Anonymous Coward | more than 4 years ago | (#31645552)

stoolpigeon asks: Are you taking my deduplication investigation seriously or are you disrespecting my deduplication investigation?

-- minor misquote of LL Cool J speaking to Robin Wright in the movie Toys

This is for hard disks (2, Interesting)

ZERO1ZERO (948669) | more than 4 years ago | (#31644786)

Does software like ESX and others (Xen etc) perform this in memory already for running VMs? I.e. if you have 2 Windows VMs it will only store one copy of the libs etc in the hosts memory ?

Also, is there easy way to get multiple machines running 'as one' to pool resources for running a vm setup? Does openmosix do that?

Re:This is for hard disks (1)

Wolfraider (1065360) | more than 4 years ago | (#31644862)

I remember reading on VMware's site that they do this if you have the vmware tools installed. As for the pool, The only real way I know how is to run multiple machines with a load balancer.

Re:This is for hard disks (1)

TooMuchToDo (882796) | more than 4 years ago | (#31644902)

Both VMware and KVM can do this. Not sure about Xen. Google "memory deduplication $VM_TECH"

Re:This is for hard disks (2, Funny)

fatp (1171151) | more than 4 years ago | (#31645442)

I really googled "memory deduplication $VM_TECH"... It returned this post as the only result

what an idiot I am T.T

A hypothetical question. (1)

drolli (522659) | more than 4 years ago | (#31644794)

I appreciate any deduplication solution for linux for sure, but isnt any deplucation creating a lot of shared ressources which could be possibly exploited for attacks (e.g. on the privacy of other users)?

Re:A hypothetical question. (1)

Wolfraider (1065360) | more than 4 years ago | (#31644884)

I don't believe so. The main way to dedup files that I know of is on the block level. The file system can keep a record of what blocks are recorded on the HDD. Now if they do a checksum or equivalent on the blocks, they can determine which blocks are identical. Then it's as simple as recording 2 pointers to the same block. If it shows up that one of the records needs to change, then simply remove the pointer and save the changed record to a new block.

Re:A hypothetical question. (1)

symbolset (646467) | more than 4 years ago | (#31644960)

Opendedup is file-based deduplication, much like Microsoft's Single Instance Storage. If I recall correctly there was a security problem with that some time ago, but I don't know if it was fixed.

Re:A hypothetical question. (1)

GNUALMAFUERTE (697061) | more than 4 years ago | (#31645704)

It's had a vulnerability because microsoft made it. Vulnerabilities are their signature.
And, as I explained before, it was a microsoft product (which means it wasn't fixed).

Re:A hypothetical question. (2, Interesting)

tlhIngan (30335) | more than 4 years ago | (#31645068)

I appreciate any deduplication solution for linux for sure, but isnt any deplucation creating a lot of shared ressources which could be possibly exploited for attacks (e.g. on the privacy of other users)?

Most likely in the implementation itself, not the de-duplication process.

Let's say user A and B have some file in common. Without de-duplication, the file exists on both home directories. With de-duplication, one copy of the file exists for both users. Now, if there is an exploit such that you could find out if this has happened, then user A or B will know that the other has a copy of the same file. That knowledge could be useful.

Ditto on critical system files - if you could generate a file and have it match a protected system file, this might be useful to exploit the system. E.g., /etc/shadow (which isn't normally world-readable). If you can find a way to tell the deduplication happens, you can get access to these critical files for other purposes.

Note that you can't *change* the file (because that would just split the files up again), but being able to read the file (when you couldn't before) or knowing that another copy exists elsewhere can be very useful knowledge. But the de-duplication mechanism must inadvertently reveal when this happens.

Re:A hypothetical question. (1)

drolli (522659) | more than 4 years ago | (#31645302)

Yes, that was the thing i had in mind. I imagined that you can make timing measurements. So for example two isolated VMs running on the same physical dedup fs can exchange information (unless the underlying os does not intenntionally delay the return from the call). i actually think you can run a programs creating a lot of specially crafted file contents in two VMs, thus circumventing networking restrictions.

Re:A hypothetical question. (1)

drsmithy (35869) | more than 4 years ago | (#31645694)

Note that you can't *change* the file (because that would just split the files up again), but being able to read the file (when you couldn't before) or knowing that another copy exists elsewhere can be very useful knowledge.

If you can "generate a file" that can be deduplicated, then by definition you already know about the date in that file.

Re:A hypothetical question. (1)

GNUALMAFUERTE (697061) | more than 4 years ago | (#31645696)

Leaving aside vulnerabilities on any particular implementation, the only possible attack vector I see would be a bruteforce approach. Basically, a user in one VM creates random n bytes size files with all possible combinations of files of that size (off course, this would only be feasible for very small files, but /etc/shadow is usually small enough, and so is everything on $HOME/.ssh/). Eventually, the user would create a file that would match a copy on another VM. Off course, this would be useless without a way to check if another file was matched and deduplication took place. If the deduplication solution has any virtual guest software (like vmware tools), and that tool shares this kind of information with other systems, it might be possible, but that's a big might.

Any reasonably implemented deduplication solution should be 100% transparent to the guest, and very secure.

And, to all the people talking about "shared resources", deduplication doesn't create "shared resources". Deduplication is not similar to symbolic links (ln -s). If you want to compare it to links, you have to compare it to hard links, and that would be hard links that automatically dereferenced and created a new copy of the file with all the blocks as soon as the user wanted to write to that file. Remember, as soon as the file changes on any given guest, the information is not the same anymore, and so that file is not de-duplicated anymore. A user can change his copy of the file, not other people's files.

Hasn't this been posted before? (5, Funny)

Required Snark (1702878) | more than 4 years ago | (#31644812)

Just wondering...

Re:Hasn't this been posted before? (1, Funny)

Anonymous Coward | more than 4 years ago | (#31644890)

If so, how about we just reference that post?

Re:Hasn't this been posted before? (1)

Hurricane78 (562437) | more than 4 years ago | (#31645064)

Well, at least this comment has been posted before.

Dude, you’re only piling it up. Like with trolling: If you react to it, you only make it worse.

And because I’m not better, I’m now gonna end it, by stating that: yes, yes, I’m also not making it better. ^^
Oh wait... now I am! :)

deduplication (0, Offtopic)

hduff (570443) | more than 4 years ago | (#31644898)

What kind of lame recursive acronym is "deduplication"?

I'm flummoxed in any attempt to decipher it.

Re:deduplication (1)

deniable (76198) | more than 4 years ago | (#31644994)

It's neither acronym or abbreviation. Duplication is making copies. De-duplication is getting rid of the copies.

Re:deduplication (2, Funny)

GNUALMAFUERTE (697061) | more than 4 years ago | (#31645732)

So, Blade Runner was about de-duplication?

I was worried for a second there... (-1, Troll)

Anonymous Coward | more than 4 years ago | (#31644900)

I thought it said "Open Source Decapitation".

Re:I was worried for a second there... (0)

Anonymous Coward | more than 4 years ago | (#31645356)

that's one instance where the troll's term "open sores" could be used for humorous effect, instead of just grating people...

Re:I was worried for a second there... (-1, Offtopic)

GNUALMAFUERTE (697061) | more than 4 years ago | (#31645734)

Excellent! (-1, Flamebait)

Anonymous Coward | more than 4 years ago | (#31644904)

Once again the laziness of software types requires the ingenuity of massive hardware to compensate. Keep it up programmers! Soon you can have a retarded clam programming stuff that runs on a 6500THz processor and you'll STILL blame the computer for being slow!

Re:Excellent! (1, Redundant)

jtownatpunk.net (245670) | more than 4 years ago | (#31644930)

Yeah, I gave up on bitching about code inefficiency back in the early 90s. Do they even teach assembly any more?

Offtopic? (3, Informative)

SanityInAnarchy (655584) | more than 4 years ago | (#31645024)

If you'd mentioned the fact that this appears to be written in Java, you might have a point. But despite this, and the fact that it's in userland, they seem to be getting pretty decent performance out of it.

And keep in mind, all of this is to support reducing the amount of storage required on a hard disk, and it's a fairly large programming effort to do so. Seems like this entire project is just the opposite of what you claim -- it's software types doing extra work so they can spend less on storage.

Re:Excellent! (1)

az1324 (458137) | more than 4 years ago | (#31645264)

They are very poor programmers. Almost nothing works in retarded clam shell (rcsh).

Let's get down to brass tacks. (0, Redundant)

jtownatpunk.net (245670) | more than 4 years ago | (#31644918)

Does this mean I'll finally be able to store my entire porn collection on a single volume?

Re:Let's get down to brass tacks. (1)

nystire (871449) | more than 4 years ago | (#31644932)

AND it will make sure that all those 60,000 duplicate files no longer take up most of your hard drive space!

Re:Let's get down to brass tacks. (1)

SanityInAnarchy (655584) | more than 4 years ago | (#31645002)

Well, just how repetitive is your porn collection?

Re:Let's get down to brass tacks. (3, Funny)

Hooya (518216) | more than 4 years ago | (#31645204)

very repetitive. back and fourth. back and fourth. oh wait... that's not what you meant. never mind.

redundant if saving large amounts of data to SAN (0)

Anonymous Coward | more than 4 years ago | (#31644940)

If you are storing that amount of data wouldn't you use a SAN and don't most already have data de-duplication technology? I suppose this project will be pillaged by all of the backup appliance MFG's and those who build consumer grade NAS devices

Re:redundant if saving large amounts of data to SA (1)

dbIII (701233) | more than 4 years ago | (#31645094)

Consider that things may be spread over more than one SAN or that it is a situation where an old style file server makes better sense anyway.

Re:redundant if saving large amounts of data to SA (1)

afidel (530433) | more than 4 years ago | (#31645664)

Not every SAN has dedupe, for instance my HP EVA doesn't. Also many of the lowend Netapp boxes have too anemic processors to be able to do dedupe. Most of the lowend iSCSI boxes also lack dedupe.

Yea, I RTFA, but... (2, Interesting)

mrsteveman1 (1010381) | more than 4 years ago | (#31644966)

......from what i can tell, this is NOT a way to deduplicate existing filesystems or even layer it on top of existing data, but a new filesystem operating perhaps like eCryptfs, storing backend data on an existing filesystem in some FS-specific format.

So, having said that, does anyone know if there is a good way to resolve EXISTING duplicate files on Linux using hard links? For every identical pair found, a+b, b is deleted and instead hardlinked to a? I know there are plenty of duplicate file finders (fdupes, some windows programs, etc), but they're all focused on deleting things rather than simply recovering space using hardlinks.

Re:Yea, I RTFA, but... (1)

Aluvus (691449) | more than 4 years ago | (#31645054)

FSlint's "merge" option will do what you want.

Re:Yea, I RTFA, but... (1)

mrsteveman1 (1010381) | more than 4 years ago | (#31645208)

Hmm, yea i've used FSLint but I didn't pay close enough attention to the options it seems :)

Thanks

Re:Yea, I RTFA, but... (3, Informative)

dlgeek (1065796) | more than 4 years ago | (#31645058)

You could easily write a script to do that using find, sha1sum or md5sum, sort and link. It would probably only take about 5-10 minutes to write but you most likely don't want to do that. When you modify one item in a hard linked pair, the other one is edited as well, whereas a copy doesn't do this. Unless you are sure your data is immutable, this will lead to problems down the road.

Deduplication systems pay attention to this and maintain independent indexes to do copy-on-write and the like to preserve the independence of each reference.

Re:Yea, I RTFA, but... (1)

mrsteveman1 (1010381) | more than 4 years ago | (#31645220)

If I couldn't find a good tool from responses here I would have written one for sure.

Re:Yea, I RTFA, but... (2, Interesting)

Lorens (597774) | more than 4 years ago | (#31645082)

I wrote fileuniq (http://sourceforge.net/projects/fileuniq/) exactly for this reason. You can symlink or hardlink, decide how identical a file must be (timestamp, uid...), or delete.

It's far from optimized, but I accept patches :-)

Re:Yea, I RTFA, but... (1)

mrsteveman1 (1010381) | more than 4 years ago | (#31645200)

Sweet! Thanks a lot :)

Re:Yea, I RTFA, but... (1)

symbolset (646467) | more than 4 years ago | (#31645098)

There are security problems with this. The duplicate files might have different metadata - for example, access privileges.

For real (block level) deduplication, try lessfs or zfs.

Re:Yea, I RTFA, but... (1)

mrsteveman1 (1010381) | more than 4 years ago | (#31645206)

That can be managed for simple use cases, but yea i see your point.

Or get inline deduplication (1)

anilg (961244) | more than 4 years ago | (#31644972)

with NexentaStor CE [nexentastor.org] , which is based on OpenSolaris b134. It's free.. and has an excellent Storage WebUI. /plug

For a detailed explanation of OpenSolaris dedup see this blog entry [sun.com] .

~Anil

Re:Or get inline deduplication (1)

anilg (961244) | more than 4 years ago | (#31645008)

Grr.. meant inline/kernel dedup.

Re:Or get inline deduplication (1)

mrsteveman1 (1010381) | more than 4 years ago | (#31645010)

Plus you get the "real" ZFS, zones, and tightly integrated, bootable system rollbacks using zfs clones :)

Re:Or get inline deduplication (1)

itsme1234 (199680) | more than 4 years ago | (#31645894)

Plus you get the "real" ZFS, zones, and tightly integrated, bootable system rollbacks using zfs clones :)

Plus you get the "real" opensolaris experience:

- poor (like really really poor) hardware compatibility. Starting with basic stuff, many on-board Ethernet controllers with flaky or no support, very hard to choose a motherboard that's available and without too many compromises and fully supported. A guy asked if Android pairing is available (to use phone as modem for OpenSolaris), made me spill my coffee...
- doubtful future
- no security patches (yes, you read that right)
- major features like zfs encryption slipping schedule for years (working on it since 2008, last promise was to be in 2010.2 release which in itself slipped to 2010.3 and this one seems to be delayed as well as it was supposed to be released on the 26th and in any case it's quite sure that encryption won't make it anyway)

Thanks, but no thanks.

It wants Java :-( (0)

Anonymous Coward | more than 4 years ago | (#31644992)

I wonder how well it performs, or if this is just functionality for demonstration purposes ?

How useful is this in realistic scenarios? (1)

marvin2k (685952) | more than 4 years ago | (#31645012)

Given that usually most of the disk space is swallowed by the data of an application and that data rarely is identical to the data on another system (why would you have two systems then?) I wonder how much this approach really buys you in "normal" scenarios especially given the CPU and disk I/O cost involved in finding and maintaining the de-duplicated blocks. There may be a few very specific examples where this could really make a difference but can someone enlighten me how this is useful on say a physical system with 10 Centos VMs running different apps or similar apps with different data? You might save a few blocks because of the shared OS files but if you did a proper minimal OS install then the gain hardly seems to be worth the effort.

Re:How useful is this in realistic scenarios? (1)

dlgeek (1065796) | more than 4 years ago | (#31645100)

It sounds to me like uou have a very narrow view of what constitutes "realistic scenarios".
  • A high-availability mail system that has multiple servers handling client mail storage. VMs are used for rapid failover in the case of hardware failure. Sounds pretty realistic to me. Deduplication is extremely helpful when there are many copies of the same attachment as many users forward it around.
  • A large set of VMs which used for testing the software you develop with a variety of possible end-user configurations. Sounds pretty realistic to me. Deduplication is extremely helpful to save space storing the base OS libraries and such.
  • You have a server (or set of servers) which is/are responsible for backing up a large number of other computers. Sounds pretty realistic to me. Deduplication is extremely helpful when these computers have files that are identical. (Hell, deduplication can make it much easier to do incremental backups of a single computer).

These all sound very realistic to me...

Re:How useful is this in realistic scenarios? (4, Informative)

mysidia (191772) | more than 4 years ago | (#31645490)

First of all.... one of the most commonly duplicated blocks is the NUL block, that is a block of data where all bits are 0, corresponding with unused space, or space that was used and then zero'd.

If you have a virtual machine on a fresh 30GB disk with 10GB actually in use, you have at least 25GB that could be freed up by dedup.

Second, if you have multiple VMs on a dedup store, many of the OS files will be duplicates.

Even on a single system, many system binaries and libraries, will contain duplicate blocks.

Of course multiple binaries statically linked against the same libraries will have dups.

But also, there is a common structure to certain files in the OS, similarities between files so great, that they will contain duplicate blocks.

Then if the system actually contains user data, there is probably duplication within the data.

For example, mail stores... will commonly have many duplicates.

One user sent an e-mail message to 300 people in your organization -- guess what, that message is going to be in 300 mailboxes.

If users store files on the system, they will commonly make multiple copies of their own files..

Ex... mydocument-draft1.doc, mydocument-draft2.doc, mydocument-draft3.doc

Can MS Word files be large enough to matter? Yes.. if you get enough of them.

Besides they have common structure that is the same for almost all MS Word files. Even documents' whose text is not at all similar are likely to have some duplicate blocks, which you have just accepted in the past -- it's supposed to be a very small amount of space per file, but in reality: a small amount of waste multiplied by thousands of files, adds up.

Just because data seems to be all different doesn't mean dedup won't help with storage usage.

Re:How useful is this in realistic scenarios? (1)

Lorens (597774) | more than 4 years ago | (#31645128)

A major use case is NAS for users. Think of all those multi-megabyte files, stored individually by thousands of users.

However, normally deduplication is block level, under the filesystem, invisible to the user. This is implemented by NetApp SANs, for instance. After having RTFA, OpenDedup seems to be file-level, running between the user and an underlying file system. I'm not sure it's a good idea.

Re:How useful is this in realistic scenarios? (1)

snikulin (889460) | more than 4 years ago | (#31645152)

Well, a really good and useful "home" scenario is a system backup of multiple computers with the same OS.
OS itself plus common software takes at least 20-30 GB per installation these days.

My WHS (which does support de-dup in form of Single-instance storage [wikipedia.org] ) server keeps full backup (3-months worth) of my seven Windows home computers on about 60 GB.

Unfortunately SIS does not work for WHS shared folders, so my two Linux machines' (my version control & gallery servers) rsync backups over SMB are not de-duplicated by WHS.

I could probably save only /etc, /var and /srv of each server, but so far I backup everything.

Re:How useful is this in realistic scenarios? (3, Informative)

QuantumRiff (120817) | more than 4 years ago | (#31645282)

If you cut up a large file into lots of chunks of whatever size, lets say 64KB each. Then, you look at the chunks. If you have two chunks that are the same, you remove the second one, and just place a pointer to the first one. Data Deduplication is much more complicated than that in real life, but basically, the more data you have, or the smaller the chunks you look at, the more likely you are to have duplication, or collisions. (how many word documents have a few words in a row? remove every repeat of the phrase "and then the" and replace it with a pointer, if you will).

This is also similar to WAN acceleration, which at a high enough level, is just deduplicating traffic that the network would have to transmit.

It is amazing how much space you can free up, when your not just looking at the file level. This has become very big in recent years, cause storage has exploded, and processors are finally fast enough to do this in real-time.

Re:How useful is this in realistic scenarios? (1)

jdoverholt (1229898) | more than 4 years ago | (#31645350)

If you look at the sales materials from any of the big vendors (EMC, I'm looking at you), even a single system image shows reduction in size through block-level deduplication--even more through variable-sized blocks. I can't recall the exact numbers, I'm at the end of a terribly long week, but I think it was somewhere around 10-30% reduction in the day-0 backup size. Subsequent days typically see a >95% reduction.

All sales literature, mind you. My personal experience with it will begin in a few months, when we get our new Celerra installed :-)

P.S. Remember that a project such as this is good because it offers high-dollar features to low-dollar players who enjoy tinkering in their basements. Such was the goal of Linux in the first place. It's how, on a three-figure budget, a dedicated nerd can set up a several-terabyte file server with software RAID-6 protection and (soon) data deduplication--stuff you'd pay EMC 100-1000 times as much for.

Re:How useful is this in realistic scenarios? (1)

drsmithy (35869) | more than 4 years ago | (#31645742)

All sales literature, mind you. My personal experience with it will begin in a few months, when we get our new Celerra installed :-)

As far as I know, Celerras only do file-level dedupe.

Re:How useful is this in realistic scenarios? (1)

Spad (470073) | more than 4 years ago | (#31645596)

All good dedupe systems are block-level, not file-level so you don't just save where whole files are identical but on *any* identical data that's on the disks.

If you're running VMs with the same OS you'll probably find that close to 70% of the data can be de-duplicated - and that's before you consider things like farms of clustered servers where you have literally identical config or fileservers with lots of idiots saving 40 "backup" copies of the same 2Gb access database just in case they need it.

Our deduped backup array is currently storing ~70Tb of backups on 10Tb of raw space and it's only about 40% full - to me, that's useful.

Re:How useful is this in realistic scenarios? (3, Informative)

drsmithy (35869) | more than 4 years ago | (#31645736)

I wonder how much this approach really buys you in "normal" scenarios especially given the CPU and disk I/O cost involved in finding and maintaining the de-duplicated blocks. There may be a few very specific examples where this could really make a difference but can someone enlighten me how this is useful on say a physical system with 10 Centos VMs running different apps or similar apps with different data? You might save a few blocks because of the shared OS files but if you did a proper minimal OS install then the gain hardly seems to be worth the effort.

Assume 200 VMs at, say, 2GB per OS install. Allowing for some uniqueness, you'll probably end up using something in the ballpark of 20-30GB of "real" space to store 400GB of "virtual" data. That's a *massive* saving, not only disk space, but also in IOPS, since any well-engineered system will carry that deduplication through to the cache layer as well.

Deduplication is *huge* in virtual environments. The other big place it provides benefits, of course, is D2D backups.

This just gave me a good idea! (3, Interesting)

thePowerOfGrayskull (905905) | more than 4 years ago | (#31645014)

Actually, just the title did it. I've historically had a bad habit of backing things up by taking tar/gzs of directory structures, giving them an obscure name, and putting them onto network storage. Or sometimes just copying directory structures without zipping first. Needless to say, this makes for a huge mess.

Just occurred to me that it would not be difficult to write a quick script to extract everything into its own tree; run sha1sum on all files; and identify duplicate files automatically; probably in just one or two lines.

So in other words -- thanks Slashdot! The otherwise unintelligible summary did me a world of good -- mostly because there was no context as to what the hell it was talking about, so I had to supply my own definition...

Re:This just gave me a good idea! (3, Informative)

Hooya (518216) | more than 4 years ago | (#31645230)

try this::

mv backup.0 backup.1
rsync -a --delete --link-dest=../backup.1 source_directory/ backup.0/

see this [mikerubel.org]

Re:This just gave me a good idea! (1)

devent (1627873) | more than 4 years ago | (#31645382)

Why not just rdiff-backup? rdiff-backup.nongnu.org

Re:This just gave me a good idea! (1)

evilviper (135110) | more than 4 years ago | (#31646006)

Why not just rdiff-backup?

Why not just rsync? ...
Now that that's out of the way...

For one thing, rdiff will give you a mess... rsync will give you multiple full filesystem trees...

Re:This just gave me a good idea! (1, Informative)

Anonymous Coward | more than 4 years ago | (#31645244)

Re:This just gave me a good idea! (0)

Anonymous Coward | more than 4 years ago | (#31645316)

This script identifies duplicate files in one swift line, using only find, xargs, md5sum and awk:

http://www.gedankenverbrechen.org/~tk/finddupes.sh

Re:This just gave me a good idea! (1)

CAIMLAS (41445) | more than 4 years ago | (#31645392)

Two things to look into:

rsync snapshots [mikerubel.org]
rsnapshot, for a better rsync snapshot [rsnapshot.org]

Re:This just gave me a good idea! (1)

thePowerOfGrayskull (905905) | more than 4 years ago | (#31645698)

Most recently, I've been moving most of my documents and source-code level stuff to a LAN-based SVN repository; then periodically dumping that, encrypting, and tossing it onto dropbox. The versioning is good, but it's not so practical for downloaded files and various other content types.

I'll take a look at this - thanks for the post.

Re:This just gave me a good idea! (0)

Anonymous Coward | more than 4 years ago | (#31645502)

look into fdupes for finding/dealing w/ dupes.

Re:This just gave me a good idea! (1)

thePowerOfGrayskull (905905) | more than 4 years ago | (#31645682)

Even better -- thanks!

Look at StoreBackup (1)

bradley13 (1118935) | more than 4 years ago | (#31645712)

We're a bit off topic here, seeing as this has nothing to do with file systems, but being off-topic is on-topic for /.

Anyhow: StoreBackup is a great backup system that automatically detects duplicates.

Both VMware and KVM can do thi (0)

Anonymous Coward | more than 4 years ago | (#31645026)

Both VMware and KVM can do this. Not sure about Xen. Google "memory deduplication $VM_TECH" China Mobile Phones [chinamobilephones.org]
Chinese Girls [chinese-girls.org]

This workgeat,nfc... (0)

Anonymous Coward | more than 4 years ago | (#31645046)

I usedthichnlgywrp!

User Land? Come on! (1)

Gazzonyx (982402) | more than 4 years ago | (#31645092)

[...] Opendedup runs in user space, making it platform independent, easier to scale and cluster, [...]

... and slow, prone to locking issues, etc. There's a reason no one runs ZFS over FUSE, why would we do it with this?

Confusing summary (1)

Brian Gordon (987471) | more than 4 years ago | (#31645108)

The new deduplication-based file system called SDFS (GPL v2) is scalable to eight petabytes of capacity with 256 storage engines, which can each store up to 32TB of deduplicated data. Each volume can be up to 8 exabytes

Can anyone offer wisdom on what the volume size is supposed to signify, being different from the maximum size that SDFS is scalable to?

Re:Confusing summary (1)

Spad (470073) | more than 4 years ago | (#31645618)

It's the raw capacity of the filesystem compared to the maximum amount of deduplicated data it can handle. So you can have 8Pb of raw disk space, on which you can store up to 8Eb of deduplicated data (depending on the dedupe ratios you get - I think 1000x is a little optimistic, 30x-40x is more common).

"support for deduplication at 4K block sizes" (0)

Anonymous Coward | more than 4 years ago | (#31645146)

Why would anyone keep their blocks so cold?

It finally happened (1)

thesymbolicfrog (907527) | more than 4 years ago | (#31645272)

I stopped being able to read English. WTF does any of that mean? Is it written in moonspeak?

Off-site replication (1)

Dishwasha (125561) | more than 4 years ago | (#31645338)

One of the biggest targets for data de-duplication is for efficient off-site replication which you see in the EMC Avamar product line. This is advantageous when your WAN links aren't fast enough so that you can't do synchronous replication and a scheduled asynchronous replication would take too long. I'd like to see the SDSF storage engine be intelligent enough to snapshot the data, then when the next "backup/replication" occurs, it gathers up all the hashes of the blocks that have changed since the snapshot was created, communicates those hashes to the off-site system, and then transfer just the blocks that currently don't have a comparable hash on the target system, the target system receives a complete hash table update of the snapshot block difference from the source, and then both systems merge their snapshots and then take a new snapshot to get ready for the next replication cycle.

See also: LessFS (1)

kb1 (1764484) | more than 4 years ago | (#31645408)

The LessFS project also deserves mention: http://www.lessfs.com/ [lessfs.com] . Just think of the effect of combining a deduplication system with an iSCSI shared virtual tape library like http://sites.google.com/site/linuxvtl2/ [google.com]

New use for an old algorithm? (1)

Squatting_Dog (96576) | more than 4 years ago | (#31645554)

Isn't this just an application of 'tokenizing' as it is used in compression of data streams? Build an index of unique(read non-repetitive) data segments and store the (smaller)index and resulting data?

This has been around for some time....hard to believe that this use has just come to light.

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?