Open Source Deduplication For Linux With Opendedup

Please create an account to participate in the Slashdot moderation system

Open Source Deduplication For Linux With Opendedup 186

Posted by timothy on Saturday March 27, 2010 @11:31PM from the its-missing-apostrophes dept.

tazzbit writes "The storage vendors have been crowing about data deduplication technology for some time now, but a new open source project, Opendedup, brings it to Linux and its hypervisors — KVM, Xen and VMware. The new deduplication-based file system called SDFS (GPL v2) is scalable to eight petabytes of capacity with 256 storage engines, which can each store up to 32TB of deduplicated data. Each volume can be up to 8 exabytes and the number of files is limited by the underlying file system. Opendedup runs in user space, making it platform independent, easier to scale and cluster, and it can integrate with other user space services like Amazon S3."

This discussion has been archived. No new comments can be posted.

Open Source Deduplication For Linux With Opendedup

Load All Comments

Search 186 Comments Log In/Create an Account

Comments Filter:

In case you don't know much about it (Score:5, Informative)

by stoolpigeon ( 454276 ) * writes: <bittercode@gmail> on Saturday March 27, 2010 @11:32PM (#31644772) Homepage Journal

Data deduplication [wikipedia.org]
( I don't )

Share
twitter facebook
- Re:In case you don't know much about it (Score:5, Informative)
  
  by MyLongNickName ( 822545 ) writes: on Saturday March 27, 2010 @11:52PM (#31644882) Journal
  
  Data deduplication is huge in virtualized environments. Four virtual servers with identical OS's running on one host server? Deduplicate the data and save a lot of space.
  This is even bigger in the virutulized desktop envirornment where you could literally have hundreds of PCs virtualized on the same physical box.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by Hurricane78 ( 562437 ) writes:
    
    Unless you “deduplicate” the CPU work, that’s not going to happen. ^^
    - Re: (Score:3, Informative)
      
      by rubycodez ( 864176 ) writes:
      
      hundreds of virtualized desktops per physical server does happen, my employer sells such solutions from several vendors.
    - Re:In case you don't know much about it (Score:4, Informative)
      
      by MyLongNickName ( 822545 ) writes: on Sunday March 28, 2010 @12:51AM (#31645170) Journal
      
      If you have a couple hundred people running business apps, it ain't all that difficult. Generally you will get spikes of CPU utilization that last a few seconds mashed between many minutes, or even hours of very low CPU utilization. A powerful server can handle dozens or even hundreds of virtual desktops in this type of environment.
      
      Parent Share
      twitter facebook
      - Re: (Score:3, Informative)
        
        by DarkOx ( 621550 ) writes:
        
        It really is hundreds, on a modern nehalem core system with 64 gigs of memory or so. We used to do dozens on each node in a citrix farm back in the PIII days.
    - Re: (Score:2)
      
      by drsmithy ( 35869 ) writes:
      
      Unless you "deduplicate" the CPU work, that's not going to happen. ^^
      Sure it does. CPU power is generally the _last_ thing you run out of in virtualised environments, and that's been true for years.
      On a modern, Core i7-based server, you should be able to get 10+ "virtual desktops" per core on average, without too much trouble. IOPS and RAM are typically your two biggest limitations.
  - Re: (Score:2)
    
    by fyoder ( 857358 ) writes:
    
    I don't know much about the subject, so forgive me if this is a dumb question, but in that scenario, if the data for a file becomes corrupted on the hard drive, say a critical system file, doesn't that mean that all vm's using it are pooched?
    - Re:In case you don't know much about it (Score:5, Informative)
      
      by zappepcs ( 820751 ) writes: on Sunday March 28, 2010 @01:12AM (#31645250) Journal
      
      In a word, No. There are many types of 'virtualization' and more than one approach to de-duplication. In a system as engineered as one with de-duplication, you should have replication as part of the data integrity processes. If the file is corrupted in all the main copies (everywhere it exists, including backups) then the scenario you describe would be correct. This is true for any individual file that exists on computer systems today. De-duplication strives to reduce the number of copies needed across some defined data 'space' whether that is user space, or server space, or storage space etc.
      This is a problem in many aspects of computing. Imagine you have a business with 50 users. Each must use a web application which has many graphics. The browser caches of each user has copies of each of those graphics images. When the cache is backed up, the backup is much larger than it needs to be. You can do several things to reduce backup times, storage space, and user quality of service
      1 - disable caching for that site in the browser and cache them on a single server locally located
      2 - disable backing up the browser caches, or back up only one
      3 - enable deduplication in the backup and storage processes
      4 - implement all or several of the above
      The problems are not single ended and the answers or solutions will also not be single ended or faceted. That is no one solution is the answer to all possible problems. This one has some aspects to it that are appealing to certain groups of people. You average home user might not be able to take advantage of this yet. Small businesses though might need to start looking at this type of solution. Think how many people got the same group email message with a 12MB attachment. How many times do all those copies get archived? In just that example you see the waste that duplicated data represents. Solutions such as this offer an affordable way to positively affect bottom lines in fighting those types of problems problems.
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by kylegordon ( 159137 ) writes:
        
        Put all client side caches and temp directories on a RAM disk. Save backup space and time, reduce your IOPS, and decrease client latency.
      - Protected Space (Score:2)
        
        by nurb432 ( 527695 ) writes:
        
        This is why some vendors protect some duplicated VM data ( like the OS ).
        And sure stock DDup is not the end all to be all, but it goes a long way to that goal and the risks are more then worth the gains.
    - Re: (Score:2)
      
      by jamesh ( 87723 ) writes:
      
      I don't know much about the subject, so forgive me if this is a dumb question, but in that scenario, if the data for a file becomes corrupted on the hard drive, say a critical system file, doesn't that mean that all vm's using it are pooched?
      Yes, but not because of deduplication. If you had one sector go bad then yes, you could affect many more vm's if you you were using data deduplication than if you weren't, but in my experience, data corruption is seldom just a '1 sector' thing, and once you detect it you should restore anything that uses that disk from a backup that probably was taken before the corruption started (which is tricky... how do you know when that was?)
      Bitrot is one of the nastiest failure modes around.
    - Re: (Score:2)
      
      by drsmithy ( 35869 ) writes:
      
      I don't know much about the subject, so forgive me if this is a dumb question, but in that scenario, if the data for a file becomes corrupted on the hard drive, say a critical system file, doesn't that mean that all vm's using it are pooched?
      Yes, but a) this is something inherent to anything using shared resources, and b) there's not a lot of scope for such corruption to happen in a decent system (RAID, block-level checksums, etc).
  - Re: (Score:2)
    
    by ObsessiveMathsFreak ( 773371 ) writes:
    
    Deduplicate the data and save a lot of space.
    Or just use chroot or something. I don't know.
  - Re: (Score:2)
    
    by RAMMS+EIN ( 578166 ) writes:
    
    Right.
    It's one of the things I never really managed to wrap my head around: why would you want to install many instances of the same OS on the same machine to begin with? Besides using lots of disk space, each instance will also use up memory and redundantly use up resources for updates, background tasks that each OS is running, and basically everything else they have in common.
    Sure, you can add on a lot of clever tricks to deduplicate resource usage, but why introduce the duplication in the first place?
    - Re: (Score:3, Interesting)
      
      by Degrees ( 220395 ) writes:
      
      It is one of those things that once you start using it, the benefits become apparent.
      Here are some:
      1) One application on one machine. No more wondering if application X has somehow messed up application Y. The writers of the software probably developed the application in a clean environment, and this lets you run it in a clean environment. Gets rid of vendor finger-pointing, too.
      2) One application on one machine. If application X fouls the nest, you can reboot it and know that you are not also terminati
    - Re: (Score:3, Interesting)
      
      by Eil ( 82413 ) writes:
      
      Almost every mission critical system these days is running in either a clustered or virtualized environment. I work in the financial services industry and there are many reasons we virtualize pretty much everything these days. These, however, are probably the biggies:
      - Redundancy: If a physical machine dies, its virtual machines can be moved over to a spare, often with no interruption in service.
      - Isolation: Just because you can run multiple services on a box doesn't mean you should. It poses potential secu
  - - Re: (Score:2, Funny)
      
      by fatp ( 1171151 ) writes:
      
      Oh in fact it requires jdk 7...
  - - Re: (Score:3, Funny)
      
      by GNUALMAFUERTE ( 697061 ) writes:
      
      Hey, slow down cowboy. Explain that concept to me again. I don't know if it's applicable here, but if we find a way to implement it, it might just prove revolutionary.
      I work in the quality assurance department of Geeknet Inc, Slashdot's parent company. We are constantly looking for ways to improve all the sites on our network.
      I don't know if this method you propose, that, if I understand correctly, would involve parsing the content of the html document linked, and having an editor analyze the output of such
- See Also LESSFS (Score:4, Interesting)
  
  by sharper56 ( 142142 ) writes: <antisharper@NospaM.hotmail.com> on Sunday March 28, 2010 @06:37AM (#31646280) Journal
  
  Another nice OpenSource FS De-Dup project to look into is LESSFS.
  Block-level de-dup and good speed. Also offers per block encryption and compression.
  I'm using it backup VMs. 2TB of raw VMs plus 60 days of changes store down to 300GB. Write to de-dup FS is > 50MB/s.
  
  Parent Share
  twitter facebook
  - Re: (Score:3, Interesting)
    
    by phoenix_rizzen ( 256998 ) writes:
    
    ZFS also offers block-level dedupe support since ZFSv21. You can run it via FUSE on Linux, or natively on OpenSolaris. Hopefully, it'll also be available in FreeBSD 9.0 if not sooner (FreeBSD 7.3/8.0 have ZFSv14).
    Since ZFS already checksums every block that hits the disk, dedupe is almost free, as those checksums are re-used for finding/tracking duplicate blocks.
- Re: (Score:2)
  
  by vrmlguy ( 120854 ) writes:
  
  Here's another explaination: http://storagezilla.typepad.com/storagezilla/2009/02/unified-storage-file-system-deduplication.html [typepad.com]
  There's a table about half-way down showing the differences between file-level dedup (elimination of duplicate files), fixed block dedup (elimilation of duplicate blocks as stored on the disk, which is what Opendedup is doing), and variable block dedup (which handles non-block aligned data, such as when you insert or delete someting at the start of a large file). File level dedup
- - Re: (Score:2)
    
    by Lorens ( 597774 ) writes:
    
    Good try, but after skimming it, does not seem to apply. Seems to be for deduplicating e-mail attachments.
    - - Re: (Score:2)
        
        by Lorens ( 597774 ) writes:
        
        Which claims apply? I can see no claim that does not reference "information items [...] transferred between a plurality of servers connected on a distributed network". In fact, e-mail attachment dedup is seen as prior art (Background, fourth paragraph). File dedup is simpler than that.
  - Re: (Score:3, Interesting)
    
    by pem ( 1013437 ) writes:
    
    A good lawyer could probably argue that this doesn't apply.
    Claim 1(a) requires "dividing an information item into a common portion and a unique portion".
    It may be that the patent covers the case where the unique portion is empty, but then again maybe not, especially if the computer never takes the step to find out! In other words, if you treat every item as a common item (even if there is only one copy), there is a good chance the patent might not apply.
    (There is also a good chance that the patent is
This is for hard disks (Score:3, Interesting)

by ZERO1ZERO ( 948669 ) writes: on Saturday March 27, 2010 @11:36PM (#31644786)

Does software like ESX and others (Xen etc) perform this in memory already for running VMs? I.e. if you have 2 Windows VMs it will only store one copy of the libs etc in the hosts memory ?
Also, is there easy way to get multiple machines running 'as one' to pool resources for running a vm setup? Does openmosix do that?

Share
twitter facebook
- Re: (Score:2)
  
  by TooMuchToDo ( 882796 ) writes:
  
  Both VMware and KVM can do this. Not sure about Xen. Google "memory deduplication $VM_TECH"
  - Re: (Score:2, Funny)
    
    by fatp ( 1171151 ) writes:
    
    I really googled "memory deduplication $VM_TECH"... It returned this post as the only result
    
    what an idiot I am T.T
    - Re: (Score:2, Funny)
      
      by Island Admin ( 1562905 ) writes:
      
      Go to your browser preferences - uncheck "enable Great Firewall of China". ;)
- Re: (Score:2)
  
  by DarkOx ( 621550 ) writes:
  
  Does software like ESX and others (Xen etc) perform this in memory already for running VMs? I.e. if you have 2 Windows VMs it will only store one copy of the libs etc in the hosts memory ?
  I don't know about Xen but VMWare will do that.
  is there easy way to get multiple machines running 'as one' to pool resources for running a vm setup? Does openmosix do that?
  I am not entirely certain what you mean by 'as one' to pool resources. Openmosix more or less is a load distributor that dispatches jobs across hosts. I am not sure what advantage you would gain by virtualizing the hosts other than granularity.
A hypothetical question. (Score:2)

by drolli ( 522659 ) writes:

I appreciate any deduplication solution for linux for sure, but isnt any deplucation creating a lot of shared ressources which could be possibly exploited for attacks (e.g. on the privacy of other users)?
- Re: (Score:3, Interesting)
  
  by tlhIngan ( 30335 ) writes:
  
  I appreciate any deduplication solution for linux for sure, but isnt any deplucation creating a lot of shared ressources which could be possibly exploited for attacks (e.g. on the privacy of other users)?
  Most likely in the implementation itself, not the de-duplication process.
  Let's say user A and B have some file in common. Without de-duplication, the file exists on both home directories. With de-duplication, one copy of the file exists for both users. Now, if there is an exploit such that you could find ou
  - Re: (Score:2)
    
    by drolli ( 522659 ) writes:
    
    Yes, that was the thing i had in mind. I imagined that you can make timing measurements. So for example two isolated VMs running on the same physical dedup fs can exchange information (unless the underlying os does not intenntionally delay the return from the call). i actually think you can run a programs creating a lot of specially crafted file contents in two VMs, thus circumventing networking restrictions.
    - Re: (Score:2)
      
      by amorsen ( 7485 ) writes:
      
      Covert channels are fairly easy to achieve in a virtualized setup, particularly if you oversubscribe -- and if you don't oversubscribe you generally gain nothing from virtualization. Allocating physical CPU's, memory, network interfaces, and disks for each virtual server is impractical. Therefore I don't think the covert channel attack is much of a threat.
      Detecting whether a particular file exists on other machines is interesting though. You can do that with Arkeia (deduplicating backup) I believe, by creat
  - Re: (Score:2)
    
    by drsmithy ( 35869 ) writes:
    
    Note that you can't *change* the file (because that would just split the files up again), but being able to read the file (when you couldn't before) or knowing that another copy exists elsewhere can be very useful knowledge.
    If you can "generate a file" that can be deduplicated, then by definition you already know about the date in that file.
    - Re: (Score:2)
      
      by DarkOx ( 621550 ) writes:
      
      Maybe not, you might be able to fool the dedupe engine with a hash collision, and get it to turn your file full of gobeldy-gook into the actual file contents. I agree though you would need to know an awful amount about the file to pull that off, size, hash of what ever type the dedupe uses, time stamps.
      So of that you might be able to control yourself like atime, though other access, but I don't know how you'd get the rest, (thinking about the GP's example of /etc/shadow).
- Re: (Score:2)
  
  by GNUALMAFUERTE ( 697061 ) writes:
  
  Leaving aside vulnerabilities on any particular implementation, the only possible attack vector I see would be a bruteforce approach. Basically, a user in one VM creates random n bytes size files with all possible combinations of files of that size (off course, this would only be feasible for very small files, but /etc/shadow is usually small enough, and so is everything on $HOME/.ssh/). Eventually, the user would create a file that would match a copy on another VM. Off course, this would be useless without
  - Re: (Score:2)
    
    by amorsen ( 7485 ) writes:
    
    off course, this would only be feasible for very small files, but /etc/shadow is usually small enough,
    /etc/shadow is typically >1kB, which is 2^(1000*8) possibilities. A stupid brute force approach isn't going to work. If you can be sure which users exist in the file in which order, and root is the only one with a password, then maybe, but I doubt you could get it fast enough even in that case. If it turns out the be a threat we just need to increase the salt size.
- Re: (Score:2)
  
  by kitgerrits ( 1034262 ) writes:
  
  Deduplication often relies on copy-on-write to maintain seperate versions after deduplication.
  Once a block is deduplicated between users A, B and C into file Z and user B changes his file, the filesystem will record the change and point user B to block Z instead.
  Other security issues (permissions) should be handled by the filesystem table, not the physical file.
- - - Re: (Score:2)
      
      by GNUALMAFUERTE ( 697061 ) writes:
      
      It's had a vulnerability because microsoft made it. Vulnerabilities are their signature.
      And, as I explained before, it was a microsoft product (which means it wasn't fixed).
Hasn't this been posted before? (Score:5, Funny)

by Required Snark ( 1702878 ) writes: on Saturday March 27, 2010 @11:40PM (#31644812)

Just wondering...

Share
twitter facebook
- Re: (Score:2)
  
  by Hurricane78 ( 562437 ) writes:
  
  Well, at least this comment has been posted before.
  Dude, you’re only piling it up. Like with trolling: If you react to it, you only make it worse.
  And because I’m not better, I’m now gonna end it, by stating that: yes, yes, I’m also not making it better. ^^
  Oh wait... now I am! :)
Yea, I RTFA, but... (Score:3, Interesting)

by mrsteveman1 ( 1010381 ) writes: on Sunday March 28, 2010 @12:10AM (#31644966)

......from what i can tell, this is NOT a way to deduplicate existing filesystems or even layer it on top of existing data, but a new filesystem operating perhaps like eCryptfs, storing backend data on an existing filesystem in some FS-specific format.
So, having said that, does anyone know if there is a good way to resolve EXISTING duplicate files on Linux using hard links? For every identical pair found, a+b, b is deleted and instead hardlinked to a? I know there are plenty of duplicate file finders (fdupes, some windows programs, etc), but they're all focused on deleting things rather than simply recovering space using hardlinks.

Share
twitter facebook
- Re: (Score:2)
  
  by Aluvus ( 691449 ) writes:
  
  FSlint's "merge" option will do what you want.
  - Re: (Score:2)
    
    by mrsteveman1 ( 1010381 ) writes:
    
    Hmm, yea i've used FSLint but I didn't pay close enough attention to the options it seems :)
    Thanks
- Re:Yea, I RTFA, but... (Score:4, Informative)
  
  by dlgeek ( 1065796 ) writes: on Sunday March 28, 2010 @12:23AM (#31645058)
  
  You could easily write a script to do that using find, sha1sum or md5sum, sort and link. It would probably only take about 5-10 minutes to write but you most likely don't want to do that. When you modify one item in a hard linked pair, the other one is edited as well, whereas a copy doesn't do this. Unless you are sure your data is immutable, this will lead to problems down the road.
  
  Deduplication systems pay attention to this and maintain independent indexes to do copy-on-write and the like to preserve the independence of each reference.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by mrsteveman1 ( 1010381 ) writes:
    
    If I couldn't find a good tool from responses here I would have written one for sure.
- Re: (Score:3, Interesting)
  
  by Lorens ( 597774 ) writes:
  
  I wrote fileuniq (http://sourceforge.net/projects/fileuniq/) exactly for this reason. You can symlink or hardlink, decide how identical a file must be (timestamp, uid...), or delete.
  It's far from optimized, but I accept patches :-)
  - Re: (Score:2)
    
    by mrsteveman1 ( 1010381 ) writes:
    
    Sweet! Thanks a lot :)
- Re: (Score:2)
  
  by TarpaKungs ( 466496 ) writes:
  
  FSLint is very good. http://www.pixelbeat.org/fslint/ [pixelbeat.org]
- Re: (Score:2)
  
  by Ant P. ( 974313 ) writes:
  
  http://code.google.com/p/hardlinkpy/ [google.com]
- - Re: (Score:2)
    
    by mrsteveman1 ( 1010381 ) writes:
    
    That can be managed for simple use cases, but yea i see your point.
Or get inline deduplication (Score:2)

by anilg ( 961244 ) writes:

with NexentaStor CE [nexentastor.org], which is based on OpenSolaris b134. It's free.. and has an excellent Storage WebUI. /plug
For a detailed explanation of OpenSolaris dedup see this blog entry [sun.com].
~Anil
- Re: (Score:2)
  
  by anilg ( 961244 ) writes:
  
  Grr.. meant inline/kernel dedup.
- Re: (Score:2)
  
  by mrsteveman1 ( 1010381 ) writes:
  
  Plus you get the "real" ZFS, zones, and tightly integrated, bootable system rollbacks using zfs clones :)
  - Re: (Score:2)
    
    by itsme1234 ( 199680 ) writes:
    
    Plus you get the "real" ZFS, zones, and tightly integrated, bootable system rollbacks using zfs clones :)
    Plus you get the "real" opensolaris experience:
    - poor (like really really poor) hardware compatibility. Starting with basic stuff, many on-board Ethernet controllers with flaky or no support, very hard to choose a motherboard that's available and without too many compromises and fully supported. A guy asked if Android pairing is available (to use phone as modem for OpenSolaris), made me spill my coffee..
    - Re: (Score:2)
      
      by anilg ( 961244 ) writes:
      
      Hardware compatibility is pretty good. Really. All decent brands (storage controller/NICs) support opensolaris. Doubtful future part is FUD. Oracle made it clear OpenSolaris development, community functions will continue as is. The security patches costing $$ is not for opensolaris, but enterprise Solaris. Encryption is late.. big deal.. some things are set to low priority over others. Dedup is present, and works very well.
      If it's a storage box you're looking at.. what's really important? An in-kernel, esta
How useful is this in realistic scenarios? (Score:2)

by marvin2k ( 685952 ) writes:

Given that usually most of the disk space is swallowed by the data of an application and that data rarely is identical to the data on another system (why would you have two systems then?) I wonder how much this approach really buys you in "normal" scenarios especially given the CPU and disk I/O cost involved in finding and maintaining the de-duplicated blocks. There may be a few very specific examples where this could really make a difference but can someone enlighten me how this is useful on say a physical
- Re: (Score:2)
  
  by dlgeek ( 1065796 ) writes:
  It sounds to me like uou have a very narrow view of what constitutes "realistic scenarios".
  
  A high-availability mail system that has multiple servers handling client mail storage. VMs are used for rapid failover in the case of hardware failure. Sounds pretty realistic to me. Deduplication is extremely helpful when there are many copies of the same attachment as many users forward it around.
  A large set of VMs which used for testing the software you develop with a variety of possible end-user configuration
  - Re:How useful is this in realistic scenarios? (Score:4, Informative)
    
    by mysidia ( 191772 ) writes: on Sunday March 28, 2010 @02:03AM (#31645490)
    
    First of all.... one of the most commonly duplicated blocks is the NUL block, that is a block of data where all bits are 0, corresponding with unused space, or space that was used and then zero'd.
    If you have a virtual machine on a fresh 30GB disk with 10GB actually in use, you have at least 25GB that could be freed up by dedup.
    Second, if you have multiple VMs on a dedup store, many of the OS files will be duplicates.
    Even on a single system, many system binaries and libraries, will contain duplicate blocks.
    Of course multiple binaries statically linked against the same libraries will have dups.
    But also, there is a common structure to certain files in the OS, similarities between files so great, that they will contain duplicate blocks.
    Then if the system actually contains user data, there is probably duplication within the data.
    For example, mail stores... will commonly have many duplicates.
    One user sent an e-mail message to 300 people in your organization -- guess what, that message is going to be in 300 mailboxes.
    If users store files on the system, they will commonly make multiple copies of their own files..
    Ex... mydocument-draft1.doc, mydocument-draft2.doc, mydocument-draft3.doc
    Can MS Word files be large enough to matter? Yes.. if you get enough of them.
    Besides they have common structure that is the same for almost all MS Word files. Even documents' whose text is not at all similar are likely to have some duplicate blocks, which you have just accepted in the past -- it's supposed to be a very small amount of space per file, but in reality: a small amount of waste multiplied by thousands of files, adds up.
    Just because data seems to be all different doesn't mean dedup won't help with storage usage.
    
    Parent Share
    twitter facebook
- Re: (Score:2)
  
  by Lorens ( 597774 ) writes:
  
  A major use case is NAS for users. Think of all those multi-megabyte files, stored individually by thousands of users.
  However, normally deduplication is block level, under the filesystem, invisible to the user. This is implemented by NetApp SANs, for instance. After having RTFA, OpenDedup seems to be file-level, running between the user and an underlying file system. I'm not sure it's a good idea.
- Re:How useful is this in realistic scenarios? (Score:4, Informative)
  
  by QuantumRiff ( 120817 ) writes: on Sunday March 28, 2010 @01:19AM (#31645282)
  
  If you cut up a large file into lots of chunks of whatever size, lets say 64KB each. Then, you look at the chunks. If you have two chunks that are the same, you remove the second one, and just place a pointer to the first one. Data Deduplication is much more complicated than that in real life, but basically, the more data you have, or the smaller the chunks you look at, the more likely you are to have duplication, or collisions. (how many word documents have a few words in a row? remove every repeat of the phrase "and then the" and replace it with a pointer, if you will).
  This is also similar to WAN acceleration, which at a high enough level, is just deduplicating traffic that the network would have to transmit.
  It is amazing how much space you can free up, when your not just looking at the file level. This has become very big in recent years, cause storage has exploded, and processors are finally fast enough to do this in real-time.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by Spad ( 470073 ) writes:
  
  All good dedupe systems are block-level, not file-level so you don't just save where whole files are identical but on *any* identical data that's on the disks.
  If you're running VMs with the same OS you'll probably find that close to 70% of the data can be de-duplicated - and that's before you consider things like farms of clustered servers where you have literally identical config or fileservers with lots of idiots saving 40 "backup" copies of the same 2Gb access database just in case they need it.
  Our dedup
- Re:How useful is this in realistic scenarios? (Score:4, Informative)
  
  by drsmithy ( 35869 ) writes: <drsmithy@nOSPAm.gmail.com> on Sunday March 28, 2010 @03:13AM (#31645736)
  
  I wonder how much this approach really buys you in "normal" scenarios especially given the CPU and disk I/O cost involved in finding and maintaining the de-duplicated blocks. There may be a few very specific examples where this could really make a difference but can someone enlighten me how this is useful on say a physical system with 10 Centos VMs running different apps or similar apps with different data? You might save a few blocks because of the shared OS files but if you did a proper minimal OS install then the gain hardly seems to be worth the effort.
  Assume 200 VMs at, say, 2GB per OS install. Allowing for some uniqueness, you'll probably end up using something in the ballpark of 20-30GB of "real" space to store 400GB of "virtual" data. That's a *massive* saving, not only disk space, but also in IOPS, since any well-engineered system will carry that deduplication through to the cache layer as well.
  Deduplication is *huge* in virtual environments. The other big place it provides benefits, of course, is D2D backups.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by jabuzz ( 182671 ) writes:
    
    Yeah, but the problem there is the cost. We run on 17GB boot disks, so your 200VM's would require under 4TB of disk to store. I am sorry but 4TB of storage is peanuts and I can do that easily with a low end DS3400.
    Now the million dollar question to ask is how much does your dedupe solution cost? The reason being any dedupe that is supported against a virtualization solution we have looked at costs more than just buying the frigging disk. One then has to question the point of bothering with the extra layer o
    - Re: (Score:2)
      
      by drsmithy ( 35869 ) writes:
      
      Now the million dollar question to ask is how much does your dedupe solution cost?
      Nothing. Our NetApp has it by default (who charges extra for dedupe these days ?).
      The reason being any dedupe that is supported against a virtualization solution we have looked at costs more than just buying the frigging disk.
      Except it doesn't cost any more and it saves IOPS, meaning we need to buy less disk not only for space, but for performance as well.
      The level of dedupe in bulk storage is likely to be low as well
  - Re: (Score:2)
    
    by MMC Monster ( 602931 ) writes:
    
    This doesn't even save a single hard drive at current storage densities. :-(
  - Re: (Score:2)
    
    by marvin2k ( 685952 ) writes:
    
    Who does 2GB OS installs especially in a 200+ VM environment? That's insane. I agree that deduplication is a nice addition to the virtual tool-set but it only seems to really ad a benefit to very specific environments. If I have optimized OS installs and the VMs run completely different data-sets from different organizations then the cost (both money and system resources) of deduplication seems to outweigh the benefit of saving a few G especially in a world where HDs come in 2TB sizes.
- Re: (Score:2)
  
  by DarkOx ( 621550 ) writes:
  
  Ok I'll bite.
  Its real rarity in any of the enterprise environments that I have ever seen for minimal OS install installs to be the mode of operation on application servers (Unix and like); and I have never seen in on Windows based application servers. I am not even certain I agree that its such a good idea. Sure all the daemons not in use should not be started and ideally have had their execute bits turned off to avoid mistakes but when things go wrong its often helpful to have full platform availability.
- Re: (Score:2)
  
  by RAMMS+EIN ( 578166 ) writes:
  
  ``You might save a few blocks because of the shared OS files but if you did a proper minimal OS install then the gain hardly seems to be worth the effort.''
  Right, but note the if. Most of the places where I've seen virtualization used have most of the VMs running instances of a proprietary operating system which shall remain unnamed. Together with other components that tend to be common, the amount of data that is common among instances can easily be over 10 GB per instance.
  There is certainly a more efficie
  - Re: (Score:2)
    
    by jabuzz ( 182671 ) writes:
    
    The point is though even if you save 20GB per OS instance, that only comes to 2TB over 100 virtual machines. You are talking of saving four RAID1 450GB 15k rpm SAS/FC arrays or eight disks. It really is just is not worth the additional complexity. At your 10GB per instance we are talking two arrays, or four disks even less worth the additional complexity.
    Then once you look at mature comercial inplementations and you start paying by the TB deduped it becomes utterly pointless. For sure an open source implent
- - Re: (Score:2)
    
    by drsmithy ( 35869 ) writes:
    
    All sales literature, mind you. My personal experience with it will begin in a few months, when we get our new Celerra installed :-)
    As far as I know, Celerras only do file-level dedupe.
This just gave me a good idea! (Score:4, Interesting)

by thePowerOfGrayskull ( 905905 ) writes: <marc...paradise@@@gmail...com> on Sunday March 28, 2010 @12:17AM (#31645014) Homepage Journal

Actually, just the title did it. I've historically had a bad habit of backing things up by taking tar/gzs of directory structures, giving them an obscure name, and putting them onto network storage. Or sometimes just copying directory structures without zipping first. Needless to say, this makes for a huge mess.
Just occurred to me that it would not be difficult to write a quick script to extract everything into its own tree; run sha1sum on all files; and identify duplicate files automatically; probably in just one or two lines.
So in other words -- thanks Slashdot! The otherwise unintelligible summary did me a world of good -- mostly because there was no context as to what the hell it was talking about, so I had to supply my own definition...

Share
twitter facebook
- Re:This just gave me a good idea! (Score:4, Informative)
  
  by Hooya ( 518216 ) writes: on Sunday March 28, 2010 @01:06AM (#31645230) Homepage
  
  try this::
  mv backup.0 backup.1
  rsync -a --delete --link-dest=../backup.1 source_directory/ backup.0/
  see this [mikerubel.org]
  
  Parent Share
  twitter facebook
  - - Re: (Score:2)
      
      by evilviper ( 135110 ) writes:
      
      Why not just rdiff-backup?
      Why not just rsync? ...
      Now that that's out of the way...
      For one thing, rdiff will give you a mess... rsync will give you multiple full filesystem trees...
- Re: (Score:2)
  
  by CAIMLAS ( 41445 ) writes:
  
  Two things to look into:
  rsync snapshots [mikerubel.org]
  rsnapshot, for a better rsync snapshot [rsnapshot.org]
  - Re: (Score:2)
    
    by thePowerOfGrayskull ( 905905 ) writes:
    
    Most recently, I've been moving most of my documents and source-code level stuff to a LAN-based SVN repository; then periodically dumping that, encrypting, and tossing it onto dropbox. The versioning is good, but it's not so practical for downloaded files and various other content types.
    I'll take a look at this - thanks for the post.
- Look at StoreBackup (Score:2)
  
  by bradley13 ( 1118935 ) writes:
  
  We're a bit off topic here, seeing as this has nothing to do with file systems, but being off-topic is on-topic for /.
  Anyhow: StoreBackup is a great backup system that automatically detects duplicates.
- Re: (Score:2)
  
  by Fruit ( 31966 ) writes:
  
  here you go [fruit.je] :)
- Re: (Score:2)
  
  by nacturation ( 646836 ) * writes:
  
  Deduplicated backups: http://backuppc.sourceforge.net/info.html [sourceforge.net]
- Re: (Score:2)
  
  by bokmann ( 323771 ) writes:
  
  For more good ideas like this, watch this screencast from pragmatic TV.
  http://bit.ly/Pk3z3 [bit.ly]
  Jim Weirich expains how git (the version control tool) works from the ground up, and in doing so, builds a hypothetical system that sounds like what you are trying to do.
- Re: (Score:2)
  
  by slashflood ( 697891 ) writes:
  
  One word: BackupPC [sourceforge.net].
- Re: (Score:2)
  
  by david.given ( 6740 ) writes:
  
  You may want to look at rsnapshot [rsnapshot.org]. It's a very small shell script that pretty much duplicates the functionality of Apple's Time Machine. Each backup becomes a timestamped directory containing all the data in the backup; files that haven't changed from backup to backup are hardlinked together, so they only get stored once (per-file deduplication). This makes incremental backups very cheap, while also avoiding the need for specialised backup restoration software. It all works through the magic of rsync.
  On m
- Re: (Score:2)
  
  by TheRaven64 ( 641858 ) writes:
  
  You might like to take a look at Epitome [peereboom.us], which supports CAS, DEDUP, SIS and remote backup.
- - Re: (Score:2)
    
    by thePowerOfGrayskull ( 905905 ) writes:
    
    Even better -- thanks!
User Land? Come on! (Score:2)

by Gazzonyx ( 982402 ) writes:

[...] Opendedup runs in user space, making it platform independent, easier to scale and cluster, [...]
... and slow, prone to locking issues, etc. There's a reason no one runs ZFS over FUSE, why would we do it with this?
Confusing summary (Score:2)

by Brian Gordon ( 987471 ) writes:

The new deduplication-based file system called SDFS (GPL v2) is scalable to eight petabytes of capacity with 256 storage engines, which can each store up to 32TB of deduplicated data. Each volume can be up to 8 exabytes
Can anyone offer wisdom on what the volume size is supposed to signify, being different from the maximum size that SDFS is scalable to?
- Re: (Score:2)
  
  by Spad ( 470073 ) writes:
  
  It's the raw capacity of the filesystem compared to the maximum amount of deduplicated data it can handle. So you can have 8Pb of raw disk space, on which you can store up to 8Eb of deduplicated data (depending on the dedupe ratios you get - I think 1000x is a little optimistic, 30x-40x is more common).
Great idea (Score:2)

by fearlezz ( 594718 ) writes:

Too bad it's just another new filesystem. I would have preferred integration into (some future version of) EXTn or BTRFS.
Not only would that mean it gets more widely available, it also means you don't have to miss al the nice functions of these filesystems. You may even be able to use it out of the box.
- Re: (Score:2, Redundant)
  
  by jtownatpunk.net ( 245670 ) writes:
  
  Yeah, I gave up on bitching about code inefficiency back in the early 90s. Do they even teach assembly any more?
- Offtopic? (Score:4, Informative)
  
  by SanityInAnarchy ( 655584 ) writes: <ninja@slaphack.com> on Sunday March 28, 2010 @12:18AM (#31645024) Journal
  
  If you'd mentioned the fact that this appears to be written in Java, you might have a point. But despite this, and the fact that it's in userland, they seem to be getting pretty decent performance out of it.
  And keep in mind, all of this is to support reducing the amount of storage required on a hard disk, and it's a fairly large programming effort to do so. Seems like this entire project is just the opposite of what you claim -- it's software types doing extra work so they can spend less on storage.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by deniable ( 76198 ) writes:
  
  It's neither acronym or abbreviation. Duplication is making copies. De-duplication is getting rid of the copies.
  - Re: (Score:3, Funny)
    
    by GNUALMAFUERTE ( 697061 ) writes:
    
    So, Blade Runner was about de-duplication?
- Re:deduplication (Score:5, Funny)
  
  by nacturation ( 646836 ) * writes: <nacturation AT gmail DOT com> on Sunday March 28, 2010 @06:10AM (#31646212) Journal
  
  What kind of lame recursive acronym is "deduplication"? I'm flummoxed in any attempt to decipher it.
  Deduplication Eases Disk Utilization Purposefully Linking Information Common Among Trusted Independent Operating Nodes
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by SanityInAnarchy ( 655584 ) writes:
  
  Well, just how repetitive is your porn collection?
  - Re:Let's get down to brass tacks. (Score:4, Funny)
    
    by Hooya ( 518216 ) writes: on Sunday March 28, 2010 @01:01AM (#31645204) Homepage
    
    very repetitive. back and fourth. back and fourth. oh wait... that's not what you meant. never mind.
    
    Parent Share
    twitter facebook
- Re:redundant if saving large amounts of data to SA (Score:2)
  
  by dbIII ( 701233 ) writes:
  
  Consider that things may be spread over more than one SAN or that it is a situation where an old style file server makes better sense anyway.
- Re: (Score:2)
  
  by afidel ( 530433 ) writes:
  
  Not every SAN has dedupe, for instance my HP EVA doesn't. Also many of the lowend Netapp boxes have too anemic processors to be able to do dedupe. Most of the lowend iSCSI boxes also lack dedupe.
- Re: (Score:2)
  
  by SharpFang ( 651121 ) writes:
  
  it seems so, but the ordering was always: physical, partition, filesystem, compression (sometimes fs integrating compression) and compression applied to relatively small chunks (blocks).
  Now you have compression layer above partition layer, which means two identical files on two different partitions will occupy space of one physically.
  So, say, your LAMP server takes up 4GB generic system plus 1GB custom data. One 1TB of storage could fit 200 partition-files of such server. Now you'll fit 995 of them and it w
- Re: (Score:2)
  
  by jabuzz ( 182671 ) writes:
  
  And given that it is not written in Java is likely to be much better performing.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

In case you don't know much about it (Score:5, Informative)

Re:In case you don't know much about it (Score:5, Informative)

Re: (Score:2)

Re: (Score:3, Informative)

Re:In case you don't know much about it (Score:4, Informative)

Re: (Score:3, Informative)

Re: (Score:2)

Re: (Score:2)

Re:In case you don't know much about it (Score:5, Informative)

Re: (Score:2)

Protected Space (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Interesting)

Re: (Score:3, Interesting)

Re: (Score:2, Funny)

Re: (Score:3, Funny)

See Also LESSFS (Score:4, Interesting)

Re: (Score:3, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Interesting)

This is for hard disks (Score:3, Interesting)

Re: (Score:2)

Re: (Score:2, Funny)

Re: (Score:2, Funny)

Re: (Score:2)

A hypothetical question. (Score:2)

Re: (Score:3, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Hasn't this been posted before? (Score:5, Funny)

Re: (Score:2)

Yea, I RTFA, but... (Score:3, Interesting)

Re: (Score:2)

Re: (Score:2)

Re:Yea, I RTFA, but... (Score:4, Informative)

Re: (Score:2)

Re: (Score:3, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Or get inline deduplication (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

How useful is this in realistic scenarios? (Score:2)

Re: (Score:2)

Re:How useful is this in realistic scenarios? (Score:4, Informative)

Re: (Score:2)

Re:How useful is this in realistic scenarios? (Score:4, Informative)

Re: (Score:2)

Re:How useful is this in realistic scenarios? (Score:4, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

This just gave me a good idea! (Score:4, Interesting)

Re:This just gave me a good idea! (Score:4, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Look at StoreBackup (Score:2)

Re: (Score:2)

Re: (Score:2)