×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Ask Slashdot: Free/Open Deduplication Software?

timothy posted more than 2 years ago | from the the-dept-dept-the-from-dept-from-from dept.

Data Storage 306

First time accepted submitter ltjohhed writes "We've been using deduplication products, for backup purposes, at my company for a couple of years now (DataDomain, NetApp etc). Although they've fully satisfied the customer needs in terms of functionality, they don't come across cheap — whatever the brand. So we went looking for some free dedup software. OpenSolaris, using ZFS dedup, was there first that came to mind, but OpenSolaris' future doesn't look all that bright. Another possibility might be utilizing LessFS, if it's fully ready. What are the slashdotters favourite dedup flavour? Is there any free dedup software out there that is ready for customer deployment?" Possibly helpful is this article about SDFS, which seems to be along the right lines; the changelog appears stagnant, though, although there's some active discussion.

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

306 comments

I've wanted deduplication for a long time! (4, Interesting)

GPLJonas (2545760) | more than 2 years ago | (#38588736)

And now, even the next version of Windows Server will contain integrated data deduplication technology! So Linux devs better get working on similar features. I still cannot figure out how NTFS can support compressing files and folders but Linux cannot.

That deduplication for NTFS is really interesting [fosketts.net], actually. It's not licensed technology but straight from Microsoft Research and it has some clever aspects to it.

Some technical details about the deduplication process:

Microsoft Research spent 2 years experimenting with algorithms to find the “cheapest” in terms of overhead. They select a chunk size for each data set. This is typically between 32 KB and 128 KB, but smaller chunks can be created as well. Microsoft claims that most real-world use cases are about 80 KB. The system processes all the data looking for “fingerprints” of split points and selects the “best” on the fly for each file.

After data is de-duplicated, Microsoft compresses the chunks and stores them in a special “chunk store” within NTFS. This is actually part of the System Volume store in the root of the volume, so dedupe is volume-level. The entire setup is self describing, so a deduplication NTFS volume can be read by another server without any external data.

There is some redundancy in the system as well. Any chunk that is referenced more than x times (100 by default) will be kept in a second location. All data in the filesystem is checksummed and will be proactively repaired. The same is done for the metadata. The deduplication service includes a scrubbing job as well as a file system optimization task to keep everything running smoothly.

Windows 8 deduplication cooperates with other elements of the operating system. The Windows caching layer is dedupe-aware, and this will greatly accelerate overall performance. Windows 8 also includes a new “express” library that makes compression “20 times faster”. Compressed files are not re-compressed based on filetype, so zip files, Office 2007+ files, etc will be skipped and just deduped.

New writes are not deduped – this is a post-process technology. The data deduplication service can be scheduled or can run in “background mode” and wait for idle time. Therefore, I/O impact is between “none and 2x” depending on type. Opening a file is less than 3% greater I/O and can be faster if it’s cached. Copying a large file can make some difference (e.g. 10 GB VHD) since it adds additional disk seeks, but multiple concurrent copies that share data can actually improve performance.

The most interesting thing is that Microsoft Research says it doesn't affect performance almost at all. So when are we going to see Linux equivalents? Because Linux is getting behind on the new technologies.

Re:I've wanted deduplication for a long time! (-1, Troll)

Anonymous Coward | more than 2 years ago | (#38588784)

And there's the MS shill first post I was hoping would be here...

Re:I've wanted deduplication for a long time! (4, Insightful)

nanoflower (1077145) | more than 2 years ago | (#38588796)

The most likely answer is when some one is willing to pay for it. What you have described above isn't a trivial effort and it's unlikely someone is going to do the work for free so it will have to wait until someone is willing to pay for the development. Even then it's likely that the developer may keep it closed source in order to recoup the investment.

Re:I've wanted deduplication for a long time! (-1, Flamebait)

GPLJonas (2545760) | more than 2 years ago | (#38588822)

So you're saying that open source isn't viable solution in every case?

Re:I've wanted deduplication for a long time! (-1)

Anonymous Coward | more than 2 years ago | (#38588948)

> So you're saying that open source isn't viable solution in every case?

no... it is the other way round... people like you who quote from marketing gibberish and think they hold a gun in their hand are not needed around here... please continue using windows we already have enough maggot like you shitting all over the linux desktop... thank you

Re:I've wanted deduplication for a long time! (-1)

Anonymous Coward | more than 2 years ago | (#38589266)

too bad you don't have a dedup though, retard

Re:I've wanted deduplication for a long time! (2)

the_B0fh (208483) | more than 2 years ago | (#38589320)

Even Richard Stallman agreed that there are cases where GPL is not necessary. So stop being an ass.

Acronis (2)

syousef (465911) | more than 2 years ago | (#38588834)

Acronis Backup & Recovery 11 Advanced Server has deduplication (licensed addon) and runs on Linux. At roughly $2000 it ain't cheap. I've never used it so can't comment on how well it does.

Re:Acronis (2, Informative)

Anonymous Coward | more than 2 years ago | (#38588982)

Acronis is a bloody NIGHTMARE to deal with. We have a mixed shop here, and after seeing what Acronis does on Windows I vetoed the idea of having it on our mission critical linux servers.

I have never seen such a useless backup product before I started working with Acronis. Most backup systems let you set it up once and they WORK. Acronis is always getting itself wedged (dur, a metadata file I miswrote yesterday is corrupt, I will just hang), and when wedged it hangs ALL backup jobs, not just the one that is stuck. And the only "fix" is to redo all the jobs from scratch. No other backup system needs as much handholding as Acronis.

Acronis claims to have an excellent recovery environment. I haven't used it, but I am sure it is fantastic when you finally dig up a month-old backup to restore from because Acronis had stopped working.

Re:Acronis (1)

Anonymous Coward | more than 2 years ago | (#38589138)

Test your backups because there is a good chance they will not work. I say this as an experienced user of Acronis B&R 11 Advanced. Total garbage.

Re:Acronis (1)

sco_robinso (749990) | more than 2 years ago | (#38589322)

Never used a product that required as much handholding? I see you've never used Backup Exec.

I've been having to hand-hold backup exec for the better part of a year now. We more or less finally have it working, but I wouldn't speak to highly of it.. We use BE's dedup features, and while it seems to work reasonable well, one day we went to make a standard administrative change to the folder (sharing it within BE to another media server), a *poof*, all of the data got corrupted. A week on the phone with Symantec and they couldn't figure out why.

It's also RAM hungry. Symantec requires 1.5GB of RAM of every 1TB of dedup storage. RAM is cheap though, so not a big deal for me.

Never used Acronis in production, though. Demo'd it, seemed OK, but also came across as somewhat amatuer (came across as a home user product that was stretched beyond imagination to work in an enterprise environment).

I eventually have gone to Veeam. It only does virtual and only backs up to disk (no tape support), but man o man is it ever fast and easy (and reliable).

Re:Acronis (3, Informative)

arth1 (260657) | more than 2 years ago | (#38589482)

In Linux, I would avoid any backup system that doesn't support hard links, long paths, file attributes, file access control lists and SElinux contexts.
Some of the "offerings" out there are so Windows-centric that they can't even handle "Makefile" and "makefile" being in the same directory.

In Windows, I would require that it backs up and restores both the long and the short name of files, if short name support is enabled for the file system (default in Windows). Why? If "Microsoft Office" has a short name of MICROS~3 and it gets restored with MICROS~2, the apps won't work, because of registry entries using the short name.
I'd also look for one that can back up NTFS streams. Some apps store their registration keys in NTFS streams.

In all cases, Acronis does not measure up to what I require of a backup program. Also because the restore environment doesn't even work unless you have hardware compatible with its drivers. You may be able to back up, and even boot the restore environment, but not do an actual restore from it.

ArcServe is better - for Linux, it still lacks support for file attributes and the hardlink handling is rather peculiar during restore, but at least handles SElinux and dedup of the backup.

An option for dedup on Linux file systems would be nice - the easiest implementation would be COW hardlinks. But like for Microsoft's new NTFS, you'd need something that scans the file system for duplicates. And it better have an attrib for do-not-dedup too, because of how expensive COW can be for large files, or to avoid file fragmentation for specific files.

Re:I've wanted deduplication for a long time! (1)

PedXing (14787) | more than 2 years ago | (#38588840)

Indeed, I'd love to see Linux implement deduplication similar to Microsoft's new NTFS tech. But of course, it's a major development effort and LOTS of patents are involved.

Re:I've wanted deduplication for a long time! (4, Interesting)

Anonymous Coward | more than 2 years ago | (#38589224)

I have often wondered why someone doesn't use the rsync algorithm as a basis for this kind of chunking and deduplication. I imagine a FUSE-based filesystem that breaks the application-level files into checksummed pieces and stores both the file fragments and file descriptions into an underlying filesystem. Then it could reconstruct the application-level files on demand, using the description to draw out the right fragments.

From an academic point of view, it already solves the same problem and just needs some repackaging. It breaks arbitrary data into phrases to be identified by checksum and located in another existing corpus of data. It just needs a metadata model to record the structure of the file as composed of these canonical phrases, rather than performing the actual file reconstruction immediately as rsync does now.

From my cynical point of view, I realize someone may have patented the repackaging, in the same way that Apple seems to think they can re-patent every idea "on a smartphone".

Re:I've wanted deduplication for a long time! (5, Insightful)

lucm (889690) | more than 2 years ago | (#38588916)

And now, even the next version of Windows Server will contain integrated data deduplication technology! [...] The most interesting thing is that Microsoft Research says it doesn't affect performance almost at all.

Well ask anyone who lost documents on DoubleSpace volumes or got corrupted media files on Windows Home Server and they will tell you that even if Microsoft Research says so, it's not something I would put on my production servers any time soon.

Re:I've wanted deduplication for a long time! (5, Funny)

Ironchew (1069966) | more than 2 years ago | (#38588924)

When will Adblock Plus block these Microsoft ads on Slashdot?

Re:I've wanted deduplication for a long time! (1, Troll)

howardd21 (1001567) | more than 2 years ago | (#38589494)

When will Adblock Plus block these Microsoft ads on Slashdot?

The day before they block the Linux ones. I am waiting for the idiot blocker, but fear it may sometimes affect my own posts.

Re:I've wanted deduplication for a long time! (0, Insightful)

Anonymous Coward | more than 2 years ago | (#38588964)

Microsoft has a looooong history of promising features for NTFS and then not delivering on them.

Also, perhaps the reason that Linux does not support file-system compression on the fly is because it's a horrid idea, and should never actually be used?

Re:I've wanted deduplication for a long time! (-1)

Anonymous Coward | more than 2 years ago | (#38589070)

lol... you did register to post a Microsoft advert?

my god... people like you really exist... xD

Re:I've wanted deduplication for a long time! (2, Insightful)

19thNervousBreakdown (768619) | more than 2 years ago | (#38589072)

This has got to be some sort of smear campaign against Microsoft, because I cannot believe that they would think that bludgeoning people with astro-turf is going to get them sales. The first two articles with MS shilling I saw (today!) I wrote off as just people sharing interesting stuff that happened to come from MS, but thanks to your over-heavy hand, the pattern is clear as a bell now.

So tell me MS marketing people, are you seriously this incompetent, or did a new astro-turf campagin incentive get out of hand? I'm honestly curious how this happened.

Re:I've wanted deduplication for a long time! (2, Insightful)

Richard_at_work (517087) | more than 2 years ago | (#38589190)

Heh, the comment is completely on topic, interesting and factual and yet we still have fucktards insisting that any mention of MS must have been paid for.

Get a life.

Re:I've wanted deduplication for a long time! (1)

Anonymous Coward | more than 2 years ago | (#38589308)

Its the perfect grammar, good use of white, buzz words and directness of the comment. /. people talk a certain way and this is not it. Usually they assume knowledge and feel like they have not been proof read. Plus windows isn't foss making the comment irrelevant despite being well written.

Re:I've wanted deduplication for a long time! (0)

Richard_at_work (517087) | more than 2 years ago | (#38589394)

Complete and utter bollocks. I'm referring to your comment, just to make sure there is no doubt.

Re:I've wanted deduplication for a long time! (1)

Anonymous Coward | more than 2 years ago | (#38589408)

Heh, the comment is completely on topic, interesting and factual and yet we still have fucktards insisting that any mention of MS must have been paid for.

A multi-paragraph comment favorably disposed towards Microsoft and critical of its competition was posted at exactly the same time as the story. Check the times above:

  • Story: Posted by timothy on Wednesday January 04, @03:54PM
  • Comment: GPLJonas (2545760) on Wednesday January 04, @03:54PM (#38588736)

Compare with yesterday's Ask Slashdot multi-paragraph comment, posted at exactly the same time as the story, promoting ASP:

It's not that Slashdotters are "fucktards," it's that Slashdotters are math/science/CS folks and are trained to recognize anomalies and to grow suspicious of patterns of anomalies.

Re:I've wanted deduplication for a long time! (1)

Anonymous Coward | more than 2 years ago | (#38589426)

Thats probably because this kind of thing ACTUALLY HAPPENS ALL THE TIME.

The GP's comment was posted the exact minute the post was put up... naw, couldn't possibly be prepared ahead of time, nawwww

Re:I've wanted deduplication for a long time! (1)

Anonymous Coward | more than 2 years ago | (#38589438)

Please add something to the discussion or get the fuck out. I'd rather have astroturfers making relevant, on-topic posts than people like you ranting and raving at shadows.

Re:I've wanted deduplication for a long time! (0)

Anonymous Coward | more than 2 years ago | (#38589116)

We're well aware of how much linux sucks, but ignoring that is how we stay hip.

Re:I've wanted deduplication for a long time! (1, Flamebait)

timeOday (582209) | more than 2 years ago | (#38589148)

Oh boy, are they rolling dedup into WinFS [wikipedia.org]? Wow, just look at the timeline [wikipedia.org] in that article if you want to get discouraged about Microsoft delivering a new filesystem.

Re:I've wanted deduplication for a long time! (2, Informative)

Anonymous Coward | more than 2 years ago | (#38589164)

Interesting link, but it doesn't look like Microsoft has actually released this yet and it is only slated to be released with Win 8 server, and it will come with some caveats.

FTFA:
"It is a server-only feature, like so many of Microsoft’s storage developments. But perhaps we might see it deployed in low-end or home servers in the future.
It is not supported on boot or system volumes.
Although it should work just fine on removable drives, deduplication requires NTFS so you can forget about FAT or exFAT. And of course the connected system must be running a server edition of Windows 8.
Although deduplication does not work with clustered shared volumes, it is supported in Hyper-V configurations that do not use CSV.
Finally, deduplication does not function on encrypted files, files with extended attributes, tiny (less than 64 kB) files, or re-parse points."

Re:I've wanted deduplication for a long time! (3, Interesting)

SuricouRaven (1897204) | more than 2 years ago | (#38589248)

NTFSs file compression actually rather sucks. Space saving is minimal under all but ideal conditions. It's a common problem in filesystem-level compression - the need to be able to read a file without seeking very far, or reconstructing the entire stream. Compression ratio is seriously compromised to achieve that.

Re:I've wanted deduplication for a long time! (1)

0123456 (636235) | more than 2 years ago | (#38589550)

I compressed some of my Steam games recently to free up disk space on my laptop. If I remember correctly, most of them shrunk by 5-10% while some got around 30%.

So not great, but it freed up a few gigabytes that allowed me to install a few more smaller games.

Re:I've wanted deduplication for a long time! (2)

dltaylor (7510) | more than 2 years ago | (#38589288)

One instance by default is brain-dead. Lose that and they're ALL gone, if it happens between dedup and backup. You should have at least two copies, on different media, if possible, always, of any data with more than one reference, as well as backing up your data, of course.

deduplication is just compression (1)

Colin Smith (2679) | more than 2 years ago | (#38589560)

Both deduplication and conventional compression are a questionable idea on a file server which has many clients.

The most interesting thing is that Microsoft Research says it doesn't affect performance almost at all.

Yeah right. I'll wait for real usage numbers rather than "the vendor selling this stuff to me says it's fucking awesome".

Dragonfly BSD's HAMMER... (5, Interesting)

Anonymous Coward | more than 2 years ago | (#38588772)

...includes dedupe.

There was a blog entry a while ago where on a 256MB RAM machine someone was able to dedupe 600GB down to 400GB and the performance was fine. This is much unlike ZFS which wants the entire dedupe tree in memory and requires gigs and gigs of RAM.

Re:Dragonfly BSD's HAMMER... (0)

Anonymous Coward | more than 2 years ago | (#38588898)

Second this

Dragonfly BSD's HAMMER... (0)

callmebill (1917294) | more than 2 years ago | (#38589254)

...includes dedupe. There was a blog entry a while ago where on a 256MB RAM machine someone was able to dedupe 600GB down to 400GB and the performance was fine. This is much unlike ZFS which wants the entire dedupe tree in memory and requires gigs and gigs of RAM.

FreeBSD (0)

Anonymous Coward | more than 2 years ago | (#38588816)

Combines a bright future and ZFS

Re:FreeBSD (1)

kthreadd (1558445) | more than 2 years ago | (#38589014)

Question is how bright the future will be with Oracle effectively going their own way with ZFS.

Re:FreeBSD (5, Informative)

TheRaven64 (641858) | more than 2 years ago | (#38589060)

As I said in another post, ZFS development on FreeBSD is now funded by iXSystems. Given that most of their income is from selling large storage solutions built on top of FreeBSD and ZFS (often with a side order of FusionIO and other very expensive hardware things), they have a strong incentive to keep it stable and full of the features that their customers want.

Re:FreeBSD (1)

Guspaz (556486) | more than 2 years ago | (#38589548)

It means, however, that ZFS is now forked. ZFS volumes from future Solaris releases may be incompatible with future versions of FreeBSD (or IllumOS or whatnot). And the new approach that they're taking to define which new features are supported is flexible, but opens the door to situations where you can't mount a filesystem because your OS is missing some individual feature that the devs chose not to implement (ZFS currently goes by backwards compatible versions of the filesystem, the new forked version works based on feature flags).

To be honest, I don't see the latter being much of a problem, but the former (lack of inter compatibility with "official" ZFS) is annoying. Perhaps the forked version should change the name to something other than ZFS to avoid confusion.

Re:FreeBSD (0)

Anonymous Coward | more than 2 years ago | (#38589026)

One day, some ZFS version on FreeBSD will be able to make coffee. (given enough RAM)

Nexenta (0)

Anonymous Coward | more than 2 years ago | (#38588826)

...ever heard of it?

Re:Nexenta (2)

MatthewEarley (2451600) | more than 2 years ago | (#38589196)

Yes and currently using the CE version. It is native ZFS version 15 due to the fact that it is running on OpenSolaris. I chose Nexentastor CE for this reason. Excellent performance, great GUI, available CLI and access to the underlying file system with a few commands. Run the 64-bit version, as it chokes in 32-bit. The HCL is small, you will have to spend time reading the OpenSolaris HCL. It took me a few tries to get the hardware right, MB, CPU, memory, JBOD controllers. I am glad I went through the trouble, I learned a lot and most importantly got away from HW RAID which is inferior in performance, has no dedupe, and does not detect bitrot.

BSD (0)

Anonymous Coward | more than 2 years ago | (#38588836)

FreeBSD (ZFS) or DragonFly BSD's HAMMER FS perhaps?

ZFS != OpenSolaris (0)

Anonymous Coward | more than 2 years ago | (#38588848)

I do believe FreeBSD also supports ZFS.

OpenSolaris but not FreeBSD? (3, Informative)

TheRaven64 (641858) | more than 2 years ago | (#38588868)

ZFS in FreeBSD 9 has deduplication support. I've been running the betas / release candidates on my NAS for a little while (everything important is backed up, so I thought I'd give it a test). ZFS development in FreeBSD is funded by iXSystems [ixsystems.com], who sell expensive NAS and SAN systems so they have an incentive to keep it improving.

I have a ZFS filesystem using compression and deduplication for my backups from my Mac laptop. I copy entire files to it, but it only stores the differences.

Re:OpenSolaris but not FreeBSD? (1)

Anonymous Coward | more than 2 years ago | (#38589274)

Seeing as nobody else has inquired: What are your system stats cpu and especially memory/harddisk-wise?

I've been curious about that, since min requirements for OpenSolaris/Indiana are 768 megs of ram, and the boot cds won't load on less.

Someone further up mentioned DragonFlyBSD's HAMMER filesystem supporting hundreds of gigabytes of dedup processing on a 256 meg system, so hearing some real-world usage statistics from someone running FBSD/ZFS would be appreciated!

Re:OpenSolaris but not FreeBSD? (4, Informative)

Anonymous Coward | more than 2 years ago | (#38589558)

People considering either dedup or compression on FreeBSD should be made blatantly aware of one of the issues which exists solely on FreeBSD. When using these features, you will find your system "stalling" intermittently during ZFS I/O (e.g. your SSH session stops accepting characters, etc.). Meaning, interactivity is *greatly* impacted when using dedup or compression. This problem affects RELENG_7 (which you shouldn't be using for ZFS anyway, too many bugs), RELENG_8, the new 9.x releases, and HEAD (10.x). Changing the compression algorithm to lzjb has a big improvement, but it's still easily noticeable.

My point is that I cannot imagine using either of these features on a system where users are actually on the machine trying to do interactive tasks, or on a machine used as a desktop. It's simply not plausible.

Here's confirmation and reading material for those who think my statements are bogus. The problem:

http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012718.html [freebsd.org]
http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012752.html [freebsd.org]

And how OpenIndiana/Illumos solved it:

http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012726.html [freebsd.org]

ZFS (0)

Anonymous Coward | more than 2 years ago | (#38588886)

Why dont you look at Nexenta? That's all the goodness of open indiana, zfs and dedupe with the backing of a strong commercial organization.

Roll your own? (1)

goombah99 (560566) | more than 2 years ago | (#38588894)

you could run a nightly script to find duplicates and then deduplicate them.

an example of this would be find all new files since the last run, checksum them and compare this to the checksum of all previously examined files. Once you find the likely duplicates you can decide how careful you want to be about verifying identity of both data and meta data. For example, do you want to preserve the attribute dates if the data is identical? for some programs the creator dates matter. Likewise file permissions might be different.

once you find the duplicate then just erase one of them and create a hard link to the other if it's on the same filesystem or, if you dare, a softlink to the file on another filesystem.

This is not hard, and it very closely approximates what Netapp and Apple TimeMachine do.

Alternatively, if what you are really trying to do is not elminate duplicates in an active file system but merely keep snapshots for backup then the problem is much simpler.

on BSD unix:
ms snapshot oldsnapshot
mkdir snapshot
cd snapshot
find -d ../oldsnapshot | cpio -dpl -
rsync -aE source/ ./

where ../oldsnapshot is the old backup of your data
snapshot is the new backup
source is the thing you want to backup.

voila you have an endless set of snapshots in which no file is ever duplicated. However, metadata like file ownership and unix flags are not preserved in the old snapshots.

Re:Roll your own? (1)

timeOday (582209) | more than 2 years ago | (#38589056)

an example of this would be find all new files since the last run, checksum them and compare this to the checksum of all previously examined files

Block-level (rather than file-level) de-duplication would be drastically better for virtual machine images, which are most of the largest files I back up. Email spool files (or pst files) are another example of large files that tend to change only in portions of them.

Re:Roll your own? (1)

mhotchin (791085) | more than 2 years ago | (#38589100)

Think about things line Virtual Machine images - they could have a *lot* in common, but not everything. DeDupe gets useful once you can work at a sub-file level.

FreeBSD has ZFS (3, Informative)

grub (11606) | more than 2 years ago | (#38588896)


FreeBSD, and FreeNAS which is bases on FreeBSD, both come with ZFS. Neither is going away anytime soon.
I use both at home and am happy as a clam.

Re:FreeBSD has ZFS (0)

Anonymous Coward | more than 2 years ago | (#38588936)

I was happy fucking your wife's clam while you were dicking around with your nerdy shit.

Re:FreeBSD has ZFS (2, Funny)

grub (11606) | more than 2 years ago | (#38589022)


Best get your eyes checked, I think that was our cat's anus you were in.
My wife and I love to sit around and de-dupe all night.

.

No dedup in FreeNAS (5, Informative)

Svenne (117693) | more than 2 years ago | (#38589102)

However, FreeNAS supports ZFS v15, which doesn't have support for deduplication.

Lessfs is slow on Atom (3, Informative)

Dwedit (232252) | more than 2 years ago | (#38588920)

I've used LessFS. On my "server" powered by an Intel Atom, it is very very slow. It writes at about 5MB/sec, even when everything is inside a ram disk.
You can't use a block size of 4KB, otherwise write speeds are around 256KB/sec, need to use at least 64KB.

Re:Lessfs is slow on Atom (2)

TheRaven64 (641858) | more than 2 years ago | (#38589104)

I have a NAS with a 1.6GHz AMD Fusion thingy, which should be in the same speed ballpark as an Atom. It happily got 40MB/s writing to the deduplicated filesystem with FreeBSD ZFS (with a kernel with all of the debug knobs turned on) over GigE. With a release kernel, I'd expect it to be a bit faster, but since I mainly use it over WiFi the bottleneck is generally somewhere else...

Re:Lessfs is slow on Atom (3, Informative)

BitZtream (692029) | more than 2 years ago | (#38589502)

No you didn't.

You got 40MB writing to memory cache possibly, not the ZFS store.

I have a quad disk, 8 core, 8 GIG machine that ONLY does ZFS, Sustaining 40MB/s doesn't happen without special tuning, turning off write cache flushing and a whole bunch of other stuff ... unless I stay in memory buffers. Once that 8 gig fills or the 30 second default timeout for ZFS to flush, the machine comes to a stand still while the disks are flushed, and at that point, the throughput rate drops well below 40MB/s since it is actually finally putting that data on disk.

Without compression and dedup, with possibly low end checksuming, you may be able to write that fast. With compression or checksuming, theres absolutely no way your processor is moving the data fast enough.

This is a well known and well documented set of issues. If you haven't experienced it, its only because you really aren't using your NAS under any sort of real work load.

Bourne Shell! (-1)

Anonymous Coward | more than 2 years ago | (#38588922)

#!/bin/bash
L=$(ls -R1)
for O in L
do
for T in L
do
if [[ cmp $O $T ]]
then
rm -f $O
fi
done
done

What?

Re:Bourne Shell! (0)

Anonymous Coward | more than 2 years ago | (#38588990)

Isn't that going to eventually delete every file?

Re:Bourne Shell! (0)

Anonymous Coward | more than 2 years ago | (#38589366)

Yes, but the OP never said anything about keeping files that weren't duplicates.

O.K ,fine [pastebin.com]. That one actually works, believe it or not...

Just store everything in /dev/null (1)

kthreadd (1558445) | more than 2 years ago | (#38588942)

Look, no duplicates!

Re:Just store everything in /dev/null (4, Funny)

shish (588640) | more than 2 years ago | (#38589222)

$ cat todo.txt > /dev/null
$ md5sum /dev/null
d41d8cd98f00b204e9800998ecf8427e /dev/null
$ cat aaah.png > /dev/null
$ md5sum /dev/null
d41d8cd98f00b204e9800998ecf8427e /dev/null
-

Duplicates!

BackupPC (5, Interesting)

Anonymous Coward | more than 2 years ago | (#38588954)

Check out BackupPC. Been using it for about 5 years at our company, admittedly a mostly Linux shop, with great results. Deduplication on a per-file basis, block-based transfers via the rsync protocol, and a good web-based UI (at least in terms of function). Thanks to deduplication we are getting about a 10:1 storage compression backing up servers and workstations: a total of 1.28 TB of backups in 130.88 GB of used space.

Backup or live FS? (3, Interesting)

Anonymous Coward | more than 2 years ago | (#38588972)

Your post doesn't make it clear if you're looking for a free backup product to replace DataDomain, NetApp, etc. or if you're now wanting to dedup on live filesystems.

If you're looking for a free backup product that supports deduplication, look at backuppc . Powerful and complex, but free. I've used it for years with good results.

Backuppc (0)

Anonymous Coward | more than 2 years ago | (#38588976)

http://backuppc.sourceforge.net/info.html

dragonflybsd (2, Informative)

Anonymous Coward | more than 2 years ago | (#38589006)

So you want dragonfly BSD with a hammer filesystem.
An excellent and stable BSD and an excellent filesystem to go with it. And a very helpful community.

yuo 74il it! (-1)

Anonymous Coward | more than 2 years ago | (#38589016)

'doing somethuing' your replies rather contact to sse if on slashdot.org

Some block based backup may have issues... (1)

Anonymous Coward | more than 2 years ago | (#38589046)

You should be aware that ZFS and most block based deduplication products don't handle streams from NetBackup very well. This is due to block size variability, which invalidates the block by block deduplication algorithms. Information on this issue is readily available if you search for it.

For ZFS options, you can try the following.
      -Sun/Oracle sells a ZFS storage array called the Amber Road.
      -Check out Nexenta. They provide a supported version of OpenSolaris complete with high availability options if you so desire.
      -You can also check out Open Indiana, a fork of OpenSolaris.

Why not just (1)

StripedCow (776465) | more than 2 years ago | (#38589096)

Why not just do the following every once in a while:

1. go through all your files,
2. for each file, compute a checksum (e.g. using the unix tools md5sum or sha1sum),
3. for pairs of files giving similar checksum, compare them (optionally) and if equal remove one of them and make it a hard-link to the other.

It would surprise me if there was no free open-source script doing exactly this.

Re:Why not just (1)

shish (588640) | more than 2 years ago | (#38589268)

for pairs of files giving similar checksum

MD5 / SHA1 / most good hashing algorithms will give completely different hashes for a single bit of data difference

remove one of them and make it a hard-link to the other

What then happens when you write to one of the files?

Re:Why not just (0)

Anonymous Coward | more than 2 years ago | (#38589302)

Whatever you do, do not do this on directories under version control. You'll be sorry :)

Re:Why not just (1)

MagicM (85041) | more than 2 years ago | (#38589304)

1. go through all your files,
2. for each file, compute a checksum (e.g. using the unix tools md5sum or sha1sum),
3. for pairs of files giving similar checksum, compare them (optionally) and if equal remove one of them and make it a hard-link to the other.

If comparing them is optional, why not just delete both copies and store the checksum instead? I bet you'll save lots of space that way!

Re:Why not just (0)

Anonymous Coward | more than 2 years ago | (#38589346)

This is file based deduplication - The best (imho) implementations are block level - Equivalently, you would need to compute the block-level sha1 of each block of each file, then delete the ones you have elsewhere... Which would actually render the file useless.

Block level deduplication is in-line, it replaces these "delete them if we have them elsewhere" with a pointer to the original block that is identical.

The best example I have heard concerning this: 2 excel documents, 250 rows each. On the 100th row, third column, a B is changed to a D - in an ideal deduplication scenario, these two files, saved to disk, would be stored as two files, one with size 250KB, the other with size 1B. And both be fully usable.

Dedup is just a marketing word.... (3, Informative)

Anonymous Coward | more than 2 years ago | (#38589106)

it needs incredible amount of memory to operate effectively.
from my university notes:
5TB data, average blocksize 64K = 78125000 blocks
for each block the dedup needs 320 bytes so
78125000 x 320 byte = 25 GB dedup table

use compression instead. (eg zfs compression)

Re:Dedup is just a marketing word.... (0)

Anonymous Coward | more than 2 years ago | (#38589170)

No it doesn't need an incredible amount of memory. Look at HAMMER's implementation.

Re:Dedup is just a marketing word.... (1)

kesuki (321456) | more than 2 years ago | (#38589356)

if you've got 5 tb of files, 25gb doesn't seem so huge. besides you assume that it needs to keep track of every block, when you only need to hash every file, but yes, that is a massive burden on resources. and in implementation it would require loads of ram or a large swapfile, and all just to save disc space over burdening processor/ram requests.

Re:Dedup is just a marketing word.... (5, Informative)

m.dillon (147925) | more than 2 years ago | (#38589414)

All dedup operations have a trade-off between disk I/O and memory use. The less memory you use the more disk I/O you have to do, and vise-versa.

Think of it like this: You have to scan every block on the disk at least once (or at least scan all the meta-data at least once if the CRC/SHA/whatever is already recorded in meta-data). You generate (say) a 32 bit CRC for each block. You then [re]read the blocks whos CRCs match to determine if the CRC found a matching block or simply had a collision.

The memory requirement for an all-in-one pass like this is that you have to record each block's CRC plus other information... essentially unbounded from the point of view of filesystem design and so not desirable.

To reduce memory use you can reduce the scan space... on your first pass of the disk only record CRCs in the 0x0-0x7FFFFFFF range, and ignore 0x80000000-0xFFFFFFFF. In other words, now you are using HALF the memory but you have to do TWO passes on the disk drive to find all possible matches.

The method DragonFly's HAMMER uses is to allocate a fixed-sized memory buffer and start recording all CRCs as it scans the meta-data. When the memory buffer becomes full DragonFly dynamically deletes the highest-recorded CRC (and no longer records CRCs >= to that value) to make room. Once the pass is over another pass is started beginning with the remaining range. As many passes are taken as required to exhaust the CRC space.

Because HAMMER stores a data CRC in meta-data the de-dup passes are mostly limited to just meta-data I/O, plus data reads only for those CRCs which collide, so it is fairly optimal.

This can be done with any sized CRC but what you cannot do is avoid the verification pass.. no matter how big your CRC is or your SHA-256 or whatever, you still have to physically verify that the duplicate blocks are, in fact, exactl duplicates, before you de-dup their block references. A larger CRC is preferable to reduce collisions but diminishing returns build up fairly quickly relative to the actual amount of data that can be de-duplicated. 64 bits is a reasonable trade-off, but even 32 bits works relatively well.

In anycase, most deduplication algorithms are going to do something similar unless they were really stupidly written to require unbounded memory use.

-Matt

Apologies to LL Cool J (0)

Anonymous Coward | more than 2 years ago | (#38589128)

Patrick Zevo: Are you taking my [de]duplication investigation seriously or are you disrespecting my [de]duplication investigation?

(LL Cool J to Robin Wright in the 1992 movie Toys)

What is deduplication? (5, Informative)

jdavidb (449077) | more than 2 years ago | (#38589134)

I had to Google to find out. Here's what I found: http://en.wikipedia.org/wiki/Data_deduplication [wikipedia.org]

Maybe everybody else is familiar with this term except for me, but I find it a bit off-putting for the submitter and the editors to not offer a small bit of explanation.

Re:What is deduplication? (2, Interesting)

BitZtream (692029) | more than 2 years ago | (#38589402)

Seriously, at this point on slashdot its been talked about enough that unless you bought your UID from someone, you should be fully aware of what it is from here alone.

Re:What is deduplication? (0)

Anonymous Coward | more than 2 years ago | (#38589506)

it's because slashdot has crowdsourced dupe whining the internet version of deduplication

I rolled my own... (-1)

Anonymous Coward | more than 2 years ago | (#38589140)

SDFS and co are looking promising, but they didn't even meet my home stability requirements (which I don't feel are terribly high). You could go OOM and a lot more things way too easily. Perhaps some point in the future...?

What I did instead was essentially something more simple, I started creating my own take on the deduplication provided by an existing Linux utility called "fslint" (file level checksum-and-deduplicate).

I reinvented the thing in Java/Scala, adapted a bit for my own purposes. Will presumably publish it soon on github, even if it likely won't get a nifty GUI...

OpenDedup Changelog not Stagnant (1)

MatthiasF (1853064) | more than 2 years ago | (#38589152)

The change log shows an update two months ago, how's that stagnant?

Because the original date on the post is from April 2010?

They modified a previous post to put all of the changes in one place, dufus.

There is Duplicity (0)

Anonymous Coward | more than 2 years ago | (#38589192)

Duplicity [nongnu.org]. It's written in Python with various back-end storage types (Disk, FTP, SFTP, Amazon S3, etc) It's especially useful for backups.

Personal Project (1)

TheRealMindChild (743925) | more than 2 years ago | (#38589214)

I have recently written some personal software that finds duplicate files/directories. I had no idea there was such a demand for something like this (and price for such software).

There are a few hurdles with deduplication that a piece of software will likely need your input for:

- Does something like the directory/file names matter? If so, does case matter? What about comparing names from ASCII Unicode? Do you compare the DOS 8.3 names also?
- What do you want to do when you find duplicates? Delete the duplicates and put links in their place? Which one do you want the links to point to? Maybe remove the duplicates... so which one do you keep?
- Does meta data need compared too? What about the data itself? Different file systems have different gotchas. NTFS can have separate streams, which do not necessarily follow a file. If your file system supports snapshots, what are you really wanting to compare?

OpenSolaris is Dead; Illumos is alive and well (0)

Anonymous Coward | more than 2 years ago | (#38589256)

Oracle has more or less killed OpenSolaris.

ALL Solaris, ZFS, DTrace and Zones engineering talent *and* Solaris community energy is now focused on Illumos.

More from highly entertaining ex-Sun engineer Bryan Cantrill in this talk:
http://www.youtube.com/watch?v=-zRN7XLCRhc

(funny and informative for those of you looking for some Oracle-bashing and Solaris history)

ZFS on linux (0)

Anonymous Coward | more than 2 years ago | (#38589270)

Look into the ZFS On Linux project (NOT fuse). It has native performance, and it supports dedup. I've been running it on a small office server for 4 months now without issue, though I have not used the dedup feature due to its RAM requirements. See http://zfsonlinux.org/

Data Deduplication is probably the wrong solution (0)

Anonymous Coward | more than 2 years ago | (#38589350)

The original post doesn't quite describe the application well enough to make any absolute judgements, but I doubt Data Deduplication is going to help much.

I've had experience using ZFS with its data deduplication. If you're using ZFS across your operation, it might make sense to have an alternate ZFS server be a "backup" of your primary store, but if ZFS is not your thing, you're headed down the wrong path. Deduplication is somewhat computationally expensive to implement and locks your data into the target format. Changing your application to fit ZFS as the target is an expense you'll have to factor in. (An expense likely as big as the disk space you believe you will be saving.)

OTOH - disks keep getting bigger and cheaper. You may not be able to count on OpenSolaris or even FreeBSD, but you can certainly count on WesternDigital/Seagate to continue to offer solutions at reasonable cost/byte.

ZFS dedup is not cheap. (0)

Anonymous Coward | more than 2 years ago | (#38589370)

ZFS dedup has a serious impact on write performance if you enable it.

Just buy more disk.

FreeBSD + ZFS (0)

BitZtream (692029) | more than 2 years ago | (#38589376)

FreeBSD is certainly alive and well with an actively developed implementation real zfs implementation.

Keep in mind however, ZFS with deduplication is not going to be a 'high performance' file system. You aren't going to use it for anything where latency is of any importance at all. It makes a pretty shitty VMware storage pool for instance, even with SSDs for log devices.

Its silly to consider anything else. You never actually wanted to run the current bastard child previously known as Solaris. Its actually proof that open sourcing a project can make it actually suck more than before hand. Linux is out since its silly license is so paranoid about someone using their precious code that they can't use anyone else's without arguing with each other about legality rather than actually do something useful. All you're left with is FreeBSD.

DDAR (0)

Anonymous Coward | more than 2 years ago | (#38589444)

http://www.synctus.com/ddar/

How about SREP ? (0)

Anonymous Coward | more than 2 years ago | (#38589474)

SuperREP: huge-dictionary LZ77 preprocessor.

de-dupes and preprocesses helps to create zip/gz/bz tighter and faster than without.

http://freearc.org/research/SREP.aspx

Ubuntu 11.10 (0)

Anonymous Coward | more than 2 years ago | (#38589544)

I'm utilizing Ubuntu 11.10 Server with ZFS right now and have compression and dedup turned on.
Works great... had to download ZFS as it didn't come preloaded.

Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...