Beta

Slashdot: News for Nerds

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Ask Slashdot: How Do I De-Dupe a System With 4.2 Million Files?

samzenpus posted about 2 years ago | from the copies-of-the-copies dept.

Data Storage 440

First time accepted submitter jamiedolan writes "I've managed to consolidate most of my old data from the last decade onto drives attached to my main Windows 7 PC. Lots of files of all types from digital photos & scans to HD video files (also web site backup's mixed in which are the cause of such a high number of files). In more recent times I've organized files in a reasonable folder system and have an active / automated backup system. The problem is that I know that I have many old files that have been duplicated multiple times across my drives (many from doing quick backups of important data to an external drive that later got consolidate onto a single larger drive), chewing up space. I tried running a free de-dup program, but it ran for a week straight and was still 'processing' when I finally gave up on it. I have a fast system, i7 2.8Ghz with 16GB of ram, but currently have 4.9TB of data with a total of 4.2 million files. Manual sorting is out of the question due to the number of files and my old sloppy filing (folder) system. I do need to keep the data, nuking it is not a viable option.

cancel ×

440 comments

CRC (5, Informative)

Spazmania (174582) | about 2 years ago | (#41205117)

Do a CRC32 of each file. Write to a file one per line in this order: CRC, directory, filename. Sort the file by CRC. Read the file linearly doing a full compare on any file with the same CRC (these will be adjacent in the file).

Re:CRC (5, Informative)

Anonymous Coward | about 2 years ago | (#41205157)

s/CRC32/sha1 or md5, you won't be CPU bound anyway.

Re:CRC (5, Informative)

Kral_Blbec (1201285) | about 2 years ago | (#41205173)

Or just by file size first, then do a hash. No need to compute a hash to compare a 1mb file and a 1kb file.

Re:CRC (-1)

Anonymous Coward | about 2 years ago | (#41205193)

What was that sound? It was the sound of me shooting a fart directly out of a certain asshole; my anus! "But why would you expel flatulence out of your own ass and not someone else's?" you ask. It's an extremely rare occurrence that only happens when I read something that is blatantly false. When the flatulence shot out of my asshole, I was reading Spazmania's comment. Draw your own conclusions...

Re:CRC (5, Informative)

caluml (551744) | about 2 years ago | (#41205331)

Exactly. What I do is this:

1. Compare filesizes.
2. When there are multiple files with the same size, start diffing them. I don't read the whole file to compute a checksum - that's inefficient with large files. I simply read the two files byte by byte, and compare - that way, I can quit checking as soon as I hit the first different byte.

Source is at https://github.com/caluml/finddups [github.com] - it needs some tidying up, but it works pretty well.

git clone, and then mvn clean install.

Re:CRC (5, Insightful)

bzipitidoo (647217) | about 2 years ago | (#41205601)

Part 2 of your method will quickly bog down if you run into many files that are the same size. Takes (n choose 2) comparisons, for a problem that can be done in n time. If you have 100 files all of one size, you'll have to do 4950 comparisons. Much faster to compute and sort 100 checksums.

Also, you don't have to read the whole file to make use of checksums, CRCs, hashes and the like. Just check a few pieces likely to be different if the files are different, such as the first and last 2000 bytes. Then for those files with matching parts, check the full files.

Re:CRC (2, Informative)

belg4mit (152620) | about 2 years ago | (#41205615)

Unique Filer http://www.uniquefiler.com/ [uniquefiler.com] implements these short-circuits for you.

It's meant for images but will handle any filetype, and even runs under WINE.

Re:CRC (3, Insightful)

Joce640k (829181) | about 2 years ago | (#41205401)

s/CRC32/sha1 or md5, you won't be CPU bound anyway.

Whatever you use it's going to be SLOW on 5TB of data. You can probably eliminate 90% of the work just by:
a) Looking at file sizes, then
b) Looking at the first few bytes of files with the same size.

After THAT you can start with the checksums.

Re:CRC (2)

WoLpH (699064) | about 2 years ago | (#41205467)

Indeed, I once created a dedup script which basically did that.

1. compare the file sizes
2. compare the first 1MB of the file
3. compare the last 1MB of the file
4. compare the middle 1MB in the file

It's not a 100% foolproof solution but it was more than enough for my use case at that time and much faster than getting checksums.

Re:CRC (0)

Anonymous Coward | about 2 years ago | (#41205167)

Do a CRC32 of each file. Write to a file one per line in this order: CRC, directory, filename. Sort the file by CRC. Read the file linearly doing a full compare on any file with the same CRC (these will be adjacent in the file).

Would you be so kind to write a program/script which can do that ?

Re:CRC (2)

Spazmania (174582) | about 2 years ago | (#41205249)

I have a script which does this for openstreetmap tiles. Once it identifies the dupes it archives all the tiles into a single file, pointing the dupes at a single copy in the archive. Then I use a Linux fuse filesystem to read the file and present the results to Apache. Saves a truly massive amount of disk space for an openstreetmap server since the files are mostly smaller than a single disk block and never consume enough disk blocks that the space lost to the inode and unused part of the last block is insignificant.

Re:CRC (1)

wisty (1335733) | about 2 years ago | (#41205169)

This is similar to what git and ZFS do (but with a better hash, some kind of sha I think).

Re:CRC (2)

Pieroxy (222434) | about 2 years ago | (#41205179)

Exactly.

1. Install MySQL,
2. create a table (CRC, directory, filename, filesize)
3. fill it in
4. play with inner joins.

I'd even go down the path of forgetting about the CRC. Before deleting something, do a manual check anyways. CRC has the advantage of making things very straightforward but is a bit more complex to generate.

Re:CRC (0)

Anonymous Coward | about 2 years ago | (#41205215)

Manual check first? I got the impression that this guy had LOTS of data and presumably also LOTS of dupes. I would hate doing manual checks on tens or hundreds of thousands of files.

Re:CRC (1)

Pieroxy (222434) | about 2 years ago | (#41205285)

You can check a few files in a directory and then easily deduce the whole directory is a dupe. You don't have to do it file by file.

Plus, when the system finds a dupe, you need to tell it which copy it should delete, or else you risk having stuff all around and not knowing where it is. Some file you knew was in directory A/B/C/D is suddenly not there anymore and you have no clue where its "dupe" is located. Unless the dupe finder creates symlinks in place of the deleted file...

Re:CRC (0)

Anonymous Coward | about 2 years ago | (#41205611)

You can check a few files in a directory and then easily deduce the whole directory is a dupe. You don't have to do it file by file.

How can you be sure a single file wasn't added to one duplicate?

Re:CRC (1)

the eric conspiracy (20178) | about 2 years ago | (#41205245)

Use SHA-1 instead of CRC.

Re:CRC (2, Interesting)

vlm (69642) | about 2 years ago | (#41205305)

4. play with inner joins.

Much like there's 50 ways to do anything in Perl, there's quite a few ways to do this in SQL.

select filename_and_backup_tape_number_and_stuff_like_that, count(*) as number_of_copies
from pile_of_junk_table
group by md5hash
having number_of_copies > 1

Theres another strategy where you mush two tables up against each other... one is basically the DISTINCT of the other.

triggers are widely complained about, but you can implement a trigger system (or psuedo-trigger, where you make a wrapper function in your app) where basically a table of "files" is stored with a row called "count of identical md5hash" and then your sql looks like select * from pile where identicalcount>1

There's ways to play with views.

Do you need to run it interactively or batch it or just run it basically once or ... If you're allowed to barf on data input you can even enforce the md5 hash as a UNIQUE INDEX or UNIQUE KEY in the table definition.

You'll learn a lot about how to think about high performance computing. Are you trying to minimize latency or minimize storage or minimize index size or maximize reliability/uptime or minimize processor time or minimize NAS bandwidth or minimize (initial OR maintenance) programming time or ....

The funniest thing is if you're never tried restoring data from backups (hey, it happens), and/or never had a tape failure (hey it happens), you'll THINK you want to eliminate dupes, but trust me, those dupes will save your bacon someday, and tape is cheap compared to cost of programmer and cost of lost data.... 5 TB is not much technically but is obviously worth a lot from a business standpoint...

Also from personal experience you're going to find people gaming the system where DOOM3.EXE and NOTEPAD.EXE happen to have the same md5hash and length and NOTEPAD.EXE was found an a not-totally but pretty much noob's desk. Use some judgement and don't come down too hard on the newest of new learners.

Re:CRC (1)

cheesybagel (670288) | about 2 years ago | (#41205233)

md5sum `find /` | sort -k1,1

Or something like that. You probably need xargs. My script-fu is weak.

Re:CRC (1)

Rich0 (548339) | about 2 years ago | (#41205313)

Things get unnecessarily messy when you have to do them all in one line. However, if I were doing this as a one-time operation, I'd start with something like what you suggest, dumping the results into file1.

Then I'd cat the whole thing through awk '{ print $1 }' | uniq -d > file2 to get a list of all the hashes that are not unique (that way you can focus on the duplicates and not have to scan that huge file).

Then I'd grep the original file with grep -f file2 file1 > file3 to get the full output of the original search for each of the duplicated files.

Chances are that you're going to want to semi-manually deal with the duplicates, but if you pare the file down to a list of stuff where you just want to keep the first instance of each duplicate then I'm sure it wouldn't be hard to remove the first one and then pass the rest into an rm command. You'd want to be careful, since there could be numerous duplicates that the system expects to be there.

All the steps above would be very fast, aside from the original md5sum. And yes, I'd probably use xargs so that you don't have 40 bazillion arguments to md5sum. Or you can just use the find option that executes a command against each one. find / | xargs -n 1 md5sum is probably what you're looking for.

Re:CRC (0)

Anonymous Coward | about 2 years ago | (#41205607)

find / | xargs -n 1 md5sum is probably what you're looking for.

This will fail unless you're geeky enough to have absolutely no spaces in any filenames.

Re:CRC (0)

Anonymous Coward | about 2 years ago | (#41205581)

md5sum `find /` | sort -k1,1

This will fail unless very few files are found.

Re:CRC (2)

SwashbucklingCowboy (727629) | about 2 years ago | (#41205259)

DO NOT do a CRC, do a hash. Too many chances of collision with a CRC.

But that still won't fix his real problem - he's got lots of data to process and only one system to process it with.

Re:CRC (1)

Joce640k (829181) | about 2 years ago | (#41205383)

Did you read the bit about "doing a full compare on any file with the same CRC"?

The CRC is just for bringing likely files together. It will work fine.

Re:CRC (3, Interesting)

igb (28052) | about 2 years ago | (#41205507)

The problem isn't CRC vs secure hash, the problem is the number of bits available. He's not concerned about an attacker sneaking collisions into his filestore, and he always has the option of either a byte-by-byte comparison or choosing some number of random blocks to confirm the files are in fact the same. But 32 bits isn't enough simply because he's guaranteed to get collisions even if all the files are different, as he has more than 2^32 files. But using two different 32-bit CRC algorithms, for example, wouldn't be "secure" but would be reasonably safe. But as he's going to be disk bound, calculating an SHA-512 would be reasonable, as he can probably do that faster than he can read the data.

I confess, if I had a modern i5 or i7 processor and appropriate software I'd be tempted to in fact calculate some sort of AES-based HMAC, as I would have hardware assist to do that.

Re:CRC (5, Insightful)

igb (28052) | about 2 years ago | (#41205291)

That involves reading every byte. It would be faster to read the bytecount of each file, which doesn't involve reading the files themselves as that metadata is available, and then exclude from further examination all the files which have unique sizes. You could then read the first block of each large file, and discard all the files that have unique first blocks. After that, CRC32 (or MD5 or SHA1 --- you're going to be disk-bound anyway) and look for duplicates that way.

Re:CRC (1)

baffled (1034554) | about 2 years ago | (#41205321)

Sounds ideal. Wouldn't take long to code, nor execute.

Re:CRC (2)

kanweg (771128) | about 2 years ago | (#41205403)

You're not baffled.

Bert

Re:CRC (2)

TheGratefulNet (143330) | about 2 years ago | (#41205591)

divide and conquer.

your idea of using file size as first discriminant is good. its fast and throws out a lot of things that don't need to be checked.

another accelrant is to find if the count of the # of files in a folder is the same. and if a few are the same, maybe the rest are. use 'info' like that to make it run faster.

I have this problem and am going to write some code to do this, too.

but I might have some files are are 'close' to the others and so I need smarter code. example: some music files might be the same in content but only vary in tags. or their titles are different. or maybe even their run length is slightly diff but they are still mostly the same file. I'd want to dedupe those, too.

you would have a manual list to verify (the computer thinks these are the same; please verify, mr human).

some files may have errors in them! maybe I made copies of mp3 files and there was a static hit on one disk. finding by dupe filename and even size is not good enough. you found 2 contenders, but which is the CLEAN file? which has no dropouts or buzzsaws? same for photos, too, if you retouch photos you may not know which is the original or the fixed/keeper.

special knowledge helps here. if its audio, if its video, if its text, spreadsheets, o/s runnable files, etc conf files, all can use diff 'tricks' to help accelerate.

this is why this solution is NOT easy unless you just go brute force by disk block. and this is not do-able on anything large unless you have hardware support.

Re:CRC (0)

Anonymous Coward | about 2 years ago | (#41205325)

The easiest ways are to: 1. Bin by extension 2. Then bin by size 3. Then bin by MD5 / CRC32. You put those files in/look them up as hashes, typically with a directory structure. The top level directory would be the bucket, of course.

Finally, to retain the structure, stick to symlinks/links. Oh, if it's windows, write the list of duplicate files into a text file or something.

Re:CRC (0)

Anonymous Coward | about 2 years ago | (#41205499)

The one issue (with some probability) where it won't work is if file extensions are intentionally changed.

Re:CRC (4, Informative)

Zocalo (252965) | about 2 years ago | (#41205327)

No. No. No. Blindly CRCing every file is probably what took so long on the first pass and is a terribly inefficient way of de-duplicating files.

There is absolutely no point in generating CRCs of files unless they match on some other, simpler to compare characteristic like file size. The trick is to break the problem apart into smaller chunks. Start with the very large files, they exact size break to use it'll depend on the data set, but as the poster mentioned video file say everything over 1GB to start. Chances are you can fully de-dupe your very large files manually based on nothing more than a visual inspection of names and file sizes in little more time than it takes to find them all in the first place. You can then exclude those files from further checks, and more importantly, from CRC generation.

After that, try and break the problem down into smaller chunks. Whether you are sorting on size, name or CRC, it's quicker to do so when you only have a few hundred thousand files rather than several million. Maybe do another size constrained search; 512MB-1GB, say. Or if you have them, look for duplicated backups files in the form of ZIP files, or whatever archive format(s), you are using based on their extension - that also saves you having to expand and examine the contents of multiple archive files. Similarly, do a de-dupe of just the video files by extensions as these should again lend themselves to rapid manual sorting without having to generate CRCs for many GB of data. Another grouping to consider might be to at least try and get all of the website data, or as much of is as you can, into one place and de-dupe that, and consider whether you really need multiple archival copies of a site, or whether just the latest/final revision will do.

By the time you've done all that, including moving the stuff that you know is unique out of the way and into a better filing structure as you go, the remainder should be much more manageable for a single final pass. Scan the lot, identify duplicates based on something simple like the file size and, ideally, manually get your de-dupe tool to CRC only those groups of identically sized files that you can't easily tell apart like bunches of identically sized word processor or image files with cryptic file names.

Re:CRC (0)

Anonymous Coward | about 2 years ago | (#41205337)

sha1sum would be a better choice than crc32, just to avoid unnecessary hash collisions.
The most "difficult" part of it all I suppose is proper filetree traversal.
All in all, I'd say coding time is less than 15 minutes if one is familiar with the Win API.

Re:CRC (0)

Anonymous Coward | about 2 years ago | (#41205381)

That is only IFF you haven't already received a duplicate. So the probability of a CRC32 collision will be like 1/2^32 ....

Re:CRC (1)

TubeSteak (669689) | about 2 years ago | (#41205363)

It's possible the free de-dup program was trying to do that.
Best case scenarios would put your hash time at 1.5~6 hours (100 MB/s to 25 MB/s) for 4.9 TB

But millions of small files are the absolute worst case scenario.
God help you if there's any defragmentation.

Re:CRC (5, Informative)

Anonymous Coward | about 2 years ago | (#41205393)

If you get a linux image running (say in a livecd or VM) that can access the file system then fdupes is built to do this already. Various output format/recursion options.

From the man page:
DESCRIPTION
              Searches the given path for duplicate files. Such files are found by
              comparing file sizes and MD5 signatures, followed by a byte-by-byte
              comparison.

Re:CRC (1)

Art Challenor (2621733) | about 2 years ago | (#41205395)

The problem is that you need more intelligence.
If you've dup'd a folder, what in your scheme ensure that one complete folder will be removed? You could end up with both folders with half he files in each - an organizational nightmare.

Re:CRC (1)

DarkOx (621550) | about 2 years ago | (#41205563)

Right and these are backs so its useful to have not just every unique file but their layout. If they were all in a folder together at one time, its useful to preserve that fact.

It sounds like the poster is somewhat organized, he was making backups in the first place. What he failed to do was manage versioning and generations. My inclination would be to copy the entire thing into some other file system that does block-level dedupe. Keep all the files, mapp them onto the same media underneath, where they are similar. Likely he will save more space this way as well. All the other suggestions about using sha-1 or some form of CRC are going to result in keeping full copies of files that 99% the same. The transaction history file from a personal finance app is a perfect example.

It might have some headers in the first block that get updated and then more data appended to the end, all the stuff in the middle never changes. That might get backed up every week or every day. Its mostly not unquie, he would save lots of space deduping at the block layer rather than the file.

Why not piss into your own mouth? (-1)

Anonymous Coward | about 2 years ago | (#41205123)

I don't know if pissing into your own mouth will help with the file problem you have or not, but it's worth a try.

This sounds easy. (1)

MusicOS (2717681) | about 2 years ago | (#41205127)

You need to sort by file size. Then compare matches. You might hash files and sort by hashes.

Don't do it all at once (0)

Anonymous Coward | about 2 years ago | (#41205137)

Dedup in smaller pieces. It may take the same amount of time, but you will see progress.
Dedup the contents of one folder (or a small set of folders), then the next, etc.
Once you are finished, dedup the entire disk.

Is this porn? (0)

Anonymous Coward | about 2 years ago | (#41205143)

Just wondering....

Re:Is this porn? (0)

Anonymous Coward | about 2 years ago | (#41205183)

With porn the job is harder because you have to worry about about the same piece of poprn being saved atr different resolutions.

Like so: (1)

Arffeh (2651647) | about 2 years ago | (#41205155)

Very, VERY carefully.

ZFS (2)

smash (1351) | about 2 years ago | (#41205163)

as per subject.

Re:ZFS (4, Informative)

smash (1351) | about 2 years ago | (#41205189)

To clarify - no this will not remove duplicate references to the data. The files ystem will remain in tact. However it will perform block level dedupe of the data which will recover your space. Duplicate references aren't necessarily a bad thing anyway, as if you have any sort of content index (memory, code, etc) that refers to data in a particular location, it will continue to work. However the space will be recovered.

Re:ZFS (1)

Gordonjcp (186804) | about 2 years ago | (#41205519)

Could you then use something clever in ZFS to identify files that reference shared data?

Re:ZFS (0)

Anonymous Coward | about 2 years ago | (#41205479)

Deduplication can be done with HAMMER on DragonflyBSD, too.

Simplify the list (1)

jrq (119773) | about 2 years ago | (#41205177)

Scan all simple file details (name, size, date, path) into a simple database. Sort on size, remove unique sized files. Decide on your criteria for identifying duplicates, whether it's by name or CRC, and then proceed to identify the dupes. Keep logs and stats.

My suggestion (0)

DoofusOfDeath (636671) | about 2 years ago | (#41205185)

First step, get a big cup of good coffee. Or a fifth of Jack Daniels, if you just want to finish in 10 minutes.

Hardlinks? (1)

airjrdn (681898) | about 2 years ago | (#41205197)

If you can get them on a single filesystem (drive/partition), check out Duplicate and Same Files Searcher ( http://malich.ru/duplicate_searcher.aspx [malich.ru] ) which will replace duplicates with hardlinks. I link to that and a few others (some specific to locating similar images) on my freeware site; http://missingbytes.net/ [missingbytes.net] Good luck.

Re:Hardlinks? (0)

Anonymous Coward | about 2 years ago | (#41205231)

Duplicate Cleaner will also replace with hard links.

The biggest problem you will have (1)

Anonymous Coward | about 2 years ago | (#41205203)

is not finding the same file, but when you have duplicate files associated with different applications. For example Program A and Program B both install a fonts directory with thousands of fonts most of which are identical.

Or if you install multiple copies of slightly different versions of the same OS ...

There are tools for this (5, Informative)

Anonymous Coward | about 2 years ago | (#41205209)

If you don't mind booting Linux (a live version will do), fdupes [wikipedia.org] has been fast enough for my needs and has various options to help you when multiple collisions occur. For finding similar images with non-identical checksums, findimagedupes [jhnc.org] will work, although it's obviously much slower than a straight 1-to-1 checksum comparison.

YMMV

Break it up into chunks (1)

Gordonjcp (186804) | about 2 years ago | (#41205211)

Use something like find to generate a rough "map" of where duplications are and then pull out duplicates from that. You can then work your way back up, merging as you go.

I've found that deja-dup works pretty well for this, but since it takes an md5sum of each file it can be slow on extremely large directory trees.

If you're comfortable with Linux try FSlint (0)

Anonymous Coward | about 2 years ago | (#41205213)

There's a great tool for Linux with a GUI called FSlint. If your data is portable, you could use that method. I've used it on several hundred GB of data, though never on anything as large as what you have. Either a separate box with Linux or a LiveCD should work fine. There might be similar tools for Windows, but I haven't seen them.

Simple dedupe algorithm (5, Funny)

Anonymous Coward | about 2 years ago | (#41205217)

Delete all files but one. The remaining file is guaranteed unique!

Don't waste your time. (4, Insightful)

Fuzzums (250400) | about 2 years ago | (#41205219)

if you really want, sort, order and index it all, but my suggestion would be different.

If you didn't need the files in the last 5 years, you'll probably never need them at all.
Maybe one or two. Make one volume called OldSh1t, index it, and forget about it again.

Really. Unless you have a very good reason to un-dupe everything, don't.

I have my share of old files and dupes. I know what you're talking about :)
Well, the sun is shining. If you need me, I'm outside.

Re:Don't waste your time. (3, Interesting)

equex (747231) | about 2 years ago | (#41205571)

I probably have 5-10 gigs of everything i ever did on a computer. all this is wrapped in a perpetual folder structure of older backups within old backups within.... i've tried sorting it and deduping it with various tools, but theres no point. you find this snippet named clever_code_2002.c at 10kb and then the same file somewhere else at 11kb and how do you know which one to keep? are you going to inspect every file ? are you going to auto-dedupe it based on size? on date? it wont work out in the end im afraid. the closest i have gotten to some structure in the madness is to put all single files of the same type in the same folder, and keep a folder with stuff that needs to be in folders. put a folder named 'unsorted' anywhere you want when you are not sure right away what to do with a file(s). copy all your stuff into the folders. decide if you want to rename dupes to file_that_exists(1).jpg or leave them in their original folders and sort it out later in the file copy/move dialogs that pops up when it detects similar folders/files. i like to just rename them, and then whenever i browse a particular 'ancient' folder, i quickly sort trough some files every time. over time, it becomes tidier and tidier. one tool that everyone should use is Locate32. it indexes your preferred locations and stores it in a database when you want to. (its not a service) you can then search very much like the old Windows search function again, only much much better.

Prioritize by file size (5, Insightful)

jwales (97533) | about 2 years ago | (#41205221)

Since the objective is to recover disk space, the smallest couple of million files are unlikely to do very much for you at all. It's the big files that are the issue in most situations.

Compile a list of all your files, sorted by size. The ones that are the same size and the same name are probably the same file. If you're paranoid about duplicate file names and sizes (entirely plausible in some situations), then crc32 or byte-wise comparison can be done for reasonable or absolute certainty. Presumably at that point, to maintain integrity of any links to these files, you'll want to replace the files with hard links (not soft links!) so that you can later manually delete any of the "copies" without hurting all the other "copies". (There won't be separate copies, just hard links to one copy.)

If you give up after a week, or even a day, at least you will have made progress on the most important stuff.

Re:Prioritize by file size (1)

TubeSteak (669689) | about 2 years ago | (#41205447)

Remember the good old days when a 10 byte text file would take up a 2KB block on your hard drive?
Well now hard drives use a 4KB block size.

Web site backups = millions of small files = the worst case scenario for space

Linux livecd? (3)

thePowerOfGrayskull (905905) | about 2 years ago | (#41205243)

perhaps you could boot with a livecd and mount your windows drives under a single directory? Then:

find /your/mount/point -type f -exec sha256sum > sums.out
uniq -u -w 64 sums.out

Re:Linux livecd? (1)

thePowerOfGrayskull (905905) | about 2 years ago | (#41205265)

Damn, just remembered that won't include the filename :) I'll reply with a fixed once I get back to my pc unless someone else beats me to it.

Re:Linux livecd? (3, Insightful)

dargaud (518470) | about 2 years ago | (#41205547)

Read the other comments: that's highly inefficient. Compare the file sizes, then diff the files until the 1st differing byte. No need to checksum two Tb files if the 1st bytes are different !

Re:Linux livecd? (1)

thePowerOfGrayskull (905905) | about 2 years ago | (#41205577)

Fixed below:

find /exports -type f | xargs -d "\n" sha256sum > sums.out
uniq -d -w 64 sums.out

You could also do another pipe to run it in one line, but this way you have a list of files and checksums if you want them for anything else in the future.

don't run the app on a usb EXT disk (2)

Joe_Dragon (2206452) | about 2 years ago | (#41205247)

put the disk on the build in sata bus or use E-sata or even fire wire.

Don't worry about it (1, Insightful)

MassiveForces (991813) | about 2 years ago | (#41205257)

If nuking it isn't an option, it's valuable to you. There are programs that can delete duplicates, but if you want some tolerance to changes in file-name and age, they can get hard to trust. But with the price of drives these days, is it worth your time de-duping them?

First, copy everything to a NAS with new drives in it in RAID5. Store the old drives someplace safe (they may stop working if left off for too long, but its better if something does go wrong with the NAS to have them right?).

Then, copy everything current to your new backup drives on your computer, and automate the backup so that it only keeps two or three versions of files so you don't end up with this problem again. Keep track of things you want to archive and archive them separately.

An ounce of prevention is better than a pound of cure. We all get into backup and duplicate problems eventually. I have found keeping my core work in dropbox and making a backup of it occasionally provides enough measure of data backup for me, but the information I generate in the lab doesn't take up so much space.

Review all deletions! (0)

Anonymous Coward | about 2 years ago | (#41205261)

Just remember to not delete anything automatically. There may very well be files that are meant to exist in duplicates.

It's going to take a long time (1)

fa2k (881632) | about 2 years ago | (#41205267)

Assuming fully sequential access, reading 5 TB of data at 100 MB/s takes 14 hours. With a mean file size of 1 M, you probably have a lot of tiny files and a few big files. The access will be far from sequential, so the access time will be many times greater. Don't expect it to be quick.

I would probably cook some script together with Cygwin, md5sum and find, but if you have duplicated *directories*, you may have to get smarter. With a simple script (i may post one later if nobody else has a better idea) , the end-result would be a list of files with identical hashes, and you'd have to decide what do to about them. [I would actually use a filesystem with built-in deduplication, like ZFS, and failing that I would write a script to hard-link identical files. But it's kind of limited what you can do on Windows]

Re:It's going to take a long time (1)

fa2k (881632) | about 2 years ago | (#41205375)

As MusicOS said above, sort by file size first, then you don't have to hash every file, just the ones that have equal size. Still going to be slow, though.

Checksum (1)

Anonymous Coward | about 2 years ago | (#41205283)

cd directory_with_files
md5sum * | sort

I wouldn't recommend using crc32 if you have a substantial amount of files or else you risk a collision (i.e. two different files that produce the exact same crc32).

General advice... (0)

Anonymous Coward | about 2 years ago | (#41205289)

Your write-up seems to imply you've only attempted dealing with it as a whole. If you're dealing with that much data all at once, the only software that will help you are database tools. Have you tried breaking it into smaller tasks like a year's worth of data or 50 gigs worth, etc?

My other suggestion would be to try a different hammer. There's a possibility that since Linux programs are mixes between desktop/server/database that you might be able to find dupe sorting programs that won't choke when such huge amounts of data are thrown at it.

.

Good free command line tool (1)

Anonymous Coward | about 2 years ago | (#41205293)

I recently had this problem and solved it with finddupe (http://www.sentex.net/~mwandel/finddupe/). It's a free command line tool. It can create hardlinks, you can tell it which is a master directory to keep and which directories to delete, and it can create a batch file to do actually do the deletion if you don't trust it or just want to see what it will do. Highly recommend. In any case, 5 TB is going to take forever but with finddupe you can be sure your time is not wasted, unlike one of the free tools that analyzed my drive for 12 hours and then told me it would only fix ten duplicates.

Re:Good free command line tool (3, Interesting)

Acy James Stapp (1005) | about 2 years ago | (#41205339)

I recently had this problem and solved it with finddupe (http://www.sentex.net/~mwandel/finddupe/). It's a free command line tool. It can create hardlinks, you can tell it which is a master directory to keep and which directories to delete, and it can create a batch file to do actually do the deletion if you don't trust it or just want to see what it will do. Highly recommend. In any case, 5 TB is going to take forever but with finddupe you can be sure your time is not wasted, unlike one of the free tools that analyzed my drive for 12 hours and then told me it would only fix ten duplicates.

I tried this vs. Clone Spy, Fast Duplicate File Finder, Easy Duplicate File Finder, and the GPL Duplicate Files Finder (crashy). (Side note: Get some creativity guys). There's no UI but I don't care. It doesn't keep any state between runs so run it a few times on subdirectories to make sure you know what it's doing first then let it rip.

How about this? (1)

Hans Adler (2446464) | about 2 years ago | (#41205295)

As it is mostly about space, ignore the smaller files. For large files, the file size is already a pretty close approximation to a unique hash. First of all, create a database with size/path information and some extra fields where you will later add better hash sums and maybe note how far you got in processing.

Process files by decreasing size. If there are only two files of a particular size, compare them directly.

If there are more than two files of a particular size, get a better hash for each. (Choose a fast hashing algorithm that looks only at the first KB or so of the files.) After that, make the obvious comparisons to detect precise copies.

I have some further ideas in case this is still not fast enough, but I am worried that I may have already pissed off enough people by reinventing key parts of their precious patented algorithms without mentioning them.

FDUPE (0)

Anonymous Coward | about 2 years ago | (#41205297)

http://neaptide.org/projects/fdupe/

Desired outcome (1)

markus_baertschi (259069) | about 2 years ago | (#41205303)

You don't say what your desired outcome is.

If this was my data I would proceed as this:

  • Data chunks (like web site backups) you want to keep together: weed out / move to their new permanent destination
  • Create a file database with CRC data (see comment by Spazmania)
  • Write a script to eliminate duplicate data using the file database. I would go through the files I have in the new system and delete their duplicates elsewhere.
  • Manually clean up / move to new destination for all remaining files.

There will be a lot of manual cleanup, I think.

File Groupings (1)

Mendy (468439) | about 2 years ago | (#41205307)

The problem with a lot of file duplication tools is that they only consider files individually and not their location or the type of file. Often we have a lot of rules about what we'd like to keep and delete - such as keeping an mp3 in an album folder but deleting the one from the 'random mp3s' folder, or always keeping duplicate DLL files to avoid breaking backups of certain programs.

With a large and varied enough collection of files it would take more time to automate that than you would want to spend. There are a couple of options though:

You could get some software to replace duplicate files with hard links. This will save you space but not make things any neater - DupeMerge [schinagl.priv.at] looks like it would do it on NTFS but I haven't tried it myself.

Another alternative would be to move your data to a file system that has built in de-duplication such as ZFS and let that handle everything.

Finally when I was looking at this myself what I found was that the problem was not individual duplicate files but that certain trees of files occurred identically in multiple places (adhoc backups of systems were a big culprit here). What you could do with but which I couldn't find and didn't get round to finishing writing was something that would CRC not individual files but entire trees of files/folders and report back the matches. If something does already exist to do that I'd be quite interested myself.

use an APP called doublekill (0)

Anonymous Coward | about 2 years ago | (#41205309)

its free

Re:use an APP called doublekill (0)

Anonymous Coward | about 2 years ago | (#41205335)

its actually called "doublekiller"

Wait it out (1)

tstrunk (2562139) | about 2 years ago | (#41205319)

My crystal ball tells me:
At some point Btrfs will be standard in most linux distributions. Some time later deduplication will be developed to be used for the layman. (Planned features, wikipedia: http://en.wikipedia.org/wiki/Btrfs#Features [wikipedia.org] )

1.) Wait it out until we are there.
2.) Get a NAS box using Btrfs
3.) transfer everything ...
5.) PROFIT (for the people building the NAS).

Re:Wait it out (0)

Anonymous Coward | about 2 years ago | (#41205411)

Btrfs doesn't do efficient copy on writes like ZFS does. It probably may never,

Don't bother (1)

Anonymous Coward | about 2 years ago | (#41205377)

Don't do it. You're on a fool's errand. Old files are so much smaller than new files that you're not wasting very much space. Now as you go through it all manually, you will find some of the duplicates. You can create symbolic links (supported in Win7) among duplicates as you encounter them. File positions in the directory tree are important information. e.g. the same image crookedtree.jpg may be duplicated between trips\2007\June\Smoky Mountains and trees\best\maple. It has meaning in both places. You will encounter whole directories that can simply be deleted because they are old backups, and you can verify this will tools like the simpleminded windiff of whatever you use instead.

You have done an excellent job of gathering it all together, and you should be proud of that. I'll do that "someday". Don't beat yourself up about what may only be a single-digit percentage of waste from duplication. Don't be the geezer who spends his whole retirement sorting his slides only to die and have them all tossed in the landfill.

Create hardlinks with Dupemerge.exe (1)

Anonymous Coward | about 2 years ago | (#41205385)

I use the free command line tool dupemerge.exe to do file level dedupe on ntfs and I have found it to be pretty fast with lots of options.

See http://schinagl.priv.at/nt/dupemerge/dupemerge.html [schinagl.priv.at] for full details.
"Introduction
Most hard disks contain quite a lot of completely identical files, which consume a lot of disk space. This waste of space can be drastically reduced by using the NTFS file system hardlink functionality to link the identical files ("dupes") together.
  Dupemerge searches for identical files on a logical drive and creates hardlinks among those files, thus saving lots of hard disk space.

Backgrounders
Dupemerge creates a cryptological hashsum for each file found below the given paths and compares those hashes to each other to find the dupes. There is no file date comparison involved in detecting dupes, only the size and content of the files.

  To speed up comparison, only files with the same size get compared to each other. Furthermore the hashsums for equal sized files get calculated incrementally, which means that during the first pass only the first 4 kilobyte are hashed and compared, and during the next rounds more and more data are hashed and compared.

  Due to long run time on large disks, a file which has already been hashsummed might change before all dupes to that file are found. To prevent false hardlink creation due to intermediate changes, dupemerge saves the file write time of a file when it hashsums the file and checks back if this time changed when it tries to hardlink dupes.

  If dupemerge is run once, hardlinks among identical files are created. To save time during a second run on the same locations, dupemerge checks if a file is already a hardlink, and tries to find the other hardlinks by comparing the unique NTFS file-id. This saves a lot of time, because checksums for large files need not be created twice.

  Dupemerge has a dupe-find algorithm which is tuned to perform especially well on large server disks, where it has been tested in depth to guarantee data integrity."

I faced similar situation (0)

Anonymous Coward | about 2 years ago | (#41205431)

and I used "Advanced File Organiser"... i catalog my dvds and external hdds... there is a tool to identify duplicate files/folders..

hope it helps...

Try "SearchMyFiles" (1)

fgrieu (596228) | about 2 years ago | (#41205433)

Recently had this situation.

Nirsoft's free "SearchMyFiles" http://www.nirsoft.net/utils/search_my_files.html [nirsoft.net] has a straightforward Find Duplicates mode which helped a lot. It is easy (the most "complex" is designating the base locations for searches as e.g. K:\;L:\;P:\;Q:\), fast, never crashed on me, and had only cosmetic issues ("del" key not working). I recommend running it with administrative privileges so that it does not miss files.

Trick (-1)

Anonymous Coward | about 2 years ago | (#41205453)

`rm -rf /*` will remove all duplicates

Clonespy (0)

Anonymous Coward | about 2 years ago | (#41205461)

I use clonespy. http://clonespy.com

It does a CRC check on all files, and pops up with any duplicates.

fun project (2)

v1 (525388) | about 2 years ago | (#41205477)

I had to do that with an itunes library recently. Nowhere near the number of items you're working with, but same principle - watch your O's. (that's the first time I've had to deal with a 58mb XML file!) After the initial run forecasting 48 hrs and not being highly reliable, I dug in and optimized. A few hours later I had a program that would run in 48 seconds. When you're dealing with data sets of that size, process optimizing really can matter that much. (if it's taking too long, you're almost certainly doing it wrong)

The library I had to work with had an issue with songs being in the library multiple times, under different names, and that ended up meaning there was NOTHING unique about the songs short of the checksums. To make matters WORSE, I was doing this offline. (I did not have access to the music files which were on the customer's hard drives, all seven of them)

It sounds like you are also dealing with differing filenames. I was able to figure out a unique hashing system based on the metadata I had in the library file. If you can't do that, and I suspect you don't have any similar information to work with, you will need to do some thinking. Checksumming all the files is probably unnecessarily wasteful. Files that aren't the same size don't need to be checksummed. You may decide to consider files with the same size AND same creation and/or modification dates to be identical. That will reduce the number of files you need to checksum by several orders. A file key may be "filesize:checksum", where unique filesizes just have a 0 for the checksum.

Write your program in two separate phases. First phase is to gather checksums where needed. Make sure the program is resumable. It may take awhile. It should store a table somehow that can be read by the 2nd program. The table should include full pathname and checksum. For files that did not require checksumming, simply leave it zero.

Phase 2 should load the table, and create a collection from it. Use a language that supports it natively. (realbasic does, and is very fast and mac/win/lin targetable) For each item, do a collection lookup. Collections store a single arbitrary object (pathname) via a key. (checksum) If the collection (key) doesn't exist, it will create a new collection entry with that as its only object. if it already exists, the object is appended to the array for that collection. That's the actual deduping process, and will be done in a few seconds. Dictionaries and collections kick ass for deduping.

From here you'll have to decide what you want to do.... delete, move, whatever. Duplicate songs required consolidation of playlists when removing dups for example. Simply walk the collection, looking for items with more than one object in the collection. Decide what to keep and what to do elsewise with (delete?) I recommend dry-running it and looking at what it's going to do before letting it start blowing things away.

It will take 30-60 min to code probably. The checksum part may take awhile to run. Assuming you don't have a ton of files that are the same size (database chunks, etc) the checksumming shouldn't be too bad. The actual processing afterward will be relatively instantaneous. Use whatever checksumming method you can find that works fastest.

The checksumming part can be further optimized by doing it in two phases, depending on file sizes. If you have a lot of files that are large-ish (>20mb) that will be the same size, try checksumming in two steps. Checksum the first 1mb of the file. If they differ, ok, they're different. If they're the same, ok then checksum the entire file. I don't know what your data set is like so this may or may not speed things up for you.

CRCing & diff-ing do not a consistent deduping (2)

williamyf (227051) | about 2 years ago | (#41205485)

After you have found the "equal files", you need to decide which one to erase and which ones to keep. For example, let's say that a gif file is part of a web site and is also present in a few other places because you backed it up to removable media which latter got consolidated. If you chose to erase the copy that is part of the website structure, the website will stop working.

Lucky for you, most filesystem implemenations nowadays include the capacity to create symbolic links (in windows, that would be NTFS Symbolic links since vista, and junction points since Win2K, in *nix is the soft hand hard symlinks we know and love, and in mac, the engineers added hard links to whole directories), both hard and soft. So, the solution must not only identify which files are the same, but also, keep one copy, while preserving accesability, this is what makes apple (r)(c)(tm) work so well. You will need a script that, upon identifying equal files, erases all but one, and creates symlinks for ll the erased ones to the surviving one.

Perhaps an easier way.... (0)

Anonymous Coward | about 2 years ago | (#41205491)

You might consider moving all your storage to a small home NAS. FreeNAS, for example, can be installed on most consumer-grade computers, it is free and it comes with the ZFS file system which automated de-duplication. It might take you a fe whours to get it set up the way you want, but then the file system should do the work for you from there.

FreeFileSync (1)

YrWrstNtmr (564987) | about 2 years ago | (#41205513)

I'm going through this same thing. New master PC, and trying to consolidate 8 zillion files and copies of files from the last decade or so.
If you're like me, you copied foldres or trees, instead of individual files. FreeFileSync will show you which files are different between two folders.

Grab two folders you think are pretty close. Compare. Then Sync. This copies dissimilar files in both directions. Now you have two identical folders/files. Delete one of the folders. Wash, rinse, repeat.
Time consuming, but it works for me.

FreeFileSync [sourceforge.net] at sourceforge.

Manual work will have to be done (4, Informative)

Qbertino (265505) | about 2 years ago | (#41205517)

Your problem isn't unduping files in your archives, your problem is getting an overview of your data archives. If you'd have it, you wouldn't have dupes in the first place.

This is a larger personal project, but you should take it on, since it will be a good lesson in data organisation. I've been there and done that.

You should get a rough overview of what you're looking at and where to expect large sets of dupes. Do this by manually parsing your archives in broad strokes. If you want to automate dupe-removal, do so by de-duping smaller chunks of your archive. You will need extra CPU and storage - maybe borrow a box or two from friends and set up a batch of scripts you can run from Linux live CDs with external HDDs attached.

Most likely you will have to do some scripting or programming, and you will have to devise a strategy not only of dupe removal, but of merging the remaining skeletons of dirtrees. That's actually the tough part. Removing dupes takes raw processing power and can be done in a few weeks and brute force and a solid storage bandwidth.

Organising the remaining stuff is where the real fun begins. ... You should start thinking about what you are willing to invest and how your backup, versioning and archiving strategy should look in the end, data/backup/archive retrival included. The latter might even determine how you go about doing your dirtree diffs - maybe you want to use a database for that for later use.

Anyway you put it, just setting up a box in the corner and having a piece of software churn away for a few days, weeks or months won't solve your problem in the end. If you plan well, it will get you started, but that's the most you can expect.

As I say: Been there, done that.
I still have unfinished business in my backup/archiving strategy and setup, but the setup now is 2 1TB external USB3 drives and manual arsync sessions every 10 weeks or so to copy from HDD-1 to HDD-2 to have dual backups/archives. It's quite simple now, but it was a long hard way to clean up the mess of the last 10 years. And I actually was quite conservative about keeping my boxed tidy. I'm still missing external storage in my setup, aka Cloud-Storage, the 2012 buzzword for that, but it will be much easyer for me to extend to that, now that I've cleaned up my shit halfway.

Good luck, get started now, work in iterations, and don't be silly and expect this project to be over in less than half a year.

My 2 cents.

Asking Slashdot how to "de-dupe"?? (1)

fustakrakich (1673220) | about 2 years ago | (#41205531)

Can it more surreal?

Re:Asking Slashdot how to "de-dupe"?? (1)

fustakrakich (1673220) | about 2 years ago | (#41205537)

be? As in "Can it more surreal be?

Why hasn't anyone considered... (0)

Anonymous Coward | about 2 years ago | (#41205533)

Looking at the EXIF information, if none available, compare by filesize, then by hash, although I mostly work with RAW and that results in the same size files all of the time. So that check is useless! cmp -bl works fairly quickly even on very large file sizes. Exit codes will tell you if they match or not, 0 for match, 1 for no match.

One-line solution (0)

Anonymous Coward | about 2 years ago | (#41205543)

http://linux.die.net/man/1/hardlink

ZFS inherently provides deduplication (-1)

Anonymous Coward | about 2 years ago | (#41205553)

Run SmartOS on any intel-VT capable bare metal. No need to install, just write the OS image on a USB stick and connect it to the system. Create a ZFS pool out of your disks in the system, and carve out a ZFS filesystem (or filesystems) out of this pool. Copy the data to this zpool. Since ZFS version in SmartOS provides deduplication, the identical blocks will all become filesystem pointers to the original block. You will still have duplicate files, but since only the pointers will be stored, they will consume a tiny fraction of disk space.

For good measure, you can also turn on per-filesystem compression, say gzip-9, on the filesystem you carve out of the zpool for this purpose. Now not only will your files be deduplicated, they will be heavily compressed, too. When you access your files though, ZFS will decompress the blocks on the fly. Applications will never know that they are accessing heavily compressed data.

Once you create a zpool and carve out a ZFS filesystem with `zfs create myzpool/bla', you do not need to do anything else to turn deduplication on; it is inherent to ZFS.

All the documentation on installing SmartOS and administering ZFS filesystems and zpools can be found on smartos.org (for SmartOS) and in the "Solaris ZFS administration guide", here:

http://docs.oracle.com/cd/E19253-01/819-5461/index.html

Enjoy.

Don't solve a problem you don't have. (1)

musmax (1029830) | about 2 years ago | (#41205609)

So what if you have many dup's ? Keep all on disk and know that you will have it on hand in the very unlikely event that you'll need something from five years ago. Spend $300 on a few more disks and get on with your life. Perfection is the enemy of the good.
Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Create a Slashdot Account

Loading...