Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

MD5 of Song Data: Finding Duplicate MP3's

MyHair (589485) writes | more than 10 years ago

User Journal 1

Update: I started this project but still can't strip the ID3 tags temporarily. I'll put the details and code snippets in a post.

Update: I started this project but still can't strip the ID3 tags temporarily. I'll put the details and code snippets in a post.

Update 2: I more or less gave up on stripping the MD5 tags for now since it will take too much work. However, all ID3v1 tags are at the end of the file. ID3v2 tags can be and usually are at the beginning of the file, but ID3v2 tags are far less used in my collection, so I used the following command to get the MD5 sum of the first 500KiB of the file. It will miss identical songs that have had ID3v2 tags at the beggining of the file added or altered, but I was able to delete another 1.2 gigs of duplicate mp3s with this test:

head -c 512k <mp3 file> | md5sum -b

md5sum in this command returns a filename of "-", but I used some shell trickery to figure out which sum went with which file.

***************

Foreward: Really my only question here is "how do I calculate the MD5 sum of MP3 song data only--that is, not including ID3v1 or ID3v2 tags?" The rest is just chatting about what I'm doing.

I'm a network admin who thinks he knows enough about programming to pull this off. (I had several programming courses in college and occaionally play around with one language or another but never had a developer job or a serious project.)

I have disorganized, reorganized and relabled MP3 files. I want to find duplicates. Here's my idea to accomplish this:

I'm going to copy information from each MP3 into a database, either directly or through an intermediate delimited text file. The major info will be the path & file name, MD5 sum of the file, and the MD5 sum of the data only (minus the metadata/ID3 tags). I'll probably also try to get the MD5 sum of the first X seconds or X Kbytes of song data to detect duplicate beginings (in case of a truncated duplicate). Secondary info will include all the other stuff like file size, song length and ID3 tag data; this info will just be for me to locate duplicates that aren't byte-exact and to help me decide which of the duplicates to delete.

I could probably do this with a shell script and a couple of utility programs, but it may be simpler to grab a couple of modules from CPAN and use Perl. I don't anticipate having trouble with the metadata or MD5 sums of the entire file, but I don't yet see an easy way to calculate the MD5 sum of only the song portion of the MP3. I'm browsing through CPAN, and there are tons of modules which read and edit ID3 tags but nothing quite like what I have in mind. I did find one that strips the ID3 tags, but it seems to alter the file directly, and I only want to have the tags stripped just long enough to get the MD5 sum.

Sure I could read the data structure of the MP3 file and parse it myself, but I intuit that somebody made a perl module or command line utility that can nondestructively present me the tag-stripped MP3 so I can MD5 it and leave the original file unchanged. Anybody know of such a module or utility?

My app won't be pretty or user friendly. It will probably take a path as input and output either a delimited file or update a mysql or pgsql database directly. Then I will query the database to find what I need. I'm not only finding duplicates; I'll also be looking for files that are unique per storage device, because I have most of my MP3s on my main PC and my MP3 player (w/20G HDD), but there are some MP3s on one device but not on the other and vice versa, not to mention some on my linux box and some on my work PC. I can import from all sources into different tables and play with queries and eventually consolidate and catalog my mp3 collection.

I know there are some various apps that try to do some of these things, but I like working at the command line and having my hands on the raw data; every time I try to deal with an MP3 manager application I hate it.

EDIT: Yeah, I could just delete all my MP3's and re-rip them, but what fun would that be? Mucking about with Perl and SQL sounds so much more fun.

cancel ×

1 comment

Sorry! There are no comments related to the filter you selected.

My Project So Far (1)

MyHair (589485) | more than 10 years ago | (#7860603)

Disclaimer: The code snippets in this post were quickly thrown together. They aren't very thorough and have no error checking. Use them at your own risk. I don't consider them copyrighted as they are very short and quite obvious to anyone who knows how to use the commands, so there's nothing copyrightable in the code in my opinion. Also, for the record, I own the CD whose contents are listed below, and those mp3s are my rips for my personal listening pleasure.

*****

I decided not to use a database. I wanted to for the geek factor, but I don't really want to keep a database in sync with the file data; after all, everything but the MD5 sums are already stored in the filesystem and ID3 tags, so why try to maintain a database, too? There's locatedb and updatedb for unix/Cygwin or MS Index service if I want fast searching. And if I ever figure out how to take the MD5 sum of just the song data then I can store that in the ID3 tag.

The tools md5sum, cut, comm, xargs, grep, uniq, sort and find have been all I need. These are unix tools, and I am doing this on a Win2k box, but I have Cygwin installed so these tools are avaialble. There are likely DOS (DJGPP) and Win32 ports of these tools, too. Most of them are part of the GNU Textutils package.

The following command calculates the MD5 sums of all my MP3 files on drive C: and saves them in the file mp3drivec.
find /cygdrive/c/mp3/ -type f -exec md5sum -b {} \; >> mp3drivec
The output (in the file) looks like this:
19e9bbeb1f26dec3c04527001ec5d9db */cygdrive/c/mp3/my-cd-rips/Cake/Prolonging the Magic/01-Satan is My Motor.mp3
64f7573905924eecf0b156eadf8cff20 */cygdrive/c/mp3/my-cd-rips/Cake/Prolonging the Magic/02-Mexico.mp3
7c0dc04f6078fd81ed5241c994a2a 6b6 */cygdrive/c/mp3/my-cd-rips/Cake/Prolonging the Magic/03-Never There.mp3
615424badd6fd00afb681c62997fa4f5 */cygdrive/c/mp3/my-cd-rips/Cake/Prolonging the Magic/04-Guitar.mp3
e6db4636e772361d88597c1dfd03f cbd */cygdrive/c/mp3/my-cd-rips/Cake/Prolonging the Magic/05-You Turn the Screws.mp3
e55045c5ac79d2fe2cb069807542985e */cygdrive/c/mp3/my-cd-rips/Cake/Prolonging the Magic/06-Walk On By.mp3
d5f556ef9a0829b6ab4b2ef8455213a3 */cygdrive/c/mp3/my-cd-rips/Cake/Prolonging the Magic/07-Sheep Go To Heaven.mp3
d940c98b4b7e8e946d772005038233b8 */cygdrive/c/mp3/my-cd-rips/Cake/Prolonging the Magic/08-When You Sleep.mp3
bb13cd27cb66bf25cbd3a5307b13f918 */cygdrive/c/mp3/my-cd-rips/Cake/Prolonging the Magic/09-Hem of Your Garment.mp3
b415be05adb8d8d88a723ba4db2be4a3 */cygdrive/c/mp3/my-cd-rips/Cake/Prolonging the Magic/10-Alpha Beta Parking Lot.mp3
ed34f2b55bbb3072b4a8e3b52ff69405 */cygdrive/c/mp3/my-cd-rips/Cake/Prolonging the Magic/11-Let Me Go.mp3
d304207d68d08e0e3209c32b6d5033c7 */cygdrive/c/mp3/my-cd-rips/Cake/Prolonging the Magic/12-Cool Blue Reason.mp3
2614e16898dca06c63ffaadbdbc2441c */cygdrive/c/mp3/my-cd-rips/Cake/Prolonging the Magic/13-Where Would I Be_.mp3
I can use the cut command to grab either the MD5 sum or filename:
cut -d ' ' -f 1 <file>
for the MD5 sum or the following for the pathname:
cut -d '*' -f 2 <file>
I saved this command as a shell script. It will list lines with duplicate MD5 sums within a list file:
cut -d ' ' -f 1 $1 | sort | uniq -d | xargs -i grep {} $1
I can run that on the mp3drivec file I created earlier to find identical files under C:\mp3.

The following is a shell script I made to find MD5 sums that are unique to the left file (it is called with two files as arguments):
LEFT=tmp-left
RIGHT=tmp-right
rm $LEFT $RIGHT
cut -d ' ' -f 1 $1 | sort | uniq > $LEFT
cut -d ' ' -f 1 $2 | sort | uniq > $RIGHT
comm -2 -3 $LEFT $RIGHT | xargs -i grep {} $1
This script is similar except that it lists lines from the left file that have matching MD5 sums in the right file; in other words files in the left list that are redundant to the ones in the right.
LEFT=tmp-left
RIGHT=tmp-right
rm $LEFT $RIGHT
cut -d ' ' -f 1 $1 | sort | uniq > $LEFT
cut -d ' ' -f 1 $2 | sort | uniq > $RIGHT
comm -1 -2 $LEFT $RIGHT | xargs -i grep {} $1
This shell script list lines from the right file that have matching MD5 sums in the left file:
cut -d ' ' -f 1 $1 | sort | uniq | xargs -i grep {} $2
Using all of the above I was able to delete a couple of gigs or more of duplicate MP3 files. (I either deleted them by hand or in the case of a large number of deletes I edited a list of duplicte files with vi to turn it into a shell script which deleted each file.)

If you're confused, it's my fault. This is more of a toolbox post than a howto, but hopefully somebody finds it interesting or helpful.

Now that I've pretty much gotten rid of the completely identical files I know I have at least a few duplicate songs that have had ID3 tags altered, so the MD5 sum of the files are different even though the song data is identical. I haven't figured out yet how to nondestructively calculate the MD5 of the song data only. I've found tons of utilities, but so far they either don't work or they alter the file directly with no option to output to another file or stdout.

So I finally decided to install a Perl module and do it myself, but alas, the damn Perl module even modifies the file directly. Here's an example using MPEG::MP3Info (WARNING: This will strip ID3 tags out of the file passed to it!):
#!/usr/bin/perl

use MPEG::MP3Info;

$mp3 = shift;
remove_mp3tag($mp3,ALL);
exit 0;
I got tired and didn't investigate copying the file into a buffer, modifying it and then writing to stdout. I'm not even sure if it's possible, but I assume it is. I'm just not experienced enough to know how to do it.

What I really want is a stripper that works like an I/O filter, like this:
id3stripper 'Cool Song.mp3' | md5sum -b
Or perhaps it will even just calculate the MD5 sum inside itself and output the sum and pathname like the md5sum command does. The example above would give a filename of "-" in the md5sum output.

It would also be nice to have a similar utility to calculate the MD5 sum of the first several seconds or several KB of the song to match a truncated rip or a rip with a skip or pop later in the track to a full track.


Hmmm, I'm getting an idea. Since I don't care about reading the ID3 (and most existing utilities do) and ID3 tags are in predictable locations (ID3v1 is the last 128 bytes; I think ID3v2 is always at the beginning with a size field somewhere) I may be able to use a pointer and a length argument to calculate the data MD5 (there are MD5 modules for Perl) without any special buffer copies or temp files. Maybe I'll work on that this weekend.

I realize I could copy each MP3 to strip and MD5 it, but that's very sloppy and slow, and I don't want to do it that way.

Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>