Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Faster P2P By Matching Similiar Files?

CmdrTaco posted more than 7 years ago | from the something-doesn't-jive-here dept.

Software 222

Andreaskem writes "A Carnegie Mellon University computer scientist says transferring large data files, such as movies and music, over the Internet could be sped up significantly if peer-to-peer (P2P) file-sharing services were configured to share not only identical files, but also similar files. "SET speeds up data transfers by simultaneously downloading different chunks of a desired data file from multiple sources, rather than downloading an entire file from one slow source. Even then, downloads can be slow because these networks can't find enough sources to use all of a receiver's download bandwidth. That's why SET takes the additional step of identifying files that are similar to the desired file... No one knows the degree of similarity between data files stored in computers around the world, but analyses suggest the types of files most commonly shared are likely to contain a number of similar elements. Many music files, for instance, may differ only in the artist-and-title headers, but are otherwise 99 percent similar.""

cancel ×


Sorry! There are no comments related to the filter you selected.

Nickelback? (5, Funny)

onemorehour (162028) | more than 7 years ago | (#18690017)

Many music files, for instance, may differ only in the artist-and-title headers, but are otherwise 99 percent similar.

Well, sure, if you're only looking at Nickelback [] songs.

TorrentSoup (1)

snsr (917423) | more than 7 years ago | (#18690121)

It must be too early for me.
Why would I want to let someone transfer part of a file on my system that resembles part of a completely different file that they're looking for? Maybe everyone should just transfers all of their files all the time.

Re:TorrentSoup (2, Insightful)

eric76 (679787) | more than 7 years ago | (#18690223)

The only thing I use the file sharing networks for is to download new images of FreeBSD and Linux using BitTorrent.

The last thing I want is a "similar" file.

What would be a "similar" file to a FreeBSD ISO? It would either be a corrupted file or one with an introduced exploit.

Re:TorrentSoup (5, Informative)

joe_cot (1011355) | more than 7 years ago | (#18690417)

It would still work the same way as it does now: an md5 of each specific block, and an md5 of the whole thing. If the md5 for the block doesn't match, it's not going to download, and if it's someone using collision to inject a block with the same md5, 1) it's not going to pass the md5 on the whole thing, 2) you're already vulnerable to it. The reason this will work is that they'll be lots of people sharing incomplete or corrupted versions of your FreeBSD iso; you'll get the blocks that are good, and skip the blocks that aren't, making "similar" files very useful. Not too difficult to understand, and no need for tin foil hats.

Re:TorrentSoup (3, Interesting)

ShieldW0lf (601553) | more than 7 years ago | (#18690979)

So if someone is sharing an older ISO, and it happens to have large portions that exactly match the one you're downloading, with other portions that are not identical, you don't want to download the identical chunks off that person?

It would be interesting if the implementing software could also look for possible matches within your existing file structure and reduce the data downloaded automatically, kind of like using diff and just downloading the patch.

Re:TorrentSoup (3, Insightful)

drix (4602) | more than 7 years ago | (#18690749)

Because it gets you published and, thus, increases your chance for tenure, that from which all blessings flow.

Re:Nickelback? (4, Informative)

beckerist (985855) | more than 7 years ago | (#18690193)

The idea is that the song "Girl Talk - Once Again" might be represented as "girltalk-onceagain" "girl talk - once again" "GirlTalk - Once Again" "01-once again.mp3" "OnceAgain.mp3"

Being that the only difference is just the text (ID3, ID3-2) tags, the rest of the song is exactly the same, so why can't you use that as a download source too? I personally organize all of my music, and because of this P2P programs believe that it's an entirely new file, when really it was just renamed and the header information was changed (generally to be grammatically correct.)</summary>

Re:Nickelback? (5, Interesting)

thepotoo (829391) | more than 7 years ago | (#18690599)

If you use bittorrent, the DHT protocol (supported by Azureus, BitComet, and uTorrent, among others) does the exact thing you're describing. It checks MD5 hashes for files (the whole file, not the pieces, I think), and connects you to peers which have the same file.
DHT even supports partially corrupted files, your client just discards the corrupt data.

My question is, why would I want to use SET over DHT? Does SET not need a ceneralized server, or does it have any other advantage at all?
TFA is really short on technical details, but it sounds to me as though SET is just a re-design of DHT. Still, I imagine SET support will be in the next builds of all the major bittorrent clients if it ends up being worth something.

Re:Nickelback? (1)

beckerist (985855) | more than 7 years ago | (#18690711)

I read this as being less effective for bittorrent (as your assumption is correct, DHT basically does this on its own already) and more effective for Gnutella. While the gnutella network is maturing to a level past the Bearshare/Limewire days, there are still inherent problems with download speeds simply because a mechanism like DHT (or as the article calls it SET) doesn't currently exist. Are there problems with DHT? Absolutely, I get node errors all the time in Bittyrant. I think this is just the next natural extension for Bittorrent (or distributed FTP [] , etc)

Re:Nickelback? (1)

beckerist (985855) | more than 7 years ago | (#18690777)

errrrr...sorry. I need to proofread better. Instead of saying "I think this is just the next natural extension for Bittorrent..." I meant to say "next natural extension for Gnutella."


Re:Nickelback? (3, Informative)

Robotech_Master (14247) | more than 7 years ago | (#18691075)

If I understand the article right, SET looks at individual files within a particular download. DHT just looks at the whole download.

For instance, if I'm uploading my "Songs I Like to Dance To" mp3 mix, and someone else is uploading an "All-Time Greatest Dance Hits" CD rip, and there are a couple of songs both uploads have in common, SET would enable someone downloading my MP3 mix to treat the CD rip as a partial seed (and vice versa), and pull down the songs held in common from either one.

Whereas DHT would simply enable people to pull down my mix from other people uploading the mix, or the CD rip from other people uploading the CD rip, even if the tracker was down. (If I understand what DHT does correctly. Which it is possible I don't.)

Re:Nickelback? (4, Interesting)

Andy Dodd (701) | more than 7 years ago | (#18691089)

If I recall correctly, DHT takes the file name into account when calculating the hash. Thus identical files with different names are treated differently.

Some P2P protocols allow looking up a file by a hash which does not take filename into account, but this will not handle the case where the files differ in only one small section. The best example is the following:
Person downloads an MP3.
Person finds that the MP3 is not properly tagged (for example, has a comment field saying who ripped it/released the rip, and has no track number.)
Person changes the MP3's ID3 tag
Now, nearly all existing P2P protocols will treat the new file as a completely different file, when in reality the most important contents (the audio itself) have not changed, only the file's metadata.
Other users will go for the "full-file" match with the largest number of sources, thus causing the mistagged MP3 to propagate more than a "fixed" one.

So a P2P system that ignores the ID3 tag when hashing would have significant advantages, in which the user could download the file from many sources and then choose which source to get their metadata from.

Re:Nickelback? (1, Informative)

Anonymous Coward | more than 7 years ago | (#18690647)

Shareaza already discards ID3 tags when hashing files. (It's an option if it's off by default.)

Re:Nickelback? (1)

LoofWaffle (976969) | more than 7 years ago | (#18690769)

Is the ID tag really the only difference? You would also need to make sure that encoding bit rates were the same, that the original data set (let's stick with the MP3 example) was digitized the same way, etc. It may be more efficient, but given the polluted nature of P2P, accuracy is a bit questionable.

Re:Nickelback? (3, Interesting)

hey! (33014) | more than 7 years ago | (#18690987)

I don't think this is just about inconsistent metadata.

I think what he's talking about may be more like the document fingerprinting algorithms used to pare search engine results, or to detect plagiarism in student papers.

In some cases you will be downloading components of a file from two sources, neither of which have the others' component. The example in TFA was downloading the video portion of a movie from a foreign language site and the audio from a site with the language you speak but less bandwidth.

I suppose another example would be that if you were downloading an anthology of stories, you could take a particular story from a server that hosted a different anthology including that story. Or maybe you are downloading the new distro; you could take some of your files from sites offering the distro version you are looking for, some from sites only offering the files you need to upgrade to that version, and some from entirely different distros or much older versions if they happened to be the same.

I guess it could be thought of as a kind of "fuzzy akamai".

It's an interesting idea, but I don't see any commercial support for it. In fact I see commercial opposition under the current regime of copyright laws and royalty based business models.

Re:Nickelback? (0)

Anonymous Coward | more than 7 years ago | (#18690983)

Or if you're looking for commercial music, period; remember how the RIAA bombed Kazaa?

similiar?! (0)

Anonymous Coward | more than 7 years ago | (#18690049)

similiar files ftw!

Similar Files? (1, Insightful)

Anonymous Coward | more than 7 years ago | (#18690051)


Re:Similar Files? (2, Insightful)

Aladrin (926209) | more than 7 years ago | (#18690271)

No seriously, the coward is right. WTF?

Okay, I'll admit that there's a few MP3s that have different ID3 tags but the actual audio is the same. A few. The large majority of duplicate songs are NOT the same audio data. It's been re-ripped, transcoded, or some other horrid thing done to it and is not the same data anymore.

Now, even assuming that there ARE tons of very-alike files out there, you'd have to write an intelligent comparer for each one so that it knew how to deal with the file and what information could be mixed without ruining the file.

At the end of the project, you've spent years on a project that'll never quite work right to save a bit of bandwidth for people that should have just gone and bought the song from iTunes in the first place if they wanted it that damned bad. And if they don't want it that bad, they aren't going to bother with some specialized P2P program that only has 1 advantage: It can tell some files are alike. (And probably has tons of disadvantages compared to the already-existing applications.)

Re:Similar Files? (2, Interesting)

joto (134244) | more than 7 years ago | (#18690805)

Okay, I'll admit that there's a few MP3s that have different ID3 tags but the actual audio is the same.

It's more then a few. Most people use the default settings in their audio ripper/compression program, and it's all from the same CD. Even more people never uses an audio ripper and/or compressor, and simply downloads the file from the Internet. Not that many people bother to change ID3-tags either, but every single person that do, leads to a different file.

And if they don't want it that bad, they aren't going to bother with some specialized P2P program that only has 1 advantage: It can tell some files are alike.

My impression is that people are going out of their way to find specialized P2P programs that offer any advantage. And that very many people are willing to spend $100 in work-time finding a CD worth $10 for "free". The market has spoken, people are not acting the way you want them to.

Re:Similar Files? (0)

Anonymous Coward | more than 7 years ago | (#18690931)

having downloaded many songs from emule and kazaa back in the day, there are many mp3s that are bit identical in the audio data that are different in the metadata. just search for a song on emule and download the dozen or so most available mp3s that have the same file size and do a foobar2000 bit compare on the audio and you'll find most of them are bit identical just with different tags. of course this is just as anecdotal evidence as you have cited, but my experience seems to be somewhat different to yours...

as for your other point, many p2p apps such as emule, limewire, shareaza have been metadata aware for many different filetypes for a long time now. i doubt it would be a huge task to have these programs support two hashes for a file, the normal hash for the entire file with metadata, and another hash just for the data. of course, the program in the article is going one step further by further subdividing the non-metadata data to look for more redundancy between files. makes sense to me.

Re:Similar Files? (2, Insightful)

Anonymous Coward | more than 7 years ago | (#18691019)

Break down the file into small (16Kb) chunks. Hash those chunks and let the client compare those chunks to the chunks you need. Most BT clients already do this, but still only draw the file from peers using the same file listed by the tracker. With this technology it can use any file that has chunks with the exact same hash as the file being downloaded by the user. I would imagine not a great many changes would be needed to implement it. There's no need for an 'intelligent comparer' as it's pretty much already built into almost every BT app out there. There won't be 'years on the project' either, since most of the infrastructure already exists. They can just build on what is already out there.

There could be a fairly large performance increase if I've understood the paper correctly. I have a 10Mb downstream cable connection at home. If I connect to a torrent that has many more seeders than leechers I can easily top out my d/l speed at around 1.1 MB/s. Reverse that scenario and the d/l is extremely slow due to the lack of seeders able to send out chunks of the file. Now, imagine there are multiple copies of these same file on multiple trackers being shared by many, many more seeders that this one torrent. This new implementation will find those chunks, as well as the chunks you originally connected to. Next time, RTFA.

Re:Similar Files? (1)

woolio (927141) | more than 7 years ago | (#18691121)

At the end of the project, you've spent years on a project that'll never quite work right to save a bit of bandwidth for people that should have just gone and bought the song from iTunes in the first place if they wanted it that damned bad. And if they don't want it that bad, they aren't going to bother with some specialized P2P program that only has 1 advantage: It can tell some files are alike. (And probably has tons of disadvantages compared to the already-existing applications.)

After spending several years of my life in grad school, I'm beginning to think that s loy of university research is like that.

Re:Similar Files? (1)

lottameez (816335) | more than 7 years ago | (#18690529)

Both of these files use the same 26-character alphabet! Share and substitute at will!

Thats the dumbest thing I've ever heard (0, Flamebait)

MetalliQaZ (539913) | more than 7 years ago | (#18690089)

One wonders if these "researchers" have ever actually used p2p...

Re:Thats the dumbest thing I've ever heard (1)

dknj (441802) | more than 7 years ago | (#18690231)

Or it may even be native to the service itself (i think gnutella or edonkey.. one of the two). Anyway, the point is if you search for a certain file and the hash for specific blocks are identical, Shareaza will attempt to download that copy as well. So if you download, say, a 4mb tupac song and suddenly see a 2.5mb britney spears song in the list.. don't cancel the download. It could be that the 2.5mb britney spears song is actually the same tupac song that was renamed and incomplete.

However, there is no reason why the current generation of bittorrent clients don't already support this via DHT...

It gets worse. (2)

khasim (1285) | more than 7 years ago | (#18690331)

Taking advantage of those similarities could speed downloads considerably. If a U.S. computer user wanted to download a German-language version of a popular movie, for instance, existing systems would probably download most of the movie from sources in Germany. But if the user could download from similar files, the user could retrieve most of the video from English versions readily available from U.S. sources, and download only the audio portion of the movie from the German sources.


Now it would be GREAT if someone did manage to do that. Split the video from the audio (and from the sub-titles). And maybe create a meta-package.

And maybe if those researchers focus on that, this will be a better idea.

But that would ONLY work for material that could be split like that. If it's a song, what are you going to split? An ISO image? Same question.

Re:It gets worse. (0)

Anonymous Coward | more than 7 years ago | (#18690405)

I think the idea here is to identify sections of different files that are identical and use those sections as part of a typical swarm download.

Note the "think". This could just be idiots thinking they are the first to discover swarming. I didn't RTFA.

Re:It gets worse. (1)

WoZzeR (1081201) | more than 7 years ago | (#18690579)

I think the original article used 'similar files' only as a rough example. How I imagine it would work would be similar blocks could be downloaded from many sources. So in the example of a movie with German language, as long as the video blocks were the same, they would be interchangeable between the 2 files. This actually would not be hard to do, if there was a standard for block sizes, each one should have a different md5 hash, so what a client would do is say "do you have md5 hash xxxxxx block, please send it to me". This way it would not matter where in the file the other block was, but the original (torrent file in this case) file would have the info on where the md5 hash would go.

Re:It gets worse. (1)

brunascle (994197) | more than 7 years ago | (#18690743)

hmmm, that's actually interesting. this could potentially completely change how P2P works. instead of requesting files, just request blocks by md5 hash. when you get a match, compare the hashes using another algorithm (to make sure it isnt a coincidence).

would that make it easier to defend yourself against the MAFIAA? since all they know about is 1 block that matches a copyrighted file (or at least, the hash matches)?

Think about how that would be accomplished. (1)

khasim (1285) | more than 7 years ago | (#18690943)

Pretend that you're part of a swarm.

Your computer would then go through ALL YOUR FILES and advertise the md5 checksums to everyone.

Normally, you just advertise the blocks for the file that you're downloading.

So, I'm downloading a Debian iso ... and you're downloading a movie. Why am I (and a million others like me) going to be connecting to your box, asking your processor whether you have a file with checksum ghskldkjasa198d.a8.3ep ?

Normally, I would not even be talking to your box.

Suddenly, your bandwidth is gone from a million requests that have nothing to do with your download.

Re:Think about how that would be accomplished. (1)

maxume (22995) | more than 7 years ago | (#18691139)

I would set it up so that my computer only went through the files I wanted to be sharing blocks of, which is slightly less hyperbolic. I imagine that networks will end up taking advantage of this, but instead of direct block searches, the search will ask 'do you have anything like...' and if your computer responds, the search will say 'are any of them exactly...', so even though it isn't super-awesome-perfect, there appears to be some room to improve over things as they are.

To me the real benefit would be that the software would be better at telling what songs are the same song and searches would return fewer, more accurate results(simply because they would be grouped better), bandwidth usually ends up coming from one or two 'good' sharers anyway.

Re:Thats the dumbest thing I've ever heard (1)

TodMinuit (1026042) | more than 7 years ago | (#18690457)

Not really. Lets use BitTorrent as an example. BitTorrent links chunks to a torrent. If it were to just track chunks, you could get a speed up for certain torrents.

Lets say I make a torrent, example-v1.0.torrent, containing file X (with the checksum "foobar"), and file Z (with the checksum "deadbeef"). I seed it, people download it, yippie.

Now lets say later on, file X changes, and now has the checksum "barfoo". So I create example-v1.1.torrent. Under the current BitTorrent system, both file X and file Z would have to be seeded from scratch. Whereas if you were to merely track chunks, anyone currently distributing file Z, which hans't changed, would be used in the seeding of example-v1.1.torrent.

For things like operating system ISOs, you could get a head start when seeding new versions. For compressed data, like MP3s or videos, you're screwed.

grea tide a (5, Funny)

underwhelm (53409) | more than 7 years ago | (#18690103)

I'm hoping this CATCHES ON and wet ransfer a11 sorts of information like this. It'11 be 1ike getting every thing in the form of a ransom n0te.

hear my words..... (-1, Offtopic)

Anonymous Coward | more than 7 years ago | (#18690107)

In order to understand what is really going on with this war in the middle east you need to understand a bit about history..

in particular the battle that has been going on between Judaism + Christianity VS Muslim Savages since the time of the original Mohammed...


what you are really dealing with here is the battle of the 'civilized man' to take over the world and eradicate the 'savages'

The great Christian crusades were part of this but it goes much further back than this..

back to the original goals of the roman empire... back to the times of the great Greek gods...


the crusades never ended..

the war still goes on..

the ultimate goal is to KILL ALL THE SAVAGES and make the world civilized..

that is what is happening in the middle east..

and the goal of the savages is to kill all the gentiles before they kill them!

It is presented to the general public in one guise or another over time in order to get them to participate in the great cleansing, but it was decided long ago by learned men that the savages on this planet MUST BE ELIMINATED and that same goal is as alive and well today as it was thousands of years ago..

take it for what you will but that is the truth

The music kids listen to (3, Funny)

Vollernurd (232458) | more than 7 years ago | (#18690111)

So it's not me then? All new tunes DO sound the same?

Re:The music kids listen to (0)

penp (1072374) | more than 7 years ago | (#18690269)

It's not that they sound the same, it's that the files are similar. If a file is encoded in the same format as another, I'm guessing that this data that is part of these encoders' nature is what they are referring to. The similar (binary) data is data that has nothing to do with the actual output of the audio/video.

Re:The music kids listen to (0)

ronanbear (924575) | more than 7 years ago | (#18690661)

meh, if that's the case then you could just download one song, once, and then download the additional (non identical) data to turn it into a whole music collection.

Re:The music kids listen to (1)

iminplaya (723125) | more than 7 years ago | (#18691037)

It's not that they sound the same, it's that the files are similar.

If you ever have the chance, go over and test drive the mail program that comes on every new macintosh, and send off a test mail. Make sure the volume is turned up.

They already work like this (1)

Noishe (829350) | more than 7 years ago | (#18690115)

Many music files, for instance, may differ only in the artist-and-title headers, but are otherwise 99 percent similar.
I've never used a p2p program that won't download from two songs, just because they are labeled differently, or have different headers.

Re:They already work like this (0, Flamebait)

multipartmixed (163409) | more than 7 years ago | (#18691079)

Still haven't learned about ID3 tags, huh?

You poor bastard.

Even Media Player messes with them now.

That'll work great with (1)

Rik Sweeney (471717) | more than 7 years ago | (#18690117)


No wait, hear me out. Most porn is going to be largely white or black skin colour (particularly with Friesian Cows if you're into that sort of thing), so the P2P can just find a chunk with a similar amount of that colour and download that!

Re:That'll work great with (1)

FlatLine84 (1084689) | more than 7 years ago | (#18690265)

I'm sorry, if I'm downloading porn, I would rather have it as one file, instead of risking "mixing" genres... Although, for some people, that could be interesting I suppose.

This could work for some files, but not for others (0, Redundant)

Icarus1919 (802533) | more than 7 years ago | (#18690123)

What if you're downloading a backup copy of a movie that is subtitled, as opposed to a version that is not? Considering how much space the video data takes up, these files would be awfully similar. Sometimes it IS the little things that make all the difference in files. Version of software with bugfix as opposed to without? Only difference between the two files is the name (1.0 vs. 1.1) and the fixed lines of code. That sort of thing. Seems to me this may not be as useful as advertised.

Re:This could work for some files, but not for oth (2, Interesting)

Xzzy (111297) | more than 7 years ago | (#18690495)

It depends on how they calculate "similar". If they run checksums on the chunks and submit a query to other machines on the network that have pieces with identical chunks, then it would be valid to download. I'm pretty sure a few P2P services in the pre-bittorrent days did something like this, files with identical hashes would be grouped together.

But the article makes it sound like their custom software breaks up media files into their component streams, which clients can download separately as they desire. Pull the english audio stream from Computer A, the video stream from B, etc. Once downloaded the client automagically reassembles it into a single file.

Writing a client that can disassemble and reassemble all known media types sounds absolutely horrifying though. ;)

Re:This could work for some files, but not for oth (3, Interesting)

Incy (635621) | more than 7 years ago | (#18690535)

Anything compressed/encrypted won't work so well. Unless it is just a mislabeled peice of music. If you google around for Low Bandwidth File System (LBFS) you'll see what technique the article is really talking about.(disclaimer -- I didn't read the article either) Variable Length chunking will handle cases where new data is inserted halfway into the file, however with compression that extra data will end up changing the whole damned file.

Right.... (1, Troll)

simm1701 (835424) | more than 7 years ago | (#18690129)

Sure this is going to work... really

I'll just splice that bit from that torrent, that bit from that one... it should work, I mean they are all the same TV episode and they are all mpeg4 - the file name says so...

Hmmm how about which bitrates, codecs, if it was from TV whether it was started at the same time??

That guy seriously has to be joking - the byte offsets are unlikely to ever specify a suitable join - and even if they rewrote the protocol so it split by seconds rather than fixed file widths you'd still have changing codecs and bitrates to deal with. Personally I'll stick to torrents with decent known trackers

Re:Right.... (2, Informative)

Icarus1919 (802533) | more than 7 years ago | (#18690209)

Ok, perhaps you're not certain how files work. But things compressed with different codecs and bitrates look VERY different when you actually look at the coding in the file as opposed to the same file named differently or with minor changes.

Re:Right.... (1)

Volante3192 (953645) | more than 7 years ago | (#18690215)

Beat me to it. This sounds like it'd just wreak havoc on checksums.

Plus, I'd also like to add that P2P doesn't download from a single source as the summary claims...but it pulls down chunks of the same file from MANY sources.

Maybe these guys are confusing P2P with FTP?...

Re:Right.... (1)

maxume (22995) | more than 7 years ago | (#18691195)

They are improving the ability of the software to identify files as being the same, rather than reporting files with superficial differences as different(where superficial is different tags or whatever, and the media stream is identical).

Re:Right.... (5, Informative)

angio (33504) | more than 7 years ago | (#18690311)

Take a peek at the paper [] - it actually does work, and we demonstrated it. The intuition: people make small changes to files like changing the artist or title in the MP3 header, and then BitTorrent and other systems treat this as a "different" file, when in fact it's 99.9% similar.
(Yes, I'm one of the authors.)

Re:Right.... (0, Troll)

Reason58 (775044) | more than 7 years ago | (#18690587)

The intuition: people make small changes to files like changing the artist or title in the MP3 header, and then BitTorrent and other systems treat this as a "different" file, when in fact it's 99.9% similar.
BitTorrent, as most of you know, doesn't work this way. Files are selected from a server called a "tracker", and only users with that exact file size and hash will be linked with you. The only way you could implement a system like this is to create an entirely new protocol, server software and client software. Given the widespread adoption of BitTorrent I think the performance gains would have to be very substantial for people to migrate.

Re:Right.... (1)

joshv (13017) | more than 7 years ago | (#18691009)

It would indeed be odd for someone to publish a paper about a novel way of speeding up downloads if in fact Bitorrent, or any other file sharing protocol already worked that way. Thanks for pointing out that they don't.

Re:Right.... (1)

Maximum Prophet (716608) | more than 7 years ago | (#18690675)

Isn't this rsync meets bitTorrent?

It sounds like what you are saying is that someone wants to download X, but there are few sources of X. There are many sources of Y, which is really X, renamed. Your tool would download the proper header info from the X source and the majority of the data from the Y sources.

Re:Right.... (1)

Incy (635621) | more than 7 years ago | (#18690735)

Interesting.. The three types of "similar" files were things that were mis-labeled, in a different language, had errors, or packaged differently.... I wouldn't have expected those 4 cases to add up to anything significant. However I guess it only takes 1 mostly similar file to double*ish* the speed of your download.. so it doesn't take much to have a big impact on your download..

Re:Right.... (0)

Anonymous Coward | more than 7 years ago | (#18690851)

If that's all you're doing, then GNUnet [] does that already. For MP3 files with different ID3 tags, the first 99% of the file is exactly the same. But if you have something more complex, like the same movie with different subtitles, then you have to find similar pieces which are scattered around at different offsets.

Re:Right.... (1, Interesting)

Hatta (162192) | more than 7 years ago | (#18690865)

If I download a song with bittorrent then change the tag, it doesn't treat it as a different file. It treats it as a corrupted file. It does indeed recognize that it is 99% similar, and it can use that file to seed the similar parts. How is this novel again?

Re:Right.... (1)

Incy (635621) | more than 7 years ago | (#18691021)

Atlhough this comment from the article begs for further clarification: "We intentionally sampled a subset of files that were likely to be similar to each other and that we expected people to be interested in downloading."

Re:Right.... (0)

Anonymous Coward | more than 7 years ago | (#18690481)

RTFA. It only downloads identical chunks (by some hashing technique they call "handprinting"), you'll end up with the original file just the same as before. Consider the situation of the same film but with different subtitles and / or audio; if the source / encoder / settings were otherwise the same then a lot of the video chunks will match.

Likewise the "99% similar" mp3 situation mentioned is the case where you have essentially the same mp3 with different ID3 tags, though as pointed out by another comment a lot of p2p programs already ignore the headers for swarming purposes. The basic concept is sound, the question is if it can scale well enough to be practical; every incoming chunk request hash must presumably be searched against the hash of every chunk shared if the hash of the file isn't found, and since the chunk size is only 16k that's likely to be a LOT of chunks.

Re:Right.... (1)

nine-times (778537) | more than 7 years ago | (#18690719)

I'll admit that I'm definitely not the most educated person in these matters, but I just don't quite "get it". If you are going to download only the differences between two files, doesn't that require that some computer has access to both files and can compare the differences? If one end or the other doesn't have both files, wouldn't you need to transfer the file first to make the comparison? (meaning you'd still need to download the whole thing?)

Anyway, I've thought about this before, even though I don't have the technical background to think about it properly-- what's the legal implication on copyrights in this instance? Let's imagine I have 50 public domain movies in MPEG format that I'm sharing through bittorrent, and that, miraculously, you could take different small chunks of data from these various movies, put those chunks into a different order, you could create a full copy of a copyrighted Hollywod movie released a few months ago. Now someone else introduces a bittorrent tracker that can pull those chunks from those 50 movies and put them into the correct order. In this unusual situation, is anyone violating the copyright of Hollywood movie?

It seems like a technicality, but it you'd have to figure that, for any size-limited chunk of data, there are a finite number of possibilities. Therefore, if you had a hard drive storing a different 4KB file representing every possible combination of data that could exist in 4KB, you would, in a sense, have every piece of information that it would be possible to create stored on that hard drive. Of course, in order to create a piece of information, you'd need additional information: which chunks need to be combines, in which order, and at what point to cut off the last 4KB chunk.

So it seems like a weird gray area to me. In one sense, you could consider this as a form of encoding. However, if several people and sharing several pieces of unrelated information which can be pieced together into a copyrighted work, it doesn't seem to me that anyone is necessarily guilty of copyright infringement. On the server end, the copyrighted work simply doesn't exist. At no point is the copyrighted film actually being copied. But after a given series of data chunks are copied, they can be put into a series which results on the copyrighted work.

I know, I know, I probably sound very silly to those who know better.

Re:Right.... (1)

zero1101 (444838) | more than 7 years ago | (#18691111)

You can simplify your example and see why it doesn't make sense by looking at it this way: if I have a database on my hard drive that stores every possible representation of 1 bit, all I need to do to recreate a remote file is download a list that gives me the number of bits, as well as the state of each bit in the remote file in sequence. I can then rebuild the file by looking up the bits in the local state database in sequence and writing the result out to a file.

Re:Right.... (0)

Anonymous Coward | more than 7 years ago | (#18691115)


take 128 byte md5s of your 4k chunks. If your downstream bandwidth isn't saturated, look for other chunks with identical checksums and download them. When you get "the whole file", you can then use a checksum on it to verify that the chunks work. If this fails, you begin replacing the "similar" chunks with ones from your original source until the final file passes.

Your basically using your extra bandwith to gamble on identical checksum chunks. If some of these work, you could get a faster download.

Re:Right.... (1)

mini me (132455) | more than 7 years ago | (#18691165)

Trivial example:

I want to download a file that contains "Hello World"
Another user has a file that contains "Hello John"

I can download "World" from user A, and download "Hello " from user B to make up the file I'm looking for, even though user B does not actually have the actual file on his computer.

Re:Right.... (1)

Dare nMc (468959) | more than 7 years ago | (#18691193)

if it was from TV whether it was started at the same time??

More like creating a diff program for binary files, not just text.
thats the part their sharing. IE if you find identical segments in similar files, you can grab the identical segments from either torrent. They even have the example, a trailer.
A better example would be a movie with 3 version, IE a extended version, a theatricle version, and a TV version. If you cut the theater version and TV version from the extended version (after encoding, etc) you could download the extended version after downloading the theater version in 1/8th the time, since you really just need a difference file.

Re:Right.... (0)

Anonymous Coward | more than 7 years ago | (#18691223)


i don't think there is any suggestion to just trust the filename as you seem to be implying. you subdivide each file (making sure to ignore metadata) into chunks and then hash each chunk and compare the hashes.

as far as torrents, haven't you ever noticed that many torrents will have mostly the same files in them but be "different" because there are different nfo/txt files in them? haven't you ever thought it would just make a lot of sense if you could combine "different" torrents together which have files which are 99% the same? the concept in the article is the same, except looking for redundancy at the chunk level, rather than the file level as in my example.

Sounds not so useful. (0)

realcoolguy425 (587426) | more than 7 years ago | (#18690145)

Oh but if we only could! I'm sorry, sounds like someone is just imagining a 'what-if' scenario. This is not going to happen, unless, a new filetype is made specifically for it, and then you'd need a very specialized peer to peer program, and who knows how much extra work it's going to cost in terms of cpu/networking to determine if a file is similair or not. To me this just sounds stupid. Maybe it's just from waking up. They should work on optimising the 'pipes' or the 'tubes' to carry more data instead. It sounds like all the person thought of was hey, bit-torrent AND a stupid way to increase the number of people I can download from!

Snakeoil (0, Troll)

Reason58 (775044) | more than 7 years ago | (#18690151)

Many music files, for instance, may differ only in the artist-and-title headers, but are otherwise 99 percent similar.
It's not even 9AM and I have already filled my bullshit quota for the day. The concept itself is dubious, but this statement in particular is ludicrous.

Re:Snakeoil (3, Funny)

Anonymous Coward | more than 7 years ago | (#18690241)

this statement in particular is ludicrous.
You don't listen to pop music, do you?

Re:Snakeoil (1)

SaturnTim (445813) | more than 7 years ago | (#18690399)

The article is a bit vague, but I think this is really trying to match up identical files that just differ in meta-information. So this won't be
downloading parts of completely different files, it's just not relying on file names to find matches.

Summary: (4, Informative)

PhrostyMcByte (589271) | more than 7 years ago | (#18690153)

instead of sharing files, divide them into 16KB chunks and share those, to help work around files that get renamed or trivially altered (eg a website tagging their url to all the files you upload).

Re:Summary: (0, Redundant)

physicsboy500 (645835) | more than 7 years ago | (#18690381)

or better yet, why not divide it into 1 bit chunks, that way you'll have TRILLIONS of sources to download it from, that way all your computer has to do is determine the information in the chunk and download it from the first source it sees... this could even come from your own computers!!

Man, these torrent files are huge. When did they start making torrent files that are 4GB??

99% Similar? (0, Redundant)

swillden (191260) | more than 7 years ago | (#18690161)

I knew all the music on the radio sounded the same, but I didn't think it was *that* lacking in originality.

Re:99% Similar? (1)

happyemoticon (543015) | more than 7 years ago | (#18690413)

Well, they're all composed of 1's and 0's, so they must be basically identical. Except for those songs that contain a 2, or a -1.

Tiny chunks, Large files (2, Insightful)

LiquidCoooled (634315) | more than 7 years ago | (#18690177)

LOTS of overhead just to find the chunks.
The article talks about 16kb chunks, which for a dvd image would take more than the torrent protocol currently allows.

The client would spend more time communicating its chunk lists around than actually getting data.

(If I remember rightly, torrents can have a max of 65535 chunks and some servers prevent huge .torrent files which contain the chunk breakpoints anyway)

Re:Tiny chunks, Large files (1)

burris (122191) | more than 7 years ago | (#18690631)

Sorry, the limit on the number of pieces in a torrent is 2^32. The message length prefix (i.e. the max length of the have-bitfield) as well as the piece indicies are 4-byte unsigned ints.

meh (1)

brunascle (994197) | more than 7 years ago | (#18690185)

not particularly new. many P2Ps have been grouping identical files together. i know one of the early ones did it (was it Napster, Audiogalaxy?), but i think only if the files were 100% identical other than the filename.

there's definitely potential for problem here. what if those files really arent supposed to be the same? a swapped byte here and there could have huge effects on the end result.

Re:meh (0)

Anonymous Coward | more than 7 years ago | (#18691013)

Hello and welcome to 1999. Napster did that almost 10 years ago.

Well put... that was my first thought also when reading the summary : old news.

But the problem was actually that the file could be significantly different and still be treated as the same, so you would have some songs that you would get, say, half from a source (at a certain bitrate/song length/etc) and half from another source and then when you listen to it theres a big jump (forward of backward) in the song, change of pitch, volume, etc... it was pretty nasty. I still have quite a few songs from back then with these problems.

Overhead (1)

TubeSteak (669689) | more than 7 years ago | (#18690201)

Wouldn't this generate a lot more overhead traffic?
I'm sure smarter people than I have already thought this out,
but that seems to be the most immediate downside.

"SET divides a one-gigabyte file into 64,000 16-kilobyte chunks"

In other words: instead of seeking out the one master-hash for the file,
your P2P is looking up the thousands of chunk hashes.

Re:Overhead (0)

Anonymous Coward | more than 7 years ago | (#18690981)

In other words: instead of seeking out the one master-hash for the file,
your P2P is looking up the thousands of chunk hashes.
In other words, they've reinvented GNUnet? []

limewire (0)

Anonymous Coward | more than 7 years ago | (#18690207)

i don't think this makes sense for bittorrent p2p, but i think what they mean is
that when you search limewire you can see that there are many differently tagged versions with roughly the same filesize
when you try to download one, it isn't associated with the other files that are tagged differently even though it is the same song/bitrate everything.
so this program would recognize that i guess...

how about bringing the orig. napster back, or make a client IDENTICAL to it.

Container-aware P2P (1)

Tx (96709) | more than 7 years ago | (#18690219)

I have no idea what the overheads might be for their "handprinting" algorithms, or how effective they are. But I've wondered in the past whether something vaguely similar could be achieved by for example hashing each stream and headers separately in audio and video files, or each file within an archive. The same could apply to any container format. That would certainly deal with e.g. the same mp3 with different ID3 tags, and the overheads might be lower. Could get messy though, I guess.

Interesting idea (1)

Dan Stephans II (693520) | more than 7 years ago | (#18690251)

Thought of this a while back on a similar subject: taking bittorrent as an example you track a file (or set of files) and the torrent has presized chunks that are hashed. A simple extension may be to change that relationship from one to many to many to many. Share a set of chunks and this particular torrent uses the following chunks (identified by their hash). Obviously this would cause overhead issues but would address the issue in TFA (which, being new here I did not read).

Should work just fine... (2, Informative)

Anonymous Coward | more than 7 years ago | (#18690257)

I think everyone posting above saying "it won't work" should RTFA.

This works by breaking files down into clusters and hashing the clusters (like Bittorrent already does). Then it searches for other shares that have clusters with the same hash value, and requests them.

Assuming that the hashing scheme being used it "good" in that there are no collisions, two clusters with the same hash will contain the *exact same* information.

Should work just fine.

Its all just ones and zeros (1)

MosesJones (55544) | more than 7 years ago | (#18690275)

Of COURSE all files are "basically" the same, after all its just a set of 1s and 0s, and given that you already have lots of 1s and 0s on your machine this means that you already have the file even before you download it. It reminds me of Eric Morcambe and Andre Previn Previn: You were playing all the wrong notes Morcambe: No I was playing all the right notes just not necessarily in the right order

mod Up (-1, Troll)

Anonymous Coward | more than 7 years ago | (#18690279)

raise or lower the market. TherefoRe CLAIM THAT BSD IS A BSD's cokdebase

Sounds like "Single Instance Storage" (0)

Anonymous Coward | more than 7 years ago | (#18690293)

Sounds like he's trying to apply the concepts behind Single Instance Storage [] to P2P applications, which makes a certain amount of sense, but is sure to be a fair bit more difficult to track the hashes for all the file chunks of all the files available on the network (as opposed to comparing file hashes during a backup or across a LAN).

It might be a bit easier if P2P apps were more aware of the types of files they were transferring, and could make intelligent decisions about how to split up the files into data chunks.

Shows the real trouble with P2P (1)

Animats (122034) | more than 7 years ago | (#18690309)

This is just an illustration of the fact that P2P is an incredibly inefficient way of transferring files around. Most of the material is not only pirated, but a big fraction of the pirated material is the same stuff. P2P "peers" aren't necessarily nearby, either in a physical or bandwidth sense. So huge amounts of bandwidth are being spent shipping the same stuff around.

If it weren't for the piracy issue, the daily output of the RIAA, which is a few gigabytes, could be distributed efficiently by putting MP3s in a Usenet group. With Usenet's distribution mechanism, which is a flooding P2P system, nothing travels over a path more than once.

Re:Shows the real trouble with P2P (1)

abscissa (136568) | more than 7 years ago | (#18690419)

I used to disagree with what you said... but after seeing how P2P affects networks first hand, I am now inclined to agree.

P2P is very inefficient.. but the problem is that means that are maximally efficient (e.g. proxies, usenet, etc.) are inaccessible to the masses.

Re:Shows the real trouble with P2P (1)

Animats (122034) | more than 7 years ago | (#18690693)

Yes, if we had "" we could probably cut the world's P2P load in half. At least.

It might be a good move for the RIAA members to do that. They pay radio stations to play the stuff. Why not cut out the middleman and ship direct to consumers?

We may be headed for an era where top-40 music is free, but ad-supported.

Jennifer Aniston != Monkey (1)

Redbaran (918344) | more than 7 years ago | (#18690385)

So does this mean I go to download a picture of anyone, say Jennifer Aniston, I might get a picture of a monkey, just because they are 99% (or whatever) the same genetically.

I think I speak for all males when I say: Not cool man... not cool!

in other news (1)

brunascle (994197) | more than 7 years ago | (#18690395)

a new P2P app call BET promises faster downloads, utilizing the tried-and-true method of giving a file a set time limit to download then filling in random bytes after that.

Kazaa hiss/pop (0)

Anonymous Coward | more than 7 years ago | (#18690435)

I think Kazaa did something like this, which is partly to blame for the horrible malformed mp3 files that flowed through that network.

But think of the RIAA... (1)

Nom du Keyboard (633989) | more than 7 years ago | (#18690437)

The RIAA is going to absolutely hate any research in this area that can improve P2P performance in any manner. And especially by a university, no less. Those hot beds of piracy don't deserve public money at all, when they spend it like this!!

Pop Music (0, Redundant)

Archangel Michael (180766) | more than 7 years ago | (#18690441)

"but are otherwise 99 percent similar.""

Sounds like all of Pop Music and most movies these days. Since the latest greatest movies are remakes and I, II, III, IV, V versions of the same theme.

Blurring the Line Between Data (2, Interesting)

dduardo (592868) | more than 7 years ago | (#18690557)

So let say person A wants to download copyrighted material. They would connect to a tracker who would when tell person A where to get chunk 1 from. This chunk could come from copyrighted data or public domain data. What the tracker stores isn't the actual copyrighted data, but offsets, lengths, and ip addresses of where to find particular chunks.

In essence, if you were downloading a music CD, the data chunks could be coming from someone who has an Ubuntu ISO image, or some type of copyrighted material.

With this system it essential becomes impossible to tell who is uploading what.

Chunks not the entire song (1)

ET_Fleshy (829048) | more than 7 years ago | (#18690603)

They are talking about the chunks being the same, not the entire song. If you had a massive lookup table for md5's or something you could easily get a listing of every(body| torrent) that has the chunk you are looking for. Good idea but sounds quite difficult to maintain.

Note: didn't RTFA ;)

wondered how long it would take.. (1)

ehrichweiss (706417) | more than 7 years ago | (#18690877)

I thought of this a few years ago. The idea seemed an obvious one. There'll always be blocks of repeating data(e.g. FF00FF00), and exe, zip, rar, mp3, etc. headers with many similarities. If one can make the algo vary the size of the chunks it uses then you can derive your data from lots of different sources including getting data from image files that are to be applied to an .exe. The key would be to be able to recreate as much as possible from the similar data and the checksum/CRC using some of today's error correcting technology. It wouldn't be flawless of course but well worth the effort.

Permuted Local Chunk Versions (1)

Doc Ruby (173196) | more than 7 years ago | (#18690911)

Storage is cheap, bandwidth more expensive. Why not chunk up each file into many different permutations of its compressed data, with the variants recorded in the local index by fingerprint? Those fingerprints of unique chunks and the list of chunks to files can be maintained in the distributed index of many sites to each fingerprinted chunk. That would make more chances for a given content site to have a chunk that's identical to the one looked for, even if the chunk originated in a different file.

At some point, this protocol gets so far away from merely specifying a file's index in a local list to its specified remote storage server (eg. a tinyurl) for its monolithic compressed content (the best bandwidth for the least cacheable flexibility) that it's transferring more data among the distributed servers than that basic protocol. There's got to be some kind of "topological calculus" [] of the network connectedness and edge capacity vs node capacity that specifies the optimal distribution allocation of data to indices. Anyone?

similiar? (0, Troll)

mephistophyles (974697) | more than 7 years ago | (#18691003)

sheesh, it's not even in the summary, it's in TFA title, I also hate to point out the obvious, but I don't need to be a researcher at CMU to realize that if all those split files were put together it would be easier and faster to download the file... talk about pointing out the obvious

Updates (1)

diakka (2281) | more than 7 years ago | (#18691227)

This could be a great tool for distributing updates. Say, if you already downloaded one DVD iso image for your favorite Linux distro, it could save a lot of time over downloading a whole new DVD iso. Even for smaller files or individual packages it could be really handy. I know there are already tools for generating rpm deltas and such, but if it could be transparant, it could really save a lot of hassle as well as bandwidth.
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>