Faster P2P By Matching Similiar Files?

Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

Faster P2P By Matching Similiar Files? 222

Posted by CmdrTaco on Wednesday April 11, 2007 @11:45AM from the something-doesn't-jive-here dept.

Andreaskem writes "A Carnegie Mellon University computer scientist says transferring large data files, such as movies and music, over the Internet could be sped up significantly if peer-to-peer (P2P) file-sharing services were configured to share not only identical files, but also similar files. "SET speeds up data transfers by simultaneously downloading different chunks of a desired data file from multiple sources, rather than downloading an entire file from one slow source. Even then, downloads can be slow because these networks can't find enough sources to use all of a receiver's download bandwidth. That's why SET takes the additional step of identifying files that are similar to the desired file... No one knows the degree of similarity between data files stored in computers around the world, but analyses suggest the types of files most commonly shared are likely to contain a number of similar elements. Many music files, for instance, may differ only in the artist-and-title headers, but are otherwise 99 percent similar.""

This discussion has been archived. No new comments can be posted.

Faster P2P By Matching Similiar Files?

Load All Comments

Search 222 Comments Log In/Create an Account

Comments Filter:

Nickelback? (Score:5, Funny)

by onemorehour ( 162028 ) * writes: on Wednesday April 11, 2007 @11:46AM (#18690017)

Many music files, for instance, may differ only in the artist-and-title headers, but are otherwise 99 percent similar.

Well, sure, if you're only looking at Nickelback [thewebshite.net] songs.

Share
twitter facebook
- Comment removed (Score:5, Informative)
  
  by account_deleted ( 4530225 ) writes: on Wednesday April 11, 2007 @11:54AM (#18690193)
  
  Comment removed based on user account deletion
  
  Parent Share
  twitter facebook
  - Re:Nickelback? (Score:5, Interesting)
    
    by thepotoo ( 829391 ) writes: <thepotoospam@yah[ ]com ['oo.' in gap]> on Wednesday April 11, 2007 @12:17PM (#18690599)
    
    If you use bittorrent, the DHT protocol (supported by Azureus, BitComet, and uTorrent, among others) does the exact thing you're describing. It checks MD5 hashes for files (the whole file, not the pieces, I think), and connects you to peers which have the same file.
    DHT even supports partially corrupted files, your client just discards the corrupt data.
    My question is, why would I want to use SET over DHT? Does SET not need a ceneralized server, or does it have any other advantage at all?
    TFA is really short on technical details, but it sounds to me as though SET is just a re-design of DHT. Still, I imagine SET support will be in the next builds of all the major bittorrent clients if it ends up being worth something.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by account_deleted ( 4530225 ) writes:
      
      Comment removed based on user account deletion
      - Re: (Score:2)
        
        by account_deleted ( 4530225 ) writes:
        
        Comment removed based on user account deletion
    - Re:Nickelback? (Score:4, Informative)
      
      by Robotech_Master ( 14247 ) * writes: on Wednesday April 11, 2007 @12:46PM (#18691075) Homepage Journal
      
      If I understand the article right, SET looks at individual files within a particular download. DHT just looks at the whole download.
      
      For instance, if I'm uploading my "Songs I Like to Dance To" mp3 mix, and someone else is uploading an "All-Time Greatest Dance Hits" CD rip, and there are a couple of songs both uploads have in common, SET would enable someone downloading my MP3 mix to treat the CD rip as a partial seed (and vice versa), and pull down the songs held in common from either one.
      
      Whereas DHT would simply enable people to pull down my mix from other people uploading the mix, or the CD rip from other people uploading the CD rip, even if the tracker was down. (If I understand what DHT does correctly. Which it is possible I don't.)
      
      Parent Share
      twitter facebook
    - Re:Nickelback? (Score:5, Interesting)
      
      by Andy Dodd ( 701 ) writes: <atd7NO@SPAMcornell.edu> on Wednesday April 11, 2007 @12:46PM (#18691089) Homepage
      
      If I recall correctly, DHT takes the file name into account when calculating the hash. Thus identical files with different names are treated differently.
      
      Some P2P protocols allow looking up a file by a hash which does not take filename into account, but this will not handle the case where the files differ in only one small section. The best example is the following:
      Person downloads an MP3.
      Person finds that the MP3 is not properly tagged (for example, has a comment field saying who ripped it/released the rip, and has no track number.)
      Person changes the MP3's ID3 tag
      Now, nearly all existing P2P protocols will treat the new file as a completely different file, when in reality the most important contents (the audio itself) have not changed, only the file's metadata.
      Other users will go for the "full-file" match with the largest number of sources, thus causing the mistagged MP3 to propagate more than a "fixed" one.
      
      So a P2P system that ignores the ID3 tag when hashing would have significant advantages, in which the user could download the file from many sources and then choose which source to get their metadata from.
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by account_deleted ( 4530225 ) writes:
        
        Comment removed based on user account deletion
        
        Re: (Score:2)
        
        by dreamlax ( 981973 ) writes:
        
        I guess the idea would be to truncate the file on either end (depending on where the meta data is stored) to have the raw MP3 data. Hashing just that would mean that people such as myself who not only rename but re-tag all my MP3s (because I hate seeing things like "santana - baila mi hermana " . . . Capitalise!) would still be able to share our MP3s the same way we got it from someone else, and that next person can tag it however they please.
      - Re: (Score:3, Informative)
        
        by thepotoo ( 829391 ) writes:
        
        Actually, DHT doesn't care about file names. You can test this yourself if you have a LAN. Grab a torrent, start downloading it on one computer (save as a different filename), get a little of the file downloaded, and start the same torrent on a different computer. Use your firewall to block the second computer's access to the tracker. DHT will kick in, your computers will log in, get each other's IPs, and computer 2 will get an insanely fast download speed until it catches up with computer 1.
        Tested on u
      - Re: (Score:3, Informative)
        
        by evilviper ( 135110 ) writes:
        
        Some P2P protocols allow looking up a file by a hash which does not take filename into account,
        By "Some", you mean "Every Single Frickin' One Of Them", right?
        nearly all existing P2P protocols will treat the new file as a completely different file,
        No. Only the most brain-dead P2P protocols will. "tree" hashes are in use by several P2P protocols. Some are just old or primitive, and have a large number of old servants around that don't understand newer hashes.
    - Re: (Score:2, Interesting)
      
      by Anonymous Coward writes:
      
      No, it's a different kind of identifying which blocks are interesting in a swarm download.
      
      Basically - and this isn't the first time this idea has been tabled, I have an unpublished paper and a reference implementation - chuck BT's idea in the bin, as it uses lists of SHA1 hashes and that's not suitable for this. Shareaza's better placed to do this technically, but you could of course adapt torrent.
      
      What you actually want to use is a TTH - Tiger Tree Hash (THEX standard). That's a Merkle hash tree based on TI
    - Re: (Score:2)
      
      by paeanblack ( 191171 ) writes:
      
      TFA is really short on technical details, but it sounds to me as though SET is just a re-design of DHT. Still, I imagine SET support will be in the next builds of all the major bittorrent clients if it ends up being worth something.
      
      As TFA current describes things, I'm really struggling to find a use for this feature that does not involve copyright infringement. The rightsholders for "legitimate" p2p traffic already have a strong incentive to act as a central authority for low-bandwidth meta-data.
      
      I think th
    - Re: (Score:2)
      
      by Frenchy_2001 ( 659163 ) writes:
      
      SET seems to be an incremental update of DHT or MD5 hashes.
      Most recent p2p use hashes to recognize files instead of just filename, BUT this makes that music files that are different in their header will come out as different.
      This method allows just to distinguish between data and meta-data for the file and allo the meta data to be different.
      Anyway, just my understanding of it.
      Otherwise, i'm sure they could just do md5 hashes of all your 16k/32k/64k parts of all your shared files and download parts with the
  - Re:Nickelback? (Score:4, Interesting)
    
    by hey! ( 33014 ) writes: on Wednesday April 11, 2007 @12:41PM (#18690987) Homepage Journal
    
    I don't think this is just about inconsistent metadata.
    
    I think what he's talking about may be more like the document fingerprinting algorithms used to pare search engine results, or to detect plagiarism in student papers.
    
    In some cases you will be downloading components of a file from two sources, neither of which have the others' component. The example in TFA was downloading the video portion of a movie from a foreign language site and the audio from a site with the language you speak but less bandwidth.
    
    I suppose another example would be that if you were downloading an anthology of stories, you could take a particular story from a server that hosted a different anthology including that story. Or maybe you are downloading the new distro; you could take some of your files from sites offering the distro version you are looking for, some from sites only offering the files you need to upgrade to that version, and some from entirely different distros or much older versions if they happened to be the same.
    
    I guess it could be thought of as a kind of "fuzzy akamai".
    
    It's an interesting idea, but I don't see any commercial support for it. In fact I see commercial opposition under the current regime of copyright laws and royalty based business models.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by marcosdumay ( 620877 ) writes:
      
      "It's an interesting idea, but I don't see any commercial support for it. In fact I see commercial opposition under the current regime of copyright laws and royalty based business models."
      
      Who needs commecial suport for P2P?
- Re: (Score:2)
  
  by rainman_bc ( 735332 ) writes:
  
  Well, sure, if you're only looking at Nickelback songs.
  
  Or "theory of a deadman" or "default"
  
  I think they should merge and call themselves Theory of a Nickelfault
  
  =D
- - Re: (Score:3, Insightful)
    
    by eric76 ( 679787 ) writes:
    
    The only thing I use the file sharing networks for is to download new images of FreeBSD and Linux using BitTorrent.
    
    The last thing I want is a "similar" file.
    
    What would be a "similar" file to a FreeBSD ISO? It would either be a corrupted file or one with an introduced exploit.
    - Re:TorrentSoup (Score:5, Informative)
      
      by joe_cot ( 1011355 ) writes: on Wednesday April 11, 2007 @12:06PM (#18690417) Homepage
      
      It would still work the same way as it does now: an md5 of each specific block, and an md5 of the whole thing. If the md5 for the block doesn't match, it's not going to download, and if it's someone using collision to inject a block with the same md5, 1) it's not going to pass the md5 on the whole thing, 2) you're already vulnerable to it. The reason this will work is that they'll be lots of people sharing incomplete or corrupted versions of your FreeBSD iso; you'll get the blocks that are good, and skip the blocks that aren't, making "similar" files very useful. Not too difficult to understand, and no need for tin foil hats.
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by chgros ( 690878 ) writes:
        
        if it's someone using collision to inject a block with the same md5
        Thankfully, that's not practical at this time.
        You can fairly easily generate 2 chunks with the same md5.
        You cannot easily generate a chunk with the same md5 as a given, pre-existing chunk.
    - Re:TorrentSoup (Score:4, Interesting)
      
      by ShieldW0lf ( 601553 ) writes: on Wednesday April 11, 2007 @12:41PM (#18690979) Journal
      
      So if someone is sharing an older ISO, and it happens to have large portions that exactly match the one you're downloading, with other portions that are not identical, you don't want to download the identical chunks off that person?
      
      It would be interesting if the implementing software could also look for possible matches within your existing file structure and reduce the data downloaded automatically, kind of like using diff and just downloading the patch.
      
      Parent Share
      twitter facebook
      - Re:TorrentSoup (Score:4, Informative)
        
        by CastrTroy ( 595695 ) writes: on Wednesday April 11, 2007 @12:58PM (#18691255)
        
        I'm not sure if this would work if you changed the byte offset though. Sure both ISOs may contain a lot of the same data, but I think it's very unlikely that the data would be at the same byte-offset in the file. I don't think that you'd be able to accomplish this for different byte offsets, because for a 100 MB File, assuming 5 MB chunks, You're looking at about 2,000,000,000 chunks to calculate (20 chunks, calculated at each byte offset).
        
        Parent Share
        twitter facebook
        
        Re: (Score:2)
        
        by CastrTroy ( 595695 ) writes:
        
        From my understanding, RSync's job is made easier because it knows which files it's comparing. It looks at file A on location X, and compares it to file A on location Y. It looks for places where the file is the same, and doesn't transfer the data. It becomes a lot harder when you have a set of files A(1)-A(X) and you have to determine if any parts of A(n) exist in File B on the other system.
  - Re:TorrentSoup (Score:4, Insightful)
    
    by drix ( 4602 ) writes: on Wednesday April 11, 2007 @12:27PM (#18690749) Homepage
    
    Because it gets you published and, thus, increases your chance for tenure, that from which all blessings flow.
    
    Parent Share
    twitter facebook
grea tide a (Score:5, Funny)

by underwhelm ( 53409 ) writes: <underwhelm@NOsPAM.gmail.com> on Wednesday April 11, 2007 @11:51AM (#18690103) Homepage Journal

I'm hoping this CATCHES ON and wet ransfer a11 sorts of information like this. It'11 be 1ike getting every thing in the form of a ransom n0te.

Share
twitter facebook
The music kids listen to (Score:3, Funny)

by Vollernurd ( 232458 ) writes: on Wednesday April 11, 2007 @11:51AM (#18690111)

So it's not me then? All new tunes DO sound the same?

Share
twitter facebook
That'll work great with (Score:2)

by Rik Sweeney ( 471717 ) writes:

porn.

No wait, hear me out. Most porn is going to be largely white or black skin colour (particularly with Friesian Cows if you're into that sort of thing), so the P2P can just find a chunk with a similar amount of that colour and download that!
Summary: (Score:5, Informative)

by PhrostyMcByte ( 589271 ) writes: <phrosty@gmail.com> on Wednesday April 11, 2007 @11:53AM (#18690153) Homepage

instead of sharing files, divide them into 16KB chunks and share those, to help work around files that get renamed or trivially altered (eg a website tagging their url to all the files you upload).

Share
twitter facebook
- Re: (Score:2)
  
  by TheoMurpse ( 729043 ) writes:
  
  If the file length was affected, the 16KB chunks would absolutely differ starting at the point where the length was affected.
- Re: (Score:2)
  
  by EllisDees ( 268037 ) writes:
  
  Even better, just ignore the metadata and search on a hash of the actual content. I'm not sure where the ID3 tags are placed (and I'm too lazy to look it up right now) in an mp3, but if you strip them off and ignore the file name, you should have the raw mp3 data left over.
Tiny chunks, Large files (Score:2, Insightful)

by LiquidCoooled ( 634315 ) writes:

LOTS of overhead just to find the chunks.
The article talks about 16kb chunks, which for a dvd image would take more than the torrent protocol currently allows.

The client would spend more time communicating its chunk lists around than actually getting data.

(If I remember rightly, torrents can have a max of 65535 chunks and some servers prevent huge .torrent files which contain the chunk breakpoints anyway)
- Re: (Score:2)
  
  by burris ( 122191 ) writes:
  
  Sorry, the limit on the number of pieces in a torrent is 2^32. The message length prefix (i.e. the max length of the have-bitfield) as well as the piece indicies are 4-byte unsigned ints.
meh (Score:2)

by brunascle ( 994197 ) writes:

not particularly new. many P2Ps have been grouping identical files together. i know one of the early ones did it (was it Napster, Audiogalaxy?), but i think only if the files were 100% identical other than the filename.

there's definitely potential for problem here. what if those files really arent supposed to be the same? a swapped byte here and there could have huge effects on the end result.
Overhead (Score:2)

by TubeSteak ( 669689 ) writes:

Wouldn't this generate a lot more overhead traffic?
I'm sure smarter people than I have already thought this out,
but that seems to be the most immediate downside.

"SET divides a one-gigabyte file into 64,000 16-kilobyte chunks"

In other words: instead of seeking out the one master-hash for the file,
your P2P is looking up the thousands of chunk hashes.
Container-aware P2P (Score:2)

by Tx ( 96709 ) writes:

I have no idea what the overheads might be for their "handprinting" algorithms, or how effective they are. But I've wondered in the past whether something vaguely similar could be achieved by for example hashing each stream and headers separately in audio and video files, or each file within an archive. The same could apply to any container format. That would certainly deal with e.g. the same mp3 with different ID3 tags, and the overheads might be lower. Could get messy though, I guess.
Should work just fine... (Score:2, Informative)

by Anonymous Coward writes:

I think everyone posting above saying "it won't work" should RTFA.

This works by breaking files down into clusters and hashing the clusters (like Bittorrent already does). Then it searches for other shares that have clusters with the same hash value, and requests them.

Assuming that the hashing scheme being used it "good" in that there are no collisions, two clusters with the same hash will contain the *exact same* information.

Should work just fine.
Its all just ones and zeros (Score:2)

by MosesJones ( 55544 ) writes:

Of COURSE all files are "basically" the same, after all its just a set of 1s and 0s, and given that you already have lots of 1s and 0s on your machine this means that you already have the file even before you download it. It reminds me of Eric Morcambe and Andre Previn Previn: You were playing all the wrong notes Morcambe: No I was playing all the right notes just not necessarily in the right order
Shows the real trouble with P2P (Score:2)

by Animats ( 122034 ) writes:

This is just an illustration of the fact that P2P is an incredibly inefficient way of transferring files around. Most of the material is not only pirated, but a big fraction of the pirated material is the same stuff. P2P "peers" aren't necessarily nearby, either in a physical or bandwidth sense. So huge amounts of bandwidth are being spent shipping the same stuff around.
If it weren't for the piracy issue, the daily output of the RIAA, which is a few gigabytes, could be distributed efficiently by putti
- Re: (Score:2)
  
  by abscissa ( 136568 ) writes:
  
  I used to disagree with what you said... but after seeing how P2P affects networks first hand, I am now inclined to agree.
  
  P2P is very inefficient.. but the problem is that means that are maximally efficient (e.g. proxies, usenet, etc.) are inaccessible to the masses.
  - Re: (Score:2)
    
    by Animats ( 122034 ) writes:
    
    Yes, if we had "alt.binaries.music.riaa.top40" we could probably cut the world's P2P load in half. At least.
    It might be a good move for the RIAA members to do that. They pay radio stations to play the stuff. Why not cut out the middleman and ship direct to consumers?
    We may be headed for an era where top-40 music is free, but ad-supported.
in other news (Score:2)

by brunascle ( 994197 ) writes:

a new P2P app call BET promises faster downloads, utilizing the tried-and-true method of giving a file a set time limit to download then filling in random bytes after that.
But think of the RIAA... (Score:2)

by Nom du Keyboard ( 633989 ) writes:

The RIAA is going to absolutely hate any research in this area that can improve P2P performance in any manner. And especially by a university, no less. Those hot beds of piracy don't deserve public money at all, when they spend it like this!!
Blurring the Line Between Data (Score:3, Interesting)

by dduardo ( 592868 ) writes: on Wednesday April 11, 2007 @12:14PM (#18690557)

So let say person A wants to download copyrighted material. They would connect to a tracker who would when tell person A where to get chunk 1 from. This chunk could come from copyrighted data or public domain data. What the tracker stores isn't the actual copyrighted data, but offsets, lengths, and ip addresses of where to find particular chunks.

In essence, if you were downloading a music CD, the data chunks could be coming from someone who has an Ubuntu ISO image, or some type of copyrighted material.

With this system it essential becomes impossible to tell who is uploading what.

Share
twitter facebook
- Re: (Score:3, Interesting)
  
  by PaintyThePirate ( 682047 ) writes:
  
  The chances of two 16kB chunks from a Ubuntu ISO and a music CD being identical are extremely unlikely, or for that matter, the chances of any two 16kbit chunks from any two non-similar (meaning, in this case, files that are not of/from the same source [the same movie, album, program, whatever]) files being the same are incredibly slim.
  
  That being said, about 90% of the commenters missed the point completly, though it is somewhat understandable given how vague and nontechnical the linked article and summa
- Re: (Score:2)
  
  by aardvarkjoe ( 156801 ) writes:
  
  As the other reply said, the chances of two chunks, even ones that are relatively small (like 16kb) being identical are very, very small, unless the source data was similar in the first place. So you're not going to be able to download your top 40 CD from all-free sources using that method.
  
  You could do something like reducing the chunk size to 10 bytes or something else really small, and probably be able to find chunks from non-copyrighted sources. (Disregard for the moment the fact that the overhead wou
wondered how long it would take.. (Score:2)

by ehrichweiss ( 706417 ) writes:

I thought of this a few years ago. The idea seemed an obvious one. There'll always be blocks of repeating data(e.g. FF00FF00), and exe, zip, rar, mp3, etc. headers with many similarities. If one can make the algo vary the size of the chunks it uses then you can derive your data from lots of different sources including getting data from image files that are to be applied to an .exe. The key would be to be able to recreate as much as possible from the similar data and the checksum/CRC using some of today's er
Permuted Local Chunk Versions (Score:2)

by Doc Ruby ( 173196 ) writes:

Storage is cheap, bandwidth more expensive. Why not chunk up each file into many different permutations of its compressed data, with the variants recorded in the local index by fingerprint? Those fingerprints of unique chunks and the list of chunks to files can be maintained in the distributed index of many sites to each fingerprinted chunk. That would make more chances for a given content site to have a chunk that's identical to the one looked for, even if the chunk originated in a different file.

At some p
Updates (Score:2)

by diakka ( 2281 ) writes:

This could be a great tool for distributing updates. Say, if you already downloaded one DVD iso image for your favorite Linux distro, it could save a lot of time over downloading a whole new DVD iso. Even for smaller files or individual packages it could be really handy. I know there are already tools for generating rpm deltas and such, but if it could be transparant, it could really save a lot of hassle as well as bandwidth.
The main problem with this is... (Score:2)

by guruevi ( 827432 ) writes:

... that hashes collide, all the time. It probably won't collide over a large data chunk, but if you split the data chunks into $number chunks and send around MD5's (or other hashes) for that, you'll multiply the possible collisions by $number.

The only solution therefore is to create a one-to-one hash for each chunk, but then you could just as well transfer the data, because the hash size = chunk size.

Therefore, this approach won't work. Because, say you are transferring an OGG file (of your favorite indie
Could really work! (Score:4, Funny)

by Junior J. Junior III ( 192702 ) writes: on Wednesday April 11, 2007 @01:10PM (#18691415) Homepage

At their fundamental level, all files are essentially similar. They're encoded as 1's and 0's. So, wherever a file happens to call for a 1, you should be able to just pull that 1 from ANYWHERE. Even some random file on your local hard drive. And likewise for zeroes. All you need is a smart download algorithm to re-assemble the 1s and 0s in the correct order, and you're set.

Share
twitter facebook
Already been done a long time ago (Score:3, Informative)

by J0nne ( 924579 ) writes: on Wednesday April 11, 2007 @01:13PM (#18691459)

Shareaza has been doing this for years. When hashing MP3 files, it disregards what(s in the id3 tags, and just computes a hash for the audio information. This means that files with different id3 tags will still be added to the swarm, whicj is great.
Unfortunately, there are some issues with it:
-Only Shareaza supports it, other clients didn't want to play along.
-Shareaza has/d a bug where it would fail to reconstruct the id3 tag after downloading, giving you files with empty tags
-Only mp3 is supported, so no ogg, aac or wma

So this paper isn't as revolutionary (if that's what they mean).

This will only work with identical files that have metadata that is frequently changed by end-users, because there's no way you're going to be able to get a good file if you try to mix a cam with a dvdrip, or an ogg with an mp3, or an xvid file with a divx file. It just doesn't work that way.

Share
twitter facebook
Juuuust one problem... (Score:2)

by DigitAl56K ( 805623 ) writes:

ID3v1 tags won't pose a problem for this, they occur at the end of the file, i.e. the last chunk.

But ID3v2 files occur at the start of the file and have variable size. AVI files might have similar video streams but different language audio tracks, or be interleaved slightly differently, and so forth.

So although similar information might exist in these files, the chances of that information laying exactly on the same chunk boundaries, and thus the chunks having matching MD5s, is pretty low I bet. Even a 1ms
An interesting licensing issue (Score:4, Interesting)

by DigitAl56K ( 805623 ) writes: on Wednesday April 11, 2007 @01:39PM (#18691913)

If a client recreates a file from "similar" pieces, is it a derivative work?

Share
twitter facebook
I tried it (Score:3, Funny)

by Intron ( 870560 ) writes: on Wednesday April 11, 2007 @02:36PM (#18692799)

I tried downloading "My Sweet Lord" by George Harrison, but I got "He's So Fine" by the Chiffons instead.

Share
twitter facebook
Don't see how it could work on a large scale. (Score:3, Interesting)

by Bluesman ( 104513 ) writes: on Wednesday April 11, 2007 @03:21PM (#18693521) Homepage

This seems to be an intractable problem.

How do you know a file is similar? By hashing? There's no guarantee that a particular chunk of a file with an md5 hash (for example) contains the same bytes as that of another file.

There are 2^256 possible chunks of 256 bits of data. There are 2^16 possible hashes with (using a 16 bit binary key) for that same data. That means that for every hash match, the data has a 1 in 16 chance of actually matching.

You can extend the key length to reduce this ratio, but you'll end up with a key length equal to your data size before you're sure the data is not a collision.

The problem gets worse if the chunks of data aren't equal in size.

This can only work if you have a centralized database of every possible file combination on your network. It's workable for a small amount of files, but will grow exponentially in a real environment. Not to mention, the centralized database would have to handle a significant amount of traffic, reducing the speed gains possible.

Count me as skeptical.

Share
twitter facebook
perhaps 100% similar (Score:3, Interesting)

by Frisky070802 ( 591229 ) * writes: on Wednesday April 11, 2007 @03:46PM (#18693821) Journal

I recall hearing a story on NPR Music a few weeks ago about someone who plugged a CD into itunes and had it come up with the right piece but the wrong performer. Then it happened again, with the same ostensible pianist but yet another "wrong" performer. Detailed analysis showed the pianist had apparently published CDs claiming to perform pieces but actually substituting the work of others! Itunes must have used a signature over the content to index the piece by the earlier CD.

Similarity detection rules.

Share
twitter facebook
- Re: (Score:3, Informative)
  
  by Icarus1919 ( 802533 ) writes:
  
  Ok, perhaps you're not certain how files work. But things compressed with different codecs and bitrates look VERY different when you actually look at the coding in the file as opposed to the same file named differently or with minor changes.
  - Re: (Score:2)
    
    by hotdiggitydawg ( 881316 ) writes:
    
    OK... so instead let's go with the same codec, same bitrate, but ripped from two different instances of the same media. You can't seriously tell me that the same analog signal will be sampled exactly the same way every single time.
    
    Have you tried ripping the song from the same CD twice in a row and doing a binary diff? Have you then tried it on a different PC with the same codec/bitrate/software? How about on a different OS with the same codec/bitrate?
  - mod this up -- angio...this is the problem (Score:3, Insightful)
    
    by tacokill ( 531275 ) writes:
    
    What the parent is saying can be summarized with a simple example:
    
    A 200MB, 30min video that was compressed at 1000kbps DiVX is not the "same file with minor changes" as a 200MB, 30min video that was compressed at 900kbps DiVX. They ARE different files and should be treated as such. You also can't deduce anything from their filenames, play length, or any other characteristic so how would you determine which ones can go together and which ones can't? I did not see codecs or compression mentioned at al
    - - Re: (Score:3, Interesting)
        
        by angio ( 33504 ) writes:
        
        Correct (and the parent is correct as well). We didn't exhaustively characterize the reasons that the files differed, but most of the cases appeared to be modifications _after_ encoding. We have on the TODO list to test encode the same CD multiple times with the same settings to see if it produces a similar file, but we haven't run the test yet. There were tons of cases of metadata-altered MP3s, many cases of video files with the same video content but different audio or subtitle information, and some ca
- Re:Right.... (Score:5, Informative)
  
  by angio ( 33504 ) writes: on Wednesday April 11, 2007 @12:01PM (#18690311) Homepage
  
  Take a peek at the paper [cmu.edu] - it actually does work, and we demonstrated it. The intuition: people make small changes to files like changing the artist or title in the MP3 header, and then BitTorrent and other systems treat this as a "different" file, when in fact it's 99.9% similar.
  (Yes, I'm one of the authors.)
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by Maximum Prophet ( 716608 ) writes:
    
    Isn't this rsync meets bitTorrent?
    
    It sounds like what you are saying is that someone wants to download X, but there are few sources of X. There are many sources of Y, which is really X, renamed. Your tool would download the proper header info from the X source and the majority of the data from the Y sources.
    - Re:Right.... (Score:4, Informative)
      
      by angio ( 33504 ) writes: on Wednesday April 11, 2007 @01:41PM (#18691943) Homepage
      
      Similar in spirit - except rsync looks at files on your local hard drive by the same name, so there's only one possible candidate to draw from. SET looks at all of the files that everyone else is currently downloading, so we had to develop a much more efficient technique for locating useful files.
      
      Parent Share
      twitter facebook
  - Re: (Score:2, Interesting)
    
    by Hatta ( 162192 ) writes:
    
    If I download a song with bittorrent then change the tag, it doesn't treat it as a different file. It treats it as a corrupted file. It does indeed recognize that it is 99% similar, and it can use that file to seed the similar parts. How is this novel again?
    - Re: (Score:2)
      
      by TheoMurpse ( 729043 ) writes:
      
      It does indeed recognize that it is 99% similar
      I think that may depend on if the affected metadata affects the length of the entire file. I'm not sure of BT's hashing function, but I think (let X be a constant that is configurable by the guy making the torrent file) when creating a torrent, it creates a hash of each string of X bytes. Thus, if you make the file a different size (by, say, changing the genre tag from "Jpop" to "J-pop" -- assuming there is no NULL padding in the ID3 tag -- or by removing/addin
  - Problem with variable insertions? (Score:3, Interesting)
    
    by digitalderbs ( 718388 ) writes:
    
    I haven't read the detailed paper, but it seems to me that there could be problems in finding similarity when there are random insertions. This is analogous to protein sequence matching (peripherally my field of research) when there are random insertions in the primary sequence. So if you have two identical 16k files, eight 2k chunks will be identical to each other. However, if you insert 512 bytes at the start of one file, eight 2k chunks will be different.
    - Re:Problem with variable insertions? (Score:5, Informative)
      
      by angio ( 33504 ) writes: on Wednesday April 11, 2007 @01:38PM (#18691895) Homepage
      
      We define chunk boundaries using Rabin fingerprinting. It's a cute trick - not one of our own invention - that is relatively insensitive to insertions and deletions. It was used in some of the other work in this area, such as the Low Bandwidth File System (LBFS). There's a family of work in this area called "shingling" that can also apply to sequence similarity.
      
      Parent Share
      twitter facebook
  - Re: (Score:2)
    
    by g2devi ( 898503 ) writes:
    
    99.9% similar may be significantly different.
    
    Compare a video clip that has Nixon saying "I am NOT a crook" with one that says "I am a crook".
    Compare a copy of the US constitution which removes the "due process of law" provision with one that doesn't.
    
    Both are 99.9% similar and both may be popular on Bittorent as spoofs, but neither is what you'd want.
  - - Re: (Score:2)
      
      by joshv ( 13017 ) writes:
      
      It would indeed be odd for someone to publish a paper about a novel way of speeding up downloads if in fact Bitorrent, or any other file sharing protocol already worked that way. Thanks for pointing out that they don't.
      - Re: (Score:2)
        
        by Reason58 ( 775044 ) writes:
        
        It would indeed be odd for someone to publish a paper about a novel way of speeding up downloads if in fact Bitorrent, or any other file sharing protocol already worked that way. Thanks for pointing out that they don't.
        Please excuse me for pointing out the de facto standards for P2P do not currently implement something like this, nor do they allow for this modification within the current structure. I thought this was a place for discussion of the article.
    - Re: (Score:3, Interesting)
      
      by Dare nMc ( 468959 ) writes:
      
      The only way you could implement a system like this is to create an entirely new protocol, server software and client software.
      I disagree, this could be done with No Change to any protocol, client, or server: the only thing that needs created is a torrent creator located on a machine that had a full version of every "simular" torrent to be shared. the torrents would all be linked read only to the same DVD image (for exampl), only part of that DVD image would be labeled as "downloaded" and you would put the
    - Re: (Score:2)
      
      by evilviper ( 135110 ) writes:
      
      BitTorrent, as most of you know, doesn't work this way.
      I, however, know that you're wrong, and that bittorrent DOES in-fact work precisely that way.
      Files are selected from a server called a "tracker", and only users with that exact file size and hash will be linked with you.
      Then please explain how partial files are shared with bittorrent... They won't have the exact same size and hash until they're completely downloaded, yet everyone is sharing their partially downloaded file... One of the most important
  - - It will still be WORSE. (Score:3, Informative)
      
      by khasim ( 1285 ) writes:
      
      Let's go with a fairly vanilla scenario:
      
      I grab an mp3 from person A. I then clean up the tag and rename it to suit me.
      
      You want to download that same song with a different name and different tag.
      
      You connect to person B sharing it. If you're using BitTorrent, you can also connect to any of 99 other people trying to download it from person B.
      
      Using the new model, you could also connect to person A and myself and download the blocks that are the same.
      
      So instead of only...
      99 people in the swarm and 1 seeder
      you'd
  - - Re: (Score:3, Informative)
      
      by Loligo ( 12021 ) writes:
      
      It's already been addressed: files encoded with different codecs, bitrates, compression ratios, what-have-you look completely different, have vastly different checksums, and even if named exactly the same and with the exact same file size, would never be confused for each other by any algorithm that's comparing what's actually IN the files.
      
      -l
- Re: (Score:2)
  
  by nine-times ( 778537 ) writes:
  
  I'll admit that I'm definitely not the most educated person in these matters, but I just don't quite "get it". If you are going to download only the differences between two files, doesn't that require that some computer has access to both files and can compare the differences? If one end or the other doesn't have both files, wouldn't you need to transfer the file first to make the comparison? (meaning you'd still need to download the whole thing?)
  Anyway, I've thought about this before, even though I don'
  - Re: (Score:2)
    
    by zero1101 ( 444838 ) writes:
    
    You can simplify your example and see why it doesn't make sense by looking at it this way: if I have a database on my hard drive that stores every possible representation of 1 bit, all I need to do to recreate a remote file is download a list that gives me the number of bits, as well as the state of each bit in the remote file in sequence. I can then rebuild the file by looking up the bits in the local state database in sequence and writing the result out to a file.
    - Re: (Score:2)
      
      by nine-times ( 778537 ) writes:
      
      Yes, that's why I said "you could look at it as just encoding". The extreme case of my example, obviously, is the single bit. Therefore, the most likely candidate for copyright infringement would be the person providing the bittorrent tracker that told you which chunks to download and which order to put them in. However, it doesn't seem like the answer is so clear. If that were the case, then what about conventional bittorrent participants? Are they guilty of copyright infringement, or is it only the p
  - - Re: (Score:2)
      
      by nine-times ( 778537 ) writes:
      
      Well, yes, I know that you can obviously do a checksum, but that won't tell you which parts of the file have changed. Unless, that is, you run checksums on the individual chunks. However, checksums do not uniquely identify a file. That is, it's been shown that you can manufacture a file to match a given checksum and yet have it be different from the file that the checksum was originally created from.
      Part of the reason checksums work so well is that it's extremely unlikely that two given files will have
- Re: (Score:2)
  
  by Dare nMc ( 468959 ) writes:
  
  if it was from TV whether it was started at the same time??
  More like creating a diff program for binary files, not just text.
  thats the part their sharing. IE if you find identical segments in similar files, you can grab the identical segments from either torrent. They even have the example, a trailer.
  A better example would be a movie with 3 version, IE a extended version, a theatricle version, and a TV version. If you cut the theater version and TV version from the extended version (after encoding, etc)
- Re: (Score:2)
  
  by dknj ( 441802 ) writes:
  
  Or it may even be native to the service itself (i think gnutella or edonkey.. one of the two). Anyway, the point is if you search for a certain file and the hash for specific blocks are identical, Shareaza will attempt to download that copy as well. So if you download, say, a 4mb tupac song and suddenly see a 2.5mb britney spears song in the list.. don't cancel the download. It could be that the 2.5mb britney spears song is actually the same tupac song that was renamed and incomplete.
  
  However, there is no
- It gets worse. (Score:3)
  
  by khasim ( 1285 ) writes:
  
  Taking advantage of those similarities could speed downloads considerably. If a U.S. computer user wanted to download a German-language version of a popular movie, for instance, existing systems would probably download most of the movie from sources in Germany. But if the user could download from similar files, the user could retrieve most of the video from English versions readily available from U.S. sources, and download only the audio portion of the movie from the German sources.
  To paraphrase Morbo: "DOW
  - - Re: (Score:2)
      
      by brunascle ( 994197 ) writes:
      
      hmmm, that's actually interesting. this could potentially completely change how P2P works. instead of requesting files, just request blocks by md5 hash. when you get a match, compare the hashes using another algorithm (to make sure it isnt a coincidence).
      
      would that make it easier to defend yourself against the MAFIAA? since all they know about is 1 block that matches a copyrighted file (or at least, the hash matches)?
    - Think about how that would be accomplished. (Score:2)
      
      by khasim ( 1285 ) writes:
      
      Pretend that you're part of a swarm.
      
      Your computer would then go through ALL YOUR FILES and advertise the md5 checksums to everyone.
      
      Normally, you just advertise the blocks for the file that you're downloading.
      
      So, I'm downloading a Debian iso ... and you're downloading a movie. Why am I (and a million others like me) going to be connecting to your box, asking your processor whether you have a file with checksum ghskldkjasa198d.a8.3ep ?
      
      Normally, I would not even be talking to your box.
      
      Suddenly, your bandwidth
      - Re: (Score:2, Interesting)
        
        by Anonymous Coward writes:
        
        It doesn't have to operate on every file on the filesystem. It's perfectly capable of identifying common blocks among files you happen to be swarming (downloading/hosting) or seeding (hosting), and that's more than useful enough.
        
        Or you can maintain a set of managed/hosted files, exactly like if they were on an rsync server.
        
        As for network efficiency, this isn't Gnutella, we've learned a lot in years of research since then, such as how to make searches that scale well. Very, very well. Well enough to be able
      - That would be better, but still too big. (Score:2)
        
        by khasim ( 1285 ) writes:
        
        For single songs (mp3's or even flac) the time spent hunting down other bits doesn't seem like it would be any better than just downloading that song from one person.
        
        And things like md5 are useful because there is such a low probability of collisions (two different files having the same md5 checksum).
        
        And by that same token, the likelihood that two different songs would have blocks of the exact same bits in a block is practically zero.
        
        Their system WOULD work for movies IF they had previously incorporated the
        
        Re: (Score:2)
        
        by ichigo 2.0 ( 900288 ) writes:
        
        Bram Cohen actually was going to implement a similar scheme in bittorrent at one point, but instead of asking everyone it asks the tracker for peers that have blocks with the correct hash. This would decrease the redundancy in trackers as many torrents have the same files, and it's silly to have separate swarms for them. I think this could probably have been extended to work in such a way that a torrent client queries multiple trackers (not just the original tracker of the torrent in question) for the piece
- Re: (Score:2)
  
  by TodMinuit ( 1026042 ) writes:
  
  Not really. Lets use BitTorrent as an example. BitTorrent links chunks to a torrent. If it were to just track chunks, you could get a speed up for certain torrents.
  
  Lets say I make a torrent, example-v1.0.torrent, containing file X (with the checksum "foobar"), and file Z (with the checksum "deadbeef"). I seed it, people download it, yippie.
  
  Now lets say later on, file X changes, and now has the checksum "barfoo". So I create example-v1.1.torrent. Under the current BitTorrent system, both file X and file Z wo
- Re: (Score:3, Funny)
  
  by Anonymous Coward writes:
  
  this statement in particular is ludicrous.
  You don't listen to pop music, do you?
- Re: (Score:2)
  
  by SaturnTim ( 445813 ) writes:
  
  The article is a bit vague, but I think this is really trying to match up identical files that just differ in meta-information. So this won't be
  downloading parts of completely different files, it's just not relying on file names to find matches.
- Re:Snakeoil (Score:4, Funny)
  
  by discord5 ( 798235 ) writes: on Wednesday April 11, 2007 @01:28PM (#18691721)
  
  It's not even 9AM and I have already filled my bullshit quota for the day. The concept itself is dubious, but this statement in particular is ludicrous.
  
  May I suggest you don't open your e-mail and refrain from answering the phone for today? I usually fill up my bullshit quota with those two media alone. Slashdot is just the icing on the cake. ;)
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by EllisDees ( 268037 ) writes:
  
  What they are saying is that if you search for 'paradise city' on limewire, you might get 20 different entries in the search results all at the same bitrate and approximate size. If you can figure out which of these are 99% the same, with only the metadata changed, you can download the similar parts from many more sources than if you need an exact match.
- Re: (Score:2, Insightful)
  
  by Aladrin ( 926209 ) writes:
  
  No seriously, the coward is right. WTF?
  
  Okay, I'll admit that there's a few MP3s that have different ID3 tags but the actual audio is the same. A few. The large majority of duplicate songs are NOT the same audio data. It's been re-ripped, transcoded, or some other horrid thing done to it and is not the same data anymore.
  
  Now, even assuming that there ARE tons of very-alike files out there, you'd have to write an intelligent comparer for each one so that it knew how to deal with the file and what informati
  - Re: (Score:3, Interesting)
    
    by joto ( 134244 ) writes:
    
    Okay, I'll admit that there's a few MP3s that have different ID3 tags but the actual audio is the same.
    It's more then a few. Most people use the default settings in their audio ripper/compression program, and it's all from the same CD. Even more people never uses an audio ripper and/or compressor, and simply downloads the file from the Internet. Not that many people bother to change ID3-tags either, but every single person that do, leads to a different file.
    And if they don't want it that bad, they aren
  - Re: (Score:2, Insightful)
    
    by Anonymous Coward writes:
    
    Break down the file into small (16Kb) chunks. Hash those chunks and let the client compare those chunks to the chunks you need. Most BT clients already do this, but still only draw the file from peers using the same file listed by the tracker. With this technology it can use any file that has chunks with the exact same hash as the file being downloaded by the user. I would imagine not a great many changes would be needed to implement it. There's no need for an 'intelligent comparer' as it's pretty much
  - Re: (Score:2)
    
    by woolio ( 927141 ) writes:
    
    At the end of the project, you've spent years on a project that'll never quite work right to save a bit of bandwidth for people that should have just gone and bought the song from iTunes in the first place if they wanted it that damned bad. And if they don't want it that bad, they aren't going to bother with some specialized P2P program that only has 1 advantage: It can tell some files are alike. (And probably has tons of disadvantages compared to the already-existing applications.)
    
    After spending several ye
- Re: (Score:2)
  
  by lottameez ( 816335 ) writes:
  
  Both of these files use the same 26-character alphabet! Share and substitute at will!
- Re: (Score:2)
  
  by happyemoticon ( 543015 ) writes:
  
  Well, they're all composed of 1's and 0's, so they must be basically identical. Except for those songs that contain a 2, or a -1.
- Re:This could work for some files, but not for oth (Score:3, Interesting)
  
  by Xzzy ( 111297 ) writes:
  
  It depends on how they calculate "similar". If they run checksums on the chunks and submit a query to other machines on the network that have pieces with identical chunks, then it would be valid to download. I'm pretty sure a few P2P services in the pre-bittorrent days did something like this, files with identical hashes would be grouped together.
  
  But the article makes it sound like their custom software breaks up media files into their component streams, which clients can download separately as they desire.
- Re:This could work for some files, but not for oth (Score:3, Interesting)
  
  by Incy ( 635621 ) writes:
  
  Anything compressed/encrypted won't work so well. Unless it is just a mislabeled peice of music. If you google around for Low Bandwidth File System (LBFS) you'll see what technique the article is really talking about.(disclaimer -- I didn't read the article either) Variable Length chunking will handle cases where new data is inserted halfway into the file, however with compression that extra data will end up changing the whole damned file.
- Re: (Score:2, Funny)
  
  by maxume ( 22995 ) writes:
  
  Think of it as Jennifer Aniston vs a perfect Jennifer Aniston clone with a paper bag with a monkey face drawn on it over her head. Take off the bag(i.e., the differing headers), and you are good to go.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Nickelback? (Score:5, Funny)

Comment removed (Score:5, Informative)

Re:Nickelback? (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Re:Nickelback? (Score:4, Informative)

Re:Nickelback? (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Informative)

Re: (Score:3, Informative)

Re: (Score:2, Interesting)

Re: (Score:2)

Re: (Score:2)

Re:Nickelback? (Score:4, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Insightful)

Re:TorrentSoup (Score:5, Informative)

Re: (Score:2)

Re:TorrentSoup (Score:4, Interesting)

Re:TorrentSoup (Score:4, Informative)

Re: (Score:2)

Re:TorrentSoup (Score:4, Insightful)

grea tide a (Score:5, Funny)

The music kids listen to (Score:3, Funny)

That'll work great with (Score:2)

Summary: (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Tiny chunks, Large files (Score:2, Insightful)

Re: (Score:2)

meh (Score:2)

Overhead (Score:2)

Container-aware P2P (Score:2)

Should work just fine... (Score:2, Informative)

Its all just ones and zeros (Score:2)

Shows the real trouble with P2P (Score:2)

Re: (Score:2)

Re: (Score:2)

in other news (Score:2)

But think of the RIAA... (Score:2)

Blurring the Line Between Data (Score:3, Interesting)

Re: (Score:3, Interesting)

Re: (Score:2)

wondered how long it would take.. (Score:2)

Permuted Local Chunk Versions (Score:2)

Updates (Score:2)

The main problem with this is... (Score:2)

Could really work! (Score:4, Funny)

Already been done a long time ago (Score:3, Informative)

Juuuust one problem... (Score:2)

An interesting licensing issue (Score:4, Interesting)

I tried it (Score:3, Funny)

Don't see how it could work on a large scale. (Score:3, Interesting)

perhaps 100% similar (Score:3, Interesting)

Re: (Score:3, Informative)

Re: (Score:2)

mod this up -- angio...this is the problem (Score:3, Insightful)

Re: (Score:3, Interesting)

Re:Right.... (Score:5, Informative)

Re: (Score:2)

Re:Right.... (Score:4, Informative)

Re: (Score:2, Interesting)

Re: (Score:2)

Problem with variable insertions? (Score:3, Interesting)

Re:Problem with variable insertions? (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Interesting)

Re: (Score:2)

It will still be WORSE. (Score:3, Informative)

Re: (Score:3, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)