×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

New Binary Diffing Algorithm Announced By Google

timothy posted more than 4 years ago | from the smartness-and-light dept.

Upgrades 192

bheer writes "Today Google's Open-Source Chromium project announced a new compression technique called Courgette geared towards distributing really small updates. Courgette achieves smaller diffs (about 9x in one example) than standard binary-diffing algorithms like bsdiff by disassembling the code and sending the assembler diffs over the wire. This, the Chromium devs say, will allow them to send smaller, more frequent updates, making users more secure. Since this will be released as open source, it should make distributing updates a lot easier for the open-source community."

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

192 comments

Google (0, Funny)

Anonymous Coward | more than 4 years ago | (#28719313)

can suck my diff!

Rob Malda is no longer an anal virgin (-1, Troll)

Anonymous Coward | more than 4 years ago | (#28719513)

Last night I took Rob Malda's anal virginity. It was amusing listening to him squeal like a pig as my throbbing member eviscerated his asshole. It bled a little bit after we were finished but he said he had never cum so hard as he did that time. If any other Slashdot anal virgins want their anal virginity taken away in the most pleasurable way possible, please respond to this post.

Re:Google (-1, Offtopic)

Anonymous Coward | more than 4 years ago | (#28719655)

I just downloaded Google Chrome [google.com] 3.0.192.0 for Mac [apple.com] and it crashed before I could even open a page. There is no excuse for this; my Mac Pro [apple.com] is perfect in every way with eight 2.93 GHz cores [intel.com], 32 GB RAM, and a fresh install of Mac OS X [apple.com] Leopard v10.5.7. Ergo any crashing Google Chrome does is Google Chrome's own fault!

Why is it that Apple [apple.com] and Mozilla [mozilla.com] can do this but Google [google.com] can't? I ran Internet Explorer 8 [microsoft.com] for months before its final release, Firefox 3.5 [trollaxor.com] since its 3.1 days, and found Safari 4 Developer Preview [apple.com] more stable than Safari 3. In fact, even WebKit [webkit.org] is more stable than Chrome.

What really baffles me, however, isn't the instability [computerworld.com] I've come to expect from Google, but that Google has the audacity [bullsballs.com] to ask for personal user info to improve its browser. Is the search engine maker datamonger really so desperate for my private information that it's stooped to the level of Trojan horses [wikisource.org] to get it?

They should ask me that when it doesn't crash on launch.

Everything Google does is just another way to sieve personal data away for targeting ads. This kind of Big Brother [google-watch.org] crap is more repulsive than the fat [trollaxor.com] programmers [shelleytherepublican.com] that make it possible. Google, with its deep pockets and doctoral scholars [nytimes.com], thinks that by holding user data hostage it can maneuver around Apple and Microsoft [micrsoft.com]. While this may be true, I'm not willing to be a part of it.

In using Google's search [google.com], Gmail [google.com], Chrome [apple.com] or whatever else the faceless robot [wikipedia.org] of a company invents, the user is surrendering their personal information to a giant hivemind [google.com]. No longer are their personal preferences some choice they make; they're a string of data processed by a Google algorithm: Google dehumanizes [wikipedia.org] its users!

So while Google is arrogant enough to paint spyware shiny so it can parse our browsing habits, the least they could do is make sure it doesn't crash. If Apple, Microsoft, and Mozilla can get their preview releases right, why can't Google? And now they're making their own operating [scobleizer.com] systems [pcworld.com]?

Get real, Google! I'll use your crashing codebloat when my Mac is cold and dead and I'm looking for handouts. Until then, quit mining [goatse.info] my personal data!

dictionary (2, Informative)

neonprimetime (528653) | more than 4 years ago | (#28719335)

courgette (kr-zht)

n. Chiefly British

A zucchini.

Re:dictionary (1)

koxkoxkox (879667) | more than 4 years ago | (#28719525)

it is from French. A cute word in French, but I can't imagine how it would be pronounced in English.

Re:dictionary (1)

4D6963 (933028) | more than 4 years ago | (#28719539)

it is from French. A cute word in French

Indeed, I couldn't help but lol when I first saw the name.

Re:dictionary (2, Funny)

MrMr (219533) | more than 4 years ago | (#28719811)

Stoopid Brits,
Because the Americans pronounce the Italian word Zucchini flawlessly.

Re:dictionary (1)

Ed Avis (5917) | more than 4 years ago | (#28719843)

A cute word in French, but I can't imagine how it would be pronounced in English.

Think 'Corvette' and change the V to a soft J or zh sound.

Re:dictionary (0)

Anonymous Coward | more than 4 years ago | (#28720045)

Easy: check out its entry on dict.cc [www.dict.cc] (an English<->German online dictionary), and click on the speaker icon in the leftmost column. They have both an automatically-generated computer version and a sound recording by a British user.

Re:dictionary (0)

Anonymous Coward | more than 4 years ago | (#28719777)

Who wouldn't want to get backdoored by a zucchini?

Re:dictionary (1, Funny)

Anonymous Coward | more than 4 years ago | (#28719821)

A faggot?

uses a primitive automatic disassembler (4, Insightful)

flowsnake (1051494) | more than 4 years ago | (#28719351)

An interesting approach - I wonder if this would also work as well on compiled bytecode like .NET or Java uses?

Re:uses a primitive automatic disassembler (3, Insightful)

hattig (47930) | more than 4 years ago | (#28719463)

Why not just ship the affected .class files for Java? Disassemble the Jar that is being updated, replace/add/remove the class(es) as per the update instructions and rebuild the Jar.

Chrome's problem is that a massive, bulky, chrome.dll file needs to be sent out with each update, and that it isn't easy to split it up into its constituent parts because it is lower level.

It's not a new idea, but quite impressive that they've actually gone and worked it all out so that it is reliable. And nicer than patching via jumps (noop the old code, add a jump to new code, append new code to end).

Re:uses a primitive automatic disassembler (0)

Anonymous Coward | more than 4 years ago | (#28719867)

It's an interesting approach to a problem that should never have arisen. Why does everything have to be thrown into a 10MB dll? People are browsing on computers, not wristwatches, and there's no reason to abandon modular design to tangle things together for trivial performance gains. Google loves throwing its massive revenues at building and maintaining bad code that's obsessively optimized. Look at the unholy amounts of research and tricky implementation they put into Apps and Gmail and Wave.. inexplicably designing all their own widgets and developing the mind-bogglingly expensive Google App Engine into a Python- and Java- linkable library and stretching and extending the capabilities of Javascript and the DOM far beyond their intended purpose to get drag-and-drop, fancy transitions, live editing... all of this is standard fare in a desktop app but Google insists on coding it (inevitably ugly) to run in the browser. I should have expected the same thing from Chrome.

Re:uses a primitive automatic disassembler (4, Insightful)

FunkyELF (609131) | more than 4 years ago | (#28720561)

It's an interesting approach to a problem that should never have arisen. Why does everything have to be thrown into a 10MB dll? People are browsing on computers, not wristwatches, and there's no reason to abandon modular design to tangle things together for trivial performance gains.

Just because the end result is a 10MB dll doesn't mean that the code and design isn't modular. Thats like saying Linux isn't modular because the live CDs / DVDs come in a single gigantic 4.7Gb .iso file.

Also less overhead for Google (4, Insightful)

Enderandrew (866215) | more than 4 years ago | (#28719411)

Google has to pay the cost for maintaining servers and handling bandwidth for all the OS updates they push out. The more efficient they are in this process, the more money the save.

The good news is that the same benefits could be applied to Red Hat, Ubuntu, openSUSE, etc. Lower costs helps the profitability of companies trying to make a profit on Linux.

The end users also see benefits in that their packages download quicker. I'd be honestly pretty disappointed in any major distro that doesn't start implementing a binary diff solution around this.

Re:Also less overhead for Google (2, Insightful)

maxume (22995) | more than 4 years ago | (#28719773)

Yeah, but it is probably more like unplugging wall warts than insulating your house.

That is, it will be a hassle, be visible, and show a tiny benefit, all the while distracting from more worthwhile activity.

Re:Also less overhead for Google (1)

pembo13 (770295) | more than 4 years ago | (#28719865)

> I'd be honestly pretty disappointed in any major distro that doesn't start implementing a binary diff solution around this.

Fedora already has Presto using DeltaRPMs [fedoraproject.org]

Re:Also less overhead for Google (2, Informative)

Rayban (13436) | more than 4 years ago | (#28719929)

DeltaRPM uses bsdiff - an impressive but generic binary diff algorithm.

Courgette is designed to replace this binary diff with one that understands compiled code well enough to optimize the binary diffs by a significant amount.

Re:Also less overhead for Google (1)

mzs (595629) | more than 4 years ago | (#28720531)

Courgette uses bsdiff as well, after it breaks down into some parts that bsdiff better.

Re:Also less overhead for Google (2, Interesting)

Ed Avis (5917) | more than 4 years ago | (#28719911)

Fedora is already using binary diffs to speed up downloading updates - see yum-presto [fedorahosted.org]. With a better binary diff algorithm, the RPM updates can hopefully be made even smaller. As the Google developers note, making smaller update packages isn't just a 'nice to have' - it really makes a difference in getting vulnerabilities patched faster and cuts the bandwidth bill for the vendor and its mirror sites. Remembering my experiences downloading updates over a 56k modem, I am also strongly in favour of anything that makes updating faster for the user.

Rsync (1, Interesting)

Gerald (9696) | more than 4 years ago | (#28719431)

Would this algorithm be useful to the rsync team?

Re:Rsync (1)

six (1673) | more than 4 years ago | (#28719569)

Simple answer: no

statistically, I think rsync has very few binary files to deal with, at least the way I'm using it.

also, their technique may make the diff data smaller, but it also makes the diffing/patching process a LOT slower, something many rsync users don't want because on a LAN you don't care much about bandwidth usage.

Re:Rsync (3, Interesting)

Anonymous Coward | more than 4 years ago | (#28719801)

Umm, many of us use rsync like mad on binaries such as ISO images or repository trees full of RPMs which are full of compressed data.

The rsync algorithm (see the cool thesis) is actually quite adept at processing "non-text" byte sequences. It's main innovation is having a heuristic to break the stream into smaller phrases by identifying canonical start/end boundaries by running the same heuristic on both source and destination files (without ever possessing both files in the same place). It transfers a description of the source file as a sequence of phrases identified by their hash. It transfers entire phrases when the recipient doesn't already have one with the same hash. It doesn't compute "diffs" in any normal sense, e.g. comparing before/after copies and creating a patch.

An application of rsync could be immensely improved in any situation where it operates on structured data inside a serialization "container" by cracking open the container and operating on the internal structured data stream instead of the container representation. Then, the recipient would reencode the container after generating a full copy of the internal structured data. However, if the container encoding is not normalized and deterministic, you would not get back a byte-for-byte identical container on the other side.

This google strategy is to open up the "assembled container" and do diffs on the source representation. It sounds more like an attempt to recover easy newline boundaries for doing diffs without a clever phrase boundary heuristic like used in rsync. I wonder whether their assembler is deterministic. Operating on source code in general could be great, but compilation with optimizers and linkers is not necessarily deterministic, so a different sort of integrity protection system would be needed to validate the results of the process.

Re:Rsync (1)

relguj9 (1313593) | more than 4 years ago | (#28720033)

I wonder whether their assembler is deterministic. Operating on source code in general could be great, but compilation with optimizers and linkers is not necessarily deterministic, so a different sort of integrity protection system would be needed to validate the results of the process.

Yah, that was my question kind of too (I'm assuming it is). Generally, I'd think that you want everyone with the same version to have the exact same deliverables.

Re:Rsync (2, Interesting)

mzs (595629) | more than 4 years ago | (#28720297)

That was a very good high level explanation of how rsync works, thanks. It shows though that this approach is uninteresting to rsync. The hashes and heuristics are so that you do not have to compare the two ends over a potentially slow link, that is the beauty of rsync. In google's case they have all the versions locally and can do the comparisons orders of magnitude faster. So even if there was some way to get binary 1-1 (which I bet they do in fact) it would be useless for rsync as you would have to read the whole file to do the comparison, so why not just send the whole file instead of changed to begin with?

The other point is that when google says assembler they do not mean something like MASM or gas, but rather something that can take two binary blobs (one that you have and one that they send) and run them to generate a third binary blob that matches what they intended the result to be after encapsulating it again. I think you get that but I wanted to make that clear to other readers.

Re:Rsync (3, Interesting)

bcrowell (177657) | more than 4 years ago | (#28719991)

Simple answer: no statistically, I think rsync has very few binary files to deal with, at least the way I'm using it. also, their technique may make the diff data smaller, but it also makes the diffing/patching process a LOT slower, something many rsync users don't want because on a LAN you don't care much about bandwidth usage.

Well, your use of rsync is not necessarily typical. I use unison for file synchronization, and unison uses rsync; I have quite a few binary files that I sync. You're right that it would be a tradeoff of cpu versus bandwidth, but actually that's the whole point of rsync.

On the other hand, I can think of at least some other reasons that this algorithm would be less appropriate for rsync than for google's intended use. (1) Google knows that their file is an executable, and knows that it's an x86 executable. Rsync would have to detect this using the kind of heuristic algorithms used by the unix "file" utility. (2) It's kind of ugly to imagine building all this x86-specific stuff into a generic program like rsync, which people may be using on ARM machines, or which people may use in the future on architectures that haven't been invented yet. (3) Google is doing one-to-many file distribution (pushing out a single patch to millions of users), so that means that the tradeoff of cpu versus bandwidth is an excellent one for them. With rsync, the typical use is one-to-one (syncing machine A with machine B), so the tradeoff isn't as awesomely spectacular.

BTW, a completely different application of this that Google may be interested in is for Google Native Client, which runs x86 code sandboxed in a browser. Suppose google has 10^8 users using a particular web app that runs x86 code in their browser via Native Client. As long as the program doesn't change, most of the users are probably just getting it from cache, so it loads fast and doesn't cost google any bandwidth. Then on a certain day, google updates the software. Potentially this causes a bottleneck where 10^8 users are all trying to download the same code (could be especially bad on a corporate network with lots of people simultaneously downloading the same stuff on a Monday morning). If Courgette is incorporated in the browser, then potentially it could be smart enough to realize that it already has version x cached, and what it's trying to download is version x+1, and work out the patch without having to download the whole thing again.

Re:Rsync (0)

Anonymous Coward | more than 4 years ago | (#28720509)

I think rsync sends whole files when it detects you're sending across a local network anyway, or via a command-line switch.

Re:Rsync (1)

drinkypoo (153816) | more than 4 years ago | (#28720547)

  1. In your example (backups) there's no reason there couldn't be a metadatabase with the types of your files (once determined by magic, as in file(1).)
  2. There's no reason why you couldn't do this with other architectures as well. Any binaries for which you had the compiler/decompiler set would be supported.

    Taking this idea a step further, there's no particular reason that it couldn't be done for other types of file as well. For example, uncompressed images could be losslessly compressed and transmitted, then recomposed into the original file.

  3. If you were backing up a lot of binaries over a slow link, the difference could be fairly sizable.

I'm sure there's other problems :)

Re:Rsync (3, Interesting)

NoCowardsHere (1278858) | more than 4 years ago | (#28720141)

Not really. Where Google's algorithm really shines is in exactly the field they designed it for: efficiently trying to update a large number of identical binary files of known content (particularly those representing code) on many remote computers, by sending a compact list of the differences.

Rsync actually has to solve a different problem: figuring out where differences exist between two files separated over a slow link when you DON'T precisely know the content of the remote file, but know it's likely similar to a local one. Its rolling-checksum algorithm is very good at doing this pretty efficiently for many types of files.

Re:Rsync (1)

ChrisMounce (1096567) | more than 4 years ago | (#28720469)

Maybe. But I can think of a few issues:

1. This is geared toward a specific type of file (x86 executable), not generic data files.
2. Adding an educated-guess-where-all-the-pointers-are system might just mess with the rsync protocol.
3. Google has the advantage of knowing, with a quick version number check, exactly what changes need to be made: most data flows from server to client. The rsync destination would have to send back two sets of rolling checksums: first for the disassembly, then for the guess made using the patched disassembly. Don't know how big of an effect this would be on efficiency, but it would be at least slightly slower than what Google can achieve.

I really like the idea, though, and I'm a big fan of rsync. It would be interesting if there was a general purpose system for guessing changes in files, given that X changed.

The cool thing is... (3, Interesting)

salimma (115327) | more than 4 years ago | (#28719451)

The cool thing is, one can easily extend this to other executable formats, as long as the assembler is readily available client-side: Windows users could relate to those pesky, resource-hogging Java updates, and .NET and Mono applications could similarly benefit.

This is, interestingly, the second binary diffing innovation that affects me in the past few months. Fedora just turned on delta updates with Fedora 11, a feature borrowed from the openSUSE folks.

Re:The cool thing is... (5, Informative)

mzs (595629) | more than 4 years ago | (#28719739)

You didn't RTFA before posting did you? When they say assembler they mean something of their own that takes a binary blob and one you have already and reassembles the original binary. It just so happens that the disassembler knows a lot about windows executables and the archive format that google uses breaks it up into some portions and assigns labels to addresses. Then it runs bsdiff on this smaller subset.

The code outlined in the blog post is really in these files:

win32_x86_generator.h
win32_x86_patcher.h

Notice these names? This is the disappointing aspect to all of this, it is one more new reason that Chrome is x86 and primarily Windows. You would need one for Mach-O and ELF to do this on other platforms and then if you were on another processor, say ARM or PPC, you would need something that understood that as well. Then there is the issue about 64-bit. In any case the assembler is not something like gas or MASM which is what you imagined.

Re:The cool thing is... (1)

salimma (115327) | more than 4 years ago | (#28719817)

Ah, that is indeed correct. A bit disappointing, but upon reflection, going binary->assembly code->diff would likely result in a larger patch anyway, even if the diff is then compressed efficiently.

Re:The cool thing is... (2, Informative)

Rayban (13436) | more than 4 years ago | (#28720111)

The source for the disassembler is pretty simple.

http://src.chromium.org/viewvc/chrome/trunk/src/courgette/disassembler.cc [chromium.org]

Porting that to parse x86 out of ELF or another executable container wouldn't be too difficult. Porting it to parse x64 or PPC would be tougher.

Re:The cool thing is... (2, Informative)

mzs (595629) | more than 4 years ago | (#28720419)

Actually it would be pretty hard to go x86 PEX to x86 ELF as relocations are done completely differently in those formats. They are often names or indexes in DLLs with PEX and handles (think dlopen) while there are all sorts of relocation entries in ELF where the instructions themselves get modified by the run time linker to the correct address or offset (the premunged instruction either has an index in it to a name or another relocation table with more info).

Re:The cool thing is... (1)

salimma (115327) | more than 4 years ago | (#28719859)

Ah, delta RPM turns out to use normal compression, no binary diffing is involved:

$ rpm -q deltarpm --requires
libbz2.so.1()(64bit)
libc.so.6()(64bit)
libc.so.6(GLIBC_2.2.5)(64bit)
libc.so.6(GLIBC_2.3.4)(64bit)
libc.so.6(GLIBC_2.4)(64bit)
libc.so.6(GLIBC_2.7)(64bit)
librpm.so.0()(64bit)
librpmio.so.0()(64bit)
rpmlib(CompressedFileNames) = 3.0.4-1
rpmlib(FileDigests) = 4.6.0-1
rpmlib(PayloadFilesHavePrefix) = 4.0-1
rtld(GNU_HASH)

wait a minute (3, Insightful)

six (1673) | more than 4 years ago | (#28719457)

announced a new compression technique called Courgette geared towards distributing really small updates

I just RTFA, this has nothing to do with a compression technique.

What they developed is a technique to make small diffs from *executable binary files* and it doesn't look like it's portable to anything other than x86 because the patch engine has to embed an architecture specific assembler + disasembler.

Re:wait a minute (2, Informative)

hattig (47930) | more than 4 years ago | (#28719541)

Another problem is that you would need to maintain every little diff from previous versions, and apply them one after another. Every so often you would have a cumulative diff maintained, so you'd do: 3.0.3 -> 3.0.4 -> 3.0.5 -> 3.2 (cumulative 3.0.5 to 3.2) -> 3.2.1 -> 3.2.2 -> 3.5 (cumulative 3.2.2 to 3.5) -> 3.5.1

Re:wait a minute (5, Informative)

flowsnake (1051494) | more than 4 years ago | (#28719723)

Not really a problem. Every n releases you push a complete patch - a bit like key frames in MPEG. People who keep their stuff reasonably up-to-date benefit from the smaller patches, those who don't just have to go back to the 'key frame' equivalent. And on the client - the latest version on the host is effectively the sum of all the diffs up to that point. OK so there is not enough information there to revert to an arbitrary earlier version, but usually we don't revert to older versions of executables. If we absolutely have to revert, maybe to undo a bad update, we can always just download a complete version of the required version.

Re:wait a minute (1)

hattig (47930) | more than 4 years ago | (#28719837)

Yeah, it's very manageable, and the build tools would generate each update binary diff and put them on the update servers anyway, and the idea is to keep people up to date often, so most people would be on the current or previous minor version anyway.

Reversion would probably be best handled by having a new update. As long as the update code wasn't the aspect that got broken.

Re:wait a minute (1)

minor_deity (1176695) | more than 4 years ago | (#28720129)

It would be a good idea to backup the file being updated before patching it; so the backup could easily be saved in order to revert.

I had this idea (1)

Fantom42 (174630) | more than 4 years ago | (#28719469)

I remember having this idea a while back and thinking that it would surely be out there already implemented. When it wasn't, I didn't follow through and try to invent it myself. I was focused on another project at the time. Never really came back to that original idea. Awesome that someone did, and that they are keeping it open so everyone can use it! Thanks, Google!

Evil bastages (-1, Troll)

Anonymous Coward | more than 4 years ago | (#28719487)

I wonder what the tin foil hat crowd will have to say about this? Somebodies gonna hafta rant about how this is evil because it allows for more efficient tracking of everything we do. Or how they hate the google updater always running using up their precious bandwidths/slots on the task manager list. Somebody will troll and it'll get marked insightful or something stupid. And that last sentence guaranteed I won't get nothing but a -1 troll either, but oh well.

Microsoft version? (1, Funny)

Anonymous Coward | more than 4 years ago | (#28719501)

Any takers that Microsoft will release their own version of this, but compile the assembly before sending it over the wire?

Maybe they can call it "Compiled Assembly from Disassembly" (CAD).

Bad explanation (4, Insightful)

DoofusOfDeath (636671) | more than 4 years ago | (#28719557)

Courgette achieves smaller diffs (about 9x in one example)

That's potentially very misleading. I can compress any document, down to a single but, if my compression algorithm is sufficiently tailored to that document. For example:

if (compressed_data[0] == 0):
      return = get_Magna_Carta_text()
else:
      return unzip(compressed_data[1:])

What we need to know is the overall distribution of compression ratios, or at least the average compression ratio, over a large population of documents.

Re:Bad explanation (1)

six (1673) | more than 4 years ago | (#28719675)

their technique is specifically geared towards executable binary files, it doesn't make sense to use it, and also won't work at all with any other type of file.

so yes the 9x thing is quite unfair, that's like comparing bzip2 vs. lame for compressing music.

Re:Bad explanation (1)

DoofusOfDeath (636671) | more than 4 years ago | (#28719787)

their technique is specifically geared towards executable binary files, it doesn't make sense to use it, and also won't work at all with any other type of file.

so yes the 9x thing is quite unfair, that's like comparing bzip2 vs. lame for compressing music.

Even if you consider the set of binary strings that represent object code to be a subset of the set of all binary strings, I think my point still holds.

The only way my point would be totally invalid, I think, is if were always the case that the assembly-language diff can be expressed in fewer bits than can the corresponding object-code diff. If that is always true, then I agree that my point is invalid.

Re:Bad explanation (2, Funny)

DoofusOfDeath (636671) | more than 4 years ago | (#28719683)

I can compress any document, down to a single but,

Oh crap. There goes any chance of this being a technical discussion.

Re:Bad explanation (0)

Anonymous Coward | more than 4 years ago | (#28719891)

You could reduce that to zero bytes. Return the result of get_Magna_Carta_text() every time, and require that the sourcetext be the Magna Carta. The output is undefined for all other inputs :)

Not a 'binary diffing' program. . . (0)

Anonymous Coward | more than 4 years ago | (#28719559)

It seems to me that the moniker 'binary diff' doesn't really apply here, does it? My understanding of a binary diff algorithm/program has always been one that can take any arbitrary binary file, not knowing anything about what *type* of file it was (e.g. the file could be an image file, video file, executable, word document, zip file, CAD file, or whatever) and create a diff for that binary 'data'.

It sounds like this is very specific to x86 executables?

Still, whatever you call it, it's good to see progress being made. I just wonder why you can't create small, efficient diffs for any kind of binary file?

Today? (3, Funny)

Blakey Rat (99501) | more than 4 years ago | (#28719575)

Google's Open-Source Chromium project announced a new compression technique called Courgette geared towards distributing really small updates today.

Better hurry! It won't work tomorrow!

Re:Today? (1)

Night Goat (18437) | more than 4 years ago | (#28719659)

I was thinking the same thing! Proofreading is a lost art these days.

Re:Today? (0)

Anonymous Coward | more than 4 years ago | (#28720525)

Would you like some cheese with that whine?

Like many brilliant ideas... (4, Informative)

istartedi (132515) | more than 4 years ago | (#28719587)

...it makes you smack yourself on the head and go "why hasn't everybody been doing this for years?".

The idea is simple, and reminds me of something I learned in school regarding signals. Some operations are easy to perform in the frequency domain, so you do the Fourier transform, perform the operation, and then transform back.

This is really just the same idea applied to the problem of patches. They're small in source; but big in binary. It seems so obvious that you could apply a transform,patch,reverse process... but only when pointed out and demonstrated.

It's almost like my favorite invention: the phonograph.

The instructions for making an Edison phonograph could have been understood and executed by any craftsman going back thousands of years. Yet, it wasn't done until the late 19th century.

Are the inventors that brilliant, or are we just that stupid.

Re:Like many brilliant ideas... (1)

ashtophoenix (929197) | more than 4 years ago | (#28719779)

Are the inventors that brilliant, or are we just that stupid.

Isn't this simply "another level of indirection"...? :)

I think to answer your question, its not brilliance but the ability to step away from the norm and take a fresh look at established things that brings about such innovation. (IMHO)

Re:Like many brilliant ideas... (0)

Anonymous Coward | more than 4 years ago | (#28719923)

I think to answer your question, its not brilliance but the ability to step away from the norm and take a fresh look at established things that brings about such innovation. (IMHO)

IMHO that's how you define brilliance :)

Re:Like many brilliant ideas... (0)

Anonymous Coward | more than 4 years ago | (#28719965)

It's definitely not its brilliance... Maybe it's its frame of reference?

Re:Like many brilliant ideas... (0)

Anonymous Coward | more than 4 years ago | (#28719961)

"The instructions for making an Edison phonograph could have been understood and executed by any craftsman going back thousands of years. Yet, it wasn't done until the late 19th century.

Are the inventors that brilliant, or are we just that stupid."

Yet Leonardo Da Vinci had the idea for a helicopter and it wasn't for several centuries that man took to the skies, let alone in an actual helicopter. Sometimes the ideas are there but not the desire or the means to move the process along, and sometimes the idea comes at exactly the right time. It's certainly possible that earlier craftsmen could have been able to reproduce a phonograph if shown how, but I have to wonder if society would have been ready for such a device.

Re:Like many brilliant ideas... (1)

MBCook (132727) | more than 4 years ago | (#28720339)

It's only a helicopter because it has a spinning thing who's axel is perpendicular to the ground. And it wouldn't actually work. Even if it could achieve lift by turning the screw, without a tail rotor (or counter-rotating screw) it would simply spin and go nowhere. Then there is the problem of the tremendous power it takes to power a helicopter rotor.

istartedi is right that any decent craftsman could have made a phonograph. Even with the correct information to design a good wing/blade, a helicopter wouldn't have been possible 150+ years ago. A very high quality phonograph could have been made by any watchmaker.

Re:Like many brilliant ideas... (3, Insightful)

merreborn (853723) | more than 4 years ago | (#28720027)

Like many brilliant ideas... it makes you smack yourself on the head and go "why hasn't everybody been doing this for years?".

Probably because the old solution was:

A) Simple
B) Good enough for most purposes.

Sure, you can shave 80% off your patch filesize... but unless you're as big as google, patch bandwidth probably isn't a major priority -- you've likely got much more important things to dedicate engineering resources to.

You know how they say "Necessity is the mother of invention"? Well, when an invention isn't really necessary for most folks, it tends to show up a little later than it might otherwise have.

Re:Like many brilliant ideas... (2, Insightful)

Just Some Guy (3352) | more than 4 years ago | (#28720527)

Actually, it made me smack my head and say "I assumed we were already doing this".

Sure, you can shave 80% off your patch filesize... but unless you're as big as google, patch bandwidth probably isn't a major priority

So, you've never installed a service pack or another OS update? I'd be more than happy to run "sudo aptitude patch-upgrade" to download the 3KB difference between OpenOffice 3.1.foo and 3.1.foo+1.

Re:Like many brilliant ideas... (0)

Anonymous Coward | more than 4 years ago | (#28720093)

The phonograph could have been executed thousands of years ago using what materials? I don't belive a practical phonograph could be made without modern materials.

Solving the wrong problem (0, Flamebait)

Animats (122034) | more than 4 years ago | (#28719629)

A better binary diffing algorithm is useful for source control. But for security? If the code is so awful that the bandwidth required for security updates is a problem, the product is defective by design.

It sounds like Google tried "agile programming" on trusted code, and now has to deal with the consequences of debugging a pile of crap.

Re:Solving the wrong problem (3, Insightful)

Joe Random (777564) | more than 4 years ago | (#28719815)

If the code is so awful that the bandwidth required for security updates is a problem, the product is defective by design.

No one is saying that the bandwidth is a problem. They're just saying that the bandwidth is unnecessary. FSM-forbid that anyone try to optimize something.

Plus, as the article points out, smaller updates mean more people can receive the update per unit-bandwidth, which means faster distribution of security updates when something critical is fixed.

Re:Solving the wrong problem (0)

Anonymous Coward | more than 4 years ago | (#28720021)

Bandwidth isn't the problem, user time is. Imagine seamless updates where you don't even have to click OK, watch progress bars, restart the app. Just set up the app to autoupdate, and because the updates are small, it can do that without bothering you and without any perceivable lag. Imagine not having to micromanage updates anymore because they're totally transparent.

This technique can be part of that solution. It requires more serverside and clientside CPU but much less downloading time, and the net is increasingly the bottleneck in everything. So it solves the right problem in the right place.

Re:Solving the wrong problem (1)

BuR4N (512430) | more than 4 years ago | (#28719975)

It has of course nothing to do with what you implying, its all about saving money (yea, stuff like bandwidth etc do cost money)

Re:Solving the wrong problem (1, Insightful)

Anonymous Coward | more than 4 years ago | (#28720309)

Out of curiosity, could you please point us to some of your code so we can do a comparison?

Thanks.

Re:Solving the wrong problem (1)

Ed Avis (5917) | more than 4 years ago | (#28720555)

Right because we all know that sensible coders get it right first time, and have every feature you'll ever want already implemented in version 1.0, and so never need to push out a new version of their program, and if for some unthinkable reason they do want to make a new release, the users positively enjoy watching the download creep through as slowly as possible...

BTW, have you seen the size of the package updates for your Linux distribution recently?

Nothing terribly new (2, Interesting)

davidwr (791652) | more than 4 years ago | (#28719641)

There was a old, pre-X MacOS binary-diff program that diffed each element of the resource fork [wikipedia.org] independently.

Back in those days, non-code elements usually were stored in unique "resources," one icon might be in ICON 1, another might be in ICON 2, etc.

Code was split among one or more resources.

If a code change only affected one section of code, the only 1 CODE resource would be affected.

Since each resource was limited to 32KB, a diff that only affected one resource would never be larger than twice that size.

If only a single byte changed, the diff was only the overhead to say what byte changed, the old value, and the new value, much like "diff" on text files only on a byte rather than line basis.

So, conceptually, this isn't all that new.

Re:Nothing terribly new (1)

mzs (595629) | more than 4 years ago | (#28720041)

One thing is very different from back then and now, disk space was at a premium, so there were a ton of optimizations to code fragments in the object file format to that end. For example most data was zero, so only portions that were not were in the object file bunched-up. The utility you refer to was able to take advantage of this. It was otherwise a simple ed-like that could insert, remove, and replace. I wish I could remember the name.

Courgette? (0)

Anonymous Coward | more than 4 years ago | (#28719645)

Sounds like a old horny midget woman.

Cougarette? (0)

Anonymous Coward | more than 4 years ago | (#28719681)

Am I the only one who misread "Courgette"? :)

binary diff (2, Informative)

visible.frylock (965768) | more than 4 years ago | (#28719699)

If you're not familiar with the process of binary diff (I wasn't) there's a paper linked from the article that explains some about bsdiff:

http://www.daemonology.net/papers/bsdiff.pdf [daemonology.net]

Wayback from 2007/07/09:
http://web.archive.org/web/20070709234208/http://www.daemonology.net/papers/bsdiff.pdf [archive.org]

Re:binary diff (1)

mzs (595629) | more than 4 years ago | (#28719945)

Yes and it is a generic approach that does not need to know anything about pex, x86, and archives. In fact it is used by this google implementation. And there was something much like what google has here implemented back in 1999 cited in the paper you posted the link to:

B.S. Baker, U. Manber, and R. Muth, Compressing Differences of Executable Code, ACM SIGPLAN Workshop on Compiler Support for System Software, 1999.

Can a layman get an explanation in English? (2, Insightful)

AP31R0N (723649) | more than 4 years ago | (#28719737)

Please?

Re:Can a layman get an explanation in English? (2, Informative)

ledow (319597) | more than 4 years ago | (#28719903)

Rather than send the difference in the executable, they send the difference in a sort of source code. Saves space if a small source change (move this line past this function) causes a big executable difference.

Re:Can a layman get an explanation in English? (5, Informative)

six (1673) | more than 4 years ago | (#28719915)

Binary executable files contain a lot of addresses (variables, jump locations, ...) that are generated by the assembler at compile time.

Now consider you just add one 1-byte instruction somewhere in the middle of your program (let's say "nop"). When you compile it again, all the code that reference addresses beyond the insert point will have changed because the address has been incremented. So these 4 bytes added to your source code could mean addresses that get incremented in the compiled file in thousands of places.

What they do basically is take the binary file, disassemble it back to pseudo source code (not real asm I guess), and diff that against old version. The patch engine on the client end does the same disassembling, applies the patch, and reassembles the patched source code to an executable file.

This means diffs gets much smaller (4 bytes vs. 1000s in my extreme example), but also makes the diff/patch process much more complex, slower, and not portable.

Re:Can a layman get an explanation in English? (3, Informative)

chris_eineke (634570) | more than 4 years ago | (#28719953)

A compiler takes source codes and turns them into assembler code. That's lines of human-readable machine instruction mnemonics (for example, "Copy from here to here." "Is that bigger than zero?"). The assembler takes those lines and turns them into machine instructions, a sequence of binary numbers.

Finding the difference between two huge gobs of binary numbers is difficult. Instead, they turn the binary numbers back into lines of mnemonics and use a algorithm that finds the difference between two huge listings of mnemonics.

That method is easier because the listings of a program that has been changed slightly can be very similar to the listing of a unmodified program. That has to do with how compilers work.

Capiche? ;)

Re:Can a layman get an explanation in English? (1)

DiegoBravo (324012) | more than 4 years ago | (#28720289)

I'm not sure about the benefits.

1) The bulk of many (most?) software packages are resources, not executables
2) A lot of the executables is a lot of linker/DLL overhead, specially the smaller ones
3) The optimizers (I think) remix the assembly instructions, so small changes in the program logic result in a lot of changes in assembly, ergo, in machine code. The best solution in terms of BW remains sending diffs in high level source code.

Re:Can a layman get an explanation in English? (1, Funny)

Anonymous Coward | more than 4 years ago | (#28720387)

Everything except "capiche", yeah.

Re:Can a layman get an explanation in English? (0)

Anonymous Coward | more than 4 years ago | (#28720099)

I'm waiting until they announce a project "geared towards distributing really small updates yesterday." Now THAT will be fast.

(Yes I'm looking at you, editors.)

Re:Can a layman get an explanation in English? (1)

unfunk (804468) | more than 4 years ago | (#28720201)

I'd like to see a car analogy here, but it needs to involve the car's transmission diff

Re:Can a layman get an explanation in English? (1)

eric-x (1348097) | more than 4 years ago | (#28720417)

It makes patches smaller by doing a few extra voodoo pre and post processing steps. Many changes in the binary can be tracked back to only 1 change in the original source code, the voodoo knows how to do this and stores just that one master change instead all of the resulting changes. Car analogy: instead of storing the activity of each piston in a car engine it just stores the fact if the engine is running or not.

Microsoft patch API (0)

Anonymous Coward | more than 4 years ago | (#28719829)

A bit related, Microsoft have provided at binary patch API for years that can take advantage of the internal structure of .EXE files to achieve better compression. It also does multi-way patching so you can specify n source files and one target file... the patch will contain enough information to patch any of those n source files up to the target file.

A solution in search of a problem? (1)

alexborges (313924) | more than 4 years ago | (#28719875)

I mean, whats the need, really?

Cant we all be merry and unixize our minds and think: hey, all components of the sw are files, lets just separate the heavy parts that are less prone to any security issue (GUI), from all the libs and executables, and lets just send those through the wire as a package....

We do that in linux and it works(TM): we package everything, have dependencies and then only update what needs to be updated without weird binary patching.

Now, not to say that its not usefull to have a new bindiff algorithm: lets plug it into git and svn, a specialized switch (--Gelf-exec-compression) in rsync also comes to mind, and that should make them more powerfull and all...

That's just a dissembler. How about bittorrent? (1)

Khopesh (112447) | more than 4 years ago | (#28719897)

This is diffs the dissembled version of the original against the update on the server, then does the opposite on the client. I couldn't help but think of this as similar to Gentoo's model ... download a compressed diff of the source and then recompile. Both have the same problem: too much client-side CPU usage (though Gentoo's is an extreme of this). Isn't Google Chrome OS primarily targeting netbooks? Can such things handle that level of extra client-side computation without leaving users frustrated?

I'd rather improve the distribution model. Since packages are all signed, SSL and friends aren't needed for the transfer, nor does it need to come from a trusted authority. Bittorrent comes to mind. I'm quite disappointed that the apt-torrent project [sianka.free.fr] never went anywhere [ubuntu.com]. It's clearly the solution.

Re:That's just a dissembler. How about bittorrent? (1)

BobMcD (601576) | more than 4 years ago | (#28720061)

Torrents are problematic in that they are contrary to the way many, if not all, ISP's are set up. Nodes are intended to be consumers, not data providers. Shifting the downloaded data to the UPSTREAM is against the intent of the architects and produces problems.

Redesign the network, sure. Until then, torrents for updates are only taking your problem and shifting it to someone with fewer resources (without permission.)

Re:That's just a dissembler. How about bittorrent? (1)

tvjunky (838064) | more than 4 years ago | (#28720173)

This is diffs the dissembled version of the original against the update on the server, then does the opposite on the client. I couldn't help but think of this as similar to Gentoo's model ... download a compressed diff of the source and then recompile. Both have the same problem: too much client-side CPU usage (though Gentoo's is an extreme of this). Isn't Google Chrome OS primarily targeting netbooks? Can such things handle that level of extra client-side computation without leaving users frustrated?

I don't think this is really a problem in this case. In the time even a slow computer, by today's standards, has downloaded a kilobyte over a WAN link it has easily performed millions of CPU operations on it. The same would be true for any kind of compression really. Since Bandwidth through your pipe is just orders of magnitudes slower than anything that happens within your machine, this added level of complexity is clearly more beneficial than a direct approach. That's why it makes sense to compress files on the server (or the client uploading it to the server), transfer them and decompress them on the client, even if the client is quite slow.

I'd rather improve the distribution model. Since packages are all signed, SSL and friends aren't needed for the transfer, nor does it need to come from a trusted authority. Bittorrent comes to mind. I'm quite disappointed that the apt-torrent project never went anywhere. It's clearly the solution.

With patches between minor versions at about 80kB (as stated in TFA), I don't think that a distribution using bittorrent would really be the way to go here. Add to this the fact that google has quite a lot of bandwidth at their disposal and I don't see this happening anytime soon.
I aggree however that it may be a good idea to transfer large amounts of linux packeges that way. But with a lot of smaller packages the protocol overhead of bittorrent might become a limiting factor regarding its usefulness.

BCJ filter (2, Interesting)

hpa (7948) | more than 4 years ago | (#28720125)

The concept of a Branch-Call-Jump (BCJ) filter is well-known in the data compression community, and is a standard part of quite a few deployed compression products. Used as a front end to a conventional compression algorithm -- or, in this case, a binary compression algorithm -- does indeed give significant improvements. The application to binary diff is particularly interesting, since it means you can deal with branches and other references *over* the compressed region, so this is really rather clever.

LLVM (1)

reg (5428) | more than 4 years ago | (#28720337)

If you're going to do this in the long run, it would make more sense to distribute LLVM byte code and compile that to native code on the host... That way you have small diffs to the byte code, and also you get targeted optimization (and maybe even custom profile guided optimization).

Not terribly novel (0)

Anonymous Coward | more than 4 years ago | (#28720523)

This is a classic trade-off; you can compress better by sacrifizing platfom-indepence. It has been done before, e.g, the ExeDiff tool for the DEC Alpha. The only news here is that its being done by Google. Interested parties should refer to: G. Motta, J. Gustafson, and S. Chen. "Differential Compression of Executable Code," Proceedings of the Data Compression Conference (DCC 07), Mar. 2007, pp. 103-112 for an overview of the available tools and how they perform.

Regards,
Jacob Gorm Hansen, author of the EDelta linear-time executable differ.

wrong title, of course (1)

TheGratefulNet (143330) | more than 4 years ago | (#28720579)

its not 'binary diffing'.

does it work with gif files? jpg? png? dng? mov? mpg?

its meant for PROGRAM binary files. of a specific type of processor, at that.

that's fine.

but please say so, and don't imply its binary. its really specific and not helpful for image (pic, sound) formats.

Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...