Beta

Slashdot: News for Nerds

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

First Hutter Prize Awarded

kdawson posted more than 7 years ago | from the compress-this dept.

191

stefanb writes, "The Hutter Prize for Lossless Compression of Human Knowledge, an ongoing challenge to compress a 100-MB excerpt of the Wikipedia, has been awarded for the first time. Alexander Ratushnyak managed to improve the compression factor to 5.86 and will receive a 3,416-Euro award. Being able to compress knowledge well is believed to be related to acting intelligently." The Usenet announcement notes that at Ratushnyak's request, part of the prize will go to Przemyslaw Skibinski of the University of Wroclaw Institute of Computer Science, for his early contributions to the PAQ compression algorithm.

cancel ×

191 comments

Just goes to show... (0)

Anonymous Coward | more than 7 years ago | (#16637610)

It's not what you know, or who you know, but how much you can compress what you know.

Although I could tell a fair few stories about crushing students into phone booths.

Re:Just goes to show... (0)

Anonymous Coward | more than 7 years ago | (#16637676)

Although I could tell a fair few stories about crushing students into phone booths.

You could probably tell a lot of stories about packing fudge, too.

Re:Just goes to show... (1)

NosTROLLdamus (979044) | more than 7 years ago | (#16637736)

Imagine a Beowulf Cluster of that butt-sex!

what the hell? (1)

hjf (703092) | more than 7 years ago | (#16637640)

last time I checked, you couldn't make a "lossy" compression of text.

oh... nevermind. I forgot about teens with SMS.

Re:what the hell? (3, Funny)

Barny (103770) | more than 7 years ago | (#16637728)

Hehe, and remember the study a few months back that found that the order of letters within a word is unimportant, so long as the first and last are correct? Potentially big words could be listed merely by how many of each letter they have, then randomly mix them up inside, order doesn't matter :)

Re:what the hell? (1, Informative)

Anonymous Coward | more than 7 years ago | (#16637828)

Hehe, and remember the study a few months back that found that the order of letters within a word is unimportant, so long as the first and last are correct?

Remember the study a few years back that proved this to be false, pointing out that slpmiy rnisreveg the ioiretnr lrettes mdae it sltnacifingiy mroe dluciffit to raed the wdros?

Now let us never speak of this again.

Re:what the hell? (1)

SillySnake (727102) | more than 7 years ago | (#16638609)

It took me 10 minutes to figure out "ioiretnr" :-/

Re:what the hell? (1)

McFadden (809368) | more than 7 years ago | (#16639125)

It took me 10 minutes to figure out "ioiretnr" :-/
Presumably this was after you had already figured out "reversing" and still drew a blank.

Re:what the hell? (0)

Anonymous Coward | more than 7 years ago | (#16639255)

Who cares?

If you're looking for a place to put lossy compression to use in text, it's still an interesting effect to help choose which information is lost.

Re:what the hell? (1)

Carewolf (581105) | more than 7 years ago | (#16639369)

Not a bad idea, you could always rearrange them to the right order using a dictionary.

Seems like a strange contest (4, Interesting)

Salvance (1014001) | more than 7 years ago | (#16637656)

It turns out that Alexander Ratushnyak was the only person to even meet the challenge's guidelines, and one of only 3 people who submitted entries. Seems like a strange thing to award up to 50,000 Euro for ... his compression was only 20% smaller than what I just quickly did with gzip. I'm rather surprised that more people didn't try though ... it's not like people are throwing money at data compression researchers.

Re:Seems like a strange contest (2, Informative)

Barny (103770) | more than 7 years ago | (#16637700)

20% of a few petabytes is.... a lot :)

And yes, offering cash in this way is a great incentive for programers.

Also, if its a cpu friendly (read: reasonable cpu useage or even able to be offloaded on a processing card) method it could potentially add 20% onto your bandwidth :)

Re:Seems like a strange contest (0)

Anonymous Coward | more than 7 years ago | (#16637764)

"And yes, offering cash in this way is a great incentive for programers."

Except, with this contest as an example, it clearly isn't.

Re:Seems like a strange contest (1)

Barny (103770) | more than 7 years ago | (#16637906)

Uh, someone did it didn't they? :)

Re:Seems like a strange contest (0)

Anonymous Coward | more than 7 years ago | (#16638108)

True enough, though stress on the "one." :-P

Re:Seems like a strange contest (5, Insightful)

Salvance (1014001) | more than 7 years ago | (#16637822)

20% is a lot, when the compression/decompression is fast. gzip/WinRAR/WinZip all compress this file to the 22-24MB mark in a minute or two on my desktop (which is comparable to their test machine). The winner's algorithm compressed the data to 16.9MB, but spent 5 hours and 900MB of RAM doing so. The contest would be far more interesting if it added a reasonable time/RAM restriction (e.g. 10 minutes and 500MB of RAM).

Re:Seems like a strange contest (0)

Anonymous Coward | more than 7 years ago | (#16638076)

I don't think efficiency of computational resources is the point of this contest. As I understand it, the idea is that maximal, ideal compression could only be gotten by actually *understanding* the stuff being compressed. So an algorithm better in terms of end size than any before presumably understands the Wikipedia text better than the previous algorithms did. Rinse & repeat, and hopefully you'll have an algorithm that will be closer to AI.

Re:Seems like a strange contest (1)

Skuto (171945) | more than 7 years ago | (#16639133)

Well, there *is* a restriction on the amount of time/CPU. Just read the website.

Whatever is "reasonable" is just personal preference.

Re:Seems like a strange contest (1)

BoberFett (127537) | more than 7 years ago | (#16639273)

If the purpose of compressing information such as Wikipedia is for distribution for something such as the OLPC, compression time is largely irrelevant. Compression is done once, on one system, and the decompression is usually fairly fast.

Re:Seems like a strange contest (3, Interesting)

Raul654 (453029) | more than 7 years ago | (#16637722)

There aren't many Wikipedia related things I don't know about (I'm a Wikipedia administrator, arbitrator, and bureacrat, featured article director, and I'm on the Wikimedia Foundation's Press Committee), and this is the first time I've ever heard of this contest. I think it's fair to say it's been relatively low-profile.

Re:Seems like a strange contest (0, Redundant)

QuantumG (50515) | more than 7 years ago | (#16638134)

It was on Slashdot, how low profile is that?

Re:Seems like a strange contest (1, Insightful)

Anonymous Coward | more than 7 years ago | (#16638317)

There aren't many Wikipedia related things I don't know about (I'm a Wikipedia administrator, arbitrator, and bureacrat, featured article director, and I'm on the Wikimedia Foundation's Press Committee), and this is the first time I've ever heard of this contest.

Cut the rank pulling. Almost half a million people have a lower slashdot ID than you, thousands of them with much more important functions than you, but they don't see a need to brag about their positions every time they've given half a chance.

I think it's fair to say it's been relatively low-profile.

Unlike you, you mean?

Re:Seems like a strange contest (1)

Ninjaesque One (902204) | more than 7 years ago | (#16639005)

That was the best post to post as AC, wasn't it?

Anyways, the guy was talking about Wikipedia. Compare 'im to Wikipedian admins:

http://en.wikipedia.org/wiki/Wikipedia:List_of_adm inistrators [wikipedia.org]
http://en.wikipedia.org/wiki/User:Crzrussian [wikipedia.org]

Just inserted a random admin there. Look at the long list of barnstars, like they were friggin' medals. Look at all that idiosyncratic stuff; userboxes innumerable, a random cabal-thing that I've certainly never heard of, and so on. Now, click the "administrators" category, then go to any other user-page in the list. You will see the same damn thing.

Re:Seems like a strange contest (1)

RealGrouchy (943109) | more than 7 years ago | (#16638761)

There aren't many Wikipedia related things I don't know about ... and this is the first time I've ever heard of this contest.

That's because the contest has nothing to do with Wikipedia; it just uses Wikipedia's content as a standard block of compressable data.

Just like how the numbers in a Sudoku puzzle have nothing to do with math; they're simply a set of nine characters/symbols.

- RG>

Re:Seems like a strange contest (1)

Raul654 (453029) | more than 7 years ago | (#16638901)

Why use Wikipedia then? There must be dozens (if not hundreds or thousands) of free sources of regular text out there.

Re:Seems like a strange contest (1)

QuantumFTL (197300) | more than 7 years ago | (#16637750)

It turns out that Alexander Ratushnyak was the only person to even meet the challenge's guidelines

Wow, that makes me really wish I had entered. There are some great multipass lossless text compression systems that would work well for Wikipedia...

Re:Seems like a strange contest (0)

Anonymous Coward | more than 7 years ago | (#16637794)

rm -r comes to mind.

Re:Seems like a strange contest (1)

muszek (882567) | more than 7 years ago | (#16637848)

but rm is lossy. you'd end up as one of those that didn't meet criteria guidelines.

Re:Seems like a strange contest (0)

Anonymous Coward | more than 7 years ago | (#16638022)

rm'ing wikipedia would be like subtracting a negative number. So its ‘loss’ would really be a net gain.

Re:Seems like a strange contest (1)

Breakfast Pants (323698) | more than 7 years ago | (#16638549)

And conversely, modding up your post would be like adding a negative number, so it would be a net loss.

Re:Seems like a strange contest (1)

gerbalblaste (882682) | more than 7 years ago | (#16637874)

the contest is still open and will likely be for a long time.

Re:Seems like a strange contest (1)

dch24 (904899) | more than 7 years ago | (#16638793)

Does anyone know if WinRK [wikipedia.org] has been submitted? It's probably pretty slow, and may go over the 10 hour time limit.

Re:Seems like a strange contest (1)

Kris_J (10111) | more than 7 years ago | (#16638911)

WinRK uses PAQ technology. Since the winner in this story used a tweaked PAQ engine, I doubt that WinRK would perform better. Better than the WinZip, et al, examples people have been posting, but not better than the winning entry.

Re:Seems like a strange contest (1)

Shados (741919) | more than 7 years ago | (#16637772)

Keep in mind that the rule was that the file had to be a self executable. If you just did default gziping, you'd have to include the overhead of a self executable in the total.

Re:Seems like a strange contest (1)

gameforge (965493) | more than 7 years ago | (#16637912)

The overhead for both compression and decompression on my box anyway looks to be about 50k (the size of my /bin/gzip).

No clue whether it requires more code to compress than to decompress, but I would guess your overhead would be less than 30k. On an order of maybe 20MB of compressed data, that's not much.

GZIP is a bad example because it also works with binary data; it's too much. If you only have to worry about encountering 128 possible values, compression can get real interesting, and there are a lot more algorithms out there, though I'm too lazy to find some examples.

In this case, since the data looks to be fixed (this particular 100MB of text, not just any 100MB of text), it seems to me that you could really optimize the bejebus out of it - perhaps there's a large abundance of a particular word or combination of words, etc. If I was going to put some time into this, I'd run some heuristics type algorithms to find out what's unique about this specific piece of data that I could take advantage of.

Unfortunately I am not a compression guru, so I can't really say any of this with certainty... just some thoughts.

Re:Seems like a strange contest (1)

Firehed (942385) | more than 7 years ago | (#16638309)

The only problem is that if your heuristics are designed for this specific chunk of data, it'll end up being largely useless for a full database crunching. Mind you, they could still be useful, but you don't want them to be sample-specific when you're talking about orders of magnitude more data that will need to get compressed.

I really don't know jack about how compression is done, but I can toss out an idea. Scan article text for all of the used characters... expect to find your A-Z a-z 0-9 and some characters - with luck, you'll still end up a decent way under 128 characters. Then scan through the articles for the (128-#) most frequently used words, and assign each of those words to have a 7-bit value. Or depending on the size of the article, use all 256 characters and give yourself a solid 150 words to do to the same thing, only with an 8-bit value. Then just do some sort of find-and-replace, turning all of the "the"s in an article to ç, "and"s to ü, and "then"s to ß. Or whatever (ideally, you'd find full phrases rather than just individual words; better yet, they'd capture anything in the 128-255 charset and eliminate the need for 8-bit encoding). Do some funky algorithm so that if you have an unusually large number of longer words (which would be reasonably likely in the more scientific articles), it'll check if replacing the dozen occurrences of "subwoofer" is more space-saving than the hundred uses of "of".

Like I said, I'm no expert, but I don't think the above idea would be too hard to implement (if you want to try, go ahead; I'd appreciate 10% and a mention if you win a prize for it). So long as you have some sort of dynamic heuristics, if that'd be the correct term, I think it could be really efficient. Though, in honesty, I'd expect that's how it already works, at least to a limited extent.

Re:Seems like a strange contest (1)

tabrisnet (722816) | more than 7 years ago | (#16638677)

Lookup Huffman Encoding. It's been done.

Re:Seems like a strange contest (1)

Kris_J (10111) | more than 7 years ago | (#16638925)

My Javascript Packer does that. [krisjohn.net] To actually compete with the recent winner you need to start exploring far more complicated stuff.

Re:Seems like a strange contest (1)

complete loony (663508) | more than 7 years ago | (#16637954)

Finding a way to compress just a little bit more, becomes an increasingly harder problem. You begin to hit entropy limits, and have to use methods that are only relevant to the type of data being compressed. For example, a good lossless audio compressor won't handle text data very well, and vice versa.
Most compressors in this space use a prediction algorithm to generate a signal, and then only compress the differences between the generated signal and the actual data. To get the best possible compression, you have to build a system that has a perfect knowledge of the information being compressed and can then derive the source data again. Hence good compression is thought to be a good test of AI type systems.
20% more compression may not seem like much on the surface. But this is a very difficult problem.

Re:Seems like a strange contest (2, Informative)

CryoPenguin (242131) | more than 7 years ago | (#16638142)

Since the criteria for entry say that any new submission must beat the current record, it's no surprise that only 3 people are listed. You're not seeing any of the people who didn't win.

Lossless from Lossy? (2, Funny)

Anonymous Coward | more than 7 years ago | (#16637664)

"The Hutter Prize for Lossless Compression of Human Knowledge, an ongoing challenge to compress a 100-MB excerpt of the Wikipedia"

Wikipedia? Knowledge? Isn't that already a lossy compression mechanism?

Re:Lossless from Lossy? (1)

ClamIAm (926466) | more than 7 years ago | (#16638837)

I find that most Wikipedia articles suffer from the reverse: poor writing style [accd.edu] and trivia [wikipedia.org] make them much larger than needed. Concise, cruftless articles would probably be a better way of reducing the disk/archive needs of Wikipedia.

f%t! (-1, Offtopic)

Anonymous Coward | more than 7 years ago | (#16637674)

f%t!

mods... (0)

Anonymous Coward | more than 7 years ago | (#16638222)

..that isn't offtopic at all, it is a rather cute and harmless example of a compressed first post that is actually ON topic. And, no I am not the submitter. Now normally I never chastise here, nor pay much attention to the frosty pissers, but c'mon, cut the dude some slack, it's a +1 funny at least.

Hmm... (4, Funny)

Cinder6 (894572) | more than 7 years ago | (#16637732)

Am I the only one who finds it slightly ironic that (as of this writing), there is no entry for the Hutter Prize on Wikipedia?

Re:Hmm... (1)

iamwoodyjones (562550) | more than 7 years ago | (#16637870)

Give me a few minutes...;-)

Re:Hmm... (1)

megaditto (982598) | more than 7 years ago | (#16638198)

Stand by for a picture of a giant phallus.

Seriously though, I respect the guy for giving up half his prize to the other algorithm developer.

Re:Hmm... (1)

Evil Pete (73279) | more than 7 years ago | (#16637880)

But there is for the PAQ algorithm (see link in summary) with mention of the awarding of the Hutter prize.

Re:Hmm... (2, Informative)

CastrTroy (595695) | more than 7 years ago | (#16637898)

While this is not article on the Hutter Prize itself, you will be relieved to know that it is mentioned in the article on Marcus Hutter [wikipedia.org]

Re:Hmm... (1)

Reikk (534266) | more than 7 years ago | (#16638056)

It WAS there, but someone accidently used lossy compression on it.

For comparison .... (3, Informative)

Ignorant Aardvark (632408) | more than 7 years ago | (#16637738)

For comparison purposes, WinRAR on its best setting only gets this down to 24MB. Doubtless 7zip could get even lower, but I don't think either could crack the 17MB mark. And certainly neither of those would be self-extracting, which this contest requires.

Re:For comparison .... (1)

Ksempac (934247) | more than 7 years ago | (#16637946)

If you read a bit further than the main page you find this :

Relaxations
* In lieu of a self-extracting archive, a decompressor program decomp8.exe plus a compressed file archive8.bhm may be published, where decomp8.exe produces data8 from archive8.bhm. In this case, the total size is S := length(decomp8.exe)+length(archive8.bhm).


7zip is less than 0.9MB and i guess you could strip it down to less than that by removing a lot of useless features (GUI, Support for others formats). So that's a not big overhead, and you re free of the "self-extracting" rule.

Re:For comparison .... (1)

RuBLed (995686) | more than 7 years ago | (#16638070)

and surely you could do the compression in less than 5 hours...

WIkipedia-specific compression algorithms (1)

billstewart (78916) | more than 7 years ago | (#16639159)

  • 1. Itemize the articles in the Wikipedia extract.
  • 2. Edit them down to stubs.
  • 3. Compress File
  • 4. ...
  • 5. Profit!
  • 6. ...
  • 7. Other Wikipedia authors replace and augment the stubs.

I want (0)

Anonymous Coward | more than 7 years ago | (#16637786)

a Lossless Compression of the Internet

makes one wonder (1)

v1 (525388) | more than 7 years ago | (#16637816)

How slow this code is. Usually when you are trying to squeak out another 1% of space in a file your algorythm triples in size and quadruples in runtime to get that improvement. I wonder how practical this new algorythm really is. Probably not so much an issue nowadays but I also wonder how big the compression program is. This used to be a really big deal many years ago when the compression programs could easily be larger than the data they were compressing.

Re:makes one wonder (1)

CastrTroy (595695) | more than 7 years ago | (#16637942)

Accoring to somebody who posted above (he probably read it in the article, but i'm not about to look there), the program took, 5 hours and 900MB of RAM. I find it's interesting that he only got down to 17MB, while consuming all those resources. It makes it seem like it isn't even worth it. I'm sure you could find some pretty small program that would compute pi to some very large number of digits, and find the wikipedia excerpt in the results, but It would take a very long time to run.

Re:makes one wonder (2, Informative)

AxelBoldt (1490) | more than 7 years ago | (#16638266)

I'm sure you could find some pretty small program that would compute pi to some very large number of digits, and find the wikipedia excerpt in the results, but It would take a very long time to run.
In all likelihood, that trick won't work: your program would need to know the start index of the Wikipedia text in the digit sequence of pi, and that index would be so astronomically huge that writing it down would be about as long as writing down Wikipedia itself. As a little example of the effect, think about finding the first occurrence of '0000' in the digit sequence of pi, and ask yourself how far out that is likely to happen.

Re:makes one wonder (4, Funny)

Henry V .009 (518000) | more than 7 years ago | (#16638307)

Assuming a random distribution of digits in pi (and why does everyone assume this? -- there is certainly no proof), the odds are that you'll see one sequence of 4 zeros every 10,000 (decimal) digits. On the other hand, you'd expect to see 4 zeros once every 16 binary digits. Wikipedia, at 100MBs would be expected to be found once every 2^800,000,000 binary digits of pi.

Then again, maybe we're lucky, and this universe is God's way of storing the universal encyclopedia in the digits of pi. Wikipedia might be up near the front somewhere.

Re:makes one wonder (1)

arrrrg (902404) | more than 7 years ago | (#16638733)

Tangentially related and mildly amusing trivia:

Feynman Point [wikipedia.org] . From the Wikipedia article: "The Feynman point comprises the 762nd through 767th decimal places of pi ... The name refers to a remark made by the physicist Richard Feynman, expressing a wish to memorise the digits of as far as that point so that when reciting them, he would be able to end with '... nine, nine, nine, nine, nine, nine, and so on.'"

So, I dunno about "0000," but for some interesting sequences we may get lucky :).

Re:makes one wonder (1)

NoGuffCheck (746638) | more than 7 years ago | (#16638861)

think about finding the first occurrence of '0000' in the digit sequence of pi, and ask yourself how far out that is likely to happen

ahh, but pi is exactly 3!

Re:makes one wonder (1)

Spikeles (972972) | more than 7 years ago | (#16639147)

think about finding the first occurrence of '0000' in the digit sequence of pi, and ask yourself how far out that is likely to happen.
That's easy... The string 0000 occurs at position 13,390 counting from the first digit after the decimal point. The 3. is not counted. http://www.angio.net/pi/piquery [angio.net]

Re:makes one wonder (1)

mh101 (620659) | more than 7 years ago | (#16637950)

FTFA:
Restrictions: Must run in less than 10 hours on a 2GHz P4 with 1GB RAM and 10GB free HD.
If someone's program takes close to 10 hours to compress 100MB it better get considerably more than just an additional 1% out of it!

Re:makes one wonder (1)

arth1 (260657) | more than 7 years ago | (#16639073)

If someone's program takes close to 10 hours to compress 100MB it better get considerably more than just an additional 1% out of it!

Indeed. Who in their right mind would spend 10 hours compressing instead of simply moving the entire dataset in a fraction of the time?

Oh, and for a good compression scheme:

uuencode filename filename | mail external@stora.ge

Then all the decompresser needs to do is fetch the mail and uudecode it.

Regards,
--
*Art

Re:makes one wonder (2, Insightful)

Alaria Phrozen (975601) | more than 7 years ago | (#16637978)

If you'd RTFA you'd find the running times ranged from 30 minutes to 5 hours. They have a whole table and everything.



The whole point of the challenge was to create a self-executing compression program that made a perfect copy of their 100MB file. Final file sizes were in the 16MB range. Geeze, seriously RTFA.

Re:makes one wonder (1)

infaustus (936456) | more than 7 years ago | (#16638006)

If you'd rtfo, you'd know that it took 5 hours. But point taken.

Lookup table? (0, Interesting)

Anonymous Coward | more than 7 years ago | (#16637974)

I wonder if you could just store all the common english words into a lookup table indexed by a number (32 bit enough?). So every word then gets replaced by a number which would then be compressed further. Only catch is the client needs a copy of the lookup table which might be feasible.

Re:Lookup table? (1)

LordEd (840443) | more than 7 years ago | (#16639245)

It would not be feasible in this competition because either the archive must be self-extracting, or that the decompression app's size is included in the final result.

A lookup table of 32 bits would immediately cause an overhead of over 20 GB bytes (assuming an average of 5 letters per word, which is probably too small for the number of word options in 32 bits).

compress knowledge = intelligence (2, Insightful)

Profound (50789) | more than 7 years ago | (#16637998)

Being able to compress knowledge well is believed to be related to acting intelligently. - IHNPTTT (I Have Not Passed The Turing Test), but while my brain is good at remember the gist of knowlege, but really bad at losslessly recalling it.

Re:compress knowledge = intelligence (1)

illuminatedwax (537131) | more than 7 years ago | (#16638603)

intelligence != acting like a human

Re:compress knowledge = intelligence (1)

Kris_J (10111) | more than 7 years ago | (#16638939)

intelligence != acting like a human
Truer words were never typed.

Re:compress knowledge = intelligence (1)

qbwiz (87077) | more than 7 years ago | (#16638949)

Intelligence != being able to store Wikipedia in a small space. In fact, according to Wikipedia, programs are already able to predict what comes next in English text [wikipedia.org] approximately as well as humans, but they aren't as intelligent as humans.

If we want to make something less intelligent more intelligent, it seems likely that at some point, when its intelligence is in some sense the same as the average human's, that its behavior would be more similar to a human's than it is now. Of course, that's hardly a certainty.

Error in Wikipedia (1)

Baldrson (78598) | more than 7 years ago | (#16639229)

A more precise Wikipedia link is to the erroneous section on Entropy as information content [wikipedia.org] which reads:
Experiments with human predictors show an information rate of between 1.1 and 1.6 bits per character, depending on the experimental setup; the PPM compression algorithm can achieve a compression ratio of 1.5 bits per character.
The range for Shannon's experiments are actually between .6 and 1.3 bit per character.

This error even caught Dr. Dobbs compression expert, Mark Nelson [marknelson.us] so I guess it isn't too surprising it caught the Wikipedia folks.

The beauty of this sort of error in Wikipedia is it demonstrates just how even erroneous human knowledge can result in greater human knowledge if it is compressed:

Imagine, if you will, a perfectly compressed Wikipedia (despite the fact that an optimum compression is not generally computable). This perfectly compressed Wikipedia would generally "hang together" with a great many pieces cross-checking coherently with other pieces and hence be represented with common (compressed) codes. But then, some pieces would stand out as not cross-checking. They would not "hang together" with other pieces and would require specific codes to represent them. To a Sherlock Holmes kind of mind -- the mind of a scientist -- the mind of a ruthless epistemologist -- these codes would represent "suspects" of the crime of lies or error. The idea that compressors like PPM are just as able to compress text as humans (given enough time) would not hang together -- it would not be predicted from the more general body of human knowledge and would therefore become suspect due to its relative incompressibility.

Compress knowledge != intelligence (0)

Moraelin (679338) | more than 7 years ago | (#16639343)

I can think of lots of stuff which would realistically count towards intelligence, but "compress knowledge" is the kind of thing that just sounds unbelievably stupid. In fact, it sounds like the kind of idiotic publicity stunt coming from a marketting guy who never heard of AI or programming.

The things you'd realistically need, and probably use every day unconsciously, include:

- good indexing: Just having the information somewhere in your head, well compressed is pointless if you can't quickly find it.

E.g., when you see a faucet, you need to immediately access such information as "turning it this way makes water come out, turning it the other way stops the water from coming out." You need to start from "faucet" and basically search for the information you need relating to "faucet". That's an easy example, but even trivial daily activities like driving your car often involve navigating a whole graph of associations, and if, say, a kid jumped in front of your car, you better not need hours to uncompress the data and find a solution.

- good "tokenization" so to speak. Think every word or concept being not a sequence of letters, but a pointer to that very concept. Think Wikipedia's hyperlinks. It may resemble compression on a superficial level, but that's like saying that a bird resembles the space shuttle on a superficial level. The goal here isn't just to minimize size by agnostically matching any chunk of text that came before, but to create a graph of interdependencies that you can navigate to find a solution.

E.g., again, if you started with "faucet" and went through "turn it clockwise", you need to imediately be able to reference "turn" and "clockwise" without sitting and sifting through a zip archive. And for that you need the real tokens, not any old chunk of matching text for compression reasons. A match for "turn" is good. A match for "rn i" between "turn it clockwise" and a previous occurence of "a barn in the backyard", as a compression algorithm would match, is utterly useless.

- good filtering. Yes, the brain is sorta lossy at all stages, because it mercilessly filters out everything that doesn't look relevant or in the focus of attention. E.g., when looking at the street, you might just see "it's a car", because the details about it don't look relevant at the moment. E.g., when focusing on the car for more details, you might completely miss the guy in a pink gorilla suit doing cartwheels in the background, because it looks irrelevant to your current focus of attention.

This is more important than it sounds, and is the _real_ compression for the most part when processing that data. The brain isn't just compressing the whole scene to save bandwidth, it's really mercilessly clipping 90% of it out, tokenizing the rest, and again clipping out most of the details about those tokens as long as they're not explicitly required. It's not just compressing the image of a car coming at you, it's transforming it into the "tokens" (so to speak) "[car] [incoming]". Anything else from that scene (e.g., the guy in a pink gorilla suit across the road) isn't compressed, it's filtered out, and so are the details about those tokens. (E.g., car model, colour, registration number, etc.) You can choose to pay attention to some of them (e.g., look at the registration number or the driver's face), and that will change the filtering criteria to give you those details at the expense of something else.

Even what you mention ("while my brain is good at remember the gist of knowlege, but really bad at losslessly recalling it.") is just an example of such filtering. A lot of the details were just filtered out before you even memorized that (e.g., you might have gotten the idea that a giraffe has a long neck, but not its colour, because it didn't seem important at the moment), an a lot just weren't stored or indexed or given a high priority. If it didn't look useful to remember what colour a giraffe is, it didn't get stored. Again, it's filtering, not dumb stream compression.

And so on.

And missing all that in favour of a "let's see who compresses the best" stunt seems just stupid and clueless.

Intelligence revisited ... (1)

foobsr (693224) | more than 7 years ago | (#16638024)

From the contest specs:
thus reducing the slippery concept of intelligence to hard file size numbers

I doubt the instance which wrote that was cynical.

CC.

Re:Intelligence revisited ... (1)

Skuto (171945) | more than 7 years ago | (#16639161)

They were certainly not, and they are right.

The contest produces a hard, verifiable result with a hard restriction on resources that you can use to attain it.

If you look at the state of AI research, then you will understand that introducing some cold hard numbers wont hurt.

I think compressing porn would be more useful (0)

Anonymous Coward | more than 7 years ago | (#16638120)

Then, we can use the rest of the leftover space to store the sum of all human knowledge in an uncompressed form.

- Dave

Done before.... (1)

gmuslera (3436) | more than 7 years ago | (#16638188)

... by Douglas Adams. Wonder what rate of compression is putting the meaning of life, universe and everything in just the number 42.

Re:Done before.... (1)

netsharc (195805) | more than 7 years ago | (#16638264)

There was a small story in the HHG series about aliens who encoded the whole of human knowledge into a number, took a meter ruler, and made a precise mark between 0 and 1. But then they chipped the ruler while taking it home.

Or was it some other book?

Nice idea, but I guess you can only get to a certain precision before reaching the Nanometer limit..

I suppose the Windows equivalent of (1)

WhatDoIKnow (962719) | more than 7 years ago | (#16638190)

wget "http://cs.fit.edu/~mmahoney/compression/enwik8.zi p" | gunzip

doesn't count?

:wq

Anyone notice the judges are white supremacists? (0)

Anonymous Coward | more than 7 years ago | (#16638301)

At least one of them, Jim Bowery. He posts online under the nickname Baldrson and his personal homepage is filled with self published articles about how the jews are responsible for the aids and such.

what matters most... (1)

ElitistWhiner (79961) | more than 7 years ago | (#16638359)

is what you get when you run the new compression backwards!

This reminds me of the cryptographer paid to crypto-compress datastreams into musical notation. The work laid the foundation to real time packet sniffing of the Internet

Cryptographer? Got atta boys

hah! compression good, speed, ram.. nope (0)

Anonymous Coward | more than 7 years ago | (#16638387)

it took 5 hours, with 900 megs of ram..

i dont even have half that processing power in my own brain as well as my computer!

A simple solution in 3 steps... (0)

plaxion (98397) | more than 7 years ago | (#16638465)

Step 1: Read in the 100MB file as binary
Step 2: Scrub/remove all the 1's (they're bigger than 0 anyway so they're obviously taking up the most space)
Step 3: Compress the remaining 0's to just one 0 because as we all know from math class 00000000000(etc) == 0

There, you've just converted 100MB of data into a single "0" which is just one byte long! I've got the compressor part down to a perl one-liner, but I'm still working on how to decompress it efficiently because strangely enough my decompresser is ~100MB in size, so it's still kinda alpha at the moment.

not impressed (0)

Anonymous Coward | more than 7 years ago | (#16638515)

some guys made an entire single player 3d game with doom 3 like graphics that fits into only like 128 k
they have plenty of long 3d rendered movies that fit in 64k too.
Im sure they could have won this thing if they had tried.

Compression related to acting intelligently? (1)

christpants (860689) | more than 7 years ago | (#16638613)

How does a computer algorithm for compression relate to the idea that human compression of knowledge corresponds to intelligent action? I think the summary is making a mistake in terminology. An intelligent person is able to compress information and access it quickly, but that doesn't really have much to do with a more efficient compression algorithm, unless we're talking some seriously advanced AI. I just really don't understand the correlation or what the summary is trying to imply... anybody else get it?

Re:Compression related to acting intelligently? (0)

Anonymous Coward | more than 7 years ago | (#16639039)

George Bush submitted an entry, and his algorithm was able to get the excerpt down to 101MB...

Re:Compression related to acting intelligently? (1)

Skuto (171945) | more than 7 years ago | (#16639143)

Read the website. (Oh wait, this is /., nobody does that).

A system that understands the text it is compressing will compress better than one that doesn't.

The winner improved upon previous compression programs by adding semantic modelling. Improving the modeling further would improve the results. You're heading towards language understanding.

Re:Compression related to acting intelligently? (4, Interesting)

adrianmonk (890071) | more than 7 years ago | (#16639191)

How does a computer algorithm for compression relate to the idea that human compression of knowledge corresponds to intelligent action? I think the summary is making a mistake in terminology. An intelligent person is able to compress information and access it quickly, but that doesn't really have much to do with a more efficient compression algorithm, unless we're talking some seriously advanced AI. I just really don't understand the correlation or what the summary is trying to imply...

Nope, I had heard about the contest before seeing it today on Slashdot. The summary is fine. That is in fact the thesis that the contest is designed to investigate.

As for why this is the case, I spent some time studying compression two or three years ago, and one of the things that you quickly realize is that there is no such thing as a truly general-purpose compression algorithm. Compression algorithms work on the principle that there is some underlying pattern or structure to your data. Once the structure is understood, the data can be transformed into a smaller format. Think of compressed files as a set of parameters to a magic function which generates a larger file. The real art is in finding the function.

To give a concrete example, one of the simplest forms of compression is Huffman coding. You analyze your stream of symbols, and you realize that some occur more often than others. You then assign shorter bit strings to more frequently-occurring symbols and longer ones to less frequent symbols. This gives you a net gain. You were able to do this because you had the insight that in human language, some letters (like "e") occur more frequently than others (like "q").

There are, of course, other patterns that can be exploited. You can take the frequency distribution trick above up to another level by noting that the frequency of a symbol is not really independent of the symbol before it. For example, the most likely symbol after a "q" is "u". Sure, you could have other things. But "u" is the most likely. You can exploit that to get better compression.

But of course, you might be compressing something other than English text. There are different techniques for whatever kind of data you're trying to compress. A row of pixels in a faxed (1-bit) image tends to be very close to the same as the previous row. Each pixel is a bit. If you represent a row of pixels not as on or off directly but instead represent it as its bit value XORed with the pixel above it, you end up with 0 bits for every pixel that hasn't changed and 1 bits for every value that hasn't. Presto, you have just managed to skew the probability distribution radically towards 0 bits, and you can use further tricks to gain compression from that.

The point of all this is that to come up with these tricks, you have to understand something about the data. One of the better definitions I've ever heard for data compression was simply "applied modeling".

Given that, you have to ask a question: have we reached the point today where compression algorithms are as good as they're going to get at compressing English text based on its surface-level characteristics such as character frequencies and repeated strings? Have we exhausted the low-level modeling that is possible? There has been a lot of work on this, so it's possible that we may have. And if so, then any further gains in the compression ratio could very well be the result of some sort of higher-level modeling. Maybe even modeling the knowledge rather than just the language. And that is what this contest is about, as I understand it.

It's not for sure that a winning contest entry necessarily requires the submitter to have developed some kind of machine intelligence. But it's an intriguing enough idea that it might be worthwhile to run the contest just to see if something interesting does happen.

By the way, for some interesting reading on this subject, look up Kolmogorov Complexity. The wikipedia seems to have a pretty decent article [wikipedia.org] on it (although I haven't read the whole thing).

Done on 9/25/06? (2, Interesting)

SillySnake (727102) | more than 7 years ago | (#16638629)

According to http://prize.hutter1.net/ [hutter1.net] this happened on Sept 25 of 2006.

The site also gives some of the requirements..
Create a compressed version (self-extracting archive) of the 100MB file enwik8 of less than 17MB. More precisely:

        * Create a Linux or Windows executable of size S L := 17'073'018 = previous record.
        * If run, it produces (without input from other sources) a 108 byte file that is identical to enwik8.
        * If we can verify your claim, you are eligible for a prize of 50'000×(1-S/L). Minimum claim is 500.
        * Restrictions: Must run in 10 hours on a 2GHz P4 with 1GB RAM and 10GB free HD.

Re:Done on 9/25/06? (1)

Skuto (171945) | more than 7 years ago | (#16639123)

>According to http://prize.hutter1.net/ [hutter1.net] this happened on Sept 25 of 2006.

There is waiting period for public comment/verification etc...

I can do better than that. (0)

Anonymous Coward | more than 7 years ago | (#16638825)

Here's some eleven character lossless compression of the full Wikipedia: Utter crap.

That's nothing (1)

snoggeramus (945056) | more than 7 years ago | (#16638853)

Compressing information is easy. Now .... let's see them compress an AOL chat room!

Interesting related webpage (1)

Skuto (171945) | more than 7 years ago | (#16639169)

http://www.cs.fit.edu/~mmahoney/compression/text.h tml [fit.edu]

Info about contenders and results of common compression programs on the testset. (All the "just use gzip/rar/winrk/..." fools can stop jabbering now...)

Know what to compress, not how. (1)

fahrbot-bot (874524) | more than 7 years ago | (#16639237)

Being able to compress knowledge well is believed to be related to acting intelligently.

Bah! Knowing what to compress is more intelligenter...

This is the AI problem (1)

LesPaul75 (571752) | more than 7 years ago | (#16639261)

Being able to compress knowledge well is believed to be related to acting intelligently.
There's actually a little more to it than that. The creators of the Hutter prize believe that intelligence is the ability to compress knowledge. That's why they're offering this prize -- to solve the artificial intelligence problem. I'm not saying I buy into it, but that's their claim. That's why they call it "The most crucial technology prize of all." Here [geocities.com] is the page describing the origin of the prize.

some standard compression tools tested (1)

elmartinos (228710) | more than 7 years ago | (#16639357)

I have tested some of the standard compression tools, here are the compressed sizes:

100000000 | original size
  36445241 | gzip --best
  29008758 | bzip2
  28860178 | rzip -9
  24864529 | 7zr -mx=9
  24753293 | rar a -m5 -mdG

7z does not do so well, I think because not so much tuned for text compression, it is much better at compressing binaries. For text compression PPMD and the variants are quite good, so I guess you will see good results with WinRK, Compressia, PAQ and the like.
Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Create a Slashdot Account

Loading...