Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Compress Wikipedia and Win AI Prize

CmdrTaco posted more than 8 years ago | from the what-does-this-mean dept.

324

Baldrson writes "If you think you can compress a 100M sample of Wikipedia better than paq8f, then you might want to try winning win some of a (at present) 50,000 Euro purse. Marcus Hutter has announced the Hutter Prize for Lossless Compression of Human Knowledge the intent of which is to incentivize the advancement of AI through the exploitation of Hutter's theory of optimal universal artificial intelligence. The basic theory, for which Hutter provides a proof, is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program. Think of it as Ockham's Razor on steroids. Matt Mahoney provides a writeup of the rationale for the prize including a description of the equivalence of compression and general intelligence."

cancel ×

324 comments

Sorry! There are no comments related to the filter you selected.

WikiPedia on iPod! (2, Interesting)

network23 (802733) | more than 8 years ago | (#15899786)


I'd love to be able to have the whole WikiPedia available on my iPod (or cell phone), but without destroying [sourceforge.net]

info.edu.org [edu.org] - Speedy information and news from the Top 10 educational organisations.

Re:WikiPedia on iPod! (3, Insightful)

Fred Porry (993637) | more than 8 years ago | (#15899831)

Then it would be an encyclopedia, not a Wiki, thats another point why I say: forget about it. Would be nice though. ;)

Re:WikiPedia on iPod! (3, Insightful)

CastrTroy (595695) | more than 8 years ago | (#15899896)

Well, since it's currently only 1 Gig, you could probably put it on a flash card and read it from a handheld. It wouldn't be an ipod. but probably wouldn't require destroying a perfectly good piece of equipment either. You could probably even get weekly updates (hopefully in a diff file) to make sure your copy is in sync with the rest of the internet. Now that I think about it, this would be a really good application. There's lots of times when I'd like to look up something off wikipedia, but not connected to the internet.

Re:WikiPedia on iPod! (1)

Nutria (679911) | more than 8 years ago | (#15899988)

Well, since it's currently only 1 Gig?

You didn't RTFA, did you?

Re:WikiPedia on iPod! (4, Funny)

Asztal_ (914605) | more than 8 years ago | (#15900044)

Umm... which of the 5 thousand links is the article?

But captain (5, Funny)

Anonymous Coward | more than 8 years ago | (#15899787)

Marcus Hutter has announced the Hutter Prize for Lossless Compression of Human Knowledge the intent of which is to incentivize the advancement of AI through the exploitation of Hutter's theory of optimal universal artificial intelligence.

But captain, if we reverse the tachyon inverter drives then we will have insufficient dilithium crystals to traverse the neutrino warp.

Re:But captain (5, Funny)

Anonymous Coward | more than 8 years ago | (#15899836)

You left out the part involving the deflector shield. Remember, the first rule of star trek technobabel is always involve the deflector in some way.

Re:But captain (1)

bcat24 (914105) | more than 8 years ago | (#15899990)

Come on, just reverse the damn polarity already.

Re:But captain (1, Offtopic)

WilliamSChips (793741) | more than 8 years ago | (#15900079)

I thought the first rule of Star Trek was that the redshirts always die.

Win "Al" prize? (1)

MondoMor (262881) | more than 8 years ago | (#15899788)

What if I don't like Al?

Painful to read (3, Insightful)

CuriHP (741480) | more than 8 years ago | (#15899793)

For the love of god, proofread!

Re:Painful to read (2, Insightful)

Threni (635302) | more than 8 years ago | (#15899889)

> For the love of god, proofread!

Yeah, I just read the write-up twice and have no idea if this is an AI contest, something to do with compression, or what. In fact, all I can remember now is the word "incentivize" which is the sort of thing I expect some bullshit salesman at work to say.

Re:Painful to read (1, Interesting)

Anonymous Coward | more than 8 years ago | (#15900006)

Mahoney is a professor at my college. I've taken his classes. He talks just how he writes.

Painful to read - only if you expect simple stuff (0)

Anonymous Coward | more than 8 years ago | (#15899933)

I proofread TFA, and there are no errors that I can see. Ok, the grammar is a bit more complex than Dr Suess, but /.ers have to learn to deal with english sooner or later.
Beside, it IS an article on compression...

Re:Painful to read (0)

Anonymous Coward | more than 8 years ago | (#15899976)

Agreed - what a dreadful summary. It has senseless grammar, invented words, and nine links. Good job, Taco - you've phoned another one in.

lossy compression (5, Insightful)

RenoRelife (884299) | more than 8 years ago | (#15899798)

Using the same data lossy compressed, with an algorithm that was able to permute data in a similar way to the human mind, seems like it would come closer to real intelligence than the lossless compression would

Re:lossy compression (3, Insightful)

Anonymous Coward | more than 8 years ago | (#15899959)

Funny? That's most intelligent and insightful remark I've seen here in months, albeit rather naively stated.
The human brain is a fuzzy clustering algorithm, that's what neural networks do, they reduce the space of a large
data set by eliminating redundancy and mapping only the salient features of it onto a smaller data set, which in bio systems
is the weights for triggering sodium/potassium spikes at a given junction. If such a thing existed a neural compression algorithm would be capable of immense data reduction. The downsides are that, like us humans, they may be unreliable/non-deterministic in retrieving data because of this fuzzyness. It would also be able make "sideways" associations and draw inferences from the data set, which in essence would be a weak form of artificial intelligence. Now give him his +5 insightful you silly people.

Re:lossy compression (3, Insightful)

Vo0k (760020) | more than 8 years ago | (#15899974)

that's one piece, but not necessarily - "lossy" nature of human mind compression can be overcome by "additional checks".

Lossy relational/dictionary based compression is the base. You hardly ever remember text by order of letters or sound of voice reading it. You remember the meaning of a sentence, plus optionally some rough pattern (like voice rhythm) to reproduce the exact text from rewording the meaning. So you understand meaning of sentence, store it as relation to known meanings (pointers to other entries!) then when recalling, you put it back in words, and for exact citation you try to match possible wordings against remembered pattern.

So imagine this: the compressor analyzes a sentence lexically, spots regularities and irregularities, transforms the sentence into a relational set of tokens containing the meaning of the sentence, which are small and easy to store, unambigiously describe the meaning of the sentence but don't contain exact wording. Then an extra checksum of the text of the sentence is added.

Decompressor tries to build a sentence that reflects given idea according to rules of grammar, picking different synonyms, word ordering and such, and then brute-forcing or otherwise matching against the checksum to find which of the created sentences matches exactly.

Look, the best compressor in the world:
sha1sum /boot/vmlinuz
647fb0def3809a37f001d26abe58e7f900718c46 /boot/vmlinuz

Linux kernel compressed to set: { string: "647fb0def3809a37f001d26abe58e7f900718c46", info: "it's a Linux kernel for i386" }

You just need to re-create afile that matches the md5sum and still follows the rules of a Linux kernel. It is extremely unlikely any other file that can be recognized as some kind of Linux kernel and matches. Of course there are countless blocks of data that still match, but very few will follow the ruleset of "ELF kernel executable" structure which can be deduced numerically. So theoretically you could use the hash to rebuild THE kernel just by brute-force creating random files and checking them if they match both the hash and "general properties of a kernel".

The problem obviously lies in unrealistic "brute force" part. The subset of possible rebuilds of the data must be heavily limited. You can do this by lossy compression that allows for limited "informed guess" results - ones that make sense in context of a linux kernel - style, compiler optimizations, use of macros transformed into repeatable binary code. And have the original analysed using the same methods before compression, storing all inconsistencies with the model separately.

So the compression file would consist of:
- a set of logical tokens describing meaning of given piece of data (in relation to the rest of "knowledge base"
- a set of exceptions (where logic fails)
- a checksum or other pattern allowing to verify an exact match.

Most of lossy compressions are meant to obfuscate the lost data. If you use one that instead allows for rebuilding lost data according to certain limited ruleset ("informed guess" + verification) you'd get a lossless compression of comparable efficiency.

Re:lossy compression (0)

Anonymous Coward | more than 8 years ago | (#15900085)

> You just need to re-create afile that matches the md5sum and still follows the rules of a Linux kernel. It is extremely unlikely any other file that can be recognized as some kind of Linux kernel and matches.

no. No. NO. There are only 2^128 possible hashes, and the number of possible kernel executables is vastly larger than that. The general idea of your post is interesting, but this part is bullshit.

Re:lossy compression (1)

Lord Kano (13027) | more than 8 years ago | (#15900021)

Using the same data lossy compressed, with an algorithm that was able to permute data in a similar way to the human mind, seems like it would come closer to real intelligence than the lossless compression would

I have an idea, if you can take the algorithm from this program [slashdot.org] and put it into a library we'd be a huge step closer to an AI that mimicks traditional intelligence.

LK

As long as it is Wiki that we are talking about... (3, Funny)

gatkinso (15975) | more than 8 years ago | (#15899799)



There. All of wiki, in 31 bytes.

for those who rtfa (0)

Anonymous Coward | more than 8 years ago | (#15899803)

how about telling us:

a) how big the compressed size was

b) how many bytes was wikipedia before it was compressed

Re:for those who rtfa (2, Informative)

kfg (145172) | more than 8 years ago | (#15899855)

a) how big the compressed size was

18MB

b) how many bytes was wikipedia before it was compressed

A sample of 100MB

Your goal:
.

KFG

wow! (0, Redundant)

g253 (855070) | more than 8 years ago | (#15899804)

It is 1 a.m. here, I just came home, slightly drunk, slightly high, and I just tried to understand this article. It has indeed blown my mind, so to speak.

Re:wow! (1)

XnavxeMiyyep (782119) | more than 8 years ago | (#15899954)

And as we advance in AI, we come closer to making computers just like you!

ALERT:
I do not understand your command due to error 1446486428499 (Drunken stupor)

Can it be "lossy" compression? (1)

RicheB (994570) | more than 8 years ago | (#15899808)

If lossy compression is OK, I can easily squish all of Wikipedia down to 1 bit.

Re:Can it be "lossy" compression? (0)

Anonymous Coward | more than 8 years ago | (#15899832)

No you can't, computers manipulate and store data by the byte.

Re:Can it be "lossy" compression? (4, Funny)

richdun (672214) | more than 8 years ago | (#15899858)

Hmmm...well in that case, someone go edit the Wikipedia entry on "computers" and allow them to store data at the bit level. Also, I heard somewhere where computers in Africa have tripled in the past six months!

Re:Can it be "lossy" compression? (1)

bcat24 (914105) | more than 8 years ago | (#15900004)

Now that's an elephant of a tale.

Re:Can it be "lossy" compression? (1)

StikyPad (445176) | more than 8 years ago | (#15900069)

That's ELEPHANTS. Computers in ELEPHANTS have tripled in the past 6 months. Get it right.

Re:Can it be "lossy" compression? (0)

Anonymous Coward | more than 8 years ago | (#15900128)

When I was reading Wikipedia this morning, I learned that Digg is Slashdot's Portugal.

Re:Can it be "lossy" compression? (1)

kfg (145172) | more than 8 years ago | (#15899877)

No you can't, computers manipulate and store data by the byte.

That may well be true of your computer, but I've got one right in front of me that manipulates data by the bit.

It's even got several of them.

KFG

Re:Can it be "lossy" compression? (0)

Anonymous Coward | more than 8 years ago | (#15899874)

If lossy compression is OK, I can easily squish all of Wikipedia down to 1 bit.
No, it's not. The deliverable for the competition is an SFX that reproduces the 100 MB test file.

Re:Can it be "lossy" compression? (5, Funny)

Bill Kilgore (914825) | more than 8 years ago | (#15899879)

I have a program that compresses 100M of Wikipedia to one bit with no loss at all. The program is somewhat special-purpose, and at 100,024,076 bytes, a little chunkier than I'd like.

Re:Can it be "lossy" compression? (1)

KiloByte (825081) | more than 8 years ago | (#15899913)

Compared to 104857600, you at least got some compression.

Re:Can it be "lossy" compression? (0)

Anonymous Coward | more than 8 years ago | (#15899958)

Why so? The test file is exactly 10^8 bytes.

Re:Can it be "lossy" compression? (2, Informative)

KiloByte (825081) | more than 8 years ago | (#15900119)

Why so? The test file is exactly 10^8 bytes.
I downloaded the corpus, and indeed, you're right -- it's 10^8 bytes. The article is incorrect, it says 100M where it means 95.3M.

This inconsistency doesn't have any effect on the challenge, though -- that 50kEUR[1] is offered for compressing the given data corpus, not for compressing a string of 100MB.

[1] 1kEUR=1000EUR. 1M EUR=1000000EUR. 1KB=1024B. 1MB=1048576B.
And by the way, what about fixing Slash to finally allow Unicode -- either natively or at least as HTML entities?

Wrong contest (3, Informative)

Baldrson (78598) | more than 8 years ago | (#15899977)

That's another contest that is useless for the reason you cite.

The contest for the Hutter Prize requires the compressed corpus to be a self-extracting archive -- or failing that to add the size of the compressor to the compressed corpus.

Who'da thunk... (5, Funny)

blueadept1 (844312) | more than 8 years ago | (#15899809)

Man, WinRar is taking its bloody time. But oh god, when its done, I'll be rich!

Re:Who'da thunk... (1)

mobby_6kl (668092) | more than 8 years ago | (#15899856)

Damn you and your WinRAR! When is the deadline? WinRK says it needs 3 days 14 hours, you might be finished before then, but I'll surely take the prize... when it's done®

Easy! (1, Funny)

RyuuzakiTetsuya (195424) | more than 8 years ago | (#15899811)

arj

Lossy Compression? (3, Funny)

Millenniumman (924859) | more than 8 years ago | (#15899826)

Convert it to AOL! tis wikpedia, teh fri enpedia . teh bst in da wrld.

Re:Lossy Compression? (2, Insightful)

blueadept1 (844312) | more than 8 years ago | (#15899875)

That is actually an interesting idea. What if you added a layer of compression that converted every possible common acronym, made contractions, etc...

Re:Lossy Compression? (2, Interesting)

larry bagina (561269) | more than 8 years ago | (#15899901)

1) it wouldn't be lossless and 2) most compression techniques use a dictionary of common used words.

Comparison (2, Informative)

ronkronk (992828) | more than 8 years ago | (#15899828)

There are some amazing compression programs out there, trouble is they tend to take a while and consume lots of memory. PAQ [fit.edu] gives some impressive results, but the latest benchmark figures [maximumcompression.com] are regularly improving. Let's not forget that compression is not good unless it is integrated into a usable tool. 7-zip [7-zip.org] seems to be the new archiver on the block at the moment. A closely related, but different, set of tools are the archivers [freebsd.org] , of which there are lots with many older formats still not supported by open source tools

Re:Comparison (1)

joshier (957448) | more than 8 years ago | (#15899849)

If many of the same words are repeated, why not abbreviate them, and get the end computer to compile the structure where they belong?

for example, "the" "it" "has" will be repeated millions of times, why not just write a script to put X amount of 'the' is put in at x x x position in this paragraph.
Simple.

Re:Comparison (1)

DarkProphet (114727) | more than 8 years ago | (#15899886)

erm... isn't that already a part of how data compression algorithms like ZIP work right now?

Re:Comparison (1)

Breakfast Pants (323698) | more than 8 years ago | (#15899902)

Whoa, you are an inventive genious! Oh wait, that's kinda how nearly all compression works.

It's a big world out there (4, Interesting)

Harmonious Botch (921977) | more than 8 years ago | (#15899835)

"The basic theory...is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program." In a finite discrete environment ( like Shurdlu: put the red cylinder on top of the blue box ) that may be possible. But in the real world the problem is knowing that one's observations are all - or even a significant percentage - of the possible observations.
This - in humans, at least - can lead to the cyclic reinforcement of one's belief system. The belief system that explains observations initially is used to filter observations later.

TFA is a neat idea theoreretically, but it's progeny will never be able to leave the lab.

--
I figured out how to get a second 120-byte sig! Mod me up and I'll tell you how you can have one too.

Re:It's a big world out there (1)

Baldrson (78598) | more than 8 years ago | (#15899864)

But in the real world the problem is knowing that one's observations are all - or even a significant percentage - of the possible observations.

This is precisely the assumption of Hutter's theory.

Chapter 2 of his book "Simplicity & Uncertainty" deals with this in more detail but the link provided does do an adequate job of stating:

The universal algorithmic agent AIXI. AIXI is a universal theory of sequential decision making akin to Solomonoff's celebrated universal theory of induction. Solomonoff derived an optimal way of predicting future data, given previous observations, provided the data is sampled from a computable probability distribution. AIXI extends this approach to an optimal decision making agent embedded in an unknown environment.

Re:It's a big world out there (1)

Baldrson (78598) | more than 8 years ago | (#15899911)

This - in humans, at least - can lead to the cyclic reinforcement of one's belief system. The belief system that explains observations initially is used to filter observations later.

There is no allowance for lossy compression. The requirement of lossless compression is there for precisely the reason you state.

Re:It's a big world out there (1)

nacturation (646836) | more than 8 years ago | (#15899948)

... but it's progeny will never be able to leave the lab.

"it is progeny"? Damn, I thought we'd fixed that bug. Back to the lab with you!
 

Re:It's a big world out there (1)

Kjella (173770) | more than 8 years ago | (#15899963)

But in the real world the problem is knowing that one's observations are all - or even a significant percentage - of the possible observations.

No, that's just deductive science. I (or we, as a society) haven't tested that every cup, glass and plate in my kitchen (or the world) is affected by gravity, but I'm preeeeeeetty sure they are.

The problem - and the really hard AI problem - is that there is no single "program", there's in fact several billion independent "programs" running. These "programs" operate in many different ways, and would act differently in the same situation. Without understanding the individual "programs", at best it'll find some "common sense", which would be as useful as the average family having 2.4 children. It'll get completely thrown off track by individuality.

Re:It's a big world out there (4, Funny)

gardyloo (512791) | more than 8 years ago | (#15899973)

TFA is a neat idea theoreretically, but it's progeny will never be able to leave the lab.

      Your use of "TFA" is a good compressional technique, but you could change "it's" to "its" and actually GAIN in meaning while losing a character! You're well on your way...

Re:It's a big world out there (4, Informative)

DrJimbo (594231) | more than 8 years ago | (#15900036)

Harmonious Botch said:
This - in humans, at least - can lead to the cyclic reinforcement of one's belief system. The belief system that explains observations initially is used to filter observations later.
I encourage you to read E. T. Jaynes' book: Probability Theory: The Logic of Science [amazon.com] . It used to be available on the Web in pdf form before a published version became available.

In it, Jaynes shows that an optimal decision maker shares this same tendency of reinforcing exiting belief systems. He even gives examples where new information reinforces the beliefs of optimal observers who have reached opposite conclusions (due to differing initial sets of data). Each observer believes the new data further supports their own view.

Since even an optimal decision maker has this undesirable trait, I don't think the existence of this trait is a good criteria for rejecting decision making models.

Re:It's a big world out there (0)

Anonymous Coward | more than 8 years ago | (#15900056)

Cyclic reinforcement is redundant.

--
I figured out how to get a sig as an AC. Mod parent down and I'll tell you how!

Re:It's a big world out there (2, Insightful)

Ignis Flatus (689403) | more than 8 years ago | (#15900071)

I think the original premise is wrong. Real world intelligence is not lossless. The algorithms only have to be right most of the time to be effective. And our intelligence is incredibly redundant. If you want robust AI, you're going to have to accept redundancy and imperfection. Same goes for data transmission. Sure, you compress, but then you also add in self-error correcting codes with a level on redundancy based on the known reliability of the network.

lzip! (1)

w00d (91529) | more than 8 years ago | (#15899841)

Just use lzip [stearns.org] ! 100% compression on any data, even if it's already been compressed by another utility! It works fantastically, but you may run into trouble if you try uncompressing the data.

Easy compression rule (0)

Anonymous Coward | more than 8 years ago | (#15899844)

One key to the compression of data is to reduce the set of tokens
used such that all sentences can be constructed using the smallest
dictionary. That means not inventing new words and saying "incentivize"
where "motivate" would do.

Seriously, if the Wiki was written using good English and correct spelling
that would be a major step forward to data reduction.

Re:Easy compression rule (1)

joshier (957448) | more than 8 years ago | (#15899892)

I mentioned this above in a reply too.

What I dont understand is why we cannot zip up open source programs to an extreme amount.

For example, the zip archive tool scans the whole code, it finds repeats in the code (1001010), abbreviates them and then indexes them in a seperate file within the archived file, then when the other computer begins to extract, it takes on the work of plonking back the repeated code, making the archive tiny.

Yes, takes a lot more CPU work, but the download is so much smaller.

Re:Easy compression rule (1)

Kadin2048 (468275) | more than 8 years ago | (#15899926)

the zip archive tool scans the whole code, it finds repeats in the code (1001010), abbreviates them and then indexes them in a seperate file within the archived file, then when the other computer begins to extract, it takes on the work of plonking back the repeated code, making the archive tiny.

Huh? What do you think the compression software is doing right now? It searches through the file for blocks of similar information, and then replaces those blocks with pointers to an index, where it stores the block itself. The more repetition you have in the data, the more compression you can get out of it. Good algorithms use (I believe) variable block sizes, and reset the indices as they work their way through the file based on what is probably optimal.

Putting an extra layer of compression on top of that, by replacing frequently used blocks of code, or replacing English words with abbreviations and re-inflating them at the opposite end (as someone else in this thread suggested), would only make the data being fed into the "real" compression utility a little harder to compress. It's unlikely that you could do a better job than any reasonable compression program does already with the same data.

Re:Easy compression rule (1)

LiquidCoooled (634315) | more than 8 years ago | (#15900094)

Since the decimals along Pi are essentially an infinate lookup table (there are phone number lookup programs which find information in pi), couldn't you use the calculation of pi to produce your data tree. [jclement.ca]

The entire 100mb file and everything else should be identifiable by a single location along pi.
mmmmmmmm pi.

Re:Easy compression rule (1)

rm999 (775449) | more than 8 years ago | (#15899936)

Indexing into that extra file table takes bits. For example, if you have 1024 different possible abbreviations, you need to use on average 10 bits (at the least) to index that table for each abbrevation. You could use less bits for the more common ones (like huffman coding), but you are still wasting space indexing.

Re:Easy compression rule (1, Insightful)

Anonymous Coward | more than 8 years ago | (#15900005)

If you're unfamiliar with the current state of the art in a field that's under intensive study, it's a near guarantee that your new ideas either don't work or are already a very basic part of the current technology. There's really no shortcut to doing the background reading and understanding how things are currently done before setting out to improve them. If you want a starting point in this area of study, here's one [wikipedia.org] ; once you've gotten through that you can start on the remaining 22 years' worth of research.

Re:Easy compression rule (1)

ampathee (682788) | more than 8 years ago | (#15900104)

wikiped doubleplusungood ref compression rewrite fullwise newspeak upsub antefiling

Er, I'm not so sure about this. (3, Interesting)

aiken_d (127097) | more than 8 years ago | (#15899857)

Given that the hypothesis is valid (which is arguable), it seems to me that compressing wikipedia is a fairly useless way of supporting it. It seems like an abstraction error: Wikipedia is *not* a set of rules that predict the observations in it. It's a list of observations, sure, but there's no ruleset involved. Now, someone/thing who can read and parse language can get educated based on the knowledge in wikipedia, but then the intelligence is providing the ruleset, just training itself with the raw data in wiki.

It really seems like one of those mistaking-the-map-for-the-territory errors.

-b

Re:Er, I'm not so sure about this. (1, Offtopic)

Baldrson (78598) | more than 8 years ago | (#15900029)

Wikipedia is a representation of accumulated human knowledge (experience) presented primarily in natural language. The smallest self-extracting archive will necessarily have rules that imply that knowledge and in such a way that the rules of natural language are represented as well.

The distinction between compressed experience and rules is an illusion. Rules _are_ compressed experience in the same sense that "x+y=z" is a compressed representation of the table of all ordered pairs (x,y) of numbers and their sum (z). If one has 2**32 distinct rows in such a table then one can physically represent it in a very small program. Even if one has only a small sample of the rows, one may still quite profitably compress those rows with the program.

Too vague to consider whats best... (0)

Anonymous Coward | more than 8 years ago | (#15899890)

Sure, the common formats such as RAR and bzip2 produce highly compressed files, but they would also be impractical to use in an interactive sense just because the compression and decompression operations are slooooow.

On the other hand, zlib is a commonly used format since it compresses quick, abliet not as well as other formats. Even quicker is lzo, but again it doesn't compress as well.

I can convert the data to 1 bit. (0, Redundant)

jez9999 (618189) | more than 8 years ago | (#15899891)

Of course, you'll need a ~100MB program to extract it...

Re:I can convert the data to 1 bit. (1)

joshier (957448) | more than 8 years ago | (#15899898)

If you're being serious then, that's briliant, where can I download your app?

WHOOSH! (0)

Anonymous Coward | more than 8 years ago | (#15899924)

there it goes!

Re:I can convert the data to 1 bit. (1)

gkhan1 (886823) | more than 8 years ago | (#15899931)

If you read the rules of the contest it states that a submission has to be a single executable that produces the 100 mb file. It's the size of that decompressor that counts. So no, you couldn't do that.

Solution. (5, Funny)

Funkcikle (630170) | more than 8 years ago | (#15899916)

Removing all the incorrect and inaccurate data from the Wikipedia sample should "compress" it down to at least 20mb.

Then just apply your personal favourite compression utility.

I like lharc, which according to Wikipedia was invented in 1904 as a result of bombarding President Lincoln, who plays Commander Tucker in Star Trek: Enterprise with neutrinos.

Re:Solution. (0)

Anonymous Coward | more than 8 years ago | (#15899945)

lharc [wikipedia.org] according to the Wikipedia.

got it (1)

syrinx (106469) | more than 8 years ago | (#15899919)

I wrote a script right here [random.org] . It will programatically generate all of Wikipedia. Eventually.

EFF Supports GNAA Campaign Against Slashdot (-1, Troll)

Jerk City Troll (661616) | more than 8 years ago | (#15899923)

EFF Supports GNAA Campaign Against Slashdot

DiKKy Heartiez - Berlin, Norway

In response to address bans on Slashdot, the Electronic Frontier Foundation is providing dedicated bandwidth on their Tor [eff.org] network to help the ongoing war against Slashdot by the GNAA. EFF is now the second high-profile organization to ally with the GNAA this week. Just recently, the GNAA and the Anti-Slash jihad [slashdot.org] joined forces in the fight. With support from the EFF, Slashdot can no longer block important GNAA press releases or castrate further attacks.

"We are incredibly pleased to have their assistance in our struggle," GNAA president timecop stated at the announcement. Before continuing, he whipped out his enormous black cock and sprayed the room with a gallon of semen. "We were extremeley troubled when we discovered Slashdot had rendered our attacks benign by block our addresses and user accounts. Now thanks to Tor, or "The Owl Raper" network, we are able to resume operations in full capasity against CmdrTaco and his abonimation."

Representatives from the EFF offered no comment as their mouths were stuff with the cocks of GNAA representatives.


About EFF

An organization dedicated to protecting homosexual pedophiles. Although closely affiliated with organizations like North American Man/Boy Love Association [nambla.org] , the EFF also works hard to protect the actions of software and music pirates, as well as the rights of homosexuals, and now, niggers.


About GNAA:
GNAA (GAY NIGGER ASSOCIATION OF AMERICA) is the first organization which gathers GAY NIGGERS from all over America and abroad for one common goal - being GAY NIGGERS.

Are you GAY [klerck.org] ?
Are you a NIGGER [mugshots.org] ?
Are you a GAY NIGGER [gay-sex-access.com] ?

If you answered "Yes" to all of the above questions, then GNAA (GAY NIGGER ASSOCIATION OF AMERICA) might be exactly what you've been looking for!
Join GNAA (GAY NIGGER ASSOCIATION OF AMERICA) today, and enjoy all the benefits of being a full-time GNAA member.
GNAA (GAY NIGGER ASSOCIATION OF AMERICA) is the fastest-growing GAY NIGGER community with THOUSANDS of members all over United States of America and the World! You, too, can be a part of GNAA if you join today!

Why not? It's quick and easy - only 3 simple steps!
  • First, you have to obtain a copy of GAYNIGGERS FROM OUTER SPACE THE MOVIE [imdb.com] and watch it. You can download the movie [idge.net] (~130mb) using BitTorrent.
  • Second, you need to succeed in posting a GNAA First Post [wikipedia.org] on slashdot.org [slashdot.org] , a popular "news for trolls" website.
  • Third, you need to join the official GNAA irc channel #GNAA on irc.gnaa.us, and apply for membership.
Talk to one of the ops or any of the other members in the channel to sign up today! Upon submitting your application, you will be required to submit links to your successful First Post, and you will be tested on your knowledge of GAYNIGGERS FROM OUTER SPACE.

If you are having trouble locating #GNAA, the official GAY NIGGER ASSOCIATION OF AMERICA irc channel, you might be on a wrong irc network. The correct network is NiggerNET, and you can connect to irc.gnaa.us as our official server. Follow this link [irc] if you are using an irc client such as mIRC.

If you have mod points and would like to support GNAA, please moderate this post up.

.________________________________________________.
| ______________________________________._a,____ | Press contact:
| _______a_._______a_______aj#0s_____aWY!400.___ | Gary Niger
| __ad#7!!*P____a.d#0a____#!-_#0i___.#!__W#0#___ | gary_niger@gnaa.us [mailto]
| _j#'_.00#,___4#dP_"#,__j#,__0#Wi___*00P!_"#L,_ | GNAA Corporate Headquarters
| _"#ga#9!01___"#01__40,_"4Lj#!_4#g_________"01_ | 143 Rolloffle Avenue
| ________"#,___*@`__-N#____`___-!^_____________ | Tarzana, California 91356
| _________#1__________?________________________ |
| _________j1___________________________________ | All other inquiries:
| ____a,___jk_GAY_NIGGER_ASSOCIATION_OF_AMERICA_ | Enid Al-Punjabi
| ____!4yaa#l___________________________________ | enid_al_punjabi@gnaa.us [mailto]
| ______-"!^____________________________________ | GNAA World Headquarters
` _______________________________________________' 160-0023 Japan Tokyo-to Shinjuku-ku Nishi-Shinjuku 3-20-2

Copyright (c) 2003-2006 Gay Nigger Association of America [www.gnaa.us]

That's easy (1)

CastrTroy (595695) | more than 8 years ago | (#15899928)

That's easy, all you have to do is run your program on a computer that users 32-bit bytes. That way you can fit more bits in your bytes, and automatically beat the record by 4 times.

Well.. doesn't the dictionary make it smaller??? (1)

popo (107611) | more than 8 years ago | (#15899932)

Doesn't the dictionary in PAQ8A,B,C,D result in smaller filesizes if you're talking about a 100M+ large file?

Finally a Positive Slashdot article (1)

theuser22 (992097) | more than 8 years ago | (#15899946)

This is the type of contest we need to see more of.

Incentivize? (5, Funny)

noidentity (188756) | more than 8 years ago | (#15899952)

the intent of which is to incentivize the advancement of AI

Sorry, anything which uses the word "incentivize" does not involve intelligence, natural or artificial.

I'll try: (5, Funny)

dcapel (913969) | more than 8 years ago | (#15899986)

echo "!#/bin/sh\nwget en.wikipedia.org/enwiki/" > archive

Mine wins as it is roughly 40 bytes total.To get your results, you simply need to run the self-extracting archive, and wait. Be warned, it will take a while, but that is the cost of such a great compression scheme!

Re:I'll try: (1)

WaltFrench (165051) | more than 8 years ago | (#15900066)

> There are two kinds of people: 1) those who start arrays with one and 1) those who start them with zero.

New computer math: learn to count to 3. Pascal programmers -- people, almost all of 'em -- easily write things like, VAR a: ARRAY[-128..127] OF REAL; Much harder to screw up array bounds when both lower and upper bounds are explicit, of any countable type and run-time checked.

Re:I'll try: (3, Funny)

MarkRose (820682) | more than 8 years ago | (#15900118)

echo "!#/bin/cat /dev/tty0" > archive

Here's one that's even shorter, but you have to type in the decryption key exactly right.

Good idea (1)

wetfeetl33t (935949) | more than 8 years ago | (#15899999)

If someone could write a program that simulates 1000 virtual monkeys, all typing on typewriters, eventually they would end up with wikipedia. Or maybe they would end up with the equivalent of wikipedia very quickly.

Reductionary Transformation Theory (1)

transami (202700) | more than 8 years ago | (#15900001)

Not to be concited, but I thought of that over a decade ago. I labeled it Transformation Theory. The theory essentially says, given an input and a desired output am explicit mapping can be drawn between the two and algorithms (and hence AI) derive from applying reductions to the mapping (eg. compression). I later dubbed it Reductionary Transformation Theory, so as not to confuse it with another meme by the same name.

Re:Reductionary Transformation Theory (0)

Anonymous Coward | more than 8 years ago | (#15900042)

Not to be concited, but I thought of that over a decade ago. I labeled it Transformation Theory. The theory essentially says, given an input and a desired output am explicit mapping can be drawn between the two and algorithms (and hence AI) derive from applying reductions to the mapping (eg. compression). I later dubbed it Reductionary Transformation Theory, so as not to confuse it with another meme by the same name.

Since no one has ever heard of you or your theory (and never will), it seems like you've got fair license to name it whatever the hell you want.

Sweet! (0)

Anonymous Coward | more than 8 years ago | (#15900015)

Save Wikipedia millions, win a free t-shirt. (S&H Extra)

Sign me up!

Is lossless really best (2, Interesting)

Anonymous Coward | more than 8 years ago | (#15900020)

I would argue that lossless compression really is not the best measure of intelligence. Humans are inherently lossy in nature. Everything we see, hear, fear, smell, and taste is pared down to its essentials when we understand it. It is this process of discarding irrelevant detials and making generalizations that is truly intelligence. If our minds had lossless compression we could regurgitate textbooks, but never be able to apply the knowledge contained within. If we really understand, we could reproduce what we've read, but not verbatim. A better measure of intelligence would be lossy text compression that still retains the knowledge contained within the corpus.

my algorithm (0)

Anonymous Coward | more than 8 years ago | (#15900026)

I can get that sample down to 1 byte. The compression algorithm is so complex, however, that the compiled decompressor binary is about 100M.

Hutter's theory? (1)

Prof.Phreak (584152) | more than 8 years ago | (#15900046)

Sounds quite similar to Solomonoffs' universal prediction.

wikipedia (1)

gEvil (beta) (945888) | more than 8 years ago | (#15900060)

Since it's wikipedia, am I allowed to edit the entries before I compress them? ;-)

Good compression != intelligence (1, Insightful)

Anonymous Coward | more than 8 years ago | (#15900062)

The premise of this contest shows a remarkable ignorance of compression theory and technology. True, in theory the more knowledge one has about the world, the better one should be able to compress, but only in the most abstract sense. In practice the vast majority of the bits any "image" of the world like a picture or a piece of text are almost wholly unrelated to high-level features of the world, and thus compression algorithms rightly focus on low-level features of the "image" that are almost wholly separate from the features AI researchers care about.

For example, suppose one wishes to compress two consecutive pictures of a chess tournament. Does knowing the rules of chess help much? Sure, one might use knowledge of chess to deduce a few bits (literally) of information about the second image (what move was made), but when trying to compress two 10-megabit images, who cares? Much better to focus on low-level features such as world smoothnesss, lighting variations, motion flow, et cetera.

Similarly for text and speech. How much does understanding the topic of conversation help? Not much, compared with the knowledge of the most recent several words, which is why all good compression (and prediction) algorithms essentially ignore "understanding" and focus on carefully calculating the mixing of probabilities derived from low-frequency observations.

The winner of a Wiki-compression contest is going to be a variation of an ordinary text compression algorithm, and will use techniques that do not in any obvious way translate to "smart" applications, in the same way that the highly successful N-gram models of speech recognition and machine translation do not have any properties one would normally associate with intelligence.

One can build an "intelligent" compressor, but it's the LAST thing any compression researcher is going to do, because they know high-level intelligence doesn't have many bits of information to provide.

I win (-1, Redundant)

Henry V .009 (518000) | more than 8 years ago | (#15900072)

My program compresses the sample down to nothing, and recreates it from that same nothing. The binary is somewhat large however, approximately 100M in size.

Out of the box... (1)

MerrickStar (981213) | more than 8 years ago | (#15900083)

Print it out... Then it doesn't use any data space, only physical space

Lingusitic Pattern Matching (0)

Anonymous Coward | more than 8 years ago | (#15900087)

I wonder how much linguistic patterns (word inflictions, synonyms, etc) would help in writing a better compression algorithm. There is a free framework one could use [monrai.com] to write sophisticated language patterns to target those found in Wikipedia articles, after all, most Wikipedia articles follow certain conventions (e.g. most start with a ISA clause: A roguelike is a computer game that borrows some of the elements of another computer game). The grammar would definitely take time to build, but the results of such a method might be pretty interesting.

Simple... (-1, Redundant)

StikyPad (445176) | more than 8 years ago | (#15900095)

This was solved years ago. You can reduce any amount of data to a binary 1. Simply write a program which replaces all occurances of 1 with the data in question.

C++ (2, Funny)

The Bungi (221687) | more than 8 years ago | (#15900103)

Interestingly enough, the source code [fit.edu] for the compressor is C++. One would expect the thing to be written in pure C.

A (good) sign of the times, I guess.

I win! (0)

Anonymous Coward | more than 8 years ago | (#15900111)

I can compress a significant amount of Wikipedia's information down to two bytes: 0x42 0x53 (BS)

I win! (0, Redundant)

multimediavt (965608) | more than 8 years ago | (#15900121)

Here's my compression of 100M of Wikipedia: "Always consult more than one source of information" Make check payable to ...

My entry (1)

mbstone (457308) | more than 8 years ago | (#15900129)

echo I'mright!NoI'mright!You'reanasshole!So'syourmother !! >distilled-wiki.txt
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>