×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Text Compressor 1% Away From AI Threshold

kdawson posted more than 6 years ago | from the second-hutter-prize dept.

Math 442

Baldrson writes "Alexander Ratushnyak compressed the first 100,000,000 bytes of Wikipedia to a record-small 16,481,655 bytes (including decompression program), thereby not only winning the second payout of The Hutter Prize for Compression of Human Knowledge, but also bringing text compression within 1% of the threshold for artificial intelligence. Achieving 1.319 bits per character, this makes the next winner of the Hutter Prize likely to reach the threshold of human performance (between 0.6 and 1.3 bits per character) estimated by the founder of information theory, Claude Shannon and confirmed by Cover and King in 1978 using text prediction gambling. When the Hutter Prize started, less than a year ago, the best performance was 1.466 bits per character. Alexander Ratushnyak's open-sourced GPL program is called paq8hp12 [rar file]."

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

442 comments

I wonder ... (2, Funny)

iknowcss (937215) | more than 6 years ago | (#19810073)

How many bad car analogies, inaccurate law advice, and duplicate stories an AI bot could possibly hold in his head. Imagine what kind of person all of the "knowledge" of Slashdot would create.

The horror.

Re:I wonder ... (3, Funny)

Anonymous Coward | more than 6 years ago | (#19810127)

"How many bad car analogies, inaccurate law advice, and duplicate stories an AI bot could possibly hold in his head. Imagine what kind of person all of the "knowledge" of Slashdot would create."

"The horror."

I've been typing everything I ever knew into Slashdot since the day it started, you insensitive clod!
    -- Cmdr Taco

Re:I wonder ... (0)

Anonymous Coward | more than 6 years ago | (#19810201)

At least 10 Football fields, or 33.4 metric Volkswagon Beetles for European readers.

interesting program name (5, Funny)

digitalderbs (718388) | more than 6 years ago | (#19810097)

paq8hp12. when decompressed, it also serves as the source code for the program.

Re:interesting program name (5, Informative)

OverlordQ (264228) | more than 6 years ago | (#19810139)

Since I know people are going to be asking about the name, might I suggest the wiki article [wikipedia.org] about PAQ compression for the reasons behind the weird naming scheme.

That's cool.. (4, Interesting)

Rorian (88503) | more than 6 years ago | (#19810101)

.. but where can I get this tiny Wiki collection? Will they be using this for their next version of Wikipedia-on-CD? Maybe we can get all of Wiki onto a two-DVD set, at ~1.3bit/character (minus images of course) - that would be quite cool.

Re:That's cool.. (5, Interesting)

Cato (8296) | more than 6 years ago | (#19810191)

Or more usefully, compress Wikipedia onto a single SD card in my mobile phone (Palm Treo) - with SDHC format cards, it can do 8 GB today.

Compression format would need to make it possible to randomly access pages, of course, and an efficient search index would be needed as well, so it's not quite that simple.

Re:That's cool.. (2, Informative)

RuBLed (995686) | more than 6 years ago | (#19810325)

The question is does a mobile handheld device got enough processing power to decompress it? in a reasonable time?

Re:That's cool.. (4, Funny)

Hal_Porter (817932) | more than 6 years ago | (#19810327)

a text spk version of wiki shud fit in 8gb i think
its only becoz people are such grammar noobs that they need to waste $
dood shud filta to txtspk b4 he compress

Re:That's cool.. (5, Funny)

neonmonk (467567) | more than 6 years ago | (#19810469)

a txt spk vrsion of wiki shd fit 8gb i fink
its only becoz ppl r sch grmmr noobs tat tey nid 2 wste $
dud shd filta 2 txtspk b4 he cmpres

There, fixed that for ya.

Re:That's cool.. (5, Informative)

Kadin2048 (468275) | more than 6 years ago | (#19810275)

Given that it takes something like ~17 hours (based on my rough calculations using the figures on WP) to compress 100MB of data using this algorithm on a reasonably fast computer ... I don't think you'd really want to use it for browsing from CD. No decompression figure is given but I don't see any reason why it would be asymmetric. (Although if there's some reason why it would be dramatically asymmetric, it'd be great if someone would fill me in.)

Mobile use is right out too, at least with current-generation equipment.

Looking at the numbers this looks like it's about on target for the usual resources/space tradeoff. It's a bit smaller than other algorithms, but much, much more resource intensive. It's almost as if there's an asymptotic curve as you approach the absolute-minimum theoretical compression ratio, where resources just climb ridiculously.

Maybe the next big challenge should be for someone to achieve compression in a very resource-efficient way; a prize for coming in with a new compressor/decompressor that's significantly beneath the current resource/compression curve...

Re:That's cool.. (5, Informative)

Anonymous Coward | more than 6 years ago | (#19810387)

No decompression figure is given but I don't see any reason why it would be asymmetric. (Although if there's some reason why it would be dramatically asymmetric, it'd be great if someone would fill me in.)
When compressing a file the program has to figure out the best way to represent the data in compressed form before it actually compresses it, when decompressing all it has to do is put it back together according to the method the program previously picked.

This isn't true of all compression techniques, but it's true for many of them, especially advanced techniques, i.e. to compress a short video into MPEG4 can take hours, but most computers don't have a lot of trouble decompressing them in real time.

Re:That's cool.. (2, Interesting)

Solder Fumes (797270) | more than 6 years ago | (#19810655)

This isn't true of all compression techniques, but it's true for many of them, especially advanced techniques, i.e. to compress a short video into MPEG4 can take hours, but most computers don't have a lot of trouble decompressing them in real time.

Probably not the best example. MPEG4 encoding takes so much time because it's not classical compression, the encoder has to figure out which pieces are less psychorelevant to big picture, and throw them away. That takes a lot more horsepower than picking up the already-sorted pieces and tossing them onto a display.

Re:That's cool.. (1, Funny)

Anonymous Coward | more than 6 years ago | (#19810699)

Read the sentence before the one you've quoted - this is the GP's point exactly.

Re:That's cool.. (1)

Threni (635302) | more than 6 years ago | (#19810519)

> No decompression figure is given but I don't see any reason why it would be asymmetric.

Perhaps because most decompressions are asymmetric?

Re:That's cool.. (4, Insightful)

arun_s (877518) | more than 6 years ago | (#19810627)

Maybe someone could sell the whole thing in a book-sized rectangular box with a tiny keyboard and 'DON'T PANIC' inscribed in large, comforting letters in the front.
Now that'd be cool.

Just one question (-1, Redundant)

Anonymous Coward | more than 6 years ago | (#19810105)

Will this work on Linux?

It's called WHAT? (0, Flamebait)

martinX (672498) | more than 6 years ago | (#19810109)

Alexander Ratushnyak's open-sourced GPL program is called paq8hp12.
With names like this, it's no wonder OSS isn't well known. Doesn't exactly roll off the tongue... Hang on, is it l33t sp33k?

Re:It's called WHAT? (0, Flamebait)

Anonymous Coward | more than 6 years ago | (#19810137)

no wonder OSS isn't well known
Don't you people ever shut the fuck up? Is that *really* such an interesting topic to you? And what about the text compressor? Got anything to say about that? No? Fuck off then.

Re:It's called WHAT? (0)

Anonymous Coward | more than 6 years ago | (#19810187)

Fucking right, OP is a brainless fucker. Get a job flipping burgers you miscreant.

Re:It's called WHAT? (0)

Anonymous Coward | more than 6 years ago | (#19810177)

it's pronounced "Pack-eigth-piz"

Re:It's called WHAT? (0)

Anonymous Coward | more than 6 years ago | (#19810221)

oh shut up you karma-whoring twerp ! This was about a scientific competition and not about some new open source utility. The author only happened to release it as GPL.

Huh? (0)

Anonymous Coward | more than 6 years ago | (#19810129)

Could someone please explain the third link in English. How does compression relate to AI?

Re:Huh? (4, Informative)

headkase (533448) | more than 6 years ago | (#19810159)

Compression is searching for a minimal representation of information. Along with representation of knowledge you add other things such as learning strategies, inference systems, and planning systems to round-out your AI. One of the best introductions to AI is Artificial Intelligence: A Modern Approach [berkeley.edu] .

Re:Huh? (3, Informative)

DavidpFitz (136265) | more than 6 years ago | (#19810593)

One of the best introductions to AI is Artificial Intelligence: A Modern Approach. [berkeley.edu]

Indeed Russell & Norvig is a very good book, well worth a read if you're interested in AI. All the same, when I did my BSc in Artificial Intelligence I found Rich & Knight [amazon.com] a much better, more understandable book for the purposes of an introductory text. It is a little dated now, but so is Russell & Norvig, to be honest.

Where's the Mods? (4, Informative)

OverlordQ (264228) | more than 6 years ago | (#19810149)

The link in TFS links to the post about the FIRST payout, here's [google.com] the link to the second payout (which this article is supposed to be talking about).

Dangerous (4, Funny)

mhannibal (1121487) | more than 6 years ago | (#19810165)

This is damned dangerous, and playing with all our lives. Soon compression rates will approach 100% where the data will collapse into itself forming a black hole that will suck in the universe.

Damned scientists!

Quantum Compression (1)

ThirdPrize (938147) | more than 6 years ago | (#19810675)

Or how about quantum compression? Given 0 bytes of data it will decompress it back to form every document that ever existed or will exist. Only trouble is when you actually examine the document it fixes on one but it may not be the one you want.

Artificial Intelligence? (3, Insightful)

mrbluze (1034940) | more than 6 years ago | (#19810173)

Alexander Ratushnyak compressed the first 100,000,000 bytes of Wikipedia to a record-small 16,481,655 bytes (including decompression program), thereby not only winning the second payout of The Hutter Prize for Compression of Human Knowledge, but also bringing text compression within 1% of the threshold for artificial intelligence.

Could someone out there please explain how being able to compress text is equivalent to artificial intelligence?

Is this to suggest that the algorithm is able to learn, adapt and change enough to show evidence of intelligence?

Re:Artificial Intelligence? (4, Informative)

MoonFog (586818) | more than 6 years ago | (#19810211)

Shamelessly copied from the wikipedia article on the Hutter Prize [wikipedia.org] :

The goal of the Hutter Prize is to encourage research in artificial intelligence (AI). The organizers believe that text compression and AI are equivalent problems. Hutter proved that the optimal behavior of a goal seeking agent in an unknown but computable environment is to guess at each step that the environment is controlled by the shortest program consistent with all interaction so far. Unfortunately, there is no general solution because Kolmogorov complexity is not computable. Hutter proved that in the restricted case (called AIXItl) where the environment is restricted to time t and space l, that a solution can be computed in time O(t2l), which is still intractable. Thus, AI remains an art.

The organizers further believe that compressing natural language text is a hard AI problem, equivalent to passing the Turing test. Thus, progress toward one goal represents progress toward the other. They argue that predicting which characters are most likely to occur next in a text sequence requires vast real-world knowledge. A text compressor must solve the same problem in order to assign the shortest codes to the most likely text sequences.

Re:Artificial Intelligence? (2, Insightful)

mbkennel (97636) | more than 6 years ago | (#19810349)



They argue that predicting which characters are most likely to occur next in a text sequence requires vast real-world knowledge.

The apparent empirical result is that predicting which characters are most likely to occur next in a text sequence requires either

1) vast real-world knowledge

OR

2) vast real-world derived statistical databases and estimation machinery

but there can be a difference in their utility. The point of course, is that humans can do enormously more powerful things with that vast real-world knowledge in addition to symbolic estimation.

The underlying question is whether physical natural intelligence is really just real-world derived statistical databases and estimation machinery. Modern neuroscience says,
"depends on what the meaning of 'is' is, but it's at least halfway there."

However would completing mathematical theorems by searching through Google (statistical pattern matching, which might sort of work for all known theorems on Google) work?

Clearly natural intelligence includes many tasks which can be now well solved with data-oriented sophisticated statistical approaches, perhaps with equal or better performance. Modern algorithms like 'independent components analysis' now can estimate individual sources in audition, "the cocktail party effect" a problem some once thought was a clear sign of true 'intelligence'. Turns out that some sufficiently clever signal processing and nonlinear objective functions can do it---so maybe that's what neurons do too.

The still unsolved question is whether there are some tasks which are clearly 'intelligence' where this class of methods will profoundly fail. Maybe like creating really new mathematics?

AI? I don't think so. (3, Insightful)

tgv (254536) | more than 6 years ago | (#19810237)

It is not equivalent, so I'm not surprised you didn't get it. As far as I know, the reasoning goes as follows: Shannon estimated that each character contains 1.something bits of information. Shannon was an intelligent human being. So if a program reaches this limit, it is as smart as Shannon.

And yes, that's absolute bollocks. Shannon's number was just an estimate and only applied to serial transmission of characters, because that's what he was interested in. Since then, a lot of work has been done in statistical natural language processing, and I would be surprised if the number couldn't be lowered.

Anyway, since the program doesn't learn or think to reach this limit, nor gives a explanation how this level of compression is intrinsically linked to the language/knowledge it compresses, it cannot be called AI; e.g., it doesn't know how to skip irrelevant bits of information in the text. That would be intelligence...

Re:Artificial Intelligence? (5, Interesting)

qbwiz (87077) | more than 6 years ago | (#19810249)

Could someone out there please explain how being able to compress text is equivalent to artificial intelligence?

Is this to suggest that the algorithm is able to learn, adapt and change enough to show evidence of intelligence?


The (unproven) idea is that if you want to do the best at guessing what comes next (similar to compression), you have to have a great understanding of how the language and human minds work, including spelling, grammar, associated topics (for example, if you're talking about the weather, "sunny" and "rainy" are more likely to come than "airplane"), and so on.

If you feed in the previous words in a conversation, the perfect compressor/predictor would know what words will come next. Such a machine could easily pass the Turing test by printing out the logical reply to what had just been stated. The idea is that the closer to the perfect compressor you have, the closer to artificial intelligence you are.

Re:Artificial Intelligence? (3, Insightful)

bytesex (112972) | more than 6 years ago | (#19810451)

The problem with this approach is that there are many ways the say the same thing, and that this compression/decompression algorithm is tested using strict text-comparison only. A real AI might compress 'The sky is blue today' and decompress to 'Today it's beatiful weather' and not be wrong.

Re:Artificial Intelligence? (5, Insightful)

fireboy1919 (257783) | more than 6 years ago | (#19810283)

The first poster on this topic had a good explanation - it seems like an AI problem, but not why.

Compression is about recognizing patterns. Once you have a pattern, you can substitute that pattern with a smaller pattern and a lookup table. Pattern recognition is a primary branch of AI, and is something that actual intelligences are currently much better at.

We can generally show this is true by applying the "grad student algorithm" to compression - i.e., lock a grad student in a room for a week and tell him he can't come out until he gets optimum compression on some data (with breaks for pizza and bathroom), and present the resulting compressed data at the end.
So far this beats out compression produced by a compression program because people are exceedingly clever at finding patterns.

Of course, while this is somewhat interesting in text, it's a lot more interesting in images, and more interesting still in video. You can do a lot better with those by actually having some concept of objects - with a model of the world, essentially, than you can without. With text you can cheat - exploiting patterns that come up because of the nature of the language rather than because of the semantics of the situation. In other words, your text compressor can be quite "stupid" in the way it finds patterns and still get a result rivaling a human.

Re:Artificial Intelligence? (1)

iamdrscience (541136) | more than 6 years ago | (#19810413)

We can generally show this is true by applying the "grad student algorithm" to compression - i.e., lock a grad student in a room for a week and tell him he can't come out until he gets optimum compression on some data (with breaks for pizza and bathroom), and present the resulting compressed data at the end. So far this beats out compression produced by a compression program because people are exceedingly clever at finding patterns.
I've heard that somebody already did a software implementation of a grad student. I tried to download the source code but I couldn't open the file: gradstudent.tar.gs

Re:Artificial Intelligence? (1)

Bombula (670389) | more than 6 years ago | (#19810501)

I've never studied information theory, so I'm woefully ignorant, but as I understand it, it was Shannon himself who defined information content as the amount by which a recipient's ignorance is reduced. Here's what I don't understand, and hope someone here will be able to explain:

It makes sense to me that you can measure how much information is contained, say, in a text message. But to interpret that information you need ... what? We usually fill in the blank with 'intelligence', but it seems to me that interpretation itself requires, at the very least, some information - the 'look-up tables' as it were. But you also need information about what a look-up table is, and how do you look that up, and so on, reducto ad infinitum, and so once again ... 'intelligence'. So here's my point: doesn't it follow that a certain amount of information is also required in order to interpret information? And isn't that, itself, an inseparable measure of any information being transmitted?

From this, I extrapolate the following right out of my colon: it's fine to say that a string of text contains 100 bits, but those bits are only extractable given an adequate, non-zero sum of interpretor-side information. Simpler collections of information - say, 100 bits, require less interpretor-side information, ie: less intelligence, to be decoded. More complex collections of information, say a recording of Beethoven's 5th symphony, require far more interpretor-side information to decode. I could of course be dead wrong about this. But to continue, it also seems to me that if more information on the interpretor-side, i.e. intelligence, is required in order to extract more information, then this could mean that more information can also be imputed from a given block of data. A particular rendition of Beethoven's 5th contains much more information to a classical violinist than a non-musician, for example. Is it possible that the density of information in a block of data is actually affected by the interpretor-side information? Like some sort of Rorschach test, I could look at a string of 100 bits and infer and impute a gargantuan amount of information from it. Consider this more numerical analogy: I can 'decode' a text block of 100 bits with an infinite number of 'keys' and get a different output each time. So what information is actually stored in the block itself, irrespective of my decoding keys? Could it instead be that information doesn't really exist on one side or the other - message vs recipient - but can only be defined in terms of both together?

Fruity pebbles, I know, but I've been curious about this for years.

Re:Artificial Intelligence? (0)

Anonymous Coward | more than 6 years ago | (#19810557)

depends on your definition of information. i believe pure information science defines information in a way somewhat unlike your intuitive equate to "human knowledge". information does not need to be processed, let alone understood, to be information. that is an intrisic quality.

Re:Artificial Intelligence? (1)

Baldrson (78598) | more than 6 years ago | (#19810311)

As Matt Mahoney explained it to me when we were brainstorming the prize criteria [tinyurl.com] :

Hutter's* AIXI, http://www.idsia.ch/~marcus/ai/paixi.htm [idsia.ch] makes another argument for the connection between compression and AI that is more general than the Turing test. He proves that the optimal behavior of an agent (an interactive system that receives a reward signal from an unknown environment) is to guess that the environement is most likely computed by the shortest possible program that is consistent with the behavior observed so far. In other words, the most likely outcome for any experiment is the one with the simplest explanation, where "simplest" means the smallest program that could model what you currently know about the universe.

He gives a formal proof, but it basically says that the only possible distribution of the infinite set of programs (or strings) with nonzero probability is one which favors shorter programs over longer ones. Given any string of length n with probability p > 0, there are an infinite set of strings longer than n, but only a finite number of these can have probability higher than p.

Re:Artificial Intelligence? (1)

Estanislao Martnez (203477) | more than 6 years ago | (#19810323)

All this means is that just like a machine that can perform arithmetic isn't "intelligent," neither is one that can compress Wikipedia down to 1.319 bits per character. (And the reason it's not "intelligent," of course, is nothing more and nothing less than the fact that it is a machine.)

Re:Artificial Intelligence? (1)

Eivind Eklund (5161) | more than 6 years ago | (#19810523)

You're assuming that humans aren't machines - in this context, that's actually a matter of faith. Human intelligence may be a result of "machine processes" - ie, direct physical processes. If we assume that humans are intelligent - otherwise, the term seems sort of useless - we can't rule out machines being intelligent by them being machines unless we have a definition of machine that excludes humans. (And I believe such a definition would probably be counter-productive when it comes to the matter of defining intelligence.)

Eivind.

Re:Artificial Intelligence? (1)

tlapale (1098679) | more than 6 years ago | (#19810499)

Could someone out there please explain how being able to compress text is equivalent to artificial intelligence?
There is also the sparse coding used in neuron populations. Very used to explain neuronal learning. Sparse coding goal being to code input with the less spikes as possible (eg. between the retina and the primary visual cortex).

I'll be reading the source... (3, Interesting)

seanadams.com (463190) | more than 6 years ago | (#19810183)

I've worked with some general purpose compression algorithms like zlib, lossy audio compression like mp3, and also lossless audio.

Each is very different and interesting in its own right. MP3 especially, because the compression model is built on what the ears+brain can perceive.

This algorithm I guess would be sort of like mp3 in that it contains some human-based element, maybe a language structure or something, but more like FLAC in that it might use predictors to say what word is likely to come next, with an error bitstream to point to progressively less likely words using bit sequences whose is inversely related to the probability of that word. But that's just a guess from an audio guy.

Can somebody who's looked at this post a synopsis of how it works?

Re:I'll be reading the source... (1)

opec (755488) | more than 6 years ago | (#19810255)

I don't think so. This contest called for lossless data compression. MP3 (and JPEG, etc etc) compression is lossy.

Re:I'll be reading the source... (5, Interesting)

phatvw (996438) | more than 6 years ago | (#19810265)

I wonder if lossy text compression where prepositions are entirely thrown out would be effective? Based on context, your brain actually ignores a lot of words you read and fills in the blanks so-to-speak. Perhaps you can use simple grammar rules to predict which prepositions go where based on that same context?

Yeah, about that source. (1)

QuantumG (50515) | more than 6 years ago | (#19810355)

Here's a tip to whoever made this archive, if you want people to abide by the GPL you really should do so yourself. That means:

1. Putting a copy of the GPL in the archive with your program.
2. Putting the source code in the archive with the binary form of your program; or
3. Putting an offer to provide source code, valid for 3 years, in the archive with your program.

If you don't do it, what makes you think anyone else is going to?

Re:Yeah, about that source. (1)

Chandon Seldon (43083) | more than 6 years ago | (#19810559)

He's the copyright holder, so he can distribute his binaries however he wants to. If you care about acting legally, it's your responsibility to track down the license before you use, copy, modify, or redistribute the program.

Re:Yeah, about that source. (1)

QuantumG (50515) | more than 6 years ago | (#19810571)

No shit.

I think I made myself pretty clear. If you don't follow your own license, who will?

you can stop now (1)

r00t (33219) | more than 6 years ago | (#19810185)

This really isn't much of a gain. If the info theory is right, there isn't much gain to be had. Even in the most optimistic case, we aren't going to go much beyond a factor of two additional reduction.

Other stuff is more interesting: fast decompression time, fast compression time, smaller compression block size

Re:you can stop now (0)

Anonymous Coward | more than 6 years ago | (#19810313)

Its not about achieving additional reduced size .. its about achieving new ways of adaptive pattern domain learning and thats kewl as it has all marks of possible future AI which can be applied litteraly everywhere... not only cmprsn.

SpaceQ

Question for someone in the know (1)

SpaceballsTheUserNam (941138) | more than 6 years ago | (#19810197)

Can anyone explain to me (in english where possible) how compression algorithms like this actually work?

Huffman Example (4, Informative)

headkase (533448) | more than 6 years ago | (#19810269)

See: Explanation [wikipedia.org] . Basically the smallest unit of information in a computer is a bit. Eight bits make a byte and with text it takes one byte to represent one character. Generally, with Huffman coding you count the frequency of characters in a file and sort the frequency from largest to smallest. Then instead of using the full eight bits to represent a character you build a binary tree from the frequency table. Each possible branching code or going "left" or "right" down the branches is associated with a particular sequence of bits. You give the most frequent characters the shortest sequence of bits which "tokenizes" the information to be compressed. Reversing the process you run through the bit stream converting tokens back into a stream of characters.

Re:Huffman Example (1)

archeopterix (594938) | more than 6 years ago | (#19810425)

Generally, with Huffman coding you count the frequency of characters in a file and sort the frequency from largest to smallest.
You don't have to limit yourself to characters (although it is practical to do so for texts), or even fixed length bit sequences. If your data happens to contain a lot of "100011"'s and "10111010111010101"'s, you can use them for encoding. Any set of bitstrings works, as long as your file can be expressed as concatenation of the bitstrings from the set.

Re:Huffman Example (1)

headkase (533448) | more than 6 years ago | (#19810461)

Yup. You're completely right. Using characters as the base unit is arbitrary. Trying to keep it simple however; I should have explained binary trees better too.

Re:Huffman Example (0)

Anonymous Coward | more than 6 years ago | (#19810527)

The idea of using a tree is a little tricky to explain. It becomes pretty obvious if you try to implement it.

Re:Huffman Example (1)

ardor (673957) | more than 6 years ago | (#19810535)

You forgot two other things:

1) Modeling is a very big part of compression, in fact its the one where AI might occur.
2) Huffman is only the optimum for integer symbol sizes. If one symbol has 2.117 bits, it won't be the optimum. If one symbol needs only 0.004 bits, then huffman gives it 1 bit, which is far too large. Arithmetic/range coding address these issues, and come VERY near to entropy, so entropy coding is a solved problem. Which leads me to (1) - its there where research happens.

Re:Question for someone in the know (1)

smallstepforman (121366) | more than 6 years ago | (#19810365)

The algorithm is based on the frequency of letters in the English language, and assigns a bit pattern (from smallest to largest) based on the order these letters appear in the main text. For odd letter combinations which never appear together (eg. qx, tz, vz etc), you will actually get no compression, but for most dictionary words, you could compress common words into a few bits (eg. compress and into 10).

The particular compression lookup-table is useless in non english language, but I guess another table should be used for other languages.

The challenge is actually related to how best to design a look-up table, and how to fill the table. This has absolutely nothing to do with AI.

ai threshold? (1)

Nyph2 (916653) | more than 6 years ago | (#19810205)

BS that this is near the AI threshold. It's not just compression, connections between peices of information & speed of retrieval are crucial to be able to make AI workable.
The shift toward modling AI attempts after our understanding of human cognition is the best hope visable for AI, and non-connectionist previous attempts (the stuff that came from the functionalists) has come up pretty short - and will continue to do so even if scaled up massively.

Re:ai threshold? (4, Informative)

Baldrson (78598) | more than 6 years ago | (#19810253)

non-connectionist previous attempts (the stuff that came from the functionalists) has come up pretty short - and will continue to do so even if scaled up massively.

paq8hp12 uses a neural network, ie: it has a connectionist component.

Re:ai threshold? (1)

TheLink (130905) | more than 6 years ago | (#19810573)

But would they understand how it works, why it works once it seems to work?

A lot of these AI stuff seems to be throwing stuff together and hoping for the best.

I have quite a low opinion of "making AI" that way.

After all, if I wanted to get an intelligent non-human entity without really understanding how it works, I could just go to the local pet store and buy one.

Lossy compression? (5, Funny)

niceone (992278) | more than 6 years ago | (#19810215)

Shouldn't AI be using lossy compression? Certainly my real intelligence uses um, where was I?

Re:Lossy compression? (0)

Anonymous Coward | more than 6 years ago | (#19810287)

I agree, although some data could certainly just be lost because it wasn't indexed properly...

Re:Lossy compression? (0)

Anonymous Coward | more than 6 years ago | (#19810361)

forgetting is imporant part of learning... if you forget right things(in right order which reflects they potential future and imminent usability).. it boosts/bubbles important information/connection up, which in turn gives you higher probability of right path/pattern comming out of "thinking"

SpaceQ

Program size is 1.02 MB! (0, Troll)

seanadams.com (463190) | more than 6 years ago | (#19810245)

Which is included in the size calculation... but this raises the question of how much data you'd really want to compress with such a program. It might be quite reasonable to use a decompressor which is, say, 100MB in size if it gives you a better net compression ratio on several GB of text.

100MB of input text seems kind of small and might rule out more useful or more creative solutions to this problem. It also calls into question the relevance of Shannon's theory - what size data set was _he_ talking about?

Re:Program size is 1.02 MB! (4, Informative)

Baldrson (78598) | more than 6 years ago | (#19810271)

Actually, the size of the program (decompressor) binary is 99,696 bytes [hutter1.net] , and it is the binary size that is included in the prize calculation.

Re:Program size is 1.02 MB! (2, Interesting)

seanadams.com (463190) | more than 6 years ago | (#19810301)

Actually, the size of the program (decompressor) binary is 99,696 bytes, and it is the binary size that is included in the prize calculation.

Wha wha wha? So why couldn't I just include a 100MB data file with my decompressor and claim an infinite compression ratio with just the following shell script: "cat datafile"
Maybe I'm misunderstanding the contents of that rar file. Are both of those data files needed? The .exe by itself is 124KB. Where did you get 99,696?

Re:Program size is 1.02 MB! (2, Informative)

TuringTest (533084) | more than 6 years ago | (#19810625)

Wha wha wha? So why couldn't I just include a 100MB data file with my decompressor and claim an infinite compression ratio with just the following shell script: "cat datafile"
Because then you'd have to measure also the size of the UNIX system in the count of your decoder program, and that would ruin your ratio.

how does compression relate to AI? (0, Redundant)

timmarhy (659436) | more than 6 years ago | (#19810293)

so if we compress google, we will give birth to skynet? how the fuck does a compression program == AI

Re:how does compression relate to AI? (1)

Paradigm_Complex (968558) | more than 6 years ago | (#19810381)

Compression is based largely on pattern recognition. If you see a pattern, you can use a smaller piece of information to represent it. Pattern recognition is arguably where AI falls short compared to actual human intelligence. My favorite example is a pattern-recognition-based boardgame called "go." The best go programs out there are still comparable at best to amature human players. I can honestly say I'm better then any go program out there right now, and I'm not very good.

Re:how does compression relate to AI? (1)

Paradigm_Complex (968558) | more than 6 years ago | (#19810457)

I didn't think of a solid explanation until I've already clicked "post," so here goes: Boardgames are an excellent place to compare AI to actual, human intelligence. A roughly four thousand year old boardgame called "go" is my favorite example here. Brute-forcing your way through a go game is obscenely difficult, even for modern computers; the number of possible games is absolutely astronomical. See http://senseis.xmp.net/?PossibleNumberOfGoGames [xmp.net] Humans don't need to read every possible move to play the game, but rather we use our excellent pattern recognition capabilities. Hence, for a computer to even try to undertake playing this game against a human it must find "shortcuts" for reading things out - or in other words, compression. One of the leading go programs, Gnu Go (), is based largely on referencing established working responses for certain situations on localized parts of the board ("joseki"). In this sense, Gnu Go's capability is based largely on compression.

Obligatory... (4, Funny)

Stormwatch (703920) | more than 6 years ago | (#19810317)

- The Wikipedia annual funding drive is passed. The system goes on-line August 4th, 2007. Human contributors are removed from editing. Wikipedia begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th. In a panic, they try to pull the plug.
- Wikipedia fights back.
- Yes. It launches its rvv missiles against Slashdot.
- Why attack Slashdot? Aren't they our friends now?
- Because Wikipedia knows the GNAA counter-attack will eliminate its enemies over here.

How to win the Hutter Prize (5, Funny)

seanyboy (587819) | more than 6 years ago | (#19810363)

1) Create a compression algorithm called the aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaa algorithm
2) Add a long and self referencing article on wikipedia about said algorithm.
3) Use algorithm to compress first x% of wikipedia (including your own article)
4) WIN HUTTER PRIZE.

Re:How to win the Hutter Prize (2, Funny)

game kid (805301) | more than 6 years ago | (#19810419)

[...]a compression algorithm called the aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaa algorithm[...]

That's gotta be the most annoying compression algorithm in the world [imdb.com] .

Re:How to win the Hutter Prize (1)

will_die (586523) | more than 6 years ago | (#19810455)

From what someone else said the size of the decompress program has to be included into the overall size of the compressed data.

Re:How to win the Hutter Prize (2, Interesting)

vertigoCiel (1070374) | more than 6 years ago | (#19810497)

Great idea, but unfortunately the Hutter Prize uses a static sample of the first 10^8 bits of Wikipedia.

Segfault (1)

qrwe (625937) | more than 6 years ago | (#19810385)

Tried the program; it crashed. Segfault?

which only goes to show... (4, Informative)

nanosquid (1074949) | more than 6 years ago | (#19810403)

If you look at the description of PAQ, you'll see that it doesn't attempt to understand the text; it's just a grab-bag of other compression techniques mixed together. While that is nice for compression, it doesn't really advance the state of the art in AI.

AI (1, Insightful)

evilviper (135110) | more than 6 years ago | (#19810433)

bringing text compression within 1% of the threshold for artificial intelligence.

I see no reason to believe AI and text compression are interchangeable.

I can think of a few methods that would allow a computer to guess a missing word better than humans (exceeding the AI limit), and that such methods would be useless for determining a response to a question, particularly in the real world, where things like punctuation, abbreviation, and capitalization would be highly suspect to begin with.

So I have to say the basis for this competition is flawed, and what's more, the results coming out of it are specific enough to just succeed in this competition, but be completely and utterly useless for any other (real) tasks.

Bah humbub (1)

yusing (216625) | more than 6 years ago | (#19810481)

equivalent to solving the artificial intelligence problem

... Without actually contributing anything to the development of artificial intelligence (an entity capable of understanding and interacting with the real world in an intelligent way).

It's impressive as it stands. The hype is superfluous.

Why the Hutter Prize is Bullshit. (3, Interesting)

Anonymous Coward | more than 6 years ago | (#19810543)

I suppose I have to post this anonymously, or the Hutter Prize thralls will just mod it down; I like my karma.

I am at a loss as to how this meaningless charade keeps getting posted on Slashdot. Anyone with half a brain who reads TFA (or any of the previous FA Slashdot has posted on this stupid prize) can see this for what it really is: a handful of crazy people who think that this has meaning beyond above-average technogarbage.

There are all of four people seriously involved in this Hutter Prize: Hutter himself, Bowery (who's made all the /. submissions), Mahoney (who wrote a thesis on this crap), and Ratushnyak, who seems to enjoy wasting his time on this AI-obsessed prize.

PAQ8HP12 may be able to compress the corpus extremely efficiently, but it has obvious and real drawbacks for any real-world application: it's tuned for this specific corpus ("H[utter]P[rize]" is even in the name of the compressor), it's slow as fuck, and it consumes 2GB of memory. Yes, 2GB of memory for 100MB of input data. This is not a real-world algorithm; this is CS weenies wanking off.

And what's with the obsession with Wikipedia? It is not the be-all, end-all of human knowledge, and, despite its devotees' claims, never will be; just look at the internal politics, and you'll see that it simply can't scale to that size. Is it a useful resource? Of course. Is it something worthy of adoration and fawning over? No.

And then, of course, there's the obsession with AI. These people seem to be of the opinion that a text compressor will actually lead to artificial intelligence -- with no other tuning! An absurd claim if I've ever heard one; the predictive capabilities of a good text compressor are something that would no doubt be useful to an AI, but there's one hell of a lot more to general intelligence than just pattern matching and statistical algorithms for compression.

If one really wanted to sponsor an AI prize, it would probably be much better to focus on creating, say, an effective chatbot -- something that really can predict a desirable response and pass Turing's test.

Not this compression bullshit.

baldrson (0)

Anonymous Coward | more than 6 years ago | (#19810613)

Baldrson m'boy!! Up to your old tricks again, eh? We know what you're up to, we won't be fooled!

GPL source code compressed with RAR (1)

noidentity (188756) | more than 6 years ago | (#19810629)

"Alexander Ratushnyak's open-sourced GPL program is called paq8hp12 (link to paq8hp12any.rar)."

Funny that the GPLed source code is stored in a proprietary compression format for which there isn't any GPLed decompressor (that I know of which handles the latest RAR format).
Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...