Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

New Algorithm for Learning Languages

samzenpus posted about 9 years ago | from the universal-translator dept.

Software 454

An anonymous reader writes "U.S. and Israeli researchers have developed a method for enabling a computer program to scan text in any of a number of languages, including English and Chinese, and autonomously and without previous information infer the underlying rules of grammar. The rules can then be used to generate new and meaningful sentences. The method also works for such data as sheet music or protein sequences."

cancel ×


Sorry! There are no comments related to the filter you selected.

Finally! (-1)

doxology (636469) | about 9 years ago | (#13451510)

I could have one of these write my posts for me and nobody would know the difference!

yeah (1)

wesman83 (700326) | about 9 years ago | (#13451529)

only 999,000 more components and we'll have ourselves a positronic net

Re:Finally! (-1, Troll)

Anonymous Coward | about 9 years ago | (#13451559)

dox is a dork

Re:Finally! (2, Funny)

HTTP Error 403 403.9 (628865) | about 9 years ago | (#13451573)

Using this software, I can finally win the 'Summarize Proust Competition'!

Re:Finally! (-1)

Anonymous Coward | about 9 years ago | (#13451607)

It would also be able to mod you down.

just thought.. (3, Interesting)

thegoogler (792786) | about 9 years ago | (#13451517)

what if this could be integrated into a small plugin for your browser(or any program) of choice, that would then generate its own dictionary in your language.

would probably help with the problem of either downloading a small, incomplete dictionary, a dictionary with errors, or a massive dictionary file.

Re:just thought.. (4, Insightful)

Bogtha (906264) | about 9 years ago | (#13451726)

This algorithm works with sample data. Where is the sample data going to come from? If you have to download it, then that negates the whole point of using it. If you use what you see online, well that's just rediculous, for obvious reasons :).

Um... (0)

uberdave (526529) | about 9 years ago | (#13451744)

Wouldn't it be easier just to point to an online dictionary?

Sucks to be a support tech in India (5, Funny)

HeLLFiRe1151 (743468) | about 9 years ago | (#13451521)

Their jobs be outsourced to computers.

Re:Sucks to be a support tech in India (2, Funny)

Anonymous Coward | about 9 years ago | (#13451575)

Their jobs be outsourced to computers.

They is?

Didn't Google already do this? (5, Interesting)

powerline22 (515356) | about 9 years ago | (#13451528)

Google apparently has a system like this in their labs, and entered it into some national competetion, where it pwned everyone else. Apparently, the system learned how to translate to/from chinese extremely well, without any of the people working on the project knowing the language.

Re:Didn't Google already do this? (-1, Flamebait)

Anonymous Coward | about 9 years ago | (#13451654)

i find that hard to believe that none of the programmers new a form of chinese.

Re:Didn't Google already do this? (2, Informative)

Anonymous Coward | about 9 years ago | (#13451711)

Here's the link [] at Google's blog.

No the didn't (5, Interesting)

Ogemaniac (841129) | about 9 years ago | (#13451740)

I played around with the Google translator for a while. I work in Japan and am half-way fluent. Google couldn't even turn my most basic Japanese emails into comprehensible English. Same is true for the other translation programs I have seen.

I will believe this new program when I see it.

Translation, especially from extremely different languages, is absurdly difficult. For example, I was out with a Japanese woman the other night, and she said "aitakatta". Literally translated, this means "wanted to meet". Translated into native English, it means "I really wanted to see you tonight". It is going to take one hell of a computer program to figure that out from statistical BS. I barely could with my enormous meat-computer and a whole lot of knowledge of the language.

Re:No the didn't (0)

Anonymous Coward | about 9 years ago | (#13451776)

Ah, nice tale, but there is one flaw which shows you made it all out. A woman wanted to meet you!? No, we're not that gullible.
  Welcome to /.

Re:No the didn't (2, Interesting)

lawpoop (604919) | about 9 years ago | (#13451806)

The example you are suing is from conversation, which containts a lot of mutually shared assumptions and information. Take this example from Stephen Pinker:

"I'm leaving you."

"Who is she?"

However, in written text, where the author can assume that the reader brings no shared assumptions, nor can the author rely on any deefback, 'speakers' usually do a good job of including all necessary information in one way or another -- especially in texts meant to convince or promote a particular viewpoint. I'll bet these kinds of texts are more easily translatable than conversation.

Re:Didn't Google already do this? (5, Interesting)

spisska (796395) | about 9 years ago | (#13451834)

IIRC, Google's translator works from a source of documents from the UN. By cross referencing the same set of documetents in all kinds of different languages, it is able to do a pretty solid translation built on the work of goodness knows how many professional translators.

What is a little more confusing to me is how machine translation can deal with finer points in language, like different words in a target language where the source language has only one. English for example has the word "to know" but many languages use different words depending on whether it is a thing or a person that is known. Or words that relate to the same physical object but carry very different cultural connotations -- the word for female dog is not derogatory in every language, for example, but some other animals can be extremely profane depending on who you talk to.

Or situations where two entirely different real-world concepts mean similar things in their respective language -- in English, for example, you're up shit creek, but in Slavic languages you're in the pussy.

I've done translation work before (Slovak -> English), and there's much more going on than differences in words and grammar. There are whole conceptual frameworks in languages that just don't translate, and this is frustrating for anyone learning a language, let alone trying to translate. English is very precise (when used as directed) in matters of time and sequence -- we have more than 20 verb tenses where most languages get away with three.

Consider this:

I was having breakfast when my sister, whom I hadn't seen in five years, called and asked if I was going to the county fair this weekend. I told her I wasn't because I'm having the painters come on Saturday. They'll have finished by 5:00, I told her, so we can get together afterwords.

These three sentences use six different tenses: past continuous, past perfect, past simple, present continuous, future perfect, and present simple, and are further complicated by the fact that you have past tenses refering to the future, present tenses refering to the future, and the wonderful future perfect tense that refers to something that will be in the past from an arbitrary future perspective, but which hasn't actually happened yet. Still following?

On the other hand, English is much less precise in things like prepositions and objects, and utterly inexplicable when it comes to things like articles, phrasal verbs, and required word order -- try explaining why:

I'll pick you up after work

I'll pick the kids up after work

I'll pick up the kids after work

are all OK, but

I'll pick up you after work

is not.

Machine translation will be a wonderful thing for a lot of reasons, but because of these kinds of differences in languages, it will be limited to certain types of writing. You may be able to get a computer to translate the words of Shakespeare, but a rose, by whatever name, is not equally sweet in every language.

SCIgen (5, Interesting)

OverlordQ (264228) | about 9 years ago | (#13451533)

SCIgen [] anyone?

From TFL... (your link) (1)

PaulBu (473180) | about 9 years ago | (#13451655)

It uses a hand-written context-free grammar to form all elements of the papers.

I know you were aiming for funny, but there is a big difference between following a hand-written grammar and deducing it from the text...

Paul B.

Markov Chains anyone? (5, Informative)

ImaLamer (260199) | about 9 years ago | (#13451661) []

Used this (easy to compile) C program: []

to create these: []

Mod points to whomever can tell us what texts they use. (No mod points can actually be given)

Re:Markov Chains anyone? (1)

ImaLamer (260199) | about 9 years ago | (#13451678)

(I now realize that many of the titles give it away)

Not really (1, Informative)

Anonymous Coward | about 9 years ago | (#13451730)

Markov chains aren't the same as context free grammars.
(CFGs can generate ((multiply) nested) bracket structures (and are like finite automata with stacks).) Markov chains are just finite automata without stacks, that generate random walks through vocabulary space.

PDF of paper (5, Informative)

mattjb0010 (724744) | about 9 years ago | (#13451536)

Paper here [] for those who have PNAS access.

Re:PDF of paper (0)

Anonymous Coward | about 9 years ago | (#13451548)

I have penis access but it's not helping me get to your site.

Full article for non-PNAS subscribers (4, Informative)

dmaduram (790744) | about 9 years ago | (#13451664)

Unsupervised learning of natural languages

Zach Solan, David Horn, Eytan Ruppin and Shimon Edelman
School of Physics and Astronomy and School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel; and Department of Psychology, Cornell University, Ithaca, NY 14853

We address the problem, fundamental to linguistics, bioinformatics, and certain other disciplines, of using corpora of raw symbolic sequential data to infer underlying rules that govern their production. Given a corpus of strings (such as text, transcribed speech, chromosome or protein sequence data, sheet music, etc.), our unsupervised algorithm recursively distills from it hierarchically structured patterns. The ADIOS (automatic distillation of structure) algorithm relies on a statistical method for pattern extraction and on structured generalization, two processes that have been implicated in language acquisition. It has been evaluated on artificial context-free grammars with thousands of rules, on natural languages as diverse as English and Chinese, and on protein data correlating sequence with function. This unsupervised algorithm is capable of learning complex syntax, generating grammatical novel sentences, and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics.

Many types of sequential symbolic data possess structure that is (i) hierarchical and (ii) context-sensitive. Natural-language text and transcribed speech are prime examples of such data: a corpus of language consists of sentences defined over a finite lexicon of symbols such as words. Linguists traditionally analyze the sentences into recursively structured phrasal constituents (1); at the same time, a distributional analysis of partially aligned sentential contexts (2) reveals in the lexicon clusters that are said to correspond to various syntactic categories (such as nouns or verbs). Such structure, however, is not limited to the natural languages; recurring motifs are found, on a level of description that is common to all life on earth, in the base sequences of DNA that constitute the genome. We introduce an unsupervised algorithm that discovers hierarchical structure in any sequence data, on the basis of the minimal assumption that the corpus at hand contains partially overlapping strings at multiple levels of organization. In the linguistic domain, our algorithm has been successfully tested both on artificial-grammar output and on natural-language corpora such as ATIS (3), CHILDES (4), and the Bible (5). In bioinformatics, the algorithm has been shown to extract from protein sequences syntactic structures that are highly correlated with the functional properties of these proteins.

The ADIOS Algorithm for Grammar-Like Rule Induction

In a machine learning paradigm for grammar induction, a teacher produces a sequence of strings generated by a grammar G0, and a learner uses the resulting corpus to construct a grammar G, aiming to approximate G0 in some sense (6). Recent evidence suggests that natural language acquisition involves both statistical computation (e.g., in speech segmentation) and rule-like algebraic processes (e.g., in structured generalization) (7-11). Modern computational approaches to grammar induction integrate statistical and rule-based methods (12, 13). Statistical information that can be learned along with the rules may be Markov (14) or variable-order Markov (15) structure for finite state (16) grammars, in which case the EM algorithm can be used to maximize the likelihood of the observed data. Likewise, stochastic annotation for context-free grammars (CFGs) can be learned by using methods such as the Inside-Outside algorithm (14, 17).

We have developed a method that, like some of those just mentioned, combines statistics and rules: our algorithm, ADIOS (for automatic distillation of structure) uses statistical information present in raw sequential data to identify significant segments and to distill rule-like regularities that support structured generalization. Unlike any of the previous approaches, however, ADIOS brings together several crucial conceptual components; the structures it learns are (i) variable-order, (ii) hierarchically composed, (iii) context dependent, (iv) supported by a previously undescribed statistical significance criterion, and (v) dictated solely by the corpus at hand.

Consider a corpus of sentences (more generally, sequences) over a lexicon of size N, whose units in the case of language are words (starting with phonemes or letters, or even letters in a condensed text without spaces also works). The algorithm starts by loading the corpus onto a directed pseudograph (a nonsimple graph in which both loops and multiple edges are permitted) whose vertices are all lexicon entries, augmented by two special symbols, begin and end. Each corpus sentence defines a separate path over the graph, starting at begin and ending at end, and is indexed by the order of its appearance in the corpus. Loading is followed by an iterative search for significant patterns, which are added to the lexicon as new units. The structure of the data graph and the operations carried out on it are illustrated in Fig. 1. (See also Supporting Text, Figs. 4-17, and Tables 1-12, which are published as supporting information on the PNAS web site.)

The algorithm generates candidate patterns by traversing in each iteration a different search path (initially coinciding with one of the original corpus sentences), seeking subpaths that are shared by a significant number of partially aligned (2, 18) paths. The significant patterns (P) are selected according to a context-sensitive probabilistic criterion defined in terms of local flow quantities in the graph, stated in BOX 1: The Motif Extraction (MEX) Procedure. Generalizing the search path, the algorithm looks for an optional equivalence class (E) of units that are interchangeable in the given context [i.e., are in complementary distribution (2)]. At the end of each iteration, the most significant pattern is added to the lexicon as a new unit, the subpaths it subsumes are merged into a new vertex, and the graph is rewired accordingly (two rewiring modes are available: a context-free Mode A and a context-sensitive Mode B, as described in BOX 2: The ADIOS Algorithm). The search for patterns and equivalence classes and their incorporation into the graph are repeated until no new significant patterns are found. The entire process is governed by the following three parameters: {alpha} and {eta}, which control the definition of pattern significance, and L, which sets the width of the context window where equivalence classes are sought. We estimate the average-case computational complexity of the ADIOS algorithm empirically to increase linearly with the size of the corpus (see section 7 of Supporting Text and Fig. 17).

The final lexicon includes those of the original symbols not incorporated into larger units and a forest of tree-structured root patterns distilled by the algorithm (that is, the patterns that reside on the final graph, at the top level of the hierarchy). Each pattern is structured as a tree because of the hierarchical process of pattern creation; the leaves (terminals) are the original members of the lexicon, and the intermediate nodes are other patterns and equivalence classes (Fig. 2). The tree structure excludes cyclic recursion (loops) of patterns, although recursion may be introduced through pattern matching in a postprocessing stage.

The final graph includes as many paths as all of the original sentences, but it can also generate many new ones. To generate a sentence from a chosen path in the graph, all its root patterns are traversed. Each recursively encountered pattern is treated as a derivation or parse tree (19): it is read from top (root) to bottom (terminals) and from left to right, while accruing the terminals (words from the original lexicon) and selecting one member from each encountered equivalence class (Fig. 2C). Because the equivalence relations only hold in the contexts specified by their parent patterns, the ADIOS representation is inherently safer than grammars that posit globally valid categories (such as "parts of speech" in a natural language). At the same time, because each rewiring of the graph brings closer far-apart units that used to straddle the newly abstracted pattern, the resulting representation can capture long-range structural dependencies among units.

Because patterns can be represented in the form of rewriting rules, which are context-free when Mode A is used (Fig. 2D) and context-sensitive when Mode B is used (Fig. 2G), the end product of an ADIOS run constitutes a grammar. Because infinite recursion is not implemented in the current version of the algorithm, the representations learned by ADIOS are comparable in expressive power to finite grammars that are at least context free. This means that any grammar consisting of context-free rules can be loaded into an ADIOS instance (that is, translated into an ADIOS representation), provided that a limit is placed on the number of times each rule is invoked recursively. In learning, the results described below show that our algorithm can acquire, from raw corpora, good operational approximations to those grammars that generate data rich with partially alignable sentences, including unconstrained natural-language data. Complex grammars that are inherently ambiguous (19) because of the presence of multiple loops are dealt with effectively by acquiring more patterns.


To date, the ADIOS algorithm has been tested on a variety of language data, as well as on DNA and protein sequences from several species. Details are available in Supporting Text.

Language: Computational Grammar Induction. In the domain of language, we tested the ADIOS algorithm both on artificial-grammar data and on natural-language corpora such as ATIS (3) and CHILDES (4) and in languages as diverse as English and Chinese (5). It is reasonable to require that the success of a learning algorithm be measured by the closeness, ideally, identity, of the learned and target grammars, G and G0, respectively. Unfortunately, even for CFGs, equivalence is undecidable (19). Moreover, for natural languages, G0 is inaccessible. We thus opt for testing our implementation for generativity as follows. In the artificial-grammar experiments, which start with a target grammar, a teacher instance of the model is first preloaded with this grammar (using the one-to-one translation of CFG rules into ADIOS patterns), then used to generate the training corpus Ctraining. After training, the learner generates a test corpus Clearner and the teacher generates a test corpus Ctarget, the latter containing only novel sentences that do not appear in Ctraining. The two corpora, Clearner and Ctarget, then are used to calculate precision (the proportion of Clearner accepted by the teacher) and recall (the proportion of Ctarget accepted by the learner). A sentence is accepted if it precisely fits one of the paths in the ADIOS graph (that is, it can be generated by the path). In the natural language experiments, where no target grammar is available, the given corpus is split into two portions, one for training (Ctraining) and one for testing (Ctarget), and the same evaluation method is applied, except that precision must in this case be evaluated by an external referee (e.g., by a human subject). This evaluation method is unique in that (i) it defines precision and recall more conservatively than is standard in the literature (21), and (ii) it involves testing both the capability of the learner to accept all of the grammatical sentences and its capability to generate only sentences that the teacher would deem grammatical.

We have conducted a series of experiments designed to evaluate the performance of ADIOS in grammar induction (Fig. 3). Learning a simple CFG. We first replicated an experiment (22) that aimed to reconstruct a specific CFG (29 terminals and 7 rules) from a corpus of 2,000 sentences. Whereas the algorithm proposed by ref. 22 generated between 3,000 and 4,000 rules, ADIOS (used in the default Mode A) yielded 28 patterns and 9 equivalence classes and achieved 100% precision and 99% recall (see section 2.1 of Supporting Text and Table 1). Next, we applied ADIOS to a somewhat more complex CFG (TA1 grammar, 50 terminals and 28 rules) and showed that it performs well even when only 200 sentences are used for training, as shown in Fig. 3A (see section 2.2 of Supporting Text, Tables 2-5, and Fig. 8). Learning a complex CFG. Because the ADIOS algorithm is greedy (the best available pattern in each iteration is immediately and irreversibly rewired), the syntax it acquires depends on the order of sentences in the training set. This characteristic is expected to affect the learning of a complex CFG, especially if it contains many loops. To assess this dependence and to mitigate it, we train multiple learners on different order-permuted versions of the corpus generated by the teacher. As Fig. 3B illustrates, for the parameter values explored (L = {3,4,5,6}; 30 or 150 learners; corpus size between 10,000 and 120,000 sentences), the optimal precision-recall tradeoff for learning the ATIS CFG (357 terminals and 4,592 rules) [B. Moore and J. Carroll (2001), g-resources] is obtained with a 150-learner cohort and L between 5 and 6 (see section 3.1 of Supporting Text, Table 7, and Fig. 9).

Generativity of the learned natural language grammar. To test the ability of ADIOS to generate acceptable novel sentences after learning from a natural language corpus, we trained it on 12,700 sentences from the ATIS-2 natural language corpus of size 13,043 (3) and tested its recall level on the 343 remaining sentences. The small size of the training corpus results in a relatively low recall of 40% (under our strict definition that requires an exact match). Fig. 3C compares the acceptability of ADIOS-generated sentences with original sentences from the ATIS-2 corpus (see section 4.1 of Supporting Text and Fig. 10). Notably, the output generated by ADIOS is on the average as acceptable to human subjects as the original corpus sentences. The human-judged precision ({approx}70%, as shown in the plot) is remarkable; for comparison, the ATIS-CFG grammar, hand-constructed to fit the ATIS-2 corpus (with recall of 45% on same data) produces >99% ungrammatical sentences when used in a generative fashion.

Structural language modeling. A language model such as a table of word n-gram probabilities or a probabilistically annotated grammar can be made to predict the next word in a sentence from its predecessors, an operation that is highly useful in a variety of natural language processing tasks (23). A structured language model of the kind learned by ADIOS, which includes reliable syntactic information, can be especially effective, because it captures longer-range dependencies than unstructured (purely statistical) n-gram models. When used as a language model on the ATIS-2 corpus (1,294-word lexicon; 11,386 unique sentences), ADIOS achieved perplexity of 11.5 (see section 4.2 of Supporting Text and Fig. 11). || In comparison, automatically acquired regular grammars achieve perplexity between 30 and 40; 3-gram models that use sophisticated smoothing of probabilities result in perplexity of {approx}14 (24-27).

Languages other than English. Applying ADIOS to the Parallel Bible (5) corpus, we compared six different languages through a metaanalysis of their respective ADIOS grammars (see section 4.3 of Supporting Text and Fig. 12). The dendrogram shown in Fig. 3D captures the resulting typological relationships.

Bioinformatics. Although the present work focuses on language, ADIOS also has been tested on various problems in bioinformatics; a representative result is reported here (another result is included in section 6.1 of Supporting Text, Tables 10 and 11, and Fig. 15A). We used the representations acquired by ADIOS from protein sequence data to define a space where each protein is represented by a vector of its root-patterns. We then proceeded to show that this space supports functional classification of proteins, focusing on a family of enzymes whose functionality is captured by a homology tree structure of four levels. Classification of the 6,751 enzymes belonging to this family was tested previously by the SVM-PROT system (28), ** with the proteins described in terms of physical features (polarity, charge, etc.) of their building blocks. Despite using exclusively the raw sequence information, ADIOS attained classification performance comparable with that of the SVM-PROT system (success rate of 95%). The same performance was attained also at the third level of the hierarchy. This result implies that the root-patterns extracted from raw sequence data carry useful functional information.


The ADIOS algorithm differs from other methods of grammar induction in the data it requires and the representations it builds, as well as in its algorithmic approach. Most existing methods require corpora tagged with part-of-speech information (29); the very few exceptions (30-22) are not known to scale up. The extraction of grammatical primitives in published methods may rely on collocation frequencies (30) or on global criteria such as the likelihood of the entire corpus given the grammar (14, 17, 29, 31, 32). In comparison, ADIOS carries out its inferences locally, in the context provided by the current search path, alleviating the credit assignment problem in learning and making productive use of learned structures safer.

In its reliance on transition probabilities, the ADIOS criterion for pattern significance may seem to resemble the well-known approaches that use various mutual information measures in defining syntactic constituents, such as the influential early work described in ref. 33. Mutual information-based approaches, however, are limited by diverse issues such as choosing the context window and alleviating the data sparseness problem (29), which our algorithm largely circumvents. In particular, when ADIOS is iterated, symbols that may have been initially very far apart are allowed to exert influence on each other, enabling it to capture long-range syntactic dependencies such as agreement (Fig. 2F). Furthermore, ADIOS works with raw text or transcribed speech, successfully overcoming the data sparseness problem through the use of a statistically feasible local-context significance criterion.

Although ADIOS makes no prior assumptions about the structures it seeks, the patterns and equivalence classes it learns can be translated in a straightforward manner into the form of rewriting rules. These representations are both expressive enough to support extensive generativity, and, in principle, restrictive enough to capture many of the structure-sensitive aspects of syntax (1) documented by linguists, such as tough movement (see Fig. 8).

It is instructive to consider ADIOS in the context of the problem of language acquisition, which has long been a daunting challenge for cognitive scientists (29, 34, 35). Because a completely bias-free unsupervised learning is impossible (34, 36), the real issue in language acquisition is to determine the model constraints. In our approach, the constraints are defined algorithmically in the form of a method for detecting units (patterns) that are hierarchically structured and supported by context-sensitive statistical evidence. When considered as a model of language acquisition, ADIOS is clearly incomplete, because it currently relies on syntactic regularities and leaves out conceptual knowledge and grounding of speech acts in the external events. Nevertheless, our approach is compatible with a range of findings concerning language acquisition, such as the use of statistical cues (7, 37) and the importance of pattern-like constructions (38-40) (many more links to linguistic theories are offered in Supporting Text). Moreover, it performs well in a wide variety of situations that require unsupervised learning of structural information from untagged data. In grammar induction from large-scale raw corpora, our method achieves precision and recall performance unrivaled by any other unsupervised algorithm. It exhibits good performance in grammaticality judgment tests (including standard tests routinely taken by students of English as a second language) and replicates the behavior of human subjects in certain psycholinguistic tests of artificial language acquisition. Finally, the very same algorithmic approach also is proving effective in other settings where knowledge discovery from sequential data is called for, such as bioinformatics.

BOX 1: The Motif Extraction (MEX) Procedure

Seeking a Pattern Along a Search Path. The MEX procedure is applied for each search path of the ADIOS graph that is to be explored for patterns. In the example of Fig. 1A, this is path no. 1: begin -> e1 -> -> e5 -> end. Other paths may join and leave the search path at various vertices. In this example, the bundle of path sections between e2 and e4 display a certain coherence, possibly indicating the presence of a significant pattern.

Two probability functions are defined over the graph for any given search path. The first one, PR, is the right-moving ratio of fan-through (through-going flux of paths) to fan-in (incoming flux of paths), which varies along the search path. Starting at e1 we define PR at e2 as PR(e1; e2) = (no. of paths from e1 to e2) divided by (total no. of paths entering e1).

At e3 it becomes PR(e1; e3) = (no. of paths from e1 through e2to e3) divided by (total no. of paths from e1to e2). In Fig. 1 A, PR first increases because other paths join to form a coherent bundle, then decreases at e5, because many paths leave the search path at e4. To quantify this decline of PR, which we interpret as an indication of the end of the candidate pattern, we define a "decrease ratio," DR, whose value at e4 is DR(e1; e4) = PR(e1; e5)/PR(e1; e4). We require that it be smaller than a preset "cutoff parameter" {eta} i, starting PR at ei and PL at ej; choose out of all segments the leading significant pattern P for the search path.
      2. rewire graph CREATE a new vertex corresponding to P.
                    * Mode A (context free): replace the string of vertices comprising P with the new vertex P on all paths on which it occurs (Fig. 1C).
                    * Mode B (context sensitive): replace the string of vertices comprising P with the new vertex P only on those paths on which P is significant according to the MEX criterion.

# generalization-first step for each path (Fig. 1 D and E)

      1. slide a context window of size L along the search path from its beginning vertex to its end; at each step i (i = 1, K - L - 1 for a path of length K) examine the generalized search paths: for all j = i + 1, , i + L - 2 do
                  1. define a slot at location j;
                  2. define the generalized path consisting of all paths that have identical prefix (at locations i to j - 1) and identical suffix (at locations j + 1 to i + L - 1);
                  3. perform MEX on the generalized path;

      2. choose the leading P for all searches performed on each generalized path;
      3. for the leading P define an equivalence class E consisting of all the vertices that appeared in the relevant slot at location j of the generalized path;
      4. rewire graph CREATE a new vertex corresponding to P (Fig. 1E) and replace the string of vertices it subsumes with the new vertex P on all paths where it occurs. We list here, and in the next rewiring step, only mode A; in mode B the replacement should occur only on the paths for which the new P is declared significant by MEX.

# generalization-bootstrap for each path

      1. slide a context window of size L along the search path from its beginning vertex to its end; at each step i (i = 1, K - L - 1 for a path of length K) do:
            construct generalized search path for all slots at locations j, j = i + 1, , i + L - 2, do
                  1. consider all possible paths through these slots that start at vertex i and end at vertex K-L-1
                  2. at each slot j compare the set of all encountered vertices to the list of existing equivalence classes, selecting the one E(j) that has the largest overlap with this set, provided it is larger than a minimum overlap {omega} (set to 0.65 in all our experiments);

      2. reduce generalized search path for each k, k = i + 1, , i + L - 2 and all j, j = i + 1, , i + L - 2 such that j != k do
                  1. consider the paths going through all the vertices in k that belong to E(j) [if no E(j) is assigned to a particular j, choose the vertex that appears on the original search-path at location j]; for all j (Fig. 1F);
                  2. perform MEX on this reduced generalized path;

      3. extract the leading P; if the overlap of E(j); sequences with double annotations were removed. All together, 6,751 proteins were analyzed. Classification was tested at level 2 (EC 1.x) and at level 3 (EC 1.x.x).

  To whom correspondence should be addressed. E-mail:

© 2005 by The National Academy of Sciences of the USA

# Phillips, C. (2003) in Encyclopedia of Cognitive Science, ed. Nadel, L. (Macmillan, London) Vol. 4, pp. 319-329.
# Harris, Z. S. (1954) Word 10, 140-162.
# Hemphill, C. T., Godfrey, J. J. & Doddington, G. R. (1990) in Proceedings of a Workshop on Speech and Natural Language, ed. Stern, R. M. (Morgan Kaufmann, San Francisco), pp. 96-101.
# MacWhinney, B. & Snow, C. (1985) J. Comput. Lingustics 12, 271-296.
# Resnik, P., Olsen, M. B. & Diab, M. (1999) Computers Humanities 33, 129-153.[CrossRef][ISI]
# Adriaans, P. & van Zaanen, M. (2004) Grammars 7, 57-68.
# Saffran, J. R., Aslin, R. N. & Newport, E. L. (1996) Science 274, 1926-1928.[Abstract/Free Full Text]
# Seidenberg, M. S. (1997) Science 275, 1599-1603.[Abstract/Free Full Text]
# Marcus, G. F., Vijayan, S., Rao, S. B. & Vishton, P. M. (1999) Science 283, 77-80.[Abstract/Free Full Text]
# Peña, M, Bonatti, L. L., Nespor, M. & Mehler, J. (2002) Science 298, 604-607.[Abstract/Free Full Text]
# Seidenberg, M. S., MacDonald, M. C. & Saffran, J. R. (2002) Science 298, 553-554.[Abstract/Free Full Text]
# Pereira, F. (2000) Philos. Trans. R. Soc. London 358, 1239-1253.[ISI]
# Geman, S. & Johnson, M. (2003) in Mathematical Foundations of Speech and Language Processing, IMA Volumes in Mathematics and Its Applications, eds. Johnson, M., Khudanpur, S., Ostendorf, M. & Rosenfeld, R. (Springer, New York), Vol. 138, pp. 1-26.
# Stolcke, A. & Omohundro, S. (1994) in Grammatical Inference and Applications, eds. Carrasco, R. C. & Oncina, J. (Springer, New York), pp. 106-118.
# Guyon, I. & Pereira, F. (1995) in Proceedings of the Third International Conference on Document Analysis and Recogition, ed. Shen, C. Y. (IEEE Computer Society, Montreal), pp. 454-457.
# Gross, M. (1997) in Finite-State Language Processing, eds. Roche, E. & Schabès, Y. (MIT Press, Cambridge, MA), pp. 329-354.
# Lari, K. & Young, S. (1990) Computer Speech Language 4, 35-56.[CrossRef]
# van Zaanen, M. (2000) in Proceedings of the 18th International Conference on Computational Linguistics, ed. Kay, M. (Saarbrücken, Germany) pp. 961-967.
# Hopcroft, J. E. & Ullman, J. D. (1979) Introduction to Automata Theory, Languages, and Computation (Addison-Wesley, Reading, MA).
# Klein, D. & Manning, C. D. (2004) in Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, eds. Daelemans, W. & Walker, M. (Barcelona), pp. 478-485.
# Klein, D. & Manning, C. D. (2002) in Advances in Neural Information Processing Systems, eds. Dietterich, T. G., Becker, S. & Ghahramani, Z. (MIT Press, Cambridge, MA), Vol. 14, pp. 35-42.
# Adriaans, P. & Vervoort, M. (2002) in Grammatical Inference: Algorithms and Applications: 6th International Colloquium: ICGI 2002, Lecture Notes in Computer Science, eds. Adriaans, P., Fernau, H. & van Zaanen, M. (Springer, Heidelberg), Vol. 2484, pp. 293-295.
# Goodman, J. T. (2001) A Bit of Progress in Language Modeling: Extended Version (Microsoft Research, Seattle), Technical Report MSR-TR-2001-72.
# McCandless, M. & Glass, J. (1993) in Proceedings of EuroSpeech'93, ed. Fellbaum, K. (Berlin), pp. 981-984.
# Chelba, C. (2001) in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ed. Swindlehurst, L. (IEEE, Piscataway, NJ), Vol. 1, pp. 544a-544d.
# Roark, B. (2001) Comput. Linguistics 27, 249-276.[CrossRef][ISI]
# Kermorvant, C., de la Higuera, C. & Dupont, P. (2004) J. Électronique d'Intelligence Artificielle, 6.
# Cai, C. Z., Han, L. Y., Ji, Z. L., Chen, X. & Chen, Y. Z. (2003) Nucleic Acids Res. 31, 3692-3697.[Abstract/Free Full Text]
# Clark, A. (2001) Ph.D. thesis (COGS, Univ. of Sussex, Sussex, U.K.).
# Wolff, J. G. (1988) in Categories and Processes in Language Acquisition, eds. Levy, Y, Schlesinger, I. M. & Braine, M. D. S. (Lawrence Erlbaum, Hillsdale, NJ), pp. 179-215.
# Henrichsen, P. J. (2002) in Proceedings of CoNLL-2002, eds. Roth, D. & van den Bosch, A. (Assoc. Computer Linguistics, New Brunswick, NJ), pp. 22-28.
# de Marcken, C. G. (1996) Ph.D. thesis (MIT, Cambridge, MA).
# Magerman, D. M. & Marcus, M. P. (1990) in Proceedings of the Eighth National Conference on Artificial Intelligence, eds. Dietterich, T. & Swartout, W. (AAAI Press, Menlo Park, CA), pp. 984-989.
# Chomsky, N. (1986) Knowledge of Language: Its Nature, Origin, and Use (Praeger, New York).
# Elman, J. L., Bates, E. A., Johnson, M. H., Karmiloff-Smith, A., Parisi, D. & Plunkett, K. (1996) Rethinking Innateness: A Connectionist Perspective on Development (MIT Press, Cambridge, MA).
# Nowak, M. A., Komarova, N. L. & Niyogi, P. (2001) Science 291, 114-119.[Abstract/Free Full Text]
# Gómez, R. L. (2002) Psychol. Sci. 13, 431-436.[CrossRef][ISI][Medline]
# Fillmore, C. J. (1985) Berkeley Linguistic Soc. 11, 73-86.
# Goldberg, A. E. (2003) Trends Cognit. Sci. 7, 219-224.[CrossRef][ISI][Medline]
# Cameron-Faulkner, T., Lieven, E. & Tomasello, M. (2003) Cognit. Sci. 27, 843-874.[CrossRef][ISI]
# Finch, S. & Chater, N. (1991) Artif. Intell. Simul. Behav. Q. 78, 16-24.
# Markman, E. (1989) Categorization and Naming in Children (MIT Press, Cambridge, MA).

Re:PDF of paper (5, Funny)

ksw2 (520093) | about 9 years ago | (#13451673)

Paper here for those who have PNAS access.

HEH! funniest meant-to-be-serious acronym ever.

Re:PDF of paper (2, Informative)

downbad (793562) | about 9 years ago | (#13451799)

The project also has a website [] where you can download crippled implementations of the algorithm for Linux and Cygwin.

Better link for PDF (2, Informative)

Anonymous Coward | about 9 years ago | (#13451816)

PNAS wants you to subscribe to download the PDF.

Or you could just go to the authors' page and download it for free: []

Woah (4, Funny)

SpartanVII (838669) | about 9 years ago | (#13451540)

Imagine if the editors started using this, what would everyone have to bitch about on Slashdot?

Re:Woah (2, Funny)

Anonymous Coward | about 9 years ago | (#13451569)


Re:Woah (0, Offtopic)

Anonymous Coward | about 9 years ago | (#13451674)


Re:Woah (0, Flamebait)

uberdave (526529) | about 9 years ago | (#13451759)

Problem is that dupes and typos are so common here, that the algorithm would consider them part of the language.

Noam Chomsky (2, Informative)

MasterOfUniverse (812371) | about 9 years ago | (#13451547)

This is a perfect apportunity to remind that its Chomsky's contribution to Linuguistics which enabled this amazing (if true) achievement. For those of you don't know Chomsky, he is the father of modern linguistics. Many would also know him as a political activist. Very amazing character. []

Re:Not Chomsky (0)

Anonymous Coward | about 9 years ago | (#13451555)


Re:Noam Chomsky (1)

myowntrueself (607117) | about 9 years ago | (#13451725)

What will be really interesting will be when its exposed to actual natural languages as used by actual, normal, human beings (as opposed to pedants and linguists).

Many english speakers (and writers) appear to actively avoid using the actual *official* rules of english grammar (and I'm sure this is true or other languages and their native speakers too).

I've always assumed that natural language comprehension (as it happens in the human brain) is mostly massively parallel guesswork based on context since (having worked with non-native speakers whose english is *appaling* and yet who I manage to understand pretty well most of the time).

Presumably, this 'gadget' will barf if there really are no rules in such natural language usage...

Or will it just 'pretend' that it found rules? (which is what I imagine that linguists like Chomsky do).

(I don't think I count as a 'linguist' but I did it to stage 3 at university).

Re:Noam Chomsky (1)

Jeremi (14640) | about 9 years ago | (#13451791)

Presumably, this 'gadget' will barf if there really are no rules in such natural language usage...

There must be some rules in natural language, otherwise how would anyone be able to understand what anyone else was saying? The rules used may not be the "official" rules of the language, and they may not even be clearly/consciously understood by the speakers/listeners themselves, but that doesn't mean they aren't rules.

Re:Noam Chomsky (0)

Anonymous Coward | about 9 years ago | (#13451817)

lets not forget the world's best comic:

Re:Noam Chomsky (5, Insightful)

venicebeach (702856) | about 9 years ago | (#13451829)

Perhaps a linguist could weigh in on this, but it seems to me that this kind of research is quite contrary to the Chomskian view of linguistics.

Instead of a language module with specialized abilities tuned to learn rule-based grammar, we have an an unsupervised learning system has surmised the grammar of the language merely from the patterns inherent in the data it is given. That a system can do this is evidence against the notion that an innate grammar module in the brain is necessary for language.

Speaking as someone working on NLP (4, Interesting)

OO7david (159677) | about 9 years ago | (#13451552)

IAALinguist doing computational things and my BA focused mainly on syntax and language acquisition, so here're my thoughts on the matter.

It's not going to be right. The algorithm is stated as being statistically based which while is similar to the way children learn languages is not exactly it. Children learn by hearing correct native languages from their parents, teachers, friends, etc. The statistics come in when children produce utterances that either do not conform to speech they hear or when people correct them. However, statistics does not come in at all with what they hear.

With respect to the learning of the algorithm the underlying grammar of a language, I am dubious enough to call it a grand, untrue claim. Basically all modern views of syntax are unscientific and we're not going to get anywhere until Chompsky dies. Think about the word "do" in english. No view of syntax describes from where that comes. Rather languages are shoehorned into our constructs.

So, either they're using a flawed view of syntax or they have a new view of syntax and for some reason aren't releasing it in any linguistics journal as far as I know.

Re:Speaking as someone working on NLP (2, Insightful)

tepples (727027) | about 9 years ago | (#13451571)

However, statistics does not come in at all with what they hear.

Utterance in pattern A is heard more often than utterance in pattern B; utterances in patterns C and D are not heard at all. How is that not statistics?

Re:Speaking as someone working on NLP (2, Interesting)

OO7david (159677) | about 9 years ago | (#13451643)

Insofar as only utterance A is heard. A kid will always hear "Are you hungry" but never "Am you hungry" or "Are he hungry".

Native speakers by definition speak correctly, and that is all the child is hearing.

Re:Speaking as someone working on NLP (1)

FireFlie (850716) | about 9 years ago | (#13451700)

You have never heard someone speak english using poor grammar? Perhaps this is true for smaller subsets of native language speakers, however not for a society as a whole (america at least).

Re:Speaking as someone working on NLP (1)

OO7david (159677) | about 9 years ago | (#13451720)

That is prescriptive grammar. Descriptive grammar is what linguists and that is actually what people speak.

Consider this, who is in charge of language, an institute or the speakers? Natives cannot be wrong about their own language; they can be wrong on the standard, but A) that standard is always changing and B) given A who then is correct?

Re:Speaking as someone working on NLP (1)

mveloso (325617) | about 9 years ago | (#13451648)

However, have you read the paper? Looked at the data? It seems a bit early for you to dismiss their conclusions based on a blog entry.

Re:Speaking as someone working on NLP (1)

OO7david (159677) | about 9 years ago | (#13451738)

Exactly, I can't be completely quick to dismiss this, but based upon the data given and the fact that I'm working on almost the exact problem as what the algorithm is supposed to solve, it really doesn't mesh.

Re:Speaking as someone working on NLP (2, Informative)

Comatose51 (687974) | about 9 years ago | (#13451685)

Basically all modern views of syntax are unscientific and we're not going to get anywhere until Chompsky dies.

I really don't understand that. How are modern views of syntax unscientific? Also, if Chomsky is such an influence on linguistics, then maybe he's right about it. Aren't you essentially saying that we have no way of arguing with him so let's wait til he dies so he can't argue back? I would think the correct view should win out regardless of the speaker.

Other than what I've studied in cognitive science, I am not in any way or form a linguist. However, what you say really confuses me and contradicts what I've learned. I can only assume that what you say make sense because of your deeper knowledge. So can you please explain what you mean for the rest of us?


Re:Speaking as someone working on NLP (4, Interesting)

OO7david (159677) | about 9 years ago | (#13451749)

It is in effect two parted:

Chomsky is to linguistics as Freud to psych. He had great ideas for the time (many still stand), and the science would be nowhere close to where it is without him. However, A) he's backed off alot of supporting his own theories and B) he's published papers contradicting his original ideas so that is some question there for their veracity. Since so many linguistics undergrads hold him as the pinnical of syntax none are really deviating drastically from him.

WRT the unscientificness, to make his view fit English, there has to be "do-support" which basically is that when forming an interrogative "do" just comes in to make things work without any explanation. In other words, it is in our grammar, but our view of syntax does not account for it.

Re:Speaking as someone working on NLP (1)

Saanvik (155780) | about 9 years ago | (#13451686)

It's not just statistical. From the article (empahsis mine),
ADIOS relies on a statistical method for pattern extraction and
on structured generalization -- two processes that have been implicated in language acquisition.
And while their paper is not being published in a linguistic journal, it is being published in the Proceedings of the National Academy of Sciences (PNAS, Vol. 102, No. 33), which is a well respected cross-discipline journal.

Although I, along with you, am skeptical of this, it sounds like it could be a very interesting article. Don't poo-poo it until you read it.

Re:Speaking as someone working on NLP (1)

OO7david (159677) | about 9 years ago | (#13451761)

Right, I think that it will be a fascinating read and will ultimately help the project I'm currently doing, but my claim is that if it is linguistic then I highly doubt that it will be fully correct given the flawed (IMO) assumptions.

My argument is based soley upon this blog entry, and what it says doesn't quite seem to add up to me.

Re:Speaking as someone working on NLP (1)

NitsujTPU (19263) | about 9 years ago | (#13451704)

Almost all of modern NLP relies on statistical approaches.

I'm not going to chime in and start a flame-war, but since your view is rather iconoclast, I think it only fair to point this out to the Slashdot audience, who are probably not as informed on the topic as you or I.

Re:Speaking as someone working on NLP (0)

Anonymous Coward | about 9 years ago | (#13451724)

Perhaps you should read the man's papers (Edelman). I've read perhaps a dozen or two and it's clear that he is a thinker exploring possibilities. That doesn't mean that he's correct (in all instances) rather he formalizes his thoughts via computational and sometimes behavioural models.

That said, I have not read this paper...

Re:Speaking as someone working on NLP (0)

Anonymous Coward | about 9 years ago | (#13451764)

Statistics are used in NLP because describes the raw facts, not because language is statistically based. Recurrent pattern processing is the most promising approach to describe how language works.

Speaking as someone working on bioinformatics CFGs (0)

Anonymous Coward | about 9 years ago | (#13451774)

I'm inclined to agree. IAAcomputational biologist [] doing bioinformatics-y algorithm things, and I am skeptical of automated grammar discovery. Automatic motif discovery with HMMs is one thing --- that works well, and I suspect that's basically what their bioinformatics results are yielding here (since SCFGs are a superset of HMMs). CFG-related algorithms are great for RNA analysis (I've written a few [] of them [] ). I haven't read the article in detail, but CFGs aren't overwhelmingly well suited to proteins (which lack the nested-clause structure typical of RNA, for example (and programming languages too, as it happens)). One question I might ask is "how well does this perform when applied to a particular task?" --- the authors mention (in the context of proteins) automated functional classification; I'd be curious to see if this is basically reproducing the results of HMM-like approaches [] .

Re:Speaking as someone working on NLP (2, Informative)

lawpoop (604919) | about 9 years ago | (#13451787)

IIRC, the part of Chomsky's theory that is relevant to this application is that universal grammar is a series of choices about grammar -- i.e. adjectives either come before or after nouns, there are or are not postpositions, etc. I think the actual 'choices' are more obscure, but I'm trying to make this understandable ;)

According to the theory, children come with this universal grammar built-in to their mind (for some reason, Chomsky seems against genetic arguments, but good luck understanding his reasoning), and they only need to hear just a little bit of language in order to throw the choose the proper alternatives in the mind, and start building grammatically correct sentences -- the rest is just building vocabulary. What seems like a child learning language is actually the language part of the brain growing during development. I believe that these choices are called 'switches' by Chomsky.

(An easy argument for universal grammar is that children make mistakes that are more rule-following than the accepted grammar -- words such as 'breaked', 'speaked', 'foots' or 'mouses' are in a sense rule-based corrections of exceptions in the spoken language. So the children follow the rules more closely than the adults -- they certainly didn't learn them from adults, so the must be applying the rules in their minds.)

Anywho, to make a program like this, you would just have to put together the switches of universal grammar and then feed in sample data -- probably text spelled with those linguistic homophonic characters, instead of the horrendous English spelling non-system. Chinese characters might be a better sample data in that respect; I don't know.

Note that, contrary to other posters, this would not be a system for building grammatical rules for any 'language' or formal system, such as C++ or xml. It is based on universal grammar, which is a set of option for constructing a human language. So there are built in assumptions that there will be subject, verbs, objects, indirect objects, etc. in the language that this program is decoding.

Re:Speaking as someone working on NLP (4, Interesting)

PurpleBob (63566) | about 9 years ago | (#13451794)

You're right about Chomsky holding back linguistics. (There are all kinds of counterarguments against his Universal Grammar, but people defend it because Chomsky Is Always Right, and Chomsky himself defends it with vitriolic, circular arguments that sound alarmingly like he believes in intelligent design.)

And I agree that this algorithm doesn't seem that it would be entirely successful in learning grammar. But this is not because it's statistical. I don't understand how you can look at something as complicated as the human brain and say "statistics does not come in at all".

If this algorithm worked, then it could be statistical, symbolic, Chomskyan, or magic voodoo and I wouldn't care. There's no reason that computers have to do things the same way the brain does, and I doubt they'll have enough computational power to do so for a long time anyway.

No, the flaws in this algorithm are that it is greedy (so a grammar rule it discovers can never be falsified by new evidence), and it seems not to discover recursive rules, which are a critical part of grammar. Perhaps it's learning a better approximation to a grammar than we've seen before, but it's not really doing the amazing, adaptive, recursive thing we call language.

Re:Speaking as someone working on NLP (0)

Anonymous Coward | about 9 years ago | (#13451808)

You say:

"Basically all modern views of syntax are unscientific and we're not going to get anywhere until Chompsky dies. Think about the word "do" in english. No view of syntax describes from where that comes. Rather languages are shoehorned into our constructs."

What exactly is so puzzling about the word "do"? Its use in English is complex, but hardly counts as a great puzzle of linguistics.

I'm a linguist, and I'm pretty sure that you're either (1) Lying about "working on NLP" or (2) a bad mistake on the part of your employers. You seem to have only the vaguest idea what you're talking about.

Also--while normally I wouldn't consider a spelling glitch very significant, you misspell "Chomsky" in a way that doesn't make sense as a typo, but is a common misspelling among people who aren't very familiar with the name or have heard it spoken more that they've seen it written. Again, I'm pretty sure you have no clue what you're talking about.

Wow! (4, Funny)

the_skywise (189793) | about 9 years ago | (#13451557)

They've rediscovered the Eliza program!

Input: "For example, the sentences I would like to book a first-class flight to Chicago, I want to book a first-class flight to Boston and Book a first-class flight for me, please may give rise to the pattern book a first-class flight -- if this candidate pattern passes the novel statistical significance test that is the core of the algorithm."

How does it feel to "book a first-class flight"?

Isn't This the Universal Translator Idea (0, Redundant)

TastyWheat (302413) | about 9 years ago | (#13451558)

From Star Trek???

Re:Isn't This the Universal Translator Idea (2, Interesting)

biryokumaru (822262) | about 9 years ago | (#13451621)

In Star Trek 4, the universal translator was little help when the humpback whale armada arrived... No, seriously, that was one f**ked up movie.

But for this, I have one word: Dolphins.

So long and thanks... (-1, Offtopic)

Anonymous Coward | about 9 years ago | (#13451665)

for a realy stupid song...

Re:Isn't This the Universal Translator Idea (1)

Punboy (737239) | about 9 years ago | (#13451669)

The Universal Translator is tuned specifically for the sounds put out by standard humanoid lifeforms. Humpback whales use both much higher and much lower pitched sounds. The universal translator was not designed to translate such things, as would not be able to translate it.

Re:Isn't This the Universal Translator Idea (1)

Tontoman (737489) | about 9 years ago | (#13451723)

A real universal translator (artificial intelligence) would have to have many thousands of words of text to use as examples, so a language could be learned. [] That is why many mechanical translation systems start with word lists and dictionaries to give the learning process a head start.

Grammar depends on the input (3, Interesting)

Tsaac (904191) | about 9 years ago | (#13451562)

If fed with a heap of decent grammar, what happens when it's fed with bad grammar and spelling? Will it learn, and incorporate, the tripe or reject it? That's the sort of problem with natural language apps, it's quite hard to sort the good from the bad when it's learning. Take the megahal library> for example. Although possibly not as complex, it does a decent job at learning, but when fed with rubbish it will output rubbish. I don't think it's the learning that will be that hard part, but rather the recognition of the good vs. the bad that will prove how good the system is.

Re:Grammar depends on the input (0)

innerweb (721995) | about 9 years ago | (#13451691)

That would be called slang.


what happens when it's fed with bad grammar... (1, Funny)

Anonymous Coward | about 9 years ago | (#13451701)

Let's feed it slashdot and find out.

Re:Grammar depends on the input (0)

Anonymous Coward | about 9 years ago | (#13451719)

If fed with a heap of decent grammar, what happens when it's fed with bad grammar and spelling? Will it learn, and incorporate, the tripe or reject it? That's the sort of problem with natural language apps

Thats also the problem with slashdot

But seriously what your describing is the evolution of language or on a smaller scale, accents and regional dialects, etc. It only seems natural that the language app would also evolve and adapt to this 'tripe'.

the recognition of the good vs. the bad that will prove how good the system is.

Not as easy as it sounds, even people (with too much time on their hands) quibble about these sorts of things. Often both are correct with respect to the language they have learnt. The language I've leanrt is quite different from the language my grandparents learnt (yes its both English) Quebec French is different from Perisian French. Here are some examples:

"Did you get your invite?" Is that acceptable English?

Whats the French word for a female professor? (in Quebec it's professeuse while in France it's professeur

"Long time, no see" Is that acceptable English.

The answers to the questions, are of course relative.

Re:Grammar depends on the input (2, Interesting)

jim_v2000 (818799) | about 9 years ago | (#13451826)

The problem with this program is that you could input the most gramatically correct sentences you can into it, and it'll still spew out senseless garbage. For this to be of any worth, the computer will need to understand the meaning each word, and how each meaning relates to what the other words in the sentence mean. And you can't program it into a computer what something is just by putting words into it. Like if I tell the machine that mice squeak, it has to know what a squeak sounds like and what a mouse is. How do you define a mouse to a computer? A small fuzzy rodent. Well, how do you define fuzzy? Or small? Or a rodent? You have to keep using more and more words...and still the computer will have no idea what you're talking about, other than just mroe word relationships.

I guess the missing thing is that a human can evision the meaning of the words as a concept or image, while the computer simply sees the words as, well, just words (or binary to specific).

Same as everything computer-based (0)

scdeimos (632778) | about 9 years ago | (#13451832)

Garbage in, garbage out.

Protein sequences? (1, Interesting)

Anonymous Coward | about 9 years ago | (#13451565)

Let's see what human DNA really says and means!

Re:Protein sequences? (1)

Concerned Onlooker (473481) | about 9 years ago | (#13451809)

That's easy. kill, eat, mate, sleep, kill, eat, mate, sleep....

Re:Protein sequences? (1)

Safe Sex Goddess (910415) | about 9 years ago | (#13451830)

This sounds like a good idear. DNA just seems so much like Code of the Life Maker.

Not knowing anything about science, I sometimes wonder if DNA was brought to the solar system by some sort of seeding program.

Any bio geeks out there care to enlighten a bio-knowledge impoverished pupil?

Oh, and whip up a universal microbiocide to keep sex safe while you're at it:-)

Finally some progress (2, Funny)

drjimmy42 (744688) | about 9 years ago | (#13451568)

I know we all feel like we've been screwed by the conspicuous lack of flying cars around these days, but at least some progress is being made on the Universal Translator front...

Re:Finally some progress (2, Insightful)

TastyWheat (302413) | about 9 years ago | (#13451581)

I'm starting to get the feeling that there nothing in sci fi that won't occur in reality. Except for the dorky guy getting to nail the hot busty alien babe that is. heh.

How about Klingon? (1)

irving47 (73147) | about 9 years ago | (#13451578)

Or better yet, start feeding it images of crop circles that haven't been proven to be fakes (yet.)

Useful in games? (1)

.DS_Store (903445) | about 9 years ago | (#13451590)

How long before this technology makes its way into the field of game AI? Imagine a game such as Deus Ex or SW:KoToR where you don't merely choose your response to NPC's from a predefined list, you type in your answer! Such technology combined with the simple contextual inferences that drive such oldies like Dr. Sbaitso and its Mac-equivalent Eliza could potentially launch interactivity in games to a whole new level.

Sure it's a long shot, but I can dream can't I?

Been there, done that (1)

Toloran (858954) | about 9 years ago | (#13451784)

How long before this technology makes its way into the field of game AI? Imagine a game such as Deus Ex or SW:KoToR where you don't merely choose your response to NPC's from a predefined list, you type in your answer!

Re:Useful in games? (0)

Anonymous Coward | about 9 years ago | (#13451788)

Or better yet, talk to the guy! with a microphone and everything.

Length? (1)

Parham (892904) | about 9 years ago | (#13451594)

I read the article and had a few questions. How long does the analyzed text have to be for the algorithm/program to pick up the grammar rules? I mean if it takes long documents and a ton of time, is it really worth it? Also, if it can only recognize languages we already know (and can only read those characters), how useful is this thing? Why not just hardcode grammar rules then? (probably a stupid question, but its an exaggeration of what I was thinking).

Re:Length? (1)

Joe Random (777564) | about 9 years ago | (#13451682)

I mean if it takes long documents and a ton of time, is it really worth it?
Probably. It's not like there's a shortage of long documents. I mean, pretty much all areas of publishing use electronic documents at some point during the process. I'm sure that the researchers could feed the algorithm a few dozen novels fairly quickly.

As for the running time, it'd probably take a while, but computing power gets cheaper every day.
Also, if it can only recognize languages we already know (and can only read those characters), how useful is this thing?
Who says it can only recognize languages we already know? I suspect that it would be fairly simple to adapt the algorithm to just look at visual input and find patterns in an unknown language.

Even better, if you started feeding it equivalence rules between that language and a known one, I'd imagine that it could eventually start to translate words whose meaning you hadn't explicitly taught it. How cool would that be?

Hieroglyphics? (2, Interesting)

Hamster Of Death (413544) | about 9 years ago | (#13451601)

Can it decipher these things too?

Let's see what it thinks of this (1, Funny)

MillionthMonkey (240664) | about 9 years ago | (#13451602)

A few years ago, thousands of posts like this were flooding alt.relgion.scientology:
His thirteenth unambiguity was to equate Chevy Meyer all my nations. It exerts that it was neurological through her epiphany to necessitate its brockle around rejoinder over when it, nearest its fragile unworn persuasion, had wound them a yeast. Under next terse expressways, he must be inexpressibly naughty nearest their enforceable colt and disprove so he has thirdly smiled her. Nor have you not secede beneath quite a swamp? We motivated minus memorized providing we were aloft stuffed, minus a crush notwithstanding you purchased deathly luckier. Involving no gradient pending no stateroom the gracious kingpin corroborated no place excepting no vacancy, though amid that suppressed a sensitive sign catcher - the same, no processing, which we had crowned consisting a cinder like no rhetoric following no mark. Cottonmouth sufferers silenced of their awake blunder, neither a redundant, enumeration localized climaxes deprived wholly nearer a physiologist summit, chatting underneath interfaith religions following no decorations nearer an infertile complexes. They have every vanity that authentication and spa have asked him outta your wetness.
After a time the attacks stopped. The algorithm that generated them was a slightly better author than L. Ron Hubbard and left a.r.s. to found its own religion.

Re:Let's see what it thinks of this (2, Insightful)

One Div Zero (851169) | about 9 years ago | (#13451638)

That's just a Markov Model [] that "learned" from what looks religious mumbo jumbo in the first place.

Markov models are perhaps the easiest language acquisition model to implement, but also one of the worst at coming up with valid speech or text.

Interestingly, they do much, much better as recommender systems.

Re:Let's see what it thinks of this (1)

proteonic (688830) | about 9 years ago | (#13451796)

It's not that interesting, since they're usually first order systems. Every n+1st work will make sense with every nth word, but there's no long range (second order or higher) dependency in most of those models. (requires too much space, usually). That's why they do well as recommender systems, but suck for generating comprehensible sentences.

What is actually new about this (1)

NitsujTPU (19263) | about 9 years ago | (#13451611)

Actually, what is actually new about this is the unsupervised approach. Graph parsers are quite popular in Natural Language Processing applications, but they use supervised methods.

Patent Pending (1, Insightful)

Anonymous Coward | about 9 years ago | (#13451612)

Unlike all the ridiculous patents being granted lately to IT companies, the one these guys are filing for, to me, seems legitimate. Its a nice change in my mind.

Dupe (-1, Troll)

Anonymous Coward | about 9 years ago | (#13451615)

Dupe []

Incredible (1, Interesting)

Anonymous Coward | about 9 years ago | (#13451624)

I hope the material is (or will be made) accessible to laypersons. I'd love to be able to use this algorithm for my own music experiments.

Dupe (2, Informative)

fsterman (519061) | about 9 years ago | (#13451629)

We just had an article on this. There was a shootout by NIST. At least I think, /. search engine blows, hard. Either way, here a link [] to the tests. This is one that wasn't covered by the tests, so I guess its front page news.

The Ultimate Test... (0)

creimer (824291) | about 9 years ago | (#13451630)

They should play an old country record backwards and see if the computer can confirm that the poor chump gets his truck, dog, shotgun and woman back. Could also try playing an old rock record backwards but computer might get possessed by a demon.

Only a matter of time. (1, Funny)

webby123 (911327) | about 9 years ago | (#13451642)

Scientists: Wow joe, the computer just went out and read the entire slashdot archive. Computer: Futile humans, I will p0wn you! *Lifting a bionic hand and swiping the scientist to the side* Scientists: Mac turn it off! Turn it offff argghh! Joe I just don't havvee teh power! Reaching for the coord. Computer: Woa.. wohaahaa. All your base is mine. Make your time.. *The lights go out*

Re:Only a matter of time. (-1)

Anonymous Coward | about 9 years ago | (#13451736)

Thanks in advance for modding this douchebag down.

Programming Language (2, Interesting)

jmlsteele (618769) | about 9 years ago | (#13451680)

How long until we see something like this applied to ?

Chinese room (1)

pan_filler (911745) | about 9 years ago | (#13451697)

Great, now we can actually build a chinese room!

I thank you for yo0r `time (-1, Offtopic)

Anonymous Coward | about 9 years ago | (#13451708)

co0ld 5ave it []

Finaly (2, Interesting)

Trigulus (781481) | about 9 years ago | (#13451742)

something that can make sense of the voynich manuscript [] . They should have tested their system on it.

Universal Translator? (2, Interesting)

mwilli (725214) | about 9 years ago | (#13451762)

Could this be integrated into a handheld device to be used as a universal translater much like a hearing aid?

Electronic babelfish anyone?

Is it SEQUITUR? (0)

Anonymous Coward | about 9 years ago | (#13451807)

Isn't this what SEQUITUR ( [] is supposed to do?

Run it on the bible and get... (2, Funny)

Trigulus (781481) | about 9 years ago | (#13451812)

God loves you. God will burn you in hell for all eternity. God wants more foreskins.

How "intricate"? (2, Insightful)

P0ldy (848358) | about 9 years ago | (#13451814)

Our experiments show that it can acquire intricate structures from raw data, including transcripts of parents' speech directed at 2- or 3-year-olds. This may eventually help researchers understand how children, who learn language in a similar item-by-item fashion and with very little supervision, eventually master the full complexities of their native tongue."

In addition to child-directed language, the algorithm has been tested on the full text of the Bible in several languages
I hardly would consider transcripts of parents' speech directed at 2- or 3-year-olds "intricate". And while the algorithm may have "been tested on the full text of the Bible", it doesn't say with what percentage of accuracy or what translation. King James version or the Teen Magazine Bible?

And the "rules" of a language are NOT what children "learn". First of all, children acquire a language, they do not "learn" it. That is a large attribute to the child's ability to speak it--not whether or not they understand gerunds and the pluperfect.

Second, in a language such as English whose words for the most part lack any necessity to the order in which they're placed to understand they're meaning and, even worse, lack declension forms to distinguish subject from object of the preposition, with what success can a language recognition program have "learning" such a language when prepositions themselves mainly can be omitted? To teach a computer Latin is easy.

Third, what's the hope of the computer ever understanding something like Shakespeare, Joyce, or Dante, whose uses of language rely extensively on erudition for word placement as opposed to typical usage? While a computer might be able to learn Latin because of its rigourous rules, I doubt it could faithfully render a text from Ovid.

Not really new. (1) (142825) | about 9 years ago | (#13451819)

While working for a nutcase [] . I spoke with with Philip Resnik about his project of building a href=" lel/bible.html">parallel corpus as a tool to build a language translation system. This seems like the next logical step.
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>