×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

New Online Dictionaries Automate Away the Linguistic Middleman

timothy posted more than 2 years ago | from the boncha-porftis-hworkin dept.

The Internet 60

An article in The New York Times highlights two growing collections of words online that effectively bypass the traditional dictionary publishing system of slow aggregation and curation. Wordnik is a private venture that has already raised more than $12 million in capital, while the Corpus of Contemporary American English is a project started by Brigham Young professor Mark Davies. These sources differ from both conventional dictionary publishers and crowd-sourced efforts like the excellent Wiktionary for their emphasis on avoiding human intervention rather than fostering it. Says founder Erin McKean in the linked article, 'Language changes every day, and the lexicographer should get out of the way. ... You can type in anything, and we'll show you what data we have.'

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

60 comments

Isn't that called Googling? (3, Insightful)

hawks5999 (588198) | more than 2 years ago | (#38557316)

You can type in anything and we'll show you the data we have sounds a lot like Google search.

Re:Isn't that called Googling? (1)

Sqr(twg) (2126054) | more than 2 years ago | (#38557618)

The difference should be in the prioritizing of results. The first few pages from Google might give only hits based on the most common meaning of a word, while Wordnik, according to TFA, should group citations by meaning.

In practice, this didn't seem to work for the words I tried.

Re:Isn't that called Googling? (4, Insightful)

Samantha Wright (1324923) | more than 2 years ago | (#38558474)

Here's the results for 'magic' [wordnik.com].

Gee, it sure looks like they're returning random search engine results next to—oh look, a list of opinions as proferred by so-called "linguistic middlemen."

I like how the top example for how 'magic' is used in English isn't even purely English, but a bullet point about features in the Zend framework. I'll make a habit of saying "__magic()" in everyday speech more often!

I think the worst outcome of this is that PHP now somehow has influence on the evolution of a natural language. I do not believe I am alone in feeling terrified by this prospect.

Re:Isn't that called Googling? (1)

Jezral (449476) | more than 2 years ago | (#38561698)

Then maybe what you want is DeepDict. E.g., magic is used like http://gramtrans.com/deepdict/lookup.php?word=magic&class=N&lang=eng&top=200 [gramtrans.com] - it is not free, though all words starting with 's' are currently open to viewing for anyone.

It yields info such as: black magic, Orlando magic, ceremonial magic ... magic kingdom, magic roundabout, magic flute ... practice magic, radiate magic ... magic of animation ... etc

(disclaimer: I work on the DeepDict project)

Re:Isn't that called Googling? (0)

Anonymous Coward | more than 2 years ago | (#38562892)

There were two instances of 'magic' in that sentence, what makes you think the dictionary referred to the second one and not the first one?

The __magic() might very well be invoking black magic as the sentence indicates, how are you to know?

CCAE isn't that nontraditional (2)

Trepidity (597) | more than 2 years ago | (#38557352)

CCAE is an annotated corpus more than a dictionary. It counts words, word co-occurrences, etc. It's also manually annotated with parts of speech and other such things, not fully automated. Its scope is bigger and more recent than what was possible before computers, but the general idea is ancient: 18th-century classicists would manually compile frequency and word co-occurrence tables for ancient languages to try to get an understanding of their structure.

Re:CCAE isn't that nontraditional (1)

hedwards (940851) | more than 2 years ago | (#38557560)

Having access to a good corpus is really helpful, but once you start hitting the 2k word count additional entries aren't really that helpful to anybody other than hardcore linguists.At that point it's generally more helpful to have information about what words frequently travel together and where they're likely to appear in a sentence.

Re:CCAE isn't that nontraditional (1)

gsnedders (928327) | more than 2 years ago | (#38557798)

It's depends what you're doing. I've spent a while dealing with the Scottish Corpus of Texts and Speech, and there the size is around four million words. If you're doing anything based upon dialects, size does make a very real difference, because you're interested in the density of usage by area. Personally, even in a non-linguistic context, I find it useful to know whether someone in x is likely to know (by virtue of using) a word y.

Re:CCAE isn't that nontraditional (1)

RancidPeanutOil (607744) | more than 2 years ago | (#38564360)

If you're a trained computational linguist then obviously yes, you can generate this information (collocations and concordances) from corpora. The COCA actually has some pretty neat features along those lines, but it's not user-intuitive.

Good idea? (1)

colinrichardday (768814) | more than 2 years ago | (#38557358)

At the risk of being elitist, I wonder if I should adjust my use of language to that of the average American.

Re:Good idea? (1)

rolfwind (528248) | more than 2 years ago | (#38557526)

It's inevitable, language always adjusts to popular usage eventually, even with guards in place that act as filters.

Though I still cringe when people say they "could care less."

Not that all rules set in place by self-annointed authorities. I never understood why end-of-sentence punctuation should appear inside quotations, especially if it might not match what was quoted, like making a question out of a sentence.

Re:Good idea? (4, Funny)

vlm (69642) | more than 2 years ago | (#38557550)

Though I still cringe when people say they "could care less."

That begs the question if inappropriate use of "begs the question" is like, worse, like, than like using the word like, like in as the first like word after every like lung inhalation. I think that is a full 360 degree reversal from your suggestion.

Re:Good idea? (2, Informative)

Anonymous Coward | more than 2 years ago | (#38557576)

You're sentence could of been improved if you had leveraged a preposition to end it with.

Re:Good idea? (1)

Trepidity (597) | more than 2 years ago | (#38557648)

To be quite honest, it's not an, uh, a very uncommon pattern of speech, if I may say so, to interject one's spoken English with, discourse... discourse particles, and, well, other minor disfluencies, which do--- which do vary by social class, but more in, uh, word choice than in what you might call actual frequency.

Re:Good idea? (2)

AlienIntelligence (1184493) | more than 2 years ago | (#38558752)

Though I still cringe when people say they "could care less."

That begs the question if inappropriate use of "begs the question" is like, worse, like, than like using the word like, like in as the first like word after every like lung inhalation. I think that is a full 360 degree reversal from your suggestion.

I live in the corner of a quad of homes that creates an interesting
amplifying effect of sounds, within the area. So that a house that
is completely on the other side, hundreds of feet away, you can
clearly hear people talk. [Yeah, it DOES suck].

So, the other day, I heard this teen-thing speaking to her folks
and about the 20th like, I was gonna "say loudly" since that's
all one has to do...

"Like will you shut the fuck up"

But tis the season and all that crap.

-AI

Re:Good idea? (1)

colinrichardday (768814) | more than 2 years ago | (#38557652)

I never understood why end-of-sentence punctuation should appear inside quotations, especially if it might not match what was quoted, like making a question out of a sentence.

So I'm not the only one? Yeah!!! Although I believe that you may have a question mark outside of quotes if the sentence (and not the quoted material) is a question.

Re:Good idea? (5, Insightful)

Samantha Wright (1324923) | more than 2 years ago | (#38558558)

Oh, that's purely typographical. When moving blocks of metal type around, a full-stop/period or comma is more delicate than a quotation mark, since it's only x-height and not capital letter height. Typographers got in the habit of putting them on the inside to keep them safe. That's also why certain ligatures of f and the long s were preserved from scribal writing: those letters were designed to hook over others, and if the next letter was tall then it would create a structural instability (an x-height hole.) If modern punctuation had evolved before the invention of moveable type, we would probably put the quotation mark directly above the other punctuation mark, and use logical punctuation for ? and !. However, it didn't, so it was all put inside to stay consistent.

To be honest, I find it visually more pleasant. After looking at code that passes strings around as arguments in C-style imperative languages all day, it's nice to see something without a big gap on the baseline (this "is," an "example", for you.) Since the quotation mark is already floating up and away from the letters, it's less jarring to see it separated from the word than a comma or period. (This is more or less the modern aesthetic justification for keeping it the traditional way. However, modern typographers don't always agree with traditionalists: watch what happens when you point out that the "single" space used to separate sentences prior to the invention of the typewriter was actually larger than a standard double space.)

Re:Good idea? (1)

lsatenstein (949458) | more than 2 years ago | (#38564768)

What about punctuation for other languages such as and or the Spanish inverted question mark at the beginning and ? at the end of a question

Re:Good idea? (1)

Samantha Wright (1324923) | more than 2 years ago | (#38565156)

I didn't know this one off the top of my head, but Wikipedia says [wikipedia.org] they were introduced in Spanish in 1754 because there's no way to recognize that a sentence is a question just from looking at the words; it's purely a tonal difference—and for really long sentences it can get disorienting if you have to go back and re-read it because you just found out that it was a question when you got to the end. I imagine the exclamation point was just made to be consistent.

What were the other symbols you tried to type? Guillemets [wikipedia.org]?

Re:Good idea? (0)

Anonymous Coward | more than 2 years ago | (#38561412)

"I never understood why end-of-sentence punctuation should appear inside quotations, especially if it might not match what was quoted, like making a question out of a sentence."

For what it's worth, it's now accepted that you DON'T need to do it that way. "Logical quoting" is the alternative. It works exactly like most coders would expect:

http://home.swipnet.se/sunnanvind/logical.html

http://catb.org/jargon/html/writing-style.html

Wikitionary? (-1)

Anonymous Coward | more than 2 years ago | (#38557396)

Really? The excellent Wikitionary? It is filled with just as much garbage as Wikipedia. Both of them are terrible as first sources. Try The Free Dictionary [thefreedictionary.com] instead.

Re:Wikitionary? (4, Informative)

Trepidity (597) | more than 2 years ago | (#38557464)

"The Free Dictionary" appears to be just a spammy repackaging of Wikipedia content. Lots of their articles even have a footer saying they're licensed under the GFDL from Wikipedia.

Citation needed (0)

Anonymous Coward | more than 2 years ago | (#38557440)

"Slow" sources that take the time to verify things are what's needed to become reliable sources, which Wikipedia cites. Unfortunately "new" ideas can be a victim of deletionism.

Great Idea (1)

Anonymous Coward | more than 2 years ago | (#38557456)

Let's eliminate the making-sense and explaining that human beings can do. The absurdity of most spell check and voice recognition "did you mean" suggestions doesn't give me much hope that it's all just a matter of having enough data. Yes, Google can seem almost prescient, but only if thousands of other people are looking for the same things as I am. When I could really use a hint, Google never comes up with something useful. On the contrary, then I have to coax it not to replace my carefully selected search term with something "more popular" that hasn't got anything to do with what I'm looking for.

What are these guys? (1)

vlm (69642) | more than 2 years ago | (#38557540)

What are these guys, all we get is what they're not:

traditional dictionary publishing system

slow aggregation

curation

crowd-sourced effort

human intervention

I'm guessing they are also not street taco vendors, catholic priests or christmas tree salesmen. Great, that really narrows it down. So, what are they? I mean in terms of workflow, or data diagrams, or even user experience. And who are their users, anyway, unless they provide a really good reason, the rest of the world will continue to use wikipedia/wikimedia products, google (lets face it, mostly google), and the urban dictionary (dare I invoke encyclopedia dramatica?)

Re:What are these guys? (0)

Anonymous Coward | more than 2 years ago | (#38557882)

An excellent way to find the information you're looking for would be to visit the links in the story. Since you seem to have misplaced them, here's another copy for your convenience:

Furthermore, the summary actually contains the text "These sources differ from both conventional dictionary publishers and crowd-sourced efforts like the excellent Wiktionary for their emphasis on avoiding human intervention rather than fostering it."

How did you manage to make your post and miss all of the above?

So then ... (2)

PPH (736903) | more than 2 years ago | (#38557556)

... if its used, it is automatically entered into this 'dictionary'. On one hand, I shudder to think of the direction that various languages might take. On the other hand, there could be hope for words like malamanteau [xkcd.com]. That seems perfectly cromulant to me.

Lexicographers out of the way (3, Informative)

Compaqt (1758360) | more than 2 years ago | (#38557566)

Obviously, I'd suppose you still needed a few lexicographers to come up with the system.

And to maintain it, right?

The problem seems to be when you've put 95% of lexicographers out of a job, who's going to train the next bunch, and will it be cost-effective at a university level to have a graduate program in such for 1 or 2 individuals?

Re:Lexicographers out of the way (4, Funny)

VortexCortex (1117377) | more than 2 years ago | (#38558062)

Obviously, I'd suppose you still needed a few lexicographers to come up with the system.

And to maintain it, right?

The problem seems to be when you've put 95% of lexicographers out of a job, who's going to train the next bunch, and will it be cost-effective at a university level to have a graduate program in such for 1 or 2 individuals?

Syntax error on line(s): 1 thru 1
Ambiguous contraction in "I'd".

Syntax error on line(s): 1 thru 1
Mixed tense in "still needed".
Note: Root word "need" satisfies the expression.

Syntax error on line(s): 3 thru 3
Incomplete sentence.

Syntax error on line(s): 5 thru 5
Expected colon after "be" in "to be when".

Syntax error on line(s): 5 thru 5
Expected capitalization of "when" in "to be when".

Syntax error on line(s): 5 thru 5
Extraneous comma.
Note: This message is generated only once for multiple errors.

Point taken: Screw the Lexicographers!

Re:Lexicographers out of the way (1)

smellotron (1039250) | more than 2 years ago | (#38560710)

Syntax error on line(s): <snip>

I would like to subscribe to your newsletter. Do you provide an Outlook plugin?

Re:Lexicographers out of the way (1)

Livius (318358) | more than 2 years ago | (#38560750)

Syntax error: unknown token "thru"

Re:Lexicographers out of the way (0)

Anonymous Coward | more than 2 years ago | (#38561448)

http://en.wiktionary.org/wiki/thru

words will not escape us anymore? (1)

Mister Liberty (769145) | more than 2 years ago | (#38557568)

So if I type in "anything" I won't get just an interpreted response
but really -- what... everything?

bjd

Wordnik is a dictionary aggregator (3, Funny)

NaCh0 (6124) | more than 2 years ago | (#38557870)

I wonder what kind of sales pitch it takes to get $12 million for a free web dictionary.

'Just imagine if we could provide 100 definitions from other people for the word "butt", how much is that worth to you?'

Re:Wordnik is a dictionary aggregator (1)

eulernet (1132389) | more than 2 years ago | (#38558998)

Totally agree, and it seems that their data is not cross-checked at all:

http://www.wordnik.com/words/internet [wordnik.com]

        antonyms
        Words with the opposite meaning:
                World Wide Web

WTF ?

Internet != WWW (1)

tepples (727027) | more than 2 years ago | (#38560362)

It might have something to do with tech sites that take pains to point out that the Internet is not just the World Wide Web.

Re:Internet != WWW (0)

Anonymous Coward | more than 2 years ago | (#38562618)

"!=" != "antonym"

lameness filter is lame

Telivision (4, Insightful)

aembleton (324527) | more than 2 years ago | (#38557932)

It doesn't detect that telivision is an incorrect spelling because there are so many authoritative examples of that spelling: http://www.wordnik.com/words/telivision [wordnik.com]

Google seems to do a good job of detecting spelling errors and automatically updating it's dictionary and of course it also shows you websites where that word is used. I don't really see what Wordnik provides.

Re:Telivision (0)

Anonymous Coward | more than 2 years ago | (#38557968)

Hey man, language changes every day. Not only are we free to redefine words the wya we pelase, we nac splel thme ayn way ew ikle. Gte wthi ti anm!!

Re:Telivision (1, Troll)

AK Marc (707885) | more than 2 years ago | (#38558504)

What's funny is that 4 of the top 5 examples are by conservatives attacking liberals (and one transcription error on a CNN interview). What's that say about where our language is going and who is taking it there?

Re:Telivision (1)

vikingpower (768921) | more than 2 years ago | (#38562134)

Language use, and interpretations thereof, is not politically bias-free. Even more so opinions, scientific or not, on language use.

Re:Telivision (1)

AK Marc (707885) | more than 2 years ago | (#38566354)

If we find one group, say Wal-Mart shoppers, who use words that don't exist like misunderestimate and nuk-u-lar more than others, does that mean anything? And if so, what?

Re:Telivision (1)

VortexCortex (1117377) | more than 2 years ago | (#38558888)

I second this notion. I frequently use the define: $searchTerm query with Google.

For example: telivision [google.com],
or: Wordnik [google.com]

Compare the latter to the same search on Wordnik: Wordnik [wordnik.com]

Bonus: Those Google links are wrapped in TLS, so no one sees the query terms or results in transit. https://www.wordnik.com/ [wordnik.com] takes you to their developer site...

$12m in venture capital to invent Urban Dictionary (1)

SpiralSpirit (874918) | more than 2 years ago | (#38558042)

we've eliminated the middle man by letting users submit whatever they want, and pocketing all the money!

What a horrible summary (1)

oneiros27 (46144) | more than 2 years ago | (#38559086)

These sources differ from both conventional dictionary publishers and crowd-sourced efforts like the excellent Wiktionary for their emphasis on avoiding human intervention rather than fostering it.

You make it sound like they're completely removing the human elements. And just, a corpus by nature does that, as they're only really involved in setting the bounds of the collection and letting the authors speak for themselves. Wordnik, on the other hand, allows *anyone* to contribute, but they're not allowed to give definitions. (definitions are only gathered from official dictionaries and the like). What you do with Wordnik is give examples of the word in context -- because it's actually really hard to define some words.

The thing is -- there's no editors trying to come up with 'is this a word or not' ... if you put it in, it's a word. It doesn't matter that only you and your 4 friends use it, or that it's important enough -- if you want to add it, you can. Yes, they also automate adding stuff from other sources, and so did wikipedia early on (CIA factbook for countries, US census for places in the US, etc.)

Yes, you can use wordnik as a sort of meta-dictionary, but you can also add words to it, look to see the values in scrabble, tag words (words you hate, jargon in your field, etc.). It *is* fostering human intervention -- how many of you out there can add a word to a print dictionary? And unlike those print dictionaries, we don't have to wait 3-4 years before someone decides that something is 'officially' a word.

Re:What a horrible summary (1)

martin-boundary (547041) | more than 2 years ago | (#38559962)

That's a stupid idea. To use an analogy that Slashdot understands: a traditional dictionary is like a standards document. It's useful to promote interoperability between speakers both during a single transaction (conversation between two parties), and also in log files (written documents to be read again later).

Collecting random words on the web into a dictionary is like getting rid of standards altogether, or saying that every piece of software out there, no matter what it does, is standards compliant. We saw what that leads to in the early browser wars.

We need language gate keepers. It's ok if language evolves somewhat over a period of 100 years, but if it changes so much that we can't make sense of what people wrote even 10 years ago, then we're in big trouble. In particular, dictionaries *shouldn't* be published more often than once every 25 years or so: It actually helps continuity if we force ourselves to use the same language that was current in the previous generation.

continuity (1)

mcswell (1102107) | more than 2 years ago | (#38564370)

Current generation nonsense, it's high time we return to Latin. Ita et vos per linguam nisi manifestum sermonem dederitis, quo modo scietur id quod dicitur? eritis enim in aëra loquentes.
But I'd accept Old English.

Re:What a horrible summary (1)

oneiros27 (46144) | more than 2 years ago | (#38568840)

Do you really mean to tell me that you only use words as they're defined in the dictionary? And if so, which dictionary? Because as we all know, there's lots of different standards out there. And then there's versioning of the standards, and those implementations that aren't quite complient (in language, those would be regional dialets). Language is not as cut and dried as you think it might be.

But your suggestion is actually done in other countries -- the French have a government group that officially approves new words to be added to their language, with the result that they have much fewer words than we do.

And I admit, there are problems with allowing anyone to change the language -- we have judges who are willing to use modern definitions of terms to decide what 200+ year old legal documents mean ... because after all, they should've planned for language to change when they write the contitution and the bill of rights.

Historical accretion (1)

vikingpower (768921) | more than 2 years ago | (#38562126)

All of the interviewed persons as well as the author of the NY Times Article leave a major issue unmentioned, and that is historical word use. As a very enthusiastic user of the Oxford English Dictionary ( yes, it has the place of honour in my living room ), each time I look up a word in the venerable OED I am amazed at the thick and variegated strata of historical meaning, and the gradual shifting in it, even for words we think of as "simple".

To wit, neither the Wordnik nor the CCAE person mentioned these important aspects of a dictionary's use. For good reasons: such corpuses as Wordnik and the CCAE are "mere" aggregations of internet use. Which, also and not accessorily, is not necessarily idempotent with everyday use.

Check for New Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...