×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Extracting Meaning From Millions of Pages

kdawson posted more than 4 years ago | from the data-mining-gone-large dept.

Google 138

freakshowsam writes "Technology Review has an article on a software engine, developed by researchers at the University of Washington, that pulls together facts by combing through more than 500 million Web pages. TextRunner extracts information from billions of lines of text by analyzing basic relationships between words. 'The significance of TextRunner is that it is scalable because it is unsupervised,' says Peter Norvig, director of research at Google, which donated the database of Web pages that TextRunner analyzes. The prototype still has a fairly simple interface and is not meant for public search so much as to demonstrate the automated extraction of information from 500 million Web pages, says Oren Etzioni, a University of Washington computer scientist leading the project." Try the query "Who has Microsoft acquired?"

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

138 comments

Try the query.... (3, Funny)

Finallyjoined!!! (1158431) | more than 4 years ago | (#28306709)

"Who has dumped Vista?"

Re:Try the query.... (0)

Anonymous Coward | more than 4 years ago | (#28306817)

If I had a fanny 'd be a woman

Re:Try the query.... (0)

Anonymous Coward | more than 4 years ago | (#28306855)

No, try "Who has VA Linux acquired?"

Re:Try the query.... (1)

drinkypoo (153816) | more than 4 years ago | (#28307399)

Oh man, it's the new sucks-rules-o-meter for sure. Who hates vista: 55 results. Who loves vista: 11 results. Obviously, vista blows hairy goats. It becomes even more clear when you look at the actual results: somehow "
Bookmark Islamic Screensaver download-All people (12) love screensaver-Windows Vista Downloads" counts as a hit.. ahh there, a reload with js and the spam disappears leaving 9 :D

Re:Try the query.... (3, Funny)

maxume (22995) | more than 4 years ago | (#28309267)

I tried to read your comment, but I did not attempt to understand it.

Re:Try the query.... (1)

Chninkel (1396241) | more than 4 years ago | (#28308377)

Why the query

Who has not been acquired by Microsoft

doesn't return Yahoo ?
actually it doesn't return any result ...

Re:Try the query.... (1)

Nutria (679911) | more than 4 years ago | (#28308409)

Better yet: "Why does Windows suck?"

Retrieved 0 results for Why does Windows suck?.

Being in Washington, MSFT has obviously paid them off to filter out unpleasant results.

Not entirely helpful (5, Interesting)

CRCulver (715279) | more than 4 years ago | (#28306721)

I suppose the major problem with this is that it cannot tell the difference between truth and lies or urban legends, it just repeats what other people have said, even if they are conspiracy theorists. The query "Who killed JFK?" suggests the CIA did it.

Re:Not entirely helpful (1)

Random2 (1412773) | more than 4 years ago | (#28306795)

Yeah, it's something you'd have to cross-reference, but the main use I see for it is the initial search for information. You ask a question, it gives some answers, then you type them into yahoo or something to look them up/verify what it said. This could be a huge help for things that one may not know a lot about.

Wikipedia tried and failed (1)

Anonymous Coward | more than 4 years ago | (#28307251)

That is how Wikipedia was meant to be. A group of statements about subjects, all of which can be referenced to some original source. So that people can look up something quickly and then look at the sources for more definite information....

Seeing how many people cite Wikipedia directly, use it as the main source for their research and the amount of newspapers that have been reported to directly quote inaccurate facts from Wikipedia... I don't think it is working properly. It requires a lot of optimism to believe "People will use that as a initial source and then verify the information"

Re:Wikipedia tried and failed (3, Insightful)

Colonel Korn (1258968) | more than 4 years ago | (#28307379)

That is how Wikipedia was meant to be. A group of statements about subjects, all of which can be referenced to some original source. So that people can look up something quickly and then look at the sources for more definite information....

Seeing how many people cite Wikipedia directly, use it as the main source for their research and the amount of newspapers that have been reported to directly quote inaccurate facts from Wikipedia... I don't think it is working properly. It requires a lot of optimism to believe "People will use that as a initial source and then verify the information"

That's not wikipedia's failure. Those same people would just be referencing nothing or a web site with zero public review and commenting without it.

Re:Not entirely helpful (2, Insightful)

John Hasler (414242) | more than 4 years ago | (#28306859)

The major problem is that it assumes the presence of meaning in Web pages in the first place.

Re:Not entirely helpful (2, Interesting)

morgan_greywolf (835522) | more than 4 years ago | (#28306963)

Actually, just like any other search, it just shows ALL of the likely results and you are still responsible for determining for yourself which of the statements is true. It says "CIA killed JFK" but the first result it returns is "Lee Harvey Oswald killed JFK". It also seems to pare down the results somewhat, because I know I've seen conspiracies also suggesting that the KGB killed JFK, or that the Mafia killed JFK. I'm guessing that more people think the CIA killed JFK than the KGB or the Mafia.

Re:Not entirely helpful (4, Funny)

owlnation (858981) | more than 4 years ago | (#28306991)

I suppose the major problem with this is that it cannot tell the difference between truth and lies or urban legends, it just repeats what other people have said, even if they are conspiracy theorists. The query "Who killed JFK?" suggests the CIA did it.

So much like Wikipedia then?

Re:Not entirely helpful (1)

L4t3r4lu5 (1216702) | more than 4 years ago | (#28307019)

... and yet "Who was responsible for the World Trade Centre attacks?" returns no results...

[/tinfoilhat]

Re:Not entirely helpful (1)

ericrost (1049312) | more than 4 years ago | (#28307225)

That would be because "centre" is spelled center. The correct spelling yields plenty of results.

Re:Not entirely helpful (1)

L4t3r4lu5 (1216702) | more than 4 years ago | (#28307257)

Damn my correct spelling of English words! I suppose as a proper noun, I can forgive this slip up.

Why WTC name is spelled in American (2, Informative)

tepples (727027) | more than 4 years ago | (#28309511)

Damn my correct spelling of English words!

Because the World Trade Center was located on American soil, its name is spelled in American dialect.

Re:Not entirely helpful (0)

Anonymous Coward | more than 4 years ago | (#28307057)

I suppose the major problem with this is that it cannot tell the difference between truth and lies or urban legends, it just repeats what other people have said, even if they are conspiracy theorists. The query "Who killed JFK?" suggests the CIA did it.

For more fun, try "Who blew up the WTC?" , "Who developed the AIDS virus?" , or "Who controls world power?"

But to call this a problem with TextRunner is a bit unfair. It's still an interesting tool for looking at the content of the web. Yeah, the web is mostly populated by kooks, but that's life, isn't it?

Not entirely helpful predicting the futute? (1)

thijsh (910751) | more than 4 years ago | (#28307179)

Why deal with uncertainties about who-killed-who in the past, when you can have a lot more fun with what could be in the future.
"Who killed obama?" ... seems an inside job by Hillary is most probable just below a vicious murder by Ted Nugent. Scary!

Re:Not entirely helpful (0)

Anonymous Coward | more than 4 years ago | (#28307239)

CIA hasn't killed JFK?

Re:Not entirely helpful (2, Insightful)

jerep (794296) | more than 4 years ago | (#28307297)

it just repeats what other people have said

I don't see anything new here, most people have done this since the beginning of time.

Re:Not entirely helpful (2, Funny)

thedonger (1317951) | more than 4 years ago | (#28307345)

it just repeats what other people have said

I don't see anything new here, most people have done this since the beginning of time.

Yeah, Textrunner just repeats what other people have said, like most people since the beginning of time.

Re:Not entirely helpful (0, Redundant)

db10 (740174) | more than 4 years ago | (#28308007)

Yeah, Textrunner just repeats what other people have said, like most people since the beginning of time.

Re:Not entirely helpful (2, Interesting)

somersault (912633) | more than 4 years ago | (#28307375)

I suppose the major problem with this is that it cannot tell the difference between truth and lies or urban legends

Most humans can't either, how do you expect a search engine to?

There will be a lot of false positives and negatives that will be hard to identify as such unless it directly works with something like snopes.com , which kind of defeats the purpose because it means someone has had to research every question anyway.

If a project like this which simply scoured the whole 'net, you wouldn't really be able to verify anything beyond people's opinions or beliefs, which may or may not be 'true'.

I think something like this would work really well for factual results if it was only allowed to draw conclusions from verified sources, say something like Wikipedia articles that have been verified by experts in the appropriate field (I've not been following all this type of thing recently but perhaps that is what Wolfram Alpha does already). It could perhaps be useful to have it search the general internet for supplementary results for some questions though, especially those of a philosophical nature where it may be impossible to establish definite answers ("is there a god" and the like).

Re:Exactly (2, Funny)

bxbaser (252102) | more than 4 years ago | (#28307999)

"The query "Who killed JFK?" suggests the CIA did it"

Hmmm....And now its not responding because its "slashdotted"

Re:Not entirely helpful (2, Informative)

msbmsb (871828) | more than 4 years ago | (#28308611)

Semantic processing systems like this (it's not something new) aren't usually able to determine correctness. The truth of a statement is assumed and the best these NLP [wikipedia.org] engines can do at the moment is identify conflicts and maybe use some reputation metrics to assign a veracity rating to a particular statement, or notify the user that there are differing conclusions. These systems are just really, like the summary states, "information extraction [wikipedia.org]" systems. Just as a regular search engine will return you the results from the data set, that's what these types of semantic extraction engines usually do, except the data is processed in a semantically-organized way so that you can query with semantics/natural language constraints instead of just keywords and boolean constraints.

There are some that incorporate some intention or opinion polarity detection, but even those are not capable to sorting "truth" versus "conspiracy".

Additionally, semantic extraction output, like named entities [wikipedia.org] and semantic relations [wikipedia.org], are useful for many other applications.

So someone donated a copy of my copyrighted pages? (0)

Anonymous Coward | more than 4 years ago | (#28306743)

What the heck.

Re:So someone donated a copy of my copyrighted pag (1)

morgan_greywolf (835522) | more than 4 years ago | (#28306973)

The same copyrighted pages that you allowed Google to crawl since you obviously didn't protect it with a robots.txt?

Re:So someone donated a copy of my copyrighted pag (0)

Anonymous Coward | more than 4 years ago | (#28307073)

What? You've found a search engine that honors robots.txt?

Re:So someone donated a copy of my copyrighted pag (2, Interesting)

Anonymous Coward | more than 4 years ago | (#28307127)

Allowing a search engine to visit a site and allowing somebody to pass your web page content around are two completely different things.

Nascent AI? (4, Funny)

Drakkenmensch (1255800) | more than 4 years ago | (#28306841)

I've always viewed intelligence as the ability to take unrelated facts and create new and original ideas from their synthesis. This project may very well lead to new ideas to create the first true AI.

I'll start stockpiling food and armor piercing rounds for the moment Skynet goes live.

I think you are dumb (-1, Troll)

Anonymous Coward | more than 4 years ago | (#28307033)

You are a dumb fat bitch face who is stuped and dumb.

Re:Nascent AI? (1)

thedonger (1317951) | more than 4 years ago | (#28307391)

I've always viewed intelligence as the ability to take unrelated facts and create new and original ideas from their synthesis.

Intelligence, like insanity, is finding links between seemingly unrelated facts. It can also be keen observation and recognition of interactions between things where others see chaos. Either way, truly unrelated things are just that: unrelated.

Re:Nascent AI? (1)

thedonger (1317951) | more than 4 years ago | (#28307589)

I should add that the distinction between intelligence and insanity blurs as the relationship between the facts becomes weaker. Well, at least to the observer of the intelligent/insane person.

500 million web pages can't be wrong (4, Funny)

Dunbal (464142) | more than 4 years ago | (#28306885)

Yet strangely, I get a result of:

TextRunner took 9 seconds.
Retrieved 0 results for what is the airspeed velocity of an unladen swallow?.

Meh, call me when this stuff can answer the really USEFUL questions in life.

Re:500 million web pages can't be wrong (0)

Anonymous Coward | more than 4 years ago | (#28306941)

Well, what did you expect... Did you mean an African or a European swallow?

Re:500 million web pages can't be wrong (3, Funny)

JDHannan (786636) | more than 4 years ago | (#28306983)

And even worse:

Retrieved 0 results for what is the answer to life, the universe and everything?.

Re:500 million web pages can't be wrong (4, Funny)

sukotto (122876) | more than 4 years ago | (#28307133)

Obviously it's not indexing http://www.style.org/unladenswallow/ [style.org]

estimate that the average cruising airspeed velocity of an unladen European Swallow is roughly 11 meters per second, or 24 miles an hour.

meters per second or miles per hour? what? (1, Interesting)

Anonymous Coward | more than 4 years ago | (#28307429)

I would go with...

  • ...meters per second and kilometers per hour
  • ...feet per second and miles per hour
  • ...feet per second and meters per second
  • ...miles per hour and kilometers per hour

But meters per second and miles per hour? WHY?!

Re:meters per second or miles per hour? what? (1)

Evanisincontrol (830057) | more than 4 years ago | (#28308243)

They wanted a metric measure and a standard measure. Meters per second is a reasonable metric measure for something slow(er than a car), and miles per hour... is basically the only speed measure that Americans understand. (No flaming, I'm American).

Re:500 million web pages can't be wrong (1)

maxwell demon (590494) | more than 4 years ago | (#28308277)

Just found out: If you just type "airspeed velocity", you'll get as first two results:

airspeed velocity of an unladen swallow is roughly 11 meters (10), 24 miles (9), 10 meters (2)
average cruising airspeed velocity of an unladen European Swallow is 24 mph (2)

It seems to have trouble understanding units, but otherwise the information is found.

Textrunner confirms it (0)

Anonymous Coward | more than 4 years ago | (#28306921)

"Retrieved 0 results for Is Linux ready for the desktop?."

Zero results (2, Interesting)

John Hasler (414242) | more than 4 years ago | (#28306945)

I tried half a dozen queries of the sort I often use Google for (example: "What is the velocity of sound in hydraulic fluid?"). No answers.

Concise (2, Interesting)

moogsynth (1264404) | more than 4 years ago | (#28306951)

Try "Who paid SCO?" Concise, to the point. Nice.

Re:Concise (0)

Anonymous Coward | more than 4 years ago | (#28307047)

I'm stealing your karma points for when the site gets slashdotted. The answer of course is

Linux customers paid SCO $ 10 million (2)
plans to pay SCO fees (2)
Microsoft paid SCO approximately $ 16 million (2)

Towards a web with only one page: Google (1, Insightful)

Anonymous Coward | more than 4 years ago | (#28306961)

Are we moving towards a web in which Google centralises everything on their own pages? These new engines present content without the need to visit pages it originates from. Is Google basically mooching off other people's websites with hardly anything - if anything at all - in return?

It could be dangerous if the only visitor a web site can expect is the Google bot.

what causes cancer? (5, Funny)

umundane (1490741) | more than 4 years ago | (#28307231)

I learned that

> smoking (387) causes cancer.

I was also surprised to learn that

> girls and women (11) cause most cases of cervical cancer

This is a great resource if you need to cite a reference for a Wikipedia article.

TextRunner confirms it: (4, Funny)

guruevi (827432) | more than 4 years ago | (#28307247)

Who is at Area 51
aliens (3), Carter (2), Colonel Sanders (2), Hi Group (2) is at Area 51

Who bombed WTC
Al Qaeda (5), Bush (5), Clinton (2), 4 more... bombed the WTC

Who built the pyramids (example on site):
Egyptians (298), aliens (73), Pharaohs (40), 77 more... built the pyramids

What contains antioxidants (example on site):
Coffee (17), Recent scientific research (15), food (6), 5 more... contain significant amounts of antioxidants

-- man, I gotta get me some more recent scientific research.

Re:TextRunner confirms it: (1)

houghi (78078) | more than 4 years ago | (#28308691)

Retrieved 0 results for what is the answer to life, the universe and everything?.

Slashdot is not ... (2)

Xyberu (1440765) | more than 4 years ago | (#28307261)

Slashdot isn't
        a professional news site
        a normal news site
        a social news site
        a News Site
        a valid source
        a reputable source
        the right source
        a healthy online community
        a goddamn online community
        a Terrorist Organization

Re:Slashdot is not ... (1)

unfasten (1335957) | more than 4 years ago | (#28308619)

Ah, but look at what Slashdot is:

Slashdot is the single most important english site (8), another extremely sophisticated example (4), another online community (3), 15 more...

I'd like to see it extracting Millions of Meanings (1)

Klistvud (1574615) | more than 4 years ago | (#28307349)

...from me being completely silent, mouth shut and all, like my wife does! And she never had a single reboot in 43 years! Then again ... maybe that's precisely the problem?

Bah useless (1)

Veretax (872660) | more than 4 years ago | (#28307381)

I tried asking the real name of Doctor Who, and the site basically crapped out LOL, totally useless.

User invalid, deleting user (1)

uncanny (954868) | more than 4 years ago | (#28307465)

I typed in "how does a computer become self aware?" it just said something about it being busy because it's currently controlling california!

Re:User invalid, deleting user (1)

PPH (736903) | more than 4 years ago | (#28309243)

Come on now, be specific. What it actually said was, "I'll be back".

source code ? implementation details? (0)

Anonymous Coward | more than 4 years ago | (#28307601)

is this closed source ? Any idea what language this is implemented in ?

Re:source code ? implementation details? (1)

Zappy (7013) | more than 4 years ago | (#28308193)

# # An unexpected error has been detected by HotSpot Virtual Machine: # # SIGSEGV (0xb) at pc=0xb77acafa, pid=21855, tid=1833073568 # # Java VM: Java HotSpot(TM) Server VM (1.5.0_14-b03 mixed mode) # Problematic frame: # V [libjvm.so+0x23dafa] # # An error report file with more information is saved as hs_err_pid21855.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp [sun.com] # Abort

Human cities torture? (1)

YourExperiment (1081089) | more than 4 years ago | (#28307607)

Apparently Mount Marcy, Mount Elbrus, Mount Kilimanjaro and Mount Etna are all the highest mountain. Then again, I was also informed that "high mountains are the hum of human cities torture", so I think I'll just steer clear of mountains altogether.

Screw that (0)

Anonymous Coward | more than 4 years ago | (#28307633)

Why is my TV suddenly not working anymore?

something i've never head of (1)

n30na (1525807) | more than 4 years ago | (#28307929)

Turns out to be way cooler than Wolfram Alpha. Now just think if it has the whole web. Wait, scratch that, I bet wikipedia's already in there. Also, skynet.

Correction.... (4, Insightful)

wowbagger (69688) | more than 4 years ago | (#28307983)

"...that pulls together facts by combing through more than 500 million Web pages."

Correction:

"...that pulls together assertions by combing through more than 500 million Web pages."

Whether those assertions are correct or even reasonable is a completely different issue.

It might be interesting to then take those assertions and have some means to validate or invalidate them, but currently that's going to require meat, not metal.

Now, if you could come up with some form of AI^Walgorithm to do that automatically, then you would have something.

What is the meaning of life (1)

soundhack (179543) | more than 4 years ago | (#28308061)

love (53), song (19), Life (16), 81 more... is the meaning of life

1) of the 81 more, 42 doesnt show up anywhere
2) the stupid javascript hiding makes copy and paste a pain

Found the cause of global warming.... (0)

Anonymous Coward | more than 4 years ago | (#28308731)

Retrieved 8 results for What causes global warming?

Human(10) vommitting.BUTT PLUGS (2).

Sorry everyone... I'll take it out.

Wow, impressive, but prior art... (1)

Pedrito (94783) | more than 4 years ago | (#28308789)

TextRunner gets rid of that manual labor. A user can enter, for example, "kills bacteria," and the engine will come up with of pages that offer the insights that "chlorine kills bacteria" or "ultraviolet light kills bacteria" or "heat kills bacteria"--results called "triples"--and provide ways to preview the text and then visit the Web page that it comes from.

Wow, incredible. Because doing a search of "kills bacteria" with the quotes on Google won't get you those kind of results. Oh wait, yeah it will. In fact, it too will "chlorine kills bacteria" and "ultraviolet light kills bacteria" and "heat kills bacteria". And google also provides a way to preview the text and then visit the web page that it comes from.

Yeah, I know, I know, they just put a bad example in the article, but it's a ridiculously bad example.

I'm impressed (1)

thethibs (882667) | more than 4 years ago | (#28309041)

This has to be played with to be appreciated. On request, it delivered a set of interesting papers about US-EPA misrepresentation of science. And, it returned a nul result for "Has any climate model been validated?"

This is going to be fun

Carmen San Diego? (1)

tech_fixer (1541657) | more than 4 years ago | (#28309105)

I asked "Where in the world is Carmen San Diego?". The page trhew up a Java error.

I guess nobody really knows.

What makes grass grow? (0)

Anonymous Coward | more than 4 years ago | (#28309271)

What makes grass grow?

Answer - (1 thing)

blood...

Who performs warrantless wiretapping? (0)

Anonymous Coward | more than 4 years ago | (#28309301)

TextRunner took 2 seconds.
Retrieved 0 results for who performs warrantless wiretapping.

Who is John Galt? (0)

Anonymous Coward | more than 4 years ago | (#28309305)

0 results.

Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...