Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

IBM vs. Content Chaos

Hemos posted more than 10 years ago | from the help-me-find-directions-to-p4r1s-h1l70n dept.

The Internet 216

ps writes "IBM's Almaden Research Center has been featured for their continued work on "Web Fountain", a huge system to turn all the unstructured info on the web into structured data. (Is "pink" the singer or the color?) IEEE reports that the first commercial use will be to track public opinion for companies. " It looks like its feeding ground is primarily the public Internet, but it can be fed private information as well.

cancel ×

216 comments

Sorry! There are no comments related to the filter you selected.

frst crap (-1, Offtopic)

Anonymous Coward | more than 10 years ago | (#7953192)

frst crap!
office sucks

Content Chaos (-1, Offtopic)

m3j00 (606453) | more than 10 years ago | (#7953195)

What is it all about? Is it cool? Is it whack?

Re:Content Chaos (-1)

Anonymous Coward | more than 10 years ago | (#7953264)

WTF does "whack" mean anyway?

Re:Content Chaos (0)

Anonymous Coward | more than 10 years ago | (#7953365)

Meaning 1:

crazy, ridiculous

Your mom flushed your stash? that's whack!

Meaning 2:

stupid, dumb; gay

Dude, that's whack.

Source: Urban Dictionary(.com)

Re:Content Chaos (-1, Offtopic)

Anonymous Coward | more than 10 years ago | (#7953696)

Urban, eh?

In other words, black streetgang loserspeak.

How does "gay" get lumped in with "stupid" and "dumb" anyway? Most gay people that I know are usually of above average intelligence...as opposed to black streetgang types who consider being smart and wanting to get ahead in life as "acting too white" therefore beneath them somehow, and to be avoided at all costs.

Re:Content Chaos (-1, Troll)

Anonymous Coward | more than 10 years ago | (#7953727)

President Bush doesn't like gay people. I think he's pretty smart. You stooped liberal!

FIRST AUTOMATED POST (-1, Offtopic)

Anonymous Coward | more than 10 years ago | (#7953206)

This is a test of the emergency Slashdot Posting System. If this were a real post, you would have already been trolled. 4217

It can be feed? (0, Offtopic)

BCole (620344) | more than 10 years ago | (#7953214)

How about "it can be fed"

I think a better question... (5, Funny)

bc90021 (43730) | more than 10 years ago | (#7953216)

...doesn't concern whether "Pink" is a colour or a singer, but whether "Paris Hilton" is a hotel in France or an oft downloaded video... ;)

Re:I think a better question... (0, Flamebait)

jpr1nd (678149) | more than 10 years ago | (#7953284)

wow, that is such a clever and original [slashdot.org] joke. how do you come up with stuff so funny?

What is PINK? (2, Funny)

BigBlockMopar (191202) | more than 10 years ago | (#7953537)


(Is "pink" the singer or the color?)

I didn't get the joke.

These are, after all, engineers. Pink is neither a color nor a singer (talented or otherwise).

To an engineer, PINK can only be an acronym.

Re:I think a better question... (2, Funny)

Dave2 Wickham (600202) | more than 10 years ago | (#7953288)

"from the help-me-find-directions-to-p4r1s-h1l70n dept."

Re:I think a better question... (1)

bc90021 (43730) | more than 10 years ago | (#7953432)

The really funny part is that I didn't even see that until you pointed it out...

Re:I think a better question... (1)

robslimo (587196) | more than 10 years ago | (#7953387)

If the reference was to the band Pink Floyd, that was the name of the group, originally "The Pink Floyd Sound" and did not reference anyone in the group.

However, I guess it _was_ from a person's name since the band was named for American blues artists Pink Anderson and Floyd Council.

Re:I think a better question... (2, Funny)

ePhil_One (634771) | more than 10 years ago | (#7953486)

Oh, by the way, which one's Pink?

pr0nfountain (1, Funny)

3lb4rt0 (736495) | more than 10 years ago | (#7953228)

The spinoff that will be used by joe sixpack net user.

All we need... (3, Interesting)

TJ_Phazerhacki (520002) | more than 10 years ago | (#7953230)

There is already altogether too much "Stuff out there" for anyone to put any major effort into catogorizing it. We should soon reach the point of info overload, and then what? What is the point of catologing overflow data? Do we really need something like this? Or should we just ship a bunch of programmers wasting their time over to something else, like better spam filters and OS's without gaping security holes?

Re:All we need... (0, Flamebait)

Frymaster (171343) | more than 10 years ago | (#7953318)

the first commercial use will be to track public opinion for companies.

here's one to start with:

microsoft (msft) of redmond washington: you suck!

now, go log that.

Re:All we need... (1, Insightful)

geoffspear (692508) | more than 10 years ago | (#7953327)

Oh yes, because there's such an enormous shortage of programmers right now. IBM should lay off all of these programmers so Microsoft will have a pool of available programmers who know nothing about OS security to work on security.

And once all the game producers, who make a product we definitely don't "need" get rid of all of their programmers, there will be plenty of free people to work on anti-spam technology. Whee!

Re:All we need... (5, Insightful)

millahtime (710421) | more than 10 years ago | (#7953444)

There are many organizations that need better ways to analyze their info. There are databases that are terabytes in size and have to do detailed searches. With SQL databases that can take a long time and any faster way can save a lot of time and money. There is a big need for this technology across many industries.

Re:All we need... (5, Insightful)

xyzzy (10685) | more than 10 years ago | (#7953619)

That's really funny that you mention "spam filters", since that is exactly the content categorization task that you are talking about.

Automatic categorization of overflowing data is exactly what you need to do when you have too much to think about -- it allows you to triage your attention span, which is the most limited resource you have.

Re:All we need... (2, Interesting)

redragon (161901) | more than 10 years ago | (#7953801)

I think the inverse is the case.

The more chaotic (overloaded in your terms) that data tends to be, then the greater the information contained in that data (think compression). So what they're going after is not "catogorizing" the internet, they're going after making some sense out of all of that data. Information overload begins to necesitate an intermediary to help filter out the data that you're interested in.

The interesting thing becomes what sort of biases are built into a system like this? That is what I'm curious about. Right now when we search on Google (which of course has it's own biases), we decide which links end up mattering (if we have the will to root through it). If a computer system is doing this, it will inevitably alter the way in which we come to understand the data we're looking through.

I think you're saying (or am I (mis)reading you?) that, "it doesn't matter," isn't the right direction of thinking here. Sure spam and security are issues too, spam actually being a related problem, but it seems unfair to delegate this to the "bad idea" stack already.

Guess it's time to update the ole' homepage with.. (-1, Redundant)

eddy (18759) | more than 10 years ago | (#7953235)

..a "SCO Sucks" comment.

Or would that be considered redundant maybe?

Send link to Google (4, Insightful)

Urkki (668283) | more than 10 years ago | (#7953236)

They could certainly use this kind of techniques to improve their results...

Then again, in a way they already use something like this, except they're only really concerned about links, not actual contents of pages...

Re:Send link to Google (1)

AndroidCat (229562) | more than 10 years ago | (#7953378)

Or send it to Slashdot. :^)

structure... (5, Funny)

Rhubarb Crumble (581156) | more than 10 years ago | (#7953239)

a huge system to turn all the unstructured info on the web into structured data

In order to do this, they will use a scheme by which each document is referred to by a string including the transfer protocol, the host name, and a file path.

oh, wait...

Too easy, think complicated (1)

korpiq (8532) | more than 10 years ago | (#7953371)


Some information at different paths might require cross-referencing. Thus, the scheme you propose should be extended so that there would be a way for text documents to contain links to each other.

However, if you just take a big enough storage system and download all the documents from teh intterweb, you can have a flat directory containing all the documents. Woohoo, progress!

First Ninnle Post! (-1, Offtopic)

Anonymous Coward | more than 10 years ago | (#7953243)

Be the first on your block to run the new Ninnle Servers from IBM!

First customer (3, Funny)

Anonymous Coward | more than 10 years ago | (#7953245)

IEEE reports that the first commercial use will be to track public opinion for companies.

Word has it the first test case will be SCO. Web fountian: "Outlook not so good"

Re:First customer (0)

Anonymous Coward | more than 10 years ago | (#7953329)

The e-mail client? You needed IBM to tell you that?

Blogzine [blogzine.net]

Obligatory SCO poke. (1)

i_r_sensitive (697893) | more than 10 years ago | (#7953837)

Damnit, too busy reading stupid poll posts, damnit dmanit dmanit.

You've won this round, Lonestar...

Hmm... (-1)

Anonymous Coward | more than 10 years ago | (#7953247)

Just me or has Hemos gone nuts? Posting like crazy today.

Blogzine [blogzine.net]

galss (-1, Redundant)

Anonymous Coward | more than 10 years ago | (#7953255)

"It looks like it's feeding ground is primarily the public Internet, but it can be feed private information as well."
(Use the Preview Button! Check those URLs!)

-AC

SITE ALREADY SLASHDOTTED, HERES A MIRROR! (2, Funny)

ThisIsAnExampleAccou (718430) | more than 10 years ago | (#7953261)

Link to a Mirror [google.com]

Stupid Mod - Should Be +1 Funny! (-1)

Anonymous Coward | more than 10 years ago | (#7953306)

Bad Mod! Bad!

What he posted was +1 Funny, not -1 Offtopic.

Re:SITE ALREADY SLASHDOTTED, HERES A MIRROR! (0)

jpr1nd (678149) | more than 10 years ago | (#7953312)

um, thats not a mirror.

and apparently "To ride your bicycle safely and efficiently, it is important to have equipment operating smoothly and properly."

Re:SITE ALREADY SLASHDOTTED, HERES A MIRROR! (-1)

Anonymous Coward | more than 10 years ago | (#7953340)

Perhaps you need either (1) an eye exam or (2) a sense of humor. His post said "Here is a mirror". When you click on it, it comes up with a picture of a mirror. That is funny. Somehow, he got it to work so that it looks like it is a link to google. Bonus points for tricking me.

Get this setup (3, Interesting)

millahtime (710421) | more than 10 years ago | (#7953274)

I wonder how long until IBM sells this setup. If it works well Logistics Orginazations would love to get their hands on it.

Re:Get this setup (1)

millahtime (710421) | more than 10 years ago | (#7953294)

I mean by this that most Logistics Orgainzations will have propritary info that they won't let IBM house.

Re:Get this setup (4, Informative)

orac2 (88688) | more than 10 years ago | (#7953409)

Although the article didn't have room to go into this point (and I should know, I'm the author), IBM can completley compartmentalize competitors' data, even if hosted in house (IBM already does this in other parts of its business). If companies are still wary, they can host the data themselves and let WebFountain troll it on a need to know basis.

Expensive (4, Interesting)

starvingcodeartist (739199) | more than 10 years ago | (#7953289)

In the article is says they plan on charging between $150,000 and $300,000 a year to use this super-search engine. They think corporate execs will pay for it. Seems really steep to me. BUT, for corporate execs, its probably not too expensive. They'll just outsource another 10-15 programming jobs to India to pay for it.

Re:Expensive (4, Interesting)

orac2 (88688) | more than 10 years ago | (#7953349)

The point is that it's not intended for use as a search engine, but a platform for doing computation intensive data mining and analysis. A search engine can tell you how many mentions of IBM appear on the web, but not how people feel about IBM.

Re:Expensive (1)

starvingcodeartist (739199) | more than 10 years ago | (#7953379)

That's what they say, but the article gave me the impression that it basically just organizes data into usable categories. The benefit being that you can get "exactly" the data you are looking for, instead of wasting your time wading through scores of unrelated pages.

Re:Expensive (1)

orac2 (88688) | more than 10 years ago | (#7953808)

The point is that the "you" in "you can get exactly the data you're looking for" is not a person, but a data mining program.

Disclaimer: I'm the author of the article!

Re:Expensive (1)

millahtime (710421) | more than 10 years ago | (#7953513)

For the kinds of data mining that would be done this cost vs the benefit will easily pay for iteslf. If it works as advertised. There is a huge speed problem in doing data mining currently. If this can solve that then there are a lot of companies that will jump on it.

Almaden (-1, Offtopic)

Anonymous Coward | more than 10 years ago | (#7953310)

That's the guy on Monday Night Football, right?

Bush vs. U.S. Gulf War Veterans: +1, Patriotic (-1, Offtopic)

Anonymous Coward | more than 10 years ago | (#7953319)

Pink's pinky (-1, Offtopic)

Anonymous Coward | more than 10 years ago | (#7953320)

Pink's kinky pinky makes me see... pink... wait!.. red...

corporate meddling (3, Insightful)

commo1 (709770) | more than 10 years ago | (#7953322)

One of my main concerns with search databases is the inhenrent ability for corporations to increase their visibility on the web by manipulating data to their benefit to bring their corporate page up first on the list. I wonder if there is a way for the database to have a scoring system based on the validity of the data: is the information there, or are there just highly develpoped metatags doing the work? If you do a search for a specific part number for an HP product, what are the cances of getting a) the HP home page where a further search would be necessary to find any relevant info or b) the big chains like Staples, Sircuit City who just want to sell you cartridges and have the time and resources to steer you in the right direction. How would the system be regulated? (kinda like Slashdot mods :P)? Who watches the watchers, and can information validity be electronically implemented? What kind of AI would be necessary?

Re:corporate meddling (1)

orac2 (88688) | more than 10 years ago | (#7953546)

WebFountain isn't intended a a general purpose search engine, but to provide a platform for data mining and analysis.

Information... (1, Funny)

enrico_suave (179651) | more than 10 years ago | (#7953324)

Information wants to be... Fuscia!

*shrug*

e.

Re:Information... (1)

__past__ (542467) | more than 10 years ago | (#7953841)

Really? But I heard mauve has the most RAM!

What about Existing Data? (4, Interesting)

ParadoxicalPostulate (729766) | more than 10 years ago | (#7953330)

Are you telling me that there are programmers willing to go through [Insert Ludicrously Large Number Here] files and "annotate" them using XML to fit the new system?

You would need an enormous workforce to do that.

And if they don't plan on doing that, what about all the existing information? Is it going to be excluded from the database? Seems like much of a waste to me!

Damn but I would love to have access to one of these, even if the amount of information available will be miniscule (relatively speaking) for the next few years.

Re:What about Existing Data? (5, Funny)

Ronald Dumsfeld (723277) | more than 10 years ago | (#7953528)

Are you telling me that there are programmers willing to go through [Insert Ludicrously Large Number Here] files and "annotate" them using XML to fit the new system?

No, they're writing software to put in the XML tags.

What will be more interesting to see is if it's possible to pollute the database by putting in your own XML. Instead of Google-Bombing we'll have people pissing in the WebFountain.

Re:What about Existing Data? (1)

ParadoxicalPostulate (729766) | more than 10 years ago | (#7953575)

Ah, you are correct. I mistakenly took the "annotators" for humans (damn personification...) But, when I think about it in oversimplified terms, it sounds pretty funny: So they are writing software to categorize software so that it can be recognized by other software?

Re:What about Existing Data? (1)

azzy (86427) | more than 10 years ago | (#7953529)

If they are prepared to pay me enough, I'll do it!

Re:What about Existing Data? (2, Informative)

AndroidCat (229562) | more than 10 years ago | (#7953541)

According to the article, Web Fountain is supposed to sift through information which isn't XML tagged.

Entirely unsuited (3, Insightful)

happyfrogcow (708359) | more than 10 years ago | (#7953337)

From the article, "But many online information sources are entirely unsuited to the XML model--for example, personal Web pages, e-mails, postings to newsgroups, and conversations in chat rooms."

entirely unsuited? chrissake. email, unsuited. newsgroups, unsuited. chat rooms, unsuited. If personal home pages are unsuited, then so are corporate home pages, as there is nothing inherantly different about the two. All this from an IEEE article... I would have thought them to be more acurate and less misleading. I could put <popularmusic>Pink</popularmusic> in my HTML as easily as Amazon could in theirs.

HTML is based on the XML model. HTML is used to create personal web pages. How on earth then, could personal web pages be "entirely unsuited to the XML model"?

Re:Entirely unsuited (0)

Anonymous Coward | more than 10 years ago | (#7953358)

Um... No, XML is based on the HTML model.

Re:Entirely unsuited (0)

Anonymous Coward | more than 10 years ago | (#7953413)

No, you are both wrong. HTML is an SGML application. XML is a simplification of SGML. XHTML is an XML application.

Oh, and "pink" is the colour - P!nk is the singer :)

Re:Entirely unsuited (1)

happyfrogcow (708359) | more than 10 years ago | (#7953414)

Um... No, XML is based on the HTML model.

no, XML is based on the SGML model. HTML too, with exceptions to some SGML features. more info: http://www.w3.org/TR/html401/intro/sgmltut.html [w3.org]

Re:Entirely unsuited (0)

Anonymous Coward | more than 10 years ago | (#7953428)

HTML is not based on "the XML model". HTML is a cut-down version of SGML, as is XML. There is a variant of HTML - XHTML - designed to parse as an XML document, but XHTML doesn't necessarily include any semantic information.

It is the semantic content that is important. Personal web pages will never be marked up with meaningful semantics, because doing that is a lot of work for little benefit to the writer. Corporate webpages are different: corporations employ IT people who should be capable of understanding the need for semantic markup, plus there could be benefits to a corporation of paying for the time and skills necessary to convert at least some of their web presence to a semantically-marked-up form.
I could put <popularmusic>Pink</popularmusic> in my HTML as easily as Amazon could in theirs.
Now explain to every single person with a Geocities page how to and why they need to do that.

HTML is based on the XML model. (1)

wiredog (43288) | more than 10 years ago | (#7953463)

Ummm. No. HTML predates XML.

Re:Entirely unsuited (4, Insightful)

orac2 (88688) | more than 10 years ago | (#7953525)

Disclaimer: I'm the author of the article.

Most people don't and won't tag as they go. (Except for those of us used to writing HTML-enabled comments on /. of course). Also, in order to be able to write <popularmusic>Pink</popularmusic>, and have it make sense, you'd have to be following a DTD.

As anyone who's been involved in DTD formulation can attest, even for internal documentation, it can be a royal pain in the butt. I don't think the vast majority of on-line rapid content generators (all those bloggers, emailers, chatters) will ever use XML to routinely tag their content manually. The article isn't talking about machine generated or commercial content, like Amazon's, but the day to day stuff that gets put up in the time it takes to write it and click submit, and which is of most interest to market researchers.

Re:Entirely unsuited (1)

xyzzy (10685) | more than 10 years ago | (#7953639)

More to the point, HTML tags for RENDERING, not semantics. To a first order, ALL HTML pages look alike.

if you read the article (0)

Anonymous Coward | more than 10 years ago | (#7953535)

if you read the article you would have seen that that statement is about the fact that people are not going to spend time xml tagging their irc chat and every blog entry and email.

Is "pink" the singer or the color? (-1, Offtopic)

AndroidCat (229562) | more than 10 years ago | (#7953346)

Answering the old question: "Which one's Pink?"

I'll believe this works well when they have something that can be checked for accuracy. At best, it'll be like voice recognition: another neat technology that needs a few more nines.

IBM needs this... (1)

G. Waters (172392) | more than 10 years ago | (#7953391)

IBM should try their own website. Passport-Advantage [lotus.com] is about the most hideous labyrinth I've ever spelunked (sp). IBM is not alone, but through sheer scale the site just screams "bueromaze".

Re:IBM needs this... (1)

null etc. (524767) | more than 10 years ago | (#7953511)

You think that site is bad, try mining through Symantec's site. Their online store, combined with their poor product differentiation between product models and product lines (i.e. Norton products vs. Symantec products) make it impossible for anyone to be a Symantec guru. My friend is a product manager there, and I always give him flack about it.

Impact on Google IPO (2, Interesting)

G4from128k (686170) | more than 10 years ago | (#7953412)

This is the type of technology that could either ensure or derail Google's future (I'm not saying that it will, only that it could). Semantic analysis and clustering of web pages could improve search. I hope Google gets to use/create this type of tech.

Echelon? (2, Interesting)

SexyKellyOsbourne (606860) | more than 10 years ago | (#7953438)

This project sounds quite interesting -- it could really help out projects like Echelon [aclu.org] to help win the war on terrorism, if it's capable of understanding other languages of course, and could possibly build a whole database of information that's intercepted from other places. All that chatter, with the codewords they use, could possibly be understood by a football field full of Linux rackmounts, and might foil something.

Of course, such power could also be horribly misused if it came into the wrong hands. What if they wanted to enumerate every member or affiliate of the "terrorist" Green Party in the case of a "national emergency?" Feed WebFountain some data from the internet, and from ECHELON, and they would have a quick blacklist.

Or corporations, for that matter, as that's who it's designed for, could quickly blacklist people from employment who were considered "dangerous" such as whistleblowers, heavily involved union members, spies, watchdogs, and so forth.

Re:Echelon? (1)

pantycrickets (694774) | more than 10 years ago | (#7953533)

All that chatter, with the codewords they use, could possibly be understood by a football field full of Linux rackmounts, and might foil something.

You don't need much technology to predict future terrorist attacks. I just used google.. and look what I found! [telegraph.co.uk]

Re:Echelon? (3, Insightful)

orac2 (88688) | more than 10 years ago | (#7953672)

Disclaimer: I'm the author of the article.

I know, from talking to the WebFountain team that they're very sensitive to privacy concerns. WebFountain obeys robots.txt and doesn't archive material which has vanished from the publicly visible web (if only for reasons of storage capacity!).

The point is that all the information that feeds into IBM is already publicly availble. If wanted to go after Green Party members and if the Green Party posted it's membership roll on a webserver, I think they'd be able to get it, WebFountain or no.

Of course, I suppose WebFountain could be used to construct a membership list by scanning people's home page's to find out if they say that they're a member, but again this is publicly declared information.

Bottom line, as always: if you don't want it generally accessible to all, don't put it on a public web server.

One Net to Rule Them All (5, Insightful)

null etc. (524767) | more than 10 years ago | (#7953441)

It would be nice if, in parallel to the Internet, another network was developed to hold only symantically organized knowledge. That network would be free of marketing and commercial business, and would ostensibly be the largest repository of organized knowledge in the planet. Think Internet2, based entirely in XML.

Similar to HTML's current weakness in separating presentation from content, the web today has a weakness in separating content sites from sales sites. Do a search in Google, especially for programming or technical topics, and you're more likely to retrieve 100 links to online stores selling a book on that topic, than finding actual content regarding that topic. This lack of ability to separate queries for knowledge, verses queries for product sales literature, is especially frustrating for scientists and programmers. I think Google is taking a step towards this with Froogle, meaning that if Froogle becomes popular enough, it's possible that Google will strip marketing pages from their search results.

Worse even, is when someone registers a thousand domains (plumbing-supplies-store.com, plumb-superstore-supplies.com, all-plumbing-supplies.com, etc) and posts the same marketing page content ("Buy my plumbing supplies!") on each domain. A search on Google will then retrieve 100 separate links containing the same identical garbage. You would think that Google could detect this "marketing domain spam" and reduce the relevancy of such search results.

Anyways, I can't complain, because I can find nearly anything on the web I need, compared to 10 years ago.

Re:One Net to Rule Them All (1)

ThomasXSteel (545884) | more than 10 years ago | (#7953702)

It would be nice if, in parallel to the Internet, another network was developed to hold only symantically organized knowledge.

Don't see why you need another whole network to do this. See wikipedia [wikipedia.com] . It may not have the uber xml web service driven aspect oriented paradigm shifting 300 grand per annum buzzword love, but it works fine for me. Besides, we can just graft that shite on later if it turns out to be useful.

In other news... (1)

jetkust (596906) | more than 10 years ago | (#7953472)

Researchers in Alabama are working on a system which converts all music on the internet into a single Menudo mp3 file. EIEIO reports the first public use will be to create a single mp3 file that results in trilllions of dollars in royalties to the RIAA when traded illegally.

Pink is... (0, Offtopic)

LJPeixoto (130298) | more than 10 years ago | (#7953484)

Brains coworker :-)

Watch cartoons all day and see your mind melt down :-)

i.e. nameprotect (3, Interesting)

joeldg (518249) | more than 10 years ago | (#7953507)

nameprotect does something similar, except they are looking for people violating copyrights.
in addition I think they might be one of the most banned bots online.

anyway, their users are all corporate entities who pay a lot of money to be able to auto-cease and desist copyright infringers..

These same companies will pay IBM to tell them that since their cease and desist spree everyone hates them.

almaden webspider (1)

MrSpiff (515611) | more than 10 years ago | (#7953520)

is the Almaden webspider (http://www.almaden.ibm.com/cs/crawler/) that's been scavenging in the dark a part of this?

URL of the project page (2, Informative)

DerOle (520081) | more than 10 years ago | (#7953521)

WebFountain [ibm.com]

Like NorthernLight? (4, Informative)

dpbsmith (263124) | more than 10 years ago | (#7953527)

This sounds very similar to NorthernLight.

NorthernLight was (it still exists, but apparently is not available to the nonpaying public at all) a search engine that displayed its results automatically sorted into as many as fifteen or twenty categories, automatically generated on the basis of the search. (For some reason, they called these categories "custom search folders.")

Since it's no longer available to the public I can't give a concrete example. I can't test it to see whether a search on "Pink" creates a couple of folders labelled "Singer" and "Color," for example. But that's exactly the sort of thing it does/did.

I actually would have used NorthernLight as one of my routine search engines--it worked quite well--had it not been for another major annoyance: in the publicly available version, it always searched both publicly available Web pages and a number of fee-based private databases, so whatever you searched for, the majority of the results were in the fee-based databases and I would have had to pay money to see what they were. In other words, it was heavy-handed promotion of their paid services and had only limited utility to those who did not wish to by them).

Re:Like NorthernLight? (2, Informative)

Wiktor Kochanowski (5740) | more than 10 years ago | (#7953690)

Vivisimo [vivisimo.com] is doing sorting searches.

Try it out, works quite often for me - beats Google for many queries, not in actual number of pages found, but in the time it takes me to find out whatever I'm looking for.

Gaming Webfountain (3, Interesting)

G4from128k (686170) | more than 10 years ago | (#7953534)

I wonder how long it will take sleazy e-commerce sites and p0rn sites to game WebFountain and turn it into SpamFountain?

I suspect that this tool (and any like it) must make a core assumption -- that each webpage is about one semantic thing and that the creators are trying to communicate that one thought. In contrast, people who try to boost their page rank have no compuction about misleading people (or algorithms). Clever tagging and misleading verbage should be able to fool IBM's analyzer into clustering a site where it does not belong (but where the site owner wants it). The result is pages look like it is about another thing (some popular search term)while being about soemthing else (selling their junk or porn).

Next will come high-priced consultants that tell you how to make you site pace highly on WebFountain (like the ones that currently game Google).

Maybe we should just outsource (1)

CompWerks (684874) | more than 10 years ago | (#7953543)

this project to india.

IBM's Pink (2, Funny)

th77 (515478) | more than 10 years ago | (#7953563)

IBM should know that Pink was the predecessor to Taligent [wikipedia.org] which was the predecessor to absolutely nothing.

Intel-based? (1)

trACE666 (731643) | more than 10 years ago | (#7953570)

Why does IBM use PC hardware?
Wouldn't it make more (marketing) sense to use one of their own platforms, I guess the z-Series should be the most suited for that amount of data...

A good idea for search engines follow? (0)

dollar70 (598384) | more than 10 years ago | (#7953591)

Don't get me wrong, I like getting a little web-traffic (I said a little so no /.ing please!), but when I look through my logs and see searches where Google is referring people to my site inappropriately, I almost want to scream at the mindlessness they use to catagorize my web pages. On the one hand, I'm flattered, but on the other hand it's disturbingly out of context. I even put the <meta NAME="robots" CONTENT="noindex,noarchive"> line in the headers that were giving me headaches, but people still end up at my site looking for that damned lemonparty.jpg just because I mentioned it in my blog once.

Re:A good idea for search engines follow? (1)

rcastro0 (241450) | more than 10 years ago | (#7953847)

Shame on me for being curious.
lemon party [urbandictionary.com]
a group of 3 or more old men in a circle sucking each other off
Bill and Carl joined their grand fathers at the lemon party
I don't want to think about why this term ever arose and was able to drive trafic through google.

ObSCO ref (1)

gosand (234100) | more than 10 years ago | (#7953605)

IEEE reports that the first commercial use will be to track public opinion for companies.


Can't wait to see what the entry for SCO looks like...

so-called tags (1)

jamesl (106902) | more than 10 years ago | (#7953621)

"Things such as price or product identification numbers are identified by bracketing them with so-called tags, as in Deluxe Toaster , $19.95 ."

They're "tags", not "so-called tags".

Tags! Like those little things they hang on stuff at the store to tell you how much it costs. Tags.

Of course, he may have been referring to their use in a "software program".

How long before people start gaming the system? (4, Interesting)

dpbsmith (263124) | more than 10 years ago | (#7953631)

As Google has discovered, it's only possible for simple heuristics and algorithms to "understand" the human content on the Web for as long as it doesn't matter.

As soon as people become aware that Google or WebFountain or whatever is trying to evaluate web content, immediately they will begin trying to reverse-engineer and subvert the algorithms and heuristics that are used.

And the stakes are much higher for gaming WebFountain than for gaming Google.

For example, I'd imagine there would be big money for anyone who could convince companies that they know how to make it appear that a particular movie/song/toy/computer was "hot," so that the WebFountain-using Walmarts and Best Buys of the world would stock more of it.

WebFountain will work well only until it is actually introduced.

Re:How long before people start gaming the system? (2, Informative)

orac2 (88688) | more than 10 years ago | (#7953742)

Disclaimer: I'm the author of the article

As soon as people become aware that Google or WebFountain or whatever is trying to evaluate web content, immediately they will begin trying to reverse-engineer and subvert the algorithms and heuristics that are used.
.

This could be tricky -- WebFountain uses a kitchen sink approach, with a varying palette of content discriminators and disambiguators. The developers are also savvy to downweight link farm type approaches. Of course, one could say, conduct a campaign among bloggers to mention a term and make it appear well-known to WebFountain, but the inevitable consequence is that it would then actually be well-known!

"Is this web site selling something"? (3, Insightful)

Animats (122034) | more than 10 years ago | (#7953634)

Search engine spiders need to understand more about sites. Things like this:
  • The site is selling something.
  • The page is composed of multiple unrelated articles or ads, each one of which should be viewed as a separate entity for search purposes.
  • The page is part of a blog.
  • Content on this site duplicates that found on other sites.
  • The site is owned by an organization with a known Dun and Bradstreet number. (If a site is selling something, and its Whois info doesn't match the DNB corporation database, it should be downgraded in search position. This would encourage honest Whois info.)

SCO (4, Funny)

Zork the Almighty (599344) | more than 10 years ago | (#7953638)

IEEE reports that the first commercial use will be to track public opinion for companies.

Searching "SCO"
Found "Slashdot"
ERROR arithmetic underflow.

CrapFountain (4, Funny)

s4m7 (519684) | more than 10 years ago | (#7953661)

Here's how it works:

Executive Bob, who's paid IBM $150,000 for his enterprise liscence of webfountain, enters into his webfountain search box: "Pink the musician, not the color"

IBM's powerful software parses this command into "pink music -color" and passes it to google, retrieves the results, removes Google's paid ads and replaces them with IBM's paid ads. The content is then served to Executive Bob, who shouts: "EUREKA" since within the top ten search results he finds "NUDE PICTURES OF RAPPER PINK!"

IBM then lands a lucrative support contract with Exectutive Bob to remove all the viruses and spyware from his desktop PC. Rinse and Repeat.

first post (-1, Offtopic)

Anonymous Coward | more than 10 years ago | (#7953670)

read it and weep fuckers! ahahahahaha!

Half a football field? (4, Interesting)

AndroidCat (229562) | more than 10 years ago | (#7953677)

(Imperial or metric football fields?)
IBM's breakthrough is called WebFountain--half a football field's worth of rack-mounted processors, routers, and disk drives running a huge menagerie of programs.
Later:
It uses a cluster of thirty 2.4-GHz Intel Xeon dual-processor computers running Linux to crawl as much of the general Web as it can find at least once a week.

To ensure that WebFountain's finger is constantly on the pulse of the Internet, an additional suite of similar computers is dedicated to crawling important but volatile Web sites, such as those hosting blogs, at least once a day. Other machines maintain access to popular non-Web-based sources, such as Usenet (a newsgroup service that predates the Web) and the Internet Relay Chat system, known as IRC. The data is then passed into WebFountain's main cluster of computers, currently composed of 32 server racks connected via gigabit Ethernet. Each rack holds eight Xeon dual-processor computers and is equipped with about 4-5 terabytes of disk storage.

That's a lot of stuff, but half a football field? Possibly they're including cubicles for the staff or did they just inherit some old Big Iron space that was that large?

Prior art :o) (3, Funny)

Mr_Silver (213637) | more than 10 years ago | (#7953752)

IEEE reports that the first commercial use will be to track public opinion for companies

You can do that already with Google:

A search for "Microsoft is evil" gets you 600,000 pages.

A search for "Microsoft is good" gets you 3,590,000 pages.

Therefore Microsoft is more good than evil.

Err ... that wasn't quite the answer I was expecting.

(cue sounds of joke falling apart...)

BSD or GPL equivalent? (0, Troll)

cowbrain_jimbo_ox (739942) | more than 10 years ago | (#7953782)

Sounds good. There ought to be something similar under BSD or GPL.

Political dissidents would definitely benefit from this kind of super search system, and so do normal users like kids doing searches for their homework.

We need our own "commie" version.

I wish I was fluent in computer languages or else I'd be the first one to start this up under BSD licence.

Any suggestions as to what language I need to learn to develop this kind of search engine?
Its gotta have a capability like freenet [sourceforge.net] to distribute load on the network and the system while keeping users anonymous, since private users won't have the resource to come up with 1000s of servers. I'm thinking on the lines of XML.

Potential money saver: Differential buzz (2, Insightful)

benja (623818) | more than 10 years ago | (#7953843)

The head of a research and development department could feed WebFountain all the e-mails, reports, PowerPoint presentations, and so on that her employees produced in the last six months. From this, WebFountain could give her a list of technologies that the department was paying attention to. She could then compare this list to the technologies in her sector that were creating a buzz online. Discrepancies between the two lists would be worth asking her managers about, allowing her to know whether or not the department was ahead of the market or falling dangerously behind.

This is a potentially very useful money-saver. Currently companies employ hoards of middle-management people who do little else than detecting discrepancies between the technologies that their department is focusing on and those that are currently all the buzz. Now we can create an automatic boss that sends out e-mails like, "What's this IP-over-XML thing and why don't we use it and how soon can you have all our critical systems migrated to it?"

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>