Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Nutch: An Open Source Search Engine

Hemos posted about 11 years ago | from the but-will-it-matter dept.

The Internet 291

Anonymous Coward writes "Someone forwarded me this site working to create an open source search engine called Nutch. In the age of weighted rankings on search engines for profits, there's an obvious need for an unbiased search engine. After all, isn't a search engine supposed to be for finding relevant data, not as an indirect and sometimes slimy method of advertising? Nutch is clearly in their intial stages, but it would certainly get my vote." You can find the project on SF.net, and also read the Business 2.0 article on it.

cancel ×

291 comments

Sorry! There are no comments related to the filter you selected.

FROST! (-1, Troll)

Anonymous Coward | about 11 years ago | (#6689361)

Oh yeah, dip my nuts in it!

I am teh FP Mastar (-1, Troll)

Anonymous Coward | about 11 years ago | (#6689447)

I was in the other room jerking it, then I thought "Hey, it would be kind of funny to say 'dip my nuts in it'(as my nuts were coated in lotion) on Slashdot"
So I finish off, then turn on the interweb, and sure enough, I was first.

My life is complete.

SIR (0)

Anonymous Coward | about 11 years ago | (#6689576)

While your first post (reply) was quite amusing, you lost a few points:
1) You should have used "nutchs" instead of "nuts"
2) You are a "FP Mastur", not a "FP Mastar"
3) ???
4) PROFIT!!1

Hook it up to slashdot! (1, Insightful)

FortKnox (169099) | about 11 years ago | (#6689365)

The slashdot search page could definately use this kinda technology!

Re:Hook it up to slashdot! (1)

Gherald (682277) | about 11 years ago | (#6689388)

The slashdot database is pretty huge.. I wonder if the servers could handle this kind of indexing?

Re:Hook it up to slashdot! (3, Informative)

Anonymous Coward | about 11 years ago | (#6689414)

Just use google. Search for "SEARCH-STRING site:slashdot.org"

Biutch (0)

Anonymous Coward | about 11 years ago | (#6689366)

open source pimp engine

NUTCH SOUNDS A LITTLE BIT LIKE SNATCH! (-1, Troll)

Anonymous Coward | about 11 years ago | (#6689367)

GO LINUX!

Nutch to this (-1, Troll)

Anonymous Coward | about 11 years ago | (#6689369)

Patents. (5, Interesting)

Christopher Thomas (11717) | about 11 years ago | (#6689370)

I hope the authours of this project do their homework. My impression is that most of the good search and indexing schemes have already been patented, which will make it difficult to release such a project without stepping on someone's toes.

Re:Patents. (2, Interesting)

socrates32 (650558) | about 11 years ago | (#6689494)

"most of the good search and indexing schemes have already been patented" Not at all... just the easy ones.
If this is to be cheap to run, it will probably have to be distributed, and thus a very different architecture than most of what we've seen up to now.

Re:Patents. (0)

Anonymous Coward | about 11 years ago | (#6689594)

Dude. Read about google's network. Seriously.

Re:Patents. (1)

alwayslurking (555708) | about 11 years ago | (#6689680)

I think the parent probably meant distributed in a Folding-at-home or SETI sense. Google use a massive cluster, but it's all on-site and owned by them, AFAIK.

Lucene (index and search engine) (1, Informative)

Anonymous Coward | about 11 years ago | (#6689621)

Check out Lucene [apache.org] , the indexing and search engine used by Nutch. From what I've heard, Nutch is mainly the spider/crawler used to gather documents.

Re:Patents. (4, Insightful)

Feztaa (633745) | about 11 years ago | (#6689678)

I hope the authours of this project do their homework. My impression is that most of the good search and indexing schemes have already been patented, which will make it difficult to release such a project without stepping on someone's toes.

Hmmm, I just realized something... with patents, you end up stepping on people's toes. Without patents, you get to stand on their shoulders. Which do you think is the better vantage point?

The purpose of a search engine (1)

Stalemate (105992) | about 11 years ago | (#6689373)

After all, isn't a search engine supposed to be for finding relevant data, not as an indirect and sometimes slimy method of advertising?


I'm pretty sure a search engine is supposed to be for whatever purpose the people making it want it to be.

Re:The purpose of a search engine (2, Funny)

yamcha666 (519244) | about 11 years ago | (#6689406)

I'm pretty sure a search engine is supposed to be for whatever purpose the people making it want it to be.

And I'm sure many Slashdotters would love a search engine dedicated to find pr0n and anti-Microsoft propaganda. Right?

Re:The purpose of a search engine (4, Funny)

AVryhof (142320) | about 11 years ago | (#6689558)

Here you go.

Porn [sublimedirectory.com]

Anti-Microsoft Propoganda. [slashdot.org]

Exploits (0, Troll)

Greenisus (262784) | about 11 years ago | (#6689375)

My biggest concern is that the developers will simply be in a scramble patching up exploits, instead of actually making their technology better.

Google? (5, Informative)

devphaeton (695736) | about 11 years ago | (#6689378)

Last i heard google still doesn't accept bribes for page ranking.

inobtrusive adverts on the right hand column nonwithstanding.

Re:Google? (-1, Troll)

Anonymous Coward | about 11 years ago | (#6689461)

"Last i heard google still doesn't accept bribes for page ranking"

thats pure BS.. thats all I got to say on that.

Re:Google? (1)

billstr78 (535271) | about 11 years ago | (#6689479)

Yeah, and the last I heard: this was the only search engine anyone used.

Re:Google? (3, Insightful)

delcielo (217760) | about 11 years ago | (#6689500)

I have to agree. And I don't see my allegiance to Google as a sell-out. I see it as a reward for good work.

Re:Google? (1)

capedgirardeau (531367) | about 11 years ago | (#6689532)

Maybe they dont take money, I can't really say for sure, but they do adjust the rankings for some pages.

They have been called on it before as I recall and refused to reveal what their criteria was for when they would manually adjust a page's rank.

Read their support pages, no where do they say they do not manually adjust the page ranks.

But they are still the best thing in town.

Re:Google? (2, Informative)

fireboy1919 (257783) | about 11 years ago | (#6689589)

Yeah, they been known to do that when people make server farms to attempt to influence the rankings of google. It is in their best interest to ensure that the pages that people actually want to see come up first, not the advertisers pages.

That's why people use google. If they stacked the deck supporting places people don't care about - advertisers pages, for instance, then we'd all jump ship and use another search engine.

They're like the Swiss and Consumer Reports. Part of the reason they make money is neutrality, and they won't make as much if they're not.

Anyone ever heard of grub? (2, Informative)

nadadogg (652178) | about 11 years ago | (#6689649)

Grub is another open-source search engine, I have the client running right now, its nice and distributed, I think this kind of idea is great.

Re:Google? (1)

The Clockwork Troll (655321) | about 11 years ago | (#6689671)

Last i heard google still doesn't accept bribes for page ranking.
They aren't stupid, give them some credit.

If they accepted bribes for anything, it would be for concrete information about their ranking algorithm.

pump up teh NOISE! (-1, Offtopic)

Anonymous Coward | about 11 years ago | (#6689379)

Torll on!

Lost Cause (0)

Anonymous Coward | about 11 years ago | (#6689382)

Didn't google start out that way and then realize that is very expensive to maintain a search engine? Google also clearly differeniates between its ads and its results.

I guess the more the merrier but I wouldn't bank on this thing becoming more than a curiosity.

WHY? (-1, Offtopic)

Anonymous Coward | about 11 years ago | (#6689383)

What is the point of this post?

Slimey adverts? (3, Insightful)

Acidic_Diarrhea (641390) | about 11 years ago | (#6689384)

Yes, having advertising affecting search results is not good for the end user but (and I'm just bringing this up as a discussion topic), in what other ways can a search engine make money? It's clear that running a search engine has costs associated with it. To offset these costs, it seems like advertising is the only way to go. Now I can see that some search engines handle this in a more "slimey" way than others (I am happy with Google) but this project seems to want to avoid advertising at all costs. Where does the money come from then?

Also of note is that companies can still influence search engines in slimey ways - Google can be manipulated to make a page rank higher, although Google keeps an eye on this activity and works around it.

Re:Slimey adverts? (0)

darkstar949 (697933) | about 11 years ago | (#6689439)

I agree, a good search engine is going to face problems in paying for bandwidth, hosting, ect; and unless you have paid advertising you would be hard pressed to keep it running for long. The only posiblities left would be to have a company pay to host it (which may have problems), or to have user donations keep it running, simlar to PBS.

Re:Slimey adverts? (0)

Anonymous Coward | about 11 years ago | (#6689504)

Why not give it a distributed architecture, then people can donate some of thier server's time to contributing.

Re:Slimey adverts? (2, Funny)

M.C. Hampster (541262) | about 11 years ago | (#6689508)


To offset these costs, it seems like advertising is the only way to go. Now I can see that some search engines handle this in a more "slimey" way than others (I am happy with Google) but this project seems to want to avoid advertising at all costs. Where does the money come from then?

You speak blasphemy! How dare you speak of such practical issues as money when talking about free software!

Re:Slimey adverts? (5, Insightful)

Anonymous Coward | about 11 years ago | (#6689573)

This project is the SOFTWARE to run a search engine. Not a corporation that needs to generate income to justify the resources required to run the search engine.

Anyone could take this source code and with enough money, challenge Google.com as the top search engine.

I see this project as a competitor to shrink wrapped search engines. IE google appliance [google.ca] or maybe even Folio based products. Typically corporations have many documents that need to be indexed and searchable to their needs.

I haven't seen this on the homepage but it doesn't list what content it can index. I hope it can at least index PDF's and popular Office documents.. Maybe even Media files? And what XML indexed fields? Or external metadata?

Some serious shortcomings of Nutch (-1, Offtopic)

Anonymous Coward | about 11 years ago | (#6689386)

omg my nutch thingie doesnt work!! omfg it sux!
gay!! lol me goto google omg lol

Seems like /. (1, Insightful)

darkstar949 (697933) | about 11 years ago | (#6689391)

This seems to me like the /. moderation system, with the pages being ranked based upon how the user feels about the site.
However, I could see some disadvantages to the system depending upon how it is set up, because one person could keep dinging a site to get its score to drop down.

Biased listings (4, Insightful)

Champaign (307086) | about 11 years ago | (#6689392)

I think many commercial search engines have learned that biasing themselves to sites who have paid them is a good way to errode consumer confidence, and damage their readership/userbase. Just as newspapers have to at least provide the image of objectivity, the same demands are on search engines.

I'm quite comfortable with how Google does this (present commercial links clearly marked to the side), and am not convinced a non-commercial (open source) alternative is needed.

It's not "Businuess" (1)

The Bungi (221687) | about 11 years ago | (#6689394)

Businuess 2.0 article on it.

It's "Business". Hope that helps.

Actually, it's "Bidness", white boy! (1)

figlet (83424) | about 11 years ago | (#6689604)

Actually, it's "Bidness".
Yor momma! :-)

just don't get it (3, Insightful)

Astrorunner (316100) | about 11 years ago | (#6689401)

I think that you absolutely have to have a closed source algorithm for ranking pages, because otherwise you'll get people who will simply tune their pages to be high on the list. I can see how making the majority of the search engine open source would be beneficial, but the algorithm itself? Its like saying "Here's the keys to my car" and thinking that, because everyone has access to the keys, no one's going to drive away with it. Sure, everyone has the opportunity to make your search engine better, but never underestimate the tenacity of a web-wanna-be-millionaire.

Re:just don't get it (4, Insightful)

cduffy (652) | about 11 years ago | (#6689475)

Think about cryptosystems: The whole point about the really good ones is that you can know the algorithm, but still not break it. Granted, pulling that off for a search engine is prone to be much, much harder -- but I *do* believe it's well within the realm of possibility. Ambitious in the extreme? Certainly... but there's something to be said for high-risk-high-reward projects.

Re:just don't get it (0)

Anonymous Coward | about 11 years ago | (#6689679)

This sounds an awful lot as the closed-source security-by-obscurity-i-am-an-industry-official type of speak..

Same arguments applies.

I see two problems (1, Interesting)

Anonymous Coward | about 11 years ago | (#6689403)

Two problems:

  1. Bandwidth. Having to search through so much data is going to take so much bandwidth, how could you pay for it?
  2. Patents. Google has lots of patents in this area, I imagine other search engines do as well. This is one area where I think software patents are deserved, since Googles' alorthims are actualy innovative. I don;t they will be willing to let you use thier patents in a GPLed app.

Hardware and Bandwidth (1)

metalhed77 (250273) | about 11 years ago | (#6689529)

according to http://www.nutch.org/docs/credits.html the Internet Archive is hosting nutch, and Overture has given them hardware. Sounds pretty sweet. Probably not the 20,000 strong linux cluster google has going though.

If it's like every other SourceForge project... (2, Insightful)

realmolo (574068) | about 11 years ago | (#6689405)

Here's what I expect to see on the webpage in a few months: "Currently Nutch is in the alpha stage- it doesn't index any web pages, doesn't return any results, and has no user interface. Programmer's needed!" Google has WON the search engine war, probably forever. Find some other mountain to climb, guys.

Search engine game is NOT over (4, Insightful)

AtariAmarok (451306) | about 11 years ago | (#6689449)

"Google has WON the search engine war, probably forever. Find some other mountain to climb, guys."

At one time, Oldsmobile won the auto company wars. Where are they now?

IBM ruled the PC roost. Hmmmm....

Command-line OS's were king. But now???

Altavista and infoseek and Lycos were search engine kings at one time. Whither this trio?

The point is, it is not over.

Neither is THIS GAME, MOTHERFUCKER...! (-1, Offtopic)

Anonymous Coward | about 11 years ago | (#6689512)

Seriously, this sounds like the next "craze" to sweep the gay comunity...right up there with stuffing life animals up their filthy rectums.

"Nutching", adv., to "nutch.": A disgusting sex act whereupon a male regurgitates days old feces - which he has previously ingested from his partners anus - and liberally smerars it on his partner's face.

So there you have it.

What??? (1)

jawtheshark (198669) | about 11 years ago | (#6689593)

Command-line OS's were king. But now???

What??? And nobody sent me the memo.... (Posting from Lynx from a *BSD shell)

Re:If it's like every other SourceForge project... (0)

Anonymous Coward | about 11 years ago | (#6689501)

So since MS has won the browser and OS war, we should abandon all open source projects relating to those as well, right? Mod parent down -1 dumbass.

Accuracy is relevance (2, Informative)

AtariAmarok (451306) | about 11 years ago | (#6689407)

To me, accuracy is the most important "Relevance".

The problem with Google is that there are errors in it: you ask for something and sometimes you get something else.

A search on "to be or not to be" produces an error (non-matching results) in three of the first ten results: a 30% search failure rate. It used to be worse, when most of the links were bad.

Since it seems like Google will never fix this problem, I'm looking forward to something with all of Google's great features, plus accuracy.

Re:Accuracy is relevance (3, Informative)

binaryDigit (557647) | about 11 years ago | (#6689499)

A search on "to be or not to be" produces an error (non-matching results) in three of the first ten results: a 30% search failure rate. It used to be worse, when most of the links were bad.

This is a bit of a misrepesentation. Google will toss the words 'to' 'be' and 'or'. So you effectively end up searching on 'not'. It does this to eliminate words that show up to frequently and make the searches faster (and the overloading of the word 'or'). If you really want that text, then either quote the whole thing, or place a '+' in front of those words, which will give you exactly what you're looking for. So there is no problem with it's acurracy when you understand the proper way to ask it for something.

Re:Accuracy is relevance (1)

antibryce (124264) | about 11 years ago | (#6689552)

Thank you for pointing that out. It seems most people when pointing out problems with google are really just highlighting their lack of understanding of how it works.

Imagine if I complained that Linux needed lots more work because when I'm at the command line I get an error from typing "move my email inbox to the floppy disk."

That's the problem (1)

AtariAmarok (451306) | about 11 years ago | (#6689564)

"Google will toss the words 'to' 'be' and 'or'."

That is the problem. The reason I put such words in phrases is because I want an exact match.

" It does this to eliminate words that show up to frequently and make the searches faster"

I would hope that Google solves this by getting faster servers, instead of producing bad results. Besides, if I did not want the results to include all the words in the phrase, I would not have included them in the phrase in the first place.

" If you really want that text, then either quote the whole thing, or place a '+' in front of those words"

I did quote the whole thing, and got 70% accuracy. By putting plusses in front of the words, I still got 70% accuracy.

"So there is no problem with it's acurracy when you understand the proper way to ask it for something."

Quotes around the phrase do not work. Plus in front of all the words fails too. What is the secret of "the proper way"? more importantly, why won't it do the most intuitive thing: try to match the phrase as it is typed?

Re:That's the problem (0)

Anonymous Coward | about 11 years ago | (#6689659)

Quotes around the phrase do not work. Plus in front of all the words fails too. What is the secret of "the proper way"? more importantly, why won't it do the most intuitive thing: try to match the phrase as it is typed?


I don't know what you're doing wrong, but when I type the phrase in quotes I get page after page of good hits. Same with putting +'s in front of each word (although the results were different because it was no longer looking for that particular string of words together.)

Re:Accuracy is relevance (1)

WTFmonkey (652603) | about 11 years ago | (#6689591)

Well, if you slap some double-quotes around it (which I'm assuming is what was intended), you get accurate, but maybe not what you were [probably] looking for.

The first link is about Barium Enemas, I shit you not.
The second is about BeOS, and the third is some randomass link at funbrain.com.
In the fourth we finally get some Shakepeare.
Point is, these are all links that "capitalized" on the "to be or not to be" cliche and so are accurate results. Although, probably not what you were looking for. Next time try "Hamlet," "Shakespeare," or like that. If

If all you know is the "to be or not to be" part, and can't remember who said it, or where they said it, hitting it on the fourth link is pretty damn good for a search that blind.

Re:Accuracy is relevance (1)

bersl2 (689221) | about 11 years ago | (#6689608)

Why is a non-zero failure rate such an abominable thing? At some times, maybe finding something you weren't expecting is a positive. Perhaps a search engine with a "fooling around" mode using a more heuristic search method (which still excludes keyword floods)?

Re:Accuracy is relevance (1)

AtariAmarok (451306) | about 11 years ago | (#6689642)

" Why is a non-zero failure rate such an abominable thing? At some times, maybe finding something you weren't expecting is a positive."

If you reach into the freezer without really looking, thinking that you are grabbing a freezer-pop, and get an 8 month old leg of lamb instead, are you going to shrug and eat the lamb anyway?

" Why is a non-zero failure rate such an abominable thing? "

Come to think of it, I have to ask. Which development team has Steve Ballmer assigned you to?

Who? (1)

Chess_the_cat (653159) | about 11 years ago | (#6689411)

After all, isn't a search engine supposed to be for finding relevant data, not as an indirect and sometimes slimy method of advertising?

The only search engine I ever use is Google and it seems to find relevant data just fine. And the ads are small, discrete, and actually useful. What's the problem?

schweet (0)

Anonymous Coward | about 11 years ago | (#6689415)

i'm scho exschited to usche thisch on my blog schite. it'sch scho exscheschible, that even a noobie can hammer out a schuper-schweet nutch-hack in a couple of hoursch.

i proposche a new schite to catalog thesche hacksch called nutchhack.com.

Seems pretty pointless (4, Insightful)

cryptochrome (303529) | about 11 years ago | (#6689419)

Free and open code is good and all... but the one real cost of a search engine is RUNNING it. It requires a far from trivial amount bandwidth and hardware, and somebody has to pay for all of it. Unless someone comes up with a novel P2P solution (and many are trying) it just won't happen.

What they should be doing is pressuring the existing search engine companies for some integrity.

Re:Seems pretty pointless (2, Interesting)

jawtheshark (198669) | about 11 years ago | (#6689638)

Yes, that would hold true if you want to index the WWW. But what about indexing an intranet? Now businesses are paying Google for indexing servers (not that I think it is bad), but an Opensource searchengine could save costs for medium sized businesses. Just toss in another Quad Xeon with a few Gigs of RAM and it will do fine for a normal intranet.

Forget It. (1)

Boss, Pointy Haired (537010) | about 11 years ago | (#6689423)

In the commercial Internet, the mechanism by which you find commercial sites must be

paid for

by the sites which you find, otherwise basic economics breaks down and it will not work (abuse etc.).

Thousands of companies provide $product - free search engines simply direct all users to one supplier of $product. That's not right.

Searching for a supplier of $product is not like searching for information - it is not something that can be done outside of payment by the supplier of $product.

Nutch? (1)

burgburgburg (574866) | about 11 years ago | (#6689431)

Acronym, non-obvious pun, obscure reference?

The FAQ doesn't explain the name.

Re:Nutch? (3, Funny)

qwerty823 (126234) | about 11 years ago | (#6689462)

who knows... but as soon as they get it working, they can use it to search for a better name!

The answer is "Nutch"... (1)

Gudlyf (544445) | about 11 years ago | (#6689610)

*Opens sealed envelope*

The question is: "What did Sean Connery say when he saw the reviews for 'League of Extraordinary Gentlemen?"

that business2.0 article.. (1)

joeldg (518249) | about 11 years ago | (#6689433)

it reads more like some strange marketing propaganda than anything.

That project has no releases, has nothing in cvs and very scant details on what it even "is" ..

There are many many projects out there with so much more info available, why is this one that has not released anything getting so much attention?

Re:that business2.0 article.. (0)

Anonymous Coward | about 11 years ago | (#6689675)

What strange marketing propaganda?

CVS has a plethora of information. 1,216 commits, 285 adds

And the board of directors has Tim O'Reilly. Thats worth attention

not a good idea.... (3, Interesting)

edrugtrader (442064) | about 11 years ago | (#6689435)

google is already ideal... the weight of search results is not sold, just text ads.

people are already 'googlebombing' to try and get better rankings by signing up tons of domains and cross linking them all with the keyword that they want to be #1...

if the algorithm that determined how #1 is determined was public, then the best possible strategy to cheat the system could be demised... instead of paying for weight to the search engines you would be paying to web developers to make the search engine think you were #1. and as a web developer i feel that.... oh... wait, proceed.

looking forward to it (1, Interesting)

Anonymous Coward | about 11 years ago | (#6689441)

take a look at the developers and contributors. these guys are all top notch. doug cutting, one of the developers there is the developer for lucene, one of the best libraries out there for developing application search engines in any language. not to mention overture, internet archive, and mitch kapor.. looks like an all-star team. can't wait to play the software.

Bandwidth Costs (1)

NDPTAL85 (260093) | about 11 years ago | (#6689442)

Who's going to pay for them if its a non-profit open source project? Bandwidth doesn't grow on trees you know.

And slimy adverts? Google has slimy adverts? I thought they only had relevant adverts? Oh well I guess we need another dot.com that will go bust in 6 months or so.

Can this work? (4, Insightful)

jmkaza (173878) | about 11 years ago | (#6689444)

I think the idea is good in principle, but could it actually succeed? Google gets hit with millions of request each day. They've got hardware that can support thousands of slashdottings a day and a fat pipe to feed all of that info out. That takes alot of money. Financing an open source project is difficult enough, but financing an open source service such as that would seem next to impossible. Ideas?

The other major problem would be that, with the ranking criteria being available for all to see, it would be relatively simple to manipulate page rankings.

Not making nutch sense (0)

Anonymous Coward | about 11 years ago | (#6689453)

At this point Nutch is coded entirely in Java, however persistent data is written in language-independent formats so that, if needed, modules may be re-written in other languages (e.g., C++) as the project progresses.

One of those coffee out the nose moments for me.

Re:Not making nutch sense (2, Funny)

AtariAmarok (451306) | about 11 years ago | (#6689466)

Don't worry. It is just a stepping stone to full project maturity reached when it is fully coded in Borland Turbo Pascal.

A Tough Challenge (5, Interesting)

Cloudmark (309003) | about 11 years ago | (#6689455)

One of the biggest issues with running a search-engine, open-source or otherwise, is that you can't eliminate bias in the results. No matter what scheme you put in place to handle rankings, someone will find a way to take advantage of it. It's a fact of any major system - there's always a way to twist it. Part of the challenge that Google and similar sites face is that they have to work constantly to protect themselves from systems designed to take advantage of their algorithm. While a completely unbiased search service would be nice, I think it would require the impossible. It would require that no one out here took advantage of it to further their own interests, be they political, commercial, or otherwise. That's fairly unlikely.

With most of the major engines today including Google, they make an effort to prevent horribly unbalanced results (recent controversy over blogs outweighing professional sites in the rankings due to linking and other factors). Some even admit (again, Google does) to manually messing with the rankings a little. If you search for suicide methods, they will bend the engine to make sure you get reasons why you shouldn't commit suicide before you get the how-to. That's in their own public docs. It's also discussed in Wired.

I honestly don't know if open-source could do a better job. The algorithm might be better (likely, given the manpower), but would it really be that much fairer?

Here's "A Tough Challenge" for you, hombre... (-1, Troll)

Anonymous Coward | about 11 years ago | (#6689490)

Seriously, this sounds like the next "craze" to sweep the gay comunity...right up there with stuffing life animals up their filthy rectums.

"Nutching", adv., to "nutch.": A disgusting sex act whereupon a male regurgitates days old feces - which he has previously ingested from his partners anus - and liberally smerars it on his partner's face.

So there you have it.

Re:A Tough Challenge (1)

rmohr02 (208447) | about 11 years ago | (#6689613)

In fact, since the algorithm would be completely open, it would probably be easier to subvert. I'm sure Google has enough trouble working against people who guess at their algorithms, so you could imagine the trouble when people know the algorithm. Then again, many of the people who attempt to subvert search engines are probably fans of open source, and, as you said, there might be more manpower to work against them. Merely comparing open and closed search engines, it's a hard sell either way, but in this case, Google wins because everyone knows Google.

Dupe! (0)

Anonymous Coward | about 11 years ago | (#6689459)

It must be indexing this page [slashdot.org] again.

Business 2.0 is paid access only (1)

prostoalex (308614) | about 11 years ago | (#6689460)

To read the second page of this article use subscriber code 079751240X.

Go to "Magazine subscribers: Enter here", then "Sign in using the account number on your subscription label" and enter the account number above.

Courtesy of TechDirt.com [techdirt.com]

Nutch will never get out of alpha stage (2, Insightful)

xannik (534808) | about 11 years ago | (#6689468)

I fail to see the point of such an endeavor. Without advertising Nutch can not possibly hope to become a serious contender with search engines such as google or overture. Advertising provides the money that enables search engines to have lots of bandwith to send those results quickly back to users, lots of computing power to quickly process each search, even the ability to hire people to research into new areas for better search results. Even if the search engine is selling its resources to other portals like google does with yahoo advertising would still be involved in the process. Yahoo would still need to be advertising on their site to bring in revenue to pay for the service. I think google's method is perfectly fine with small text based ads that are discrete. Why do we need to fix this?

Are they thinking too big? (3, Insightful)

xanderwilson (662093) | about 11 years ago | (#6689472)

I think they're setting themselves up for something that will get too big and too expensive before it can get finished, and they'll have to figure out a way to (gasp) get some funding beyond donations.

I don't see a solution in one great open-source, independent search engine, but many individual specialized search engines, each mastering their own niche area of specialty stands a chance to compete, especially if run by people who focus on their areas of expertise. Alternative news search engines, music search engines, literary search engines, etc. each run by people who know what to filter in and out.

If Nutch.org could create the technology that would allow each of these search engines to exist autonomously, it could also be the hub/portal/start-page/blahblahblah that links all these engines and databases together.

Alex.

Nutch? Sounds like a gay sex act to me! (-1, Troll)

Anonymous Coward | about 11 years ago | (#6689474)

Seriously, this sounds like the next "craze" to sweep the gay comunity...right up there with stuffing life animals up their filthy rectums.

"Nutching", adv., to "nutch.": A disgusting sex act hereupon a male regurgitates days old feces - which he has previously ingested from his partners anus - and liberally smerars it on his partner's face.

So there you have it.

Feasible? (1)

ae0nflx (679000) | about 11 years ago | (#6689480)

No offense to the open source community, but I'm not sure about how feasible an unbiased search engine is. The open source community does not like any bias towards commerical interests and has no problem pointing it out, but by the same token, they do enjoy plugging their own programs, which is completely understandable and normal for any community, however that does not make it unbiased or 'publicly biased'. The merit of the site is very subjective, in my opinion. I am in favor of a project such as this, but I just want to see it for what it is.

Most open source internet based projects (*cough* Slashdot *cough*) have tended to be rather biased towards themselves. It would be very difficult to remove all subjectivity from a project of this nature. How can the ratings be controlled? If it is done entirely by the 'public bias' what's to stop bots from altering the 'public bias'? Just a few questions that still need to be answered.

A suggestion that Google adopted (1)

afflatus_com (121694) | about 11 years ago | (#6689484)

I wrote to Google some time back with an algorithm suggestion that was adopted by them. It is certainly welcome to an open source search engine. It is a minor improvement, but every bit helps.

For citations of most websites, some of the citing people will link to http://www.someplace.com, and some will link to http://someplace.com.

Therefore, include a comparison of the pages returned by each query, and if they are the same page returned, then summate the reverse citations to calculate their total rank.

Distributed Open Search Network (2, Interesting)

Massacrifice (249974) | about 11 years ago | (#6689487)

It'd be nice if they could make distributed. Kinda like P2P search engines, but for the web. That way, the main searching server farm wouldn't be tied to any company in particular. That would give Google a run for their money, and would keep Microsoft at bay for another while.

Being open an open search network, some peer servers could specialize in searching what they're hosting, making it possible to index otherwise dynamically generated content. These specialized hosts would act as "search plugins" for some otherwise hard-to-define content.

An authentication method (a la Freenet) would be needed, though. Some form of authority to prevent rogue peers from injecting too much crap in the results.

Overall, a good idea. If they make it, I'll run it.

HTDig (0)

Anonymous Coward | about 11 years ago | (#6689503)

HTDig is written in C, configurable, and flexable.

Nutch.. written in java. No Thank you, I rather not have my machine become to a crawl.

Fact is, why cant these "developers" working on Nutch work on HTDig to add the features they want?

HTDig is really nice.. Search Engine, help index tool... Really nice... You can even configure the ranking system to fit your needs.

may not fly (1)

wo1verin3 (473094) | about 11 years ago | (#6689526)

>> In the age of weighted rankings on search
>> engines for profits, there's an obvious need
>> for an unbiased search engine.

If you tell everyone, what your page rankings are based on... that doesn't make it hard for companies to modify their page to fit what the search engine is looking for to increase rankings or hits.

There are some companies that do this for Google as complicated as it may be

Mnogosearch : a viable Free Software alternative (0)

Anonymous Coward | about 11 years ago | (#6689543)

http://www.mnogosearch.org/ [mnogosearch.org]

Mnogosearch is a viable web search engine software. It supports caching (a la google), cluster of db, supports easily external parser... Maybe that project should enhance and helps this excellent Free Software.

adulau

The answer is "Nutch" (3, Funny)

Gudlyf (544445) | about 11 years ago | (#6689548)

*Blows open envelope*

The answer: "What did Sean Connery say when he saw the reviews for 'League of Extraordinary Gentlemen?"

Funding....an issue.... or not? (1)

ReyTFox (676839) | about 11 years ago | (#6689550)

One thing that'll help Nutch financially is that they can use their technology for more than a single page running on their own servers(and taking on the huge loads that implies). They can use open-source business models instead, offering licenses with tech support, custom versions, etc.

I was always kind of worried that we might end up with an internet controlled by Google, anyway. But we'll have to wait and see if it actually works or not, anyway. I sure hope so.

Well (1)

CausticWindow (632215) | about 11 years ago | (#6689566)

It's not the technology that prevents thousands of google clones to pop up. It's the simple fact that to initially succeed, you need either a lot of cash or heavy backers.

It't not like Google's pagerank is so unique that it's impossible to do better any other way. It's just that 1) you have to do better or equal, 2) people have to know about you.

Point 2 equals lot of cash.

I'm no genius but... (0)

MRsackler (571464) | about 11 years ago | (#6689567)

the way I understand it is that in order to operate a search engine that sorts through millions upon millions of listings thousands of times every minute, someone is going to need a whole lot of bandwidth. Not to mention the cpu resources that such a task would require. CPU cycles and Bandwidth cost money, and no matter how altruistic the person's intentions, they've got to earn that money. That is where advertising comes in. If I'm not mistaken, google is pretty cool about not having slimy advertising. However, I'm not sure if they pocket any of the money recieved from those advertisements, or if they simply use it to cover the costs of operating the search engine.

That's great but.... (0)

Anonymous Coward | about 11 years ago | (#6689574)

This is great until it starts working and it is really good and someone offers a lot of money for it and it is sold.

Unbiased Searching is Absurd or Useless (1)

smack.addict (116174) | about 11 years ago | (#6689579)

An unbiased search engine is completely useless. In short, an unbiased search engine would either list results randomly or according to useless biases such as alphabetical listings.

Any useful search engine will have an algorithm for ranking page relevance. Because search engine placement is so important to business, there will always be people out there who attempt to optimize (and in some cases, abuse) their pages to boost search engine ranking.

The most useful search engine is the one whose biases match your own biases.

Unbiased is good enough for me (1)

AtariAmarok (451306) | about 11 years ago | (#6689612)

" An unbiased search engine is completely useless."

Unbiased is fine for me. When I search, I am just looking for matches. That is all. I don't care so much about ranking decisions as long as the search produces accurate results. (that is, words or phrases found in the resulting documents).

Umm... (0)

Anonymous Coward | about 11 years ago | (#6689590)

Could someone set up a mirror? I think they got slashdotted.

Hardware? (1)

shredwheat (199954) | about 11 years ago | (#6689592)

But what about the hardware and bandwidth? I read about the kind of horsepower running behind the offices at Google and find it hard to believe a competitive offering can be made.

Perhaps what is needed is a peer-to-peer style distributed search engine for the web?

What about the hardware? (1)

foo fighter (151863) | about 11 years ago | (#6689620)

How are they going to afford the massive hardware and bandwidth costs associated with running a tier 1 search engine?

This kind of thing costs money, though (1)

venom600 (527627) | about 11 years ago | (#6689647)

The project may start out as an un-biased ranking system. But, if it gets very popular, the cost of running and maintaining a search engine that gets much traffic at all will require some sort of funding. (case in point: Google)

Maybe if the thing was intended for use only by educational institutions, then some education grants could be used to support the infrastructure required to run a popular search engine? Or maybe it could be a subscription-based service? I dunno...couple of thoughts on how to pay for it anyway.

Bottom line: somebody's gotta pay for it and (usually) the easiest way to pay for it is through advertising.....which will (unforunately) probably lead to money-biased rankings.

Let's check out the credits page... (3, Interesting)

baggachipz (686602) | about 11 years ago | (#6689654)

Ooh, what's this?

Overture Research has donated hardware and helped to fund development.

So, even an "open source," "unbiased" search engine is funded by a commercial search organization.
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>