Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

NCSA Compares Google and Yahoo Index Numbers

ScuttleMonkey posted more than 9 years ago | from the searching-for-truth dept.

Google 395

chrisd (former Slashdot editor and now Google employee) writes "Recently, Yahoo claimed an increase of index size to "over 20 billion items", compared to Google's 8.16 billion pages. Now, researchers at NCSA have done their own, independent, comparison of the two engines. "

Sorry! There are no comments related to the filter you selected.

Yahoo pants down, egg on face, no WMD either. (3, Interesting)

ackthpt (218170) | more than 9 years ago | (#13322982)

So the summary is in all but 3% of the time, Yahoo finds less pages than Google and that 18 bi1110nz Mayer claimed are a number he pulled right out of his own arse.

Honestly, when I first heard the news over the weekend I thought "rubbish, they must be ignoring requests for spiders to go no further or something." I guess NCSA can either 1) Expect no gifts from Yahoo OR 2) Report significantly different results after a sizable gift to NCSA.

75% less truth than other leading brand

Yahoo returns dupes... (3, Insightful)

Marnhinn (310256) | more than 9 years ago | (#13323039)

Yahoo returns a lot of dupes.

They may have more unique information simply futher down the result list, but since the search engines terminate the results at not quite 1k (1,000), the researchers have no way of testing that out.

All they can really show is that google returns more unique results per 1000 (which usually means that more items are indexed, but could be from Google's Pagerank also)...

Re:Yahoo returns dupes... (5, Funny)

Anonymous Coward | more than 9 years ago | (#13323089)

Yahoo returns a lot of dupes.

If that's the case, then why is Google the darling of slashdot? ;)

Re:Yahoo pants down, egg on face, no WMD either. (3, Interesting)

Iriel (810009) | more than 9 years ago | (#13323109)

I think it is possible that Yahoo! has more items indexed than Google. It may not be true after all, but one has to give thought to the fact that Yahoo can search subscription based content. That has got to boost their numbers considerably beyond the range of queries that typically return less than one thousand results. It's possible that Yahoo! could have simply been fudging the numbers to get some press now that they're actually starting to get noticed again. I can't make a certain conjecture in either direction, but don't totally discredit Yahoo! without looking into everything.

Re:Yahoo pants down, egg on face, no WMD either. (-1, Troll)

daniil (775990) | more than 9 years ago | (#13323123)

Look...There's something I gotta tell you. Prepare yourself...OK, here goes:

It was I that ate your children.

I'm so sorry you had to find out about it like this, but I couldn't stand living like this any longer.

Again, I'm sorry.

HOW THE FUCK CAN YOU TOLERATE THIS SHIT? (-1, Troll)

Anonymous Coward | more than 9 years ago | (#13323163)

Why can't I just go on some website and download a setup program for the app I want? Why the fuck do I have to tolerate shit like this:

The following packages have unmet dependencies:
the-program-you-want: Depends: some-fucking-package (= 0.14.1-5) but it is not going to be installed.

INSTALL THE FUCKING PACKAGE IF THAT'S NEEDED FOR THE PROGRAM I WANT!
WHY THE FUCK IS IT "NOT GOING TO BE INSTALLED" YOU SHITTY PACKAGE MANAGER?

Fuck you Linux and fuck all your advocates.

Sorry about hijacking the thread, I'm sure the original subject was very fascinating.

Accurate results? (5, Interesting)

bigwavejas (678602) | more than 9 years ago | (#13322984)

Google sometimes returns some pretty interesting/ entertaining results.

Try searching for the word, "failure" in Google and check the results.

This brings into question *accurate* results. In this case it appears that's left to interpretation.

What would you want them to return? (1)

brunes69 (86786) | more than 9 years ago | (#13323020)

All Google does is index the web. In this case, it seems like there are more web pages/more highly linked pages about GW being a failure than anyone else.

Is this that hard to beleive? What would you rather it return for such a query? A dictionary definition? If you want a dictionary definition, use the define: oerator.

Trust me - GW will not be on the top of the failure list forever. In another few years we will have a new most-hated person. This is the nature of a real web index, because it is the nature of the web, and of society itself - it is fickle.

Re:What would you want them to return? (1)

bigwavejas (678602) | more than 9 years ago | (#13323056)

I have no opinion on it actually. I just found it interesting Google displayed GW as result 1 and Yahoo! as result 4. Obviously there's two different search methodologies used here.

Re:What would you want them to return? (4, Insightful)

Intron (870560) | more than 9 years ago | (#13323171)

The top of the page return for Yahoo is

"Failure on eBay Find failure items at low prices. "

which illustrates the most important difference between Yahoo and Google.

And when I search for "Linux" via Google... (-1, Flamebait)

brunes69 (86786) | more than 9 years ago | (#13323229)

The top of the page return is

Windows vs Linux
www.microsoft.ca/getthefacts
Read In-Depth 3rd Party Performance Analysis on Linux & Windows!

So what was your point again? Oh yeah - you had none.

Re:What would you want them to return? (1)

Dan Up Baby (878587) | more than 9 years ago | (#13323074)

If this was a natural result I would agree with you, but like "French Military Victories" it was orchestrated; it's not the real web, and it doesn't illustrate how the web actually uses the word.

Re:Accurate results? (1, Insightful)

Anonymous Coward | more than 9 years ago | (#13323036)

Actually, this was the result of a bloggers' linking campaign to do just that.

In response, you can see Michael Moore in the #2 position.

Re:Accurate results? (2, Funny)

DroopyStonx (683090) | more than 9 years ago | (#13323105)

Um, GW Bush is the first result.

Seems fairly accurate to me...

Re:Accurate results? (0)

Anonymous Coward | more than 9 years ago | (#13323137)

The interesting thing though is that the word "failure" doesn't appear anywhere on the page.

Though I am not disagreeing with you...

Re:Accurate results? (5, Insightful)

jrallison (857135) | more than 9 years ago | (#13323160)

It is odd however the #1 result for failure is a webpage without the word "failure" in it.

Re:Accurate results? (1, Interesting)

ArsonSmith (13997) | more than 9 years ago | (#13323220)

Hmm, bumbling idiot possibly, but sense when has becoming the President of the US, then being elected again been the mark of a failure????

Re:Accurate results? (1)

Monty845 (739787) | more than 9 years ago | (#13323113)

The next logical step would be to take a list of the results that a query generates and examine the ones unique to each search engine. The difficult part would be creating a methology for determining relevency that wasn't subjective to the opinions of the analyst.

The other possibility is that yahoo's indexing system preferentially indexes popular pages.. not sure if that is a reasonable possibility.

Re:Accurate results? (1)

ArsonSmith (13997) | more than 9 years ago | (#13323166)

That's funny I get POOP, DICK, VIGINAS, POOP, DICK, VAGINAS. Wonder why that would be a failure? You're right though it is kinda funny. Say that like 5 times out load at work.

Re:Accurate results? (1, Informative)

Anonymous Coward | more than 9 years ago | (#13323174)

It is called a "Google Bomb"

http://en.wikipedia.org/wiki/Google_bomb [wikipedia.org]

Re:Accurate results? (0, Troll)

AnUnnamedSource (174313) | more than 9 years ago | (#13323182)

Looks pretty accurate to me. Type in "failure"--get a picture of George W. Bush. How much more accurate do you want?

Re:Accurate results? (4, Insightful)

MindStalker (22827) | more than 9 years ago | (#13323192)

Well google also indexes based upon refering links and not just the context in the page itself. So if many websites refer to GW as a failure, GWs page itself will turn up as a high hit. Yahoo does this as well, but doesn't not nessesarly give it the same weight. This could highly affect amounts of returns. Because if we say that google returned X pages for a search on term "y" many of these pages may not actually mention "y" thus giving a larger page count for "y". While with yahoos method, it will mainly return pages that mention "y" themself. And possibly add some pages that are mentioned to include "y" by links. This can vastly alter the count.

NCSA is comparing the archives... (1, Funny)

Anonymous Coward | more than 9 years ago | (#13322988)

by surfing to each page in each archive with the most recent NCSA Mosaic.

It will take a while.

Re:NCSA is comparing the archives... (1)

Winckle (870180) | more than 9 years ago | (#13323225)

Yeah, especially when they reach http://www.dukenukemforever.com/ [dukenukemforever.com]

Conclusion (3, Informative)

mboverload (657893) | more than 9 years ago | (#13322989)

"Based on the data created from our sample searches, this study concludes that a user can expect, on average, to receive 166.9% more results using the Google search engine than the Yahoo! search engine. In fact, in the 10,012 test cases we ran, only in 3% of the cases (307) did Yahoo! return more results. In 96.6% of the cases (9,676) Google returned more results. In less than 1% of the cases (29) both search engines returned the same number of results. It is the opinion of this study that Yahoo!'s claim to have a web index of over twice as many documents as Googles index is suspicious. Unless a large number of the documents Yahoo! has indexed are not yet available to its search engine, we find it puzzling that Yahoo!'s search engine consistently returned less results than Google. "

Re:Conclusion (4, Insightful)

nutshell42 (557890) | more than 9 years ago | (#13323132)

And Nutshell42's New Amazing Search Engine gives you even more results. Even though my index size is only 1.something million. I simply return every single wikipedia article in every language as result no matter what you search.

Concluding that Yahoo's index has to be smaller because they return fewer results seems a bit overzealous. Only a thorough study comparing results and how useful they were (which is hard to do, expensive and time consuming) has any meaning that goes beyond producing lots of funny numbers and percentages.

96.34% of all percentages are completely useless.

btw. I use google, not yahoo

Re:Conclusion (3, Insightful)

rossifer (581396) | more than 9 years ago | (#13323189)

Concluding that Yahoo's index has to be smaller because they return fewer results seems a bit overzealous.

No, it's accurate. They're testing Yahoo's claim of how many pages they've indexed, which just means that all indexed pages that contain the requested words should be returned from the search request. If yahoo returns fewer unique pages, yahoo has indexed fewer pages.

What you're talking about is measuring the effectiveness of page ranking, which is a completely different measure of how good a search engine is. Note: Google wins on that measure too.

Regards,
Ross

Interesting. (1)

Poromenos1 (830658) | more than 9 years ago | (#13322990)

I was wondering how accurate were the results that the companies themselves reported. Or are they accurate, but they just spidered sites that don't matter to anyone?

They might have a larger index file (4, Insightful)

BlackCobra43 (596714) | more than 9 years ago | (#13323006)

but they can't sift through it nearly as well as Google, so what does it matter? Even if you have a bigger dictionnary, if you can't speak English at all it won't do you much good.

Re:They might have a larger index file (0)

Anonymous Coward | more than 9 years ago | (#13323076)

stfu n00b

Why not use both? (2)

LuciferBlack (905438) | more than 9 years ago | (#13323007)

Just use both...Then you'll be certain to have a nice unbiased search result. ;)

The difference is (-1, Offtopic)

AnonDotOrg (902320) | more than 9 years ago | (#13323016)

That Yahoo! tends to find a way of indexing identical pages on a site. While Google seems to just index the unique and significant ones [overheardintheuk.com] .

Re:The difference is (0)

Anonymous Coward | more than 9 years ago | (#13323124)

How many more times are you going to whore your site in your comments? Out of 9 comments you've made on /. 5 of them have included a link to your site in the body of the comment. If it's relevant, fine, if not then stick it in your sig or profile.

Louis Waweru - youngbonzi@earthlink.net (0)

Anonymous Coward | more than 9 years ago | (#13323245)



a picture [nyud.net] tells a thousand words

Mcdonalds obviously isn't hiring

Flawed conclusion? (5, Insightful)

Prong_Thunder (572889) | more than 9 years ago | (#13323025)

Sorry, but if Google consistently returns more results, it could just as easily mean that the filtering isn't as good.

I still prefer Google though.

Re:Flawed conclusion? (5, Insightful)

Ossifer (703813) | more than 9 years ago | (#13323103)

Exactly! I find the conclusions of the research to be quite specious. Yahoo may simply have tighter controls of what is considered a match, which, by the way, is no simple algorithm.

In any case, I am usually not so interested in the numbers of matches, but in the quality of the list returned--hopefully one website will have exactly what I need...

Re:Flawed conclusion? (1, Insightful)

Lewisham (239493) | more than 9 years ago | (#13323158)

Agreed, whoever conducted this "research" is pretty idiotic. The pages returned != pages available.

This isn't worthy of the NCSA, or indeed any university, to be shown in any public format with any conclusions at *all*. You'd be laughed out of the conference hall if you presented this.

Interesting but... (2, Insightful)

kf6auf (719514) | more than 9 years ago | (#13323119)

While it is true that more results could mean worse filtering, that is a separate test entirely.

I tend to think that ordering is more important than filtering down to a small number of results, since having lots of results returned doesn't hurt if the search engine can order well so that what you want is most likely to be in the top 10-25. This is especially true when there will be at most a couple of results where I'd rather have the search engine try at the ordering and have me do most of the filtering because no search engine is as good as a person at really figuring out what people want, yet.

Not really (1)

Mr. Underbridge (666784) | more than 9 years ago | (#13323142)

In fact, all results that match a query are returned, it's the ranking that matters. Google is also more rigorous about excluding apparant duplicate results, and don't count those in the stats.

results? (1)

dotpavan (829804) | more than 9 years ago | (#13323028)

quoting:"Based on the data created from our sample searches, this study concludes that a user can expect, on average, to receive 166.9% more results using the Google search engine than the Yahoo! search engine. In fact, in the 10,012 test cases we ran, only in 3% of the cases (307) did Yahoo! return more results. In 96.6% of the cases (9,676) Google returned more results. In less than 1% of the cases (29) both search engines returned the same number of results. It is the opinion of this study that Yahoo!'s claim to have a web index of over twice as many documents as Googles index is suspicious. Unless a large number of the documents Yahoo! has indexed are not yet available to its search engine, we find it puzzling that Yahoo!'s search engine consistently returned less results than Google. "

in short: truth is that size does not matter.. the hype behind bigger the better is *false*, just like its for penises :)

Re:results? (1)

DirtyHerring (635192) | more than 9 years ago | (#13323257)

in short: truth is that size does not matter.. the hype behind bigger the better is *false*, just like its for penises :)

Yet another guy with a small penis.

What a surprise (-1, Flamebait)

Dunbal (464142) | more than 9 years ago | (#13323032)

Surprise surprise yet another google story on slashdot. Well I guess it beats Piquepaille...

OK, mod me flamebait now.

More please! (5, Interesting)

2008 (900939) | more than 9 years ago | (#13323114)

This is a great article! I wish there were more like it on slashdot. It's scientific instead of an opinion piece, it has references, it's repeatable. It's also short and very readable, unlike a lot of science papers.

OK, it is yet another Google piece, but it's not "some junior analyst predicts Google will buy Apple and release OSX86box 720".

MODERATRORS, look here!!! (1)

Junior J. Junior III (192702) | more than 9 years ago | (#13323249)

Mod parent up.

Re:What a surprise (0, Troll)

Eric604 (798298) | more than 9 years ago | (#13323135)

OK, mod me flamebait now.

I'll take some of the heat off you. Let's burn some karma. Here we go. MODERATORS ARE STUPID FUCKERS.

The results (4, Interesting)

Swamii (594522) | more than 9 years ago | (#13323034)

For those that don't want to read the flippin' article:

Based on this random sample, we found that on average Yahoo! only returns 37.4% of the results that Google does and, in many cases, returns significantly less.


In other words, they believe Google indexes more items based on their own tests of searching.

Re:The results (2, Insightful)

mi (197448) | more than 9 years ago | (#13323130)

Based on this random sample, we found that on average Yahoo! only returns 37.4% of the results that Google does and, in many cases, returns significantly less.
Informative. But do they also explain, why this (Google's results) is a good thing? From my experience, Google's results beyond the second page are never useful, so they may as well not be there at all.

I don't see, how NCSA's findings can prove or disprove's Yahoo's earlier claims.

English Language (3, Insightful)

morcheeba (260908) | more than 9 years ago | (#13323038)

They only used words from the English Ispell word list. Besides the english-language bias, this is probably limited in other ways. News websites use a limited vocabulary, but a lot of proper names -- so if one engine indexed these better, they wouldn't necessarily get a better rating. News sites are also very dynamic and have a large number of webpages, so they would be influential in the count.

Re:English Language (0)

Anonymous Coward | more than 9 years ago | (#13323258)

... If one of them indexed a site with proper nouns better it would skew the results. Wow, you really had to stretch for that nugget.

Hrmm (2, Interesting)

T3kno (51315) | more than 9 years ago | (#13323041)

Why wget instead of LWP?

Re:Hrmm (1)

glwtta (532858) | more than 9 years ago | (#13323246)

Why not? Often wget is faster to set up, since it already has a whole lot of functionality rolled in that you'd have to do by hand with LWP.

Queries with 1,000 results (3, Interesting)

Whafro (193881) | more than 9 years ago | (#13323053)

TFA notes that queries with greater than 1,000 results were dropped from the survey, because Google and Yahoo both truncate their results to 1,000.

That makes sense, but it does stand to reason (or, at least, to my reason) that these queries that garner large numbers of results could have had a significant impact on the bottom line of the survey.

Those could be the larger sites, where Yahoo is perhaps digging deeper, requesting data from forms, ignoring robots.txt, etc. It could be where they're getting those big claimed numbers of indexed documents.

This is what matters... (1)

Rolan (20257) | more than 9 years ago | (#13323059)

This boils down to the real numbers that matter. It doesn't really matter if your index is "bigger" or not, it is about the results that are returned. The other thing that matters (and can't really be measured in a scientific manner) is relevance. It's easy to return results for a set of words, it is hard to return relevant results for a set of words. My personal experience is that Google returns more relevant and better ordered results than Yahoo!.

Exactly (1)

mopslik (688435) | more than 9 years ago | (#13323262)

If $SEARCH_ENGINE returns 1,000,000 results, and assuming I can sift through each result at an astonishing rate of 1 per second, it will take me 1,000,000/(60*60) = 278 hours, or 11 1/2 days to wade through the junk.

The number of results is largely irrelevant. Give me quality filtering instead. Fortunately, Google does that for the most part.

The ultimate test (1)

kevin_conaway (585204) | more than 9 years ago | (#13323075)

To me, the test is googling myself and seeing what comes back. Google seems to favor mailing lists high in its results so all the stupid things I've said over the years are right up there on front. Of course, I think Google is more accurate because things actually attributed to me show up higher in the results, but is that actually correct? I don't know.

Re:The ultimate test (1)

Vegeta99 (219501) | more than 9 years ago | (#13323216)

Ha! Yeah. According to Google, anyway, Plug N' Play is satan, and I really dispised MP3 players (in favor of MD players).

Good article (0, Flamebait)

eth00 (612841) | more than 9 years ago | (#13323081)

The researchers in this article took as close to a scientific method as one can get for something like this. This just tells us exactly what has been know for away, yahoo just plain sucks at giving good results.

Re:Good article (1)

darius779 (734496) | more than 9 years ago | (#13323101)

The article gives no information as to the quality of the results, just the number of the results given..

Re:Good article (1)

amliebsch (724858) | more than 9 years ago | (#13323217)

ERROR: LOGIC FAILURE

Returning fewer pages does not necessarily mean poorer search results - after all, a good search will present the maximum number of relevant pages, but no others. Google only wins if all of the extra results it shows are actually relevant. By the methods of this test and your analysis, I could write a search engine that returns its entire index as the result set for every search, and it would be the best websearch ever! Billions of results on every search!

I would like to see an objective qualitative assessment.

Quality of Results (0)

Anonymous Coward | more than 9 years ago | (#13323082)

The big flaw in this test, IMO, is that it assumes quantity of results is as good as quality of results. I couldn't care less if a search results in 10,000 hits or 100,000 hits. All I really care about is did it return the 1 or 2 hits that actually have the information I'm looking for and are they high up in the results?

"Number of documents indexed" is a worthless pissing match as far I'm concerned.

Quality not quantity (1)

ngunton (460215) | more than 9 years ago | (#13323091)

Surely it's the quality of the results that counts, rather than the quantity? Who needs 1,000,000 matches anyway, when most people don't go past the first page or two of the results? The article doesn't talk at all about how relevant the matches were. I'm not saying that it invalidates their study, but I would say that any search engine that returns millions of hits for any query is simply showing off. Give me a search engine that shows me fewer matches, but the best hits anyday. Lately, Google has increasingly been giving me a bunch of useless links when I search for stuff. For example, looking for reviews on various bits of hardware just gives you a bunch of websites that are selling the products, and *seem* to have reviews, but then you go to the page and it says something like "no reviews have been posted". Lots of ghost towns out there on the web these days. Anyway, the point holds: Give me relevant results and allow me to screen out the marketing junk and link farms. Beyond that I don't really care how many pages they have in the index.

Concede (0, Redundant)

DrugCheese (266151) | more than 9 years ago | (#13323095)

In fact, in the 10,012 test cases we ran, only in 3% of the cases (307) did Yahoo! return more results. In 96.6% of the cases (9,676) Google returned more results.

I don't understand what would make someone want to compete against Google anymore. Sure if you're got technology in place like yahoo keep it going but still ...

Google is synonymous with searching the internet.

Google is a verb

Re:Concede (0)

Anonymous Coward | more than 9 years ago | (#13323197)

Actually no. There are many problems with Google:

* Their image search is out-of-date.
* Google sometimes can't find content you know is on the web and you know is indexed.
* Google's special operators sometimes don't work (like link:). This may be related to the previous point.
* Google will sometimes crash. Unusual, but I think that everyone has seen one of their "core dumps" with the encoded data and a request to file a bug report.

Re:Concede (0)

Anonymous Coward | more than 9 years ago | (#13323203)

Competition is ALWAYS good. It is what will keep google "honest" in the long run. Now that google is a public company, EVENTUALLY money will corrupt the company and they won't be the glorious tech savy company they once were. While competition won't keep any company "honest", it at least provides an incentive to a company to keep their customers happy.

Conflict of interest? (1, Interesting)

Anonymous Coward | more than 9 years ago | (#13323098)

It seems to me that when Slashdot publishes an article that is favourable to Google, that was submitted by a Google staff member, one might question whether someone involved has a conflict of interest. It's not astroturfing, because his employment at Google was clearly mentioned. It might be an ad (or more correctly, a press release) masquerading as news. I wonder if the article would have been published had it been submitted anonymously...

But the real question is... (1)

convex_mirror (905839) | more than 9 years ago | (#13323100)

Why is the NCSA cowering from comparing Google and Yahoo to infoseek? The wool has been pulled over your eyes people!

My own independent analysis (0)

Anonymous Coward | more than 9 years ago | (#13323106)

No one gives a fuck whether it is 8.16 billion or 20 billion. No matter what, it is 99.9999% useless shit. Is the largest catalog of useless shit really something to aspire to?

Perl Code (4, Funny)

hayro (854797) | more than 9 years ago | (#13323107)

I don't know about the study but that is the most readable perl code I have seen in a long time.

interesting but inconclusive (1)

it0 (567968) | more than 9 years ago | (#13323108)

It's a nice test but ifail to see how they can extrapolate this to be true for all searches.

Don't forget that also a lot of queries get handtuned at google/yahoo to give the proper resultset.

Also to keep in mind that size doesn't matter but relevancy does!

And they both cheat at that as well, they just give back the highley ranked pages for those words. Works ok for a lot of people but hardly relevant.

Study has poor assumption (2, Insightful)

Anonymous Coward | more than 9 years ago | (#13323115)

The study noted that although Yahoo says that have ~twice as many pages indexed as google, when they queried each engine with two arbitrary words from the dictionary, they got less responses from Yahoo.
  From this they concluded yahoo's claim of twice as many pages is suspicious.

What's suspicious is that these people consider themselves scientific. What if, for example, Yahoo just returns meaningful results, whereas google returns anything with those words in? For example, what if you search for "faience" and "urbanity" -- maybe google has more results, but maybe they are less pertinent - in other words maybe not only Yahoo has more pages indexed, but they have an algorithm that returns only the most relevent stuff

Not saying that's the case necessarily, but not mentioning that assumption makes for a worthless study/conclusion. (also if google says they return x results, often when you go to the last page of their results listing you'll notice their total went down, and its more like x - 10%)

    -Josh

Methodology (5, Insightful)

enjo13 (444114) | more than 9 years ago | (#13323125)

The very methodology used in this case seems rather incorrect to me.

The assumption (as stated in the paper): Since Yahoo claims to have indexed twice as much as google, searches should return twice as many entries.

That assumption is flat out incorrect. There are actually multiple problems.

First, the scope of the search (based on index terms) is really up to the search engine itself. Since each search engine does not return the entire database as search results, it is very much up to the individual search algorithm to determine the depth of entries considered to 'match' a set of terms. That's what is really being reflected in these results.. it is not the overall size of the index, but simply how aggressive the search algorithm is in matching terms to entries.

Even if the algorithms where identical (same algorithm being run across both indexes), the nature of search does not scale in that way. If Yahoo has, for instance, becomre more aggressive in indexing message board and forum content, then only searches that play to those subjects should return more results than Google. Since searches are by definition narrowing on a data set, a methodology needs to be developed that more effectively tests the BREADTH of the results more than simply testing the depth.

Yahoo still crap and continue to be crap (-1, Troll)

Fujisawa Sensei (207127) | more than 9 years ago | (#13323128)

Yahoo is still over hyped crap. It was crap in the last century it will continue to be crap in this century. Want to know when Yahoo is going to stop being crap? When people stop advertising with them and they sell out to Google. As long as the Chief Fscking Officer is making butloads of money from advertisers they're going keep putting out crap. Much the same as the companies in the RIAA.

Quality fo Quantity? (1)

imstanny (722685) | more than 9 years ago | (#13323131)

I'd be much more interested to see a test of the quality of results. Considering that most of the results that I end up activating are on the first page, quantity of results is less relevant to me in determining a good search engine.

International Listings (4, Insightful)

Dominatus (796241) | more than 9 years ago | (#13323138)

The study only checked English words. Is it possible that the increase came from Yahoo expanding into more international website markets?

Just a thought

can we trust the methodology (1)

GabrielF (636907) | more than 9 years ago | (#13323140)

Basically NCSA's method assumes that if a search engine indexes twice the number of pages, than it will return twice the number of results for a given search. However, in order for this to be the case, the 10 billion+ more pages that yahoo indexes would have to be roughly equivalent to the pages that google indexes. If Yahoo is indexing 20 billion pages, but ten billion of those are in mandarin, than searching for random combinations of english words (which NCSA is doing) won't tell us which search engine indexes more pages. In order to trust NCSA's methodology we would have to know exactly WHAT the billions of pages that Yahoo knows about but Google does not are. Surely the web didn't double in size overnight, Yahoo must be searching somewhere Google doesn't search if their claims mean anything (which they may not).

Finally! (1)

Cheirdal (776541) | more than 9 years ago | (#13323141)

It's good to see that slashdot is FINALLY posting an article about Google.

This is what passes for CS research nowadays? (5, Insightful)

adrizk (137574) | more than 9 years ago | (#13323151)

Seriously. 'We wrote a script and here are the results'? This would take an average PERL programmer what -- 30 minutes of work? Has academic research in computing really sunk to this level?

Maybe it's not even worth pointing out how badly flawed (and lazy) the underlying assumption of 'twice the results = twice the index size' probably is, as I'm sure we're going to see a few dozen posts to that effect (unless PageRank really means nothing), but at least I can complain about the slant they put on this, and how strong a conclusion they seem to derive.

Re:This is what passes for CS research nowadays? (1)

DogDude (805747) | more than 9 years ago | (#13323264)

Has academic research in computing really sunk to this level?

Considering that most people call a "fact" something that they found on Wikipedia or via Google, I'd have to say that the answer to your questions is "yes". The Net is a vast source of incorrect, incomplete, and otherwise bad data. There may be a lot of information out there, but the vast majority is wrong. This "cheapening" of information has and probably will lead to more of this crap "research".

no change for me (1)

rotagivan (885893) | more than 9 years ago | (#13323153)

Do you really even have to RTA? My search engine is still the same as before and works fine, no need to change now. Awe, is Yahoo jealous?

Interesting study... (1)

dracken (453199) | more than 9 years ago | (#13323156)

...though flawed in many respects. The raw number of pages returned may not indicate the size of indices. Google is famous because it returns *relevant* pages but not necessarily *more* pages. A search engine that returns its entire index with each search isnt all that useful.

Secondly, results for all keywords may not increase with the size of the index. The pages which were indexed might correspond to popular searches (that return more than 1000 results, which were not considered if you RTFA) - so considering only those words that return less than 1000 results is flawed.

Though some competition is good, the "DO YOU WANT MY 20 BILLION BIG INDEX ???!!" claim by yahoo reminds of certain yahoo chat rooms :p

yahoo failed it (0)

Anonymous Coward | more than 9 years ago | (#13323161)

yahoo forgot to index all /. dupes.

methodology (1)

abde (136025) | more than 9 years ago | (#13323173)

the assumptoins seem to be that sarch results are randomlydistributed. But by teh very nature of search - a targeted and subjective request for information - that is clearly the wrong model. I don't se why the assumption that a 2x bigger index should return 2x more results for any query 1000.

A better test would be to see how much overlap there was between queries. Do the top 50 returns on queries (ofany size, not just imited to those with N 1000 returns) match? to wuithin what percentage?

Google parses plurals differently. (3, Interesting)

WoTG (610710) | more than 9 years ago | (#13323176)

Google started treating plurals as the same search about a year ago. Yahoo doesn't. So, if you google for "inkjet printers" and "inkjet printer" you will get the same result set; however, on Yahoo, you will get different results.

The net result is that for the same index size, Google will return more results. (And, IMHO, more meaningful ones.)

Who cares about... (2, Insightful)

Ignignokt (803398) | more than 9 years ago | (#13323178)

the number of results anyways? Who makes it to page 5000 when doing a search?

Questionable methodology (1)

lpangelrob (714473) | more than 9 years ago | (#13323180)

I agree that it's hard to determine how many items that exist for a subject XYZ, but I'm not sure this is the way to go about it.

They presumed that for random phrases that return less than 1,000 matches, one can determine between the ratio of matches that Google returns and matches that Yahoo returns, which engine has indexed more documents. This also presumes that the Internet is an infinite source of information about XYZ, and that there is always an indeterminate number of sources that remain unindexed on both engines. I don't think this is the case at all.

Say I write a page about Jabberwocky. I get together with people that write more pages about Jabberwocky, and all of us have on three domains information about Jabberwocky that exists nowhere else, except maybe Wikipedia under the Jabberwocky entry. If both sites index Wikipedia and those three domains (that link to each other), that's 100% coverage... barring horrible algorithms, you can't get less than this, or you get nothing at all.

Also, when you're looking around for such unique information, I have to imagine that it's not representative of other sources in more general searches.

More results == better search engine? (3, Insightful)

RunzWithScissors (567704) | more than 9 years ago | (#13323185)

So in the conclusion, the author writes that since Google displayed more results, based on their random test data, it was the superior search engine? That seems so wrong somehow...

Wouldn't a better search engine return less, but more appropriate results? I mean, how many of us have found the information we were actually looking for on page ten or twelve of a search. And, isn't less more, but better? %insert Linux geek laughs here%

One would think that volume of results would not a better search engine make, although it may indicate a larger engine index size; an expicit statement to that effect seems to be missing from the NCSA report.

-Runz

Quality Quantity (2, Insightful)

hagrin (896731) | more than 9 years ago | (#13323195)

This is just another example in the age old argument of which is better. IMO, the quality of the search results is what matters more than the sheer quantity of information. One relevant find is more valuable than 100 inaccurate results. A test of accuracy might be more valuable and one that would be difficult to engineer. For instance, if I type in a word that has a direct correlating .com domain, that should be the first result (assuming no other words in the title - i.e. "hagrin" brings me my home page as the first result). I am sure a test of accuracy could be further derived from such logic.

The other side of the argument probably relates back to something my fiancee once told me - "Size doesn't matter, but it's the great equalizer when it comes to two guys not knowing what they are doing". Yahoo!, especially since the researches couldn't perform queries on topics returning more than 1,000 results, may be indexing and crawling deeper into sites or it has a "double dipping" problem.

Either way, I don't see Yahoo! falsely reporting their numbers - I would tend to think that this "study" is highly flawed due to its exclusion of larger result topics, etc.

Problems with the research (1)

iceco2 (703132) | more than 9 years ago | (#13323196)

The research has several problems:
a. It measured number of results for a certain
query, even if we assumed identical algorithms for checking if a page matches the a query, the two search engines are likely to use diffrent relevancy thresholds.
b. the search pretty much limited itself to the
english language.
c. as they admit themselvs they measured only obscure queries, actually most of my queris
are not obscure at all and it takes me more then 2 words(which fit together) in order to chop down
the search results group.
d. finally the entire research has very little to do with the really intresting question, which is which search engine is more likely to give me the results I need on the first page?

  Me.

Wait, wait, wait (1)

antifoidulus (807088) | more than 9 years ago | (#13323198)

What's this? A concise and well written summary with a link directly to the well written article? No twisting/breaking of the truth in order to incite /. groupthink comments? No pointless plugs for unrelated topics? No ADS?!?!

Jesus, the editors keep that up they might actually have a worthwhile site going....never fear, I'm sure the next dupe and/or an article comparing spooning to unmanned space travel will surface before the day's end.

WMD flaimbait? (1)

mi (197448) | more than 9 years ago | (#13323202)

Or off-topic? Or troll?

The NCSA's test neither confirms nor disproves Yahoo's earlier claims. Their lesser average results may just indicate higher quality threshold -- Google's results beyond the second page are never useful either.

I'd say, it is kind'a early to claim "pants down, egg on face"...

But was the study (1)

Approaching.sanity (889047) | more than 9 years ago | (#13323208)

funded by Microsoft?

Not only does Google do More, it does Better (2, Informative)

Ralph Spoilsport (673134) | more than 9 years ago | (#13323210)

In regards to a similar article last week, I posted my own personal results [slashdot.org] on what I found when I did a search on Kyzyl, the capital of Tuva.

Google not only gave MORE results, it gave BETTER results. The only bad results were some hairsplitting (if largely well meant) from fellow /.ers... (I mentioned Tuva as a suburb of Mongolia, and while it IS a part of the Russian Federation, it is Much More Mongolian than Russian. And if the rising tide of neoNazi scum in Russia get their way, Tuva could easily be cut adrift into the Mongolian/Chinese orbit...but I digress...)

The essential point is: Which Does the Job Better For Me? Google. Therefore, I use Google. Assuming the Copernican position that I am not atypical, I would therefore extrapolate that this is very true for most other people as well. Which means that Yahoo has a LONG way to go and A LOT more work to do.

RS

I don't see how it can be accurate (1)

jerryodom (904532) | more than 9 years ago | (#13323211)

There is a big difference between the size of an index and the number of or quality of search results returned. Yahoo may simply not return as meany results or retards the number of results returned for speed considerations. Just because their particular test favored Google's system doesn't make it accurate. I'm sure we could sit here and think of hundreds of different reasons or considerations not taken into account.

With each having billions upon billions of documents available and indexing more everyday who really cares?

disregarded results (1)

Metex (302736) | more than 9 years ago | (#13323219)

Thus, for the purposes of this study, we were forced to restrict our searches to those queries that returned less than 1,000 results on both Yahoo! and Google. Any search result found to have more than 1,000 returned results on either search engine was disregarded from our sample.

my question is which search engine required them to disregard their sample the most. Did google hit the limit the most or was it yahoo?

By the way I love google but I do think yahoo indexs more pages. It index personal pages moreso then google does. So when I am searching for items which I know other people would point to I hit up google. But if I am searching for something that no one has a reason to link to (home page of your gf) I hit up yahoo.

Not Convincing (1)

FreshFunk510 (526493) | more than 9 years ago | (#13323232)

Thus, for the purposes of this study, we were forced to restrict our searches to those queries that returned less than 1,000 results on both Yahoo! and Google.


In order to create a large number of queries that returned less than 1,000 results, we took the commonly available English Ispell Wordlist.. and wrote a PERL script to randomly select two words at a time from that list.


Is it just me or does this study not sound convincing enough? There are too many holes in the way the study was conducted, IMHO. First of all, they restricted queries that return less than 1000 results? They're already limitied the sort of queries they're executing by choosing those that return significantly less results that other "popular" queries.

Secondly, they chose random words to create a query. This doesn't give me the confidence that this belongs to the same space of queries that people execute on the average. It would've been great if they sampled their queries from those that people actually execute instead of just crawling the english dictionary.

Nevertheless, bigger is not always better. The reason why Google became such a phenomenon was because of the quality of their search results. Duh.

Offtopic but Adsense needs work (0)

Anonymous Coward | more than 9 years ago | (#13323251)

I was on a page reading about Windows Longhorn and Google showed me ad's about Cattle in Texas I could buy... with all the 1337 hax0rz and ub3r geeks they have at Google, Inc, can they not fix the "context"

Automated querying is Illegal (1)

pooya (878915) | more than 9 years ago | (#13323254)

Isn't it illegal that their crawler is running automated queries? From what I see in Google's Term of Services:
You may not send automated queries of any sort to Google's system without express permission in advance from Google.
Please do not write to Google to request permission to "meta-search" Google for a research project, as such requests will not be granted.
I'm wondering how they prevented the machines running robots from getting banned after querying that much.

Nice an objective (1)

Gumber (17306) | more than 9 years ago | (#13323255)

Nice to take an anti-yahoo submission from a Google employee. I guess I should be happy they at least disclosed the conflict. It's more than you can say for someone like Bob "rove-puppet" Novak.

Clever idea - I think I will patent it (0)

Anonymous Coward | more than 9 years ago | (#13323260)

before someone else does.

Results of my own study... (4, Funny)

Locke2005 (849178) | more than 9 years ago | (#13323261)

Google only reports "about 4,820,000" entries for Britney Spears, while Yahoo reports "about 67,100,000" entries! This makes Yahoo more than 12 times better than google! Yeah, my methodology is completely fucked up... but then, so is the NCSA's!
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?