Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Understanding Search Engines?

Cliff posted more than 8 years ago | from the a-primer-on-the-technology-behind-them dept.

The Internet 49

An anonymous reader asks: "I guess by now we can be fairly certain that search engines are here to stay, and hence I'm trying to understand how the technology works. I'm not so much looking for a particular 'best' technology or implementation, but rather an overview of the different approaches and their trade-offs. Something that would teach me: which approach works in a distributed vs a centralized infrastructure; how different algorithms will perform on complete search words vs arbitrary sub-strings; or how mass storage (hard disk vs. solid state) affects implementation choices. For most mature technologies there is a host of 'overview' books and papers for my questions -- but I couldn't find anything on search engines. Where should I look? Are there any good books or papers?"

cancel ×

49 comments

Sorry! There are no comments related to the filter you selected.

Start off by reading up aboutPage Rank (4, Informative)

xmas2003 (739875) | more than 8 years ago | (#14643548)

Google has a summary here [google.com] ... but start with the original Brin and Page paper - The Anatomy of a Large-Scale Hypertextual Web Search Engine. [stanford.edu]

Same basic concepts apply today ... although they probably didn't anticipate the rise of Black Hat SEO [watching-paint-dry.com] which attempts to "beat" the algorithms.

Your post, sir! (-1, Troll)

Anonymous Coward | more than 8 years ago | (#14643639)

Is so gay. It makes bob dole look undysfunctional. You should be kicked in the balls...you wouldn't scream.
 
Search Engine Alta la Vista! BABY

ALERT! Google Blog Censorship? (-1, Offtopic)

Philip K Dickhead (906971) | more than 8 years ago | (#14644166)

Alert!

"Progressive" and "Radical" Blogs on Blogger seem to be completely unavailable for over 10 hours.

I am trying to get responses here, and come to a consensus. Are these being deleted?

The following ".blogspot.com" sites are unavailable as of Saturday, Feb. 4, 4PM EST.

http://gorillaintheroom.blogspot.com/ [blogspot.com]
http://xymphora.blogspot.com/ [blogspot.com]
http://spacetimecurves.blogspot.com/ [blogspot.com]
http://rigorousintuition.blogspot.com/ [blogspot.com]
http://jewssansfrontieres.blogspot.com/ [blogspot.com]
http://rudepundit.blogspot.com/ [blogspot.com]

And many more.

A casual use of the search function on the Blogger/Blogspot front page returns the sites in question.

Conservative political sites also seem to still be "killed".

Plenty of sites promoting cheese in salads , or announcing the purchase of new home electronics , seem to be available.

Are we experiencing a crackdown? Is this a "service outage" anomaly?

Foul Deeds... (1)

EMeta (860558) | more than 8 years ago | (#14643841)

That Google summary is useful, but is actually just a simplified version of their true ways [google.com] .

Re:Start off by reading up aboutPage Rank (0)

Anonymous Coward | more than 8 years ago | (#14644817)

Black hat bat's vs. the white hat shotgun dude, what a joke.
As someone who along with a dozen others used to shoot at bat's that flew out of a rural local church belfrey at least once a week one summer I can tell you the black hat bat's are laughing. I've only seen one bat come tumbling out of the sky. It's damned near impossible to kill a bat with a shotgun or even a dozen shotguns.

Re:Start off by reading up aboutPage Rank (0)

Anonymous Coward | more than 8 years ago | (#14646054)

Matt Cutts gives another explanation [google.com] of how Google crawls, indexes and ranks webpages.

Learn math (4, Insightful)

2.7182 (819680) | more than 8 years ago | (#14643572)

SIAM Review had a survey article on different methods recently. You need to know linear algebra, combinatorics and probability

Re:Learn math (1)

SgtChaireBourne (457691) | more than 8 years ago | (#14649334)

IANAM (I am not a mathematician) but I recall the terms 'set theory' and 'boundary theory' being used by mathematics researchers when they were talking about search engines and ranking / grouping the results.

Google founder's links to publications (5, Informative)

zoeblade (600058) | more than 8 years ago | (#14643590)

One of the founders [stanford.edu] of Google still has links to various publications (in PostScript format) about search engines, if that helps.

Class (2, Insightful)

addaon (41825) | more than 8 years ago | (#14643591)

Take a class on information retrieval from your local university.

Easy Start (3, Informative)

MikeFM (12491) | more than 8 years ago | (#14643609)

Try the web. I have a short intro to search engines [kavlon.org] on my website. Many others exist. The basics aren't hard and are very effective.

Re:Easy Start (1)

RedWizzard (192002) | more than 8 years ago | (#14643649)

I have a short intro to search engines on my website.
That's an introduction to search engine optimization, not an introduction to search engine technology. The submitter seems to be interested in the technology and algorithms behind finding pages that match a particular term, not finding out how to maximize the position of their website in indexes.

Re:Easy Start (1)

MikeFM (12491) | more than 8 years ago | (#14644210)

Good point although I think you can infer the one from the other. I actually am fairly good at SEO and a lot of that is due simply to understanding how the Internet and search engines work. I've always had a thing for studying ways of indexing and searching data as well as things like AI so it's not as mysterious a field for me as it is for a lot of people.

Re:Easy Start (1)

RedWizzard (192002) | more than 8 years ago | (#14648417)

Good point although I think you can infer the one from the other. I actually am fairly good at SEO and a lot of that is due simply to understanding how the Internet and search engines work. I've always had a thing for studying ways of indexing and searching data as well as things like AI so it's not as mysterious a field for me as it is for a lot of people.
SEO can give you an insight into how particular search engines order the results they return. But it doesn't give you any idea how those search engines produce the list in the first place, and that is what I think the submitter was interested in. Stuff like how to structure your database, how to distribute it across a cluster, how to index it, how to query it for keywords or substrings or whatever. There doesn't seem to be much of that sort of technical detail in any of the SEO stuff I've seen (I admit I'm no sort of expert in SEO, not even a client, but I have read a bit).

Re:Easy Start (1)

MikeFM (12491) | more than 8 years ago | (#14649531)

I was thinking more along methods of indexing data and search algorithms etc. As for what kind of db to use and stuff I guess really that depends on what kind of search you're trying to do. A normal db is good enough for many problems but not good at many other problems. Is sort of like asking what makes a good program.

Managing Gigabytes (5, Informative)

cariaso1 (674515) | more than 8 years ago | (#14643658)

Managing Gigbytes author site [mu.oz.au] Amazon [amazon.com]

is a spectacular book on most of the underlying technologies. Although I've only read the first edition, I don't recall it talking about spidering/webcrawling. Instead it starts with building a simple index, and builds through all the refinements (ie stemming, etc) until you've built a serious workhorse for mining text documents. Its definitely at the core of what a search engine does,

Re:Managing Gigabytes (0)

Anonymous Coward | more than 8 years ago | (#14653559)

Agreed. I studied this book when at Texas A&M studying Information Storage & Retrieval Systems. The book covers a *lot*, but it's all still relevant, and gives a great look at how search engines work.

For a very simplified introduction, check out this article by Matt Cutts [google.com] for the librarian newsletter (side note: consider just how much a librarian and a search engine are alike ...)

One interesting (and perhaps unexpected) place (3, Interesting)

Snarfangel (203258) | more than 8 years ago | (#14643662)

Look up voting methods, with keywords like Kemeny, Condorcet, and Borda. A lot of search engine algorithms are like vote aggregation methods, where each site "votes" for other sites it has links to. There is quite a bit of stuff on spam page filtering and the like as well.

Related Question: (3, Interesting)

Absolut187 (816431) | more than 8 years ago | (#14643688)

As a website administrator, is there anything I need to do, other than give every page a relevant title?

Re:Related Question: (2, Informative)

Denyer (717613) | more than 8 years ago | (#14643731)

Write in the language of the users you expect to use your site, and look at the server logs to see what terms people were using when they found your site.

If you sell poultry medication, for example, there's no point in only labelling products as being for cockerels if your visitors are more likely to use rooster as a keyword. You might also want to refrain from putting "cock pills" in your meta tags...

Other than that, write semantically valid code (header tags, etc) and don't put large blocks of navigation links first in a document -- you want search engines to concentrate more on the unique content of a page.

All common sense stuff, really.

Re:Related Question: (1)

Gord (23773) | more than 8 years ago | (#14643869)

Although meta tags are largely thought to be a waste of time these days (from a search engine p.o.v.) you might still consider using "description". This is used by 'Google' and probably others to replace the snipet of your page with an actual description.

It depends on what you want to achive tho. GameSpy always uses "GameSpy is the most complete source for [insert game here] trailers, screenshots, cheats, walkthroughs, release dates, previews, reviews, ..." for every game it covers even when it's clearly not 'the most complete source', which just becomes annoying.

If you choose relevent descriptions that are are representative of your page they can be useful, although sometimes actually seeing the search word in the context that it appears in the page is better. See what your search results look like at the moment and if you don't like the summary you see in Google think about using the "description" tag.

Re:Related Question: (2, Insightful)

Flwyd (607088) | more than 8 years ago | (#14646884)

Look at your page with Lynx.

Web crawlers can't see the text in your images and weird HTML constructions can make it hard to parse the text back out. If your page content can be clearly expressed in plain text there's a good chance a search engine will know what you're talking about.

As an added bonus, if a web crawler can read your pages so can blind users.

Re:Related Question: (0)

Anonymous Coward | more than 8 years ago | (#14653900)

Read through Brett's 26 Steps to 1k per day [webmasterworld.com] . Great set of advice. And, the WebmasterWorld forum itself is quite a resource. Lots of experts in a lot of areas.

Don't Reinvent The Wheel (-1, Offtopic)

ObsessiveMathsFreak (773371) | more than 8 years ago | (#14643714)

Not to sound reproachful, but don't reinvent the wheel.

If you need a search tool, look around for a solution that someone else has already wasted years of their life on rather than have yourself do the same. Why recode, when you can download?

Re:Don't Reinvent The Wheel (1)

jlarocco (851450) | more than 8 years ago | (#14643825)

Not to sound reproachful, but don't reinvent the wheel.

If you need a search tool, look around for a solution that someone else has already wasted years of their life on rather than have yourself do the same. Why recode, when you can download?

If the question was "I need to create a search engine for work, where should I start?" your answer would have been good advice. But the poster made no reference at all to creating his/her own search engine. Specifically: "...hence I'm trying to understand how the technology works." Even if the purpose is to create a search engine, your answer is still useless. Reinventing the wheel is a great way to learn.

Or perhaps you can explain how buying a search "solution" would teach him how a search engine works?

Re:Don't Reinvent The Wheel (1)

TTK Ciar (698795) | more than 8 years ago | (#14644171)

If you need a search tool, look around for a solution that someone else has already wasted years of their life on rather than have yourself do the same. Why recode, when you can download?

I have never in my professional life run into a nontrivial production business application which was perfect. They all have bugs. They all need more work. So it doesn't matter whether he downloads an open source system, or inherits his employer's legacy system -- he will need to learn the principles behind the technology in order to move that system forward.

-- TTK

How to solve any technical problem made easy ..... (1)

Gorshkov (932507) | more than 8 years ago | (#14676445)

it doesn't matter whether he downloads an open source system, or inherits his employer's legacy system -- he will need to learn the principles behind the technology in order to move that system forward.

There are 3 basic approaches to solving any technical problem. Let's say, for the sake of argument, you want to cook a roast and have never done it.

Brute Force & Ignorance
Get 100 ovens. Stick a roast in each. Set them to all different combinations of time & temperature. When they're all done, sample them all, and go with the one that's given the best result.

The Scientific Method
Get a stove, and stick a roast in it. Using temperature probes placed at different depths in the meat, and possibly using some sort of thermal imaging, monitor the cooking process while varying the temperature over time. Make carefull notes and plot your results afterwards, and compute the optimal configuration.

The Engineering Method
Ask your mother


'nuff said

Understanding Slashdot. (1, Offtopic)

LiquidCoooled (634315) | more than 8 years ago | (#14643733)

From the link I just clicked on I saw:

Ask Slashdot: Understanding Search Engines? 8 of 7 comments

It took a second glance to notice the subtle error in the wording.
Bug in slash?

Re:Understanding Slashdot. (1)

Galactic Dominator (944134) | more than 8 years ago | (#14646030)

That's just /. giving 110% as always...

Re:Understanding Slashdot. (1)

VGPowerlord (621254) | more than 8 years ago | (#14647173)

This is why you should use ACID compliant database tables, rather than MyISAM^W non-ACID compliant database tables.

page rank explained (0)

Anonymous Coward | more than 8 years ago | (#14643778)

Search Engines are Mature? (1)

Crutcher (24607) | more than 8 years ago | (#14643826)

Well hell, guess everybody can go home. Nothing more to search for here, it is all figured out.

Look at Open Source projects (5, Informative)

X (1235) | more than 8 years ago | (#14643842)

So, aside from reading books on Information Retrieval and Data Mining, the other easily available reference are open source search engines. In particular, look at the Nutch project [apache.org] , which is actually a pretty high quality search engine implementation. Even better: start contributing to the project.

Re:Look at Open Source projects (1)

wan-fu (746576) | more than 8 years ago | (#14643931)

I'd recommend taking a look at Lucene [apache.org] as well.

Re:Look at Open Source projects (1)

X (1235) | more than 8 years ago | (#14645037)

Nutch is written by the Lucene guys. Lucene really isn't designed to be a full fledged search engine. It's more like a site indexer.

Mozdex (1)

mparaz (31980) | more than 8 years ago | (#14648345)

To see Nutch in action... Mozdex [mozdex.com]

some info... (3, Informative)

Pavel Stratil (950257) | more than 8 years ago | (#14644157)

If you really want to go behind the theory, you will want to start here [psu.edu] . But be prepared to have some really good skills in math, statistics, computer sciences and system administration to understand the articles as they are not intended for general public.

A brief intro of how classical search engines work goes as follows:
Grabbing: A crawler visits pages which it considers important, downloads them and parses them
Analysing: The document receives an identification string and is stored in a reversed index, which is simply a database table with culomns such as "word", "document", "possition", "importance". The "word" culomn is indexed and used for searching.
Searching: Say that you search for the phrase "ask slashdot". The search engine searches the lines with the terms "ask" and "slashdot", looks into the "document" cell and selects only those documents that both terms occure in. Then it looks into the "possition" cell which carries all the possitions of the searched word in each document and discards all the documents that do not have successive "ask" and "slashdot" terms possitions. The resulting documents are then sorted according to the importance cells of the searched terms.

This is how basically all search engine works. The only major difference is usually only in the math used to compute the imprtance. There are also some major optimisations done to speed up the responses. To discuss this would take too long. So if you have any questions feel free to ask. Currently I am part of a team developing a large scale search engine, so you have a chance to get some hot info here :)

Search engine watch (1)

MassOutput (867589) | more than 8 years ago | (#14644255)

There is a website [searchenginewatch.com] dedicated to search engines

I'm not so sure... (1)

wbren (682133) | more than 8 years ago | (#14644337)

I guess by now we can be fairly certain that search engines are here to stay...
I'm not so sure I agree with you. What do people really search for nowadays (OK, other than "sex"/"porn")? I know where to go for my news, weather, sports, tech news, political discussion, coding tips, dining reviews, etc. The last time I *really* searched for something was a couple months back. I search within websites quite often, but I do not use large search engines like Google, Yahoo!, MSN, etc. more than a few times a year. Maybe I'm in the minority, but do search engines really have a long future ahead of them?

Re:I'm not so sure... (1)

Wisgary (799898) | more than 8 years ago | (#14644378)

My favorite google query always starts out with the word wikipedia

wikipedia [INSERT_RANDOM_THING_HERE]

other than that, it's called "being a college student". Then there's movie info, music info, game/movie/hardware reviews comparisons and discussions, etc.

As the internet grows even more monolithic, I don't see how it's possible not to think search engines have a great future.

Re:I'm not so sure... (1)

steppin_razor_LA (236684) | more than 8 years ago | (#14666118)

I use google thoughtout the day every day for my job. Then again, I'm doing web development for a living...

Re:I'm not so sure... (1)

wbren (682133) | more than 8 years ago | (#14666323)

Well yeah, in that case you would use it quite often :-)

I was talking more about the average Joe looking for sports scores or news. I think most people know where to go for their information, and search engines are just a last resort when they absolutely can't find what they need on the sites they know. I think Ruppert Murdoc said something to that effect. Although I think Murdoc is wrong about nearly everything else, I have to agree with him on that small point.

HEY! (0)

Anonymous Coward | more than 8 years ago | (#14644779)

Why don't you "search" for one? LOL..... ok, bad joke

Mining the Web (1)

dubl-u (51156) | more than 8 years ago | (#14647962)

I found Mining the Web [iitb.ac.in] useful. It's written by academics, so you'll have to put in a little brain work translating it into implementable patterns, but it gave me a good jump start when I took on a new client that does a lot of crawling and searching.

v7ndotcom elursrebmem Search Engine Competition (1)

mparaz (31980) | more than 8 years ago | (#14648372)

You might want to study "v7ndotcom elursrebmem" - the latest Search Engine competition. Just type it in your favorite search engine. If you enter it in Google, you'd even get strange ads.

Some are joining to get prizes from the competition, like v7ndotcom elursrebmem: Blogging for Charity [yugatech.com]

Web Search for a Planet: The Google Cluster Arch.. (1)

adpowers (153922) | more than 8 years ago | (#14648693)

You might want to check out the paper Web Search for a Planet: The Google Cluster Architecture [google.com] . It is 4-5 generations old, but provides some interesting information about Google's previous cluster setup.

there is less info than you might think (1)

dumbfounder (770681) | more than 8 years ago | (#14651370)

There are a lot of papers on the underlying theories, but there is very little out there that will actually tell you how and when to implement them. Pagerank has really nothing to do with building a search engine, it is just one measurement that goes into determining relevancy. And it isn't applied the way most people think. I would say the best book on search engines is "mining the web" by Soumen Chakrabarti, but it doesn't really talk about implementations. But that's why information retrieval experts get paid the big bucks.

Search Technology Resources (1)

GBRD (952468) | more than 8 years ago | (#14656048)

For print resources I would suggest:

Understanding Search Engines [amazon.com] by Michael Berry and Murray Browne

as well as

Modern Information Retrieval [berkeley.edu] by Ricardo Baeza-Yates and Berthier Ribeiro-Neto

For online resources I would of course direct you to the work of our Search Focus R&D Group [greenbuilt-research.com]

Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>