Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Is the Internet Becoming Unsearchable?

Cliff posted more than 14 years ago | from the even-terrabyte-dBs-may-not-be-enough dept.

The Internet 313

wergild asks: "With more and more sites going to a database driven design, and most search engines not indexing anything that contains a query string in it, we're missing alot of content. I've also heard that some search engines won't index certain extensions like php3 or phtml. Is anything being done about this? How can you use dynamic, database driven content and still get it indexed into the major serach engines?" Is keyword searching obsolete? Do you think its time to index sites by the type of content they carry rather than the content itself? Will larger indexing databases (or a series of smaller, decentralized ones) help?

Sorry! There are no comments related to the filter you selected.

This is why generic domains are worth something. (0)

Anonymous Coward | more than 14 years ago | (#1465990)

But I'm not saying this necessarily justifies some of those million-dollar sales.

Directories (1)

Anonymous Coward | more than 14 years ago | (#1465991)

The one constant throught the history os internet technology, is that directory entries (open directory, yahoo) with structured, categorized results are and always will be superior to free text search for anything that isn't completely obscure.

Directories (0)

Anonymous Coward | more than 14 years ago | (#1465992)

The one constant throught the history of internet technology, is that directory entries (open directory, yahoo) with structured, categorized results are and always will be superior to free text search for anything that isn't completely obscure.

Catchup (2)

Phule77 (70674) | more than 14 years ago | (#1465993)

I think we've actually hit another period, technologically, where we're advancing too fast for active standards on "how things should be done" to make things like searching pages/web databases/etc. an accessible, easy thing. It's probably going to take a while...it seems like every month they come out with a new way of doing things, a new "language that's going to change the world!", a new proprietery language/program for corps to use. Until that dwindles, for whatever reason, the web is going to continue to be behind in terms of searchability.

First comment? (0)

Anonymous Coward | more than 14 years ago | (#1465994)

anyways this is very true. I have a site that all of it is database driven and it uses mysql and php and the pages.

Searching searches? (1)

Baloo Ursidae (29355) | more than 14 years ago | (#1465995)

One idea would be have a centralized authority have individual machines scan for sites ala distributed.net to expand existing databases. Would this be possible?

Extend the Robots.txt protocol... (3)

Anonymous Coward | more than 14 years ago | (#1465996)

...to "force" search engines to search certain pages. Currently, you cna only tell searchbot to "piss off". There is no way to tell a searchbot "hey!!!! come look at this...."


Distirbuted Databases? (2)

Anonymous Coward | more than 14 years ago | (#1465997)

What if we just have a standard search interface that can be built in to any DB driven website....say it returns XML or WDDX or something. So now when the search engines hit a DB driven site, it goes ahead and creates an index through this interface. I guess like a DNS zone transfer.....hmmmm...

The effort put in by sites makes up for it. (1)

ItsIllak (95786) | more than 14 years ago | (#1465998)

I think the effort many of us put in to make sure that the relevant site info is indexed by the engines makes up for it. Many of my sites include special pages that only the search engines get to increase their "relevance" in the engines databases. What does pose a problem is where people are totally abusing the indexing methods to get their site promoted in searches that they shouldn't. I don't see that anything can be done about that (and efforts by some engines, including ignoring meta tags etc are quite annoying)

There are ways... (1)

RFINN (18178) | more than 14 years ago | (#1465999)

My company created lots of dynamic sites with
dynamic content - without the use of different
extensions or URLs that contain query strings.
Apache is awesome (in case you haven't heard)!
Almost all of our HTML files actually contain
embedded TCL code, so the servers are configured
to parse every *.html file - allowing to use
the *.html extension for files that have dynamic
content. We also use things like mod_rewrite
to send data to a single file that tells the
file what data to use and how to behave. We
could have an entire range of sites served out
by a single file... even making it look like
they have thier own directories, when in
reality they don't exist.

Not at all searchable (2)

lapsan (88119) | more than 14 years ago | (#1466000)

We've been running across problems related to this in my office (a web design/hosting/advert firm) and, while I'd like to see non-database driven searching of the Internet continue, I have to say that perhaps, most people, would rather have the database. So many web design clients expect that once they have a web site they won't have to advertise in print ever again are driving the whole thing toward the database method... creating the problem they so love to bitch about.

Perhaps doing away with keywords entirely, getting search engines to look at the content instead of the "false content" of meta tags... now that would be nice.

First post! (0)

Anonymous Coward | more than 14 years ago | (#1466001)

web suckz!!!!!!

Search engines are useless. (1)

jagz (106749) | more than 14 years ago | (#1466002)

Once one site is found an a certain topic, it will often be linked to many others; you can find a lot of information this way. I only use search engines in extreme cases.

Customers :) (2)

DanaL (66515) | more than 14 years ago | (#1466003)

We had a client once who wanted keywords inserted dynamically into the metatags on his webpage based on query results because he read once that search engines index pages based on the tags. Nothing we could say would convince him what was wrong with that picture.

Is it even possible to index dynamic pages? They don't really exist until the page is generated. Perhaps the best thing to do for sites that want to be indexed is to make sure they have a plain, vanilla index.html page that contains relevant keywords?

Dana

diligent searching. (1)

mcrandello (90837) | more than 14 years ago | (#1466004)

First hit google. Then metacrawler. Then try it as a phrase, then add "-" terms to filter useless results. After that ask jeeves, then the imdb, ubl, mp3.com, amazo^H^H^H^H^Hbarnes&noble online, the manufacturer's sites, then give up and ask someone for it on alt.binaries.whatever.

Short answer, yes. Long answer->I'll find it if given an afternoon or two.


mcrandello@my-deja.com
rschaar{at}pegasus.cc.ucf.edu if it's important.

Rethink the way we index? (1)

Anonymous Coward | more than 14 years ago | (#1466005)

As is, search engines just index raw HTML, with no regard for the actual content of the pages. Perhaps as XML and related stuff begins to proliferate, the indexers of the future will begin to use the extra markup to deduce things about the data that are relevant to the searchers. Certainly it needs to be rethought, because as it it is, it's crappier than even searching for text in Emacs. Think of the internet as the worlds largest text file and you're trying to find things using a simple search in a myopic text editor that can only see 1/100 of the whole document anyway.

Sort of... (0)

Anonymous Coward | more than 14 years ago | (#1466006)

It isn't too bad if you're looking for obscure
things; for example if you get a weird error
message from a Linux utility, or song lyrics.

But try to search for a device driver and you
get those "ad bait" sites like driver-forum.com.

This is a serious problem... there is a big
opportunity for a search engine that will be more
selective about keywords and will reject sites of
dubious value like driver-forum.com

Mark

My sites get indexed (0)

Anonymous Coward | more than 14 years ago | (#1466007)

All of my sites are dynamically generated, using PHP3, MySQL, etc., and they do get indexed in AltaVista and alltheweb. I'm not sure about other search engines, but those two find my sites just fine.

Not a problem (1)

Tom7 (102298) | more than 14 years ago | (#1466008)

#1: it's easy to make apache run cgi scripts with any extension you want, so php3 and shtml being ignored shouldn't hurt any site that really wants to be indexed

#2: technologies like XML may give a standard interface to databases, so that search engines can index databases directly.

IMO, A much bigger threat to the "searchability" of the internet is the rapidly growing amount information -- and with it, the amount of misinformation.

Parallel static pages for search engines (1)

Ravenfeather (21614) | more than 14 years ago | (#1466009)

How can you use dynamic, database driven content and still get it indexed into the major serach [sic] engines?"

One obvious possibility is to generate - using the database - a set of static pages as "targets" for the search engines. This could be done weekly or monthly, for example. Each target page would contain a prominent link to the dynamic database-driven front-end of the website, so that searchers could find the site and then quickly get directly to the main front end. Not particularly elegant, but it seems like a reasonable work-around for the time being. The real solution, in the long run, will involve more sophicated searching and indexing paradigms.

What do people think about this approach?

dammit I forgot gopher... (0)

Anonymous Coward | more than 14 years ago | (#1466010)

and webluis. Make that 2-3 days...

Parsing .html with PHP3 not impossible (2)

Balazs (18529) | more than 14 years ago | (#1466011)

You can tweak Apache to parse documents ending in .html with PHP3. You could use .html for generated content and .htm for static pages.

Database driven web pages are 'spam' (3)

GaspodeTheWonderDog (40464) | more than 14 years ago | (#1466012)

Yeah, give me a minute to back that statement up. :)

Honestly though. With something that is inherently dynamic like the internet, it is already near impossible to catalogue and make it searchable. Just to illustrate this take any given news site. Today they might have articles about Clinton, tomorrow it might be news about a big fire. Search engines can't just direct you to those sites based on queries because who knows what data they have.

Even if a search engine was able to validate the content on every site before it gave you the url it could still change by the time you actually got to see it.

So quite literaly there isn't even a clue of a way to catalogue a database generated web site. Now granted I know there are plenty of sites like Slashdot that eventually the 'content' settles down and becomes static. Still, how are you going to get some stupid program to verify and validate that for *every* dynamically generated web page. I don't think you can.

The web was created to be open and dynamic and it will stay that way. I've heard people say that maybe there should be *more* interoperability between things like search engines and spiders. This in my mind would do more damage.

Besides is it so bad that spiders don't get these pages? It probably isn't even reasonable because it would add that much more complexity to the search engine to catalogue what it finds. How do you rank content?

Anyway... just my 2 cents or so...

Internal Site Searches More Difficult as well (2)

peterdaly (123554) | more than 14 years ago | (#1466013)

No only is multipul site searching becoming more dificult, but single site searches as well.

Now most content is stored in a SQL database. While it is fairly easy to search an SQL database, returning the information in usable form is not. This is especially true once you have many type of tables containing many different types of information.

Currently, the search engine on the site I work on has it's own built in forms for information from each type of table, but this method takes a lot of maintainance.

Another possible way is to point to the page (php3, asp, .pl, .cgi, etc.) which generated the information. But this only works if arguments are not required.

It is about time someone developed some technology to do "smart searches" of sql data and return useful information without having to write a template for each and every type of data that might be queried.

I might be off my rocker a little bit on this, but I cannot believe I am the only one experiencing these problems.

-Pete

Use dynamically generated static pages (1)

seyed (33396) | more than 14 years ago | (#1466014)

Best thing to do is to create static versions of dynamic content that you want to index (like articles etc.) and use scripting to divert non-robots to a dynamic version.

You can also make those static pages keyword and meta tag heavy without affecting the user experience.

New method? (1)

ChrisGB (114774) | more than 14 years ago | (#1466015)

I was beginning to think this as well - Yahoo, Infoseek, Hotbot and the like just don't seem to find the good stuff anymore. If there's content held in a database and a page is generated on demand by an active server page or CGI script for example then the page doesn't come into existence until the user requests the information.

Perhaps it's time for search engines to search by topic and direct to a site related to the enquiry. The individual sites could then have their own search utilities to trawl through their databases? Not sure if this is feasible or not though.

In terms of good search engines though - Google [google.com] and AllTheWeb.com [alltheweb.com] seem to find good content whenever I use them. The problem I guess is that you don't know what you're missing until you find it by some other means, and neither do the search engines.

Uses Keywords Luke! (1)

ninoles (24951) | more than 14 years ago | (#1466016)

META tags and PICs-like protocol should be used. However, WISIWIG editor doesn't help much since META tags aren't a visual parts of the html document.

I also find that self-registred index sites (like WebRings) can be useful. May be a search engine for WebRings (e.g. look 'Elbereth' on Tolkien WebRing) can be useful (I have to look if there aren't already one).

Personnally, I use specialized index site (like NewHoo, Linux Life or Freshmeat) when I'm looking for something. Those sites will just have more value in the future, IMHO.

Searching ineffectiveness (1)

bbqBrain (107591) | more than 14 years ago | (#1466017)

One of my biggest gripes with the current scheme is that so many people abuse it. I once searched for "XOR gate" and pulled up porn sites. If sites could be indexed based on content type, perhaps this wouldn't be such a problem. Currently, these jokers dump the entire dictionary in a meta tag and waste everyone's time by throwing off keyword searches.

The sheer volume of websites out there makes effective searches difficult. I imagine a search engine could be tuned for better results, but will people be willing to wait while it crunches through data longer than a shoddy counterpart?

Re:diligent searching. (1)

The Fonze (28895) | more than 14 years ago | (#1466018)

ok how about this. one site sends out a bot. this bot hangs out on the remote machine, collecting info, or filling up its bag with data, then returns to report its findings. this way, the local remote site can then have rules on what is public info, info for a particular client, or not searchable at all, private. I've been thinking about this lately as a kind of shopping bot. there are lots of potentials, and lots of security issues...anyone for an apache mod?

Search Engine to Search Engine protocol (1)

Anonymous Coward | more than 14 years ago | (#1466019)

Since most dynamic sites provide their own internal search engines, it seems that a standard Search Engine to Search Engine protocol could help ease this problem.

I seem to be able to find what I need... (2)

Gogl (125883) | more than 14 years ago | (#1466020)

It seems to me that if anything, the internet is MORE searchable then it used to be. I remember some statistic about how a couple of years ago the few search engines that were around only got some small percentage of the web covered anyway. These days it seems the search engines do a better job, and there are a zillion more search engines and also tools that let you search multiple search engines at once. That and the fact that there is just plain a lot more stuff on the net. Back a few years ago, if you searched for Cervantes, the author of Don Quixote, you might find a page or two on some college webpage somewhere, if you were lucky. These days there are enough pages out there that you're bound to find at least one of them that's halfway decent. Anyway, to summarize, keyword searching still seems to work for me. I think that the only way it will get considerably better is when true artificial intelligence is possible. That way, when you ask the computer to find something, it is actually smart and goes out and finds it like a real person. However, it seems to me that true artificial intelligence is a way off....

Indexing dynamic-content sites. (1)

Crafack (16264) | more than 14 years ago | (#1466021)

Today, there are two methods used when a site is added to a search engine database. The first relies on information submitted by the site, the second relies on information (e.g. keyword fields) found by crawlers. As more sites switch to dynamic content, the sites offer no easy way for a crawler to find information about content. This could be solved by developing some method for storage and retrieval of the data. For an example, look at how the "robots. txt" [w3.org] -mechanism works. /Joakim Crafack

There are just too many things ... (0)

Anonymous Coward | more than 14 years ago | (#1466022)

.. just too many things that you can save in databases and publish on the internet in dynamic pages .... The internet IS unsearchable because of the amount of data and not because some websites cannot be searched by searchengines (including these would increase both signal and noise and not help). What is IMHO needed more is some way to differentate real information from fluff and spam .. but that is still far away ... (hoping for some advanced AI)

We're not there.... yet (3)

jw3 (99683) | more than 14 years ago | (#1466023)

For my own purposes there is no trouble of finding information on the Net - google, Altavista and a few specialized databases are good enough for me, independently whether I look for a pidgin-English dictionary or a protocol for AMV reverse transcriptase. At worst you find a link to an index page with "interesting links" or something. Basic IQ and knowledga about how the search engines work is enough.

Still, I see a potential threat in information becoming unmanagable, and, most of all, ways of finding information being abused (like using unrelated keywords just to get some visitors). Stanislaw Lem, the polish sf-writer described this situation in many of his books - starting with the 60s, when noone was even starting to think about such problems.. Sooner of later we'll have a large branch of computer sciences dealing only with searching information in Internet; searching services are already available, but they are either incomplete, or not evaluated. The latter is the key: and google is the first service I'm aware of which tries to automatize evaluating (by counting links pointing to a specific page).

There has been a lot of talk about "Internet agents" a couple of years ago (I remember an article in Scientific American...) - could some good soul explain to me how is the situation now?

Regards,

January

It's been unsearchable (2)

Mark Edwards (48) | more than 14 years ago | (#1466024)

I used to make a decent living as an Information Broker - basically, a trained database searcher for hire. Along came the net, and suddenly everyone with a modem could search for themselves. So I wrapped my shingle up, and stored it away.

These days, there is so much junk and bad indexing, that I may as well put the shingle back out. Almost any search will find mostly commercial sites, unrelated to the search, or completely useless garbage.

You almost have to be in a bizarre frame of mind to create a good search term these days.

Mark Edwards [mailto]
Proof of Sanity Forged Upon Request

Re:Extend the Robots.txt protocol... (2)

technos (73414) | more than 14 years ago | (#1466025)

This is probably the simplest solution. Just use an 'insteadof' tag in robots.txt to redirect the `bot to a meaningful page. As is, sites are using hidden pages gobbed full of metatags, relevent text, etc. Point the bot there! You could perhaps just use a space delimited section of the database, and include some standard for how the searchbot indexes the new 'insteadof'.
'Karma, Karma, Karma Chamelion' -- Boy George

Black holes (3)

slashdot-me (40891) | more than 14 years ago | (#1466026)

I've done some work on a spider and these are the types of pages I spider:
html htm asp php shtml php3
I guess I'll add phtml :)

Other extensions and urls with query strings are ignored. This is mainly for self defense. There are many, many infinite loops and blackholes on the web and they're hard to avoid. For instance, my spider once got stuck on a server that would return the contents of /index.html for any non-existent path. Also, all links on the homepage were relative (not a bad thing) and one was invalid. The call sequence is below.
GET /index.html
found foo/broken.html
GET /foo/broken.html
webserver couldn't find path, so returns /index.html
GET /foo/foo/broken.html
etc.

What was the programmer thinking?

This is just one example of the blackholes that lurk on the web. It was completely unexpected and pretty difficult to detect. What if someone wanted to write a search engine trap? I don't believe there is a simple solution to this problem.

Ryan

Re:Catchup (1)

NotQuiteSonic (23451) | more than 14 years ago | (#1466027)

It's true. What we need is a new standard. The thing that bugs me most about search engines is getting multiple hits on a particular site. If you could distribute the load of the search to the site itself (have a standard search db that could talk to the search engines) and simply index the general idea of the site through meta tags. Then using a search engine (which is more like a distributed database browser) you can browse the individual and up-to-date search dbs of the site.

Domain Names are the Kludge to the Problem (1)

Ron Bennett (14590) | more than 14 years ago | (#1466028)

Many companies, especially startups, are turning to using catchy domain names as the way to promote their site and products. Even many non-profits and research groups now register domain names that reflect what they do since many people just type in domain names into their browser - and ironically having a domain name actually helps in being indexed by some search engines; one may debate if this is good or bad, but it's a reality.

Until there's a standard, the search engines will continue to miss more and more of the sites out there. XML may be the answer to indexing and exchanging data. However, on the bright side, the difficulty of finding data makes censorship much more difficult for the censors - and that's a good thing.

Re:Search engines are useless. (0)

Anonymous Coward | more than 14 years ago | (#1466029)

Yeah, but how did you find that first site?

While they are nice we should already have more (2)

slashdot-terminal (83882) | more than 14 years ago | (#1466030)

Just stop thinking that tera\bytes are the limit. Get more hardware and more computers. Create petabyte databases. In fact have millions of petabyte locations world wide and create a series of multipetabyte databases that one can use.
Categories are nice but some (most) sites are personal sites and these sites chage quite often in subject matter.
While the categories are nice we should have a community planned and maintained categorical system along with a plain text search. Have identifier tags that go along with every web site and then have a standalone and a web based version of this program which will allow for anyone to create a hierical listing of anything according to tcertain tastes and peramaters.

Two-level structure (2)

Kaa (21510) | more than 14 years ago | (#1466031)

I think we are already looking at a two-tiered structure: there are sites (that could be found through standard search engines) and then there are databases/archives inside those sites.

It is getting more and more so that to find an answer to a somewhat obscure question, I need first to find major sites on the topic, and then do a search through their databases or mailing list archives. I believe this reflects a real-life structuring of the Web and will have to be taken into account by next-generation search engines.

Kaa

Centralized Searching is the Wrong Approach (1)

douglass (29120) | more than 14 years ago | (#1466032)

I don't think the issue has to do with the ability of centralized search engines to index dynamic pages. I think there is a more fundamental flaw in that idiom.
The lists of problems that exist for centralized search engines goes on and on: dynamic pages (of course), missing/broken/changed links, getting to new pages, and so on.
What I think could be done is to define a search protocol (perhaps through some kind of search://domain/search+terms method) that is standardized. The global search engines then search by determining the most likely sites to have information for you and querying those sites directly for information. This would fix the problem of broken/missing/changed links being reported, new pages would automatically be available (assuming sites updated their search engines quickly), and if the local search engines are integrated with dynamic page generators (which should be possible) than those pages could be searched too.
I realize that a lot of work would be needed to be put in to this in order for it to work. A protocol would need to be developed, as well as servers for the protocol. Search engines would have to learn to efficiently decide which sites to query to complete their searches, etc.
Perhaps a combination of both approaches could yield something even better. All I know is that what is out there right now, well, fails miserably.

Static gateways? (3)

sufi (39527) | more than 14 years ago | (#1466033)

One way round the search engine missing query URLs is to write to static pages for the purpose of submitting it to search engines, there are many clever ways of having truely dynamic sites without the need for long urls, you just have to put some effort into it.

Search engines not picking up on php3 is a bit worrying though, all my sites are written purely in php3, although I never seem to have any problems with getting listed.

Gateway pages are a good way of making sure you get listed with the keywords you want, although they aren't very dynamic and unless you get really clever don't tend to reflect the contents of a regularly update site... however it seems to me that you can only really hope for *a* listing these days, not an index of all of your site.

Even google has a 3 month disclaimer on it's submit page, that's a mighty long time if you are looking for support on a brand new motherboard.

LASE seems to be the way to go... subject specifc full text indexes which spider regularly and can index specialised data keeping it up to date.

However you would still need a search engine to find a LASE that will get you what you want, but at least it's a bit more structured!

There are many ways round the search engine problems, and keeping on top of it is a full time job, Submit-it doesn't come close, that hasn't changed in the past 3 years, Search engines however have!

IMO a combination of all of the above will get you where you want. Keywords and Meta Tags still count, and you have to be persistent.

fault bad browsers and no index of quality (1)

_Doug_ (32427) | more than 14 years ago | (#1466034)

In my opinion, part of the fault lies with the browsers, which poorly handle caching dyanmic content, regardless of whether it is on a remote webserver or a local drive. I for example am forced to add a useless query string to the end of local file URLs so that all browsers will work. Browsers are notorious for ignoring no-cache pragmas and expiration dates.

The most common way though people find out about worthy dynamic content sites I think is word of mouth. We could use more forums and link referrals to share websites we have found useful. This has the very distinct advantage over search engines of providing a better filter of QUALITY of information. After reading someone's recommendation of slashdot or an article elsewhere, I won't have to hurdle 19 irrelevant hits to get there.

Re:Catchup (1)

Chocky2 (99588) | more than 14 years ago | (#1466035)

I suspect we'll see the emphasis shifting towards more specialised, manually or semi-manually, compiled indices of websites, complemented by robots searching these manually created indices for new/relevant/expired items of information/people/links.

And alas, as far as the crap that accounts for 75% of the web goes, the cost of accommodating the vast quantities of crap is less than the cost of removing it or improving ways of avoiding it; and until that is no longer the case we're going to have to put up with ever increasing quantities of sewage.

The Open Directory Project (2)

side_ways (121915) | more than 14 years ago | (#1466036)

The Open Directory Project [dmoz.org] , managed by dmoz.org, is an open source effort to create an organized index of the internet through volunteer work. Currently their are 20,000+ volunteers working on the project. This is a way cool idea that we should all support.

XML? (2)

Matt2000 (29624) | more than 14 years ago | (#1466037)

I read a while back that meta data for sites would eventually move to an XML based standard which would accurately describe the content of the site?

Whatever happened to that? I don't mind all that much being taken to the front page of a site if I know that site has the information somewhere in there, I just hate having to hit seven sites to find that one.

Hotnutz.com [hotnutz.com]

Challenges for searching the web..... (2)

chirayu (3931) | more than 14 years ago | (#1466038)

I have been thinking about the working of a search engine lately and this post just comes at the right time.

Some of the challanges which will be faced for search the web in the future will be :

1. Displaying matching URLs as well as links which match the type of content. This is important. If I search for "throat infection" on a search engine..apart from the pages which mention "throat infection" ..the web engine should give me a link to drkoop.com, webmd.com (AFAIK, these sites do not allow search engines to copy their content) and so on.
Search engines will have to maintain huge databases linking words to categories. And with the proliferation of hte internet the number of sites carrying content and disallowing search engines is going to increase. Search engines need a intelligent way to get around this.

2. Search engines will need "help" users with their searches. For example if I just search for "throat" the search engine should have a helper section where it can ask me more...whether I am searching for "throat infection" or "study off throat" and so on.

3. Search assisted by humans. This is also one of the concepts picking up these days. Basically you submit a question and there will be some person searching the web, and you will get you answer in a few hours/days. Chk out www.xpertsite.com.

4. Tools for better maintenance of bookmarks. I for one usually bookmark all relevant stuff and then I spend a full weekend arranging them so that I can find the relevant stuff from the bookmarks quickly. The current bookmarking scheme is very primitive causing a lot of users to "reinvent teh wheel" (searching for URLs which are already bookmarked).

Phew!

I'll jot down more thoughts later. Gotta work now.

CP

XML (2)

pos (59949) | more than 14 years ago | (#1466039)

I was just about to ASK SLASHDOT about XML [w3.org] . XML will solve the search problem (or at least help make it better) Working drafts of XML have been drawn up by the W3 Consorium [w3.org] and XLINK [brown.edu] , XSL [w3.org] , etc... are coming. There are almost no XML applications available yet though!!!!! most of what is available is in java. This is a field where Linux could be leading the pack, but is instead an example where I think we are lagging behind. (I hope someone can point me to a group that is bringing XML deep into the linux os)

I want to know if Linux is on top of this. Microsoft has an XML notepad available and I hear that it's going to be all over Win2000 (in the registry even). XML will be the foundation of the new internet and we don't want microsoft to have a technology edge there do we? Perl has XML modules, as I am sure other languages do too (python). Lets get some apps written!

What about Gnome and KDE? this could help make their projects easier. Especially KDE with all of the object similatrities between Corba and XML and Object RDB's. All Config files could be theoretically stored in XML. We need to push this one people!

-pos


The truth is more important than the facts.

Re:Not a problem (1)

Xuli (98764) | more than 14 years ago | (#1466040)

XML, as wonderful and dynamic and potentially-world-saving as it is, is far from standardized. In addition to this, the only serious developments I've seen of any semblance of an XML search engine are being developed and marketed to be used by businesses in their intranets because, well, that's where the $$$ is.... So, in short, my opinion is XML is the only possible savior of exponetially expanding undocumented, unserchable content, it just needs more attention from the user/consumer commuinity.

You mean like Sherlock? (2)

skew (123682) | more than 14 years ago | (#1466041)

The problem with dynamic content is that you pretty much have to query the target web servers at the time the user enters the search request.

One solution that attempts to address this is Apple's Sherlock [apple.com] . It uses XML to pass queries to web sites and return results. There are certainly some limitations: you have to choose which web sites you want to search (although this isn't always a bad thing), these web sites have to support Sherlock queries, and it only works on the MacOS. Currently lots of big name and Apple-specific sites support it.

The dev info at Apple is pretty clear though. It wouldn't be difficult for others to create clones for Sherlock that either work over a web interface or on other OSes too. (dunno if Apple could...or would... make any claim against this)

Scott

Is searching DBs really necessary? (1)

LordSaxman (118168) | more than 14 years ago | (#1466042)

No technology is going to read your mind - you're limited by language, and that can be interpreted and misued in multiple ways. This includes searching (e.g. keywords in porn sites) applications. Word misuse will never stop (ask Plato or Burke) so we're just going to have to deal with it.

Eventually, the *end user* has to do the infromation filtering, so you might as well take what you can get FAST so you can move on if you don't see what you need. Indexing every database or dynamic page on the web would slow down engines to a crawl. Do you honestly want Altavista bringing up books from Amazon, companies from the Thomas Register, and patents from the USPTO? There's no need for this. If you want specialized information, go to a specialized source.

Spider traps ... (4)

charlie (1328) | more than 14 years ago | (#1466043)

Many years ago (1994? 1993?) I wrote a web spider. (Crap back end, though, so I dropped it. The bones are on my website.)

Some time later, it occured to me to try and monitor the efficiency of web indexing tools using a spider trap.

The methodology is like this:

  • Write a perl module (or equivalent) that generates realistic-looking text using Markov chaining based off a database. Text generated should be deterministic when seeded with a URL.
  • Write a CGI program that uses PATH_INFO to encode additional metainformation. Have it eat the output from the text generator and insert URLs that point back to itself, with additional pathname components appended.
  • If the spider follows a link it will be presented with another page generated by the CGI script, containing text generated by it in response to a hit that differs in a repeatable manner from the text in the original page.
  • Child pages should contain links that point inside the web site; you could do this by making the CGI program the root of your "document tree". Better yet, run multiple virtual servers and include URLs bouncing between the domains -- all of which are mapped onto the same script.
  • Stick this thing up on the web and wait for the crawlers to come. They will see a tree of realistic-looking HTML with internal links, digest, and index it.
  • You can now analyse your logs and monitor the robot's behaviour (e.g. by changing the type, frequency, and destination of links your text includes). You can also search the search engines for references back into your document tree and work up some metrics to measure just how accurately it's been indexed (e.g. by re-generating the text of a page and feeding it to the search engine and seeing what comes back -- which words are indexed and which are ignored).

Anyone done this? I'm particularly interested in knowing how spiders handle large websites -- have been ever since I was doing a contract job on Hampshire County Council's Hantsweb site a few years ago and caught AltaVista's spider scanning through a 250,000 document web that at the time had only a 64K connection to the outside world. (Do the math! :)

It's not obsolete... (1)

Seth Scali (18018) | more than 14 years ago | (#1466044)

I think that, for the most part, the databases are doing their job rather well.

Where do you find the most dynamic content? News sites. Slashdot, Freshmeat, Linuxtoday, Yahoo! News, etc. These are the sites that need dynamic content.

Ironically, these are the exact sites that search engines are pretty much not interested in indexing, anyway. Even assuming that a database can update all its sites once per day, that means that the information is a day old-- centuries, in Slashdot time! People don't go to AltaVista to search for the story over at ABCNews.com. They go to AltaVista to find information about international child custody laws (to name a random hot issue of late).

Most of your general information stuff is pretty much static. This is what the search engines look for anyway-- this is the stuff that doesn't change often, so it's good stuff to record. Why would anybody bother to make a page about Cup 'O Noodles that's generated through a Perl script? It's too tough, and can be a huge pain in the ass to change it.

Why index the pages that are constantly changing, when the stuff you're looking for (by definition) doesn't change much? Sure, there's overlap (small sites that use generate the exact same content every time). But it's such a small segment that hardly anybody would miss it (yes, it may be important, but not important enough to totally revamp the indexing procedure).

Indexing dynamic content (was Re:Customers :) (1)

Simon Brooke (45012) | more than 14 years ago | (#1466045)

Is it even possible to index dynamic pages? They don't really exist until the page is generated.

Yes, for a very large category of dynamic pages, it is. For example, in an online shop, the actual number of a particular product in stock at the moment may very from minute to minute, the price of that product in the user's preferred currency may change from week to week, but the product itself doesn't change much over months over months or years. It makes perfect sense to index the product page, because although some of the contained data may be transient, a great deal more is not.

Or take another example: the weather forecast for a particular area. The forecast itself may change regularly, but the page always contains a current forecast and that fact is worth indexing. The best technology available for this sort of thing is probably RDF [w3.org] and the Dublin Core [purl.org] metadata specification. Of course, the search engines still have to be persuaded to take heed of this...

Two ways things might go (2)

jd (1658) | more than 14 years ago | (#1466046)

There is nothing to stop the web server from sending different pages, depending on whether it's a regular user or a web crawler.

Therefore, it would be entirely feasable to have a system in which regular users saw regular pages and web crawlers saw a "static" index page, all at the same URL.

This would allow web crawlers to index according to genuinely useful keywords, rather than by how the crawler's writer decided to determine them.

An alternative approach would be to distribute the keyword database. Since all the web servers have the pages in databases of one sort or another, it should be possible to do a "live" distributed query across all of them, to see what URLs are turned up.

This would be a lot more computer-intensive, and would seriously bog down a lot of networks & web servers, but you'd never run into the "dead link" syndrome, either, where a search engine turns up references to pages which have long since ceased to be.

Re:Directories (1)

mcrandello (90837) | more than 14 years ago | (#1466047)

That's the problem. Obsure is all I ever deal with. You are right about heirarchies, however. miningco [miningco.com] is a pretty nice one as well. Oh, and what became of newhoo [dmoz.org] also....Just thought I'd point them out because I had almost completely forgotton about dmoz, and I keep stumbling across miningco pages from other searches, and they sem worth mentioning...


mcrandello@my-deja.com
rschaar{at}pegasus.cc.ucf.edu if it's important.

The searchers and the searchees are ever-changing (3)

Zigg (64962) | more than 14 years ago | (#1466048)

I think the real problem with searching really isn't that the Internet is growing too large. The central problem with it being too hard to find information is due to the unfortunately ever-changing nature of HTML. (Yes, I know there are much better solutions out there -- I work with some of them on a daily basis. However, we seem to presently be stuck with HTML and its variants.)

It's a self-feeding monster, whose typical cycle goes as follows: SearchEngineInc (a division of ConHugeCo) creates a new technology that really impresses people with its ability to find what they want more quickly. (Right now SearchEngineInc is probably Google [google.com] , at least in my view.)

Once the new technology takes root, content authors (well, maybe not the authors so much as their PHBs) note that SearchEngineInc doesn't bring their business (which sells soybean derivatives) to the top of the search list (when people type ``food'' into the search engine). Said PHBs make the techies work around this ``problem'', and all of a sudden SearchEngineInc's technology isn't so great anymore because the HTML landscape it maps has changed.

A similar situation occurs when PHBs think their site doesn't ``look'' quite as good as others. (Insert my usual rant about content vs. presentation here.) Whether via a hideous HTML-abusing web authoring program, or via all sorts of hacks that God never intended to appear in anything resembling SGML, the HTML landscape is changed there as well, and SearchEngineInc's product becomes less effective.

What's the solution to this? I'm not quite sure. Obviously there are better technologies out there that are at least immune to PHBs' sense of ``aesthetics'' but I would wager few of them are immune from hackery. I'd say that search engine authors are doomed for all time to stay just one step ahead of the web wranglers. At least it assures them that their market segment won't go away any time soon. :-)

Unsearchable? Possibly... (1)

Shadowcat (56159) | more than 14 years ago | (#1466049)

I have to say, yes. I believe that with the way the internet is growing, it's difficult to keep up with new pages and new technology. I know there have been several times I have done searches only to turn up nothing when I KNOW it's there or to turn up too much which pertains to nothing I'm looking for. Most of the more mainstream search engines have become obsolete, I'm afraid. Many of them use methods that just simply aren't practical like searching for certain words in the text of a page. When you search for things like that your searches will not be accurate and often you'll get information you don't really want or need.

So, I believe the internet is outgrowing the current search engine technology.
-- Shadowcat

Computer indexing too primitive (for now...) (1)

Outlet of Me (90657) | more than 14 years ago | (#1466050)

I think it is just the plain and simple truth that the searching algorithms all of the search engines use currently are not suitable for the task. I will perform searches on what I think are pretty obscure terms and return >10,000 hits on some of these search engines. Of course, none of them mean anything to me.

I'm not saying that this problem won't be figured out at some point. It's going to take a little more technology than we have right now, but no doubt it's on its way even as we speak. (Any AI experts out there? :)

Until then, indexing by hand seems to be the only 100% solution. Humans are fallible, but much less than the machines are at this present stage. Plus, directories geared towards specific topics would help narrow down your search before you even start searching.

Hide the query string (1)

grinder (825) | more than 14 years ago | (#1466051)

There is no excuse for having a purely database-driven website that does not appear to be straight HTML pages. If you have ?s everywhere then you're just lazy.

Firstly, even though you might pull everything out of a database, a large per cent of all such content is not really all that dynamic, which means you're probably better off precompiling the base down into static HTML, and recompile the page only when its content changes.

Secondly, if you have a script with a messy query string you can turn it into something that doesn't look like a script at all, e.g., /cgi-bin/script.cgi?foo=bar&this=that could be presented as /snap/foo/bar/this/that.

With Apache, you would just define and pass it off to a handler, that would pick up the parameters in the PATH_INFO environment variable. If people tried URL surgery, you could just return a 404 if the args made no sense.

Search engines are your best (and probably only) hope of getting people in to visit your site. It's up to you to make sure your URIs are search-engine friendly. If they can't be bothered to index what looks like a CGI script, well that is your problem. There are more than enough pages elsewhere for them to crawl over and index without bothering with yours.

Re:Not at all searchable (1)

ChristTrekker (91442) | more than 14 years ago | (#1466052)

So then you put invisible content in the page instead. Same result.

There will always be a way to "fool" indexing robots if you're creative enough.

False positive hits. (3)

Lord Kano (13027) | more than 14 years ago | (#1466053)

It disturbes me that so many pron sites have hidden in their html code (and sometimes not even hidden) huge lists of adult film stars just to get hits from search engines.

If you do a search for Cortknee or Lotta Top you'll get a bazillion hits and 90%+ of them are "Click here to see young virgins having sex for the first time on their 18th birthday!"

As we all know, but nobody likes to admit, pron is the fuel that makes the net go 'round.

Many other sites have taken hints from the pron people. I'm sure that it was a deal of some sort, but everytime I do a search on metacrawler there's a line to search for anything I get a like to search a certain bookstore for books on the same topic.

Commercialism and shady practices are what are making the net so hard to search.

LK

Re:The effort put in by sites makes up for it. (0)

Anonymous Coward | more than 14 years ago | (#1466054)

The engine I'm working on doesn't read meta tags. But neither do humans. It also runs a grammar analyzer on the page to determine if the content is real. Bye bye keyword lists.

Re:Searching searches? (1)

MrIgnorant (70659) | more than 14 years ago | (#1466055)

The distributed idea is interesting, but I think the problem lies not in the actual power and bandwidth of the search, but more in what exactly we are searching for. Much like the article says, I've noticed myself that many of the search engines today don't find exactly what i'm looking for and I myself am still stuck sifting through their results.

I think the search engine community needs a paradigm shift in their way of approaching searches now, with the curve dynamic information has thrown at them. I don't know how well standards would work in this situation. It's up to the search engines to come up with a new way of sorting the huge ammounts of data they collect in an orderly fashion so they can serve us searchers with exactly what we ask for. Ok, so "exactly" is probably stretching it a bit, but I'll settle for pretty damn close.

Re:Black holes (1)

phutureboy (70690) | more than 14 years ago | (#1466056)

Shouldn't you index all files with MIME type text/html or text/plain, regardless of file extension?

Just curious.

mod_perl should be helpful to you (0)

Anonymous Coward | more than 14 years ago | (#1466057)

with mod_perl, you could create a system that analyses the URL requested, and makes a database query. You could hide a database behind something like: www.webzine.org/articles/section/109944.html on your server, no actual file called 109944.html would exist, but the request of that file would tell your server to query the record 109944.html from the database.

Searching... (3)

Pollux (102520) | more than 14 years ago | (#1466058)

Okay, I just got done with my research paper for college last week, and although I can pull a paper out of some orifice of my body, researching is always a pain.

Our library has a wonderful online database where you can type in keywords and search for them, but the keywords only look as far as the Title, Author, or abstract of the book. If you wanted to look up some narrow topic, you can't expect that there's books written exactly on that topic, but there's always bound to be a few books out there that have a few pages dedicated to that subject (but isn't listed in the abstract). So, what do you do? You have to get your hands dirty.

My topic: Holy Wisdom (I won't bore you with details, but just stick with the subject). Looking in the online database, I find that there are zero books on the subject. Darn. Let's do some lookin...

After I read in a few Religion Dictionaries, I find that Holy Wisdom is also called "Sophia." I go back to the catalog, type in "Sophia," and I get one book. I skim this one book, and find that Sophia has sometimes been associated with the Holy Trinity. So, I go back to the catalog, enter "Holy Trinity," and BOOM, I get back 400 results (anyone seeing a similarity here...). Let's limit them...we'll search within the results for "History of," and I get back about 11 results. I read the abstracts, find a few books of interest, and start skimmin...

...Well, whadda know, there's a page in one book that talks about Sophia, and half a chapter in another book that talks about Sophia as well. There's a few more sources for the paper!

Now, for those of you who just don't understand what I'm trying to say here, just read from here on, cause here's my point: Computers aren't smart enough yet to "guess" at what we want, and personally, I don't think they ever will. Internet keyword searches are just like asking someone to help you who has no idea what your topic is...they can only search for what you ask them to search for.

Internet keyword searches are a hastle, and many times the first few returns won't be anything CLOSE to what you want (search for "Computer Science," you get back porn, search for "Linux," you get back porn, search for "White House,"...). But if you learn how to dig, like the people who lived fifty years ago WITHOUT Boolean Searches, you'll find what you're looking for. Sometimes, it's just like searching for a topic...you might not find anything directly, but you can't sum up an entire book in just a paragraph either!

Try some links, look around, and it'll be there!

PHP / Dynamic Pages Are Indexed (2)

waldoj (8229) | more than 14 years ago | (#1466059)

Many of my sites are database-driven sites that run on PHP and MySQL. No problem with indexing, and no problem with the file extensions.

If you can get beyond the backend concept of a dynamic page, most pages really appear to be quite static, from an indexing perspective. A http-based indexing system (as opposed to filesystem-level) can't tell that pages are dynamic, and don't care.

I've never had a problem with search engines failing to index pages just because they had convoluted URL. If some engines do that, it's a bloody shame.

Re:XML (2)

Digital G (16017) | more than 14 years ago | (#1466060)

take a look at http://xml.apache.org


yep. (1)

justin_saunders (99661) | more than 14 years ago | (#1466061)

Dynamic pages don't exist until you click on them in your browser either. Search engines *will* follow links to dynamically generated pages.

The point is there has to be a link there in the first place. They will not be able to index a dynamic page if it is only accessable through a "form" post.

The way you can get around this is to have a hidden (to users) page on your site with hardcoded (or database generated) links into the dynamic content that you'd like visible from search engines.

For example, if you have a whole heap of news articles on your site, with one per page, you can make a dynamic page called "newslinks" which, when generated by a crawler, querys the database and writes links to every news article in the site.

cheers, j.

Meta-Engines (1)

Paradox !-) (51314) | more than 14 years ago | (#1466062)

IMHO we're already seeing the advent of meta search engines that do their own search and then do a simultaneous search using other engines. (Yahoo does this, I think, as does lycos/hotbot) That's a great kludge for these engines to extend their reach, but not a real solution.

I think we'll see more topic-specific search engines (I use trade rag sites exclusively for really good info on tech news, for example) linked together through the big search engines. The main engine (Google, or whatever) will check the search term to see if that term has been pre-linked by the engine managers to generate a search on a more topic-specific engine (for example a search on "market size" may cause the engine to do a lookup on the northpoint search engine) or engines, and then combine the results of its own search with that of the topic-specific engine for relevant results.

It's the whole idea of vertical portals taken to the next level. The vertical portals provide topic-specific searching capabilities over the 'Net to the behemoth engines and portals for a fee, or something.

Remember, the user will not get smarter, but will rather look for the faster and easier solution.

IMHO.

Semantics Antics (3)

TrueJim (107565) | more than 14 years ago | (#1466063)

I'm going to say a naughty word: artificial intelligence. I'm hoping we soon ( 5 years) get good enough at this "indexing" stuff to create semantic models of Web content rather than purely syntactic models. (Google is a small step in the right direction.) If so, then perhaps dynamic pages can be indexed according to their location (role?) in an "ontology" rather than via the frequency of essentially meaningless character strings. That may sound farfetched, but it seems to me that the Web finally provides a real _financial_ incentive with near-term payoff for that kind of research. Hitherto, the quest has been purely academic. And where there's the lure of a real payoff, stuff often happens quickly (usually -- batteries and flat-screen technologies being notable exceptions).

The problem is centalization (1)

TulioSerpio (125657) | more than 14 years ago | (#1466064)

Every Web/FTP server must have a standard, live query engine. Every week or so, some sites would query them, and update their database, but only to the site level, if the end-user want, must query in the site for the page, in a second phase search. [Buen español, bad english]

This looks like a job for.....XML! (1)

Rombuu (22914) | more than 14 years ago | (#1466065)

Look up in the sky.. its a bird, its a plane, it web sites dumping their information from hard to index databases to easy to read XML!

Wasn't this sort of thing what XML and RDF were originally designed for?

less page, more site (1)

matman (71405) | more than 14 years ago | (#1466066)

I think that this is going to force the search engines on focusing on sites rather than pages.

as a site can be described by keywords even if their subsequent pages are database driven. i like searching by site usually anyways - provided that the site has a nice search engine :)

Don't change the web--change the way we search (1)

Zomart9th (125454) | more than 14 years ago | (#1466067)

The web is growing and changing at a pace that a band-aid fix like static indeces just wont solve. Database-driven web sites are simply more manageable, scale better, and more easily allow the separation of content creation from site design than static ones consisting of n-thousand HTML documents.

Technologies like XML and WDDX provide access to databases through standard protocols and are not difficult to implement. A few simple, scalable solutions include:

  • Allaire's [allaire.com] ColdFusion (for small websites)
  • Allaire's Spectra for *huge* sites
  • Apache in combo with some DB fun for those of you sage enough to use *nix

DB-Based web content has the potential to make the web more searchable then ever before through hierachy and content classification, but only if we do not try to reign it in. Instead, we should adapt the way we search to the emerging scalable, powerful web architechture that is the future of the web.

Re:Domain Names are the Kludge to the Problem (0)

Anonymous Coward | more than 14 years ago | (#1466068)

Things like www.bobsfantasyhouseofbarbeque.com...I work with a hosting procider and some of the domain names these people are coming up with. The only thing funnier is when they try to abbreviate, www.qqqfudosgirinc.com. I mean it does help getting indexed but then when you get to the page and it looks like it was made in publisher, because it *was*...ok I'll stop bitching now.

Data Driven Sites: an RFC standard is needed (1)

CodeShark (17400) | more than 14 years ago | (#1466069)

I don't think that the 'Net is becoming unsearchable, I think there is no standardization in how to write and/or search a data driven site. When I code a database driven site, for example, I include code which automatically writes the new meta tag information for the content page, and then I submit the new pages to as many engines as I can. But I don't know of any way to automatically get the big sites to delete my old pages and replace them with the new -- the timing of the site submissions appearing in the search engine directories has been highly unpredictable to say the least.

What I try to prevent is the problem I am going to mention next, which is that it seems like with many data driven sites, the content pages "expire" (i.e., they are aged out of the database -- thus disappearing from the site) without any notification to the search engines that the page is expired.

As an example, I use a product which performs queries against 10-12 search engines at the same time. For any given search, 10% or more of the pages will be invalid. What little research I have done into the invalid sites often shows that the page has been dead for more than a year -- even when 8 or more of the search engines advertise that they have (at least in theory) spidered the page within the last 60 days.

What we have here is a problem in search of a standards based solution (an official RFC) designed to bring order out of the chaos.

My own thought (which I acknowledge are from someone who has been doing data driven sites for less than a year) is that there ought to be a standard way of telling an external spider to use a "site local" index file, similar to how the robots.txt file excludes some or all of a site from spidering (assuming the spider's coders obey the standards -- not all do.)

It then becomes the data-driven site's coders responsibility to add the added code which updates the robot's index file "automagically" based on the content changes to the site.

It also seems to me like browsers could access this file to see if a bookmark is still active, and with the proper format, maybe even update the local bookmark file. Something like this:


  1. http://mygreatsite.com/old.html := http//mygreat.com/new.html.
I'm interested in what more experienced coders have to say about this idea, BTW.

Re:Domain Names are the Kludge to the Problem (1)

Logolept (13043) | more than 14 years ago | (#1466070)

The problem is much greater. We have too much information, but not enough data.

The information exists in some form or another (php, asp, xml, html), but making it findable is extremely difficult as this Ask Slashdot defines.

I'd think a better solution would be to organize inforamtion from the get-go. Unfortunately, for something like this to work, there would need to be universal standards. XML might be able to bring this -- yet the compliance on one type of DTD is necessary.

Take press releases for example. Currently, it would be fairly easy to aggregate those as they are relatively standard to begin with. Add in product descriptions, datasheets, etc. And it becomes much more muddy.

Wouldn't it be great if information was categorized by those who know best (the author) and then aggregated later?

One use for the whole e-Speak shebang? (1)

rise (101383) | more than 14 years ago | (#1466071)

A web site is basically a network service. It seems like there should be a place for a distributed protocol that actually allows an intelligent* search. If you defined a doc/HOWTO type you could search for sites providing those services with criteria that select the particular issue you're looking for. Try that with a search engine and irrelevant juxtapositions will fill your results with noise.


*Intelligent in the sense that the search method used shares a vocabulary with the providers.

About dynamically generated pages... (1)

diediebinks (62408) | more than 14 years ago | (#1466072)

Almost all search engines will reject dynamically generated pages if they have extended characters in the URL (except for Lycos and Inktomi). This is primarily due to the fact they are worried of getting into what they call "robot traps" where there may be no end to the number of links that a script or program generates. If the URL contains a "?", "%" or other similar characters, they will probably not index your site. A work around is to build "Pointer Pages" using regular static html with links to the target page. If you attempt to use the refresh tag within the "pointer" pages, be aware that Infoseek will try to index your targeted page, not the page that you submit. There are ways around this problem...


(From The Unfair Advantage Book on Winning The Search Engine Wars)

Google (1)

Kvort (73138) | more than 14 years ago | (#1466073)

Google rocks. I can do a search and find all the articles I have ever posted on slashdot. (Archived, of course) The problem of slowness of distribution to search engines is a difficulty, but compared to historical ways of gaining information, what we have is incredible.

We should have some sort of a standard way of indexing these pages, and if they make it compatible with all the new technologies coming out, I will be very impressed. The best search engines will use the standard indexing in addition to current technologies, I would suppose, but it would still make life much easier to have this.

Also, if there were a central place to notify that you have posted/changed content. Something like the way domain names are registered in central places. Its in the users best interest to notify the central location that its content has been added/changed, and then the central point propigates its information to anyone who wants it, for a small fee, of course. :)

Why do I post these things on public forums, anyway?

>>>>>>>>> Kvort, Lord High Peanut of Krondor

Distributed open database standard for the web (0)

Anonymous Coward | more than 14 years ago | (#1466074)

Basically we need an distributed open database
standard for the web. Searching a database is
much faster than doing a blind text search and
should definitely take up less bandwidth and
resources than a text search. If each ISP and
independent node on the internet hooked up
their databases and (imported) html pages,
we'd be able to search anything, anywhere.

Of course, implementing it will be tough. The
current approach of web searching is based on
laziness. Actively participating in the creation
of a web index is not necessary. The only reason
for ISPs to participate is because their afraid
that spiders eat up too much bandwidth.

In the mean time, we'll just have to live with
what we have. As Larry Wall is fond of
saying, "Laziness is a virtue". I hope that
enough of us are lazy enough to use plan ol'
text, HTML, SGML and XML.




Problems with search engines (1)

BoneFlower (107640) | more than 14 years ago | (#1466075)

Search engines have serious problems. One is that boolean strings and other forms of highly specific searching never seem to work. I search for anything, and I get maybe 20 out of 3000000 sites that have what I want. And many of these sites are on the fifth or sixth page. What needs to happen is search engines coming up with a better way of ranking sites. Its really annoying when the 100% relevant site has nothing remotely related to your search, and the 25% site is exactly what you are looking for. Search engines also have to do more to prevent spamming them. Content based searching rather than keyword should be implemented, it can help, but keyword searching, if improved, is still good when searching for specific information. Search engines could focus on specific areas, like a SlashSearch.com would be a tech search engine. The search everything engines could add a new option for their advanced mode searching for category. Database driven sites should use meta tags describing the content type. While no solution can be perfect in a rapidly changing environment like the Web, these ideas can be implemented and would help.

Spidering dynamic content (1)

dvt (93883) | more than 14 years ago | (#1466076)

Spidering dynamic content is not itself a problem. Because HTTP does not provide a "list all files in folder" method, you have to use the same basic approach regardless of the source of the content: start in a root page, extract the HREFs, index those pages, get their HREFs, etc.

If an HREF contains a query string, sending that query string will return the content in the same way that sending an ordinary www.sample.com/page.html link will return the content.

Another message mentioned the problem of loops. A table of visited URLs does not always work because of the problem of relative links that get continuously appended to on sites that return index.html for broken links. Two alternatives are:

(1) limit the spidering depth so that you only go, say, 4 links deep into the site, or

(2) make a hash value on content returned, and use the hash value to see if you are getting the same content with a different URL. Stop spidering any time the hash value is the same as a previous hash value.

bots (0)

Anonymous Coward | more than 14 years ago | (#1466077)

We need personal bots that search the internet for us all the time. So I can tell my bot to find web sites containing Heller reference's and store the URL on my 20 gig(or whatever) HD.

I have alot of interests, but I don't thinf A URL entry for every web site that holds my interest would fill up 20gigs of data.

Also it should have the option to only search entry points into domains, such as http://slashdot.org but not http://slashdot.org/whatever/more/test.html

On a similiar note I think the 'WEB' has gotten to a point where web sites need a tag that determins an overall content review, an example would be that porn sites may have this , and a personal site may have tag and so on. so I can click the no porn option on my search engine and not have 500 returns, 450 involving animals....

Now I use porn as an example, but I don't think it should be removed from the net, but I think I should be subject to it if I want to.

We have the technology to make it better... (1)

underbider (63054) | more than 14 years ago | (#1466078)

So, yes, interms of technology, it is easier to classify webpages into categories, then index them within each category. Check out http://www.cora.jprc.com/ It is a search engine for Computer Science researhc papers. It is in a format that is just like yahoo. But every thing is done automatically!

So, the technology is here. It is just a matter of time before this kind of thing is neccessary.

DB-Driven Pages with Full, Normal URLs (1)

dschuetz (10924) | more than 14 years ago | (#1466079)

This is actually pretty easy to do. We have three sites (all internal) that all run with a MySQL DB storing all web content, and uses PHP/Apache as the Browser/DB interface. By using an apache "Alias Match" directive, we re-write everything to point to a primary PHP script:

AliasMatch ^/(.*) /home/web/index.html/$1

This front-end does a lot of (admitedly, crude) parsing of the rest of the URI line to determine what "document" in the DB to look up, and which subordinate page, or if it's supposed to be instead generating an image, or whatever. The main script also looks up styles for each document, builds navigation bars, etc.

Works pretty well. Not nearly as flexible as if I'd actually thought it through before writing it, but it fits our needs admirably.

Why'd we go to such a complicated approach? Because we have bunches of InfoSec engineers, who really don't want to worry about HTML, writing web pages and reports. We've got a GUI front-end with a nice wysiwyg HTML editor that hits the documents in the DB directly, and all changes happen on the "live" HTTP server immediately. It's completely scannable because we use a web-get sort of tool to create a static "snapshot" of the final report before we send it to customers.

At any rate, I think it's cool... :-)


david.

why the web is broken. (1)

whocares (93522) | more than 14 years ago | (#1466080)

A friend of mine asked me once to explain my opinion of why the web is broken. After some thought, I came to some conclusions that are relevant here. I'll see if I can restate them effectively. All IMHO, of course.

A couple of assumptions:

1) The web is a non-hierarchical, non-linear system. The entire nature of it is actually closely related to how most people think, through a series of links. Ever found yourself explaining to someone how you got from one seemingly unrelated topic to another? The web is the same thing.

2) Mapping linear, hierarchical systems is what humans are good at. Indices, tables, flowcharts, etc. are all designed to present a certain kind of data in a randomly accessable way. When information is non-linear, we try to force it into this kind of structure, for better or for worse. This is what search engines currently try to do - provide a keyword index to every document on the web.

We cannot treat the web like something it is not. It is not a book or a collection of books. It is not even linear. It's a lot closer to the repository of information that is the human mind than most things that humans create.

This presents an information-finding nightmare. Much as it's sometimes difficult to find the piece of information you know you have stored in your head, it's becoming increasingly more difficult, even with the power of algorithmic parsing and pruning, to extract single pieces of information from the system. Search engines are, as the original post stated, becoming obsolete.

So what is the solution? In my opinion, the most intuitive 'index' type interface to the web has always been Yahoo, which for any given topic will provide a number of starting points. Not every document is indexed, not everything is represented - but if you drill down through links, you are more than likely to find what you're looking for. It takes the natural process of searching the web, which if it were a few hundred nodes could easily be done by hand, and gives it a logical starting point, much as someone can remind you of something you were searching to remember, and suddenly it all becomes clear. Indexing the entire web is as useless as trying to do an entire braindump of your mind. Indexing a set of starting points for using the web the way it was intended - as a series of links - is the only way that will probably ultimately work.

Alternatives to [?|&|phtml|etc.] database calls? (1)

SwellJoe (100612) | more than 14 years ago | (#1466081)

This adversely effects web caching technologies as well. Any dynamic content is uncacheable and unsearchable due to the inability to know if the content is specific to a query or simply a love of the concept of "dynamic content" on the part of the page designer.

It's clear that many aspects of a webpage that could be pregenerated every time the information is updated are not being done that way. Slashdot is a prime example. Presumably, thousands of people visit Slashdot anonymously everyday. Even though they see the same content, the page is regenerated with unsearchable/uncacheable content. Shouldn't it be a simple matter to have a script choose between a dynamic page for logged in users and a constantly up to date pregenerated page for anonymous users? Saving CPU cycles for the servers, allowing indexing by search engines, and speeding up accesses for users behind a cacheing proxy. Sounds like only good things to me.

Obviously this won't solve all of the problems, but many websites front pages are the same for every user. Wouldn't it make sense to pregenerate it as static content? This could be taken much further by news sites that provide the same story content to every user, but use a database frontend for simplicity anyway. This doesn't preclude use of a backend database for information storage and organization, but it does impose quite a lot of complexity in the implementation of a system to index all of the pages as they become available and make them into static, numbered pages.

I tend to fall into the category of folks who believe that site designers should be a little more aware of the outside world and making their content accessible via every possible means. I don't think it makes sense to prevent search engines from finding ones content. If you've put it up, you want people to find it. Why turn down that extra banner display simply because someone doesn't check your headlines and instead searches Google or Alatvista?

I'm sure there are other issues involved and I'm glad this was brought up...I've been trying to figure out solutions to these problems myself while implementing a company web page with our web designer. Being a cache server company, we've got to make sure our own pages are completely cacheable whenever humanly possible. Not to mention that when someone does a search on any engine we want our URL to come up if we've got something to say on the subject of the search. It just makes sense to be as openly accessible as possible.

So, is this a problem that should be addressed mainly by the search engines, or should web designers be thinking ahead to such concerns when they are building a site with dynamic content?

Joe, Swell Technology

Re:Customers :) (1)

justin_saunders (99661) | more than 14 years ago | (#1466082)

Sorry, posted this initially as a new topic :p

Dynamic pages don't exist until you click on them in your browser either. Search engines *will* follow links to dynamically generated pages.

The point is there has to be a link there in the first place. Crawlers will not be able to index a dynamic page if it is only accessable through a "form" post.

The way you can get around this is to have a hidden (to users) page on your site with hardcoded (or database generated) links into the dynamic content that you'd like visible from search engines.

For example, if you have a whole heap of news articles on your site, with one per page, you can make a dynamic page called "newslinks" which, when generated by a crawler, querys the database and writes links to every news article in the site.

cheers, j.

Re:XML (3)

Simon Brooke (45012) | more than 14 years ago | (#1466083)

This is a field where Linux could be leading the pack, but is instead an example where I think we are lagging behind. (I hope someone can point me to a group that is bringing XML deep into the linux os)

Not so, fortunately. A certain very large telco (which I'm not yet allowed to name) is now running its Intranet directory on an XML/XSL application which I've written. The application was developed on Linux and is currently running on Linux, although the customer intends to move it to Solaris.

My XML intro course is online [weft.co.uk] ; it's a little out of date at the moment but will be updated over the next few months.

XML and particularly RDF [w3.org] do have a lot to offer for search engines - see my other note further up this thread.

Catalogues are a Good Thing(tm) (1)

saska (13691) | more than 14 years ago | (#1466084)

Long live Yahoo (and alikes).

Markus
--

Specialized Engines - Not More Engines! (2)

xtal (49134) | more than 14 years ago | (#1466085)

The answer to all this isn't going to come from making existing engines better, nor is it going to come from bigger, badder, faster database engines powered by your friendly clustering technologies!

The answer is simple: More specialized search engines. You're looking for technical stuff? Then you should be able to search a technical database. Like, if I'm looking for source code to model fluid flows - that's pretty specific already. There's no reason that I should have to wade through all the references to "bodily fluids" that I'll get on altavista for instance!

Search engine people, take note of this. Classify your URLs into categories - like Yahoo - but come up with some way to do it automatically. Or even better yet, let the users do it, a la NewHoo [newhoo.com] .

End of internet predicted. Film at 11. We've heard it before, and we'll hear it again. Just need someone with a little VC money to throw it towards an idea that supports more specialization in search engine tech.

Kudos..

From a web page owner (2)

gargle (97883) | more than 14 years ago | (#1466086)

I use a free site statistics service to keep track of hits to my web site, where I keep some software that I've written. Looking at the referrer statistics to my site, the vast majority of hits are generated from explicit, categorized links to my site (e.g. bookmark pages and surprisingly Lycos which has a categorized database), and rarely ever from general search engines like Altavista. The questioner may be right - from the perspective of a web site owner, general search engines aren't very effective at bringing visitors to my site.

You can.... (0)

Anonymous Coward | more than 14 years ago | (#1466087)

http://www.navigateone.com OK, so it's only financial information, but it does update itself, work out queries on it's own etc..etc... So it's not impossible. p.s. It's nothing to do with me.

People just don't know how to search. (1)

segmond (34052) | more than 14 years ago | (#1466088)

People just don't know how to search.
Since, I have been using the internet, I have stopped making daily trips to the library. Searching is an art, The web is pretty searchable, but takes quite some effort, knowing the right search engines to use for what, knowing the right keywords and combinations.

XML (2)

debrain (29228) | more than 14 years ago | (#1466089)

IIRC, XML was designed to help alleviate this sort of thing. Unfortunately, XML has not been exploited enough to have any significant ramification on the way the internet is sorted.
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?