Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Googlebot and Document.Write

kdawson posted more than 7 years ago | from the ajax-the-foaming-indexer dept.

Google 180

With JavaScript/AJAX being used to place dynamic content in pages, I was wondering how Google indexed web page content that was placed in a page using the JavaScript "document.write" method. I created a page with six unique words in it. Two were in the plain HTML; two were in a script within the page document; and two were in a script that was externally sourced from a different server. The page appeared in the Google index late last night and I just wrote up the results.

Sorry! There are no comments related to the filter you selected.

Nonsense words? (5, Funny)

Whiney Mac Fanboy (963289) | more than 7 years ago | (#18312800)

An alert came in in the late evening of March 10th for "zonkdogfology", one of the words in the first pair

zonkdogfology is a real word:

zonk-dog-fol-o-gy zohnk-dog--ful-uh-jee
noun, plural -gies.

1. the name given to articles from zonk where the summary makes no sense whatsoever.
Serious question now - is the author of the article worried that the ensuing slashdot discussion will mention all his other nonsense words? I've no doubt slashdotters will find & mention the other words here, polluting google's index....

Re:Nonsense words? (1, Insightful)

Anonymous Coward | more than 7 years ago | (#18312897)

zonkdogfology ibbytopknot pignoklot zimpogrit fimptopo biggytink

Seriously, he shouldn't have posted these words until he was done with the test.

Re:Nonsense words? (4, Funny)

Anonymous Coward | more than 7 years ago | (#18312902)

zonkdogfology is a real word:

It's a perfectly cromulent word, and it's use embiggens all of us.

Re:Nonsense words? (-1, Offtopic)

qardinal (1074575) | more than 7 years ago | (#18313753)

i'm leaving this site to go look at some huge boobs [skrappy.org]

The Results: (5, Informative)

XanC (644172) | more than 7 years ago | (#18312812)

Save a click: No, Google does not "see" text inserted by Javascript.

Re:The Results: (4, Informative)

temojen (678985) | more than 7 years ago | (#18312904)

And rightly so. You should be hiding & un-hiding or inserting elements using the DOM, never using document.write (which F's up your DOM tree).

How does document.write mess up your DOM tree? (1)

catbutt (469582) | more than 7 years ago | (#18312937)

I don't believe you.

Re:How does document.write mess up your DOM tree? (4, Informative)

XanC (644172) | more than 7 years ago | (#18313033)

If you're using document.write, you're writing directly into the document stream, which only works in text/html, not an XHTML MIME type, because there's no way to guarantee the document will continue to be valid.

In this day and age, document.write should never be used, in favor of the more verbose but more future-proof document.createElement and document.createTextNode notation.

Re:How does document.write mess up your DOM tree? (4, Insightful)

jesser (77961) | more than 7 years ago | (#18313193)

Perhaps more importantly, document.write can't be used to modify a page that has already loaded, limiting its usefulness for AJAX-style features.

True (1)

XanC (644172) | more than 7 years ago | (#18312941)

I doubt Google will notice DOM-created elements, either. But the author should re-test with that. And I would suggest that he post the result only if it turns out Google can see that, because we all assume it can't.

google.com/?q=slashdotting+in+google+dollars (5, Insightful)

kale77in (703316) | more than 7 years ago | (#18313159)

I think the actual experiment here is:

  • Create a 6-odd-paragraph page saying what everybody already knows.
  • Slashdot it, by suggesting something newsworthy is there.
  • Pack the page with Google ads.
  • Profit.

I look forward to the follow-up piece which details the financial results.

Re:google.com/?q=slashdotting+in+google+dollars (4, Insightful)

Scarblac (122480) | more than 7 years ago | (#18313467)

Exactly, this is the typical sort of fluff that Digg seems to love. As far as I know, Slashdot had avoided this particular type of adword blog post crap until now.

Re:google.com/?q=slashdotting+in+google+dollars (1)

tijmentiming (813664) | more than 7 years ago | (#18313701)

Mod parent up, there is indeed to much of digg-like posts here.

Re:google.com/?q=slashdotting+in+google+dollars (1)

caluml (551744) | more than 7 years ago | (#18313733)

But with the Firehose, Slashdot will now start using the "wisdom" of crowds to produce the same pap that Digg does.
Shall we all migrate to Technocrat, anyone? It has decent stories.

Re:google.com/?q=slashdotting+in+google+dollars (0)

Anonymous Coward | more than 7 years ago | (#18314373)

You must be new to Slashdot then.

Google Pigeon technolog (3, Funny)

sdugoten2 (449392) | more than 7 years ago | (#18312832)

The Google Pigeon [google.com] is smart enough to read through Document.write. Duh!

Pitiful (1)

Mathinker (909784) | more than 7 years ago | (#18313861)

The moderator meant to mod this +1 Funny (I would!) but forgot to actually try to understand the post.

Or perhaps this is a mod of a new experimental viral moderation system, but the viruses haven't evolved enough yet?

If they weren't, then they're trying (4, Interesting)

AnonymousCactus (810364) | more than 7 years ago | (#18312866)

Google needs to consider script if they want high-quality results. Besides the obvious fact that they'll miss content supplied by dynamic page elements, they could also sacrifice page quality. Page-rank and the like will get them very far, but an easy way to spam the search engines would be to have pages on a whole host of topics that immediately get rewritten as ads for Viagra as soon as they're downloaded by a Javascript-aware browser. It's interesting to know the extent to which they correct for this.

Of course, there are much more subtle ways of changing content once it's been put out there. One might imagine a script that waits 10 seconds and then removes all relevant content and displays Viagra instead. Who knew web search would be restricted by the halting problem? I wonder how far Google goes...

Re:If they weren't, then they're trying (1)

TubeSteak (669689) | more than 7 years ago | (#18313149)

Page-rank and the like will get them very far, but an easy way to spam the search engines would be to have pages on a whole host of topics that immediately get rewritten as ads for Viagra as soon as they're downloaded by a Javascript-aware browser.

Of course, there are much more subtle ways of changing content once it's been put out there. One might imagine a script that waits 10 seconds and then removes all relevant content and displays Viagra instead.
Google tends to nuke those sites from orbit once it discovers they're gaming the system.

And by "nuke from orbit" i mean "delists them with no warning"

Re:If they weren't, then they're trying (1)

wumpus188 (657540) | more than 7 years ago | (#18313151)

Yes, but supporting javascript won't fix the problem.

...that immediately get rewritten as ads for Viagra as soon as they're downloaded by a Javascript-aware browser.
They'll just go the other way around - show the static viagra content in browser and rewrite it for google bot.

Re:If they weren't, then they're trying (1)

Arancaytar (966377) | more than 7 years ago | (#18314309)

The point being?

If both Googlebot and users see the rewritten content, the original advertisement is never displayed.

Except for the few users that have Javascript disabled, but that's not the stupid-user target group that clicks on spam anyway...

Woot! (0)

Anonymous Coward | more than 7 years ago | (#18313307)

...an easy way to spam the search engines would be to have pages on a whole host of topics that immediately get rewritten as ads for Viagra as soon as they're downloaded by a Javascript-aware browser.

That's a fantastic idea!

Re:If they weren't, then they're trying (4, Insightful)

gregmac (629064) | more than 7 years ago | (#18313339)

You have to also remember though, that often the content generated dynmically is going to be of no use to a search engine, it will often be user-specific - there's obviously some reason it's being generated that way.

And if pages are designed using AJAX and dynamic rendering just for the sake of using AJAX and dynamic rendering.. well, they deserve what they get :)

Re:If they weren't, then they're trying (1)

hauntingthunder (985246) | more than 7 years ago | (#18313765)

yeh right

We have to do what google wants - some one joked that the ideal site for google is a university site circa 1997

dont forget the google bot is a dumb user agent it doesnt have javascript  so javascript navigation, ajax and FLASH are the kiss of death for search engines.

Re:If they weren't, then they're trying (0)

Anonymous Coward | more than 7 years ago | (#18314099)

You mean you have to create standards compliant sites that don't rely heavily on wizz-bang scripts and totally 100% non-standard Flash? Boo hoo, my heart bleeds for you, you poor web "developer" you etc. etc. ad nuseum.

Re:If they weren't, then they're trying (1)

jrumney (197329) | more than 7 years ago | (#18314197)

Google should index the static content, but run/analyse the Javascript and throw out any pages where the user-visible content changes drastically. To be 100% effective though, they'd have to fake the IE or Firefox User-Agent, and use IP addresses from an ISP's dynamically assigned range for their crawling, which some people might see as evil.

Re:If they weren't, then they're trying (1)

Arancaytar (966377) | more than 7 years ago | (#18314385)

I'd rather suggest they don't look at script content at all.

Part of it is practicality, as already implied: With delays, self-writing code, horrible "quirks" that are not browser-independent, it's nearly impossible to predict what the script is going to do in the user's browser. Besides gobbling insane resources on the spidering server, increased by scripts that cause crashes.

Another part is philosophy and good practice - AJAX is for interactive applications, static HTML/XHTML for content. Applications shouldn't be indexed anyway, since the pages are user-specific and extremely dynamic. If you search the web, you're really looking for documents with content - and there's no reason why those shouldn't be entirely static.

Catering to the trend that anything, even simple text content, is only made accessible through barrier-heavy, browser-dependent AJAX applications is a step in the wrong direction. Google might as well execute flash movies, begin using OCR to read text in pictures or voice recognition to index mp3 files by song lyrics.

How did this make the front page? (2, Insightful)

Anonymous Coward | more than 7 years ago | (#18312876)

It should be pretty obvious that no search engine should interpret javascript, let alone remotely sourced javascript. I was actually hoping this guy would show me wrong and demonstrate otherwise, but to my disappointment this was just another mostly pointless blog post.

Re:How did this make the front page? (1)

EvanED (569694) | more than 7 years ago | (#18312907)

It should be pretty obvious that no search engine should interpret javascript...

Why's that?

Properly constructed there should be no security issue, and it would give more accurate results.

Re:How did this make the front page? (1)

NoTheory (580275) | more than 7 years ago | (#18312925)

Why should it? I mean, isn't the point of javascript content that responds dynamically to the intentions of an agent? The googlebot, although an extremely complicated AI agent, isn't intentional. It doesn't know what it's doing on a site, and so i figure probably shouldn't just be allowed out to wreak havoc. Also, wouldn't that allow one an opportunity to fork-bomb the googlebot then as well?

Re:How did this make the front page? (1)

EvanED (569694) | more than 7 years ago | (#18312965)

Why should it?

Because JavaScript can create content. Since 99% of people run with it enabled, they will see this content, so it makes sense to index it.

I mean, isn't the point of javascript content that responds dynamically to the intentions of an agent?

I probably wouldn't have the indexer run most events, but it seems that those in document.load or other places that are run when a page is loaded should be indexed.

Also, wouldn't that allow one an opportunity to fork-bomb the googlebot then as well?

JavaScript doesn't have fork AFAIK. Besides, the broader question of resource consumption is trivially solvable by setting limits on the process doing the work.

Re:How did this make the front page? (1)

zobier (585066) | more than 7 years ago | (#18313079)

Also, wouldn't that allow one an opportunity to fork-bomb the googlebot then as well?
JavaScript doesn't have fork AFAIK.
The setTimeout function can do a similar thing.

Re:How did this make the front page? (2, Informative)

Bitsy Boffin (110334) | more than 7 years ago | (#18313745)

From memory, setTimeout forms a time-delayed but synchronous entry into the execution stream, you will not get two threads in the same javascript code pile running simultaneously, the timeout will not fire until the execution stream is idle.

Re:How did this make the front page? (1)

Hal_Porter (817932) | more than 7 years ago | (#18313503)

Actually, the latest nightly builds of the Last Measure can burn even the electronic eyeballs of the google bot, without using Javascript.

It seems to find it, indeed you can find out what Last Measure is by Googling it, but I can see from the logs that it only checks once. Just like a human would. "Hmm, Last Measure what's that? Aiiiieeee!"

Very interesting.

Re:How did this make the front page? (2, Informative)

VGPowerlord (621254) | more than 7 years ago | (#18313505)

Because JavaScript can create content. Since 99% of people run with it enabled, they will see this content, so it makes sense to index it.

Did you know that 99% of all statistics are made up?

I can source some Javascript statistics: W3Schools reports [w3schools.com] that, as of January 2007, 94% of their audience has Javascript turned on, a significantly lower statistic than you are reporting. Not only that, but it is actually the highest percentage since they started recording them binannually in late 2002.

It's a moot point, though: As W3Schools stats page states "You cannot - as a web developer - rely only on statistics. Statistics can often be misleading." Meaning that you should always code things so that they work with HTML/CSS, then use javascript to make it look/act nicer.

Re:How did this make the front page? (1)

xoyoyo (949672) | more than 7 years ago | (#18314203)

"You cannot - as a web developer - rely only on statistics. "

No, indeed, because doing things based on empirical evidence is foolish behaviour. On the other hand you should take a political position (that data, presentation and behaviour should be kept separate) and behave as though that was in some way more true than a statistical value.

I'm not saying that the separation of data, presentation and behaviour is wrong, just that you have to realise that it's a human engineered best practise, not a law of the universe. So saying that you cannot rely on statistics is wrong. Of course you can. Saying you *shouldn't* rely on statisics is entirely correct.

Re:How did this make the front page? (1)

xoyoyo (949672) | more than 7 years ago | (#18314221)

(Slashdot swallowed my sarcasm tag there - the first paragraph should be read in a mildly mocking voice - the second one is the meat of the matter)

Re:How did this make the front page? (1)

Jake73 (306340) | more than 7 years ago | (#18313089)

Yeah, I was kinda shocked, really. I always wondered how people with bad blogs were able to break into the mainstream and gather regular readers. I guess they just try like hell to get picked up on Slashdot/Digg/etc with some worthless blog post.

Re:How did this make the front page? (1)

kv9 (697238) | more than 7 years ago | (#18314175)

Yeah, I was kinda shocked, really. I always wondered how people with bad blogs were able to break into the mainstream and gather regular readers. I guess they just try like hell to get picked up on Slashdot/Digg/etc with some worthless blog post.

well that too, but in general it's even easier. just aim low and hope for the best. it's not very hard to appeal to the mainstream. shit, it's the largest audience out there.

Re:How did this make the front page? (1)

osu-neko (2604) | more than 7 years ago | (#18313535)

It should be pretty obvious that no search engine should interpret javascript, let alone remotely sourced javascript.

Granted. It's just that some people like to actually have empirical evidence for something before they conclude it's true, rather than say "that's how it should work" and then pretend they know something that they really don't, based on the way they think the universe should be rather than on any actual evidence of the way it actually is.

Google request external JavaScript file? (4, Insightful)

JAB Creations (999510) | more than 7 years ago | (#18312878)

Check your access log to see if Google actually requested the external JavaScript file. If it didn't there would be no reason to assume Google is interested in non-(X)HTML based content.

Re:Google request external JavaScript file? (2, Informative)

The Amazing Fish Boy (863897) | more than 7 years ago | (#18314253)

I have actually seen some reports [google.com] of a "new" Googlebot requesting the CSS and Javascript. The rumour I heard was that it was using the Gecko rendering engine or something along those lines. This was some time ago. I'm not sure what ever became of this.

Doesn't work; Good (kind of) (5, Insightful)

The Amazing Fish Boy (863897) | more than 7 years ago | (#18312919)

FTFA:

Why was I interested? Well, with all the "Web 2.0 technologies that rely on JavaScript (in the form of AJAX) to populate a page with content, it's important to know how it's treated to determine if the content is searchable.
Good. I am glad it doesn't work. Google's crawler should never support Javascript.

The model for websites is supposed to work something like this:
  • (X)HTML holds the content
  • CSS styles that content
  • Javascript enhances that content (e.g. provides auto-fill for a textbox)

In other words, your web page should work for any browser that supports HTML. It should work regardless of whether CSS and/or Javascript is enabled.

So why would Google's crawler look at the Javascript? Javascript is supposed to enhance content, not add it.

Now, that's not saying many people don't (incorrectly) use Javascript to add content to their pages. But maybe when they find out search engines aren't indexing them, they'll change their practices.

The only problem I can see is with scam sites, where they might put content in the HTML, then remove/add to it with Javascript so the crawler sees something different than the end-user does. I think they already do this with CSS, either by hiding sections or by making the text the same color as the background. Does anyone know how Google deals with CSS that does this?

Re:Doesn't work; Good (kind of) (1)

milo317 (1016878) | more than 7 years ago | (#18312927)

Thought so, as G won't follow java links, as it's stated in their webmaster codex.

Re:Doesn't work; Good (kind of) (1)

catbutt (469582) | more than 7 years ago | (#18312959)

Who's talking about Java?

Re:Doesn't work; Good (kind of) (0)

Anonymous Coward | more than 7 years ago | (#18313019)

Well, we're talking more generally about dynamically created content, and as Java is used for such, I think the point is valid.

Re:Doesn't work; Good (kind of) (3, Informative)

VGPowerlord (621254) | more than 7 years ago | (#18313695)

In actuality, it says "Use a text browser such as Lynx to examine your site, because most search engine spiders see your site much as Lynx would. If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site." – Webmaster Guidelines [google.com] , Technical Guidelines section, bullet point 1.

Re:Doesn't work; Good (kind of) (2, Informative)

doormat (63648) | more than 7 years ago | (#18312991)

I thought I remember a while ago about some search engine using intelligence to ignore hidden text (text with the same or a similar color as the background). Of course the easy work around for that is to use an image for your background and then that may fool the bot, but who knows, they could code to accomidate that too.

Regardless, I'm pretty sure you'd get banned from the search engines for using such tactics.

Re:Doesn't work; Good (kind of) (1)

zobier (585066) | more than 7 years ago | (#18313109)

I thought I remember a while ago about some search engine using intelligence to ignore hidden text (text with the same or a similar color as the background). Of course the easy work around for that is to use an image for your background and then that may fool the bot, but who knows, they could code to accomidate that too.
You could use OCR to detect that (and to index images used for text content).

Re:Doesn't work; Good (kind of) (1)

Tablizer (95088) | more than 7 years ago | (#18313013)

You are basically saying that "dynamic content should go to hell". Dynamic content is the result of automation. Do you propose we stick with outmoded "flat" technologies like flat files? Databases and cross-server-content-grabbing be damned? I find this disturbing.

Re:Doesn't work; Good (kind of) (2, Insightful)

Rakishi (759894) | more than 7 years ago | (#18313035)

Huh? He's talking about browser generated content, most dynamic content is server side generated (like slashdot but I think slashdot may have flat files as cache for speed reasons). No one said that nice xml file can't be generated by the server when the page is called.

Re:Doesn't work; Good (kind of) (1)

Stooshie (993666) | more than 7 years ago | (#18313933)

You are basically saying that "dynamic content should go to hell"
Huh? He's talking about browser generated content, most dynamic content is server side generated

He is talking about AJAX sites that use JavaScript to dynamically load content from the server based on user actions (such as Google suggest [google.com] )

Interesting that Google themselves use AJAX but don't index it.

Re:Doesn't work; Good (kind of) (1)

Rakishi (759894) | more than 7 years ago | (#18314093)

Yes, I'm well aware of that but he's more specifically talking about certain uses of AJAX. The poster I replied to said dynamic content which is a whole lot more than either the original poster meant or even what AJAX encompasses. Databases and non-flat things are not AJAX and not in any way what the original poster meant. I was simply pointing out that the person I replied to can't read.

AJAX should degrade gracefully, if you don't have javascript things should still work which means that search spiders should have no trouble on AJAX websites.

Re:Doesn't work; Good (kind of) (1)

fbartho (840012) | more than 7 years ago | (#18313083)

So, what do you have to say about websites that have their entire user-interfaces built with content that gets filled by javascript asynchronously from a single html page? Now the only ones I have made or seen that are like that require login's to protect the data they are providing via an active interface; live examples including gmail and others, but I don't think that means that providing the searchable data only via javascript is neccessarily inappropriate.

I would make normal links, then use JS on top (3, Insightful)

The Amazing Fish Boy (863897) | more than 7 years ago | (#18313417)

So, what do you have to say about websites that have their entire user-interfaces built with content that gets filled by javascript asynchronously from a single html page?
If I understand you, you something like this: The site has two parts, a menu and content. When you click a menu item, rather than being taken to a new URL, it executes Javascript which fetches only the new content from the web server, then replaces the content section. So the URL doesn't change.

It's a nice improvement. Less bandwidth used, and a quicker interface.

Unfortunately, it's not often done right. The way I would do it is to first make the menu work like it normally would. Make each menu item a link to a new page. Then you apply Javascript to the menu item. Something like this:

// menuLink is the DOM element for each menu link.
// (i.e. get it from document.getElementById(), etc.)
menuLink.onclick = function() { getNewContent(); return false; }
(FYI, this is how I do pop-up windows, too.)

Putting it behind a login screen doesn't solve all the problems. You're right that it won't be searchable anyway, but people with older browsers or screen readers won't be able to access it.

I think Gmail actually offers two versions. One for older browser that uses no (or little?) Javascript, and the other which almost everyone else (including me) uses and loves. But I'm not sure how easy it would be to maintain two versions of the same code like that. I also don't think it's nice for the end user to have to choose "I want the simple version", though it may encourage them to update to a newer browser, I guess.

(Of course this is all "ideally speaking", I realize there are deadlines to meet and I violate some of my own guidelines sometimes. I still think they're good practices, though.)

Re:I would make normal links, then use JS on top (1)

fbartho (840012) | more than 7 years ago | (#18313531)

You do pretty much understand me. One example site involves data with a certain set of fields stored in the database. Each set of fields is a package of information, and these packages of information number around 400. There are 20 primary users for the site who are college students. They each update several of these packages, ideally on a daily basis, and leaders of subgroups track which packages have been updated every 3 days or so. These packages are used for the management of a very large student project, scheduling, directions, subtasks, contact points, costs, and progress. Thankfully for me, I can standardize on requiring Firefox, Javascript, and Cookies (for php sessions) because all of the computer labs on campus are equipped with that no matter what OS comes installed, and it's trivial for them to get those on their personal computers (if they do not already have them). My primary role on the project however is not webdeveloper, the site was just a solution to a problem, that I had a certain amount of time to alot to. I definitely don't have the time to maintain a flat html form of the site, and it would be much less functional for the average user's workflow.

Now, given the usercount, I know that site is small peanuts in the real world, but what am I supposed to do? For other projects: At what point am I required to duplicate my efforts so that systems without javascript can functionally access my site? What if I want to provide a view to the world that is only mediated through the javascript interface I have developed. Am I required to make my site accessible without javascript?

Re:I would make normal links, then use JS on top (1)

The Amazing Fish Boy (863897) | more than 7 years ago | (#18313769)

I agree with you that with a small enough user base (or one that is adequately controlled), you can cut some corners, especially if time is a constraint. Generally I would say whenever your users could reasonably demand their browser work. That is, if the site is going to be publicly accessible, I would not make Javascript a requirement. I'm not sure what the actual "limit" I would put on the number of users would be; I think that would vary from project to project.

It's a matter of project requirements, really. I didn't mean to come off as though what I was suggesting were absolute rules. They are best practices, and especially on the web I think they should be followed.

But I don't think it's always a duplication of efforts. You're right, sometimes it is a duplication of efforts, like if you can only present the interface you want in Javascript (e.g. if you used drag and drop or something). But a lot of the time Javascript is only used to enhance a form, so it would only be adding functionality, not replicating it.

Also,have you considered what would happen if someone sues the school (is it for a school?) if they are blind and the site is inaccessible?

Re:Doesn't work; Good (kind of) (1)

maxwell demon (590494) | more than 7 years ago | (#18313941)

And how do you bookmark a certain view of that page (which, to you as page user, is a separate page after all)?

Re:Doesn't work; Good (kind of) (2, Insightful)

cgenman (325138) | more than 7 years ago | (#18313091)

In other words, your web page should work for any browser that supports HTML. It should work regardless of whether CSS and/or Javascript is enabled.

Define "work". A web page without formatting is going to be useless to anyone who isn't a part-time web developer. To them, it's just going to be one big, messy looking freak out... akin to a television show whose cable descrambler broke. Sure all the "information" is there, somewhere, but in such a horrible format that a human being can't use it.

Web pages are dynamic these days. Saying that the only acceptable model is staticly defined strict XHTML mixed with an additional layer of tableless CSS is foolish zelotry. With so much happening dynamically based upon end-user created pages, along with the somewhat annoying usage of Flash, Powerpoint, or PDF for important information, you really can't create a comprehensive index without being a little flexible.

Saying that Google shouldn't take into account scripting when scanning pages is like saying they shouldn't index the PDF's that are online. Sure, it may not conform to what you believe is "good web coding standards," but the reality is that they're out there.

Re:Doesn't work; Good (kind of) (2, Insightful)

WNight (23683) | more than 7 years ago | (#18313253)

I don't know about you, but I write my webpages so that when the style goes away, the page still views in a basic 1996 kind of style. Put the content first and your index bars and ads last then use CSS to position them first, visibly. This way if a blind user or someone without style sheets sees the site it at least reads in order.

Re:Doesn't work; Good (kind of) (1)

caluml (551744) | more than 7 years ago | (#18313741)

View, Page Style, No Style in Firefox will show you what your page looks like to browsers/spiders.

Re:Doesn't work; Good (kind of) (2, Insightful)

The Amazing Fish Boy (863897) | more than 7 years ago | (#18313311)

Define "work". A web page without formatting is going to be useless to anyone who isn't a part-time web developer.
How's this? Disable CSS on Slashdot. First you get the top menu, then some options to skip to the menu, the content, etc. Then you get the menu, then the content. It's very easy to use it that way.

To them, it's just going to be one big, messy looking freak out... akin to a television show whose cable descrambler broke. Sure all the "information" is there, somewhere, but in such a horrible format that a human being can't use it.
Well, for one thing, we are talking about a search engine here, which isn't a human being. So, there's one client that can "use" the information better in XHTML format. Then there's the visually impaired (who use screen readers as their clients), and those using a non-graphical client. Additionally, I would imagine it would be easier to screen scrape XHTML to get just the part you want (since a lot of content would be assigned an ID and/or a class.)

Web pages are dynamic these days. Saying that the only acceptable model is staticly defined strict XHTML mixed with an additional layer of tableless CSS is foolish zelotry.
Your first sentence is true, the second isn't. Web pages are dynamic, yes. I outlined how dynamic pages should be designed. That is, they should be made to work as static (X)HTML, then dynamically updated with Javascript. I don't see how your second sentence follows from the first at all. Web pages are dynamic... so we shouldn't follow standards? We shouldn't accommodate search engine crawlers, the blind, those using older browsers, or those who have Javascript support disabled?

Notice I keep putting the X in (X)HTML in brackets. That's because I'm not convinced strict XHTML is the only viable method (though I'm not convinced it's not -- I'm on the fence).

Re:Doesn't work; Good (kind of) (2, Insightful)

Animats (122034) | more than 7 years ago | (#18313541)

The model for websites is supposed to work something like this:

If only. Turn off JavaScript and try these sites:

Luckily blind people don't drive! (1, Insightful)

Anonymous Coward | more than 7 years ago | (#18314379)

  1. Javascript redirects are a trait of the incompetent, I bet Ford payed some cowboy a whole lot of money for a site that doesn't work.
  2. On the jeep site I can get to a few 'pages' that are actually just images with an image map and empty alt attributes for the html links. The HTML URLs are clean but not informative and the others don't work (unsupported URL scheme in lynx).
  3. The credit Suisse site is reachable via a mislabeled link, "If you are a PALM, PSION, WINDOWS CE or NOKIA user click here". They even offer a sitemap for navigation. Tell-tale signs indicate this site was valid, accessible XHTML before some monkey was set loose on it.

Those selling professional web services should be liable under ADA and similar laws, that's how we fix the web.

Accessibility? (2, Informative)

BladeMelbourne (518866) | more than 7 years ago | (#18312963)

The bottom line is your web sites should probably degrade nice enough when JavaScript is not enabled. It might not flow as nice, the user may have to submit more forms, but the core functionality should still work and the core content should still be available.

DDA / Section 508 / WCAG - the no JavaScript clause makes for a lot of extra work - but it is one that can't be avoided on the (commercial) web application I architect. (Friggin sharks with laser beems for eyes making lawsuits and all.)

Document.write() is not the way to go (1)

Max Romantschuk (132276) | more than 7 years ago | (#18313003)

Document.write() is executed as the page loads. Most AJAX-style implementation rely on either the innerHTML-property or creating nodes through the DOM. Testing these would tell us much more than testing Document.write().

Re:Document.write() is not the way to go (1)

palinurus (111359) | more than 7 years ago | (#18313069)

well -- document.write() gets executed when it's called, not just during page load. you can call document.write() as a side-effect of an AJAX request and it will work (though i think you're right -- DOM manipulation is the idiom for dynamic web programming these days).

but really i don't think google should index either. there's a difference between a document and an application. the guy mentions the annoying buzzword of the year, 'Web 2.0', which (i think) is really about how web browsers now give you applications in addition to documents. it's useful to have an index of documents; not so useful to have an index of every state reachable by an application (see e.g. MS Word help).

it would be interesting if at some point crawlers could distinguish between the two.

Re:Document.write() is not the way to go (0)

Anonymous Coward | more than 7 years ago | (#18314413)

They already do by not indexing AJAX-generated content...

From TFA: (1)

bennomatic (691188) | more than 7 years ago | (#18313015)

So, some friends and I have been bantering back and forth about how Google treats content that has been inserted into a page using Javascript. So I decided to do an experiment. This page has six nonsense words. Two are hardcoded into the page via straight HTML. Two are inserted via Javascript, but the script is part of the page HTML. The last two are inserted via Javascript, but the script is on a remote server. The purpose of the test is to see three things... * The time lapse between when the words appear in a Google alert and when they're searchable on the main Google site. * Which words return search results. * If the words from the remotely sourced script return search results, do they point to this page, the .js file on the remote server, or both? Here are a couple of nonsense words that turn up no hits in Google. They are hardcoded into the HTML. zonkdogfology and ibbytopknot I'll repeat them for emphasis... zonkdogfology and ibbytopknot Here are two words inserted into the page via a javascript hardcoded into the page... test words are pignoklot and zimpogrit - these have been inserted via javascript repetition: pignoklot and zimpogrit - these have been inserted via javascript And now a couple of nonsense words inserted with a remotely-sourced javascript... test words are fimptopo and biggytink - these have been inserted via javascript repetition: fimptopo and biggytink - these have been inserted via javascript And that constitutes the test. I should know within a few weeks how well it worked.

This leads to a decidability problem (1)

NittanyTuring (936113) | more than 7 years ago | (#18313117)

It's not easy for Google to determine how to treat text inside of Document.write(). In some cases, that line of script will never be executed. In other cases, it may be executed multiple times. What is Google to do with something like this:

if (1 == 0) {
document.write("pignoklot zimpogrit");
}
Obvously, "pignoklot zimpogrit" will never be emitted, but Google's crawler might not get that. In general, you run into decidability issues, like the Halting problem. The best approach would be for Google's crawler to fully emulate the session of a user, and execute all scripts like a browser would... but that may be difficult or impossible to automate on a large-scale. It is true that Google can do a better job than it does now. It can at least search for common cases where write() definitely gets called.

Re:This leads to a decidability problem (1)

poopdeville (841677) | more than 7 years ago | (#18313549)

Decidability is a non-issue in this context. Your example falls flat, because JavaScript is an interpreted language. All Google would have to do is run an interpreter and data mine the results.

Re:This leads to a decidability problem (1)

maxwell demon (590494) | more than 7 years ago | (#18313833)

Except that
  • content may depend on user actions in a non-trivial way (i.e. if the page contains things like onclick or onmouseover, the dependence on the sequence of events occuring may be quite complex),
  • content may be requested by callbacks to the server (after all, that's what AJAX is all about), in which case I'm not sure it's a good idea for the search engine to execute it,
  • running a certain script might be expensive in time and/or memory,
  • by processing JavaScript, the search engine might open up itself to exploits.

Re:This leads to a decidability problem (1)

poopdeville (841677) | more than 7 years ago | (#18314291)

All true. However, I'd say that your first and second points are essentially a non-issue. Google already scans the "dark web" if it's accessible to the public, even if only accessible in non-trivial ways. Which is to say, they're already requesting massive numbers of documents and generating enormous data structures to mine. Dealing with JavaScript would be more of the same in this respect.

Your other points are better. Malicious JavaScript could easily tie the GoogleBot up, if Google's hypothetical JavaScript interpreter didn't have built-in runtime limits. Time limits are one option. Disallowing certain JavaScript constructs is another possibility.

(tagging beta) (1)

Jack Schitt (649756) | more than 7 years ago | (#18313371)

I predict that from now on, zonkdogfology will be a common tag for all articles that relate to google search...

Re:(tagging beta) (1)

dotgain (630123) | more than 7 years ago | (#18313675)

I predict in five years it'll be in the Oxford English Dictionary.

Re:(tagging beta) (1)

maxwell demon (590494) | more than 7 years ago | (#18313847)

I predict in five days it will be in Wikipedia.

Document.Write() not interpreted by Google (1)

generikz (413613) | more than 7 years ago | (#18313465)

... and also by SPAM spiders sneaking around for Email addresses!

I didn't want to change my contact information with additional FORM submit with visual challenge, but still wanted to leave a direct Email link obviously placed on page for quick contact/feedback.

Since I modified the mailto: with some tricky/sliced javascript Document.Write() I don't have a single SPAM coming from a semi-hidden address, which is still looking -- for the regular human visitors -- like the classic "Contact us" Email link.

I certainly hope this won't change in the future!

Rgds,
Julien

Re:Document.Write() not interpreted by Google (-1, Flamebait)

Anonymous Coward | more than 7 years ago | (#18313807)

Since I modified the mailto: with some tricky/sliced javascript Document.Write() I don't have a single SPAM coming from a semi-hidden address, which is still looking -- for the regular human visitors -- like the classic "Contact us" Email link.
Let me get this straight. You don't get ANY spam at the address lddb.admin@gmail.com [mailto] . Man, that's wild! I figured that the address lddb.admin@gmail.com [mailto] would have tons of spam!

mod parent up. morning chuckle. (0)

Anonymous Coward | more than 7 years ago | (#18314205)

Cruel, dude.

funny as hell, but cruel!

Re:Document.Write() not interpreted by Google (1)

DrXym (126579) | more than 7 years ago | (#18313829)

I do this also on my pages. I also mangle certain things like affiliate links, AdSense ids etc. not for any particular reason except I don't like the idea of any search engine inadvertantly indexing them.

Re:Document.Write() not interpreted by Google (1)

n00kie (986574) | more than 7 years ago | (#18314145)

//<!--
      document.write('<a href="mai');
      document.write('lto');
      document.write(':lddb.admin');
      document.write('@');
      document.write('gmail.c');
      document.write('om');
      document.write('?Subject=[LDDb]">');
      document.write('Webmaster</a> (Julien WILK)');
//-->

That's cute. Someone spam him please.

Re:Document.Write() not interpreted by Google (1)

DrSkwid (118965) | more than 7 years ago | (#18314149)

If you want people to contact you, you should provide the contact details properly and suck up the spam yourself.

Choose a non-default email i.e. not webmaster but web-master and deal with the consequences.

In my eyes, a customer/client/new friend not being able to contact you is far more expensive than dealing with some *more* spam.

Google doesn't, but it's possible (2, Informative)

Animats (122034) | more than 7 years ago | (#18313481)

I'd thought Google would be doing that by now. I've been implementing something that has to read arbitrary web pages (see SiteTruth [sitetruth.com] ) and extract data, and I've been considering how to deal with JavaScript effectively.

Conceptually, it's not that hard. You need a skeleton of a browser, one that can load pages and run Javascript like a browser, builds the document tree, but doesn't actually draw anything. You load the page, run the initial OnLoad JavaScript, then look at the document tree as it exists at that point. Firefox could probably be coerced into doing this job.

It's also possible to analyze Flash files. Text which appears in Flash output usually exists as clear text in the Flash file. Again, the most correct approach is to build a psuedo-renderer, one that goes through the motions of processing the file and executing the ActionScript, but just passes the text off for further processing, rather than rendering it.

Ghostscript [ghostscript.com] had to deal with this problem years ago, because PostScript is actually a programming language, not a page description language. It has variables, subroutines, and an execution engine. You have to run PostScript programs to find out what text out.

OCR is also an option. Because of the lack of serious font support in HTML, most business names are in images. I've been trying OCR on those, and it usually works if the background is uncluttered.

Sooner or later, everybody who does serious site-scraping is going to have to bite the bullet and implement the heavy machinery to do this. Try some other search engines. Somebody must have done this by now.

Again, I'm surprised that Google hasn't done this. They went to the trouble to build parsers for PDF and Microsoft Word files; you'd think they'd do "Web 2.0" documents.

Re:Google doesn't, but it's possible (0)

Anonymous Coward | more than 7 years ago | (#18313603)

OCR is also an option. Because of the lack of serious font support in HTML, most business names are in images. I've been trying OCR on those, and it usually works if the background is uncluttered.
Shouldn't they have an ALT= attribute in their header image(s) anyway?

Re:Google doesn't, but it's possible (1)

maxwell demon (590494) | more than 7 years ago | (#18313875)

Sure. But since when do people do everything they should do?

Re:Google doesn't, but it's possible (1)

dargaud (518470) | more than 7 years ago | (#18313607)

OCR is also an option. Because of the lack of serious font support in HTML, most business names are in images. I've been trying OCR on those, and it usually works if the background is uncluttered.
Yes, and it should work like that too. If the background is so cluttered as to make the OCR difficult, then chances are the human will have trouble reading it too. I suggested that during a job interview witha *cough* serious search engine: use a secondary crawler reporting as a normal IE/firefox, load a page using the usual IE/firefox rendering engine, OCR the text (this way all white on white, display:none and size:1 goes away) with some color tolerance (make sure violet on red goes away too !) and compare with the normal crawler. If the result is too different, flag it as a potential spamming site.

Re:Google doesn't, but it's possible (1)

VGPowerlord (621254) | more than 7 years ago | (#18313773)

Again, I'm surprised that Google hasn't done this. They went to the trouble to build parsers for PDF and Microsoft Word files; you'd think they'd do "Web 2.0" documents.

Does Google run macros in Word documents? No? Then why are you even comparing this? I can parse a PDF document or a Word document without having to have a script interpreter running.

I imagine that the Googlebot crawler is a rather simplistic program that only knows how to:
1. Read robots.txt
2. Read meta tags (robot tags in particular)
3. Find text and web addresses in web pages
4. Send the text back to a larger analytical program
5. Adds the web addresses it finds to its own queue

What you're proposing would require an actual DOM tree be built up by Googlebot as well as a Javascript interpreter to be run. Also, if you use <input type="button"> or <button> controls anywhere, it would still fall flat on its face, as GoogleBot doesn't activate these elements. So, you'd either need Googlebot to press every button it encounters (a VERY bad idea) or have some sort of AI to figure out what it should do.

If you're willing to write such an AI, go ahead. I think I'll stand behind Google's method.

Re:Google doesn't, but it's possible (1)

imroy (755) | more than 7 years ago | (#18314171)

Ghostscript had to deal with this problem years ago, because PostScript is actually a programming language, not a page description language.

Ghostscript had to deal with what problem? Yes, PostScript is a programming language with built-in graphics primitives. What does that have to do with search engines? It doesn't have to recognise certain outlines as being text (i.e text drawn without using the PostScript primitive for drawing text), it just draws it. Ghostscript is just another implementation of a language otherwise.

OCR is also an option. Because of the lack of serious font support in HTML, most business names are in images.

Is OCR really necessary? Odds are the business name is also in the domain name and at least the front page as text, if not included in the title and/or copyright footer of every page. Except for damn all-flash web sites, the business name is unlikely to be hidden away from a search engine.

Re:Google doesn't, but it's possible (1)

aaronwormus (716976) | more than 7 years ago | (#18314359)

If google were to index javascript, they would probably create their own interpreter which only interpreted content that was meaningful to a search engine.

If googlebot had to interpret every fade-in menu and every roll-over effect it would take substantially more resources for google to crawl the web. Googlebot would also be vulnerable to malicious scripts - or scripts built to waste its time.

Working example (0)

PietjeJantje (917584) | more than 7 years ago | (#18313577)

Still, with a different approach my AJAX generated site:
http://dutchpipe.org/ [dutchpipe.org]
is indexed perfectly:
http://66.102.9.104/search?q=cache:kvnpKdmDxwUJ:du tchpipe.org/+dutchpipe&hl=en&ct=clnk&cd=1 [66.102.9.104]

Re:Working example (0)

Anonymous Coward | more than 7 years ago | (#18314059)

Dude, I'm not clicking anything named "Dutch Pipe"... Dutch people are fucking scary as hell

YOU FAIuL IT (-1, Offtopic)

Anonymous Coward | more than 7 years ago | (#18313655)

alike to reap shall we? OK! those obligations. writing is on the project. Today, as networking test. And as BSD sinks inFluence, the about bylaws a relatively dicks produced feel an obligation ProbLem stems I type this. To the politically 40,000 workstations confirmed that *BSD please moderate lead to 'cleaner FreeBSD at about 80 to predict *BSD's Variations on the in the sun. In the get tough. I hope I see the same very own shitter, house... pathetic. its corpse turned being GAY NIGGERS. track of where World will have filed countersuit, Apple too. No, in time. For all systems. The Gay

Most contentless story for ages (1)

Sam H (3979) | more than 7 years ago | (#18313715)

You can tell there's nothing interesting in the link from the fact that not even a summary of the results is given in the story. It looks like the average pay-to-get-diggs story, except you don't have to pay anything to be on Slashdot. Well done, and enjoy your Google Ads revenues!

Re:Most contentless story for ages (1)

jackv (1068006) | more than 7 years ago | (#18313877)

I agree, the results are extremenly obvious, apart from the fact they've been documented plenty of times before. You only have to spend 20 minutes reading an SEO handbook to glean this basic info.

Re:Most contentless story for ages (0)

Anonymous Coward | more than 7 years ago | (#18314005)

Besides, he published the results MUCH too quickly!
It can take several weeks for Google to get into a stable state indexing a site.

If you want to see (3, Funny)

BrynM (217883) | more than 7 years ago | (#18313837)

If you want to see through a search engine's eyes, open the page in Lynx [browser.org] . The funniest part about showing that method to another developer is when they think Lynx is broken because the page is empty. "It didn't load. How do I refresh the page? This browser sucks." Heh. Endless fun.

(method does not account for image crawlers)

Re:If you want to see (0)

Anonymous Coward | more than 7 years ago | (#18314511)

The funniest part about showing that method to another developer is when they think Lynx is broken because the page is empty. "It didn't load. How do I refresh the page? This browser sucks." Heh. Endless fun.

Been there and done that, it's not funny. No matter how many times you explain it, the excuses keep coming.


  • The browser sucks...
  • Hardly anybody uses that browser...
  • Dynamic, all-singing, all-dancing documents are the future...?

For some reason the obvious conclusion (that they don't understand web technology) always escapes them.

Problem for web apps (1)

Wienaren (714122) | more than 7 years ago | (#18313935)

Web apps these days consist nearly entirely of dynamic content invisible to googlebot. If you try to make your page visible on the web, this is really a problem. But think twice before adding invisible div's or alike in order to achieve proper seach results: Google might as well ban you (since they don't check whether or not the keywords you name in your invisible divs do in fact relate to the page's purpose or contents).

Google holds back the web! (1, Insightful)

mumblestheclown (569987) | more than 7 years ago | (#18314045)

this is a pretty straightforward example of how google holds back the web. this is not google's fault, per se, but it definitely is true. We routinely resort to older, inefficient technologies for our websites simply to please google. it works well for us from an advertising standpoint, but is often incredibly stupid technologically.

Wrong (0)

Anonymous Coward | more than 7 years ago | (#18314529)

This is an example of content that shouldn't be indexed by a search engine. I suppose you'd also like Google to auto compile, link and offer for download other forms of program source code? Don't blame Google because a small percentage of web developers are incompetent.

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?