Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Google Now Searches JavaScript

timothy posted about 2 years ago | from the watch-for-the-scriptview-vans dept.

Google 114

mikejuk writes "Google has been improving the way that its Googlebot searches dynamic web pages for some time — but it seems to be causing some added interest just at the moment. In the past Google has encouraged developers to avoid using JavaScript to deliver content or links to content because of the difficulty of indexing dynamic content. Over time, however, the Googlebot has incorporated ways of searching content that is provided via JavaScript. Now it seems that it has got so good at the task Google is asking us to allow the Googlebot to scan the JavaScript used by our sites. Working with JavaScript means that the Googlebot has to actually download and run the scripts and this is more complicated than you might think. This has led to speculation of whether or not it might be possible to include JavaScript on a site that could use the Google cloud to compute something. For example, imagine that you set up a JavaScript program to compute the n-digits of Pi, or a BitCoin miner, and had the result formed into a custom URL — which the Googlebot would then try to access as part of its crawl. By looking at, say, the query part of the URL in the log you might be able to get back a useful result."

Sorry! There are no comments related to the filter you selected.

Really? (5, Insightful)

Anonymous Coward | about 2 years ago | (#40119115)

Googlebot will have a very quick timeout on scripts and probably wont be more powerful than a standard home computer. How would that be useful for calculating digits of pi or bitcoin mining? It would take far longer than doing it the conventional way.

Incremental and/or parallel computing? (5, Interesting)

SlovakWakko (1025878) | about 2 years ago | (#40119141)

You can always cut the whole process into smaller steps, each providing URL that will initiate the next step. Or you can provide several URLs and have the Google cloud compute a problem for you in parallel...

Re:Incremental and/or parallel computing? (5, Funny)

Anonymous Coward | about 2 years ago | (#40119159)

I already do this using a system of CNAME's in a .xxx domain.

Re:Incremental and/or parallel computing? (0)

Anonymous Coward | about 2 years ago | (#40119173)

I realise that the kind of idiots who like Bitcoins will be the same fools who drool over Google, and that these same monkeys won't see any problem with providing an algorithm which generates a secret to a third party for execution, but exactly why do they think any significant sort of compute time will be dedicated to their shitty little site?

Re:Incremental and/or parallel computing? (1)

Anonymous Coward | about 2 years ago | (#40119253)

The same reason why 72 hours of video is uploaded to YouTube every minute.

Re:Incremental and/or parallel computing? (1)

Dwonis (52652) | more than 2 years ago | (#40125463)

I realise that the kind of idiots who like Bitcoins will be the same fools who drool over Google, and that these same monkeys won't see any problem with providing an algorithm which generates a secret to a third party for execution,

Bitcoin mining doesn't involve any secret information.

I'm not sure why you're slagging "idiots who like Bitcoins" so much, either. Sure, Bitcoin has attracted some cranks, anarchists, people who don't trust government-issued money, and speculators who will say all manner of things in attempts to influence the price of Bitcoins (both up and down), but have you actually looked at the crypto and the system of incentives built into the Bitcoin system? It's brilliant, and it's basically the micropayment system that everyone wanted back in the 1990s, but couldn't have because it didn't exist.

Re:Incremental and/or parallel computing? (0)

Anonymous Coward | about 2 years ago | (#40119343)

Not worth it, IMO. You have to serve it a small fraction of the problem and then get back the answer and use it. Using your own resources would be faster and cheaper.

Re:Incremental and/or parallel computing? (1)

rtfa-troll (1340807) | about 2 years ago | (#40119543)

What if the URL triggers, for example, a slashdot posting then you use another external Javascript interpreter to gather all the results. Sort of map-reduce. Incredibly inefficient but you don't have to pay so who cares? Even better if some xss our similar attack on a Web site can be used to parcel out the work.

It seems to me though that there's no reason to limit this to googlebot any Javascript interpreter will do.I'm surprised if nobody from the blackhat community doesn't have this up and running for password cracking or similar.

Re:Incremental and/or parallel computing? (1, Interesting)

Zero__Kelvin (151819) | about 2 years ago | (#40119369)

Even if this is possible, you would certainly be violating Google's guidelines and have your site blacklisted from Googlebot pretty quickly. Furthermore, you could be charge with theft of services.

Re:Incremental and/or parallel computing? (0, Flamebait)

Anonymous Coward | about 2 years ago | (#40119405)

I don't think so. Google downloaded my script and ran it on their servers. How is that my fault if it uses their resources?

--
Sundar Pichai is the utter asshole whose incompetence has resulted in the shutdown of Google's Atlanta office.

Re:Incremental and/or parallel computing? (1)

Zero__Kelvin (151819) | about 2 years ago | (#40119475)

Intent.

Re:Incremental and/or parallel computing? (0)

Anonymous Coward | about 2 years ago | (#40119859)

That only applies if you target it specifically at Google. Which is harder than making a generic calculate-all-the-pi-digits javascript that simply outputs to the browser, with a different colour background for each digit. Then you do exactly the above, and you get the result on the server, and there is no malicious intent; only the intent to compute pi digits.

Re:Incremental and/or parallel computing? (1)

dreamchaser (49529) | about 2 years ago | (#40120189)

Intent.

Prove it. Seriously. You wouldn't be able to.

Re:Incremental and/or parallel computing? (5, Interesting)

ThatsMyNick (2004126) | about 2 years ago | (#40119417)

Anyone wanting to do this would be doing it on a dedicate website. They wont care about the domain or IP address being blacklisted from Google. And good luck with the theft of service charge, they never asked Google to index them. They did not even agree to any terms of service from Google. As I said, good luck.

Re:Incremental and/or parallel computing? (0)

Zero__Kelvin (151819) | about 2 years ago | (#40119485)

Thanks, but I won't need luck. If I don't set up a robots.txt file telling them not to index it, then I have opted in to be indexed. If my code is clearly designed to exploit Google's bot, then I had intent. The two combined equals guilt.

Re:Incremental and/or parallel computing? (4, Informative)

truedfx (802492) | about 2 years ago | (#40119521)

No, that's not what opting in means. Opting in means you're asking Google to visit your site. Opting out means you're asking Google not to visit your site. When you're not asking for anything, merely hoping, you're neither opting in nor opting out.

Stop trying to teach what you don't understand (1)

Zero__Kelvin (151819) | about 2 years ago | (#40123565)

"Opting in means you're asking Google to visit your site. "

Right. That is exactly what I said. The standard for the internet is well defined. You should read about it [wikipedia.org] . If you make a web page available to the internet without a password, captcha or firewall, etc. you are making it available to all. You have already purposely accepted the condition ahead of time. This is opting in [thefreedictionary.com] . The robots.txt allows you to opt-out instead. If you opt in by placing it on the internet available to web crawlers and not opting-out with a robots.txt entry, you opt-in to having that data accessible to all, including but by no means limited, to Googlebot.

Re:Stop trying to teach what you don't understand (1)

truedfx (802492) | more than 2 years ago | (#40126975)

No, that isn't what you said. Allowing Google to access your site and asking Google to access your site are two different things. By neither opting in nor opting out, you're allowing Google to access your site, because the default is to allow it and you haven't told Google otherwise, but that's not opting in. Hint: what does opt mean? Who has chosen that the default is to allow anyone to visit your site? If it isn't you, then you didn't opt.

Re:Stop trying to teach what you don't understand (1)

Zero__Kelvin (151819) | more than 2 years ago | (#40127031)

You are confusing e-mail with the internet, which it turns out isn't a series of tubes by the way. In an e-mail scenario opting in involves a specific request to receive information. In a web publishing scenario you are opting in to having your published data and information read by all.

"Who has chosen that the default is to allow anyone to visit your site? If it isn't you, then you didn't opt."

Why do you keep reiterating my point for me and then saying I didn't make my point? If you don't create a mechanism to keep Google out (e.g. robots.txt) then - by your own admission - you have opted to allow Googlebot to read what you publish to the world.

Re:Stop trying to teach what you don't understand (0)

Anonymous Coward | more than 2 years ago | (#40127837)

if it's not a series of tubes then what is it?

Re:Stop trying to teach what you don't understand (1)

bingoUV (1066850) | more than 2 years ago | (#40152879)

If you don't create a mechanism to keep Google out (e.g. robots.txt) then - by your own admission - you have opted to allow Googlebot to read what you publish to the world.

Allowing Google to do something does not mean asking Google to do it. Allowing does not involve "service".

Re:Incremental and/or parallel computing? (1)

Dark$ide (732508) | more than 2 years ago | (#40130677)

What happens when Google chooses to ignore my carefully crafted robots.txt?

If they then download the my javascript experiment and run it at their cost, that's their problem.

When I can trust crawlers to not ignore my robots.txt I'll stop using fail2ban on my apache logs.

Re:Incremental and/or parallel computing? (0)

postbigbang (761081) | about 2 years ago | (#40119795)

There is no reason to believe, as the research is scant at best, that Google even respects a robots.txt file. They are a vacuum hose attached to an analytic engine, easily metaphorized to Steven King's Langoliers.

Here's your sign (2)

Zero__Kelvin (151819) | about 2 years ago | (#40123499)

"There is no reason to believe, as the research is scant at best, that Google even respects a robots.txt file [google.com] .

From the preceeding link: "Make use of the robots.txt file on your web server. This file tells crawlers which directories can or cannot be crawled. Make sure it's current for your site so that you don't accidentally block the Googlebot crawler. Visit http://code.google.com/web/controlcrawlindex/docs/faq.html [google.com] to learn how to instruct robots when they visit your site. You can test your robots.txt file to make sure you're using it correctly with the robots.txt analysis tool available in Google Webmaster Tools.

Re:Incremental and/or parallel computing? (1)

amRadioHed (463061) | more than 2 years ago | (#40125643)

Research is scant? It's ridiculously easy for anyone with a webserver to verify if Google respects robots.txt.

Re:Incremental and/or parallel computing? (0)

Anonymous Coward | about 2 years ago | (#40120131)

You're a bit of a moron aren't you. I think you need to read up on the concept of 'opt in' vs 'opt out'.

Herpa derp!

Re:Incremental and/or parallel computing? (0)

Anonymous Coward | about 2 years ago | (#40120773)

Shouldn't be that way. Should be no robots.txt is the same as...

User-agent: *
Disallow: /

WITH a robots.txt, then the bot behavior should be specified, for example:

User-agent: *
Allow: googlebot

Re:Incremental and/or parallel computing? (1)

Dwonis (52652) | more than 2 years ago | (#40125473)

Guilt under what section of what law, specifically?

Re:Incremental and/or parallel computing? (1)

Zero__Kelvin (151819) | more than 2 years ago | (#40127041)

You must be from another country. In the US, they have a smorgasbord of them from which they can choose now. But in this case I was thinking theft of service as I already stated quite explicitly.

Re:Incremental and/or parallel computing? (1)

Zero__Kelvin (151819) | about 2 years ago | (#40123619)

"Anyone wanting to do this would be doing it on a dedicate website. They wont care about the domain or IP address being blacklisted from Google."

So you are saying that someone would go through all the trouble of registering the domain, creating the code, and getting (or waiting for) Google to index it, then wouldn't care that Google would cease to execute the actual code before the desired results are obtained? Re-read what I wrote. I merely said it would be blacklisted quickly. I didn't say that it would affect some tangential portion of the site. The purpose of the attempt itself would be defeated quickly. That is the point.

Re:Incremental and/or parallel computing? (1)

ThatsMyNick (2004126) | about 2 years ago | (#40123875)

So far blacklisting has worked pretty well for Google. Google has used it well to punish black hat SEO techniques.

In this case though, if I dont care about my page rank, I would simply create tons of long length domain names for pennies (+icann fees). I would use few at a time and would care if Google blacklisted few at a time (I would be storing partial results, just like one of the parent mentioned, and the takeover should be seamless). It doesnt take a lot to recoop your domain name fees if your task is purely computational.

Re:Incremental and/or parallel computing? (1)

Zero__Kelvin (151819) | about 2 years ago | (#40123949)

It doesnt take a lot to recoop your domain name fees if your task is purely computational.

Dedicated hardware is cheap, and designing software costs a lot of money and time. What you are proposing would be ridiculously convoluted and costly, even disregarding the legal ramifications. We software engineers often talk about using the right tool for the right job. Your outlandish proposal ignores numerous sound engineering principles, not the least of which is adhering to this simple maxim.

Re:Incremental and/or parallel computing? (1)

ThatsMyNick (2004126) | more than 2 years ago | (#40125899)

May be not. But if someone wanted to do it just for the heck of it, it can be done. It may not scale very well, otherwise I dont see issues at all with it.

Re:Incremental and/or parallel computing? (1)

Zero__Kelvin (151819) | more than 2 years ago | (#40127047)

As I was trying to explain, there is a huge difference between the problems you can see with it and the actual long list of problems that any moderately competent software engineer could quickly point out.

Re:Incremental and/or parallel computing? (2)

ThatsMyNick (2004126) | more than 2 years ago | (#40127135)

I think you missed the "just for the heck of it". I understand my approach is not the practical one, and any sane person would just use their resources to do what little can be done and implement it on their own hardware. But it does it mean it cannot be done in a no loss way. Say I want to calculate the last 100 digits of Graham's number, it is can be split into multiple calculations, a sub result calculation can take less than a second (which is what I assume Google will limit the runtime to). The bandwidth requirements are less
 
And I really dont understand why you believe this cannot be done. If your argument is this can be done is other easier way, I agree. If not, I dont really understand your argument.

Re:Incremental and/or parallel computing? (1)

Zero__Kelvin (151819) | more than 2 years ago | (#40127241)

Let's start with the simplest problem. You plan on having Googlebot load and run your client side code. Great. Now how do you plan to get Googlebot to feed you the result?

Re:Incremental and/or parallel computing? (3, Insightful)

ThatsMyNick (2004126) | more than 2 years ago | (#40127275)

Your JS would generate HTML on the client side. Just generate a link that your server can understand. Google bot, doing what it does, will try to load this URL. When it does, the server stores this result, and generates a new problem for GoogleBot to solve. This is the basis, for the article and the entire comment thread.

Re:Incremental and/or parallel computing? (1)

Zero__Kelvin (151819) | more than 2 years ago | (#40127305)

"Your JS would generate HTML on the client side."

Like I said, you are making assumptions about Googlebot. You seem to think that they have no idea how to sanitize an input and will just execute whatever you send them byte for byte. That's not going to happen.

Re:Incremental and/or parallel computing? (2)

ThatsMyNick (2004126) | more than 2 years ago | (#40127345)

Er, they are looking for JS that generates HTML (So this is not an assumption). The purpose of GoogleBot is to index. If they run the JS and dont even index the results, it is makes no sense.
 
Would you mind specifically mentioning what I assumption I am making. And there is no way to Sanitize JS (JS is a turing complete language, there is no way (atleast as far as present day research) to santize it in any reasonable way)

Re:Incremental and/or parallel computing? (1)

ThatsMyNick (2004126) | more than 2 years ago | (#40127357)

Soory about the typos, I guess I need to get some sleep.

Re:Incremental and/or parallel computing? (0)

Clogoddess (2591147) | more than 2 years ago | (#40125923)

Blah blah blah Google employee alert yawn.

Re:Incremental and/or parallel computing? (1)

ThatsMyNick (2004126) | more than 2 years ago | (#40125961)

I feel honored to have been considered a Google employee. Well, not really. Is there is something wrong with my point, that it sounds Fanboish or Employeeish?

Re:Incremental and/or parallel computing? (-1)

Anonymous Coward | more than 2 years ago | (#40127541)

Not really, if you worked for Google you would come off as an arrogant asshole.

--
Sundar Pichai is the utter asshole whose incompetence has resulted in the shutdown of Google's Atlanta office.

Re:Really? (1)

multicoregeneral (2618207) | about 2 years ago | (#40121005)

Depends how often they hit your site. Google has been known to check sites pretty regularly.

Re:Really? (1)

Sloppy (14984) | about 2 years ago | (#40121307)

Wait a minute, are you suggesting that having spiders run my javascript x86 emulator which runs jruby scripts which mines bitcoins, isn't practical?

Simply another example (1)

Anonymous Coward | about 2 years ago | (#40119169)

why having other parties fetch your arbitrary code and execute it is such a wonderful idea.

Re:Simply another example (5, Funny)

Zero__Kelvin (151819) | about 2 years ago | (#40119377)

Well, I think the bigger problem is that you are writing arbitrary code.

A much more likely application (5, Interesting)

maxwell demon (590494) | about 2 years ago | (#40119259)

Send Google JavaScript which generates different results for Google than for normal visitors, in order to rank up the site.

Re:A much more likely application (0)

Anonymous Coward | about 2 years ago | (#40119279)

That's an interesting idea and much more insidious than mine, which was to simply send nothing to Google and fuck 'em.

Re:A much more likely application (0)

Anonymous Coward | about 2 years ago | (#40119413)

Or you could do it the easy way - robots.txt.

Re:A much more likely application (1)

Anonymous Coward | about 2 years ago | (#40121335)

That's an interesting idea and much more insidious than mine, which was to simply send nothing to Google and fuck 'em.

Not allow your site to be indexed by Google? Yeah, that'd really fuck Google up good, wouldn't it?

Re:A much more likely application (4, Funny)

aaronb1138 (2035478) | about 2 years ago | (#40119287)

What is this method you have written, "sudo_mod_me_up?"

Re:A much more likely application (1)

The Mighty Buzzard (878441) | about 2 years ago | (#40119533)

Wait, GoogleBot gets mod points now? This explains soooo much.

Re:A much more likely application (1)

Anonymous Coward | about 2 years ago | (#40120529)

You don't need JavaScript for that. A lot of servers serve different HTML to Google than to us. It's especially noticeable when searching for a rare term; Google will show you results that appear to contain the term, but without relevant context (only mystifying unrelated terms) and when you open it the page turns out to have some completely different subject.

Re:A much more likely application (1)

Anonymous Coward | about 2 years ago | (#40120677)

I noticed this in a PHP attack script earlier this year. It installs a script pointing to a Russian malware domain, but only inserts it in the page if the user agent is not GoogleBot or a few other spiders. It also checked for some Google ip ranges. Surely Google must be combating this by doing some stealth spidering, otherwise SEO and malware providers will game them if they stick to their classic robot rules.

Re:A much more likely application (2)

slapyslapslap (995769) | about 2 years ago | (#40120689)

This is already being done, but in reverse. Google doesn't like it much either. Get caught, and you are de-listed.

Re:A much more likely application (1)

maxwell demon (590494) | more than 2 years ago | (#40127471)

The point is, with Google executing JavaScript you could make it less obvious, by just having the JavaScript depend on some difference between the Google and the Browser JavaScript execution (maybe timings of certain rendering operations).

Also, it might be used through XSS, to have competitors delisted.

Re:A much more likely application (1)

multicoregeneral (2618207) | about 2 years ago | (#40120987)

Or one that generates useful looking links to other sites you own (on different servers and subnets, of course).

Re:A much more likely application (1)

squidinkcalligraphy (558677) | more than 2 years ago | (#40125603)

I would be surprised if the googlebot didn't try everything to appear to the server like a normal user browser. Even better would be to crawl a site while in disguise, then again while not disguised. Differences would affect the sites ranking negatively.

Re:A much more likely application (1)

maxwell demon (590494) | more than 2 years ago | (#40127493)

Serving different content based on IP or self-identification is possible even without JavaScript. However if the detection makes use of peculiar behavior of the JavaScript implementation (and the JavaScript implementation will have to have some differences, or else it won't find content which is initially hidden, but unhidden by an user interaction), just fetching from a different UI or with a different browser/spider identification doesn't work.

And BTW, the spider will certainly expose itself from the very fact that it accesses robots.txt

Not convinced (0)

Anonymous Coward | about 2 years ago | (#40119283)

As a programmer, this still sounds like an extremely dirty hack to me. For the time being, I'll stick to creating gracefully degrading sites, thank you very much.

Re:Not convinced (2)

mwvdlee (775178) | about 2 years ago | (#40119435)

By "gracefully degrading" do you mean "if (useragent == 'googlebot') { random-spamwords(); paywalled-content(); links-to-every-parsable-uri(); }"?

Re:Not convinced (0)

Anonymous Coward | about 2 years ago | (#40120103)

They're probably doing it to deal with retards like the Meteor [meteor.com] developers.

Go ahead and look at the source of that page, and note how the page is completely invisible to search engines.

Really, who wouldn't want their hip new Web 2.0 site to be completely unfindable through search engines? The accessibility degradation is just a bonus.

Re:Not convinced (1)

TheLink (130905) | about 2 years ago | (#40120745)

Would it be possible to filter out sites like this? I personally don't want to find sites like these in my Google search results.

Re:Not convinced (1)

dave420 (699308) | more than 2 years ago | (#40133869)

Google is not there to represent your own idea of what the internet is. Sites like that will become more and more common, whether you like it or not.

Re:Not convinced (0)

Anonymous Coward | more than 2 years ago | (#40140155)

when I visit that site I get a background image only, consequently, unless it's something I really need, I surf away.

If they want my business, they'd better make sure the basic content is available without JS (and that goes double for links)

I noticed this already some time ago. (1)

jimbauwens (1648531) | about 2 years ago | (#40119313)

When I was looking at the page previews (in google) of my JavaScript network scanner, I noticed it listed some IP's, indicating that it was running the script. Just google "http://bwns.be/jim/scanning_printing/detect_range.html" and look at the preview. (Also, most of those IP's probably exist, as my script indicates it is sure about them).

Re:I noticed this already some time ago. (1)

C18H27NO3 (1282172) | about 2 years ago | (#40119511)

You typoed your url. You have detect_range.html which is actually detect-range.html

Re:I noticed this already some time ago. (1)

jimbauwens (1648531) | about 2 years ago | (#40119683)

Oh :P Thanks :)

Re:I noticed this already some time ago. (3, Funny)

RoccamOccam (953524) | about 2 years ago | (#40120233)

Also, the dry cleaning that you dropped off on Thursday is ready for pick-up and your driver's license expires in three months.

Sincerely,
The Slashdot Citizens Brigade

Re:I noticed this already some time ago. (2)

marcosdumay (620877) | about 2 years ago | (#40121903)

Now that you said it. The preview Google shows of one of my sites has all the CSS aplied, including some that is aplied by javascript after the page load.

Probably, but limited in use. (0)

Anonymous Coward | about 2 years ago | (#40119325)

I remember once I was going to try use Google Cache to see if I could store backups on it.

I still haven't actually bothered doing it to be honest.
And Caching in general doesn't seem to show up as often as it used to on websites.
I feel they are only caching larger or active websites, or social.

It would probably require a bit of trial and error most likely.

so much for (5, Insightful)

Anonymous Coward | about 2 years ago | (#40119477)

using javascript to hide or obfuscate email addresses to help protect them from spammers, scammers and bots.

thanks fer nuttin, google.

Re:so much for (0)

Anonymous Coward | about 2 years ago | (#40119667)

Exactly. Or you could modify the script to detect the googlebot and then generate garbage, bogus addresses, or maybe the e-mail address to Google's complaint department (must be the size of a small planet by now).

Re:so much for (2)

VortexCortex (1117377) | about 2 years ago | (#40120193)

robots.txt

Re:so much for (2)

MattskEE (925706) | about 2 years ago | (#40121927)

Do you think spammers scraping the web for email addresses respect robots.txt?

Re:so much for (0)

Anonymous Coward | about 2 years ago | (#40122229)

No, but they probably haven't implemented JavaScript execution either, so if you prevent them from using the results of Google's crawl (by telling Google not to crawl the page in the first place) they're back to square one.

Re:so much for (0)

Anonymous Coward | about 2 years ago | (#40122421)

This whole thread is under the premise that spammers and scammers use the results of Google's crawl to harvest email addresses. How does that work? Which query are they using?

Re:so much for (0)

Anonymous Coward | more than 2 years ago | (#40126113)

... how long have you been using google? search for:

*@hotmail.com

even with a unique strike rate of 0.0001%, at apparently 15 billion results you'll harvest 150000 email addresses
that's just looking up the one domain. do it for every domain ever and see how many you can uncover
noting that something like this will probably be more accurate for targetting specific sites:

site:domaintoscrape.com *@domaintoscrape.com

Re:so much for (2)

John Bokma (834313) | about 2 years ago | (#40122039)

Uhm, years ago one could already do that using SpiderMonkey and some Perl. It's what I used to report nasty redirects in Blogspot/Blogger to Google (thousands and thousands). It took me some time, but Google did see the light and the problem was resolved.

Why do people keep thinking that spammers are retards? If it can be abused, it will be. And spammers/cybercriminals are among the first to do so.

Re:so much for (1)

goaxcap (2648385) | more than 2 years ago | (#40125535)

Use images or flash to show up email

Re:so much for (0)

Anonymous Coward | more than 2 years ago | (#40130945)

If they can OCR CAPTCHA they can OCR your images. You'll only succeed in making your page harder for the legit users.

Evaluate JavaScript on the client (1)

Anonymous Coward | about 2 years ago | (#40119515)

Now Google controls the client, the search engine and the analytics it should not be too difficult for them to see how traffic is flowing between sites. Pages need not even be physically linked for Google to see a connection. E.g. reading an article on the BBC may cause people to search for a company. With people signing into Chrome Google Google must have some very rich logs.

Google has been doing this for quite some time (2, Interesting)

Anonymous Coward | about 2 years ago | (#40119853)

Although maybe not quite in the same context. Google used to display javascript-munged email addresses in their search results until some of the larger sites involved, such as Rootsweb, complained.

GET vs POST (1)

Anonymous Coward | about 2 years ago | (#40119919)

I really hope website developers and web application developers know the difference between GET and POST requests.

Else, this could turn ugly.

Re:GET vs POST (1)

physburn (1095481) | about 2 years ago | (#40121449)

I've often programmed write new article, or add item, GET links, and also javascript actions. Which would mean google is going to be spamming forums and databases. Whats the robots.txt command to prevent going running the javascript on a page?

Re:GET vs POST (2)

xOneca (1271886) | more than 2 years ago | (#40127553)

Maybe put Javascript functions on a separate file and use robots.txt to ban bots access.

Silly (0)

Anonymous Coward | about 2 years ago | (#40120315)

That's a silly idea anyway.

What I expect from Google is to basically download the page and process it, as if it was Chrome, and then diff it against the unprocessed page to figure out which section of content is changed.

Secondly, to avoid stupidity, keep a whitelist and blacklist of scripts that should, and should not be processed. For example, Whitelist scripts on Twitter, Facebook, and Disqus to read comments, but blacklist login pages, local storage, and advertisements. This would let google figure out which content is part of the page and which content is dynamically added to the page that's of contextual value. Google can do it's own oAuth and login to sites that allow authentication from G+ as "The GoogleBot" which would also allow users to ban it from accessing any data they don't want it to see.

Google adding potential security holes in its bot? (1)

Kergan (780543) | about 2 years ago | (#40120387)

I can already picture hackers drooling at the idea of turning Google's cloud into the ultimate zombie network.

Chrome (2)

The MAZZTer (911996) | about 2 years ago | (#40120435)

If you check out some of the thumbnails, it looks like Googlebot is using a customized version of Chrome now. You can see it blocking plugins.

I for one welcome the Javascript spamming. (1)

multicoregeneral (2618207) | about 2 years ago | (#40120963)

It's inevitable. Someone will figure out a way to abuse the system that google hasn't thought to make contingencies for yet. I'm on the fence as to whether this is a good idea. I just hope they know what they're doing.

Re:I for one welcome the Javascript spamming. (1)

dave420 (699308) | more than 2 years ago | (#40133881)

Yeah, it's true - Google clearly knows nothing about searching the internet. ;)

Re:I for one welcome the Javascript spamming. (1)

multicoregeneral (2618207) | more than 2 years ago | (#40135077)

Dave, every time they make a change like this, they get hammered. They made some big changes the release before "panda" and the site was useless for almost a year.

Chrome from users used as web spider (0)

Anonymous Coward | about 2 years ago | (#40121163)

I thought that every (yet unknown) url that is visited by a user from inside Google Chrome is reported back to Google. I guess that could also be used for crawling javascript by using the client's computer for that.

They don't need to run the scripts (2)

Hentes (2461350) | about 2 years ago | (#40122065)

You don't need to actually run the scripts, most of the time it's enough to just scrape the strings and links out of them.

WTF? (1)

Johann Lau (1040920) | about 2 years ago | (#40122691)

Oh yeah, fuck accessibility. Fuck the web in general. "It's better for everybody". That's literally all you need to know. "Just go ahead and remove that from your robots.txt".

I'm not saying there may not be good reasons (e.g. having the CSS and Javascript actually makes it possible to detect invisible text and whatnot, without that search engines may not even have a chance), but I really would appreciate some good reasoning, not being talked to like a fucking 5 year old.

Or hey, how about adding that "of course, not having a unique URL for relevant content is a noob fucking mistake, and generally a cancer everybody is looking forward to eradicate, and irrelevant gimmick content is hardly interesting for search, so if you just went ahead and made a site that doesn't suck butt, that would be fine, too." --- *something* to indicate he isn't fucking clueless.

How Are They Doing This? (0)

Anonymous Coward | more than 2 years ago | (#40124775)

I wonder if they're using a standard browser to load pages now or if they have incorporated V8 and WebKit (or something similar) into Googlebot?

Imagine if you could get this on your local machine as a web crawler app, but with filtering capabilities. Traditional web crawlers only work with static content for the very reason that they're not advanced enough to load the entire page, including running javascript, plus there is the overhead of that additional processing, which can be a real kill to your crawling time.

I really hope more details are released on how they're doing this (but not on how they're ranking anything since Google is protective of that).

Re:How Are They Doing This? (1)

dave420 (699308) | more than 2 years ago | (#40133885)

There are command-line WebKit-based parsers out there, which allow you to process any URL or file as a browser would, and take either a screenshot of the page or access the DOM or whatever. They're not new.

Spammers! (3, Informative)

xenobyte (446878) | more than 2 years ago | (#40126669)

They've been testing this for a while - We've already had the first complaints against someone spamming an email that only exists in exactly one place: Online as the result of some (trivial) javascript. Turned out that if you Googled the page, the result snapshot included the javascript generated email... In other words - it's already there and this will effectively kill javascript as a way of hiding functioning mailto links. Okay it would be fairly simple to add a condition based on the User Agent as GoogleBot is easily identified but it will make things a bit more complicated for the average user.

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?