Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Google Crawls The Deep Web

Zonk posted more than 6 years ago | from the delved-too-deeply dept.

Google 197

mikkl666 writes "In their official blog, Google announces that they are experimenting with technologies to index the Deep Web, i.e. the sites hidden behind forms, in order to be 'the gateway to large volumes of data beyond the normal scope of search engines'. For that purpose, the engine tries to automatically get past the forms: 'For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML'. Nevertheless, directions like 'nofollow' and 'noindex' are still respected, so sites can still be excluded from this type of search.'"

cancel ×

197 comments

Sorry! There are no comments related to the filter you selected.

Just think! (5, Funny)

scubamage (727538) | more than 6 years ago | (#23095774)

Soon, they'll start injecting SQL too to help map databases! Google is so useful indeed! :)

Re:Just think! (-1, Offtopic)

Anonymous Coward | more than 6 years ago | (#23095796)

It's four thirty a.m. and the house is asleep.

I. . . am not asleep.

I am crouched in the bathtub in a frog-like stance, small puddles of urine and liquid shit at my feet. I'm leaning forward, gripping the side of the tub and biting my knee, overwhelmed by a mixture of pain and pleasure as I piston a dildo in and out of my ass.

You see, I really love anal masturbation.

Ever try it? No? You should.

Doesn't matter who you are. God gave all of us, male and female, an abundance of nerve endings in our rectum - and one life to live. So why don't you go ahead and test out the equipment? Have some fun. No point in having a gun sitting on your shelf your entire life and never killing anyone, right?
But I realize there's a fairly persistent misconception among guys that I'm gonna have to dispel before we go any further:

Stimulating your own ass is not "gay."

That notion doesn't make a whole lot of sense. I mean, how could anything you do to your own body be gay? Nobody ever freaks out in the middle of jerking off like "Holy fuck, I've got a fistful of cock! I've gotta cut this gay shit out!" Well, what's the philosophical difference between playing with your dick and playing with your ass?

There is none.

Look fellas, here's the scoop:

If you have a girl wearing a foot long strap-on, smacking your face and screaming "WHO'S MY BITCH?!?" while she pounds your asshole until it bleeds, that would be a *heterosexual* act. Girl on guy. Simple.

Now if it's a guy that's fucking you, that would be homosexual. And if you're doing it to yourself, well, that's plain old masturbation.

But listen - if you're still sitting there being stubborn, all macho and uptight going "My ass. . . is EXIT ONLY!!!" then lemme just ask you a question.

You know that feeling you get when you take a really big shit?

You know what I'm talking about. You're sitting on the couch, eating Cheez-Its and watching Larry King, when all of the sudden you feel that familiar burning. . . so you get up and bound off to the bathroom all bow legged, clenching your sphincter real tight, and then you furiously rip off your boxer briefs and plop down on the seat just in time to let a huuuuuuge thick turd come sliding out of your ass?

Ahhhhhhhhh!!!!

That feeling.

That tingling, chills up your spine, this-is-absolutely-the-pinnacle-of-human-existence feeling.

Well guess what. That's the feeling of a massive rod moving through your rectum, tickling those wonderfully abundant nerve endings. You love it. It's okay. We all do. It doesn't make you a fag. Or at the very least, we're ALL fags. So indulge yourself.

(Yes, I understand that said feeling is partially due to the sensory experience of toxins leaving the body, which is unique to defecation - but the operative word here is "partially." You like the log movement, too. Don't try to argue.)

So anyway, now that you've decided to be bold, and not a homophobic pussy, and poke around the cornhole a little bit - good for you. But there's something you should remember. Anal masturbation is just like playing the accordion, or shooting a jumper, or really anything else that's worth doing. That is, it requires practice.

You see, back when I was a kid I would get curious and stick a finger or a toothbrush up there, but I wasn't fucking around with anywhere near the kind of pleasure I'm achieving now. It was uncomfortable even. So I worked on it.

And conversely, I know I'm still far from expertise in this particular discipline. I don't claim to be an ass master. There's a whole world of lengths, girths, textures, and vibrations that my eager browneye has yet to inhale.

But since I have honed my skills to a pretty decent level, I'll share with you my current technique. Without further ado:

Rob Malda's Anal Masturbation Technique

What You Need:

1. Lubricant of your choice
2. Fake cock (eight inches, approx.)
3. Ridged anal wand (seven inches, approx.)
Procedure:

1. Apply a generous amount of lube to your index finger, and swirl the lubricated finger lightly around your butthole. Add another drop or two of lube, and then simultaneously push your finger into your butthole while pushing back with your anus muscles.

2. Slide your finger into your ass up to the knuckle and feel around for turds. Unless you're an anorexic, you probably will come across one.

3. Circle your finger around your anal walls pressing outward, as if you were an umpire signaling a home run. You should be near the toilet, because this is intended to stimulate a bowel movement. Once you've shit, and your rectum is empty, then you're ready for some heavy duty fun.

4. Lube up a second finger and slip them both into your poopchute. Let your asshole get comfortable with the new mass, and then begin to pump a little. Repeat with a third finger if you so desire.

5. Slather lube all over the ridged anal wand. Squat over your tool and press the tip to your now greasy anus. Just as you've done with your fingers, ease the dildo into your cornhole as you push back onto it with your ass muscles. Go slowly, stopping at each ridge and letting your ass adjust to the increase in width, until you have it in as far as it will go.

6. Now it's time to start pounding. I'm not gonna get more specific than that. Do it your own way. Experiment with different positions and rhythms until you find what you like.

7. Once your ass has been thoroughly fucked by the anal wand, it's time to move up to the larger dildo. Again, you're going to repeat the process that you've done twice already, with your fingers and the wand. Entering slowly, pushing back on it, letting yourself adjust, and then starting to pump.

8. At this point your asshole is really loose, gaping even, and it's time to move on to my favorite part. Crouch down, or get into whatever position you feel comfortable with, and hold the fake cock in one hand and the wand in the other. Work the fake cock in and out, building the pace until you are doing a high intensity rectal plundering. Slide it in really deep, pause, then pull it out all the way - quickly jamming in the anal wand to fill its place. The rapid transition from smooth to ridged textures will send waves out of pleasure rippling through your entire body. Then give yourself a nice hard fuck with the anal wand, and repeat as many times as you'd like.

*In carrying out these steps - even if you take the dump at the beginning - you still might at some point fuck the shit out of yourself. This is why I recommend doing it in a bathtub, or on some other surface that is easy to clean. Now at first you might be squeamish about the poo, but I think that as you get hardcore into the pleasure of all this, you'll just naturally get desensitized. Kind of like a heroin addict quickly gets over his fear of needles.

In fact, I've found that the right kind of poo can easily be incorporated into the festivities. Sometimes while I'm pounding away I will feel a sudden rush of heat travel through my ass, and I'll know that I'm coating the dildo with a somewhat viscous liquid shit. At this point in the ass ramming, my pain tolerance is rather high, so I'll simply jam the shitty dildo back up my ass, and let the sudden decrease in lubrication create an effect similar to the aforementioned smooth-to-ridged transition. As a matter of fact, this is probably the most intense sensation that I've come across in my entire anal masturbatory experience.*

So that's how it's done. Quite the activity, I must say. Maybe next time you're feeling bored and restless, you can give it a shot. Unless you're a fucking prude, in which case I'd recommend suicide. Or do a goddamn crossword puzzle, I don't really care.

Re:Just think! (1)

WGFCrafty (1062506) | more than 6 years ago | (#23095854)

Why?

Re:Just think! (-1, Offtopic)

Anonymous Coward | more than 6 years ago | (#23096158)

probably because it feels good.

Re:Just think! (-1, Offtopic)

Anonymous Coward | more than 6 years ago | (#23097252)

I agree. Nothing like having your prostate stimulated. I love it when my girlfriend works her finger up my asshole. I wish she would toss my salad :( When I was 12 or so, my dog accidently took a few licks of my poopsy pop. Damn that felt good.

Re:Just think! (1)

Nullav (1053766) | more than 6 years ago | (#23097262)

+1, Informative!

Re:Just think! (0)

Anonymous Coward | more than 6 years ago | (#23097356)

I see you have been reading Neal Stephenson...

Re:Just think! (1)

rawdirt (464725) | more than 6 years ago | (#23095864)

already there?

I sent this drupal log trace to abuse@google.com

Message Duplicate entry cb489713c1455ab98723be737bfe8ca7 for key 1 query: INSERT INTO sessions (sid, uid, cache, hostname, session, timestamp) VALUES (cb489713c1455ab98723be737bfe8ca7, 0, 0, 66.249.85.133, , 1207635444) in /var/www/drupal-5.5/includes/database.mysql.inc on line 172.
Severity error
Hostname 66.249.85.133

no response from them yet, maybe a borked machine?

Re:Just think! (3, Informative)

Lillesvin (797939) | more than 6 years ago | (#23096324)

... maybe a borked machine?

Yeah, maybe your machine... That SQL-error looks more like bad session handling on the server hosting your Drupal installation than Google trying to do an SQL-injection... Actually, it looks nothing like an SQL-injection at all. MySQL is merely being asked to insert a duplicate value in a column specified as unique (`sid`), which it refuses because it's not unique. Don't expect an answer, since it's most likely not an error on Google's end.

A little more on topic though, what exactly is Google looking for there? I mean, what content (of any interest to anyone) is hiding behind forms? Many sites that require registration (like NY Times (IIRC) and others) already check if the UserAgent string is that of a Google crawler and lets it index if so in order for people to be able to search eg. NY Times articles on Google but only read them if they register (or change their UserAgent string or use BugMeNot).

And how does Google make sure they don't end up accidently editing a crapload of wikies by filling out random forms on random sites and hitting submit?

Re:Just think! (3, Funny)

AKAImBatman (238306) | more than 6 years ago | (#23095906)

Hmm... that reminds me of this DailyWTF [thedailywtf.com] . Who knew that Mr. Test User was such a big customer? :-P

Re:Just think! (1, Informative)

Anonymous Coward | more than 6 years ago | (#23096052)

I had a search not for "allinurl:select from where" but for "allinurl: delete from" ... throws up a bunch of phpBBAdmin pages with "Do you really want to do this" and "Yes" and "No" buttons .... which one will Google click :)

Bright Planet's DQM (3, Interesting)

eldavojohn (898314) | more than 6 years ago | (#23095788)

Several years ago, I tried a demo of Bright Planet's Deep Query Manager [brightplanet.com] that would essentially do these searches through a client on your machine in batch-like jobs. Oh, the bandwidth and resources you'll hog!

Their stats on how much of the web they hit that Google missed was always impressive (true or not) but perhaps their days are numbered with this new venture by Google.

Quite an interesting concept if you think about it. I always presupposed that companies would hate it but never got 'blocked' from doing it to sites.

Here, suck up my bandwidth without generating ad revenue! Sounds like a lose situation for the data provider in my mind ...

Re:Bright Planet's DQM (0)

Anonymous Coward | more than 6 years ago | (#23095922)

It doesn't bring in money directly, but getting those pages listed in Google will bring more people to your site, and from that comes more ad revenue.

Re:Bright Planet's DQM (2, Interesting)

menace3society (768451) | more than 6 years ago | (#23096436)

You could build a really interesting "Deep Web" crawler by ignoring robots.txt. In fact, an index just of robots.txt files would be pretty cool in its own right. Call it "Sweet Sixteen" (10**100 in binary) or something.

Re:Bright Planet's DQM (1)

cheater512 (783349) | more than 6 years ago | (#23097400)

The more content they have off your site, the more visitors they send.

The visitors *do* generate ad revenue. :)

Oops... (5, Funny)

JohnnyDanger (680986) | more than 6 years ago | (#23095790)

They just bought everything on Amazon.

Re:Oops... (4, Informative)

Bogtha (906264) | more than 6 years ago | (#23096200)

This won't post forms of that sort. In the blog post, they say that they are only doing this for GET forms, which are safe to automate as per the HTTP specification.

This is for things like product catalogue searches where you pick criteria from drop-down boxes. Not so common for run-of-the-mill e-commerce sites, but I've seen a lot on B2B sites.

Re:Oops... (2, Funny)

Firehed (942385) | more than 6 years ago | (#23096286)

HTTP spec be damned - has IE taught you nothing?

Re:Oops... (0)

Anonymous Coward | more than 6 years ago | (#23096320)

HTML != HTTP

Re:Oops... (1)

Jarjarthejedi (996957) | more than 6 years ago | (#23096958)

IE's horridness trancends the mere concept of acronyms.

Re:Oops... (1)

cheater512 (783349) | more than 6 years ago | (#23097418)

No, IE ignores chunks of HTTP as well as HTML.

Re:Oops... (5, Insightful)

orkysoft (93727) | more than 6 years ago | (#23097390)

Unfortunately, there are tons of sites whose developers did not understand the part about GET being for looking up stuff, and POST being for making changes on the server.

Re:Oops... (1)

UnderCoverPenguin (1001627) | more than 6 years ago | (#23097476)

This won't post forms of that sort. In the blog post, they say that they are only doing this for GET forms, which are safe to automate as per the HTTP specification.

I have seen plenty of forms that use get for commanding actions, including making purchases.(for example, I used to work for web company; one of the page designers only ever used get - and then would wonder why my code replied with an error when his get requests exceeded the size limit)

Re:Oops... (0)

Anonymous Coward | more than 6 years ago | (#23096392)

They just bought everything on Amazon
...using your credit card! Oops indeed!

Will it solve captchas? (4, Interesting)

lastninja (237588) | more than 6 years ago | (#23095806)

only half kidding

Re:Will it solve captchas? (1)

fishybell (516991) | more than 6 years ago | (#23095858)

Just what we need, some 'bot adding it's insightful comments based on other words in the same document...then again, on most sites, would you be able to tell the difference between Google posting something and some 1337 kiddiez?!?!!1eleven?

Re:Will it solve captchas? (5, Funny)

skraps (650379) | more than 6 years ago | (#23096004)

Just what we need, some 'bot adding it's insightful comments based on other words in the same document.
Are such questions on your mind often?

..then again, on most sites, would you be able to tell the difference between Google posting something and some 1337 kiddiez?!?!!1eleven?
What does that suggest to you?

Re:Will it solve captchas? (4, Funny)

urcreepyneighbor (1171755) | more than 6 years ago | (#23096328)

You whore! You told me you loved me, Eliza! You said you'd call!

Re:Will it solve captchas? (1)

Kemanorel (127835) | more than 6 years ago | (#23096346)

What does that suggest to you?
A new Turing test is needed?

It suggests... (1)

Web Goddess (133348) | more than 6 years ago | (#23096998)

...that Google's Deep Crawl is already emuating the kidd33z.

Forums? (5, Funny)

fishybell (516991) | more than 6 years ago | (#23095814)

Well, I certainly hope that they put in some decent smarts to prevent it from making posts onto forums, blogs, /., etc.


On the plus side, this should enable Google to get by the "Must be 18 to view" buttons ;)

Re:Forums? (2, Informative)

brunascle (994197) | more than 6 years ago | (#23095898)

as TFA states, it's only GET requests, not POSTs. so it would mostly be search queries.

Re:Forums? (1)

fishybell (516991) | more than 6 years ago | (#23095958)

...and porn. You can't forget the porn.

Re:Forums? (1)

MenTaLguY (5483) | more than 6 years ago | (#23096142)

Unfortunately a lot of developers misuse GET requests for actions which modify state. (I suppose this'll teach them...)

Re:Forums? (1)

Bogtha (906264) | more than 6 years ago | (#23096342)

The usual excuse for that is that they want a link — for aesthetic purposes, to put in an email, etc. If you're using a form anyway, those reasons disappear. I'm sure there are a few developers who screw this up, but it won't be anywhere near as common as the problems GWA uncovered.

HELLO I AM GOOGLEBOT (5, Funny)

Anonymous Coward | more than 6 years ago | (#23095828)

I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

Re:HELLO I AM GOOGLEBOT (5, Funny)

Anonymous Coward | more than 6 years ago | (#23095848)

I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

Re:HELLO I AM GOOGLEBOT (4, Funny)

Anonymous Coward | more than 6 years ago | (#23095896)

I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

Re:HELLO I AM GOOGLEBOT (2, Interesting)

Anonymous Coward | more than 6 years ago | (#23095972)

I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

Re:HELLO I AM GOOGLEBOT (0)

Anonymous Coward | more than 6 years ago | (#23095994)

I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

Re:HELLO I AM GOOGLEBOT (1, Funny)

Anonymous Coward | more than 6 years ago | (#23096014)

I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

Re:HELLO I AM GOOGLEBOT (1, Insightful)

Anonymous Coward | more than 6 years ago | (#23096120)

I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

Re:HELLO I AM GOOGLEBOT (0)

Anonymous Coward | more than 6 years ago | (#23096920)

I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

Forums, and "web 2.0" sites. (1)

PyroMosh (287149) | more than 6 years ago | (#23095830)

This brings up a concern from the description.

So Googlebot will come across a web page.
It follows a link.
The link leads to a page with a form.
Googlebot fills out the form based on content already on the site.
Googlebot clicks submit.
Googlebot goes to the next page, and continues to follow links.

The problem comes when that form was a post form like the one I am typing on right now for a forum, or some other type of form to create user generated content. This makes it seem like google will see the text box and input random content from the site, then post it.

What keeps googlebot from becoming a nonsensical spambot? Yes, you can use nofollow, but there is such a huge quantity of web forms that don't have that now because they've never needed it. Retrofitting all of them web wide is not the most realistic of goals.

Re:Forums, and "web 2.0" sites. (1)

Idiomatick (976696) | more than 6 years ago | (#23095872)

Google indexes more than any other search engine by expanding the web themselves. It was moving too slow for them.

Really though i don't think this will be a problem. People at google are pretty smart and i'm sure they've thought of this. Even if you believe google is evil there no evil corporate benefit to spamming garbled text to the entire internet.

Re:Forums, and "web 2.0" sites. (1)

mmkkbb (816035) | more than 6 years ago | (#23095962)

They will use Markov chains which may end up sounding more intelligent than many forum denizens. Fark, Free Republic, LGF, etc. won't even notice.

Re:Forums, and "web 2.0" sites. (1)

Nos. (179609) | more than 6 years ago | (#23096012)

Not only that, but suppose I search for something, that is hidden behind a form. Assuming I click the link on the search results, I'm going to (most likely) taken to an error page saying I have to fill out the form.

Re:Forums, and "web 2.0" sites. (1)

Simon (S2) (600188) | more than 6 years ago | (#23096020)

This makes it seem like google will see the text box and input random content from the site, then post it.
No. Googlebot will only do gets, not posts.

Re:Forums, and "web 2.0" sites. (1)

lordSaurontheGreat (898628) | more than 6 years ago | (#23096056)

...This makes it seem like google will see the text box and input random content from the site, then post it. ...


No, the Google Bot sees raw HTML and CSS code, plus maybe some basic JavaScript. All of /.'s new js additions will throw off the bots immediately.


In addition, you're forgetting that GET and POST are completely different things. They look identical to you, but the HTML is different, and it's not difficult to differentiate between them. One is <form method="GET"> and the other is <form method="POST">

Re:Forums, and "web 2.0" sites. (1)

menace3society (768451) | more than 6 years ago | (#23096388)

I am tempted to copy and paste that and post it as my reply, but I think that would be insufferably clever. So, too, is referring the fact that I could be insufferably clever, but choose not to be. Etc...

Re:Forums, and "web 2.0" sites. (1)

liquidpele (663430) | more than 6 years ago | (#23096796)

What keeps googlebot from becoming a nonsensical spambot? Yes, you can use nofollow, but there is such a huge quantity of web forms that don't have that now because they've never needed it. Retrofitting all of them web wide is not the most realistic of goals.
The captcha or other anti-bot mechanism. Any forum that can't stop a "good" bot is going to have spam all over it anyway from the "bad" ones...

And now that Captcha has been cracked... (1)

UnCivil Liberty (786163) | more than 6 years ago | (#23095836)

...Google will rule the world

good and bad (3, Insightful)

ILuvRamen (1026668) | more than 6 years ago | (#23095838)

Well first of all, it's about time they learn how to read advanced sites! If your site is dependent on input from the user to display content, you're basically invisible to google. Now all they need is something to read text in flash files and they've got something going. But on the other hand, this is almost auto-fuzzing which could be considered hacking and I bet they'll often get results they didn't intend to and expose data that's supposed to be protected and private.

Re:good and bad (5, Insightful)

QuoteMstr (55051) | more than 6 years ago | (#23096022)

And should we not make any progress because we might step on a few toes while doing it? If Google can get your into uber-secret-private-database, so ran random user, or random Russian cracker. Fix your damn site if you're worried about this particular attack.

Re:good and bad (4, Insightful)

Bogtha (906264) | more than 6 years ago | (#23096264)

Now all they need is something to read text in flash files and they've got something going.

They've indexed Flash for about four years now.

I bet they'll often get results they didn't intend to and expose data that's supposed to be protected and private.

No doubt. There are a lot of clueless developers out there who insist on ignoring security and specifications time and time again. I have no sympathy for people bitten by this, you'd think they'd have learnt from GWA that GET is not off-limits to automated software.

Google, consider this... (0)

Kiralan (765796) | more than 6 years ago | (#23095868)

Do you realize the amount of wasted time the operators of some websites will spend, processing the trash data that doing this will create? I speak mainly of feedback forms, e-mail signups, and the like. Also, what about the excess click-throughs that some websites may be paying an outside entity for? Finally, what of the time spent by IIS in examining the logs for yet another anomaly. Maybe these are unlikely possiblities, or maybe not, but it will come back to affect your image. Just a thought exercise: Consider the fun to be had in leading Google through dynamically generated pages, when a google Deep Web crawler comes to visit >:-)

Re:Google, consider this... (1)

chris_mahan (256577) | more than 6 years ago | (#23095964)

The best would be for the app to be hosted on google appengine, then it would take the app down, and the culprit would be google. So when google comes and bills you for bandwith, CPU and storage usage, you bill them right back, citing Bot activity.

hehehe

Re:Google, consider this... (3, Insightful)

poot_rootbeer (188613) | more than 6 years ago | (#23096134)

Do you realize the amount of wasted time the operators of some websites will spend, processing the trash data that doing this will create? I speak mainly of feedback forms, e-mail signups, and the like.

If your site uses GET for a non-idempotent action like sending a feedback form or signing up for an email newsletter, you're doing it Wrong.

ROBOTS.TXT & CONTENT="NOINDEX", "NOFOLLOW" (1)

Chyeld (713439) | more than 6 years ago | (#23096166)

http://www.robotstxt.org/ [robotstxt.org]

Dang, that was hard. Damn you, GOOGLE! Damn you to HELL! You blew it up! You finially blew up the web!

Or not.

This could cause problems (0, Redundant)

tehcmn (1192821) | more than 6 years ago | (#23095882)

They'll have to be careful how they go about this. If they start filling in forms with bogus data on blogs, forums etc., there are going to be a lot of pissed off website owners out there. Just imagine the number of admins who'll have to update their robots.txt for this. Just my 2c.

Re:This could cause problems (1)

profplump (309017) | more than 6 years ago | (#23096170)

They are only submitting forms with a GET method. According to the HTTP specs, GET requests should always be idempotent. If you've got forms that use the GET method and aren't idempotent you should *already* be taking extra precautions avoid accidental use by bots and other automated tools.

What about register forms? (0, Flamebait)

Anonymous Coward | more than 6 years ago | (#23095916)

Does that mean I'll have to introduce methods that waste people's time in order to prevent google from registering on my site multiple times?

Re:What about register forms? (2, Informative)

stephanruby (542433) | more than 6 years ago | (#23097142)

Does that mean I'll have to introduce methods that waste people's time in order to prevent google from registering on my site multiple times?
Yes, if you require all your human visitors to read your robots.txt [robotstxt.org] , and then require them to check a checkbox to mean that they clearly read and understood the entire body of your robots.txt. Then yes, you'll have to introduce some sort of almost impossible-to-read translucent captcha written in classical Chinese.

I'm in your Intarwebs (2, Funny)

Mathus (941922) | more than 6 years ago | (#23095950)

Cracking your forms. Sorry, could not help myself.

robots.txt (4, Funny)

B3ryllium (571199) | more than 6 years ago | (#23095986)

Okay, so how long until the spec for robots.txt is updated to have a "DontBeStupid" directive?

Note to self... (3, Funny)

fahrbot-bot (874524) | more than 6 years ago | (#23095990)

our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML...

...post invoice forms ordering expensive items to be shipped to Google. Be sure to log incoming IP addresses for verification.

Heisenberg for web (1)

gmuslera (3436) | more than 6 years ago | (#23096074)

If well you can have links that do actions and change information, submitting forms is a good recipe for massive changes, from comment spam to anything, sky is the limit.

Now you can't see what is on the web, by crawling, without changing it.

They'll make it your fault. (0)

Anonymous Coward | more than 6 years ago | (#23096078)

Here's how Google will respond when you complain to them about junk data in your forms: "We're sorry to hear about the problems with the way GoogleBot indexes your web site. Please note that GoogleBot strictly follows the robots exclusion standard and found no indication that your forms were not suitable for being accessed by automated processes. To avoid unwanted accesses, please update your robots.txt to correctly indicate which forms you don't want to be accessed by GoogleBot. Our webmastertools-service can help you make these updates."

Google() {Google();} (1)

unforkable (956731) | more than 6 years ago | (#23096114)

So it will search recursively through .... Google. Or probably benefite from altavista/yahoo/... results . (just joking).

Re:Google() {Google();} (1)

TheRaven64 (641858) | more than 6 years ago | (#23096482)

I'm surprised it isn't doing this already, from the number of 'search results' pages an average Google search turns up.

The Internet is for Porn (5, Funny)

kiehlster (844523) | more than 6 years ago | (#23096118)

If you haven't already noticed, AdSense has features now to tell Google how to log into your website so it can catalog your user-only pages. You know what that means. Porn sites are going to start using this so that Googlebot can confirm that it's age is over 18. We'll be showered with a gigantic wave of pornographic information. We will soon have to press juvenile charges against a corporate entity because it lied about its age on web forms to gain access to pornography and forum discussions.

directions like 'nofollow' are still respected (5, Informative)

frovingslosh (582462) | more than 6 years ago | (#23096160)

Nevertheless, directions like 'nofollow' and 'noindex' are still respected, so sites can still be excluded from this type of search.

Maybe they shouldn't be, at least not in all cases. Several years back I had done many Google searches for some information that was very important to me, but never could find anything. Then a few months later (too late to be of use), pretty much by a fortunate combination of factors but with no help from Google, I came across the exact information, on a .GOV website in a publicly filed IPO document. As far as I can tell, our US government aggressively marks websites not to be indexed, even when they contain information that is posted there to be public record. When these nofollow directives are over used by mindless and unaccountable bureaucrats, perhaps someone needs to make the decision that these records should be public and that isn't best served by hiding them deep down a long list of links where they are hard to locate. In cases like this I would applaud any search engine that ignores the "suggestion" not to index public pages just because of an inappropriate tag in the HTML. In fact, if I knew of any search engine that was indexing in spite of this tag, I would switch to them as my first choice search engine in an instant. For starters, I would suggest that any .GOV and any State TLD website should have this tag ignored unless there were darn good reason to do otherwise.

Re:directions like 'nofollow' are still respected (2, Insightful)

QuantumHobbit (976542) | more than 6 years ago | (#23096432)

But they don't want you to find out that the moon landing was faked and that Jimmie Hoffa shot Kennedy while driving a car that runs on water. I agree with you. If you don't want people to know something don't put it on the web. If you want people to know put it on the web and let google send the people to you. It's all bureaucracy inaction.

Re:directions like 'nofollow' are still respected (3, Interesting)

Christophotron (812632) | more than 6 years ago | (#23096520)

As far as I can tell, our US government aggressively marks websites not to be indexed, even when they contain information that is posted there to be public record.

I'd mod you up if I had some points. I'm sure there are ethical implications or something when it comes to respecting the website owner's wishes not to index, but it's all public information anyway. If it's on the web and I can look at it, then Google should be able to look at it and index it.

I had no idea that government sites don't allow themselves to be indexed. That is BULLSHIT. People often NEED information from .gov sites and ALL of it should be made easy to find. Refusing to allow indexing such information is akin to hiding or obfuscating it: you don't actually want anyone to read it or anything, but you can say it's available on the web so your ass is covered. IMO there should be a law stating that all of .gov MUST be indexed by search engines.

Is there a law saying that search engines MUST follow these robots.txt, nofollow, etc? If it's not breaking the law, then Google should have some serious competition. A new search engine that indexes ALL VIEWABLE SITES regardless of the owner's wishes would be fucking great.

sites can still be excluded (1, Flamebait)

nurb432 (527695) | more than 6 years ago | (#23096184)

Wimps. Index it all, who cares if the site doesn't want it. If its public facing it deserves to be indexed.

Re:sites can still be excluded (1)

Christophotron (812632) | more than 6 years ago | (#23096626)

+1 hell yes. Sure, indexing may be abused at the expense of clueless web developers, but they'll clean up their act very rapidly in the wake of all the security breaches.

Non-indexing may be abused as well. As someone said in an earlier comment, .gov sites like to disallow indexing. What possible purpose could this serve other than to make people's lives miserable when dealing with the government?

IANA(web developer), but I never understood the point of robots.txt crap. Why put the site up if you don't want people to find it?

Re:sites can still be excluded (1)

Omnius (1242594) | more than 6 years ago | (#23096906)

IANA(web developer), but I never understood the point of robots.txt crap. Why put the site up if you don't want people to find it?
Then as a web developer I'll explain it to you. I pay for the storage of my web site as well as every single byte that goes in or out of my web site (bandwidth). So, every query that is done against my web site by a query engine (or a user) costs me money. Generally I am willing to spend that money to get my content indexed in the various search engines, but it should be MY CHOICE since it is MY MONEY. The way I limit that today is using robots.txt and other techniques. Now, if the search engine wants to pay me to index my content, that's another thing entirely.

BTW, I agree that government sites should not use exclusion rules for public data, but the right thing to do is to complain to the oversight committee for that government web site, not blame the search engines.

Re:sites can still be excluded (1)

nurb432 (527695) | more than 6 years ago | (#23096940)

Then dont put your site public facing.

Re:sites can still be excluded (0)

Anonymous Coward | more than 6 years ago | (#23097386)

The robots exclusion standard (aka robots.txt) was designed to prevent technical problems which occurred when automated processes tried to crawl very large URL spaces, like those of database backed websites with dynamically generated URLs. This original intention of the robots exclusion standard is still used today to trap spiders which ignore the standard: Some websites have infinite junk page trees linked invisibly and excluded in robots.txt. So instead of just blocking the IP address of an ignorant spider, which doesn't help much against distributed spiders running on botnets, these websites "poison the harvest", so to speak.

stores (1)

sveard (1076275) | more than 6 years ago | (#23096208)

How about online stores? Google is going to get some merchandise...

Fuzzing the world (2, Insightful)

corsec67 (627446) | more than 6 years ago | (#23096222)

Sweet, now Google will be Fuzzing [wikipedia.org] the entire web.

How will this work for forms that perform translations, validations and similar kinds of operations on other websites? Try to pull the entire internet through each such site it finds?

And then not every web development environment forces GET to not change data. In Ruby on Rails, adding "?methond=post" to the end of a url fakes a post, even though it is actually a GET, which I disabled in the company I work for. Not everyone is going to do that.

Re:Fuzzing the world (2, Insightful)

Bogtha (906264) | more than 6 years ago | (#23096462)

In Ruby on Rails, adding "?methond=post" to the end of a url fakes a post, even though it is actually a GET, which I disabled in the company I work for. Not everyone is going to do that.

More precisely: Not everyone has been doing that. I'm sure when Google comes along and exposes all their bugs they will quickly take the hint.

I don't really see the problem. The developers who know what they are doing, like you, won't be adversely affected, while the incompetent developers have to scurry around fixing their bugs every time something like this happens.

Evil Bot (1)

Arancaytar (966377) | more than 6 years ago | (#23096336)

For text boxes, our computers automatically choose words from the site that has the form


And a few relevant URLs from helpful sponsors?

Now you just need to hire a few sweatshop workers to get past those pesky captchas...

mod 04 (-1, Troll)

Anonymous Coward | more than 6 years ago | (#23096352)

To the 4olitically most. Look at the SALES AND SO ON, lizard - In other dying' crowd - Provide sodas,

Username == Username (1)

QuantumHobbit (976542) | more than 6 years ago | (#23096378)

This explains the sudden increase in users registered as Username==Username, with Password ==Password across the interwebs. To reply please send an email to name@domain.com

Anecdote from Google (5, Funny)

arrrrg (902404) | more than 6 years ago | (#23096424)

When I interned at Google, someone told me a funny anecdote about a guy who emailed their tech support insisting that the Google crawler had deleted his web site. At first, I think he was told that "Just because we download a copy of your site, doesn't mean your local copy is gone." (a'la obligatory bash [bash.org] .) But, the guy insisted, and finally they double checked and his site was in fact gone. Turns out that it was a home-brewed wiki-style site, and each page had a "delete" button. The only problem was, the "delete" button sent its query via GET, not POST, and so the Google spider happily followed those links one-by-one and deleted the poor guy's entire site. The Google guys were feeling charitable and so they sent him a backup of his site, but told him he wouldn't be so lucky the next time, and he should change any forms that make changes to POSTs -- GETs are only for queries.

So, long story short, I wonder how Google will avoid more of this kind of problem if they're really going off the deep end and submitting random data on random forms on the web. Like the above guy, people may not design their site with such a spider in mind, and despite their lack of foresight this could kill a lot of goodwill if done improperly.

Re:Anecdote from Google (1)

Arimus (198136) | more than 6 years ago | (#23096622)

[blockquote]
end and submitting random data on random forms
[/blockquote]

Sod worrying about zapping sites, what will happen when they crawl the nuclear launch site and enter random data into the authorisation field, and in a rare feat of sod's law end up getting the code just right....

(oh and what's the betting they'll put redmond in as a target string?)

Re:Anecdote from Google (0, Troll)

Colonel Korn (1258968) | more than 6 years ago | (#23096670)

The world needs web hosts that block all Google IPs.

Re:Anecdote from Google (0)

Anonymous Coward | more than 6 years ago | (#23097064)

They already exist, but you probably can't find them... since they're not in Google.

Re:Anecdote from Google (1)

IdeaMan (216340) | more than 6 years ago | (#23097372)

Google-fu young padawan lacking are.

What's the use? (1)

cppgenius (1009857) | more than 6 years ago | (#23096442)

The stuff behind forms are normally of no use to someone doing a search on Google, so how does this fits in with their "Do-No-Evil" motto? What's the use of indexing a confirmation page to a support ticket system? Is someone going to do a search for: "A support ticket has been created. Your reference number is .... bla... bla... bla..." Anyway, how do you expect someone to visit a dynamic confirmation page without filling out a form? Is Google going to hack our CAPTCHA scripts and anti-spam measures just to get past our forms? "Nevertheless, directions like 'nofollow' and 'noindex' are still respected". It's like the stupid CAN SPAM law, spammers are allowed to spam us until we tell them to stop. Google automatically allow themselves the privilege of fiddling with our e-mail forms until we tell them to stop.

Correct me if I am wrong........ (1)

Anachragnome (1008495) | more than 6 years ago | (#23096468)

.....The first thing that popped into my head was someone out there figuring out how to use this to access password protected sites/accounts.

"Hey! Look at this! I googled "World of Warcraft Forums" and just got 10 million hits, all logged in as a user!"

Saw this a few months ago.. (1)

Kenny Austin (319525) | more than 6 years ago | (#23096546)

I saw this a few months ago while grepping through our apache log. Googlebot was submitting search requests for some weird stuff to our online catalog (for example: "Ctnblnd"). After some research I found that Googlebot was the only client which had ever searched for most of these terms and that they were abbreviations that our accounting department uses. I was guessing that they were doing something like this in the lab for words that they "didn't know" but ultimately put our search url into robots.txt because I didn't like our search results showing up in theirs.

Re:Saw this a few months ago.. (1)

corsec67 (627446) | more than 6 years ago | (#23096660)

... I didn't like our search results showing up in theirs.

And I hate it when a search result goes to... another page of search results. "You searched for 'perpetual motion engine'. Here are links to pages of us doing that search on other sites as well." Not very useful.

It isn't easy to programmatically tell the difference, but this seems like this would make that happen much more often.

I bet I know what's next (1)

93 Escort Wagon (326346) | more than 6 years ago | (#23096710)

In a few months, there'll be a new blog post - Google will attempt to access and index all sites' password-protected pages by matching usernames found elsewhere on the site (e.g. from email addresses) with intelligent guesses at passwords, based on information it's gleaned regarding those individuals. Failing that, it'll run through entries found in various cracker dictionaries.

In other news, (1, Insightful)

mbstone (457308) | more than 6 years ago | (#23096884)

Google has announced that Google Phones (beta) will soon unveil the results of its having wardialed all 6,800,000,000 U.S. telephone numbers. Visitors to the Google Phones site will be able to search individual phone numbers to determine (without personally dialing the number) whether the number belongs to a landline telephone, cell phone, fax, or modem.

On phone numbers where a VMS is detected, Google plans to dial "#0#" and other codes in order to determine how to reach a human.

"Since we are a big, rich entity, the laws don't apply to us. We can do black-hat hacking exploits that would cause law enforcement to raid your home if you did the same thing," said a Google spokesman.

Re:In other news, (0, Troll)

Rogan's Heroes (1274232) | more than 6 years ago | (#23096956)

Well if you're a stupid enough developer that someone can hack your site by purely using GET requests, than you probably deserved it.

Uh, you missed one critical point (1)

cppgenius (1009857) | more than 6 years ago | (#23096934)

As per Google Webmaster Central:

"Similarly, we only retrieve GET forms and avoid forms that require any kind of user information. For example, we omit any forms that have a password input or that use terms commonly associated with personal information such as logins, userids, contacts, etc. We are also mindful of the impact we can have on web sites and limit ourselves to a very small number of fetches for a given site."

Stuff like login forms, contact forms or forms for user generated content should be using the POST method not GET, so there shouldn't be any concern for web developers who know how to design their sites. If you are using GET in the wrong places, then it is your own fault.

What is the motto of this story, read the actual post/article in detail before overreacting on something posted out of context by a slashdotter. (and yes I'm also guilty of this)

Opt IN (1)

Omnius (1242594) | more than 6 years ago | (#23097224)

It seems to me that a much better plan would be to extend robots.txt to include a way for web sites to OPT IN to having their form-fronted "deep" data indexed. This would make sure that only sites that are ready for this kind of intrusion (and have data worth indexing behind their forms) get indexed. Why go for an OPT OUT methodology when the vast majority of the forms on the web front for stuff that wouldn't benefit from indexing.

Also, note that Google is not being altruistic when they say they will only process GET forms. From a programming POV yes it is no harder for them to submit POST forms than it is to submit GET forms. The problem is that they index their resulting data by storing URLs (which a GET request provides and a POST does not) so they would have no way to redirect a person clicking on the result list from Google to the POST form results (thatâ(TM)s just not supported by the browsers). So we are talking about a technical limitation, not a altruistic self-limitation.

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>