Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Google Buys reCAPTCHA For Better Book Scanning

CmdrTaco posted more than 5 years ago | from the when-spammers-give-you-lemons dept.

Google 138

TimmyC writes "This story may interest the Slashdot folk, many of whom use the reCAPTCHA anti-spam service. Well, reCAPTCHA is now owned by Google. Apparently, what attracted Google to ReCAPTCHA is that the company has linked its core authentication service with efforts to digitize print books and periodicals. The search giant has a massive (and controversial) effort underway in that area for its Google Books and Google News Archive services. Every time people solve a CAPTCHA from the company, they are also, as a byproduct, helping to turn scanned words into plain text that can be indexed and made searchable by search engines. Interesting times indeed."

Sorry! There are no comments related to the filter you selected.

Imagine! (1)

rallymatte (707679) | more than 5 years ago | (#29453055)

How slow is searching the internet going to be if you have to fill out stupid obscured word each time?!

Re:Imagine! (1, Insightful)

Anonymous Coward | more than 5 years ago | (#29453103)

As slow as searching most forums

Re:Imagine! (2, Insightful)

natehoy (1608657) | more than 5 years ago | (#29454787)

Google's probably not going to add this to their default search engine. They've already got a good audience using this where it's appropriate - to keep spambots from joining or posting to forums or in other contexts where you want to determine if your web client is human or bot.

Google SEARCH exists and is popular because it's fast and convenient. I can't see them adding a 2-word CAPTCHA to do a simple search only because that would drive search traffic (which is already very profitable) to their competition.

Google is very, very clever at designing mutually beneficial arrangements. They craft all of their products so the user is receiving some significant benefit in return for the information or work they provide to Google. reCAPTCHA only provides a benefit when users see a forum is pretty clean from spam and crap because CAPTCHA is there, so they'll go to the effort of joining those forums. Forum master and user both see a tangible benefit - reduced spam - and will happily compensate google with 5 seconds' work.

Well... (4, Interesting)

vikhyat (1593841) | more than 5 years ago | (#29453083)

This should improve Google's indecipherable CAPTCHA.

Re:Well... (0)

Anonymous Coward | more than 5 years ago | (#29454149)

As long as it doesn't replace them with one i had from ReCAPTCHA that day, ooowee that was a toughy.
I just can't seem to place my finger on it... pretty sure i was drunk.

WTF Summary (0, Troll)

afxgrin (208686) | more than 5 years ago | (#29453087)

How does solving a captcha help the database? That doesn't make ANY sense at all - a captcha needs to be solved before hand to make sure that the user authenticates the correct word. You don't just type into the captcha input box any random word, and it lets you through!

Heh I can just see these spamming guys trying to modify an OCR system for captcha breaking, and suddenly realizing they can just input any word.

Re:WTF Summary (5, Informative)

duguk (589689) | more than 5 years ago | (#29453125)

You're asked to enter TWO words; one known; one not.

From: recaptcha.net [recaptcha.net] :
But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.

Re:WTF Summary (4, Insightful)

iamhassi (659463) | more than 5 years ago | (#29453263)

"Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. "

That explains why half the time I can't even read the word. I swear every time I reach a captcha I have to refresh it 5x before I finally land on two words I can read.

I must say this system is ingenious. Distributed OCR: let millions of internet users figure out what the words are. Maybe next election when there's hanging chads [wikipedia.org] they can use that as a captcha.

Mod up (1)

Anne Honime (828246) | more than 5 years ago | (#29453391)

I totally agree, this is pure genius. Distributed Human-engined OCR is certainly the best solution to traditional OCR problems, and at the same time it leaves many doors to unforeseen traps ajar.

Re:Mod up (5, Interesting)

mrcaseyj (902945) | more than 5 years ago | (#29453667)

I agree that the idea is ingenious. But on the only one I ran into, the word was completely indecipherable. I don't mean that it was really hard, I mean that it was a word so thoroughly mangled that it was clearly impossible to read by anyone, especially without context. The lack of context is one of the big weaknesses of the system. When a word is unclear, it's the words around it that give critical clues to what it is.

Re:Mod up (2, Insightful)

Chabil Ha' (875116) | more than 5 years ago | (#29454219)

Which gives rise to the question: Why isn't captcha giving us complete sentences? Not only would you be OCRing more words, but the context gives the human a greater chance at getting it right, whilst increasing the chance of a spam bot of getting it wrong.

Re:Mod up (2, Funny)

Anonymous Coward | more than 5 years ago | (#29454463)

Which gives rise to the question: Why isn't captcha giving us complete sentences? Not only would you be OCRing more words, but the context gives the human a greater chance at getting it right, whilst increasing the chance of a spam bot of getting it wrong.

...and increasing the rate of people saying "F- it, the captcha should not be longer than my comment." - hence the limit of two words to allow for "me too!" comments.

Re:Mod up (2, Funny)

Kozz (7764) | more than 5 years ago | (#29454469)

Which gives rise to the question...

Don't you mean, "Which begs the question..."?!

(ducks)

Re:Mod up (1)

Tim C (15259) | more than 5 years ago | (#29454753)

Because having to read and enter a single, hard to read word is enough hassle for most people; two is stretching it. An entire sentence would be too much.

Re:Mod up (1)

hipifreq (1323407) | more than 5 years ago | (#29454683)

But if the people behind reCaptcha are really doing this well, then they remember the words that you refreshed. If a word gets refreshed enough then a real human can go to the real book and figure out the meaning of the word.

Re:WTF Summary (1)

hansamurai (907719) | more than 5 years ago | (#29453467)

I find reCaptcha high readable, this isn't like other captcha techniques where there are really thin letters and randoms objects strewn about, it's just blurry, zoomed in typewritten words that are hard for a computer to distinguish.

Re:WTF Summary (2, Interesting)

slyborg (524607) | more than 5 years ago | (#29453687)

I still don't get it. How do you know that the person correctly identified the second word? I don't see how a priori decoding the first word means that the second was correct. I would expect that the individual bad data rate from this technique would be substantial.

I do enjoy the fact that Google, a ridiculously profitable company by virtue of its near-monopoly on Internet search advertising, is using the public who pays it via these ad impressions to do its work for free, and using the technique invented and used by spammers to crowd-source solve CAPTCHAs to get into Gmail and the like!

Re:WTF Summary (1)

mckinleyn (1288586) | more than 5 years ago | (#29453757)

Because it presents the same words to many, many people. Yes, 10 people can all be wrong, but how likely is it that more than half of 100 people are all wrong in exactly the same way?

Re:WTF Summary (1)

Ziwcam (766621) | more than 5 years ago | (#29453779)

It's not necessarily the second word that's unknown.

Re:WTF Summary (2, Insightful)

Chyeld (713439) | more than 5 years ago | (#29453815)

You don't assume.

For the purposes of captcha, typing one word correct suffices. As long as you get the right word (the known 'good' word) correct.

For the purposes of distributed OCR, the "how do you know if the unknown word was ID'ed correctly" issue is simply solved by having the word ID'ed several times. Given you don't know which word is the 'test' word and which is the one actually needing IDing, there shouldn't be a problem with people guessing "Penis!" or "Boobies!" all the time.

So as long as a majority of the people ID the word the same way, you have can have a high level of confidence that it's being ID'ed correctly.

Re:WTF Summary (0, Offtopic)

Rich0 (548339) | more than 5 years ago | (#29453959)

Maybe next election when there's hanging chads they can use that as a captcha.

It would certainly be a lot more fair than the current process - which is a bunch of cronies each interpret the results to their preferred candidate's advantage and then a judge settles it.

Of course, the better solution is to not have such ambiguity in the first place.

If you wanted to implement a system for interpreting analog votes here is what I'd do:

1. All ambiguous votes are digitized. Of course, the definition of "ambiguous" is itself ambiguous - if somebody solidly fills in one circle and leaves one dot in another, is that ambiguous? What constitutes a stray mark vs a double-vote? I guess you could err on the side of caution, or maybe put all votes through the digitizer.

2. The digitizer chops up each vote into individual boxes and then presents them to a user in random order. For example, if the Gore box is on the left on the ballot, it could be on the left or on the right in the presented ballot.

3. The human interprets the vote. They have no cues to actually determine who the vote is for - just whether a given box was selected.

4. Each vote is given to sufficient numbers of people that a high-confidence vote can be selected. If you get 3 people who agree then maybe that's enough. If you get any disagreements maybe you keep asking for opinions until one response has a significant margin. Maybe votes are tossed entirely at some threshold.

The key is that those looking at ballots should not be able to tell which boxes correspond to which candidates. That will eliminate the bias from the system.

Again, in my opinion computers should generate human-readable ballots - so that the computer validates the ballot BEFORE the voter submits it. No issue with stray marks if there are no pencils in the room.

Re:WTF Summary (0, Troll)

melikamp (631205) | more than 5 years ago | (#29455741)

I must say this system is ingenious.

I respectfully disagree. I hate CAPTCHA because it discriminates against AI. Instead, Web-based systems should be designed to accommodate AI participants. I hate reCAPTCHA even more because it is even more annoying and I have no idea who I am working for. I always intentionally smash the keyboard with my palm for the second word. I think that tricking people into working for you is by far the least decent way of distributing this process. It would be better to have an "OCR box" which has nothing to do with CAPTCHA and is known to be a part of a copyleft or public domain project, like Wikimedia. It should display, as others have suggested, single sentences or sentence fragments, so that the reader can use the context, and it should be completely unrelated to CAPTCHA, which is just a discriminatory practice, and, as such, unethical.

Re:WTF Summary (4, Informative)

Sockatume (732728) | more than 5 years ago | (#29453357)

The best part is, it automatically selects for words which are invulnerable to OCR-based attacks. And if the user's presented with an illegible scanned CAPTCHA, they aren't penalised for getting it wrong.

Re:WTF Summary (1)

Kaetemi (928767) | more than 5 years ago | (#29453661)

Well, yeah, but the OCR attacker also just needs to get the OCR readable word right...

Re:WTF Summary (1)

Sockatume (732728) | more than 5 years ago | (#29453849)

That'd involve designing a pattern-recognition system which can reliably decide which of two OCR words is less readable, mind you.

Re:WTF Summary (1)

b4dc0d3r (1268512) | more than 5 years ago | (#29454125)

One KNOWN, one not. The known word is not necessarily going to be OCR readable... you can seed the database with 100 or so images which are known, but maybe not OCR readable. Of course it works better if the known words are NOT OCR readable.

The point is OCR can have typos as well, so just because OCR returns a result doesn't mean it should be trusted. The known word of the two is likely independently analyzed, probably by a human.

Once enough people put the same answer for an unknown word, it becomes trustworthy. That is not easy to hack by making repeated requests with your OCR tool (which does not get GOOD results, but does get CONSISTENT results, therefore the same answer each time) and putting incorrect answers in the database - one of the millions of human users will likely get one of the words being attacked, and respond differently. So you will have several different answers and no clear winner, leaving it an unknown word.

Re:WTF Summary (1)

Granis (92074) | more than 5 years ago | (#29453551)

That's really interesting. I've always wondered why I have passed these CAPTCHAs even when I had to make wild guesses on some of the words because they were so hard to read.

However, how long will it be before a lot of users realize that it is irrelevant what you enter for the unknown word? Even if you don't know for sure which of the word that is the unknown one, knowing the above I think the risk is high that you just type nonsense if you can't read one of the words.

If enough people do this the system will be quite ineffective. reCAPTCHA will probably not accept the wrong solution very often, but it will take a lot of time to get enough users with the same solution to accept it. But with a massive amount of users, even a small amount of the total might be enough to keep it running?

Re:WTF Summary (1)

bami (1376931) | more than 5 years ago | (#29453833)

That is what happened with the Anonymous attack on the Time poll, with the 'penis' attack.

They looked at both words, see which one was the least readable, fill in the good one and fill in 'penis' for the second one, in the hopes of poisoning the database so that they only have to enter the first word correctly.

Would be kind of amusing to see a couple of books showing up on Google Books with the word 'penis' randomly inserted in pages where reCaptcha was used.

Re:WTF Summary (1, Interesting)

Anonymous Coward | more than 5 years ago | (#29453141)

As a control, the system sends out one word that it knows the answer to. You don't know which of the two is the unknown word beforehand. Also, I think that the same unknown word is kept in rotation for a couple of iterations just to double-check that it was entered correctly.

At least, that's how I'd implement it.

Re:WTF Summary (1)

vivaoporto (1064484) | more than 5 years ago | (#29453143)

It is the wisdow of the crowds. There are two words, one is a normal mangled (and known beforehand) captcha, the other is one that the best OCR google got its hands on couldn't solve.

People still have to solve the first one correctly, and if enough people give the same answer to the second one, it is added considered correct.

Re:WTF Summary (2, Funny)

digitig (1056110) | more than 5 years ago | (#29453753)

wisdow

OCR error?

Re:WTF Summary (1)

complete loony (663508) | more than 5 years ago | (#29453801)

The first one is not a normal mangled word. It's another word that could not be OCR'd but has already been identified by the crowd.

Re:WTF Summary (0)

Anonymous Coward | more than 5 years ago | (#29453147)

And then, imagine what happens if they actually use the results for their online book contents...

Re:WTF Summary (5, Funny)

Anonymous Coward | more than 5 years ago | (#29453369)

"Hey everyone, let's all sit refreshing the google gmail account creation page, and always type "boobs" for the second captcha value..."

Re:WTF Summary (3, Interesting)

Anonymous Coward | more than 5 years ago | (#29455137)

Interesting you should say that.

Unfortunately, it won't work - 4chan already ruined it for everyone.

http://musicmachinery.com/2009/04/27/moot-wins-time-inc-loses/

Er... no. Read the reCAPTCHA info (1)

djkitsch (576853) | more than 5 years ago | (#29453151)

The interface uses two words: one which is verified and one which isn't. Assuming the first one is typed in correctly, they present the second to a bunch of people until they get a consensus (three the same, I think) and then it goes in the "verified" pile. Thus, even if the second word's not verified yet, a spammer will still get caught out by the other one.

Re:Er... no. Read the reCAPTCHA info (1)

Tony Hoyle (11698) | more than 5 years ago | (#29454571)

So if enough people type ' penis' as the result, eventually 3 people will identify the captcha as 'penis' and it gets in the list of known words.

Re:WTF Summary (0)

Anonymous Coward | more than 5 years ago | (#29453153)

It can be done... Ever had a captcha come back stating there was a problem and you need to try again (with a different piece of gobledy-gook). What if the problem was faked and it was an unknown piece of text being used? Record the guesses, once you have a sufficient number of agreeing values, you might assume that the word has been solved and record it as such. Might be a bit sneaky, underhanded and very inefficient, but it might work.

Re:WTF Summary (1)

neoform (551705) | more than 5 years ago | (#29453169)

I'm fairly certain the scanners read the text, get a good idea of what it says, then asks several people to tell them what it says, as more people type the text in they become more clear on what it says.. I've used reCaptcha a number of times and find it to work pretty well. Though I have wondered the same thing you're wondering.

Re:WTF Summary (1)

corsec67 (627446) | more than 5 years ago | (#29453171)

ReCaptcha does that:
One of the words is generated or known, and the other is the new word they are trying to scan. You have to give both to access the protected system, since you don't know which is the known word and which is the new word.

http://en.wikipedia.org/wiki/ReCAPTCHA [wikipedia.org]

Re:WTF Summary (1)

narfman0 (979017) | more than 5 years ago | (#29453183)

They could have a list of possibilities, generated by computer or human. Then- They throw out the same word several times and aggregate the answers. Comparing elements in the aggregates they see how many people chose a particular word. When the probability that the word is wrong reaches near zero, they introduce it to the database. Don't know how they did it, that's just one of the ways they could have. Not a cure-all, but it helps with the scans I suppose.

Re:WTF Summary (1)

iammani (1392285) | more than 5 years ago | (#29453187)

WTF Post?

This is not just any captcha, but recaptcha. This captcha system will challenge you to recognize two words, one of which it understands and one it cannot understand. It assumes that, if sufficient people map the unrecognized word to the same set of letters (and also get the known word right), the image indeed maps to these letters.

This is, indeed, a neat idea for OCR.

Re:WTF Summary (1)

Useful Wheat (1488675) | more than 5 years ago | (#29453293)

The system works by having you validate 2 words. One of the words is a word that already been verified to be correct, a known quantity. The other word is the unknown word. If you get the first one correct, it assumes you got the other one correct to. Error correction is done by having multiple people evaluate the same unknown word. If 3 people agree that the unknown word is "Bacon", the word is then taken to be bacon.

Random people trying to mess up the system will not suceed. However, if you convinced everyone to simply enter "Bacon" we could have some amazing google book searches.

Re:WTF Summary (0)

Anonymous Coward | more than 5 years ago | (#29455071)

One thing they should do, is that once they have a substantial amount of words done for a book, set up a neural network and train it to the font that's known on the words, and allow it to translate the rest of the book into text and index it. It would save a lot of time, as people don't need to confirm each and every word of the book, only a fraction.

Why just words? (3, Insightful)

Thanshin (1188877) | more than 5 years ago | (#29453095)

I suppose most people write fast enough to allow sentence captchas already.

Re:Why just words? (4, Insightful)

Canazza (1428553) | more than 5 years ago | (#29453199)

no they don't. I was transfering flights at London Heathrow and there was only one window open, and a massive queue. I get to the front and I find the woman at the computer used one finger typing... ONE FINGER, not even one on each hand, one feking finger. This was someone who was supposedly trained to do this job, can't even touch type.
I know alot of people who still have to look at the keys when they type, and while it's generally faster than that bint, it's still painfully slow.
Not to mention Children, when it comes to touch typing, kids can be fast learners, but before they get the hang of it, they can be very slow too.

Re:Why just words? (0)

Anonymous Coward | more than 5 years ago | (#29453363)

alot of people who still have to look at the keys when they type, and while it's generally faster than that bint, it's still painfully slow.

maybe you don't know many programmers? (like many other developers) I have to look at my hands (not constantly, but at least a glance every 3rd word) to type. But when my job was solely programming I was well into the 50 wpm range (ie faster than I speak.) Simply I don't have to look at anything else in the middle of development, but was typing constantly so learned speed. I doubt my watching has much affect other than a mental requirement... But I can't enter data/transpose from paper for crap (fingers seam to loose confidence without constant feedback.)

Re:Why just words? (1)

crazyjimmy (927974) | more than 5 years ago | (#29453513)

Not to mention Children, when it comes to touch typing, kids can be fast learners, but before they get the hang of it, they can be very slow too.

Don't hate on the children. Most keyboards are way too big for the li'l ones anyways. We should be getting them netbooks... and maybe cellphone keyboards. They could probably type great on those, with their tiny little fingers.

Lord knows, I can't do it. :)

--Jimmy

Re:Why just words? (1)

British (51765) | more than 5 years ago | (#29453681)

I admit, I'm great with a standard QWERTY keyboard, but when it comes to remote controls for cable boxes/vcrs, etc, I slow down to a crawl. Perhaps it's just what you are used to. I almost never look at my keyboard(maybe for typing in tough passwords), but for my VCR remote control(infrequently used), it's a bit more difficult.

Familiar Creature (1)

TheMeuge (645043) | more than 5 years ago | (#29453861)

no they don't. I was transfering flights at London Heathrow and there was only one window open, and a massive queue. I get to the front and I find the woman at the computer used one finger typing... ONE FINGER, not even one on each hand, one feking finger. This was someone who was supposedly trained to do this job, can't even touch type.

I don't know about London, but in the U.S., the 1-2 finger typing is usually accomplished by a community college dropout, whose fingernail extensions are about 2 inches long, and who types either by carefully and slowly pressing one key at a time with the nail extension, or with the second knuckle of her middle finger. She will also scream: "Can I help you" with enough contempt to burn your eyebrows off. When you get to the counter, she will look you over with as much spite as humanly possible, then get her Sidekick out and text someone for a couple of minutes. And god help you if you are still with her (inevitably) when 12pm or 1pm comes about. She will get up and leave for lunch (or unroll her food), whether you're waiting or not. Actually, she'd prefer you to wait there.

She is a ubiquitous inhabitant of government offices of all sorts, as well as front desks in companies that don't respect themselves. She will need the supervisor/manager to resolve any issue that goes beyond typing your name (incorrectly), but she will march on city hall with the rest of her co-workers if they don't get another 5% raise in the middle of the recession.

Re:Why just words? (1)

BetterSense (1398915) | more than 5 years ago | (#29455073)

I can touch-type Dvorak at 80+wpm. I'm reduced to hunt-and-peck mode with Qwerty, however. Which proves the superiority of Dvorak of course.

Re:Why just words? (0)

Anonymous Coward | more than 5 years ago | (#29454903)

Why not whole articles? And not cosmopolitan ones, I mean technical articles! People may even learn something out of captchas. ;) Well worth the effort!

Stupid (0)

Anonymous Coward | more than 5 years ago | (#29453115)

Why didn't they just spend the money on improve their character recognition AI? Ultimately, they will end up having an AI that defeats the purpose of this company anyways...

Great (0)

Anonymous Coward | more than 5 years ago | (#29453123)

Here's to the prospect, for those of us who don't permit random web sites to run code on our computers, of yet more javascript dependant captchas to manually hack through.

In related (and more important) news mozilla at last have a working 64-bit JIT for tracemonkey.

Is that a finger cot? (1)

AmigaHeretic (991368) | more than 5 years ago | (#29453131)

Check out this Google book.... about the 7th page down.

http://www.google.com/books?id=Y0OOlnDFUM8C&printsec=frontcover&dq=Le+Morte+d'Arthur&as_brr=1#v=onepage&q=&f=false [google.com]

I thought these were scanned in by robots? If so it looks like it has well kept fingernails.

Re:Is that a finger cot? (1)

KDR_11k (778916) | more than 5 years ago | (#29453219)

Presumably the robot wasn't the only one ever to handle that book.

Re:Is that a finger cot? (1, Funny)

Anonymous Coward | more than 5 years ago | (#29453267)

Presumably the robot wasn't the only one ever to handle that book.

Maybe not. But I know that when I'm done handling a book I usually don't leave my hands there with it.

Re:Is that a finger cot? (1)

quercus.aeternam (1174283) | more than 5 years ago | (#29453673)

Humans - the new replacement for robots.

Why drop half a million dollars on a machine when you can pay someone 25k a year to do the same job!

But really, they probably do have robots that do some of the work - but to my (very limited) knowledge, even the best are somewhat destructive.

Re:Is that a finger cot? (1)

Jared555 (874152) | more than 5 years ago | (#29453715)

They probably also have some that were manually scanned, or there are probably cases where pages stick together and require human intervention. If the robot scans a book and then later it is discovered a page didn't get scanned they probably are going to manually scan it.

Good idea, but how? (1, Interesting)

Nesa2 (1142511) | more than 5 years ago | (#29453145)

ReCAPTCHA is a free service that usually integrates into forums, bLogs, and other such anonymous comment-posting services to help eliminate bot spamming. I think they will not use it on Google search pages, but exploit ReCAPTCHA users of all of those sites that do use it already. Sounds to me like a really good idea...

I'm interested though how they are going to know what a correct entry by a user would be for a scanned word in order to validate it if they only have a scan...

Re:Good idea, but how? (0)

Anonymous Coward | more than 5 years ago | (#29454071)

I'm interested though how they are going to know what a correct entry by a user would be for a scanned word in order to validate it if they only have a scan...

1. The CAPTCHA test to submit the form is still based on a known word.
2. The unknown word is shown to multiple users, so even if some percentage get it wrong, eventually the system will have a majority opinion on the correct value.

Re:Good idea, but how? (1)

city (1189205) | more than 5 years ago | (#29454769)

There is a really good talk by the reCAPTCHA found, Von Ahn, describing their method for validation a word and how they are using it to digitize old NYT articles. I think it's his one: http://www.youtube.com/v/Aszl5avDtekhl=en&%23038;fs=1&%23038;rel=0 [youtube.com]

Re:Good idea, but how? (0)

Anonymous Coward | more than 5 years ago | (#29455003)

I'm interested though how they are going to know what a correct entry by a user would be for a scanned word in order to validate it if they only have a scan...

That was my first thought. In order for a captcha to be effective, must you not already know what it contains?

Re:Good idea, but how? (0)

Anonymous Coward | more than 5 years ago | (#29455193)

They display two words. A known word to confirm the captcha and an unknown word to identify.

If you get the captcha word right then the unknown word entry is put forward as a possible solution.

This is then crowd sourced, so once enough people suggest the same solution to an unknown word, we have a winner.

Re:Good idea, but how? (0)

Anonymous Coward | more than 5 years ago | (#29455239)

i saw a talk on recaptcha...
they pair the scanned word with a regular captcha, and issue the same scanned word to multiple people.
if a person get the regular captcha correct, then there is some confidence that they also got the scanned word correct.
if multiple do this, then confidence is further increased.

I'm real giddy about this (1, Interesting)

Kokuyo (549451) | more than 5 years ago | (#29453181)

Just wait until some soccer mom needs to protect her genius of a brat from all the bad things there are. Latest crusade? A 'bad' word in a CAPTCHA. Just you wait, it will happen.

angry hipster alert (-1, Troll)

Anonymous Coward | more than 5 years ago | (#29453657)

soccer moms... brats... crusades... ?

what the fuck are you babbling about?

I mean, we know what you're babbling about, but what the fuck is wrong with you? why are you so angry? did you spill your latte on your new pre-worn jeans again and come here to vent your frustrations on normal people?

I hope they have a couple of tests! (4, Funny)

NoYob (1630681) | more than 5 years ago | (#29453241)

As I get older, I find that I'm having a harder time reading from computer monitors and especially captchas. I confuse words all the time. For acample: erection with election. Not so bad, but if Google doesn't pass that unknown to multiple folks, it could get embarrassing. Text from a Bill Clinton bio:

After Bill Clinton's first erection as President, he proceeded .....

Re:I hope they have a couple of tests! (0)

Anonymous Coward | more than 5 years ago | (#29453335)

"After Bill Clinton's first erection as President, he proceeded ....."

I don't see any typos or errors in that sentence.

Re:I hope they have a couple of tests! (1)

HipToday (883113) | more than 5 years ago | (#29453491)

Or acample with example.

Re:I hope they have a couple of tests! (1)

ElSupreme (1217088) | more than 5 years ago | (#29453649)

I find that ReCapcha is MUCH easier than standard ones to decipher. I mean I have 10s of years deciphering text on the curve of a book, with cheap printing. Versus the made hard to read on purpose ones.

But a few of the ReCapchas are just miss printed and would require someone to read the sentance to figure out what sholud go there.

Re:I hope they have a couple of tests! (0)

Anonymous Coward | more than 5 years ago | (#29453813)

> I confuse words all the time. For acample: erection with election.

In that case I thoroughly recommend visiting www.sensibleelection.com

(Captcha: Nothing particularly relevant)

Re:I hope they have a couple of tests! (1)

natehoy (1608657) | more than 5 years ago | (#29453825)

Most CAPTCHA solutions have at least two ways you can solve them. Some offer an audio version of the words that is only slightly garbled (enough to defeat voice recognition) that you can listen to in addition to or instead of the CAPTCHA word, and some allow you to solve some simple word problem instead of CAPTCHA if your hearing AND eyesight are both bad.

As far as the Clinton example, funny, but in reality people are going to be looking at one word at a time. The Clinton bio example would be frequently made (humorously or maliciously) due to context. But if the word "election" was put on a CAPTCHA, most people would interpret it correctly. A few might get funny and try "erection" just to see if it's the "non significant" word, but I doubt that would be EVERYONE. If you checked the word against a dozen people, you'd have to have at least (at a guess) 10 of them with the exact same sense of humor to get the word automatically accepted as "erection" and not "election".

I don't know Google's algorithm for re-checking words, but the article clearly says they'll be doing some rechecking for reliability by having a number of different randomly-chosen people interpret the same word. I imagine that words where the answers are all identical might get 4-5 checks, while words that prove less consistent will get checked at least a dozen times or so, and those that continue being unreliable would probably get an authoritative check.

If, say, 4 people chose "erection" and the remaining 8 chose "election", the word would probably be flagged as "unreliable" by the automated CAPTCHA system and reviewed by a Google employee in proper context for final verification. Then the word would be corrected. Exactly which of the two words is chosen would probably depend on the political affiliation of the Google employee. :)

Re:I hope they have a couple of tests! (1)

Hurricane78 (562437) | more than 5 years ago | (#29454107)

Protip: Ctrl-+

Seriously. Or change the freakin' resolution of your display.

There, was it that hard? ^^

maybe they should use CAPTCHAs... (0)

Anonymous Coward | more than 5 years ago | (#29453325)

to allow people to send emails to "higher class of service" mailboxes. Hey, I should patent that idea before Nathan the ex-Microsoftie gets to it.

Re:maybe they should use CAPTCHAs... (3, Interesting)

Rik Sweeney (471717) | more than 5 years ago | (#29453521)

Funny you should say that

http://mailhide.recaptcha.net/ [recaptcha.net]

Won't this eventually defeat the purpose? (3, Interesting)

natehoy (1608657) | more than 5 years ago | (#29453347)

Google is doing this in order to prevent spam and to improve OCR. But once OCR is improved to the point where it can read poorer scans, won't spammers be able to use that new technology to eventually defeat CAPTCHA?

Don't get me wrong, I think this is a marvelous idea, potentially using volunteer labor of humans as OCR to interpret a book one poorly-scanned word at a time. But it does seem to have the side effect of eventually destroying the original purpose of what they bought. Maybe CAPTCHA is worth more as a "crowdsourced OCR solution" than it ever was as spam prevention anyway...

Re:Won't this eventually defeat the purpose? (1)

CSMatt (1175471) | more than 5 years ago | (#29453507)

CAPTCHAs can be defeated right now by using mechanical turk or social engineering to get humans to solve the CAPTCHAs for the spammers.

Re:Won't this eventually defeat the purpose? (1)

funfail (970288) | more than 5 years ago | (#29453575)

CAPTCHAs can also be defeated with a system like reCAPTCHA.

Re:Won't this eventually defeat the purpose? (5, Insightful)

slim (1652) | more than 5 years ago | (#29453519)

What you get in the capcha is the scanned word, plus some warping and obfuscation. Therefore if OCR advances to the point where it has no trouble with the original scan, it would still have trouble with the capcha.

Spammers already have a neat way around capchas -- they proxy them to people on porn and warez sites. If you ever fill in a capcha on such a site, you're probably helping a spambot out.

Re:Won't this eventually defeat the purpose? (2, Insightful)

Hurricane78 (562437) | more than 5 years ago | (#29454215)

No it's not warped and obfuscated. ReCaptcha gives you the word as-is.

GP is using faulty logic (circular reasoning I think).

If ReCaptcha improves OCR algorithms, then not only spammers will have access to them, but so does the effort behind ReCaptcha.
So the now scannable words would be scanned and never turn up there. ReCaptcha would just present you with those words that would still not be scannable by any OCR.

Re:Won't this eventually defeat the purpose? (2, Informative)

koick (770435) | more than 5 years ago | (#29455831)

In this interview on Wired, Luis von Ahn explains that they do indeed warp it: http://www.youtube.com/watch?v=3PuZ55kyf7E [youtube.com]

Re:Won't this eventually defeat the purpose? (1, Interesting)

Anonymous Coward | more than 5 years ago | (#29453591)

If spammers figure out how to defeat reCAPTCHA, Google will probably hire them to automatically digitise books; that probably pays a lot better than spamming. You can think of it as trying to set all the ingenuity of the world's spammers working at the same problem...

Re:Won't this eventually defeat the purpose? (1)

maxume (22995) | more than 5 years ago | (#29453871)

All you have to do is add a level of indirection. Take the reCAPTCHA images and present them to users of your rereCAPTCHA system, and then use the results to solve the reCAPTCHA tests.

I suppose keeping up with the turnover of the reCAPTCHA might be an issue, but if the problem were valuable enough to solve...

Re:Won't this eventually defeat the purpose? (1)

delete2kill (1449861) | more than 5 years ago | (#29453599)

CAPTCHA solving is a lucrative industry $ 5-8 for 1000 out in far east some times solicited over craigslist ... kinda like gold farming with enough CAPTCHA you could create a program to defeat an algorithm ..but of course reCAPTCHA is different

Re:Won't this eventually defeat the purpose? (1)

jacktherobot (1538645) | more than 5 years ago | (#29453629)

in addition to just showing a scanned word, the captcha image is contorted and corrupted. This makes captchas much much harder to solve compared to standard OCR problems. Improving and perfecting OCR is unlikely to have as much of an adverse impact on captchas as spammers hiring poor folks to solve them.

Re:Won't this eventually defeat the purpose? (0)

Anonymous Coward | more than 5 years ago | (#29454699)

The idea is that there is no way to lose. Either you have an effective mechanism to fight spam, or you have a better method for scanning books. In both cases it is a win for the society.

reCAPTCHA is awesome (5, Funny)

Thaelon (250687) | more than 5 years ago | (#29453597)

I have to say, reCAPTCHA is one of the most elegant solutions I've ever seen to a problem.

It's not even killing two birds with one stone, it's killing two birds with one of the birds.

Re:reCAPTCHA is awesome (1)

pHus10n (1443071) | more than 5 years ago | (#29453635)

Your analogy made me lol IRL.

Re:reCAPTCHA is awesome (1)

Sockatume (732728) | more than 5 years ago | (#29454397)

I've already posted so I can't mod you up, but that might be the greatest analogy I've ever heard. I'm already thinking up applications for it.

Psst, scanning books is just one goal (1)

melted (227442) | more than 5 years ago | (#29453605)

The other is to track how users browse the web, for ad targeting. All they need to do is put a cookie in your browser and read it next time you see a captcha or load a Google analytics script.

Re:Psst, scanning books is just one goal (0)

Anonymous Coward | more than 5 years ago | (#29453773)

To be honest, this was my first thought as well. I use reCaptcha on my sites, but I reject google analytics because I don't want to help google gather data on my users. It is really frustrating because reCaptcha is a great tool that I was happy to take advantage of. I might have to re-evaluate that decision now.

Tinfoil hat (0)

Anonymous Coward | more than 5 years ago | (#29453607)

What is stopping them from including their analytics code (or else something that scrapes behaviour of a user over different websites) behind the scenes?

A corporate motto?

Evil? (1)

AP31R0N (723649) | more than 5 years ago | (#29453853)

Have you paranoiacs figured out how Google is going to use this to spy on you or otherwise do evil?

The Machine (0)

Anonymous Coward | more than 5 years ago | (#29453885)

I once caught my dad doing something similar via one of those "make money on the internet" sites. I told him that he was most likely assisting a programmer to design "character-by-shape"-recognition software....that he was in essence making the machine smarter.

-Oz

Waiiiiit.... (1)

WWWWolf (2428) | more than 5 years ago | (#29454725)

I thought I had some hazy recollection that reCAPTCHA was being used for some open projects, like helping to OCR out-of-copyright works...

...so now it is being used to fuel Google's massive, still-very-much-copyrighted, proprietary book scanning effort?

So how's this going to benefit people? I'm, of course, assuming the details are spotty at the moment and I'm terribly interested to hear more details from Google's official "do no evil" department on how they intend to contribute to the world.

Beloved != 8cloved (1)

EdgeyEdgey (1172665) | more than 5 years ago | (#29454999)

I just got a correct response from a clearly incorrect answer.
The image was of Beloved but being difficult I answered 8cloved and got accepted.
It did the job of proving that I wasn't a bot, but if there are enough difficult people (like me) out there then we could really screw Google over.
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?