×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Google Adds OCR To PDF and Images

CmdrTaco posted more than 3 years ago | from the typing-is-for-suckers dept.

Google 76

Kilrah_il writes "Now you have the option to OCR every PDF and image you upload to Google Docs. 'When you upload files to Google Docs, you'll notice a new option that tells Google to convert the text from PDF and image files to Google Docs documents. ... I've tried to convert an excerpt from the book Rework and the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.'"

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

76 comments

F1r5t p0st? (4, Funny)

Chrisq (894406) | more than 3 years ago | (#32651916)

F1r5t p0st? (OCR's by Google)

Captcha correction? (3, Interesting)

0100010001010011 (652467) | more than 3 years ago | (#32651968)

Could google provide some sort of opt-in service where our PDFs (one word at a time) could appear as a captcha. More or less what reCaptcha does, except with something a bit newer.

Re:Captcha correction? (-1, Troll)

Anonymous Coward | more than 3 years ago | (#32652052)

No, because you top-posted on a Frosty Piss just to get higher on the page layout.

I save my mod points exclusively for people like you, and you just missed being "off-topic" by about 4 hours because that's when they expire.

People like you are why we can't have nice things.

Re:Captcha correction? (0)

Anonymous Coward | more than 3 years ago | (#32652146)

People like you are why the moderation system on here has gone to the crapper.

Re:Captcha correction? (0, Offtopic)

Ardeaem (625311) | more than 3 years ago | (#32652148)

I save my mod points exclusively for people like you, and you just missed being "off-topic" by about 4 hours because that's when they expire.

Yeah, well, I save my mod points for people who post responses to offtopic posts, too. People like you suck. I would NEVER do such a thing.

Re:Captcha correction? (1)

0100010001010011 (652467) | more than 3 years ago | (#32652430)

Well, a captcha service would have corrected "F1r5t p0st". Seems relevant to me.

Re:Captcha correction? (1)

somersault (912633) | more than 3 years ago | (#32652798)

I fail to see how it works as a captcha if the "correct" interpretation is unknown..

Re:Captcha correction? (1, Interesting)

Anonymous Coward | more than 3 years ago | (#32655408)

Check into how the current reCaptcha works. The user is presented with two words. One is known to be correct. The other is suspect. User is unaware of which is known and which is suspect. User types both words, and backend system verifies the known word was typed correctly. Logs suspect word value typed by user. Returns the suspect word image to a few more users, and if they all respond with same text along with correct known word, the system can assume the suspect image contains the text returned. http://stackoverflow.com/questions/1435696/how-does-recaptcha-work [stackoverflow.com]

Re:Captcha correction? (2, Informative)

pcgc1xn (922943) | more than 3 years ago | (#32655504)

I am pretty sure that with recatpcha only one of the two words you type in is unknown.
So if I have some text that looks like 'first known) p0sh bi4ches'.
Captcha user one will get "first p0sh".
If they correctly identify first, then I will accept their reading of posh, say "post".
User 2 gets "p0sh b14ches"
If they correctly identify "p0sh" as post, then I will accept their reading of "bi4ches".

Obviously the guys at recaptcha has done a better job than my simplified & poor explanation. You need "some" knowledge of what the text actually is, but only some.

D2 is not defined (-1, Flamebait)

Anonymous Coward | more than 3 years ago | (#32652346)

D2 is not defined

Niggers at work on slashcode, apparently.

Re:F1r5t p0st? (0)

Anonymous Coward | more than 3 years ago | (#32654404)

The mod who modded this "off topic" should be felrq in his own f47.

THIS IS NOT A PROBLEM !! (0)

Anonymous Coward | more than 3 years ago | (#32651920)

This is a good thing for all concerned !!

lolwut? (2, Insightful)

Pojut (1027544) | more than 3 years ago | (#32651924)

I can understand OCR software not working if you are scanning a document, due to dirt over the text or what have you...but OCR failing on a PDF with typed text? WTF?

Re:lolwut? (3, Interesting)

erikdalen (99500) | more than 3 years ago | (#32652114)

Didn't fail at all on a PDF with typed text for me. Did you actually try it?

I bet they don't actually use OCR on a PDF with typed text as they can just extract it from the PDF, they probably use that on images inside PDFs though.

Re:lolwut? (1)

clone53421 (1310749) | more than 3 years ago | (#32652886)

If you uploaded a PDF with typed text, it probably didn't even do OCR on it. It'd be pointless. You have to convert the pages to images for that to be necessary and I'm guessing you didn't.

Open in Acrobat Reader and use the snapshot tool to capture an entire page. Paste into Word as an image, then re-export to PDF. Upload that and then see how the OCR fares. Of course you'll also get an excellent quality in the snapshot since it's a pure digital copy and it won't have the blemishes that you'd get by printing a physical page and scanning that, so the results might be better.

Changing ridiculously stupid subject line (1)

mjwx (966435) | more than 3 years ago | (#32654234)

I bet they don't actually use OCR on a PDF with typed text as they can just extract it from the PDF, they probably use that on images inside PDFs though. Have you tried it on a PDF that was an image of text, such as a scanned or photographed text document. That's the real test.

Re:lolwut? (1)

AHuxley (892839) | more than 3 years ago | (#32652130)

Someone at google made a mistake with the dpi setting? Between Tesseract and reCAPTCHA something should work.

Re:lolwut? (1)

mlk (18543) | more than 3 years ago | (#32652180)

It is likely that the PDF tried above was scanned pages.

Re:lolwut? (3, Informative)

mlk (18543) | more than 3 years ago | (#32652274)

I've just tried with the extract [37signals.com].

The text extraction seams to have worked well. Unsurprisingly the formatting has been lost and it has got confused with the REwork type bits. PDFs are not designed with extraction to a editable format in mind, so getting any of the formatting is impressive in my book.

Re:lolwut? (1)

TheRaven64 (641858) | more than 3 years ago | (#32663528)

PDFs are not designed with extraction to a editable format in mind

Spoken like someone who has never read the PDF spec. PDFs are, in fact, specifically designed to allow editing. Everything in a PDF is stored as an object inside the document, indexed via an object table. Text runs are single objects containing a stream of commands sent to a PostScript-like VM to control their positioning. You can relatively easily map these to rich text in some other format, and you can trivially replace any object in a PDF by adding a new version and appending a new object table with a new version. PDFs store their object table at the end specifically for this reason - it allows new versions of objects to be added without having to rewrite the entire file; you can just write a new object table that refers to the old one but overrides some objects by providing a higher version number for them.

What you meant to say was 'creating correct layout information from an image of text is hard'.

Re:lolwut? (1)

mlk (18543) | more than 3 years ago | (#32675538)

Nope - I'll repharse it to "most tools do not output a format that is extraction to a common editable format (such as Word)" if you like.

The spec may allow for easy editing, but converting a PDF (a PDF contain text with formating, not a image stored in a PDF) is hard. Chunks of unrelated text get bunched together into a single object, while other chucks of text that are related get throw into sum unrelated chunk so extracting it all logically becomes a royal pain. Sure this is the "fault" of the creation tool, but given that all the creation tools I've played with generate scaryness I'd question that, and suggest that the specification is not as well designed for editing as a document. It may well be great (even for editing) as DTP format, I don't know.

lots of tough problems in OCR (2, Interesting)

tmbdev (1320455) | more than 3 years ago | (#32660852)

OCR consists of many steps; recognizing the individual characters is only one of them. You also need to separate text from images, group characters into lines and columns, separate floats, captions, and body text, etc. Many of those are tough problems even if someone hands you a PDF with all the characters. And if any one of them is wrong, the entire output may be wrong.

Recognizing individual characters is also harder than you may think because there is such a wide variety of fonts in use and because there are so many odd things that can happen. Even in perfectly rendered images (no dirt etc.), two characters may be bit-identical but mean something different in different fonts. Ligatures, underlines, unknown characters, etc. also make the problem quite a bit harder.

And even though 1% error would be low for just about any other machine learning or pattern recognition problem, that's a high OCR error and looks quite bad; people are much more sensitive to OCR errors than pattern recognition errors in other contexts. Furthermore, there are a lot of characters to be classified and you only get very little CPU time per character.

We've been developing an OCR system (ocropus.org) for a while now (see http://bit.ly/9Xputj [bit.ly] for status info). It's fairly easy to get excellent performance on a closed dataset with a well-defined character set. Getting acceptable performance on arbitrary documents and dealing with all the special cases (ligatures, foreign characters, color images, magazine layouts, unknown languages, Unicode issues, etc.) is tons of work.

Oh, and in case you're wondering, although Google has sponsored OCRopus (thanks!), OCRopus is a separate project from Google's internal OCR efforts.

Great! (0)

Bottles (1672000) | more than 3 years ago | (#32651940)

th15 i5 $o1zg to nnVke d0(unnenct 5cam1ng a rea| t1me sAver fr0m novv on!

Re:Great! (1)

morgan_greywolf (835522) | more than 3 years ago | (#32652162)

Sadly, I had no issues reading this: "This is going to make document scanning a real time saver from now on!"

Obviously, I've spent way too much time correcting bad OCR.

Where did all the ReCAPTCHA go? (2, Interesting)

AHuxley (892839) | more than 3 years ago | (#32652004)

With all the words deciphered, no bump in the OCR backend?

Re:Where did all the ReCAPTCHA go? (1)

Loconut1389 (455297) | more than 3 years ago | (#32652508)

ReCAPTCHA was to fix bad scans in specific works- I didn't think it was ever designed to further OCR, but I see how it could possibly be useful.

Re:Where did all the ReCAPTCHA go? (1)

AHuxley (892839) | more than 3 years ago | (#32653682)

http://en.wikipedia.org/wiki/ReCAPTCHA [wikipedia.org] seems to be in use for some form of OCR?
"The reCAPTCHA software itself is not open source" could be the issue?

Re:Where did all the ReCAPTCHA go? (1)

slaingod (1076625) | more than 3 years ago | (#32654322)

I think the point Loconut was making is that ReCaptcha does not 'further' machine OCR (ie. it doesn't improve the recognition algorithms used by the OCR software), instead using humans used to 'OCR' words that otherwise aren't legible.

Re:Where did all the ReCAPTCHA go? (1)

AHuxley (892839) | more than 3 years ago | (#32654394)

Pity they did not improve the recognition algorithms with all the data flowing in.
Cost vs a tiny % in better recognition vs a free network of humans.
Thanks for the info, I was thinking that a quality private OCR system was getting the ReCAPTCHA inputs and it was learning.

High expectations (0)

Anonymous Coward | more than 3 years ago | (#32652100)

I've tried to convert an excerpt from the book Rework and the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.'

Oh come on man, this is *FREE* and you're complaining about a 10% margin of error?

Then again, most OCR tech is cheap enough as is, if not completely free.

If anything, this just makes it feel weirder as to exactly how much you're letting google control your documents. Not only are you letting them have your typewritten docs, but now you're scanning in docs for them to archive as they want? Because it's not like you're the owner of said scanned in docs once you put them in there.

If Google's lucky, people will start bombarding them with more and more documents and then a year or two down the line "Google's friendly archives! Millions of scanned documents fully OCR'd and searchable from various sources! Wanna complain about how you didn't want your docs online, searchable, and visible to the entire world? TOUGH COOKIES. CHECK THE TOS."

Google Captcha processor here I come!!!! (2, Interesting)

OzPeter (195038) | more than 3 years ago | (#32652104)

How long before you see an automated system to upload and process Captcha images on google?

Re:Google Captcha processor here I come!!!! (1)

mrops (927562) | more than 3 years ago | (#32652938)

A little offtopic.

I have always wondered that google does a whole lot of processing. More so than any other corporation in recent times. Stuff like this OCR, searches, building heuristics for searches etc etc etc. Combined, these are no small tasks, is there a number on what kind of processing power google has, does google's computing grid qualify to be categorized as a super computing grid? What is its standing when compared to all those other super computers?

Re:Google Captcha processor here I come!!!! (0)

Anonymous Coward | more than 3 years ago | (#32653318)

~6 years ago, I heard they had 70k computers (dual processor, dual hard-drive "cheap" boxes). They probably have at least 2-8x that today---with increased processing speed per unit. A conservative guestimate is ~300k dual-core "modern" (2-4Ghz?) boxes?

This is still *way* less than some of the botnets out there... though google is probably much better managed/coordinated.

Re:Google Captcha processor here I come!!!! (2, Interesting)

BrightSpark (1578977) | more than 3 years ago | (#32653640)

One of the easier ways to restrict how your words and ideas are searched and indexed on the net is to to hide them in plain sight. A jpg image of text is very dificult for a search engine to use, yet you and I can read and understand the data quite easily. This ability to scan on line has been around, but not mainstream to my knowledge. I'm guessing Google has been checking jpgs for text as a trial for some time. Once this is gone maybe ASCII art text will work for a while. Hiding/protecting data by steganography is detectable by scan now, eg http://www.outguess.org/detection.php [outguess.org] so the battle continues. Of course one can work offline and send letters to each other and be protected by law :-) I wonder if one day sending stuff my mail will seem shady?

Re:Google Captcha processor here I come!!!! (1)

gravis777 (123605) | more than 3 years ago | (#32654032)

I can't read captcha's 60% of the time, and am not always in an area where I can listen to the audio hint. An OCR would be nice. On Boing Boing, I usually mistype the captcha's about 3-4 times before finally stumbling on one I can actually read.

Re:Google Captcha processor here I come!!!! (1)

Bigjeff5 (1143585) | more than 3 years ago | (#32656490)

I had to use a captcha for work once, and the captcha itself was incorrect. I have no idea what key combination would have worked, but what the captcha said certainly did. It had an audio option, so I tried it, but the audio was so garbled I couldn't pick out a single word, let alone the three necessary to complete the captcha.

I like captcha as a basic form of protection from bots, but when it keeps me from accessing a website it is beyond worthless.

captcha cracking (1)

aiwarrior (1030802) | more than 3 years ago | (#32652116)

For now it sucks, but we know if google wants it throws out the better in the market.
Just wondering if this gets so good as to make mass captcha cracking cheap.

Is there a "this translation is bad" option? (4, Informative)

AdmiralXyz (1378985) | more than 3 years ago | (#32652216)

I know with several services, like Google Voice, they had a link or checkbox to indicate that "this transcription is lousy, I can do better" with an option to do so, which was presumably sent back to improve the service. It really seemed to work, too, the quality of Google Voice's voicemail-to-text transcriptions started off horrible, and has since become awesome. Same goes for the built-in speech-to-text in Android. If Google includes something like that here to tune whatever machine learning algos they're using, I have no doubt it will rapidly progress to a usable state.

Re:Is there a "this translation is bad" option? (1)

rolfwind (528248) | more than 3 years ago | (#32652462)

Google translate between western languages I encountered are pretty good, but they need a lot of work on the asian languages imo.

Re:Is there a "this translation is bad" option? (1)

steveg (55825) | more than 3 years ago | (#32667698)

Awesome might be pushing it a bit, but I'll agree it's gotten better. It's never quite right, but I can usually get the gist of the message even before I have a chance to listen directly.

The future of palms (0)

Anonymous Coward | more than 3 years ago | (#32652238)

Anyway, it's a beginning. In a few year's time, we'll go to meetings with a notebook and upload our notes into Google docs later.

OCR efficiency (1, Interesting)

Anonymous Coward | more than 3 years ago | (#32652276)

> the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.

Well, what is the state of the art of OCR today? I wouldn't call this a bad result either... And OTOH, if people were correctly trained in spelling, we would have made do without spell checkers and have invested in OCR technology instead, right? ;)

OCR??? (0, Offtopic)

BlackEdder (1220942) | more than 3 years ago | (#32652286)

Typing "Optical Character Recognition (OCR)" was too much effort?

OCD??? (-1, Troll)

clone53421 (1310749) | more than 3 years ago | (#32652318)

OCD much? Everybody knew what OCR means. If it was some obscure acronym then yeah, spell it out first, but OCR is pretty widely known. The few people who didn't already know what it meant could probably figure it out from context, and failing that there's always Google [lmgtfy.com].

Re:OCD??? (1)

selven (1556643) | more than 3 years ago | (#32652838)

OCR (Oxford, Cambridge and RSA Examinations) is an examination board that sets examinations and awards qualifications (including GCSEs and A-levels). It is one of England, Wales and Northern Ireland's five main examination boards.

Organization of Communist Revolutionaries (marxist-leninist) (in Persian: ( (-) was an Iranian Maoist organization. It was formed in opposition to the Shah regime in Iran and was active the Iranian student movement in exile.

To perform OCR (optical character recognition); Oxford, Cambridge & RSA (examinations (board)); Optical Character Recognition; Office for Civil Rights (US); Office of the Chief Rabbi

People who can't figure things out from context would have a much harder time than you think.

Re:OCD??? (1)

SpeZek (970136) | more than 3 years ago | (#32653188)

Um...no. The summary is clearly talking about text. It even says "a new option that tells Google to convert the text from PDF and image files to Google Docs documents". There's no way that, from context, you could think OCR meant "Organization of Communist Revolutionaries".

Don't argue for argument's sake.

Re:OCD??? (0, Offtopic)

clone53421 (1310749) | more than 3 years ago | (#32653348)

Don't argue for argument's sake.

But those are the best kind... it doesn’t even really matter who’s wrong.

Never let a day go by when you can’t say to yourself as you’re falling asleep, “Well, I was wrong on the internet today, but damn, I had fun.” That’s what I say...

Re:OCD??? (1)

clone53421 (1310749) | more than 3 years ago | (#32653498)

Snarkyness aside, it explicitly says that it’s converting the text from images into a searchable document... if someone can’t tell from that context that OCR means converting the text from images into a document, they probably have about the IQ of a cinder-block and wouldn’t “get it” from the Google results either.

Hell, if we are really lowering ourselves to that lowest denominator of intelligence, the person would probably still be confused if we called it Optical Character Recognition. What, like Google has to actually print out the stuff you upload so it can Optically Recognize it...? Oh noes, we’re killing trees. Everybody, don’t use it!

Doesn't have to be perfect (4, Insightful)

clone53421 (1310749) | more than 3 years ago | (#32652298)

They really should hide the text underneath the actual scanned image, though, so that what you're actually looking at is the real page, but searchable. That takes care of the issue with layout, and since you aren't actually trying to read the garbled text, although 10% is still a rather high error rate it won't matter as much because you'll only notice it if you're trying to copy-and-paste or you might search for something and miss a few of the hits because it was incorrectly OCR'd. Not a huge deal.

Re:Doesn't have to be perfect (1)

gravis777 (123605) | more than 3 years ago | (#32654116)

The funny thing is that their OCR seems to be pretty good for Google Books. Yes, its photographed pictures, but you can search the text, which means some type of OCR must be going on. So, unless they are using a completely different technology, than this should really only have issues with hand-written text.

Re:Doesn't have to be perfect (1)

clone53421 (1310749) | more than 3 years ago | (#32654348)

Copy-and-paste some text from it and see how good the OCR was. You’ll be able to see the mistakes that were previously hidden.

I’m guessing it’s exactly the same engine, but done exactly as I said it should be, correctly.

Google should concentrate elsewhere (1)

bogaboga (793279) | more than 3 years ago | (#32652476)

First: My suggestion is that Google should put its efforts in making Google Docs at least as usable as Zoho Office first.

How can a small company like Zoho beat Google on usability?

Second: GMail still sucks [at search experience], big time in my opinion. Here's why: I had this happen to me recently...

I knew an email existed but could not remember much about it at all! Yes sometimes, you need a memory trigger for lack of a better word.

My search term was "details" and Gmail returned 311 messages. I also knew that the attachment this message had was in 'tiff' format. This search had GMail return all those messages that met my search criterion but it would have been more useful if Google Mail had gone ahead to "auto-magically" categorize emails with attachments, and further by attachment type and so many other useful categorizations.

This way, a message with a 'tiff' format attachment (which I could not remember) would have been displayed...already sorted for me to see, may be with some kind of highlight. If I had a huge in-box like our folks in sales, results would not be that useful. By the way, I searched for the string "when" in an in-box that had 142,211 emails and I received 11,317 emails back! No categorization at all! Needless to say, these results were not useful.

The current approach is still wanting, inadequate and can be made better. Yahoo does this, so Google can surely do better.

Re:Google should concentrate elsewhere (1)

MozeeToby (1163751) | more than 3 years ago | (#32652718)

By the way, I searched for the string "when" in an in-box that had 142,211 emails and I received 11,317 emails back!

You can hardly expect Google to make up for your lack of search skills or memory. I'm not saying you don't have other valid points, but searching for such basic terms as 'when' and 'details' instead of something that is unique to the message is bound to return tons of results. It's not Google's fault that you have 11,317 emails with the word 'when' in it.

Re:Google should concentrate elsewhere (1)

bogaboga (793279) | more than 3 years ago | (#32653094)

You still do not get it, I am afraid! And that's the very reason that companies like Apple and Microsoft at one point in the past made life incredibly easy for computer users. This is why they excelled, of course making users "dumb" in the process.

You can hardly expect Google to make up for your lack of search skills or memory.

This is the very mistake you make...How come Google now categorizes results of search terms at google.com? Tell me why. I just searched for "House Skills" and had categories of videos, discussions, books, news, blogs, updates returned. So according to you, categorizations work for searches at google.com but not GMail? What kind of reasoning is this?

I'm not saying you don't have other valid points, but searching for such basic terms as 'when' and 'details' instead of something that is unique to the message is bound to return tons of results.

OK..thanks...! People do crazy things with their computers and an algorithm is supposed to take care of such business.

It's not Google's fault that you have 11,317 emails with the word 'when' in it.

Who said it's Google's fault? OK...but when a successful company prides itself in being able to 'organize' the world's information, and is pretty good at it, we as users expect the best. Why not?

Re:Google should concentrate elsewhere (1)

KeNickety (1416855) | more than 3 years ago | (#32653476)

Possibly because computers and networks can't be expected to infer meaning? Remember, in your search field you've entered no context, no kinds of specifying statements, so you're expecting the computer to be able to read your mind?

Re:Google should concentrate elsewhere (1)

bogaboga (793279) | more than 3 years ago | (#32653634)

...so you're expecting the computer to be able to read your mind?

No sir! I expect the computer to categorize, and I know it is possible because I have seen it elsewhere...even in applications by the same vendor.

Re:Google should concentrate elsewhere (0)

Anonymous Coward | more than 3 years ago | (#32652730)

I have many troubles finding mail having ".exe" format attachments. People are obliged to use so many fancy manner to work around stupid filters.

Re:Google should concentrate elsewhere (1)

HamburglerJones (1539661) | more than 3 years ago | (#32652782)

Search:
details has:attachment .tiff

I see your point about it being nice if they'd automatically label this stuff, but you can search for attachments. This might turn up something that has a different kind of attachment and merely mentions ".tiff" in the email, but what you're looking for should turn up.

I have found Gmail search to be vastly superior to Yahoo! and Outlook since I've switched. They have some great tips [google.com] on how to search.

Re:Google should concentrate elsewhere (1)

quickOnTheUptake (1450889) | more than 3 years ago | (#32653106)

The thing I've had trouble with in gmail search is that it lacks any sort of lemmatisation. This would be fine if it would match sub-strings within words, but it seems to only match full words that are morphologically identical.

Re:Google should concentrate elsewhere (1)

Tropaios (244000) | more than 3 years ago | (#32654324)

The problem is that you weren't using Google's search engine properly. You failed to give it all the relevant information you DID remember. Next time include "tiff" in your search as well as clicking the box "Has attachment" in search options.

In fact, go do that now, then come back and tell me how many results you get.

Anyone know what they're using for the OCR? (1)

rsilvergun (571051) | more than 3 years ago | (#32652748)

It'd be cool if it was GPL'd :).

Re:Anyone know what they're using for the OCR? (0)

Anonymous Coward | more than 3 years ago | (#32652884)

Maybe it's tesseract-ocr [google.com]

Re:Anyone know what they're using for the OCR? (3, Informative)

quickOnTheUptake (1450889) | more than 3 years ago | (#32653210)

I don't know for sure what's running behind this, but Google's OCRopus [google.com] is Apache, as is the actual OCR engine behind it, tesseract [google.com].

Re:Anyone know what they're using for the OCR? (1)

tmbdev (1320455) | more than 3 years ago | (#32660732)

FWIW, I believe a lot of OCRopus hasn't been incorporated at Google yet because OCRopus itself is still under heavy development.

OCR Reality (2, Informative)

Shadow Wrought (586631) | more than 3 years ago | (#32654498)

About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.

What did you expect? I've been in the legal field for 10 years and have seen OCR progress substantially during that time. However, 10% error rate is still very common with scanned docs and unless you are looking at the original image, all the formatting is lost. This is with the best OCR engines in the industry!

Maybe you should actually know something about the particular field before you judge?

Re:OCR Reality (1)

binary paladin (684759) | more than 3 years ago | (#32655860)

Yeah, I was thinking the same thing. This sounds like someone who hasn't actually done OCR prior to these fancy Google docs.

OCR has always been somewhat inaccurate. It's just the nature of the beast.

Google search had OCR before Google Docs (2, Interesting)

mike.mondy (524326) | more than 3 years ago | (#32655906)

Google's search engine started doing OCR on any scanned documents they found in late 2008. The results were horrible in some cases, but it didn't matter. The searchable OCR results made it possible to find things more easily and you could obviously refer to the original source if the OCR was too garbled.

Free OCR server. (1)

rynolangner (1847532) | more than 3 years ago | (#32781702)

This google ocr thing just gives you the text. What about the formating? Why not use something like WatchOCR from http://www.watchocr.com./ [www.watchocr.com] It creates text searchable pdfs from image only pdfs and it's all free and open source. You just drop them into a watched folder and the server spits them out as text searchable. It runs as a LiveCD so you don't even have to install anything to try it.
Check for New Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...