×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Where Did Affordable OCR Go?

Cliff posted more than 9 years ago | from the from-print-to-data dept.

Software 79

Goeland86 asks: "Has OCR (Optical Character Recognition) died down? Where have all the magical programs that translate your handwriting to office compatible files gone? Most of the windows programs nowadays are either expensive (ReadIris Pro 9 about $400) and not that many OSS projects for OCR have released a recent update (Kognition was last updated on July 17th 2003 according to Freshmeat). Has everyone already scanned/translated all of their paper files? Has OCR outlived its use, or is it just a fancy technology that hit a dead end in terms of the market? Have Slashdot readers used it? If so, are you still using it? If not, why?"

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

79 comments

OMGWTFBBQ!!!!1one (-1, Offtopic)

Anonymous Coward | more than 9 years ago | (#9951103)

That shit is my birthright, motherfuckers!

Re:OMGWTFBBQ!!!!1one (0)

Anonymous Coward | more than 9 years ago | (#9953857)

Well, it's better than most first posts.

ocr and pdf (3, Interesting)

i621148 (728860) | more than 9 years ago | (#9951108)

i think that pdf's and the availability of the free adobe viewer have pretty much obsoleted ocr.
ocr has to be babysat also. it is not 100% reliable like scanning to pdf is...

Re:ocr and pdf (0)

jfdawes (254678) | more than 9 years ago | (#9951624)

i think that education and the availability of the free online dictionaries have pretty much obsoleted punctuation and grammar. grammar and punctuation is difficult to automate also. it is bit 100% oranges like apples is...

-1 "Stream of Conciousness Frist Ps0t attempt"

What? Scanning to PDF != OCR. (2, Informative)

Ayanami Rei (621112) | more than 9 years ago | (#9953244)

When you scan to a PDF, you essentially create a high-resolution single channel JPEG image which is the sole contents of the PDF file.

It does not create text that you can search or highlight and copy with your mouse later on. It's just a picture.

Now, there is some nice scanning software out there that if you do select "text" mode when you scan to PDF, it does an OCR pass and sticks that in the PDF. But the cost of this software is usually hidden in the purchase of a high-end scanner or printer/fax combo that it would normally accompany.

Re:What? Scanning to PDF != OCR. (1)

tweedlebait (560901) | more than 9 years ago | (#9956206)

Image + hidden text pdf is now in most
lower end ocr products. also in acrobat 4&5 (iirc).
Sometime you have to dig to find the feature though.

going from image ->ocr'd text -> text only pdf or text pdf with image snippets usually looks quite awful.

I+HT PDFs though will let you search for and highlight the text behind the image. It isn't perfectly accurate on placement but it's usually adequate.

You're quite right that most of the time 'scan to pdf' is just a bitmap. It's the usual default for most ocr or scan apps. Fujitsu has a small cheap desktop scanner that ONLY scans to pdf (eek!)i can't recall if it allows ocr or not. Many high end scanner software will allow pdf but with no ocr. (kodak fujitsu and kofax based products come to mind) With the high speed scanners they usually don't want to include anything to make their scanner look slow. This thinking is sort of reasonable because if you have 160 page per min. scanner having to wait 15 secs per side of the page to ocr it the whole purpose of high speed is defeated. Not that that sort of problem can't be solved.

Pay 50 bucks and get a scanner. (0)

Anonymous Coward | more than 9 years ago | (#9953359)

The OCR software comes free with it.

Re:ocr and pdf (1)

stuuf (587464) | more than 9 years ago | (#9956452)

no, the point of OCR is to make a file smaller than the image, not larger as always happens with a pdf

I dunno (2, Insightful)

Rie Beam (632299) | more than 9 years ago | (#9951113)

It seems to have just fallen into a middle-market that doesn't exist anymore. I mean, anymore, either documents are handled completely digitally, or just scanned and translated into PDFs or the like. There just doesn't seem to be a need, at least a large enough one to merit attention.

Re:I dunno (1)

Drakon (414580) | more than 9 years ago | (#9951222)

re: your sig

wow.
(yes it happens here)

Re:I dunno (0)

Anonymous Coward | more than 9 years ago | (#9951327)

re: your sig

It's probably the "http" part...notice that searching google for "http", the "I'm feeling lucky" link is microsoft.com

Re:I dunno (2, Informative)

Tolchz (19162) | more than 9 years ago | (#9951330)

It does a "Feeling Lucky" search since http by itself is not a valid url. Go search on Google for http and look at the first result.

Re:I dunno (1)

digime (681824) | more than 9 years ago | (#9951252)

Regarding your sig - http;//slashdot.org in Firefox takes me to http://www.microsoft.com. Interesting.

Re:I dunno (1)

Przepla (637674) | more than 9 years ago | (#9951389)

It's because "I'm feeling lucky" in Google for http;// goes to microsoft.com. Just type http;// , press "I'm feeling lucky" and see for yourself. Firefox by default feeds "I'm feeling lucky" when something that is not an valid adress is typed. This behaviour is controlled by keyword.URL preference in about:config

Re:I dunno (1)

Spudley (171066) | more than 9 years ago | (#9953266)

Well, that's easily solved. We just need do one of those google hacks - get enough links pointing to a site with a given keyword, and you can put any site on the top of that list. (cough... "French military defeats"... cough... "miserable failure"... etc).

[hehehe... imagining the /. response to that suggestion - picking a site to google hack 'http'... and everyone says: "oooh! oooh! let it be mine! pleeeeeease?" ;-) ]

Re:I dunno (1)

rudy_wayne (414635) | more than 9 years ago | (#9952425)

In Mozilla it takes me to www.http.com which appears to be a site set up for the sole purpose of displaying advertising to people who mis-type a URL.

Re:I dunno (1)

frAme57 (145879) | more than 9 years ago | (#9953115)

that's wierd; that's really wierd

'zilla went to some random search page, IE went to /. but firefox went to microsoft.com

if that's what you mean, it wasn't just you

Re:I dunno (1)

Halfbaked Plan (769830) | more than 9 years ago | (#9954504)

I wouldn't know, as Firefox is a dead-end consumer-only browser. Regular Mozilla has a composer, so is a two-way communications tool. I installed Firefox once, but it wasn't impressive. Why would I want a view-only tool for the Web?

Re:I dunno (1)

pyite (140350) | more than 9 years ago | (#9955534)

Why would I want a view-only tool for the Web?

I don't know, maybe for the same reason I don't use a speaker as a microphone. Sure, it can be done, but why would I want to do it when there are more specialized tools?

I too recently noticed... (1)

haplo21112 (184264) | more than 9 years ago | (#9951132)

...that the OCR market had died down...

I for one was encourged by the provious progress, but also fustrated by the still existant shortcomings of the software. Clear printed/type written documents still had a high rate of error (these were especially fustrating, was it un reasonable to expect letterquality reads to be nearly error free)...handwriting that was farly clear was getting there, but not quickly.

Perhaps the Voicedictation market has something to do with it, maybe the recognition rate and quality is higher than OCR and now people are just reading the documents in with voice Rec apps?

Re:I too recently noticed... (1)

Karma Farmer (595141) | more than 9 years ago | (#9951251)

maybe the recognition rate and quality is higher than OCR and now people are just reading the documents in with voice Rec apps?

Assume you have documents you want to convert to digital text (and not just scan).

If you have money, then you either hire a temp to type them in for $100 a day, or contract them out to some poor schmuck in India to type them in for $5 a day.

If you don't have money, then your probably not what capitalists like to call a "customer."

Re:I too recently noticed... (2, Interesting)

tchuladdiass (174342) | more than 9 years ago | (#9951423)

Too add to this, no OCR packages is 100% accurate. Most will be 95 - 99%, which still means you have to have someone proofread / correct each page. Which is just as expensive as having the text entered manually.
Side note: I remember a number of years ago, trying out OCR, and it turned out that I could type the page in sligtly faster than it could be scanned and recognized.

Re:I too recently noticed... (1)

I_Love_Pocky! (751171) | more than 9 years ago | (#9951918)

Human data entry is not 100% accurate either. Especially with the kind of people filling low end data entry jobs.

Re:I too recently noticed... (1)

Twirlip of the Mists (615030) | more than 9 years ago | (#9952660)

OmniPage Pro X is sufficiently close to 100% accurate as to make no difference. Starting with INCREDIBLY bad documents--photocopies of photocopies of photocopies of declassified memos--OmniPage Pro X just churns thought them. It's almost eerie. It's too good. It's like it knows.

Re:I too recently noticed... (0)

Anonymous Coward | more than 9 years ago | (#9971608)

Yeah, I found it pretty good in handling a 40 year old manual. What it didn't do too well at were the 'scientific' characters like a +/- symbol and the percent sign, and all the electronics symbols like 'mu'. But it did mis-read them consistently making it easy to correct afterwards. I never did bother with the training option of Omnipage.

Re:I too recently noticed... (1)

walt-sjc (145127) | more than 9 years ago | (#9951833)

contract them out to some poor schmuck in India to type them in for $5 a day.

I worked at a company that did this years ago. We started with OCR, but due to the error rate on even perfectly printed material, we dumped it and sent the material to india. It was pretty inexpensive - much less than our time correcting the OCR mistakes. We had triple entry which was still very cheap.

I could take something and print it in courier 16 point at 600dpi, scan it, and the OCR would still screw up about 1% of the time. Normal pages was somewhere about 5% errors, which is HUGE. It takes so long to correct that you could just type the thing in. Note that many of these OCR packages would claim 99.9% accuracy but they never got close. I tried everything from $99 programs to high-end $10K programs and they were all about the same, really.

Personally... (2, Interesting)

WildFire42 (262051) | more than 9 years ago | (#9951188)

Personally, I believed that the amount of return for any further research put into OCR technology wasn't really worth it at this point. OCR is actually pretty darn reliable for printed characters, even if it sucks wind for handwriting. Mostly, people are interested in OCR'ing printed characters, and handwriting recognition is just one of these nifty, shiny technologies that wouldn't be used that often.

At this point, OCR is a commodity. It's not really worth the hundreds of thousands or millions of dollars for research to get an extra 2% accuracy, so the technology is stagnant and the prices for standard, printed character OCR are dirt cheap.

With that being said, I see voice dictation as the next big thing. Voice recognition is where OCR was 10 years ago, still new, not many players in the market, and a lot of room for technological improvement. The accuracy isn't that great, even with extensive "training", and more and more, because of the need for archiving, data warehousing, captioning for accessibility (Section 508, W3C WAI and the like), captioning without training is going to become a shining goal within the next 10 years.

Re:Personally... (1)

jfdawes (254678) | more than 9 years ago | (#9951710)

I'm sure that anything that can successfully recognize handwriting would also be able to recognize a significant portion of the new variety of "Only a human could recognize this" tests being used to validate new logins for email providers and the like.

Re:Personally... (1)

walt-sjc (145127) | more than 9 years ago | (#9951902)

OCR is actually pretty darn reliable for printed characters

That has not been my experience. I have found the accuracy to be horrible - even on high-end systems. What we ended up doing for a domument management system is use the OCR for searching, and the raw image gets retrieved. 100% accuracy isn't very important then.

Re:Personally... (1)

br0ck (237309) | more than 9 years ago | (#9952249)

Well, I've done quite a few pages at Project Gutenberg's Distributed Proofreaders [pgdp.net] where you donate your proofreading time to clean up text scanned in from books and the the first draft from the computer is usually pretty dang close. I find that it is usually just a matter of cleaning up formatting for things like footnotes and scientific notations.

Where did useful OCR go? (2, Interesting)

Karma Farmer (595141) | more than 9 years ago | (#9951198)

I want OCR that works, and I want a flying car.

I'm assuming people got sick of paying $39.95 for OCR software that didn't do jack squat, and was about as reliable as handing your documents to a spastic monkey. I'm also assuming software makers got sick of making $3 or $4 (or less) on each package, only to get a million tech support calls along the lines of "It doesn't work. I want my money back."

For $400, I'm guessing the software vendors can afford a small amount of support, and can expect the users to be willing to understand the limits of the software.

Blatant lack of research. (4, Informative)

GoRK (10018) | more than 9 years ago | (#9951231)

I usually don't post replies like this, but this question is ridiculously underresearched. OCR is a hard problem. Sure, a OSS alternative would be nice, but until a solution matures, when you really need OCR you need it because it's generally unreasonable either from a time standpoint or a budget standpoint to any alternative. That is why people pay for software sometimes.

TextBridge, PaperPort, and a host of other entry level programs are available for windows under a $100 price point. Generally if you buy a decent scanner (ie not a $50 piece of crap), you'll get some software capable of doing OCR bundled for free.

Higher-end OCR packages with better accuracy, more features, etc. often cost quite a bit more. OmniPage Pro is a decent package for only slightly more than $100. ReadIris is a really good program, and is reportedly very quick in comparison to some of the others. I imagine this is the reason that it costs $400.

There are document management packages out there that have very good OCR integrated that cost a hell of a lot more than $400. Trust me, though, if you're looking at the time or cost of converting a few thousand pages of data into editible text documents, a program that costs even $400 should be a steal.

Why bother?? (2, Insightful)

Syncdata (596941) | more than 9 years ago | (#9951583)

OCR was a good idea when Hard drive capacity was less fantastic then it is today. The idea of taking a page of handwritten text, and scanning it magically into a supersmall text file was attractive. But OCR wasn't terribly accurate, and to make it so would require quite a bit of R&D. All of a sudden software houses need to hire handwriting analysts.

In the meantime, Harddrive capacity grew, and all of a sudden, the difference between a 4k text file and a 35k jpg became negligable. The only real benefit OCR offered was the ability to spellcheck, search for words within the document. Given that OCR was prone to creating nonsensical non-words, as well as changing a word like "which" into "mitch", even these benefits became less frequently used.

I'm convinced the most common phrase associated with OCR programs is "eh, just save it as a .jpg".

Re:Why bother?? (1)

belg4mit (152620) | more than 9 years ago | (#9952036)

I bother because the readings for my courses are available as PDFs created from page scans. Now,
when I need to go back and find something I can't
grep through embedded TIFFs and JPEGs. So I tried
AdLib Express, and it works pretty damn well if
not expensive as hell. Plus, it embeds the OCR
results in the PDF so you can search within the
documents as well.

Re:Why bother?? (2, Interesting)

photon317 (208409) | more than 9 years ago | (#9952548)


The problem is that jpegs can't be grepped like text. People don't just want to scan a stack of images, they want the data to have meaning. In some cases they even want to parse typed hospital forms into an xml format for example.

Re:Why bother?? (0)

Anonymous Coward | more than 9 years ago | (#9952657)

And those people need to buy a software package for a couple hundred dollars and change.

Re:Why bother?? (1)

larien (5608) | more than 9 years ago | (#9952900)

Exactly; the old phrase "you can't grep dead trees" carries forward into this; perhaps it's time to start saying "you can't grep jpegs"?

Also, how about other uses, like readers for the blind or visually impaired?

Re:Blatant lack of research. (1)

evil_one666 (664331) | more than 9 years ago | (#9957541)


Higher-end OCR packages with better accuracy, more features, etc. often cost quite a bit more. OmniPage Pro is a decent package for only slightly more than $100. ReadIris is a really good program, and is reportedly very quick in comparison to some of the others. I imagine this is the reason that it costs $400.
You are, unwittingly perhaps, succumbing to one of the most persuasive, yet oldest sales tactics in the book. Just because one costs $400 and another costs $100, there is absolutely no reason to assume that the former is better than the latter.

What a good question (3, Informative)

the Man in Black (102634) | more than 9 years ago | (#9951264)

My company just paid ~$1,800USD for OCR software (ABBYY FormReader [abbyyusa.com] ). We're scanning in stacks of healthcare forms, reading the data, and spitting them out into DBF format. Why? I don't know, I just do my job. It was my responsiblity to review and demo other pieces of software, and ABBYY's was definitely the most robust. Open Source had, as the poster stated, few contenders and even fewer that had been worked on since the 90s.

I don't know what happened to OCR, but there's certainly still a need for it.

Re:What a good question (3, Funny)

i621148 (728860) | more than 9 years ago | (#9951889)

health care documents are being OCRed? no wonder i got a gender reassignment surgery instead of a wart removed >;)

Re:What a good question (0)

Anonymous Coward | more than 9 years ago | (#9962319)

We're scanning in stacks of healthcare forms, reading the data, and spitting them out into DBF format. Why? I don't know, I just do my job.

Don't forget the cover sheets for those TPS reports, mmmmkay?

Free OCR? (3, Interesting)

Asprin (545477) | more than 9 years ago | (#9951316)


So far as I can tell, NON-free OCR isn't doing so hot either -- you pretty much have to proof-read and correct everything you scan anyway, which just makes it impractical for most purposes. If I had to scan a bunch of records, I'd probably outsource it to a pay service that specializes in that sort of thing, which means it would have to be worth the cost of getting it done.

What I want to know is what's Google going to do about this? They have a catalog search in their Google Labs playpen that indexes products and their descriptions to make them searchable. ...and by searchable, I mean you can search for "bicycle" and it will highlight all of the instances of that word in some 200+ PRINTED catalogs, not similar HTML/XML/PDF electronic documents. So clearly, they know some things about OCR we don't (and probably 2D map indexing, too), but durned if they aren't letting on about it.

In the next few years, I expect to see a fully automated Google OCR product that can not only scan your paper docs, but index them and help you search them too, all while maintaining the electronic copies in their original scanned (think photograph) state, not the some bastardized, mistranslated and screwed up PDF or DOC format.

**THAT'S** what's going to kill Microsoft, and probably why they're so keen to risk overreaching on their IPO.

Re:Free OCR? (1)

Jmstuckman (561420) | more than 9 years ago | (#9953845)

The product you describe has already been created by several companies. For example:

http://www.onbase.com/products/onbasemodules/inp ut modules/ocr.html

Also, why should Google market this product? It's not like they're the only ones who can search OCR documents (if you've used Amazon.com's book searching feature, you'll see the same thing.) Also, it's not like they're going to use PageRank to help them search, because these aren't web pages.

Re:Free OCR? (1)

Judg3 (88435) | more than 9 years ago | (#9955631)

Yup one of the companies I used to work for (Stock Exchange) used OnBase for it's OCRing. I admined it, so I got to get pretty indepth with it (And not to mention training from Hyland is AWESOME! Not the training itself really, the nightly bar crawls and trip to the Rock and Roll Hall of Fame at the end!).
It does what the above posters said plus it has this really slick way of placing the media - you can spread it out over a half dozen different SANs, some DVD-changers, a SQL database, etc and OnBase will know where you put the data and grab it for you - it was truly a wonderful OCR program. (Though at the time horribly expensive, requiring hardware dongles to allow different access to the system)

Re:Free OCR? (0)

Anonymous Coward | more than 9 years ago | (#9954191)

Hate to say it, but OffiecXP has a built-in OCR component (based on ScanSoft OmniWeb Capture SDK) that already scans and converts images for indexing purposes.

Elsiveir publishing did this ages ago (1)

davidwr (791652) | more than 9 years ago | (#9960403)

"In the next few years, I expect to see a fully automated Google OCR product that can not only scan your paper docs, but index them and help you search them too, all while maintaining the electronic copies in their original scanned (think photograph) state, not the some bastardized, mistranslated and screwed up PDF or DOC format."

The scienctific publishing house Elsevier [elsevier.com] did this in the mid-90's.

They took the past few years of several of their journals, scanned them in, did a less-than-perfect OCR on them, and created an index from the OCR results.
When you "searched," what you got back was a list of sentences. These sentences were riddled with typos/mis-reads. When you clicked on one of them, you got the actual photographic with a box around where text was. There were a few mistakes and you learned to ignore them, but it served its purpose.

Things are a lot better now. Newer (post-mid-90s) publications are available digitally, and the older ones (going back only so many years, of course) have much better OCR-based indexes and usually a PDF which contains the images of the journal.

If you want some obscure journal from 1965 though, be prepared for a trek to library and hope you can get it on inter-library loan. If you are LUCKY the abstracts might be available online somewhere.

It's included in Microsoft Office! (0)

Anonymous Coward | more than 9 years ago | (#9951368)

I don't know since when, but at least from Microsoft Office XP on, an OCR is included. It does a good job and it integrates nicely with Windows and Office.

As for open source OCRs, well... Open source developers started working on them and lost interest when they realized they did not have any books to put in. You know, not everyone can afford books. If all you can afford for your PC is Linux, then there are other priorities than books for your OCR, such as food, etc.

Re:It's included in Microsoft Office! (1)

cpsc2005 (629087) | more than 9 years ago | (#9955180)

Looks like I got beaten, but I already had typed up a reply in Notepad, and mine is better than that "Coward's" anyway as it tells you where to get it from.

I was rooting around in MS Office 2003 and I noticed that it has a category in it's installation setup under "Office Tools" in a little + thing entitled "Microsoft Office Document Imaging." Under this there are three options, "Scanning, OCR and Indexing Filters" "help" and "Microsoft Office Document Image writer." As I had to remove Office 2003 after installing SP2, (All of the programs crash when I try to pull down the font choosing form, so I figured uninstalling Office, reboot and install might fix things since that's what you must do with microsoft) I cannot vouch for how well it works... If you have Office 2003, you might wish to configure it so that this is installed and give it a whirl... Once you have it installed, it shows up in your Start Menu under "Programs->Microsoft Office 2003->Microsoft Office Tools->Microsoft Office Document Imaging." The help file reccomends that when scanning and wanting OCR, to use Black and White only, no colour. I wish you luck.

(As an aside, I finished my reinstall and installed SP1 for Office 2003, and it now hangs for about a minute or so when I attempt to expand the font list... I have a feeling I have too may fonts installed, probably from Photoshop and what not. I need a wish of luck myself...)

MOD PARENTS UP (0)

Anonymous Coward | more than 9 years ago | (#9956575)

It's as close to free as you'll find

Offshoring is probably cheaper (1)

LordNimon (85072) | more than 9 years ago | (#9951638)

I bet someone could make a killing setting up an off-shore operation, say in India, where actual humans read your document and type in the text for you. It'd be cheaper and more accurate than high-end OCR software.

Re:Offshoring is probably cheaper (1)

tweedlebait (560901) | more than 9 years ago | (#9952247)

These services do exist and some places I've worked with have used them. The usual problem is a poor
understanding of the 'gotchas' in english.
document structure and names get mangled the most.

Paperless office killed it? (2, Insightful)

MobyDisk (75490) | more than 9 years ago | (#9952007)

Ok, everyone laugh at me. I say the paperless office killled OCR. :-) Yeah, that thing that would supposedly never happen? That is the butt of office jokes? Well, I think it did and nobody noticed.

How much paper do you see around you that wasn't already computer generated? Paper still exists as a convenient thing to hang up, or to take to a meeting, but it is always printed. There's no point in complex OCR packages when people can just get the soft copy.

There is very little left to scan. large organizations that are moving from paper to electronic systems aldready keyed the data in manually and don't need the technology anymore. The internet killed the need for faxes, which were unreadable anyway. What's left to OCR?

With that said, my bank doesn't offer online statements, so I scan them every month. But I don't bother to OCR them. My credit card company just started, so that will leave me with one sheet of paper every month.

There's a lot of manual handwritten data generated (1)

davidwr (791652) | more than 9 years ago | (#9960543)

Not much that's typed, but a lot of printed filled-in forms still lying around.

Think "teacher's comments" in school records, "officer's comments" on traffic tickets, doctor's notes, and in some countries, paper checks.

Yes, a lot of that is moving towards digital-data-entry, and a lot of the rest is being moved to scan-store-and-shred.

But in the meantime, there's a market for OCR and after-the-fact handwriting recognition.

As an example, the folks at GrokLaw [groklaw.net] are putting SCO-related court case files online. Usually someone can get a PDF fairly easily, either from the court's web site or having someone local to that particular court go to the courthouse, get a paper copy, and scan it. It sometimes takes a day or two before the text version is up. If the PDF is from a "clean scan" then OCR can do a really good job of it. If the PDF is "dirty" due to a bad scan, add-ons like rubber stamps or handwritten notes, or just a bad photocopy at the courthouse, it takes a lot of hand-editing. I know, I've tried.

Google it. (-1, Flamebait)

Anonymous Coward | more than 9 years ago | (#9952348)

It's very telling that most of the "Ask Slashdot" articles can be answered by one website:

F@#King Google It [fuckinggoogleit.com]

Distributed Proofreaders (3, Informative)

Smallpond (221300) | more than 9 years ago | (#9952543)

The DP site [distribute...eaders.net] does OCR and proofreads the results for Project Gutenberg. Anyone can join and spend a few minutes once in a while proofreading books. If you are kind of ADD like me, it lets you read about 3 pages of a book once in a while without having to actually sit down and do cover-to-cover.

TextBridge - $80 (1)

portscan (140282) | more than 9 years ago | (#9952572)

I remember using TextBridge in 1998 on a Mac with an Apple scanner. It was quite excellent at the time. I can't say whether it has improved, but I cannot imagine that it has gotten any worse. On typed documents, I got about 98% accuracy--sometimes better.

It is $80 now and there appears to only be a Windows version, but you appear to be running Windows, so no problem there. Enjoy.

paperless office , ocr rant, etc. long (5, Informative)

tweedlebait (560901) | more than 9 years ago | (#9953354)

(I'm in the document imaging / conversion industry)

The term paperless office is considered a joke, and the funny part of it is this: as soon as someone looks up a document in their doc management system they just print it. Even if just to glance at! Copier/printer companies are thrilled!

There are megatons of paper and microfilm out there left to ocr and process. It's considered a pretty fast growing industry, although stunted recently after the bomb and more by the economy.

Having ocr'd images is very handy. Here's an open secret though-- Image+ hidden text pdf.
--Searchable, you have the original doc just as it looked, and the ocr errors don't make such an impact. It's easy to throw into a search engine and the prints look great, and small (b+w use tiff group iv, and jpeg for color jbig is not quite mature yet and only a few apps from cvision do a great job at it)

Anyway, since people just hit print as soon as they find their doc in a system those file cabinets we tried so hard to empty and organize re fill magically.

Also, scanning and setting up an edms (electonic doc managemnt system) is considered a luxury. business move slow with luxury items and usually get to reap the benefits of more mature software and systems (but this is NOT always true!).

Many other slow tech adoption business are just discovering scanning ocr and doc management. Litigation is a great example. xerox was doing quite a few tv ads recently touting that stuff.

The state of ocr itself is strange. There has been a sort of pleague in that industry of 'weird innovation' for years and many buyouts or companys changing the focus of their ocr product to another industry (like web or xml). Even the small office versions ($500 range) are not geared for any sort of reasonable volume or speed without crashing and burning, and usually designed to be babysat. Using these apps leaves the user with a really bad experience. For those not familliar the process goes something like this for a 200 page b+w document:
==
Scan (or import but import is usually crippled)

gaze at loads of memory hogging eye candy (this is what your upgrade bought you usually)

wait

correct skew (wait for crappy tools)
(possibly reboot from crash)

recognize page -slower with each new version even
when hardware is so much faster every year. some recognition is improved in some packages. Some of the latest i've tested take over 15 sec and sometimes over 45 sec per page!
(crash!?)
Correct errors / tune learning engine. (sometimes i swear this effort of teaching goes straight to NULL)

repeat 199 times

Now since you're locked in your desk and finished scanning now it's time to export! (like i didn't know what formats i wanted before i sat down.)

So it chews and chews and maybe crashes causing you to repeat all the above steps. Also note that most of these apps keep all the pages pretty much uncompressed in memory, then create a copy of them in memory for your desired output format. (crash)

2 days of work gone.

====
Most users walk away with the feeling of 'Yikes! all I wanted was a word doc of this. I'll just do something else'

For the home and also small biz market here are some of the 'weird innovations'--
===========
typereader 5 -- pretty good app! doesn't do image+hidden text pdf though. Pitty. has a batch file import and reasonable priced in the $100 range. nice and fast with good results

Typereader 6 and up- file import feature moved to industrial version lots of eye candy less stable minor improvement in recog and a bunch of other silly limits & slow

Omnipage same thing only it's never been great for over 50 pages. horrid workflow and crashes like crazy. very unpredictable!
Omnipage version 3 was better in many ways than omnipage 14. (lightning fast on today's equipment too :)

abby finereader - very slow but great recognition, more stable but lame workflow- expensive too.

Oh yeah adobe acrobat!
Started with capture (if you used the older versions you have a prescription of prozak) with
a per click license. as of acrobat full version 4 (iirc) some of the ocr was included and was pretty good and somewhat stable but with a lousy workflow. acro 6 comes along and to do any of the ocr you have to **upload** **your** images to them and have them do it. No thanks! or you can plunk down a few thousand and get their industrial product (haven't tried it).
=================
Many of these have industrial versions and sdk's and are in more industrial software
and they are much more stable (but not perfect) and can handle much higher volumes.

For OSS i haven't run into much that was usable
or competitive with the more industrial versions of ocr software. Admittedly it's been awhile since i've looked.

On the industrial side ($$$$$)

Here are some tidbits that are neat:
--also http://www.aiim.org/ [aiim.org] is a place to find most of the companies that do this stuff.

http://www.parascript.com/ [parascript.com]

since handwriting was mentioned, these guys have some of the most amazing handwriting rec i've seen. It read my handwriting when i was hungover and purposely trying to fool it. nice folks too.
my handwriting==sismograph

http://www.primerecognition.com/ [primerecognition.com]

These guys use a voting system taking several engines from abbyy, omnipage, recore, etc and having each process the page then vote on the results. very very accurate. they do have a service to handle your paper if you don't have the megabucks for their software.

It's a shame this industry hasn't produced anything great and inexpensive and have generally removed features that are useful replacing them with more eyecandy just to sell another version.
Hopefully there will be more OSS projects to fill the voids the commercial market has ignored.

I want people to easily be able to convert their paper without paying a huge price.

Re:paperless office , ocr rant, etc. long (2, Funny)

Spike_/\_ (155774) | more than 9 years ago | (#9953579)

tweedlebait - thanks for the great survey, and for ++ S/N(/.) I'm saving it - no, I think I'll print it out...

Re:paperless office , ocr rant, etc. long (1)

tweedlebait (560901) | more than 9 years ago | (#9956133)

Thanks. My pleasure.

i'm probably just thick headed tonight but whacha mean by
"and for ++ S/N(/.)" ?

Thanks!

every time you don't print trillions of electrons are forced into slavery.

Transym OCR (1)

jhoger (519683) | more than 9 years ago | (#9953405)

I've had good results with transym OCR. I had to run it under VmWare. I tried all the F/OSS but it produced unusable results. I think it cost around $40

I am heading up a project to convert an out of print computer book to LaTeX (with the author's permission) and one of the volunteers suggested this package. One other nice thing about it is that the registered version comes with API documentation and VB6 source code to the front end, so you can change it however you want as long as you don't need to modify the engine.

Freshmeat (1)

noselasd (594905) | more than 9 years ago | (#9953423)

Yielded these programs, Last Updated in 2004 that to some degree
deals with OCR:
http://www.pattern-lab.de/gui.html
http://w ww.claraocr.org/
http://www.gnu.org/software/ocra d/ocrad.html
http://www.kde.org/apps/kooka/
http ://lem.eui.upm.es/ocre.html

Service Bureaus (1)

IntlHarvester (11985) | more than 9 years ago | (#9954365)

A Service Bureau (copy shop or whatever) will do OCR in bulk for about 10 cents a page, and that includes the scanning labor (which is sometimes done offshore).

So, $400 buys you a lot of OCR -- especially when you consider you have to pay labor costs, document management costs, etc on top. So, I wouldn't deploy OCR software unless it's a once-in-a-while thing or something thats central to your business process.

Comes with scanners (0)

Anonymous Coward | more than 9 years ago | (#9954435)

Almost any scanner you buy will come with an OCR package, and if you buy a computer with a scanner bundled, it may come with the OCR software pre-installed.

Have you tried poking around on the sites of scanner manufacturers (epson, for instance)? Maybe they'll at least have pointers to inexpensive packages....

OCR is a waste of time (1)

xtal (49134) | more than 9 years ago | (#9956232)

It doesn't work that well, and is a PITA for forms. What is worthwhile is imaging the file - just scan the document you want, and "file" it in a directory. When you want a document, look it up the way you would normally then print it. Presto.

a few ins and outs of OCR (3, Informative)

tweedlebait (560901) | more than 9 years ago | (#9956399)

It's slightly off topic but seemed appropriate.

Here's some quick tips/nuggets of crispy wisdom.

The art of ocr is like working with autistics. give them what they expect. the more surprises, the more episodes.

Don't believe the hype.

Scan black & white to TIFF GROUP IV. OCR systems are optomized for this. Color is new and pretty wacky still. BMP even freaks out in black and white on some packages.

Make sure your background is white and clean, not specled. despeckling tools can be overused and kill ocr results.

3 hole punches regularly show up as o O 0 D
staples: ~ .. // c d

Deskew all images to a line of text, not the page

Scan at 200-300 dpi but not higher than 600 or most apps will choke and produce bad results.

Make a custom dictionary if you can. if you're doing automotive related stuff, look up auto terms and make a dictionary out of it.

To process tiny text (concordences etc) scan at 800dpi and then fool the ocr by scaling the image to 300. sounds nuts right? ok try it the logcal way first and then come back and try this teq.

Shaded text is a new thing in document as is inverted text blocks (thanks word...you make my job hell.)you must remove the shading with something like scanfix by tms sequioa- good tool for small doc cleanup for pre-ocr. requires practice and trial and error. interface needs some work though.

Dot matrix prints should be scanned and some blur added to join the dots (unless you are using something expressedly made for dmp's) as always your milage may vary

Turn off auto rotate (mangle)features. They are not very smart an often have monkeyvision. just review your images before hand and rotate accordingly.

If you're scanning something poster size, or engineering drawing size (not recommended for most ocr) cut it into smaller images. ideally regions of interest not larger than 8.5x11

Remember 99% accurate means 1 character per hundred will be screwed up.

Table of contents pages are an interesting test for ocr especially if they use periods to lead to page numbers. How many identical characters can occur before the ocr system misreads. often quite telling.

OCRing a spreadsheet and using the data with out verifying every character? may the monkeygods help you.

Above applies to processing screenshots, 17'th century print, tabloid print, multi-column, shaded background handwriting w/o special software, modern magazines, etc.

OCR does not like non seriffed fonts much.

The post office spends millions on theirs and they have a nice address DB list to verify against.

Over 90% of banks scan your checks and microfilm them. (and there is some really cool signature verification sofware out there for forgery detection) MICR font at the bottom helps them immensely.

Breaking down your document into areas can be useful. changing fonts and sizes sometimes throw it off . an example would be computer lit with code snippets interspersed.

Do yourself a favor if it applies and use image+hidden text pdf. raw ocr is almost always yucky and all those claims of preserving document layout and format are just that--claims.

If you do use i+HT pdf, or for a larger job for that matter, do it in small chunks so your app doesn't crash. for pdf, join the small documents together in acrobat later or use other tools to do so.

For fun and science, take an old apple newton 100 and trace over some of the text on your page and compare its results to your ocr package.

Anyway i hope that helps someone avoid a few landmines and there are many more tips out there. these are from my experience and off the cuff.

Gone, but possibly coming back soon (1)

jpop32 (596022) | more than 9 years ago | (#9957028)

My guess would be that OCR lost it's appeal when pretty much every text on paper originated in a computer. OCRing nowadays is a need to only a niche of users (form scanning, archives and stuff like that), and those are always expected to pay the premium.

Anyways, the possible comeback of OCR may occur in the near future, with the inevitable ubiquity of camera phones and processor power behing them. I sure could use a phone that could scan an URL from a newspaper and take me there. Or call a phone number printed in the ad. Or mail/SMS a piece of an article. Or even translate a foreign text.

BTW, I still have fond memories of reading OCRed Amiga Hardware Reference Manual (THE book for Amiga hacking) with every other 'm' recognised as a 'rn'. :-)

I do this for a living. (1)

gurps_npc (621217) | more than 9 years ago | (#9959061)

And I can tell you that good OCR is WORTH the money.

And people WOULD pay more for better accuracy. My company pays huge amounts for OCR work, usually getting in boxes of CD's each week.

Lawyers consume OCR capacity like it was Wine the night before Prohibition starts up.

Yes we use the low end junk they put out now, but we would love to pay much more money for stuff that was even 10% better. Right now, even OCR of Typed documents SUCKS!!!. Yes it is 99+% accurate, but one letter off in one hundred means we have to pay people to read all 3 million documents we subphoened to find any dirty words letting us sue the company for sexual harassement instead of just telling the computer to find them.

The problem is OCR is an incredibally difficult task, even for typed stuff. Different fonts, poor photocopying, different colored inks/papers, all contribute to making it a very difficult job. And the use of special characters such as $ which looks an awfull lot like S, makes it harder.

We often get crap that looks like this: "1+ is 1MP0$$18!E" instead of "It is IMPOSSIBLE"

Re:I do this for a living. (1)

taradfong (311185) | more than 9 years ago | (#9995208)

We often get crap that looks like this: "1+ is 1MP0$$18!E" instead of "It is IMPOSSIBLE"

If it happens often, I can't believe the OCR software doesn't have some way of flagging unlikely usages of '$' with suggested autocorrection.

Re:I do this for a living. (1)

gurps_npc (621217) | more than 9 years ago | (#10005035)

The problem is that there are just too many different possibilities, and too many errors. $o could be So or could be $0 and then there is the real problem, the minor imperfections from copying/on the page that get picked up and turned into wierd punctuation. If you OCR a blank page you often as not get . ' , : ; and all sorts of wierder junk strewn randomly over it.
Check for New Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...