×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Open Source OCR That Makes Searchable PDFs

timothy posted more than 3 years ago | from the word-of-advice dept.

Open Source 133

An anonymous reader writes "In my job all of our multifunction copiers scan to PDF but many of our users want and expect those PDFs to be text searchable. I looked around for software that would create text searchable pdfs but most are very expensive and I couldn't find any that were open source (free). I did find some open source packages like CuneiForm and Exactimage that could in theory do the job, but they were hard to install and difficult to set up and use over a network. Then I stumbled upon WatchOCR. This is a Live CD distro that can easily create a server on your network that provides an OCR service using watched folders. Now all my scanners scan to a watched folder, WatchOCR picks up those files and OCRs them, and then spits them out into another folder. It uses CuneiForm and ExactImage but it is all configured and ready to deploy. It can even be remotely managed via the Web interface. Hope this proves helpful to someone else who has this same situation."

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

133 comments

Thanks! (5, Insightful)

Fast Thick Pants (1081517) | more than 3 years ago | (#32994454)

Wow, it's a "Tell Slashdot" segment! I've been looking for something similar myself, so thanks, I'll give this a spin!

Re:Thanks! (2, Informative)

godrik (1287354) | more than 3 years ago | (#32994580)

Same here. Thank you too!

(I know this post is very redundant and useless. But thanks are always welcome, aren't they ?)

Re:Thanks! (1)

sumdumass (711423) | more than 3 years ago | (#32994658)

Heh.. 6 months of looking for something and about to settle on a very expensive proprietary deployment.

I'll have to see if it works as easy as he says, but it's right there for my needs too.

Re:Thanks! (0)

Anonymous Coward | more than 3 years ago | (#32995176)

You've misspelled Reddit or digg.

Re:Thanks! (3, Interesting)

MikeBabcock (65886) | more than 3 years ago | (#32995328)

I only wish I could find a source download on their site. Even a "what we're doing" guide. Downloading the ISO and reverse-engineering what they're doing with cuneiform and exactimage doesn't seem nearly as productive, especially when I'd rather implement this on an existing server than boot a special piece of hardware with it.

Re:Thanks! (2, Insightful)

tsstahl (812393) | more than 3 years ago | (#32995612)

Virtual machine?

Re:Thanks! (1)

houstonbofh (602064) | more than 3 years ago | (#32995684)

Virtual machine?

Only a solution to "How do I get this running" and not "What is this thing doing?" The lack of source is a bit offputting to me. I will look at it, but I may wait to roll it out.

Re:Thanks! (1)

interval1066 (668936) | more than 3 years ago | (#32996318)

"Only a solution to "How do I get this running" and not "What is this thing doing?" The lack of source is a bit offputting to me. I will look at it, but I may wait to roll it out.

I would tend to agree, only because I'm extremely paranoid when it comes to security; I'd do some site analysis and make sure unexpected connections to foreign hosts aren't going out over the wire. If they were I'd want to do some code analysis to see what exactly is going on. Or if I wanted to add some customization: extremely important. But; in a pinch it sounds like a really worthwhile solution.

Re:Thanks! (1)

StuartHankins (1020819) | more than 3 years ago | (#32996214)

Setup a VM; not only can you monitor / limit its communication but it's a cinch to back up. In my environment this is the easiest way to test something also. I use ntop for monitoring and it works ok; it would probably be a good fit in this case.

Re:Thanks! (1)

Peach Rings (1782482) | more than 3 years ago | (#32998964)

It's all GPL so there has to be source somewhere. Their site says

The source code of the standard packages on the CD are available from their respective original providers (for example on the FTP servers at Debian). Special components such as the WatchOCR program and scripts are available on the CD.

so it's probably on the disk.

Re:Thanks! (-1, Offtopic)

Anonymous Coward | more than 3 years ago | (#32995622)

My company makes OCR Shop(r) It's a commercial linux OCR product with a command-line interface. It has been widely used for years, and it uses the best technology in OCR from Nuance Corp. It's at http://www.vividata.com

Trojan system? (0)

Anonymous Coward | more than 3 years ago | (#32996840)

So let me get this straight - there's a new Service on a Disc, that reportedly delivers something cool, that is brand new and working better than anyone has gotten it to work, it's free, and all that's required is that you have to spin it up on a server on your network - no muss no fuss. Oh yeah, and nobody knows what exactly it does.

Time to update your resume...

Let me fix that for you: (1)

skids (119237) | more than 3 years ago | (#32998426)

>> Oh yeah, and nobody knows what exactly it does.

Oh yeah, and nobody knows what exactly it does with access to all your sensitive documents.

Re:Let me fix that for you: (0)

Anonymous Coward | more than 3 years ago | (#32998742)

Its open source... Your free to review the code to see what it does exactly with that scanned data. I would trust open source projects more then I would trust that expensive proprietary one. The world will review the code and any fishy code will make news

Not an alternative (0)

Anonymous Coward | more than 3 years ago | (#32994504)

Nothing beats the proprietary software like ABBYY Finereader.

Wait a sec (5, Funny)

inKubus (199753) | more than 3 years ago | (#32994542)

There's something wrong with this Slashvertisement--it's for a free product!

Re:Wait a sec (4, Insightful)

ushering05401 (1086795) | more than 3 years ago | (#32994870)

Seriously, I'm conflicted. I'm not any sort of web search guru, but it looks like that site just got put up. Is submitter an early adopter (v0.2) or a social engineer?

Thanks for the info... (2)

TiggertheMad (556308) | more than 3 years ago | (#32994544)

Wow, very cool. I have been looking around for something similar myself.

While we are on the topic, anyone seen a good solution to scan, OCR, and reconvert existing crappy pdfs to improve them?

Re:Thanks for the info... (0)

Anonymous Coward | more than 3 years ago | (#32996294)

While we are on the topic, anyone seen a good solution to scan, OCR, and reconvert existing crappy pdfs to improve them?

Yeah, it's called a printer.

(Kidding! ;))

Re:Thanks for the info... (0)

Anonymous Coward | more than 3 years ago | (#32997470)

After you do OCR, you can parse your text, write a script to mark it up in LaTEX, and run it thru pdflatex.

Re:Thanks for the info... (2, Informative)

It's the tripnaut! (687402) | more than 3 years ago | (#32997722)

While we are on the topic, anyone seen a good solution to scan, OCR, and reconvert existing crappy pdfs to improve them?

I've tried quite a few free and proprietary OCR's and the best available right now, imho, is ABBYY Finereader [abbyy.com]. Other than fonts, it also easily recognizes tables, diagrams and illustrations. But most of all, it can read and render 189 languages (including Chinese and Cyrillic) accurately. A free trial version is available.

Re:Thanks for the info... (1)

datakid23 (1706976) | more than 3 years ago | (#32999294)

It's true. I teach Translation Studies and one of the main pieces of software that's needed is OCR. I use OOo, Poedit, Lokalize, Jubler and OmegaT in my class, I teach Creative Common's Licensing, I promote sites like http://pootle.locamotion.org/ [locamotion.org] and http://www.transifex.net/ [transifex.net] I *really* *really* wish I could give my students a best of the bunch free OCR link. But the reality is, ABBYY Finereader is the best that's available. And since it's relatively cheap (compared to some of the translation software like Trados) I don't think it's too onerous. But hot diggity, I wish there was a better FLOSS OCR program.

Re:Thanks for the info... (1)

ArundelCastle (1581543) | more than 3 years ago | (#32999888)

While we are on the topic, anyone seen a good solution to scan, OCR, and reconvert existing crappy pdfs to improve them?

I think they are called interns. Photoshoop's Content-Aware Fill isn't very good with charts or handwriting.

...wait, you actually kept the original document after PDFing? Troglodyte.

Run on a VM (3, Insightful)

ChuckDriver (1276092) | more than 3 years ago | (#32994568)

Ultimately, it would be nice to figure out what script or daemon is running in this and put it on an existing server. In the mean time, I could see just creating a VM for this thing to get started.

Re:Run on a VM (1)

xtracto (837672) | more than 3 years ago | (#32995046)

Haha, I thought exactly the same. It would be really great if someone could create a VirtualBox, vmware or Qemu "virtual appliance" for this!

 

Cool program (1)

Raineer (1002750) | more than 3 years ago | (#32994602)

I agree with above posters, it's amazing to see a useful Slashvertisement. This one, however, has some quality behind it. I had not seen this program and OCR is one area where it's been difficult to find quality OSS solutions. Thanks for the post.

Re:Cool program (1)

inode_buddha (576844) | more than 3 years ago | (#32995398)

You would be amazed at how much the people over at Groklaw could use something like this; since most US court documents are recorded as scanned PDF's and TIFF files. I'm saving this link.

Re:Cool program (0)

Anonymous Coward | more than 3 years ago | (#32998692)

Hopefully PJ will read your post.

Or maybe she's the original poster. It's the kind of sweet, generous thing she does on a regular basis.

Wow! (0)

Anonymous Coward | more than 3 years ago | (#32994620)

The most useful /. post in at least a year.

ocr (1)

Suicidal Teapot (820232) | more than 3 years ago | (#32994626)

Nice, thanks for sharing. Currently we use Acrobat to OCR scanned documents, it seems to work well but doesn't keep up to our high-speed scanners. Having it automated sounds great. How does the speed/accuracy of WatchOCR compare to commercial products?

Re:ocr (1)

IICV (652597) | more than 3 years ago | (#32994770)

Who gives a shit? My cheapass "free" workflow for OCR-ing PDF documents on Windows was basically what's described here [imagemagick.org]. With this, all I need is to run a virtual server on my computer! That's significantly better.

Re:ocr (1)

wiredlogic (135348) | more than 3 years ago | (#32998682)

MODI just leaves you with the text pulled out of context. ExactImage's hocr2pdf can merge the OCR'd text back into the original scanned pages to produce a PDF with searchable text and all the original formatting and images.

Re:ocr (3, Funny)

0100010001010011 (652467) | more than 3 years ago | (#32994780)

Now it just needs to incorporate a Recaptcha Lite to improve accuracy.

Maybe something on the web interface when it doesn't recognize a word you can correct it.

[Given the success of the Cow Clicker on Facebook, maybe turn it into a facebook game. Tell people they're only allowed to correct words every 6 hours. If they want to correct more words, they'll have to pay for it. Add friends and correct more words to level up!]

Anyone got error rates? (3, Insightful)

savanik (1090193) | more than 3 years ago | (#32994680)

I was looking for something like this last year - it looks like this just got released last month, so I don't feel too bad about not finding it.

It looks really interesting, but how accurate is it? I've got some old books that are falling apart I'd like to scan in and textify, but I'd like to know how much time I'm going to have to budget ahead of time fixing problems and proofing.

Re:Anyone got error rates? (0)

Anonymous Coward | more than 3 years ago | (#32995040)

I am also interested in error rates. The last FOSS OCR software I tried was tesseract-based (the google ocr engine) and it was surprisingly bad. I am going to give this a go and see how it compares.

Re:Anyone got error rates? (0)

Anonymous Coward | more than 3 years ago | (#32998308)

That's funny. I found tesseract to be very good when I specified which language I wanted to use (German). My complaint is that it took a whole weekend to get it to output HOCR files from which I could produce searchable PDF-s. In addition, it does not do multi-language documents. If anybody knows a good way to OCR a German/Arabic text, please let me know.

commercial? (1)

Paralizer (792155) | more than 3 years ago | (#32994700)

Is there something similar available commercially anyone can recommend? We may end up needing to scan large amounts of pdf's to a shared drive somewhere and need the whole thing to be searchable for keywords, but a requirement for that would be a commercial product that has 24x7 support.

Re:commercial? (2, Informative)

Anonymous Coward | more than 3 years ago | (#32994752)

After doing a similar search recently, your two major choices are ABBY FineReader (they have Enterprise/Server level editions) or OmniReader (again at the Server/Enterprise level). They're priced pretty closely and have pretty well matched features, plus high accuracy. We're in the process of moving from a solution originally based on Adobe Acrobat's built-in OCR, which is okay but not great. Initial testing with ABBY showed a demonstrably lower error rate on documents from scanned in legal files.

Re:commercial? (3, Informative)

ganjadude (952775) | more than 3 years ago | (#32994786)

there is! I happen to work for a company (shameless plug) called DocuWare. Its document management software that does all of that., we are not in 24/7 we are in 8 AM-8 PM eastern m-f for support (I am the support) at the corporate level, however we sell through a dealer network that provides support on a contract basis (many Toshiba business solutions are resellers for us, I know they are 24X7) www.docuware.com

Re:commercial? (0, Offtopic)

h4rm0ny (722443) | more than 3 years ago | (#32995884)


You work for the company and you couldn't be bothered making it a proper link? Don't you know that making people have to copy and paste a URL will actually halve the number of referrals? No joke!

Re:commercial? (3, Funny)

FelixNZ (1426093) | more than 3 years ago | (#32997112)

Sole support staff's user name in 'ganjadude' I am a little wary :)

Re:commercial? (0)

Anonymous Coward | more than 3 years ago | (#32998972)

Well, on the plus side, they probably need not fear getting ones' head bit off, should they need tech support. :)

Re:commercial? (0)

Anonymous Coward | more than 3 years ago | (#32994996)

Omtool

Re:commercial? (1)

h4rr4r (612664) | more than 3 years ago | (#32996790)

Why?
You like giving away money?

I suggest you install this on your own machine, find a quote for a "commercial 24x7" support solution, then tell your boss your company does the same thing for 1/2 the price.

Re:commercial? (0)

Anonymous Coward | more than 3 years ago | (#32998304)

Hi - take a look at EzeScan [ezescan.com.au]. It will drive your scanners, extract data from the documents, do the OCR, convert to PDF and drop the results onto a shared drive (or into an EDRMS, Sharepoint etc).

(Disclaimer - I work for these guys)

Re:commercial? (1)

JuliaNZ (17473) | more than 3 years ago | (#32998328)

Heh, that'll teach me for reading Slashdot on a new laptop without logging in. The EzeScan comment was mine.

Ask your vendors (0)

Dynedain (141758) | more than 3 years ago | (#32994806)

Your copier providers probably already include this in the package you have. It just hasn't been enabled.

Our direct-to-pdf document scanners include copies of Acrobat Pro (both Windows and OSX), automatically do OCR, and were less than $400 each.

Middle ground? (1)

DoofusOfDeath (636671) | more than 3 years ago | (#32994836)

Funny, I was just looking for something to do this the other day.

But isn't there some middle grown betweeen (a) making users do complicated setup work, vs. (b) making an entire OS out of it?

How about just making a tarball or Ubuntu/Debian/RPM package that installs and sensibly configures those two tools?

VirtualBox as the middle ground (1)

daboochmeister (914039) | more than 3 years ago | (#32995910)

I understand what you're saying, but installing the distro in a VM isn't much extra resource/work over a tarball. Plug in your preferred virtualization solution, of course, they all support exporting directories.

Re:VirtualBox as the middle ground (0)

Anonymous Coward | more than 3 years ago | (#32997180)

"but installing the distro in a VM isn't much extra resource/work over a tarball."

what??????

are you smoking?

gs + tesseract (1)

arnott (789715) | more than 3 years ago | (#32994930)

Another opensource option: The pdfs can be converted to tiffs using ghostscript [wisc.edu] and ocr-ed using tesseract [google.com]. The current version of tesseract does not have document layout analysis and page segmentation support.

Error rate. (1)

stimpleton (732392) | more than 3 years ago | (#32994978)

I settled on an expensive propriety solution some months ago at work(I am the IT guy, Dishwasher, and Business...something) to do our orgs scan and ocr. Admittedly its end to end including the scanner as well. But $15K and does a good job.

I did searches online(a dozen hours) and they all funneled back to "FOSS less good, proprietry for best results)

I am afraid to look at this one, because I did make final decision with pressure from the General Manager.

I dunno what google uses actually, but their in-house solution(on googe code) would *not* produce good results. No1 in the FOSS tests, but like 6th(by miles) on proprietry comparisons.

Re:Error rate. (1)

Monkey-Man2000 (603495) | more than 3 years ago | (#32995220)

Well, since this apparently was just put online recently you probably made the right decision. But since there seems to be so much interest in this thread, hopefully it may becomed polished rather quickly for future needs.

Re:Error rate. (1)

StuartHankins (1020819) | more than 3 years ago | (#32996408)

In my role it's always better to be aware and try it out than pretend it doesn't exist. You can't thoroughly research every solution in advance every time. We call it due diligence even after the fact because you might find a better way of doing something and it's always good to have options. If nothing else it may also give you some negotiating room with the proprietary vendor at renewal time.

Okular (0)

Anonymous Coward | more than 3 years ago | (#32995076)

Okular is a free linux program that can export pdfs to text as well.

Re:Okular (1)

timothyb89 (1259272) | more than 3 years ago | (#32995238)

Most normal PDF readers (incl. Okular) only work when the actual text is included in the PDF to begin with. When the source isn't computer-generated but scanned in, there's only image data to work with (no text). Actual OCR is pretty much the only choice in this case...

Stupid (3, Insightful)

Archangel Michael (180766) | more than 3 years ago | (#32995098)

Most, if not ALL of the documents being scanned into PDF format, are generated on computers already, so why go through the whole OCR process, and not get the actual document from the original source in a PDF version that is already text searchable?

THIS is exactly the problem with document management and processing today! Doing things the hard way because we can't be bothered changing processes that will save tons of money, be more effective, and accurate.

I know people who type a document in WORD and then print it to the Copier/scanner/fax device, go pick up the document, put it on the document scanner, scans it to email (PDF) and sends it that way.

SERIOUSLY???

Re:Stupid (1)

Big Boss (7354) | more than 3 years ago | (#32995214)

Just about anyone can read a PDF. If you send a MS Word doc, you have to wonder what version of Word the other person has. And these days, Macs are popular enough that they might not have Word at all! PDF works, and works for everyone. It would be far simpler to print to PDF, but not everyone has a print driver that can do that. ODF is supposed to fix that, but it probably won't.

Re:Stupid*2? (0)

Anonymous Coward | more than 3 years ago | (#32995730)

Just about anyone can read a PDF. If you send a MS Word doc, you have to wonder what version of Word the other person has. And these days, Macs are popular enough that they might not have Word at all! PDF works, and works for everyone. It would be far simpler to print to PDF, but not everyone has a print driver that can do that. ODF is supposed to fix that, but it probably won't.

What are you suggesting?, Word 2007 and Open Office (for some versions back now) export to PDF, so why not just do that for sending electronically, and skip the print/scan steps?

RO

Re:Stupid (1)

Archangel Michael (180766) | more than 3 years ago | (#32995874)

Print to PDF, ever heard of that?

OpenOffice Export to PDF, ever heard of that?

Acrobat Professional, ever heard of that?

How about copy/paste into email?

There are plenty of alternatives to take your WORD (or whatever) doc and get it into a searchable PDF without scanning the damn thing into a TIFF and then OCRing it back to text later.

Re:Stupid (1)

Bert64 (520050) | more than 3 years ago | (#32996700)

The "print to pdf" function often creates very poor pdf files, a proper pdf export function in the program is a lot better...
Get a relatively complex document and compare the output from the native pdf export of openoffice and printing to a pdf file.

Re:Stupid (1)

salesgeek (263995) | more than 3 years ago | (#32997244)

What is true for OpenOffice may not, and probably is not true for all applications. CorelDraw, yes, I'm looking at you. That purple is supposed to be blue.

Re:Stupid ... maybe (1, Insightful)

Anonymous Coward | more than 3 years ago | (#32995946)

Not everyone wanting to do this does in fact have access to the electronic source. I know I would like to try it for some my old crumbling books, as someone else mentioned above, no longer in print (or otherwise only available in DRM-encumbered ebook formats that I cannot read on Linux or Windows Mobile).

RO

Re:Stupid (1)

Phoenix Rising (28955) | more than 3 years ago | (#32996052)

You obviously live in a utopia somewhere. Most of the documents I've seen scanned in to document management may have had their origins on a computer, but they've had signatures, comments, and other stuff penned in by hand, and you can't always get the originals sent to you.

The poster is addressing a real need, as evidenced by the number of comments proclaiming the usefulness of the post.

Re:Stupid (1)

Archangel Michael (180766) | more than 3 years ago | (#32996792)

All of the "exceptions" you listed (signatures, comments penned by hand) are NOT OCRed, making it text searchable as needed by the ORIGINAL concept.

Changing processes would solve the need to OCR documents that already exist as searchable text elsewhere. EVEN if you have need to document signatures and other hand written notes.

It is a real need (searchable text), I never said that it wasn't. I'm just quibbling over the process to attain the goal.

better alternatives to pdftohtml (0)

Anonymous Coward | more than 3 years ago | (#32995196)

While we're on the topic, does anyone know of any PDF to text or HTML (or XML or whatever) converters that will do a good job of preserving the original structure of the information?
Specifically, I have occasion to deal with PDF documents that contain tables of information -- I don't have any need for OCR (at least, not so far), as the PDFs that I deal with are not scanned documents.

Most tools will extract the text correctly, and will create documents that will render quite close to the original document in a web browser, but the markup can become extremely difficult to parse.
Generally, each block of text (table element, say) will be placed inside something like an independently-positioned DIV.

Things get especially screwy when table elements can contain line breaks (which make some rows span multiple lines) and some elements which are empty.

So, the text will all be there, but for some PDFs, it becomes a difficult task to parse out the meaning. I tried out a number of free tools and some paid demos, and have settled on PDFTOHTML.
Does anyone have a better tool that will, at the very least, draw individual lines between table rows? I think what PDFTOHTML does is to create a background image of all the lines on the page. I'd prefer a free/opensource solution, but would be perfectly happy with anything that does the job well.

thanks.

Re:better alternatives to pdftohtml (2, Informative)

petermgreen (876956) | more than 3 years ago | (#32996182)

Afaict the original structure was already gone when the pdf was made, you can only try to reverese engineer it from the drawing objects.

You might want to try converting to postscript using ghostscript and then converting to svg using pstoedit. You still won't have the original structure but at least you should have the table shape as a vector drawing rather than a bitmap.

Thanks! (0)

Anonymous Coward | more than 3 years ago | (#32995768)

Have also been looking for something like this but was using tesseract to create .txt files that they then searched!

Unsuccessful download? (1)

Kiralan (765796) | more than 3 years ago | (#32995842)

I have tried twice to download it, and it 'finishes' at about 150mb both times, while the file size on their web page shows over 600mb. As a double-check, (suspecting a file size reporting error on their page), it fails MD5 sum as well. Has anyone successfully downloaded it?

Re:Unsuccessful download? (1)

MathiasRav (1210872) | more than 3 years ago | (#32996094)

Did you try wget? See what error it reports, or try with wget --continue (shorthand -c).

Re:Unsuccessful download? (1)

Kiralan (765796) | more than 3 years ago | (#32996120)

I had the 'short' download error using the link on the home page, which leads to an FTP-like directory page. I am trying the link in the forums, and it appears to be working. Thanks for the advice, though!

Tesseract OCR (2, Informative)

TheSync (5291) | more than 3 years ago | (#32996454)

I found tesseract [google.com] to work very well to do OCR tasks. Doesn't generate PDF though.

Re:Tesseract OCR (0)

Anonymous Coward | more than 3 years ago | (#32997074)

If you send its output to http://xplus3.net/2009/04/02/convert-hocr-to-pdf/ [xplus3.net] it does :)

convert.py

#!/usr/bin/python
import sys
from HocrConverter import HocrConverter
hocr = HocrConverter(sys.argv[1])
hocr.to_pdf(sys.argv[2], sys.argv[3])

ocrpdf.sh

#!/bin/bash
# Run OCR on a multi-page PDF file and create a new pdf with the
# extracted text in hidden layer. Requires tesseract, hocr2pdf, gs.

export OCROSCRIPTS=/usr/share/ocropus/scripts

set -e

input="$1"
output="$2"

tmpdir="$(mktemp -d)"
pdfimages $input $tmpdir/page

if [ -f $tmpdir/page-000.ppm ];
then
        for page in $tmpdir/page-*.ppm
        do
                base="${page%.ppm}"
                convert "$page" "$base.png"
        done
else
        for page in $tmpdir/page-*.pbm
        do
                base="${page%.pbm}"
                convert "$page" "$base.png"
        done
fi

for page in "$tmpdir"/page-*.png
do
        base="${page%.png}"
        ocroscript recognize $page > $base.html
        convert.py $base.html $page $base.pdf
done

gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=$output $tmpdir/*.pdf ............
rm -rf $tmpdir

This script is written for the PDF files output by our scanner & fax machine, so your PDF files may not result in PPM or PBM files. Adjust as necessary.

Our multifunction copier already does that... (0)

Anonymous Coward | more than 3 years ago | (#32996798)

Although it is a expensive model, but I expect that within a few years this will be a standard feature even in the cheaper models.

Look around when you replace your existing copiers.

Anyone have a mirror or torrent? (1)

rsborg (111459) | more than 3 years ago | (#32997062)

I can't seem to be able to download this file, it keeps giving up after a couple of hundred megs... probably slashdotted.

Re:Anyone have a mirror or torrent? (0)

Anonymous Coward | more than 3 years ago | (#32998106)

Try using a download manager that supports resuming (wget -c), or download it from our mirror:

http://mirror.adlibre.net/tmp/watchocr-V0.2-2010-06-28-en.iso

Re:Anyone have a mirror or torrent? (0)

Anonymous Coward | more than 3 years ago | (#32998798)

Try using a download manager that resumes (wget -c).

Also put a copy up on our mirror, http://mirror.adlibre.net/tmp/

Microsoft Files (1)

imscarr (246204) | more than 3 years ago | (#32997934)

Can you go one step further with this and get it read the text (only) out of Microsoft formatted files? Maybe it could even read words out of Word files, Powerpoint, etc.

Re:Microsoft Files (0)

Anonymous Coward | more than 3 years ago | (#33000632)

Can you go one step further with this and get it read the text (only) out of Microsoft formatted files? Maybe it could even read words out of Word files, Powerpoint, etc.

There is no technical overlap between running OCR on PDF files and extracting text from a Word document. The latter can already be done using antiword.

Mirrors? Torrent? (0)

Anonymous Coward | more than 3 years ago | (#33000156)

The download is quite slow. I guess it would be nice to have alternative download options to try.

exactimage + cuneiform (1)

seyyah (986027) | more than 3 years ago | (#33000160)

I wrote a bash script a few months back which, in a little over 130 lines (it has a few command line options), can convert any old PDF to a text searcheable PDF. I really wonder whether a distro is a bit overkill for this? But it is such an important tool to have that I commend the authors for making it available... I just wish they'd put up the actual script that they used so I could compare it to my own!

Re:exactimage + cuneiform (0)

Anonymous Coward | more than 3 years ago | (#33000252)

Care to share the source?

Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...