Simple Document Imaging for Unix? 47
andylievertz asks: "I have developed a logical system of directories for storing my digital documents (i.e. *.doc, *.mp3, *.gif, etc.), and can usually find any obscure document with relative speed. These 'must-keep' hardcopies include everything from bills and shipping invoices to brochures and chinese-food menus. I've tried applying my electronic filing techniques to an actual, real-world filing cabinet, complete with folders and labels, but such a system: requires a great deal of effort to maintain relative to the electronic system, especially considering the frequent influx of new hardcopy material; and doesn't address the greater issue of reducing the sheer paper bulk, organized or not. What solutions have you, the Slashdot Reader, employed to solve this situation for yourself? Are there viable Unix-based Document Imaging packages, similar in function to the Microsoft Document Imaging utility packaged with Office? Do you use a Unix-based Document Imaging solution personally or professionally? If so, what package, and why does it work for you?"
"So, step one is to find ways to reduce the influx of hardcopy (i.e. electronic billing, etc.), but for me, the second step is to find and utilize a [Unix-based!] system that will allow me to scan and file hardcopies electronically so they may be indexed, searched, re-organized, shared, and retrieved as easily as their electronic counterparts. Naturally, any such system would need tolerances for multi-paged documents, and would need to store its output in a non-proprietary file format."
How often do you really need to look at old bills? (Score:5, Interesting)
If you want to track money, having the paper is not nearly as useful as entering the data into a financial program. Try GnuCash [gnucash.org] or something of that ilk.
Delivery menus are different story. I keep them under a magnet on the fridge. If you get a nice rare earth magnet [ebay.com] that can hold a half inch stack of menus, that problem is easy to solve (get at least the half inch cubes).
Any solution that requires every document to be scanned is not going to work for you if you can't even file the documents. what are the chances you are going to get around to that stack of stuff to be scanned?
Invest in a magnet, a big box, and a good paper shredder.
Re:How often do you really need to look at old bil (Score:1)
Re:How often do you really need to look at old bil (Score:1)
Funny, not hilarious, we were burgled (sp?) recently - the insurance company wanted original documents for all the items taken, original receipts ... I had most of them, going back about 5 years, and even knew where they were ... this is probably 'helped' by the fact I don't have much disposable income.
Thing is the insurance folk probably wouldn't
No solution for you but.. (Score:4, Interesting)
Seems to me that the solution would involve a scanner, a database, and a mechanical system for retrieving the documents.
1) Scan the document.
2) Slide document into doc protector with ID tag (UPC codes might work, but really it could just be sequentioal
3) Create DB entry for ID, BLOB of scanned image, (or perhaps a foreign key to keep the images out of the quesry, but realistically most DBs optimize this for you) and most importatntly, meta data about document.
The more I think about it, the more I realize a number system of 1,2,3,4...would work fine. The automated retrieval, which would be nice, is not really vital. The match between the doc ID and the scanned version is enough, so long as the document always goes back into the same folder.
Insertion O(1)
Search O(log(n))
Deletion O(log(n))
Note that garbage collection (compation is not really an option, which means to reaclaim discarded IDS (Reuse folders would crank insertion back up to O(log(n))
The question is whether the scanning process would be worth the time.
Re:No solution for you but.. (Score:2)
Why bother? (Score:2, Informative)
I don't really need a "system" for that... just make your "root" folders explicit enough, then file everything where it should go.
I even have a "temp" dir for every category.
I don't really see the need for such a tool, IF you can spare a few seconds to browse&dump...
Re:Why bother? (Score:1)
The flash on a camera needs to be exactly right to produce a good readable image. Ok, so a scanner is slightly slower, but you will get a good quality document copy every time, which is suitable for use.
Afterall, if the image isn't captured to a good quality, there is little point capturing it at all.
Use a digital camera for input (Score:4, Insightful)
A good digital camera may seem like overkill for scanning in bills, but then the camera also doubles as a camera too.
Re:Use a digital camera for input (Score:1)
In the case of Linux, the USB Mass Storage [google.com] drivers work pretty well form many types of hardware.
Re:Use a digital camera for input (Score:1)
Re:Use a digital camera for input (Score:2)
Apple Unix? (Score:2)
Ok, so it's not available on any other unix platform, but it employes a nice design for storing images that takes advantage of simple UNIX symbolic links. All images are stored in a hierarchy based on the import date. Then, Albums are created, which contain symbolic links to the real image files.
paymybills.com (Score:2)
basically, you tell the peeps billing you to send your bills to pmb.com. pmb.com scans in the bills, and you could download the scan in pdf and do whatever with it. then, you could pay your bills
when i moved, i lost interest because i could pay all of my bills online (i couldn't when i was using pmb.com), but having it a
Re:paymybills.com (Score:2)
From paymybills.com: This is no longer an active web site.
Looks like they didn't.
Re:paymybills.com (Score:2)
From paymybills.com: This is no longer an active web site.
Looks like they didn't.
They were bought by PayTrust [paytrust.com]. They make money by charging $13 a month per user.
Just to make it clear how spectacularly cool this service is: it's not just online bill-paying. All your bills go to them (either electronically or through the regular postal service) and show up on the web. You can even have them pay the bills automatically. It's dang convenient.
Re:paymybills.com (Score:2)
Re:paymybills.com (Score:2)
I looked at their site and it looks like you can only receive e-bills through it. PayTrust lets you receive paper bills too; they scan the bill in and show it to you online.
It's lame that all billers can't send e-bills, but at least PayTrust can make all bills seem like e-bills to me.
Re:paymybills.com (Score:2)
Re:paymybills.com (Score:2)
Most companies can handle it pretty well. I just tell them that my billing address is such and such, but that my home address is such and such. I've come across two companies so far that can't deal with it (one tiny company and one huge company), but the rest have had no problem. I guess it's not too uncommon to a person's billing address to be different from their home address or service address.
Re:paymybills.com (Score:2)
how about... (Score:1)
Get a label gun.
Put each document in white business envelopes, numerical labels on each white business evelope, put document inside as it comes in. Put envelopes in one of three boxes; never throw away, throw away in five years, throw away in a year. Maybe have additional box for documents that will only fit in big manilla envelopes.
Write a quick perl script webinterface that records one of several customizable options from a pull down menu (ie grocery receipt, gas bill, heroin expenses) along with the d
PaperPort (Score:1)
I settled on PaperPort for Windows. It allows a folder hierarchy to organized your scanned documents that can be altered to your liking. In addition, you can use the application's basic OCR capability to search the contents of all your documents. The previous version, PaperPort 8, used a proprietary file format. But the new version, PaperPort 9, has changed the default fi
I use... (Score:5, Informative)
Its insanely good. I use it to scan in all my important documents. It useful multipage modes for... well, multipage documents.
Try it. It's actually been considerably revamped since I installed it, I will have to try a more recent version,
Oh, it comes in a nice debian package via apt-get.
Re:I use... (Score:1)
If I had moderator points (and could moderate this thread) I would give them all to you. Moderators please mod up.
Andy
Re:I use... (Score:2)
Why bother with a "why bother" post?
I have a three tiered system (Score:1)
I was thinking about installing a java front end to use webdav to connect to the db to allow me to access the documents through a webpage, but I'm not sure if I'll go through with it or not, I want to keep it fairly simple...
I'm kidding of course, I have a trash can in my office that my girlfriend
Re:I have a three tiered system (Score:2, Funny)
Which important documents? The ones from Playboy?
Re:Simple? (Score:1)
Kooka (Score:1, Interesting)
Re: (Score:2)
some thoughts... (Score:2)
Re:some thoughts... (Score:1)
Perhaps your database could hold (a) image title (b)the image (c) the OCR text from the image, for searching purposes and (d) several "category" fields. F
Re:some thoughts... (Score:2)
Perhaps your database could hold (a) image title (b)the image (c) the OCR text from the image, for searching purposes and (d) several "category" fields.
Goo
Windows XP has the right idea (Re:some thoughts... (Score:1)
(1) integrated
(2) flexible enough to let you set and query by keyword AND key/value pairs
(3) transportable (so that copying a file to another disk moves transfers its indexing info as well).
So far the best interface I've seen is Windows 2000/XP's index service. It's an extension to the Find utility that lets you query files very
Document Imageing (Score:1)
FileNet... (Score:2)
Webcam for scanning documents (Score:2)
HP Digital Sender and htDig (Score:5, Informative)
and htDIG to solve all my document storage problems.
The Digital Sender is a wonderful toy. Stick a stack of paper in the bin. Enter an email address. Press the big-green button. And a PDF shows up in my mailbox in a few minutes. Even does double sided. Very simple device and it does most of what I need.
It doesn't do OCR. The Digital Sender outputs a bit-mapped PDF that looks very good. I usually use the full version of Adobe Acrobat to do optical character recognition and store the results in the background. That way I still see the good scan on the screen and when I print. But I can copy and search the text as I would normally.
I use htDig (http://www.htdig.org/) to index my archive. I store content in file folders that make sense (2002 taxes, pitch perception papers, etc). But I still find htdig useful. It indexes both HTML (my lab notebook) and PDF files. All is good.
PDF is a well-documented file format. I wish there was a good free-OCR package, but sometimes you have to pay for good performance. htDig and PDF work great on Windows and Linux.
In three years I have accumulated just over 1Gbyte of content. That represents all my lab notes (in HTML format) and all the papers I've read (in PDF). It's wonderful having my entire paper life with me on my laptop. (I also back it up to three different machines.)
DocMGR looks like what you want... (Score:1)
It's at http://docmgr.sourceforge.net/ [sourceforge.net]
sane + ghostscript (Score:3, Interesting)
I have a very simple script that runs scanimage, then processes the output through convert to make it a rasterized postscript output, then processed that output through ps2pdf (part of ghostscript).
My scanner (epson 1640U) has a document feeder so the command line options for scanimage reflect that. A simple loop in the script handles all the pages.
The net result is a script called "scan2pdf" that I just specify the output PDF file name (something helpful, like the name of the document and the date). I've processed over a decade of financial records, easily 1000s of pages, in a day with this simple setup.
Re:sane + ghostscript (Score:1)
I love that your solution is command-line based. Could your script be modified to handle multiple page documents, fed one at a time (some kind of a pause-while-you-change-pages)?
Would you be willing to share your script here? Thank you....
Andy
Yippee! (Score:2)
I really need this type of system. By far the single largest amount of clutter in my home has always been bills, other USPS mail that I need to keep (like mail from my 403B advisor), and recipients for a wide assortment of purchases. I've been looking for ths type of system for years. What I really want to be able to do is sit down in the evening with my bills in hand, pull up a softw
One Big Folder (Score:1)
- use highly descriptive and standardized filenames (ie "12-03-water-bill.tif")
- ls -l | grep water | grep 12-03
VIOLA!!! Always works for me!
You can't get any simpler.
Gallery? (Score:2)