Beta

Slashdot: News for Nerds

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Digitizing Your Dead Trees?

Cliff posted more than 12 years ago | from the when-you-can't-carry-them-with-you dept.

Hardware 367

smart2000 asks: "I'm tired of lugging around dead trees. I've just moved offices and had to move over 100 pounds of 'essential' technical books. It is clear to me that the dead tree industry is never going to supply the books I want in electronic form, so it's time to do it myself. What hardware and software should I use?"

"The Plan: Take the binding of each book and cut it off. Feed into a scanner with duplex and cut-sheet feeder. Scan as a 300 DPI jpeg with compression. Then OCR them overnight. I don't expect the OCR to be perfect, just good enough to use as a searchable index.

What are the suitable scanner choices for Linux? Any recommendations for OCR software that will write in an open format? Has anyone done this before?"

cancel ×

367 comments

First Post (-1, Troll)

Anonymous Coward | more than 12 years ago | (#3492952)

fp Bitches!
Chalk one up for the ACs!
Why don't you log this in!
HA

fp (-1, Offtopic)

Anonymous Coward | more than 12 years ago | (#3492953)

i believe you can use a program known as first post :)

Re:fp (-1, Flamebait)

stoolpigeon (454276) | more than 12 years ago | (#3492960)

He may be able to do so-- obviously you can't.

Dumb Ass

.

Bow! (-1)

Klerck (213193) | more than 12 years ago | (#3492954)

Bow down before me! [msnbc.com]

YHBT! [petitiononline.com]

Klerck, you spastic monkey fucker! (-1, Troll)

SumDeusExMachina (318037) | more than 12 years ago | (#3493074)

Would your link be referring to the Two Towers thing, or did you father Elizibeth Hurley's son?

How about... (-1)

Anonymous Coward | more than 12 years ago | (#3492955)

A scanner.

With some scanning software.

Do a search on Google [google.com] , you should find many matches for each.

Have fun!

Re:How about... (0)

Anonymous Coward | more than 12 years ago | (#3492978)

he doesn't want google matches retard, he wants help from people who may have already spent time experimenting with different scanner software. The gift of experience. The whole point of open source, and the "ask slashdot" section.

Re:How about... (0)

Anonymous Coward | more than 12 years ago | (#3493024)

You mean the "do my homework" section?

Re:How about... (0)

Anonymous Coward | more than 12 years ago | (#3493109)

No, he means the "can anyone please change my damn dirty diapers?" section, of course.

What? You haven't noticed this section yet?!?!

Blimey.

look online before you scan (5, Informative)

cheesyfru (99893) | more than 12 years ago | (#3492957)

You can find a wealth of PDF/PS/HTML/etc copies of computer texts online. Kazaa is a good place to start. Obviously, only download the books you have physical copies of. :-)

Re:look online before you scan (2, Informative)

MisterBlister (539957) | more than 12 years ago | (#3492972)

Most of the stuff you find online is training stuff, like Learn Photoshop or Learn HTML in 21 days or whatever.

There's a dearth of available electronic copies of programming-type texts, except for those where the author/publish creates their own version (like all of Bruce Eckel's books).

Re:look online before you scan (2, Insightful)

cheesyfru (99893) | more than 12 years ago | (#3493008)

I've got about 30+ O'Reilly books, Design Patterns, Stroustrap C++, etc. They're out there if you look long enough. LimeWire has also been a big help in it as well.

Re:look online before you scan (0)

Anonymous Coward | more than 12 years ago | (#3493047)

You can also go to irc.nullus.net and join #bw, there's a _massive_ quantity of books available for download.

Posting as AC so they don't bust my ass. :)

Re:look online before you scan (1)

MisterBlister (539957) | more than 12 years ago | (#3493048)

I'll take a look on gnutella, thanks for the tip.

already scanned (2, Informative)

Anonymous Coward | more than 12 years ago | (#3493040)

Yup. There is quite a lot already scanned. The best places to look are usenet (at alt.binaries.e-book, alt.binaries.e-book.technical, alt.binaries.e-books) and IRC at #bookwarez and #bookz on undernet, dalnet, and irc.nullus.net (and most likely other irc nets as well.)

You could try making a request in abeb, but the biggest selection in one place is irc. So as long as you are not scared by the interface, that is where I would look first.

Non-DIY Options.... (-1)

cybrpnk (94636) | more than 12 years ago | (#3493085)

Digitizing books can be done automatically by machine if you've got the cash - the machine even turns the page [4digitalbooks.com] . It can also be done by Indians [outsource2india.com] . If you want to Buy American, you could probably hire Sara and her Foveon Camera [tmc.edu] (check the bottom of the page)....

EEEWWW HE SAID THE 'K' word (0)

chewedtoothpick (564184) | more than 12 years ago | (#3493142)

oh my god... people actually use that word any more? We are talking a word that in the computer world is worse than all the old and slang curse words multiplied by their cumulative power... Are we going to let him get away with it?

An easier solution. (4, Funny)

SystemFork (578511) | more than 12 years ago | (#3492959)

Lots of college students at $5/hour.

Re:An easier solution. (0)

Anonymous Coward | more than 12 years ago | (#3493076)

I'll volunteer for one

scummy_student@frathouse.com

Re:An easier solution. (1)

psycht (233176) | more than 12 years ago | (#3493117)

Insightful? i belive the author's intent was humor, cause most students would really do it for beer.

Linux is the frist unprofessional system! (-1, Flamebait)

Anonymous Coward | more than 12 years ago | (#3492962)

Seen on comp.risks, please tell the guy what an idiot he is. Or if you happen to live nearby, you may try to give him a third meaning of killall, hehe...:

Date: Mon, 6 May 2002 14:52:30 -0500
From: dmaziuk@yola.bmrb.wisc.edu (Dimitri Maziuk)
Subject: GNU in Not Unix (Re: Markettos, RISKS-22.05)

Well, that particular risk is well known to professional Unix systems
administrators -- in fact, I was rather surprised to see that Linux
"killall" made the RISKS now: it's been [in]famous among Unix sysadmins for
quite a while now.

I see two issues here: one is that of false advertising, and another one --
of professionalism (not that they are entirely unrelated).

Stallman's rants about "LiGNUx" have a perfectly good technical reason
behind them: "Linux" (as in "OS based on Linux kernel and free software")
has lots of GNU software in it, and "GNU is Not Unix". Hence, Linux is
Not Unix, regardless of what Linux advocates may be telling us, it is
"GNU". (And, BTW, Unix is Not GNU.)

That was about false advertising, now let's look at professionalism.

Linux killall is perfect illustration of what happens when a product is
designed by a diletante.

Back in 1975 professionals designed an OS called Unix. Being professionals,
they realised the need for certain design principles. Such as splitting a
task into a number of smaller subtasks and designing a separate tool to
handle each subtask (that does one thing, and does it well)[0].

For example, shutting down a computer involves flushing (synchronizing) file
buffers to disk ("sync"), killing all running processes ("killall"), and
powering off the machine ("poweroff", at least on Solaris). All perfectly
neat and logical.

Along comes a layman who is unaware of the above principle, nor of
the significant "prior art"[1]. Result? -- read Theo's message.

(Various observations to show that isn't such a big problem (in
no particular order):

* professionals already know that similarly-named utilities often
behave differently on different operating systems,
* GNU folks never intended to uphold the aforementioned design
principle in the first place (see EMACS), so no surprises there,
after all, you'll only run "killall" on a Unix once.)

We have a bigger problem with another Unix principle: source code
portability.

As software becomes more complex, it requires more sophisticated build
tools. More and more open source software is being developed using GNU
compilers and build tools, and it is becoming dependant on them. The result?
-- While portability at the level of each compilation unit is still
maintained, the whole thing is not portable anymore. It fails to build on
non-GNU systems[2].

GNU project in particular did a great service to software community by
promoting and popularizing free software. It also did a great disservice by
turning the whole thing into a political issue, and pretty much ignoring the
need for competence and expertise on the part of software developers.
Instead of sound software engineering, we now have "Free Speech"
flag-waving[3].

With more companies (individuals, governments) jumping on Linux bandwagon,
the situation becomes eerily reminiscent of the recent dot-com boom; back
then we had The Internet and e-words, now we have Open Source and
Linux. Back then a few cautionary voices drowned in marketing hype, now
they're likely to be branded Paid Advocates of Evil Entertainment Industry
and Oppressors of Free Speech[tm] -- so they shut up and go learn Plan9, or
something.

(BTW, if it sounds like I'm singling GNU out, I'm not. Microsoft
et al., did at least as much as GNU to get us where we are now.
The whole thing would be very different if there was e.g. a
liability clause in every software license.)

But the $15 question remains: would you board an airplane designed by, say,
2nd year biology student as a night-time hobby? So what makes you think
their software design skills are any better?

Hmm. This came out sounding like a rant. Well, it probably is.

Dima

[0] Various aspects of the problems related to complex software systems are
very familiar to RISKS readers. They come up in, what? -- every other RISKS
issue? 25+ years ago Unix authors were well aware of them, too.

[1] Irix and Solaris "killall", for examle, behave like HP-UX one -- not
surprising, considering the "grand scheme of things" outlined above.

[2] Anyone who ever tried building open source software on Solaris using
native build tools knows that 9 times out 10 GNU "libtool" fails to link
shared libraries. The remaining 1 time GNU ./configure script fails to
determine compiler flags to make position-independent code (needed for said
libraries). And since GNU compiler and build tools are unable to produce
64-bit code on Solaris, the libraries, and all software that uses them must
be built as 32-bit binaries. Now, why did I pay for that 64-bit hardware,
again?

[3] And instead of one Shakespeare, we have a zillion monkeys with C
compilers. As history of Usenet shows, we shouldn't expect them to come up
with even "Hello World" anytime soon, not to mention "Hamlet".

Go To Kinko's!!!! (4, Informative)

thedbp (443047) | more than 12 years ago | (#3492963)

Kinko's offers high-volume scan-to-PDF solutions ... at low volume, it is usually a 10 - 25 per page and the cost of the media to copy it to, but in large volume, sometimes the cost can go down to 1 per page.

Call Kinko's. Ask for the Territory Representative. They'll help you out!!!

Re:Go To Kinko's!!!! (0)

blindbat (189141) | more than 12 years ago | (#3492997)

Will they let you bulk copy copyrighted books?

Re:Go To Kinko's!!!! (1, Informative)

Anonymous Coward | more than 12 years ago | (#3493114)

They won't. I'm working at a K's right now and company policy won't let us copy anything that's copyrighted without proper permission and to hand place that many pages on a scanner bed would be horrendously time consuming.

Re:Go To Kinko's!!!! (4, Interesting)

Microsift (223381) | more than 12 years ago | (#3493003)

I seriously doubt Kinko's would do this. They are ultra-paranoid about violating copyright. I imagine if you could do it at Kinko's, you'd have to all the work yourself in the Self-Service area. I doubt they have machines like that in self-service.

While you're scanning my books... (2)

DarkHelmet (120004) | more than 12 years ago | (#3493094)

Oh yeah, I have these 100 dollar bills I'd like you to scan and put in a PDF file... I'm not going to reprint them, honest!

I just wanna be able to look at the dollar bills on my computer instead of having to carry them with me. Is that so bad?

monkeys (4, Funny)

blugecko (152079) | more than 12 years ago | (#3492966)

hire an infinite amount of monkeys on typewriters and... oh wait, that is for shakespeare

Re:monkeys ** -- MOD PARENT UP!! -- ** (0)

Anonymous Coward | more than 12 years ago | (#3493056)

that is all.

typewriters use paper (1)

IIRCAFAIKIANAL (572786) | more than 12 years ago | (#3493113)

So along with his current books, he would have an infinite number of pages that contain some of the works of shakespeare?

What you need is an infinite number of monkeys with an infinite number of computers

...Silly old bear

YES, BUT... (0)

Anonymous Coward | more than 12 years ago | (#3493122)

you'd also need an infinite amount of bananas...

what a mess, all the flies... oh, the humanity...

um, wait...

nevermind.

Safari is your friend (5, Informative)

Dredd13 (14750) | more than 12 years ago | (#3492971)

If you're like me, a good chunk of your collection is ORA books... in which case, you should check out O'Reilly's Safari [oreilly.com] , which is their online book offering. It also includes non-ORA books as well, actually.

Quite useful and handy.

D

Re:Safari is your friend (2)

Skidge (316075) | more than 12 years ago | (#3493007)

But unfortunately, owning the OReilly books doesn't entitle you to be able to access them online. You'd have to pay a subscription to access them.

That being said, the $9.99/month (or so) would probably be worth it, considering all the work tearing apart and OCRing all the books would take, just to get somewhat inaccurate digital versions.

Re:Safari is your friend (0)

Anonymous Coward | more than 12 years ago | (#3493039)

That's nice, but why would he want to pay a monthly fee to rent books he already owns?

Re:Safari is your friend (4, Insightful)

Dredd13 (14750) | more than 12 years ago | (#3493104)

That's nice, but why would he want to pay a monthly fee to rent books he already owns?

Because there's something very nice to having access to your 30-odd book collection from home, office, conference, at a job-site, etc. etc., without dragging along 40 pounds of books with you everywhere you go.

It's a convenience you pay for. Considering how many ORA books many people pay for (and keep current as new editions come out), the annualized cost of simply subscribing and NOT buying the dead-tree version at all is very appealing to some folks, especially if their lifestyle has them wanting ready access to the material "from lots of different places".

Re:Safari is your friend (2, Insightful)

SystemFork (578511) | more than 12 years ago | (#3493081)

Perhaps the original poster should subscribe to the O'Reilly books they've purchased (for a month) and then save each chapter locally. Even at Safari's upper subscription levels of $100/mo you get access to 200 books. There's no way you could get a quality scanner with a feeder and OCR software for less than $100. Re-inventing the wheel is instructive, but silly. ------

Re:Safari is your friend (5, Informative)

Wanker (17907) | more than 12 years ago | (#3493091)

I'll second this-- the O'Reilley Safari site is wonderful for anyone with a hoard of tech books.

I bet about half of your books are already online.

Also, for your compression you should NOT use JPEG. JPEG is optimized for smooth tones and will badly blur hard edges like text. On the other hand, JPEG performs relatively poorly at compressing large areas of the same color (i.e. white backgrounds.) [Note for the nit-pickers, both of these JPEG issues will be reduced/eliminated in JPEG2000.]

I scan documents to either compressed TIFF (tend to be large), PNG, or (*shudder* [unisys.com] ) GIF.

From the Project Gutenberg "Making Etexts from Paper Originals" paper" [promo.net] : (You can bet these guys know how to scan...)

A general rule is to store scanned images to JPEG and store computer-generated pictures (like diagrams etc.) to GIF. The exception is if you scan in grayscale, then use GIF. Never scan pictures as lineart. If acceptable from a file size perspective use the highest possible quality setting for JPEG.
I suggest never using JPEG. The quality loss for printed words is just terrible relative to the compression you get. Also, just substitute PNG for GIF and the above works.

Its been a while.... (1)

PepsiProgrammer (545828) | more than 12 years ago | (#3492973)

I havent used OCR in about 2 years, but the last time I tried it out, it sucked horribly, its acceptable for small documents, that arent that hard to proofreed/correct But for huge documents, like books, etc... Dont expect a huge ammount of accuracy

OCR has improved (-1, Offtopic)

Anonymous Coward | more than 12 years ago | (#3493022)

Two years ago the Xbox and Gamecube weren't out.
Two years ago they didn't have OS-less computers
Two years ago Linux installation was really hard.
Two years ago they weren't close to a v1.0 Mozilla.

The computer Industry changes pretty quickly. Everything does double every 18 months, you know.

Re:OCR has improved (0)

ovit (246181) | more than 12 years ago | (#3493095)

I dont believe he was reccomended plain text as the digital format...

What I read was to use JPG, IE images of each page.. OCR is just to provide indexing...

Re:Its been a while.... (1)

ryanwright (450832) | more than 12 years ago | (#3493103)

I'm with you. OCR, at least the inexpensive (under $1000) software, is worthless. I found it to be faster to retype the whole stupid document by hand than it was to correct the OCR errors.

As Krow always says... (5, Funny)

bdesham (533897) | more than 12 years ago | (#3492974)

You can't grep a dead tree.

Re:As Krow always says... (0, Troll)

technoid_ (136914) | more than 12 years ago | (#3493087)

But its easier to read while on the toilet.

Great (2, Insightful)

Quill_28 (553921) | more than 12 years ago | (#3492975)

Now the bookseller's will join with the entertainment industry. Nexty we will be seeing books that can't be scanned easily.

Remeber those passkeys for computer games in the 80's that were black on maroon paper? Or some dial thingy.

Re:Great (0)

Anonymous Coward | more than 12 years ago | (#3493071)

There's no apostrophe in "booksellers," you stupid twat.

Re:Great (3, Funny)

yintercept (517362) | more than 12 years ago | (#3493092)

Cool idea. You could sell special 3D glasses with an encrypted pattern that you would have to purchase to read a book. With the print on demand technologies, book seller might create a system where people have to get a special printing of the book that fits only their encrypted readers. That way you can guarantee that only one person reads the book. You could also create a pretty good database of what people read. This would give you a good idea on who are the subversive elements in society.

100 pounds? (5, Funny)

NineNine (235196) | more than 12 years ago | (#3492982)

That's it? Jesus, what are you, a 12 year old girl? That's 2 armloads. Sounds like you need the exercise, fatass.

Re:100 pounds? (0)

Anonymous Coward | more than 12 years ago | (#3493049)

use a cart. lol

Re:100 pounds? (5, Funny)

zulux (112259) | more than 12 years ago | (#3493086)

That's it? Jesus, what are you, a 12 year old girl?

Girl? On Slashdot?

Woah!

Re:100 pounds? (0)

Anonymous Coward | more than 12 years ago | (#3493139)

yeah, really... i mean, face it, women have better sense than to come to a shithole like this, full of morons who have no understanding of civility...

You are the Bizarro me (1)

Microsift (223381) | more than 12 years ago | (#3492985)

I'm getting tired of buying books only to find out that a LOT of the chapters are on CD in pdf Form.
What's even more annoying is when the PDF doesn't let you print!

Unprintable PDFs (0)

Anonymous Coward | more than 12 years ago | (#3493009)

That's what Ghostscript is for. :-)

You're mad, surely? (2, Insightful)

fractalus (322043) | more than 12 years ago | (#3492987)

Most of my technical books contain vast quantities of useful information in charts, diagrams, and illustrations... which are far more of a challenge to OCR than mere printed text.

I suspect that even were this sort of thing really possible, it's a major time investment. I have several dozen technical books I'd like to scan, each with four hundred or so pages... and I'm not sure I want to spend a week's vacation time doing it.

And even were it done... there is just something comforting about having a nice printed book that I can set on the desk next to the computer and consult, without having to read it on the screen. Print still looks way better than monitors.

Re:You're mad, surely? (2)

jgerman (106518) | more than 12 years ago | (#3493098)

It's a convenience issue. I'd love to have all my books on CD's so I can either 1) leave them at work and use the dead tree's at home, or 2) carry them back and forth each day. There have been plenty of times that I need a resource that I know I have at home ( "I think something out of the Dragon book would help here"), but no way to access it.

Do you really need them? (4, Insightful)

alt.sex.fetish.jesus (542450) | more than 12 years ago | (#3492991)

I suppose this will be marked off-topic, since the poster is asking about digitization hardware. But whenever I see coworkers with tons of books on their desk shelf, I wonder to myself why they really need them. Do they actually have time to read them? Or are they more for show?

Personally, I have about 3 books I consider _essential_, and I've read them cover to cover (mostly while in the crapper ;-) ). The rest of the time, I get what I need off the web or USENET.

As far as I'm concerned, the most important quality in an engineer is not what you know but what search engine you use to look stuff up.

Re:Do you really need them? (2, Interesting)

ComputerSlicer23 (516509) | more than 12 years ago | (#3493083)

All depends. I have probably 8 C++ books that have lots of different useful information in them. Really, I probably only need 3 of them, the ISO standard (yes I own a copy), Strousup's C++ Language and Jossutis's book (big black book, can't remember the title).

I own probably 500 computer books that completely cover an 6ft by 6ft section on my wall. No I haven't read all of them, but I have read 80% of them cover to cover, and I know the table of contents on the rest of the books. It's generally very useful to keep lots of reference material "grey matter indexed". That is, I know which book to find it in and roughly where it is in the book. I have found on-line documentation to be of very low quality personally, and I like to peruse it when I don't have a computer handy

The other consideration is it is nice to know the documentation isn't going to change, or move, or do anything weird. Of course it isn't going to get updated either so, cuts both ways.

I agree (1)

IIRCAFAIKIANAL (572786) | more than 12 years ago | (#3493145)

Donate them to your local library if they are still relevant.

I just donated a bunch of books myself.BR>
Another strategy may be to only scan the stuff you need out of the books.

I just wish I could get rid of all of the leftover records/reports/legacy app documentation in my office.

If worst comes to worst (2, Funny)

MoneyT (548795) | more than 12 years ago | (#3492992)

You could always look into those funky OCR pens that you see in some electronics catalogs. Basicaly it's a pen with an optical sensor that you scan over lines of text to digitize them, they can then be transferred to a computer or to a palm pilot (or like product)

Re:If worst comes to worst (2)

AugstWest (79042) | more than 12 years ago | (#3493097)

This would actually be quite useful -- basically scan the texts as you need them.

When you need to look something up, you scan it as you go, almost like highlighting the text with a bright yellow marker.

Let's face it -- out of most of these 500-page behemoth books often only use small chunks of them, especially when you're talking about using them as reference tools well after your first or second read. This way you wouldn't be wasting time, energy, electricity and disk space with all of the voluminous words you don't really need.

I think this could be the best advice I've seen so far.

Re:If worst comes to worst (1)

WeirdKid (260577) | more than 12 years ago | (#3493127)

That would take a fucking eon! You could re-type the books yourself with less time and frustration.

Re:If worst comes to worst (2)

cheesyfru (99893) | more than 12 years ago | (#3493144)

I'd complain that the code you scanned would come out formatted poorly, but your wrist would be so carpal-tunneled by the time you're done scanning 100 lbs of books line-by-line that you wouldn't be able to type code anyway!

Talk to the project Gutenberg guy (3, Informative)

Anonymous Coward | more than 12 years ago | (#3492995)

Check out project gutenberg. I remember that they have a very nice how-to for scanning in texts

Question is: Free or Not Free (4, Informative)

The Ape With No Name (213531) | more than 12 years ago | (#3492996)

You could scan it all into PDF/PS but I am not sure about making it all into a document with free tools after that but here is a go at a solution.

Adobe Acrobat (read $$$$) does all of this and works well. But if you are *nix person you could pipe some ghostview tools together and put it all into LaTex then re-export it as a digital book in to PDF. Scanning: look no further than a HP scanner. It doesn't even have to be HQ unless you need the diagrams to be photoquality. After that burn it all to CD or, better, DVD.

OCR??? (-1)

Anonymous Coward | more than 12 years ago | (#3492999)

In Linux? HAHAHAHAHAHA!

Essential? (4, Interesting)

daeley (126313) | more than 12 years ago | (#3493005)

If they're that 'essential' how can you justify cutting them up? (A 100 pounds of tech books is, what, three or four books? ;)

Maybe you could donate the bulk of them to a school or something, follow the other suggestions about downloading fair-use versions where possible, digitize the few remaining ones, and start using ebooks or Safari [oreilly.com] (or similar) exclusively from now on.

I work in this field (5, Informative)

JeanBaptiste (537955) | more than 12 years ago | (#3493006)

My company is a document imaging systems reseller. The drawback to siong this is that it is expensive. We work with many different libraries and we sell them book scanners. They do lots of neat things, including things like not breaking the binding of the book during scanning, binding curve compensation, masking/centering, and so on. Most of these customers then take the tiff images and upload them into a document imaging system, although you could easily make pdfs also.

<plug>
Let me recommend the PS7000 from minolta (www.minolta.com), that is the book scanner we sell the most of.

If you are at all interested in document imaging, check out www.otg.com

and if your in minnesota, wisconsin, or the dakotas, check out my companies web site at www.mid-america.com
</plug>

Re:ps7000 (1)

catfoo (576397) | more than 12 years ago | (#3493132)

i used a ps7000 at a library i worked at. its pretty cool, works real well for rare books. but its waaaayyyy tooo expensive. i think it was something like 15,000$ (last year).

Hardware Suggestion (-1, Redundant)

Anonymous Coward | more than 12 years ago | (#3493010)

What hardware and software should I use?

100 lbs? What is that, two or three boxes? Try using the weightlifting equipment down at the gym, ya fembot!

check sane (4, Informative)

walt-sjc (145127) | more than 12 years ago | (#3493013)

Check the hardware list for sane and then pick one of the fastest scanners you can afford. The DB on Sane's web site is your best bet. You will find that to get good scanning speed you will need scsi as USB is just too slow.

jpeg also sucks for this. Jpeg is best for full color images like photographs. Better off using tiff or png. Most OCR software will require tiff. Don't know of any OCR software for linux although you might get some windows app to work under WINE. Textbridge from Xerox isn't bad for the money.

We do this all the time at the office...... (4, Informative)

diorio (244324) | more than 12 years ago | (#3493014)

.....we use a xerox DC265ST. This digital photocopier scans pages at 65 per minute and posts them to an FTP server inhouse. It can scan at 300 or 600 DPI and you can apply OCR after the scans are done. The DC265 is a workhorse and there are about a million of them out there. The scan back feature is a additional price on the device so not everyone spent the money on that feature....but about 1000 Kinko's have these in house and a Kinko's with a good DTP department might actually even know how to use the feature. (Good Luck!)
.

when will it end? (1)

bcnarc (449909) | more than 12 years ago | (#3493015)

This has to be one of the dumbest questions I've seen in a long time. If you're ambitious enough to attempt to scan '100lbs of dead trees' you'd think you'd manage to do some research on your own.

ooh.. searchable index... (2)

josquint (193951) | more than 12 years ago | (#3493018)

I dont know HOW many times i've looked at a tech manual(or other paper book for that matter)trying to find something I read a while ago and thought " i wish i could just do a text search to find the 3 or so words i remember seeing..." Sure theindex and table of contents gets you part of the way there, but if the author mentions something off-hand in an 'unrelated' section of the book...

Try one of these... (3, Interesting)

matthew.thompson (44814) | more than 12 years ago | (#3493023)

Canon DR-5020 [canon.com]

Canon's 90ppm high speed scanner - only problem with high speed scanning is that they need loose leaves. Any decent books you have and want to copy will need a Stanley knife taking to the spine.

Please remember to make decent backups on a long lasting madium with a high chance of recoverability. Failing that place the loose leaf versions with a document recovery firm and take their insurance for the full purchase value of the originals.

reed all about IT (-1, Offtopic)

Anonymous Coward | more than 12 years ago | (#3493026)

now that the "news" "changes" .asp we read IT, will priNT be allowed anymore.

see also: FraUDuleNT fairytail "economy" collapses due to onset of hobbyist whiner's coolapps giveaway.

see also: stinking rotting carcass of stock markup "bull" is reincarnated for just one more fleecing of what's left of J. Public's ability to take on debt.

searchable text versus scanned images (2, Redundant)

pomakis (323200) | more than 12 years ago | (#3493028)

The first question you'll want to ask yourself is whether you want the result in searchable text form or scanned image form. Searchable text is achievable with OCR (optical character recognition) software, but has at least two issues:

  • OCR software isn't perfect, and so errors will occur that'll you'll either have to live with or correct manually. Good OCR software does some validating against a dictionary, but this doesn't help when the source is highly mathematical, etc.
  • You'll lose figures, diagrams and pictures.

Scanned images solve these problems, but have two problems of their own:

  • They're not searchable.
  • They're bulky (perhaps 100x).

Perhaps a hybrid solution exists, but I suspect such a solution will require a lot of manual intervention and tweaking, something you'll want to avoid if your goal is to digitize several books.

Re:searchable text versus scanned images (2)

synx (29979) | more than 12 years ago | (#3493063)

i seem to recall a product that adobe has which makes hybrid pdf files using ocr. Text where possible, graphics elsewhere. You get the benefits of both. Of course the software is expensive.

Re:searchable text versus scanned images (1)

turbosaab (526476) | more than 12 years ago | (#3493065)

AFAIK, you can create Adobe PDF files where the image is visible and the OCRed text is "underneath" for search capability.

Re:searchable text versus scanned images (1)

br0ck (237309) | more than 12 years ago | (#3493136)

Google seems to have found a way [google.com] to search for words within images of catalog pages. Look for the cool little yellow boxes.

I like my dead trees (2, Insightful)

SirWhoopass (108232) | more than 12 years ago | (#3493030)

Electronic manuals are great, particularly because of the ability to search them. I certainly use plenty of them.

Personally, however, I still like printed manuals. Using an online manual means either reducing some windows or switching desktops. With a paper manual I can keep the screen exactly as it is. Higher resolution screens, or the use of multiple screens, are making online manuals much more useful (anyone remember what a pain in the ass it was to try and figure out something with only an online manual on a 640x480 screen?). Occasionally I still manage to fill two 1600x1200 screens with a bunch of stuff I want to keep visible while still reading the manual.

Electronic format is nice for storage, but... (2, Informative)

delphin42 (556929) | more than 12 years ago | (#3493033)

if you are anything like the computer guys I know (myself included), you'd end up printing out
portions of the text whenever you wanted to read them anyway!!!

I have the same goals - and problems (1)

nurb432 (527695) | more than 12 years ago | (#3493037)

Looking at over 2000 books ( and magazines ) in my garage in boxes im faced wiht the same issues..

Like what sort of scanner, software, etc to do such a massive collection.

And how to rationally complete the project... am i looking at having to cut(!) the books for a sheet feeder, or squish them on a flat bed.. Never had much luck with 'page scanners'..

Am i looking at *having* to buy something like acrobat to make the scanned pages useful??

Tech books shouldn't be dead tree only. (1, Interesting)

Anonymous Coward | more than 12 years ago | (#3493053)

Think about it.

People love books in dead tree format for the most part. You don't really want to curl up with a cup of coffee and a nice monitor. No, you want some good old dead tree.

But when you're coding, you don't want to curl up with a cup of coffee. You want to sit in a chair and hammer out code while quaffing coffee as if it were, well, coffee.

Most of the time when I look through books for reference, it's annoying. I'd rather be able to just grep for info.

Thankfully, at least O'Reilly's catching on to this. :)

Sara and her Foveon Camera (-1)

cybrpnk (94636) | more than 12 years ago | (#3493055)

Digitizing books can be done automatically by machine if you've got the cash - the machine even turns the page [4digitalbooks.com] . It can also be done by Indians [outsource2india.com] . If you want to Buy American, you could probably hire Sara and her Foveon Camera [tmc.edu] (check the bottom of the page)....

JPEG? (1)

Catskul (323619) | more than 12 years ago | (#3493058)

Dont use jpeg, its not good for text. Jpegs are good for photographs because photographs have predictable gradients. Use PNG/GIF for images with sharp/nongradual edges, you will get better compression/quailty that way.

I like the / character. : )

Don't use JPEG. (1)

Bistronaut (267467) | more than 12 years ago | (#3493064)

Use PNG! It's lossless and gets compression ratios that are just as good (unless you are using ultra-lossy compression with your JPEGs - in which case they will be a pain to read anyway). Why do people even use JPEG and GIF anymore? JPEG is only good if you need ultra-high compression and don't care about quality, and GIF only has the animation thing on PNGs.

Sorry about the rant, but there are so many cool computer technologies that people just overlook. It makes me sad.

Copyright Infringement? (1)

stickytar (96286) | more than 12 years ago | (#3493068)

It seems that all the "essential" books I have rarely get touched except at those special key moments when they are needed. I can't imagine spending more than 15 minutes trying to adapt these "old" knowledge bases into electronic form unless it was SO HUGE and needed to be accessed by alot of people (i.e., the company library?) so then, what? where is the fair use policy on this? do I buy oreilly's "java and xml" book and then copy and cut it up to my hearts content? What do the publishers think about all this?

Fire trucks!! Start your engines!!

I want both (2, Informative)

peterdaly (123554) | more than 12 years ago | (#3493070)

O'Rielly (sp?) has many of their java books available on CD-ROM, although I only own the dead tree versions of the ones I have in that series.

On a regular basis, I haul 2188 pages worth, I just added them up, of QUE's Using Java2 Standard Edition, and Enterprise edition, between home an the office. (Speaking of which, go to the link in my .sig and buy some of my favorite books!) That a lot of weight for two books, and I usually haul around a couple smaller ones as well, O'Riely's perl book, and their EJB 3rd edition.

Not only are all of these books heavy, but I have also yet to find an easy way to card them around, they don't all fit right in any of my bags.

I want all of these books on CD-ROM, but not just CD-ROM. Half the books I have INCLUDED a cd-rom, it just doesn't contain the texxt of the book. With O-Riely, I'd buy the CD-ROM version, but I want to dead tree version too. I want to use the dead tree version, unless I am working from home, I want to haul home the CD's. I don't think I should have to pay any more for it either, I bought the IP (in the property sense), and I am already paying the price for the wood slices, which includes a silver disk.

PUBLISHERS, GIVE ME THE BOOK ON THE CD TOO! I spend $100/month or so on tech books.

-Pete

Let me get this straight... (5, Insightful)

deacon (40533) | more than 12 years ago | (#3493073)

You are going to cut up thousands of dollars worth of your "essential" books?

And put them into an inferior visual format you cannot read without the computer being working and on?

And you are going to spend about 100 hours to do this.. and the original books are going to be ruined.

All this just so you don't have to make 3 trips to move your books?

Mmmkayyy.. (backs away slowly)

Have you ever heard of a dolly?

contact your local school for the blind (2, Interesting)

veggiespam (5283) | more than 12 years ago | (#3493075)

Schools for the blind have been doing this for years, especially with technical books. Many of my V.I. friends would remove the binding and feed them through a high-speed sheet feeder to a scanner. Then, the books are proofed by seeing people for OCR perfection. Contact your local school and ask if they already have some of your works in pdf/jpeg/tiff/WordPerfect (yes, lots of Word Perfect). They may be willing to give you some legal copies of your books in exchange for you converting some of the books you have that they don't into blind readable format (which means, you'd have to proof your own book for accuracy - but you're doing that anyway). Basically, you're donating your time for a good cause and bennifiting yourself.

In a Word: (1)

bpfinn (557273) | more than 12 years ago | (#3493079)

Exoskeleton [slashdot.org]

are you sure you want to do this? (4, Insightful)

binaryDigit (557647) | more than 12 years ago | (#3493089)

I think you may be underestimating the sheer enormity of your task. Getting sheets to all feed right (a little skew and you're skrewed) and in order (feeder issues, what happens when one page mis-scans/feeds, can you go back and insert it into it's proper location), handling front to back issues (though I would assume that decent scanning software would take care of this for you). Also, your plan to use jpg might be problematic. OCR is finicky enough as it is, back when we were scanning documents we always used 300dpi tiff (using group3 or group4 lossless compression) to get the maximum accuracy rates from the ocr package we were using. And speaking of accuracy, keep in mind that OCR software that has a 97% accuracy rate means that it will flub 3 out of every 100 words, in a book that might contain tens/hundreds of thousands or words, that is a whole lot of errors. Now it's been a few years (6-8) since I've done this kind of stuff, so who knows, maybe things are much better now?

I've been wanting to do something similar for years, but with technical magazines, not books. But the sheer amount of manual labor involved has turned me off considerably (not to mention the thought of destroying the original source).

Keep in mind that this is such a common need, that if it were pretty straight forward, much of it would be done already (perhaps someone out there has the time/hardware/software to have done some of this already?) Not to mention the issue that with the web, that much of the information contained in those books are now available online, makes you wonder if it's really worth the time and effort, esp. considering that a great many of the technical books are obsolete two weeks before they hit the shelves.

Call the cops! (0)

Anonymous Coward | more than 12 years ago | (#3493096)

Isn't what you are planning on doing technically a copyright violation?

PDF and OCR (2, Interesting)

4/3PI*R^3 (102276) | more than 12 years ago | (#3493105)

If you really want to go through all this effort use both PDF and OCR.
OCR sucks royally for large documents, documents with images or diagrams, handwritten comments, etc. However scanning the pages to an image and then creating a PDF of the images does not care about any of that.
So, scan all of your books as images that your OCR software can process. Use the OCR output to create an index of pages. If a specific word on a specific page doesn't OCR well who cares. With typed and professionally printed books your OCR software should be about 90% accurate. Take the images and create PDF files.
Now you have your nice clean images but you still have a searchable index. BTW, when you get this done post your procedures, problems, and solutions to a web site somewhere so that you can share your experiences with the rest of the world.

Start with google. (2)

bluGill (862) | more than 12 years ago | (#3493107)

Start with google. There is a lot of technical information online, and google will find it. Not as good as those dead trees, but if you can find it and it is accurate, google is often easier than searching indexes. Best of all, dead trees are limited to the ones you own, while google is limited to whatever someone found useful to put online.

Note the last line of the above: google is limited to what someone else finds useful to put online. So if you can't find it on google, take some time to put it online for the rest of us. If/when you find yourself going back to the same few sites often, link to them from your homepage so google knows you find them useful. In other words, google is interactive, make it work for you and it will work for everyone. The internet is not a one way street.

Finially, some things are just plan eaiser to look up in dead tree format. I would strongly recomend you keep your books intact. Put the information you need on the web (what you can do legally), and keep the books for the rest. If you find you are not using a book anymore because all the information is on the web (including you put it there), then throw it out. My monitor is only 19 inches, not nearly enough to hold all the information I have scattered about my desk.

screw trees... (-1, Flamebait)

Anonymous Coward | more than 12 years ago | (#3493108)

fuck arbor day too.

When are you jerkoffs going to realize that trees are evil and serve no purpose except to inseminate hobbits.

Blackmask.com (2)

KelsoLundeen (454249) | more than 12 years ago | (#3493118)

Blackmask.com [blackmask.com]

Tons and tons of e-texts. In multiple formats: text, pdf, lit, HTML.

Excellent resource!

FAQ: Making Etexts from Paper Originals (2, Informative)

ancarett (221103) | more than 12 years ago | (#3493119)

Anders Borg [torget.se] wrote this FAQ [promo.net] from Project Gutenberg [promo.net] . Lots of field-tested advice there, such as a suggestion to scan at 300dpi or better.

Goatses evil twin (0)

Anonymous Coward | more than 12 years ago | (#3493120)

Heres someone else [stilemedia.com] doing the same thing As goatse.cx! [goatse.cx]

Somewhat on topic... Historical Papers (3, Interesting)

Embedded Geek (532893) | more than 12 years ago | (#3493135)

My father passed on Sunday and we were going through all the family papers. We have lots of original documents from my family during the Civil War and earlier. My sister and I were thinking of donating them to a museum, so there would be no risk of their loss should my house get damaged (there's way too many documents to fit in my fire safe).

Before doing this, though, we were thinking of scanning/copying all the documents to keep copies for ourselves. In doing so, though, we could use some advice:

What special steps must we take in scanning 150+ year old documents, some very yellowed and fragile?

What is the best format in which to store them (assuming we want them easilly readble in 20+ years for our kids)?

What is the best media upon which to store the data (again, hoping for readability in 20+ years)? (I'm thinking online storage to allow easy conversion to the media of the moment, but I still want something to stash in the safe deposit box)

Does anyone have experience with digital preservation/resoration of archival documents? Should I just try cleaning it up in photoshop or should I find a pro to help out? Maybe I can make it a term of the donation to the museum/library, for that matter.

Thanks in andvance for your advice.

Abuse of hardware resources! (0)

Anonymous Coward | more than 12 years ago | (#3493137)

Listen, chump! Scanning books and storing them online is a waste of hard drive space! Space which would be better utilized for illegal MPAA and RIAA copyrighted material, porn, games, warez and other materials.

paper is superior (0)

Anonymous Coward | more than 12 years ago | (#3493138)

1. You can keep paper docs next to your computer while you work without having to juggle applications on your PC.
2. It's much easier to read paper docs while you take a dump or lay in bed. I know that you can use a laptop while shitting or laying in bed, but a book is easier. Save the toilet/bed laptop sessions for important IRC chats.
Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Create a Slashdot Account

Loading...