Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Human-Powered Internet Archive Book Project

Zonk posted more than 8 years ago | from the hope-she-likes-the-way-books-smell dept.

Books 113

Carl Bialik from the WSJ writes "A group led by the Internet Archive is planning a massive, ambitious effort to scan millions of old books and make them available for Web searching early next year. Behind that effort are about a dozen scanners, employees making about $10 an hour to manually scan volumes -- some more than a century old -- one page at a time, on special contraptions. The Wall Street Journal Online visits a University of Toronto library to watch one of the scanners in action: 25-year-old Liz Ridolfo."

Sorry! There are no comments related to the filter you selected.

Fp/Google (-1)

MikeWasHere05 (900478) | more than 8 years ago | (#14014224)

fp? So, are they going to run into the same problems that Google did/does?

Re:Fp/Google (1)

TWooster (696270) | more than 8 years ago | (#14014261)

No. You failed to RTFA. They are only scanning books pre-1923 -- out of the copyright domain -- and those that they are specifically allowed to scan by the publishers. This has the backing of a lot of big corporations (Microsoft, HP, etc), and I don't think they'd like to be caught on the wrong side of copyright law, considering their position on the whole issue.

Re:Fp/Google (1)

roseblood (631824) | more than 8 years ago | (#14014262)

The books are old enough to be in the public domain now. No problem.

Re:Fp/Google (1)

D'Sphitz (699604) | more than 8 years ago | (#14014510)

apparently you were more interested in getting first post than reading tfa.

Re:Fp/Google (1, Funny)

Anonymous Coward | more than 8 years ago | (#14014693)

apparently you were more interested in getting first post than reading tfa.

You must be new here ;)

Re:Fp/Google (0)

Anonymous Coward | more than 8 years ago | (#14015221)

You must enjoy using ridiculously worn-out catchphrases.

Diffrent? (1)

Sinryc (834433) | more than 8 years ago | (#14014226)

How is this diffrent from something like Googles project? And how will the Copyright holders feel? Also, this could be pretty usefull...

Re:Diffrent? (3, Insightful)

way2trivial (601132) | more than 8 years ago | (#14014247)

Stories over 75 years old don't have the same copyright protections..

anyone can do 'a christmas carol' because it's copyright has expired..

using however, someones PRECISE arangement of the text is not permissible however- that has it's own copyright...
so if I buy a current day copy from amazon, I cant scan it in... but if I buy a copy that's last edition/print was more than 75 years ago, it is fair game.

Re:Diffrent? (1, Informative)

Anonymous Coward | more than 8 years ago | (#14014280)

> so if I buy a current day copy from amazon, I cant scan it in

bullshit

Re:Diffrent? (1)

way2trivial (601132) | more than 8 years ago | (#14014291)

okay, I can't scan it in, and use it for commercial purposes... as in, to resell..

I have to make my own assemblage of the story.. and sell that,

Re:Diffrent? (2, Insightful)

Kadin2048 (468275) | more than 8 years ago | (#14016559)

Commercial or non-commercial use doesn't enter into it.

If the work in question is under copyright, you can't copy and redistribute it; if it's not, then you can. The only exceptions would be the fair use provisions, and I don't think that they would cover you reproducing an entire book, even if it was for non-commercial use: if you're a university professor you can't copy an entire textbook and give them out to your students. That's a non-commerical use, but it's still illegal. There might be some exceptions for purely personal use -- some type of "format shifting" perhaps, like OCRing it and running it through text-to-speech and putting the result on your iPod -- that you could make a good case for if you already owned the book in print form, but non-commerical use normally isn't an excuse for infringement. Despite public opinion to the contrary, there is no exception to a copyright holder's exclusive rights for "non-commercial" uses.

Also, if you made a derivative work from something that was out of copyright, and then went and tried to sell it, only the portions that you contributed anew would be protected, the existing stuff doesn't change. That's not to say that you couldn't sell it (if it's out of copyright you don't even have to change anything to sell it, you can go and print out anything you want from Project Gutenberg and try to sell it, if you think you'll get any buyers), but you wouldn't have any recourse against someone taking your changed version, editing out all of your changes, and selling it themselves.

There is a fairly good introduction to these concepts here [findlaw.com] . Or read it straight [findlaw.com] from the U.S. Code.

best commet ever! (4, Funny)

CannibalSmith (684531) | more than 8 years ago | (#14014444)

> bullshit

I too want to be modded Insightful!

Exactly. (1)

Kadin2048 (468275) | more than 8 years ago | (#14016472)

If you bought a copy of any classic book that is out of copyright, and it's a literal republication of the original (not a 'modern interpretation' or new translation or anything else) than you could, I believe, scan, OCR, and distribute the resulting text. The literary work -- be it Shakespeare's, Clemens', Dickens', etc. -- is no longer protected by copyright.

You could not, however, scan the book and distribute the images of the pages. Because although the original author's text is not under copyright protection, the book itself (layout, design, etc.) could be. Also, any changes they might have made to the text (new grammar, or diction) could be, which is why you'd have to be careful. I think it would only qualify as a new protected work if the changes represented "an original work of authorship" according to 17 U.S.C. 101 [findlaw.com] , but depending on the publisher they might try to sue you into bankruptcy anyway.

As long as you didn't copy the layout or any of the additional materials (critical essays, introductions) that publishers put into re-prints of classic literature, I see no reason why it would be illegal to type in and share the $2.99 Penguin Classics edition of Tom Sawyer that you can get at any Borders.

Re:Diffrent? (1)

temojen (678985) | more than 8 years ago | (#14014255)

And how will the Copyright holders feel?

There probably aren't any. Copyrights do expire.

Re:Diffrent? (0)

Anonymous Coward | more than 8 years ago | (#14014445)

Copyrights do expire.
As do the copyright holders. :-)

Re:Diffrent? (1, Interesting)

Anonymous Coward | more than 8 years ago | (#14014264)

The Internet Archive is a non-profit. As for sufficiently old books, they're out of copyright anyway, and neither Google nor the Archive will have problems. Meanwhile, as an author, I would be fairly happy for a non-profit such as this to scan my publications, providing search and excerpts. I don't think I'd even be too up in arms if they used opt-out - they are, after all, just extending the role of the library. On the other hand, when Google does it, with the aim of making its shareholders richer (whether that's by providing for-pay services, advertising, or merely control that provides potential for revenue) I most certainly will refuse. (captcha: archival. hehe)

Re:Diffrent? (4, Informative)

arrrrg (902404) | more than 8 years ago | (#14014299)

From the Wikipedia article on the Open Content Alliance [wikipedia.org] :

The Open Content Alliance is a consortium of non-profit and for-profit groups which is dedicated to building a free archive of digital text and multimedia. It was conceived in 2005 by Yahoo and the Internet Archive. It was conceived in response to Google Print's closed nature, and aims to keep public domain works in the public domain on-line. These results will then be used in the search results of participating search engines. You can see a sample of the open content at openlibrary.org

A large difference between the OCA's approach and that of Google Print is that the OCA intends to ask a copyright holder before digitising a work that is still under copyright, while Google Print will digitise any book unless explicitly told not to do so by November 1, 2005.


So, Google Print will almost certainly be better when searching for copyrighted material. For public domain works, we'll have to wait and see.

IMHO, it seems like a little cooperation here would make a lot of sense for both parties - they could save money trading digital copies 1-for-1 while remaining in (healthy) competition.

Re:Diffrent? (1)

Kadin2048 (468275) | more than 8 years ago | (#14016616)

IMHO, it seems like a little cooperation here would make a lot of sense for both parties - they could save money trading digital copies 1-for-1 while remaining in (healthy) competition.

This is very true. However you see this sort of thing in a lot of emerging industries -- two competitors will duplicate each other's work until eventually one defeats the other in the marketplace and buys up their work at fire-sale prices. As long as either one thinks that they can "win," there's little incentive to help.

Too bad though, because the other thing they could do is divide the 'contentspace' and let Google deal with the stuff that's under copyright (which has a lot of thorny legal issues but is potentially a gold mine in terms of rewards) while letting the OCA do the stuff that's free, and link their databases through one Search interface. Two backends, one frontend. They could even pull Project Gutenberg's collection and the Internet Archive in there too.

But I think that as long as Yahoo is championing one effort and Google the other, there won't be that much integration between them.

Re:Diffrent? (4, Informative)

Dave114 (168228) | more than 8 years ago | (#14014317)

It's different. Take a look at the Open Content Alliance's FAQ [opencontentalliance.org] . Below are a few excerpts from it:

What can people do with materials contained in the OCA archive?

The OCA will encourage the greatest possible degree of access to and reuse of collections in the archive, while respecting the rights of content owners and contributors. Generally, textual material will be free to read, and in most cases, available for saving or printing using formats such as PDF. Contributors to the OCA will determine the appropriate level of access to their content.

How will the OCA deal with copyrighted content?

The OCA is committed to respecting the copyrights of content owners. All content providers who contribute to the OCA must agree with the founding principles of the OCA, contained in the OCA Call for Participation, which describes how their materials and associated metadata will be accessed and used. Further, all contributors of collections can specify use restrictions on material that they contribute.

Will copyrighted content be digitized or placed in the OCA archive without explicit permission from rights-holders?

No. OCA contributors must secure the permission of all concerned copyright holders prior to submitting materials to the OCA for digitization or inclusion in the archive.

Re:Diffrent? (1)

cwalk (899502) | more than 8 years ago | (#14014330)

If you actually read the article...
But the Open Content Alliance has sidestepped legal troubles by focusing on books published before 1923 -- and therefore out of copyright in the U.S. -- as well as some newer books publishers have allowed it to scan.

Re:Diffrent? (2, Interesting)

Chubby_C (874060) | more than 8 years ago | (#14015490)

with all these companies now deciding they want to scan books (Google, Amazon) why not partner up on this project, it would greatly reduce the overall costs as each company would scan the same books as the other.

At least partner up for the process of scanning even if they have different plans as to what to do with the scans

http://www.digg.com (0, Troll)

master_meio (834537) | more than 8 years ago | (#14014234)

Digg speaks, slashdot listens. It's like a crystal ball, showing you what slashdot's front page will like like in 45 minutes.

R.I.P. Slashdot.org

Re:http://www.digg.com (-1, Troll)

Anonymous Coward | more than 8 years ago | (#14014988)

It should be slashdot.com, the hypocrites.

Re:http://www.digg.com (1, Troll)

heinousjay (683506) | more than 8 years ago | (#14015165)

Yep.

Too bad the comments on Digg make this place look like a scholar's retreat.

Contributing to Gutenberg (2, Interesting)

watermodem (714738) | more than 8 years ago | (#14014239)

Will the scans be added to the Project Gutenberg collection?

Sorta. (4, Informative)

Grendel Drago (41496) | more than 8 years ago | (#14014260)

Project Gutenberg frequently makes use of the page scans for source material. What PG does is to run the images through OCR, proofread and post-process it. It's more useful than a stack of page images, but considerably more work.

If you look at the current books on Distributed Proofreaders [pgdp.net] , you'll see that some of them credit the Million Books Project for the page scans.

Re:Sorta. (2, Insightful)

spxero (782496) | more than 8 years ago | (#14014401)

But at the same time, wouldn't it be better if this outfit did the scanning for PG and PG edited and finalized? What is the point to race against another organization to provide the same works without making profit?*

*Until advertisement factors in. Advertisement ALWAYS factors in...

Re:Sorta. (1)

aussie_a (778472) | more than 8 years ago | (#14014415)

I'm pretty sure PG has no advertising, so there is no profit factor. And I'm fairly certain that someone could come along for PG and use the Internet Archive's images to convert into text.

Internet Archive doesn't have to specifically give them the images though.

OCA and PG scratching each others' backs (2, Insightful)

TTK Ciar (698795) | more than 8 years ago | (#14016615)

The focuses of OCA and PG are really quite different: PG is most interested in preserving the essential information of a book (ie, its text), while OCA's interest is in preserving the form of the book (ie, its fonts, pages format, coloration, even down to the yellowing of the pages). That having been said, there's a lot each can do for the other (and has!).

The Archive has archived most of PG's material, because even though the Books department of The Archive is focussed mostly on preserving books, The Archive as a whole is interested in preserving just about any information it can, and the PG data is definitely of interest.

When the The Archive's Scribe software processes the book images into its various format (jpg, djvu, pdf, flippy, et al), it OCR's the book's text. This text then becomes part of generating some of the other formats. It will be really trivial for PG to obtain this text for any book it wants to incorporate into their dataset.

qv: intlepisode00jamearch [archive.org] . The interesting files here are intlepisode00jamearch.txt which is just the OCR'd text, and intlepisode00jamearch_djvu.xml which is the OCR'd text with layout information (which has been useful to me in developing software which auto-corrects some OCR errors -- where the text is on the page often offers valuable hints for choosing the right heuristic for guessing the right text).

A quick side note on the differences between Google's and OCA's efforts that I haven't seen talked about much -- Google's main advantages in their bookscanning efforts are their wealth and fame, while The Archive's main advantages are experience, familiarity, and scanning technology.

Traditional book-scanning technologies are expensive and slow (which makes doing a lot of books, fast, that much more expensive, because you have to hire more people to do more books in parallel), but Google has enough money to throw at the problem that this is less of an issue. Google's fame means they can bring powerful partners onboard with a smile and a handshake, including some of the most prestigious libraries in the nation.

The Archive has been involved in scanning books and making them available online for several years now (qv The Million Books Project [archive.org] ). This experience has shaped the processes used in the acquisition and scanning of books, as well as the technology used in their storage, indexing, and presentation. Furthermore, libraries around the world have grown familiar with The Archive over the years. That, and The Archive's good track record, make it a powerful rallying point for partnerships and alliances, and have given it more experience in facilitating such relationships. Finally, partially due to the limits of existing book-scanning solutions, and partially due to The Archive's limited budget, it has facilitated the development of two independent low-cost, reliable, high-quality book-scanning systems: The Scribe (developed in-house at The Archive) and the Kirtas Robot [archive.org] (developed at Kirtas [kirtas-tech.com] , a Canadian company).

Many of the books scanned for the Million Book Project using traditional scanning methods are really lousy, sometimes to the point of being unreadable. These new scanning systems dramatically improve the quality of the end product, while equally dramatically reducing the cost-per-page. This means that more scanning systems can be purchased for more libraries (avoiding the per-library capital outlay problem), and more books can be scanned more quickly within a given budget.

Obviously, Google and OCA can benefit from co-operation, as each has a lot to offer the other. I'd be surprised if Google didn't join the OCA, eventually, if for no other reason that to gain access to the books of the >100 OCA member libraries, and to more freely discuss technical and legal issues of common interest. But also, Google and OCA can (and in some ways already have) benefitted each other without co-operating. Google can learn a lot about how to store, index, and present books just by looking at The Archive's system (which is totally open for anyone to look at), and can obviously make their wealth stretch by purchasing and deploying the Scribe and/or Kirtas robot. Since The Archive gives away all its content to anyone for free, Google can trivially obtain all of the books The Archive has scanned. Meanwhile, OCA can benefit from the limelight Google has cast on bookscanning in general. Because more people are now familiar with the concept and with the issues involved, the OCA can more easily approach and accept new member organizations. And of course if Google does purchase a million Kirtas robots, the economy of scale will reduce the per-unit price of those robots, enabling OCA to purchase more of them for their own book-scanning efforts.

Just my $0.02USD .. disclaimer: though I work for The Internet Archive, I do not speak for them, and the opinions stated here are not necessarily shared by my employer.

-- TTK

Re:Contributing to Gutenberg (4, Informative)

jonathan_ingram (30440) | more than 8 years ago | (#14014440)

The scans won't be added to Project Gutenberg, but it's very likely that the scans will be used by Project Gutenberg's Distributed Proofreading [pgdp.net] project, which I'm involved in. We're already 'harvesting' images from quite a few sites, as well as all the images our volunteers scan. Now that there are several large and relatively well funded scanning operations getting off the ground, I imagine that over time an ever increasing proportion of the works that go through DP will be based from harvested images.

I maintain several lists that show the DP harvesting status of several image collections, including The Internet Archive's Canadian Libraries collection [ntlworld.com] , Google Print [ntlworld.com] , and Early Canadiana Online [ntlworld.com] . As you can see, we will not be running short of material to work on for a very long time, even without any of these recently announced initiatives. That said, it's always great to see more material be made freely available, rather than locked up behind expensive subscription services like Jstor and EEBO.

It's lighter! (4, Interesting)

HolyCrapSCOsux (700114) | more than 8 years ago | (#14014244)

Last time I moved, It took many VERY HEAVY boxes to Move all my books. Maybe I'll scan them all..

All though anything useful has to be illegal... :(

Re:It's lighter! (3, Insightful)

Hosiah (849792) | more than 8 years ago | (#14014318)

Ahem: years ago, I made up the "moving time" rule that books *must* be packed in the smallest available boxes. Anything of dimensions around 2x1x1 feet. After straining on the book boxes previously, it occurred to me that it's human nature to (a) pack books first, reasoning that you're not going to be doing much reading in the next couple days anyway... and (b) upon first beginning to pack, grab the biggest box to start with.

Re:It's lighter! (0)

Anonymous Coward | more than 8 years ago | (#14015014)

So, this is what Slashdot has become? "Packing Advice for Morons. Stuff to Make Boxes Lighter".

Your post got modded Informative, when it was Offtopic to begin with: What does packing books for moving have to do with the article? Next to nothing.

Slashdot has become a sewer, full of 6-digit idiots that can't think, can't write, and couldn't stay on subject if someone put a gun to their head.

Re:It's lighter! (0)

Anonymous Coward | more than 8 years ago | (#14015105)

"Slashdot has become a sewer, full of 6-digit idiots that can't think, can't write, and couldn't stay on subject if someone put a gun to their head."

And anonymous cowards who complain endlessly about how bad slashdot has become. I mean, could anyone be more whiny? Which reminds me, there was a squirrel up in a tree screeching (which is a lot like whining) on and on and on while I was trying to write on slashdot. So, I took my AK-47 and shot his screeching ass full of lead. Scared away all the other squirrels, too. You ever notice that, when you blast a squirrel's head into utter oblivion, it's tail isn't all that bushy anymore?

Squirrel n. - 1. Any of various arboreal rodents of the genus Sciurus and related genera of the family Sciuridae, having a long flexible bushy tail and including the fox squirrel, gray squirrel, and red squirrel. Also called tree squirrel.

Re:It's lighter! (0)

Anonymous Coward | more than 8 years ago | (#14015367)

So, I took my AK-47 and shot his screeching ass full of lead.

See, and it's that kind of insensitivity towards rodents that, I think, is behind Disney Corporation's increasing hostility in the entertainment market, the rise in popularity of Tom and Jerry cartoons, the fallout of the Smurf figurines market, the high cost of cheese, and the alarming increase in furbie sex as our nations most pervasive form of perversion.

Nevertheless, I for one, welcome our new AK-47-wielding squirrel-shooting overlords.

Re:It's lighter! (0)

Anonymous Coward | more than 8 years ago | (#14014633)

Good move is to put books and clothes (half and half) in same boxes. Bad move is to put all books in a freezer.

Why not join the Gutenberg Project (0, Redundant)

autOmato (446950) | more than 8 years ago | (#14014246)

Since they are nonprofit, why don't they join forces with a quite similar project: Project Gutenberg [gutenberg.org]

Re:Why not join the Gutenberg Project (3, Informative)

flimnap (751001) | more than 8 years ago | (#14014283)

Project Gutenberg and the Open Content Alliance are working on two slightly different things:

The OCA is making available the images of scanned pages. That's fine for reading an entire book, but you can't search it, nor copy a section of text into a document of your own.

Project Gutenberg makes available plain text, usually illustrated HTML, and occasionally other versions, of public domain books, which can be used by anyone for no cost.

If you'd like to help prepare public domain ebooks, visit Distributed Proofreaders [pgdp.net] and proofread a page a day (or more!).

Re:Why not join the Gutenberg Project (1)

autOmato (446950) | more than 8 years ago | (#14014324)

The OCA is making available the images of scanned pages.

So, as the summary states:
make them available for Web searching
does not mean that there will be a complete text index available (that is full text search,) but instead you can only search for specific works?

If you'd like to help prepare public domain ebooks, visit Distributed Proofreaders and proofread a page a day (or more!).
I do that every once in a while on their German counterpart: GaGa

Re:Why not join the Gutenberg Project (3, Insightful)

flimnap (751001) | more than 8 years ago | (#14014354)

So, as the summary states:
make them available for Web searching
does not mean that there will be a complete text index available (that is full text search,) but instead you can only search for specific works?

That probably means that the search index will be uncorrected OCR, which leads to some inaccurate searches. The problem with using raw OCR is scannos, words that may be recognised as a different word that "looks" the same, for example modem and modern [google.com] , or an i might be recognised as a slash [google.com] .

I do that every once in a while on their German counterpart: GaGa

Your time might be better spent at the real Distributed Proofreaders [pgdp.net] , or DP-Europe [rastko.net] , since Projekt Gutenberg-DE is not an offical branch of PG, and actually copyrights its output (unlike the real PG).

Re:Why not join the Gutenberg Project (1)

autOmato (446950) | more than 8 years ago | (#14015241)

Your time might be better spent at the real Distributed Proofreaders, or DP-Europe, since Projekt Gutenberg-DE is not an offical branch of PG, and actually copyrights its output

Thanks. I hadn't paid attention to that detail. Damn it! You just can't trust people anymore. I feel raped. I'll go home now...

Re:Why not join the Gutenberg Project (2, Interesting)

TWooster (696270) | more than 8 years ago | (#14014296)

That's a good question, but I can't help but wonder if this is the miracle of capitalism at work. Right now we're in the eeeearly stages of this sort of thing, and the copyright laws, the mechanics, et al are still rather unexplored. Besides, I have to think -- the scanned images themselves are probably copyrighted by those who scanned, but chances are the plaintext isn't (considering they're copying it already, and not reinterpreting it). So the more people who want to scan whatever, the better, even if they overlap. Consider it error checking.

The real test and business opportunity comes in the distribution phase. The first person to have a huge library of old books, and contracts with publishing houses for new books (with "purchases" by the end users, and DRM encumbered, of course) is the person who will win the market and define the (capitalistic) best way to scan and distribute.

And come the semantic web, things get really interesting. Already we have tons of sites that do cross-referencing between academic papers -- at least, the citations, as well as categorization by topic. When we can start doing this for books based not only on genre, but topic or specific references to persons, or general concepts ("Book X mentions technology Y on page Z. Click here for link!")... well, things will become far more informative. I suspect that in this field, the information -- the texts -- may become free, but the computerized (and human-assisted) analyzation, linking, value-added stuff will be the new commodity. He who has the best algorithm wins.

I guess information has always wanted to be free, but the analysis of said information lies firmly in the realm of economics.

Re: Why not join the Gutenberg Project (0)

Anonymous Coward | more than 8 years ago | (#14014631)

No, the scans do not gain copyright because they are not creative works. See Bridgeman Art Library vs. Corel Corporation [wikipedia.org] .

Re:Why not join the Gutenberg Project (1, Informative)

Anonymous Coward | more than 8 years ago | (#14014361)

Actually, Project Gutenberg can be reached from Archive.org's Main Text Page [archive.org] along with some other cool sub-collections. I particularly like the Canadian Libraries. Once you install the DjVu extension you can view the scanned books. All old and out of copyright. Some have some very nice illustrations that now public domain, so can be copied and used for other projects.

i spent 3 hours shading the glasses. (0)

Anonymous Coward | more than 8 years ago | (#14014249)

That Wall Street Journal article reads less like a report than it does a nuance-to-nuance account of the author's infatuation with Liz Ridolfo.

And what the hell is with that sketch? DAVID KESMODEL, MORE LIKE DAVIDEON -DYANMITE-!!!

Not google (1)

Mecdemort (930017) | more than 8 years ago | (#14014263)

From TFA, they are only scanning works that are out of copyright and in public domain, so this is not the same as what google is doing.

Oh, so wrong... (0)

Anonymous Coward | more than 8 years ago | (#14014455)

Google is doing a great amount of that in addition to the headline-grabbing bullshit that everyone harps about so much. [From TFReality]

Can only be a good thing (3, Interesting)

LordofEntropy (250334) | more than 8 years ago | (#14014270)

Getting written works off of paper and stored electronically should be a priority--bits are much easier to store, preserve, and copy for future use.

In Stanislaw Lem's science fiction book "Memoirs Found in a Bathtub", all the paper in the world gets eaten by a virus and chaos ensues. Interesting read if you've missed it, has made me paranoid about how much the world still depends on paper.

Re:Can only be a good thing (1)

Jason1729 (561790) | more than 8 years ago | (#14014352)

Getting written works off of paper and stored electronically should be a priority--bits are much easier to store, preserve, and copy for future use.

Preservation?

Do you really think your magnetic/optical/flash/etc storage will last as long as printed paper...even assuming you can find a CD reader in 50 years? Maybe you mean to recopy the data every few years, but if something gets lost for a few decades, it's lost for ever.

Re:Can only be a good thing (2, Informative)

aussie_a (778472) | more than 8 years ago | (#14014434)

Maybe you mean to recopy the data every few years,

That is called periodic storage, and for anything you wish to preserve, it is necessary. You're argument is a bit weak, considering that any information in book or electronic format needs to be recopied periodically. Books need to be done so less then electronic copies, however electronic copies are cheaper and easier to store, which offsets the costs.

The OP wasn't saying to burn the paper books after their stored, merely to put them in electronic format ASAP because some of them might not be around for too long (funny how those books that are in danger of becoming extinct haven't been backed up in paper format, even though paper lasts for so long).

Re:Can only be a good thing (0)

Anonymous Coward | more than 8 years ago | (#14014471)

a lot of people will copy it, make prints though. the copies might keep it alive

Hey there... (3, Funny)

stev3 (640425) | more than 8 years ago | (#14014272)

Why hello, Ms. Liz Ridolfo. I'm happy to see you are into computers (at least I'll tell myself that) and you like to put your pictures online.

Please email me at superdesperateteengeek@needtogetlaid.net

Re:Hey there... (2, Insightful)

prezkennedy.org (786501) | more than 8 years ago | (#14014495)

This had to be about as funny as the US PATRIOT Act.

No, I think that actually has a leg up on this comment.

Use machines instead of humans (1)

hadj (926126) | more than 8 years ago | (#14014273)

I have seen a video how Google is adding its books to their database: a huge machine which can automatically turn over a page and continue to scan very rapidly. I guess that these 100+ year-old books would need subtility but scanning al those pages manually is a pain in the ass.

Good Bad Ugly (4, Insightful)

mpapet (761907) | more than 8 years ago | (#14014285)

The good:
Old books prior to copyright laws are being scanned.

The bad:
Pay is roughly $10/hr. Now, I happen to be concerned that someone being paid so little should be handling rare books. Not to mention the college graduate getting paid so little.

The ugly:
The digital camera contraption costs $30,000!! There's a few scanner manufacturers left in the world and none of them have exploited this niche. Shame on them.

Re:Good Bad Ugly (2, Funny)

susano_otter (123650) | more than 8 years ago | (#14014300)

Now, I happen to be concerned that someone being paid so little should be handling rare books. Not to mention the college graduate getting paid so little.

May we assume that you will therefore be donating additional funds, up to the level of your concern or the amount you can afford (whichever is less)?

Re:Good Bad Ugly (1)

Dorm41Baggins (858984) | more than 8 years ago | (#14014375)

Pay is roughly $10/hr. Now, I happen to be concerned that someone being paid so little should be handling rare books.

I would tend to think this is a good thing. It means that the people doing it aren't neccesarily in it for the money. Being paid by the hour also gives them an incentive to take their time about it. ;)

As long as the people hired are screened for at least a medium-high level of respect for old books, I don't see a problem here.

Re:Good Bad Ugly (2, Informative)

rm999 (775449) | more than 8 years ago | (#14014462)

lets look at the average PHD student:
20000 dollars, 40-50 weeks a year, 40-50 hours a week

yep, that's 10 dollars an hour...

Does that mean all the PHD students should be kicked out of their labs and shouldn't be able to handle expensive books?

Re:Good Bad Ugly (0)

Anonymous Coward | more than 8 years ago | (#14016462)

lets look at the average PHD student:
20000 dollars, 40-50 weeks a year, 40-50 hours a week
Very funny.

Re:Good Bad Ugly (3, Informative)

Dave114 (168228) | more than 8 years ago | (#14014605)

There's a few scanner manufacturers left in the world and none of them have exploited this niche.

Actually, you can buy a robotic book scanner [kirtas-tech.com] (there's a demo video of it). No doubt it costs an arm and a leg although it may be worth it if you're scanning a large enough volume of books.

Re:Good Bad Ugly (1)

Kadin2048 (468275) | more than 8 years ago | (#14016707)

Actually, as this is being done in association with university libraries, I think they shouldn't have any problems getting reliable help at $10USD an hour, because that's significantly more than a lot of other on-campus jobs pay. I know from personal experience that many of the students that get paid to videotape campus events and have access to thousands of dollars of semi-pro videography gear are only paid $8-10 an hour. Same for stage electricians, scene shop carpenters and painters, and audio technicians. You'd be pretty amazed at the level of responsibility that people are given, despite being paid less than a typical mouth-breathing fast food restaurant zombie.

If they're going to hire college students, I think the problem is more about finding people who are interested in the work and not in it just to sit and listen to their iPod while making some beer bucks. That means that the pay should be consistent with the skill level relative to other campus jobs, but not much more than that. (Or at least don't advertise it; use the salary as a method of retention, not recruitment.)

Outsouce this to India (0)

Anonymous Coward | more than 8 years ago | (#14014286)

10$ per hour is too much here. You can take 2$ as the commision per labour hour. Since the people are very hard working they'll work 20 hours a day. Making it a huger 8*20 = 160$ per day. In one month one can make 160*30 = 4800$. In Indian this is a huge money. Even a top class manager doesnt get this much salary per month.

no. (1)

queef_latina (847562) | more than 8 years ago | (#14014412)

I REFUSE to use anything that's been touched by an indian(fucking *GROSS*). It doesn't matter if the thing in question is a digital scanned copy. Even though it's just ones and zeroes, it's still disgusting.

Scanning OLD books... (0)

Anonymous Coward | more than 8 years ago | (#14014287)

...is not a NOVEL idea.

*Ducks*

Scanner: I want. (2, Interesting)

sakusha (441986) | more than 8 years ago | (#14014298)

Wow, that book scanner rig is just what I've been dreaming of for years. I've been thinking about mounting a couple of glass plates at a 90 degree angle, and then I could put the open book on apex of the glass, then photograph it with a couple of cameras underneath. This rig is just exactly what I was thinking of, but upside down and even cleverer, with a footpedal to lift the glass up and down onto the book. A very nice piece of design work.
The obvious advantage of this rig is that you don't have to open the spine 180 degrees and smash the books flat onto a single glass plane, you don't have to open the book up more than 90 degrees, so it's gentle on the spine of fragile old books. And the glass wedge is always self-centering against the spine of the book. The only way this scheme could work better is if there was a way to turn the pages automatically. But these are old and presumably valuable works, safer to let paid low-wage drones to do the work than risk mechanical damage.

Book Scanners (2, Interesting)

jab (9153) | more than 8 years ago | (#14014425)

Here's a list of book scanning equipment [harvard.edu] . I've seen the one from Kirtas in action, it's fun to watch.

Hmm... (0, Redundant)

ToasterofDOOM (878240) | more than 8 years ago | (#14014310)

... hey guys, I think I heard that some people [google.com] were doing this already. Maybe even another group [opencontentalliance.org] too. I think I heard it on slashdot.

Re: Hmm... (1)

Celsius 233 (913263) | more than 8 years ago | (#14014344)

Competition is good.

No, wait...

More people doing a good thing is good.

Re:Hmm... (1)

Matt Perry (793115) | more than 8 years ago | (#14014454)

That second link is who this article is talking about.

Full text for copyright lapsed works? (1)

mattr (78516) | more than 8 years ago | (#14014312)

Will it automatically provide full text or scanned image files for works that have gone out of copyright? And do the restrictions against scanning , storage or reproduction also lapse when copyright lapses? This would be massive. Lots of publishers just reissue old work with new copyrights attached to them.

Personally I've read lots of old science fiction from copyright lapsed works, there is some in Gutenberg, and like it quite a bit, though I'd like to find more of them.

For example I'm looking for Perry Rhodan (anything past #128) in English, which is out of print and maybe in old book stores or garage sales though I'm not in the U.S. now.

Web searching is fine but the most important part is to be able to get the works digitized. Then make freely available what is not in copyright, and make it easy to purchase what is. I bet you'll see publishers rushing towards that when they start seeing dollars rolling in.
 

Re:Full text for copyright lapsed works? (1)

aussie_a (778472) | more than 8 years ago | (#14014460)


Will it automatically provide full text or scanned image files for works that have gone out of copyright?


If by "automatic" you mean after it's been scanned by someone, the images processed, placed onto the server and put into the system. Then yes, it will; automatically provide the scanned image files.

nd do the restrictions against scanning , storage or reproduction also lapse when copyright lapses?

Yes, because it becomes a public domain work. You can do anything (from publishing it unchanged, creating your own version of the tale, writing a sequel, etc) you want to it.

This would be massive.

Not as massive as you seem to think. Project Guttenberg (google it) has been doing this for quite some time now.

Lots of publishers just reissue old work with new copyrights attached to them.

I'm fairly certain that doing so doesn't mean that you have to respect that copyright. I know it's done, but I don't think that "Copyright 2005. All rights reserved" is legal, because they have no rights TO reserve. I believe that for the publisher to claim a new copyright on a public domain work. They have to make sufficient changes. I believe this can be done by having someone edit it (and edit it enough that it is classed to have enough original content to be a new work) which I believe is done on a fairly common basis.


For example I'm looking for Perry Rhodan (anything past #128) in English, which is out of print and maybe in old book stores or garage sales though I'm not in the U.S. now.


Then this project won't help you. From what I can see his work is still copyrighted.
Web searching is fine but the most important part is to be able to get the works digitized. Then make freely available what is not in copyright, and make it easy to purchase what is.

Aaah, Internet Archive can't scan in books (even out of print books) and then sell them, if the book is still in copyright. There is no article (that I can see) to read, but I doubt they'll be scanning in copyrighted books (which I believe they can do technically) and then sitting on the scan until the copyright passes and then release it.

I bet you'll see publishers rushing towards that when they start seeing dollars rolling in.

Electronic books aren't anything new (which is what these are, in a way). They've been around for some time now, and the major publishers aren't jumping on it quite as much as you seem to think. It's still a niche, although it is steadily growing. But there hasn't been any major move to the format yet.

Kirtas automatic book scanner (1)

AlexOsadzinski (221254) | more than 8 years ago | (#14014345)

It seems a pity to use such a manual method. This... http://www.kirtas-tech.com/ [kirtas-tech.com] is designed to scan books, especially old and fragile books, automatically. It handles the pages even more gently than a trained person. It's not cheap, but is does around 1,000 pages per hour, and the operator just loads books in and takes them out when they're done. I looked at the company a couple of years ago (I'm a VC) and get regular updates from them. A LOT of libraries are using them now.

Manual seems safer to me.... (2, Interesting)

fantomas (94850) | more than 8 years ago | (#14014861)

"It seems a pity to use such a manual method"


Interesting - I don't understand your line of thinking - interested to hear more. Is the argument that automated page turning is *cheaper* so it's a pity that the project spends a lot on labour charges (manual scanning)? Or is the argument that the automated page turning is easier on the fragile old books? I'd appreciate if you could offer more details about the technology - the company's demo video shows a vacuum device lifting pages, but both examples are with modern books. Honest question: surely the advantage here is a low labour cost method of scanning huge numbers of pages (like the telephone directory example they show). But if you have fragile books, surely the advantage of a human is that they can see that individual pages might be particularly fragile, maybe even needing support or repair to scan, while the pre-set vacuum device will plough on regardless, it won't be able to make a decision on the quality of the pages. Does it have any sensing devices built in? My experience of older books (e.g. nineteenth century) is that in some cases the paper can be very brittle.

Re:Manual seems safer to me.... (1)

AlexOsadzinski (221254) | more than 8 years ago | (#14015767)

You make good points. My argument is that the Kirtas scanner is cheaper, because it's at least twice as fast as a human operator, and becaus one operator can run several machines at once. When I spent time with the company, I was impressed at just how gently it handled pages: it fans a page out using gentle puffs of air, and lifts pages using a large-area (dry) sponge. But, you're right: if a book is really fragile, or on the point of disintegration, a manual approach is better.

Printer Friendly Link (1)

TubeSteak (669689) | more than 8 years ago | (#14014348)

My ad-blocking software wouldn't let me at the page, which confused me. So I disabled it.

http://online.wsj.com/public/article_print/SB11311 1987803688478-VNpw62xi_JA4avE8cxOZf0pf_nM_20061109 .html [wsj.com]

And of course, direct linkage to the picture of the girl.
Because that's the only reason 90% of you would click on the link anyways

The Girl Has Nice Shoes [wsj.com]



As an aside, cigarettes + old books = bad
"This book almost killed me," Ms. Ridolfo said to her boss, Gabe Juszel, who was preoccupied with a stack of books and didn't reply. Then she walked outside for a cigarette break, pausing along the way to rub her neck."

Not to mention that girls + cigarette smell = not terribly attractive
(but that's just my opinion)

Re:Printer Friendly Link (1)

joelsanda (619660) | more than 8 years ago | (#14014365)

Not to mention that girls + cigarette smell = not terribly attractive

Unless the 'girl' is Lauren Bacall [ntropie.de] .

Re:Printer Friendly Link (0)

Anonymous Coward | more than 8 years ago | (#14016532)

Mmmm... cute mousy bookworm smoker girls = HOT. There are actually people here who disagree? Smokers, like geeks, make better lovers. I have collected substantial empirical data on this matter.

Libraries? (1)

mofomojo (810520) | more than 8 years ago | (#14014405)

What's the legality difference between this and say regular libraries? Don't regular libraries loan material freely? What changes when it becomes electronic, it just means that the people will be able to keep them for longer or as long as they want, no? IMHO, I like the idea of doing this. It'll make doing books for school much easier knowing that there's a backup copy of it floating around somewhere on the interweb.

Re:Libraries? (1)

aussie_a (778472) | more than 8 years ago | (#14014464)

For one, copyright laws tend to have a special clause inside them for public libraries. This group hasn't been classified as a public library ;) Another point is that libraries (as a rule) don't photocopy books and then give them away to people indefinitely. Instead they legally buy copies of a book, which they lend out for a finite period of time.

Libraries (2, Interesting)

andrewburt (856855) | more than 8 years ago | (#14016154)

Borrowing from a library or reading in a bookstore are hugely different, for these reasons:

(1) The library paid for the copy you're borrowing. (Or somebody paid for it, in case the book was donated to the library.) Thus the author was paid for that copy. If you read a whole copyrighted book via a Content Display Site (CDS - Google Print, Amazon Search Inside, etc.) and never buy the book, the author wasn't paid. Copyright law is about creating new copies; you're not creating a new copy when you read in a store or from a library.

(2) Browsing in a bookstore is pretty inconvenient. You can't take the copy with you to look at any time you want. (Unless you buy it! That's sort of the point.) Bookstores know that few people really read entire books in the store -- else they'd go out of business. However, reading a book from a CDS doesn't have that limitation: You can take it with you, on your laptop, etc. This is particularly critical in light of digital paper, when the digital copy is the paper copy.

(3) Libraries and bookstore reading isn't anywhere near free: You have to move your physical body to the bookstore to read. For one thing, you can't likely do that at 3am. (And certainly not in your pajamas.) You can't do it from your bed, couch, or desk, without getting up. You have to spend time to move your body down there, which might be 10min-30min each way; 20-60min round trip, plus say 10min to find the book, a place to sit, etc; call it 30-70min. If you value your time at say, $10/hr, that's $5-12. Then there's the cost of transportation. If the library/bookstore is three miles away, 6mi. round trip, and gas costs $2.50/gal., and you get 20mi/gal., that's another $.75. The IRS figures driving a car costs $.405/mile in repairs, wearing it out, etc., so that's another $2.40. So you're at something like $8-15 to go read a "free" book.

Really -- if it were that free, people would do a lot more of it.

Yet reading a free copy from a CDS doesn't have those limitations. It is much closer to $0, actually and truly free. THAT's the problem.

(4) You can't pass on a "free" copy you read in the store or from the library. You have to leave the book at the bookstore (or buy it); you have to return the book to the library. Reading a book in digital form that was stolen from a CDS, you could pass that copy on to others by email, via a web page, P2P software, etc.

So, bottom line, bookstore/library reading isn't really free. CDS copies are essentially free, and that's the problem. They're too convenient to read free.

This is one of the reasons we formed the COCOA Association ( http://www.copyrightaccess.com/ [copyrightaccess.com] ), to make more copyrighted work available. (Note, COCOA does not inhibit indexing and searching and returning text snippet search results -- just what page images can be displayed.) If you support this, please sign our petition at http://www.petitiononline.com/cocoa/petition.html [petitiononline.com] -- thanks!

Dr. Andrew Burt,
Chair, The COCOA Association

RTFA? (0, Offtopic)

kennygraham (894697) | more than 8 years ago | (#14014416)

I'd RTFA if the black text didn't overlap a black image. IE-only web designers should be shot.

Re:RTFA? (1)

fimbulvetr (598306) | more than 8 years ago | (#14014475)

WTF?
Why is this off topic? Seems perfectly reasonable to me. I actually had the same thought.

Re:RTFA? (3, Interesting)

commbat (50622) | more than 8 years ago | (#14015371)

I'd RTFA if the black text didn't overlap a black image. IE-only web designers should be shot.

This is when the 'remove this object' firefox extension [mozilla.org] comes in handy. Just remove the image and the text is readable. 'Undo last remove' to get the image back.

I don't think you should have been modded down.

How can I help? (1, Interesting)

Anonymous Coward | more than 8 years ago | (#14014442)

How can I help? I'm willing to give a couple of hours a week, I don't have a scanner, but I'm willing to type...if this is truly "open", I will be more than willing to contribute my time.

Re:How can I help? (2, Informative)

jnik (1733) | more than 8 years ago | (#14015809)

How can I help? I'm willing to give a couple of hours a week, I don't have a scanner, but I'm willing to type...if this is truly "open", I will be more than willing to contribute my time.

As a few others have mentioned, jump in to Distributed Proofreaders [pgdp.net] . We take the raw images (either scanned specifically for DP or taken from scanning projects like this) and produce checked, corrected text, which then goes to Project Gutenberg [gutenberg.org] . A few hours a week can help a lot.

thought process (0)

Anonymous Coward | more than 8 years ago | (#14014478)

slashdotters thought process..."reading slashdot -> Liz? -> that is a female nameif I'm not mistaken. *clicks the link*"

$10/hr for scanning books? (1)

mumblestheclown (569987) | more than 8 years ago | (#14014489)

$10/hr is crazy for scanning books.

Send the scans to india or eastern europe to be scanned for a fraction of the price. I mean really. This seems to be a serious operation - why not maximize the use of available resources? Spending $10/hr on scanning is just dumb.

Re:$10/hr for scanning books? (1)

Impeesa (763920) | more than 8 years ago | (#14014569)

You'd send hundred-year-old books overseas to be handled by extremely cheap (and poor) labourers? Are you operating under the assumption that once they're scanned, you won't need the originals any more?

Re:$10/hr for scanning books? (1)

mumblestheclown (569987) | more than 8 years ago | (#14014593)

No, i'm operatnig under the assumption that it's far less expensive to to the transfer and pay for the necessary oversight, given the number of books invoved. those hourly costs very quickly get very expensive.

I am not saying send them to a random person with a scanner. However, this can be done competently.

Re:$10/hr for scanning books? (1)

narcc (412956) | more than 8 years ago | (#14014627)

$10/hr isn't very much (~20k a year). Why risk sending old (possibly valuable) books overseas to be scanned by unskilled cheap foreign labor when you can have it done under your supervision locally while employing local people. Not spending $10/hr on scanning is just dumb.

Re:$10/hr for scanning books? (1)

trollable (928694) | more than 8 years ago | (#14015138)

Why risk sending old books overseas to be scanned by unskilled cheap foreign labor

You're right, there is plenty of skilled US citizens that work for $10/hr.

Over a century old... (3, Funny)

Cow Jones (615566) | more than 8 years ago | (#14014578)


employees making about $10 an hour to manually scan volumes -- some more than a century old

I think that if they hired younger people to scan the books, it might go a little faster.
Imagine a 100 year old at this job...

"...(mumble mumble) in my day we used priests to copy books (mumble mumble) oh dear, I tore another page, darn Parkinson (mumble mumble)"

When I read "human-powered"... (1)

GroeFaZ (850443) | more than 8 years ago | (#14014635)

I can't help but think of midgets in a running wheel. Is that an improvement over a "hamster-powered" book project?

Darn expensive, machines win (1)

MadMirko (231667) | more than 8 years ago | (#14014689)

10$ per hour for the humans, tens of thousands for the scanners. Damn you machine-overlords!

On the other hand, the whole project is funded by Microsoft and Yahoo, which creates the usual good (open content!) / evil (paid for with the devil's money!) dilemma. ...

That's enough coffee for me, I suppose...

Scanning with precision is difficult (2, Informative)

Xoc-S (645831) | more than 8 years ago | (#14014995)

The (Jack) Vance Integral Edition [vanceintegral.com] was a volunteer effort to produce a limited edition 42 volume set of the complete works of Jack Vance, restored to as close to the author's original manuscripts as possible.

(The project is complete, and an amazing success.)

The team scanned and edited many of Jack's early works for which there was no good clean manuscript. They developed software tools that would compare scans from different editions to automatically find errors. It turns out that even the best human editor still missed "scanos" (typos produced by the scanning process) that the automated tools found.

Even so, in the final books there were a handful of errors that slipped through, despite extremely careful editing by hundreds of volunteers.

"some more than a century old..." (0, Redundant)

whitroth (9367) | more than 8 years ago | (#14015472)

I do hope they're not duplicating efforts... and whether they even know about Project Gutenberg. http://www.promo.net/pg/>

          mark

$11 - $12 per hour (1)

brewsterkahle (635187) | more than 8 years ago | (#14016092)

At the University of Toronto the Internet Archive pays $12 Canadian / hour to the scanners, and $11-$12 American in the US. The exchange rates keep changing so judging Canadian pay by US translations is a bit confusing. With experience the Archive will adapt as well, but the Archive is interested in maintaining a reasonable wage while keeping the overall cost cheaper than most commercial offerings. The reason for that is to encourage the open nature that the Archive supports.

What would be the equivalent local rate for scanners in Europe?

-brewster
Digital Librarian

Re:$11 - $12 per hour (2, Funny)

Kadin2048 (468275) | more than 8 years ago | (#14016781)

What would be the equivalent local rate for scanners in Europe?

Probably about $35 an hour, they'd only work seven hours, three days a week, and they'd be on strike half the year anyway. And you can't fire any of them. ;-)

Sh*t jobs (0)

Anonymous Coward | more than 8 years ago | (#14016102)

The book included several copies of handwritten letters by authors, that folded out from the pages and were difficult to photograph. "This book almost killed me," Ms. Ridolfo said to her boss, Gabe Juszel, who was preoccupied with a stack of books and didn't reply. Then she walked outside for a cigarette break, pausing along the way to rub her neck.

Wow... at CA $12.00/hour I can imagine her life must be quite difficult. I grit my teeth at how I could work an entire week just to see it go "poof" at a 15 minute doctors consultation.

Urgh... Sounds cruel... (1)

ShyGuy91284 (701108) | more than 8 years ago | (#14016354)

I remember being put in charge (still not sure how it happened, but it did) of my HS senior class's slideshow thing for the end of the year banquet, and everyone brought in 2 or 3 pictues to be scanned for it..... It wasn't fun........ But then again, for $10/hr, it couldn't have been any worse then most other crappy jobs....
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?