Why Project Gutenberg Isn't There Yet

Become a fan of Slashdot on Facebook

Why Project Gutenberg Isn't There Yet 354

Posted by timothy on Tuesday January 28, 2003 @09:17PM from the because-there-is-a-long-way-from-here dept.

option8 writes "This wired article ('Any Text. Anytime. Anywhere. (Any Volunteers?)'), goes into good detail on why Project Gutenberg, and similar efforts, are far from creating a complete, free electronic library. A quote: "The mechanics of a universal library are simple. The tricky part: harnessing the free labor." Though it doesn't go into technology much, I expect there's a lot of potential in mass OCR tech and good speech recognition (faster to read a book aloud than to transcribe it correctly)."

This discussion has been archived. No new comments can be posted.

Why Project Gutenberg Isn't There Yet

Load All Comments

Search 354 Comments Log In/Create an Account

Comments Filter:

Speech recognition? (Score:4, Interesting)

by timeOday ( 582209 ) writes: on Tuesday January 28, 2003 @09:22PM (#5178501)

That's crazy. OCR will always be faster than speech, even if speech recognition ever works, which it currently does not.

Share
twitter facebook
- The parent is "interesting"? (Score:5, Informative)
  
  by thac0 ( 644918 ) writes: on Tuesday January 28, 2003 @09:48PM (#5178684)
  
  The article didn't say that OCR was faster than speech, it said that speech was faster than transcibing it.
  
  Come on mod's, read more carefully.
  
  Parent Share
  twitter facebook
  - Re:The parent is "interesting"? (Score:4, Insightful)
    
    by timeOday ( 582209 ) writes: on Tuesday January 28, 2003 @10:45PM (#5178966)
    
    So what? Rowing across the ocean is faster than swimming. Most of us still fly.
    Sure, for the best scanning speed you have to cut the binding off and use a sheet feeder. But even scanning 2 pages at a time will be far faster than reading the whole thing out loud.
    So what is your point?
    
    Parent Share
    twitter facebook
  - Re:The parent is "interesting"? (Score:5, Informative)
    
    by nautical9 ( 469723 ) writes: on Tuesday January 28, 2003 @10:58PM (#5179029) Homepage
    
    Depending on the typist, I can't see reading a book out loud as being any faster than transcribing it - especially considering that the speech recognition software is unlikely to do the proper punctuation, paragraph breaks, people & place names, and general capitalization, so proofing the results would take a considerable amount of time.
    But as the GP said - a moot point since OCR'ing it and proofreading/fixing minor typos would be far quicker than either.
    
    Parent Share
    twitter facebook
- speech recog works (Score:3, Offtopic)
  
  by SHEENmaster ( 581283 ) writes:
  
  for short command sets. Mac OS X has excellent speech recognition for example. What we are lacking is a way to differentiate a larger vocabulary.
  
  I can see PG's next release now:
  Welcome to the audiotape version of.......
  - Re:speech recog works (Score:3, Funny)
    
    by Anonymous Coward writes:
    
    so, speech recognition works, it just can't recognise much speech.
    
    Just that one feature left to add then.
  - - Re:speech recog works (Score:5, Funny)
      
      by einer ( 459199 ) writes: on Wednesday January 29, 2003 @12:22AM (#5179538) Journal
      
      Best title for any paper, book or article on the subject: How to wreck a nice beach.
      
      Parent Share
      twitter facebook
Cost of labor? (Score:3, Insightful)

by Anonymous Coward writes: on Tuesday January 28, 2003 @09:24PM (#5178520)

What about the cost of the books? Unless the only books you have in this "universal library" are old enough to be without copyrights, won't there be a problem in finding funding to buy current day books?

Share
twitter facebook
- Re:Cost of labor? (Score:5, Informative)
  
  by jonman_d ( 465049 ) writes: <nemilar&optonline,net> on Tuesday January 28, 2003 @09:29PM (#5178551) Homepage Journal
  
  That's pretty much it - most of the books are in the public domain. AFAIK, the rest are all donated by their authors.
  
  From their FAQ:
  
  What books will I find in Project Gutenberg?
  
  We cannot publish any texts still in copyright. This generally means that our texts are taken from books published pre-1923. (It's more complicated than that, as our Copyright Page explains, but 1923 is a good first rule-of-thumb for the U.S.A.)
  
  So you won't find the latest bestsellers or modern computer books here. You will find the classic books from the start of this century and previous centuries, from authors like Shakespeare, Poe, Dante, as well as well-loved favorites like the Sherlock Holmes stories by Sir Arthur Conan Doyle, the Tarzan and Mars books of Edgar Rice Burroughs, Alice's adventures in Wonderland as told by Lewis Carroll, and thousands of others.
  
  These books are chosen by our volunteers. Simply, a volunteer decides that a certain book should be in the archives, obtains the book and does the work necessary to turn it into an e-text. If you're interested in volunteering, click here.
  
  Parent Share
  twitter facebook
  - Re:Cost of labor? (Score:5, Informative)
    
    by GammaTau ( 636807 ) writes: <jni@iki.fi> on Tuesday January 28, 2003 @10:18PM (#5178857) Homepage Journal
    
    Additionally translations might generate practical limitations. If a text was written in ancient Greece and translated to English or some other language in the 20th century, the translation might not be public domain even when the original work is. Of course you are free to read the original text or make a new translation. Anyway even if a piece of literature was public domain, the translation to your native language might not be.
    
    Parent Share
    twitter facebook
- Re:Cost of labor? (Score:2, Insightful)
  
  by Acidic_Diarrhea ( 641390 ) writes:
  
  That's exactly it. Check out their website. All the works they currently have and all the ones they want to get are public domain. So it's a big project but one that we can eventually finish since the age of intellectual property that never expires is upon us. Today's books won't ever be in the public domain if the current trend continues.
Librarians? (Score:5, Interesting)

by Metallic Matty ( 579124 ) writes: on Tuesday January 28, 2003 @09:24PM (#5178526)

I'm not too informed about this topic; feel free to correct me.

If the goal is a universal library, and there is a need for a work force, wouldn't a program iniated on the library level to utilizie librarians as a volunteer work force, perhaps as a side project they might be interesting in helping along? I think of it as SETI in the library world.. *shrug*

Share
twitter facebook
- Re:Librarians? (Score:2, Interesting)
  
  by qortra ( 591818 ) writes:
  
  That's a pretty good idea. If each public American library (and perhaps other nations for other languages) was to commit about 5 books to be typed by its volunteers and staff each year (a resonable amount), the project could really take off. Estimate 5000 participants (conservative); 25000 books a year.
- Distributed Proofreaders (Score:5, Informative)
  
  by Amata ( 554796 ) writes: on Wednesday January 29, 2003 @01:34AM (#5179927)
  
  I just found this site a few days ago. Essentially, volunteers can proofread one page at a time, so that huge time commitments of doing an entire book yourself are not required. Worth checking out.
  
  http://texts01.archive.org/dp/ [archive.org]
  
  Parent Share
  twitter facebook
The REAL Problem (Score:2, Insightful)

by echucker ( 570962 ) writes:

So many of the things that people want to read are copyrighted, and won't be availble until long after we're dead.
- Re:The REAL Problem (Score:2, Insightful)
  
  by Lenbok ( 22992 ) writes:
  
  If they become available at all, given the current copywrite extension precedents.
- Re:The REAL Problem (Score:2, Informative)
  
  by IvyMike ( 178408 ) writes:
  
  No, the real REAL problem is that because of Disney [findlaw.com], copyright lengths keep getting extended and extended. At the current rate, Mickey Mouse will never be public domain. This is actually unconstitutional, since Congress is enabled to grant exclusive rights for "limited times" only. But it's the way things are.
  - Re:The REAL Problem (Score:5, Insightful)
    
    by Anonymous Coward writes: on Tuesday January 28, 2003 @10:04PM (#5178784)
    
    Mickey Mouse will never be public domain because MICKEY MOUSE IS A TRADEMARK/LOGO. That would be like forcing IBM to give up their IBM logo/colors/design.
    
    However, *Copyrighted* works should eventually go into public domain. The point is that after you are dead, anything - be it a movie, song, cartoon, book, poem --- whatever --- serves a greater good to mankind than it could to its dead creator. I think that a decade or two is too short of a limit for copyright. If I write a book when I'm 20 years old, I should still be allowed to make money off the sale of that book when I'm 40. But when I'm in the grave, it servs me no use.
    
    Now, it could be said that a person who works hard to create pieces of work like movies or books or songs should be allowed to bestow the revenue from use of that material after the original author is dead. If I write a book that still sells well 20 years after my death, my son and daughter should be allowed to benefit from this copyrighted item in my 'estate'.
    
    But I think that indefinite extensions are rediculous. I would say that 100 years is bordering on ridiculous. I think that 75 years is reasonable. If I create something when I'm 25, the copyright will outlive me by as much as 25 years.
    
    In fact, I would propose that copyright should be extended to the life of the creator plus 20 years **OR** 50 years. Whichever is less (so if you die two years after the copyright, the copyright is still in effect for another 20 years).
    
    Parent Share
    twitter facebook
    - Re:The REAL Problem (Score:4, Insightful)
      
      by L. J. Beauregard ( 111334 ) writes: on Wednesday January 29, 2003 @12:30AM (#5179586)
      
      There is some good in letting a copyright extend beyond the author's death. An author may die with children still not yet grown, and his royalties can provide for them. Life plus 20 or maybe 25 should be enough for this.
      Some posthumous works may come out under a life-plus-X term that might have been cast aside under a life-plus-zero term. Life plus 50 is probably more than enough.
      Life plus 70 is absurd and our so-called elected officials should be ashamed of going along with it. And may Sonny Bono *not* rest in peace.
      
      Parent Share
      twitter facebook
  - Re:The REAL Problem (Score:3, Informative)
    
    by Thomas M Hughes ( 463951 ) writes:
    
    This is actually unconstitutional, since Congress is enabled to grant exclusive rights for "limited times" only.
    
    As much as I wish you were right, you're actually wrong on this. The Supreme Court ruled on the case, and found that what the Congress did was constitutional, and since the constitution grants the Supreme Court the right to interpret the Constitution, it is constitutional to do so. This will only change if the Supreme Court changes its ruling at a future date, or the Congress were to ammend the constitution to make it unconstitutional, this issue remains constitutional, as unfortunate as it is.
  - - Comment removed (Score:4, Insightful)
      
      by account_deleted ( 4530225 ) writes: on Tuesday January 28, 2003 @11:55PM (#5179339)
      
      Comment removed based on user account deletion
      
      Parent Share
      twitter facebook
- I read lots of stuff off there! (Score:5, Interesting)
  
  by neurostar ( 578917 ) writes: <neurostarNO@SPAMprivon.com> on Tuesday January 28, 2003 @11:39PM (#5179254)
  
  ...things that people want to read are copyrighted, and won't be availble until long after we're dead.
  
  Actually I've found the most value from the project is downloading and reading classics. I've downloaded works by people such as: Adam Smith, Nietzsche, Aristotle, Plato, Karl Marx, Oscar Wilde, Thomas More, and various other classic writers. I've found this resource indispensable. It provides high quality texts for free. I probably wouldn't read many works by these authors if I had to purchase them. I unfortunately, don't have the money to spend on many small works such as these (they're short, but sometimes cost $10-15). I also don't have easy access to a library and I like keeping a copy for my own personal use.
  
  So I find that Project Gutenberg is a very useful resource.
  neurostar
  
  Parent Share
  twitter facebook
If only (Score:4, Interesting)

by Cookeisparanoid ( 178680 ) writes: on Tuesday January 28, 2003 @09:31PM (#5178566) Homepage

It would make life some much easier if I could search an online library rather than searching the library index. Just think how much space we could save as well rather than shelves full of books that are basically dead weight 95% of the time.
I think copyrights got to be the biggest hurdle, publishing houses arnt easily going to be perswaded to put oh say the next harry potter book online for free and risk losing millions

Share
twitter facebook
If you do want to help (Score:5, Informative)

by Anonymous Coward writes: on Tuesday January 28, 2003 @09:31PM (#5178568)

Distributed Proofreaders [archive.org]. Recently discussed [slashdot.org] on /. as well.

Share
twitter facebook
- Re:If you do want to help (Score:3, Interesting)
  
  by adamjaskie ( 310474 ) writes:
  
  Why not modify that in such a way as to have avaliable a scanned image of a single page of the book, along with an empty box to enter text? That way, people could work on ONE page at a time, while others work on other pages. A single book could be typed in by 547 different people, each typing up one page.
  - That's part of what DP does (Score:5, Informative)
    
    by smiff ( 578693 ) writes: on Tuesday January 28, 2003 @10:29PM (#5178901)
    
    Why not modify that in such a way as to have avaliable a scanned image of a single page of the book, along with an empty box to enter text?
    That's basically what Distributed Proofers does. Except they OCR the book first, so the proofreaders just need to fix the OCR errors. Every page goes through two passes. Then the entire book goes into post-processing where a single person puts all the pages together, and checks for problems that the proofers didn't know how to solve (marked with an astrisk). Once Distributed Proofers finishes the book, they pass it on to Project Gutenberg where somebody reviews the whole text again.
    Distributed Proofers currently has a problem. After the previous Slashdot announcement, they were overwhelmed with volunteers. The volunteers processed books so fast, they were running out of material to work on. Three or four people scan in most of the books. They have been slaving away trying to keep up with the proofers.
    Distributed Proofers is also working on a standard to mark up the books to better preserve tables, illustrations, bold text, math, etc. I suspect that effort is being slowed due to the priority of keeping material on the site.
    
    Parent Share
    twitter facebook
    - Re:That's part of what DP does (Score:4, Informative)
      
      by kalidasa ( 577403 ) writes: on Wednesday January 29, 2003 @08:44AM (#5180791) Journal
      
      Distributed Proofers is also working on a standard to mark up the books to better preserve tables, illustrations, bold text, math, etc. I suspect that effort is being slowed due to the priority of keeping material on the site.
      
      Three Little Letters:
      
      T E I [tei-c.org]
      
      TEI is to literature as DocBook is to documentation.
      
      Parent Share
      twitter facebook
  - - Re:If you do want to help (Score:4, Informative)
      
      by adamjaskie ( 310474 ) writes: on Tuesday January 28, 2003 @10:38PM (#5178938) Homepage
      
      They give you an image of the scanned page, along with the OCR'd text. I just looked closer, and did a few pages as well. Its pretty easy. Took me about 5-10 minutes/page. I had to remove a few end-of-line hyphenations, fix an OCR-mangled word, and replace single hyphens with double hyphens for em dashes a few times.
      
      Parent Share
      twitter facebook
barbor or barber (Score:4, Funny)

by (rypto* ( 641800 ) writes: on Tuesday January 28, 2003 @09:32PM (#5178573)

The mechanics of a universal library are simple. The tricky part: hairdressing the free labor.

Karma: Barber

Share
twitter facebook
Books Are Printed With Computers... (Score:5, Interesting)

by MBCook ( 132727 ) writes: <foobarsoft@foobarsoft.com> on Tuesday January 28, 2003 @09:32PM (#5178574) Homepage

... aren't they? I mean, even if I buy "Moby Dick", isn't all that text in a computer at the publisher somewhere? They format it to fit the pages, etc, and then send that file off to the printers, correct? So it is still on the publisher's compuer, and it shouldn't be TOO hard to get it into the simple text files instead of whatever odd format they might use. What about when books get published in Braille. A computer must do that, right? There isn't some guy pokeing dots in steel plates to emboss the pages with, right? I could be wrong, this is my guess. Anyone in the publishing industry out there?
So the point of this post is: why not ask publishers for the material? If it's already public domain, it's not like they'll lose profits, and maybe Project Gutenberg could let them put a little

This text donated by Joe Bob Publishers Inc, of Wala Wala Washington (www.joebobbooks.inc)

kind of thing at the top of each book they donate. Plus, maybe it's a tax write off. I don't know. That said, I'd thing it'd be much easier to just type things in than OCR it or use Speach-To-Text.

Share
twitter facebook
- Re:Books Are Printed With Computers... (Score:5, Insightful)
  
  by BJH ( 11355 ) writes: on Tuesday January 28, 2003 @09:57PM (#5178740)
  
  I used to be a book editor (at a Japanese publishing company). Let me give you a rundown of the process we followed (I'm sure there are more efficient places than the one I worked at - O'Reilly is well known for their high level of automation).
  
  Get manuscript from author.
  This could be either handwritten or typed. If typed, it's likely to be in either plain text or Word format, but with a lot of errors.
  
  If the manuscript's handwritten, farm it out to a typist.
  We used to pay 0.5 yen a letter for English, 1 yen a character for Japanese.
  
  Once it's data, edit.
  I used to do my editing on a Mac with BBEdit, but this varies a lot between editors - some do it on (shudder) Word, where all the formatting gets in the way.
  
  Reformat it to pass it to the DTP firm.
  When I say 'reformat', I don't mean making things bold or italic - I mean cleaning it up so it's easy to do the next step, which is...
  
  Print out and insert format directions.
  The manuscript is printed out, and you go through it one line at a time adding things like "Line break here" and "Use larger font for this".
  
  Proofs arrive from the DTP firm.
  You go through the proofs, making corrections by hand (i.e., "Move this down one line", etc.)
  
  The DTP firm passes you back the formatted data.
  QuarkXPress is king here. You get the data in a finished form and pass it to the printers.
  
  The printer produces the final proofs.
  You can still make corrections, but these have to be done by the DTP firm, who then give you the updated data.
  
  Last-minute corrections are made.
  This depends on the printer, but quite often these are done by pasting the changes over the top of the printer film (i.e., they're not reflected in the data).
  
  The book is printed.
  Corrections after printing are usually done as described above (pasting changes over the film).
  
  The problem with this is that the text data held by the editor is now out-of-date in all sorts of ways:
  - It doesn't have the corrections made by the DTP firm.
  - It doesn't have the corrections made by the printer.
  - It doesn't have any formatting.
  
  QuarkXPress can output the data in other forms, but it's still missing the last-minute changes and after-printing changes, and quite frankly once it's on the market, most publishing companies aren't interested in reworking the data to keep it as text for the next 90 years, so it can be released into the public domain.
  
  Parent Share
  twitter facebook
- Yes, but they don't... (Score:2)
  
  by dachshund ( 300733 ) writes:
  
  So the point of this post is: why not ask publishers for the material? If it's already public domain, it's not like they'll lose profits, and maybe Project Gutenberg could let them put a little
  Yup. Except that the vast majority of publishers won't give out their digital masters, even if the work in question is public domain. The formatting and page layout cost them money, and they (rightly or wrongly) feel that such a release would undercut their sales.
  And even if you could get hold of the digital representation, it'd very likely be copyrighted as a "derivative work" (due to the layout info, page numbers, and even spelling corrections).
WiReD (Score:3, Interesting)

by IvyMike ( 178408 ) writes: on Tuesday January 28, 2003 @09:33PM (#5178584)

I'd just like to point out that this is the third story from Wired to show up on slashdot today. And it's not even that bad of a story. I think this must mean Wired is cool again.

Share
twitter facebook
- Re:WiReD (Score:3, Insightful)
  
  by NerdSlayer ( 300907 ) writes:
  
  Seriously. And there were a couple of more earlier in the week, I believe. What's the deal? Slashdot has turned into Wired with trolls substituted for pictures and illustrations. Well, I guess there's the goatse guy...
copyright information (Score:5, Informative)

by Anonymous Coward writes: on Tuesday January 28, 2003 @09:34PM (#5178586)

Keep in mind the following copyright rules:

1. Works first published before January 1, 1923 with proper copyright notice entered the public domain no later than 75 years from the date copyright was first secured. Hence, all works whose copyrights were secured before 1923 are now in the public domain.
(This is the rule Project Gutenberg uses most often)
Works published from 1923-1977 retain copyright for 95 years. No such works will enter the public domain until 2019.
2. Works first created on or after January 1, 1978 enter the public domain 70 years after the death of the author if the author is a natural person.
(Nothing will enter the public domain under this rule until at least January 1, 2049.)
3. Works first created on or after January 1, 1978 which are created by a corporate author enter the public domain 95 years after publication or 120 years after creation whichever occurs first.
(Nothing will enter the public domain under this rule until at least January 1, 2074.)
4. Works created before January 1, 1978 but not published before that date are copyrighted under rules 2 and 3 above, except that in no case will the copyright on a work not published prior to January 1, 1978 expire before December 31, 2002. If the work is published before December 31, 2002, its copyright will not expire before December 31, 2047.
(This rule copyrights a lot of manuscripts that we would otherwise think of as public domain because of their age.)
5. If a substantial number of copies were printed and distributed in the U.S. prior to March 1, 1989 without a copyright notice, and the work is of entirely American authorship, or was first published in the United States, the work is in the public domain in the U.S.
6. (This rule is complicated, and is seldom applied). Works published before 1964 needed to have their copyrights renewed in their 28th year, or they'd enter into the public domain. Some books originally published outside of the US by non-Americans are exempt from this requirement, under GATT. Works from before 1964 were automatically renewed if ALL of these apply:
At least one author was a citizen or resident of a foreign country (outside the US) that's a party to the applicable copyright agreements. (Almost all countries are parties to these agreements.)
The work was still under copyright in at least one author's "home country" at the time the GATT copyright agreement went into effect for that country (January 1, 1996 for most countries).
The work was first published abroad, and not published in the United States until at least 30 days after its first publication abroad.

This means that we can't simply take electronic versions of modern texts and put them in the archive, because only out-of-copyright books are in there.

Share
twitter facebook
- Re:copyright information (Score:3, Informative)
  
  by ColaMan ( 37550 ) writes:
  
  Unless you visit some other , non-US version of project gutenburg , such as the Australian [gutenberg.net.au]one, which I peruse through every now and then.
  
  From the .au front page:
  
  Works in the 'public domain' in Australia
  Under Australian copyright law, literary, dramatic, & musical work published, performed, communicated, or recorded and offered for sale in an author's lifetime are protected for the life of the author plus fifty years from the end of the year of the author's death. After this time they enter into the public domain. EBooks on this page may be still copyright in the US and are therefore not available from the US site.
  
  So , at present Australians can get up to the beginning of 1953. Seems a hell of a lot easier to follow than the mess of dates the parent posted.
  - Not quite (Score:3, Interesting)
    
    by Duds ( 100634 ) writes:
    
    So , at present Australians can get up to the beginning of 1953. Seems a hell of a lot easier to follow than the mess of dates the parent posted.
    
    Not quite.
    
    Up to 50 years after the end of the year of the author's death
    
    i.e - they can get stuff up to the end of 1952, assuming the author also died that year.
    
    I wonder though. What if they wrote something in 1951, died in 1952, but it was only discovered (and published) in 1973. What applies?
- - Re:copyright sucks but... (Score:3, Insightful)
    
    by kalidasa ( 577403 ) writes:
    
    ...humanity wrote some ok books in its first 3000 years (-ish) of literacy. The Koran, the Bible, Shakespeare... yeah there's some ok books out there not covered by the stupid copyright situation we are now in.
    
    Unfortunately, Bevington's, Taylor's, Kermode's, and even Muir's texts of Shakespeare are still under copyright. (Compare an Arden of Shakespeare to a facsimile of the First Folio some time: the printers of the First Folio were considered good in their day, but not in ours). Too bad most English translations of the Bible (the KJV and the Tyndale are two obvious exceptions) are still under copyright. Too bad most of the good translations of the Koran still are.
    
    Yes, there's plenty of good lit before 1923, but sometimes you need to look at a more modern edition to see what the original author most likely really wrote.
Distributed Proofreaders, Copyright (Score:5, Interesting)

by dachshund ( 300733 ) writes: on Tuesday January 28, 2003 @09:34PM (#5178597)

Didn't we just have a set of articles on Distributed Proofreaders [archive.org]? Those guys are harnessing technology to churn out books at a mad rate. Seems to me that Wired's reporter is maybe just a tad uninformed.
In any case, the real obstacle to a useful electronic library isn't labor. It's copyright.

Share
twitter facebook
hmmmmmm (Score:3, Funny)

by pummer ( 637413 ) writes: <spam&pumm,org> on Tuesday January 28, 2003 @09:34PM (#5178599) Homepage Journal

which will be ready first, Project Gutenberg or Duke Nukem?

Share
twitter facebook
- Re:hmmmmmm (Score:2, Funny)
  
  by yerricde ( 125198 ) writes:
  
  Are you claiming that Duke Nukem Forever will not be released within the next ninety-five [pineight.com] years?
Um, Distributed Proofreaders (Score:5, Interesting)

by volsung ( 378 ) writes: <stan@mtrr.org> on Tuesday January 28, 2003 @09:35PM (#5178600)

Apparently the author of the article missed Distributed Proofreaders [archive.org]. They seem to have survived their Slashdotting and actually retained a good fraction of their new users. This month they've proofed 116,827 pages! (Cut that in half for unique pages, I think) They have completed in their 2(?) years of existence 918 books, and have another 317 being assembled. It really seems like they are only limited by what they can get their hands on in the public domain.

Share
twitter facebook
- Re:Um, Distributed Proofreaders (Score:2)
  
  by mumkin ( 28230 ) writes:
  
  Rah, Rah Distributed Proofreaders!
  
  The Slashdot story added several thousand users to their rolls, myself included, and upped the output volume dramatically. Things have quieted down a bit in the months since the Slashdotting, but it's going *very* well over at DP. I encourage anyone who is remotely interested in helping to create a phat, free digital library to check it out [archive.org] and get involved.
  
  It's truly amazing what you can accomplish with a large-enough group of volunteers, over a long-enough period of time. I've spent relatively little time proofing -- just a few pages whenever I've nothing else to do -- but over the course of several months it turns out I've proofed 551 pages... that's a decent-sized book that I, personally, have helped to bring to the masses. How cool is that?
  
  It's off the homepage now, but I believe that a previous note from DP project management estimated that if it continues at its current pace, Distributed Proofreading will manage to add ~2,000 books to the Project Gutenberg library this year alone!
  - Re:Um, Distributed Proofreaders (Score:5, Interesting)
    
    by madfgurtbn ( 321041 ) writes: on Tuesday January 28, 2003 @11:21PM (#5179157)
    
    How cool is that?
    
    Way cool. I've been working there once in a while since the first /. story, and I think it's the one of the most important things happening on the web.
    
    It's only a matter of time before someone with a relatively massive audience like NPR does a story on DP and then we'll see what it's really like to be slashdotted. I would like to see the international membership increase, as well.
    
    I recommend it to anyone who reads. A page a day or a week or a month helps save another book. Most of these old books will become extinct if they are not saved to the web sooon.
    
    Parent Share
    twitter facebook
Gutenberg (Score:3, Funny)

by Eddy Johnson ( 467614 ) writes: on Tuesday January 28, 2003 @09:35PM (#5178601)

Ah, good old Gutenberg, the German man who invented the printing press. I believe he was made Man of the Millenium in 2000. Not bad for a guy whos been dead for a few hundred years. The Library of Congress has a Gutenberg Bible on display (the Bible being, of course, the first book made with a printing press.)

And while we're discussing the speech recognition for books, it wouldn't make sense for poetry, which uses alternate spellings sometimes. It also wouldn't make sense for at least one work that I can think of - Through the Looking-Glass by Lewis Carroll, which is already up there. When Alice first looks at the poem Jabberwocky, it's backwards. Try saying that backwards faster than you can type it!

Share
twitter facebook
- Re:Gutenberg (Score:5, Informative)
  
  by CharlieG ( 34950 ) writes: on Tuesday January 28, 2003 @10:40PM (#5178951) Homepage
  
  Gutenberg did NOT invent the printing press - He invented moveable type -a BIG difference
  
  Before Gutenberg, there were printing presses, BUT you had to carve the master (the plate) for each page, and it could NOT be changed. Other folks had the IDEA of movable type, but what Gutenberg did was figure out a way to make it work (what he did was figure out how to make all the type the same length, so that when you press down, all the type comes in contact with the paper)
  
  Movable type gives you one huge advantage - you can make up a bunch of sets of letters, and reuse them for many pages.
  
  The total irony of this is that movable type is almost never used anymore - we make up a plate for each page. Of course, we are doing it with electronic movable type, but that is here nor there. Movable type started to go away with the Linotype machine - which made up one LINE of type at a time.
  
  I think I still have an ingot of linotype metal around somewhere
  
  Parent Share
  twitter facebook
  - Re:Gutenberg (Score:3, Informative)
    
    by MrOrn ( 469069 ) writes:
    
    Actually, he didn't even invent moveable type. The Chinese did that with wooden blocks much earlier and there were existing printing presses that used moveable blocks.
    Also, there were prior claimants to the "invention" in Europe, such as Laurens Coster in Haarlem, Netherlands, and others in Bruges, Flanders (Belgium), Avignon (Waldvogel, who is recorded as having "steel alphabets" in 1444) and Bologna.
    BTW Gutenburg's "invention" was not the length of the type. It was to have cast the movable type in metal using a matrix. As he was a goldsmith and his father was the Master of the Episcopal Mint in Mainz, this was a great instance of lateral thinking, adapting technology he knew well and applying it to a new field. He would have seen coins being minted and twigged that you could print books like that.
    He also designed the press (adapted from existing wine presses) and came up with an ink that was suitable for the process of printing with this type of press (the ink had to be viscous, rather than the ink used for manuscripts).
    His combination of the three things meant that he could successfully exploit printing commercially. So Gutenburg was probably the first to exploit it commercially, although he wasn't very successful (5 years (1450-1455) isn't a long time to have a revolutionary business). This fact has ensured that he is credited with the invention of modern printing.
    - Re:Gutenberg (Score:4, Interesting)
      
      by kalidasa ( 577403 ) writes: on Wednesday January 29, 2003 @09:03AM (#5180832) Journal
      
      Actually, he didn't even invent moveable type. The Chinese did that with wooden blocks much earlier and there were existing printing presses that used moveable blocks.
      
      Are you sure about this? My understanding was that early (pre-Gutenburg) Chinese presses didn't have sorts, because with the sheer size of the Chinese writing system, they wouldn't have been efficient with the level of technology to produce wooden blocks. But I'm willing to be corrected (with a reference, preferably).
      
      Anyway, see Elizabeth Eisenstein's The Printing Press as an Agent of Change [amazon.com] for a lot of the information that Mr. Orn describes. A more accessible book, The Nature of the Book [amazon.com], Adrian Johns, discusses some of this in the in the earlier chapters.
      
      Parent Share
      twitter facebook
You can help (Score:3, Informative)

by geyser ( 45316 ) writes: on Tuesday January 28, 2003 @09:38PM (#5178612)

The volunteer page is the place to start:
http://promo.net/pg/volunteer.html

Share
twitter facebook
Time to Request Digital Copies from Publishers (Score:5, Insightful)

by CaptCanuk ( 245649 ) writes: on Tuesday January 28, 2003 @09:39PM (#5178622) Journal

All digital versions of books that publishers have should be requested and maintained in a safe place till their respective patents expire so that they can be easily integrated into the public domain.... especially if OCR or speech recognition doesn't get any better any time soon.

Share
twitter facebook
There's already a free, online library (Score:2, Funny)

by e12532 ( 158556 ) writes:

It's called IRC :)

irc.nullus.net /join #bw

Tons of books, most in .txt format some in .pdf and .htm

-- Enjoy
size (Score:3, Interesting)

by AndrewRUK ( 543993 ) writes: on Tuesday January 28, 2003 @09:41PM (#5178638)

In its 32nd year of existence, the collection has only 6,267 etexts.

Yep, but Project Gutenberg is predicted to reach it's 10,000th book on, variously, Nov 10th, Nov 14th, Dec 10th or Dec 31st this year. It's growing quickly now.

Share
twitter facebook
plain text -- WHY?? (Score:4, Interesting)

by CoughDropAddict ( 40792 ) writes: on Tuesday January 28, 2003 @09:42PM (#5178639) Homepage

I cannot believe that Project Gutenberg continues to use plain text as their source code! I can see why it would have been compelling in 1971, and it still may be true that there are systems out there that can only read 7-bit ASCII.

But that's absolutely no reason why the source shouldn't be marked up. Marked up source can always be converted to ASCII, but you cannot derive semantic markup from ASCII.

Share
twitter facebook
- Re:plain text -- WHY?? (Score:2)
  
  by SoupIsGoodFood_42 ( 521389 ) writes:
  
  Yes! As much as I admire the project, I also think that this is incredibly stupid. It's not like there are no tools, basic XML, or even HTML would be usefull. Infact, there is already a open ebook XML-type standard being created by a lot of the big corps like Adobe etc.
  It doesn't have to be anything flash, just something to show chapters, titles, [P]s, [BR]s, [B]s, [I]s, etc.
  It just seems like such a waste of effort to convert all these books, only to end up with something that has no semantic structure, I thought that would be half the reason for doing it.
- Re:plain text -- WHY?? (Score:5, Informative)
  
  by ChaosDiscord ( 4913 ) writes: on Tuesday January 28, 2003 @10:29PM (#5178897) Homepage Journal
  
  I cannot believe that Project Gutenberg continues to use plain text as their source code! I can see why it would have been compelling in 1971, and it still may be true that there are systems out there that can only read 7-bit ASCII.
  
  That's exactly why. Since 1971 a wide variety of encodings and markup languages existed. 32 years later the only system still trivial to read is plain old ASCII. Project Gutenberg is most interested in preserving the texts themselves. The texts are quite well preserved in ASCII. Sure, some formatting is missing, but it's relatively minor for the majority of books in question. And given the existance of this unformatted text it's alot easier to create formatted text than from scratch, so you even get a benefit there.
  But that's absolutely no reason why the source shouldn't be marked up. Marked up source can always be converted to ASCII, but you cannot derive semantic markup from ASCII.
  
  I think you're a bit confused on semantic markup. By and large publishers aren't interested in semantics of the documention, just the formtting.
  
  Parent Share
  twitter facebook
- Re:plain text -- WHY?? (Score:4, Interesting)
  
  by rodgerd ( 402 ) writes: on Tuesday January 28, 2003 @11:52PM (#5179320) Homepage
  
  Amen.
  
  It could at least shift to unicode, so we can write in languages other than English (and English-with-no-accents, at that!).
  
  Parent Share
  twitter facebook
They need a name change (Score:2)

by SystematicPsycho ( 456042 ) writes:

Project Gutenberg just doesn't come across as something interesting or the first thing you think of when you think "Free electronic library". Even "WikiLibrary" would be better (although not a wiki).
just scan and compress (Score:5, Interesting)

by Anonymous Coward writes: on Tuesday January 28, 2003 @09:45PM (#5178665)

The best and cheapest way to get existing books on the web is to scan them and compress the images. Compression technology for text images is so good (see DjVu [djvuzone.org]), and storage so cheap nowadays that you are better off just distributing high resolution scans.

This is a much more efficient way to make books available on the web, much more efficient than having volunteers painstakingly transcribe the text or correcting OCR mistakes.

OCR can be used for indexing scanned documents, but there is no need to do manual correction. DjVu can compress 300dpi black and white pages of text to 5-25KB. That's less than most HTML pages, and the images look just like the original book.

The Million Book Project at the Internet Archive uses DjVu (as well as other formats).

The open source implementation of DjVu is available on sourceforge [sf.net]

Share
twitter facebook
- Downside to that method: (Score:4, Insightful)
  
  by Anonymous Coward writes: on Tuesday January 28, 2003 @11:07PM (#5179074)
  
  I and probably many others here, like to read Project Gutenberg books on my Palm/Pocket PC. Whenever I have a little down time I can get that out and choose from a dozen "classic" books to read. Can't do that when the "book" is a 800x600 image, and your screen can only do 320x320 (Sony Clies, Palm Tungsten), 320x240 (PocketPCs, Handera), or 160x160 (almost all Palm and Handspring PDAs).
  
  Plain text, HTML, or XML are much more portable than compressed images. Which is at least partly why Gutenberg uses plain ASCII text; it's readable on literally anything with an alphanumeric display, and by all signs will be for decades, if not centuries or millenia. Good luck finding a GIF or BMP in 100 years, let alone formats nobody's even heard of. I have plenty of pictures I made only a few years ago on an Apple II that can't be read by anything, even when I get it off the 5.25" floppies. Yet I've read code and other things written on computers from the 70s and 80s. ASCII Just Doesn't Die.
  
  Parent Share
  twitter facebook
- - You can search DJVU files... (Score:3, Informative)
    
    by raytracer ( 51035 ) writes:
    
    Scanned documents might be fine for readers, but what if you're looking for "oh, you know, that one line in the book, where the dude was talking about melons."
    
    It might help to actually understand what you are talking about before you are so quick to dismiss it. DJVU does support searchable text, which can be inserted automatically via OCR. The advantage of this is that the OCR need not be 100% accurate to still be useful (vastly more useful and accurate than the indices in most books, for instance).
bookwarez? (Score:2)

by Punto ( 100573 ) writes:

I always got the impression that there are more titles available on the bookwarez scene than on project gutenberg.
I might be wrong, or maybe some books are more '1337' than others, but I got the impression that there definitively are enough people willing to get the texts to digital format.
Just daydreaming here. (Score:3, Insightful)

by eniu!uine ( 317250 ) writes: on Tuesday January 28, 2003 @09:46PM (#5178671)

As someone pointed out, the real problem is the copyright issue. Most works are copyrighted and copyrights last for way too long. The consitution states that copyright should be limited, but when it's lifetime plus 90 years, it may as well be unlimited since we'll all be dead before they expire. There needs to be a grassroots movement to inspire a repeal of some seriously damaging legislation. I feel confident that most slashdot readers agree about what needs to be done, but we seem too apathetic to actually do something about it. Sometimes I wish someone would post a link that says 'click here to vote for freedom'. If only it were that easy.

I think an interesting project would be public domain textbooks. Textbooks are grossly overpriced and contain information that is largely available for free. If a community of developers can create an OS like linux then the educational community should be able to come up with open textbooks.

Share
twitter facebook
Huh? (Score:3, Insightful)

by Tyler Eaves ( 344284 ) writes: on Tuesday January 28, 2003 @09:48PM (#5178685)

Huh? I can type a good bit faster than I can speak.

Share
twitter facebook
- Re:Huh? (Score:3, Interesting)
  
  by Call Me Black Cloud ( 616282 ) writes:
  
  No you can't, unless you're impaired in some way.
  
  Average speaking rate (in English) is 100-180 wpm. The world's fastest typist hit 212 wpm on a Dvorak keyboard. See also this [angelfire.com]
  
  I took a quickie online typing test [dennisglass.com], one pass, 60 seconds, and here's my score. I'm a decent typist (better when coding). What's your score?
  
  Percentage Accuracy : 100%
  Percentage Inaccuracy : 0.8333333333333334%
  Characters per minute : 360 cpm
  Characters per second : 6 cps
  Words per minute : 67 wpm
  Words per second : 1 wps
  Total Speed status : Too Good
  Overall Accuracy : Absolutely Spot on
Project Gutenberg? (Score:2, Funny)

by bezza ( 590194 ) writes:

Let it go guys....
No 'project' is going to get Steve Guttenberg back in Police Academy.
It is time to move on...
Transcribing? (Score:5, Insightful)

by bravehamster ( 44836 ) writes: on Tuesday January 28, 2003 @09:51PM (#5178702) Homepage Journal

Hah, try transcribing "Huckleberry Finn", or any Dr. Seuss, or better yet, try "Feersum Endjinn" by Iain M. Banks. I'd love to see what a transcriber would do to that one. Given the amount of made-up words in literature, catching and correcting the mistakes a transcriber commits would make it less than useless.

Share
twitter facebook
- No. Boycott Dr. Seuss. (Score:2, Informative)
  
  by yerricde ( 125198 ) writes:
  
  Hah, try transcribing "Huckleberry Finn", or any Dr. Seuss
  
  No. Boycott Dr. Seuss. His estate submitted an amicus brief [harvard.edu] in favor of the Bono Act. Now that Project Gutenberg uses distributed proofreading, the Bono Act is the biggest barrier to the growth of PG.
I don't like reading online! (Score:5, Interesting)

by corvi42 ( 235814 ) writes: on Tuesday January 28, 2003 @09:54PM (#5178718) Homepage Journal

It seems paradoxical, but there it is. I spend a huge amount of time glued to the screen, reading articles, blogs, forums, FAQs, HOW-TOs, etc. But I don't like it, in fact I find it aggravating.

I am lured and lulled by the vast amount of easy information suitably tailored to my interests, all with an easy to use intuitive associational ( read hypertextual ) interface. But it is tiring, staring at a flickering glaring screen for hours, my eyes get dry, and I strain and get tired picking out fuzzy objects when I try to focus at distance. Its nasty and annoying.

Here is my point about this project. Nobody wants to read books on their computers. Well maybe some do, but I think the vast majority don't. Paper books are easily available and cheap. If you can't find the one you want in a local library or bookstore there are a multitude of ways of ordering them. You don't get tired looking at them, they are actually enjoyable. So why should there be a desire amongst the majority for e-books?

Don't get me wrong, I think its a good idea, but not one that I, nor I think the majority, will go in for until a better way is developped of presenting them. LCDs are an improvement, but they still are shabby. I don't think a project like this is going to see much public interest until some better presentation media is found. E-Paper will be needed before the E-book becomes a reality for most people. Some kind of little book-sized unit that you can hold and which will display on a matt - non-glaring, non-luminous surface.

Share
twitter facebook
Online Interface.. (Score:3, Informative)

by slashkitty ( 21637 ) writes: on Tuesday January 28, 2003 @09:58PM (#5178744) Homepage

While I like the project, I think the biggest problem is the interface to use the books. They end up in this crappy.txt format. The searching and browsing is slow and painful. If they just spent a little time on the website, they might get more support!

Share
twitter facebook
In Search of the Perfect Library (Score:5, Interesting)

by drmofe ( 523606 ) writes: on Tuesday January 28, 2003 @10:00PM (#5178759)

There seems to be an interesting recurring theme in human history - we constantly strive to build libraries but we have never yet built one that is quite "good enough".

The Great Library in Alexandria was a wonder of the ancient world until it got burned down as part of a domestic dispute between Mark Anthony and Cleopatra. I was amused to note that the local University recently received funding approval to rebuild it - grants committees move slowly.

In mediaeval times, monks were the guardians of knowledge and the various monasteries dotted around Europe were oases of learning and knowledge in those times. Knowledge was restricted to the few.

The original Gutenberg made it possible to create huge volumes (literally) of knowledge and disseminate it on a wide scale. Ever since, people in power have sought to control this technology - either through censorship, copyright, or even education (you have to be able to read before a book is of greatest use to you.)

In Victorian England, the mark of a scholarly gentleman was in the breadth of works he maintained in his private library.

Perhaps a new initiative might be Gutenberg@Home whereby any reader made an electronic copy of physical works by some convenient, nondestructive means. By keeping such a personal library private, one would not have to worry about copyright laws, even as currently framed.

How much of what is holding us back from building the perfect library simply our insistence on monetary-related restrictions? How long will it take us to realize that lengthy (in time) and complex or intensive (in resources consumed) PHYSICAL processes are the only ones to which we need to attach a value. Whatever happens inthe electronic world should be free and that the collation, assembly, verification, dissemination and application of the sum of human knowledge is one of the most important things that we could achieve?

STF

Share
twitter facebook
- Re:In Search of the Perfect Library (Score:3, Informative)
  
  by Allen Varney ( 449382 ) writes:
  
  The Great Library in Alexandria was a wonder of the ancient world until it got burned down as part of a domestic dispute between Mark Antony and Cleopatra.
  
  Uh... what? For centuries people have blamed the burning of Alexandria's Great Library on the Romans, the Christians, or the Muslims, depending on which ones they disliked. But Mark Antony? Cleopatra? That's a new one. Maybe you're thinking of Julius Caesar [bede.org.uk], who gets the blame according to this fellow (a self-proclaimed Christian apologist).
Books online are not as good as books on paper .. (Score:2, Interesting)

by staaktdenarbeid ( 620908 ) writes:

Storing books online is one thing. Gutenberg also needs readers to be successful. How many readers are willing to read .txt or .pdf files instead of printed material ? Several times I downloaded Gutenberg books, with the intention to read them from laptop or screen lateron. Turns out this is too inconvenient, when compared to paper print.
If only electronic paper [eink.com] would be at 1c a page ...
And not going anywhere soon.... (Score:4, Interesting)

by Captain Beefheart ( 628365 ) writes: on Tuesday January 28, 2003 @10:06PM (#5178796)

Floppy disks get magnetized, hard drives crash, optical disks get scratched...A book can take a beating, man. All the OCR and voice rec in the world won't change this until we can get widespread, cheap cartridged optical media.

I think this take on media longevity also prevents progress WRT Project Gutenberg. Too many people don't see the point, when they can have the Library of Congress backed up on disk one day but be looking at a screen full of garbage characters the next because someone accidentally yanked the power supply on the server or whathaveyou.

A single $5 paperback book can be propagated more reliably than tens of thousands of dollars worth of networks and storage, although the latter system can admittedly hold a whole library's worth of that single book. But think about the infrastructure required to maintain the latter system. Until we have better media, the costs aren't justifiable, IMHO. It's an idea whose time has not yet come.

Share
twitter facebook
- Re:And not going anywhere soon.... (Score:5, Insightful)
  
  by dvdeug ( 5033 ) writes: <dvdeug&email,ro> on Tuesday January 28, 2003 @11:32PM (#5179222)
  
  Floppy disks get magnetized, hard drives crash, optical disks get scratched...A book can take a beating, man. All the OCR and voice rec in the world won't change this until we can get widespread, cheap cartridged optical media.
  
  One small book also takes up the space of a hard drive, and can't be redownloaded, or backed up. If my roof leaks, or I have a fire, it will cost me thousands of dollars to replace my books, and some will be hard to impossible to replace. If my hard drive crashes, I redownload the files from Gutenberg, and/or restore them from my backups.
  
  Parent Share
  twitter facebook
Technological Breakthrough ( funny ) (Score:3, Funny)

by corvi42 ( 235814 ) writes: on Tuesday January 28, 2003 @10:07PM (#5178805) Homepage Journal

From the bookmarks that a local bookstore ( bookcity [seekbooks.com.au] ) gives out with purchases:
Introducing the new Bio-Optic Organized Knowledge device - BOOK.
BOOK is a revolutionary breakthrough in technology; no wires, no electric circuits, no batteries, nothing to be connected or switched on. It's easy to use. Even a child can operate it.
Compact and portable, it can be used anywhere - even sitting in an armchair by the fire - yet is powerful enough to hold as much as a CD-ROM.
[...]
BOOK never crashes nor requires rebooting. The 'browse' function allows instant movement to any sheet, forward or back, as one wishes. Many come with an 'index' feature, which pinpoints the exact location of any selected information for instant retrieval.
Portable, durable and affordable, BOOK is being hailed as a precursor of a new entertainment wave. BOOK's appeal seems so certain that thousands of content creators have committed to the platform and investors are reportadly flocking to the medium.

Share
twitter facebook
Stupid article. Project Gutenberg doing great. (Score:5, Insightful)

by ChaosDiscord ( 4913 ) writes: on Tuesday January 28, 2003 @10:08PM (#5178812) Homepage Journal

Thus Project Gutenberg has inched ahead at a snail's pace. In its 32nd year of existence, the collection has only 6,267 etexts.

I prefer to phrase it, "Thus Project Gutenberg has raced ahead at an amazing rate. In its 32nd year in existence, the collection has 6,267 etexts, averaging almost 200 etexts per year. That works out to about one book every other day. This is more impressive given that in the first twenty years of the projects existance the Internet didn't exist anywhere near the form we take it for granted today. The popularization of the Internet has just accelerated the rate the Project Gutenberg grows. With the help of Distributed Proofreaders [archive.org], a project that allows average people to donate small amounts of time to proofread just one page at a time, Project Gutenberg can expect to add over 400 etexts per year. Clearly Project Gutenberg is thriving."

Share
twitter facebook
- Re:Stupid article. Project Gutenberg doing great. (Score:3, Informative)
  
  by Overt Coward ( 19347 ) writes:
  
  I'll point out that at the end of 2000, there were only roughly 2000 etexts in the entire PG library (I copied them all to a single CD)... So if they're up well over 6,000, then they've made amazing progress in two years!
If one demands that the library be born. . . (Score:5, Insightful)

by kfg ( 145172 ) writes: on Tuesday January 28, 2003 @10:13PM (#5178837)

full grown, like Athena springing from the head of Zeus, this criticism is largely valid.

Patience, however, is a virtue. Libraries of public domain works *grow.* Every work added remains. Although it may take many years, even generations, as did the construction of the Giza plaza, over time The pyramid grows toward its apex, another pyramid joins it, a temple is added to the side, and so on.

That's part of the point of Project Gutenburg. Not just to provide an online library but to do so in an immutable manner that only grows over time.

Adding only *one page* to the project is valuable, and that addition remains and is added to by others.

Even brick and mortar libraries can take generations to build. A two hundred year plan only requires patience to complete.

That said, I'm going to take an even more contrarian point of view to the Wired article. The amazing thing I find about Project Gutenburg is how much is already in there. It's already at the point that I think few people could manage to read one half of the texts available in their lifetime, and finding a project to donate is complicated by the fact that the hardest part may not be performing the labor, but simply finding a project that interests you that *hasn't already been done.*

It's already a remarkable collection, and I've had to, on occasion, resort to it because my local library didn't have a lending copy of the work I wanted, but Project Gutenburg could give me free ownership of it.

KFG

Share
twitter facebook
Wired's failed logic... (Score:4, Interesting)

by Dynedain ( 141758 ) writes: <slashdot2 AT anthonymclin DOT com> on Tuesday January 28, 2003 @10:28PM (#5178892) Homepage

One thing that really pisses me off about Wired, ZDNet, and all the rest....

You are on the internet, talking about various companies/projects/organizations/events on the internet....SO PROVID THE DAMN LINKS!!!

The Wired article is about Project Gutenberg, but there is no link to it! There is a link to a children's books database, which is only mentioned in passing.

Get it together!

Share
twitter facebook
OCR might have a problem... (Score:3, Interesting)

by salimma ( 115327 ) writes: on Tuesday January 28, 2003 @11:07PM (#5179075) Homepage Journal

.. when the font used is different from fonts it is programmed to recognise. I tried scanning a 40-year-old book - a drama script written in Indonesian - and the combination of unusual font *and* unrecognised language was enough to make the OCR software's output 50% rubbish.

Hmm, imagine scanning a 500-year-old book hand-written in Cyrillic... forgetting for one second the damage that scanning might do to the book in the first place.

Share
twitter facebook
All Hail The Text! (Score:3, Informative)

by Jason Scott ( 18815 ) writes: on Tuesday January 28, 2003 @11:18PM (#5179136) Homepage

Well, until it's free, there's always textfiles.com [textfiles.com].

Actually, a while ago I copied a lot of the Project Gutenberg library, along with some others, and created etext.textfiles.com [textfiles.com].

In my experience, the reason a lot of people don't donate free time to transcription or other similar drudge work is because a lot of sites that encourage it steal it. Witness CDDB, and just wait to see how long before you pay for IMDB.

Share
twitter facebook
I tend to differ... (Score:3, Interesting)

by joto ( 134244 ) writes: on Tuesday January 28, 2003 @11:20PM (#5179149)

I think Gutenberg is very much there...
Have you ever looked at the amount of material in Gutenberg's archives? When it comes to books and material written in english, that is in the public domain, I have to say, that Gutenberg offers almost everything of interest already.
The reason the Gutenberg project isn't hugely succesful is not the lack of text. Part of it might be the lack of formatting. Nobody want's to read 600 pages of a classic work on a computer screen in ASCII. Some may be masochistic enough to do it if it was in HTML. Personally, I still prefer it in book-form.
But even if it was properly formatted in several formats (including .pdf's in several sizes), it still is a lot of work to print it out, find a decent way to keep it together (no, ring binders isn't very appropriate for something you are going to read).
The main reason Gutenberg isn't succesful is because it is not what people want. People don't want to read or print out old literature in the public domain. They either want a nice edition that looks good on the shelf, or a cheap paperback to carry around with them . And most likely they aren't particulary into really old books (with a possible few exceptions which the Gutenberg project long since have covered).
It's not like the work the Gutenberg people are doing isn't important, or isn't of good enough quality or anything else. The simple reasons it's not heavily succesfull is because very few people are really interested. I'm sure much of the work the Gutenberg people have done will become important as soon as on-demand printing is more common and affordable.

Share
twitter facebook
Problems with speech recognition (Score:3, Interesting)

by mactari ( 220786 ) writes: <rufwork.gmail@com> on Tuesday January 28, 2003 @11:31PM (#5179216) Homepage

Though it doesn't go into technology much, I expect there's a lot of potential in mass OCR tech and good speech recognition (faster to read a book aloud than to transcribe it correctly).

Was thinking about voice recognition today while lamenting that I haven't done more to type in my copy of The Queen's Necklace by Alexandre Dumas, copyright 1910.

Here are two problems that came to mindn why I probably won't be able to use voice recog soon:
1.) Works who have been lucky enough to actually have their copyright lapse are often pretty old works. Their English (let's use English just b/c it's the lang I'm using) isn't exactly today's English, and sometimes even spellings, etc, change. Try reading anything from the 1800s and before.

2.) Names (so any protracted dialog) and other tough-to-translate stuff is going to be a pain to proofread. My book in particular has quite a bit of French in it (lots of "Parbleu" and French names with crazy accents all over the place).

I'd like to say voice recog could produce a "new version" with "updated spellings", but I just don't think that'd fly.

So once voice recog is commonplace for, say, office use (still quite a ways off) and affordable (not sure there, but I haven't heard of a friend using it yet, even just to play) we'll still have a ways to go before we can get true literature into PG simply by reading.

As an aside, at the same time I've been thinking about simply taping me reading the book and donating *that* via mp3 (or Ogg or whatever the heck). For the time being anyone who wants can listen in the car, and as soon as voice recog is up to snuff, voila. Just run it on my recording, proofread (easier said than done), and you're ready to go!

Share
twitter facebook
bookwarez (Score:3, Interesting)

by majcher ( 26219 ) writes: <slashdotNO@SPAMmajcher.com> on Tuesday January 28, 2003 @11:45PM (#5179282) Homepage

I love Project Gutenberg, and I've used and supported it since the pre-web days. However, I don't think they go far enough.

There are plenty of places on the net that one can find and download copyrighted works. Web sites, mail servers, IRC networks, and so on. I've used them extensively, myself. Many of the books I've downloaded, I own, and I got the electronic format for searching, reading on pocket devices, and so on. I think that this is fair - I've paid for the information once, and my sense of Fair Use tells me that it's okay to get this bits in this way.

I've also downloaded many, many books that I do not, nor will ever, own. (Some of these, I will probably never read.) Is this a copyright violation? Almost definitely. Is it ethically wrong? I don't think so. I would probably never buy a new copy of these works. If I hadn't downloaded them, I would have borrowed them from a friend, or a library, or bought a used copy, and sold it back later. None of these legal methods would have earned the author or publishers a cent. So, how are they different from downloading an electronic version? In my eyes, they are not.

I buy plenty of books - hundreds or dollars worth every year. I love to read. I support local authors, and independent publishers. I do not think my actions are criminal. If someone disagrees, tough. You won't stop me, or the legions of other electronic book traders. Ever. Sorry. If it helps, think of us as the "books" in Fahrenheit 451, keeping a distributed library available for public use, in the event that something terrible should happen someday. Eventually, one way or the other, copyright will go away, and the words will be truly free again.

(And anyway, I was just joking. I'd never knowingly violate copyright law. What am I, stupid?)

Share
twitter facebook
Tax funded... (Score:4, Interesting)

by silverhalide ( 584408 ) writes: on Wednesday January 29, 2003 @01:09AM (#5179809)

Why isn't a project like this tax funded? It would be trivial for Congress to put aside a million or two to pay some schlubs to sit around doing data entry all day. Heck, create a department to do it. Almody all brick 'n mortar libraries are tax funded, so why shouldn't a public electronic library be tax funded? You could (theoertically) crank up production of the conversions to save even more rare works, on top of the fact that ideally the project could work directly with major libraries around the USA, or even the world. Of course, realistically such a project would turn into some buereuacracy that gets barely more done than the volunteer version, but it would at least look like someone cares.

Really, information is the most important thing humanity has, and the people literally "Saving" the world are doing it on their free time.

Share
twitter facebook
What's wrong with Wired Magazine... (Score:5, Interesting)

by raytracer ( 51035 ) writes: on Wednesday January 29, 2003 @01:41AM (#5179961)

They obviously publish articles written by people with their head up their asses.

Honestly, just what is Mr. J. Bradford DeLong thinking? To characterize Project Gutenberg as a failure is just imbecilic. From PG's own pages, 203 ebooks were released in October 2002. 1975 new books in 2002 (1240 in 2001). It's a lot of work to produce even one book, and PG is churning them out at a pretty good clip for an entirely volunteer effort.

Even as it is, I've found PG to be pretty damned useful. It's kind of nice to be able to grep the collected works of Shakespeare. Or Darwin. Or Conan Doyle. Or H. G. Wells. Or Jules Verne. Or Charles Dickens. Or Frank. L. Baum.

Despite advances in technology, scanning, OCRing and proofreading books remains a very labor intensive process, and it is a boring, often thankless process as well. The Million Book project wants to take a somewhat different approach to providing digital books: they actually scan the books and store them in DJVU format (a very nice format similar to PDF). They can do OCR on it to provide searchable text, but such text doesn't have to be 100% accurate to be effective. Most of the time you print and read the original scans. After all, some publisher went to the trouble of carefully typesetting the book and proofreading it once, why bother to do it all again?

I first became aware of this project and technology when I met Brewster Kahle as he drove the Internet Bookmobile around the U.S., going to libraries and schools trying to drum up interest in Eldred vs. Ashcroft. A compressed version of Alice in Wonderland in DJVU format is about 5 megabytes (the same as a single MP3) including the illustrations and fancy typesetting. He could print and bind a copy of it for about $2 in materials, on demand using an HP laser printer out of the back of the mobile. The binding isn't amazing, but consider the possibility of having literally any book in any small town library in any place in the world. It's an exciting idea, and one that technology is only making easier and cheaper. You can get a decent scanner for $100 (even one small enough to hook to a laptop and take to a library). You can scan a book in an evening. And after you do, the file can be converted to a simple, easy to use format that everyone can use. Forever. One evening. One person. One book.

Despite the setback of Eldred v. Ashcroft, more and more books are going to be made available by the true philanthropists of the world: the volunteers who give something of their own time to make the world a better place. I wonder what Mr. DeLong has done to make the world a better place...

Share
twitter facebook
Semi-official response from Project Gutenberg (Score:5, Insightful)

by gbnewby ( 74175 ) writes: on Wednesday January 29, 2003 @01:46AM (#5179978) Homepage
Michael Hart and I are working on a written response that we'll send to Wired and other media, but by then this /. article will be off the front page. So, allow me to make a few comments.
- Projecting back to 1971, Project Gutenberg has tracked Moore's Law quite precisely. January 2003 will be our most productive month ever, and we are looking forward to continuing to double our rate of new eBooks every 18 months.
- Project Gutenberg has received some big donations, and we're working on grants and other funding. However, when you do the math you realize that there's essentially no hope for paying for content -- it takes thousands and thousands of people. The hope for "someone" to do it is naive -- the only answer is to figure out ways for "everyone" to work on digitization.
- While the author makes 6200 books sound like small potatoes, in fact it represents about 1/3 of all eBooks listed in places like the Internet Public Library [ipl.org]. Not bad, and it certainly explains why some random book the author wants isn't part of the collection -- there just aren't that many projects working on digitizing literature.
- Where did the author figure on $750million, and for what? Over 30 million printed books were registered for copyright in the last 100 years (this doesn't count magazines, recordings, etc.). The notion that $25/book could pay for digitization is not unreasonable. But where do you get the books, and what about copyright? If there's a plan, I'd like to hear it.
- One more point, to keep this short: We have just under 7000 eBooks (up about 800 from whenever the author did his research!). We have over 1000 active volunteers. The books are in over 20 languages, dozens of formats and, if printed, would fill a small library. We're on track to reach #10,000 in 2003. Via Distributed Proofreading, as mentioned here and in a previous /. story, we can and frequently do complete digitizing a 300 page book in just a few hours. Mr. DeLong, I don't feel apologetic about these numbers at all.
That's all for now. Thanks to all the supportive comments in this thread, and to all the constructive criticism. And remember, a page a day [archive.org] is all it takes to contribute!
Greg Newby, Director and CEO The Project Gutenberg Literary Archive Foundation www.gutenberg.net [gutenberg.net]
Share
twitter facebook
- Re:Tex? (Score:5, Insightful)
  
  by ProgressiveCynic ( 624271 ) writes: on Tuesday January 28, 2003 @09:25PM (#5178531) Homepage
  
  Umm, Project Guttenberg can only legally use public domain works. If you know of any 100+ year old novels typeset in Tex lets hear about it. Even if a modern reprint was done recently, do you think the publisher would really want to give away all that hard work so that everyone can get it for free instead of buying their spiffy new edition?
  
  Parent Share
  twitter facebook
  - Re:Tex? (Score:4, Funny)
    
    by MulluskO ( 305219 ) writes: on Tuesday January 28, 2003 @09:40PM (#5178627) Journal
    
    Well, getting the source from the publishers certainly seems like a more feasible solution than copying the text out of books.
    
    Now that I think about it, I imagagine publishers face this very same problem when publishing a new edition of an old work, and I'd wager they have developed a few tricks to make the process easier.
    
    If all else fails, I suggest we direct some time, effort, and old PCs into monastaries. Monks dig that tedious shit, hard style. Only one question remains, though... VI or Emacs?
    
    Parent Share
    twitter facebook
- Re:Tex? (Score:5, Informative)
  
  by jonman_d ( 465049 ) writes: <nemilar&optonline,net> on Tuesday January 28, 2003 @09:26PM (#5178534) Homepage Journal
  
  That's not how project Gutenberg works. Most everything that's on PG is public domain - that means the copyright has expired. Thus, most of the stuff is over 70 years old. They didn't exactly use Latex back in the 1930s.
  
  Besides, what I generally use PG for are the classics - greek/roman literature, etc... I don't think Plato used UNIX.
  
  It's all got to be somehow entered from dead-tree-format copy. Currently, that pretty much means typing up the entire book.
  
  Parent Share
  twitter facebook
  - What's more. . . (Score:5, Informative)
    
    by kfg ( 145172 ) writes: on Tuesday January 28, 2003 @09:49PM (#5178690)
    
    it is part of the philosophy of Project Gutenburg to publish all of their works in the lowest level stardard format, thus insuring continued cross platform, program independant readability, ad infinitum.
    
    That means *plain* ASCII. Plain ASCII means you could read it in edlin if you really had to.
    
    This is a Good Thing.
    
    This also means that if you wish to format any Project Gutenburg text, in HTML or TeX for publication, you start with a blank slate and can immediately start to work your own will upon the raw text.
    
    This is also a Good Thing.
    
    KFG
    
    Parent Share
    twitter facebook
    - - Re:What's more. . . (Score:3, Informative)
        
        by Sgs-Cruz ( 526085 ) writes:
        
        It keeps it quite good for almost all European languages [evergreen.edu], thank you. Wouldn't you consider it better than nothing? Or would you prefer that Project Gutenberg supported the Unicode standard that is mired in controversy [hastingsresearch.com] because it doesn't support all 10 to the freaking 24th ancient Chinese ideographs.
        I'd prefer that the books be transcribed now and maybe later we can add some foreign-language books once we figure out a standard that can satisfy the world. Besides, English (European languages, anyway) are the real languages of the Internet.
        
        You are correct on all points (Score:3, Insightful)
        
        by kfg ( 145172 ) writes:
        
        In fact ASCII text can even be human translated (although not really human read) if all you have is the *binary*.
        
        The poster to whom you reply seems to have missed the essential point.
        
        I would give you one caveat though. English may well be the language of the internet ( and I'll leave the arguement as to whether that's a good or bad thing to the students), but it isn't the language of *literature.*
        
        It would certainly be a Good Thing to be able to store the Vedas and Sun-Tzu, in the original script, at the lowest possible human readable electronic form.
        
        This, however, as you note, will apparently have to wait for some future time.
        
        KFG
      - Re:What's more. . . (Score:3, Informative)
        
        by dvdeug ( 5033 ) writes:
        
        No, it's a bad thing, because it renders Gutenberg near useless for anything other than English,
        
        Have you ever taken an actual look at Project Gutenberg? It uses whatever character set is necessary for the language in question; Unicode, CP1251, and ISO-8859-1 have all been used.
        
        Of course, so has DOS CP850, which is darn near unreadable unless you're a CS geek, which is why PG prefers ASCII.
      - Re:What's more. . . (Score:5, Informative)
        
        by Doug Merritt ( 3550 ) writes: <doug AT remarque DOT org> on Wednesday January 29, 2003 @01:50PM (#5182636) Homepage Journal
        
        No, it's a bad thing, because it renders Gutenberg near useless for anything other than English, and it cripples it for creating PDFs, TeX files for printing, and the like
        Strangely enough, people have actually addressed this, notably with the Gutenmark program to convert Gutenberg text into nicely formatted documents in a variety of markup formats (including PDF and TeX, using postprocessing filters).
        See GutenMark home [sandroid.org]
        It never ceases to amaze me that, when people see something that only addresses 90% of their own problem, they call it useless, rather than doing a web search to see whether someone has addressed the remaining 10% of their problem.
        Gutenberg is an amazingly important project; I urge everyone to support it.
        
        Parent Share
        twitter facebook
- Re:Tex? (Score:3, Informative)
  
  by AndrewRUK ( 543993 ) writes:
  
  The requires LaTex source to post, which means a modern edition, which means it's copyrighted, which means you can't copy it (unless you have the publisher's permission.*)
  Project Guttenburg only does texts which are in the public domain, which currently mean pre-1923 editions (PG of Australia [gutenberg.net.au] has newer books) and, obviously, pre-1923 means that the only sources are print copies.
  
  * pedantic point: it's the copyright holder's permission, which isn't necessairly the publisher, but usually is.
- Re:Tex? (Score:5, Informative)
  
  by Matrix ( 290 ) writes: on Tuesday January 28, 2003 @09:45PM (#5178661) Homepage
  
  While this comment has been addressed, I'd like to point out that you can get pretty decent output from the Gutenberg texts by importing them into LyX [lyx.org]. With just a little bit of work (basically setting up the chapters), LyX will allow you to create good looking PDF, Postscript, HTML, etc, along with the LaTeX source. Combine this with rbmake [sf.net] and you can even read them, complete with hyperlinks, on your eBook (if you have one!)
  
  Parent Share
  twitter facebook
  - Re:Tex? (Score:3, Informative)
    
    by nels_tomlinson ( 106413 ) writes:
    
    I've been marking up the Project Gutenberg etexts using LaTeX for several years now. I can typeset an Oz book, or one of the Tom Swift books, in about 15 minutes. I have put about a week into typsetting ``The Voyage of the Beagle'', and no end in sight. I was able to typeset a translation of the bible in about one week, but it was sloppy work, and I wasn't satisfied.
    Lyx is nice, but I don't think that it really speeds things up. I can't imagine that Lyx could speed things up at all on a Tom Swift.
- Re:other issues... (Score:3, Informative)
  
  by MBCook ( 132727 ) writes:
  
  Most of the stuff on PG is public domain, IIRC. Unless Poe, Melville (I know it's wrong, so sue me), Shakespere, and others all climb out of their graves and form some kind of union (RIAA - Recently-undeceased Inkers of Aged Albums etc.) will people complain that they're getting ripped off by these works being put on the web.
- Re:My Opinion (Score:2, Interesting)
  
  by Googol ( 63685 ) writes:
  
  >
  
  Linus, however, has persons coding for him who learn something from the experience. He doesn't ask them to retype the kernel for him and get it to recompile exactly the same way.
  
  The motivational task for PG is *much* harder. "So, your resume sez you spent 6 years typing e-text written by other people. Hmm."
  
  There are more volunteer man-hours in Linux because the volunteers get more out of it.
  
  =googol=

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Speech recognition? (Score:4, Interesting)

The parent is "interesting"? (Score:5, Informative)

Re:The parent is "interesting"? (Score:4, Insightful)

Re:The parent is "interesting"? (Score:5, Informative)

speech recog works (Score:3, Offtopic)

Re:speech recog works (Score:3, Funny)

Re:speech recog works (Score:5, Funny)

Cost of labor? (Score:3, Insightful)

Re:Cost of labor? (Score:5, Informative)

Re:Cost of labor? (Score:5, Informative)

Re:Cost of labor? (Score:2, Insightful)

Librarians? (Score:5, Interesting)

Re:Librarians? (Score:2, Interesting)

Distributed Proofreaders (Score:5, Informative)

The REAL Problem (Score:2, Insightful)

Re:The REAL Problem (Score:2, Insightful)

Re:The REAL Problem (Score:2, Informative)

Re:The REAL Problem (Score:5, Insightful)

Re:The REAL Problem (Score:4, Insightful)

Re:The REAL Problem (Score:3, Informative)

Comment removed (Score:4, Insightful)

I read lots of stuff off there! (Score:5, Interesting)

If only (Score:4, Interesting)

If you do want to help (Score:5, Informative)

Re:If you do want to help (Score:3, Interesting)

That's part of what DP does (Score:5, Informative)

Re:That's part of what DP does (Score:4, Informative)

Re:If you do want to help (Score:4, Informative)

barbor or barber (Score:4, Funny)

Books Are Printed With Computers... (Score:5, Interesting)

Re:Books Are Printed With Computers... (Score:5, Insightful)

Yes, but they don't... (Score:2)

WiReD (Score:3, Interesting)

Re:WiReD (Score:3, Insightful)

copyright information (Score:5, Informative)

Re:copyright information (Score:3, Informative)

Not quite (Score:3, Interesting)

Re:copyright sucks but... (Score:3, Insightful)

Distributed Proofreaders, Copyright (Score:5, Interesting)

hmmmmmm (Score:3, Funny)

Re:hmmmmmm (Score:2, Funny)

Um, Distributed Proofreaders (Score:5, Interesting)

Re:Um, Distributed Proofreaders (Score:2)

Re:Um, Distributed Proofreaders (Score:5, Interesting)

Gutenberg (Score:3, Funny)

Re:Gutenberg (Score:5, Informative)

Re:Gutenberg (Score:3, Informative)

Re:Gutenberg (Score:4, Interesting)

You can help (Score:3, Informative)

Time to Request Digital Copies from Publishers (Score:5, Insightful)

There's already a free, online library (Score:2, Funny)

size (Score:3, Interesting)

plain text -- WHY?? (Score:4, Interesting)

Re:plain text -- WHY?? (Score:2)

Re:plain text -- WHY?? (Score:5, Informative)

Re:plain text -- WHY?? (Score:4, Interesting)

They need a name change (Score:2)

just scan and compress (Score:5, Interesting)

Downside to that method: (Score:4, Insightful)

You can search DJVU files... (Score:3, Informative)

bookwarez? (Score:2)

Just daydreaming here. (Score:3, Insightful)

Huh? (Score:3, Insightful)

Re:Huh? (Score:3, Interesting)

Project Gutenberg? (Score:2, Funny)

Transcribing? (Score:5, Insightful)

No. Boycott Dr. Seuss. (Score:2, Informative)

I don't like reading online! (Score:5, Interesting)

Online Interface.. (Score:3, Informative)

In Search of the Perfect Library (Score:5, Interesting)

Re:In Search of the Perfect Library (Score:3, Informative)

Books online are not as good as books on paper .. (Score:2, Interesting)

And not going anywhere soon.... (Score:4, Interesting)

Re:And not going anywhere soon.... (Score:5, Insightful)

Technological Breakthrough ( funny ) (Score:3, Funny)

Stupid article. Project Gutenberg doing great. (Score:5, Insightful)

Re:Stupid article. Project Gutenberg doing great. (Score:3, Informative)

If one demands that the library be born. . . (Score:5, Insightful)

Wired's failed logic... (Score:4, Interesting)

OCR might have a problem... (Score:3, Interesting)