Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Building a Searchable Literature Archive With Keywords?

timothy posted more than 5 years ago | from the must-be-in-here-somewhere dept.

Data Storage 211

Sooner Boomer writes "I'm trying to help drag a professor I work with into the 20th century. Although he is involved in cutting-edge research (nanotechnology), his method of literature search is to begin with digging through the hundreds of 3-ring binders that contain articles (usually from PDFs) that he has printed out. Even though the binders are labeled, the articles can only go under one 'heading' and there's no way to do a keyword search on subject, methods, materials, etc. Yeah, google is pretty good for finding stuff, as are other on-line literature services, but they only work for articles that are already on-line. His literature also includes articles copied from books, professional correspondence, and other sources. Is there a FOSS database or archive method (preferably with a web interface) where he could archive the PDFs and scanned documents and be able to search by keywords? It would also be nice to categorize them under multiple subject headings if possible. I know this has been covered ad nauseum with things like photos and the like, but I'm not looking at storage as such: instead I'm trying to find what's stored."

cancel ×

211 comments

Sorry! There are no comments related to the filter you selected.

Document Management Software and OCR (5, Informative)

eldavojohn (898314) | more than 5 years ago | (#27508859)

I think what you are looking for is something called "document management [google.com] " software. As far as FOSS goes, KnowledgeTree [knowledgetree.com] offers a community version that might be down your alley. They have an online demo if you're interested. There's also Alfresco [alfresco.com] but I haven't tried either of these.

From the sound of it, you want to verify that your product supports document tagging (not unlike Slashdot's tagging system I guess) so that he can attach his categories to documents as he puts them in (or more likely as you do the manual labor, right?).

... where he could archive the PDFs and scanned documents and be able to search by keywords?

So, my big concern is the part where you said he scans things from books and articles and so some of the PDFs might just be massive images, right? I don't think you're going to find systems with OCR built in so you might have quite the chore on your hands. If you don't have it electronically or if it's just an image electronically, you may have to implement some sort of process for getting a doc into this system so it can be searched, right? Look into GOCR [sourceforge.net] or Tesseract [linuxjournal.com] if this is the case.

Also, judging by your nickname ("Sooner Boomer"), you're at the University of Oklahoma. Why in the world would you name yourself after a group of people [wikipedia.org] who not only disobeyed the Indian Appropriation Act but also moved out onto Native American territory before it was officially declared property of the United States? And then you also chose "Boomer" which refers to "white settlers who believed the Unassigned Lands were public property and open to anyone for settlement, not just Indian tribes. Their reasoning came from a clause in the Homestead Act of 1862, which said that any settler could claim 160 acres of public land. Some boomers entered and were removed more than once by the United States Army." If you are a descendant of either a Sooner or a Boomer, I respectfully do not agree with their actions.

Re:Document Management Software and OCR (1)

CyberLord Seven (525173) | more than 5 years ago | (#27508953)

I wish I had mod-points.

Very well done and informative on all counts.

Not quite correct! (1)

TrisexualPuppy (976893) | more than 5 years ago | (#27509275)

So, my big concern is the part where you said he scans things from books and articles and so some of the PDFs might just be massive images, right? I don't think you're going to find systems with OCR built in so you might have quite the chore on your hands. If you don't have it electronically or if it's just an image electronically, you may have to implement some sort of process for getting a doc into this system so it can be searched, right? Look into GOCR [sourceforge.net] or Tesseract [linuxjournal.com] if this is the case.

No, no, no! HERE! [cowtax.org]

Re:Document Management Software and OCR (2, Interesting)

digitalunity (19107) | more than 5 years ago | (#27509599)

Another obvious, more practical but potentially less powerful method is simply to index all of the printouts with serial numbers and manually create a database of tags with serial numbers.

You then have a library card index system, but electronic. Sure, it won't help with documents that aren't entered properly, but it's dramatically more efficient than thumbing through PDF's until you find what you need.

Re:Document Management Software and OCR (4, Funny)

qoncept (599709) | more than 5 years ago | (#27508959)

If you are a descendant of either a Sooner or a Boomer, I respectfully do not agree with their actions.

Except he's not. He just prematurely ejaculates. And he'd gone all this time with no one drawing attention to it as you just have.

Re:Document Management Software and OCR (1)

WillKemp (1338605) | more than 5 years ago | (#27508989)

Look into GOCR [sourceforge.net] or Tesseract [linuxjournal.com] if this is the case.

Unless you're really lucky with the images, OCR requires a lot of work correcting errors. It would probably be less work to just be able to add searchable tags to whatever system is used to store the PDFs and leave them as images.

Re:Document Management Software and OCR (3, Insightful)

Shadow Wrought (586631) | more than 5 years ago | (#27509173)

OCR certainly requires work if you need it to be completely accurate. In practice, speaking as a paralegal who's overseen the OCR'ing of millions of pages, it's just not a reasonable expectation. If you can supplement it with coding, in this case keyword tags, date, author, publication and title would build a pretty strong database. If he's looking to do that already, then whatever OCR you get is gravy. Some is better than none.

Re:Document Management Software and OCR (1)

digitalunity (19107) | more than 5 years ago | (#27509551)

OCR, in my experience, is also crap with equations and technical literature in general. It's linguistic fuzzy matching changes technical words it doesn't recognize into similar words it does, on the basis that well shit it might have just not scanned very well.

Not much way around this.

Re:Document Management Software and OCR (2, Funny)

NoobixCube (1133473) | more than 5 years ago | (#27510549)

What he needs is a bunch of undergrads or interns to painstakingly transcribe and proofread every scrap and napkin of text!

Re:Document Management Software and OCR (2, Insightful)

shaitand (626655) | more than 5 years ago | (#27509427)

That depends, if all he needs the OCR for is to build a searchable keyword index then the error rate can be quite high and still get good results. The results of a query should point to the original PDF, not the result of the OCR.

Re:Document Management Software and OCR (1)

Hognoxious (631665) | more than 5 years ago | (#27510133)

Wouldn't it be even less work to get somebody else (TAs or some other form of free labour that needs to be on the prof's good side) to do it?

Re:Document Management Software and OCR (2, Interesting)

Red Flayer (890720) | more than 5 years ago | (#27509073)

I think what you are looking for is something called "document management" software.

Ugh... for a secon there, I thought Clippy started posting to slashdot. Would you like help with that?

I don't think he's looking for a DMS, which includes lots of things like workflows, audit trails, etc. DMSs are typically used to make an office go paperless, but he's not looking for a processing and tracking mechanism. He's looking for an easy way to create a searchable archive index.

My suggestion? Since he's a professor, get a bunch of students to "help with research" by scanning the docs and OCRing them. If he's willing to shell out a few hundred bucks from his research grant, there are services that will do this... Most of the best OCR tools are proprietary, not open source, but even a crappy one should get enough text that the OCRed files could be indexed usefully.

For an indexer, I've heard good things about MPS, and a friend did a similar project to yours with Yaz/Zebra, but he was working with a library, there may have been a special reason for that.

Re:Document Management Software and OCR (-1, Troll)

Anonymous Coward | more than 5 years ago | (#27509171)

Maybe you can cry about the "sooner boomer" name some more. Really it was quite touching.

Re:Document Management Software and OCR (4, Interesting)

atrizu (1434023) | more than 5 years ago | (#27509283)

Try Koha [koha.org] . It's an open source ILS.

OCR is aweful (1)

shaitand (626655) | more than 5 years ago | (#27509371)

OCR is pretty nasty stuff and it doesn't work very well at all. It's probably worth saying that the OCR results should probably only be used to generate your index and keywords.

Actually accessing the document should show the original PDF, not the error riddled OCR scan of it.

sooner boomer (0)

Anonymous Coward | more than 5 years ago | (#27509373)

I think he's actually a crewman aboard the SSBN Oklahoma.
(jk afik there is no SSBN Oklahoma)

Re:Document Management Software and OCR (0)

Anonymous Coward | more than 5 years ago | (#27509453)

This is actually My line of business. Yes Document Management Software is exactly what he wants. Those hundreds of binders can be scanned in and indexed within a reasonable amount of time. In general we save companies hundreds of man hours a week with our software (shameless plug - www.docuware.com)

Re:Document Management Software and OCR (0)

Anonymous Coward | more than 5 years ago | (#27509487)

Adobe Acrobat Standard version has OCR built in. Yes, a chore, but there you go.

wow.. (3, Insightful)

way2trivial (601132) | more than 5 years ago | (#27509527)

2 years? ago I bought for my small business a fujitu f1-5120c duplex scanner--it came with adobe acrobat
I scan every bill, correspondence, notice, and everything to pdf- then I throw it the hell away.

the version of acrobat included does OCR-I open acrobat, choose create pdf from scanner, and scan away.
I can mix a scan job up between B&W & color or duplex or simplex within one job
I can open an existing PDF and append to it

I save everything to an infrant nas box.

I can go to windows search, type in 1179.21 (actually did this one once)
set to look INSIDE the files of that directory and get results that include
a soda delievery notice, a soda invoice, and my bank statement where I paid it off

they have other model scanners that combine sheetfed+flatbed...

here is a beauty
http://www.fujitsu.com/us/services/computing/peripherals/scanners/workgroup/fi-6230.html [fujitsu.com]

Re:Document Management Software and OCR (3, Informative)

burki (32245) | more than 5 years ago | (#27509595)

For an Open Source DMS that generates searchable PDF Files, try ArchivistaBox: http://sourceforge.net/projects/archivista/ [sourceforge.net]

Tesseract (including fracture / black-letter recognition) and the Linux port of Cuneiform (BSD licence) OCR engines are used for text recognition. The hocr2pdf module (see http://www.exactcode.de/ [exactcode.de] is used to generate the searchable PDF files.
(http://sourceforge.net/forum/forum.php?forum_id=868471)

Re:Document Management Software and OCR (2, Insightful)

snspdaarf (1314399) | more than 5 years ago | (#27509775)

And, since what the Boomers and Sooners did was about 100 years ago, slamming someone for their Slashdot nickname helps with their current problem how? The Chickasaw side of my family has owned the same land since before statehood, and the German side was in on the Land Run, and we don't get all fired up about what the people were doing back then. The past is past. Learn to relax a little.

Re:Document Management Software and OCR (0)

Anonymous Coward | more than 5 years ago | (#27510381)

Another suggestion: Sphinx Search.

Re:Document Management Software and OCR (1)

mungewell (149275) | more than 5 years ago | (#27510745)

We are just starting to use Knowledge Tree, and it does have a 'Tag Field' where you can associate searchable keywords with every/any document contained in the system. It also supports the concept of linking documents, so you can manually add specific links between documents.

If 'you' are just looking to index the text contained in a series of PDFs, why not just use one of the many desktop search engines.
Mungewell.

Keywords (-1, Troll)

Anonymous Coward | more than 5 years ago | (#27508929)

Keywords: nigger, coon, jigaboo, dune coon, prarie nigger, sand nigger, porch monkey, shaved gorilla, spic, wetback, kyke, hooknose. Did I mention dune coon? Oh yeah I guess I did. Dune coon.

Re:Keywords (0, Flamebait)

pwfffff (1517213) | more than 5 years ago | (#27510349)

Kike, not kyke.

Yes, I just grammar Nazi'd the race Nazi (Nazi Nazi?).

fox? (4, Funny)

SnarfQuest (469614) | more than 5 years ago | (#27508939)

I'm trying to help drag a professor I work with into the 20th century

Maybe after that, you should try to bring him into the 21st century. You know, the one where PDF's exist?

Fail? (-1, Offtopic)

Anonymous Coward | more than 5 years ago | (#27509061)

Portable Document Format (PDF) is a file format created by Adobe Systems in 1993 for document exchange. [wikipedia.org]

Unless I'm in a parallel dimension of some sorts, the twentieth century did in fact include PDF's.

I understand though, teh intarwebs r hard 2 uze!

Re:Fail? (-1, Offtopic)

Anonymous Coward | more than 5 years ago | (#27510595)

Yes, offtopic mods, pile them on! 'Cause you know the ignorant parent really is funny! Seriously, you guys! Like watching anything penned by Joss Wheadon - hilarity ensues!

Re:fox? (4, Funny)

fuzzyfuzzyfungus (1223518) | more than 5 years ago | (#27509089)

PDF has been around since 1993. That's what, six months or so after we switched from coal-fired data furnaces to vacuum tubes, right?

Re:fox? (1)

myrrdyn (562078) | more than 5 years ago | (#27509703)

PDF has been around since 1993. That's what, six months or so after we switched from coal-fired data furnaces to vacuum tubes, right?

No, but it was just in time for the eternal September coming...

Re:fox? (2, Funny)

Hognoxious (631665) | more than 5 years ago | (#27510205)

Maybe wait for the 22nd. If we're lucky, by then it won't suck. But you may still have to wait for the Hurd port.

OWL (1)

LWATCDR (28044) | more than 5 years ago | (#27508947)

http://owl.anytimecomm.com/ [anytimecomm.com]
we use this at my office. Works well for us.

Try Papers (3, Informative)

matt4077 (581118) | more than 5 years ago | (#27508955)

Papers [mekentosj.com] is a Mac software that does exactly what you need, and does it very well. It's not webbased and Mac only unfortunately, but you can probably find out there what the right terms to google for are.

Re:Try Papers (0)

Anonymous Coward | more than 5 years ago | (#27510021)

I wish Ask Slashdot made it mandatory to say what OS the user is running.

I can't seem to throw a cat without hitting a friendly piece of OSX software that does this. I use Journaler. Not often, but it's doing the trick for indexing any kind of saved document I throw at it. Good software should be keyword indexing any OCR text for you when imported. Author didn't mention what the professor is using to create the PDFs, but Adobe Acrobat Standard comes with "not awful" OCR that needs a lot of attention for proofreading.

With FOSS OCR you have Tesseract, OCRopus, Clara... Can't remember what the one I used on Windows was, but it was great at the time. (~2000)

I know this has been covered ad nauseum with things like photos and the like, but I'm not looking at storage as such: instead I'm trying to find what's stored.

This is what Digital Asset Management is all about, so be sure to add that DAM criteria to broaden your search. ;)

Reference management software (2, Insightful)

ckthorp (1255134) | more than 5 years ago | (#27508957)

I've had good luck with JabRef which uses a BibTex database on the backend so it integrates very well with LaTeX.

Re:Reference management software (2, Interesting)

eggy78 (1227698) | more than 5 years ago | (#27509605)

I never really enjoyed using JabRef, but have had pretty good luck with Aigaion [aigaion.nl] ... a little more setup but it's great for our lab, where everyone works from a common database of papers. It allows export to RIS, BibTex, etc. although we do occasionally run into some errors with the LaTeX special characters and such. At least as far as our advisor is concerned, this absolutely revolutionized the way we handled our references. It's searchable, you can add keywords, your own annotations, include abstracts, and upload one or more attachments (the original paper) in whatever format you want. Technically it's an annotated bibliography that supports attachments, but it is pretty solid. One thing to note: We are still using a 1.3.x version of it; we haven't been brave enough or had the time to try the 2.x releases.

Re:Reference management software (2, Interesting)

joe 155 (937621) | more than 5 years ago | (#27510007)

I agree. I'm in the first year of my PhD and I've been making an effort to build an extensive bibtex database because it provides everything I need in terms of references and notes. What I do is read a paper, make pretty extensive notes on it and then put them in the abstract section of Jabref so that when you use the search function for terms it searches through all the relevant text in the article for what you work on. I've also tried to put down some keywords which are related just to make sure that they're linked with the article. Then if I want to know everything I've ever read on, say, political corruption it's just a search away.

If you wanted to add papers you've not digitized your notes for then you could put in the references and just a few quick keywords. Papers you don't have you can search through google scholar to find them. It works OK.

I've also been impressed with Papers for OSX, but Jabref can move systems really easily and is GPL.

Is the material copyrighted? (0)

bihoy (100694) | more than 5 years ago | (#27509065)

There is also the issue of making copies of any copyrighted material. Unless you have obtained permission to do so from the copyright holder (usually for a fee) you could find yourself in a whole lot of, very expensive, trouble for copyright infringement [usdoj.gov] .

Re:Is the material copyrighted? (2, Insightful)

shaitand (626655) | more than 5 years ago | (#27509313)

Copying excerpts for educational use is actually an explicitly protected fair use case. The copyright act actually uses it as an example if I remember correctly.

The parent said he is copying parts of texts, not entire books.

Re:Is the material copyrighted? (1)

Red Flayer (890720) | more than 5 years ago | (#27510045)

Copying excerpts for educational use in a classroom setting is actually an explicitly protected fair use case.

This is not a classroom setting, this is a research setting. Very different.

Though it may be covered under other criteria of fair use, the educational purposes exemption from copyright does not apply.

Re:Is the material copyrighted? (1)

Fallen Kell (165468) | more than 5 years ago | (#27509315)

I seem to remember something about "educational use" in Section 107 of the Copyright Act....

Re:Is the material copyrighted? (0)

Anonymous Coward | more than 5 years ago | (#27509421)

The electronic versions of these documents will be no more encumbered than the paper versions he already has. He makes no mention of distributing these documents to the public via the web, just to the person who's already using them. He's not selling them. Since the paper versions that he has are apparently printed PDF files, he's not even actually making new copies... he's just reorganizing the digital copies he already has.

There are no copyright issues here.

Success = Copyright Problems (1)

tobiah (308208) | more than 5 years ago | (#27510185)

I agree that for the immediate use listed there is unlikely to be any copyright violations. But if someone were to make a good collection for their lab, that perhaps then became popular in the department, it would start running into copyright gray areas. For example the university discontinues subscribing to a journal, but articles remain available on a broad intranet system. Normally if you already had a copy of the article that's legit, but now a new student has access to articles that were only available before they showed up. Or articles are scanned from copyright-legit sources and made available to a large audience, but not as large as the whole web. My guess is systems like this will be tolerated as long as they aren't very good. And when they become good, they'll be tolerated because everything else is not as good.

Re:Success = Copyright Problems (1)

imidan (559239) | more than 5 years ago | (#27510689)

You're right, of course, that such a system could run into problems if its use became more widespread. It seems like one option is to keep the content restricted--can't just add in any electronic resource without a thorough understanding of its copyright terms--like certain linux distributions only including unencumbered code, or (in theory) Wikipedia only including unencumbered images.

Or, go to the effort of keeping track of the copyright terms and encumberances. This, obviously, is way beyond the scope of their project, but it's a service that academic libraries ought to be offering: document management services that make the users and the institution capable of, at the least, demonstrating a good faith effort to obey licensing terms, and, ideally, avoiding any infringement-style problems altogether.

Libraries are looking for ways to stay relevant in the digital age, and document management (including cataloging, indexing, ownership, tracking, search, etc.) is something that they've been doing forever.

Beagle (1)

WillKemp (1338605) | more than 5 years ago | (#27509091)

It may be worth looking at Beagle: http://beagle-project.org/ [beagle-project.org] - it's Linux only though.

Zotero (2, Informative)

k2enemy (555744) | more than 5 years ago | (#27509107)

Zotero [zotero.org] might be useful.

Re:Zotero (2, Informative)

hnwombat (172691) | more than 5 years ago | (#27510047)

I'll second this one. I'm a doctoral student, and have been using it to handle my research. A nice, simple firefox-based interface. It'll snarf references right off pages from search engines. You can attach things, including links to or copies of pdfs to those references, summaries, etc. You can apply keyword tags to citations, and you can organize the citations into a nice directory tree.

To get them out, there's a sweet interface available for open office. I think it's also available for Word, but I use M$ as little as possible.

The only real downsides are copying between computers and citation formats.

Copying is actually easier than it is with the other reference managers I've tried (yeah, I'm talking about you, refworks (bleaargh!) and end note (urrrp!)). You may have to do it more than you would with others, but it's easy to do when you need to. You can export some or all your references to a file, sneakernet the file to the new computer, import it into zotero, and you're done.

There are a lot of citation formats currently available, they just don't happen to be the ones I need. However, there's one close, and the system is designed to be extensible; it's not *that* hard to add your own styles. As soon as I get a round tuit I'll be adding styles for the journals I'll be submitting to, and contribute those back to the project.

Like I said, really sweet, and free.

Cheap scanner, expensive OCR software (4, Insightful)

MartinSchou (1360093) | more than 5 years ago | (#27509109)

Most highend consumer All in One printers comes with an ADF capable of handling most types of paper as long as it's not crumpled up, stapled or the like. Some of the more expensive ones can do two sided scanning to a network repository. I work with consumer level HP printers, and the Office Jet Pro L7xxx series does this. The Pro L7680 is 200 US$ at Newegg.com

Now, while that printer comes with some okay OCR software, it's basicly thrown in for free. A lot of the stuff in the kind of documents you're talking about is going to be math heavy mixed in with images, graphs, tables and personal notes. I don't know any OCR software that'll transform that into exact replicas via LaTeX or the like, I'm pretty sure the really expensive OCR software will translate the written text and reproduce the rest as images and neatly transform it into some easily searchable pdf-documents.

That brings you from paper to searchable pdf-files. Catagorizing those is probably not all that hard. I'd suspect you could do some text analysis and break each document down into a list of technical terms and the number of times they're used.

A document that uses the cashmir effect in a single example is probably not a document related to that specific field, whereas documents that talk about it repeatedly, referencing known articles on the subject etc. is. Sorting that out ... beyond my knowledge.

I'd suggest you start out with an experiment. Take a "typical" page from the binders, scan it to a non-compressed image at a decent resolution (e.g. TIFF). We usually reccomend around 300 dpi for OCR - beyond that you start picking up things that we don't really look for when we're reading.

Test that page against various OCR software, see what they reproduce as the output. Pick the one that's the best result.

And don't worry - the OCR software is going to be the single most expensive purchase in this equation. I am however more than ready to be proven wrong in that regard.

Personal Document Management (3, Interesting)

steveha (103154) | more than 5 years ago | (#27509157)

I am hoping that someone will make a nice personal document management package as free software.

If you use Windows, you can buy this:

http://www.nuance.com/paperport/ [nuance.com]

The basic features would be:

  • Scan in a document (group multiple pages into a single PDF)
  • Easily scan a page and insert it into a pre-existing PDF (if you missed a page yesterday, today go back and put it in)
  • OCR the documents and provide an index to allow searching
  • Provide a really convenient photocopier feature (scan+print)
  • Fast and easy. Scan in color, but detect black-and-white and auto-convert to greyscale. Do not pop up any dialogs; when the user clicks on the "Scan!" button, start scanning.
  • Also allow dropping in saved HTML pages, OpenOffice.org documents, etc. Manage the user's saved documents, no matter what kind of documents they are.

In a perfect world, the GNOME guys and the KDE guys would both start competing over who can make the slickest product and we all would win.

steveha

Re:Personal Document Management (0)

Anonymous Coward | more than 5 years ago | (#27509719)

The software should also do auto-cropping. Put a magazine clipping on the flatbed, hit the "Scan!" button, and it not only scans it in, but it notices that the clipping is only 1/3 of the scanner surface and crops it down minimally, automatically.

It should also have a "scan image" mode where it saves the scan as a JPEG instead of a PDF, doesn't OCR, etc.

Summation (2, Interesting)

Anonymous Coward | more than 5 years ago | (#27509193)

Law firms use a program called Summation to do this all the time. They take all the paper docs and electronic docs in a case (sometimes tens of thousands of pages) and load them into this program as TIFFs or PDFs. They are then OCR searchable. Not nearly as good of a search algo as something like google, as it is purely Boolean...but it gets the job done. Not sure about cost, but your university may have a license. An alternative is a program called Concordance, which does the same thing. One last option would be to scan everything to OCR searchable PDFs, throw them into a folder, and setup google desktop to only search that folder...you could then essentially "google" the contents of all those PDFs.

Quick and dirty solution (2, Informative)

oldhack (1037484) | more than 5 years ago | (#27509205)

Assuming you have electronic versions of the documents in one format or another, stick them all in a file system and use desktop search (MS or Google). More than that you're looking at good bit of time and money.

Re:Quick and dirty solution (2, Funny)

pete-classic (75983) | more than 5 years ago | (#27509753)

Maybe you're unfamiliar with three ring binders [keyproductsinc.com] .

They're archaic devices used to store non-electronic paper-based documents. You can ask your granddad about them.

I'm beginning to think these kids today don't realize that the desktop metaphor is . . . a metaphor!

-Peter

Re:Quick and dirty solution (3, Funny)

oldhack (1037484) | more than 5 years ago | (#27509855)

I am a granddad, you insensitive clod.

Beagle or Google Desktop (1)

janoc (699997) | more than 5 years ago | (#27509227)

I am using Beagle and/or Google Desktop for exactly this task. Both are able to index PDFs and search them. Unfortunately, they will not deal with PDFs directly from scanner (large images), you need to process those with OCR first. I believe that both Beagle and Google Desktop are able to search the metadata too, so even for image documents you can still search authors and titles if you are diligent and fill them in when the document is scanned. This needs a bit of discipline and insight into how the data are actually stored, but if you are willing to invest the time, it works pretty well.

There are other tools, Strigi comes to mind, but that was too unstable for me. I do not know about commercial apps doing this - there are probably some, but I am a Linux user so I need not to apply there... Then there are document management systems, but I think that is an overkill for your needs.

Solr (0)

Anonymous Coward | more than 5 years ago | (#27509247)

Somebody will try to tell you Alfresco is the solution. Give it a shot, but I haven't met anybody who has actually been able to use the open source version in production. The commercial version is nice though and there is a 30 day trial.

Apache Solr is built on their Lucene project and does the web interface search part of you want. There are VM images online that you can download and deploy. I don't know what you should use to do the tagging part of the project.

(Let me google that for you)^2 (1, Offtopic)

clinko (232501) | more than 5 years ago | (#27509285)

If only GOOGLE had a way to search your DESKTOP [google.com] , that would be perfect.

Yes.. Update me to the past plz! (-1, Redundant)

Anonymous Coward | more than 5 years ago | (#27509309)

"I'm trying to help drag a professor I work with into the 20th century. "

20th Century eh? Sounds like what he already has,

Perhaps you meant the 21st Century?

Papers for Mac OS X and iPhone (2, Insightful)

200_success (623160) | more than 5 years ago | (#27509321)

For Mac OS X, try Papers [mekentosj.com] . There's also an iPhone/iPod Touch version. Mac OS is great at handling PDFs in general.

Apple Spotlight (1)

troylanes (883822) | more than 5 years ago | (#27509329)

Not trying to sound like a fanboi... However, I have hundreds of data sheets for various microprocessors, IC's, power supplies, embedded API's, 5 years worth of emails, etc. Spotlight indexes them all beautifully, and access is very quick, only a few seconds to pull up all references. I believe spotlight will even index network attached storage although I could be wrong.

Re:Apple Spotlight (1)

tobiah (308208) | more than 5 years ago | (#27509791)

Yup, I've got a crude filing system with hundreds of papers that works great because of Spotlight. I don't bother with OCR for older docs, just punch in some keywords in the file description section. Network drive indexing works if the drive is formatted in HFS+.

Citeulike (1)

badger17 (1360865) | more than 5 years ago | (#27509339)

Check out http://www.citeulike.org/ [citeulike.org] Does pretty much what you are asking for. You put in the details of papers, and assign keyword tags. You can also look at other people's libraries and so on.

DevonThink on a Mac (2, Insightful)

Autumnmist (80543) | more than 5 years ago | (#27509385)

If your professor uses a Mac, consider Devonthink by DevonTechnologies.
http://www.devon-technologies.com/products/devonthink/index.html [devon-technologies.com]

For searching, the software has an artificial intelligence system, keywords, meta data. It can store PDFs, word docs, emails, notes. It can be integrated with a scanner so you can scan and store documents in the database. It's got OCR built in...

I have DevonThink (personal edition, not Pro/Office) and I don't even use 1/10 of the power built into this system. You should check out some of the reviews online and videos of people using DevonThink.

pdfhacks (0)

Anonymous Coward | more than 5 years ago | (#27509419)

http://www.pdfhacks.com/ [pdfhacks.com]

(disclaimer: not affiliated, just a user)

There are tools to index (kw_index) as well as a web based interface to a pdf collection (pdfportal).

OCR of your scanned pdfs is the enemy here. But as suggested, tesseract or google's continuation of it works pretty well.

here is a sample script from a set of tools I was experimenting with to index pdfs (all open source with windows binaries available):

pdftk example.pdf dump_data output example.data.txt
pdftotext example.pdf example.txt
kw_catcher 1000 keywords_only example.txt > example.keywords.txt
page_refs example.txt example.keywords.txt example.data.txt > example.pagerefs.txt
enscript --columns 2 --font "Times-Roman@10" --header "|INDEX" --header-font "Times-Bold@14" --margins 54:54:36:54 --word-wrap --output example.index.ps example.pagerefs.txt
ps2pdf example.index.ps example.index.pdf

All from pdfhacks, GnuWin32 and Ghostscript.

Re:pdfhacks (0)

Anonymous Coward | more than 5 years ago | (#27509749)

Note, you can also use pdftk to create a package of pdfs (attach_files parameter) which can then be searched in later versions of the free acrobat reader. (without an index though, I could find no free solution to index pdfs the way Acrobat pro does)

e.g.:

pdftk coverpage.pdf attach_files path-to-folder-full-of-pdfs\*.pdf output example-package.pdf dont_ask

Again, only the text content of the pdf package will be searchable.

Mac: Skim and Yep (1)

koick (770435) | more than 5 years ago | (#27509473)

If on a Mac, here's two you may consider (neither have a web interface).

Skim is open source and is a PDF reader and note-taker for OS X.

http://skim-app.sourceforge.net/ [sourceforge.net]

Yep is not open source, but will scan, tag and search PDFs ("like iTunes for PDFs").

http://www.ironicsoftware.com/yep/ [ironicsoftware.com]

Try 'Green Stone', a _digital library_ system. (1)

Eyeballs (64172) | more than 5 years ago | (#27509489)

http://en.wikipedia.org/wiki/Greenstone_(software)

-- From Grenstone's Web Site --
About Greenstone:
Greenstone is a suite of software for building and distributing digital library collections.

It provides a new way of organizing information and publishing it on the Internet or on CD-ROM.

Greenstone is produced by the New Zealand Digital Library Project at the University of Waikato, and developed and distributed in cooperation with UNESCO and the Human Info NGO.

It is open-source, multilingual software, issued under the terms of the GNU General Public License. Read the Greenstone Factsheet for more information.

The aim of the Greenstone software is to empower users, particularly in universities, libraries, and other public service institutions, to build their own digital libraries.

Digital libraries are radically reforming how information is disseminated and acquired in UNESCO's partner communities and institutions in the fields of education, science and culture around the world, and particularly in developing countries.

We hope that this software will encourage the effective deployment of digital libraries to share information and place it in the public domain. Further information can be found in the book 'How to build a digital library', authored by two of the group's members.

Fujitsu ScanSnap (1)

heynonnynonny (249133) | more than 5 years ago | (#27509505)

Shameless Plug:

I would highly recommend the Fujitsu ScanSnap 510 (or 510M if you're a Mac). It's ain't free and it ain't open source, but it comes with everything you need to scan in large quantities of documents, name them, put them in the folders you want, and create OCR text backed PDF's, so you keep your original files and have "searchable" backed text. It does double-sided scanning at about 15 pages per minute (my real-world estimate).

I just bought the Mac version and have managed to reduce two packed drawers of a file cabinet down to just a few documents of which I wanted to keep the originals. Plus, with them being text backed (per a previous post) I can use Spotlight to search for them.

My next plan is to scan in my old Engineering notes.

Fujitsu is coming out with the 1500, but I don't know much more than it's supposed to be improved. The 510 is fantastic, though. Check out the reviews on Amazon:
http://www.amazon.com/Fujitsu-ScanSnap-S510-Sheet-fed-Scanner/dp/B000RUOW66/

Included with the scanner is Adobe Acrobat in addition to ABBYY FineReader OCR software.

No Linux software that I'm aware of, but once you have the files in PDF format you can use them to your liking. They aren't particularly cheap at $450, but I've been very happy with the devices utility.

I had a HP All-in-One as well, but not having a double-sided scanner made it a pain to use.

So, what I think you're asking for is... (4, Informative)

Basilius (184226) | more than 5 years ago | (#27509515)

...something like this:

1. You want to be able to store documents that currently exist electronically, and also handle documents you're going to scan. The latter may, or may not, be OCR'd.

2. You want to attach keywords to the articles, and be able to bring up a list of articles that match some arbitrary combination of these keywords.

3. Full-text search isn't as important (but would be useful if available).

If that's the case, I'm thinking Alfresco [alfresco.com] might be what you're looking for. Multi-platform, open source, java-based content repository. Supports document tagging (and loads, loads more). Relatively easy to use right out of the box, and has a CIFS interface so you can just create a project and simply tree-copy your current documents into the project. Don't let the "enterprise" designation on the software scare you away.

I've actually considered going that route for my own personal document library, but while Alfresco might be one of the only good solutions, it's like killing a fly with a cannon.

I'm frankly amazed that with the "paperless living" meme currently going through the productivity circles that someone hasn't come up with a simple tool to do something just like what you're looking for: point it at a root folder, let it suck in all the files, then start tagging away. Search with keywords or filenames or both, and provide a clickable list of hits. Full-text search isn't needed, as there's already a ton of tools out there that'll happily index your hard drive for you.

And, if a tool like that exists, could someone point me to it, please?

just use DSpace (0)

Anonymous Coward | more than 5 years ago | (#27509617)

It's a nice Java web app. We use it at the Institute for Clean and Secure Energy (ICSE), and it does a great job.

Tellico (1)

seyyah (986027) | more than 5 years ago | (#27509621)

Tellico [periapsis.org] for KDE might be a suitable solution. I use it extensively as a collection manager.

Use Yep! (0)

Anonymous Coward | more than 5 years ago | (#27509627)

If you are running Mac OS X, you can quickly accomplish this very thing with a piece of software called "Yep!" It will track all of your pdf's and allow you to tag them. You can do previews, groups, etc. It will sort by date, etc. Very intuitive, very fast, easy to use.

You can download it from www [dot] yepthat [dot] com/yep/index.html

It's relative inexpensive at $34USD.

Find out what his colleagues use - nanohub.org (0)

Anonymous Coward | more than 5 years ago | (#27509637)

Sooner - There are a community of researchers who work in the nanotech field and collaborate through nanohub.org. I am not in the field, so I'm not sure how helpful it will be, but it's billed as "A resource for nanoscience and technology, the nanoHUB was created by the NSF-funded Network for Computational Nanotechnology."

This community is probably a much better place to ask the question than slashdot, IMHO. :-)

JR

Suggestion (3, Insightful)

vondo (303621) | more than 5 years ago | (#27509705)

I wrote and maintain a project to do this:

http://sourceforge.net/projects/docdb-v/ [sourceforge.net]

"DocDB is a powerful and flexible collaborative web based document server which maintains a versioned list of documents. Information maintained in the database includes, author(s), title, topic(s), abstract, access restriction information, etc."

It's intended for collaborations, but groups from 5 to 500 use it.

Defeating Bedlam part 2 (0)

Anonymous Coward | more than 5 years ago | (#27509747)

As a young academic I can vouch this being a problem that is looking for a good solution. Olivia Judson talked about this issue in the NY Times a few months back (December 16, 2008, Defeating Bedlam). Folks who spend a lot of time with the literature need a version of EndNotes or RefMag that stores the bloody PDF along with the citation info; storing the PDF might have taken a prohibitive amount of memory in the past but these days memory is cheap. The program must also be able to search within the PDF, assigning keywords yourself is for chumps. "Papers" and "Yep" look good but what about all of us who don't have the luxury of working on a Mac.

ePrints (1)

Demoriel (1478317) | more than 5 years ago | (#27509751)

Our institution uses something called ePrints [eprints.org] - I'm not sure if it's entirely what you're looking for but it does support different Subjects (headings?) and you can upload the documents using it.

I, Librarian seems pretty close (1)

nniillss (577580) | more than 5 years ago | (#27509827)

What the submitter needs (and I also need) is an organizer for scientific papers with an interface for standard fields such as authors, journal, title, doi, http links etc. I, Librarian [bioinformatics.org] seems to fulfill this need; unfortunately with direct interfaces (for retrieving pdf and meta information at the same time) only with pubmed.

If anybody knew of (or planned for) an adaptation to physics (with interfaces to arXiv.org [arxiv.org] , the APS journals [aps.org] and ideally other journals), I would be very interested (even as a paying customer).

Re:I, Librarian seems pretty close (0)

Anonymous Coward | more than 5 years ago | (#27510221)

Do you or other physicists use:

NASA ADS:
http://adsabs.harvard.edu/abstract_service.html

or CERN DS:
http://cdsweb.cern.ch/

Do you think it would be useful to integrate these databases with I, Librarian as PubMed and PubMed Central? Let me know.

Wikidata (0)

Anonymous Coward | more than 5 years ago | (#27509831)

What we need is a "Wikidata" project that would catalog every book, paper, recording, movie, etc. There are a few attempts that I know of such as openlibrary.org and wikidata proposals on wikimedia, but nothing that I know of that has reached critical mass. Such a system would be free as in freedom, and include abstracts, location item info, would allow users to create there own sub-database of items to search, etc. something that would be a harbinger of death to google.

Look into FOSD Medical DMR or EMR systems (0)

Anonymous Coward | more than 5 years ago | (#27509869)

In healthcare there is a company called Laserfiche that does exactly what you are asking for. Its not free, but maybe there is a similar FOSS.

DekiWiki is a wiki that will index attachments (using Lucene) although I am not sure to the extent you'll need. It would be worth looking into also since it IS free.

I have used both and both work well. I hope that leads you into the right direction.

You need controlled vocabularies of your keywords (0)

Anonymous Coward | more than 5 years ago | (#27509939)

After you get past the easy part, which is the scanning / OCR / selection and installation of doc management software, training users, etc., you'll reach the hard part: Developing controlled vocabularies based on the ontologies specific to your domain's metadata.

Talk to your school librarian. (1)

phallstrom (69697) | more than 5 years ago | (#27509945)

Don't librarian's (particularly those in the library science realm) deal with this sort of thing all the time?

Digital Archive software (2, Insightful)

who's got my nicknam (841366) | more than 5 years ago | (#27509995)

What you are looking for is a proper archiving application. I suggest ICAAtom [ica-atom.org] . Scan your documents as TIFFs if you are going to be saving them as images; if your hardware will do OCR nicely, then you would be better off scanning them to text, as they will be more searchable. ICA Atom supports all of the standard archiving metadata protocols, of course, so you will have good searching capabilities as long as you enter proper metadata.

Ask your librarian (1)

danthelibrarian (734399) | more than 5 years ago | (#27510019)

This researcher should learn to talk to his local librarians. Many universities have a bibliography management system e.g. Refworks, that would be a lightweight solution. And many of the articles he has in print are quite likely now already properly digitized and available by PDF through his university library. If he's a proper researcher, he should care about more than what he has in his binder. There are likely more recent articles that reference those articles, building on that knowledge. Which he's missing. He can chat with a librarian online, or try the 20th century version of communication and make an appointment to talk in person.

Keywords aren't the total solution (0)

Anonymous Coward | more than 5 years ago | (#27510037)

Any solution that doesn't provide full text searching is less likely to be useful unless the exact, specific query from each and every user can be mandated.

I've lived thru "Document Management Guy" (actually a team of them, some with PhDs and publications) claims that keywords stored in document metadata was all that was needed. I called BS based on my years and years of DMS experience.

If an end user can't find a document, then the document doesn't exist, period. The document is useless unless the purpose is to have the document, but not have the document found. Images of text isn't generally useful without adding significant metadata based on how users will search for a document. IT people don't think like end users, so ask them what search terms they would use to find a few sample documents.

I've been away from Documentum, FileAid, Docushare and Sharepoint for a few years, but last time I used Sharepoint, the full text search results were worthless. I knew about a document - MS-Word, no less. Searches for a few specific, keywords failed to locate it. Yes, it was in a collection that was indexed.

About 6 months ago, the company I work at implemented the OSS version of Alfresco. We're ok with it, but need to upgrade to v3.x to get a much better GUI. We did trial the beta v3, but it wasn't ready for use at the time and had a few flaws with version control. Those are all fixed now.

Amplify to autotag (1)

atcat (1527889) | more than 5 years ago | (#27510071)

Once you OCR all your paper (FineReader is not bad), and full-text index your PDFs (Beagle for Linux, MOSS for Windows), you'll still have a problem with narrowing down a keyword search. Try Amplify http://www.hapax.com/amplify.php [hapax.com] on the title/abstract/methods page of each document and maybe you'll get useful metadata.

Why Not To (4, Insightful)

DynaSoar (714234) | more than 5 years ago | (#27510271)

There's at least two reasons the professor's method is beneficial:

1. By having to search by hand and scan by eye, he becomes more familiar with more of what's actually in the papers. His familiarity with the material gets better.

2. Repetitive scanning/searching of the papers leads to the mind partially wandering while doing so. This can result in inspiration and intuitive leaps.

Both methods together are preferable. But good luck on getting the professor to use them. You may have better luck getting him to create his own indices or tables of contents on paper to put in the binders. With his familiarity it shouldn't be too difficult.

Got any money? (0)

Anonymous Coward | more than 5 years ago | (#27510281)

If you've got some money, get yourself a Google Mini for $3K or so and a scanner. The base Google mini will index up to 50,000 documents, and supports PDF as well.

http://www.google.com/enterprise/mini/fileformats.html

EndNote X (1)

daisybelle (1077153) | more than 5 years ago | (#27510313)

I'm a linguist, and I use EndNote X for storing all my papers (and books now actually too). It makes reference lists in my papers without me doing anything, but more importantly, it stores the reference/paper itself (which for me are mostly pdfs, with some Word and some html documents) with the record. There are fields in EndNote for notes, keywords, all that jazz, which are very searchable.

I would expect any modern photocopier to scan to pdf (while 150dpi is okay to look at, the OCR is better at 300dpi), then Adobe Professional does the OCR (my uni has a site license).

I actually bought a nifty tablet/pen thingy recently, and now I can write notes directly on the pdf too, in my own handwriting. I love it.

Zotero, Mendeley (2, Informative)

pesho (843750) | more than 5 years ago | (#27510357)

You should try Zotero [zotero.org] or Mendeley [mendeley.com] .

Zotero is a firefox extension that can grab reserach papers directly from the journal or library web sites. It organizes the papers in collections, has keywords (they call them tags), can automatically index the PDFs. The metadata is stored also on a remote server and you can browse through it using a web interface. You also get a Word and Openoffice plugins to insert citations in the papers you write. The plugins are a little rough around the edges, but are usable. The references formatting is very robust and comes with styles for a lot of journals.

Mendeley is stand alone application. I haven't tryed it yet,but is seems to have very similar functionality.

BibDesk (1, Informative)

Anonymous Coward | more than 5 years ago | (#27510559)

I use BibDesk to organize my research. It's not perfect for that as its basic use is to cite and all that but you can actually import pdfs and tag them, i.e. put them into smart folders. It does have the collaborative approach to organizing data in a flat structure similar to that of delicious.

Zotero (1)

Volfied (307532) | more than 5 years ago | (#27510585)

Zotero is what you want. Integrates smoothly into a research workflow. Great for managing research materials of all kinds. Powerful search and tagging features. Adding sources is quick and easy and it works hand in glove with lots of research databases. Also interoperates with Word or OpenOffice to manage citations and biblographies.

www.zotero.org

Homebrew Solution (1)

mraiser (1151329) | more than 5 years ago | (#27510615)

As an alternative to off-the-shelf software you could create a series of html pages, put them online and let Google index them for you. Create a separate html page for each scanned document with the desired keywords and a link to the document. Create an index page with a link to each of these html pages. link to that page on your home page or blog or any other page that you know Google scans. Wait a few days and search your site via Google. Build up a two column table in your favorite spreadsheet application with the file name in one column and the keywords in the other. Export as csv, and with a little coding in the programming language of your choice, you can generate the whole set of html in no time. Cheap free and easy!

Suggestion in using Zotero (0)

Anonymous Coward | more than 5 years ago | (#27510627)

I am currently using Zotero (http://www.zotero.org/) to organize all my articles and citations. It is open source, developed by George Mason University. The software works as an add-on to firefox, and automatically downloads the citation or the PDF of the article. The citaion can then be tagged with various labels and all the words in the article are searchable. I haven't used the tagging feature much, but the software has already proven invaluable in research and paper writing.

CiteULike (0)

Anonymous Coward | more than 5 years ago | (#27510629)

I'm a biology grad student and have been dealing with some similar issues. I've ended up using an online app (CiteULike) that has a great tagging interface and uses a bookmarklet for posting from journal sites, ISI, and PubMed.

It also has a great bibtex export/import feature. Since I'm using LaTeX for my dissertation I'm slowly migrating to a BSD licensed Mac program called BibDesk. Its tagging interface could use a little work though.

I've tried Zotero, and heard good things about Mendeley and Papers, but none of them have worked as well for me.

Try Aigaion (0)

Anonymous Coward | more than 5 years ago | (#27510717)

I've used Aigaion [aigaion.nl] for managing all of my documents and references in the course of my Ph.D. I now recommend it to all of my grad students.

The website calls it a "Web based bibliography management software"

From the site:

"Both for individual researchers as for research groups or projects, it is of major importance to organize the literature one has read. A well organized bibliography is a powerful instrument. It speeds up the search for publications one has already read and supports the user in structuring information. Aigaion provides a bibliography management software environment that supports a user in just this: Organizing and managing a complete bibliography, from small bibliographies to bibliographies for a complete research department."

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>