Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Best Way to Build a Searchable Document Index?

ScuttleMonkey posted more than 6 years ago | from the build-a-better-boss-trap dept.

Software 216

Blinocac writes "I am organizing the IT documentation for the agency I work for, and we would like to make a searchable document index that would render results based on meta tags placed in the documents, which include everything from Word files, HTML, Excel, Access, and PDF's." What methods or tools have others seen that work? Anything to avoid?

cancel ×


Sorry! There are no comments related to the filter you selected.

Gee I don't know.. (4, Funny)

Anonymous Coward | more than 6 years ago | (#20816465)

You're the one gettin' paid, you figure it out.

Easy (0, Redundant)

Anonymous Coward | more than 6 years ago | (#20816473)

Grep and flat files. The way God intended.

Re:Easy (2, Informative)

avronius (689343) | more than 6 years ago | (#20817217)

If you host all of your documentation on a website, take a look at ht://dig [].

I've deployed it across a handful of servers, and it does a good job of crawling, but doesn't do well with javascript. If you have javascript for your web's frontend, you can write a shell script to find . -print, prepend the urls into a file, and point htdig at that file. It will dig into each file it finds, and create a searchable database of everything that it finds.

You add /cgi-bin/search.cgi to your page, and you can auto-magically search your documentation.

- Avron

Re;Easy off the shelf (2, Informative)

homey of my owney (975234) | more than 6 years ago | (#20817569)

If you're looking for an index, a document management system probably makes sense. This one [] is inexpensive and very good.

Lucene (5, Informative)

v_1_r_u_5 (462399) | more than 6 years ago | (#20816485)

Check out Apache's free Lucene engine, found at [] . Lucene is a powerful indexing engine that handles all kinds of docs, and you can easily mod it to handle whatever it doesn't. It also allows custom scoring and a very powerful query language.

Re:Lucene (5, Informative)

Anonymous Coward | more than 6 years ago | (#20816545)

yes. It's hard to beat Lucene if you don't mind working at the API level. If you want a ready-build web crawler, check out Nutch, which is based on Lucene.

Re:Lucene (3, Informative)

BoberFett (127537) | more than 6 years ago | (#20816583)

I haven't used Lucene, but as for commercial software I've used dtSearch and ISYS and they are both excellent full text search engines. Both have web interfaces as well as desktop programs, and SDKs are available for custom applications. They scale to a massive number of documents per index, in a large variety of formats and are very fast. They have additional features like returning documents of all types in HTML so no reader is required on the front end other than a browser so legacy formats are easier to access.

Re:Lucene (1)

jafac (1449) | more than 6 years ago | (#20818005)

About 10 years ago, I used a product called Folio, which was the same product Novell used for their "Novell Support Encyclopedia" - they had a great set of robust tools, including support filters for a wide enough variety of formats (and tools to write your own), and your data would all compile down to what they called an "Infobase" - which was a single indexed file containing full text, markup, graphics, and index - readable in a free downloadable reader with a pretty decent (for 1996) search engine, that did boolean, stem, and proximity searches.

It was very convenient to be able to give to either a reseller, or high-end customer, this single-file, containing reasonably up-to-date support information on our product that they could search on. It was probably the #1 thing our tech support department spent money on that reduced call volume. Then we got bought, and the new tech support director, an IBM guy, replaced everything with Lotus Notes - (and other headcount-increasing, empire building tools).

The Folio company was bought up and I think the technology is now used by the company that does Lexis-Nexis.

For now, my current employer is using Lucene.

Re:Lucene (1)

BoberFett (127537) | more than 6 years ago | (#20818235)

Yep, Folio Views is another one, but I have no personal experience with it. I wrote two commercial software packages (CD and internet legal research systems) using ISYS and dtSearch, and I'm familiar with Folio because Lexis was a competitor of the company I worked for. I'm not sure what it's complete capabilities are, but I have to imagine it's comparable.

Re:Lucene (4, Informative)

knewter (62953) | more than 6 years ago | (#20816659)

Lucene's good. If you haven't yet have a look at Ferret, a port of Lucene for Ruby. It's listed as faster than Lucene. I've used it in 20+ projects now as my built-in fulltext index of choice, and it's pretty great. You can easily define your own ranking algorithms if you'd like. You can find more information on Ferret here: []

I've got a prototype of the system described in the OP that we did while quoting a fairly large project. It's really easy to have an 'after upload' action that'll push the document through strings (or some other third party app that can operate similarly, given the document type) and throw the strings into a field that gets indexed as well. That pretty much handles everything you may need.

Obviously I'd also allow someone to specify keywords when uploading a document, but if this engine's going to just be thrown against an existing cache of documents, strings-only's the way to go.

Re:Lucene (2, Informative)

caseydk (203763) | more than 6 years ago | (#20817061)

DocSearcher - [] - already does it. A friend with the US Coast Guard wrote it 4+ years ago, I deployed it within the Department of Justice for a few projects, and it's pretty widely used among some of the local tech circles. It even plugs into Tomcat if you want a web-based UI.

Re:Lucene (0)

Anonymous Coward | more than 6 years ago | (#20816701)

One of the few products I've found that you can search for partial strings, not just from the beginning of a word, is Copernic.


Re:Lucene (1)

Dadoo (899435) | more than 6 years ago | (#20817293)

I checked out the documentation on Lucene, and it appears to be designed for searching the documents on a few web servers.

In my situation, I've got a couple dozen servers (mostly Windows, but some Linux), and maybe 8TB of data, mostly in Word documents, Excel spreadsheets, etc. Can Lucene (or Nutch) scale up to something like that? I'd also like it to search Windows network drives. Is that possible?

Re:Lucene (2, Informative)

dilute (74234) | more than 6 years ago | (#20818197)

Lucene is strictly an indexing engine. It wants to index text. It can index metadata as well as full text. Your surrounding application gets the files to index from wherever (local hard drive, database BLOBS, remote Windows shares or what have you). We don't care if the files are Word, PDF, Powerpoint, HTML, or whatever. A parser (many free ones available) extracts the text. We also don't care what web server you are using - using the index to identify and retrieve files is a totally separate process. Lucene indexes the text stream one-by-one and stores the results in a very efficiently organized index. It has been ported to a bunch of languages, including dotNET. I haven't tried it on terabytes of data but it rips through gigabytes very fast. Assuming all 8 terabytes don't change between runs the scale should be no problem.

If you needed to run, say 100 indexing engines in parallel and merge the indexes, you'd have to research that. Somebody's probably done it.

Only one real choice (-1, Flamebait)

Anonymous Coward | more than 6 years ago | (#20818199)

Docs Open [] . Sure, it's Windows based, but if you are a real company playing in the real world, you are already Windows based by now.

Really... your data is worth too much to trust to some FOSSie app supported by an unemployed hippy living in his mom's basement and his 15 year old boy-toy. And if it's not... well, you won't have much use for a document storage solution anyway. Also, "free software" just means you get your head chopped off one centimeter at a time by the armada of consultants it's going to take to support it. Better to just pay da man up front and enjoy the free support your IT staff gets for the life of the product. FOSS never, ever, wins in the TCO battle.

Apache can't even figure out how to get their app to install properly across all Lunix versions, and you honestly expect people to trust them with a data storage solution? Puh Leeze. Yeah, I'll be modded down, but the truth always is around here.

Google (3, Informative)

Anonymous Coward | more than 6 years ago | (#20816493)

We have a Google appliance, but you can do it with regular Google, too. Just make sure you disable caching (with headers or by encrypting documents). Then place an IP or password restriction for non-Google crawlers (check IP, not user-agent). People will be able to search with the power of Google, but only people you allow in will be able to get the full documents.

If you value your privacy, invest in a Google mini, though.

Re:Google (5, Informative)

rta (559125) | more than 6 years ago | (#20816607)

Previous place i worked we had a Google Mini and it was better than anything we had come up with in-house.

We even pointed it at the web-cvs server and bugzilla and it was great at searching those too.

To see all the bugs still open against v 2.2.1 or something like that bugzilla's own search was better. but for searching for "bugs about X" the google mini was great.

It only cost something like $3k ircc.

not exactly what you asked about, but you should definitely see if this wouldn't work for you instead.

Re:Google (2, Interesting)

TooMuchToDo (882796) | more than 6 years ago | (#20816867)

Seconded. I've done implementations (hosted an in-house) of both Google Minis as well as the full blown Enterprise appliances. They are amazing creatures. I would recommend the Mini to almost anyone, while the Enterprise costs a pretty penny.

Re:Google (3, Informative)

shepmaster (319234) | more than 6 years ago | (#20818185)

The company I work for, Vivisimo [] , makes an awesome search engine. Although I've never dealt with the Google box directly, I know that we have had customers get fed up with the Google box and replace it quite easily with our software. Click the first link to see a pretty flash demo, or go to [] to try out a subset of the functionality for real. We specialize in "complex, heterogenous search solutions", which exactly fits most intranet sites I've seen. Files are on SMB shares, local disks, Sharepoint, Lotus, Documentum, IMAP, Exchange, etc, etc, etc. We connect to all those sources and provide a unified interface. You can do really neat tricks with combining content across multiple repositories, such as metadata from a database added to files on SMB shares. We support Linux, Solaris, and Windows, all 32 and 64 bit. Although I may work here, it really is a great product, and I use it at home to crawl my email archives and various blogs, websites, forums, things that I use frequently but have sucky search.

Re:Google (1)

TooMuchToDo (882796) | more than 6 years ago | (#20818279)

How does the pricing compare to the Google product offerings? And does your licensing allow us to offer hosted solutions?

lucene...duh (-1, Redundant)

Anonymous Coward | more than 6 years ago | (#20816495)

Meta tags placed? (3, Insightful)

harmonica (29841) | more than 6 years ago | (#20816497)

Who places what types of meta tags in the documents? I don't understand the requirements.

Generally, Lucene [] does a good job. It's easy to learn and performance was fine for me and my data (~ 2 GB of textual documents).

Meta tags are worthless, generally (4, Insightful)

Anonymous Coward | more than 6 years ago | (#20817029)

Meta tags are worthless, generally, unless you have a librarian who ensures correctness.
I've worked in electronic document management in 3 different businesses and metadata entered by end users is worst than worthless - it is wrong. Searches that don't use full text for general documents are less than ideal.

Just to prove that you're question is missing critical data:
  - how many documents?
  - how large is the average and largest documents?
  - what format will be input? PDF, HTML, XLS, PPT, OO, C++, what?
  - what search tools do you use elsewhere?
  - any budget constraints?
  - did you look at general document management systems? Documentum, Docushare, Filenet, Sharepoint? If so, what didn't work with these systems?
  - Did you consider OSS solutions? htdig, e-swish, custom searching?
  - A buddy of mine wrote an article on "how to index anything" that was in the Linux Journal a few years ago. Google is your friend.

AND if i didn't get this across yet - DON'T TRUST META DATA IN HIDDEN DOCUMENT FIELDS - bad Metadata in MS-Office files will completely destroy the usefulness of your searches.

Re:Meta tags placed? (2, Insightful)

rainmayun (842754) | more than 6 years ago | (#20817123)

I don't understand the requirements.

I don't either, and that's because the submitter didn't give enough information. I'm working on a fairly large enterprise content management system for the feds (think 2.5 TB/month of new data), and I don't see any of the solution components we use mentioned in any thread yet. If I were being a responsible consultant, I'd want to know the answers to the following questions at minimum before making any recommendations:

  • What is the budget?
  • How many documents are we talking about? The answer for 10,000 is different than for 10,000,000.
  • Are you looking for off-the-shelf, or is software development + integration going to be involved
  • Who is going to maintain the integrity of this data?

Although I am as much a fan of open source as anybody, I don't think the offerings in this area are anywhere near the maturity of commercial offerings. But some of those offerings cost a pretty penny, so it might be worthwhile to hire a developer or two for a few weeks or months to get what you want.

Re:Meta tags placed? (1)

Ctrl-Z (28806) | more than 6 years ago | (#20817299)

This doesn't sound like an enterprise-scale problem. I work for a large ECM vendor, and unless the IT department that we're talking about is huge, ECM is going to be overkill. Not that that would stop vendors from trying to sell it to you if you have enough money to put on the table.

Re:Meta tags placed? (1)

rainmayun (842754) | more than 6 years ago | (#20817887)

You're probably right. Then again, I think a lot of smaller and growing organizations underestimate the volume of their data and the value of organizing it well. Get it right now, and maybe they grow enough to need a real ECM solution.

Google Desktop or Applicance (3, Insightful)

wsanders (114993) | more than 6 years ago | (#20816507)

Because if you have to spend more than an hour on this kind of project nowadays, you're wasting your time.

The inexpensive Google appliacances don't have very fine-grained access control, though. But I am involved in several semi-failed projects of this nature in my organization, but new and legacy, and my Google Desktop outperforms all of them.

Re:Google Desktop or Applicance (1, Interesting)

Anonymous Coward | more than 6 years ago | (#20816783)

and copernic desktop search outperforms GDS by a long way...

Re:Google Desktop or Applicance (1)

sootman (158191) | more than 6 years ago | (#20816807)

Since Google Desktop works by running its own little webserver, can you install Google Desktop on a server and access it by visiting http://server.ip.address:4664/ [server.ip.address] ? (I'm at work and my only Windows box has its firewall options set by group policy.)

Re:Google Desktop or Applicance (1)

slazzy (864185) | more than 6 years ago | (#20816911)

Yes - it is possible to configure google desktop that way (disabled by default) there are also a few programs out there that were designed to access google desktop search remotly: []

Re:Google Desktop or Applicance (2, Informative)

NoNeeeed (157503) | more than 6 years ago | (#20816967)

Yep, a Google appliance (or equivalent, there are others on the market such as X1) is the way to go.

I set up a Google Mini for indexing an internal wiki, our bug tracking system, and some other systems, and it is very straight-forward.

I know the original question mentioned meta-data, but you have to ask yourself if the meta-data is going to be maintained well enough that the search index will be valid. Going the Google Appliance route is so much simpler. It takes a bit of tweaking to set up the search restrictions, but once up and running, it works flawlessly. Most importantly, it doesn't require everyone to make sure that all their document meta-data is perfect.

Google appliance pricing is really quite cheap when you compare it to the time cost of setting up a meta-data driven system.

Meta-data is one of those things that seems like a really good idea, but like all plans, doesn't tend to survive contact with the enemy, which in this case is the user.


Personal GSA experience (1)

PIPBoy3000 (619296) | more than 6 years ago | (#20817013)

We've been quite happy with our Google Search Applance.

The two exceptions are the way it handles secured documents (on our mostly-Windows network, that meant authenticating twice or doing complicated Kerberos stuff), and hardware (we've had two boxes fail with drive issues in the last year).

Still, when it comes to search results and speed, it's been very good. I'm also a fan of Google Desktop, but that's a completely different story and more difficult to centrally manage.

Re:Google Desktop or Applicance (0)

Anonymous Coward | more than 6 years ago | (#20817081)

Take a look at IBM/Yahoo's free (as in beer) enterprise search box. I have had good experiences with it and especially like it's open forum (the ibm developers help out as much as they can). [] []

Swish-E (2, Informative)

ccandreva (409807) | more than 6 years ago | (#20816517)

Re:Swish-E (1)

nuzak (959558) | more than 6 years ago | (#20816951)

swish-E is fast, but the quality of its search results is just awful. We use socialtext at work, which uses swish-e for search, and the search results may as well just be random.

I don't think it even handles unicode, either.

Re:Swish-E (1)

PinkPanther (42194) | more than 6 years ago | (#20817973)

Swish-E's configuration is pretty flexible even when it comes to relevancy ranking, though it is also quite non-intuitive for lots of different aspects of the configuration.

And yes, it does not support UTF-8/Unicode/anything-non-ASCII-8 [] .

But the developer list is quite active and responses are usually accurate (though they also can be terse and sometimes overly-authoritative).

Open Source (1)

HartDev (1155203) | more than 6 years ago | (#20816537)

There are many open source solutions for what you are trying to do, also if you want it be portable then I would suggest a CMS that does not require a MySQL database like "Limbo" what does the organization do?

Re:Open Source (-1, Redundant)

Anonymous Coward | more than 6 years ago | (#20817615)

but if the user is heterosexual they certainly don't want to use open source. what alternatives exist for them?

Re:Open Source (0)

Anonymous Coward | more than 6 years ago | (#20818219)

There are many open source solutions for what you are trying to do, also if you want it be portable then I would suggest a CMS that does not require a MySQL database like "Limbo" what does the organization do?
Exactly what's wrong with a DB based solution? After all, how much more portable can it get? And why MySQL specifically? (That wouldn't be a slam on Alfresco or Mondrian, would it? Both run on more than MySQL)

Avoid kat and beagle (0)

Anonymous Coward | more than 6 years ago | (#20816539)

Hmm, avoid kat and beagle. Consider a using a one line perl script using find and grep instead...

Re:Avoid kat and beagle (1)

Tablizer (95088) | more than 6 years ago | (#20816657)

Hmm, avoid kat and beagle. Consider a using a one line perl script using find and grep instead...

But it still needs some kind of indexing. I don't think they want a sequencial search every time a query is used, unless its a small set. Thus, a database of some sort is probably in order. SqLite may be a quick way to go, although I've heard nasty things about its ODBC drivers (maybe since fixed). But I envision a schema something like:

    table: tags

    table: tag_doc

Perhaps also have a "document" table to give ID's to documents instead of using paths, and also a document summary description.

Beagle, Spotlight? (2, Insightful)

Lord Satri (609291) | more than 6 years ago | (#20816541)

Is this something that would suit your needs: Beagle for Linux [] , Spotlight for OSX [] ? I haven't tried Beagle (I don't have root access on my Debian installation at work), but Spotlight is probably my most cherished feature in OSX... it's so useful.

Re:Beagle, Spotlight? (1)

hejog (816106) | more than 6 years ago | (#20816633)

Spotlight doesn't support indexing of network volumes, which is a killer. It'll support them in Leopard (you'll be able to search spotlight indexes of remote servers) we can't wait.

Re:Beagle, Spotlight? (1)

Constantine XVI (880691) | more than 6 years ago | (#20818221)

I don't think Beagle and Spotlight are really network-friendly. There's not really any point of having each and every machine having to index all the drives on the network. It'd be better to have some sort of networked solution.

Google Applicance (1)

spcherub (798144) | more than 6 years ago | (#20816555)

If you have a reasonable budget *and* an intranet, you can consider implementing a Google Appliance and pointing it at the network location that houses the documents. The side benefit is that documents can be found/accesses via browser.

All depends on the document filters ... (1)

molarmass192 (608071) | more than 6 years ago | (#20816569)

Since you're indexing non-text data, you'll need a search engine that has plenty of document filters. We use Oracle Text to do something similar to this, but it's not for the faint of heart. The nice thing about Oracle Text is it includes filters for pretty much any document you'd want to index (PDF, Word, Excel, etc). Of course, Oracle Text query syntax needs an awful lot of lipstick to be made to look like Google query syntax. WMMV.

google (0)

Anonymous Coward | more than 6 years ago | (#20816573)

get your site indexed by google then add parameters like
filetype:xml ;-)

cheap quick and easy

mnogosearch (1)

nereid666 (533498) | more than 6 years ago | (#20816575)

Try [] is like a small free google spider.

XML document formats (2)

mind21_98 (18647) | more than 6 years ago | (#20816611)

If you're using Office 2007, you can probably hack something together really quickly to pull the meta tags from the files and put them in a database. Not sure about the other formats you need, though--and support from Google, for instance, would probably be beneficial for your company anyway. Hope that helps!

Re:XML document formats (1)

flyingfsck (986395) | more than 6 years ago | (#20817853)

"pull the meta tags from the files" You think so? Usually there is absolutely no relationship between meta data and the file contents. Just think of meta tags on web pages...

Google corporate search (1)

Gwala (309968) | more than 6 years ago | (#20816619)

A googlebox. Indexes file shares and internal websites and makes them searchable. Can be a little pricey though.

Google backdoor appliance (1, Insightful)

Anonymous Coward | more than 6 years ago | (#20817245)

And it reports everything straight back to Google! Such a deal!

what to avoid (2, Insightful)

Anonymous Coward | more than 6 years ago | (#20816631)

You should avoid any system that relies on individual employees putting in these meta-tags. It won't work; they either won't do it, or will do it wrong (spelling errors, inventing their own tags on the fly, and so on.) And then you'll catch hell when they can't find one of those documents they mislabled. Trust me.

Most easy solution (4, Informative)

PermanentMarker (916408) | more than 6 years ago | (#20816653)

it wil cost you some bucks just buy MS sharepoint portal server, and leave the indexing over to sharepoint.
Your not even realy required to use added tags... (as most people will put in poor tags).

But if you like you can add tags even with sharepoint.

Re:Most easy solution (3, Informative)

DigitalSorceress (156609) | more than 6 years ago | (#20817455)

Actually, if you are an MS shop and have Microsoft Server 2003, SharePoint Services 3.0 (as opposed to the SharePoint Portal server (now renamed, I believe, to Microsoft Office SharePoint Server) which does indeed cost a packet.

I do a lot of LAMP development, and I'm not the strongest fan of Microsoft for a lot of things, but if you have a MS desktop and MS Office environment, SharePoint services really is quite decent for INTRANET applications. Especially for collaberation. You can set up work flows for check-out/check-in, and it integrates really nicely with some of the more recent MS Office releases. If you connect it to a real MS SQL server on the back end (as opposed to the express edition that it defaults to), you can have full text indexing even with the free SharePoint Services version. Only need for the full blown Portal/MOSS version is if you think you are going to have a large number of sharePoint sites, and want to simplify cross-connecting and management. (At least as far as I can recall)

I'm not saying SharePoint is the way to go, but I'd at least read up on it and consider it IF you have a lot of MS Office stuff that you plan on indexing/sharing.

I'd strongly advise avoiding it if you plan to do Internet-based stuff though... at lest until you get a good enough understanding of the security issues involved that you feel that you really know what you're doing.

Just my $0.02 worth.

Re:Most easy solution (1, Insightful)

Anonymous Coward | more than 6 years ago | (#20817613)

Just a couple bucks, $60-70 per seat, plus how much for the server software?

Re:Most easy solution (1)

JacobO (41895) | more than 6 years ago | (#20817847)


I've never seen a SharePoint site where the search worked well at all (particularly in the document libraries.) You might think by its observable behavior that it is simply offering up documents at random instead of searching their contents.

Livelink (0)

Anonymous Coward | more than 6 years ago | (#20816681)

We use Livelink.

It's huge, kludgy, awkward, slow, resource intensive, but it works (when it's up).

Re:Livelink (2, Funny)

Ajehals (947354) | more than 6 years ago | (#20816745)

You are in marketing aren't you?

(I'm sold anyway)

Upload it to the web (2, Funny)

omgamibig (977963) | more than 6 years ago | (#20816685)

Let google do the indexing!

Check out Alfresco! (2, Interesting)

thule (9041) | more than 6 years ago | (#20816713)

I posted this before on slashdot. I discovered a while ago a cool system called Alfresco [] . There is a free (as is liberty) and commercial versions. It acts like a SMB (like SAMBA), ftp, and WebDAV server so you don't have to use the web interface to get files into the system. Users can map it as a network drive. The web interface allows users to set metatags, retrieve previous versions of the file, and most importantly, search the documents in the system.

Alfresco also has plugins for Microsoft Office so you can manage the repository from Word, etc. They are also working on OpenOffice integration.

Don't use SAMBA for .doc and .pdf's, use Alfresco.

I am not affiliated with Alfresco, just a happy user.

Re:Check out Alfresco! (2, Interesting)

G1369311007 (719689) | more than 6 years ago | (#20817155)

Along the same lines of Alfresco is Plone. I'm currently the sole admin of a Plone site serving ~50 users on an intranet. We use it for document management etc. Just another option. The CIA's website is made in Plone so it can't be that bad right?!

Alfresco is doc based, Plone is web based (0)

thule (9041) | more than 6 years ago | (#20817307)

Last I saw, Plone is very web-centric. It is designed to manage a web site. Alfresco is designed to handle documents. It is more similar to SharePoint than Plone.

Re:Alfresco is doc based, Plone is web based (1)

seanmeister (156224) | more than 6 years ago | (#20818205)

Actually, Plone does a pretty good job of document indexing/management, particularly when you plug in TextIndexNG3 [] . Add Enfold Desktop [] on top of that, and you've got desktop and MS Office integration easy enough for any office drone to use.

X1 (1)

bhovinga (804122) | more than 6 years ago | (#20816775)

For searching Microsoft products you can't beat X1 [] as far as user interface.

Also see Xapian (4, Informative)

dmeranda (120061) | more than 6 years ago | (#20816793)

I'd suggest you should consider a full-text search engine. First start here: []

If you're not afraid to do a little reading and potentially coding a custom front end, you may want to look at two of the big open source engines: Lucene and Xapian.

Lucene is quite popular now, and is an Apache Java project. It's a good choice if you're a Java shop.

Xapian seems to be based on a little more solid and modern information retrieval theory and is incredibly scalable and fast. It's written in C++, with SWIG-based front ends to many languages. It might not have as polished of a front end or as fancy of a website as Lucene, but I believe it's a better choice if you have really really huge data sets or want to venture outside the Java universe.

There are also many other wholely-contained indexers too, mostly which are based on web indexing (they have spiders, query forms, etc.) all bundled together. Like ht://Dig, mnogosearch, and so forth. They are good, especially if you want more of a drop-in solution rather than a raw indexing engine, and if you're indexing web sites (and not complex entities like databases, etc).

Re:Also see Xapian (1)

mikeboone (163222) | more than 6 years ago | (#20817903)

I've had good luck for several years using Xapian integrated with PHP. It did take some work to integrate but it's fast and flexible.

Re:Also see Xapian (2, Informative)

risk one (1013529) | more than 6 years ago | (#20817999)

I agree that Lucene is a great choice specifically for java shops, it does have ports for pretty much all major languages. The java implementation is the 'mothership' but you can use lucene with php, python, .NET or C++ or whatever.

Secondly, I'd like to point out Lemur [] . It's an indexing engine similar to Lucene, but geared much more toward the language modeling approach of information retrieval. All IR approaches will use either a vector space based approach or a language model approach. Lucene does vector space very well, but it's difficult to get it to do language model based retrieval (although extensions are available), Lemur can do both. Lemur also has Indri, a search engine written on top of Lemur, which can parse html, PDF and xml. And like Lucene, Lemur has multiple language ports of the API.

A final point I would like to make is that IR is a very actively researched field. If you're going to do your own coding (specifically the retrieval model), I suggest you buy a book and get reading. Most of the basic problems (and there are many) have been figured out and it'll save you a lot of trouble if you just read up on how to update an index or find spelling suggestions, instead of figuring it out for yourself. It's possible to index your documents with Lucene and run searches on them in half an afternoon, but it takes some basic knowledge to get it right, and make the app useful. (Look at wikipedia's search for an example of what you get when you don't follow through, and stop after it seems to work ok).

Depends on size of document base (2, Insightful)

Nefarious Wheel (628136) | more than 6 years ago | (#20816815)

It depends on the size of your document base, and how you're going to store it -- if you're using something industry-strength like Documentum or Hummingbird then the Google Mini won't index it, you have to go up a notch and use the yellow box solutions. And if you're using Lotus Notes, you'll need a third party crawler such as C-Search. Google Desktop can be bent into some solutions, and it's free, but for many users you're better off having a separate server do the indexing. Google bills on the number of documents you need to keep in the index at once, and they throw in a bit of tinware to support that on a 2 year contract.

Disclaimer: I flog Google search solutions at work, so I'm way biased.

use a Wiki instead (4, Insightful)

poopie (35416) | more than 6 years ago | (#20816871)

Directories full of random documents in random formats of random version with varying degrees of completeness and accuracy tend to get less useful as an information source as time goes on. Docs get abandoned and continue to provide outdated information and dead links. Doc formats change and require converters to import. Doc maintainers leave the company.

If you work somewhere where people are not trained to attach Office docs to every email, where people don't use Word to compose 10 bullet points, where people don't use a spreadsheet as a substitute for all sorts of CRM and business applications... a Wiki is actually a good solution.

You can use something like MediaWiki or Twiki or... heck you can use a whole variety of content management systems.

The key to success is to *EMPOWER* people to actually update information, and have a few people who are empowered to actually edit, rehash, sort, move, prune wiki pages and content. As the content improves, it will draw in more users and more content creators. Pretty soon, employees will *COMPLAIN* when someone sends out information and doesn't update the wiki.

Some corporate cultures are not wiki-friendly. Some management chains *fear* the wiki. Some companies have whole webmaster groups who believe it is their job to delay the process of getting useful content onto the web by controlling it. If you're in one of those companies... start up your own wiki and beg for forgiveness later.

Re:use a Wiki instead (2, Interesting)

LadyLucky (546115) | more than 6 years ago | (#20817805)

We've been using Confluence, from Atlassian [] for our wiki, and it's pretty fully featured for a wiki.

I'd Be Interested... (1)

morari (1080535) | more than 6 years ago | (#20816905)

In an image-based solution. My business requires customer access to literally thousands of individual images. It would be nice to be able to scan them all in and tag them appropriately (multiple tags!) so as to create an easily searchable database.

Extensis + SQL Connect (0)

Anonymous Coward | more than 6 years ago | (#20816991)

I'm doing something similar at work. The Extensis web stuff they sell is a bit pathetic, so you're better off buying their SQL connect stuff and building your own web frontend.

Re:I'd Be Interested... (1)

Bronster (13157) | more than 6 years ago | (#20818081)

Wow, you run a porn site too?

True hackers (1)

athloi (1075845) | more than 6 years ago | (#20816929)

Always write their own homebrew search engines [] .

FileNet (1)

drhamad (868567) | more than 6 years ago | (#20816961)

How about IBM FileNet? Or are you looking for something free? We use FileNet everywhere I've been.

The downside to the suggestions like Google Appliances is that you're then storing this information on Google servers.... something that most companies find HIGHLY objectionable (security).

Re:FileNet (1)

James Youngman (3732) | more than 6 years ago | (#20817137)

No, if you buy a Google appliance, the index is stored on the appliance, not on Google's servers. That's kinda the point.

Re:FileNet (1)

rainmayun (842754) | more than 6 years ago | (#20817203)

Um, they aren't "Google servers" in the sense that Google owns and operates them... you buy it and run it in your own enterprise, the same way someone might run, say an "Oracle server".

Re:FileNet (1)

drhamad (868567) | more than 6 years ago | (#20817213)

I should amend what I said. I was referring to Google Desktop Search, not the standalone, separated Google Appliance application/hw.

I'll plug my software (1)

vondo (303621) | more than 6 years ago | (#20816985)

DocDB ( can interface to the search engines others are suggesting, but organizing your documents with decent meta-data in the first place (and not on a Wiki that is allowed to rot) is also important. That's what DocDB does.

$$$ - Universal Content Management (1)

hrieke (126185) | more than 6 years ago | (#20817015)

So far, everything I've read really doesn't sound like it's geared towards an enterprise level; mostly put a bunch of files out there in a folder somewhere and let a crawler index them. That's all good and fine until someone gets the idea to search for the payroll documents...

At work, and granted it's a fair size HMO, we use Universal Content Management by Oracle, formerly known as Stellent's Content Management System.
UCM allows for named accounts with control over access, plus a full audit log, full change log (plus revisions), and a centralized location for searching for document of any type (Word, Excel, Powerpoint, AutoCAD, MPG movies, TIFF, etc.).

Supports work flows as well, a nice plus if something needs to go through a formal process, gives nice audit trails, and supports 3 different full text indexes (FAST, Verity, and database) of the stored content.

This might not be for everyone, but it is a decent tool for large size companies to manage documents.

Google (1)

JimDaGeek (983925) | more than 6 years ago | (#20817017)

Seriously, spend a tiny bit of money on a Google Appliance and get excellent search. I tried to use MS stuff, like the built-in index server and it just wasn't good enough.

We got a Google "appliance" and the damn thing just works, and works well. I don't work for Google, nor do I get paid if they make a sale. Just saying what worked great for us.

Anything to avoid? (0)

Anonymous Coward | more than 6 years ago | (#20817019)

How about avoiding Word docs, Excel spreadsheets and Access databases?

Swish++, HyperEstraier (1)

ecloud (3022) | more than 6 years ago | (#20817151)

I once integrated Swish++ as a document search system for a MediaWiki installation, to handle uploaded documents. I liked the results so then I started using it to build an index on a large codebase so I could quickly find all usages of a particular symbol (in source files, libraries and executables too). The catch is you have to define how to translate each type of file into plain text so it can be indexed. There are plenty of tools available for Word docs, PDFs, nm for libraries, etc. Compared to some others I think Swish++ has the advantage of speed. I haven't tried Lucene but my feeling is I'd rather not use Java for that unless the whole system is in Java.

HyperEstraier has an excellent reputation but I haven't tried it yet. It's harder to get going with it.

Too bad Beagle is written in .net; sounded like a well-integrated solution otherwise...

Requirements spec (1, Insightful)

Anonymous Coward | more than 6 years ago | (#20817179)

The requirements spec there reads like most of the projects Ive worked on the last few years. *sigh*

In light of the above I cant (IGC) recommend anything specific, but I can advise you to avoid :-

1) In house solutions (expensive, usually buggy).
2) Anything from Thunderstone (If they've fixed the numerous Vortex bugs over the years I might revise my opinion but my last experience was painful).
3) MS Full text search/indexing (slow - and yeah you can throw a load of hardware at this but hardly the optimal solution).
4) Lucene (Ive seen too many sites with dead lucene searches).

The recommendations re Google are probably safe-bets ("nobody ever got fired for buying google") and Ive had a lot of success with Swish-e for smaller (20,000 docs) projects.

cough cough (1)

Evets (629327) | more than 6 years ago | (#20817265)

Microsoft Index... oh nevermind. I can't get it out with a straight face.

Lucene is the way to go. There are APIs for Perl for dealing with Lucene data sets and for many other languages as well. Nutch is a good place to start getting to know the power of Lucene - you can get a nutch crawler interface up and running quickly and you can browse through some of the source files to get an understanding of how to bring in various file formats - Office documents, PDFs, etc.

The Google Search boxes are decent, but with any commercial solution you end up paying fees for the amount of documents in your index. They open source the code, presumably because of OSS components (maybe even Lucene) but the documentation they publish is laughable.

I do this in several programming languages (2, Informative)

MarkWatson (189759) | more than 6 years ago | (#20817371)

There are 2 problems: getting plain text out of documents, then indexing the plain text

A good tool for getting plain text out of various versions of Word documents is the "antiword" command line utility.

The Apache POI project (Java) can read and write several Microsoft Office formats.

For indexing: I like Lucene (Java), Ferret (Ruby+C), and Montezuma (Common Lisp).

I have mostly been using Ruby the last few years for text processing. Here is a short article I wrote using the Java Lucene library using JRuby: []

Here is another short snippet for reading documents in Ruby: []


You might just want to use the entire Nutch stack: []

stack that collects documents, spiders the web, has plugins for many document types, etc. Good stuff!

Many options, here are a couple (0)

Anonymous Coward | more than 6 years ago | (#20817435)

General purpose indexing is not very fine tuned, when it comes to enterprise documentation and the indexing around it you get what you put into it. So if you don't have standards for meta-data then its not going to meet the needs of your management. For example if they type in "insert name of internal project" then they feel they should get all the stuff surrounding that project regardless if the name of that project is nowhere to be found in that document.

Give EMC a call and talk to them about Documentum and all the things around it. Or organize your documents around projects/collaborations and use sharepoint and all the tools around it. There are a couple of things to get you started.

If money's not an object... (2, Informative)

djpretzel (891427) | more than 6 years ago | (#20817457)

As a Documentum developer, especially in light of the recent 6.0 release, I'd be remiss not to recommend it for such a purpose. It's expensive, rather complex, and requires solid development talent to implement, but is almost infinitely configurable and customizable, and there are separate components (at cost, of course) that can add on all sorts of fun functionality like collaboration, digital asset management, etc. It has the ability to auto-tag documents based on configurable rules using Content Intelligence Services and supports extensible object hierarchies, workflows, lifecycles, taxonomies, web services, you name it. It's probably overkill for the user in question, and it's far from open source (although EMC is doing an admirable job at encouraging code exchange, and the new dev. environment is based on Eclipse), but it's pretty darn slick when you look at the ground it covers, functionally.

WSS 3.0 (1)

madagajs (1165331) | more than 6 years ago | (#20817535)

I'd recommend WSS 3.0. It can search any document that you can find/write an IFilter for with many built in, out of the box. It also provides an area for people to discuss a document without having to place comments in emails or the document itself (which could inadvertently make their way to a client).

Install Wumpus Search (1)

gvc (167165) | more than 6 years ago | (#20817657)

It is free, libre. Wumpus Search.

Re:Install Wumpus Search (2, Informative)

gvc (167165) | more than 6 years ago | (#20817677)

Sorry, mangled the URL in the parent: []

Re:Install Wumpus Search (1)

Hatta (162192) | more than 6 years ago | (#20818271)

Wow, this option looks especially nice considering they use fschange [] to obviate the need for constant reindexing of the drive. fschange tells wumpus when a file changes and what particular portion of that file, so it can reindex the part it needs to. The constant reindexing and related performance problems are what stopped me from using Beagle, etc. Of course fschange requires a kernel patch, but big deal.

Full text search? (1)

CoffeeIsMyGod (1136809) | more than 6 years ago | (#20817679)

You may want to consider something besides full text searching (Google and company) as this usually starts to degrade fairly quickly with the size of the documents. So far I have not seen anything that actually comes close to human index documents. There are several tools that help users tie documents into a pre-build taxonomy or thesaurus so you get consistency, accuracy, and a *well designed* solution, not random machine learning grouped results. They usually cost money, though so be prepared. I think that Lucine has a taxonomy module that is in beta mode so that really helps. Automated categorization is still quite terrible so you will need to sit a few users down and have them tag the data. You will be much happier with the result, honestly.

IBM OmniFind (1)

bbdd (733681) | more than 6 years ago | (#20817693)

well, this may not apply to you, as you do not mention the size/number of items to index.

but, for small shops where there is no money to throw at this type of thing, try IBM OmniFind Yahoo! Edition. can't beat the price. []

Lucene (1)

AtomicDevice (926814) | more than 6 years ago | (#20817783)

I had an internship over the summer at a large scholary journal archiving company who used lucene. I found it to be very easy to learn and powerful to use and customize. I was easily able to manage the tags and whatnot for documents, also I didn't really notice any issues with scalability, we indexed millions of documents and were able to search them just fine. It also has some nice basic options to get you on the way with semantic indexing if that is your bag (there are some better tools for that, but a lucene index is a good place to start for those)

You may be asking the WRONG question... (1)

ivi (126837) | more than 6 years ago | (#20818093)

Haven't you listened to the TED talk (cf: []
on Spaghetti Sauces, which refers to Moskowicz's idea that:

  There isn't a (one) best , only best

eg, best spaghetti sauceS, etc.

Some find one best, others find another best [for them]...

Whatcha think?

Access Control and Searching (1)

queenb**ch (446380) | more than 6 years ago | (#20818135)

The project that I worked on was also concerned with who was able to access the data. For that reason, we used a wiki-like format, converted everything into text, using a variety of conversion methods, and assigned access controls to it via a in-house web based application.

This allowed for the full text to be searchable, provided a reference back to the original file. If it was in a digital format, like a Word document, it was also stored in the database. If it, it referenced a physical file. The user could suggest modifications to the searchable entry if an error was found. The archive team would investigate the suggested correction and usually implement it.

2 cents,


Hack Something Together? (1)

LionKimbro (200000) | more than 6 years ago | (#20818175)

I once spent 4 hours hacking together a symbol indexer for the 10,000+ CPP files in our source code repository. I wrote it in Python. It worked by brute force: "For every directory, for every .h or .cpp or .c file, crack it open, and, line by line, look for all instances of this regex..."

It's a little slow-- 10 seconds to look up all instances of a symbol. And it takes ~3 hours to refresh the full index.

But is saves an enormous amount of time, makes impossible tasks possible, and I have used it every day since. It's been about a 8 months now, and it's been absolutely wonderful.

It would have taken far longer and many more resources to begin to figure out how to hook in Lucerne, or some other heavy duty package.
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?