Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Lucene in Action

timothy posted more than 8 years ago | from the lucene-and-ricking dept.

Software 109

Simon P. Chappell writes "I don't know about you, but I hardly bother with browser bookmarks any more. I used to have so many bookmarks, back in the early days of Netscape's 4 series, that I would have to regularly trim and edit my bookmark file to prevent my browser from crashing on startup -- that's a lot of bookmarks, folks! Now, I go to my favourite web search engine, enter a couple of appropriate search terms and voila, there's my page! Search engines are so ubiquitous that we rarely give much thought to the technology that powers them. Lucene in Action by Otis Gospodnetic and Erik Hatcher , both committers on the Lucene project, goes behind the HTML and takes you on a guided tour of Lucene, one of a generation of powerful Free and Open-Source search engines now available." Read on for the rest of Chappell's review.

Who's it for?

Lucene is a library and framework, rather than a complete application. It truly is an engine, around which you are expected to build and extend your own application. Like Lucene, the book is targeted at those who are looking for a tool to build their own search facility application rather than just "download and go." The book does include a number of case studies of Lucene usage (including at least one download and go search engine) but those are included to show how to use and adapt Lucene to fit differing environments rather than as ends in themselves.

The Structure

The book is sensibly divided into two parts. The first part looks at "Core Lucene" functionality, while the second part addresses "Applied Lucene".

Part one has six chapters, covering the central components and inner workings of Lucene. It's here that the book starts with a tutorial introduction, familiarising the reader with the concepts of Lucene as a search engine around which you wrap your own code. The other five chapters move steadily through good search engine fare, with indexing getting the whole of chapter two to itself The discussion of how to retrieve text from the documents being indexed is mentioned here but postponed until chapter seven, where it is dealt with exhaustively. Chapter three covers searching, and especially how Lucene ranks documents.

Chapter four examines analysis. In it's chapter introduction, the book explains that "Analysis, in Lucene, is the process of converting field text into it's most fundamental indexed representation, terms." This process is performed by an analyser, which tokenises text according to it's own built in rules; each analyser will have a different emphasis, some want only dictionary words, others might explicitly include acronyms and sometimes you'll want an analyser that will block stop words (those words in languages that are part of the structure, but that add nothing to the information being conveyed by the text; classic examples of stop words in English include "a", "and" and "the").

Chapter five looks at advanced search techniques; everything from sorting search results, searching on multiple fields to filtering searches. Many free or open source software tools are extensible, and Lucene is no exception. Chapter six addresses creating and using custom components within Lucene, everything from custom sort methods to custom filters.

Part two, the final four chapters, cover Applied Lucene. It is dedicated to practical uses of Lucene and answers the question "So, what can I do with a search engine?" Chapter seven covers ways and means to parse common, non-plain text document formats. The primary formats covered are RTF, XML, PDF, HTML and Microsoft Word. The ability to parse and index these file formats will cover the search engine needs of the majority of Lucene users. Chapter eight looks at a number of Lucene tools and extensions that are available; many of them being free and open source software. Chapter nine covers ports of Lucene. While for many users, Lucene being a Java library is not a problem, some users want its functionality in environments that do not have Java. The chapter looks at ports written in C++, C#, Perl and Python. Lastly, chapter ten takes a thorough look at seven Lucene case studies. Perhaps the "star" case study is the one about Nutch, a download and go search engine written by Doug Cutting , the original author of Lucene.

There are three appendices. The first offers installation advice for Lucene; a useful addition that those newer to working with Java libraries will surely appreciate. The second appendix has a very well explained description of the Lucene index format. This is the kind of information that can be hard to find, so it is welcome in a book of this sort. The last appendix contains a number of categorised resource references. The number and breadth of the resources provided could provide quite an incredible education in information retrieval theory if the reader was inclined to read them all.

What's to Like?

There are several things to like about this book. Let's start with the fact that the authors are part of the core development team of Lucene. This gives them both credibility and an excellent understanding of the internal workings of Lucene. Co-author Erik Hatcher is a fantastic writer, having previously been a co-author of the only Ant book worth bothering with, Manning's Java Development with Ant . (Full disclosure: I do know Erik personally.)

The structure of the book is well thought out and each chapter does seem to move your understanding forward when combined with what you learned from the proceeding ones. The division into core and applied Lucene is also helpful. While you'd hope that this was the case, it often isn't; hence I note it as a positive.

I especially appreciate that this book does not fill up page after page with API documentation. The authors appear to have grasped that if you have Internet access to download the software, you might just be able to access the documentation online; rather, they concentrate on the way to use the software. What a concept!

As a part of Manning's "in Action" series, the book has excellent layout and has obviously been thoroughly edited by both technical evaluators and copyeditors. This might seem to be a small thing to some, but a well-edited book stands out clearly from the crowd.

What's to consider?

If you are looking for a book on using and configuring a download and go style of search engine, this book would be less suitable. While the case study on Nutch is of good length, it would be too short to useful as a configuration guide.


I enjoyed reading this book. If you have any text searching needs, this book will be more than sufficient equipment to guide you to successful completion. Even, if you are just looking to download a pre-written search engine, then this book will provide a good background to the nature of information retrieval in general and text indexing and searching specifically.

You can purchase Lucene in Action from Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

Sorry! There are no comments related to the filter you selected.

Raise your hand if... (0)

Anonymous Coward | more than 8 years ago | (#13391998)

in Firefox you hover your cursor over the scroll up "button" in the Bookmarks menu, then hit the up arrow key on your keyboard because you have so many bookmarks that it takes to damn long to scroll all the way down to the bottom (and yes I know I could categorize).

Re:Raise your hand if... (2, Informative)

Black Perl (12686) | more than 8 years ago | (#13392031)

Raise your hand if you have jettisoned all browser bookmarks and just use [] (and/or the wonderful bookmarklets or Firefox plugin for it).

Re:Raise your hand if... (1)

Mr. Mosty-Toasty (449993) | more than 8 years ago | (#13392181)

I use Spurl [] , which has a nice Firefox sidebar and the possibility to mirror bookmarks automatically to So in case one of the two services goes down or turns evil, I don't lose all my bookmarks.

Re:Raise your hand if... (1)

elambrecht (24773) | more than 8 years ago | (#13392434) is nice, but I don't want to depend on that site always being around (plus I've got private bookmarks I don't want to share).

I wrote a firefox extension to have the best of both worlds: my bookmarks are still stored locally, but they're now automatically backed up on a web site. Additionally, with this extension my other machines can synchronize with that web site, so I've got all my bookmarks stored locally on all my machines and they periodically automatically synchronized between them.

Check it out: []

Re:Raise your hand if... (0)

Anonymous Coward | more than 8 years ago | (#13392541)

(plus I've got private bookmarks I don't want to share)

Okay, stop being stingy with teh pr0n. If you've got good stuff you've an obligation to share!

They say... (0)

Anonymous Coward | more than 8 years ago | (#13392007)

They say that necessity is the mother of invention. And I need this search engine like a kitten needs to suck on a deep-sea thermal vent.

Nice that I can search the Lucene site with Google (-1, Offtopic)

Anonymous Coward | more than 8 years ago | (#13392028)


My solution (3, Interesting)

Neil Blender (555885) | more than 8 years ago | (#13392035)

My home page is a nicely sorted webpage with all my frequently visited links in a password protected section of my web site. If something gets used enough in my bookmarks, it gets put on that page and gets deleted from my bookmarks. Then, no mater where I am or what computer I am on, I can access my links.

Re:My solution (1)

keilinw (663210) | more than 8 years ago | (#13392063)

Yes, I've managed to do this to... right now I'm using MediaWiki. I found it to be a faster alternative to HTML and I can have friends / colleagues, etc help maintain it. I just started using wiki's recently... but now I'm addicted... I use it for more than just bookmarks too.. such as my list of restaurant reviews.

Re:My solution - Booby online PIM (1)

marcop (205587) | more than 8 years ago | (#13392313)

I use Booby [] for online bookmarks. It requires PHP and one MySQL database. I have access to my bookmarks regardless of what computer I'm on.

Re:My solution - Booby online PIM (0)

Anonymous Coward | more than 8 years ago | (#13392466)

I use Booby for online bookmarks.

Which, appearently, is deceased and was reincarnated as Brim [] .

Looks interesting though, thanks.

Re:My solution - Booby online PIM (1)

Captain Splendid (673276) | more than 8 years ago | (#13392626)

Why is everybody going so high-tech about this? I just use the syncmark extension for FF and back the bookmarks file up to a directory on my webspace. It's not pretty, but it works great when I'm on somebody else's PC.

Re:My solution - Booby online PIM (1)

Neil Blender (555885) | more than 8 years ago | (#13392703)

Why is everybody going so high-tech about this? I just use the syncmark extension for FF and back the bookmarks file up to a directory on my webspace. It's not pretty, but it works great when I'm on somebody else's PC.

My solution is not really high tech, but it has one major advantage over bookmarks - it's way, way faster. Scrolling through bookmarks is slow and tedious because scrolling in general is slow and tedious. If you have them on your home page, the first thing you are presented with is links. It only takes a fraction of a second to select one. Hell, when I open my browser, my cursor is already on it's way to the position of the link I want before the page is even loaded.

Your solution solves the other reason I use the solution I do. I lost my bookmarks one too many times (and that was back in 1999.) Now, if I lose them, I usually only lose 20-30 unimportant ones. In fact, I lost my bookmarks a few months ago upgrading Firefox and was too lazy to retrieve them from a backup.

Re:My solution - Booby online PIM (0)

Anonymous Coward | more than 8 years ago | (#13393428)

Why is everybody going so high-tech about this? I just use the syncmark extension for FF and back the bookmarks file up to a directory on my webspace. It's not pretty, but it works great when I'm on somebody else's PC.

Well, maybe because Brim (aka, BooBy) also mantains a contact list, calendar, task manager, notes, a password manager, etc, in addition to bookmarks.

Like their slogan says, it's "a multithingy something".

Re:My solution (1, Interesting)

Anonymous Coward | more than 8 years ago | (#13392788)

Well, exactly.

The "home" button on a browser is supposed to take you to YOUR OWN web space, maintained by you - i.e. your home. Some bits might be your front garden, visible to others, others private.

People who use the "home" button as just another bookmark to a search engine are missing the point of the web.

It isn't helped by the fact that current browsers aren't actually good as editors (unlike the original web browser vision) - Your web site should be a WYSIWIG-editable persomal/private pseudowiki. Many people have the wiki part down now, but are stuck typing markup into text boxes when it should be a matter of pointing your browser at the site and making the change.

Where are the WebDAV compliant WYSIWIG wikis?

Microsoft kindof-sortof tried with frontpage, and Amaya and Mozilla Composer are both in existence, but the problem is most people on the net are now brainwashed into drooling "consumers" of corporate media instead of being active participants in society i.e. "citizens".

Wow, open source search engines. (2, Interesting)

keilinw (663210) | more than 8 years ago | (#13392038)

Thanks! I was looking for a good book on Open Source search engines. While I have never heard of "Lucene" I will definitely be looking into it now. Its probably a good opportunity to learn all about Search Engine Heuristics, methods, etc...

Also, I agree with the author that bookmark functionality has gone the way of the dinosuars... with the exception of the "open all tabs" feature found in many browsers today... that is about the only one that I use often.

Im just wondering how the "search" functionality will actually play out in the future. Apple has "Spotlight" and Microsoft is supposedly incorperating magick folders or something like that into Vista. Can anyone tell me more about Lucene and how it differs from say Google or other search engines?



Re:Wow, open source search engines. (1, Funny)

Anonymous Coward | more than 8 years ago | (#13392087)

I was looking for a good book on Open Source search engines.

Well, you could have used Google to fi-- oh, I see.

Re:Wow, open source search engines. (0)

Anonymous Coward | more than 8 years ago | (#13392702)

Lucene is great, used in on a project and were so surprised when it outperformed a 50000$ product (no names).

They have got it right with lucene, indexes are about 30% of the data and indexing is super quick. That 50000$ product took us 12 hours to index our document set. Come lucene and we were done in about 2 hours.

Re:Wow, open source search engines. (1)

jdray (645332) | more than 8 years ago | (#13393156)

I liked the "search this site with Google" box at the top of the Lucene page. Irony, right?

Lucene in a nutshell (over-simplified) (5, Informative)

juancn (596002) | more than 8 years ago | (#13392352)

Lucene is not like google (it's not a full application), it's a library focused on searching text based documents (you could use it to build a mini-google).

The basic idea is that you want to build an index, and then search it, to find some document.

A document has several fields (e.g. text, title, lastModificationDate, author, categories, summary, url, etc.) which may be indexed, stored, or both.

You usually build your lucene documents, based on some real documents (e.g. web pages, PDF, records in a database, etc.), and then add them to the index.

Once you have an index, you build a query to search one or more fields (lucene provides a QueryParser class, which handles the most common cases), and you get a Hits collection containing the documents matching your query in some order (this can be customized).

Before a document is added to the index, it is passed through an Analyzer which converts the text in the fields to terms, which are the basic internal concept that is indexed.

Another interesting feature of lucene indexes is that they can be searched while they are being built without noticeable loss of search performance, and that they are process-safe (many processes can access them for reading, only one for writing), this has the drawback that the indexes are append-only (actually a separate index is created if you modify an index), but periodical optimization of the indexes removes unnecessary entries and inefficiencies.

Hope this helps!


Re:Wow, open source search engines. (1)

AntigonusPiglet (744432) | more than 8 years ago | (#13392489)

Before the web, a "search engine" was a piece of software that provided a way to search a collection of documents efficiently. The usual method is to create an inverted index, a data structure analogous to the index in the back of a book, in which you can look up a word and get back a list of all the documents containing the word. There are also a set of standard techniques for ranking the results, for example based on statistics about the distribution of words across documents.

Lucene is a search engine in that sense. It's a library that you build into your application to give it indexing & search capability. You can use it to search files on your computer or whatever you want -- you write the application. For example, Lookout uses Lucene to make all your Outlook mail searchable.

In contrast, Google, Yahoo, etc. are services that use a variation of indexing and search technology specialized for the scale and particular characteristics of the web. It's possible to build a Google-style search engine on top of Lucene -- that's what Nutch is.

Beagle uses Lucene (0)

Anonymous Coward | more than 8 years ago | (#13392909)

Interesting you should mention desktop search

The Beagle desktop search engine uses the C# version of Lucene. Yes, it runs on Linux (and anything else that can run Mono)

Re:Wow, open source search engines. (1)

otisg (92803) | more than 9 years ago | (#13394404)

It sounds like you may be interested in Nutch [] , a sub-project of Lucene. Nutch is a full search engine package (fetcher, indexer, searcher, etc.), made to work in a cluster, etc. Of course, at the core of indexing and searching functionality is Lucene.

Bookmarks are better (4, Interesting)

saskboy (600063) | more than 8 years ago | (#13392061)

Bookmarks are more secure than a search engine, since a search engine could provide a poisoned link, and if you're typing in the URL by hand, if you make a spelling mistake, you could find yourself at a pharming site, or someplace you didn't want to go.

I tend to use bookmarks in Firefox and the autocomplete about equally, and make use of the Quick Links toolbar for my most popular sites.

The Firefox bookmark all tabs feature is a breakthrough, since you can close your browser, and reopen it to the same set of tabs as before, which is great when installing extensions and you're forced to restart. The only drawback is that scrolling through bookmarks is too slow, but if you use your scroll wheel it speeds up considerably. That's a trick I didn't figure out until just last month.

Re:Bookmarks are better (1)

jorenko (238937) | more than 8 years ago | (#13392172)

I've always used the SessionSaver [] extension to get this functionality. I don't have to go to the trouble of acutally setting anything this way. I can close my browser any time and it will be in the same state the next time I open it. Or, I can save a specific session to load later.

Re:Bookmarks are better (3, Interesting)

zhiwenchong (155773) | more than 8 years ago | (#13392275)

Quite... I just want to say something:

I think that abandoning bookmarks altogether is a bad idea.

Search, while useful, only works if you can find the exact keywords necessary to bring up a certain page. Search merely complements, rather than replaces, bookmarks.

Looking through my bookmark lists, I see many websites which I would never have known how to search for (they're mostly websites I stumbled upon from other websites). Some of these sites are hard to find because:

1) they don't have enough Statistically Improbable Words. e.g. try searching for software that describes biology of a python.

2) the page doesn't contain words associated with its significance to me (yes, it can happen). e.g. let's say you come across a page that has a nice layout that you want to revisit later -- if you ever forget the keywords on that page, you may never find it again. Whereas if I were to file it under "Nice websites" in my bookmark folder, I'd be able to find it again.

3) I can't remember any of the keywords associated with the page.

4) I forget that I've ever visited those webpages. Some search engines (e.g. have histories that you can revisit, but they're no use unless you can classify them. And if you classify them, they're basicallly bookmarks.

I think the reason people dislike bookmarks is because they're a hassle to organize. We need some sort of tool to autoorganize bookmarks.

There two basic requirements:
1) Multiple hierarchy - a bookmark must be able to belong to more than one category. Example of this is GMail's labels [] -- each email can belong to more than one label.

2) Automatic classification - the proper term for this is automatic taxonomy. This can be accomplished using a Bayesian algorithm (like the one POPmail is using). In fact, DEVONthink already does this [] .

When a user makes a bookmark, the program should come up with a list of category folders (sorted from likeliest to least likely) to file that bookmark under, and the user must be allowed to select more than one folder.

Re:Bookmarks are better (2, Insightful)

Michael Woodhams (112247) | more than 8 years ago | (#13392414)

Yep, I also think the poster's abandonment of bookmarks is truely bizzare. I have top-level folders of bookmarks, each of which becomes an instantly available pull-down menu. Most have subfolders.

I like your automatic classification ideas.

A complaint about Firefox: when I choose "bookmark this page" it comes up with a little dialog. This dialog has a one-line selector for where I want to create the bookmark (default being the folder named "bookmarks") and a little button to expand this one line into a screen. NOT ONCE in hundreds of times have I not clicked that little button to expand this display. The default behaviour assumes you don't classify your bookmarks, which is likely to only apply to people who seldom use bookmarks. I'd expect 80%+ of the time this dialog is invoked, the little button is clicked. So why doesn't it open the big display by default?

Sorry, I have no mod points for you today.

Re:Bookmarks are better (1)

moeffju (114331) | more than 8 years ago | (#13392848)

The OpenBook extension does this.
(I agree, it should really be default behavior.)

OpenBook Firefox Extension (1)

nv5 (697631) | more than 8 years ago | (#13392855)

try the OpenBook [] extension. It fixes that problem. And I wouldn't be surprised, if it would be incorporated in the standard Firefox "real soon now"(tm).

Re:Bookmarks are better (0)

Anonymous Coward | more than 8 years ago | (#13392363)

The Firefox bookmark all tabs feature is a breakthrough, since you can close your browser, and reopen it to the same set of tabs as before, which is great when installing extensions and you're forced to restart. The only drawback is that scrolling through bookmarks is too slow, but if you use your scroll wheel it speeds up considerably. That's a trick I didn't figure out until just last month.

Not to start a browser war, but by "breakthrough" you mean "copy of Opera's sessions", right? That's a trick I didn't figure out until 3 or 4 years ago. What will those amazing firefox devs think of next?!?!

Re:Bookmarks are better (0)

Anonymous Coward | more than 9 years ago | (#13394835)

And yet Opera still hasn't figured out the most amazing trick of all: getting more than 5 people to actually use it!

You Opera zealots are silly. Saying something was invented in Opera is like saying something was invented on the moon: it could be true, but nobody on earth is gonna know about it.

Bookmarks toolbar folder is better! (2, Interesting)

ImaLamer (260199) | more than 8 years ago | (#13392376)

Scroll wheel? Thanks, that is a major helper.

What I've begun doing is using the "Bookmarks Toolbar Folder" for all of my bookmarks. I've got "Essentials" with links to Gmail, Adsense, my website, stats and so forth, basically all of the sites that I try to visit daily. Then I've got "Favorite sites" that holds Slashdot (even though now it's "home"), Woot, Craigslist, (hehe),, Myspleen, demonoid, you get the point.

Then I've got the essential one: "Functions" - that holds mostly Javascript links but other things like TinyURL, @nonymous, Wordpress Press-It, BlogThis!, post to, Ping-o-matic, Send SMS message, mailto: and whatever. Then there is "Junk" which isn't really used any more because is so sweet. I generally dump something I might want to read later there and categorize it later (like every 8 months). Then of course I've got a few drop-down RSS feeds, but since they are torrent sites I'll keep them to myself. (Oh, I almost forgot - a huge drop down of bookmarks with the help of Foxylicious [] )

This works well, and generally reminds me of a filing system. Since I'm never using the File, Edit, and etc menus this has become my new menu.

Re:Bookmarks toolbar folder is better! (2, Funny)

Cylix (55374) | more than 8 years ago | (#13392495)

I think slashdot just became your bookmark page.

Re:Bookmarks are better (1)

mov_eax_eax (906912) | more than 8 years ago | (#13392591)

i don't think that bookmarks better than search engine makes any sense at all, are you trying to say that instead of indexing automatically a broad range of documents for intranet usage using lucene or htdig YOU will bookmark manually every page using your favorite browser?, is better (2, Informative)

xant (99438) | more than 9 years ago | (#13393734)

You can combine the best of both worlds, bookmarks and search.

I find bookmarks slow to navigate, and it's hard for me to remember my own hierarchy when I've got enough bookmarks to organize. The problems with search have been expanded on by others in this thread.

So here's the solution: [] .

You can create, edit, tag, describe, and search your own personal bookmarks. When you've done that, the world can see your links too. Subscribing to an RSS feed of some tags you're interested in ("python" for me) gives you a constant stream of interesting links other people who are into Python found useful.

If you're using Firefox, you'll probably want in your search box. Find that here [] . I find myself frequently using this when I'd normally use Google, when I know (or just suspect) that I've been somewhere before.

What I've been doing for about a year now is keeping my actual in-browser bookmarks an unsorted flat list with just about 20 sites I visit on a regular basis. Webcomics, blogs that don't have RSS feeds already, and the like. Everything else goes into All my other bookmarks were of one of these categories: links I visit only occasionally for reference, sites I intended to visit just once later when I have free time, or sites that I don't even know if I'll find useful until I go back and reread them. Now I don't have to clutter my browser with those; I throw them in As a bonus, tells you how many other people have bookmarked the page. Number of times a link is being submitted is a good first-blush indication of whether the information there is really interesting or useful.

DAMMIT - link is wrong, see update (1)

xant (99438) | more than 9 years ago | (#13393769)

I typo'd the link above. This [] is the correct link. I hit preview about 4 times and didn't catch that.

To make matters worse, there appears to be a copycat typosquatter site at the link I put in there. Oh well, if you get all your vital information from Slashdot you deserve what you get :P is better (1)

otisg (92803) | more than 9 years ago | (#13394362)

Funny you mention delicious :)
I run Simpy (see the signature or use the
demo/demo [] account), which has some notable advantages over delicious [] , especially in the search area (surprise, surprise).

Algorithms (1)

Transcendor (907201) | more than 8 years ago | (#13392073)

That sounds interesting. At the moment, I'm dreaming of a "textual exchange service center" for pupils at my school or even schools in whole Hamburg/Germany. (in other words: a good, dialer-free, non advertising, trusted, backfeeded homework exchanger).
I've heard of Lucene through my fav. Computer magazine ( [] ), but I was more interested in indexing algorithm at that time.
So how much weight does the book give into algorithms? Is there anyone out there who's as mathematically/scientific interested in that topic as me?
You've been intentionally poked. Prepare to get an access violation ;) []

Re:Algorithms (1)

steve_l (109732) | more than 8 years ago | (#13392691)

the book (I have a copy) goes into detail on extending lucene with handlers for different formats, filters, and things like stem rules (language specific rules for stemming words).

It also talks about the indexing format; how the indexes are stored and searched. If that aint enough, well, the source is up on

Benefits? (1)

op12 (830015) | more than 8 years ago | (#13392074)

Are the benefits of having such a customizable search engine enough to justify the work required to code for it? It seems like even many of the "customizable" features that you could code have already been included in the major search engines of today, and it would be difficult/impossible to beat their algorithms when they are already developing super-efficient algorithms to stay competitive. It seems like only highly specific or unusual search engine applications could make use of something like this.

Re:Benefits? (1)

MindStalker (22827) | more than 8 years ago | (#13392102)

I think the common use for this is for having an insite search. Sure you can link to google and limit site to your domain but you won't get results as good as a search engine that completly indexes just your site.

Re:Benefits? (1)

ragnar (3268) | more than 8 years ago | (#13392169)

You bring up a good point. In my group, we use Lucene to index XML files because there is a good deal of metadata that (for legitimate reasons I won't go into here) doesn't make it into the HTML presentation that google and human readers see. In order to use the search interface for effective research, thereby using the metadata, a Lucene index was most helpful.

That said, for most projects you are better off to just use a google search, but there are times when knowing the structural properties of your data gives you more control over the search.

Re:Benefits? (3, Informative)

coflow (519578) | more than 8 years ago | (#13392201)

Typical search engines have licensing fees associated with them if you're embedding them in your application. This is basically an open source alternative. And you can customize the hell out of it. I've used it on several web-based applications and on SOA platforms, and it is fast, reliable, and easy to use. Did I mention it's open source? Take a look at the Apache site [] .

Some examples of customizable features are that you can index database entries and achieve quantum leaps in performance over that offered by Oracle, MySQL, PostGres, Firebird, etc. indexing. You can index formats that are not supported by the major search enginges.

It may not offer quite the performance of Google, Alta Vista, etc., but it's a FREE product, well supported by the folks at Apache, and many open source J2EE frameworks support it as well.

Re:Benefits? (1)

MemoryDragon (544441) | more than 8 years ago | (#13392589)

One word yes... more words... in about 10 projects I had in the recent past, I had to apply lucene in about 3 of them due to the requirements of the project....

Better Memory Than I (2, Interesting)

Flamesplash (469287) | more than 8 years ago | (#13392107)

Now, I go to my favourite web search engine, enter a couple of appropriate search terms and voila, there's my page!

You have a better memory than I my friend. Many times I only barely remember something I want to find again. Maybe I remember it was humourous, or maybe I remember it was an online game with pigs in it. Unless it's popular I doubt 'pig game' is gonna get me far. So bookmarks aren't so useless to those of us who don't keep everything in RAM.

Bookmarks, and a good hierarchy, also leverage the Associative aspect of our minds. Skim through your high level bookmark folders and you'll probably find what you were thinking of pretty quick. Additionally it reminds you of things you may have bookmarked yet forgotten.

Re:Better Memory Than I (1)

mopslik (688435) | more than 8 years ago | (#13392126)

I doubt 'pig game' is gonna get me far.

Maybe not, but I'll bet it would make an interesting Google Image Search with Safe-search turned off.

*dares not try at work*

Re:Better Memory Than I (1)

Chosen Reject (842143) | more than 8 years ago | (#13392355)

Doesn't look at all bad. []

And by the way, it looks like that pig game was pretty popular as it is the first link in a normal search and a screenshot of a game is the first link in a image search.

Re:Better Memory Than I (1)

AKAImBatman (238306) | more than 8 years ago | (#13392204)

Mmm... that's why I suggested moving Bookmarks and the like into a DBFS [] . In doing so, the user gain the power to organize and search [] on his data in ways that were previously impossible. Just imagine if your bookmarks automatically attached the meta-data about themselves (based on the website). You could then search for "humor" and find a list of everything you thought was funny enough to bookmark!

That's my idea, anywho. :-)

Re:Better Memory Than I (2, Interesting)

garcia (6573) | more than 8 years ago | (#13392277)

I haven't used Bookmarks since 1998 or 1999. Too much of a hassle finding stuff when the links are dead anyway.

His solution, using a search engine, is a much better method as you might even come across something new and even MORE useful than what you had originally bookmarked.

I check a handful of websites daily. Mostly Google News, slashdot, MNspeak,,, and usually some others. While having them setup in a hierarchy might leverage the association aspect, typing them in everytime exercises my memory and my typing. I guess we each have our own seperate areas we'd prefer to work on.


Re:Better Memory Than I (1)

BaudKarma (868193) | more than 8 years ago | (#13392437)

Whether I bookmark something or not usually depends on how difficult it was to locate in the first place. A few months ago, I was looking for instructions on how to replace the CMOS battery in an old Winbook. I had to wade through half-a-dozen pages on Google with links to companies that wanted to sell me a new laptop battery before I found something relevant. That one got bookmarked.

Re:Better Memory Than I (1)

geckofiend (314803) | more than 8 years ago | (#13392604)

Let's see...
1. Visit google.
2. Type in "Qt 3.4 documentation"
3. Hit submit
4. Find and click on the link


1. Click on the bookmark.

Yeah NOT using bookmarks is so efficient.

Re:Better Memory Than I (1)

ecloud (3022) | more than 8 years ago | (#13392879)

Dead links - that's a good point regardless of the bookmarking technology. One solution is to automatically cache everything you consider worth bookmarking, permanently. (Tie a bliki to a proxy like squid, maybe?) Another is to design the bookmarking system to go to whenever the link is dead.

Like a phone directory (1)

freeweed (309734) | more than 8 years ago | (#13392936)

typing them in everytime exercises my memory


My cellphone has only work contacts programmed into it, because the only time I'm going to need these numbers is when I'm on the clock, and when I'm carrying the cellphone.

But personal contacts? I've learned over the years to deliberately NOT program these in - forced repetition of typing in the numbers means I commit them to memory. Extremely handy for when I don't have my cellphone on me, or its battery dies.

Personally, I found bookmarks were almost harmful. I'd lose them somehow, and suddenly couldn't remember the URL of a site I visited regularly. It was pretty weird, and in some cases, very frustrating.

Re:Better Memory Than I (1)

Domo-Sun (585730) | more than 9 years ago | (#13394776)

I stopped using bookmarks a long time ago too. It was too much of a hassle to keep track of them. One bad side effect is that sometimes I don't know where to go, and over time I start to forget about pages I previously frequented.

Like em, keepin em (2, Insightful)

Cylix (55374) | more than 8 years ago | (#13392122)

I might reference a search engine to tell someone how to find a site via word of mouth, but it has not replaced my bookmarks. If I am away from my system and all I can remember some common words maybe, but not so long ago I used to sync bookmarks with a firefox plugin. (Kept breaking after version updates, never went to reinstall it... though I think I will now)

Anyhow, I simply build to critical mass before I sort them into their respective folders. Some things are automatically tossed into temporary bookmark folders that are going to get washed away after they are no longer useful. (Think auction links)

Now I'll tell you why using a search engine as replacement bookmark concept is a bad idea. Page ranking changes. That particular combination of words you can remember... might just not produce the same results next time. Wonder why? The interenet changes! It is not Aol keyword search...

That said, I did something as foolish as to rely on google to get back to some website regarding video sync signals. It was an excellent page and then I went back to search for it again and I could not find it. (Eventually I did though)

Bookmarks good, search engine good... not mutually exclusive.

Re:Like em, keepin em (1)

superpulpsicle (533373) | more than 8 years ago | (#13392620)

I think you sumed it up best out of everybody here. The bookmarking concept is practically perfected to the point where there is little room to improve.


Ha (1)

blake3737 (839993) | more than 8 years ago | (#13392140)

Chapelle: What do you expect from my review....It's a search Engine Biotch!!!

Oh wait.. wrong chappelle

Re:Ha (0)

Anonymous Coward | more than 8 years ago | (#13392396)

I'm Simon P. Chappell bitch

RSS (2, Interesting)

ezweave (584517) | more than 8 years ago | (#13392143)

While search engines are great, bookmarks are not obsolete. I use RSS feeds to keep up on anything that is serialized that I might care about. FF is great for that.

I still use a few regular bookmarks (like the URL that logs me into /.). Or for development servers with obscene URLs. That is the kind of thing that a search engine won't find. Especially if you have to deploy to a few web servers (this is the WebLogic machine, this is the OAS machine, etc). I have even bookmarked LDAP strings for testing.

More to the point of TFR, I would be intereseted in learning more about OSS search engines. It would be great to set one up on my own net... hmmm. As an aside, can Lucene be used for local searches? It would be cool to make my own desktop search. What kind of licensing does it have?

Re:RSS (1)

Schnee (743890) | more than 8 years ago | (#13392494)

As an aside, can Lucene be used for local searches?

Yes, but...

The distribution contains some demo applications that you can point to a filesystem. One app will index the text, another will index HTML (or maybe one does both, I can't remember). Then you execute another app to query the index.

The hard part is to get Lucene to index non-text files such as Office files. The version of Lucene I've used is the Java version. Third-party libraries exist for Word and Excel docs (on a Windows filesystem), but none for PowerPoint, as far as I know. PDF is pretty easy using other third-party libs.

The demo apps are just that. You really either need to get SearchBlox (not OSS, but built on top of Lucene) or roll-your-own if you want a full-featured app.

Lucene is Apache Licensed (1)

steve_l (109732) | more than 8 years ago | (#13392645)

Lucene is an Apache licensed java project; there is a .NET version that may work on Mono too.

The nice thing about Lucene is it adds indexing and searching to anything you want -some search plugin for outlook (blech) is built on; imagine an equivalent for the unix mail systems -thunderbird , evolution or emacs, for example.

Lucene providing search engine for Hula (2, Interesting)

bad_outlook (868902) | more than 8 years ago | (#13392171)

the Lucene ( [] ) indexer will be inplememtned within Hula the web and cal application ( [] ) made from open sourced Novell NetMail code. Samples of the search engine have been comitted and should start functioning within weeks, just in time for the new cal UI, which you can now view a demo of here: [] That's looking to be an amazing app...

Reminds me of Mac System 7 (1)

ryanov (193048) | more than 8 years ago | (#13392178)

When I was in HS, the preferred way on a Mac to find the telnet application to go run pine was using Find. It was almost always quicker than finding out what folder it was in on the machine, as they were surprisingly nonstandard installs.

Google anyone ? (3, Interesting)

Potatomasher (798018) | more than 8 years ago | (#13392198)

Does anyone find it a little funny that on the main webpage, there is a "Search this site with Google" textbox ? Kind of makes you NOT want to use their search engine if they dont' even trust it enough to work on their own site....

Re:Google anyone ? (-1, Troll)

Anonymous Coward | more than 8 years ago | (#13392243)

Can't say I blame them, it's written in java :-o

Wake me up when the C port is finished.


metamatic (202216) | more than 8 years ago | (#13392320)

Seriously, why would I want to use a library that the authors don't consider good enough to use themselves?


joeykiller (119489) | more than 8 years ago | (#13392498)

Don't write up Lucene just because they're not using Lucene for the site search. Lucene is good. I've only used it on my laptop to index a couple of hundred thousand news articles, but even on a laptop Lucene performs well.

Take a look at [] - the enterprise java community. Their search is powered by Lucene. It's pretty fast and a very capable site search.

You also have open source projects, such as Beagle (the desktop search for Gnome), that uses the .Net version of Lucene. Lookout, a search plugin for Outlook (recently bought by Microsoft), also uses .Net Lucene for the indexing and searching.

Re:Google anyone ? (4, Informative)

Anonymous Coward | more than 8 years ago | (#13392443)

I'm going to assume your post wasn't a joke and explaina a few valid reasons.

The best reason is that its very, very easy to set up a Google search... all you have to do is add site:your_site to the search query, and Bam! instant search.

Lucene takes some work to setup, and is best used where normal Web crawling doesn't work. For example, I work on an eCommerce Web App where all our products are stored in the database, and you reach them by setting a CGI parameter in the URL. Not all products have links to them on our site. We use Lucene because we can pull all the products out of the database and index them, and get hits that crawling would have missed. We can also customize things like redirecting a search for "help" to the help page, set up synonym lists, etc.

So long story short, their search needs are not complex enough to justify the effort of setting up a Lucene based application.

Re:Google anyone ? (0)

Anonymous Coward | more than 8 years ago | (#13393315)

Perhaps because Lucene is not search engine solution, it's a search engine library.

Should I not trust the GNU version of grep because the FSF doesn't use grep as their website search engine?

The worst part is that some succession of no-talent ass-clowns have moderated the parent to +3.

Control-D is the death of bookmarks (1)

sled (10079) | more than 8 years ago | (#13392202)

I used to use bookmarks, until I got in the habit of exiting my shells with Control-D. I spend 90% of my computer time either in a terminal window or mozilla, and I don't use click to focus. Therefore, many times when I hit control-D to exit a shell, I have accidentally left focus on the mozilla window and I add an unwanted bookmark. My bookmarks quickly become cluttered beyond use in this way. Surely there is some way to remap this function to another key, but I've yet to find it.

Re:Control-D is the death of bookmarks (1)

michaelhood (667393) | more than 8 years ago | (#13392487)

So, I didn't know about Control-D exiting shells. And I thought to myself, that's kinda neat. So I clicked over to SecureCRT, hit Ctrl+D. Nothing. Hit it again. Nothing. Railed it about ten more times. Nothing.

I was focused on another SCRT.

I just wanted to thank you for closing screen, a tail with a lengthy grep, mysql, bind (I was running in the foreground for debugging), and god knows what else.

That is all.

Re:Control-D is the death of bookmarks (0)

Anonymous Coward | more than 8 years ago | (#13392526)

<Nelson Laugh>Ha, ha!</Nelson>

We use Lucene ... (1)

rahuja (751005) | more than 8 years ago | (#13392218)

.. here at UMIACS for searching huge e-mail corpora. Luecene rocks!

Good article (2, Informative)

Linux_ho (205887) | more than 8 years ago | (#13392250)

Check out this article [] for a good intro to Plucene, the Perl port of Lucene.

This is also a good link for all of you slashdotters who have no idea what Lucene is for and are posting rants wondering why people don't just use Google instead.

Re:Good article (0)

Anonymous Coward | more than 8 years ago | (#13393093)

One thing to be careful about is that Plucene and Lucene indexes are not compatible with each other. So if you were thinking about using Perl to crawl and using Nutch's Java-based index searcher, think again... []

It would be great if they were compatible. I hate having to write my custom parsing code in Java.

my problem solver (3, Informative)

shareme (897587) | more than 8 years ago | (#13392251)

Check out regain: [] Works as a computer indexer using Lucene.. Seems to do better on search than MS stuff :) On a typical 2.97 GHZ system with 100 gig hd 70% full is about 6 hours to do first index.. It runs fine in backgroudn no noticeable slowing down of other apps while indexing.. It also come sin a server vesion as well for website searches to build search engines like google and yahoo :)

Bookmarks and chronology (1)

ch-chuck (9622) | more than 8 years ago | (#13392262)

My solution is, every 6 months or so, save the bookmarks.html somewhere with a date in the name and start over with a blank bookmarks.html. Then, if I need to find something old I just open up the old bookmarks_6_2003.html or whatever - the interesting thing is, it's like going back in time to review what I was interested in at the moment. Like if I was researching a lot on electric park flyer airplanes in 2003 it would have a lot of links. That way it's kinda like a scrapbook of your life - if your sorry life is spent browsing the web that is ;)

Re:Bookmarks and chronology (0)

Anonymous Coward | more than 9 years ago | (#13395406)

Nobody cares how you manage your bookmarks. The whole bookmark story at the beginning of the article was just a silly lead-in to the book review.

Sitebar saved my sanity... (2, Informative)

Ransak (548582) | more than 8 years ago | (#13392268)

After realizing I had over 800 bookmarks spread across four different workstations in different geographic areas, I consolidated them into a Sitebar [] install. I'd recommend it to anyone; you can tinker with the PHP or MySQL side, or simply leave it alone beyond the default installation. It's really designed for bookmark sharing for teams, but has options for single user installations.

Usual disclaimer: I have nothing to do with Sitebar or its development, just a majorly satisfied user.

Try delicious? (2, Informative)

delete (514365) | more than 8 years ago | (#13392276)

Why not try delicious [] ? It allows you to keep your bookmarks online so that they're accessible from multiple locations, while also allowing you to search [] your bookmarks and those belonging to other people.

If you use Firefox, there are extensions that allow you to view your bookmarks in a sidebar [] and sync your online bookmarks [] with your browser bookmarks.

Re:Try delicious? (0)

Anonymous Coward | more than 9 years ago | (#13395886)

Alternatively, there's Yahoo MyWeb: []

The advantages are that you get a cached copy of the page to go with the live one, should the site go down or disappear, full text search inside the pages you've saved, and that you don't have to make every page you save publically viewable.

The disadvantages are that it's currently less popular, so finding useful pages other people shared is less effective, and that it's easier to link into (the URLs are simpler, and My Web currently requires a Yahoo ID to use certain features)

Compare: [] &tag=css []

But what about Fishy? (1)

Patentmat (846401) | more than 8 years ago | (#13392289)

It is true, for many purposes there is no need for a bookmark. Google will take you just where you need to go, and will keep you informed if a better source for what you need becomes available and makes its way up the results page.

But other times, the search engine screws you over. Case in Point - fishy

The other day at work I wanted to play Fishy, so I typed it into Google, went to the top link, and started playing. What??? They changed FISHY?? NO, they didnt, the top link was some bizarro version of fishy (bizarro at least to me) that was not nearly as good as the one I remember growing up on. y [] For anyone who hasnt played fishy, you need to put down what you are doing and go play []

Save Some Money (0)

Anonymous Coward | more than 8 years ago | (#13392432)

Save yourself $15.28 by buying the book here: Lucene in Action []

Re:Save Some Money (-1, Troll)

Anonymous Coward | more than 8 years ago | (#13392477)

Save yourself some effort by going here: A Happy Place []

What I want (for bookmarks) (1)

ecloud (3022) | more than 8 years ago | (#13392467)

I'd like a blog which is also part of a wiki (aka a "bliki"). And I'd like to be able to set each post public or private - so I can record my own thoughts about ideas which I do not want to share, vs. plain links and comments which I do want to share. Probably I will try to use MediaWiki for this but I'm not sure about the privacy aspect of it. I have used it at work for an internal blog; the "my talk" page that each user gets is very much like a blog. You can mod the code a little bit to automatically put the time and date in each post.

It's also nice to be able to email posts to one's blog.

Hmm (0, Redundant)

romka1 (891990) | more than 8 years ago | (#13392475)

"Search this site using google" :) shouldn't they use their own technology for this ?

PPS (1)

duncangough (530657) | more than 8 years ago | (#13392573)

If you like Python, then there's LuPy from divmod, which is a python port of Lucene.

And if you've ever wanted to create a personal proxy server that gives you a searchable database of your history and bookmarks, then you can do that too, just like I did: []

Re:PPS (0)

Anonymous Coward | more than 9 years ago | (#13394328)

I've been looking for a decent version of something like this for ages. I've previously investigated Agent Frank [] , WebMate [] , IProxy [] , and I've used WBI [] . All seem lacking in some way.
PPS looks promising. I'd like to see the search functionality of PPS integrated into the proxy so you can do a search right from the browser and it can show you live links......

Re:PPS (1)

duncangough (530657) | more than 9 years ago | (#13395917)

Feel free to take the code and extend it - I got the bare bones up and running and then work came calling. I'd love it if someone else could find a use for it!

Lucene implementations can parse MS Office files (1)

dilute (74234) | more than 8 years ago | (#13392586)

So here you have a free alternative to proprietary search engine indexing software that allows you to run an intranet with MS Office (as well as PDF, HTML, text, etc.) files on a non-MS web platform (it's "free" except insofar as Java itself is not "free"). In truth, the document parsers are external to Lucene, but they do work together, plus Lucene itself is a solid piece of work. Also, Lucene itself is just an indexing engine - the other plumbing and connections of a full "search engine" have to be constructed around it (the book gives you the basics you'll need to do that, or you can look at the numerous freely available solutions that make use of Lucene).

It's about time Slashdot has picked up on Lucene - This book has been out since last December, and the Lucene project has been around quite a bit longer than that. This is very powerful stuff.

Lucene is great! I use it all the time (3, Informative)

MarkWatson (189759) | more than 8 years ago | (#13392918)

Also, I wrote a DevX article on Lucene: []

Lucene is so well documented and simple to use that I am surprised that this subject would fill an entire book :-) Just kidding.

Lucene can be used as is, or you can extend it with your own document type handlers, etc.

As a programmer, I way prefer dynamic languages like Common Lisp, Ruby, Python, Smalltalk, etc. However, one of the things that keeps me firmly in the "Java camp" is the great free infrastructure software tools (like Lucene, Tomcat, JBoss, etc.) As a programming language, Java is kind-of weak.

Re:Lucene is great! I use it all the time (2, Interesting)

bmalia (583394) | more than 8 years ago | (#13392989)

As a programming language, Java is kind-of weak.

Java is anything but weak.

Re:Lucene is great! I use it all the time (1)

mark_lybarger (199098) | more than 8 years ago | (#13393655)

ah yes, the dynamic languages. those are fun to debug. this variable is what again? what methods can i call on it? the ides for those must be excellent. having only seen ruby on some slides, python in gentoo scripts, and, well, that's if of the 4 your mention, how are the ides? i've been submerged into javascript land lately, and the dynamic nature of the language (coupled with the bugginess nature of the sandbox) is just a joy to work with.

Lucene rocks! (2, Informative)

swf (129638) | more than 8 years ago | (#13393483)

Lucene is a pretty amazing piece of software. Lucene is to text indexing what postgres is to relational databases. The API is simple, and though many people have reservations about java it is very, very fast. I've written lucene apps that could perform queries in 40-60ms that would take a relational database up to 20 minutes to perform on the same hardware. I've found it to be even a few orders of magnitude faster than Oracle text indexing.

And you can index pretty much anything you want, so long as you can get it into a string. Everyone knows you can index documents in XML, HTML, etc, but you can also index objects with strings, integers and dates, hash tables and lists. Just put a primary key value and a table name and you can retrieve the full object from your database. Very cool indeed.

Every programmer out there with skills in relational databases should take a look at a information retrieval library like lucene. It's a completely different field from relational databases and it will change how you think about storing and searching data. It's not a replacement for relational databases, but it does complement them very well and allows you to do things that you wouldn't normally be able to do, like allowing for results that partially match the search query, and being able to rank them by relevance.

DotLucene (1)

ThinkFr33ly (902481) | more than 8 years ago | (#13393571)

I've personally used DotLucence [] , which is the .NET version of Lucene.

I used it to index a fairly complicated ASP.NET portal site on which there was little or no static content and all content was secured using a custom implementation of ACLs. The ASP.NET application allowed you to run mini-applications within it, called Portlets.

These portlets had very complex security rules. For instance, you could say that certain users could click this button, while others could not. Certain users can view this portlet page, while others cannot. Etc. This security was all implemented using an interesting blend of .NET attributes and code access security.

Needless to say, indexing this was very complex. We can't just index everything because that might expose too little or too much to those who are searching for content. DotLucene allowed us to implement a system that could index all available content while filtering those results depending on the person doing the search. DotLucene would "ask" all portlets in the system to give it all searchable content, along with a unique ID to identify that piece of content. When displaying the results to the user we would first "ask" the portlet if this particular user should be able to access this content. If so, let it through... otherwise, filter it out.

In the end the system could support many, many simultaneous users and searching was almost instant. Great application.

Xapian (1)

mikeboone (163222) | more than 8 years ago | (#13393671)

A year ago I was looking at search engine software and search query parsers. I didn't want to mess with setting up Lucene on Java. I found another tool called Xapian [] which compiles on Linux from C and has bindings for PHP, Perl, and other languages. I've found it to be fast and stable. The documentation is sort of spotty but the guys on the mailing list are great.

Bookmarks are personal (1)

ElitistWhiner (79961) | more than 9 years ago | (#13393708)

...which the author admits up-front that he has 500 bookmarks. The solution to our bookmarked memories, is not another search engine. Silly to think that any search engine could come close to having our every bookmark in memory., the abstraction, is half the answer. Apple's "iDrive" is the technological half. What we all need is a *follow-me* resource available anywhere, anytime that is totally abstracted above the hardware layer.

iMarks, personal bookmarks, that load on launch. An open standard downloadable or hardcoded into browsers.

Bookmarks solution from the book author (1)

otisg (92803) | more than 9 years ago | (#13394334)

Hello from Otis, one of the co-authors of Lucene in Action. It is interesting the book review starts with a problem with bookmarks in the browser, because I run Simpy [] , a fairly popular social bookmarking service. The reason I started the service a few years back was because with a few keywords + search I could locate my bookmark far more easily and much faster than traversing my bookmark folder hierarchies.

Anyhow, I just wanted to connect these 3 islands - Lucene in Action + bookmark problem + Simpy. I'll go back to reading the rest of the review now...
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?