Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Open Source Search Engine Benchmarks

CmdrTaco posted more than 5 years ago | from the welcome-to-the-monday dept.

Databases 62

Sean Fargo writes "This article has benchmarks for the latest versions of Lucene, Xapian, zettair, sqlite, and sphinx. It tests them by indexing Twitter and Medical Journals, providing comparative system stats and relevancy scores. All the benchmark code is open source."

cancel ×

62 comments

Sorry! There are no comments related to the filter you selected.

first post (-1)

Anonymous Coward | more than 5 years ago | (#28593583)

yeah. this is boring.

k (-1)

selven (1556643) | more than 5 years ago | (#28593593)

Nothing else to say, really

Re:k (5, Insightful)

eldavojohn (898314) | more than 5 years ago | (#28593673)

Nothing else to say, really

Really? Am I the only person that found it interesting that Lucene, the only non C/C++ implementation, gave some pretty impressive stats? I mean, it's written in Java and although it has a slower index time its search time, index size and relevancy are impressive.

I may have to poke around in the Lucene code after work tonight to figure out what kind of strange majick those Apache developers employ. Hopefully I'll walk away with some extra spells in my bag.

Re:k (-1, Offtopic)

Anonymous Coward | more than 5 years ago | (#28593703)

Yeah, you're the only person. *yawn*

Re:k (1)

Jarlsberg (643324) | more than 5 years ago | (#28593713)

It was a foregone conclusion that lucene would trounce the others, if you ask me. And comparing sqlite vs lucene is slightly absurd, since most people with a clue already uses lucene on top of sqlite (and mysql as well) to get good search results.

Re:k (4, Insightful)

julesh (229690) | more than 5 years ago | (#28593809)

Really? Am I the only person that found it interesting that Lucene, the only non C/C++ implementation, gave some pretty impressive stats?

Is it really that big a surprise? Given that some of the largest, most information-heavy sites on the Internet (e.g. Wikipedia) use it for their internal search?

Re:k (4, Insightful)

nyctopterus (717502) | more than 5 years ago | (#28594627)

But Wikipedia's internal search is the suckiest thing that ever sucked! Seriously, does anyone use it, instead of just sticking "wikipedia" into their Google search?

Re:k (1)

Hurricane78 (562437) | more than 5 years ago | (#28669949)

Sticking "wiki" into it usually suffices. :)

Re:k (2, Insightful)

forkazoo (138186) | more than 5 years ago | (#28593839)

Really? Am I the only person that found it interesting that Lucene, the only non C/C++ implementation, gave some pretty impressive stats? I mean, it's written in Java and although it has a slower index time its search time, index size and relevancy are impressive.

Meh, look at any /. article about Java and you'll see somebody complain about the speed of Java, and a reply explaining that Java isn't particularly slow. It has some weaknesses that mean it isn't as optimal as really good C, but it also has some capacity for dynamic optimisation which can make it faster than poorly optimised C. Regardless in a DB type application, a lot of your time will be spent in vendor supplied code. Whether that is disk access supplied by the OS or some functions available as part of the language standard library. A lot of actually runs this type of app isn't particularly guaranteed to be written in the same language as the app.

Also, most of the Java code you run across in real life is crap. That's not a dig at the language itself. IMO, it's the volume of poor coders that give Java a reputation for slowness more than anything else. You probably won't find any secret double ninja techniques in Lucene as much as you will just find relatively few embarrassing fuckups.

Re:k (5, Informative)

Lord Grey (463613) | more than 5 years ago | (#28593877)

Really? Am I the only person that found it interesting that Lucene, the only non C/C++ implementation, gave some pretty impressive stats? I mean, it's written in Java and although it has a slower index time its search time, index size and relevancy are impressive.

Lucene is a great search tool. As TFA pointed out, however, if you're looking for a "search solution" rather than "search engine" then you should check out Solr [apache.org] instead. Lucene is a toolkit that you build on top of, not something you really want to deploy by itself. Solr is that thing built on top of Lucene.

Be aware that while Lucene/Solr has made terrific progress, it is not quite in the "enterprise search" category. For superscale implementations you'll still likely need to look at a high-priced product like FAST [microsoft.com] .

Re:k (4, Informative)

tealwarrior (534667) | more than 5 years ago | (#28595065)

Solr/Lucene power a number of sites that would be in the enterprise search category (Apple, Netflix, C-Net). Where I work, we index 5 million docs in Solr/Lucne and serve out millions of search requests a day. It's not google scale, but most people don't need that. The markets where one needs a FAST are dwindling quickly.

Re:k (1, Interesting)

Anonymous Coward | more than 5 years ago | (#28595829)

Solr/Lucene power a number of sites that would be in the enterprise search category (Apple, Netflix, C-Net). Where I work, we index 5 million docs in Solr/Lucne and serve out millions of search requests a day. It's not google scale, but most people don't need that. The markets where one needs a FAST are dwindling quickly.

I work in a shop that uses fast, despite pressure from some to move to solr. As I understand it, solr can't keep up with the volume of changes we need to make to our data. I'm talking millions of documents of a 100+ fields changed, per day, with any given change visible to the customer within a short timeframe (10 minutes). solr can index that much data easily, but it can't keep with that kind of volume. That's what I've been told anyway.

Re:k (2, Informative)

tealwarrior (534667) | more than 5 years ago | (#28597993)

Solr/Lucene real-time search (or near real-time) is one of its weaker points. I think it could keep up with the updates but making them appear in the index immediately and having the caching still perform can be tricky.

We have one index with that's updated every 20 minutes, but only has about 50k documents and a combination of Solr cache auto-warming and squid's stale-while-re-validate logic works there.

In another system where updates need to be faster, we had to do some custom work to make it perform where there is an in memory index for recent changes, an on-disk index of previous changes, and process for moving from one to another. Hopefully these improvements will make their way back to Lucene in the future.

Not really surprising - Disk I/O is the slowdown (0)

Anonymous Coward | more than 5 years ago | (#28593901)

This isn't really surprising to me. Disk I/O is the slowdown for almost all programs, so efficient disk access is more important than the application code, no matter how it is written. OTOH, a well designed system that minimizes wasteful I/O will do very well - even if it is written in, cough, java.

Way to go Apache guys!

BTW, I use Lucene on our document management system. It works well enough, but definitely eats more RAM than I'd like. Did anyone look at the RAM trade-off?

Re:k (1)

kestasjk (933987) | more than 5 years ago | (#28593969)

Really? Am I the only person that found it interesting that Lucene, the only non C/C++ implementation, gave some pretty impressive stats? I mean, it's written in Java and although it has a slower index time its search time, index size and relevancy are impressive.

Of course you are, fool! Everyone else on slashdot knows exactly how Lucene and sqlite's indexing systems work. I don't know why they bothered to take the benchmarks at all, anyone with half a clue has integrated a Java engine running Lucene into sqlite and hooked it into MyISAM already..

Re:k (1)

Daengbo (523424) | more than 5 years ago | (#28594307)

I mean, it's written in Java and although it has a slower index time its search time, index size and relevancy are impressive.

In the "benchmark," it wasn't just impressive in those areas: it had the lowest search time, the smallest index, and the highest relevance. That makes top honors, in my book.

Re:k (0)

Anonymous Coward | more than 5 years ago | (#28594443)

Yeah and look at the memory stats. It uses nearly twice the memory of the next one down and more than 6 times the memory of the best*. I don't imagine it gets better over a long period of time either. I see that time and again, long running Java processes are no good.

* With that said, SQLite needs a lot of tweaking and I can tell from the memory usage that they didn't tweak it much if at all. That pretty much invalidates SQLite's results in these tests.

Re:k (1)

Scott Kevill (1080991) | more than 5 years ago | (#28594531)

Far more likely to be because of the choice of algorithms and the resources behind the project. Would be interesting to see how CLucene [sourceforge.net] performs.

Java is not slow (anymore) (1)

allcoolnameswheretak (1102727) | more than 5 years ago | (#28595459)

Java can't seem to get past it's reputation for being slow - which quite simply is no longer true. Java can match and even exceed the speed of C/C++ implementations. This often seems like an impossible, even outrageous claim to many C/C++ developers. What they fail to see is, that Javas Hotspot compiler compiles critical code sections at runtime on the client computer. This has the advantage over C/C++ programs that the compiler has detailed info about the system it's running on and therefore can perform specific optimizations that a C/C++ program -that is compiled only on the developers PC- can't.

Re:k (1)

johannesg (664142) | more than 5 years ago | (#28596241)

Nothing else to say, really

Really? Am I the only person that found it interesting that Lucene, the only non C/C++ implementation, gave some pretty impressive stats? I mean, it's written in Java and although it has a slower index time its search time, index size and relevancy are impressive.
 

Yes, that's pretty much you yes. Different algorithms, therefore different performance. Reimplement Lucene in C++, then see what the differences are in terms of speed (and if you care, code size, complexity, etc.). Until then the comparison is totally meaningless.

And gee, what's with the defensive attitude...

Re:k (2, Informative)

JorDan Clock (664877) | more than 5 years ago | (#28599473)

Kind of like... CLucene [sourceforge.net] ?

Re:k (4, Informative)

johannesg (664142) | more than 5 years ago | (#28600031)

Ah, thank you. So indeed, an implementation of the same algorithm turns out to be _three times_ as fast in C++ than it is in Java (see here [sourceforge.net] ).

I wonder if eldavojohn wishes to comment on that?

Re:k (1)

BikeHelmet (1437881) | more than 5 years ago | (#28598821)

It's no surprise to me. Java has long since been the best technology for all things internet. Streaming servers, forum software, indexing/archiving, Web2.0 sites; it's only several dozen times faster than Ruby or PHP, with similar memory usage. And I'm not talking applets here - I mean the backend. Tomcat is even significantly faster than mod_php or fastCGI with their C backends.

Keep in mind that anything Java based has VM overhead. If they included that in the Lucene graphs, then it performed the best while using about as much memory as sqlite. If they didn't, then it's a bit RAM hungry(add another 30MB), but still performs the best.

I've always been a big advocate of using easy languages for complex software. When I was first learning programming, I opted to create Tetris in Javascript. It took me a few days - about 12 hours - but I did it from scratch, without help! Now I could probably do the same task in Java in 2 hours, but working in an "easy" language certainly does help when the code is almost above your head. It helps you keep a larger part of the project in focus, instead of having to focus on the actual code.

And then there's the gains from when you make a mistake. I'm sure some of you will claim to be perfect - but in C/C++, if you mess up and introduce memory leaks, you have to waste time tracking them down, rather than spending that time optimizing, thinking up new algorithms, etc., easier languages are so much better for the average programmer, which may think up an impressive algorithm from time to time, but struggle with implementing it in a low level language.

Medical Journals? (-1, Troll)

Anonymous Coward | more than 5 years ago | (#28593755)

Are those anything like Livejournals?

Re:Medical Journals? (1)

fuzzyfuzzyfungus (1223518) | more than 5 years ago | (#28594187)

No, they are what keep Livejournals from becoming Deadjournals.

Hear the heads exploding - Java is fastest (4, Informative)

MosesJones (55544) | more than 5 years ago | (#28593795)

Okay so the fastest engine is using Lucerne, a Java search engine, and this is neither tuned nor horizontally scaled (which it can do very well).

C++ and C both fail to deliver the same level of performance as the Java virtual machine.

Oh wait hang on... does this mean that for complex applications the most important performance piece is normally actually the efficiency of the code rather than the efficiency of the base platform and therefore having a language in which it is easier to write efficient code is better than just having the one that is fastest to execute a for loop?

But hell this is Slashdot and Java is Slooooooow...

Re:Hear the heads exploding - Java is fastest (1)

zappepcs (820751) | more than 5 years ago | (#28593867)

You beat me to the comment. I'm sort of surprised that the reaction so far has been the sound of crickets and loud yawning... meh

Re:Hear the heads exploding - Java is fastest (1)

BrokenHalo (565198) | more than 5 years ago | (#28598019)

I'm sort of surprised that the reaction so far has been the sound of crickets and loud yawning... meh

Well, the OP certainly got a loud yawn from me for the remark about indexing twitter posts. They might just as well index cockroach farts.

Re:Hear the heads exploding - Java is fastest (1, Interesting)

TheSunborn (68004) | more than 5 years ago | (#28593905)

It may be a bit faster on searching, but it take ~5 times as long to generate the index, and use twice as much memmory when searching so it may just be a different trade off between index time and search time.

And it's a bad search test, because the total search time is less them 2 seconds, thus not including the cost of the gc for java.

hint to people doing benchmark: When benchmarking a component which use gc or similary memmory handling methods, remember to have the test dataset be large enough that you cause enough gc cycles to make the performance of any single cycle noise.

And to be fair to the gc language, set minimum memmory=maximum memmory, so it will use as much memmory as you allow and don't waste time allocating more memmory.

Gc is more effective, the more memory you allow it to use, because the runtime cost of gc mostly depend on the number of live objects, not the number of allocated objects.

Re:Hear the heads exploding - Java is fastest (1)

xouumalperxe (815707) | more than 5 years ago | (#28605853)

hint to people doing benchmark: When benchmarking a component which use gc or similary memmory handling methods, remember to have the test dataset be large enough that you cause enough gc cycles to make the performance of any single cycle noise.

I have an even better idea. Why don't we just model the benchmark on the real world usage scenarios, and let those decide whether garbage collection and allocation even matter?

Re:Hear the heads exploding - Java is fastest (0)

Anonymous Coward | more than 5 years ago | (#28593947)

But...but...if I did it, it would be fast! The guys making the non-Java implementations must not know what they're doing!
 
Yeah...sure...

Re:Hear the heads exploding - Java is fastest (1)

bluefoxlucid (723572) | more than 5 years ago | (#28593955)

Finding it easier to code well in Java than C is like finding it easier to drive Automatic than Manual. I stopped driving automatic, it stopped almost getting me into accidents.

Re:Hear the heads exploding - Java is fastest (1)

Atzanteol (99067) | more than 5 years ago | (#28593989)

Driving an automatic almost got you into accidents? You must *suck* at driving dude.

Re:Hear the heads exploding - Java is fastest (1)

bluefoxlucid (723572) | more than 5 years ago | (#28594477)

I can't seem to move into a lane of faster traffic in heavy traffic situations on the highway without being able to immediately accelerate. I can't seem to shift into a lower gear without using the knockdown mechanism, which requires me to depress the accelerator the whole way and wait a second for everything to engage. In a manual, I can downshift to fourth or third and control my speed, enter an opening, and accelerate quickly without fear that backing off the accelerator a little (you try keeping control under full throttle) will result in me being thrown instantly into high gear.

It's the same as what happens when Boehm-gc kicks in during a real time event and causes sound playback to skip. It usually doesn't, but sometimes it does. More importantly, it's like when Boehm-gc encounters an implementation bug or limitation (like a a message passing mechanism that uses relative pointers to pass information between threads) and frees up memory in use.

Too many programmers use Java's garbage collector and exception handling as crutches. There are brand new programmers that only know Java, C#, and PHP/Perl/whatnot; when you explain to them how C++ and C allocate/free memory, they look confused, and start talking about how it's humanly impossible to keep the entire running state of your program in your head, or some other such garbage. Then they go and use try/catch with an empty catch block just so their broken Java program continues to hobble along through errors. I'm more comfortable with C, and some programmers are more comfortable in Java or C# but can code in C and/or C++ fine.

Re:Hear the heads exploding - Java is fastest (1, Insightful)

Anonymous Coward | more than 5 years ago | (#28594269)

I stopped driving automatic, it stopped almost getting me into accidents.

You're a fucking idiot. Get off my road.

Re:Hear the heads exploding - Java is fastest (0)

Anonymous Coward | more than 5 years ago | (#28593971)

C++ and C both fail to deliver the same level of performance as the Java virtual machine.

The first requirement to successfully mock the uninformed wisdom of crowds, is to be informed [sourceforge.net] yourself.

HTH

Re:Hear the heads exploding - Java is fastest (1)

cpghost (719344) | more than 5 years ago | (#28594001)

Granted, bubble sort is slower in C/C++ than Quicksort in Java. Then again, we do have qsort(3) in C and std::sort() in C++/STL, and slow C++ code is usually the result of developer newbies misunderstanding the copy semantics of parameter passing.

Re:Hear the heads exploding - Java is fastest (3, Informative)

Roy van Rijn (919696) | more than 5 years ago | (#28594257)

Hrm, this had absolutely nothing to do with the language. It has almost everything to do with the algorithms.

Its very hard to compare languages, maybe if you use the languages to implement the exact same algorithm and let it run for a long while... But that still doesn't really compare it well enough.

Like somebody already said: Bubble sort in C++ is (almost) always slower then a quicksort in Java.

Re:Hear the heads exploding - Java is fastest (1)

ThePhilips (752041) | more than 5 years ago | (#28595043)

C++ and C both fail to deliver the same level of performance as the Java virtual machine.

Oh wait hang on...

As was pointed above, the search engines spend >90% of their time in DB/file I/O code.

In other words, implementation language plays little role - it is I/O optimization algorithms which play bigger role.

From my experience with number of C/C++ projects, efficiency of the languages/compilers allows developers to remain ignorant. In Java that approach simply doesn't work. Thus I more often see more better algorithms often in less efficient languages.

Like I recently found in one program people used a bubble sort - as if copied verbatim from "C Programming for Dummies in 21 day." And it worked without causing any problem for more than a decade - only after a rare occasion when dimension went above 1000, it took longer than 1 second to finish. I bet Java would have immediately choked on the code.

Re:Hear the heads exploding - Java is fastest (0, Flamebait)

Wovel (964431) | more than 5 years ago | (#28595911)

Java is slow. If you took the same algorithm and coded it in an efficient compiled language, it would be faster. Much faster.

Re:Hear the heads exploding - Java is fastest (1)

BlueKitties (1541613) | more than 5 years ago | (#28596679)

Java is fine for plenty of applications, but there are certain situations where it simply doesn't cut it. Heavy GUI oriented applications tend to take a massive performance hit because all of the objects are dynamically generated at run time -- just load up Eclipse and see how long it takes to start. Scientific and Mathematical applications, as well, rely on high-speed languages like C/FORTRAN. That doesn't mean Java is so slow it's useless -- in many cases the aided clarity and simplicity is worth it.

There are times we use _ASM_, there are times we use C, there are times we use Java. And, like it or not, we often fall back on C/C++ for speed. Generally, the only people who bash one language or the other are fanboys. Languages are tools to be used to our advantage, each has its own strengths and weaknesses -- sometimes we use a hammer, sometimes not; as a programmer, we must know the tools at our disposal and deploy them accordingly. Just because a pipe wrench can substitute a hammer doesn't mean it should.

Re:Hear the heads exploding - Java is fastest (1)

Lisandro (799651) | more than 5 years ago | (#28596999)

Oh wait hang on... does this mean that for complex applications the most important performance piece is normally actually the efficiency of the code rather than the efficiency of the base platform...

Yes. ...and therefore having a language in which it is easier to write efficient code is better than just having the one that is fastest to execute a for loop?

No.

Re:Hear the heads exploding - Java is fastest (2, Informative)

johannesg (664142) | more than 5 years ago | (#28600093)

Okay so the fastest engine is using Lucerne, a Java search engine, and this is neither tuned nor horizontally scaled (which it can do very well).

C++ and C both fail to deliver the same level of performance as the Java virtual machine.

Oh wait hang on... does this mean that for complex applications the most important performance piece is normally actually the efficiency of the code rather than the efficiency of the base platform and therefore having a language in which it is easier to write efficient code is better than just having the one that is fastest to execute a for loop?

But hell this is Slashdot and Java is Slooooooow...

Actually if you check here [sourceforge.net] , you will find that an implementation of the exact same Lucene done in C++ is about three times faster than Java.

Sorry for spoiling your moment there...

I've been very happy with Sphinx.... (1)

tcopeland (32225) | more than 5 years ago | (#28593997)

...have used it on several projects and always gotten good results. Setting it up is easy and the Ruby API is solid, although I needed a tiny bit of additional code for special character escaping [blogs.com] . Highly recommended!

SQLLite is a search engine?!??! (2, Insightful)

brunes69 (86786) | more than 5 years ago | (#28594009)

Oh wait - seems TFA is saying a lot of sites just use an SQL DB and use like '%FOO%' as a "search engine....

Ok, this is reasonable, however, I don't see why anyone would choose sqllite as a benchmark. If you are trying to compare search engines, and consider an RDBMS to be a 'search engine' category, then you at least need to include 4 or 5 of the most popular open source RDBMSs in the benchmark (SQL lite, POstgreSQL, MySQL, Derby, Firebird), not just one.

Re:SQLLite is a search engine?!? (0)

Anonymous Coward | more than 5 years ago | (#28594995)

SQLLite, Oracle, MySQL, PostgreSQL, etc. all have full-text indexing engines as part of the RDBMS, or as add-on packages. From TFA "I had some text issues with sqlite (also needs to be recompiled with FTS3 enabled) ...". According to this, it uses Full Text Search 3 (FTS3) as its text indexing engine. They all parse the CHAR(N) or CLOB(N) columns into tokens (words), and index those.

The standard SQL predicate "...WHERE columnN LIKE '%FOO%' " cannot be indexed by any RDBMS. That is a non-indexable CHAR(N) or CLOB(N) searchinside a string. Only, left-anchored LIKE queries can use the index, "...WHERE columnN LIKE 'Foo%' ".

Re:SQLLite is a search engine?!? (1)

mindas (533922) | more than 5 years ago | (#28595719)

Although they might have full text indexing and searching, databases and search engines/libraries work differently.

E.g. you come to online DVD shop and search for "Tom Criuse" (hint: misspelled surname). Every decent search engine (including Lucene library, not sure of others evaluated here) would yield a result, despite misspelling. I am not sure whether database fulltext thing would spit anything at all. It's simply built do do different job, that's it.

THIS FP FoR- GNAA (-1, Offtopic)

Anonymous Coward | more than 5 years ago | (#28594027)

driven out by the ?A super-organised am protesting

OS Search measured by OS Benchmark (0)

Anonymous Coward | more than 5 years ago | (#28594069)

The open source search engines are being measured by an open source benchmark. Must be a conspiracy. I want to see propriety benchmarks measuring these. I'm sure M$'s Bing would be the best.

CLucene (5, Insightful)

drac667 (878093) | more than 5 years ago | (#28594121)

All the other search engines except lucene are written in C/C++. Why didn't Vik Singh test also CLucene (http://sourceforge.net/projects/clucene/)?

Here is the CLucene's description on SourceForce: "CLucene is a C++ port of Lucene: the high-performance, full-featured text search engine written in Java. CLucene is faster than lucene as it is written in C++."

Re:CLucene (2, Insightful)

samkass (174571) | more than 5 years ago | (#28594519)

CLucene is faster than lucene as it is written in C++.

XXX is better than YYY as it is written in [my favorite language].

Haven't we explored this one to death already? Java isn't slow, and there's nothing magic about C/C++. Badly written C/C++ gets trounced by Java any day, and algorithmic efficiency trounces both of those when it comes to complex functions like indexed searches.

Re:CLucene (2, Insightful)

caramelcarrot (778148) | more than 5 years ago | (#28594831)

But if it's a direct port of Lucene presumably it's using the same algorithms and has similar code quality - hence it provides a good direct comparison of the language speeds and such a comment is legit.

Re:CLucene (1)

ThePhilips (752041) | more than 5 years ago | (#28595255)

Haven't we explored this one to death already? Java isn't slow, and there's nothing magic about C/C++. Badly written C/C++ gets trounced by Java any day, and algorithmic efficiency trounces both of those when it comes to complex functions like indexed searches.

Actually on synthetic benchmarks C/C++ implementation might outperform the Java implementation. Some benchmarks are crafted to essentially test memory bandwidth, where C/C++ easily wins.

And still, well written C/C++ code scales magnitudes better than Java code. Resource management is a bitch. I have seen that to win a number of deals.

Re:CLucene (1)

Scott Kevill (1080991) | more than 5 years ago | (#28603029)

CLucene is faster, and uses less memory, from what is basically a direct port. The README includes some benchmarks:

There are 250 HTML files under $JAVA_HOME/docs/api/java/util for about
6108kb of HTML text.
org.apache.lucene.demo.IndexFiles with java and gcj:
on mac os x 10.3.1 (panther) powerbook g4 1ghz 1gb:
        . running with java 1.4.1_01-99 : 20379 ms
        . running with gcj 3.3.2 -O2 : 17842 ms
        . running clucene 0.8.9's demo : 9930 ms

I recently did some more tests and came up with these rough tests:
663mb (797 files) of Guttenberg texts
on a Pentium 4 running Windows XP with 1 GB of RAM. Indexing max 100,000 fields
à Jlucene: 646453ms. peak mem usage ~72mb, avg ~14mb ram
à Clucene: 232141. peak mem usage ~60, avg ~4mb ram

Searching indexing using 10,000 single word queries
à Jlucene: ~60078ms and used ~13mb ram
à Clucene: ~48359ms and used ~4.2mb ram

why index something as useless as twitter? (0)

Anonymous Coward | more than 5 years ago | (#28594131)

why index something as useless as twitter?

How do these compare to Oracle? (1)

introspekt.i (1233118) | more than 5 years ago | (#28594283)

Does anybody know? That'd be a great comparison.

PLEASE (1)

Parker Lewis (999165) | more than 5 years ago | (#28594311)

Please, can we avoid the "java vs C/CC++" thread again?

i can speak from experience (1)

nimbius (983462) | more than 5 years ago | (#28594433)

the lucene based nutch has been a big help to our group. we currently index 60 sites across the company, dive through PDF files and even shockwave flash and powerpoint with ease. the search results are extremely fast and the results are so accurate theyve blown our corporate engine completely out of the water.

Swish++ not mentioned? (2, Informative)

bobv-pillars-net (97943) | more than 5 years ago | (#28596187)

Last time I had to implement an indexing and searching solution, swish++ [sourceforge.net] was by far the performance winner.

clucene (0)

Anonymous Coward | more than 5 years ago | (#28606103)

clucene beats jlucene (or simply Java Lucene) in everything.

http://clucene.wiki.sourceforge.net/Benchmarks

Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?