lamaditx writes "The book Collective Intelligence in Action shows you how to apply theory from Machine Learning, Artificial Intelligence and Data Mining to your business. The goal is to create systems which make use of data created by groups of people — i.e. social networks — and abstract from these to gain new or additional information. Some of you might think "just another kind of Web 2.0." This is one application you might think of, but the input and output format do not matter that much. You can use these methods anywhere as long as the amount of data is big enough. You will find some examples related to the latest web technologies to explain methods, but the code is rather generic. Also, you won't find a lot disturbing details about HTML, HTTP and the like." Keep reading for the rest of Adrian's review.
There are three main parts to Collective Intelligence in Action. The first part explains how to gather data from external sources or internal repositories. The second part, "Deriving Intelligence", explains how to analyze the collected data. This is the part where you gain information and create new ideas. This does not help you much unless you find a way to use this in you application. The third part — which is also the shortest — provides you with some information on how to use the results in order to build user centric applications. This is obviously the best way to create a unique difference no matter what kind of services you may want to provide.
I have to admit that I waited for such a book for some time. After studying Artificial Intelligence — a modern approach — maybe THE book about AI — I felt like knowing a lot theory but missed the practical aspects. Several AI concepts are used in this guide but you don't create an AI system or an agent. Don't mix up those two even though they are similar.
The "in Action" series in supposed to show how things are done in practice. You can expect a lot of Java code samples and advice. Several open source tools are introduced to enable you to build your own system. These are also Java tools. It's up to you if you prefer to use Java or some other language. From my perspective it does not really matter which language you choose because the concepts can be implemented using other languages as well. The main drawback is that you will not be able to use Java Data Mining API (JDM) which is used extensively.
The first chapter introduces the main terms and concepts of the book. It is available here together with chapter 2 and the source code. One thing I consider to be an important prerequisite are mathematics. Most aspects are easy to read and understand if you have some knowledge about statistics and linear algebra. On the other hand you can still get it with basic math because the explanation is well written. The same holds for standard concepts and algorithms like word stemming, decision trees, Bayesian networks or k-means. These are summarized with the most important properties such that you don't require prior knowledge. You will notice that the chapter, like the following ones, ends with a large amount of references.
Personally I find it hard to read formulas when they are described in words (like: take the square root of x and multiply with y) instead of the mathematical notation. This is due to the fact that you cannot look up the formula quickly, because it does not stand out from the text. It might have been better to provide the formula in words and a mathematical notation as well. You will find some formulas in mathematical notation but some are really hard to read since they are printed in a font size of about 4 while the text is written in 10.
Coming back to the content: The other sections of the first part show you how to gather data from external online sources. Of course you can apply the same concepts to offline sources or other data repositories. The key is to collect usable data to derive intelligence later on. One example is generating tags from a number of sources and associate each tag with a weight relative to the occurrence of the tag. The result will be one of the well known tag clouds.
You will need a persistent data storage such as a database for the results and access them in the second step. Unsurprisingly you will find several ER diagrams to create the right data structure. A big plus is that the author tells you explicitly the important facts which can be derived from formulas or (ER-) diagrams. Reading the text is much more convenient this way. He will also provide implications for the database design when discussing ER diagrams. You can be sure that you do not miss the important points.
The second part starts with an introduction of data mining and machine learning terminology and concepts. You are also introduced to the JDM API which proves to be helpful in the future. You may start looking for a substitute if you choose not to use Java. The extensive usage of design patterns in almost every aspect eases the change from Java to an alternative language. You get to know the common methods and how to implement them. I consider this part to be more or less craftsmanship . There is some magic to it if you never heard anything about the utilized methods.
The only thing that caught my eye was the calculation of the inverse of a matrix. The notation is pretty common when solving linear equations, but you should never (except in rare cases) use the plain matrix inversion operation when implementing your solution. The reason is that the amount of effort to be undertaken grows exponentially. The more data is used, the larger the matrix will be — and thus the longer it will take to compute the inverse. Instead one should use, e.g., LU decomposition. The footnote points you to use the weka.core.matrix.Matrix class, which uses LU decomposition, but make sure about that if you use some other package or some other language.
The last 80 pages enable you to make use of your information gain and integrate it in the application. This is also the shortest part but that is due to the fact that the heavy lifting was done in parts one and two. Application means basically querying your data in the correct way to generate the right recommendations for your users. One part of that is searching and the other one is recommending. You may imagine the necessary effort to undertake if you ever happened to take a look at the way search engines work. The author deals with that by using the open source search engine Nutch together with Lucene in such a way that you just use the interfaces. This approach enables the author to keep the last part as short as it is.
I consider "Collective Intelligence in Action" to be a very good book. It is thought through from beginning to end. Examples are not just presented to the reader, but evolve step by step. You know why things are done the way they are, which enables you to change every aspect in a way you need to. From my point of view this is the right way to do it because a copy-and-paste solution would not get the job done. I pointed out some issues that could be done better such as too-small fonts in graphics or missing literature references in the text. However these are not major problems or content errors that should be blamed on the author. Finally I think you will gain from this book because it addresses Web 2.0 to some degree but is generic enough for other applications as well.
Adrian Lambeck is a graduate student in "Media and Information Technologies" and uses C# more often than Java.
You can purchase Collective Intelligence in Action from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.