Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Programming Collective Intelligence

samzenpus posted more than 6 years ago | from the the-movies-you-want-to-watch dept.

Book Reviews 74

Joe Kauzlarich writes "In 2006, the on-line movie rental store Netflix proposed a $1 million prize to whomever could write a movie recommendation algorithm that offered a ten percent improvement over their own. As of this writing, the intriguingly-named Gravity and Dinosaurs team holds first place by a slim margin of .07 percent over BellKor, their algorithm an 8.82 percent improvement on the Netflix benchmark. So, the question remains, how do they write these so-called recommendation algorithms? A new O'Reilly book gives us a thorough introduction to the basics of this and similar lucrative sciences." Keep reading for the rest of Joe's review.Among the chief ideological mandates of the Church of Web 2.0 is that users need not click around to locate information when that information can be brought to the users. This is achieved by leveraging 'collective intelligence,' that is, in terms of recommendations systems, by computationally analyzing statistical patterns of past users to make as-accurate-as-possible guesses about the desires of present users. Amazon, Google and certainly many other organizations, in addition to Netflix, have successfully edged out more traditional competitors on this basis, the latter failing to pay attention to the shopping patterns of users and forcing customers to locate products in a trial and error manner as they would in, say, a Costco. As a further illustration, if I go to the movie shelf at Best Buy, and look under 'R' for Rambo, no one's going to come up to me and say that the Die Hard Trilogy now has a special-edition release on DVD and is on sale. I'd have to accidentally pass the 'D' section and be looking in that direction in order to notice it. Amazon would immediately tell me, without bothering to mention that Gone With The Wind has a new special edition.

Programming Collective Intelligence is far more than a guide to building recommendation systems. Author Toby Segaran is not a commercial product vendor, but a director of software development for a computational biology firm, doing data-mining and algorithm design (so apparently there is more to these 'algorithms' than just their usefulness in recommending movies?). Segaran takes us on a friendly and detailed tour through the field's toolchest, covering the following topics in some depth:
Recommendation Systems
Discovering Groups
Searching and Ranking
Document Filtering
Decision Trees
Price Models
Genetic Programming
... and a lot more

As you can see, the subject matter stretches into the higher levels of mathematics and academia, but Segaran successfully keeps the book intelligible to most software developers and examples are written in the easy-to-follow Python language. Further chapters cover more advanced topics, like optimization techniques and many of the more complex algorithms are deferred to the appendix.

The third chapter of the book, 'Discovering Groups,' deserves some explanation and may enlighten you as to how the book may be of some use in day-to-day software designs. Suppose you have a collection of data that is interrelated by a 'JOIN' in two sets of data. For example, certain customers may spend more time browsing certain subsets of movies. 'Discovering Groups' refers to the computational process of recognizing these patterns and sectioning data into groups. In terms of music or movies, these groups would represent genres. The marketing team may thus become aware that jazz enthusiasts buy more music at sale prices than do listeners of contemporary rock, or that listeners of late-60's jazz also listen to 70's prog, or similar such trends.

Certainly the applications of such tools as Programming Collective Intelligence provides us are broader than my imagination can handle. Insurance companies, airlines and banks are all part of massive industries that rely on precise knowledge of consumer trends and can certainly make use of the data-mining knowledge introduced in this book.

I have no major complaints about the book, particularly because it fills a gap in popular knowledge with no precursor of which I'm aware. Presentation-wise, even though Python is easy to read, pseudo-code is more timeless and even easier to read. You can't cut & paste from a paper book into a Python interpreter anyway. It may 've been more appropriate to use pseudo-code in print and keep the example code on the website (I'm sure it's there anyway).

If you ever find yourself browsing or referencing your algorithms text from college or even seriously studying algorithms for fun or profit, then I would highly recommend this book depending on your background in mathematics and computer science. That is, if you have a strong background in the academic study of related research, then you might look elsewhere, but this book, certainly suitable as an undergraduate text, is probably the best one for relative beginners that is going to be available for a long time.

You can purchase Programming Collective Intelligence from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

cancel ×

74 comments

Sorry! There are no comments related to the filter you selected.

Disco Duck !! (-1, Offtopic)

Anonymous Coward | more than 6 years ago | (#23092426)

Eat my shorts!!

programmers please take note (-1, Offtopic)

Anonymous Coward | more than 6 years ago | (#23092440)

You are all homosexual gays who sex it with men in the butts, anally.

Re:programmers please take note (-1, Offtopic)

Anonymous Coward | more than 6 years ago | (#23092554)

Only Apple programmers are, you insensitive clod!

How is it quantified (3, Insightful)

4D6963 (933028) | more than 6 years ago | (#23092466)

So, the question remains, how do they write these so-called recommendation algorithms?

For now I'm more interested to know how they quantify these improvements.

Re:How is it quantified (3, Informative)

Otter (3800) | more than 6 years ago | (#23092752)

Let's say I have a dataset where 1000 people have each reviewed 20 movies. If I give you a set with five reviews blanked out for each person, how accurately can you predict them from the other 15?

Re:How is it quantified (5, Informative)

robizzle (975423) | more than 6 years ago | (#23092852)

Which improvements? The Netflix competition?

They basically have a large dataset consisting of User, Movie, Rating. Of this set, they split it into two data sets. In the smaller subset they removed the ratings and didn't release these to the public. They didn't modify the larger subset at all. They had cinematch make predictions on the smaller subset (without having been told the real predictions) and use this as the baseline. Next, people that compete in the competition make predictions on the missing data and improvements can be calculated. They calculate the percent improvement as 100 * [Submission's Error] / [Cinematch's Error]

There are a number of ways to calculate the error but for the Netflix competition they use MASE (Mean Average Squared Error). Basically you take the sum of the squared difference between what was predicted and what the real rating was then divide it by the number of ratings.

Detailed information can be found on the Netflix Prize rules page [netflixprize.com] and there are a number of good posts on the forums as well.

Re:How is it quantified (-1, Troll)

Anonymous Coward | more than 6 years ago | (#23093156)

God, then look it up. Seriously. There are entire graduate courses devoted to this, and I bet you think you're the first person to think of it. There's nothing more embarrassing than a /.er posting on some academic subject who has never studied it.

Re:How is it quantified (1)

4D6963 (933028) | more than 6 years ago | (#23093332)

God, then look it up.

Muahaha, you're new here aren't you? Why look anything up when you can just ask knowledgable people who'll be happy to tell you and in the process inform people who'd be interested to know too?

Re:How is it quantified (0)

Anonymous Coward | more than 6 years ago | (#23093602)

God, then look it up.

Muahaha, you're new here aren't you? Why look anything up when you can just ask knowledgable people who'll be happy to tell you and in the process inform people who'd be interested to know too?

You forgot to mention that you also get the satisfaction of making some asshat make an AC post complaining about how other people are too lazy too look it up.

Re:How is it quantified (3, Informative)

Gorobei (127755) | more than 6 years ago | (#23098930)

For now I'm more interested to know how they quantify these improvements.

Quantification is fun field in itself, and by no means trivial. As other posters have noted, there are many leave-n-out approaches: basically, divide the dataset into a training set and a test set, and rank by how accurately the code predicts the test set given the training set.

These types of tests are good in that they are easy to understand by the judges and participants. The problem, of course, is that over repeated trials, information about the test set leaks out in the scoring, and the participants slowly overfit their algorithms to the test set based on scoring feedback (in the extreme case, there is no training data, only test data - the winning algorithms are just maps of matched test inputs to correct outputs.)

Even if you manage to ameliorate this problem (e.g by requiring submission of a function that will be applied to an unknown training set to produce a set of predictions,) there is still the risk that the high scoring functions are not very useful (e.g. predicting someones rating of "The Matrix" is easy and has a low RMS error, but do you even care about error from most peoples rating of "Mr Lucky," most have never heard of it?)

So, to be really useful, you want your rating (objective) function to be weighted by usefulness from the point of view of your business (e.g. yes, everyone like the current blockbuster, but will John Q Random be happy geting "Bringing Up Baby" instead?) Here, "happy" is defined as maximizing profits for the firm :)

So, you often a prize with a simple (but wrong) objective function. Then offer the winners a chance a real money if they work on the actual hard problems the firm is facing (this is what we do on Wall St, anyway ;)

Who Cares About 0.1 Stars Difference? (5, Interesting)

Jynx77 (974092) | more than 6 years ago | (#23092534)

I was initially intrigued by reccomendation algorithms. Sadly, it's easy to get them up to a certain point and then almost impossible to make them any better. At least for movies. Netflix rates almost everything between 2.5 to 4 stars. Movies it rates 1 or 2 stars, I wouldn't have considered watching anyways. It never rates anything 5 stars. And for things between 3 and 4 stars, I seem equally as likely to really like a 3 star rated item as I am to not really like a 4 star rated item. So why is Netflix paying a million bucks to change that 3 to a 3.1 or 2.9?

Re:Who Cares About 0.1 Stars Difference? (2)

JeanBaptiste (537955) | more than 6 years ago | (#23092580)

because it does make a big difference when you scale the system up to millions of users.

Re:Who Cares About 0.1 Stars Difference? (1)

Jynx77 (974092) | more than 6 years ago | (#23092990)

Really? How so? We're not talking about money here or something tangible that really adds up.

Re:Who Cares About 0.1 Stars Difference? (3, Insightful)

JeanBaptiste (537955) | more than 6 years ago | (#23093190)

Think of it more like marketing. because thats exactly what it is. They are basically showing you billboards of other movies you may have an interest in. This algorithm decides which billboards are to be shown to you. Now, if the algorithm is 0.1 percent better at deciding which billboards to show you, does that really matter to you as an individual? not at all. Does it matter to netflix across a userbase of millions of people? absolutely. hence this contest.

Re:Who Cares About 0.1 Stars Difference? (1)

Jynx77 (974092) | more than 6 years ago | (#23095018)

I could see your point if I was paying on a per-movie basis. I guess if every one feels a tiny bit better about Netflix because their predictions are 0.1 or 0.2 stars better, it's a win for Netflix. I'm guessing there were better ways to spend the million + achieve that. Although as someone else pointed out, they may not ever pay the 1M and they are getting lots of free pub.

Re:Who Cares About 0.1 Stars Difference? (2, Insightful)

abolitiontheory (1138999) | more than 6 years ago | (#23092840)

I think there should be seven stars. This is an endless debate I know--which data entry metric to choose--but seven stars seem to provide meaningful choices, whereas five limits the field too much, and 10 choices make some of them functionally meaningless.

Of course people who still decide to rate The Wedding Singer seven stars can throw the whole thing off, like on iTunes where *no* album scores under a four or a five. But that's the problem isn't it, humans are entering these things. Not only do differences in taste have to be considered, but also differences in how people view the rating scale, what their current mood while entering the information is, etc.

Perhaps more effective data can be mined form people's purchasing choices, since we know that what people say and do are often not the same. I think that's why I like Amazon's "most people who viewed this item ended up purchasing:" and then it lists the three most popular options. Their recommendations are fairly solid, if redundent, overall.

Anyway, it's hard to do anything correctly with a large number of average humans.

Re:Who Cares About 0.1 Stars Difference? (1)

dookiesan (600840) | more than 6 years ago | (#23095072)

After they enter their star rating, you could give the person a list of similarly rated movies (by that person) that are currently inferred to be related, and ask the person "which movies did you love/hate for similar reasons?" If people actually took the question seriously, you would quickly build up a set of very good predictors and even on a person-person basis (look for shared actors/directors/screenwriters amongst their picks).

Re:Who Cares About 0.1 Stars Difference? (1)

jsebrech (525647) | more than 6 years ago | (#23095576)

Of course people who still decide to rate The Wedding Singer seven stars can throw the whole thing off, like on iTunes where *no* album scores under a four or a five.

The fallacy is allowing extreme votes to be worth more than moderate votes, because moderate votes are more likely to be accurate.

It's better to use a simple up/down vote system. Everyone's vote is worth as much that way.

Re:Who Cares About 0.1 Stars Difference? (1)

CanadaIsCold (1079483) | more than 6 years ago | (#23092846)

There was a fairly good wired article on what they are trying to accomplish and it has less to do with ranking and more to do with recommending movies for individuals.

http://www.wired.com/techbiz/media/magazine/16-03/mf_netflix?currentPage=all [wired.com]

The algorithm wants to analyze your habits and then recommend the best movie for you. The interest to netflix is if you can get more people interested in movies that they haven't seen then they will rent more movies.

Re:Who Cares About 0.1 Stars Difference? (2, Interesting)

WindowlessView (703773) | more than 6 years ago | (#23093308)

I was initially intrigued by reccomendation algorithms.

Me too. Last time this topic rolled around I took a brief look at the Netflix competition and was disappointed. The star rating system was limited but more importantly there was a remarkable lack of data. Many of the teams that edged out some improvement did so by importing lots of data from other sources - with lots of holes in that process - and trying to discern patterns from that.

On the whole the exercise seems to be a variation of a couple of decades ago when so many people bought a pc because they planned to be the next stock market wiz by throwing a neural net at basic NYSE daily data. With fancy algorithms and math constructs being all the rage these days (dare I say a bit of a fad?) it behooves us to remember that they are far from the whole story. It helps to have some useful data with which to make connections. No matter how fancy the algorithm you aren't going to harvest rice in a desert.

Re:Who Cares About 0.1 Stars Difference? (1)

timeOday (582209) | more than 6 years ago | (#23093660)

With fancy algorithms and math constructs being all the rage these days (dare I say a bit of a fad?) it behooves us to remember that they are far from the whole story. It helps to have some useful data with which to make connections. No matter how fancy the algorithm you aren't going to harvest rice in a desert.
Sure, but that's why people import "lots of data from other sources," so why do you call that a bad thing? Yes, collecting more and better data is often more important than additional algorithm development for the same old data sources. But I disagree that's somehow "cheating" and not valuable. In fact I think the reason AI falls short of human intelligence in the real world is largely because we haven't figured out how to import (via artificial perception) lots of data from other sources which people use.

Re:Who Cares About 0.1 Stars Difference? (1)

teh moges (875080) | more than 6 years ago | (#23100040)

Any interesting non-trivial problem suffers from a lack of data. I agree though that a lot of the teams are suffering from a movement of copying the top teams and hoping to get lucky by adding incremental improvements.
The winner of Netflix will probably be someone that took the problem from a completely new angle. Eventually the increments will reach there, as new algorithms are tweaked and edited to reach the milestone, but I'm guessing that someone will come along and take the prize another way before that happens.

Re:Who Cares About 0.1 Stars Difference? (1)

platykurtic (1210910) | more than 6 years ago | (#23093654)

So why is Netflix paying a million bucks to change that 3 to a 3.1 or 2.9?
That's the clever part. they're not paying a million bucks, they're offering a million bucks to anyone who gets to 10%, which may never happen. And in the meantime they've gotten some better algorithms for free, as well as good publicity

Re:Who Cares About 0.1 Stars Difference? (2, Informative)

Jynx77 (974092) | more than 6 years ago | (#23093774)

I think they are paying 50K a year out to the top team. Not sure if that's got a time limit on it. I guess the pub is good.

Re:Who Cares About 0.1 Stars Difference? (1)

SpinyNorman (33776) | more than 6 years ago | (#23094462)

I'm not sure how you equate a 10% accuracy improvement of "predicted like to actual like" to an 0.1 star delta on a 5 star system. In a 5 star system surely each star is equivalent to 20% predicted like, so a 10% accuracy improvement would be 0.5 star reduction in prediction vs actual mismatch.

I think it's reasonable to expect that a 0.5 star accuracy improvement on a 5 star system would be noticeable by enough people (although not all) to make a difference - presumably resulting in better confidence in the recommendation system and an increase in the number of people who would choose a recommentation when they otherwise had nothing else they wanted. The benefit to Netflix is that if more people are consistently watching (and enjoying) movies to the max of their monthly limit, then they are less likely to drop down to a cheaper less-movies plan (or drop out entirely).

Re:Who Cares About 0.1 Stars Difference? (1)

Jynx77 (974092) | more than 6 years ago | (#23094902)

First, you can't rate something 0 stars, so there's only a true range of 4 stars. 4 * 0.1 = 0.4 which would be +/- 0.2. However, as I mentioned in my post, the vast majority of predictions (for me) are in the 2-4 star range. Hence, 2 * 0.1 = 0.2 is +/- 0.1 star. Based on what my perceived margin of error is, +/- 0.1 would probably not even be noticeable.

Re:Who Cares About 0.1 Stars Difference? (1)

in10se (472253) | more than 6 years ago | (#23106608)

The question isn't what the current rating is. That's just the average of everyone else's ratings. The recommendation system attempts to figure out if *YOU* would like it based on various factors. If their system is accurate, then they can suggest more movies to you that you will actually like. If they suggest more movies to you that you like, you will continue using their service, or perhaps upgrade your subscription so you can have more of those great movies at once.

Re:Who Cares About 0.1 Stars Difference? (1)

Jynx77 (974092) | more than 6 years ago | (#23107138)

"recommendation system attempts to figure out if *YOU* would like it based on various factors."

Thank you, Caption Obvious!

Re:Who Cares About 0.1 Stars Difference? (1)

in10se (472253) | more than 6 years ago | (#23109090)

You are the one who posted the question. The question implies that you either:

a.) do not understand the difference between a recommendation system and a ratings system
        -or-
b.) do not know English well enough to coherently phrase a meaningful question

My answer to you assumed "a". I'm sorry it turns out it was "b".

The fact is, Netflix isn't trying to change a rating from 3.0 to 3.1 or 2.9. They just want to know if you will like that movie regardless of its average rating.

Re:Who Cares About 0.1 Stars Difference? (1)

Jynx77 (974092) | more than 6 years ago | (#23109520)

I suggest you re-read the posts in threaded mode. The first sentence of my post is "I was initially intrigued by reccomendation algorithms." It's obvious that all subsequent use of the word "rates" applied to the reccomendation engine. Only you seemed to have trouble understanding that.

What question did I post that you thought you were answering? The last sentence of your last post shows a complete misunderstanding on your part. Is English your first language? Based on your sig, I wouldn't have thought you'd have trouble understanding that.

Numbers? (2, Informative)

drquoz (1199407) | more than 6 years ago | (#23092540)

The numbers in the summary don't match up with the numbers on Netflix's leaderboard [netflixprize.com] :

BellKor: 9.08%
Gravity/Dinosaurs: 8.82%
BigChaos: 8.80%

Re:Numbers? (0)

Anonymous Coward | more than 6 years ago | (#23092586)

You have to multiply those numbers by 90%, or .9 to get the results.

Re:Numbers? (0)

Anonymous Coward | more than 6 years ago | (#23092824)

The numbers in the summary don't match up with the numbers on Netflix's leaderboard [netflixprize.com] :
leaderboard has changed since the book was published...

How are they judging "improvement?" (1)

abolitiontheory (1138999) | more than 6 years ago | (#23092612)

How are they defining this %10 improvement? How do they judge it? And how can they get it down to things like %.07. There have to be user test groups involved and I can't believe their that objective. %10 increase in rentals, in click throughs, in user agreement that the recommendations are helpful? What?

Re:How are they judging "improvement?" (0)

Anonymous Coward | more than 6 years ago | (#23092812)

my understanding is that they use a root mean squared error to calculate improvement. In the netflix prize case they have the contestants submit ratings for a set of films that netflix has actual customer ratings for already. These values are used to calculate the rmse

Re:How are they judging "improvement?" (1)

abolitiontheory (1138999) | more than 6 years ago | (#23092936)

Ahh. Thank you. For reference, root mean squared error [wikipedia.org] .

Re:How are they judging "improvement?" (1)

FourDegreez (546211) | more than 6 years ago | (#23093106)

Wired article on the Netflix challenge [wired.com] . Netflix has a benchmark dataset that they run the competing algorithms against to judge improvement.

With 35535 entrants, this may just be noise (2, Interesting)

Animats (122034) | more than 6 years ago | (#23092618)

There are now 35535 entries in the Netflix competition. If they all used roughly the same algorithm, with some randomness in the tuning variables, we'd expect to see results about like what we've seen. I think we're looking at noise here.

The same phenomenon shows up with mutual funds. Some outperform the market, some don't, but prior year results are not good predictors of future results.

Re:With 35535 entrants, this may just be noise (2, Insightful)

CastrTroy (595695) | more than 6 years ago | (#23092742)

But the teams that are good continue to refine their algorithms and do better and better. The top teams continue to be at the top over the life of the competition. Also, you can't compare this to the stock market. If company A is doing well now, there is no guarantee that they will still be doing well in 2 or 3 years. However, if you liked a movie, you will probably always like the movie. Sure tastes change, but a lot less than the stock market.

Re:With 35535 entrants, this may just be noise (1)

thrillseeker (518224) | more than 6 years ago | (#23093072)

From Netflix's perspective, it doesn't matter whether I liked it or not - it matters that I rented it.

Re:With 35535 entrants, this may just be noise (1)

CastrTroy (595695) | more than 6 years ago | (#23093102)

Yes, but if they keep recommending movies you don't like, you may stop renting movies altogether.

Re:With 35535 entrants, this may just be noise (1)

Actually, I do RTFA (1058596) | more than 6 years ago | (#23093990)

From Netflix's perspective, it doesn't matter whether I liked it or not - it matters that I rented it.

Netflix is a subscription all-you-can-eat service. So they would most prefer if you got a large plan and never used it. Since the only thing that keeps you renewing your subscription is your enjoyment of the movies, and since it costs them money every time you rent a movie, they have a vested interest in trying to maximize enjoyment per movie.

Actually, they'd probably rather you really enjoy 1/3 of the movies you borrow, so you subscribe to the 3-at-a-time model.

Re:With 35535 entrants, this may just be noise (2, Insightful)

SQLGuru (980662) | more than 6 years ago | (#23094474)

Actually, I don't think they care whether you like the movie or not.....I think the point is to maximize the movies out to subscribers and minimize the movies stored in a warehouse. If I have 1,000 movies in inventory and only 100 are "active", I have 900 movies taking up space. I also have customers who are waiting on one of the 100 movies to become available so they can watch it. If I recommend to you one of the 900, you get to watch a movie while waiting for one of the 100 popular titles which means you aren't sitting there complaining about how long it takes to get a movie from Netflix. Of course, if you like the obscure movie that was recommended, you'll be more likely to take a chance on the next obscure movie that gets recommended, thus my 900 movies are in circulation keeping people from hating my service and coincidentally not taking up space in my warehouse.

Layne

Re:With 35535 entrants, this may just be noise (2, Informative)

glyph42 (315631) | more than 6 years ago | (#23094086)

You should read the competition rules. The test set is so enormous that you would need 2^something_huge entries to see the results we've seen based on randomness. I did a back-of-the-envelope calculation at the beginning of the competition to see if a random search would be feasible to win the prize, and it's not. Not in a million years. Literally.

I bought this book (4, Informative)

iluvcapra (782887) | more than 6 years ago | (#23092698)

I was at the Borders and was looking for something to pass the weekend, and I'd been doing some sound effects library work, so I took a look at this.

It has a lot of statistics; it's essentially a statistics-in-use book , with code examples in Python of all of the algorithms. That said, it makes all of the topics very accessible, and proposes many different ways of solving different wisdom-of-crowds type problems, and gives you enough knowledge so you'd be able to hear someone pitch you their dataset, and you'd be able to say "Oh, you wanna do full-text relevance ranking" or "You need decision tree for that" or "you just want the correlation." The book very much has a sort of statistics-as-swiss-army-knife approach.

Also, I'm not Pythonic, but I was able to translate all of the algorithms into Ruby as I went, even turning the list comprehensions into the Rubyish block/yield equivalents, so his style is not too idiomatic.

Re:I bought this book (2, Informative)

StarfishOne (756076) | more than 6 years ago | (#23094192)

Very nice summary! I own the book and I must say that it's very nice and accessible.

The examples are practical and described quite well, even if ones math skills are not that great.

And the example in Python are almost looking pseudo-code like, even if one has little to no Python skills, the language is not a huge barrier.

5 stars out of 5!

The reviews at amazing are also quite quite good:

http://www.amazon.com/review/product/0596529325/ref=pd_bbs_sr_1_cm_cr_acr_txt?_encoding=UTF8&showViewpoints=1 [amazon.com]

23 ratings at this moment, 20x5 stars, 1x4 star, 1x3 star.

Re:I bought this book (1)

iluvcapra (782887) | more than 6 years ago | (#23095004)

Yeah, good reviews there.

A point the reviewers make that I didn't very clearly is that it does have a bunch of statistics, but it also has neural networks, and a bunch of other stuff that are more along the lines of "machine learning." One of the reviewers said it was the "best book on machine learning ever written," which may be true, but if and only if you're not a theorist or academic computer scientist.

Re:I bought this book (1)

StarfishOne (756076) | more than 6 years ago | (#23155198)

Yes, there are statistics, neural networks, genetic algorithms, clustering/distance measures, etc.

I might call it "the best PRACTICAL/APPLIED book on machine learning ever written". :)

For a more theoretical approach, this book is quite nice: Machine Learning, Tom Mitchell, McGraw Hill, 1997.
( http://www.cs.cmu.edu/~tom/mlbook.html [cmu.edu] )

(Btw: great signature. :))

hmm (-1, Offtopic)

Anonymous Coward | more than 6 years ago | (#23092734)

yeah what you said

Ever been to grad school? (0)

Oaklodge (1274204) | more than 6 years ago | (#23092942)

A million dollars? This is what happens when business people dabble in science. Artificial Intelligence grad students and professors have been studying these kinds of problems for decades. Netflix could have saved a boatload of money by throwing some cash at a university with an established AI group and asking them to research the current state-of-the-art. The only reason to put up that kind of money is to generate publicity, and I'm not really sure that worked.

Where to bet your money (1)

maillemaker (924053) | more than 6 years ago | (#23093166)

If I had to choose whether to be my million bucks on some cushy grant-wallowing researchers or some hungry self-motivated code geeks, I'd pick the latter.

Re:Where to bet your money (1)

Oaklodge (1274204) | more than 6 years ago | (#23093368)

You miss my point: it's been done. There are probably a dozen algorithms that have been developed with sound scientific methods and subjected to peer criticism to cover this sort of problem. After that it's just a matter of implementation and tweaking. Look at the top teams, they're all at universities or research labs. Those aren't just simple down-home code geeks, my friend, those people are (were) grad students.

Re:Ever been to grad school? (0)

Anonymous Coward | more than 6 years ago | (#23093226)

This method is much cheaper. It is similar to DARPA's Grand Challenge; they put up a $2 million US prize and they receive orders of magnitude more research and development then that miniscule purse.

Re:Ever been to grad school? (2, Insightful)

strangeattraction (1058568) | more than 6 years ago | (#23093252)

Silly. What they are doing is smart. The grad school can compete and win the money if it chooses. In the event the University or the greedy code geeks fail to produce it cost Netflix nothing. With your thinking it cost them money whether results are produced or not. I guess that is why you do not run Netflixs:)

Re:Ever been to grad school? (2, Insightful)

Sommelier (243051) | more than 6 years ago | (#23093590)

A million dollars? This is what happens when business people dabble in science. Artificial Intelligence grad students and professors have been studying these kinds of problems for decades.

I think that is the point - academia has been studying this for decades and has yet to produce meaningful results. I'm not saying that universities haven't contributed their fair share of technological advances through the years, but doing so in a practical and timely manner isn't exactly what they're known for. When business and/or money gets thrown into the mix, the pace of progress tends to rapidly accelerate.

X Prize Foundation [xprize.org]
Millennium Problems [claymath.org]
2008 Templeton Prize [nytimes.com]

Netflix could have saved a boatload of money by throwing some cash at a university with an established AI group and asking them to research the current state-of-the-art

According to the Netflix site [netflixprize.com] there are currently 35558 contestants on 29326 teams from 170 different countries. They could have thrown any amount of money at any university and still not received the kind of effort they've seen to date. I'd say their million dollars is money well spent.

Re:Ever been to grad school? (1)

Oaklodge (1274204) | more than 6 years ago | (#23093972)

I think the last few posts have a good point: promise big money and get results. It works well for engineering where there isn't a general solution to a general problem. But I want to bring the thread back to my original argument: It's. Been. Done. You think the teams in the lead just dreamed up their solutions from scratch? I'm willing to bet they are all using algorithms that grew out of university research and are just tweaking them.

Re:Ever been to grad school? (3, Insightful)

Eivind Eklund (5161) | more than 6 years ago | (#23094260)

I believe you're missing the point: Netflix has a solution that is about as good as the best previous published work, and have done tweaking of it. They are well aware of the published work.

This is an attempt to bring out new solutions.

Eivind.

Re:Ever been to grad school? (1)

dookiesan (600840) | more than 6 years ago | (#23095002)

I don't think Netflix had tried all of the current methods yet. The best algorithms used the SVD (through alternating least squares) and k-nearest-neighbors. Simon Funk made everyone slap their heads when he posted his method, but I don't think this approach to fitting the SVD is new. I admit that no one would have ever used Restricted Boltzmann Machine approach except for Hinton though.

Re:Ever been to grad school? (1)

Oaklodge (1274204) | more than 6 years ago | (#23095396)

I agree Netflix is attempting to buy a new (hopefully better) solution. But if you're implying someone is going to come up with a solution, not based on any previous research, and that is going to beat the Netflix solution (which you assert *is* based on the best previous research), then, good luck with that. Maybe I'll eat my words on that statement... we'll see.

Amazon is best here (1)

hesaigo999ca (786966) | more than 6 years ago | (#23092944)

The problem is where you post your algorithm, if you wait till they are paying for their items ( as at Amazon) where they add in the shopping cart, the people who bought this book also bought this book, or we have a sale, 2 books one of which you have plus this one, for less...

This can only be done with a shopping cart style, where as Netflickshas to wait for them to select their movie before they can recommend anything, seriously they should partner up with Amazon,
the people who rented this movie from Netflicks, also bought this book from....lol!

"As of this writing" (3, Interesting)

Anonymous Coward | more than 6 years ago | (#23093194)

When was this written? According to the leaderboard, http://www.netflixprize.com//leaderboard BellKor is leading by 0.26 and has been leading for several months.

Come on already... (1, Insightful)

CopaceticOpus (965603) | more than 6 years ago | (#23094136)

Among the chief ideological mandates of the Church of Web 2.0...
Shut. The. Fuck. Up.

Seriously. It's a trend to create websites with more dynamic and shared content. That's it. No church, no ideology, no 2.0.

Re:Come on already... (0)

Anonymous Coward | more than 6 years ago | (#23094330)

Amen, Brother, Amen! ... wait a minute...

Re:Come on already... (0)

Anonymous Coward | more than 6 years ago | (#23104736)

Wow, somebody's ideology got offended.

This book offers a great fundation (1)

hierro (809232) | more than 6 years ago | (#23094516)

I've read this book, and let me say I found it to be a superb introduction to the topic. It teachs you different methods applicable to a lot of different situations. In fact, after reading it, I decided to build my own social news site [ffloat.it] based on user recommendation. However, I had to research a lot into the field before coming with a good and fast algorithm. That's the only flaw I found in the book, all the algorithms are poorly implemented (altought this may be for the sake of clarity).

Good introduction to pattern recognition (2, Informative)

Gendor (1148039) | more than 6 years ago | (#23094574)

I came across this book browsing through Safari Books Online's titles, and was almost halfway through the book before I was able to get hold of an actual copy. While the main focus of the book is on data mining (definitely not only recommendation algorithms, it also shows how Google's PageRank algorithm works, how to mine user data from Facebook and write matching algorithms etc.) it provides a good introduction to pattern recognition in general. It shows you how to write a simple neural network in Python, how to write a Bayes classifier for spam filtering, and even touches on Support Vector Machines (SVMs). What I really love about the book is that everything is explained by means of code examples, with the actual math theory in an appendix for those of us more mathematically inclined. You can literally sit with the book next to the computer and reproduce the code as you go along.

Why has no one beat the Netflix algorithm yet (2, Interesting)

wintermute42 (710554) | more than 6 years ago | (#23094758)

The Netflix competition, in principle, is an example of an interesting class of prediction algorithms. There is a lot of good work in academia in this area and on the face of it one might be surprised that no one has beat Netflix yet.

Unfortunately Netflix restricts the data that can be applied to prediction. You have to use their data which includes only movie title and genre. A much better job could be done if something like the Internet Movie Database were fused with the title selection information. This would allow the algorithm to predict based on actors, directors and detailed genre. For example, I see all movies directed by John Woo. Given that I've seen all of his movies, it's not hard to predict that I'm going to see his next movie.

it works (1)

douthat (568842) | more than 6 years ago | (#23094854)

see: http://developers.slashdot.org/article.pl?sid=08/04/01/189230 [slashdot.org]

"A teacher is offering empirical evidence that when you're mining data, augmenting data is better than a better algorithm. He explains that he had teams in his class enter the Netflix challenge, and two teams went two different ways. One team used a better algorithm while the other harvested augmenting data on movies from the Internet Movie Database. And this team, which used a simpler algorithm, did much better -- nearly as well as the best algorithm on the boards for the $1 million challenge. The teacher relates this back to Google's page ranking algorithm and presents a pretty convincing argument. What do you think? Will more data usually perform better than a better algorithm?"

Re:Why has no one beat the Netflix algorithm yet (1)

dookiesan (600840) | more than 6 years ago | (#23094936)

People easily beat Netflix (or did you mean "no one has beat the Netflix challenge yet"?). They just haven't beat the $1,000,000 mark yet.

Does Netflix restrict what you can use in your algorithm now? I haven't checked the rules recently, but I know at first a lot of people were using IMDB and other sites for extra predictors.

Re:Why has no one beat the Netflix algorithm yet (1)

TheAntiTech (1275704) | more than 6 years ago | (#23124458)

As we have a team in the contest (first page of the leaderboard) I know that using the IMDB's downloable info is prohibited due to some clause that states it can not be used for commercial purposes. This is IMDBs rule however Netflix has raised no objection.

Moreover the problem with the Netflix dataset is they have intentionally inserted misinformation into the dataset for whatever reason.

Our answer was to have someone (read:me) to comb over each of the 17,000 entries and screen for basic accuracy. For instance, I wouldn't consider "Family Guy", "Southpark" or many of the Anime movies intended for adults (tentacle porn and the like) as being listed in Cartoons/Family/Children - but who know about kids these days!

Re:Why has no one beat the Netflix algorithm yet (1)

Jainith (153344) | more than 6 years ago | (#23096146)

I would agree that adding Actor, Director, Art director...grip...whatever is likely to be the "next big thing" in making movie picks more accurate.

Just an Idea (1)

Infinite Wave (1124173) | more than 6 years ago | (#23094768)

Could you not just add an extra box on the rating section that asks for the customers mood? Say a box that says rate this film 1-5 stars. Below that a drop down with the most common moods, happy, sad, angry, annoyed. It seems to me a big factor in when you rate a film is your current mood. If your in a good mood your more likely to be forgiving of a film, in a bad mood your going to be critical. This extra information might help you determin the accuracy of a given rating. I'm shure a study could help determin just how much a given mood can effect a rating +/- so many points. Seems to make sense to me but what the hell do I know?

The L Word (1)

Jimmy King (828214) | more than 6 years ago | (#23095322)

All I know about these recommendation algorithms is that they're a bit crazy. I have had The L Word recommended because I liked Alias, 24, and Roswell.

Of course maybe The L Word is about lesbian alien spies with super powers. Huh. I'm gonna go check it out.

Another Review of Collective Intelligence (1)

skuenzli (169327) | more than 6 years ago | (#23104664)

I have also read Collective Intelligence. I think I enjoyed it significantly more than the Slashdot reviewer. Here is my review:

~~~~

Have you ever wondered how:

        * Google comes up with its search results
        * Amazon recommends you books/movies/music
        * spam filters decide good from bad

Well, Toby Segaran not only explains these topics and more in Collective Intelligence, but he does so in a way accessible to software developers that haven't worked on machine-learning problems before. He even provides working Python code for all the algorithms.

Oh, and Collective Intelligence reads incredibly well. I could not wait to get home and get back to it -- and when I went in to work the next morning, I usually had a new idea or two of how to improve our software. I also started implementing the most important examples in Groovy to make sure I got it.

If you are a Senior Software Engineer or "better," this is a must-read. Proper application of the algorithms in this book are a great way to simplify your system and avoid getting nickel-and-dimed to death with new ways to prioritize/categorize/slice-and-dice your domain data.
Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>