Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Augmenting Data Beats Better Algorithms

kdawson posted more than 6 years ago | from the tell-it-to-the-dhs dept.

Education 179

eldavojohn writes "A teacher is offering empirical evidence that when you're mining data, augmenting data is better than a better algorithm. He explains that he had teams in his class enter the Netflix challenge, and two teams went two different ways. One team used a better algorithm while the other harvested augmenting data on movies from the Internet Movie Database. And this team, which used a simpler algorithm, did much better — nearly as well as the best algorithm on the boards for the $1 million challenge. The teacher relates this back to Google's page ranking algorithm and presents a pretty convincing argument. What do you think? Will more data usually perform better than a better algorithm?"

cancel ×

179 comments

Sorry! There are no comments related to the filter you selected.

Heuristics?? (1, Insightful)

nategoose (1004564) | more than 6 years ago | (#22933362)

Aren't these heuristics and not algorithms?

Re:Heuristics?? (5, Informative)

EvanED (569694) | more than 6 years ago | (#22933586)

One would hope that the thing that calculates the heuristic is an algorithm. See wikipedia [wikipedia.org] .

Re:Heuristics?? (1)

esp_ex (148837) | more than 6 years ago | (#22933808)

April Fools!

Depends on the Problem (4, Insightful)

roadkill_cr (1155149) | more than 6 years ago | (#22933376)

I think it heavily depends on what you're kind of data your mining.

I worked for a while on the Netflix prize, and if there's one thing I learned it's that a recommender system almost always gets better the more data you put into it, so I'm not sure if this one case study is enough to apply the idea to all algorithms.

Though, in a way, this is sort of a "duh" result - data mining relies on lots of good data, and the more there is generally the better a fit you can make with your algorithm.

Re:Depends on the Problem (1)

TooMuchToDo (882796) | more than 6 years ago | (#22933658)

Exactly. An algorithm can't see what isn't there, so the more data you have, the better your result will be. You can of course improve upon the algorithm, but the quality/quantity of data is always going to be more important.

Re:Depends on the Problem (1)

ubrgeek (679399) | more than 6 years ago | (#22933798)

Isn't that similar to the posting about Berkley's joke recommender posting from the other day [networkworld.com] ? Rate jokes and it then suggests ones you should like. I tried it and I don't know if the pool from which the jokes are pulled is shallow, but the ones it returned after I finished "calibrating" it were terrible and not along the lines of what I would have assumed the system thought I would think were funny.

Re:Depends on the Problem (2, Insightful)

Brian Gordon (987471) | more than 6 years ago | (#22933820)

It's not always going to be more important. There's really no difference between a sample of 10 million and a sample of 100 million.. at that point it's obviously more effective to put work into improving the algorithm.. but that turning point (again obviously) would come way before 10 million samples of data. It's a balance.

Re:Depends on the Problem (3, Insightful)

RedHelix (882676) | more than 6 years ago | (#22934182)

Well, yeah, augmenting data can produce more reliable results than better algorithms. If a legion of film buffs went through every single film record on Netflix's database and assigned "recommendable" films to it, then went and looked up the rental history of every Netflix user and assigned them individual recommendations, you would probably end up with a recommendation system that beats any algorithm. The dataset here would be ENORMOUS. But the reason algorithms exist is so that doesn't have to happen. i like turtles

Re:Depends on the Problem (3, Interesting)

blahplusplus (757119) | more than 6 years ago | (#22934214)

"I worked for a while on the Netflix prize, and if there's one thing I learned it's that a recommender system almost always gets better the more data you put into it, ...."

Ironically enough, you'd think they'd adopt the wikipedia model where their customers can simply vote thumbs up vs thumbs down to a small list of recomendations everytime they visit their site.

All this convenience comes at a cost though, you're basically giving people insight into your personality and who you are and I'm sure many "Recommendation engines" easily double as demographic data for advertisers and other companies.

Re:Depends on the Problem (3, Insightful)

roadkill_cr (1155149) | more than 6 years ago | (#22934300)

It's true that you lose some anonymity, but there is so much to gain. To be perfectly honest, I'm completely fine with rating products on Amazon.com and Netflix - I only go to these sites to shop for products and movies, so why not take full advantage of their recommendation system? If I am in consumer mode, I want the salesman to be as competent as possible.

Anyways, if you're paranoid about data on you being used - there's a less well-known field of recommender systems which uses implicit data gathering which can be easily setup on any site. For example, it might say that because you clicked on product X many times today, you're probably in want of it and they can use that data. Of course, implicit data gathering is more faulty than explicit data gathering, but it just goes to show that if you spend time on the internet, websites can always use your data for their own means.

I think better is subjective... (3, Insightful)

3p1ph4ny (835701) | more than 6 years ago | (#22933378)

In problems like minimizing lateness et. al. "better" can be simply defined as "closer to optimal" or "fewer time units late."

Here, better means different things to different people. The more data you have gives you a larger set of people, and probably a more accurate definition of better for a larger set of people. I'm not sure you can really compare the two.

Re:I think better is subjective... (1)

moderatorrater (1095745) | more than 6 years ago | (#22933652)

In this case, better is well defined. They're looking for a system that can take a certain data set and use it to predict another data set. ultimately, the quality of picks is determined by the user. For this contest, they've got data sets that they can use to determine which is the best method.

Re:I think better is subjective... (1)

phkhd (172530) | more than 6 years ago | (#22934258)

More to the point, the demographic is not for the perfect Pepsi, but for the perfect Pespsis. Malcom Gladwell had a talk at a TED conference [ted.com] , where he explains demographic research that has been done to show that there is usually not a perfect solution, but rather several near optimal solutions. Probably not applicable to all data-mining applications, but certainly appropriate for anything relating to subjective tastes.

Duh (0)

Anonymous Coward | more than 6 years ago | (#22933380)

Unless the solution has a provably optimal algorithm, more data is always going to beat a better algorithm. Trivial example: The data includes the answers to the question...

That doesn't mean a better algorithm is useless, though. If the data isn't available, you're kinda up a creek.

attn computer scientists: stop renaming stuff (0, Insightful)

Anonymous Coward | more than 6 years ago | (#22933384)

"machine learning" is just statistical inference
"page rank algorithm" is just an eigenvalue calculation.

i know you computer scientists like playing mathematician, but there's a reason why you're the butt of mathematicians jokes. because you guys are nothing more than glorified engineers.

Re:attn computer scientists: stop renaming stuff (5, Funny)

Anonymous Coward | more than 6 years ago | (#22933492)

you guys are nothing more than glorified engineers.
Computer scientists are not glorified engineers. They're the butt of engineers' jokes too.

Re:attn computer scientists: stop renaming stuff (1)

jank1887 (815982) | more than 6 years ago | (#22933686)

ooohhh... can we start on computer engineers next ??

Re:attn computer scientists: stop renaming stuff (0)

Anonymous Coward | more than 6 years ago | (#22934374)

You're both confusing Computer Science with application "design".

Re:attn computer scientists: stop renaming stuff (0)

Anonymous Coward | more than 6 years ago | (#22933710)

Your names suck.

Re:attn computer scientists: stop renaming stuff (5, Funny)

Freeside1 (1140901) | more than 6 years ago | (#22933724)

Say what you want about computer scientists, but without them you'd probably be complaining on a chalkboard.

Re:attn computer scientists: stop renaming stuff (4, Funny)

jank1887 (815982) | more than 6 years ago | (#22933726)

Mathematics is physics without purpose, Chemistry is physics without thought, Engineering is physics - CliffsNotes edition.

Re:attn computer scientists: stop renaming stuff (5, Funny)

JasonKChapman (842766) | more than 6 years ago | (#22933938)

Mathematics is physics without purpose, Chemistry is physics without thought, Engineering is physics

Mathematics is physics without purpose, Chemistry is physics without thought, Engineering is physics without tenure.

Re:attn computer scientists: stop renaming stuff (1)

jank1887 (815982) | more than 6 years ago | (#22934026)

since when do engineers not get tenure? and to boot, we get sizable research dollars.

Re:attn computer scientists: stop renaming stuff (1)

Fishbulb (32296) | more than 6 years ago | (#22933966)

...and a physicist is nothing without alcohol.

Q.E.D., beer.

Re:attn computer scientists: stop renaming stuff (1)

etymxris (121288) | more than 6 years ago | (#22933768)

And chemists are just doing heuristic physics, and biologists are just doing heuristic chemistry.

All of math reduces to logic and set theory but you don't see philosophers snubbing their noses at mathematicians. I agree that "computer science" is a misnomer in many ways, but "algorithm" in this case is not a misnomer. Yes, it's readily apparent that all algorithms can ultimately be represented mathematically, but that means no more than the reduction of math to logic.

Re:attn computer scientists: stop renaming stuff (2, Informative)

Sciros (986030) | more than 6 years ago | (#22933796)

What noobery. You're confusing the "what" with the "how". Finding eigenvalues is part of a particular page rank algorithm. It's not THE page rank algorithm. Likewise, statistical inference is part of particular "machine learning" systems. It's not THE system. Using statistical inference alone will give you crude (albeit good, with enough training data) baselines to work from in some applications such as automatic text translation, but you'll need more than that to overcome issues like data sparseness, etc.

I know anonymous cowards like playing expert, but there's a reason why you're the butt of so many jokes here -- only thing you're usually expert in is misinformation and disingenuity.

Re:attn computer scientists: stop renaming stuff (1)

agentultra (1090039) | more than 6 years ago | (#22933954)

sigh.

It sounds like you've got a hammer and look at everything as nails.

You might want to take a trip outside your ivory tower.

Synonyms happen to be a way of abstracting complexity out of the language used so that laypersons can understand, or at least talk about, the concepts and such that we "glorified engineers" use. It's really so the marketing guys have something to sell other than "eigenvalue calculation."

I suppose its beneath you, but average people should be able to have a chance at grasping what we do; even if its not in its most pure and exact form. It doesn't mean that we "engineers," are all ignorant of the actual mathematical terms. It just means we have to adopt language to deal with people who are involved with the product of our endeavors who may not understand what statistical inference means, but can at least grasp the idea by using the term, "machine learning."

geez.

Re:attn computer scientists: stop renaming stuff (3, Insightful)

Metasquares (555685) | more than 6 years ago | (#22934038)

And nonlinear dimensionality reduction is just nonconvex trace optimization coupled with kernel principal component analysis (fine, call it "singular value decomposition") using Mercer's theorem to map the resulting dot product through a kernel function (usually represented as a Hermitian positive semidefinite Gram matrix), yielding an inner product space of higher (possibly infinite) dimensionality in which the original problem is linearly separable.

Now take this description and write an algorithm that performs it efficiently. And you use PageRank as an example, so let's call "efficient" "performs as well as Google on the entire web's worth of data".

If you can't do this, perhaps you should reconsider your view of computer scientists. There's no reason whatsoever to play up the boundaries between two very related fields. Arbitrary boundaries in knowledge are already bad enough; they need to be knocked down, not reinforced.

Re:attn computer scientists: stop renaming stuff (1)

raddan (519638) | more than 6 years ago | (#22934064)

Mod parent flamebait. Right, and as a holder of a philosophy degree, I don't understand why you nitwit mathematicians can't get it through your thick skulls that we "statistical inference" is just yet another flawed manifestation of the Cartesian dichotomy. See where I'm going with this?

Why all the hate, people? Different disciplines have different terminology. Sure, there are probably some mathematical generalizations for common computer science problems. And there are probably some CS generalizations for common accounting problems. But you know why actual traveling salesmen don't call their travels the Traveling Salesman problem? They don't fucking care, and for the most part, it doesn't matter to them.

Now that I'm a CS student, I can appreciate where my current field and where my former field overlap. In my book, nobody who puts their mind to work is the butt of anybody else's joke.

Re:attn computer scientists: stop renaming stuff (1)

Deepness In The Sky (1265938) | more than 6 years ago | (#22934090)

So... if you have a degree in Mathematics and Computer Science, are you the butt of your own jokes?

Re:attn computer scientists: stop renaming stuff (1)

Hoi Polloi (522990) | more than 6 years ago | (#22934316)

What was that? Sorry, I was busy admiring my fat IT paycheck.

What about lambda calculus ? (1)

S3D (745318) | more than 6 years ago | (#22934384)

i know you computer scientists like playing mathematician, but there's a reason why you're the butt of mathematicians jokes. because you guys are nothing more than glorified engineers.
And category theory applied to functional programming [ucr.edu] ?

Re:attn computer scientists: stop renaming stuff (5, Funny)

Arthur B. (806360) | more than 6 years ago | (#22934496)

"machine learning" is just statistical inference

Riiiht. And mathematical research is just finding a Hamiltonian cycle in a graph defined by the set of axioms used.

Um, Yes? (4, Insightful)

randyest (589159) | more than 6 years ago | (#22933390)

Of course. Why wouldn't more (or bettter) relevant data that applies on a case-y-case basis provide more improved results than a "improved algorithm" (what does that mean, really?) that applied generally and globally?

I think we need much, much more rigorous definitions of "more data" and "better algorithm" in order to discuss this in any meaningful way.

Re:Um, Yes? (1)

robbyjo (315601) | more than 6 years ago | (#22933482)

It's a simple application of Rao-Blackwell theorem [wikipedia.org] at work. Making use of useful information (in this case, movie genre) makes the estimate more precise.

For the Sake of Discussion (3, Insightful)

eldavojohn (898314) | more than 6 years ago | (#22933544)

Well, for the sake of discussion I will try to give you an example so that you might pick it apart.

"more data"
More data means that you understand directors and actors/actresses often do a lot of the same work. So for every movie that the user likes, you weight their stars they gave it with a name. Then you cross reference movies containing those people using a database (like IMDB). So if your user loved The Sting and Fight Club, they will also love Spy Games which had both Redford & Pitt starring in it.

"better algorithm"
If you naively look at the data sets, you can imagine that each user represents a taste set and that high correlations between two movies in several users indicates that a user who has not seen the second movie will most likely enjoy it. So if 1,056 users who saw 12 Monkeys loved Donnie Darko but your user has only seen Donnie Darko, highly recommend them 12 Monkeys.

You could also make an elaborate algorithm that uses user age, sex & location ... or even a novel 'distance' algorithm that determines how far away they are from liking 12 Monkeys based on their highly ranked other movies.

Honestly, I could provide endless ideas for 'better algorithms' although I don't think any of them would even come close to matching what I could do with a database like IMDB. Hell, think of the Bayesian token analysis you could do on the reviews and message boards alone!

Re:Um, Yes? (0)

Anonymous Coward | more than 6 years ago | (#22933778)

I've got a headache, and the only prescription is more data.

Re:Um, Yes? (0)

Anonymous Coward | more than 6 years ago | (#22934014)

Q. Why wouldn't more (or bettter) relevant data that applies on a case-y-case basis provide more improved results than a "improved algorithm" (what does that mean, really?) that applied generally and
globally?

A. Because it was with a really simple algorithm (except it doesn't say anywhere what the simple algorithm was)

Re:Um, Yes? (0)

Anonymous Coward | more than 6 years ago | (#22934040)

"Better" is quite easy to define when you're talking about recommender systems. Let's say you have a list of ratings between 0 and 5 for movies submitted by a user. Take one of the items out. Now use the recommender system to predict how the user would rate that item. The closer you are to what the user actually rated an item, the better your algorithm. Next take out 5 items and do the same thing. Next take out 50%. Next take out all of the items but one.

You should be able to build a nice little chart on how well your algorithm will perform based on this.

Ask Slashdot (0)

Anonymous Coward | more than 6 years ago | (#22933392)

Yes. Yes it will.

Re:Ask Slashdot (0)

Anonymous Coward | more than 6 years ago | (#22933728)

No. No, it won't.

Re:Ask Slashdot (1)

apt142 (574425) | more than 6 years ago | (#22934138)

Depends. It really depends on the specifics.

Re:Ask Slashdot (0)

Anonymous Coward | more than 6 years ago | (#22933732)

No. Absolutely not.

This reminds me (2, Interesting)

FredFredrickson (1177871) | more than 6 years ago | (#22933396)

This reminds me of those articles who say that the amount of data humanity has archived is so much data that nobody could possibly use it in a lifetime. I think what people fail to remember is this: the point is to have available data just-in-case you need to reference it in the future. Nobody watches security tapes in full. The review the day or hour that the robbery occured. Does that mean we should stop recording everything? No. Let's keep archiving.

Combine that with the speed at which computers are getting more efficient - and I see no reason to just keep piling up this crap. More is always better. (More efficient might be better- but add the two together, and you're unstoppable)

Is it just me that is surprised here? (1, Insightful)

zappepcs (820751) | more than 6 years ago | (#22933398)

What do you think? Will more data usually perform better than a better algorithm?"
Duh... the algorithm can ONLY be as good as the data supplied to it. Better data always improves performance in this type of problem. The netflix challenge is to arrive at a better algorithm with the supplied data. Adding more data gives you a richer data set to choose from. This is obvious, no?

I read the article in question here and can say that I'm surprised that this is even a question.

Re:Is it just me that is surprised here? (5, Informative)

gnick (1211984) | more than 6 years ago | (#22933516)

The netflix challenge is to arrive at a better algorithm with the supplied data.
Actually, the rules explicitly allow supplementing the data set and Netflix points out that they explore external data sets as well.

Re:Is it just me that is surprised here? (1)

geminidomino (614729) | more than 6 years ago | (#22934410)

What do you think? Will more data usually perform better than a better algorithm?"
Duh... the algorithm can ONLY be as good as the data supplied to it. Better data always improves performance in this type of problem. The netflix challenge is to arrive at a better algorithm with the supplied data. Adding more data gives you a richer data set to choose from. This is obvious, no?

I read the article in question here and can say that I'm surprised that this is even a question.
Good point. There doesn't appear to be any mention of the improvement of supplemented data AND an improved algorithm.

slashdot users (1)

das_schmitt (936797) | more than 6 years ago | (#22933410)

I would never socialize with a slashdot user. sorry guys :/

Blame yourselves :(

Re:slashdot users (0)

Anonymous Coward | more than 6 years ago | (#22933786)

I would never socialize with a slashdot user. sorry guys :/

Blame yourselves :(

That begs the question... most of us use pseudonyms on here.
How do you know you aren't socializing with closet /. users every day?
Plus, of course, you're sort of stuck with yourself ;)

Too a large extent ... (2, Interesting)

haluness (219661) | more than 6 years ago | (#22933420)

I can see that more data (especially more varied data) could be better than a tweaked algorithm. Especially in machine learning, I see many people publish papers on a new method that does 1% better than preexisting methods.

Now, I won't deny that algorithmic advances are important, but it seems to me that unless you have a better understanding of the underlying system (which might be a physical system or a social system) tweaking algorithms would only lead to marginal improvements.

Obviously, there will be a big jump when going from a simplistic method (say linear regression) to a more sophisticated method (say SVM's). But going from one type of SVM to another slightly tweaked version of the fundamental SVM algorithm is probably not as worthwhile as sitting down and trying to understand what is generating the observed data in the first place.

Re:Too a large extent ... (1)

Artuir (1226648) | more than 6 years ago | (#22934428)

You know, one time I had the luxury of working on a blade server set up to use both forms of feeds. Ultimately I found when compiling my AI dataset, each subchannel was coherently placed within 5 arcs of true accuracy. The AI was able to do very well on the turing test as a result and my boss was quite pleased.

Has anyone attempted to use a KVM setup to see if this improves the data augmentation at all?

Slashdot News Flash: BUSH RESIGNS +1, Good (-1, Offtopic)

Anonymous Coward | more than 6 years ago | (#22933422)

and flees to Kazakhstan [wikipedia.org] .

We hope he is apprehended and returned to face a criminal trial with the Satan sympathizer President-VICE Cheney.

Yours PatRIOTically,
K. Trout, ACTIVIST

Re:Slashdot News Flash: BUSH RESIGNS +1, Good (0, Troll)

sm62704 (957197) | more than 6 years ago | (#22933662)

In case you haven't heard yet, April Fool's day has been postponed to May [uncyclopedia.org] .

In case you haven't heard (0)

Anonymous Coward | more than 6 years ago | (#22934086)


Slashcrap's threading system threads your comment in the wrong thread.

Secondly, your moronic link would only fool Slashdot moderators.

"I've got mod points".

Who gives a fuck?

Re:In case you haven't heard (1)

sm62704 (957197) | more than 6 years ago | (#22934488)

Secondly, your moronic link would only fool Slashdot moderators.

*Woosh*

There is no difference between the two (0)

Anonymous Coward | more than 6 years ago | (#22933424)

Algorithms are nothing more than efficient representations of data.

Algorithms and a data are just the two extreme ends of a continuum.

Assuming the algorithm isn't evil (1)

Lije Baley (88936) | more than 6 years ago | (#22933434)

A piece of pertinent data is worth a thousand (code) lines of speculation.

"Better data" not "more data" (1, Insightful)

Anonymous Coward | more than 6 years ago | (#22933438)

Just having more data to process doesn't produce better results in this sort of field.

Look at the application. Netflix alone VS Netflix+IMDB. The second not only has more data, but it has "better" data in terms of having more human decision inputs applied to it thus weighting the data to produce more correct results.

But if you looked at it this way Netflix 2007 data VS Netflix 2006-2007 data I don't think you would find a significant difference in results. This is the same "type" of data, only more of it, where as the former is a practical example of data fusion.

Char-Lez

More vs Better (3, Insightful)

Mikkeles (698461) | more than 6 years ago | (#22933440)

Better data is probably most important and having more data makes having better data more likely. It would probably make sense to analyse the impact of each datum on the accuracy of the ruslt, then choose a better algorithm using the most influential data. That is, a simple algorithm on good data is better than a great algorithm on mediocre data.

Why not use both? (0)

Anonymous Coward | more than 6 years ago | (#22933444)

Is it just me, or wouldn't it make even more sense to use both? It's like asking which would you choose to make a room brighter, the floor lamp or the overhead light? My guess is that both lights together would produce the most brightness.

I think that's the principle behind Metascore (though it seems vague at the moment).
http://www.metascore.org/ [metascore.org]
Massive amounts of data and massive amounts of recursive algorithms.

All things being equal... (3, Insightful)

Just Some Guy (3352) | more than 6 years ago | (#22933454)

One team used a better algorithm while the other harvested augmenting data on movies from the Internet Movie Database. And this team, which used a simpler algorithm, did much better nearly as well as the best algorithm on the boards for the $1 million challenge.

And the teams were identically talented? In my CS classes, I could have hand-picked teams that could make O(2^n) algorithms run quickly and others that could make O(1) take hours.

Re:All things being equal... (0)

Anonymous Coward | more than 6 years ago | (#22934200)

Apparently you didn't learn much in your CS classes, because O tells you exactly nothing about the absolute time an algorithm requires.

Obvious? (1)

nine-times (778537) | more than 6 years ago | (#22933458)

Is it just me, or is it pretty obvious that this all just depends on the algorithm and the data?

Like I could "augment" the data with worthless or misleading data, and get the same or worse results. If I have a huge set of really good and useful data, I can get better results without making my algorithm more advanced. And no matter how advanced my algorithm is, it won't return good results if it doesn't have sufficient data.

When a challenge is put out to improve these algorithms, it's really because these companies are operating with limited and/or bad data. They have to deal with crap data and people trying to game the system. They can't pull data from other sites because they don't own the other sites' data. They can't necessarily track their own customers' searches and compile that because (sometimes) their customers would be outraged at the "invasion of privacy".

Hold on a sec... (4, Funny)

peacefinder (469349) | more than 6 years ago | (#22933476)

"What do you think? Will more data usually perform better than a better algorithm?"

I need more data.

Re:Hold on a sec... (0)

Anonymous Coward | more than 6 years ago | (#22933730)

I need more data.

How can you be sure that you're not actually in need of a better algorithm? [sillyurl.com]

Re:Hold on a sec... (1)

peacefinder (469349) | more than 6 years ago | (#22934062)

"How can you be sure that you're not actually in need of a better algorithm?"

I was optimizing for humor.

Re:Hold on a sec... (1)

Archangel Michael (180766) | more than 6 years ago | (#22933822)

... or a better algorithm

This is classic XOR thinking that permeates our society. One or the other, not both is rarely a correct option. It is mostly for boolean operations, which this is clearly not. This is clearly an AND function. More Data AND a Better Algorithm is actually the most correct answer. "Which helps more?" is a silly question except for deciding on how much resources should be split in improving both, along with how much easier is one vs the other.

Five stars (5, Insightful)

CopaceticOpus (965603) | more than 6 years ago | (#22933488)

If more data is helpful, then Netflix is really hurting themselves with their 5-star rating system. I'd only give 5 stars to a really amazing movie, but to only give 3/5 stars to a movie I enjoyed feels too low. Many movies that range from a 7/10 to a 9/10 get lumped into that 4 star category, and the nuances of the data are lost.

How to translate the entire experience of watching a movie into a lone number is a separate issue.

Re:Five stars (1)

edcheevy (1160545) | more than 6 years ago | (#22933828)

You're absolutely correct, the more you compress categories the more you essentially throw away data. The flip side is the average customer whose mind would be blown if you let them rate a movie on a 1 to 1000 scale (or something equally ridiculous). Most of us would chunk that down into a more meaningful range anyway.

I'm afraid I don't have the linkage, but I seem to recall research on the Likert scale (typically a 1-5 or 1-7 scale) that actually found larger scales really didn't add much beyond 1 through 7 or 1 through 9. That said "not adding much" may still be worth a million dollars to Netflix if that "not much" is still better than what they've got (and doesn't scare people by offering too many choices).

Re:Five stars (1)

RingDev (879105) | more than 6 years ago | (#22934204)

Hey, if you can find a link to that research, please post it. I swear I caught a glimpse of similar work years ago in college, but have long since lost it. And it just so happens that now I'm working for a R&D company focusing on mental health testing and the topic of scale sizes comes up on occasion (especially when planning surveys for patients with mental illnesses). Anyways, I've had squat for luck tracking down that paper.

-Rick

Re:Five stars (1)

Chris Burke (6130) | more than 6 years ago | (#22933884)

I'd only give 5 stars to a really amazing movie, but to only give 3/5 stars to a movie I enjoyed feels too low.

I don't think this is a problem of it being a 1-5 scale instead of 1-10. It's not like there's really that much information given by scoring a movie a 7 instead of 8, since it's all subjective anyway on any given day those scores could have been reversed.

I think it's more the extremely common situation where people don't want to give an "average" score, so you get score inflation such that only the top half of the scale is ever used except for palpably bad movies. So even in a 1-10 scale, you're only really using 6-10. You might as well use a 0-5 scale, where "1" means good, and a 0 is anything worse than that.

I personally try to solve this by firmly keeping in mind the idea that the middle score should be for the "average" movie. If I'm never giving out scores that are 3 or lower, then I'm not rating them correctly. Unfortunately, I'm the only one who does this, which just means that I'm giving movies lower scores than everyone else even if I felt the same about the movie. So it's not much of a solution. ;)

Re:Five stars (1)

areReady (1186871) | more than 6 years ago | (#22933940)

Well, I suppose they should use a 100-point scale, so you don't have to lump all those 71-79s together in the 7's when there could be much more delineation between them. Or 1,000 points. Obviously this breaks down at some point. Five stars isn't necessarily bad, the correlation between positive and negative ratings is still very useful.

Will more data usually perform better than a bette (1)

sm62704 (957197) | more than 6 years ago | (#22933572)

I would suggest that one both go for better algorythms AND more/better data.

Captain Obvious Says: (1)

GameboyRMH (1153867) | more than 6 years ago | (#22933920)

The quality (accuracy) of the result is a function of how much data you put in and how you operate on it, but entering more data can yield a much greater improvement in the quality of the output than a better algorithm.

apparently... (1)

spune (715782) | more than 6 years ago | (#22933582)

...the algorithm wasn't 'better' enough.

The punch line (1)

shewfig (1051592) | more than 6 years ago | (#22933600)

The last sentence of TFA sums up the non-usefulness of the result: "Of course, you have to be judicious in your choice of the data to add to your data set."

I refer you to the question of training Bayesian data sets for anti-spam: should you classify every single email, or only the ones that are "clearly" well-defined? Without a good algorithm to extract the search terms, the additional data just poisons the data sets, reducing the effectiveness of the filter.

See also any decent physiological study, in which "extraneous" factors are "corrected". Without enough data pruning, you have a correlation like the study that showed that losing weight, and keeping it off, reduces life expectancy. They didn't correct for the terminally ill, who lost weight as a result of their conditions. However, do too much pruning, and you have the controversial Harvard study, which reached the "common sense" conclusion almost at the expense of the data.

For more examples of massaging data using a bad algorithm, see studies that demonstrate a better TCO by going Microsoft.

In short, adding additional data is no guarantee of good results. The students clearly got lucky in finding a similar data set on a well-researched topic, based on an established taxonomy rather than a murky preference rating.

To augment or algorithm is the question? (1)

flajann (658201) | more than 6 years ago | (#22933606)

It really depends on a number of factors. I don't think anyone can make a general claim for one over the other. A smart algorithm can beat data augmentation in some cases. Of course, creating the algorithm is the crux of the matter, one that is harder to put a definition on.

So, the upshot is to look at both approaches and take the best course of action for your needs.

It depends on your definition of "better" (1)

paulatz (744216) | more than 6 years ago | (#22933638)

How do you define a "better" algorithm? Well, a better algorithm is an algoithm that works better on the field, it may seem obvious, but it is not at all. Usually it is not possible to test an algorithm deeply enough until its development is finished. On the other hand you would rather not spend a lot of time developing an algorithm that is not good enough. Hence the quality of algorithms is often deduced by some indicators, like some small test samples. Finally, as the general theory improves, the difference in performance between the top ranking algorithm decreases, and may start to depend quite strongly on the subset of the general total population to wich they are actually applied. We cannot simply say that "given two algorithms, the best one is the one which performs better on all possible samples;" we should rather say "the best one is the one which performs better on most of the real world samples." You can clearly see how actually impractical this definition is, this is why finding a good ranking algorithm requires constant tuning, as they do in google. A better algorithm may not be so much better, or may lack of generality when tested in the real world. More data always helps.

Isn't an algorithm just data? (1)

tjstork (137384) | more than 6 years ago | (#22933644)

I mean, if we balloon up to 10,000 feet, the problem really is, where do you put the extra data? Do you encode it in an algorithm, or do you have less code but more dynamic data. Given that POV, then, it stands to reason the best place to put the extra data is outside of the code, so that it is easier and less costly to modify.

Um, no. (1)

emmons (94632) | more than 6 years ago | (#22934148)

In a data mining context, an algorithm extracts, modifies or creates data from an existing data set.

Think of it this way.. algorithm is to verb as data is to noun.

This is assuming... (2, Insightful)

jd (1658) | more than 6 years ago | (#22933690)

...that algorithms and data are, in fact, different animals. Algorithms are simply mapping functions, which can in turn be entirely represented as data. A true algorithm represents a set of statements which, when taken as a collective whole, will always be true. In other words, it's something that is generic, across-the-board. Think object-oriented design - you do not write one class for every variable. Pure data will contain a mix of the generic and the specific, with no trivial way to always identify which is which, or to what degree.

Thus, an algorithm-driven design should always out-perform data-driven designs when knowledge of the specific is substantially less important than knowledge of the generic. Data-driven designs should always out-perform algorithm-driven design when the reverse is true. A blend of the two designs (in order to isolate and identify the nature of the data) should outperform pure implementations following either design when you want to know a lot about both.

The key to programming is not to have one "perfect" methodology but to have a wide range at your disposal.

For those who prefer mantras, have the serenity to accept the invariants aren't going to change, the courage to recognize the methodology will, and the wisdom to apply the difference.

A bit like swap vs. real memory (2, Informative)

etymxris (121288) | more than 6 years ago | (#22933692)

A machine with swap enabled will always have more throughput than a machine without. It's a better use of the resources available. However, replace that swap space with the same amount of RAM, and of course that will be even better. Some use this as an argument against swap space, but it's not a fair comparison, since you can enable swap space in the RAM increased machine and increase throughput even more.

So when I think of this recommendation system, a better algorithm is like having swap space enabled. It's a more sophisticated use of the data you have. Having more data is like having more RAM. And of course the best option is to have more reference data and a better algorithm. It's not an exclusive disjunction, and it's silly to think it has to be.

Data has no comments (1)

freejamesbrown (566022) | more than 6 years ago | (#22933696)

In the long term, if gamed Data determines hidden features of an algorithm's output, that output will not be completely understood in case it needs to be analyzed.

I've seen this on several systems over the years where legacy programmers tweak the data just a bit to affect sort order, etc etc and it leads to nightmares when you try to actually understand what's really happening to try to replace it's functionality.

There's no hard rule but beware, Data has no comments, so you'll never completely understand all the actions of your algorithm.

Google Page Rank probably suffers from this.

Diminishing Returns (1)

areReady (1186871) | more than 6 years ago | (#22933758)

It is obvious that both will help. Your first big chunk of augmenting data will help a lot, as will your first few algorithm adjustments. As you go forward, however, you will get smaller and smaller returns for each new tweak to the algorithm and each new set of data. It seems obvious after these results that the best course is BOTH.

Sounds reasonable. (1)

Orig_Club_Soda (983823) | more than 6 years ago | (#22933764)

Personally, I more likely to watch a movie based on genre, producer, director, writers, actors... Especially with plot specifics like era and technology.

Really? (1)

edcheevy (1160545) | more than 6 years ago | (#22933772)

The more data you have, the more likely your results are going to be significant. I think we already knew this. ;)

Really though, it's the "design fix" vs the "statistics fix" (or the algorithm fix in this case) and a proper design always beats a crappy design with statistical band aids.

I tried to augment the data when... (0)

DRAGONWEEZEL (125809) | more than 6 years ago | (#22933806)

I showed my boss a sallary survey...

Luckily for me he fell for it!

Recommendations Systems and subjectivity (3, Insightful)

mlwmohawk (801821) | more than 6 years ago | (#22933812)

I have written two recommendations systems and have taken a crack at the Netflix prize (but have been hard pressed to make time for the serious work.)

The article is informative and generally correct, however, having done this sort of stuff on a few projects, I have some problems with the netflix data.

First, the data is bogus. The preferences are "aggregates" of rental behaviors, whole families are represented by single accounts. Little 16 year old Tod, likes different movies than his 40 year old dad. Not to mention his toddler sibling and mother. A single account may have Winnie the Pooh and Kill Bill. Obviously, you can't say that people who like Kill Bill tend to like Winnie the Pooh. (Unless of course there is a strange human behavioral factor being exposed by this, it could be that parents of young children want the thrill of vicarious killing, but I digress)

The IMDB information about genre is interesting as it is possibly a good way to separate some of the aggregation.

Recommendation systems tend to like a lot of data, but not what you think. People will say, if you need more data, why just have 1-5 and not 1-10? Well, that really isn't much more added data it is just greater granularity of the same data. Think of it like "color depth" vs "resolution" on a video monitor.

My last point about recommendations is that people have moods are are not as predictable as we may wish. On an aggregate basis, a group of people is very predictable. A single person setting his/her preferences one night may have had a good day and a glass of wine and numbers are higher. The next day could have had a crappy day and had to deal with it sober, the numbers are different.

You can't make a system that will accurately predict responses of a single specific individual at an arbitrary time. Let alone based on an aggregated data set. That's why I haven't put much stock in the Netflix prize. Maybe someone will win it, but I have my doubts. A million dollars is a lot of money, but there are enough vagaries in what qualifies as a success to make it a lottery or a sham.

That being said, the data is fun to work with!!

more data in which dimension? (1)

Fuzuli (135489) | more than 6 years ago | (#22933858)

The team with more data performed better, probably because their data allowed them to clearly differentiate between movies using a far significant dimension than the given ratings per movie dimension.
The fundamental idea is to be able to identify clusters of movies, or users (who like a certain type of movie), and the idea of clusters is built on some form of distance. When you add a new dimension to your feature vector, you get a chance to identify groups of entities better, using that dimension. You may do worse as well, a new dimension may blur the lines between groups. Genres for movies looks like a good label for identifying groups of movies. Trying to do the same with more complex methods, using only ratings is harder.
More data does not necessarily mean you'll do better, it has to allow you to identify differences better, it should either contain or add a dimension with a "good" data. It seems team B directly went for generating a more relevant data set for the problem at hand.

One Trivial Result, One Big Assumption (3, Insightful)

fygment (444210) | more than 6 years ago | (#22933964)

Two things. The first is that it is tritely obvious that adding more data improves your results. But there are two possible mechanisms at work. On the one hand add more of the same data ie. just make your original database larger with more entries. That form of augmentation will hopefully give you more insight into the underlying distribution of the data. On the other hand you can augment the existing data. In the latter you are really adding extra dimensions/features/attributes to the data set. That's what seems to be alluded to in the article i.e. the students are adding extra features to the original data set. The success of the technique is a trivial result which depends very much on whether the features you add are discriminating or not. In this case, the IMDB presumably added discriminating features. However, if it had not, then "improved algorithms" would have had the upper hand.

The second thing about the claim seems to be that there is always additional information actually available. The comment is made that academia and business don't seem to appreciate the value of augmenting the data. That is false. In business additional data is often just not available (physically or for cost reasons). Consequently, improving your algorithms is all you can do. Similarly in academia (say a computer science department) the assumption is often that you are trying to improve your algorithms while assuming that you have all the data available.

Depends on the problem. (1, Interesting)

v(*_*)vvvv (233078) | more than 6 years ago | (#22933988)

Would you rather know more or be smarter?

Knowledge is power, and the ultimate in information is the answer itself. If the answer is accessible, then by all means access it.

You cannot compare algorithms unless the initial conditions are the same, and this usually includes available information. In other words, algorithms make the most out of "what you have". If what you have can be expanded, then by all means you should expand it.

I wonder if accessing foreign web sites is legal in this competition though, because that definitely alters the complexion of the problem.

To say google succeeded by expanding their data pool is an oversimplification, because not only did they select what they felt was most important, they ignored what they felt was not. Intelligent selection took place to set their initial conditions for their algorithm. So it isn't just data augmentation. It is the ability to augment data relative to a goal, and this is much deeper than just "more data" vs "algorithm". In fact, you can also find situations where algorithms are used to make these intelligent selections, in which case the selection process can be as or more important than just the sheer volume of available data alone.

Augmented? (1)

Anne Thwacks (531696) | more than 6 years ago | (#22934066)

When I was in college "augmented data" was a tactful way of saying "faked results"

well, duh (0)

Anonymous Coward | more than 6 years ago | (#22934202)

>> Will more data usually perform better than a better algorithm?

of course! more data, more signals
more signals, more clouds
more clouds, more rain
more rain, more marijuana
more marijuana, better performance!
what was the point?

This is a rule of algorithms (1)

MobyDisk (75490) | more than 6 years ago | (#22934210)

For every problem, there is an optimal solution (okay... there are many optimal solutions, depending on what you are trying to optimize for). If you want to do better than that algorithm, you must break the model. That means that you must either modify the inputs or modify the assumptions of the model. For example, the fastest way to sort arbitrary data that can only be compared using takes O(n*log(n)) time. To do any better, you must break the model by making assumptions about the range and precision of the data. Then you can do it in O(n).

So for the data in netflix, there is an optimal algorithm. To do better, you must include additional data. This particular problem is interesting because it is nearly impossible to determine what the "optimal" algorithm is since it is based on psychological factors. However, the fact that they are seeking out smart people to figure this out indicates that we are probably pretty close to optimal, so maybe we need to start including more information and changing the model.

Making up for being slow - or being slow. (1)

totierne (56891) | more than 6 years ago | (#22934212)

I am always looking for more data, from new people, from different countries.
I think I am making up for my slow algorithm in my head, or maybe all this data is slowing me down.
Actually it is making no decisions and having a cloud of maybes instead of deciding what rules I want to live by is the problem.

against the terms of the prize (1)

deander2 (26173) | more than 6 years ago | (#22934242)

yes this data is useful, but you can't use it in the contest:
http://www.netflix.com/TermsOfUse [netflix.com]

see also:
http://www.netflixprize.com/community/viewtopic.php?id=98 [netflixprize.com]
http://www.netflixprize.com/community/viewtopic.php?id=20 [netflixprize.com]
http://www.netflixprize.com/community/viewtopic.php?id=14 [netflixprize.com]

note that this makes sense. more/better data would help ANY decent algorithm. they want a better one, and they're judging you on a baseline. so they'd naturally limit your input options.

One answer: Kevin Bacon (1)

recharged95 (782975) | more than 6 years ago | (#22934342)

Now there's a simple algorithm that works. And beats even page rank.

The Best Data Wins (1)

dj e-rock (700351) | more than 6 years ago | (#22934354)

I would say that a richer set of (relevant) data would generally generate a better result than an improvement of algorithm. Granted, different statistical models and algorithms do work better on certain kinds of data (there's almost an art to picking a good model).

But, as a past professor of mine was fond of saying, "the best data wins."

However... (0)

Anonymous Coward | more than 6 years ago | (#22934358)

No data is enough when you have BAD algorithms...

Lisias.

Algorithms help too (1)

kabloom (755503) | more than 6 years ago | (#22934414)

I've seen a great many cases where developing better algorithms caused better performance (and better algorithms rather than better data, in fact, account for the vast majority of Computer Science research papers out there), so certainly it can't only be better data. Additionally, what about the times when you need a better algorithm to take advantage of the additional data. Or, what about when you combine the better algorithm with the better data.

This article is a completely false dichotomy.
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>