Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Can Machine Learning Replace Focus Groups?

samzenpus posted more than 2 years ago | from the what-does-the-machine-think? dept.

Programming 93

itwbennett writes "In a blog post, Steve Hanov explains how 20 lines of code can outperform A/B testing. Using an example from one of his own sites, Hanov reports a green button outperformed orange and white buttons. Why don't people use this method? Because most don't understand or trust machine learning algorithms, mainstream tools don't support it, and maybe because bad design will sometimes win."

cancel ×

93 comments

Sorry! There are no comments related to the filter you selected.

OK, so... (5, Insightful)

war4peace (1628283) | more than 2 years ago | (#40173821)

I have read the synopsis 4 (four) times and I didn't get shit.
Of course, TFA sheds some light on the whole thing, but really... work on your short version, guys, because what's in here makes no sense.

Re:OK, so... (1)

jakimfett (2629943) | more than 2 years ago | (#40174005)

I have read the synopsis 4 (four) times and I didn't get shit.

Read this AC submitted summary [slashdot.org] It may (or may not) enlighten you.

Re:OK, so... (4, Funny)

WrongSizeGlass (838941) | more than 2 years ago | (#40175091)

I have read the synopsis 4 (four) times and I didn't get shit.
Of course, TFA sheds some light on the whole thing, but really... work on your short version, guys, because what's in here makes no sense.

If you had just clicked the green button the machine would have understood it for you.

Re:OK, so... (0)

Alex Belits (437) | more than 2 years ago | (#40175371)

The article doesn't make any sense, either. Who, other than scammers, cares about trivial shit like one button being pressed by a random person that wandered to some web page? People write software that users use to accomplish some work. You can't recruit random people to perform random actions on a randomly changing user interface, and then collect statistics on what they accomplished.

To think of it, if someone did that, the "best" interface would look just like GNOME3... Oh shit...

Re:OK, so... (1)

mwvdlee (775178) | more than 2 years ago | (#40177111)

Anybody who wants to their users to take a certain action?

Think of websites (as stated in TFS) or focus group testing (also stated in TFS).

A lot of user interface testing is basically looking at how a user interacts with a UI. Things like automated testing could show you that people more easily recognize the functionality of the [OK] button over a functionally identical [Well, might as well try and go ahead with doing what I wanted to do] button.

As for websites; even on my open source project websites I prefer people press the [Download] button instead of browsing to a different site. Imagine how it is when commercial interests are at stake (even sites like /. want you to give them money).

Re:OK, so... (1)

Alex Belits (437) | more than 2 years ago | (#40178689)

Anybody who wants to their users to take a certain action?

Think of websites (as stated in TFS) or focus group testing (also stated in TFS).

My response to that is identical to the comment you are replying to.

Re:OK, so... (5, Insightful)

Tarsir (1175373) | more than 2 years ago | (#40175859)

You know, I read the summary without understanding it, and just clicked through to read the article, but only after reading your comment did I realize just how little sense the summary really made.

In a blog post, Steve Hanov explains how 20 lines of code can outperform A/B testing.

It starts off talking about a nobody who did something that is apparently so trivial that it can be outdone by 20 lines of code. You might think that the following sentence will answer at least one of the questions raised by this sentence: Who is Steve Hanov? What is A/B testing? What do Steve's 20 lines of code do? But you'd be wrong.

Using an example from one of his own sites, Hanov reports a green button outperformed orange and white buttons.

Because the next sentence jumps to a topic whose banality and seeming irrelevance to the matter at hand defies belief. Three coloured buttons, one of which 'outperformed' the others, with nary a hint as to what these buttons do, or how one can outperform the others.

Why don't people use this method?

The third sentence appears to pick up where the first left off. Why don't people use the A/B testing method? Or are we talking about the three coloured buttons method?

Because most don't understand or trust machine learning algorithms, mainstream tools don't support it, and maybe because bad design will sometimes win.

The final sentence is a tour-de-force of disjointed confusion. It skips from machine learning algorithms that haven't been discussed, to tools with unknown purpose, to the design of something which was never specified.

It's like the summary is some kind of abstract art installation whose purpose is to be as uninformative as possible. It is literally the opposite of informative: Not only does it provide no information, it raises questions which you can't even be sure relate to the purported topic at hand, because you don't know what the topic at hand is.

It is either a bizarrely confused summary or one of the most artful trolls ever to grace Slashdot's front page

Re:OK, so... (0)

Anonymous Coward | more than 2 years ago | (#40176535)

Elegantly put. please mod up

capcha: Grander

Re:OK, so... (0)

Anonymous Coward | more than 2 years ago | (#40176757)

The summary was generated by a machine learning program that automatically learned to generate summaries of articles.

Re:OK, so... (1)

artor3 (1344997) | more than 2 years ago | (#40176955)

Sadly, it learned to generate summaries by reading Slashdot :-(

Re:OK, so... (0)

Anonymous Coward | more than 2 years ago | (#40179833)

I'm pretty sure this summary was what the 20 lines of code generated when the article was used as input.

Re:OK, so... (0)

littlewink (996298) | more than 2 years ago | (#40180247)

In the time you took to complain you could have RTFA. I understood it the first time I read it yesterday. Today, listening to complaints, I read it again and still understand it. Maybe you're not bright enough to either a) read it carefully or b)understand it. No problema - there are plenty of jobs as janitors and car salesmen.

Re:OK, so... (1)

Tarsir (1175373) | more than 2 years ago | (#40181837)

In the time you took to complain you could have RTFA.

Reread my post. I clicked through and read the article before posting my comment.

I understood it the first time I read it yesterday.

No you didn't. As the summary contains no actual information, you filled it in with your own prejudices and preconceptions, no doubt because you are not in the habit of reading things carefully. cf My first point.

Today, listening to complaints, I read it again and still understand it.

What is this supposed to prove? Of course you "still" understand it after having read the full article, unless you think people habitually lose all knowledge of their previous experiences after sleeping for eight hours.

Maybe you're not bright enough to either a) read it carefully

Ha! cf My first point again.

or b)understand it.

cf My second point.

No problema - there are plenty of jobs as janitors and car salesmen.

What do you do that's so prestigious and intellectually demanding?

Re:OK, so... (1)

lurker1997 (2005954) | more than 2 years ago | (#40181259)

That is probably the best post I have ever read here. Extremely insightful and hilariously written. I was in tears laughing through most of it.

Too Dumb to Understand, Therefore "5,Insightful" (0)

Anonymous Coward | more than 2 years ago | (#40184629)

Idiocy rewarded!

Re:OK, so... (0)

littlewink (996298) | more than 2 years ago | (#40180219)

RTFA

I understood it the first time I read it yesterday.

Today, listening to your complaint, I read it again and still understand it.

Maybe you're not bright enough to either a) read it carefully or
b)understand it. No problema - there are plenty of jobs as janitors and car salesmen.

Re:OK, so... (1)

war4peace (1628283) | more than 2 years ago | (#40181819)

Smart. Very smart. You should be proud of yourself, being part of an elite that has the inherent right of stomping less-gifted people. Gratz!

Much ado ... (0)

jxander (2605655) | more than 2 years ago | (#40173899)

about nothing

Wake me up when they produce banner-ad algorithms that beat adblock, noscript, etc.

The only possibly benefit I can see from this is *maybe* adjusting a site's color-scheme or layout to be more intuitive and easy to navigate. I.E. making the "add to cart" button easier to find without being obnoxious about it. But then again, if I decide to add something to my cart, I'm confident I'll find the button even if it's 1.2% less optimized. And visual optimization can be done by any 1st year graphic design student.

Re:Much ado ... (1)

jakimfett (2629943) | more than 2 years ago | (#40174031)

The only possibly benefit I can see from this is *maybe* adjusting a site's color-scheme or layout to be more intuitive and easy to navigate.

Well, for those of us who do use testing and usability reporting on a daily basis, or have jobs that *require* us to know what is easiest for people to navigate (read: any and all web designers), this is pretty nice, and I intend to use the concept heavily.

Re:Much ado ... (0)

Anonymous Coward | more than 2 years ago | (#40174143)

I intend to use the concept heavily.

After paying your patent license fees!

Re:Much ado ... (1)

Half-pint HAL (718102) | more than 2 years ago | (#40188621)

Epsilon-greedy is one of the most well-known algorithms in machine learning. I'd heard of it before, but I didn't know how it works (I dropped AI after 2nd year), but I do now.

Re:Much ado ... (1)

BasilBrush (643681) | more than 2 years ago | (#40174247)

Don't forget the sales pitch. It could help you chose between different text. Real world trials are far better than gut feel on that.

Re:Much ado ... (1)

spazdor (902907) | more than 2 years ago | (#40174495)

if I decide to add something to my cart, I'm confident I'll find the button even if it's 1.2% less optimized.

That's very well and good for you, but marketing and layout-optimization people are more interested in the question of whether one site or the other makes you more likely to decide to add something to your cart, and not whether you'll succeed once you've decided to do so.

Re:Much ado ... (1)

jxander (2605655) | more than 2 years ago | (#40174673)

For most people, myself included, I'd imagine the deciding factor is not website layout, but something much more obvious.

Money, dear boy. (spoken with an English accent, ofc)

Plus a variety of other factors like shipping speeds, general reputation of the sites, ease of RMA, etc... Whether the "buy" button is Green, Orange or White is quite simply the last on my list of priorities, and pulling metrics on it without examining the other factors will net faulty results.

Re:Much ado ... (1)

spazdor (902907) | more than 2 years ago | (#40174905)

Ah, I see. You're one of those few people whose every decision is the logical, incontrovertible result of sober factual considerations.

"Psychology" is merely the study of what forces mold the choices of everyone's mind but yours.

Re:Much ado ... (0)

Anonymous Coward | more than 2 years ago | (#40176477)

Clever, retort sir, however might I interest you in a long forgotten theory of economics that something bought or sold might possibly have greater value, than that of the mechanism by which it is sold. Tsk, tsk, I apologize sir, I hadn't originally noticed the little billboard for the red crawfish tattooed on your arm, well, I suppose I'm off for lunch,anything but seafood I suppose.

Re:Much ado ... (1)

Half-pint HAL (718102) | more than 2 years ago | (#40188587)

Clever, retort sir, however might I interest you in a long forgotten theory of economics that something bought or sold might possibly have greater value, than that of the mechanism by which it is sold.

Which is why advertising and marketing are such underfunded spheres of public endeavour....

Re:Much ado ... (1)

mwvdlee (775178) | more than 2 years ago | (#40177155)

Sorry marketing and sales department, you're fired.
You can thank jxander for proving your jobs were never useful in the first place.
But don't feel bad; since competitor A offers the same service for 0.01% less, we'll soon be bankrupt anyway.

Re:Much ado ... (1)

spazdor (902907) | more than 2 years ago | (#40183543)

I lol'd.

Re:Much ado ... (0)

Anonymous Coward | more than 2 years ago | (#40185297)

What are you talking about, the first thing I look for on any website is if the checkout button is green or red, that is the determining factor for if I purchase there, if the button is not Pale_Golden_Rod I do not even give it a second thought, that is unless it is Pea_Green. /Sarcasm.

Re:Much ado ... (1)

Half-pint HAL (718102) | more than 2 years ago | (#40188769)

Oh FFS -- the use of button colours was what is known in technical jargon as an "example". The point of the article applies to all variables. And while you make think "layout" is less important than "shipping speeds", how do you find out shipping speeds? You have to look for them. If you can't find them, you walk. If you can't find them, chances are it's because of something we call in technical jargon "site design", which includes details such as "layout".

It's easy when you're designing something (I'm guessing you've never had to design anything for the public) to make lots of assumptions without even realising. You might put your "checkout" button where it is on your favourite webshop, but that might actually be the least obvious place to anyone who doesn't already share your shopping habits. Or maybe you think it's a wonderful shade of green, but what you don't realise (as someone with normal site and no understanding of occular defects) is that it's actually invisible against your chosen background to about 5% of the global population.

I dunno can a RealDoll replace your mom? (-1)

Anonymous Coward | more than 2 years ago | (#40173909)

Your sister says no. (well, she actually mumbled it because she has my cock in her mauf.

Captcha is "degrade" which is what I'm doing to your mom and your sister right now bitch.

Re:I dunno can a RealDoll replace your mom? (-1)

Anonymous Coward | more than 2 years ago | (#40174271)

My "sister" is a she-male with aids. hope you had fun, faggot

Re:I dunno can a RealDoll replace your mom? (-1)

Anonymous Coward | more than 2 years ago | (#40174447)

Why thank you I did. I took her in the ass bare-back style and then turned her over and fucked her pussy then she sucked me off till I cummed. Now I'm cumming for you.

Translation (5, Informative)

Anonymous Coward | more than 2 years ago | (#40173949)

So that you don't have to click through the slashvertisement, I have read TFA for you.

Here is a summary: Let's say you have several different designs for a web interface that you want to test to find out which one works the best.

One method is to have a "testing period" in which you randomly show each person one of the designs at random and identify how well it works for that person. Then, once you've shown 1,000 people each of the designs, you figure out which one is the best on average. Now the "testing period" is over, and the best design is shown to everyone from that point forward. That is the "old" method.

The "new" method is to dispense with the testing period. Instead, you show the first person one design at random. If it works (e.g. they click on the ad), it gets bonus points. If it doesn't work, it gets a penalty. At any time, you show the design with the most points; if it is bad, it will lose points over time and eventually stop being shown.

The goal of the "new" method is to hopefully avoid showing bad designs to 2000 people just to figure out which one is the best.

If you care about the details then you should probably read the article. This summary is just an approximation for those who can't be bothered or who object to slashvertisements on principle.

Re:Translation (1)

jakimfett (2629943) | more than 2 years ago | (#40173991)

...thank you for saving me the work of slogging through it on my own.

Re:Translation (1, Interesting)

mark-t (151149) | more than 2 years ago | (#40174137)

The "new" method has the problem of immediately favoring the first design to get a positive response.

My own experience with focus groups is that they were more interested in _WHY_ you chose something the way you did, rather than in just what you chose. I'm not entirely sure how this algorithm will determine that.

Re:Translation (3, Informative)

spazdor (902907) | more than 2 years ago | (#40174409)

The "new" method has the problem of immediately favoring the first design to get a positive response.

No it doesn't. The designs are ranked according to what percentage of responses have been positive so far, not by the total number of positive responses. The first design to get a positive response will get shown more, and thus it will get more positive responses, and more negative responses.

Re:Translation (-1)

mark-t (151149) | more than 2 years ago | (#40174541)

As soon as it presents a design that gets a positive response, that design will have the highest percentage response and, by the algorithm described above, be the only one that gets shown unless more people later vote it down.

Re:Translation (0)

Anonymous Coward | more than 2 years ago | (#40174711)

You didn't read the article, did you?

Re:Translation (1)

mark-t (151149) | more than 2 years ago | (#40174893)

My remark was on the algorithm that the poster above had presented... not the article.

Re:Translation (2)

spazdor (902907) | more than 2 years ago | (#40174725)

More people will inevitably vote it down (unless it is indeed the best option), because it's getting more exposure.

Unless you're saying that display frequency will actually affect click-through rate. Are you suggesting that, for instance, a design which only gets shown 300 times and gets 100 positive responses, if it were shown 3000 times instead it should be expected to get more than 1000 positive responses? This seems unlikely if successive tests are causally independent (and given that successive tests are most likely completely different site users, at different computers, who have never met each other, that seems a fair assumption.)

Re:Translation (-1)

mark-t (151149) | more than 2 years ago | (#40174985)

No.... I'm suggesting that the algorithm presented above, which only ever displays the single highest scoring design, is biased against designs that haven't yet had a chance to be viewed by anybody, and thus have not had an opportunity to get a positive response, when people are already showing some favor towards others.

My point is that it is an all too sad statement about humanity that most people will tend to more or less mindlessly consume whatever they are spoon-fed. If they don't know they have other choices, they are less likely to feel negative about the choices that are presented.

Re:Translation (5, Informative)

swillden (191260) | more than 2 years ago | (#40175159)

No.... I'm suggesting that the algorithm presented above, which only ever displays the single highest scoring design, is biased against designs that haven't yet had a chance to be viewed by anybody, and thus have not had an opportunity to get a positive response, when people are already showing some favor towards others.

What you're missing is the implied assumption that all of the options will fail most of the time, and that all options are initialized with maximum scores. The goal is to find the design that best motivates the user to take some action (e.g. click a link), and the assumption is that most of the time the user will not take that action. By starting all of the choices at a high value, they will all gradually converge downward to their true effectiveness rate, at which point the most effective will be chosen nearly all of the time. During the convergence process, the "leader" may change, but if the current leader isn't the true best, as it gets driven towards it's true rate, it will eventually dip under one of the others.

If, by chance, a more effective option has a really bad run early on and gets pushed below the true effectiveness rate of another option, it would never recover -- which is why the author includes an occasional randomly-selected choice. If there is a large difference between the effectiveness of the options this is really unlikely to happen, but in the rare event it happens the randomization will eventually fix it. The author also covers a method of handling the case where the audience preferences drift over time, by including the ability to "forget" old input via simple exponential decay.

The only really bad thing about this approach is that it assumes you don't have a lot of repeat visitors. If you do, they'll be annoyed by seeing different versions, apparently at random (from their perspective).

Re:Translation (1)

Half-pint HAL (718102) | more than 2 years ago | (#40188837)

The only really bad thing about this approach is that it assumes you don't have a lot of repeat visitors. If you do, they'll be annoyed by seeing different versions, apparently at random (from their perspective).

What he doesn't discuss is what "one" instance of the site is. If you've got tracking cookies switched on, then you can assign one version of the site to the user at first visit and have it persist across browsing sessions.

An oversight on the author's part, but not a huge leap of logic.

Re:Translation (0)

Anonymous Coward | more than 2 years ago | (#40177187)

Reading the article would make you appear less stupid. The choices are initialised to have a 100% success rate, not 0%, and so those will automatically become the "highest scoring" if the others fail even one test.

Re:Translation (1)

spazdor (902907) | more than 2 years ago | (#40181675)

designs that haven't yet had a chance to be viewed by anybody,

There are no such designs in this model, owing to the fact that 10% of all visitors are shown a design at random, unweighted by previous measurements.

Seriously, the algorithm presented in TFA anticipates and addresses your objection perfectly. You'd do well to check it out; AC's summary up there was good but incomplete.

Re:Translation (1)

mwvdlee (775178) | more than 2 years ago | (#40177241)

Take a piece of paper and try to run down some scenarios. Try to find a scenario that disproves your own theory, then figure out why.

I'm sure there are edge cases where this "new" method fails, but there are also edge cases where classical focus group testing fails.

Since my job involves some A/B testing, I did the above and found some edge cases. But they're far less likely and with some job-specific optimizing (we have relatively long feedback delays) these edge cases can be mitigated.

Most interesting issue I found is when the positive feedback to each of the choices is near 100%. Not much of a problem unless having 100% positive feedback is somehow negative.

Re:Translation (1)

DerekLyons (302214) | more than 2 years ago | (#40176003)

The "new" method has the problem of immediately favoring the first design to get a positive response.

Only if you're stupid enough to only show the design with the highest score. Something as simple as choosing randomly among the top .75n results (where n=number of designs under test) fixes that.

Re:Translation (2)

WrongSizeGlass (838941) | more than 2 years ago | (#40175111)

Is there any way they can apply this to summaries and stories on /.? I'd be willing to read that summary ... and maybe even that story.

Re:Translation (0)

Anonymous Coward | more than 2 years ago | (#40177831)

So you mean he proposes to show the users of his website a 'randomly' selected design each time they visit...

And I was just thinking people were complaining about Facebook for constantly changing their UI!

Of course it can. (1)

thoughtspace (1444717) | more than 2 years ago | (#40173961)

Just get the last answer and repeat it over and over.
Such a machine will be equally as good as any focus group.

Re:Of course it can. (1)

mwvdlee (775178) | more than 2 years ago | (#40177275)

Imagine you have 3 buttons...

First user sees button 1, clicks it.
Next user sees button 1 (because repeat), doesn't click it.
Next user sees button 2, doesn't click it.
Next user sees button 3, doesn't click it.
Next user sees button 1, clicks it.
Next user sees button 1 (because repeat), doesn't click it.
Next user sees button 2, doesn't click it.
Next user sees button 3, doesn't click it. ...repeat...

Even though button 1 has a 50% success rate and the other buttons 0% (and is thus infinitely better), it's only shown 50% of the time.
In this example, you'd want to show button 1 ~100% of the time, since it's the only button that ever gets clicked.

Just repeating the last anwer produces sub-optimal results.

Re:Of course it can. (0)

Anonymous Coward | more than 2 years ago | (#40181855)

Okay, let's imagine that each button is set so that 0/0 represents a 100% rate. Here's what I'm pretty sure would happen

First user sees button 1, clicks it.

1(1/1), 2(0/0), 3(0/0)

Next user sees button 1 (because repeat), doesn't click it.

1(1/2), 2(0/0), 3(0/0)

Next user sees button 2, doesn't click it.

1(1/2) ,2(0/1), 3(0/0)

Next user sees button 3, doesn't click it.

1(1/2), 2(0/1), 3(0/1)

Next user sees button 1, clicks it.

1(2/3), 2(0/1), 3(0/1)

Next user sees button 1 (because repeat), doesn't click it.

1(2/4), 2(0/1), 3(0/1)

From this point, everyone sees button 1. I don't get why button 2 would ever be shown again except in the 10% random case.

Re:Of course it can. (1)

Half-pint HAL (718102) | more than 2 years ago | (#40188973)

And that is precisely why they don't set it to 0/0 = 100%, instead initialising everything to 1:1 = 100%
1(1:1) 2(1:1) 3(1:1)
First user sees 1, clicks it:
1(2:2) 2(1:1) 3(1:1)
At this point, the algorithm could still pick any of the three.
Say it picks 1 again, and this is not clicked:
1(2:3) 2(1:1) 3(1:1)
So say it picks 2 for the next user, but the user doesn't click it:
1(2:3) 2(1:2) 3(1:1)
Well this time it has to pick 3 (unless the 10% random kicks in). Lets assume that's unsuccessful.
1(2:3) 2(1:2) 3(1:2)
OK, so 1 is now favoured, but one more "no click" on 1 levels us off 2:4 = 1:2.

There will never be a true zero probability in the epsilon-greedy algorithm, and it can only approximate zero after accumulating an awful lot of evidence...

Can Machine Learning Replace Focus Groups? (1)

dgharmon (2564621) | more than 2 years ago | (#40173977)

NO !!!

Re:Can Machine Learning Replace Focus Groups? (1)

Half-pint HAL (718102) | more than 2 years ago | (#40188993)

Of course not. The whole point of a focus group is for the facilitator to lead the group to the conclusion he or she wants. Management can't maipulate machine learning algorithms -- only developers can.

What the...? (0)

Anonymous Coward | more than 2 years ago | (#40173999)

Is this a Turing test?

This is not exclusively machine learning (5, Insightful)

Anonymous Coward | more than 2 years ago | (#40174021)

This is not "machine learning" subsituting for human A/B testing. It's just changing the ratio of the number of visitors exposed to the "new" feature to be tested from 50% to 10%, while keeping the rest (90%) of the visitors using the "best so far" feature. There's also a bit of randomness thrown in when choosing which new feature the 10% of visitors get to test.

In this scheme, the human visitors are still doing the A/B testing, it's just that determination of which human is testing which feature dynamically adapts over time.

Now, if this guy had subsituted human A/B testing completely with a machine learning technology that could somehow determine which feature is better without any input from humans, then I'd be impressed. That's kind of what the summary and article imply. But that's not what he's done. He's just being a bit more sophisticated regarding which humans get to test which feature.

He's also made a big fat claim regarding the effectiveness of his method with zero evidence to back it up. Theoretical results regarding one-armed bandit problems are quite a far cry for real-world results regarding website feature selection. I'm looking forward to seeing some results of the proposed method on the latter.

Re:This is not exclusively machine learning (1)

BasilBrush (643681) | more than 2 years ago | (#40174513)

So you want to do A/B testing on whether this algorithm is better than A/B testing?

It'd probably be better to use the epsilon-greedy method when deciding whether the A/B testing or epsilon-greedy algorithm is better.

Or maybe not. Well have to test that too.

It's testing all the way down.

Re:This is not exclusively machine learning (2)

tgv (254536) | more than 2 years ago | (#40177119)

Indeed, this has no relation to machine learning, whatsoever. The summary is once again ... deceptive.

And I'm sure the proof, that the best one gets chosen, doesn't exist. I'm also sure that this [i]way of choosing[/i] an interface has a high probability of choosing the preferred one, but there is also a big difference with A/B testing: you'll never know how big the difference between the two is. In straight-forward testing with two groups (which is not really A/B, by the way: that is alternating between A and B and then ask the subject to chose the best one; it has its origins in perceptual testing, where ABX testing is preferred), you can find out the difference in scores. Here you can't.

Re:This is not exclusively machine learning (1)

Half-pint HAL (718102) | more than 2 years ago | (#40189023)

Indeed, this has no relation to machine learning, whatsoever.

Is there an algorithm? Does the machine use the algorithm to obtain the optimum result? Just because the machine uses humans as its test subjects doesn't stop it being machine learning.

Re:This is not exclusively machine learning (1)

tgv (254536) | more than 2 years ago | (#40192457)

So ... sorting is machine learning? MS Word is machine learning? Don't think so.

Nowhere did I nor the GP claim that machines have to be involved. And the machine doesn't use humans in this case, it just uses their choices as its data. So your rebuttal is somewhat unfounded.

Machine learning is learning in the first place, through algorithm: a machine can learn to do a task on its own. Not: a machine assists in a task where someone else learns. In this case, the machine doesn't learn anything. It just acts as a biased dice. The outcome of the process might be called "learned", but the knowledge is in the head of the one that runs the experiment and overlooks the outcome, not in the machine. And the "learning" doesn't generalize, so it doesn't help in improving performance on any other task than selecting between these two designs.

That's why it's not machine learning.

Re:This is not exclusively machine learning (1)

Half-pint HAL (718102) | more than 2 years ago | (#40199507)

So ... sorting is machine learning? MS Word is machine learning? Don't think so.

Nowhere did I nor the GP claim that machines have to be involved. And the machine doesn't use humans in this case, it just uses their choices as its data. So your rebuttal is somewhat unfounded.

Machine learning is learning in the first place, through algorithm: a machine can learn to do a task on its own. Not: a machine assists in a task where someone else learns. In this case, the machine doesn't learn anything. It just acts as a biased dice. The outcome of the process might be called "learned", but the knowledge is in the head of the one that runs the experiment and overlooks the outcome, not in the machine. And the "learning" doesn't generalize, so it doesn't help in improving performance on any other task than selecting between these two designs.

That's why it's not machine learning.

A hell of a lot of machine learning is based around giving the computer equation and let it work out the particular coefficients that give the best possible answer. There are very few machine learning tasks that don't have some sort of experimenter assumptions built in, and no machine learning algorithm is ever 100% generalisable (otherwise machine learning would be a pretty small field, as there would only be one machine learning algorithm!)

The reason that this is classed as a machine learning problem and sort isn't is that a sorting algorithm runs once and gives you a definite answer. But with epsilon-greedy, the computer maintains a theory that approximates the "correct" answer, and over time the answer gets better and better without direct operator control.

Yes, it's a simple algorithm. Yes, you could do a similar thing on paper with a human controller. But that doesn't stop the computer implementation qualifying as machine learning.

Re:This is not exclusively machine learning (1)

khipu (2511498) | more than 2 years ago | (#40177753)

Both Hanov and you are mixing up a couple of things. A/B testing is done with focus groups, not live visitors. When you test with focus groups, you don't run a live web server, and you're willing to pay for completion of some test design.

Algorithms for use with the multiarmed bandit are already widely used in live testing. Those algorithms properly belong to the field of machine learning (reinforcement learning), but it turns out that very simple algorithms or strategies are hard to beat. You're right that it's not "exclusively" machine learning because the simple algorithms were already known before machine learning even existed, but these algorithms are still primarily studied in machine learning.

As for whether these methods are effective, that's easy: they are, and they are widely used. The part that's hard isn't to decide which versions of a page to present how often, but instead to figure out which change was responsible for the better outcome you were interested in.

Re:This is not exclusively machine learning (1)

Cederic (9623) | more than 2 years ago | (#40179141)

You can A/B test with live visitors. Works well too.

I think his approach has merit, but it's really just an automatically applied implementation of the outcome of the test - at some point you'd want to switch off A or B completely anyway.

Of course, far more interesting would be understanding why people chose A or B and offering the appropriate one based on what you know of the person involved. That's more sophisticated, but already done by people like Amazon: My amazon.co.uk web page will be very different to yours, in terms of content.

Re:This is not exclusively machine learning (1)

khipu (2511498) | more than 2 years ago | (#40179665)

You can A/B test with live visitors. Works well too.

It's still not a multi-armed bandit situation. The multi-armed bandit situation specifically means that you present either A or B, not an A/B choice. There are other machine learning techniques for optimizing A/B tests, just not the ones in the article.

This Is News? (2)

hondo77 (324058) | more than 2 years ago | (#40174077)

Throwing up banner ads with different color schemes and automatically re-weighting them based on click-through % is something I was doing well over ten years ago. This can't really be news, can it?

Re:This Is News? (1)

BasilBrush (643681) | more than 2 years ago | (#40174325)

Maybe, given that most sites aren't doing it means it comes under "stuff that matters".

Re:This Is News? (1)

hondo77 (324058) | more than 2 years ago | (#40189171)

I meant that I can't believe this is news because I assumed people had been doing this for years.

Re:This Is News? (0)

Anonymous Coward | more than 2 years ago | (#40201367)

I meant that I can't believe this is news because I assumed people had been doing this for years.

Good for you. Too bad you didn't patent it. And I'm not being sarcastic.

As for the method employed: Wired had a big article on a/b testing last month which is probably why the summary was crap. It was written from the perspective of someone who already knew what a/b testing was. [wired.com]

Although the summary was poorly written, since I had seen the previous wired article I wanted to see what the new new hotness was.

Now to add to the actual discussion, my understanding from the wired article was that they give all candidates an even test run (of say a few thousand page views). Then based on performance they'll select one. Instead of dynamically weighting them immediately after the first performance feedback is entered. This would make sense for 2 reasons.
1) The computational expense of deciding which page to show on the fly based on it's most recent popularity is higher than getting a static sample run
2) Different pages may perform well in different demographics. If you introduce a new style and it gets downvoted to oblivion by the 10am crowd, you may never find out that it would have blown away all candidates with the 10pm crowd.

There are probably more reasons and I'm just pulling those out of my ass but there you go.

The article's premise is entirely wrong (5, Insightful)

RandCraw (1047302) | more than 2 years ago | (#40174119)

A/B focus testing is about observing how customers or users choose between two alternatives based on their qualitative sense of aesthetics. ML is about classifying data based on quantifying the data into defined classes or toward optimal values.

Predicting the outcome of a focus group is a completely different problem than multi arm slot machines. In focus groups there is no objective metric, so focus group problems are not amenable to machine learning unless your machine can define, measure, and perhaps predict aesthetic criteria.

Now THAT I'd like to see.

Re:The article's premise is entirely wrong (0)

Anonymous Coward | more than 2 years ago | (#40174233)

A/B focus testing is about observing how customers or users choose between two alternatives based on their qualitative sense of aesthetics. ML is about classifying data based on quantifying the data into defined classes or toward optimal values.

Predicting the outcome of a focus group is a completely different problem than multi arm slot machines. In focus groups there is no objective metric, so focus group problems are not amenable to machine learning unless your machine can define, measure, and perhaps predict aesthetic criteria.

Now THAT I'd like to see.

If you read TFA, you'll see that humans input the data to the machine. Then, the machine "learns" what is statistically best. The browser user chooses to click based on aesthetic criteria and the machine counts the votes for that link. So, it is really like a double-blind focus group.

Re:The article's premise is entirely wrong (2)

retchdog (1319261) | more than 2 years ago | (#40174579)

i don't know what the fuck a "double-blind" focus group is, since the user is clearly not blind to the design (this is the entire point).

and the reason why this is "like" a focus group, is that it is a focus group. all the information is coming from humans; it's just being used in a not-completely-idiotic way.

it's such an obvious idea it's surprising that no one has done this yet. oh, wait: http://m6d.com/about/about-us/ [m6d.com]

"Because the approach is rooted in machine learning, it continuously updates advertising decisions based on real-time signals from a marketer’s customer base. That feedback loop allows us to improve advertising performance over time."

Re:The article's premise is entirely wrong (1)

Half-pint HAL (718102) | more than 2 years ago | (#40189109)

No, it's not a focus group. A focus group is a bunch of people talking about what they like/don't like. However, humans are very poor at judging what they like. Most living room (en_US "lounge") chairs are uncomfortable. People buy them because when they sit down on them in the showroom, they appear comfortable. Because they encourage poor posture, they take the strain off the sitting muscles. This gives the illusion of relaxation, and tricks people into believing the uncomfortable is comfortable.

A related issue is the fact that the majority of people claim to like their steaks "medium rare". Not because they like them medium rare, but because that's what they hear on the TV.

Focus groups are more often than not a total waste of time.

Re:The article's premise is entirely wrong (1)

retchdog (1319261) | more than 2 years ago | (#40191203)

yeah i know what a real focus group is, but it's a reasonable metonymic usage imho. welcome to today's internet, where you're never more than a statistic, unless someone actually notices you, in which case god help you.

medium rare: well, it's also what i'd personally recommend to someone... it's a good starting point. imho anything more than medium is a waste of decent steak, so medium-rare is in the middle of acceptable. personally, i go for rare at most if i'm at a good place (which is none-too-often, sadly), or if i'm cooking.

OT: steaks (1)

Half-pint HAL (718102) | more than 2 years ago | (#40192331)

In the UK, most places will serve you a medium if you ask for medium rare, simply because most folk who ask for medium rare well send it back to the kitchen because it's "not cooked properly". We're not good with our steaks here.

Re:OT: steaks (1)

retchdog (1319261) | more than 2 years ago | (#40215847)

that's a shame, but in line with the stereotypes of english food i suppose.

by the way, i've only read about and seen pictures of beef wellington, but it seems to me to be the culinary equivalent of an orgy, and would be, in and of itself, a total redemption of british cuisine. am i wrong here?

Re:The article's premise is entirely wrong (1)

BasilBrush (643681) | more than 2 years ago | (#40174377)

Neither the article nor the summary says anything about A/B focus testing. Or mention focus groups at all. It refers to A/B testing, where 2 different websites are offered to customers, and the better one found according to how objectively successful it has been. (by sales, clicks or whatever numerical measure.)

Re:The article's premise is entirely wrong (1)

RandCraw (1047302) | more than 2 years ago | (#40175357)

You're right. My criticism was misdirected. The article is fine; it's not about ML or focus groups but minimizing trial size.

It was the Slashdot summary that somehow saw it as 'ML Replaces Focus Groups'. Thee-a-culpa.

Re:The article's premise is entirely wrong (1)

Hognoxious (631665) | more than 2 years ago | (#40177421)

Somebody in the chain, probably the submitter, thinks "user trials" and "focus groups" are synonyms.

you got it wrong too (1)

khipu (2511498) | more than 2 years ago | (#40177707)

Predicting the outcome of a focus group is a completely different problem than multi arm slot machines.

He isn't trying to use ML to predict the outcome of a focus group.

ML is about classifying data based on quantifying the data into defined classes or toward optimal values.

ML is about many things. One thing it is about is how a learner should explore an environment in order to maximize what he learns. It is one of those techniques that Hanov refers to, and it's a good idea in principle. But he picked the wrong algorithm for focus groups.

The algorithm he points to would is the right one for online testing of different web page designs, where you stick with your current design 99% of the time but show visitors different designs 1% of the time and see whether those work better or worse.

Bayesian modelling and experiment design (2)

HalfFlat (121672) | more than 2 years ago | (#40175533)

It's a 'good-enough' approximation to an optimal selection process.

The probability of someone clicking on option A, B or C is unknown, but is expected to be constant when averaged over the population. Given the ratio of clicks versus views on any given option, the posterior distribution of that probability can be modelled as a Beta distribution. The experimental question is then: given the current estimates, which option should be presented to maximise the utility of the test?

For simply ranking the options, the utility may be the Shannon information [wikipedia.org] . In this case though, the utility also has to incorporate the expected benefit of a click-through. One could set up a utility function which is weighted between the two outcomes, possibly varying over time.

In practice though, Beta distributions with different means tend to converge to separate peaks quite quickly, so taking a possible 10% hit on the current best estimate click-through outcome seems an entirely plausible approximation. Bayesian experimental design though could also tell you when to stop testing and stick with the winner.

Re:Bayesian modelling and experiment design (1)

ShieldW0lf (601553) | more than 2 years ago | (#40175653)

If you used this type of algorithm to rotate a selection of different-but-good style sheets on a website, you'd be able to go past "which one is best at the time the test was devised" and actually build sites that pre-emptively and reactively stay "fashionable", "trendy" and "cool".

Re:Bayesian modelling and experiment design (1)

shadowrat (1069614) | more than 2 years ago | (#40176537)

An algorithm like this isn't going to always pick a trendy and fashionable design. It's going to pick the least bad design you have. If you make 15 designs now, they will probably all be tired in 2 years. Sure the algorithm will say design 7 is the best 2 years from now, but it's probably not as good as whatever your designer would come up with at that time. Its probably better to plan on your designer making the 15 designs over the span of the 2 years .That way you know you are submitting designs made under the influence of the current culture and tastes.

Re:Bayesian modelling and experiment design (1)

ShieldW0lf (601553) | more than 2 years ago | (#40180507)

You're not wrong... but, there are scenarios where, for example, a designer comes up with 4 proposed designs, all of which are good, and someone need to make a decision as to which one to go with without any meaningful way to differentiate. This algorithm allows all 4 to be approved as "functional and not embarrassing" and put into place.

And yes, 2 years later, you might decide it's a good idea to hire a designer to freshen things up, and have them deliver you a few more designs. But, with a pattern like this, you don't need to discard the old ones... you can add the new ones in amongst the old and have the algorithm elevate the one that is popular.

But the real gem would be to find out that the design that was least popular 4 years ago is actually in better sync with what is stylish now, more so than the ones you paid for 6 months ago, and have that dusty old design automatically leap to the front of the queue without you even having to think about it.

Re:Bayesian modelling and experiment design (1)

martas (1439879) | more than 2 years ago | (#40186259)

For simple non-critical things like web design what parent describes is all well and good, but please don't use any similar method for a problem with serious consequences, be it in medicine or science or anything like that. There are statistically sound ways of doing experimental design, e.g. for deciding when to stop an experiment, and they are not Bayesian (usually).

Re:Bayesian modelling and experiment design (1)

HalfFlat (121672) | more than 2 years ago | (#40198987)

I am honestly curious: why should Bayesian experimental design not be used for serious work?

Re:Bayesian modelling and experiment design (1)

martas (1439879) | more than 2 years ago | (#40215155)

Put simply, because it is the wrong tool. Frequentist methods for problems like hypothesis testing and confidence set estimation were designed based on some simple assumptions that probably never really hold in the real world, but probably aren't very far from the truth. Bayesian methods rely on assumptions (and definitions of what kind of error is to be avoided) that are not suitable for many problems in science and medicine. E.g. Bayesian confidence interval estimation will tell you that "on average" over the random distribution of the unknown parameter you're estimating (i.e. the prior distribution that you pulled out of your ass) you won't be off by more than a certain amount. But clearly if what you're estimating is, for example, the safe dose of radiation for workers at a nuclear power plant, there is no random distribution over that amount. There is just a single maximal amount that is safe. Hence, the guarantee you need is that in the worst case over all possible unknown values of the quantity to estimate, you won't be off by more than some amount. This is exactly the kind of guarantee that frequentist methods give you.

Hope that explanation isn't complete gibberish to you...

Er, how about statistical significance? (2)

blach (25515) | more than 2 years ago | (#40175651)

To be valid, the last step (of which the author makes no mention) should be to compare the three groups to see if their differences are statistically significant. With tens of thousands of clicks, it's likely that they are, but the percentages were awfully close in the 2-3% range.

wrong algorithm (1)

khipu (2511498) | more than 2 years ago | (#40177679)

The multiarmed bandit problem is a problem in which you simultaneously try to optimize your overall reward and still explore. As a consumer, I face that problem (switch brands or stick with the tried-and-true). However, for focus groups, maximizing rewards for participants doesn't matter; it's all about finding the best solution for the organizer of the focus group. The participants already get the products for free. That means that it is not a multiarmed bandit problem, and algorithms for solving such problems are the wrong algorithms to use for focus groups.

There are mathematically more efficient ways of doing this kind of testing. But there are other constraints when testing with human beings as well, such as dependencies on the order in which you test. A/B testing is probably a pretty good compromise.

Then /b finds your site... (1)

GrumpySteen (1250194) | more than 2 years ago | (#40180237)

and suddenly the button with the racial epithet on it becomes the most popular one and you lose all your real customers.

Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>