Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

The Simian Army and the Antifragile Organization

Soulskill posted about a year ago | from the if-it-ain't-broke-get-a-bigger-hammer dept.

Programming 66

CowboyRobot writes "ACM has an article about how Netflix conducts its resilience testing. Instead of the GameDays used by sites such as Amazon and Google, Netflix uses what they call The Simian Army, based on the philosophy that 'Resilience can be improved by increasing the frequency and variety of failure and evolving the system to deal better with each new-found failure, thereby increasing anti-fragility.' While GameDay exercises are like a fire-drill, with scheduled exercises where failure is manually introduced or simulated, the Simian Army relies on failure in the live environment induced by autonomous agents known as 'monkeys.' Chaos Monkey randomly terminates virtual instances in a production environment that are serving live customer traffic. Chaos Gorilla causes an entire Amazon Availability Zone to fail. And Chaos Kong will take down an entire region of zones. 'What doesn't kill you makes you stronger' and Netflix hopes that by constantly protecting itself from internal onslaught, they will become increasingly 'anti-fragile — growing stronger from each successive stressor, disturbance, and failure.'"

cancel ×

66 comments

Sorry! There are no comments related to the filter you selected.

Fook Yu? (-1)

Anonymous Coward | about a year ago | (#44172335)

Fook Mi!

Antifragility (2, Informative)

Anonymous Coward | about a year ago | (#44172377)

This is the example that I use when explaining antifragility to my colleagues. I highly recommend Nassim Nicholas Taleb's book, "Antifragile" - at least chapters 1,2, and 7.

no wonder nobody takes Netflix seriously (3, Funny)

ebno-10db (1459097) | about a year ago | (#44172389)

No wonder nobody takes Netflix seriously. What kind of tech company worries about things like reliability and robustness? That's soooo 20th century. Everyone knows that if you have more than 90% availability or too low of a bug rate it means you're not agile enough and you can't be one of those amazingly innovative social networking outfits.

Re:no wonder nobody takes Netflix seriously (0)

Anonymous Coward | about a year ago | (#44172483)

Netflix itself runs on Likely Linux and a number of open-source projects for ultimate reliability, and yet it refuses to develop a client for Linux. Please, take my money! No?

Oh, well -- Back to the venerable Pirate Bay.

-- Ethanol-fueled

Re:no wonder nobody takes Netflix seriously (0)

Anonymous Coward | about a year ago | (#44172557)

refuses to develop a client for Linux

There are actually several Netflix clients on Linux. Two that I use are on Android and Roku.

Re:no wonder nobody takes Netflix seriously (1)

doctor_subtilis (1266720) | about a year ago | (#44172699)

Actually they just recently said they would be making the switch to HTML5 and thus becoming completely cross-platform. I really can't believe they stuck with silverlight this long.

Re:no wonder nobody takes Netflix seriously (1)

Guspaz (556486) | about a year ago | (#44172713)

What other choice did they have? Flash? Even HTML5 isn't really ready until more browsers implement the security features required.

Re:no wonder nobody takes Netflix seriously (1)

l0ungeb0y (442022) | about a year ago | (#44177187)

What other choice did they have? Flash?

Yes, believe it or not, Flash is kick ass when it comes to streaming video.
Far superior to your fabled HTML5 especially in regards to streaming latency.
Want to stream live video from the iPhone? Flash is the ONLY way to do it.

IIRC, they chose Silverlight over Flash because of Microsoft's DRM stack and I'm sure MS gave them some sweetheart deals like they did with MLB.

Re:no wonder nobody takes Netflix seriously (1)

Guspaz (556486) | about a year ago | (#44178843)

I don't think it was only DRM. At the time they chose Silverlight (five years ago), Flash's video streaming support wasn't nearly as robust; it didn't support seamless bitrate changes at the time, for example.

I haven't kept up on things, so it's possible that Flash's video streaming support is as robust as Silverlight today, but it wasn't back then.

Re:no wonder nobody takes Netflix seriously (1)

Dahamma (304068) | about a year ago | (#44173147)

Not cross platform at all until some browser for Linux actually implements their extensions via EME. Chrome might, but probably just for Google's own DRM (Widevine). Netflix currently uses MS PlayReady, good luck getting that in a Linux browser...

Re:no wonder nobody takes Netflix seriously (2)

CrankyFool (680025) | about a year ago | (#44173227)

Hint: You can play Netflix movies on Chromebooks, using HTML5. Think that uses MS PlayReady?

Re:no wonder nobody takes Netflix seriously (0)

Anonymous Coward | about a year ago | (#44174803)

Hint: You can play Netflix movies on Chromebooks, using HTML5. Think that uses MS PlayReady?

No, it uses a Chromebook-specific proto-implementation of EME that is disabled the moment the end-user enters root-access mode. (Functional EME will never be available on general purpose Linux, only locked-down "devices" which is no improvement over the situation today.)

Re:no wonder nobody takes Netflix seriously (1)

Anonymous Coward | about a year ago | (#44180131)

If you have access to root mode, couldn't you add a small patch called "I don't have root chrome, you did that yourself, no problems here"

trusting the client for security is rearanging deck chairs on the titanic.

Re:no wonder nobody takes Netflix seriously (1)

Luyseyal (3154) | about a year ago | (#44175705)

It's definitely time for DVD Jon [wikipedia.org] to make a comeback.

-l

Re:no wonder nobody takes Netflix seriously (1)

DrXym (126579) | about a year ago | (#44173753)

Just because they use HTML 5 does not mean they are cross platform. They could and probably will set the video tag to point to an encrypted stream and the browser will be expected to decrypt it and meet other criteria that stops it from being easily ripped on the fly.

Re:no wonder nobody takes Netflix seriously (0)

Anonymous Coward | about a year ago | (#44174891)

"A browser like Mozilla is *legally prevented* from actually implementing DRM, because they have to reveal all their code, including the decryption code that contains the secrets you use to decrypt," said Google Chrome team member Tab Atkins Jr., in a reply to the mailing list discussion.

"The proposal comes from authors at Google, Microsoft and Netflix, companies that stand to profit from the union of HTML5 and DRM ... *Netflix* responded that this particular component of a browser would *have to be implemented as closed source*" (emphasis added)

Re:no wonder nobody takes Netflix seriously (1)

DrXym (126579) | about a year ago | (#44175719)

Mozilla manages to provide an NPAPI interface to proprietary plugins. I see no reason whatsoever that it can't provide an API for video plugins.

Re:no wonder nobody takes Netflix seriously (0)

Anonymous Coward | about a year ago | (#44177009)

Open-source browsers already reveal their decryption code.

What makes encryption secure (given a strong enough algorithm) is not the code, it's the key. If you have no key, you cannot decrypt it. The keys, however aren't usually part of the source code.

Re:no wonder nobody takes Netflix seriously (0)

Anonymous Coward | about a year ago | (#44174767)

Actually they just recently said they would be making the switch to HTML5 and thus becoming completely cross-platform. I really can't believe they stuck with silverlight this long.

They are not becoming cross-platform in the sense that HTML5 is cross-platform. They will still only run in browsers that have their DRM "plugin" (currently IE and Chromebook, eventually Chrome on Windows/Mac and Safari on Mac; not likely anything more (which is effectively not anymore cross-platform than NetFlix is today)).

Re:no wonder nobody takes Netflix seriously (1)

Anonymous Coward | about a year ago | (#44172947)

Netflix itself runs on Likely Linux and a number of open-source projects for ultimate reliability,

How do they get access to the Chaos Gorilla when they're not running Microsoft products? Do the chairs throw themselves?

Re:no wonder nobody takes Netflix seriously (1)

DrXym (126579) | about a year ago | (#44173745)

I expect they would develop a standalone client if it were economically viable to do so, or there was a media framework that properly supported the copy protection they obviously require to deliver their service to a platform.

As it is, you can watch Netflix on Linux by using Wine to run the Windows Firefox and Silverlight plugin.

Re:no wonder nobody takes Netflix seriously (1)

stox (131684) | about a year ago | (#44172577)

Assuming the DRM features remain in HTML5, there will be no need for a client, you'll just be able to use your browser.

Re:no wonder nobody takes Netflix seriously (1)

qbast (1265706) | about a year ago | (#44174293)

... with apropriate DRM plugin. Which you won't get for Linux.

Re:no wonder nobody takes Netflix seriously (1)

b4dc0d3r (1268512) | about a year ago | (#44172843)

That's why they wrote apps to lower availability. The anti-fragility thing is a cover invented since the last time the story hit slashdot.

Their availability is in the higher range of reasonable, as a result of making the simians more powerful. Obviously they work hard at staying within the agile metrics, no matter how much time and money it takes.

Re:no wonder nobody takes Netflix seriously (1)

davester666 (731373) | about a year ago | (#44173131)

Totally. I suggest Netflix be fixed via the liberal use of Javascript, both on the Server and the Client.

Re:no wonder nobody takes Netflix seriously (1)

K. S. Kyosuke (729550) | about a year ago | (#44174367)

No wonder nobody takes Netflix seriously.

They paid the programmers peanuts, and got monkeys.

Re:no wonder nobody takes Netflix seriously (1)

luis_a_espinal (1810296) | about a year ago | (#44178241)

No wonder nobody takes Netflix seriously.

Its impact on the market says otherwise.

Re:no wonder nobody takes Netflix seriously (1)

ebno-10db (1459097) | about a year ago | (#44179499)

No wonder nobody takes Netflix seriously.

Its impact on the market says otherwise.

Many Slashdotters are impervious to irony.

Sybian Amy (0)

Anonymous Coward | about a year ago | (#44172397)

Yeah i liked that video too.

The big spec? (1)

sphealey (2855) | about a year ago | (#44172405)

But how do you write the big spec for that? The PMI would never approve.

sPh

Reliability vs Resilience (1)

Anonymous Coward | about a year ago | (#44172439)

So this is the ability to use whatever resources are available for graceful failover, allowing masses cheap/consumer grade equipment to be used instead of small amounts of expensive, reliable, enterprise gear.
Sounds like a winning strategy.

Re:Reliability vs Resilience (-1)

Anonymous Coward | about a year ago | (#44172697)

Man you're right, what the fuck were they thinking the whole time? You should go take charge, I bet they'll have a position to fill if you point this out to them!

Failover vs 0 downtime vs no brokenconnection. (1)

leuk_he (194174) | about a year ago | (#44174447)

I wonder how this behaves in the eye of the customer.

From cluster solutions i know there are those in the maintain of it that mistake a redundant system with zero downtime.

The problem is that if you take down a server , all connections to it are down. Some application gracefully swtich to an other server. Some application however first have to time out. Some applicatons crash.

THe question is, do those interruptions get reported correctly, or are people just blame the app, restart their PC?

Very few of those user-problems actually get reported, and the first line help desk just instructs them to restart, and since by then an other VM / region has taken over, everything works. But doing this on purpose is not a gooed user expierence.

Just remember, 0 downtime does not mean that there are no interruptions, to minimize these you need a differnet mindset.

Yes sir, I like uptimes of 1 to 2 years.

Antifragile (1)

hhawk (26580) | about a year ago | (#44172445)

It's hard to explain for layer Antifragility are best built on layers of fragility.. meaning cells in a body are fragile but the body itself get's stronger when stressed (lifting weights, Etc.). The Netflix example is good, it's a bit like randoming pulling parts of a plane in flight and then after the crash making the next planes stronger.. it also leads to antifragility, but it's a strong stressor. .

Forgotten chaos (1)

gmuslera (3436) | about a year ago | (#44172499)

The Black Swan chaos, Government/Hollywood takeover, the 1+ billon dollars lawsuit, EMP bombs, mass/worldwide migration to internet 2, Yellowstone and of course, the Cthulu Chaos. Probably the insider threat chaos goes around all this options.

Re:Forgotten chaos (0)

Anonymous Coward | about a year ago | (#44172513)

dude, they're so gonna split your photons.

Misleading (1)

Livius (318358) | about a year ago | (#44172505)

I was looking forward to hearing about this army full of primates.

Re:Misleading (1)

camperdave (969942) | about a year ago | (#44172591)

I was looking forward to hearing about this army full of primates.

... Or at least an army of twelve monkeys.

Actually, my first thought was the final battle of Planet of the Apes (2001), when all the apes are running towards the grounded ship.

Re:Misleading (1)

Herkum01 (592704) | about a year ago | (#44172689)

That is only because you saw it on Netflix the other night...

Re:Misleading (0)

Anonymous Coward | about a year ago | (#44172941)

I was looking forward to the next season of SyFy network, but the monkeys keep writing this iambic pentameter stuff.

Re:Misleading (1)

kahless62003 (1372913) | about a year ago | (#44174149)

I was hoping for a story about an army of simians doing glorious battle with an army of code-monkeys.

Automated testing with a new name (0)

Anonymous Coward | about a year ago | (#44172675)

This just sounds like automated testing with a new name. Testing on live networks is maybe a little bit "innovative"; but it's really just automated testing. Now let's go synergize some more paradigms.

Re: Automated testing with a new name (0)

Anonymous Coward | about a year ago | (#44172779)

Paradigm is (IMHO) one of those written words that when spoken makes even the brightest among us sound pretentious.

On a side note, maybe Netflix's masochism with breed a new brand of wide spread (pun) pen for XboxOne.

Re:Automated testing with a new name (1)

Dahamma (304068) | about a year ago | (#44173165)

Not even a new name, monkey testing [wikipedia.org] has been around for a long time...

Sadly this doesn't protect against.. (0)

Anonymous Coward | about a year ago | (#44172769)

...catastrophic management failure, such as when executives decide to spin off half the company into an independent service called Qwikster...

Mongolian Horde (3, Funny)

girlintraining (1395911) | about a year ago | (#44172771)

The problem with this, is that it's still programmed failure. In my experience, hardware or software faults, or combinations of both, are not nearly as effective as plain old human stupidity. Oh, and government action. There is no disaster recovery plan for "Here's a warrant. Give us all your shit." There is a similar lack of recovery options for human stupidity. And let's be honest: It's more abundant in the universe than hydrogen, and infinitely harder to defend against, precisely because stupidity is far more cunning and unpredictable than intelligence could ever hope to be.

Re:Mongolian Horde (1)

c0lo (1497653) | about a year ago | (#44172829)

There is a similar lack of recovery options for human stupidity. And let's be honest: It's more abundant in the universe than hydrogen, and infinitely harder to defend against, precisely because stupidity is far more cunning and unpredictable than intelligence could ever hope to be.

The last I knew, the stuff that's more abundant than hydrogen was called "dark matter/energy". You mean they lately discovered those are actually "stupidity in action"?

Re:Mongolian Horde (1)

Dahamma (304068) | about a year ago | (#44173219)

The problem with this, is that it's still programmed failure. In my experience, hardware or software faults, or combinations of both, are not nearly as effective as plain old human stupidity.

But that's largely irrelevant to their testing methodology. They don't just simulate hardware, software, or human faults, they simulate loss of services at various levels of granularity. Doesn't matter whether a server died, someone misconfigured a router, a construction backhoe plowed a fiber cable, a Starz Network-funded hit squad took out their data center, or an earthquake struck the West Coast - it simulates an outage in their network that they want to recover from.

And let's be honest: It's more abundant in the universe than hydrogen, and infinitely harder to defend against

Ok, this line is just plain ridiculous. Been a long day, I assume? Or were you distracted in your metaphor thoughts while vigilantly defending your network against hydrogen? :)

Re:Mongolian Horde (1)

bertok (226922) | about a year ago | (#44173963)

I wrote about this before in an unrelated post, but the point is the same: most "enterprise" vendors will sell you kit that can tolerate nuclear war, but as far as I know, there are very few solutions to protect from administrator error or malice.

Think about the harm someone could do to a typical business with nothing other than an Active Directory "Domain Admin" account! Given something like that, I can think of a whole bunch of ways to harm an environment in such a way that even the availability of backup tapes stored off site wouldn't be sufficient to repair.

Ill will isn't even required. I've personally witnessed a fat-fingered administrator nearly destroy a business in seconds! That organization just barely managed to remain solvent, despite full backups that were successfully restored.

There is an enormous amount of research waiting to be done to develop systems with "Byzantine Security", that is, systems that can tolerate not only external attacks or simple component failures, but also deliberate attacks by trusted parties.

Re:Mongolian Horde (2)

ebno-10db (1459097) | about a year ago | (#44175011)

most "enterprise" vendors will sell you kit that can tolerate nuclear war, but as far as I know, there are very few solutions to protect from administrator error or malice

Not true. If the nuclear war kills the administrator you're safe.

Reliability (1)

doctor_subtilis (1266720) | about a year ago | (#44172831)

I think part of the reason for their heavy focus on reliability is that they are competing with the mature television industry and thus have a lot of concern for finicky customers that are considering cutting their cable/satellite plan.

Re:Reliability (1)

foniksonik (573572) | about a year ago | (#44173287)

The biggest issue isn't Netflix. It's crappy routers on the other end that can't recover from a mild outage. I came home the other day to find my router blinking like a a madman on stimulants but no service. Unplugged, replugged and it all worked fine. Seems like it should have been able to diagnose itself and restart to achieve the same result.

Re:Reliability (0)

Anonymous Coward | about a year ago | (#44174167)

I don't tend to have this problem with my router since I started using Tomato for my firmware, but the cable modem from my ISP is a dumber device by far. Its connectivity issues are often solved by power cycling. But it shouldn't be too hard to program a TomatoUSB router to send a signal to cut off power to a socket controlled by a USB relay, whenever the router loses its WAN connection. This would be an especially good setup to technophobic users who tend to panic when they discover that their internet isn't working (like my parents). Somebody should make it!

In general, I'd love to get an interface inside the router firmware to turn on and off electrical sockets. That way, you could just log into it (say, with your phone) and turn off appliances you left on when you were racing out the door. Or imagine a stove that connects to your wifi and can be powered down remotely, say with an Android app. To some OCD people, this would be a godsend! Once I powered on my computer remotely (through WOL), logged in remotely, and played an audio file of a time-sensitive message for my gf, who tends to forget to charge her phone. I'm sure it was weird for her to hear my voice yelling at her from my computer speakers, but the message was received, and this demonstrated to me the value of being able to remotely power on and off appliances.

Re:Reliability (1)

DrXym (126579) | about a year ago | (#44173821)

I think more likely it's because they're following the AOL model. They have a high percentage of non technical users (and morons) and therefore the service should be ultra simple and ultra reliable. They most likely fear the cost of support calls and customer churn caused by a service that "confuses" customers.

On the flip side it makes their service maddeningly retarded at times especially in families where adult and kid viewing habits are munged into one unholy meaningless mess and there is no easy way to clean it out or hide recently watched videos. Supposedly profiles are coming soon where different members of the family can split out their recommendations but I'll reserve judgement until I see it.

Re:Reliability (1)

ebno-10db (1459097) | about a year ago | (#44175045)

They have a high percentage of non technical users ... therefore the service should be ... ultra reliable.

Technically sophisticated users shouldn't have reliable service?

Re:Reliability (1)

DrXym (126579) | about a year ago | (#44175889)

Technically sophisticated users shouldn't have reliable service?

The point I was making is that AOL achieved reliability by dumbing the UI down to what the lowest common denominator was capable of. Not because that represented the optimum user experience but because they dreaded customers choking up their call centres by "confusing" them with features.

Re:Reliability (1)

ebno-10db (1459097) | about a year ago | (#44177105)

The point I was making is that AOL achieved reliability by dumbing the UI down to what the lowest common denominator was capable of. Not because that represented the optimum user experience but because they dreaded customers choking up their call centres by "confusing" them with features.

But that's not what Netflix is doing. They're trying to ensure reliable delivery of content. That's something that should just work, and not require endless tweaking by a technically sophisticated end user.

Re:Reliability (1)

Anonymous Coward | about a year ago | (#44177893)

You know what dozens of config options are caused by? Lazy fucking programmers.

You call clean and simple UI's "dumbed down" - I call them programmers doing their fucking job. Its part of a programmers job to reduce the complexity of a problem for the user - not just pass that complexity on in a different form.

Old story (0)

Anonymous Coward | about a year ago | (#44173611)

why reiterating this again?

"Chaos Monkey"? (1)

Spudley (171066) | about a year ago | (#44174427)

"Chaos Monkey" sounds like it ought to be the name of the next iteration of Firefox's Javascript subsystem.

Hang on.... "Chaos Monkey is a piece of software that deliberately takes out random parts of your live production system".... hmmmm.... maybe it *is* the Firefox Javascript subsystem?

You and what army? (1)

nitehawk214 (222219) | about a year ago | (#44175423)

I originally read that as the "Syrian Army".

grr (1)

dirtyhippie (259852) | about a year ago | (#44175797)

netflix gets all this great PR for this approach - and at least in theory it's a good one - but as a customer of netflix's, the results i've experienced are actually pretty poor.

think about it, they go around shooting nodes in the head during business hours. In the long run, that's great, they can be prepared for anything, but it's still madness.

Oh and separation of services? Great. But who the hell wants to browse the netflix directory when the streaming service is down? Not me, for one.

uh huh (1)

morgauxo (974071) | about a year ago | (#44175899)

Maybe they do this with their PC client? They surely don't seem to care about the robustness of their Android client. I think they must develop and test that monster on the latest, most powerful hardware that a corporation can buy. Then they fill it full of graphics and video until it almost breaks thus ensuring that it runs like crap on anything less. I would drop Netflix like a ton of bricks except they have licensed most of the content that I would actually want to watch while Hulu, the only competitor I am aware of has just about nothing for me.

Chaos Kong in the clouds (0)

Anonymous Coward | about a year ago | (#44176331)

Chaos Kong ate my cloud! I hope Mario can save my QoS levels by rescuing Pauline.

Won't Work where it Matters Most (1)

Capt.Albatross (1301561) | about a year ago | (#44197099)

The article appears to be a slightly pretentious way of saying that Netflix does reliability testing on its live systems. They can get away with this only because it is not critically important for Netflix to be highly robust: the downside of failure is merely a degree of temporary irritation. Don't try this in the financial markets or life-support systems.

Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>