×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Making Facebook Self Healing

Soulskill posted more than 2 years ago | from the resistance-is-futile dept.

Facebook 74

New submitter djeps writes "I used to achieve some degree of automated problem resolution with Nagios Event Handler scripts and RabbitMQ, but Facebook has done it on a far larger scale than my old days of sysadmin. Quoting: 'When your infrastructure is the size of Facebook's, there are always broken servers and pieces of software that have gone down or are generally misbehaving. In most cases, our systems are engineered such that these issues cause little or no impact to people using the site. But sometimes small outages can become bigger outages, causing errors or poor performance on the site. If a piece of broken software or hardware does impact the site, then it's important that we fix it or replace it as quickly as possible. ... We had to find an automated way to handle these sorts of issues so that the human engineers could focus on solving and preventing the larger, more complex outages. So, I started writing scripts when I had time to automate the fixes for various types of broken servers and pieces of software.'"

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

74 comments

Golden Girls! (-1)

Anonymous Coward | more than 2 years ago | (#37432024)

Thank you for being a friend
Traveled down the road and back again
Your heart is true, you're a pal and a cosmonaut.

And if you threw a party
Invited everyone you ever knew
You would see the biggest gift would be from me
And the card attached would say, thank you for being a friend.

Suggested change... (1, Offtopic)

Gription (1006467) | more than 2 years ago | (#37434124)

I would like to suggest a subtle change to the posting system: Make it so the first post on any article cannot be done as "Anonymous Coward".

I know Slashdot has a tradition of being a "free-for-all, run through a blender" but I don't think there has ever been an AC first post that has ever been anything but either:
- So lame that you wonder how a person manages to survive such a terminal case of lack of personality or creativity... or
- There is no real reason it couldn't have been posted under a login.

Stupidity really should be viciously stamped out but if we can use automated steps to reduce the "background stupid" we can then focus more energy on more invasive cases of dumb.

Re:Suggested change... (0)

Anonymous Coward | more than 2 years ago | (#37436196)

While they're at it, make it so the first post on any article cannot be modded up all the way up.

Re:Golden Girls! (1)

mr_mischief (456295) | more than 2 years ago | (#37434294)

s/cosmonaut/confidant/

Maybe You have confused Zuckerberg with Guy Laliberté or Mark Shuttleworth. Or perhaps with Richard Branson, who builds space tourism vehicles.

The song, however, has nothing to do with space travelers.

Complexity arising from simplicity (2, Insightful)

Psychotria (953670) | more than 2 years ago | (#37432060)

We had to find an automated way to handle these sorts of issues so that the human engineers could focus on solving and preventing the larger, more complex outages.

This seems backwards to me. Surely the "larger, more complex outages" are caused by an accumulation of, or interaction between, the smaller, less complex problems/situations. If all of the smaller problems are well understood and dealt with, then those more complex problems should not arise. I think it's dangerous to assume that because the smaller problems can be transiently resolved by a script with minimal human intervention that the more complex problems need less exploration. Sure, scripts to handle the less complex issues are great, but this should not shift the focus of the human engineers to "focus on solving and preventing complex outages"; solving those often (always?) means solving the less complex issues.

Re:Complexity arising from simplicity (5, Insightful)

aiken_d (127097) | more than 2 years ago | (#37432086)

I disagree. Larger outages in an infrastructure like Facebook's are only rarely an accumulation of smaller issues. Think about it: what's a more likely scenario for a major site-wide issue, thousands of web servers whose hard drives die simultaneously, or a flapping route caused by a configuration issue on a router?

Think of it like our body: every day, you suffer thousands of tiny injuries and insults that your autoimmune system and skin deal with and that you never know about. This frees you up to drive yourself to the doctor if you notice a lingering cough or to call the ambulance if you sever a limb. You wouldn't argue against an immune system because it might hide larger issues from conscious attention, would you?

Re:Complexity arising from simplicity (1)

Anthony Mouse (1927662) | more than 2 years ago | (#37432422)

Larger outages in an infrastructure like Facebook's are only rarely an accumulation of smaller issues. Think about it: what's a more likely scenario for a major site-wide issue, thousands of web servers whose hard drives die simultaneously, or a flapping route caused by a configuration issue on a router?

Sometimes. But for example, suppose you have a fail-over setup so that if one machine falls over, its work units or clients are automatically transferred to another machine. You're very proud of yourself until you get a damaged work unit or client which is capable of causing the machine processing it to fall over, and then it gets transferred all around to every server and causes a cascade failure until 30 seconds later all of your servers have crashed.

And sometimes you do get simultaneous "independent" failures of both hardware and software because of a common cause. Suppose you have a load spike during a nationwide heatwave and the ambient temperature in most of your data centers gets up to 85 degrees Fahrenheit, which is just within design specifications for your facilities but it had never happened before. You might very well see a rash of disk or power supply failures then.

Re:Complexity arising from simplicity (3, Insightful)

hardtofindanick (1105361) | more than 2 years ago | (#37432542)

It seems to me like you are creating hypothetical scenarios of total failure. Most of the practical failure scenarios can be handled gracefully when you have facebook's resources under your command. After all they are not sending men to Mars. We have studied and now well understand distributed database problems for more than 30 years. There is pretty much nothing technologically interesting about Facebook (and Twitter for that matter).

The sad part is someone [linkedin.com] writes his ramblings and puts a flow chart or two and it becomes a story on /.

Re:Complexity arising from simplicity (1)

Thanster (669304) | more than 2 years ago | (#37432698)

Here's a real one that defeated a modern multi-path network not so long ago, constructed with WAN paths over some antiquated link encryptors. it seems that there was an undocumented (at least to the end user) "drop all keys" bit sequence. Now being a link encryptor this was parsed for within the flowing data stream, now one day an unassuming jpeg file attached to an email just by absolute chance (the bit sequence didn't have a lot of entropy to it) contained this bit-sequence, - instant denial of service attack as each link dropped, network re-converged and the still extant tcp connection between mail servers resent the offending packet until the site in question had completely isolated itself from the network. (that was a real doozer to figure out what had happened!)

PHP (0)

Anonymous Coward | more than 2 years ago | (#37433094)

Google doesn't have nearly as many such problems. I'd think Google simply pays more for better people, rather than hiring dirt cheap PHP morons.

Re:Complexity arising from simplicity (1, Insightful)

Ethanol-fueled (1125189) | more than 2 years ago | (#37432094)

The FBAR system will be ineffective against the outages caused by their users leaving in droves for the next big thing.

I blindly clicked the TFA link without checking that it was a Facebook link. Once I was at the page, I was halfway through it when a box popped up telling me "Please log in to continue." I closed the box and nothing happened. If I was thinking about joining Facebook, I sure wouldn't now after seeing that shithead pop-up. Fuck Facebook - you guys get on your knees and suck my dick, you beg for my information, and only then I might just give you my real age.

And what's up with Facebook's IPO? Do their investors have a bunch of invisible Disney Dollars stashed away in Uncle Scrooge's money vault?

Re:Complexity arising from simplicity (0)

Anonymous Coward | more than 2 years ago | (#37433508)

"Your account isn't secure! Give us your cell phone number so we can notify you about stuff."

Re:Complexity arising from simplicity (3, Informative)

mclearn (86140) | more than 2 years ago | (#37432126)

TFA specifically uses an example of a failed hard drive to describe the workflow. You can see that a failed hard drive is something small, easily diagnosable, and -- in the greater scheme of things -- easily fixable.

Now, if you recall what happened with AWS in April, they had a low-bandwidth management network that all of a sudden had all primary EBS API traffic shunted to it. This was caused by a human flipping a network switch when they shouldn't have. Something like this is not something that happens all the time, has little, if any diagnosable features, is not well-defined to have a proper workflow attached to it, and needs human engineers to correct. This is an example of a complex, large-scale problem.

Read the article, it's actually quite interesting.

Re:Complexity arising from simplicity (1)

TinyManCan (580322) | more than 2 years ago | (#37432626)

Now, if you recall what happened with AWS in April, they had a low-bandwidth management network that all of a sudden had all primary EBS API traffic shunted to it. This was caused by a human flipping a network switch when they shouldn't have. Something like this is not something that happens all the time, has little, if any diagnosable features, is not well-defined to have a proper workflow attached to it, and needs human engineers to correct. This is an example of a complex, large-scale problem.

I wonder when this army of automated-problem-fixing engines will encounter a corner case its masters never considered and how it will react.

I give the ops guys at Facebook a lot of credit for managing such a gigantic workload with just a (relatively) few, very smart, people. Amazon also has a lot of smart people who have been working on EBS (in one form or another) since before Facebook was founded. These systems just interact in unpredictable ways when they get out of their comfort zone.

Systems so complicated they require self-managing management systems are going to have some interesting failure modes, to say the least.

Re:Complexity arising from simplicity (-1)

Anonymous Coward | more than 2 years ago | (#37432250)

It may seem backwards to you, but you are still a nigger. A dumb one at that.

Re:Complexity arising from simplicity (0)

Anonymous Coward | more than 2 years ago | (#37434708)

So how do you understand, and deal with, a dozen of broken hard drives and a few power supplies going bonkers daily?

NOOOOOO!! (4, Funny)

Baloroth (2370816) | more than 2 years ago | (#37432082)

How are we supposed to kill it if it's self-healing? Now it will never die!

Re:NOOOOOO!! (0)

Anonymous Coward | more than 2 years ago | (#37432142)

How are we supposed to kill it if it's self-healing? Now it will never die!

Acid or Fire.

Assisted Suicide (1)

Frosty Piss (770223) | more than 2 years ago | (#37432374)

I was thinking more in terms of "assisted suicide".

Re:Assisted Suicide (1)

32771 (906153) | more than 2 years ago | (#37432814)

I thought of renaming it to Palliabook, but then look what I found at Wikipedia:

"Palliative care (from Latin palliare, to cloak) is a specialized area ..."

I guess Cloakbook would also be correct.

Re:NOOOOOO!! (-1)

Anonymous Coward | more than 2 years ago | (#37432442)

How are we supposed to kill it if it's self-healing? Now it will never die!

Hopefully Niggers don't get hold of this technology.

Re:NOOOOOO!! (1)

rvw (755107) | more than 2 years ago | (#37433392)

How are we supposed to kill it if it's self-healing? Now it will never die!

Wait until Microsoft buys it, then give it another year or two...

Re:NOOOOOO!! (1)

acidradio (659704) | more than 2 years ago | (#37434120)

Maybe Facebook is really the Skynet that we learned about in the Terminator movies. I fear the day that it becomes self-aware.

Re:NOOOOOO!! (1)

cr0nj0b (20813) | more than 2 years ago | (#37434172)

How are we supposed to kill it if it's self-healing? Now it will never die!

Make sure the halon system is not computer controlled.
Kill internet connections to all sites at the same time, so they cant send out an SOS
Then kill the power

BULLSHIT (0)

Osgeld (1900440) | more than 2 years ago | (#37432088)

if it were true facebook would just self destruct

Re:BULLSHIT (0)

Anonymous Coward | more than 2 years ago | (#37432182)

The Only Winning Move Is Not to Play

( and screw slashdot for not allowing my all caps )

This FP for GnABA (-1)

Anonymous Coward | more than 2 years ago | (#37432106)

OpenBSD, as the one common goal - very sick and its have somebody just obsessed - give standards shOuld platform for the may well remain never heeded

One script writer equals one hundred MCPs (0)

Anonymous Coward | more than 2 years ago | (#37432214)

"Today, the FBAR service is developed and maintained by two full time engineers [facebook.com], but according to the most recent metrics, it’s doing the work of approximately 200 full time system administrators".

Re:One script writer equals one hundred MCPs (1)

inglorion_on_the_net (1965514) | more than 2 years ago | (#37432720)

"Today, the FBAR service is developed and maintained by two full time engineers, but according to the most recent metrics, itâ(TM)s doing the work of approximately 200 full time system administrators"

Which doesn't really tell anyone anything. Who expresses amount of work done in terms of number of full time workers? In case anyone had failed to get the message, the above shows that such a metric isn't very useful. Perhaps the message here is that 2 really effective people can do the work of 200 not so effective people - but that has been known for a long time. Still, the more this message is spread, the better.

Routing around the faulty components (1)

izomiac (815208) | more than 2 years ago | (#37432224)

We had to find an automated way to handle these sorts of issues so that the human engineers could focus on solving and preventing the larger, more complex outages.

Given how glitchy Facebook was in the past, I can't help but be reminded of this comic [smbc-comics.com].

Re:Routing around the faulty components (1)

Raenex (947668) | more than 2 years ago | (#37433386)

And amusingly enough, the SMBC site is down now, so I can't reach your link.

Re:Routing around the faulty components (1)

perryizgr8 (1370173) | more than 2 years ago | (#37433590)

amazingly, smbc has been down for hours now. i've never seen any site go down for so long.

I find this ironic... (0)

Anonymous Coward | more than 2 years ago | (#37432232)

...given how broken most of the site is on a daily basis.

Feature request. (0)

PrimeNumber (136578) | more than 2 years ago | (#37432234)

Could they do the world a favour and write scripts to make it self-terminate instead?

Re:Feature request. (0)

Anonymous Coward | more than 2 years ago | (#37432920)

probably what anon will do.

Re:Feature request. (0)

Anonymous Coward | more than 2 years ago | (#37433078)

lol hilarious PrimeNumber!

Every generation wants to re-invent the wheel. (1)

tqk (413719) | more than 2 years ago | (#37432254)

I was rolling out Big Brother Network Monitor a decade ago. It was well capable of doing this.

Today, I'd use an RDB that stored output from perl:DBI cronjobs running on each machine, and another job that checked the db and made sure all that ought to be happening had reported in successfully recently. Anything that hadn't would trigger an email to someone to look into it.

Easy to develop, implement, extend, and maintain.

No, I don't want to connect to FB just to read the article. Post it somewhere else if you want it read.

Re:Every generation wants to re-invent the wheel. (0)

Anonymous Coward | more than 2 years ago | (#37432570)

I've done this type of watchdog scripting over 20 years, and that's only because I've been in systems/networking for over 20 years. My guess is they thought that their hardware/software was going to be perfect when it was purchased/written....and they've just now realized that it isn't.

Dear Facebook, if you start recursively monitoring your monitoring software, pretty soon you won't be able to run anything.

Re:Every generation wants to re-invent the wheel. (1)

maxwell demon (590494) | more than 2 years ago | (#37433042)

Dear Facebook, if you start recursively monitoring your monitoring software, pretty soon you won't be able to run anything.

At which time the process will message: "Mission accomplished."

Re:Every generation wants to re-invent the wheel. (1)

Anonymous Coward | more than 2 years ago | (#37433060)

Today, I'd use an RDB that stored output from perl:DBI cronjobs running on each machine, and another job that checked the db and made sure all that ought to be happening had reported in successfully recently. Anything that hadn't would trigger an email to someone to look into it.

You'd re-invent Nagios, but worse?

Re:Every generation wants to re-invent the wheel. (0)

Anonymous Coward | more than 2 years ago | (#37447540)

Ouch, you're reinventing the wheel, but are also asking for trouble.

Writing into a shared database via cronjobs on different boxes has a few implications:
-the credentials do share write access to the database - if not per user account, then per permission. You usually don't give each host its own
  table to log into but are granting several hosts to log into a single table. So if a single box is compromised or the monitoring software becomes
  broken (due to some junior admin "enhancing" the cronjobs), your database is basically rendered useless, can no longer be trusted or probably
  won't contain any useful data anymore. If a single box is misbehaving (i.e. the hostname "got lost"), you'll end up searching for that box
  forever - unless you also started logging hostname and IP address, both retrieved via the SQL connection and not as cronjob output.
-Servers need to be NTP-synchronized - which on the other hand results in all of those cronjobs connecting to your shared database
  at the same time, all of them trying to write a few records into the very same tables. I haven't seen that many database management systems
  handling such scenario with e.g. 30000 connections within a single second fairly well.
  So in the end, you need to introduce either a simple message queue spreading your message submission to the database over time or even distributing
  those messages among multiple database servers.
  For a "quick fix", you may run something like "sleep(rand(60))" in your cronjobs, just before submitting the results to your database server.
  On the other hand, your cronjob is not that essential that it does need to run exactly on second "zero" of a specified time. Maybe you'll
  reconsider other ways to run your job (e.g. as a self-looping daemon or from inittab).
-When your cron daemon dies, there is no easy way to recover automatically and you'll be flooded with alarms.
  So you'd probably also need to introduce something like a "health check" into /etc/inittab (including some "sleep(60)" to prevent the "respawning too fast"-issue)
  to locally check wether essentials like crond, sshd and syslogd are still running (and restart them accordingly).

Re:Every generation wants to re-invent the wheel. (1)

tqk (413719) | more than 2 years ago | (#37448756)

Objections noted, but I'm unconvinced any are show-stoppers.

Writing into a shared database via cronjobs on different boxes has a few implications:
-the credentials do share write access to the database - if not per user account, then per permission. You usually don't give each host its own table to log into ...

Why? I would give each host its own table, or perhaps a small block of machines one table. This is hardly going to be a vast blob of data going back and forth here. Besides, it doesn't all have to go into one db, nor one db on one machine. Hell, it could be a db on each machine with exports scp'd to a central log server (or ten).

If a single box is misbehaving (i.e. the hostname "got lost"), you'll end up searching for that box forever - unless you also started logging hostname and IP address, both retrieved via the SQL connection and not as cronjob output.

That makes no sense to me. No, I've never worked anywhere that had 30k hosts on-line, but simple documentation practices scale. Yeesh. Hostname:location:IP Address:... would make a very small db entry, considering the binary blobs rdbs are comfortable handling these days.

-Servers need to be NTP-synchronized - which on the other hand results in all of those cronjobs connecting to your shared database at the same time ...

Get real! Of course you don't have them try to do that.

-When your cron daemon dies ...

Oh come on. Now I know you're just making stuff up.

Sounds like a good place to work (3, Interesting)

Maow (620678) | more than 2 years ago | (#37432272)

Facebook is an amazing place to work for many reasons but I think my favorite part of the job is that engineers like me are encouraged to come up with our own ideas and implement them. Management here is very technical and there is very little bureaucracy, so when someone builds something that works, it gets adopted quickly. Even though Facebook is one of the biggest websites in the world it still feels like a start-up work environment because there's so much room for individual employees to have a huge impact.

Like building infrastructure? Facebook is hiring infrastructure engineers. Apply here.

Damn, if I weren't so adverse to soul crushing rejection, I'd apply.

This guy was insightful and informative, so I believe what is quoted above.

And I'm surprised: I figured Facebook would be either more bureaucratic (like MS) or kinda dickishly autocratic (like Zuckerberg is rumoured to be).

Re:Sounds like a good place to work (1)

hedwards (940851) | more than 2 years ago | (#37432500)

If the site is often broken and randomly changing, this would probably be why. You do want people experimenting and finding fixes, but if you don't have any coordination going on that's just as bad.

Re:Sounds like a good place to work (0)

Anonymous Coward | more than 2 years ago | (#37434752)

There aren't that many engineers to begin with, why would it be bureaucratic.

Re:Sounds like a good place to work (1)

The O Rly Factor (1977536) | more than 2 years ago | (#37435064)

Having a multibillion dollar company pretend they are still a Stanford startup is kind of like trying to pilot an oil tanker as if it were a 30 horsepower inflatable boat. Hence, you get situations like that godawful instant message...thing that takes up a quarter of your screen and disallows you to see contacts that are actually online.

But HEY! At least our employees feel like they are empowered and important and we still get to have a fuseball table in the conference room, right? I truly cannot take a company like facebook seriously when I see tours of their facilities and their infrastructure engineers are walking around in volcom t-shirts and skateboard shoes.

Re:Sounds like a good place to work (1)

russotto (537200) | more than 2 years ago | (#37436742)

I truly cannot take a company like facebook seriously when I see tours of their facilities and their infrastructure engineers are walking around in volcom t-shirts and skateboard shoes.

And the T-shirts and the shoes interfere with the job exactly how? Suits (or just dress shirts) and wingtips do NOT increase efficiency one iota.

Re:Sounds like a good place to work (1)

The O Rly Factor (1977536) | more than 2 years ago | (#37439518)

The same way that people who get themselves pierced and tattooed up who then wonder why nobody will hire them as an investment banker. It's all about presentation: if your company looks like its being managed by a bunch of 15 year olds, then I'm just going to assume that it is being managed by a bunch of 15 year olds. But hey, stick it to the man, trying to put us down with his suits and business casual and looking presentable for clients and whatnot, right?

Re:Sounds like a good place to work (1)

russotto (537200) | more than 2 years ago | (#37443724)

I'm sure Facebook, Google, and other companies where you're as likely to see a skateboard as a suit are crying into their corporate beers over whether you take them seriously. As for investment bankers, I do know someone who is pierced and tatooed and works for an Wall Street trading firm.

Of course, if we're going by dress, you really have to consider the position. The casual appearance you describe is the hallmark of the programmer... who in their right mind would hire a programmer in a suit? That'd be like hiring a Unix guru without a ponytail!

Re:Sounds like a good place to work (1)

evilviper (135110) | more than 2 years ago | (#37438360)

And I'm surprised: I figured Facebook would be either more bureaucratic (like MS) or kinda dickishly autocratic (like Zuckerberg is rumoured to be).

I've seen what happens when a startup gets big, and I don't have good things to say about it.

Lack of bureaucracy is often code for the lunatics taking over and running the asylum... Think, no standards, no processes, no training for new hires (and there are, of course, lots of them) and just nobody in-charge or enforcing, anything. That kind of havock is great for the sociopaths, but makes it very hard for the adults to manage to keep everything holding together with toothpicks and bubblegum, particularly when every new guy makes the same damn stupid mistakes, because they're so "empowered" and management is hands-off and won't enforce the most basic standards.

I've seen both sides of the coin, and while both are terrible at their extreme, I'd rather err on the side of a little too-much management and standards.

Of course this is an extreme generalization. There is the perfect balance in there, somewhere, and facebook is swimming in enough money that they certainly COULD have gotten things right, but I'm inclined to believe it's a lot more like the out-of-control overgrown startups I've seen than anyone would like to admit...

Sounds like the definition of Facebook to me (0)

oheso (898435) | more than 2 years ago | (#37432306)

pieces of software that have gone down or are generally misbehaving

I mean, when was the last time something on Facebook actually worked?

Re:Sounds like the definition of Facebook to me (0)

Anonymous Coward | more than 2 years ago | (#37433436)

pieces of software that have gone down or are generally misbehaving

I mean, when was the last time something on Facebook actually worked?

Every time I use it... why do people make such obviously trollish claims?

WOW (1)

PopeScott (1343031) | more than 2 years ago | (#37432320)

Auto ticketed errors, I am Amazed. If you did not detect sarcasm, please enter a problem ticket. You don't think that shit's automated do you?

Upstart? (0)

Compaqt (1758360) | more than 2 years ago | (#37432332)

So this is basically a script that restarts dead daemons, right?

What's the difference between this and Upstart?

http://upstart.ubuntu.com/faq.html [ubuntu.com]

automation? (0)

Anonymous Coward | more than 2 years ago | (#37432564)

MSP I work for has been doing this for at least 3 years. It can also call people and give them a menu of actions if such is a requirement.

Google and Facebook can fail more freely (1)

Coward Anonymous (110649) | more than 2 years ago | (#37432932)

Part of the reason Facebook and Google can "self heal" is because failures are mostly not noticeable by end users. If a Facebook or Google machine fails, unless you are getting a 404 or a service failure message there is little to no way for you to know that the web page you have been served up is wrong, partial or out of date. This failure ambiguity provides a lot of leeway on the methods and speed required to fix a failure.

For most other services where there is a definite correct and incorrect output - like file systems or financial services - a broken service has immediate impact and fixing it is much harder.

They do it very differently (2)

brunes69 (86786) | more than 2 years ago | (#37433354)

From the sounds of this article, Facebook and Google go about this VERY differently.

The Facebook way, it seems, is that every node in the infrastructure is possibly important. So they write and maintain all these healing scripts to deal with problems like broken processes or failed hard drives.

Google goes about the same problem in a very different way. Google's system is architected such that no node is important. Everything is massively parallel and redundant - such that you could take and destroy any server, any set of servers, even an entire data centre and blow it up with a bomb, and side from performance issues, no one would notice.

From an admin's point of view - I would much prefer Google's system. Something doesn't look right on a box? Yank it out TOTALLY, put in a new one, investigate some other time.

Re:They do it very differently (0)

Anonymous Coward | more than 2 years ago | (#37436412)

The article is talking about a system to respond once the box is yanked out of production. Even this is a significant number of machines (the yanking out based on monitoring is mostly automatic). The goal is to reduce the amount of work that needs to be done to get that machine back into operating order. That's what FBAR is doing.

I'm not sure where you get that Facebook treats every machine as critical?

Self Healing, my foot (1)

justcauseisjustthat (1150803) | more than 2 years ago | (#37432946)

How come friends keep disappearing only to request again, saying I dropped them. Either it's buggy or broke....

Re:Self Healing, my foot (0)

Anonymous Coward | more than 2 years ago | (#37433204)

You're not supposed to add friends whose middle names are ' DROP TABLE, silly.

glorified ticket agent (0)

Anonymous Coward | more than 2 years ago | (#37432984)

Sounds more like an api built to issue service tickets. They broke down the api access to different groups so the individual groups can create their own resolutions to known problems.

Not Impressed (0)

Anonymous Coward | more than 2 years ago | (#37436254)

I have a friend who works there. According to him, Facebook devotes over 100 physical servers to every 35,000 users. That is incredibly inefficient in terms of power and hardware costs. Seems to me they just throw a low of money around rather than coming up with elegant technical solutions.

Re:Not Impressed (0)

Anonymous Coward | more than 2 years ago | (#37436324)

Many of the failures that FBAR handles are software, not hardware. I wish I'd given a software example in the post. For hardware, all FBAR can do is flag the issue for humans to fix. For software outages, it repairs the broken services on the machine and humans don't have to touch it at all.

Sounds to me like they write shitty software. How about writing robust software and not needing scripts to "repair" it all the time?

Re:Not Impressed (1)

turbidostato (878842) | more than 2 years ago | (#37436508)

"Facebook devotes over 100 physical servers to every 35,000 users. That is incredibly inefficient"

Absolutly yes! If they only managed to serve 350 users per server, that, that would be a neat thing.

Google wrote about this way before FB existed... (0)

Anonymous Coward | more than 2 years ago | (#37440650)

And Google are dwarfing FaceBook in term of bandwith / number of servers / database / you-name-it.

When you've got a lot of servers, it's been long known that "failure is the norm" and that you're infrastructure should be designed around that *fact*.

Check for New Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...