Beta

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Why Browsers Blamed DNS For Facebook Outage

timothy posted more than 3 years ago | from the three-letters-bad dept.

Bug 96

Julie188 writes "That was probably the only time 'DNS' will ever be a trending term on Twitter. The cause was Facebook's 2.5 hour outage on Thursday, which incorrectly told users trying to access the site that a DNS error was to blame. In truth, experts who've read Facebook's explanation say the site went down because Facebook gave itself a distributed denial-of-service attack when a system admin misconfigured a database. So why was DNS blamed? The 27-year-old communications protocol has been known to cause other, somewhat similar outages."

cancel ×

96 comments

Sorry! There are no comments related to the filter you selected.

DNS? (0)

Anonymous Coward | more than 3 years ago | (#33703550)

All I received was bad gateway. Just saying..

Re:DNS? (3, Funny)

Mitchell314 (1576581) | more than 3 years ago | (#33703738)

Then stop buying Dells. :P

Re:DNS? (2, Informative)

rs79 (71822) | more than 3 years ago | (#33703870)

http://rs79.vrx.net/works/photoblog/2010/Sep/23/ [vrx.net]

Notice the page, being served from facebook.com, saying "bad DNS". Think about that
for a second.

Re:DNS? (2, Interesting)

kasperd (592156) | more than 3 years ago | (#33704022)

Notice the page, being served from facebook.com, saying "bad DNS".

I can understand why that may cause people to think the problem is with DNS. The error message looks like it came from an http proxy. That would suggest that either the user had a proxy configured or facebook were using a reverse proxy. If it was the later, the DNS "problem" would be inside their network.

Re:DNS? (1)

rekoil (168689) | more than 3 years ago | (#33704372)

Easy. They absolutely do use reverse proxies - every large site does, because you just can't scale a web site to Facebook's size without them.

In the post-mortem, they mention the need to effectively "turn off" the entire site, and the easiest way to do that is to remove its DNS. In this case, however, it was most likely more effective to remove the DNS entries for the back-end hosts that the proxies forward queries to, rather than the entries for www.facebook.com. This is most likely what generated the DNS errors that users saw.

Re:DNS? (1)

sortius_nod (1080919) | more than 3 years ago | (#33709024)

If it was doing a DDoS of their SQL servers, wouldn't taking down the DNS be useless? I've always been taught to use IP rather than hostname when building any database servers.

Re:DNS? (1)

InfiniteWisdom (530090) | more than 3 years ago | (#33715990)

I've always been taught to use IP rather than hostname when building any database servers.

Definitely time for some reeducation then. Using IP addresses instead of DNS is just asking for trouble and headaches.

Re:DNS? (0)

Anonymous Coward | more than 3 years ago | (#33720520)

Agreed - the only time I can think of only using IP is on a completely isolated segment, where DNS is unavailable and undesirable, like a cluster interconnect.. even private backup segments we resolve names through the frontend, route the traffic through the backend.

Re:DNS? (1)

billcopc (196330) | more than 3 years ago | (#33707830)

Bingo! The DNS issue was internal to Facebook's load-balancing cluster. Anyone who's hosted a busy web site should be familiar with this kind of setup. Internal DNS is often uses for such purposes, as it can transparently provide round-robin functionality. Every time you resolve the hostname, you get a different IP (caches notwithstanding), so while the back-end servers need to be conscious of the load-balancing in a generic fashion, the actual distribution of work is trivial, and adding more back-end nodes merely requires a painless addition to a DNS zone file.

Re:DNS? (1)

rs79 (71822) | more than 3 years ago | (#33708914)

I think they changed their internal DNS config, screwed it up, and when their front facing webservers tried to lookup their database servers and failed, they tried the backup/rollover db servers, failed... these cascading errors caused their internal DNS servers to melt down.

After they'd been down for a while, because it spun down slowly over about half an hour, somebody in charge asked "WHY ARE WE DOWN" and was told "DNS error" and then changed the front facing webservers to spit up HTML that said "DNS ERROR", a simple web page communicating something is better than dead air.

Pedants will note that when http://facebook.com/ [facebook.com] says "DNS error" clearly it isn't a DNS error - it was able to use the DNS to find facebook.com, no? Therefore it had to be an internal DNS error.

Facebook's own explanation of the fault speaks vaguely of cached and persistant data. Classic DNS screwup.

Re:DNS? (1)

c6gunner (950153) | more than 3 years ago | (#33709692)

Off topic here, but FYI your "orange plane" is probably a flare, especially if you spotted it over Mountain View.

Human Error (1)

mfh (56) | more than 3 years ago | (#33703572)

Is contagious, it seems.

Re:Human Error (1)

BrokenHalo (565198) | more than 3 years ago | (#33704006)

Indeed. I don't do Facebook, but if I had got such a message, my first response would be to look at my own /etc/hosts file. From time to time I manage to bite myself on the ass with my block-list, but I can live with that...

You must have interesting firewall logs... (2, Funny)

RulerOf (975607) | more than 3 years ago | (#33704942)

look at my own /etc/hosts file. From time to time I manage to bite myself on the ass with my block-list

#Below is my custom DNS blocklist
127.0.0.1 om.nom.nom.

user@localhost:~$ ping om.nom.nom.

Re:Human Error (1)

froggymana (1896008) | more than 3 years ago | (#33704236)

People are always looking to blame someone else for their problems or someone else's. Its just human nature.

And somebody here actually cares??? (1)

AliasMarlowe (1042386) | more than 3 years ago | (#33704306)

What percentage of slashdotters actually noticed the facebook outage when it happened? As opposed to merely participating in the post-hoc commentary after they read about it. It should have been posted to slashdot's idle category.

I disagree (1, Informative)

Anonymous Coward | more than 3 years ago | (#33705076)

It is the most used website in the world (more userhours/month spent of Facebook than any other site), the fastest growing internet community (when measured in new users/month), etc... And as such it is an engineering masterpiece (in software engineering and probably in several other areas, too). When it goes down for several hours, it is a newsworthy event.

For us who work for advertising agencies, FB downtime is also a financially notable event.

Re:I disagree (0)

Anonymous Coward | more than 3 years ago | (#33705296)

So you don't know and instead went on some tangent about something else? I have no doubt you are an diehard facebook user.

Re:I disagree (2, Interesting)

Kvasio (127200) | more than 3 years ago | (#33705338)

Yet, you failed to notice that /. is a site for nerds.
Many nerds do not thrive to cultivate their social skills.
Checking their friends status on social network might not be on top of their agendas.
So: event was notable, but not very important to many slashdotters.

Re:And somebody here actually cares??? (1)

ryanov (193048) | more than 3 years ago | (#33708482)

I noticed, and saw the DNS message when it was there. When I read this, I said to myself "umm, why did people blame DNS? That's what the message said!"

Re:And somebody here actually cares??? (0)

Anonymous Coward | more than 3 years ago | (#33711140)

I noticed when half of my Friends suddenly appeared on AIM in the middle of the day.

I think it was DNS (1)

mhh91 (1784516) | more than 3 years ago | (#33703574)

because chrome stopped at "resloving host"

Re:I think it was DNS (-1, Troll)

Anonymous Coward | more than 3 years ago | (#33703622)

Well, then you're a fucking idiot, because even TFS states that people incorrectly thought it was DNS because of the browser message, which was wrong. You also have no business on this site, because anyone who knows anything about DNS could clearly see that all the relevant domains were resolvable.

Re:I think it was DNS (1)

morgan_greywolf (835522) | more than 3 years ago | (#33703932)

Except that 'hslookup facebook.com', et al.. worked with no issues. RTFS.

Message saying DNS error (2, Interesting)

Anonymous Coward | more than 3 years ago | (#33703610)

It wasn't your browser having a DNS error, it was the user facing servers at Facebook reporting DNS problems talking to whoever they talk to. Maybe when they decided the way to fix the problem was to take down the site, they just removed the back end server cluster from their internal DNS.

Duh (5, Insightful)

vlm (69642) | more than 3 years ago | (#33703620)

So why was DNS blamed?

From http://www.facebook.com/note.php?note_id=431441338919&id=9445547199&ref=mf&_fb_noscript=1 [facebook.com]

The way to stop the feedback cycle was quite painful - we had to stop all traffic to this database cluster, which meant turning off the site.

I'm, uh, taking a wild guess that simply shutting off port 80 is not going to allow for a controllable ramp up... they could redirect to another site, Orkut or myspace would have been mildly humorous. I am mildly surprised they don't have a simple emergency box with a simple static "undergoing repair" page, but, whatever ...

So, other than zapping the A records and waiting, what are they supposed to do? Bonus points if they were doing DNS based load balancing and simply unplugged their (dns based) load balancer.

I have no dog in the fight, having deleted my facebook account months ago. It is kind of funny that a page of technobabble is described as "technical details" as if folks like us/me would find it to be a complete description rather than pretty vague. Then again we're dealing with farmville addicts and you can't reason with addicts.

Re:Duh (1)

kasperd (592156) | more than 3 years ago | (#33704174)

I'm, uh, taking a wild guess that simply shutting off port 80 is not going to allow for a controllable ramp up...

Both approaches allow for a controllable ramp up given the right software on their servers. And I think with the typical off the shelf software neither of them allow for a controllable ramp up.

But did they even need a controllable ramp up of user requests? It sounded like the overloaded system was overloaded by internal requests, that were unrelated to the number of requests they got from end users.

My guess is they simply did what they found the easiest way to get their internal systems working again and not worrying about what errors users would see in the meantime.

Re:Duh (1)

vlm (69642) | more than 3 years ago | (#33709506)

But did they even need a controllable ramp up of user requests? It sounded like the overloaded system was overloaded by internal requests, that were unrelated to the number of requests they got from end users.

When you hear hoofs, think horses not zebras.

Seeing my servers spike to 100% CPU or 100% I/O and stay there, I'd look outside first before looking inside... So my first goal would be to act for a controllable ramp up of user requests. If the systems are so overloaded I can't troubleshoot at 100% of users, maybe I COULD log in and troubleshoot at 50 or 90 % load.

Also, I've worked at places that won't upgrade until outages due to high utilization are some large multiple of the cost of upgrading, this would be strong indication facebook works the same way.

Re:Duh (0)

Anonymous Coward | more than 3 years ago | (#33704836)

Without going into specifics, we use global load balancing infrastructure driven by dns to shift traffic between VIPs in realtime. Turn that to 0 with no fallback set and you get the described behavior. Painful but in this case, necessary.

Re:Duh (2, Funny)

PaganRitual (551879) | more than 3 years ago | (#33707512)

This whole situation does explain why my mother appeared to be sick on the couch at my parent's place on Thursday afternoon when I paid them a visit. With all the shaking and huddling under the covers and looking pale-faced I presumed she had come down with the flu or something.

Then again we're dealing with farmville addicts and you can't reason with addicts.

They aren't addicts, that's patently unfair. They can stop any time they want. What is most admirable about them is that they are simply so time-savvy that they coincide those times at which they wish to stop with the periods during which their crops have to be left to grow. Once the crops are ready for harvest, they desire to play again. It's really very simple and implies no addiction whatsoever.

Seriously though, 2.5 hours? The experience I have with Farmville gives me vague recollection that there are a fair few crops that have a growth period of a hour or less, and given that the crops wither and become unusable in the same time they take to complete their growth makes me wonder how many people petitioned Zynga for free ... well, the game is free so technically (and literally) nothing of value was lost, but still, I'm sure they were crying about something.

Now shut-up, it's nearly 4:01 server time and my rogue still needs the Brewfest boss' dagger to drop for it. 5 times and all I've seen is the mace which I can buy for fuck all anyway. My warlock has had two daggers already; maybe it's payback for the Midsummer event when my rogue got the staff twice and my warlock never saw it. THIS IS SUCH BULLSHIT.

Ageism (5, Informative)

Vahokif (1292866) | more than 3 years ago | (#33703658)

The 27-year-old communications protocol

So? TCP/IP is 36 years old.

Re:Ageism (0)

Anonymous Coward | more than 3 years ago | (#33703702)

Yeah, but TCP/IP at least shows up for work

DNS however...

Re:Ageism (1, Funny)

Anonymous Coward | more than 3 years ago | (#33703716)

PILFS!

Re:Ageism (1)

BrokenHalo (565198) | more than 3 years ago | (#33704028)

Who are those? Programmers I'd like to fuck? Sorry, that doesn't compute.

Re:Ageism (0)

Anonymous Coward | more than 3 years ago | (#33704652)

Different AC here - I guess the "P" stands for "Protocols".

Re:Ageism (0)

Anonymous Coward | more than 3 years ago | (#33703802)

And broken in practice. We've have that new version now for 12 years now, but not many have gotten around to implementing it yet.

Re:Ageism (4, Insightful)

morgan_greywolf (835522) | more than 3 years ago | (#33703974)

Really? DNS is broken? So typing say, http://slashdot.org/ [slashdot.org] doesn't work for you?

No. DNS has a few security issues, but they're mostly minor. The fact that DNS works for millions of people every day without issue at least 99% of the time proves that DNS is a successful design, even if it could use some security updating.

Re:Ageism (2, Interesting)

kasperd (592156) | more than 3 years ago | (#33704040)

I think that comment was referring to the fact that some recent announcement said there are now 5 billion devices on the internet, and IPv4 supports only up to 3.7 billion devices.

Re:Ageism (2, Informative)

morgan_greywolf (835522) | more than 3 years ago | (#33704144)

What does IPv4 have to do with DNS? (hint: nothing. Modern DNS servers support IPv6)

Re:Ageism (0)

Anonymous Coward | more than 3 years ago | (#33704206)

What does IPv4 have to do with DNS?

Maybe you should answer that question. After all you were the one writing a comment about DNS in reply to a comment saying that IPv4 is broken.

Re:Ageism (0)

Anonymous Coward | more than 3 years ago | (#33704760)

The comment was about TCP/IP, not IPv4/6. He's right, get over it. If he meant IP he should've said IP.

Re:Ageism (0)

Anonymous Coward | more than 3 years ago | (#33705030)

Which still has nothing to do with DNS, never mind that what you said makes no sense since you can't pair TCPv6 with IPv4, you have TCP/IP(v4) and TCP/IPv6, only the first one of these is 36 years old.

Re:Ageism (1)

morgan_greywolf (835522) | more than 3 years ago | (#33706098)

1. TFA is about DNS.
2. There is no "TCPv6". There is TCP over IPV4 and TCP over IPv6. They are, however, the same TCP.
3. TCP/IP is also used as a broad term for for the entire network stack. For example, DNS is an application-level protocol implemented on top of TCP and UDP over IP. But the entire thing is, loosely speaking, TCP/IP technologies.

Re:Ageism (0)

Anonymous Coward | more than 3 years ago | (#33706358)

Thanks for proving my point. The OP was talking about TCP, and some idiot started mentioning IPv6/DNS.

Re:Ageism (1)

definate (876684) | more than 3 years ago | (#33707132)

AAAAhhh since when has DNS supported IPv6? I call shenaaaanigans!

Re:Ageism (5, Insightful)

kasperd (592156) | more than 3 years ago | (#33704098)

Some people think technology should be replaced just because it is old. But really, it should be replaced if it doesn't suit our needs and there is a different technology that does suit it.

It is better to replace a 1 year old technology that does not suit our needs than to replace a 50 year old one that does. Usually when replacing, you want to replace with something newer. But in some cases it may turn out to be better to replace a new and misdesigned technology with an older and proven one.

That said, there are improvements to both IP and DNS which should be rolled out because they fix real problems. The rollouts are not happening as fast as they ought to, mainly because it is problematic to roll out a change to the entire Internet, especially when not everybody involved is cooperating.

But I don't think that really has anything to do with this outage.

Re:Ageism (1)

the_womble (580291) | more than 3 years ago | (#33704598)

The people who think that are

1) The people with patents on the new technology, or who are planning to sell stuff for it.
2) The people who have been convinced by the marketing budgets made possible by 1)

Re:Ageism (1)

A beautiful mind (821714) | more than 3 years ago | (#33704868)

I'll take the quality of design of IP or DNS over what passes on for "The Web" these days. The browser as a concept is bending towards it's breaking point as it tries to cope with the fact it's treated as a clown car.

I guess it's historical legacy that we started with HTML and crap like that for browser interaction and everything sort of grew from there, but we're doing the whole "web as an applications platform" wrong.

Re:Ageism (0)

Anonymous Coward | more than 3 years ago | (#33709604)

"but we're doing the whole "web as an applications platform" wrong."

I agree there but this is a symptom of one click install being an impossibility in 2010.

People don't use the web as an app platform because it's better, easier to develop for, or provides control over the system.... It's one of the worst places to do your coding but you get one beautiful thing for all the trouble.

JoeBlow typing yourdumpcompany.com and instantly running an app without libraries, install media, or waiting......

Re:Ageism (2, Funny)

oldspewey (1303305) | more than 3 years ago | (#33704120)

So? TCP/IP is 36 years old.

Yeah, but it still lives in its parents' basement.

Re:Ageism (2, Insightful)

dlgeek (1065796) | more than 3 years ago | (#33704676)

And is definitely showing it's age. There's been a big cry for years from those working at the really high end of networking that we need to replace (really just extend) TCP because it doesn't work well with high bandwidth-delay-product links. This is because the max window size and ramp-up algorithm (slow start) don't allow you to saturate the pipe quickly enough or even at all. There are several proposed extensions floating around to fix the problem but none of them have widespread adoption.

This actually is the case with a lot of our old networking protocols - yes, they were incredibly well designed at the time, but many are showing that they need to be upgraded to reflect modern technology. Back to our original case, the original DNS protocol does have a lot of problems that have surfaced lately (think about the sequence number prediction stuff from a couple years back) which inspired the roll-out of DNSSEC. IPv4 is hitting it's limits, but we're having trouble rolling out IPv6. How much easier would fighting spam be if SMTP had a strong authentication system for sent messages? Even HTTP, which has undergone several revisions, is again showing limitations, hence Google rolling out SPDY which allows predictive pushes, stream parallelism, etc.

I don't think anyone seeks to criticize the designers of these protocols, and the protocols have excelled and scaled far, far beyond anyone's wildest expectations. That being said, they have been showing cracks lately as technology has grown, and nothing looks like it did back when they were written. However, we have hit a point where the difficulty in upgrading or replacing them is actually starting to hold us back.

PEP (1)

HBI (604924) | more than 3 years ago | (#33705324)

There is a whole market devoted to handling high delay TCP connections. It works. It's what I do. Well, part of it.

Replacing the protocol for this reason would be kind of lame.

Strong authentication of senders: 2 drawbacks (1)

tepples (727027) | more than 3 years ago | (#33714596)

How much easier would fighting spam be if SMTP had a strong authentication system for sent messages?

There is one, called OpenPGP. There is another one, called S/MIME. Implementation of these in real-world MUAs awaits a decision on best practices for how strong the authentication needs to be. Stronger authentication has two downsides. First, the cost of obtaining a digital ID goes up with strength; even with the OpenPGP web of trust, travel to a key signing party hundreds of km away is not free. Second, requiring strong digital ID makes it difficult for someone living under a government that suppresses speech to express politically unpopular ideas.

Re:Strong authentication of senders: 2 drawbacks (1)

jgrahn (181062) | more than 3 years ago | (#33716196)

How much easier would fighting spam be if SMTP had a strong authentication system for sent messages?

There is one, called OpenPGP. There is another one, called S/MIME. Implementation of these in real-world MUAs awaits a decision on best practices for how strong the authentication needs to be. Stronger authentication has two downsides. First, the cost of obtaining a digital ID goes up with strength; even with the OpenPGP web of trust, travel to a key signing party hundreds of km away is not free.

You wouldn't have to go to such extremes though. If everyone had as a personal policy "only read OpenPGP-signed mail, and distrust mail signed with a key I haven't personally downloaded from a key server", spam and mail worms would be less of a problem.

(Not that it will ever happen.)

Spam solutions copypasta (1)

tepples (727027) | more than 3 years ago | (#33718616)

If everyone had as a personal policy "only read OpenPGP-signed mail, and distrust mail signed with a key I haven't personally downloaded from a key server"

Then it would it would still fall under the "Requires immediate total cooperation from everybody at once" line of the well-known copypasta [craphound.com] , and possibly "Mailing lists and other legitimate email uses would be affected" and "Many email users cannot afford to lose business or alienate potential employers" depending on how it is implemented.

Re:Ageism (1)

bill_mcgonigle (4333) | more than 3 years ago | (#33708312)

So? TCP/IP is 36 years old.

And can't even cope with lossy network connections (i.e. mobile).

Re:Ageism (1)

godefroi (52421) | more than 3 years ago | (#33710776)

On the contrary; it copes very well with lossy network connections. The real problem is YOU and your insistence on receiving everything that was sent, and in the correct order even.

If you were willing to see half-pages and miss images, then UDP would be a splendid protocol for you, and you wouldn't have to wait for timeouts and retransmissions.

Re:Ageism (1)

bill_mcgonigle (4333) | more than 3 years ago | (#33711486)

On the contrary; it copes very well with lossy network connections

I suspect this is going for Funny, but just in case: the basic problem is that TCP congestion control sees a lossy network as busy and backs off on transmission speed.

It's an open research topic, and currently handled in L2 on mobile networks since TCP can't cope.

DNS? Huh? (1)

Animats (122034) | more than 3 years ago | (#33703710)

Terrible article. What "DNS error"? Is Facebook running its own DNS servers that do something funny, or what?

As for DNS "moving to the cloud", DNS is already far more distributed than any of the "cloud" systems. Which is a good thing.

Re:DNS? Huh? (0)

Anonymous Coward | more than 3 years ago | (#33704702)

'Fake sidebar"

Lookup whoosh in the dictionary. That's not a mirror.

Couldn't be a DNS outage. (1)

MrCrassic (994046) | more than 3 years ago | (#33703748)

Yeah...from reading that Facebook note, it's pretty clear that DNS had nothing to do with the outage. Do you guys think the outage would've been better or worse had it been one?

Not mission critical! (2, Insightful)

j_col (1895476) | more than 3 years ago | (#33703792)

I found the genuine panic from many Facebook users to this outage very amusing.

Re:Not mission critical! (1)

Anonymous Coward | more than 3 years ago | (#33703862)

I found the genuine panic from many Facebook users to this outage very amusing.

I, too, laugh at the misfortune of others.

Re:Not mission critical! (0)

Anonymous Coward | more than 3 years ago | (#33704050)

Misfortune?? This is a blessing!!

Re:Not mission critical! (1)

Skylinux (942824) | more than 3 years ago | (#33709540)

Des einen Leid ist des anderen Freud. -- Of one man's meat is another man's poison.

Re:Not mission critical! (1, Flamebait)

kiwimate (458274) | more than 3 years ago | (#33704806)

I suppose if I were an angst-ridden bitter friendless teenager I may have found it amusing too. Luckily, I'm an adult. (How sad that this comment is currently marked insightful.)

And - really? Genuine panic? I think that says more about the specific subset of Facebook users within your anecdote set than anything else. Or do you also extrapolate out from the frequent racist troll comments on Slashdot?

Re:Not mission critical! (1)

definate (876684) | more than 3 years ago | (#33707150)

LOL Yeah it was hilarious [publicradio.org] when people [dailymail.co.uk] were complaining about being unable to get on Facebook. So funny that people need services to keep in contact with others, it's like why don't you just talk to them in person? I mean like, HELLO, am I the only one getting this? Geeze. If it's so important to you then you should be more redundant with your services, like, everyone knows that!

So what? Big Whoop! (1, Flamebait)

WarpedCore (1255156) | more than 3 years ago | (#33703810)

It's an advertisement platform that rides solely on the ignorance if its users. So people had two and a half hours to take a break from their narcissism... this is something worthy of finger pointing?

Re:So what? Big Whoop! (1)

Anonymous Coward | more than 3 years ago | (#33703966)

high and mighty non-trend followers are pretty trendy, just saying...

Re:So what? Big Whoop! (0)

Anonymous Coward | more than 3 years ago | (#33704026)

I'm way cooler because I participate in the trend, but only at a level that would barely be called surface. Once again my lack of dedication and inability to commit have paid off by inserting myself into an elitist class that hovers above all else. Unfortunately due to said attributes I am unwilling and unable to use this to any social advantage.

Re:So what? Big Whoop! (1)

_Shad0w_ (127912) | more than 3 years ago | (#33703986)

There are adverts on there?

Re:So what? Big Whoop! (2, Funny)

Sir_Lewk (967686) | more than 3 years ago | (#33704350)

No. Facebook doesn't do data-mining, and they don't serve ads. They simply pull money out of their ass.

Re:So what? Big Whoop! (2, Insightful)

kiwimate (458274) | more than 3 years ago | (#33704860)

So is Slashdot.

I don't know that finger pointing is necessarily healthy - that tends to suggest CYA and childish blame games. But on a technical IT focused web site, one might suppose that a lessons learned exercise on the root cause of the failure of a massive website would be of interest and hopefully even an educational experience.

Re:So what? Big Whoop! (1)

Skylinux (942824) | more than 3 years ago | (#33709564)

But on a technical IT focused web site, one might suppose that a lessons learned exercise on the root cause of the failure of a massive website would be of interest and hopefully even an educational experience.

I can't remember ever seeing an article about a major outage from some big website where they delivered enough information that one could learn from it.
So what did you learn from this article? Don't fuck up when you are an admin or maybe to create a better error page.
Yes very informative and of interest to nerds, indeed.

OT: I don't care. (0)

Anonymous Coward | more than 3 years ago | (#33703848)

Sorry for posting Off-Topic, but is there a way to hide all stories tagged "facebook"?

FFS (0)

Anonymous Coward | more than 3 years ago | (#33703886)

Who cares? Enough about the facebook outage already.

They get no monies from me! (2, Interesting)

chucklebutte (921447) | more than 3 years ago | (#33703934)

With all the adblocking software, no scripting, and other misc Firefox plugins I have no ads from facebook, let alone any other page. Firefox is set to delete cookies and BS on exit and I keep my machine clean with bleachbit.

Facebook allows me to connect with lots of people I never would see, like buddies in the Army based in Japan for instance or friends in New York etc, etc.

I hide all the annoying spam adds for peoples stupid farms, I have convinced many of my Facebook friends and family to stop playing those games and to take similar precautions.

Every company is evil, hell even my Linux distro has political agendas, ( damn Mint Linux!!! ) but what does it really matter? It is the cost of using technology. Until we change our idea about advancing ourselves and not our pocket books nothing will change, so stop with the Q_Q and pew pew, and just deal with and be smart about what you do.

Re:They get no monies from me! (1)

BrokenHalo (565198) | more than 3 years ago | (#33704072)

hell even my Linux distro has political agendas, ( damn Mint Linux!!! )

I had never heard of them (I'm an old Slackware hand, and more recently Arch), but Mint's webpage is so incredibly slow to load, it's impossible to see what that agenda is. It doesn't inspire much confidence in them. :-|

Re:They get no monies from me! (1, Informative)

Anonymous Coward | more than 3 years ago | (#33704720)

You won't find it on the home page. It was a post by a developer on the dev blog. He later removed it and apparently moved it to his personal blog.

Palestine Written by Clem on Sunday, May 3rd, 2009 @ 12:34 am | Main Topics

This is not the place to talk about this but I am deeply touched by what is happening over there. I feel disgust and guilt with us passively witnessing it and our money and weapons supporting it. I don't want to use my name or this project to push my own ideas about this but I spend a lot of time working and giving away, sharing and receiving to and from a lot of people.

I'm only going to ask for one thing here. If you do not agree I kindly ask you not to use Linux Mint and not to donate money to it.

I hope for these people to be able to live decently in the future and for me not to have anything to do with the misery they're in at the moment.

I promise not to talk about this anymore. I don't want any money or help coming from Israel or people who support the action of their current government.

Thank you for your understanding. This is very important to me.

Where's the beef? (0)

Anonymous Coward | more than 3 years ago | (#33703968)

No explanation for DNS is offered.

My guess is Facebook runs its own DNS servers and they were swamped by a DDoS of Facebook's own making.

There WAS some DNS issues too ! (3, Informative)

ivan_w (1115485) | more than 3 years ago | (#33704004)

The confusion might have come from the fact that when I looked, there seemed to also be some DNS problem.

Basically, when asking directly, the servers that are authoritative for the zone were giving me a CNAME for the 'ANY' query, but not the associated A records, which it should, since the CNAME was pointing to a host name within the same authority. At this point, any sensible resolver stops asking !

This only lasted for a little while though - so it might have been a glitch or possibly a deliberate action related to how they were trying to fix the underlying issue itself - possibly averting traffic until they actually solved the actual problem.

--Ivan

Re:There WAS some DNS issues too ! (1)

Skapare (16644) | more than 3 years ago | (#33720902)

This kind of thing can happen when records are being changed (say from A to CNAME) and the A record has not expired from your cache, yet. Did you do a "dig trace" around the cache to verify?

no (0)

Anonymous Coward | more than 3 years ago | (#33704084)

The summary's wrong. The problem was caused by one looping server, hence making it a DOS, not a DDOS.

Anonymous Coward. (0)

Anonymous Coward | more than 3 years ago | (#33704156)

Yet m.facebook.com worked the whole time this was going on.

Why DNS? (1)

darth dickinson (169021) | more than 3 years ago | (#33704238)

Cause browsers is stoopid!

It wasn't DNS. (0, Redundant)

meerling (1487879) | more than 3 years ago | (#33704246)

That was obvious, it showed symptoms of a DDoS attack, not a DNS problem. I find it funny it was caused by their own error.

Another failure waiting to happen (1)

Skapare (16644) | more than 3 years ago | (#33704300)

To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn't allow the databases to recover.

Even when the database has a valid value, if failures to get a value from the database can creating a growing cascade of errors, then this design is still poised for a future failure for simple things like a partial outage of databases or network access to them. Ideally, once the data was valid, the number of clients not getting a valid value should gradually decrease as more and more get valid values and don't have to requery. But if the scale was such that none could get anything when all were trying (and hence require a shutdown and slow start), it can all happen again with many other classes of failure. It can even happen with transient error, if the transient is long enough to trip a certain threshhold of clients.

So leaving that configuration correction system off for now makes sense. I would suggest a combination of a push system (originate at the database and push new values out ... but watch for security) and a randomizing of delay times inserted for the data pulls.

Take down any company's entire network access ... (1)

Skapare (16644) | more than 3 years ago | (#33704328)

... and browsers will think their DNS is dead ... because, well, it is ... and is the first thing a browser needs to access.

Did Facebook have an internal DNS failure? (1)

Animats (122034) | more than 3 years ago | (#33704340)

The "error page" is clearly a Facebook server reporting a DNS failure within Facebook's own network. Facebook requests are processed by user-facing servers which make RPC calls (not HTTP) into Facebook's internal network. Machines in multiple locations may be involved in generating a single Facebook page. If their in-house DNS system for organizing their internal network failed, they might produce messages like that.

Re:Did Facebook have an internal DNS failure? (3, Informative)

rekoil (168689) | more than 3 years ago | (#33704420)

It didn't fail, they turned it off. This was the easiest way to "shut off the entire site" as their post-mortem describes. The DNS errors users saw were being generated by the front-end HTTP proxies, not by client browsers, which caused most of this confusion. Once the database issue cleared, they reactivated the DNS entries for the back-end servers one cluster at a time and the site came back.

Re:Did Facebook have an internal DNS failure? (1)

Skapare (16644) | more than 3 years ago | (#33706220)

You seem informed. Maybe you can explain why it is that clients would not be picking up the corrected info and reducing their "attack" on the database servers (more so than everything being turned off and back on).

Re:Did Facebook have an internal DNS failure? (1)

rekoil (168689) | more than 3 years ago | (#33791724)

This is explained in the post-mortem. Basically, the problem was that clients were reacting to corrupt data being served up by the origin DB cluster the same way that they reacted to bad data coming from the memcached cluster - by deleting the offending entry in memcached and re-sending the query to the origin DB. So a client queried the origin, got bad data, and then deleted the key from memcached - resulting in every other client (tens of thousands of them, most likely) then querying the cluster for the same key* at the same time. Instant meltage ensued.

Now think about what happens when you have tens of thousands of boxes all querying the same cluster for the same keys all at the same time. Some clients will get the answer, but others will get an invalid response back from a melting mysql box. And when that happens, what does the client do? Exactly what started the mess in the first place - it *deletes the key from memcached*. So if any other clients were happily using the cached copy of the key data, they aren't anymore...and back to the origin they go. Lather, rinse, repeat until someone hits the Big Red Button and restarts the whole shebang in a ordered fashion (i.e. only re-activating a few racks at a time).

* More likely, many keys were corrupted on the origin. A single key would only impact one memcached instance and most likely only one mysql server (read about consistent hashing for more detail) and not cause this level of chaos.

Clumsy title (1)

Haiyadragon (770036) | more than 3 years ago | (#33704682)

Browsers are fucking software. They don't blame anything for anything.

Facebook was down. That's the only that matters to most people.

DDOS? (0)

Anonymous Coward | more than 3 years ago | (#33705000)

Wouldn't this just be DOS or have the two separate terms become synonymous?

dns (1)

xander19 (1717110) | more than 3 years ago | (#33705216)

it might be a buzz acronym on twitter anyway, cause it also means DNA in german

The REAL Reason Facebook went down! (0)

Anonymous Coward | more than 3 years ago | (#33706726)

Someone digging in Farmville didn't call before digging and cut an underground cable.

Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?
or Connect with...

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>