×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Wikipedia Explains Today's Global Outage

timothy posted about 4 years ago | from the citation-you-requested dept.

Wikipedia 153

gnujoshua writes "The Wikimedia Tech Blog has a post explaining why many users were unable to reach Wikimedia sites due to DNS resolution failure. The article states, 'Due to an overheating problem in our European data center many of our servers turned off to protect themselves. As this impacted all Wikipedia and other projects access from European users, we were forced to move all user traffic to our Florida cluster, for which we have a standard quick failover procedure in place, that changes our DNS entries. However, shortly after we did this failover switch, it turned out that this failover mechanism was now broken, causing the DNS resolution of Wikimedia sites to stop working globally. This problem was quickly resolved, but unfortunately it may take up to an hour before access is restored for everyone, due to caching effects."

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

153 comments

DNA? Really? (0)

Anonymous Coward | about 4 years ago | (#31601168)

How 'bout proofreading titles

Re:DNA? Really? (0)

Anonymous Coward | about 4 years ago | (#31601296)

They're too busy downmodding nigger jokes.

Wow! (1, Funny)

Anonymous Coward | about 4 years ago | (#31601172)

DNA resolution failure

Re:Wow! (2)

suso (153703) | about 4 years ago | (#31601272)

Maybe the reverse DNA wasn't set right.

Re:Wow! (1)

monkeySauce (562927) | about 4 years ago | (#31601604)

No, the RNA is just fine. The real problem is a DNA CNAME pointing to an A molecule that isn't resolving.

Re:Wow! (0)

Anonymous Coward | about 4 years ago | (#31602080)

Nice try, but I think you stretched that one a little too far. And besides, pointing a CNAME to an A record is ok, its pointing a CNAME to a CNAME that is forbidden.

Re:Wow! (1)

DA-MAN (17442) | about 4 years ago | (#31602900)

Nah, its standard practice.

$ host download.microsoft.com
download.microsoft.com is an alias for download.microsoft.com.nsatc.net.
download.microsoft.com.nsatc.net is an alias for mscom-dlc.vo.llnwd.net.
mscom-dlc.vo.llnwd.net has address 208.111.161.113
mscom-dlc.vo.llnwd.net has address 208.111.161.89

DNA DNS? (2, Funny)

Anonymous Coward | about 4 years ago | (#31601204)

I could see why the failover didn't work... They should try resolving names instead of nucleic acids. :\

Oh teh noes! (0)

Anonymous Coward | about 4 years ago | (#31601206)

Its the T-virus, run!

Test, and Test Again (3, Insightful)

Jazz-Masta (240659) | about 4 years ago | (#31601228)

However, shortly after we did this failover switch, it turned out that this failover mechanism was now broken, causing the DNS resolution of Wikimedia sites to stop working globally.

Good thing Wikimedia pays their System Administrators well enough to test their backup systems.

Re:Test, and Test Again (3, Insightful)

X0563511 (793323) | about 4 years ago | (#31601536)

I know people who work in the Florida DC. They do, and they are smart people. Don't assume incompetence.

Re:Test, and Test Again (2, Informative)

Jazz-Masta (240659) | about 4 years ago | (#31601772)

I actually wasn't assuming incompetence, the hallmark of many SysAdmins is being understaffed, overworked and underpaid, and thus do not have the resources to properly test all backup and redundant systems.

As consultants and contractors in the area of System Administration, you get let go if anything like this was ever to happen. This is why they charge a little bit more.

Whatever happened, it failed. A good lesson for next time. Not knowing exactly the cause, but it is safe to safe there were too many eggs in one basket. Multiple geopgrahically diverse load-balanced DNS servers? Why was there an overheating problem in the first place? Only one air conditioner?

Wikipedia has had a few failures, not all their fault. In 2006 Cogent pulled a block of IP addresses that were leased to Wikipedia.

Re:Test, and Test Again (1, Insightful)

Anonymous Coward | about 4 years ago | (#31602230)

I know people who work in the Florida DC. They do, and they are smart people. Don't assume incompetence.

I'm going to assume incompetence. The only question is whose incompetence: the admins, or the folks higher up the food chain who didn't give them the resources they needed. But I have no doubt somebody was incompetent somewhere, how else do you explain the failure? Can you answer that instead of telling people what to think?

Re:Test, and Test Again (2, Informative)

geniice (1336589) | about 4 years ago | (#31602328)

Wikipedia has a fairly limited budget and has historically accepted the odd few hours of downtime now and again as the natural result of this. The number of such incidents have reduced over the years though.

Re:Test, and Test Again (0)

Anonymous Coward | about 4 years ago | (#31602832)

I'm going to assume incompetence. The only question is whose incompetence: the admins, or the folks higher up the food chain who didn't give them the resources they needed. But I have no doubt somebody was incompetent somewhere, how else do you explain the failure? Can you answer that instead of telling people what to think?

Wow. For someone who probably uses the service and doesn't pay for it, you're sure griping a lot. They don't serve ads (well, except to solict funds to keep their servers up and running), they rely on private donations (which sets them apart from even public television, which gets some of it's funding from the government).

When you pay for an SLA with Wikipedia (signed by someone with the authority to make such an agreement) then you have the right to throw rude accusations around.

Re:Test, and Test Again (2, Insightful)

cryfreedomlove (929828) | about 4 years ago | (#31601644)

You say test and test again. I say that this is true only when the cost of an outage outweighs the cost of testing. What does this one hour, once per year really cost wikipedia?

Re:Test, and Test Again (2, Insightful)

Dahamma (304068) | about 4 years ago | (#31601750)

True, and the cost was probably fairly minor, as they are not advertising based... so only the cost of any people so pissed off with the downtime that they refuse to donate :)

Re:Test, and Test Again (4, Interesting)

geniice (1336589) | about 4 years ago | (#31602088)

Going by past statsitics the cost of downtime to wikipedia tends to be negative since donations rise. Not that this is something wikimedia aims to do.

Run both systems live at half capacity (2, Interesting)

Colin Smith (2679) | about 4 years ago | (#31602018)

active/passive systems are a pain in the arse. The whole concept of testing failover in an active/passive situation is wrong. Anything which relies on human beings doing this and that and that and that is a bad solution.

Just run active/active and load balancer over both sites. If one fails it's tests, you just pull it.

 

Re:Run both systems live at half capacity (1)

rmm4pi8 (680224) | about 4 years ago | (#31602822)

For systems that can be stateless, this is always the best approach. master-master replication with conflict resolution isn't always that easy, however, especially when you think about something like the way wikipedia edits can potentially interact. So developing a conflict resolution scheme can be extraordinarily expensive, and MySQL isn't the most stable in multi-master anyway. Thus while you're right in principle, the expense can be prohibitive.

Rumor was.. (1)

cybrthng (22291) | about 4 years ago | (#31601236)

Some government pencil pusher mixed up wikileaks with wikipedia... after all the "strange tweets" from @wikileaks it sounded feasible ;)

Re:Rumor was.. (0)

Anonymous Coward | about 4 years ago | (#31601366)

Wikileaks is part of wikimedia, so it went down too (along with wikinews, wikispecies, etc.).

Re:Rumor was.. (2, Informative)

Anonymous Coward | about 4 years ago | (#31601704)

Wikileaks is part of wikimedia, so it went down too (along with wikinews, wikispecies, etc.).

Wikileaks is certainly NOT part of Wikimedia. You can see such at http://wikimediafoundation.org/wiki/Our_projects

Do we accept this... (4, Funny)

Al's Hat (1765456) | about 4 years ago | (#31601242)

...as proof of global warming?

Re:Do we accept this... (1)

alta (1263) | about 4 years ago | (#31601276)

+1 So True...

Re:Do we accept this... (0, Troll)

afabbro (33948) | about 4 years ago | (#31602006)

Do not meddle in the affairs of sysadmins, for they are subtle, and quick to anger.

Wow - is that the worst sig on Slashdot or what?

  • "Do not meddle in the affairs of sysadmins, for they are subtle..." The sysadmins or their affairs? And why would subtle affairs (or subtle sysadmins) be either significant or threatening? Oh my God, he's...he's...SUBTLE! RUN!
  • "..and quick to anger." Angry affairs? Angry sysadmins?
  • And to top it off, your URL points to a slimey affiliate site that promotes "Free Advertising System" and "Copy the Super Affiliates".

Re:Do we accept this... (0)

Anonymous Coward | about 4 years ago | (#31602176)

Troll, or woefully ignorant of Tolkien? I can't tell.

Re:Do we accept this... (0, Offtopic)

jeffasselin (566598) | about 4 years ago | (#31602352)

Either you should read more or you have some serious linguistic credentials.

It's a reference to a quote in The Lord of the Rings by JRR Tolkien, something said to Frodo Baggins by Gildor Inglorion in The Fellowship of the Ring (tome 1) in chapter 3 "Three is Company":

"Do not meddle in the affairs of wizards, for they are subtle and quick to anger."

The quote is of course in reference to the wizards of Middle-Earth. The user sig which you tried to disparage is an attempt to make an analogy between sysadmins to wizards.

My problem here is that you weren't attacking the analogy, but the syntax of the sentence. A sentence crafted by JRR Tolkien, one of the most well-known scholars of the English language, and one who was named "Author of the Century" for the last century. I think he knew English syntax better than you do. Unless of course you're a well-known, published writer who has studied the English language extensively and you have the diplomas to prove it on the wall in your office.

Re:Do we accept this... (0)

Anonymous Coward | about 4 years ago | (#31602726)

^Try not failing English 101. Sysadmins are the subject of the entire sentence. Thus, sysadmins are subtle, and quick to anger. HTH

welcome to the cloud (0)

Anonymous Coward | about 4 years ago | (#31601268)

With both stormy and sunny days.

Good choice on the article you're linking to... (0)

Anonymous Coward | about 4 years ago | (#31601274)

I don't know which is more awesome - that this article came up just as I was wondering what happened to Wikipedia, or that the post links to an article which I CAN'T READ BECAUSE WIKIPEDIA IS DOWN.

Didn't know DNA could cause an outage (0, Redundant)

jmdevince (1175647) | about 4 years ago | (#31601280)

When did we start using DNA to resolve domain names? I mean we can fit a butt-load of information in a DNA strand but I think the overhead would be too high for DNS resolutions. (Should be DNS)

Re:Didn't know DNA could cause an outage (1)

wizardforce (1005805) | about 4 years ago | (#31601386)

Just to give an idea of just how vast DNA's information storage is, the average human cell contains about as much information as most DVDs can store. So hypothetically, if you could reliably transport DNA like we do electrons on the internet, the bandwidth would be enormous (1 gram DNA can store ~10^21 bits) although lag might be a problem unless you can route these DNA packets at relativistic velocities.

Re:Didn't know DNA could cause an outage (1)

Andy Dodd (701) | about 4 years ago | (#31601542)

I'm assuming that material containing large amounts of DNA gummed up a cooling fan, causing the overheating. :)

Re:Didn't know DNA could cause an outage (0)

Anonymous Coward | about 4 years ago | (#31602902)

Hey, they really love their servers!

DNA resolution failure??? (0)

Anonymous Coward | about 4 years ago | (#31601288)

>>due to DNA resolution failure. ...Also known as mutation...or X-men

wait what my dna is being resolved? (0)

Anonymous Coward | about 4 years ago | (#31601292)

cant we simply hack that by modifying gens.conf?

rndc flush (2, Funny)

ls671 (1122017) | about 4 years ago | (#31601300)

I noticed wikipedia wasn't resolving this morning.

Flushing my "DNA" cache fixed it ;-))

rndc flush

Re:rndc flush (1)

ls671 (1122017) | about 4 years ago | (#31601410)

I will add that this is a good thing this article was posted. It caused me to stop investigating the possibilities of somebody hacking into my "DNA". ;-))

Re:rndc flush (1)

Sigma 7 (266129) | about 4 years ago | (#31601562)

Flushing my "DNA" cache fixed it ;-))

Not for everyone, since some ISPs cache DNS lookup results.

Re:rndc flush (1)

ls671 (1122017) | about 4 years ago | (#31601678)

> Not for everyone, since some ISPs cache DNS lookup results.

It should have been obvious that you needed admin access to your own "DNA" in order for this fix to work... ;-))

Also your ISP must not intercept your "DNA" queries (redirecting deoxyribonucleic acid #53 to their own DNA)

Hour Delay (5, Funny)

Reason58 (775044) | about 4 years ago | (#31601322)

This problem was quickly resolved, but unfortunately it may take up to an hour before access is restored for everyone, due to caching effects.

If you don't want to wait an hour for it to update, you can open a command prompt and type "ipconfig /flushdna".

Please be warned that this may also revert you to some sort of single-celled organism.

Re:Hour Delay (1)

mrdogi (82975) | about 4 years ago | (#31602118)

OK, I'm somewhat worried now. I was going to make a snarky comment on how I can't seem to find the ipconfig command on my Mac, but it *actually* has one! Mac is following Windows?!?

At least I'm still safe with not having on on my Solaris boxen...

Overheating (0)

Anonymous Coward | about 4 years ago | (#31601352)

Obviously this was caused by Global Warming.

Oops (3, Insightful)

girlintraining (1395911) | about 4 years ago | (#31601426)

You see guys, this is why you regularily test your backup plans and failovers. This is equivalent to building maintenance making sure the fire extinguishers aren't expired... it's basic to IT. Unfortunately, Wikipedia just reminded us that what's basic isn't always what's remembered. Someone just lost their job.

Re:Oops (2, Insightful)

cryfreedomlove (929828) | about 4 years ago | (#31601578)

I doubt anyone lost their job over this. What is the real cost of a 1 hour global outage for wikipedia if it only occurs once per year?

Re:Oops (1, Insightful)

Anonymous Coward | about 4 years ago | (#31601708)

Wikipedia does not profit from traffic so they actually save money for every hour the site is down. Looks like someone just got promoted!

Re:Oops (0)

Anonymous Coward | about 4 years ago | (#31602150)

For every hour? Really? With that logic they should just keep it down 24/7 then.

Re:Oops (2, Insightful)

Arthur Grumbine (1086397) | about 4 years ago | (#31602334)

For every hour? Really? With that logic they should just keep it down 24/7 then.

Only when combined with the premise that profit is a goal for them. Which it's not.

Re:Oops (1)

ArundelCastle (1581543) | about 4 years ago | (#31601724)

Wikipedia just reminded us that what's basic isn't always what's remembered.

TFA quote did say it was a standard procedure. Seems like an accurate description leading to the common SNAFU, or "Administrivia" if you prefer. It's the weird shit you're always checking on.
Building maintenance is an interesting comparison to use. Every year I see plenty of elevator licenses and fire extinguisher tags in many, many buildings that are expired.

Re:Oops (1)

geniice (1336589) | about 4 years ago | (#31602196)

Since wikimedia's server admins have long since been divided into two departments known as wing and prayer they can probably avoid any job loses by blaming each other.

I disagree. (1)

Colin Smith (2679) | about 4 years ago | (#31602224)

You build your systems to be fault tolerant. They automatically continue with half the components missing. Automatically disable those which fail the continually running tests.

Build your backup tests into daily procedures. i.e. don't copy/scp files to other locations/servers/sites, restore them to the other location. Autorestore DB backups to the staging/test/dev/reporting systems daily.

Computers are there to do stuff automatically. Getting human beings to do them is prone to failure.
 

Re:I disagree. (1)

RAMMS+EIN (578166) | about 4 years ago | (#31602450)

You make some very good points in your post.

At the end of it all comes the realization that planning for crisis is complicated, and getting it right is hard. It's also something that every organization I have ever worked with has underestimated considerably. From what little information I have about this incident with Wikimedia (I noticed nothing, myself), they did considerably better than average.

But you are right: the right approach is not to prepare for contingency, but to make recovery part of the normal flow. If failure and recovery are part of the everyday routine, you will know what things are broken before disaster strikes, and when it does, you will be prepared. Nothing will make your organization infallible, but at least you will have procedures, people who know how to execute them, and experience with doing so.

Re:Oops (2, Insightful)

VTEX (916800) | about 4 years ago | (#31602492)

Someone just lost their job.

I highly doubt someone lost their job over this - and they shouldn't. There are no perfect systems out there, period. Given Wikipedia is a not for profit corporation, they very likely have limited resources and the IT staff does the best with what they have. Even with a virtual unlimited amount of resources things can still go wrong in a "Perfect Storm".

If anything, the System Administrators should be commended for their quick actions to get the site back up and running as soon as they did.

Re:Oops (1)

Yvanhoe (564877) | about 4 years ago | (#31602668)

Someone does an awesome job at having a failover procedure for such an incredible non-profit project. And for resuming access within one hour. For heaven's sake, they don't even make money keeping the biggest encyclopedia of all History online, give them a break !

Come on wikipedia, fix this, but rest assured that we all love you !

Re:Oops (0)

Anonymous Coward | about 4 years ago | (#31602680)

This is equivalent to building maintenance making sure the fire extinguishers aren't expired...

It's as if they were making sure that the fire extinguishers paints weren't faded or peeling, rather than checking the mechanism and pressure.
Can't have rotten looking fire extinguishers now, can we? Nobody would want to pick them up if there was an actual fire, they might get all icky and covered in paint shards.

I just assumed DNS was reverted... (1, Funny)

Anonymous Coward | about 4 years ago | (#31601500)

...to the old setting by some Admin who edited it the last time, and who would be damned if he let anyone else get in the last word.

genetic level? (0)

Anonymous Coward | about 4 years ago | (#31601540)

DNA resolution failure? sounds serious.

Serves them right (0)

Anonymous Coward | about 4 years ago | (#31601570)

Wikipedia admins need to get out of their basement anyway.

makes sense (1)

nomadic (141991) | about 4 years ago | (#31601642)

Judging by traveling through Europe in the summer, they've never discovered "air conditioning."

Uptime is dumb (0)

Anonymous Coward | about 4 years ago | (#31601650)

Guess what Wikipedia, you are a free service. You could be up 10 hours a day or 24, existing is sufficient. Damn demanding internayz morons.

I saw some issues with wiktionary... (1)

Qubit (100461) | about 4 years ago | (#31601848)

But when I got to the wiktionary.org main page I didn't see any kind of note or warning.

Couldn't they have at least put up some kind of warning box, hopefully with a list of IP addresses underneath so that one could directly access the services when in dire need?

.
.
.
.
.

(I'm not really sure what constitutes "dire need" of wikimedia services, but I'm sure someone can come up with a list of relevant circumstances)

Re:I saw some issues with wiktionary... (1)

PPH (736903) | about 4 years ago | (#31602172)

I'm not really sure what constitutes "dire need" of wikimedia services, but I'm sure someone can come up with a list of relevant circumstances

You could look up 'Dire Need' on Wiki..... oh, never mind.

From hot to hotter (1)

LoRdTAW (99712) | about 4 years ago | (#31602048)

From the Summary:
"Due to an overheating problem in our European data center many of our servers turned off to protect themselves"
"we were forced to move all user traffic to our Florida cluster"

I think Wikipedia needs to build some data centers further north.

Deleted? (4, Funny)

Grishnakh (216268) | about 4 years ago | (#31602100)

I thought maybe they had simply deleted Wikipedia because some admin decided nothing on there was "notable".

Funny (0)

Anonymous Coward | about 4 years ago | (#31602300)

I didn't understand some terms in the summary, so I was about to wiki them... *sigh*

backup failure doesn't mean a failure to test (4, Insightful)

rritterson (588983) | about 4 years ago | (#31602462)

I see lots of comments stating that this would not have happened had admins run regular tests on the failover mechanisms. That seems a poor assumption- if the system happens to fail and then an outage occurs before the next scheduled test, one may not be aware of it.

We had this problem recently where we were testing our backup generator. Normally, we cut power to the local on-campus substation, which kicks in the generator and activates a failover mechanism, rerouting power. Well, the generator came on no problem but the failover mechanism was broken, so every server in the datacenter spontaneously lost power. Had we known the failover was broken, we would have not done the regular test. However, the last test on the failover (done directly without cutting power), a mere month prior, had shown the failover mechanism was fine.

Point being, unless you are going to literally continuously test everything, there is still some probability of an unexpected double failure.

Denmark is still without Wikipedia (1, Interesting)

Anonymous Coward | about 4 years ago | (#31602466)

20:47 UTC+1, we are still without Wikipedia probably due to poor DNS propagation

great replies (1)

vxice (1690200) | about 4 years ago | (#31602484)

if only their blog had mod points. all the comments are of the form "still down where ever I am"

Wikipedia goes off the air... (0)

Anonymous Coward | about 4 years ago | (#31602490)

...and nobody really gave a damn.

So why are they considered relevant again?

School assignments (1)

bjb_admin (1204494) | about 4 years ago | (#31602552)

How many kids will go to school tomorrow and say they couldn't complete an assignment because Wikipedia is down?

Distributed Wikipedia (2, Interesting)

RAMMS+EIN (578166) | about 4 years ago | (#31602574)

Speaking of Wikipedia, an idea that has long been in my mind, but that I have never sat down and worked out is distributed hosting of Wikipedia. The idea is that volunteers each contribute some resources (network capacity, storage space, RAM, and CPU cycles) to host and serve part of the content.

This way, we should be able to reduce the load on the (donation supported) Wikimedia servers, as well as increase the redundancy in the system.

Is anybody already working on this or are there perhaps even already implementations of this idea?

China? (1)

zorro-z (1423959) | about 4 years ago | (#31602800)

And here, I thought that the Great Firewall of China had been blocking access to politically-charged Websites again.

Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...