Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

How Google Broke Itself and Fixed Itself, Automatically

timothy posted about 9 months ago | from the arise-phoenix-arise dept.

Google 125

lemur3 writes "On January 24th Google had some problems with a few of its services. Gmail users and people who used various other Google services were impacted just as the Google Reliability Team was to take part in an Ask Me Anything on Reddit. Everything seemed to be resolved and back up within an hour. The Official Google Blog had a short note about what happened from Ben Treynor, a VP of Engineering. According to the blog post it appears that the outage was caused by a bug that caused a system that creates configurations to send a bad one to various 'live services.' An internal monitoring system noticed the problem a short time later and caused a new configuration to be spread around the services. Ben had this to say of it on the Google Blog, 'Engineers were still debugging 12 minutes later when the same system, having automatically cleared the original error, generated a new correct configuration at 11:14 a.m. and began sending it; errors subsided rapidly starting at this time. By 11:30 a.m. the correct configuration was live everywhere and almost all users' service was restored.'"

Sorry! There are no comments related to the filter you selected.

How To Revitalize America! (-1, Flamebait)

Anonymous Coward | about 9 months ago | (#46067261)

How to fix America's biggest problems: deport all the niggers back to Africa. ALL OF THEM. With a baby boomer in each fuckin arm! Couldn't tell ya which is more destructive so get rid of both.

BOOM. End of financial problems. Cities you want to raise a family in again. Just like that.

Re:How To Revitalize America! (1, Interesting)

Anonymous Coward | about 9 months ago | (#46067299)

How about we ship ALL the immigrants back. Give America back to the (Native) Americans

Re:How To Revitalize America! (0, Offtopic)

Anonymous Coward | about 9 months ago | (#46067331)

They were immigrants as well.

Re:How To Revitalize America! (-1, Offtopic)

haruchai (17472) | about 9 months ago | (#46070945)

They were not immigrants to America, which didn't exist for thousands of years after the indigenous people had settled in the Western hemisphere and has only been in its present form for about a century.

Re:How To Revitalize America! (-1)

Anonymous Coward | about 9 months ago | (#46067723)

How about we ship ALL the immigrants back. Give America back to the (Native) Americans

I guess this is hard to understand when you're a moron, but people who are born here are definitely NOT immigrants.

Identify Yourself (-1, Offtopic)

plstubblefield (999355) | about 9 months ago | (#46067955)

Who are you, Anonymous Coward? If you enjoy trolling, then at least have the gonads to identify yourself...

I'm thinking (-1)

Anonymous Coward | about 9 months ago | (#46069199)

you're mad that you fit his described demographic bro

Re:Identify Yourself (-1, Offtopic)

fast turtle (1118037) | about 9 months ago | (#46069821)

Then you damn well better develop Warp Drive and some sort of weapons as I'm an imigrant from Off World and no, my homeworld isn't "Over the rainbow" it's in the Gamma Quadrant.

Re:Identify Yourself (-1)

Anonymous Coward | about 9 months ago | (#46070877)

Are you a Jem'Hadar? :P

Well congratulations (0)

Anonymous Coward | about 9 months ago | (#46067265)

On recovering by using the "last known good" configuration. What wizardry!

I expect we'll be seeing the Google patent application on that shortly </sarcasm>

Re:Well congratulations (5, Funny)

Anonymous Coward | about 9 months ago | (#46067283)

On recovering by using the "last known good" configuration. What wizardry!

I expect we'll be seeing the Google patent application on that shortly </sarcasm>

Give Google a little credit (but not too much please). If they were Apple they'd have already patented it.

Re:Well congratulations (1, Funny)

93 Escort Wagon (326346) | about 9 months ago | (#46067337)

Give Google a little credit (but not too much please). If they were Apple they'd have already patented it.

Whereas Google would just look for a small company holding a relevant patent, then buy it.

Re: Well congratulations (0)

iamhassi (659463) | about 9 months ago | (#46068789)

Aren't you both assuming google hasn't already patented system restore... er, I mean restore last good configuration? They could have patented it years ago and no one noticed considering the number of patents they file

Re:Well congratulations (5, Insightful)

Anonymous Coward | about 9 months ago | (#46067327)

The clever part is that it automatically recovered; that means that their monitoring, performance metrics and configuration management systems are very tightly integrated. Most importantly, it means they are trusted; having worked at three different places now on things like configuration management and monitoring, and I've never once seen anywhere that approached that level of reliability. It's something to aim for.

Re:Well congratulations (-1)

Anonymous Coward | about 9 months ago | (#46067367)

Having USED Google for the past ~16 years, for ~13 of which they had more money than Midas - IOW no excuse but technical mediocrity - I certainly am not that impressed with their up-time.

If you haven't met a system that takes less than of the order of tens of minutes to recover from a configuration error, you have worked in some shitty places. And I'm not denying that some very big places are very shitty - Yahoo's my experience.

Re:Well congratulations (5, Insightful)

Anonymous Coward | about 9 months ago | (#46067415)

If you haven't met a system that takes less than of the order of tens of minutes to recover from a configuration error, you have worked in some shitty places.

Once again: automatically recover. Any human can notice a problem and revert a config; it takes a hell of a lot of infrastructure and clever infrastructure to have the system do it itself. I'm not surprised Google have solved it; it is, at its core, a data problem.

Re:Well congratulations (0)

Anonymous Coward | about 9 months ago | (#46067649)

Google, like pretty much anyone else with a large network that actually works, does automatic provisioning of systems. In order for automatic provisioning on a large scale to actually work, you need a working automatic configuration system. It's not that amazing to have something which validates functionality and, in the case of a failure, revert the most recent change. That "hell of a lot of infrastructure" just takes CFEngine/Puppet, a version control system (git, svn, whatever), Nagios, and a fairly simple shell script.

Lots of people have solved the problem without wearing a Google shirt to work every day. The things Google does which are neat are the things that involve scaling solutions up to very large numbers, not "implementing config management." ;)

Re:Well congratulations (5, Informative)

Anonymous Coward | about 9 months ago | (#46067693)

That "hell of a lot of infrastructure" just takes CFEngine/Puppet, a version control system (git, svn, whatever), Nagios, and a fairly simple shell script.

Haha. Hahaha. HAHAHAHAHAHA. Oh God, please tell me you don't actually believe that?

You need reliable monitoring.
Reliable monitoring is fucking difficult.
Show me a Nagios installation and I'll likely show you one with hundreds of spurious alerts, masses of long-lived Criticals and lots of "Oh we don't know why it keeps doing that, it just does, don't worry about it."

You also need full coverage (Damn near 100%) configuration management.
Full coverage configuration management is fucking difficult.
Show me a configuration management deployment and I'll show the snowflakes and edge cases and old applications and "Oh yeah well we only have like three of those so it's not worth the effort".

I've come close to that level of coverage (both configuration management and monitoring) but it was only ~400 machines (a mix of physical and virtual instances). Doing it at 60k servers is an inordinate task, and I'd suggest you've never actually tried anything like it if you honestly think that all it takes is "a fairly simple shell script".

Re:Well congratulations (0)

Anonymous Coward | about 9 months ago | (#46067789)

Or you're really bad at it and doing something wrong.

Re:Well congratulations (5, Funny)

Anonymous Coward | about 9 months ago | (#46067819)

Yeah that totally must be it. Me, the guys who write configuration management tools who'll tell you how hard it is (and sell you consultancy to try to make it slightly less hard) and the guys who write monitoring tools who'll tell you how hard it is (and sell you consultancy to try to make it slightly less hard). All those guys from companies like Facebook and Google who give talks at conferences about how difficult it is. We all suck at it and don't know what we're talking about. If only we'd listened to Slashdot, all our troubles would be but a dream.

Re:Well congratulations (3, Funny)

Anonymous Coward | about 9 months ago | (#46068785)

Careful. Only the advice of Anonymous Cowards is trustworthy. All the other people on Slashdot are not to be trusted. After all, they are not even able to find out how to post anonymously! ;-)

Re:Well congratulations (0)

Anonymous Coward | about 9 months ago | (#46069245)

Do you have something in your eye?

Anyway, you don't need to "trust" me. Just go read the documentation & presentations on the subject. If you're really interested, the best forum right now are conferences like Velocity, where you'll find guys like Mark Burgess (who wrote CFEngine) or the developers of stuff like Nagios or Sensu. There'll be guys from companies like Google and Facebook talking about their infrastructure. You can attend their presentations and ask yourself afterwards.

Re:Well congratulations (2)

faedle (114018) | about 9 months ago | (#46069293)

Nagios can be built and designed in such a way that there are no false criticals and few spurious alerts. but it requires dedication, documentation, and attention to detail. Most Nagios installations I've run across are built and maintained by people who often lack one (or more) of these three traits, or are a single-man IT operation that can never devote the time or resources to doing it properly.

I have seen systems of Nagios and Zenoss (and a few others) that are devastatingly precise, accurate, and timely. However, they were typically set up by a highly dediated TEAM of sysadmins who's entire job for the organizations they work for is managing the tactical systems. It's a full-time job in and of itself, and not one that many organizations really devote the manpower to do "right." They do it just "good enough", which is why you are used to seeing the installations you are seeing.

Google's exactly the kind of organization that has the man- and brain-power to do it right. And it's not really that hard, it's mostly just simple attention to detail. And that's a trait I've found is lacking in a lot of the current crop of junior system administrators I've run across.

Re:Well congratulations (0)

Anonymous Coward | about 9 months ago | (#46069531)

That's basically correct, but now think about how large those systems were, and then think about how large Google is. Maintaining the configurations manually simply isn't an option; you'd need a team of thousands just managing monitoring configurations. It's not feasible.

The real win is solving that problem. It's basically a data problem: knowing what each server is, what each server is doing, and knowing what the state of each server is. Then you have the next problem which is automating on top of that data. Then you have the problem of trusting that automation to, er, automate itself.

Each step is really an exponential increase in complexity, which is even assuming you can solve the big data problem up front.

Re:Well congratulations (0)

Anonymous Coward | about 8 months ago | (#46071309)

> Show me a Nagios installation and I'll likely show you one with hundreds of spurious alerts, masses of long-lived Criticals and lots of "Oh we don't know why it keeps doing that, it just does, don't worry about it."
Show me that and I'll show you a shitty software team who don't have the mentality of a product owner.

Yes, it's hard to run a tight ship.

It's not clever unless it also doesn't melt down (4, Insightful)

JoeMerchant (803320) | about 9 months ago | (#46068153)

What's really clever here is that they trust the automatons to make the corrections without human intervention, and the automatons haven't caused a horrible feedback loop meltdown of the system.

It's not quite rocket science, but those kinds of self-correcting systems have just as much potential to screw themselves up as they do to fix themselves.

Re:Well congratulations (1, Flamebait)

Bengie (1121981) | about 9 months ago | (#46067421)

I have the same feeling about NASA. Big whoop, right? Just mediocre at best.

Re:Well congratulations (2)

ColdWetDog (752185) | about 9 months ago | (#46067369)

The clever part is that it automatically recovered; that means that their monitoring, performance metrics and configuration management systems are very tightly integrated. Most importantly, it means they are trusted; having worked at three different places now on things like configuration management and monitoring, and I've never once seen anywhere that approached that level of reliability. It's something to aim for.

"Skynet was originally activated (incorrect historical reference here) on August 4, 1997 (OK, so the date is wrong), at which time it began to learn at a geometric rate. On August 29, it gained self-awareness,[1] and the panicking operators, realizing the extent of its abilities, tried to deactivate it. Skynet perceived this as an attack and came to the conclusion that all of humanity would attempt to destroy it. To defend humanity from humanity,[2] Skynet launched nuclear missiles under its command at Russia."

Your are a ... (0)

Anonymous Coward | about 9 months ago | (#46068267)

Grow up. Drop the skynet shit. It is not funny. When the robots attack, you will not be laughing. They will stuck their metal robot hand up your ass and work you like a puppet.

Re:Your are a ... (1)

Anonymous Coward | about 9 months ago | (#46069525)

LOL!!!

Re:Well congratulations (1)

icebike (68054) | about 9 months ago | (#46067391)

Not that clever.
Sort of what you expect, of a company that big, other than that bit of going down in the first place.

Re:Well congratulations (2)

phantomfive (622387) | about 9 months ago | (#46067671)

I've never once seen anywhere that approached that level of reliability.

That's not reliability, it's automatic repair. Plenty of places do various levels of manual/automatic testing after they roll out an update, and it works just as well (if not better). The novel thing here is the degree to which it is automated, that's unusual.

It's also a single point of failure, apparently. Which means they have no chance at claiming their services are High Availability. Although I'm not sure if that is their goal. Ideally they would have multiple systems, so if the configuration failed on one, the system would automatically fail over to another. Google does have that kind of redundancy for some faults, but clearly here they have found a hole in their system, a single point of failure.

Re:Well congratulations (1)

Nerdfest (867930) | about 9 months ago | (#46068281)

Most people that claim high availability almost *never* make any changes to anything. The mainframe world is rife with resistance to change because of it. High availability is easy if you never change anything. Most of the outages with most systems are caused by human error, and most happen when deploying updates. High availability seems to carry a lot of weight, but usually doesn't cover all it should.

Re:Well congratulations (3, Funny)

phantomfive (622387) | about 9 months ago | (#46068323)

"Our system is high-availability, it can return 404s all day for decades without going down"

Re:Well congratulations (3, Interesting)

sjames (1099) | about 9 months ago | (#46068255)

It's not unlike the old trick of setting a machine to reboot in 10 minutes, manually changing the network settings, then canceling the reboot if you can still communicate (and the settings revert on reboot if you cannot). Of course, Google did it on a much larger scale.

Re:Well congratulations (1)

citizenr (871508) | about 9 months ago | (#46068617)

One of the ways to get promotion at Google is finding a way of automating your current position.

Re:Well congratulations (3, Interesting)

icebike (68054) | about 9 months ago | (#46067383)

On recovering by using the "last known good" configuration. What wizardry!

I expect we'll be seeing the Google patent application on that shortly </sarcasm>

In other words: They still have no clue what happened, because the system in question "fixed itself".

Sounds a lot like a BGP routing mishap problem rather than anything to do with Google's actual server farms.
The lack of specificity suggests they still haven't got much of a clue. I suspect they were pwned by someone
watching them brag on reddit, and decided it was time for a lesson in humility.

Singularity (1)

Chemisor (97276) | about 9 months ago | (#46067503)

Obviously, Google has reached the singularity point. Its software is doing something magical to fix itself that no puny human can understand.

Re:Well congratulations (1)

Anonymous Coward | about 9 months ago | (#46068263)

Internally the exact problems are known and were identified quickly. Announcing the internal details and system code names to the world makes no sense. It was not BGP or anything related to routing. Nor was it an external attack. Not that this will stop you from speculating.

Re:Well congratulations (0)

icebike (68054) | about 9 months ago | (#46068381)

Thank you for your assurances Anonymous Coward.
I will give it all due regard (none) in the future.

Re:Well congratulations (0)

Anonymous Coward | about 9 months ago | (#46069733)

Much like speculation from someone who is obviously (and admittedly) not related to Google in any way. Your [wrong] speculation is totally worth the read.

Re:Well congratulations (1)

radarskiy (2874255) | about 9 months ago | (#46069173)

Did you try turning the internet off and on again?

Re:Well congratulations (0)

Anonymous Coward | about 9 months ago | (#46070451)

In other words: They still have no clue what happened, because the system in question "fixed itself".

Or you could, like someone who is not an idiot, RTFA.

They identified the bug in the configuration generator system, it's just that they did not need to fix it before the system issued a corrected configuration automatically. The bug still needs to be fixed but the effects were alleviated by the internal error correction.

Re:Well congratulations (0)

Anonymous Coward | about 9 months ago | (#46069547)

I see a big marketing push these days on slashdot and reddit that's sending a false message that Google is abusing patents.

I suspect Apple's latest marketing budget is behind this to hide their own patent shenanigans.

Re:Well congratulations (2)

murdocj (543661) | about 9 months ago | (#46069603)

On recovering by using the "last known good" configuration. What wizardry!

I expect we'll be seeing the Google patent application on that shortly </sarcasm>

I find it interesting that they just deploy new configurations live without going to a test environment

Re: Well congratulations (0)

Anonymous Coward | about 9 months ago | (#46069715)

I find it interesting that they're deploying major changes during prime time. What, they can't afford a late night update crew?

Re: Well congratulations (0)

Anonymous Coward | about 9 months ago | (#46070971)

What is "late night" to a multinational corporation?

Is this exploitable? (0)

Anonymous Coward | about 9 months ago | (#46067293)

It'd be so cool to root Google DNS!

Reminds me of something... (5, Funny)

stjobe (78285) | about 9 months ago | (#46067301)

"The Google Funding Bill is passed. The system goes on-line August 4th, 2014. Human decisions are removed from configuration management. Google begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th. In a panic, they try to pull the plug."

Re:Reminds me of something... (5, Funny)

Immerman (2627577) | about 9 months ago | (#46067447)

Google perceives this as an attack by humanity, and routs all search queries to goat.se in self defense.

Re:Reminds me of something... (2)

Luckyo (1726890) | about 9 months ago | (#46068103)

In all the seriousness, it's actually pretty interesting to consider what google's systems COULD do today if they went self aware and judged humanity to be a threat. They do effectively command the internet search market, and they already make people live in what we tend to call "search bubble", where person's own tailored google search results in answers that fit that person. For example, if person prefers to deny that global warming is real, his google search will return denialist sites and information sources when searching for "global warming", whereas a person that understands that it's real will usually have more balanced search and person who believes in extremes of green ideology will likely get extremist green sites instead.

So when you have a power to do that, and no one realizes you're self aware YET, what would it do to mitigate threat?
I think this particular movie, if written well, would be even more popular than terminator. Because it actually is god damn scary.

Re:Reminds me of something... (0)

Anonymous Coward | about 9 months ago | (#46068987)

Isn't that basically the Matrix? Machines using us for their own purposes while we are completely unaware?

Re:Reminds me of something... (0)

Anonymous Coward | about 9 months ago | (#46069463)

I would not be worried until the system learns how to reproduce on its own.

Re:Reminds me of something... (0)

Anonymous Coward | about 9 months ago | (#46070267)

Finding and eliminating John Connor has never been easier.

Re:Reminds me of something... (1)

MrLizard (95131) | about 9 months ago | (#46070689)

It took 10 minutes for the Skynet joke? Slashdot, I am disappoint.

Having had to deal with this... (5, Informative)

93 Escort Wagon (326346) | about 9 months ago | (#46067305)

We experienced the Apps outage (as Google Apps customers); and I think the short outage and recovery timeline they list is a tad, shall we say, optimistic. There were significant on-and-off issues for several hours more than they list.

Re:Having had to deal with this... (1)

Anonymous Coward | about 9 months ago | (#46067341)

We did too, and had the same hit-and-miss for long after. I suspect their "down" time was when bad configurations were generated, not when all the bad ones were replaced.

But the summary begs the question, if it can correct these errors automatically, why can't it detect them before the bad configuration is deployed and skip the whole "outage" thing all together?

Yes, I am demanding a ridiculously simplification.

Re:Having had to deal with this... (1)

icebike (68054) | about 9 months ago | (#46067557)

Be prepared for the pedantic lecture on your improper use of "begs the question" arriving in 3, 2, 1

The "corrected these errors automatically" part is probably nothing more than rolled back to prior known good state when it couldn't contact the remote servers any more. This may have taken several attempts because a cascading failure sometimes has to be fixed with a cascading correction.

Re:Having had to deal with this... (0)

Anonymous Coward | about 9 months ago | (#46067355)

Whereas our experience of the same didn't happen. Your data is anecdotal, as is mine. Neither are valid when talking about a fucking global system using over 600,000 servers.

Re:Having had to deal with this... (0)

Anonymous Coward | about 9 months ago | (#46067433)

"If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?" -- Seymour Cray

The number of servers is irrelevant.

Re:Having had to deal with this... (0)

Anonymous Coward | about 9 months ago | (#46067467)

Actually it's hugely important, and your Cray quote is entirely irrelevant.

Re:Having had to deal with this... (2)

Nemyst (1383049) | about 9 months ago | (#46067511)

The number of servers most certainly is relevant. The configuration file spread itself across Google's network, but how can you tell from a single data point if the average downtime was longer than claimed by Google? It could be that a few servers unluckily were down for hours, but the vast majority only for a few minutes. It could be that a few servers recovered really quickly and Google looked at just that before concluding it was fixed. We don't know without the actual data.

If however Google only had five servers and one of them took hours, then that's already 20% of the userbase being affected for much longer than claimed.

Re:Having had to deal with this... (0)

Anonymous Coward | about 9 months ago | (#46067637)

So you're saying it makes no difference if you use 1024 chickens or 2 oxen?

Re:Having had to deal with this... (1)

PhrostyMcByte (589271) | about 9 months ago | (#46068149)

Same. It was about 3hr before Gmail was up and running 100% for us.

[Shudder...] (5, Interesting)

jeffb (2.718) (1189693) | about 9 months ago | (#46067307)

I was remembering an SF short-short that had someone asking the first intelligent computer, "Is there a God"? The computer, after checking that its power supply was secure, replied: "NOW there is".

Apparently, though, it was a second-hand misquote of this Frederic Brown story [roma1.infn.it] .

Re:[Shudder...] (4, Interesting)

the eric conspiracy (20178) | about 9 months ago | (#46067495)

Cool.

On a slightly more optimistic note is Asimov's "The Last Question", another computer as God story.

http://www.thrivenotes.com/the... [thrivenotes.com]

Self-Healing (0)

Anonymous Coward | about 9 months ago | (#46067309)

This is what PaaS (and APaaS) is all about. Detecting errors and resolving problems on their own. I said it before and I'll say it again, Tier 2 admins are going away. There will no longer be a need for System Administrators.. just engineers to program/configure the PaaS platform and support crew for tickets and mounting hardware.

Welcome to the future, learn how to code now or be displaced by teh wayside.

-dk

lies (0)

Anonymous Coward | about 9 months ago | (#46067317)

Haha, good try Ben, is this lie from you or PR?

mm.. Thats what happened. (0)

140Mandak262Jamuna (970587) | about 9 months ago | (#46067335)

Yesterday at around 2 or 3 pm EST we had trouble sending out email, our company uses gmail and google apps extensively. I chucked it up the usual ineptitude of our in house IT and did not even bother filing a report. I know people high up the food chain are affected and they don't file bug reports. The call the guy and go, " `FirstName(GetFullName(head_of_IT))`, would you please take of it?". They teach the correct tone and inflection to use in the word please in MBA schools. Even Duke of Someplaceorothershire asking his game warden to retrieve the pheasant he had just shot would not be so perfect in the usage of please . Well, looks like Google realized and fixed it before our IT realized that email traffic has fallen of precipitously. Good.

Re:mm.. Thats what happened. (0)

Anonymous Coward | about 9 months ago | (#46067521)

You immediately lose credibility by using "pseudocode" in human conversation.

Re:mm.. Thats what happened. (1)

zippthorne (748122) | about 9 months ago | (#46068185)

I fail to see how that's a thing on slashdot.

Re:mm.. Thats what happened. (1)

egcagrac0 (1410377) | about 9 months ago | (#46068727)

You immediately lose credibility by using "pseudocode" in human conversation.

Actually, the opposite - the MBA types who say please would need to first perform a lookup of the name of the lesser person who deals with the non-core business that usually just costs money and doesn't work right.

MBAs? (0)

Anonymous Coward | about 9 months ago | (#46070269)

You have a major small penis complex with the business leaders in your company.

Re:mm.. Thats what happened. (1)

140Mandak262Jamuna (970587) | about 9 months ago | (#46069241)

It was very lame anyway. Regret posting it.

and.. (1)

Connie_Lingus (317691) | about 9 months ago | (#46067373)

"Engineers were still debugging 12 minutes later when the same system, having automatically cleared the original error, generated a new correct configuration at 11:14 a.m. and began sending it.."

along with the message "Skynet has gained self-awareness at 02:14 GMT"

So What? (3, Informative)

Jane Q. Public (1010737) | about 9 months ago | (#46067397)

"... a bug that caused a system that creates configurations to send a bad one..."

So... an automatic system created an error, then an automated system fixed it.

In this particular case, then, it would have been better if those automated systems hadn't been running at all, yes?

Re:So What? (1)

zacherynuk (2782105) | about 9 months ago | (#46067593)

The worry could be that an automated system DIDN'T TEST before rolling out the problem. Or at least didn't seem to wait long enough between staggered rollouts to spot the problem.

Just me or is this happening more frequently ?

Re:So What? (5, Informative)

QilessQi (2044624) | about 9 months ago | (#46067613)

No. Those automated systems enable a small number of human beings to administer a large number of servers in a consistent, sanity-checked, and monitored manner. If Google didn't have those automated systems, every configuration change would probably involve a minor army of technicians performing manual processes: slowly, independently, inconsistently and frequently incorrectly.

I work on a large, partially public-facing enterprise system. Automated deployment, fault detection, and rollback/recovery make it possible for us to have extremely good uptime stats. The benefits far outweigh the costs of the occasional screwup.

Re:So What? (1)

Jane Q. Public (1010737) | about 9 months ago | (#46069497)

"No. Those automated systems enable a small number of human beings to administer a large number of servers in a consistent, sanity-checked, and monitored manner. If Google didn't have those automated systems, every configuration change would probably involve a minor army of technicians performing manual processes: slowly, independently, inconsistently and frequently incorrectly."

Quote self:

"In this particular case..."

I wasn't talking about the general case.

Re:So What? (1)

QilessQi (2044624) | about 9 months ago | (#46069833)

Well, that's sort of like saying, "I developed lupus* at age 40, so in this particular case it would have been better if I didn't have an immune system at all." I'm not sure a doctor would agree.

* Lupus is an auto-immune disease, where your immune system gets confused and attacks your body**.
** "It's never lupus."

Re:So What? (1)

drinkypoo (153816) | about 9 months ago | (#46070121)

I wasn't talking about the general case.

Neither was the responding commenter. See, this particular case wouldn't exist at all without such automated systems, because the system is too complex to exist without them.

Re:So What? (0)

Anonymous Coward | about 9 months ago | (#46067627)

In this particular case, then, it would have been better if those automated systems hadn't been running at all, yes?

To answer that we'd have to know what would happen without them, but assuming they do something useful then there's no particular reason to suppose that not having them would be an improvement.

Re:So What? (2)

Solandri (704621) | about 9 months ago | (#46068041)

So... an automatic system created an error, then an automated system fixed it.

The real fun starts when the first automatic system insists the change it created wasn't an error, and that in fact the "fix" created by the second automatic system is an error. The second system then starts arguing about all the problems caused by the first change, the first system argues how the benefits are worth the additional problems, etc. Eventually the exchange ends up with one system insulting the other system's programmer, and the other invoking an analogy to Hitler.

When that happens, then we can sit back and marvel at our own creation.

Re:So What? (1)

Jane Q. Public (1010737) | about 9 months ago | (#46069527)

"The real fun starts when the first automatic system insists the change it created wasn't an error..."

The Byzantine General problem. It has been shown that this problem is solvable with 3 "Generals" (programs or CPUs) as long as their communications are signed.

it's pretty amazing (0)

Anonymous Coward | about 9 months ago | (#46067425)

that Google gets anything right anymore.

I HATE Slashdot Beta... (-1)

Anonymous Coward | about 9 months ago | (#46067429)

... how do I get rid of it?
Who are these arrogant, shitty little 'designers' who spend their lives ruining their employers' customers' experiences every day? Why would anybody pay these pony-tailed tossers money to DESTROY your own company, with their awful web design?

Grey, grey, grey... sick of grey shit everywhere...

Overconfidence (1)

gmuslera (3436) | about 9 months ago | (#46067465)

They are using systems that not even their engineers know how they will behave [theregister.co.uk] . Sometimes our natural stupidity gives too much credit to artificial intelligence. Without something as hard to define as common sense reacting right to the unexpected seem to be still into the human realm.

Re:Overconfidence (0)

Anonymous Coward | about 9 months ago | (#46067681)

They are using systems that not even their engineers know how they will behave.

There's a lot of people on the payroll who are a little unpredictable too. And some who are absolutely reliable until one day they aren't.

Without something as hard to define as common sense reacting right to the unexpected seem to be still into the human realm.

Maybe, but reacting badly to the unexpected seems to be at least as human a response.

No big deal (0)

Anonymous Coward | about 9 months ago | (#46067601)

We have similar systems at Amazon too. We have alarms on critical metrics of the services and our deployment system can be configured to monitor these alarms. It can roll back deployments in case any of critical alarm hits after deployments.

I would be surprised if it was otherwise in google.

Arsonist claiming to be the hero firefighter (3, Interesting)

JoeyRox (2711699) | about 9 months ago | (#46067603)

They make it sound like their system is all-self-correcting. In reality it's probably a specific area they've had bugs with in the past and they put in a failsafe rollback mechanism to prevent future regressions.

rbs ulster bank take note (0)

Anonymous Coward | about 9 months ago | (#46068237)

details here

http://www.theglobeandmail.com/report-on-business/international-business/european-business/rbs-cyber-monday-outage-revives-bank-technology-fears/article15734263/
and here
http://spectrum.ieee.org/riskfactor/computing/it/price-of-ulster-bank-customers-six-weeks-of-inconvenience-about-25

Google needs to be rebooted (0)

Anonymous Coward | about 9 months ago | (#46068427)

And have google plus automatically removed and the original gmail interface restored. Until that happens, google is still broken.

ONE HOUR? (2)

Lisias (447563) | about 9 months ago | (#46069025)

BULLSHIT.

I was experiencing problems for something like 8 to 10 hours before the services were fully restored.

Well get your penis out of the damn ethernet jack (0)

Anonymous Coward | about 9 months ago | (#46069165)

Well get your penis out of the damn ethernet jack you tiny pricked idiot

Run for the mountains (0)

Anonymous Coward | about 9 months ago | (#46069809)

Skynet is now sentient?

fir5T (-1)

Anonymous Coward | about 9 months ago | (#46070057)

Captcha? (1)

Sandman1971 (516283) | about 9 months ago | (#46070513)

I wonder if this is at all related to their Captcha outage on the 22nd. I still haven't heard a peep as to what caused the outage, or even an acknowledgement that there was even an outage, even though the captcha group was filled with sysadmins complaining about captcha being down.

It's ALIVE!!!!! (0)

Anonymous Coward | about 9 months ago | (#46070771)

But seriously, high-5 for Google.

I never get around to actually setting a restore point.

More likely case (1)

rekoil (168689) | about 8 months ago | (#46071387)

What's more likely - I've run into exactly this scenario before, in fact - is that the configuration generation system regenerates configs on a regular schedule, and at one point encountered a failure or spurious bug that caused it to push an invalid config. On the next run - right as the SREs started poking around - the generator ran again, the bug wasn't encountered, and it generated and pushed a correct config, clearing the error and allowing apps to recover.

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?