Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Are Data Center "Tiers" Still Relevant?

timothy posted more than 4 years ago | from the german-datacenters-have-tieren dept.

Data Storage 98

miller60 writes "In their efforts at uptime, are data centers relying too much on infrastructure and not enough on best practices? That question is at the heart of an ongoing industry debate about the merits of the tier system, a four-level classification of data center reliability developed by The Uptime Institute. Critics assert that the historic focus on Uptime tiers prompts companies to default to Tier III or Tier IV designs that emphasize investment in redundant UPSes and generators. Uptime says that many industries continue to require mission-critical data centers with high levels of redundancy, which are needed to perform maintenance without taking a data center offline. Given the recent series of data center outages and the current focus on corporate cost control, the debate reflects the industry focus on how to get the most uptime for the data center dollar."

cancel ×

98 comments

Sorry! There are no comments related to the filter you selected.

Prejudice. (0)

Drunken Buddhist (467947) | more than 4 years ago | (#29505291)

Only when coming from the eyes of data center owners.

Re:Prejudice. (1)

MrNaz (730548) | more than 4 years ago | (#29506293)

The only way I can come up with an analogy nonsensical enough to involve prejudice and eyes is if it contains a car.

Re:Prejudice. (1)

Drunken Buddhist (467947) | more than 4 years ago | (#29506437)

Tiers/Tears

Prejudice as in why should data center tears be more relevant than any other tears.

No (1)

pyster (670298) | more than 4 years ago | (#29505397)

And they never were.

Re:No (1)

aaarrrgggh (9205) | more than 4 years ago | (#29506171)

The Tier Guidelines as Uptime Institute presents it is utterly useless beyond C-Suite penis posturing. However, it is important for a company to establish what their needs are.

However, most serious players customize it based on their own needs and risk assessments. Redundant UPS systems create a more valuable benefit than redundant utility transformers as an example. Redundant generators offer less benefit than redundant starting batteries and proper maintenance and testing. Mechanically, over-sizing some critical systems can provide as much benefit as redundant chillers (as long as you have multiple chillers), as the biggest risk is pull-down time.

Looking at a lot of data centers, if the main facility or projects person leans more towards electrical, you see lots of electrical redundancy but no mechanical redundancy. Opposite is also true.

Most important thing is to understand limitations of existing systems, and ensure proper maintenance is taking place...

Re:No (2, Informative)

Forge (2456) | more than 4 years ago | (#29506561)

Sometimes people do irrational things in DATA center. I.e. Where I live/work the Electricity company is notoriously unreliable. We had a 5 minute outage this morning for no apparent reason, We had 3 last week of varied durations. This in the heart of the business district where power is most reliable.

Because of this our Data center has redundant UPS and Redundant Generators. All but the least critical servers have dual power supplys, plugged into independent circuits.

We have multiple ACs but they are not strictly set up to be redundant. When one breaks down we have to haul standing fans to the area to keep the machines cool enough while the AC is repaired.

The stupid thing though is that most of the smaller switches have a single power supply and most machines are plugged into a single switch. So our last UPS failure resulted in two whole racks of servers being inaccessible for 15 minutes, while I ran over there, figured out what the problem was and plugged the switch into a neighboring RACK.

Data Centers and Groupthink (-1, Troll)

Anonymous Coward | more than 4 years ago | (#29505419)

" how to get the most uptime for the data center dollar."

DON'T use Microsoft operating systems.

Yours In Belarus,
Philboyd Studge

It depends (4, Interesting)

afidel (530433) | more than 4 years ago | (#29505457)

If you are large enough to survive one or more site outages then sure you can go for a cheaper $/sq ft design without redundant power and cooling. If on the other hand you are like most small to medium shops then you probably can't afford the downtime because you haven't reached the scale where you can geographically diversify your operations. In that case downtime is probably still much more costly than even the most expensive of hosting facilities. I know when we looked for a site to host our DR site we were only looking at tier-IV datacenters because the assumption is that if our primary facility is gone we will have to timeshare the significantly reduced performance facilities we have at DR and so downtime wouldn't really be acceptable. By going that route we saved ~$500k on equipment to make DR equivalent to production at a cost of a few thousand a month for a top tier datacenter, those numbers are easy to work.

Re:It depends (0)

Anonymous Coward | more than 4 years ago | (#29506627)

"I know when we looked for a site to host our DR site we were only looking at tier-IV datacenters because the assumption is that if our primary facility is gone we will have to timeshare the significantly reduced performance facilities we have at DR and so downtime wouldn't really be acceptable. By going that route we saved ~$500k on equipment to make DR equivalent to production at a cost of a few thousand a month for a top tier datacenter, those numbers are easy to work."

You'll regret that when vikings attack you Tier IV facility.

The first rule of datacenter planning is that everything will eventually happen. It might be someone going, "Look at this!", flipping a switch, and knocking out power as backup systems fail to come online. It might be a flaming car setting fire to a dumpster next to the facility.

Or it might be viking raiders.

All of things have or will happen; it's simply a matter of time.

Safety Case #76 - "Cattle in the Lobby" (0)

Anonymous Coward | more than 4 years ago | (#29507285)

The truly paranoid aren't working at NSA, they're working as safety engineers :-) It's a genuine pity that the so-called bond rating agencies aren't required to employ safety engineers to review the crazy stuff dreamed up by investment bankers.

There are so many "Black Swan" events that safety engineers have to worry about that some of them sound insane ... even when they're events that have happened at least once before.

(As Dave Barry says, "I am not making these up") "Cattle in the Lobby", "Concerted Attack by Rodentia", etc.

Re:It depends (0, Offtopic)

mirian745m (1642447) | more than 4 years ago | (#29507335)

Data center redundancy is a need thing. . [123bemyhost.com]

Re:It depends (1)

EvilBudMan (588716) | more than 4 years ago | (#29507875)

It's very confusing to me. These guys say they are the only tier III in the US according to ATAC whoever they are? Apparently there are no level IV's that outsource. So I would think your statement to be correct. --haven't reached the scale where you can geographically diversify your operations-- means probably don't need above Tier II and their own backups.

http://www.onepartner.com/v3/atac_tier.html [onepartner.com]

FAST-RELIABLE-CHEAP (Pick 2) (0)

Anonymous Coward | more than 4 years ago | (#29505461)

"More uptime for the data center dollar" is a meaningless phrase.

The tried and true statement is that you can pick two (2) of the following:
-FAST
-RELIABLE
-CHEAP

Changing the metric to SAY you are providing all three does not mean you ACTUALLY are. It is just another way to confuse the customer and sell and inferior service as a premium service. If a company chooses to favor lower costs over redundancy in their data center that is their choice. If we start to blur the line we between the different options we only hinder the ability of a company to make an informed decision.

Re:FAST-RELIABLE-CHEAP (Pick 2) (1)

marcosdumay (620877) | more than 4 years ago | (#29509695)

The question is how to get the maximum improvement on RELIABLE for a given lose of CHEAP. That is the meaning you didn't find because you are thinking in binary.

Infrastructure is very important. (4, Interesting)

CherniyVolk (513591) | more than 4 years ago | (#29505475)

Infrastructure is more important than "best practices". Infrastructure is more of a physical, concrete aspect. Practices really aren't that important once the critical, physical disasters begin. As an example, good hardware will continue to run for years. Most of the downtime in regards to good hardware will most likely be due to misconfiguration, human error that sort of thing. A Sys Admin banks on some wrong assumption, messes up a script or hits the wrong command, but nonetheless the hardware is still physically able and therefore the infrastructure has not been jeopardized. If the situation is reversed, top notch paper plans and procedures... with crappy hardware. Well... the realities of physical discrepancies are harder to argue than our personal, nebulous, intangible, inconsequential philosophies of "good/better/best" management procedures/practices.

So to me the question "In their efforts at uptime, are data centers relying too much on infrastructure and not enough on best practices?" is best translated as "To belittle the concept of uptime and it's association with reliability, are data centers relying too much on the raw realities of the universe and the physical laws that govern it and not enough on some random guys philosophies regarding problems that only manifest within our imaginations?"

Or, as a medical analogy... "In their efforts in curing cancer, are doctors relying too much on science and not enough on voodoo/religion?"

Re:Infrastructure is very important. (0)

Anonymous Coward | more than 4 years ago | (#29505697)

Clearly you have no idea what you're talking about.

You can't just throw hardware (or money) at a problem and expect things to work. You have to know what you're doing, and set things up properly (i.e. follow best practices).

Re:Infrastructure is very important. (1)

dissy (172727) | more than 4 years ago | (#29507219)

Clearly you have no idea what you're talking about.

You can't just throw hardware (or money) at a problem and expect things to work. You have to know what you're doing, and set things up properly (i.e. follow best practices).

You clearly have never worked for or at 90% of the companies out there.

Things being setup properly and by best standards does not often come into play.

Re:Infrastructure is very important. (1)

Bill, Shooter of Bul (629286) | more than 4 years ago | (#29506225)

Its really a false dichotomy. You need a little of both. You need to have your procedures matched to the reliability,performance and architecture of your infrastructure.

Or, as a medical analogy... "In their efforts in curing cancer, are doctors relying too much on surgery and not enough on chemotherapy?"

location, location, location (0)

Anonymous Coward | more than 4 years ago | (#29509373)

Having your data center where it:
- floods
- hurricanes
- earthquakes
- end of airport runways
- bad power supply
- bad network connections

Is just crazy.

5 miles from me, there are streets and homes with 6 feet of water covering them.

I know of multiple telecom data centers that are in hurricane paths or other possible major accident locations that could take out the building.

If your data center doesn't have redundant power from 2 different substations AND the power company doesn't offer continuous power, WHAT ARE YOU THINKING? That's a resource room, not a data center.

Your network uplinks need to be redundant via 2 different providers.

If you can't afford this infra - pay for colocation where they do OR design all your systems to be geographically redundant. I think you'll find that most companies can't afford to do that.

Best Practices? (0)

Anonymous Coward | more than 4 years ago | (#29505481)

Whose best practices do you propose we follow?

as the article states, the current cost cutting best practices is leading to mediocrity

Re:Best Practices? (1)

JSG (82708) | more than 4 years ago | (#29509809)

I hate terms such as "best practi[c|s]e" and I certainly don't use them. I may tell a client I try to follow good practice. Claiming you know what constitutes "best practice" is a sure sign of imminent failure and arrogant.

Tiers and Data Center Redundancy (3, Insightful)

japhering (564929) | more than 4 years ago | (#29505521)

Data center redundancy is a need thing. However, most data center designs for get to address the two largest causes of down time ... people and software. People are people and will always make mistakes, as such there are still things that can be done to reduce the impact of human error.

Software, very rarely is designed for use in redundant systems. More likely, the design is for use in a hot-cold or hot-warm recovery scenario. Very rarely is it designed for multiple hot across multiple data centers.

Remember, good disaster avoidance is always cheaper than disaster recovery when done right.

Re:Tiers and Data Center Redundancy (1)

dkleinsc (563838) | more than 4 years ago | (#29505853)

Disaster avoidance is good, for sure, but that's not what your DR efforts are really for.

Here's the story (fairly well-covered by /. at the time) of why you have a disaster recovery system and plan in place: A university's computing center burned to the ground. The entire place. All the servers, all the onsite backups, all the UPS units, gone. Within 48 hours, they were back up and running. Not at 100% capacity, but they were running.

Re:Tiers and Data Center Redundancy (2, Insightful)

japhering (564929) | more than 4 years ago | (#29505965)

And if you had two identical data centers, where each in and of itself was redundant with software designed to function seamlessly across the two in a hot-hot configuration .. there would have been NO downtime.. the university would have been up the entire time with little to no data loss.

So say I'm Amazon and my data center burns down.. 48 hours with ZERO sales for a disaster recovery scenario vs normal operations for the time it takes to rebuild/move the burned data center..

I think I'll take disaster avoidance and keep selling things :-)

Re:Tiers and Data Center Redundancy (2, Insightful)

aaarrrgggh (9205) | more than 4 years ago | (#29506239)

Unless you were doing maintenance in the second facility when a problem hit the first. That is what real risk management is about; when you assume hot-hot will cover everything, you have to make sure that is really the case. Far too often there are a few things that will either cause data loss or significant recovery time even in a hot-hot system when there is a failure.

Even with hot-hot systems, all facilities should be reasonably redundant and reasonably maintainable. Fully redundant and fully maintainable can be a pipe-dream.

Re:Tiers and Data Center Redundancy (1)

Vancorps (746090) | more than 4 years ago | (#29508363)

I agree, as someone that runs two data centers that are both hot and both independently capable of handling the company load I think it is still wise for a scaled back DR at yet another location. For me it's as simple as a tier 3 storage server with a 100tb tape library backing things up properly. This gives me email archiving and records compliance. Of course the tapes aren't stored at the DR site.

The problem is that there is no effective limit to how much redundancy you can have. The industry default in my experience is N+1 or normally 3 redundant systems which I'm fortunate enough to have because I put on live events around the country. We have onsite redundancy, two racks physically split up but connected via fiber for synchronization. Failure of one rack results in the switch mesh directing the remaining traffic to the secondary rack. Beyond that I have a direct fiber connection back to our HQ where I have another copy of everything, usually about 1 hour behind so a sudden data incident such as CEO deleting 200 gigs of marketing data can be easily recovered with a minimal impact. Of course in that scenario had I not been present both live sites would have deleted the data and then you have to go back to snapshots if you're lucky and have ponied up for proper storage or you have to go back to tape.

DR is necessary for any company that relies on technology and having access to data to make money. For the company I work for both are critically necessary so you see a large investment in redundancy but we try to attack the most vulnerable areas first. When I started we were running on two servers, one was a database server, the other was a file/print server. If either one died our event would come to a halt. To make matters even more lovely you have a 50/50 chance of the database server RAID array spinning up correctly on boot up. First thing I did was double the server count to create two clusters. Later we moved to a webservice so things got even more interesting. Now I'm up to 16 servers that I bring with me and looking to cut it in half with XenServer and some Proliant SLs that give me good density without ridiculous power requirements. Life is better when you're not relying on luck for anything.

Re:Tiers and Data Center Redundancy (3, Interesting)

japhering (564929) | more than 4 years ago | (#29508365)

Precisely, I've spent the last 12 years (prior to be laid off) working in a hot-hot-hot solution. Each center was fully redundant and ran at no more then 50% dedicated utilization. Each data center got 1 week worth of planned maint every quarter for hardware and software updates when that data center was completely off line leaving a hot-hot solution.. if something else happened we still had a "live" data center while scrambling to recover the other two.

We ran completely without change windows as we would simply deadvertize an entire data center do the work and readvertize, them move on to the next data center. In the event of high importance, say a cert advisory requiring an immediate update, we would follow the same procedures just as soon as all the requisite mgmt paperwork was complete.

And yes, we were running some of the most visible and highest traffic websites on the internet.

Database schema changes? (1)

adrianmsmith (1237152) | more than 4 years ago | (#29509633)

Were you running relational databases? What did you do about schema changes?

(i.e. presumably if you were running relational DBs then there would be one big data set which would be shared between all three sites; you couldn't e.g. deadvertize one site, change the schema, then readvertise, as then the schemas would be different...)

Re:Database schema changes? (1)

japhering (564929) | about 5 years ago | (#29511703)

Actually, we would deadvertize, and stop the synchronization, then change the code and the schema in the database and readvertize leaving the sync off, move to the second site do the same thing but restart the sync between sites 1 and 2..

When site 3 was done.. then all three sites would after a few minutes, be back in sync.

Re:Database schema changes? (1)

adrianmsmith (1237152) | about 5 years ago | (#29513607)

But surely if you readvertize and leave the sync off, then data inconsistencies will start to occur? (e.g. a modification to the one database, and a different modification e.g. to the same row in another database). How are these inconsistencies then reconciled?

Re:Database schema changes? (1)

japhering (564929) | about 5 years ago | (#29515027)

Depends on the schema change involved. Never saw any with add or delete a column, which was 95+% of what happened in my environment. As for the remaining changes, I don't ever remember things becoming inconsistent, might have been pure luck , really good design and implementation or just bad memory on my part (I'll have to query some of my former colleagues).

One thing to remember is that while all three sites were running the same application, the end user never, ever switch sites (unless the site failed or was taken down) so until the user finished the activity in the app his data was isolated to 1 data center.

Re:Tiers and Data Center Redundancy (1)

sjames (1099) | more than 4 years ago | (#29507715)

Not necessarily. If the VALUE of those 48 hours worth of sales is less than the COST of a hot-hot configuration then you're wasting money. You also have to consider the number of sales NOT lost in the 48 hours. Depending on your reputation, value, and what you're selling some people will just try again in a day or two. In other cases potential customers will just go to the next seller on Google. You need to know which scenario is more likely.

But it's never the software... (2, Insightful)

Sarten-X (1102295) | more than 4 years ago | (#29505537)

"A stick of RAM costs how much? $50?"

I don't remember the source of that quote, but it was in relation to a company spending money (far more than $50) to reduce the memory use of their program. Sure, there's a lot of talk in computer science curricula about using efficient algorithms, but from what I've seen and heard, companies almost always respond to performance problems by buying bigger and better hardware. If software weren't grossly inefficient, how would that affect data centers? Less power consumption, cheaper hardware, and more "bang for your buck", so to speak.

Eventually, this whole debate becomes moot, as data centers can get more income from the hardware, thus still provide the uptime, redundancy, and features, without the need to cut costs. Once those basic needs are out of the way, there's room for expansion into other less-than-critical offerings, and finally, innovation in areas other than uptime.

Re:But it's never the software... (1)

alen (225700) | more than 4 years ago | (#29505707)

with the new Intel CPU's it's still cheaper to buy hardware than pay coders. our devs need more space. turns out it's cheaper to buy a new HP Proliant G6 server than just more storage for their G4 server. and if we spent a bit more we could buy the power efficient CPU which will run an extra few hundred $$$. a coder will easily run you $100 per hour for salary, taxes, benefits and the enviromentals. a bare bones HP Proliant DL 380 G6 server is $200 more than the lowest priced iMac

Re:But it's never the software... (4, Insightful)

Maximum Prophet (716608) | more than 4 years ago | (#29505873)

Code scales, hardware doesn't. If you have one machine, yes, it cheaper to get a bigger, better machine, or to wait for one to be released.

If you have 20,000 machines, even a 10% increase in efficiency is important.

Re:But it's never the software... (1)

Sarten-X (1102295) | more than 4 years ago | (#29508847)

There's the CPU, plus the energy cost to produce it, the environmental waste of disposing the old unit, the fuel to ship it, the labor to install it... Somebody pays for all of it, even if it isn't put on the 'new hardware' budget.

Also, I'm not suggesting paying for more programmers, or even demanding much more from existing programmers. All I suggest is that companies ought to push programmers to produce slightly-better programs, especially when they're going to be deployed in a data center environment.

Given current circumstances, it should be pretty easy for companies to hire the best and most well-educated programmers out there, and cheaply. A fresh-out-of-college computer science degree holder ought to be able to tell the difference between a basic O(n) algorithm and an O(n^6) one. Hire better programmers in the first place, and you can reduce or eliminate the expense of needing new hardware.

Re:But it's never the software... (2, Insightful)

Maximum Prophet (716608) | more than 4 years ago | (#29505825)

That works if you have one program that you have to run every so often to produce a report. If your datacenter is more like Google, where you have 100,000+ servers, a 10% increase in efficiency could eliminate 10,000 servers. Figure $1,000 per server and it would make sense to offer a $1,000,000 prize to a programmer that can increase the efficiency of the Linux kernel by > 10%.

B.t.w Adding one stick of RAM might increase the efficiency of a machine, but in the case above, the machines are probably maxed out w.r.t. RAM. Adding more might not be an option without an expensive retrofit.

Re:But it's never the software... (0)

Anonymous Coward | more than 4 years ago | (#29505979)

I don't know about where you work, but here is is quite the opposite. We have testing machines and they run on old crappy hardware (running linux at least), and each supporting tests on dozens of devices at a time.

We as the test programmers are constantly being told to reduce the CPU/Memory/Network footprint of our test scripts as these machines can barely handle it, and they are not about to spring for better hardware.

Re:But it's never the software... (1)

Sarten-X (1102295) | more than 4 years ago | (#29508647)

For what it's worth, I applaud your process. I'm assuming it doesn't get in the way of 'real progress' (however that may be defined at your company), but it seems to be a nice mix of theory and practice. If only more companies cared about performance on low-end hardware...

Re:But it's never the software... (2, Informative)

Mr. DOS (1276020) | more than 4 years ago | (#29506399)

Perhaps this TDWTF article [thedailywtf.com] is what you were thinking of?

      --- Mr. DOS

Re:But it's never the software... (1)

Sarten-X (1102295) | more than 4 years ago | (#29508601)

I do believe it is. Thanks!

The case presented there is ridiculously going to the other extreme, but the principle is sound. A few rare memory leaks aren't a problem, but using a bubble sort on a million-item list is.

Perfect illustration (4, Insightful)

jeffmeden (135043) | more than 4 years ago | (#29505593)

Given the recent series of data center outages and the current focus on corporate cost control, the debate reflects the industry focus on how to get the most uptime for the data center dollar.

Repeat after me: There is no replacement for redundancy. There is no replacement for redundancy. Every outage you read about involves a failure in a feature of the datacenter that was not redundant and was assumed to not need to be redundant... assumed *incorrectly*. Redundancy is irreplaceable. If you rely on your servers (the servers housed in one place) you had better have redundancy for EVERY. SINGLE. OTHER. ASPECT. If not, you can expect downtime, and you can expect it to happen at the worst possible moment.

Re:Perfect illustration (1)

QuantumRiff (120817) | more than 4 years ago | (#29505735)

There is no replacement for redundancy..

Sorry, I had to add a 3rd one to repeat.. I'm a bit more risk averse than you!

Re:Perfect illustration (1)

Jared555 (874152) | more than 4 years ago | (#29505961)

The issue is when the systems designed to create redundancy actually cause the failure (a transfer switch causing a short, etc.) Also with a couple seconds of searching I was able to find one extended downtime caused by safety procedures and not lack of redundancy:

http://www.datacenterknowledge.com/archives/2008/06/01/explosion-at-the-planet-causes-major-outage/ [datacenterknowledge.com]

I have seen other cases where entire datacenters were shut down because some idiot hit the shutdown control (required by fire departments for safety reasons, you don't want thousands of amps flowing through a building you are spraying water into), etc.

Re:Perfect illustration (1)

aaarrrgggh (9205) | more than 4 years ago | (#29506275)

There is also N/2 redundancy when you talk about EPO systems-- each button only kills one cord per server, so you have to actually hit two buttons to shut everything down...

Increased complexity increases risk; the most elegant redundant systems are never tied together, and provide the greatest simplicity. The others ensure job security until the outage happens...

Re:Perfect illustration (1)

afidel (530433) | more than 4 years ago | (#29506479)

EPO buttons aren't allowed to work like that in most jurisdictions, the fire fighters want to know that when they hit the red button that ALL power to the room is off.

Re:Perfect illustration (1)

aaarrrgggh (9205) | about 5 years ago | (#29516443)

That used to be the case, but we have successfully argued for it in every jurisdiction we have tried. With the 2008 NEC, claiming it is a COPS system will quickly let you eliminate an EPO in the traditional sense.

Dating back to 1993, there was never a NFPA requirement for a single button to kill everything; they allowed you to combine HVAC and power into a single button if desired.

Re:Perfect illustration (1)

jeffmeden (135043) | more than 4 years ago | (#29506307)

While it's hard to argue that outages would still occur from things like fires and explosions in a fully redundant environment, it's easy to connect the dots and notice that fully redundant systems rarely experience fires or explosions, if only for the fact that they spend almost all of their service lives operating at less than 50% capacity. Many "bargain basement" hosting companies (I won't name names) choose to run far closer to 80% or 90% of nameplate capacity because it's cheaper. Also, the question of transfer switches causing additional points of failure is valid, however; it's perfectly possible (and practiced, by paranoid data center managers) to put two switches in parallel and two or more switches (of varying types) in series along the power path, so a failure at any one spot (or in some cases even more than one failure) can be corrected with zero impact to the critical load.

Re:Perfect illustration (1)

afidel (530433) | more than 4 years ago | (#29506551)

Exactly, unless the short in the transfer switch somehow gets through the UPS how is it going to affect a truly redundant setup? I know if one of my transfer switches dies it wouldn't do anything as the systems would just go along powered by the other power feed. If there is ANY single point of failure in your design it WILL fail at some point, that's why the design guidelines matter.

Re:Perfect illustration (1)

sjames (1099) | more than 4 years ago | (#29507893)

I saw a datacenter go down because one of the batteries in one of the UPS burst. The fire department then came in and hit the EPO. (There exists no point where 100% of everything is fully accounted for. Just when you think you've covered every last contingency some country that's afraid of boobies will black hole your IPs for you.

Meanwhile, the cost of each 9 is exponentially higher than the last one was.

Re:Perfect illustration (1)

mokus000 (1491841) | about 5 years ago | (#29511267)

Meanwhile, the cost of each 9 is exponentially higher than the last one was.

And its value is exponentially smaller.

Re:Perfect illustration (1)

adrianmsmith (1237152) | more than 4 years ago | (#29509697)

"The issue is when the systems designed to create redundancy actually cause the failure" - exactly.

For example we had two Oracle systems (hot-cold) and one disk array connected to both systems. The second Oracle was triggered to start automatically when the first Oracle died. One time the second Oracle thought the first Oralce had died and started, even though the first Oralce hadn't died. (We never knew why it started.) Then we had two live instances writing to the same set of data files, and not knowing anything about each other - not good.

I'm not saying redundancy is bad, but it has consequences, and one of those consequences is complexity which can introduce its own downtimes.

Re:Perfect illustration (1)

mindstrm (20013) | about 5 years ago | (#29511911)

What was missing is colloquially called STONITH: Shoot The Other Node In The Head.

Re:Perfect illustration (2, Insightful)

Timothy Brownawell (627747) | more than 4 years ago | (#29506773)

Every outage you read about involves a failure in a feature of the datacenter that was not redundant and was assumed to not need to be redundant... assumed *incorrectly*.

No, I've also heard about cases where both redundant systems failed at the same time (due to poor maintenance) and where the fire department won't allow the generators to be started. Everything within the datacenter can be redundant, but the datacenter itself still is a single physical location.

Redundancy is irreplaceable.

Distributed fault-tolerant systems are "better", but they're also harder to build. Likewise redundancy is more expensive than lack of redundancy, and if you have to choose between $300k/year for a redundant location with redundant people vs. a million-dollar outage every few years, well, the redundancy might not make sense.

Re:Perfect illustration (1)

R2.0 (532027) | more than 4 years ago | (#29507207)

Redundancy is a necessary condition for uptime, but not a sufficient condition. You can have N+a kagillion levels of redundancy, but is the equipment is neglected or procedures aren't followed, it means jack shit.

Added levels of redundancy can actually hurt overall reliability, if it encourages maintenance to delay repairs and ignore problems because "we have backups for that".

One facility I worked on had a half again more processing equipment than needed on the floor. Why? "Well, when one fails we just move production over to another one." Then they would leave the dead equipment on the floor until they found time to get it repaired, which could take months. You can guess where I'm going - they needed 20 machined, there were 30 on the floor, and 12 were down. And and now they had 12 to repair, not 1 or 2.

Re:Perfect illustration (1)

drsmithy (35869) | more than 4 years ago | (#29507553)

Every outage you read about involves a failure in a feature of the datacenter that was not redundant and was assumed to not need to be redundant... assumed *incorrectly*.

IME, most outages are due to software or process failures, not hardware.

Re:Perfect illustration (0)

Anonymous Coward | more than 4 years ago | (#29507643)

Hey! You're a pretentious faggot with this "repeat after me" bullshit. Hope you get hit by a truck.

Re:Perfect illustration (1)

sjames (1099) | more than 4 years ago | (#29507823)

The question is where to put the redundancy. If you have a DR site and the ability to do hot cut-over, you now have a redundant everything (assuming it actually works). While you wouldn't likely want to have no further redundancy, realistically you just need enough UPS time to make a clean cut-over. If you skip the N+1 everything else you might even be able to afford the much more valuable N+2 data centers.

Re:Perfect illustration (0)

Anonymous Coward | more than 4 years ago | (#29508183)

Pardon my cynicism, but I can't held wondering how many of these "hardware failures" are actually software failures.

Because they farmed critical code out to the low bidder and made their own developers redundant.

Re:Perfect illustration (1)

tdeaderick (1643079) | about 5 years ago | (#29517427)

Our company, OnePartner, was referenced in the article. I agree wholheartedly. There is no replacement for redundancy. There's a point made in the original article that I find interesting. The folks on the "con" side of Certification argue that every application doesn't require Tier III or IV. That *might* make sense if the cost of a Tier III were substantially higher than an uncertified data center. Our cabinet rates are lower than many of the uncertified data centers with established brands. If the prices are the same or better, what kind of person would decide that their application could get by with lower resiliency? This seems like the world's easiest IQ test to me.

pointless marketing (5, Informative)

vlm (69642) | more than 4 years ago | (#29505693)

Critics assert that the historic focus on Uptime tiers prompts companies to default to Tier III or Tier IV designs that emphasize investment in redundant UPSes and generators

I've been involved in this field for about 15 years. The funniest misconception I've run into, time and time again, is that an unmaintained UPS, unmaintained battery bank, unmaintained transfer switch, and unmaintained generator will somehow act as magical charms so as to be more reliable than the commercial power they are supposedly backing up. And yes I've been involved in numerous power failure incidents (dozens) at numerous companies, and only experienced two incidents of successful backup of commercial power loss.

Transfer switches that don't switch. Generators that don't start below 50 degrees. Generators with empty fuel tanks staffed by smirking employees with diesel vehicles. When you're adding capacity to battery string A, and the contractor shorts out the mislabeled B bus while pulling cable for the "A" bus.

Experience shows that if a companies core competency is not running power plants, they would be better off not trying to build and maintain a small electrical power plant. Microsoft has conditioned users to expect failure and unreliability, use that conditioning to your advantage... the users don't particularly care if its down because of a OS patch or a loss of -48VDC...

Re:pointless marketing (2, Insightful)

Ephemeriis (315124) | more than 4 years ago | (#29506063)

I've been involved in this field for about 15 years. The funniest misconception I've run into, time and time again, is that an unmaintained UPS, unmaintained battery bank, unmaintained transfer switch, and unmaintained generator will somehow act as magical charms so as to be more reliable than the commercial power they are supposedly backing up.

A lot of folks don't really contemplate what a loss of power means to their business.

Some IT journal or salesperson or someone tells them that they need backup power for their servers, so they throw in a pile of batteries or generators or whatever... And when the power goes out they're left in dark cubicles with dead workstations. Or their manufacturing equipment doesn't run, so it doesn't really matter if the computers are up. Or all their internal network equipment is happy, but there's no electricity between them and the ISP - so their Internet is down anyway.

I'll stand behind a few batteries for servers... Enough to keep them running until they can shut down properly... But actually staying up and running while the power is out? From what I've seen that's basically impossible.

Re:pointless marketing (1)

afidel (530433) | more than 4 years ago | (#29506667)

We have a remote telco shelf powered by enough batteries to last 48 hours (not that they have ever been drained down past 30 seconds except during a battery test) and the equipment they talk to is likewise powered by two sets of such batteries (but only 1 generator). Soon we are going to have feeds to two CO's which take different egress paths from the city (one East, one West). We have dual generators, dual transfer switches, dual UPS's and all equipment is obviously dual power supplied. The only potential wrinkle would be actual delivery on our fuel contracts, our tanks are big enough to run for 48-72 hours and we pay to be right behind hospitals and 911 centers for fuel but you never know until the event if they will actually deliver. On the positive front we have a new generator with a massive tank to power our new building and once it was obvious that it was going to be an extended outage we would likely send everyone home and could siphon fuel from that tank to power the two datacenter generators. Over half our company is located outside our power grid so if we can stay up we should be able to keep the company functioning even in an extended outage.

Re:pointless marketing (1)

Aqualung812 (959532) | more than 4 years ago | (#29506763)

I'll stand behind a few batteries for servers... Enough to keep them running until they can shut down properly... But actually staying up and running while the power is out? From what I've seen that's basically impossible

Many businesses have dozens or hundreds of remote offices / branches / stores. If those stores depend on the HQ site to be running (as many or most do), then having a very reliable generator is critical.
Sure, if you lose power for a single site, your customers at that single site will be forgiving and don't expect you to have a generator at every store.
However, if your HQ is in Chicago and loses power for 12 hours from an ice storm, your customers that can't shop at your Palm Beach location are going to be pissed that you are now closed nationwide.

While on the topic, it is important to note that now we do need to have generators when we're talking about more than a few minutes of UPS time. I've found that 30-60 minutes is the most you can go in a small datacenter without AC until the server racks turn into red pools of molten metal. You're not running A/C on battery, at least not for very long.

Re:pointless marketing (1)

Ephemeriis (315124) | more than 4 years ago | (#29507683)

Many businesses have dozens or hundreds of remote offices / branches / stores. If those stores depend on the HQ site to be running (as many or most do), then having a very reliable generator is critical.
Sure, if you lose power for a single site, your customers at that single site will be forgiving and don't expect you to have a generator at every store.
However, if your HQ is in Chicago and loses power for 12 hours from an ice storm, your customers that can't shop at your Palm Beach location are going to be pissed that you are now closed nationwide.

If you're that big, I'd expect you to have multiple data centers distributed geographically. If your data center in Chicago loses power for 12 hours from an ice storm, I'd expect the Palm Beach store to be accessing a data center somewhere else.

Even with generators and whatnot... If there's an ice storm in Chicago you're likely looking at an outage. You'll have lines down, trees falling over, issues with your ISP and whatever else. Just keeping your data center up in the middle of that kind of havoc isn't going to do you much good.

Re:pointless marketing (1)

Aqualung812 (959532) | about 5 years ago | (#29513781)

I work in the midwest, the last two locations I worked for had 20-40 locations. At each place (5 years at one, 4 at the next), I had an 8 hour power outage. The first place didn't have generators, so all of their retail stores lost a day's work of sales. The second place did have a generator, and everything worked fine. In both cases, the fiber optic service didn't fail. Since it was 100% underground, that was expected. Power lines just go down more than sonet fiber optic does, pure and simple.
I'd say that having an event like that every 5 years is enough to pay for a generator. At the second place, we used an HP Alpha for the core application. Because it was a crappy application, it took 10 hours to fail over to our DR site. A generator was critical.

Re:pointless marketing (1)

mindstrm (20013) | about 5 years ago | (#29511939)

Basically impossible? All it takes is an adequate UPS setup, with a proper transfer switch and a diesel generator - and a proper maintenance plane to go with it. There's nothing hard or magical about it - it just costs more. Maintenance and fuel.

Plenty of places have proper backup facilities.

The main problem, at least in most of the 1st world, is that people are so used to reliable grid power that they don't think about it or see the risk. Look at any operation running somewhere where the power goes out on a frequent basis, and you'll find the above mentioned scenario very common.

Re:pointless marketing (1)

Ephemeriis (315124) | about 5 years ago | (#29514223)

The main problem, at least in most of the 1st world, is that people are so used to reliable grid power that they don't think about it or see the risk. Look at any operation running somewhere where the power goes out on a frequent basis, and you'll find the above mentioned scenario very common.

That may very well be true... I've never done any work outside of the US, so I have no idea what kind of scenario is common elsewhere. And maybe I've just been exposed to some fairly clueless people... But I've yet to see a backup power system do what people thought it was going to do - allow them to stay open for business while the grid goes down.

Basically impossible? All it takes is an adequate UPS setup, with a proper transfer switch and a diesel generator - and a proper maintenance plane to go with it. There's nothing hard or magical about it - it just costs more. Maintenance and fuel.

The actual quote was "From what I've seen that's basically impossible." I never claimed to be omniscient or omnipresent. I'm just basing my statements on my own observations.

Folks will put in an "adequate" UPS setup... And then ignore it for several years while they add more and more hardware... And then act amazed when everything falls over after a power outage - never realizing that the batteries hadn't been regularly tested and were dead or completely insufficient for the current needs.

Or they'll put in a generator and insufficient batteries to keep everything up while the generator starts.

Or they won't keep the generator fueled up and ready to go.

Or, my favorite - they'll actually do a great job of protecting the server room and completely forget about the rest of the building. The servers will be up and humming along, but there'll be no power to the workstations, so nobody can use those servers. Or maybe all the manufacturing equipment is unpowered so they can't actually do anything. And everybody winds up standing around, twiddling their thumbs anyway.

Plenty of places have proper backup facilities.

I'm glad. Really, I am. It's good to know that they're not all as bad as what I've seen. Reassuring, actually, to think that the hospital I wind up in might not be as clueless as the places that I've worked.

But that doesn't change anything about how horrible the backup plans have been at many (most?) of the places I've worked.

Re:pointless marketing (1)

AliasMarlowe (1042386) | about 5 years ago | (#29514625)

I've been involved in this field for about 15 years. The funniest misconception I've run into, time and time again, is that an unmaintained UPS, unmaintained battery bank, unmaintained transfer switch, and unmaintained generator will somehow act as magical charms so as to be more reliable than the commercial power they are supposedly backing up.

A lot of folks don't really contemplate what a loss of power means to their business.
Some IT journal or salesperson or someone tells them that they need backup power for their servers, so they throw in a pile of batteries or generators or whatever... And when the power goes out they're left in dark cubicles with dead workstations. Or their manufacturing equipment doesn't run, so it doesn't really matter if the computers are up. Or all their internal network equipment is happy, but there's no electricity between them and the ISP - so their Internet is down anyway.
I'll stand behind a few batteries for servers... Enough to keep them running until they can shut down properly... But actually staying up and running while the power is out? From what I've seen that's basically impossible.

I've never had the headache of maintaining a business infrastructure, but must cope with our small setup at home. The LAN printer is the only IT thing without UPS power. The server, router, and optical switch are on one UPS. Two PCs each have their own smaller UPS which also power ethernet switches, and there's a laptop which obviously has battery power built-in. All of the computers, including the server, are configured to shutdown if the batteries go down to 20% (for the laptop, it's 10%).
We live in the countryside, so power outages happen (too often), especially the annoying 1-10 minute outages which mean someone is working on the power line. The optical fiber never seems to go down, so I guess they have good power at the other end and at any intermediate units. The kids keep on surfing the net (or doing homework, or whatever) right through most power outages. They know to finish up and save any work if the battery starts getting low, and get an automated warning when the battery gets down to 30%.
So our IT at home stays up and functioning for a while during power breaks. The TV and related stuff, on the other hand, are left at the mercy of the power company.

Re:pointless marketing (1)

Ephemeriis (315124) | about 5 years ago | (#29514823)

The server, router, and optical switch are on one UPS.

The optical fiber never seems to go down, so I guess they have good power at the other end and at any intermediate units.

I love how everyone else on the planet has fiber to their home now. Even folks in the countryside.

We moved out of town while two of the local ISPs were in the process of rolling out fiber all over town. We're only about 1 mile outside of the city, and all we have available is dial-up, cable, or satellite. It sucks.

We live in the countryside, so power outages happen (too often), especially the annoying 1-10 minute outages which mean someone is working on the power line.

I'm in a similar situation at home. I've got the individual desktops on batteries, and our server, and the network hardware. Pretty much everything except the printer. But our cable Internet does not stay up during power outages - even brief ones. And it seems to take their equipment a good 10-15 minutes to recover from a power outage.

Re:pointless marketing (0)

Anonymous Coward | more than 4 years ago | (#29506127)

So very, very true - but also false.

False, because big UPSes seems to work wonders against short power outages. Say a 1-10 second outage. With a UPS, you don't even notice those under normal circumstances. And they occur maybe twice per year or more per DC. For this, UPSes help a LOT.

Then you have the redundancy-induced failures, where you mention faulty transfer switches that don't switch. The corollary to that is transfer switches that switch to the battery bank in error - with the generator saying "hmm, time to power on.. " and two seconds later saying "Hmm, actually, mains is here, time to power off" - going into a bloody generator on-off loop.

In addition, you have the nice situations where somebody does a Big Mistake during UPS maintenance almost killing themselves in the process (more common than you think).

I also haven't seen the "smirking employees" thing. The rest, however, ring very, very true.

(Posted anonymously to protect the innocent :-P)

Re:pointless marketing (1)

Geoff-with-a-G (762688) | more than 4 years ago | (#29506331)

Well, my experience is the opposite of your anecdata - our remote sites often experience grid power failures and the building UPS keeps the equipment running the whole time. However, those are smaller sites, not full size datacenters I'm talking about.

I will however say this about "high availability is hard": Often the redundancy mechanisms themselves are the source of outages. Not just power, but equipment, software, protocols... Maybe your RAID controller fails, instead of the drive. Maybe the HSRP/VRRP on your routers flaps or goes active/active or standby/standby. Maybe the technician servicing your generator/UPS/PDU kills the power on the floor.

The step up from "no redundancy systems" to "some redundancy systems" doesn't necessarily take you from 2-nines to 3 or 4. Sometimes it takes you from 3-nines down to 2. If you really want high availability, it's a question of approach and procedures as much as it is hardware and systems.

Re:pointless marketing (5, Interesting)

R2.0 (532027) | more than 4 years ago | (#29507027)

It's not just in IT. I work for an organization that uses a LOT of refrigeration in the form of walk-in refrigerators and freezers. Each one can hold product worth up to $1M and all can be lost in a temperature excursion. So we started designing in redundancy: 2 separate refrigeration systems per box, backup controller, redundant power feeds from different transfer switches over divers routing (Brown's Ferry lessons learned). Oh, and each facility had twice as many boxes as needed for the inventory.

After installation, we began getting calls and complaints about how our "wonder boxes" were pieces of crap, that they were failing left and right, etc. We freak out and do some analysis. Turns out that, in almost every instance, a trivial component had failed in 1 compressor and the system had failed over to the other system, ran for weeks-months, and then that failed too. When we asked why they never fixed the first failure, they said "What failure?" When we asked about the alarm the controller gave due to mechanical failure, we were told that it had gone off repeatedly but was ignored because the temperature readings were still good and that's all Operations cared about. In some instances the wires to the buzzer was cut, and in one instance, a "massive controller failure" was really a crash due to the system memory being filled by the alarm log.

Yes, we did some design changes, but we also added another base principle to our design criteria: "You can't engineer away stupid."

Re:pointless marketing (0, Redundant)

iamhigh (1252742) | more than 4 years ago | (#29507105)

Used my points already, but that was interesting to read.

Re:pointless marketing (0)

Anonymous Coward | more than 4 years ago | (#29507429)

that really was an interesting read

Re:pointless marketing (1)

R2.0 (532027) | more than 4 years ago | (#29508677)

Hmmm, 2 "interesting read" comments.

Why do I have the feeling that the next "See me in my office!" email won't be spam?

Re:pointless marketing (1)

DNS-and-BIND (461968) | about 5 years ago | (#29512271)

I wouldn't call it "stupid" at all. Failure to consider the human element and designing something for yourself is a classic mistake. Hell yeah if I'm getting paid crap, I'm sure as hell not caring about some guy's alarm. Is it broke? No, then don't fix it. Not surprising to me it turned out that way at all. You should look at old telco systems, they knew how to design around people.

Re:pointless marketing (1)

Part`A (170102) | about 5 years ago | (#29512917)

How about IBM's approach? Have the system contact and request a technician directly and charge them for a support contract or call out fee?

Re:pointless marketing (1)

S.P.Zeidler (227675) | about 5 years ago | (#29522767)

three things to say to this:

- unmaintained UPS is worse than none
- you need actual risk assessment to decide what quality of power backup you need
- a good line filter is essential (unless you don't care if all your equipment gets toasted)

If you are in an area where mains power is very reliable, your UPS will need to be very good to beat it, ie be rather expensive (so only useful if outages are very expensive for you); if you're looking at two outages from storms a year, at least getting something that will let the systems shut down gracefully will amortize itself real fast even if you're just going to send people home for the day anyway.

So you look at likelihood and impact of power loss, and then decide what to do against it. And you make sure the result of the cogitations receives good and regular maintenance. That's best practise, too :>

RAID (4, Interesting)

QuantumRiff (120817) | more than 4 years ago | (#29505703)

Why go with a huge, multiple 9's datacenter, when you can go the way of google, and have a RAID:
Redundant Array of Inexpensive Datacenters..

Is really better to have 1000 machines in a 5-9's location, or 500 systems each in a 4-9's, with extra cash in hand?

Re:RAID (0)

Anonymous Coward | more than 4 years ago | (#29505877)

Jesus christ you people are fucking morons.

Re:RAID (2, Informative)

jeffmeden (135043) | more than 4 years ago | (#29505903)

Why go with a huge, multiple 9's datacenter, when you can go the way of google, and have a RAID: Redundant Array of Inexpensive Datacenters.. Is really better to have 1000 machines in a 5-9's location, or 500 systems each in a 4-9's, with extra cash in hand?

That all depends. A 5 9s datacenter is a full ten times more reliable than a 4 9s datacenter (mathematically speaking). So, all things being equal (again, mathematically), you would need ten 4-9 centers to be as reliable as your one 5-9 center. However geographic dispersion, outage recover lead time, bandwidth costs, maintenance, etc. can all factor in to sway the equation either way. It really comes down to itemizing your outage threats, pairing that with the cost of redundancy for each threatened component, and then looking at the cost of downtime as part of the business process. It's rarely as simple as "why not just build two at twice the price".

Re:RAID (1)

brian_tanner (1022773) | more than 4 years ago | (#29506175)

I think your probability calculation might be a bit off. The math doesn't go through.

I should say ahead of time, I don't know much about these 4-9s vs 5-9s. I interpret them as probability of not failing. IE, 4-9, means 99.99%, which means the probability of failure is .0001. If that's wrong, the rest of this doesn't work out.

Lets try different numbers. Choice A has a probability of 25% of failing, Choice B has a probability of 1% of failing.

How many A do we need such that the probability of them all failing is less than 1%?

If I have 2xA, what is the probability that they both fail (assuming they are independent)?
P(A1) and P(A2) = .25 * .25 = .0625 (6%)
What if we add a third:
.25 * .25 * .25 = .015625 (1.2%)
And a fourth
.0039 (.39%)

So, 4 of these 25% data centers is better than a single 1% data center.

The case is even stronger for the 4-9s vs 5-9s example.

4-9s (if I understand) means 99.99%. Or, .01% of failure (P=.0001). 5--9s means 99.999%, or .001% of failure (P=.00001).

2 x 4-9s is .0001 * .0001 = 0.00000001 , which is 0.000001%, which is 99.9999 (6-9s).

To me, it makes perfect sense to do the "google" thing. This is exactly the reason that they fill their data centers with low-cost commodity hardware instead of high cost servers.

Re:RAID (1)

arcade (16638) | more than 4 years ago | (#29506223)

Wrong.

0.01*0.01 = 0.0001

Which is ten times better than 0.001

Re:RAID (1)

aaarrrgggh (9205) | more than 4 years ago | (#29506345)

Even that can over-simplify the problem; when you have to take one system offline, what redundancy to you have left? Will one drive failure take you down?

To the GP's point, the problem isn't going from 1x 5x9s to 2x 4x9's, usually companies try to do 2x 3x9's facilities instead.

Redundancy is not Reliability is not Maintainability.

Re:RAID (1)

sjames (1099) | more than 4 years ago | (#29509463)

And they're still better off!

Re:RAID (1, Informative)

Anonymous Coward | more than 4 years ago | (#29507213)

Inaccurate math aside, "4 Nines" is 4 minutes per month. ie: restart the machine at midnight on the first of the month. "5 nines" is 5 minutes a year, a restart every Jan 1st. Properly managed, neither of these is particularly disruptive.

If your concern is unplanned outages, then two independent "4 nines" data centers have eight nines of reliability, because there's a 99.99% probability that the second data center will be funtional when the first one goes down. Of course, you can't predict susceptibility to unplanned outages, so "4 nines" or "5 nines" in that context is a made up number.

Re:RAID (1)

andymadigan (792996) | more than 4 years ago | (#29508327)

Actually, I think your math is a bit off.

A 4 9s datacenter fails .0001% of the time. The chances of two 4 9s datacenters failing simultaneously is .0001% squared (.0000001%). The 5 9s data center fails .000001% of the time. Therefore, two 4 9s datacenters are ten times as reliable as one 5 9s datacenters (assuming I did my math right). That's why RAID works.

Re:RAID (0)

Anonymous Coward | more than 4 years ago | (#29509449)

Not necessarily. A key point is that the nines are just assurances of percentage uptime. Assurances in the sense that they SAY it's that or we THINK it's that (more likely, we GUESS it's that as far as you know).

I don't care how many 9's a datacenter has, if it' on a fault line, in an area subject to wildfire, inclimate weather, etc it can be taken completely down in a way where it will stay down for days, weeks, or forever. A 4 9's datacenter faces those same risks.

Then there's the mathematical question. 5 9's means the datacenter will be down for not more than 1/1000th of the operational time. In a year, that's about 5 minutes. At 4 9's it's 52 minutes. For 2 4 9's datacenters to be as reliable as a single 5 9's datacenter, all they have to do is not have more than 5 minutes where their 52 minute downtimes overlap. In fact, 2 2 9's datacenters (each with about 80 hours downtime a year) will add up to 5 9's.

Also note the important weasel words. 5 9's means no more than 5 minutes of UNPLANNED/UNSCHEDULED downtime in a year. Basically if they announce that it will happen, anything goes as far as the SLA.

Re:RAID (1)

drsmithy (35869) | more than 4 years ago | (#29507609)

Why go with a huge, multiple 9's datacenter, when you can go the way of google, and have a RAID: Redundant Array of Inexpensive Datacenters..

Because most systems don't scale horizontally and most businesses don't have the resources of Google to create their own that do.

Re:RAID (1)

dkf (304284) | more than 4 years ago | (#29509997)

Is really better to have 1000 machines in a 5-9's location, or 500 systems each in a 4-9's, with extra cash in hand?

Remember that the main problems with these datacenters are in networking (because that can propagate failures) and automated failover systems. Given that, go for the cash in hand, since you can do other stuff with that (including buying disaster recovery insurance if appropriate).

uptime matters (2, Insightful)

Spazmania (174582) | more than 4 years ago | (#29505861)

Designing nontrivial systems without single points of failure is difficult and expensive. Worse, it has to be built in from the ground up. Which it rarely is: by the time a system is valuable enough to merit the cost of a failover system, the design choices which limit certain components to single devices have long since been made.

Which means uptime matters. 1% downtime is more than 3 days a year. Unacceptable.

The TIA-942 data center tiers are a formulaic way of achieving satisfactory uptime. They've been carefully studied and statistically tier-3 data centers achieve three 9's uptime (99.9%) while tier-4 data centers achieve four 9's. Tiers 1 and 2 only achieve two 9's.

Are there other ways of achieving the same or better uptime? Of course. But they haven't been as carefully studied which means you can't assign a high a confidence to your uptime estimate.

Is it possible to build a tier-4 data center that doesn't achieve four 9's? Of course. All you have to do is put your eggs in one basket (like buying all the same brand of UPS) and then have yourself a cascade failure. But with a competent system architect, a tier-4 data center will tend to achieve at least 99.99% annual uptime.

Ask a European banker (1)

drdrgivemethenews (1525877) | more than 4 years ago | (#29506025)

European bank IT people are some of the most conservative and risk-averse people on the planet. If you ask them which is more important, infrastructure or best practices, they will answer "Yes."
----------
Change is inevitable. Progress is not.

Re:Ask a European banker (1, Interesting)

Anonymous Coward | more than 4 years ago | (#29508297)

I work for a very very large European bank. And yes - we're highly risk averse.

Here's the interesting thing - we built a bunch of Tier 3 and Tier 4 datacenters because the infrastructure guys thought that it was what the organization needed.

But they didn't talk to the consumers of their services - the application development folks.

So what do we have -

Redundant datacenters with redundant power supplies with redundant networks with redundant storage networks with redundant WAN connections with redundant database servers running in them.

The app guys then said - "to hell with this" - whenever we try to fail over we can't get it to work anyway because there's always something small the infra guys missed, so they built HA and auto-failover into their applications, or better still, live-live applications.

Hence all of the redundant infrastructure is ... redundant. And a complete waste of money.

In the end the app devs want cheap and full control of their infrastructure. Which is why they all want to go buy the cheapest hosting they can get.

The good thing is that it's now apparent that this is the case. And so, soon, there will probably be a lot of redundant infrastructure people - and that will be a good thing.

The uptime institute is, in my mind, has the major piece of accountability for peddling this rubbish.

It's a scam.

Most uptime for the dollar is a bad idea (1)

sirwired (27582) | more than 4 years ago | (#29506179)

On a strict IT budget cost-effectiveness basis, the most uptime for your dollar will be Windows (Windows admins practically grow on trees, so they are cheap) on some commodity Pizza Box servers, connected to some cheap NAS storage and networked with crap switches. If you are an IT manager looking for your short-term bonus before you move onto greener pastures, this is a great idea! There is a good chance you will be able to hold things together long enough to get your bonus, and then get outta there.

Of course, if you actually care about the business IT is supposed to support, you will get a setup a bit more trusty. But if the IT manager isn't incentivized for long-term uptime stats, it just isn't gonna happen.

SirWired

Re:Most uptime for the dollar is a bad idea (1)

Alpha830RulZ (939527) | more than 4 years ago | (#29506651)

Or, to be slightly more robust, windows or linux on redundant commodity boxes, with mid grade disk and network components, set up in redundant locations, will serve a lot of needs for lower cost. Not to go all MBA on you or anything, but a smart management team would look at the cost of providing the last 9 of reliability, against the cost of x days of outage, multiplied by some reasonable percentage of the likelihood of the outage, and then ask, does it make financial sense to ensure against the extremely improbable. I'll get flamed for saying this, but I don't think it makes economic sense to protect against ALL possible disasters, as some are just not economically perfectly mitigatable. The present value of the saved expense may well exceed the present value of the expected loss of business. An HR system -can- be down for 3 days without killing a business, while Amazon's web presense/order system probably can't. But without working the numbers, you can't know. Coming up with a stated approach without working the numbers is voodoo planning.

I don't expect any management team to not fire an IT manager that did this thinking, but that is how I would run my own business.

Just remember data centers aren't important (0)

Anonymous Coward | more than 4 years ago | (#29507245)

Libraries worked just fine, for thousands of years.

Re:Just remember data centers aren't important (1)

mokus000 (1491841) | about 5 years ago | (#29511337)

The one at Alexandria would've benefitted from more offsite backup.

Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?