Are Data Center "Tiers" Still Relevant? 98
miller60 writes "In their efforts at uptime, are data centers relying too much on infrastructure and not enough on best practices? That question is at the heart of an ongoing industry debate about the merits of the tier system, a four-level classification of data center reliability developed by The Uptime Institute. Critics assert that the historic focus on Uptime tiers prompts companies to default to Tier III or Tier IV designs that emphasize investment in redundant UPSes and generators. Uptime says that many industries continue to require mission-critical data centers with high levels of redundancy, which are needed to perform maintenance without taking a data center offline. Given the recent series of data center outages and the current focus on corporate cost control, the debate reflects the industry focus on how to get the most uptime for the data center dollar."
Re: (Score:1)
The only way I can come up with an analogy nonsensical enough to involve prejudice and eyes is if it contains a car.
Re: (Score:1)
Tiers/Tears
Prejudice as in why should data center tears be more relevant than any other tears.
No (Score:1)
Re: (Score:2)
The Tier Guidelines as Uptime Institute presents it is utterly useless beyond C-Suite penis posturing. However, it is important for a company to establish what their needs are.
However, most serious players customize it based on their own needs and risk assessments. Redundant UPS systems create a more valuable benefit than redundant utility transformers as an example. Redundant generators offer less benefit than redundant starting batteries and proper maintenance and testing. Mechanically, over-sizing so
Re: (Score:3, Informative)
Because of this our Data center has redundant UPS and Redundant Generators. All but the least critical servers have dual power supplys, plugged into independent circuits.
We have multiple ACs but they ar
It depends (Score:5, Interesting)
Re: (Score:2)
It's very confusing to me. These guys say they are the only tier III in the US according to ATAC whoever they are? Apparently there are no level IV's that outsource. So I would think your statement to be correct. --haven't reached the scale where you can geographically diversify your operations-- means probably don't need above Tier II and their own backups.
http://www.onepartner.com/v3/atac_tier.html [onepartner.com]
Re: (Score:2)
Infrastructure is very important. (Score:5, Interesting)
Infrastructure is more important than "best practices". Infrastructure is more of a physical, concrete aspect. Practices really aren't that important once the critical, physical disasters begin. As an example, good hardware will continue to run for years. Most of the downtime in regards to good hardware will most likely be due to misconfiguration, human error that sort of thing. A Sys Admin banks on some wrong assumption, messes up a script or hits the wrong command, but nonetheless the hardware is still physically able and therefore the infrastructure has not been jeopardized. If the situation is reversed, top notch paper plans and procedures... with crappy hardware. Well... the realities of physical discrepancies are harder to argue than our personal, nebulous, intangible, inconsequential philosophies of "good/better/best" management procedures/practices.
So to me the question "In their efforts at uptime, are data centers relying too much on infrastructure and not enough on best practices?" is best translated as "To belittle the concept of uptime and it's association with reliability, are data centers relying too much on the raw realities of the universe and the physical laws that govern it and not enough on some random guys philosophies regarding problems that only manifest within our imaginations?"
Or, as a medical analogy... "In their efforts in curing cancer, are doctors relying too much on science and not enough on voodoo/religion?"
Re: (Score:2)
Clearly you have no idea what you're talking about.
You can't just throw hardware (or money) at a problem and expect things to work. You have to know what you're doing, and set things up properly (i.e. follow best practices).
You clearly have never worked for or at 90% of the companies out there.
Things being setup properly and by best standards does not often come into play.
Re: (Score:2)
Its really a false dichotomy. You need a little of both. You need to have your procedures matched to the reliability,performance and architecture of your infrastructure.
Or, as a medical analogy... "In their efforts in curing cancer, are doctors relying too much on surgery and not enough on chemotherapy?"
Re: (Score:2)
I hate terms such as "best practi[c|s]e" and I certainly don't use them. I may tell a client I try to follow good practice. Claiming you know what constitutes "best practice" is a sure sign of imminent failure and arrogant.
Tiers and Data Center Redundancy (Score:4, Insightful)
Data center redundancy is a need thing. However, most data center designs for get to address the two largest causes of down time ... people and software. People are people and will always make mistakes, as such there are still things that can be done to reduce the impact of human error.
Software, very rarely is designed for use in redundant systems. More likely, the design is for use in a hot-cold or hot-warm recovery scenario. Very rarely is it designed for multiple hot across multiple data centers.
Remember, good disaster avoidance is always cheaper than disaster recovery when done right.
Re: (Score:2)
Disaster avoidance is good, for sure, but that's not what your DR efforts are really for.
Here's the story (fairly well-covered by /. at the time) of why you have a disaster recovery system and plan in place: A university's computing center burned to the ground. The entire place. All the servers, all the onsite backups, all the UPS units, gone. Within 48 hours, they were back up and running. Not at 100% capacity, but they were running.
Re: (Score:3, Insightful)
And if you had two identical data centers, where each in and of itself was redundant with software designed to function seamlessly across the two in a hot-hot configuration .. there would have been NO downtime.. the university would have been up the entire time with little to no data loss.
So say I'm Amazon and my data center burns down.. 48 hours with ZERO sales for a disaster recovery scenario vs normal operations for the time it takes to rebuild/move the burned data center..
I think I'll take disaster avoi
Re: (Score:3, Insightful)
Unless you were doing maintenance in the second facility when a problem hit the first. That is what real risk management is about; when you assume hot-hot will cover everything, you have to make sure that is really the case. Far too often there are a few things that will either cause data loss or significant recovery time even in a hot-hot system when there is a failure.
Even with hot-hot systems, all facilities should be reasonably redundant and reasonably maintainable. Fully redundant and fully maintain
Re: (Score:2)
I agree, as someone that runs two data centers that are both hot and both independently capable of handling the company load I think it is still wise for a scaled back DR at yet another location. For me it's as simple as a tier 3 storage server with a 100tb tape library backing things up properly. This gives me email archiving and records compliance. Of course the tapes aren't stored at the DR site.
The problem is that there is no effective limit to how much redundancy you can have. The industry default in
Re:Tiers and Data Center Redundancy (Score:4, Interesting)
Precisely, I've spent the last 12 years (prior to be laid off) working in a hot-hot-hot solution. Each center was fully redundant and ran at no more then 50% dedicated utilization. Each data center got 1 week worth of planned maint every quarter for hardware and software updates when that data center was completely off line leaving a hot-hot solution.. if something else happened we still had a "live" data center while scrambling to recover the other two.
We ran completely without change windows as we would simply deadvertize an entire data center do the work and readvertize, them move on to the next data center. In the event of high importance, say a cert advisory requiring an immediate update, we would follow the same procedures just as soon as all the requisite mgmt paperwork was complete.
And yes, we were running some of the most visible and highest traffic websites on the internet.
Database schema changes? (Score:1)
Were you running relational databases? What did you do about schema changes?
(i.e. presumably if you were running relational DBs then there would be one big data set which would be shared between all three sites; you couldn't e.g. deadvertize one site, change the schema, then readvertise, as then the schemas would be different...)
Re: (Score:2)
Actually, we would deadvertize, and stop the synchronization, then change the code and the schema in the database and readvertize leaving the sync off, move to the second site do the same thing but restart the sync between sites 1 and 2..
When site 3 was done.. then all three sites would after a few minutes, be back in sync.
Re: (Score:1)
Re: (Score:2)
Depends on the schema change involved. Never saw any with add or delete a column, which was 95+% of what happened in my environment. As for the remaining changes, I don't ever remember things becoming inconsistent, might have been pure luck , really good design and implementation or just bad memory on my part (I'll have to query some of my former colleagues).
One thing to remember is that while all three sites were running the same application, the end user never, ever switch sites (unless the site failed
Re: (Score:2)
Not necessarily. If the VALUE of those 48 hours worth of sales is less than the COST of a hot-hot configuration then you're wasting money. You also have to consider the number of sales NOT lost in the 48 hours. Depending on your reputation, value, and what you're selling some people will just try again in a day or two. In other cases potential customers will just go to the next seller on Google. You need to know which scenario is more likely.
But it's never the software... (Score:2, Insightful)
"A stick of RAM costs how much? $50?"
I don't remember the source of that quote, but it was in relation to a company spending money (far more than $50) to reduce the memory use of their program. Sure, there's a lot of talk in computer science curricula about using efficient algorithms, but from what I've seen and heard, companies almost always respond to performance problems by buying bigger and better hardware. If software weren't grossly inefficient, how would that affect data centers? Less power consumpti
Re: (Score:2)
with the new Intel CPU's it's still cheaper to buy hardware than pay coders. our devs need more space. turns out it's cheaper to buy a new HP Proliant G6 server than just more storage for their G4 server. and if we spent a bit more we could buy the power efficient CPU which will run an extra few hundred $$$. a coder will easily run you $100 per hour for salary, taxes, benefits and the enviromentals. a bare bones HP Proliant DL 380 G6 server is $200 more than the lowest priced iMac
Re:But it's never the software... (Score:5, Insightful)
If you have 20,000 machines, even a 10% increase in efficiency is important.
Re: (Score:1)
There's the CPU, plus the energy cost to produce it, the environmental waste of disposing the old unit, the fuel to ship it, the labor to install it... Somebody pays for all of it, even if it isn't put on the 'new hardware' budget.
Also, I'm not suggesting paying for more programmers, or even demanding much more from existing programmers. All I suggest is that companies ought to push programmers to produce slightly-better programs, especially when they're going to be deployed in a data center environment.
Giv
Re: (Score:3, Insightful)
B.t.w Adding one stick of RAM might increase the efficiency of a machine, but in the case above, the machines are probably
Re: (Score:1)
For what it's worth, I applaud your process. I'm assuming it doesn't get in the way of 'real progress' (however that may be defined at your company), but it seems to be a nice mix of theory and practice. If only more companies cared about performance on low-end hardware...
Re: (Score:3, Informative)
Perhaps this TDWTF article [thedailywtf.com] is what you were thinking of?
--- Mr. DOS
Re: (Score:1)
I do believe it is. Thanks!
The case presented there is ridiculously going to the other extreme, but the principle is sound. A few rare memory leaks aren't a problem, but using a bubble sort on a million-item list is.
Perfect illustration (Score:5, Insightful)
Given the recent series of data center outages and the current focus on corporate cost control, the debate reflects the industry focus on how to get the most uptime for the data center dollar.
Repeat after me: There is no replacement for redundancy. There is no replacement for redundancy. Every outage you read about involves a failure in a feature of the datacenter that was not redundant and was assumed to not need to be redundant... assumed *incorrectly*. Redundancy is irreplaceable. If you rely on your servers (the servers housed in one place) you had better have redundancy for EVERY. SINGLE. OTHER. ASPECT. If not, you can expect downtime, and you can expect it to happen at the worst possible moment.
Re: (Score:2)
Sorry, I had to add a 3rd one to repeat.. I'm a bit more risk averse than you!
Re: (Score:2)
The issue is when the systems designed to create redundancy actually cause the failure (a transfer switch causing a short, etc.) Also with a couple seconds of searching I was able to find one extended downtime caused by safety procedures and not lack of redundancy:
http://www.datacenterknowledge.com/archives/2008/06/01/explosion-at-the-planet-causes-major-outage/ [datacenterknowledge.com]
I have seen other cases where entire datacenters were shut down because some idiot hit the shutdown control (required by fire departments for safet
Re: (Score:2)
There is also N/2 redundancy when you talk about EPO systems-- each button only kills one cord per server, so you have to actually hit two buttons to shut everything down...
Increased complexity increases risk; the most elegant redundant systems are never tied together, and provide the greatest simplicity. The others ensure job security until the outage happens...
Re: (Score:2)
Re: (Score:2)
That used to be the case, but we have successfully argued for it in every jurisdiction we have tried. With the 2008 NEC, claiming it is a COPS system will quickly let you eliminate an EPO in the traditional sense.
Dating back to 1993, there was never a NFPA requirement for a single button to kill everything; they allowed you to combine HVAC and power into a single button if desired.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
I saw a datacenter go down because one of the batteries in one of the UPS burst. The fire department then came in and hit the EPO. (There exists no point where 100% of everything is fully accounted for. Just when you think you've covered every last contingency some country that's afraid of boobies will black hole your IPs for you.
Meanwhile, the cost of each 9 is exponentially higher than the last one was.
Re: (Score:1)
Meanwhile, the cost of each 9 is exponentially higher than the last one was.
And its value is exponentially smaller.
Re: (Score:1)
"The issue is when the systems designed to create redundancy actually cause the failure" - exactly.
For example we had two Oracle systems (hot-cold) and one disk array connected to both systems. The second Oracle was triggered to start automatically when the first Oracle died. One time the second Oracle thought the first Oralce had died and started, even though the first Oralce hadn't died. (We never knew why it started.) Then we had two live instances writing to the same set of data files, and not knowin
Re: (Score:1)
What was missing is colloquially called STONITH: Shoot The Other Node In The Head.
Re: (Score:3, Insightful)
Every outage you read about involves a failure in a feature of the datacenter that was not redundant and was assumed to not need to be redundant... assumed *incorrectly*.
No, I've also heard about cases where both redundant systems failed at the same time (due to poor maintenance) and where the fire department won't allow the generators to be started. Everything within the datacenter can be redundant, but the datacenter itself still is a single physical location.
Redundancy is irreplaceable.
Distributed fault-tolerant systems are "better", but they're also harder to build. Likewise redundancy is more expensive than lack of redundancy, and if you have to choose between $300k/year for a redundant location
Re: (Score:1)
Redundancy is a necessary condition for uptime, but not a sufficient condition. You can have N+a kagillion levels of redundancy, but is the equipment is neglected or procedures aren't followed, it means jack shit.
Added levels of redundancy can actually hurt overall reliability, if it encourages maintenance to delay repairs and ignore problems because "we have backups for that".
One facility I worked on had a half again more processing equipment than needed on the floor. Why? "Well, when one fails we just
Re: (Score:2)
Every outage you read about involves a failure in a feature of the datacenter that was not redundant and was assumed to not need to be redundant... assumed *incorrectly*.
IME, most outages are due to software or process failures, not hardware.
Re: (Score:2)
The question is where to put the redundancy. If you have a DR site and the ability to do hot cut-over, you now have a redundant everything (assuming it actually works). While you wouldn't likely want to have no further redundancy, realistically you just need enough UPS time to make a clean cut-over. If you skip the N+1 everything else you might even be able to afford the much more valuable N+2 data centers.
Re: (Score:1)
pointless marketing (Score:5, Informative)
Critics assert that the historic focus on Uptime tiers prompts companies to default to Tier III or Tier IV designs that emphasize investment in redundant UPSes and generators
I've been involved in this field for about 15 years. The funniest misconception I've run into, time and time again, is that an unmaintained UPS, unmaintained battery bank, unmaintained transfer switch, and unmaintained generator will somehow act as magical charms so as to be more reliable than the commercial power they are supposedly backing up. And yes I've been involved in numerous power failure incidents (dozens) at numerous companies, and only experienced two incidents of successful backup of commercial power loss.
Transfer switches that don't switch. Generators that don't start below 50 degrees. Generators with empty fuel tanks staffed by smirking employees with diesel vehicles. When you're adding capacity to battery string A, and the contractor shorts out the mislabeled B bus while pulling cable for the "A" bus.
Experience shows that if a companies core competency is not running power plants, they would be better off not trying to build and maintain a small electrical power plant. Microsoft has conditioned users to expect failure and unreliability, use that conditioning to your advantage... the users don't particularly care if its down because of a OS patch or a loss of -48VDC...
Re: (Score:3, Insightful)
I've been involved in this field for about 15 years. The funniest misconception I've run into, time and time again, is that an unmaintained UPS, unmaintained battery bank, unmaintained transfer switch, and unmaintained generator will somehow act as magical charms so as to be more reliable than the commercial power they are supposedly backing up.
A lot of folks don't really contemplate what a loss of power means to their business.
Some IT journal or salesperson or someone tells them that they need backup power for their servers, so they throw in a pile of batteries or generators or whatever... And when the power goes out they're left in dark cubicles with dead workstations. Or their manufacturing equipment doesn't run, so it doesn't really matter if the computers are up. Or all their internal network equipment is happy, but there's no electricity
Re: (Score:2)
Re: (Score:2)
Many businesses have dozens or hundreds of remote offices / branches / stores. If those stores depend on the HQ site to be running (as many or most do), then having a very reliable generator is critical.
Sure, if you lose power for a single site, your customers at that single site will
Re: (Score:2)
Many businesses have dozens or hundreds of remote offices / branches / stores. If those stores depend on the HQ site to be running (as many or most do), then having a very reliable generator is critical.
Sure, if you lose power for a single site, your customers at that single site will be forgiving and don't expect you to have a generator at every store.
However, if your HQ is in Chicago and loses power for 12 hours from an ice storm, your customers that can't shop at your Palm Beach location are going to be pissed that you are now closed nationwide.
If you're that big, I'd expect you to have multiple data centers distributed geographically. If your data center in Chicago loses power for 12 hours from an ice storm, I'd expect the Palm Beach store to be accessing a data center somewhere else.
Even with generators and whatnot... If there's an ice storm in Chicago you're likely looking at an outage. You'll have lines down, trees falling over, issues with your ISP and whatever else. Just keeping your data center up in the middle of that kind of havoc isn
Re: (Score:2)
I'd sa
Re: (Score:1)
Basically impossible? All it takes is an adequate UPS setup, with a proper transfer switch and a diesel generator - and a proper maintenance plane to go with it. There's nothing hard or magical about it - it just costs more. Maintenance and fuel.
Plenty of places have proper backup facilities.
The main problem, at least in most of the 1st world, is that people are so used to reliable grid power that they don't think about it or see the risk. Look at any operation running somewhere where the power goes o
Re: (Score:2)
The main problem, at least in most of the 1st world, is that people are so used to reliable grid power that they don't think about it or see the risk. Look at any operation running somewhere where the power goes out on a frequent basis, and you'll find the above mentioned scenario very common.
That may very well be true... I've never done any work outside of the US, so I have no idea what kind of scenario is common elsewhere. And maybe I've just been exposed to some fairly clueless people... But I've yet to see a backup power system do what people thought it was going to do - allow them to stay open for business while the grid goes down.
Basically impossible? All it takes is an adequate UPS setup, with a proper transfer switch and a diesel generator - and a proper maintenance plane to go with it. There's nothing hard or magical about it - it just costs more. Maintenance and fuel.
The actual quote was "From what I've seen that's basically impossible." I never claimed to be omniscient or omnipresent. I'm just basing my statements on my
Re: (Score:2)
I've been involved in this field for about 15 years. The funniest misconception I've run into, time and time again, is that an unmaintained UPS, unmaintained battery bank, unmaintained transfer switch, and unmaintained generator will somehow act as magical charms so as to be more reliable than the commercial power they are supposedly backing up.
A lot of folks don't really contemplate what a loss of power means to their business.
Some IT journal or salesperson or someone tells them that they need backup power for their servers, so they throw in a pile of batteries or generators or whatever... And when the power goes out they're left in dark cubicles with dead workstations. Or their manufacturing equipment doesn't run, so it doesn't really matter if the computers are up. Or all their internal network equipment is happy, but there's no electricity between them and the ISP - so their Internet is down anyway.
I'll stand behind a few batteries for servers... Enough to keep them running until they can shut down properly... But actually staying up and running while the power is out? From what I've seen that's basically impossible.
I've never had the headache of maintaining a business infrastructure, but must cope with our small setup at home. The LAN printer is the only IT thing without UPS power. The server, router, and optical switch are on one UPS. Two PCs each have their own smaller UPS which also power ethernet switches, and there's a laptop which obviously has battery power built-in. All of the computers, including the server, are configured to shutdown if the batteries go down to 20% (for the laptop, it's 10%).
We live in the
Re: (Score:2)
The server, router, and optical switch are on one UPS.
The optical fiber never seems to go down, so I guess they have good power at the other end and at any intermediate units.
I love how everyone else on the planet has fiber to their home now. Even folks in the countryside.
We moved out of town while two of the local ISPs were in the process of rolling out fiber all over town. We're only about 1 mile outside of the city, and all we have available is dial-up, cable, or satellite. It sucks.
We live in the countryside, so power outages happen (too often), especially the annoying 1-10 minute outages which mean someone is working on the power line.
I'm in a similar situation at home. I've got the individual desktops on batteries, and our server, and the network hardware. Pretty much everything except the printer. But our cable Internet
Re: (Score:2)
Well, my experience is the opposite of your anecdata - our remote sites often experience grid power failures and the building UPS keeps the equipment running the whole time. However, those are smaller sites, not full size datacenters I'm talking about.
I will however say this about "high availability is hard": Often the redundancy mechanisms themselves are the source of outages. Not just power, but equipment, software, protocols... Maybe your RAID controller fails, instead of the drive. Maybe the HSRP/V
Re:pointless marketing (Score:5, Interesting)
It's not just in IT. I work for an organization that uses a LOT of refrigeration in the form of walk-in refrigerators and freezers. Each one can hold product worth up to $1M and all can be lost in a temperature excursion. So we started designing in redundancy: 2 separate refrigeration systems per box, backup controller, redundant power feeds from different transfer switches over divers routing (Brown's Ferry lessons learned). Oh, and each facility had twice as many boxes as needed for the inventory.
After installation, we began getting calls and complaints about how our "wonder boxes" were pieces of crap, that they were failing left and right, etc. We freak out and do some analysis. Turns out that, in almost every instance, a trivial component had failed in 1 compressor and the system had failed over to the other system, ran for weeks-months, and then that failed too. When we asked why they never fixed the first failure, they said "What failure?" When we asked about the alarm the controller gave due to mechanical failure, we were told that it had gone off repeatedly but was ignored because the temperature readings were still good and that's all Operations cared about. In some instances the wires to the buzzer was cut, and in one instance, a "massive controller failure" was really a crash due to the system memory being filled by the alarm log.
Yes, we did some design changes, but we also added another base principle to our design criteria: "You can't engineer away stupid."
Re: (Score:1, Redundant)
Re: (Score:1)
Hmmm, 2 "interesting read" comments.
Why do I have the feeling that the next "See me in my office!" email won't be spam?
Re: (Score:2)
Re: (Score:1)
How about IBM's approach? Have the system contact and request a technician directly and charge them for a support contract or call out fee?
Re: (Score:1)
three things to say to this:
- unmaintained UPS is worse than none
- you need actual risk assessment to decide what quality of power backup you need
- a good line filter is essential (unless you don't care if all your equipment gets toasted)
If you are in an area where mains power is very reliable, your UPS will need to be very good to beat it, ie be rather expensive (so only useful if outages are very expensive for you); if you're looking at two outages from storms a year, at least getting something that will
RAID (Score:5, Interesting)
Redundant Array of Inexpensive Datacenters..
Is really better to have 1000 machines in a 5-9's location, or 500 systems each in a 4-9's, with extra cash in hand?
Re: (Score:3, Informative)
Why go with a huge, multiple 9's datacenter, when you can go the way of google, and have a RAID: Redundant Array of Inexpensive Datacenters.. Is really better to have 1000 machines in a 5-9's location, or 500 systems each in a 4-9's, with extra cash in hand?
That all depends. A 5 9s datacenter is a full ten times more reliable than a 4 9s datacenter (mathematically speaking). So, all things being equal (again, mathematically), you would need ten 4-9 centers to be as reliable as your one 5-9 center. However geographic dispersion, outage recover lead time, bandwidth costs, maintenance, etc. can all factor in to sway the equation either way. It really comes down to itemizing your outage threats, pairing that with the cost of redundancy for each threatened comp
Re: (Score:1)
I should say ahead of time, I don't know much about these 4-9s vs 5-9s. I interpret them as probability of not failing. IE, 4-9, means 99.99%, which means the probability of failure is
Lets try different numbers. Choice A has a probability of 25% of failing, Choice B has a probability of 1% of failing.
How many A do we need such that the probability of t
Re: (Score:2)
Wrong.
0.01*0.01 = 0.0001
Which is ten times better than 0.001
Re: (Score:2)
Even that can over-simplify the problem; when you have to take one system offline, what redundancy to you have left? Will one drive failure take you down?
To the GP's point, the problem isn't going from 1x 5x9s to 2x 4x9's, usually companies try to do 2x 3x9's facilities instead.
Redundancy is not Reliability is not Maintainability.
Re: (Score:2)
And they're still better off!
Re: (Score:1, Informative)
Inaccurate math aside, "4 Nines" is 4 minutes per month. ie: restart the machine at midnight on the first of the month. "5 nines" is 5 minutes a year, a restart every Jan 1st. Properly managed, neither of these is particularly disruptive.
If your concern is unplanned outages, then two independent "4 nines" data centers have eight nines of reliability, because there's a 99.99% probability that the second data center will be funtional when the first one goes down. Of course, you can't predict susceptibilit
Re: (Score:2)
A 4 9s datacenter fails
Re: (Score:2)
Why go with a huge, multiple 9's datacenter, when you can go the way of google, and have a RAID: Redundant Array of Inexpensive Datacenters..
Because most systems don't scale horizontally and most businesses don't have the resources of Google to create their own that do.
Re: (Score:2)
Is really better to have 1000 machines in a 5-9's location, or 500 systems each in a 4-9's, with extra cash in hand?
Remember that the main problems with these datacenters are in networking (because that can propagate failures) and automated failover systems. Given that, go for the cash in hand, since you can do other stuff with that (including buying disaster recovery insurance if appropriate).
uptime matters (Score:3, Insightful)
Designing nontrivial systems without single points of failure is difficult and expensive. Worse, it has to be built in from the ground up. Which it rarely is: by the time a system is valuable enough to merit the cost of a failover system, the design choices which limit certain components to single devices have long since been made.
Which means uptime matters. 1% downtime is more than 3 days a year. Unacceptable.
The TIA-942 data center tiers are a formulaic way of achieving satisfactory uptime. They've been carefully studied and statistically tier-3 data centers achieve three 9's uptime (99.9%) while tier-4 data centers achieve four 9's. Tiers 1 and 2 only achieve two 9's.
Are there other ways of achieving the same or better uptime? Of course. But they haven't been as carefully studied which means you can't assign a high a confidence to your uptime estimate.
Is it possible to build a tier-4 data center that doesn't achieve four 9's? Of course. All you have to do is put your eggs in one basket (like buying all the same brand of UPS) and then have yourself a cascade failure. But with a competent system architect, a tier-4 data center will tend to achieve at least 99.99% annual uptime.
Ask a European banker (Score:1)
----------
Change is inevitable. Progress is not.
Re: (Score:1, Interesting)
I work for a very very large European bank. And yes - we're highly risk averse.
Here's the interesting thing - we built a bunch of Tier 3 and Tier 4 datacenters because the infrastructure guys thought that it was what the organization needed.
But they didn't talk to the consumers of their services - the application development folks.
So what do we have -
Redundant datacenters with redundant power supplies with redundant networks with redundant storage networks with redundant WAN connections with redundant data
Most uptime for the dollar is a bad idea (Score:2)
On a strict IT budget cost-effectiveness basis, the most uptime for your dollar will be Windows (Windows admins practically grow on trees, so they are cheap) on some commodity Pizza Box servers, connected to some cheap NAS storage and networked with crap switches. If you are an IT manager looking for your short-term bonus before you move onto greener pastures, this is a great idea! There is a good chance you will be able to hold things together long enough to get your bonus, and then get outta there.
Of co
Re: (Score:2)
Or, to be slightly more robust, windows or linux on redundant commodity boxes, with mid grade disk and network components, set up in redundant locations, will serve a lot of needs for lower cost. Not to go all MBA on you or anything, but a smart management team would look at the cost of providing the last 9 of reliability, against the cost of x days of outage, multiplied by some reasonable percentage of the likelihood of the outage, and then ask, does it make financial sense to ensure against the extremely
Re: (Score:1)
The one at Alexandria would've benefitted from more offsite backup.