Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Quickly Switching Your Servers to Backups?

Cliff posted more than 7 years ago | from the fast-failover dept.

The Internet 73

moogoogaipan writes "After a few days thinking about the quickest way to bring my website back to the internet users, I am still stuck at DNS. From experience, even if I set the TTL for my DNS zone file as low as 5 minutes, there are still DNS servers out there won't update until a few days later (Yeah. I'm looking at you, AOL). Here is my situation. Say that I have my web servers and database servers at a remote backup location, ready to serve. If we get hit by an earthquake at our main location, what can I do in a few hours to get everyone to go to our backup location?"

Sorry! There are no comments related to the filter you selected.

BGP (5, Informative)

ckdake (577698) | more than 7 years ago | (#19043309)

Same Provider at both (N) locations, Same IPs for servers/services, Just don't advertise the prefixes via BGP from the backup location until the primary one goes down.

Re:BGP (4, Informative)

georgewilliamherbert (211790) | more than 7 years ago | (#19044895)

Bingo.

This is exactly what BGP (or OSPF feeding in to your providers' BGP) is made for.

If you're big enough, you can get your own AS number and do this without having the same provider at each end (useful if the disaster that happens is that the software on all of provider X's core routers goes insane all at once, which happens from time to time).

DNS just can't be assumed to fail fast enough for very high reliability services. You can do DNS right... low TTLs and all... and some providers just cache the results and do the wrong thing, and some client systems will never look up the changed data if the old IP stops responding, until someone reboots the browser or workstation.

BGP.

Re:BGP (2, Interesting)

sjames (1099) | more than 7 years ago | (#19047115)

In addition, make sure that your routers at both locations have a routable IP NOT in your IP block and use that as the source for your BGP sessions AND make sure you can log in remotely using that address (perhaps from a short list of external IPs you control). That way you can log in to each and make sure the route is actually announced by only one of them. Not all failover situations will completely take your network down (but unless you have a way to do it yourself, you'll sure wish it did).

Re:BGP (1)

mother_reincarnated (1099781) | more than 7 years ago | (#19051151)

DNS doesn't do the job seemlessly 100% of the time but BGP isn't robust enough to do the job alone. Solution = do both.

Re:BGP (1)

sjames (1099) | more than 7 years ago | (#19053151)

Actually, BGP is certainly robust enough unless your provider configures their router very badly. The advantage if they do is that as a paying customer you have some leverage to get it corrected or change providers. By contrast, DNS configurations will be on 3rd party servers and so beyond corrective reach.

That's not to say there are no problems at all with BGP, but those typically have more to do with router admins who haven't yet noticed that a few formerly reserved ranges are actually valid now and others who try to make do with a router that's too small and compensate by rejecting perfectly valid /24 entries to save table space. There's little one can do about those since it's their network that's broken. Of courrse, those will be ongoing rather than failover specific problems.

Re:BGP (1)

mother_reincarnated (1099781) | more than 7 years ago | (#19057303)

At the point where you are doing site/site DR via BGP you are most likely going to be running both your edge routers and your DNS based GSLB servers.

When I mentioned robustness I was really thinking of two specific issues:
-partial failures (where not all of your services are down)
-routing instability directly following a major 'disaster'

I've seen people use iBGP w/ route health injection to move traffic internally between DC's, but AS-at-a-time is not particularly granular.

Users well outside of a disaster event are also much more likely to be able to get DNS resolution and a stable route to the DR network if it is in a wholly separate AS.

I still vote for the best solution being both your own AS + BGP as well as DNS based GSLB using ISP IP's...

Re:BGP (1)

sjames (1099) | more than 7 years ago | (#19058151)

For cases where a handful of servers go down in an otherwise functional facility, simple failover where a backup server takes the relevant IP address and/or double up a service on another machine and alias the IP will do fine. For example, I run DNS on my web server, but block external queries. Should a DNS server go away, I add it's address on the web server and add a permit rule to the firewall.

Another strategy for single server failures is to NAT it's IP over to it's counterpart in the DR location and make the DNS change. Much of the traffic will transition when the TTL expires, but AOL and other lame networks may bounce through the NAT for a while. The benefit is practically instant switchover, the cost is doubling the bandwidth use.

Unfortunatly, those are the options.

Re:BGP (1)

opec (755488) | more than 7 years ago | (#19053129)

Could we possibly fit any more acronyms in there?

Server Clusters (2, Informative)

Tuoqui (1091447) | more than 7 years ago | (#19043313)

NLB (Network Load Balancing) Cluster, link the two together and have them both serve the website. Not only will it not go down (barring freak accidents like both locations being hit at once) but it will also have the added benefit of presumably double the bandwidth and such.

Only problem is if you're locating them in two separate locations that they need to be able to communicate with each other and keep identical copies of the website and be able to connect to any databases you may need.

Basically any server clustering type setup if you can connect the two remotely would probably be a good starting point for your website assuming it is that important that it dont go down ever.

Re:Server Clusters (5, Funny)

Tackhead (54550) | more than 7 years ago | (#19043579)

> Only problem is if you're locating them in two separate locations that they need to be able to communicate with each other and keep identical copies of the website and be able to connect to any databases you may need.

Depending on the industry, that's a very real problem.

Sysadmin: "Don't worry, we're already switched over to the hot spare, just get out of there!"
CIO: "What if the whole building goes?"
Sysadmin: "No worries. Remember that $1M we spent stringing all that fiber over to the other datacenter?"
CIO: "Oh yeah, the one in WTC 2!"
Sysadmin: "Aaw, shit."

Re:Server Clusters (0)

Anonymous Coward | more than 7 years ago | (#19045015)

Haha, very good. I laughed out loud

Re:Server Clusters (1)

akita (16773) | more than 7 years ago | (#19045363)

I fail to see how is this flamebait.

Re:Server Clusters (1)

Anonymous Coward | more than 7 years ago | (#19046689)

How is this flamebait?

The canonical example is "the failover system is on the floor below us" / "the AC failed and dumped 100 gallons of water into the machine room floor" / "oh shit".

Guaranteeing sub-15-minute failover is hard. Many IT departments in the industry aim for 5 minutes of downtime per year. An earthquake wasn't likely to hit NYC, and even the "worst-case scenario" of an airplane hitting the tower would only destroy one tower, and putting the hot spare in "the other tower" was a perfectly good strategy... for about 17 minutes.

Time costs money. How long can you afford to be down?

Re:Server Clusters (1)

Sobrique (543255) | more than 7 years ago | (#19048863)

This is why many sites are now moving to a 3 site model. A 'local failover' and a 'nearline async' copy. Certainly SAN vendors are up on the bandwagon, so with EMC kit (and probably others, but I've been on the training course of the EMC one) you can do 'SRDF/STAR' which basically does synchronous storage replication between two sites, with an asynchronous replication to a remote site (so it can be further away, geographically - synchronous fiberchannel replicas you hit problems with latency as you go further away)

Get the ISP involved (3, Insightful)

MeanMF (631837) | more than 7 years ago | (#19043345)

Talk to your ISP. They can set it up so the IP addresses at the main location can be rerouted to the DR site almost instantly.

Scale back your expectations (5, Informative)

eln (21727) | more than 7 years ago | (#19043413)

You could spend a bundle of money doing global load balancing and maintaining a full hot spare site, or you could figure out how critical it really is that your website be up within 5 minutes of some major disaster like an earthquake.

In the event of a major disaster, the need for "immediate" recovery is actually defined as being able to be back up and running within 24 hours of the event. This is true even for business critical functions. Unless your business would cease to exist within 24 hours if your website went down, I would consider a 72 hour return to service to be perfectly adequate, and it would cost a whole lot less time and money to set up. Keep in mind that we are talking about an eventuality that would only occur if your primary site was entirely disabled for an extended period of time, which is highly unlikely to happen if you're hosted in any kind of modern data center.

Re:Scale back your expectations (1)

BlackSnake112 (912158) | more than 7 years ago | (#19044353)

talk to trading companies. 5-10 minutes could mean millions. (So that VP kept on telling me)

We had a major failure of all systems one day. All the servers were up and running fine when we looked at them locally. The network went down, (not the database/web/file servers) the actual network blade thing. The network blades were working fine. We had communication on each blade, just not cross blades. We asked the network guys why they didn't have a backup plan. We had to have 5 sets of backups, (backup on site, backup off site local, backup off site within an hour, backup off site regional, backup off site far away). It was a pain to keep all the servers sync'd. The network guys were responsible for getting the 5 layer backup model approved by the board. Then their blade network thing went down.

I still think someone hacked into the blade and killed it. Cause all the hardware worked fine. The blade bios was cleared. (according to the network guys) And adding 5X to everything you wanted to do made getting things approved a lot harder. And to this day I do not like blade based servers/network components. The blade enclosure is the single point of failure.

Re:Scale back your expectations (5, Insightful)

eln (21727) | more than 7 years ago | (#19044409)

Ok sure, but if you're the kind of company that is making millions of dollars a minute through your website, you're paying qualified IT professionals to go out and spend a bundle of money developing an architecture that will allow full global load balancing with constant mirroring, probably with dedicated circuits between sites. You are definitely not posting Ask Slashdot articles about how to get around other ISPs' annoying habit of holding on to DNS records for too long.

Re:Scale back your expectations (1)

vux984 (928602) | more than 7 years ago | (#19048635)

Even if its only 2 million dollars (which is about as low as it can be and be called 'millions')...

24 hours
x60 minutes/hour = 1440 minutes
x360 days/year = 525,600 minutes
x 2,000,000 $/minute = 1,051,200,000,000 $

A trillion dollars! 100 times what even googles total revenues are. Its a problem nobody has.

And lets face it, if your "website" is generating that kind of revenue, you wouldn't have an ISP. You'd BE an ISP, with your own peering, fiber, and so on.

Re:Scale back your expectations (2, Insightful)

autocracy (192714) | more than 7 years ago | (#19049997)

A million dollars in revenue isn't what he meant, I think. A bank's revenue, for example, is much less than its transaction amount. Stock Exchance as well.

Also, consider 2x60x40x52 comes out to 249,600,000,000. I bet Bank of America sees a billion dollars a day move. Peak transaction volume is often used when calculating potential loss, so it may only be $2 million / minute during the highest hour -- but that's always the hour you'll fail during ;)

Re:Scale back your expectations (1)

vux984 (928602) | more than 7 years ago | (#19056641)

How can not being able to move X dollars be construed as an X dollar "loss"? If my online banking goes down and I'm unable to place a web order to transfer my Mom X, and move Y from chequing to savings, and purchase Z mutual fund for an entire day. Its absurd to suggest the bank lost X+Y+Z? Even *I* didn't lose X+Y+Z from the inconvenience, and its my money.

Re:Scale back your expectations (4, Insightful)

CrankyOldBastard (945508) | more than 7 years ago | (#19045151)

"You could spend a bundle of money doing global load balancing and maintaining a full hot spare site, or you could figure out how critical it really is that your website be up within 5 minutes of some major disaster like an earthquake."

I wish I still had mod points left. It's overlooked by many people, that you should always compare the cost of a disaster/breakin/breakdown to the cost of being prepared for it. I've seen situations where over $10,000 was spent on a bug that would of cost about $200 in downtime. Similarly I've seen a few thousand dollars spent fixing a bug that affected one customer who was paying $15 per month.

Re:Scale back your expectations (0)

Anonymous Coward | more than 7 years ago | (#19046143)

In the event of a major disaster, the need for "immediate" recovery is actually defined as being able to be back up and running within 24 hours of the event. This is true even for business critical functions.

LOL. You say "actually defined" as if there is a reference definition for "immediate".

If there is a definition for "immediate", it's not sensitive to context. Immediate is immediate. There's not a different meaning for "immediate" in the context of a major disaster.

In any context, poll 100 people and tell me how many define "immediate" to be "within 24 hours". Very few I'll bet. So you're free to live by your own definition, but your choice of words is going to confuse the hell out of other people.

Re:Scale back your expectations (1)

CBravo (35450) | more than 7 years ago | (#19048463)

Keep in mind that we are talking about an eventuality that would only occur if your primary site was entirely disabled for an extended period of time, which is highly unlikely to happen if you're hosted in any kind of modern data center
Ha, this happened to IBM last year. The whole site in Belgium burned down when they tried to upgrade some electricity. Our contracts stated 72 hours but it became 1,5 weeks.

Re:Scale back your expectations (1)

afabbro (33948) | more than 7 years ago | (#19058935)

In the event of a major disaster, the need for "immediate" recovery is actually defined as being able to be back up and running within 24 hours of the event.

Defined by who? Someone without a passable command of the English language?

(If your answer is any sort of government agency, then I was right.)

geographical load balancing (1)

Doobian Coedifier (316239) | more than 7 years ago | (#19043431)

A geographical load balancing solution, such as Coyote Point's Envoy [coyotepoint.com] or F5's Global Traffic Manager [f5.com] . Very expensive though.

Re:geographical load balancing (2, Informative)

passthecrackpipe (598773) | more than 7 years ago | (#19045473)

F5 mainly uses DNS for its Global Traffic Management solution. There are other bits and pieces, but that is the core, really.

Um (2, Funny)

Anonymous Coward | more than 7 years ago | (#19043507)

You could hire an actual IT administrator who knows what they're doing? Like, one who's actually trained?

Re:Um (2, Funny)

CRiMSON (3495) | more than 7 years ago | (#19043943)

But then it wouldn't be another half-assed implementation. Come on what were you thinking.

Location (1)

vertinox (846076) | more than 7 years ago | (#19043569)

I hate to say this, but sometimes you want to avoid hosting locations in places that are earthquake, hurricane, or just natural disaster prone if it is that critical.

Then again even places like NYC are victims to total power grid failure once in a blue moon so you do want some type of clustering in place like the prior people mentioned. I can't tell how many times in IT I've heard someone say, "Some guy in Georgia just dug up a major fiber cable with his backhoe!"

Re:Location (0)

Anonymous Coward | more than 7 years ago | (#19043781)

Honestly, I got a good place for ya
www.teamnet.net

Iowa FTW

Re:Location (1)

RobNich (85522) | more than 7 years ago | (#19056919)

What an absolutely horridly shitty site! It's impossible to see the html version, and the flash version has links that go to items that are wrong!

For instance, try to compare the specs of the two locations. They're the same, except for the first page! But the link to the second location goes to the second page, so you can't even tell that it's a different location!

WHY IS THIS A FLASH SITE?

Re:Location (2, Informative)

bahwi (43111) | more than 7 years ago | (#19043809)

I've got my personal/small business server(does nothing but a crappy webpage) so it's not critical down in florida. It's gone down because of router issues but never a hurricane, and oh yes, it's been hit. I actually think those places may be better as they are built to weather those types of storms.

Yeah, I've heard lots of people sweating and panicking because a back ho was working somewhere near the datacenter. On site and beads of sweat on their forehead.

Re:Location (0)

Anonymous Coward | more than 7 years ago | (#19045069)

You may or may not look down on the old country, but in our thousands of years of history, we've learnt that buildings made of wood may get blown down in strong weather, but the ones made of brick tend to survive. You seem enlightened, so I'm guessing your datacentre is made of brick...

For those of you in Kansas, the metaphor is buildings built on sand versus those built on rock.

Re:Location (2, Interesting)

walt-sjc (145127) | more than 7 years ago | (#19046213)

There is an AT&T data center in Virginia that was hit by a tornado. Our servers are there. In the part that got hit hardest, water was pouring in and down onto a few racks of servers. The servers were still up, but they powered down that section of the data center for safety reasons. Our servers were fortunate not to be affected, and AT&T kept them running throughout the whole ordeal (power grid was down too, so they were on generator for a couple days.) BTW, that was the "before SBC" AT&T.

Oops (2, Funny)

tedhiltonhead (654502) | more than 7 years ago | (#19044115)

That was me... sorry... my bad. FSB's (Fiber Seeking Backhoe) are tough to control.

Re:Location (0)

Anonymous Coward | more than 7 years ago | (#19053703)

I can't tell how many times in IT I've heard someone say, "Some guy in Georgia just dug up a major fiber cable with his backhoe!"
Yes, but how many times has it been true?

Re:Location (0)

Anonymous Coward | more than 7 years ago | (#19055471)

I don't know about the old backhoe excuse, but our last major outage was caused by a drunk guy. He crashed his car into a power poll and took out power to the entire town for 2.5 hours. The massive UPS in our server room was fine, but the UPS in the room where our DS3 went out had a bad battery and died within minutes.

Re:Location, Location, Location (1)

bikeidaho (951032) | more than 7 years ago | (#19062055)

That's what makes places like Boise, Idaho ideal areas for datacenters. 1) No natural disasters 2) Good Power Grid 3) Cheap HydroPower 4) Unused Fiber Networks solutionpro is the way to go

DNS failover (3, Informative)

linuxwrangler (582055) | more than 7 years ago | (#19043787)

We have used DNS failover from dnsmadeeasy.com for a couple years and have put it to the test a couple times. They have had perfect reliability and a low cost (typically well under $100/year).

The method is not perfect, but it is plenty good enough for our needs to protect against something that takes a datacenter down for a prolonged time (several minutes/hours/days). And the price

And to those who recommend avoiding "disaster prone" places: they all have people. People like the backhoe guy who took out the OC192 down the street. Or the core drillers who managed to punch both the primary and secondary optical links to a building of ours at a point where they were too close to each other.

You can roll your own by having a DNS server at each site and DNS 1 always issues IP of server 1 while DNS 2 always issues IP of server 2. But there are a number of issues like traffic hitting both sites at the same time. And you will have to detect more than just a down link so you will be scripting web test and DNS update systems. By the time you are done, you will have spent decades' worth of dnsmadeeasy fees.

Note: dnsmadeeasy isn't the only game in town. Just the one we happen to use.

Re:DNS failover (4, Informative)

Joe5678 (135227) | more than 7 years ago | (#19044605)

dnsmadeeasy doesn't solve the problem the OP is asking about. They simply monitor your services and start serving a different DNS record if your primary is down.

The OP is concerned with all the DNS servers that aren't yours that would then have a cached version already, and continue to serve up the dead DNS record until their (incorrectly configured) TTL expired.

As another poster already mentioned, BGP is really the only technical solution to this problem. All other "solutions" are going to be convincing people that they don't really need instant failover in the event of a major disaster.

Re:DNS failover (2, Informative)

mother_reincarnated (1099781) | more than 7 years ago | (#19050719)

Disclosure time: I deal with this stuff every day and I have a vested interested in commercial solutions to this issue.
dnsmadeeasy doesn't solve the problem the OP is asking about. They simply monitor your services and start serving a different DNS record if your primary is down.

The OP is concerned with all the DNS servers that aren't yours that would then have a cached version already, and continue to serve up the dead DNS record until their (incorrectly configured) TTL expired.

As another poster already mentioned, BGP is really the only technical solution to this problem. All other "solutions" are going to be convincing people that they don't really need instant failover in the event of a major disaster.


The problem is that this is somewhat specious of a concern. It tends to be browser/os combos that don't honor TTLs, and there are very few of those around anymore. In that case all the client has to do is close and reopen their browser.

BGP is cumbersome but a valid solution (if you've got your own AS). You probably want to suppliment it with a DNS based solution though- far more of your outages will be less than whole site events. The more viable (and likely) scenario to keep in mind is a partial outage at a the primary DC where you can use DNS to send 99% of the users of affected applications to the DR site, and the 1% who are non RFC compliant can be handled via more draconian methods (either 302's, l3 backhaul, or even shared l2 if you have it between sites). A properly designed solution will be able to do all of this for you and also know what is really available without manual intervention.

I really want to address your comment about convincing people they "don't really need instant failover in the event of a major disaster." Well they probably don't [need 100% of users to have instant failover during a major disaster]. Metrics get wrapped around BC for a good reason. People often fail to grasp is that if your entire primary site goes 'poof' you, and likely most of the world, will not notice the time it takes either a DNS or BGP based solution to 'reconverge,' and you certainly won't care if a fractional percent of your users need to close their browsers (btw many users probably won't be able to get to your site because of route flapping mayhem et al. caused by said disaster...) Frankly it won't be anyone's biggest concern at that point.

Layered approaches are often needed, your particular requirements and bugdet determine how deep you go. A DNS based first layer will do 99%+ of the job.

[FYI for dealing with the bad boy superproxy types like AOL you could segment that traffic out and handle it differently than everyone who 'plays fair'... If you've got the money to play with the best of breed solutions then it's all been done before- but you probably wouldn't be asking /.]

Prioritization (0)

Anonymous Coward | more than 7 years ago | (#19044069)

Depends, after some emergency, like an Earthquake, etc, IT services are the absolute last on my priority list.

The damn website can wait, family needs come first, most IT people are not paid enough to give a damn about a web site or server anyway.

excellent point (2, Insightful)

swschrad (312009) | more than 7 years ago | (#19044189)

but wrong answer for it. the disaster plan should include backup for key people and assume responsibility for their dependents, so the key people give a schytte about what they're doing and have an out for the whole family from the (hopefully local) disaster.

it's incumbent on manglement to have useful plans, and you should help make what they have useful. shift the end focus and present it to them.

Re:Prioritization (1)

qwijibo (101731) | more than 7 years ago | (#19045279)

If it's a local company that can't operate after a local disaster, that attitude is fine. However, there are a fair number of companies who can't completely stop their business due to a problem at one location. In fact, it's probably safe to say that without a way to get back up and running without that location, many companies may not have the cash flow to get that site back up and running ever, which means you're going to need a new job anyway.

Oddly enough, those of us who give a damn about whether or not the systems we administer are working tend to get paid enough to continue to give a damn. A solid disaster recovery plan actually puts everyone in a better position to attend to their family needs without neglecting their work duties.

Buy "Scalable Internet Architectures" (3, Informative)

tedhiltonhead (654502) | more than 7 years ago | (#19044171)

For a real answer, buy Theo Schlossnagle's book, "Scalable Internet Architectures". Theo presented a lengthy and highly-informative session at OSCON last year, and I subsequently bought and read his book. Worth every penny if you're professionally involved in providing reliable Internet services of any kind.

Re:Buy "Scalable Internet Architectures" (1)

scribblej (195445) | more than 7 years ago | (#19047435)

Thank you Ted. Teddy. Theodore. Theo. Whatever.

What?! I'm just saying thanks.

Re:Buy "Scalable Internet Architectures" (1)

tedhiltonhead (654502) | more than 7 years ago | (#19048333)

Nice try, but it's not me. :)

Fail the IP address across (1)

Colin Smith (2679) | more than 7 years ago | (#19044351)

Rather than the name.

Your networking gurus will find that an interesting one. In reality though, a couple of days is probably good enough in the event of something like an earthquake, you may find that people have other things on their minds when that amount of shit hits the fan.

Still, kudos on planning for it. Most IT depts are taken completely by surprise and the business goes down in flames. A decent global filesystem (like AFS) helps massively.

 

Re:Fail the IP address across (4, Insightful)

QuantumRiff (120817) | more than 7 years ago | (#19044635)

you may find that people have other things on their minds when that amount of shit hits the fan.

Really interesting point that seems to be overlooked. The CEO is concerned about getting everything back up and running (since statistically they have no heart or pulse), but the employees are more concerned about finding family members in the wreckage of their house, cleanup, watching the kids cause schools are shut down, etc..

Whatever you do, ensure it is automated as possible, and please, please, please don't forget to test. I've heard to many stories about everything looking okay, until the emergency generator runs for several hours, vibrating a connection loose and causing it to shut down. It would pass the test run every month, that was only 15 min long. "Hmm, power is out, and power poles are blown all over the streets, do I stay safely inside? Or do I brave a trip across town to try to flip a switch for my wonderful employer?"

Re:Fail the IP address across (1)

mabu (178417) | more than 7 years ago | (#19047077)

Speaking as the owner of an ISP that was hit by Hurricane Katrina, I can comment on this... while everyone else in the family was looking for a place to live, my customers demanded their web sites be back online ASAP, so it took me 24-48 hours to get everything back online even though our network never went down - the military took over the building and messed with the generator transfer switch and screwed things up.

I slept on floors and in peoples cars while I got the network back online. Only then did I concern myself with stuff like food and where I was going to live. I wish it didn't go down like that, but it did.

Re:Fail the IP address across (1)

Sobrique (543255) | more than 7 years ago | (#19048941)

Then you are clearly far more dedicated than average :).

Re:Fail the IP address across (1)

cornjones (33009) | more than 7 years ago | (#19049995)

read: owner. joe paycheck went looting for food.

Re:Fail the IP address across (1)

jrvz (734655) | more than 7 years ago | (#19053739)

Re: testing. I remember a command center that ran on a generator only as long as the inside tank had fuel. When they tried to switch to the outside tank, the generator died. There was water in the connecting pipe, which had frozen.

Re:Fail the IP address across (1)

gbjbaanb (229885) | more than 7 years ago | (#19048947)

in the event of something like an earthquake, you may find that people have other things on their minds

Only if you live there, most people rent servers in a geographically different location, and their customers may easily be all over the world. If you live in the Uk, and your servers are in Texas, and your customers are in Australia - you only care about the disaster like its just another news story.

The OP is not necessarily concerned about disasters though - power cuts (increasingly likely nowadays), cable accidents, human error are all much more likely to cause you extensive downtime.

Redirection (1)

gweihir (88907) | more than 7 years ago | (#19044911)

If you can get somebody with multiple allready redundant locations to put up a redirection page, then you are all set, since you can basically ride on their redundancy. Personally I think this is the cheapest solution. I don't see any way around DNS RFC violators.

Other options would likely require messing with routing info. I think tat is chancy at best.

Simple cheap solution: (0)

Anonymous Coward | more than 7 years ago | (#19045113)

I'm always amazed at the expensive solutions that people bandy about.

Just set www.example.com to have two ip addresses: one at your main site, one at your backup site. On the start page at both sites, immediately redirect the browser to www1.example.com (main site) or www2.example.com (backup site). Run a script every five minutes to update that static start page based on whether your main site is still up using whatever technique suits your fancy.

DNS round-robin means that browsers try the different ip addresses until they find one that works. Primary and secondary DNS servers mean that dns lookups try the different servers until they find one that works. Use these features, that's what they're there for!

(If you're really big, and have servers all over the world, have your start page redirect people based on their geo-ip location, like sourceforge's downloads. Or pay someone a load of money, because you're obviously rich... :-)

Enjoy!

Re:Simple cheap solution: (1)

spydum (828400) | more than 7 years ago | (#19052003)

Nice try, but DNS round-robin does not frequently work as you suggest. Give it a try, you will find quite a variation in clients on how they respond. More often than not, your going to end up with a timeout.

Re:Simple cheap solution: (1)

RobNich (85522) | more than 7 years ago | (#19057541)

DNS round-robin means that browsers try the different ip addresses until they find one that works
Nope. It means that each client request for an address gets the next address in the list. A client does not do another DNS lookup when the po rt 80 connection fails.

Primary and secondary DNS servers mean that dns lookups try the different servers until they find one that works
This is true.

To address your final statement, geo-location to find the closest server and having a backup sites are two problems that are essentially unrelated. Having a site that redirects to a closer mirror is great, but you still need redundancy for the main site (the one that does the redirection).

Stochastic Resillience (4, Interesting)

DamonHD (794830) | more than 7 years ago | (#19045957)

Hi, An alternative is to forget the all-or-nothing view, and make sure that with some simple round-robin DNS and enough geographically-separated servers for the DNS and HTTP/whatever, then even if one is taken out by a quake or Act of Congress (ewwww, those nature programmes), *most* users will still get through just fine. Any clients/proxies that are smart and that can try out multiple A records for one URL will always get through if even one of your servers is reachable. Example: my main UK server failed strangely yesterday morning, but only about 30% of my visitors can even have noticed, and the other servers worldwide took up some of the load. Just simple and reliable and cheap round-robin DNS. Rgds Damon

Try asking at webhostingtalk.com (3, Informative)

Wabbit Wabbit (828630) | more than 7 years ago | (#19046117)

The industry pros discuss this sort of thing there all the time. The colocation sub-forum would be the best place to ask. I know that sounds odd, but that's the area on WHT where the best network/transit/BGP people hang out.

Re:Try asking at webhostingtalk.com (0)

Anonymous Coward | more than 7 years ago | (#19048045)

if by "industry pros" you mean "16 year olds trying to get rich by reselling hosting, without the slightest clue as to what they're doing" then you'd be right.

Re:Try asking at webhostingtalk.com (1)

Wabbit Wabbit (828630) | more than 7 years ago | (#19049167)

Not in the colo forum. Serious people do serious business there. The main hosting forum and the reseller forum are indeed both something of a joke.

But your answer (and your posting as AC) certainly tell me which parts of WHT *you're* familiar with, and more importantly, which parts you aren't.

Cross site IP takeover (1)

subreality (157447) | more than 7 years ago | (#19046737)

You need cross-site IP address takeover. You can accomplish this generically with BGP (but if you're asking these questions, I'd stay away from this for now), or work with your ISP to set up a simple way to accomplish this.

Back in the day I used Exodus to do this (3, Informative)

ejoe_mac (560743) | more than 7 years ago | (#19047535)

When $ is no issue, a tier 1 colocation provider with their own services would be the best option. They've got big pipes, and will work with you to have the additional services needed. I'd go as far to say that you're going to want to have a failover script that they would follow in the event of site A going offline. You'd need redundant equipment, or use a DR firm for getting back up.

It's called "BGP"... (1)

jafo (11982) | more than 7 years ago | (#19048031)

You can move a block of IP addresses, most sites will honor an advertisement of a /24 block, in my experience. With BGP you can cause this IP block to start getting routed to equipment in another part of the world. In other words, you can keep your DNS the same and cause the IP addresses to move. No DNS propagation time required. BGP changes can propagate in a minute or two, unless it's been flapping and remote routers have dampened the route.

Sean

Umm, doesn't that get complicated? (1)

cheros (223479) | more than 7 years ago | (#19055899)

It's been a long time since I've been near any kind of routing, wouldn't that require access to two separate Autonomous Systems? I'd be interested to know how this works, good refresher :-)

I personally thought that using VLANs would be a quicker way to go about it, but that's obviously a more localised solution. But you're right in looking for solutions at network level, there's too much ignoring of DNS TTL values going on (AOL being the example we all love) to make any other measure quick enough.

= Ch =

Disaster Tolerant is the answer (1)

tengu1sd (797240) | more than 7 years ago | (#19048929)

A disaster tolerant system will have servers configured at multiple sites with real time data update. OpenVMS can do this with the remote sites being far enough away you need to fly there. How do I restore service? is the the wrong question, ask How do avoid downtime even if a site fails?

The Amsterdam Police IT group recently announced a 10 year uptime. OpenVMS.org [openvms.org] has details on celebrating 10 years of uninterrupted service.

Not so sure about this... (1)

Gazzonyx (982402) | more than 7 years ago | (#19049039)

But, can't you use a CN for your external so that clients are forced to come to your DNS server to get the actual IP? I'm not sure it's the most effecient way, but then again your TTL is only 300. Like you make yourself the SOA and only use CN, and then just point the CN to the currently running server? It *would* mean having to have failover DNS servers though. I'm no admin, but I do admin-type work from time to time (read: I admin a single server where I have an internship that only touches the intranet and don't have to worry about such issues on a regular basis and have the manuals on my shelf for just these occasions).

Re:Not so sure about this... (1)

spydum (828400) | more than 7 years ago | (#19052047)

The complaint is not that his DNS server does not act appropriately, but that there are providers on the net who run software that cache responses for far longer than the norm. Those folks will have old IP addresses in their cache for anywhere from 24-48 hours (sometimes longer, but that is rare).

Re:Not so sure about this... (1)

Gazzonyx (982402) | more than 7 years ago | (#19052175)

Ah, I was thinking they would cache his DNS server's IP instead of what his server resolved the server to, like a single world IP and then treating the servers as internal IPs. I withdraw my idea, it probably wouldn't work at all. I was truly half asleep when I posted that; not sure what I was thinking. Oh, yeah, I was thinking about studying for finals... speaking of which...

DNS is the way to go (1)

RobNich (85522) | more than 7 years ago | (#19057941)

You don't want to mess with BGP unless you have plenty of money to have a redundant location, and a large enough IP block to justify it. You may find an ISP that has this set up or their own block, but I don't know of any.

The way to go is DNS. For an example of this, look at Akamai:
(Removed because of the fucking lameness filter. It was a very useful DNS lookup. Try 'dig images.apple.com' to see what I saw.)

It's done using an extremely short TTL on the final A record. Obviously this handles the vast majority of cases. I also recommend having a backup DNS site hosted by someone ELSE! Set up your two locations, and host DNS on them, but have third and fourth DNS servers that are authoritative for your domain. That way if your main site is down, you can switch to secondary, but if secondary goes down, you can set up something else in a pinch and point your backup DNS at it. If you don't have this, there's no chance you'll get back up in less than a day, as you'll have to change your domain's DNS servers.

Also, if you're hosting email for you domain, set up a mail forwarding service that will hold your mail or deliver it to various addresses while your main site is down.

I used Rollernet [rollernet.us] for both of these services, but I currently only use them for Secondary DNS, since my mail is now hosted by TuffMail [tuffmail.com] . As a former LFS-building home-server-roller, it's nice to have others dealing with that stuff.

ultradns have a product called .... (1)

Alex (342) | more than 7 years ago | (#19158967)

sitebacker - that will change your dns within 3m of your site going down.

cheers,

Alex
Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?