Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
User Journal

Journal Pii's Journal: Same shit, Different day...

I should not have been surprised, when two days ago, my company's primary Internet connection, through PSInet , dropped unexpectedly. It had happened before, and will likely happen again in the future.

What did surprise me was that they are in the midst of a catastrophic network failure affecting not only the Washington, DC, metropolitan area, but also taking down San Francisco, Chicago, St. Louis, Atlanta, and a number of other good-sized, but geographically dispersed, markets.

(My buddy and I, both serious network guys with over 30 years experience between us, including a lot of time in the carrier space, can only come to the conclusion that PSInet was not paying it's bills. No single equipment failure, or loss of a POP, could possibly result in the carnage of this magnitude.)

Wow, sucks to be them, and by extension, sucks to be us. Fortunately for me, the last time we experienced an outage, my company's owner sought my guidance in getting some Internet redundancy in here. I had been lobbying for this for quite a while. Like most companies today, our fortunes are won and lost on the delivery of email. An Internet outage represents the loss of opportunity, and revenue, and in these trying economic times, nobody can afford to squander opportunities.

Understand that we are network integrators. We design, sell, and install network services for our customers, from infrastrucure components like circuits, routers, and switches, through server building and maintenance. When our customers are looking to ensure their connectivity to the Internet, we propose multiple providers, terminating to seperate edge routers, and guide them through the process of obtaining portable IP address space, and an Autonomous System Number, so that they can run BGP upstream ensuring their reachability in the event of a single provider outage.

" Great!, " thought I.

What did we end up with? It fell a little short of what we like to install for our customers. We got a second Internet connection in the form of a 768k/128k ADSL link. Terminated to the same edge router, no less. No BGP for us. The barest minimum.

The most redundancy I could get out of this was for email, using an additional MX record, and some fancy port forwarding on the edge router.

At least email would continue to flow in the event of an outage, and for a time, is functioned perfectly.

That is, until PSInet's secondary nameservers, slaves to our own Master here at the office, expired the cache, and now refuse to respond to queries for records for our domain. This is, of course, because some moronic MCSE was left to handle our DNS configuration, and like a tool, he set the SOA expire time to 1 day.

Like clockwork, after 24 hours of circuit outage, PSInet's DNS servers decided (because our MSCE told them to) that it could no longer trust the information in it's cache, and expired all of the information for our zone. Two lessons to take from this are:

  • Never send an MSCE to do anything important.
  • Never assume that because a person is capable of setting up a critical network service (Because, Hey!, It's all point and click.), that they should be tasked with doing so.

So yesterday, even with our Primary circuit down, we were able to receive email with almost no loss of performance. All was right in the world. Then, as the 24 clock ran out, all of the sudden our inbound email ceased to exist.

The topper was this morning, when one of our directors said to me:

"Hey, I don't want you to take this personally, but seriously, how do you guys (the engineering staff) sleep at night? I've got a $60,000 order out there, but I can't receive the email. Just trying to keep the lights on in this place."

My response (and I am not embellishing):

I got right in his face, pointed my finger in the general direction of his office, and said "You can take that shit somewhere else."

If this company had listened, just once, to the recommendations that I had made, they never would have known we had an interruption. Not only would we have abandoned the sinking ship that is PSInet over 1 year ago, but we'd have had two legitimate Internet connections to two seperate Tier-1 providers, and we'd have been announcing our own routes via BGP. In addition, our "Mission Critical" applications (Email is mission critical, but they would probably think it was a good idea that our completely worthless and devoid of content website was also always available) would have backup servers co-located in the facilities of a 3rd major provider.

I mean, you have to make a decision. Is email mission critical, or isn't it? There are no half measures; there's no middle ground. If the absence of email is more costly than the price of ensuring its availability, then you implement the bulletproof solution. Every goddamn time. Period.

But what else is new... The cobbler's kid needs new shoes. We sell technology solutions. We don't use them.

I hope the problem doesn't get solved.

This discussion has been archived. No new comments can be posted.

Same shit, Different day...

Comments Filter:

egrep -n '^[a-z].*\(' $ | sort -t':' +2.0

Working...