Air Traffic Control "Telephone Glitch" Delays Hundreds of UK Flights

samzenpus posted about 4 months ago

United Kingdom 40

First time accepted submitter biodata writes "The BBC is reporting that hundreds of UK commercial air flights have been delayed for most of Saturday due to an internal telephone systems problem in the National Air Traffic Control Service, and delays are likely to continue into the evening. A spokesperson said that it was a different software bug from the one which grounded flights in the summer."

oh, editorial control (5, Insightful)

richlv (778496) | about 4 months ago | (#45633153)

"difficult software bug from the one which grounded flights"

well, spellchecker did not complain, so we're all set...

Re:oh, editorial control (1)

richlv (778496) | about 4 months ago | (#45633177)

it also seems that somebody had to make this change... article has this quote:

The BBC's transport correspondent Richard Westcott said it was a totally different issue to a software problem that hit the control centre in summer.

Re:oh, editorial control (0)

Anonymous Coward | about 4 months ago | (#45633381)

Someone was translating into American and autocorrected a typo.

Re:oh, editorial control (0)

Anonymous Coward | about 4 months ago | (#45633547)

I see ... it's a difficult word and different to spot.

Re:oh, editorial control (-1)

Anonymous Coward | about 4 months ago | (#45633247)

Do you ever decide to not use the toilet and instead just poop your pants? Get off on just rolling around in your filth and being a dirty birdie?

Re:oh, editorial control (-1)

Anonymous Coward | about 4 months ago | (#45633539)

Much prefer imported poop. It's classier. You're just an amateur.

Re:oh, editorial control (2)

maxwell demon (590494) | about 4 months ago | (#45633321)

Sure. It was really not a simple bug to put in, but the programmer who wrote it had already grounded flights in the summer, and thanks to that experience he also managed to put this bug in, despite all its difficulty.

This was definitely not intentional. (5, Informative)

Rei (128717) | about 4 months ago | (#45633817)

It's just an unfortunate incident.

British Telecom has had an issue (which has happened a number of times) which led to a minor timing glitch in one of their systems. When this happens, the data reliability on the FARICE line to Iceland drops and you start getting corrupted flight messages. Shanwick was alerted to the problem and both sides consulted and decided that the best solution in the interrim would be something that had been done previously, disconnecting FARICE and thus forcing all connections through the backup line, DANICE, which appeared to be operating normally.

Unfortunately, the problem was even worse on DANICE. What appeared to be normal operation was only normal up to the data logger. Once it actually got to the flight tracking software, the messages were being refused, and corrupted messages being sent in the other direction. So while BT was working on getting their system fixed, flight control managers were being forced to basically manually dig up ATC messages and copy-paste them off to the air traffic controllers (as much was handled through voice as possible as well).

But it got even worse. A totally unrelated communications network, Datalink, decided to misbehave during all of this, which may or may not have been due to the Shanwick problems. On the Iceland side, the general solution is to force a switchover to the backup system. Which was done... except a critical component on the backup system immediately crashed. Repeated attempts to switch and ultimately switch back caused even more problems for the air traffic controllers.

Eventually the fixed FARICE line was brought back up, Datalink back online (with the switchover-crash problem postponed to be investigated during a low-traffic timeperiod)

It's terrible that there were so many delays, but these are extremely complicated systems with a challenging task, built up over decades with tons of computer components, protocols, lines, routers, radar systems, transmitters, and on and on, scattered all over the world. On a weekend. Everyone was scrambling and doing their damndest to fix it as soon as possible. It should also be noted that it was never a safety issue - even in the absolute worst case, air traffic control could go all the way back to the old paper-and-pencil method. What the systems give is, primarily, speed, and thus when there's big problems, there's delays.

And that was my weekend, how was yours? ;)

Re:This was definitely not intentional. (2)

Rei (128717) | about 4 months ago | (#45633893)

Oh, and I forgot to mention the voice communication systems problem. That one didn't affect me directly but I did get a memo about it.

Re:This was definitely not intentional. (2)

tibit (1762298) | about 4 months ago | (#45634361)

Of course, I might be entirely off base here, but below is the first impression I got.

Wouldn't a "fix" be as simple as routing all that junk encapsulated over a point-to-point ssh connection between two routers? Doesn't almost any router let you pack up all of the disparate kinds of traffic and push it over a "safe" pipe that doesn't give a flying fuck about datagram corruption? Wouldn't a solution here be, quite literally, two router boxes from any major vendor? Yeah, it may not perform all that great when there's bad corruption, but it will work as much as anything would over that link. What I just don't get is how you can get application-level messages corrupted when all that happens is a bad data link, if that's in fact what was going on. That stuff has been solved long time ago with things as low-key as HDLC and X.25, and if you really need hardcore resiliency to corruption, then I think a PTP link over SSL will not pass anything corrupt to the higher layers even if you do a deliberate MITM, much less from random data corruption.

Re:This was definitely not intentional. (1)

mjwalshe (1680392) | about 4 months ago | (#45634905)

mm you do wonder if the fault was in using a modern tcp/ip link rather than an old school error corrected up the wazzo x.25 - on of the problems with OSI was that it had a lot of error correction as was less efficient than TCP/IP

Re:This was definitely not intentional. (1)

Rei (128717) | about 4 months ago | (#45638397)

The issue is, you deal with the system you're with, not the situation you wish you had.

We can't change a transmission protocol or route data over arbitrary connections. This is a collection of everything from very old hardware to brand new, protocols from very old to brand new, in every country in the world, and you can't just arbitrarily rework them. It's the same in the air, too. And when new protocols are made, they're generally in addition to existing ones, not replacing them. I'm not aware of any with error correcting codes or the like (there could be, I just haven't worked with them), but some of them (not all) use checksums (though that's a whole 'nother story... the documentation on how one common type of checksum, that used in datalink messages, is a big fat lie, caused by a screwup in whoever implmented the code the first time that everyone else now has to imitate... but it works, so...).

In the long run, the goal is to move as much traffic as possible to the more automated, more reliable newer protocols. But this is something that's invariably going to happen at a snail's pace.

As I've never messed with them directly, I can't decribe to you the protocols used for physical data transmission at every point over the FARICE and DANICE links - just the message layer on top of them, which is plaintext except for the header marker characters. I've never worked at anything more than the endpoints. But I can tell you this, there's no way we could just go in and replace all of the hardware along the way (you should see the graph of all of the hardware that exists just between Iceland and Britain). It would be an expensive long-term international effort with major potential for disruption in its own right. And it would only help for that particular link anyway. What you really want is how all of air traffic control messages are transmitted - aircraft, atc, tower, etc - everywhere in the world to be switched over to a single, reliable mechanism and a standardized set of international routing hardware. Well, great, join the club, I'd love that too! But it's just not going to happen any time soon without a massive funding surge.

You work with the systems that you have, not the systems you wish you had. Yes, we're working to modernize everything, just like everyone else. For example, in the past year I've spent a good bit of time working on adding in capabilities to one system to help take a sort of "middleman" server that it talks to out of the loop to improve reliability and error logging. But these things don't happen fast. And how many programmers / hardware engineers do you think we have, really? We're no Microsoft here.

Re:This was definitely not intentional. (1)

tibit (1762298) | about 4 months ago | (#45645011)

Ultimately, there are routers or modems involved, and they push some legacy protocols, and there's a lot of providers out there who offer modules for modern routing hardware that take those old protocols and push them quite transparently over modern data pipes. It's a reasonably well understood problem. It would not require reworking the whole thing, that's the whole point - you take what you have and push the data around using modern hardware that can ensure that the data is safe.

Even if all you have is a 7 bit-only "ASCII" link, you can still push HDLC/X.25 on top of it and then any other protocol you wish on top of that. All it takes is inserting new hardware at key points in the infrastructure. Eventually you can provision alternate links, in an emergency you can leverage public internet - it's better than downtime, and with proper cryptography it's actually more secure than legacy non-encrypted links.

Re:This was definitely not intentional. (0)

Anonymous Coward | about 4 months ago | (#45635115)

British Telecom has had an issue (which has happened a number of times) which led to a minor timing glitch in one of their systems.

Given the consequences that's not minor at all. Too many mediocre developers and system administrators think inaccurate time is unimportant. Time affects so many different computer [distributed] sub-systems it's just ridiculous, everything from caches to message ordering to syncing to version control to debugging to timeouts to dependency management to resource sharing etc. One of the first things I do on any system I manage is to make sure I have accurate time, preferably with NTP or GPS.

Not to mention a backup system that failed it's entire reason for existence.

Re:This was definitely not intentional. (1)

maxwell demon (590494) | about 4 months ago | (#45654375)

I'll assume that it was only because you were overworked that you missed the humour in my comment. What I did was to give a possible interpretation which would have made the erroneous sentence correct. Of course I didn't mean to imply that someone really added bugs intentionally. At least one person understood it and gave me a "Funny" mod.

But anyway, your comment was full of interesting information, so it was the rare case of a productive Whoosh. Thank you for sharing that information.

How my weekend was? Well, the only thing that failed for me was my own computer, so by far not the same scale as your problems. :-)

Re:oh, editorial control (0)

koan (80826) | about 4 months ago | (#45633931)

Pilot: "Mountain Ahead"
Plane: "Did you mean maintain altitude?"
Plane: "Maintaining Altitude"

Doesn't ANYONE check for typos anymore? (0)

Anonymous Coward | about 4 months ago | (#45633165)

I believe you meant to say "different software bug," not "difficult software bug."

Re:Doesn't ANYONE check for typos anymore? (1)

Stewie241 (1035724) | about 4 months ago | (#45633185)

This happens enough that I often wonder whether the editors are really that careless, or whether they intentionally insert errors like that in order to provide fodder for those who so enjoy writing posts correcting the article and complaining about the lack of editing. Thorough proofreading would kill one of the memes that makes slashdot what it is.

Re:Mind Games (0)

Anonymous Coward | about 4 months ago | (#45633561)

Could I maybe get a tl;dr on that?

I choose to believe... (1)

rmdingler (1955220) | about 4 months ago | (#45633201)

Samzenpus chose to combine different cult using an ultra-liberal poetic license. Bugs are not an option with this telephone system... they come bundled with it.

I work as part of the team... (-1)

Anonymous Coward | about 4 months ago | (#45633277)

And this is the reason [google.com] for this bug :-(

Re:I work as part of the team... (1)

maxwell demon (590494) | about 4 months ago | (#45633483)

Yeah I can understand that - after seeing goatse you cannot concentrate on your programming and create all sort of nasty bugs ...

Re:I work as part of the team... (0)

Anonymous Coward | about 4 months ago | (#45633513)


Re:I work as part of the team... (0)

Anonymous Coward | about 4 months ago | (#45634625)

And yet you fell for it.

Time to Scale back on Computerisation (1, Interesting)

ObsessiveMathsFreak (773371) | about 4 months ago | (#45633385)

This wouldn't -- no counldn't have happenned in the days before computers.

Eventually, I think centralised computer control is going to go the way of semaphore. It's too easy for a centralised computer system to glitch, break, be shutdown, and then screw up the lives and functions of millions.

What we should see is decentralised systems run using independent computer systems.

Re:Time to Scale back on Computerisation (1)

Anonymous Coward | about 4 months ago | (#45633435)

Yeah no. Probably not.

The efficiency gains from centrally controlled, fully integrated computer systems simply dwarf any benefits you might get from time to time with a distributed system.

A central computer with occasional downtime is acceptable when the alternative is a stupid, slow clerical system every day, all the time. "Clerical" is what disparate, independent systems always break down to because of the amount of human effort required to keep them working together.

Re:Time to Scale back on Computerisation (1)

0123456 (636235) | about 4 months ago | (#45633803)

What efficiency gains? Airlines would be far more efficient if they could fly direct from A to B, rather than being funneled into narrow corridors. Pretty much since the advent of GPS, people have been trying to get rid of 'air traffic control' and replace it by direct communication between aircraft which know where they're going and where they want to go.

Re:Time to Scale back on Computerisation (1)

wonkey_monkey (2592601) | about 4 months ago | (#45633515)

This wouldn't -- no counldn't have happenned in the days before computers.

And we wouldn't have all these Tesla fires holding back the adoption of the electric car if we'd just stuck with horses and carts.

Also, without the internet paedophiles wouldn't have easy access to kiddy porn. Won't someone (else) puhlease think of the children?!

What we should see is decentralised systems run using independent computer systems.

Got much experience of ATC systems?

Re:Time to Scale back on Computerisation (1)

Anne Thwacks (531696) | about 4 months ago | (#45634415)

Also, without the internet paedophiles wouldn't have easy access to kiddy porn.

Without the internet, how the hell would anyone know what they had access to? There was something called privacy before the Internet came along.

Re:Time to Scale back on Computerisation (1)

drinkypoo (153816) | about 4 months ago | (#45636653)

What we should see is decentralised systems run using independent computer systems.

How about some updates? These are old-ass systems developed incrementally. It's time to spend some money modernising and unifying them.

RTFA says it's not telephones (0)

swschrad (312009) | about 4 months ago | (#45633495)

but a day/night switchover.

which means they have back-assward management in the first place, for not operating a life-safety system as a 24/7 operation.

carbon-based computation should not be part of the core logic on which the air control system rests.

Re:RTFA says it's not telephones (5, Informative)

jonbryce (703250) | about 4 months ago | (#45633605)

They do operate it as a 24/7 operation. However, at night time there are less planes in the sky, so each traffic controller is given a bigger area to work on and there are fewer of them on duty. During day time, these areas are subdivided into smaller areas and more controllers are brought on-line to work on the larger number of areas. It was this switch-over that failed.

Internal telephone systems problem? (1)

codeusirae (3036835) | about 4 months ago | (#45633681)

`One of the key changes involves improving the warning messages that flash on the air traffic controllers' screens when an aircraft moves out of their area of control and responsibility. The aim is for a warning to flash on the display to remind the controllers to ensure that they have completed all their co-ordination checks before an aircraft leaves their screen and becomes the responsibility of others.

"There is a quirk over whether it flashes or not," says Chisholm. "We want it to work in 100% of cases".

It is important to fix this problem because the Swanwick system, unlike the current manual process, supports the automated transfer of aircraft from one air space sector to another.

Currently at the London Air Traffic Control Centre, when controllers relinquish responsibility for an aircraft, they confirm this by phoning the appropriate new controller. This will not happen under the new automated procedures at Swanwick
'. link [computerweekly.com]

Re:Internal telephone systems problem? (1)

tibit (1762298) | about 4 months ago | (#45634379)

And this is how government organizations perform in the 21st century. Facepalm...

Re:Internal telephone systems problem? (1)

Capt. Mubbers (206692) | about 4 months ago | (#45635057)

NATS (http://en.wikipedia.org/wiki/National_Air_Traffic_Services) is 51% in private ownership, and 42% is actually owned by large airlines

Re:Internal telephone systems problem? (0)

Anonymous Coward | about 4 months ago | (#45637011)

When a system 100% has to fail safe, often the only way is to fail into the hands of a human doing shit manually.

I caught the beginning of this... (1)

dwater (72834) | about 4 months ago | (#45635165)

I was traveling from Heathrow to Beijing via Helsinki (5.5 hour lay-over) that was supposed to leave LHR at 7:30 but was delayed until 9:00...the estimated departure moved again backwards and forwards once (after we got on the plane), but it seemed to be a minor delay from my point of view.

The most annoying thing was that the online systems weren't showing the disruption. I was looking at the departure board at LHR and it was showing the delay (though it took a while), but the online web page and the 'Heathrow App' for my android phone both showed no delay even though it was ~8am already. I was due to meet someone for lunch (during my 5.5 hour lay-over) and I had smsed them about the delay, but they had called the airport authority and were told there was no delay, and so they experienced some inconvenience while they waited.

The good thing was that the flight from Helsinki to Beijing was very sparse, and I was able to use a whole 4-seat row to sleep on - I guess many flights missed the connection. Sucks to be them, but good for me, I suppose :)

