×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Curiosity Rover On Standby As NASA Addresses Computer Glitch

samzenpus posted about a year ago | from the fixing-the-glitch dept.

Mars 98

alancronin writes "NASA's Mars rover Curiosity has been temporarily put into 'safe mode,' as scientists monitoring from Earth try to fix a computer glitch, the US space agency said. Scientists switched to a backup computer Thursday so that they could troubleshoot the problem, said to be linked to a glitch in the original computer's flash memory. 'We switched computers to get to a standard state from which to begin restoring routine operations,' said Richard Cook of NASA's Jet Propulsion Laboratory, the project manager for the Mars Science Laboratory Project, which built and operates Curiosity."

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

98 comments

Glitch or flash memory failure? (4, Interesting)

AmiMoJo (196126) | about a year ago | (#43062771)

Are we talking a temporary issue that can be resolved by re-flashing the memory in question or is one of the cells damaged in some un-recoverable way? Either way there are solutions but the latter is far more serious.

Re:Glitch or flash memory failure? (3, Funny)

Anonymous Coward | about a year ago | (#43062779)

Would hate to be the field service engineer for this one...

Re:Glitch or flash memory failure? (0)

Anonymous Coward | about a year ago | (#43062835)

At least it hasn't been contracted out to SpaceX, yet. I can just imagine the scathing accusations against cosmic rays by Elon Musk

Re:Glitch or flash memory failure? (-1)

Anonymous Coward | about a year ago | (#43065175)

you are a fucking clueless wanker.

fuck you and all your kind.

Re:Glitch or flash memory failure? (2)

Hesty Heffew (941832) | about a year ago | (#43063123)

Call Howard Wolowitz. "Having worked on the Mars rover project, he invited his date, Dr. Stephanie Barnett, to drive it at a secret facility. Instead, the rover became stuck in a Martian ditch and he spent the rest of the night with Sheldon and Raj trying to undo the damage. When this proved unsuccessful, he decided to erase the hard drives of the facility to cover up his meddling, only to later find out that the rover had discovered the first clear signs of life on Mars. When he made it onto a new team for the Defense Department laser-equipped surveillance satellite, Howard was not granted security clearance after Sheldon revealed the rover incident to an investigating FBI agent." The Apology Insufficiency". The Big Bang Theory. Season 4. Episode 7. November 4, 2010.

Re:Glitch or flash memory failure? (3, Insightful)

icebike (68054) | about a year ago | (#43063511)

Does every incident in the real world need a reference to a TV show?

Are you sure you can't find an XKCD comic [xkcd.com] that would be more appropriate?

Re:Glitch or flash memory failure? (-1)

Anonymous Coward | about a year ago | (#43063735)

I'd rather he reference a sometimes funny tv show than a never funny piss poor "web comic".

Re:Glitch or flash memory failure? (1, Insightful)

Anonymous Coward | about a year ago | (#43063855)

Fuck off. The big bang theory is beyond shite, I've tried watching it a handful of times and always gave up after 3 or 4 minutes. It's absolutely inusfferable tripe.

XKCD, on the other hand, is great.

Re:Glitch or flash memory failure? (1)

Anonymous Coward | about a year ago | (#43064665)

Fuck off. The big bang theory is beyond shite, I've tried watching it a handful of times and always gave up after 3 or 4 minutes. It's absolutely inusfferable tripe.

XKCD, on the other hand, is great.

I derogatorily dismiss your subjective point of view and substitute my own subjective opinion! Now that is how you win an argument!

Re:Glitch or flash memory failure? (0)

Anonymous Coward | about a year ago | (#43065181)

especially when its true!

BBt is rubbish aimed at idiots who think they know what 'geeks' or 'educated people' are like and it really is almost exclusively insufferable trash, aimed at the 'unwashed masses'.

XKCD is for geeks by a geek...what could be better?

Re:Glitch or flash memory failure? (1)

icebike (68054) | about a year ago | (#43074463)

aimed at idiots who think they know what 'geeks' or 'educated people' are like

I knew people just about exactly like every character on that show while going through College.
Yes, babbling in klingon to each other, and the whole 9 yards. The exaggeration in the show isn't all
that great from what I remember.

Of course I knew regular PhDs as well. This show isn't about them.

Re:Glitch or flash memory failure? (1)

Celeritas 5k (1587217) | about a year ago | (#43064077)

If you don't like XKCD, you're everything that's wrong with the world. Besides, anything with a laugh track is awful.

Re:Glitch or flash memory failure? (0)

Anonymous Coward | about a year ago | (#43064159)

If you don't drink my brand of whiskey or smoke my brand of cigarettes or drive my brand of car, then you are somewhat less of a man. You are probably a homosexual and/or a communist. If you don't read the same comics as I do, you are an occupy-this-n-that dirty hippy at the very least. Is that what you are trying to say?

Sign me up! (0)

Anonymous Coward | about a year ago | (#43067225)

Would hate to be the field service engineer for this one...

Are you kidding? Trip to Mars and back, on the company dime?

The SkyMiles alone would be worth it!

Re:Glitch or flash memory failure? (1)

TJNoffy (1255090) | about a year ago | (#43071123)

Really? I would LOVE it if the technology existed to get me there and back safely! Send me!

Re:Glitch or flash memory failure? (1)

RockDoctor (15477) | about a year ago | (#43095331)

Really? I would LOVE it if the technology existed to get me there and back safely! Send me!

To misquote ... whoever actually wrote Star Wars ... "You are not the field service engineer we're looking for!"

Real Mars rover field service engineers go out, fix the falt, and then terraform the planet. With nowt but a roll of duct tape and a can of WD40.

Re:Glitch or flash memory failure? (5, Informative)

icebike (68054) | about a year ago | (#43062837)

TFA pretty much covers this, saying they believe it is a problem in the flash memory.

The computer problem is related to a glitch in flash memory on the A-side computer caused by corrupted memory files, Cook said. Scientists are still looking into the root cause the corrupted memory, but it's possible the memory files were damaged by high-energy space particles called cosmic rays, which are always a danger beyond the protective atmosphere of Earth.

They also say

"We also want to look to see if we can make changes to software to immunize against this kind of problem in the future," Cook said.

It seems that, since the same thing happened on one of the earlier rovers, this is something they would have done some time ago.

They are now updating the B side computer so it can manage the mission while they work on the primary. I wonder why this is not something that is kept up to date anyway. I can see keeping B an update or two behind A to prevent a single programming error taking both of them down. But after you are satisfied with A's software load, why keep B so far back-level that transition takes so much time. And since the computers are said to be identical, why the desire to move back to A?

Re:Glitch or flash memory failure? (2, Funny)

Anonymous Coward | about a year ago | (#43062975)

Because apt-get update installed takes a million years with that kind of latency?

Re:Glitch or flash memory failure? (2)

icebike (68054) | about a year ago | (#43063103)

They sent the update once, didn't they?
Wait till you are satisfied it worked, and shunt it over to computer B.

Re:Glitch or flash memory failure? (5, Interesting)

BradleyUffner (103496) | about a year ago | (#43064475)

They sent the update once, didn't they?
Wait till you are satisfied it worked, and shunt it over to computer B.

I'm fairly sure that they purposely keep the computers out of sync to avoid a single bug taking out both systems. If I recall, it actually has 3 computers, 2 of them have identical hardware that run different versions of the same software, and a 3rd computer based on completely different hardware running yet another software package. Each system is able to assume command of the mission and issue updates to the other systems.

Re:Glitch or flash memory failure? (5, Informative)

Kjella (173770) | about a year ago | (#43063023)

They are now updating the B side computer so it can manage the mission while they work on the primary. I wonder why this is not something that is kept up to date anyway. I can see keeping B an update or two behind A to prevent a single programming error taking both of them down. But after you are satisfied with A's software load, why keep B so far back-level that transition takes so much time. And since the computers are said to be identical, why the desire to move back to A?

When are you "satisfied" with software like this? Imagine something comes as slow corruption or only occurs when a certain counter overflows or whatever, you don't want to be caught in a race against time to save the system before the B computer dies too. Which is probably the reason why they want to move it back to the A computer, if it can't run there then they don't have a backup anymore. It's better for them with a slightly reduced system with backup than a B computer running with no backup. It does them no good to sit on the ground and say "well, we've figured out how what happened and how we could have fixed it" after you've lost contact, then it's game over. You don't run it in "if it breaks, we're done here" mode unless you really, absolutely must.

Re:Glitch or flash memory failure? (2)

icebike (68054) | about a year ago | (#43063107)

If your scenario were real, then why are the updating B at this point?
Clearly they are satisfied with that A was running before the memory fritz.

Re:Glitch or flash memory failure? (3, Insightful)

instagib (879544) | about a year ago | (#43063441)

One can only hope that they have a C computer which will never be updated, and which can reset the rover to the initial state. Even if updates on A run fine for some time, experience in computing of the last decades shows that Murphy's Law is always lurking.

Re:Glitch or flash memory failure? (1)

Impy the Impiuos Imp (442658) | about a year ago | (#43064359)

As they cannot have someone pushza physical hardware reset button, they should have a simple watchdog circuit that looks for a particular radio signal (it would listen in on it) and force a reset that way. The flash routine would similarly be brutally powerful yet simple.

And why they don't have 200 flash banks I don't know. Best 9 of 17 wins on a bit level, that kind of thing.

Yes, there's watchdogs on watchdogs (0)

Anonymous Coward | about a year ago | (#43067445)

There's a whole sequence of watchdog like things. There's a watchdog in the flight computer, and, there are things like hardware command decoders (see if a particular sequence of bits comes over the radio, and push the reset button). Also most spacecraft (MSL included) have hardware that deals with command uplink loss timeout (We haven't heard a command in N days, time to go into safe mode).

bear in mind that "safe mode" reverts to VERY slow data rates over the radio link (10 bps kind of range) on the assumption that no matter what, you can get some bits through.

Yes, there is error detection and correction (EDAC) logic on flash and RAM, but that doesn't mean a)you didn't have a bad day and get a multiple bit error or b) that some other software bug is making it seem there's bad memory or c) that you've found some sort of obscure hardware bug.

Back in the 80s, I worked on a system that would apparently throw double bit errors on DRAM cards on a every couple days basis. MUCH too frequently given the observed single bit error rate. Turns out it was the bus interface logic and some timing problems. Stuff like this happens.

Re:Glitch or flash memory failure? (1)

perryizgr8 (1370173) | about a year ago | (#43065367)

i'm pretty sure the guys at nasa writing code for the mars rover are doing it in a lot more careful way than those impulsive hacks at facebook or the iTunes team. murphy's law is supposed to be obliterated by engineering.

Re:Glitch or flash memory failure? (0)

Anonymous Coward | about a year ago | (#43063075)

It seems that, since the same thing happened on one of the earlier rovers, this is something they would have done some time ago.

They are now updating the B side computer so it can manage the mission while they work on the primary. I wonder why this is not something that is kept up to date anyway. I can see keeping B an update or two behind A to prevent a single programming error taking both of them down. But after you are satisfied with A's software load, why keep B so far back-level that transition takes so much time. And since the computers are said to be identical, why the desire to move back to A?

They have Verizon for an ISP and were afraid of exceeding their cap.

Re:Glitch or flash memory failure? (5, Interesting)

Brett Buck (811747) | about a year ago | (#43063109)

I wonder why this is not something that is kept up to date anyway. I can see keeping B an update or two behind A to prevent a single programming error taking both of them down. But after you are satisfied with A's software load, why keep B so far back-level that transition takes so much time. And since the computers are said to be identical, why the desire to move back to A?

I can easily imagine this happening, I work on a very similar, perhaps nearly identical spacecraft (that's just a tad mode critical AND expensive than this thing...) and we haven't necessarily maintained this. You underestimate the overhead associated with generating the necessary uploads.

        The reason they probably want to go back to the Prime is that their failure isolation system database is keyed to using the prime units only, and to alter it to start on the "B" side and have it switch back to "A" is prohibitive, or at least easier to get around by switching back to A. This last is also something we do in the rare case of a temporary failure. There's less good justification to doing it than leaving the backup program image alone but having to completely retest the entire redundancy management system for a new configuration is generally avoided. If it fails hard, it doesn't really matter, since there's no Prime to switch back to.

      Brett

Re:Glitch or flash memory failure? (1)

sjames (1099) | about a year ago | (#43063183)

Several reasons come to mind. First, it's not necessary. If they ever need to switch computers they can do it then (as they are doing). It takes time and energy (from the limited supply) to do the updates.

Next, up until they have to do that, they stay with the software that has already proven itself to be at least good enough to run the mission. That is, if something really screws up and they realize it was a change 3 revisions back, they still have a computer that was never affected by that change.

As for moving back, they want to get A working again so they keep redundancy. Once they do get it working, the best way to test it is to use it.

Re:Glitch or flash memory failure? (2)

ScottMaxwell (108831) | about a year ago | (#43065273)

They are now updating the B side computer so it can manage the mission while they work on the primary. I wonder why this is not something that is kept up to date anyway. I can see keeping B an update or two behind A to prevent a single programming error taking both of them down. But after you are satisfied with A's software load, why keep B so far back-level that transition takes so much time. And since the computers are said to be identical, why the desire to move back to A?

They're running the same flight software, but the parameters are different. (A parameter might say, for example, how far to drive between autonomous visual odometry updates, or how big the bounding boxes around the arm should be when computing ChemCam laser safety.) There are thousands of these parameters, and they're not routinely kept up to date on the non-prime side (which has historically been the B side).

And while the computers are identical, the non-cross-strapped equipment isn't. For example, the B-side rear HAZCAMs are exposed to more radiation, because of the DAN instrument, than the A-side rear HAZCAMs, and are therefore expected to degrade faster. Switching back to the A side is, generally, switching back to slightly better equipment.

Re:Glitch or flash memory failure? (0)

Anonymous Coward | about a year ago | (#43065523)

Hogwash - its bugs bugs bugs

Re:Glitch or flash memory failure? (0)

Anonymous Coward | about a year ago | (#43066489)

Noob much?

transmissions take time and power. that time could be spent sending data back to earth instead. Priorities.
Identical computers does not mean identical bus access or identical sensor access. Perhaps the backup bus access is half speed?

Of course, I don't know all the details, but I have worked on spacecraft software systems. There are always details and slight differences between every "identical" box due to limitations at the time of design.

Re:Glitch or flash memory failure? (1)

RockDoctor (15477) | about a year ago | (#43095527)

And since the computers are said to be identical, why the desire to move back to A?

The desire, as I read it, is to have the "A" side back available as a backup. So, in the event of another fault (hypothesis : cosmic ray hit) on the "B" side, then they'll be able to recover the rover's operational capability using the "A" side.

Re:Glitch or flash memory failure? (1)

tipo159 (1151047) | about a year ago | (#43062851)

Are we talking a temporary issue that can be resolved by re-flashing the memory in question or is one of the cells damaged in some un-recoverable way? Either way there are solutions but the latter is far more serious.

Did you read the article? Most of these questions are answered there.

Re:Glitch or flash memory failure? (2)

Solandri (704621) | about a year ago | (#43063557)

Glitched memory usually isn't a problem. Other spacecraft [arstechnica.com] have had similar memory problems [nasa.gov]. Usually it's temporary. If it's permanent, the computers are programmed to map around the glitched memory or (back in the tape drive days) not use that segment of tape..

The real danger is that such a glitch will first manifest itself by altering control or orientation instructions, breaking the spacecraft's contact with Earth. Most spacecraft are designed with a "safe mode" when this happens. If there's been no communication with Earth for x days, the main computer switches to a rudimentary instruction set or a second computer takes over, and tries to re-establish communications.

Re:Glitch or flash memory failure? (1)

tlhIngan (30335) | about a year ago | (#43065195)

Are we talking a temporary issue that can be resolved by re-flashing the memory in question or is one of the cells damaged in some un-recoverable way? Either way there are solutions but the latter is far more serious.

It's most likely NAND flash, in which case damaged cells are a natural occurance anyways (most have up to 2% bad to begin with, even when "new"). Even if it's damaged, it's where the flash translation layer goes and marks it as bad and avoids using it.

The only question is - is the rest of the array damaged and the entire flash chip unusable...?

Robust hardware (0)

DigiShaman (671371) | about a year ago | (#43062795)

Who actually fabs the chips and circut boards used by NASA? I'm guessing these flash modules are not your typical Sandisk variety. Also, do they plan on wiping out the flash memory from a secondary computer, mapping bad cells, and reloading software from scratch? Hopefully these flash modules don't suffer any systemic issues.

Re:Robust hardware (0)

Gareth Iwan Fairclough (2831535) | about a year ago | (#43062849)

Who else has a feeling that someone fitted in a module backwards? Either that, or a dead cell or two.

Re:Robust hardware (5, Informative)

icebike (68054) | about a year ago | (#43062871)

Who else has a feeling that someone fitted in a module backwards?
Either that, or a dead cell or two.

Nobody who has read TFA has that feeling. Curiosity has been running since Aug. 6, 2012 on your putative "backwards module".

Re:Robust hardware (4, Funny)

93 Escort Wagon (326346) | about a year ago | (#43063071)

Nobody who has read TFA has that feeling.

You could actually explain it to him rather than choosing to go all holier than thou. Here, I'll do it for you.

Who else has a feeling that someone fitted in a module backwards? Either that, or a dead cell or two.

The A-side flux capacitor was somehow depolarized, perhaps by a cosmic ray impact event. They're hoping to fix it by reinitializing the quantum warp matrix.

Re:Robust hardware (1)

Tablizer (95088) | about a year ago | (#43063279)

[fitted in a module backwards?]
Nobody who has read TFA has that feeling. Curiosity has been running since Aug. 6, 2012 on your putative "backwards module".

But the rover's actually on Venus because of it.

Re:Robust hardware (1)

Gareth Iwan Fairclough (2831535) | about a year ago | (#43064067)

I suppose I deserved such snarky responses. Seeing karma go "bad" after my seconds post is .. Well, I don't know. I'd hazard a guess at "bad". It was a dumb post though. Btw, TFA? Whatever happens, it would be a hell of a blow for those guys to lose the rover after only a few months. Not only to the engineers and the scientists, but to exploration in general.

Re:Robust hardware (1)

Anonymous Coward | about a year ago | (#43065689)

Btw, TFA?

The Fucking Article

Re:Robust hardware (0)

Anonymous Coward | about a year ago | (#43063079)

"Who actually fabs the chips and circut boards used by NASA?"

I heard they have a 3D printer.

Re:Robust hardware (1)

c0lo (1497653) | about a year ago | (#43063791)

Who actually fabs the chips and circut boards used by NASA?

"American Components, Russian Components, all the same, all made in Taiwan!"

Lev Andropov - on board of MIR

lol (-1)

Anonymous Coward | about a year ago | (#43062883)

This is why you shouldn't use node.js on anything that matters.

Could it be Java? (-1)

Anonymous Coward | about a year ago | (#43063033)

Do they need to update the Java Runtime?

Gotta love Armchair Quarterbacks and their simple (3)

shoottothrill (1806688) | about a year ago | (#43063051)

solutions: "someone fitted a module in backwards" "Why didnt they simply do X instead of Y" Do you seriously think the people who work for NASA, the same ones who sent a group of men to the moon and back, rovers to mars, Voyager to Deep Space, shuttles, space stations, et all can be out thought by YOU? And from the comfort of your Lazy Boy Chair, no less?! Wow. Surely you can come to respect these these men and women have given countless hours of thought, simulation, and planning to these missions. In fact, many have dedicated their lives to this engineering. Surely you can respect that those involved in this mission did not simply put the red wire where the blue one should have gone. Thats sorta insulting. Peace out

Re:Gotta love Armchair Quarterbacks and their simp (3, Informative)

MLCT (1148749) | about a year ago | (#43063171)

On the whole I am sure everyone does respect NASA, but they do have "previous" on things far simpler than the random slashdotter obtuse suggestions:

Newtons or pound-force? [wikipedia.org]

I don't think anyone is suggesting that simple mistakes were the cause in this case - but the above link may help explain why a little leg-pulling by slashdotters is not crossing any lines.

"Peace out"

Re:Gotta love Armchair Quarterbacks and their simp (4, Interesting)

Tablizer (95088) | about a year ago | (#43063329)

The Galileo Jupiter atmosphere probe actually had a parachute-related part put on backward. It almost ruined the mission. They got lucky and the shaking from atmospheric drag eventually shook the high-altitude parachute off the bad lock barely in time before it could have damaged the probe.

Doesn't hurt to ask, although knowing more about the hardware may allow you to give more specific advice, such as "part X could be put in backward and still mostly work without early detection according to simulation Y."

Re:Gotta love Armchair Quarterbacks and their simp (0)

Anonymous Coward | about a year ago | (#43063399)

Do you seriously think the people who work for NASA, the same ones who sent a group of men to the moon and back

Not the same people.

Re:Gotta love Armchair Quarterbacks and their simp (0)

Anonymous Coward | about a year ago | (#43063697)

The current crop of "scientists" and "engineers" (terms used very loosely here) at NASA is nothing like the group of scientists and engineers that put men on the moon 50 years ago.

Re:Gotta love Armchair Quarterbacks and their simp (0)

Anonymous Coward | about a year ago | (#43065295)

except it *isn't* the same team that sent people to the moon and back...not even close. 1/3 of the apollo astronauts are dead, and probably roughly the same proportion of the team which worked on the ground as well.

The space shuttle was a monumental failure compared to its stated objectives, unlike apollo, so now you have a culture of abject failure predominating.

Don't believe me? Ask yourself, how many people actually died going to the moon vs the space shuttle fiasco?

Re:Gotta love Armchair Quarterbacks and their simp (0)

Anonymous Coward | about a year ago | (#43068605)

Ask yourself, how many people actually died going to the moon vs the space shuttle fiasco?

3 vs. 12? Is that your point? Well, you might have a point. 3 during a simulation and 12 due to PR and marketing bullshit.

Capcha: Counters

Remote fixes always a hair raiser (5, Interesting)

Celarent Darii (1561999) | about a year ago | (#43063153)

I once had to fix a server some 6000 km away due to a corrupted disk. Doing pdisk and modifying fstab over ssh and then a reboot. You just check and recheck to make sure you did it right and just hope you get a ping a few minutes later.

Can't imagine how these guys feel. 45 min ping and it isn't like they could ask someone to go turn it off and on again.

Good luck to the guys working on this.

Re:Remote fixes always a hair raiser (0)

Anonymous Coward | about a year ago | (#43063499)

Can't imagine how these guys feel. 45 min ping and it isn't like they could ask someone to go turn it off and on again.

Thank god it's only a disk and they do not have to worry about security.
I can't even count the number of times I've told myself "don't work on firewalls and VPN's remotely" yet assure myself "it's simple" only to lock myself out.

Good luck guys!

Re:Remote fixes always a hair raiser (1)

Electricity Likes Me (1098643) | about a year ago | (#43065533)

I've always been curious as to how secure the rover communications are. It does have a direct transmitter from the surface back to Earth - I wonder what you get if you could receive/transmit to that.

Re:Remote fixes always a hair raiser (3, Interesting)

evilviper (135110) | about a year ago | (#43064059)

I once had to fix a server some 6000 km away due to a corrupted disk. Doing pdisk and modifying fstab over ssh and then a reboot. You just check and recheck to make sure you did it right and just hope you get a ping a few minutes later.

It's called out-of-band management. You can bring up a server from bare metal with no working OS installed. Damn near every server out there comes with at least ipmi, and often DRACs/iLos/RSAs with some additional features. All you need to do is give the OoBM interface an IP address (perhaps a DHCP reservation) and you're good to go.

Even if you're running on desktop-class hardware, you can still fake OoBM pretty well with a serial port. Linux/BSD/etc., will bring-up the serial port as the console as soon as the bootloader starts up, if configured to do so. And if the disk has failed, or otherwise your bootloader doesn't work, hopefully your bios is set to PXE boot, and your pxelinux configuration will give you a serial console as soon as that kicks-in. Throw-in magic sysrq to allow you to reboot a system that's not responding, and you've got something reasonably close to OoBM just about free. You could also supplement this with a watchdog timer and make things even more reliable.

But as cheap as server-class hardware is, and the ubiquity of ipmi, it's probably not worthwhile going the cheap route.

Re:Remote fixes always a hair raiser (1)

cthulhu11 (842924) | about a year ago | (#43070577)

All you need to do is give the OoBM interface an IP address (perhaps a DHCP reservation) and you're good to go.

"You too can be a MILLIONAIRE, and never pay taxes! First step: get a million dollars" This is one place where server manufacturers all too often mistake rackmount servers for desktops. I demo'd an IBM x-something system last year and the thing was a non-starter. Serial console didn't work out of the box, and even the elaborate legacy-KVM steps they came up with couldn't make the thing work. Talked to Cisco about their UCS systems, and they thoroughly were incapable of understanding. "Just hook up a laptop". Kinda hard to do from the other side of the planet. Last I looked at Dell theirs didn't work out of the box either. HP started doing that a couple of generations ago, and with very recent firmware revisions it's reached almost the point where Sun's ILOM was 10 years ago.

Even if you're running on desktop-class hardware, you can still fake OoBM pretty well with a serial port. Linux/BSD/etc., will bring-up the serial port as the console as soon as the bootloader starts up, if configured to do so

How would it be configured? Jedi mind tricks?

And if the disk has failed, or otherwise your bootloader doesn't work, hopefully your bios is set to PXE boot, and your pxelinux configuration will give you a serial console as soon as that kicks-in.

All you need is for a DHCP server to magically be available. Oh and for a non-brain-dead PXE implemention, ie, something different from the HP DL580 G7, where the goofy-ass thing forces the console to 115200 bps for no apparent reason.

Re:Remote fixes always a hair raiser (1)

evilviper (135110) | about a year ago | (#43076763)

I don't know why you're having these millions of insurmountable problems with OoBM, but I can assure you, it's only you... everyone else in the world has things working just fine.

Rack a server, plug in the power and ethernet cables, then use the front panel LCD menu to assign the static IP address for each one if you don't like DHCP, maybe set the password depending on vendor, and get the hell out of there. Or you can plug in video and keyboard to do it via the BIOS screen prompts. Done it with hundreds and hundreds of servers without issue (dealt with plenty of issues with BMCs/DRACS/iLOS/RSA later on, of course).

"How would it be configured? Jedi mind tricks?"

You start with a pxelinux config with serial console. Then one of your menu options is Linux/BSD with the appropriate configuration already set. You've got to customize your distribution anyhow, so a couple config changes to get a serial console is nothing.

" All you need is for a DHCP server to magically be available."

Nothing magical about it. If you've got more than one server on the network, you just make one (or two) the DHCP/pxe server the rest can bootstrap off of.

You've got to start somewhere. Somebody is out there physically racking the servers and running the wires. Somebody is ordering the stuff, handling delivery and unboxing. Giving one of them a customized DVD or USB flash drive to insert is trivial, and then you've got your infrastructure up and running, ready to support bootstraping the rest of it. There's no catch-22 here, just a bit of planning.

You make it sound as if servers just spring into exisitence in the farthest corners of the world, attached to some random internet link. And if you are actually supporting lone systems stashed on dstant networks, I recomend supplamenting it with a $40 DD-WRT box with a large USB drive plugged-in, acting as a DHCP/pxe server, including system images on flash, and governed by a watchdog timer, and perhaps a network connectivity cron job. It's dirt cheap infrastructure.

Re:Remote fixes always a hair raiser (1)

cthulhu11 (842924) | about a year ago | (#43083007)

I don't know why you're having these millions of insurmountable problems with OoBM,

Um, I made no such claim.

but I can assure you, it's only you... everyone else in the world has things working just fine.

Wow, you know EVERYONE ELSE? Your friend count on Facebook must be just sick!

Rack a server, plug in the power and ethernet cables, then use the front panel LCD menu to assign the static IP address for each one if you don't like DHCP, maybe set the password depending on vendor, and get the hell out of there.

Maybe set the password? Why on earth would I want to leave it set to admin/changeme? Only front-panel IP address setting I've seen has been on old laser printers and crappy Infortrend RAID arrays. Liking or not liking DHCP is moot. Servers are not desktops, there rarely is a local server available. Recently "ip helper" router config foo can sometimes relay DHCP broadcasts to a remote server, but this is far from reliable and the config needed to work on a given router/interface is far from predictable.

Or you can plug in video and keyboard to do it via the BIOS screen prompts.

5790 miles between here and Bucharest. That's an awfully long USB cable.

Done it with hundreds and hundreds of servers without issue

Because you were in the same building/state/country/continent as them, right?

You start with a pxelinux config with serial console.

Um, no. You start with a serial console, then if firmware bugs don't prevent it, you configure static addressing, IP address/netmask/gateway, user/pass, and hope that the network cables and connections are right and working.

Then one of your menu options is Linux/BSD with the appropriate configuration already set.

If one's forced to run a Linux or BSD ... and if some unicorn magically makes that configuration. You're skipping a bunch of bootstrapping.

You've got to customize your distribution anyhow, so a couple config changes to get a serial console is nothing.

Again, how? Jedi Mind Tricks are unreliable.

" All you need is for a DHCP server to magically be available."

Nothing magical about it. If you've got more than one server on the network

Usually, no. /30 directly connected to a router port. Again, not desktops in an office, not some lame mom's-basement gamer party. And even if there were a DHCP server, that just shifts the issue because it would need to be configured itself.

Somebody is out there physically racking the servers and running the wires.

Yep, often in the middle of the night, my local time, often I don't have direct contact with them, and if I did, my technical Malay and Hungarian is realllly rusty. Even in the US (eg. Houston) I can't count on English fluency. Or on them having a laptop or legacy keyboard/mouse/VGA monitor in their pocket.

Somebody is ordering the stuff, handling delivery and unboxing.

At least three different people there, in at least two different countries.

Giving one of them a customized DVD or USB flash drive to insert is trivial

No, it isn't. Again, the world is a bigger place than a one-building office. DVD's also require a drive, which rarely exists any more. But what good would a USB flash drive do anyway?

You make it sound as if servers just spring into exisitence in the farthest corners of the world, attached to some random internet link.

Pretty much, yeah.

And if you are actually supporting lone systems stashed on dstant [sic] networks, I recomend supplamenting [sic] it with a $40 DD-WRT box with a large USB drive plugged-in, acting as a DHCP/pxe server, including system images on flash, and governed by a watchdog timer, and perhaps a network connectivity cron job. It's dirt cheap infrastructure.

You're kidding, right? Where exactly would I expect unskilled hands in Sofia, Bulgaria to source such a box, and which unicorn would configure it? This would also consume an extra switch port and rack unit and require a VLAN to be configured on the switch. Lots of complexity for no benefit, and yet another OS to manage.

Re:Remote fixes always a hair raiser (1)

evilviper (135110) | about a year ago | (#43088651)

*Sigh*... I've worked with plenty of people like you before. Acting like an asshole to try and mask your incompetence doesn't ever actually work.

Why on earth would I want to leave it set to admin/changeme?

I was discussing what config NEEDS to be done locally to get OoBM working. Changing the password is something you can do from the other side of the planet, once it's pingable.

Liking or not liking DHCP is moot. Servers are not desktops, there rarely is a local server available. Recently "ip helper" router config foo can sometimes relay DHCP broadcasts to a remote server, but this is far from reliable and the config needed to work on a given router/interface is far from predictable.

This is so UTTERLY MORONIC. That shiny new "ip helper" stuff is something I've been using for the past 15 years, and I don't recall it being new at the time. It sure as hell is 100% "reliable" and "predictable" in every possible way. I've got hundreds and hundreds of systems depending on high-availability DHCP servers every single day, running that way for years, without any hiccups.

Only front-panel IP address setting I've seen has been on old laser printers and crappy Infortrend RAID arrays

EVERY DELL SERVER, produced in the past decade or so, has a front-panel LCD allowing IP configuration of the BMC/DRAC. What you've "seen" is a lousy measure of anything, since you're spouting nothing but ignorance and nonsense left and right.

You start with a serial console, then if firmware bugs don't prevent it, you configure static addressing, IP address/netmask/gateway, user/pass, and hope that the network cables and connections are right and working.

That's idiotic. Get your PXE environment working right, and you don't even need to look at the thing until the OS install has finished and restarted the box.

If one's forced to run a Linux or BSD ... and if some unicorn magically makes that configuration. You're skipping a bunch of bootstrapping.

Why do you think you need Unicorns to configure a PXE boot server and OS kickstart deployment? That a big part of the SysAdmin's job.

It's damn clear you've never done any of this, most basic best-practices in the enterprise world. Sounds like your half-assed company needs to find a halfway decent admin.

DVD's also require a drive, which rarely exists any more. But what good would a USB flash drive do anyway
Playing dumb again? (Or are you playing?)

Customize the media any way you want it... Have it install an OS, enable a console on the serial ports, configure a working dhcp/pxe server, etc. Have it install the manufacturers binaries for configuring the server, and change things however the hell you want. You don't need to be dependent on what the manufacturer does by default.

Where exactly would I expect unskilled hands in Sofia, Bulgaria to source such a box, and which unicorn would configure it?

Look in the mirror, Unicorn. Are you a sys admin or not? Some reason you refuse to do the job?

This would also consume an extra switch port and rack unit and require a VLAN to be configured on the switch. Lots of complexity for no benefit, and yet another OS to manage.

No extra VLANs needed, the benefits are massive, and I've listed exactly what they are, repeatedly.

You want to talk about added complexity for no benefit, let's talk about these terminal servers you insist on using, a decade after everyone else in the world replaced serial port OoBM with IPMI and SOL.

Re:Remote fixes always a hair raiser (1)

cthulhu11 (842924) | about a year ago | (#43113687)

*Sigh*... I've worked with plenty of people like you before. Acting like an asshole

Who's the one namecalling here?

This is so UTTERLY MORONIC. That shiny new "ip helper" stuff is something I've been using for the past 15 years, and I don't recall it being new at the time. It sure as hell is 100% "reliable" and "predictable" in every possible way

Maybe for you, with whatever homogenous environment you have. IOS-XR bugs happen.

I've got hundreds and hundreds of systems depending on high-availability DHCP servers every single day

And you're telling *me* that I'm not a sysadmin? What bizarre fear do you have of static configuration?

EVERY DELL SERVER[...]

Dell? Seriously? Do you work in Hollywood setting up rows of servers for spy show shoots? I'll bet they're all in one room.

What you've "seen" is a lousy measure of anything

Because you're more important than me?

since you're spouting nothing but ignorance and nonsense left and right.

Oh right, I clearly am imagining the bugs in HP's iLO. And the PXE code on the shitty onboard NICs that shipped with the DL580G7. What happens in your DHCP wonderland when new systems are installed, you telepathically discover the MAC addresses, configure everything, and the box doesn't magically start taking SSH sessions? In your world remote hands never silently connect the wrong switch ports, connect the wrong server NIC, use a crossover cable by mistake, use the wrong jack on a patch panel? You never have bad cables, bad patch panel ports, bad switch ports, DOA server NICs?

That's idiotic. Get your PXE environment working right

My PXE environment works fine. It however can't work magic.

Why do you think you need Unicorns to configure a PXE boot server

Not what I said.

and OS kickstart deployment? That a big part of the SysAdmin's job.

Management would laugh me out of my job if I told them that I needed to fly around the world to do every system install personally.

It's damn clear you've never done any of this

Really? I must then have really vivid synthetic memories of my career, one which has seen attrition of countless erstwhile peers who didn't get it and couldn't work with others. Heck, you're probably one of them.

most basic best-practices in the enterprise world.

Setting phasers to "stun" and boinking green alien ladies is beyond the scope of this hissy fit.

Sounds like your half-assed company needs to find a halfway decent admin.

We have a number of them, and we're assed-enough to not buy Dell.

Playing dumb again? (Or are you playing?)

Customize the media any way you want it... Have it install an OS

And this media gets from me to the other side of the world how exactly?

enable a console on the serial ports

Any half-decent server has a working serial console out of the box.

Have it install the manufacturers [sic] binaries

Where exactly would I expect unskilled hands in Sofia, Bulgaria to source such a box, and which unicorn would configure it? Look in the mirror, Unicorn. Are you a sys admin or not? Some reason you refuse to do the job?

You're ragging on me, yet you think it's reasonable to fly between continents and try to hunt down a specific hackable box in Bulgaria?Open a window, you need more oxygen.

No extra VLANs needed

Yes, extra VLAN's needed, to have the extraneous little DHCP thing on the same network as each server. They both need to plug into the switch to talk to each other. Or does your network work over actual aether?

a decade after everyone else in the world replaced serial port OoBM with IPMI and SOL.

You really don't get the idea of worldwide unstaffed sites, do you?

Re:Remote fixes always a hair raiser (1)

evilviper (135110) | about a year ago | (#43133273)

You're the worst combination of ignorant, tiresome, and obtuse when it suits you... So I'll just say goodbye, after I quickly address just a few of your points...

use a crossover cable by mistake

I have to point out, this statement shows massive ignorance. Every NIC and switch that can do gigabit does auto MDIX, and will just work, whether you have a straight or xover cable.

Management would laugh me out of my job if I told them that I needed to fly around the world to do every system install personally.

It's not my fault you don't know what a kickstart deployment is, and couldn't be bothered to look it up...

Really? I must then have really vivid synthetic memories of my career

I have no doubt you have memories of being a sysadmin... But from multiple statements you've made, I can only assume those memories date from over 20 years ago, when the technology was vastly different than today.

Flash? (1)

angularbanjo (1521611) | about a year ago | (#43063241)

Didn't Apple just disable Flash again? Coincidence? Knew they should have turned off remote updating on the Rover's Mac Mini.

The design is very robust (5, Interesting)

chalker (718945) | about a year ago | (#43063353)

Check out the official rover press kit for a summary of the computer design (http://mars.jpl.nasa.gov/msl/news/pdfs/MSLLanding.pdf) Page 42 in particular:

"Curiosity has redundant main computers, or rover compute elements. Of this “A” and “B” pair, it uses one at a time, with the spare held in cold backup. Thus, at a
given time, the rover is operating from either its “A” side or its “B” side. Most rover devices can be controlled by either side; a few components, such as the navigation camera, have side-specific redundancy themselves. The computer inside the rover — whichever side is active — also serves as the main computer for the rest of the Mars Science Laboratory spacecraft during the flight from Earth and arrival at Mars. In case the active computer resets for any reason during the critical minutes of entry, descent and landing, a software feature called “second chance” has been designed to enable the other side to promptly take control, and in most cases, finish the landing with a bare-bones version of entry, descent and landing instructions.

Each rover compute element contains a radiation-hardened central processor with PowerPC 750 architecture: a BAE RAD 750. This processor operates at up to 200 megahertz speed, compared with 20 megahertz speed of the single RAD6000 central processor in each of the Mars rovers Spirit and Opportunity. Each of Curiosity’s redundant computers has 2 gigabytes of flash memory (about eight times as much as Spirit or Opportunity), 256 megabytes of dynamic random access memory and 256 kilobytes of electrically erasable programmable read-only memory.

The Mars Science Laboratory flight software monitors the status and health of the spacecraft during all phases of the mission, checks for the presence of commands to execute, performs communication functions and controls spacecraft activities. The spacecraft was launched with software adequate to serve for the landing and for operations on the surface of Mars, as well as during the flight from Earth to Mars. The months after launch were used, as planned, to develop and test improved flight software versions. One upgraded version was sent to the spacecraft in May 2012 and installed onto its computers in May and June. This version includes improvements for entry, descent and landing. Another was sent to the spacecraft in June and will be installed on the rover’s computers a few days after landing, with improvements for driving the rover and using its robotic arm."

And according to a release they issued after landing, both computers receive the same updates and are running the same software (not a version or 2 behind like others have suggested): http://mars.jpl.nasa.gov/news/whatsnew/index.cfm?FuseAction=ShowNews&NewsID=1305 [nasa.gov]

In space cosmic ray excuse never gets old (1)

WaffleMonster (969671) | about a year ago | (#43064171)

Ok lets assume a cosmic ray corrupted some random block of flash memory...so what? Why should that lead to failure to upload anything or enter sleep mode?

I can only assume there is integrity check for block level I/O from flash and it just did not try to load garbage without knowing it. If it were any old PC app this would be perfectly acceptable behavior.

However for ultra expensive spacefaring things I would expect it to be designed to still try and be useful even if the southbridge cought fire.

Re:In space cosmic ray excuse never gets old (3, Informative)

Binestar (28861) | about a year ago | (#43064437)

Yeah... did you miss the part where it went to the redundant unit and sent an error to mission control? Sheesh.

Re:In space cosmic ray excuse never gets old (1)

WaffleMonster (969671) | about a year ago | (#43065261)

Yeah... did you miss the part where it went to the redundant unit and sent an error to mission control? Sheesh.

No I missed it. I read both articles and none of them mentioned A. the rover went to a redundant anything by *itself* or B that it sent an error.

It says they "NASA" switched it and that they noticed the problem when the rover did not uplink or enter sleep mode when it was supposed to... what error are you talking about?

And I think my point remains. Just because you can't read or write to an area of persistant storage what prevents you from entering sleep mode or uplinking data? There was also no information anything but the flash memory was broke.

Re:In space cosmic ray excuse never gets old (4, Insightful)

Chris Burke (6130) | about a year ago | (#43065613)

Ok lets assume a cosmic ray corrupted some random block of flash memory...so what? Why should that lead to failure to upload anything or enter sleep mode?

Pretty much any fault, error, or out-of-bounds reading with any part of the rover causes it to stop whatever it is doing and wait for ground control to check it out and decide what to do. If the fault is with the computer itself, it makes sense to gracefully enter safe mode. It probably was a cosmic ray flipping a random bit, but you can't assume that when designing your fault handler.

If it were any old PC app this would be perfectly acceptable behavior. However for ultra expensive spacefaring things I would expect it to be designed to still try and be useful even if the southbridge cought fire.

See, I think you have that backwards. If it were a PC app it would be appropriate to just assume the error was insignificant or more likely not bother checking in the first place. If it's a more serious problem then eventually the app or OS might crash, the user will reboot, and if that doesn't work reinstall, and if not that then they'll just go get some new hardware.

For a multi-billion rover on another planet, you don't want to just wait and see what happens. Any anomaly at all should be cause for cautious, deliberate action. Heck, the whole project is run that way.

The rover was designed with a lot of redundancy and flexibility so that it can be useful even in the face of more serious problems, and if that turns out to be the case they'll find a way to make the rover as useful as possible. Missing a couple night's worth of downloads and delaying some activities in order to take the time to make sure they're maximizing the rover's future potential is an easy tradeoff.

Re:In space cosmic ray excuse never gets old (1)

WaffleMonster (969671) | about a year ago | (#43070175)

Pretty much any fault, error, or out-of-bounds reading with any part of the rover causes it to stop whatever it is doing and wait for ground control to check it out and decide what to do.

Thats a great strategy only problem with it is from TFA the indication they received was noticing it was not behaving the way it was supposed to be behaving. They had to look around to figure out why.

If the fault is with the computer itself, it makes sense to gracefully enter safe mode. It probably was a cosmic ray flipping a random bit, but you can't assume that when designing your fault handler.

You don't have to assume anything. You KNOW the block is invalid. A bad block should not cripple the computer so that it can't do anything else. There is no indication from TFA there were any other faults.

See, I think you have that backwards. If it were a PC app it would be appropriate to just assume the error was insignificant or more likely not bother checking in the first place.

All I do is write software and I refuse to follow this shitty advice. Every error should be checked and handled. Besides the fricking hardware does all the heavy lifting for us all we have to do is check the return codes of read() and write() as they say not rocket science.

serious problem then eventually the app or OS might crash, the user will reboot, and if that doesn't work reinstall, and if not that then they'll just go get some new hardware.

We're talking about I/O failure to flash not crashing an OS or broken hardware.

For a multi-billion rover on another planet, you don't want to just wait and see what happens. Any anomaly at all should be cause for cautious, deliberate action. Heck, the whole project is run that way.

From TFA this is exactly what they did do...they waited to notice the rover not doing what it was supposed to be doing. This is deserving in my view of "should not happen again".

"The issue cropped up Wednesday (Feb. 27), when the spacecraft failed to send its recorded data back to Earth and did not switch into its daily sleep mode as planned. After looking into the issue, engineers decided to switch the Curiosity rover from its primary "A-side" computer to its "B-side" backup on Thursday at 5:30 "

The rover was designed with a lot of redundancy and flexibility so that it can be useful even in the face of more serious problems, and if that turns out to be the case they'll find a way to make the rover as useful as possible. Missing a couple night's worth of downloads and delaying some activities in order to take the time to make sure they're maximizing the rover's future potential is an easy tradeoff.

"We have probably several days, maybe a week of activities to get everything back and reconfigured."

Re:In space cosmic ray excuse never gets old (1)

Chris Burke (6130) | about a year ago | (#43071421)

Thats a great strategy only problem with it is from TFA the indication they received was noticing it was not behaving the way it was supposed to be behaving. They had to look around to figure out why.

Yes, because normal operations were suspended.

You don't have to assume anything. You KNOW the block is invalid. A bad block should not cripple the computer so that it can't do anything else. There is no indication from TFA there were any other faults.

No, you don't. You don't know if the block is bad, if the data bus is suffering an intermittent fault that happened to occur while that block was being read, if it's the BIST or ECC mechanisms that are faulty, or if it's a software error corrupting the data. Going from "we got a fault on reading this block" to "that block and only that block is affected, let's get on with it" with no consideration is a great way to lose a rover.

All I do is write software and I refuse to follow this shitty advice. Every error should be checked and handled. Besides the fricking hardware does all the heavy lifting for us

Ah, so you only allow your software to be run on hardware with ECC corrected RAM and ECC caches and ECC data busses... seems weird to call this a "PC app" when it's excluding most of the PC market. Unless you're doing it yourself then you're only checking for a subset of errors.

Now, assuming it's one that you can see, how do you "handle" that error? Do you just not read from that file again but continue on under the assumption that it was a singular event of no further consequence? Or do you have the software notify you so you can identify what the actual source of the read error was?

The former, I presume, which is fine for the situation of a PC app. Just let the hardware do the heavy lifting, and don't worry about what it can't find, and don't worry about what errors it signals actually mean. If it's a more serious error that ends up causing rampant corruption, it's not your problem! Contact your OEM, your help desk can tell them.

We're talking about I/O failure to flash not crashing an OS or broken hardware.

So, you don't see how an I/O failure could cause an OS to crash, like say if it's reading a code page, and you're still assuming that an ECC error on reading a block of flash can only mean that it's the ram cell itself and only that ram cell that could be affected? You're willing to bet 2.5 billion dollars and the rest of the mission on these assumptions?

Okay.

From TFA this is exactly what they did do...they waited to notice the rover not doing what it was supposed to be doing.

Which is a consequence of the rover doing what it should be doing -- ceasing normal activities on detecting a fault. We're talking about what the rover should be doing -- assume it's a one-off fault and continue normal operation minus that one block as you would have it, or try to prevent anything else bad from happening by going into safe mode (!= sleep mode, btw) and waiting for ground control to figure out the problem.

"We have probably several days, maybe a week of activities to get everything back and reconfigured."

Yes, and? Are you quibbling over "couple" vs "several" -- is this attempted pedantry, or are you actually implying that a couple days is fine, but waiting a week to finish the historic first-time analysis of an interior rock sample on Mars to make sure it has the maximum chance of success instead of bricking the rover crosses the line?

Re:In space cosmic ray excuse never gets old (1)

WaffleMonster (969671) | about a year ago | (#43073895)

Yes, because normal operations were suspended.

Why the mystery? Why couldn't it just say that this failed?

No, you don't. You don't know if the block is bad, if the data bus is suffering an intermittent fault that happened to occur while that block was being read, if it's the BIST or ECC mechanisms that are faulty, or if it's a software error corrupting the data. Going from "we got a fault on reading this block" to "that block and only that block is affected, let's get on with it" with no consideration is a great way to lose a rover.

Why should the rover have to read or write to persistant storage to continue to operate?

Ah, so you only allow your software to be run on hardware with ECC corrected RAM and ECC caches and ECC data busses... seems weird to call this a "PC app" when it's excluding most of the PC market. Unless you're doing it yourself then you're only checking for a subset of errors.

If your going to use this interpretation of "error" ECC is not good enough. It can fail undetected as well, same goes for cryptographic signatures. This is not a grand tour of everything and anything that can go wrong. I never asserted the system should continue if the running image was suspect. That would be madness. This is about I/O to persistant storage specifically.

Now, assuming it's one that you can see, how do you "handle" that error? Do you just not read from that file again but continue on under the assumption that it was a singular event of no further consequence?

Not using anything you have reason to think may be suspect is a fine strategy.

In PC land I trust the I/O subsystem to retry read operations to underlying media and remap failed write operations as appropriate. I trust the storage subsystem to monitor persistant storage and inform of any systematic problems. Trust is more powerful than paranoia. Make each component trustworthy and each exchange between components transactional, instrument each subsystem so you can be alerted to problems and make good decisions. Paranoia does not scale.

Yes, and?

No implication, was providing a data point from TFA.

Re:In space cosmic ray excuse never gets old (1)

AC-x (735297) | about a year ago | (#43068117)

However for ultra expensive spacefaring things I would expect it to be designed to still try and be useful even if the southbridge cought fire.

The computer on Curiosity is completely redundant and has switched over to the secondary computer, even if the primary computer has suffered fatal hardware failure the rover can continue to operate on the secondary. If that's not "being useful" after a failure I don't know what is!

Re:In space cosmic ray excuse never gets old (1)

WaffleMonster (969671) | about a year ago | (#43070535)

The computer on Curiosity is completely redundant and has switched over to the secondary computer, even if the primary computer has suffered fatal hardware failure the rover can continue to operate on the secondary. If that's not "being useful" after a failure I don't know what is!

This is confusing my point. Your drawing a "systems" box around both computers while I have only drawn a box around one computer.

Relying on hardware redundancy to fix a software problem kind of spoils the reason for hardware redundancy doesn't it? What if B-side had been burnt to a crisp and then the same problem occured? I'm sure they have an answer for that.

Re:In space cosmic ray excuse never gets old (1)

AC-x (735297) | about a year ago | (#43070809)

Who said it was a software only problem? The article suggests the flash memory may have been corrupted by cosmic rays, how do you protect against that? Redundancy. How redundant did they make it? Completely redundant (2 separate computer systems).

Plus no-one said that the A-side could never recover on its own (like what happened with Spirit), I'm sure it's just a lot easier to boot the redundant system and diagnose it from there.

Or do you have a better idea for how they could architected it?

Re:In space cosmic ray excuse never gets old (1)

WaffleMonster (969671) | about a year ago | (#43073465)

Who said it was a software only problem?

NASA did. From the space.com article they said some files were corrupted meaning flash hardware could still be accessed.

The article suggests the flash memory may have been corrupted by cosmic rays, how do you protect against that? Redundancy.

There are several ways to do it using error correction techniques at the cost of some capacity. My point is not that flash should not have failed it is the system should be able to continue to function with external I/O failure present as long as the core system processor/northbridge is ok. An I/O error transfering data for one discrete experiment or function should not adversly effect another.

Plus no-one said that the A-side could never recover on its own (like what happened with Spirit), I'm sure it's just a lot easier to boot the redundant system and diagnose it from there.

This again is not my point. The point is it should still be able to function in the face of I/O failure.

Or do you have a better idea for how they could architected it?

I would expect all I/O to be transactional and orthagonal operations to be isolated. From the space.com article I know that 1. data upload failed, 2 sleep failed, 3 they had to take manual action to figure out why.

Re:In space cosmic ray excuse never gets old (1)

AC-x (735297) | about a year ago | (#43077289)

There are several ways to do it using error correction techniques at the cost of some capacity. My point is not that flash should not have failed it is the system should be able to continue to function with external I/O failure present as long as the core system processor/northbridge is ok. An I/O error transfering data for one discrete experiment or function should not adversly effect another.

You're acting as if the rover suddenly failed and is no-longer working, the fact is the rover is completely operational and they're currently fully diagnosing the problem to make sure they fix the problem.

Plus have you considered it's far better to fail safe then fail catastrophically? We're talking about computer systems with zero physical access, it seems like a pretty damn good idea to me to, on detection of any sort of problem, enter a safe mode to allow the problem to be fully diagnosed instead of the rover just assuming it can carry on operating, executing faulty commands and putting itself in an uncontactable state like Viking I did [wikipedia.org].

What if Curiosity found something? (-1)

Anonymous Coward | about a year ago | (#43064637)

Let say Curiosity took a picture of, say, an item or a spaceship or anything that couldn't have been created naturally, how would the NASA / govt react?

Sending the pictures to everyone?

Or going in "stealth" mode and telling everyone: "There are issues with Curiosity, we'll try to switch to computer B. Meanwhile the regular mission is on hold."

Now I'm not saying that this is what happened. All I'm saying is that they'd basically act in precisely this way should some really major picture have been taken by Curiosity.

Damn thing (0)

Anonymous Coward | about a year ago | (#43064817)

probably runs Java. I say 'let it crash'.

Occam's Razor? (1)

Smerta (1855348) | about a year ago | (#43064885)

E.E. / firmware guy here... Corrupted files, and so far not even a mention of a potential software problem?

Even considering radiation out in space, it seems that it's still easier to get the hardware right than the software / firmware. I'm not jumping to any conclusions, but my first guess would be some kind of rare and unanticipated race condition, or some rarely-executed leg of the filesystem / logging software, etc.

I'm probably cynical from having worked on lots of safety-critical systems for a while, but it just seems often convenient to throw alpha particles under the bus and not even question, perhaps gently, if perhaps there is some latent, obscure bug which just crapped all over the flash.

Is that the history-worth imminent announcement ? (0)

Anonymous Coward | about a year ago | (#43065739)

NASA spoke of an historical announcement or something a few months ago.
Is that it ? Do not use TLC flash memory in expensive piece of hardware ?

Well, Duh!

complicating this, Earth-Sun-Mars are in a line (0)

Anonymous Coward | about a year ago | (#43067579)

Yes, we're coming into opposition, and for the month of April, give or take, communicating with anything at Mars is difficult, because you have to point your antenna at the sun, which is noisy, or blocks/distorts the radio signals. Mars's orbit is inclined about 1-2 degrees relative to earth, so at best, it's 2 degrees away from the sun. The Sun's corona extends at least that far, and is a fine radio signal distorter.

Not only that, but we're also about as far apart as you can get (2.5 AU), so that reduces the maximum data rate substantially (vs closest at 0.5 AU).

I would imagine there's a fair amount of late night work going on to get this all resolved in the next couple weeks.

Check for New Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...