Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Debugging The Spirit Rover

timothy posted more than 10 years ago | from the at-a-distance dept.

Space 390

icebike writes "eeTimes has a story on how the Mars Rover was essentially reprogrammed from millions of miles away. 'How do you diagnose an embedded system that has rendered itself unobservable? That was the riddle a Jet Propulsion Laboratory team had to solve when the Mars rover Spirit capped a successful landing on the Martian surface with a sequence of stunning images and then, darkness.' The outcome strikes me as an extremely Lucky Hack, and the rover could have just as likely been lost forever. Are there lessons here that we can use here on the third rock for recovery of our messed up machines which we manage from afar via ssh?"

cancel ×

390 comments

Sorry! There are no comments related to the filter you selected.

Oh, sure... (5, Funny)

inertia@yahoo.com (156602) | more than 10 years ago | (#8354094)

Are there lessons here that we can use here on the third rock for recovery of our messed up machines which we manage from afar via ssh?

As a former co-worker (hi, jwalker!) used to say when people tried to draw ridiculous analogies, "It's exactly like that...only different."

Re:Oh, sure... (-1, Offtopic)

Anonymous Coward | more than 10 years ago | (#8354097)

You fucking moron. First posts are to be used for trolling only. Seriously, what the fuck is wrong with you?

Re:Oh, sure... (-1, Flamebait)

Anonymous Coward | more than 10 years ago | (#8354124)

Awww poor baby get beat to a fp by a subscriber? The world's smallest fiddle plays for you.

Re:Oh, sure... (1, Troll)

kraksmoka (561333) | more than 10 years ago | (#8354220)

actually, i had a saying with my high school sweetheart that would better express this idea. . . . .

its like being on a sea cruise, but different!

One reasonable anology (1, Insightful)

Anonymous Coward | more than 10 years ago | (#8354295)

One lesson we can learn from the Spirit problems that really and truly does directly apply on earth:

Just in case of a worst case scenario, always make sure you have physical access to the machines.

Re:Oh, sure... (-1, Offtopic)

Anonymous Coward | more than 10 years ago | (#8354389)

As a former co-worker (hi, jwalker!) used to say

Woah! Hi Martin! How've you been? How's Ben doing? And how was Disneyland? Give me a call sometime :)

jwalker

Local Debugging (3, Funny)

webmaestro (323340) | more than 10 years ago | (#8354095)

Man, I have a hard enough time debugging programs running on my local machine.

Re:Local Debugging (5, Funny)

srichand (750139) | more than 10 years ago | (#8354168)

In other news stories, the Microsoft Corporation decided to sue NASA, apparently since the right to crash systems was only theirs. Not to be left behind, SCO insisted that the code that caused the failure was unethically copied from their source repositories. This has indeed caused a flutter in the space communities

I dont know about learning much.... (4, Funny)

detritus` (32392) | more than 10 years ago | (#8354098)

I dont think i want to learn too much from this as the solution was the equivalent of rm -rf... On a side note i wonder when the 40 min ssh delay jokes will begin again

Re:I dont know about learning much.... (1)

BlueTrin (683373) | more than 10 years ago | (#8354141)

well i don't know about these jokes, but my Beowulf cluster has a delay of (x machines) x 40 minutes

well (4, Funny)

whackco (599646) | more than 10 years ago | (#8354104)

at least it wasn't a blue screen?

Like this? (4, Funny)

The Human Cow (646609) | more than 10 years ago | (#8354106)

man rover?

er - don't run Windows... (-1, Offtopic)

Anonymous Coward | more than 10 years ago | (#8354110)

although with MS going open source these days ... :)

Remote debugging? (4, Funny)

Nimloth (704789) | more than 10 years ago | (#8354114)

I don't get it, couldn't NASA afford the on-site warranty?

Re:Remote debugging? (0)

sysbot (238421) | more than 10 years ago | (#8354218)

Nope.

Re:Remote debugging? (5, Funny)

kfg (145172) | more than 10 years ago | (#8354227)

Yeah, but they thought they could save a few bucks and got the Gateway consumer version.

"Oh, you've got the on-site warranty, huh? Ok, first thing you have to do is ship it to South Dakota. . ."

Oh, hey, looks just like Mars.

KFG

Re:Remote debugging? (0)

Anonymous Coward | more than 10 years ago | (#8354271)

Shipping it to South Dakota isn't going to help. Gateway moved to California in 1998.

Re:Remote debugging? (1)

kfg (145172) | more than 10 years ago | (#8354330)

No. The executives moved to California. The schlubs are still in South Dakota. You think they're going to pay assemblers San Diego wages?

Gateway Factory [beans-arou...-world.com]

Although bits of California look like Mars too, except for the funny color of the sky.

KFG

Re:Remote debugging? (2, Funny)

operagost (62405) | more than 10 years ago | (#8354308)

When you get the on-site warranty, make sure they tell you WHICH site!

Re:Remote debugging? (0)

kfg (145172) | more than 10 years ago | (#8354343)

Sometimes I learn things the hard way.

KFG

rebooting on mars... (4, Interesting)

segment (695309) | more than 10 years ago | (#8354268)

Interesting reading:

Rebooting on Mars

By Matthew Fordahl, The Associated Press

It's a PC user's nightmare: You're almost done with a lengthy e-mail, or about to finish a report at the office, and the computer crashes for no apparent reason. It tries to restart but never quite finishes booting. Then it crashes again. And again.

Getting caught in such a loop is frustrating enough on Earth. But imagine what it's like when the computer is 200 million miles away on Mars. That's what mission controllers faced when the Mars rover Spirit stopped communicating last month.

...

Tech support for an $820 million mission is a cautious affair. Tools to recover from and fix any problem must be built into the system before launch. The systems' behaviors need to be completely understood and predictable.

"Luckily, during the design period, we anticipated that we might get into a situation like this," said Glenn Reeves, who oversees the software aboard the Mars rovers Sprit and Opportunity at NASA's Jet Propulsion Laboratory.

For stability, reliability and predictability, mission designers did not bust the budget and design the hardware or software from scratch. Instead, they turned to hardware and software that's been used in space before and has a proven track record on Earth as well.

"The advantage of using commercial software is it's well-known, and it's well deployed," said Mike Deliman, an engineer at Alameda-based Wind River Systems Inc., which made the rovers' operating system. "It has been used throughout the world in hundreds of thousands of applications."

The operating system, VxWorks, has its roots in software developed to help Francis Ford Coppola gain more control over a film editing system. But the developers, David Wilner and Jerry Fiddler, saw a greater potential and eventually formed Wind River, named for the mountains in Wyoming. VxWorks became a formal product in 1987.

rest of article [securityfocus.com]

Starmanta is a total fuckcock (-1, Offtopic)

Anonymous Coward | more than 10 years ago | (#8354120)

The GNAA trolled starmanta liberally. We strapped him in to our niggermobile and we couldnt keep our first-posting hands off of him. we were performing many first-post-submitting touches. we couldnt believe what the fuck was going on. we told starmanta that the slashdot trolling community would not approve of a fucking loser posting rabidly idiotic questions while logged in.

it doesn't help at all that starmanta has been modbombed out of ever posting above -1 again. he can hardly get modded after not touching the post anonymously button. how is he possibly going to make up for all this lost karma when he doesnt even know how to prepare a properly insightful karma whoring post? they'll make him drop his account in front of the whole Slashdot community. there it is. Malda just called and asked why starmanta cant stop being such a flaming faggot. he has to go.

Re:Starmanta is a total fuckcock (-1, Offtopic)

Anonymous Coward | more than 10 years ago | (#8354129)

Name, Address Phone
Nothnagel, Raymond L. DAAP 07
Daniels Hall, ROOM 0217 2720 SCIOTO ST CINCINNATI OH 45219 r.nothnagel@fuse.net (513)556-7920

Re:Starmanta is a total fuckcock (-1, Offtopic)

Anonymous Coward | more than 10 years ago | (#8354134)

This would be even funnier if you included one of his "pwnination flowcharts", maybe a pastiche of his sig too

The GNAA have never complained about their lack of karma. Neither does Starmanta

Unified Theory of Faggotry (Based on Suggestions) (-1, Troll)

Anonymous Coward | more than 10 years ago | (#8354163)

The GNAA trolled starmanta liberally. We strapped him in to our niggermobile and we couldnt keep our first-posting hands off of him. we were performing many first-post-submitting touches. we couldnt believe what the fuck was going on. we told starmanta that the slashdot trolling community would not approve of a fucking loser posting rabidly idiotic questions while logged in.

it doesn't help at all that starmanta has been modbombed out of ever posting above -1 again. he can hardly get modded after not touching the post anonymously button. how is he possibly going to make up for all this lost karma when he doesnt even know how to prepare a properly insightful karma whoring post? they'll make him drop his account in front of the whole Slashdot community. there it is. Malda just called and asked why starmanta cant stop being such a flaming faggot. he has to go.

Name, Address Phone
Nothnagel, Raymond L. DAAP 07
Daniels Hall, ROOM 0217 2720 SCIOTO ST CINCINNATI OH 45219 r.nothnagel@fuse.net (513)556-7920

THE STARCRAFT FLOWCHART OF PWNATION, part n+1

Me [Zerg Air] --> Josh [Terran]

--
StarManta
I don't think BMW has ever complained about their 2% marketshare. Neither has Apple.

------------
-jx

lots of mem of an embedded system (4, Funny)

millette (56354) | more than 10 years ago | (#8354127)

Wow, I didn't expect the rover had 128MiB of RAM, or 256MiB of flash. Funny to think they had to run chkdsk from so far away :)

Re:lots of mem of an embedded system (1)

clifgriffin (676199) | more than 10 years ago | (#8354189)

chkdsk is windows.

I doubt they are running windows.

Space Technology (5, Insightful)

superpulpsicle (533373) | more than 10 years ago | (#8354131)

That's the thing that amaze me. Any technology having to do with space seem that much more advanced.

Here on earth we can't even build cars that require no maintainance and last more than 10 years.

Re:Space Technology (1)

Naffer (720686) | more than 10 years ago | (#8354142)

Do the people building the cars want them to last 10 years with no maintinence? Dealerships make wads of cash in their auto shops.

Re:Space Technology (1, Funny)

Anonymous Coward | more than 10 years ago | (#8354147)

I couldn't afford a car that NASA built... :P

Re:Space Technology (2, Insightful)

Anonymous Coward | more than 10 years ago | (#8354150)

Yeah offer to pay $800 million for a custom built car, and you can bet it will last 90 days too.

Re:Space Technology (0, Offtopic)

Anonymous Coward | more than 10 years ago | (#8354165)

Here on earth we can't even build cars that require no maintainance and last more than 10 years.

Sure we can. It's just that nobody does, because making completely reliable, long-lasting products is not good business.

If your products die one day after the warranty is up, or if they last forever, you kill your repeat business. Companies have to strike that happy medium to keep people coming back for more.

Re:Space Technology (4, Insightful)

Billly Gates (198444) | more than 10 years ago | (#8354202)

The Japanese started that.

They make alot of money from loyal customers. But I admit my 13 year old 91 honda civic with 140k miles is getting on my nerves with repair costs. WOuld a 91 ford escort still be running today? I think not.

I will buy only Toyatas and Honda's for that reason.

It amazes me consumers are too stupid to read consumer reports and buy cars on looks. Repair costs for things like Cadallacs and BMW's are not cheap for TCO! Yes consumer products have TCO too and we and not just businesses should look at that as well.

Re:Space Technology (0, Redundant)

sangreal66 (740295) | more than 10 years ago | (#8354275)

My '89 temp runs great...

Re:Space Technology (1, Informative)

afidel (530433) | more than 10 years ago | (#8354290)

Just gave a 93 Ford Taurus to my brothers fiance, runs great and in the 5 years I owned it I had to replace a seal on the radiator and that was IT other than oil and gas. My current car is 99 Taurus with 158K miles and I haven't put a dime into it other than oil and brakes, need to do spark plugs as the fuel economy has gone down this winter and that's the most likely cause =)

Re:Space Technology (0, Redundant)

Kurt Russell (627436) | more than 10 years ago | (#8354298)

WOuld a 91 ford escort still be running today? I think not.


And why not? I have a 90 mustang 5.0 with 168k on the clock. My girl has an 88 grand am with 220k..;-)

Re:Space Technology (-1, Troll)

operagost (62405) | more than 10 years ago | (#8354322)

They make alot of money from loyal customers. But I admit my 13 year old 91 honda civic with 140k miles is getting on my nerves with repair costs. WOuld a 91 ford escort still be running today? I think not.
Yes, I'm sure there are absolutely no 1991 Ford Escorts still running today. Those American cars sure do suck! You are so insightful! Gosh! And that LTD that I ran up to 182,000 miles? Must have been my imagination! FIX OR REPAIR DAILY! BWAHAHAHAHA!
It amazes me consumers are too stupid to read consumer reports and buy cars on looks.
Yup, I'm sure that's the only criterion. Especially for Escorts ... they sure were snazzy! Well, compared to the Civics of the time, which looked like roller skates with headlights.

Ford Reliability shell game (0, Offtopic)

MythoBeast (54294) | more than 10 years ago | (#8354388)

Regardless of your personal experience, it is Ford's habit to replace reliable vehicles with unreliable ones. The classical example of this is the Festiva. Those little things just went and went, got excellent reviews in Consumer Reports, and really upset a few Ford corporate executives.

They replaced the vehicle with the Aspire, which Ford dealership automechanics quicky nicknamed the "expire" due to their regular need for maintenance. They still sold quite a number of them due to the reputation of the previous vehicle.

Re:Space Technology (0)

Anonymous Coward | more than 10 years ago | (#8354328)

They make alot of money from loyal customers. But I admit my 13 year old 91 honda civic with 140k miles is getting on my nerves with repair costs. WOuld a 91 ford escort still be running today? I think not

I will buy only Toyatas and Honda's for that reason.
Overrated--- mod this bullshit down. ALL CARS break down. Sheesh..

Re:Space Technology (1)

Biogenesis (670772) | more than 10 years ago | (#8354344)

Maybe a 91 ford escort woulden't be running, by my 91 ford falcon sure is. and with ~260Mm's (convert to miles yourself, i'm too lazy to use google) on the clock it's still going strong with only a bi-yearly service.

Re:Space Technology (0)

Chrispy1000000 the 2 (624021) | more than 10 years ago | (#8354366)

Well, as to dubunk your theory about fords, in general, I call into the following evidence:

My car is 1993 Ford Topaz and it's still running! Mind you, the E-brake, horn, normal brakes, RPM gauge, door locks, seatbelts, transmission, ball joints, cam shaft, air conditionong/heating pannel, fan ducts, block heater, key thinge, and throtle cable all need a little work, but it still runs!

And It's a standard to boot, can spin tires on really, really slipery ice, and can get to 115kph with a good 30k strech, and a lot of shaking.

Can you tell that I hate my car?

Re:Space Technology (4, Interesting)

kfg (145172) | more than 10 years ago | (#8354288)

No. You can't make a mechanical device like a car that requires no maintainence. Bearings wear out. Hoses and belts have a limited lifespan even you never drive the car, etc. This is the real world. We will obey the laws of thermodynamics. Entropy always wins.

What you can do is make it require less maintainence, make that maintainence cheaper to perform, and make the car last until you hit something really hard so long as you maintain it. You should be able to hand your car down to your kids.

Other than that you're bang on though.

I wonder what we can learn from that about maintaining our computers?

KFG

Re:Space Technology (5, Insightful)

beeplet (735701) | more than 10 years ago | (#8354174)

Actually any technology making it into space is more likely to be 10 years out of date... Getting anything certified for space is a long process. The technology in space isn't more advanced, just much better documented and well-understood.

Re:Space Technology (5, Insightful)

kfg (145172) | more than 10 years ago | (#8354248)

Ten years out of date, but ten years more reliable for the effort.

Sort of like Debian.

Cutting edge ain't always what it's cracked up to be.

KFG

Re:Space Technology (0)

Anonymous Coward | more than 10 years ago | (#8354207)

Until you realize that an MS-DOS computer in that same situation wouldn't have crashed.

do they use SSH ? (5, Funny)

Anonymous Coward | more than 10 years ago | (#8354135)

I hope they use SSH or something .. who's to say a future mission ..some hax0r doesnt grab control of a space probe and have it send goatse.cx pics back??

All it takes is a transmitter out in the middle of nowhere africa or some island .. after all the probe communicates using known frequencies. There may be probs picking up the return signal without an expensive antenna i suppose. But then again maybe some hax0r can build one cheaply and or do what captin midnight did ( www.signaltonoise.net/library/captmidn.htm ).

I wouldnt worry about signal jamming though as that will probably be discovered easily.

Re:do they use SSH ? (1, Insightful)

Anonymous Coward | more than 10 years ago | (#8354179)

No the didn't use SSH. However, a lucky hacker
would have to have access to a every large radio atennae, like the one atop a volcano in Hawaii.

Re:do they use SSH ? (5, Insightful)

mcbridematt (544099) | more than 10 years ago | (#8354184)

I don't think they would bother using anything to do with TCP. Anything you do send you will have to wait 9 minutes for. Just imagine the ping times:

Pinging mars-rover with 32 bytes of data:
request timed out
request timed out
request timed out
64 bytes from mars-rover: icmp_seq=0 ttl=64 time=32400ms :(

If it has anything to do with current internet protocols, it would be UDP.

Re:do they use SSH ? (0)

Anonymous Coward | more than 10 years ago | (#8354230)

Now they do. They were using telnet before, but some hacker broke in and uploaded megabytes of porn to its flash RAM. Eventually the rover ran out of memory and crashed. NASA has now switched to SSH to keep hackers from breaking in again in future.

This is news why? (-1, Flamebait)

Anonymous Coward | more than 10 years ago | (#8354138)

To make a long article short, they basically nuked and reinstalled. Nothing most of us who run or support Windows machines have never done. The only difference was the distance involved.

Pissed Martians (5, Funny)

Tablizer (95088) | more than 10 years ago | (#8354148)

The Martians are pissed that the repair labor was outsourced to Earth.

What's the big deal?? (4, Insightful)

prakslash (681585) | more than 10 years ago | (#8354166)

Unless you are a lay person, I don't understand what the big deal is .

If it was the hardware that got fried and they miraculously fixed that, I would understand but this was just a software glitch.

I routinely reboot and reprogram machines in our data-center that is 2000 miles away from me.

As long as all hardware components are working and there is connectivity to the machine, it doesn't matter whether the machine is a few miles away or a million miles away.

Re:What's the big deal?? (5, Funny)

Gizzmonic (412910) | more than 10 years ago | (#8354219)

I routinely reboot and reprogram machines in our data-center that is 2000 miles away from me.

As long as all hardware components are working and there is connectivity to the machine, it doesn't matter whether the machine is a few miles away or a million miles away.


You are too humble, friend. What you do routinely and without thinking, is nothing less than a miracle of modern science. A miracle that you take part in every day. And because of men like you, we don't have to rely on the abacus anymore. We sent a pentium to the Moon, and soon, Mars will be colonized by G5s. America salutes you, for all the things that you do.....

Like a rock! I was strong as I could be be!

Ooooooohh! Like a rock!

Re:What's the big deal?? (1, Redundant)

mattkime (8466) | more than 10 years ago | (#8354222)

As long as all hardware components are working and there is connectivity to the machine, it doesn't matter whether the machine is a few miles away or a million miles away.

...and I suppose you have the entire news media providing constant updates to the world about your server reboots.

Actually, it is interesting only because its NASA and it happened on mars. NASA projects tend to have circumstances a bit different from most of us.

Re:What's the big deal?? (4, Insightful)

dellis78741 (745139) | more than 10 years ago | (#8354262)

The tricky part here was that the 'hardware connectivity' depended on 'software functionality'. Try maintaining machine a block away if the commnication link requires both ends to point a satellite dish at an orbiting satellite and that pointing relied of software functioning correctly.

Re:What's the big deal?? (4, Insightful)

FTL (112112) | more than 10 years ago | (#8354280)

I routinely reboot and reprogram machines in our data-center that is 2000 miles away from me.

As long as all hardware components are working and there is connectivity to the machine, it doesn't matter whether the machine is a few miles away or a million miles away.

There are some fundamental differences, my friend:

  • If you screw up leaving the computer unbootable, you get local tech support to check the console and fix it. NASA on the other hand doesn't have tech support on Mars.
  • If you hose the server, it means a day's worth of reinstallation. If NASA hoses their rover, they just lost $300,000,000.
  • You can poke around the system and see what's wrong. NASA has a harder time since their lag time is 20 minutes.
  • You can download core dumps, NASA were operating on the low-bandwidth antenna which meant looking at file sizes, time stamps, selected lines, but not file contents.
  • You have your boss breathing down your neck (hoping for success), NASA have the international media breathing down their necks (hoping for a disaster).

Re:What's the big deal?? (4, Insightful)

updog (608318) | more than 10 years ago | (#8354287)

There is a big difference between this, and your example of forcing a controlled reboot of your remote machines.

Spirit was in a constant reboot cycle, and the fact that they could even communicate with it long enough to bypass the problem was an accomplishment (and lucky).

It would be more similar to your remote data-center machine suddenly going offline and you have no idea why, and you are unable to ssh to it, and you fix it by running through potential scenarios and finding that the problem could have been due to mounting a certain partition, then discovering that there's an exploit in ICMP that allows you to hack to kernel so it doesn't mount that partition.

Re:What's the big deal?? (4, Insightful)

amRadioHed (463061) | more than 10 years ago | (#8354303)

Are you forgetting that the latency when communicationg with mars averages around 1200000 ms? I'd say that when you have to wait 20 minutes to see the result of anything you do you're going to have to substantially change your debugging strategy.

Re:What's the big deal?? (5, Interesting)

afidel (530433) | more than 10 years ago | (#8354315)

Actually I remember NASA doing a hardware repair from most of the way across the solar system. One of the deep space probes was starting to have a problem sending signals, some bright mind at NASA looked at the circuit diagram and figured out that a single component (resistor, cap, can't remember) was starting to fail, they figured out that there was a way to recondition the part. So they came up with a program that basically intentionally overstressed that component path and the extra energy heated up the part an reconditioned it so that the unit was back to working condition.

Uh-oh (5, Funny)

z0ink (572154) | more than 10 years ago | (#8354170)

"We recognized early in the planning process that the flash file system had a limited capacity for files."

Sounds like NASA forgot to empty the rover's recycle bin. =)

Re:Uh-oh (3, Funny)

LnxAddct (679316) | more than 10 years ago | (#8354276)

I've thought long and hard on this topic and yes on windows it is accurately called the recycle bin because you dont get rid of the junk you put in there, it gets reused in some other part of your system. You put junk in, the junk is modified into other junk and then sent back to create new system dlls. In linux(and I believe macs) it is accurately called the trash can because what we put in there is thrown out for good, we don't have our junk recycled to create more, but different, junk:)
Regards,
Steve

Re:Uh-oh (3, Funny)

brendan_orr (648182) | more than 10 years ago | (#8354368)

Nah, Linux, Mac OS X, *BSD, and other *nix users have /dev/null as a trash can.

Re:Uh-oh (1)

ProKras (727865) | more than 10 years ago | (#8354403)

Sounds like NASA forgot to empty the rover's recycle bin. =)

Fortunately, the folks a JPL can at least say that causing Spirit's memory error wasn't as stupid as Microsoft's first foray into DVRs [cnn.com] a couple years back

The proper fix... (3, Insightful)

Dan East (318230) | more than 10 years ago | (#8354173)

...would have been to have "fixed" the problem before the hardware left earth. This "bug" (or more accurately, known limitation of the filesystem) should have been discovered here on earth if the rover had been properly tested.

The only real bug was the inability of the system to properly handle running out of file entries (or more specifically, consuming too much RAM as the number of file entries increased). However the software should have never have stressed the filesystem to that degree in the first place.

Dan East

Re:The proper fix... (1)

tiny69 (34486) | more than 10 years ago | (#8354266)

That is a problem with NASA'a faster-better-cheaper approach to space flight. There's a good chance that a catastrophic bug will be missed. NASA lost a $125 million orbiter on Mars due to a metric conversion error. A simple conversion check was never done!!

http://clive.canoe.ca/CNEWSHeyMartha9911/10_metric .html [canoe.ca]

Re:The proper fix... (4, Funny)

Chester K (145560) | more than 10 years ago | (#8354342)

The only real bug was the inability of the system to properly handle running out of file entries (or more specifically, consuming too much RAM as the number of file entries increased). However the software should have never have stressed the filesystem to that degree in the first place.

When you can write an embedded operating system that can gracefully and automatically recover from every possible thing that might ever go wrong, perhaps you should send your resume to NASA.

Hindsight (5, Insightful)

FTL (112112) | more than 10 years ago | (#8354175)

The article (I know, I know, this is Slashdot) is really good. It contains everything that is missing from traditional media. The story, the background, technical details, and follow through.

Granted mainstream media have to keep their coverage dumbed down if Joe Public are going to read it. But what really bugs me is the lack of follow-up. We hear about poorly understood events as they are unfolding, then never heard about them later when they are completely understood.

A recent example is the gangway between ship and shore at the QM2's drydock. It collapsed killing lots of people, an investigation was launched. Why did it collapse? At the time it wasn't known. I'm sure it's known now, but there's been absolutely no followup.

This article about the rover is great not so much because of the level of detail but because it reports on an event with the benefit of hindsight.

Re:Hindsight (2, Informative)

Jeremy Erwin (2054) | more than 10 years ago | (#8354327)

I'm sure there will be at least some mention of the results of the investigation when it is completed and various persons are prosecuted. In the meantime, here's a relatively recent article [yahoo.com] on the investigation into the collapse.

What the article doesn't say (4, Insightful)

Mr2cents (323101) | more than 10 years ago | (#8354213)

What filesystem is used? Is wear leveling being used? The directory structure is apparently stored in RAM during the day (why else would it use so much RAM?), that is a good thing for reducing wear on the flash system. But what's the number of writes on the flash chips? When will that number be reached?

Re:What the article doesn't say (4, Interesting)

afidel (530433) | more than 10 years ago | (#8354329)

Never, the rovers are only going to operate for ~100 days, the number of writes for modern flash ram is 100K cycles minimum, over a million typical. So unless they are really screwing something up that shouldn't be a limitation, also distributing file placement shouldn't be a software function, good CF cards do it in the controller logic.

Mod this "redundant" (5, Informative)

Penguinshit (591885) | more than 10 years ago | (#8354221)


'How do you diagnose an embedded system that has rendered itself unobservable?'

The way you do this is by having an exact duplicate of the remote system so you can set up a test with conditions as close to those under which the remote system is currently operating. You can then do a series of carefully controlled test solutions to determine the optimum prior to trying it on the "live" system.

This is the way I set up all my production systems and, barring catastrophic hardware failure (self-immolating disks and a router which just folded when its power supply burped) I've had perfect uptime.

(well, ok.. there was that one time, late at night, when I typed "reboot" in the wrong window.. but that happens...)

NASA should have simulated... (-1, Troll)

Anonymous Coward | more than 10 years ago | (#8354224)

the entire mission before launch. The fact that they filled up the flash memory with too many files that were accumulated during the cruise phase of the mission between earth and mars was something that they should have known would happen. As usual, however, NASA is just plain dumb lucky that they could save the mission. They are lucky that they could ctrl-alt-delete the rover and reformat the flash memory. But the main point is that they should not have had to do this at all; had they prepared properly, this would not have happened in the actual mission because it would have happened on the ground during simulations. I guess NASA cannot be trusted to execute the hard work anymore; it isn't any fun, nor does it get good TV.

I'm not even going to mention the fact that there is just no way that any peer review process could justify spending 0.8B$ to get some geological data that will at best be ambiguous. We are never going to really know whether there was any water on the surface of Mars; I'm not really sure what good it would do if we did know the answer about that. For 0.8B$, the government could fund 800 single-investigator NIH grants for five years each, and maybe 50 of those might come up with a cure for cancer, eliminate diabetes, solve the origins of heart disease and stroke, and maybe solve the mystery of life.

We are lucky that NASA is lucky. Or all that money would be wasted because NASA couldn't spend the time to execute the mission preparation properly. Jeez.

Re:NASA should have simulated... (3, Informative)

updog (608318) | more than 10 years ago | (#8354337)

The fact that they filled up the flash memory with too many files that were accumulated during the cruise phase of the mission between earth and mars was something that they should have known would happen. Apparently you didn't read the article. Because of a communication failure, a utility that was supposed to delete the old files didn't get completely uploaded. The utility was scheduled for retransmission, but the filesystem filled up before it got re-transmitted.

No exception handler?? (0)

SID*C64 (444002) | more than 10 years ago | (#8354225)

With all of the money we spend on this stuff, couldn't someone have written an exception handler for this? Haven't we learned our lessons in the past about unhandled exceptions?

The article states that they are working on one now. A bit late eh? Lucky indeed.

Ran out of flash disk space. No, really. (2, Insightful)

randyest (589159) | more than 10 years ago | (#8354233)

If you RTFA you will realize that I'm not lying in the least when I say that, effectively, they ran out of flash-based "disk" space! They forgot to delete old files when updating the programs in the flash memory (which is mounted like a filesystem, or hard disk), and the OS was failing because it wanted to use that space. So it rebooted, and still had insufficient disk space, and rebooted again . . . lather rinse repeat. There was no signal because it was stuck in a reboot loop because they ran out of disk. Wow.

They fixed it by telling it to boot without using the flash (safe mode :) ), then used low-level (direct access) flash utilities to remove the old files. Reboot, mount, disk check / corruption repair, voila it works again.

We have a big 1TB NetApps server where I work, and we have so much disk space that people get lazy and don't delete files or archive old projects, then they get really confused when jobs fail, not thinking disk space until checking everything else first. But it happens, and it's usually surprisingly hard to debug (they check a lot of other things first, sometimes even upgrading tool versions!). It's really kinda funny, in an expensive and mildly embarassing way that the Spirit had the same problem.

Re:Ran out of flash disk space. No, really. (1)

AaronStJ (182845) | more than 10 years ago | (#8354362)

They forgot to delete old files when updating the programs in the flash memory (which is mounted like a filesystem, or hard disk), and the OS was failing because it wanted to use that space.


It's not even that they forgot to delete old files. Then program they sent to the old files failed to upload correctly, and they ran out of space before they could retransmit the delete program.

Ran out of INODES. No really. (5, Informative)

dorko (89725) | more than 10 years ago | (#8354385)

If you RTFA you will realize that I'm not lying in the least when I say that, effectively, they ran out of flash-based "disk" space!
Well, I did read the article and I wouldn't say it quite like that. The article says: "Spirit attempted to allocate more files than the RAM-based directory structure could accommodate." Furthermore, the article says that the low-level file manipulation commands "worked directly on the flash memory without mounting the volume or building the directory table in RAM ."

To me, if this were a Unix-like system, it sounds like they ran out of inodes [webopedia.com] . Running out of inodes is very different than running out of disk space.

If you think runing out of disk space can be hard to trouble shoot, try running out of inodes.

mod parent down (2)

ChrisCampbell47 (181542) | more than 10 years ago | (#8354395)

Wrong wrong wrong, as I'm sure someone else will post. He spins a good yarn but he's just a machine room flunky and hasn't RTFA himself.

only 120 megs ram? (-1, Troll)

imsirovic5 (542929) | more than 10 years ago | (#8354237)

From the article:

"The Spirit rover has a radiation-hardened R6000 CPU from Lockheed-Martin Federal Systems at the heart of the system. The processor accesses 120 Mbytes of RAM and 256 Mbytes of flash. Mounted in a 6U VME chassis, the processor board also has access to custom cards that interface to systems on the rover."

The project costed about 800 millions, would have it hurted to maybe spend few dollars more for greater flash capacity??

Re:only 120 megs ram? (0)

Anonymous Coward | more than 10 years ago | (#8354267)

Well, considaring the problem was they had too many files on the flash, why would they want even more? They should have had more ram, not flash

not really... (3, Informative)

rebelcool (247749) | more than 10 years ago | (#8354270)

on projects such as this, the design specs would've been frozen several years ago, and then would've been conservative for the time, using proven technology.

Another factor in this is the safety of the flash ram. It is rad-hardened and built with tons of extra error correction which again, requires years of testing and special design considerations. And is extremely expensive.

YOU FAIL IT? (-1, Offtopic)

Anonymous Coward | more than 10 years ago | (#8354242)

FYI (-1, Offtopic)

Anonymous Coward | more than 10 years ago | (#8354313)

Dude, goatse.cx is gone.

Lucky Hack? (5, Insightful)

electromaggot (597134) | more than 10 years ago | (#8354243)

"The outcome strikes me as an extremely Lucky Hack..."

The outcome does not strike me as a "Lucky Hack." They made the system flexible, that flexibility got them into some trouble, and it's also what got them out of it. Anyone else agree?

All these worlds.... (4, Funny)

dmeranda (120061) | more than 10 years ago | (#8354246)

"The irony of it was that the operating system was doing exactly what we'd told it to do," Klemm lamented.

Yeah, that was HAL's excuse too.

Seriously, hats off to all the JPL programmers. Proving to the Martians that there is indeed intelligent life on Earth, very intelligent.

Remote debugging pet peeve (5, Funny)

Peter McC (24534) | more than 10 years ago | (#8354261)

My pet peeve when I'm doing remote troubleshooting is 'ifconfig eth0 down'...oops. At least NASA is smarter than that.

Peter.

Doh (1)

BlueTrin (683373) | more than 10 years ago | (#8354269)

Klemm explained that as data is collected by Spirit, files are created and stored in the flash file system until a communications window opens -- an opportunity to transmit the data either directly to Earth or to one of the two orbiters circling the Red Planet. Then the files are transmitted. They are still held in the flash system until retrieved and error-corrected on Earth.


They should just have ticked the "autoaccept and minimize" checkbox .

Lucky Hack? (5, Insightful)

SuperKendall (25149) | more than 10 years ago | (#8354291)

Your post is the only thing that strikes me as a "Lucky Hack" here. They included the ability in the design to remotely disable booting from flash and upload new boot images, in what way is that a "hack"? All this is just foresight in design to include as many possible recovery modes as they could.

Basically, they rebooted from a recovery image (sent via radio) and then proceeded to do low-level fixes on Flash memory and they a chkdisk. If I do something similar via recovery disk or CD, I don't get a lot of people telling me that it was a "Lucky Hack" that I could boot off of CD!!!

WindRiver's fault (-1, Troll)

Anonymous Coward | more than 10 years ago | (#8354292)

Those crappy guys at WindRiver messed it up. If some other operating system other than VXWorks were used, it would have been better....
wonder how much WindRiver "SPENT" to get their OS on board ... hmmm .. efficiently spent "Marketing Budget"

Re:WindRiver's fault (-1, Offtopic)

Anonymous Coward | more than 10 years ago | (#8354345)

You're just bent because they laughed WinCE out of the building when you pitched it...

Seems like a stupid mistake to me (0)

Anonymous Coward | more than 10 years ago | (#8354293)

I know next to nothing about progamming, but I'm a fairly good armchair quarterback.

"Spirit attempted to allocate more files than the RAM-based directory structure could accommodate. That caused an exception..."

For an agency that usually trys to think of everything, doesn't this seem like a stupid lack of planning? To not have any error handling to catch something that is trying to allocate more memory than what is availible? From a laymen's perspective, this seems like a rookie goof. Please correct me if I'm wrong.

"...just in case, the team is working on an exception-handler routine that will more gracefully recover from an allocation failure."

I think anything would be more gracefull than 'totally puke and get stuck in a futile reboot cycle'.

Our tax dollars at work...

NASA Rocks! (5, Interesting)

blueZhift (652272) | more than 10 years ago | (#8354301)

Great article! This is just the sort of thing that has always impressed me about NASA and the JPL. Just when mere mortals might give it up and walk away, they figure out the problem. I can only imagine how wild the party must have been after they fixed Spirit, the scientists and engineers I've worked with in the pass could really put away the booze.

Seriously though, the key lessons to take away from this are.

1) Gather all of the clues you can.

2) Take those clues and build a model.

With luck and care, the model should get you closer to what may have gone wrong. And in this case it apparently did just that. Now that's geek cool!

BTW, I know that generally you want to prevent this sort of thing from happening. But in reality most software ships with bugs and launch windows to Mars are non-negotiable.

Remote safe mode (3, Interesting)

Megane (129182) | more than 10 years ago | (#8354314)

The first thing needed to achieve remote maintainability on the order of space probes is some way to access a machine remotely when it's not running the full OS. A KVM switch isn't going to work over long distances. The BIOS needs a way to run over the network. Same for the kernel boot messages. Whether it's through a serial console and SSH server, or through the BIOS running TCP/IP, what we have now isn't enough. A separate console server could also control a power cycle/reset switch circuit.

There also needs to be a way to load bootstrap code remotely. For instance, having a TCP/IP enabled BIOS be able to run TFTP or some other protocol to load a netboot floppy image. Then you could give it a LILO command instructing it where to find a boot image, preferably one on a server in the same hosting center.

whoops (5, Funny)

usillyman (755322) | more than 10 years ago | (#8354349)

Operating System not found. Press any key to continue.
Damn! Left the floppy in!

Could an earthbout 'twin' computer help? (4, Interesting)

AaronStJ (182845) | more than 10 years ago | (#8354350)

What surprises me is that they don't have a 'twin' of the rover's computer system set up on earth. When commands are run on the rover, the same commands could be run on the computer system on earth. Then, if the rover's software, fails (as it did), the software on earth would (theoretically) fail in a similar way, and be MUCH easier to debug. Of course, the systems wouldn't be identical (without building an entire duplicate and expensive rover), and the data gatehred wouldn't be identical, but if the twin was carefully planned and fed dummy data that aproximately mirrored that data the rover was gathering. For example, the twin could be fed dummy pictures about as often as the rover took a real picture.

From the article "[The] transmission that uploaded the utility was a partial failure: Only one of the utility program's two parts was received successfully. The second part was not received, and so in accordance with the communications protocol it was scheduled for retransmission on sol 19." NASA could have simulated a half failed transfer on the twin copmuter on earth, and then watched carefully using traditional debugging tools to make sure the failed transmission didn't cause a software failure (which it did).

Again, from the article "The data management team's calculations had not made any provision for leftover directories from a previous load still sitting in the flash file system." However, if they had a twin computer system to watch, they would have seen that the failure occur on earth as it did in space. Debugging a system you can hook a serial debugger to is bound to much easier than debugging a system a million miles away.

Re:Could an earthbout 'twin' computer help? (4, Informative)

Anonymous Coward | more than 10 years ago | (#8354383)

Uhmm... we DID build a 'twin' of the rover, hardware and all. Give us a bit more credit, will ya? :-P What you may not realize is that exposure to radiation on the surface of Mars, solar wind while in transit and other factors such as thermal expansion / contraction, etc. are slowly degrading the rovers in nondeterministic ways. It is not nearly as simple as 'running the commands in the testbed' at JPL to diagnose any problems which occur.

Re:Could an earthbout 'twin' computer help? (2, Informative)

Gogo Dodo (129808) | more than 10 years ago | (#8354384)

They do have a twin system here, but having one here isn't quite the same as the two on Mars. You can't replicate everything on the two Mars rovers such as the science data files.

When Spirit was turned around on it's lander, they tested the moves on it's twin here, hence the long delay getting off the lander.

You can easily debug rovers with these! (-1, Offtopic)

Anonymous Coward | more than 10 years ago | (#8354354)

Hi,

My name is Simon Feay. I'm based out of Vancouver and I sell ex-lease computers
from the major banks to people across Canada.

Currently we're offering IBM P3 650mhz computers for $285.00.

I send out an email once a month with a special and was wondering if it would
interest you.

We have great deals on used laptops laser printers etc. also.

Let me know if you're interested.

Cheers

Simon

Aceon Computers
P. 604 873-3300
TF. 1-866-268.3792
simon@aceonsource.com
www.aceonso urce.com

Found in Spirit's flash memory (-1)

britneys 9th husband (741556) | more than 10 years ago | (#8354386)

To: spirit@mars.nasa.gov
Subject: Important Notification About Your PCs Recent Internet Activity:

You may have recently noticed that your computer's connection to
the Internet has been much slower than usual. If you, or someone
else that uses your PC, have been downloading Internet files such
as music, games, or movies, then adware and spyware programs may
have been added to your computer's hard drive without your direct
knowledge.

To check for any adware or spyware applications press on the link below.
There is no cost for this scan:
http://www.ScanPC4spyware.spyw.com

If after completing the complimentary SCAN it is brought to your attention
that your computer's hard drive is infected with adware, spyware, or both,
then it may be in your computer's best interest to remove the adware and
spyware applications.

Press below to begin the scan:
http://www.ScanPC4spyware.spyw.com

To UNSUBSCRIBE from these mailings, please send mailto:
RemovalReq@asurfer.com with REMOVE in the subject line.

This is an advertisement. To not receive future SpyWareNuker offers
or promotions, please click on: http://Removal4.mp3update.com or write to:

100 E. San Marcos Blvd.
San Marcos, CA 92069 USA...

There is a significant lesson to learn, here .. (3, Insightful)

Anonymous Coward | more than 10 years ago | (#8354360)

.. namely, "Do Not Use VxWorks". Use something stable instead. eCos [planetmirror.com] comes to mind. So does everyone's favorite OS these days [kernel.org] , which has RTOS support. Having been a frustrated VxWorks user in the past, I'd no more entrust my mission-critical services to it than I would to Microsoft. -- TTK
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>