Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Self-Repairing Computers

Hemos posted more than 11 years ago | from the repairing-the-box dept.

Technology 224

Roland Piquepaille writes "Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex. You all have experienced a PC crash or the disappearance of a large Internet site. What to do to improve the situation? This Scientific American article describes a new method called recovery-oriented computing (ROC). ROC is based on four principles: speedy recovery by using what these researchers call micro-rebooting; using better tools to pinpoint problems in multicomponent systems; build an "undo" function (similar to those in word-processing programs) for large computing systems; and injecting test errors to better evaluate systems and train operators. Check this column for more details or read the long and dense original article if you want to know more."

cancel ×

224 comments

Sorry! There are no comments related to the filter you selected.

This would be great (4, Funny)

CausticWindow (632215) | more than 11 years ago | (#5935335)

coupled with self debugging code.

Re:This would be great (-1)

Anonymous Coward | more than 11 years ago | (#5935483)

"tighten your buttocks, pour juice on your chin - i promised my girlfriend.. the violin"

Great song. I bet most people here have no idea what that's from. :)

Re:This would be great (1)

ThundaGaiden (615019) | more than 11 years ago | (#5935484)

And they had better make sure that the self
debugging can handle threaded applications

It'll be great , but I can tell you one thing
straight off...

I really pity the first person who has to code
it , ha ha ha , it's not going to be me.

DWIM (3, Funny)

PhilHibbs (4537) | more than 11 years ago | (#5935577)

We've had RISC, MMX, VLIW, SSI, maybe it's time for DWIM [google.co.uk] processors.

Hmmm... (-1, Redundant)

Thaidog (235587) | more than 11 years ago | (#5935338)

Sounds a lot like magic, self-healing server pixy dust to me...

This post (2, Funny)

nother_nix_hacker (596961) | more than 11 years ago | (#5935339)

Is Ctrl-Alt-Del ROC too? :)

Add it to windoze! (-1, Troll)

Anonymous Coward | more than 11 years ago | (#5935340)

That's what windoze needs

Managerspeak (3, Insightful)

CvD (94050) | more than 11 years ago | (#5935342)

I haven't read the long and dense article, but this sounds like managerspeak, PHB-talk. The concepts described are all very high level, requiring a whole plethora of yet unwritten code to roll back changes in a large system. This will require a lot of work, including rebuilding a lot of those large systems from the ground up.

I don't think anybody (any company) is willing to undertake such an enterprise, having to re-architect/redesign whole systems from ground up. Systems that work these days, but aren't 100% reliable.

Will it be worth it? For those systems to have a smaller boot up time after failure? I don't think so, but ymmv.

Cheers,

Costyn.

Re:Managerspeak (5, Interesting)

gilesjuk (604902) | more than 11 years ago | (#5935347)

Not to mention that the ROC system itself will need to be rock solid. It's no good to have a recovery system that needs to recover itself, which would then recover itself and so on :)

Self-diagnostics (4, Interesting)

6hill (535468) | more than 11 years ago | (#5935459)

I've done some work on high availability computing (incl. my Master's thesis) and one of the more interesting problems is the one you described here -- true metaphysics. The question as it is usually posed goes, How does one self-diagnose? Can a computer program distinguish between a malfunctioning software or malfunctioning software monitoring software -- is the problem in the running program or in the actual diagnostic software? How do you run diagnostics on diagnostics running diagnostics on diagnostics... ugh :).

My particular system of research finally wound up relying on the Windows method: if uncertain, erase and reboot. It didn't have to be 99.999% available, after all. There are other ways with which to solve this in distributed/clustered computing, such as voting: servers in the cluster vote for each other's sanity (i.e. determine if the messages sent by one computer make sense to at least two others). However, even not this system is rock solid (what if two computers happen to malfunction in the same manner simultaneously? what if the malfunction is contagious? or widespread in the cluster?).

So, self-correcting is an intriguing question, to say the least. I'll be keenly following what the ROC fellas come up with.

"Managerspeak"?! (3, Insightful)

No Such Agency (136681) | more than 11 years ago | (#5935421)

Somebody has to suggest the weird ideas, even if they sound stupid and impractical now. Of course we won't be retrofitting our existing systems in six months, I think this is a bigger vision than that.

Rather than trying to eliminate computer crashes--probably an impossible task--our team concentrates on designing systems that recover rapidly when mishaps do occur.

The goal here is clearly to make the stability of the operating system and software less critical, so we don't have to hope and pray that a new installation doesn't overwrite a system file with a weird buggy version, or that our OS won't decide to go tits-up in the middle of an important process. Since all us good Slashdotters KNOW there will still be crufty, evil OS's around in 10 years, even if WE aren't using them :-)

Re:"Managerspeak"?! (1)

_typo (122952) | more than 11 years ago | (#5935609)

so we don't have to hope and pray that a new installation doesn't overwrite a system file with a weird buggy version, or that our OS won't decide to go tits-up in the middle of an important process. Since all us good Slashdotters KNOW there will still be crufty, evil OS's around in 10 years, even if WE aren't using them :-)

Then maybe the solution isn't using aditional bug-prone software to try to recover fast from failures but to actually replace the crufty, evil OS's

Re:Managerspeak (2, Funny)

TopShelf (92521) | more than 11 years ago | (#5935436)

Speaking for the PHB's, this sounds very exciting. I can't wait until they have self-upgrading computers as well. No more replacing hardware every 3 years!

Re:Managerspeak (3, Insightful)

Bazzargh (39195) | more than 11 years ago | (#5935652)

I haven't read the long and dense article

Yet you feel qualified to comment....

requiring a whole plethora of yet unwritten code

You do realize they have running code for (for example) an email server [berkeley.edu] (actually a proxy) which uses these principals? NB this was based on proxying sendmail, so they didn't "re-architect/redesign whole systems from ground up". This isn't the only work they've done either.

As for 'will it be worth it', if you'd read the article you'd find their economic justifications. This [berkeley.edu] has a good explanation of the figures. Note in particular that a large proportion of the failure they are concerned about is operator error, hence why they emphasise system rollback as a recovery technique, as opposed to software robustness.

Re:Managerspeak (4, Interesting)

sjames (1099) | more than 11 years ago | (#5935659)

There are allready steps in place towards recoverability in currently running system. That's what filesystem journaling is all about. Journaling doesn't do anything that fsck can't do EXCEPT that replaying the journal is much faster. Vi recovery files are another example. As the article pointed out, 'undo' in any app is an example.

Life critical systems are often actually two seperate programs, 'old reliable' which is primarily designed not to allow a dangerous ondition, and the 'latest and greatest' which has optimal performance as it's primary goal. Should 'old reliable' detect that 'latest and greatest' is about to do something dangerous, it will take over and possibly reboot 'latest and greatest'.

Transaction based systems feature rollback, volume managers support snapshot, and libraries exist to support application checkpointing. EROS [eros-os.org] is an operating system based on transactions and persistant state. It's designed to support this sort of reliability.

HA clustering and server farms are another similar approach. In that case, they allow individual transactions to fail and individual machines to crash, but overall remain available.

Apache has used a simple form of this for years. Each server process has a maximum service count associated with it. It will serve that many requests, then be killed and a new process spawned. The purpose is to minimize the consequences of unfixed memory leaks.

Many server daemons support a reload method where they re-read their config files without doing a complete restart. Smart admins make a backup copy of the config files to roll back to should their changes cause a system failure.

Also as the article points out, design for testing (DFT) has been around in hardware for a while as well. That's what JTAG is for. JTAG itself will be more useful once reasonably priced tools become available. Newer motherboards have JTAG ports built in. They are intended for monitor boards, but can be used for debugging as well (IMHO, they would be MORE useful for debugging than for monitoring, but that's another post!). Built in watchdog timers are becoming more common as well. ECC RAM is now manditory on many server boards.

It WILL take a lot of work. It IS being done NOW in a stepwise manner. IF/when healthy competition in software is restored, we will see even more of this. When it comes down to it, nobody likes to lose work or time and software that prevents that will be preferred to that which doesn't.

Interesting choice (4, Insightful)

sql*kitten (1359) | more than 11 years ago | (#5935344)

From the article:

We decided to focus our efforts on improving Internet site software. ...
Because of the constant need to upgrade the hardware and software of Internet sites, many of the engineering techniques used previously to help maintain system dependability are too expensive to be deployed.

(etc)

Translation: "when we started this project, we thought we'd be able to spin it off into a hot IPO and get rich!!"

/etc/rc.d ? (4, Interesting)

graveyhead (210996) | more than 11 years ago | (#5935345)

Frequently, only one of these modules may be encountering trouble, but when a user reboots a computer, all the software it is running stops immediately. If each of its separate subcomponents could be restarted independently, however, one might never need to reboot the entire collection. Then, if a glitch has affected only a few parts of the system, restarting just those isolated elements might solve the problem.
OK, how is this different from the scripts in /etc/rc.d that can start, stop, or restart all my system services? Any daemon process needs this feature, right? It doesn't help if the machine has locked up entirely.

Maybe I just don't understand this part. The other points all seem very sensible.

Re:/etc/rc.d ? (1)

jvervloet (532924) | more than 11 years ago | (#5935356)

... and is this undo feature a big imporvement compared to e.g. regular backups ?

Re:/etc/rc.d ? (1)

oliverthered (187439) | more than 11 years ago | (#5935380)

....or journaling/transactions.

Re:/etc/rc.d ? (4, Insightful)

Surak (18578) | more than 11 years ago | (#5935384)

Exactly. It isn't. I think the people who wrote this are looking at Windows machines, where restarting individual subcomponents is often impossible.

If my Samba runs in trouble and gets its poor little head confused, I can restart the Samba daemon. There's no equivalent on Windows -- if SMB-based filesharing goes down on an NT box, you're restarting the computer, there is no other choice.

Or (1)

gazbo (517111) | more than 11 years ago | (#5935416)

You're just making assumptions. Snippet:

The most common way to fix Web site faults today is to reboot the entire system, which takes anywhere from 10 seconds (if the application alone is rebooted) to a minute (if the whole thing is restarted). According to our initial results, micro- rebooting just the necessary subcomponents takes less than a second.

So in fact it's not talking about rebooting machine vs restarting services, it's talking about both of the above vs restarting subcomponents.

But hey, if you want to start talking about rebooting failed SMB services on Windows then go right ahead - you're in front of a friendly audience after all.

Re:Or (0)

Anonymous Coward | more than 11 years ago | (#5935662)

Doesn't Apache coupled with a good database already provide much of what this article discusses? Apache already monitors child processes and reboot these sub-components when necessary. A good database provides transactions enabling a roll-back should something go south.

net start/stop (1)

oliverthered (187439) | more than 11 years ago | (#5935479)

net stop workstation
net start workstation

when nt services blow chunks, the often leave crap in kenel space that prevents them being stopped/started.

I hope things have improved with widows XP.

Re:/etc/rc.d ? (3, Interesting)

Mark Hood (1630) | more than 11 years ago | (#5935414)

It's different (in my view) in that you can go even lower than that... Imagine you're running a webserver, and you get 1000 hits a minute (say).

Now say that someone manages to hang a session, because of a software problem. Eventually the same bug will hang another one, and another until you run out of resources.

Just being able to stop the web server & restart to clear it is fine, but it is still total downtime, even if you don't need to reboot the PC.

Imagine you could restart the troublesome session and not affect the other 999 hits that minute... That's what this is about.

Alternatively, making a config change that requires a reboot is daft - why not apply it for all new sessions from now on? If you get to a point where people are still logged in after (say) 5 minutes you could terminate or restart their sessions, perhaps keeping the data that's not changed...

rc.d files are a good start, but this is about going further.

Re:/etc/rc.d ? (1)

platypus (18156) | more than 11 years ago | (#5935442)

How about killing just the worker process which hangs?

Re:/etc/rc.d ? (2, Insightful)

GigsVT (208848) | more than 11 years ago | (#5935443)

Apache sorta does this with its thread pool.

That aside, wouldn't the proper solution be to fix the bug, rather than covering it up by treating the symptom?

I think this ROC could only encourage buggier programs.

Re:/etc/rc.d ? (1)

the-dude-man (629634) | more than 11 years ago | (#5935642)

thats what the goal is, but apache is also trying to keep these threads locked down as well, ie-someone trys to do a bufferoverrun, because of this, we cant simply 'return' they may have overrun the return address, so kill the tread imediatly and flush the stack and dont give them a chance to get to that pointer.

yes fixing the bug is a proper solution, however, the idea behind this is that you can never catch 100 % of the bugs, that is the one thing you can gaurnetee with any pice of software, because of this, have systems to handle the bugs and then fix them, that way, you still can (and should) fix the bug, but you havent encurred alot of downtime in the proccess

Re:/etc/rc.d ? (1)

42forty-two42 (532340) | more than 11 years ago | (#5935449)

Imagine you could restart the troublesome session and not affect the other 999 hits that minute...
So delete the offending session from the database.

Re:/etc/rc.d ? (1)

the-dude-man (629634) | more than 11 years ago | (#5935650)

in order to isolate that session in memory (without affecting other users), you need some of the very concepts we are talking about. Also, the goal is to make it more stable for end users, so we want to only kill the session if we cant fix the bug

Re:/etc/rc.d ? (1)

42forty-two42 (532340) | more than 11 years ago | (#5935445)

HURD is an even better example - TCP breaking? Reboot it! Of course, you have a single-threaded filesystem, but that's okay, right?

Re:/etc/rc.d ? (2)

Bluelive (608914) | more than 11 years ago | (#5935530)

rc.d doesnt detect failures in the deamons, it doesnt resolve dependencies between deamons, and more of these things. rc.d is a step in the right direction but it isnt a solution to the whole problem set.

hmmmmm (5, Funny)

Shishio (540577) | more than 11 years ago | (#5935349)

the disappearance of a large Internet site.

Yeah, I wonder what could ever bring down a large Internet site?
Ahem. [slashdot.org]

Re:hmmmmm (0)

Anonymous Coward | more than 11 years ago | (#5935542)

btw, your web-browser micro-rebooted 7 times while loading this page.

test errors (3, Funny)

paulmew (609565) | more than 11 years ago | (#5935350)

"Last, computer scientists should develop the ability to inject test errors" Ah, so that explains those BSOD's It's not a fault, it's a feature....

LOL (-1, Offtopic)

gazbo (517111) | more than 11 years ago | (#5935364)

Well said. I love jokes like this because they're so original - if you can see a joke coming a mile off then it's just not as good.

Respect.

ROC detail (5, Informative)

rleyton (14248) | more than 11 years ago | (#5935352)

For a much better, and more detailed, discussion of Recovery Oriented Computing, you're better off visiting the ROC group at Berkeley [berkeley.edu] , specifically David Paterson's writings [berkeley.edu] .

Computer.... (2, Funny)

Viceice (462967) | more than 11 years ago | (#5935353)

Heal thy-self!

Re:Computer.... (0)

Anonymous Coward | more than 11 years ago | (#5935552)

Heal thy-self!

Ahh... Pixie Dust!

Use it regularly, and servers solve their own problems.

No clue (-1, Redundant)

Anonymous Coward | more than 11 years ago | (#5935354)

Im sorry to say so, but this guy has no clue what so ever. micro-rebooting? (As rebboting only specific parts of the system) This guy never heard about ProcessIDentifiers or PIDs for short? I never rebooted my machine just to restart the webserver, or my MySQL or any other application for that matter. He do have some intresting ideas about undoing executed commands on the machine, for after a virus attack just turn back time on the machine and its clean of viruses, and then roll forward those changes that did not infect the system with the virus.. and you virus free, well.. you are if the virus doesnt understand your operation system, couse if it does it will ofcourse infect the undo feature. Cause I dont see how you could do this in hardware, it has to be software.. and if its software it can be alterd by a virus? right?

Re:No clue (1)

Jedi Alec (258881) | more than 11 years ago | (#5935370)

theoretically, i don't see why you shouldn't be able to do it in hardware, if for example an entire OS has been written to report to some piece of hardware what processes it has running, and that each of these processes needs to report to that piece of hardware on it's status. If a report comes in concerning problems, or the report fails to come in altogether, the chip then takes action to remedy the situation, by for example restarting that particular process.

Disclaimer: all uses of the word process in this post are due to a total lack of knowledge concerning *nix and more than is good for me with 2K/XP.

Re:No clue (4, Informative)

Gordonjcp (186804) | more than 11 years ago | (#5935397)

Well, yeah. That's basically a watchdog timer. It's very common in embedded stuff, because it's cheap to implement - in fact, many microcontrollers have it built into the hardware. In microcontrollers they're very simple - a counter counts up (say) 1024 clock pulses, and if it rolls over then reset the CPU. In normal operation then every time round the main loop you'd write to a specified IO port to kick the watchdog once every millisecond or so - this resets the counter. It's crude but effective, and is very commonly used in things like ECUs for automotive electrickery - although the software is simple enough to be thoroughly tested (BMW 735i's aside) there's still dirty power and mechanically harsh environment to deal with. And your ABS ECU doesn't have <CTRL><ALT><DELETE>, does it?

it will not work now (4, Insightful)

KingRamsis (595828) | more than 11 years ago | (#5935357)

Computers still rely on the original John von Neumann architecture they are not redundant in anyway, there will be always a single point of failure for ever, no matter what you hear about RAID, redundant power suppliers etc.. etc.. basically the self-healing system is based on the same concept, compare that to a natural thing like the nervous system of humans now that is redundant and self healing, a fly has more wires in it's brain than all of the internet nodes, cut your finger and after a couple of days a fully automated autonomous transparent healing system will fix it, if we ever need to create self healing computers we need to radically change what is a computer, we need to break from the John von Neumann not because anything wrong with it but because it is reaching it's limits quickly, we need truly parallel autonomous computers with replicated capacity that increase linearly by adding more hardware, and software paradigms that take advantage of that, try make a self-healing self-fixing computer today and you will end up with a every complicated piece of software that will fail in real life.

Re:it will not work now (2, Interesting)

torpor (458) | more than 11 years ago | (#5935407)

So what are some of the other paradigms which might be proferred instead of von Neumann?

My take is that for as long as CPU design is instruction-oriented instead of time-oriented, we won't be able to have truly trusty 'self-repairable' computing.

Give every single datatype in the system its own tightly-coupled timestamp as part of its inherent existence, and then we might be getting somewhere ... the biggest problems with existing architectures for self-repair are in the area of keeping track of one thing: time.

Make time a fundamental to the system, not just an abstract datatype among all other datatypes, and we might see some interesting changes...

Re:it will not work now (2, Interesting)

KingRamsis (595828) | more than 11 years ago | (#5935572)

well the man who answers this question will certainly become the von Neumann of the century, you need to do some serious out of the box thinking, first you throw away the concept of the digital computer as you know it, personally I think there will be a split in computer science, there will be generally two computer types the "classical" von Neumann and a new and different type of computer, the classical computer will be useful as a controller of some sort for the newer one, it is difficult to come up with the working principle of that computer, let me elaborate it is like a missing piece of the puzzle you know how it looks like but you are not certain what exactly will be printed on it, but I can summarize it is features:
1. It must be data oriented with no concept of instructions (just routing information), data flows in the system and transformed in a non-linear way, and the output will be all possible computations doable by the transformations.
2. It must be based on a fully interconnected grid of very simple processing elements.
3. The performance of said computer will be measured in terms of bandwidth not the usual MIPS. As you can see you will need a classical type computer to operate the described computer above so it will not totally replace it.
I believe that we should look into nature more closely, we stole the design of the plane straight from birds wings, and the helicopter from the dragonfly, and there are a lot that was inspired to us by mother nature, one of the relevant examples that always fascinated me was the fly brain, each eye is a processor on its own, the works independently conveying information to a more concise layer and so on, even human vision is based on similar concept of retina cells, there is no "pixel" concept, each layer that process vision emphasize on one concept of vision like texture, color, outline, shadowing, movement...etc ..Etc Finally well such a computer be useful? can we just write a plain spread sheet on it and send it by email to someone and then resume our saved DOOM game?
well it is possible but we need also to redefine what we can do with a computer because the classical von Neumann computer that we are stuck with for the last half a century certainly limited our imagination on what can be done with a computer.

Re:it will not work now (0)

Anonymous Coward | more than 11 years ago | (#5935427)

I think you need to learn the concept of the full-stop. That has to be the least readable sentence I have seen in a looooooong time!

Re:it will not work now (0)

KingRamsis (595828) | more than 11 years ago | (#5935460)

well instead of commenting on my english just extract the knowledge in the post, english is not my native language.

Re:it will not work now (1)

the-dude-man (629634) | more than 11 years ago | (#5935441)

Well yes and no.

ROC I dont think will every yeild servers that can heal themselves...rather, yeild servers that will be able to take corrective measures for a wide array problems...there really is no way to make a completely redudnat system, well there may be, but as you said, we are no were near there yet.

ROC may someday evelove into that, however, for the moment, its really a constantly expanding range of exceptional situations that a system can handel by design. Using structures such as exceptions and the like.

Re:it will not work now (0)

the_duke_of_hazzard (603473) | more than 11 years ago | (#5935461)

Dude, please self-repair your grammar. I could only follow what you were saying with some effort.

SPOFs (1)

6hill (535468) | more than 11 years ago | (#5935514)

there will be always a single point of failure for ever

Well, yes and no. Single points of failure are extremely difficult to find in the first place, not to mention remove, but it can be done on the hardware side. I could mention the servers formerly known as Compaq Himalaya, nowadays part of HP's NonStop Enterprise Division [hp.com] in some manner. Duplicated everything, from processors and power sources to I/O and all manner of computing doo-dads. Scalable from 2 to 4000 processors.

They are (or were, when I did my research piece on the Himalayas) also self-correcting in the sense that the two processors do lock-step processing and if the two differ in their opinions, the primary immediately hands over the responsibility to the redundant/backup -- data self-correcting on the assembly level. Of course, this doesn't prevent software from being a point of failure or from functioning incorrectly, but one or a cluster of these is as close as you're going to get without automated hotswapping or nanobot parts building, or other such sci-fi notions.

Various levels of rebooting... (4, Funny)

jkrise (535370) | more than 11 years ago | (#5935358)

Micro-rebooting: Restart service.
Mini-rebooting: Restart Windows 98
Rebooting : Switch off/on power
Macro-rebooting: BSOD.
Mega-rebooting: BSOD--> System crash--> reload OS from Recovery CD--> Reinstall apps --> reinstall screen savers --> reinstall Service Packs --> Say your prayers --> Reboot ---> Curse --> Repeat.

!RTFA, but (2, Interesting)

the_real_tigga (568488) | more than 11 years ago | (#5935363)

I wonder if this [osdl.org] [PDF!] cool new feature will help there.

Sounds a lot like "micro-rebooting" to me...

uunnschulding sme.. (3, Insightful)

danalien (545655) | more than 11 years ago | (#5935366)

but if end-users got a better computer education, I think most of the problems would be fixed.

I find it quite funny that "a ground course in computer"-courses we have (here in sweden) only educate people in how to use word/excel/powerpoint/etc... nothing _fundamental_ about how to opporate a computer. It`s like learning how to use the cigaret lighter in your car, and declareing yourself as someone who can drive a car. And now you want a quick fix for your incompentance in driving "the car".

Re:uunnschulding sme.. (-1, Flamebait)

Anonymous Coward | more than 11 years ago | (#5935381)

Nothing fundamental about how to spell english either eh?

Re:uunnschulding sme.. (0)

Anonymous Coward | more than 11 years ago | (#5935413)

I find it quite funny that "a ground course in computer"-courses we have (here in sweden) only educate people in how to use word/excel/powerpoint/etc... nothing _fundamental_ about how to opporate a computer. It`s like learning how to use the cigaret lighter in your car, and declareing yourself as someone who can drive a car.
Nonsense. Most people only need to learn Word, Excel, Powerpoint and their company's in-house stuff, just like most car owners only need to learn to drive.

With cars, the in-depth course is learning to be a mechanic or how to design engines. There's a similar thing for computers: being a sysadmin or developer. 99.99% of people don't need to know that stuff.

Re:uunnschulding sme.. (1)

Lord Kholdan (670731) | more than 11 years ago | (#5935602)

Let's say there is a billion computer users today and it'd take 100 hours on average to make them at least somewhat computer savvy. Let's say teaching them costs 5$/hour and they'd earn 10$/hour if they'd be working instead of studying. It'd cost 1500 billion to train them!

I think it'd be much cheaper to just write stable software, even in the really long run.

Borg (-1, Offtopic)

Anonymous Coward | more than 11 years ago | (#5935367)

Ensign: Captain, the borg seem to be repairing themselves.
Picard: Increase to warp factor 9.3
Ensign: The cube has damaged our warp drive. We can only go warp 8.7.
Data: The borg cube will overtake us in 27 seconds. We could use an inverse tachyon beam to repair our engines and disable the borg's warp drive.
Picard: No, you almost screwed up the galaxy last time you tried that.
CleverNickName: There's this article on slashdot about self repairing computers in our archives. Maybe we can apply it to our 24th century technology to fix the warp core.
Picard: Who are you?
CleverNickName: It's me, Wesley!
Picard: Didn't you run off with the galactic pedophile?
CleverNickName: Yes, but I came back, with a bunch of new infor---
Picard: Fuck Off! Mr. Data, initialize inverse tachyon beam.

So the next playstation can fix itself (-1)

Anonymous Coward | more than 11 years ago | (#5935368)

It would be cool if my ps2 could fix those Disk read errors itself.

...D (-1)

Anonymous Coward | more than 11 years ago | (#5935369)

Bull.. ..shit

Compulsory M$ joke (3, Funny)

Rosco P. Coltrane (209368) | more than 11 years ago | (#5935373)

Third, programmers ought to build systems that support an "undo" function (similar to those in word-processing programs), so operators can correct their mistakes. Last, computer scientists should develop the ability to inject test errors; these would permit the evaluation of system behavior and assist in operator training.

[WARNING]
You have installed Microsoft[tm] Windows[tm]. Would you like to undo your mistake, or are you simply injecting test errors on your system ?

[Undo] [Continue testing]

Hmm. (4, Insightful)

mfh (56) | more than 11 years ago | (#5935376)

Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex

I think that's a big fat lie.

Re:Hmm. (0)

Anonymous Coward | more than 11 years ago | (#5935393)

I think that's a big fat lie.

True. A computer does pretty much the same thing it did 20 years ago but on a bigger, grander scale. Most of that 10,000-fold speed increase is sucked up by software - making operating them easier, not by what the user does with the machine.

Re:Hmm. (0)

Anonymous Coward | more than 11 years ago | (#5935438)

My old TI99/4A needs specific arbitrary commands to be entered via the keyboard to be operated. It has very little concept of "file system", and nothing approching the desktop-metaphor of my PowerBook's OS X GUI. It also requires extensive knowledge of either BASIC or assembler languages if you want to use it for anything other than a primitive arcade gaming machine, while my PowerBook's software, videogames and administration tools all follow the same interface guidelines.

And even if you consider the command line, my PowerBook tries to correct the input and auto-completes filepaths and command names, while my TI994A shits itself or returns obscure error codes if I misspell anything.

Re:Hmm. (1)

Technician (215283) | more than 11 years ago | (#5935456)

Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex

Let's see. IBM PC XT 4.7 Megahertz to Pentium 4 at 3 Gigahertz. (3,000 Megahertz) It seems a little shy of 10,000 times unless you factor going from an 8 bit processor to a 32 bit processer. That's 4X the bandwidth. I don't think they missed the mark by much. 10,000 times or 12,000 times, what the the diff?

Re:Hmm. (1)

vofka (572268) | more than 11 years ago | (#5935544)

But even there, you are forgetting to factor in Multiple Pipelineing (At best, the P4 can 'complete' 9 Instructions per Cycle, though it doesn't usually get that good), and shorter Instruction Execution times, for example, a 32-bit Relative CALL on a 386 takes a minimum of 7 Cycles, whereas on a Pentium Class system it takes only One Cycle...

So, it's a lot more complex than just comparing clock-for-clock, or even clock and bus-width... 10,000 times is probably a very low estimate of how much power has increased in 20 years, just for x86 alone - and that doesn't factor in other architectures such as SPARC or PPC!

Re:Hmm. (1)

gl4ss (559668) | more than 11 years ago | (#5935608)

hmm... is it just me but does windows/beos/whatever look more complex to operate than ms dos 1.0 and dos based programs in general?

Re:Hmm. (1)

justin_speers (631757) | more than 11 years ago | (#5935654)

I agree with the original post actually, I think you misinterpreted it.

Computers may be (approximately) 10,000 times faster, but is operating them really more complex?

Write scripts for it... (4, Insightful)

ndogg (158021) | more than 11 years ago | (#5935382)

and cron them in.

This concept isn't particularily new. It's easy to write a script that will check a partiular piece of the system by running some sort of diagnostic command (e.g. netstat), parse the output, and make sure everything looks normal. If something doesn't look normal, just stop the process and restart, or whatever you need to do to get some service back up an running, or secured, or whatever is needed to make the system normal again.

Make sure that script is part of a crontab that's run somewhat frequently, and things should recover on their own as soon as they fail (well, within the time-frame that you have the script running within your crontab.)

"Undo" feature? That's what backups are for.

Of course, the article was thinking that this would be built into the software, but I don't think that is that much better of a solution. In fact, I would say that that would make things more complicated than anything.

Re:Write scripts for it... (1)

the-dude-man (629634) | more than 11 years ago | (#5935463)

Your quite right....most large systems are maintained by shell scripts and the crontab

However, this is inheriently limited to finding the errors, some errors (ie /var/run has incorrect permissions) cant be solved by restarting the service, this concept is about identifing the problem and then taking correct measures.

What you described is a primitive version of this, it will handle most of the *dumb* errors, not persistant errors that could be outside of the programs control. ROC is more/less an evolution of what you described

Go figure... (1)

qat (637648) | more than 11 years ago | (#5935385)

Sounds like a great way to lure in customers for another product. What happens when part of this ROC fucks up? No coding is perfect. Also, would it be cost effective? I doubt it...

Self Repairing gone bad (2, Insightful)

UndercoverBrotha (623615) | more than 11 years ago | (#5935388)

Windows Installer [microsoft.com] , was an effort in self "repairing" or "healing" , what ever you would like to call it. However, am the only one who has seen errors like "Please insert Microsoft Office XP CD.." blah blah, when nothing is wrong, and you have to cancel out of it just to use something totally unrelated, like say Excel or Word.

The Office 2000 self-repairing installations is another notorious one [google.com] , if you remove something, the installer thinks it has been removed in error and tries to reinstall it...

Oh well, lets wish the recovery-oriented computing guys luck...

Re:Self Repairing gone bad (1)

swankypimp (542486) | more than 11 years ago | (#5935641)

This week I took a look at my sister's chronicly gimpy machine. It had Gateway's "GoBack" software on it, which lets the OS return to a bootable state if it gets completely hosed (the "system restore" option on newer versions of Windows are similar, but GoBack loads right after the BIOS POST, before the machine tries to boot the OS).

The problem is that GoBack interprets easily recoverable errors as catastrophic. The machine didn't shutdown properly? GoBack to previously saved state. BSOD lockup? GoBack to previously saved state. The end result was that files were written to the hard disk but the system didn't keep track of them. The files were still there, and I could access them from a DOS prompt, but Windows Explorer had no clue where they were. The same thing happened with recently-installed programs, which utterly cocked things up. Windows only "knew" about them subconsciously, or something.

Of course, this (and the installers you mentioned) are cheap consumer grade products, and the server grade ones these people are researching would be much better. Because GoBack exists mainly as a "why buy Gateway over Dell" marketing tool, while real ROC would exist on mission critical servers. I just felt like ranting about Gateway GoBack for a while. (Finally I just uninstalled it rather than troubleshoot the thing. I still have some "hidden" directories, though. If I ever need a place to hide my porno stash, now I have an option. Shrug.)

Second paragraph (4, Insightful)

NewbieProgrammerMan (558327) | more than 11 years ago | (#5935389)

The second paragraph of the "long and dense article" strikes me as hyperbole. I haven't noticed that my computer's "operation has become brittle and unreliable" or that it "crash[es] or freeze[s] up regularly." I have not experienced the "annual outlays for maintenance, repairs and operations" that "far exceed total hardware and software costs, for both individuals and corporations."

Since this is /. I feel compelled to say this: "Gee, sounds like these guys are Windows users." Haha. But, to be fair, I have to say that - in my experience, at least - Windows2000 has been pretty stable both at home and at work. My computers seem to me to have become more stable and reliable over the years.

But maybe my computers have become more stable because I learned to not tweak on them all the time. As long as my system works, I leave it the hell alone. I don't install the "latest and greatest M$ service pack" (or Linux kernel, for that matter) unless it fixes a bug or security vulnerability that actually affects me. I don't download and install every cutesy program I see. My computer is a tool I need to do my job - and since I've started treating it as such, it seems to work pretty damn well.

I already do this with Linux... (2, Interesting)

jkrise (535370) | more than 11 years ago | (#5935391)

Here's the strategy:
1. Every system will have a spare 2GB filesystem partition, where I copy all the files of the 'root' filesystem, after successful instln., drivers, personalised settings, blah blah.
2. Every day, during shutdown, users are prompted to 'copy' changed files to this 'backup OS partition'. A script handles this - only changed files are updated.
3. After the 1st instln. a copy of the installed version is put onto a CD.
4. On a server with 4*120GB IDE disks, I've got "data" (home dirs) of about 200 systems in the network - updated once a quarter.

Now, for self-repairing:
1. If user messes up with settings, kernel etc., boot tomsrtbt, run a script to recopy changed files back to root filesystem -> restart. (20 mins)
2. If disk drive crashes, install from CD of step 3, and restore data from server.(40 mins)

Foolproof system, so far - and yes, lots of foolish users around.

First use for this (1)

buyo-kun (664999) | more than 11 years ago | (#5935392)

I think the first good use of ROC would be to clean up the errors and problems in Windows. Of course the only solution the ROC could possibly do to clean up all the problems with Windows is to detele Windows all together, but hey, we'd do it ourselves sooner or later anyway.

I used systems like this (5, Interesting)

Mark Hood (1630) | more than 11 years ago | (#5935401)

they were large telecomms phone switches.

When I left the company in question, they had recently introduced a 'micro-reboot' feature that allowed you to only clear the registers for one call - previously you had to drop all the calls to solve a hung channel or if you hit a software error.

The system could do this for phone calls, commands entered on the command line, even backups could be halted and started without affecting anything else.

Yes, it requires extensive development, but you can do it incrementally - we had thousadnds of software 'blocks' which had this functionality added to them whenever they were opened for other reasons, we never added this feature unless we were already making major changes.

Patches could be introduced to the running system, and falling back was simplicity itself - the same went for configuration changes.

This stuff is not new in the telecomms field, where 'five nines' uptime is the bare minimum. Now the telco's are trying to save money, they're looking at commodity PCs & open standard solutions, and shuddering - you need to reboot everything to fix a minor issue? Ugh!

As for introducing errors to test stability, I did this, and I can vouch for it's effects. I made a few patches that randomly caused 'real world' type errors (call dropped, congestion on routes, no free devices) and let it run for a weekend as an automated caller tried to make calls. When I came in on Monday I'd caused 2,000 failures which boiled down to 38 unique faults. The system had not rebooted once, so only those 2,000 calls had even noticed a problem. Once the software went live, the customer spotted 2 faults in the first month, where previously they'd found 30... So I swear by 'negative testing'.

Nice to see the 'PC' world finally catching up :)

If people want more info, then write to me.

Mark

Re:I used systems like this (1)

the-dude-man (629634) | more than 11 years ago | (#5935501)

I've been striving to work this kind of stability into my client's software for years! To a certian extent, alot of its there, the problem with the pc world is you have to do an update every 3 days just to prevent someone for rooting your box with all the remote exploits floating aroung out there.

I usually use large sets of negitive data to isolate the problem...but there are just some things that users can cause, that in an itergrated world like the pc world, will just take things down.

Thats not to say that you cant keep a box up for several years. I have a client that has outright refused to update their kerenel and not reboot a red-hat box I set up 5 years ago. The kerenel is sheltered enough from the real world that as long as it does what we want...its fine. And services are updated almost daily via scripting, and most of the kernel is modulized so parts of the kernel can be updated to keep with the services.

So i can keep an operating system up for a very long time...my concern has now turned to keeping services up....there are just somethings that will take down a service no matter what (ie dosing the socket until it explodes) I do, i cant seem to find a way around without restarting the service to correcet the problem...this is problamatic because there is an indefinate number of other users that we dont want to affect....telecom has been doing this for years so i would be interested in hearing any coding tricks you may have up your sleve :)

already done? (1)

the-dude-man (629634) | more than 11 years ago | (#5935428)

hmmmm....Recovery Oreinted Computing......This just screams linux.

Recovery Oreinted Computing is nothing new, most devlopers (well *nix devlopers) have been heading down this route for years, particularly with more hardcore OO languages (is java...and in many respects c++) come to the surface with exception structures, it becomes easier to isloate and identify the exception that occured and take appropiate action to keep the server going.

However, this method of coding is still growing...there are no real solid / accepting methods of isolating and identifying problems...however, in the next few years you will probably see this trend move to the next level as algorithims for identification, and localization are devloped and widely adopted.

Of course if your running on a windows platform this is kinda pointless...rebooting at least once every 30 days really eliminates any chance of long term running and the need for large scale localization and identification

Excellent (1, Funny)

hdparm (575302) | more than 11 years ago | (#5935429)

we could in that case:

rm -rf /*

^Z

jut for fun!

Re:Excellent (0)

Anonymous Coward | more than 11 years ago | (#5935623)

[1]+ Stopped rm -rf /*

ACID ROC? (3, Insightful)

shic (309152) | more than 11 years ago | (#5935434)

I wonder... is there a meaningful distinction between ROC and the classical holy-grail of ACID systems(i.e. systems which meet Atomic, Consistent, Isolated and Durable assumptions commonly cited in the realm of commercial RDBMS?) Apart from the 'swish' buzzword re-name that isn't even an acronym?

Professionals in the field, while usually in agreement about the desirability of systems which pass the ACID test, most admit that while the concepts are well understood, the real-world cost of the additional software complexity often precludes strict ACID compliance in typical systems. I would certainly be interested if there were more to ROC than evaluating the performance of existing well understood ACID-related techniques but can't find anything more than the "hype." For example, has ROC suggested designs to resolve distributed incoherence due to hardware failure? Classified non-trivial architectures immune to various classes of failure? Discovered a cost effective approach to ACID?

Not going to work (1, Offtopic)

locarecords.com (601843) | more than 11 years ago | (#5935440)

This is pie in the sky.

My experience is the best system is paired computers running in parallel that are balanced by another computer that watches for problems and switches the crashed system from Live to the other computer seamlessly. It then reboots the system with problems and allows it to recreate its dataset from its partner.

In effect this points the way to the importance of massive parallelism required for totally stable systems so that clusters form the virtual computer and we get away from the idea of a computer as a single machine.

Afterall individual computers suffer hardware failure too!

The Hurd (3, Interesting)

rf0 (159958) | more than 11 years ago | (#5935448)

Wouldn't some sort of software solution be the Hurd (if/when it becomes ready) in that as each system is a micro-kernel you just restart that bit of the operating system. As said in another post this is like /etc/rc.d but at a lower level.

Or you could just have some sort of failover setup.

Rus

Magic Server Pixie Dust (3, Funny)

thynk (653762) | more than 11 years ago | (#5935455)

Didn't IBM come out with some Magic Server Pixie Dust that did this sort of thing already, or am I mistaken?

Re:Magic Server Pixie Dust (1)

the-dude-man (629634) | more than 11 years ago | (#5935550)

That was just a gimmick for that commerical...what they were actually selling is BSD boxes running ports that can update themselves, and rebuilt the kernel acording to pre-defined specs, and reboot when necceary to implement the changes, but designed not to reboot whenever possible (so they build the kerenel to be very modular and only update the modules as needed until something in the base needs to be updated, then rebuild the kerenl and reboot. And useing some tweaked out bash scripting to respawn services that died.

This isnt really ROTC since if its a system wide error, restarting the service just causes it to die agian, they are selling more less very well set up BSD boxes. (i sell similar solutions to people) However, this is not ROC because they still need to be administered, the goal though was for someone with minimal knowledge of linux to be able to handle the day-to-day operations of the server. Since these

Sounds like commitment control (1)

GerardM (535367) | more than 11 years ago | (#5935462)

In databases, you have your actions and when a sequence of events start, they are committed at the end of the event cycle. When you change things, there is a sequence of events that lead to a "stable" state. When the stable state has arrived, you commit. When you decide that it is no good anyway there is the possibility of a roll-back; everything is rolled back to a last known good state.

In practice it would mean that changes are logged and possibly after logging changes are effectuated. This does result in overhead and in potential vulnerabilities (both for hackers and for errors).

Things like this also reek like what a "standardised" hardware and software would look like. How else can you control the quality of such a system? NB this does not mean that a Linux BSD is inferior, it would only be more obvious and visible what went right what went wrong.
Thanks,
Gerard

Not Just In DataBases (1)

the-dude-man (629634) | more than 11 years ago | (#5935559)

n databases, you have your actions and when a sequence of events start, they are committed at the end of the event cycle. When you change things, there is a sequence of events that lead to a "stable" state. When the stable state has arrived, you commit.

This is actualy exaclty what iptables does...there is even a commit command at the end of every rulset after all exceptional circumstances have been handled

micro-rebooting... (1, Funny)

amanpatelhotmail.com (604171) | more than 11 years ago | (#5935465)

Is just one of the cool new features in MS Windows(r) Longhorn(tm)

You now don't reboot(tm) but you micro-reboot(tm) i.e. the system will do that for you! Remember the times when you are writing that important report under MS(r) Word(tm); and the system crashed, and you had to press Ctrl-Alt-Del(tm) to reboot(tm). No more! No more pressing ackward buttons... The system is intelligent enough to do that for you :)

"operating them is much more complex" (2, Funny)

NReitzel (77941) | more than 11 years ago | (#5935467)

Are you crazy?

My first "PC" was a PDP-11/20, with paper tape reader and linc tape storage. Anyone who tries to tell me that operating today's computers is much more complex needs to take some serious drugs.

What is more complex is what today's computers do, and increasing their reliability or making them goal oriented are both laudable goals. What will not be accomplished is making the things that these computers actually do less complex.

Ah, youth... (2, Insightful)

tkrotchko (124118) | more than 11 years ago | (#5935471)

"But operating them is much more complex."

You're saying the computers of today are more complex to operate than those of 20 years ago?

What was the popular platform 20 years ago.... (1983). The MacOS had not yet debutted, but the PC XT had. The Apple ][ was the main competitor.

So you had a DOS command line and an AppleDOS command line. Was that really a simpler than pointing and clicking in XP and OSX today? I mean, you can actually have your *mother* operate a computer today.

I'm not sure I agree with the premise.

Re:Ah, youth... (1)

the-dude-man (629634) | more than 11 years ago | (#5935523)

So you had a DOS command line and an AppleDOS command line. Was that really a simpler than pointing and clicking in XP and OSX today? I mean, you can actually have your *mother* operate a computer today.

This is true, however, keep in mind that none of the DOS operating systems had a kernel. nor were any of them truely mutlitasking until windows 95 for the windows world(shudders). And the debut of Unix 20 years ago.

Also keep in mind all the new technologies such as netwroking, (thats a whole post of changes on its own) hardware and bluetooth, firewire, usb, a hudge number of new technologies that have evolved to meet the ever expanding demands we place on systems.

Some of the popular platforms from 20 years ago such as the PC XT are now used in calculators today, The very definition of a computer has changed in 20 years, so the operating systems are orders of magnatidude more complex...20 years ago the pc world was still in its infancy. Since then, everything outside the very definition of the pc has changed...and notebook and handheld technologies are pushing that.

That being said, its not really fair to compare operating systems from 20 years ago to operating systems of today....its just a different world, and the very definition of an operating system is no longer the same

Good Code and Hardware (1)

caffeinex36 (608768) | more than 11 years ago | (#5935486)

Wouldn't better coding and better hardware be more efficient? This sounds a little silly. Perhaps, come quantum computers, maybe. Think of all the SA's that fix things that break all day who will be jobless.

Rob

The long wondered about origin of ... (0, Funny)

den_erpel (140080) | more than 11 years ago | (#5935499)

Self-Repairing Computers

Finally, this provides us with the long awaited answer to the following situations:

Reed: Captain, direct hit on the power supply!
Archer: That'll teach those cyborgs for flooding our inbox with p0rn!
T'Pol: Captain, their server is mysteriously repairing itself, we're still being flooded.

for any other series:
TOS:
%s/Reed/Checkov/g
%s/Archer/Kirk/g
%s/T'Pol/Spock/g
TNG:
%s/Reed/Worf/g
%s/Arche r/Picard/g
%s/T'Pol/Data/g
DS9:
%s/Reed/Kira/g
%s/Archer/Sisko/g
%s/T'Pol/Dax/g
VGR:
%s/Reed/ Tuvok/g
%s/Archer/Janeway/g
%s/T'Pol/Kim/g

Since the B&B messed up the timelines anyway, they'll probably pour it into an episode, they seem to be out of inspiration anyhow...

A computer is no washmachine, but why ? (2, Insightful)

Quazion (237706) | more than 11 years ago | (#5935504)

Washingmachines have a life time of around 15-20 years i guess, computers about 1-3 years.
This is because the technical computer stuff is so new every year and so...

1: Its to expensive to make it failsafe, development would take to long.
2: You cant refine/redesign and resell, because of new technologie.
3: If it just works noone will buy new systems, so they have to fail every now and then.

While with other consumer products they have a much longer development cycle, cars for example shouldnt fail and if it should be fairly easy to repair, cars also have been around since i dont know like a hundred years and have they changed much ?. Computers heck just buy a new one or hire a PC Repair Man [www.pcrm.nl] (Dutch only) todo your fixing.

excuse me for my bad english ;-) but i hope you got the point, no time to ask my living dictionary.

English (1)

rf0 (159958) | more than 11 years ago | (#5935582)

I wouldn't worry about your english. Its better than some native speaks I've seen

Rus

But I do that already... (2, Informative)

edunbar93 (141167) | more than 11 years ago | (#5935508)

build an "undo" function (similar to those in word-processing programs) for large computing systems

This is called "the sysadmin thinks ahead."

Essentially, when any sysadmin worth a pile of
beans makes any changes whatsoever, he makes sure there's a backup plan before making his changes live. Whether it means running the service on a non-standard port to test, running it on the development server to test, making backups of the configuration and/or the binaries in question, or making backups of the entire system every night. She is thinking "what happens if this doesn't work?" before making any changes. It doesn't matter if it's a web server running on a lowly Pentium 2 or Google - the sysadmin is paid to think about actions before making them. Having things like this won't replace the sysadmin, although I can imagine a good many PHBs trying before realizing that just because you can back out of stupid mistakes, doesn't mean you can keep them from happening in the first place.

Does SCI AM review articles properly nowadays? (3, Insightful)

panurge (573432) | more than 11 years ago | (#5935525)

The authors either don't seem to know much about the current state of the art or are just ignoring it. And as for unreliability - well, it's true that the first Unix box I ever had (8 user with VT100 terminals) could go almost as long without rebooting as my most recent small Linux box, but there's a bit of a difference in traffic between 8 19200 baud serial links and two 100baseT ports, not to mention the range of applications being supported.
Or the factor of 1000 to 1 in hard disk sizes.
Or the 20:1 price difference.

I think a suitable punishment would be to lock the authors in a museum somewhere that has a 70s mainframe, and let them out when they've learned how to swap disk packs, load the tapes, splice paper tape, connect the Teletype, sweep the chad off the floor, stack a card deck or two and actually run an application...those were the days, when computing kept you fit.

Some of this isn't entirely new... (5, Interesting)

Mendenhall (32321) | more than 11 years ago | (#5935564)

As one component of my regular job (I am a physicist), I develop control systems for large scientific equipment, and have been doing so for about 25 years. One of the cornerstones of this work has been high-reliability operation and fault tolerance.

One of the primary tricks I have used has always been mockup testing of software and hardware with an emulated machine. In a data acquisition/control system, I can generate _lots_ of errors and fault conditions, most of which would never be seen in real life. This way, I can not only test the code for tolerance, and do so repeatedly, I can thoroughly check the error-recovery code to make sure it doesn't introduce any errors itself.

This is really the software equivalent to teaching an airline pilot to fly on a simulator. A pilot who only trains in real planes only gets one fatal crash, (obviously), and so never really learns how to recover from worst-case scenarios. In a simulator, one can repeat 'fatal' crashes until they aren't fatal any more. My software has been through quite the same experience, and it is surprising the types of errors one can avoid this way.

Really, the main problem with building an already highly reliable system, using very good hardware, etc., is that you must do this kind of testing, since failures will start out very rare, and so unless one intentionally creates faults, the ability to recover from them is not verified. Especially in asynchronous systems, one must test each fault many times, and in combinations of multiple faults to find out how hard it is to really break a system, and this won't happen without emulating the error conditions.

Imagine a beowulf cluster of (-1, Offtopic)

noogle (664169) | more than 11 years ago | (#5935566)

these.

Nope. Memory (4, Interesting)

awol (98751) | more than 11 years ago | (#5935600)

The problem here is that whilst it is true that _certain_ aspects of computational power has increased "probably 10,000 times" others have not. In order to really make stuff work like this, with an undo, because that is the critical bit since redundant hardware already exists, Non-Stop from HP (nee Himalaya) for example.

Where I work we implemented at least one stack based undo functionality and it worked really nicely, we trapped sigsevs etc and just popped the appropriate state back into the places that were touched in the event of an error. We wrote a magical "for loop" construct that broke out after N iterations reagrdless of the other constraints. The software that resulted from this was uncrashable. I mean that relatively seriously, you could not crash the thing. you could very seriously screw data up through bugs, but the beast would just keep on ticking.

I had a discussion with a friend of mine more than a decade ago that eventually all these extra MHz that were coming would eventually be overkill. His argument was that, no, more of them will be consumed, in the background making good stuff happen. He was thinking about things like voice recognition, handwriting recognition, predictive work etc etc. I agree with his point. If you have a surfeit of CPU then use it to do cool things (not wasting it on eycandy necessarily) to make the things easier to use. Indeed we see some of that stuff now, not enough, but some.

Self-Reparing is an excellent candidate and with so much CPU juice lying around in your average machine, it must be workable. I mean think about the computers used for industrial plant. Most of them could be emulated faster in a P4 than they currently run. So emulate N and check the results against each other, if one breaks just emulate a new one and turf the old one. Nice.

But here's the rub. Memory, we have nowhere near decreased the memory latency by the same amount we have boosted the processing power (and as for IO, sheesh!), as a result, undo is very expensive to do generically, i mean it at least halves the amount of bandwidth since it is [read/write write] for each write not to mention the administrative overhead and we just haven't got that much spare capacity in memory latency left. Indeed, just after the ten year old discussion, I had to go and enhance some software to get past the HPUX9 800MB single shared memory segment limit and the demand is only just being outstripped by the affordable supply of memory, we do not yet have the orders of magnitude of performance to make the self correcting model work ina generic sense.

I think this idea will come, but it will not come until we have an order of magnitude more capacity in all the areas of the system. Until then we will see very successful but limited solutions like the one we implemented.

'IMPORTANT' 'NEW' 'DISCOVERY'! (2, Funny)

kahei (466208) | more than 11 years ago | (#5935648)


Scientists discovered this week that well-known and rather obvious software engineering concepts like componentization and redundancy could seem new and impressive if written up like Science!

Although this week's breakthrough yielded little direct benefit, it is theorized that applying the verbal style of Science to other subjects, such as aromatherapy and running shoes, could have highly profitable results.

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?