×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

When Mistakes Improve Performance

kdawson posted more than 3 years ago | from the let's-change-everything dept.

Hardware 222

jd and other readers pointed out BBC coverage of research into "stochastic" CPUs that allow communication errors in order to reap benefits in performance and power usage. "Professor Rakesh Kumar at the University of Illinois has produced research showing that allowing communication errors between microprocessor components and then making the software more robust will actually result in chips that are faster and yet require less power. His argument is that at the current scale, errors in transmission occur anyway and that the efforts of chip manufacturers to hide these to create the illusion of perfect reliability simply introduces a lot of unnecessary expense, demands excessive power, and deoptimises the design. He favors a new architecture, that he calls the 'stochastic processor,' which is designed to handle data corruption and error recovery gracefully. He believes he has shown such a design would work and that it would permit Moore's Law to continue to operate into the foreseeable future. However, this is not the first time someone has tried to fundamentally revolutionize the CPU. The Transputer, the AMULET, the FM8501, the iWARP, and the Crusoe were all supposed to be game-changers but died cold, lonely deaths instead — and those were far closer to design philosophies programmers are currently familiar with. Modern software simply isn't written with the level of reliability the stochastic processor requires (and many software packages are too big and too complex to port), and the volume of available software frequently makes or breaks new designs. Will this be 'interesting but dead-end' research, or will Professor Kumar pull off a CPU architectural revolution really not seen since the microprocessor was designed?"

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

222 comments

Impossible design (4, Interesting)

ThatMegathronDude (1189203) | more than 3 years ago | (#32392416)

If the processor goofs up the instructions that its supposed to execute, how does it recover gracefully?

Re:Impossible design (3, Funny)

Anonymous Coward | more than 3 years ago | (#32392496)

The Indian-developed software will itself fuck up in a way that negates whatever fuck up just happened with the CPU. In the end, it all balances out, and the computation is correct.

Re:Impossible design (0)

Anonymous Coward | more than 3 years ago | (#32392586)

I belive Windows is american.

Re:Impossible design (1)

koreaman (835838) | more than 3 years ago | (#32392652)

And Windows is rock solid reliable compared to the software made abroad (I won't name specific countries) that I have to deal with at work. I suspect many people feel the same way.

Re:Impossible design (4, Insightful)

TheThiefMaster (992038) | more than 3 years ago | (#32392562)

Especially a JMP (GOTO) or CALL. If the instruction is JMP 0x04203733 and a transmission error makes it do JMP 0x00203733 instead, causing it to attempt to execute data or an unallocated memory page, how the hell can it recover from that? It could be even worse if the JMP instruction is changed only subtly, jumping only a few bytes too far or too close could land you the wrong side of an important instruction that throws off the entire rest of the program. All you could do is to detect the error/crash and restart from the beginning and hope. What if the error was in your error detection code? Do you have to check the result of your error detection for errors too?

Re:Impossible design (2, Interesting)

Turzyx (1462339) | more than 3 years ago | (#32392684)

Or worse, it could jump to itself repeatedly, thereby creating a HCF situation.

Re:Impossible design (1)

cheese_wallet (88279) | more than 3 years ago | (#32393442)

Especially a JMP (GOTO) or CALL. If the instruction is JMP 0x04203733 and a transmission error makes it do JMP 0x00203733 instead, causing it to attempt to execute data or an unallocated memory page, how the hell can it recover from that? It could be even worse if the JMP instruction is changed only subtly, jumping only a few bytes too far or too close could land you the wrong side of an important instruction that throws off the entire rest of the program. All you could do is to detect the error/crash and restart from the beginning and hope. What if the error was in your error detection code? Do you have to check the result of your error detection for errors too?

space the instructions further apart so that one or two bit flips won't map to another instruction.

Re:Impossible design (5, Insightful)

Interoperable (1651953) | more than 3 years ago | (#32393494)

The research [illinois.edu] is targeted specifically at dedicated audio/video encoding/decoding blocks within the processors of mobile devices and similar error-tolerant applications. The journalist just didn't mention the fact that the idea isn't to expose the entire system to fault-prone components. When considered in the light that the most power-sensitive mainstream devices (cell-phones) spend most of their time doing these error-tolerant tasks, the research becomes quite interesting. They claim to have demonstrated the effectiveness of the technique to encode an h.264 video.

Re:Impossible design (3, Interesting)

Anonymous Coward | more than 3 years ago | (#32392594)

Thats a good point. You accept mistakes with the data, but don't want the operation to change from add (where, when doing large averages plus/minus a few hundreds wont matter) to multiply or divide.

But once you have the opcode separated from the data, you can mess with the former. E.g. not care when something is a race condition because that happening every 1000th operation doesn't matter too much.
And as this is a source of noise, you just got a free random data!
Still, this looks more like something for scientific computing, and when they build the next big one that can easily be factored in. For home computing, not so much, 99% of the time they wait for user input anyhow.

More likely: Impossible gains (2, Insightful)

Anonymous Coward | more than 3 years ago | (#32392618)

More importantly, if the software is more robust so as to detect and correct errors, then it will require more clock cycles of the CPU and negate the performance gain.

This CPU design sounds like the processing equivalent of a perpetual motion device. The additional software error correction is like the friction that makes the supposed gain impossible.

Re:More likely: Impossible gains (0)

Anonymous Coward | more than 3 years ago | (#32393216)

More importantly, if the software is more robust so as to detect and correct errors, then it will require more clock cycles of the CPU and negate the performance gain.

This CPU design sounds like the processing equivalent of a perpetual motion device. The additional software error correction is like the friction that makes the supposed gain impossible.

No one here is really getting this.

Think about it this way: just as you can talk about the time complexity of an algorithm you can also talk about its energy complexity.

So if we'll say there's some fixed energy cost associated with doing a particular operation relative to some probability of being correct and that cost changes if you adjust that probability of being correct. So given that, if your goal is to minimize the amount of energy it takes to complete an algorithm with 99.9999% probability of a correct answer. So you would go about this the same way that you optimize the time performance of an algorithm: you organize it so that redundant work is minimized.

So let's say I want to add n numbers together with 99.9999% probability of a correct answer. Do you suppose it would use more energy to do each addition with (99.9999%)^(1/n) probability of error or to design an algorithm that doesn't do that much checking the whole time but generates a number which is highly likely correct and is accompanied by additional redundant coded data which can be used to true up the approximation on demand?

Re:More likely: Impossible gains (1)

shimage (954282) | more than 3 years ago | (#32393326)

I think an important part of his argument is that current processors aren't perfect anyway, and they are only going to get worse as we move to smaller processes. The argument is that, eventually, everyone should be writing more robust code anyway, so why not get something out of it. I don't know whether that is true or not, but to argue against his ideas, you need to address his points ...

Re:Impossible design (1)

mederbil (1756400) | more than 3 years ago | (#32392648)

It is similar to quantum computing. Quantum computing can be insanely fast, but it is often makes inaccurate calculations.

It's mainly about quantity, not quality. A possibly use for it is computation knowledge engines, like WolframAlpha. It would be inexpensive for computation servers, but only really useful if it was at least 98% accurate.

Re:Impossible design (1)

WrongSizeGlass (838941) | more than 3 years ago | (#32392796)

Basically he's saying we should trade power consumption for accuracy? Hmmm ... I vote 'No'.

Re:Impossible design (1, Informative)

Anonymous Coward | more than 3 years ago | (#32392972)

Reduced power OR equal power at a faster clock rate. Many times speed is preferred to accuracy when perfection isn't necessary. Video and audio are good examples already doing this (e.g. dropped frames on slow connections).

Re:Impossible design (3, Informative)

AmiMoJo (196126) | more than 3 years ago | (#32392866)

The first thing to say is that we are not talking about general purpose CPU instructions but rather the highly repetitive arithmetic processing that is needed for things like video decoding or 3D geometry processing.

The CPU can detect when some types of error occur. It's a bit like ECC RAM where one or two bit errors can be noticed and corrected. It can also check for things like invalid op-codes, jumps to invalid or non-code memory and the like. If a CPU were to have two identical ALUs it could compare results.

Software can also look for errors in processed data. Things like checksums and estimation can be used.

In fact GPUs already do this to some extent. AMD and nVidia's workstation cards are the same as their gaming cards, the only difference being that the workstation ones are certified to produce 100% accurate output. If a gaming card colours a pixel wrong every now and then it's no big deal and the player probably won't even notice. For CAD and other high end applications the cards have to be correct all the time.

OpenCL (2, Interesting)

tepples (727027) | more than 3 years ago | (#32393192)

AMD and nVidia's workstation cards are the same as their gaming cards, the only difference being that the workstation ones are certified to produce 100% accurate output. If a gaming card colours a pixel wrong every now and then it's no big deal and the player probably won't even notice.

As OpenCL and other "abuses" of GPU power become more popular, "colors a pixel wrong" will eventually happen in the wrong place at the wrong time on someone using a "gaming" card.

Re:Impossible design (3, Insightful)

Chowderbags (847952) | more than 3 years ago | (#32392908)

Moreover, if the processor goofs on the check, how will the program know? Do we run every operation 3 times and take the majority vote (then we've cut down to 1/3rd of the effective power)? Even if we were to take the 1% error rate, given that each core of CPUs right now can run billions of instructions per second, this CPU will fail to check correctly every second (even checking, rechecking, and checking again every single operation). And what about memory operations? Can we accept errors in a load or store function? If so, we can't in practice trust our software to do what we tell it. (change a bit on load and you could do damn near anything from adding the wrong number, to saying an if statement is true when it should be false, to not even running the right fricken instruction.

There's a damn good reason why we want our processors to be rock solid. If they don't work right, we can't trust anything they output.

Re:Impossible design (4, Insightful)

pipatron (966506) | more than 3 years ago | (#32393618)

Not very insightful. You seem to say that a CPU today is error-free, and if this is true, the part of the new CPU that does the checks could also be made error-free so there's no problem.

Well, they aren't rock-solid today either, so you can not trust their output even today. It's just not very likeley that there will be a mistake. This is why mainframes execute a lot of instructions at least twice, and decides on-the-fly if something went wrong. This idea is just an extension of that.

I thcnk it4s a gre&t idKa (0)

jolyonr (560227) | more than 3 years ago | (#32392430)

Bu) I l/ved in thE day oI the 2400 baUUUd modem.

Actually you've got the right idea (2, Interesting)

Anonymous Coward | more than 3 years ago | (#32392758)

The summary talked about the communication links... I remember when we were running SLIP over two-conductor serial lines and "overclocking" the serial lines. Because the networking stack (software) was doing checksums and retries, it worked faster to run the line into its fast but noisy mode, rather than to clock it conservatively at a rate with low noise.

If the chip communications paths start following the trend of off-chip paths (packetized serial streams), then you could have higher level functions of the chip do checksums and retries, with a timeout that aborts back even higher to a software level. Your program could decide how much to wait around for correct results versus going on with partial results, depending on its real-time requirements. The memory controllers could do this, using the large, remote SRAM and RAM spaces as an assumed-noisy space and overlaying its own software checksum format on top.

This is really not so different from modern filesystems which start to reduce their trust in the storage device, and overlay their own checksum, redundancy, and recovery methods. You can imagine bringing these reliability boundaries ever "closer" to the CPU. Of course, you are right that it doesn't make sense to allow noisy computed goto addresses, unless you can characterize the noise and link your code with "safe landing areas" around the target jump points. It makes even less sense to have noisy instruction pointers, e.g. where it could back up or skip steps by accident, unless you can design an entire ISA out of idempotent instructions which you can then emit with sufficient redundancy for your taste.

Wrong approach? (1)

Yvan256 (722131) | more than 3 years ago | (#32392454)

Wouldn't it be simpler to simply add redundancy and CRC or something similar to that effect?

Re:Wrong approach? (1)

DigiShaman (671371) | more than 3 years ago | (#32392550)

The goal is to reduce power consumption and improve performance. Adding redundancy and CRC goes against that.

Re:Wrong approach? (0)

Dunbal (464142) | more than 3 years ago | (#32393160)

The goal is to reduce power consumption and improve performance.

      Well that's fine if you are an academic who measures "performance" in "operations per second". Usually, however, computers are used to make CORRECT calculations. What use is a blazing fast computer that is no longer reliable? If you allow a fraction of errors, considering the speed of CPU's and the length of time they can be running, you can expect these errors to compound and magnify over time eventually corrupting the whole program/data.

Re:Wrong approach? (2, Funny)

jibjibjib (889679) | more than 3 years ago | (#32393436)

Yeah because academics are idiots who measure performance in incorrect calculations per second, and they did this research without thinking of all these things that you've thought up in two minutes reading the Slashdot summary.

Seriously, people, get some common sense.

Re:Wrong approach? (3, Interesting)

somersault (912633) | more than 3 years ago | (#32393538)

What use is a blazing fast computer that is no longer reliable

Meh.. I'm pretty happy to have my brain, even if it makes some mistakes sometimes.

Moving, not fixing, the problem (4, Interesting)

Red Jesus (962106) | more than 3 years ago | (#32392494)

The "robustification" of software, as he calls it, involves re-writing it so an error simply causes the execution of instructions to take longer.

Ooh, this is tricky. So we can reduce CPU power consumption by a certain amount if we rewrite software in such a way that it can slowly roll over errors when they take place. There are some crude numbers in the document: a 1% error rate, whatever that means, causes a 23% drop in power consumption. What if the `robustification' of software means that it has an extra "check" instruction for every three "real" instructions? Now you're back to where you started, but you had to rewrite your software to get here. I know, it's unfair to compare his proven reduction in power consumption with my imaginary ratio of "check" instructions to "real" instructions, but my point still stands. This system may very well move the burden of error correction from the hardware to the software in such a way that there is no net gain.

Re:Moving, not fixing, the problem (2, Insightful)

sourcerror (1718066) | more than 3 years ago | (#32392546)

This system may very well move the burden of error correction from the hardware to the software in such a way that there is no net gain.

People said the same about RISC processors.

Re:Moving, not fixing, the problem (2, Informative)

DigiShaman (671371) | more than 3 years ago | (#32392640)

From what I understand, all modern processors are now a hybrid of both RISK and CISC (Intel Core 2, AMD K8, etc). Except for embedded applications, the generic CPU doesn't have that kind of pure classification anymore. Right?

Re:Moving, not fixing, the problem (2, Informative)

xZgf6xHx2uhoAj9D (1160707) | more than 3 years ago | (#32392786)

The classifications weren't totally meaningful to begin with, but CISC has essentially died. I don't mean there aren't CISC chips anymore--any x86 or x64 chip can essentially be called "CISC"--but certainly no one's designed a CISC architecture in the past decade at least.

RISC has essentially won and we've moved into the post-RISC world as far as new designs go. VLIW, for instance, can't really be considered classical RISC, but it builds on what RISC has accomplished.

The grandparent's point is a good one: people thought RISC would never succeed; they were wrong.

Re:Moving, not fixing, the problem (2, Interesting)

Turzyx (1462339) | more than 3 years ago | (#32392636)

I'm making assumptions here, but if these errors are handled by software would it not be possible for a program to 'ignore' errors in certain circumstances? Perhaps this could result in improved performance/battery life for certain low priority tasks. Although an application where 1% error is acceptable doesn't spring immediately to mind, maybe supercomputing - where anomalous results are checked and verified against each other...?

Re:Moving, not fixing, the problem (0)

Anonymous Coward | more than 3 years ago | (#32392788)

Video decoding maybe?

Re:Moving, not fixing, the problem (1)

thegarbz (1787294) | more than 3 years ago | (#32392936)

Although an application where 1% error is acceptable doesn't spring immediately to mind,

This struck me as a problem too. Where is a processing error acceptable? In a game which currently doesn't tax the CPU as much as it does the GPU anyway? In a word processor where the performance of the CPU really doesn't matter? A 1% error is definitely not tolerable when calculating PI, and again how do you check the result is correct? Do you execute each instruction twice and ensure that the same result came out the other end?

Also what kind of error are we talking about? Sure when playing a game a flipped bit could cause the screen to display a fault or it could cause the game to come crashing down. Is the error checking code some how immune to errors too? This just wreaks of the Pentium 3 problems which had a significant impact.

Re:Moving, not fixing, the problem (3, Insightful)

jd (1658) | more than 3 years ago | (#32393570)

For this, I'd point to the RISC vs. CISC debate. RISC programs took many more instructions to do the same things, but the gain in performance was so great that you ended up with greater performance. Extra steps = some amount of overhead, but so long as the net gain is greater than the net overhead, you will gain overall. The RISC chip demonstrated that such examples really do exist in the real world.

But just because there are such situations, it does not automatically follow that more steps always equals greater performance. It may be that better error-correction techniques in the hardware would handle the transmission errors just fine without having to alter any software at all. It depends on the nature of the errors (bursty vs randomly-distributed, percent of signal corrupted, etc) as to what error-correction would be needed and whether the additional circuitry can be afforded.

Alternatively, the problem may in fact turn out to be a solution. Once upon a time, electron leakage was a serious problem in CPU designs. Then, chip manufacturers learned that they could use electron tunneling deliberately. The cause of these errors may be further electron leakage or some other quantum effect, it really doesn't matter. If it leads to a better practical understanding of the quantum world to the point where the errors can be mitigated and the phenomenon turned to the advantage of the engineers, it could lead to all kinds of improvement.

There again, it might prove redundant. There are good reasons for believing that "Processor In Memory" architectures are good for certain types of problem - particularly for providing a very standard library of routines, but certain opcodes can be shifted there as well. There is also the Cell approach, which is to have a collection of tightly-coupled but specialized processors of different kinds. A heterogenious cluster with shared memory, in effect. If you extend the idea to allow these cores to be physically distinct, you can offload from the original CPU that way. In both cases, you distribute the logic over a much wider area without increasing the distance signals have to travel (you may even shorten the distance). As such, you can eliminate a lot of the internal sources of errors.

It may prove redundant in other ways, too. There are plenty of cooling kits these days that will take a CPU to much lower temperatures. Less thermal noise may well result in fewer errors, since that is a likely source of some of them. Currently, processors often run at 40'C - and I've seen laptop CPUs reach 80'C. If you can bring the cores down to nearly 0'C and keep them there, that should have a sizable impact on whether the data is being transmitted accurately. The biggest change would be to modify the CPU casing so that heat is reliably and rapidly transferred from the silicon. I would imagine that you'd want the interior of the CPU casing to be flooded with something that conducts heat but not electricity - fluorinert does a good job there - and then have the top of the case to also be an extra-good heat conductor (plastic only takes you so far).

However, if programs were designed with fault-tolerence in mind, these extra layers might not be needed. You might well be able to get away with better software on a poorer processor. Indeed, one could argue that the human brain is an example of an extremely unreliable processor whose net processing power (even allowing for the very high error rates) is far far beyond any computer yet built. This fits perfectly with the professor's description of what he expects, so maybe this design actually will work as intended.

Re:Moving, not fixing, the problem (1, Insightful)

Anonymous Coward | more than 3 years ago | (#32392980)

I've read something similar to this in the past and the example they used is video playback. If a few pixels in a video frame are rendered incorrectly the end user probably won't even notice. I think the likely applications if this is in video decoders and gaming graphics.

Re:Moving, not fixing, the problem (1)

noidentity (188756) | more than 3 years ago | (#32392658)

I can see this possibly working, though the devil is in the details. First, consider a similar situation with a communications link. You could either send every byte twice (TI 99/4A cassette format, I'm looking at you!), or if the error rate isn't too high, checksum large blocks of data and retransmit if there's an error. The latter will usually yield a higher rate for the error-free channel you create the illusion of. So if you could break a computation into blocks and somehow detect a corrupt computation, you could just recompute the block. So bringing this back to computation, how the hell do you determine a corrupt computation without doing the computation to see what the correct result is? And if you knew this already, you wouldn't need to compute it in the first place. Maybe there are solutions to this, but for any piece of software? And you thought rewriting for multiple threads was hard...

Re:Moving, not fixing, the problem (2, Informative)

xZgf6xHx2uhoAj9D (1160707) | more than 3 years ago | (#32392924)

Error Correction Codes (aka Forward Error Correction) are typically more efficient for high-error channels than error detection (aka checksum and retransmit), which is why 10Gbps Ethernet uses Reed-Solomon rather than CRC in previous Ethernet standards: it avoids the need to retransmit.

I had the same questions about how this is going to work, though. What is the machine code going to look like and how will it allow the programmer to check for errors? Possibly each register could have an extra "error" bit (similarly to IA-64's NaT bit on its GP registers). E.g., if you do an "add" instruction, it checks the error bits on its source operands and propagates them. So long as you only allow false positives and false negatives, it would work, and could be relatively efficient.

Re:Moving, not fixing, the problem (1)

jd (1658) | more than 3 years ago | (#32393716)

There are many types of error correction codes, and which one you use depends on the nature of the errors. For example, if the errors are totally random, then Reed-Solomon will likely be the error correction code used. CDs use two layers of Reed-Solomon concatenated in series. This is not especially fast, but the output for a CD is in tens of kiloherts and an ASIC could easily be operating in the gigahertz realms. However, when you're talking the internals of a CPU, there's a limit to how slow you can afford things to be.

BCH is a general form of Reed-Solomon and you can therefore tailor the function to handle specific characteristics of the sorts of error observed rather than trying to code round anything that doesn't fit perfectly within Reed-Solomon. Although it's more general, it might actually be quicker if you can find parameters that better describe what it is supposed to be correcting than the default values used by Reed-Solomon.

Turbo codes (as used by NASA for modern deep-space communication) are great when you're dealing with block errors. In long-distance communication, your biggest problem will be bursts of interference rather than the random noise. Wikipedia states: "Turbo codes, as described first in 1993, implemented a parallel concatenation of two convolutional codes, with an interleaver between the two codes and an iterative decoder that would pass information forth and back between the codes." It goes on to state that this is faster than any previous concatenation scheme. Fast is good, here, but it is only meaningful if this is the sort of error experienced.

There are plenty of other sorts out there, and again there may well be error correction codes not listed above that are much better suited to this specific type of problem.

Re:Moving, not fixing, the problem (0)

Anonymous Coward | more than 3 years ago | (#32392898)

It's not that simple, their are errors all the time in hardware that are hidden from you. For instance before hard disk companies got smart and started hiding and automatically re-allocating bad sectors, if you're drive was beginning to go you could get replaces as soon as it had bad sectors, but drive companies started reallocating and remapping sectors so disk scanners became redundant since drive companies came with extra space on the platters.

I think the idea MAY have promise if we ever hit a brick wall but the it will require a lot of experimentation to see if it is feasible, they really have to design the hardware so that you don't have to change the software.

Re:Moving, not fixing, the problem (2, Interesting)

JoeMerchant (803320) | more than 3 years ago | (#32392974)

Why rewrite the application software? Why not catch it in the firmware and still present a "perfect" face to the assembly level code? Net effect would be an unreliable rate of execution, but who cares about that if the net rate is faster?

While not move and fix it? (2, Informative)

BartholomewBernsteyn (1720348) | more than 3 years ago | (#32393070)

This may be a far thought, but if stochastic CPUs allow for increased performance in a trade-off for correctness, maybe something like following description may reap the benefits while keeping out the stochastics right away:
Suppose those CPUs really allow for faster instruction handling using less resources, maybe you could put more in a package, for the same price, which on a hardware level would give rise to more processing cores at the same cost. (Multi-Core stochastic CPUs)
Naturally, you have the ability to do parallel processing, with errors possible, but you are able to process instructions at a faster rate.
On the software side, the support for concurrency is a mayor selling point, of course, there has to be something able recover from those pesky stochastics gracefully. I come up with the functional language 'Erlang'.
This is taken from wikipedia

Concurrency supports the primary method of error-handling in Erlang. When a process crashes, it neatly exits and sends a message to the controlling process which can take action. This way of error handling increases maintainability and reduces complexity of code

From the official source:

Erlang has a built-in feature for error handling between processes. Terminating processes will emit exit signals to all linked processes, which may terminate as well or handle the exit in some way. This feature can be used to build hierarchical program structures where some processes are supervising other processes, for example restarting them if they terminate abnormally.

Asked to 'refer to OTP Design Principles for more information about OTP supervision trees, which use[s] this feature' I read this:

A basic concept in Erlang/OTP is the supervision tree. This is a process structuring model based on the idea of workers and supervisors. Workers are processes which perform computations, that is, they do the actual work. Supervisors are processes which monitor the behaviour of workers. A supervisor can restart a worker if something goes wrong. The supervision tree is a hierarchical arrangement of code into supervisors and workers, making it possible to design and program fault-tolerant software.

This seems well fit? Create a real, physical machine for a language both able to reap its benefits and cope with the trade-off.
Or maybe I'm too far off (I'm bored technologically, allow me some paradigmatic change at slashdot).

TamedStochastics - Hiring.

Yes, checksumming on dedicated hardware was my first thought as well.

Re:Moving, not fixing, the problem (0)

Anonymous Coward | more than 3 years ago | (#32393412)

Moving things from hardware to software has always led to failure or challenges. Examples include

1. The current parallel programming challenge (using all cores is up to the programmer or compiler)
2. Software controlled caches (The Cell processor)

Anything that can be moved from software to hardware shields users from complexities in a transparent manner.

Re:Moving, not fixing, the problem (4, Insightful)

Interoperable (1651953) | more than 3 years ago | (#32393432)

I did some digging and found some material by the researcher, unfiltered by journalists. I don't have any background in processor architecture but I'll present what I understood. The original publications can be found here [illinois.edu].

The target of the research is not general computing, but rather low-power "client-side" computing, as the author puts it. I understand this to be decoding application, such as voice or video in mobile devices. Furthermore, the entire architecture would not be stochastic, but rather it would contain some functional blocks that are stochastic. I think the idea is that certain mobile hardware devices devote much of their time to specialized applications that do not require absolute accuracy.

A mobile phone may spend most of it's time being used encode/decode low resolution voice and video and would have significant blocks within the processor devoted to those tasks. Those tasks could be considered error tolerant. The operating system would not be exposed to error-prone hardware, only applications that use hardware acceleration for specialized, error-tolerant tasks. In fact, the researchers specifically mention encoding/decoding voice and video and have demonstrated the technique on encoding h.264 video.

Re:Moving, not fixing, the problem (1)

oldhack (1037484) | more than 3 years ago | (#32393438)

Bit like Itanium VLI-whatever architecture - leave it to the compiler (i.e. software) to correctly pack instructions to big units so they can use up all the subunits simultaneously.

Possibly it might be suitable for some niche applications.

Never going to work (0)

Anonymous Coward | more than 3 years ago | (#32392520)

Consider that we currently can't even write software that is reliable on relatively error-free hardware. Introducing faulty hardware will just make everything suck even more.

This is just like all the other "solutions" to problems that only require well designed, bug free software. That just ain't gonna happen. Programming is complex and our little fuzzy/faulty brains aren't very good at it.

Does his lab produce Daily WTFs? (0)

Anonymous Coward | more than 3 years ago | (#32392524)

The whole TFS and most of the TFA is a load of non-sequiturs. If they bothered linking the papers it might have been useful, but no, it's another shining example of tech journali^H^H^H^H^H^H^H total bullshit.

It might be something to do with clock variability and operations that are retry-able on error. Or they could be running a clock signal from a grandfather clock.

Sounds like... (1, Offtopic)

Chineseyes (691744) | more than 3 years ago | (#32392526)

Sounds like Kumar and his friend Harold have been spending too much time baking weed brownies and not silicon wafers.

Moore's law (1)

koreaman (835838) | more than 3 years ago | (#32392588)

I don't see how allowing a higher error rate will enable them to put more transistors on a chip.

Re:Moore's law (3, Informative)

takev (214836) | more than 3 years ago | (#32392632)

What he is proposing is to reduce the number of transistors on a chip, to increase its speed and reduce power usage.
So in fact he is trying to reverse more's law.

Re:Moore's law (0)

Anonymous Coward | more than 3 years ago | (#32393288)

What he is doing is exactly like the people who made NoSQL are doing. Remove error correction/detection, and of course a system will go faster.

OB Car analogy: Removing airbags, brakes, and the body of the car leaving a chassis will surely make a car run faster. It won't be usable though for anything other than "I have moar speed than you!" type of crap.

Current software can't recover from its own errors (0)

Anonymous Coward | more than 3 years ago | (#32392698)

Current software can't recover from its own errors most of the time, but we're supposed to trust it to handle hardware ones, too?

I'll trade some speed for reliability, thanks. Rebooting sucks and this sounds like a great way to do more of it.

why use a stochastic processor? (0)

Anonymous Coward | more than 3 years ago | (#32392728)

Why use a stochastic processor which makes mistakes when we can use our brains, which make mistakes?

Re:why use a stochastic processor? (3, Funny)

Arker (91948) | more than 3 years ago | (#32393024)

Why use a stochastic processor which makes mistakes when we can use our brains, which make mistakes?

Because the stochastic processor will be able to make mistakes much more quickly of course.

Don't you understand progress?

Sounds reasonable to me (4, Insightful)

bug1 (96678) | more than 3 years ago | (#32392738)

Ethernet is an improvement over than token ring, yet Ethernet has collisions and token ring doesn't.

Token ring avoids collisions, Ethernet accepts collisions will take place but has a good error recovery system.

Re:Sounds reasonable to me (0)

Anonymous Coward | more than 3 years ago | (#32392826)

And the recovery algorithm runs on a defective processor?

Re:Sounds reasonable to me (2, Insightful)

TheGratefulNet (143330) | more than 3 years ago | (#32393308)

in fact, its the randomness of ethernet (back off and retry at random non-matching intervals) that gets you order. if everyone used the same backoff timers, they'd keep colliding; but add in some randomness and things work better.

increase entropy to ensure order. ha! but its true.

Re:Sounds reasonable to me (1)

h4rr4r (612664) | more than 3 years ago | (#32393316)

No, they are totally separate things. You can run token ring over Ethernet, been there done that. Ethernet does use a bus topology but these days we use switches to avoid collisions.

Re:Sounds reasonable to me (1)

oldhack (1037484) | more than 3 years ago | (#32393466)

Out networking code explicitly assumed unreliable network transmission, but I doubt much of our code is designed to handle baked CPU faults.

Moore Slaw (0)

Anonymous Coward | more than 3 years ago | (#32392782)

And maybe some potato salad?

A brainy idea. (4, Interesting)

Ostracus (1354233) | more than 3 years ago | (#32392962)

He favors a new architecture, that he calls the 'stochastic processor,' which is designed to handle data corruption and error recovery gracefully.

I dub thee neuron.

Re:A brainy idea. (0)

Anonymous Coward | more than 3 years ago | (#32393444)

Ah, now I understand what a brain fart is.

Re:A brainy idea. (1)

Colin Smith (2679) | more than 3 years ago | (#32393512)

Indeed. It couldn't be used with traditional programming methods, you'd only be able to use it with statistical methods.

Genetic programming maybe. Errors are mutations.
 

The problem isn't hardware to begin with... (4, Insightful)

Angst Badger (8636) | more than 3 years ago | (#32392986)

...the problem is software. In the last twenty years, we've gone from machines running at a few MHz to multicore, multi-CPU machines with clock speeds in the GHz, with corresponding increases in memory capacity and other resources. While the hardware has improved by several orders of magnitude, the same has largely not been true of software. With the exception of games and some media software, which actually require and can use all the hardware you can throw at them, end user software that does very little more than it did twenty years ago could not even run on a machine from 1990, much less run usably fast. I'm not talking enterprise database software here, I'm talking about spreadsheets and word processors.

All of the gains we make in hardware are eaten up as fast or faster than they are produced by two main consumers: useless eye-candy for end users, and higher and higher-level programming languages and tools that make it possible for developers to build increasingly inefficient and resource-hungry applications faster than before. And yes, I realize that there are irresistible market forces at work here, but that only applies to commercial software; for the FOSS world, it's a tremendous lost opportunity that appears to have been driven by little more than a desire to emulate corporate software development, which many FOSS developers admire for reasons known only to them and God.

It really doesn't matter how powerful the hardware becomes. For specialist applications, it's still a help. But for the average user, an increase in processor speed and memory simply means that their 25 meg printer drivers will become 100 meg printer drivers and their operating system will demand another gig of RAM and all their new clock cycles. Anything that's left will be spent on menus that fade in and out and buttons that look like quivering drops of water -- perhaps next year, they'll have animated fish living inside them.

Re:The problem isn't hardware to begin with... (0)

Anonymous Coward | more than 3 years ago | (#32393134)

That's why real geeks live under the rock of [url=http://suckless.org/]Suckless[/url].

Re:The problem isn't hardware to begin with... (1)

dbirk (1822312) | more than 3 years ago | (#32393250)

The thing you aren't realizing is that while it might not matter to the person using Microsoft Word on their computer, it matters because now Google has more CPU power to give them better search quality, and Youtube can show people videos on their mobile phone, and WOW can support millions of people logged on to their virtual worlds, and a small startup can buy a server that can scale to thousands of users for a few hundred bucks a month. So while it doesn't affect that individual users machine, the hardware improvements absolutely affect what they do on their computer, and in a very beneficial way.

How eye candy helps interoperability (2, Interesting)

tepples (727027) | more than 3 years ago | (#32393262)

And yes, I realize that there are irresistible market forces at work here, but that only applies to commercial software; for the FOSS world, it's a tremendous lost opportunity that appears to have been driven by little more than a desire to emulate corporate software development, which many FOSS developers admire for reasons known only to them and God.

I think I know why. If free software lacks eye candy, free software has trouble gaining more users. If free software lacks users, hardware makers won't cooperate, leading to the spread of "paperweight" status on hardware compatibility lists. And if free software lacks users, there won't be any way to get other software publishers to document data formats or to get publishers of works to use open data formats.

Re:The problem isn't hardware to begin with... (1)

blindbat (189141) | more than 3 years ago | (#32393422)

I started programming on Apple ][ computers and did assembly. The code was fast and efficient.

However, the benefits of developing with powerful APIs on beefy operating systems is well worth it. There is so much more going on under the hood of a program and system and that costs CPU cycles.

Even software I'm writing for the iPad moves so fast that you *must* include the movements, fades, etc. in order to let the user know something has or is changing.

Much is unnecessary and ugly (especially Windows driver interfaces by many manufacturers) but I'll take today over 30 years ago.

Re:The problem isn't hardware to begin with... (1)

uglyduckling (103926) | more than 3 years ago | (#32393448)

Higher level languages aren't just there to save developers time. Using higher level languages usually makes it harder to generate code that will walk on protected memory, cause race conditions etc., and higher level languages are usually more portable and make it easier to write modular re-usable code.

Re:The problem isn't hardware to begin with... (4, Insightful)

Draek (916851) | more than 3 years ago | (#32393604)

for the FOSS world, it's a tremendous lost opportunity that appears to have been driven by little more than a desire to emulate corporate software development, which many FOSS developers admire for reasons known only to them and God.

You yourself stated that high-level languages allow for a much faster rate of development, yet you dismiss the idea of using them in the F/OSS world as a mere "desire to emulate corporate software development"?

Hell, you also forgot another big reason: high-level code is almost always *far* more readable than its equivalent set of low-level instructions, the appeal of which for F/OSS ought to be similarly obvious.

Sorry but no, the reason practically the whole industry has been moving towards high-level languages isn't because we're all lazy, and if you worked in the field you'd probably know why.

Re:The problem isn't hardware to begin with... (3, Insightful)

Homburg (213427) | more than 3 years ago | (#32393676)

So you're posting this from Mosaic, I take it? I suspect not, because, despite your "get off my lawn" posturing, you recognize in practice that modern software actually does do more than twenty-year-old software. Firefox is much faster and easier to use than Mosaic, and it also does more, dealing with significantly more complicated web pages (like this one; and terrible though Slashdot's code surely is, the ability to expand comments and comment forms in-line is a genuine improvement, leaving aside the much more significant improvements of something like gmail). Try using an early 90s version of Word, and you'll see that, in the past 20 years word processors, too, have become significantly faster, easier to use, and capable of doing more (more complicated layouts, better typography).

Sure, the laptop I'm typing this on now is, what, 60 times faster than a computer in 1990, and the software I'm running now is neither 60 times faster nor 60 times better than the software I was running in 1990. But it is noticeably faster, at the same time that it does noticeably more and is much easier to develop for. The idea that hardware improvements haven't led to huge software improvements over the past 20 years can only be maintained if you don't remember what software was like 20 years ago.

Re:The problem isn't hardware to begin with... (0)

Anonymous Coward | more than 3 years ago | (#32393704)

You think high level languages are a bad idea? Wow. Just wow. I'm not sure you could get much more off the mark with that belief. Sounds like you missed your class on programming languages.

Human learning (1)

Gruff1002 (717818) | more than 3 years ago | (#32393028)

We all learn (or are supposed to) from our mistakes how is a machine supposed to act differently? Its simple logic.

Goes against the trend (1)

izomiac (815208) | more than 3 years ago | (#32393064)

The trend lately seems to be to build hardware that runs existing software faster. Designing hardware without legacy support would make for faster, more power efficient hardware. Futhermore, hardware is expensive to modify, whereas software is relatively cheap to update.

OTOH, since the world relies on commercial software distributed in binary form, hardware makers have to support it. Today, the hardware is built so the software doesn't need to be changed, despite the fact that computers would perform at a much higher level if it were the other way around. I suppose one could point out that we have so much software today that porting all of it would isn't practical. Of course, the current state of affairs is solely due to the Windows on x86 near-monoculture. People seem to love sticking with what works, rather than go through a bit of pain to achieve higher levels. I suppose people expect that computers aren't ever going to move past the general design standardized in the 1990s.

IMHO, what we need is a clean break, a complete redesign, every decade or so. At that point, most decade-old software should be emulatable, and we get the benefits of the ever-advancing state of computer science. Plus the periodic chaos should prevent complacency and increase competition, while the decade long stability would allow for optimization and provide a common build target. Fat chance that Microsoft or big business would ever go along with that idea though.

Re:Goes against the trend (1)

tepples (727027) | more than 3 years ago | (#32393282)

IMHO, what we need is a clean break, a complete redesign, every decade or so.

Game consoles have that. But the problem with those is the cryptographic lockout chips that enforce policies that shut out small businesses.

Nope (1)

klaiber (117439) | more than 3 years ago | (#32393112)

If it requires software changes that are not 100% automated, then this won't fly. Programmers have a hard enough time writing sequential programs, let alone multithreaded ones. Now they're supposed to also foresee and check hardware errors? I think not.
I note that the entire idea hinges on the s/w component, yet the article hides the complexity under the harmless-sounding term "robustification".
Another idea from the ivory towers that is good at generating papers, but not actual machines. IMHO.

I've gone even better (1)

Dunbal (464142) | more than 3 years ago | (#32393180)

I have designed a CPU that uses only one transistor, requires absolutely no power, and is infinitely fast! Of course at the moment the only instruction it can run is NOP, but I'm working on the problem...

Garbage in, garbage out, professor. A computer that isn't accurate is no longer useful. We might as well go back to using thousands of humans to double-check other thousands of humans. Oh wait no those require FAR more energy and time.

I'd just like to point out... (2, Insightful)

DavidR1991 (1047748) | more than 3 years ago | (#32393210)

...that the Transmeta Crusoe processor has sod-all to do with porting or different programming models. The whole point of the Crusoe was that it could distil down various types of instruction (e.g. x86, even Java bytecode) to native instructions it understood. It could run 'anything' so to speak, given the right abstraction layer in between

Its lack of success was nothing to do with programming - just that no one needed a processor that could these things. The demand wasn't there

Mistakes make you learn more... (1)

AmazinglySmooth (1668735) | more than 3 years ago | (#32393228)

I had a physics prof for freshman physics that said you learn more from mistakes. I told him that we must be physics experts by now.

Just adds another layer... (1)

VortexCortex (1117377) | more than 3 years ago | (#32393322)

We rarely write software that is even robust enough to be secure against unexpected input on our current "reliable" chips (see: Viruses and other malware).

The idea of having application programmers cope with the new unpredictable hardware errors is seriously flawed.

In the end an additional "software" layer (probably actually firmware) will have to deal with this new type of hardware error; Application level coding (and existing software) will continue working as usual.

If this turns out to be faster than current techniques: Meh. A new faster processor and a new buzzword will be born.

I'll be interested when I can buy the new hardware and run *nix on it; Until then the only buzzword that comes to mind is "vaporware".

Late, and innaccurate (3, Interesting)

gman003 (1693318) | more than 3 years ago | (#32393324)

I've seen this before, except for an application that made more sense: GPUs. A GPU instruction is almost never critical. Errors writing pixel values will just result in minor static, and GPUs are actually far closer to needing this sort of thing. The highest-end GPUs draw far more power than the highest-end CPUs, and heating problems are far more common.

It may even improve results. If you lower the power by half for the least significant bit, and by a quarter for the next-least, you've cut power 3% for something invisible to humans. In fact, a slight variation in the rendering could make the end result look more like our flawed reality.

A GPU can mess up and not take down the whole computer. A CPU can. What happens when the error hits during a syscall? During bootup? While doing I/O?

Re:Late, and innaccurate (0)

Anonymous Coward | more than 3 years ago | (#32393430)

While encrypting your /home?

GPUs are used for crypto too.

Re:Late, and innaccurate (1)

MostAwesomeDude (980382) | more than 3 years ago | (#32393510)

Not *that* kind of crypto. Still...

GPUs become inaccurate intentionally. Most GPU instructions are as accurate as IEEE 754 requires, and some are *more* accurate because they are directly in silicon. For example, reciprocals and square roots usually have dedicated circuits. Everything is at least a single-precision float. The inaccuracy comes later, during output; GPUs can be configured to dither away quality or lower their color depth in order to work with software that expects lower quality.

However, general-purpose computing GPUs, from the Dx 9 era onwards, are all shaderful, and shader units are not designed to be inaccurate.

Finally, GPUs are more than just the shader unit. If an error occurs on the GPU that causes the DMA unit to lock up, then the OS will spin its wheels endlessly trying to get the GPU to talk to it again. We've mitigated this somewhat in Linux, but it's still possible for a misprogrammed or misbehaving GPU to lock up the PCI/AGP/PCIe bus entirely, something we can't possibly recover from.

Re:Late, and innaccurate (1)

Kjella (173770) | more than 3 years ago | (#32393632)

But it's a long time since computers just draw something up on the screen. One little error in a video decoding will keep throwing off every frame until the next keyframe. One error in a shader computation can cause a huge difference in the output. What coordinates have an error tolerance after all is transformed and textured and tessellated and whatnot? An error in the Z-buffer will throw one object in front of another instead of behind it. The number of operations where you don't get massive screen corruption is not that high.

low power - for embedded and server farms not desk (1)

mr_walrus (410770) | more than 3 years ago | (#32393356)

it could take off, but in specialized areas like embedded designs (low power - long battery life consideration)
and in server farms (low power, low cooling and electric costs).

embedded and server applications do not have the bloated huge application suites that need porting.
ie: big bloated popular desktop apps likes photoshops and excels are not an issue for this new
  cpu design approach to be adopted.

server apps 'relatively' speaking are much less bloated, and often have been ported a zillion times already
adding error robustness is doable and worth doing for the potential savings.

embedded apps don't mind doing what it takes to achieve battery life noticably longer than the competitor.
and often do such specialized (read: not bloated) functions that error robustness should also be doable -- even if currently
glossed over.

but, regardless how desireable this turns out to be, if a "big guy" (read Intel) couldn't be bothered to push
it, it'll die :(

Take a step back? (1)

solid_liq (720160) | more than 3 years ago | (#32393396)

He apparently wants us to take a step backwards to the days when crashes were frequent, such as with Windows 98. Software quality has a long way to go already. Does he not realize that making programmers deal with such an issue would bring software quality back into the Dark Ages?

As it is, programmers aren't given enough time to write software that works bug-free. Schedules are always rushed. This would dramatically increase: the burden on developers, the quantity of bugs, the number of developers being fired because they didn't get a project accomplished nearly as quickly as someone who pulled off a similar project 5 years earlier, the frustration of the users and developers (and transitively, the number of heart attacks around the world due to elevated blood pressure), the number of security vulnerabilities in software, and the migration rate to processor vendors who didn't make this mistake.

In short: this guy is on crack!

The perfect processor (1, Insightful)

Anonymous Coward | more than 3 years ago | (#32393434)

I think Mr Kumar is confusing the performance of the designer to develop a useful power effecient product today on a modern process with the performance of the end result. There is no law or provable proposition that that a useful processor needs to be sloppy to outperform a neat competitor. This only holds true when you fail to include the cost of being sloppy and limit the intelligence and creativity of the designer. Any figures you produce to prove your point are by definition limited to a narrow limited set of defined tradeoffs.. It does not represent what is possible when someone smarter than yourself is desinging a solution.

The space is difficult and getting more and more so. Deal with it or find another job. For quite some time now innovations in the space have always come from techniques to mitigate complexity and error. When your designing non-trivial ASICs its what you do.

In certain areas analog computers make sense. Heck our brains are analog computers but asking a classic computing environment to check itself is a non-starter in terms of any product users will accept.

Circut design is somewhat of an art. There are an infinite array of subtle tradeoffs and clever hacks one can use to improve performance such as use of crosstalk to bootstrap charging of neighboring caps, clock gating, distributed clocking, intentional glitching, even the use of analog circuts in certain limited roles.

What pisses me off the most about articles like this is that designers suffer from tunnel vision and therefore act like morons. I mean look at a modern desktop PC. Intel et al tout their speedstep, bus power management, LPC..etc to save energy and they have epicly failed. Why does a computer doing absoultely nothing need to use >100 watts to sit idle? If they can't get reasonable power scaling from clock gating then why not just design an idle processor thats slow and stupid (ie ATOM) and shut the other crap down when its not needed. If people really cared about power there are a lot of realitivly low tech solutions that would work and make huge dents in world demand for energy to power electronics.

But we still have a situation where GPU designers would rather let their processors idle at 70c to protect against temp gradients and not have to account for effects of temperature changes on their circuts.

Reminds me of Java2k (0)

Anonymous Coward | more than 3 years ago | (#32393530)

This reminds of Java2K, a esoteric programming language inspired by physics. When you do measurements in physics, you have to be specific about the error. You have to deal with it, so you have to think about how to minimize it in your contraptions.

In Java2k, all instructions can misbehave. So x1/x2 will divide x1 by x2, but do so only with a probability of 90% correctly. And all variables start with random values, "like in real physics". A language impossible to work with?! Turns out, you can at least do simple things:

integer variable x
integer variable y
y = x/x

at the end of the computation, y=1 -- with a probability of 90%. Now, how to proceed...?

Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...